272 72 8MB
English Pages 217 [218] Year 2023
Studies in Big Data 129
Sanjiban Sekhar Roy Ching-Hsien Hsu Venkateshwara Kagita Editors
Deep Learning Applications in Image Analysis
Studies in Big Data Volume 129
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.
Sanjiban Sekhar Roy · Ching-Hsien Hsu · Venkateshwara Kagita Editors
Deep Learning Applications in Image Analysis
Editors Sanjiban Sekhar Roy School of Computer Science and Engineering Vellore Institute of Technology Vellore, TN, India
Ching-Hsien Hsu College of Information and Electrical Engineering Asia University Musashino, Taiwan
Venkateshwara Kagita Department of Computer Science and Engineering National Institute of Technology Warangal Warangal, India
ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-981-99-3783-7 ISBN 978-981-99-3784-4 (eBook) https://doi.org/10.1007/978-981-99-3784-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
This book is dedicated to my mother “Papri Roy” –Sanjiban Sekhar Roy
Preface
In recent times, deep learning applications have achieved cutting-edge results on various image-related problems. Deep learning models are fascinating because they can understand images and perform vision tasks without requiring a complex series of specialized methods. In recent years, deep learning has emerged as the fastestgrowing field in artificial intelligence. It has found widespread application across various domains, showcasing its effectiveness and rapid development. Starting from handwritten character recognitions, automatic diagnosis of COVID-19 disease from x-ray images, imbalance image data sets of classification to image captioning, vehicle over speed detection systems, and many other applications. The topics that have been included in this book will cater interest to academicians and researchers working in the field of deep learning and machine learning with image-related problems. Also, graduates, postgraduates, and Ph.D. scholars working in these fields will immensely be benefited. This edited book has dealt with the following chains of works on the applications of deep learning for various image-related problems. • Autoencoder and Deep Convolutional Generative Adversarial Network in Improving Classification Performance of Bangla Handwritten • Deep Learning-Based Approaches Using Feature Selection Methods for Automatic Diagnosis of COVID-19 Disease from X-RAY Images • Image Captioning Using Deep Transfer Learning • Vehicle Over speed Detection system • An Intelligent System for Video-Based Proximity Analysis • Melanoma cancer detection using deep learning • Plant Diseases Classification using Neural Network: AlexNet • Hyperspectral Images: A Succinct Analytical Deep Learning Study • Chest X-Ray image classification of Pneumonia Disease using Efficient Net and InceptionV3 • Detection of Cancer using Deep Learning Techniques
vii
viii
Preface
The intention of compiling this book is to present a good idea about both theory and practice related to the above-mentioned applications before the readers by showcasing the usages of deep learning. We hope that readers will be benefited significantly from learning the state of the art of deep learning applications in the domain of imagery. Keep reading, learning, and inquiring. Vellore, TN, India September 2020
Dr. Sanjiban Sekhar Roy Professor, School of Computer Science and Engineering Vellore Institute of Technology Vellore, India [email protected]
Contents
Autoencoder and Deep Convolutional Generative Adversarial Network in Improving the Performance of Bangla Handwritten Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly, Md. Shoumique Hasan, Nilofa Yeasmin, and Shamim Ripon Deep Learning-Based Approaches Using Feature Selection Methods for Automatic Diagnosis of COVID-19 Disease from X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Burak Ta¸sci
1
27
Image Captioning Using Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . Tapan Kumar Das
51
Vehicle Over Speed Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Ganesan, N. S. Manikandan, and Vijayan Sugumaran
63
An Intelligent System for Video-Based Proximity Analysis . . . . . . . . . . . . . Sergey Antonov, Mikhail Bogachev, Pavel Leyba, Aleksandr Sinitca, and Dmitrii Kaplun
89
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular Surface Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Kanchon Kanti Podder, Mohammad Kaosar Alam, Zakaria Shams Siam, Khandaker Reajul Islam, Proma Dutta, Adam Mushtak, Amith Khandakar, Shona Pedersen, and Muhammad E. H. Chowdhury Plant Diseases Classification Using Neural Network: AlexNet . . . . . . . . . . 133 Mohd Anas, Sanjiban Sekhar Roy, Kunwar S. Srivastava, and Jashabir Chakraborty Hyperspectral Images: A Succinct Analytical Deep Learning Study . . . . 149 L. Sandeep Kumar, G. K. Panda, and B. K. Tripathy
ix
x
Contents
Chest X-Ray Image Classification of Pneumonia Disease Using EfficientNet and InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Neel Ghoshal, Mohd Anas, and Sanjiban Sekhar Roy Detection of Cancer Using Deep Learning Techniques . . . . . . . . . . . . . . . . . 187 Apoorv Singh, Arjunaditya, and B. K. Tripathy
About the Editors
Sanjiban Sekhar Roy is currently a Professor with the School of Computer Science and Engineering, Vellore Institute of Technology. He received Ph.D. degree from the Vellore Institute of Technology, Vellore, India, in 2016. He has edited handful of special issues for journals, published numerous articles in SCI high impact journals such as IEEE Transactions on Computational social systems; Scientific Reports, Nature; Computers and Electrical Engineering, Elsevier and many other reputed journals; Dr. Roy has published nine books with reputed international publishers such as Springer, Elsevier and IGI Global. His research interests are deep learning and advanced machine learning. Dr. Roy was a recipient of the “Diploma of Excellence” Award for academic research from the Ministry of National Education, Romania. He was also an Associate Researcher with Ton Duc Thang University, Ho Chi Minh City, Vietnam, during 2019 to 2020. Ching-Hsien Hsu is Chair Professor of the College of Information and Electrical Engineering, Asia University, Taiwan; Professor in the department of Computer Science and Information Engineering, National Chung Cheng University; Research Consultant, Department of Medical Research, China Medical University Hospital, China Medical University, Taiwan. His research includes cloud and edge computing, big data analytics, high performance computing systems, parallel and distributed systems, artificial intelligence, medical AI and natural language processing. He has published 350+ papers in top journals such as IEEE TPDS, IEEE TSC, ACM TOMM, IEEE TCC, IEEE TETC, IEEE System, IEEE Network, top conferences, and book chapters in these areas. Dr. Hsu is the editor-in-chief of International Journal of Grid and High Performance Computing, and International Journal of Big Data Intelligence; and serving as editorial board for a number of prestigious journals, including IEEE Transactions on Service Computing, IEEE Transactions on Cloud Computing, International Journal of Cloud Computing, Journal of Communication Systems, International Journal of Computational Science, AutoSoft Journal. He has been acting as an author/co-author or an editor/co-editor of 10 books from Elsevier, Springer, IGI Global, World Scientific and McGraw-Hill. Dr. Hsu was awarded seven
xi
xii
About the Editors
times talent awards from Ministry of Science and Technology, Ministry of Education, and nine times distinguished award for excellence in research from Chung Hua University, Taiwan. Prof. Hsu is president of Taiwan Association of Cloud Coputing; Chair of IEEE Technical Committee on Cloud Computing (TCCLD); Fellow of the IET (IEE) and senior member of the IEEE. Venkateswara Rao Kagita is an Assistant Professor at NIT Warangal. He has obtained Ph.D. from the University of Hyderabad. His research interests are Data Mining, Machine Learning, and Deep learning with a specific focus on machine learning techniques for recommender systems. His research works have been published in various reputed journals and conference proceedings. He has also delivered various guest lectures in several International and National workshops, IITs, NITs, and Universities.
Autoencoder and Deep Convolutional Generative Adversarial Network in Improving the Performance of Bangla Handwritten Character Recognition Tanzina Akter Tani, Mir Moynuddin Ahmed Shibly, Md. Shoumique Hasan, Nilofa Yeasmin, and Shamim Ripon
1 Introduction Handwritten character recognition has been an area of interest among deep learning researchers and practitioners in recent years. Due to its huge possibilities of various implementations, a significant number of studies have been carried out on handwritten texts, and character recognition of different languages, such as English [1], Japanese [2], Latin [3], etc. Bangla is the 1st and official language of Bangladesh, and it is the 4th most popular language in the world, spoken by almost 300 million people [4]. Considering this large number of native users, handwritten character recognition of the Bangla language plays a very important role in a wide range of applications, including bank cheque processing, identifying postal codes, zip code scanning, interpreting national ID numbers, Bangla optical character recognition (OCR), and many more [5, 6]. In the Bangla language, there are 11 vowels, 39 consonants, and a considerable number of vowel diacritical, consonant conjuncts and diacritical, and other digits, symbols, and punctuation marks. Recognizing handwritten Bangla characters is more difficult and complicated for a number of reasons: (a) there are a lot of compound characters in the Bangla alphabet, (b) the forms of certain characters are identical, (c) as different people write in different ways, the same character written by different people will have different forms, sizes, and curvatures. T. A. Tani · Md. S. Hasan · N. Yeasmin · S. Ripon (B) Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh e-mail: [email protected] M. M. A. Shibly Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_1
1
2
T. A. Tani et al.
To overcome these problems several efforts have been taken to improve the recognition accuracy. Convolutional neural networks (CNN) [4, 7–9], Deep CNN [10], and ensemble learning methods [11, 12] have been applied in recent years. However, the scarcity of Bangla datasets and imbalances in classes in those datasets are a barrier to the recognition problem. Ensemble methods and image augmentation are among the many ways to overcome this issue. A generative Adversarial Network (GAN) introduced by [13] is another way to produce new instances of data. The presence of outliers in the dataset can also make the recognition a difficult task as they mislead the training of the models. So, by eliminating outliers, statistically meaningful results can be obtained. In Bangla handwriting-related studies, researchers have used different classification approaches. The authors in [14] suggested a hierarchical method for segmenting characters from sentences, with multilayer perceptron (MLP) as the classification algorithm, whereas an MLP, RBF network, and SVM fusion classifier is suggested in [15]. In [16], Bangla handwriting images are classified into 50 groups by using a multilayer perceptron neural network. Deep learning methods such as convolutional neural network (CNN)-based architecture have been used in the majority of recent works. Some of these works are only limited to simple characters [17] while others are concentrated in handwritten digits [18, 19]. Additionally, work has been done on a subset of the compound characters of the Bangla language [20]. One of the major issues in Bangla handwritten character recognition is the limited availability of a complete handwritten characters dataset. Generating Bangla handwritten characters is a way to solve this problem. Deep convolutional generative adversarial network (DCGAN) [21] has been used by some researchers to generate Bangla handwritten digits [22, 23]. However, there has not been much work focused on generating complicated curative characters and classifying Bangla handwritten characters using them. The deep neural network is a widely used technique in analyzing and classifying different types of images [24–27]. Residual Networks (ResNet) is one of the prominent neural network-based architectures that have been used for image classification and identification with excellent results for a long time. For example, researchers have used Transfer Learning with ResNet-50 for Malaria Cell-Image classification [28], or malicious software classification [29]. The residual networks have been applied in some of the Bangla handwritten character recognition studies [30–32]. This chapter aims at proposing a two-fold approach with the residual network classifier to classify Bangla handwritten characters. At first, a model has been created by using a ResNet variant called ResNet-50 to show the classification of the target dataset, which is in this case, the Ekush [33] dataset. After classifying, the datasets are then stabilized by removing the outliers by using an autoencoder, and the classification is performed again by using the same ResNet-50 model. Finally, the classes with a fewer number of images are augmented with more images by applying DCGAN, so that the number of images among the classes becomes balanced. This dataset is then classified with the ResNet-50 model. In the end, a detailed comparative analysis is conducted over the results obtained from the above-mentioned experiments measuring the strengths of the adopted methods.
Autoencoder and Deep Convolutional Generative Adversarial Network …
3
The rest of the chapter is structured as follows. Section 2 covers a detailed review of the state-of-the-art of Bangla handwritten character recognition. The methods and materials of this study and elaborate result analysis with discussion are presented in Sects. 3, 4, and 5. The chapter ends with an appropriate conclusion section.
2 Related Work The researchers in [4] have introduced a CNN model named Ekushnet, which has generated satisfactory results on Ekush [33] and CMATERdb [34] datasets. The authors have mentioned that their Ekushnet model has performed extremely well and generated the best results on Bangla character recognition relative to the prior work that has been performed. Their proposed model has found 96.90% accuracy on the training set and 97.73% on the validation set on the Ekush dataset after 50 iterations. The authors also applied cross-validation on the CMATERdb dataset and found that their EkushNet model is 95.01% accurate. Another research work [7] has applied only a CNN model on Bangla handwritten character identification and their proposed model obtained 85.96% accuracy on the test dataset, whereas the authors in [10] achieved 95% accuracy by using the Deep CNN model. Both works have used 50 alphabet classes of the Ekush dataset. Another study [20] has achieved 95.05% accuracy on 122 classes in the Ekush dataset using their implemented DCNN model. The authors have experimented on two other databases, CMATERdb and BanglaLekha Isolated dataset [35]. The authors in [36] have shown an excellent accuracy result on the CMATERdb dataset which is 98.78%. The researchers have used five different approaches for classification. The authors of [11] have found that the ensembled convolutional neural network system outperforms a single CNN model when it comes to recognizing Bangla handwriting. They have proposed a stacked Generalization Ensemble Framework, consisting of six CNN models. Their research has reached 96.72% accuracy on the test set. They achieved the result only after 40 epochs. Another study [37] has applied three approaches: first, seven CNN models have been applied to recognize the Bangla handwritten characters. Then the best performing model ResNet-50 which has given 97.81% accuracy, has been used for feature extraction, and classification is done by traditional classification algorithms. In the last step, the authors employed different ensemble techniques for the classification task. The stacked generation ensemble method has achieved 98.68% test accuracy which is the best result among all the adopted methods. All the experiments of this study have been done on Ekush and BanglaLekha-isolated datasets.
4
T. A. Tani et al.
The authors of another study [38] experimented on six CNN models and evaluated which DCNN model produces the best performance by using CMATERdb [34] dataset. The results have shown that all the DCNN models have worked wonderfully, but the DenseNet model has outperformed the others. They have also pointed out that the DCNN framework works better than other object recognition methods. Another work [17] has shown that data augmentation can improve handwritten character identification accuracy. The authors have tested their algorithms on the alphabets of the BanglaLekha-Isolated dataset and found that it is 91.81% accurate without data-augmented images and 95.25% accurate with data-augmented images. They have also compared other machine learning approaches to find out the efficiency level of these methods. The comparative analysis has revealed that CNN outperforms SVM and LSTM with or without data augmentation. They also have put their proposed approaches to test on other datasets with similar characteristics. The experiment has demonstrated 95.07% test accuracy on the 59 classes of the Ekush dataset. The performance of the classifier can be enhanced by enlarging the dataset size. GAN as a data augmentation technique can help to expand the dataset [23]. In [22], the authors have proposed a DCGAN architecture that successfully increased four Bangla handwritten character datasets. For their proposed work, the writers have just focused on the digit dataset. However, they have not attempted to determine the CNN model performance. The proof of improving the performance of the classifier on handwritten datasets by adding GAN-generated images has been shown by another study [23]. The proposed method has been successful in increasing the accuracy on the MNIST dataset by using the GAN approach. They have also used GAN on three Indian numerical handwritten datasets: Bangla, Devanagari, and Oriya. The accuracy of all the datasets has been improved. However, the result of the proposed work has shown that combining so many GAN-generated images with the real dataset might degrade efficiency. Another digit recognition and generation work done by [39] has proposed network architecture and achieved 99.44% on BHAND [40] dataset. After that, the study applied Semi-Supervised GAN or SGAN to generate Bangla digits. One more GAN-related work [41] has proposed a conditional GAN-based method for generating character images based on class. Their study has used three separate Bangla handwritten character datasets and has been able to achieve very realistic images by 1500 epochs. However, they did not apply any classification with the generated images.
Autoencoder and Deep Convolutional Generative Adversarial Network …
5
The literature review reveals that most of the research has been done with either CNN or Deep CNN models. Apart from performing classification, there have been very few variations of approaches in Bangla handwritten character recognition works. Only a few studies have used the GAN method with the attempt of the classifiers. Furthermore, none of the studies has used outlier detection as part of their study. So, there is a knowledge gap identified in this literature review about outlier identification and elimination. In this work, both approaches are explored to enhance the recognition performance of Bangla handwritten characters.
3 Materials and Methods In order to improve handwritten character recognition of the Bangla language, a series of steps has been followed and a set of experiments have been conducted to evaluate the effectiveness of the proposed model. These steps are illustrated in this section. The schematic view of the overall steps is shown in Fig. 1. As shown in Fig. 1, various experiments are conducted over the dataset. Algorithm 1 outlines the pseudocode of the proposed methods. The adopted steps of the model are described in the subsequent sections.
Fig. 1 Schematic view of the proposed model
6
T. A. Tani et al.
Algorithm 1: Bangla Handwritten Character Recognition 1. Procedure Handwritten Character Recognition Input: Ekush Dataset as D. Output: Prediction of Handwritten characters 2. Rename D with classification labels 3. TRAIN_DATA, TEST_DATA ← TRAIN_TEST_SPLIT(D, 0.7, 0.3) 4. ResNet-50_MODEL ← TRAIN_ResNet-50_MODEL(TRAIN_DATA) 5. prediction_results ← ResNet-50_MODEL(TEST_DATA) 6. performance ← PERFORMANCE_SCORE(prediction_results, Labels of TEST_DATA) 7. For i = 0 to NUMBER_OF_OUTLIER_CLASS do 8. N ← Sample few inlier images from class i 9. TEST_SET ← All images from class i – N 10. OUTLIER_DETECTOR ← Create_autoencoder(N) 11. OUTLIERS, INLIERS ← OUTLIER_DETECTOR(TEST_SET) 12. DISCARD(OUTLIERS) 13. Dnew ← New pure dataset after discarding outliers 14. Calculate prediction performance using steps 3-6 using Dnew and continue next step 15. G ← GAN_CLASS_DATA 16. For i = 0 to NUMBER_OF_GAN_CLASS do 17. G ← GAN_CLASS_DATA [i] 18. For each EPOCH do 19. For each image_batch in G do 20. generated_image ← GENERATOR (random_noise, training=TRUE) 21. real_output ← DISCRIMINATOR (image, training=TRUE) 22. fake_output← DISCRIMINATOR(generated_image, training=TRUE) 23. GENERETOR_LOSS (fake_output) 24. DISCRIMINATOR_LOSS (real_output, fake_output) 25. GRADIENTS_OF_GENERATOR // to update the generator 26. GRADIENTS_OF_DISCRIMINTOR // to update the discriminator 27. optimizer ← Apply ADAM_OPTIMIZER on GENERATOR, DISCRIMINATOR 28. if (epoch % 50 == 0) then 29. Gnew ← Save GENERATED_IMAGES 30. Dnew ← Gnew 31. D ← Merge G and D 32. Calculate prediction performance using steps 3-7 and continue next step COMPARE prediction performance from all three cases
Autoencoder and Deep Convolutional Generative Adversarial Network …
7
Fig. 2 Different handwritten characters from the Ekush dataset
3.1 Dataset BanglaLekha Isolated [35], ISI [42], NumtaDB [43], CMATERdb [34], and Ekush [33] are a few datasets that contain Bangla handwritten characters and numerals. Ekush dataset is selected in this study for experimental purposes because it contains more classes than any other Bangla handwritten dataset. The Ekush dataset consists of basic and compound characters, numerals, and modifiers. The 122 classes of characters are grouped into four categories: 10 modifiers, 11 vowels, 39 consonants, 52 widely used compound letters, and 10 numeral digits and the dataset contains about 7,29,750 images. A few images from the dataset are shown in Fig. 2. The images are greyscaled with a size of 28 × 28 pixels.
3.2 Outlier Detection Outliers boost the uncertainty of the results, lowering statistical power. Therefore, removing outliers can lead to statistically significant results. In the Ekush dataset, Bangla handwritten characters are categorized into 122 groups. While grouping individual characters into their respective classes, some characters are moved into separate classes. As a result, a few characters from various classes are mixed up [37]. It has already been mentioned that some Bangla handwritten characters bear a striking resemblance to one another. During the pre-processing of the dataset, it has been discovered that class 87 and 97, class 19 and 84, and class 69, 76, 110, and 111, all contain outliers of one another due to their resemblance. Since the character instances are anomalous in only the specific context, they are termed contextual outliers. To locate outliers in the individual groups, a semi-supervised outlier detection approach using autoencoders has been applied. In Fig. 3, the process of outlier detection and preparing a purer dataset has been presented. Initially, the images of 122 classes have been analyzed and eight classes that potentially contain more outliers have been identified. From each class that contains more than 3000 images, 1000 inlier images have been selected and the training sets for the autoencoder-based outlier detection models have been prepared. The number of images in the training sets was 500 for the classes with less than 3000 images. No outlier image has been fed to the outlier detection model during
8
T. A. Tani et al.
Fig. 3 Workflow of the outlier detection
training. After training, the rest of the images have been tested using the model and the outliers have been identified. The inlier images along with the previously selected pure training set have been used to develop robust classifiers.
3.2.1
Autoencoder
Autoencoders are special neural networks that learn features of complex data in lower dimensions from unlabeled data [44], then try to reconstruct the original complex input from the simpler encoded features. This type of neural network has been proven to perform well in numerous fields such as generative models, classification, clustering, recommender system, dimensionality reduction, and so on [45], but in this work, it has been used as an outlier detection model. A convolutional autoencoder depicted in Fig. 4 has been used for detecting outliers from Bangla handwritten character dataset. The network consists of three major components – an encoder network, a bottleneck layer, and a decoder network. The encoder network starts with an input layer. After that, there are three convolutional layers; the output of the last such layer is flattened and passed through a dense layer which produces a vector containing features in a lower dimension. This is also known as a bottleneck which is followed by a decoder network. The job of the decoder is to reproduce the input as close as possible to the original. Convolutional transpose layers have been used which perform the inverse operation of what typical convolutional layers do.
Autoencoder and Deep Convolutional Generative Adversarial Network …
9
Fig. 4 Convolutional autoencoder for outlier detection
The autoencoder network has been trained with only a few inlier images. The intuition behind using only inlier images is to train the model to be familiar with what is normal so that while testing, the model reconstructs the outlier images poorly and reconstruction error becomes high. The images with reconstruction errors higher than a specific threshold then have been labeled as outliers and have been discarded from the dataset. The reconstruction error is calculated using mean squared error.
3.3 Generative Adversarial Network The Ekush Bangla handwritten dataset contains several imbalanced classes. Data augmentation can be a way for generating a number of images in order to balance a dataset. Data augmentation approaches such as rotation, and scaling can expand a dataset but do not always add information. Generative Adversarial Network (GAN), on the other hand, can generate synthetic images that can bring additional information to the dataset. We have chosen a deep convolutional generative adversarial network (DCGAN) as it is the most effective architecture for improving classification and identification [46]. We have only taken five classes from the Ekush dataset as these classes have much fewer images than others. The outlier-removed classes that are common in these classes are used as input data in the proposed GAN model. Table 1 shows the classes that have been used in DCGAN. The generative adversarial network is a method for creating new synthetic data that consists of two models: generator and discriminator. The generator attempts to create a new image from the random noise and feeds it into the discriminator model, which determines whether the image is fake or real. If the discriminator determines it to be fake, the generator attempts again to create a new image to deceive the discriminator. The fight between these two models will continue until the generator becomes incredibly powerful, creating a synthetic image that the discriminator model is unable to differentiate. A general view of GAN is shown in Fig. 5. Though a few experimental setups have been altered, we have adopted the DCGAN model shown in [20] as their approach has achieved good results for generating Ekush dataset images. The DCGAN architecture is defined briefly in the following section. A CNN is used for both discriminator and generator in DCGAN architecture. Before passing to the DCGAN model, the images have been prepared by converting all the
10
T. A. Tani et al.
Table 1 Classes used in DCGAN
Class No
Number of images
72
4186
76
4261
97
4100
110
2012
111
986
Image example
Fig. 5 General overview of GAN
images into a single channel and by resizing them all into 28 × 28 pixels. After that, all the images are normalized in the scale of [-1,1]. The GAN model has been run separately for each of the chosen classes.
3.3.1
Generator Architecture
In GAN, the generator model is used to create new images from a random variable. A random noise of 100 input sizes is given to the generator of our model. This is forwarded onto the dense layer which with 1024 hidden units. To keep the GAN model steady, batch normalization is used in both the generator and discriminator [21]. The Relu activation function is used in all layers except the output layer, where the Tanh activation is used. The Tanh activation function allows taking the pixel in the [-1,1] range that is later used as the discriminator input [23]. Again, another
Autoencoder and Deep Convolutional Generative Adversarial Network …
11
Fig. 6 Generator architecture
dense layer having 6272 neurons is used. After the following batch normalization and Relu activation, the output is reshaped. Up-sample to input data in the generator model is required to generate a new output image. Two convolution layers have been used where the first layer consists of 64 filters, and a kernel size of 3, and the second layer consists of one filter, and 3 kernel size. In both layers, padding with zero has been applied. A 2D upsampling is used just before each convolutional layer. The architecture of the generator model is given in Fig. 6.
3.3.2
Discriminator Architecture
In the discriminator model, two convolution layers have been used. With 64 filters, a kernel size of 5, a stride of 2, and ‘same’ padding, the first convolutional layer receives the dimension of 28 × 28 × 1 as an input shape. The same size of the kernel, stride, and padding is used in the second convolution layer with 128 filters. Then the outputs are flattened and transferred to the dense layer with 256 hidden neurons. The LeakyRelu activation function with an initial of α = 0.2 has been used in every layer in the discriminator model as it helps to perform well in the GAN model [21]. The alpha(α) parameter is the leakiness of the LeakyReLu activation function which controls the negative inputs and allows the passing of negative values to the network which prevents the dying state. After that, a 25% dropout has been used to keep the discriminator model from overfitting. Finally, a single unit of output has been used in a dense layer having sigmoid activation. The architecture of the discriminator model is given in Fig. 7. Following this [21] research, we have used the Adam optimizer in our DCGAN model. Although another study [47] has used an Adam optimizer with 1 × 10−5 of learning rate and 0.1 of β1 momentum, we have changed the learning rate to 1.5e−4 and momentum term to β1 = 0.5 in both the discriminator and generator model as we have found that these values of parameters helped to stabilize the training. β_1 momentum is used to control the decaying of the running average of the gradient, which is exponentially multiplied by itself at the end of each batch step [48]. Binary cross entropy has been employed to measure the loss of the discriminator and generator. We have used two separate batch sizes: 64 batch size in 110 and 111 classes because the number of real images has been limited, and 128 batch size in the
12
T. A. Tani et al.
Fig. 7 Discriminator architecture
Table 2 DCGAN generated dataset details
Class no
Epoch number
Total DCGAN image
72
1900
964
76
1750
1360
97
750
896
110
1850
3267
111
3750
3594
other three classes. For these groups, the model has been trained for 2000 epochs, except for 111, which have been trained for 4000 epochs. The explanation for the higher number of epochs for 111 groups is that the number of training data in 111 is very scarce, which prevents the generator to generate quality synthetic images in the early epoch. Every 50 epochs, we saved images and observed the produced images. We have taken images for these five classes at various epochs and identified the epoch where the quality of synthetic images is good enough compared to the actual images. We have taken a fixed number of images for each of these classes so that the classification model is trained with at least 4000 images. Table 2 shows the total number of images that are added to the actual training dataset.
3.4 Classification Before applying the classification model, all the images have been resized as 28 × 28 scales with gray color mode. We have used ResNet-50 to classify the 122 classes of the Ekush dataset. The name implies that the model consists of 50 layers. A brief description of ResNet-50 is given in the following section.
Autoencoder and Deep Convolutional Generative Adversarial Network …
3.4.1
13
ResNet-50
Identity and convolutional blocks are the two different blocks that are used in ResNet50 architecture based on the dimensions of the input and output. Both blocks have a skip connection over the main path which helps the model learn an identity function. The identity function helps to skip the layers to be trained which is not helpful to add value to accuracy [49]. In the identity block, there are three Conv2D layers with stride (1,1) and zero seed random initialization. Only the second Conv2D has padding. Batch normalization and Relu activation follow each Conv2D except that a shortcut is added before the final Relu activation. In the convolutional block, the skip connection has a Conv2D layer and Batch normalization that the identity block does not have. Except for this, the structure is almost the same as the identity block. The first and the convolution layer on shortcut paths has a stride of (s,s) and the rest has (1,1). The ResNet-50 architecture has five stages. Before entering these stages, the dimension of the dataset image 28 × 28 × 1 is given as an input shape to the ResNet50 architecture. The first stage of the ResNet-50 has 7 × 7 convolutional layers with 32 filters and (1,1) strides. Right after that, batch normalization and a 3 × 3 MaxPooling layer are used. The rest of the stages of ResNet-50 has two, three, five, and two numbers of identity blocks, respectively followed by a convolutional block. After the five stages, there is an average pooling with (2,2) strides, which is used to reduce the output. Finally, a SoftMax activation is used with an FC-dense layer to reduce the 122 input classes. The diagram of ResNet-50 architecture is given in Fig. 8. To train the ResNet-50 models, adam optimizer with the default learning rate value of 0.001 has been used. Also, as the loss function, categorical cross-entropy has been utilized. The accuracy with 1024 batch size has provided the best result in the [11] study. Following this study, the batch size has reset to 1024. Furthermore, 100 epochs have been used to train all the approaches.
Fig. 8 ResNet-50 architecture
14
T. A. Tani et al.
Table 3 Overview of the dataset for all approaches Datasets
Total number of images
train set
Test set
Validation set
Original dataset
729,750
547,131
109,777
72,842
Outlier removed Dataset
727,849
545,724
109,467
72,658
DCGAN + Outlier Removed Dataset
737,659
555,224
109,777
72,658
3.5 Train, Test, and Validation There have been made three different datasets after the deduction of outliers and DCGAN image generation. All the images in each approach have been split in such a way that 70% of images are in the training set, 20% of images are in the test set and 10% of images are in the validation set. The DCGAN-generated images are used only to balance the training dataset after the split to avoid bias. The total number of images and train, test, and validation set image numbers are presented in Table 3.
4 Results To improve the performance of Bangla handwritten character recognition, initially, a semi-supervised image outlier detection model has been proposed, and secondly, a generative adversarial network model has been used to balance the dataset. For both strategies, a subset of 122 classes has been chosen based on the recommendation made by other works [37] and based on the domain knowledge regarding Bangla handwritten characters. Outliers have been excluded from 7 classes using an autoencoder-based model and 5 classes have been balanced up using the DCGAN model. In this section, the outcomes of the experiments are explained in detail.
4.1 Result Analysis The Ekush dataset has been classified using the ResNet-50 framework in three different datasets (original dataset, outlier removed dataset, outlier removed and DCGAN implemented dataset). The ResNet-50 model has achieved 97.63% test accuracy on the original dataset consisting of 122 classes. The second approach where the outliers are removed from seven classes has achieved 97.95% test accuracy. And the final approach where outliers are removed from the original dataset and DCGAN-generated images are used to balance the original training dataset has achieved 97.92% test accuracy. The precision, recall, F1-score, and accuracy yielded by the ResNet-50 model on three approaches are shown in Table 4. It illustrates
Autoencoder and Deep Convolutional Generative Adversarial Network …
15
Table 4 Performance of all proposed approaches Methods
Precision (%)
Recall (%)
F1 Score (%)
Test accuracy (%)
Original dataset
97.64
97.63
97.62
97.63
Outlier removed dataset
97.96
97.95
97.95
97.95
DCGAN + Outlier removed dataset
97.93
97.92
97.92
97.92
that the ResNet-50 models with both outliers-removed dataset and with a balanced dataset using DCGAN-generated images have outperformed the model trained on the original dataset.
4.1.1
Result of Outlier Detection
The model accuracy has improved from 97.63% to 97.95% after outliers are removed from seven classes of the dataset, demonstrating the benefit of outlier removal. When assessing changes in individual classes, the same trend of improvement can be observed. In Table 5, the precision for classes 76, 87, and 97 increased by 4%, 1%, and 5%, respectively, suggesting that the performance has improved for these three classes compared to the original dataset. In classes 84 and 111, the recall has improved by 1% and 6% for the outlier-removed dataset than that for the original dataset, which also indicates the improvement of the classifier. For four classes (76, 84, 97, and 111) among the seven discussed classes, the increased F1-score compared to the original dataset indicates that the images have been better classified than the original dataset. The remaining three classes (19, 69, and 87) have not seen any changes in F1-score. However, few classes have experienced performance drops in terms of precision and recall even after removing the outliers. The reason behind this is, even though the outliers are eliminated from those classes, some noises are still there. Another explanation is that, when the outliers are eliminated, it also removes some of the original images from these classes, resulting in a dataset that is less balanced than the original. But these can be ignored as the performance drop is very negligible. Removing outliers from specific classes has also reduced the number of false positives and false negatives for classes other than the ones that are discussed. This, along with the improved performance in these specific classes has been the key ingredient to achieving an overall better performance. So, the cumulative results justify that the classifier performs well as a result of excluding outliers from the original dataset. The improvement in classification performance shows the effectiveness of the autoencoder-based outlier detection model in Bangla handwritten character images. In Table 6, the numbers of discarded images from the chosen seven classes are shown. The greatest number of outliers have been removed from class 19, whereas the least amount has been removed from class 111. There is also a correlation between the
16
T. A. Tani et al.
Table 5 Classifier evaluation after outlier exclusion Precision
Recall
F1-Score
Class
Original dataset
Outlier removed dataset
Original dataset
Outlier removed dataset
Original dataset
Outlier removed dataset
19
0.95
0.95
0.98
0.96
0.96
0.96
69
0.91
0.89
0.94
0.94
0.92
0.92
76
0.90
0.94
0.89
0.89
0.89
0.92
84
0.95
0.95
0.92
0.93
0.93
0.94
87
0.97
0.98
0.97
0.96
0.97
0.97
97
0.88
0.93
0.95
0.94
0.92
0.93
111
0.86
0.84
0.62
0.68
0.72
0.75
Table 6 Outliers removed dataset details
Class
No. of images in the original dataset
No. of outliers removed
19
6180
272
69
5676
234
76
4446
185
84
5788
240
87
6136
257
97
4264
164
111
1028
42
number of outliers removed with the original size of the dataset. The more images a class has more outliers have been removed. In Fig. 9, a few representative inlier and outlier images from class 69 that are detected by the model are presented. By looking at the images, one can identify that the images in the right part of the figure are anomalous, while the images on the left side are inliers. However, there are cases where outliers have not been accurately detected, and inliers have been wrongly identified as outliers. Despite this, the overall outlier detection scheme has been successful as it has improved the ResNet-50 model performance. Figure 10 also justifies the efficiency of the outlier detection model. The inlier images of class 19 have been divided into four batches, and all the images of each batch have been superimposed into a single image. Each batch consists of approximately 1500 images. In contrast, 272 outlier images detected by the model have also been superimposed into a single image. It is apparent from the figure that the images with superimposed inliers tend to hold the inherent shape of the character even with 1500 images. On the other hand, only 272 outliers have made the corresponding superimposed image all jumbled up, which further validates the efficiency of the outlier detector.
Autoencoder and Deep Convolutional Generative Adversarial Network …
17
Fig. 9 Inliers versus outliers in class 69
Fig. 10 Superimposed inliers versus superimposed outliers
4.1.2
Result of Balancing Dataset with DCGAN
Apart from the existence of outliers in the Ekush dataset, there is also an imbalance in it. Five such imbalanced classes (72, 76, 97, 110, and 111) have been selected and their training sets have been made balanced using DCGAN-generated images. No generated image has been added to either validation or test set. The test accuracy of ResNet-50 has improved from 97.63% to 97.92% after adding synthesized images to the training set. Moreover, Table 7 shows that almost all the evaluation metrics are improved through the use of DCGAN-generated images. Especially the class of 111 has improved exceptionally. But only the precision of DCGAN with outlier removed images in class 72 is dropped by 2% from the original classes. This means when the
18
T. A. Tani et al.
Table 7 Classifier evaluation after applying DCGAN Class
Precision
Recall
F1-Score
Original dataset
Balanced dataset with DCGAN
Original dataset
Balanced dataset with DCGAN
Original dataset
Balanced dataset with DCGAN
72
0.99
0.97
0.93
0.94
0.96
0.95
76
0.90
0.97
0.89
0.94
0.89
0.96
97
0.88
0.95
0.95
0.96
0.92
0.96
110
0.97
0.99
0.92
0.93
0.94
0.96
111
0.86
0.93
0.62
0.88
0.72
0.90
classifier predicted the images are from class 72, it is less correct than the original dataset. The reason for decreasing the precision of class 72 can be the noisy images that are generated in the DCGAN experiment. But that is a very negligible value and also the corresponding recall is increased which means it can more correctly identify all the respective class images than the original class. There are three classes to which both the outlier detection model and DCGAN have been applied. For all three classes, ResNet-50 with DCGAN-generated images has outperformed the ResNet50 trained on the outliers-removed dataset. The reason for this improvement is that the DCGAN model has been trained on those individual classes after the removal of outliers which has produced good-quality images. The overall performance justifies that the use of proposed DCGAN-generated images on the real dataset can improve the classification result. Figure 11 shows a comparison of original images and DCGAN-generated images. From the figure, it is difficult to distinguish between original and synthesized images without the labels which prove that the DCGAN has generated good quality images. However, for class 111, the generated images have not been up to the mark for having a smaller number of images to train DCGAN. Using this generative adversarial network has helped us to tackle the class imbalance problem. Without the five training classes on which the DCGAN has been applied, the average training size had been nearly 4575 images per class. On the other hand, those five classes had only 2389 training images on average per class. Even one class i.e., class 111 had only 770 training instances which led the classifier to achieve only a 72% F1-score. But after balancing only the training set with 3594 synthesized images, the F1-score has improved to 90%. The changes in the training sample sizes are illustrated in Fig. 12. In classes 110 and 111, the number of synthesized images added has been more than 3000 and for the rest of the three classes, this number has been around 1000. For four of these five classes, the classifier performance in terms of the F1-score has improved. Moreover, the overall performance of the ResNet-50 classifier trained on a balanced dataset has been better than that of the trained on the original dataset. This validates the applicability of DCGAN in generating synthesized Bangla handwritten character images.
Autoencoder and Deep Convolutional Generative Adversarial Network …
19
Fig. 11 Some original images versus some DCGAN generated images
Fig. 12 Training size before versus training size after applying DCGAN
4.2 Overfitting Handling Training and validation accuracy and loss are illustrated in Figs. 13, 14, and 15. On both the original dataset and the outlier removed dataset, the ResNet-50 model has a good fit for predicting handwriting characters, as illustrated in Figs. 13 and 14. But Fig. 15 illustrates an exception, in which the model is applied to the Ekush dataset after outliers are removed and DCGAN is used. Except for one epoch in validation loss, the model has a good prediction result because the training and validation accuracy and loss are near to each other. Additionally, in the learning curve of each approach, the training and the validation loss are initially high and
20
T. A. Tani et al.
Fig. 13 Accuracy and loss of ResNet-50 on the original dataset
Fig. 14 Accuracy and loss of ResNet-50 on outlier removed dataset
then gradually decrease in the same direction, indicating that the model is secure from overfitting. Though there is a slight gap between the training and validation curve, it is negligible for being considered as overfitting. However, the validation loss in epoch 88 has increased to about 1.6, which is relatively high compared to other epochs. The reason behind this spike of 88 number epoch can be due to the existence of noise in the dataset. In this particular batch of images, the model is unable to correctly predict the batch image’s class. This type of spike does not exist in the outlier-removed dataset or the original dataset. This means there is some noise in the DCGAN-generated images. Also, when the training and the validation dataset are split randomly, this particular batch has got the noisiest images.
4.3 Comparison with State-of-the-Art Outlier elimination on the Ekush dataset is a unique operation. We are the first who experimented on the outlier-removed Ekush dataset. Authors in [22] only performed
Autoencoder and Deep Convolutional Generative Adversarial Network …
21
Fig. 15 Accuracy and loss of ResNet-50 on outlier removed and DCGAN applied dataset
DCGAN to enlarge the Ekush dataset but no classification was performed on the generated images. A comparative analysis of the current work with others that used only the Ekush dataset is illustrated in Table 8. Our proposed ResNet-50 model on the original dataset has achieved 97.63% accuracy on the test dataset (Table 4), which proves that the score has outperformed all the work except the EkushNet. Shibly et al. [37] achieved the best test accuracy of 98.68% on the Ekush dataset but that has been obtained through an ensemble of ten CNN models. Their highest performance with a single CNN model has been 97.81% using ResNet-50 which is easily outperformed by both of our proposed methods. Our work has also achieved better performance than an ensemble [11] and deep CNN techniques [10, 20] applied to the same dataset. Also, although we have applied outlier on only seven classes and DCGAN on only five classes, our two approaches outperformed the other related works. However, the improvement is minor as only some classes from 122 classes of the Ekush dataset have been considered in our study. But the results can conclude that our proposed outlier and DCGAN approaches are capable to improve the classification performance. Table 8 Performance comparison with the state of the art Work references
Methods
Number of classes
Test accuracy (%)
[4]
CNN
122
97.73
[10]
Deep CNN (Bengali handwritten alphabets of Ekush dataset)
50
95.00
[37]
ResNet-50
122
97.81
[20]
Deep CNN
122
95.05
[11]
Stacked generalization ensemble method
122
96.72
Proposed method
Outlier Removal + ResNet-50
122
97.95
Proposed method
DCGAN + ResNet-50
122
97.92
22
T. A. Tani et al.
5 Discussion Outlier elimination and applying DCGAN as well as comparing the character detection of these two approaches is a unique experiment conducted on the Ekush dataset. ResNet-50 is one of the most popular models and can be used to achieve a very good result on the Ekush dataset as in [37]. In addition to managing the vanishing gradient problem, the ResNet-50 model can achieve great results with a few error rates. Apart from that, by applying a skip connection, it can ignore the layers which cannot provide any benefit to the output [50]. The result has shown that the ResNet-50 has given a better performance than the widely used CNN models. Outlier detection is very beneficial if there is a probability of images being found in the wrong classes. The result analysis has shown that the test result as well as the precision, recall, and F1-scores have improved after applying outlier detection on seven classes of the Ekush dataset. There is also an improvement in the performance of the overall classification result of 122 classes. Moreover, outlier detection and elimination on three (76, 97, 111) classes help our DCGAN to generate good-quality images. However, certain classes from the dataset, that have been chosen in this outlier detection approach, have a smaller volume of data, so training the model with this limited dataset reduced the precision. The outcome could be better if outlier detection can be applied to the whole dataset. The DCGAN approach has generated images as an augmentation technique with an outlier removed dataset has improved the test dataset performance by 0.29% over the original Ekush dataset. Not only DCGAN has increased the size of the dataset but also created variant images that add more information to the original dataset. In our study, only five classes of images have been augmented by the DCGAN approach and the generated image number is only able to make the training set near to 4 thousand. However, the whole dataset still has imbalanced classes besides the chosen classes. Yet with small amounts of generated images, the study has shown an improvement in the classification result. If we could generate more images for these classes, the accuracy might be improved further. However, as mentioned in [23], we should be careful not to generate a large number of images to avoid the probability of degrading the performance.
6 Conclusion Handwritten character recognition is a widely known research problem. This study adopts a two-fold approach on one of the largest Bangla handwritten datasets, namely the Ekush dataset with the ResNet-50 classifier. At first, outliers are detected and eliminated which has achieved a test accuracy of 97.95%. In the second approach, DCGAN is used to generate images for the original dataset which shows an accuracy of 97.92%. However, the results can be improved more if the adopted approaches have been applied to the whole dataset. Because of the limited computing resources, we
Autoencoder and Deep Convolutional Generative Adversarial Network …
23
have taken only a few classes of the Ekush dataset for our experiments. Despite this, the results which are obtained from the adopted novel approaches have demonstrated superior performance than majority the related works. In the future, other Bangla handwritten character datasets may also be used to evaluate the efficacy of these methods. In addition, other classifier models, such as VGG-16, Xception, DenseNet, AlexNet, etc. can also be explored with these two proposed methods. Data Availability Statement All the codes and the dataset can be accessed at the following repositories. Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Codes. figshare. Journal contribution. https://doi.org/10.6084/m9.figshare.18933470. Tanzina Akter Tani and Shibly, Moynuddin Ahmed (2022): Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.18931760. DCGAN generated images: http://doi.org/10.6084/m9.figshare.14754309.
References 1. Yuan, A., Bai, G., Jiao, L., & Liu, Y. (2012). Offline handwritten English character recognition based on convolutional neural network. In Proceedings 10th IAPR International Workshop on Document Analysis Systems, DAS 2012 (pp. 125–129). https://doi.org/10.1109/DAS.2012.61 2. Kimura, F., Wakabayashi, T., Tsuruoka, S., & Miyake, Y. (1997). Improvement of handwritten Japanese character recognition using weighted direction code histogram. Pattern Recognition, 30(8), 1329–1337. https://doi.org/10.1016/S0031-3203(96)00153-7 3. Ciresan, D. C., Meier, U., & Schmidhuber, J. (2012). Transfer learning for Latin and Chinese characters with deep neural networks. In Proceedings of the international joint conference on neural networks (pp. 1–6). https://doi.org/10.1109/IJCNN.2012.6252544 4. Azad Rabby, A. K. M. S., Haque, S., Abujar, S., & Hossain, S. A. (2018). Ekushnet: Using convolutional neural network for Bangla handwritten recognition. Procedia Computer Science, 143, 603–610. https://doi.org/10.1016/j.procs.2018.10.437 5. Ahmed, S., et al. (2019). Hand sign to bangla speech: A deep learning in vision based system for recognizing hand sign digits and generating bangla speech. https://doi.org/10.2139/ssrn. 3358187 6. Manisha, N., Sreenivasa, E., & Krishna, Y. (2016). Role of offline handwritten character recognition system in various applications. International Journal of Computer Applications. https:/ /doi.org/10.5120/ijca2016908349 7. Rahman, Md. M., Akhand, M. A. H., Islam, S., Chandra Shill, P., & Hafizur Rahman, M. M. (2015). Bangla handwritten character recognition using convolutional neural network. International Journal of Image, Graphics and Signal Processing, 7(8), 42–49. https://doi.org/10. 5815/ijigsp.2015.08.05 8. Ghosh, T., Abedin, M. H. Z., Al Banna, H., Mumenin, N., & Abu Yousuf, M. (2021). Performance analysis of state of the art convolutional neural network architectures in Bangla handwritten character recognition. Pattern Recognition and Image Analysis, 31(1), 60–71. https:// doi.org/10.1134/S1054661821010089 9. Chowdhury, R. R., Hossain, M. S., ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla handwritten character recognition using convolutional neural network with data augmentation. In 2019 Joint 8th international conference on informatics, electronics & vision (ICIEV) and 2019 3rd international conference on imaging, vision & pattern recognition (icIVPR) (pp. 318– 323). https://doi.org/10.1109/ICIEV.2019.8858545
24
T. A. Tani et al.
10. Ahmed, S., Tabsun, F., Reyadh, A. S., Shaafi, A. I., & Shah, F. M. (2019). Bengali handwritten alphabet recognition using deep convolutional neural network. In 5th International conference on computer, communication, chemical, materials and electronic engineering, IC4ME2 2019. https://doi.org/10.1109/IC4ME247184.2019.9036572 11. Shibly, M. M. A., Tisha, T. A., & Ripon, S. H. (2021). Stacked generalization ensemble method to classify Bangla handwritten character. In Proceedings of international conference on sustainable expert systems. Lecture Notes in Networks and Systems 176. https://doi.org/10.1007/978981-33-4355-9_46 12. Mamun, M. R., Al Nazi, Z., & Yusuf, M. S. (2018). Bangla handwritten digit recognition approach with an ensemble of deep residual networks. In International conference on bangla speech and language processing, ICBSLP 2018 (pp. 21–22). https://doi.org/10.1109/ICBSLP. 2018.8554674 13. Goodfellow, I., et al. (2014). Generative adversarial nets. Advance in Neural Information Process Systems, 27. 14. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., & Basu, D. K. (2009). A hierarchical approach to recognition of handwritten Bangla characters. Pattern Recognition, 42(7), 1467– 1484. https://doi.org/10.1016/j.patcog.2009.01.008 15. Bhowmik, T. K., Ghanty, P., Roy, A., & Parui, S. K. (2009). SVM-based hierarchical architectures for handwritten Bangla character recognition. International Journal on Document Analysis and Recognition, 12(2), 97–108. https://doi.org/10.1007/s10032-009-0084-x 16. Bhattacharya, U., Gupta, B. K., & Parui, S. K. (2007). Direction code based features for recognition of online handwritten characters of Bangla. In Proceedings of the international conference on document analysis and recognition, ICDAR, 2007. https://doi.org/10.1109/ICDAR. 2007.4378675 17. Chowdhury, R. R., Hossain, M. S., Ul Islam, R., Andersson, K., & Hossain, S. (2019). Bangla handwritten character recognition using convolutional neural network with data augmentation. In 2019 Joint 8th international conference on informatics, electronics and vision, ICIEV 2019 and 3rd international conference on imaging, vision and pattern recognition, icIVPR 2019 with international conference on activity and behavior computing, ABC 2019 (pp. 318–323). https:/ /doi.org/10.1109/ICIEV.2019.8858545 18. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Bangla handwritten digit recognition using autoencoder and deep convolutional neural network. In IWCI 2016-2016 International Workshop on Computational Intelligence. https://doi.org/10.1109/IWCI.2016.7860340 19. Shopon, M., Mohammed, N., & Abedin, M. A. (2017). Image augmentation by blocky artifact in deep convolutional neural network for handwritten digit recognition. In IEEE international conference on imaging, vision and pattern recognition, icIVPR 2017 (pp. 1–6). https://doi.org/ 10.1109/ICIVPR.2017.7890867 20. Mashrukh Zayed, M., Neyamul Kabir Utsha, S. M., & Waheed, S. (2021). Handwritten bangla character recognition using deep convolutional neural network: Comprehensive analysis on three complete datasets. Advances in Intelligent Systems and Computing. https://doi.org/10. 1007/978-981-33-4673-4_7 21. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International conference on learning representations, ICLR 2016-conference track proceedings. 22. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2018). OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial networks. In Recent Trends in image processing and pattern recognition, second international conference, {RTIP2R} 2018, Solapur, India, 21–22 Dec 2018, Revised Selected Papers, Part {III}, 2018, vol. 1037 (pp. 108–117). https://doi.org/10.1007/978-981-13-9187-3_10 23. Jha, G., & Cecotti, H. (2020). Data augmentation for handwritten digit recognition using generative adversarial networks. Multimed Tools and Applications. https://doi.org/10.1007/s11 042-020-08883-w 24. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518. https://doi.org/10.1007/s40998-019-00213-7
Autoencoder and Deep Convolutional Generative Adversarial Network …
25
25. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. https://doi.org/10.3390/app10144915 26. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems, 43(2), 1827–1833. https://doi.org/10.3233/JIFS-219283 27. Roy, S. S., et al. (2022). L2 regularized deep convolutional neural networks for fire detection. Journal of Intelligent & Fuzzy Systems, 43(2), 1799–1810. https://doi.org/10.3233/JIFS219281 28. Reddy, A. S. B., & Juliet, D. S. (2019). Transfer learning with ResNet-50 for malaria cell-image classification. In International Conference on Communication and Signal Processing (ICCSP) (pp. 945–949). https://doi.org/10.1109/ICCSP.2019.8697909 29. Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., & de Geus, P. (2017). Malicious software classification using transfer learning of ResNet-50 deep neural network. In Proceedings of the 16th IEEE international conference on machine learning and applications, ICMLA 2017 (pp. 1011–1014). https://doi.org/10.1109/ICMLA.2017.00-19 30. Alif, M. A. R., Ahmed, S., & Hasan, M. A. (2017). Isolated Bangla handwritten character recognition with convolutional neural network. In 2017 20th International conference of computer and information technology (ICCIT) (pp. 1–6). 31. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2018). Handwritten Bangla character recognition using the state-of-the-art deep convolutional neural networks. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2018/6747098 32. Khan, M. M., Uddin, M. S., Parvez, M. Z., & Nahar, L. (2022). A squeeze and excitation ResNeXt-based deep learning model for Bangla handwritten compound character recognition. Journal of King Saud University Computer and Information Sciences, 34(6), 3356–3364. https:/ /doi.org/10.1016/j.jksuci.2021.01.021 33. Rabby, A. K. M. S. A., Haque, S., Islam, M. S., Abujar, S., & Hossain, S. A. (2019). Ekush: A multipurpose and multitype comprehensive database for online off-line Bangla handwritten characters. Communications in Computer and Information Science. https://doi.org/10.1007/ 978-981-13-9187-3_14 34. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., & Basu, D. K. (2012). CMATERdb1: A database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journal on Document Analysis and Recognition. https://doi.org/10.1007/ s10032-011-0148-6 35. Biswas, M., et al. (2017). BanglaLekha-Isolated: A multi-purpose comprehensive dataset of handwritten Bangla isolated characters. Data in Brief . https://doi.org/10.1016/j.dib.2017. 03.035 36. Alom, Z., Sidike, P., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla digit recognition using deep learning, p. 1712. 37. Shibly, M. M. A., Tisha, T. A., Tani, T. A., & Ripon, S. (2021). Convolutional neural networkbased ensemble methods to recognize Bangla handwritten character. PeerJ Computer Science, 7, 1–30. https://doi.org/10.7717/peerj-cs.565 38. Alom, M. Z., Sidike, P., Hasan, M., Taha, T. M., & Asari, V. K. (2017). Handwritten bangla character recognition using the state-of-art deep convolutional neural networks, p.1712. 39. Sikder, M. F. (2020). Bangla handwritten digit recognition and generation. In: Proceedings of international joint conference on computational intelligence (pp. 547–556). 40. Rahman, M. S. (2016). Towards optimal convolutional neural network parameters for bengali handwritten numerals recognition. In 19th international conference on computer and information technology (ICCIT) (pp. 431–436). 41. Nishat, Z. K., & Shopon, M. (2019). Synthetic class specific Bangla handwritten character generation using conditional generative adversarial networks. In 2019 International conference on bangla speech and language processing (ICBSLP 2019). https://doi.org/10.1109/ICBSLP 47725.2019.201475 42. Chaudhuri, B. B. (2006). A complete handwritten numeral database of Bangla-A major Indic script. In 10th international workshop on frontiers of handwriting recognition (IWFHR), La Baule, France.
26
T. A. Tani et al.
43. Alam, S., Reasat, T., Doha, R. M., & Humayun, A. I. (2018). NumtaDB-assembled Bengali handwritten digits, pp 1–4. 44. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37(2), 233–243. https://doi.org/10.1002/aic.690370209 45. Bank, D., Koenigstein, N., & Giryes, R. (2020). Autoencoders. In Machine learning: Methods and applications to brain disorders (pp. 193–208). https://doi.org/10.1016/B978-0-12-8157398.00011-0 46. Alqahtani, H., Kavakli-Thorne, M., & Kumar, G. (2021). Applications of generative adversarial networks (GANs): An updated review. Archives of Computational Methods in Engineering, 28(2), 525–552. https://doi.org/10.1007/s11831-019-09388-y 47. Haque, S., Shahinoor, S. A., Rabby, A. K. M. S. A., Abujar, S., & Hossain, S. A. (2019). OnkoGan: Bangla handwritten digit generation with deep convolutional generative adversarial networks. Communications in Computer and Information Science. https://doi.org/10.1007/978981-13-9187-3_10 48. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint at arXiv arXiv:1412.6980. 49. Theckedath, D., & Sedamkar, R. R. (2020). Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks. SN Computer Science. https://doi.org/10.1007/s42979-020-0114-9 50. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
Deep Learning-Based Approaches Using Feature Selection Methods for Automatic Diagnosis of COVID-19 Disease from X-Ray Images Burak Ta¸sci
1 Introduction The novel coronavirus pandemic (COVID-19) was created a worldwide chaos environment in a very short time. As of July 2021, over 206 million official cases were reported in the world and the number of deaths due to COVID-19 has exceeded 4 million [1]. Many countries have developed various policies to cope with this pandemic and minimize its effects. In particular, Turkey is among the few countries that set an example to the world as a result of the early measures and social isolation rules. It is of vital importance to take early action for COVID-19 and similar pandemics. If the cases of COVID-19 can be detected early, these patients can be isolated, so that healthy individuals who are not infected can remain safe. Science and technology make great contributions to the precautionary policies implemented in this sense. One of the most important of these contributions is to predict how the pandemic will act in the ongoing times. In this context, two main approaches appear. The first of these is statistical approaches and mathematical models. The second approach is artificial intelligence-based approaches that have received more attention in recent years. In the literature, there are various approaches for disease detection using biomedical images based on machine learning and deep learning methods [2–8]. Javaheri et al. [9], tried to detect COVID-19 positive, CAP, and other diseases from 89,145 images obtained from the data of 5 different hospitals using BCDU-Net (U-Net). The achievement results were 91.66%, 87.5%, 95%, and 94% accuracy, sensitivity, AUC, and specificity, respectively. Rehmen et al. [10], used CT and XRay images of 200 COVID19(+), 200 Healthy, 200 Bacterial Pneumonia and 200 viral Pneumonia in their study. Using the RestNet101 transfer learning method, the reported results were 98.75%, 97.5%, 96.43%, and 100% accuracy, sensitivity, B. Ta¸sci (B) Fırat University Vocational School of Technical Sciences Elazı˘g, Elazı˘g, Turkey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_2
27
28
B. Ta¸sci
precision, and specificity respectively. JavadiMoghaddam et al. [11], proposed a deep learning model called Wavelet CNN-4, which consists of a wavelet and four convolution layers and a Squeeze Excitation Block layer in the coupling layer. They compared the proposed model with pre-trained models such as VGG11, ResNet18, ResNet50 and Inception-v3. The proposed model achieved 99.03% accuracy. Chen et al. [12], tried to detect COVID-19 positive and other diseases from 35,355 images using. U-Net+ +. With the applied method, the obtained results were 98.85%, 94.34%, 99.16%, 88.37%, and 99.4% accuracy, sensitivity, specificity, precision, and AUC values, respectively. Wu et al. [13], used CT images consisting of 368 COVID19(+), 127 other diseases in their study. Using the RestNet50 transfer learning method, the reported results were 76%, 81.1%, 61.5%, and 81.9% accuracy, sensitivity, AUC, and specificity, respectively. Mobiny et al. [14], used CT images consisting of 349 COVID19(+) and 397 COVID19(-) images in their study. The images that applied GAN, Rescaling, and cropping as a preprocessing, in DECAPS + Peekaboo and DECAPS architectures were used. In the applied DECAPS + Peekaboo method, it was reached 87.6%, 84.3%, 87.1%, and 85.2% accuracy, sensitivity, F1-score, and specificity, respectively. Balaha et al. [15], proposed a hybrid learning and optimization approach based on pre-trained models to detect Covid-19. Harris Hawks Optimization (HHO) algorithm was used to optimize the hyperparameters. They performed data augmentation by combining three publicly available data sets. The Weighted Summation Method (WSM) was used as an evaluation metric to compare combinations of models, with the best accuracy being 99.33% with VGG19. Li et al. [16], proposed a deep learning automated framework, COVNet, to accurately identify COVID-19 with chest CTs. While creating the models, a chest CT consisting of 4356 images was reported to be used. With this model, detecting COVID-19 patients from other pneumonia patients, a sensitivity of 87% and an Area Under the Curve (AUC) value of 0.95% were obtained. He et al. [17], used CT images consisting of 349 COVID19(+) and 397 COVID19(-) images in their study. The self-trans method was used as a preprocess. Using the DenseNet-169 transfer learning method, the reached results were 85%, 94%, and 86% F1-score, AUC, and accuracy, respectively. Ahamed et al. [18], used datasets consisting of chest X-ray and CT images in their study to train their proposed model. Images were preprocessed and enlarged before entering the proposed ResNet50V2 model. Extra layers have been added to the basic model with regularization and fine-tuning processes. They classified the images according to two-class, three-class and four-class categories as pre-processed and non-pre-processed. The model achieved 99.01% and 83.6% accuracy for the 3 class categories with and without preprocessing, respectively. Pathak et al. [19], used 413 COVID19(+), 439 normal or pneumonia CT images in their study. As preprocessing, ResNet50 feature extraction was used. CNN was used for classification. On the CNN network, the reached results were 93.01%, 91.45%, 94.77%, 95.18% accuracy, sensitivity, specificity, and precision, respectively. Shi et al. [20], have applied a machine learning algorithm, Random Forest (RF), to screen for COVID-19. CT images of 2685 patients were used to evaluate the models in the presented study. In the model, after evaluating the fivefold cross-validation technique, the model achieved accuracy, sensitivity, and specificity of 87.9%, 90.7%, and 83.3%, respectively.
Deep Learning-Based Approaches Using Feature Selection Methods …
29
The following are the primary contributions of this study: • The suggested model utilized the classification capabilities of features derived from AlexNet and ResNet101’s pre-trained deep architectures. • The current study examines Chi-square, NCA, mRMR, and ReliefF feature selection algorithms in order to reduce the amount of features obtained from pre-trained deep neural networks and identify the most effective deep features.The AlexNet and ResNet101 features that give the highest result are combined. The mRMR feature selection algorithm was adapted to the combined features. In experimental studies, a highly successful diagnostic model was obtained by using these selected and effective features for chest X-ray image classification. • Deep features were obtained using pre-trained CNN networks, and those features were used to optimize the parameters of the best SVM classifier. This method got the maximum performance, with a score of 98.21%, when it came to the classification of chest X-ray images. In the remaining section of the paper, the material and method was mentioned in the second section, the experimental studies and results in the third section, and the discussion in the fourth section.
2 Material and Method 2.1 Methodology Using previously trained network models, an efficient method for detecting the COVID-19 virus with a high degree of accuracy is proposed in this research. Figure 1 depicts the planned workflow for the approach. Preprocessing techniques are applied to X-RAY images as part of the proposed method. The primary purpose of these techniques is to improve classification performance. In order to draw attention to the point regions in X-RAY images and cut down on the overall number of house gray tones, the gradient operator was used in sobel operator mode. After that, we moved on to the second step, which involved using the Modulator Circulating Water System (MCWS) to segment the points in the gradient images. In the last step, feature extraction was performed on 13 pre-trained models. Extracted features were reduced in number using Chi-square, NCA, mRMR, and ReliefF feature selection methods. Selected features obtained from pretrained networks were given to 13 different classifiers. High performance was observed in AlexNet and Resnet101. The AlexNet and Resnet101 architectures were reused for feature extraction. The FC8 layer of the AlexNet model and the FC1000 layer of the ResNET101 model have 1000 features. In the proposed method, feature extractions were carried out during the training and testing phases. In total (1000 + 1000) 2000 features have been reduced to 200
30
B. Ta¸sci
1000 Features
Covid-19
AlexNet
Matmul 1000 Features
…………..
Pneumonia
EfficientNet B0 Loss3-Classifier 1000 Features
…………..
Normal
GoogleNet Predictions 1000 Features
…………..
Inception ResNet-v2
Relieff NCA mRMR Chi-Spuare
Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare
Predictions 1000 Features
…………..
Inception v3 fc1000
Gradient
1000 Features
…………..
Resnet-18 fc1000
Watershed
1000 Features
…………..
Resnet-50 fc1000 1000 Features
…………..
Resize
DenseNet-201 fc8 …………..
Pre-Processing
1000 Features
VGG16 fc8 …………..
1000 Features
VGG19 Logits …………..
1000 Features
MobileNetv2 Predictions …………..
1000 Features
Nasnet-Large fc1000 …………..
1000 Features
Resnet-101
Fig. 1 Framework of the proposed approach
Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare
Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare Relieff NCA mRMR Chi-Spuare
Combined Features
Dataset
mRMR Feature Selection Algorithm
fc8 …………..
SVM
Deep Learning-Based Approaches Using Feature Selection Methods …
31
features by combined mRMR feature selection methods. In the last step, the reclassification process was given to 13 different classifiers. It was observed that the highest performance was obtained in SVM.
2.1.1
Preprocessing
The gradient method is applied to the input images. Calculation of gradient magnitudes and directions is done with the help of directional gradient. The watershed method is usually applied to the gradient of the image. By using 8 neighboring points around each point in the image, the most bumpy and rough directions in the image are detected [21]. Points with a minimum height in the image are marked with individual identifiers. Using the gradient information in the image, the descending regions are followed at certain rates. The watershed method associates all pixels with their respective minimum points [22].
2.1.2
Feature Selection Algorithms
Feature selection, in short, is the creation of a feature vector equivalent to the principal feature vector and more functional, smaller in size, by creating a subset of features that belong to a class and obtained by deep learning models.
Neighborhood Component Analysis (NCA) NCA is a feature weighting method that may be used to select the optimum subset of features by maximizing the objective function that evaluates classification accuracy over training data. This is done through the use of NCA as a feature selection method. In order to obtain the weight vector (w) that corresponds to the feature vector x i , the approach optimizes the closest neighbor learning classifier in an effort to improve performance [23]. Within the NCA framework, a reference sample point x j is chosen for each sample, and then that point is assigned to sample x i . As a result of the close proximity of the two samples, the probability that x j will be selected as the reference point for x i will increase as a direct result of this proximity. This distance can be measured using the weighted distance, which is denoted by Dw and found by applying Eq. 1 to the equation. r | | ( ) ∑ Dw= xi , x j = wm2 |xim − x jm |
(1)
m=1
wm is the weight that has been allotted to the mth feature. A kernel function that returns big values for tiny Dw can be used to determine the relationship between probability Pij and weighted distance Dw . This relationship can be determined by
32
B. Ta¸sci
using the kernel function. Pij is defined by the following equation: )) ( ( k Dw x i , x j Pi j = ∑n j=1, j/=i Dw(xi ,x j )
(2)
Also, it takes the vae 1 if i = j and Pii = 0. The kernel function is defined as k(z) = exp (− z/σ). The parameter k and σ are the core width and this affects the probability that sample x j will be selected as the reference point. The probability of x i being classified correctly is written ain Eq. 3. Pi j =
n ∑
Pi j Yi j
(3)
j=1, j/=i
ReliefF One of the most well-known approaches of feature selection is referred to as the relief algorithm. It is a type of algorithm that has the potential to create features predictions that are quite accurate and fruitful. The prediction of these features is accomplished by assigning weights to the characteristics or features If an features is of any use, one can anticipate that the closest distances of the same class will be closer to one another than the closest distances of any and all other classes that are given along that feature [24]. The convex optimization problem is solved, and the result is used to determine the feature weights. However, the Relief algorithm has the limitation of only being able to handle two-class situations and cannot process data that is incomplete. This is a disadvantage. The ReliefF method, which was an enhanced version of the Relief algorithm, was offered as a solution for these problems as well as additional difficulties. It’s possible that this enhanced approach can conquer incredibly powerful, noisy, and incomplete data. If the working logic of the ReliefF algorithm is examined, firstly, a sample Ri is randomly selected, then, the k nearest neighbors from the same class called Hj, and k nearest neighbors from each of the different classes, called Mj(C) are selected. Depending on the values of Ri, Hj, and Mj(C), the w[A] value was updated for all A features. feature weights range from −1 to + 1. The largest positive values mean that the feature was important. This process was continued for the number determined by the user. With the diff function, the differences between samples and features, that is, distances, are calculated. The calculation of this function depends on whether the features are written or numeric. Let I1 and I2 be samples and A be features If the features were written, then the calculation will be as in Eqs. 4 and 5. Choosing k, increases the robustness of the algorithm against noisy data. This value can be set by the user; but if k is chosen as 1, the algorithm will be sensitive to noisy data. In many studies, the k value was chosen as 10, but choosing the k value differently would be more useful in examining the importance levels of the features. Finally, choosing the k value too small will cause similar bad results.
Deep Learning-Based Approaches Using Feature Selection Methods …
{ di f f ( A, I1 , I2 ) = di f f (A, I1 , I2 ) = |I1 − I2 |x
33
0, I1 = I2 1, I1 /= I2
(4)
1 max( A) − min(A)
(5)
Chi Square Test Chi Square Test; It is a single variable filter method. The Chi-Square method works on categorical variables. It detects the relationships and dependencies of categorical variables. The chi-square test is a two-step test. In the first step, the chi-square statistics of the observed values are calculated according to the expected values. In the second step, the obtained chi-square statistics are compared with the determined threshold value and a decision is made accordingly. The features are scored according to the chi-square statistic and the features with the best score are used. The chi-square statistic is obtained using Eq. 6–8. I given in Eq. 6; is the number of intervals, and J is the number of classes. Nij ; The ith interval is the number of samples in the jth class. While the two properties Eij given in Eq. 6 are independent; The ith interval is the expected number of units in the jth grade. Finally, the d given in Eq. 7 shows the degrees of freedom of the Chi-Square distribution to be used for the test statistic [25, 26]. )2 j ( I ∑ ∑ Ni j − E i j X = Ei j i=1 j=1 2
Ei j =
Ni N j N
d = (I − 1)(j − 1)
(6)
(7) (8)
Minimum Redundancy Feature Selection(mRMR) The MRMR algorithm is an entropy-based feature selection algorithm proposed by Peng et al. in 2005 [27]. The MRMR algorithm is a filtering algorithm that works by selecting the features that are most associated with the labels of the classes in the data to be used for classification. This algorithm uses Mutual Information to measure the similarity ratio between two features or between features and class labels [28]. In essence, the MRMR algorithm tries to rank all the features from the most valuable to the least valuable and leaves the user to decide how many features should be used for the classification problem. Therefore, the MRMR algorithm should be considered as a feature sorting algorithm rather than a feature selection algorithm.
34
2.1.3
B. Ta¸sci
Pre-trained Networks
Transfer learning is defined as the learning structure created by using the features obtained by deep learning models developed for special purposes as inputs in other machine learning methods. In this study, deep learning models AlexNet, EfficientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3, ResNet18, ResNet50, ResNet101, VGG16 and VGG19 were used. Layer, depth, number of parameters, image input dimensions of the mentioned networks in Table 1, and network architectures were given in Table 1.
AlexNet Deep learning pioneers Alex Krishevsky, Ilya Sutskever, and Geoffrey Hinton came up with the method that would become known as AlexNet [29]. This deep convolutional neural network has a total of 25 layers, with 5 convolution layers, 3 maxpool layers, 2 dropout layers, 3 fully connected layers, 7 relu layers, 2 normalization layers, a softmax layer, input, and classification (output) layers making up the structure. The dimensions of the image that will go into the input layer of Alexnet are 227 by 227 by 3. The final layer is where classification takes place, and this is also where the value of the classification number in the input image is presented. Table 1 Deep learning networks used in the study Pre-trained model
Layer
Depth
Number of parameters (Million)
Image input size
25
8
61,0
227 × 227
EfficientNet B0
290
82
5,3
224 × 224
GoogleNet
144
22
7,0
224 × 224
Inception ResNet-v2
825
164
55,9
299 × 299
Inception v3
316
48
23,9
299 × 299
Densenet201
708
201
Nasnetlarge
1243
Mobilenetv2
154
Resnet-18 Resnet-50 Resnet-101
AlexNet
3,5
224 × 224
88,9
331 × 331
53
3,5
224 × 224
71
18
11,7
224 × 224
177
50
25,6
224 × 224
347
101
44,6
224 × 224
VGG16
41
16
138,0
224 × 224
VGG19
47
19
144,0
224 × 224
Deep Learning-Based Approaches Using Feature Selection Methods …
35
DenseNet201 Forward connections are made between each layer of the DenseNet-121 (Densely Connected Convolutional Network) and other layers. Each layer of the DenseNet design takes as input the properties of all of the layers that came before it, as well as the qualities that are unique to that layer, which are then passed on to the layers that come after it [30]. DenseNet topologies have the advantage of providing feature propagation and reducing the number of parameters by permitting feature reuse [31]. DenseNet-121 design is composed of four dense blocks, three transition layers, and 121 layers in total (117 loops, 3 passes and 1 classification).
MobileVNet2 MobileNet designs are built on a modular architecture that allows for the development of both shallow and deep neural networks. This architecture’s two basic global hyperparameters provide an optimal balance of latency and precision. Based on the restrictions of the problem, these hyperparameters allow the model builder to select the appropriate-sized model for their application.
Nasnet-Large NASNet-Large is a 1243-layer convolutional neural network trained on more than one million photos from the ImageNet collection. The network can split photos into one thousand object types, including animals, balloons, and flowers. As a result, the network has acquired rich feature representations for a vast array of image types. 331 × 331 pixels should be the size of the picture to be put to the mesh.
EfficientNet B0 EfficientNet, a new CNN study developed by Google in 2019, provides significant improvements in accuracy and productivity (performance). The productivity model presented in the study offers a new approach because of being also applicable to other CNN models. EfficientNet-B0, is the basic network developed using AutoML MNAS [32]. EfficientNet-B0 consists of 290 layers. The image to be placed in the input layer of EfficientNet B0 is 227 × 227 × 3 in size.
GoogleNet ImageNet 2014 came first with a success rate of 93.33% in image classification competition. GoogLeNet architectural structure consists of 144 layers and this architecture has proven that too many data sets were increased the performance of the
36
B. Ta¸sci
classification process by increasing the number of layers. The image to be placed in the input layer of Googlenet is 224 × 224 × 3 In order to prevent overloading of large-sized images, it filters images in various sizes such as “1 × 1, 3 × 3, 5 × 5” in the same period. Unlike other architectures, this architecture processes images in parallel, rather than stacking the layers it creates. Because it also was considered negative factors such as memory size increase, waste of time, etc. for stacked processes [25].
Inception ResNet-V2 The Inception-ResNet-V2 architecture combines the remaining connections with a new version of the inception architecture. The Inception-ResNet-V2 network makes efficient use of remaining connections [33]. The feature extraction performance of Inception-ResNet-V2 architecture is quite good. In this architecture, remaining units are added to each Inception module to prevent degradation of the network gradient usually associated with the increase in the number of layers. Inception ResNet-v2 architectural structure consists of 825 layers. The image to be placed in the input layer of Inception ResNet-v2 measures 299 × 299 × 3.
Inception V3 Inception architecture is an architecture that emerged with the GoogleNet model. GoogleNet model, proposed by Szegedy et al. (2015), tries to keep the computational cost at the same rate while increasing the depth and width. Therefore, in this model using the concept of Inception, the outputs obtained by using different convolution filters together were combined [34]. The Inception-v3 architectural structure consists of 316 layers. The image to be placed in the input layer of Inception v3 measures 299 × 299 × 3.
ResNet-18 The ResNet 18 pre-trained model, which provides rich features, works by inputting more than one million data in the ImageNet dataset with a size of 224 × 224. Although it has 71 layers and 18 depths, it is analyzed that it gives successful and faster results compared to some models with a deeper layer [35]. ResNet-50 Resnet microarchitecture module differs from other architectures with its structure. It may be preferable to switch to the lower layer by ignoring the change between some layers. By allowing this situation in the Resnet architecture, the performance rate was increased to higher levels.
Deep Learning-Based Approaches Using Feature Selection Methods …
37
Resnet50 architecture consists of a network of 177 layers. The depth of the net is 50. In addition to this layered structure, there is information about how the inter-layer connections will be [36]. ResNet-101 The Resnet-101 structure has 347 layers and a depth of 101. ResNet’s bypass (jumping) between layers is referred to as ResBlock. Even if nothing is learned in the previous layer, ResBlock makes the model more robust by applying the information from the previous layer to the new layer. ResBlock thereby fixed the gradient deletion issue. Utilizing slope drop as the optimization algorithm. Resnet-101 input layer dimensions are 224 × 224 × 3 [36]. VGG16 The VGG16 model consists of a total of 41 layers, 16 of which include learnable weights, followed by ReLu and pooling layers. Learnable layers include thirteen convolutional and three fully linked layers. Similar to AlexNet, the VGG16 model employs a 1-pixel pitch shift and 3 × 3 filter in all convolutional layers, and maximum pooling layers follow convolutional layers. Maximum pooling is attained with a twostep, two-by-two filter. To extract feature vectors, activations in the first and second fully connected layers (fc6, fc7) were utilized. fc6 and fc7 result vectors include a total of 4096 characteristics. Training utilizes 224,224 RGB pictures [37]. VGG19 The Visual Geometry Group at the University of Oxford is responsible for the development of the VGG19 computer program (VGG). It consists of 19 layers, 16 of which are convolutional, 3 of which are completely connected, 5 of which are maximum pooling, and 1 of which is a Softmax layer. The input for this network is photos with a dimension of (224, 224, 3). Approximately 144 million trainable parameters are available. Filters with a step size of one pixel (3 by 3) were employed so that the overall notion of the image could be conveyed [37].
2.1.4
Support Vector Machine
SVM is a machine learning model, used in clustering and regression problems, especially in classification, developed by Vapnik–Chervonenkis in 1995. Especially in recent years, it is one of the most successful machine learning algorithms used for solving classification problems. The purpose of the SVM model is basically, is to detect the hyperplane that will separate the classes of target variables from each other in the most appropriate way [38].
38
2.1.5
B. Ta¸sci
K-Nearest Neighbors(K-NN)
Although the k-NN classifier is a simple type of classifier, it is one of the classifiers with good results. The reason why it is called “simple”, this classifier does not require any training steps. This feature distinguishes this training data. This classifier from other classifiers. used directly during the classification process by the classifier, without a requirement for a training stage. Let a test sample is given, k nearest neighbors of this test sample in the training set are detected and the number of those belonging to each class is subtracted. Here it is said to belong to the class with the largest number of neighbors [39]. There are certain mathematical formulas for the concept of distance in the k-NN classifier. These are given in Eqs. 9–11. In the Minkowski distance equation, if k 1 is chosen, Manhattan, if k 2 is chosen, the Euclidean distance equation is obtained. ┌ | n |∑ Oklid = | (xi − yi )2
(9)
i=1
┌ | n |∑ Manhattan = | |xi − yi |
(10)
i=1
⎞1/k ⎛┌ | n |∑ Minkowski = ⎝| |xi − yi |k ⎠
(11)
i=1
2.1.6
Decision Trees
Decision trees allow the rapid processing of data. Decision trees perform the classification process by data with certain property values. For this process, some features are determined as input and some features as output, are presented to the algorithm. In order to obtain the results in the output feature with the algorithm, what the input values can be is realized by looking at the decision trees. One of the methods used to create a model is the EBT method. To increase the prediction accuracy of discrete learning algorithms, ensemble approaches mix various learning methods. They are a linear mixture of different modeling methods that produce better prediction outcomes without increasing complexity significantly. Bagged and boosted ensemble methods are two of the most used ensemble methods. While bagged approaches minimize error variance in constructor learning algorithms, boosted methods specifically reduce bias in constructor learning algorithms [40, 41].
Deep Learning-Based Approaches Using Feature Selection Methods …
Covid-19
Pneumonia
39
Normal
Fig. 2 COVID-X-Ray scan dataset sample images
2.2 Dataset The dataset consists of 1061 x-ray images labeled by Radiologists. The dataset has been edited after downloading from the kaggle website [42, 43]. X-ray images consist of three classes: COVID-19, Pneumonia and Normal. There are 361 COVID-19, 500 Pneumonia and 200 Normal chest X-ray images in the Dataset. The COVID-19 cases in the dataset consist of chest X-ray images of 200 male and 161 female patients. The mean age of the patients is over 45. These images range in height is from 143 to 1637 pixels (average 491 pixels) and in width from 76 to 1225 pixels (average 383 pixels). Figure 2 shows an example of X- RAY scans of COVID-19, Normal and Pneumonia patients in the dataset.
3 Performance Measurement Metrics The success of machine learning classifiers was determined by the correlation between class labeling and actual class value. Labeling data with a positive true class value as positive was referred to as true positive (TP), while labeling as negative was referred to as false negative (FN); labeling data with a negative true class value as negative was referred to as true negative (TN), while labeling as positive
40
B. Ta¸sci
was referred to as false positive (FP) (FP). For the suggested method, performance measurement metrics were computed utilizing the TP, TN, FP, and FN numbers from the matrix of complexity. Using the values of accuracy, sensitivity, specificity, precision, and F-score, performance measures were developed. Using the following equations, performance measurement metrics were computed. Accuracy =
TP + TN TP + TN + FP + FN
(12)
Sensitivit y =
TP TP + FN
(13)
Speci f icit y =
TN TN + FP
(14)
TP TP + FP
(15)
Pr ecision = F − scor e = 2 ×
Pr ecision × Sensitivit y Pr ecision + Sensitivit y
(16)
4 Experimental Studies Matlab environment was used to obtain the experimental results in this study. Experimental results were obtained using an all-in-one computer with an I7 processor, 16 GB Ram, and a 4 GB graphics card. The images in the data set were sized as 224 × 224, 227 × 227, 299 × 299 and 331 × 331, and classification was performed. In the study, convolutional neural networks, AlexNet, EfficientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3, DenseNet201, MobilevNet2, Nasnet-Large, ResNet18, ResNet50, ResNet101, VGG16 and VGG19 models were used. Chi-square, NCA, mRMR and ReliefF feature selection methods were used. A total of 2000 features were selected, 1000 from the FC8 layer of AlexNet’s features and 1000 from Resnet101’s FC1000 layer. Selected features have been reduced to 200 features with mRMR feature selection methods. Classification process for 200 features was given to 13 different classifiers. In this study, it was observed that the highest performance was obtained in SVM. In Fig. 3, the Confusion matrices of the classification method in which the 13 pre-learned different networks and combined networks used reach the highest accuracy were given. ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the best accuracy result with 98,21% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature selection had the worst accuracy result with 95,00%. In Fig. 4, The graphs of the accuracy values of the pre-trained networks according to the classifiers and feature selections were given.
Pneumonia
Covid-19
Predicted Class
Pneumonia
168
Covid-19
Normal
Pneumonia
173
Normal
Covid-19 True Class Normal
Pneumonia
10
32
168
Normal
Normal
Pneumonia
347
14
28
172
500
Covid-19
Normal
Covid-19 True Class Normal
Pneumonia
Predicted Class
VGG19-Cubic SVM-NCA
6
13
187
500
Covid-19
Normal
Pneumonia
Predicted Class
ResNet50+AlexNet-Cubic SVM - mRMR
Fig. 3 Confusion matrices with the highest accuracy
Pneumonia
Resnet-18-Cubic SVM
342
19
19
181
500
Covid-19
Normal
Pneumonia
Predicted Class
ResNet50-Quadratic Dicriminant-NCA
355
Pneumonia
Predicted Class
Predicted Class
500
Covid-19
Normal
Inception-v3-Cubic SVM-Chi2
Pneumonia
500
Pneumonia
Covid-19 True Class Normal
156
Pneumonia
44
Normal
351
Covid-19
ResNet101-Cubic SVM-NCA
2
500
NasNet Large-Cubic SVM-Relieff
Predicted Class
359
168
Predicted Class
500 Covid-19
32
Predicted Class
500
Pneumonia
Covid-19 True Class Normal
185
Pneumonia
15
9
Covid-19
Covid-19
27
Covid-19
MobilevNet2-Subspace KNN-NCA
16
Pneumonia
True Class Normal
12
Pneumonia
Pneumonia
Covid-19 True Class Normal
349
Predicted Class
345
Normal
352
Predicted Class
Inception ResNet-v2-Cubic SVM
500 Covid-19
Normal Predicted Class
Efficient B0-Cubic SVM-mRMR
Covid-19
32
Pneumonia
Covid-19 True Class Normal
Pneumonia
175
177
Covid-19
True Class Normal
21
Predicted Class
25
Pneumonia
500
GoogleNet-Cubic SVM-NCA
16
23
Pneumonia
Covid-19 True Class Normal
340
Pneumonia
Covid-19 True Class Normal
Pneumonia
500
345
14
Predicted Class
164
Normal
Normal
347
500
DenseNet-Cubic SVM-NCA
361
Covid-19
Covid-19
500
AlexNet-Cubic SVM
36
True Class Normal
173
Covid-19
Normal
27
True Class Normal
Covid-19
9
Pneumonia
500
352
41
Pneumonia
173
Covid-19
27
True Class Normal
11
Pneumonia
350
Pneumonia
True Class Normal
Covid-19
Deep Learning-Based Approaches Using Feature Selection Methods …
VGG16-Cubic SVM-mRMR
42
B. Ta¸sci
98,00%
98,00%
98,00%
96,00%
96,00%
97,00% 96,00% 95,00%
94,00%
94,00%
94,00%
92,00%
93,00%
92,00%
92,00%
90,00%
91,00%
90,00%
88,00%
90,00% 89,00%
88,00%
86,00%
88,00%
87,00%
84,00%
86,00%
No
Chi2
MrMr
NCA
Relieff
No
Chi2
AlexNet
MrMr
NCA
No
Relieff
Chi2
97,00%
97,00%
96,00%
96,00%
95,00%
95,00%
94,00%
94,00%
93,00%
NCA
Relieff
NCA
Relieff
NCA
Relieff
NCA
Relieff
98,00% 96,00% 94,00%
93,00%
92,00%
MrMr EfficientNet B0
DenseNet201
92,00%
92,00%
91,00%
91,00%
90,00%
89,00%
90,00%
88,00%
88,00%
89,00%
87,00%
88,00%
86,00%
87,00%
90,00%
No
Chi2
MrMr
NCA
Relieff
86,00%
84,00%
No
Chi2
GoogleNet
MrMr
NCA
Relieff
No
Chi2
Inception ResNet-v2
98,00%
97,00%
MrMr Inception v3
97,00%
96,00%
96,00% 96,00%
95,00%
95,00%
94,00%
94,00%
94,00%
93,00%
93,00%
92,00%
92,00%
92,00%
91,00%
91,00%
90,00%
90,00%
90,00%
89,00% 88,00%
89,00%
88,00%
88,00%
87,00%
86,00%
86,00%
No
Chi2
MrMr
NCA
87,00%
No
Relieff
Chi2
MrMr
NCA
Relieff
No
Chi2
Nasnet-Large
MobilevNet2
98,00%
MrMr Resnet-18
98,00%
97,00% 96,00%
96,00%
96,00%
95,00% 94,00%
94,00%
94,00%
93,00% 92,00%
92,00%
92,00%
91,00% 90,00%
90,00%
90,00% 89,00%
88,00%
88,00%
88,00%
87,00%
86,00%
No
Chi2
MrMr
NCA
Relieff
86,00%
No
Chi2
MrMr
NCA
Relieff
No
Chi2
Resnet-101
Resnet-50
MrMr VGG16
Ensemble Subspace KNN 98,00%
Quadratic Dicriminant 96,00%
Bilayered Neural Network Medium Gaussian SVM
94,00%
Ensemble Bagged Trees
92,00%
Weighted KNN
Narrow Neural Network 90,00%
Wide Neural Network
Cubic SVM
88,00%
Quadratic SVM Ensemble Boosted Trees
86,00%
No
Chi2
MrMr VGG19
NCA
Relieff
Fine Tree
Fine KNN
Fig. 4 Graphs of truth values of pre-trained networks according to classifiers and feature selections
Cubic SVM classifier had the highest accuracy with 96.42% for AlexNet network, The Medium Gaussian SVM classifier with mRMR feature selection had the worst accuracy with 89.2%. Cubic SVM classifier with NCA feature selection had the highest accuracy with 96.61% for DenseNet-201 network, The Quadratic Dicriminant classifier with Chi2 feature selection had the worst accuracy with 89.6%. Cubic SVM classifier with NCA feature selection had the highest accuracy with 96.51% for EfficientNet B0 network, The Fine Tree classifier with Chi2 feature selection had the worst accuracy with 89.3%. Cubic SVM classifier with NCA feature selection had the highest accuracy with 96.06% for GoogleNet network, The Quadratic SVM classifier with mRMR feature selection had the worst accuracy with 89.7%. Cubic SVM classifier had the highest accuracy with 95.0% for Inception ResNet-v2
Deep Learning-Based Approaches Using Feature Selection Methods …
43
network, The Medium Gaussian SVM classifier with NCA feature selection had the worst accuracy with 89.7%. Cubic SVM classifier with Chi2 feature selection had the highest accuracy with 96.14% for Inception v3 network, The Bilayered Neural Network had the worst accuracy with 89.2%. Cubic SVM classifier with Chi2 feature selection had the highest accuracy with 96.14% for MobilevNet2 network, The Quadratic Dicriminant with ReliefF feature selection had the worst accuracy with 90.0%. Cubic SVM classifier with ReliefF feature selection had the highest accuracy with 96.32% for Nasnet-Large network, The Medium Gaussian SVM with ReliefF feature selection had the worst accuracy with 89.7%. Cubic SVM classifier had the highest accuracy with 96.04% for ResNet18 network, The Quadratic Dicriminant with ReliefF feature selection had the worst accuracy with 90.0%. Cubic SVM classifier with NCA feature selection had the highest accuracy with 97.08% for ResNet50 network, The Quadratic Dicriminant with NCA feature selection had the worst accuracy with 90.1%.Quadratic Dicriminant with NCA feature selection had the highest accuracy with 96.04% for ResNet101 network, The Fine Tree classifier with ReliefF feature selection had the worst accuracy with 90.4%.Cubic SVM classifier with mRMR feature selection had the highest accuracy with 96.42% for VGG16 network, The Medium Gaussian SVM with NCA feature selection had the worst accuracy with 90.0%.Quadratic Dicriminant classifier with NCA feature selection had the highest accuracy with 95.66% for VGG19 network, The Medium Gaussian SVM with NCA feature selection had the worst accuracy with 89.3%. In Table 2, the sensitivity, specificity, precision and, F-score results of the classifiers used in the proposed method were given. For the pneumonia class, Accuracy, Sensitivity, Specificity, Precision, F-Score metrics were all 100%. In the COVID19 class, for the Sensitivity metric, GoogleNet network Cubic SVM classifier with NCA feature selection had the best result with 100% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature selection had the worst result with 94.18%. For the Specificity metric, ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the best result with 98.14% and classifier VGG19 network Cubic SVM classifier with mRMR feature selection had the worst result with 93.71%. For the Precision metric, ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the best result with 96.47% and classifier VGG19 network Cubic SVM classifier with mRMR feature selection had the worst result with 89.08%. For the F-score metric, ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the best result with 97.39% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature selection had the worst result with 92.77%. In the Normal class, for the Sensitivity metric, VGG19 network Cubic SVM classifier with mRMR feature selection had the best result with 99.77% and classifier GoogleNet network Cubic SVM classifier with NCA feature selection had the worst result with 96.04%. For the Specificity metric, classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature selection had the best result with 98.95% and classifier VGG19 network Cubic SVM classifier with mRMR feature selection had the worst result with 78.00%. For the Precision metric, ResNet50 + AlexNet
44
B. Ta¸sci
Table 2 Other performance metrics of classifiers Accuracy Sensitivity Specificity Precision F-Score (%) (%) (%) (%) (%) AlexNet-Cubic SVM
COVID-19 96.42
96.95
96.14
92.84
Normal
98.72
86.50
96.92
97.81
100.00
100.00
100.00
100.00
COVID-19 96.61
97.51
96.14
92.88
95.14
Normal
98.95
86.50
96.93
97.93
100.00
100.00
100.00
100.00
COVID-19 96.51
96.12
96.71
93.78
94.94
Normal
98.37
88.50
97.36
97.86
Pneumonia
100.00
100.00
100.00
100.00
COVID-19 96.06
100.00
94.86
90.93
95.25
Normal
100.00
98.37
86.00
97.58
Pneumonia
100.00
100.00
100.00
100.00 92.77
Pneumonia DenseNet-201-Cubic SVM-NCA
Pneumonia Efficient-B0-Cubic SVM-mRMR
GoogleNet-Cubic SVM-NCA
94.85
Inception Resnet-v2-Cubic SVM-NCA
COVID-19 95.00
94.18
95.43
91.40
Normal
97.56
84.00
96.33
96.94
100.00
100.00
100.00
100.00
Inception- v3-Cubic SVM-Chi2
COVID-19 96.14
97.51
95.43
91.67
94.50
Normal
96.14
98.95
84.00
96.38
100.00
100.00
100.00
100.00
95.57
96.43
93.24
94.39
98.14
87.50
97.13
97.63
Pneumonia
Pneumonia MobilevNet2-Subspace COVID-19 96.14 KNN-NCA Normal Pneumonia NasNet Large-Cubic SVM-ReliefF
100.00
100.00
100.00
100.00
COVID-19 96.32
96.68
96.14
92.82
94.71
Normal
98.61
86.50
96.92
97.75
Pneumonia ResNet18-Cubic SVM
100.00
100.00
100.00
100.00
COVID-19 96.04
96.12
96.00
92.53
94.29
Normal
98.37
86.00
96.80
97.58
Pneumonia ResNet101-Cubic SVM-NCA
100.00
100.00
100.00
100.00
COVID-19 97.08
95.57
97.86
95.83
95.70
Normal
98.14
92.50
98.26
98.20
Pneumonia ResNet50-Cubic SVM-NCA
100.00
100.00
100.00
100.00
COVID-19 96.04
97.23
95.43
91.64
94.35
Normal
98.84
84.00
96.38
97.59
Pneumonia VGG16-Cubic SVM-mRMR
100.00
100.00
100.00
100.00
COVID-19 96.42
94.74
97.29
94.74
94.74
Normal
97.79
90.50
97.79
97.79 (continued)
Deep Learning-Based Approaches Using Feature Selection Methods …
45
Table 2 (continued) Accuracy Sensitivity Specificity Precision F-Score (%) (%) (%) (%) (%) Pneumonia VGG19-Cubic SVM-mRMR
100.00
100.00
100.00
100.00
COVID-19 95.66
99.45
93.71
89.08
93.98
Normal
99.77
78.00
95.13
97.39
Pneumonia RenNet50 + AlexNet-Cubic SVM-mRMR
100.00
100.00
100.00
100.00
COVID-19 98.21
98.34
98.14
96.47
97.39
Normal
99.30
93.50
98.50
98.90
100.00
100.00
100.00
100.00
Pneumonia
network Cubic SVM classifier with mRMR feature selection had the best result with 98.50% and classifier Inception Resnet-v2 network Cubic SVM classifier with NCA feature selection had the worst result with 84.00%. For the F-score metric, ResNet50 + AlexNet network Cubic SVM classifier with mRMR feature selection had the best result with 98.90% and classifier Inception-v3 network Cubic SVM classifier with Chi2 feature selection had the worst result with 96.38%.
5 Discussion In this section, the performance criteria of studies with pre-trained models and the proposed method, consisting of accuracy, sensitivity and specificity, are discussed. Evaluations in the literature are usually made on combined data sets. Since the data sets used in the studies are different and the evaluation criteria are different, it cannot be said that they are completely superior to each other. The performance scores of these methods are given in Table 3. Abbas et al. [44], established a modified deep neural network effective on Xray images to more effectively distinguish between COVID-19 cases. The model they call DeTraC includes three inner layers. This model was created using ResNet18 on the backend and achieved 95.12% accuracy on the X-Ray dataset. Wang et al. [45], used 44 COVID19(+), 55 typical viral pneumonia CT images in their study. As preprocessing, a visual inspection of ROI extraction was performed. In the applied M-inception algorithm, the obtained results were 82.9%, 81%, 84%, 77%, and 90% accuracy, sensitivity, F1-score, AUC, and specificity, respectively. Alqudah et al. [46] used SVM, Random Forest, CNN in this study. 95.2% accuracy, 93.3% Sensitivity, 100% Specificity and 100% Precision were achieved. Hemdan et al. [47], suggested the COVIDXNET deep learning classifier architecture for COVID-19 diagnosis using X-Ray pictures. In addition, they validated seven distinct DCNN models, such as VGG19 and Densenet201, in their investigation. They demonstrated that VGG19 and DenseNet classifications are superior.
COVID-19 image data collection[49]
COVID-19 image data collection[49]
Alqudah et al. [46]
Hemdan et al. [47]
AlexNet, EfficientNet B0, GoogleNet, Inception ResNet-v2, Inception-v3, DenseNet201, MobilevNet2, Nasnet-Large, ResNet18, ResNet50, ResNet101, VGG16,VGG19
Proposed method
COVID-19 image dataset [42, 43]
ResNet50, InceptionV3, InceptionResNetV2
Narin et al. [48], COVID-19 image data collection[49]
VGG19,DenseNet201,InceptionV3, ResNetV2, InceptionResNetV2, Xception, MobileNetV2
SVM,Random Forest, CNN
VGG16, VGG19, DenseNet201, Inception_ResNet_V 2,Inception_V3, Resnet50, MobileNet_V2 Xception
The chest x-ray images (pneumonia) [42]
Wang et al. [45]
Method
eTraC-ResNet-18
Dataset
Abbas et al. [44] COVID-19 image data collection[49]
Ref
Table 3 Literature studies and results
93.50% 100.00%
Normal = 98.21 99.30% Pneumonia = 98.21
100.00%
98.14%
98.34%
COVID-19 = 98.21
–
–
100.00%
77.00%
91.87%
Specificity
–
–
93.30%
81.00%
97.97%
Sensitivity
98.00
90.00
95.20
82.9
95.12
Accuracy (%)
100.00%
98.50%
96.47%
100.00%
83.00%
100.00%
–
93.36%
Precision
100.00%
98.90%
97.39%
98.00%
91.00%
–
84.00%
–
F-Score
46 B. Ta¸sci
Deep Learning-Based Approaches Using Feature Selection Methods …
47
Narin et al. [48], used deep CNN-based models to classify X-ray images for COVID19 illness. Using chest X-ray radiographs, CNN-based models (InceptionResNetV2, ResNet50, and InceptionV3) were utilized to detect people infected with coronavirus pneu-monia. 98.00% accuracy was reached with the ResNet50 model, based on the results of the experiments. The proposed approach has reached a success rate of 98.21%. It has reached a 100% success rate in the sensitivity and specificity criteria for the pneumonia class. For the COVID-19 class Sensitivity, Specificity, Precision, F-Score metrics, values of 98.34%, 98.14%, 0.96.47%, and 97.39% were obtained, respectively.
6 Results The rapid spread of the COVID-19 pandemic all over the world, its negative effects on people, clearly demonstrates the detection of positive cases in the early stages and the rapid and correct intervention. In this study, the three-class data set consisting of X-Ray images obtained during the COVID-19 epidemic was classified by the learning transfer method. In this paper, preprocessing techniques have been applied to X-RAY images to improve classification performance. Gradient operator used as Sobel operator was used to highlight the point regions in X-RAY images and reduce the number of house gray tones. Chi-square, NCA, mRMR and ReliefF feature selection methods were used. First, the results of 13 pre-trained models were compared. Then, a total of 2000 features were selected from AlexNet and Resnet101. Selected features have been reduced to 200 features with mRMR feature selection methods. Classification process for 200 features was given to 13 different classifiers. In this study, it was seen that the highest performance was obtained at 98.21% SVM after applying mRMR feature selection to the combined models of RenNet50 + AlexNet models. In the study, the highest accuracy, sensitivity, specificity, precision and Fscore value for the COVID19 class were; ResNet50 + AlexNet Cubic SVM with 98.21%, GoogleNet network Cubic SVM classifier with 100%, ResNet50 + AlexNet Cubic SVM with 98.14%, ResNet50 + AlexNet Cubic SVM with 96.47%, ResNet50 + AlexNet with 97.39% Obtained in Cubic SVM. In the proposed approach, it has been seen that pre-trained CNN architectures and feature extraction methods can be used together. In addition, it has been confirmed in this study that the weights can be combined and efficient rather than considering the performance of feature selection methods separately. The major limitation of this study is that the method used requires more powerful hardware if applied to larger datasets.
48
B. Ta¸sci
References 1. CoronaVirus Updates. (2022). https://www.worldometers.info/coronavirus/ 2. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S. (2022). X-ray image based COVID-19 detection using evolutionary deep learning approach. Expert Systems with Applications, 201, 116942. 3. Dhiman, G., Chang, V., Kant Singh, K., & Shankar, A. (2022). Adopt: Automatic deep learning and optimization-based approach for detection of novel coronavirus covid-19 disease using x-ray images. Journal of Biomolecular Structure and Dynamics, 40(13), 5836–5847. 4. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., & MohammadiIvatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of Intelligent & Fuzzy Systems, 1–12. 5. Ravi, V., Narasimhan, H., Chakraborty, C., & Pham, T. D. (2022). Deep learning-based meta-classifier approach for COVID-19 classification using CT scan and chest X-ray images. Multimedia Systems, 28(4), 1401–1415. 6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. 7. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518. 8. Samui, P., Roy, S. S., & Balas, V. E. (2017). Handbook of neural computation. Academic Press. 9. Javaheri, T., Homayounfar, M., Amoozgar, Z., Reiazi, R., Homayounieh, F., Abbas, E., Laali, A., Radmard, A. R., Gharib, M. H., & Mousavi, S. A. J. (2021). CovidCTNet: An open-source deep learning approach to diagnose covid-19 using small cohort of CT images. NPJ Digital Medicine, 4(1), 1–10. 10. Rehman, A., Naz, S., Khan, A., Zaib, A., & Razzak, I. (2022) Improving coronavirus (COVID19) diagnosis using deep transfer learning. In Proceedings of international conference on information technology and applications (pp. 23–37). Springer. 11. JavadiMoghaddam, S., & Gholamalinejad, H. (2021). A novel deep learning based method for COVID-19 detection from CT image. Biomedical Signal Processing and Control, 70, 102987. 12. Chen, J., Wu, L., Zhang, J., Zhang, L., Gong, D., Zhao, Y., Chen, Q., Huang, S., Yang, M., & Yang, X. (2020). Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Scientific Reports, 10(1), 1–11. 13. Wu, X., Hui, H., Niu, M., Li, L., Wang, L., He, B., Yang, X., Li, L., Li, H., & Tian, J. (2020). Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A multicentre study. European Journal of Radiology, 128, 109041. 14. Mobiny, A., Cicalese, P., Zare, S., Yuan, P., Abavisani, M., Wu, C., Ahuja, J., de Groot, P., & Van Nguyen, H. (2020). Covid R-l detection using CT scans with detail-oriented capsule networks. 15. Balaha, H. M., El-Gendy, E. M., & Saafan, M. M. (2021). CovH2SD: A COVID-19 detection approach based on Harris Hawks Optimization and stacked deep learning. Expert Systems with Applications, 186, 115805. 16. Li, L., Qin, L., Xu, Z., Yin, Y., Wang, X., Kong, B., Bai, J., Lu, Y., Fang, Z., & Song, Q. (2020) Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT. Radiology. 17. He, X., Yang, X., Zhang, S., Zhao, J., Zhang, Y., Xing, E., & Xie, P. (2020) Sample-efficient deep learning for COVID-19 diagnosis based on CT scans. Medrxiv. 18. Ahamed, K. U., Islam, M., Uddin, A., Akhter, A., Paul, B. K., Yousuf, M. A., Uddin, S., Quinn, J. M., & Moni, M. A. (2021). A deep learning approach using effective preprocessing techniques to detect COVID-19 from chest CT-scan and X-ray images. Computers in Biology and Medicine, 139, 105014. 19. Pathak, Y., Shukla, P. K., Tiwari, A., Stalin, S., & Singh, S. (2020). Deep transfer learning based classification model for COVID-19 disease. Irbm.
Deep Learning-Based Approaches Using Feature Selection Methods …
49
20. Shi, F., Xia, L., Shan, F., Song, B., Wu, D., Wei, Y., Yuan, H., Jiang, H., He, Y., & Gao, Y. (2021). Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia using infection size-aware classification. Physics in Medicine & Biology, 66(6), 065031. 21. Tarabalka, Y., Chanussot, J., & Benediktsson, J. A. (2010). Segmentation and classification of hyperspectral images using watershed transformation. Pattern Recognition, 43(7), 2367–2379. 22. Gauch, J. M. (1999). Image segmentation and analysis via multiscale gradient watershed hierarchies. IEEE Transactions on Image Processing, 8(1), 69–79. 23. Yang, W., Wang, K., & Zuo, W. (2012). Neighborhood component feature selection for highdimensional data. Journal of Computers, 7(1), 161–168. 24. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53(1), 23–69. 25. Liu, H., Li, J., & Wong, L. (2002). A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13, 51–60. 26. McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143– 149. 27. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. 28. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 3(02), 185–205. 29. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. 30. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708). 31. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., & Keutzer, K. (2014) Densenet: Implementing efficient convnet descriptor pyramids. Preprint at arXiv:14041869 32. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, PMLR (pp. 6105–6114). 33. Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence. 34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9). 35. Ou, X., Yan, P., Zhang, Y., Tu, B., Zhang, G., Wu, J., & Li, W. (2019). Moving object detection method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access, 7, 108152–108160. 36. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778) 37. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Preprint at arXiv:14091556. 38. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media. 39. McRoberts, R. E., Tomppo, E. O., Finley, A. O., & Heikkinen, J. (2007). Estimating areal means and variances of forest attributes using the k-Nearest Neighbors technique and satellite imagery. Remote Sensing of Environment, 111(4), 466–480. 40. Bühlmann, P. (2012). Bagging, boosting and ensemble methods. In Handbook of computational statistics. Springer, pp 985–1022. 41. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. 42. COVID-19 chest xray. (2022). https://www.kaggle.com/bachrr/covid-chest-xray 43. Chest X-Ray Images (Pneumonia). (2022). Retrieved from https://www.kaggle.com/paultimot hymooney/chest-xray-pneumonia 44. Abbas, A., Abdelsamea, M. M., & Gaber, M. M. (2021). Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network. Applied Intelligence, 51(2), 854–864.
50
B. Ta¸sci
45. Wang, S., Kang, B., Ma, J., Zeng, X., Xiao, M., Guo, J., Cai, M., Yang, J., Li, Y., & Meng, X. (2021). A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). European Radiology, 31(8), 6096–6104. 46. Alqudah, A. M., Qazan, S., Alquran, H., Qasmieh, I. A., & Alqudah, A. (2020). COVID-2019 detection using X-ray images and artificial intelligence hybrid systems. Biomedical Signal and Image Analysis and Project. 47. Hemdan, E. E.-D., Shouman, M. A., & Karar, M. E. (2020). Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in x-ray images. Preprint at arXiv:200311055. 48. Narin, A., Kaya, C., & Pamuk, Z. (2021). Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks. Pattern Analysis and Applications, 24(3), 1207–1220. 49. Cohen, J. P., Morrison, P., Dao, L., Roth, K., Duong, T. Q., & Ghassemi, M. (2020). Covid-19 image data collection: Prospective predictions are the future. Preprint at arXiv:200611988.
Image Captioning Using Deep Transfer Learning Tapan Kumar Das
1 Introduction Generating textual description of an image is an easier task for human being, however, for a machine to explain the image requires computer vision to visualise the image and NLP to describe the image [1]. Hence in order to generate caption automatically for a particular photograph, the system must be trained and educated to realise the content of image and thereafter to express the contents in natural language words [2]. With the advent of deep learning methods especially for image feature extraction and processing [3], this particular problem has been swiftly addressed. Deep learning techniques such as convolutional neural network (CNN) are widely used for image processing tasks for their ability to deal with millions of underlying features [4]. It has been well perceived that CNN techniques are quite efficient for varieties medical image processing e.g. COVID-19 lung CT- scan [5], MRI images for brain tumor diagnosis [6, 7], retinal blood vessel [8], angiograms [9], chest X-rays [10] and many more. By just seeing the picture depicted in Fig. 1, some of us might say “A Little is talking brown guiding grassy”, some may say “Little boy is playing with toys” and yet some others might say “A little boy is designing the house”. The answer to all these observations are true and even few additional captions are also possible. All these findings do not require any special training or efforts for a human being, however, this is not the case for a system so that just by overlooking glancing; an appropriate language can be described.
T. K. Das (B) School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_3
51
52
T. K. Das
Fig.1 A sample image
This study of generating the captions for the images has following significances. • The experiments are based on transfer learning coupled with Convolution Neural Network (CNN). • We aims for boosting the model performance by making subtle changes to the block diagram. • The objective is producing the semantic and syntactical captions for the input images by using the phrases as elementary units instead of words. Motivation This problem is immensely useful in real-world applications. We listed below few applications where this study is being interpreted: • Self-driving cars: By automatically and readily generating the caption of the scene around the car, the self-driving system would be truly autonomous. • Aid to the blind: By designing a product which will guide the blind persons when walking on the roads will fulfil a lot of aspirations. This is possible by converting the scene around into text following the text to voice. • Google image search: Like Google search, image search may be popular if an image could be first transformed into a caption and then the underlying text can be searched.
Image Captioning Using Deep Transfer Learning
53
2 Related Studies Different techniques for image captioning exists; they are retrieval based or template based. Recently deep learning base captioning become very popular due to the quality and appropriateness of the textual description of images. Deep learning based attention mechanism are also delivers promising result in captioning [11]. Most of the models are encoder- decoder based, and it has been realised that LSTM and bidirectional LSTM networks are used as decoder in most of the systems [12]. Similarly for encoding purpose VGG16 and ResNet50 are employed for their effectiveness in vectorising [13]. Few studies on image captioning those have used deep learning for image processing and text description are represented in Table 1. Table 1 Contemporary studies on caption generation using deep learning method Studies
Objective
Methodology
Result
Chen et al. Mapping between [14] images and their textual descriptions
Generating the caption using the recurrent neural network
Capable of generating novel capions
Sharma et al. [15]
Image captioning by integrating visual and external knowledge
The methodologies used extracted classes belong to the True are LSTM, CNN, and class in general knowledge from external source
You et al. [16]
Image captioning with semantic attention
Combines both top-down and bottom-up strategies
State-of-the-art performance as compared to standard benchmarks
Rampal et al. [17]
Image captioning using neural network compression
VGG16 or ResNet50 as an encoder, LSTM as decoder and flickr8k dataset
As compared to uncompressed model, achieves a 73.1% reduction in model size, and 7.7% increase in BLEU score
Arnav et al. [18]
Image captioning using deep learning
Two input streams are Results show 5 sentences, merged and passed to an generated using a beam size of 5 LSTM layer along with average log probability of the sequence of words
Yao et al. [19]
Boosting image captioning with attributes
Devised the CNN plus RNN architecture to generate descriptions
Examines image representations and high-level attributes
Singh et al. Image Captioning [20] using Artificial Intelligence
CNN, RNN –LSTM
discussed the various algorithms like CNN, RNN, LSTM
Wang et al. Image captioning [21]
CNN and two separate LSTM networks
Achieved highly performance
54
T. K. Das
3 Methodology We used combined (CNN-RNN) model to extract the features from the image and text, further, we used evaluation model to check the accuracy of the proposed model and finally performance of the model at each epoch is tracked by the help of error rate. Here we are using top-down approach and transfer learning to extract the features and to train a model and also to get accurate captions of the image. In fact the concept of transfer learning is applied twice in our model. InceptionV3 for extracting features from images and Glove for extracting features from text/captions for better accuracy.Finally, we test a model with some images (test images) to know the accuracy of the model. Detailed methodology consists of following steps: • Data collection. • Data cleaning and pre-processing. • The result from pre-processing is that we have a vocabulary of 1652 unique words from the training dataset. We employed InceptionV3 transfer learning model. • We encoded all the training images and testing images which are input to our model. • After removing the stop-words in the process of data cleaning we have 7578 words in our vocabulary. • We also used a transfer learning model (Glove) to extract the features from our pre-processed text data. • Then we built and train our network/model. Finally, we evaluated the performance on the test data.
3.1 Dataset We have utilised Flickr8k dataset which contains around 8000 image, out of which 6000 images are used for training the model, 1000 images for validating the model and remaining 1000 images for testing the model in order to determine the model efficiency. Each image contains five number of captions (Fig. 2). Figure 3 exhibits few sample images from the Flickr8k dataset. From Fig. 4, clearly each individual images have five different captions. The Flickr dataset are loaded in repository, then the data is pre-processed by removing extra whitespace, punctuation, and other distractions. For encoding, CNN is used. The input image is fed to CNN to extract the features. After the features are processed by a series of layers, the last hidden state of the CNN is connected to
Fig. 2 Encoder-decoder based image captioning process
Image Captioning Using Deep Transfer Learning
55
Fig. 3 Sample images in the dataset
Fig. 4 Caption for the images
the decoder. In this framework, RNN serves as a decoder which performs language modelling up to the word level. A schematic diagram of encoder-decoder based image captioning process is shown in Fig. 2.
3.2 Inception Model for Images Here we have used pretrained Inception V3 model to extract the features from the images. Inception v3 is a widely-used image recognition model which has shown a remarkable accuracy of 98.1% on the standard ImageNet dataset. Architecture of Inception V3 is depicted in Fig. 5.
56
T. K. Das
Fig. 5 Inception V3 architecture diagram
The process of encoding and decoding and the detailed layers of those models and the parameters are involved are being represented in Figs. 6 and 7 respectively. Summary of caption model which depicts that the total parameters trained by the proposed model and the detailed network layers are represented in Fig. 8.
Fig. 6 Encoding the model summary
Image Captioning Using Deep Transfer Learning
57
Fig. 7 Decoding the vectored model summary
4 Result The main objective is to predict the caption for the image. For predicting, we applied an efficient predictive model using deep learning technique. We mainly focussed on the predictiveness of the model that suits to find the caption for the given image in the dataset. For evaluating the calibre of the text generated, we used BLEU (Bilingual Evaluation Understudy) since it has the principle of matching each text against set of reference texts composed by human itself. It is being signified a score which reflects overall quality of generated text. We achieved a BLEU score of 0.645 for our considered dataset.
4.1 Sample Output For testing the effectiveness of our designed model, we tested the model over the few images from Flicker8k dataset and exhibited the output caption obtained in Figs. 9, 10, 11, 12 and 13.
58
Fig. 8 Caption model summary
T. K. Das
Image Captioning Using Deep Transfer Learning
Fig. 9 Sample output for image 1
Fig. 10 Sample output for image 2
59
60
Fig. 11 Sample output for image 3
Fig. 12 Sample output for image 4
T. K. Das
Image Captioning Using Deep Transfer Learning
61
Fig. 13 Sample output for image 5
5 Conclusion In this chapter, we have executed image captioning task by integrating two deep learning techniques i.e. CNN with RNN. For training the encoder-decoder model, we used Flickr8k dataset. The trained model achieved state of the art performance when tested with unseen images of the dataset. Efficiency of image retrieval with content is assessed by the quality of the textual description of the image. This image caption generation can widen the scope of application areas such as medicine, security and other fields where the underlying image speaks a lot and has some implicit meaning. Moreover, the framework of image captioning can automate and promote annotating the image in large scale which can lead to even video captioning and video dialog.
References 1. Sharma, H., Agrahari, M., Singh, S. K., Firoj, M., & Mishra, R. K. (2020). Image captioning: A comprehensive survey. In 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC) (pp. 325–328). IEEE. 2. Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R. (2022). From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 3. Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36. 4. Chohan, M., Khan, A., Mahar, M. S., Hassan, S., Ghafoor, A., & Khan, M. (2020). Image captioning using deep learning: A systematic. Image, 11(5).
62
T. K. Das
5. Tiwari, R. S., Das, T. K., Srinivasan, K., & Chang, C. Y. (2022). Conceptualising a channelbased overlapping CNN tower architecture for COVID-19 identification from CT-scan images. Scientific Reports, 12(1), 1–15. 6. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. 7. Das, T. K., Roy, P. K., Uddin, M., Srinivasan, K., Chang, C. Y., & Syed-Abdul, S. (2021). Early tumor diagnosis in brain MR images via deep convolutional neural network model. Computers, Materials and Continua, 68(2), 2413–2429. 8. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518. 9. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation in angiograms using convolutional neural network: A deep learning based approach. CMESComputer Modeling in Engineering & Sciences, 136(1), 241–255. 10. Das, T. K., Chowdhary, C. L., & Gao, X. Z. (2020). Chest X-ray investigation: a convolutional neural network approach. Journal of Biomimetics, Biomaterials and Biomedical Engineering, 45, 57–70. Trans Tech Publications Ltd. 11. Zohourianshahzadi, Z., & Kalita, J. K. (2022). Neural attention for image captioning: Review of outstanding methods. Artificial Intelligence Review, 55(5), 3833–3862. 12. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988– 997). 13. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural network compression. Preprint retrieved from arXiv:2012.09708. 14. Chen, X., & Zitnick, C. L. (2014). Learning a recurrent visual representation for image caption generation. Preprint retrieved from arXiv:1411.5654. 15. Sharma, H., & Jalal, A. S. (2020). Incorporating external knowledge for image captioning using CNN and LSTM. Modern Physics Letters B, 34(28), 2050315. 16. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4651– 4659). 17. Rampal, H., & Mohanty, A. (2020). Efficient CNN-LSTM based image captioning using neural network compression. Preprint retrieved from arXiv:2012.09708. 18. Arnav, J. H., & Pulkit, M. (2018). Image captioning using deep learning. 19. Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T., (2017). Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4894–4902). 20. Singh, Y. P., Ahmed, S. A. L. E., Singh, P., Kumar, N., & Diwakar, M. (2021). Image captioning using artificial intelligence. In Journal of Physics: Conference Series (Vol. 1854, No. 1, p. 012048). IOP Publishing. 21. Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (pp. 988– 997).
Vehicle Over Speed Detection System K. Ganesan, N. S. Manikandan, and Vijayan Sugumaran
1 Introduction Every year, many individuals die all across the world. One of the most common causes of death is a vehicle accident. Accidents not only kill people, but also harm a large number of people. Among the several causes of accidents, high-speed vehicles are the most important cause. As a result, high-speed vehicles must be managed. As a result, different government organisations, academic institutions, and automobile manufacturers have begun various studies and projects to lower the likelihood of accidents and provide safety to passengers and drivers. Several researchers have used different kinds of mechanisms to detect vehicle over-speed in highways such as VANET technology to connect with cloud server [1], video based specific area of Interest (ROI) [2], and Electronic toll collection data based speed prediction [3]. To manage high-speed vehicles on the highway, the Tamil Nadu government planned to install an over-speed detecting device in the toll plaza. Figure 1 depicts a block diagram of over-speed detection in a toll plaza. This architecture is made up of a vehicle detection system, a common cloud server that is linked to an RTO server, and an over-speed detection system.
K. Ganesan (B) Professor, Higher Academic Grade, School of Information Technology and Engineering, Vellore Institute of Technology (VIT), Vellore 632014, Tamil Nadu, India e-mail: [email protected] N. S. Manikandan Senior System Architect, TIFAC-CORE Automotive Infotronics, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India V. Sugumaran Distinguished Professor of Management Information Systems, School of Business Administration, Oakland University, Rochester, MI, USA © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_4
63
64
K. Ganesan et al.
Fig. 1 Block diagram of the proposed system for high speed detection
A state-of-the-art application in many domains, including vehicle detection from satellite images [4], tumour extraction from medical images [5], and many others, is made possible by neural computation [6] and deep learning [7]. The vehicle detection and licence plate detection and recognition play a major role in the initial stages of the vehicle over-speed detection system. It must, in particular, detect Indian vehicles and extract Indian licence plate information. In view of the Indian vehicle detection at the toll plaza, Rajput et al. [8] utilized YOLO3 object detection and classification to detect and classify vehicles in a toll plaza. They have classified six types of vehicles, each of which can be used for a separate toll cost. Their findings at the toll plaza revealed an average recall of 86.3% and precision of 94.1%. The important role of this system is to locate and extract vehicle license plate information. Many researchers have used deep learning technique [9–15]; some have used image processing technique [16, 17]. However, most of the time, locating and extracting Indian vehicle license plate is complicated due to non-proper position or damaged or occlusion or un-authorized font used [18]. In terms of Indian vehicle license plate localization and data extraction, a novel method for detecting license plates with various font styles on vehicles was proposed by Jagtap et al. [19]. It relies on adaptive image segmentation in conjunction with Artificial Neural Network (ANN) character recognition. The proposed approach combines morphological operations with horizontal and vertical edge histograms to accomplish plate localization
Vehicle Over Speed Detection System
65
and character segmentation. To recognize characters, a two layer feed forward back propagation ANN is used. The results show an overall accuracy of 89.5%. When it comes to Indian license plate irregularities, a pipeline is built by Ravirathinam et al. [20] using a number of cutting-edge Faster Regional Convolutional Neural Networks to effectively address the Indian situation in a variety of scenarios. There is no publicly accessible dataset for Indian licence plates, so they created a balanced dataset using frames from videos and images from mobile devices, accounting for all the irregularities. Their pipeline generated an overall total correctness of 88.5% and a partial correctness of 10% for Indian plates. The overall correctness increased to 91% with the addition of a new heuristics system. The accuracy of licence plate detection for all kinds of vehicles was 94.98%. Sometimes the extracted license plate information is incorrect, for OCR corrections in chaotic Indian traffic videos with complicated licence plate patterns; Singh et al. [18] proposed a modular framework. These patterns are produced by a cutting-edge deep learning model that was trained on video frames. This model includes multi-frame consensus in their framework for generating suggestions because it reads text from videos rather than images. Their human-interactive framework uses an object detector and a tracker to first separate the multi-vehicle videos into multiple clips, each of which contains a single vehicle from the video, to aid in the correction process. Their framework then offers recommendations for a single vehicle using multi-frame consensus. The user is then given interactive suggestions that only show them certain extracted clips, allowing them to quickly and easily verify or correct their predictions. This high-quality output can be used to update a sizable database continuously for surveillance, which will improve the accuracy of deep models in difficult real-world scenarios. In view of the cloud platform, an IoT-based system that uses two detection points with surveillance cameras to measure the average speed between them was proposed by Khan et al. [21]. To enforce speed limits, the measured data is sent to the cloud for additional processing. Entry and exit points are used to detect any uncertainty in a particular area. The failure of a car, for instance, to reach the end point after passing through the entrance point, can be highlighted. The system is made up of a mobile phone application and a web network that exchange real-time data, including information about passing vehicles like entrance time, pictures, and license plate registration numbers. Such a system has the advantages of requiring little human involvement, requiring fewer speed guns to be installed, and monitoring vehicles even when they are not in the camera’s field of view. The speed limit between two toll gates is determined by traffic density or government traffic rules and regulations. However, the roads between the toll plazas are generally curvy and have speed limits. The majority of cloud-based vehicle overspeed detection systems are unaware of road curvature. In terms of extracting data about horizontal curves from road GIS maps, Li et al. [22] present a fully automated method. Their proposed methodology aims at four different things: (a) Regardless of the type of curve, each road’s curves in the selected road’s surface layers are identified; (b) each curve is automatically classified as either simple or compound; (c) Each simple curve’s radius, degree of curvature, length, and compound curve’s radius are
66
K. Ganesan et al.
all automatically determined; and (d) curve characteristics and layers are automatically created in the GIS for all detected curves. 96.7% of curves were correctly identified and their geometric information was computed using the proposed technique. However, the existing road curvature extraction method is unaware of curvature noise and curvature in hilly terrain. Thus, the existing over-speed detection system has some gaps, such as not being aware of the curvature on the highway and not being aware of curvature noise. To bridge the generation gap, the proposed system includes the following features: • The YOLO object detection model has been proposed for vehicle detection and vehicle type extraction. • An image processing technique is used to locate and extract licence plates from detected vehicle images. • The information on the localised licence plate is extracted using the CRNN deep learning text extraction model. • The proposed curvature aware travel time estimation model calculates the travel time between two toll plazas, and the cloud-based system detects over-speed of vehicles. The remaining portions of this paper is arranged as follows: Sect. 2 describes the vehicle detection & license plate extraction system, which is sub divided into vehicle detection & type classification, License plate localization and license plate recognition, travel time estimating & over-speed detection system. The speed detection system is further sub divided into new curve finding method, curve speed limit database creation, curve aware travel time estimation, and vehicle over speed detection system. Section 3 discusses the results of vehicle detection and license plate localization & text extraction, new curve finding method, curve aware travel time estimation, and vehicle over speed detection. Finally, Sect. 4 provides the conclusion and future work.
2 Proposed Model Figure 2 depicts the proposed system’s architecture. This system has been subdivided into three subsystems. The first subsystem detects vehicles at toll gates and extracts license plate information as well as vehicle type. The second subsystem uses a road curvature extraction module, a curve aware speed limitation module, and a curvature aware travel time estimator to characterize the curvature aware journey time between two toll gates. Over speed detection is the third and final subsystem. It is made up of a toll gate system and a common cloud server infrastructure. The three subsystems are briefly described below.
Vehicle Over Speed Detection System
67
Fig. 2 The architecture of the proposed system
2.1 Vehicle Detection and License Plate Extraction System This system consists of three modules: vehicle detection and vehicle type classification, license plate localization, and license plate recognition. Details of each of these modules are provided below.
2.1.1
Vehicle Detection and Type Classification: YOLO
YOLO [23] divides the image into M X M grids by a single CNN applied to the entire image. For each grid, the prediction of B bounding boxes and the associated confidence score are computed. The class confidence score analyses these bounding boxes using the formula given below. Class confidence score = conditional class probability + box confidence score. It assesses the level of certainty in both classification and localization. The mathematical definitions are as follows: box confidence score ≡ Pr (object).I oU conditional class probability ≡ Pr (class i |object) class confidence score ≡ Pr (class i ).I oU , then Pr (class i ).I oU = P r (object).I oU × Pr (class i |object)
(1)
68
K. Ganesan et al.
where Pr (object) denotes the likelihood that an object is present in the box. The intersection over union, or IoU, between the predicted box and the actual data is the ground truth. The probability that an object belongs to a given class i , given its presence, is known as Pr (class i |object). The probability that an object belongs to a given class i is given by Pr (class i ). YOLO reduces an input image to 448 × 448 pixels in size. The image is then sent through a convolutional network, yielding a tensor of 7 × 7x30. Tensor information includes: (1) the coordinates of the bounding box’s rectangle, and (2) the probability distribution for all classes for which the system has been trained. By limiting these class labels, confidence scores (probability) with less than 30% are eliminated. To calculate the loss, when comparing predictions to ground truth, YOLO uses the sum-squared error. The categorization loss is part of the loss function. The loss of localization is the error between the predicted boundary box and the ground truth. The loss of confidence scores only for the boxes which did not contain any object at all. Here’s the overall formula: s ∑ B ∑ 2
λcoor d
obj
1i j
[(
Ʌ
xi − x i
)2
+ (yi − yi )2
i=0 j=0
+λcoor d
B s2 ∑ ∑
obj 1i j
[ (√
/ )2 √ )2 (√ xi − x i + hi − hi
]
]
Ʌ
Ʌ
i=0 j=0
+
s2 ∑ B ∑
obj ( 1i j ci
Ʌ
− ci
)2
2
+ λnoobj
i=0 j=0
)2 obj ( 1i j ci − ci Ʌ
i=0 j=0
+
s2 ∑
obj
1i j
i=0
2.1.2
s ∑ B ∑
∑ (
Ʌ
pi (c) − pi (c)
)2
(2)
c∈classes
License Plate Localization
Finding the location of the License Plate in the vehicle image is a critical assignment. Grayscale conversion, thresholding, and morphological procedures such as dilatation and erosion are used to localize plates. Canny edge detector is used to detect license plate edges and crop the located license plate from vehicle image [24].
2.1.3
License Plate Recognition: CRNN
The CNN, Bi-directional LSTM, and CTC layer that make up the CRNN [25] can be viewed as an encoder-decoder structure. A feature sequence encoder known as CNN creates image feature sequences. Character sequences are produced by a decoder made up of the bi-directional LSTM and CTC layers.
Vehicle Over Speed Detection System
69
The input image”s width and height are set by CNN to (Wx32)/H and 32 pixels, respectively, where W and H are the image’s width and height, in order to maintain the original aspect ratio. The CNN uses stride 21 rather than stride 22 for the pooling layer because the character is tall and thin, with a height greater than a width. As a result, the final feature map has a thin and tall pixel point that corresponds to the original image’s receptive field. The input image is downsampled using two layering pools with a 22 stride, and three layering pools with a 21 stride. The final dimension of the feature map is b × 1 × [(W × 8)/H] × C, where b is the batch size, 1 is the height, (Wx8)/H is the width, and C represents the number of channels. The structure of CNN used for feature extraction is displayed in Table 1. The CRNN decoder is composed of the CTC layer and the Bi-directional LSTM layer. Bi-directional LSTM receives its input from a feature map’s column vector. The probability matrix of (Wx8)/HxC, where C is the number of character labels and is set to English uppercase letters in 26, English lowercase letters in 26, and a space, is the output and it represents the probabilities of characters in each column vector. The feature map that was recovered has a width of (Wx8)/H. The likelihood of the label sequence is determined by applying the CTC layer to the Bi-directional LSTM’s output. The likelihood of the label sequence during training is determined by the conditional probability defined in the CTC layer. The conditional probability’s negative log-likelihood serves as the loss function for training the network. The probability sum of all pathways that are genuine label sequences is calculated by the CTC layer. The paths ‘hee-ll-o’ and ‘hh-ee-ll-oo’ (where ‘-’ signifies a space) eliminate Table 1 CNN network of CRNN Layers
CNN network
Output size
Conv1
(3 × 3 × conv) × 6
32 × [(W × 32)/H]
Connection layer
Relu, 1 × 1 conv, dropout
32 × [(W × 32)/H]
2 × 2 average pool, stride 2 × 2
16 × [(W × 16)/H]
Conv2
(3 × 3 × conv) × 6
16 × [(W × 16)/H]
Connection layer
Relu, 1 × 1 conv, dropout
16 × [(W × 16)/H]
2 × 2 average pool, stride 2 × 2
8 × [(W × 8)/H]
Conv3
(3 × 3 × conv) × 6
8 × [(W × 8)/H]
Connection layer_1
Relu, 1 × 1 conv, dropout
8 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1
4 × [(W × 8)/H]
Conv4
(3 × 3 × conv) × 6
4 × [(W × 8)/H]
Connection layer_1
Relu, 1 × 1 conv, dropout
4 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1
2 × [(W × 8)/H]
Conv5
(3 × 3 × conv) × 6
2 × [(W × 8)/H]
Connection layer_1
Relu, 1 × 1 conv, dropout
2 × [(W × 8)/H]
2 × 1 average pool, stride 2 × 1
1 × [(W × 8)/H]
70
K. Ganesan et al.
duplicates and spaces to show the label sequence ‘helo.‘ The test’s recognition result is determined by which character sequence has the highest probability.
2.2 Travel Time Estimating and Over-Speed Detection System This system has four modules: road curvature identification, curve speed restriction declaration, curve aware travel time computation, and vehicle overspeed detection. They are described in detail below.
2.2.1
New Curve Detection Algorithm
A path between the source to the destination is constructed. The path, as shown in Fig. 3a, is made up of a series of segment points (S1 to S9) with each segment connected by straight lines. (Note: In India, all vehicles drive on the left side [26]). The equations for detecting the curve using the sequence of segment point are shown below. Before calculating the radius of curvature, we must first calculate the great-circle distance between two points using the ‘Haversine’ formula. val = sin2 (Δϕ/2) + cos ϕ1 × cos ϕ2 × sin2 (Δλ/2)
(3)
√
res = 2 × tan
−1
√
val (1 − val)
dst = ER · res
Fig. 3 a Google map road segment points b Identified curve
(4) (5)
Vehicle Over Speed Detection System
71
where ϕ is latitude, λ is longitude, ER is radius of earth (ER = 6,371 km). Let’s think about the distance between segment point S1 to S2 as a, S2 to S3 as b, and S1 to S3 as c then calculating the radius is (a × b × c) radius = √ (a + b + c) × (b + c − a) × (c + a − b) × (a + b − c)
(6)
According to the Indian Roads Congress [27, 28], a vehicle can travel at a speed of 70 to 80 km/h in a 1000 m radius curve on the Indian highways [26]. So, we assume that the maximum radius of an Indian road curve is 1000 m. Using Eq. (6) at segment points S1, S2, and S3 from Fig. 3a, we find that the radius of these three segment points is more than 1000 m because they are interconnected like a straight line. So, we check the next adjacent three-segment points S2, S3, and S4. The radius of these three-segment points (S2, S3, and S4) is less than 1000 m because it looks like a curve. These three segment points’ radius values are recorded in the radius list, and this procedure is repeated for subsequent segment points until we reach the final set of segment points along the path. Figure 3a segment points S3–S9 yield six curve radii R1, R2, R3, R4, R5, and R6. This information is saved in the radius list. After that, the average curve radius (R1 to R6) of the segment points S2 to S9 is calculated. The detected curve of the path in Fig. 3a is shown in Fig. 3b. (Explained in Algorithm 1). The curve list keeps track of the curve’s starting segment point S2, ending segment point S8, mid-segment point S6, and computed average curve radius. The method is utilized with the route between the origin and destination, and the found curves and their attributes (curve starting point, curve ending point, curve mid-point, and average curve radius) are saved in the curve list. Figure 4 shows that the source location is on a highway, but the destination location is on a mountainous (hilly) terrain. The curves on the highway are always large. A single curve that is 1000 m long, as seen in Fig. 4 (top red solid circle), is a good example. The mountainous landscape here features several hairpin bends. These curves have a radius of 50 to 150 m, and some curves are 500 m long. As the assumed maximum curve radius of roads in India is 1000 m, multiple hairpin curves form a single curve, as seen in Fig. 4 (bottom two red solid circles).
72
K. Ganesan et al.
Algorithm 1: finding curve on path Input: GPS segment points, required radius Output: collection of Identified curves and its attributes 1: Function curve_detection (list_segment_points, required_radius_meter) 2: for each segment from list_segment_points do 3: a = distance between segment S1 and S2 4: b = distance between segment S2 and S3 5: c = distance between segment S1 and S3 6: det_curve_radius = (a * b * c) / 7: (sqrt ((a + b + c) * (b + c - a) * (c + a - b)* (a + b - c))) 8: radius_count=0 9: collected_radius = 0 10: if det_curve_ radius pallik pallik -> rani 14
13 9 6
c/v/j 90
6
c/v/j 95
7
7 5
b/t 70
Fig. 15 Suggestion to reduce vehicle speed
Fig. 16 Field test at a Pallikonda and b Ranipet toll plazas
b/t 75
b10
86
K. Ganesan et al.
4 Conclusion Highway traffic moving at an excessive speed needs to be controlled. The proposed vehicle over-speed detection system can be used to determine whether or not a vehicle that is travelling between two toll plaza roads was travelling at an excessive speed. In this regard, a new curve-finding algorithm is proposed to precisely determine the travel time of the vehicle. In the proposed vehicle over-speed detection system, this curve-aware travel time is used. The Pallikonda and Ranipet toll plazas participated in the real-time test-bed for a two-hour testing period under the direction of the RTO, Tamilnadu government. Due to speeding, two vehicles were found and fined. This system is currently being tested in two plazas; however, in the future, it could be expanded to all toll plazas. In the future, the camera-based license plate extraction module will be replaced by an RFID tag-based vehicle information extraction module, which is currently used in every vehicle in India under the brand name FastTag.
References 1. Nayak, R. P., Sethi, S., & Bhoi, S. K. (2018). PHVA: A position based high speed vehicle detection algorithm for detecting high speed vehicles using vehicular cloud. In 2018 International Conference on Information Technology (ICIT). https://doi.org/10.1109/icit.2018.00054 2. Krishnakumar, B., Kousalya, K., Mohana, R., Vellingiriraj, E., Maniprasanth, K., & Krishnakumar, E. (2022). Detection of vehicle speeding violation using video processing techniques. In 2022 International Conference on Computer Communication and Informatics (ICCCI). https://doi.org/10.1109/iccci54379.2022.9740909 3. Zou, F., Ren, Q., Tian, J., Guo, F., Huang, S., Liao, L., & Wu, J. (2022). Expressway speed prediction based on electronic toll collection data. Electronics, 11(10), 1613. https://doi.org/ 10.3390/electronics11101613 4. Shen, J., Zhou, W., Liu, N., Sun, H., Li, D., & Zhang, Y. (2022). An anchor-free lightweight deep convolutional network for vehicle detection in aerial images. IEEE Transactions on Intelligent Transportation Systems. 5. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. 6. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic Press. 7. Biswas, R., Vasan, A., & Roy, S. S. (2019). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 1–14. 8. Rajput, S. K., Patni, J. C., Alshamrani, S. S., Chaudhari, V., Dumka, A., Singh, R., Rashid, M., Gehlot, A., & AlGhamdi, A. S. (2022). Automatic vehicle identification and classification model using the YOLOv3 algorithm for a toll management system. Sustainability, 14(15), 9163. https://doi.org/10.3390/su14159163 9. Wang, W., Yang, J., Chen, M., & Wang, P. (2019). A light CNN for end-to-end car license plates detection and recognition. IEEE Access, 7, 173875–173883. https://doi.org/10.1109/acc ess.2019.2956357 10. Huang, Q., Cai, Z., & Lan, T. (2021). A new approach for character recognition of multi-style vehicle license plates. IEEE Transactions on Multimedia, 23, 3768–3777. https://doi.org/10. 1109/tmm.2020.3031074
Vehicle Over Speed Detection System
87
11. Seo, T., & Kang, D. (2022). A robust layout-independent license plate detection and recognition model based on attention method. IEEE Access, 10, 57427–57436. https://doi.org/10.1109/acc ess.2022.3178192 12. Henry, C., Ahn, S. Y., & Lee, S. (2020). Multinational license plate recognition using generalized character sequence detection. IEEE Access, 8, 35185–35199. https://doi.org/10.1109/acc ess.2020.2974973 13. Park, S., Yu, S., Kim, J., & Yoon, H. (2022). An all-in-one vehicle type and license plate recognition system using YOLOv4. Sensors, 22(3), 921. https://doi.org/10.3390/s22030921 14. Alam, N., Ahsan, M., Based, M. A., & Haider, J. (2021). Intelligent system for vehicles number plate detection and recognition using convolutional neural networks. Technologies, 9(1), 9. https://doi.org/10.3390/technologies9010009 15. Alghyaline, S. (2022). Real-time Jordanian license plate recognition using deep learning. Journal of King Saud University-Computer and Information Sciences, 34(6), 2601–2609. https://doi.org/10.1016/j.jksuci.2020.09.018 16. Raghunandan, K. S., Shivakumara, P., Jalab, H. A., Ibrahim, R. W., Kumar, G. H., Pal, U., & Lu, T. (2018). Riesz fractional based model for enhancing license plate detection and recognition. IEEE Transactions on Circuits and Systems for Video Technology, 28(9). 17. Dalarmelina, N. D., Teixeira, M. A., & Meneguette, R. I. (2019). A real-time automatic plate recognition system based on optical character recognition and wireless sensor networks for ITS. Sensors, 20(1), 55. https://doi.org/10.3390/s20010055 18. Singh, P., Patwa, B., Saluja, R., Ramakrishnan, G., & Chaudhuri, P. (2019). StreetOCRCorrect: An interactive framework for OCR corrections in chaotic Indian street videos. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). https:// doi.org/10.1109/icdarw.2019.10036 19. Jagtap, J., & Holambe, S. (2018). Multi-style license plate recognition using artificial neural network for Indian vehicles. In 2018 International Conference on Information, Communication, Engineering and Technology (ICICET). https://doi.org/10.1109/icicet.2018.8533707 20. Ravirathinam, P., & Patawari, A. (2019). Automatic license plate recognition for Indian roads using Faster-RCNN. In 2019 11th International Conference on Advanced Computing (ICoAC). https://doi.org/10.1109/icoac48765.2019.246853 21. Khan, S. U., Alam, N., Jan, S. U., & Koo, I. S. (2022). IoT-enabled vehicle speed monitoring system. Electronics, 11(4), 614. https://doi.org/10.3390/electronics11040614 22. Li, Z., Chitturi, M., Bill, A., & Noyce, D. (2012). Automated identification and extraction of horizontal curve information from geographic information system roadway maps. Transportation Research Record: Journal of the Transportation Research Board, 2291, 80–92. 23. Horzyk, A., & Ergun, E. (2020). YOLOv3 precision improvement by the weighted centers of confidence selection. In 2020 International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/ijcnn48605.2020.9206848 24. Jayaraman, S., Esakkirajan, S., Veerakumar, T. (2015). Digital image processing. Tata McGraw Hill publication, Indian Edition. 25. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/tpami.2016. 2646371 26. Bains, M. S., Bhardwaj, A., Arkatkar, S., Velmurugan, S. (2013). Effect of speed limit compliance on roadway capacity of Indian expressways. Procedia-Social and Behavioral Sciences, 104, 458−467 27. IRC: 73. (1980). Geometric design standards for rural (Non-urban) highways. Indian Roads Congress. 28. IRC: 38. (1988). Guidelines for design of horizontal curves for highways and design tables. Indian Roads Congress.
An Intelligent System for Video-Based Proximity Analysis Sergey Antonov, Mikhail Bogachev, Pavel Leyba, Aleksandr Sinitca, and Dmitrii Kaplun
1 Introduction Recently boosted by the COVID-19 pandemic, digital technologies played an increasingly significant role in the public-health response to contact tracing worldwide. Budd et al. [12] provides a comprehensive review of digital innovations developed in response to COVID-19 worldwide, including legal, ethical and privacy barriers to their implementation, as well as organizational and workforce restrictions. The review covers technologies developed in responce to five public-health needs, including epidemiological surveillance, rapid case identification, control of community transmission, communication of essential medical information and clinical support [5]. Interrupting community transmission requires rapid tracing and quarantining of contacts in order to prevent further transmission. Technologies supporting such activities are largely based on proximity tracing [17], which is usually implemented using smartphone apps ([57, 59]) and low-power Bluetooth technologies. Hossain et al. [18] recently proposed a B5G framework that employs high throughput and low latency of modern 5G network standard to exchange chest X-ray [20] or CT scan S. Antonov · D. Kaplun (B) Department of Automation and Control Processes, Saint Petersburg Electrotechnical University “LETI”, St. Petersburg 197022, Russia e-mail: [email protected] S. Antonov e-mail: [email protected] M. Bogachev · P. Leyba · A. Sinitca · D. Kaplun Centre for Digital Telecommunication Technologies, Saint Petersburg Electrotechnical University “LETI”, St. Petersburg 197022, Russia e-mail: [email protected] A. Sinitca e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_5
89
90
S. Antonov et al.
images [41] for an early instrumental detection of COVID-19, as well as development of a mass surveillance system to control and manage social distancing, mask wearing, and body temperature monitoring. The above approach lies in the context of various AI-based integrated emergency response solutions attracting increasing interest in recent years [40, 42, 44]. Privacy is one of the major concerns in this context, strongly limiting the applicability of various solutions. As a prominent example, Norway has stopped using the Smittestopp app and switched to the Bluetooth approach [60]. Several international frameworks with various systematic approaches to privacy preservation are emerging, including Decentralized Privacy-Preserving Proximity Tracing [58], the Pan-European Privacy-Preserving Proximity Tracing initiative [61] and the joint Google–Apple framework [56]. A key limitation of contact-tracing apps such as those mentioned above is that they require a large proportion of the population to use the app. However, the practical effectiveness of these apps is strongly limited by smartphone ownership, user compliance, and technical compatibility [12]. An alternative approach, which can be more effective in a variety of scenarios is proximity tracing based on video surveillance. There are only few works addressing video surveillance in the context of COVID19 pandemic. Punn et al. [38] proposes a framework that utilizes the YOLO v3 object detection model not only to detect, but also to distinguish between humans using the Deepsort approach capable of further tracking the identified persons according to their assigned IDs. The results of the YOLO v3 model are further compared with other popular convolutional neural network architectures, such as SSD (Single Shot Detector), R-CNN (Region-Based CNN) and their modifications. Rezaei et al. [39] use a YOLOv4-based framework and inverse perspective mapping to improve accuracy of personal identification for an improved social distance tracking in the presence of disturbance factors, such as crowd occlusion, partial visibility, and lighting variations, also providing a risk assessment scheme based on the statistical analysis of personalized movement trajectories and the rate of social distancing violations. Like in the case with mobile apps for tracing proximity, any solution based on video surveillance needs to address privacy concerns. In this paper we propose a framework which builds on the ideas of object detection and trajectory analysis, incorporated from the literature on pedestrian tracking, but also integrates elements that will allow for addressing privacy issues: facial recognition system which maps faces to anonymzed IDs, and the construction of an anonymzied potential spread graph, which can be used in scenarios such as contact tracking and epidemiological surveillance. Now more than two years since the onset of the pandemic, public attention is increasingly shifting towards finding optimal exit strategies, including adaptation of the technologies that have been rapidly deployed earlier in the course of the pandemic, and finding their place in the post-pandemic society. Here we show explicitly how the AI-based framework for proximity tracing based on video surveillance in public places proposed here can be used in different scenarios ranging from individual contact tracing or epidemiological surveillance of crowds to the improved public spaces planning.
An Intelligent System for Video-Based Proximity Analysis
91
Existing body of work, e.g., on automatic pedestrian behavior analysis can be adapted to this context [52]. These approaches usually employ various models for object detection. However, the pandemic largely changed our vision of the goals that have to be achieved in public spaces planning. There is compelling evidence that various social distancing measures also reduced the spread of other infectious diseases such as common cold or flu, which accounts for around 166 million working days loss in the U.S. only, that nearly doubles when taking into account parents that skip work due to the colds caught by their children, even outside of the pandemic context. Therefore, adapation of the technologies widely used during the current COVID-19 pandemic to reduce community transmissions of other respiratory diseases such as common cold and flu, could be advantageous for at least a partial reduction of these losses. The rest of the paper is organized as follows. Section 2 presents an overview of the proposed framework and the corresponding video data processing pipeline. Section 3 focuses on the proximity networks, which can be used in a variety of scenarios to address public-health needs. Section 4 describes the evaluation of our approach for a series of videos captured by the street surveilance cameras. Section 5 introduces statistical quantities that are associated with the risks of community transmission and discusses how they could be used for future improvements in public space planning aiming at the reduction of community transmission risks in the post-pandemic society.
2 Overview of the Framework A schematic overview of the proposed framework is presented in Fig. 1. The proposed framework contains four main modules that are responsible for persons detection, distance calculation, face recognition and network construction, respectively, with the first three being repeated for each frame, while the last one combining all information gathered from the entire scene. Frames are fed into a trained convolutional neural network model “ssdlite mobilenet v2 coco” for object detection. The output of the model contains coordinates of bounding boxes of all detected persons in the frame. To find the actual coordinates of each person in a frame relative to other people, coordinates of bounding boxes are passed into OpenCV computer vision library, which performs bird’s eye projection using the homography matrix. The next step is to link coordinates with individual persons ID’s by matching with previous frames and update their trajectories. To facilitate the linkage process, every time the position appears not in a close proximity to an already identified person’s trajectory, a facial recognition algorithm is applied to the cropped image for identification purposes. To calculate distance between people, a VP Tree is built and a modified nearest neighbors algorithm is applied to this tree. Once individual trajectories are obtained for each detected person (identified by a unique ID) based on the results of the video analysis, identification of groups of people that appear in a close proximity to each person, as well as the durations they
92 Fig. 1 Diagram of workflow
S. Antonov et al.
An Intelligent System for Video-Based Proximity Analysis
93
appear in proximity, are among the quantities of interest in the context of contact tracing purposes.
3 Construction of Proximity Networks 3.1 People Detection and Accuracy Evaluation A common approach to the detection of individual persons in video obtained from fixed surveillance cameras is based on convolutional neural networks. There are several approaches to the neural network training, among them supervised, unsupervised and reinforcement learning. While training by supervised learning generally leads to superior accuracy and performance, it requires large amounts of data at learning stage. Datasets used for the network training should contain the “ground truth” information including segmentation, localization, as well as object classification, typically summarized in the associated annotation files. Among multiple variants, convolutional neural networks [27] should be noted as a common solution for object detection. Choice of particular solution and its validation are largely based on the accuracy metrics such as Precision (Prc), Recall (Rec), Intersection Over Union (IoU) and mean Average Precision (mAP). Figure 2 illustrates the IoU, a measure based on Jaccard Index that evaluates the overlap between the reference and the predicted bounding boxes, respectively. To obtain the accuracy metrics, the IoU is next compared against a fixed threshold Θ, that equals 0.5 in our example. When I oU < Θ the decision in made in favor of hypothesis H0 , otherwise the decision is made in favor of hypothesis H1 . . The accuracy of the decision making procedure is quantified based on the true positive (TP) rate indicating the rate of decisions in favor of hypothesis H1 under the validity of hypothesis H1 , , and by the false positive (FP) rate, indicating the rate of decisions in favor of hypothesis H1 under validity of hypothesis H0 (see, e.g., [48] and references therein). In a numerical treatment, based on the above rates, one can estimate precision Fig. 2 Intersection Over Union (IoU) that is a measure based on Jaccard Index that evaluates the overlap between the reference (indicated by green border) and the predicted (indicated by red border) bounding boxes
94
S. Antonov et al.
Pr c =
TP T P + FP
(1)
Rec =
TP . T P + FN
(2)
and recall
Similarly to the approach taken in [16], we also calculate the widely used detection accuracy measure mAP, obtained as the area under the Pr c×Rec curve. By definition, both precision and recall are bounded between 0 and 1, and thus mAP is also bounded between 0 and 1. It is common to estimate mAP from interpolated Pr c × Rec curves m AP =
1 pinter p(r ) , 11 r ∈(0,0.1,...,1)
(3)
where pinter p(r ) is the interpolation of the Pr c × Rec curve.
3.2 Finding Coordinates of Each Person Next before proceeding to walking trajectories, one has to transform from the homogeneous (also known as projective) coordinates to the world coordinates (corresponding to the bird’s eye view) by means of projective geometry techniques [31]. For simplification purposes, each detected object represented by a bounding box is associated with its pivot point, resulting in a simplified transformation expressed by 3 × 3 matrices ⎡
⎤ ⎡ ⎤ xi, x ⎣ y , ⎦ = w⎣ y ⎦, where x, y are the coordinates of the plane i wi 1
(4)
In order to transform from homogeneous coordinates to world coordinates, one has to divide the resulting coordinates by wi . Accordingly, the procedure of finding the location of each person in world coordinates can be expressed as ⎤ ⎡ ⎤ wi xi, xi ⎣ wi y , ⎦ = H ⎣ yi ⎦, i wi 1 ⎡
(5)
where H the projection matrix, sometimes also referred to as the homography matrix, which can be estimated using a number of approaches [33], such as direct linear transformation (DLT) and robust estimation (RANSAC). Assuming that the pivot point of each bounding boxes is located in the center of its lower edge, it can be
An Intelligent System for Video-Based Proximity Analysis
95
found as xi = xmin +
xmax − xmin , 2
yi = ymin ,
(6) (7)
where (xmin , xmax , ymin , ymax ) are the bounding box coordinates. Thus, for a given homography matrix, transformation to the world coordinates can be expressed as ⎤ ⎤ ⎡ min wi xi ‘ xmin + xmax −x 2 ⎦. ⎣ wi yi ‘ ⎦ = H ⎣ ymin wi 1 ⎡
(8)
3.3 Extracting Walking Trajectories The idea of constructing walking trajectories based on locations obtained from individual video frames requires linking the location points corresponding to the same person observed in consecutive frames. For that, the first step is commonly searching for the nearest neighbor points. The latter is usually performed using one of the algorithms such as linear (full) search, search in kd-trees [35], search in BSP-trees [28], LS-hashing [36], method with keywords [50] and search in VP-trees [54]. As linear search is computationally inefficient due to its linear complexity of O(n), alternative algorithms are in focus. The LS-hashing algorithm is based on finding a simple hash function that can be used instead of direct comparison of point coordinates, resulting in superior efficacy once a simple hash function is known, although finding such function is not a straightforward solution in many real-world scenarios. The idea of the keyword algorithm is to store a list of objects with rarely observed coordinates, which also limits its applicability. Therefore, in our case the remaining options, namely kd-tree, BSP-tree and VP-tree search algorithms, are of greater interest. In the following, we focus on the VP-tree search, since it searches for other points in a circular vicinity around the current pivot point, that is relevant to the contact proximity analysis problem.
3.3.1
VP-Tree Construction
Like the majority tree construction algorithms, building a hierarchical VP-tree is a recursive procedure. In the first iteration of the algorithm, a vantage point is selected and the average distance from this point to all other points is calculated. The input set of points is divided into two subtrees, assigning the point to the set of points in the inner (left) subtree if the distance from it to the vantage point is less than the average,
96
S. Antonov et al.
and to the set of points in the outer (right) subtree otherwise. The same operation is repeated for each subtree. Thus, each node in the tree has a vantage point and a radius where the points belong to the node. Complexity of the tree construction algorithm O(n ∗ logn).
3.3.2
Finding Nearest Neighbor in VP-Tree
The algorithm for finding the nearest neighbor to the point x is also recursive. At any given step, one focuses on a tree node that has a vantage point q and a radius r . Let us assume that point x is located at some distance d from q. If d is below r , a recursive algorithm to search for a subtree of the node that contains any points closer to the vantage point than the radius r is activated. Upon reaching the subtree, we perform a linear search among the points of this subtree. Otherwise, returning to the subtree of the node containing points displaced from q further than the given radius r . When constructing the trajectory of a single walker movement, x is obtained from the coordinates of this person in the previous frame, and the desired nearest point is the coordinates of the same person in the current frame.
3.4 Finding People That Appear in Close Proximity In the context of contact tracing, the next step is typically finding all points that appear in a close proximity, usually determined by a circular area of a certain radius around each person, first for a given video frame corresponding to a single point of time. Since the original VP-tree based search algorithm focuses on finding single nearest neighbor only, it should be generalized to search for potentially multiple nearest neighbors within a circle of a given radius. In this context, there are several possible situations: 1. The Entire Search Area is Included in the Internal Subtree d(q, s) + r < T
(9)
where d(q, s) is the distance from the center of the node to the search point, r is the search radius, T is the node radius, determining the border of the inner subtree. The world scale of the distances between two bird’s eye viewpoints is determined using the size of the camera pixel obtained from the calibration procedure, and the distance between two points is calculated as d( pi , p j ) =
/
(x j − xi )2 + (y j − yi )2
(11)
An Intelligent System for Video-Based Proximity Analysis
97
If this condition is met, the search can continue in the internal subtree only. 2. The entire search area is included in the external subtree. d(q, s) − r < T
(12)
If this condition is met, the search can only continue in the external subtree. 3. The entire search area is distributed over both subtrees. In this case, the search is performed in both subtrees. The difficulty of searching for nearest neighbors is O(log n).
3.5 Face Analysis and Recognition Facial recognition is a long studied problem which attracted increasing attention in recent years, leading to a considerable advancements in methodology and algorithms development (see, e.g., [4] for a detailed review). One of the key issues regarding widespread use of facial recognition technologies are privacy concerns, and thus those methodological approaches that are capable of integrating anonymization in a systematic way are favorable. In this work detected faces are being mapped to anonymized IDs, which can be then stored in the system, allowing for identities to be revealed in a controlled way. The face analysis and recognition problem is generally a stepwise procedure, including finding and selecting all faces in the images, their initial preprocessing and alignment, identification of unique facial features, and their comparison against a database of known people. The above procedure is typically implemented as a pipeline, where all steps can be performed independently of each other, and thus particular choice of techniques can be performed independently at each step from a number of available solutions. Since there is a large body of recent work in this field, we only provide a brief overview of the available solutions and their pros and cons. Face detection techniques largely rely upon several well-established approaches. Retrospecitvely, the Viola-Jones method [53], while being one of the first widely available and computationally efficient solutions, was characterized by relatively high false detection rates, as well as requirement of frontal facial images and low robustness against occlusion. The Histogram of Oriented Gradients (HOG) method [15] is based on analyzing the gradient of the binarized image, followed by its segmentation into small segments and finding those where the arrangement of gradients is close to a known facial image, often denoted as the HOG pattern. The keynote advantage of this method is its computational efficiency, as well as reasonable effectiveness for slightly non-frontal images, as well as moderate robustness against occlusions, while its major drawback is the requirement of high resolution images, and failure with low resolution images due to discreteness effects. In recent years, Multi-Task Cascaded Convolutional Networks (MTCNN) [55] became one of the most popular solutions for finding faces in images based on
98
S. Antonov et al.
the DNN (Deep Neural Network) approach. The above algorithm consists of three consective steps, with the first one responsible for the image rescaling, the second one known as the Proposal Network (or P-Net) looks for the candidate facial regions, followed by the Refine Network (or R-Net) filtering bounding boxes and finally by the Output Network (or O-Net) that focuses on facial landmarks (such as eyes and mouth) localization. Another recent and powerful alternative solution is the MMOD algorithm introduced by Davis E. King and implemented in the Dlib library [23]. Since it appears one of the most accurate of the other methods discussed above, while also working well for different face orientations and even under substantial occlusion, it has been chosen as an instrument used in this work. However, it is also important to note that deep learning algorithms, while typically outperforming other approaches in terms of accuracy, require considerably higher computational resources, that may appear a limiting factor for their application under limited resources scenarios and/or large amounts of data, as well as online analysis requirements. Face rescaling and alignment is an intermediate step between face detection and face recognition. Common solutions are based on finding specific face landmarks that can be used in the rescaling and alignment procedure as pivot points. Face recognition techniques are also well developed. Early approaches were largely based on such algorithms as Eigenface [21], Fisherface [3] and Local Binary Patterns Histogram (LBPH) [46]. As these algorithms proved to have numerous drawbacks, here we follow a more recent approach based on Convolutional Neural Networks (CNN), that remain one of the most effective and reliable solution to the date. Prominent examples include Google FaceNet [45] based on convolutional layers learning face representations directly from the image. FaceNet was trained on the Labelled Faces in the wild (LFW) [19] dataset to achieve invariance to illumination, pose, and other variable conditions. Other notable examples include OpenFace [2]. In this work, we used also a neural network based solution implemented in the Dlib library. Finally, recognized faces should be associated with IDs of particular persons. This is a typical problem for machine learning classification algorithms. If no matches are found, a new ID is added to the database. In this work, we used a KNN classifier, although many alternative classifiers would do the job.
4 Experiments 4.1 Combined Dataset of Neural Network Training Next, we evaluated the approach using several sample videos recorded by surveillance cameras in busy outdoor public places. For neural network training, we combined two different datasets, that are among the most popular for object detection algorithms learning, PASCAL VOC [16] and COCO [25]. Although they differ in the amount
An Intelligent System for Video-Based Proximity Analysis
99
Fig. 3 Distribution of people number in images
of annotation, both of them contain sufficient information to extract bounding boxes around detected people. Figure 3 shows the histogram of person count in images for the resulting dataset, indicating that the majority of images contained one single person, while a significant number of images contained up to twenty different people.
4.2 Training a Neural Network Model We used the combined dataset described above to train a convolutional neural network model “ssdlite mobile net v2 coco”, which is a lightweight version of SSD (Single Shot MultiBox Detector) [26] based on the joint architecture of SSDLite and MobileNetV2 [43], characterized by high object detection accuracy (evalated by mAP) and computational performance in various image analysis based problems. Figure 4 shows the loss function obtained during model training, while Fig. 5 shows the average accuracy of object detection (mAP), altogether indicating the chosen neural network model demonstrates high accuracy of object recognition.
4.3 Frame Processing By processing a video frame, the system detects people using a previously trained neural network model “ssd lite mobile net coco v2”. After detecting people by the trained network, their bounding box coordinates were subjected to the homography matrix based transformation and nearest neighbor search algorithm, followed by face detection and recognition algorithms, as described
100
S. Antonov et al.
Fig. 4 Evolution of the loss during model training
above. Figure 6 exemplifies a processed video frame with indicated bounding boxes, where those appearing within a close proximity (for an arbitrary 2 m threshold) are shown in red, while others are shown in green. Figure 7 shows the corresponding bird’s eye view for the same frame, using similar color notation. Another example is shown in Figs. 8 and 9, respectively.
4.4 Contact Network Graph In epidemiological contact tracing, an important quantity that strongly influences the transmission risk is the duration of contact between each pair of individuals. The corresponding framework for a given public space can be represented by a weighted graph, where the nodes correspond to individual persons, while the weights of the links between them represent contact durations. Figure 10 exemplifies a contact graph for a representative short scene, where link weights represent contact durations in seconds. In order to reduce the risk of infection transmission in public spaces, it is essential to reduce the duration of contacts. Alternatively, under the assumption that contact duration above a certain threshold is associated with increased risk of infection
An Intelligent System for Video-Based Proximity Analysis
Fig. 5 Evolution of the [email protected] while model training
Fig. 6 Results of processing 1 frame
101
102
Fig. 7 Bird eye view of 1 frame
Fig. 8 Results of processing 2 frame Fig. 9 Bird eye view of 2 frame
S. Antonov et al.
An Intelligent System for Video-Based Proximity Analysis
103
Fig. 10 Contact graph
transmission, one can focus on the reduction of the number of links above a certain threshold weight, i.e., the number of pairs of individuals that appear in close proximity to each other for durations above a certain threshold value.
5 Further Interpretation and Outlook Towards Adaptation to the Post-pandemic Society Goals Now after more than two years since the onset of the pandemic, public attention is increasingly shifting towards finding optimal exit strategies, including adaptation of the technologies that have been rapidly deployed earlier in the course of the pandemic, and finding their place in the post-pandemic society. In the following, we consider how the above solutions could be used in different scenarios than individual contact tracing or epidemiological surveillance of crowds, for example, leading to the improved public spaces planning. Planning of public spaces strongly affects the probability of congestions, formation of crowds, organization of queues, that in turn largely determines the numbers of total contacts that remain in close proximity above a certain duration. There
104
S. Antonov et al.
is a number of well-known mathematical models widely used to simulate collective dynamics from particle movement to walking trajectories. One of the simplest models for walking trajectories simulation is a 2D random walk characterized by random increments. In real-world settings, randomness of increments is an unlikely scenario, due to inevitable interactions between walkers and stationary objects, as well as between walkers and other walkers, leading to the adjustment of their trajectories, and thus correlated and self-avoiding walks appear more relevant. For a recent literature overview of the problem from a multidisciplinary perspective, we refer to [30, 37, 51], as well as several relevant special cases, including presence of obstacles [47] and compactness constraints [24] capable of representing typical features of the real-world public space settings. In the following, we consider several short scenes, calculate statistics for the quantities of interest and compare them against similar results for both uncorrelated and correlated random walk models obtained by computer simulations. Figure 11 shows the pairwise contact duration matrices representing the duration of time each pair of individuals remains in a close proximity, for a sample proximity threshold value. To simplify the comparison between different scenes, as well as between video analysis based and random walk simulation based results, we define the proximity threshold as a certain quantile of the distance distribution for all walkers that can be observed simulateously within the scene. This kind of normalization is a common approach to the comparison of datasets at different scales, see e. g. [7]. In this particular example, we have chosen TQ = 5, indicating that on the average each 5th pairwise distance appears below the threshold. For a statistical characterization of the contact graph properties, a straightforward approach would be consideration of the distribution of contact durations obtained for all possible pairs of individuals. Figure 12 shows the statistics for six different short scenes, including complementary cumulative distribution functions (CCDFs) indicating the probabilities that inter-arrival times and durations that each individual remains within the scene exceed the function argument, as well as similar quantities of the pairwise contact durations for all possible pairs of individuals, each for three different threshold values, corresponding to TQ = 2, 5 and 10, respectively. The figure shows that the normalized distributions expressed in the units of the average contact durations obtained separately for each scene and each threshold value, tend to follow a simple exponential. This is generally an expected result, which is in a good agreement with similar quantities obtained by computer simulations of random walks characterized by the same average inter-arrival times and durations (for simplicity, exponential distributions of arrivals and durations within the scene have been considered). The theoretical background behind this distribution is rather simple and can be explained via an event-based concept, considering any pair of individuals following random trajectories coming into proximity as a random event. In this simplest scenario, these events constitute Poisson processes with parameters generally depending on both inter-arrival and duration times, as well as average distances between different walkers and step sizes performed by a single walker in a given time unit (e.g. one second). However, since the inter-event distbituion for any Poisson process decays
An Intelligent System for Video-Based Proximity Analysis
105
Fig. 11 Examples of pairwise contact duration matrices for six representative short scenes captured from a street video surveilance camera for TQ = 5. Matrix sizes are determined by the total number of individuals captured in each scene, with their total pairwise duration of proximity (in seconds) indicated by color
by an exponential with only one free parameter that is the average value, normalization by division by this average value for each distribution results in a data collapse indicated by all curves following the same pattern close to a simple exponential with the unit average. Deviations from this simple theoretical scenario can be attributed to the discreteness and finite size effects. As one can see from the figure, these deviations are comparable for the observational and for the simulated data, given that the simulated data contains similar number of frames, average inter-arrival intervals
106 Fig. 12 Statistics for six different short scenes, including a CCDFs of inter-arrival times and b durations that each individual remains within the scene, as well as c distributions of the contact times for all possible pairs of individuals, each for three different thresholds TQ = 2, 5 and 10. Straight black lines show a simple exponential, while dashed colored curves show similar results for simulated random walks with similar parameters like in the observational data (blue curves correspond to the absence of correlations, while red curves correspond to the long-range correlated random walks with Hurst exponent H = 1.5
S. Antonov et al.
An Intelligent System for Video-Based Proximity Analysis
107
and durations of individuals remaining within the scene, and thus also similar total numbers of individual trajectories in the entire scene. However, in most real-world scenarios walking trajectories strongly deviate from the simplest random model. Typical reasons for that are localization of the objects of attraction (e.g. counters, doors, passages etc.), as well as obstacles (e.g. barriers, billboards, kiosks etc.) in both indoor and outdoor public spaces, leading to the spatial clustering of the walking trajectories. In addition, traffic regulations (e.g. revolving doors, traffic lights at crosswalks etc.) lead to additional temporal clustering of the walking trajectories. Among various models used to characterize motion from the statistical physics viewpoint, long-range correlations appear the most relevant in the context of human dynamics (for a recent and comprehensive review of literature on the topic, we refer to [22]). To account for both spatial and temporal clustering, two-dimensional long-range correlated fields seem to be a relevant model. Recent data including our own results indicate that long-range correlations are strongly associated with clustering of events, generally leading to heavy-tailed distributions of both inter-event times and event durations, with the latter being crucial for the contact proximity durations. The impact of long-range temporal correlations on the event dynamics have been investigated both analytically [32] and numerically [1, 14] indicating that the interval distributions between consecutive events in a series broaden from a single exponential for the simplest Poisson process scenario to a stretched exponential for linear long-term correlations, and finally converge to a power-law decay for strong long-term correlations, especially in the presence of nonlinear interactions in the system [6, 7]. Moreover, in recent years similar distributions of the inter-event times have been observed in a number of real-world complex systems, ranging from bursty access patterns driven by user interactions in public computer networks [6, 29, 34, 49] to various natural phenomena, e.g. in geophysics [8, 13]. Finally, our recent data indicate that spatial long-range correlations lead to the manifestations of similar laws in biological polymer structures [9–11]. Figure 13 exemplifies similar distributions obtained by computer simulations for walks with random increments with Hurst exponent H = 0.5 and longrange spatiotemporally correlated increments with Hurst exponent H = 1.5. The figure shows explicitly that stronger spatio-temporal correlations lead to broader contact duration distributions, indicating that a larger fraction of pairs of individuals remain within the same proximity thresholds for longer times (depicted by a more pronounced initial decay in the exceedance probability distributions), compared to the random increments scenario. The figure also shows that, while some general qualitative conclusions are possible based on these simulations, particular functional forms of the distributions obtained for finite systems exhibit non-trivial shapes that are determined by a complex interplay of correlations, discreteness and finite size effects, and thus are determined not only by their asymptotic behaviors that could be eventually derived from known theoretical assumptions, but also depend explicitly on the system size. As a remark, obtaining Fig. 13 required simulated datasets that contained 110 times more time steps and 11 times more individual walkers, altogether resulting in ~103 more walker positions, and potentially up to ~106 more pairwise distances,
108
S. Antonov et al.
Fig. 13 Pairwise proximity duration distributions obtained by computer simulations for a random walk with random increments with Hurst exponent H = 0.5 (blue curves) and long-range spatiotemporally correlated increments with Hurst exponent H = 1.5 (red curves), respectively
compared to the observational video examples used in our study. Since the amount of video analysis required to obtain comparable statistics for different public places requires considerable computational efforts, we believe that more detailed analysis including long-term video analysis and best correlated walkers model fitting, for a better understanding of how public space planning affects both the spatio-temporal walking trajectory correlation patterns and contact proximity distributions, remains beyond the scope of this study, and could be considered as an outlook for future reseach directions.
6 Conclusion and Outlook To summarize, digital technologies played a major role in the global responce to the COVID-19 early on from the onset of the pandemic, especially in the context of digital epidemiological surveillance and contact tracing, and proved their effectiveness in the real-world context being strongly associated with a number of success stories
An Intelligent System for Video-Based Proximity Analysis
109
leading to the rapid suppression of the community transmission and reduction of the incidence rates. While AI and machine learning techniques have been widely applied in webbased epidemic information support tools and online case tracing, they have not yet been fully explored in the context of proximity tracing and consecutive analysis for a more informed public spaces planning in the context of the reduction of the contacts and contact durations. In this paper, we have proposed a framework which is based on video-surveilance for proximity tracing. However, as with the use of mobile apps and Bluetooth, privacy considerations cannot be emphasized enough for any approach to be of practical use. This is one of the fundamental ideas in our framework, realized by using anonymized IDs to identify individuals. Further exploring how privacy can be integrated in the proposed solution is the most immediate future research direction. Other directions include training other neural network models and comparing them to find the best model. Trained models will be evaluated based on the above parameters, such as mAP with a set IoU threshold of 0.5, the error of the trained model, and the number of frames per second (FPS) spent on object detection. In addition, we will further evaluate the approach using large datasets from crowded streets. Now after more than two year since the onset of the pandemic, public attention increasingly shifts towards finding optimal exit strategies, including adaptation of these technologies and finding their place in the post-pandemic society. Looking forward towards this goal, we also consider how the proximity tracing based on video surveillance in public places could be adapted to facilitate the improved public spaces planning. Acknowledgment The work of Sergey Antonov was supported by the Ministry of Science and Higher Education of the Russian Federation “Goszadanie” No 075-01024-21-02 from 29.09.2021 (Project No. FSEE-2021-0014).
References 1. Altmann, E., & Kantz, H. (2005). Recurrence time analysis, long-term correlations, and extreme events. Physical Review E, 71(5), 056106. 2. Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). Openface: A general-purpose face recognition library with mobile applications. Technical report, CMU-CS-16–118, CMU School of Computer Science. 3. Anggo, M., & Arapu, L. (2018). Face recognition using fisherface method. Journal of Physics: Conference Series, 1028, 012119. https://doi.org/10.1088/1742-6596/1028/1/012119 4. Balaban, S. (2015). Deep learning and face recognition: the state of the art. In Biometric and Surveillance Technology for Human and Activity Identification XII (vol. 9457, p. 94570B). International Society for Optics and Photonics. 5. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518.
110
S. Antonov et al.
6. Bogachev, M., & Bunde, A. (2009). On the occurrence and predictability of overloads in telecommunication networks. EPL (Europhysics Letters), 86(6), 66002. 7. Bogachev, M., Eichner, J., & Bunde, A. (2007). Effect of nonlinear correlations on the statistics of return intervals in multifractal data sets. Physical Review Letters, 99(24), 240601. 8. Bogachev, M., Eichner, J., & Bunde, A. (2008). On the occurence of extreme events in long-term correlated and multifractal data sets. Pure and Applied Geophysics, 165, 1195–1207. 9. Bogachev, M., Kayumov, A., & Bunde, A. (2014). Universal internucleotide statistics in full genomes: A footprint of the dna structure and packaging? PLoS ONE, 9(12), e112534. 10. Bogachev, M., Kayumov, A., Markelov, O., & Bunde, A. (2016). Statistical prediction of protein structural, localization and functional properties by the analysis of its fragment mass distributions after proteolytic cleavage. Scientific Reports, 6, 22286. 11. Bogachev, M., Markelov, O., Kayumov, A., & Bunde, A. (2017). Superstatistical model of bacterial DNA architecture. Scientific Reports, 7, 43034. 12. Budd, J., Miller, B. S., Manning, E. M., Lampos, V., Zhuang, M., Edelstein, M., Rees, G., Emery, V. C., Stevens, M. M., Keegan, N., et al. (2020). Digital technologies in the public-health response to covid-19. Nature Medicine, 1–10. 13. Bunde, A., Bogachev, M., & Lennartz, S.: Precipitation and river flow: Long-term memory and predictability of extreme events. Extreme Events and Natural Hazards: The Complexity Perspective, 139–152. 14. Bunde, A., Eichner, J., Havlin, S., & Kantelhardt, J. (2004). Return intervals of rare events in records with long-term persistence. Physica A: Statistical Mechanics and its Applications, 342(1), 308–314. 15. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005) (vol. 1, pp. 886–893). IEEE (2005). https://doi.org/10.1109/cvpr.2005.177 16. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136. 17. Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-Dörner, L., Parker, M., Bonsall, D., & Fraser, C. (2020). Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing. Science, 368(6491). 18. Hossain, M. S., Muhammad, G., & Guizani, N. (2020). Explainable ai and mass surveillance system-based healthcare framework to combat covid-i9 like pandemics. IEEE Network, 34(4), 126–132. 19. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst. 20. Jalali, S. M. J., Ahmadian, M., Ahmadian, S., Hedjam, R., Khosravi, A., & Nahavandi, S. (2022). X-ray image based COVID-19 detection using evolutionary deep learning approach. Expert Systems with Applications, 201, 116942. 21. Jalled, F. (2017). Face recognition machine vision system using eigenfaces. 22. Karsai, M., Jo, H. H., Kaski, K., et al. (2018). Bursty human dynamics. Springer 23. King, D. E. (2015). Max-margin object detection 24. Lellouche, S., & Souris, M. (2020). Distribution of distances between elements in a compact set. Stats, 3(1), 1–15. 25. Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. CoRR abs/1405.0312 26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C. (2016). Ssd: Single shot multibox detector (pp. 21–37). Lecture Notes in Computer Science. https://doi.org/ 10.1007/978-3-319-46448-0_2 27. Li, Z., Yang, W., Peng, S., & Liu, F. (2020). A survey of convolutional neural networks: Analysis, applications, and prospects
An Intelligent System for Video-Based Proximity Analysis
111
28. Maneewongvatana, S., & Mount, D. M. (2001). An empirical study of a new approach to nearest neighbor searching. In Algorithm Engineering and Experimentation (pp. 172–187). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44808-x_14 29. Markelov, O., Nguyen, V., & Bogachev, M. (2017). Statistical modeling of the internet traffic dynamics: To which extent do we need long-term correlations? Physica A: Statistical Mechanics and its Applications, 485, 48–60. 30. Moltchanov, D. (2012). Distance distributions in random networks. Ad Hoc Networks, 10(6), 1146–1166. 31. Mundy, J. L., Zisserman, A., et al. (1992). Geometric invariance in computer vision (Vol. 92). MIT press Cambridge. 32. Newell, G., & Rosenblatt, M. (1962). Zero crossing probabilities for gaussian stationary processes. The Annals of Mathematical Statistics, 33(4), 1306–1313. 33. Nguyen, T., Chen, S.W., Shivakumar, S. S., Taylor, C. J., & Kumar, V. (2017). Unsupervised deep homography: A fast and robust homography estimation model. 34. Nguyen, V., Markelov, O., Serdyuk, A., Vasenev, A., & Bogachev, M. (2018). Universal ranksize statistics in network traffic: Modeling collective access patterns by zipf’s law with longterm correlations. EPL (Europhysics Letters), 123(5), 50001. 35. Panigrahy, R. (2008). An improved algorithm finding nearest neighbor using kd-trees. Lecture Notes in Computer Science, pp. 387–398. Springer Berlin Heidelberg. https://doi.org/10.1007/ 978-3-540-78773-0_34 36. Pan, J., & Manocha, D. (2011). Fast gpu-based locality sensitive hashing for k-nearest neighbor computation. In Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, GIS, pp. 211–220. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2093973.2094002 37. Pönisch, W., & Zaburdaev, V. (2018). Relative distance between tracers as a measure of diffusivity within moving aggregates. The European Physical Journal B, 91(2), 1–7. 38. Punn, N. S., Sonbhadra, S. K., & Agarwal, S. (2020). Monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolo v3 and deepsort techniques. 39. Rezaei, M., & Azarmi, M. (2020). Deepsocial: Social distancing monitoring and infection risk assessment in covid-19 pandemic. arXiv preprint arXiv:2008.11672 40. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N. & MohammadiIvatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of Intelligent & Fuzzy Systems, 1–12. 41. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. 42. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems, 1–7. 43. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. 44. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic Press. 45. Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682 46. Singh, S., Kaur, A., & Taqdir, A. (2015). A face recognition technique using local binary pattern method. IJARCCE, 165–168. https://doi.org/10.17148/IJARCCE.2015.4340 47. Skliros, A., & Chirikjian, G. S. (2008). Position and orientation distributions for locally selfavoiding walks in the presence of obstacles. Polymer, 49(6), 1701–1715. 48. Sokolova, A., Uljanitski, Y., Kayumov, A. R., & Bogachev, M. I. (2021). Improved online event detection and differentiation by a simple gradient-based nonlinear transformation: Implications for the biomedical signal and image analysis. Biomedical Signal Processing and Control, 66, 102470.
112
S. Antonov et al.
49. Tamazian, A., Nguyen, V., Markelov, O., & Bogachev, M. (2016). Universal model for collective access patterns in the internet traffic dynamics: A superstatistical approach. EPL (Europhysics Letters), 115(1), 10008. 50. Tao, Y., & Sheng, C. (2014). Fast nearest neighbor search with keywords. , IEEE Transactions on Knowledge and Data Engineering, 26, 878–888. https://doi.org/10.1109/TKDE.2013.66 51. Tejedor, V., Schad, M., Bénichou, O., Voituriez, R., & Metzler, R. (2011). Encounter distribution of two random walkers on a finite one-dimensional interval. Journal of Physics A: Mathematical and Theoretical, 44(39), 395005. 52. Vannoorenberghe, P., Motamed, C., Blosseville, J. M., & Postaire, J. G. (1997). Automatic pedestrian recognition using real-time motion analysis. In International conference on image analysis and processing (pp. 493–500). Springer. 53. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition (CVPR 2001, vol. 1, pp. I–I). IEEE 54. Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete algorithms, SODA, pp. 311–321. Society for Industrial and Applied Mathematics, USA. 55. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499– 1503. https://doi.org/10.1109/lsp.2016.2603342 56. Apple and google framework. https://www.apple.com/newsroom/2020/04/apple-and-googlepartner-on-covid-19-contact-tracing-technology/ 57. Covidsafe app, Australia. https://www.health.gov.au/resources/apps-and-tools/covidsafe-app 58. The dp-3t project. https://github.com/DP-3T/documents 59. Hamagen app, israel. https://govextra.gov.il/ministry-of-health/hamagen-app/download-en/ 60. Norway halting smittestop app. https://www.amnesty.org/en/latest/news/2020/06/norway-cov id19-contact-tracing-app-privacy-win/ 61. Pepp-pt project. https://github.com/pepp-pt/pepp-pt-documentation/blob/master/PEPP-PThigh-level-overview.pdf
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular Surface Images Kanchon Kanti Podder, Mohammad Kaosar Alam, Zakaria Shams Siam, Khandaker Reajul Islam, Proma Dutta, Adam Mushtak, Amith Khandakar, Shona Pedersen, and Muhammad E. H. Chowdhury
1 Introduction The eye is a crucial and among the most intricate sensory organs which we have as humans. It aids in our ability to visualize objects as well as our perception of light, depth, and colour. Conjunctival nevus [1], which is a relatively ordinary disorder, possesses several distinct clinical presentations [2]. Sufferers who ask about conjunctival lesions are frequently encountered during ordinary clinical treatment Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-981-99-3784-4_6. K. K. Podder Department of Biomedical Physics and Technology, University of Dhaka, Dhaka 1000, Bangladesh M. K. Alam · Z. S. Siam Department of Electrical, Electronic and Systems Engineering, Universiti Kebangsaan Malaysia, 43600 Bangi, Malaysia Z. S. Siam Department of Electrical and Computer Engineering, Presidency University, Dhaka, Bangladesh K. R. Islam · A. Khandakar · M. E. H. Chowdhury (B) Department of Electrical Engineering, Qatar University, 2713 Doha, Qatar e-mail: [email protected] P. Dutta Department of Electrical and Electronic Engineering, Chittagong University of Engineering and Technology, Chittagong 4349, Bangladesh A. Mushtak Clinical Imaging Department, Hamad Medical Corporation, Doha, Qatar S. Pedersen Department of Basic Medical Sciences, College of Medicine, Qatar University, 2713 Doha, Qatar © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_6
113
114
K. K. Podder et al.
[3]. Conjunctival nevi could exhibit a range of malignant or benign characteristics [4]. An uncommon but possibly fatal malignant growth of the eye is called conjunctival melanoma [1], which develops from melanocytes found within the conjunctival epithelium’s basal cells [5]. This uncommon tumour accounts for around 2% of all eye tumours, 5% of optic melanomas [6], and 0.25% of each type of melanoma [7]. Mortality rates of at least 30% are associated with conjunctival melanoma [8], which demands costly treatment, while a bad prognosis is linked to a belated diagnosis [3, 5]. Conjunctival melanoma often manifests as a pigmented or colourful sharp conjunctival lesion, however unusual cases with a variety of morphologies might cause the diagnosis to be delayed [9]. This condition might be caused either by nevus or acquired melanosis [10]. To diminish the mortality caused by this condition, prompt diagnosis and the practicality of detection are necessary, given the contemporary scenario in several countries that involve an ageing population as well as insufficient healthcare resources. An ophthalmologist performs a conventional clinical examination to determine whether a patient has conjunctival melanoma by viewing the ocular surface under a slit lamp, where a biopsy is necessary to verify the diagnosis [3]. The implementation of these in-clinic investigations has, however, been considerably impacted by the contemporary outbreak due to COVID-19 [11]. Therefore, ophthalmologists face significant difficulties in the prompt identification of conjunctival melanoma [3]. Medical imaging has already been greatly impacted by deep learning, and this influence is only anticipated to increase in future [12, 13]. Deep learning, according to several experts, is going to be a key factor in the forthcoming medicine and a key instrument for medical practice and research [14–18]. In terms of the analyses of medical images, deep learning methods have already demonstrated impressive, and frequently unheard-of, performance and accomplishment in a wide range of tasks from both low- and high-level image processing functions, including image classification, detection, segmentation, enhancement, denoising, reconstruction, registration etc. [19–26]. Deep learning techniques that make use of digital images with pathological lesions are thought to be useful for enhancing the detection of skin malignancies [27, 28]. Even though many studies utilizing deep learning models have concentrated on skin melanoma [29–32], the use of modern deep learning technology to identify conjunctival melanomas has been underexplored. Because of the lack of substantial data including ground truth data of conjunctival diseases, training traditional deep neural networks to identify conjunctival melanoma is very difficult. Very recently, deep learning techniques for identifying conjunctival melanoma from the optic surface images were explored [3]. However, their dataset was not well curated. Also, for the classification to perform even better, more research is required. The current study’s goal is to examine contemporary deep learning techniques used to detect conjunctival melanoma utilizing a sizable, enhanced optic surface image dataset. Four classes of image data, that are conjunctival melanoma, melanosis or nevus, normal conjunctiva, and pterygium [33] images, have been used in the present
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
115
study. Considering the research gap available in the field of classifying conjunctival melanomas, the following contributions are proposed in this study: • A well-curated dataset for conjunctival melanoma is proposed which is validated by medical experts. • An effective and faster augmentation technique is proposed counter to CycleGANbased augmentation [3] for increasing a small conjunctival melanoma dataset. • A high-performing deep learning model is proposed in this study which can classify the different eye conditions with high accuracy. Additionally, we incorporated the interpretability of our findings. This study intends to verify the hypothesis that conjunctival lesions could be classified, and conjunctival melanoma could be found utilizing optic surface images with the help of deep learning. The prompt identification of conjunctival lesions might be made easier by this investigation. The outline of this study is described in the sections below. The following parts go into further information about the materials and methods that were utilized. Afterwards, the findings are revealed and discussed. At last, we address the conclusion and potential future research as we wrap off our study.
2 Methodology This study proposed a system where an image of the eye taken using a smartphone can be classified as normal or other eye-related medical conditions. The methods involved in this system start from data collection, data cleaning and validation, CNN training and evaluation and visual interpretation. Figure 1 illustrates the step-by-step workflow of the methodology proposed in this study.
2.1 Data Collection The focus of this research was on analyzing the anterior segment utilizing a deep learning system and images of the eye’s surface. The preliminary melanoma data set on which our data set is developed was taken from [3]. Normal, Pterygium, Nevus, and Conjunctival melanosis were the four categories present in that dataset. The dataset suggested by [3] contains some irrelevant and problematic images identified by the medical experts of our team. Ocular images of subjects with conjunctival anomalies are widely available online and can be accessed through various keyword searches (for example, “normal conjunctiva”, “pterygium”, “conjunctival nevus”, “conjunctival melanosis”), so we removed irrelevant data from the dataset proposed in [3] and added new images to the dataset. Expert physicians double-checked the data to make sure it was accurate and valuable. The details of the original dataset and
116
K. K. Podder et al.
Fig. 1 Depiction of methodology adopted in this study
Fig. 2 Dataset details before and after cleaning and validation
the proposed dataset in this study are illustrated in Fig. 2 and a sample representation of the different classes in the dataset is available in Fig. 3.
2.2 Data Augmentation The “Four Class” Dataset is the label considered for the dataset proposed in this study. It was from this “Four Class Dataset” that, another dataset was developed. Here we
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
117
Fig. 3 A sample representation of the proposed dataset displaying images from class “Normal Conjunctiva”, “Conjunctival Melanoma”, “Nevus”, and “Pterygium”
have a “Binary Class Dataset” where “Normal Conjunctiva” is categorized separately from “Abnormal Conjunctiva” (which includes pterygium, nevus, and conjunctival melanosis). Both datasets were divided into training set, a validation set, and a test set with the percentage of 70%, 10%, and 20%, respectively (Fig. 4). Due to the small size of dataset, four augmentation techniques were employed to enhance the size of the train set. We have seen that data augmentation technique is a proven method to counter the problem of small data set as shown in some other publications [34–36]. These were random rotation, random affine transformation, padding, and colour correction. The specifics of the four methods of augmentation explored in this study are provided in Table 1. Methods of random augmentation including both single augmentation and multiple augmentations were used in this investigation. Whether a single augmentation or multiple augmentations would be
Fig. 4 A representation of single and multiple augmentation techniques on an ocular surface image
118
K. K. Podder et al.
Table 1 Augmentation techniques and ranges used in the training set of proposed datasets
Augmentation techniques
Range
Random rotation
+20 to −20 degree
Random affine
Degree = 0 Translate range = (0.05, 0.15) Scaling range = (0.9, 0.95)
Padding
Range = (0,10) Fill = (black, white) Mode = (‘Constant’, ‘Edge’)
Colour correction
Brightness = (0, 0.2) Contrast = (0, 0.2)
used was determined randomly in the augmentation model. The augmentation model would then randomly decide which combination of augmentations to use if multiple augmentations are chosen. Single and multiple augmentation techniques were used to an image of the ocular surface, as shown in Fig. 3. In each of the two datasets, the size of the training set for each class was expanded to three thousand samples by applying these four augmentation techniques. As the validation and test sets were used for evaluating deep learning models in a real-world setting, these two sets were left unchanged throughout the process. Table 2 contains a description of the sizes of the datasets along with the augmentation [37] factors. Table 2 The detailed description of proposed datasets. The curated dataset is validated by expert doctors and the training samples are increased by an augmentation factor using different augmentation techniques Dataset Class
Original Validation Testing Training Augmentation Training set data set set set factor after samples augmentation
Binary class
Normal conjunctiva
125
13
25
87
Abnormal conjunctiva
285
28
57
200
Four Normal class conjunctiva dataset Nevus
125
13
25
85
8
70
Conjunctival 130 melanoma
Pterygium
34.48
3000
15
3000
87
34.48
3000
17
60
50
3000
7
14
49
61.44
3000
13
26
91
32.97
3000
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
119
2.3 Convolutional Neural Network (CNN) Based Classification Models This project utilized state-of-the-art CNNs for classifying ocular surface images of normal and different eye conditions. Four CNN architectures, ResNet, DenseNet, GoogLeNet, and EfficientNet were used in this study with pre-trained weights. We selected these architectures due to there efficacy in previous publication [38]. These four CNN architectures were trained on a large benchmark dataset “ImageNet” [39], and the weights adopted in the training are the pre-trained weights that were utilized for this study utilizing the well-known concept of transfer learning. CNN models are initiated with the pre-trained weights and optimized during training on the ocular surface images. Details of the trained CNN architectures are given below:
2.3.1
GoogLeNet
GoogLeNet was proposed in the literature [40], which was built on the Inception module. The authors of GoogLeNet proposed a wider and deeper Inception which performed slightly better performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014 competition. Inside Inception module with dimensionality reduction of GoogleNet, 1 × 1 convolution was added before every 3 × 3 and 5 × 5 convolution. This model is 22 layers deep with 27 pooling layers where 9 inception modules are stacked linearly. The end of the inception modules is entangaled to the global average pooling layer. Detailed model architecture with convolutional layers, pooling, and activations is available in the literature [40].
2.3.2
ResNet
ResNet architecture proposed in the literature [41] was designed to counter the vanishing gradient problem in the deeper CNN architectures. In a deep CNN architecture, the features of the earlier layers start vanishing from the network as it goes deeper and is introduced to more complex feature extractors. As a result, the vanishing gradient happens and the residual connection in ResNet architecture solves this problem by implementing a skip connection which flows the feature from the earlier layer to deeper layers. In this study, ResNet18, ResNet50 and ResNet152 were used. The designation ResNet, which is then followed by a number consisting of two or more digits, indicates, quite simply, the ResNet architecture with a specific number of neural network layers. So, in this ocular surface image classification research, 18, 50, and 152 layers-based ResNet architectures were utilized for evaluation and comparison with other counterpart CNN architectures.
120
2.3.3
K. K. Podder et al.
DenseNet
The authors in [42] observed that deeper CNN models are more accurate and efficient when the short connections are built among layers closer to input and closer to output. By applying this observation, authors in [42] proposed DenseNet, which works in a feed-forward fashion to connect each layer to every other layer. The authors discovered that utilizing DenseNet had several benefits, including the elimination of the vanishing-gradient problem, which resulted in better feature propagation and reuse. This particular sort of connection achieved benchmark results on the ImageNet dataset while also significantly reducing the number of parameters. Both the Densenet-161 and the Densenet-201 architectures were utilized in this study; respectively, the depth of each design is 161 and 201 layers.
2.3.4
EfficientNet
All the CNNs, such as VGGNet, ResNet, MobileNet, and SeNet, employ a variety of methods to improve the accuracy of the network. The methods may increase any one of the three dimensions (width, depth, or resolution), but at least one of them will. The authors in [43], addressed these methods of scaling in the literature. The integration of all these strategies into EfficientNet was accomplished by the proposal of a scaling mechanism that scales consistently across all of these dimensions. EfficientNet_B7, a family member of the EfficientNet architecture, achieved 84.3% top-1 accuracy on ImageNet and pre-trained weights of this model performance were used in our ocular surface image classification.
2.4 Visualization Methods Intuition on how CNN performs and reasoning behind its decision-making is always an intriguing topic. Over the years with the development of visualization tools, the curiosity behind how CNN works is satisfied effectively. This leads to model’s functionality by showing the rationale behind the inference in a way that human would figure out the engineering behind it which results in confidence in the CNNs’ outputs. Among various visualization tools, Grad-CAM [44] was chosen for this investigation as Grad-CAM shows promising performance in recent computer vision problems [45]. The method of Gradient-Weighted Class Activation Mapping utilizes gradient of the feature at any final CNN layer to yield a localization map on images to find out which region contributes to the decision-making. The benefit of using Grad-CAM against other visualization technique is that, it is applicable on wide variety of CNN architectures such as with or without fully connected layers [45]. Because sensitive medical condition classification was carried out in this study, it was necessary to confirm the region of interest with visualization for the CNN model to take it
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular … Table 3 Details for hyper-parameters used for all CNN models to train on “Binary Class” and “Four Class” datasets
121
Hyper-parameters
Details
Batch size
4
Optimizer
Adam
Loss function
NLL Loss
Learning rate
0.0001
Total epoch
20
Epoch patients
6
Drop factor of learning rate
0.1
Maximum epoch stop
10
Stop criteria
Loss
into consideration. As a result, this ultimately strengthened the trust in the decisionmaking technique of the models. At the very end of the result section, a discussion regarding the visual representation and explanation of the Grad-CAM used in this ocular surface image classification is provided.
2.5 Experimental Setup Five-fold cross-validation was used in the investigation of the “Binary Class” and “Four Class” datasets. PyTorch library and Python 3.7 are being utilized in this study. Google ColabPro platform with a 16 GB Tesla T4 GPU and 120 GB of High RAM was utilized for training, validation, testing process. Apart from that, hyperparameters used in this study for all investigations are given in Table 3.
2.6 Evaluation Metrics The performance of the CNN models was investigated by utilizing mathematical metrics such as overall accuracy, precision, sensitivity/recall, F1 score, and specificity. Let, α = Number of ocular surface images predicted as true positive, γ = Number of ocular surface images predicted as false positive, δ = Number of ocular surface images predicted as true negative, and θ = Number of ocular surface images predicted as false negative. So, the overall accuracy, precision, sensitivity/recall, F1-score, and specificity may be formulated as given in Eqs. (1–5). Specificity = Precision =
δ γ +δ α α+γ
(1) (2)
122
K. K. Podder et al.
Recall = F1 score =
α α+θ
(3)
2 × Precision × Recall Precision + Recall
(4)
α α+γ +δ+θ
(5)
Overall Accuracy =
The confusion matrix and ROC curves present important model evaluation metrics for deep learning models’ performance on medical image classification. In this study, the confusion matrix and ROC curves of each CNN model were evaluated to figure out the best-performing model by comparing other counterpart models.
3 Results 3.1 Binary Classification “Normal Conjunctiva” versus “Abnormal Conjunctiva” classes are considered binary classes for classification using seven CNN models. The learning curves of these seven CNNs are available in Supplementary tables 1 to 7. All the learning curves suggested the models are well-trained and do not have chances of overfitting and underfitting problems. Figure 5 displays the mean and standard deviation of accuracies across fivefold validation using these seven pre-trained CNN models. EfficientNet_B7 achieved the highest mean accuracy and lowest standard deviation in fold-wise accuracy. The results showed that GoogLeNet’s performance varied more over five-fold than EfficientNet_B7, which indicates that GoogLeNet had a comparatively less fold-wise performance. Table 4 Depicts binary classification results of all the employed models along with number of trainable parameters of all models as well as the inference time taken by each of them. As can be seen from Table 4, out of the seven distinct CNN architectures that we used, the best-performing model turned out to be EfficientNet_B7 according to the outcome of different parameters. Different parameters such as accuracy, precision, recall, F1 Score and specificity are 99.51%, 99.52%, 99.51%, 99.51% and 99.70%, respectively. The EfficientNet_B7 is the heaviest network in terms of the number of trainable parameters (more than 63 million trainable parameters). However, all the other models also achieved very close performance in terms of the evaluation metrics used in Table 4. It is notable that the shallowest network, according to number of trainable parameters is GoogLeNet model (only ~5.6 million trainable parameters), compared to the other networks. However, GoogLeNet achieved a classification performance that was comparable to that of the EfficientNet_B7 model with regard to the evaluation
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
123
Fig. 5 Representation of mean and standard deviation in the five-fold accuracy of all models for binary classification
Table 4 Performance metrics of different CNN models in detection of “Normal Conjunctiva” versus “Abnormal Conjunctiva” with five-fold cross-validation method in a binary class dataset Model
Trainable parameters
Inference Overall Precision Recall F1 time (second) accuracy (%) (%) score (%) (%)
Specificity (%)
ResNet18
11,177,538 0.00216
99.27
99.27
99.27
99.26 98.78
ResNet50
23,512,130 0.00532
99.27
99.27
99.27
99.26 99.22
ResNet152
58,147,906 0.01557
98.78
98.78
98.78
98.78 98.76
GoogLeNet
5,601,954
0.00641
98.78
98.79
98.78
98.78 98.27
DenseNet161 26,476,418 0.01893
98.29
98.30
98.29
98.29 98.22
DenseNet201 18,096,770 0.02592
99.02
99.05
99.02
99.03 99.17
EfficientNet_ B7
99.51
99.52
99.51
99.51 99.70
63,792,082 0.03334
criteria. As of trainable parameters, EfficientNet_B7 model is almost 11 times heavier than the GoogLeNet model. However, the EfficientNet_B7 model also produced a performance in classification that was 0.73% more accurate and precise than that of the GoogLeNet model. In terms of the inference time, ResNet18 took the least inference time (around 2.16 ms) and also achieved very good performance in classification (accuracy of 99.27%). Due to a very less inference time (less than 0.04 s), all the employed networks can be utilized for real-time applications.
124
K. K. Podder et al.
(a) Fig. 6 The a ROC curve and b confusion matrix for best-performing EfficientNet_B7 model, which has been trained and tested on binary class data. The confusion matrices and the ROC curves of the other models can be found in the supplementary materials
The performance as well as effectiveness of one model distinguishing critical medical complications from normal medical data can be also understood using ROC curves, AUC score and the confusion matrices. Figure 6 represents ROC curve and confusion matrix of best-performing network, EfficientNet_B7 for binary classification. The confusion matrices and ROC curves of the other models used in binary classification can be found in Supplementary Figures (1–14). Figure 6a depicts TPR vs FPR of EfficientNet_B7 in classifying “Normal Conjunctiva” vs “Abnormal Conjunctiva” in different thresholds. AUCROC was close to 1.00 to indicate that EfficientNet B7 was able to accurately classify the sample across all classification thresholds. The value of true positive, true negative, false positive, and false negative cases of EfficientNet_B7 are shown in confusion matrix that can be seen in Fig. 6b. Only one of the 285 test instances of the “Abnormal Conjunctiva” class across five-fold was identified as “Normal conjunctiva”. When compared to other CNNs, overall performance of EfficientNet_B7 was superior to that of its counterparts.
3.2 Multi-class Classification The seven CNN models used in binary classification were also used in four class classifications. The learning curves of these models are also available in Supplementary Tables 8 to 14, displaying the trend of well-fitted models. Figure 7 represents the graphical illustration of mean and standard deviation of accuracies in five-fold cross-validation of all models on multi-class classification. Multi-class classification of three cases of ocular illness and normal condition based on optic images presents significant challenges. EfficientNet_B7, a recently developed and robust CNN, had the highest mean accuracy across all five folds (94.43 percent). Although other CNNs, such as DenseNet161, demonstrated larger
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
125
Fig. 7 Graphical representation of mean and standard deviation in the five-fold accuracy of all the models on multi-class classification
standard deviations, the standard deviation of the EfficientNet_B7 model’s accuracy was small (±1.54), indicating steady performance based on fold-wise accuracy. The other metrics such as overall accuracy, precision, recall, F1 score, and specificity are also significant in understanding the functionality of a deep learning model as well as fold-wise accuracies. Table 5 represents the four-class classification results of all the CNN models along with number of trainable parameters as well as the inference time taken by each of them. Multi-class classification results, as shown in Table 5, exhibit greater variability than binary classification results (Table 4). Table 5 shows that the EfficientNet_B7 model had the highest performance across all metrics used for assessing models. The accuracy, precision, recall/sensitivity, F1 Score and specificity are 94.42%, Table 5 The performance metrics of different state-of-the-art CNN models in the detection of conjunctival melanoma with a five-fold cross-validation method on a four-class dataset Model
Trainable parameters
Inference Overall Precision Recall F1 time (second) accuracy (%) (%) score (%) (%)
Specificity (%)
ResNet18
11,178,564 0.00226
93.45
93.49
93.45
93.46 97.86
ResNet50
23,516,228 0.00539
91.02
91.22
91.02
91.01 96.91
ResNet152
58,152,004 0.01494
92.72
92.92
92.72
92.76 97.66
GoogLeNet
5,604,004
0.00627
91.75
91.76
91.75
91.72 96.98
DenseNet161 26,480,836 0.01789
91.02
91.34
91.02
91.1
DenseNet201 18,100,612 0.02269
94.42
94.43
94.42
94.42 98.15
EfficientNet_ B7
94.42
94.55
94.42
94.43 98.20
63,797,204 0.03188
96.90
126
K. K. Podder et al.
Fig. 8 The a ROC curve and b confusion matrix for best performing EfficientNet_B7 model (the “others” class is labelled as the “abnormal” class) trained and tested on the multi-class dataset
94.55%, 94.42%, 94.43%, and 98.20%, respectively. However, in terms of accuracy and recall, the DenseNet201 model showed exactly the same performance as the EfficientNet_B7 model. Precision, F1 score, and specificity were all improved for EfficientNet_B7. Although DenseNet161 and ResNet50 had more trainable parameters than GoogLeNet, the shallower network still managed to outperform them by a little margin. ResNet18 once again had the fastest inference time (approximately 2.26 ms) with a classification accuracy of 93.45%. In addition, the inference time for all of the models was less than 0.04 s, making them suitable for usage in real-time settings. Figure 8 represents the ROC curve and confusion matrix of best-performing model, EfficientNet_B7 for multi-class classification. Figure 8a represents ROC curve to be around 0.99, indicating close-to-perfect performance in multi-class classification. Figure 8b describes the TP (true positive) , TN (true negative), FP (false positive), and FN (false negative) capabilities of the best-performing EfficientNet_B7 model. The true positive percentage of EfficientNet_B7 in classifying normal, pterygium, nevus, and melanoma is 0.98%, 0.94%, 0.94%, and 0.91% respectively, which indicates the model’s higher capability in distinguishing the classes. The confusion matrix and the ROC curves of the other models can be found in Supplementary Figures (15–28).
3.3 Comparative Analysis with Existing Literature The proposed method of using data augmentation and pre-trained CNNs showed improvement in model performance. The comparative analysis between previous literature [3] and the proposed method in this study is tabulated in Table 6. In multiclass or four-class classification, the method proposed in this study achieved 13.42% improved accuracy and 0.036 improved AUC. The EfficientNet_B7 with image
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
127
Table 6 Comparative analysis of the performance of the proposed method with counterpart literature Datasets
Reference
Technique
AUC
Accuracy
CycleGAN-based image augmentation, MobileNetV2
0.954
81
Proposed method
Dataset cleaning, inclusion of related images, image augmentations, EfficientNet_B7
0.99
94.42
Yoo et al.
CycleGAN-based image augmentation, MobileNetV2
0.976
96.5
Proposed method
Dataset cleaning, inclusion of related images, image augmentations, EfficientNet_B7
1.00
99.73
Four class Yoo et al.
Binary class
augmentation techniques outperformed the CycleGAN-based Image Augmentation and MobileNetV2-based study reported in [3]. The proposed method also outperformed the previous literature in binary classification by 3.23% accuracy and 0.024 AUC.
3.4 Visualization Using Grad-CAM Figures 9 and 10 represent the visual interpretation of the best-performing models in the “Binary Class” and “Four Class” datasets, respectively using Grad-CAM. It is easier to comprehend the model’s prediction process when using this visual representation. Figure 9 provides a visual interpretation of EfficientNet_B7 and ResNet18, two of the top-performing models in the “Binary Class” investigation. Both models were effectively predicting the classes that corresponded to the region of interest. Also, this study was undertaken to categorize three different medical conditions, including Nevus, Pterygium, and Conjunctival Melanoma, in addition to Normal subjects, thus visual interpretability is especially crucial in “Four Class” investigations. Figure 10 displays the visual interpretation of the best
Fig. 9 Visual interpretation of ResNet18 and EfficientNet_B7 model predictions on the “Binary Class” dataset
128
K. K. Podder et al.
Fig. 10 Visual interpretation of DenseNet201 and EfficientNet_B7 model predictions on the “Four Class” dataset
performing model, EfficientNet_B7, beside the DenseNet201 interpretation. From a visual perspective, EfficientNet_B7 revealed that the features learned from the relative region of interest during training are the key to the models’ capacity to classify ocular surface images at maximum accuracy.
4 Conclusion In conclusion, the proposed study used state-of-the-art CNN models with data curation, validation and single and multiple augmentation techniques to classify ocular surface images for different medical condition investigations (“Binary Class” and “Four Class”). EfficientNet_B7 was the best-performing model with 99.73% and 94.42% accuracy for “Binary Class” and “Multi-Class” respectively utilizing the methodology proposed in this study. The results for both types of investigation outperformed previously published literature [3]. Moreover, this model showed a high degree of sensitivity of 99.51% and 99.42% for the “Binary Class” and “Four Class” investigations, respectively. The performance of the best model, EfficientNet_ B7, was also evaluated through Grad-CAM-based visual interpretation as this study includes the diagnosis of sensitive medical conditions using ocular surface images. In future, the proposed model can be implemented in the server so that the model can produce predictions with visual interpretation for clinicians and patients. The implementation of such a server-based implementation of the proposed model can be used in remote areas for telemedicine facilities and helps people in the rural area to easily diagnose eye conditions with visual interpretation. Funding This work was made possible by Qatar National Research Fund (QNRF) NPRP12S0227–190164 and International Research Collaboration Co-Fund (IRCC) grant: IRCC-2021–001. The statements made herein are solely the responsibility of the authors.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
129
References 1. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36(8), 786–795. 2. Oellers, P., & Karp, C. L. (2012). Management of pigmented conjunctival lesions. The Ocular Surface, 10(4), 251–263. 3. Yoo, T. K., Choi, J. Y., Kim, H. K., Ryu, I. H., & Kim, J. K. (2021). Adopting low-shot deep learning for the detection of conjunctival melanoma using ocular surface images. Computer Methods and Programs in Biomedicine, 205, 106086. 4. Shields, C. L., Fasiudden, A., Mashayekhi, A., & Shields, J. A. (2004). Conjunctival nevi: clinical features and natural course in 410 consecutive patients. Archives of Ophthalmology, 122(2), 167–175. 5. Wong, J. R., Nanji, A. A., Galor, A., & Karp, C. L. (2014). Management of conjunctival malignant melanoma: a review and update. Expert Review of Ophthalmology, 9(3), 185–204. 6. Isager, P., Engholm, G., Overgaard, J., & Storm, H. (2002). Uveal and conjunctival malignant melanoma in Denmark 1943–97: observed and relative survival of patients followed through 2002. Ophthalmic Epidemiology, 13(2), 85–96. 7. Chang, A. E., Karnell, L. H., & Menck, H. R. (1998). The National Cancer Data Base report on cutaneous and noncutaneous melanoma: A summary of 84,836 cases from the past decade. Cancer: Interdisciplinary International Journal of the American Cancer Society, 83(8), 1664– 1678. 8. Larsen, A. C., Dahmcke, C. M., Dahl, C., Siersma, V. D., Toft, P. B., Coupland, S. E., et al. (2015). A retrospective review of conjunctival melanoma presentation, treatment, and outcome and an investigation of features associated with BRAF mutations. JAMA Ophthalmology, 133 (11), 1295–1303. 9. Kao, A., Afshar, A., Bloomer, M., & Damato, B. (2016). Management of primary acquired melanosis, nevus, and conjunctival melanoma. Cancer Control, 23(2), 117–125. 10. Damato, B., & Coupland, S. E. (2008). Conjunctival melanoma and melanosis: a reappraisal of terminology, classification and staging. Clinical & Experimental Ophthalmology, 36 (8), 786–795. 11. Hallak, J. A., Scanzera, A., Azar, D. T., & Chan, R. P. (2020). Artificial intelligence in ophthalmology during COVID-19 and in the post COVID-19 era. Current Opinion in Ophthalmology, 31(5), 447. 12. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 20170387 13. Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56. 14. DuBois, K. N. (2019). Deep medicine: How artificial intelligence can make healthcare human again. Perspectives on Science and Christian Faith, 71(3), 199–201. 15. Rahman, T., Akinbi, A., Chowdhury, M. E., Rashid, T. A., Sengür, ¸ A., Khandakar, A., et al. (2022). COV-ECGNET: COVID-19 detection using ECG trace images with deep convolutional neural network. Health Information Science and Systems, 10(1), 1–16. 16. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A., et al. (2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using X-ray images. IEEE Access, 10, 53359–53373. 17. Abir, F. F., Alyafei, K., Chowdhury, M. E., Khandakar, A., Ahmed, R., Hossain, M. M., et al. (2022). PCovNet: A presymptomatic COVID-19 detection framework using deep learning model using wearables data. Computers in Biology and Medicine, 147, 105682. 18. Chowdhury, M. H., Shuzan, M. N. I., Chowdhury, M. E., Reaz, M. B. I., Mahmud, S., Al Emadi, N., et al. (2022). Lightweight end-to-end deep learning solution for estimating the respiration rate from photoplethysmogram signal. Bioengineering, 9(10), 558.
130
K. K. Podder et al.
19. Wang, G., Ye, J. C., Mueller, K., & Fessler, J. A. (2018). Image reconstruction is a new frontier of machine learning. IEEE Transactions On Medical Imaging, 37(6), 1289–1296. 20. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention (pp. 234–241). 21. Haskins, G., Kruger, U., & Yan, P. (2020). Deep learning in medical image registration: A survey. Machine Vision and Applications, 31(1), 1–18. 22. Karimi, D., Dou, H., Warfield, S. K., & Gholipour, A. (2020). Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical Image Analysis, 65, 101759. 23. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou, A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique for mortality risk prediction of COVID-19 patients using chest x-ray Images and clinical data. Neural Computing and Applications. 24. Tahir, A. M., Qiblawey, Y., Khandakar, A., Rahman, T., Khurshid, U., Musharavati, F., et al. (2022). Deep learning for reliable classification of COVID-19, MERS, and SARS from chest X-ray images. Cognitive Computation, 1–21. 25. Tahir, A. M., Chowdhury, M. E., Khandakar, A., Rahman, T., Qiblawey, Y., Khurshid, U., et al. (2021). COVID-19 infection localization and severity grading from chest X-ray images Computers in Biology and Medicine, 139, 105002. 26. Qiblawey, Y., Tahir, A., Chowdhury, M. E., Khandakar, A., Kiranyaz, S., Rahman, T., et al. (2021). Detection and severity classification of COVID-19 in CT images using deep learning. Diagnostics, 11(5), 893. 27. Pacheco, A. G. C., & Krohling, R. A. (2020). The impact of patient clinical information on automated skin cancer detection. Computers in Biology and Medicine, 116, 103545. 28. Han, S. S., Park, G. H., Lim, W., Kim, M. S., Na, J. I., Park, I., et al. (2018). Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: Automatic construction of onychomycosis datasets by region-based convolutional deep neural network. PloS one, 13(1), e0191493. 29. Bhimavarapu, U., & Battineni, G. (2022). Skin lesion analysis for melanoma detection using the novel deep learning model fuzzy GC-SCNN. In Healthcare, p. 962. 30. Martin-Gonzalez, M., Azcarraga, C., Martin-Gil, A., Carpena-Torres, C., Jaen, P., & Health, P. (2022). Efficacy of a deep learning convolutional neural network system for melanoma diagnosis in a hospital population. International Journal of Environmental Research and Public Health, 19(7), 3892. 31. Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A., et al. (2018). Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology, 29(8), 1836–1842. 32. Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., et al. (2019). A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task. European Journal of Cancer, 111, 148–154. 33. Yin, G., Gendler, S., & Teichman, J. (2022). Ocular surface squamous neoplasia in a patient following oral steroids for contralateral necrotising scleritis. BMJ Case Reports CP, 15(12), e253300. 34. Rahman, T., Chowdhury, M. E., Khandakar, A., Mahbub, Z. B., Hossain, M. S. A., Alhatou, A., et al. (2022). BIO-CXRNET: A robust multimodal stacking machine learning technique for mortality risk prediction of COVID-19 patients using chest x-ray images and clinical data. arXiv preprint arXiv:2206.07595 35. Khandakar, A., Chowdhury, M. E., Reaz, M. B. I., Ali, S. H. M., Kiranyaz, S., Rahman, T., et al. (2022). A novel machine learning approach for severity classification of diabetic foot complications using thermogram images. Sensors, 22(11), 4249.
Deep Learning-Based Conjunctival Melanoma Detection Using Ocular …
131
36. Rahman, T., Khandakar, A., Islam, K. R., Soliman, M. M., Islam, M. T., Elsayed, A. et al. (2022). HipXNet: Deep learning approaches to detect aseptic loos-ening of hip implants using x-ray images. IEEE Access, 10, 53359–53373. 37. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., et al. (2020). Score-CAM: Scoreweighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/ CVF conference on computer vision and pattern recognition workshops (pp. 24–25). 38. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., et al. (2019). Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 197–207. 39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. 40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9). 41. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 42. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708). 43. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). 44. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Gradcam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626). 45. Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B., Khandakar, A., Hossain, M. S., et al. (2022). Bangla sign language (bdsl) alphabets and numerals classification using a deep learning model. Sensors, 22(2), 574.
Plant Diseases Classification Using Neural Network: AlexNet Mohd Anas, Sanjiban Sekhar Roy, Kunwar S. Srivastava, and Jashabir Chakraborty
1 Introduction Not so long ago, India was an agricultural country. Even today, roughly, there are 118 million farmers in the country [1]. One of the major issues that these farmers/ cultivators face is several diseases that affect their plants. Not only this exacerbates their economic problem, but also their social life; several hours, and sometimes years, of hard work. There are several chemicals that can be employed to alleviate this problem. The major issue here is diagnosis, and unless farmers have a lab in their vicinity, it is likely that diseases will be misidentified. Furthermore, the situation may get worsen, as it is often and spread to other farms. India has seen a large increase in smartphone sales and this is coupled with the rise of middle class. Various telecommunication companies want to have hold of the rising market and this has led to the cost of internet usage to almost nearly zero. There are nearly 833 million internet [2] users which is equal to 59.28% of the population of India. In this chapter, we have work to provide all the farmers and cultivators with smartphones with internet access, we could reduce the food loss in the country. In order to help these farmers, David. P. Hughes and Marcel Salathe, in their paper have created a database called, PlantVillage, which is an open access database of 50,000 + images of healthy and diseased crops. This database has more than 150 crops and 1800 diseases. PlantVillage is a community of people helping each other, by answering the questions and identifying the diseases by looking at pictures in the questions. It is helpful but it has drawbacks as stated above [3]. In the paper, David
M. Anas · S. S. Roy (B) · K. S. Srivastava School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, India e-mail: [email protected] J. Chakraborty Mata Gujri College of Pharmacy, Mata Gujri University, Bihar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_7
133
134
M. Anas et al.
P. Hughes & Marcel Salathé, have described the advantage of computer diagnostics tools over human diagnosis. And we cannot download all the images in their dataset. But in April 2016, PlantVillage released a subset of their dataset for image classification challenge on CrowdAI [3].
2 Machine Learning and Deep Learning In this section, we have discussed about the machine learning and neural network in details.
2.1 History Deep Learning was an underappreciated field due to several reasons such as, absence of powerful GPUs, absence of required data and limited scientific work. In fact, deep learning is term coined to attract interest in neural networks again. There have been three phases of development in the field: it was known as cybernetics in 1940−1960s, connectionism in 1980−1990s and deep learning from late 2000s. It is also known as artificial neural network (ANNs) due to the fact that its design is inspired from biological neural network [4]. So, earliest neural network models were simple linear models. They were designed to take inputs{x1 , x2 …..xN } at the input layer, corresponding to an output y. The network would learn the weights {w1 , w2 ……wN } such that f (x, w) = y = x1 ∗ w1 + x2 ∗ w2 + . . . . . . . . . . . . . x N ∗ W N
(1)
McCulloch-Pitts Neuron, Perceptron and ADALINE (adaptive linear element) were some of the linear models. Although, these models were very useful but they had limitations, most importantly, they couldn’t replicate XOR function. Neural network were no longer popular after the discovery. There were massive research going on during the second phase or popularly known as connectionism. The most important development in this phase was successful implementation of backpropagation algorithm for training purposes. Algorithms such as backpropagation and LSTM are still popular. But the reason why the popularity of neural net declined was unrealistic claims made by the companies and then under delivering them. Meanwhile, various other machine learning models were performing far better than neural networks, thus declining its popularity. In 2006, Geoffrey Hinton, trained a neural network called, deep belief network. This sparked interest in neural network again. World had more computation power and more data. By 2012, deep learning had proved to be useful state of art technology in the field of object detection, image classification and computer vision.
Plant Diseases Classification Using Neural Network: AlexNet
135
2.2 Machine Learning Basics A learning program is said to learn from experience E on task T with respect to performance measure P, if its performance on T improves with experience E. A learning program produces a representation R (often called a hypothesis h) of what it has learned. Another program can use R to perform T. A learning program uses a learning algorithm A to produce R from E [4].
2.2.1
Capacity, Overfitting and Under Fitting
The main challenge in machine learning is that our trained model must perform well on new data points. This ability to perform well on new data points is called generalization. When we train a model on a dataset, we have an error measure known as, training error. We want this error to be as low as possible. But, in order to have a working model, we want our model to have good generalization as well which means that our test error should be low [4, 5]. Take linear regression for example, we train the model by minimizing the training error, which is: 1 m train
train 2 X w − Y train 2
(2)
Similarly, we want to minimize our test error, which would be:
m
2 1 test X w − Y test 2 test
(3)
There are two factors determining the performance of machine learning models. First is to make training error small, and second is to reduce the difference between training and test error. Underfitting occurs when model is not able to make the training error small and Overfitting occurs when it cannot reduce the difference between the training error and test error. In simpler words, when model hasn’t sufficiently learned the features, we call it under fitting, whereas when model memorizes the features instead of learning from data, we call it overfitting. We can control whether a model is more likely to overfit or underfit by altering its capacity.
2.2.2
Hyperparameters and Validation Set
Generally, machine learning algorithms have several parameters that control the behaviour of training algorithm, these parameters are called Hyperparameters. We
136
M. Anas et al.
usually do not learn hyperparameters, because it is not appropriate to learn the hypermeter on training set. If we learn hyperparameters on training set, it will almost always overfit. To solve this problem, we need another dataset, known as validation set. Validation set is taken from training set but not included during training process. Validation set is used during and after training, in order to estimate generalization or test loss. We use this to update hyper parameters accordingly [4]. Typically, we take 80% of training dataset for training and 20% for validation.
2.2.3
Gradient Descent and Stochastic Gradient Descent
Gradient Descent and its variations is widely used in several deep learning algorithms [6]. It minimizes an error function. E in (w) =
N 1 e(h(xn ), yn ) N n=1
(4)
In order to compute error or the gradient of error, we have to evaluate the hypothesis at every point in the sample. We go down the error surface along the direction subjected by gradient descent. The steps used in this case are iterative, and we take one step at a time and one step is full epoch. Simply, we consider epoch when we take all the example at once. So, weight update formula in this case: w = −α∇ E in (w)
(5)
In case of stochastic gradient descent, instead of having movement in the w space, we will try to do it on space on one example at a time. ∇Ein is based on all examples (xn , yn ). Because we will use another method, we will call the standard gradient descent as “batch” GD. In case of stochastic gradient descent, we pick one example at time and apply gradient descent on that point e(h(xn ), yn ), instead of whole dataset. Now, let’s think of the average direction that we are going to be send along. Average direction: E n [−∇e(h(xn ), yn )]
(6)
If we take the error measure that we are going to minimize, in this case, just one example, and take the expected value, we get an equation which is similar to equation mentioned above [4, 6, 7]. Average direction: N 1 E n [−∇e(h(xn ), yn )] = −∇e(h(xn ), yn ) N n=1
(7)
Plant Diseases Classification Using Neural Network: AlexNet
137
So, this is as if we are actually going in the direction we want, except that we only use one example in the computation and then keep repeating. Thus, we will always get the expected value in that direction and with time, the noise will average out and we’ll go along the desired direction.
2.2.4
Neural Network and Backpropagation
Suppose we assign weights the notations wil j where l is hidden notation for layers [7, 8] (Fig. 1). ⎧ ⎨ 1 ≤ l ≤ L , layer s W eight : Wilj = 0 ≤ i ≤ d (l−1) , input ⎩ 1 ≤ j ≤ d (l) , out put
(8)
And if we use tanh(s) as the activation function, where (Fig. 2): θ (s) = tanh(s) =
es − e−s es + e−s
(9)
Output in neural network is x (l) j ⎛
⎞
(l) ⎝ x (l) j = θ (s j = θ
wi(l)j xi(l−1) ⎠
(l−1) d
(10)
i=0
Apply x lj tox1(0) . . . . . . .xd(0) → x1L = h(x). 2.2.5
Applying Stochastic Gradient Descent
We take one example at a time and apply it to the network and adjust the weight of the network in the direction of negative of the gradient descent and thus makes it stochastic [7]. All the weights w = {wil j } determine h(x). Error on example (x n , yn ) is: e(h(xn ), yn ) = e(w) =
1 (h(xn ) − yn )2 2
(11)
So, to implement SGD, all we have to do is implement gradient of ∇e(w) ∇e(w) =
∂e(w) f or all i, j, l ∂wil j
(12)
138
Fig. 1 A multi-layer perceptron
Fig. 2 Graph for tanh(x) activation function
M. Anas et al.
Plant Diseases Classification Using Neural Network: AlexNet
139
Fig. 3 Backpropagation: phase I
All we have to do is compute this for every i,j,l and then take entire value of weight and move along negative gradient (Fig. 3). We can evaluate ∂e(w) using chain rule: ∂wl ij
∂s lj ∂e(w) ∂e(w) = × ∂wil j ∂s lj ∂wil j wher e
∂s lj ∂e(w) l = δ and = xil−1 j ∂s lj ∂wil j
(13)
Now let’s find δ for final layer. When we computed the same we got xs for first layer and then we propagate it forward until we get to the output. The reason is that if we know δ for final layer, we will be able to use it to find δ for previous layers by propagating backwards, and hence the name, backpropagation. ∂e(w) = δlj , f or f inal layerl = L and j = 1 ∂s lj So, ∂e(w) = δ1L and e(w) = e(h(xn ), yn ) ∂s1L
(14)
140
M. Anas et al.
Fig.4 Backpropagation: phase II
e(w) is error measure. This is applied on each layer until we reach the output, h(x n ) and compare it to target output yn. e(w) = e(x1L , yn )
(15)
Suppose we are using mean square error, then (Fig. 4) 2 e(w) = x1L − yn x1L = θ s1L ; θ (s) = 1 − θ 2 (s) f or tanh δil−1 = =
wher e
d(l) ∂e(w) j=1
∂s lj
×
∂s lj ∂ xil−1
×
∂e(w) ∂s l−1 j ∂ xil−1 ∂sil−1
l−1 ∂s lj ∂e(w) l l ∂ xi = δ , = w , = θ sil−1 j i j l l−1 l−1 ∂s j ∂ xi ∂si
=
d(l) j=1
(16)
δlj × wil j × θ sil−1
(17)
Plant Diseases Classification Using Neural Network: AlexNet d(l)
2 δil−1 = 1 − xil−1 wil j × θ sil−1
141
(18)
j=1
2.2.6 1. 2. 3. 4. 5. 6.
Backpropagation Algorithm
Initialize all weights wil j at random. For t = 0, 1….. do Pick n from {1, 2, … N} Forward: compute all x lj Backward: compute δlj Update the weights, wil j ← wil j − nxil−1 δlj
7. Iterate to the next step until it is time to stop. 8. Return the final weight, wil j .
2.3 Convolution Neural Network Convolution neural network is a special kind of neural network. It was given the name because it uses convolution in at least one layer. It is widely used in computer vision, image segmentation, classification etc. among other things [9, 10].
2.3.1
Convolution
In mathematics and engineering, convolution is described as mathematical operation between two functions. It is defined as the integral of the product of the two functions after one is reversed and shifted. s(t) = x(a)w(t − a)da s(t) = (x ∗ w)(t)
(19)
Convolution is denoted by asterisk (*). In deep learning, function x(a) is known as input and function w(t-a) is known as kernel. Convolution controls three important ideas that helps a machine learning system: sparse interactions, parameter sharing and equivariant representations. Additionally, convolution provides a means for working with inputs of variable size.
142
2.3.2
M. Anas et al.
Pooling
A layer of convolution network has three stages: convolution layer, activation function such as ReLU and a pooling layer. A pooling layer changes the output of the net by replacing some areas of input by its statistical summary. It performs down sampling in height and width dimensions. The commonly used pooling layer is max pooling.
2.3.3
ReLU
The rectifier linear unit is an activation function defined as f (x) = max(0, x)
(20)
Convolutional nets were some of the first working deep networks trained with backpropagation. It is not fully clear why convolutional networks succeeded when general backpropagation networks were considered to have failed.
2.4 Various Deep Learning Libraries There are several deep learning libraries to choose from. Some popular ones are:
2.4.1
Theano
Theano is a framework based on python developed by the LISA group and run by Yoshua Bengio at the University of Montreal [11].
2.4.2
Torch
Torch is a deep learning framework developed by Ronan Collobert, Clement Farabet and Koray Kavukcuoglu [12].
2.4.3
Caffe
Caffe is a Python deep learning library developed by Yangqing Jia at the Berkeley Vision and Learning Centre. The biggest advantage of Caffe is the number of pretrained network that be downloaded from their model zoo [13].
Plant Diseases Classification Using Neural Network: AlexNet
2.4.4
143
Tensorflow
TensorFlow is an open-source programming library for machine learning over a scope of assignments, and created by Google to address their issues for frameworks fit for building and preparing neural systems to identify and interpret examples and relationships.
2.4.5
Deep Learning 4J
Deeplearning4j is an open-source, distributed deep learning framework for Java and Scala programming languages. It supports a variety of neural network architectures such as feedforward, recurrent, and convolutional networks, and enables deployment of models on GPUs, CPUs, and embedded devices [14].
3 Experimental Work and Results In this section, we have discussed the experimental results and the model used.
3.1 Dataset The dataset on CrowdAI consists of 54,309 images for training the neural network. It has 14 different species of crop, 17 fungal diseases, 4 bacterial diseases, 2 mold diseases, 2 viral disease, 1 disease caused by a mite and 12 crop species that are visibly healthy. This means that there are 38 classes of images. These 14 crop species are: Apple, Blueberry, Cherry, Corn, Grape, Orange, Peach, Bell Pepper, Potato, Raspberry, Soybean, Squash, Strawberry, and Tomato (Fig. 5). In the Fig. 1 above, there are 38 images each corresponding to different class of diseases [3].
3.2 Data Pre-processing Since we are trying to tune AlexNet, we have to make sure that the size of images must be of exactly the same size as was used to originally train it. AlexNet was trained on images of size 256 × 256 pixels with central crop of 227 × 227 pixels. This means that we have to resize all the images of PlantVillage dataset. Instead of having to deal with images straight from the disk, we will store them in LMDB which is a high performance embedded transactional database. While Caffe does supports reading images directly from the disk, using LMDB as the data store has
144
M. Anas et al.
Fig.5 Different 38 disease classes of leaves
quite significant performance gains. Finally, we will compute the mean of all the images. This will be useful in both, training and testing processes. After correctly updating LMDB store references, fine tuning the parameters in configuration files, and changing hyperparameters in solver configuration file, we will train the model.
3.3 Architecture In 2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton submitted a convolution network called, Alexnet, for an Imagenet ILSVRC challenge. ILSVRC challenge also known as ImageNet challenge is conducted every year where participants have to make a model that can classify millions of images into 1000 classes of object. They won the challenge the same year and since then it was always a variation of CNN that won the challenge (Fig. 6). The input layers in AlexNet are formed by the raw pixel values obtained from the image, and the final layer gives a probability distribution across all the classes. The intermediate layers use a “processed version” of the output of the previous layer as
Plant Diseases Classification Using Neural Network: AlexNet
145
Fig. 6 Architecture of alexNet
their input, and over the whole training period they learn to activate against more and more complex features depending on how deep they are in the overall architecture. The neural net such as AlexNet are computationally very expensive and intensive. It usually takes several weeks to train on ImageNet dataset. Fortunately, the features learnt by earlier layers are very generic in nature, and thus can be used on new dataset with totally different classes. This approach is known Transfer Learning or Fine Tuning. In transfer learning, we take a pre-trained model and use the learnt weight and after modification of the final fully connected layers, we use them to train on new dataset. This gives us better result. In our PlantVillage dataset, we have 38 classes instead of 1000 classes from ImageNet. So, we have to change the num_ output parameter of fully connected layer in the training configuration file Caffe [3, 8, 15, 16].
3.4 Results If data is pre-processed and files are correctly configured, there will be no problem in training the model. So, when we train the model, we have to make sure that we are maintaining the log file. This is done in order to understand the training process. Also, this log file can be used to generate graph. It took roughly around 2 h for training the model for 2000 iterations (Fig. 7). We can see the development of three performance measure: training loss, test loss, test accuracy. Training and Test loss has significantly decreased from nearly 1 to 0.1, whereas the test accuracy on the test dataset was around 91.3%, which is pretty impressive. The two most important factors to be considered in transfer learning are size of the data and similarity of the data to the original dataset. If new dataset is small and similar to original dataset, there is a high chance that the model will over fit. In case we have large dataset, this may work given that both datasets are similar [17–27].
146
M. Anas et al.
Fig. 7 Training curve for accuracy and loss with 2000 iterations
4 Conclusion In conclusion, the use of deep learning in the form of image classification can provide a budget-friendly and efficient solution to the problem of plant diseases affecting farmers and cultivators. Otherwise, farmers would need well equipped labs to determine the disease. AlexNet is able to obtain 98 to 99% accuracy on training set and 91.3% accuracy on test set. In the future, we would like to employed different deep learning models and perform different types of augmentations.
References 1. Agarwal, K. (2021). Indian agriculture’s enduring question: Just how many farmers does the country have?. The Wire. Retrieved, 22. 2. BBC. (2023, January 23). India media guide. BBC News. https://www.bbc.com/news/worldsouth-asia-12557390 3. Hughes, D., & Salathé, M. (2015). An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060. 4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in preparation for MIT Press. http://www.deeplearningbook.org 5. Jabbar, H., & Khan, R. Z. (2015). Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science, Communication and Instrumentation Devices, 70, 163–172. 6. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, 400–407.
Plant Diseases Classification Using Neural Network: AlexNet
147
7. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2, 1–127. Also published as a book. Now Publishers. 8. Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural networks for perception (pp. 65–93). Academic Press. 9. Roy, S. S., Awad, A. I., Amare, L. A., Erkihun, M. T., & Anas, M. (2022). Multimodel phishing URL detection using LSTM, bidirectional LSTM, and GRU models. Future Internet, 14(11), 340. 10. O’Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458. 11. Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., Belopolsky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V., Snyder, J. B., Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Brébisson, A., … Zhang, Y. (2016). Theano: A python framework for fast computation of mathematical expressions. arXiv e-prints, arXiv-1605. 12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32. 13. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Guadarrama, S. & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 675–678). 14. Gibson, A., Nicholson, C., Patterson, J., Warrick, M., Black, A. D., Kokorin, V., ... & Eraly, S. (2016). Deeplearning4j: Distributed, opensource deep learning for Java and Scala on hadoop and spark. Towards Data Science. 15. Fei Fei, L., Karpathy, A., Johnson, J. CS231N–Stanford University 16. Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep convolutional neural networks. University of Toronto. 17. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Paraschiv, N., MohammadiIvatloo, B. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of Intelligent & Fuzzy Systems, (Preprint), 1–12. 18. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022) Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems Preprint, 1–7. 19. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences, 10(14), 4915. 20. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518. 21. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems, (Preprint), 1–7. 22. Deep learning research should be encouraged for diagnosis and treatment of antibiotic resistance of microbial infections in treatment associated emergencies in hospitals. 23. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical engineering and healthcare. Academic Press. 24. Samui, P., Roy, S. S., & Balas, V. E. (Eds.). (2017). Handbook of neural computation. Academic Press. 25. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines and deep neural network 26. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Scientific reports, 11(1), 1–18. 27. Ali, M., Magdon-Ismail, M., Lin, H. T. Learning from Data-Abu. https://amlbook.com/
Hyperspectral Images: A Succinct Analytical Deep Learning Study L. Sandeep Kumar, G. K. Panda, and B. K. Tripathy
1 Introduction Since the advent of imaging spectrum (1980s), Hyperspectral images (HIs) have been acquired owing to computational classificatory capability for fine spectra that provides a resolving power for a diverse range of applications. Some includes remote sensing based environmental, atmospheric and ocean observations [66], meteorological applications, military [37], geological exploration and mining [53], crops, vegetation and food analysis and standalone biomedical fields [56]. In addition to having high spectral and spatial resolution, HIs have many bands and abundant information because they cover ultraviolet, visible, near-infrared, and mid-infrared wavelengths. This offers an avenue of research HI-based image correction [77], noise reduction [40], transformation [48], dimensionality reduction, and classification [8]. For the machine learning (ML) based methods to processes HIs, there is a high need to label several legitimate samples for training. Early researches on this regard were focused with spectral information based HI classification methods like, support vector machine [72], random forest, neural networks [20, 67], and Polynomial logistic regression [45]. An HI represents the image as a “hypercube, (x, y, λ)” in which the first two dimensions indicate its spatial coordinates and the third indicates the number of bands. As a result, each pixel represents a pattern with as many attributes as there are bands. With a complexity on bands (large number) associated in HIs, the high data volume (populates exponentially) to be processed further relates to the avenue to reduce the dimensionality and to minimize the computation complexity L. S. Kumar Biju Patnaik University of Technology, Rourkela, Odisha, India G. K. Panda MITS School of Biotechnology, Utkal University, Bhubaneswar, Odisha 765017, India B. K. Tripathy (B) School of Information Technology and Engineering, VIT, Vellore, Tamil Nadu 632014, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_8
149
150
L. S. Kumar et al.
in many real life HI applications. To cater the dimension reduction methods in HIs, numerous applications have also been proposed using feature extraction and feature selection. Some prominent methods include, principal component analysis (PCA) [32], independent component analysis (ICA) [33, 71], and linear discriminant analysis (LDA) [17]. The deep learning method has excellent capabilities in image processing, particularly in recent years, when image classification, target detection, and other fields have sparked its use. There are a number of deep learning network models available to improve the performance of HI processing, such as the convolutional neural network (CNN), the deep belief network (DBN) [24], and the recurrent neural network (RNN). In addition, to resolve the problem of poor classification results due to a lack of training samples, tensor-based classification model [51, 52] was proposed and experiments revealed that when the number of training samples is small, this method outperforms to support vector machines and deep learning. In this first part of our discussion, one of our primary goals is to enhance the accuracy of the classification. We use Hyperspectral image of Sundarban mangrove area through seninel-2 satellite (Fig. 1). The input images contributed through 12 bands are processed with spatial analysis using DL based 3D CNN. Principal Component Analysis (PCA) is implemented to derive 3D patches of the Sundarban satellite image. The process exhibits with 96% classification accuracy. The remaining part is summarized as follows. In Sect. 2 we review the related research of the study. In Sect. 3 we highlight some concepts of Hyperspectral image. Section 4 discusses the overview of deep learning and CNN. Section 5 presents an empirical study of 3D-CNN classification on HIs of Sundarban Mangrove region. In Sect. 6, we present a hybrid-MSSN model, validate with experiment analysis with three HI datasets and discuss the outcomes.
Fig. 1 Bhitarkanika mangrove (Source Google Maps): a Binary image b Grayscale image c RGB image
Hyperspectral Images: A Succinct Analytical Deep Learning Study
151
2 Related Researches In the process of classification of HIs, the spectral dimension (Fig. 2) helps in identifying the significant variations of reflectance between image pixels which change with wavelength [38]. In a study [31], it was observed that, the classification accuracy drops dramatically after a certain increased value of spectral bands. Since a majority of spectral bands are redundant in nature, so carrying all bands into consideration, affect to the model’s performance. Dimension reduction techniques [28, 57] on this regard are used to identify such unnecessary bands without compromising the image’s information content. The modified brown stick rule for HI [3] contributes a phenomenal aspect in dimension reductions. In majority of the cases, the reduced band features suffer with the anomalies of object identification and necessitates for discriminative spatial features. As per a study [19] the pixels next to each other belongs to the same class in HIs, hence applications of HI’s spatial features along with spectral features is an intuitional and motivation for an effective classification, to study. There have been some methodologies adapted on feature extractions like Gray level co-occurrence matrix [44, 54], stationary wavelet transform (SWT) [43, 73], discrete wavelet transforms [10, 22], morphological profiles [4, 55] have been used in may real-world applications. Neural network-based techniques have been implemented to tackle many complex problems of remote sensing [67]. DL techniques have become extremely popular in recent years with several real life applications like, study of gene characteristics [25], text-based image retrieval [60], audio signal classification [9], image processing [2], health care analysis [36], measuring confidence in interviewees [61], Face mask detection [64], classification of skin cancer [63] and computer vision [70]. In general
Fig. 2 Image dimension: a Hyperspectral image b RGB image
152
L. S. Kumar et al.
it has influenced the research in AI in a major way. Some such studies are presented in [1, 69]. The application of Deep learning (DL) in ANN has led to the development of Deep Neural Networks (DNN) [6]. Some of the DL algorithms which are used in HI classifications are stacked autoencoders (SAEs) [5], deep belief networks (DBNs) [30]. Convolutional neural networks (CNNs) [21, 49] are used in HI classifications [11]. CNNs have wide applications like, MRI segmentation [68], Diabetic retinopathy [58], in study of Big Data [7], Classification of pests [16], COVID-19 detection [62] and whether classification [23], There are some innovative approaches with 2D-CNN [50], 3D-CNN [26, 46], spectral-spatial LSTMs [80], SSUN [76], SSRN [79] have also been employed in HI classifications. Literature shows that 2D-CNN alone, is not able to generate discriminative features of classification [59] whereas 3D CNN is found to be suitable for volumetric samples. However, it lacks in generating discriminative features of classes that have textural similarity across several spectral bands. Taking these shortcomings a HybridSN model [59] was proposed which comprises of 2-D and 3-D convolution layers to generate discriminative spectral and spatial features. MCNNCP model [78] also contributes promising accuracy in using 3-D and 2-D convolution layers based solutions. DLs have achieved noteworthy performance in the domains of visual information processing and AI. Some special DNNs like Gated recurrent unit networks are used for detecting toxicity [39] and wide res-Net being used for age and gender estimation [14]. This approach pioneered the extraction of hierarchical deep features automatically in a practicable way for an HI. They consider an image to be organized with hierarchical components like pixels, edges, parts and objects. In contrast to shallow handcrafted features, end-to-end deep features are capable of representing more abstract and complex shapes in the image. They perform well even in circumstances where there are rapid regional changes in an image. Normal image classifications presume on the data that follows uniform distribution between diverse classes and prone with discriminate samples belonging to the majority classes leading to an imbalanced phenomenon (in case of HIs). Hence, special care or measure needs to be addressed to tackle such imbalanced characteristics of HI classification [65]. Studies in [29, 47, 74] demonstrate on data augmentations, pixel-pairing and auto allocations of unlabeled samples respectively and demonstrate their efficacy in HI classifications. Studies in [27, 41–43] modeled with recent novel concepts of SWT and CNN, decomposition and deep residual nets, 3D2D-and depth wise separable-1D CNNs, CNN with optimization (Grey Wolf) and 1D-EWT and 3D-CNN. So, following are some of the intuitive literature outcomes that motivate us to address through the proposed models undertaken in the following sections. 1. Ensemble a DL-model to address the hierarchical feature extractions. 2. To perform and learn with limited training data. 3. To demonstrate minimum information loss due to dimension reductions, convolutions and Max pooling operations. 4. To address the issues of vanishing gradients, minimum computational time, class imbalance problem and tolerance to noise.
Hyperspectral Images: A Succinct Analytical Deep Learning Study
153
3 Hyperspectral Images To start with the fundamental concepts of a digital image, we can interpolate it into the form of binary, grayscale, color and Hyperspectral images. Binary images consist with 0 and 1’s to represent black and white respectively and occupy in a 2-D matrix (r-rows. c-columns). Grayscale digital images range from 0 to 255 to represent white to black with intermediate levels of gray-scale. As per the biological aspects of human cone cells (eye) to render environment colours, combinations of RGB-scales (red, green, blue) are digitized into (r-rows × c-columns) × 3 channels. These RGB coloration is based on the reflected light from objects fall under separate wavelengths (long, medium, short for red, green and blue respectively) in the visible spectrum (perceived by human eyes) of the electromagnetic radiation. Alongside, there are lot of wavelengths beyond the visible spectrum signify valuable information which the human eye cannot perceive. To be formal, spectral image is a kind of similar to RGB colour image with many channels describing the spatial and spectral information. Multi-spectral image consists with n-band images, where each band has corresponding light intensity to the wavelength (not necessarily spread over a contiguous wavelength range). A λ-band Hyperspectral image consists with n grey-scale images, where each band has corresponding light intensity to the wavelength being stacked on top of each other over a contiguous wavelength range (r-rows × c-columns × λ bands).
4 Deep Learning and CNN The idea behind Deep Learning (DL) is to train computers/machines artificially with an approach to model complex algorithms to learn from experience, classify and recognize data or images just like a human brain does. As a type of ANN, CNN is also used for image or object recognition (processing images, analyzing videos, and detecting obstacles in autonomous vehicles). There have been phenomenal developments in devising methods pertain to ANN in DL-based classification and object/ image recognition domain. DL-based three Core layers (dense, convolution and output) offer learning based HI solutions towards a supervised, semi-supervised or unsupervised models. Hi-based DL models are being developed for many classifications and object identification purposes in using these three designs. The adaptability of these design for application models depends on the availability of labeled HI-data. To be specific, if the HI-model is based on the mapping process of labeled datasets in respect to the ground truth then the supervised model is used. To extract/unavil properties of HI data from unlabeled datasets, the unsupervised design is addressed and while with availability of little/small portion of HI based labeled data, the semi-supervised design is preferred to get use in the model. Further, convolutional neural networks (CNNs) in contrast to deep forward neural networks (DNNs) and autoencoders (AEs)
154
L. S. Kumar et al.
play a vital role in many HI-based intensive applications. In a high-dimensional recognition or prediction system, the role of convolution layers in CNN is specifically oriented to identify or learn the local patterns from images or sequences of images. There are three simple operational steps generally viewed in CNN models (feed forward and one direction) for HI classifications. First, identification of input image and the conversion into image pixels (array) by the input layer. Then it passes through multiple hidden layers. The feature extraction process is being taken care by convolution followed by the usage of pooling, rectified linear units on need basis. Object classification is being taken care at fully connected layer and to identify with label at output layer. The most general form of a CNN is identified with a group of convolutional and pooling into modules; however there are variant possible of groups. In HI-based research point of view, the top ten most popular CNNs can be represented as, Convolutional Neural Networks (CNNs), Long Short Term Memory Networks (LSTMs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), Radial Basis Function Networks (RBFNs), Multilayer Perceptions (MLPs), Self-Organizing Maps (SOMs), Deep Belief Networks (DBNs), Restricted Boltzmann Machines( RBMs) and Autoencoders.
4.1 HI Based Deep Feature Selection With high spectral resolution based HIs, information from each pixel is generally interpolated to one-dimensional spectral vectors. 1D-CNN model helps in identifying specific features (from the pool of spectral information) of the HI through such pixels for further classifications. In simpler description, 1D-CNN takes labeled HIdata as input, process with class labels during training and updates network weights iteratively using stochastic gradient descent algorithms and results with classified data being trained with each pixel classification. Convolution operations on 1-D feature vectors are performed using a 1-D convolution kernel defined in Eq. 1. vl,x j
= f
m
Hi −1 (x+h) kl,h j,m v(l−1),m h=0
+ biasl, j
(1)
The 2-D CNN (Eq. 2) uses a 2-D convolution-kernel to exhibit a convolution operation on 2-D matrix in using 2-D filter. x,y mapl, j
= f
l −1 W l −1 H
m
(x+h)(y,w) kl,h,w j,m map(l−1),m
+ biasl, j
(2)
h=0 w=0
To perform a convolution operation on 3-D data, 3-D CNNs use 3-D convolution kernels (Ri refers to the size of each kernel). As the main objective is to extract the
Hyperspectral Images: A Succinct Analytical Deep Learning Study
155
low-level features contained in the HIs, we use 3-D filter at the input image and generates a cube or cuboid in the 3-D volume space. In 3-D convolution, the same 3-D kernel is applied to overlapping 3-D cubes in the input to extract the features. Max pooling, Dropout, Batch Normalization, Flatten methods are generally used to route multi-scale feature maps generated from each 3-D convolution layer. x,y,z
mapi, j
⎛ ⎞ i −1 R l −1 Q i −1 P p,q,r (x+ p)(y+q)(z+r ) = tanh ⎝ wi, j,m map(i−1),m + biasl, j ⎠ m
(3)
p=0 q=0 r =0
In addition to splendid advantages of Deep Neural Network (DNN) usage, some of its observed limitations include, (a) difficulty in accommodating large number of input features in case of small first hidden layer, (b) high increase of weights in case of accommodating large input features to a large first hidden layer, (c) difficulty due to the vanishing gradient point in case of large number of layers, the gradient is high at neurons near output and comparatively low at near inputs. The Spinal Fully Connected Layer (SFCN) or SpinalNet [34] model is in interpreting with human somatosensory system offers solutions to such issues being observed in conventional DNNs. SFCN is based on gradual inputs, local output and probable global influence, reconfiguration of weights during training. The architecture [34] of the model is shown in Fig. 3.
Fig. 3 SpinalNet (Source [34]
156
L. S. Kumar et al.
4.2 HI Based Optimization In HI-based classifications, with algorithmic approach of optimizers are used during the learning process in neural network. The main purpose of these algorithms is to minimize the difference between the expected and actual values in adjusting or updating the weights in order to make the most accurate predictions. The gradient descent technique is found to be one of the prominent methods adopted by many research image classification applications in the context of deep learning and to get an optimized neural network. Gradient descent may be classified into three basic variants according to the amount of data used: batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent. In addition to the SGD optimization, adaptive moment (Adam), AdaDelta, the root mean square propagation (RMSProp), Nesterov, AdaMax, Nadam GD [65, 79] are also take part in many applications. The Adaptive Moment Estimation (Adam) is a replacement optimization algorithm for SGD for training deep learning models [75], which combines the capabilities of both the RMSProp and AdaGrad [35]. This optimizer needs little memory and tuning, can handle sparse gradients on very noisy problems, impressive speed of convergence and mean absolute error and found to be the most preferred optimizers including Hyperspectral image analysis. The mean and variance (1st and 2nd moments) of the gradient are calculated as: m t = β1 ∗ m t−1 + (1 − β1 ) ∗ εt
wt+1
vt = β2 ∗ vt−1 + (1 − β2 ) ∗ εt2 vt ∗ εt = wt + wt , wher ewt = −η √ st + ∈
(4)
(5)
The relative contribution of past history with regards to the present gradient is controlled through the decay rates (β1 and β2 hyper parameters), each parameter wt replaces with w. η is the first level learning rate, εt represents the gradient at time t, vt signify for the exponential average and st is the exponential average of the square of the gradient.
5 3D-Convolutional Neural Network Based HI Classification on Sentinel-2 Satellite Data of Sundarban Mangrove Regions Sundarban is one of the largest mangrove areas in the world stretching from India to Bangladesh with a delta formed by the rivers, Brahmaputra, Meghan and Padma in the Bay of Bengal. Around 106 islands and supports a good number of biodiversity
Hyperspectral Images: A Succinct Analytical Deep Learning Study
157
Fig. 4 HI/classification map: a Sundarban b Indian pines [13] c PaviaU [13]
(Fig. 5). It is home to a wide range of wildlife species including endangered species and supports for biodiversity through its 106 islands.
5.1 Dataset Description We have used Hyperspectral images (HI) of Sundarban mangrove region. Actual Hyperspectral images from 12 bands are collected from the Sentinel-2 satellite images (COAH, [12]). The remote sensing based satellite images contain more than three bands which contains a diverse set of information about any specific geographical location in contrast to the general images (3 bands, red, green and blue bands). With the help of more data in the form of bands, we can understand and analyze the data effectively. The image in Fig. 4a represents a satellite image cube that contains R-rows, C-columns, and B-bands. As stated above, the input Sentinel-2 based HIs for the experiment comprise with 12 bands (coastal aerosol, Near Infra-Red(NIR), Short Wave Infra-Red(SWIR), and RGB), wavelength ranging from 0.443 to 2.190 micro meters with 10–60 m of spectral resolution. In using the COAH tool, HIs with less than one percent of clouds, being filtered with cloud cover map were selected for input image analysis (Fig. 5).
5.2 Experimental Setup and Hyper-Parameters Outcomes of the undertaken experiment on Sundarban satellite HI data is processed on Google Colab Pro™ cloud platform with graphical processing unit (GPU) analysis. In using Python libraries and methods like rasterio, loadmad, EarthPy, the input HIs brought into the frame of stack to compare with six major classes. It includes,
158
L. S. Kumar et al.
Fig. 5 Satellite data: Sundarban mangrove a Composite image b Ground truth image c 12-band HI visualization
Barren land (BL): Land devoid of vegetation or sand dunes, River(RV) bodies, Dense Mangrove (DM): Mangrove forest with dense canopy cover, Open Mangrove (OM): Mangrove forest with open canopy and mudflats with very less mangrove cover, Agriculture (AG): Active agricultural practice and Human habitat (HUM): Human habitation often under the canopy shade of non-mangrove plants. Principal Component Analysis (PCA) and TensorFlow based Keras package of Python is used to extract 3D patches (containing true-classes) and to categorize the reduced high-dimensional input with (0.7 to 0.3 of scale-1) for encoding. Next, we processed into a 3D-CNN through Convolution, Dropout and Dense layers with 1,204,098 trainable parameters. The model adapts 6optimizers discussed in Sect. 2.3 and selects the best (here, the Adams). Methods like TesnsorBoard, EarlyStopping, and ModelCheckpoint were used to tackle issues of keeping track of learning logs during every batch, monitor metric of learning status to overcome issues of overfitting and to epoch-leveled control checkpoint losses respectively. Classification accuracy of the undertaken input HI is shown below. Plot the classification report (page 11) in graph. To overcome the unbalanced classes and to minimize the loss in the training and validation of HI patches the categorical cross-entropy (CCE) (Fig. 6). The functions
of CCE can be identified as CC E = − iC ti log f (S i ) and f (S i ) = ee Si / Cj e S j with C set of classes, t i ground truths and S i corresponding CNN score for each class-i having softmax activation function. The data were augmented using random horizontal and vertical flips. After tuning, the monitor = ‘val_loss’ and restore_best_ weights = True, the batch size to (1024 × 6) and the optimizer used was Adam with CCE.
Hyperspectral Images: A Succinct Analytical Deep Learning Study
159
Fig. 6 Training, testing: sundarban mangrove a Accuracy b Loss
6 A Novel Deep Learning Hybrid-MSSN Architecture for Hyperspectral Image Classification It is often the case that scientists combine two or more types of architectures instead of relying on a single approach (hybrid models), which can result in better results when dealing with complex problems. In other words, they are a class of methods that integrate the advantages of different models in the same system. The following sections describe on the methodology for classifying (deep) Hyperspectral images (HIs) from three HI-datasets.
6.1 Architecture of Hybrid-MSSN Model The architecture of our HI-based deep classification model is presented in Fig. 7. In the model, we use multi-scale CNNs and spinal fully connected network (SFCN). In the process of HI based spectral and spatial feature extractions, we use 3D-CNNs and for spatial feature learning we use 2D-CNN. First, the model is initialized with satellite band based high-dimension HI which was meant to address high-spectral features. We use principal component analysis (PCA) to filter the unnecessary bands, de-correlate and reduce the spectral dimension without compromising the HI’s information content.
6.2 Dataset Description We have used the following three popular HI-datasets [13] to validate our HybridMSSN model (Table 1).
160
Fig. 7 Proposed architecture of our model
L. S. Kumar et al.
Hyperspectral Images: A Succinct Analytical Deep Learning Study
161
Table 1 Description of experimental HI-datasets HI Datasets
HI-captured source
HI-description
Ground truth description
Indian pines (IPD)
North-western Indiana (AVIRIS sensor)
Spatial dimension, 145 × 145 × 200 (rows, columns, filtered-bands)
Classes-16, patches 21,025 Land coverage (forest)-66% Land coverage (farming)–33% Crop coverage (corn, soybeans) Highways (dual lane)-1 Railway line–1 Houses/structures/ roads
Salinas (SD)
Salinas valley, California (AVIRIS sensor)
Spatial dimension, 512 × 217 × 20 (rows, columns, filtered-bands)
Classes-16, patches Land coverage (bare soils) Land coverage (vineyard fields) Land coverage (farming) Crop coverage (vegetables)
Pavia University (PUD)
Pavia, northern Italy, (ROSIS sensor)
Spatial dimension, 610 × 610 × 103 (rows, columns, filtered-bands)
Classes-09, patches
6.3 Experimental Setup and Result Analysis The detailed process of the model is outlined in Algorithm 1. The undertaken experimental setup is based on Google Colaboratory pro cloud platform with Python, Jupyter notebook and GPUs. Keras, a deep learning tool, was used to validate the model.
162
L. S. Kumar et al.
In the deep CNN classification, as the layer becomes deeper, the spatial dimensions of feature maps shrink sharply and results to a loss. In conventional cases, the FC layer frequently points to the deepest Convolutional (or pooling) layer and hence the network seriously depends on the global data which reflect to high computation time. Thus, to overcome such issues, we use both shallow and deep convolution features [76] to account the complexity in HIs, where distinct items likely to have varying scales and the Spinal Fully Connected network (SpinalNet, [34]) instead of the dense layer. To experiment the HI, first we use PCA transformation to extract the most informative r spectral bands (IPDr = 30, PUDr = 15 and SDr = 15) as per the Modified Brown Stick Rule (MBR) [3]. With iterative noise filtrations we get HI-cubes with reduced dimension of (13 × 13 × r) which is relevant to the findings in [59]. The HI-cubes were further categorized into two groups with distinct training and testing samples (Fig. 8). One group comprises with 10 and 90 percent of train to test samples and the other group with 30 and 70 percent to compensate for the problem of class imbalance. Table 2 represents the classification outcomes of the three datasets with oversampling. The undertaken approach also achieves impressing results on all 3d-patches of 3-datasets without oversampling; for instance with (13 × 13) patch size, accuracies at 3-datasets are represented in Table 3. We use accuracy measures like, Overall
Hyperspectral Images: A Succinct Analytical Deep Learning Study
163
Fig. 8 Testing: IPD, SD, PUD a Overall accuracy b Average accuracy c Kappa score
Table 2 Classification performance: with oversampling HI data sets
Window size (3D-Patch)
Train: test ratio (10:90)
Train: test ratio (30:70)
(IPD)
9×9
99.956 ± 0.01
99.958 ± 0.01
11 × 11
99.967 ± 0.02
99.986 ± 0.01
13 × 13
99.967 ± 0.02
100.000 ± 0.01
9×9
99.934 ± 0.002
99.981 ± 0.002
11 × 11
100.000 ± 0.001
100.000 ± 0.001
13 × 13
100.000 ± 0.000
100.000 ± 0.000
9×9
99.989 ± 0.002
99.981 ± 0.002
11 × 11
99.994 ± 0.002
100.000 ± 0.002
13 × 13
100.000 ± 0.001
100.000 ± 0.001
(SD)
(PUD)
Accuracy (OA), Average Accuracy (AA), Kappa value (KA) and Class-wise accuracy to evaluate the model. We use the first category of 10 percent training samples for model validation. With 26 , 27 and 28 filters (3 × 3 × 3 dimension) in the first, second and third phase of 3D-Convolution layers respectively, we adapt ‘Relu’ activation function. In the model, each 3D convolution layer follows with 3D Max pooling with pooling size
164
L. S. Kumar et al.
Table 3 Classification performance: without oversampling Data Sets
Accuracy Training: testing Measures (20:80) (%) (%) Time (S)
Training: testing (30:70)
Training: testing (80:20)
(%)
(%)
Time (S)
Time (S)
OA
98.65 ± 0.05
79.74 99.27 ± 0.03
85.38 99.65 ± 0.01 144:71
AA
98.69 ± 0.20
02.21 98.07 ± 0.12
02.77 99.43 ± 0.02
K
98.47 ± 0.10
99.15 ± 0.03
99.61 ± 0.02
OA
99.20 ± 0.05 184.22 99.58 ± 0.03 138.38 99.95 ± 0.01 213.21
AA
99.32 ± 0.20
07.59 99.68 ± 0.12
06.67 99.92 ± 0.02
K
99.10 ± 0.10
99.53 ± 0.00
99.94 ± 0.02
(PUD) OA
99.84 ± 0.05
59.18 99.84 ± 0.03
86.62 99.99 ± 0.01 206.87
AA
99.70 ± 0.20
06.57 99.65 ± 0.12
05.85 99.99 ± 0.02 01.37
K
99.79 ± 0.10
99.80 ± 0.03
(IPD)
(SD)
0.718
01.97
99.99 ± 0.02
2 and dropout ratio of 0.5. The 2D-convolution layer in the model has 256 filters (3 × 3 dimension), dropout ratio of 0.25. In all SFCNs (1–5), the layer width is set to 256 and half width is set to the round of integer value to half of the layer width, which play a significant role. We use Adam optimizer, having categorical crossentropy loss function (Fig. 9). The learn-rate and decays were assigned as 0.001 and 1e−06 respectively. The model is trained over 20 epochs with a batch size of 256. The model is compared with four published methods (Fig. 10), EMP-SVM [18], MCNN-CP [78], 2D-CNN [50] and hybrid-SN [59]. The performance of the model is also investigated (Table 4) by repeating the experiments with data that contains noise, with and without weak class oversampling and with different spatial sizes and train-test ratios (Fig. 11).
Fig. 9 Epochs and training/validation loss a–c 10% Training IPD, SD, PUD d–f30% Training IPD, SD, PUD
Hyperspectral Images: A Succinct Analytical Deep Learning Study
165
Fig. 10 Class-wise classification accuracy: Training sample (T.S.) with oversample (O.S.) a & b IPD c & d SD e & f PUD
Table 4 Accuracy of datasets with noise Data sets
(IPD)
(SD)
(PUD)
Accuracy (%)
With noise Speckle noise
Gaussian noise
Salt & Pepper
v= 10
v= 30
v= 50
v= 10
v= 30
v= 50
a=0
a= 0.5
a=1
OA
98.68
99.60
99.75
98.87
96.54
97.59
99.95
99.60
95.56
AA
99.09
98.50
98.62
99.26
96.54
97.59
99.91
99.50
97.12
K
98.49
99.55
99.72
98.71
96.55
98.44
99.94
99.55
94.95
OA
99.94
96.91
99.38
99.88
99.94
99.85
99.83
99.52
97.16
AA
99.89
96.17
99.33
99.70
99.88
99.80
99.79
99.16
93.85
K
99.93
96.56
99.31
99.87
99.93
99.83
99.81
99.47
96.83
OA
99.97
99.92
99.76
99.27
99.10
95.89
100.00
99.73
99.53
AA
99.94
99.75
99.70
99.56
98.31
93.77
100.00
99.38
99.49
K
99.96
99.90
99.69
99.88
98.80
94.58
100.00
99.64
99.38
7 Conclusions and Future Scope This literature addresses basic issues related to satellite imaging techniques and hyper spectral based classification techniques. In the first part of experiment analysis, we used sentinel-2 based satellite image of Sundarban Mangrove and classified the land coverage with respect to six ground truth labels with comparative better accuracy. Further with identified issues like training size limitation, better computational time and better classification performance under noise, we adapted a combined 3D-2D DL approach for the generation of hierarchical discriminative deep spectral-spatial features and HI classification. A multi-scale feature learning technique is employed in the framework, which increases the ability of the model to classify the objects of diverse shapes even after the information loss by the convolution mechanism. The use of SpinalNet model enhances the accuracy and controls the error. Experimental results demonstrate that our model is capable enough to classify with a limited number of training samples and thus avoid the need for oversampling and performs well even
166
L. S. Kumar et al.
Fig. 11 Classification maps: a–c Ground truth of IPD, SD and PUD d–f Our model (with 30% training) of IPD, SD and PUD g–i Our model (with 10% training and oversampling) of IPD, SD and PUD
in the presence of Gaussian and Poisson noise. The model demonstrates with three benchmark datasets by giving consistent and competitive values for Overall Accuracy (OA), Aver- age Accuracy (AA), and Kappa Accuracy (KA) compared to the other four state-of-the-art models. Being a supervised classification based model, it offers with best usage on labeled Hyperspectral datasets and most suitable for applications based on land cover mapping, agriculture and global climate.
Hyperspectral Images: A Succinct Analytical Deep Learning Study
167
References 1. Adate, A., Arya, D., Shaha, A., & Tripathy, B. K. (2020). Impact of deep neural learning on artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Ed.), Deep Learning Research and Applications (pp.69–84). De Gruyter Publications. https:/ /doi.org/10.1515/9783110670905-004 2. Adate, A., & Tripathy, B. K. (2018). Deep learning techniques for image processing. In S. Bhattacharyya, H. Bhaumik, A. Mukherjee & S. De (Eds.), Machine Learning for Big Data Analysis (pp. 69–90). De Gruyter. https://doi.org/10.1515/9783110551433-00357 3. Bajorski, P. (2010). Investigation of virtual dimensionality and broken stick rule for hyperspectral images. In 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (pp. 1–4). 4. Benediktsson, J. A., Palmason, J. A., & Sveinsson, J. R. (2005). Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 480–491. 5. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 153. 6. Bhattacharyya, S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909, 9783110670905. https://doi.org/10.1515/9783110670905 7. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN, Studies in Big Data. In S.S. Roy, & Y.-H. Taguchi (Eds.), Handbook of Machine Learning Applications for Genomics, (Chapter 5) (vol. 103). ISBN: 978–981–16–9157–7 496166_1_En 8. Binol, H. (2018). Ensemble learning based multiple kernel principal component analysis for dimensionality reduction and classification of hyperspectral imagery. Mathematical Problems in Engineering, 2018, 14. Article ID 9632569. 9. Bose, A., & Tripathy, B. K. (2020). Deep learning for audio signal classification. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Ed.), Deep Learning Research and Applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/978311067090500660 10. Bruce, L. M., Li, J., & Huang, Y. (2022). Automated detection of subpixel hyperspectral targets with adaptive multichannel discrete wavelet trans-form. IEEE Transactions on Geoscience and Remote Sensing, 40(4), 977−980 11. Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based classi-fication of hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote sensing, 7(6), 2094–2107. 12. COAH: Copernicus Open Access Hub. https://scihub.copernicus.eu 13. Grupo de Inteligencia Computacional. (2014). Hyperspectral remote sensing scenes. http:// www.ehu.eus/ccwintco/index.php 14. Debgupta, R., Chaudhuri, B. B., Tripathy, B. K. (2020). A eide resNet-based approach for age and gender estimation in face images. In A. Khanna, D. Gupta, S. Bhattacharyya, V. Snasel, J. Platos, A. Hassanien (Eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing (vol. 1087, pp. 517–530). Springer. https://doi.org/10.1007/978-981-15-1286-5_44 15. Deepa, P., & Thilagavathi, K. (2015). Feature extraction of hyperspectral image using principal component analysis and folded-principal component analysis. In 2015 2nd International Conference on Electronics and Communication Systems (ICECS) (pp. 656–660). 16. Dharmasastha, K. N. S., Banu, K. S., Kalaichevlan, G., Lincy, B., & Tripathy, B.K. (2022). Classification of pest in tomato plants using CNN. In M. N. Mohanty, S. Das, M. Ray, B. Patra (Eds.), Meta Heuristic Techniques in Software Engineering and Its Applications. METASOFT 2022. Artificial Intelligence-Enhanced Software and Systems Engineering (vol. 1). Springer. https://doi.org/10.1007/978-3-031-11713-8_6 17. Du, Q. (2007). Modified fisher’s linear discriminant analysis for hyperspectral imagery. IEEE Geoscience and Remote Sensing Letters, 4(4), 503–507.
168
L. S. Kumar et al.
18. Fauvel, M., Benediktsson, J. A., Chanussot, J., & Sveinsson, J. R. (2008). Spectral and spatial classification of hyperspectral data using svms and morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 46(11), 3804–3814. 19. Fauvel, M., Tarabalka, Y., Benediktsson, J. A., Chanussot, J., & Tilton, J. C. (2012). Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3), 652–675. 20. Fu, A., Ma, X., & Wang, H. (2018). Classification of hyperspectral image based on hybrid neural networks. In: IGARSS 2018 2018 IEEE International Geoscience and Remote Sensing Symposium (pp. 2643–2646). 21. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural net-work model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets (pp. 267–285). Springer. 22. Ghasemzadeh, A., & Demirel, H. (2016) Hyperspectral face recognition using 3d discrete wavelet transform. In 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA) (pp. 1–4). 23. Ghiya, A.S., Vijay, V., Ranganath, A., Chaturvedi, P., Tripathy, B.K. & Banu, K. S. (2021). Weather classification: Image embedding using xonvolutional autoencoder and predictive analysis using stacked generalization. In ANTIC conference. BHU. 24. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27–48. 25. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics and their applications using deep learning, (Chapter 4), Studies in Big Data. In S. S. Roy, & Y.-H. Taguchi (Eds.), Handbook of Machine Learning Applications for Genomics (vol. 103). ISBN: 978–981–16–9157–7, 496166_1_En 26. Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-d deep learning approach for remote sensing image classification. IEEE Transactions on geoscience and remote sensing, 56(8), 4420–4434. 27. Harikiran, J., Ladi, S. K., Panda, G. K., Dash, R., Ladi, P. K. (2020). Hyperspectral image classification bi-dimensional empirical mode decomposition and deep residual networks. In 2020 International Conference on Artificial Intelligence and Signal Processing (AISP) (pp.1– 6). 28. Harsanyi, J. C., & Chang, C.-I. (1994). Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection approach. IEEE Transactions on geoscience and remote sensing, 32(4), 779–785. 29. Haut, J. M., Paoletti, M. E., Plaza, J., Plaza, A., & Li, J. (2019). Hyperspectral image classification using random occlusion data augmentation. IEEE Geoscience and Remote Sensing Letters, 16(11), 1751–1755. 30. Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527–1554. 31. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory, 14(1), 55–63. 32. Imani, M., & Ghassemian, H. (2014). Principal component discriminant analysis for feature extraction and classification of hyperspectral images. In 2014 Iranian Conference on Intelligent Systems (ICIS) (pp. 1–5). 33. Jayaprakash, C., Damodaran, B. B., Sowmya, V., & Soman, K. P. (2018). Dimensionality reduction of hyperspectral images for classification using randomized independent component analysis. In 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 492–496) 34. Kabir, H. M. D., Abdar, M., Jalali, S. M. J., Khosravi, A., Atiya, A.F., Nahavandi, S., & Srinivasan, D. (2020). SpinalNet: Deep neural network with gradual input 35. Kathuria, A. (2018) Intro to optimization in deep learning: Momentum, Rmsprop and Adam. https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/
Hyperspectral Images: A Succinct Analytical Deep Learning Study
169
36. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare, in: Deep Learning in Data Analytics. In: D.P. Acharjya, A. Mitra, N. Zaman (Eds,), Deep Learning in Data AnalyticsRecent Techniques, Practices and Applications, Studies in Big Data (vol. 91, pp. 97–115). Springer. https://doi.org/10.1007/978-3-030-75855-4_6 37. Ke, C. (2017). Military object detection using multiple information extracted from hyperspectral imagery. In 2017 International Conference on Progress in Informatics and Computing (PIC) (pp. 124–128). 38. Khan, M.J., Khan, H.S., Yousaf, A., Khurshid, K., & Abbas, A. (2018). Modern trends in hyperspectral image analysis: A review. IEEE Access. 6, 14118−14129 39. Kumar, V., & Tripathy, B. K. (2020). Detecting toxicity with bidirectional gated recurrent unit networks. In V. Bhateja, S. Satapathy, Y.D. Zhang, V. Aradhya (Eds.), Intelligent Computing and Communication. ICICC 2019. Advances in Intelligent Systems and Computing (vol. 1034). Springer. https://doi.org/10.1007/978-981-15-1084-7_57 40. Kwon, H., Hu, X., Theiler, J., Zare, A, & Gurram, P. (2013). Algorithms for multispectral and hyperspectral image analysis. Journal of Electrical and Computer Engineering, 2013, 2. Article ID 908906 41. Ladi, S. K., Panda, G. K., Dash, R., et al. (2022). A novel grey wolf optimisation based CNN classifier for hyperspectral image classification. Multimed Tools Appl, 81, 28207–28230. 42. Ladi, S. K., Panda, G. K., Dash, R. et al. (2022). A novel strategy for classifying spectral-spatial shallow and deep hyperspectral image features using 1D-EWT and 3D-CNN. Earth science informatics 43. Ladi, S. K., Dash, R., Panda, G. K., Ladi, P. K., & Dhupar, R. (2019). Hyperspectral image classification using swt and cnn. In 2019 International Conference on Information Technology (ICIT) (pp. 172–177). 44. Li, C., Zuo, H., Fan, T. (2017). Hyperspectral image classification based on gray level cooccurrence matrix and local mean decomposition. In 2017 4th International Conference on Systems and Informatics (ICSAI) (pp. 1219–1223). 45. Li, J., Bioucas-Dias, J. M., & Plaza, A. (2010). Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning. IEEE Transactions on Geoscience and Remote Sensing, 48(11), 4085–4098. 46. Li, Y., Zhang, H., & Shen, Q. (2017). Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing, 9(1), 67. 47. Li, W., Wu, G., Zhang, F., & Du, Q. (2017). Hyperspectral image classification using deep pixel-pair features. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 844–853. 48. Ma, Y., Li, R., Yang, G., Sun, L., & Wang, J. (2018). A research on the combination strategies of multiple features for hyperspectral remote sensing image classification. Journal of Sensors, 2018, 14. Article ID 7341973. 49. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional neural networks: A bottom-ip approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B.K. Tripathy (Ed.), Deep Learning Research with Engineering Applications (pp.21–50). De Gruyter Publications. https://doi.org/10.1515/9783110670905-002 50. Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015). Deep super-vised learning for hyperspectral data classification through convolutional neural networks. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (pp. 4959–4962). 51. Makantasis, K., Doulamis, A. D., Doulamis, N. D., & Nikitakis, A. (2018). Tensor-based classification models for hyperspectral data analysis. IEEE Transactions on Geoscience and Remote Sensing, 56(12), 6884–6898. 52. Makantasis, K., Doulamis, A., Doulamis, N., Nikitakis, A., & Voulodimos, A. (2018). Tensorbased nonlinear classifier for highorder data analysis. In 2018 IEEE International Conference 53. Notesco, G., Dor, E. B., & Brook, A. (2014). Mineral mapping of makhtesh ramon in israel using hyperspectral remote sensing day and night LWIR images. In 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1– 4).
170
L. S. Kumar et al.
54. Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area presence index by anisotropic rotation-invariant textural measure. IEEE Journal of selected topics in applied earth observations and remote sensing, 1(3), 180–192. 55. Pesaresi, M., & Benediktsson, J. A. (2001). A new approach for the morphological segmentation of high-resolution satellite imagery. IEEE transactions on Geoscience and Remote Sensing, 39(2), 309–320. 56. Pike, R., Lu, G., Wang, D., Chen, Z. G., & Fei, B. (2016). A minimum spanning forest-based method for noninvasive cancer detection with hyperspectral imaging. IEEE Transactions on Biomedical Engineering, 63(3), 653–663. 57. Plaza, A., Mart´ınez, P., Plaza, J., P´erez, R. (2005). Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations. IEEE Transactions on Geoscience and remote sensing, 43(3), 466–479. 58. Prabhavathy, P., Tripathy, B.K., Venkatesan, M. (2022). Analysis of diabetic retinopathy detection techniques using CNN Models. In: S. Mishra, H. K. Tripathy, P. Mallick, K. Shaalan (Eds.), Augmented Intelligence in Healthcare: A Pragmatic and Integrated Analysis. Studies in Computational Intelligence (vol. 1024). Springer, https://doi.org/10.1007/978-981-19-107 6-0_6 59. Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2020). Hybridsn: Exploring 3-d-2-d cnn feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 17(2), 277–281. 60. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In Encyclopedia of Information Science and Technology (5th ed., p. 11). https://doi.org/10.4018/ 978-1-7998-3479-3.ch007 61. Rungta, R. K., Jaiswal, P., Tripathy, B. K. (2022). A deep learning based approach to measure confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on Computational Intelligence in Pattern Recognition (CIPR) (pp. 278–291). CIPR 2022, LNNS 480. 62. Sihare, P., Khan, A. U., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on Computational Intelligence in Pattern Recognition (CIPR) (pp. 1–10). CIPR 2022, LNNS 480. 63. Jain, S., Singhania, U., Tripathy, B.K., Abouel, E. N., Aboudaif, M. K., & Ali, K. K. (2021). Deep learning based transfer learning for classification of skin cancer. Sensors (Basel), 21(23), 8142 https://doi.org/10.3390/s21238142. (IF:4.35) 64. Surya, Y. S., Geetha Rani, K. T., & Tripathy, B. K. (2022). Social distance monitoring and face mask detection using deep learning. In: J. Nayak, H. Behera, B. Naik, S. Vimal, D. Pelusi (Eds.), Computational Intelligence in Data Mining. Smart Innovation, Systems and Technologies (vol. 281). Springer. https://doi.org/10.1007/978-981-16-9447-9_36 65. Sun, T., Jiao, L., Feng, J., Liu, F., & Zhang, X. (2015). Imbalanced hyperspectral image classification based on maximum margin. IEEE Geoscience and Remote Sensing Letters, 12(3), 522–526. 66. Teng, M. Y., Mehrubeoglu, R., King, S. A., Cammarata, K., & Simons, J. (2013). Investig tion of epifauna coverage on seagrass blades using spatial and spectral analysis of hyperspectral images. In 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1–4). 67. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage Learning publishers. ASIN: 8131526194, ISBN-109788131526194 68. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C. (2022). Brain MRI segmentation techniques based on CNN and its variants, (Chapter-10). In J. Chaki (Ed.), Brain Tumor MRI Image Segmentation Using Deep Learning Techniques (pp. 161−182). Elsevier publications. https:// doi.org/10.1016/B978-0-323-91171-9.00001-6 69. Tripathy, B. K., & Adate, A. (2021). Impact of deep neural learning on artificial intelligence research, Chapter-8. In D. P. Acharjya et al (Ed.), Springer publications.
Hyperspectral Images: A Succinct Analytical Deep Learning Study
171
70. Voulodimos, A. (2018). Deep learning for computer vision: a brief review. Computational Intelligence and Neuroscience, 2018, 13. Article ID 7068349. 71. Wang, & Chang, C. I. (2006). Independent component analysis based dimensionality reduction with applications in hyperspectral image analysis. In IEEE Transactions on Geoscience and Remote Sensing (vol. 44, no. 6, pp. 1586–1600). 72. Wang, X., & Feng, Y. (2008). New method based on support vector machine in classification for hyperspectral data. In 2008 International Symposium on Computational Intelligence and Design (pp. 76–80) 73. Wang, Y., & Cui, S. (2014). Hyperspectral image feature classification using stationary wavelet transform. In 2014 International Conference on Wavelet Analysis and Pattern Recognition (pp. 104–108) 74. Wu, Y., Mu, G., Qin, C., Miao, Q., Ma, W., & Zhang, X. (2020). Semi-supervised hyperspectral image classification via spatial-regulated self-training. Remote Sensing, 12(1) 75. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., & Woo, W.C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems (Vol. 1, pp. 802–810). 76. Xu, Y., Zhang, L., Du, B., & Zhang, F. (2018). Spectral–spatial unified networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(10), 5893–5909. 77. Zhang, X., Zhang, A., & Meng, X. (2015). Automatic fusion of hyperspectral images and laser scans using feature points. Journal of Sensors, 2015, 9. Article ID 415361 78. Zheng, J., Feng, Y., Bai, C., & Zhang, J. (2021). Hyperspectral image classification using mixed convolutions and covariance pooling. IEEE Transactions on Geoscience and Remote Sensing, 59(1), 522–534. 79. Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2018). Spectral–spatial residual network for hyperspectral image classification: A 3-d deep learning framework. IEEE Transactions on Geoscience and Remote Sensing, 56(2), 847–858 80. Zhou, F., Hang, R., Liu, Q., & Yuan, X. (2019). Hyperspectral image classification using spectral-spatial lstms. Neurocomputing, 328, 39–47.
Chest X-Ray Image Classification of Pneumonia Disease Using EfficientNet and InceptionV3 Neel Ghoshal, Mohd Anas, and Sanjiban Sekhar Roy
1 Introduction Pneumonia is a type of respiratory infection that affects the lungs. It leads to inflammation in the lungs and fluid buildup in the air sacs within, causing difficulties in breathing and simultaneous cardiovascular health effects. Pneumonia is considered to be the single largest cause of death in children worldwide, leading to an estimated count of 5.9 million deaths for children under 5 years old annually [1]. Chest X-Rays and Radiography methods have been prevalent in the medical industry for quite some time and the use of such methods and tools have been administered in diagnosing and curing issues and illnesses such as cancer, infections, emphysema and pneumonia. The specialized analysis and diagnosis of an illness through the use of X-Ray outputs are generally conducted by expert radiologists in person. In recent times, the number of cases requiring chest X-Rays have substantially increased [2], hence simultaneously, radiologists working on these outputs now have to devote higher levels of time for this task. The requirement of expertise for this task comes from the extremely detailed and niched characteristics of the components present in the lung which has to be analyzed and deduced via intricate characterizations and traits which coherently point towards a general illness category. Due to the aforementioned cause of increased frequency of Chest X-Ray instances, it is a possibility that due to this vast volume of data to be manually processed, can be a reason which simultaneously leads to time delays, cost problems, and/or errors which may occur, which in the end is something that needs to be avoided via any medical institution. Through the work described in this chapter, we propose an automated medical image diagnosis system which essentially will allow the radiologists and staff alike to gain an alternate and handy method to efficiently process and analyze data without much hassle or manual
N. Ghoshal · M. Anas · S. S. Roy (B) School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_9
173
174
N. Ghoshal et al.
work. For our problem statement, we have used two Convolutional Neural Network (CNN) based algorithms to classify Chest X-Ray scans for the illness of pneumonia. These CNN based algorithms have worked well with this specific image classification problem due to it’s inherent trait of reducing dimensionality of data and efficient processing for accurate results [3]. The aforementioned advantages are due to the neural network subdivisions and their tasks, namely Convolution Layer which breaks down the entire image into smaller sub-parts of it for and efficient and less-dimensional input layer, Pooling Layer assumes the convolution layer as input and reduces the dimensionality further, and the Fully Connected Layer which can be considered as the final layer upon which the network finally learns which subdivisions/parts are necessary for the classification problem at hand.
2 Literature Survey Till date, there have been few proposals and advances towards similar and specific medical diagnosis problems. CNNs and Deep Neural Networks have allowed researchers to build sophisticated models towards medical issues including pneumonia, tuberculosis, Covid-19, lung cancer and many more [4]. There are many different techniques and methodologies used to progress the specific tasks of medical diagnosis employed by various researchers in their respective fields, some of them include, Convolutional Neural Networks, Transfer Learning, Image-Level Prediction, Segmentation networks, Localization Networks, Image Generation Networks, Domain Adaption Networks, and likewise [4]. For example, Crosby et al. employed the use of deep CNNs for distinguishing between binary labelled chest radiograph data [5]. Deep Learning has also been employed in detection of foreign objects in chest radiographs using similar data [6]. The use of General Adversarial Networks can also be seen in deployment of technology for organ segmentation and bone suppression tasks in Chest X-Rays [7]. Transfer learning based image classifier models have been researched by Showkat et al. in detection of Covid-19 pneumonia [8]. Deep Learning techniques are used by Hirata et al. in the pursuit of detecting pulmonary artery wedge pressure metrics using Standard Chest X-Ray data. The research community pertaining to these specific tasks have produced a foothold in the use of CNNs in computer vision problems like these and in 2015 and 2016 more than 300 papers were published on applications of deep learning in workshops, conferences, journals, and special issues in this domain[9, 10].
Chest X-Ray Image Classification of Pneumonia Disease Using …
175
Fig.1 Three samples from normal and pneumonia classes
3 Dataset The dataset used to train our proposed models was obtained from the internet website named Kaggle, and is named “Chest X-Ray Images (Pneumonia)”. It consists of 5863 images as training samples each of which has a binary feature associated with it depicting the individual datapoints as either ‘normal’ or ‘pneumonia’. A point to note here is that, the feature category for this specific dataset is binary in nature, hence the proposed models will be tasked with the duty of analysing the image for the presence of the disease of pneumonia in contrast to the task of finding specific types of pneumonia ranging from bacterial to viral. The images present in the dataset are formatted X-Ray images of the lungs (Fig. 1). The dataset consists of 27% images of normal lung x-rays and the remaining pertaining to those corresponding to pneumonia (Fig. 2).
4 Data Pre-processing For the task of data pre-processing, all individual images are converted into grayscale and gaussian blur is applied to them. The conversion of images into grayscale helps in fine tuning the dataset for the specific image classification task by converting the pixels present in the images into values depicting the information of the intensity of
176
N. Ghoshal et al.
Fig. 2 Image category distribution in dataset
light. Gaussian blur, in essence, is applied to reduce the noise and redundant data present in the information pixels. The concept of Gaussian blur works on it’s characteristic to smoothen the edges and boundaries of objects resulting in enhancement of object data and smoothening of transitions between boundaries. Image Erosion is also applied to the categorical data, wherein, the erosion function used to process the data, reduces or removes pixels on object boundaries, the frequency of pixels affected depends on the specific inherent characteristics of the image. The Canny Edge Detection algorithm developed by OpenCV is also used, which reduces noise, finds the intensity gradient of the image and supresses unwanted pixels (Figs. 3, 4 and 5).
Fig. 3 Grayscale conversion and gaussian blurring
Chest X-Ray Image Classification of Pneumonia Disease Using …
Fig. 4 Image erosion
Fig. 5 Canny edge detection
177
178
N. Ghoshal et al.
5 Proposed Model We have used 2 distinct models for this classification problem, the EfficientNet model and the Inception model. Both of these models are based on CNNs (Convolutional Neural Networks).
5.1 EfficientNet EfficientNet is an architecture framework based on the methodology of model scaling in Convolutional Neural Networks. This architecture uniformly scales all dimensions of depth/width/resolution using a compound coefficient. The distinguishing factor for this specific architecture is that it doesn’t use arbitrary scaling for these factors, it uses a fixed set of scaling coefficients for uniformly scaling the network width, depth and resolution. Using this technique, the creators have surpassed the accuracy of almost all high performing convolutional network models, while simultaneously achieving better efficiency. For model scaling, the following methodologies of (a) Baseline model, (b) Width Scaling, (c) Depth Scaling and (d) Resolution Scaling are followed, whereas in the EfficientNet model, a methodology known as compound scaling is used which inculcates all the previously techniques into one hybrid and dynamic structure (Figs. 6 and 7). For obtaining the compound scaling factor, it was observed that the network depth should be increased for higher resolution images which helps capture high pixel features in bigger images and correspondingly that network width should be increased when the resolution is lower, due to the need of capturing the fine grain patterns present in the images [11]. The compound scaling method employed by the EfficientNet model using a coefficient ϕ to uniformly scale the width, depth and resolution for the neural network. The equations for the same are: Depth : d = aϕ Width : w = bϕ Resolution : r = c s.t. a ∗ bϕ ∗ cϕ 2 a >= 1, b >= 1, c >= 1 where a, b, c are constants that are determined by a small grid search. Henceforth, ϕ, is a user specified coefficient that controls how many more resources are available for model scaling, while a,b,c specify how to assign the extra resources to network width, depth and resolution respectively [11].
Chest X-Ray Image Classification of Pneumonia Disease Using … Fig. 6 Baseline network with connecting layers
Fig. 7 Compound scaled network with connecting layers
179
180
N. Ghoshal et al.
The EfficientNet Architecture is the baseline network for implementing a framework employing the above criterion and characteristics.
5.2 InceptionV3 InceptionV3 is an image recognition model which has demonstrably achieved stateof-the-art accuracy levels for image associated tasks. It uses and build upon it’s base architectures of the InceptionV1 model, which inherently consisted of multiple filters of parallel layers instead of the classical deep layers of a typical CNN model. Each subpart of a basic Inception model is made of 4 parallel layers, which are: 1*1, 3*3, 5*5 convolutions and a 3*3 max pooling layer. The InceptionV3 implementable model consists of building blocks, including (a) convolutions, (b) average pooling, (c) max pooling, (d) concatenations, (e) dropouts and (f)Softmax (Fig. 8). The model builds upon the base work of the InceptionV1 model, it enables factorization of data into smaller convolutions, i.e. reducing high dimensional data into smaller fragments for effective processing, the model also uses spatial factorization into asymmetric convolutions, which entails subdividing the previously occurred convolutions into factors of the form n*1, which allows for higher efficiency in processing and outcome [12]. The model takes into effect the use of auxiliary classifiers which in essence, acts as a regularizer here, also parallel stride blocks are created to allow for an efficient grid size reduction algorithm in order to avoid a representational bottleneck.
Fig. 8 Input layer and output layer dimensions for InceptionV3 model
Chest X-Ray Image Classification of Pneumonia Disease Using …
181
6 Experimental Outcome and Analysis 6.1 InceptionV3 Figure 9 shows the accuracy graphs and validation of accuracy graphs for the InceptionV3 model, the training of the model has occurred for a duration of 15 epochs. The peak accuracy achieved by the model is high value of 92.93%, it portrays a gradual and simultaneous increase and decrease in the graph metric values, occurring due to the fine tuning of model prediction confidence values, until finally arriving at it’s peak accuracy point and decreasing therein. The validation accuracy curve can be seen performing a similar curvature until dropping to an extremely low value but stabilizing itself while moving forward which depicts the overall accuracy value fluctuation metrics to the change of model parameters. The loss value function, as shown graphically in Fig. 10, for the model can be seen taking a huge initial decline and reaching it’s required lowest value moving forward in a stable and coherent manner. The validation loss curve doesn’t take a steep dive but goes through a sudden high peak value in between it’s complete graph path, after which it stabilizes and reaches it’s boundary values, which are close to the loss value curve boundary values. These results hence depict the benchmark being set in pneumonia diagnosis using CNN based algorithms. This outcome, when compared with other models for similar
Fig. 9 Accuracy curve of inception model
182
N. Ghoshal et al.
Fig. 10 Loss value curve of inception model
tasks perform demonstrably better in the outcomes and at the same time is more efficient due to the inbuilt performance metrics present in the baseline Inception models, as depicted in Sect. 5.2.
6.2 EfficientNet Figure 11 shows the accuracy graphs and validation of accuracy graphs for the EfficientNet model, the training of the model has occurred for a duration of 10 epochs. The peak accuracy achieved by the model is high value of 95.39%, it displays the accuracy of the model steeply increasing after the first epoch and gradually and stably achieving it’s peak value after the last epoch. The validation accuracy curve can be seen performing a similar curvature until dropping to an extremely low value and steeply increasing after the subsequent epoch but again dropping extremely low after two more epochs. The loss value function for the model, as depicted in Fig. 12, can be seen taking an initial decline and reaching it’s required lowest value while performing simultaneous but negligible ups and downs throughout the curvature. The validation loss curve can be observed performing a steep initial decline similar to the loss value function. It achieves it’s peak boundary value in the following steps therein, but it then suddenly
Chest X-Ray Image Classification of Pneumonia Disease Using …
183
Fig. 11 Accuracy curve of EfficentNet model
increases to an enormous amount and also decreases in the following epoch only to increase substantially again after 2 more epochs. These results also simultaneously set the benchmark being set in pneumonia diagnosis using CNN based algorithms. This outcome, when compared with other models for similar tasks perform demonstrably better in the outcomes and at the same time is more efficient and customizable due to the inbuilt model metrics present in the baseline Inception models, as depicted in Sect. 5.1.
7 Discussion One of the necessities and dire requirements of radiologists, clinicians and staff alike working towards the problem of detecting and curing pneumonia and related conditions is the metric factors of time, frequency and volume of data to be processed, and expertise requirements. The presence of already existing classifiers for other medical diagnosis and related works, including breast cancer detection [13], and also the recent use of CNNs being used in Brain Tumour Classification [14]. Almost all of these can be solved to a significant extent via the use of machine learning and neural network based models to ease this task. But simultaneously, it must be noted that the final diagnosis and inferences received from it should be done ultimately
184
N. Ghoshal et al.
Fig. 12 Loss value curve of EfficientNet model
by a trained professional, these classification models, for now, are present only to aid the clinicians and trained experts in streamlining their tasks. Some limitations a model like this would pertain along with itself is the explanation of achieved metrics and reasons embedded therein, and inability to characterize a few key metrics which demonstrate a substrata of the general illness being caused and which could necessitates simultaneous alternate remedies extending to a cohesion of multiple disorders either causing or caused from the pneumonia disease. The accuracies achieved in this chapter, can be improved further by incorporating a larger dataset, or developing further specific and custom models based exclusively on X-Ray diagnostics. Another method which can be availed to achieve improvement is to incorporate medical histories of the patient in a significant shape or form to be included in as a feature variable in the dataset. Furthermore, data augmentation techniques can be identified and incorporated in future models for achieving higher output metrics [15–30].
Chest X-Ray Image Classification of Pneumonia Disease Using …
185
8 Conclusion In this chapter, we have discussed the outcomes and experimental usage and usecases of the EfficientNet and InceptionV3 models for the medical diagnosis of pneumonia via Chest X-Rays. We have achieved high performance results of 95.39% and 92.93% which is achieved at a significantly low computational cost. Thereby, using the discussed frameworks can highly beneficial in the medical diagnosis of the disease and come in handy to the professional medical practitioners and radiologists working with the related problem statement. Further refinement of approaches and methodologies will definitely provide a highly positive impact towards this cause and pave the way for further improvements therein.
References 1. Yadav, K. K., & Awasthi, S. (2016). The current status of community-acquired pneumonia management and prevention in children under 5 years of age in India: A review. Therapeutic Advances in Infectious Disease, 3(3–4), 83–97. 2. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021). Deep learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125. 3. Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., & Chen, M. (2014). Medical image classification with convolutional neural network. In 13th International Conference on Control Automation Robotics & Vision (ICARCV), Singapore, pp. 844–848. https://doi.org/10.1109/ ICARCV.2014.7064414 4. Çallı, E., Sogancioglu, E., van Ginneken, B., van Leeuwen, K. G., & Murphy, K. (2021). Deep learning for chest X-ray analysis: A survey. Medical Image Analysis, 72, 102125. ISSN 1361-8415 https://doi.org/10.1016/j.media.2021.102125 5. https://www.spiedigitallibrary.org/journals/journal-of-medical-imaging/volume-7/issue-1/ 016501/Deep-convolutional-neural-networks-in-the-classification-of-dual-energy/https://doi. org/10.1117/1.JMI.7.1.016501.short?SSO=1 6. Deshpande, H., Harder, T., Saalbach, A., Sawarkar, A., Buelow, T. (2020). Detection of foreign objects in chest radiographs using deep learning. In IEEE 17th International Symposium on Biomedical Imaging Workshops (ISBI Workshops). Iowa City, IA, USA, pp. 1–4. https://doi. org/10.1109/ISBIWorkshops50223.2020.9153350 7. Eslami, M., Tabarestani, S., Albarqouni, S., Adeli, E., Navab, N., & Adjouadi, M. (2020). Image-to-images translation for multi-task organ segmentation and bone suppression in chest X-ray radiography. IEEE Transactions on Medical Imaging, 39(7), 2553–2565. https://doi.org/ 10.1109/TMI.2020.2974159 8. Showkat, S., & Qureshi, S. (2022). Efficacy of transfer learning-based resnet models in chest x-ray image classification for detecting COVID-19 pneumonia. Chemometrics and Intelligent Laboratory Systems, 224, 104534. 9. Hirata, Y., Kusunose, K., Tsuji, T., Fujimori, K., Kotoku, J. I., & Sata, M. (2021). Deep learning for detection of elevated pulmonary artery wedge pressure using standard chest x-ray. Canadian Journal of Cardiology, 37(8), 1198–1206. 10. Greenspan, H., Summers, R. M., & van Ginneken, B. (2016). Deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5), 1153–1159. 11. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR.
186
N. Ghoshal et al.
12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826). 13. Mittal, D., Gaurav, D., & Sekhar Roy, S. (2015). An effective hybridized classifier for breast cancer diagnosis. In 2015 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Busan, Korea (South), pp. 1026–1031. https://doi.org/10.1109/AIM.2015.722 2674 14. Roy, S. S., Rodrigues, N., & Taguchi, Y. (2020). Incremental dilations using CNN for brain tumor classification. Applied Sciences 10(14):4915. https://doi.org/10.3390/app10144915 15. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6, 60. 16. Roy, S. S., Hsu, C., Samaran, A., Goyal, R., Pande, A., et al. (2023). Vessels segmentation in angiograms using convolutional neural network: A deep learning based approach. CMESComputer Modeling in Engineering & Sciences, 136(1), 241–255. 17. Turki, T., & Roy, S. S. (2022). Novel hate speech detection using word cloud visualization and ensemble learning coupled with count vectorizer. Applied Sciences, 12(13), 6611. 18. Roy, S. S., Goti, V., Sood, A., Roy, H., Gavrila, T., Floroian, D., Mohammadi-Ivatloo, B., et al. (2014). L2 regularized deep convolutional neural networks for fire detection. Journal of Intelligent & Fuzzy Systems, 1–12. 19. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems, 1–7. 20. Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines and deep neural network. 21. Bose, A., Hsu, C. H., Roy, S. S., Lee, K. C., Mohammadi-Ivatloo, B., & Abimannan, S. (2021). Forecasting stock price by hybrid model of cascading multivariate adaptive regression splines and deep neural network. Computers and Electrical Engineering, 95, 107405. 22. Roy, S. S., & Taguchi, Y. H. (2021). Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Scientific Reports, 11(1), 1–18. 23. Roy, S. S., & Samui, P. (2021). Predicting longitudinal dispersion coefficient in natural streams using minimax probability machine regression and multivariate adaptive regression spline. International Journal of Advanced Intelligence Paradigms, 19(2), 119–127. 24. Marques, G., Agarwal, D., & de la Torre, I. (2020). Automated medical diagnosis of COVID-19 through EfficientNet convolutional neural network. Applied Soft Computing, 96, 106691. 25. Biswas, R., Vasan, A., & Roy, S. S. (2020). Dilated deep neural network for segmentation of retinal blood vessels in fundus images. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, 44(1), 505–518. 26. Roy, S. S., Samui, P., Nagtode, I., Jain, H., Shivaramakrishnan, V., & Mohammadi-Ivatloo, B. (2020). Forecasting heating and cooling loads of buildings: A comparative performance analysis. Journal of Ambient Intelligence and Humanized Computing, 11(3), 1253–1264. 27. Roy, S. S., Chopra, R., Lee, K. C., Spampinato, C., & Mohammadi-Ivatlood, B. (2020). Random forest, gradient boosted machines and deep neural network for stock price forecasting: A comparative analysis on South Korean companies. International Journal of Ad Hoc and Ubiquitous Computing, 33(1), 62–71. 28. Roy, S. S., Mihalache, S. F., Pricop, E., & Rodrigues, N. (2022). Deep convolutional neural network for environmental sound classification via dilation. Journal of Intelligent & Fuzzy Systems, 1–7. 29. Chakraborty, C., Bhattacharya, M., Sharma, A. R., Roy, S. S., Islam, M. A., Chakraborty, S., Dhama, K., et al. (2022). Deep learning research should be encouraged for diagnosis and treatment of antibiotic resistance of microbial infections in treatment associated emergencies in hospitals. International Journal of Surgery (London, England), 105, 106857. 30. Lee, K. C., Roy, S. S., Samui, P., & Kumar, V. (Eds.). (2020). Data analytics in biomedical engineering and healthcare. Academic Press.
Detection of Cancer Using Deep Learning Techniques Apoorv Singh, Arjunaditya, and B. K. Tripathy
1 Introduction Cancer is a dreaded disease which is posing threat to the human society and according to the data provided by World Health Organisation, cancer accounted for 13% of all the fatalities in 2018 [1]. In the upcoming years it is predicted to be ranked among the most deadly diseases in the world. As projected, 12 million individuals are likely to be affected by cancer in 2030. The number of cancer cases would rise dramatically in the next few years. Experts, specialists, and medical professionals are developing new methods to combat cancer, but it is well recognized that this battle is quite challenging [2–4]. Evaluating the visuals related to medical data by technicians, supported by computers is referred to as interpretation. Diagnostic ultrasound images, on the contrary, demand a large volume of data to be addressed by the physician and require thorough analysis in a short amount of time. These imaging processes include highenergy electromagnetic radiation. Digital photographs are analyzed by computerassisted methods to detect the presence or absence of cancer in the early stages [5]. Analysis of medical images using computer tools supports medical professionals in interpretation of medical information inherent in the images. On the other hand, diagnosing ultrasound images using specific imaging processes such as high intensity electromagnetic radiation necessitates a significant quantity of data to be controlled from doctor’s end and involves thorough analysis in a short amount of time. Digital A. Singh School of Electronics Engineering, VIT, Vellore, TN 632014, India Arjunaditya School of Computer Science and Engineering, VIT, Vellore, TN 632014, India B. K. Tripathy (B) School of Information Technology and Engineering, VIT, Vellore, TN 632014, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. S. Roy et al. (eds.), Deep Learning Applications in Image Analysis, Studies in Big Data 129, https://doi.org/10.1007/978-981-99-3784-4_10
187
188
A. Singh et al.
photographs analyzed by computer-assisted methods are potentially used to detect presence or absence of the disease in the early stage. Therefore, early cancer detection is the top goal for securing lives. To find and diagnose cancer in its early stages, many visual examinations and manual methods are used. As human interference in analyzing medical images requires enough time and expertise in order to improve the efficiency of medical image interpretation, computerized systems for disease diagnosis have been proposed [5]. Developments in the areas of AI and Machine learning (ML) have been progressing fast during the recent years and their rise in the fields of computer vision, image processing, and computer-assisted diagnosis are eye catching [6]. Some of these applications use the traditional machine learning techniques like Support Vector Machines (SVM), decision trees, K- Nearest Neighbour (KNN) and back propagation [7]. Figure 1 illustrates the overall relationship among AI, ML and their components. An Artificial Neural Network (ANN) has an input layer, an output layer and a number of hidden layers of neurons according to the requirement of the applications The input layer accepts attributes in the form of input data and uses the associated weights in the connections to get the total input before applying the activation functions to get the outputs at the hidden layer nodes. This process is repeated layer of after layer till it reaches the output layer which generates the final outputs [8]. This increases in accuracy of prediction, aids clinicians in mapping subject’s treatments and eliminating emotional and physical challenges caused by sickness. An important aspect supporting clinical researchers is an increase in the number of diagnoses made utilizing latest cutting-edge AI technology. Computer engineers and health scientists can now successfully diagnose patients by using multi-factor analysis, classical logistic regression, and analysis assisted by AI. This is made possible by theoretical and technical advancements in computer programs and statistics. These estimations are much more accurate than the experimental estimates. Recently, researchers have started to develop new models to predict and detect cancer using AI. These models are crucial for increasing the precision of survival from cancer and sensitivity estimations [3]. But just with the detection and management of cancer, this diagnosis must be made in the earliest stages of the illness. The most important thing is to diagnose cancer early in order to preserve the lives of many individuals [9]. For this form of cancer diagnosis, visual examination and manual techniques are typically used. It takes a lot of effort and is quite error-prone to explain medical imagery [10]. Due to the ambiguous nature of the symptoms, the limitations of mammography and other screening methods, and the potential for recurrence after care, a cancer in its initial phases is extremely challenging [11]. Therefore, high resolution medical diagnostics in cancer investigations will lead to the development of better predictive models [12]. An analysis of studies on the identification and management of cancer in the literature shows that the application of AI approaches is expanding [13]. Additionally, this has come to light that AI techniques are more effective than conventional analysis methods like statistical and multivariate analysis. Particularly the DL approach among AI techniques produces excellent outcomes [14].
Detection of Cancer Using Deep Learning Techniques
189
Fig. 1 Categorization of DL neural networks
A specific kind of neural network called DL has numerous hidden layers. DL is implemented in many different industries recently [15]. It has demonstrated particularly high efficiency results in use cases like voice recognition, as well as image detection within advanced devices such as driverless cars and drones [14, 16, 17]. Additionally, fundamental classifications including the identification of cancerous and healthy tissue are carried out, and conventional ML techniques are used in the produced models. Deep neural networks powered by artificial intelligence, on the other hand, offer a better way to use data matrices to create classification models. With the use of these models, cancer may be identified, its progression can be observed and predicted, and timely and effective cancer therapy can then be administered [18]. DL approaches operate by using a backpropagation algorithm to uncover fine structures in huge and frequently complex datasets. Existing techniques, such as those based on machine learning, have limits when it comes to handling raw data in its native format without preprocessing [19]. The ability to learn invariant features is a property of convolutional neural networks (CNN), a type of DL system. To build patterns for various object identification tasks like detection, segmentation, and classification, CNNs use feature pooling layers, filter banks, dropout layers, batch normalization layers, and dense layers. CNNs include a multilevel hierarchy in which the dispersion of inputs varies throughout training. To achieve improved performance throughout tasks, preprocessed data is extremely desirable [20]. There
190
A. Singh et al.
are many other CNN variations, including those with shorter connections, like the DenseNet architecture, which gives a significant reduction in the number of hyper parameters needed to develop effective designs and has benefits for feature circulation [21]. ResNets, Xception, and GooLeNet designs are other varieties of CNN architectures that have been more effective recently. These networks are necessary because multiscale processing is required, job performance across the board degrades as the network gets deeper, and better topologies with fewer parameters are sought [22–25]. Another critical challenge in DL is the capability of an architecture to store data over long time periods. Long Short-Term Memory has been suggested as a potential remedy for this issue (LSTM). Through the states of specialized units, the LSTM design enforces continuous error flow which is non-global in time and space [26]. The concept of transfer learning is another DL concept worth mentioning. Transfer learning involves applying features taken from deep convolutional neural networks to contemporary and inventive jobs. The requirement for this arises from the possibility that generic tasks may differ significantly from the original tasks and that there won’t be enough marks or inputs to train DL architecture for new tasks. The use of transfer learning also allows characteristics to be modified with ease so that they dependably express generalization well enough [27–29]. DL techniques utilized in cancer detection and treatments are investigated in this paper. The purpose of the study is to demonstrate, with the help of the literature, the effectiveness of a deep learning approach—one of the machine learning techniques treating a condition like cancer, as well as the methodologies and techniques that are employed and how they are applied [30].
2 Deep Learning 2.1 Basics of Deep Learning DL has gained a lot of popularity and success in nearly every industry and has emerged as a useful tool for understanding how machines perceive the world. In fields including speech recognition, image classification, video scrutiny and natural language learning DL techniques are applied [31]. Based on a DL created mathematical model, analysis is performed without using any attribute extractor. The scope for generalization of DL techniques is one of their key benefits. For additional applications and data types, a learnt neural network method can be used. When the data set is inadequate DL performs poorly [32]. DL exists as a kind of machine learning approach which capitalizes on benefits of nonlinear processing unit layers [15]. The result of the preceding layer is fed into the subsequent layer as an input. Data is established on the results from the visualization of the data in the DL approach by understanding multiple feature levels [33]. A hierarchy is created in the representation by deriving low-level features
Detection of Cancer Using Deep Learning Techniques
191
from top-level features. While generally based on ANN, DL techniques include more buried layers and neurons [34]. DL techniques show excellent outcomes when processing a variety of data kinds, including text, audio, and video [35, 36]. There are several applications of DL, including information retrieval, audio and speech processing [14], multi-modal and multi-task learning, Natural Language Processing (NLP), image segmentation and image recognition [16].
2.2 Cancer Diagnosis with DL When making a diagnosis, doctors frequently draw on their own knowledge, abilities, and experience. A doctor can never guarantee that his diagnosis of a condition is accurate, regardless of how talented he is, and many times diseases are misdiagnosed Technologies involving AI, therefore, appear on the agenda. This is due to the fact that AI possesses the capacity to evaluate vast quantities of data, resolve complicated issues, and make very accurate predictions [4]. One of the most modern methods for AI, DNN describes a number of computational methods that are useful for extracting data from photos. Many medical disciplines have used DL algorithms for various medical tasks like radiology, pathology etc. Good efficiency has also been achieved in the notion of using DL tools for tumor biology and other fields, such as medical imaging of many species [16].
2.3 Deep Neural Network Characteristics Any basic neural network consists of an input layer that is connected to the output immediately. There are several hidden layers inside DNNs that are efficient at handling complicated issues, each layer’s weight is modified using delta learning technique. Deep neural networks are also used to discover complex nonlinear interactions by including more hidden layers. Although learning occurs relatively slowly, DNNs are employed in unsupervised and supervised learning situations. However, good performance outcomes can be produced, and it is typically employed for classification and regression purposes [34, 35]. Using a DNN and endoscopic imaging, in [37] lesions were identified and differentiated. It was discovered that there was no appreciable distinction in diagnostic performance between the artificial intelligence system and skilled endoscopists. The neural network approach they built has demonstrated great accuracy in discriminating non-cancerous lesions and high sensitivity [37]. The ability of deep neural networks to identify cancer, specifically lung cancer, in the presence of low-dose computed tomography and positron emission tomography scans was examined in [38]. It was shown here that the DNN algorithm has excellent results in detecting lung cancer. Their work also demonstrated the efforts to screen
192
A. Singh et al.
for lung cancer were more successful as a result of the continued development of this technique [38]. • A DNN is a type of neural network with any more than two layers and a specific complexity level [27]. • Advanced mathematical modeling is used to get deeper understanding, and as a result, the processing of data or features is considered to be complex. • The task of pattern recognition is carried out by a neural network, which is a metaphor for the activity of the human brain [8, 20]. In particular, patterns are recognized to classify cells into non-cancerous and cancerous ones and for tracking input through various simulated neural association layers [39, 40]. • Dealing with unlabeled data is the major objective of using this network, with each layer carrying out specific types of tasks [11].
3 Architectures of Deep Learning Neural Networks Based on the learning technique, the DL neural network architectures are classified into 4 categories: supervised, semi-supervised, unsupervised, and reinforcement learning [41]. Figure 1 shows how DL neural networks are categorized.
3.1 Deep Unsupervised Learning The internal representation of the data is examined by the deep unsupervised learning architectures, employing a few features without the need for any tagged data. The dimensionality reduction and clustering techniques used unsupervised methods. Restricted Boltzmann Machines (RBM) and Auto-Encoders (AE) are a few deep unsupervised learning architectures [42].
3.2 Deep Supervised Learning Architectures for supervised learning use predetermined data for training. Target results and all possible combinations of inputs are fed to the network [43]. The training phase’s data is validated during testing. Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), and gated recurrent units are few typical methods used under supervised learning [17].
Detection of Cancer Using Deep Learning Techniques
193
3.3 Deep Semi-supervised Learning Partially labeled data is used for training phase under deep supervised learning architectures. A few semi-supervised learning architectures include LSTM, RNN, Generative Adversarial Networks (GAN), GRU, and deep reinforcement learning [44].
4 Types of Deep Learning Architectures for Cancer Detection 4.1 Convolutional Neural Networks (CNN) The analysis of 2D images as well as 3D images was effective with the use of CNN. A gradient-based algorithm is taken to train majority of the CNN systems [26]. Compared to other neural network models, there are fewer factors to be tweaked. Feature extractors and classification are both components of the CNN architecture [45]. The feature extraction layer receives input from one layer before it and passes it to the next layer after it. Convolution, maximum pooling, and classification are the three types of layers that make up the CNN architecture. Even numbers are used to represent convolution layers, while odd numbers are used to represent maxpooling layers. The categorization layer, the final step of architecture, is a completely connected layer. For more accuracy, an architecture using back propagation is used during for classification. Maximum pooling, global average, average, and minimum pooling are some of the several types of pooling procedures. Using a kernel made up of a linear or nonlinear activation function, the convolution layer convolves the data to create feature maps. The activation functions include the rectified linear, sigmoid, Softmax, identity and hyperbolic tangent functions. The downsampling action occurs in the pooling layer, which is also known as the subsampling layer. Depending on the application, there are different numbers of classification layers. Figure 2 shows the convolution neural network architecture.
Fig. 2 Architecture of a convolution neural network
194
A. Singh et al.
4.2 Multi-scale Convolution Neural Network A multi-scale convolutional neural network is created by modifying the conventional CNN [46]. This consists of three convolution layers; a rectified linear unit layer; a layer that maximizes pooling; as well as two fully linked layers. The input image is downsampled, and extraction of features is completed for sending to the multi-scale CNN.
4.3 LeNet-5 This is a 7-stage convolutional neural network that is utilized to categorize handwritten digits. For a complicated scenario, the number of convolution layers is employed with input images of size 32 × 32. Figure 3 shows the LeNet design, which consists of two convolutional layers, subsampling layers, and fully linked layers. Gaussian connectivity was used on a single output layer [47].
4.4 AlexNet While Alexnet’s design is identical to that of LeNet’s, it possesses deeper layers, increased filters for every layer, and connected convolutional layers. After every fully connected layer and convolutional layer, the function of ReLU activation was added. With a decreased error of 15.3% from 26%, this was a winning architecture during 2012. It includes data augmentation, dropout, max pooling, and ReLU activations
Fig. 3 Architecture of LeNet-5
Detection of Cancer Using Deep Learning Techniques
195
Fig. 4 Architecture of AlexNet
in addition to 11 × 11, 5 × 5, and 3 × 3 convolutional kernels [48]. In Fig. 4, the AlexNet architecture is shown.
4.5 ZFNet Although ZFNet’s architecture was similar to AlexNet, its settings had been finetuned, making it the 2013 challenge winner. There was a 14.8% reduction in inaccuracies. The number of weights is reduced by using 7 7 kernels rather than 11 11 kernels. The precision is increased as a result of reducing the number of tuning parameters [49].
4.6 GoogleNet A part of the GoogleNet design is LeNet, which has an inception structure. It has 22 number of layers, and throughout testing the rate of error decreased gradually from 6.66 to 3.66%. The building was the winner of ILSVRC 2014 [46]. When compared to the conventional CNN architecture, it has a reduced computational complexity. Compared to other architectures like AlexNet and VGG [50], it was less frequently used. In Fig. 5, the GoogleNet architecture is shown.
4.7 VGGNet The VGGNet, which consists of sixteen convolution layers with several filters, was the ILSVRC 2014 winner [39]. With this architecture, feature extraction has been
196
A. Singh et al.
Fig. 5 Architecture of GoogleNet
Fig. 6 Architecture of VGGNet
found to be effective, however parameter adjustment is quite important. Three VGG models with 11 layers, 16 layers, and 19 layers each were proposed: VGG-11, 16, and 19. All VGG models have three fully connected layers at the very end. Figure 6 shows the architecture of the VGGNet.
4.8 ResNet In order to employ prevent connections and normalization of batch, the ResNet, which won the ILSVRC 2015, was used [51]. When compared to the VGGNet, the computation complexity was lower. The gated recurrent units were utilized for skipping connections. This has 152 layers in total, the inaccuracy is kept at minimum of 3.57%. It finds a solution to the vanishing gradient issue. It has a residual connection and is one traditional feed forward NN [52]. It consists of a number of leftover blocks, and depending on the architecture, it operates differently. In Fig. 7, the residual network is shown.
Detection of Cancer Using Deep Learning Techniques
197
Fig. 7 Architecture of ResNet
4.9 Fully Convolutional Networks (FCNs) In contrast to the classical CNN, the fully convolutional layer in the fully convolutional network has been replaced with one layer of up-sampling, one layer of de-convolution, and one completely linked layer, as shown in the Fig. 8. This architecture was designed so that the fully convolution and the de-convolution layers create the reversed equivalents of pooling and convolution layers. Up-sampling and de-convolution layers were added to the design, which increased its accuracy [40, 41].
4.10 U-Net U-Net, which has two routes, was created for the segmentation of medical images. The first path has an encoder which records the context of the image. However,
198
A. Singh et al.
Fig. 8 Architecture of fully convolutional networks
the second path consists of transposed convolutions as well as a decoder [53, 54]. Figure 9 shows the U-Net.
4.11 Recurrent Neural Networks Figure 10 shows the RNN’s fundamental structure. In [55], various RNN design variations are described. Numerous functional blocks are included in the recurrent neural network, as seen in Fig. 10. Recurrent neural networks are susceptible to the vanishing gradient problem. Recurrent neural networks require memory because they use prior states as input to determine their present state. It makes use of sequential data, and connections among nodes create one directed graph. RNNs are used to convert input sequences into fixed-sized vectors. Using RNN in combination with the convolutional layer, the effective pixel neighborhood is extended. It is used in machine translation, time series prediction, and NLP. An example of RNN is long short-term memory network (LSTM) [56].
4.12 Autoencoders The auto encoder functions as a potent unsupervised learning architecture with three layers: encoder, decoder, and code. Encoding data into a more compact representation
Detection of Cancer Using Deep Learning Techniques
199
Fig. 9 Architecture of U-Net
is the function of an encoder. As a result, the input’s distortion is represented by the compressed image. The compressed input is represented by code. Another layer that is referred to as a bottleneck is the layer that sits between the encoder and the decoder. Figure 11 shows the construction of the autoencoder. The decoder converts the code into a replica of the initial input. The key characteristics are lossy and data-specific. Four hyperparameters, including the code size, layer count, nodes per layer, and loss function, need to be tuned before training the architecture. The application areas of the autoencoder include dimension reduction, image compression, image denoising, and feature extraction [57, 58].
4.13 Deep Belief Networks It consists of a forward feed network for the fine adjustment phase and a RBM (Restricted Boltzmann Machine) for pre-trained model. This network receives the
200
A. Singh et al.
Fig. 10 Architecture of recurrent neural networks
features that the RBM has extracted from the input data vectors. Deep belief networks use a back propagation design with a slower learning rate. It also has numerous levels that are hidden. The deep belief network’s primary advantage is its capacity to learn from higher-level features that are present in earlier levels thanks to its layer-by-layer learning strategies [59, 60]. In Fig. 12.
5 Steps for Diagnosis of Cancer by Medical Imaging The medical imaging techniques like MRI, CT scan, and ultrasound were used to evaluate the healthy function of anatomical organs and analyze diseases [61]. Cancer diagnosis and therapy planning are crucially dependent on medical imaging modalities. Preprocessing, often known as filtering, is the initial step in the processing of medical pictures. The goal of filtering is to either eliminate image noise introduced in the acquiring process or for enhancing image quality to get more accurate details [62]. The term “segmentation” describes the method of identifying ROI, or region of interest, and in the context of medical pictures, the ROI stands for anatomical organs or any abnormalities associated with them, such as tumors or cysts. To
Detection of Cancer Using Deep Learning Techniques
201
Fig. 11 Architecture of autoencoders
classify cancer intensity, the classification step typically uses any ML algorithm. Compression is defined as the process of using machine-assisted techniques to make files smaller so they can be stored and transferred with more ease. The table shows the machine learning methods that can be used in each stage of cancer diagnosis [63]. When assessing an ailment, professionals depend heavily on their first-hand observations, abilities, and experiences. A doctor can never be in a state of complete surety and claim that his assessment of the condition is entirely right, and they undoubtedly get it wrong. This introduces the dependence of Artificial Intelligence powered automated systems because artificial intelligence (AI) can evaluate enormous volumes of information, handle complicated prepositions, and anticipate accurately. One of the most modern methods for AI systems, deep neural networks, describes a number of computer models that are useful for extracting data from digital images. Algorithms for DL are utilized in several medical professions [4, 16]. The steps of cancer diagnosis are as follows.
202
A. Singh et al.
Fig. 12 Architecture of deep belief networks
5.1 Cleaning and Pre-processing The initial stage in the identification process is pre-processing since the raw photos include noise. Pre-processing is used to boost the quality of a picture that will be utilized more frequently by eliminating unnecessary image data known as image noises. If this issue is not resolved, improper categorization may occur. It becomes crucially important to properly clean the images and convert them into standard forms for getting high accuracy levels [3].
5.2 Image Segmentation Image segmentation refers to dividing any image into different sections. It is separated into pixel and region, model, and threshold based segmentation. Additionally, there is additional histogram cutoff, adaptive cutoff point, and boundary detection approaches. These strategies are also used in combination [3, 64, 65].
Detection of Cancer Using Deep Learning Techniques
203
5.3 Post Processing After image segmentation, closing and opening operations, island removal, region merging, border expansion and smoothening is done [3].
6 Diagnosis of Different Types of Cancers Using DL Table 1 shows DL architectures for various cancer diagnoses. Neural network Architectures have been extremely useful in illness detection and have also contributed to research relating to cancer that affects different organs. The convolution sparse encoder was found to be appropriate for all categories of 3-dimensional datasets in the proposed work [66]. In [67], lesion identification was achieved while stage of cancer diagnosis was accomplished using CNN and handmade features. In another work [68], GoogLeNet was determined to be more successful, with an efficiency of 85%, as compared to AlexNet, with an accuracy of 82%, and the VGGNet, with an accuracy of about 84%. When compared to the conventional predictor based on texture analysis, the model that had combined pre trained SVM and CNN was more successful for categorizing tumor tissues in digital mammograms [69]. The researchers [70] used a DL method to perform studies on breast cancer patients. They used a Cox prediction model and genomic datasets to make predictions. They show that whenever there happens to be an abundance of information and it is utilized to integrate and simplify biomarkers and gene regulation to enable prediction, performance improves. Shimizu and Nakayama [71] used the TCGA database to identify and work on breast cancer genes and analytical prediction. They employed AI to identify 184 genes, after which they used ML algorithms such as Random Forest Classifier along withDL networks to do it. Furthermore, they employed a prognostic genetic score that utilized just 23 out of the 184 identified genes. Liu et al. [72] Proposes a CNN model that is capable of identifying tiny cancerous tumors using gigapixel pathology slides. The proposed system suggested in CruzRoa et al. [73] identifies aggressive lesions in entire slide pictures while minimizing human work and temporal complications. On breast ultrasound image lesion pictures, the alternative CNN architectures like LeNet, U-Net, Transfer Learning, and AlexNet were thoroughly analyzed and it was found that AlexNet and Patch-based LeNet were the most accurate architectures [74]. Even before DNN tumor identification, the ROI was extracted using the different watershed and Gaussian mixture model (GMM) algorithms in Das et al. [75]. For the segmentation of liver tumors, the FCN structure U net was proposed, with subsequent processing via 3D linked item tagging in order to get better segmentation results [76]. CNNs were proved to be more accurate classifiers than classical machine learning algorithms [77]. The DNN was shown to be effective for segmenting the cancerous growth of cells, and it is also appropriate for segmenting tiny lung nodules. Deep Neural Network efficiency grows as training data increases [78]. The Convolutional
204
A. Singh et al.
Table 1 Deep learning architectures for cancer diagnoses References
Cancer type(s)
Type of data/imaging
DL architecture used Performance metrics
[70]
Breast
Gene expression data
Multi omics NN
Enhanced performance with more omics data
[71]
Breast
The cancer genome atlas
Random forest, NN
Log-rank p < 0.05
[72]
Breast
Pathology
Convolutional neural network (CNN)
Sensitivity: 73%
[73]
Breast
Pathology
Convnet
Positive predictive Value: 71.6%,
[74]
Breast
Ultrasound
Alexnet (CNN)
Fps / image—0.16, TPF—0.98, F measure—0.91
[75]
Liver
Computed tomography Deep Neural (CT) Scan/3D Network
Accuracy 99.4%
[76]
Liver
Computed tomography Back propagation scan neural network
Accuracy 73.2%
[77]
Liver
Computed tomography Convolutional scan neural network (CNN)
Precision: 82.67% Dice: 80.06% Recall: 84.34%
[78]
Lung
Computed tomography Deep neural scan network (DNN)
Sensitivity: 78.2% Accuracy: 82.1% Specificity: 86.13%
[79]
Lung
Computed tomography Deep neural scan network (DNN)
Sensitivity: 78.9%
[80]
Lung
Computed tomography Resnet scan
Sensitivity: 0.54
[81]
Skin
Standard images from camera
Deep convolutional neural networks (DCNN)
Accuracy: 98.55 Sensitivity: 95%
[82]
Skin
Dermoscopy images
ReLU-rectified Accuracy: 86.67% linear activation unit (CNN)
[83]
Colon
Histopathology image
Shallow neural network
Accuracy: 84%
[84]
Astrocytic tumor
Microarray gene dataset
Artificial neural network (ANN)
Accuracy: 96.15% (continued)
Detection of Cancer Using Deep Learning Techniques
205
Table 1 (continued) References
Cancer type(s)
Type of data/imaging
DL architecture used Performance metrics
[85]
Prostate
Multiparametric Magnetic resonance imaging (mpMRI)/3D
Xmasnet (CNN)
AUC: 0.84
[86]
Prostate
Multiparametric Magnetic resonance imaging (mpMRI)
Deep convolutional neural networks (DCNN)
AUC: 0.897
[87]
Brain
Magnetic resonance imaging (MRI)
Input cascade Sensitivity: 0.84 convolutional neural Specificity: 0.9 network
Neural Network suggested in Golan et al. [79] is divided into two stages, out of which the first gathers spatial characteristics, while the second does categorization. The DL structure was used with an SVM classifier to identify lung nodules; the rule-based method reduced false positives. The new ResNet design outperforms the traditional ResNet structure of lesion segmentation [80]. Additionally, using conventional camera pictures, a CNN was employed to detect melanoma [81]. Convolutional CNN has been proposed for detecting skin lesion borders in dermoscopy pictures [82]. A smaller network is used to analyze multidimensional gene data in order to definitively diagnose cancerous cell growth in histological pictures of the colon [83]. Petalidis et al. [84] published data of genomics for astrocytic malignancies. To be able to explain the necessity for accurate categorization of these cancers, they used a neural network technique to merge characteristics from histological subtypes of these cancers. They were able to identify 59 genes in this research. They identified accurate classifications for these variants using custom and separate data with a correctness of 96.15%. Prostate cancer were identified under the MRI pictures using XmasNet, a CNNbased algorithm [85]. AUC of 0.897 [86] was reached by it. In the BRATS dataset, the brain tumor is segmented using the deep interconnected CNN, which has achieved good performance through a cascaded design [87].
7 Conclusions DL has been successful in displaying its effectiveness in feature extraction, and their properties have improved cancer prognosis and prediction. DL models have revolutionized cancer diagnosis and prediction because of their superior features, learning architectures have received massive use in cancer cell segmentation and classification. Data augmentation was critical in diagnosis of cancer and prediction jobs in order to enhance system efficiency. DL solutions are evaluated and verified in areas such as replicability and universal applicability in treatment of cancer. These
206
A. Singh et al.
techniques helped in the early detection of cancer and contributed to patient recovery or life extension. DL based technological innovation has started to benefit the local and national medical sectors. Consequently, it is advantageous to use DL technology in cancer diagnostics and general medicine in order to get further theoretical understanding. Researchers studying ML algorithms for diagnosing diseases as well as experts in planning and treating have something to gain from this work’s conclusion.
References 1. Grisold, W. (Ed.) (2021). Wolfgang Grisold, Riccardo Soffietti, Stefan Oberndorfer, Guido Cavaletti (eds): Effects of cancer treatment on the nervous system. 2. Tang, J., Rangayyan, R. M., Xu, J., El Naqa, I., & Yang, Y. (2009). Computer-aided detection and diagnosis of breast cancer with mammography: Recent advances. IEEE Transactions on Information Technology in Biomedicine, 13(2), 236–251. 3. Munir, K., Elahi, H., Ayub, A., Frezza, F., & Rizzi, A. (2019). Cancer diagnosis using deep learning: A bibliographic review. Cancers, 11(9), 1235. 4. Huang, S., Yang, J., Fong, S., & Zhao, Q. (2020). Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer letters, 471, 61–71. 5. Cancer Facts and Figures. (2019). American Cancer Society. https://www.cancer.org/content/ dam/cancer-org/research/cancer-facts-and-statistics/annualcancerfacts-andfigures/2019/can cer-facts-and-figures-2019.pdf 6. Bhardwaj, P., Guhan, T., & Tripathy, B. K. (2021). Computational biology in the lens of CNN. In S. S. Roy, Y. H. Taguchi (eds.), Handbook of machine learning applications for genomics (Chapter 5). Studies in Big Data. ISBN: 978-981-16-9157-7 496166_1_En 7. Tripathy, B. K., & Anuradha, J. (2015). Soft computing-advances and applications. Cengage Learning Publishers, New Delhi. ASIN : 8131526194. ISBN-10: 9788131526194. 8. Rungta, R. K., Jaiswal, P, & Tripathy, B. K. (2022) A deep learning based approach to measure confidence for virtual interviews. In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on Computational Intelligence in Pattern Recognition (CIPR), CIPR 2022 (pp. 278– 291). LNNS 480. 9. Bhandari, A., Tripathy, B. K., Jawad, K., Bhatia, S., Rahmani, M. K. I., & Mash, A. (2022). Cancer detection and prediction using genetic algorithms. Comput Intell Neurosci 2022, 18. https://doi.org/10.1155/2022/1871841 10. Allahyar, A., Ubels, J., & de Ridder, J. (2019). A data-driven interactome of synergistic genes improves network-based cancer outcome prediction. PLoS Computational Biology, 15(2), e1006657. 11. Adate, A., Tripathy, B. K., Arya, D., & Shaha, A. (2020) Impact of deep neural learning on artificial intelligence research. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep learning research and applications (pp.69–84). De Gruyter Publications. https:// doi.org/10.1515/9783110670905-004 12. Mitchell, M. J., Jain, R. K., & Langer, R. (2017). Engineering and physical sciences in oncology: Challenges and opportunities. Nature Reviews Cancer, 17(11), 659–675. 13. Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375(13), 1216. 14. Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6645–6649). IEEE. 15. Bhattacharyya, D. S., Snasel, V., Hassanian, A. E., Saha, S., & Tripathy, B. K. (2020). Deep learning research with engineering applications. De Gruyter Publications. ISBN: 3110670909, 9783110670905. https://doi.org/10.1515/9783110670905
Detection of Cancer Using Deep Learning Techniques
207
16. Bose, A., & Tripathy, B. K. (2020) Deep learning for audio signal classification. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep learning research and applications (pp. 105–136). De Gruyter Publications. https://doi.org/10.1515/978311067090500660 17. Singhania, U., & Tripathy, B. K. (2021). Text-based image retrieval using deep learning. In Encyclopedia of information science and technology (5th edn, p. 11). https://doi.org/10.4018/ 978-1-7998-3479-3.ch007 18. Yagna Sai Surya, K., Geetha Rani, T., & Tripathy, B. K. (2022). Social distance monitoring and face mask detection using deep learning. In J. Nayak, H. Behera, B. Naik, S. Vimal, & D. Pelusi (Eds.), Computational intelligence in data mining (Vol. 281). Smart Innovation, Systems and Technologies. Springer, Singapore. https://doi.org/10.1007/978-981-16-9447-9_36 19. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448– 456). PMLR. 20. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708). 21. Kyi, C. W., Birriel, P. C., Davidsen, T. M., Ferguson, M. L., Gesuwan, P., Griner, N. B., Gerhard, D. S., et al. (2020). NCI office of cancer genomics supports multidisciplinary genomics research initiatives to advance precision oncology. Cancer Research, 80(16_Supplement), 5862–5862. 22. Pogorelov, K., Randel, K. R., Griwodz, C., Eskeland, S. L., de Lange, T., Johansen, D., Halvorsen, P., et al. (2017). Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference (pp. 164–169). 23. Mesri, M., An, E., Hiltke, T., Robles, A. I., Rodriguez, H., & CPTAC Investigators. (2022). NCI’s clinical proteomic tumor analysis consortium: A proteogenomic cancer analysis program. Cancer Research, 82(12_Supplement), 6331–6331. 24. Gupta, P., Bhachawat, S., Dhyani, K., & Tripathy, B. K. (2021). A study of gene characteristics and their applications using deep learning, (Chapter 4). In S. S. Roy, & Y. H. Taguchi (Eds.), Handbook of Machine Learning Applications for Genomics (Vol. 103). Studies in Big Data. ISBN: 978-981-16-9157-7, 496166_1_En. 25. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. 26. Maheswari, K., Shaha, A., Arya, D., Tripathy, B. K., & Rajkumar, R. (2020). Convolutional neural networks: A bottom-up approach. In S. Bhattacharyya, A. E. Hassanian, S. Saha, & B. K. Tripathy (Eds.), Deep Learning Research with Engineering Applications (pp. 21–50). De Gruyter Publications. https://doi.org/10.1515/9783110670905-002 27. Tripathy, B. K., & Deepthi, P. H. (2015). Application of spatial FCM in detecting cancer cells. IIMT Research Network (pp. 1–6, 96–100). ISBN 878-93-82208-77-8. 28. Zhong, Z., Sun, L., & Huo, Q. (2019). An anchor-free region proposal network for Faster R-CNN-based text detection approaches. International Journal on Document Analysis and Recognition (IJDAR), 22(3), 315–327. 29. Hanefi Calp, M. (2021). Use of deep learning approaches in cancer diagnosis. In Deep Learning for Cancer Diagnosis (pp. 249–267). Springer, Singapore. 30. Karahan, S., ¸ & Akgül, Y. S. (2016). Eye detection by using deep learning. In 2016 24th Signal Processing and Communication Application Conference (SIU) (pp. 2145–2148). IEEE. 31. Özkan, ˙IN. ˙IK., & Ülker, E. (2017). Derin ö˘grenme ve görüntü analizinde kullanılan derin ö˘grenme modelleri. Gaziosmanpa¸sa Bilimsel Ara¸stırma Dergisi, 6(3), 85–104. 32. Seker, ¸ A., Diri, B., & Balık, H. H. (2017). Derin ö˘grenme yöntemleri ve uygulamaları hakkında bir inceleme. Gazi Mühendislik Bilimleri Dergisi, 3(3), 47–64. 33. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127. 34. Tripathy, B. K., Raju, H., & Kaul, D. (2018). Deep learning in health care, accepted in deep learning for remote sensing and GIS: Frontier advancements and applications. In V. Santhi (Eds.) CRC publications
208
A. Singh et al.
35. Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016). Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics, 21(1), 4–21. 36. Küçük, D., & Arici, N. (2018). Do˘gal Dil ˙I¸slemede Derin Ö˘grenme Uygulamalari Üzerine Bir Literatür Çali¸smasi. Uluslararası Yönetim Bili¸sim Sistemleri ve Bilgisayar Bilimleri Dergisi, 2(2), 76–86. 37. Ohmori, M., Ishihara, R., Aoyama, K., Nakagawa, K., Iwagami, H., Matsuura, N., & Tada, T., et al. (2020). Endoscopic detection and differentiation of esophageal lesions using a deep neural network. Gastrointestinal Endoscopy, 91(2), 301–309. 38. Schwyzer, M., Ferraro, D. A., Muehlematter, U. J., Curioni-Fontecedro, A., Huellner, M. W., Von Schulthess, G. K., Kaufmann, P. A., Burger, I. A., & Messerli, M. (2018). Automated detection of lung cancer at ultralow dose PET/CT by deep neural networks–initial results. Lung Cancer, 126, 170–173. 39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A., et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–9). 40. Sihare, P., Ullah Khan, A., Bardhan, P., & Tripathy, B. K. (2022). COVID-19 detection using deep learning: A comparative study of segmentation algorithms. In A. K. Das et al. (Eds.), Proceedings of the 4th International Conference on Computational Intelligence in Pattern Recognition (CIPR) (pp. 1–10), CIPR 2022, LNNS 480. 41. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems, 29. 42. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 873–880). 43. Tripathy, B. K., Dash, S., & Patro, B. N. (2012). Study of classification accuracy of microarray data for cancer classification using multivariate and hybrid feature selection method. IOSR Journal of Engineering (IOSRJEN), 2(8), 112–119 ISSN: 2250-302. 44. Adate, A., & Tripathy, B. K. (2017). Understanding single image super-resolution techniques with generative adversarial networks. Advances in Intelligent Systems and ComputingIn J. Bansal, K. Das, A. Nagar, K. Deep, & A. Ojha (Eds.), Soft computing for problem solving (Vol. 816, pp. 833–840). Springer. 45. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural network architectures and their applications. Neurocomputing, 234, 11–26. 46. Mustafa, H. T., Yang, J., & Zareapoor, M. (2019). Multi-scale convolutional neural network for multi-focus image fusion. Image and Vision Computing, 85, 26–35. 47. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 48. Kaul, D., Raju, H., & Tripathy, B. K. (2022). Deep learning in healthcare. In D. P. Acharjya, A. Mitra, & N. Zaman (Eds.), Deep learning in data analytics, deep learning in data analytics-recent techniques, practices and applications (Vol. 91, pp. 97–115). Studies in Big Data. Springer, Cham. https://doi.org/10.1007/978-3-030-75855-4_6 49. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Preprint retrieved from arXiv:1409.1556. 50. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Fei-Fei, L., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. 51. Tripathy, B. K., Garg, N., & Nikhitha, P. (2014). Image retrieval using latent feature learning by deep architecture. In Proceedings of the IEEE ICCIC2014 (pp. 663–666) 52. Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. Preprint retrieved from arXiv:1603.08029. 53. Tripathy, B. K., Parikh, S., Ajay, P., & Magapu, C.: Brain MRI segmentation techniques based on CNN and its variants (Chapter-10). In J. Chaki (Ed.), Brain tumor MRI image segmentation using deep learning techniques (pp.161–182.). Elsevier publications. https://doi.org/10.1016/ B978-0-323-91171-9.00001-6
Detection of Cancer Using Deep Learning Techniques
209
54. Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016). 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 424–432). Springer, Cham. 55. Baktha, K., & Tripathy, B. K. (2017). Investigation of recurrent neural networks in the field of sentiment analysis. In International Conference on Communication and Signal Processing (ICCSP), (pp. 2047–2050). https://doi.org/10.1109/ICCSP.2017.8286763 56. Adate, A., & Tripathy, B. K. (2019). S-LSTM-GAN: Shared recurrent neural networks with adversarial training. In A. Kulkarni, S. Satapathy, T. Kang, A. Kashan (Eds.), Proceedings of the 2nd International Conference on Data Engineering and Communication Technology (Vol. 828, pp. 107–115). Advances in Intelligent Systems and Computing. Springer, Singapore. 57. Loey, M., El-Sawy, A., & El-Bakry, H. (2017). Deep learning autoencoder approach for handwritten arabic digits recognition. Preprint retrieved from arXiv:1706.06720. 58. Thomas, S. A., Race, A. M., Steven, R. T., Gilmore, I. S., & Bunch, J. (2016). Dimensionality reduction of mass spectrometry imaging data using autoencoders. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1–7). IEEE. 59. Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep belief networks and introducing a new object oriented toolbox (DeeBNet). Preprint retrieved from arXiv:1408.3264. 60. Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5), 5947. 61. Jeong, J. (2017). Deep learning for cancer screening in medical imaging. Hanyang Medical Reviews, 37(2), 71–76. 62. Pereira, G. C., Traughber, M., & Muzic, R. F. (2014). The role of imaging in radiation therapy planning: past, present, and future. BioMed Research International. 63. Adate, A., & Tripathy, B. K. (2018) Deep learning techniques for image processing. In S. Bhattacharyya, H. Bhaumik, A. Mukherjee, & S. De (Eds.), Machine learning for big data analysis (pp. 69–90). De Gruyter, Berlin, Boston. https://doi.org/10.1515/978311055143300357 64. Jain, S., Singhania, U., Tripathy, B., Nasr, E. A., Aboudaif, M. K., & Kamrani, A. K. (2021). Deep learning-based transfer learning for classification of skin cancer. Sensors (Basel), 21(23), 8142. https://doi.org/10.3390/s21238142 65. Tong, N., Lu, H., Ruan, X., & Yang, M. H. (2015). Salient object detection via bootstrap learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1884–1892). 66. Kallenberg, M., Petersen, K., Nielsen, M., Ng, A. Y., Diao, P., Igel, C., Lillholm, M., et al. (2016). Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE Transactions on Medical Imaging, 35(5), 1322–1331. 67. Wang, H., Roa, A. C., Basavanhally, A. N., Gilmore, H. L., Shih, N., Feldman, M., Madabhushi, A., et al. (2014). Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. Journal of Medical Imaging, 1(3), 034003. 68. Ertosun, M. G., & Rubin, D. L. (2015). Probabilistic visual search for masses within mammography images using deep learning. In 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 1310–1315). IEEE. 69. Turkki, R., Linder, N., Kovanen, P. E., Pellinen, T., & Lundin, J. (2016). Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. Journal of Pathology Informatics, 7(1), 38. 70. Huang, Z., Zhan, X., Xiang, S., Johnson, T. S., Helm, B., Yu, C. Y., Huang, K., et al. (2019). SALMON: Survival analysis learning with multi-omics neural networks on breast cancer. Frontiers in Genetics, 10, 166. 71. Shimizu, H., & Nakayama, K. I. (2019). A 23 gene–based molecular prognostic score precisely predicts overall survival of breast cancer patients. eBioMedicine, 46, 150–159. 72. Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Kohlberger, T., Boyko, A., Stumpe, M. C., et al. (2017). Detecting cancer metastases on gigapixel pathology images. Preprint retrieved from arXiv preprint arXiv:1703.02442.
210
A. Singh et al.
73. Cruz-Roa, A., Gilmore, H., Basavanhally, A., Feldman, M., Ganesan, S., Shih, N. N., Tomaszewski, J., González, F. A., & Madabhushi, A. (2017). Accurate and reproducible invasive breast cancer detection in whole-slide images: A deep learning approach for quantifying tumor extent. Scientific Reports, 7(1), 1–14. 74. Yap, M. H., Pons, G., Marti, J., Ganau, S., Sentis, M., Zwiggelaar, R., Davison, A. K., & Marti, R. (2017). Automated breast ultrasound lesions detection using convolutional neural networks. IEEE Journal of Biomedical and Health Informatics, 22(4), 1218–1226. 75. Das, A., Acharya, U. R., Panda, S. S., & Sabut, S. (2019). Deep learning based liver cancer detection using watershed transform and Gaussian mixture model techniques. Cognitive Systems Research, 54, 165–175. 76. Devi, P., & Dabas, P. (2015). Liver tumor detection using artificial neural networks for medical images. International Journal of Innovative Reserach Science Technology, 2(3), 34–38. 77. Li, W. (2015). Automatic segmentation of liver tumor in CT images with deep convolutional neural networks. Journal of Computer and Communications, 3(11), 146. 78. Gruetzemacher, R., & Gupta, A. (2016). Using deep learning for pulmonary nodule detection & diagnosis. 79. Golan, R., Jacob, C., & Denzinger, J. (2016). Lung nodule detection in CT images using deep convolutional neural networks. In 2016 International Joint Conference on Neural Networks (IJCNN) (pp. 243–250). IEEE. 80. Kuan, K., Ravaut, M., Manek, G., Chen, H., Lin, J., Nazir, B., Chen, C., Howe, T. C., Zeng, Z., & Chandrasekhar, V. (2017). Deep learning for lung cancer detection: tackling the kaggle data science bowl 2017 challenge. Preprint retrieved from arXiv:1705.09435. 81. Jafari, M. H., Karimi, N., Nasr-Esfahani, E., Samavi, S., Soroushmehr, S. M. R., Ward, K., & Najarian, K. (2016). Skin lesion segmentation in clinical images using deep learning. In 2016 23rd International Conference on Pattern Recognition (ICPR) (pp. 337–342). IEEE. 82. Sabouri, P., & GholamHosseini, H. (2016). Lesion border detection using deep learning. In 2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 1416–1421). IEEE. 83. Chen, H., Zhao, H., Shen, J., Zhou, R., & Zhou, Q. (2015). Supervised machine learning model for high dimensional gene data in colon cancer detection. In 2015 IEEE International Congress on Big Data (pp. 134–141). IEEE. 84. Petalidis, L. P., Oulas, A., Backlund, M., Wayland, M. T., Liu, L., Plant, K., Happerfield, L., Freeman, T.C., Poirazi, P., & Collins, V. P. (2008). Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data. Molecular Cancer Therapeutics, 7(5), 1013–1024. 85. Liu, S., Zheng, H., Feng, Y., & Li, W. (2017). Prostate cancer diagnosis using deep learning with 3D multiparametric MRI. In Medical Imaging 2017: Computer-Aided Diagnosis (Vol. 10134, pp. 581–584). SPIE. 86. Tsehay, Y. K., Lay, N. S., Roth, H. R., Wang, X., Kwak, J. T., Turkbey, B. I., Pinto, P. A., Wood, B. J., & Summers, R. M. (2017). Convolutional neural network based deep-learning architecture for prostate cancer detection on multiparametric magnetic resonance images. In Medical Imaging 2017: Computer-Aided Diagnosis (Vol. 10134, pp. 20–30). SPIE. 87. Havaei, M., Davy, A., Warde, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P. M., & Larochelle, H. (2017). Brain tumor segmentation with deep neural networks. Medical Image Analysis, 35, 18–31.