122 107 9MB
English Pages 215 [211] Year 2024
Transactions on Computer Systems and Networks
Jonah Gamba
Deep Learning Models A Practical Approach for Hands-On Professionals
Transactions on Computer Systems and Networks Series Editor Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of Information Technology, Kolkata, West Bengal, India Editorial Board Jürgen Becker, Institute for Information Processing–ITIV, Karlsruhe Institute of Technology—KIT, Karlsruhe, Germany Yu-Chen Hu, Department of Computer Science and Information Management, Providence University, Taichung City, Taiwan Anupam Chattopadhyay , School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore Gaurav Tribedi, EEE Department, IIT Guwahati, Guwahati, India Sriparna Saha, Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India Saptarsi Goswami, A.K. Choudhury school of Information Technology, Kolkata, India
Transactions on Computer Systems and Networks is a unique series that aims to capture advances in evolution of computer hardware and software systems and progress in computer networks. Computing Systems in present world span from miniature IoT nodes and embedded computing systems to large-scale cloud infrastructures, which necessitates developing systems architecture, storage infrastructure and process management to work at various scales. Present day networking technologies provide pervasive global coverage on a scale and enable multitude of transformative technologies. The new landscape of computing comprises of self-aware autonomous systems, which are built upon a software-hardware collaborative framework. These systems are designed to execute critical and non-critical tasks involving a variety of processing resources like multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed through virtualisation, real-time process management and fault-tolerance. While AI, Machine Learning and Deep Learning tasks are predominantly increasing in the application space the computing system research aim towards efficient means of data processing, memory management, real-time task scheduling, scalable, secured and energy aware computing. The paradigm of computer networks also extends it support to this evolving application scenario through various advanced protocols, architectures and services. This series aims to present leading works on advances in theory, design, behaviour and applications in computing systems and networks. The Series accepts research monographs, introductory and advanced textbooks, professional books, reference works, and select conference proceedings.
Jonah Gamba
Deep Learning Models A Practical Approach for Hands-On Professionals
Jonah Gamba Tsukuba, Ibaraki, Japan
ISSN 2730-7484 ISSN 2730-7492 (electronic) Transactions on Computer Systems and Networks ISBN 978-981-99-9671-1 ISBN 978-981-99-9672-8 (eBook) https://doi.org/10.1007/978-981-99-9672-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
For our extended family, whose diversity enriches our lives and broadens our horizons And in loving memory of our departed ones, whose presence is felt in every family gathering
Preface
This book is a result of realizing the need for practical approach to understanding deep learning models since many existing books on the market tend to emphasize theoretical aspects, leaving newcomers and professionals seeking new solutions scrambling for effective guidelines to achieve their goals. Additionally, most available material does not take into account the important factor of rapid prototyping where the goal is to quickly evaluate the performance of algorithms before going deep into consideration of final implementation platforms on which the algorithms will run. The intention here is to address these problems by taking a different approach which focuses on practicality while keeping theoretical concepts to a necessary minimum. In this book, we first build the necessary foundation on deep learning models including current status and progressively go into actual examples of model evaluation. A dedicated chapter is allocated to evaluating the performance of multiple algorithms on specific datasets, highlighting techniques and strategies that can address real-world challenges when deep learning is employed. By consolidating all necessary information into a single resource, readers can bypass the hassle of scouring scattered online sources, gaining a one-stop solution to dive into deep learning for object detection and classification. To facilitate understanding, the book employs a rich array of illustrations, figures, tables, and code snippets. Comprehensive code examples are provided, empowering readers to grasp concepts quickly and develop practical solutions. The book covers essential methods and tools, ensuring a complete and comprehensive treatment that enables professionals to implement deep learning algorithms swiftly and effectively. The book is also designed to equip professionals with the necessary skills to thrive in the active field of deep learning, where it has the potential to revolutionize traditional problem-solving approaches. This book serves as a practical companion, enabling readers to grasp concepts swiftly and embark on building practical solutions. The content presented in this book is based on several years of experience in research and development. The main idea is to give a quick start for those try to find answers within a short period of time irrespective of background. The chapters are organized as follows:
vii
viii
Preface
Chapter 1 Basic Approaches in Object Detection and Classification by Deep Learning This chapter introduces the basics of object detection and classification as target areas of deep learning. It briefly covers traditional methods such K-nearest neighbors (KNN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machine (SVM), random forest (RF), and gradient boosting machine. We also give an overview of deep learning within the context of artificial intelligence. With this, we aim to introduce the reader into the subject. Object classification will be the main focus of this book. Chapter 2 Requirements for Hands-On Approach to Deep Learning This a bridging chapter in which we introduce some of the concepts needed to start building deep learning models in Python. The chapter starts with basic principles related to data manipulation and ends with explanation on how to set up the modelling environment. It is expected that the reader is familiar with some high-level programming concepts which are very easy to acquire within a short space of time. Deep learning models mostly deal with vectors and matrices as we know them from linear algebra. These objects are sometimes referred to as tensors but from an engineering perspective, they can be considered as subsets of multi-dimensional arrays especially if one is already familiar with numerical processing tools like MATLAB, Scilab, Octave, etc. Like any other language, Python has a unique way of accessing and manipulating these arrays. This topic is concisely addressed. We also include here a discussion on some environments supported for deep learning model evaluation, both offline and online. Chapter 3 Building Deep Learning Models This chapter illustrates how to build deep learning models, their training and evaluation using the Keras framework in a simple and succinct way. We briefly explain some of the concepts behind these models so as to give the reader a smooth entry into each section while concentrating mainly of how-to-use rather than details of the algorithms themselves. The entry point will be shallow networks upon which the deep neural networks are developed. We then touch on convolutional neural networks (CNNs), followed by recurrent neural networks (RNNs) and finally long short-term memory (LSTM)/ gated recurring units (GRUs). Along the way, we provide examples on how each of these can be used in order to cement the ideas behind them. After that we give a quick look at the Keras library and some references for further investigation. Chapter 4 The Building Blocks of Machine Learning and Deep Learning In this chapter, we take a look at the three main categories of machine learning and then move on to explore how the machine learning models can be evaluated. The various metrics commonly used are explained. After that, we briefly address the important topic of data preprocessing followed by standard methods of evaluating machine learning models. One of the reasons why most models fail to perform on unseen data is due the problem of overfitting. We take a look at this problem and outline some of the strategies that can be applied in order to overcome it. The next topic is a discussion of the workflow for machine learning or deep learning. The chapter ends with concluding remarks to recap the covered topics. Chapter 5 Remote Sensing Example for Deep Learning
Preface
ix
Recently remote sensing has become heavily dependent on machine learning algorithms such decision trees, random forests, support vector machines, and artificial neural networks. However, there is an increasing recognition that deep learning which has been applied successfully in other areas such as computer vision and language processing is a viable alternative to traditional machine learning. In this chapter, we will work through one specific example of application of deep learning algorithms to one important area of remote sensing data analysis, namely land cover classification. Land cover and land use change analysis is of importance in many practical areas such urban planning, environmental degradation monitoring, and disaster management. The main goal of this chapter is to provide a detailed understanding of the performance of various deep learning models applied to the problem of land cover classification starting from known dataset. Although we use remote sensing as an example, the key point here is to show the level of hyperparameter tuning that is required to get desired results from any multiclass problem to which deep learning is applied. We divide the presentation into five main parts including preliminary information on the models including input data restrictions followed by exploration of the EuroSAT data contents, preprocessing steps and performance evaluation results for several selected models. Finally, we test the performance of the models with a new dataset to get a clear picture of the limitations of the presented approach in the face of unseen data. I hope that the material presented in the book will be valuable to all readers and enable them to move fast on employing deep learning models for various applications. Deep learning is now entering an exciting phase in which most scientist and enthusiasts are actively involved. It is also desirable that the readers are able to extend the ideas covered here to their particular situations with little effort. Tsukuba, Japan 2023
Jonah Gamba
Acknowledgements
I would like to express my gratitude all the people who have in some ways, positively contributed in various ways to the preparation of this book. First and foremost I would like to thank my family members, Megumi, Sekai, and Mirai, for their invaluable patience during this very long process. Their understanding and accommodation made it possible for me to spare some time for putting together the material required to complete the manuscript. I would like to thank my extended family in various places and situations for the emotional and physical support that they have given during the period of writing this book. Special thanks goes to Dr. Courage Kamusoko, Prof. Hiromi Murakami, formerly Seikei University, and Prof. Shuji Kawasaki of Iwate University for their continuous encouragement and advice during the process of putting the book together. Let me also take this opportunity to thank LocaSense Research Systems team for their assistance in the preparation of part of the evaluation data used in Chap. 5. I would also like to express my sincere gratitude to Honjo Scholarship Foundation for always including old boys in their programs and through their kindness, that made it possible for me to pursue my interest in information systems, which is the subject of this book. Last but not least, many thanks also to Smith Chae, Sivananth S. Siva Chandran, and Diya Ma of Springer for their very efficient and continuous support during the process of creating this book. Jonah Gamba
xi
Contents
1 Basic Approaches in Object Detection and Classification by Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Conventional Methods of Object Detection and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Gradient Boosting Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Deep Learning as Part of Artificial Intelligence . . . . . . . . . . . . . . . . . 1.4 Frameworks for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Selection of Target Areas for This Book . . . . . . . . . . . . . . . . . . . . . . . 1.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12 25 37 39 40 41 41 42 42 42
2 Requirements for Hands-On Approach to Deep Learning . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Basic Python Arrays for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Setting Up Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 OS Support for Offline Environments . . . . . . . . . . . . . . . . . . . 2.3.2 Windows Environment Creation Example . . . . . . . . . . . . . . . 2.3.3 Options to Consider for Online Environments . . . . . . . . . . . . 2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 47 47 50 50 51 52 53 54 54
1 1 4 5
xiii
xiv
Contents
3 Building Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction: Neural Networks Basics . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . 3.1.3 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . 3.1.4 Long Short-Term Memory (LSTM)/Gated Recurring Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Using Keras for as Deep Learning Framework . . . . . . . . . . . . . . . . . . 3.2.1 Overview of Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55 55 55 60 62
4 The Building Blocks of Machine Learning and Deep Learning . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Categorization of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Methods of Evaluating Machine Learning Models . . . . . . . . . . . . . . . 4.3.1 Data Preprocessing for Deep Learning . . . . . . . . . . . . . . . . . . 4.3.2 Problem of Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Machine Learning Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Self-evaluation Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 73 73 74 78 78 80 82 82 82
5 Remote Sensing Example for Deep Learning . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Background of the Remote Sensing Example . . . . . . . . . . . . . . . . . . . 5.3 Remote Sensing: Land Cover Classification . . . . . . . . . . . . . . . . . . . . 5.4 Background of Experimental Comparison of Keras Applications Deep Learning Models Performance on EuroSAT Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Information Input Data Requirements . . . . . . . . . . . . . . . . . . . 5.4.2 Input Restrictions (from Keras Application Page) . . . . . . . . . 5.4.3 Training and Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Application of EuroSAT Results to Uncorrelated Dataset . . . . . . . . . 5.5.1 Evaluation of 10-Classes with Best EuroSAT Weights . . . . . 5.5.2 Training Results with 6 Classes—Unbalanced/ Balanced Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Training Results with 5 Classes . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85 85 85 86
65 70 70 71 71 72 72
86 88 89 90 189 189 195 197 200 200
Chapter 1
Basic Approaches in Object Detection and Classification by Deep Learning
1.1 Introduction Deep learning is a high-pace research and development topic spanning an everincreasing number of applications fields including but not limited to text recognition, speech recognition, natural language processing, image recognition, autonomous driving, and remote sensing [1–4]. Among these applications, one of the currently trending and exciting implementations of deep learning is ChatGPT by OpenAI, which provides a platform to interactively performing queries on a wide range of subjects and get almost human-level answers [5]. The list of applications seems endless and only depends on how problems can be transformed into machineprocessible and learnable representations. Since it is practically impossible to cover every application field of deep learning in great details, we chose to limit the scope of this book to object detection and classification which have strong connections to computer vision. The ideas from object detection and classification can be extended into other areas with some modifications. So, what is object detection all about? In object detection, the aim is to localize objects in an image by some predefined algorithm. The algorithms take images as input and produce identified objects together with labels usually superimposed on the input images with a bounding box (see Figs. 1.1, and 1.2). On the other classification is mainly concerned with assigning predefined classes to detected objects. In this respect, it is reasonable to say that detection encompasses classification. However, as we will see later, classification can standalone as the objective of the algorithm when applied to fields like remote sensing. Remote sensing is a fascinating field of research where data provide by Google and other players is find use in several applications (see Figs. 1.3, and 1.4). In Chap. 5, we will present a comprehensive example of remote sensing classification to illustrate its practical implementation. Our focus here will be on algorithms coming from the deep learning area. We will explore and present the practical of the approaches which are state of the art. To be clear from the beginning, the approach that we will take here is to present
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-99-9672-8_1
1
2
1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.1 Flow of object detection process
Fig. 1.2 An example of multiple object detection in a driving environment Fig. 1.3 An example of classification of remote-sensed data for building footprint detection
1.1 Introduction
3
Fig. 1.4 An example of classification of remote-sensed land use data
concepts in a manner that will allow interested readers to start working on real-world problems. Although theoretical concepts have a critical role to play on performance of the algorithms, we will leave much of this discussion to dedicated professionals and instead concentrate on how utilize the existing technology. The simple reason is that technology consumers normally experiment with existing methods at a very high level to confirm that they work as expected before deep diving into the theory behind in order to improve performance. In some cases, existing algorithms work well without any modifications thereby reducing the effort and money spent on further research. Moreover, the approach presented here is natural and will in turn allow a rapid entry into the realm of deep learning. There is an abundance of resources in terms of data, code, and theoretical materials available on Internet platforms. However, the existence of this enormous volume of information is a two-edged sword. On one hand, it is much easier to find information on any topic of interest, but on the other hand it makes it harder to figure out where to start and filter out all the clutter. Once you visit a particular site from any search engine of your choice, you are basically presented with multiple links which can be endless linked to other sites. To summarize it all, this book offers the following advantages: Deep Understanding: It provides a more immersive and focused reading experience that allows the reader to delve deeply into a subject, offering in-depth explanations, comprehensive coverage, and a cohesive narrative that helps build a solid foundation of understanding. This depth is often missing in fragmented online search results. Credibility and Quality: It goes through a rigorous review process, ensuring a certain level of quality, accuracy, and credibility. Referenced authors are experts in the field and have invested significant time and effort into research and writing. Structure and Organization: It is organized in a structured manner, with chapters, sections, and an index that allows for easy navigation and reference. This makes it convenient to follow a logical progression of concepts, find specific information, and revisit previous sections.
4
1 Basic Approaches in Object Detection and Classification by Deep Learning
Limited Distractions: With the material presented in the subsequent chapters, one can concentrate on the content without the distractions of ads, pop-ups, or hyperlinks. This helps maintain focus and promotes a deeper level of engagement with the material. The above are among the reason why you need this book to keep focused on the ball and strike the target without miss in the most efficient way. Of course, it may be necessary to occasionally check some material on the Internet but it’s best to quickly come back to the main text to avoid getting lost in distractions. So how do you we begin? As mentioned above, the best way to start with deep learning would be to first limit the scope of the search area. The recommended approach would be to make a quick survey of the publicly available material on the subject and then chose one that matches your final objective. For quick results, it is also instructive to take a hands-on approach where one can work on example code along the way. This makes it possible to visualize the output and make further refinements for robustness. For example, Python programming language code can be executed to cement the ideas and also get a deeper understanding of how the concepts can be implemented. To this end there are numerous stable, collaboratively debugged open-source packages and tools available that make it unnecessary to code algorithms from mathematical models, thereby reducing the learning curve and increasing speed productively toward the intended goal enabling rapid prototyping.
1.2 Conventional Methods of Object Detection and Machine Learning There are a variety of machine learning algorithms that have evolved over many years of research and development. An overview of some of these algorithms has been can be found in [6], and a concise presentation is given in this section. The performance of these so-called shallow machine learning algorithms depends heavily on the representation of the input data they are given [4]. All machine learning algorithms are constructed along mathematical concepts that make it possible to transform input data into a form that simplifies the task of classification. After transformation, it becomes a matter of applying a logical rule to cluster the data into their respective classes. Normally a series of affine and/or nonlinear operations are applied to the input data to come to the final result. One example is when data becomes linearly separable by conversion from Cartesian coordinates to polar coordinates. It is important to recognize that the representation of data can make a big difference on the classification or recognition task. In the following subsections, we briefly summarize some of these conventional methods that have been found to be effective in some applications.
1.2 Conventional Methods of Object Detection and Machine Learning
5
1.2.1 K-Nearest Neighbors (KNN) The K-Nearest Neighbors (KNN) algorithm is a simple and versatile supervised machine learning algorithm used for classification and regression tasks [7]. KNN is a nonparametric and instance-based learning algorithm, which means it doesn’t make assumptions about the underlying data distribution and uses the entire training dataset for making predictions. KNN is particularly useful for tasks involving nonlinear relationships and complex decision boundaries. KNN works on the principle that similar instances tend to have similar outcomes. Given a new data point, KNN identifies the K-nearest data points (neighbors) from the training dataset and assigns the majority class or computes the average value of these neighbors to make predictions. Therefore, K is the important parameter that need determined in order to avoid noisy predictions. Having set the value of K, the algorithm can then proceed to use a distance metric to judge the neighborliness of given query points. Due to its simplicity, the KNN algorithm generally has low complexity. However, it’s performance decreases with increase in the feature space dimension. In this respect, the algorithm is not well positioned for high-dimensional problems. In such cases, strategies such as parallelization, dimensionality reduction, and partitioning can be employed [8]. We will next take close look the steps followed in executing the KNN algorithm. KNN Algorithm Steps Several steps are involved in performing classification by KNN algorithm starting from data preparation until making predictions on new data points. Below is summary of the sequential steps to be followed for the classification task by the KNN algorithm (Fig. 1.5). Step 1: Data Preparation Gather and preprocess labeled dataset, which normally consists of features and their corresponding class labels. At this stage, the dataset is split into a training set and a test set for final model evaluation. Step 2: Choosing K Select an appropriate value for K, the number of neighbors to consider. This can be determined through techniques such as cross-validation to find the optimal K value for the given dataset. The choice of K is trade-off between the bias and the variance of the model. Smaller values of K can lead to more flexible models (low bias, high variance), while larger K values can result in smoother decision boundaries (high bias, low variance). Step 3: Distance Metric Selection Choose a suitable distance metric (e.g., Euclidean distance, Mahalanobis Distance, Manhattan distance, Minkowski Distance, Chebyshev Distance, Cosine Distance,
6
1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.5 KNN processing flow
Hamming Distance, Jaccard Distance, Correlation Distance) to measure the similarity between data points. The choice of metric depends on the nature of the data. Step 4: Normalization Normalization is necessary in order to ensure that all features contribute equally to the distance calculations preventing features with larger magnitudes from dominating the distance computation. Common applied normalization methods include Min–Max Scaling and Z-Score Scaling.
1.2 Conventional Methods of Object Detection and Machine Learning
7
Step 5: Classification For each data point in the test set, calculate the distances between the test point and all data points in the training set using the chosen distance metric. Then sort the distances in ascending order to identify the K-nearest neighbors. Step 6: Voting The first option is to count the occurrences of each class label among the K-nearest neighbors. Then assign the test point to the class that appears most frequently among its neighbors (majority voting). The second option is to assign different weights to neighbors based on their distance from the test point. Closer neighbors have a higher influence on the prediction (weighted voting). In case of ties (equal occurrences of multiple class labels among the K neighbors), tie-breaking mechanisms, such as selecting the class with the closest neighbor can be applied. Step 7: Evaluation After classifying all test points, evaluate the model’s performance using appropriate metrics like accuracy, precision, recall, F1-score, or confusion matrix. Step 8: Hyperparameter Tuning If the performance is not satisfactory, it is recommended to adjust hyperparameters like K or the distance metric and re-evaluate the model. This process can be repeated with multiple combination of hyperparameters until the desired performance is achieved. Step 9: Prediction Once the model is tuned and evaluated, the final step is use make predictions on new, unseen data by following the same steps of distance calculation, neighbor selection, and voting. If prediction erroneous, it may be necessary to re-examine steps 2–8 above. It’s important to keep in mind that KNN is sensitive to noisy data, outliers, and the curse of dimensionality just as with other techniques in the same category. Preprocessing and data cleaning steps can help alleviate some of these challenges. In summary KNN is a straightforward algorithm that can be implemented relatively easily. In this respect and as additional information for the interested reader, in Python the Scikit-learn API provides the sklearn.neighbors.KNeighborsClassifier for the performing KNN [9]. However, its performance and accuracy heavily depend on parameter tuning, distance metric selection, and data preprocessing. It’s also important to balance computational efficiency with model accuracy, especially when dealing with large datasets. Figure 1.6 is an example of applying the KNN algorithm to classify a query point in cases of two classes.
8
1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.6 An example KNN applied to query point (star) where the distance to each of the selected samples is computed
Merits of KNN The KNN algorithm has several advantages that make it a popular and useful choice for certain machine learning tasks. We outline some of the key advantages of the KNN algorithm (Table 1.1). Limitations of KNN While KNN has several advantages, it’s essential to consider its limitations, such as sensitivity to the choice of K, slow prediction times for large datasets, and the impact of irrelevant or redundant features. Proper preprocessing, parameter tuning, and validation are crucial for achieving optimal results with the KNN algorithm. The KNN algorithm, despite its advantages, also has some limitations that need to be considered when using it for machine learning tasks. Here are the main limitations of the KNN algorithm (Table 1.2). To mitigate these limitations, it’s important to preprocess the data, choose an appropriate K, and consider using KNN in combination with other algorithms or techniques, such as dimensionality reduction or ensemble methods. Additionally, understanding the characteristics of the dataset and the problem domain can help determine whether KNN is a suitable choice for a specific task. Improvements to KNN Algorithm Several improvements and variations of the KNN algorithm have been proposed to address its limitations and enhance its performance in various scenarios. Here are some possible improvements and extensions for the KNN algorithm: One category of improve is related to distance computation. This includes distance weighting, distance metric learning, and distance decay [10]. With distance weighting, weights are assigned to neighbors based on their distance from the query point. Closer neighbors can be given higher weights, while farther neighbors receive
1.2 Conventional Methods of Object Detection and Machine Learning
9
Table 1.1 Merits of the KNN algorithm Merit
Details
Simplicity of implementation
KNN is straightforward to implement and understand. It doesn’t require complex mathematical formulations or assumptions about the data distribution. This is a crucial advantage in real-time applications where algorithm resources are constrained
Nonparametric approach
KNN is a nonparametric algorithm, meaning it imposes no assumptions on the specific functional form for the data statistical distribution. This makes it versatile and suitable for a wide range of data patterns
Flexibility for nonlinear data
KNN can capture complex, nonlinear relationships in the data. It’s particularly useful when the decision boundary is irregular or when classes have intricate shapes
Instance-based learning
KNN is an instance-based learning algorithm meaning that the model learns from the specific instances in the training data instead of attempting to build a generalizable model based on abstract features. This makes it possible for KNN to handle diverse datasets, adapt to changing data, and make intuitive predictions without complex model training. This flexibility makes KNN an attractive choice for various real-world applications
No training phase required
KNN doesn’t have an explicit training phase. Once the dataset is prepared, the algorithm is ready for prediction, making it suitable for scenarios where new data arrives frequently
Interpretability
The KNN algorithm provides easily interpretable results. Predictions are based on the actual instances in the dataset, which can help provide insights into the decision-making process
Can handle multiclass KNN can handle multiclass classification problems naturally by problems extending the idea of majority voting to multiple classes Suitable for small datasets
KNN performs well on small datasets, as it doesn’t require large amounts of training data to make accurate predictions
Robustness to noise
KNN can handle noisy data by considering multiple neighbors during prediction. Outliers or noisy instances have less impact on predictions due to the averaging effect of multiple neighbors. One other reason for noise robustness is the voting mechanism. When classifying a new data point, KNN looks at the labels of its KNN thereby voting the class. Noisy data points might have incorrect labels, but their influence is mitigated because KNN considers multiple neighbors. This means that if a couple of neighbors have incorrect labels due to noise, their impact is diluted by the correctly labeled neighbors
No assumption of linearity
KNN doesn’t assume linearity in the data, making it suitable for scenarios where relationships between features are not linear
High Recall in imbalanced classes
KNN can achieve high recall in imbalanced class distributions, as it is not biased toward any specific class and can capture minority class instances effectively
Natural feature importance
By examining the nearest neighbors of a data point, KNN can provide insights into the importance of features for prediction (continued)
10
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.1 (continued) Merit
Details
Dynamic adaptation
KNN can adapt to changes in the data distribution without the need for retraining. As new data arrives, predictions can be updated using the existing model
Ensemble and hybrid approaches
KNN can be used as a component in ensemble methods or hybrid models, combining its strengths with other algorithms for improved performance
lower weights. This can improve the accuracy of predictions, especially when some neighbors are more relevant than others. This approach is similar to distance decay where a distance decay function that reduces the influence of neighbors as their distance from the query point increases. Distance Metric Learning can also be applied where the objective is to learn a customized distance metric that optimizes the neighbors’ relevance for classification. Metric learning can improve the algorithm’s performance when the standard distance metrics are not well-suited for the data. A related approach is the localized KNN in which instead of considering all data points equally, use only a subset of neighbors that are closer to the query point. This can reduce the influence of irrelevant neighbors and improve computational efficiency. Kernel Density Estimation can incorporate kernel density estimation techniques to smooth the contribution of neighbors. This can result in more stable and robust predictions, particularly in noisy or irregular data [11]. Feature Selection and Dimensionality Reduction can be achieved by applying techniques like Principal Component Analysis (PCA) or feature selection to reduce the dimensionality of the data before applying KNN. This can help mitigate the curse of dimensionality and improve computational efficiency [12, 13]. Other possibilities that can be explored are ensemble approaches. This involves combining predictions from multiple KNN models with different settings (e.g., different K values or distance metrics) to enhance accuracy and robustness. Ensemble methods like Bagging or Boosting can be employed. This can be done in conjunction with approximate nearest neighbor search algorithms to accelerate the search for neighbors in high-dimensional spaces. Techniques like k-d trees or ball trees can significantly improve the algorithm’s efficiency. Adaptive KNN can dynamically adjust the value of K based on the local density of data points. In regions with high data density, a smaller K value can be used, while a larger K value can be employed in sparse regions [14]. Other improvements try to detect and handle outliers before applying KNN. Outliers can significantly impact the algorithm’s performance, so preprocessing steps like outlier removal or outlier correction can be beneficial. Hybrid models can also be applied such as combination of KNN with other algorithms, such as decision trees, support vector machines, or neural networks, to leverage their strengths and mitigate KNN’s weaknesses. Finally localized classifiers and incremental learning have been investigated. With localized classifiers, specialized classifiers are applied in specific regions of the feature space, depending on the data characteristics. This approach can
1.2 Conventional Methods of Object Detection and Machine Learning
11
Table 1.2 Limitations of KNN algorithm Limitation
Details
Computational complexity
As the dataset grows, the time and memory required to make predictions using KNN can increase significantly. Searching for the nearest neighbors among a large number of data points can be computationally expensive
High storage requirements
KNN requires storing the entire training dataset for making predictions. This can be memory-intensive, especially for large datasets with many features
Sensitivity to feature scaling
KNN is sensitive to the scale of features. Features with larger scales can dominate the distance calculations, leading to biased results. Feature normalization or standardization is crucial before applying KNN
Choosing an appropriate K
The choice of the K parameter significantly affects the algorithm’s performance. A small K can make predictions noisy and sensitive to outliers, while a large K can lead to oversmoothing and loss of important details
Curse of dimensionality
KNN’s performance can degrade in high-dimensional spaces. As the number of dimensions increases, the distinction between nearest and non-nearest neighbors becomes less meaningful, which can lead to reduced accuracy
Unevenly distributed data
KNN may perform poorly when the data is unevenly distributed across classes. In cases where one class significantly outweighs the others, KNN can be biased toward the majority class
Local optima and overfitting
KNN can be prone to overfitting, especially when the K value is small. Using a larger K can help alleviate this issue, but it may lead to underfitting or oversmoothing of the decision boundary
Outliers and noisy data
Outliers and noisy data can disproportionately influence the prediction process, especially when K is small. Robustness to outliers can be improved by using larger K values
Distance metric choice
The choice of distance metric can significantly impact the algorithm’s performance. An inappropriate distance metric can lead to inaccurate predictions. Choosing the right metric depends on the data and the problem domain
Boundary irregularities
KNN may struggle with irregular decision boundaries or classes with intricate shapes. It’s not well-suited for cases where classes overlap heavily
Data imbalance
KNN can struggle with imbalanced class distributions, as it might favor the majority class when predicting the class of a new data point
Lack of interpretability
While KNN provides accurate predictions, it may not provide insights into why a particular prediction was made. The algorithm lacks the interpretability of some other methods
Efficiency in large datasets
KNN’s efficiency can be compromised when working with large datasets, especially when the dimensionality is high. Approximation methods or other algorithms might be more suitable
12
1 Basic Approaches in Object Detection and Classification by Deep Learning
improve classification accuracy in complex or overlapping regions. For incremental learning new data points are incorporated into the model without retraining the entire dataset. This is useful for scenarios with streaming data [15]. These improvements and variations address various challenges of the KNN algorithm and can enhance its accuracy, efficiency, and applicability to a wide range of machine learning tasks. The choice of improvement will depend on the specific characteristics of the data and the goals of the analysis.
1.2.2 Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) is a statistical technique used for dimensionality reduction and classification in the field of machine learning and pattern recognition. LDA aims to find a linear combination of features that maximizes the separation between classes while minimizing the variation within each class [16]. It is particularly useful for tasks such as feature extraction, data visualization, and classification. LDA assumes that the data is normally distributed and that the classes have similar covariance matrices [17]. It takes a labeled dataset where each sample is associated with a class label. The goal is to project this high-dimensional data onto a lowerdimensional space while preserving class separability. For each class, LDA computes the mean vector (average) of the feature values and the scatter matrix (covariance matrix). The scatter matrix captures the spread of data within each class. LDA aims to maximize the distance between class means (between-class scatter) while minimizing the spread of data within each class (within-class scatter). These scatter matrices provide insights into the separability of the classes in the transformed space. To find the optimal projection for the data, LDA performs eigenvalue decomposition on the matrix that is the result of inverting the within-class scatter matrix and multiplying it with the between-class scatter matrix. The eigenvalues and eigenvectors obtained from this decomposition are used to determine the directions of the new feature space. The eigenvectors corresponding to the largest eigenvalues are chosen as the directions of the new feature space. These eigenvectors are used to transform the original feature vectors into a lower-dimensional space. LDA reduces the dimensionality of the data by selecting a subset of the eigenvectors. The number of eigenvectors chosen corresponds to the desired number of dimensions in the new feature space. Typically, the number of dimensions is set to the number of classes minus one to prevent overfitting. In a classification context, the reduced-dimensional data can be fed into a classifier such as a linear classifier (e.g., logistic regression) for making predictions. In a visualization context, the reduced-dimensional data can be used to visualize the data distribution while preserving class separability.
1.2 Conventional Methods of Object Detection and Machine Learning
13
In a nutshell, LDA provides a structured way to reduce the dimensionality of data while preserving information that is relevant for classification tasks. It is particularly effective when dealing with multiclass classification problems and can lead to improved classification accuracy by transforming the data into a space where class separability is maximized. The following steps are taken for performing classification using Linear Discriminant Analysis (Fig. 1.7). Step 1: Data Preparation Gather and preprocess your labeled dataset. Each data point should have a set of features and a corresponding class label. Ensure that the data satisfies the assumptions of LDA, such as normal distribution and similar covariance matrices for different classes. Fig. 1.7 LDA processing flow
14
1 Basic Approaches in Object Detection and Classification by Deep Learning
Step 2: Compute Class Means Calculate the mean vector for each class by averaging the feature vectors of all data points belonging to that class. Step 3: Compute Within-Class Scatter Matrix Calculate the within-class scatter matrix (SW ) for each class. This matrix captures the spread of data points within each class and is computed by summing the outer product of the difference between each data point and its class mean. Step 4: Compute Between-Class Scatter Matrix Calculate the between-class scatter matrix (SB ) that measures the spread between different classes. It is computed by summing the outer product of the difference between class means and the overall mean. Step 5: Eigenvalue Decomposition Find the matrix St = SW −1 * SB and perform eigenvalue decomposition on it to obtain eigenvalues and eigenvectors. Sort the eigenvectors in descending order of their corresponding eigenvalues. Step 6: Dimensions Selection Select the top k eigenvectors (where k is the number of classes minus one, or a smaller number chosen based on the desired dimensionality reduction) to form the transformation matrix W. This means W is a matrix of eigenvectors corresponding to k largest eigenvalues. Step 7: Data Transformation Transform the original data to the lower-dimensional space using the transformation matrix W such that Y = X * W, where X is the original data matrix and Y is the transformed data matrix. Step 8: Classification Apply a classification algorithm (e.g., logistic regression) on the reduceddimensional data Y to perform classification. Step 9: Model Evaluation Split your dataset into training and testing sets. Train the classification model on the training data and evaluate its performance on the testing data using appropriate metrics. Step 10: Predictions Once the model is trained, you can use it to make predictions on new, unseen data by first transforming the new data using the transformation matrix W and then applying the trained classifier.
1.2 Conventional Methods of Object Detection and Machine Learning
15
Fig. 1.8 An illustration of the operating principle of LDA
Further details of how the above steps are accomplished can be found in [4]. Additionally, in Python the Scikit-learn API provides the sklearn.discriminant_ analysis.LinearDiscriminantAnalysis Class for the performing LDA. The basic principle of LDA is shown in Fig. 1.8. Merits of LDA Linear Discriminant Analysis (LDA) offers several advantages, making it a valuable technique in various machine learning and pattern recognition tasks. Some of the key advantages of the Linear Discriminant Analysis algorithm are listed below (Table 1.3). Limitations of the LDA Despite its advantages, LDA also has some limitations that should be considered when applying the algorithm to real-world problems. The main limitations mainly stem from underlying assumptions used by the algorithm which in some cases may not fit the data under analysis. We briefly outline some of these limitations here. LDA makes the assumption of Gaussian distribution which means that the data in each class follows a Gaussian distribution. If this assumption is not met, the performance of LDA may degrade. In real-world datasets, the distribution of data may not always be Gaussian. The assumption of equal covariance matrices of different classes is equal which is also made to simplify the computation. This assumption may not hold true for all datasets, especially when classes have significantly different variances or covariance structures. The LDA is sensitive to outliers in the data. Outliers can disproportionately influence the estimation of class means and covariance matrices, leading to suboptimal results. As an example, outliers can greatly affect the calculation of class means. Since class means play a crucial role in LDA by determining the position of the decision boundary, outliers can pull the mean of a class in the direction of the outlier. This
16
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.3 Merits of LDA algorithm Merit
Details
Dimensionality reduction with class separability
Reduced feature dimensions while retaining most of the discriminatory information since LDA aims to find a lower-dimensional space that maximizes the separation between classes, making it useful for visualization and feature extraction
Improved classification performance
By focusing on class separability, LDA can lead to improved classification accuracy. It reduces the “curse of dimensionality” by projecting data onto a lower-dimensional space where classes are better distinguished
Can handle multiclass problems
LDA can handle multiclass classification problems efficiently by transforming the data into a lower-dimensional space that optimally separates different classes, even when there are more than two classes
Reduced overfitting
By projecting data onto a lower-dimensional space, LDA reduces the complexity of the model and mitigates the risk of overfitting. This is especially helpful when the number of training samples is limited
Utilizes class information
LDA makes use of class labels during training, which allows it to capture information about class distributions. This results in a more informed transformation that improves class separability
Data visualization
LDA can be used for data visualization, particularly in two- or three-dimensional space. It can help in understanding the distribution of data classes and their separability, aiding in exploratory data analysis
Complementary to other algorithms
LDA can be used in conjunction with other classification algorithms, serving as a preprocessing step to improve the quality of features fed into subsequent classifiers
Robustness to outliers
LDA is less sensitive to outliers compared to some other techniques like principal component analysis (PCA). It focuses on maximizing the ratio of between-class variance to within-class variance, which makes it more robust
Interpretable results
The transformation matrix computed by LDA provides insights into how the features contribute to class separability. This transparency can be valuable for understanding the importance of different features (continued)
1.2 Conventional Methods of Object Detection and Machine Learning
17
Table 1.3 (continued) Merit
Details
Low computational cost
The computational complexity of LDA is generally lower than more complex algorithms like support vector machines (SVMs), making it efficient for large datasets
Well-established theory
LDA has a strong theoretical foundation, which facilitates its understanding and implementation. It has been extensively studied and applied in various fields
might lead to a decision boundary that does not accurately represent the actual distribution of the majority of the data. Similarly outliers can lead to distorted within-class scatter, inaccurate between-class scatter and loss of linearity which all have negative impact on classification performance. When the number of samples is small, LDA may overfit the data, especially if the number of features is large. In such cases, LDA can perform poorly due to the limited amount of data available for estimation. This can result in the overfitting problem, where an attempt is made to use all the available samples. Another limitation is limited to linear separability. The LDA aims to find linear boundaries that separate classes. It may struggle when classes are not linearly separable, leading to reduced classification accuracy. There are also problems associated with reduced performance in high-dimensional data. In high-dimensional feature spaces, the “curse of dimensionality” can affect the performance of LDA. This is because the assumptions of LDA become harder to meet as the number of features increases. The primary focus LDA’s lies in transforming data into a lower-dimensional space that enhances class separability. However, it lacks an inherent mechanism for explicit feature selection. This means that while it aims to improve classification accuracy through this transformation, it does not automatically identify or eliminate irrelevant or redundant features from the dataset. This leaves the algorithm implementer the burden of identifying the correct set of features that may be suitable for the problem at hand. This shortcoming is very severe when the data has no obvious patterns which may give clues feature selection. Although the LDA can be extended to address multiclass classification problems, its fundamental formulation is rooted in binary classification. Consequently, in complex scenarios involving multiple classes, the algorithm’s binary origin may impact its behavior. Although techniques exist to expand its use to multiclass scenarios, the algorithm’s core design remains influenced by its binary classification heritage. The determination of the decision boundary by LDA takes into account the prior probabilities of the different classes. This reliance on prior probabilities can be a double-edged sword. When prior probabilities are accurate and unbiased, LDA can produce effective results. However, if these probabilities are skewed or inaccurate,
18
1 Basic Approaches in Object Detection and Classification by Deep Learning
it can lead to suboptimal outcomes, as the algorithm’s performance is closely linked to these estimates. LDA assumes a linear correlation between the features and the class labels. If the true relationship between these elements is nonlinear, LDA might struggle to capture the underlying patterns adequately. This limitation can affect its ability to accurately classify data in cases where nonlinearity is prevalent. The LDA approach offers dimensionality reduction and feature transformation, which are valuable for improving classification accuracy. However, the transformed features generated by LDA might not be as intuitively interpretable as the original features. This reduced interpretability can hinder its application in certain contexts where understanding the relationships between features and classes is crucial. It is essential to carefully consider the above limitations and assess whether LDA is suitable for a particular problem. Depending on the nature of the data and the goals of the analysis, alternative techniques like Quadratic Discriminant Analysis (QDA), SVMs, or more advanced methods may be more appropriate choices. We will look at the approaches in the following sections. Improvements to the LDA Algorithm Several extensions and improvements to the LDA algorithm have been proposed to address its limitations. Here we provide a quick summary of these possible improvements and alternative approaches: Regularized Linear Discriminant Analysis is one way to address the shortcomings of LDA. Regularization techniques, like Ridge or LASSO regression, can be applied to LDA to address the issue of overfitting, especially when dealing with small sample sizes and high-dimensional data [18]. As will be seen the following subsection, relaxing the assumption of equal covariance matrices for all classes by using diagonal or different-shaped covariance matrices for each class can improve performance when class covariances are not equal. In this class of solutions is the Quadratic Discriminant Analysis (QDA). Instead of assuming a common covariance matrix for all classes, QDA allows each class to have its own covariance matrix. This can capture more complex relationships among features. Another approach is the Regularized Discriminant Analysis (RDA). It is a variation of LDA that uses regularization to estimate class-specific covariance matrices, addressing the issue of singularity when sample size is small [19]. Techniques such as Ledoit–Wolf shrinkage can be employed to improve covariance matrix estimation, particularly when the number of features is large and sample size is small. In addition, Kernel Discriminant Analysis (KDA) extends LDA to a kernelized version, enabling nonlinear separation between classes. Kernel methods can increase discriminative power in complex datasets. On the other hand, Multiple Discriminant Analysis (MDA) is an extension of LDA for multiple class problems that constructs a series of discriminant functions. It provides a flexible framework for multiclass classification [20]. As usual, Dimensionality Reduction Techniques like PCA or autoencoders can help mitigate issues with high-dimensional data.
1.2 Conventional Methods of Object Detection and Machine Learning
19
Innovative approaches combining deep learning and LDA such as Deep Linear Discriminant Analysis (DLDA) have been proposed in the literature [21, 22]. Integrating LDA into deep learning frameworks allows for learning more complex and discriminative feature representations while retaining the benefits of LDA. Other notable methods include combining multiple LDA models or LDA with other classification algorithms in an ensemble can enhance classification performance and robustness and Sparse Discriminant Analysis which incorporates sparsityinducing techniques to encourage feature selection and prioritize relevant features in the discriminant analysis process. More accurate estimation of class priors can also be employed to improve the performance of LDA, especially when prior probabilities are imbalanced. The above improvements and extensions address various limitations of the traditional LDA algorithm and offer more flexibility, accuracy, and robustness in various scenarios. Quadratic Discriminant Analysis (QDA) Quadratic Discriminant Analysis (QDA) is a statistical classification algorithm that extends the concept of Linear Discriminant Analysis (LDA) to accommodate nonlinear relationships between features and class labels. QDA is a supervised learning algorithm used for classification tasks, where the goal is to predict the class of a data point based on its features. QDA models the distribution of each class using quadratic functions, allowing it to capture more complex decision boundaries compared to LDA. We provide a detailed overview of the Quadratic Discriminant Analysis algorithm below by explaining the steps involved in the processing (Fig. 1.9). Step 1: Data Preparation Gather a labeled dataset with features and corresponding class labels. It is vital to ensure that the data satisfies the assumptions of QDA, including the assumption of multivariate Gaussian distribution for each class. Data preparation can make difference in the final model performance. We briefly describe some of points which need to be consider for the data preparation phase. The data cleaning process removes or corrects any missing values in the dataset. QDA, like other algorithms, requires complete data to function properly. The target is to identify and handle any outliers that could potentially skew the covariance estimates used in QDA. During data preparations feature selection and extraction are performed. This involves evaluating the relevance of each feature to the classification task. Redundant, irrelevant, or those features that have little discriminatory power are removed at this stage. Techniques that can be considered for this include PCA which can to reduce dimensionality of the feature space while retaining important information. It should also be ensured that that each class in the dataset has sufficient representation. Highly imbalanced classes might lead to biased results. Basically, it is ideal to have equal data samples for each class if sufficient data is available. To achieve this, techniques like oversampling or undersampling may be used if necessary.
20
1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.9 QDA processing flow
Normalization or standardization is performed during data preparation. Depending on the nature of the features, one might need to normalize or standardize them to ensure that they are on a similar scale. Most algorithms work well for data in the range 0–1. Normalization is also important for QDA’s covariance calculations. If the dataset contains categorical variables, they need to be properly encoded as numerical values. Techniques like one-hot encoding can be applied. Having done the necessary data processing the next step is decided on the traintest split for the dataset. At this point the dataset is divided into training and testing sets. Normally 80–20 or 70–30 splits are applied. The training set is used to estimate the parameters of the QDA model, while the testing set evaluates its performance. Although not of immediate use, the appropriate choice of evaluation metrics for the classification task can be chosen at this stage. Accuracy, precision, recall, F1-score, and confusion matrix are common metrics used to assess QDA’s performance. As a preliminary step, data visualization can be important to see the nature of the data at hand. Visualization helps to understand the distribution of classes, potential overlaps, and the separation between classes. One example of such cases is for the commonly used Iris dataset. Just visualization of the class histograms can already give an idea of separable and non-separable features, thereby aiding the reduction the feature space. Visualization can help you make informed decisions
1.2 Conventional Methods of Object Detection and Machine Learning
21
about whether QDA is suitable for your data. If the features are highly correlated, multicollinearity might impact the accuracy of covariance estimates in QDA. In this case one can consider addressing multicollinearity through feature transformation or regularization. Finally real data doesn’t always come in assumed distributions and desired size. In fact this happens more often than not. In that respect, handling non-Gaussian distributions and small sample sizes should be considered. QDA assumes that the feature distributions within each class are multivariate normal. If this assumption is violated, consider transforming your data to approach normality. Additionally, if one has a small dataset, cautious application of the QDA kept in mind as it might lead to overfitting. Techniques like regularized discriminant analysis or dimensionality reduction can be used in such cases. Step 2: Compute Class Statistics Calculate the mean vector and covariance matrix for each class. These statistics provide information about the distribution of data within each class. Quadratic Discriminant Function: For each class, QDA models the class distribution using a quadratic function. The quadratic discriminant function d(x) can be represented as: 1 T 1 x − μ j + log p j d j (x) = − ∗ log j − ∗ x − μ j −1 j 2 2 T 1 j = xk − μ j xk − μ j ∗ n j kC j
where the index j denotes the class, , μ and p are the covariance matrix, mean, and probability respectively given the data x. The objective is to calculate the quadratic discriminant function for each class and assign the point to the class with the highest discriminant score. Step 3: Model Training QDA does not have an explicit training phase like some other algorithms. The model parameters (mean vectors and covariance matrices) are estimated directly from the training data. Step 4: Regularization If the covariance matrices are ill-conditioned or if the number of training samples is small, regularization techniques can be optionally applied to stabilize the parameter estimation. Step 5: Model Evaluation Split the dataset into training and testing sets. Train the QDA model on the training data and evaluate its performance on the testing data using appropriate metrics (accuracy, precision, recall, F1-score, etc.).
22
1 Basic Approaches in Object Detection and Classification by Deep Learning
Step 6: Hyperparameter Tuning If needed, tune hyperparameters such as the regularization parameter to improve the model’s performance. Step 7: Prediction Once the model is trained and evaluated, you can use it to make predictions on new, unseen data by calculating the quadratic discriminant functions and assigning the data points to the class with the highest score. QDA is a versatile algorithm that can capture complex decision boundaries and perform well when classes have different covariance structures. It can be a powerful tool for classification tasks where the relationships between features and class labels are nonlinear. However, it’s important to consider the assumptions of QDA, such as Gaussian distribution and class-specific covariance matrices, to ensure its applicability to the data at hand. Details of how the above steps are accomplished can be found in [4]. Additionally, in Python the Scikit-learn API provides the sklearn.discriminant_ analysis.QuadraticDiscriminantAnalysis Class for flexibly performing QDA. An example of such class separation is shown in Fig. 1.10. It is important to keep in mind that QDA assumes class-specific covariance matrices, which allows it to capture different covariance structures for each class. This assumption should align with the characteristics of your data. QDA is a powerful classification algorithm that can handle complex decision boundaries and perform well when classes have distinct covariance structures. It is particularly useful when the relationship between features and class labels is nonlinear. However, as with any algorithm, it’s important to preprocess the data appropriately and ensure that the assumptions of QDA are met for accurate results.
Fig. 1.10 Example of nonlinear decision boundary for the classification of a two-class problem by the QDA. Such cases cannot be handled by LDA
1.2 Conventional Methods of Object Detection and Machine Learning
23
Merits of QDA The Quadratic Discriminant Analysis (QDA) algorithm offers several advantages that make it a valuable tool for certain classification tasks. Here are the key advantages of the QDA algorithm (Table 1.4). QDA is particularly useful when data exhibits nonlinear relationships and varying covariance structures among classes. It can provide accurate and flexible classification in scenarios where linear classifiers might not be suitable. However, it’s important to consider the assumptions of QDA, such as Gaussian distribution and class-specific covariance matrices, to ensure its applicability to the given data. Limitations of QDA The Quadratic Discriminant Analysis (QDA) algorithm, while advantageous in many aspects, also has limitations that need to be considered when applying it to classification tasks. Here are the main limitations of the QDA algorithm (Table 1.5). Despite these limitations, QDA can still be a powerful tool for classification tasks, especially when data exhibits nonlinear relationships and varying covariance structures. It’s important to carefully assess whether QDA is appropriate for a specific problem and ensure that the assumptions of the algorithm are met for accurate and reliable results. Improvements of QDA While Linear Quadratic Discriminant Analysis (LQDA) is a simplified version of Quadratic Discriminant Analysis (QDA) that assumes equal covariance matrices for all classes, there are some possible improvements and variations that can enhance its performance and address its limitations. Some of these improvements are common across multiple object detection and classification methods. Here we summarize some of the common approaches. Introduce regularization techniques to mitigate the effects of ill-conditioned covariance matrices or situations with limited data is almost standard for most classification methods and QDA is no exception. Regularized Linear Quadratic Discriminant Analysis can stabilize parameter estimation and prevent overfitting. Another approach is to modify the algorithm to allow for different covariance matrices in localized regions of the feature space. This approach can improve accuracy by accommodating varying data distributions [23]. Implementing feature selection or dimensionality reduction techniques before applying LQDA is another common method. Reducing the number of features can improve the algorithm’s performance, especially in high-dimensional spaces. It is also worthwhile to consider Ensemble Methods. Employing ensemble methods like Bagging or Boosting with LQDA as the base classifier leading to enhanced robustness and accuracy by combining multiple classifiers. Hybrid models can also be investigated. For example, combining LQDA with other classification algorithms, such as logistic regression, naive Bayes, or support vector machines, utilization of distance-based classifiers, like KNN, in conjunction
24
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.4 Merits of QDA Merit
Details
Captures nonlinear relationships
Unlike Linear Discriminant Analysis (LDA), QDA can capture nonlinear relationships between features and class labels. This makes QDA suitable for classification problems where classes have complex decision boundaries
Flexible decision boundaries
QDA allows for more flexible decision boundaries compared to linear methods like LDA or logistic regression. It can model curved decision boundaries, enabling it to handle a wider range of data distributions
Takes into account the different covariance structures
QDA assumes class-specific covariance matrices, which means it can capture varying covariance structures within different classes. This is particularly useful when classes have distinct variations or dispersions
Makes no assumption of equal covariance matrices
Unlike LDA, QDA does not assume that all classes share the same covariance matrix. This makes QDA more robust when dealing with datasets where covariance matrices differ significantly
Can handle multimodal distributions
QDA can effectively model and differentiate between classes with multiple peaks or modes in their distributions, which might be challenging for linear classifiers
Optimal for small datasets
QDA can perform well even when the dataset is small, as it can leverage the available data to estimate class parameters more accurately
Probabilistic classification
QDA inherently provides probabilistic classification. It estimates the class membership probabilities based on the Gaussian distributions, allowing for probabilistic interpretation of predictions
Interpretability
QDA provides interpretable results, as the decision boundaries and classification probabilities are derived from explicit mathematical models
Applicable to non-Gaussian data cases
While QDA assumes Gaussian distributions, it can still perform reasonably well on data that are approximately Gaussian or have distributions close to Gaussian
Regularization may be optionally applied
QDA can be regularized to handle ill-conditioned covariance matrices or situations with limited data, improving stability and preventing overfitting (continued)
1.2 Conventional Methods of Object Detection and Machine Learning
25
Table 1.4 (continued) Merit
Details
Ensemble and hybrid approaches are possible QDA can be integrated into ensemble methods or hybrid models, combining its strengths with other algorithms for improved performance Class-specific information modeling possible By modeling class-specific distributions, QDA can uncover unique characteristics of each class, which might be important for interpreting and understanding the data
with LQDA to incorporate local information for classification and exploring semisupervised or self-training techniques [24]. Hybrid models can leverage the strengths of each algorithm to improve overall performance. Other improvements worth mentioning include Kernel Linear Quadratic Discriminant Analysis which extends LQDA using kernel methods to allow for nonlinear decision boundaries. Kernel LQDA can capture complex relationships between features and classes. Finally development of interpretable variations of LQDA that provide insights into the decision-making process, similar to linear models, while still capturing more complex relationships can be considered [25, 26]. As with LDA, it is important to note that while these improvements can enhance LQDA’s performance, they might introduce additional complexity or computational requirements. This trade-off between complexity and performance gains always exists.
1.2.3 Support Vector Machine (SVM) Due the importance of the SVM algorithm, we first give a brief description of its historical background. The foundation for SVMs was laid by the work of Vladimir Vapnik and Alexey Chervonenkis in the late 1960s [27]. They introduced the concepts of “structural risk minimization” and the “VC dimension,” which form the theoretical basis for SVMs. The concept of SVMs as we know them today was developed by Vapnik and his team at AT&T Bell Laboratories in the 1990s. In 1992, Bernhard Boser, Isabelle Guyon, and Vapnik introduced the first algorithm for training linear SVMs [17]. In 1995, Corinna Cortes and Vapnik introduced the Support Vector Classification (SVC) algorithm. The “kernel trick,” a fundamental aspect of SVMs that allows nonlinear classification by mapping data into higher-dimensional spaces, was proposed by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik in 1992. This enabled SVMs to tackle complex classification problems. SVMs gained popularity in the early 2000s due to their strong theoretical foundations and good generalization properties. The development of SVMs was intertwined with the progress of kernel methods and machine learning in general. Researchers
26
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.5 Limitations of QDA Limitation
Details
Low performance for high-dimensional data
QDA can struggle with high-dimensional datasets, where the number of features is large. As the dimensionality increases, the number of parameters in the covariance matrices grows, which can lead to overfitting and computational challenges
Large sample size required
QDA may not perform well with a small number of training samples for each class. Having too few samples relative to the dimensionality of the data can lead to unreliable covariance matrix estimates
Curse of dimensionality problem
While QDA can capture nonlinear relationships, it is still susceptible to the curse of dimensionality. The performance of QDA may degrade as the number of features increases, especially when the data is sparse
Assumption of Gaussian distribution
QDA assumes that each class follows a multivariate Gaussian distribution. If the data does not adhere to this assumption, QDA may provide suboptimal results
Computational complexity
QDA involves the estimation of class-specific covariance matrices, which can be computationally intensive, especially for large datasets or datasets with many features
Sensitivity to outliers
QDA can be sensitive to outliers, as they can significantly impact the estimation of covariance matrices and, consequently, the decision boundaries
Reduced Bias
In situations where there is limited data per class, LDA might have an advantage due to its assumption of shared covariance matrices. QDA could suffer from higher variance due to the smaller sample size
Tuning parameters
QDA may require the estimation of more parameters (covariance matrices) than Linear Discriminant Analysis (LDA), which could lead to overfitting when the training sample size is small
Limited generalization to new data
If the assumptions of Gaussian distribution and class-specific covariance matrices are not met in new data, QDA’s performance may degrade
Not suitable for online learning
QDA typically requires retraining the entire model when new data arrives. This might not be efficient for scenarios where data streams in real-time
Limited interpretability
While QDA provides interpretable results, the nonlinear decision boundaries it generates might be harder to explain compared to linear methods like LDA
Limited performance on linear data
In cases where the relationships between features and classes are mostly linear, QDA might not provide significant advantages over simpler linear classifiers
1.2 Conventional Methods of Object Detection and Machine Learning
27
introduced various enhancements, like the formulation of multiclass classification using One-vs-One and One-vs-Rest strategies, and extensions to handle regression and ranking tasks. The field of machine learning saw the emergence of deep learning methods, which led to some reduction in SVM’s popularity for certain tasks, but SVMs still remain relevant for various applications. SVMs and their variants continue to be an active area of research, with efforts focused on optimization techniques, largescale implementations, and the integration of SVMs with other machine learning approaches. Today, SVMs are widely used in many fields such as image classification, text categorization, bioinformatics, finance, and more. Their evolution from theoretical foundations to practical applications has contributed significantly to the advancement of machine learning and pattern recognition. SVM is a powerful and versatile supervised machine learning algorithm used for classification and regression tasks [4, 7, 17, 27–38]. It’s particularly effective in scenarios where the data is not linearly separable and requires finding a clear decision boundary between classes or predicting continuous values. SVM aims to find the optimal hyperplane that best separates different classes in the feature space while maximizing the margin between them. SVM seeks to find the hyperplane that maximizes the margin between classes. The margin is the distance between the hyperplane and the nearest data points from each class, which are called support vectors. The optimization problem involves finding the weights (w) and bias (b) that define the hyperplane or decision boundary wT x + b = 0 at the same time maximizing the margin m = 2/||w|| and minimizing the classification error. In cases where the data is not linearly separable, SVM can transform the data into a higher-dimensional space using a kernel function (e.g., polynomial, radial basis function) to create a hyperplane that can separate classes. In real-world scenarios, data might not be perfectly separable. Soft margin SVM allows for some misclassification by introducing a penalty parameter (C) that balances between maximizing the margin and minimizing the classification error. Moreover, Kernel SVM extends the algorithm to handle nonlinear classification. It uses a kernel function to implicitly map the data into a higher-dimensional space, making it possible to find a hyperplane that can separate classes. SVM can be applied to regression tasks using support vector regression (SVR), where the goal is to predict continuous values instead of discrete classes. SVR aims to minimize the deviation of predicted values from the actual target values. By selecting appropriate hyperparameters such as the type of kernel, kernel parameters, and regularization parameter (C for C-SVM). Cross-validation is often used to find optimal values. As most classification algorithms, the dataset is split into training and testing sets. The SVM model is trained on the training data using the selected kernel and hyperparameters. Evaluation of the model’s performance on the testing data using appropriate metrics (accuracy, precision, recall, F1-score for classification; RMSE, MAE for regression). Once trained, the SVM model can be used to predict the class label (classification) or continuous value (regression) of new, unseen data points.
28
1 Basic Approaches in Object Detection and Classification by Deep Learning
SVM is known for its ability to handle complex decision boundaries and perform well on various types of data. Its effectiveness in high-dimensional spaces and its ability to handle nonlinear relationships through kernel functions make it a popular choice in many machine learning applications. However, SVM’s training time and complexity can increase with larger datasets, and the selection of appropriate kernels and hyperparameters requires careful consideration to achieve optimal results. Steps of the SVM Algorithm Performing classification using the SVM algorithm involves several steps, from data preparation to making predictions. Below is a step-by-step flow on how to perform classification with SVM (Fig. 1.11). Step 1: Data Preparation Collect and preprocess your labeled dataset, consisting of features and corresponding class labels. Ensure the data is properly scaled and normalized to prevent features with larger scales from dominating the optimization process. Fig. 1.11 SVM processing flow
1.2 Conventional Methods of Object Detection and Machine Learning
29
Step 2: SVM Type Selection Decide whether you’re performing binary classification (separating two classes) or multiclass classification (separating more than two classes). For binary classification, choose between the standard SVM and the soft margin SVM (C-SVM) depending on the data separability. C-SVM allows for some misclassification to find a better overall separation. Step 3: Kernel Selection If your data is not linearly separable in its current form, choose an appropriate kernel function to transform the data into a higher-dimensional space where it becomes separable. Common kernels include Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid. Kernel selection is a pivotal step in the SVM algorithm, especially when dealing with complex, nonlinearly separable data. SVMs achieve their power by implicitly transforming the original data into a higher-dimensional space through kernel functions, enabling them to find nonlinear decision boundaries. However, real-world data often exhibits intricate patterns that cannot be separated linearly in the original feature space. Kernels allow SVMs to map the data to a higher-dimensional space where a linear boundary can separate the transformed points effectively. The choice of kernel function determines how the data is transformed and how well the SVM captures its underlying structure. Four types of kernels are commonly used: Linear, polynomial, Radial Basis Function (RBF), and sigmoid kernels. With linear kernels, the kernel performs no transformation and represents the original feature space. It’s suitable for linearly separable data. Polynomial kernels transforms data into a higher-dimensional space using polynomial functions. It’s useful for capturing moderate levels of nonlinearity. The degree parameter controls the degree of the polynomial. Radial Basis Functions, also known as the Gaussian kernels, are the most widely used. They map data into an infinite-dimensional space, capturing complex nonlinear relationships. The gamma parameter determines the extent of influence of each data point. Lastly, sigmoid kernels use hyperbolic tangent functions to transform data. They can be effective, but its performance might depend heavily on the choice of parameters. Selecting the appropriate kernel depends on the characteristics of data under analysis. If the data is not linearly separable in its original space, polynomial, RBF, or sigmoid that capture nonlinear relationships can be considered. High-dimensional data might benefit from kernels that are more flexible in capturing complex relationships. The choice of kernel parameters (such as degree for polynomial kernel and gamma for RBF kernel) is critical. Proper tuning through techniques like crossvalidation ensures optimal performance. Overfitting could be a problem when using highly flexible kernels like polynomial with high degrees or RBF with small gamma. Regularization techniques might be needed in such cases.
30
1 Basic Approaches in Object Detection and Classification by Deep Learning
Step 4: SVM Problem Formulation For binary classification, the goal is to find the optimal hyperplane that maximizes the margin between the two classes. This involves minimizing a cost function that accounts for misclassified points while maximizing the margin. For multiclass classification, techniques like One-vs-One (OvO) or One-vs-Rest (OvR)/One-vs-All (OvA) are used to extend binary SVM to handle multiple classes. In the One-vs-One approach, a separate binary classifier is trained for each pair of classes. If there are N classes, this results in N × (N − 1)/2 binary classifiers. During training, each classifier is trained using data from its respective pair of classes. When classifying a new data point, each classifier predicts a class label. The class label with the majority votes is chosen as the final prediction. OvO can handle complex decision boundaries for each pair of classes, but it can become computationally expensive for a large number of classes due to the need to train multiple classifiers. In the OvR approach, a separate binary classifier is trained for each class, treating it as the positive class and grouping all other classes as the negative class. If there are N classes, this results in N binary classifiers. During training, each classifier is trained using data from the positive class and the aggregated negative class. When classifying a new data point, each classifier predicts whether the point belongs to its positive class or not. The class associated with the classifier that gives the highest confidence is chosen as the final prediction. OvR is more suitable for scenarios with a large number of classes since it requires fewer classifiers than OvO. In terms of computational complexity, OvR is usually more computationally efficient than OvO when dealing with many classes, as it requires training only N classifiers compared to the N × (N − 1) classifiers in OvO. For predictive performance, the OvO can potentially lead to more accurate results because it focuses on binary comparisons for each pair of classes. However, OvR might be more balanced in terms of training data distribution. It can also be noted that OvR is simpler to implement, as it involves training and evaluating a set of binary classifiers independently. Additionally, OvR might perform better in cases of class imbalance, as it balances class distributions by treating each class as the positive class once. Therefore, the choice between OvO and OvR depends on factors like computational efficiency, predictive performance, and class distribution. Example applications of these approaches can be found in [38, 39]. Step 5: Hyperparameter Tuning Choose the hyperparameters for the SVM, such as the regularization parameter (C in C-SVM), kernel parameters (if applicable), and other settings specific to the chosen SVM variant. Then, perform cross-validation to find the best combination of hyperparameters that yields the highest performance on validation data. Hyperparameters SVM: SVMs have hyperparameters that are not learned from the data itself but are set before training the model. Some essential hyperparameters include the regularization parameter (often denoted as C in C-SVM), the choice of kernel (linear, polynomial,
1.2 Conventional Methods of Object Detection and Machine Learning
31
radial basis function, etc.), and the associated kernel-specific parameters (e.g., degree for polynomial kernel, gamma for RBF kernel). The selection of hyperparameters can dramatically impact the SVM’s ability to generalize well on new, unseen data. An incorrect choice of hyperparameters can lead to overfitting (when the model fits the training data too closely but doesn’t perform well on new data) or underfitting (when the model is too simplistic to capture the underlying patterns). To determine the best combination of hyperparameters for your SVM, you typically use a technique called cross-validation. Cross-validation involves splitting your training data into multiple subsets or folds. You train the SVM on several combinations of hyperparameters and evaluate its performance on different folds. This helps you understand how well the SVM generalizes to unseen data under various hyperparameter settings. One common approach for hyperparameter tuning is grid search. In grid search, you define a range of possible values for each hyperparameter, and the algorithm tries every possible combination of these values. For each combination, you train the SVM using cross-validation and measure its performance. The combination of hyperparameters that yields the best validation performance is selected as the optimal set. Grid search can be computationally expensive, especially when dealing with multiple hyperparameters or large datasets. Random search is an alternative where you randomly sample from the hyperparameter space. Bayesian optimization is another approach that uses probabilistic models to find the next set of hyperparameters to evaluate based on past performance. The regularization parameter (C) controls the trade-off between maximizing the margin and minimizing the classification error. Larger C values result in a smaller margin and potentially more training data points within it. Kernel parameters like gamma in the RBF kernel influence the flexibility of the decision boundary. These parameters require careful tuning to prevent overfitting or underfitting. Step 6: Training the SVM Train the SVM model on the training data using the chosen hyperparameters and kernel. During training, the SVM optimizer adjusts the weights and bias of the hyperplane to create the optimal decision boundary that separates the classes. Step 7: Model Evaluation Evaluate the trained SVM model on a separate testing dataset to assess its performance. Use appropriate evaluation metrics such as accuracy, precision, recall, F1-score, or ROC curves to measure the model’s effectiveness. Step 8: Fine-Tuning (Optional) If the performance is not satisfactory, it is possible optionally go back to hyperparameter tuning, try different kernels, or consider adjusting the dataset to improve results.
32
1 Basic Approaches in Object Detection and Classification by Deep Learning
Step 9: Prediction Once the model’s performance becomes satisfactory, one can use it to make predictions on new, unseen data points. Apply the same preprocessing steps (scaling, normalization) to the new data before making predictions. Step 10: Model Interpretation Depending on the kernel used, SVM might offer insights into feature importance, allowing one to understand which features contribute most to the classification decision. This step can be optionally performed. Step 11: Deployment Deploy the trained SVM model into production environments for making real-time predictions on new data. SVM is a versatile algorithm that can handle a variety of classification tasks, from linear to nonlinear and binary to multiclass. Its effectiveness relies on proper data preprocessing, kernel selection, and hyperparameter tuning to achieve optimal performance. The application of SVM for linear and nonlinear cases is illustrated in Figs. 1.12 and 1.13. More details of how the above steps are accomplished can be found in [4]. Additionally, in Python the Scikit-learn API provides the sklearn.svm module for handling the multiple variations of the SVM algorithms. Merits of SVM Algorithm The SVM algorithm offers several advantages that contribute to its popularity and effectiveness in various machine learning applications. Below are some of the key advantages of SVM (Table 1.6). Fig. 1.12 This is a scenario where two classes can be separated by linear hyperplanes. In the illustration, white circles represent one class (Class 1), while solid circles represent another class (Class 2). The optimal hyperplane that distinctly separates these classes is depicted as a bold black line. Circles aligned along the dashed lines are known as support vectors, having a significant influence on defining the hyperplane
1.2 Conventional Methods of Object Detection and Machine Learning
33
Fig. 1.13 In this case, the classes are separable by a nonlinear hyperplane
Kernel Trick for Dimensionality Reduction: The kernel trick can also be used for dimensionality reduction, which can be useful when dealing with high-dimensional data. SVM’s combination of flexibility, generalization capability, and robustness makes it a valuable tool in various domains, such as image classification, text categorization, bioinformatics, and more. However, it’s important to fine-tune hyperparameters and choose the appropriate kernel for each problem to achieve optimal results. Limits of SVM While the SVM algorithm offers numerous advantages, it also comes with some limitations and challenges that need to be considered when applying it to different machine learning tasks. Here are the main limitations of the SVM algorithm (Table 1.7). While SVM is a powerful algorithm with wide-ranging applications, it’s important to be aware of its limitations and carefully consider whether it is the appropriate choice for a specific problem. Addressing these limitations may involve using techniques like feature engineering, kernel tuning, and model evaluation to ensure optimal performance. Improvements to the SVM Algorithm Several improvements and variations have been proposed to enhance the performance and address some limitations of the SVM algorithm. We give some of the possible improvements and extensions for SVM. Automated methods as kernel selection strategies can be develop for selecting the most suitable kernel for a given dataset. This could involve exploring various kernels and measuring their impact on cross-validation performance. Additionally, combining SVM with Stochastic Gradient Descent (SGD) optimization to improve
34
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.6 Merits of SVM algorithm Merit
Details
Effective in high-dimensional feature spaces
SVM performs well even in high-dimensional feature spaces, making it suitable for complex data that might be difficult to separate using linear methods
Nonlinearity handling
SVM can efficiently handle nonlinear relationships between features and classes through the use of kernel functions, enabling it to capture complex decision boundaries
Robust generalization
SVM aims to maximize the margin between classes, which helps it to generalize well to new, unseen data and reduces overfitting
Flexibility in the choice of kernels
The choice of kernel functions (linear, polynomial, RBF, sigmoid, etc.) allows SVM to be adapted to different types of data and problem domains, enhancing its versatility
Global optimization achievable
SVM’s objective function aims to find the hyperplane that maximizes the margin, resulting in a globally optimal solution rather than getting stuck in local minima
Robustness to overfitting
By minimizing the classification error and maximizing the margin, SVM is less prone to overfitting, even when the number of features is greater than the number of samples
Effective in small datasets
SVM can perform well with small datasets, as it focuses on the most informative points (support vectors) rather than relying on the entire dataset
Insensitivity to irrelevant features
SVM is relatively insensitive to irrelevant features, focusing on the most discriminative features that contribute to separating the classes
Regularization
In the case of the soft margin SVM, the hyperparameter C controls the trade-off between achieving a large margin and allowing some misclassification. This built-in regularization prevents excessive model complexity
Handling unbalanced data
SVM can handle class imbalance by assigning different weights to classes or by utilizing techniques such as cost-sensitive learning
Interpretability (linear kernel)
With the linear kernel, SVM can provide insights into feature importance, helping to understand the contributions of different features to the classification decision
Well-studied theory
SVM is built on solid mathematical principles, and its theoretical foundations are well-established, making it easier to understand, analyze, and implement (continued)
1.2 Conventional Methods of Object Detection and Machine Learning
35
Table 1.6 (continued) Merit
Details
Support for different problems
SVM can be applied to both classification and regression tasks, making it a versatile tool for a wide range of machine learning problems
Consistency in high-dimensional data
SVM’s ability to maximize margins helps maintain classification consistency, even when the number of features is much larger than the sample size
scalability and speed up training, especially for large datasets is one possible approach [40]. Recently, incremental learning has been a subject of investigation. Creation of incremental or online SVM algorithms that can adapt to new data without retraining the entire model could be effective. This is particularly useful for real-time or streaming data scenarios [41]. Applying advanced regularization techniques to SVM to handle noisy data and improve generalization leads to improved performance. Techniques like L1 regularization or Elastic Net can help with feature selection and reduce model complexity [42]. Another approach is to take the SVM ensemble approach. This involves building ensemble models using multiple SVM classifiers to enhance performance. Techniques like Bagging or Boosting can combine multiple SVMs to achieve better generalization [43]. Hybrid models which combine SVM with other algorithms, such as Decision Trees or Neural Networks, to leverage their strengths and create hybrid models with improved performance. For multiclass SVM, the idea is to develop specialized algorithms for multiclass classification that go beyond One-vs-One and One-vs-Rest approaches. Hierarchical classification or direct optimization methods could be explored [44]. For kernel learning, investigation of methods for automatically learning the optimal kernel from the data, potentially using unsupervised learning techniques to uncover meaningful transformations has a chance to improve performance. In case of imbalanced data, exploration of techniques to adapt SVM for imbalanced class distributions, such as using cost-sensitive learning, adjusting class weights, or generating synthetic samples for the minority class can be considered. Design of interpretable kernels that provide insights into the decision boundary and feature importance, making SVM results easier to understand and extension of SVM to handle structured data like graphs or sequences by incorporating domainspecific similarity measures or defining custom kernel functions could be possible avenues to follow. Scalability Improvements are also important for SVM. Development of parallel and distributed versions of SVM algorithms to accelerate training and improve scalability on distributed computing frameworks can be considered. Other approaches include multilabel classification which extend SVM to handle multilabel classification problems, where instances can belong to multiple classes
36
1 Basic Approaches in Object Detection and Classification by Deep Learning
Table 1.7 Limitations of SVM Limitation
Details
High computational complexity for large datasets
For large datasets, training an SVM can be computationally intensive. The time complexity can become prohibitive as the dataset size increases
Sensitivity to noise
SVM is sensitive to noisy data, especially outliers. Outliers can have a significant impact on the position of the hyperplane and the margin, potentially leading to overfitting
High memory resource usage
SVM models can require significant memory, especially when dealing with large datasets and high-dimensional feature spaces
Choice of kernel function affects performance
The choice of the kernel function can greatly affect the performance of the SVM. Selecting an inappropriate kernel for the data can lead to suboptimal results
Difficult to train with large datasets
SVM’s training time increases significantly with the number of data points, making it less suitable for very large datasets
Limited interpretability (nonlinear kernels)
While linear SVMs provide interpretable feature importance, nonlinear kernels (e.g., polynomial, RBF) can make the model’s decision boundary difficult to understand and visualize
Model selection and tuning difficult
Selecting the appropriate kernel and hyperparameters can be challenging. Improper choices may lead to poor generalization or overfitting
No probabilistic output for interpretation of results
SVM does not provide direct probabilistic outputs like some other algorithms (e.g., logistic regression), making it less straightforward to interpret classification confidence
Not suitable for noisy labels
Performance of SVM degrades in situations with mislabeled data or uncertain class assignments
Limited to binary and multiclass classification
While SVM is primarily designed for binary classification, it can be extended to multiclass classification using techniques like One-vs-One (OvO) or One-vs-Rest (OvR). However, direct support for multilabel classification is limited
Impact of class imbalance
In scenarios with imbalanced class distributions, SVM may have difficulty capturing the minority class adequately, affecting the overall model performance
Feature scaling necessary to ensure scale-invariance
SVM is sensitive to the scale of features, so proper feature scaling is necessary to prevent features with larger scales from dominating the optimization process (continued)
1.2 Conventional Methods of Object Detection and Machine Learning
37
Table 1.7 (continued) Limitation
Details
Lack of robustness in feature selection
SVM does not inherently perform feature selection. Feature selection should be performed separately, and the choice of features can impact SVM’s performance
Domain knowledge required for kernel selection
Choosing the right kernel requires domain knowledge and experimentation, which can be time-consuming and may not always lead to optimal results
Limited applicability to non-Euclidean data
SVM assumes a Euclidean space, which may not be suitable for all types of data (e.g., structured data with graph-like relationships)
simultaneously and kernel approximations techniques to reduce the computational burden of SVM training while maintaining reasonable accuracy. Finally, combining SVM with feature selection techniques to identify the most relevant features, improving both model efficiency and generalization, incorporation of semi-supervised learning for SVM and exploration of ways to integrate SVM with deep learning techniques to create hybrid models that benefit from both SVM’s robustness and deep learning’s feature representation capabilities [45]. These improvements and extensions showcase the ongoing research and development efforts aimed at enhancing the capabilities and addressing the challenges of the SVM algorithm in various contexts. The choice of improvement depends on the specific problem, dataset characteristics, and computational resources available.
1.2.4 Random Forest Random forest is a decision tree approach used to successfully solve many shallow machine learning tasks. It was the top algorithm of choice until mid-2010s when another decision tree-based algorithm, gradient boosting machines, took over. Random forest, is composed of a large ensemble of decision trees that perform prediction tasks individually. The result of these decision trees is then combined into the final result by some form of voting. This means that the class with the most votes become the output. Figure 1.14 illustrates how the random forest algorithm works in principle [33]. Why this approach works well is that since the individual decision trees are uncorrelated, prediction errors from some models can be covered by having correct results in the majority of the decision trees. Random forests offer several advantages over decision trees: Improved Accuracy: Random forests generally provide higher accuracy compared to individual decision trees. By aggregating the predictions of multiple decision trees,
38
1 Basic Approaches in Object Detection and Classification by Deep Learning
Fig. 1.14 Illustration of random forest algorithm where the majority determines the final predicted class
the ensemble approach reduces overfitting and variance, resulting in more robust and accurate predictions. Reduced Overfitting: Decision trees tend to overfit the training data, capturing noise and specific patterns that may not generalize well to unseen data. Random forests mitigate overfitting by using random subsets of the data and features for each tree, reducing the risk of memorizing noise. Robustness: Random forests are less sensitive to outliers and noisy data points compared to single decision trees. The averaging of multiple trees reduces the impact of individual noisy predictions, leading to more robust models. Feature Importance: Random forests can assess the importance of features in the model, providing insights into which features are most influential in making predictions. This information is valuable for feature selection and understanding the underlying data patterns. Parallelism: Random forests can be easily parallelized, making them efficient for training on large datasets and taking advantage of multicore processors or distributed computing. No Need for Feature Scaling: Random forests are not sensitive to the scale of features. Unlike some algorithms that require feature scaling, random forests can handle features of different scales without impacting performance.
1.2 Conventional Methods of Object Detection and Machine Learning
39
Handling Missing Data: Random forests can handle missing data without requiring imputation. Missing values can be efficiently dealt with during the tree-building process. Versatility: Random forests can be used for both classification and regression tasks, making them a versatile choice for various machine learning problems. In short, the ensemble nature of random forests, where multiple decision trees are combined, leads to more accurate, robust, and stable models compared to individual decision trees.
1.2.5 Gradient Boosting Machines As stated in the preceding subsection, the recent champion of decision tree algorithms is the gradient boosting machine. The variant called XGBoost (extreme gradient boosting) got a boost by winning a Kaggle competition. Gradient boosting relies on boosting where weak learners are converted into stronger learners. In this technique, the gradient descent method is applied to the loss function to determine the model parameters [34]. XGBoost is a powerful and widely used algorithm for both regression and classification tasks, known for its speed, scalability, and high predictive performance. Gradient boosting is an ensemble learning technique that combines multiple weak learners (usually decision trees) to create a strong predictive model. It builds the models sequentially, with each new model attempting to correct the errors made by the previous ones. XGBoost extends the traditional gradient boosting algorithm by introducing several enhancements, which contribute to its effectiveness and efficiency: Regularization: Includes L1 (Lasso) and L2 (Ridge) regularization terms in the objective function, which helps prevent overfitting and improve model generalization. Tree Pruning: It uses a depth-first approach to build decision trees and prunes branches that contribute little to the overall model’s performance. This helps reduce the complexity of the model and enhance its efficiency. Weighted Quantile Sketch: Employs an optimized data structure called the “weighted quantile sketch” to efficiently handle data summary statistics during tree construction, improving the speed of the algorithm. Handling Missing Values: It automatically handles missing data during tree construction, eliminating the need for explicit data imputation. Cross-validation: Includes built-in cross-validation capabilities to assess model performance and tune hyperparameters effectively. Parallel Processing: It can be parallelized, taking advantage of multicore processors and distributed computing environments, making it highly efficient for large datasets. Due to these optimizations and improvements, XGBoost has gained significant popularity in machine learning competitions, real-world applications, and academic research. It is often regarded as one of the most powerful and versatile algorithms in the gradient boosting family.
40
1 Basic Approaches in Object Detection and Classification by Deep Learning
The early boosting variants include AdaBoost (Adaptive Boosting) and have extensively been employed to solve classification problems [35]. In case of conventional machine learning approaches where the underlying problem is non-vision related, gradient boosting can be considered as the best choice at this point in time.
1.3 Deep Learning as Part of Artificial Intelligence Here we give a concise background of deep learning and its origins. Although deep learning is intensively under spotlight in recent years, it has been in the literature for a long time under different terminology [36, 37, 46, 47]. In fact, all of machine learning algorithms can be broadly classified under the artificial intelligence umbrella. The field of artificial intelligence is superset of both machine learning and deep learning. However, machine learning is a subset of machine learning. This can be visualized as shown in Fig. 1.15. Artificial intelligence has a several definitions but the general consensus is the desire to make machines have some level of human intelligence. While the Britannica encyclopedia defines human intelligence as the mental quality that consists of the abilities to learn from experience, adapt to new situations, understand and handle abstract concepts, and use knowledge to manipulate one’s environment, we can all agree that artificial intelligence is still far from achieving this level of ability. The biggest limitations still remain on adaptations to new situations and handling abstract concepts. To some extent, machines are able learn certain patterns and manipulate the environment. Given the above outstanding hurdles, computer scientist and engineers define artificial intelligence as the ability of computer systems to perform intelligent tasks. Some notable examples of these tasks include computer vision, natural language processing, text processing, and pattern recognition. Fig. 1.15 Relationship between artificial intelligence, machine learning, and deep learning
1.5 Selection of Target Areas for This Book
41
By definition, machine learning (ML) is the concerned with the study of computers algorithms and statistical models that can accomplish intelligent tasks. These algorithms can be notably categorized into supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning. Supervised and semi-supervised learning can be combined into one category, thus effectively resulting in three categories. Reinforcement learning differs from supervised and unsupervised learning in that it does not rely on labeled or unlabeled examples of correct behavior, but is interactive and tries to maximize a reward signal as opposed to finding hidden structures which is the basis of unsupervised learning [48]. On the other hand, deep learning has roots in artificial neural networks which in turn are modeled based on inspiration from human neurons or perceptrons [49], although it will be an oversimplification to say that neurons operate like artificial neural networks. For object detection and classification, we have gone a long way in formulating very useful algorithms up to deep learning. Some of the popular deep learning algorithms that have been successfully used in solving practical problems include, region proposals (Region Based Convolutional Neural Networks (R-CNN), Fast R-CNN, Faster R-CNN) [50], You Only Look Once (YOLO) [51], Deformable convolutional networks [52], Refinement Neural Network for Object Detection (RefineDet) [53], Retina-Net [54, 55] and many others. The number of the algorithms keeps growing rapidly but the CNN has proven to be most widely used network architecture so far. The VGG16 architecture [56] which is built on CNN is one example.
1.4 Frameworks for Deep Learning There are three main competing frameworks for implementing and evaluating deep learning algorithms, namely, Keras (https://keras.io/), TensorFlow (https://www. tensorflow.org/) and PyTorch (https://pytorch.org/). Keras and TensorFlow can be viewed as complementing frameworks which then boils down to two frameworks in reality. Each of the frameworks have their own pros and cons in terms of usability and performance, so choices can be made on a need basis. While Keras offers a quick start by hiding most of programmatic details in TensorFlow, PyTorch takes one level deep into Python strengths. So, for a quick start, Keras would be the way to go and then at some point venture into PyTorch. Therefore, in this book we will be building all the examples on the Keras framework.
1.5 Selection of Target Areas for This Book This book is mainly focused on application of deep learning to classification of objects and the targeting the remote sensed data as a representative example. However, the algorithms described here are not limited to this area of application as they
42
1 Basic Approaches in Object Detection and Classification by Deep Learning
generic in nature and can be flexibly extended to general objecting detection for such tasks as text recognition and autonomous driving environment object detection and recognition.
1.6 Concluding Remarks Wrapping up what we have learnt so far, we briefly introduced conventional methods of object detection and machine learning principles. We also touched on deep learning to understand its roots as part of artificial intelligence which actually began in the early 1950s. Finally, we ended by presenting the deep learning frameworks among which Keras was chosen as the basis for building application examples in the rest of the book.
1.7 Self-evaluation Exercises 1. What is the difference between object detection and object classification? How can deep learning be used to solve both of these tasks? 2. Explain the difference between support vector machines, random forests and gradient boosting. What are some advantages and disadvantages of each approach? 3. How can convolutional neural networks (CNNs) be used for object detection and classification? Describe the architecture of a typical CNN-based object detection system. 4. Investigate on the object detection methods and explain their strengths and weaknesses. What are bounding boxes used for in object detection? How are they used to improve the accuracy of object detection models? 5. What is transfer learning, and how can it be used for object detection and classification? Give an example of how a pretrained model could be fine-tuned for a specific object detection task.
References 1. Francois C (2018) Deep learning with Python. Manning Publications Co. 2. Jiang X, Hadid A, Pang Y, Granger E, Feng X (2019) Deep learning in object detection and recognition, 1 edn. Springer 3. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press 4. Gamba J (2020) Radar signal processing for autonomous driving. Springer 5. ChatGPT. https://chat.openai.com/ 6. Gamba J (2020) Radar signal processing for autonomous Driving. Springer, Berlin/Heidelberg, Germany
References
43
7. Cortes C, Vapnik V (1995) Support-vector network. Mach Learn 20(3):273–297 8. Ukey N et al (2023) Survey on exact kNN queries over high-dimensional data space. Sensors 23(2):629. https://doi.org/10.3390/s23020629 9. scikit-learn. https://scikit-learn.org 10. Zhang S, Li J (2023) KNN classification with one-step computation. In: IEEE Trans Knowl Data Eng 35(3):2711–2723. https://doi.org/10.1109/TKDE.2021.3119140 11. Zhao P, Lai L (2022) Analysis of KNN density estimation. IEEE Trans Inf Theory 68(12):7971– 7995. https://doi.org/10.1109/TIT.2022.3195870 12. Liu Y, Chen H, Wang B (2020) DOA estimation of underwater acoustic signals based on PCAkNN algorithm. In: 2020 international conference on computer information and Big Data applications (CIBDA), Guiyang, China, 2020, pp 486–490. https://doi.org/10.1109/CIBDA5 0819.2020.00115 13. Rashid NEA, Nor YAIM, Sharif KKM, Khan ZI, Zakaria NA (2021) Hand gesture recognition using continuous wave (CW) radar based on hybrid PCA-KNN. In: 2021 IEEE symposium on wireless technology & applications (ISWTA), Shah Alam, Malaysia, 2021, pp 88–92. https:// doi.org/10.1109/ISWTA52208.2021.9587404 14. Zheng X et al (2021) Adaptive nearest neighbor machine translation. https://arxiv.org/abs/2105. 13022 15. Zhang J, Wang T, Ng WWY, Pedrycz W, KNNENS: a k-nearest neighbor ensemble-based method for incremental learning under data stream with emerging new classes. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3149991 16. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188 17. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Fifth annual workshop on computational learning theory. ACM, pp 144–152 18. Li C-N, Li Y, Meng Y-H, Ren P-W, Shao Y-H (2023) L2,1 -Norm regularized robust and sparse linear discriminant analysis via an alternating direction method of multipliers. IEEE Access 11:34250–34259. https://doi.org/10.1109/ACCESS.2023.3264688 19. Dai D-Q, Yuen PC (2007) Face recognition by regularized discriminant analysis. IEEE Trans Syst, Man, Cybern, Part B (Cybernetics), 37(4):1080–1085. https://doi.org/10.1109/TSMCB. 2007.895363 20. Duda R, Hart P, Stork D (2001) Pattern classification, 2nd edn. New York 21. Lu W (2022) Regularized deep linear discriminant analysis. https://arxiv.org/abs/2105.07129 22. Chang C-C (2023) Fisher’s linear discriminant analysis with space-folding operations. In: IEEE Trans Pattern Anal Mach Intell 45(7):9233–9240. https://doi.org/10.1109/TPAMI.2022. 3233572 23. Elkhalil K, Kammoun A, Couillet R, Al-Naffouri TY, Alouini M-S (2020) A large dimensional study of regularized discriminant analysis. IEEE Trans Signal Process 68:2464–2479. https:// doi.org/10.1109/TSP.2020.2984160 24. Cai D, He X, Han J (2007) Semi-supervised discriminant analysis. In: 2007 IEEE 11th International conference on computer vision, Rio de Janeiro, Brazil, 2007, pp 1–7. https://doi.org/ 10.1109/ICCV.2007.4408856 25. Wang J, Plataniotis KN, Lu J, Venetsanopoulos AN (2008) Kernel quadratic discriminant analysis for small sample size problem. Pattern Recogn 41(5):1528–1538 26. P¸ekalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans Pattern Anal Mach Intell 31(6):1017–1032. https://doi.org/10.1109/ TPAMI.2008.290 27. Vapnik VN (1998) Statistical learning theory. Wiley, New York 28. Huang Z, Lee BG (2004) Combining non-parametric models for multisource predictive forest mapping. Photogramm Eng Remote Sens 70:415–425 29. Vapnik VN (1998) The nature of statistical learning theory. Wiley, New York 30. Camps-Valls G, Bruzzone L (2005) Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sens 43(6):1351–1362
44
1 Basic Approaches in Object Detection and Classification by Deep Learning
31. Bruzzone L, Persello C (2009) A novel context-sensitive semi-supervised SVM classifier robust to mislabeled training samples. IEEE Trans Geosci Remote Sens 47(7) 32. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Kluwer Academic Publishers, Boston, pp 1–43 33. Breiman L (2001) Random forests. Mach Learn 45:5–32 34. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: KDD‘16: proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining August, 2016, pp 785–794 35. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139 36. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 37. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18 38. Kamusoko C, Gamba J (2014) Mapping woodland cover in the Miombo ecosystem: a comparison of machine learning classifiers. Land 3:524–540 39. Schultheis E, Babbar R (2021) Speeding-up one-vs-all training for extreme classification via smart initialization. https://arxiv.org/abs/2109.13122 40. Abeykoon VL, Fox GC, Kim M (2019) Performance optimization on model synchronization in parallel stochastic gradient descent based SVM. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), Larnaca, Cyprus, 2019, pp 508– 517. https://doi.org/10.1109/CCGRID.2019.00065 41. Pesala V, Kalakanti AK, Paul T, Ueno K, Kesarwani A, Bugata HGSP (2019) Incremental learning of SVM using backward elimination and forward selection of support vectors. In: 2019 International conference on applied machine learning (ICAML), Bhubaneswar, India, 2019, pp 9–14. https://doi.org/10.1109/ICAML48257.2019.00010 42. Xie L, Luo Y, Su S-F, Wei H (2023) Graph regularized structured output SVM for early expression detection with online extension. IEEE Trans Cybern 53(3):1419–1431. https://doi. org/10.1109/TCYB.2021.3108143 43. Cao Y, Sun Y, Li P, Su S, Vibration-based fault diagnosis for railway point machines using multi-domain features, ensemble feature selection and SVM. IEEE Trans Veh Technol. https:// doi.org/10.1109/TVT.2023.3305603 44. Liu H, Yu Z, Shum CK, Man Q, Wang B (2023) A new hierarchical multiplication and spectral mixing method for quantification of forest coverage changes using Gaofen (GF)-1 imagery in Zhejiang Province, China. IEEE Trans Geosci Remote Sens 61:1–10, Art no. 4407210. https:// doi.org/10.1109/TGRS.2023.3303078 45. Su Y, Li X, Yao J, Dong C, Wang Y (2023) A spectral–spatial feature rotation-based ensemble method for imbalanced hyperspectral image classification. IEEE Trans Geosci Remote Sens, 61:1–18, Art no. 5515918. https://doi.org/10.1109/TGRS.2023.3282064 46. Furukawa H (2018) Deep learning for end-to-end automatic target recognition from synthetic aperture radar imagery. IEICE Tech Rep 117(403):35–40, SANE 2017-92 47. Angelov A, Robertson A, Murray-Smith R, Fioranelli F (2018) Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar, Sonar Navig 12(10):1082–1089 48. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press 49. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc., New York 50. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on computer vision and pattern recognition, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81 51. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), 2016, pp 779–788. https://doi.org/10.1109/CVPR.2016.91
References
45
52. Dai J et al (2017) Deformable convolutional networks. In: 2017 IEEE International conference on computer vision (ICCV), 2017, pp 764–773. https://doi.org/10.1109/ICCV.2017.89 53. Zhang S, Wen L, Lei Z, Li SZ (2021) RefineDet++: single-shot refinement neural network for object detection. IEEE Trans Circuits Syst Video Technol 31(2):674–687. https://doi.org/10. 1109/TCSVT.2020.2986402 54. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: 2017 IEEE international conference on computer vision (ICCV), 2017, pp 2999–3007. https:// doi.org/10.1109/ICCV.2017.324 55. Del Prete R, Graziano MD, Renga A (2021) RetinaNet: a deep learning architecture to achieve a robust wake detector in SAR images. In: 2021 IEEE 6th International forum on research and technology for society and industry (RTSI), 2021, pp 171–176. https://doi.org/10.1109/RTS I50628.2021.9597297 56. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556
Chapter 2
Requirements for Hands-On Approach to Deep Learning
2.1 Introduction This a bridging chapter in which we introduce some of the concepts needed to start building deep learning models in Python. We will start with basic principles related to data manipulation and end with explanation on how to set up the modeling environment. Of course, this chapter is not meant to replace any detailed course on Python but to be a stepping stone for those already familiar with or new to data structures in Python. We would recommend to visit the https://www.python.org/ site for comprehensive materials on Python. There is also a vast number of online material both text and video on the Internet to aid the learning process, but due diligence is necessary to avoid falling into the trap as mentioned in Chap. 1. In deep learning, we are mostly dealing with vectors and matrices as we know them from linear algebra. These objects are sometimes referred to as tensors but from an engineering perspective, they can be considered as subsets of multidimensional arrays especially if one is already familiar with numerical processing tools like MATLAB, Scilab, Octave, etc. Like any other language, Python has a unique way of accessing and manipulating these arrays.
2.2 Basic Python Arrays for Deep Learning In Python, vectors, matrices, arrays, and tensors are all data structures used to represent and manipulate multidimensional data. In the deep learning models presented later, we will be processing data in numerical format defined from Python’s NumPy library. Therefore, for our purposes we will treat tensors as multidimensional NumPy arrays [1].
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-99-9672-8_2
47
48
2 Requirements for Hands-On Approach to Deep Learning
Fig. 2.1 Visualization of scalar (0-D tensor) and 1-D array (1-D tensor, row vector)
In fact, tensors are a generalization of vectors and matrices to higher dimensions. They can have any number of dimensions and are used to represent multidimensional data in deep learning and scientific computing. Tensors are commonly used by libraries such as TensorFlow and PyTorch. In Python, tensors are often represented using multidimensional NumPy arrays or specialized tensor libraries. The TensorFlow library is specifically geared to perform operations on tensors since tensors have the attractive property of being able to be efficiently manipulated on GPU. In deep learning, GPU processing drastically makes operations faster thereby reducing the required time from hours to minutes. Please refer to the companion Notebook (Chapter02.ipynb) to get a better insight into the nature the data and also as part of hands-on experience [2]. Scalars and 1-D Arrays (Vector) Scalars have zero dimensions while vectors are single-dimensional as shown in Fig. 2.1. Specifically, vectors are one-dimensional arrays that store elements in a single row or column. They can be considered as a special case of a matrix with either a single row (row vector) or a single column (column vector). In Python, vectors are often represented using one-dimensional NumPy arrays or lists. 2-D Arrays (Matrices), 3-D Arrays (Data Cubes) Matrices are presented as 2-D arrays and data cubes as 3-D arrays as shown in Fig. 2.2. Matrices are two-dimensional data structures that store elements in rows and columns. They are used to represent tabular or grid-like data. Matrices can be created using nested lists or two-dimensional NumPy arrays in Python. Multidimensional Arrays (ndarray) Multidimensional arrays are normally visualized with a dimension greater or equal to 3 but by definition, the only requirement is that the dimension must be nonnegative. We will use image data as an example to explain multidimensional array. An image can be represented by (height, width, color depth), and a collection of images stacked together (such as frames in a video) would have a fourth dimension (frame number, height, width, color depth). For multiple video sequences, we end up with fifth dimension and have the representation (video number, frame number, height, width, color depth) as illustrated in Fig. 2.3. In general, two forms of representation are used for 3D image data, namely channels-last convention (height, width, color depth) supported by TensorFlow or the channels-first convention (color depth, height, width) supported by Theano. In this book, we will be focusing on the Keras framework which supports both conventions. In any case, switching between the two conventions is possible by data transposition.
2.2 Basic Python Arrays for Deep Learning
49
Fig. 2.2 Visualization of 2-D array (2-D tensor, matrix) and 3-D array (3-D tensor, cube)
Fig. 2.3 Visualization of an example of a multidimensional array (4-D array, 4-D tensor)
Array Manipulation Figure 2.4 is an example of reshaping an array from size (3, 5) to size (5, 3). The key point is that the total number of elements in the new shape must be factorizable to the old dimension. Besides reshape, other array manipulation operations such as resize, transpose, squeeze, flatten, etc. can be performed on Numpy arrays.
50
2 Requirements for Hands-On Approach to Deep Learning
Fig. 2.4 Visualization of an example of array manipulation where the shape is changed
2.3 Setting Up Environment This section is a quick guide that explains the necessary steps to create the environment using Python as the basis for deep learning algorithm evaluation. The processing steps and resources will be explained as we walk through the process. With the availability of vast resources on the internet, the interested reader should be able to rapidly create a working demo script within a few hours if not minutes. It is assumed in this book that the reader has basic knowledge of programming and Python environment. Deep knowledge of artificial intelligence, neural networks or deep learning is a not a prerequisite to run deep learning algorithms. Basically, it is possible to run deep learning algorithms either offline or online.
2.3.1 OS Support for Offline Environments The examples that we will present will be built on Python 3.7 and can run on standalone Windows environment. However, we have confirmed that the setup is also straightforward using VirtualBox Ubuntu 20.04 LTS on a Windows host. The notable difference between Windows environment and Ubuntu is that the Ubuntu Terminal is the basic tool for command line operations, and no additional terminal installation is required. For package management, we recommend using Anaconda Navigator which can be downloaded for free from their official website (Anaconda Navigator Installation).
2.3 Setting Up Environment
51
2.3.2 Windows Environment Creation Example Installation of the Anaconda Navigator on Windows is quite easy to perform and the Navigator can be started from the Start Menu. Figure 2.5 below is an example of the interface on Windows 10. It is highly recommended to create a new environment for each classification task or project using the following steps. 1. Click the Environments on the Anaconda Navigator and select Create on the bottom left side (Fig. 2.6) 2. Set the environment name (in this case env_maskrcnn as an example) and select the Python version (in this case 3.7) as in Fig. 2.7. The environment will be shown in the list of environments to which packages and tools can be added as a necessary. In our example, we created “env_maskrcnn” and installed Spyder® and Jupyter Notebook, among other standard tools. Spyder® is a user-friendly interactive Python GUI and Jupyter Notebook is good for visualizing demos available from GitHub and for creating new scripts before running in Spyder as one use-case example. The Jupyter Notebook is also handy for interactive debugging as it can easily link to online resources like Stack Overflow, etc.
Fig. 2.5 Anaconda Navigator interface
52
2 Requirements for Hands-On Approach to Deep Learning
Fig. 2.6 An example of creating an environment
Fig. 2.7 An example of setting environment properties
2.3.3 Options to Consider for Online Environments Although the Windows and Ubuntu/Linux platforms are convenient to use in terms of availability and control, recent trends are to rely on online platforms, specifically Google Colab (https://research.google.com/colaboratory/). The advantages of Google Colab (Colab for short) are that very minimal or no setup effort is required and it also provides the option to use free GPU/TPU resources once an account is created. The packages needed for most classification tasks are constantly updated simplifying package management. In addition, it is very easy to share Notebooks and check algorithm performance online. For a little affordable fee, it’s possible upgrade to the account if more computational resources are required. In any case, it
2.4 Concluding Remarks
53
Fig. 2.8 An example of Google Colab interface
is always possible to unsubscribe anytime and use the free Colab account for small demo projects. An example of the online Colab interface is shown in Fig. 2.8.
2.4 Concluding Remarks Wrapping up what we have learnt so far, we presented basic Python data structures and their manipulation. We ended with reference material on setting up the environment and also gave online options to consider.
54
2 Requirements for Hands-On Approach to Deep Learning
2.5 Self-evaluation Exercises 1. What is a tensor, and how is it used in deep learning? Describe the difference between a scalar, vector, and matrix, and give an example of each. 2. How do you create a one-dimensional (1-D) array in Python? Give an example of how to create an array of integers, and describe how to access individual elements of the array. 3. What is a matrix, and how do you create a two-dimensional (2-D) array in Python? Give an example of how to create a 2-D array of floating-point numbers, and describe how to perform basic operations on matrices (e.g., addition, multiplication). 4. What is a data cube, and how is it used in deep learning? Describe how to create a three-dimensional (3-D) array in Python, and give an example of how to access individual elements of the array. 5. Describe the concept of multidimensional arrays in Python. What are some common operations you can perform on multidimensional arrays? Give an example of how to perform each of these operations on a multidimensional array.
References 1. The N-dimensional array (ndarray). https://numpy.org/doc/stable/reference/arrays.ndarray.html 2. Deep-Learning-Models. https://github.com/sn-code-inside/Deep-Learning-Models
Chapter 3
Building Deep Learning Models
3.1 Introduction: Neural Networks Basics In this chapter, we illustrate how to build deep learning models, their training and evaluation using the Keras framework in a simple and concise way. We briefly explain some of the concepts behind these models so as to give the reader a smooth entry into each section while concentrating mainly of how-to-use rather than details of algorithms themselves. The entry point will be shallow networks upon which the deep neural networks are developed. We then touch on convolutional neural networks (CNNs), followed by recurrent neural networks (RNNs) and finally long short-term memory (LSTM)/gated recurring units (GRUs). Along the way, we provide examples on how each of these can be used in order to cement the ideas behind them. After that we give a quick look at the Keras library and some references for further investigation.
3.1.1 Shallow Networks In recent terminology, neural networks can be categorized into deep and shallow neural networks. In this categorization, shallow neural networks can be thought of as the basic building blocks required to understand deep neural networks and they consist of a few hidden layers, normally one or two. In this subsection, we will give a brief overview of shallow networks since they are an important part of artificial intelligence. Artificial neural networks models were originally inspired by human neurons, referred to as perceptrons [1]. Comprehensive treatment of the evolution of neural networks is beyond the scope of this section but in its basic functionality, a perceptron takes several binary inputs and produces a single binary output as illustrated in Fig. 3.1. The output can be computed using the following expressing:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-99-9672-8_3
55
56
3 Building Deep Learning Models
Fig. 3.1 Illustration of a simple perceptron model
⎧ wi xi ≤ θ0 ⎨ 0, i=0 output = ⎩ 1, wi xi > θ0
(3.1)
i=0
where x0 = 1 and θ0 is predetermined threshold. The main components of the perceptron can be summarized as follows: 1. Inputs: The data to be processed. 2. Weights: Values which determine the importance of each input. 3. Processing layer: It is the part that performs mathematical operations on the inputs by applying weights to them. 4. Activation function: A nonlinear output selection function which can be a sigmoid, rectified linear unit (ReLU), tanh or any other appropriate function. 5. Output: The result of applying activation function to the processed input. In the Keras framework, it is a simple matter to create a shallow neural network using Dense layers. A popular and well-known standard example that can be used to illustrate the shallow neural network in a concrete manner is by using the established MNIST dataset that is packaged into the Keras Library. The MNIST dataset consists of two sets of data with 60,000 training images and 10,000 testing images. The images are 28 × 28 pixels grayscale (intensities 0–255) and are a collection of handwritten images of digits 0–9 (10 classes). The MNIST, which stands for Modified National Institute of Standards and Technology, is a dataset created from NIST original data with some modification and is extensively studied in the computer vision and machine learning literature. To those familiar with image processing algorithm evaluation, MNIST can be considered as the Lena of image classification. Using this MNIST dataset the goal is to classify the handwritten digits into one of the 10 classes (0–9). Please refer to the companion Notebook (Chapter03.ipynb) to get a better insight into the nature the data and also as part of hands-on experience [2]. For this purpose, the shallow neural network model can be defined as follow:
3.1 Introduction: Neural Networks Basics from keras import models from keras import layers shallownet = models.Sequential() shallownet.add(layers.Dense(4, activation=’relu’, shape=(28 * 28,))) shallownet.add(layers.Dense(10, activation=’softmax’))
57
input_
The model can then the compiled and training on input can be done. shallownet.compile(optimizer = ’adam’, loss= ’categorical_ crossentropy’, metrics=[’accuracy’]) shallownet.fit(train_images, train_labels, epochs=5, batch_ size=128)
The full example is given below: # Import the necessary libraries from keras import models from keras import layers # Load MNIST dataset from Keras from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() # Define the model by adding two Dense layers shallownetwork = models.Sequential() shallownetwork.add(layers.Dense(4, activation=’relu’, input_ shape=(28 * 28,))) shallownetwork.add(layers.Dense(10, activation=’softmax’)) # Compile the model shallownetwork.compile(optimizer = ’adam’, loss= ’categorical_ crossentropy’, metrics=[’accuracy’]) # Preprocess the data by scaling it from [0, 255] range to [0, 1] range. train_images = train_images.reshape((60000, 28 * 28)) train_images = train_images.astype(’float32’) / 255 test_images = test_images.reshape((10000, 28 * 28)) test_images = test_images.astype(’float32’) / 255 # Prepare the training and test labels from keras.utils import to_categorical train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Performing training of the network using the MNIST training dataset history = shallownetwork.fit(train_images, train_labels, epochs=10, batch_size=64, validation_data=(test_images, test_ labels)) # Plot training results #Import library for plots import matplotlib.pyplot as plt plt.plot(history.history[’accuracy’], label=’train_accuracy’) plt.plot(history.history[’val_accuracy’], label = ’val_ accuracy’) plt.xlabel(’Epoch’) plt.ylabel(’Accuracy’)
58
3 Building Deep Learning Models plt.title(’Training/Validation Accuracy’) plt.legend(loc=’lower right’) # Evaluate the model using the MNIST test dataset test_loss, test_acc = network.evaluate(test_images, test_labels) print(’test_acc:’, test_acc)
The above simple model gives a test accuracy of 86.23% (Fig. 3.2). The utility of Keras is that it is possible to quickly adjust hyperparameters to improve on test accuracy. As an example, increasing the size of the network to 512, recompiling and changing the training batch size to 128 results in the increase in accuracy to 98.15%! # Define the network model by adding two Dense layers, with increased network size to 512 shallownetwork = models.Sequential() shallownetwork.add(layers.Dense(512, activation=’relu’, input_ shape=(28 * 28,))) shallownetwork.add(layers.Dense(10, activation=’softmax’)) # Compile the model shallownetwork.compile(optimizer = ’adam’, loss= ’categorical_ crossentropy’, metrics=[’accuracy’]) # Performing training of the network using the MNIST training dataset with increased batchsize of 128 history = shallownetwork.fit(train_images, train_labels, epochs = 10, batch_size = 128, validation_data = (test_images, test_ labels)).
Fig. 3.2 Training and validation accuracy (network size 4, batch size 64)
3.1 Introduction: Neural Networks Basics
59
# Plot training results #Import library for plots import matplotlib.pyplot as plt plt.plot(history.history[’accuracy’], label=’train_accuracy’) plt.plot(history.history[’val_accuracy’], label = ’val_ accuracy’) plt.xlabel(’Epoch’) plt.ylabel(’Accuracy’) plt.title(’Training/Validation Accuracy’) plt.legend(loc=’lower right’) # Evaluate the model using the MNIST test dataset test_loss, test_acc = shallownetwork.evaluate(test_images, test_ labels) print(’test_acc:’, test_acc)
Figure 3.3 shows that for this shallow model, over-fitting starts after the first epoch as shown by almost flat validation accuracy. The perceptron model can be extended to multiple hidden layers of perceptrons to produce complex decisions as shown in Fig. 3.4 and often referred to as multilayer perceptron (MLP) in the literature.
Fig. 3.3 Training and validation accuracy (network size 512, batch size 128)
60
3 Building Deep Learning Models
Fig. 3.4 An illustration of the construction of the multilayer (2-layer) perceptron model
3.1.2 Convolutional Neural Networks (CNNs) The CNN is one of the most successful models used in deep neural networks, especially in image processing and computer vision. Taking a little deviation into history, deep learning networks differ from conventional neural networks by the number of node layers used, which brings in the concept of depth, and can also have loops. Basically, neural networks normally have one to two hidden layers and are used for supervised prediction or classification. In contrast, deep learning networks can have several hidden layers with the possibility of unsupervised training. Figure 3.5 illustrates one example of such a network. Examples of widely used deep learning architectures include deep neural networks (DNN), deep belief networks (DBF), and recurrent neural networks (RNNs) [3, 4]. The main advantage of DNN over traditional neural networks is the ability to learn complex tasks in an unsupervised manner. However, this advantage does not come at no cost [5]. Large amounts of training data are required for building the network, high computational complexity is a big burden, difficulties arise when attempting to analyze the algorithms and also inability to predict the output precisely, among other challenges. For applications such as autonomous navigation, DNNs have a promising future and integration into sensors like the automotive radar is currently under intensive research [6]. With advances in both computational power (GPUs/TPUs) and available resources (RAM/ ROM) on the sensor devices, the realization of the so-called intelligent sensors is now possible. For the interested reader, further details about DBM and RNN can be found in [7] and [8], respectively. It should be noted that RNNs have found better success in natural language process (NLP). Coming back to the subject of this section, a convolutional neural network (CNN) is a neural network which uses at least one layer as part of the model. The construction of a CNN involves the several layers between input and output and at least one layer is a convolutional layer. A typical convolutional neural network consists of some
3.1 Introduction: Neural Networks Basics
61
Fig. 3.5 An illustration of the components of a deep neural network model
combination of the following layers: convolutional layers, pooling layers, and fully connected/dense layers. Convolutional layers apply convolutional operations on their inputs to extract features of the input. Pooling operations are used to reduce the size of the convolution layer outputs by either maximization or averaging operations. Normally, the averaging is done over a 2 × 2 matrix. Fully connected layers usually come at the top of the network (close to output) and are also sometimes referred to as dense layers. CNNs have been successfully applied to computer vision, producing state-of-theart performance in most applications. Figure 3.6 illustrates the structure a typical CNN architecture. The typical CNN architecture shows the progression through convolution and pooling operations. The flattening operation produces a one-dimensional array for inputting it to the final fully connected top layers. We continue with the MNIST data as a concrete example of how to implement a CNN in Keras following [9]. # Example of CNN using the MNIST data set # Import necessary packages from keras import layers from keras import models # Define the CNN model with 3 convolution layers and 2 pooling layers cnn_model = models.Sequential()
Fig. 3.6 Typical CNN architecture
62
3 Building Deep Learning Models cnn_model.add(layers.Conv2D(32, (3, 3), activation=’relu’, input_ shape=(28, 28,1))) cnn_model.add(layers.MaxPooling2D((2, 2))) cnn_model.add(layers.Conv2D(64, (3, 3), activation=’relu’)) cnn_model.add(layers.MaxPooling2D((2, 2))) cnn_model.add(layers.Conv2D(64, (3, 3), activation=’relu’)) cnn_model.add(layers.Flatten()) cnn_model.add(layers.Dense(64, activation=’relu’)) cnn_model.add(layers.Dense(10, activation=’softmax’)) # View the model summary cnn_model.summary() # Training the convnet on MNIST images train_images = train_images.reshape((60000, 28, 28, 1)) train_images = train_images.astype("float32") / 255 test_images = test_images.reshape((10000, 28, 28, 1)) test_images = test_images.astype("float32") / 255 cnn_model.compile(optimizer=’adam’,loss=’categorical_ crossentropy’,metrics=[’accuracy’]) history = cnn_model.fit(train_images, train_labels, epochs=10, batch_size=64, validation_data=(test_images, test_labels)) # Plot training results #Import library for plots import matplotlib.pyplot as plt plt.plot(history.history[’accuracy’], label=’train_accuracy’) plt.plot(history.history[’val_accuracy’], label = ’val_ accuracy’) plt.xlabel(’Epoch’) plt.ylabel(’Accuracy’) plt.title(’Training/Validation Accuracy’) #plt.ylim([0.5, 1]) plt.legend(loc=’lower right’) # Evaluate the model using the MNIST test dataset test_loss, test_acc = cnn_model.evaluate(test_images, labels) print(’test_acc:’, test_acc)
test_
Figure 3.7 shows the progression of training/validation accuracy. The validation accuracy seems be higher than training accuracy which shows good generalization of the model. A decent test accuracy result of 96.5% is achieved (Fig. 3.8).
3.1.3 Recurrent Neural Networks (RNNs) Another popular type of neural network is the recurrent neural network that has been very successfully used for application like natural language processing and speech recognition. RNNs differ from CNNs in that they have memory, meaning that previous inputs have influence on the present input and output. We will not dwell much on RNNs in this text, but Fig. 3.9 gives a simplified visual illustration on how they work.
3.1 Introduction: Neural Networks Basics
63
Fig. 3.7 Training and validation accuracy for the CNN model
# Evaluate the model using the MNIST test dataset test_loss, test_acc = cnn_model.evaluate(test_images, test_labels) print('test_acc:', test_acc)
Fig. 3.8 Test accuracy results for CNN
Suffice to say, Keras provides the SimpleRNN layer for model construction. Below is an example of RNN with Keras. # Import necessary packages import numpy as np from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, SimpleRNN from tensorflow.keras.utils import to_categorical from tensorflow.keras.datasets import mnist # Load MNIST dataset from Keras from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) mnist.load_data() # Extract the number of labels num_train_labels = len(np.unique(train_labels)) # Normalize data for training train_images = train_images.reshape((60000, 28, 28)) train_images = train_images.astype("float32") / 255
=
64
3 Building Deep Learning Models
Fig. 3.9 Illustration of a simplified RNN showing the rolled and unrolled representations
test_images = test_images.reshape((10000, 28, 28)) test_images = test_images.astype("float32") / 255 # Prepare the training and test labels from keras.utils import to_categorical train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Create RNN model with 256 units rnn_model = Sequential() rnn_model.add(SimpleRNN(256,input_shape=(28, 28))) rnn_model.add(Dense(num_train_labels, activation=’softmax’)) rnn_model.summary() # Train the with the RNN model with batch size of 128 and 20 epochs rnn_model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) history = rnn_model.fit(train_images, train_labels, epochs=20, batch_size=128, validation_data=(test_images, test_labels)) # Plot training results #Import library for plots import matplotlib.pyplot as plt plt.plot(history.history[’accuracy’], label=’train_accuracy’) plt.plot(history.history[’val_accuracy’], label = ’val_ accuracy’) plt.xlabel(’Epoch’) plt.ylabel(’Accuracy’) plt.title(’Training/Validation Accuracy’) #plt.ylim([0.5, 1]) plt.legend(loc=’lower right’)
3.1 Introduction: Neural Networks Basics
65
Fig. 3.10 Training and validation accuracy for the RNN model
# Evaluate the model using the MNIST test dataset test_loss, test_acc = rnn_model.evaluate(test_images, test_labels) print('test_acc:', test_acc)
Fig. 3.11 Test accuracy results for RNN
# Evaluate the model using the MNIST test dataset test_loss, test_acc = rnn_model.evaluate(test_images, labels) print(’test_acc:’, test_acc)
test_
With this simple 2-layer RNN model, a decent accuracy of 97.65% can be achieved for the MNIST data (Figs. 3.10 and 3.11).
3.1.4 Long Short-Term Memory (LSTM)/Gated Recurring Units (GRUs) The LSTM and GRU layers are designed to solve the vanishing gradient problem that makes SimpleRNN not suitable for most practical problems [10]. This is achieved by injecting information from previous layers at a later time using some form of forgetting factors to circumvent the vanishing-gradient problem considerably. On the other hand, GRUs operate on the same principle as LSTM except that for LSTM,
66
3 Building Deep Learning Models
three gates, namely input, output, and forget gate are used, while for GRU only two gates, reset and update gate, are required. The choice between the two involves a trade-off between accuracy and computational complexity, with LSTM generally expected to provide higher accuracy [11, 12]. Employing the same approach as for the SimpleRNN model, we compare the LSTM and GRU models built from Keras layers. We start with the LSTM model. # Create LSTM model with 256 units lstm_model = Sequential() lstm_model.add(layers.LSTM(256,input_shape=(28, 28))) lstm_model.add(Dense(num_train_labels, activation=’softmax’)) lstm_model.summary() # Create LSTM model with 256 units lstm_model = Sequential() lstm_model.add(layers.LSTM(256,input_shape=(28, 28))) lstm_model.add(Dense(num_train_labels, activation=’softmax’)) lstm_model.summary()
A total of 294 410 training parameters for this model (Fig. 3.12). # Create LSTM model with 256 units lstm_model = Sequential() lstm_model.add(layers.LSTM(256,input_shape=(28, 28))) lstm_model.add(Dense(num_train_labels, activation='softmax')) lstm_model.summary()
# Create LSTM model with 256 units lstm_model = Sequential() lstm_model.add(layers.LSTM(256,input_shape=(28, 28))) lstm_model.add(Dense(num_train_labels, activation='softmax')) lstm_model.summary()
Fig. 3.12 Model parameters summary for LSTM
3.1 Introduction: Neural Networks Basics
67
# Train the with the LSTM model with batch size of 128 and 20 epochs lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) history = lstm_model.fit(train_images, train_labels, epochs=20, batch_size=128, validation_data=(test_images, test_labels))
Fig. 3.13 Training of LSTM progress for each epoch
# Train the with the LSTM model with batch size of 128 and 20 epochs lstm_model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) history = lstm_model.fit(train_images, train_labels, epochs=20, batch_size=128, validation_data=(test_images, test_labels))
The validation accuracy progressively increases as the validation loss falls (Figs. 3.13, 3.14 and 3.15). # Evaluate the model using the MNIST test dataset test_loss, test_acc = lstm_model.evaluate(test_images, test_ labels) print(’test_acc:’, test_acc)
Next, we construct and train the GRU model. A total of 221 450 training parameters for this model (Fig. 3.16). # Train the with the GRU model with batch size of 128 and 20 epochs gru_model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[’accuracy’]) history = gru_model.fit(train_images, train_labels, epochs=20, batch_size=128, validation_data=(test_images, test_labels))
The validation accuracy progressively increases as the validation loss falls (Figs. 3.17 and 3.18).
68
3 Building Deep Learning Models
Fig. 3.14 Training and validation accuracy for the LSTM model
# Evaluate the model using the MNIST test dataset test_loss, test_acc = lstm_model.evaluate(test_images, test_labels) print('test_acc:', test_acc)
Fig. 3.15 Test results for LSTM
Fig. 3.16 Model parameters summary for GRU
3.1 Introduction: Neural Networks Basics
69
# Train the with the GRU model with batch size of 128 and 20 epochs gru_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) history = gru_model.fit(train_images, train_labels, epochs=20, batch_size=128, validation_data=(test_images, test_labels))
Fig. 3.17 Training of GRU progress for each epoch
Fig. 3.18 Training and validation accuracy for the GRU model
# Evaluate the model using the MNIST test dataset test_loss, test_acc = gru_model.evaluate(test_images, labels) print(’test_acc:’, test_acc)
test_
70
3 Building Deep Learning Models
# Evaluate the model using the MNIST test dataset test_loss, test_acc = gru_model.evaluate(test_images, test_labels) print('test_acc:', test_acc)
Fig. 3.19 Test accuracy results for GRU
Comparing the above results obtained under similar conditions, it can be observed that LSTM model achieves an average speed of 70 s per epoch with a validation accuracy 98.92% (Fig. 3.15). The GRU model achieves an average speed of 55 s per epoch and 98.81% (Fig. 3.19) accuracy, which means LSTM is 0.11% better in this example. As stated above, this improvement comes at a computational expense as reflected in the execution speed. As shown in Figs. 3.14 and 3.18, the two models quickly achieve high accuracy in the first 5 epochs after which over-fitting becomes visible. With the addition of more layers and hyperparameter tuning, further improvements can generally be achieved for any model as will be seen in the next chapters.
3.2 Using Keras for as Deep Learning Framework Keras is a widely used Python framework for machine learning and deep neural network applications due its intuitive logical flow, easy to get started quickly and richness in ready-to-use packages. With very few code lines, model evaluation on benchmark and new datasets can be accomplished efficiently. We will briefly explore the Keras framework here, but further details and latest developments can be found at https://keras.io/.
3.2.1 Overview of Library The Keras API reference consists of Models APIs, Layers APIs, Callback APIs, Optimizers, Metrics, Applications, and many other utilities that greatly reduce the effort from concept to tangible results for engineers and scientist from various backgrounds and fields. The workflow can be reduced to three main steps which are (1) define the model, (2) compile the model, and (3) evaluate the model. By continuously refining step (1), rapid evaluation of models is possible. Keras is also compatible with Ubuntu, Windows, and macOS, thus is making it available to a wide range of audience. Among other characteristics, it is can be run on both CPU & GPU platforms.
3.3 Concluding Remarks
71
Fig. 3.20 Survey results showing popularity of Keras in Kaggle competitions
3.2.2 Usability The usability of Keras is evidenced by the data available at https://keras.io/why_ keras/. Keras has been used by the majority of top-5 winners of Kaggle competitions based on 2019 survey. Additionally, the results of the 2022 state of data science and machine learning survey published by Kaggle showed that the Tensorflow, which is the backend engine of Keras, has broad adoption in both industry and research circles reaching approximately 61% (Fig. 3.20) [13].
3.3 Concluding Remarks Here we give the highlights of this chapter. In this chapter, we provided a concise introduction to building deep learning models with practical examples. We discussed the distinctions between shallow and deep neural networks and demonstrated how to implement them using the Keras framework. Some of the popular deep learning architectures, namely CNN and RNN, were also illustrated. In the end we provided some background on why it makes sense to start with Keras as framework for building and evaluating deep neural networks.
72
3 Building Deep Learning Models
3.4 Self-evaluation Exercises 1. Explain the concept of shallow networks and their limitations. Can shallow networks be used for complex tasks such as image classification or natural language processing? 2. What are Convolutional Neural Networks (CNNs)? How do they differ from fully connected neural networks? Explain the architecture of a typical CNN. 3. Describe Recurrent Neural Networks (RNNs) and their ability to model sequential data. What are some limitations of standard RNNs, and how do Long Short-Term Memory (LSTM) and Gated Recurring Units (GRU) address these limitations? 4. Explain the Keras API and its advantages for building deep learning models. What are some features of the Keras API that make it popular among developers? 5. Give an example of building a deep learning model using the Keras API. Describe the steps involved in building a CNN or RNN using Keras, including data preparation, model definition, training, and evaluation.
References 1. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Inc., New York 2. Deep-Learning-Models. https://github.com/sn-code-inside/Deep-Learning-Models 3. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18 4. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press 5. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 6. Wheeler TA, Holder MF, Winner H, Kochenderfer MJ (2017) Deep stochastic radar models. IEEE Intell Veh Symp IV 7. Salakhutdinov R, Hinton GE (2009) Deep Boltzmann machines. In: AISTATS, pp 448–455 8. Graves A, Mohamed A, Hinton GE (2013) Speech recognition with deep recurrent neural networks. In: ICASSP, pp 6645–6649 9. Francois C (2018) Deep learning with Python. Manning Publications Co. 10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 11. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1724–1734 12. Cahuantzi R, Chen X, Güttel S (2021) A comparison of LSTM and GRU networks for learning symbolic sequences. https://arxiv.org/abs/2107.02248 13. Kaggle (2022) State of data science and machine learning 2022. https://www.kaggle.com/kag gle-survey-2022
Chapter 4
The Building Blocks of Machine Learning and Deep Learning
4.1 Introduction In this chapter, we take a look at the three main categories of machine learning and then move on to explore how the machine learning models can be evaluated. The various metrics commonly used are explained. After that, we briefly address the important topic of data preprocessing followed by standard methods of evaluating machine learning models. One of the reasons why most models fail to perform on unseen data is due to the problem of overfitting. We take a look at this problem and outline some of the strategies that can be applied in order to overcome it. The next topic is a discussion of the workflow for machine learning or deep learning. The chapter ends with concluding remarks to recap the covered topics.
4.2 Categorization of Machine Learning As introduced in Chap. 1, there are three major categories of machine learning: supervised machine learning, unsupervised machine, and reinforcement learning [1, 2] (see Fig. 4.1). A supervised machine learning algorithm uses labeled input data to learn a mapping function which generates an appropriate output when given new unlabeled data. The term supervised learning comes from the fact that the process of algorithm learning uses a training dataset that can be viewed as an instructor supervising the learning process. Supervised learning can be divided into classification and regression. The classification process results in discrete or categorized outputs such as car, bicycle, pedestrian, or truck in the case of road objects classification. The output class can be labeled as an integer. On the other hand, regression results in real-valued outputs such as height or width. By far supervised learning is currently the most widely used type of machine learning including in deep learning.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-99-9672-8_4
73
74
4 The Building Blocks of Machine Learning and Deep Learning
Fig. 4.1 Main branches of machine learning with representative examples
An unsupervised machine learning algorithm utilizes input data without using explicitly provided labels. The algorithms work by themselves to discover patterns within the unlabeled data. To avoid wild results, human intervention is required for validation. Reinforcement learning a comparatively new branch which has its roots in game development and has been extended to autonomous driving and other applications. Reinforcement learning differs from the other two branches described above in that intelligent agents interact with environment to maximize reward and requires no labelled data. The concept of reward maximization makes reinforcement learning distinct from unsupervised learning [3, 4]. In this book, we will be mainly focusing on supervised learning. Supervised machine learning finds application in many areas including image recognition, speech recognition, object detection, remote sensing, autonomous driving, just to name few [5–8]. There is a vast amount of reference material in the literature on recognition and classification [9].
4.3 Methods of Evaluating Machine Learning Models The first step in evaluating machine learning models after collecting the dataset is to decide on the split or proportion of the dataset that will be used for training, validation, and testing phases. In most algorithms, it is possible to first split the data into training and test datasets and then use as percentage of the training set for validation. The training dataset is used for fitting the model parameters by in order to maximize prediction performance. The validation dataset is used to evaluate the model performance during the training phase in order to aid tuning of model hyperparameters. The test dataset is used for evaluating the model produced during the training phase and should be completely separate from training dataset. An example of splitting the data for computer vision applications is to use a combination of StratifiedShuffleSplit from scikit-learn with
4.3 Methods of Evaluating Machine Learning Models
75
Fig. 4.2 Training/validation split example
ImageDataGenerator from Keras to first create training and test datasets and then partition the training data into training and validation portion. Step1: Import libraries from sklearn.model_selection import StratifiedShuffleSplit from keras.preprocessing.image import ImageDataGenerator
Step2: Split with scikit-learn: 20% test and 80% training. split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_ state=69)
Then create training and test folders based on the split. Step3: Using the ImageDataGenerator define the proportion of the training data that will be used for validation during training by the validation_split argument. This is the fraction of images reserved for validation and must be between 0 and 1. In this example, the value is set to 0.1 which means that 10% samples will be reserved for the validation set and remaining 90% for the training set (see Fig. 4.2). # training generator – reserve 0.1 for as validation subset train_gen = ImageDataGenerator( rescale=1./255, rotation_range=60, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, horizontal_flip=True, validation_split=0.1 )
Using the.flow_from_directory method, create the train, validation, and test generators. TRAIN_DIR, BATCH_SIZE, and CLASS_MODE are predefined values. train_generator = train_gen.flow_from_directory( directory=TRAIN_DIR, target_size=(32, 32), batch_size=BATCH_SIZE, class_mode=CLASS_MODE, subset=’training’, color_mode=’rgb’,
76
4 The Building Blocks of Machine Learning and Deep Learning shuffle=True, seed=71 ) valid_generator = train_gen.flow_from_directory( directory=TRAIN_DIR, target_size=(64, 64), batch_size=BATCH_SIZE, class_mode=CLASS_MODE, subset=’validation’, color_mode=’rgb’, shuffle=True, seed=71 ) # Test generator for evaluation purposes (only rescaling applied) test_gen = ImageDataGenerator( rescale=1./255 ) test_generator = test_gen.flow_from_directory( directory=TEST_DIR, target_size=(64, 64), batch_size=1, class_mode=None, color_mode=’rgb’, shuffle=False, seed=71 )
The above approach is referred to as hold-out validation where a single validation subset is created. Another innovative method of cross validation is the K-fold validation where the training dataset can be divided into K equal proportions and reserving one of the K portions for validation and the using remaining K-1 portions for training at each training cycle. The performance score is calculated by averaging the K results. This approach is effective when data size is very small. A more computationally intensive approach would be to shuffle data and perform several K-fold times, one set for each shuffled data. The final result can be computed by averaging over all K-fold validations. Performance evaluation is an important part of any model evaluation. Here we list the common methods of evaluation, but it should be kept in mind that these metrics can be used in combination and new ones can be defined if the data cannot be correctly evaluated by any of them. It will be instructive to start by defining the terminology used in performance evaluation from a classification perspective. True Negative (TN) = correct prediction of non-existence of a class False Negative (FN) = incorrect prediction of non-existence of a class False Positive (FP) = incorrect prediction of class when it actually exists True Positive (TP) = correct prediction of existence of class • Accuracy It indicates the ratio of correct classifications, whether positive or negative. Accuracy = (TP + TN)/(TP + FP + FN + TN)
4.3 Methods of Evaluating Machine Learning Models
77
• Precision It is the fraction of true positives among all classified as positives. It is also referred to as the true positive rate (TPR). Precision = TP/(TP + FP) • Recall It is a measure of how correctly positives were classified as positive. Recall = TP/(TP + FN) • Specificity It is measure of how correctly negatives were actually classified as negatives. It is also referred to as the true negative rate. Specificity = TN/(FP + TN) • F1-score It is a measure of accuracy and is the harmonic mean of Recall and Precision. F1 = 2TP/(2TP + FP + FN) = 2 ∗ Precision ∗ Recall/(Precision + Recall) = 0.5/(1/Recall + 1/Precision) • F2-score The F2 score is a weighted average of recall and precision that gives more weight to recall than precision. F2 = 5 ∗ Precision ∗ Recall/(4 ∗ Precision + Recall) • Confusion Matrix It is a matrix showing classification results by comparing actual classes and predicted classes. It is most informative in multiclass scenarios. • Precision-Recall (PR) curve Plot of PR versus ROC curve. • Receiver Operating Characteristics (ROC) curve
78
4 The Building Blocks of Machine Learning and Deep Learning
It is plot of Recall against false positive rate (1-Specificity). It is used to judge the optimality of the model and has origins in radar processing where the false positive rate known as the probability of false alarm. • Area under the ROC curve (AUC) It is used to measure model performance based on the area under the ROC curve. It falls between 0 and 1, and higher values greater than 0.5 are desirable because 0.5 represent random guess. The above metrics are well-known and widely used. In addition, most of them can be easily imported from the sklearn.metrics module. Below is an example of how this can be achieved in a single line of code. from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, fbeta_score, accuracy_score, roc_curve, roc_auc_score, auc
4.3.1 Data Preprocessing for Deep Learning The input data used for machine learning or deep learning comes in various formats such as text, images, and even videos. Before feeding this data into a deep learning model, it is necessary to put it into a format that makes the task of training tractable. This means that besides denoising, the data will need to be vectorized and normalized as part of preprocessing. In Chap. 2, we briefly discussed some of data structures that can be handled in machine learning algorithms. Normalization is especially important in image data processing where the (0, 255) pixel range is transformed to (0, 1) used in most machine learning models. As can be seen in Sect. 4.3 of this chapter, normalization is incorporated into the ImageDataGenerator for this purpose.
4.3.2 Problem of Overfitting Overfitting happens when the model performance on validation data stops improving compared to the performance on the training data. This is usually seen by validation accuracy remaining constant while training accuracy continues to approach 100%. Or looking from the loss function side, the training loss decreases for each epoch while validation loss stops decreasing or even worse increases. This behavior is an indication of poor generalization of the model to unseen data. Fighting overfitting is a common problem in machine learning, including deep learning. There are various strategies that can be considered to tackle the overfitting problem (Fig. 4.3).
4.3 Methods of Evaluating Machine Learning Models
79
Fig. 4.3 Illustration of overfitting
The reason why overfitting happens is due to the failure of the model to generalize to new or unseen data and the simplest and most effective solution is collect more data. However, this is not always possible so we have to deal with available limited data to improve the situation. The way out of this problem is employing methods such as regularization. To understand what is happening when overfitting occurs, it will be constructive to imagine trying to fit noisy data to a quadratic function. The data obviously will contain outliers. With overfitting, the model tries to approximate a function that passes through all the data points, including outliers. This is overfitting because the resulting function is only good for this particular dataset. The consequence of this is that if we get a new data with the same quadratic behavior but different outliers, then our approximations will not fit properly. In the absence of additional data to smooth out the outliers, regularization is our next best solution because it will try to control the large swings in the approximating weights, thereby making generalizations to new data possible. When using Keras, the following regularization techniques can be applied: Layer weight regularization—There are three forms regularizers, namely, kernel_ regularizer where a penalty is applied on the layer’s kernel, bias_regularizer where a penalty is applied on the layer’s bias and activity_regularizer where a penalty is applied on the layer’s output. For all the three, L1, L2 or a combination of L1 and L2 (L1_L2), can be used. Dropout—Network nodes are randomly selected and removed in order to reduce the network complexity. Network capacity reduction—Network units define the size of the output of the layer, therefore reduction in capacity will lead to fewer parameters and thereby increase ability to generalized. Moreover, large network parameters can be thought of as having the ability to memorize large volume of data but cannot perform well when required to make decision on new data.
80
4 The Building Blocks of Machine Learning and Deep Learning
For completeness’ sake, underfitting is also a problem when model does not perform well on neither training data nor unseen or new data. It also leads to poor generalization, but this may be indicative or poor model selection or untrainable data. These kinds of problems must be solved before training starts.
4.4 The Machine Learning Workflow Up to now we have not given any guideline on how to attack a machine learning problem from the start. Here we explain the steps involved in the machine learning/ deep learning workflow (Fig. 4.4). Problem Definition The first step is to define the problem at hand in terms of the required data, and what we are trying to achieve as output. At this stage, it is good practice to decide on whether the problem will be binary, multiclass multilabel, etc. Most problems have an application domain with vast examples in the literature. It is advisable to make a survey of available approaches, among other things. Data Collection Data acquisition is one of the most tedious and time-consuming part of the workflow. The data must be large enough to be representative of the problem under analysis. As previously stated in the overfitting section, lack of sufficient data contributes to lack of model generalization when the model is deployed on new data. So, how much data is enough? There is no straight answer to this question, but a moderate deep learning problem would require 10,000 to 100,000 data samples. On other hand, more complex problems like machine translation would require up to one million samples. A general rule of thumb for computer vision is to collect at least 1000 data samples per class. When enough data is not available, methods of generating artificial data such as data augmentation can be implemented.
Fig. 4.4 Illustration of the machine learning working from preparation to deployment
4.4 The Machine Learning Workflow
81
Data Preparation Having collected enough data, the next step would be to transform the data into machine-consumable format. This is where vectorization and normalization can be applied before inputting the data to the model. Defining Performance Measures As described in the previous sections, how to measure performance should be decided before the models can be selected. The metrics like precision, recall, accuracy, etc. come into the picture. Leaving the decision on metrics to later stages will result in wasted effort and time, and re-evaluation of the model performance may be necessary. Given that most deep learning models required a lot of time in terms of epochs per run, setting metrics from the start will help reduce the chances of doing this task repeatedly. Model Selection With the problem defined and data available including performance measures, the task of deciding the model comes into place. There is no formula for this task, but it’s always good to start from a simple model with a few layers and few units and increase the complexity until no gain in performance can realized. Train Model Training is the critical part of the whole process as it is at this point that we start to see the level of difficulty of the task at hand. During training, performance metrics can be monitored with tools like the Tensorboard, and decision on whether to keep the current model can be made as quickly and as early as possible. Model Evaluation If the model runs to the end, the remaining thing to do would be to evaluate model performance against benchmarks or target values. If for example it is required to achieve a 99% accuracy but the model reaches only 70%, then it will be better to change the model or adjust the hyperparameter space. At this stage, we may decide to abandon the model or choose alternative performance measures. Hyperparameter Tuning The hyperparameter of the model can be tuned to achieve a certain level of performance. This includes changing the optimizers, learning rate, etc. and including measures for reducing overfitting if this is a problem. After hyperparameter tuning, we then retrain the model to see if there are any gains that can be achieved. By repeating this experimenting phase several times, we end up with best model from the given data. Deployment When we are satisfied with performance of the model on unseen data, then it can be deployed into use.
82
4 The Building Blocks of Machine Learning and Deep Learning
Maintenance In the maintenance phase, we keep checking the real performance of the model to decide on whether additional data acquisition would be required.
4.5 Concluding Remarks Wrapping up what we have learnt so far, we gave a categorization of machine learning branches. This was followed by introduction to machine learning evaluation methods. After that we touched on data processing for deep learning followed by the problem of overfitting which is of importance in all machine learning algorithms. We then ended by presenting the workflow of machine learning development. The above content should give a pretty good picture of how the general deep learning algorithms are constructed.
4.6 Self-evaluation Exercises 1. What are the three main categories of machine learning? Explain the difference between supervised, unsupervised, and reinforcement learning. 2. What are some common metrics used to evaluate machine learning models? Describe the differences between accuracy, precision, recall, F1-score, and AUCROC. 3. What is overfitting in machine learning? Why is it a problem, and how can it be addressed? Describe some techniques for preventing overfitting in machine learning models. 4. What is the typical workflow for building a machine learning model? Describe the steps involved, including data preparation, feature selection, model selection, hyperparameter tuning, and model evaluation. 5. Give an example of building a machine learning or deep learning model using a specific framework or library (e.g., scikit-learn, TensorFlow, PyTorch). Describe the steps involved in building the model, including data preparation, feature selection, model definition, training, and evaluation.
References 1. Francois C (2018) Deep learning with Python. Manning Publications Co. 2. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press 3. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
References
83
4. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction (Adaptive Computation and Machine Learning series), A Bradford Book, 2nd edn 5. Gamba J (2020) Radar signal processing for autonomous driving. Springer 6. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press Inc., New York 7. Scikit-learn. https://scikit-learn.org/stable/ 8. OpenCV. https://www.learnopencv.com/ 9. El Mrabet MA, El Makkaoui K, Faize A (2021) Supervised machine learning: a survey. In: 2021 4th International conference on advanced communication technologies and networking (CommNet), 2021, pp 1–10. https://doi.org/10.1109/CommNet52204.2021.9641998
Chapter 5
Remote Sensing Example for Deep Learning
5.1 Introduction Recently, remote sensing has become heavily dependent on machine learning algorithms such decision trees, random forests, support vector machines, and artificial neural networks. However, there is an increasing recognition that deep learning which has been applied successfully in other areas such as computer vision and language processing is a viable alternative to traditional machine learning methods [1]. With the availability of high-resolution imagery, it is becoming more attractive to venture into deep learning as a key technology to achieve previously unimaginable classification accuracies [2, 3]. In this chapter, we will work through a specific example of application of deep learning algorithms to one important area of remote sensing data analysis, namely land cover classification. Land cover and land use change analysis is of importance in many practical areas such urban planning, environmental degradation monitoring, and disaster management [4, 5].
5.2 Background of the Remote Sensing Example The main goal of this chapter is to provide a detailed understanding of the performance of various deep learning models applied to the problem of land cover classification starting from known dataset. We divide the presentation into 5 main parts including preliminary information on the models covering input data restrictions followed by exploration of the EuroSAT data contents, preprocessing steps, and performance evaluation results for several selected models in Sect. 5.3. Finally, we test the performance of the models with a new dataset to get a clear picture of the limitations of the presented approach in the face of unseen data in Sect. 5.4. This application example assumes basic knowledge of the Python programming language. There is an abundance of easy-to-follow material for this topic for readers
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 J. Gamba, Deep Learning Models, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-99-9672-8_5
85
86
5 Remote Sensing Example for Deep Learning
of all background publicly starting from python.org. We therefore assume the reader is familiar with Python syntax and how to get needed solutions from platforms such as stack overflow, etc. In addition, it is not the intention of this chapter to provide mathematical details of inner working of algorithms behind the presented models. Having said this, this chapter is meant to give the interested reader a good insight into the performance of the Keras APIs (models) that are available for land cover classification. The techniques introduced here can be extended, improved, and applied to a broad range of problems.
5.3 Remote Sensing: Land Cover Classification The EuroSAT dataset is obtained from the openly and freely accessible Sentinel-2 satellite images provided in the Earth observation program Copernicus. It has been demonstrated in [2] that RGB bands of the Sentinel data give the best results in terms of accuracy. We will therefore only use the RGB dataset in this chapter. This does not in any way mean that the other bands cannot be used for classification. The folder structure for algorithm evaluation is shown in Fig. 5.1. We also give a flow diagram of the approach used to train and test the data in Fig. 5.2.
5.4 Background of Experimental Comparison of Keras Applications Deep Learning Models Performance on EuroSAT Dataset We make a comparison of the performance of various classes of Keras models using the publicly available EuroSAT dataset (https://github.com/phelber/EuroSAT) as input. We also build on the Kaggle (land-cover-classification-with-eurosat-dataset) example which is under the Apache 2.0 open-source software license to expand the range of models evaluated and utilize Google Colab for convenient and efficient execution of the models by taking advantage of the available GPU resources. The high computing power is important during the training phase where multiples epochs have to be executed. In this chapter, specifically we will evaluate ResNet, VGG, NasNet, and EfficientNet V1 models to see how they compare under similar conditions except where model specific treatment is necessary. Since our approach is hands-on, the code for this chapter is available at [6]. We should also emphasize that the scope of this material is evaluation of model performance and detailed analysis of the implications of these results will be left for later treatment. However, we will also briefly compare how the models perform on completely uncorrelated data to understand the limitations and challenges of deep learning algorithm applications across various datasets. So, this material is just the beginning of a long journey. The details of evaluated models are as follows.
5.4 Background of Experimental Comparison of Keras Applications Deep …
Fig. 5.1 EuroSAT data folder structure for training and testing
Models Evaluated: ResNet50 ResNet101 ResNet152 VGG16 VGG19 NasNetLarge NasNetMobile EfficientNet B0 EfficientNet B1 EfficientNet B2 EfficientNet B3 EfficientNet B4 EfficientNet B5
87
88
5 Remote Sensing Example for Deep Learning
Fig. 5.2 Example processing flow of the EuroSAT data for deep learning algorithm evaluation
EfficientNet B6 EfficientNet B7 Comparison Methods and Metrics: • Training/Validation accuracy and loss (visualization) • Precision, Recall, F2 score (PRF) • Confusion matrix.
5.4.1 Information Input Data Requirements Keras offers many models under Applications API that can be used as a base on top of which upper layers including dense layers can be added. Using this approach, we check how the models perform on the publicly available EuroSAT dataset. For more details about the dataset and how it was collected, refer to https://github.com/phe lber/EuroSAT. We briefly discuss the preliminary information related to the Keras models.
5.4 Background of Experimental Comparison of Keras Applications Deep …
89
5.4.2 Input Restrictions (from Keras Application Page) The full details and arguments for each model can be found from the following link: https://keras.io/api/applications/ Here we are only interested in highlighting the limitations imposed on input data for each model at the time of writing. NasNetLarge has highest top-1 and top-5 accuracy. The top-1 and top-5 accuracy refers to the model’s performance on the ImageNet validation dataset. However, there is an issue with earlier implementations of the model as described below. During training, we found it necessary to modify the model library file’s (in Keras applications) input shape argument “require_flatten” by setting it to “False” before running the training. Without this modification, an error message like “ValueError: When setting ‘include_top = True‘ and loading ‘imagenet‘ weights, ‘input_shape‘ should be (331, 331, 3).” will be thrown for NasNetLarge even if the argument “include_top” is set to false. The argument “require_flatten” is set to “True” by default hence the need to make this adjustment to avoid the bug. However, for EfficientNet models the input argument is set to “require_ flatten = include_top” by default with the restriction that min_size = 32. On the other hand, the min_size restriction was not documented in the above API link at the time of writing.
5.4.2.1
ResNet50, ResNet101, ResNet152
ResNet50 input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3, 224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels, and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one valid value. ResNet101 input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3, 224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels, and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one valid value. ResNet152 input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with ‘channels_last’ data format) or (3, 224, 224) (with ‘channels_first’ data format). It should have exactly 3 input channels, and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one valid value.
90
5.4.2.2
5 Remote Sensing Example for Deep Learning
VGG16 and VGG19
VGG16 input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with channels_last data format) or (3, 224, 224) (with channels_first data format). It should have exactly 3 input channels, and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one valid value. VGG19 input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with channels_last data format) or (3, 224, 224) (with channels_first data format). It should have exactly 3 inputs channels, and width and height should be no smaller than 32, e.g., (200, 200, 3) would be one valid value. 5.4.2.3
NasNetLarge and NasNetMobile
NasNetLarge input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (331, 331, 3) for NasNetLarge. It should have exactly 3 input channels, and width and height should be no smaller than 32, e.g., (224, 224, 3) would be one valid value. 5.4.2.4
NasNetMobile
input_shape: Optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) for NasNetMobile. It should have exactly 3 inputs channels, and width and height should be no smaller than 32, e.g., (224, 224, 3) would be one valid value. 5.4.2.5
EfficientNet B0 to B7
No input width and height restriction!! input_shape: Optional shape tuple, only to be specified if include_top is False. It should have exactly 3 input channels.
5.4.3 Training and Test Results Below we give a visual summary of results obtained by running the above as convolution base models under similar settings for each class of models and using same input shape (64 × 64 × 3). The first part of the simulation used training–test split of 70/30 while the latter half used 80/20.
5.4 Background of Experimental Comparison of Keras Applications Deep …
91
Please refer to the companion Notebook (eurosat-projectbook-blg.ipynb) for further hands-on experience [6].
5.4.3.1
Data Exploration
Import the required libraries for data loading and preprocessing. import os # file directory manipulation import shutil
# file copying, etc.
import random
# random number generation
from tqdm import tqdm # execution progress import numpy as np
# array processing
import pandas as pd # data folder manipulation import PIL # image visualization and processing tool import matplotlib.pyplot as plt # plotting functions
Mount Google Drive to access files from Google Colab. from google.colab import drive drive.mount('/content/drive')
Set the EuroSAT dataset path and extract labels. There are 10 classes, namely AnnualCrop, Pasture, PermanentCrop, Residential, Industrial, River, SeaLake, HerbaceousVegetation, Highway, and Forest. DATASET = "/content/drive/My Drive/Colab Notebooks/Eu roSAT/2750" LABELS = os.listdir(DATASET) print(LABELS)
Next, we plot class distributions of the EuroSAT dataset. There are a total of 2700 images distributed among the classes as shown in Fig. 5.3. Select 20 images arbitrarily from the whole dataset and show the classes to which they belong.
92
5 Remote Sensing Example for Deep Learning
Fig. 5.3 Distribution of data among the classes
img_paths = [os.path.join(DATASET, l, l+'_1000.j pg') for l in LABELS] img_paths = img_paths + [os.path.join(DATASET, l , l+'_2000.jpg') for l in LABELS] def plot_sat_imgs(paths): plt.figure(figsize=(15, 8)) for i in range(20): plt.subplot(4, 5, i+1, xticks=[], yticks=[]) img = PIL.Image.open(paths[i], 'r') plt.imshow(np.asarray(img)) plt.title(paths[i].split('/')[-2]) plot_sat_imgs(img_paths)
5.4 Background of Experimental Comparison of Keras Applications Deep …
93
Fig. 5.4 Samples arbitrarily selected from the dataset for visual inspection
Figure 5.4 shows the result of selected samples. The sample data shows the variability in contents of the classes from AnnualCrop to Forest. Some similarities can be observed, for example, between Highway and Industrial classes. The challenge for the deep learning algorithms is to be able to distinguish these classes by minimizing false positives and false negatives among other metrics that can be used. Although NIR band data is available, our evaluation will solely use RGB bands.
94
5 Remote Sensing Example for Deep Learning
5.4.3.2
Data Preprocessing
Next the data is split into training and test sets using stratified shuffle-split from Scikit-learn. We also make use of Keras ImageDataGenerator for data augmentation. import re from sklearn.model_selection import StratifiedShuffleSplit from keras.preprocessing.image import ImageDataGenerator TRAIN_DIR = '/content/drive/My Drive/Colab Noteboo ks/EuroSAT/working/training' TEST_DIR = '/content/drive/My Drive/Colab Notebook s/EuroSAT/working/testing' BATCH_SIZE = 64 NUM_CLASSES=len(LABELS) INPUT_SHAPE = (64, 64, 3) CLASS_MODE = 'categorical' # Create training and testing directories for path in (TRAIN_DIR, TEST_DIR): if not os.path.exists(path): os.mkdir(path)
We copy train and test data into respective folders.
5.4 Background of Experimental Comparison of Keras Applications Deep …
import re # regular expression package for string pattern processing from sklearn.model_selection import StratifiedShuffleSplit # stratified sampling for train-test split from keras.preprocessing.image import ImageDataGenerator # image augmentation utility TRAIN_DIR = '/content/drive/My Drive/Colab Notebooks/Eu roSAT/working/training' TEST_DIR = '/content/drive/My Drive/Colab Notebooks/Eur oSAT/working/testing' BATCH_SIZE = 64 NUM_CLASSES=len(LABELS) INPUT_SHAPE = (64, 64, 3) CLASS_MODE = 'categorical' # Create training and testing directories for path in (TRAIN_DIR, TEST_DIR): if not os.path.exists(path): os.mkdir(path) # Create class label subdirectories in train and test for l in LABELS: if not os.path.exists(os.path.join(TRAIN_DIR, l)): os.mkdir(os.path.join(TRAIN_DIR, l)) if not os.path.exists(os.path.join(TEST_DIR, l)): os.mkdir(os.path.join(TEST_DIR, l))
95
96
5 Remote Sensing Example for Deep Learning # Execute this once to load split data into train and test folders
respectively data = {} for l in LABELS: for img in os.listdir(DATASET+'/'+l): data.update({os.path.join(DATASET, l, img): l}) X = pd.Series(list(data.keys())) y = pd.get_dummies(pd.Series(data.values())) split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=69) # Split the list of image paths for train_idx, test_idx in split.split(X, y): train_paths = X[train_idx] test_paths = X[test_idx] # Define a new path for each image depending on training or testing s_in ='/content/drive/My Drive/Colab Notebooks/EuroSAT/2750' s_train ='/content/drive/My Drive/Colab Notebooks/EuroSAT/working/training' s_test = '/content/drive/My Drive/Colab Notebooks/EuroSAT/working/testing' new_train_paths = [re.sub(s_in, s_train, i) for i in train_path s] new_test_paths = [re.sub(s_in, s_test, i) for i in test_paths] train_path_map = list((zip(train_paths, new_train_paths))) test_path_map = list((zip(test_paths, new_test_paths)))
5.4 Background of Experimental Comparison of Keras Applications Deep …
# Move the files print("moving training files..") for i in tqdm(train_path_map): if not os.path.exists(i[1]): if not os.path.exists(re.sub('training', 'testing', i[1])): shutil.copy(i[0], i[1]) print("moving testing files..") for i in tqdm(test_path_map): if not os.path.exists(i[1]): if not os.path.exists(re.sub('training', 'testing', i[1])): # Create a ImageDataGenerator instance which can be used for data augmentation train_gen = ImageDataGenerator( rescale=1./255, rotation_range=60, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, vertical_flip = True ) train_generator = train_gen.flow_from_directory( directory=TRAIN_DIR,
target_size=(64, 64),
batch_size=BATCH_SIZE, color_mode='rgb', seed=69 )
class_mode=CLASS_MODE,
shuffle=True,
97
98
5 Remote Sensing Example for Deep Learning #
Test
generator for evaluation purposes with no augmenta-
tions, just rescaling test_gen = ImageDataGenerator( rescale=1./255, ) test_generator = test_gen.flow_from_directory( directory=TEST_DIR,
target_size=(64, 64),
batch_size=BATCH_SIZE, class_mode=CLASS_MODE, color_mode='rgb', shuffle=False, seed=69 )
Confirm and save class indices. # Print class indices print(train_generator.class_indices) # Save class indices np.save('class_indices', train_generator.class_indices)
The next sections provide details of deep learning algorithms that will be used in the evaluation. We will first start by importing all the necessary packages followed by definition of a generic function for model compilation and then some functions to plot and visualize results. The ResNet framework model will be taken as an example to demonstrate the evaluation procedure. After that results of other selected models will be presented.
5.4 Background of Experimental Comparison of Keras Applications Deep …
99
import tensorflow as tf from keras.models import Model from keras.layers import Dense, Dropout, Flatten, GlobalAveragePooling2D, BatchNormalization from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau from tensorflow.keras.optimizers import Adam from keras.applications.vgg16 import VGG16 from tensorflow.keras.applications.vgg19 import VGG19 from tensorflow.keras.applications.resnet import ResNet50, ResNet101, ResNet152 from tensorflow.keras.applications import ResNet50V2, ResNet50V2, ResNet152V2 from tensorflow.python.keras import regularizers
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, fbeta_score, accuracy_score from tensorflow.keras.applications.nasnet import NASNetLarge, NASNetMobile from tensorflow.keras.applications import EfficientNetB0, EfficientNetB1, EfficientNetB2 from tensorflow.keras.applications import EfficientNetB3, EfficientNetB4, EfficientNetB5 from tensorflow.keras.applications import EfficientNetB6, EfficientNetB7 from tensorflow.keras.regularizers import l2,l1, l1_l2 from tensorflow.python.keras import regularizers
Configure GPUs for processing if available. It is recommended to use the first available GPU for TensorFlow processing (https://www.tensorflow.org/guide/gpu# using_multiple_gpus). gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: # Restrict TensorFlow to only use the first GPU try: tf.config.experimental.set_visible_devices(gpus[0], 'GPU') logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU") except RuntimeError as e: # Visible devices must be set before GPUs have been initialized print(e)
100
5 Remote Sensing Example for Deep Learning
We then define the generic function for model selection and compilation using the following function. # Note that for different CNN models we will be using different setup of dense layers def compile_model(cnn_base, input_shape, n_classes, optimizer, fine_tune=None): if (cnn_base == 'ResNet50') or (cnn_base == 'ResNet50V2') or (cnn_base == 'ResNet152V2') or (cnn_base == 'ResNet101') or (cnn_base == 'ResNet152'): if cnn_base == 'ResNet50': conv_base = ResNet50(include_top=False, weights='imagenet', input_shape=input_shape) elif cnn_base == 'ResNet50V2': conv_base = ResNet50V2(include_top=False, weights='imagenet', input_shape=input_shape) elif cnn_base == 'ResNet101': conv_base = ResNet101(include_top=False, weights='imagenet', input_shape=input_shape) elif cnn_base == 'ResNet152': conv_base = ResNet152(include_top=False, weights='imagenet', input_shape=input_shape) else: conv_base = ResNet152V2(include_top=False, weights='imagenet', input_shape=input_shape) top_model = conv_base.output top_model = Flatten()(top_model) top_model = Dense(1024, activity_regularizer=regularizers.l2(1e-4), activation='relu')(top_model) top_model = Dense(1024, activation='relu')(top_model) top_model = BatchNormalization()(top_model) #added top_model = Dropout(0.2)(top_model) top_model = Dense(1024, activity_regularizer=regularizers.l2(1e-4), activation='relu')(top_model) top_model = Dense(1024, activation='relu')(top_model) top_model = BatchNormalization()(top_model) #added top_model = Dropout(0.2)(top_model)
5.4 Background of Experimental Comparison of Keras Applications Deep … elif (cnn_base == 'VGG16') or (cnn_base == 'VGG19'): if cnn_base == 'VGG16': conv_base = VGG16(include_top=False, weights='imagenet', input_shape=input_shape) else: conv_base = VGG19(include_top=False, weights='imagenet', input_shape=input_shape) top_model = conv_base.output top_model = Flatten()(top_model) top_model = Dense(1024, activity_regularizer=regularizers.l2(1e-4), activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model) top_model = Dense(1024, activity_regularizer=regularizers.l2(1e-4), activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model)
elif (cnn_base == 'Xception') or (cnn_base == 'InceptionV3'): if cnn_base == 'Xception': conv_base = Xception(include_top=False, weights='imagenet', input_shape=input_shape) else: conv_base = InceptionV3(include_top=False, weights='imagenet', input_shape=input_shape) top_model = conv_base.output top_model = GlobalAveragePooling2D()(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = Dropout(0.2)(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = Dropout(0.2)(top_model) elif (cnn_base == 'NASNetLarge') or (cnn_base == 'NASNetMobile' if cnn_base == 'NASNetLarge': conv_base = NASNetLarge(include_top=False, weights='imagenet', input_shape=input_shape) else:
101
102
5 Remote Sensing Example for Deep Learning conv_base = NASNetMobile(include_top=False, weights='imagenet', input_shape=input_shape) top_model = conv_base.output top_model = GlobalAveragePooling2D()(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model)
elif (cnn_base == 'EfficientNetB0') or (cnn_base == 'EfficientNetB1') or (cnn_base == 'EfficientNetB2') or (cnn_base == 'EfficientNetB3') or (cnn_base == 'EfficientNetB4') or (cnn_base == 'EfficientNetB5') or (cnn_base == 'EfficientNetB6') or (cnn_base == 'EfficientNetB7 '): if cnn_base == 'EfficientNetB0': conv_base = EfficientNetB0(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB1': conv_base = EfficientNetB1(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB2': conv_base = EfficientNetB2(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB3': conv_base = EfficientNetB3(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB4': conv_base = EfficientNetB4(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB5': conv_base = EfficientNetB5(include_top=False, weights="imagenet", input_shape=input_shape) elif cnn_base == 'EfficientNetB6':
5.4 Background of Experimental Comparison of Keras Applications Deep … conv_base = EfficientNetB6(include_top=False, weights="imagenet", input_shape=input_shape)
else: conv_base = EfficientNetB7(include_top=False, weights="imagenet", input_shape=input_shape) top_model = conv_base.output top_model = GlobalAveragePooling2D()(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model) top_model = Dense(2048, activation='relu')(top_model) top_model = BatchNormalization()(top_model) top_model = Dropout(0.2)(top_model)
output_layer = Dense(n_classes, activation='softmax')(top_model)
model = Model(inputs=conv_base.input, outputs=output_layer)
if type(fine_tune) == int: for layer in conv_base.layers[fine_tune:]: layer.trainable = True else: for layer in conv_base.layers: layer.trainable = False
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
return model
Utility functions to plot training/validation progress and display results.
103
104
5 Remote Sensing Example for Deep Learning
def plot_history(history):
acc = history.history['categorical_accuracy'] val_acc = history.history['val_categorical_accuracy'] loss = history.history['loss'] val_loss = history.history['val_loss']
plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) plt.plot(acc) plt.plot(val_acc) plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left')
plt.subplot(1, 2, 2) plt.plot(loss) plt.plot(val_loss) plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left')
plt.show();
5.4 Background of Experimental Comparison of Keras Applications Deep …
def display_results(y_true, y_preds, class_labels): results = pd.DataFrame(precision_recall_fscore_support(y_true, y_preds), columns=class_labels).T results.rename(columns={0: 'Precision', 1: 'Recall', 2: 'F-Score', 3: 'Support'}, inplace=True)
conf_mat = pd.DataFrame(confusion_matrix(y_true, y_preds), columns=class_labels, index=class_labels) f2 = fbeta_score(y_true, y_preds, beta=2, average='micro') accuracy = accuracy_score(y_true, y_preds) print(f"Accuracy: {accuracy}") print(f"Global F2 Score: {f2}") return results, conf_mat def plot_predictions(y_true, y_preds, test_generator, class_indices):
fig = plt.figure(figsize=(20, 10)) for i, idx in enumerate(np.random.choice(test_generator.samples, size=20, replace=False)): ax = fig.add_subplot(4, 5, i + 1, xticks=[], yticks=[]) ax.imshow(np.squeeze(test_generator[idx])) pred_idx = np.argmax(y_preds[idx]) true_idx = y_true[idx] plt.tight_layout() ax.set_title("{}\n({})".format(class_indices[pred_idx], class_indices[true_idx]), color=("green" if pred_idx == true_idx else "red"))
Example of evaluating the ResNet50 model.
105
106
5 Remote Sensing Example for Deep Learning
# Model compilation and summary display resnet50_model = compile_model('ResNet50', INPUT_SHAPE, NUM_CLASSES, Adam(learning_rate=1e-2), fine_tune=None) resnet50_model.summary()
# Initialize the train and test generators train_generator.reset() test_generator.reset()
N_STEPS = train_generator.samples//BATCH_SIZE N_VAL_STEPS = test_generator.samples//BATCH_SIZE N_EPOCHS = 100
# Define model callbacks checkpoint = ModelCheckpoint(filepath='/content/drive/My Drive/Colab Notebooks/EuroSAT/working/model.weights.best.hdf5', monitor='val_categorical_accuracy', save_best_only=True, verbose=1) early_stop = EarlyStopping(monitor='val_categorical_accuracy', patience=10, restore_best_weights=True, mode='max')
reduce_lr = ReduceLROnPlateau(monitor='val_categorical_accuracy', factor=0.5, patience=3, min_lr=0.00001)
5.4 Background of Experimental Comparison of Keras Applications Deep …
107
# First perform pretraining of the dense layer resnet50_history = resnet50_model.fit(train_generator, steps_per_epoch=N_STEPS, epochs=50, callbacks=[early_stop, checkpoint], validation_data=test_generator, validation_steps=N_VAL_STEPS)
# Re-train whole network end to end resnet50_model = compile_model('ResNet50', INPUT_SHAPE, NUM_CLASSES, Adam(learning_rate=1e-4), fine_tune=0)
resnet50_model.load_weights('/content/drive/My Drive/Colab Notebooks/EuroSAT/working/model.weights.best.hdf5')
train_generator.reset() test_generator.reset()
resnet50_history = resnet50_model.fit(train_generator, steps_per_epoch=N_STEPS, epochs=N_EPOCHS, callbacks=[early_stop, checkpoint, reduce_lr], validation_data=test_generator, validation_steps=N_VAL_STEPS)
108
5 Remote Sensing Example for Deep Learning
# Plot loss and accuracy plot_history(resnet50_history) resnet50_model.load_weights('/content/drive/My Drive/Colab Notebooks/EuroSAT/working/model.weights.best.hdf5')
# Evaluate the model on test data and compute precision, recall, f-score and confusion matrix. Display prf. class_indices = train_generator.class_indices class_indices = dict((v,k) for k,v in class_indices.items())
test_generator_new = test_gen.flow_from_directory( directory=TEST_DIR, target_size=(64, 64), batch_size=1, class_mode=None, color_mode='rgb', shuffle=False, seed=69 )
predictions = resnet50_model.predict(test_generator_new, steps=len(test_generator_new.filenames)) predicted_classes = np.argmax(np.rint(predictions), axis=1) true_classes = test_generator_new.classes
prf, conf_mat = display_results(true_classes, predicted_classes, class_indices.values()) prf
# Display confusion matrix conf_mat # Save the model and the weights resnet50_model.save('/content/drive/My Drive/Colab Notebooks/EuroSAT/working/ResNet50_eurosat.h5')
5.4 Background of Experimental Comparison of Keras Applications Deep …
109
Repeating the above steps for ResNet50 model, all the other models can be similarly evaluated. We present some of selected model evaluation results below. Basic Assumptions and Definitions: The training/validation accuracy and loss are defined for each model. The models are compiled with the Adam optimizer, categorical cross-entropy loss and use the categorical accuracy metric. This can be easily accomplished in Keras with one line of code. There is no fine-tuning applied. Precision is the ratio of true positives to all predicted positives of a given class. Recall is the ratio of true positives to all actual positives of a given class that should have been identified, i.e., it includes false negatives. The F2-score is a weighted average of recall and precision that gives more weight to recall than precision. It is defined by 5 * precision * recall/(4 * precision + recall). For a given dataset, the support is defined as the number of occurrences of each class. The expectation is to have balanced data in which there are no huge differences in support. As some informative details in terms simulation settings, 100 or 200 epochs were set for all models. Additionally, EarlyStopping patience of 10 and ReduceLROnPlateau patience of 5 were used as default. The Adam optimizer with default value was also applied to all models. ResNet Models ResNet50 Training/Validation Accuracy and Loss See Fig. 5.5. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 40th epoch due
Fig. 5.5 Training and validation loss and accuracy for ResNet50 model
110
5 Remote Sensing Example for Deep Learning
lack of improvement in validation categorical accuracy. The loss is small and close to zero after the initial swings. Precision, Recall, F-Score, Support See Fig. 5.6. High precision and recall (> 99%) can be obtained for Forest class (99.9% recall, 95.1% precision) and Residential class (100% recall, 94.7% precision). The global F2-score not shown here is estimated to be 96.4%. Confusion Matrix See Fig. 5.7. Confusion Matrix (%) See Fig. 5.8.
Fig. 5.6 Precision, recall, and F-score of each of the classes for ResNet50 model
Fig. 5.7 Confusion matrix showing number of hits for each of the classes for ResNet50 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
111
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.95 0.00 0.00 0.00 0.03 0.01 0.00 0.00 0.01 0.00 0.00 0.94 0.01 0.00 0.00 0.01 0.03 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.01 0.03 0.01 0.00 0.00 0.94 0.01 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.95 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.02 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94
Fig. 5.8 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The Highway, Pasture, and SeaLake classes show the lowest accuracy of 94%, while the Forest and Residential can be classified with 100% accuracy.
ResNet101 Training/Validation Accuracy See Fig. 5.9. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 35th epoch due lack of improvement in validation categorical accuracy. The loss is small and close to zero after high initial validation loss. Precision, Recall, F-Score, Support See Fig. 5.10. High precision and recall (> 99%) can be obtained for Forest class (100% recall, 94.2% precision) and Residential class (99.8% recall, 94.1% precision). The global F2-score not shown here is estimated to be 96.6%.
Fig. 5.9 Training and validation loss and accuracy for ResNet101 model
112
5 Remote Sensing Example for Deep Learning
Fig. 5.10 Precision, recall, and F-score of each of the classes for ResNet101 model
Confusion Matrix See Fig. 5.11. Confusion Matrix (%) See Fig. 5.12. ResNet152. Training/Validation Accuracy See Fig. 5.13. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 35th epoch due lack of improvement in validation categorical accuracy. The loss is small and close to zero after the initial swings.
Fig. 5.11 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
113
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00 0.01 0.03 0.01 0.00 0.00 0.93 0.01 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.95
Fig. 5.12 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The Pasture class shows the lowest accuracy of 93%, while the Forest and Residential can be classified with 100% accuracy
Fig. 5.13 Training and validation loss and accuracy for ResNet152 model
Precision, Recall, F-Score, Support See Fig. 5.14. High precision and recall (>99%) can be obtained for Forest class (99.8% recall, 93.5% precision) and Residential class (100% recall, 91.7% precision). The global F2-score not shown here is estimated to be 95.5%. Confusion Matrix See Fig. 5.15. Confusion Matrix (%) See Fig. 5.16. VGG Models VGG16
114
5 Remote Sensing Example for Deep Learning
Fig. 5.14 Precision, recall, and F-score of each of the classes for ResNet152 model
Confusion Matrix AnnualCrop
Forest
Herbaceous Vegetation
Highway
Industrial
Pasture
PermanentC rop
Residential
River
SeaLake
AnnualCrop
1151
1
0
4
0
6
44
0
5
Forest
0
1202
0
0
0
0
0
2
0
0
Herbaceous Vegetation
9
19
1093
2
1
5
36
31
0
0
0
Highway
15
0
5
1135
10
5
5
16
23
0
Industrial
0
0
0
1
720
0
0
28
1
0
Pasture
6
25
8
0
1
553
5
0
2
0
PermanentC rop
3
1
15
2
9
0
799
12
1
0
Residential
0
0
0
0
0
0
0
993
0
0
River
6
1
1
18
1
0
1
1
829
0
SeaLake
12
37
1
0
0
1
0
0
4
959
Fig. 5.15 Confusion matrix showing number of hits for each of the classes for ResNet152 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.95 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.91 0.00 0.00 0.00 0.03 0.03 0.00 0.00 0.01 0.00 0.00 0.93 0.01 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00 0.01 0.04 0.01 0.00 0.00 0.92 0.01 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.95 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.97 0.00 0.01 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.95
Fig. 5.16 Confusion matrix showing ratio of hits for each of the classes for ResNet152 model. The HerbaceousVegetation class shows the lowest accuracy of 91% while the Forest and Residential can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep …
115
Fig. 5.17 Training and validation loss and accuracy for VGG16 model
Training/Validation Accuracy See Fig. 5.17. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 40th epoch due lack of improvement in validation categorical accuracy. The loss gradually decreases to below 0.2. Precision, Recall, F-Score, Support See Fig. 5.18. High precision and recall (> 99%) can be obtained for Forest class (99.2% recall, 97.9% precision) and Residential class (99.9% recall, 96.1% precision). The global F2-score not shown here is estimated to be 97.1%. Confusion Matrix See Fig. 5.19. Confusion Matrix (%) See Fig. 5.20. VGG19 Training/Validation Accuracy See Fig. 5.21. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 35th epoch due lack of improvement in validation categorical accuracy. The loss gradually decreases to below 0.25.
116
5 Remote Sensing Example for Deep Learning
Fig. 5.18 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.19 Confusion matrix showing number of hits for each of the classes for VGG16 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.93 0.00 0.00 0.00 0.00 0.01 0.06 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.95 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.96 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98
Fig. 5.20 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The AnnualCrop class shows the lowest accuracy of 93% while the Residential class can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep …
117
Fig. 5.21 Training and validation loss and accuracy for VGG19 model
Precision, Recall, F-Score, Support See Fig. 5.22. High precision and recall (> 99%) can be obtained for Forest class (99.5% recall, 97.8% precision) and Residential class (99.5% recall, 97.6% precision). The global F2-score not shown here is estimated to be 96.8%. Confusion Matrix See Fig. 5.23.
Fig. 5.22 Precision, recall, and F-score of each of the classes for VGG19 model
118
5 Remote Sensing Example for Deep Learning
Fig. 5.23 Confusion matrix showing number of hits for each of the classes for VGG19 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.94 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.96 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.02 0.00 0.03 0.00 0.01 0.00 0.93 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96
Fig. 5.24 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be classified with 100% accuracy
Confusion Matrix (%) See Fig. 5.24. NasNet Models NasNetLarge Training/Validation Accuracy See Fig. 5.25. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 85th epoch due lack of improvement in validation categorical accuracy. The loss remains close to zero after initial large swings. Precision, Recall, F-Score, Support See Fig. 5.26. High precision and recall (> 99%) can be obtained for Forest class (99.9% recall, 97.9% precision) and Residential class (100% recall, 95.0% precision). The global F2-score not shown here is estimated to be 97.6%. Confusion Matrix See Fig. 5.27.
5.4 Background of Experimental Comparison of Keras Applications Deep …
119
Fig. 5.25 Training and validation loss and accuracy for NasNetLarge model
Fig. 5.26 Precision, recall, and F-score of each of the classes for NasNetLarge model
Confusion Matrix (%) See Fig. 5.28. NasNetMobile Training/Validation Accuracy See Fig. 5.29. There is a steady increase in training accuracy. Over-fitting can be observed immediately after the 1st epoch. Early stopping comes in as early as the 10th epoch due
120
5 Remote Sensing Example for Deep Learning
Fig. 5.27 Confusion matrix showing number of hits for each of the classes for NasNetLarge model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.95 0.00 0.00 0.05 0.01 0.01 0.02 0.00 0.00 0.96 0.01 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
River SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.01 0.98
Fig. 5.28 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model. The Highway and Pasture classes show the lowest accuracy of 95%, while the Forest and Residential can be classified with 100% accuracy
Fig. 5.29 Training and validation loss and accuracy for NasNetMobile model
5.4 Background of Experimental Comparison of Keras Applications Deep …
121
lack of improvement in validation categorical accuracy. The validation loss diverges despite decreasing training loss. In this case, training stopped much earlier than expected compared to other models. Further model tuning is required to improve on both accuracy and loss and avoid over-fitting. Precision, Recall, F-Score, Support See Fig. 5.30. The precision and recall were lower than expected. The global F2-score not shown here is estimated to be 63.0%. Further investigation is necessary to improve the performance. Confusion Matrix See Fig. 5.31.
Fig. 5.30 Precision, recall, and F-score of each of the classes for NasNetMobile model
Fig. 5.31 Confusion matrix showing number of hits for each of the classes for NasNetMobile model
122
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) See Fig. 5.32. The NasNetMobile model appears not suitable for this classification task as it shows the worst performance among all models evaluated so far. EfficientNet Models Efficient models trade-off performance for speed. Therefore, results are expected to show lower performance compared to VGG and NasNet models. EfficientNet B0 Training/Validation Accuracy See Fig. 5.33. There is a steady increase in both training and validation accuracy. No over-fitting can be observed although the early stopping comes in only after the 35th epoch due
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential 0.95 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.14 0.79 0.01 0.00 0.00 0.01 0.00 0.01 0.22 0.04 0.69 0.00 0.00 0.00 0.01 0.03 0.51 0.00 0.02 0.35 0.01 0.01 0.02 0.02 0.31 0.00 0.00 0.00 0.43 0.00 0.01 0.23 0.47 0.03 0.05 0.01 0.00 0.38 0.04 0.03 0.38 0.00 0.10 0.00 0.01 0.01 0.43 0.05 0.22 0.00 0.03 0.00 0.00 0.00 0.00 0.74 0.40 0.00 0.01 0.03 0.01 0.00 0.00 0.01 0.11 0.03 0.00 0.00 0.00 0.00 0.00 0.00
River SeaLake 0.01 0.00 0.00 0.04 0.00 0.00 0.06 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.53 0.00 0.00 0.85
Fig. 5.32 Confusion matrix showing ratio of hits for each of the classes for NasNetMobile model. The Highway class shows the lowest accuracy of 35%, while the AnnualCrop class shows the highest accuracy of 95%
Fig. 5.33 Training and validation loss and accuracy for EfficientNet B0 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
123
lack of improvement in validation categorical accuracy. The loss is small and close to zero after the initial swings. Precision, Recall, F-Score, Support See Fig. 5.34. The highest precision and recall were obtained for Forest class (96.3% recall, 88.5% precision) followed by Residential class (97.6% recall, 75.1% precision). The global F2-score not shown here is estimated to be 96.8%. Confusion Matrix See Fig. 5.35. Confusion Matrix (%) See Fig. 5.36.
Fig. 5.34 Precision, recall, and F-score of each of the classes for EfficientNet B0 model
Fig. 5.35 Confusion matrix showing number of hits for each of the classes for EfficientNet B0 model
124
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.89 0.00 0.01 0.01 0.00 0.02 0.03 0.00 0.02 0.01 0.01 0.96 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.14 0.03 0.68 0.00 0.02 0.01 0.03 0.09 0.00 0.00 0.35 0.00 0.02 0.36 0.06 0.01 0.04 0.06 0.11 0.00 0.08 0.00 0.00 0.00 0.81 0.00 0.01 0.10 0.00 0.00 0.22 0.11 0.03 0.01 0.00 0.59 0.01 0.00 0.03 0.00 0.29 0.00 0.12 0.03 0.04 0.00 0.47 0.05 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00 0.29 0.02 0.01 0.05 0.01 0.03 0.00 0.02 0.57 0.01 0.04 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.91
Fig. 5.36 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B0 model. The Highway class shows the lowest accuracy of 36%, while the Residential class shows the highest accuracy of 98%
EfficientNet B1 Training/Validation Accuracy See Fig. 5.37. There is an initial steady increase in both training and validation accuracy. Overfitting begins to show after 80th epoch where the accuracy becomes unstable and large swings in validation accuracy are evident. The loss is small and close to zero after the initial dip. Precision, Recall, F-Score, Support See Fig. 5.38. The highest precision and recall were obtained for Residential class (99.7% recall, 74.4% precision) followed by Forest class (96.9% recall, 86.9% precision). The global F2-score not shown here is estimated to be 77.8%.
Fig. 5.37 Training and validation loss and accuracy for EfficientNet B1 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
125
Fig. 5.38 Precision, recall, and F-score of each of the classes for Efficient-Net B1 model
Confusion Matrix See Fig. 5.39. Confusion Matrix (%) See Fig. 5.40. EfficientNet B2 Training/Validation Accuracy See Fig. 5.41. There is a gradual increase in validation accuracy that can be observed. Training terminates at the 30th epoch due to lack of improvement in accuracy. The loss remains close to zero after the initial dip.
Fig. 5.39 Confusion matrix showing the number of hits for each of the classes for EfficientNet B1 model
126
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.89 0.00 0.01 0.01 0.00 0.01 0.03 0.00 0.04 0.01 0.01 0.97 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.07 0.02 0.82 0.00 0.01 0.00 0.02 0.06 0.00 0.00 0.24 0.00 0.04 0.49 0.05 0.00 0.04 0.05 0.08 0.00 0.04 0.00 0.00 0.00 0.83 0.00 0.00 0.13 0.00 0.00 0.21 0.14 0.05 0.01 0.00 0.54 0.01 0.02 0.04 0.00 0.21 0.00 0.10 0.02 0.04 0.00 0.53 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.16 0.02 0.02 0.07 0.01 0.02 0.02 0.01 0.67 0.00 0.03 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.91
Fig. 5.40 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B1 model. The PermanentCrop class shows the lowest accuracy of 53% while the Residential class shows the highest accuracy of 100%
Fig. 5.41 Training and validation loss and accuracy for EfficientNet B2 model
Precision, Recall, F-Score, Support See Fig. 5.42. The highest precision and recall were obtained for Residential class (97.9% recall, 74.5% precision) followed by Forest class (95.8% recall, 87.2% precision). The global F2-score not shown here is estimated to be 73.5%. Confusion Matrix See Fig. 5.43. Confusion Matrix (%) See Fig. 5.44. EfficientNet B3
5.4 Background of Experimental Comparison of Keras Applications Deep …
127
Fig. 5.42 Precision, recall, and F-score of each of the classes for EfficientNet B2 model
Fig. 5.43 Confusion matrix showing the number of hits for each of the classes for EfficientNet B2 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.92 0.00 0.01 0.01 0.00 0.01 0.02 0.00 0.02 0.00 0.02 0.96 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.17 0.03 0.64 0.00 0.01 0.01 0.03 0.10 0.00 0.00 0.33 0.00 0.00 0.43 0.06 0.01 0.04 0.05 0.08 0.00 0.06 0.00 0.00 0.01 0.81 0.00 0.00 0.11 0.00 0.00 0.20 0.08 0.03 0.01 0.00 0.64 0.02 0.00 0.02 0.00 0.37 0.00 0.08 0.02 0.03 0.01 0.42 0.06 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00 0.28 0.02 0.01 0.07 0.02 0.02 0.00 0.01 0.57 0.00 0.03 0.07 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.89
Fig. 5.44 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B2 model. The PermanentCrop class shows the lowest accuracy of 42% while the Forest class shows the highest accuracy of 96%
128
5 Remote Sensing Example for Deep Learning
Training/Validation Accuracy See Fig. 5.45. Wild swings in validation accuracy can be observed. However, the loss remains close to zero after the initial dip. Precision, Recall, F-Score, Support See Fig. 5.46.
Fig. 5.45 Training and validation loss and accuracy for EfficientNet B3 model
Fig. 5.46 Precision, recall, and F-score of each of the classes for Efficient-Net B3 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
129
The highest precision and recall were obtained for Residential class (99.5% recall, 90.8% precision) followed by SeaLake class (96.7% recall, 98.7% precision). The Forest class achieves a decent performance of 95.8% recall, 97.0% precision. The global F2-score not shown here is estimated to be 90.5%. Confusion Matrix See Fig. 5.47. Confusion Matrix (%) See Fig. 5.48. EfficientNet B4 Training/Validation Accuracy See Fig. 5.49. There is a steady increase in both training and validation accuracy with initial fluctuations. Over-fitting begins to show after 80th epoch where the accuracy becomes unstable. The loss is small and close to zero after the initial dip.
Fig. 5.47 Confusion matrix showing the number of hits for each of the classes for EfficientNet B3 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.92 0.00 0.01 0.01 0.00 0.01 0.04 0.00 0.01 0.01 0.01 0.96 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.94 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.06 0.00 0.02 0.81 0.02 0.00 0.02 0.02 0.04 0.00 0.01 0.00 0.00 0.00 0.92 0.00 0.01 0.06 0.00 0.00 0.04 0.04 0.04 0.00 0.00 0.84 0.03 0.00 0.01 0.00 0.07 0.00 0.08 0.01 0.02 0.00 0.80 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.06 0.00 0.01 0.08 0.00 0.00 0.00 0.00 0.84 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97
Fig. 5.48 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B3 model. The Highway class shows the lowest accuracy of 81%, while the Residential class shows the highest accuracy of 99%
130
5 Remote Sensing Example for Deep Learning
Fig. 5.49 Training and validation loss and accuracy for EfficientNet B4 model
Precision, Recall, F-Score, Support See Fig. 5.50. The highest precision and recall were obtained for Residential class (98.5% recall, 82.0% precision) followed by Forest class (96.8% recall, 91.8% precision). The global F2-score not shown here is estimated to be 83.5%. Confusion Matrix See Fig. 5.51.
Fig. 5.50 Precision, recall and F-score of each of the classes for EfficientNet B4 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
131
Fig. 5.51 Confusion matrix showing the number of hits for each of the classes for EfficientNet B4 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.93 0.00 0.00 0.00 0.00 0.02 0.02 0.00 0.01 0.01 0.01 0.97 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.07 0.00 0.81 0.00 0.00 0.01 0.05 0.05 0.00 0.00 0.20 0.00 0.02 0.59 0.03 0.01 0.06 0.03 0.05 0.00 0.03 0.00 0.00 0.00 0.86 0.00 0.00 0.10 0.00 0.00 0.07 0.04 0.02 0.00 0.00 0.85 0.02 0.00 0.00 0.00 0.13 0.00 0.04 0.01 0.04 0.01 0.72 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.98 0.00 0.00 0.14 0.01 0.01 0.06 0.00 0.04 0.01 0.00 0.71 0.01 0.03 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.90
Fig. 5.52 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B4 model. The River class shows the lowest accuracy of 71% while the Residential class shows the highest accuracy of 98%
Confusion Matrix (%) See Fig. 5.52. EfficientNet B5. Training/Validation Accuracy See Fig. 5.53. There is a steady increase in both training and validation accuracy with shallow initial fluctuations. Over-fitting begins to show after 80th epoch. The loss is small and close to zero after the initial dip. Precision, Recall, F-Score, Support See Fig. 5.54. The highest precision and recall were obtained for Residential class (98.9% recall, 89.9% precision) followed by Forest class (97.6% recall, 90.0% precision). The global F2-score not shown here is estimated to be 88.8%. Confusion Matrix See Fig. 5.55.
132
5 Remote Sensing Example for Deep Learning
Fig. 5.53 Training and validation loss and accuracy for EfficientNet B5 model
Fig. 5.54 Precision, recall, and F-score of each of the classes for EfficientNet B5 model
Confusion Matrix (%) See Fig. 5.56. EfficientNet B6 Training/Validation Accuracy See Fig. 5.57. There is a steady increase in both training and validation accuracy. Over-fitting begins to show after 80th epoch.
5.4 Background of Experimental Comparison of Keras Applications Deep …
133
Fig. 5.55 Confusion matrix showing the number of hits for each of the classes for EfficientNet B5 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.93 0.00 0.00 0.00 0.00 0.01 0.04 0.00 0.01 0.00 0.01 0.98 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.03 0.01 0.89 0.00 0.00 0.01 0.04 0.02 0.00 0.00 0.08 0.00 0.02 0.78 0.02 0.01 0.04 0.01 0.04 0.00 0.03 0.00 0.00 0.01 0.87 0.00 0.01 0.09 0.00 0.00 0.04 0.09 0.03 0.00 0.00 0.82 0.02 0.00 0.01 0.00 0.05 0.00 0.05 0.01 0.01 0.01 0.86 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.08 0.01 0.02 0.06 0.00 0.01 0.00 0.00 0.82 0.00 0.03 0.05 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.90
Fig. 5.56 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B5 model. The Highway class shows the lowest accuracy of 78% while the Residential class shows the highest accuracy of 99%
Fig. 5.57 Training and validation loss and accuracy for EfficientNet B6 model
134
5 Remote Sensing Example for Deep Learning
Precision, Recall, F-Score, Support See Fig. 5.58. The highest precision and recall were obtained for Residential class (99.7% recall, 90.7% precision) followed by Forest class (97.9% recall, 88.4% precision). The global F2-score not shown here is estimated to be 89.3%. Confusion Matrix See Fig. 5.59. Confusion Matrix (%) See Fig. 5.60. EfficientNet B7
Fig. 5.58 Precision, recall, and F-score of each of the classes for EfficientNet B6 model
Fig. 5.59 Confusion matrix showing the number of hits for each of the classes for EfficientNet B6 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
135
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.93 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00 0.00 0.98 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.01 0.93 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.07 0.00 0.03 0.79 0.01 0.00 0.03 0.01 0.04 0.00 0.01 0.00 0.00 0.01 0.90 0.00 0.01 0.08 0.00 0.00 0.06 0.10 0.05 0.01 0.00 0.75 0.03 0.00 0.01 0.00 0.05 0.00 0.06 0.01 0.01 0.00 0.85 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.05 0.01 0.00 0.07 0.00 0.00 0.00 0.00 0.85 0.00 0.03 0.07 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.88
Fig. 5.60 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B6 model. The Pasture class shows the lowest accuracy of 75%, while the Residential class shows the highest accuracy of 100%
Training/Validation Accuracy See Fig. 5.61. There is a steady increase in training accuracy and validation accuracy becomes stable after initial fluctuations. Over-fitting begins to show after 50th epoch. The same trend can be observed for the loss function. Precision, Recall, F-Score, Support See Fig. 5.62. The highest precision and recall were obtained for Residential class (99.4% recall, 91.2% precision) followed by Forest class (99.0% recall, 92.0% precision). The global F2-score not shown here is estimated to be 93.0%. Confusion Matrix See Fig. 5.63.
Fig. 5.61 Training and validation loss and accuracy for EfficientNet B7 model
136
5 Remote Sensing Example for Deep Learning
Fig. 5.62 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
Fig. 5.63 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7 model
Confusion Matrix (%) See Fig. 5.64. There is a progressive increase in accuracy for EfficientNet class of models from EfficientNet B0 to EfficientNet B7. However, the accuracy for the classes varies from model to model. Model Performance Comparison and Analysis Intra-Model type comparison (Table 5.1) There were no drastic improvements in ResNet models from ResNet50 to ResNet152. VGG16 model showed slightly better performance than VGG19. NasNetLarge showed the overall best performance with an F2-score of 97.6%. The EfficientNet models showed a gradual increase in accuracy from B0 to B7 with anomaly for B2 and B3. Reasons for this need further investigation but could due to model initialization issues in the case of B2. For the EfficientNetB7 it was found
5.4 Background of Experimental Comparison of Keras Applications Deep …
137
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.93 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.01 0.92 0.00 0.00 0.01 0.02 0.02 0.00 0.00 0.03 0.00 0.01 0.89 0.01 0.00 0.01 0.01 0.03 0.00 0.01 0.00 0.00 0.00 0.94 0.00 0.00 0.05 0.00 0.00 0.02 0.04 0.02 0.01 0.00 0.90 0.01 0.00 0.01 0.00 0.04 0.00 0.04 0.00 0.01 0.00 0.88 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.02 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.91 0.00 0.01 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93
Fig. 5.64 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model. The PermanentCrop class shows the lowest accuracy of 88% while the Forest and Residential classes show the highest accuracy of 99%
Table 5.1 Comparison of accuracy of the evaluated family of models
Model
Accuracy (%)
F2-score (%)
ResNet50
96.34
96.41
ResNet101
96.42
96.62
ResNet152
95.40
95.47
VGG16
97.14
97.06
VGG19
96.84
96.78
NasNetLarge
97.39
97.61
NasNetMobile
61.41
62.96
EfficientNetB0
72.10
73.07
EfficientNetB1
76.38
77.82
EfficientNetB2
72.58
73.50
EfficientNetB3
90.00
90.47
EfficientNetB4
83.32
83.45
EfficientNetB5
88.26
88.78
EfficientNetB6
88.51
89.30
EfficientNetB7
92.77
93.02
that by adding BatchNormalization layers after the top Dense layers, the global mean recall could be improved from 92.8 to 96.4% (+ 3.6%) and global F2-score improved from 93 to 96.7% (+ 3.7%)!! In addition, the highest precision and recall were obtained for Residential class (99.8% recall, 96.5% precision), a noticeable improvement from (99.4% recall, 91.2% precision), followed by Forest class (99.3% recall, 95.7% precision) up from (99.0% recall, 92.0% precision). The results are shown below. EfficientNet B7 Training/Validation Accuracy See Fig. 5.65.
138
5 Remote Sensing Example for Deep Learning
Fig. 5.65 Training and validation loss and accuracy for EfficientNet B7 model
Precision, Recall, F-Score, Support See Fig. 5.66. Confusion Matrix See Fig. 5.67. Confusion Matrix (%) See Fig. 5.68.
Fig. 5.66 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
139
Fig. 5.67 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.01 0.03 0.02 0.00 0.00 0.94 0.01 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.01 0.00 0.94 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.96 0.00 0.01 0.02 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.94
Fig. 5.68 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model. The Pasture class shows the lowest accuracy of 94% while the Residential class achieves the highest accuracy of 100%
In short, with some slight modifications to the model, great benefits in performance can be achieved as shown above for EfficientNet B7 model. This is also applicable to other models and illustrates the advantage of the Keras framework for experimental modeling when quick confirmations and decision have to be made. In fact, we checked the effect of similar changes on ResNet50, ResNet101, VVG16, VGG19, and NasNetLarge models. The results were shown below. ResNet50 Training/Validation Accuracy See Fig. 5.69. Precision, Recall, F-Score, Support See Fig. 5.70. Confusion Matrix See Fig. 5.71. Confusion Matrix (%) See Fig. 5.72.
140
5 Remote Sensing Example for Deep Learning
Fig. 5.69 Training and validation loss and accuracy for ResNet50 model
Fig. 5.70 Precision, recall, and F-score of each of the classes for ResNet50 model
ResNet101 Training/Validation Accuracy See Fig. 5.73. Precision, Recall, F-Score, Support See Fig. 5.74.
5.4 Background of Experimental Comparison of Keras Applications Deep …
141
Fig. 5.71 Confusion matrix showing number of hits for each of the classes for ResNet50 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.01 0.02 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.95 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97
Fig. 5.72 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential can be classified with 100% accuracy
Fig. 5.73 Training and validation loss and accuracy for ResNet101 model
142
5 Remote Sensing Example for Deep Learning
Fig. 5.74 Precision, recall, and F-score of each of the classes for ResNet101 model
Confusion Matrix See Fig. 5.75. Confusion Matrix (%) See Fig. 5.76. VGG16 Training/Validation Accuracy See Fig. 5.77. Precision, Recall, F-Score, Support See Fig. 5.78. Forest, Residential, and River have a recall of greater than 99% which can be considered as state of the art. The improvements gained from applying batch normalizations can be summarized as follows.
Fig. 5.75 Confusion matrix showing number of hits for each of the classes for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
143
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.95 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.01 0.02 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98
Fig. 5.76 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The HerbaceousVegetation class shows the lowest accuracy of 95% while the Forest and Residential can be classified with 100% accuracy
Fig. 5.77 Training and validation loss and accuracy for VGG16 model
Before: Average Accuracy: 97.14%, Global F2-Score: 97.06% After: Average Accuracy: 98.07% (+ 0.93%), Global F2-Score: 98.05% (+ 0.99%). Confusion Matrix See Fig. 5.79. Confusion Matrix (%) See Fig. 5.80. VGG19 Training/Validation Accuracy See Fig. 5.81.
144
5 Remote Sensing Example for Deep Learning
Fig. 5.78 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.79 Confusion matrix showing number of hits for each of the classes for VGG16 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.96 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.98
Fig. 5.80 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The Highway class shows the lowest accuracy of 96% while the Residential class can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep …
Fig. 5.81 Training and validation loss and accuracy for VGG19 model
Precision, Recall, F-Score, Support See Fig. 5.82. Confusion Matrix See Fig. 5.83. Confusion Matrix (%) See Fig. 5.84.
Fig. 5.82 Precision, recall, and F-score of each of the classes for VGG19 model
145
146
5 Remote Sensing Example for Deep Learning
Fig. 5.83 Confusion matrix showing number of hits for each of the classes for VGG19 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.98
Fig. 5.84 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The AnnualCrop, HerbaceousVegetation, Industrial, Pasture, and PermanentCrop classes have the lowest accuracy of 97% while the Forest and Residential can be classified with 100% accuracy
NasNetLarge Training/Validation Accuracy See Fig. 5.85. Precision, Recall, F-Score, Support See Fig. 5.86. Confusion Matrix See Fig. 5.87. Confusion Matrix (%) See Fig. 5.88. In terms of mean accuracy and F2-score, the results can be summarized as shown in Table 5.2. The VGG models give the best performance in terms of both accuracy and F2score. VGG16 achieves an accuracy of 98.07% while VGG19 lower at 97.89%. The same result is reflected in the F2-score which is 98.05% and 98.02% for the respective models. NasNetLarge showed a lower than expected performance. Train–test Split 80–20 Result
5.4 Background of Experimental Comparison of Keras Applications Deep …
147
Fig. 5.85 Training and validation loss and accuracy for NasNetLarge model
Fig. 5.86 Precision, recall, and F-score of each of the classes for NasNetLarge model
Changing the train–test split is one way to improve the validation accuracy. In this case, we change the split case from 70–30 to 80–20 and check some of the top performing models so far. The results of this change are shown below. ResNet50 Training/Validation Accuracy See Fig. 5.89.
148
5 Remote Sensing Example for Deep Learning
Fig. 5.87 Confusion matrix showing number of hits for each of the classes for NasNetLarge model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.03 0.01 0.02 0.02 0.00 0.00 0.95 0.01 0.00 0.02 0.00 0.02 0.00 0.00 0.00 0.94 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
River SeaLake 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.98
Fig. 5.88 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model. The PermanentCrop class shows the lowest accuracy of 94% while the Forest and Residential can be classified with 100% accuracy
Table 5.2 Comparison of the accuracy of top models of each family
Model
Accuracy (%)
F2-score (%)
ResNet50
97.48
97.57
ResNet101
97.57
97.64
VGG16
98.07
98.05
VGG19
97.89
98.02
NasNetLarge
97.21
97.45
EfficientNetB7
96.44
96.68
Precision, Recall, F-Score, Support See Fig. 5.90. Confusion Matrix See Fig. 5.91. Confusion Matrix (%) See Fig. 5.92. ResNet101
5.4 Background of Experimental Comparison of Keras Applications Deep …
Fig. 5.89 Training and validation loss and accuracy for ResNet50 model
Fig. 5.90 Precision, recall, and F-score of each of the classes for ResNet50 model
Training/Validation Accuracy See Fig. 5.93. Precision, Recall, F-Score, Support See Fig. 5.94. Confusion Matrix See Fig. 5.95.
149
150
5 Remote Sensing Example for Deep Learning
Fig. 5.91 Confusion matrix showing number of hits for each of the classes for ResNet50 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.95 0.00 0.00 0.05 0.00 0.00 0.01 0.02 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98
Fig. 5.92 Confusion matrix showing ratio of hits for each of the classes for ResNet50 model. The AnnualCrop class shows the lowest accuracy of 95% while the Forest and Residential can be classified with 100% accuracy
Fig. 5.93 Training and validation loss and accuracy for ResNet101 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
151
Fig. 5.94 Precision, recall, and F-score of each of the classes for ResNet101 model
Fig. 5.95 Confusion matrix showing number of hits for each of the classes for ResNet101 model
Confusion Matrix (%) See Fig. 5.96. VGG16 Training/Validation Accuracy See Fig. 5.97. Precision, Recall, F-Score, Support See Fig. 5.98. Confusion Matrix See Fig. 5.99.
152
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.94 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99
Fig. 5.96 Confusion matrix showing ratio of hits for each of the classes for ResNet101 model. The Highway class shows the lowest accuracy of 94% while the Forest and Residential can be classified with 100% accuracy
Fig. 5.97 Training and validation loss and accuracy for VGG16 model
Confusion Matrix (%) See Fig. 5.100. Retraining the above VGG16 model using the previous weights as input does not result in notable improvements in accuracy. The results are shown below. Training/Validation Accuracy See Fig. 5.101. Precision, Recall, F-Score, Support See Fig. 5.102. Confusion Matrix See Fig. 5.103.
5.4 Background of Experimental Comparison of Keras Applications Deep …
153
Fig. 5.98 Precision, recall, and F-score of each of the classes for VGG16 model
Fig. 5.99 Confusion matrix showing number of hits for each of the classes for VGG16 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99
Fig. 5.100 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
154
5 Remote Sensing Example for Deep Learning
Fig. 5.101 Training and validation loss and accuracy for VGG16 model
Fig. 5.102 Precision, recall, and F-score of each of the classes for VGG16 model
Confusion Matrix (%) See Fig. 5.104. VGG19 Training/Validation Accuracy See Fig. 5.105.
5.4 Background of Experimental Comparison of Keras Applications Deep …
155
Fig. 5.103 Confusion matrix showing number of hits for each of the classes for VGG16 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.00 0.00 0.02 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.95 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99
Fig. 5.104 Confusion matrix showing ratio of hits for each of the classes for VGG16 model. The PermanentCrop class shows the lowest accuracy of 95% while the Forest and Residential classes can be classified with 100% accuracy
Fig. 5.105 Training and validation loss and accuracy for VGG19 model
156
5 Remote Sensing Example for Deep Learning
Precision, Recall, F-Score, Support See Fig. 5.106. Confusion Matrix See Fig. 5.107. Confusion Matrix (%) See Fig. 5.108. VGG19 Test2—rerun A rerun of the model as with VGG16 is performed to check if any improvements in accuracy can be obtained. The weights from previous training are re-used in this case. Again, no significant improvements could be obtained from this approach. The reason could be that no additional learning is possible with the current parameter set. The results are shown below.
Fig. 5.106 Precision, recall, and F-score of each of the classes for VGG19 mod-el
Fig. 5.107 Confusion matrix showing number of hits for each of the classes for VGG19 model
5.4 Background of Experimental Comparison of Keras Applications Deep …
157
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.94 0.00 0.00 0.01 0.00 0.01 0.04 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.94 0.00 0.00 0.01 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.96
Fig. 5.108 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The AnnualCrop class shows the lowest accuracy of 94% while the Forest and Residential can be classified with 100% accuracy
Training/Validation Accuracy See Fig. 5.109. Precision, Recall, F-Score, Support See Fig. 5.110. Confusion Matrix See Fig. 5.111. Confusion Matrix (%) See Fig. 5.112. NasNetLarge Training/Validation Accuracy See Fig. 5.113.
Fig. 5.109 Training and validation loss and accuracy for VGG19 model
158
5 Remote Sensing Example for Deep Learning
Fig. 5.110 Precision, recall, and F-score of each of the classes for VGG19 model
Fig. 5.111 Confusion matrix showing number of hits for each of the classes for VGG19 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98
Fig. 5.112 Confusion matrix showing ratio of hits for each of the classes for VGG19 model. The AnnualCrop and Highway classes show the lowest accuracy of 96% while the Forest and Residential can be classified with 100% accuracy
5.4 Background of Experimental Comparison of Keras Applications Deep …
Fig. 5.113 Training and validation loss and accuracy for NasNetLarge model
Precision, Recall, F-Score, Support See Fig. 5.114. Confusion Matrix See Fig. 5.115.
Fig. 5.114 Precision, recall, and F-score of each of the classes for NasNetLarge model
159
160
5 Remote Sensing Example for Deep Learning
Fig. 5.115 Confusion matrix showing number of hits for each of the classes for NasNetLarge model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.03 0.02 0.00 0.00 0.94 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00
River SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.97
Fig. 5.116 Confusion matrix showing ratio of hits for each of the classes for NasNetLarge model. The Pasture class shows the lowest accuracy of 94% while the Forest and Residential can be classified with 100% accuracy
Confusion Matrix (%) See Fig. 5.116. EfficientNetB7 Training/Validation Accuracy See Fig. 5.117. Precision, Recall, F-Score, Support See Fig. 5.118. Confusion Matrix See Fig. 5.119. Confusion Matrix (%) See Fig. 5.120. Table 5.3 summarizes the results of this subsection on train–test split. There is a general improvement in accuracy and the F2-score compared to the 70–30 split. Specifically, the VGG family of models showed the best performance with VGG16 being slightly better compared to VGG19. Weight Regularization
5.4 Background of Experimental Comparison of Keras Applications Deep …
161
Fig. 5.117 Training and validation loss and accuracy for EfficientNet B7 model
Fig. 5.118 Precision, recall, and F-score of each of the classes for EfficientNet B7 model
To reduce overfitting, one of strategies that can be used is L2 weight regulation. Three kinds of regularization exist, namely kernel regularization where a penalty is applied on the layer’s kernel, bias regularization where a penalty is applied on the layer’s bias, and activity regularization where a penalty is applied on the layer’s output. Kernel regularization is commonly used, but here will check how L2 kernel regularization and L2 activity regularization affect the performance using the VGG16 as examples.
162
5 Remote Sensing Example for Deep Learning
Fig. 5.119 Confusion matrix showing the number of hits for each of the classes for EfficientNet B7 model
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.95 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.96 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
River SeaLake 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.98
Fig. 5.120 Confusion matrix showing ratio of hits for each of the classes for EfficientNet B7 model. The Highway class shows the lowest accuracy of 95% while the Residential class shows the highest accuracy of 100%
Table 5.3 Comparison of results for the two splits of data Model
70–30 Split
80–20 Split
Accuracy (%)
F2-score (%)
Accuracy (%)
F2-score (%)
ResNet50
97.48
97.57
97.81
97.80
ResNet101
97.57
97.64
98.01
98.00
VGG16
98.07
98.05
98.18
98.09
VGG19
97.89
98.02
98.14
98.06
NasNetLarge
97.21
97.45
97.65
97.65
EfficientNetB7
96.44
96.68
97.45
97.43
Kernel Regularization (2048 units) Training/Validation Accuracy See Fig. 5.121. Precision, Recall, F-Score, Support See Fig. 5.122.
5.4 Background of Experimental Comparison of Keras Applications Deep …
163
Fig. 5.121 Training and validation loss and accuracy after applying kernel regularization to VGG16 model with a capacity of 2048 units
Fig. 5.122 Precision, recall, and F-score of each of the classes for after applying kernel regularization to VGG16 model with a capacity of 2048 units
Confusion Matrix See Fig. 5.123. Confusion Matrix (%) See Fig. 5.124. Activity Regularization (2048 units)
164
5 Remote Sensing Example for Deep Learning
Fig. 5.123 Confusion matrix showing number of hits for each of the classes after applying kernel regularization to VGG16 model with a capacity of 2048 units
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97
Fig. 5.124 Confusion matrix showing ratio of hits for each of the classes after applying kernel regularization to VGG16 model with a capacity of 2048 units. The AnnualCrop, Industrial, and SeaLake classes show the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100% accuracy
Training/Validation Accuracy See Fig. 5.125. Precision, Recall, F-Score, Support See Fig. 5.126. Confusion Matrix See Fig. 5.127. Confusion Matrix (%) See Fig. 5.128. Activity regularization seems to provide better results than kernel regularization. We will therefore consider improvements by reducing the network capacity with Activity regularization. Activity Regularization (1024 units) Training/Validation Accuracy See Fig. 5.129.
5.4 Background of Experimental Comparison of Keras Applications Deep …
165
Fig. 5.125 Training and validation loss and accuracy after applying activity regularization to VGG16 model with a capacity of 2048 units
Fig. 5.126 Precision, recall, and F-score of each of the classes for after applying activity regularization to VGG16 model with a capacity of 2048 units
Precision, Recall, F-Score, Support See Fig. 5.130. Confusion Matrix See Fig. 5.131.
166
5 Remote Sensing Example for Deep Learning
Fig. 5.127 Confusion matrix showing number of hits for each of the classes after applying activity regularization to VGG16 model with a capacity of 2048 units
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.99 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.128 Confusion matrix showing ratio of hits for each of the classes after applying activity regularization to VGG16 model with a capacity of 2048 units. The HerbaceousVegetation class shows the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100% accuracy
Fig. 5.129 Training and validation loss and accuracy after applying Activity regularization to VGG16 model with a capacity of 1024 units
5.4 Background of Experimental Comparison of Keras Applications Deep …
167
Fig. 5.130 Precision, recall, and F-score of each of the classes for after applying activity regularization to VGG16 model with a capacity of 1024 units
Fig. 5.131 Confusion matrix showing number of hits for each of the classes after applying activity regularization to VGG16 model with a capacity of 1024 units
Confusion Matrix (%) See Fig. 5.132. A global accuracy of 98.27% is achieved when capacity is reduced to 1024 units. In this case there is benefit in in increasing the network capacity to 2048 units since the same accuracy is obtained. There is a 1% increase in accuracy of the HerbaceousCrop class from 96 to 97%. Additionally, all the other 9 classes have an accuracy of 98% and above. Activity Regularization (512 units) Training/Validation Accuracy See Fig. 5.133.
168
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.98 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.97 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98
Fig. 5.132 Confusion matrix showing ratio of hits for each of the classes after applying Activity regularization to VGG16 model with a capacity of 1024 units. The HerbaceousCrop class shows the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100% accuracy
Fig. 5.133 Training and validation loss and accuracy after applying Activity regularization to VGG16 model with a capacity of 512 units
Precision, Recall, F-Score, Support See Fig. 5.134. Confusion Matrix See Fig. 5.135. Confusion Matrix (%) See Fig. 5.136. The overall accuracy reduces to 98.17% compared to 98.27% obtained for the 1024 units. Therefore, there is a no merit in reducing the capacity in this particular case. Comparing L2 kernel regularization with L2 activity regularization, we see that activity regularization tends give better performance in terms of both combating
5.4 Background of Experimental Comparison of Keras Applications Deep …
169
Fig. 5.134 Precision, recall, and F-score of each of the classes for after applying Activity regularization to VGG16 model with a capacity of 512 units
Fig. 5.135 Confusion matrix showing number of hits for each of the classes after applying Activity regularization to VGG16 model with a capacity of 512 units
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.01 0.03 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.136 Confusion matrix showing ratio of hits for each of the classes after applying Activity regularization to VGG16 model with a capacity of 512 units. The Highway class shows the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
170
5 Remote Sensing Example for Deep Learning
overfitting and accuracy. It can also be seen that in some cases, the validation accuracy loss is lower than the training loss. This can be due the fact that regularization is applied only during training and not during validation. Some other valid reasons normally given are that the evaluation of the training loss happens during training while that of validation happens at the end of validation resulting in slight shift of loss time by about half an epoch. Some strategies to avoid being too conservative during training would be lowering the regularization constant, reducing dropout rate and increasing model capacity. In our case, a model capacity of 1024 seems to be the best so far achieving an accuracy of greater than 98% for 9 out of the 10 classes and also showing resistance to overfitting. Dropout In this section we take a look at what happens if we vary the dropout rate for a given network capacity of 1024 using the VGG16 model as base. We evaluate the Dropout rates of 0.2, 0.3, 0.4, and 0.5. In practice, dropout rates of between 0.2 and 0.5 are recommended. Dropout Rate 0.2 Training/Validation Accuracy See Fig. 5.137. Precision, Recall, F-Score, Support See Fig. 5.138. Confusion Matrix See Fig. 5.139.
Fig. 5.137 Training and validation loss and accuracy of the VGG16 model with a dropout rate of 0.2
5.4 Background of Experimental Comparison of Keras Applications Deep …
171
Fig. 5.138 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout rate of 0.2
Fig. 5.139 Confusion matrix showing number of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.2
Confusion Matrix (%) See Fig. 5.140. For the dropout rate of 0.2 the overall accuracy is about 98.28%. In terms of classification performance for the resulting model, HerbaceousVegetation, Highway and Permanent crop have a recall of less than 98%. Dropout Rate 0.3 Training/Validation Accuracy See Fig. 5.141. Precision, Recall, F-Score, Support See Fig. 5.142.
172
5 Remote Sensing Example for Deep Learning
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.140 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.2. The HerbaceousVegetation class shows the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
Fig. 5.141 Training and validation loss and accuracy of the VGG16 model with a dropout rate of 0.3
Confusion Matrix See Fig. 5.143. Confusion Matrix (%) See Fig. 5.144. For the dropout rate of 0.3 the overall accuracy is about 98.13%. In terms of classification performance for the resulting model, AnnualCrop, HerbaceousVegetation, and Highway crop have a recall of less than 98%. Dropout Rate 0.4 Training/Validation Accuracy See Fig. 5.145.
5.4 Background of Experimental Comparison of Keras Applications Deep …
173
Fig. 5.142 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout rate of 0.3
Fig. 5.143 Confusion matrix showing number of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.3
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.01 0.03 0.00 0.00 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.144 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.3. The HerbaceousVegetation class shows the lowest accuracy of 96% while the Residential class can be classified with 100% accuracy
174
5 Remote Sensing Example for Deep Learning
Fig. 5.145 Training and validation loss and accuracy of the VGG16 model with a dropout rate of 0.4
Precision, Recall, F-Score, Support See Fig. 5.146. Confusion Matrix See Fig. 5.147.
Fig. 5.146 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout rate of 0.4
5.4 Background of Experimental Comparison of Keras Applications Deep …
175
Fig. 5.147 Confusion matrix showing number of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.4
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.96 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.96 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.96
Fig. 5.148 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.4. The AnnualCrop, HerbaceousVegetation, Highway, and SeaLake classes show the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
Confusion Matrix (%) See Fig. 5.148. For the dropout rate of 0.4, the overall accuracy is about 97.74%. In terms of classification performance for the resulting model, AnnualCrop, HerbaceousVegetation, Highway, and SeaLake crop have a recall of less than 98%. Dropout Rate 0.5 Training/Validation Accuracy See Fig. 5.149. Precision, Recall, F-Score, Support See Fig. 5.150. Confusion Matrix See Fig. 5.151. Confusion Matrix (%) See Fig. 5.152.
176
5 Remote Sensing Example for Deep Learning
Fig. 5.149 Training and validation loss and accuracy of the VGG16 model with a dropout rate of 0.5
Fig. 5.150 Precision, recall, and F-score of each of the classes for the VGG16 model with a dropout rate of 0.5
For the dropout rate of 0.5, the overall accuracy is about 98.17%. In terms of classification performance for the resulting model, AnnualCrop, HerbaceousVegetation, Pasture, and PermanentCrop have a recall of less than 98%. We found out that for this particular dataset, there was no marked improvement in model accuracy associated with a dropout rate increase from 0.2 to 0.5. A dropout rate of 0.2 would still be sufficient to achieve a decent accuracy of 98.28%. This does
5.4 Background of Experimental Comparison of Keras Applications Deep …
177
Fig. 5.151 Confusion matrix showing number of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.5
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.96 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99
Fig. 5.152 Confusion matrix showing ratio of hits for each of the classes using the VGG16 model as base with a dropout rate of 0.5. The HerbaceousVegetation class shows the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
not however imply that there is no merit in investigating drop as part of a broader strategy to improve validation accuracy by algorithm tuning. Network Capacity Reduction Another strategy to reduce overfitting is network capacity reduction. We check the effect of reducing the network capacity from 2048 units, which is the default setting in all the above simulations, to 1024 and 512 respectively. As in the above cases, we evaluate the performance in terms of accuracy and F2-score for the VGG16 as an example. Capacity reduced to from 2048 to 1024 units The network capacity can be easily obtained from the model summary. In our case, we use vgg16_model.summary()to get this information since we defined vgg16_model as the model name. The results of network capacity with 1024 units compared to 2048 units are shown in Table 5.4. As can be seen from the numbers, the trainable parameters reduced by slightly more than half from over 8 million to about 3 million parameters. The following figures show the effect of the capacity reduction. Training/Validation Accuracy See Fig. 5.153.
178
5 Remote Sensing Example for Deep Learning
Table 5.4 Comparison of parameters for 2048 and 1024 units of VGG16 model Parameters
Number of model units 2048
1024
Total parameters
23,144,266
17,880,906
Trainable parameters
8,421,386
3,162,122
Non-trainable parameters
14,722,880
14,718,784
Fig. 5.153 Training and validation loss and accuracy of VGG16 model with a reduced capacity of 1024 units
Precision, Recall, F-Score, Support See Fig. 5.154. Confusion Matrix See Fig. 5.155. Confusion Matrix (%) See Fig. 5.156. In summary, it can be observed that in comparison with case of in which we applied weight regularization without network capacity reduction, there was a slight decrease in performance. With regularization only, an accuracy and F2-score of 98.15% were obtained. With the reduced capacity, the accuracy and F2-score of 98.11%, which translates to a decrease of 0.04%. If can be said there is hardly noticeable impact on performance in this case. When large amount training data is available, it could beneficial to reduce the capacity with to overcome overfitting as shown by the reduced difference between the training and validation loss until about the 20th epoch. Capacity reduced to 512 units
5.4 Background of Experimental Comparison of Keras Applications Deep …
179
Fig. 5.154 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced capacity of 1024 units
Fig. 5.155 Confusion matrix showing number of hits for each of the classes of VGG16 model with a reduced capacity of 1024 units
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.96 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.98 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.156 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a reduced capacity of 1024 units. The PermanentCrop class shows the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
180
5 Remote Sensing Example for Deep Learning
Using 2048 units as the base for comparison, we can see a considerable reduction of the trainable parameters to less than one quarter as shown Table 5.5. In terms of numbers, it translates to slightly over 8.4 million parameters compared to about 1.3 million parameters. The following figures show the effect of the capacity reduction. Training/Validation Accuracy See Fig. 5.157. Precision, Recall, F-Score, Support See Fig. 5.158. Confusion Matrix See Fig. 5.159. Confusion Matrix (%) See Fig. 5.160. Table 5.5 Comparison of parameters for 2048 and 512 units of VGG16 model Parameters
Number of model units 2048
512
Total parameters
23,144,266
16,035,658
Trainable parameters
8,421,386
1,318,922
Non-trainable parameters
14,722,880
14,716,736
Fig. 5.157 Training and validation loss and accuracy of VGG16 model with a reduced capacity of 512 units
5.4 Background of Experimental Comparison of Keras Applications Deep …
181
Fig. 5.158 Precision, recall, and F-score of each of the classes of VGG16 model with a reduced capacity of 512 units
Fig. 5.159 Confusion matrix showing number of hits for each of the classes of VGG16 model with a reduced capacity of 512 units
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.02 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99
Fig. 5.160 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a reduced capacity of 512 units. The Highway class shows the lowest accuracy of 96% while the Forest and Residential classes can be classified with 100% accuracy
182
5 Remote Sensing Example for Deep Learning
As with the case of reduction to 1024 units, there is slight loss in performance and improved robustness to overfitting when 512 units were used as shown in the training/ validation loss graph above. In this case, the accuracy and F2-score of 98.13% were obtained after network capacity reduction which is comparable to 98.15% for 2048 units. This translates to a decrease of 0.02% which we think can be acceptable in most practical situations. The results for regularization can be summarized in Table 5.6. Although the model showed increased resistance to overfitting, the price to pay was a slight decrease in accuracy of the model. But this is common in many machine learning and deep learning scenarios where a trade-off of some sort has to be made. More training data is always better to have. The results also show doubling the network capacity from 1024 to 2048 did not give additional benefits of increased accuracy. Effect of Batch Size Up to this point we have set the batch size for train/validation to 128. Adjusting the batch size can have an impact on the accuracy of the resulting model. For some easily trainable data like the standard MNIST dataset, reducing the batch size may lead to improved performance. However, there is no general rule on the impact of batch size as the effect can depend on the complexity of the problem under modeling. This means that it is necessary to try a couple of batch sizes to see how much how it affects the output model performance. In general, batch size of 32 is a good starting point when using Keras and it is advisable to try other sizes like 64, 128, and 256. Choosing batch sizes which are powers of 2 is recommended when using GPUs for processing in order to exploit parallel execution. We changed the batch size to 32, 64, and 256 and performed the training using the VGG16 model as the base with 1024 units, L2 activity regularization constant of 1e-4, dropout of 0.2, and early stopping patience of 30. The number of epochs was set to 200. The results are shown below. Batch size 32 Training/Validation Accuracy See Fig. 5.161.
Table 5.6 Comparison of accuracy with various regularization kernels and network size VGG16 model Regularization method
Accuracy (%)
F2-score (%)
No Regularization (2048 units)
98.18
98.08
L2 Kernel Regularization (2048 units)
98.15
98.15
L2 Kernel Reg + Network size 1024
98.11
98.11
L2 Kernel Reg + Network size 512
98.13
98.13
L2 Activity Regularization (2048 units)
98.28
98.28
L2 Activity Reg + Network size 1024
98.28
98.28
L2 Activity Reg + Network size 512
98.17
98.17
5.4 Background of Experimental Comparison of Keras Applications Deep …
183
Fig. 5.161 Training and validation loss and accuracy of VGG16 model with a batch size of 32
Precision, Recall, F-Score, Support See Fig. 5.162. Confusion Matrix See Fig. 5.163. Confusion Matrix (%) See Fig. 5.164.
Fig. 5.162 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size of 32
184
5 Remote Sensing Example for Deep Learning
Fig. 5.163 Confusion matrix showing number of hits for each of the classes of VGG16 model with a batch size of 32
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.164 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a batch size of 32. The AnnualCrop, HerbaceousVegetation, Pasture and PermanentCrop classes show the lowest accuracy of 97% while the Forest and Residential classes can be classified with 100% accuracy
Batch size 64 Training/Validation Accuracy See Fig. 5.165. Precision, Recall, F-Score, Support See Fig. 5.166. Confusion Matrix See Fig. 5.167. Confusion Matrix (%) See Fig. 5.168. Batch size 256 Training/Validation Accuracy See Fig. 5.169.
5.4 Background of Experimental Comparison of Keras Applications Deep …
185
Fig. 5.165 Training and validation loss and accuracy of VGG16 model with a batch size of 64
Fig. 5.166 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size of 64
Precision, Recall, F-Score, Support See Fig. 5.170. Confusion Matrix See Fig. 5.171. Confusion Matrix (%) See Fig. 5.172.
186
5 Remote Sensing Example for Deep Learning
Fig. 5.167 Confusion matrix showing number of hits for each of the classes of VGG16 model with a batch size of 64
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.97 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Fig. 5.168 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a batch size of 64. The AnnualCrop and Highway classes show the lowest accuracy of 97% while the Forest, Residential, and SeaLake classes can be classified with 100% accuracy. The rest of the classes are above 98% accuracy
Fig. 5.169 Training and validation loss and accuracy of VGG16 model with a batch size of 256
5.4 Background of Experimental Comparison of Keras Applications Deep …
187
Fig. 5.170 Precision, recall, and F-score of each of the classes of VGG16 model with a batch size of 256
Fig. 5.171 Confusion matrix showing number of hits for each of the classes of VGG16 model with a batch size of 256
Confusion Matrix (%) AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.98 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.96 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.98 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99
Fig. 5.172 Confusion matrix showing ratio of hits for each of the classes of VGG16 model with a batch size of 256. The Highway class shows the lowest accuracy of 96%, while the Forest and Residential classes can be classified with 100% accuracy
188 Table 5.7 Impact of batch size on accuracy when using 1024 units with the VGG16 model as base
5 Remote Sensing Example for Deep Learning
Batch size
Accuracy [%]
32
98.31
64
98.46
128
98.28
256
98.15
Table 5.7 shows that for batch size of 64 a state-of-the-art accuracy of 98.46% was achieved. Further reducing the batch size to 32 gave an accuracy of 98.31%. On the other hand, increasing the batch size to 256 resulted in accuracy of 98.15%. In summary, an as the batch size increases, accuracy was observed to decrease for sizes of 64 and above. When a fixed training data sample size is available, a reduced batch size will lead to an increase in the number steps per epoch and therefore training time depending on the available computation resources. In our case, a batch size of 64 was experimentally determined to be the best to employ. Intermodel type comparison Using recall and F2-score as performance metrics, with a train–test split of 70–30, NasNetLarge gave the best performance with an accuracy of 97.4% and F2-score of 97.6% followed by VGG16 (mean recall 97.14%, F2-score 97.1%), ResNet101 (mean recall 96.4%, F2-score 96.6%), and EfficientNetB7 (mean recall 92.8%, F2score 93.0%) in the that order. We explored the train–split ratio of 80–20 and found out that VGG16 gave the best results with an accuracy of 98.18% and F2-score of 98.06%. The above evaluation was performed with a batch size of 128. One of the known strategies to fight overfitting is regularization. We investigated the effect of weight regularization, network capacity reduction, and dropout. It was found out that there was minor degradation in the performance with better resistance to overfitting for the VGG16 model. In fact, regularization can produce meaningful results and stable validation loss. Specifically, L2 activity regularization produced a peak accuracy of 98.28%. An investigation into the impact of batch size resulted in the final best performance of 98.46% for the VGG16 model. This was with a batch size of 64. There were varying degradations in accuracy with higher and lower batch sizes. It is generally recommended to fix the batch size throughout model evaluations and also to choose a value that is a power of 2 in order to exploit computation optimizations in some GPU implementations. It should be possible to further increase the accuracy of the models by further tuning the models and also acquiring more training data. Based on the preceding evaluation, we can summarize the important best model hyperparameter as shown in Table 5.8.
5.5 Application of EuroSAT Results to Uncorrelated Dataset Table 5.8 Setting of best model hyperparameters with the VGG16 model as base model
189
Parameter
Setting
Units
1024
Batchsize
64
Dropout
0.2
Regularizer
activity_regularizer, l2(1e-4)
Normalization
BatchNormalization
Earlystoppping
monitor = ‘val_categorical_accuracy’, patience = 30, restore_best_weights = True, mode = ‘max’
Learning Rate
1e-4 (Adam optimizer)
Epochs
200
ReduceLROnPlateau
monitor = ‘val_categorical_accuracy’, factor = 0.5, patience = 5, min_lr = 0.00001
5.5 Application of EuroSAT Results to Uncorrelated Dataset We applied the above model to separately acquired data. This dataset is also Sentiel2 acquired data which covers the areas surrounding Gweru city of Zimbabwe [7]. Gweru is a small city that is characterized by a dry, cool winter season from May to July, a hot, dry period in August to early November, and a warm, rainy period from early November to April. The hottest month is October, while the coldest is July. The temperatures range from an average of 21 °C in July to 30 °C in October, while the annual rainfall is about 684 mm. In this chapter, only median post rainy-season Sentinel-2 imagery will be used for land cover classification. Although the median post-rainy Sentinel-2 imagery (April - June 2020) comprises 13 spectral bands with a spatial resolution that range between 10 and 20 m, we will only use RGB bands in a similar fashion to the EuroSAT dataset. It has been already proven in [8] that RGB bands give the highest accuracy when deep learning algorithms considered. As preparation the original GeoTiff format data is converted into 64 × 64 patches for processing by the deep learning algorithm. Since we have already confirmed that the VGG16 is the best performing model on the EuroSAT dataset, we will evaluate only this model on the Gweru dataset. It is obvious from the location information that the data is completely uncorrelated with the EuroSAT data.
5.5.1 Evaluation of 10-Classes with Best EuroSAT Weights Please refer to the companion Notebook (zimsat-projectbook-blg.ipynb) to get a better insight into the nature the data and also as part of hands-on experience [6].
190
5 Remote Sensing Example for Deep Learning
Below is the class distribution of the Gweru dataset. See Fig. 5.173. Using the best model (vgg16_eurosat8breg_act_batch64.h5) from the EuroSAT training data, the following results are obtained. See Fig. 5.174. See Fig. 5.175. The accuracy barely exceeds 20% and some classes like Forest and Pasture cannot be correctly classified at all. The Gweru test dataset consisted of only 1648 images which makes it difficult to judge if the same trend will happen with a larger dataset although the expectation is that it shouldn’t be factor in testing. We note that EuroSAT was evaluated with 5400 test images and achieved an accuracy of 98.46% (see the Figure below). This observation is the well-known fact that high validation accuracy does not always translate to high accuracy when the model is exposed to completely unseen data. So, what to do next in this situation? Results Summary: GweruData: model = vgg16_eurosat8breg_act_batch64.h5 1648 images belonging to 10 classes. Accuracy: 0.20449029126213591 Global F2 Score: 0.20449029126213591
Recap of results from the EuroSAT dataset with the same model. See Fig. 5.176. Result Summary: 5400 images belonging to 10 classes. Accuracy: 0.9846296296296296 Global F2 Score: 0.9846296296296296
See Fig. 5.177.
Fig. 5.173 Class distribution of original Gweru data
5.5 Application of EuroSAT Results to Uncorrelated Dataset
191
Fig. 5.174 PRF results for the Gweru dataset
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake
AnnualCrop Forest HerbaceousVegetation Highway Industrial Pasture PermanentCrop Residential River SeaLake 0.15 0.00 0.07 0.02 0.02 0.03 0.02 0.25 0.06 0.38 0.08 0.00 0.57 0.04 0.00 0.00 0.00 0.22 0.03 0.05 0.04 0.00 0.12 0.04 0.03 0.00 0.00 0.57 0.03 0.17 0.03 0.00 0.03 0.06 0.00 0.00 0.00 0.75 0.06 0.06 0.00 0.00 0.03 0.03 0.14 0.00 0.00 0.81 0.00 0.00 0.10 0.01 0.22 0.07 0.00 0.00 0.00 0.35 0.05 0.19 0.08 0.00 0.02 0.08 0.12 0.00 0.02 0.38 0.05 0.25 0.02 0.00 0.02 0.02 0.02 0.01 0.00 0.82 0.02 0.08 0.03 0.00 0.09 0.03 0.05 0.00 0.00 0.63 0.03 0.15 0.13 0.00 0.00 0.13 0.13 0.00 0.00 0.38 0.13 0.13
Fig. 5.175 Confusion matrix (percentage) for the Gweru dataset
Fig. 5.176 Precision, recall, accuracy results from the vgg16_eurosat8breg_act_batch64.h5 with EuroSAT dataset
192
5 Remote Sensing Example for Deep Learning
Fig. 5.177 Confusion matrix results from the vgg16_eurosat8breg_act_batch64.h5 with the EuroSAT dataset
Since we already have a working model, our best bet, which is the utility of deep learning approach, is that we can re-use this model as a starting point to see how much improvement can achieved. However, we are faced with class imbalanced problem as the HerbaceousVegetation class data is about 47% (779/1648) of the whole dataset by far outnumbers the rest of the classes and while the minority class SeaLake has as few samples as about 0.5% (8/1648). The data scarcity issues also apply Highway, Industrial, PermanentCrop, and River which have less than 100 data points per class. See Fig. 5.178. Some strategies to explicitly deal with this class imbalance problem have been addressed in literature include but not limited to [9–12]: Strategy 1: Merging near-identical classes in one class Strategy 2: Downsizing majority samples Strategy 3: Resampling specific classes Strategy 4: Adjusting the loss function.
Gweru Class Distribution 779
123
131
32
36
98
Fig. 5.178 Distribution of Gweru class data by numbers
60
264
117
8
5.5 Application of EuroSAT Results to Uncorrelated Dataset
193
We first try a combination of strategies 1 and 2. To realize the Strategy 1, we define 6 classes as in [7]: Built-up, Bare areas, Cropland, Woodland, Grass/open areas and Water and then map the 10 EuroSAT classes into these classes as described in Table 5.9. As for Strategy 3 and 4, it has been observed that there is no innovation increase therefore gains from these approaches are minimal. We will therefore leave them for future consideration. In any case there is nothing better than having more real data for each class if time and resources allow. Mapping Strategy 1: Cropland (Cr) = AnnualCrop + PermanentCrop Built-up (BU) = Residential + Industrial + Highway Woodland (Wd) = Forest Grass/open grass (Gr) = Pastures Water(Wt) = River + SeaLake Bare Areas (BA) = HerbaceousVegetation. With the above mapping the distribution of Gweru dataset is shown Fig. 5.7. As a result of this operation, the majority class becomes BareAreas at about 44% while the minority class is Grassland (Grass/open grass) at 5%. There is a slight improvement in the class balanced but not on the classification results as reflected in the PRF results below. See Fig. 5.179. Samples images from the 6 classes: See Fig. 5.180. Table 5.9 Mapping Gweru dataset classes to EuroSAT dataset classes Land cover
Description
Corresponding merged EuroSAT classes
Built-up (BU)
Residential, commercial, services, industrial, transportation, communication and utilities and construction sites
Residential + Industrial + Highway
Bare areas (BA)
Bare sparsely vegetated area with > 60% soil background. Includes sand and gravel mining pits, rock outcrops
HerbaceousVegetation
Cropland (Cr)
Cultivated land or cropland under preparation, fallow cropland, and cropland under irrigation
AnnualCrop + PermanentCrop
Woodland (Wd)
Woodlands, riverine vegetation, shrub and bush
Forest
Grass/Open areas (Gr)
Grass cover, open grass areas, golf Pastures courses, and parks
Water (Wt)
Rivers, reservoirs, and lakes
Land cover classes
River + SeaLake
194
5 Remote Sensing Example for Deep Learning
Fig. 5.179 Distribution of Gweru class data after Strategy 1 is applied
Fig. 5.180 Sample images from 6 classes after applying Strategy 1
5.5 Application of EuroSAT Results to Uncorrelated Dataset
195
5.5.2 Training Results with 6 Classes—Unbalanced/ Balanced Case See Fig. 5.181. Summary of Result: Found 312 images belonging to 6 classes. Accuracy: 0.6474358974358975 Global F2 Score: 0.6474358974358975
See Fig. 5.182. Despite the inability to classify Grassland, Water, and Woodland, a drastic increase in overall accuracy by 44.29% from 20.45% for 10 classes to 64.74% for the 6 classes has been achieved. It can therefore be confirmed Strategy 1 is effective for accuracy. However, precision and recall cannot be accepted for all classes.
Fig. 5.181 PRF with distribution of Gweru class data after Strategy 1 is applied. Grassland, Water, Woodland classes have 0% recall
BareAreas BuiltUp Cropland Grassland Water Woodland
BareAreas BuiltUp Cropland Grassland Water Woodland 0.92 0.06 0.02 0.00 0.00 0.00 0.18 0.82 0.00 0.00 0.00 0.00 0.52 0.03 0.45 0.00 0.00 0.00 0.83 0.06 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.79 0.13 0.08 0.88 0.04 0.08 0.00 0.00 0.00
Fig. 5.182 Confusion matrix results for reduction to classes from 10 to 6
196
5 Remote Sensing Example for Deep Learning
We experimentally, apply Strategy 2 to data used in Strategy 1 and reduce the sample size of BareAreas to 200 by carefully selecting the data and we end up with the distribution shown below. See Fig. 5.183. See Fig. 5.184. Summary of Result: Found 196 images belonging to 6 classes. Accuracy: 0.6275510204081632 Global F2 Score: 0.6275510204081632
Some improvement for Water & Woodland can be observed but Grassland still has recall and precision of zero meaning prediction is not possible. Different combinations of precision and recall (ability to remember) which give you a better understanding of how well your model is performing for a given class are shown in Table 5.10 [13].
Fig. 5.183 Data distribution after applying Strategy 2 to Strategy 1 dataset Fig. 5.184 PRF with of Gweru class data after Strategy 2 is applied
5.5 Application of EuroSAT Results to Uncorrelated Dataset
197
Table 5.10 Interpretation of precision and recall results with respect to a given class Low recall
High recall
Low Precision
Class prediction unreliable (model cannot recall many precisely)
Class prediction reliable but not others (model recall imprecisely)
High Precision
Class prediction reliable but not detectability is low (model recall few precisely)
Class prediction reliable (model recall many precisely)
BareAreas BuiltUp Cropland Grassland Water Woodland
BareAreas BuiltUp Cropland Grassland Water Woodland 0.85 0.00 0.03 0.00 0.00 0.13 0.07 0.89 0.00 0.00 0.02 0.02 0.39 0.03 0.55 0.00 0.00 0.03 0.78 0.06 0.11 0.00 0.00 0.06 0.58 0.00 0.08 0.00 0.33 0.00 0.38 0.00 0.04 0.00 0.04 0.54
Fig. 5.185 ConfMat of Gweru class data after Strategy 2 is applied
See Fig. 5.185. We are able to classify Water and Woodland but not classify Grassland, with some sacrifice on the accuracy decrease of about 2% to about 62.75%. We also note that when precision is high recall is low and vice versa. This could be the impact of limited data size for all classes. There is not enough information to learn all the classes accurately. So, what to do next? We observe that it makes sense use Strategy 1 again and this time around merge Grassland and BareAreas classes into one GrassBareAreas. We end up with the class distribution shown in Fig. 5.14. See Fig. 5.186. See Fig. 5.187.
5.5.3 Training Results with 5 Classes It is known that accuracy is the best measure of performance for imbalance datasets. We therefore introduce the AUC ROC metric as part of evaluation including PRF as shown in the figure below. See Fig. 5.188. See Fig. 5.189. Summary of Result: Found 196 images belonging to 5 classes. Accuracy: 0.6887755102040817 Global F2 Score: 0.6887755102040817
See Fig. 5.190.
198
5 Remote Sensing Example for Deep Learning
Fig. 5.186 Distribution after applying Strategy 1
Fig. 5.187 Sample images from 5 classes after applying Strategy 1
It can be seen that the strategy is effective in improving the PRF metrics across all classes achieving at the same time achieving a validation of AUC of 94.23%. The accuracy still remains around 68.88%. This demonstrates that importance of using different metrics for different data distributions. The classification of imbalanced data is not a simple task especially when there is a very limited number of samples as in our case. The best chance of improving performance is to start with a large dataset in which even the minority class is well
5.5 Application of EuroSAT Results to Uncorrelated Dataset
199
Fig. 5.188 Training/Validation accuracy and loss and AUC performance for 5-class Gweru dataset
Fig. 5.189 PRF with of Gweru class data after Strategy 1 is applied on 6 classes
BuiltUp BuiltUp Cropland GrassBareAreas Water Woodland
0.96 0.21 0.24 0.33 0.38
Cropland GrassBareAreas Water Woodland 0.02 0.00 0.02 0.00 0.64 0.00 0.06 0.09 0.07 0.53 0.10 0.05 0.00 0.04 0.63 0.00 0.00 0.00 0.04 0.58
Fig. 5.190 Confusion matrix of Gweru class data after Strategy 1 is applied on 6 classes
represented. Additionally, relying on traditional evaluation metrics may be counterproductive as expected results cannot be obtained. In that case, it may be become necessary to try new metrics or create new metrics [14].
200
5 Remote Sensing Example for Deep Learning
5.6 Concluding Remarks All models tested were shown be good predictors for Residential and Forest classes on the EuroSAT dataset. This gives us a hint that they can be used to detect changes in urban expansion where forest is converted to residential areas. EfficientNet models tend to classify residential better than forest for the 70–30 train–test split. This is opposite of what was observed for ResNet, VGG, and NasNet models. In general, for the EuroSAT dataset we could see that the VGG models performed well on the 80–20 split with and without regularization. This leads us to explore more on the utility of the VGG models for land-cover classification. In this investigation, we came to discovered that there are a lot of opportunities to improve the performance of the deep learning algorithms to achieve the highest possible target. Most state-of-the-art algorithms are required achieve an accuracy of not less than 98%. Through data manipulation and algorithm hyperparameter tuning we could achieve an accuracy of 98.46% by using the VGG16 as the base model without feature engineering. Other methods such model ensembles have been suggested in the literature as viable approaches although this may lead to increased effort in training due to the huge number of parameters involved. If time and computation resources are not an issue, this approach is sure worth trying. We also evaluated the performance of the best EuroSAT model weights on nonEuroSAT dataset, specifically using the Gweru dataset described in Sect. 5.3. Unfortunately, we could not get good results using these model weights. However, on retraining with VGG16 model we could get some reasonable results albeit with limitations due to imbalanced data. Next steps will be to explore emerging approaches including wide ResNet and expand the algorithms on to non-EuroSAT dataset to solve real problems. The journey has just started!
References 1. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10. 1038/nature14539 2. C. Francois, Deep Learning with Python, Manning Publications Co., 2018. 3. C. Francois, “Xception: Deep Learning with Depthwise Separable Convolutions,” Google, Inc., 2017. https://arxiv.org/abs/1610.02357 4. Zhu XX et al (2017) Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4):8–36. https://doi.org/10.1109/ MGRS.2017.2762307 5. Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for largescale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55(2):645–657. https://doi.org/10.1109/TGRS.2016.2612821 6. Deep-Learning-Models: https://github.com/sn-code-inside/Deep-Learning-Models 7. Kamusoko C, Kamusoko OW, Chikati E, Gamba J (2021) Mapping Urban and Peri-Urban Land Cover in Zimbabwe: Challenges and Opportunities. Geomatics 1(1):114–147. https:// doi.org/10.3390/geomatics1010009 8. P. Helber, B. Bischke, A. Dengel and D. Borth, “Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification,” IGARSS 2018 - 2018 IEEE
References
9. 10.
11.
12.
13. 14.
201
International Geoscience and Remote Sensing Symposium, pp. 204–207, 2018. doi: https://doi. org/10.1109/IGARSS.2018.8519248 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority oversampling technique. J Artif Intell Res 16:321–357 D. Koßmann, T. Wilhelm and Fink GA (2021) Generation of attributes for highly imbalanced land cover data. In: 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pp 2616–2619. https://doi.org/10.1109/IGARSS47720.2021.9554331 G. Douzas, F. Bação, J. Fonseca and M. Khudinyan, “Imbalanced Learning in Land Cover Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE Algorithm. Remote Sensing. 11. 3040, 2019. https://doi.org/10.3390/rs11243040. Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259. https://doi.org/10.1016/j.neunet. 2018.07.011 Scikit-learn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_ recall.html TensorFlow: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data