418 100 66MB
English Pages 624 Year 2023
Applied Deep Learning Design and implement your own Neural Networks to solve real-world problems
Dr. Neeraj Kumar | Dr. Rajkumar Tekchandani
www.bpbonline.com
ii
Copyright © 2023 BPB Online All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor BPB Online or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. BPB Online has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, BPB Online cannot guarantee the accuracy of this information.
First published: 2023 Published by BPB Online WeWork 119 Marylebone Road London NW1 5PU UK | UAE | INDIA | SINGAPORE ISBN 978-93-55513-724
www.bpbonline.com
Dedicated to My beloved parents
Late Shri Jai Singh
Smt. Nachtro Devi &
My wife Palwinder Kaur and My son Bhavik Nehra and
My daughter Anushka Nehra
— Dr. Neeraj Kumar
My beloved parents:
Shri Krishan Lal Tekchandani Smt. Kanta Tekchandani and
My wife Varsha Tekchandani and My son Tarun Tekchandani and
My daughter Medha Tekchandani
— Dr. Rajkumar Tekchandani
iii
iv
About the Authors • Dr. Neeraj Kumar (SMIEEE) (2019, 2020, 2021 highly-cited researcher from WoS) is working as a Dean DCT and Full-time professor in the Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology (Deemed to be University), Patiala (Pb.), India. He is also an adjunct professor at various organizations in India and abroad. He has published more than 600 technical research papers (DBLP Link) in top-cited journals and conferences which are cited more than 40,000 times from wellknown researchers across the globe with a current h-index of 110 (Google Scholar Link). He was named a highly cited researcher in 2019, 2020, and 2021 in the Web of Science (WoS) list. He has guided many research scholars leading to Ph.D. (16) and M.E./M.Tech (24). His research is funded by various competitive agencies across the globe. His broad research areas are Green computing and Network management, IoT, Big Data Analytics, Deep learning, and cyber-security. • Dr. Rajkumar Tekchandani (B.Tech, M.Tech, Ph.D, CSE) is working as an Assistant Professor in the Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology (Deemed to be University), Patiala (Pb.), India. He has previously worked in the Department of Computer Science and Engineering at Dr. B.R Ambedkar National Institute of Technology, Jalandhar, Punjab. He has fourteen years of academic experience in Computer Science and Engineering. He has published numerous technical research papers in top-cited journals and conferences with a current h-index of 12. He is supervising research scholars leading to Ph.D. and guided M.E./M.Tech (18). His broad research areas are Deep Learning, Machine Learning, Cognitive Science, Natural Language Processing, and Software Code Clone Detection.
v
Acknowledgements There are a few people we want to thank for the continued and ongoing support they have given us while writing this book. First and foremost, we would like to thank our parents and family for continuously encouraging us — We could have never completed this book without their support. We are grateful to the course and the companies that supported us throughout the learning process of Deep learning and helped us learn the hands-on session related to applied deep learning. Thank you for all the hidden support provided. We gratefully acknowledge Mr. Vibhu Bansal for his kind technical scrutiny of this book. Our gratitude also goes to the team at BPB Publications for being supportive and providing us with quite a long time to finish the book and allowing us to publish the book. At last, we would like to thank Shri. Chamunda Nandikeshwar Maa who gave us enough strength and patience to write such a detailed book on Deep Learning. Hope this book brings wonderful joy and experience to readers worldwide. Thanks & Regards Dr. Neeraj Kumar Dean DCT and Professor Department of Computer Science and Engineering Thapar Institute of Engineering and Technology - Patiala, Punjab
Dr. Rajkumar Tekchandani Assistant Professor Department of Computer Science and Engineering Thapar Institute of Engineering and Technology - Patiala, Punjab
vi
Preface This book covers many different aspects of deep learning and the importance of deep learning in the field of computer science and other allied areas. This book also introduces the importance of deep learning in the real world. It shows how the deep learning techniques are important for various aspects of computer science. This book solves the basic understanding of neural networks. It also gives importance to convolutional neural networks for the purpose of image-based classification. Moreover, all the deep learning concepts are implemented in hands-on sessions using Python’s libraries. This book gives information about the usefulness of deep learning and machine learning worldwide. This book takes theoretical as well as practical approaches for deep learning readers. It covers all the concepts related to deep learning in the simplest way with real-world examples. It will cover information on machine learning, deep learning techniques, Neural networks, and convolutional neural networks. It can be used for classification purposes in various domains. You can design your own neural networks by going through the entire book. This book is divided into 08 chapters. They will cover the basics of artificial intelligence, machine learning, deep learning, the intuition of neural networks, CNN, object detection and localization, RNN, LSTM, and GANs. This book also includes review questions, MCQs, Web resources, and hands-on sessions using Python at the end of each chapter. for the readers to get more interest and joy in exploring deep learning techniques. The details are listed below: Chapter 1: Basics of Artificial Intelligence and Machine Learning - Provide a brief idea about machine learning and its types with a proper explanation of classification and regression problems. In this chapter, there is a brief explanation of the clustering algorithms with solved examples. The readers will know the difference between binary and multi-class classification and the regression. The readers will understand the main concept of clustering and its types. Moreover, this will provide insights about various clustering algorithms along with its applications. Further, the basics of machine learning and artificial intelligence are implemented using python in this chapter.
vii
Chapter 2: Introduction to Deep Learning with Python - Covers the basics of deep learning concepts along with the libraries and datasets used in Python. Furthermore, it also discusses various loss functions. In the chapter, a single-layer neuron is also covered. Additionally, the fundamentals of the SVM classifier and performance indicators are discussed. Chapter 3: Intuition of Neural Networks - This chapter will focus on the intuition of neural networks in depth. In this chapter, we will learn from perceptron’s activation functions. Moreover, we have discussed the working of gradient descent, cost functions, and the relationship between log-likelihood and MSE. Furthermore, gradient descent for linear regression and back-propagation in Neural networks with corresponding loss functions is described in detail. In this chapter, we have explored the effect of integration on the derivative of the cost function. At last, various computational graphs, along with gradients and different types of activation functions, are discussed with examples. Chapter 4: Convolutional Neural Networks - This chapter will cover convolutional Neural Networks. In this chapter, we have explored various layers of CNN. We have discussed the concepts of padding, stride, and pooling in CNN with suitable examples, along with one- and two-dimensional convolution. We have also visualized the working of a three dimensional filter in 2-D convolution. Moreover, we have shown the working of CNN with an illustrative example under the section How CNN works? Finally, CNN case studies are discussed with their architectures with corresponding parameters. At last, the reader can build their own CNN model for image classification by going through the entire chapter. Chapter 5: Localization and Object Detection - In this chapter, we explored various tasks related to computer vision, such as object detection, localization, classification, and segmentation. Furthermore, we have discussed extracting various objects from a scene or an image. We have also analyzed two parts of image segmentation, semantic and instance-based segmentation. In this chapter, we will understand the working of the widely used object detection algorithm as YOLO. Finally, you will study and compare various object detection algorithms using convolutional neural networks such as RCNN, Fast RCNN, and Faster RCNN. At last, you will explore the image segmentation algorithm as Mask RCNN on the COCO data set. Chapter 6: Sequence Modeling in Neural Networks and Recurrent Neural Networks (RNN) - It mainly focuses on Recurrent Neural Networks. In this
viii
chapter, we have explored the basics of recurrent neural networks with the help of suitable examples and mathematical formulations. Further, we have realized recurrent neural networks with the help of three other neural networks. We have also visualized recurrence relationships in RNN using matrix augmentation. Moreover, we have illustrated the concept of backpropagation in time compared to RNN. Furthermore, various sequence predictions and applications of RNNs in the real world are discussed. At last, the reader can build their own RNN using hands-on sessions. Chapter 7: Gated Recurrent Unit, Long Short-Term Memory, and Siamese Networks - It will focus on GRU, LSTM, and Siamese networks. In this chapter, we have discussed Simple Gated recurrent units with an understanding of input and cell update units. Furthermore, we discussed Full Gated Recurrent Units and showed the concept of relevance gate in full GRU. Furthermore, we covered the transition of GRU to LSTM. We have also seen various gates as input, output, and forget gate, and showed the importance of LSTM over RNN. Moreover, we comprehend the concept of back-propagation in LSTM. In this chapter, we have also discussed various applications of LSTM and types of long short-term memory networks such as vanilla, stacked, CNN, encoder–decoder, and bidirectional. Finally, Siamese networks and triplet loss function are discussed. Chapter 8: Generative Adversarial Networks - The chapter will mainly focus on the intuition of GAN. We have discussed the architecture of GAN along with the generator and discriminator function. Moreover, we discussed the working of backpropagation during the training process of GAN for classification between real and fake images. Furthermore, this chapter has discussed various types of GAN, such as Cycle GAN, Pix GAN, Dual GAN, Stack GAN, and so on. Finally, various applications of GAN in the real world are discussed in detail.
ix
Coloured Images Please follow the link to download the Coloured Images of the book:
https://rebrand.ly/o1ptsh1 We have code bundles from our rich catalogue of books and videos available at https://github.com/bpbpublications. Check them out!
Errata We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : [email protected] Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at : [email protected] for more details. At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
x
Piracy If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit www.bpbonline.com. We have worked with thousands of developers and tech professionals, just like you, to help them share their insights with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions. We at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit www.bpbonline.com.
Join our book's Discord space
Join the book's Discord Workspace for Latest updates, Offers, Tech happenings around the world, New Release and Sessions with the Authors: https://discord.bpbonline.com
xi
Table of Contents 1. Basics of Artificial Intelligence and Machine Learning................................. 1 Introduction............................................................................................................... 1 Structure..................................................................................................................... 1 Objectives................................................................................................................... 2 Patterns....................................................................................................................... 2 Pattern representation........................................................................................3 Analysis of patterns............................................................................................4 Robustness..........................................................................................................4 Computational efficiency...................................................................................5 Statistical ability.................................................................................................5 Pattern classes....................................................................................................5 Intra-class variability................................................................................................ 6 Inter-class variability................................................................................................. 6 Inter-class similarity.................................................................................................. 6 Pattern analysis tasks................................................................................................ 7 Pattern classification—supervised tasks............................................................7 Pattern clustering-unsupervised tasks...............................................................8 Semi-supervised tasks........................................................................................8 Data and its types in Machine Learning................................................................ 9 Numerical data..................................................................................................9 Time series data..................................................................................................9 Text data...........................................................................................................10 Categorical data...............................................................................................10 Issues with the categorical data..................................................................11 Integer encoding..........................................................................................11 One-hot encoding........................................................................................12 Machine learning feature set.................................................................................. 13 Handcrafted features........................................................................................13 Spatio temporal features..................................................................................14 Discriminative features....................................................................................15 Artificial intelligence........................................................................................16 Machine learning and its goals.............................................................................. 16
xii
Machine learning techniques................................................................................ 19 Deep learning techniques.................................................................................20 Discriminative learning techniques.................................................................21 Types of Regression................................................................................................. 22 Linear regression..............................................................................................23 Logistic regression.............................................................................................25 Non-linear regression.......................................................................................31 Generative learning..........................................................................................33 Clustering................................................................................................................. 41 Need of clustering.............................................................................................43 Different methods of clustering........................................................................43 Partitioning method....................................................................................43 Hierarchical method....................................................................................43 Density-based method................................................................................44 Grid-based method.....................................................................................44 Clustering algorithms.......................................................................................44 K-means clustering algorithm—partitioning method...............................44 Agglomerative clustering algorithm [hierarchical]............................................ 47 Divisive clustering algorithm [hierarchical]........................................................ 50 Clustering applications........................................................................................... 51 Weakly supervised learning.............................................................................52 Statistical models..............................................................................................53 Hyper planes.....................................................................................................55 Hands-on session on AI and machine learning basics....................................... 57 Clustering.........................................................................................................58 Multiple linear regression.................................................................................60 K-means clustering...........................................................................................62 Agglomerative clustering..................................................................................63 Decision tree.....................................................................................................64 Conclusion............................................................................................................... 65 Review questions..................................................................................................... 66 True or False............................................................................................................ 66 Multiple choice questions....................................................................................... 67 Answers.................................................................................................................... 68 Web resources.......................................................................................................... 69
xiii
2. Introduction to Deep Learning with Python................................................71 Introduction............................................................................................................. 71 Structure................................................................................................................... 71 Objectives................................................................................................................. 72 Artificial intelligence and deep learning.............................................................. 72 Introduction to deep learning..........................................................................73 Introduction to Python.......................................................................................... 74 Features of Python............................................................................................74 Machine learning datasets...................................................................................... 74 Training dataset...............................................................................................75 Validation dataset............................................................................................75 Test dataset.......................................................................................................75 ImageNet...........................................................................................................76 Modified National Institute of Standards and Technology database (MNIST)........................................................................77 Machine learning libraries..................................................................................... 78 Numpy..............................................................................................................79 Arrays in Numpy.........................................................................................79 Array indexing............................................................................................79 Theano..............................................................................................................80 Keras.................................................................................................................80 TensorFlow.......................................................................................................80 Pandas..............................................................................................................80 SciPy.................................................................................................................81 PyTorch.............................................................................................................81 Matplotlib.........................................................................................................81 Neural network........................................................................................................ 82 Support Vector Machine(SVM)............................................................................ 84 Annotation and classification................................................................................ 84 Segmentation in deep learning.............................................................................. 85 Object detection...................................................................................................... 86 Image processing basics for object detection....................................................86 Defining an image............................................................................................86 Semantic analysis.................................................................................................... 87 Loss functions.......................................................................................................... 88
xiv
Binary classification loss function...................................................................89 Binary cross-entropy loss function...................................................................89 Hinge loss function...........................................................................................91 Regression loss function...................................................................................92 Mean squared error loss function....................................................................92 Absolute error loss function.............................................................................93 Huber loss.........................................................................................................93 Multi-class classification loss function............................................................94 Multi-class cross-entropy loss...........................................................................95 KL divergence.......................................................................................................... 95 Confusion matrix.................................................................................................... 97 Accuracy......................................................................................................98 Precision......................................................................................................98 Recall or sensitivity.....................................................................................98 Specificity.....................................................................................................98 F-measure....................................................................................................98 Feature mapping....................................................................................................100 Identity mapping...................................................................................................101 Conclusion.............................................................................................................101 Review questions...................................................................................................101 State true or false...................................................................................................102 Multiple choice questions.....................................................................................102 Answers...........................................................................................................104 Hands-on session on introduction to deep learning........................................104
3. Intuition of Neural Networks......................................................................113 Introduction...........................................................................................................113 Structure.................................................................................................................113 Objectives...............................................................................................................114 Introduction to neuron........................................................................................114 Neural networks....................................................................................................115 McCulloch and Pitts network........................................................................117 Feed-forward (FF) neural networks...................................................................118 General formulation of the feed-forward network in matrix form..............123 Feed-forward network with two classes using the Softmax activation function....................................................................124
xv
Perceptron..............................................................................................................128 Perceptron learning rule.................................................................................129 Multi-layer perceptron (MLP).......................................................................130 Types of perceptron learning..........................................................................131 Supervised learning...................................................................................131 Unsupervised learning..............................................................................132 Reinforcement learning..................................................................................132 Gradient descent...................................................................................................132 Fundamentals of gradient descent............................................................133 Deep learning model......................................................................................136 Cost function.............................................................................................138 Gradient descent for linear regression...............................................................140 Different types of gradient descent algorithms..............................................144 Vanilla Descent algorithm........................................................................144 ADAGRAD................................................................................................145 Vanishing and exploding gradient problems.................................................145 Backpropagation in neural networks..................................................................145 Derivation of backpropagation in neural networks with loss function as binary cross entropy loss function.......................................147 Hidden layer calculations.........................................................................152 Output layer calculations..........................................................................153 Loss calculation.........................................................................................154 Backpropagation of error.....................................................................................154 Backpropagation with loss function as the mean square error function.....157 Hidden layer calculations.........................................................................158 Output layer calculations..........................................................................159 Loss calculation.........................................................................................160 Calculating backpropagation of the error................................................160 Effect of integration on the derivative of the cost function......................162 Computational graphs..........................................................................................164 Forward pass..................................................................................................165 Backward pass................................................................................................167 Forward pass differentiation and reverse pass differentiation in computational graphs....................................................................................170 Forward pass differentiation.....................................................................173
Backward/reverse pass differentiation for sigmoid function...................173 Activation function...............................................................................................176 Need for non-linear activation functions......................................................176 Different types of activation functions..........................................................177 Linear activation function........................................................................177 Sigmoid function.......................................................................................178 ReLU activation function..........................................................................178 Leaky ReLU activation function...............................................................179 Tanh activation function..........................................................................180 Softmax activation function.....................................................................181 The binary step activation function..........................................................181 ArcTan activation function.......................................................................182 Parametric ReLU activation function......................................................183 Exponential ReLU activation function.....................................................183 Conclusion.............................................................................................................186 Review questions...................................................................................................187 True/false questions..............................................................................................187 Multiple choice questions.....................................................................................188 Hands-on session..................................................................................................188 4. Convolutional Neural Networks..................................................................197 Structure.................................................................................................................198 Objectives...............................................................................................................198 Introduction to convolutional neural networks (CNN)..................................199 Drawbacks of traditional neural networks and emergence of CNN............201 CNN architecture..................................................................................................203 Layers of CNN.......................................................................................................203 Matrix representation of an image......................................................................206 Spatial filtering...............................................................................................206 Correlation.............................................................................................................207 Convolution...........................................................................................................209 Spatial filters...........................................................................................................212 Smoothing filters (low pass filters).................................................................212 Linear smoothing filter (mean filter).............................................................212 Non-linear smoothing filter (order statistics filter)..................................213
xvii
Sharpening filters (high pass filters)..............................................................215 Laplacian filter...............................................................................................216 Edge detection................................................................................................223 Image gradient................................................................................................223 Image gradient operators..........................................................................228 Sobel filters......................................................................................................230 Prewitt filters..................................................................................................230 Horizontal edge detection.........................................................................233 Padding in CNN....................................................................................................234 Concept of stride in CNN....................................................................................238 Convolution in 1-D...............................................................................................242 Convolution in 2-D...............................................................................................244 Visualization/expansion of 3-D filter in 2-D convolution...............................246 Simplified 2-D convolutions...........................................................................248 Number of parameters in a layer..............................................................253 Learning by CNN..................................................................................................254 2-D convolution example..........................................................................257 Convolution in 3-D...............................................................................................259 Importance of ReLU in the convolution layer...................................................261 Creating the first convolution layer...............................................................263 Pooling layer in CNN...........................................................................................264 Max pooling...........................................................................................................264 Average pooling..............................................................................................268 Fully connected layer............................................................................................269 Convolutional neural network with example (Implementation: How CNN works?)................................................................................................276 Convolution....................................................................................................279 Results of the first convolution layer..............................................................282 Use of Rectified Linear unit in CNN (ReLU)................................................298 Deep stacking..................................................................................................300 Fully connected layer in CNN.............................................................................301 Backpropagation............................................................................................306 Case studies of CNN.............................................................................................308 LeNet-5...........................................................................................................308 AlexNet...........................................................................................................310
xviii
ZFNet..............................................................................................................313 VGGNet (16-layer architecture)....................................................................313 Different types of model-fitting and dropout in neural networks................316 ResNET...........................................................................................................320 Inception module............................................................................................327 (1 * 1) Convolution...................................................................................328 GoogleNet.......................................................................................................330 Conclusion.............................................................................................................333 Web resources........................................................................................................333 Subjective questions..............................................................................................334 Multiple choice questions.....................................................................................334 Answers...........................................................................................................336 Hands-on session [CNN].....................................................................................336
5. Localization and Object Detection..............................................................345 Introduction...........................................................................................................345 Structure.................................................................................................................345 Objective.................................................................................................................346 Computer vision....................................................................................................346 Computer vision tasks....................................................................................348 Object recognition.....................................................................................348 Image classification...................................................................................350 Image classification using CNN..........................................................................351 Object localization.........................................................................................352 Object detection..............................................................................................354 Image segmentation..............................................................................................359 Semantic segmentation..................................................................................359 Instance segmentation....................................................................................360 YOLO algorithm....................................................................................................361 Encoding bounding boxes in YOLO..................................................................364 Intersection over Union (IOU)......................................................................366 Non-max suppression.....................................................................................367 Anchor-boxes.............................................................................................369 CNN for object detection.....................................................................................372 Region-based CNN...............................................................................................373
xix
Bounding box regression in RCNN................................................................375 Fast RCNN.....................................................................................................381 Loss function in fast RCNN......................................................................386 Faster RCNN..................................................................................................387 Region proposal networks (RPN).......................................................................389 Image segmentation algorithm—mask RCNN.................................................391 RoI align.........................................................................................................393 Conclusion.............................................................................................................393 Web resources........................................................................................................394 Questions................................................................................................................394 Multiple choice questions.....................................................................................394 Answers...........................................................................................................395 Hands on session (object localization and detection)......................................395 6. Sequence Modeling in Neural Networks and Recurrent Neural Networks (RNN).............................................................405 Introduction...........................................................................................................405 Structure.................................................................................................................405 Objectives...............................................................................................................406 Sequential data......................................................................................................406 Sequence generation.............................................................................................412 Sequence classification.........................................................................................413 Sequence to sequence predictions......................................................................414 Recurrent neural networks (RNN).....................................................................415 Why RNN is required?...................................................................................417 Realization of RNN...............................................................................................418 Matrix representation of fixed rule problem.................................................422 Neural network 1 (NN1)................................................................................428 Neural network 2 (NN2)................................................................................429 Sequence prediction..............................................................................................437 One-to-one sequence prediction model.........................................................438 One-to-many sequence prediction model......................................................439 Many-to-one sequence prediction model......................................................440 Many to many sequence prediction model....................................................440 RNN architecture..................................................................................................441
xx
Connect with realization of RNN..................................................................442 Representation of recurrent relationships..........................................................444 Matrix augmentation for representation of recurrence relationships.....445 Backpropagation in RNN.....................................................................................450 Vanishing gradient and exploding gradient problems in RNN......................461 Applications of RNN in the real world...............................................................468 Image captioning............................................................................................468 Time series prediction....................................................................................469 Natural language processing (NLP)...............................................................469 Machine translation.......................................................................................470 Conclusion.............................................................................................................470 Review questions...................................................................................................471 Multiple choice questions.....................................................................................471 Answers...........................................................................................................472 Hands on................................................................................................................472
7. Gated Recurrent Unit, Long Short-Term Memory, and Siamese Networks........................................................................................477 Structure.................................................................................................................477 Introduction...........................................................................................................478 Simple GRU...........................................................................................................478 Potential cell update unit ..............................................................................479 Update or input unit......................................................................................479 Cell update unit..............................................................................................480 Full gated recurrent unit......................................................................................480 GRU to LSTM transition......................................................................................480 Forget gate......................................................................................................484 Input modulation gate or potential cell update unit in LSTM.....................485 Input or update gate.......................................................................................486 Output gate.....................................................................................................487 Backpropagation in LSTM...................................................................................489 Applications of LSTM...........................................................................................504 Image captioning............................................................................................504 Handwriting generation.................................................................................505 Variants of LSTM..................................................................................................506
xxi
Vanilla LSTM.................................................................................................506 Stacked LSTM................................................................................................507 CNN LSTM....................................................................................................508 Encoder decoder LSTM..................................................................................508 Bidirectional LSTM........................................................................................509 Similarity measures used in Siamese networks.................................................510 Euclidean distance or L2 distance norm.......................................................513 Manhattan distance.......................................................................................514 Cosine similarity for sparse vector or sparse data set...................................514 Jaccard similarity...........................................................................................515 Dice similarity................................................................................................517 Overlap similarity..........................................................................................518 Siamese networks..................................................................................................520 Triplet loss..............................................................................................................526 Conclusion.............................................................................................................529 Questions................................................................................................................529 Multiple choice questions.....................................................................................530 Web resources........................................................................................................531 Hands-on session..................................................................................................531 8. Generative Adversarial Networks................................................................539 Introduction...........................................................................................................539 Structure.................................................................................................................539 Objectives........................................................................................................540 Introduction to GAN............................................................................................540 Basics of generative models.................................................................................541 Generative adversarial networks.........................................................................542 GAN architecture...........................................................................................542 Training a GAN..............................................................................................544 Training algorithm for GAN.....................................................................544 When to stop the training?........................................................................547 Understanding probability distributions...........................................................548 Mathematical generalization of GAN................................................................550 GAN training challenges......................................................................................552 The GAN loss function.........................................................................................553
xxii
The standard loss function [min-max loss]...................................................554 The discriminator loss...............................................................................555 The generator loss......................................................................................555 Combined loss...........................................................................................555 Non-saturating loss (NS-loss)........................................................................556 Wasserstein GAN loss....................................................................................556 Least-square GAN loss...................................................................................557 Types of GAN........................................................................................................557 Vanilla GAN...................................................................................................558 Deep convolutional GAN...............................................................................558 Conditional GAN (cGAN).............................................................................559 Semi-supervised GAN....................................................................................561 Dual GAN......................................................................................................563 Pix-to-Pix GAN..............................................................................................564 Cycle GAN......................................................................................................567 Stack GAN......................................................................................................569 Applications of GAN............................................................................................571 Implementation of the DCGAN.........................................................................573 Importing the libraries...................................................................................573 Set random seed for reproductivity................................................................573 Dataset settings..............................................................................................574 Mount drive...............................................................................................574 Check the mounted drive..........................................................................574 Unzip the dataset......................................................................................575 Checking the content of dataset................................................................575 Setting parameters.....................................................................................575 Data loader................................................................................................576 Displaying some training samples............................................................578 Weights initialization.....................................................................................579 Generator network.........................................................................................579 Discriminator network...................................................................................582 Loss function and optimizer..........................................................................584 Training of GAN.............................................................................................584 Saving the model............................................................................................586 Reload the model............................................................................................586
xxiii
Visualizing real versus fake images...............................................................586 Viewing the new generated sample................................................................587 Conclusion.............................................................................................................588 Multiple choice questions.....................................................................................588 Answers..................................................................................................................590 Questions................................................................................................................590 Index.................................................................................................................591
xxiv
Chapter 1
Basics of Artificial Intelligence and Machine Learning Introduction
Artificial intelligence is the demonstration of intelligent processes by machines, especially computer systems similar to human intelligence. The process of artificial intelligence includes learning, reasoning, and self-correction. This chapter covers the basic concepts related to artificial intelligence, such as patterns, classification, and regression problems. In this chapter, we briefly describe the basic concept of machine learning algorithms along with their types, such as deep learning and shallow learning techniques.
Structure
In this chapter, we will cover the following topics: • Patterns
• Intra-class variability • Inter-class similarity
• Pattern analysis tasks
• Data and its types in machine learning
2
Applied Deep Learning
• Machine learning feature set • Types of Regression • Classification • Clustering
• Agglomerative clustering
• Divisive clustering algorithms • Clustering applications
• Hands-on session on AI and machine learning basics
Objectives
After going through this chapter, you will be able to understand the basic concepts of machine learning and its types. You will understand the types of classification and regression problems. You will get a brief idea about clustering algorithms and will be able to perform clustering algorithms in R.
Patterns
In the last few years, pattern analysis has been one of the emerging trends in the research community. A pattern can be termed as a type of theme of repeating events or elements of a set of objects. The elements repeat themselves in a manner which can be predictable. In the digital world, a pattern is almost everything. For example, the color of clothes, speech patterns, and so on. It can either be observed mathematically or physically by using some type of algorithm. A pattern can be a proven solution in a specified manner for a common problem. In 1979, Alexander quoted Each pattern is a three-part rule, which expresses a relation between a certain context, a problem, and a solution. According to the Gang of Four, a pattern can be summarized into four different parts: a context where the pattern can be useful, the issues which are addressed by the patterns, the forces used for forming a solution and the solution that resolves those forces. Example 1: Consider the following example and try to complete these patterns. • 1,2,3,4,5,6,…,24,25,26,27,28.
• 1,3,5,7,9,11,…,25,27,29,31,33
• 2,3,5,7,11,13,…,29,31,37,41,43
• 1,4,9,16,25,36,…,121,144,169,196
Basics of Artificial Intelligence and Machine Learning
3
• 1,2,4,8,16,32,64,…,1024,2048,4096,8192 • 1,1,2,3,5,8,13,…,55,89,144,233,377
• 1,1,2,4,7,13,24,…,81,149,274,504,927
• 3,5,12,24,41,63,….., 201,248,300,357,419 • 2,7,12,17,22,27,32,…..,42,47,52,57,62 • 1,6,19,42,59,…,95,117,156,191,?
As per the preceding example, it is easy to fill the patterns initially, but as we progress downwards, it becomes a little bit complex. So, the pattern is defined as any regularity or structure in data and pattern analysis is the automatic discovery of patterns in data.
Pattern representation In the field of computer science, a pattern is represented by using vector feature values. The feature is any distinct characteristic, quality, or aspect. Features can be numeric (for example, width and height) or symbolic. Suppose if there are d features of an object; then the combination of these d features can be represented as a column vector of dimension-d known as a feature vector. Space which is defined by the d-dimension feature vector, is called a feature space. Then, the objects are represented as some points in the feature space, and that representation is termed as a scatter plot. Figure 1.1 shows various pattern representations:
4
Applied Deep Learning
Figure 1.1 (a)–(c): Pattern representation, dimension space of vector, and different classes of patterns.
Figure 1.1(a)–(c) represents the vector of size d, the dimension space of the vector and classes of patterns.
Analysis of patterns It is a phase of pattern recognition that uses the existing knowledge present in data to uncover patterns using techniques of data analysis. Pattern analysis deals with the detection of patterns of the data from the same source automatically, making predictions of the data coming from the source. The information coming from the source can be of any form, such as text, images, family trees, records of commercial transactions, and so on. The identification of the patterns from a finite data set has very distinctive and different challenges. So, in order to design an effective pattern analysis algorithm, one should consider three key features as follows.
Robustness The first challenge for designing an effective pattern analysis algorithm is the fact that when it comes to real-time applications, data can be demolished by noise because of the randomness of the wireless channel or by virtue of human errors. So, while designing the algorithm, it must be kept in mind that the algorithms must identify the approximate patterns and can handle noisy data smoothly such that it
Basics of Artificial Intelligence and Machine Learning
5
should not affect the output of patterns or data analysis techniques. The algorithms that possess this property are considered robust.
Computational efficiency As the amount of data is increasing enormously day by day, the designed algorithm must be able to handle larger datasets due to the enormous increase in data with time. So, if the algorithm works well for small objects, it should also work well for large datasets. Basically, computationally efficient algorithms are those whose resources scale increases polynomially with an increase in the size of the data.
Statistical ability This property is the most basic property an algorithm should have, which states that the patterns that are identified by the algorithm are genuine and have an accidental relation with the data set attributes. We can define this property as if we apply the algorithm to the new data coming from the same source so that it should be able to identify a similar type of pattern. Thus, the output of the algorithm should be sensitive to the data source and not to the particular dataset. So, if the algorithm is sensitive only to a particular dataset, it can be termed as stable for short, and if the algorithm gives similar types of patterns from all the datasets coming from the same source, then such an algorithm can be termed as statistically stable.
Pattern classes A pattern class can be defined as a collection of similar types of objects that may not be identical. A class consists of exemplars, prototypes, paradigms, and learning/ training samples. Figure 1.2 represents different types of class variabilities:
Figure 1.2 (a): Low inter-class and high intra-class variability. (b) Low intra-class and high interclass variability.
6
Applied Deep Learning
There are the following two types of variability and one type of similarity in the pattern class: • Intra-class variability • Inter-class variability
Machine learning algorithms have to deal with these variabilities and similarities.
Intra-class variability
It refers to the deviations in the particular class score for a specific object which is not a part of the systematic difference. So, basically, it is within the class variability. It can be used to map different types of objects. If we talk about land covers, then bare soil, forests, and rocks can be mapped using intra-class variability. It is the variation that exists between all the samples of a particular class that are used to learn the machine. Intra-class variability uses a tool named as feature space which predicts the patterns according to their features by using spectral signatures. So, according to the specific requirement, an appropriate feature space can be chosen.
Inter-class variability
Inter-class variability means the variability among the different types of classes in a dataset. It can be used in cases where one needs to separate different classes that exist in the dataset. Also, the accuracy of the classification of objects depends upon inter-class variability. Figure 1.2(a) represents the patterns having low inter-class and high intra-class variability, whereas Figure 1.2(b) represents the patterns having low intra-class and high inter-class variability. For an ideal clustering algorithm, the intra-class variability should be minimum, whereas inter-class variability should be maximum. This is an important criterion for getting classes of different patterns. In classification, there exists a labeled dataset, so inter and intra-class variability does not exist in supervised learning. The inter-class and intra-class variability is dependent on a distance metric.
Inter-class similarity
In inter-class similarity, the data is similar, almost nearest to each other and belongs to the same cluster. The outcomes are calculated as the ratio of the total summation distance to the summation of all the distances within one cluster of the dataset. Interclass similarity can be seen in figure 1.3:
Basics of Artificial Intelligence and Machine Learning
7
Figure 1.3: Inter-class similarity
If you observe I1, both I and 1 look similar, but they are actually different. Here, I represent an alphabet, and 1 represents a number. So, this shows the inter-class similarity. The interclass similarity shows the similarity with other classes.
Pattern analysis tasks
When we discuss the importance of patterns in data, the main aim of pattern analysis is the frequent prediction of data features as the function of the values of the other features. So, it is expected that the tasks associated with pattern analysis isolate a common feature, that is, their prediction intention. The training data arrives as (X, Y), where X is the feature set called as vector and represented as X= { x1, x2…. xn}. A function f’’ is required to predict the class label Y’ on the basis of feature set X’. Basically, there are three types of pattern analysis tasks namely supervised, semisupervised, and unsupervised tasks. We will discuss all these types of tasks in detail.
Pattern classification—supervised tasks The supervised tasks are pattern analysis tasks where there is an incorporated label for each output. For such types of tasks, a pattern can be sought in the form of the following: f(x, y) = L(y, g(x)) Where denotes the loss function and g is the prediction function. It gives the variation between the true value y and the output of the prediction function, so when the pattern is detected, it is expected that the loss is zero. Figure 1.4 shows the process of classification using feature extraction:
8
Applied Deep Learning
Figure 1.4: Pattern classification
For example, if we take the label as Y = Wild Animals, that is, a class, then the feature set is X= {Canines, strips, carnivores}. If a particular object, say, Tiger has these features, then our prediction algorithm f predicts this object Tiger in Y. Figure 1.4 depicts the pattern classification process where X represents a feature vector.
Pattern clustering-unsupervised tasks Unsupervised tasks do not have labels on them. These types of tasks are only available as the training data and labels should be predicted for the use of test data. Basically, there are two methods used for unsupervised tasks named as cluster analysis and a principal component. Cluster analysis is a sub-branch of machine learning that groups the unlabeled or unclassified data. Cluster analysis is used for unsupervised tasks in the grouping of datasets with other attributes. Cluster analysis identifies the common identities in the data sample present and then further reacts on the basis of the absence or presence of such common entities from the new data coming from the same data source.
Semi-supervised tasks In semi-supervised tasks, the distinct label or feature is known partially. For example, consider a case of ranking where we only have relative ordering in the training set. But our objective is to provide the same ordering of the new data coming from the same source. For these types of issues, a value function is assumed, and suggestion for its value is made in the process of training data. So, new data from the same data source is assessed by using the output of the value function. In another type of situation, there can be a transduction case in which partial information is present regarding the labels. In such type of cases, only some data arrives with the incorporated label, so the task becomes to predict the label for the data that is not labeled. The main aim here is to have the combination of the querying cost and generalization error.
Basics of Artificial Intelligence and Machine Learning
9
Data and its types in Machine Learning
Data can be defined as a set of values with respect to quantitative or qualitative variables. Nowadays, almost everything can be represented as data. Understanding the different data types is essential for machine learning models. From a machinelearning perspective, mostly the data can be divided into four basic types as follows: 1. Numerical data 2. Text data
3. Time series data
4. Categorical data Figure 1.5 shows different types of datasets used in machine learning:
Figure 1.5: Types of data
Numerical data Numerical data is a type of data in which data points are numerical values. It can also be termed quantitative data. For example, the number of cars sold in the past month. Numerical data can be divided into discrete or continuous data. Discrete data is a set of distinct values, whereas continuous data is any value within a specified range. For example, a student’s score marks as 10, 20, 30, 40, 50, and 60. Then, this is referred to as numeric data within the range of 0 to 100.
Time series data Time series data can be defined as an order of the numbers collected at regular intervals of time. It has application in the case of finance. There is a temporal value attached to the time series. So, it can be something like a timestamp that comes as trends in time.
10
Applied Deep Learning
For example, measuring the environmental sensor data after every month is a timeseries data. The difference between numerical and time series data is that time series follow an implied ordering, whereas numeric data may consist of a bunch of values, but it does not have any ordering sequence.
Text data Basically, words are text data. While using machine learning techniques, we convert the words into numbers by using appropriate functions such as word formulation function such as word2vec. For example, while reading this book, you are actually reading the text data.
Categorical data Categorical data is a set of variables that consist of a large number of label values instead of numeric values. In categorical data, the total number of values is represented by a fixed set. Categorical variables can also be termed as nominals. For example: • A color variable contains the values: green, blue, and red • A pet variable contains the values: cat and dog • T-shirt size as S, M, L, and XL
Figure 1.6 shows a bar graph of categorical data:
Figure 1.6: Categorical data representation
Figure 1.6 shows a bar graph of categorical data in which fruit classes are divided into five categories, such as apple, banana, and so on and the number of pieces is shown for each fruit. Here, each value represents a different category.
Basics of Artificial Intelligence and Machine Learning
11
Issues with the categorical data Some machine learning algorithms work fine with categorical data. For example, the use of a decision tree algorithm, such as split theory based on information gain, is suitable for learning from the categorical data without performing any data transformation. However, many machine learning algorithms do not work on labeled data as they require all the input and output variables to be in the numeric state. This can be viewed as a constraint for clustering-based algorithms, so the categorical data have to be converted into numerical data. Also, if the output variable is categorical, then there is a need for converting the predictions into a categorical form so that it can be used for a particular type of application. Hence, now the question arises as to how to convert categorical data into numeric data. The conversion of the categorical data into numeric data can be done by using the following encoding techniques: • Integer encoding
• One-hot encoding Both these techniques are explained as follows.
Integer encoding Integer encoding is the first step for the conversion of categorical data into numeric data. Here, each unique object or class is assigned a value different from other classes. For example, we have three objects, namely, apples, banana, and orange. Table 1.1 shows an example of integer encoding: Class
Apple
Banana
Orange
Number 1 2 3
Table 1.1: Integer encoding
Integer encoding can be easily reversible as 1 → apple, 2 → banana and 3 → orange. The advantage of integer encoding is that the integer values have a numerically ordered relationship among themselves which machine learning algorithms can easily understand.
12
Applied Deep Learning
One-hot encoding In machine learning, one hot encoding includes a collection of bits that represent the data using a unique 1-bit representation for a unique class (high/ON) and the rest of the bits having a value as 0 (low/OFF). It is used for multi-class classification. Apart from this, the representation of the data in which there is only a single 0 bit is present, and the rest of the bits are 1 is called as one-cold. Table 1.2 represents the one hot and binary representation of a number from 0 to 9: Number
Binary representation
One-hot representation
1
0001
0000000010
0
0000
2
0010
3
0100
6
0110
5
0000001000 0000010000
0101
7
0000100000 0001000000
0111
8 9
0000000100
0011
4
0000000001
0010000000
1000
0100000000
1001
1000000000
Table 1.2: Binary and one hot representation
One-hot encoding is generally used to represent the different states of the state machine. Moreover, in binary or gray code, the decoder helps in determining the state of the machine. But in the case of the one-hot state machine, there is no need for the decoder. Table 1.3 represents the label encoding method in which the objects are divided categorically; for example, Apple is labeled into category 1: Fruit name
Categorical
Calories
Banana
2
70
Apple
Orange
1 3
Table 1.3: Label encoding
95 65
Basics of Artificial Intelligence and Machine Learning
13
Table 1.4 represents one-hot encoding where each object has only a single 1-bit (high). Apple
Banana
Orange
Calories
0
1
0
70
1 0
0 0
0 1
Table 1.4: One-hot encoding
95 65
Machine learning feature set
In pattern recognition and machine learning, a feature can be defined as a distinct computable characteristic or property of a phenomenon. Choosing effective, informative, and discriminative features are considered as an important role in classification, regression, and pattern recognition. Different types of features are discussed as follows.
Handcrafted features Handcrafted features refer to the derived properties for use in different algorithms in which the data is present in the image itself. For example, edges and corners are the two simple features which can be extracted from images. A simple edge detection algorithm such as Laplacian based on differentiation (rate of change) works by determining the regions in the image where the intensity of the image changes quickly. To understand it in simple language, we know that a usual image is nothing but a 2D matrix (or multiple matrices when there are multiple channels like RGB (Red, Green, and Blue). Figure 1.7 represents an image in gray-scale with intensity values from 0 to 255:
Figure 1.7: Gray-scale image
14
Applied Deep Learning
In the case of a gray-scale image (8-bit) (or a black and white image), the image is usually a matrix of two-dimension with pixel values ranging from 0 to 255, where 255 denotes complete white and 0 denotes complete black and the range from 0 to 255 represents grey-scale image as shown in figure 1.7.
Spatio temporal features Space-time features capture the distinctive shapes in a video. It provides a selfdetermined representation of events that occurred with time during different types of motion in a scene. Spatiotemporal features are generally extracted from the video directly and avoid the possible failures of pre-processing techniques such as tracking and motion segmentation. So, the motion and space features are combined over a finite time interval to form the spatiotemporal features. These types of features are commonly used for human action recognition systems, as shown in figure 1.8:
Figure 1.8: Use of spatio-temporal features
Basics of Artificial Intelligence and Machine Learning
15
Discriminative features Discriminative features are used mostly in face recognition, as shown in figure 1.9:
Figure 1.9: Facial image1
When feature learning is applied to these images, they give output as discriminative features. Figure 1.10 shows the discriminative features of a dataset consisting of the parameters such as height and time:
Figure 1.10: Discriminative features
1
https://blog.csdn.net/u014696921/article/details/70161920
16
Applied Deep Learning
Artificial intelligence In computer science, artificial intelligence is also called machine intelligence. It is the intelligence demonstrated by machines, such as humans display natural intelligence. AI includes the process of reasoning, self-correction, and learning. It can be classified as either strong or weak. Strong AI is also termed artificial general intelligence and is basically an AI system that consists of comprehensive human cognitive capabilities. This type of AI system can find a solution without the use of human involvement, even when given with an unfamiliar task. Moreover, weak AI, known as narrow AI, is an AI system that is trained and intended for a specific task. For example, Siri (Apple’s virtual personal assistant) is considered as a weak AI. Figure 1.11 represents an image of AI:
Figure 1.11: Artificial intelligence 2
Machine learning and its goals
Machine learning can be defined as the scientific study of statistical models and algorithms, which computer systems use to execute an explicit task without using categorical instructions. Thus, it depends upon the inference and patterns instead. It can be observed as a subset of artificial intelligence (AI). A good generalization ability is the main objective of machine learning algorithms. The following are the goals of machine learning algorithms: • Regression
• Binary and multi-class classification 2
https://singularityhub.com/2018/06/20/why-we-need-to-fine-tune-our-definition-of-artificial-intelligence/
Basics of Artificial Intelligence and Machine Learning
17
• Clustering
• Recommendation systems
• Supervised and unsupervised anomaly detection These are explained as follows: Regression Regression is an elegant word to describe that a model will allocate a continuous value (response) to a data observation different from a discrete class. We will discuss Regression in detail in the upcoming section. Classification Problems related to classification involve the insertion of a data point in a pre-defined category of class. In some cases, the classification problem performs the assignment of a class in an observation, whereas in other cases, the main objective is to predict the chances that the observation, which addresses the given classes. We will discuss Classification in detail further in the Classification section. Figure 1.12 represents the classification based on different scenes:
Figure 1.12: Image classification based on different scenes 3
Clustering Clustering is defined as an unsupervised technique for determining the alignment and organization of a given data set. It is a method of collecting information into clusters or groups. Each cluster is considered by an enclosed set of data points and the centroid of the cluster. The centroid of the cluster is the average of all the data points. We will discuss this in detail in the Clustering section.
3
https://pdfs.semanticscholar.org/4d63/7c12ec9864174726ac282df3747f714203d6.pdf
18
Applied Deep Learning
Figure 1.13 shows the image clustering example based on scenes without any labels:
Figure 1.13: Image clustering example based on scenes.
Figure 1.12 consists of various labels, and figure 1.13 does not consist of any labels. That is the exact difference between supervised (classification) and unsupervised learning (clustering). In the case of supervised learning, we have prior knowledge about labels in which a new object is to be classified, but in the case of unsupervised learning, we do not have any prior knowledge of labels. We have to assign labels by ourselves on the basis of inter-class, intra-class variabilities and inter-class similarity. Consider figure 1.13 and try to cluster the scenes into clusters of seven wonders of the world.
Basics of Artificial Intelligence and Machine Learning
19
Recommendation systems Another type of problem that can be addressed is a recommendation system, or even called as a recommendation engine. These systems are kind of data filtering systems, and they make recommendations for different types of applications, such as books, products, articles, restaurants, music, movies, and so on. In recommendation systems, two common methods are collaborative and content-based filtering. The use of recommendation systems can be easily seen on Amazon and Netflix. Netflix makes recommendation systems to keep its viewers busy and intends to have plenty of content to see. So, in order to keep their users watching Netflix, they provide these types of recommendations such as Top Picks for Alex, Because you watched …, and Suggestions for you. Amazon also does the same thing in order to increase its sales by up-selling, maintaining sales by user engagement, and so on. They provide recommendations such as Related to items you viewed, Customers who bought this item also bought, and More items to consider. Anomaly detection Another type of problem is anomaly detection. We think that data coming from a particular source is sensible and well-behaved, but actually, this is not the case in most of the cases. Sometimes, there exists irrelevant data because of errors or faults in the measurement and also because of fraud. Sometimes, it can be due to the measurements that are anomalous and revealing the deteriorating piece of electronics or hardware. Anomalies can even denote the real problems sometimes and cannot be easily described. For example, consider a defect in manufacturing. In such type of cases, the detection of anomalies offers a degree of control of quality and also the perception that whether the steps taken to minimize the defects have actually worked or not. In both cases, there are possible cases when it can be helpful to determine the values of the anomaly, and specific machine learning algorithms can be applied carefully in such cases.
Machine learning techniques
The machine learning techniques are divided into two broad learning techniques, namely, deep learning and shallow learning, and further classification of the two techniques of machine learning is shown in figure 1.14:
20
Applied Deep Learning
Figure 1.14: Types of machine learning techniques
These are all the various types of machine learning techniques, which are discussed as follows.
Deep learning techniques Deep learning is a subsection of machine learning that consists of neural networks that can automatically train the classifiers for the purpose of classification. It is beneficial when we have a multi-class classification problem. Deep learning can also be referred to as a deep neural network or deep neural learning. Figure 1.15 shows hidden layers of neurons forming a neural network:
Figure 1.15: Neural network
Basics of Artificial Intelligence and Machine Learning
21
Now, let us discuss different types of machine-learning techniques.
Discriminative learning techniques Working with discriminative models, if one wants to predict the label Ci for a given object D, then according to Naïve Bayes Theorem:
(1.1)
Where P(A) is the probability of occurrence of event A. P(B) is the probability of occurrence of event B. P(B|A) is the probability of event B, given that event A has occurred. P(A|B) is the probability of event A, given that event B has occurred Example 2: We want to find out a patient’s probability of having cancer disease if they are alcoholic. ABC is the test to diagnose cancer disease.
• A denotes the event Patient has cancer disease. Previous data tells us that 10% of patients coming to the clinic have cancer disease. P(A) = 0.2. • B denotes the event of the litmus test that the Patient is an alcoholic; 5% of the clinic’s patients are alcoholics. P(B) = 0.08.
• In total, 7% are alcoholics among those patients who are diagnosed with liver disease. This is B|A: the probability that a patient is alcoholic, given that they have cancer disease, is 5%. Applying the Naïve Bayes theorem, we get, P(A|B) = (0.05 * 0.2)/0.08 = 0.125. Therefore, there is a 12.5% probability that a patient has cancer if he is an alcoholic. Let "Ci represents the label/class, and d represent the object. So, by applying Naïve Bayes theorem, we get, whether an object d belongs to a particular class Ci or not.
(1.2)
where
• P(Ci) = Probability of the hypothesis Ci. • P(d) = Probability of the object d
• P(dCi) = Probability of occurrence of d given Ci.
• P(Ci|d) = Probability of occurrence of Ci given d.
22
Applied Deep Learning
In general cases, we need the most probable label given the training dataset. So, in such a case, the maximum a posteriori (MAP) label function is written as follows:
(1.3)
We can divide the discriminative learning techniques into two parts as follows: 1. Linear regression
2. Logistic regression These are two types of regressions that will be discussed in the next sections.
Types of Regression
The problems based on supervised learning are divided into two parts: (i) regression and (ii) classification. Both can be used in building a concise model that can use the attribute variables and can find the values of the dependent attribute. A problem is said to be a regression problem when the input variable is a continuous or real value. For example, weight, salary, and so on. There are various models of regression. Figure 1.16 shows the types of regression models:
Figure 1.16: Types of regression models
Basics of Artificial Intelligence and Machine Learning
23
For a proper understanding of regression, let us take the following examples. Example 3: There are four sentences given here. We have to identify which one is a regression task: • Predicting the age of your friend
• Predicting the weight of your friend
• Prediction of the nationality of people
• Predicting if the stock market will increase tomorrow Any guesses? So, let us discuss on these four sentences now. Prediction of the age and weight of a person is a type of regression task as it is a real value while predicting the nationality comes under categorical tasks, and changes in the stock market is a discrete value, say yes or no. Example 4: There are four sentences given here. We have to identify which one is a regression task and which one is a classification task: • Predicting the height of a person
• Predicting the gender of the person based on his/her handwriting • Predicting the BMI of a person
• Prediction of the nationality of the person
Predicting height and BMI is a reg0ression task and predicting the gender and nationality of a person is a classification task.
Linear regression Linear regression is a type of regression which fits information with the most suitable hyper-plane. It can be defined as a linear approach for modeling the relationship between independent variables and dependent variables. Figure 1.17 shows the pattern obtained using linear regression:
24
Applied Deep Learning
Figure 1.17: Linear regression
Refer to Section 9 for understanding hyperplanes in detail. Function approximation Statistical and geometric methods for machine learning are associated with the methods of function approximation in mathematics. Considering a case of illustration, let us consider that we have a classification-based learning task. Hence, we want to determine possible methods of grouping the objects in the universe. In figure 1.18, we have shown a universe of objects which belong to any of two classes or .
Figure 1.18: Linear discrimination in two classes ()
Basics of Artificial Intelligence and Machine Learning
25
Using the process of function approximation, we can describe a surface which divides various objects into different regions. Line and linear methods of regression are one of the simplest function approximation methods.
Logistic regression Logistic regression is a general linear regression model. It is a method used in machine learning for statistics and defines as a type of machine learning algorithm that can be used for problems related to classification. It is an analytical analysis method used to build on the idea of probability. It is the suitable analysis of regression which works best when the dependent variable is binary. Similar to other regression analysis methods, logistic regression can be based on predictive analysis. It is used to define the data and also in finding the association between a binary variable, which is dependent on one or more insignificant, interval, ordinal independent variables, and so on. It is basically a classification method that can be used in assigning the observed values as the classes of discrete sets. For example, whether received an e-mail is spam or not, online transaction is fraud or not, the tumor is benign or malignant, and so on. So basically, logistic regression converts its output by using the sigmoid function that returns the value as “0” or “1”, that is, binary classification. For a better understanding of linear regression, let us take an example. Example 5: We have default data of the credit card of XYZ bank, and we want to know if the present credit card balance of the customer indicates whether the customer will default or not on the credit card in the near future. To classify the customers as low or high-risk defaulters, we can define a linear regression model based on their present balance. But if you observe Figure 1.19(a) for a balance less than 5,000, the linear model is predicting negative values of the probability, and for a balance greater than 20,000, the model is predicting probability values greater than 1. Here, predictions based on a linear model are not sensible, as the probability value must lie between 0 and 1. Now, you observe that in figure 1.19 (b), the logistic regression line is a sigmoid curve, and the value of the probability lies between 0 and 1. So, for this example, the logistic regression model works better than the linear regression model. Figure 1.19 (a)–(b) shows the difference between the pattern observed in linear and logistic regression:
26
Applied Deep Learning
Figure 1.19: (a) Linear regression (b) logistic regression
The logistic regression hypothesis inclines to limit the cost function of the logistic function between 0 and 1. Hence, linear functions fail to characterize because they can have a value even more than 1 or sometimes less than 0.
(1.4)
Basics of Artificial Intelligence and Machine Learning
27
The preceding equation represents the expectation of the hypothesis of logistic regression.
Sigmoid function We use the sigmoid function to map the predicted values to the probabilities. The function can transform or map any type of real-value into another value which lies between 0 and 1. In the case of machine learning, we use the sigmoid function to map the predictions we get into the probabilities. Figure 1.20 shows the curve for the sigmoid function:
Figure 1.20: Sigmoid function
The sigmoid function can be represented as follows:
(1.5)
Where Z is the value for which the sigmoid function changes. Hypothesis representation of the logistic regression The model for the logistic regression can be defined as follows:
(1.6)
Where X is one independent variable, Z is the observed dependent variable and and are the two parameters. The hypothesis representation of the logistic regression in the sigmoid function of Z.
(1.7)
28
Applied Deep Learning
That is,
……..
(1.8)
The preceding equation represents the hypothesis of logistic regression. Decision boundary As we expect the classifier to provide the output as a set of classes based on the probability, whenever we provide the inputs to the classifier by using a prediction function, the classifier must return a value of probability between 0 and 1. So, now to map this to a discrete class (a = Domestic animals/ b = Wild animals, true/ false), we select a threshold, say, 0.5. So, above 0.5, we classify values into class “a = Domestic animals”, and below which, we classify values into class “b = Wild animals”. For example, if you observe figure 1.21, if the value of y = 0.7, then we classify it as a positive value and if the value of y = 0.2, then we classify it as a negative value:
Figure 1.21: Decision boundary
There are two types of logistic regression: • Binary logistic regression
• Multinomial logistic regression
Basics of Artificial Intelligence and Machine Learning
29
Binary logistic regression Binary or binomial logistic regression handles the situations for which the detected outcome of a dependent variable can be only two types of values, 1 or 0. For example, alive versus dead or loss versus win. In binary logistic regression, the output is coded as 0 or 1. So, it denotes that the regression taking place is a straightforward interpretation. This type of regression can be used for the prediction of the odds for a case depending upon the observed values of the predictors or, say, independent variables. Multinomial logistic regression Multinomial logistic regression handles the situations of the cases when the output can be of more than two possible types and is not in an ordered sequence. It compares multiple groups by using a combination of multiple binary logistic regressions. For example, suppose we want to study the difference in behavior among Bachelor’s, Master’s, and PhD students. The multinomial logistic regression will compare the Bachelor’s students to Master’s and PhD students, similarly comparison of Master’s students with bachelor’s and PhD students will take place. For each independent class, there will be at least two comparisons. Some more examples are as follows:
• Nature of person (rude, introvert, and humble) • Color of flower (green, yellow, and red)
Cost function In terms of linear regression, the cost function denotes the objective function, which is optimized; that is, in the case of linear regression, we first define a cost function and then minimize it to develop a model with a minimum error rate.
30
Applied Deep Learning
Figure 1.22 represents the graphs corresponding to cost function estimation using logistic regression:
Figure 1.22: Cost function of logistic regression
In the case of logistic regression, the cost function is defined as negative log-likelihood or binary cross-entropy function as with the following equation:
(1.9)
where As observed from figure 1.22, the value of the cost function at y = 1 can be determined by using the value of
Basics of Artificial Intelligence and Machine Learning
31
So, for this scenario, the cost function is written as follows:
(1.10)
–log(0.8) = 0.096
(1.11)
So, by solving the preceding equation, we get
The preceding two cost functions are re-written into a single cost function as follows:
(1.12)
Non-linear regression It is a form of regression analysis in which data is modeled using a function having non-linear combinations of the parameters of the model. The function consists of one or more independent variables. In this type of regression, the data is fitted by successive approximation methods. A statistical model of the non-linear regression is as follows:
b ~ f(a,β)
(1.13)
where a is a vector of independent variables and b denotes the set of observed dependent variables. β is the set of parameters for which the function f is non-linear. Let’s take an example of the Michaelis–Menten model for enzyme kinetics.
(1.14)
This model has one independent variable a, and two parameters and . The function is non-linear as it cannot be represented by a linear combination of two β. Linearly and non-linearly separable data For a better understanding of likely separable data, let us consider two classes represented by blue and red color. A dataset is linearly separable if there is a line that can distinguish the blue and red points from each other.
32
Applied Deep Learning
Figure 1.23 represents the data which is linearly separable for a dataset:
Figure 1.23: Linearly separable data
Figure 1.24 represents the data that is not linearly separable as follows:
Figure 1.24: Linearly non-separable data
Basics of Artificial Intelligence and Machine Learning
33
Mathematical definition of a single neuron The logistic regression is itself a neuron, as shown in figure 1.25. Multiple neurons can be built by using multiple logistic regressions. We can also define the linearly separated data by an algebraic method by considering the separator as a single linear function, that is if the input is (x1, x3, x3). Then, separator is a function as follows:
Figure 1.25: Multi-class objects build a neural network
(1.15)
(1.16)
(1.17)
where value.
are weights, and b is the associated bias
Generative learning The generative learning technique works by building separate models of positive and negative values. We can think of the generative learning technique as a model of a blueprint for a class. To create these types of models, the generative learning method uses the joint probability distribution. Classification is a type of generative learning technique, which is discussed as follows. Classification Classification can be defined as a learning approach that is supervised, in which a system learns from the input data (labels) provided to it and then uses it to classify new objects. The type of classification can be either binary or multi-classification.
34
Applied Deep Learning
Some of the classification problems of the real-world are biometric identification, handwriting recognition, speech recognition, document classification, speech recognition, and so on. Figure 1.26 shows static pattern classification using feature vector X. It consists of phases such as the training and testing phases. In the training phase, a classifier is trained by extracting multiple features from the images using methods of feature extraction. Furthermore, in the testing phase, the classifier assigns a class label as Tiger to the input image:
Figure 1.26: Static pattern classification using feature vector “X”
Different types of classification algorithms in the literature are as follows: • Nearest neighbor • Decision trees
• Linear classifiers • Random forest
• Support vector machines • Neural networks • Boosted trees
The supervised tasks are classified as follows: • Binary classification
• Multi-class classification
Basics of Artificial Intelligence and Machine Learning
35
Figure 1.27 shows the difference between the pattern observed for binary and multiclass classification:
Figure 1.27: Binary and multi-class classification
Table 1.5 shows the comparison between binary and multi-class classification: S.No
Parameter
Binary classification
1
Learning technique
3
Nature of output Discrete variable
2
4 5
Supervised
Characteristic of the Only two classes or target target variable categories exist, so the target variable can take up to two categorical values.
Example
Encoding
Multi-class classification
Supervised
Multiple classes or target categories exist, so the target variable can take one of many categorical values. Discrete
Medical diagnosis of a In handwritten alphabet single medical condition recognition, the number of (disease versus no disease). classes is 26. Integer Encoding, One-Hot One-Hot Encoding Encoding
Table 1.5: Binary versus multi-class classification
Classification The problems related to binary classification considers assigning any object to one of the class by identifying the values of associated features. For example, consider a case where the system has to identify a medical condition of a person, such as if
36
Applied Deep Learning
symptoms of the disease are present or not, to diagnose the disease. Hence, there are only two cases, that is, the presence or absence of the disease. So, basically, there are only two classes in the case of binary classification. Multi-class classification In many real-time applications of pattern recognition, there can be more than two numbers of classes, such as c1, c2… cn. Such a type of classification is called multiclass classification. For example, as per MNIST dataset, the number of classes is from 0 to 9. Multi-class classification gives the statement that every section is allocated only one label. Figure 1.27 depicts the difference in patterns observed between binary and multi-class classification. Now, let us do an exercise. There are three problems given here. You have to identify which of them are the type of classification problems. • Prediction of the cost of a house based on the area
• Prediction of the increase in the stock market next year • Predicting tomorrow’s weather in Delhi. Any guesses?
The prediction of weather comes under classification problems. Figure 1.28 depicts the multi-class image classification for different animals:
Figure 1.28: Multi-class image classification
Basics of Artificial Intelligence and Machine Learning
37
Decision tree A decision tree is a method of classification, which gives results as flow charts like tree structure where the branch of the tree represents the outcome, and the node represents the value of the attribute. The classes are represented by the tree nodes. Let us take an example where we have to classify a cat, sparrow, and bat based on their attributes. Figure 1.29 shows the decision tree for our example:
Figure 1.29: Decision tree
ID3 algorithm The main algorithm to build the decision tree is ID3. It follows a greedy, top-down search approach through possible branches. It uses entropy and information gain to build the decision tree. Entropy Entropy is defined as the measure of disorder, uncertainty, or impurity of a dataset. It controls the decision of splitting the data. ID3 algorithm uses entropy to determine the homogeneity of the sample. Mathematically, entropy can be represented as follows:
(1.18)
where Pj is the probability of class j in the dataset. Information gain Information gain is the measure of information given by the feature about the class. It is used to construct the decision tree. The decision tree algorithm maximizes
38
Applied Deep Learning
information gain. It is based on the reduction in entropy after splitting of the dataset. The information gained from X on Y is given as follows. Table 1.6 represents the training data set consisting of various product types:
(1.19)
is the entropy of Y given X.
where Example:
YOP (Year of Production)
Market competition
Type
Profit/Loss
YOP < 2010
Yes
Type-A
Loss
YOP < 2010
No
Type-A
Loss
YOP < 2010
No
Type-B
Loss
2010 2015, there is a Profit, but there are two Loss and two Profit for 2010