Deep Learning: Handbook of Statistics [48] 9780443184307

Deep Learning, Volume 48 in the Handbook of Statistics series, highlights new advances in the field, with this new volum

513 83 17MB

English Pages 270 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Cover
Deep Learning
Copyright
Contents
Contributors
Preface
Chapter 1: Exact deep learning machines
1. Introduction
2. EDLM constructions
3. Conclusions
References
Chapter 2: Multiscale representation learning for biomedical analysis
1. Introduction
2. Representation learning: Background
3. Multiscale embedding motivation
4. Theoretical framework
4.1. Local context embedding
4.2. Wide context embedding
4.3. Multiscale embedding
4.4. Postprocessing and inference for word similarity task
4.5. Evaluation scheme
5. Experiments, results, and discussion
5.1. Datasets
5.1.1. Training stage: Datasets and preprocessing
5.1.2. Testing stage: Datasets
5.2. Wide context embedding (context2vec)
5.3. Quantitative evaluation
5.3.1. Term similarity task
5.3.2. Downstream application task
5.3.3. Drug rediscovery test
5.4. Qualitative analysis
5.5. Error analysis
6. Conclusion and future work
References
Chapter 3: Adversarial attacks and robust defenses in deep learning
1. Introduction
2. Adversarial attacks
2.1. Fast gradient sign method
2.2. Projected gradient descent
2.3. DeepFool
2.4. Carlini and wagner attack
2.5. Adversarial patch
2.6. Elastic
2.7. Fog
2.8. Snow
2.9. Gabor
2.10. JPEG
3. On-manifold robustness
3.1. Defense-GAN
3.2. Dual manifold adversarial training (DMAT)
3.2.1. On-manifold ImageNet
3.2.2. On-manifold AT cannot defend standard attacks and vice versa
3.2.3. Proposed method: Dual manifold adversarial training
3.2.4. DMAT improves generalization and robustness
3.2.5. DMAT improves robustness to unseen attacks
3.2.6. TRADES for DMAT
4. Knowledge distillation-based defenses
5. Defenses for object detector
6. Reverse engineering of deceptions via residual learning
6.1. Adversarial perturbation estimation
6.1.1. Image reconstruction
6.1.2. Feature reconstruction
6.1.3. Image classification
6.1.4. Residual recognition
6.1.5. End-to-end training
6.2. Experimental evaluation
Acknowledgments
References
Chapter 4: Deep metric learning for computer vision: A brief overview
1. Introduction
2. Background
3. Pair-based formulation
3.1. Contrastive loss
3.2. Triplet loss
3.3. N-pair loss
3.4. Multi-Similarity loss
4. Proxy-based methods
4.1. Proxy-NCA and Proxy-NCA++
4.2. Proxy Anchor loss
4.3. ProxyGML Loss
5. Regularizations
5.1. Language guidance
5.2. Direction regularization
6. Conclusion
References
Chapter 5: Source distribution weighted multisource domain adaptation without access to source data
1. Introduction
1.1. Main contributions
2. Related works
2.1. Unsupervised domain adaptation
2.2. Hypothesis transfer learning
2.3. Multisource domain adaptation
2.4. Source-free multisource UDA
3. Problem setting
4. Practical motivation
5. Overall framework of DECISION (Ahmed et al., 2021)—A review
5.1. Weighted information maximization
5.2. Weighted pseudo-labeling
5.3. Optimization
6. Theoretical insights
6.1. Theoretical motivation behind DECISION
7. Source distribution dependent weights (DECISION-mlp)
8. Proof of Lemma 1
9. Experiments
9.1. Experiments on DECISION
9.1.1. Datasets
9.1.2. Baseline methods
9.2. Implementation details
9.2.1. Network architecture
9.2.2. Source model training
9.2.3. Hyper-parameters
9.3. Object recognition
9.3.1. Office
9.3.2. Office–Home
9.4. Ablation study
9.4.1. Contribution of each loss
9.4.2. Analysis on the learned weights
9.4.3. Distillation into a single model
9.5. Results and analyses of DECISION-mlp
10. Conclusions and future work
References
Chapter 6: Deep learning methods for scientific and industrial research
1. Introduction
2. Data and methods
2.1. Different types of data for deep learning
2.1.1. Numerical data
2.1.2. COVID-19 data
2.1.3. Meteorological data
2.1.3.1. Gridded meteorological data
2.1.3.2. Station-level meteorological data
2.1.3.3. Crop production data
2.1.4. Image data
2.2. Methodology
2.2.1. Transfer learning
2.2.2. Federated learning
2.2.3. Long short-term memory (LSTM)
2.2.3.1. Time division LSTM
2.2.3.2. Multivariate LSTM model for COVID-19 prediction
2.2.4. SNN and CNN
3. Applications of DL techniques for multi-disciplinary studies
3.1. Applications of DL models in tumor diagnosis
3.1.1. Performance evaluations of all models trained by numerical data sets for tumor diagnosis
3.1.2. Performance evaluations of all models trained by image data sets for tumor diagnosis
3.2. Application of DL model for classifying molecular subtypes of glioma tissues
3.3. Application of the deep learning model for the prognosis of glioma patients
3.4. Applications of DL model for predicting driver gene mutations in glioma
3.5. Application of Time Division LSTM for short-term prediction of wind speed
3.5.1. Performance evaluations of Time Division LSTM for short term wind speed prediction
3.6. Application of LSTM for the estimation of crop production
3.6.1. Automated model for selection of optimal input data set for designing crop prediction model
3.6.2. Design and performance evolution of crop prediction model
3.7. Classification of tea leaves
3.8. Weather integrated deep learning techniques to predict the COVID-19 cases over states in India
4. Discussion and future prospects
Acknowledgments
References
Chapter 7: On bias and fairness in deep learning-based facial analysis
1. Introduction
2. Tasks in facial analysis
2.1. Face detection and recognition
2.1.1. Face detection
2.1.2. Face verification and identification
2.2. Attribute prediction
3. Facial analysis databases for bias study
4. Evaluation metrics
4.1. Classification parity-based metrics
4.1.1. Statistical parity
4.1.2. Disparate impact (DI)
4.1.3. Equalized odds and equality of opportunity
4.1.4. Predictive parity
4.2. Score-based metrics
4.2.1. Calibration
4.2.2. Balance for positive/negative class
4.3. Facial analysis-specific metrics
4.3.1. Fairness discrepancy rate (FDR)
4.3.2. Inequity rate (IR)
4.3.3. Degree of bias
4.3.4. Precise subgroup equivalence (PSE)
5. Fairness estimation and analysis
5.1. Fairness in face detection and recognition
5.1.1. Discovery
5.1.2. Disparate impact
5.1.3. Incorporation of demographic information during model training
5.1.4. Dataset distribution during model training
5.1.5. Role of latent factors during model training
5.2. Fairness in attribute prediction
5.2.1. Discovery
5.2.2. Disparate impact
5.2.3. Counterfactual analysis
5.2.4. Role of latent factors during model training
6. Fair algorithms and bias mitigation
6.1. Face detection and recognition
6.1.1. Adversarial learning approaches
6.1.2. Pre-trained and black box approaches
6.1.3. Generative approaches
6.1.4. Bias-aware deep learning approaches
6.2. Attribute prediction
6.2.1. Adversarial approaches
6.2.2. Pre-trained and black-box approaches
6.2.3. Generative approaches
6.2.4. Bias-aware deep learning approaches
7. Meta-analysis of algorithms
8. Topography of commercial systems and patents
9. Open challenges
9.1. Fairness in presence of occlusion
9.2. Fairness across intersectional subgroups
9.3. Trade-off between fairness and model performance
9.4. Lack of benchmark databases
9.5. Variation in evaluation protocols
9.6. Unavailability of complete information
9.7. Identification of bias in models
9.8. Quantification of fairness in datasets
10. Discussion
Acknowledgment
References
Chapter 8: Manipulating faces for identity theft via morphing and deepfake: Digital privacy
1. Introduction
2. Identity manipulation techniques
3. Identity manipulation datasets
4. Identity attack detection algorithms
5. Open challenges
6. Conclusion
References
Index
Back Cover
Recommend Papers

Deep Learning: Handbook of Statistics [48]
 9780443184307

  • Commentary
  • Handbook of Statistics
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Handbook of Statistics Volume 48

Deep Learning

Handbook of Statistics Series Editors C.R. Rao C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta University, United States

Handbook of Statistics Volume 48

Deep Learning Edited by

Venu Govindaraju University at Buffalo, Buffalo, NY, United States

Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta, Georgia, United States

C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India

Academic Press is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States 525 B Street, Suite 1650, San Diego, CA 92101, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 125 London Wall, London, EC2Y 5AS, United Kingdom Copyright © 2023 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-443-18430-7 ISSN: 0169-7161 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Zoe Kruze Acquisitions Editor: Mariana Kuhl Developmental Editor: Naiza Ermin Mendoza Production Project Manager: Abdulla Sait Cover Designer: Vicky Pearson Esser Typeset by STRAIVE, India

Contents Contributors Preface

xi xiii

Section I Foundations 1.

2.

Exact deep learning machines

1

Arni S.R. Srinivasa Rao 1. Introduction 2. EDLM constructions 3. Conclusions References

1 2 6 7

Multiscale representation learning for biomedical analysis

9

Abhishek Singh, Utkarsh Porwal, Anurag Bhardwaj, and Wei Jin 1. Introduction 2. Representation learning: Background 3. Multiscale embedding motivation 4. Theoretical framework 4.1 Local context embedding 4.2 Wide context embedding 4.3 Multiscale embedding 4.4 Postprocessing and inference for word similarity task 4.5 Evaluation scheme 5. Experiments, results, and discussion 5.1 Datasets 5.2 Wide context embedding (context2vec) 5.3 Quantitative evaluation 5.4 Qualitative analysis 5.5 Error analysis 6. Conclusion and future work References

9 12 13 15 15 15 17 17 17 18 18 19 20 23 24 24 25

v

vi

3.

Contents

Adversarial attacks and robust defenses in deep learning

29

Chun Pong Lau, Jiang Liu, Wei-An Lin, Hossein Souri, Pirazh Khorramshahi, and Rama Chellappa 1. Introduction 2. Adversarial attacks 2.1 Fast gradient sign method 2.2 Projected gradient descent 2.3 DeepFool 2.4 Carlini and wagner attack 2.5 Adversarial patch 2.6 Elastic 2.7 Fog 2.8 Snow 2.9 Gabor 2.10 JPEG 3. On-manifold robustness 3.1 Defense-GAN 3.2 Dual manifold adversarial training (DMAT) 4. Knowledge distillation-based defenses 5. Defenses for object detector 6. Reverse engineering of deceptions via residual learning 6.1 Adversarial perturbation estimation 6.2 Experimental evaluation Acknowledgments References

29 32 32 32 32 32 32 33 33 33 33 33 33 33 35 41 44 46 47 52 53 53

Section II Advanced Methods 4.

Deep metric learning for computer vision: A brief overview

59

Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, and Venu Govindaraju 1. Introduction 2. Background 3. Pair-based formulation 3.1 Contrastive loss 3.2 Triplet loss 3.3 N-pair loss 3.4 Multi-Similarity loss 4. Proxy-based methods 4.1 Proxy-NCA and Proxy-NCA++ 4.2 Proxy Anchor loss 4.3 ProxyGML Loss 5. Regularizations 5.1 Language guidance 5.2 Direction regularization 6. Conclusion References

59 61 62 62 62 65 66 68 69 71 72 75 75 76 78 78

Contents

5.

vii

Source distribution weighted multisource domain adaptation without access to source data

81

Sk Miraj Ahmed, Dripta S. Raychaudhuri, Samet Oymak, and Amit K. Roy-Chowdhury 1. Introduction 1.1 Main contributions 2. Related works 2.1 Unsupervised domain adaptation 2.2 Hypothesis transfer learning 2.3 Multisource domain adaptation 2.4 Source-free multisource UDA 3. Problem setting 4. Practical motivation 5. Overall framework of DECISION—A review 5.1 Weighted information maximization 5.2 Weighted pseudo-labeling 5.3 Optimization 6. Theoretical insights 6.1 Theoretical motivation behind DECISION 7. Source distribution dependent weights (DECISION-mlp) 8. Proof of Lemma 1 9. Experiments 9.1 Experiments on DECISION 9.2 Implementation details 9.3 Object recognition 9.4 Ablation study 9.5 Results and analyses of DECISION-mlp 10. Conclusions and future work References

82 83 84 84 84 84 85 85 86 86 87 88 89 90 90 93 95 96 96 97 98 98 101 102 103

Section III Transformative Applications 6.

Deep learning methods for scientific and industrial research G.K. Patra, Kantha Rao Bhimala, Ashapurna Marndi, Saikat Chowdhury, Jarjish Rahaman, Sutanu Nandi, Ram Rup Sarkar, K.C. Gouda, K.V. Ramesh, Rajesh P. Barnwal, Siddhartha Raj, and Anil Saini 1. Introduction 2. Data and methods 2.1 Different types of data for deep learning 2.2 Methodology 3. Applications of DL techniques for multi-disciplinary studies 3.1 Applications of DL models in tumor diagnosis 3.2 Application of DL model for classifying molecular subtypes of glioma tissues

107

108 113 113 119 135 135 140

viii

Contents

3.3 Application of the deep learning model for the prognosis of glioma patients 3.4 Applications of DL model for predicting driver gene mutations in glioma 3.5 Application of Time Division LSTM for short-term prediction of wind speed 3.6 Application of LSTM for the estimation of crop production 3.7 Classification of tea leaves 3.8 Weather integrated deep learning techniques to predict the COVID-19 cases over states in India 4. Discussion and future prospects Acknowledgments References

7.

140 141 141 148 152 155 159 163 163

On bias and fairness in deep learning-based facial analysis

169

Surbhi Mittal, Puspita Majumdar, Mayank Vatsa, and Richa Singh 1. Introduction 2. Tasks in facial analysis 2.1 Face detection and recognition 2.2 Attribute prediction 3. Facial analysis databases for bias study 4. Evaluation metrics 4.1 Classification parity-based metrics 4.2 Score-based metrics 4.3 Facial analysis-specific metrics 5. Fairness estimation and analysis 5.1 Fairness in face detection and recognition 5.2 Fairness in attribute prediction 6. Fair algorithms and bias mitigation 6.1 Face detection and recognition 6.2 Attribute prediction 7. Meta-analysis of algorithms 8. Topography of commercial systems and patents 9. Open challenges 9.1 Fairness in presence of occlusion 9.2 Fairness across intersectional subgroups 9.3 Trade-off between fairness and model performance 9.4 Lack of benchmark databases 9.5 Variation in evaluation protocols 9.6 Unavailability of complete information 9.7 Identification of bias in models 9.8 Quantification of fairness in datasets 10. Discussion Acknowledgment References

169 173 173 175 175 180 180 181 182 183 184 187 191 191 195 199 202 206 206 207 207 207 208 208 208 208 209 211 211

Contents

8.

Manipulating faces for identity theft via morphing and deepfake: Digital privacy 223 Akshay Agarwal and Nalini Ratha 1. Introduction 2. Identity manipulation techniques 3. Identity manipulation datasets 4. Identity attack detection algorithms 5. Open challenges 6. Conclusion References

Index

ix

223 225 230 233 234 238 238

243

This page intentionally left blank

Contributors Akshay Agarwal (223), Department of Data Science and Engineering, IISER Bhopal, Bhopal, Madhya Pradesh, India Sk Miraj Ahmed (81), Department of Electrical and Computer Engineering, University of California Riverside, Riverside, CA, United States Rajesh P. Barnwal (107), CSIR-Central Mechanical Engineering Research Institute, Durgapur, West Bengal; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Anurag Bhardwaj (9), Khoury College of Computer Science, Northeastern University, San Jose, CA, United States Kantha Rao Bhimala (107), CSIR-Fourth Paradigm Institute, Bangalore, Karnataka; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Rama Chellappa (29), Johns Hopkins University, Baltimore, MD, United States Saikat Chowdhury (107), CSIR-National Chemical Laboratory, Pune, Maharashtra, India K.C. Gouda (107), CSIR-Fourth Paradigm Institute, Bangalore, Karnataka; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Venu Govindaraju (59), University at Buffalo, Buffalo, NY, United States Bhavin Jawade (59), University at Buffalo, Buffalo, NY, United States Wei Jin (9), Computer Science and Engineering, University of North Texas, Denton, TX, United States Pirazh Khorramshahi (29), Johns Hopkins University, Baltimore, MD, United States Chun Pong Lau (29), Johns Hopkins University, Baltimore, MD, United States Wei-An Lin (29), Adobe Research, San Jose, CA, United States Jiang Liu (29), Johns Hopkins University, Baltimore, MD, United States Puspita Majumdar (169), Department of Computer Science, IIT Jodhpur, Jodhpur; Department of Computer Science, IIIT Delhi, Delhi, India Ashapurna Marndi (107), CSIR-Fourth Paradigm Institute, Bangalore, Karnataka; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Surbhi Mittal (169), Department of Computer Science, IIT Jodhpur, Jodhpur, India Deen Dayal Mohan (59), University at Buffalo, Buffalo, NY, United States Sutanu Nandi (107), CSIR-National Chemical Laboratory, Pune, Maharashtra, India

xi

xii

Contributors

Samet Oymak (81), Department of Electrical and Computer Engineering, University of California Riverside, Riverside, CA, United States G.K. Patra (107), CSIR-Fourth Paradigm Institute, Bangalore, Karnataka; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Utkarsh Porwal (9), Walmart Inc, Hoboken, NJ, United States Jarjish Rahaman (107), CSIR-National Chemical Laboratory, Pune, Maharashtra, India Siddhartha Raj (107), CSIR-Central Mechanical Engineering Research Institute, Durgapur, West Bengal, India K.V. Ramesh (107), CSIR-Fourth Paradigm Institute, Bangalore, Karnataka; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Nalini Ratha (223), Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, United States Dripta S. Raychaudhuri (81), Department of Electrical and Computer Engineering, University of California Riverside, Riverside, CA, United States Amit K. Roy-Chowdhury (81), Department of Electrical and Computer Engineering, University of California Riverside, Riverside, CA, United States Anil Saini (107), CSIR-Central Electronics Engineering Research Institute, Pilani, Rajasthan; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Ram Rup Sarkar (107), CSIR-National Chemical Laboratory, Pune, Maharashtra; Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India Srirangaraj Setlur (59), University at Buffalo, Buffalo, NY, United States Abhishek Singh (9), Computer Science and Engineering, University of North Texas, Denton, TX, United States Richa Singh (169), Department of Computer Science, IIT Jodhpur, Jodhpur, India Hossein Souri (29), Johns Hopkins University, Baltimore, MD, United States Arni S.R. Srinivasa Rao (1), Laboratory for Theory and Mathematical Modeling, Division of Infectious Diseases, Medical College of Georgia; Department of Mathematics, Augusta University, Augusta, GA, United States Mayank Vatsa (169), Department of Computer Science, IIT Jodhpur, Jodhpur, India

Preface Over the last decade, the science of deep learning has seen rapid growth in terms of both theory and applicability. Statisticians, mathematicians, computer scientists, and other engineering and basic scientists have come together to design deep learning theories and experiments. One of the advantages of carefully planned deep learning experiments is the reduction of uncertainty in predictions and an increase in the performance of machines that are designed for a specific task. Through this volume of Handbook of Statistics on “deep learning,” we are aiming to illustrate the more realistic potential of the advancements made in the field of deep learning. We provide a wonderful collection of a wide range of research topics through didactic chapters. This collection of chapters will be useful as starting points for serious graduate students and as summaries for senior researchers. The content ranges from foundations of deep learning models and probabilities of object detection to deep learning models that involve statistical techniques such as Bayesian, principal component analysis, regression approaches, etc. We have taken great effort to include practically implementable content such as facial recognition analysis and understanding the biases introduced in data, climate models, industrial applications, healthcare settings, and open challenges and groundlevel difficulties faced in advanced applications in different settings. The chapters in this volume are divided into the following three sections: Section I: Foundations Section II: Advanced Methods Section III: Transformative Applications In Section I, Chapter 1 by A.S.R.S. Rao introduces new machines called exact deep learning machines and the concept of detecting an object by a machine with probability 1. Such machines are argued to be the best possible alternatives to the originally proposed AI models of the mid-1950s. Chapter 2 by A. Singh, U. Porwal, A. Bhardwaj, and W. Jin provides a well-written introduction and review of contextual embedding and related studies. The chapter also describes dimensionality reduction techniques using principal component analysis and preserving the information of the original data. Here, the recent advances made in novel contextual embedding are also detailed. Chapter 3 by C.P. Lau, J. Liu, W.-A. Lin, H. Souri, P. Khorramshahi, and R. Chellappa provides detailed foundations and training on deep learning principles in adversarial attacks and their wide range of applications. The xiii

xiv

Preface

chapter provides an excellent overview of the defense mechanisms to be adapted in adversarial attacks. Section II comprises two chapters on advanced deep learning methods. Chapter 4 by D.D. Mohan, B. Jawade, S. Setlur, and V. Govindaraju introduces advanced statistical methods and provides extremely clear insights into deep metrics, embedding spaces, and their utility in computer vision research. The chapter details the recent discussions on the design of experiments in computer vision studies and sampling strategies. The current state-of-the-art deep metric learning approaches described in the chapter provide training materials for researchers. Chapter 5 by A.K. Roy-Chowdhury, S.M. Ahmed, D.S. Raychaudhuri, and S. Oymak combines advanced methods of unsupervised domain adaptation in computer vision and image processing. Algorithms that transfer information to target domains and a comparison between single-trained source models and multiple-trained source models are introduced as well. The authors own algorithms for the effective aggregation of source models for nonuniform types of source data. Section III comprises three chapters that describe the practical applications of deep learning methods in various real-world scenarios. Chapter 6 by G.K. Patra, K.R. Bhimala, A. Marndi, S. Chowdhury, J. Rahaman, S. Nandi, R.R. Sarkar, K.C. Gouda, K.V. Ramesh, R.P. Barnwal, S. Raj, and A. Saini provides a detailed account of the successful implementation of various innovative projects conducted at different centers of the Council of Scientific and Industrial Research (CSIR) in India. These include climate prediction modeling, diagnosis of tumors, prediction of renewable energy generation, COVID-19, etc. The methods and designs of these experiments provide a training module for other such experiments. Chapter 7 by S. Mittal, P. Majumdar, M. Vatsa, and R. Singh summarizes the recent advancements in facial recognition analysis, data limitations, and associated statistical biases. Here, limitations in the current approaches and implementable newer methods to overcome the biases introduced due to a lack of enough data are described. The chapter concludes with open challenges and building blocks in facial analysis research worldwide. Chapter 8 by N. Ratha and A. Agarwal discusses the latest advancements and highly implementable approaches in protecting against identity theft via morphing and deepfake of faces. Here, the fundamental techniques in morphing and deepfake and their advantages and disadvantages are outlined. The chapter reviews algorithms to protect against identity thefts, defense algorithms built to detect these manipulations, etc. It concludes with highly interesting open challenges associated with such research studies. We express our sincere thanks to Mariana K€uhl Leme, acquisitions editor (Elsevier and North-Holland), for her overall administrative support throughout the preparation of this volume. Her valuable involvement in the project is highly appreciated. We thank Ms. Naiza Ermin Mendoza, developmental editor (Elsevier), for providing excellent assistance to the editors and for

Preface

xv

engaging with authors in all kinds of technical queries throughout the preparation. Her services extended until the proofing stage and production. Our thanks also go to Sudharshini Renganathan, project manager of book production, RELX India Private Limited, Chennai, India, for leading the production, responding to several rounds of queries by the authors, being available at all times for printing activities, and providing assistance to the editors. Our sincere thanks and gratitude go to all the authors for writing brilliant chapters in keeping with our requirements for the volume. We very much thank our referees for their timely assessment of the chapters. We believe that this volume on deep learning has come up at the right time, and we are very much satisfied to have been involved in its production. We are convinced that this collection will be a useful resource for new researchers in statistical science as well as advanced scientists working in statistics, mathematics, computer science, and other disciplines. Venu Govindaraju Arni S.R. Srinivasa Rao C.R. Rao

This page intentionally left blank

Chapter 1

Exact deep learning machines Arni S.R. Srinivasa Rao∗,† Laboratory for Theory and Mathematical Modeling, Division of Infectious Diseases, Medical College of Georgia, Augusta, GA, United States Department of Mathematics, Augusta University, Augusta, GA, United States ∗ Corresponding author: e-mail: [email protected]

Abstract Incorporating actual intelligence into the machines and making them think and perform like humans is impossible. In this chapter, a new kind of machine called the EDLM (exact deep learning machine) is introduced. Such EDLMs can achieve the target with probability one and could be the best alternative for originally designed artificial intelligence models of the mid-20th century by Alan Turing and others that have so far not seen reality. In the current context, achieving a target is defined as detecting a given object accurately. Keywords: Artificial intelligence, Probability one, Object detection MSC: 68T07, 68T05

1

Introduction

It is impossible to train human-made computational machines to think and perform like humans. Training of computational machines involves making the machine read the data and understand logical rule-based instructions. A machine could be designed such that it can construct further rules from a combination of finite rules provided to the machine. Such a design involves the programming of algorithms that can instruct a machine to construct newer †

The author proposed the first artificial intelligence (AI)-based model in the world in February 2020 to identify COVID-19 cases using mobile-based Apps, see, for example, Srinivasa Rao and Vazquez (2020), JagWire (2020), mhealthintelligence (2020), Innovators Magazine (2020), and Timesofindia (2020). That work later inspired several such Apps all over the world during the COVID-19 pandemic. Arni Rao is also currently serving as a member of AI-enabled Technologies & Systems Domain Expert Group (DEG), constituted in 2021 by The Council of Scientific & Industrial Research (CSIR), Government of India. The central ideas of EDLM were presented by the author at several invited talks during 2021–2022. Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2022.11.001 Copyright © 2023 Elsevier B.V. All rights reserved.

1

2

Handbook of Statistics

rules of performance. Whether or not a machine can come up with a newer rule without any specific instructions to perform from a list of rules already provided to it is still questionable. This is because the so-called advanced machines to date have not shown any signs of self-thinking capabilities without being told to them in some or other forms of machine training algorithms. Human intelligence is so far not shown evidence to transfer or implant human-like intelligence into computational machines. There have been advancements in statistical machine learning (Angelino et al., 2017; Fokoue and Titterington, 2003; Matsui, 2010; Tadaki, 2012; Tollenaar and van der Heijden, 2013), well-defined mathematical structures in machine learning (Blum et al., 1989; Cucker and Smale, 2002; Govindaraju and Rao, 2013; Niyogi et al., 2011; Smale and Zhou, 2007; Van Messem, 2020), advancements in data science aspects (Baladram et al., 2020; Brunton and Kutz, 2022; Chen, 2015; Nazarathy and Klok, 2021; Yeturu, 2020), and improvisations in understanding data structures within machine learning (Calma et al., 2018; Ferraz et al., 2021; Poggio and Smale, 2003; Srinivasa Rao and Diamond, 2020; Srinivasa Rao and Vazquez, 2020, 2022. Training machines with exact deep learning (introduced here) could be the closest possible alternatives to artificial intelligence (AI) models. So far, researchers were not successful in seeing self-thinking machines that were originally thought possible by Alan Turing and others (COE, n.d.). Deep learning machines with information on all possible distinct objects whose data is relevant are critical for successful machine learning. Only partial information on complete subjects or complete information on partial distinct subjects would not be sufficient for accurate machine learning, and performances of machines to attain given targets. In this chapter, we demonstrate a novel machine called the EDLM (exact deep learning machine) which can achieve the target suggested to the machine with the probability one. Weprovide one case scenario here, and EDLM can be extended to many situations under the assumptions and limitations provided. The central idea of EDLM is explained through a theorem in the next section.

2 EDLM constructions Theorem 1 EDLM theorem Suppose an object can be described within a finite number of distinct attributes and these attributes are not dynamic over time. Then, machines trained with attribute information can detect the object with probability one. Proof Let O be the object that is to be detected accurately by a machine. Suppose O is uniquely described by exactly a-attributes, say, y1 , y2 , … ya which are distinct. It is given that the attributes to describe O do not change with time or these attributes are not dynamic over time. Here attributes can be treated as the description of an object that is enough to define that particular

Exact deep learning machines Chapter

1

3

object. That means a-attributes are enough to describe the object O. Suppose the machine is trained with a triplet of information, say, {O, A(y), Y (c)}, where O is the object to be detected by the machine,A(y) is the set of attributes that completely describe O, and Y (c) is the set of values that each attribute can pick. The object O is described based on the set of predecided finite numerical values. The set AðyÞ ¼ fy1 , y2 , … ya g. The set Y (c) consists of all possible values of each of the attribute yj for j ¼ 1, 2, … , a. The attribute yj can choose any one of the ccj j possibilities for j ¼ 1, 2, …, a. That is, 9 8 y1 ¼ ½c11 , c21 , …, cc1 1  > > > > > = < y ¼ ½c , c , …, c  > 12 22 c2 2 2 (1) YðcÞ ¼ > > ⋮ > > > > ; : ya ¼ ½c1a , c2a , …, cca a  In Eq. (1) the quantities ccj j for j ¼ 1, 2, …, a need not be identical. This implies that the number of possible values for y1, y2, … ya need not be the same. The object O will be described by {O, A(y), Y (c)} with each yj choosing only one of the values ½c1j , c2j , …, ccj j  for j ¼ 1, 2, …, a . Suppose “0” indicates the absence of ccj j and “1” indicates the presence ccj j in yj. All the attributes of Y (c) can be listed for the absence or presence of values of each attribute as follows: y1 ¼

½c11 ð0, 1Þ, c21 ð0, 1Þ, …, cc1 1 ð0, 1Þ

y2 ¼

½c12 ð0, 1Þ, c22 ð0, 1Þ, …, cc2 2 ð0, 1Þ ⋮

ya ¼

½c1a ð0, 1Þ, c2a ð0, 1Þ, …, cca a ð0, 1Þ

(2)

The machine trained with the triplet could start checking for the presence of predefined attributes in any order from the set A(y). Suppose it starts checking whether or not a given object has the attribute y1. Then, c11(0, 1) ¼ 1 indicates y11 ¼ c11 and c11(0, 1) ¼ 0 indicates y1 could pick a value from fc21 , c22,…, cc1 1 g or the attribute y1 may not be present in a given object. If y11 picks one of the values in fc11 , c21 , …, cc1 1 g, then the machine will decide that the object O has the attribute y1 and proceed to check if other attributes of O are present. If c11(0, 1) ¼ 0 and y1 does not pick any value from fc21 , c22,…, cc1 1 g, then the machine will decide that the given object is not O. If ccj 1 ð0, 1Þ ¼ 1 for some j, then the machine starts checking whether or not a given object has the attribute y2 in the same way described for y1. If ccj 2 ð0, 1Þ ¼ 1 for some j, then the machine starts checking whether or not a given object has the attribute y3. If ccj 2 ð0, 1Þ ¼ 0 for all j, then the given object is decided by the machine as not O. The process of checking for the attributes in A(y) by the machine continues only if the previous attribute checked within A(y) was present. When the set A(y) exists in a given object, the machine classifies the object as O, else a given object is classified

4

Handbook of Statistics

as not O. The number of all possible permutations of values of attributes that could determine a given object O is c1  c2     ca ¼

a Y

cj :

(3)

1

Let E1 be an event such that the object O is determined with a combination of Y (c) values, say, fc11 , c22 , …, cca a g, and let E2 be another event that the given object is determined as O with a combination of Y (c) values different than fc11 , c22 , …, cca a g, then E1 6¼ E2. A machine trained with {O, A(y), Y (c)} could determine an object as O from any of the Ek set of distinct events (see Fig. 1) Ek for k ¼ 1, 2, …,

a Y

cj :

(4)

1

Let Ω be the set of all objects in the universe with distinct Ek values in Eq. (4) which can be classified as O. Then, jΩj 

a Y

cj :

(5)

1

Let Σ be the set of all objects in the universe, and two or more objects have identical Ek values which can be classified as O. Then, jΣj >

a Y

cj :

(6)

1

Alternatively, if (6) holds, then two or more objects in the universe that can be classified as O detected by the machine have identical sets in Ek. When an

FIG. 1 Occurrence of all possible distinct events Ek are trained into EDLMs. The elements within each row of Y (c) could be different, but the cardinalities of the events Ek are identical.

Exact deep learning machines Chapter

5

1

arbitrary object not known whether it belongs to Ω or Σ is asked to a machine trained with the triplet, then the machine can identify whether or not the object is O with probability one, because one or other Ek for k ¼ 1, 2, … is either attained or the given object does not satisfy any of the Ek. If none of the Ek occurred, then the machine will classify the given object as not O. □ Remark 1 Listing the two sets A(y) and Y (c) for all kinds of objects might be impossible even if A(y) is not dynamic. Theorem 1 although can assure the detection of O with probability 1 under the conditions specified, the procedure to detect O becomes impossible unless each and every possible trait or attribute of O is known and the values of each attribute are predetermined. Theorem 2 Suppose the given object is detected by the machine as O. Then, ! a [ Ek ¼ 1: P k¼1

Proof Probability that one of the values of the attribute y1 is picked by the machine is c11 , probability that one of the values of the attribute y2 is picked by the machine is c12 , and so on. Probability of occurrence of Ek for some k will be 1 1 1 1   ¼ c1 c2 ca Πaj¼1 cj Therefore, Prob

a [ k¼1

! Ek

¼

1Πaj¼1 cj 1Πaj¼1 cj 1 ða timesÞ ¼ 1: ⋯+ a + + Πj¼1 cj

(7)

(8)

In the right hand side of Eq. (8), we have used the property that the events Ek are disjoint. □ Remark 2 Suppose the given object is detected by the machine as O. Then, ! a \ P Ek ¼ ∅: k¼1

When an object is detected as O, then only one of the events Ek will occur for some k, and as soon as that occurs, the target is achieved and there is no scope for other events to occur (Fig. 2).

6

Handbook of Statistics

FIG. 2 Suppose a particular type of stone from the planet Mars, say, O(A) whose attributes do not change with time is collected and information about the stone in the form of A(y) and Y (c) is stored. Then EDLMs can assist in determining any arbitrary stone in the universe is O(A) or not. Here O1(A), O2 ðAÞ, …, O7 ðAÞ are stones of different sizes and different shapes of the same kind.

3 Conclusions Incorporating actual intelligence and self-thinking capabilities into the machines which is an integral part of human activities is still impossible. There is a lot of difference between a machine that possesses “vast information” and a machine that possesses “human-like intelligence.” The machines with “vast information” can use this information and perform according to the algorithms provided to use this “vast information.” The central advantage of EDLM introduced here is that they can classify a given object as O or not without any uncertainty. If a given object is O, then one of the Ek will occur for sure and the probability that an EDLM described will classify the object correctly is one. There could be a random order of selection of prelisted attributes by EDLMs; however, there will be no uncertainty in the performance of the machines once the conditions of EDLMs are satisfied. Instead of detection of predefined objects, the EDLMs can be extended for other real-world situations where complete information of A(y) and Y (c) are known prior to the implementation of machine learning algorithms. Understanding randomness and uncertainty in the data is important in real-world situations (Rao, 2022); however, machines with uncertain information cannot achieve specified targets with certainty. Construction of EDLMs may not be possible for all the circumstances because complete information on A(y) may not be possible for each and every object. However, whenever A(y) and Y (c) are known completely for a given object or for a given specific target, EDLMs would perform accurately.

Exact deep learning machines Chapter

1

7

References Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., Rudin, C., 2017. Learning certifiably optimal rule lists for categorical data. J. Mach. Learn. Res. 18 (234), 78. Baladram, M.S., Koike, A., Yamada, K.D., 2020. Introduction to supervised machine learning for data science. Interdiscip. Inform. Sci. 26 (1), 87–121. Blum, L., Shub, M., Smale, S., 1989. On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions and universal machines. Bull. Am. Math. Soc. (N.S.) 21 (1), 1–46. Brunton, S.L., Kutz, J.N., 2022. Data-Driven Science and Engineering–Machine Learning, Dynamical Systems, and Control, second ed. Cambridge University Press, Cambridge, ISBN: 9781-009-09848-9 (xxiv+590 pp). Calma, A., Reitmaier, T., Sick, B., 2018. Semi-supervised active learning for support vector machines: a novel approach that exploits structure information in data. Inform. Sci. 456, 1333. Chen, L.M., 2015. Machine learning for data science: mathematical or computational. In: Mathematical Problems in Data Science, Springer, Cham, pp. 63–74. COE, n.d. History of Artificial Intelligence. https://www.coe.int/en/web/artificial-intelligence/ history-of-ai (accessed 07.11.22). Cucker, F., Smale, S., 2002. On the mathematical foundations of learning. Bull. Am. Math. Soc. (N.S.) 39 (1), 1–49. Ferraz, M.F., Ju´nior, L.B., Komori, A.S.K., Rech, L.C., Schneider, G.H.T., Berger, G.S., Cantieri, A.R., Lima, J., Wehrmeister, M.A., 2021. Artificial intelligence architecture based on planar LiDAR scan data to detect energy pylon structures in a UAV autonomous detailed inspection process. In: Optimization, Learning Algorithms and Applications. Commun. Comput. Inf. Sci., vol. 1488. Springer, Cham, pp. 430–443. Fokoue, E., Titterington, D.M., 2003. Mixtures of factor analysers: Bayesian estimation and inference by stochastic simulation. Mach. Learn. 50, 73–94. Govindaraju, V., Rao, C.R., 2013. Machine Learning: Theory and Applications. Handbook of Statistics, vol. 31 Elsevier/North-Holland, Amsterdam, ISBN: 978-0-444-53859-8 (xxiv+525 pp). JagWire, 2020. App, AI work together to provide rapid at-home assessment of coronavirus risk woman smiling, by Toni Baker 4 on March 5. JagWire. https://jagwire.augusta.edu/ (accessed app-ai-work-together-to-provide-rapid-at-home-assessment-of-coronavirus-risk/. 01.11.22). Magazine, Innovators, 2020. App Detects Coronavirus Risk by Susan Robertsonon, 5th March. Innovators Magazine. https://www.innovatorsmag.com/app-detects-coronavirus-risk/. (accessed 01.11.22). Matsui, T., 2010. On the special topic “statistical machine learning”. Proc. Inst. Stat. Math. 58 (2), 139. mhealthintelligence, 2020. AI-Powered Smartphone App Offers Coronavirus Risk Assessment By Samantha McGrail on March 13. mHealthIntelligence. https://mhealthintelligence.com/. (accessed 01.11.22). Nazarathy, Y., Klok, H., 2021. Statistics With Julia–Fundamentals for Data Science, Machine Learning and Artificial Intelligence. Springer Series in the Data Sciences, Springer, Cham. 978-3-030-70900-6; 978-3-030-70901-3 (xii+527 pp). Niyogi, P., Smale, S., Weinberger, S., 2011. A topological view of unsupervised learning from noisy data. SIAM J. Comput. 40 (3), 646–663.

8

Handbook of Statistics

Poggio, T., Smale, S., 2003. The mathematics of learning: dealing with data. Not. Am. Math. Soc. 50 (5), 537544. Rao, A.S.R.S., 2022. Randomness and uncertainty are central in most walks of life. J. Indian Inst. Sci. 102 (4), 1105–1106. https://doi.org/10.1007/s41745-022-00345-6. Smale, S., Zhou, D.-X., 2007. Learning theory estimates via integral operators and their approximations. Constr. Approx. 26 (2), 153–172. Srinivasa Rao, A.S.R., Diamond, M.P., 2020. Deep learning of markov model-based machines for determination of better treatment option decisions for infertile women. Reprod. Sci. 27, 763–770. https://doi.org/10.1007/s43032-019-00082-9. Srinivasa Rao, A.S.R., Vazquez, J., 2020. Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone-based survey when cities and towns are under quarantine. Infect. Control Hosp. Epidemiol. 41 (7), 826–830. Srinivasa Rao, A.S.R., Vazquez, J.A., 2022. Better hybrid systems for disease detections and early predictions. Clin. Infect. Dis. 74 (3), 556–558. https://doi.org/10.1093/cid/ciab489. Tadaki, K., 2012. A statistical mechanical interpretation of algorithmic information theory III: composite systems and fixed points. Math. Struct. Comput. Sci. 22 (5), 752–770. Timesofindia, 2020. App Uses AI to Provide at-Home Assessment of Coronavirus Risk: Study by PTI/Updated: Mar 5 16:24 IST. https://timesofindia.indiatimes.com/home/science/app-usesai-to-provide-at-home-assessment-of-coronavirus-risk-study/articleshow/74493256.cms. (accessed 01.11.22). Tollenaar, N., van der Heijden, P.G.M., 2013. Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models. J. R. Stat. Soc. Ser. A 176 (2), 565–584. Van Messem, A., 2020. Support vector machines: a robust prediction method with applications in bioinformatics. In: Principles and Methods for Data Science. Handbook of Statist, vol. 43. Elsevier/North-Holland, Amsterdam, pp. 391–466. Yeturu, K., 2020. Machine learning algorithms, applications, and practices in data science. In: Principles and Methods for Data Science. Handbook of Statist, vol. 43. Elsevier/ North-Holland, Amsterdam, pp. 81–206.

Chapter 2

Multiscale representation learning for biomedical analysis Abhishek Singha, Utkarsh Porwalb, Anurag Bhardwajc,∗, and Wei Jina a

Computer Science and Engineering, University of North Texas, Denton, TX, United States Walmart Inc, Hoboken, NJ, United States c Khoury College of Computer Science, Northeastern University, San Jose, CA, United States ∗ Corresponding author: e-mail: [email protected] b

Abstract Representation learning has gained prominence over last few years and has shown that for all underlying learning algorithm, results are vastly improved if feature representation of input is improved to capture the domain knowledge. Word embedding approaches like Word2Vec or subword information technique like fastText has shown to improve multiple NLP tasks in biomedical domain. These techniques mostly capture indirect relationships but often fail to capture deeper contextual relationships. This can be attributed to the fact that such techniques capture only short-range context defined via a co-occurrence window. In this chapter we describe recent advances in novel contextual embedding for a “wide sentential context.” These contextual embedding can generate composite word embedding achieving a multiscale word representation. We further illustrate that the composite embedding performs better than the present individual state-of-art techniques on both intrinsic and extrinsic evaluations. Keywords: Knowledge representation, Natural language processing, Medical applications (general)

1

Introduction

Easy access to the Internet and personal health devices has led to a massive proliferation of biomedical data today. One of the largest such datasets is PubMed, which consists of 30 million documents and 5 billion words. Some of the important applications of this dataset include: (i) concept search and retrieval, (ii) literature-based discovery (LBD), and (iii) question and answer based systems. In order to perform these tasks efficiently, a deeper understanding of biomedical concepts in these texts is a critical requirement. In this

Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2022.12.004 Copyright © 2023 Elsevier B.V. All rights reserved.

9

10

Handbook of Statistics

chapter, we survey recent advances in representation learning techniques for medical concepts. We primarily focus on embedding-based representation that can subsequently power richer applications such as LBD as well as term similarity. Drug discovery is a process of identifying a potential candidate drug for the treatment of a specific disease. This is a long, sophisticated, and expensive process that can take years if not decades. LBD on the other hand is a process of drug discovery by making unrelated connections present in a large body of every growing biomedical literature. LBD is an inexpensive and faster way of drug discovery and where almost all the methods conform to ABC model championed by Don R. Swanson. ABC model hypothesizes that if A implies B and B implies C, then there might be a relationship between A and C. Swanson found that dietary fish oil (A) lowers blood viscosity (B) as published in some literature. He also noted that people suffering from Raynaud’s disease (C) also suffer from high blood viscosity (B) as published in literature. He then hypothesized that dietary fish oil could be used to treat Raynaud’s disease (Swanson, 1986) which was later clinically confirmed. Since then large body of research has been published proposing ways to automatically uncover such relationships. One approach is co-occurrence methods where co-occurrence in text is considered as relationship. This approach leads to high false positive rate among identified candidates as this approach fails to capture semantic context (Henry and McInnes, 2017). Therefore recently natural language processing (NLP) techniques such as word embedding or semantic lookup information (Collobert et al., 2011; Jha et al., 2017, 2018a; Pyysalo et al., 2013; Shaik and Jin, 2019) are gaining prominence. One of the primary challenges with word embedding techniques is the ability to handle and train a large-scale embedding model over such a large and diverse token set. This problem is further exacerbated by the presence of large number of medical abbreviations that are not found in common English text. Also, methods involving a “sliding window” to capture the context fail to capture “wide sentential context.” As a result, simple word embedding techniques are poorly suited to discovering long-range dependencies among biomedical terms. This limitation has impact on biomedical applications such as literature-based discovery (LBD) (Gopalakrishnan et al., 2018; Jha et al., 2018b; Xun et al., 2017), biomedical information retrieval (Wang et al., 2018b), relation extraction (Giuliano et al., 2006), and Q&A system (Lee et al., 2020). Jha et al. (2017) noted that although dense word embeddings do effectively learn implicit semantic context, they are agnostic of rich semantic knowledge captured in biomedical ontologies and vocabularies which results in poor representation of such terms without adequate local context. Authors proposed MeSH2Vec that captures both contextual information and available explicit semantic knowledge to learn externally augmented word embeddings.

Multiscale representation learning for biomedical analysis Chapter

2

11

They used MEDLINEa for the co-occurrence information of the Medical Subject Headings (MeSH terms) required to learn word embeddings and MeSHb tree codes as an external explicit knowledge-base and learn augmented word embeddings. Authors showed the efficacy of their proposed method on different biomedical concept similarity/relatedness tasks. Similarly, Zhang et al. (2019) proposed BioWordVec that learns dense word representations using both large corpus of unlabeled biomedical data and structured biomedical ontologies. Authors used PubMed and MeSH data to learn word embeddings and assessed the utility of their work on tasks of sentence pair similarity and biomedical relation extraction. While dense word embeddings do capture semantic and contextual information effectively, several tasks require embeddings at sentence or document level. While there are simple ways of creating such embeddings from word embeddings, researchers have also trained embeddings directly at sentence or document level that were more expressive than the ones derived from word embeddings. Chen et al. (2019) proposed BioSentVec, sentence embeddings trained with more than 30 million documents from PubMed and clinical notes in the MIMIC-III Clinical Database. Authors evaluated these embeddings for two sentence pair similarity tasks. Authors also released the trained embeddings publicly to further biomedical text mining research. While dense embeddings were already helping researchers make great progress in all kinds of NLP applications, recent development in deep learning using transformers caused an unprecedented acceleration. Biomedical community also benefitted from these developments (Chen et al., 2018). Lee et al. (2020) proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), a domain-specific language representation model pretrained on large-scale biomedical corpora to better understand complex biomedical texts. Authors showed the efficacy of BioBERT on tasks like biomedical named entity recognition (NER), biomedical relation extraction, and biomedical question answering. Dense word or sentence embeddings improved the accuracy of several biomedical tasks such as LBD, relation extraction, and question answering. However, since most embeddings are learned within a sliding window, more complex far-out context cannot be captured. Therefore, recently knowledge graph-based approaches have been proposed to address such limitations (Zeng et al., 2022). Embeddings representing knowledge graphs are far more expressive than just word or document embeddings. These embeddings not only preserve the structure of the graph but also capture the semantic information of entities and relations. Several embedding models have been proposed for biomedical applications. Graph neural network (GNN) is an effective and

a

https://www.nlm.nih.gov/medline/medline_overview.html. https://www.nlm.nih.gov/mesh/meshhome.html.

b

12

Handbook of Statistics

popular architecture for learning graph representations as it allows complex nonlinear transformations. Lin et al. (2020) proposed KGNN, a Knowledge GNN for drug–drug interaction (DDI) prediction. Likewise, Feeney et al. (2021) used multirelational GNN for DDI prediction. Dai et al. (2021) proposed adversarial autoencoder-based knowledge graph embeddings for DDI prediction. Su et al. (2020) presented a comprehensive review of network embeddings used in biomedical applications in their paper.

2 Representation learning: Background There has been extensive research in learning better word representations. The traditional research in this area was mostly based on the bag of words (BOW) representation (Brants et al., 2007; Zhang et al., 2006), where each individual word was represented as a one-hot vector. As BOW representation does not consider the semantic relation between words, these methods evolved to represent words as continuous dense vectors, generally known as word embeddings as shown by Bengio et al. (2003), Mikolov et al. (2013b), and Mnih and Kavukcuoglu (2013). These techniques have shown significant improvements in the performance of several NLP tasks. Unsupervised techniques like skip-gram bag-of-words model (CBOW) for learning word representation have also been used by numerous researchers (Mikolov et al., 2013a; Muneeb et al., 2015). BERT (Devlin et al., 2018) and ELMO (Peters et al., 2018) are the most recent way of language representation that has proved to be an effective way of contextualized representations from a deeper structure (e.g., bidirectional language model) for transfer learning. One of the first to use word embedding in biomedical was by Collobert et al. (2011), who proposed a neural network that learnt a unified word representation suited for tasks like parts of speech tagging, NER, and semantic role labeling. In the biomedical domain, some early attempts of applying word-embedding have seen traction in the last few years where the traditional word representation used by Pyysalo et al. (2013) provided distributional semantic resources for biomedical text processing. They focused on applying Word2Vec to train word embedding from the preprocessed PubMed and PubMed Central Open Access (PMC OA) texts. While PubMed is a popular source to extract information, most PubMed documents also have extra information about the Medical Subject Headings (MeSH) which could be useful in understanding the semantics of the document along with externally curated knowledge-base (Jha et al., 2017, 2018a). Chiu et al. (2016) prepared a study on how to train good word embeddings for biomedical NLP, and their research provided a comparison on how the quality of word embeddings differs. BioWordVec (Zhang et al., 2019) is the latest work in the Biomedical embedding space that combines MeSH terms and fastText (Bojanowski et al., 2017) in a novel way.

Multiscale representation learning for biomedical analysis Chapter

2

13

It has also been observed that the ontology-based methods (Nguyen and Al-Mubaid, 2006) are better at capturing taxonomic similarity and are more correlated with coders, whereas the context vector-based methods (Jha et al., 2017) sensitive to both taxonomic and contextual information have better correlation with physicians. In a recent work done by Shaik and Jin (2019), they use curated knowledge-base (SemMedDB) to design new sentences called “Semantic Sentences.” These semantic sentences are used in addition to biomedical text as inputs to the word embedding model to train semantic word embeddings. Though the above mentioned methods work well, they lack the ability to work at large scale which is a big challenge in the biomedical domain. Adapting the skip-gram and CBOW to biomedical domain, works like (Muneeb et al., 2015) obtained distributed representations for over 400 million tokens. Although they performed well on the task of pairwise similarity, they did not perform so well on the semantic relatedness task. The related terms can be in the same document but not in the same sentence. This points toward the lack of long-range contextual modeling in word embeddings. This chapter aims to address this particular issue. It builds upon a recent work by Melamud et al. (2016), which introduces a simple and a scalable way to model the context using bidirectional long short-term memory (LSTM). They achieve impressive results across different tasks like sentence completion, lexical substitution, and word sense disambiguation tasks, substantially outperforming the popular context representation of averaged word embeddings. However, as shown in Table 1, this too suffers the problem of associative boost and specificity, which we intend to solve. In the recent times, a lot of research has been done following the lines of bidirectional encoder representations from transformers (BERT) and recently Lee employed BERT on biomedical text (Lee et al., 2020). BioBERT showed significant improvement on tasks like biomedical NER, biomedical relation extraction, and biomedical question answering; however, it is not well suited to find similarity between terms (Reif et al., 2019). In BioBERT, representation of each word is conditioned on the sentence it appears, but similarity tasks are generally context free. BioWordVec (Zhang et al., 2019) shows the best result on the term similarity measures.

3

Multiscale embedding motivation

To address the aforementioned challenges, we need to learn a better word representation that can more efficiently and effectively capture complex relationships among biomedical terms. The key observation we make is that using just a single representation technique, like Word2Vec, fastText or Context2Vec (which through bidirectional LSTM tries to encode sentential context), is not enough to capture the high level of complexity in biomedical domain. To illustrate this, please refer to Table 1, which captures the cosine similarity

14

Handbook of Statistics

TABLE 1 Example of cosine similarity scores of BioWordVec (Zhang et al., 2019), context embedding, and composite embedding.

Word pair

BioWordVec (Zhang et al., 2019)

Context-embedding similarity

Compositeembedding similarity

Nurse–surgeon

0.537

0.387

0.440

Scalpel–surgeon

0.485

0.422

0.447

Lawyer–surgeon

0.516

0.389

0.431

Lawyer–scalpel

0.288

0.352

0.328

between four pairs of terms: nurse–surgeon, scalpel–surgeon, lawyer–surgeon, and lawyer–scalpel using the BioWordVec (Zhang et al., 2019) embedding, only context embedding, and the composite embedding. Composite embedding is the embedding obtained by combining the word embedding and context embedding. The pairs (nurse–surgeon) and (scalpel–surgeon) in the word embedding space are very closer but are fairly closer in the context embedding space as they tend to be used in the same context. The reason is that the pair (nurse–surgeon) might not be occurring in the same sentence and hence the word embedding score is lower. On the contrary (scalpel–surgeon) is used generally in same context and also in the same sentences. Hence, its similarity score is equally higher. Furthermore, words like (nurse–surgeon) that are more similar tend to also share functional, script, or thematic relations (Hutchison, 2003). As pointed out by Asr et al. (2018), word pairs that are purely category coordinates, like (lawyer–surgeon), or purely associates, like (scalpel–surgeon), and pairs that share both types of relations, like (nurse–surgeon), tend to see an additive processing benefit that reflects the privilege of both similarity and relatedness, an effect generally referred to as the “associative boost.” (lawyer–scalpel) has a very low score for the BioWordVec (Zhang et al., 2019), but a higher score in the context embedding and the composite embedding space. This might be attributed to the fact that surgeon connects lawyer and scalpel, hence a fairly high context and composite similarity score. Consequently, we propose to use the power of multiple lenses. More specifically, we use a twofold composite word representation scheme. The first representation aims to capture the short-term context. The second representation is to capture “wide sentential” contexts. Finally, these two representations are combined meaningfully to allow capturing both short-range and long-range dependency for a single term. Toward that end, we reuse the basic concepts of methodologies like Word2Vec and Context2Vec to obtain alternate

Multiscale representation learning for biomedical analysis Chapter

2

15

representations for the same word and then use concepts of concatenation and principal component analysis (PCA) (Jolliffe and Cadima, 2016) to get a unified representation. With this augmented approach, we are able to handle the limitations listed in earlier paragraphs. Going back to the examples of Table 1, our proposed methodology, multilens similarity, yields similarity values that are along the expected lines. Our experiments on diverse datasets show that the proposed approach significantly improves on the state-of-art method.

4

Theoretical framework

4.1 Local context embedding Mikolov et al. (2013b) created the Word2Vec tool that uses continuous bag-of-words (CBOW) and skip-gram with negative sampling (SGNS) for learning the distributed representation of words from very large datasets. Both of these variations use a sliding window approach to generate context and use neural network to predict either the target or context words based on the methodology adopted. Please refer to Fig. 1 to understand the difference between the two. FastText (Bojanowski et al., 2017) developed a sub word embedding model. BioWordVec (Zhang et al., 2019) uses fastText with MeSH terms and fastText to come up with a state-of-art embedding method. For various embedding referred in Table 2, we use the respective pretrained embedding available publicly.

4.2 Wide context embedding The local context focus of Word2Vec prevents a deeper understanding and incorporation of the context. To alleviate this, we use a bidirectional LSTM to get the context representation that is also sensitive to the position of the terms in a sentence. As shown in Eq. (1), two LSTM representations are generated—one from reading words in a sentence from left to right (lLSTM) and the other from right to left (rLSTM). biLSTMðW 1:n , iÞ ¼ lLSTMðl1:i1 Þ  rLSTMðr n:i+1 Þ

(1)

Once the bidirectional LSTM representation is computed, a multilayer perceptron (MLP) concatenates both the left and the right representations as shown in Eq. (2), where L1 and L2 are the linear layers. Rectified Linear Unit (ReLU) is used as an activation function. Finally, the context representation c for a word is computed in Eq. (3). MLPðxÞ ¼ L2 ðReLUðL1 ðxÞÞÞ

(2)

c ¼ MLPðbiLSTMðW 1:n , iÞÞ

(3)

16

Handbook of Statistics

FIG. 1 Word2Vec model architecture.

TABLE 2 Evaluation results on UMNSRS Sim datasets. UMNSRS-Sim Method

Corpus

Pearson

Spearman

#

Mikolov et al.

Google news

0.421

0.409

336

Pyysalo et al. (2013)

PubMed + PMC

0.549

0.524

493

Chiu et al.

PubMed

0.662

0.652

462

BioWordVec (win20)

PubMed + MeSH

0.667

0.657

521

Context2Vec

PubMed

0.601

0.589

476

Context2Vec + Pyysalo et al. (2013)

PubMed + MeSH

0.614

0.602

443

Context2Vec + BioWordVec Concat

PubMed + PMC

0.646

0.634

471

Context2Vec + BioWordVec PCA—200

PubMed + MeSH

0.692

0.684

471

“#” denotes the number of pairs for which the embedding was available. “Pearson” denotes the Pearson correlation score and “Spearman” denotes the Spearman correlation score. The highest values are shown in bold. The multiscale embedding performs better than the individual embedding

Multiscale representation learning for biomedical analysis Chapter

2

17

4.3 Multiscale embedding For every word, we get a word embedding and a context embedding. Then we combine the embedding using different methods. We first merge the embeddings. Suppose the first embedding is 200 dimension and the second embedding is 300 dimension, then the combined embedding will be 500 dimension. To further reduce the dimension, we use PCA. PCA is a mathematical algorithm that is used to reduce the dimensions of the data while retaining most of the variation in the data (Jolliffe and Cadima, 2016). In Eq. (4), we use the combined and the reduced embedding dimension via PCA to find the cosine similarity. cosða, bÞ ¼

embeddinga  embeddingb jjembeddinga jj  jjembeddingb jj

(4)

4.4 Postprocessing and inference for word similarity task The term similarity is calculated using cosine similarity between both the terms. As shown in Eq. (4) the cosine similarity is separately calculated using the word vector for each terms in local context embedding, context2vec, and composite embedding. Once the phrase pair similarity is computed, we compute the Spearman’s coefficient between the human score and the proposed model score to find the ordering as shown in Eq. (5). di is the difference between the two ranks of each observation and n is the number of observations. Human score is the score given by human judges for each pair in the respective dataset. Spearman’s coefficient “assesses how well the relationship between two variables can be described using a monotonic function” (Wikipedia Contributors, 2020). Hence, this correlation will be high when variables are identical or are very similar. We systematically ignore words that appear only in the reference but not in our models.

4.5 Evaluation scheme As defined in the datasets we use, we perform two types of evaluation—intrinsic and extrinsic. In intrinsic evaluation, we use UMNSRS-SIM dataset and UMNSRS-REL to measure the performance of our embeddings on similarity computation task. We then use Spearman’s rank correlation coefficient as a metric. For extrinsic evaluation, we follow this up with the application of these embeddings in a downstream application like text classification. For the sentence similarity task, we use the simple cosine similarity and the euclidean similarity to compare the proposed composite embedding with other available embeddings. For the DDI multiclass problem, we compute the microaverage to

18

Handbook of Statistics

evaluate the performance of the proposed composite embedding to other embedding. The DDI dataset consists of five different DDI types namely, Advice, Effect, Mechanism, Int, and Negative. Also, for qualitative evaluation, we select few terms and analyze the performance of our embeddings with those of the baselines. X 6 d2i (5) rs ¼ 1  nðn2  1Þ

5 Experiments, results, and discussion In this section, we will demonstrate the efficacy of the proposed approach through a variety of experimental results under different settings and datasets. We follow this up with a detailed discussion. However, before delving further, we first explain the datasets used for training and test.

5.1 Datasets Herein, we first introduce the datasets used for training and obtaining the embeddings, and then those that will be used in experiments. We shall also explain the preprocessing steps employed in both training and experiment stages.

5.1.1 Training stage: Datasets and preprocessing The National Library of Medicine (NLM) maintains wide variety of resources to help the researcher throughout the world (Humphreys and Lindberg, 1993). We use PubMed to train context embeddings. The PubMed database has more than 30 million documents that cover the titles and abstracts of biomedical scientific publications. The PubMed dataset contains numerous fields. For the purpose of our paper, we focus on the title and the abstract in a PubMed file. Once we get the title and the abstract, we encode the character to ASCII, convert to lowercase, and remove stop words. 5.1.2 Testing stage: Datasets Word embeddings can be broadly evaluated in two categories, intrinsic and extrinsic. For intrinsic evaluation, word embeddings are used to calculate or predict semantic similarity between words, terms, or sentences. And for the extrinsic tasks, the generated word embedding can be used as input to various downstream NLP tasks like text classification or NER, Question and Answer, etc. The different datasets used for intrinsic and extrinsic task are as follows: l

Intrinsic evaluation: UMNSRS—For intrinsic evaluation, we use the two widely used bench-marking datasets namely UMNSRS-Sim and UMNSRS-Rel, compiled by Pakhomov et al. (2010). The two datasets

Multiscale representation learning for biomedical analysis Chapter

l

2

19

UMNSRS-Sim and UMNSRS-Rel contain 566 and 587 term pairs, respectively. The similarity between these terms is judged by four medical residents from the University of Minnesota Medical School giving a score in the range of 0–1600. Higher score implying similar or more related judgments of manual annotator for UMNSRS-Sim and UMNSRS-Rel, respectively. Extrinsic evaluation: Word embedding are also commonly used to calculate the sentence similarity. In the area, SemEval semantic textual similarity (SemEval STS) challenges have been conducted to find effective model to perform this task. In this chapter for extrinsic evaluation we use the BioCreative/OHNLP STS dataset (Wang et al., 2018a), which consists of 1068 pairs of sentences derived from clinical notes and was annotated by two medical experts on a scale of 0–5, from completely dissimilar to semantically equivalent (Wang et al., 2018a). As a second datagancıo glu et al., 2017) dataset. BIOSSES set we use the BIOSSES (So (So gancıo glu et al., 2017) consists of 100 sentence pairs from PubMed articles annotated by 5 curators. We also evaluate our work on the DDI extraction task using the DDI 2013 corpus (Herrero-Zazo et al., 2013; Segura-Bedmar et al., 2014). The DDI dataset consists of five different DDI types namely, Advice, Effect, Mechanism, Int, and Negative. The training and test set contains 27, 793 and 5716 instances, respectively. As suggested by Zhang et al. (2019) we also split the training set by 10% to use as the validation set.

For both intrinsic and extrinsic evaluation dataset, we convert them to lowercase and remove stop words.

5.2 Wide context embedding (context2vec) As explained above, context2vec is trained as an unsupervised model which efficiently learns the context embedding of the wide sentential contexts. This uses bidirectional LSTM as the underlying model. The goal of the model is to learn a generic task-independent embedding for a target word, given sentence of any length. In a vanilla embedding CBOW architecture (Mikolov et al., 2013b), the naive context modeling is done by averaging word embedding in a fixed window. But in our model, we learn the context using a bidirectional LSTM, which is a much powerful neural network model. We follow the setup as mentioned by Melamud et al. (2016). To learn the parameter for the network, CBOW’s negative sampling objective function is used. With this approach, both the context embedding network parameters and the target word embedding are learned. For our implementation, we use the one-hot embedding as the input layer for all the terms. Then we have two fully connected LSTM layers followed by

20

Handbook of Statistics

two linear layers (a.k.a. fully connected layer). We did the hyperparameter tuning for min word count, dropout, batch size to minimize the error. The final configuration used had output layer size of 300 and minimum frequency of 8 gave the best result. With mini-batch of 1000 sentences, the model training on a Nvidia GEForce RTX2080 took 36 h to train 5 iterations. As a next step, we should try to train the model on a more powerful GPU machine.

5.3 Quantitative evaluation For the quantitative evaluation, we perform both the intrinsic and extrinsic evaluations. This section is divided into two based on subtasks: similarity measurement task and downstream application task.

5.3.1 Term similarity task Table 2 presents the Spearman’s (ρ) coefficient values obtained after applying the proposed model on the UMNSRS-Rel and UMNSRS-Sim datasets. The Spearman’s (ρ) coefficient is calculated between the cosine similarity between the terms using the individual embeddings against the human judgments. As shown in Table 2, the proposed model (Context2Vec + BioWordVec PCA—200) outperforms existing techniques and achieves the highest correlation with human judgments on both the datasets. For comparison of our results with previous methods, we downloaded the (Pyysalo et al., 2013) word embedding model and executed it against all the term pairs. For BioWordVec (Zhang et al., 2019), we downloaded the published pretrained embedding. We report the Mikolov et al. (2013a) and Chiu et al. (2016) as mentioned in Zhang et al. (2019). Chiu et al. (2016) show that the window size does not have a lot of impact on the intrinsic evaluation of word embedding (Chiu et al., 2016). Hence, increasing the window size of vanilla Word2Vec model would not be functionally equivalent to the contextual embedding model. We further do an experiment by combining our Context2vec model with Pyysalo et al. (2013) model. As we can see from Table 3 that the multiscale embedding performed a lot better than the individual models. Based on these, we can conclude that the proposed composite model is better at capturing the true intent than those that focus only one aspect. Context model can be easily plugged to any existing embedding model to make the individual model better. 5.3.2 Downstream application task There are multiple applications where this embedding can be directly used. One of the common use cases of word embedding is to calculate the sentence pair similarity. The SemEval STS tasks have been organized for five years to evaluate the sentence similarity performance. One of the most common baselines used for this task is the averaged word embedding. We average the word vectors for each word in the sentence. Then cosine or euclidean similarity

Multiscale representation learning for biomedical analysis Chapter

21

2

TABLE 3 Evaluation results on UMNSRS Rel datasets. UMNSRS-Rel Method

Corpus

Pearson

Spearman

#

Mikolov et al.

Google news

0.359

0.347

329

Pyysalo et al. (2013)

PubMed + PMC

0.495

0.488

496

Chiu et al.

PubMed

0.600

0.601

467

BioWordVec (win20)

PubMed + MeSH

0.619

0.617

532

Context2Vec

PubMed

0.482

0.481

484

Context2Vec + Pyysalo et al. (2013)

PubMed + MeSH

0.524

0.524

484

Context2Vec + BioWordVec Concat

PubMed + PMC

0.565

0.561

484

Context2Vec + BioWordVec PCA—200

PubMed + MeSH

0.647

0.642

484

“#” denotes the number of pairs for which the embedding was available. “Pearson” denotes the Pearson correlation score and “Spearman” denotes the Spearman correlation score. The highest values are shown in bold. The multiscale embedding performs better than the individual embedding.

measure is used to calculate the similarity between the sentence pairs. We use this simple technique to prove how the embedding is performing on its own. Table 4 shows the Pearson correlation between calculated similarity (cosine or euclidean) and gold standard labels. The proposed composite embedding performs the best. We also perform experiment on the DDI extraction task using the DDI 2013 corpus (Segura-Bedmar et al., 2014). DDI corpus (Segura-Bedmar et al., 2014) is a semantically annotated corpus of documents describing drug–drug interactions (DDIs) from the DrugBank database and MedLine abstracts on the subject of DDIs. The DDI corpus consists of 1017 texts (784 DrugBank texts and 233 MedLine abstracts) and was manually annotated. The DDI dataset consists of five different DDI types namely, Advice, Effect, Mechanism, Int, and Negative. The training and test set contains 27,793 and 5716 instances, respectively. To find the goodness of the embedding, we use a simple convolutional neural network (CNN) model as suggested by Zhang et al. (2019). Taking complex steps/strategies will partly reduce the importance of the word embedding used in the model. The input to the model was trained using the composite embedding or the BioWordVec (Zhang et al., 2019). Table 5 shows the result that the composite embedding had the

22

Handbook of Statistics

TABLE 4 Sentence pair similarity result on BioCreative/OHNLP STS dataset (Wang et al., 2018a) and BIOSSES (So gancıo glu et al., 2017). Similarity measures

Bio-WordVec (Zhang et al., 2019)

Composite embedding

BioCreative/OHNLP STS dataset (Wang et al., 2018a)

Cosine

0.606

0.619

BIOSSES (So gancıoglu et al., 2017)

Cosine

0.476

0.497

Dataset

Composite embedding refers to “Context2Vec + BioWordVec PCA—200” composite model. Pearson correlation between calculated similarity and gold standard labels.

TABLE 5 DDI (Segura-Bedmar et al., 2014) extraction evaluation results on DDI 2013 corpus. Method

Precision

Recall

F-score

BioWordVec (Zhang et al., 2019)

0.58

0.57

0.58

Composite embedding

0.62

0.56

0.59

Composite embedding refers to “Context2Vec + BioWordVec PCA—200” composite model.

best F-Score and the BioWordVec (Zhang et al., 2019) had a better recall. This also proves that the composite embedding is very easy to use with any underlying task and easy to evaluate.

5.3.3 Drug rediscovery test For the drug rediscovery, we follow the method described by Sang et al. (2018). To evaluate the capability of the SemaTyP (Sang et al., 2018) method in discovering potential drugs for new diseases, a drug rediscovery test was done. In this test, 360 drug–disease relationships were selected from therapeutic target database (TTD) (Chen et al., 2002) as gold standard to form test set. Each diseasei in test set has one known associated drugi, but the drug mechanism of action is not clear. For each diseasei, we randomly selected other 100 drugs or chemicals from TTD as candidate drugs for this disease. We then report the mean of those predicted ranks of drugi. We then perform the

Multiscale representation learning for biomedical analysis Chapter

2

23

TABLE 6 Mean ranking for drug rediscovery test. Method

Mean ranking

BioWordVec

29.24

Context2Vec

32.51

SemaTyP (Sang et al., 2018)

26.31

Context2Vec + BioWordVec PCA—200

24.8

experiment as a Closed Discovery LBD test. We perform the phrase cosine similarity between the various TTD pairs and find the position of gold drugi for a given diseasei. We then find the mean of position of gold drugi across all disease. Bases on Table 6, we can see that composite embedding with 200 dimension performs the best.

5.4 Qualitative analysis In this section, we present examples of a few source words and their corresponding nearest term in the word vector space and in the context vector space, and qualitatively evaluate the performance of our composite-embedding representation. Table 7 presents three examples and the corresponding top-4 nearest neighbors for the word-embedding, context-embedding, and compositeembedding approach. If we look at the example of “aspirin,” we observe that the wordembedding model picks warfarin, heparin, etc., which are different blood thinning agents. But if we look at the context embedding we also see the clopidogrel which is used to prevent heart attack, We also see cilostazol which is used to treat intermittent problems with blood flowing to the legs that enables people to walk more with less pain. We also see terms like ibuprofen which is when aspirin is used as antiinflammatory drug. If we carefully observe the pattern, it is clear that word embedding literally skips words in the context around the target word and therefore finds very similar contexts for aspirin and warfarin (blood thinning). But context embedding considers entire sentential contexts, taking context word order and position into consideration. Hence, we are able to get multiple contexts of aspirin namely blood thinner and antiinflammatory. We can also look at another example of term “pain.” The context embedding goes beyond the terms that co-occur like “tenderness” and “neuralgia,” but word embedding is able to only detect terms like “headache,” “aching,”

24

Handbook of Statistics

TABLE 7 Example—Top closest terms to a target term. Term

Context embedding

Word embedding

Pain

Painful, tenderness, hyperalgesia, neuralgia

Discomfort, headache, aching, tightness

Aspirin

Clopidogrel, cilostazol, indomethacin, ibuprofen

Acetaminophen, nsaid, warfarin, heparin

Eyes

Eye, corneas, retina, eyelids

Eye, phakic, eyeballs, corneas

“backache,” etc. This logic is very similar to the UMNSRS dataset of similarity and relatedness. Pakhomov et al. (2010) dataset on similarity has given more to terms like medrol and prednisolone, which are both different type of steroids treating similar conditions. So overall we can conclude that the word vector is good at modeling the similarity of terms, but the context vector is better at representing the relatedness of the terms.

5.5 Error analysis In this section, we discuss further on the various sources of error. Many of these issues are pervasive across majority of the embedding approaches, and perhaps can be solved by changing the unit of analysis from token level to character level. However, these are beyond the scope of this chapter. l

l

Abbreviations: Ache in common English means pain. But in our database we have acetylcholinesterase referred as AChE. Since we do lower character AChE and ache is same, when we look at the terms similar for ache, we see terms like buche (butyrylcholinesterase) and bche (butyrylcholinesterase), hence confusing the model. Common terms: Let us consider hypothyroidism and hyperthyroidism. Hypothyroidism is abnormally low activity of the thyroid gland. Hyperthyroidism is the overactivity of the thyroid gland. Human similarity score for this pair in the MeSH-2 set is 0.4, but the proposed model or the Pyysalo et al. (2013) model has a score of 0.8. This is because of a high degree of contextual overlap which needs more domain-specific knowledge integration.

6 Conclusion and future work In this chapter, we presented a novel approach to compute a multiscale representation of biomedical term. Classic word embedding is used as a local

Multiscale representation learning for biomedical analysis Chapter

2

25

contextual representation and context embedding is used to model “wide sentential context.” A combined representation of these two embeddings is used to compute similarity among all term pairs. We use the composite embedding to find similarity value for a pair of biomedical terms. Our experiments showcase that results of the proposed approach is better than other state-of-art methods. We also further prove that the proposed method can be easily used in underlying jobs like sentence similarity and classification, and it performs better than the traditional word embedding. The proposed method eliminates the use of any external dataset or any special processing and still provides higher result. In the future, using a better abbreviation model would help in creating a better embedding model. We would also like to train the context embedding model on a larger GPU machine to have better coverage. We would also like to evaluate the model on other downstream tasks like NER and prove that the new proposed model is better than existing models for the given tasks.

References Asr, F.T., Zinkov, R., Jones, M., 2018. Querying word embeddings for similarity and relatedness. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers), pp. 675–684. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C., 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J., 2007. Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), June, Association for Computational Linguistics, Prague, Czech Republic, pp. 858–867. Chen, X., Ji, Z.L., Chen, Y.Z., 2002. TTD: therapeutic target database. Nucleic Acids Res. 30 (1), 412–415. Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., Blaschke, T., 2018. The rise of deep learning in drug discovery. Drug Discov. Today 23 (6), 1241–1250. Chen, Q., Peng, Y., Lu, Z., 2019. BioSentVec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics (ICHI), pp. 1–5. Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S., 2016. How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pp. 166–174. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537. Dai, Y., Guo, C., Guo, W., Eickhoff, C., 2021. Drug-drug interaction prediction with Wasserstein adversarial autoencoder-based knowledge graph embeddings. Brief. Bioinform. 22 (4), bbaa256. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint:1810.04805.

26

Handbook of Statistics

Feeney, A., Gupta, R., Thost, V., Angell, R., Chandu, G., Adhikari, Y., Ma, T., 2021. Relation matters in sampling: a scalable multi-relational graph neural network for drug-drug interaction prediction. arXiv preprint:2105.13975. Giuliano, C., Lavelli, A., Romano, L., 2006. Exploiting shallow linguistic information for relation extraction from biomedical literature. In: 11th Conference of the European Chapter of the Association for Computational Linguistics. Gopalakrishnan, V., Jha, K., Xun, G., Ngo, H.Q., Zhang, A., 2018. Towards self-learning based hypotheses generation in biomedical text domain. Bioinformatics 34 (12), 2103–2115. Henry, S., McInnes, B.T., 2017. Literature based discovery: models, methods, and trends. J. Biomed. Inform. 74, 20–32. Herrero-Zazo, M., Segura-Bedmar, I., Martı´nez, P., Declerck, T., 2013. The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions. J. Biomed. Inform. 46 (5), 914–920. Humphreys, B.L., Lindberg, D.A., 1993. The UMLS project: making the conceptual connection between users and the information they need. Bull. Med. Libr. Assoc. 81 (2), 170. Hutchison, K.A., 2003. Is semantic priming due to association strength or feature overlap? A microanalytic review. Psychon. Bull. Rev. 10 (4), 785–813. Jha, K., Xun, G., Gopalakrishnan, V., Zhang, A., 2017. Augmenting word embeddings through external knowledge-base for biomedical application. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 1965–1974. Jha, K., Wang, Y., Xun, G., Zhang, A., 2018a. Interpretable word embeddings for medical domain. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 1061–1066. Jha, K., Xun, G., Wang, Y., Gopalakrishnan, V., Zhang, A., 2018b. Concepts-bridges: uncovering conceptual bridges based on biomedical concept evolution. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1599–1607. Jolliffe, I.T., Cadima, J., 2016. Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 374 (2065), 20150202. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), 1234–1240. Lin, X., Quan, Z., Wang, Z.-J., Ma, T., Zeng, X., 2020. KGNN: knowledge graph neural network for drug-drug interaction prediction. In: IJCAI, vol. 380, pp. 2739–2745. Melamud, O., Goldberger, J., Dagan, I., 2016. context2vec: learning generic context embedding with bidirectional lstm. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61. Mikolov, T., Chen, K., Corrado, G.S., Dean, J., 2013a. Efficient estimation of word representations in vector space. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. Mnih, A., Kavukcuoglu, K., 2013. Learning word embeddings efficiently with noise-contrastive estimation. In: Advances in Neural Information Processing Systems, pp. 2265–2273. Muneeb, T., Sahu, S., Anand, A., 2015. Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 15, pp. 158–163. Nguyen, H.A., Al-Mubaid, H., 2006. New ontology-based semantic similarity measure for the biomedical domain. In: 2006 IEEE International Conference on Granular Computing, pp. 623–628.

Multiscale representation learning for biomedical analysis Chapter

2

27

Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., Melton, G.B., 2010. Semantic similarity and relatedness between clinical terms: an experimental study. In: AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association, p. 572. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep contextualized word representations. arXiv preprint:1802.05365. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S., 2013. Distributional semantics resources for biomedical text processing. In: Proceedings of LBM 2013, pp. 39–44. Reif, E., Yuan, A., Wattenberg, M., Viegas, F.B., Coenen, A., Pearce, A., Kim, B., 2019. Visualizing and measuring the geometry of BERT. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32. Curran Associates, Inc, pp. 8594–8603. Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J., 2018. SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinform. 19 (1), 1–11. Segura-Bedmar, I., Martı´nez, P., Herrero-Zazo, M., 2014. Lessons learnt from the ddiextraction2013 shared task. J. Biomed. Inform. 51, 152–164. Shaik, A., Jin, W., 2019. Biomedical semantic embeddings:uusing hybrid sentences to construct biomedical word embeddings and its applications. In: 2019 IEEE International Conference on Healthcare Informatics (ICHI), pp. 1–9. € urk, H., Ozg€ € ur, A., 2017. Biosses: a semantic sentence similarity estimation So gancıo glu, G., Ozt€ system for the biomedical domain. Bioinformatics 33 (14), i49–i58. Su, C., Tong, J., Zhu, Y., Cui, P., Wang, F., 2020. Network embedding in biomedical data science. Brief. Bioinform. 21 (1), 182–197. Swanson, D.R., 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30 (1), 7–18. Wang, Y., Afzal, N., Liu, S., Rastegar-Mojarad, M., Wang, L., Shen, F., Fu, S., Liu, H., 2018a. Overview of the biocreative/ohnlp challenge 2018 task 2: clinical semantic textual similarity. In: Proceedings of the BioCreative/OHNLP Challenge, vol. 2018. Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., Liu, H., 2018b. A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20. Contributors, Wikipedia, 2020. Spearman’s rank correlation coefficient–Wikipedia, the free Encyclopedia. Xun, G., Jha, K., Gopalakrishnan, V., Li, Y., Zhang, A., 2017. Generating medical hypotheses based on evolutionary medical concepts. In: 2017 IEEE International Conference on Data Mining (ICDM), IEEE, pp. 535–544. Zeng, X., Tu, X., Liu, Y., Fu, X., Su, Y., 2022. Toward better drug discovery with knowledge graph. Curr. Opin. Struct. Biol. 72, 114–126. Zhang, Y., Hildebrand, A.S., Vogel, S., 2006. Distributed language modeling for n-best list re-ranking-best list re-ranking. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z., 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6 (1), 1–9.

This page intentionally left blank

Chapter 3

Adversarial attacks and robust defenses in deep learning Chun Pong Laua,∗, Jiang Liua, Wei-An Linb, Hossein Souria, Pirazh Khorramshahia, and Rama Chellappaa a

Johns Hopkins University, Baltimore, MD, United States Adobe Research, San Jose, CA, United States ∗ Corresponding author: e-mail: [email protected] b

Abstract Deep learning models have shown exceptional performance in many applications, including computer vision, natural language processing, and speech processing. However, if no defense strategy is considered, deep learning models are vulnerable to adversarial attacks. In this chapter, we will first describe various typical adversarial attacks. Then we will describe different adversarial defense methods for image classification and object detection tasks. Keywords: Deep learning, Adversarial attacks, Defenses against adversarial attacks

1

Introduction

In recent years, artificial intelligence (AI) systems are being utilized extensively in a variety of applications. With their exceptional performance over the past decade, deep learning (DL) models have revolutionized the machine learning and AI fields. DL models have outperformed nearly all classical machine learning models, such as SVMs, naive Bayes classifier, k-means clustering, and nearest neighbor, in many applications, including computer vision (Voulodimos et al., 2018), natural language processing (Cho et al., 2014), and speech processing (Graves and Jaitly, 2014), by a significant margin (Silver et al., 2016). With sufficient data exists, DL models would be powerful and exceeds the classical methods. In particular, with their outstanding capacity for feature extraction and generalizations, DL models have made significant advances in wide range of computer vision problems, such as object detection (Girshick et al., 2014; Liu et al., 2016; Redmon et al., 2016), object tracking (Wang et al., 2015), image classification (He et al., 2016;

Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2023.01.001 Copyright © 2023 Elsevier B.V. All rights reserved.

29

30

Handbook of Statistics

Krizhevsky et al., 2012), action recognition (Simonyan and Zisserman, 2014a), image captioning (Vinyals et al., 2015), human pose estimation (Newell et al., 2016; Toshev and Szegedy, 2014), face recognition (Deng et al., 2019; Ranjan et al., 2017), and semantic segmentation (Chen et al., 2017a, b; Zhao et al., 2017). However, without sufficient data supports, DL may not perform better than classical methods. Maliciously crafted adversarial perturbations, on the other hand, can severely degrade the performance of DL models. This serious security threat can easily fool almost any type of DL model by deliberately adding imperceptible perturbations to the inputs (Ren et al., 2020). Consequently, this phenomenon, known as the adversarial attacks, is regarded as a significant barrier to the widespread deployment of DL models for security-critical and large-scale systems. There are three categories of adversarial attacks: white-box attack, gray-box attack, and black-box attack. White-box attacks occur when the attacker has complete access to the model’s architecture, parameters, and training and testing routines. In the gray-box scenario, the attacker’s knowledge is restricted to the model architecture and perhaps training and testing routines. The type of adversarial attack that is the most stringent is the black-box attack, in which the attacker has no knowledge of the target model. In recent years, a variety of methods have been proposed to undermine the performance of models, including limited-memory Broyden–Fletcher– Goldfarb–Shanno (L-BFGS) (Szegedy et al., 2013), fast gradient sign method (FGSM) (Goodfellow et al., 2014), projected gradient descent (PGD) (Madry et al., 2017), DeepFool (DF) (Moosavi-Dezfooli et al., 2016), Carlini–Wagner (CW) (Carlini and Wagner, 2017), Jacobian-based saliency map attack (JSMA) (Papernot et al., 2016a), and Adversarial patch (Brown et al., 2017; Karmon et al., 2018). Moreover, a variety of adversarial attacks, particularly Elastic, Fog, Gabor, Snow, and JPEG, have been recently presented as differentiable attacks against deep models (Kang et al., 2019a). Fig. 1 depicts some of the most popular adversarial examples against the ResNet-34 (He et al., 2016) network. It can be observed that the adversarial perturbations added to the images are not perceptible to human eyes. In the meantime, numerous defense strategies have been proposed to make DL models secure and resilient against these attacks. Robust Optimization (Lin et al., 2020), Gradient Masking (Papernot et al., 2016b), and Adversarial Example Detection (Xu et al., 2019) are the predominant forms of these defenses (Souri et al., 2021). Throughout this chapter, we explain briefly some of the most widely known adversarial attacks and then elaborate on some of the most recent adversarial defenses. In Section 3, we introduce On-Manifold Robustness, which is about the adversarial samples perturbed within the latent space of some generators. In Section 4, we introduce knowledge distillationbased defenses, including mutual adversarial training (MAT) (Liu et al., 2021a), which allows models to share their knowledge of adversarial robustness and teach each other to be more robust. In Section 5, we introduce defenses for

FIG. 1 Adversarial samples (first column) and their respective relatively sparse perturbations (second column), generated based on a ResNet-34 network trained on the RESISC45 dataset (Cheng et al., 2017). Note that pixel values of perturbations are multiplied by a factor of 10 for the purpose of visualization. (A) FGSM; (C) PGD; (E) Patch; (G) CWL2.

32

Handbook of Statistics

object detectors, including segment and complete (SAC) (Liu et al., 2021b), which is a general framework for defending object detectors against patch attacks through detection and removal adversarial patches. And finally in Section 6, we introduce reverse engineering of deceptions via residual learning (REDRL) (Souri et al., 2021) as a method to detect and recognize benchmark adversarial attacks.

2 Adversarial attacks This section describes several adversarial attack techniques and algorithms. These methods were originally intended for image classification task; however, they can used for other tasks such as object detection and semantic segmentation.

2.1 Fast gradient sign method FGSM (Goodfellow et al., 2014) designs adversarial perturbations δ through a single backpropagation of fimage model and bounds δ to have L∞ norm of E, i.e., jjδjj∞ ¼ E.

2.2 Projected gradient descent PGD (Madry et al., 2017), in contrast to FGSM, produces δ in multiple iterative steps. In each iteration, jjδjj∞ ¼ α and the generated adversarial sample Iadv is forced to fall within the E-neighbor ball of the clean image Ic, i.e., Iadv :jjIadv  Icjj∞ E.

2.3 DeepFool DeepFool (Moosavi-Dezfooli et al., 2016) seeks a path for Ic to pass the decision boundary set by the fimage model. Here, δ can be calculated as the normal vector of the linearized decision boundary.

2.4 Carlini and wagner attack Carlini–Wagner attack (Carlini and Wagner, 2017) is a more subtle but a computationally costly technique for generating adversarial examples. Through a series of gradient descent iterations, CW seeks to find a Iadv that has the smallest Euclidean distance from Ic while fooling fimage to yield a higher score (logit) value for a target class compared to the rest of the classes including the actual class of Ic.

2.5 Adversarial patch Adversarial patch (Karmon et al., 2018) iteratively optimizes a random patch of arbitrary size to fool the image classifier fimage. Moreover, we ensure that the generated patch lies within the L∞ ball enclosing the clean image.

Adversarial attacks and robust defenses in deep learning Chapter

3

33

2.6 Elastic Elastic (Kang et al., 2019a) is a pixel space attack that warps the clean image Ic with a vector field smoothed by a Gaussian kernel. The vector field is adversarially improved to deceive the classifier fimage.

2.7 Fog Fog (Kang et al., 2019a) creates adversarial occlusion that resembles fog based on the diamond-square algorithm (Fournier et al., 1982) for rendering stochastic fog. Involved parameters are adversarially optimized.

2.8 Snow Snow (Kang et al., 2019a) simulates snowfall via randomly selecting regions within Ic to add snowflakes. The orientation and intensity of snowflakes are adversarially tuned.

2.9 Gabor Gabor (Kang et al., 2019a) adversarially optimizes the noise parameters, including the orientation and bandwidth of the Gabor kernel, so that the obtained Iadv can trick the fimage.

2.10

JPEG

JPEG (Kang et al., 2019a) converts Ic to the JPEG-encoded space and computes adversarial frequency components using the PGD attack. The inverse transform of adversarial frequency components yields Iadv.

3

On-manifold robustness

Existing defenses consider properties of the trained classifier while ignoring the particular structure of the underlying image distribution. Recent advances in GANs and VAEs are shown to be successful in characterizing the underlying image manifold.a

3.1 Defense-GAN Defense-GAN (Samangouei et al., 2018) is one of the early adversarial defense methods that leverages on-manifold images. Suppose we have a pretrained GAN model G, which is trained with natural images. Then intuitively the manifold distribution from this GAN should be similar to the natural image distribution. In other words, given a random noise vector, passing it through the GAN model G will output an image, which lies within the a

The term “manifold” is used to refer to the existence of lower-dimensional representations for natural images. This is a commonly used term in the generative model area. However, this definition may not satisfy the exact condition of manifolds defined in mathematics.

34

Handbook of Statistics

natural image distribution. Now assume we have a perturbed image x^  n , n ¼ W  H  C with the ground-truth label y in Y :¼ f1, …, jYjg, it can fool a classifier fθ : n ! Y , i.e., fθ ð^ xÞ 6¼ y, where W, H, C are the width, the height, and the color channel of the image, respectively, and N is the set of class f1, 2, …, g. To defend, Defense-GAN tries to find the most similar image from GAN to replace the perturbed image. Since the GAN model ideally has the natural image distribution, the image from GAN is unattacked. This process we call it GAN projection. Therefore, this projected image is now used to be passed to the classifier, which can classify it correctly. Mathematically, z⁎ ¼ arg min kGðzÞ  xkp

(1)

z

with p norm, where z  d is the d-dimensional latent vector and the generator G maps z to image space n . Then we pass through G(z*) instead of the input image x^ to the classifier. This defense is effective and has a good performance for natural inputs. Since natural inputs lie in the natural image distribution, which is also the GAN distribution, theoretically nothing will change and hence this method can achieve a good standard accuracy. Although Defense-GAN is effective, it has a drawback that it cannot defend on-manifold adversarial samples, which also lies in the GAN distribution but able to fool the classifier. We consider the adversarial samples perturbed within the image space and the latent space as off-manifold and on-manifold adversarial samples, respectively. Mathematically, min Dð fθ ðx + δÞ, fθ ðxt ÞÞ,

(2)

min Dð fθ ðGðz + λÞÞ, fθ ðxt ÞÞ,

(3)

δΔ

and λΛ

where x + δ and G(z + λ) are the off-manifold and on-manifold adversarial samples, respectively, Δ ¼ fδ : kδkp < Eg, and Λ ¼ fλ : kλkp < ηg. For example, evasion attacks such as FGSM and PGD would be considered as offmanifold attacks. These attacks have a different distribution from the GAN model and hence Defense-GAN can defend them effectively. On the other hand, since on-manifold attack perturbs the image in the latent space, Defense-GAN cannot defend them all. This motivates us to study whether or not leveraging the underlying manifold information can boost the robustness of the model, in particular against novel adversarial attacks. Our key intuition is that, in many cases, the latent space of GANs and VAEs represents compressed semantic-level features of images. Thus, robustifying in the latent space may guide the classification model to use robust features instead of achieving high accuracy by exploiting nonrobust features in the image space.

Adversarial attacks and robust defenses in deep learning Chapter

3

35

3.2 Dual manifold adversarial training (DMAT) Lin et al. (2020) investigate whether or not leveraging the underlying manifold information can boost model robustness, in particular against novel adversarial attacks. They consider the scenario when the manifold information of the underlying data is available. First, an “On-manifold ImageNet” (OM-ImageNet) dataset where all the samples lie exactly on a low-dimensional manifold is constructed. This is achieved by first training a StyleGAN on a subset of ImageNet natural images, and then projecting the samples onto the learned manifold. With this dataset, they show that on-manifold adversarial training (i.e., adversarial training in the latent space) could not defend against standard off-manifold attacks and vice versa. They proposed dual manifold adversarial training (DMAT), which mixes both off-manifold and on-manifold AT (see Fig. 2). AT in the image space (i.e., off-manifold AT) helps improve the robustness of the model against Lp attacks, while AT in the latent space (i.e., on-manifold AT) boosts the robustness of the model against unseen non-Lp attacks.

3.2.1 On-manifold ImageNet One major difficulty in investigating the potential benefit of manifold information in general cases is the inability to obtain such information exactly. For approximate manifolds, the effect of the distribution shift  is difficult to quantify, leading to inconclusive evaluabetween M and M tions. To address the issue, Lin et al. (2020) propose a novel dataset, called On-Manifold ImageNet (OM-ImageNet), which consists of images that lie exactly on-manifold.

FIG. 2 The overall pipeline of the proposed dual manifold adversarial training (DMAT). In this chapter, we consider the scenario when the information about the image manifold is available. This is achieved by projecting natural images x onto the range space of a trained generative model G. We empirically show that either standard adversarial training or on-manifold adversarial training alone does not provide sufficient robustness, while DMAT achieves improved robustness against unseen attacks. During test time, images are directly passed to the adversarially trained classifier.

36

Handbook of Statistics

The OM-ImageNet is built upon the Mixed-10 dataset introduced in the library (Engstrom et al., 2019), which consists of images from 10 superclasses of ImageNet. A total of 69,480 image-label pairs are manually selected as Dotr ¼ fðxi , yi ÞgNi¼1 and another disjoint 7200 image-label M pairs as Dteo ¼ fðxj , yj Þgj¼1 , both with balanced classes. The approach presented

robustness

in Lin et al. (2020) first trains a StyleGAN (Karras et al., 2019) to characterize the underlying image manifold of Dotr . Formally, the StyleGAN consists of a mapping function h : Z ! W and a synthesis network ~g : W ! X. The mapping function takes a latent code z and outputs a style code w in an intermediate latent space W. Then, the synthesis network takes the style code and produces a natural-looking image ~ gðwÞ. The authors of Lin et al. (2020) then follow (Abdal et al., 2019) and consider the extended latent space of StyleGAN. In Abdal et al. (2019), it has been shown that embedding images into the extended latent space is easier than Z or W space. Therefore, in the following, g : W + ! X is the generator function which approximates the image manifold for Dotr . In order to obtain images that are completely on-manifold, each image xi in Dotr and Dote is projected onto the learned manifold by solving for its latent representation wi (Abdal et al., 2019). A weighted combination of the learned perceptual image patch similarity (LPIPS) (Zhang et al., 2018a) and L1 loss is used to measure the closeness between g(w) and xi. LPIPS is shown to be a more suitable measure for perceptual similarity than conventional metrics, and its combination with L1 or L2 loss has been shown to be successful for inferring the latent representation of GANs. This strategy is adopted to derive the latent representation as in: wi ¼ arg min LPIPSðgðwÞ, xi Þ + kgðwÞ  xi k1 :

(4)

w

In summary, the resulting on-manifold training and test sets can be repre, where sented by Dtr ¼ fðwi , gðwi Þ, xi , yi ÞgNi¼1 and Dte ¼ fðwj , gðwj Þ, xj , yj ÞgM j¼1

N ¼ 69, 480 and M ¼ 7200. The total number of categories is 10. Notice that for OM-ImageNet, the underlying manifold information is exact, which is given by fgðwÞ, w  W + g. Sample on-manifold images g(wi) from OM-ImageNet are presented in Fig. 3. On-manifold images in OM-ImageNet have diverse textures, object sizes, lightening, and poses, which is suitable for investigating the potential benefits of using manifold information in more general scenarios.

3.2.2 On-manifold AT cannot defend standard attacks and vice versa To investigate whether the manifold information alone can improve robustness, several ResNet-50 models are trained to achieve standard L∞ robustness at a radius of E ¼ 4/255 or on-manifold robustness at a radius of η ¼ 0.02 in the latent space.

FIG. 3 Sample images from the OM-ImageNet dataset. Unlike MNIST or CelebA datasets, images in OM-ImageNet have diverse textures. Moreover, the underlying manifold information for this dataset is exact.

38

Handbook of Statistics

min θ

min θ

X i

X i

max Lð fθ ðgðwi Þ + δÞ, ytrue Þ, s:t: kδk∞ < E:

(5)

max Lð fθ ðgðwi + λÞÞ, ytrue Þ, s:t: kλk∞ < η:

(6)

δ

λ

During training, the PGD-5 threat model is used in the image space for (5), whereas for (6) OM-FGSM and OM-PGD-5 are considered as the threat models. Mathematically, λk + 1 ¼ ηiter  signðrλk Lð fθ ðGðz + λk ÞÞ,ytrue ÞÞ, δk + 1 ¼ Eiter  signðrδk Lð fθ ðx + δk Þ, ytrue ÞÞ,

(7)

where Eiter and ηiter are the attack step size at each iteration for image space and latent space, respectively. For completeness, robust training using TRADES (β ¼ 6) is also considered in the image space using the PGD-5 threat model. All the models are trained by the SGD optimizer with the cyclic learning rate scheduling strategy in, momentum 0.9, and weight decay 5  104 for a maximum of 20 epochs. The approach proposed in Lin et al. (2020) evaluates the trained models using the PGD-50 and OM-PGD-50 attacks for multiple snapshots during training. The results are presented in Fig. 4. The following results are observed: (i) standard adversarial training leads to degraded standard accuracy while on-manifold adversarial training improves it and (ii) standard adversarial training does not provide robustness to on-manifold attacks. Interestingly, (iii) on-manifold adversarial training does not provide robustness to L∞ attacks since no out-of-manifold samples are realized during training, and (iv) standard adversarial training does not provide robustness to on-manifold attacks.

FIG. 4 On-manifold adversarial training does not provide robustness to standard attacks. Standard adversarial training does not provide robustness to on-manifold attacks. Left: standard accuracy. Middle: classification accuracy when the trained models are attacked by PGD-50. Right: classification accuracy when the trained models are attacked by OM-PGD-50.

Adversarial attacks and robust defenses in deep learning Chapter

3

39

3.2.3 Proposed method: Dual manifold adversarial training The fact that standard adversarial training and on-manifold adversarial training bring complimentary benefits to the model robustness motivates us to consider the following dual manifold adversarial training (DMAT) framework:  X max Lð fθ ðgðwi Þ + δÞ, ytrue Þ + max Lð fθ ðgðwi + λÞÞ, ytrue Þ , (8) min θ

i

δΔ

λΛ

where classifier fθ is robust to both off-manifold perturbations (achieved by the first term) and on-manifold perturbations (achieved by the second term). The perturbation budgets Δ and Λ control the strengths of the threat models during training. For evaluation purposes, PGD-5 and OM-PGD-5 are considered as the standard and on-manifold threat models during training with the same perturbation budgets used by AT and OM-AT. To optimize (8), in each iteration, both standard and on-manifold adversarial examples are generated for the classifier fθ, and fθ is updated by the gradient descent method. The robust model is trained with the identical optimizer setting as in Section 3.2.2.

3.2.4 DMAT improves generalization and robustness Table 1 presents classification accuracies for different adversarial training methods against standard and on-manifold adversarial attacks. For standard adversarial attacks, a set of adaptive attacks is considered: FGSM, PGD-50, and the momentum iterative attack (MIA) (Dong et al., 2018). The per-sample worst case accuracy is also reported, where each test sample will be viewed as misclassified if one of the attacks fools the classifier. For on-manifold adversarial attacks, OM-PGD-50 is considered. Compared to standard adversarial

TABLE 1 Classification accuracy for PGD-50 and OM-PGD-50 attacks on OM-ImageNet test set. Method

Standard

FGSM

PGD-50

MIA

Worst case

OMPGD-50

Normal training

74.72

2.59

0.00

0.00

0.00

0.26

AT [PGD-5]

73.31

48.02

38.88

39.21

38.80

7.23

OM-AT [OM-FGSM]

80.77

17.15

0.03

0.01

0.01

20.19

OM-AT [OM-PGD-5]

78.10

21.68

0.25

0.12

0.10

27.53

DMAT [PGD-5, OM-PGD-5]

77.96

49.12

37.86

37.65

36.66

20.53

40

Handbook of Statistics

training, DMAT achieves improved generalization on normal samples, and significant boost for on-manifold robustness, with a slightly degraded robustness against PGD-50 attack (with L∞ bound of 4/255). Compared to on-manifold adversarial training (OM-AT [OM-FGSM] and OM-AT [OM-PGD-5]), since out-of-manifold samples are realized for DMAT, robustness against the PGD50 attack is also significantly improved.

3.2.5 DMAT improves robustness to unseen attacks After demonstrating the improved robustness on known attacks brought by DMAT, Lin et al. (2020) investigate whether DMAT improves robustness against novel attacks. Several perceptual attacks proposed in Kang et al. (2019b) are considered, including Fog, Snow, Gabor, Elastic, JPEG, L2, and L1 attacks, which apply global color shifts and image filtering to the normal images. Results presented in Table 2 demonstrate that compared to standard adversarial training, DMAT is more robust against these attacks that are not seen during training. 3.2.6 TRADES for DMAT The proposed DMAT framework is general and can be extended to other adversarial training approaches such as TRADES (Zhang et al., 2019b). TRADES is one of the state-of-the-art methods that achieves better trade-off between standard accuracy and robustness compared to standard AT (5). TRADES is integrated into DMAT by considering the following loss function: min θ

X

Lðfθ ðxi Þ, ytrue Þ + β max Lðfθ ðxi Þ, fθ ðxi + δÞÞ + β max Lðfθ ðxi Þ, fθ ðgðwi + λÞÞÞ, δ

i

λ

(9)

TABLE 2 Classification accuracy against unseen attacks applied to OM-ImageNet test set. Method

Fog

Snow

Elastic

Gabor

JPEG

L2

L1

Normal training

0.03

0.06

1.20

0.03

0.00

1.70

0.00

AT [PGD-5]

19.76

46.39

50.32

50.43

10.23

41.98

21.21

OM-AT [OM-FGSM]

11.12

13.82

34.07

1.50

0.26

2.27

8.59

OM-AT [OM-PGD-5]

22.39

28.38

48.74

5.19

0.49

5.92

14.67

DMAT [PGD-5, OM-PGD-5] (Ours)

31.78

51.19

56.09

51.61

14.31

51.36

29.68

Adversarial attacks and robust defenses in deep learning Chapter

3

41

where xi ¼ g(wi). The first two terms in (9) are the original TRADES in the image space, and the third term is the counterpart in the latent space. To solve for the two maximization problems in (9), PGD-5 and OM-PGD-5 with the same parameter settings are used. Results are presented in Table 3.

4

Knowledge distillation-based defenses

Adversarial training (Madry et al., 2017) is considered to be one of the most effective algorithms for adversarial defenses. There have been many works for improving adversarial training by using different loss functions, such as ALP (Kannan et al., 2018), TRADES (Zhang et al., 2019a), and MART (Wang et al., 2019). However, they only train one model without considering the synergy of a network cohort. In addition, most defense methods focus on a single perturbation type, which can be vulnerable to unseen perturbation types (Maini et al., 2020; Trame`r and Boneh, 2019). For example, models adversarially trained on l∞-bounded adversarial examples make them vulnerable to l1 and l2-bounded attacks. Knowledge distillation (KD) (Hinton et al., 2015) is a well-known method for transferring knowledge learned by one model to another. There are many forms of KD (Wang and Yoon, 2020) including offline KD, where the teacher models are pretrained and the students learns from static teachers (Hinton et al., 2015), and online KD, where a group of student models learn from peers’ predictions (Guo et al., 2020; Song and Chai, 2018; Zhang et al., 2018b). Several techniques based on KD have been proposed for adversarial defenses (Arani et al., 2020; Chen et al., 2021b; Goldblum et al., 2019; Liu et al., 2021a; Papernot et al., 2016b). Goldblum et al. (2019) demonstrate that robustness can be transferred among models through KD, and Chen et al. (2021b) show that KD can help mitigate robust overfitting. Table 4 summarizes the differences among KD-based defenses. Defensive distillation (DD) trains a natural model, i.e., model trained with natural examples, with another natural model as the teacher, which cannot provide strong robustness as both the teacher and student are not adversarially trained; DD was broken by Carlini and Wagner (2016). Adversarially robust distillation (ARD) (Goldblum et al., 2019) trains a robust model, i.e., model trained with adversarial examples, with another robust model as the teacher to distill the robustness of a large network onto a smaller student. This strategy relies on the existence of strong teacher models, and the improvement of the student model is limited as the teacher is static. Adversarial concurrent training (ACT) (Arani et al., 2020) trains a natural model and a robust model jointly in an online KD manner to align the feature space of both. However, since natural and robust models learn fundamentally different features (Tsipras et al., 2019), aligning the feature space of a robust model with a natural model may result in degraded robustness.

TABLE 3 Classification accuracy against known (PGD-50 and OM-PGD-50) and unseen attacks applied to OM-ImageNet test set. Fog

Snow

Elastic

Gabor

JPEG

L2

0.26

0.03

0.06

1.20

0.03

0.00

1.70

46.06

8.92

18.14

47.63

53.32

54.33

14.06

46.36

42.57

26.82

30.64

46.62

56.38

53.43

23.62

55.09

Method

Standard

PGD-50

Normal training

74.72

0.00

TRADES

69.88

DMAT + TRADES

73.17

OM-PGD-50

Even for TRADES, the benefit of using manifold information can also be observed.

Adversarial attacks and robust defenses in deep learning Chapter

3

43

TABLE 4 Comparisons of KD-based defenses. Teacher model

Student model

Form of KD

Multiperturbations

DD (Papernot et al., 2016b)

Natural

Natural

Offline



ARD (Goldblum et al., 2019)

Robust

Robust

Offline



ACT (Arani et al., 2020)

Natural

Robust

Online



MAT (Liu et al., 2021a)

Robust

Robust

Online



Method

“Robust” means the model is trained with adversarial examples and “Natural” means the model is trained with natural examples.

1 FIG. 5 Mutual adversarial training (MAT) architecture. x is a clean image, xθadv is the adversarial θ2 image of network h1, and xadv is the adversarial image of network h2.

Liu et al. (2021a) proposed MAT that allows models to share their knowledge of adversarial robustness and teach each other to be more robust. Unlike previous KD-based defenses, MAT trains a group of robust models simultaneously, and each network not only learns from ground-truth labels as in standard AT, but also the soft labels from peer networks that encode peers’ knowledge for defending adversarial attacks to achieve stronger robustness. The architecture of MAT is shown in Fig. 5. MAT improves model robustness through many avenues. First, MAT inherits the benefits of KD, such as improved generalization (Furlanello et al., 2018) and reduced robust overfitting (Chen et al., 2021b). Second,

44

Handbook of Statistics

MAT creates a positive feedback loop of increasing model robustness. Each network serves as a robust teacher to provide semantic-aware and discriminative soft labels to its peers. By learning from strong peers, the network becomes more robust, which in turn improves the robustness of its peers. Moreover, MAT allows robust models to explore a larger space of adversarial samples and find more robust feature spaces and decision boundaries jointly. The adversarial examples of each model form a subspace of adversarial inputs (Trame`r et al., 2017), and the predictions of each model encode information about its decision boundary and adversarial subspace. In MAT, each model learns from its own adversarial examples and receives information about peers’ adversarial examples through the soft labels. In this way, each model needs to consider a larger space of adversarial samples and find a feature space and decision boundary that not only work well on its own adversarial examples but also on peers’ adversarial examples, which encourages solutions that are more robust and generalizable. In addition, MAT improves model robustness to multiple perturbations. Given M (M  2) different perturbation types, since it is difficult for a single model to be robust to all perturbations, MAT trains an ensemble of M + 1 networks including one generalist network h0(; θ0) and M specialist networks h1 ð ; θ1 Þ, h2 ð ; θ2 Þ, ⋯ , hM ð ; θM Þ . Each specialist network is responsible for learning to defend against a specific perturbation, while the generalist network h0 learns to defend against all perturbations with the help of specialist networks. Previous methods Trame`r and Boneh (2019) and Maini et al. (2020) attempt to improve model robustness against multiple perturbations by augmenting the training data. Trame`r and Boneh (2019) train a model with adversarial examples of multiple perturbation types, and Maini et al. (2020) train a model using adversarial examples generated by multi steepest descent (MSD) that incorporates multiple perturbation models into a single attack. MAT takes a different approach by transferring the robustness of specialist models to a single generalist model, which complements previous methods and achieves better performance.

5 Defenses for object detector Object detection is an important computer vision task that plays a key role in many security-critical systems including autonomous driving, security surveillance, identity verification, and robot manufacturing (Vahab et al., 2019). Adversarial patch attacks, where the attacker distorts pixels within a region of bounded size, pose a serious threat to real-world object detection systems since they are easy to physically implement. For example, physical adversarial patches can make a stop sign (Song et al., 2018) or a person (Thys et al., 2019) disappear from object detectors, which could cause serious consequences in security-critical settings such as autonomous driving.

Adversarial attacks and robust defenses in deep learning Chapter

3

45

Despite the abundance (Chen et al., 2018; Lang et al., 2021; Lee and Kolter, 2019; Li et al., 2018; Liu et al., 2018; Sharif et al., 2016; Song et al., 2018; Thys et al., 2019; Wang et al., 2021; Wu et al., 2020; Zhao et al., 2020) of adversarial patch attacks on object detectors, defenses against such attacks have not been extensively studied. Most existing defenses for patch attacks are restricted to image classification (Gittings et al., 2020; Hayes, 2018; Levine and Feizi, 2020; McCoyd et al., 2020; Rao et al., 2020; Wu et al., 2019; Xiang et al., 2021; Chiang* et al., 2020). Securing object detectors is more challenging due to the complexity of the task. Most existing defenses focus on global perturbations with the lp norm constraint (Chen et al., 2021a; Chiang et al., 2020; Zhang and Wang, 2019) and only a few defenses (Chiang et al., 2020, 2021; Ji et al., 2021; Saha et al., 2020; Xiang and Mittal, 2021) for patch attacks have been proposed. (Saha et al., 2020) proposed Grad-defense and OOC defense for defending blindness attacks where the detector is blind to a specific object category chosen by the adversary. Ji et al. (2021) proposed Ad-YOLO to defend human detection patch attacks by adding a patch class on YOLOv2 (Redmon and Farhadi, 2017) detector such that it detects both the objects of interest and adversarial patches. DetectorGuard (DG) (Xiang and Mittal, 2021) is a provable defense against localized patch hiding attacks. These methods are designed for a specific type of patch attack or object detector. SAC (Liu et al., 2021b) is a general defense that can robustify any object detector against patch attacks without retraining the object detectors. SAC adopts a “detect and remove” strategy: it detects adversarial patches and removes the area from input images, and then feeds the masked images into the object detector. This is based on the following observation: while adversarial patches are localized, they can affect predictions not only locally but also on objects that are farther away in the image because object detection algorithms utilize spatial context for reasoning (Saha et al., 2020). This effect is especially significant for DL models, as a small localized adversarial patch can significantly disturb feature maps on a large scale due to large receptive fields of neurons. By removing the detected patches from the images, SAC minimizes the adverse effects of adversarial patches both locally and globally. The key of SAC is to robustly detect adversarial patches. SAC first trains a patch segmenter to segment adversarial patches from the inputs and produce an initial patch mask. It uses a self-adversarial training algorithm to enhance the robustness of the patch segmenter, which is efficient and object-detector agnostic. Since the attackers can potentially attack the segmenter and disturb its outputs under adaptive attacks, SAC further utilizes a robust shape completion algorithm that exploits the patch shape prior to ensure robust detection of adversarial patches. Shape completion takes the initial patch mask and generates a “completed patch mask” that is guaranteed to cover the entire adversarial patch, given that the initial patch mask is within a certain Hamming distance from the ground-truth patch mask. The overall pipeline of SAC is shown in Fig. 6.

46

Handbook of Statistics

FIG. 6 Pipeline of the SAC approach. SAC detects and removes adversarial patches on pixel level through patch segmentation and shape completion, and feeds the masked images into the base object detector for prediction.

The defense performance of SAC under adaptive attacks is shown in Table 5. SAC is very robust across different patch sizes on both datasets and has the highest mAP compared to baselines. In addition, SAC maintains high clean performance as the undefended model. Figs. 7 and 8 show two examples of object detection results before and after SAC defense. Adversarial patches create spurious detections and hide foreground objects. SAC masks out adversarial patches and restores model predictions.

6 Reverse engineering of deceptions via residual learning Almost all of the defense methods described in the preceding sections attempt to defend the DL models against adversarial attacks by employing adversarial training or by determining whether or not an example is adversarial (adversarial detection). However, we wish to not only detect adversarial examples but also identify the adversarial algorithm responsible for their generation. This is extremely useful for any security-critical system, as it can choose the most effective defense against the adversary based on the adversarial algorithm that created it. A straightforward method for detecting adversarial algorithms would be to train a classifier on a large dataset containing different adversarial examples labeled with the corresponding adversarial algorithm. In contrast, the authors of Souri et al. (2021) emphasize that a more nuanced approach is to search for the signature of each adversarial algorithm within the adversarial perturbation itself. Therefore, they suggest training a classifier on a dataset including multiple adversarial perturbations, named residuals, labeled with the associated adversarial algorithm. In order to determine if adversarial attack algorithms can be distinguished based on their perturbations as opposed to adversarial examples alone, the aforementioned two approaches were evaluated in Souri et al. (2021). The confusion matrices of attack classifiers from each of the two approaches are depicted in Fig. 9A and B. Fig. 9A demonstrates that the network trained solely on adversarial examples cannot classify adversarial images with a high accuracy. Fig. 9B demonstrates, on the other hand, the superior performance of the classifier when perturbations are concatenated with their corresponding adversarial examples.

Adversarial attacks and robust defenses in deep learning Chapter

3

47

TABLE 5 mAP (%) under adaptive with different patch sizes. Dataset

Method

Clean

75 × 75

100 × 100

125 × 125

COCO

Undefended

49.0

19.81.0

14.40.6

9.90.5

AT (Madry et al., 2017)

40.2

23.50.7

18.60.8

13.90.3

JPEG (Dziugaite et al., 2016)

45.6

22.80.9

18.00.8

13.40.7

Spatial smoothing (Xu et al., 2017)

46.0

23.20.7

17.51.0

13.50.6

LGS (Naseer et al., 2019)

42.7

20.80.7

15.90.5

12.20.9

APM (Chiang et al., 2021)

47.6

19.40.4

14.70.4

10.80.4

SAC (Ours)

49.0

43.6 0.9

44.0 0.3

39.2 0.7

Dataset

Method

Clean

5050

7575

100100

xView

Undefended

27.2

8.41.6

7.10.4

5.3 1.1

AT (Madry et al., 2017)

22.2

12.10.4

8.60.1

7.20.7

JPEG (Dziugaite et al., 2016)

23.3

11.20.3

9.51.0

8.30.3

Spatial smoothing (Xu et al., 2017)

21.8

11.00.7

7.90.6

6.50.2

LGS (Naseer et al., 2019)

19.1

8.20.8

6.50.4

5.40.5

APM (Chiang et al., 2021)

27.1

20.50.3

20.21.4

18.90.9

SAC (Ours)

27.2

25.3 0.3

23.6 1.2

23.2 0.3

The best performance of each column is in bold.

6.1 Adversarial perturbation estimation To properly recognize the attack algorithm, it is necessary to recover the adversarial perturbations from adversarial samples. Nonetheless, as seen in Fig. 1, adversarial perturbations are extremely sparse and imperceptible. REDRL presents an end-to-end framework for estimating adversarial perturbations, detecting and properly classifying adversarial attack algorithms.

FIG. 7 Visualization of object detection results from COCO dataset. Adversarial patches create spurious detections, and make the detector ignore the ground-truth objects. SAC masks out the patch and restores model predictions. (A) Ground-truth on clean image. (B) Predictions on clean image. (C) Predictions on adversarial images with a 100  100 patch. (D) Predictions on SAC masked images.

Adversarial attacks and robust defenses in deep learning Chapter

3

49

FIG. 8 Visualization of object detection results from xView dataset. Adversarial patches create spurious detections, and make the detector ignore the ground-truth objects. SAC masks out the patch and restores model predictions. (A) Ground-truth on clean image. (B) Predictions on clean image. (C) Predictions on adversarial images with a 100  100 patch. (D) Predictions on SAC masked images.

Fig. 10 illustrates the REDRL pipeline, which consists of four major components: image reconstruction, image classification, feature reconstruction, and residual recognition. To train REDRL, the adversarial image Iadv ¼ Ic + δ is fed to the network, where Ic and δ are, respectively, the clean image and adversarial perturbation. REDRL attempts to recover the clean image Ic from the adversarial input Iadv. Therefore, it is trained to generate an output image Ig to be as close as possible to the clean image. In order to achieve this, REDRL introduces the following loss functions

FIG. 9 (A) Confusion matrix of a ResNet18 network trained on adversarial images for attack algorithm classification. (B) Confusion matrix of another ResNet18 network trained on the concatenation of adversarial images with their corresponding perturbations for the same task. For both experiments, adversarial examples are crafted using CIFAR-10 dataset (Krizhevsky, 2012). The average accuracy of the first experiment is 57.55%, whereas that of the second experiment is 94.23% (Souri et al., 2021).

Adversarial attacks and robust defenses in deep learning Chapter

3

51

FIG. 10 REDRL pipeline. The network G reconstructs Iadv and generates Ig. Throughout REDRL training, the Ψ classification network will be trained to accurately classify the attack type using concatenation of residual image Ir with its respective Iadv as its input.

6.1.1 Image reconstruction In order to generate Ig as close as possible to Ic in pixel space, REDRL utilizes the L1 distance similar to Pix2Pix (Isola et al., 2017). Here, G consists of a set of identical residual blocks without batch normalization that are preceded and followed by two convolutional layers with Leaky ReLU nonlinearities following Ledig et al. (2017). Therefore, the first component of the overall loss function is:   (10) LR ðGÞ ¼  jI c  GðI c + δÞj1 , Ic , δ where  is the expectation operator.

6.1.2 Feature reconstruction In addition to the preceding loss, it is essential to generate Ig with a similar features to Ic. Inspired by the work of Naseer et al. (2020), REDRL proposes to employ the third convolutional block of an ImageNet-trained VGG16 network (Simonyan and Zisserman, 2014b) F as a feature extractor and tries to minimize the L2 distance between the Ig and Ic features using the following loss function:   (11) LF ðGÞ ¼  jF ðI c Þ  F ðGðI c + δÞÞj2 Ic , δ 6.1.3 Image classification REDRL further guarantees that its reconstructions have no effect on the image classifier Phi’s output. As a result, when feeding Ig and Ic to the image

52

Handbook of Statistics

classifier Phi, they must have identical classification scores. Following the KD (Chen et al., 2019) loss function, REDRL introduces the following cross-entropy loss function: 2 0 13 Φi ðGðI c +δÞÞ e A5 (12) LIC ðGÞ ¼  4 log@ XC Ic , δ eΦj ðGðIc +δÞÞ j¼1

where, i, Φi, and C are the argmax Φ(Ic), ith logit in the output of Φ, and the total number of image classes, respectively. During REDRL training, it is critical to keep the parameters of Φ frozen.

6.1.4 Residual recognition The final component of the loss function is the attack classification crossentropy loss: 2 0 13 Ψi ðI r ,I c +δÞ e A5 LAC ðGÞ ¼  4 log@ XA (13) Ic , δ eΨj ðIr ,Ic +δÞ j¼1

where A, i, and Ψj are the total number of considered attack types plus one (for clean data), the true class label corresponding to Ir, and jth logit of Ψ(Ir, Iadv), respectively.

6.1.5 End-to-end training Overall, the REDRL end-to-end training optimization objective can be calculated as follows: Ltotal ¼ min ½LAC ðGÞ + λ1 LR ðGÞ + λ2 LF ðGÞ + λ3 LIC ðGÞ G

(14)

where λ1, λ2, and λ3 are hyperparameters which should be empirically determined.

6.2 Experimental evaluation Table 6 shows the results of the REDRL evaluation. In this experiment, REDRL is trained on a synthesized dataset that includes adversarial examples created with five different adversarial algorithms and also clean images using ResNet50, ResNext50, DenseNet121, and VGG19 architectures on both the CIFAR-10 and Tiny ImageNet (Wu et al., 2017.) datasets. These evaluation results demonstrate that REDRL can effectively classify the attack types. Furthermore, REDRL outperforms attack recognition based on merely adversarial images by 63% and 43%, respectively, demonstrating the power of residuals in the estimation of sparse perturbations.

Adversarial attacks and robust defenses in deep learning Chapter

3

53

TABLE 6 Performance comparison (%) of attack algorithm type recognition based on adversarial images Iadv, concatenation of ground-truth adversarial perturbations δ with Iadv, and concatenation of estimated residuals Ir with Iadv, i.e., REDRL. Dataset CIFAR-10

Tiny ImageNet

Input to Ψ

Input to Ψ

Class

Iadv

δ, Iadv

Ir, Iadv

Iadv

δ, Iadv

Ir, Iadv

Clean

12.0

100

100

62.5

99.9

99.7

PGD

73.5

99.9

99.9

88.7

99.7

99.9

DeepFool

56.2

99.9

97.4

53.2

64.0

75.3

CWL2

73.4

98.6

96.6

28.0

96.4

66.3

CWL∞

33.4

71.6

74.1

24.2

92.7

57.7

Patch

58.4

99.9

99.9

73.8

99.9

99.6

Total

57.5

94.2

94.2

59.4

95.7

85.5

Acknowledgments This work was supported by the DARPA GARD Program under the contract HR001119S0026-GARD-FP-052.

References Abdal, R., Qin, Y., Wonka, P., 2019. Image2StyleGAN: how to embed images into the StyleGAN latent space? In: The IEEE International Conference on Computer Vision (ICCV), October. Arani, E., Sarfraz, F., Zonooz, B., 2020. Adversarial concurrent training: optimizing robustness and accuracy trade-off of deep neural networks. arXiv preprint:2008.07015. Brown, T.B., Mane, D., Roy, A., Abadi, M., Gilmer, J., 2017. Adversarial patch. arXiv preprint: 1712.09665. Carlini, N., Wagner, D., 2016. Defensive distillation is not robust to adversarial examples. arXiv preprint:1607.04311. Carlini, N., Wagner, D., 2017. Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017a. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.

54

Handbook of Statistics

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H., 2017b. Rethinking atrous convolution for semantic image segmentation. arXiv preprint:1706.05587. Chen, S.-T., Cornelius, C., Martin, J., Chau, D.H.P., 2018. Shapeshifter: robust physical adversarial attack on faster R-CNN object detector. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 52–68. Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B., Xu, C., Xu, C., Tian, Q., 2019. Data-free learning of student networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October. Chen, P.-C., Kung, B.-H., Chen, J.-C., 2021a. Class-aware robust adversarial training for object detection. arXiv preprint:2103.16148. Chen, T., Zhang, Z., Liu, S., Chang, S., Wang, Z., 2021b. Robust overfitting may be mitigated by properly learned smoothening. In: International Conference on Learning Representations. https://openreview.net/forum?id¼qZzy5urZw9. Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105 (10), 1865–1883. Chiang, P.-y., Curry, M.J., Abdelkader, A., Kumar, A., Dickerson, J., Goldstein, T., 2020. Detection as regression: certified object detection by median smoothing. arXiv preprint: 2007.03730. Chiang, P.-H., Chan, C.-S., Wu, S.-H., 2021. Adversarial pixel masking: a defense against physical attacks for pre-trained object detectors. In: Proceedings of the 29th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, pp. 1856–1865. Cho, K., Van Merrie¨nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint:1406.1078. Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J., 2018. Boosting adversarial attacks with momentum. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9185–9193. Dziugaite, G.K., Ghahramani, Z., Roy, D.M., 2016. A study of the effect of JPG compression on adversarial images. arXiv preprint:1608.00853. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., 2019. Robustness (python library). https:// github.com/MadryLab/robustness. Fournier, A., Fussell, D., Carpenter, L., 1982. Computer rendering of stochastic models. Commun. ACM 25 (6), 371–384. Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., Anandkumar, A., 2018. Born again neural networks. In: International Conference on Machine Learning, pp. 1607–1616. Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. Gittings, T., Schneider, S., Collomosse, J., 2020. Vax-a-net: training-time defence against adversarial patch attacks. In: Proceedings of the Asian Conference on Computer Vision. Goldblum, M., Fowl, L., Feizi, S., Goldstein, T., 2019. Adversarially robust distillation. arXiv preprint:1905.09747. Goodfellow, I.J., Shlens, J., Szegedy, C., 2014. Explaining and harnessing adversarial examples. arXiv preprint:1412.6572.

Adversarial attacks and robust defenses in deep learning Chapter

3

55

Graves, A., Jaitly, N., 2014. Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772. Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., Luo, P., 2020. Online knowledge distillation via collaborative learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11020–11029. Hayes, J., 2018. On visible adversarial perturbations and digital watermarking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1597–1604. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint:1503.02531. Isola, P., Zhu, J., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. Ji, N., Feng, Y., Xie, H., Xiang, X., Liu, N., 2021. Adversarial YOLO: defense human detection patch attacks via detecting adversarial patches. arXiv preprint:2103.08860. Kang, D., Sun, Y., Hendrycks, D., Brown, T., Steinhardt, J., 2019a. Testing robustness against unforeseen adversaries. arXiv preprint:1908.08016. Kang, D., Sun, Y., Hendrycks, D., Brown, T., Steinhardt, J., 2019b. Testing robustness against unforeseen adversaries. arXiv preprint:1908.08016. Kannan, H., Kurakin, A., Goodfellow, I.J., 2018. Adversarial logit pairing. ArXiv abs/ 1803.06373. Karmon, D., Zoran, D., Goldberg, Y., 2018. Lavan: localized and visible adversarial noise. In: International Conference on Machine Learning, pp. 2507–2515. Karras, T., Laine, S., Aila, T., 2019. A style-based generator architecture for generative adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June. Krizhevsky, A., 2012. Learning Multiple Layers of Features from Tiny Images. University of Toronto (May). Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. Lang, D., Chen, D., Shi, R., He, Y., 2021. Attention-guided digital adversarial patches on visual detection. Security and Communication Networks 2021. Ledig, C., Theis, L., Husza´r, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al., 2017. Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690. Lee, M., Kolter, Z., 2019. On physical adversarial patches for object detection. arXiv preprint:1906.11897. Levine, A., Feizi, S., 2020. (De) randomized smoothing for certifiable defense against patch attacks. arXiv preprint:2002.10733. Li, Y., Bian, X., Chang, M.-C., Lyu, S., 2018. Exploring the vulnerability of single shot module in object detectors via imperceptible background patches. arXiv preprint:1809.05966. Lin, W.-A., Lau, C.P., Levine, A., Chellappa, R., Feizi, S., 2020. Dual manifold adversarial robustness: defense against Lp and non-Lp adversarial attacks. In: Advances in Neural Information Processing Systems.

56

Handbook of Statistics

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C., 2016. SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Liu, X., Yang, H., Liu, Z., Song, L., Li, H., Chen, Y., 2018. DPatch: an adversarial patch attack on object detectors. arXiv preprint:1806.02299. Liu, J., Lau, C.P., Souri, H., Feizi, S., Chellappa, R., 2021a. Mutual adversarial training: learning together is better than going alone. arXiv preprint:2112.05005. Liu, J., Levine, A., Lau, C.P., Chellappa, R., Feizi, S., 2021b. Segment and complete: defending object detectors against adversarial patch attacks with robust patch detection. arXiv preprint:2112.04532. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint:1706.06083. Maini, P., Wong, E., Kolter, Z., 2020. Adversarial robustness against the union of multiple perturbation models. In: International Conference on Machine Learning, pp. 6640–6650. McCoyd, M., Park, W., Chen, S., Shah, N., Roggenkemper, R., Hwang, M., Liu, J.X., Wagner, D., 2020. Minority reports defense: defending against adversarial patches. In: Zhou, J., Conti, M., Ahmed, C.M., Au, M.H., Batina, L., Li, Z., et al. (Eds.), Applied Cryptography and Network Security Workshops. Springer International Publishing, Cham, pp. 564–582. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P., 2016. Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Naseer, M., Khan, S., Porikli, F., 2019. Local gradients smoothing: defense against localized adversarial attacks. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1300–1307. Naseer, M., Khan, S., Hayat, M., Khan, F.S., Porikli, F., 2020, June. A self-supervised approach for adversarial robustness. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Newell, A., Yang, K., Deng, J., 2016. Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A., 2016a. The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A., 2016b. Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Ranjan, R., Patel, V.M., Chellappa, R., 2017. Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41 (1), 121–135. Rao, S., Stutz, D., Schiele, B., 2020. Adversarial training against location-optimized adversarial patches. arXiv preprint:2005.02313. Redmon, J., Farhadi, A., 2017. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Ren, K., Zheng, T., Qin, Z., Liu, X., 2020. Adversarial attacks and defenses in deep learning. Engineering 6 (3), 346–360. Saha, A., Subramanya, A., Patil, K., Pirsiavash, H., 2020. Role of spatial context in adversarial robustness for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 784–785.

Adversarial attacks and robust defenses in deep learning Chapter

3

57

Samangouei, P., Kabkab, M., Chellappa, R., 2018. Defense-GAN: protecting classifiers against adversarial attacks using generative models. In: International Conference on Learning Representations. Sharif, M., Bhagavatula, S., Bauer, L., Reiter, M.K., 2016. Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In: CCS’16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al., 2016. Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), 484–489. Simonyan, K., Zisserman, A., 2014a. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. Simonyan, K., Zisserman, A., 2014b. Very deep convolutional networks for large-scale image recognition. arXiv preprint:1409.1556. Song, G., Chai, W., 2018. Collaborative learning for deep neural networks. In: Advances in Neural Information Processing Systems, pp. 1832–1841. Song, D., Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Tramer, F., Prakash, A., Kohno, T., 2018. Physical adversarial examples for object detectors. In: 12th USENIX Workshop on Offensive Technologies (WOOT 18). Souri, H., Khorramshahi, P., Lau, C.P., Goldblum, M., Chellappa, R., 2021. Identification of attack-specific signatures in adversarial examples. arXiv preprint:2110.06802. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint:1312.6199. Thys, S., Van Ranst, W., Goedeme, T., 2019. Fooling automated surveillance cameras: adversarial patches to attack person detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), June, IEEE Computer Society, Los Alamitos, CA, pp. 49–55. https://doi.ieeecomputersociety.org/10.1109/CVPRW.2019.00012. Toshev, A., Szegedy, C., 2014. Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660. Trame`r, F., Boneh, D., 2019. Adversarial training and robustness for multiple perturbations. In: Advances in Neural Information Processing Systems, pp. 5866–5876. Trame`r, F., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P., 2017. The space of transferable adversarial examples. arXiv preprint:1704.03453. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A., 2019. Robustness may be at odds with accuracy. In: International Conference on Learning Representations. https://openreview. net/forum?id¼SyxAb30cY7. Vahab, A., Naik, M.S., Raikar, P.G., SR, P., 2019. Applications of object detection system. Int. Res. J. Eng. Technol. (IRJET) 6 (4), 4186–4192. Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., 2018. Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. 2018. Wang, L., Yoon, K.-J., 2020. Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. arXiv preprint:2004.05937. Wang, L., Ouyang, W., Wang, X., Lu, H., 2015. Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127.

58

Handbook of Statistics

Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., Gu, Q., 2019. Improving adversarial robustness requires revisiting misclassified examples. In: International Conference on Learning Representations. https://openreview.net/forum?id¼rklOg6EFwS. Wang, Y., Lv, H., Kuang, X., Zhao, G., Tan, Y.-a., Zhang, Q., Hu, J., 2021. Towards a physicalworld adversarial patch for blinding object detection models. Inform. Sci. 556, 459–471. Wu, T., Tong, L., Vorobeychik, Y., 2019. Defending against physically realizable attacks on image classification. arXiv preprint:1909.09552. Wu, Z., Lim, S.-N., Davis, L.S., Goldstein, T., 2020. Making an invisibility cloak: real world adversarial attacks on object detectors. In: European Conference on Computer Vision, pp. 1–17. Xiang, C., Mittal, P., 2021. DetectorGuard: provably securing object detectors against localized patch hiding attacks. arXiv preprint:2102.02956. Xiang, C., Bhagoji, A.N., Sehwag, V., Mittal, P., 2021. PatchGuard: a provably robust defense against adversarial patches via small receptive fields and masking. In: 30th USENIX Security Symposium (USENIX Security). Xu, W., Evans, D., Qi, Y., 2017. Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint:1704.01155. Xu, H., Ma, Y., Liu, H., Deb, D., Liu, H., Tang, J., Jain, A.K., 2019. Adversarial attacks and defenses in images, graphs and text: a review. arXiv:1909.08072. Chiang*, P.-y., Ni*, R., Abdelkader, A., Zhu, C., Studor, C., Goldstein, T., 2020. Certified defenses for adversarial patches. In: International Conference on Learning Representations. https://openreview.net/forum?id¼HyeaSkrYPH. Zhang, H., Wang, J., 2019. Towards adversarially robust object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 421–430. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O., 2018a. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018b. Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328. Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M., 2019a. Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning, pp. 7472–7482. Zhang, H., Yu, Y., Jiao, J., Xing, E.P., Ghaoui, L.E., Jordan, M.I., 2019b. Theoretically principled trade-off between robustness and accuracy. In: Advances in Neural Information Processing Systems. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Zhao, Y., Yan, H., Wei, X., 2020. Object hider: adversarial patch attack against object detectors. arXiv preprint:2010.14974. Wu, J., Zhang, Q., Xu, G., 2017. Tiny ImageNet challenge.

Chapter 4

Deep metric learning for computer vision: A brief overview Deen Dayal Mohan, Bhavin Jawade, Srirangaraj Setlur, and Venu Govindaraju∗ University at Buffalo, Buffalo, NY, United States ∗ Corresponding author: e-mail: [email protected]

Abstract Objective functions that optimize deep neural networks play a vital role in creating an enhanced feature representation of the input data. Although cross-entropy-based loss formulations have been extensively used in a variety of supervised deep-learning applications, these methods tend to be less adequate when there is large intraclass variance and low interclass variance in input data distribution. Deep metric learning seeks to develop methods that aim to measure the similarity between data samples by learning a representation function that maps these data samples into a representative embedding space. It leverages carefully designed sampling strategies and loss functions that aid in optimizing the generation of a discriminative embedding space even for distributions having low interclass and high intraclass variances. In this chapter, we will provide an overview of recent progress in this area and discuss state-of-the-art deep metric learning approaches. Keywords: Deep metric learning, Triplet loss, Image retrieval, Face verification, Person re-identification

1

Introduction

The field of metric learning is currently an active area of research. Traditionally, metric learning had been used as a method to create an optimal distance measure that accounts for the specific properties and distribution of the data points (for example, Mahalanobis distance). Subsequently, the methods evolved to focus on creating representations from data that are optimized for given specific distance measures such as Euclidean or cosine distance. Following the advent of Deep Learning, these feature representations have been learned end to end using complex nonlinear transformations. This has Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2023.01.003 Copyright © 2023 Elsevier B.V. All rights reserved.

59

60

Handbook of Statistics

led to the primary research in the area of deep metric learning (DML) to be focused on creating loss/objective functions that are used to train deep neural networks. One of the key areas that extensively uses DML approaches is Computer Vision. This is due to the fact that most computer vision applications deal with scenarios where there is a large variance in visual features of data samples belonging to the same class. Additionally, multiple samples belonging to different classes might have many similarities in visual features. Recent advances in convolution neural networks (CNNs) have helped in creating good feature representations from images. While using CNNs as feature extractors under a supervised learning setting, often Softmax-based objective functions are used. Although these loss formulations tend to work well in many applications, they tend to suffer when modeling input data that has high interclass and low intraclass variances. These properties of data are present in a variety of real-world applications such as Face Recognition (Sankaran et al., 2019), Fingerprint Recognition (Jawade et al., 2021, 2022), Image Retrieval (Mohan et al., 2020), Person Re-Identification (Lee et al., 2022), and Cross-Modal Retrieval (Wei et al., 2020), (Jawade et al., 2023). In such a scenario, DML-based loss formulations are used to create highly discriminable embedding spaces. These embedding spaces are designed so as to have a feature representation of samples belonging to the same class to be clustered together and well separated from clusters of other classes in the manifold. In this chapter, we discuss popular DML formulations which are part of the literature. We will restrict our discussion to methods that have found applications in different computer vision tasks. We have organized the different loss formulations into three categories based on the type of overall objective formulations as shown in Fig. 1. The first category consists of pair-based formulations, which formulate the overall objectives based on

FIG. 1 An illustration describing various types of deep metric learning losses.

Deep metric learning for computer vision Chapter

4

61

direct pair-based interactions between samples in the dataset. The second is a group of methods that use a pseudo class representative known as a proxy to formulate the final optimization. Finally, we also discuss regularization methods that try to incorporate auxiliary information that aids in creating more optimal feature representations.

2

Background

In this section, we will introduce the mathematical notations and assumptions that are commonly used in DML literature. Consider a dataset X ¼ fðx1 , y1 Þ, ðx2 , y2 Þ…ðxn , yn Þg , consisting of a set of images and their corresponding class labels. Let ϕ be a function parameterized by θ that maps each image xi into an embedding space of d dimensions, i.e., fi ¼ Φðxi ; θϕ Þ; 8i  n

(1)

where fi  R is also referred to as the feature representation of the image xi. Typically, a standard CNN is employed as the feature extractor ϕ that produces these feature representations. The overall objective of the feature extractor ϕ is to project each image xi onto a highly separable embedding space, in which all the feature embeddings belonging to a particular class are close to each other and are well separated from the other classes, i.e., d

Dð fi , fj Þ < Dð fi , fk Þ; 8i, j  yl ; 8k 62 yl

(2)

where D is a well-defined distance metric in the embedding space. yl indicates the class label associated with the images. For example, if the distance metric under consideration is Euclidean, Eq. (2) can be rewritten as k fi  fj k2 < k fi  fkk2 ; 8i, j  yl ; 8k 62 yl

(3)

Recently, many feature extractors constrain the final feature representation to have a unit norm. Constraining the feature representation to have unit norms forces the embedding manifold to be an n-dimensional unit hypersphere (as shown in the figure). When the feature representation lies on a unit hypersphere, angular separation is used as the metric to measure the similarity and dissimilarity between the feature representations. Given two feature representations fi and fj, one could compute the cosine similarity between the two representations, i.e., S¼

fi T fj , i, j  n jj fi jj:jj fj jj

(4)

Since the magnitude of the feature representation is 1, the similarity value of the dot product of the representation provides the angular separation between the two feature vectors. The range of S is between 1 and 1, where 1 represents a 0 degree angle of separation between the feature representations and 1

62

Handbook of Statistics

represents a 180 degree separation. Given a large dataset X, the objective 2 is most often enforced in every mini-batch B. Throughout this chapter, we will assume the optimization of deep neural networks using mini-batches.

3 Pair-based formulation In this section, we will look at methods that rely on the sampling of informative pairs for better optimization. We will restrict our discussion to a few methods that are widely used in computer vision-related applications such as Face Recognition (Sankaran et al., 2019), Fingerprint Recognition (Jawade et al., 2021, 2022), Image Retrieval (Mohan et al., 2020), Person Re-Identification (Lee et al., 2022), and Cross-Modal Retrieval (Jawade et al., 2023).

3.1 Contrastive loss As discussed in the previous section, the primary objective of a standard metric learning loss formulation is given by Eq. (2). One way to achieve this is to enforce the same constraint in the objective function while training a deep neural network. Given two feature representations, fi and fj belonging to the same class, the objective is to reduce the distance between the representations. If samples belong to different classes, then the objective is to increase the distance between the feature representations, and this can be mathematically written as: ( k fi  fj k2 , if yi ¼ yj Lcon ¼ 2 ½α  ðkfi  fj k Þ+ , else if yi 6¼ yj where yi and yj are the classes associated with fi and fj. α here is the margin. We will discuss α in detail in the next section.

3.2 Triplet loss Triplet Loss, an improvement to the contrastive loss formulation proposed by Schroff et al. (2015), is a widely used metric learning objective function for creating separable embedding spaces. Consider three feature representations fa, fp, and fn corresponding to three images in the dataset. Let fa and fp be the feature representations of two images belonging to the same class which we denote as Anchor and Positive samples, respectively. Let fn belonging to a different class in the dataset be denoted as a Negative sample (Fig. 2). Triplet loss minimizes the distances between the feature embeddings of the Anchor and the Positive, while maximizing the distance between the Anchor and the Negative. When considering Euclidean distance, the triplet loss is defined as below: i X h k fa  fp k2  k fa  fnk2 + α (5) L¼ + a, p, nN

Deep metric learning for computer vision Chapter

4

63

FIG. 2 Illustration of Triplet Loss. The Positive sample is attracted toward the Anchor, whereas the Negative sample is repelled from the anchor. Here and in all diagrams in this chapter, we represent an anchor with-in a brown circle. Red arrow is used to represent repulsion and green arrow is used to represent attraction.

The terms fa, fp, fn correspond to feature embeddings for the anchor, positive, and negative samples, where a, p, n are sampled from the training dataset N. α defines the margin enforced between the Anchor-Negative distance and the Anchor-Positive distance. [.]+ represents a max(., 0) function. This mathematical formulation can be thought of as an extension of the Contrastive Loss formulation, as it explicitly enforces Anchor-Positive similarity while repelling the Negative. Additionally, it is also interesting to perform gradient analysis to get a geometric picture of the directions in which these attractive and repulsive forces act. For this, we compute the derivatives of the loss (Eq. 5) with respect to these feature representations as follows: ∂L ¼ 2ðfn  fp Þ ∂fa ∂L ¼ 2ðfp  fa Þ ∂fp

(6)

∂L ¼ 2ðfa  fn Þ ∂fn The above equations define the vectors used for updating the embeddings as illustrated in Fig. 3. As seen in the figure, for this formulation during gradient descent, the negative sample experiences a force in the direction of fn  fa which pushes it radially outward with respect to fa, while the positive sample is pulled toward fa with a force of fa  fp. It is interesting to note that even though the Anchor is radially pushed away from the Negative, there is no such radially outward push experienced by the Positive. This is due to the fact that the triplet-based formulations do not explicitly enforce the positive–negative separation. Another important aspect of the Triplet Loss

64

Handbook of Statistics

FIG. 3 Illustration of direction of gradient forces acting on Anchor, Positive, and Negative under a triplet formulation.

formulation is the margin α. One can think of α as the minimum separation that needs to be achieved between anchor-positive and anchor-negative distances. One can note that if the separation between the pairs is more than α, then the loss term goes to zero. Often α is treated as a hyperparameter, which is selected based on different factors such as dataset, network architecture, etc. So in order to create highly separable embedding space, selecting informative triplets of samples is crucial. In this context, informative triplets refer to those triplets in the datasets, whose pairwise distances violate the margin. The process of finding such informative samples to provide better optimization is often referred to as sample mining or sampling. A trivial way of identifying such informative samples is to compute the similarity of feature representations for the entire dataset using the current network model. One can easily create informative triplets with such an exhaustive offline process. Although straightforward, this process often becomes computationally infeasible as the size of the dataset increases. Alternatively, sample mining is often done in the minibatch B by computing similarities only with features of samples present in the mini-batch, thereby restricting the computational cost. More recently, triplet-based formulations are used with unit-normed feature representations. So the final loss formulation uses cosine similarity instead of Euclidian distance and can be written as:   L ¼ fa : fn  fa : fp + αÞ + (7)

Deep metric learning for computer vision Chapter

4

65

3.3 N-pair loss When analyzing the formulation of Triplet Loss, one can note that during one update, only one negative sample belonging to an arbitrary negative class is chosen. This might be suboptimal as the update would focus on separating the feature representation of Anchor and Positive from just this Negative sample representation. This might not be ideal as we would like to have feature representations of Anchor and Positive to be well separated from all the other negative classes. Although it is possible that over a large number of minibatches across multiple epochs, the Anchor and Positive might get well separated from all the other classes, this might not be guaranteed. As a result, most often the Triplet-based formulation leads to slower convergence. Additionally, due to the need for mining informative pairs to improve the optimization as mentioned in the last section, most of the randomly sampled triplets are not as useful after the initial phase of training. So, can there be an improved formulation that can help create a better separable feature space? One of the potential ways to solve the slow convergence problem is to incorporate multiple negatives into the formulation in Eq. (5). If we consider the set of feature representations Fs ¼ f fai , fpi , fn1 , fn2 , … fnj g, where fai is a sample belonging to class i, fpi is a positive sample belonging to the same class, and fn1 , fn2 ,… fnj represent negative samples belonging to classes 1 through j (excluding class i), then given the set of feature representations, Eq. (5) can be modified to " # j X i k i i ð fa : fn  fa : fp Þ (8) L ¼ log 1 + k¼1

One can note that in this formulation, if a single negative sample is considered, then the loss term in effect reduces down to Eq. (7) (Fig. 4)

FIG. 4 N-Pair loss incorporates a larger number of negative samples compared to triplet loss.

66

Handbook of Statistics

Although the above formulation solves the problem of slow convergence, it is highly inefficient. If we consider N to be the total number of classes in the dataset, for each update, a set of features of size (N + 1) needs to be created, as mentioned above. Given a batch size B, this would require BX(N + 1) samples to update the parameters of the network in one gradient step. A better and more efficient method to achieve the same objective was proposed by Sohn (2016) in which a batch can be constructed more efficiently to reduce the computational cost. Each mini-batch is constructed in such a way that it consists of two samples from each class in the dataset. The new feature set Fs can be written as Fs ¼ fð fa1 , fp1 Þ, ð fa2 , fp2 Þ, …ð fan , fpn Þg, which has a size of 2N where N is the number of classes in the dataset. Given the feature set Fs, the final optimization objective can be formulated as follows: ( " #) B X 1X log 1 + ð fai : fpj  fai : fpi Þ (9) L¼ B i¼1 i6¼j

One can note that only 2N embeddings are used to create the necessary feature sets for the optimization which is far more optimal compared to using BX(N + 1) embeddings. In practice, often the number of classes in the dataset is larger than the size of the mini-batch. In this scenario, a subset of classes C < B are sampled randomly and used to construct the mini-batch.

3.4 Multi-Similarity loss As discussed in the previous sections, one of the important aspects of pair-based metric learning is the mining of informative samples. Triplet loss uses the margin α as a measure to mine informative triples, whereas, in the case of N-pair loss, a larger number of negative samples is used to create more informative pairs for optimization. But is there a better way to sample more informative pairs? In order to improve the sampling process, Wang et al. (2019) proposed analyzing the sample similarities. According to Wang et al. (2019) sample-level similarity can be divided into three different categories. First is self-similarity, which is restricted to the similarity between a pair of samples. For example, if the feature representations fi and fj of two samples in the dataset belonging to two different classes are highly similar, then this pair of samples becomes highly informative. Here, fi and fj are called hard negatives with respect to each other. These pairs are said to have high self-similarity and are often identified by using a margin α similar to the one employed by Triplet Loss. The second type of similarity is called negative relative similarity. This measures the similarity of a pair of samples relative to other negative pairs. If the self-similarity of other negative pairs in the neighborhood of the sample pair is also high, then the negative relative similarity of the pair is low. The idea behind negative similarity is to estimate how unique the sample pair is when compared to other negative pairs in the neighborhood (as shown in Fig. 5).

Deep metric learning for computer vision Chapter

Relative Similarity Decreases

Self-Similarity Increases

Self-Similarity

Similarity-N

4

67

Relative Similarity Decreases

Similarity-P

FIG. 5 MS Loss: Multi-Similarity loss computes three different types of similarities (i) Left: Self-Similarity. (ii) Middle: Negative Relative Similarity. (iii) Right: Positive Relative Similarity. Anchor embeddings are enclosed in a brown circle.

So, if there are other negative pairs with equally high self-similarity, then the unique value that this particular sampled pair adds is marginal. This information is captured by the negative relative similarity value. Similarly, we can define a positive relative similarity. Let Anchor fa and Positive f p be the feature representations of two samples belonging to the same class. The relative positive similarity measures whether other positive samples in the neighborhood of fp also have high self-similarity with fa. In other words, it quantifies how similar a given pair is to other pairs constructed using the same Anchor and other positives of the same class. If they are highly similar, then the current pair under consideration is said to have a low positive relative similarity score. Based on these three similarity types, Wang et al. (2019) devises a sampling strategy incorporating all the three sample similarity measures. Consider a mini-batch of size B. Given the feature representation of each sample fi, the similarity between all the pairs of features is obtained as S¼

fi T fj , 8i, j  ½1, B jj fi jj:jj fj jj

(10)

where the ith row of the similarity matrix will define the similarity of ith sample with all the other samples in the batch. In order to sample informative pairs, these pairs are first filtered using the positive relative similarity. For this, the positive pair with the lowest similarity value in the ith row is identified. Let us assume Sik corresponds to this value. Now the filtering is done by considering the negative pairs which have a similarity greater than Sik. Mathematically, Sin > Sik  E;

(11)

68

Handbook of Statistics

Similarly, positive pairs are filtered by considering the negative pair with the most similarity to the ith sample. Let this be given by Sij Sip < Sij + E;

(12)

where Sij refers to the highest negative similarity to the ith sample. Given the sampled informative pairs, the final formulation of MultiSimilarity loss aims to weight these pairs based on their importance. The final formulation of the loss is given as ( " # " #) B X X 1 X 1 1 αðSip λÞ βðSin λÞ LMS ¼ log 1 + e e + log 1 + (13) B i¼1 α β The first log term deals with the cosine similarity scores Sip for the filtered positive samples corresponding to the ith anchor. The second log term analogously deals with that of the filtered negative samples. α, β are hyperparameters used to weight the similarity terms Sip and Sin, respectively. λ acts as the margin term similar to that of Triplet Loss.

4 Proxy-based methods We have thus far discussed a common approach to DML which is based on optimizing a sample-to-sample similarity-based objective (such as triplet loss). These objectives are defined in terms of triplets of samples, where a triplet consists of an anchor sample, a positive sample (that is similar to the anchor), and a negative sample (that is dissimilar to the anchor). The objective function is then defined in terms of the distances between the anchor and positive samples, and between the anchor and negative samples. We have referred to these methods as pair-based formulations. Given a dataset D with n samples, the number of possible triplets with a matching sample and a nonmatching sample could be in the order of O(n3). During each optimization step of stochastic gradient descent, a mini-batch (with B samples) would consist of only a subset of the total number of possible triplets (in the order of O(B3)). Thus, to see all triplets during training the complexity would be in the order of O(n3/B3). This impacts the convergence rate since it is highly dependent on how efficiently these triplets are used. This further leads to one of the main challenges in optimizing such a sample-tosample similarity-based objective which is the need to find informative triplets that are effective at driving the learning process. To address this challenge, a variety of tricks have been utilized, such as increasing the batch size, using hard or semihard triplet mining, utilizing online feature memory banks, and other techniques. These approaches aim to improve the quality of the triplets used to optimize the objective function, and can help to improve the performance of the DML model.

Deep metric learning for computer vision Chapter

4

69

Proxy-based methods have been proposed to overcome this informative pair mining bottleneck of traditional pair-based methods. In particular, these proxy-based methods utilize a set of learnable shared embeddings that act as class representatives during training. Each data point in the training set is approximated to at least one of the proxies. Unlike pair-based methods, these class representative proxies do not vary with each batch and are shared across samples. Since these proxies learn from all samples in each batch, they eliminate the need for explicitly mining informative pairs. In the following sections, we will discuss three recent proxy-based loss formulations, namely (i) Proxy-NCA (Movshovitz-Attias et al., 2017) and Proxy-NCA++ (Teh et al., 2020), (ii) Proxy Anchor Loss (Kim et al., 2020), and (iii) ProxyGML Loss (Zhu et al., 2020).

4.1 Proxy-NCA and Proxy-NCA++ Proxy-NCA is motivated by the seminal work performed by Goldberger et al. (2004) on neighborhood component analysis (NCA). Let pij be the assignment probability or neighborhood probability of fi to fj, where fi, fj are two data points. We can define this probability as: pij ¼ X

De ð fi , fj Þ

 De ð fi , f k Þ k62i

(14)

where De( fi, fk) is Euclidean squared distance computed on some learned embeddings. The fundamental purpose of NCA is to increase the probability that points belonging to the same class are close to one another, while simultaneously reducing the probability that points in different classes are near each other. This is formulated as follows: 0X 1 De ð fi , fj Þ e j62C A (15) LNCA ¼ log@X i De ð fi , f k Þ e k62C i

The primary challenge in directly optimizing the NCA objective is the computation cost which increases polynomially as the number of samples in the dataset increases. Proxy-NCA attempts to address this computation bottleneck of NCA by introducing proxies. Proxies can be interpreted as class representatives or class prototypes. These proxies are implemented as learnable parameters that train along with the feature encoder. During training, instead of computing pairwise distance which grows quadratically with the batch size and is highly dependent on the quality of pairs, Proxy-NCA computes the distance between the learnable set of class proxies and the feature representation of the respective samples in the batch. Based on this and derived from Eq. (15), Proxy-NCA loss (Movshovitz-Attias et al., 2017) can be formulated as:

70

Handbook of Statistics

FIG. 6 Illustration of Proxy-NCA loss: Unlike pairwise losses that require informative pair mining, Proxy-NCA injects learnable proxies denoted by star in the above diagram and pulls them toward the anchored sample embeddings.

0

LProxyNCA

1 De ð ^fi , P^i Þ e A ¼ log@X De ð ^fi , P^k Þ e k62C

(16)

i

where Pi is the proxy vector corresponding to the data point fi. f^ and P^ denote normalized embeddings. Ci is the set of all data points belonging to the same class as sample fi. The above loss function aims to maximize the distance between noncorresponding proxy and feature pairs and minimize distance between corresponding proxy and feature vectors (Fig. 6). As can be observed in Eq. (16), Proxy-NCA does not directly optimize the proxy assignment probability; rather, it optimizes for a suboptimal objective. Proxy-NCA++ (Teh et al., 2020) was proposed to overcome this issue. ProxyNCA++ (Teh et al., 2020) computes a proxy assignment probability score Pi defined as: 0 1 De ð ^fi , P^i Þ 1 e *T C B LProxyNCA++ ¼ log@X (17) 1A ^ ^ D ð f e i , Pk Þ e *T kA where A is the set of all proxies. The subtle difference between Proxy-NCA and Proxy-NCA++ formulation is the explicit computation of proxy assignment probability by performing softmax with respect to distance from all proxies. Another important contribution of Proxy-NCA++ is the introduction of temperature scaling parameter T. As T gets larger, the output of the softmax function that is used to compute the proxy assignment probability tends toward a uniform distribution. With a smaller temperature parameter, the softmax function will lead toward a sharper distribution. Lower temperature increases the distance between probabilities of different classes, thereby helping in more refined class boundaries.

Deep metric learning for computer vision Chapter

4

71

4.2 Proxy Anchor loss Introduction of proxies in Proxy-NCA loss overcame the limitations imposed by the need for informative pair mining strategies. But the objective in Proxy-NCA formulation does not explicitly optimize for finegrained sample-to-sample similarity since it indirectly represents it through a proxy-to-sample similarity. Proxy Anchor Loss (Kim et al., 2020) investigated this bottleneck and proposed a novel formulation that takes advantage of both pairwise and proxy-based formulations and provides a more explicit sample-to-sample similarity-based supervision along with proxybased supervision. Similar to Proxy-NCA, Proxy Anchor defines proxies as a learnable vector in embedding space. Unlike Proxy-NCA, Proxy Anchor formulation represents each proxy as an anchor and associates it with entire data in a batch. This allows the samples within the batch to interact with one another through their interaction with the proxy anchor (Fig. 7). Let, S(f, p) be the similarity between a sample f and a proxy p. Similar to Multi-Similarity loss formulation (Eq. 13), the proxy anchor loss is given as: 0 1 X 1 X @ log 1 + eαðSð f ,pÞδÞ A L¼ jP + j pP+ + xX p (18) 0 1 X X 1 + log@1 + eαðSð f ,pÞ + δÞ A jPj pP  xX p

where δ > 0 is a margin, α > 0 is a scaling factor, P indicates the set of all proxies, and P + denotes the set of positive proxies of data in the batch. As

FIG. 7 Illustration of Proxy Anchor (Kim et al., 2020) Loss: Unlike Proxy-NCA, Proxy Anchor (Kim et al., 2020) utilizes a proxy embedding as an anchor and pulls same class sample embeddings closer to it.

72

Handbook of Statistics

can be noticed in Eq. (18), proxy anchor utilizes a Log-Sum-Exponent formulation to pull p and its most dissimilar positive example together and to push p and its most similar negative example apart. To understand how Proxy Anchor provides a more explicit sample-to-sample similarity-based supervision, let us compare the gradients of loss functions for Proxy-NCA and Proxy Anchor. The gradient of the loss with respect to s( f, p) for both Proxy-NCA and Proxy Anchor can be given as: 9 8 1, if p ¼ p+ > > > > = sð f, pÞ ∂LProxyNCA < e X (19) ¼ , otherwise ∂sð f, pÞ Þ > > sð f, p > > ; : p P

9 8 αh+p ð f Þ > > 1 + > X > > , 8f  Fp > > > + + 0 > > jP j 1 + > > hp ð f Þ > > > > = < 0 +

f Fp ∂LProxyAnchor ¼ ∂sð f, pÞ > > αh ðfÞ 1 > > > > Xp , 8f  F > p > > >  0 > > jPj h ð f Þ 1 + > > p > > ; : 0 

(20)

f Fp

α(s(x,p)+δ) where h+p ð f Þ ¼ eα(s(x,p)δ) and h are positive and negative p ðfÞ ¼ e hardness metrics for embedding vector f given proxy p, respectively. Based on Eqs. (19) and (20), it can be observed that the gradient of the Proxy-Anchor loss function with respect to the distance measure, s( f, p), is dependent on not just the feature vector f, but also on the other samples present in the batch. This derivation for gradients in Eqs. (19) and (20) is omitted for brevity, please refer Kim et al. (2020) for the derivations. This effectively reflects the relative difficulty of the samples within the batch. This is a major advantage of the Proxy-Anchor loss over the Proxy-NCA loss (Eq. 19), which only considers a limited number of proxies when calculating the scale of the gradient for negative examples, and maintains a constant scale for positive examples. On the other hand, the Proxy-Anchor loss determines the scale of the gradient based on the relative difficulty of both positive and negative examples. Additionally, the inclusion of a margin in the loss formulation of the Proxy-Anchor loss leads to improved intraclass compactness and interclass separability.

4.3 ProxyGML Loss We have discussed two proxy-based methods namely, Proxy-NCA and Proxy Anchor. Despite their differences, one key common aspect of both these methods is the use of a single global proxy per class as a representative prototype. In DML, we aim to match features from samples belonging to the same class that

Deep metric learning for computer vision Chapter

4

73

might be visually very distinct while also distinguishing features from samples belonging to different classes yet visually very similar. Based on this, a single global proxy as a representation for all samples in a class might not be the most optimal method for proxy-based optimization. Proxy-based deep Graph Metric Learning (ProxyGML) (Zhu et al., 2020) overcame this issue by introducing the notion of multiple trainable proxies for each class that could better represent the local intraclass variations (Fig. 8). Let a dataset D consist of C classes. ProxyGML assigns M > 1 trainable subproxies for each class in C. So the total number of subproxies in the embedding space is M  C. ProxyGML models the global and local relationships between samples and proxies in a graph-like fashion. Here, the directed similarity graph represents the global similarity between all proxies and the samples within the batch. For a sample fi and proxy Pj, the cosine similarity in the global similarity matrix can be defined as: Spij ¼ fi  Pj

(21)

So the global similarity matrix S will have a dimension of (M  C)  B, since M  C is the total number of proxies and B is the total number of samples in the mini-batch. For the ith row in Sp belonging to sample fi, ProxyGML selects K most similar proxies where K is a hyperparameter. To select K most similar proxies, the method enforces that proxies belonging to the positive class P+ are explicitly selected. The remaining K  M proxies are selected based on their p

Reg

Reg

FIG. 8 Illustration of ProxyGML loss (Zhu et al., 2020): Small stars represent the subproxies. Unlike Proxy-NCA (Movshovitz-Attias et al., 2017) and Proxy Anchor (Kim et al., 2020), ProxyGML (Zhu et al., 2020) uses multiple proxies per class. Proxies interact with samples and also among themselves in the final loss formulation. Proxies belonging to the same class come closer to one another and the samples belonging to that class, and move away from proxies and samples of other classes.

74

Handbook of Statistics

similarity to fi. This results in a new subsimilarity matrix (denoted by S0 ) of dimension B  K. Next, a subproxy aggregation on S0 is performed by summing the cosine similarities of all proxies that belong to the same class in C. This is carried out for each sample in S0 . Note that, for a sample fi, if no proxy belonging to some class c is present in S0 , then the class is assigned a similarity of zero. This subproxy aggregation strategy results in a final similarity matrix S of dimensions B  C. For small values of K, the similarity matrix S can be highly sparse with many zero entries. This leads to an inflated denominator with a traditional softmax operation. Keeping this in mind, a masked softmax operation is used, given by: Mij eSij Pij ¼ X Mij eSij

(22)

where M denotes the zero mask computed as Mij ¼ 0 if Sij ¼ 0 and Mij ¼ 1 if Sij 6¼ 0. Here, Pij denotes the softmaxed probability of jth class for the ith sample. Finally, the cross entropy loss is computed over Pij as: LCE ¼

B C 1XX y  logðPij Þ B i¼0 j¼0 i

(23)

where yi denotes the ground truth label for the ith sample. Since each class contains multiple subproxies which act as class-centers, a third hard constraint is imposed over the similarity between the proxies belonging to the same class. Let SP be the similarity between all M  C number of proxies given as: SPij ¼ Pi  Pj

(24)

A proxy-to-proxy regularization constraint is imposed using Eq. (24) by computing the softmax probabilities over SP and then calculating a cross entropy loss as follows: eSPij PPij ¼ X SP e ij Lreg ¼

MC C 1 XX p y  logðPPij Þ M  C i¼0 j¼0 i

(25)

(26)

where PP represents the softmax probabilities of proxy-to-proxy similarities and y p represents ground truth proxy to class label mappings. The final loss is computed as: L ¼ LCE + λ  Lreg

(27)

Here, λ denotes the weightage of the regularization term. This approach demonstrates that end-to-end training by minimizing the final objective L yields a more discriminative metric space.

Deep metric learning for computer vision Chapter

5

4

75

Regularizations

In the previous sections, we explored pair-based and proxy-based loss formulations for creating enhanced feature representations. A majority of these methods have focused on creating new formulations based on sample–sample interaction or sample–proxy interactions. Is there any other information that one can leverage in order to create a more robust embedding space? In this section, we will look at two such sources of information, language and direction, and see how these can be integrated into existing DML formulations.

5.1 Language guidance Large language models (LLMs) such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), etc. have been highly successful in modeling and representing content present in the form of natural language. Often these models are trained on a massive corpus of data, whereby they learn to encode proper semantic context and model semantic relationships correctly. On the other hand, training a model with just the DML objective mentioned in the previous section using a dataset X ¼ fðx1 , y1 Þ, ðx2 , y2 Þ…ðxm , ym Þg in the form of image–label pairs might not encode all the semantic context associated with data points. So, is there a way to leverage the semantic context in the representation of LLMs to improve the DML objective? Roth et al. (2022) proposed a method to incorporate the knowledge from representations of LLMs into a DML objective. As we know, X ¼ fðx1 , y1 Þ, ðx2 , y2 Þ…ðxm , ym Þg represents the set of image–label pairs. Following the discussion in the Multi-Similarity loss formulation, a similarity matrix SI of feature representations of data samples is computed using Eq. (10). In order to incorporate the language-based regularization factor, prompts are generated from the class labels. Each class label yi in the batch is used to create a prompt Ti which is represented by “A photo of hyii.” Let the set of these prompts corresponding to the batch be T. So T ¼ fT 1 , T 2 … T C g where C is the size of the batch. These prompts are passed through large language model Ω to get a lower dimensional representation. More formally: lang

fi

¼ ΩðT i ; θω Þ; 8i  m

(28)

Ω here is a pretrained language model like BERT, RoBERTa, etc. Once are obtained from the prompts, the simithese language representations f lang i larity matrix SL is constructed in a similar fashion to SI using Eq. (10). Given the two similarity matrices SI and SL, a new distillation loss is proposed to incorporate knowledge from the language modality into the visual modality. The formulation of this loss is as follows:   C  σðSiI Þ 1 X (29) LLang ¼ σðSiI Þ log C i¼1 σðSiL + γ L Þ

76

Handbook of Statistics

where σðSiI Þ and σðSiimg Þ represent softmax along each row of the similarity matrix SI and SL. Here, γ L is a language constant set to 1. This distillation loss formulation in essence is a summation of KL divergence between individual rows of SI and SL after converting it into probability distributions using a softmax function. The final loss formulation is given as a combination of any prior pairbased DML loss (discussed in the prior section) LDML and LLang, i.e. L ¼ LDML + ωLLang

(30)

where ω is a hyperparameter that decides the influence of the language regularization term on the final metric learning objective.

5.2 Direction regularization The DML methods introduced so far leverage carefully designed sampling strategies and loss formulations that aid in optimizing the generation of a discriminable embedding space. While a lot of work has been done to improve the process of sampling and loss formulations, relatively less attention has been given to the relative interactions between pairs, and the forces exerted on these pairs that direct their displacement in the embedding space. In order to understand the need for optimal displacement, we revisit the triplet loss and directions in which forces act. One can note that the Negative sample is radially pushed away from the Anchor, during one update. Even though pushing the negative sample radially away from the anchor intuitively makes sense, it might be a suboptimal direction due to the presence of other positive samples in that direction. It is also important to note that the Positive sample in the triplet does not have an influence on the direction of force acting on the Negative sample. So is there a more optimal direction in which the forces should act? In Fig. 3, we see the direction in which the gradient forces act on each sample. In such a situation, we would additionally desire to have the negative sample move in the direction orthogonal to the class cluster center of a and f +f

p which we approximate as fc ¼ a 2 p as shown in Fig. 9. Mathematically, we need to enforce the following constraint NC ? PA ¼)

NC PA  ¼0 kNCk kPAk

(31)

Analyzing this in more detail, it turns out that incorporating the cosine similarity between the vectors Anchor-Positive and Anchor-Negative helps achieve this objective. (We are skipping the proof for brevity, please refer to Mohan et al. (2020) for more details.) This can be written as CosðAN, APÞ ¼

1  fp  fa kfn  fakkfp  fak

(32)

Deep metric learning for computer vision Chapter

4

77

FIG. 9 Geometric illustration of the layout of the anchor, positive, and negative samples. The lines OA, OP, and ON represent the unit-normalized embedding vectors for the anchor ( fa), positive ( fp), and negative ( fn), respectively. C is the midpoint of PA and OC represents the average embedding vector fc (not unit-normalized).

We find that minimizing this cosine similarity term helps to satisfy the condition in Eq. (31). Since the current metric learning losses lack explicit enforcement of the orthogonality of the negative sample with respect to the anchorpositive pair, when incorporated as a regularization term, it helps create a much more robust embedding space. The following equations provide definitions and the intuition on how easily this regularization term can be adapted into any standard metric learning loss function. The modified triplet loss can be written as: Lapn ¼ k fa  fpk2  k fa  fnk2 + α  γ Cosð fn  fa , fp  fa Þ Similarly, the modified Multi-Similarity loss can be written as: ( " # B X 1X 1 log 1 + eαðSip λÞ L¼ B i¼1 α " #) X 1 βðSin λγ Cosð fn fa , fp fa ÞÞ e + log 1 + β

(33)

(34)

In the case of Multi-Similarity loss, the hardest positive to the anchor is used in the regularization term. In the case of Proxy-NCA (Movshovitz-Attias et al., 2017), the direction of forces is modified to push negative samples in the direction of their corresponding proxies. It is given by : 0 1 L¼

X

2 B C eðk fa pðaÞk Þ C  logB X 2 @ A k f pðnÞk γ CosðpðnÞf , pðaÞf Þ ½ a a a  e aN

n

(35)

78

Handbook of Statistics

6 Conclusion DML approaches provide objective functions that help to create highly discriminable embedding spaces. In this chapter, we have provided a brief review of some of the important DML methods and elaborated on the uniqueness of each approach compared to each other. Currently, owing to the recent advancements in the area of Cross-modal retrieval methods such as CLIP (Radford et al., 2021), DML methods which primarily focused on a single modality are being adapted toward multimodal representation learning. Recent works, such as Wei et al. (2020) and Jawade et al. (2023), show the ability of these methods to create more robust feature representations for multimodal tasks such as text–image retrieval.

References Devlin, J., Chang, M., Lee, K., Toutanova, K., 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. http://arxiv.org/abs/1810. 04805. Goldberger, J., Hinton, G.E., Roweis, S., Salakhutdinov, R.R., 2004. Neighbourhood components analysis. In: Saul, L., Weiss, Y., Bottou, L. (Eds.), Advances in Neural Information Processing Systems. vol. 17. MIT Press. https://proceedings.neurips.cc/paper/2004/file/ 42fe880812925e520249e808937738d2-Paper.pdf. Jawade, B., Agarwal, A., Setlur, S., Ratha, N., 2021. Multi loss fusion for matching smartphone captured contactless finger images. In: 2021 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. Jawade, B., Dayal, D., Setlur, S., Govindraju, V., 2022. RidgeBase: a cross-sensor multi-finger contactless fingerprint dataset. In: 2022 International Joint Conference on Biometrics (IJCB). Jawade, B., Dayal, D., Ali, N.M., Setlur, S., Govindraju, V., 2023. NAPReg: nouns as proxies regularization for semantically aware cross-modal embeddings. In: Proceedings of the IEEE/ CVF Winter Conference on Applications of Computer Vision (WACV). Kim, S., Kim, D., Cho, M., Kwak, S., 2020. Proxy anchor loss for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3238–3247. Lee, K.W., Jawade, B., Mohan, D., Setlur, S., Govindaraju, V., 2022. Attribute de-biased vision transformer (AD-ViT) for long-term person re-identification. In: 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692. http://arxiv.org/abs/1907.11692. Mohan, D.D., Sankaran, N., Fedorishin, D., Setlur, S., Govindaraju, V., 2020. Moving in the right direction: a regularization for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June. Movshovitz-Attias, Y., Toshev, A., Leung, T.K., Ioffe, S., Singh, S., 2017. No fuss distance metric learning using proxies. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision. CoRR abs/2103.00020. https://arxiv.org/abs/2103.00020.

Deep metric learning for computer vision Chapter

4

79

Roth, K., Vinyals, O., Akata, Z., 2022. Integrating language guidance into vision-based deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16177–16189. Sankaran, N., Mohan, D.D., Setlur, S., Govindaraju, V., Fedorishin, D., 2019. Representation learning through cross-modality supervision. In: 2019 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019), pp. 1–8. Schroff, F., Kalenichenko, D., Philbin, J., 2015. FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Sohn, K., 2016. Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems, vol. 29. Teh, E.W., DeVries, T., Taylor, G.W., 2020. ProxyNCA++: revisiting and revitalizing proxy neighborhood component analysis. In: European Conference on Computer Vision, pp. 448–464. Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R., 2019. Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Wei, J., Xu, X., Yang, Y., Ji, Y., Wang, Z., Shen, H.T., 2020. Universal Weighting Metric Learning for Cross-Modal Matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13005–13014. Zhu, Y., Yang, M., Deng, C., Liu, W., 2020. Fewer is more: a deep graph metric learning perspective using fewer proxies. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (Eds.), Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc, pp. 17792–17803. https://proceedings.neurips.cc/paper/2020/file/ce016f59ecc2366a43e1c96a 4774d167-Paper.pdf.

This page intentionally left blank

Chapter 5

Source distribution weighted multisource domain adaptation without access to source data Sk Miraj Ahmed*, Dripta S. Raychaudhuri, Samet Oymak, and Amit K. Roy-Chowdhury Department of Electrical and Computer Engineering, University of California Riverside, Riverside, CA, United States ∗ Corresponding author: e-mail: [email protected]

Abstract Unsupervised domain adaptation (UDA) aims to learn a predictive model for an unlabeled domain by transferring knowledge from a separate labeled source domain. Conventional UDA approaches make the strong assumption of having access to the source data during training. This may not be very practical due to privacy, security, and storage concerns. A recent line of work addressed this problem and proposed an algorithm that transfers knowledge to the unlabeled target domain from a single-source model without requiring access to the source data. However, for adaptation purposes, if there are multiple trained source models available to choose from, this method has to go through adapting each and every model individually, to check for the best source. A better question to ask is the following: can we find the optimal combination of source models, with no source data and without target labels, whose performance is no worse than the single best source? The answer to this is given by a recent efficient algorithm (Ahmed et al., 2021) which automatically combines the source models with suitable weights in such a way that it performs at least as good as the best source model. The work provided intuitive theoretical insights to justify their claim and extensive experiments were conducted on several benchmark datasets. In this chapter, we first review this work on multi-source source-free unsupervised domain adaptation followed by analysis of a new algorithm which we propose by relaxing some of the assumptions of this prior work. More specifically, instead of naively assuming source data distribution as uniform, we try to estimate it via a multilayer perceptron (MLP) in order to use this information for effective aggregation of source models. Keywords: Source free adaptation, Multi source adaptation, Unsupervised domain adaptation

Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2022.12.001 Copyright © 2023 Elsevier B.V. All rights reserved.

81

82

Handbook of Statistics

1 Introduction Deep neural networks have achieved proficiency in a multiple array of vision tasks (He et al., 2016; Kirillov et al., 2019; Long et al., 2015; Redmon et al., 2016); however, these models have consistently fallen short in adapting to visual distributional shifts (Luo et al., 2019). Human recognition, on the other hand, is robust to such shifts, such as reading text in a new font or recognizing objects in unseen environments. Imparting such robustness toward distributional shifts to deep models is fundamental in applying these models to practical scenarios. Unsupervised domain adaptation (UDA) (Ben-David et al., 2010; Saenko et al., 2010) seeks to bridge this performance gap due to domain shift via adaptation of the model on small amounts of unsupervised data from the target domain. The majority of current approaches (Ganin et al., 2016; Hoffman et al., 2018b) optimize a two-fold objective: (i) minimize the empirical risk on the source data and (ii) make the target and source features indistinguishable from each other. Minimizing distribution divergence between domains by matching the distribution statistical moments at different orders has also been explored extensively (Peng et al., 2019; Sun et al., 2016). A shortcoming of all the above approaches is the transductive scenario in which they operate, i.e., the source data is required for adaptation purposes. In a real-world setting, source data may not be available for a variety of reasons. Privacy and security are the primary concern, with the data possibly containing sensitive information. Another crucial reason is storage issues, i.e., source datasets may contain videos or high-resolution images and it might not be practical to transfer or store on different platforms. Consequently, it is imperative to develop unsupervised adaptation approaches which can adapt the source models to the target domain without access to the source data (Fig. 1).

Standard UDA

Source data-free UDA

T DT = {xiT }N i=1 1 DS1 = {xiS1 , ySi 1 }N i=1

2 DS2 = {xiS2 , ySi 2 }N i=1

θS1

θT

θS2

T DT = {xiT }N i=1 1 DS1 = {xiS1 , ySi 1 }N i=1

θS1

2 DS2 = {xiS2 , ySi 2 }N i=1

θS2

Adaptation

n DSn = {xiSn , ySi n }N i=1

θSn

θT Adaptation

n DSn = {xiSn , ySi n }N i=1

θSn

FIG. 1 Problem setup. Standard unsupervised multisource domain adaptation (UDA) utilizes the source data, along with the models trained on the source, to perform adaptation on a target domain. In contrast, Ahmed et al. (2021) introduced a setting which adapts multiple models without requiring access to the source data.

Source distribution weighted multisource domain adaptation Chapter

5

83

Recent works (Li et al., 2020; Liang et al., 2020) attempt this by adapting a single-source model to a target domain without accessing the source data. However, an underlying assumption of these methods is that the most correlated source model is provided by an oracle for adaptation purposes. A more challenging and practical scenario entails adaptation from a bag of source models—each of these source domains is correlated to the target by different amounts and adaptation involves not only incorporating the combined prior knowledge from multiple models, but simultaneously preventing the possibility of negative transfer. In this chapter, we first discuss the problem of unsupervised multisource adaptation without access to source data proposed by Ahmed et al. (2021), followed by analysis of a new algorithm which we design based on less restrictive assumption compared to Ahmed et al. (2021). The proposed algorithm is based on the principles of pseudo-labeling and information maximization and provide intuitive theoretical insights to show that this framework guarantees performance better than the best available source and minimize the effect of negative transfer. To solve this problem of multiple source model adaptation without accessing the source data, we deploy Information Maximization (IM) loss (Liang et al., 2020) on the weighted combination of target soft labels from all the source models. We also use the pseudo-label strategy inspired from deep cluster method (Caron et al., 2018), along with the IM loss to minimize noisy cluster assignment of the features. Unlike Ahmed et al. (2021), which assigns fixed weight to each of the source models, irrespective of the target samples, the new proposed optimization assigns different source weights to each of the target samples based on estimated source distributions. Since we do not have access to the source data, our framework utilizes multilayer perceptrons (MLPs) to estimate these distributions in order to minimize empirical target risk. Closely related to Ahmed et al. (2021), this framework jointly adapts the feature encoders from sources as well as the corresponding source weights (generated by passing target samples through the MLPs), combining which the target model is obtained.

1.1 Main contributions We address the problem of multiple source UDA, with no access to the source data. Toward solving the problem, we make the following contributions: l

We propose a novel UDA algorithm which operates without requiring access to the source data. We do so by training some MLPs in order to estimate source distributions which efficiently help model the target distribution under a stronger theoretically sound combination rule. We term it as source distribution (estimated by MLPs) weighted Data free multisourCE unsupervISed domain adaptatiON (DECISION-mlp). Our algorithm automatically identifies the optimal blend of source models

84

l

l

Handbook of Statistics

(which is also a function of target samples) to generate the target model by optimizing a carefully designed unsupervised loss. Under intuitive assumptions, we establish theoretical guarantees on the performance of the target model which shows that it is consistently at least as good as deploying the single best source model, thus, minimizing negative transfer. We validate our claim by numerical experiments, demonstrating the practical benefits of our approach.

2 Related works In this section, we present a brief overview of the literature in the area of UDA in both the single and multiple sources scenario, as well as the closely related setting of hypothesis transfer learning.

2.1 Unsupervised domain adaptation UDA methods have been used for a variety of tasks, including image classification (Tzeng et al., 2017), semantic segmentation (Paul et al., 2020), and object detection (Hsu et al., 2020). Besides the feature space adaptation methods based on the paradigms of moment matching (Peng et al., 2019; Sun et al., 2016) and adversarial learning (Ganin et al., 2016; Tzeng et al., 2017), recent works have explored pixel space adaptation via image translation (Hoffman et al., 2018b). All existing UDA methods require access to labeled source data, which may not be available in many applications.

2.2 Hypothesis transfer learning Similar to our objective, hypothesis transfer learning (HTL) (Ahmed et al., 2020; Perrot and Habrard, 2015; Singh et al., 2018) aims to transfer learnt source hypotheses to a target domain without access to source data. However, data is assumed to be labeled in the target domain in contrast to our scenario, limiting its applicability to real-world settings. Recently, Li et al. (2020) and Liang et al. (2020) extend the standard HTL setting to unsupervised target data (U-HTL) by adapting single-source hypotheses via pseudo-labeling. Our paper takes this one step further by introducing multiple source models, which may or may not be positively correlated with the target domain.

2.3 Multisource domain adaptation Multisource domain adaptation (MSDA) extends the standard UDA setting by incorporating knowledge from multiple source models. Latent space transformation methods (Zhao et al., 2020) aim to align the features of different domains by optimizing a discrepancy measure or an adversarial loss. Discrepancy based

Source distribution weighted multisource domain adaptation Chapter

5

85

methods seek to align the domains by minimizing measures such as maximum mean discrepancy (Guo et al., 2018; Zhao et al., 2020) and Renyi-divergence (Hoffman et al., 2018a). Adversarial methods aim to make features from multiple domains indistinguishable to a domain discriminator by optimizing GAN loss (Xu et al., 2018), H -divergence (Zhao et al., 2018), and Wasserstein distance (Li et al., 2018; Wang et al., 2019). Domain generative methods (Lin et al., 2020; Russo et al., 2019) use some form of domain translation, such as the CycleGAN (Zhu et al., 2017), to perform adaptation at the pixel level. All these methods assume access to the source data during adaptation.

2.4 Source-free multisource UDA This line of work is very recent and first proposed and solved in Ahmed et al. (2021). Transferring knowledge from multiple source models to an unlabeled target domain by optimizing suitable source weights is the key idea of this paper, which we discuss in detail in the next section. Inspired by this work, some very recent works also address this problem. Shen et al. (2022) show the benefits of selective pseudo-labeling on target domain data from theoretical and empirical perspectives, whereas Dong et al. (2021) design a confidentanchor-induced pseudo label generator to mine confident pseudo-labels for the target domain in a multisource-free setting. Closely related to these works, we focus on a novel model aggregation scheme which is generic enough to be used on top of any existing methods for improved performance. In this chapter, we only analyze the effect of this model aggregation scheme on the algorithm in Ahmed et al. (2021).

3

Problem setting

We address the problem of jointly adapting multiple models, trained on a variety of domains, to a new target domain with access to only samples without annotations from the target. In this work, we will be considering the adaptation of classification models with K categories and the input space being X . n Formally, let us consider we have a set of source models fθSj gj¼1 , where the jth model θSj : X ! K is a classification model learned using the source dataN

set DSj ¼ fxiSj , yiSj g j , with Nj data points, where xiSj and yiSj denote the ith i¼1 source image and the corresponding label, respectively. Now, given a target NT unlabeled dataset DT ¼ fxiT gi¼1 , the problem is to learn a classification model K θT : X !  , using only the learned source models, without any access to the source datasets. Note that this is different from most of the multisource domain adaptation methods in literature, which also utilize the source data while learning the target model θT. Since our proposed framework is built upon the first work on source-free multisource UDA (DECISION) (Ahmed et al., 2021), we will first discuss DECISION in detail and then move on to the analysis of the modification we made.

86

Handbook of Statistics

4 Practical motivation To understand the practicality of this specific problem setting, we provide example of some real applications. For example, let’s say we want to classify some images in the target which belong to the domain sketches. We also have access to the trained source models which are trained on domains like Art, Clipart, Product, Real, etc. Images from Art domain are mostly visually and semantically similar to sketches domain compared to other sources. So, intuitively most images from the target domain should acquire highest adaptation performance from the Art domain. However, in practical scenarios these source images are not available and as a result there is no way of computing similarities between source and target images directly, which are crucial for computing weights for the corresponding sources given the target image. In this situation, this problem setting comes into play where the target images are adapted by the source models without accessing the source data. During adaptation, each of the target images should accordingly weight the source models based on just the source model parameters.

5 Overall framework of DECISION (Ahmed et al., 2021)—A review Each of the source models can be decomposed into two modules: the feature extractor ϕiS : X ! di and the classifier ψ iS : di ! K . Here, di refers to the feature dimension of the ith model, while K refers to the number of categories. The overall aim is to estimate the target model θT (Fig. 2) by combining knowledge only from the given source models in a manner that automatically rejects poor source models, i.e., those which are irrelevant for the target domain.

FIG. 2 Overall framework of DECISION (Ahmed et al., 2021): Final classification layers of all the sources are frozen and the source feature encoders are jointly optimized along with their corresponding weights to get the target predictor by combining those.

Source distribution weighted multisource domain adaptation Chapter

5

87

At the core of this framework lies a model aggregation scheme (Hoffman et al., 2018a; Mansour et al., 2009), wherein the key idea is to learn a set of weights fαi gni¼1 corresponding to each of the source models such that P αk  0 and nk¼1 αk ¼ 1. These weights represent a probability mass function over the source domains, with a higher value implying higher transferability from that particular domain, and are used to combine the source hypotheses accordingly. However, unlike previous works, this work jointly adapts each individual model and simultaneously learn these weights by utilizing solely the unlabeled target instances. In what follows, we describe our training strategy used to achieve this in detail.

5.1 Weighted information maximization Since one does not have access to the labeled source or target data, the authors n propose to fix the source classifiers, fψ iS gi¼1, since it contains the class distribution information of the source domain and adapted solely the feature maps n fϕiS gi¼1 via the principle of information maximization (Bridle et al., 1992; Krause et al., 2010; Liang et al., 2020; Oymak and Gulcu, 2020). The motivation behind the adaptation process stems from the cluster assumption (Chapelle et al., 2009) in semisupervised learning, which hypothesizes that the discriminative models’ decision boundaries should be located in regions of the input space which are not densely populated. To achieve this, the authors minimize a conditional entropy term (i.e., for a given input example) (Grandvalet and Bengio, 2005) as follows: " # K X δj ðθT ðxT ÞÞ log ðδj ðθT ðxT ÞÞÞ Lent ¼ xT DT (1) j¼1

where θT ðxT Þ ¼ evj K

δj ðvÞ ¼ P

i¼1

evi

Pn

j j¼1 αj θ S ðxT Þ, K

and δ() denotes the softmax operation with

for v   . Intuitively, if a source θSj has good transferability

on the target and consequently n has osmaller value of the conditional entropy, optimizing the term (1) over θSj , αj will result in higher value of αj than rest of the weights. While entropy minimization effectively captures the cluster assumption when training with partial labels, in an unsupervised setting, it may lead to degenerate solutions, such as always predicting a single class in an attempt to minimize conditional entropy. To control such degenerate solutions, the authors incorporate the idea of class diversity: configurations in which class labels are assigned evenly across the dataset are preferred. A simple way to encode the preference toward class balance is to maximize the entropy of the empirical label distribution (Bridle et al., 1992) as follows,

88

Handbook of Statistics

Ldiv ¼

K X

 p j log p j

(2)

j¼1

where p ¼ xT DT ½δðθT ðxT ÞÞ. Combining the terms (1) and (2), we arrive at LIM ¼ Ldiv  Lent

(3)

which is the empirical estimate of the mutual information between the target data and the labels under the aggregate model θT. Although maximizing this loss makes the predictions on the target data more confident and globally diverse, it may sometime still fail to restrict erroneous label assignment. Inspired by Liang et al. (2020), the authors propose a pseudo-labeling strategy in an effort to contain this mislabeling.

5.2 Weighted pseudo-labeling As a result of domain shift, information maximization may result in some instances being clubbed with the wrong class cluster. These wrong predictions get reinforced over the course of training and lead to a phenomenon termed as confirmation bias (Tarvainen and Valpola, 2017). Aiming to contain this effect, the authors adopt a self-supervised clustering strategy (Liang et al., 2020) inspired from the DeepCluster technique (Caron et al., 2018). First, calculation the cluster centroids induced by each source model for the whole target dataset is done as follows: X j j δk ðθ^S ðxT ÞÞϕ^S ðxT Þ ð0Þ

μkj ¼

xT DT

X

xT DT

(4)

j δk ðθ^S ðxT ÞÞ

where the cluster centroid of class k obtained from source j at iteration i is j j ðiÞ denoted as μ , and θ^ ¼ ðψ j ∘ ϕ^ Þ denotes the source from the previous iterkj

S

S

S

ation. Assuming feature dimensions are same across all sources, these sourcespecific centroids are combined in accordance with the current aggregation weights on each source model as follows, ð0Þ

μk ¼

n X ð0Þ αj μkj

(5)

j¼1

Next, computation of the pseudo-label of each sample is done by assigning it to its nearest cluster centroid in the feature space, y^T ¼ arg min kϕ^T ðxT Þ  μk k22 ð0Þ

ð0Þ

k

(6)

P j where ϕ^T ðxT Þ ¼ nj¼1 αj ϕ^S ðxT Þ. Next, reiteration of this process is done to get the updated centroids and pseudo-labels as follows,

Source distribution weighted multisource domain adaptation Chapter

X ð1Þ

μkj ¼

xT DT

5

89

ð0Þ ^ j ðxT Þ 1f^ yT ¼ kgϕ S

X

(7)

1ðy^t0 ¼ kÞ

xT DT ð1Þ

μk ¼

n X ð1Þ α j μ kj

(8)

j¼1

^T ðxT Þ  μ k2 y^T ¼ arg min kϕ k 2 ð1Þ

ð1Þ

(9)

k

where 1() is an indicator function which gives a value of 1 when the argument is true. While this alternating process of computing cluster centroids and pseudo-labeling can be repeated multiple times to get stationary pseudolabels, one round is sufficient for all practical purposes. One can then obtain the cross-entropy loss w.r.t. these pseudo-labels as follows: Lpl ðQT , θT Þ ¼ xT DT

K X 1f^ yT ¼ kglog δk ðθT ðxT ÞÞ:

(10)

k¼1

Note that the pseudo-labels are updated regularly after a certain number of iterations as discussed in Section 9.

5.3 Optimization n

n

In summary, given n source hypothesis fθSj gj¼1 ¼ fψ Sj ∘ ϕSj gj¼1 and target data n

T , the classifier from each of the sources is fixed and optimization DT ¼ fxiT gi¼1

n

over the parameters of fϕSj gj¼1 and the aggregation weights fαj gnj¼1 are done jointly in an end-to-end manner. The final objective is given by, Ltot ¼ Lent  Ldiv + λLpl

(11)

The above objective is used to solve the following optimization problem, minimize

fϕjS gnj¼1 , fαj gnj¼1

subject to

Ltot αj  0, 8j f1, 2, …,ng, n X αj ¼ 1

(12)

j¼1

After obtaining the optimal set of ϕSj* and α*j , the optimal target hypothesis P is computed as θT ¼ nj¼1 α*j ðψ Sj ∘ ϕSj* Þ. To solve the optimization (12), steps of Algorithm 1 are followed as stated below.

90

Handbook of Statistics

ALGORITHM 1 Overview of DECISION.

6 Theoretical insights 6.1 Theoretical motivation behind DECISION DECISION aims to find the optimal weights fαj gnj¼1 for each source and takes a convex combination of the source predictors to obtain the target predictor. It shows that under intuitive assumptions on the source and target distributions, there exists a simple choice of target predictor, which can perform better than or equal to the best source model being applied directly on the target data. Following the notations in Ahmed et al. (2021), L is the loss function which maps the pair of model-predicted label and the ground-truth label to a scalar. Expected loss over kth source distribution is denoted by QkS using the source preR dictor θ via LðQkS ,θÞ ¼ x ½LðθðxÞ, yÞ ¼ x LðθðxÞ, yÞQkS ðxÞdx. θkS is the optimal source predictor given by θkS ¼ arg min LðQkS , θÞ 8 1  k  n. Based on the θ

assumption in Ahmed et al. (2021), the target distribution is in the span of source distributions, which can be formalized by expressing the target distribution as an affine combination of source distributions, i.e., QT ðxÞ ¼ Pn Pn k k¼1 λk QS ðxÞ : λk  0, k¼1 λk ¼ 1. Under this assumption, the target predicP λ Qk ðxÞ tor can be expressed as θT ðxÞ ¼ nk¼1 Pnk S j θkS ðxÞ, then the theoretical λ Q ðxÞ j¼1 j S

claim is established as stated in Lemma 1.

Source distribution weighted multisource domain adaptation Chapter

5

91

Lemma 1. Assume that the loss L(θ(x), y) is convex in its first argument and that there exists a λ  n where λ  0 and λ>1 ¼ 1 such that the target distribution is exactly equal to the mixture of source distributions, i.e., P QT ¼ ni¼1 λi QiS . Set the target predictor as the following convex combination of the optimal source predictors

θT ðxÞ ¼

n X k¼1

λ Qk ðxÞ Xnk S j θkS ðxÞ: λ Q ðxÞÞ j S j¼1

Recall the pseudo-labeling loss (10). Then, for this target predictor, over the target distribution, the unsupervised loss induced by the pseudo-labels and the supervised loss are both less than or equal to the loss induced by the best source predictor. In particular, LðQT , θT Þ  min LðQT , θSj Þ: 1jn

Let α ¼ arg min 1jn LðQT , θSj Þ. Additionally, this inequality is strict if the entries of λ are strictly positive and there exists a source i for which the strict inequality LðQiS , θiS Þ < LðQiS , θαS Þ holds. Before going into the proof, we will discuss how does this lemma translates to DECISION and how it motivated us to analyze DECISION-mlp. The key difference between DECISION and the predictor in Lemma 1 is the algorithm’s combination rule, which fine-tunes the feature extractors of each source model unlike Lemma 1. However, each source has an individual weight which is agnostic to the target data, whereas Lemma 1 uses different weights per input instance. The intuitive justification for choosing this inputagnostic weighting strategy is provided below. Since one does not know the information about the source distributions (due to the unavailability of source data), a good choice is to consider the least informative of all the distributions, i.e., uniform distribution for sources by the Principle of Maximum Entropy (Jaynes, 1957). This uniformity is assumed over the target support set X. In what follows, the algorithm considers the restrictions of the source distributions to the target support X. Mathematically, the assumption is QkS ðxÞ ¼ ck UðxÞ when restricted to the support set x  X , where ck is a scaling factor which captures the relative contribution of source k and UðxÞ has value 1. Plugging this value of the distribution P in the combination rule in Lemma 1, we get θT ðxÞ ¼ nk¼1 Pλnk ck θkS ðxÞ by λc j¼1 j j

the following steps:

92

Handbook of Statistics

θT ðxÞ ¼

n X λk QkS ðxÞ k θS ðxÞ n X j k¼1 λj QS ðxÞ j¼1

n X λk ck UðxÞ k θS ðxÞ ¼ n X k¼1 λj cj UðxÞ

(13)

j¼1

¼

n X λk ck k θS ðxÞ n X k¼1 λj cj j¼1

This term consisting of λk and ck essentially becomes the weighting term αk in DECISION. The next step is to put this value of θT to solve the optimization (12) jointly with respect to this αk and ϕkS. Thus, DECISION will return us a favorable combination of source hypotheses, satisfying the bounds in Lemma 1, under the uniformity assumption of source distributions. However, without assumption of source uniformity one can try to estimate the QkS s in order to minimize target risk, which we will discuss in the next section as the proposed framework DECISION-mlp (Fig. 3).

FIG. 3 Weight calculation in modified framework with MLP: Instead of setting α’s as free trainable parameters, it can be expressed as a function of source distributions via multilayer perceptrons (MLPs). For the end-to-end training source feature encoders as well as the MLPs are trained jointly.

Source distribution weighted multisource domain adaptation Chapter

7

5

93

Source distribution dependent weights (DECISION-mlp)

The mixing weights in the model aggregation scheme in Lemma 1, which also stems from Mansour et al. (2009) and Hoffman et al. (2018a), are functions of the source distributions and the target samples. Due to absence of source data, assumption of uniform source distribution leads to Algorithm 1. However, without assuming anything about source distribution it is possible to estimate it via MLPs. Let’s assume there are n MLPs MkS 8 k ¼ 1, 2, …n to estimate n source distributions (Fig. 3). Target features are obtained by passing each and every sample through the feature encoders and concatenate them to get P a vector of size d where d ¼ ni¼1 di . If the target sample is xiT , then the concatenated vector ϕðxiT Þ can be expressed as ϕðXiT Þ ¼ ½ϕ1S ðxiT Þ, ϕ2S ðxiT Þ, … ϕnS ðxiT Þ>  d . Next ϕðxiT Þ is fed through all the MLPs to get n scalars n fMkS ðϕðXiT ÞÞgk¼1 , which are analogous to λk QkS ðxÞ’s in the expression of the target predictor in Lemma 1. Clearly these scalars, upon normalizing emulate α’s as a function of both source distributions and target samples. In short, if the ith sample has a weight αij corresponding to jth source, then M j ðϕðxi ÞÞ αij ¼ Xn S k T MS ðϕðxiT ÞÞ k¼1

(14)

In this new setting, the loss functions in our objective are modified slightly. Let us denote the modified entropy, diversity, and pseudo-label losses as L0ent , L0div , and L0pl , respectively. Then, " # K X 0 i i δj ðθT ðxT ÞÞ logðδj ðθT ðxT ÞÞÞ (15) Lent ¼ xiT DT j¼1

where θT ðxiT Þ ¼

Pn

i j i j¼1 αj θ S ðxT Þ.

L0div ¼

K X

p0j log p0j

(16)

j¼1

where p ¼ xiT DT ½δðθT ðxiT ÞÞ and, L0pl ðQT , θT Þ ¼ xiT DT

K X 1f^ y0T ¼ kg log δk ðθT ðxiT ÞÞ:

(17)

k¼1

where y^0T is calculated in the same manner as y^T but with the modified MLP weights. The overall modified objective function in this scenario becomes, L0tot ¼ L0ent  L0div + λL0pl

(18)

94

Handbook of Statistics

which gives rise to a new optimization problem as follows: minimize n n

fϕSj gj¼1 , fMSj gj¼1

subject to

L0tot M j ðϕðxi ÞÞ , αij ¼ Xn S k T i M ðϕðx ÞÞ T S k¼1 αij  0, n X αij ¼ 1, 8j  f1, 2, …, ng, 8i  f1, 2, …, nT g

(19)

j¼1

To solve this optimization problem, we follow the steps of Algorithm 2. We name the algorithm as DECISION-mlp which is a variant of DECISION. Note that, in Algorithm 2 we pass batch of target samples instead of a single sample to the MLPs. This is done in order to reduce computational complexity. Essentially the loss functions above are computed with respect to target samples of batch size one and can be extended easily to a batch size greater than one.

ALGORITHM 2 Overview of DECISION-mlp.

Source distribution weighted multisource domain adaptation Chapter

8

95

5

Proof of Lemma 1

Lemma 1 holds true for both DECISION and DECISION-mlp, the former being a special case and the latter being the more general one, since DECISION-mlp is designed so as to mimic Lemma 1 completely without any assumption on the source distributions. Now we prove Lemma 1 as follows, Proof. We can see that the left hand-side of the inequality in Lemma 1 can be upper-bounded by some loss as follows, Z LðQT , θT Þ ¼

x

Z QT ðxÞLðθT ðxÞ, yÞ ¼

x

0

QT ðxÞL@

n X

1

λ Qi ðxÞ Xn i S j θiS ðxÞ, yAdx λ Q ðxÞÞ i¼1 j¼1 j S

ðJensen’s inequalityÞ Z n X λ Qi ðxÞ Xn i S j LðθiS ðxÞ, yÞdx  QT ðxÞ λ Q ðxÞÞ x j i¼1 S j¼1 ðdistribution assumptionÞ Z n X λi QiS ðxÞ LðθiS ðxÞ, yÞdx ¼ QT ðxÞ QT ðxÞ x i¼1 ðchanging the order of summationÞ Z n X ¼ λi QiS ðxÞLðθiS ðxÞ, yÞdx i¼1

¼

X

x

λi LðQiS ðxÞ, θiS Þ

i

(20)

Now for the R.H.S. we can write this loss as follows, Z LðQT , θSj Þ

¼ ¼

QT ðxÞLðθSj ðxÞ, yÞdx

Z X n x

x

λ Qi ðxÞLðθSj ðxÞ, yÞdx i¼1 i S

Z n X λi QiS LðθSj ðxÞ, yÞdx ¼ i¼1

¼

n X

x

λi LðQiS ðxÞ, θSj Þ

i¼1

Now recall that, θkS ¼ arg min LðQkS , θÞ θ

for

1  k  n:

(21)

96

Handbook of Statistics

This means θiS is the best predictor for the source i, which has distribution QiS . Thus, we find that LðQiS , θiS Þ  LðQiS , θSj Þ 8j, which implies P P j i i i i λi LðQS , θS Þ  i λi LðQS , θ S Þ. This further implies that LðQT , θ T Þ  LðQT , θSj Þ 8j, which in turn concludes the proof LðQT , θT Þ  min LðQT ,θSj Þ. 1jn

Finally, suppose the entries of λ are strictly positive and let β ¼ arg min j LðQT , θSj Þ. Observe that, if there is a source i such that the strict inequality LðQiS , θiS Þ < LðQiS , θβS Þ holds, then the main claim of the lemma also becomes strict as we find LðQT , θT Þ 

X X λi LðQiS , θiS Þ < λi LðQiS , θβS Þ  min LðQT , θSj Þ: i

i

j

Verbally, this strict inequality has a natural meaning that the model j is strictly worse than model i for the source data i. □ Observe that the expected loss L defined in Lemma 1 is the supervised loss where one does have the label information. Our proposed target predictor θT achieves a supervised loss at least as good as the best individual source model. Importantly, the inequality is strict under a natural mild condition: the best individual source model β (for the target QT) is strictly worse than some source model i on the source distribution QiS .

9 Experiments In this section, we first report and discuss performance of DECISION and subsequently analyze DECISION-mlp.

9.1 Experiments on DECISION 9.1.1 Datasets To test the effectiveness of DECISION, experiments on various visual benchmarks of digit and object recognition are done, out of which we only show results on two following object recognition datasets in this chapter. l

l

Office (Hoffman et al., 2018b): In this benchmark DA dataset, there are three domains under the office environment namely Amazon (A), DSLR (D), and Webcam (W) with a total of 31 object classes in each domain. Office–Home (Venkateswara et al., 2017): Office–Home consists of four domains, namely, Art(AR), Clipart(CL), Product(PR), and Real-world(RW). Each of these domains contain 65 object classes.

In all of the experiments, turns are taken to fix one of the domains as the target and the rest as the source domains. The source data was discarded after training the source models.

Source distribution weighted multisource domain adaptation Chapter

5

97

9.1.2 Baseline methods Comparison of DECISION is done against a wide array of baselines. Similar to this setting, SHOT (Liang et al., 2020) attempts unsupervised adaptation without source data. However, it adapts a single source at a time. Comparison against a multisource extension of SHOT is done via ensembling—by passing the target data through each of the adapted source model and take an average of the soft prediction to obtain the test label. In the comparisons, the name of this method is referred as SHOT-ens. Comparisons against single-source baselines, namely SHOT-best and SHOT-worst, which refer to the best adapted source model and the worst one, respectively, learned using SHOT are also done. 9.2 Implementation details 9.2.1 Network architecture For the object recognition tasks, a pretrained ResNet-50 (He et al., 2016) is used as the feature extractor backbone, similar to Peng et al. (2019) and Xu et al. (2019). Following Liang et al. (2020) and Ganin and Lempitsky (2015), replacement of the penultimate fully connected layer is done with a bottleneck layer and a task-specific classifier layer. Batch normalization (Ioffe and Szegedy, 2015) is utilized after the bottleneck layer, along with weight normalization (Salimans and Kingma, 2016) in the final layer. For the digit recognition task, a variant of the LeNet (LeCun et al., 1998) is used similar to Liang et al. (2020). 9.2.2 Source model training Following Liang et al. (2020), training of the source models is done using smooth labels, instead of the usual one-hot encoded labels. This increases the robustness of the model and helps in the adaptation process by encouraging features to lie in tight, evenly separated clusters (M€uller et al., 2019). The maximum number of epochs for Office and Office–Home is set to 100 and 50, respectively. 9.2.3 Hyper-parameters The entire framework is trained in an end-to-end fashion via backpropagation. Specifically, stochastic gradient descent is utilized with momentum value 0.9 and weight decay equalling 103. The learning rate is set at 102 for the bottleneck and classifier layers, while the backbone is trained at a rate of 103. In addition, the learning rate scheduling strategy is used from Ganin and Lempitsky (2015), where the initial rate is exponentially decayed as learning progresses. The batch size is set to 32. We use λ ¼ 0.3 for all the object recognition tasks and λ ¼ 0.1 for the digits benchmark. For adaptation, maximum number of epochs is set to 15, with the pseudo-labels updated at the start of every epoch.

98

Handbook of Statistics

9.3 Object recognition 9.3.1 Office The results for the three adaptation tasks on the Office dataset are shown in Table 1. DECISION achieves performance at par with the best adapted source models on all the tasks and obtain an average increase of 5.2% over SHOTEns. In the task of adapting to the Webcam (W) domain, negative transfer from the Amazon (A) model brings the ensemble performance down—this model is able to prevent this, and not only outperforms the ensemble by 3.5% but also achieves higher performance than the best adapted source. 9.3.2 Office–Home On the Office–Home dataset, the algorithm outperforms all baselines as shown in Table 2. Across all tasks, this method achieves a mean increase in accuracy of 2% over the respective best adapted source models. This can be attributed to the relatively small performance gap between the best and worst adapted sources in comparison to other datasets. This suggests that, as the performance gap between the best and worst performing sources gets smaller, or outlier sources are removed, DECISION can generalize even better to the target. 9.4 Ablation study 9.4.1 Contribution of each loss DECISION is trained using a combination of three distinct losses: Ldiv , Lent and Lpl . The contribution of each component of our framework to the

TABLE 1 Results on Office. Source

Method

A,D ! W

A,W ! D

D,W ! A

Avg.

Single

Source-best

96.3

98.4

62.5

85.7

Source-worst

75.6

80.9

62.0

72.8

SHOT-best (Liang et al., 2020)

98.2

99.6

75.1

90.9

SHOT-worst (Liang et al., 2020)

90.6

94.2

72.9

85.9

SHOT-Ens (Liang et al., 2020)

94.9

97.8

75.0

89.3

DECISION(Ours)

98.4

99.6

75.4

91.1

Multiple

A,D, and W are abbreviations of Amazon, DSLR, and Webcam. For single-source methods, Source-best and Source-worst denote the best and worst unadapted source models, whereas SHOT-best and SHOT-worst are the best and worst accuracies of adapted source models.

Source distribution weighted multisource domain adaptation Chapter

5

99

TABLE 2 Results on Office–Home.

Source

Method

AR,CL, PR ! RW

Single(w/o)

Source-best

74.1

78.3

46.2

65.8

66.1

Source-worst

64.8

62.8

40.9

53.3

55.5

SHOT-best (Liang et al., 2020)

81.3

83.4

57.2

72.1

73.5

SHOT-worst (Liang et al., 2020)

80.8

77.9

53.8

66.6

69.8

SHOT-Ens (Liang et al., 2020)

82.9

82.8

59.3

72.2

74.3

DECISION (Ours)

83.6

84.4

59.4

74.5

75.5

Multiple(w/o)

AR,CL, RW ! PR

AR,PR, RW ! CL

CL,PR, RW ! AR

Avg.

AR, CL, RW, and PR are abbreviations of Art, Clipart, Real-world, and Product. We see that our method outperforms all the baselines including the best source accuracy as well as ensemble method. The abbreviations under the column SOURCE and METHOD are same as described in Table 1.

TABLE 3 Loss-wise ablation. Method

A,D ! W

A,W ! D

D,W ! A

Avg.

Lpl

97.6

98.5

75.3

90.5

Lent

96.6

99.0

68.5

88.0

Lent + Ldiv

95.9

99.0

71.6

88.9

Lent + Ldiv + λLpl

98.4

99.6

75.4

91.1

Contribution of each component in adaptation on the Office dataset.

adaptation task is shown in Table 3. First, both the diversity loss and the pseudo-labeling loss are removed, and trained using only Lent . Next, Ldiv is added in and weighted information maximization is performed. Finally, the results are compared solely using Lpl .

100 Handbook of Statistics

9.4.2 Analysis on the learned weights DECISION jointly adapts the source models and learns the weights on each such source. To understand the impact of the weights, the feature extractors are frozen and optimized solely over the weights fαj gni¼1. Naturally, this setup yields better performance compared to trivially assigning equal weights to all source models, as shown in Table 4. More interestingly, the learned weights correctly indicate which source model performs better on the target and could serve as a proxy indicator in a model selection framework. See Fig. 4. TABLE 4 Performance on freezing backbone network on Office–Home. Method

AR,CL, PR ! RW

AR,CL, RW ! PR

AR,PR, RW ! CL

CL,PR, RW ! AR

Avg.

Source-Ens

67.6

51.4

77.7

80.1

69.2

DECISIONweights

68.8

52.3

79.2

80.4

70.2

DECISION-weight is optimized solely over the source weights and consistently performs better than uniform weighting.

FIG. 4 Weights as model selection proxy. The weights learnt by DECISION on Office–Home correlates positively with the unadapted source model performance. This essentially means that if an unadapted source model gives good performance on the target, it is more likely to get higher weight compared to those unadapted source models with lesser target accuracy, which is theoretically expected.

Source distribution weighted multisource domain adaptation Chapter

5 101

9.4.3 Distillation into a single model Since this algorithm is dealing with multiple source models, inference time is of the order OðmÞ, where m is the number of source models. If m is large, this can lead to inference being quite time consuming. To ameliorate this overhead, a knowledge distillation (Hinton et al., 2015) strategy is followed to obtain a single target model. Teacher supervision is obtained by linearly combining the adapted models via the learned weights. These annotations are subsequently used to train the single student model via vanilla cross-entropy loss. Results obtained using this distillation strategy are shown in Table 5. Despite the model compression, the performance remains consistent. 9.5 Results and analyses of DECISION-mlp For all the MLPs in this alternative approach (Algorithm 2), we set architectures with one/two hidden layers, each with some number of nodes followed by ReLU activation. We show results on Office–Home and Office following the same setup as DECISION. We compare it with DECISION on Office– Home dataset in Table 6 and Office in Table 7. DECISION-mlp outperforms DECISION for all the targets of Office–Home except for “Product(Pr).”

TABLE 5 Distillation results on object recognition tasks. Office–Home

Office

Method

RW

PR

CL

AR

A

D

W

DECISION (original)

83.6

84.4

59.4

74.5

75.4

99.6

98.4

DECISION (distillation)

83.7

84.4

59.1

74.4

75.4

99.6

98.1

Performance remains consistent across all datasets despite distilling into a single target model.

TABLE 6 Comparison of DECISION and DECISION-mlp on Office–Home. Method

AR,CL, PR ! RW

AR,CL, RW ! PR

AR,PR, RW ! CL

CL,PR, RW ! AR

DECISION

83.6

84.4

59.4

74.5

DECISION-mlp

83.9

83

60.1

74.8

102 Handbook of Statistics

TABLE 7 Comparison of DECISION and DECISION-mlp on Office. Method

A, D ! W

A, W ! D

D, W ! A

DECISION

98.4

99.6

75.4

DECISION-mlp

97.6

99.6

74.6

TABLE 8 Ablation of MLP architectures. Nodes Layers

256

512

1024

2048

Single

73.7

74.1

74.2

74.5

Double

74

74.1

74.8

74.7

Triple

73.1

72.4

74

73.5

We conduct experiments on target “Art” from Office–Home by varying number of nodes and layers of the MLPs. We discover that there is a sweet-spot optimal architecture that gives the best result.

However, in Office dataset we don’t see any improvement over DECISION. For Office–Home two hidden layers with 1024 nodes each are used, whereas for Office one hidden layer with 8192 nodes gives the best performance after an exhaustive grid search. So we check the effect of MLP architectures in Table 8. We find that the choice of network architecture plays an important role in getting the best possible result, which provides a possible future direction of work.

10 Conclusions and future work In this chapter, we discussed the first UDA algorithm that can learn from and optimally combine multiple source models without requiring source data. We analyzed the modification of this algorithm by dropping some assumptions. From the results and ablations of this modified algorithm, we find that there is some potential for the source distribution estimation via MLPs instead of naively assuming them to be uniform. However, the hard part is to search for the optimal MLP architecture for a particular target. A neural architecture search (NAS) kind of approach could be a future direction to search for the optimal MLPs that can efficiently estimate source distributions as accurately as possible. Moreover, there are multiple other exciting directions to pursue.

Source distribution weighted multisource domain adaptation Chapter

5 103

First, we suspect that our algorithm’s performance can be further boosted by incorporating data augmentation techniques during training. Second, when there are many source models to utilize, it would be interesting to study whether we can automatically select an optimal subset of the source models without requiring source data in an unsupervised fashion.

References Ahmed, S.M., Lejbolle, A.R., Panda, R., Roy-Chowdhury, A.K., 2020. Camera on-boarding for person re-identification using hypothesis transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12144–12153. Ahmed, S.M., Raychaudhuri, D.S., Paul, S., Oymak, S., Roy-Chowdhury, A.K., 2021. Unsupervised multi-source domain adaptation without access to source data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10103–10112. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W., 2010. A theory of learning from different domains. Mach. Learn. 79 (1-2), 151–175. Bridle, J.S., Heading, A.J.R., MacKay, D.J.C., 1992. Unsupervised classifiers, mutual information and ‘phantom targets’. In: Advances in Neural Information Processing Systems, pp. 1096–1101. Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Chapelle, O., Scholkopf, B., Zien, A., 2009. Semi-supervised learning (chapelle, o. et al. eds.; 2006)[book reviews]. IEEE Trans. Neural Netw. 20 (3), 542–542. Dong, J., Fang, Z., Liu, A., Sun, G., Liu, T., 2021. Confident anchor-induced multi-source free domain adaptation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 2848–2860. Ganin, Y., Lempitsky, V., 2015. Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V., 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17 (1). 2096–2030. Grandvalet, Y., Bengio, Y., 2005. Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems, pp. 529–536. Guo, J., Shah, D.J., Barzilay, R., 2018. Multi-source domain adaptation with mixture of experts. arXiv preprint:1809.02256. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint:1503.02531. Hoffman, J., Mohri, M., Zhang, N., 2018a. Algorithms and theory for multiple-source adaptation. In: Advances in Neural Information Processing Systems, pp. 8246–8256. Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., Darrell, T., 2018b. Cycada: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning, pp. 1989–1998. Hsu, H.-K., Yao, C.-H., Tsai, Y.-H., Hung, W.-C., Tseng, H.-Y., Singh, M., Yang, M.-H., 2020. Progressive domain adaptation for object detection. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 749–757.

104 Handbook of Statistics Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint:1502.03167. Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106 (4), 620. Kirillov, A., He, K., Girshick, R., Rother, C., Dolla´r, P., 2019. Panoptic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9404–9413. Krause, A., Perona, P., Gomes, R.G., 2010. Discriminative clustering by regularized information maximization. In: Advances in Neural Information Processing Systems, pp. 775–783. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Li, Y., Carlson, D.E., et al., 2018. Extracting relationships by multi-domain matching. In: Advances in Neural Information Processing Systems, pp. 6798–6809. Li, R., Jiao, Q., Cao, W., Wong, H.-S., Wu, S., 2020. Model adaptation: unsupervised domain adaptation without source data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9641–9650. Liang, J., Hu, D., Feng, J., 2020. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. arXiv preprint:2002.08546. Lin, C., Zhao, S., Meng, L., Chua, T.-S., 2020. Multi-source domain adaptation for visual sentiment classification. In: AAAI, pp. 2661–2668. Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Luo, Y., Zheng, L., Guan, T., Yu, J., Yang, Y., 2019. Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2507–2516. Mansour, Y., Mohri, M., Rostamizadeh, A., 2009. Domain adaptation with multiple sources. In: Advances in Neural Information Processing Systems, pp. 1041–1048. M€ uller, R., Kornblith, S., Hinton, G.E., 2019. When does label smoothing help? In: Advances in Neural Information Processing Systems, pp. 4694–4703. Oymak, S., Gulcu, T.C., 2020. Statistical and algorithmic insights for semi-supervised learning with self-training. arXiv preprint:2006.11006. Paul, S., Tsai, Y.-H., Schulter, S., Roy-Chowdhury, A.K., Chandraker, M., 2020. Domain adaptive semantic segmentation using weak labels. arXiv preprint:2007.15176. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B., 2019. Moment matching for multisource domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415. Perrot, M., Habrard, A., 2015. A theoretical analysis of metric hypothesis transfer learning. In: International Conference on Machine Learning, pp. 1708–1717. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Russo, P., Tommasi, T., Caputo, B., 2019. Towards multi-source adaptive semantic segmentation. In: International Conference on Image Analysis and Processing, pp. 292–301. Saenko, K., Kulis, B., Fritz, M., Darrell, T., 2010. Adapting visual category models to new domains. In: European Conference on Computer Vision, pp. 213–226. Salimans, T., Kingma, D.P., 2016. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901–909.

Source distribution weighted multisource domain adaptation Chapter

5 105

Shen, M., Bu, Y., Wornell, G., 2022. On the benefits of selectivity in pseudo-labeling for unsupervised multi-source-free domain adaptation. arXiv preprint:2202.00796. Singh, S., Uppal, A., Li, B., Li, C.-L., Zaheer, M., Po´czos, B., 2018. Nonparametric density estimation under adversarial losses. In: Advances in Neural Information Processing Systems, pp. 10225–10236. Sun, B., Feng, J., Saenko, K., 2016. Return of frustratingly easy domain adaptation. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2058–2065. Tarvainen, A., Valpola, H., 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems, pp. 1195–1204. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T., 2017. Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S., 2017. Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027. Wang, H., Yang, W., Lin, Z., Yu, Y., 2019. Tmda: task-specific multi-source domain adaptation via clustering embedded adversarial training. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 1372–1377. Xu, R., Chen, Z., Zuo, W., Yan, J., Lin, L., 2018. Deep cocktail network: multi-source unsupervised domain adaptation with category shift. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3964–3973. Xu, R., Li, G., Yang, J., Lin, L., 2019. Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1426–1435. Zhao, H., Zhang, S., Wu, G., Moura, J.M.F., Costeira, J.P., Gordon, G.J., 2018. Adversarial multiple source domain adaptation. In: Advances in Neural Information Processing Systems, pp. 8559–8570. Zhao, S., Li, B., Xu, P., Keutzer, K., 2020. Multi-source domain adaptation in the deep learning Era: a systematic survey. arXiv preprint:2002.12169. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycleconsistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232.

This page intentionally left blank

Chapter 6

Deep learning methods for scientific and industrial research G.K. Patraa,e,*, Kantha Rao Bhimalaa,e, Ashapurna Marndia,e, Saikat Chowdhuryb, Jarjish Rahamanb, Sutanu Nandib, Ram Rup Sarkarb,e, K.C. Goudaa,e, K.V. Ramesha,e, Rajesh P. Barnwalc,e, Siddhartha Rajc, and Anil Sainid,e .

a

CSIR-Fourth Paradigm Institute, Bangalore, Karnataka, India CSIR-National Chemical Laboratory, Pune, Maharashtra, India c CSIR-Central Mechanical Engineering Research Institute, Durgapur, West Bengal, India d CSIR-Central Electronics Engineering Research Institute, Pilani, Rajasthan, India e Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, India * Corresponding author: e-mail: [email protected] b

Abstract Deep learning (DL) is a very powerful computational tool for various applications in scientific and industrial research which can be real-time implemented for societal benefits. Several factors impact the development of optimized DL models for better prediction including the amount of quality sample data, domain-specific knowledge, and the architecture of the model for extraction of the useful features/patterns from the data. The present chapter demonstrates the state-of-the-art DL methodologies used by the researchers from different laboratories under the Council of Scientific and Industrial Research (CSIR), India to solve important research activities across several sectors like Medical, Healthcare, Agriculture, Energy, etc. The Convolutional Neural Network (CNN) techniques are utilized for Tumor diagnosis, classifying molecular subtypes of glioma tissues, and predicting driver gene mutations in glioma. Similarly, the Long short-term memory (LSTM) model is applied for the assessment of crop production, and transfer learning is used for the classification of tea leaves. Further, the ensemble LSTM methodology is implemented for short-term prediction of wind speed to enhance the renewable energy sectors. Finally, the multivariate LSTM models were developed by integrating the weather parameters for the prediction of covid-19 spread over different states in India which is an input for policy planning and supply chain management during the pandemic time. All the use cases are being validated and the results are quite satisfying and provide confidence for the real-time application of DL for scientific and industrial research and societal benefit to the common people. Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2022.12.002 Copyright © 2023 Elsevier B.V. All rights reserved.

107

108 Handbook of Statistics

1 Introduction Deep Learning (DL) techniques have gradually become a powerful and crucial tool for various applications in scientific and industrial research (Benjamens et al., 2020; Bhalla and Lagana, 2022). The present chapter focuses on the challenges and real-time applications of DL techniques for different sectors including medicine, healthcare, agriculture, and renewable energy. In the case of the medical and healthcare sector, recently, the Food and Drug Administration (FDA) of the United States has approved several DL tools for patients with terminal diseases (Bhalla and Lagana, 2022; Spadaccini et al., 2022; Tariq et al., 2020). These DL tools are being used to identify complex patterns from clinical and biological data of lifethreatening diseases and to assess the correlation with disease diagnosis, prognosis, and treatment plans (Bhalla and Lagana, 2022). Several clinical DL tools have also been employed in solid malignant tumors (Bhalla and Lagana, 2022). Glioma, a type of solid brain tumor formed in the central nervous system (CNS) of humans, is a highly complex, aggressive, and lethal disease (Aguilar-Morante et al., 2022; Nicholson and Fine, 2021; Ushio, 1991). Hoping to unravel the intricate complexities of this disease using highthroughput molecular profiling and high-resolution imaging, researchers have discovered complex patterns in the voluminous raw data, which make the desired objectives of understanding this tumor more challenging (Cao et al., 2017; Yi et al., 2021). A plethora of molecular data, histopathological images, and MRI scans of large numbers of low and high-grade glioma patients have been deposited in The Cancer Genome Atlas (TCGA) for further research and understanding (Bakas et al., 2017; Bolouri et al., 2016; Kong et al., 2013). Undoubtedly, the deluge of big volumes of data generated by TCGA can potentially transform conventional tumor biology research into effective translational research for millions of patients predisposed to or already carrying the burden of several kinds of cancers including brain tumors (Suwinski et al., 2019). More significantly, these data sets have brought excellent opportunities to data scientists and bioinformaticians to apply state-of-the-art artificial intelligence (AI) techniques and make valuable predictions about diagnosis and prognosis of these diseases, potentially helping oncologists to decide the right course of treatment and disease maintenance (Calabrese et al., 2020; Chen et al., 2020a,b; Fan et al., 2019; Zhao et al., 2020). Interpretation of high dimensional, multi-omics molecular data sets (e.g., gene expressions, copy number variations, methylations) of glioma tumor tissues for translational or clinical research is a tremendous task and resource-intensive (Tran et al., 2021a). However, highly characterized tumor molecular data sets available in TCGA and other similar resources could be utilized to train multi-layered artificial neural networks (e.g., Sequential Neural Networks or SNN) for diagnosing tumor tissues more accurately,

Deep learning methods for scientific and industrial research Chapter

6 109

efficiently, and most importantly, with limited resources and workforce. Accurate diagnosis of glioma tumors, for example, identifying its grades (i.e., low vs high), cellular subtypes (e.g., astrocytoma, oligoastrocytoma, oligodendroglioma, etc.), and molecular subtypes (i.e., classical, mesenchymal, proneural, neural) is important to assess patient’s prognosis and predict right treatment plans (Claus et al., 2015; Masui et al., 2012; Zhang et al., 2020). DL can delineate hidden complexities and non-linear patterns in multi-omics numerical data sets and subsequently unlock their potential use in rightly diagnosing tumor tissue samples (Cesselli et al., 2019; Mousavi et al., 2015). Simultaneously, several research works have also been performed using supervised machine learning algorithms to predict tumor grades, survival probability, and treatment outcomes of cancer patients (Nuechterlein et al., 2021; Zhang et al., 2016). Hence, the present chapter demonstrated two widely-used Machine Learning (ML) algorithms (Support Vector Machine (SVM) and Random Forest (RF)) for the diagnosis of glioma tissue samples using multi-omics data (Shboul et al., 2021; Tran et al., 2021a,b). When we discuss proactive healthcare, COVID-19 (coronavirus disease 2019) cannot be ignored. Globally, the pandemic has infected 524 million population and is responsible for the death of more than 6 million people. Reliable predictions of COVID-19 outbreaks over different states in a large country like India are very difficult due to the complex mechanisms involved in disease transmissions like different virus variants, socio-economic conditions, population density, migration, and climatic conditions. Hence, the daily caseload prediction for different states in India is very important to handle the medical infrastructure, logistic support, and proactive health advisories. Machine learning (ML) and DL techniques have proved their capability in time series forecasting of such non-linear problems. Numerous studies have already shown that AI-based DL and ML techniques are useful for the prediction of COVID-19 cases over different regions of the world. Recently, researchers developed a COVID-19 prediction model using the Deep Sequential Prediction Model (DSPM) and Non-parametric Regression Model (NRM) and demonstrated that the DL technique’s capability in the prediction of disease spread over different countries and their states or provinces (Ayris et al., 2022). A COVID-19 prediction model using the Automatic Regressive Integrated Moving Average (ARIMA), Seasonal Automatic Regressive Integrated Moving Average (SARIMA), and Prophet Models is being developed and evaluated in the USA, Brazil, and India (Wang et al., 2022). In this study, it is found that the prophet model skill in the prediction of new daily cases over the USA was better than the ARIMA model, whereas the ARIMA model skill was superior in the prediction of new cumulative cases over Brazil and India. However, the integration of weather parameters into the DL Long short-term memory (LSTM) models has improved the COVID-19 prediction skill in India (Bhimala et al., 2022; Wathore et al., 2022).

110 Handbook of Statistics

Federated Learning (FL) is an emerging concept in DL that has an immense impact on developing ML and DL tools in Healthcare, IoT and computer vision, etc. Traditionally the ML and DL tools need huge datasets to train the model, which is privacy sensitive and cumbersome process. Data privacy is a very important issue, especially in a medical context. It is paramount to abide by the existing privacy regulations to preserve patients’ anonymity. With the rapid development of software and hardware like wearable devices, smartphones, and IoT systems, more and more data are generated and collected for processing. Healthcare data on the other hand is usually fragmented and private making it difficult to generate the robust result. FL targets this problem as it can solve privacy, and ownership, and restrict regulation issues. In FL, multiple clients participate to solve the problem by owning its data without sharing, hence making it private. Various data sharing and security problems can be solved from federated learning like the Industrial Internet of Things (IIoT), Internet of Things (IoT)-connected vehicles, and many more applications. FL enables companies to share data in a “closed-loop system” to build a common, powerful machine learning model and do it without actually exchanging data. In the domain of atmospheric science, DL plays a pivotal role in solving complex problems such as the prediction of different atmospheric variables with higher accuracy. The atmosphere consists of multiple interdependent parameters whose variations cause significant changes in weather patterns. Each of these parameters has a significant impact on the environment. One of the major atmospheric variables, wind speed helps in generating green energy which is responsible for reducing pollution. A significant amount of work has been carried out for predicting wind speed using various statistical, dynamic, and AI-based modeling systems. It is a big challenge for dynamical models for forecasting wind speed at the station level as it gives only macro-level information. In this prospect, statistical models are preferred compared to dynamic models. In the last few decades, researchers used different statistical methods such as Auto-Regressive (AR), and ARIMA for predicting wind speed. However, statistical models also suffer from a few limitations such that most of the statistical models are not capable of handling nonlinear characteristics of data for their lightweight behavior. With the recent development in technologies and the phenomenal growth of applications using AI, it is also feasible to use data-driven models for the short-range forecast of wind speed at the station level. Few attempts have been made in predicting wind speed using different DL techniques such as Feed-Forward Neural Networks (Bhaskar & Singh, 2012), Recurrent Neural Networks (Barbounis et al., 2006; Cao et al., 2012), Stacked Denoising Auto Encoder (Chen et al., 2019). However, there are scopes to improve the prediction capability. Time Division LSTM, an improvement in Long Short-Term memory, a DL approach enhances the prediction capability of short-term wind speed

Deep learning methods for scientific and industrial research Chapter

6 111

(Marndi et al., 2020). In this article, we present a detailed description and demonstration of Time Division LSTM for short-range wind speed prediction. In the field of the agricultural sector also various sophisticated DL techniques are utilized to improve the different crop yield prediction skills due to its capability in self-learning and extraction of nonlinear relationships from a huge data set (Muruganantham et al., 2022). Few researchers attempted to use different DL approaches to predict the yield of rice in different countries of the world (Fernandez-Beltran et al., 2021; Jeong et al., 2022). There exist different challenges to designing and developing a robust and accurate model to predict crop production of a country in advance. A digital image comprises a matrix of pixels each cell of which is considered a basic unit of information in any DL algorithm. These digital image data in turn form the basis of computer vision algorithms. The DL relies on training the deep neural network with a very large dataset to achieve a high level of accuracy. Particularly, in the field of computer vision, one needs to acquire the data by capturing images, curating the same, and then labeling those to achieve different kinds of object classifications, object detections, and image segmentation tasks. However, there are many application domains like medicine where it is difficult to generate and capture too many images to build a huge curated dataset for training, testing, and validating the model. Moreover, training any deep neural network from scratch by tuning its hyper-parameters with a large amount of image dataset requires heavy computational resources as well as high computational time. Sometimes, the labeling of such a large dataset itself becomes a challenging task. In this condition, it is prudent to take advantage of the pre-trained model on a similar dataset and use the same for re-training with the new dataset for achieving fairly high accuracy with the help of transfer learning (TL) which provides us excellent methodologies for achieving high accuracy with an even smaller dataset. The only need is to identify the proper model class and tune its parameters for fitting into the new problem domain. In this chapter, we described the method of TL and presented a use case for the same. Labeling data and training a TL model requires a special skill set, and the most difficult part of the supervised machine-learning process is getting vast amounts of tagged data (Torrey and Shavlik, 2010). The requirement for enormous amounts of labeled data makes the development of powerful models impossible. These kinds of algorithms are more likely to be created centrally by organizations that have the access to the massive amount of labeled data. Further, these well-trained algorithms can be utilized by other researchers and organizations as well. The VGG16 was one such model architecture proposed by the Visual Geometry Group (VGG) Lab of Oxford University (Simonyan and Zisserman, 2014). The model architecture consists of 16 deep layers of convNets with 3  3 filters and has a fixed stride and padding of one whereas, the max-pooling layer is set to a stride of two. The stack of convolutional layers is followed by fully-connected layers and an output layer with a

112 Handbook of Statistics

softmax activation function. The hidden layers were equipped with a Rectified Linear Unit (ReLU) activation function. This model has been used with the transfer learning approach to solve a specific task of tea leaf classification in four different classes’ one leaf one bud, two leaves one bud up to four leaves one bud. Convolutional neural networks (CNN/ConvNet) are a type of deep neural network that is frequently used to evaluate visual imagery. It employs a technique known as Convolution. It is a mathematical procedure that produces a third function that expresses how the shape of one is affected by the other. The ConvNet’s purpose is to compress the images into a shape that is easier to handle while preserving features that are important for a successful prediction. It comprises multiple artificial neurons. Artificial neurons are mathematical functions that calculate the weighted sum of various inputs and output an activation value, similar to their biological counterparts. Each layer creates several nonlinear/activation functions that are transferred onto the next layer when we input an image into a ConvNet. Basic features such as horizontal or diagonal edges are usually extracted by the first layer. This information is passed on to the next layer, which is responsible for detecting more complicated characteristics like corners and combinational edges. As we go deeper into the network, it can recognize even more complex features like objects, faces, and so on. Based on the activation function the output layer generates a confidence score associated with the classes that what are the chances of the input lie in a class. Whole-slide histopathological images (WSI) of glioma tissue samples and magnetic resonance image (MRI) scans of glioma patients could be labeled and classified based on tumor grades, molecular subtypes, overall survival, and mutation status of cancer hallmark genes (Mousavi et al., 2015; Rathore et al., 2020). These labeled image data sets could be further employed to train the convolutional neural network (CNN) for the diagnosis and prognosis of the disease (Pei et al., 2021). Histopathological images of tumor samples comprise dense cellular structures, including their interactions with the microenvironment, information on the invasion of blood vessels in tumor tissues, etc. (Liu et al., 2020). Histopathologists assess these features to assign tumor grades/cellular subtypes, a task that may be accomplished by training a machine using deep CNN (Bahar et al., 2022). However, it should be taken into account that DL is not to replace human intervention for cancer diagnosis but to assist histopathologists in automating time-consuming manual inspection of the slides and helping to study the tissue more accurately. Similarly, MRI scans of the human brain could also be trained by deep CNN to detect and classify glioma tumors into different grades, assessment of the prognosis of patients, or identification of cancer hallmark gene mutations in the tumor (Bahar et al., 2022). In the following sections, we have discussed the image processing and labeling methods, the architecture and training procedures of deep CNNs, and applications of trained models in tumor diagnosis, prognosis, and radio-genomics studies.

Deep learning methods for scientific and industrial research Chapter

6 113

The present chapter emphasizes the potential use of DL methodologies in various case studies like Tumor diagnosis, classifying molecular subtypes of glioma tissues, predicting driver gene mutations in glioma, short-range wind speed prediction, estimation of crop production, classification of Tea leaves, and weather integrated Deep learning to predict the Covid-19 spread in India. The following section describes the details of the various DL techniques utilized in the present chapter, and Section 3 presents the results of various case studies.

2

Data and methods

This section describes the detailed description of data and the methodologies developed or adopted for the different research problems.

2.1 Different types of data for deep learning 2.1.1 Numerical data Multi-omics datasets of human glioma samples were used to train three different sequential neural network (SNN) models and two machine learning models (SVM and RF) to diagnose histopathological grades of tumor tissues. Gene expressions (GE), DNA methylation (DM), and copy number variation (CNV) datasets of low (TCGA-LGG) and high-grade (TCGA-GBM) glioma tumors were obtained from The Cancer Genome Atlas (TCGA) data portal (https:// portal.gdc.cancer.gov/). Three different data matrices of GE, DM, and CNV were prepared for model training and testing. The GE matrix comprised expression values of 20,177 genes of 608 patients from TCGA-LGG and TCGA-GBM cohorts. Similarly, DM and CNV matrices had DNA methylations (β-values) and gene copy numbers (gain and loss) of 1331 and 88 chromosomal locations, respectively, obtained from 608 and 550 patients. In each data matrix, gene expressions, methylation, and CNV were considered features and patients as samples. Tumor samples of high and low-grade glioma patients (samples) were classified into four categories/classes based on their histopathological properties: (i) Astrocytoma (Grade-II/III), (ii) Astrocytoma (Grade-IV), (iii) Oligoastrocytoma (Grade-II/III), and (iv) Oligodendroglioma (Grade-II/III) (Dong et al., 2016). In GE and CNV data matrices, tumor samples from non-malignant patients (Normal) were also included as a different class. The inclusion of normal tissue samples in these data sets made GE and CNV models capable of predicting malignant and non-malignant tumors from transcriptome and CNV data. Different bioinformatics pipelines were utilized to prepare input data matrices (GE, DM, CNV) to train individual SNN models. In the TCGALGG and TCGA-GBM cohorts, high-throughput mRNA sequencing was performed on normal and tumor samples to generate the transcriptome profiles of each patient. Raw sequencing reads obtained from RNA-seq were mapped to the human reference genome GRCh38 using the STAR aligner algorithm

114 Handbook of Statistics

(Dobin et al., 2013). After alignment, whole transcriptome reads were quantified using the HTSeq-count algorithm to measure gene expressions in terms of transcript counts (Anders et al., 2015). Following this, raw counts of all transcripts were further transformed to Transcripts per Million (TPM) values (Eq. 1). The GE data matrix was populated by TPM values of all genes corresponding to each tissue sample. By definition, the summation of TPM values of all genes in a given tissue sample should be 1 million. The TPM values are calculated as   3 Cg 10 Lg

TPMg ¼ XN 

Cg 103 Lg g¼1

  106

(1)

where g denotes the Gene; Cg denotes the total counts of reads aligned to the gene g; Lg denotes the union length of all exons of gene g; and N is the total number of genes. Illumina Infinium Human Methylation 27 (HM27) and HumanMethylation450 (HM450) arrays were utilized to measure the methylation at known CpG sites of the genome in tumor tissue samples. Methylation values (aka β-values) of known CpG sites are calculated using probe intensities of the array using Eq. (2). β¼

M , where M > 0, U > 0: ðM + U Þ

(2)

Here β is a bounded variable whose value is bounded between 0 and 1. Variables M and U represent the intensities of methylated and unmethylated probes in the array (Weinhold et al., 2016). β values of CpG sites of the genome of 608 low and high-grade glioma patients were considered in the DM matrix. The Affymetrix SNP 6.0 (SNP6) array data were used to identify repeated genomic regions in the genome and calculate copy numbers of these repeats in TCGA-LGG and TCGA-GBM cohorts. Numeric copy number variation (CNV) of genes was estimated using the GISTIC2 pipeline (Mermel et al., 2011). CNVs were estimated for only protein-coding genes. Further threshold of Numeric CNVs of genes was implemented by a noise cutoff value of 0.3. The following criteria were applied to categorize the numeric CNV values to copy numbers gain, loss, and neutral states of a given gene. (i) Genes with CNV  0.3 are categorized as copy number loss and denoted by numerical value 1. (ii) Genes with CNV  0.3 are categorized as copy number gain and denoted by numeric value +1. (iii) Genes with 0.3  CNV  +0.3 are categorized as neutral and denoted by the numeric value 0.

Deep learning methods for scientific and industrial research Chapter

6 115

2.1.2 COVID-19 data The present study utilized the state-level daily COVID-19 case data which was collected from the Ministry of Health and Family Welfare (MoHFW), Government of India from 1st April 2020 to 31st March 2022. In India, COVID-19 tests were conducted using RATs (Rapid Antigen Tests) and the RT-PCR (Reverse-Transcriptase Polymerase-Chain-Reaction) testing methodology. However, the RT-PCR test is more sensitive than the RATs for detecting the COVID-19 virus in the human body, whereas, RATs are more useful for understanding the disease spread and pro-active health measures due to the shorter diagnosis time and inexpensive compared to the RT-PCR test. 2.1.3 Meteorological data 2.1.3.1 Gridded meteorological data The state-level daily meteorological parameters include specific humidity, average temperature, and maximum and minimum temperature data extracted from NCEP/NCAR reanalysis gridded data (Kalnay et al., 1996) (https://psl. noaa.gov/). These meteorological parameters are utilized for the training of the multivariate LSTM model to predict COVID-19 cases in India. The rainfall data utilized for the prediction of crop yield was collected from the IMD (India Meteorological Department) gridded data during the period 1961–2016. 2.1.3.2

Station-level meteorological data

The Council of Scientific and Industrial Research (CSIR), India has established several meteorological towers in different parts of the country for collecting some of the major metrological parameters. These towers are capable of collecting data at three different levels such as at 2, 20, and 30 m heights. We have predicted wind speed at New Delhi and Bengaluru stations set up by CSIR. Four different meteorological parameters such as temperature (T), pressure (P), humidity (H) and wind speed (V) at 30 min averaged intervals are collected from 20 m height towers from 2010 to 2013 for the study of wind speed prediction. Fig. 1 shows the sample data of daily wind speed (averaged from 2010 to 2013) measured at New Delhi and Bengaluru stations in India. 2.1.3.3 Crop production data The rice production data for different Asian countries were collected from the “Food and Agriculture Organization” (FAO) from 1961 to 2016 to predict rice production in India.

2.1.4 Image data The magnetic resonance images (MRI) of the human brain and histopathology images of glioma tissue samples (stained by Hematoxylin and Eosin) of

116 Handbook of Statistics

FIG. 1 Annual Cycle of wind speed (m/s) at New Delhi and Bengaluru stations in India. The daily wind speed is averaged for the period 2010–13 and presented in the figure.

TCGA-LGG and TCGA-GBM cohorts were used to construct deep convolutional neural network (CNN) models for predicting cellular/histopathological grades, molecular subtypes, mutations status of driver genes, and survival of glioma patients. Pre and post-contrast T1 weighted MR images and whole-slide histopathology images (WSI) of glioma patients were collected from The Cancer Image Archive (TCIA) (https://www.cancerimagingarchive. net/). Metadata (e.g., tumor grades, molecular subtypes, survival) and somatic mutations of select cancer driver genes EGFR and TP53 of all patients were also obtained from TCGA data portal (https://portal.gdc.cancer.gov/). MRI and histopathology images of glioma patients with tumor grades, molecular subtypes, mutation status of EGFR and TP53 genes, and overall survival (OS) data were utilized for training the CNN models. Given these criteria, MR and WS images of 1152 patients were finally considered for training and testing of the CNN models. MR images were available in Digital Imaging and Communications in Medicine (DICOM) format (Guthrie et al., 2001). Dicom images are not machine-readable; hence these images were converted to png format, which was further passed through several filtering steps to eliminate any potential noise from image files. De-noising MR image is a crucial step for enhancing

Deep learning methods for scientific and industrial research Chapter

6 117

image quality without compromising the integrity of the whole image (Coupe et al., 2008). Different de-noising algorithms are available, such as Gaussian blur, Median blur, Anisotropic Diffusion Filter, Total Variation, etc. (Coupe et al., 2006). Literature evidence suggests that the non-local means (NL means) algorithm, based on the neighborhood filtering technique, outperforms the conventional filtering algorithm and increases the signal-to-noise ratio (SNR) in the MR image (Coupe et al., 2006). Following the de-noising step using the Fast NL means algorithm on the converted png image, all images were further resized to 128  128 pixels and labeled with tumor grades, molecular subtypes, EGFR, and TP53 genes mutation status, and overall survival in months. Whole-slide histopathology images (WSI) were available in SVS format from TCIA. SVS is a digital file created by the Aperio ScanScope slide scanner (Arunachalam et al., 2017). It follows “pyramid” structure to stack stained tissue images at various resolutions. Typically, SVS images are large and cannot be used for model training. Also, WS images most often have backdrops, shadows, wetness, smudges, and pen traces, which are required to be filtered out before using as input in CNN. Therefore, a segment or region of interest (ROI) must be identified within WSI to reduce such conflicting issues. At first, whole-slide tissue images at 20 magnification were extracted from SVS files. In addition, Otsu’s threshold method was implemented for segmentation which helped to extract a better HVS (Human Visual System) image from WSI. This might lead to more accurate and faster model training. After that, the segmented image (ROI) of a given slide was divided into distinct tiles with a size of 512  512, provided that 90% of the tissue was available in each tile. To select the best tiles for further model training, the top 10 tiles with the maximum number of nuclei were selected. The nuclear segmentation algorithm was utilized to count the total number of nuclei in a given tile of a WSI. Finally, 5000 tiles (512  512 pixels) with maximum nuclei were labeled with distinct classes and considered as input for model training. In Deep Learning, the digital image can also be used as an important source of information by providing a large set of pixel data. For a computer, each image is just a set of matrices containing numeric values. A digital image can be a two-dimensional or three-dimensional data arranged array of picture elements called pixels. In Fig. 2, a 8  8 pixels size image is shown where each box denotes the numerical value corresponding to a particular color. The size of any digital image is represented by its resolutions normally defined as width and height in terms of the number of pixels in both dimensions. If an image is binary (Black & White) or grayscale image, it is represented by a two-dimensional rectangular array however the color images also contain the repetitions of a similar rectangular array of pixels called channels. Thus, the binary or grayscale images can be stored in a matrix of size M  N. Another important parameter to define a digital image is the

118 Handbook of Statistics

FIG. 2 Sample image of 8  8-pixel resolution. Where each box denotes the numerical value corresponding to a particular color.

intensity value or brightness of each pixel in the image. Intensity is the distinguishing feature of each pixel which helps in differentiating the object from the background or one object from another object in an image. For example, suppose in a black & white or grayscale image, if the value of the intensity is the same for each pixel, then the image will appear as of uniform shade; either all white, all gray or all black. Similarly, the color images are formed by the combination of three two-dimensional matrices for storing data related to the intensity of three different colors, i.e., Red, Green, and Blue represented in short as RGB. Fig. 3 depicts a 16  16 pixels image in grayscale as well as RGB image. In this figure, it can be observed that the although the grayscale image is a two-dimensional matrix of the numeric values but RGB image is a three-dimensional matrix that can be split into Red, Green and Blue Channels containing the color information. Depending on the number of digital bits used for storing the intensity value of each pixel, the color quality of the image gets changed. Normally, a binary image or each color channel of RGB image stores the intensity value in 8 bits variables that can store 256 (i.e., 28) distinguishable values. DL algorithms work based on the auto-detection of prominent features of any object in an image. Thus, for such an algorithm, each pixel of the image and the specific pattern of consecutive pixels in the sample images provide the raw input for learning. Different mathematical and computing operations

Deep learning methods for scientific and industrial research Chapter

6 119

FIG. 3 Schematic diagram of Grayscale and RGB Images along with their underlying numerical data matrices.

like convolution, pooling, etc., on pixel data are then applied for identification and filtration of the most common features available in all the given sample image datasets. These features are created by different combinations of consecutive pixels in the sample training image dataset. The weights and biases of the neural networks are then adjusted accordingly so that the learned features in the image are first highlighted by the algorithm and then based on several matching features with the earlier training data, the classification or object detection takes place.

2.2 Methodology This section describes the various DL methodologies developed and adopted for the present case studies.

2.2.1 Transfer learning Deep learning is a technique that relies on the availability of a huge dataset for achieving the desired level of accuracy. However, very often it is not possible to collect labeled curated datasets in some problem domains. There comes another method of deep learning known as Transfer Learning (TL). In this method, the existing pre-trained deep learning models trained on other similar dataset is used as the starting point for further fine-tuning the model for a related problem. Here the features already learned by the different layers of the neuron networks like lines, curves and other patterns help in easily learning higher concepts or features for the new but similar problem domain.

120 Handbook of Statistics

The traditional DL approaches include model development from scratch based on the problem statement or domain application. Before considering a domain-specific model development from scratch, we should consider two main problems associated with it. First, this kind of ML process is computationally expensive, and second, the requirement of a huge dataset to make a model learn the features associated with the domain. Moreover, dataset creation and data labeling itself is a time and resource-intensive process. These problems can be resolved by the application of the knowledge of another domain to the task domain, this method is known as knowledge transfer or transfer learning. The transfer learning approach negates the need for a large set of labeled training data, it also makes the training process more efficient in terms of resource utilization. Rather than starting from scratch each time, the past knowledge can be leveraged from the existing model and then built a new model iteratively with hyper-parameter optimization as well as fine-tuning the existing model. Classification, regression, clustering, and certain other problems are solved using learning algorithms. As mentioned earlier, when given a problem to tackle, we use any of the known algorithms/concepts to create a model tailored to the task at hand. We train a separate model for each task, regardless of the fact that how similar the tasks, datasets, or goal variables are. This differs from how we, as a human being, can use the knowledge from one work to address a related problem. TL is a notion that allows trained models to share their past gained knowledge to achieve better results (Pan and Yang, 2009). It provides a framework for extending the capabilities of existing systems to comprehend and share information across tasks. As a result, Transfer Learning provides a framework for extending the capabilities of existing systems to comprehend and share information across tasks. Assume we need to create a classifier model that can distinguish between defective machine components, machine components that required some machining, and non-defective machine components. In this type of problem, we need a component dataset to train the model first by giving labeled examples to the deep learning algorithm about the expected appearance and features of such components. Now, an experienced machine learning practitioner might use computer vision methods like convolutional neural networks (CNN) to solve this classification problem. The problem statement, however, has a twist, as it does in most of the real-world instances. We are given a dataset that is not only small (a few thousand items) but also has an imbalanced class distribution (the instances of data associated with each class have a large variation like the first-class may have fewer data in comparison to the other classes). If we had enough data samples to adequately represent each class, constructing a CNN-based classifier would have worked well. However, with a small, unbalanced dataset, the classifier is bound to perform adversely. The standard configuration of training samples and a CNN model.

Deep learning methods for scientific and industrial research Chapter

6 121

Because of the fact that the dataset is small and unbalanced, the results are not promising. In such situations, TL is the technique that can provide utilize the knowledge gained from other similar domains and utilize the same as a starting point for re-training the model with the real but small dataset. We can use one of the many state-of-the-art models trained on a similar dataset, such as ImageNet, instead of every time developing a new convolutional network from scratch. As the ImageNet dataset has millions of images associated with different categories which also include industrial images and the pre-trained models (Efficient net B0, VGG16, Resnet50, and others) have been tuned to perform well on ImageNet. This suggests that these models have filters that have been well-trained to recognize various features of an image. It would be simpler to use such pre-trained models to adapt to a different but comparable domain and train a model that performs better despite a smaller target dataset and imbalanced class distribution. As a result, we use a pre-trained model, such as efficient net B0, that was trained on ImageNet as the source dataset to transfer learning to the component class dataset. In comparison to a CNN trained from scratch, this allows us to obtain a performance boost. Fig. 4 represents the conceptualized difference between deep learning and transfer learning. We may categorize transfer learning into three major types based on the approach used: inductive, transductive, and unsupervised transfer learning which has been shown in Fig. 5. In the case of inductive transfer learning, to generate an objective predictive model for usage in the target domain, some labeled data in the target domain is necessary. The inductive transfer learning setting is analogous to multitask learning environment in this scenario. Inductive transfer learning, on the other hand, simply strives to improve performance in the target task by transferring knowledge from the source task, whereas multitask learning tries to learn both the target and source tasks at the same time. There are instances when there are no labeled data present in

FIG. 4 Schematic diagram of the conceptual difference between conventional deep learning and transfer learning.

122 Handbook of Statistics

FIG. 5 Schematic diagram of transfer learning methodology based on source and target domain availability.

the source domain, in this case, a self-taught approach implies that the source domain’s side information cannot be used directly. As a result, it is similar to inductive transfer learning in the absence of labeled data in the source domain. Self-taught learning has fewer limitations on the sort of un-labeled data that may be used, and it is considerably easier to use in many practical applications (such as a picture or text categorization) in comparison to traditional semi-supervised or transfer learning approaches. Transductive learning, in which the learned model cannot be reused for future data and one has access to unlabeled testing data at train time (Lu et al., 2015). As a result, when new test data arrives, it must be categorized with the old data. We can use the patterns and additional information included in the data during the learning process even if we don’t know the labels of the testing datasets. A predictive model is not created by transduction. If a new data point is added to the testing dataset, we must restart the procedure, train the model, and then use it to predict labels. When an input stream introduces fresh data points, transductive learning can become expensive. You’ll have to re-run everything every time a new data point comes. This result in the limitation of transductive learning to be used in scenarios where at some interval of time data instances are being revised.

Deep learning methods for scientific and industrial research Chapter

6 123

Unsupervised transfer learning, on the other hand, focuses on performing unsupervised learning tasks in the target domain, including clustering, dimensionality reduction, and density estimation (Zhuang et al., 2020). There are no labeled data in the source and target domains in the unsupervised transfer learning environment. There has been little research done on this configuration so far.

2.2.2 Federated learning Machine learning and deep learning methods are used widely to extract knowledge from health data, whether it is EHR or image datasets. This improves the decision-making process in healthcare. Due to the sensitive nature of medical data from patients, such approaches need the training of high-quality learning models based on broad and extensive datasets, which are difficult to obtain. A learning model trained to find patterns and extract the desired knowledge from raw data is required for ML-based approaches. The accuracy of the ensuing analytic approach is dependent on this training stage. To ensure the quality of the ML model, it must be performed on a big dataset with a diverse number of samples. The quantity of data items, the variety of the samples, and the quality of dataset annotation indicating the predicted classification all influence the training outcomes. Obtaining or creating such a dataset is frequently a time-consuming and costly process. Several parties acquire data, send it to a central data repository, then combine it to create a model in this approach. Second, the healthcare data is very sensitive which has patient information like id, card details, type of disease, and treatment. Traditional machine/deep learning model volatiles laws such as the General Data Protection Regulation (GDPR) of the European Union, the California Consumer Privacy Act (CCPA), and the Health Insurance Portability and Accountability Act (HIPAA). Medical data sharing difficulties and limits are a key barrier to the implementation of modern machine learning algorithms for healthcare. So, the question is how to implement the machine/deep learning model without sharing healthcare data. Methods like distributed machine learning cannot guarantee to restrict data sharing. Also, it needs more bandwidth to implement. We need a sophisticated method that can control data sharing as well as reduces the overhead in the network. It can also manage the non iid data as well. Combining FL with health prediction is one of the most effective ways to break down analytical barriers between hospitals. Federated learning avoids this problem of sharing data and maintains the data privacy of healthcare records. The learning process is managed by a central entity, which sends the training algorithm (Initial Global) to each participant data holder. Each participant builds a local model using their data and shares the parameters with the central

124 Handbook of Statistics

organization. Finally, to integrate the parameters of all local models into a single global model, the central entity uses an aggregation process. Electronics Health record (EHR) is one of the main datasets, which contains a lot of meaningful clinical concepts. It predicts disease incidents, patient response time for treatment and other healthcare events. Classic machine-learning mechanisms have been applied to EMRs, and while these traditional machine-learning mechanisms have been effective, there is a misleading assumption that EMRs can be simply stored and shared in centralized locations. Because EMRs are created by patients in numerous healthcare institutions and clinics, this assumption is erroneous. Because of its sensitivity, typical machine learning processes are ineffective. There are issues with EMR storage, security, privacy, cost, and availability of medical data exchange. Huang et al. (2019) solve the problem of healthcare data collection and privacy concerns of these datasets. In real-time healthcare data are non-identical distributed. With help of traditional machine learning methods, it is a challenging task to train the dataset. On the other hand, the author solves this problem via. Introducing a community-based federated machine learning (CBFL) algorithm and evaluating it on non-IID ICU EMR. The author’s federated framework train the model with different geographical location and capture similar diagnosis and learning model for each community. COVID-19 identification relies heavily on artificial intelligence. With Chest X-ray Images, computer vision and deep learning algorithms can help determine COVID-19 infection. Protection and privacy of these types of data are necessary. Also, the hospital’s sensitive medical information could not be leaked or shared without consent. It was difficult to get such training data. Due to the lack of data deep learning models do not perform well on real-time applications. This problem can be solved by applying federated learning without sharing the local data. Yan et al. (2021) use federated learning to classify the covid patient from chest X-ray images. The author uses the open access dataset from various authenticate databases called COVIDx dataset contains the largest number of COVID-19 pneumonia CXR images. They distributed the dataset among 4 nodes with iid pattern and non iid pattern. Mobile-net, covid net and Resnet is used to train these models. The results show Resnet 18 gives the highest accuracy of 96% in the IID pattern in the 90th communication round taking 0.5 fractions of a client in each round. While traditional learning takes 88 communication epochs to make a model for inference. FL model takes more time than the traditional model but it makes the global model very vast that is not overfit on inference. Feki et al. (2021) on the other hand proposed the same method which uses Federated learning on the covid dataset. This platform is decentralized and collaborative, allowing doctors all across the world to benefit from rich

Deep learning methods for scientific and industrial research Chapter

6 125

private medical data exchange while maintaining their privacy. The author used the same dataset as above available on https://github.com/ieee8023/ covid-chestxray-dataset. But the dataset has 2 classes of covid and non-covid. They randomly split the dataset to make it a Non-identical pattern and Since the dataset is small, data augmentation operations need to apply in order to artificially expand the size of the training and test sub-sets. Four clients are chosen to distribute the data among them. Li et al. (2019) use a federated learning technique with differential privacy methods to preserve the model training updates over the communication part. For brain tumor segmentation, the author implements and evaluates realistic federated learning systems. The BraTS 2018 dataset, which comprises multi-parametric pre-operative MRI images of 285 brain tumor patients, is used. To offer robust protection against indirect data leakage, use a selective parameter update and the sparse vector technique (SVT). The scans came from 13 different institutions, each with its own set of equipment and imaging techniques, resulting in a wide range of picture feature distributions. To make the federated setup more realistic, the author divided the training set into thirteen distinct subgroups based on the provenance of the picture data and assigned each to a federated client. This makes the setup more robust and realistic from a medical perspective. This type of setup is challenging compared to the data-centric approach because of severe domain-shift and overfitting issues hence making it more imbalance. The author got close accuracy compared to the data-centric model on an unbalanced dataset. Lee et al. (2018) proposed a federated patient hashing framework to detect similar patients scattered in different hospitals without sharing patient-level information. This patient-matching method could help doctors to summarize general character and direct them to treat patients with more experience. The goal of patient similarity learning is to create computer algorithms for identifying and locating clinically comparable patients to a query patient in a given clinical situation. This type of studies beneficial to the biomedical field. The author designed the cross-institutional framework to select those data so which makes a huge data gathering without sharing data. A Federated framework solves disease surveillance and clinical trial recruitment where they can do a patient similarity search across all clinical institutions to find out where the relevant patients are. They can then concentrate on recruiting patients from the appropriate healthcare facilities. Chen et al. (2020a,b) propose the first federated transfer learning framework to solve the problem of data sharing with personalization. This proposed method works on wearable activity recognition on devices where many wearable devices join the learning with their data on the same devices. Fed Health uses federated learning to aggregate data and then uses transfer learning to create personalized models. It is capable of providing precise and individualized treatment while maintaining privacy and security. The method solves the data island and personalized problem. This makes a strong model for

126 Handbook of Statistics

the wearable devices dataset without compromising the sensitive information generated by wearable devices. This framework is constantly updated to accommodate new user data. When faced with fresh user data, FedHealth may update both the user model and the cloud model at the same time. As a result, the more time a consumer spends with the product, the more personalized the model might become (Table 1).

2.2.3 Long short-term memory (LSTM) LSTM is popular for efficiently handling time series data. LSTM has two states, i.e., cell state and hidden state which are responsible for updating the information in every timestamp. There are three gates, i.e., forget gate, input gate, and output gate as shown in Fig. 6, which help the cell state and hidden state to update and propagate information to the next time stamp. Forget gate is responsible for removing unwanted information, whereas the input gate is responsible for adding useful information to the cell state. The hidden state is updated with help of the output gate. in ¼ σ ðW i I n + U i hn1 + bi Þ   fn ¼ σ W f I n + U f hn1 + bf

(4)

on ¼ σ ðW o I n + U o hn1 + bo Þ

(5)

Cn ¼ fn ∗Cn1 + in ∗ tanhðW c I n + U c hn1 + bc Þ

(6)

hn ¼ tanhðCn Þ∗on

(7)

(3)

Weight matrices of forget gate, input gate, output gate, cell state at current and previous timestamps are represented by Wf, Wi, Wo, Wc and Uf, Ui, Uo, Uc respectively. Additionally, bf, bi, bo, bc denote the bias vector of forget gate, input gate, output gate, cell state respectively. Cn, hn1, σ and tanh represent cell state, hidden state of the previous timestamp, sigmoid and hyperbolic tangent activation functions respectively. Input to the LSTM network is represented by In. 2.2.3.1

Time division LSTM

It is observed that not only the data but also the way data are fed into LSTM makes an impact on the overall outcome. This approach is based on a different arrangement of input data sequence that enables the discovery of patterns with different frequencies leading to a new set of patterns additionally that improves overall results. Generally, in machine learning, multiple models with different algorithms on the same data set or the same algorithm on different data sets of the same objectives are used to form a standard ensemble methodology whose aim is to reduce the variance, and bias and also enhance prediction capability. In this approach, LSTM is used as the core algorithm with different input sequences which imply different input characteristics to build each ensemble.

TABLE 1 presents the summarized methods and dataset used in the discussed literature work. Author

Dataset uses

Method

Application domain

Pros

Huang, Shea

eICU collaborative research database contains 200,859 patient ICU data

Community bases federated learning for each hospital to train its data using encoder decoder methods

EHR healthcare

In comparison to the baseline FL model, it converged to better prediction accuracy in fewer communication cycles. Also, FL makes different geographical models

50 hospitals with 590 patient each hospital Boyi Liu

X-ray radiography contains 15,872 images. Federated learning framework for Covid dataset

Cross silos-based FL on edge devices using Resnet, CNN and mobile net. Horizontal federated learning is used

The method is useful for X-ray image (Covid) classification

Federated model gives close accuracy compared to classical method with preservation of privacy

Ines Feki

https://github.com/ieee8023/ covid-chestxray-dataset

Federated Transfer learning is used with ImageNet and Resnet

Covid chest X-ray image classification.

FL method gives better performance on non iid pattern compare to centralized one. This is useful in medical application where data is distributed in unbalanced manner in medical application

Wenqi Li

BraTS brain tumor test dataset

Federated learning with differential privacy

Brain Tumor segmentation.

Close accuracy compares to data centric approach with differential privacy to secure the weight sharing Continued

TABLE 1 presents the summarized methods and dataset used in the discussed literature work.—Cont’d Author

Dataset uses

Method

Application domain

Pros

Lee et al

EHR

Privacy preserving using hashing and federated learning

Patient similarity matching

Solves the cross institute, disease surveillance using FL mechanism

Yiqiang Chen

UCI public human activity recognition dataset.

Federated Transfer learning with homomorphic encryption

Wearable devices with activity recognition

Personalization of wearable devices FedHealth combines data from several companies without exposing privacy protection, and uses knowledge transfer to create individualized model learning

Deep learning methods for scientific and industrial research Chapter

6 129

FIG. 6 Architecture of Long Short-Term Memory network.

FIG. 7 Architectural diagram of Variant Input Sequence model network.

As shown in Fig. 7, the model consists of two levels of LSTMs hierarchically arranged. In the first level, there are multiple LSTMs, each of them picks up one part of a differently arranged input data sequence and is trained to give optimized output. In the second level, such outputs are fed into the second-level LSTM to produce the final output. In the Time Division type, the whole input dataset is divided into multiple data chunks of subsequences based on variant interval values among these subsequences. The following notations have been used to describe the different elements in this time series sequence. p ¼ no of different inter-dependent parameters considered as input. n ¼ no of timestamps in the input sequence m ¼ no of timestamps ahead of which prediction is done An input parameter value can be represented as Vi where 1  i  p. An input sequence can be represented as I ¼ (I0, I1, I2, …, In) In the case of the single variant model, the number of input parameters is 1 whereas, in the case of the multi-variate model, each of Ij can contain a

130 Handbook of Statistics

sequence of multiple parameters arranged in an order. Thus, it can be expressed as Ij ¼ (Vj1, Vj2, Vj3, …, Vjp) Data are captured in regular intervals and the intermediate consecutive timestamps can be represented as …, t(i1), ti, t(i+1), …

Different modes of LSTM are categorized in terms of the different time sequences of the input data. After maintaining the hyper-parameters such as the number of hidden layers, and the number of neurons in each hidden layer in each of these LSTM models, they have been compared with different measuring scales. In the Time Division LSTM Modeling approach, there are multiple basic non-ensemble modes of LSTMs used in the first layer and then their outputs have been fed into another LSTM in the second layer. So before explaining the details of all the different techniques for distributing input data, a few basic building blocks of this approach are discussed below. In this algorithm, the model is designed keeping in mind that it can be also a multi-variate time-series predictive model by considering the list of interdependent variables. The algorithm for the model has been presented in four modes; among them, the first two are in non-ensemble mode with different input characteristics and the other two are in ensemble mode as discussed below. LSTM1: The first model is simple LSTM where the input parameters are arranged one after the other. Single instance values of all these parameters are inputted to give output at a given advanced time step. The first mode of LSTM is non-ensemble and non-temporal that contains the input of a single timestamp. This model is named LSTM1 and is shown in Fig. 8A. Though this is not been used directly, but for explanation purposes, it is treated as the base for further variation in LSTMs discussed later. LSTM2: In the second non-ensemble mode, the input sequence I ¼ I0, I1, I2, …, In where n ¼ number of timestamps and each of Ij contains a subsequence of data of all participating parameters at a particular timestamp arranged in a fixed order which can be expressed as Ij ¼Vj1, Vj2, Vj3, …Vjp. So, in total, the input sequence of length is p  n which goes as input to this LSTM. This model is named LSTM2 and is shown in Fig. 8B. Time Division Ensemble Technique is also a robust methodology where a couple of distinct input sequences are formed based on different arrangements of the input sequence by varying intervals among the consecutive data. Let us assume we have q number of such data sequences and the below steps are followed to build them up:

Deep learning methods for scientific and industrial research Chapter

6 131

FIG. 8 Standard LSTM (A) LSTM1: Multiple variables at single time step (B) LSTM2: Multiple variables at multiple time steps.

1. The first LSTM can directly take up the data sequence used in LSTM2, i.e., data of p number of input parameters at same timestamps are consecutively placed to make one sub-sequence. Then sub-sequences corresponding to consecutive timestamps are placed one by one to form the input sequence. This input sequence for the first LSTM can be expressed as I1 ¼In, In1, In2 …I0. 2. The second LSTM was built by considering alternate sub-sequences used in the first LSTM. The aim is to make different interval values among sub-sequences leading to form a completely different input sequence. This input sequence can be represented as I2 ¼ In, In2, In4, …I0 (or I1). 3. The third LSTM was designed by considering to skip two sub-sequences at a time and thus, it leads to another form of the input sequence. This input sequence can be represented as I3 ¼ In, In3, In6, …, I0 (or equivalent). 4. The process was continued till q number of sub-sequences were skipped from what we followed in the first LSTM. This input sequence can be expressed as Iq ¼ In, Inq, In2q … I0 (or equivalent). This input sequence is for the last LSTM in this series. Since the aim is to predict at say T time ahead which may translate to say m timestep ahead, considering the current timestamp as tn, all these LSTMs

132 Handbook of Statistics

need to predict at tn + m timestamp. In a generic way, the predicted value can be expressed as a function of LSTM2 as follows (Eq. 8):   where i ¼ 1, 2, 3, …, q (8) V in+m ¼ LSTM2i I n , I ni , I n2i …I ðn%iÞ , Here, i depict the ith ensemble of the LSTM. In this study, the upper value of i is considered as q, i.e., the total number of different ensemble datasets. In this LSTM, valued obtained from all these individual LSTMs are averaged out to determine the final prediction value at m timestep ahead (Vnf + m) which can be expressed as below equation(Eq. 9): 1 Xq i f ¼ V (9) Vn+m i¼1 n+m q This mode of LSTM is named as LSTM3 and it is shown in Fig. 9A. In LSTM3, first-level outputs are averaged with an intention to give equal weightage to all the LSTMs, which in practice may not be appropriately fitting. However, it is easy to implement and can give a rough estimation of simple use cases. Thus, LSTM3 is limited in this regard as it is ignoring the

FIG. 9 Architecture of LSTM3 (A) and LSTM4 (B).

Deep learning methods for scientific and industrial research Chapter

6 133

contribution from strong ensemble candidates. In order to improve this mode of LSTM, it was considered a weighted average where the contribution from each ensemble can be decided based on the capability of an individual ensemble. But the challenges arise in determining the correct weights. This problem was resolved by introducing a new LSTM which can determine values of weights across the hidden link automatically at the second layer for producing the final outcome. This builds up the whole network to be a hierarchical structure having q number of LSTMs in the first layer and then their outputs are fed into the second layer LSTM. This mode of LSTM is referred to as LSTM4. The schematic diagram of LSTM4 has been shown in Fig. 9B and it has been represented by Eq. (10).   (10) V fnn+m ¼ LSTM V 1n+m , V 2n+m , …V qn+m This developed algorithm has been implemented in predicting short-term wind speed at the station level. 2.2.3.2

Multivariate LSTM model for COVID-19 prediction

The long-Short Term Memory (LSTM) Model for the prediction of COVID-19 was carried out using the Vanilla LSTM framework. The LSTM is a very useful tool in solving sequential data prediction problems. The LSTM is a special kind of RNN (Recurrent Neural network) that is capable of learning long-term dependencies. In the present study, training data for LSTM was created by splitting the long-term (685 days) COVID-19 confirmed cases and weather parameters time series data. The major portion of the data (640 days) was utilized to train the model and 45 days of data were utilized for testing the model. The control experiment (CTL) was designed with a simple univariate LSTM model by training the daily COVID-19 case data, and four sensitivity experiments were designed using the multivariate LSTM model where the model training was carried out with daily COVID-19 cases data and meteorological parameters including specific humidity (CTL_SH), mean temperature (CTL_Tmean), maximum temperature (CTL_Tmax), and minimum temperature (CTL_Tmin). The minimum error method was utilized to optimize both the LSTM models by considering different hyper-parameters (hidden layers) and utilized for forecasting purposes. The state-level (13 states are considered for the present study) daily COVID-19 caseload forecasts (1-day forecast window) were generated using univariate and multivariate LSTM models for 1st January to 14th February 2022 (45 days) with different leads (lead time:1–14 days) and evaluated with laboratory-confirmed case data.

2.2.4 SNN and CNN A modified SNN (sequential neural network) model was trained by multi-omics data of low- and high-grade glioma tissues to predict their

134 Handbook of Statistics

histopathological grades. At first, three individual SNN models were created for transcriptome/gene expressions (RNA-Seq.ai), DNA methylation (Meth. ai), and copy number variations (CNV.ai) datasets. Later, a unified SNN model (ALL.ai) was created by combining molecular data of all three multi-omics datasets. The network architectures, optimum hyperparameters, and most suitable activation functions used in all four SNN models are provided in Table 2. To evaluate the performance of the SNN model, several performance metrics, such as TPR (True positive rate), FPR (False positive), F1-score, AUC (Area under the curve), and minimum losses were considered. The higher accuracy or perception with minimum loss function values was selected for the optimized models. During each SNN model training, 80% of samples were randomly chosen for model training, and the rest 20% of samples, were kept aside from the primary dataset and were not used for training purposes. These 20% samples, aka blind data sets, were utilized to evaluate the accuracy and performance of trained SNN models. Samples of multi-omics data sets were labeled with multiple classes (tumor grades), which was taken into account while calculating the loss function using categorical cross-entropy for the multi-class classification model.

TABLE 2 The optimum hyperparameters chosen for several deep learning models using SNN. Best parameters for SNN architectures

Range/diff. function varied

Rna-seq.ai

Meth.ai

CNV.ai

ALL.ai

Number of hidden layer

2–15

10

12

9

13

Number of dropout layer

1–5

3

2

2

3

Maximum epochs

10–100

75

50

30

80

Number of fully connected layers

2–15

10

12

9

13

Activation functions

ReLu, Tanh, Sigmoid

ReLu

ReLu

Tanh

ReLu

Optimizers

SDG, Adam, RSMprop

SDG

Adam

Adam

RSMprop

0.2, 0.2, 0.3

0.4, 0.5

0.3, 0.2

0.1, 0.3, 0.5

Parameters

Dropout rate Initial learning rate

0.0001–0.01

0.0001

0.0001

0.001

0.0001

Mini-batch size

8–64

16

8

8

32

Deep learning methods for scientific and industrial research Chapter

6 135

Deep learning using a convolutional neural network (CNN) was applied to MR and whole-slide histopathological images of tumor samples to predict cellular subtypes/histopathological grades, molecular subtypes, mutation status, and overall survival of glioma patients. In total, five distinct CNN models were trained for these predictions. Pre-processed MR images were used to construct CNN models for predicting histopathological grades/cellular subtypes (Astrocytoma, Oligoastrocytoma, Oligodendroglioma), gene mutation status of EGFR and TP53 genes, and overall survival of glioma patients. Similarly, pre-processed histopathological WS images were trained using CNN to predict cellular subtypes and molecular subtypes (i.e., Classical, Mesenchymal, Neural, Proneural). The network architectures, optimized hyperparameters, and activation functions used in these CNN models are shown in Table 3. Dropout and batch normalization layers were also included in these models to lose the connection between some layers and normalize the datasets to solve the overfitting and underfitting problem of overall networks.

3

Applications of DL techniques for multi-disciplinary studies

This section presents the results of various case studies carried out by the authors using the DL techniques at different CSIR laboratories in India.

3.1 Applications of DL models in tumor diagnosis The overall workflow and applications of two machine learning models, four SNN models, and five CNN models developed for glioma patients are described in Fig. 10. SNN models of multi-omics data sets (transcriptome, DNA methylation, and copy number variations) and CNN models of MR and histopathological images were applied for diagnosis (for determining tumor grades and molecular subtypes), prognosis (for assessing overall survival), and predicting cancer gene mutations of glioma patients. After training and validations, these trained models have been integrated into a cloud-based computation framework named “Oncomechanics,” which is scalable and could be employed in multiple centers, such as cancer diagnostic labs, hospitals, and remote health centers. Appropriate diagnosis (e.g., histopathological grades or cellular subtypes) is vital for deciding the right treatment plans for glioma patients (Chen et al., 2021; Colman, 2020; Schiff, 2017). We have developed ML and DL models for tumor diagnosis using large-scale molecular data, whole-slide histopathological images, and MRI scans of glioma patients. In total, six different deep learning models (using sequential and convolutional neural networks) and two machine learning (ML) models (using support vector machine (SVM) and random forest (RF)) algorithms were developed (Pellegrino et al., 2021; Sengupta et al., 2019). SVM and RF algorithms were applied only on molecular data sets (transcriptome, methylation, and CNV) to diagnose tumor

TABLE 3 The optimum hyperparameters used for training different deep learning models using CNN. Optimum parameters used for training CNN models Range/diff. function varied

Cellular subtypes

Cellular subtypes

Radiomics

Molecular subtypes

(MRI)

(WSI)

(MRI)

(WSI)

No of Convolutional + ReLU layers

5–30

17

13

19

14

15

No of Droupout Layers

2–8

4

3

4

3

5

Maximum epochs

10–100

75

55

69

92

71

No of fully connected layers

3

3

3

3

3

3

No of convolutional kernels

8–256

16, 32,64

16, 32, 64

8, 16, 32, 64, 128

8, 16, 32, 64, 128

16, 32, 64, 128

Kernel size

(3  3)…(11  11)

3  3, 5  5, 7  7, 11  11

3  3, 5  5, 7  7, 11  11

3  3, 7  7, 9  9

3  3, 5  5, 7  7, 9  9

3  3, 5  5, 7  7

Pooling layer

Max, Min, Average

Max

Average

Max

Max

Max

Pooling layer windows size

(2  2), (3  3), (4  4)

2  2, 3  3, 4  4

3  3, 4  4

3  3, 4  4, 2  2

3  3, 4  4, 2  2

3  3, 4  4, 2  2

Optimizers

SGD, Adam, RMSprop, Adagrad

Adam

SGD

SGD

Adam

SDG

Dropout rate

0.2–0.5

0.25, 0.5, 0.3, 0.25

0.25, 0.5, 0.3

0.3, 0.3, 0.4, 0.5

0.3, 0.2, 0.4

0.3, 0.5, 0.3, 0.3, 0.4

Initial learning rates

1e1,…,1e5

1e2

1e3

1e1

1e2

1e5

Mini-batch size

8–256

128

16

32

64

16

Parameters

Survival (MRI)

FIG. 10 Applications of sequential and convolutional neural networks for diagnosis, prognosis, and radio-genomics predictions of glioma tumors.

138 Handbook of Statistics

grades/cellular subtypes but were not used on image data sets. Performances of all six trained models on training and blind data sets were measured using the “Area Under the Curve” (AUC) values of “Receiver Characteristic Operator” (ROC) curves of respective models (Table 4).

3.1.1 Performance evaluations of all models trained by numerical data sets for tumor diagnosis We have evaluated the applicability of all three types of numeric molecular data sets of tumor tissues and the performances of SNN and two machine learning models for diagnosing glioma tumor tissues. SVM, RF, and SNN models trained by transcriptome profiles of glioma patients for predicting tumor grades/cellular subtypes had comparable AUCs in training and accuracy (ACC) of blind data sets. Furthermore, accuracy values of all three models significantly decreased in the blind data set (ACCblind data, SVM ¼ 0.698, RF ¼ 0.632, and SNN ¼ 0.670) compared to the training data set (AUCtraining, SVM ¼ 0.883, RF ¼ 0.893, and SNN ¼ 0.758) of gene expression profiles of tumor tissues. The performance of RF model for predicting tumor grades in the training CNV data set was relatively good (AUC ¼ 0.927) and higher than SVM (AUC ¼ 0.768) and SNN (AUC ¼ 0.805) models. However, the RF model trained by CNV data set, when applied in the blind data set, did show less accuracy (ACC ¼ 0.769) in predicting tumor grades. Performances of SVM, RF, and SNN models in the training data set of DNA methylation profiles of glioma tumor tissues were also comparable. However, in the blind data set, SVM (ACC ¼ 0.653) and RF (ACC ¼ 0.602) failed to achieve desired accuracies, but the SNN model performed well (ACC ¼ 0.857). These results suggest that SVM, RF, and SNN models trained independently by whole transcriptome and CNV profiles of glioma tumor tissues have a less predictive potential for diagnosing tumor grades of glioma patients. In contrast, the deep SNN model trained by DNA methylation profiles achieved moderate accuracies in both training and blind data sets than SVM and RF models (Table 4). Finally, the accuracies of all three models were measured in combined multi-omics data sets by merging transcriptomics, CNV, and methylation data sets into a single data set. It was observed that the SNN model performed moderately better than SVM and RF models in both trainings (AUC ¼ 0.787) and blind (ACC ¼ 0.846) multi-omics (ALL) data sets (Table 4). These observations suggest that the deep SNN model trained by multi-omics molecular profiles of tumor tissues can outperform widely used ML models (SVM and RF) in predicting tumor grades or cellular subtypes of glioma patients. 3.1.2 Performance evaluations of all models trained by image data sets for tumor diagnosis We have also assessed the application and performances of deep CNN models for classifying histopathological grades/cellular subtypes of glioma tumors

TABLE 4 Training accuracies measured by AUC of each model for predicting tumor grades. AUC (Training data sets: TCGA)

ACC (Blind data sets: TCGA)

Performance metrics

SVM

RF

SNN

CNN

SVM

RF

SNN

CNN

Gene expression

0.88

0.89

0.76

NA

0.70

0.63

0.67

NA

Copy number variation

0.76

0.93

0.81

NA

0.65

0.77

0.69

NA

DNA-methylation

0.89

0.83

0.87

NA

0.65

0.60

0.86

NA

ALL

0.82

0.80

0.79

NA

0.58

0.70

0.85

NA

MRI

NA

NA

NA

0.87

NA

NA

NA

0.89

Histopathology image

NA

NA

NA

0.90

NA

NA

NA

0.81

140 Handbook of Statistics

using MRI scans of brain tumors and whole-slide (WS) histopathology images of tumor tissues (Table 4). The CNN model trained by MR images achieved good accuracies in both training (AUC ¼ 0.869) and accuracy of blind (ACC ¼ 0.890) data sets. Similarly, CNN models of histopathological images showed high accuracies for diagnosing tumor grades in both trainings (AUC ¼ 0.900) and blind (ACC ¼ 0.810) data sets. The confidence interval (CI) of AUC of deep CNN models (95% CI: 0.87–0.94) trained by MR and WS image data sets was better than SNN models (95% CI of AUC ¼ 0.77–0.91) trained by numeric multi-omics molecular data sets, suggesting complex tumor morphologies captured in MR and histopathological images can be applied to train deep neural networks to diagnose glioma tumors.

3.2 Application of DL model for classifying molecular subtypes of glioma tissues Whole-slide histopathological (WSI) images of tumor tissues stained with Hematoxylin and Eosin (H&E) stains are used by histopathologists to understand the cellular and tissue structures of tumor samples. Histopathologists manually inspect the WS images of tumor samples and routinely determine tumor grades (Grades-I/II/III/IV) and other cellular phenotypes (astrocytoma, oligoastrocytoma, oligodendroglioma, etc.) of tumor biopsy samples of glioma patients. However, histopathologists cannot determine molecular subtypes of glioma tissue by manually inspecting the WSI. Four molecular subtypes of glioma (classical, mesenchymal, neural, and proneural), which are derived from molecular profiles of glioma patients, help understand tumor biology and prognosis (Kong et al., 2011; Lin et al., 2014). In the TCGA-LGG and TCGA-GBM cohorts, WS images of patients along with their molecular subtypes are available. Here, we have used WS images labeled with molecular subtypes to train a deep-learning CNN model and assessed its performance by evaluating AUC. During model training, the AUC value for classifying the multi-class dataset was 0.82. When applied to a blind data set of WS histopathology images, the trained DL model achieved 76% accuracy in classifying molecular subtypes of glioma tissues (Table 4). This result suggests that subtle morphological differences across different molecular subtypes of tumor cells that cannot be identified in WSI under the microscope can be classified by utilizing the power of deep learning models.

3.3 Application of the deep learning model for the prognosis of glioma patients For oncologists, accurate assessment of the prognosis of glioma patients, for example, determining the median overall survival (OS), is a critical task (Pei et al., 2020; Sun et al., 2019). We have trained a deep learning model

Deep learning methods for scientific and industrial research Chapter

6 141

using CNN by training MR images labeled with OS (in months) data of low and high-grade glioma patients of TCGA-LGG and TCGA-GBM cohorts. During training, the CNN architecture (Table 5) achieved a loss function value of 0.32 and an accuracy of 0.87 in 67 epochs. The minimal value of the loss function ensures less overfitting of the model with the respective labels during training. The deep learning model with optimized parameters, when applied to the blind dataset, achieved a significant level of accuracy (85%) to determine the median OS of glioma patients.

3.4 Applications of DL model for predicting driver gene mutations in glioma Identifying the mutational landscape of low and high-grade glioma tumors is highly beneficial for oncologists to decide on the right course of diagnosis and prognosis for the patients (Chang et al., 2018; Zinn et al., 2017). We have developed a radio-genomics deep learning model for identifying the mutation status of cancer hallmark genes EGFR and TP53. The overall performances of this model in both training and blind datasets are shown in Table 5. With this model, we were able to classify MR images of brain tumors into three mutational classes: TP53mutant/EGFRwild-type, TP53wild-type/EGFRmutant, and TP53mutant/EGFRmutant. The model achieved good accuracies and relative performance metrics in both training (TPR ¼ 0.885, FPR ¼ 0.1671, Precision ¼ 0.902, Recall ¼ 0.885, F-measure ¼ 0.890, AUC ¼ 0.924) and blind data sets (86% accuracy) as presented in Fig. 11. The 95% confidence interval of the AUC value was 0.91–0.97, suggesting MR images could provide better accuracies for identifying the mutation status of cancer hallmark genes. The model showed a higher level of accuracy and minimal values of the loss function in both training and validation data sets.

3.5 Application of Time Division LSTM for short-term prediction of wind speed The atmosphere consists of multiple interdependent parameters whose variations cause significant changes in weather patterns. Each of these parameters has a significant impact on the environment. One of the major atmospheric variables, wind speed helps in generating green energy which is responsible for reducing pollution. Considering the importance of wind in various sectors like environmental, societal, economic, etc., a detailed study has been carried out and the “Time Division Ensemble” algorithm is used for short-range wind speed prediction, i.e., 3 h ahead at station scale from two locations New Delhi and Bangalore with given inputs as atmospheric temperature (T), humidity (H), pressure (P), wind speed (V) (Marndi et al., 2020). Hence, the input can be represented as quadruple In ¼(Tn, Pn, Vn, Hn) where, n is the index at the current timestamp, n + 1 for the future, and

TABLE 5 Training performances of the model for predicting tumor molecular characteristics. Performance metrics Data

TPR

FPR

Precision

Recall

F-measure

AUC

Blind data (accuracy %)

Histopathology

0.70

0.25

0.71

0.70

0.71

0.82

76%

MRI (survival)

0.82

0.27

0.82

0.82

0.81

0.80

85%

MRI (mutations)

0.89

0.18

0.90

0.89

0.89

0.92

86%

FIG. 11 Application of deep learning model in radiogenomics study of glioma tumors.

144 Handbook of Statistics

n  1 for past timestamps. As data were collected in 30 min intervals, there is 30 min gap between two consecutive input data. It is required to predict wind speed at six-time stamps ahead, i.e., Vn + 6 to achieve 3 h advance prediction as a requirement of LSTM1. The capability of LSTM is utilized by considering data from multiple timestamps. In the second non-ensemble mode (LSTM2), input data (In2, In1, In,) is formed by considering data from three consecutive timestamps and to achieve 3 h ahead prediction wind speed is predicted at (n + 6)th step ahead, i.e., Vn + 6. The generalized version of LSTM1 and LSTM2 (Fig. 8A and B) can be modified as Fig. 12A and B.

3.5.1 Performance evaluations of Time Division LSTM for short term wind speed prediction The Time Division LSTM model is compared with two data-driven models and one classical model for predicting short-term wind speed. The training and validation losses are calculated to determine the number of epochs to be considered for prediction. It is observed that training validation loss is converged after 60 epochs for New Delhi station whilst in the case of Bengaluru, a loss is converged at 30 epochs. Hence for New Delhi and Bengaluru stations, epochs are fixed at 60 and 30 respectively. The same configuration is maintained in the case of the second-level LSTM of LSTM4 which is hierarchical in nature.

FIG. 12 Standard LSTM (A) LSTM1 (Single time step Multiple variables) and (B) LSTM2 (Multiple variables at Multiple time steps).

Deep learning methods for scientific and industrial research Chapter

6 145

TABLE 6 Summary of MAE, RMSE and CC for predicted vs observed wind speed for stations at New Delhi and Bengaluru. New Delhi

Bengaluru

Measures

MAE

RMSE

CC

MAE

RMSE

CC

SVM

0.824

1.091

0.510

1.757

2.287

0.614

ELM

0.773

1.027

0.547

0.919

1.173

0.651

LSTM1

0.179

0.210

0.044

0.154

0.189

0.167

LSTM2

0.100

0.128

0.100

0.136

0.161

0.170

LSTM3

0.078

0.107

0.489

0.133

0.160

0.716

LSTM4

0.077

0.105

0.585

0.129

0.156

0.760

Prediction capability was evaluated using different performance measures such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Pearson Correlation Coefficient (CC). Table 6 summarizes the result of two data-driven approaches, i.e., support vector machine (SVM), extreme learning machines (ELM), and four modes of LSTMs with respect to different performance measures. Among all the LSTMs, LSTM3 and LSTM4 which are ensemble-based method, perform better than non-ensemble LSTMs in terms of low MAE, RMSE and high correlation coefficient for both New Delhi and Bengaluru stations. Plotting of total 15,456 data points in a single graph for all the models creates a huddle for managing a such huge number of data points. Therefore, the prediction capability of time division LSTM compared to the data-driven model is depicted in Figs. 13 and 14 on a small dataset of 192 timesteps from 1st February 2014. As shown in Figs. 13 and 14, the ensemble method performs better than the SVM and ELM models and their performance is slightly disparaged at both locations. However, none of them could predict lower wind speed satisfactorily. As shown in Figs. 15 and 16, a unit line is drawn in the scatter plot between observed and predicted wind speed in both locations to depict the model performs better way with more clarity on bias and the extent of errors. From the scatter plot for New Delhi presented in Fig. 15, it can be seen that all four models are comparable and have a slight positive bias for wind prediction. However, the data-driven models (SVM and ELM) are unable to predict low wind speed.

146 Handbook of Statistics

FIG. 13 Comparison between model simulated (SVM, ELM, and Time division LSTM) and observed half-hourly wind speed (m/s) at New Delhi station for the period 1st February to 4th February 2014.

FIG. 14 Comparison between model simulated (SVM, ELM, and Time division LSTM) and observed half-hourly wind speed (m/s) at Bengaluru station for the period 1st February to 4th February 2014.

For the Bengaluru location as depicted in Fig. 16, it is observed that there is a large negative bias for the AR method. Both the data-driven models and time division LSTM shows better performance compared to statistical models. SVM could predict high wind speed, i.e., greater than 6 m/s, but unable to

Deep learning methods for scientific and industrial research Chapter

6 147

FIG. 15 Scatter plot shows the comparison between the observed and predicted wind speed (m/s) from various models ((A) AR, (B) SVM, (C) ELM (D) Time Division LSTM) at New Delhi during the study period.

FIG. 16 Scatter plot shows the comparison between the observed and predicted wind speed (m/s) from various models ((A) AR, (B) SVM, (C) ELM (D) Time Division LSTM) at Bengaluru during the study period.

148 Handbook of Statistics

predict very low wind speed, i.e., lesser than 1 m/s. ELM is successful in predicting medium wind speed but is unable to predict high and low wind speeds. Time division LSTM (LSTM4) could predict all categories except the very high wind speed of more than 7 m/s.

3.6 Application of LSTM for the estimation of crop production Adequate crop production helps in assuring the food requirements of a country and creates a foreign exchange. Reliable prediction of crop production in advance helps policymakers to decide on trading that makes value addition to food security. Thus, there is a need for a robust and accurate model to predict the crop production of a country in advance. An attempt is made to design an AI model for predicting the rice production of India by feeding input as rice production of neighboring Asian countries that are part of the South Asian Monsson system and rainfall of India. The result depicts that prediction capability for future years is improvised by the joint effort of local and regional scale parameters. Import-export crop production data of India and neighboring counties are used to validate the capability of this predictive model (Marndi et al., 2021). As LSTM is efficient in handling time series data, Stack LSTM is used for designing a sensitive analysis model to identify the correct input set for a predictive model for predicting rice production in India.

3.6.1 Automated model for selection of optimal input data set for designing crop prediction model India’s rice production may not be affected by the rice production of all neighboring countries although the influences from rice production of some of the countries are significant. The accuracy of an AI-based predictive model depends hugely on the proper input data set. An automated model has been designed to identify the right input for designing a more accurate predictive model by performing mathematical combinations. As shown in Table 7, all possible combinations, i.e., 220–22 of rice production from 20 neighboring countries rice production are computed. It is very complex and computationally intensive for identifying the best combination from a total of 220–22, no. of combinations. The detailed procedure to determine the optimal input set is depicted below: For identifying the correct input data set, individual LSTM is trained with each combination of 1960–90 data and subsequently tested with testing data set of 1991–2016 and training data set of 1960–90. Model efficiency for each combination is measured using different performance metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Correlation Coefficient (CC). When the trained model is tested with training data and testing data, it generates training and testing errors respectively. The ratio between these training and testing errors is also calculated. The optimal

Deep learning methods for scientific and industrial research Chapter

6 149

TABLE 7 Different mathematical combinations of rice production of different Asian countries. 20

C2 5 190

20

C3 ¼ 1140

20

C4 ¼ 4845

20

C5 ¼ 15,504

20

C6 ¼ 38,760

20

C7 ¼ 77,520

20

20 20 20 20 20

C8 5 125,970

20

C9 ¼ 167,960

20

C10 ¼ 184,756

20

C11 ¼ 167,960

20

C12 ¼ 125,970

20

C13 ¼ 77,520

20

C14 5 38,760

C15 ¼ 15,504 C16 ¼ 4845 C17 ¼ 1140 C18 ¼ 190 C19 ¼ 20

FIG. 17 Model for identifying the best combination of rice production.

combination of rice production is determined by considering the minimal value of error measures and the maximum value of the correlation coefficient. The block diagram of the automated model for identifying the optimal input is presented in Fig. 17. The percentage of events, i.e., error in this case, was calculated in each error bin for different combinations of rice production. Error bins for both MAE and RMSE are computed by distributing the range of error equally. Figs. 18 and 19 present the error distribution in each combination which indicates that the RMSE of most of the combinations of countries’ rice production is positioned between 0.1 and 0.2. The Mean Absolute Error (MAE)

150 Handbook of Statistics

FIG. 18 Error distribution of LSTM with different combinations of rice production for identifying the minimum error model. The X-axis represents the normalized RMSE and Y-axis represents the percentage of simulations that fall within the error range.

FIG. 19 Error distribution of LSTM simulations with different combinations for identifying the minimum error model. The X-axis represents the normalized MAE and Y-axis represents the percentage of simulations that fall within the error range.

Deep learning methods for scientific and industrial research Chapter

6 151

distribution for rice production of each combination of countries is presented in Fig. 19 which indicates MAE of most of the combination of countries’ rice production lie in the error bin of 0.1. The optimal input set is identified by performing rigorous sensitivity analysis using mathematical combinations of rice production of different neighboring countries. The model capability was assessed by calculating different error metrics like MAE, RMSE on training and testing data. MAE and RMSE for taring and testing data were shown in Fig. 20A and B respectively which depict that both the error metrics are minimum for combination 8. Therefore, combination 8 of rice production is selected as input for designing a predictive model for rice production. Rice productions from 8 countries, i.e., Bangladesh, Sri Lanka, Nepal, Myanmar, Pakistan, Thailand, Philippines, and Iran are selected as the optimal input set for designing the predictive model.

3.6.2 Design and performance evolution of crop prediction model The model is trained with optimal input data set obtained from sensitivity analysis performed in the previous phase. It is well known that different weather parameters mainly rainfall has a significant influence on crop growth (Carfagna and Gallego, 2005; Shin et al., 2017). Hence, rainfall data along with rice production data from neighboring countries including India’s rice production are considered to design a predictive model. Three different ways predictive models are built, the first model is trained with only rainfall data, the second model is trained with rice production data of neighboring countries obtained from the first phase, and the third model is trained with a combination of rice production data of neighboring countries and rainfall data of India. It was observed that the third model performed better than the previous two models with respect to different performance measures as shown in

FIG. 20 Optimal Combinations of neighboring countries rice production for identifying the minimum error model. The X-axis represents different combinations of countries rice production and Y-axis represent normalized MAE in (A) and RMSE in (B).

152 Handbook of Statistics

FIG. 21 Comparison between the observed and model predicted rice production in India.

Fig. 21A–C and the third model could predict drought years, i.e., 2002 and 2004 by incorporating rainfall data. Finally, the proposed model is validated with net flow, i.e., export-value subtracted by import-value as shown in Fig. 22. This indicates that net flow is following the trend of predicted as well as observed rice production from 1991 to 1993 and 2005 to 2015. However, the trend of net flow and rice production is contrasting each other from 1993 to 2004 due to some unknown reasons. Model performance was evaluated using different performance measures such as MAE, RMSE and CC. Summary of MAE, RMSE and CC between predicted and observed crop production is shown in Table 8 which depicts that the third model provides better predictability compared to the other two models in terms of RMSE, MAE and CC.

3.7 Classification of tea leaves We have a task in hand to classify the tea leaves and the dataset consists of only 965 labeled image datasets belonging to four different classes. This

Deep learning methods for scientific and industrial research Chapter

6 153

FIG. 22 Relationship between crop production (estimated, observed) and net flow during the study period.

TABLE 8 Summary of MAE, RMSE, and CC between observed and predicted rice at different scenarios. Training scenarios with different set of input data

MAE

RMSE

CC

With only rainfall data

0.370

0.425

0.0027

With only rice production data

0.157

0.291

0.844

With rice production and rainfall data

0.099

0.138

0.92

training dataset has to be further divided into training and testing datasets into 80:20 ratios and will result in an even smaller training dataset. So, the best approach here is to use transfer learning instead of model development from scratch. But here is a problem with the dataset as it becomes even smaller and is insufficient to be used for transfer learning when it is divided into training and testing sets. This problem has been resolved by implementing the data augmentation methods (image cropping, flipping, brightness alteration, and so on) and resulting in enhanced dataset size. This will also address issues associated with a class imbalance in the dataset and the over fitting of the model. Once the training dataset has been prepared, we have to use a custom classification layer as the original VGG 16 classifier was developed for the ImageNet dataset (Banerjee et al., 2022). In this work, transfer learning has provided a good initial point for training the model. This enabled efficient use of the knowledge learned by a model trained for a different task. Transfer learning methods use two main steps: (1) feature extraction and (2) network fine-tuning. In our work, we used the popular VGG 16 models as backbones

154 Handbook of Statistics

for developing our model. We use VGG 16 pre-trained models to study the performance of the algorithms for the classification of collected tea leaves data. The convolutional neural network accepts fixed input dimensions of 224  224. In VGG 16, the training images are passed to a stack of convolutional layers followed by an additional 3  3 convolutional layer which works as a feature extraction layer. After creating all the convolution layers, the input data is passed to the dense layer. The global average pooling layer is generated which comes out of the convolutions and by performing 4-way Tea-leaf classification and thus contains 4 channels, one for each class as shown in the image. As depicted in Fig. 23, during the training of VGG 16, the original ImageNet weights were used and only the top layers had been trained for the specific task. The model was evaluated by measuring the accuracy achieved during training and testing and it was found to be 83% and 80.51% respectively. However, to further improve the performance, the model had been fine-tuned which includes hyper-parameter optimization and unfreezing of some of the layers of the base model (VGG 16). This resulted in an increment in the training and testing accuracy to 92.79% and 89.29% which can be seen in Fig. 24.

FIG. 23 Schematic flow diagram of the model which uses a custom classifier for the tea leaves classification.

FIG. 24 Performance curve for VGG 16 with transfer learning.

Deep learning methods for scientific and industrial research Chapter

6 155

FIG. 25 Confusion matrix of tea-leaf classification based on different classes.

As in Fig. 25, we can see the true classification class along the principal axis of the matrix, and from there the interpretation can be made that model is performing well over the first 2 classes whereas it is underperforming for the third and fourth class tea leaf in comparison of other two classes. However, if we consider overall performance, we achieved an average precision, recall, and f1 score of 0.85, 0.76, and 0.9 respectively.

3.8 Weather integrated deep learning techniques to predict the COVID-19 cases over states in India The time series data of the laboratory-confirmed COVID-19 cases over some of the states where the maximum number of cases was reported during the first, second, and third waves in India (Fig. 26). Even though the first case was reported on 30th January 2020, the maximum number (up to 24,886 cases/day in Maharashtra) of cases was reported in September (2020) during the first wave of the pandemic in India. However, the COVID-19 transmission dynamics depend on several parameters including the variant of the virus, population density and immunity, socio-economic conditions and climate parameters. The second wave impacted most of the states very severely (up to 68,631 cases/day in Maharashtra) due to the most virulent variant like delta (B.1.617.2) during April and May 2021 (summer season) in India. The third wave in India was mostly dominated by the highly infectious virus variant of Omicron (B.1.1.529) which happened during the winter (February) period in the year 2022. However, the greatest number of daily cases were reported in the state of Maharashtra, the states located in South India (Andhra Pradesh, Karnataka, Kerala, and Tamil Nadu), and north India (Uttar Pradesh) also contributed the major share of cases in India. The data also shows that the

156 Handbook of Statistics

FIG. 26 Time series analysis of daily COVID-19 cases over different states in India for the period 1st April 2020 to 31st December 2021.

COVID-19 transmission was prolonged up to the end of the monsoon season during the second wave in the highly humid region of Kerala. The CDC (Centre for Disease Control and Prevention) report shows that people are infected with the SARS-CoV-2 virus primarily due to the inhalation of virus-loaded microscopic respiratory droplets & aerosols, deposition of the virus-loaded respiratory droplets on mucous membranes (eye, mouth, and nose) by direct sprays and splashes while close contact with the infected person, and by touching the mucous membranes with virus soiled hands (source: https://www.cdc.gov/coronavirus/2019-ncov/science/science-briefs/ sars-cov-2/transmission.html). In the case of air-borne transmission, the infection risk depends on the concentration and viability of the virus cloud in the air which mainly depends on the surrounding environmental conditions such as temperature and humidity. The safe distance from the source is determined by the weather parameters prevailing in indoor and outdoor areas. To understand the role of weather parameters on COVID-19 transmission over different states in India, a correlation analysis was carried out between the daily COVID-19 cases and the meteorological parameters including specific humidity, mean, maximum, and minimum temperature for the period 1st April 2020 to 31st December 2021. The analysis shows that the specific humidity has a significant (99% significance level) positive correlation with COVID-19 cases over Andhra Pradesh, Assam, Kerala, Odisha, Tamil Nadu, and a significant negative correlation over Chhattisgarh, Madhya Pradesh, and Maharashtra. In the case of temperature, most of the states have shown a significant positive correlation with COVID-19 cases except Kerala (Fig. 27). The time series plot of SH shows that the COVID-19 cases positively correlated where the water vapor content was high in the atmosphere and minimum seasonal variability, whereas the dry regions show a negative correlation with COVID-19 cases in India (Fig. 27).

Deep learning methods for scientific and industrial research Chapter

6 157

FIG. 27 (A) Correlation between the daily COVID-19 cases and the meteorological parameters Specific Humidity (SH), Maximum Temperature (Tmax), Minimum Temperature (Tmin), and Mean Temperature (Tmean) for the period 1st April 2020 to 31st December 2021. (B) Seasonal variability of specific humidity over some of the dry and wet regions in India (bottom figure).

Some of the earlier studies also reported that environmental conditions play a considerable role in the spread of the disease and the death rate. For example, Faruk et al. (2022) conducted a study to understand the environmental effect on COVID-19 transmission in different continents and found that temperature has a significant positive association and relative humidity has an insignificant role in the spread of the disease in Asian countries. Manik et al. (2022) also studied the impact of temperature and humidity on reproduction numbers using the SIRD (Susceptible-Infected-Recovered) and SEIRD (Susceptible-Exposed-Infected-Recovered-Deceased) mathematical models and found a positive association with temperature and a negative association with relative humidity over some of the states in India. To check the univariate and multivariate LSTM model capability in the short and medium-range forecast, state-level daily COVID-19 forecasts were generated during the third wave period by initializing the model with 1–14 lead time for all the selected states in India. Fig. 28 presents the comparison between the model (univariate and multivariate) forecasted COVID-19 cases (lead 1) and the observed cases over different states in India for the period 1st January 2022 to 14 February 2022. The time series plot shows that the model performance was very well in predicting the COVID-19 cases with

FIG. 28 Comparison between the laboratories confirmed COVID-19 cases and univariate and multivariate model forecasted cases over some of the states in India during the period 1st January 2022 to 14th February 2022 (45 days).

Deep learning methods for scientific and industrial research Chapter

6 159

1 day lead time during the test period (third wave) over most of the seriously affected states including Maharashtra, Kerala, Andhra Pradesh, Karnataka, and Tamil Nadu. The south India states were mostly affected by the COVID-19 third wave in India. The maximum number of daily cases were reported in Kerala (55,475), Karnataka (50,210), Maharashtra (48,270), and Tamil Nadu (30,744) during the last week of January in the year 2022. The developed LSTM model captured the peak with a 1day lead time for most of the states in India. The relative error (RE) analysis also shows that the RE’s were less than 25% for the lead1 forecasts over most of the selected states except Assam (Table 9). The univariate and multivariate model performance was quite comparable for short-range (lead1–lead3) forecasts for most of the states. In the case of medium-range forecasts (lead4–lead14), the multivariate model (optimized with the weather parameters) errors were less than the univariate model for most of the selected states in India (Fig. 29). In the case of the COVID-19 third wave in India, the highly humid states located in south India were impacted more than the other states in India. The specific humidity in this region has a positive association with COVID-19 cases. Our modeling results also confirmed that the specific humidityintegrated multivariate LSTM model forecasts were less relative error for the medium-range forecasts compared to the other weather parameters. For example, the average relative error of lead 7 (one week) forecasts for CTL and CTL_SH experiments were 59.6% and 52% respectively, which shows that the error was reduced by more than 7% by integrating the specific humidity data to the LSTM model. Similarly, the integration of minimum temperature data reduced the model errors for Uttar Pradesh, Telangana, Chhattisgarh, and Karnataka, and mean temperature data reduced the model errors over West Bengal and Assam regions. In the case of the Rajasthan and Madhya Pradesh regions, the model errors were reduced by integrating the maximum temperature data compare to the other meteorological data.

4

Discussion and future prospects

This chapter presented a collaborative attempt by multi-disciplinary researchers from CSIR India where the robustness of the Deep Learning tools and methodologies were implemented for industrial and scientific research. The state of art methodologies along with the real-time data of multiple formats are being analyzed for major sectors like medicine, healthcare, agriculture, energy, etc. Deep learning has great potential to bring disruptive changes in the current clinical care of cancer patients and revolutionize the growth of precision medicine (Ntakolia et al., 2020). In the medical sectors, we have demonstrated the use of DL models to analyze the multi-omics data for diagnosis and prognosis of glioma patients (Bhalla and Lagana, 2022). Histopathology and MRI scans are mainly used for tumor diagnosis and prognosis,

TABLE 9 Average relative error in prediction of daily COVID-19 cases over different states in India for the period 1st January to 14th February 2022. States

CTL

CTL_SH

CTL_Tmax

CTL_Tmean

CTL_Tmin

Andhra Pradesh

22.07

20.91

22.1

21.87

22.57

Assam

42.98

38.07

38.11

37.22

37.28

Chhattisgarh

22.25

23.22

22.49

23.52

25.33

Karnataka

21.11

19.83

22.39

22.08

20.03

Kerala

16.63

17.08

17.17

16.66

16.96

Madhya Pradesh

20.15

19.25

23.09

22.64

22.61

Maharashtra

18.07

17.41

19.17

18.85

17.76

Rajasthan

23.96

23.91

26.57

25.83

24.95

Tamil Nadu

13.32

13.61

14.63

15.63

14.18

Telangana

18.54

18.05

19.5

20.64

19.8

Uttar Pradesh

20.53

22.82

20.95

20.93

25.87

West Bengal

20.85

19.54

20.71

21.08

20.34

Deep learning methods for scientific and industrial research Chapter

6 161

FIG. 29 Skill (Average relative error) of univariate (CTL) and multivariate (CTL_SH, CTL_Tmax, CTL_Tmin, CTL_Tmean) LSTM models during the third wave (1st January 2022 to 14th February 2022) in India. Here, the L1 to L14 represents the model lead time (initialization) to generate the day 1 COVID-19 case forecast for the selected state.

the DL models are applied to determine molecular subtypes and predict somatic gene mutation status (Lotlikar et al., 2021). We have also shown that for certain numerical molecular data sets, machine learning algorithms, such as SVM and RF could also predict histopathological grades of glioma.

162 Handbook of Statistics

However, DL models showed better performance than SVM and RF models in terms of the desired accuracy required for clinical application. DL can also automate and accelerate the clinical decision of a cancer patient by examining the patient’s molecular data and/or histopathological and radiological images. In this chapter, we have also demonstrated the application of the LSTMbased DL model for the assessment of crop production and the transfer learning for the classification of tea leaves. Similarly, the Ensemble LSTM model was applied for the short-term prediction of wind speed at two stations in India, and the results indicated that it can be very much useful for short-term weather prediction. We have also presented the capability of the LSTM model (deep learning/machine learning) for forecasting the non-linear behavior of COVID-19 transmission over different states in India. Some of the earlier studies also suggested that the meteorological parameters have a significant impact on disease spread in most the regions which are motivated to integrate the weather data into the LSTM models to check the model capability over different states which is located in different geographical locations in India. The model forecasts were generated for the third wave (1st January to 14 February 2022) by training the algorithm with first and second-wave confirmed COVID-19 cases data (1st April to 31st December 2021) over India. Our results are suggested that the skill of the LSTM model is very well (RE 500 M

>10 M

50

Facebook (Taigman et al., 2014)

2014

4.4 M

4K

800/1100/1200

The data is taken from existing work (Wang and Deng, 2021).

On bias and fairness in deep learning-based facial analysis Chapter

l

l

l

l

7 177

The Cross-Age Celebrity Dataset (CACD) proposed by Chen et al. (2015) contains images of celebrities with ages ranging from 16 to 62 years. The LFWA (Huang et al., 2008) and CelebA (Liu et al., 2015) databases contain 40 annotated facial attributes and are commonly used for attribute prediction tasks. These databases are used for recent research that focuses on analyzing the performance of attribute prediction across protected attributes (male and young), followed by developing algorithms for bias mitigation (Tan et al., 2020). The AgeDb (Moschoglou et al., 2017) and All-Age Faces (AAF) (Cheng et al., 2019) databases are used for analyzing the performance of algorithms across age subgroups. The IJB-C database (Maze et al., 2018) is one of the largest databases with skin-tone information on the Fitzpatrick scale used for studying bias.

With the increased attention on understanding different aspects of bias in model prediction and developing fair algorithms, multiple databases have been proposed specifically for studying bias. Some of these databases are listed below. l

l

l

l

l

l

Zhang et al. (2017) proposed the UTKFace database with more than 20K images having variations in pose, illumination, expression, occlusion, and resolution. Buolamwini and Gebru (2018) proposed the Pilot Parliaments Benchmark (PPB) database for studying the effect of bias in gender classifiers w.r.t different skin tones. The PPB database consists of 1270 subjects from three African and three European countries. The authors labeled the skin tones of the subjects using the Fitzpatrick six-point labeling system. Racial Faces in the Wild (RFW) database proposed by Wang et al. (2019a) is an unconstrained testing database for studying racial bias in face recognition. The database consists of Caucasian, Indian, Asian, and African racial subgroups. Wang and Deng (2020) proposed four different training databases, namely BUPT-Balancedface (race-balanced), BUPT-Globalface (racial distribution approximately equal to the real distribution of the world’s population), BUPT-Transferface containing labeled data for Caucasian and unlabeled data for other subgroups (created for unsupervised domain adaptation), and MS1M-wo-RFW containing non-overlapping subjects of MS-Celeb-1M database for studying the effect of race on face recognition algorithms. Morales et al. (2020) proposed the DiveFace database with annotations for gender and ethnicity (Caucasian, European, Asian) subgroups. The FairFace database is proposed by Karkkainen and Joo (2021), which is balanced across seven race subgroups: White, Black, East Asian, Middle Eastern, Southeast Asian, Indian, and Latino.

178 Handbook of Statistics l

l

l

The DemogPairs database (Hupont and Ferna´ndez, 2019) proposed by Hupont et al. consists of 58.3M verification pairs with demographicallybalanced images across the Asian, Black, and White ethnicities, as well as across males and females. Alvi et al. propose the LAOFW database along with an Age and Gender dataset (Alvi et al., 2018). The LAOFW dataset is collected from the web, whereas the Age and Gender dataset corrects the annotations for the images in the large-scale IMDB dataset. The BFW database (Robinson et al., 2020) is inspired by the DemogPairs database (Hupont and Ferna´ndez, 2019) and comprises evenly-split folds across ethnicities and gender with the addition of Indian females and males in the dataset.

The different datasets for studying the effect of bias have been summarized in Table 4. Samples from some of these databases are shown in Fig. 3. These databases have escalated the research toward understanding bias and boosted the development of algorithms for bias mitigation. TABLE 4 Details of the databases used for studying fairness in the literature. Database

Identity Gender Race Age

No. of Subjects

No. of Images

MORPH [38]

✓ ✓ ✓ ✓

13,000+

55,000+

IMFDB [48]

✓ ✓

✗ ✓

100

34,512

Adience [49]

✓ ✓

✗ ✓

2,284

26,580

CACD [50]

✓ ✗ ✗



2,000

163,446

LFWA [20]

✓ ✓

✗ ✗

5,749

13,233

CelebA [51]

✗ ✓ ✗



10,000+

2,02,599

AgeDb [53]

✓ ✗ ✗



568

16,488

AAF [54]

✗ ✓ ✗



-

13,298

IJB-C [55]

✓ ✓

3500

33,000

UTKFace [56]

✗ ✓ ✓ ✓

-

20,000+

RFW [19]

✓ ✗ ✓ ✗

40,607

11,430

28,000,

1.3M,

38,000,

2M,

10,000

0.6M+

✗ ✗

BUPT-Balancedface, BUPT-Globalface, BUPT-Transferface [57]

✓ ✗ ✓ ✗

DiveFace [58]

✓ ✓ ✓ ✗

24,000

72,000

FairFace [27]

✗ ✓ ✓ ✓

-

108,501

Demogpairs [31]

✓ ✓ ✓ ✗

600

10,800

On bias and fairness in deep learning-based facial analysis Chapter

7 179

TABLE 4 Details of the databases used for studying fairness in the literature.—Cont’d Database

Identity Gender Race Age

No. of Subjects

No. of Images

LAOFW [59]



✗ ✓ ✗

-

14000

Age and Gender



✓ ✗ ✓

-

6000 and

Dataset [59]

8000 ✓ ✓ ✓ ✗

BFW [60] Database

800

20,000

No. of

No. of

Subjects

Images

(a) FairFace dataset

(b) UTKFace dataset

(b) RFW dataset

(d) MORPH dataset

FIG. 3 Samples from the (A) FairFace (Ntoutsi et al., 2020) (B) RFW (Policy and Division, n.d.) (C) UTKFace (Liu et al., 2015), and (D) MORPH datasets (Rawls and Ricanek, 2009).

180 Handbook of Statistics

4 Evaluation metrics In the literature, a wide variety of evaluation metrics have been employed (Beutel et al., 2019; Du et al., 2020; Garg et al., 2020). While some metrics look for similar performance on similar samples (individual fairness) (Dwork et al., 2012), others focus on similar performance across groups (group fairness). Group fairness metrics are commonly employed in facial analysis tasks where the performance of the same model is compared across different demographic subgroups. A large number of metrics are evaluated by comparing the predictions (denoted here by ^ y ) with ground-truth outcomes y for input x. The confusion matrix is commonly used in classification parity-based metrics. Some metrics also utilize the score s produced by a classifier during prediction. Research is also being conducted on proposing new evaluation metrics for fairness. We define the following terms for an input x belonging to the demographic group g for use in the subsequent sections. l

True positive rate (TPR) TPR ¼ pðy^ ¼ 1jy ¼ 1, G ¼ gÞ ¼ TP=ðTP + FN Þ

l

False positive rate (FPR) FPR ¼ pðy^ ¼ 1jy ¼ 0, G ¼ gÞ ¼ FP=ðFP + TN Þ

l

(4)

Positive predictive parity PPV ¼ pðy ¼ 1j^ y ¼ 1, G ¼ gÞ ¼ TP=ðTP + FPÞ

l

(3)

False negative rate (FNR) FNR ¼ pðy^ ¼ 0jy ¼ 1, G ¼ gÞ ¼ FN=ðFN + TPÞ

l

(2)

True negative rate (TNR) TNR ¼ pðy^ ¼ 0jy ¼ 0, G ¼ gÞ ¼ TN=ðTN + FPÞ

l

(1)

(5)

Negative predictive parity NPV ¼ pðy ¼ 0j^ y ¼ 0, G ¼ gÞ ¼ TN=ðTN + FN Þ

(6)

4.1 Classification parity-based metrics By utilizing the above definitions, we provide the following fairness metrics. They are described in the context of binary classification, considering only two demographic subgroups, i.e., G ¼ 0 or G ¼ 1 (such as male and female). These metrics can be extended for multiple demographic subgroups and can be applied to any classification-based application.

4.1.1 Statistical parity One of the earliest metrics proposed for ensuring fairness is the statistical parity (Chouldechova, 2017; Verma and Rubin, 2018). Statistical parity asserts that the predictive performance of each demographic subgroup should be

On bias and fairness in deep learning-based facial analysis Chapter

7 181

equal. It is also known as Demographic Parity (Calders and Verwer, 2010) or Group Fairness (Dwork et al., 2012). Mathematically, it is depicted aspðy^ ¼ 1jG ¼ 0Þ ¼ pðy^ ¼ 1jG ¼ 1Þ

(7)

This metric does not consider the ground-truth outcome y.

4.1.2 Disparate impact (DI) Disparate impact is the deviation from statistical parity measured as the ratio of the probability of a positive classification for both subgroups of a demographic group (Feldman et al., 2015). Mathematically, it is described as DI ¼ pðy^ ¼ 1jG ¼ 0Þ:pðy^ ¼ 1jG ¼ 1Þ

(8)

A lower value of disparate impact indicates higher bias in the predictions.

4.1.3 Equalized odds and equality of opportunity Equalized odds (Hardt et al., 2016) are said to be satisfied if both the true positive rate (TPR) and the false positive rate (FPR) are the same across different demographic subgroups (Eqs. 9 and 10, respectively). Mathematically, this is represented as pðy^ ¼ 1jy ¼ 1, G ¼ 0Þ ¼ pðy^ ¼ 1jy ¼ 1, G ¼ 1Þ

(9)

pðy^ ¼ 1jy ¼ 0, G ¼ 0Þ ¼ pðy^ ¼ 1jy ¼ 0, G ¼ 1Þ

(10)

Equalized odds is also described as “error rate balance” (Chouldechova, 2017). Similar to Equalized Odds, Equality of Opportunity requires the same TPR across groups (Eq. 9) but imposes no such restriction on the FPR.

4.1.4 Predictive parity Predictive parity is satisfied when the positive predictive value is equal across different demographic subgroups (Chouldechova, 2017; Verma and Rubin, 2018; MacCarthy, 2017). It is defined as the probability that individuals predicted to in the positive class truly belong to that class. Mathematically, pðy ¼ 1j^ y ¼ 1, G ¼ 0Þ ¼ pðy ¼ 1j^ y ¼ 1, G ¼ 1Þ

(11)

Some works require both the positive and negative predictive parity to be equal across groups (Mayson, 2018). Mathematically, the equality in negative predictive parity can be written aspðy ¼ 0j^ y ¼ 0, G ¼ 0Þ ¼ pðy ¼ 0j^ y ¼ 0, G ¼ 1Þ

(12)

4.2 Score-based metrics In the previous section, a binary output ^ y was employed to compute the value of the fairness metric. The metrics described in this section employ the

182 Handbook of Statistics

continuous score s attained as the output from models for computation. These scores can be thresholded to attain ^ y which can then be used to apply metrics (Garg et al., 2020).

4.2.1 Calibration A model is calibrated if, for all scores s, the probability of belonging to the positive class is the same irrespective of the demographic subgroup for the same s (Chouldechova, 2017; Verma and Rubin, 2018; Mehrabi et al., 2021; Hardt et al., 2016). Mathematically, pðy ¼ 1jS ¼ s, G ¼ 0Þ ¼ pðy ¼ 1jS ¼ s, G ¼ 1Þ

(13)

4.2.2 Balance for positive/negative class If the average scores s for the positive classes of the demographic subgroups are the same, a balance exists for the positive class (Kleinberg et al., 2017). Similarly, a balance for the negative class exists when the average scores of the negative classes of the subgroups are the same. The positive and negative balance is denoted in Eqs. 14 and 15 below, respectively. E½sjy ¼ 1, G ¼ 0 ¼ E½sjy ¼ 1, G ¼ 1

(14)

E½sjy ¼ 0, G ¼ 0 ¼ E½sjy ¼ 0, G ¼ 1

(15)

4.3 Facial analysis-specific metrics The difference or standard deviation between different statistical measures of performance across subgroups is often used as a measure of fairness or bias. Different research works over the years have discussed the shortcomings of existing metrics. For classification parity-based metrics, the problem of infra-marginality has been observed (Corbett-Davies et al., 2018). Due to possible differences in risk distributions across subgroups, these metrics lead to taste-based discrimination (Thijssen, 2016). The search for an optimal fairness metric remains an open problem in the research community. Metrics have been proposed specifically for fairness in face recognition and attribute prediction algorithms (Howard et al., 2022). Some of them are described below.

4.3.1 Fairness discrepancy rate (FDR) The FDR metric is defined for face verification systems. It is computed as the maximum difference in false match rate (FMR/FPR) and the false non-match rate (FNMR/FNR) between two demographic subgroups g1 and g2 at a given threshold t (de Freitas Pereira and Marcel, 2021). The differences are weighed by an α parameter. Mathematically,

On bias and fairness in deep learning-based facial analysis Chapter

 AðtÞ ¼ max jFMRg1 ðtÞ  FMRg2 ðtÞj BðtÞ ¼ max jFNMRg1 ðtÞ  FNMRg2 ðtÞj FDRðtÞ ¼ 1  ðαAðtÞ + ð1  αÞBðtÞÞ

7 183

 (16)

The metric ranges from 0 to 1 with 1 being most fair and 0 being most unfair.

4.3.2 Inequity rate (IR) The IR metric was proposed by NIST in 2021 (Grother, 2022). Instead of taking differences into account, IR considers the maximum and minimum in the ratios of the FMR and FNMR for demographic subgroups g1 and g2 at threshold t. The metric is described mathematically below (Howard et al., 2022). α is used as a weighting factor.   AðtÞ ¼ max FMRg1 ðtÞ = max FMRg2 ðtÞ   BðtÞ ¼ max FMNRg1 ðtÞ = max FMNRg2 ðtÞ IRðtÞ ¼ AðtÞα BðtÞÞ1α

(17)

4.3.3 Degree of bias The recent Degree of Bias (DoB) metric (Gong et al., 2020) measures the standard deviation of (verification/identification) accuracy (Acc) across different subgroups. It is calculated as  8j (18) DoB ¼ std Accgj where gj represents a subgroup of a demographic group G. High performance gap of the model across different subgroups will result in high DoB, indicating higher bias in the model prediction. DoB is commonly used for evaluating bias in face recognition models (Gong et al., 2020; Wang and Deng, 2021).

4.3.4 Precise subgroup equivalence (PSE) In Majumdar et al. (Majumdar et al., 2021a), the authors propose a bias estimation metric termed as Precise Subgroup Equivalence (PSE). PSE is a combination of the bias in the model prediction and the overall model performance. It is defined as AFR ¼ ðFPR + FNRÞ=2 PSE ¼ ðð1  DI Þ + AFR + DoBÞ=3

5

(19)

Fairness estimation and analysis

The prevalence of bias has adverse effects on modern technology, and various attempts have been made to understand and detect the presence of bias.

184 Handbook of Statistics Database Input Data

a

Processed Image

PreProcessing

b

Feature Extraction & Model Training

c

Output Prediction

d

FIG. 4 Sources of bias in a facial analytics system pipeline. (A) Dataset bias, (B) bias in the pre-processing step, (C) bias in feature extraction and model training, and (D) bias in prediction.

Since bias can occur at various points in a typical facial analysis pipeline (Fig. 4), research efforts have been made to understand bias from different perspectives. A large body of work is dedicated to analyzing biased predictions of models for protected attributes such as gender and ethnic subgroups. In the following sections, we discuss the research toward understanding and evaluating bias in face detection and recognition (Section 5.1), as well as in facial attribute prediction (Section 5.2).

5.1 Fairness in face detection and recognition The performance of face recognition systems has been observed to be inconsistent across different demographic groups.

5.1.1 Discovery An early observation in regard to fairness is made by Klare et al. (2012) where they demonstrated the difference in recognition performance for different commercial and non-trainable algorithms. They observed a consistently low performance for darker-skinned individuals and prompted the usage of either balanced datasets for training algorithms or separate algorithms for different subgroups. These observations are made in the pre-deep learning era for face recognition algorithms using LBP and Gabor filters (Klare et al., 2012). 5.1.2 Disparate impact With the onset of the deep learning era, it has been observed that demographic bias is an ongoing problem. The National Institute of Standards and Technology (NIST) released its report on demographic effects, which evaluates the performance of face verification and identification systems across demographic groups (Grother et al., 2019). They reported the highest false positive rates in West and East African and East Asian people and the lowest in Eastern European individuals. Research has also shown that females tend to have a higher false match rate and a higher false non-match rate as compared to males in face verification applications. To study this phenomenon, the score distributions obtained for genuine, as well as impostor pairs across different demographic subgroups have been analyzed. In Albiero et al. (2020a), the authors showcase how females with impostor distribution have higher similarity scores

On bias and fairness in deep learning-based facial analysis Chapter

7 185

while females with genuine distribution have lower similarity scores. They also observed skew in data across gender with respect to facial expressions and attributed it as a cause for the difference in verification performance. Further, they showcase how images of two different females are inherently more similar than those of two different males through principal component analysis (Albiero and Bowyer, 2020; Albiero et al., 2022). Robinson et al. (2020) analyzed the decision threshold for the face verification task and observed different thresholds to be optimal for different demographic subgroups. They highlighted how learning a global threshold for matching leads to the incorporation of bias in the system. Similarly, for race, Vangara et al. (2019) used the MORPH dataset and observed that the genuine and impostor distributions are significantly different across Caucasian and African-American subgroups. In another work focusing on bias in face recognition, Krishnapriya et al. (2020) made similar observations for face verification decision thresholds. They further explore optimal decision thresholds for one-to-many identification search. Some research has also been conducted on the presence of gender bias in face recognition for children. Srinivas et al. (2019a) compared the performance of multiple commercial-off-the-shelf (COTS) face recognition systems on the faces of adults vs children. They used publicly available databases and observed a negative bias for each system on children. On evaluating performance using eight COTS systems for only the faces of children, Srinivas et al. (2019b) observed a substantial gender bias in performance as female faces exhibit lower identification rates as compared to males.

5.1.3 Incorporation of demographic information during model training The presence of gender and ethnic subgroup information in face recognition technology clearly indicates that deep models embed the aforementioned information and utilize it for predictions. In this spirit, Acien et al. (2018) attempted to infer gender and ethnic group information from pre-trained deep models. They observed that these models classify gender and ethnicity with nearly 95% accuracy on the LFW database. To further understand how deep models incorporate demographic information, Serna et al. (2020) used feature space visualizations along with class activation maps (CAMs) which form a popular technique for inspecting relevant pixels in the input image. They further comment upon how the over-representation of certain demographic groups in popular face databases (dataset bias Fig. 4A) has led to popular pre-trained deep face models being biased. In another attempt to understand the cause of bias in deep models, Nagpal et al. (2019) observed the presence of own-race and own-age bias in popular deep networks. They observed that deep models have a tendency to focus on selected facial regions for a particular ethnic subgroup, with these regions varying across different subgroups (see Fig. 5). In another work focusing on the manifestation of the other-race effect in humans,

186 Handbook of Statistics

FIG. 5 Feature visualizations showcasing models pre-trained, fine-tuned and trained from scratch focusing on different facial regions for faces belonging to different races (Nagpal et al., 2019).

Garcia et al. (2019) demonstrated the security hazards of demographic bias in real-world applications. They showed how demographic bias could be exploited by attackers to bypass automated biometric control mechanisms.

5.1.4 Dataset distribution during model training To study the impact of dataset bias, Albiero et al. (2020b) studied how the distribution of males and females in the training set influenced the performance of face recognition models. Interestingly, they found that gender balance in the training data does not translate into gender balance in the test accuracy. On the other hand, Gwilliam et al. (2021) analyzed facial recognition performance by training on various imbalanced distributions across races. They observed less biased model predictions after training on a specific subgroup than training on a balanced distribution. Further, the addition of more samples for existing identities in the database improved performance across racial subgroups. Many face recognition databases collected in the wild lack annotation information for protected attributes such as race and gender. This leads to incomplete information about a model’s ability to generalize across different subgroups (dataset bias Fig. 4A). In Kortylewski et al. (2019), the authors used synthetically generated images for their study and observed a significant influence of pose variation on the generalization performance. The authors leveraged synthetic data for analysis and showcased how facial pose and facial identity cannot be completely disentangled by deep networks (bias in model training Fig. 4C).

On bias and fairness in deep learning-based facial analysis Chapter

7 187

5.1.5 Role of latent factors during model training Celis and Rao (2019) analyzed the latent representations of faces to understand potential sources of bias. They observed that the brightness and darkness of an image play an important role in overall activation values. Images became brighter with increasing latent values and darker as values got lower, highlighting the importance of skin color in latent representations. The low-dimensional latent feature representations of faces are generated using a variational autoencoder. To understand where bias is encoded in face recognition, certain works used CAMs and showed how activated regions vary across different demographic subgroups. Albiero and Bowyer (2020) and Albiero et al. (2022) studied the influence of gendered hairstyles on male and female faces and how female hairstyles lead to a lesser number of pixels in the face which influence the performance of the model. In a similar direction, Terhost et al. studied the influence of non-demographic factors such as accessories, hairstyles, and colors, face shapes, or facial anomalies and observed that these factors strongly impacted recognition performance (Terhorst et al., 2021). Many face recognition systems have a mandatory face quality assessment step during the enrolment of an individual. The assessment step ensures the face image meets a certain quality threshold, thereby providing a high-quality image for comparison at query time. Terh¨orst et al. (2020) studied the correlation between face quality estimation and demographic bias in face recognition (bias in pre-processing step Fig. 4B). On the evaluation of four algorithms for face quality assessment toward biases to pose, ethnicity, and age, they observed bias towards frontal poses against Asian and African-American ethnicities and toward face images of individuals below 7 years. Joshi et al. (2022) perform sensitivity analysis in the presence of perturbations and observe that increasing image exposure results in the model favoring dark features such as black hair and disfavoring features such as pale skin. Similarly, in the work by Majumdar et al. (2021b), the authors study the incorporation of bias in model predictions in the presence of real-world distortions (Fig. 6). In a recent study, Yucer et al. (2022) annotate popular datasets based on racial phenotypes such as Fitzpatrick skin types, nose shape, eyelid type, lip shape, and hair type. They discover significantly more performance differences across racial phenotypes than racial groups. 5.2 Fairness in attribute prediction It has been observed in the literature that many deep learning-based systems are biased in their predictions when measured across different subgroups.

5.2.1 Discovery Buolamwini and Gebru (2018) evaluated the performance of three commercial gender-classification systems across faces with different skin tones. They observed a huge disparity in classification error rates for darker females vs lighter males.

G1, R2

G1, R1

188 Handbook of Statistics

= 2.0 Score: 0.80

= 2.4 Score: 0.75

= 0.0 Score: 0.84

= 2.0 Score: 0.76

= 2.4 Score: 0.67

= 0.0 Score: 0.86

= 2.0 Score: 0.81

= 2.4 Score: 0.74

= 0.0 Score: 0.84

= 2.0 Score: 0.76

= 2.4 Score: 0.64

Verificaon Accuracy @ 0.01 FAR

Verification Accuracy @ 0.01 FAR

G1, R1

G2, R1

= 0.0 Score: 0.85

Sigma (a)

Sigma (b)

FIG. 6 Effect of Gaussian blur on the performance across different (A) gender and (B) race subgroups. Extracted features of the blurred images are matched with the corresponding clear image using cosine similarity (1.0 is a perfect match). Variation in similarity score is shown in the top row. The bottom row shows the verification performance (Majumdar et al., 2021b).

5.2.2 Disparate impact Deuschel et al. (2020) studied the impact of gender and skin tone on facial expression detection and used classification accuracy and heatmaps for quantitative and qualitative evaluations. Krishnan et al. (2020) investigated the impact of different deep models and training set imbalances on gender classification across different gender-race groups. They utilized the UTKFace and FairFace databases and highlighted how training set imbalance widens the gap in performance accuracy. The authors further studied facial morphology for different ethnic subgroups using facial landmark detection and obtained interesting insights about probable causes of disparity. To further investigate the impact of different skin tones on gender classification, Muthukumar (2019) used luminance mode-shift and optimal transport techniques to vary the skin tones. They reported that skin tone alone is not the driving factor for observed bias, and broader differences in ethnicity must be considered. A novel idea of adversarial fairness in the context of facial attribute prediction is proposed by Nanda et al. (2020). In adversarial fairness, a classifier is considered

On bias and fairness in deep learning-based facial analysis Chapter

7 189

to be unbiased if it is equally robust against adversarial attacks for all subgroups. They further illustrated the presence of said unfairness on the Adience and UTKFace databases, among others. Recently, Jain et al. (2022) showcased the biased generation of facial images using popular GAN networks. They further highlighted how the synthetic faces of engineering professors are generated with more masculine facial features and lighter skin tones. This is highlighted in Fig. 7. Joo and Karkkainen (2020) showcased a sharp difference in the performance of classifiers along the gender attribute with images of people in occupations such as nurse or engineer.

5.2.3 Counterfactual analysis Joo and Karkkainen (2020) proposed an approach for understanding bias that involves using counterfactuals where they synthesized counterfactual face images with varying gender and ethnic groups keeping the other signals constant. Using these samples, they analyzed performance on different downstream tasks and commented on different hidden biases in the system. Similarly, Denton et al. (2019) performed sensitivity analysis based on the performance of deep models using generated counterfactuals. The counterfactuals are generated by the manipulation of latent vectors obtained using a progressive GAN on the CelebA database (Liu et al., 2015). While counterfactual approaches provide explicit control over the manipulation of different attributes such as gender and skin tone, generative samples may not be representative of real-world distributions. Quadrianto et al. (2019) translated the data from the input domain to a fair target domain. They observed interesting outcomes where the model adjusts eyes and lips regions in males to enforce fairness in predictions.

FIG. 7 Illustrative test set of transformations on female celebrity faces. These examples highlight how the generator learns to transform the feminine facial features to appear more masculine so as to fool the discriminator into thinking the transformed celebrity face portrays an engineering professor ( Jain et al., 2022).

190 Handbook of Statistics

Further studying the interdependence between factors leading to biased model predictions, Barlas et al. (2021) analyzed the correlation between context and gender in five proprietary image tagging algorithms.

5.2.4 Role of latent factors during model training The fairness of models is generally attributed to the difference in performance across subgroups. In a different direction, Serna et al. (2021) proposed InsideBias, in which they studied how the model represents the information instead of how it performs. They analyzed the features learned by the models while training using an unbalanced dataset in the context of bias. Serna et al. (2022b) showed how analyzing model weights provide insights into its biased behavior. Li and Xu (2021) have proposed a variation loss that optimizes the hyperplane in the latent space to obtain biased attribute information. In an interesting analysis, Qiu et al. (2021) studied the relationship between gender classification and face verification performance. They observe that for impostor image pairs, when both images in the pair result in a gender classification error, the face recognition model provides a high similarity score for the pair, leading to a false match error. Researchers continue to study the influence of bias in predictions (Fig. 4D) and predominantly how it might have been incorporated at the data and algorithm level (Fig. 4A and C). A large body of work focuses on analyzing popular deep model architectures and COTS algorithms. Popular tools include feature map visualizations, fairness and performance evaluation metrics across subgroups, generated counterfactuals, and skewed training. However, identifying bias in an automated manner remains an open problem and requires the attention of the research community. A limited number of studies have been performed to detect bias-inducing factors in this direction, such as UDIS (Krishnakumar et al., 2021). UDIS utilizes a hierarchical clustering technique to cluster similar dataset embeddings. Based on the clusters’ silhouette score, possible biased models are identified. Results showcasing the effectiveness of UDIS are depicted in Fig. 8. However, identifying bias in an automated manner remains an open problem and requires the attention of the research community. The techniques for analysis are summarized in Table 5.

FIG. 8 Heatmaps showcasing the regions utilized by the model for prediction of the “smiling” attribute through the UDIS algorithm. The heatmaps highlight highly relevant “mouth” region being used for prediction (Krishnakumar et al., 2021).

On bias and fairness in deep learning-based facial analysis Chapter

7 191

TABLE 5 Different works analyzing bias in facial analysis tasks of face recognition and attribute prediction.

6

Face detection and recognition

Attribute prediction

Klare et al. (2012), Albiero et al. (2020a), Robinson et al. (2020), Vangara et al. (2019), Krishnapriya et al. (2020), Srinivas et al. (2019a,b), Acien et al. (2018), Serna et al. (2020), Nagpal et al. (2019), Garcia et al. (2019), Kortylewski et al. (2019), Gwilliam et al. (2021), Celis and Rao (2019), Terh¨orst et al. (2020), Majumdar et al. (2021b)

Buolamwini and Gebru (2018), Deuschel et al., (2020), Krishnan et al. (2020), Muthukumar (2019), Nanda et al. (2020), Joo and Karkkainen (2020), Denton et al. (2019), Quadrianto et al. (2019), Barlas et al. (2021), Serna et al. (2021, 2022b)

Fair algorithms and bias mitigation

In this section, we provide a taxonomy of the bias mitigation techniques proposed for various face analysis tasks (see Fig. 9). We discuss the mitigation techniques proposed for face detection and recognition in Section 6.1, followed by mitigation techniques for attribute prediction in Section 6.2. The proposed techniques are primarily constituted by deep learning techniques, which are subdivided into popular classes of algorithms. These classes include Adversarial learning-based algorithms, Generative algorithms, and Black-box learning algorithms. The techniques which do not fall into any of these classes are labeled as Bias-aware deep learning techniques.

6.1 Face detection and recognition The majority of the algorithms developed for mitigating bias in face detection and recognition are based on deep learning. Based on the taxonomy presented in Fig. 9, the following face detection and recognition approaches have been proposed in the literature for the development of fairer algorithms. The approaches have also been summarized in Table 6.

6.1.1 Adversarial learning approaches Some researchers have used adversarial learning-based approaches for bias Alasadi et al. (2019) presented a framework for matching low-resolution and high-resolution facial images. The aim is to mitigate bias in cross-domain face recognition. The proposed framework consists of two parts, the first part maximizes the face-matching quality, and the second part minimizes the prediction of demographic properties to reduce bias in model prediction. A novel de-biasing adversarial network is proposed by Gong et al. (Gong et al., 2020)

192 Handbook of Statistics

Approaches

Face Detection and Recognition

Attribute Prediction

Adversarial Learning

Black-Box/ Pre-trained

Generative

Other Deep Learning

FIG. 9 Different bias mitigation techniques classified by the taxonomy proposed for tasks of face recognition and attribute prediction.

TABLE 6 Taxonomy of different approaches for bias mitigation in facial analysis tasks of face recognition and attribute prediction. Approaches

Face detection and recognition

Attribute prediction

Adversarial

Alasadi et al. (2019), Gong et al. (2020), Dhar et al. (2020)

Wang et al. (2019b), Adeli et al. (2021), Dash et al., (2022), Dullerud et al. (2022), Wang et al. (2022a)

Black-box/ Pre-trained

Serna et al. (2022a), Terhorst et al. (2020a)

Dwork et al. (2018), Kim et al. (2019a), Nagpal et al. (2020a), Roh et al. (2021), Majumdar et al. (2021a, 2021c), Jang et al. (2022), Du et al. (2021)

Generative

McDuff et al. (2019), Yucer et al. (2020), Conti et al. (2022)

Mullick et al. (2019), Choi et al. (2020), Ramaswamy et al. (2021), Tan et al. (2020), Georgopoulos et al. (2021), Kolling et al. (2022)

Bias-aware DeepLearning

Amini et al. (2019), Huang et al. (2019), Vera-Rodriguez et al. (2019), Wang et al. (2019a), Terhorst et al. (2020b), Bruveris et al. (2020), Gong et al. (2021), Yang et al. (2021), Franco et al. (2022), Wang et al. (2022), Liu et al. (2022)

Ryu et al. (2018), Das et al. (2018), Alvi et al. (2018), Kim et al. (2019b), Nagpal et al. (2020b), Wang et al. (2020), Chuang and Mroueh (2021), Park et al. (2021), Agarwal et al. (2022), Cao et al. (2022)

On bias and fairness in deep learning-based facial analysis Chapter

7 193

that adversarially learns to generate disentangled representations for unbiased face and demographic recognition. The proposed network consists of four classifiers to distinguish the identity, gender, race, and age of the facial images. Dhar et al. (Dhar et al., 2020) presented a novel Adversarial Gender Debiasing algorithm to reduce the gender prediction ability of face descriptors. The proposed algorithm unlearns the gender information in descriptors while training them for classification.

6.1.2 Pre-trained and black box approaches Researchers have also proposed techniques that could be combined with pre-trained models to improve their performance and reduce bias in model prediction. Serna et al. (2022a) proposed a discrimination-aware learning method termed Sensitive loss for bias mitigation. The proposed loss function is based on the triplet loss function and a sensitive triplet generator to improve the performance of pre-trained models, as shown in Fig. 10. A novel unsupervised fair score normalization approach based on individual fairness is proposed by Terhorst et al. (2020a). The proposed solution is easily integrable into existing systems that reduce bias and improve the overall recognition performance. 6.1.3 Generative approaches Generative approaches are adopted by researchers to synthesize images of the under-represented class for bias mitigation. McDuff et al. (2019) proposed a simulation-based approach using generative adversarial models for synthesizing facial images to mitigate gender and racial biases in commercial systems. A novel data augmentation methodology is proposed by Yucer et al. (2020) to balance the training database at a per-subject level. The authors used imageto-image transformation to transfer facial features with sensitive racial characteristics while preserving the identity-related features to mitigate racial bias. Recently, Conti et al. (2022) proposed a Fair von Mises-Fisher loss for gender bias mitigation in face recognition. A shallow network is trained on top of the pre-trained embeddings using the proposed loss function.

FIG. 10 The discrimination-aware learning method termed as Sensitive loss for bias mitigation (Serna et al. 2022a).

194 Handbook of Statistics

6.1.4 Bias-aware deep learning approaches A wide variety of approaches fall under this category. These approaches include designing novel loss functions, custom networks, and discrimination-aware learning methods. Amini et al. (2019) proposed a novel algorithm for mitigating hidden bias in face detection algorithms. The proposed algorithm uses a variational autoencoder to learn the latent structure within the database. The learned latent distributions are used to re-weight the importance of certain data points during training. In the literature, it is shown that imbalanced class distribution leads to biased predictions of deep models. To handle the problem of imbalanced class distribution, Huang et al. (2019) proposed to learn Cluster-based Large Margin Local Embedding, and combined it with the k-nearest cluster algorithm for improved recognition performance. A deep information maximization adaptation network is proposed by Wang et al. (2019a) for bias mitigation using deep unsupervised domain adaptation techniques. The authors considered Caucasian as the source domain and other races as target domains to decrease the race gap at the domain-level. Vera-Rodriguez et al. (2019) proposed to train gender-specific Deep Convolutional Neural Networks using triplet loss to mitigate gender bias and improve the feature representation for both genders. Terhorst et al. (2020b) proposed a fairness-driven neural network classifier that works on the comparison level of a biometric system. A novel penalization term in the loss function is used to train the proposed classifier to reduce the intra-ethnic performance differences and introduce both group and individual fairness to the decision process. Bruveris et al. (2020) proposed to mitigate bias in face recognition due to imbalanced data distribution by employing sampling strategies that balance the training procedure. Attention mechanisms are also used to enhance the fairness of recognition models. In this direction, Gong et al. (2021) proposed a group adaptive classifier that uses adaptive convolution kernels and attention mechanisms for bias mitigation. The adaptive module and the attention maps help to activate different facial regions to learn discriminative features for each demographic group. Yang et al. (2021) proposed the race-adaptive margin-based face recognition (RamFace) model to enhance the discriminability of the features. The authors also proposed a race adaptive margin loss function to automatically derive different optimal margins from mitigating the effect of racial bias. Franco et al. (2022) impose the constraint of Demographic Parity as a regularizer at different layers of the deep network and showcase explainability analysis (see Fig. 11). Wang et al. (2022) propose a Meta Balanced Network (MBN) to adaptively learn margins in margin-based losses for improved fairness. Liu et al. (2022) propose a Multi-variation Cosine Margin (MvCoM) loss function to simultaneously consider factors such as head pose, occlusion, and blur in addition to ethnicity to train the model. Using a meta-learning setup, they show improved fairness on the RFW test set.

On bias and fairness in deep learning-based facial analysis Chapter

7 195

FIG. 11 Heatmap visualization showcasing the presence of distinctive facial regions for age classification. The fairness regularizer indicates an emphasis on the eye region for younger people (age less than 30), whereas an emphasis on the skin portion below the nose, on the cheeks, and around the mouth for older people (Franco et al., 2022).

Recently, a reinforcement learning-based approach has been used by Wang and Deng (2020) to learn balanced features and remove racial bias in face recognition using the idea of adaptive margin. It is observed that the majority of the algorithms for mitigating bias in face recognition are designed to mitigate bias for a specific demographic group. Therefore, these algorithms may not generalize for other demographic groups (Xu et al., 2021).

6.2 Attribute prediction Several algorithms have been proposed for bias mitigation in attribute prediction.

6.2.1 Adversarial approaches Wang et al. (2019b) proposed an adversarial approach to remove bias from intermediate representations of a deep neural network. The proposed approach reduces gender bias amplification and maintains the overall model performance. Adeli et al. (2021) used adversarial training to maximize the discriminative power of the learned features with respect to the main task and minimize the statistical mean dependence on the bias variable. By minimizing the dependency on the bias variable, the authors have shown a reduced effect of bias in model prediction. Dash et al. (2022) generate counterfactuals using inference in a GAN-like framework, which is then used to mitigate bias with respect to sensitive attributes such as skin color. Dullerud et al. (2022) study

196 Handbook of Statistics

bias in deep metric learning and observe that the bias induced in the initial training of the network propagates to downstream tasks. The authors further propose Partial Attribute Decorrelation (PARADE) during training which employs adversarial separation for mitigation of the aforementioned bias in the downstream tasks. Wang et al. (2022a) proposed a fairness-aware adversarial perturbation (FAAP) approach to mitigate bias in a black-box setting. They employ a discriminator to identify fairness-related features and then use a perturbation generator such that no such fairness information can be extracted.

6.2.2 Pre-trained and black-box approaches Dwork et al. (2018) proposed a decoupling technique for bias mitigation of black-box ML algorithms. The proposed technique learns separate classifiers for different groups to increase fairness and accuracy in classification systems. Another algorithm for mitigating bias in black-box models is proposed by Kim et al. (2019a). The algorithm is termed MULTIACCURACY BOOST, which post-processes a pre-trained model to improve the performance across identifiable subgroups. Nagpal et al. (2020a) proposed diversity blocks to de-bias existing models. The diversity block is trained using small training data and can be added to any black-box model. Roh et al. (2021) addressed the problem of bias mitigation using bi-level optimization. They proposed to adaptively select mini-batches for improving model fairness. A bias mitigation algorithm based on adversarial perturbation is proposed by Majumdar et al. (2021a). The proposed algorithm learns a subgroup invariant perturbation to be added to the input database to generate a transformed database. The transformed database, when given as input to a model, produces unbiased outcomes. The proposed algorithm is able to mitigate bias in pre-trained model prediction without re-training. Majumdar et al. (2021c) proposed an algorithm based on an attention mechanism for mitigating bias in pre-trained models. Recently, Jang et al. (2022) proposed a post-processing method for improving fairness which involves adaptive learning of decision thresholds for different subgroups. Du et al. (2021) proposed a Representation Neutralization for Fairness (RNF) algorithm which trains only the classification head of the model to debias. The performance of the model is compared using Demographic Parity and Equalized Odds. 6.2.3 Generative approaches Oversampling the minority class is one of the popular techniques for handling the class imbalance problem. An imbalance in class distribution leads to biased predictions, and multiple generative approaches have been proposed for mitigation. Mullick et al. (2019) proposed a generative approach that uses a convex generator to generate images of the minority class near the peripheries of the classes.

On bias and fairness in deep learning-based facial analysis Chapter

7 197

The classifier boundaries are thus adjusted in such a way that reduces the misclassification of the minority classes. Ramaswamy et al. (2021) used generative adversarial networks (GANs) to generate images for data augmentation. The generated images are perturbed in the latent space to generate balanced training data w.r.t protected attribute for bias mitigation. A weakly-supervised algorithm is proposed by Choi et al. (2020) to overcome database bias for deep generative models. An additional unlabeled database is required by the proposed approach for bias detection in existing databases. Tan et al. (2020) proposed an effective method for improving the fairness of image generation for a pre-trained GAN model without re-training. Images generated using the proposed method are applied for bias quantification in commercial face classifiers. A multi-attribute framework is proposed by Georgopoulos et al. (2021) to transfer facial patterns even for the underrepresented subgroups. The proposed method helps to mitigate dataset bias by data augmentation. Kolling et al. (2022) generate new annotations from existing human-annotated data and utilize them for training different models on different annotations. This label diversity improves the overall fairness of the models.

6.2.4 Bias-aware deep learning approaches Among other deep learning approaches, Ryu et al. (2018) proposed InclusiveFaceNet for detecting facial attributes by transferring race and gender representations. The aim is to improve the detection accuracy across different race and gender subgroups. A Multi-Task Convolution Neural Network is proposed by Das et al. (2018) with joint dynamic loss weight adjustment. The aim is to mitigate soft biometrics-related bias by joint classification of race, gender, and age. A joint learning and unlearning algorithm are proposed by Alvi et al. (2018) to remove bias from the feature representation of a network. The proposed algorithm is able to remove bias when the network is trained on an extremely biased database. One set of algorithms aims to unlearn the model’s dependency on sensitive attributes. Kim et al. (2019b) proposed a regularization loss to minimize the mutual information between feature embedding and bias to unlearn the bias information. Attribute aware filter drop is proposed by Nagpal et al. (2020b), which performs the primary attribute classification task while unlearning the dependency of the model on sensitive attributes. The approach used is shown in Fig. 12. Tartaglione et al. (2021) proposed a regularization strategy to disentangle the biased features while entangling the features belonging to the same target class. Apart from this, some techniques are proposed that use feature distillation ( Jung et al., 2021) and mutual

198 Handbook of Statistics

FIG. 12 A filter-drop approach that performs the primary attribute classification task while unlearning the dependency of the model on sensitive attributes (Nagpal et al., 2020b).

information between the learned representation (Ragonesi et al., 2021). To understand different bias mitigation algorithms, Wang et al. (2020) provided a thorough analysis of the existing bias mitigation techniques. They further designed a domain-independent training technique for bias mitigation. An interesting data augmentation strategy, fair mixup is proposed by Chuang and Mroueh (2021) to optimize group fairness constraints. The authors proposed to regularize the model on interpolated distributions between different subgroups of a demographic group. Recently, Park et al. (2021) argued that removing the information of sensitive attributes in the decision process has the limitation of eliminating beneficial information for target tasks. To overcome this, the authors proposed Fairness-aware Disentangling Variational Auto-Encoder for disentangling data representation into three latent subspaces. A decorrelation loss is proposed to align the overall information into each subspace instead of removing the information of sensitive attributes. Agarwal et al. (2022) study contextual information and select a fair subset of data corresponding to the co-occurrence with other attributes for gender in CelebA dataset. By selecting a fairer subset of data for training, they achieve fairer performance for gender classification. Similarly, for fairer and more accurate age prediction, authors have proposed distribution-aware data curation methods which employ out-of-distribution techniques for data selection (Cao et al., 2022). Recently, Park et al. (2021) analyzed bias caused by supervised contrastive learning and proposed a Fair Supervised Contrastive Loss (FSCL) for fairer learning in the contrastive setting. The majority of the algorithms are focused on alleviating the influence of demographics on model predictions for enhancing fairness. However, Terhorst et al. (2020c) demonstrated that face templates store non-demographic characteristics as well. The biased prediction of models could be due to the non-demographic characteristics stored in the face images. Therefore, considering the non-demographic attributes during bias mitigation can prove to be important for designing robust systems.

On bias and fairness in deep learning-based facial analysis Chapter

7

7 199

Meta-analysis of algorithms

The trend in deep learning models has been toward improving fairness for different demographic subgroups. As illustrated in Fig. 13, newer algorithms have shown improved fairness across different racial subgroups. In Fig. 13, the verification accuracy across different subgroups of the RFW dataset is depicted for several research works. Over time, the gap between the performance across subgroups has reduced substantially. Table 7 further demonstrates the performance of recent face recognition algorithms on the RFW dataset. The last column depicts the bias in the system and is calculated as the standard deviation of accuracy reported across the four ethnicity subsets. For the task of facial attribute prediction, a reduction in bias has also been observed over the years. Using the UTKFace, MORPH, and LFWA datasets, in Table 8, the performance of different techniques is compared in terms of error rate as well as Degree of Bias and Precise Subgroup Equivalence. The models are trained using one dataset for gender prediction, and bias estimation is performed for the other datasets. Overall, the bias in the performance of face recognition and attribute prediction has reduced substantially across different ethnic and gender subgroups.

100 98

Accuracy (%)

96 94 92 90 88 86 84 82 80 (Wang et (Wang et (Wang et (Gong et (Wang et (Wang et (Faraki et (Yang et al. (Gong et al. 2020) al. 2020) al. 2020) al. 2020) al. 2020) al. 2020) al. 2021) 2021) al. 2021)

Caucasian

Indian

Asian

(Xu et al. 2021)

(Li et al. (Chrysos et 2021) al. 2021)

African

FIG. 13 Meta-analysis of the verification accuracy reported on the RFW dataset across the four racial subgroups.

TABLE 7 Performance of existing techniques on the challenging RFW face recognition benchmark. In RFW, CA, AA, EA and IN are abbreviated for Caucasian, African American, East Asian and Indian respectively. RFW Method

CA

AA

EA

IN

Avg.

"

ArcFace (Deng et al., 2019)

98.80

97.48

96.80

97.38

97.62

0.84

URFace (Shi et al., 2020)

98.35

96.76

96.10

96.63

96.96

0.96

DebFace (Gong et al., 2020)

95.95

93.67

94.33

94.78

94.68

0.96

RL-RBN (Wang and Deng, 2020)

97.08

94.87

95.57

95.63

95.79

0.93

CIFP (Xu et al., 2021)

97.08

94.87

95.57

95.63

95.79

0.93

GAC (Gong et al., 2021)

97.60

97.03

95.65

96.82

96.78

0.82

DAM-L (Liu et al., 2021)

96.13

93.95

93.75

94.70

94.63

1.08

DAM-R (Liu et al., 2021)

96.30

94.51

94.31

95.20

95.08

0.90

CosFace (Wang et al., 2018b)

99.01

97.62

97.20

97.96

97.95

0.77

CB-CosFace (Ren et al., 2018)

99.03

98.23

97.36

97.83

98.11

0.70

LDAM-CosFace (Cao et al., 2019)

98.93

97.80

97.23

97.50

97.87

0.75

RamFace (Yang et al., 2021)

97.40

96.25

95.50

96.58

96.43

0.78

MetaCW ( Jamal et al., 2020)

99.13

97.86

97.73

98.11

98.21

0.63

Sensitive Loss (Serna et al., 2022a)

97.23

95.82

96.50

96.95

96.62

0.61

MvCoM-URFace (Liu et al., 2022)

98.85

97.18

97.15

96.98

97.54

0.88

MvComM-CosFace (Liu et al., 2022)

99.16

98.06

97.78

98.28

98.32

0.60

Bias

#

On bias and fairness in deep learning-based facial analysis Chapter

7 201

TABLE 8 Performance comparison of gender prediction models (%) on the UTKFace, MORPH and LFWA datasets using the Multi-task learning (Das et al., 2018), Filter-drop (Nagpal et al., 2020b), and SIP (Majumdar et al., 2021a) algorithms for bias mitigation. Bias estimated on

Error

Model trained on

MORPH

UTKFace

LFWA

UTKFace

MORPH

LFWA

UTKFace

G1

G2

DoB

PSE

Pre-trained

22.64

41.37

9.36

22.02

Fine-tuned

15.61

29.30

6.84

15.17

Multi-task

23.65

37.87

7.11

18.83

Filter Drop

28.08

34.99

3.46

14.86

SIP

10.18

22.30

6.06

11.93

Pre-trained

18.25

51.74

16.74

30.90

Fine-tuned

15.00

31.67

8.33

17.09

Multi-task

24.01

38.52

7.25

19.20

Filter Drop

29.77

33.33

1.78

12.79

SIP

11.91

23.15

5.62

11.97

Pre-trained

41.27

19.02

11.12

22.91

Fine-tuned

5.84

28.52

11.34

17.53

Multi-task

10.00

22.80

6.40

12.34

Filter Drop

8.88

21.67

6.40

11.90

SIP

14.11

3.59

5.26

8.34

Pre-trained

53.80

20.58

16.61

31.88

Fine-tuned

8.69

18.75

5.03

9.92

Multi-task

9.63

23.07

6.72

12.64

Filter Drop

9.88

20.94

5.53

11.06

SIP

15.92

4.75

5.58

9.21

Pre-trained

29.88

39.39

4.75

17.65

Fine-tuned

3.99

54.18

25.09

35.48

Multi-task

27.46

36.70

4.62

16.47

Filter Drop

30.16

34.05

1.95

13.20 Continued

202 Handbook of Statistics

TABLE 8 Performance comparison of gender prediction models (%) on the UTKFace, MORPH and LFWA datasets using the Multi-task learning (Das et al., 2018), Filter-drop (Nagpal et al., 2020b), and SIP (Majumdar et al., 2021a) algorithms for bias mitigation.—Cont’d Bias estimated on

Error

Model trained on

LFWA

MORPH

G1

G2

DoB

PSE

SIP

18.12

26.58

4.23

12.30

Pre-trained

19.12

58.50

19.69

35.56

Fine-tuned

5.50

50.42

22.46

32.65

Multi-task

39.67

28.74

5.47

18.34

Filter Drop

34.43

37.32

1.45

13.90

SIP

15.95

27.33

5.69

13.62

The bias is quantified using the Degree of Bias (DoB) and Precise Subgroup Equivalence (PSE) metrics.

8 Topography of commercial systems and patents Face-based technologies are widely patented all over the world. On searching for patents using the keywords “face recognition,” more than 450k patents are returned (https://www.lens.org/lens/search/patent/list?preview¼trueq¼face% 20recognition). The patents in face recognition technology have grown considerably since the early 2000s. This trend is presented in Fig. 14, where the number of patents is depicted to have grown from approximately 5k in the year 2000 to 50k in the year 2021. The top owners of these patents are tech giants such as Google, Microsoft, and Apple (see right Fig. 14). In the last 2 years, over 35k patents have been filed and published on face recognition technology. A small selection of patents published in 2022 is listed in Table 9. While facial recognition technology is heavily patented, the face recognition systems of renowned companies were openly criticized in 2018 for their biased behavior and unacceptable performance toward certain demographic subgroups (Buolamwini and Gebru, 2018). In Table 1, the face verification performance of commercial off-the-shelf systems (COTS) across different ethnicities is summarized. The research highlighting this bias was published in 2019. In a follow-up study to their work in 2018 (Raji and Buolamwini, 2019), the authors observed that the companies had significantly reduced their error margins for the female and darker-skinned subgroups. However, research published in 2021 has showcased a disparity in the face detection performance of commercial systems (Dooley et al., 2022).

On bias and fairness in deep learning-based facial analysis Chapter

7 203

FIG. 14 (Left) Graph depicting the number of patents filed, published, and granted under “face recognition” since 1950. (Right) The top owners of “face recognition” patents.

TABLE 9 List of some of the patents related to facial recognition technology published in 2022. Application No.

Title

Assignees

Inventors

JP 2022001988 A

Face Recognition Management System And Face Recognition Management Server

Almex Inc

Inoue Susumu

TW M623082 U

Face Recognition Card Transaction System

Taiwan Cooperative Bank Co Ltd

Chen Yu-Wei

ZA 202108983 B

A Face Recognition System Based On Convolutional Neural Network

Univ Southwest

Dong Tao

WO 2022/041263 A1

Face Recognition-based Access Control System

Suzhou Sdc Tech Co Ltd

Xia Zeyu

US 11281922 B2

Face Recognition System, Method For Establishing Data Of Face Recognition, And Face Recognizing Method Thereof

Pegatron Corp

Tseng Yu-Hung

KR 20220043905 A

Face Recognition System For Training Face Recognition Model Using Frequency Components

Posco ICT Co Ltd

Kim Seong Uk

Continued

204 Handbook of Statistics

TABLE 9 List of some of the patents related to facial recognition technology published in 2022.—Cont’d Application No.

Title

Assignees

Inventors

US 11301669 B2

Face Recognition System And Method For Enhancing Face Recognition

Pegatron Corp

Tseng Yu-Hung

WO 2022/078572 A1

Access Control With Face Recognition And Heterogeneous Information

Assa Abloy AB

Chen Jianbo

US 11315360 B2

Live Facial Recognition System And Method

Wistron Corp

Chang Yao-Tsung

US 11328152 B2

Recognition System Employing Thermal Sensor

Pixart Imaging Inc

Chen Nien-Tse

US 11335128 B2

Methods And Systems For Evaluating A Face Recognition System Using A Face Mountable Device

Visa Int Service Ass

Arora Sunpreet Singh

WO 2022/105015 A1

Face Recognition Method And System, Terminal, And Storage Medium

Shenzhen Inst Of Adv Tech CAS

Qian Jing

US 11354936 B1

Incremental Clustering For Face Recognition Systems

Amazon Tech Inc

Chandarana Dharmil Satishbhai

Recently, fairness in machine learning has gained attention, and patents have been filed for the same. Bank of America obtained a patent for fairer predictions by an AI-based system that generates expansion systems based on synthetic data (Eren, n.d.). By using novel cost and loss functions, it refines expansion systems and selects fair models. Ghosh et al. ( Joydeep et al., n.d.) employed counterfactuals for explainability in black-box classifiers. Through the predictions obtained for the generated counterfactual, the fairness of the system is evaluated. Kamkar et al. ( Javad et al., n.d.) trained fairer models using adversarial training. The adversarial classifier is trained

On bias and fairness in deep learning-based facial analysis Chapter

7 205

such that the prediction error for sensitive attributes such as race, ethnicity, age, sex, national origin, sexual orientation, demographics, and military status is minimized, whereas the predictive classifier is trained to maximize the prediction error. Fair Isaac Corp (Michael and Shafi, n.d.) developed a system for analyzing bias and explanations for multi-dimensional datasets. They also visualize a trivariate heatmap to identify distinct patterns in the dataset. Jialin et al. (2021) developed a fair machine-learning model which incorporates an evolutionary learning algorithm. The system simultaneously optimizes for multiple fairness objectives instead of focusing on a single metric. Microsoft Technology Licensing patented a system that provides multiple rankings for a given query for mitigating machine learning model bias (Krishnaram et al., n.d.-a, n.d.-b). Based on the aforementioned rankings, another system calculated a skew metric between the rankings for quantifying the bias in the system (Krishnaram et al., n.d.-c). Chaloulos et al. (Georgios et al., n.d.) developed a reinforcement learning algorithm that selects hyperparameters of supervised learning algorithms for improved fairness. Zhu et al. (Fei et al., n.d.) developed a system in which the classifier takes into account the fairness of the predictions by deciding whether to explore further or adhere to best-known solutions. Golding et al. (Paul, n.d.-a; Paul, n.d.-b) developed a system using supervised machine learning algorithms to verify the bias against a particular group by using training data specific to the group. Zhang et al. (Yunfeng et al., n.d.) designed a system that evaluates a machine learning model at different decision thresholds in order to obtain fairer results and analysis. Castiglione et al. (Antonio et al., n.d.) developed a system that tests the fairness of a machine learning model based on a fAux criterion. fAux is an individual fairness metric that compares the gradients of the models predicting sensitive attribute information such as gender. Das et al. (Sanjiv et al., n.d.) developed a system that monitors the performance of the machine learning models by evaluating different bias metrics, such as differences in positive predictions and KL divergence. Li et al. (Yancheng et al., n.d.) developed a system that utilizes latent features extracted from a model to generate de-biased training data. This is achieved by optimizing for both group and individual bias through two different loss terms. Miroshnikov et al. (Alexey et al., n.d.) developed a system that detects bias through the distance between the distribution of subgroups in the data. The system also mitigates bias through a score function during postprocessing. Karthikeyan et al. (Karthikeyan et al., n.d.) patented a system that utilizes a transfer learning mechanism for ensuring fairness in the absence of protected attributes such as gender. The pretext model is trained on unlabeled data with protected attributes and leads to desirable covariate shifts during downstream training. Srinivasan et al. (Ramya and Ajay, n.d.) developed a user interface (UI) that inquires the user corresponding to a set of biases and then suggests steps to mitigate analyzed bias. Hacmon et al. (Amit et al., n.d.)

206 Handbook of Statistics

developed a system that performs a set of tests using given data and models to assess the model fairness using a fairness score aggregation module. Chalamalasetti et al. (Rahul et al., n.d.) patented a system that measures the bias in a model, and if the said bias exceeds a certain threshold, the model is re-trained to mitigate the bias. Wei et al. (Dennis et al., n.d.) developed a system that optimizes Lagrange multipliers via low-dimensional convex optimization for fairer classification. The Lagrange multipliers are derived using the probabilistic scores obtained from the supervised machine learning model. Limited patents of fairness explicitly include image-based data. Dalli et al. (Angelo and Mauro, n.d.) developed a system that studies white-box machine learning models and the bias in their performance through coefficients of the trained models. The system further provides various bias mitigation methods for different types of data, including images. Zhang et al. (Yi et al., n.d.) use a deep convolutional neural network and debias the representations obtained from the network. The loss function promotes decorrelation between the features into biased and unbiased information. Morales et al. (Aythami et al., n.d.) developed a system for removing bias from deep learning models in face recognition. The inventors use an extension of the popular triplet loss with an additional bias mitigation term. The system is tested on the popular LFW dataset. Very limited patents claim a fair and unbiased facial analysis system.

9 Open challenges Research towards bias and fairness has gained significant advancements, and several solutions have been proposed to improve the trustability and dependability of facial analysis systems. Despite the progress achieved in designing fair solutions, there are various open challenges that require the attention of the research community. Here, we discuss some of the challenges that require more attention and focused research efforts.

9.1 Fairness in presence of occlusion Wearing face masks has become a mandate in public places worldwide due to the COVID-19 pandemic. Thus, face recognition algorithms are required to recognize faces in the presence of masks. Masks occlude a major portion of the facial region that poses challenges to face recognition algorithms. To facilitate research in this direction, researchers have proposed multiple masked face databases. However, these databases contain limited demographic information. In a real-world scenario, it is important that the face recognition algorithms perform equally well across different demographic groups in the presence of occlusion. Limited research is done toward understanding the effect of bias in the presence of occlusion (Majumdar et al., 2021b), and more attention is required to develop fair algorithms (see Fig. 15).

On bias and fairness in deep learning-based facial analysis Chapter

7 207

FIG. 15 (Left) Face images of males and females with different facial regions occluded. (Right) Verification accuracy obtained for males and females after occlusion (Majumdar et al., 2021b).

9.2 Fairness across intersectional subgroups Majority of the research is performed to mitigate bias due to a single demographic group. Less attention is paid to bias mitigation across intersectional subgroups. For instance, a model that is fair across gender subgroups may be biased toward old darker-skinned females. To ensure trustability, it is important that the output predictions of a model are fair across individual as well as intersectional demographic subgroups. Thus, more research is required to identify and mitigate bias across intersectional subgroups.

9.3 Trade-off between fairness and model performance Bias mitigation may affect the overall model performance. While mitigation algorithms increase fairness, and the trained models achieve equal performance across different demographic subgroups, the model performance in over-represented subgroups may reduce. Zietlow et al. (2022) found that bias mitigation algorithms improve fairness by degrading performance across the groups. Therefore, It is challenging to simultaneously reduce the effect of bias without hampering the overall model performance.

9.4 Lack of benchmark databases Multiple databases have been proposed in recent years for studying bias, as shown in Table 4. A very limited set of databases provide identity information along with demographic information. Even for databases that provide both information, there is a clear lack of consistency in the demographic information provided. For example, different databases segregate data for a different number of ethnic subgroups. We need to account for intra-class variations within demographic groups such as Indian and Asian, which have huge diversity with respect to skin tone and facial appearance. None of the existing databases provide the distribution of subgroups (based on skin tone or facial appearance) within an ethnicity. This poses a challenge regarding the interpretability of model predictions. Further, an imbalance in the database w.r.t the unlabeled demographic groups may introduce bias in model prediction.

208 Handbook of Statistics

In such scenarios, it becomes difficult to interpret the source of bias. Therefore, we believe that the construction of large-scale benchmark databases with details of demographic information will help to decipher the cause of bias in model prediction and develop algorithms for bias mitigation.

9.5 Variation in evaluation protocols A wide variety of algorithms have been proposed for bias mitigation in the literature. However, there is still a lack of consistent evaluation protocols for many tasks, especially in facial attribute prediction. In order to compare different techniques in a fair manner, it is imperative to develop a consistent protocol followed by the community. Recently, Shrestha et al. (2022) investigated existing bias mitigation techniques and provided various recommendations for better estimation of bias mitigation algorithms.

9.6 Unavailability of complete information Existing algorithms are based on the assumption that demographic information is available during training. However, due to privacy concerns and regulations in the real world, the collection of demographic information or its use during training is precluded. Further, the protected attribute information can be noisy. This severely limits the applicability of existing bias mitigation algorithms and demands the need for the development of algorithms that do not require demographic information for bias mitigation. This line of research is very new, and more research is required (Ardeshir et al., 2022; Jeon et al., 2022; Jung et al., 2022; Seo et al., 2022).

9.7 Identification of bias in models As demonstrated in this report, a wide variety of algorithms have been proposed for mitigating biased behavior in algorithms. However, the unfair behavior is captured post-hoc through the model outcome. The success of a mitigation algorithm is evaluated through performance outcomes across subgroups. If the performance gap between subgroups reduces, we consider the algorithm to be successful. However, there is limited research into finding out whether bias exists in a model without considering the performance across subgroups. Can we find out the presence of unfair learning in the model? Li and Xu (2021) have proposed a variation loss that optimizes the hyperplane in the latent space to obtain biased attribute information. Overall, limited research has been conducted to detect bias-inducing factors in deep learning models.

9.8 Quantification of fairness in datasets Recently, Wang et al. provided a tool for measuring the bias in visual datasets (Wang et al., 2022b). The tool provides an interface for studying the dataset across different axes depending on the annotations present in it. While a tool

On bias and fairness in deep learning-based facial analysis Chapter

7 209

is helpful in analyzing the different properties of data, it does not provide a single quantitative value to represent the biasness of a dataset. Such quantification of bias for datasets can prove to be useful in understanding and comparing the overall model performance.

10

Discussion

Bias in deep learning is concerned with unequal performance across different attributes. In this chapter, the emphasis lies on facial analysis, where we consider demographic subgroups such as those based on gender, ethnicity, or age. The chapter covers different algorithms, databases and metrics proposed in the literature for evaluating and mitigating biases, along with open challenges. Here, we briefly discuss other notions surrounding bias and fairness, along with possible future directions. ● Factors beyond Demography: In addition to demographic factors such as gender, ethnicity, or age, there are other non-demographic attributes in a dataset that can be a cause of concern. For example, it is a concern if face-based models perform worse for people wearing sunglasses than those who are not. These factors may be known or unknown while training a deep-learning model (see Fig. 16). Such issues often result from a distribution shift. Since machine learning and deep learning algorithms operate under the assumption of independent and identically distributed (IID) data, i.e., we assume that the training and test data will belong to the same underlying distribution, any changes in the distribution lead to a performance drop. Bias has an interesting intersection with out-of-distribution

FIG. 16 A wide variety of factors can influence the performance of deep learning models. Some of these factors which are subject-related, model-related or environmental are shown here. Subject-related factors include demographic information such as gender, age, ethnicity, skin-color as well as dynamic non-demographic factors such as hairstyle, facial hair and pose. Model-related factors include choice of architecture, learning algorithm and loss functions. Finally, the predictions may be influenced by sensing-related factors such as illumination, camera properties and image quality.

210 Handbook of Statistics

(OOD) generalization. Based on recent surveys (Ye et al., 2022), a part of OOD generalization concerns correlation shift. While some datasets constitute a diversity shift (same object in different domains such as art or photograph), a dataset like Color-MNIST has a correlation shift. The problems of bias and out-of-distribution generalization dealing with face- and object-based datasets are related. This chapter focuses on dealing with bias for applications in facial analysis, specifically on approaches based on deep learning. ● Role of Dataset Distribution and Annotations: Since deep learning approaches are data-driven, any imperfection/imbalance in the dataset has been observed to translate into the results (Fig. 4A: Dataset bias as a source of bias). While models trained on the balanced datasets provide fairer results, a disparity in model performance persists (Karkkainen and Joo, 2021; Wang et al., 2022). There are studies that showcase a correlation between face quality and demographic bias in face recognition (Terh¨orst et al., 2020). Similarly, another study has shown the strong influence of non-demographic factors such as accessories, hairstyles, colors, face shapes, or facial anomalies on recognition performance (Terhorst et al., 2021), revealing that the performance of models is influenced by factors beyond just demographics. However, it is extremely difficult to obtain a “perfect” dataset that is balanced across all possible variations of different factors. Ideally, equal performance across every attribute should be equally important, but the implications are far greater when looked at from a social lens (such as gender and ethnicity). Since we might not always know what biases the data holds, it is extremely important that we develop algorithms that are cognizant of such biases. The algorithms proposed for bias mitigation in the past years (Section 6) have proven that algorithms have the capability to make models fairer. Still, a lot of open challenges remain (Section 9). For example, we only evaluate bias today by post-hoc quantifying model performance across different subgroups. It is possible that we might not always have the annotations for such sensitive data. Therefore, it becomes crucial to develop algorithms that perform well even though we do not have access to these sensitive annotations (gender, ethnicity, etc.). ● Explainability through Feature Engineering: Explainability in deep learning approaches is limited which hinders our ability to understand biased decisions. Feature engineering approaches offer greater explainabilty. Some of these approaches can handle skin color differences in the preprocessing stage and capture key features related to eye regions, nose, chin, ears, etc., in the context of faces. Hybrid approaches which combine engineered hand-crafted features with deep learning models may be explored for more explainable learning models. ● Real-world Requirements for Fairness: Deep learning algorithms have a major impact on our day-to-day lives, and this impact will surely grow in the coming years. A lot of applications today work specifically on

On bias and fairness in deep learning-based facial analysis Chapter

7 211

face-based data, such as Instagram and Snapchat filters. There have been instances where these applications have advertently or inadvertently led to issues relating to bias (Li, 2020; Noone, 2016; Ryan-Mosley, 2021). The Twitter face-cropping algorithm has been shown to favor light-skinned faces (BBC News, 2021). Similarly, Zoom, the popular video-calling platform, has been shown to be biased against darker skin tones (Dickey, 2020). Depending on the task at hand, different solutions may be offered for applications like Zoom, Snapchat and Instagram such as locating specific features in the face. For automatic tagging of uploaded images, disentanglement mechanisms may be employed where a human-in-the-loop verifies a label. Some of these solutions may help in alleviating bias incorporated in the systems. While good quality datasets do play a major role in the development of fairer algorithms, it is our responsibility to ensure that these algorithms are performing as fairly as we would like them to.

Acknowledgment We thank the reviewers for providing valuable comments and suggestions. Their continuous feedback helped us in improving the quality of the manuscript. The author S. Mittal is partially supported by the UGC-Net JRF Fellowship and IBM Fellowship, and M. Vatsa is partially supported through the Swarnajayanti Fellowship.

References Acien, A., et al., 2018. Measuring the gender and ethnicity bias in deep models for face recognition. In: CIARP, pp. 584–593. Adeli, E., et al., 2021. Representation learning with statistical independence to mitigate bias. In: WACV, pp. 2513–2523. Agarwal, S., Muku, S., Anand, S., Arora, C., 2022. Does data repair lead to fair models? curating contextually fair data to reduce model bias. In: IEEE/CVF WACV, pp. 3298–3307. Alasadi, J., Al Hilli, A., Singh, V.K., 2019. Toward fairness in face matching algorithms. In: FAT/MM Workshops, pp. 19–25. Albiero, V., Bowyer, K.W., 2020. Is face recognition sexist? no, gendered hairstyles and biology are. In: BMVC. Albiero, V., et al., 2020a. Analysis of gender inequality in face recognition accuracy. In: WACVW, pp. 81–89. Albiero, V., Zhang, K., Bowyer, K.W., 2020b. How does gender balance in training data affect face recognition accuracy? In: 2020 IEEE International Joint Conference on Biometrics (ijcb). IEEE, pp. 1–10. Albiero, V., Zhang, K., King, M.C., Bowyer, K.W., 2022. Gendered differences in face recognition accuracy explained by hairstyles, makeup, and facial morphology. IEEE Trans. Inf. Forensics Secur. 17, 127–137. https://doi.org/10.1109/TIFS.2021.3135750. Alexey, M., Kostandinos, K., Ravi, K.A., Raghu, K., Steven D. System and Method for Mitigating Bias in Classification Scores Generated by Machine Learning Models. https://lens.org/ 005-765-417-127-16X.

212 Handbook of Statistics Alvi, M., Zisserman, A., Nellaker, C., 2018. Turning a blind eye: explicit removal of biases and variation from deep neural network embeddings. In: ECCVW. Amini, A., et al., 2019. Uncovering and mitigating algorithmic bias through learned latent structure. In: AIES, pp. 289–295. Amit, H., Yuval, E., Asaf, S., Edita, G., Oleg, B., Sebastian, F., Ronald, F. A System and a Method for Assessment of Robustness and Fairness of Artificial Intelligence Based Models. https://lens.org/101-492-277-604-292. Angelo, D., Mauro, P., Method for Detecting and Mitigating Bias and Weakness in Artificial Intelligence Training Data and Models. https://lens.org/123-614-478-265-750. Antonio, C.G.M., Damion, P.S.J., Cote, S.C., System and Method for Machine Learning Fairness Test. https://lens.org/038-677-262-509-266. Ardeshir, S., Segalin, C., Kallus, N., 2022. Estimating structural disparities for face models. In: IEEE/CVF CVPR, pp. 10358–10367. Aythami, M.M., Javier, O.G., Julia´n, F.A., Ruben V.R. Method for Removing Bias in Biometric Recognition Systems. https://lens.org/093-132-108-045-04X. Bansal, A., Castillo, C., Ranjan, R., Chellappa, R., 2017. The do’s and don’ts for cnn-based face verification. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2545–2554. Barlas, P., et al., 2021. To “see” is to stereotype: image tagging algorithms, gender recognition, and the accuracy-fairness trade-off. ACM HCI 4 (CSCW3), 1–31. BBC News, 2021. Twitter Algorithm Prefers Slimmer, Younger, Light-Skinned Faces. BBC News. online; accessed 20 December 2022 https://www.bbc.com/news/technology58159723. Beutel, A., Chen, J., Doshi, T., Qian, H., Woodruff, A., Luu, C., Kreitmann, P., Bischof, J., Chi, E.H., 2019. Putting fairness principles into practice: challenges, metrics, and improvements. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 453–459. Bruveris, M., Gietema, J., Mortazavian, P., Mahadevan, M., 2020. Reducing geographic performance differentials for face recognition. In: WACVW, pp. 98–106. Buolamwini, J., Gebru, T., 2018. Gender shades: intersectional accuracy disparities in commercial gender classification. In: FAT, PMLR, pp. 77–91. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A., 2018. Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp. 67–74. Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T., 2019. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 32. Calders, T., Verwer, S., 2010. Three naive bayes approaches for discrimination- free classification. Data Min. knowl. Discov. 21 (2), 277–292. Cao, Y., Berend, D., Tolmach, P., Amit, G., Levy, M., Liu, Y., Shabtai, A., Elovici, Y., 2022. Fair and accurate age prediction using distribution aware data curation and augmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3551–3561. Castelvecchi, D., 2016. Can we open the black box of ai? Nat. News 538 (7623), 20. Celis, D., Rao, M., 2019. Learning facial recognition biases through vae latent representations. In: FAT/MM, pp. 26–32. Chen, B.C., Chen, C.S., Hsu, W.H., 2015. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Trans. Multimed. 17 (6), 804–815. Cheng, J., et al., 2019. Exploiting effective facial patches for robust gender recognition. Tsinghua Sci. Technol. 24 (3), 333–345.

On bias and fairness in deep learning-based facial analysis Chapter

7 213

Choi, K., et al., 2020. Fair generative modeling via weak supervision. In: ICML, pp. 1887–1898. Chouldechova, A., 2017. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5 (2), 153–163. Chrysos, G.G., Moschoglou, S., Bouritsas, G., Panagakis, Y., Deng, J., Zafeiriou, S., 2020. P-nets: deep polynomial neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7325–7335. Chuang, C.-Y., Mroueh, Y., 2021. Fair mixup: fairness via interpolation. In: ICLR. Conger, K., Fausset, R., Kovaleski, S.F., 2019. San Francisco Bans Facial Recognition Technology. https://tinyurl.com/y4x6wbos, online; accessed 19 February 2021. Conti, J.-R., Noiry, N., Clemencon, S., Despiegel, V., Gentric, S., 2022. Mitigating gender bias in face recognition using the von mises-fisher mixture model. In: International Conference on Machine Learning, PMLR, pp. 4344–4369. Corbett-Davies, S., Goel, S., Morgenstern, J., Cummings, R., 2018. Defining and designing fair algorithms, In: Proceedings of the 2018 ACM Conference on Economics and Computation, EC ’18, Association for Computing Machinery, New York, NY, USA, p. 705. https://doi. org/10.1145/3219166.3277556. Das, A., Dantcheva, A., Bremond, F., 2018. Mitigating bias in gender, age and ethnicity classification: a multi-task convolution neural network approach. In: ECCVW. 0–0. Dash, S., Balasubramanian, V.N., Sharma, A., 2022. Evaluating and mitigating bias in image classifiers: a causal perspective using counterfactuals. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 915–924. Dass, R.K., Petersen, N., Omori, M., Lave, T.R., Visser, U., 2022. Detecting racial inequalities in criminal justice: towards an equitable deep learning approach for generating and interpreting racial categories using mugshots. AI Soc., 1–22. de Freitas Pereira, T., Marcel, S., 2021. Fairness in biometrics: a figure of merit to assess biometric verification systems. IEEE Trans. Biom. Behav. Identity Sci. 4 (1), 19–29. Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Dennis, W., Karthikeyan, N.R., Pin, C.F.D. Optimized Score Transformation for Fair Classification. https://lens.org/035-855-869-011-536. Denton, E., Hutchinson, B., Mitchell, M., Gebru, T., 2019. Detecting bias with generative counterfactual face attribute augmentation. arXiv. preprint arXiv:1906.06439. Deuschel, J., Finzel, B., Rieger, I., 2020. Uncovering the bias in facial expressions. CoRR, abs/ 2011.11311. https://arxiv.org/abs/2011.11311. Dhar, P., et al., 2020. An adversarial learning algorithm for mitigating gender bias in face recognition. CoRR, abs/2006.07845. https://arxiv.org/abs/2006.07845. Dickey, M.R., 2020. Twitter and Zoom’s Algorithmic Bias Issues. https://techcrunch.com/ 2020/09/21/twitter-and-zoom-algorithmic-bias-issues/, online; accessed 20 December 2022. Du, M., Yang, F., Zou, N., Hu, X., 2020. Fairness in deep learning: a computational perspective. IEEE Intell. Syst. 36 (4), 25–34. Dooley, S., Wei, G.Z., Goldstein, T., Dickerson, J.P., 2022. Robustness disparities in commercial face detection. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Du, M., Mukherjee, S., Wang, G., Tang, R., Awadallah, A., Hu, X., 2021. Fairness via representation neutralization. Advances in Neural Information Processing Systems 34, 12091–12103. Dullerud, N., Roth, K., Hamidieh, K., Papernot, N., Ghassemi, M., 2022. Is fairness only metric deep? evaluating and addressing subgroup gaps in deep metric learning. In: International Conference on Learning Representations.

214 Handbook of Statistics Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R., 2012. Fairness through awareness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226. Dwork, C., Immorlica, N., Kalai, A.T., Leiserson, M., 2018. Decoupled classifiers for group-fair and efficient machine learning. In: FAT, pp. 119–133. Eidinger, E., Enbar, R., Hassner, T., 2014. Age and gender estimation of unfiltered faces. TIFS 9 (12), 2170–2179. Eren, K., Method and System for Fairness in Artificial Intelligence Based Decision Making Engines. https://lens.org/018-652-546-632-286. Fei, Z., Xiaofei, L., Yuchen, F., Shan, Z. Fairness-Balanced Result Prediction Classifier for Context Perceptual Learning. https://lens.org/156-983-8-16-308-567. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S., 2015. Certifying and removing disparate impact. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268. Franco, D., Navarin, N., Donini, M., Anguita, D., Oneto, L., 2022. Deep fair models for complex data: graphs labeling and explainable face recognition. Neurocomputing 470, 318–334. Garcia, R.V., Wandzik, L., Grabner, L., Krueger, J., 2019. The harms of demographic bias in deep face recognition research. In: ICB, pp. 1–6. Garg, P., Villasenor, J., Foggo, V., 2020. Fairness metrics: a comparative analysis. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE, pp. 3662–3666. Garvie, C., Bedoya, A., Frankle, J., 2016. Unregulated Police Face Recognition in America. https://www.perpetuallineup.org/. technical report, Georgetown Law Center on Privacy & Technology. Georgios, C., Frank, F.F., Florian, G., Patrick, L., Stefan, R., Eric, S. Fairness Improvement Through Reinforcement Learning. https://lens.org/105-936-349-416-988. Georgopoulos, M., et al., 2021. Mitigating demographic bias in facial datasets with style-based multi-attribute transfer. IJCV 129 (7), 2288–2307. Gong, S., Liu, X., Jain, A., 2020. Jointly de-biasing face recognition and demographic attribute estimation. ECCV, 330–347. Gong, S., Liu, X., Jain, A.K., 2021. Mitigating face recognition bias via group adaptive classifier. In: CVPR, pp. 3414–3424. Grother, P., 2022. Face recognition vendor test (FRVT). Part 8: summarizing demographic differentials. Grother, P., Ngan, M., Hanaoka, K., 2019. Face Recognition Vendor Test (frvt) Part 3: Demographic Effects. https://nvlpubs. nist. gov/nistpubs/ir/2019/NIST. IR 8280. Guo, Y., et al., 2016. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. ECCV, pp. 87–102. Gwilliam, M., Hegde, S., Tinubu, L., Hanson, A., 2021. Rethinking common assumptions to mitigate racial bias in face recognition datasets. In: ICCVW, pp. 4123–4132. Harvey, J., 2021. Adam. LaPlace, Exposing.ai. https://exposing.ai. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E., 2008. Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: Workshop on Faces in ‘RealLife’ Images: Detection, Alignment, and Recognition. Hardt, M., Price, E., Srebro, N., 2016. Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, December 5–10, Barcelona, Spain, pp. 3315–3323. Howard, J.J., Laird, E.J., Sirotin, Y.B., Rubin, R.E., Tipton, J.L., Vemury, A.R., 2022. Evaluating proposed fairness models for face recognition algorithms. CoRR abs/2203.05051. https://doi. org/10.48550/arXiv.2203.05051.

On bias and fairness in deep learning-based facial analysis Chapter

7 215

Huang, C., Li, Y., Loy, C.C., Tang, X., 2019. Deep imbalanced learning for face recognition and attribute prediction. T-PAMI 42 (11), 2781–2794. Hupont, I., Ferna´ndez, C., 2019. Demogpairs: quantifying the impact of demographic imbalance in deep face recognition. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, pp. 1–7. Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K., Baum, F., Wu, M., Xu, L., Van Gool, L., 2019. Ai benchmark: all about deep learning on smartphones in 2019. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, pp. 3617–3635. Jain, N., Olmo, A., Sengupta, S., Manikonda, L., Kambhampati, S., 2022. Imperfect imagination: implications of GANs exacerbating biases on facial data augmentation and snapchat face lenses. Artif. Intell. 304, 103652. Jamal, M.A., Brown, M., Yang, M.-H., Wang, L., Gong, B., 2020. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7610–7619. Jang, T., Shi, P., Wang, X., 2022. Group-aware threshold adaptation for fair classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 6988–6995. Javad, K.S., Egan, V. V. M., Feng, L., Frederick, E.M., Efrain, V.J., Louis, B.J., Merrill Douglas, C., Merrill John Wickens Lamb, Systems and Methods for Model Fairness. https://lens.org/ 090-317-283-984-10X. Jeon, M., Kim, D., Lee, W., Kang, M., Lee, J., 2022. A conservative approach for unbiased learning on unknown biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16752–16760. Jialin, L., Qingquan, Z., Xin, Y., Zeqi, Z., Bifei, M., 2021 Fair Machine Learning Model Training Method Based on Multi-Objective Evolutionary Algorithm. https://lens.org/ 081-168-921-524-608. Joo, J., Karkkainen, K., 2020. In: Gender Slopes: Counterfactual Fairness for Computer Vision Models by Attribute Manipulation. FATE/MM, pp. 1–5. Joshi, A.R., Cuadros, X.S., Sivakumar, N., Zappella, L., Apostoloff, N., 2022. Fair SA: sensitivity analysis for fairness in face recognition. In: Algorithmic Fairness Through the Lens of Causality and Robustness Workshop, PMLR, pp. 40–58. Joydeep, G., Shubham, S., Jessica, H., Matthew, S., Framework for Explainability With Recourse of Black-Box Trained Classifiers and Assessment of Fairness and Robustness of Black-Box Trained Classifiers. https://lens.org/105-202-376-330-255. Jung, S., Lee, D., Park, T., Moon, T., 2021. Fair feature distillation for visual recognition. In: CVPR, pp. 12115–12124. Jung, S., Chun, S., Moon, T., 2022. Learning fair classifiers with partially annotated group labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10348–10357. Karkkainen, K., Joo, J., 2021. In: Fairface: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. WACV, pp. 1548–1558. Natesan Ramamurthy Karthikeyan, Coston Amanda, Wei Dennis, Varshney Kush Raj, Speakman Skyler, Mustahsan Zairah, and Chakraborty Supriyo, Enhancing Fairness in Transfer Learning for Machine Learning Models With Missing Protected Attributes in Source or Target Domains. URL https://lens.org/036-532-122-659-131. Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E., 2016. The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882.

216 Handbook of Statistics Kim, M.P., Ghorbani, A., Zou, J., 2019a. Multiaccuracy: black-box post- processing for fairness in classification. In: AIES, pp. 247–254. Kim, B., et al., 2019b. Learning not to learn: training deep neural networks with biased data. In: CVPR, pp. 9012–9020. Klare, B.F., et al., 2012. Face recognition performance: role of demographic information. TIFS 7 (6), 1789–1801. Klare, B.F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., Jain, A.K., 2015. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1931–1939. Kleinberg, J.M., Mullainathan, S., Raghavan, M., 2017. Inherent trade-offs in the fair determination of risk scores. In: Papadimitriou, C.H. (Ed.), 8th Innovations in Theoretical Computer Science Conference (ITCS), January 9–11, Berkeley, CA, USA. In: LIPIcs, vol. 67. Schloss Dagstuhl – Leibniz-Zentrum f€ur Informatik, pp. 43:1–43:23. Kolling, C., Araujo, V., Veloso, A., Musse, S.R., 2022. Mitigating bias in facial analysis systems by incorporating label diversity. IEEE Trans. Image Process. Kortylewski, A., et al., 2019. Analyzing and reducing the damage of dataset bias to face recognition with synthetic data. In: CVPRW. Krishnakumar, A., Prabhu, V., Sudhakar, S., Hoffman, J., 2021. Udis: unsupervised discovery of bias in deep visual recognition models. In: British Machine Vision Conference (BMVC). vol. 1, p. 3. Krishnan, A., Almadan, A., Rattani, A., 2020. Understanding fairness of gender classification algorithms across gender-race groups. arXiv. preprint arXiv:2009.11491. Krishnapriya, K., et al., 2020. Issues related to face recognition accuracy varying based on race and skin tone. TTS 1 (1), 8–20. Krishnaram, K., Geyik Sahin, C., Ambler Stuart, M., Multi-level Ranking for Mitigating Machine Learning Model Bias. https://lens.org/088-467-431-989-474. Krishnaram, K., Geyik Sahin, C., Ambler Stuart M., Achieving fairness Across Multiple Attributes in Rankings. https://lens.org/082-077-388-231-200. Krishnaram K., Geyik Sahin C., Ambler Stuart M., Quantifying Bias in Machine Learning Models. https://lens.org/104-451-702-614-60X. Kumar, N., Belhumeur, P., Nayar, S., 2008. Facetracer: a search engine for large collections of images with faces. In: European Conference on Computer Vision. Springer, pp. 340–353. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K., 2009. Attribute and simile classifiers for face verification. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 365–372. Li, S., 2020. The Problems With Instagram’s Most Popular Beauty Filters, From Augmentation to Eurocentrism. https://www.nylon.com/beauty/instagrams-beauty-filters-perpetuate-theindustrys-ongoing-racism, online; accessed 20 December 2022. Li, Z., Xu, C., 2021. Discover the unknown biased attribute of an image classifier. In: ICCV. Liu, Z., Luo, P., Wang, X., Tang, X., 2015. Deep learning face attributes in the wild. In: ICCV, pp. 3730–3738. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L., 2017. Sphereface: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220. Liu, J., Wu, Y., Wu, Y., Li, C., Hu, X., Liang, D., Wang, M., 2021. Dam: discrepancy alignment metric for face recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3814–3823.

On bias and fairness in deep learning-based facial analysis Chapter

7 217

Liu, C., Yu, X., Tsai, Y.-H., Faraki, M., Moslemi, R., Chandraker, M., Fu, Y., 2022. Learning to learn across diverse data biases in deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4072–4082. MacCarthy, M., 2017. Standards of fairness for disparate impact assessment of big data algorithms. Cumb. L. Rev. 48, 67. Majumdar, P., Chhabra, S., Singh, R., Vatsa, M., 2021a. Subgroup invariant perturbation for unbiased pre-trained model prediction. Front. Big Data 3, 590296. Majumdar, P., Mittal, S., Singh, R., Vatsa, M., 2021b. Unravelling the effect of image distortions for biased prediction of pre-trained face recognition models. In: ICCVW, pp. 3786–3795. Majumdar, P., Singh, R., Vatsa, M., 2021c. Attention aware debiasing for unbiased model prediction. In: ICCVW, pp. 4133–4141. Masi, I., Wu, Y., Hassner, T., Natarajan, P., 2018. Deep face recognition: a survey. In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, pp. 471–478. Mayson, S.G., 2018. Bias in, bias out. YAle. lJ 128, 2218. Maze, B., et al., 2018. IARPA janus benchmark-c: face dataset and protocol. In: ICB, pp. 158–165. McDuff, D., Ma, S., Song, Y., Kapoor, A., 2019. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E.A., Garnett, R. (Eds.), Characterizing bias in classifiers using generative models. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, Vancouver, BC, Canada, pp. 5404–5415. Michael, Z.S., Shafi, R., Method and Apparatus for Analyzing Coverage, Bias, and Model Explanations in Large Dimensional Modeling Data. https://lens.org/060-576-187-685-319. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A., 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54 (6), 1–35. Morales, A., Fierrez, J., Vera-Rodriguez, R., Tolosana, R., 2020. Sensitivenets: learning agnostic representations with application to face images. In: TPAMI. Moschoglou, S., et al., 2017. Agedb: the first manually collected, in-the-wild age database. In: CVPRW, pp. 51–59. Mullick, S.S., Datta, S., Das, S., 2019. Generative adversarial minority oversampling. In: ICCV, pp. 1695–1704. Muthukumar, V., 2019. Color-theoretic experiments to understand unequal gender classification accuracy from face images. In: CVPRW. Nagpal, S., Singh, M., Singh, R., Vatsa, M., 2019. Deep learning for face recognition: pride or prejudiced? arXiv. preprint arXiv:1904.01219. Nagpal, S., Singh, M., Singh, R., Vatsa, M., 2020a. Diversity Blocks for De-biasing Classification Models. IJCB, pp. 1–9. Nagpal, S., Singh, M., Singh, R., Vatsa, M., 2020b. Attribute aware filter-drop for bias invariant classification. In: CVPRW, pp. 32–33. Nanda, V., et al., 2020. Fairness through robustness: investigating robustness disparity in deep learning. arXiv. preprint arXiv:2006.12621. Nech, A., Kemelmacher-Shlizerman, I., 2017. Level playing field for million scale face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7044–7053. Noone, G., 2016. Are Snapchat’s Filters Making Everyone Look Whiter? https://www.thecut.com/ 2016/05/are-snapchats-filters-making-users-look-whiter.html, online; accessed 20 December 2022.

218 Handbook of Statistics Ntoutsi, E., et al., 2020. Bias in data-driven artificial intelligence systems—an introductory survey, WIREs. Data Min. Knowl. Discov. 10 (3), e1356. Osoba, O.A., Welser IV, W., 2017. An Intelligence in Our Image: The Risks of Bias and Errors in Artificial Intelligence. Rand Corporation. Paolini-Subramanya, M., 2018. Facial Recognition, and Bias. https://tinyurl.com/y7rat8vb, online; accessed 19 February 2021. Park, S., Hwang, S., Kim, D., Byun, H., 2021. Learning disentangled representation for fair facial attribute classification via fairness-aware information alignment. In: AAAI. vol. 35, pp. 2403–2411. Parkhi, O.M., Vedaldi, A., Zisserman, A., 2015. Deep face recognition. In: Xie, X., Jones, M.W., Tam, G.K.L. (Eds.), Proceedings of the British Machine Vision Conference 2015, Swansea, UK, September 7–10. BMVA Press, pp. 41.1–41.12, https://doi.org/10.5244/C.29.41. Paul, G., Method for Tracking Lack of Bias of Deep Learning AI Systems. https://lens.org/ 011-055-851-973-327. Paul, G., Method for Verifying Lack of Bias of Deep Learning AI Systems. https://lens.org/ 143-453-690-303-153. Qiu, Y., Albiero, V., King, M.C., Bowyer, K.W., 2021. Does face recognition error echo gender classification error? In: 2021 IEEE International Joint Conference on Biometrics (IJCB). IEEE, pp. 1–8. Quadrianto, N., Sharmanska, V., Thomas, O., 2019. Discovering fair representations in the data domain. In: CVPR, pp. 8227–8236. Ragonesi, R., Volpi, R., Cavazza, J., Murino, V., 2021. Learning unbiased representations via mutual information backpropagation. In: CVPR, pp. 2729–2738. Policy and Division, Introduction to Library of Congress Demographic Group Terms, https:// www.loc.gov/aba/publications/FreeLCDGT/dgtintro.pdf, online; accessed 7 July 2022. Rahul, C.S., Milojicic, D.S., Sergey, S., n.d., Machine Learning Model Bias Detection and Mitigation. URL https://lens.org/156-800-361-450-553. Raji, I.D., Buolamwini, J., 2019. Actionable auditing: investigating the impact of publicly naming biased performance results of commercial ai products. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 429–435. Ramaswamy, V.V., Kim, S.S., Russakovsky, O., 2021. Fair attribute classification through latent space de-biasing. In: CVPR, pp. 9301–9310. Ramya, M.S., Ajay, C., Bias Mitigation in Machine Learning Pipeline. https://lens.org/ 021-284-315-959-371. Rawls, A.W., Ricanek, K., 2009. Morph: development and optimization of a longitudinal age progression database. In: European Workshop, BioID, pp. 17–24. Ren, M., Zeng, W., Yang, B., Urtasun, R., 2018. Learning to reweight examples for robust deep learning. In: International Conference on Machine Learning, PMLR, pp. 4334–4343. Robinson, J.P., et al., 2020. Face recognition: too bias, or not too bias? In: CVPRW. Roh, Y., Lee, K., Whang, S.E., Suh, C., 2021. Fairbatch: batch selection for model fairness. In: 9th International Conference on Learning Representations, Virtual Event, Austria, May 3–7. OpenReview.net. Ryan-Mosley, T., 2021. How Digital Beauty Filters Perpetuate Colorism. MIT Technology Review. online; accessed 20 December 2022 https://www.technologyreview.com/2021/08/ 15/1031804/digital-beauty-filters-photoshop-photo-editing-colorism-racism/. Ryu, H.J., Adam, H., Mitchell, M., 2018. Inclusivefacenet: improving face attribute detection with race and gender diversity. In: FAT ML Workshops.

On bias and fairness in deep learning-based facial analysis Chapter

7 219

Sanjiv, D., Michele, D., Lawrence, G.J., Kevin, H., Stephen, H.T., Krishnaram, K., Altin, Y.P., Bilal, Z.M., Larroy Pedro, L., Monitoring Bias Metrics and Feature Attribution for Trained Machine Learning Models. https://lens.org/003-111-246-876-16X. Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823. Seo, S., Lee, J.-Y., Han, B., 2022. Unsupervised learning of debiased representations with pseudoattributes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16742–16751. Serna, I., Pen˜a, A., Morales, A., Fierrez, J., 2021. Insidebias: measuring bias in deep networks and application to face gender biometrics. In: ICPR, pp. 3720–3727. Serna, I., Morales, A., Fierrez, J., Obradovich, N., 2022a. Sensitive loss: improving accuracy and fairness of face representations with discrimination- aware deep learning. Artif. Intell. 305, 103682. Serna, I., Morales, A., Fierrez, J., Ortega-Garcia, J., 2022b. IFBiD: inference-free bias detection. In: Pedroza, G., Herna´ndez-Orallo, J., Chen, X.C., Huang, X., Espinoza, H.J., Castillo-Effen, ´ hEigeartaigh,  M., McDermid, J., Mallah, R., O S. (Eds.), Proceedings of the Workshop on Artificial Intelligence Safety 2022 (SafeAI 2022) co-located with the Thirty-Sixth AAAI Conference on Artificial Intelligence, Virtual, February. CEUR Workshop Proceedings. vol. 3087. CEUR-WS.org. Serna, I., Morales, A., Julian Fierrez, J., Cebrian, M., Obradovich, N., Rahwan, I., 2020. Algorithmic discrimination: formulation and exploration in deep learning-based face biometrics. In: ´ hEigeartaigh,  Espinoza, H., Herna´ndez-Orallo, J., Chen, X.C., O S.S., Huang, X., CastilloEffen, M., Mallah, R., McDermid, J.A. (Eds.), Proceedings of the Workshop on Artificial Intelligence Safety, co-located with 34th AAAI Conference on Artificial Intelligence, SafeAI@AAAI 2020, New York City, NY, USA, February 7. CEUR Workshop Proceedings. vol. 2560, pp. 146–152. CEUR-WS.org. Setty, S., et al., 2013. Indian movie face database: a benchmark for face recognition under wide variations. In: NCVPRIPG, pp. 1–5. Shi, Y., Yu, X., Sohn, K., Chandraker, M., Jain, A.K., 2020. Towards universal representation learning for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6817–6826. Shrestha, R., Kafle, K., Kanan, C., 2022. An investigation of critical issues in bias mitigation techniques. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1943–1954. Singh, R., Majumdar, P., Mittal, S., Vatsa, M., 2022. Anatomizing bias in facial analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 12351–12358. Srinivas, N., et al., 2019a. Face recognition algorithm bias: performance differences on images of children and adults. In: CVPRW. Srinivas, N., et al., 2019b. Exploring automatic face recognition on match performance and gender bias for children. In: WACVW, pp. 107–115. Sun, Y., Chen, Y., Wang, X., Tang, X., 2014. Deep learning face representation by joint identification-verification. Advances in Neural Information Processing Systems. vol. 27. Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708. Tan, S., Shen, Y., Zhou, B., 2020. Improving the fairness of deep generative models without retraining. CoRR. abs/2012.04842.

220 Handbook of Statistics Tartaglione, E., Barbano, C.A., Grangetto, M., 2021. EnD: entangling and disentangling deep representations for bias correction. In: CVPR, pp. 13508–13517. Team, S.N.U., 2015. Google Photo App Labels Black Couple Gorillas. https://tinyurl.com/ 3npuwwbn, online; accessed 19 February 2021. Terh¨orst, P., et al., 2020. Face quality estimation and its correlation to demographic and non-demographic bias in face recognition. In: IJCB. Terhorst, P., et al., 2020a. Post-comparison mitigation of demographic bias in face recognition using fair score normalization. PRL 140, 332–338. Terhorst, P., et al., 2020b. Comparison-level mitigation of ethnic bias in face recognition. In: IWBF, pp. 1–6. Terhorst, P., et al., 2020c. Beyond identity: what information is stored in biometric face templates? In: IJCB, pp. 1–10. Terhorst, P., Kolf, J.N., Huber, M., Kirchbuchner, F., Damer, N., Moreno, A.M., Fierrez, J., Kuijper, A., 2021. A comprehensive study on face recognition biases beyond demographics. IEEE Trans. Technol. Soc. 3 (1), 16–30. Thijssen, L., 2016. Taste-Based Versus Statistical Discrimination: Placing the Debate into Context. GEMM Project. Vangara, K., et al., 2019. Characterizing the variability in face recognition accuracy relative to race. In: CVPRW. 0–0. Vera-Rodriguez, R., et al., 2019. Facegenderid: exploiting gender information in dcnns face recognition systems. In: CVPRW. 0–0. Wang, M., Deng, W., 2020. Mitigating bias in face recognition using skewness-aware reinforcement learning. In: CVPR, pp. 9322–9331. Wang, M., Deng, W., 2021. Deep face recognition: a survey. Neurocomputing 429, 215–244. Wang, F., Chen, L., Li, C., Huang, S., Chen, Y., Qian, C., Loy, C.C., 2018a. The devil of face recognition is in the noise. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 765–780. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W., 2018b. Cosface: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Verma, S., Rubin, J., 2018. Fairness definitions explained. In: 2018 IEEE/ACM International Workshop on Software Fairness (Fairware). IEEE, pp. 1–7. Wang, M., et al., 2019a. In: Racial Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Network. ICCV, pp. 692–702. Wang, T., et al., 2019b. Balanced datasets are not enough: estimating and mitigating gender bias in deep image representations. In: ICCV, pp. 5310–5319. Wang, Z., et al., 2020. Towards fairness in visual recognition: effective strategies for bias mitigation. In: CVPR, pp. 8919–8928. Wang, Z., Dong, X., Xue, H., Zhang, Z., Chiu, W., Wei, T., Ren, K., 2022a. Fairness-aware adversarial perturbation towards bias mitigation for deployed deep models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10379–10388. Wang, A., Liu, A., Zhang, R., Kleiman, A., Kim, L., Zhao, D., Shirai, I., Narayanan, A., Russakovsky, O., 2022b. Revise: a tool for measuring and mitigating bias in visual datasets. Int. J. Comput. Vision, 1–21. Wang, M., Zhang, Y., Deng, W., 2022. Meta balanced network for fair face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44 (11), 8433–8448. https://doi.org/10.1109/ TPAMI.2021.3103191.

On bias and fairness in deep learning-based facial analysis Chapter

7 221

Washington, A.L., 2018. How to argue with an algorithm: lessons from the compas-propublica debate. Colo. Tech. L.J. 17, 131. Wen, Y., Zhang, K., Li, Z., Qiao, Y., 2016. A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision. Springer, pp. 499–515. Xu, X., et al., 2021. Consistent instance false positive improves fairness in face recognition. In: CVPR, pp. 578–586. Yancheng, L., Moumita, S., Haichun, C., Facilitating Online Resource Access With Bias Corrected Training Data Generated for Fairness-Aware Predictive Models. https://lens.org/ 167-586-688-178-365. Yang, Z., et al., 2021. Ramface: race adaptive margin based face recognition for racial bias mitigation. In: IJCB, pp. 1–8. Ye, N., Li, K., Bai, H., Yu, R., Hong, L., Zhou, F., Li, Z., Zhu, J., 2022. OoD-Bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7947–7958. Yi, D., Lei, Z., Liao, S., Li, S.Z., 2014. Learning face representation from scratch. CoRR. abs/1411.7923. Yi Z., Jitao S., Zunqi H., Jian Y., Zhongyuan Z., Zesong L., Method for Carrying Out Unbiased Classification on Image Data. https://lens.org/181-988-702-119-150. Yucer, S., Akcay, S., Al-Moubayed, N., Breckon, T.P., 2020. Exploring racial bias within face recognition via per-subject adversarially-enabled data augmentation. In: CVPRW, pp. 18–19. Yucer, S., Tektas, F., Al Moubayed, N., Breckon, T.P., 2022. Measuring hidden bias within face recognition via racial phenotypes. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 995–1004. Yunfeng, Z., Emma, B.R.K., Raj, V.K., Mitigating Statistical Bias in Artificial Intelligence Models. https://lens.org/075-081-335-679-106. Zhang, Z., Song, Y., Qi, H., 2017. Progression/regression by conditional adversarial autoencoder. In: CVPR, pp. 5810–5818. Zhang, Y., Deng, W., Wang, M., Hu, J., Li, X., Zhao, D., Wen, D., 2020. Globallocal gcn: large-scale label noise cleansing for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7731–7740. Zheng, X., Guo, Y., Huang, H., Li, Y., He, R., 2020. A survey of deep facial attribute analysis. Int. J. Comput. Vision 128 (8), 2002–2034. Zietlow, D., Lohaus, M., Balakrishnan, G., Kleindessner, M., Locatello, F., Scholkopf, B., Russell, C., 2022. Leveling down in computer vision: pareto inefficiencies in fair deep classifiers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10410–10421.

This page intentionally left blank

Chapter 8

Manipulating faces for identity theft via morphing and deepfake: Digital privacy Akshay Agarwala and Nalini Rathab,∗ a

Department of Data Science and Engineering, IISER Bhopal, Bhopal, Madhya Pradesh, India Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, United States ⁎Corresponding author: e-mail: [email protected] b

Abstract Digital face images can be easily manipulated for obfuscating or impersonating an identity. Several techniques are used for face manipulation, both traditional computer vision based such as morphing, and modern deep learning based such as deepfake. Morphing and deepfake techniques became advanced enough in creating photorealistic face images. Due to that, these techniques pose a serious threat to identity theft and can significantly harm at a personal level such as the risk of reputation and money, and the national level such as interference in the election. In this chapter, we review (i) different stealthy ways of identity threat generation techniques, (ii) popular databases used in this research direction, and (iii) defense algorithms build to detect these manipulated images. We further provide key open challenges which need to be addressed to make the defense algorithms robust, generalized, and to handle the adaptive nature of the attacks. Keywords: Deepfake, Identity swap, Digital threats, Vulnerability of deep face recognition, Privacy and security

1

Introduction

Identification of the correct individual is crucial in several applications including automatic border access, secure digital payment, and communication of a correct message on social media platforms. One of the popular and extensively explored biometric mediums of identity verification is face recognition, due to its nonintrusive nature and easy interpretability of expression useful for communication. The other reason for such extensive exploration of the face modality is that it is not only linked to identity but also other Handbook of Statistics, Vol. 48. https://doi.org/10.1016/bs.host.2022.12.003 Copyright © 2023 Elsevier B.V. All rights reserved.

223

224 Handbook of Statistics

attributes as well such as race, ethnicity, and gender. Similar to humans, who are good at identifying such attributes, deep learning algorithms have also shown tremendous success in that (Wang and Deng, 2021). Due to the popularity and effectiveness of face recognition systems, in recent days, manipulation of face images has become prominent, and several techniques have evolved to easily achieve that goal (Majumdar et al., 2019, 2022; Singh et al., 2020) and can fool the humans and machine learning algoobis et al., 2021). A few notable names of the techniques which are rithms (K€ popular in generating the identity manipulated images are (i) morphing, (ii) swapping, and (iii) deepfake. The impact of each method is severe and even so far no research has focused to tackle these different methods simultaneously. Moreover, the impact of these identity manipulation methods is not always the same among different ethnicity, race, or gender individuals (Trinh and Liu, 2021). For instance, deepfake threats are more prevalent in females as compared to males (Dunn, 2021; Vaas, 2019). We assert that since threats can come in any form; therefore, awareness about different possible aspects of threats is essential to develop effective security mechanisms not only to detect but also to impose a ban or even to mitigate the impact of manipulated videos (Editor, 2019; Flint, 2020; Hao, 2021). The presence of vast identity manipulation methods complicates the fact of “seeing is believing.” While the use case of manipulation technology is not limited to yielding harm and can effectively be used for generating entertainment material or model performance improvement through data augmentation, it is heavily getting used for negative purposes. Therefore, by looking at the severity of the issue, several researchers have developed approaches for detecting these identity-aware alterations. For instance, recently, research works focused on detecting deepfakes including improving the performance of morphing and swapping detection have received desired attention (Agarwal et al., 2021b). Fig. 1 demonstrates the importance of addressing this arduous research problem. As can be seen, both face morphing and deepfake produce stealthy images of high quality which can fool face recognition and humans.a This chapter provides a comprehensive survey of different identity manipulation techniques ranging from generation techniques used for the creation of morphing, swapping, deepfake, existing datasets developed utilizing these methods, and defense algorithms to counter them. We also highlights the impact of different identity manipulation threats and demands that addressing each threat simultaneously is critical. We also provide a few open research challenges including the evolution of adaptive intelligent attacks and the gaps persistent due to the limitations of existing defense techniques such as generalizability and robustness.

a

The third column contains the images generated using face morphing and the fourth column are the images generated using deepfake technique.

Identity theft via morphing and deepfake Chapter

8 225

FIG. 1 Identity-manipulated images are generated using two completely different eras of methods. Can you identify the technique (morphing or deepfake) used in the generation of these images? The first and second column images are the source and target face images that are used for manipulated image generation.

2

Identity manipulation techniques

As opposed to the first available deepfake video in 2017 by a Reddit user, the face manipulation was well established in the form of face morphing (Wolberg, 1996, 1998). However, earlier the use of such a technique was very rare or rarely used to performing any identity attack. In one of the earliest works, Ferrara et al. (2014) utilized the free image manipulation software namely GNU Image Manipulation Program v2.8 (GIMP) and the GIMP Animation Package v2.6 (GAP) to perform the identity attack. The authors have shown the success of generated morphed images using commercial face recognition systems. After that several works have started in the research community to identify the vulnerability of face recognition algorithms against face morphing and swapping manipulation (Agarwal et al., 2017; Ferrara et al., 2014; GomezBarrero et al., 2017; Hildebrandt et al., 2017). The majority of these works

226 Handbook of Statistics

show that no complex system is required to generate effective identitymanipulated images. These identity-manipulated images whether generated using face manipulation software or swapping techniques, they are heavily popular in the digital domain including messaging platforms. However, few research works have also explored the potential of morphing and swapping attacks in the physical world as well. The morph attack in the physical world is performed by first generating the digitally modified morphed images, and later the images are scanned back into the system (Scherhag et al., 2018a). Table 1 shows the few publicly available face morphing tools which are extensively explored for identity-manipulated face image generation. The popularity and easy availability of these tools to a wider public raise serious concerns. The presence of several papers on the topic of face morphing also reflects the growth of the field and the importance to tackle the problem effectively (Venkatesh et al., 2021). The morphing toolboxes are contrastively different from the current era of deepfake-generation algorithms which are highly inclined toward deep neural networks or generative networks. Fig. 2 shows the progression of both face morphing and deepfake generated images. While the toolboxes of both threats are entirely different, both can generate highly realistic attack images. Another class of identity manipulation tools that have not received significant attention is social media applications that are heavily used for online communication and media sharing. Most of these social media applications are now equipped with image manipulation entities popularly referred to as filters. A few notable examples of social media platforms extensively popular for image modification are Snapchat and Instagram. Therefore, the limited

TABLE 1 A few publicly available and explored face morphing tools. Method

Software

Type

Landmark based

Photo Morpher by Morpheus

Commercial

FantaMorph by Abrosoft FaceMorph by Luxand Inc. Photoshop + Morph Animation by Adobe FaceFusion by Moment Media Landmark + triangulation

GIMP + GAP by GIMP OpenCV Face Morpher by Alyssa Quek

Open source

Identity theft via morphing and deepfake Chapter

8 227

FIG. 2 Progression of the deepfake (first row) and morph images (second row). From being low quality, the perceptibility of artifacts both the artifacts generation algorithm can achieve high quality and low perceptible artifacts. The deepfake images are taken from three different datasets namely FF++ (Rossler et al., 2019), Celeb-DF (Li et al., 2020b), and DF (Ciftci et al., 2020) to show the progression.

exploration of these public and easily available mediums is a serious hurdle in developing a universal defense algorithm. Thankfully, a few research works have started to study the impact of social media application filters in generating face morph or swap images. Agarwal et al. (2017) have generated one of the largest face morph video datasets as compared to their image-based face morph counterpart datasets. The authors have generated high-quality face morph videos using the Snapchat mobile application and demonstrate the vulnerability of both commercial face matcher and mobile face unlocking software. Recently, the authors have extended their work and covered several other open-source social media applications and websites that can be freely used without any restriction to generate the face morph images (Agarwal et al., 2021b). Table 2 listed the number of social media platforms that can be used for an effective face morph image generation; among these Agarwal et al. (2021b) have explored Snapchat, FaceApp, and morphthing.com and demonstrate their stealthy nature in fooling several face recognition frameworks. Interestingly, the authors have explored the morphing and swapping of more than two faces as well which is rarely explored in identity manipulation research and hence needs attention. The availability of these commercial mobile applications and Internet websites allows general netizens to create fake images and videos effortlessly. It significantly raises the presence of these manipulated media on social media platforms that have no boundaries. Henceforth, the avoidance of these mediums and detection of the videos generated using these might be a fatal thing to secure not just the machine learning algorithms but a healthy society as well. Apart from these freely available and easy-to-use mediums that a nontechsavvy person can effectively use, deep neural network-based models have also received significant attention. The probable reason might be the advanced infrastructure of machine learning platforms including publicly available ones that can be used by financially limited researchers either freely or by paying a

228 Handbook of Statistics

TABLE 2 Social media applications and websites to use for face morph image generation. Method

Software

Type

Keypoints + 3D Mesh

Snapchat by Snap Inc.

Mobile Applications

FaceApp by FaceApp Tech. Pvt. Ltd. ReFace by Neocortext Inc. Instagram by Meta Zao by Changsha Shenduronghe Network Tech. Co., Ltd. Face Morph by Hamsoft https://www.morphthing.com/

Website

https://faceswaponline.com/morph https://facemorph.me/ https://3dthis.com/facemorph.htm

minimal amount. The use of deep learning techniques for the generation of face-swap videos started in 2017 when the inception of deepfake videos by a Reddit user posted over the Internet shook the world. The posted faceswapped videos were generated using an algorithm similar to the deep neural network architecture proposed by Korshunova et al. (2017). Later, several deep neural network-based face swap or style transfer algorithms are proposed for the effective manipulation of faces. The interesting or worrisome part is that similar to open social media applications, the majority of these deep neural networks-based face manipulation techniques are freely available. For instance, the DeepFaceLab (Perov et al., 2020) provides several face manipulation options such as replacing the face and head, de-age the person, and manipulating lips. On top of that, the authors claim that the toolbox can easily be used without any professional knowledge, i.e., basically, anyone with Internet and computing resource can generate and spread fake data. Table 3 listed some of the popular deep learning-based face manipulation algorithms. Fig. 3 shows the advancement in the number of papers published on interrelated topics which are dealt with independently the majority of the time. The statistics are obtained by putting the keywords “face morphing,” “face-swapping,” and “deepfake” on the “https://app.dimensions.ai/discover/

Identity theft via morphing and deepfake Chapter

8 229

TABLE 3 Deepfake generation techniques are based on deep neural network architectures. Algorithm

Architecture

Link

Face Swap

Encoder-Decoder

https://github.com/deepfakes/ faceswap

DeepFaceLab

https://github.com/iperov/ DeepFaceLab

Face Swap-GAN

https://github.com/shaoanlu/ faceswap-GAN

Face Shifter

Encoder-Decoder + Generator

https://lingzhili.com/ FaceShifterPage/

FSGAN

Recurrent Neural Network

https://github.com/YuvalNirkin/ fsgan

Style GAN

Generative Networks

https://github.com/NVlabs/ stylegan

Face2Face

Shape Deformation Model

https://justusthies.github.io/posts/ face2face/

Neural Texture

Convolutional Encoder Decoder

https://github.com/SSRSGJYD/ NeuralTexture

Morphing

Swapping

Deepfake

Number of Publications

8000

6000

4000

2000

0 2010

2012

2014

2016

2018

2020

2022

Year

FIG. 3 Progression in the number of papers appearing with the different identity manipulation technique names. The number of papers for the year 2022 reflects the published papers listed so far (till march) on the platform.

230 Handbook of Statistics

publication” platform. It is clear that as the threat of deepfake technology is becoming prominent, the number of papers is growing exponentially as well. The field that is working in parallel covering similar domains such as face morphing and face-swapping also has a significant footprint in the publication world. We also want to bring to the notice that it is not the case that the techniques which are developed for face morphing detection cannot be used for deepfake detection (Agarwal et al., 2021b); however, their exploration in different manipulation settings is limited so far.

3 Identity manipulation datasets The advancement of identity manipulation research has become possible due to significant contributions by various researchers in terms of the creation of large-scale face morphing, swapping, and deepfake datasets. Table 4 shows the various deepfake datasets generated using deep neural network architectures and a majority of these datasets contain the images of the target person swapped with the source identity. The table presents one of the first comprehensive surveys of datasets missing in the majority of the recent survey papers (Mirsky and Lee, 2021; Venkatesh et al., 2021). To advance the deepfake research community, the earliest dataset namely UADFV and TIMIT was developed in the year 2018. The primary limitations of these datasets are that they have few real and deepfake videos and the quality of these videos is also poor. Later, several high-quality and large-scale datasets are proposed to advance deepfake detection research. FF++ is one of the earliest large-scale datasets containing four different manipulations among them two methods belong to identity manipulation namely face swap and deepfake. The dataset contains three different image quality videos to make the deepfake detection problem even more interesting. By looking at the popularity of deepfake detection research and its impact on social media platforms, several companies have also started deepfake detection competitions. The competitions are announced along with the introduction of novel datasets. One such deepfake detection competition organized by Google came with the dataset namely the deepfake detection challenge dataset (DFDC). The majority of the above datasets are inclined toward one particular modality for modification, i.e., face. However, recently few datasets come with the modification of both face and audio manipulation (Khalid et al., 2021). The lip-sync is an important part of the movement of facial features; therefore, the presence of such a wide variety of manipulation can significantly boost identity manipulation research. Apart from that, in place of capturing the deepfake videos generated in constrained environments, research work has started to acquire or generate videos reflecting several realworld artifacts including variations in resolution, compression, illumination, aspect ratio, frame rate, motion, pose, cosmetics, and occlusion. The datasets covering such artifacts are popularly referred to as “in-the-wild” datasets such as DF (Ciftci et al., 2020) and Wilddeepfake (Zi et al., 2020).

TABLE 4 Comparison of various benchmark datasets heavily getting used for deepfake detection research. Dataset

Real videos

Fake videos

Total videos

Total subjects

Methods

Real audio

Deepfake audio

Multi ethnicity

UADFV (Yang et al., 2019)

49

49

98

49

1

No

No

No

TIMIT (Korshunov and Marcel, 2018)

640

320

960

32

2

No

Yes

No

FF++ (Rossler et al., 2019)

1000

4000

5000

N/A

4

No

No

No

Celeb-DF (Li et al., 2020b)

590

5639

6229

59

1

No

No

No

Google DFD (Dufour and Gully, 2019)

363

3068

3431

28

5

No

No

No

DeeperForensics ( Jiang et al., 2020)

50,000

10,000

60,000

100

1

No

No

No

DFDC (Dolhansky et al., 2020)

23,654

104,500

128,154

960

8

Yes

Yes

No

KoDF (Kwon et al., 2021)

62,166

175,776

237,942

403

6

No

Yes

No

FakeAVCeleb (Khalid et al., 2021)

490+

20,000+

20,000+

600+

5

Yes

Yes

No

Indian Forensics (Mehra et al., 2021, 2023)

200

248

448

149

1

Yes

No

No

DF (Ciftci et al., 2020)





142





No

No

No

Wilddeepfake (Zi et al., 2020)

3805

3509

7314





No

No

No

DF-W (Pu et al., 2021)



1869





5

No

No

No

232 Handbook of Statistics

Table 5 describes the datasets developed for face morph and swap attack detection. The majority of the datasets are prepared using landmark detection of the facial region. Facial landmarks are first detected and later aligned both in the source and target image to make the blending of the faces effective. The datasets are prepared using OpenCV or GIMP-free software and are generally image-based datasets. The first large-scale morph dataset containing more than 750 videos was prepared by Agarwal et al. (2017) in 2017 using the social media application namely Snapchat. To further advance the morph detection research, recently the authors (Agarwal et al., 2021b) have prepared multiple datasets using several social media applications or online tools. These mediums reflect various methods of face morphing ranging from face swapping to neural attribute transfer. In contrast to several morph datasets and deepfake datasets which combine two faces to perform the deepfake or morph attack, the Identity Morphing dataset (Agarwal et al., 2021b) contains the morph images generated by combining up to four faces as well. The morph generation methods used in the dataset are entirely different from the deepfake generation algorithms and it is seen that the majority of the algorithm developed for one type of manipulation might not be effective for another type of attack (Du et al., 2020). Therefore, simultaneous attention to

TABLE 5 Comparison of various benchmark datasets proposed for face morph and swap attack detection research. Database

Tool

Author

Face Morph

GIMP/GAP

Ferrara et al. (2016)

SWAPPED

Snapchat

Agarwal et al. (2017)

FRGC-v2-Morphs

OpenCV

Scherhag et al. (2018b)

VISAPP17

MATLAB

Makrushin et al. (2017)

MorGAN

GAN

Debiasi et al. (2019)

Snapchat

Snapchat

Agarwal et al. (2021b)

Identity Morphing

morphthing

Agarwal et al. (2021b)

FaceApp

FaceApp

Agarwal et al. (2021b)

FRGC-Morphs

OpenCV, FaceMorpher, StyleGAN2

Sarkar et al. (2020)

FERET-Morphs

Sarkar et al. (2020)

FRLL-Morphs

Sarkar et al. (2020)

8 233

Accuracy %

Identity theft via morphing and deepfake Chapter

Rank

FIG. 4 Depicting the vulnerability of face recognition algorithms toward face swapping (right) and deepfake images (left). The experiments represent two identity threat scenarios: impersonation using swapping and obfuscation using deepfake. In obfuscation, the accuracy on both source attack and target attack drops significantly, whereas in impersonation, more than 90% of the time an attacker can get the identity he/she wants through face swapping.

both threats is necessary to build a unified defense approach. The need for unified defense can also be understood from the fact that face morphing attacks are as prevalent as deepfake attacks and can generate highly realistic images (Figs. 1, 2, and 4). Therefore, handling one attack and ignoring others cannot build a good defense to protect face recognition algorithms and society.

4

Identity attack detection algorithms

While the face morph, swap, and deepfake detection research are getting handled independently, the developed algorithms can be broadly grouped into traditional handcrafted image features based and deep neural network-based. Additionally, since the majority of the deepfake datasets are video-based; hence, temporal artifacts are also utilized in deepfake detection. First, we briefly describe the algorithms used for face morph and swap attack detection followed by the description of deepfake detection algorithms. Several image features such as local binary pattern (LBP), binarized statistical image features (BSIF), histogram of oriented gradients (HOG), scaleinvariant features (SIFT), and speed-up robust features (SURF) are extensively explored for the morph and swap images detection (Agarwal et al., 2017, 2021b; Makrushin et al., 2019; Scherhag et al., 2018a, 2019). The significant advantage of the image feature-based algorithm is the computational complexity; however, it is observed in many experiments that these algorithms suffer from generalizability or are not robust against an unseen dataset or manipulation type. Therefore, looking at the tremendous success of deep neural networks in image classification, research threads have started to explore them for face morphing and swapping detection (Seibold et al., 2017; Venkatesh et al., 2020). Raja et al. (2017) have utilized the feature fusion strategy of two

234 Handbook of Statistics

different deep neural networks for the detection of face morph images. Agarwal et al. (2021b) have combined the power of the deep neural network with image features to learn an effective face swap attack detector. Similar to face morph and swap detection, works toward deepfake detection also explored the handcrafted image feature such as a bag of words, head pose, audiovisual features, steganalysis features, and eye blink. However, these methods are not found effective, and hence a majority of today’s research utilizes deep learning architectures. For example, Rossler et al. (2019) have studied several deep neural network architectures for deepfake detection. Recently, the latest era of deep architectures is also explored for deepfake detection such as capsule networks, vision transformers, attention networks, and 3D networks (Nguyen et al., 2019c; Wang et al., 2021; Zhao et al., 2021). As mentioned earlier, the deepfake datasets are video-based and it is assumed that the local face manipulations are attenuated across the time dimension and yield inconsistencies among the frames. Based on these assertions, several researchers have explored the recurrent networks to model the inconsistencies among the time dimension (G€ uera and Delp, 2018; Sabir et al., 2019). We want to highlight that this chapter does not aim to provide a detailed review of each existing algorithm developed in such a vast research area. However, the aim is to provide the missing connections between identity manipulation attacks and how the independent handling of these threats can be combined for better security by providing a comprehensive knowledge of different attacks. We would like to point the readers toward the existing survey papers for further detailed study (Majumdar et al., 2022; Mirsky and Lee, 2021; Singh et al., 2020; Venkatesh et al., 2021). However, as mentioned, the majority of these survey papers might be tackling one category of attack and hence are shallow in that sense.

5 Open challenges Apart from handling these different and powerful identity evasion attacks, there are several open challenges which are not effectively dealt with so far. Generalization: One of the biggest challenges of fake identity detection is the generalizability against manipulation types and datasets. Fig. 5 demonstrate the robustness of several state-of-the-art deepfake detection algorithms: Xception (Rossler et al., 2019), Face X-ray (Li et al., 2020a), and Multiscale (Luo et al., 2021) against the same and cross attack evaluation. The experimental results are demonstrated on the deepfake (DF) and face swap (FS) manipulations. It is observed that when the algorithms are evaluated on the same attack images on which they are trained, they yield at least a 0.98 AUC value. However, each algorithm suffers drastically as soon as the unseen attack types come for testing. For instance, the multiscale algorithm proposed with the claim of a generalized algorithm shows a drop of 30.7% in DF detection accuracy when trained on FS as compared to when trained on DF. A similar observation can be noticed from the performance of another “more generalized” (claimed to be) algorithm namely Face X-ray, whose performance on DF testing shows a

1

0.993

0.987 0.992

1

Xception Face X-Ray

0.9

0.994 0.981 0.995

Xception Face X-Ray

0.9

Multi-scale

Multi-scale 0.8

0.685

0.664

0.7

AUC

AUC

0.8

0.6 0.458

0.5 0.4

0.7 0.6 0.6 0.497

0.49

0.5 0.4

0.3

0.3 DF

FS

FS

DF

Training Manipulation

Training Manipulation

Testing Manipulation: DF

Testing Manipulation: FS

FIG. 5 Cross attack generalizability of the deepfake detectors. Both for face swap and deepfake detection, the algorithms have shown poor generalizability when tested on manipulation methods not seen at the time of training. For example, when the algorithms are evaluated on DF, they are trained on FS and vice versa.

236 Handbook of Statistics

drop of more than 52% when trained on FS as compared to the training on the same attack images/videos (i.e., DF videos). Another way of evaluating the generalizability of the deepfake attack detection algorithms is the unseen dataset training testing. For instance, one popular evaluation protocol in the research community is training on FF++ dataset (Rossler et al., 2019) and testing on Celeb-DF dataset (Li et al., 2020b). Table 6 shows the lack of generalizability of several existing algorithms when trained on FF++ and tested on the Celeb-DF dataset. The results are a perfect example of the algorithm being overfitted or memorizing the training set or not generalized against the variations present in unseen testing datasets. The algorithm namely SPSL which yields the highest generalizability shows approximately 20% less performance on unseen dataset training– testing. Moreover, the majority of the existing algorithms yield more than 30% less AUC on Celeb-DF as compared to FF++. The above generalizability issue is reported on high-quality deepfake videos showcasing that the problem is not only this. There is another threat to deepfake detection algorithms, i.e., vulnerability against compression artifacts. It is noticed that when the same algorithms are evaluated on low-quality or highly compressed videos, their performance further degrades significantly.

TABLE 6 Cross-dataset evaluation (AUC) on Celeb-DF. Method

FF++

Celeb-DF

Xception-raw (Rossler et al., 2019)

0.9970

0.4820

Multi-task (Nguyen et al., 2019a)

0.7630

0.5430

Capsule (Nguyen et al., 2019b)

0.9660

0.5750

DSP-FWA (Li and Lyu, 2019)

0.9300

0.6460

Face-XRay (Li et al., 2020b)

0.9912

0.7420

F3-Net (Qian et al., 2020)

0.9797

0.6517

Two-branch (Masi et al., 2020)

0.9318

0.7341

EfficientNet-B4 (Tan and Le, 2019)

0.9970

0.6429

SPSL (Liu et al., 2021)

0.9691

0.7688

Nirkin et al. (Nirkin et al., 2022)

0.9900

0.6600

MD-CSDNetwork (Agarwal et al., 2021a)

0.9970

0.6877

The model is trained on FF++ and tested on the Celeb-DF dataset. The first column results report the AUC values when tested only on the FF++. The results are taken from the recent research paper (Agarwal et al., 2021a).

Identity theft via morphing and deepfake Chapter

8 237

Coverage of multiple demographics: As observed from the deepfake dataset Table 4, the majority of the dataset is highly inclined toward one ethnicity only. The Caucasian ethnicity is highly explored for the generation of the deepfake videos, although recently few research works have started generating deepfake videos for other ethnicities as well such as Indian (Mehra et al., 2021, 2023) and Korean (Kwon et al., 2021). The impact of deepfake, face swap, and face morphing is not limited to any particular demographics. It might have been observed that the impact on one demography is higher than on others, but we cannot neglect even the smaller effect of such a sensitive issue. Therefore, the collection of identity manipulation videos on different demographic identities is critical. As the famous proverb says “data is the new electricity”; hence, without much electricity of varying attack distribution how an effective defense can be built to protect each demography present in our society? Large-scale morph and swap datasets: In contrast to the deepfake datasets, face swap and morph datasets are not only image-based but also limited in size. For instance, to the best of our knowledge, the largest face swap/morph dataset namely “IDAgender” prepared by Agarwal et al. (2021b) contains approximately 750 swap videos, 1700 morph images, and 600 neural attribute swap images. Another issue with the face morph and swap dataset is the availability of the datasets in the public domain. The majority of the research work that developed face morph and swap datasets is collected privately and not released publicly due to privacy issues. Simultaneous handling of different threats: As mentioned earlier, the impact of each identity theft attack is severe and handling the attacks individually cannot help in building a better defense. As the proverb says “the best defense is the good defense”; however, if the best defense is not aware of different threats, it cannot be a good defense. Face morph and deepfake attacks can be effectively used for identity impersonation and obfuscation based on the manipulation of source identity using the target identity. The attack generation also exhibits several similarities and hence individual tackling of both attacks can be fatal in building defense and protecting the cyber world. Novel or adaptive threats: Another significant challenge that needs attention is novel and adaptive attacks. A few such cases of adaptive attacks are partial tampering or swapping of the face regions (Majumdar et al., 2019), vulnerability against the features of a classifier (Agarwal et al., 2019b), and adversarial modification of the input images (Agarwal et al., 2019a; Hussain et al., 2021). Contrary to deepfake or face morph generation techniques which alter the full face region, Majumdar et al. (2019) have proposed the generation of face swap images where only partial regions of the source identities are swapped with the target identities. The authors have shown the vulnerability of several deep face recognition networks against even such low-level modifications. Similarly, the robustness issue also arises from the feature-level and image-level perturbation which have not received significant attention so far.

238 Handbook of Statistics

6 Conclusion Identity theft has grown multifold and impacted several aspects of our dayto-day life ranging from believing social media content to the use of automated face recognition algorithms. Face being one of the quickest mediums of communication has become one of the prime victims of such attacks. Several forms of identity theft attacks are proposed in the literature including face morphing, swapping, and deepfake. Moreover, each attack which even works with different principles has shown its effectiveness in fooling both humans and face recognition algorithms (K€ obis et al., 2021; Korshunov and Marcel, 2020; Robertson et al., 2017). However, the research works handling these threats are very specific and handle the attacks independently. This chapter provides a comprehensive overview of popular and stealthy identity theft attacks. We also presented popular face manipulation datasets and detection techniques that have been proposed in the literature. Several open research challenges are also highlighted which need to be tackled efficiently to build a secure social-media environment and true use of face recognition algorithms by developing unified defense algorithms in the future.

References Agarwal, A., Singh, R., Vatsa, M., Noore, A., 2017. Swapped! digital face presentation attack detection via weighted local magnitude pattern. In: IEEE IJCB, pp. 659–665. Agarwal, A., Sehwag, A., Singh, R., Vatsa, M., 2019a. Deceiving face presentation attack detection via image transforms. In: IEEE BigMM, pp. 373–382. Agarwal, A., Sehwag, A., Vatsa, M., Singh, R., 2019b. Deceiving the protector: fooling face presentation attack detection algorithms. In: (ICB), pp. 1–6. Agarwal, A., Agarwal, A., Sinha, S., Vatsa, M., Singh, R., 2021a. MD-CSDNetwork: multidomain cross stitched network for deepfake detection. In: IEEE F&G, pp. 1–8. Agarwal, A., Singh, R., Vatsa, M., Noore, A., 2021b. MagNet: detecting digital presentation attacks on face recognition. Front. Artif. Intell. 4, 643424. Ciftci, U.A., Demir, I., Yin, L., 2020. Fakecatcher: detection of synthetic portrait videos using biological signals. IEEE TPAMI. https://doi.org/10.1109/TPAMI.2020.3009287. Debiasi, L., Damer, N., Saladie, A.M., Rathgeb, C., Scherhag, U., Busch, C., Kirchbuchner, F., Uhl, A., 2019. On the detection of gan-based face morphs using established morph detectors. In: ICIAP, Springer, pp. 345–356. Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C., 2020. The deepfake detection challenge (dfdc) dataset. arXiv preprint:2006.07397. Du, M., Pentyala, S., Li, Y., Hu, X., 2020. Towards generalizable deepfake detection with locality-aware autoencoder. In: ACM CIKM, pp. 325–334. Dufour, N., Gully, A., 2019. Contributing data to deepfake detection research. Google AI Blog 1 (2), 3. Dunn, S., 2021. Women, not politicians, are targeted most often by Deepfake videos. https://www. cigionline.org/articles/women-not-politicians-are-targeted-most-often-deepfake-videos/. Editor, 2019. DEEPFAKE: app that can remove women’s clothes from images shut down. https:// bebasnews.my/2019/06/29/app-that-can-remove-womens-clothes-from-images-shut-down/.

Identity theft via morphing and deepfake Chapter

8 239

Ferrara, M., Franco, A., Maltoni, D., 2014. The magic passport. In: IEEE IJCB, pp. 1–7. Ferrara, M., Franco, A., Maltoni, D., 2016. On the effects of image alterations on face recognition accuracy. In: Face Recognition Across the Imaging Spectrum, Springer, pp. 195–222. Flint, P., 2020. Ai deepfake digitally removes clothes of over 100,000 women. https://www.techspot.com/news/87219-report-shows-over-100k-women-virtually-disrobed-through.html/. Gomez-Barrero, M., Rathgeb, C., Scherhag, U., Busch, C., 2017. Is your biometric system robust to morphing attacks? In: IEEE IWBF, pp. 1–6. G€ uera, D., Delp, E.J., 2018. Deepfake video detection using recurrent neural networks. In: IEEE AVSS, pp. 1–6. Hao, K., 2021. Deepfake porn is ruining women’s lives. Now the law may finally ban it. https:// www.technologyreview.com/2021/02/12/1018222/ deepfake-revenge-porn-coming-ban/. Hildebrandt, M., Neubert, T., Makrushin, A., Dittmann, J., 2017. Benchmarking face morphing forgery detection: application of stirtrace for impact simulation of different processing steps. In: IEEE IWBF, pp. 1–6. Hussain, S., Neekhara, P., Jere, M., Koushanfar, F., McAuley, J., 2021. Adversarial deepfakes: evaluating vulnerability of deepfake detectors to adversarial examples. In: WACV, pp. 3348–3357. Jiang, L., Li, R., Wu, W., Qian, C., Loy, C.C., 2020. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection. In: CVPR, pp. 2889–2898. Khalid, H., Tariq, S., Kim, M., Woo, S.S., 2021. FakeAVCeleb: a novel audio-video multimodal deepfake dataset. arXiv preprint:2108.05080. K€ obis, N.C., Dolezˇalova´, B., Soraperra, I., 2021. Fooled twice: people cannot detect deepfakes but think they can. Iscience 24 (11), 103364. Korshunov, P., Marcel, S., 2018. Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint:1812.08685. Korshunov, P., Marcel, S., 2020. Deepfake detection: humans vs. machines. arXiv preprint:2009.03155. Korshunova, I., Shi, W., Dambre, J., Theis, L., 2017. Fast face-swap using convolutional neural networks. In: ICCV, pp. 3677–3685. Kwon, P., You, J., Nam, G., Park, S., Chae, G., 2021. Kodf: a large-scale Korean deepfake detection dataset. In: ICCV, pp. 10744–10753. Li, Y., Lyu, S., 2019. Exposing DeepFake videos by detecting face warping artifacts. In: IEEE CVPRW. Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B., 2020a. Face x-ray for more general face forgery detection. In: CVPR, pp. 5001–5010. Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S., 2020b. Celeb-df: a large-scale challenging dataset for deepfake forensics. In: CVPR, pp. 3207–3216. Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N., 2021. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: IEEE/CVF CVPR. Luo, Y., Zhang, Y., Yan, J., Liu, W., 2021. Generalizing face forgery detection with highfrequency features. In: CVPR, pp. 16317–16326. Majumdar, P., Agarwal, A., Singh, R., Vatsa, M., 2019. Evading face recognition via partial tampering of faces. In: CVPRW, pp. 11–20. Majumdar, P., Agarwal, A., Vatsa, M., Singh, R., 2022. Facial retouching and alteration detection. In: Handbook of Digital Face Manipulation and Detection, Springer, Cham, pp. 367–387. Makrushin, A., Neubert, T., Dittmann, J., 2017. Automatic generation and detection of visually faultless facial morphs. In: ICCVTA, vol. 7. SciTePress, pp. 39–50.

240 Handbook of Statistics Makrushin, A., Kraetzer, C., Dittmann, J., Seibold, C., Hilsmann, A., Eisert, P., 2019. DempsterShafer theory for fusing face morphing detectors. In: IEEE EUSIPCO, pp. 1–5. Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W., 2020. Two-branch recurrent network for isolating deepfakes in videos. In: ECCV, pp. 667–684. Mehra, A., Agarwal, A., Vatsa, M., Singh, R., 2021. Detection of digital manipulation in facial images (student abstract). AAAI. 35 (18), 15845–15846. Mehra, A., Agarwal, A., Vatsa, M., Singh, R., 2023. Motion magnified 3D residual-in-dense network for DeepFake detection. IEEE Trans. Biometrics Behav. Identity Sci. 5 (1), 39–52. https://doi.org/10.1109/TBIOM.2022.3201887. Mirsky, Y., Lee, W., 2021. The creation and detection of deepfakes: a survey. ACM CSUR 54 (1), 1–41. Nguyen, H.H., Fang, F., Yamagishi, J., Echizen, I., 2019a. Multi-task learning for detecting and segmenting manipulated facial images and videos. In: IEEE BTAS, pp. 1–8. Nguyen, H.H., Yamagishi, J., Echizen, I., 2019b. Capsule-forensics: using capsule networks to detect forged images and videos. In: IEEE ICASSP, pp. 2307–2311. Nguyen, H.H., Yamagishi, J., Echizen, I., 2019c. Use of a capsule network to detect fake images and videos. arXiv preprint:1910.12467. Nirkin, Y., Wolf, L., Keller, Y., Hassner, T., 2022. Deepfake detection based on discrepancies between faces and their context. IEEE TPAMI 44 (10), 6111–6121. https://doi.org/10.1109/ TPAMI.2021.3093446. Perov, I., Gao, D., Chervoniy, N., Liu, K., Marangonda, S., Ume, C., Dpfks, M., Facenheim, C.S., RP, L., Jiang, J., et al., 2020. Deepfacelab: integrated, flexible and extensible face-swapping framework. arXiv preprint:2005.05535. Pu, J., Mangaokar, N., Kelly, L., Bhattacharya, P., Sundaram, K., Javed, M., Wang, B., Viswanath, B., 2021. Deepfake videos in the wild: analysis and detection. In: Web Conference, pp. 981–992. Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: face forgery detection by mining frequency-aware clues. In: ECCV, Springer, pp. 86–103. Raja, K., Venkatesh, S., Christoph Busch, R.B., et al., 2017. Transferable deep-cnn features for detecting digital and print-scanned morphed face images. In: CVPRW, pp. 10–18. Robertson, D.J., Kramer, R.S.S., Burton, A.M., 2017. Fraudulent ID using face morphs: experiments on human and automatic recognition. PLoS One 12 (3), e0173319. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019. Faceforensics++: learning to detect manipulated facial images. In: ICCV, pp. 1–11. Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., Natarajan, P., 2019. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3 (1), 80–87. Sarkar, E., Korshunov, P., Colbois, L., Marcel, S., 2020. Vulnerability analysis of face morphing attacks from landmarks and generative adversarial networks. arXiv preprint:2012.05344. Scherhag, U., Budhrani, D., Gomez-Barrero, M., Busch, C., 2018a. Detecting morphed face images using facial landmarks. In: ICISP, pp. 444–452. Scherhag, U., Rathgeb, C., Busch, C., 2018b. Morph deterction from single face image: a multialgorithm fusion approach. In: ICBEA, pp. 6–12. Scherhag, U., Rathgeb, C., Merkle, J., Breithaupt, R., Busch, C., 2019. Face recognition systems under morphing attacks: a survey. IEEE Access 7, 23012–23026. Seibold, C., Samek, W., Hilsmann, A., Eisert, P., 2017. Detection of face morphing attacks by deep learning. In: IWDW, pp. 107–120. Singh, R., Agarwal, A., Singh, M., Nagpal, S., Vatsa, M., 2020. On the robustness of face recognition algorithms against attacks and bias. In: AAAI, vol. 34, pp. 13583–13589. 9.

Identity theft via morphing and deepfake Chapter

8 241

Tan, M., Le, Q.V., 2019. EfficientNet: rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of the International Conference on Machine Learning, vol. 97, pp. 6105–6114. http://proceedings.mlr.press/v97/ tan19a.html. Trinh, L., Liu, Y., 2021. An examination of fairness of AI models for deepfake detection. arXiv preprint:2105.00558. Vaas, L., 2019. Deepfakes have doubled, overwhelmingly targeting women. https://nakedsecurity. sophos.com/2019/10/09/deepfakes-have-doubled-overwhelmingly-targeting-women/. Venkatesh, S., Ramachandra, R., Raja, K., Spreeuwers, L., Veldhuis, R., Busch, C., 2020. Detecting morphed face attacks using residual noise from deep multi-scale context aggregation network. In: WACV, pp. 280–289. Venkatesh, S., Ramachandra, R., Raja, K., Busch, C., 2021. Face morphing attack generation & detection: a comprehensive survey. IEEE TTS 2 (3), 128–145. https://doi.org/10.1109/ TTS.2021.3066254. Wang, M., Deng, W., 2021. Deep face recognition: a survey. Neurocomputing 429, 215–244. Wang, J., Wu, Z., Chen, J., Jiang, Y.-G., 2021. M2tr: Multi-modal multi-scale transformers for deepfake detection. arXiv preprint:2104.09770. Wolberg, G., 1996. Recent advances in image morphing. In: IEEE CG International, pp. 64–71. Wolberg, G., 1998. Image morphing: a survey. Vis. Comput. 14 (8), 360–372. Yang, X., Li, Y., Lyu, S., 2019. Exposing deep fakes using inconsistent head poses. In: ICASSP, pp. 8261–8265. Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N., 2021. Multi-attentional deepfake detection. In: CVPR, pp. 2185–2194. Zi, B., Chang, M., Chen, J., Ma, X., Jiang, Y.-G., 2020. Wilddeepfake: a challenging real-world dataset for deepfake detection. In: ACM MM, pp. 2382–2390.

This page intentionally left blank

Index

Note: Page numbers followed by “f ” indicate figures, “t” indicate tables, and “b” indicate boxes.

A ABC model, 10 ACT. See Adversarial concurrent training (ACT) Adience database, 175 Adversarial attacks adversarial examples, 30, 31f adversarial patch, 30, 32 black-box attack, 30 Carlini–Wagner (CW) attack, 30, 32 DeepFool (DF), 30, 32 Elastic, 33 fast gradient sign method (FGSM), 30, 32 Fog, 33 Gabor, 33 gray-box attack, 30 JPEG, 33 knowledge distillation (KD) based defenses adversarial concurrent training (ACT), 41, 43t adversarially robust distillation (ARD), 41, 43t defensive distillation (DD), 41, 43t mutual adversarial training (MAT), 43–44, 43t, 43f offline KD, 41 online KD, 41 object detector, defenses for, 44–46 on-manifold robustness Defense-GAN, 33–34 dual manifold adversarial training (DMAT) (see Dual manifold adversarial training (DMAT)) projected gradient descent (PGD), 32 reverse engineering of deceptions via residual learning (REDRL), 46 adversarial perturbation estimation, 47–52, 51f evaluation, 52, 53t ResNet18 network, confusion matrix of, 46, 50f Snow, 33 white-box attack, 30 Adversarial concurrent training (ACT), 41, 43t

Adversarial learning-based algorithms, 192t attribute prediction, 195–196 face detection and recognition, 191–193 Adversarially robust distillation (ARD), 41, 43t Adversarial patch, 30, 32 AgeDb database, 177 All-Age Faces (AAF) database, 177 ARD. See Adversarially robust distillation (ARD) Area under the curve (AUC), 135–140, 139t Artificial intelligence (AI), 29–30 Automatic regressive integrated moving average (ARIMA) model COVID -19 prediction, 109 wind speed prediction, 110–111 Auto-regressive (AR), 110–111

B Bag of words (BOW) representation, 12 BFW database, 178 Bias-aware deep learning approaches, 192t attribute prediction, 197–198 face detection and recognition, 194–195 Biased facial analysis against certain demographic subgroups, 170–173, 170f in databases, distribution across gender and ethnicity, 170–173, 172t, 175–179, 184f in face recognition and attribute prediction, 190, 191t mitigation techniques, 173, 191, 192f attribute prediction, 195–198 face detection and recognition, 191–195, 192t in popular COTS and deep learning methods, 170–173, 171t racial bias, 170–173 sources of, 183–184, 184f Bidirectional encoder representations from transformers (BERT), 13, 75 Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), 10–11, 13

243

244 Bidirectional language model, 12 Bidirectional LSTM, 13–17, 19 Binarized statistical image features (BSIF), 233–234 BioCreative/OHNLP STS dataset, 19, 22t BioSentVec, 10–11, 15 BIOSSES, 19, 22t BioWordVec, 10–14, 14t, 21–22 Black-box attack, 30 Black-box deep learning models, 169–170, 173 attribute prediction, 196 face detection and recognition, 193, 193f BraTS 2018 dataset, 125 BUPT-Balancedface database, 177 BUPT-Globalface database, 177

C The Cancer Genome Atlas (TCGA), 108, 113–114 Carlini–Wagner (CW) attack, 30, 32 CBOW. See Continuous bag-of-words (CBOW) CelebA database, 177, 189–190 Celeb-DF dataset, 236, 236t Class activation maps (CAMs), 185–186 Commercial-off-the-shelf (COTS) face recognition systems, 170–173, 171t, 184–185, 202 Community-based federated machine learning (CBFL) algorithm, 124 Composite embedding, 13–14, 14t, 24–25 Confirmation bias, 88 Context embedding cosine similarity score, 13–14, 14t qualitative analysis, 23–24, 24t Context2Vec, 14–15, 19–20 Context vector-based methods, 13 Continuous bag-of-words (CBOW), 12–13, 15, 16f, 19 Contrastive loss, 62 Convolution, 112 Convolutional neural networks (CNN/ConvNet), 60–62, 112, 115–116, 120–121, 135, 136t, 138–140 Co-occurrence methods, 10 Copy number variation (CNV) data, 113–115 Coronavirus disease 2019 (COVID-19), DL techniques for automatic regressive integrated moving average (ARIMA) model, 109 data, 115 deep sequential prediction model (DSPM), 109

Index

federated learning (FL), 124–125 long short-term memory (LSTM) models, 109 multivariate LSTM model, 133, 157–159, 158f, 161f univariate LSTM model, 157–159, 158f, 161f non-parametric regression model (NRM), 109 Prophet models, 109 seasonal automatic regressive integrated moving average (SARIMA) model, 109 weather integrated deep learning techniques, 155–159 Correlation coefficient (CC) rice production, 152, 153t short term wind speed prediction, 145, 145t Cosine similarity, 13–14, 14t, 17–18, 20, 64–65, 68, 76–77 Counterfactual analysis, 189–190 Crop production, DL techniques for, 111 crop prediction model, LSTM for, 148–152 design and performance evolution, 151–152 optimal input data set selection, automated model for, 148–151 gridded meteorological data, 115 rice production data, 115 Cross-Age Celebrity Dataset (CACD), 177 Cross entropy loss, 74

D DDI. See Drug–drug interaction (DDI) DECISION ablation study, 98–101 baseline methods, 97 datasets, 96 vs. DECISION-mlp, 101–102t distillation into single model, 101, 101t hyper-parameters, 97 learned weights analysis, 100 loss-wise ablation, 98–99, 99t network architecture, 97 object recognition, 98 office dataset, 98, 98t office-home dataset, 98, 99t optimization, 89 overall framework, 86–89 overview of, 90b proposed framework, 92, 92f source model training, 97 theoretical motivation, 90–92 weighted information maximization, 87–88 weighted pseudo-labeling, 88–89

Index

DECISION-mlp ablation, 101–102, 102t expected loss, 95 overview of, 94b results and analyses, 101–102 source distribution dependent weights, 93–94 DeepCluster technique, 88 Deepfake detection challenge dataset (DFDC), 230 Deepfake (DF) technique challenges, in attack detection algorithms coverage of multiple demographics, 237 generalization, 234–236, 235f novel and adaptive attacks, 237 simultaneous handling of different threats, 237 datasets, for attack detection research, 230, 231t, 232–233 deep neural network architectures, 227–228, 229t defense algorithms, 224, 234 in digital domain, 225–226 face recognition algorithms, vulnerability of, 232–233, 233f females, prevalance in, 224 impact, 224 manipulated image generation, 224, 225f number of publications, progression in, 228–230, 229f progression, of deepfake images, 225–226, 227f tools, 225–226 DeepFool (DF), 30, 32 Deep learning (DL), 29–30 adversarial perturbations (see Adversarial attacks) vs. classical methods, 29–30 computer vision problems, 29–30 deep neural networks-based face manipulation techniques, 227–228, 229t facial analysis, bias and fairness in (see Facial analysis systems, deep learning) metric learning, for computer vision applications (see Deep metric learning (DML), for computer vision) scientific and industrial research, applications in (see Scientific and industrial research, DL techniques for) Deep metric learning (DML), for computer vision, 60 convolution neural networks (CNNs), 60–62 loss formulations, types of, 60–61, 60f

245 mathematical notations and assumptions, 61–62 pair-based formulation, 60–61, 60f contrastive loss, 62 multisimilarity loss, 66–68, 67f N-pair loss, 65–66, 65f triplet loss, 62–64, 63f proxy-based methods, 60–61, 60f learnable shared embeddings, 69 proxy anchor loss, 71–72, 71f ProxyGML loss, 72–74, 73f ProxyNCA and ProxyNCA++, 69–70, 70f regularization methods, 60–61, 60f direction regularization, 76–77, 77f language guidance, 75–76 sample-to-sample similarity-based objective, 68 softmax-based objective functions, 60 Deep neural networks, 82, 111 face manipulation algorithms, 227–228, 229t Deep sequential prediction model (DSPM), 109 Defense-GAN, 33–34 Defensive distillation (DD), 41, 43t Degree of bias (DoB) metric, 183 DemogPairs database, 178 Demographic parity. See Statistical parity DetectorGuard (DG), 45 Digital image data, 111, 117–118, 118–119f Direction regularization, DML, 76–77, 77f Discrimination-aware learning method, 193f Disparate impact (DI), 181 Distributed machine learning, 123 DiveFace database, 177 DMAT. See Dual manifold adversarial training (DMAT) DNA methylation (DM) data, 113–114 Downstream application task, 20–22 Drug–drug interaction (DDI) dataset, 17–19, 21–22 extraction evaluation results, 21–22, 22t prediction, GNN for, 11–12 Drug rediscovery test, 22–23, 23t Dual manifold adversarial training (DMAT) improves generalization and robustness, 39–40 improves robustness to unseen attacks, 40 off-manifold AT, 35 OM-ImageNet dataset (see On-Manifold ImageNet (OM-ImageNet) dataset) on-manifold AT, 35–38, 35f, 38f proposed method, 39 TRADES for, 40–41

246

E EDLM. See Exact deep learning machine (EDLM) Electronics Health record (EHR), 124 ELM. See Extreme learning machines (ELM) Equalized odds, 181 Error analysis, 24 Euclidean distance, 62–63 Exact deep learning machine (EDLM), 2 advantage of, 6 probability one, object detection with, 2–5 for real-world situations, 6 Extreme learning machines (ELM), 145–148, 146–147f

F FaceApp, 226–227 Face detection algorithms, 174, 174f Face manipulation, for identity threat. See Identity manipulation techniques Face morphing technique challenges, in attack detection algorithms coverage of multiple demographics, 237 large-scale morph datasets, 237 novel and adaptive attacks, 237 simultaneous handling of different threats, 237 datasets, for attack detection research, 232–233, 232t deep neural network-based algorithms, 227–228 defense algorithms, 224, 233–234 in digital domain, 225–226 face recognition algorithms, 225–226 GIMP Animation Package v2.6 (GAP), 225–226 GNU Image Manipulation Program v2.8 (GIMP), 225–226 impact, 224 manipulated image generation, 224, 225f number of publications, progression in, 228–230, 229f progression, of morph images, 225–226, 227f social media applications and websites, 226–227, 228t tools, 225–226, 226t Face swap (FS) challenges, in attack detection algorithms coverage of multiple demographics, 237 generalization, 234–236, 235f large-scale swap datasets, 237 novel and adaptive attacks, 237

Index

datasets, for attack detection research, 232–233, 232t deep neural network-based algorithms, 227–228 defense algorithms, 224, 233–234 in digital domain, 225–226 face recognition algorithms, vulnerability of, 232–233, 233f impact, 224 number of publications, progression in, 228–230, 229f social media applications, 226–227 Facial analysis systems, deep learning, 209f authentication-based applications, 169–170 bias against certain demographic subgroups, 170–173, 170f in databases, distribution across gender and ethnicity, 170–173, 172t, 175–179, 184f in face recognition and attribute prediction, 190, 191t mitigation techniques, 173, 191–198, 192f, 192t in popular COTS and deep learning methods, 170–173, 171t racial bias, 170–173 sources of, 183–184, 184f challenges, 206–209 classification parity-based and score-based metrics disparate impact (DI), 181 equalized odds and equality of opportunity, 181 predictive parity, 181 statistical parity, 180–181 commercial systems and patents, topography of, 202–206, 203f, 203–204t daily-life applications, 169–170 dataset distribution and annotations, role of, 210 decision-making tasks, role in, 169–170 facial analysis specific metrics, 182 degree of bias (DoB), 183 fairness discrepancy rate (FDR), 182–183 inequity rate (IR), 183 precise subgroup equivalence (PSE), 183 fairness estimation and analysis (see Fairness, in facial analysis) false negative rate (FNR), 180 false positive rate (FPR), 180 feature engineering approaches, 210 independent and identically distributed (IID) data, 209

247

Index

meta-analysis, of algorithms, 180 negative predictive parity, 180 out-of-distribution (OOD) generalization, 209 positive predictive parity, 180 score-based metrics calibration, 182 positive/negative class, balance for, 182 tasks in, 169–170, 173, 174f attribute prediction, 175 face detection, 174 face verification and identification, 174–175 true negative rate (TNR), 180 true positive rate (TPR), 180 FairFace database, 177, 179f, 188–189 Fair mixup, 197–198 Fairness-aware adversarial perturbation (FAAP) approach, 195–196 Fairness-aware disentangling variational auto-encoder, 197–198 Fairness discrepancy rate (FDR), 182–183 Fairness, in facial analysis, 173 attribute prediction adversarial learning-based algorithms, 195–196 bias-aware deep learning approaches, 197–198 counterfactual analysis, 189–190 disparate impact, 188–189 disparity, 187 generative algorithms, 196–197 latent factors, role of, 190 pre-trained and black box approaches, 196 challenges, 206–209 databases, 178, 178–179t face detection and recognition adversarial learning-based algorithms, 191–193 bias-aware deep learning approaches, 194–195 dataset distribution during model training, 186 demographic information, incorporation of, 185–186 discovery, 184 disparate impact, 184–185 generative algorithms, 193, 193f latent representations during model training, 187 pre-trained and black box approaches, 193, 193f machine learning (ML), 204–206 real-world requirements, 210

Fair supervised contrastive loss (FSCL), 197–198 False match rate (FMR), 182–183 False negative rate (FNR), 180 False non-match rate (FNMR), 182–183 False positive rate (FPR), 174, 180–181 Fast gradient sign method (FGSM), 30, 32, 34 fastText, 12–15 FDR. See Fairness discrepancy rate (FDR) Feature visualizations, 185–186, 186f, 190 Federated learning (FL), 110, 123–126, 127–128t Federated patient hashing framework, 125 Federated transfer learning framework, 125–126 FedHealth, 125–126 Feed-forward neural networks, 110–111 FF++, 230, 236, 236t FGSM. See Fast gradient sign method (FGSM) Filter-drop approach, 197–198, 198f, 201–202t Filters, 226–227 FNR. See False negative rate (FNR) Fog, 33 FPR. See False positive rate (FPR) FS. See Face swap (FS)

G Gabor, 33 Gene expressions (GE) data, 113–114 Generative algorithms, 192t attribute prediction, 196–197 face detection and recognition, 193, 193f GIMP Animation Package v2.6 (GAP), 225–226 Glioma tumors, deep learning applications in diagnosis area under the curve (AUC) values, 135–138, 139t image data sets, 112, 115–117, 135–140 molecular data sets, 108–109, 135–138 numerical data sets, 108–109, 113–114, 138 sequential and convolutional neural networks, 135, 137f gene mutations, 141 prognosis, of glioma patients, 140–141, 142t radiogenomics study, 141, 143f random forest (RF), 135–138 support vector machine (SVM), 135–138 tumor tissues, molecular subtype classification of, 140

248 GNU Image Manipulation Program v2.8 (GIMP), 225–226 Gradient masking, 30–32 Graph neural network (GNN), 11–12 Gray-box attack, 30 Gridded meteorological data, 115

H Hierarchical clustering technique, 190 Histogram of oriented gradients (HOG), 233–234 Hypothesis transfer learning (HTL), 84

I IDAgender, 237 Identity manipulation techniques deepfake (DF) technique challenges, in attack detection algorithms, 234–237 datasets, for attack detection research, 230, 231t, 232–233 deep neural network architectures, 227–228, 229t defense algorithms, 224, 234 in digital domain, 225–226 face recognition algorithms, vulnerability of, 232–233, 233f females, prevalance in, 224 impact, 224 manipulated image generation, 224, 225f number of publications, progression in, 228–230, 229f progression, of deepfake images, 225–226, 227f tools, 225–226 face morphing technique challenges, in attack detection algorithms, 234–237 datasets, for attack detection research, 232–233, 232t deep neural network-based algorithms, 227–228 defense algorithms, 224, 233–234 in digital domain, 225–226 face recognition algorithms, 225–226 GIMP Animation Package v2.6 (GAP), 225–226 GNU Image Manipulation Program v2.8 (GIMP), 225–226 impact, 224 manipulated image generation, 224, 225f

Index

number of publications, progression in, 228–230, 229f progression, of morph images, 225–226, 227f social media applications and websites, 226–227, 228t tools, 225–226, 226t face swap (FS) challenges, in attack detection algorithms, 234–237 datasets, for attack detection research, 232–233, 232t deep neural network-based algorithms, 227–228 defense algorithms, 224, 233–234 in digital domain, 225–226 face recognition algorithms, vulnerability of, 232–233, 233f impact, 224 number of publications, progression in, 228–230, 229f social media applications, 226–227 use case of, 224 Identity Morphing dataset, 232–233 IJB-C database, 177 Image data, 118–119 digital image, 111, 117–118, 118–119f magnetic resonance images (MRI), 112, 115–117, 135–140 whole-slide histopathological images (WSI), 112, 115–117, 135–140 Image feature-based algorithm, 233–234 ImageNet, 121, 152–154 Independent and identically distributed (IID) data, 209 Indian Movie Face Database (IMFDB), 175 Inductive transfer learning, 121–122, 122f Industrial internet of things (IIoT), 110 Inequity rate (IR) metrics, 183 Instagram, 210, 226–227 Internet of things (IoT), 110

J Jacobian-based saliency map attack (JSMA), 30 JPEG, 33

K Knowledge distillation (KD) based defenses adversarial concurrent training (ACT), 41, 43t adversarially robust distillation (ARD), 41, 43t

249

Index

defensive distillation (DD), 41, 43t mutual adversarial training (MAT), 43–44, 43t, 43f offline KD, 41 online KD, 41 Knowledge GNN (KGNN), 11–12 Knowledge graph embeddings, 11–12

L Labeled image data sets, 111–112 Lagrange multipliers, 205–206 Language regularization, 75–76 LAOFW database, 178 Large language models (LLMs), 75 LBD. See Literature-based discovery (LBD) LBP. See Local binary pattern (LBP) Learned perceptual image patch similarity (LPIPS), 36 LFWA database, 177, 199, 201–202t Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS), 30 Literature-based discovery (LBD), 9–10 Local binary pattern (LBP), 233–234 Local context embedding, 15 Long short-term memory (LSTM) architecture of, 126, 129f bidirectional LSTM, 13–17, 19 cell and hidden state, 126 COVID -19 prediction, 109 multivariate LSTM model for, 133, 157–159, 158f, 161f univariate LSTM model for, 157–159, 158f, 161f crop prediction model, 148–152 design and performance evolution, 151–152 optimal input data set selection, automated model for, 148–151 crop production, estimation of, 148–152 time division LSTM model, 110–111, 126–133, 141–148 LSTM. See Long short-term memory (LSTM)

M Machine learning (ML), 2, 110, 124 fairness in, 204–206 MAE. See Mean absolute error (MAE) Magnetic resonance images (MRI), 112, 115–117, 135–140 Maximum temperature (Tmax), 156, 157f

Mean absolute error (MAE) rice production, 148–152, 150–151f, 153t short term wind speed prediction, 145, 145t Mean temperature (Tmean), 156, 157f Medical Subject Headings (MeSH), 10–12 MEDLINE, 10–11 MeSH2Vec, 10–11 Meteorological data crop production data, 115 gridded meteorological data, 115 station-level meteorological data, 115, 116f Metric learning, 59–60 Minimum temperature (Tmin), 156, 157f Mobile-net, 124 Momentum iterative attack (MIA), 39–40 MORPH datasets, 175, 179f, 184–185, 199, 201–202t Morphing. See Face morphing MRI. See Magnetic resonance images (MRI) MS-Celeb-1M database, 177 MS1M-wo-RFW database, 177 MULTIACCURACY BOOST algorithm, 196 Multilayer perceptron (MLP), 15–17 Multiscale representation, for biomedical analysis. See Representation learning, for biomedical analysis Multisimilarity (MS) loss, 66–68, 67f, 77 Multisource domain adaptation (MSDA), 84–85 Multi steepest descent (MSD), 44 Multi-task convolution neural network, 197 Multivariate LSTM model, for COVID -19 prediction, 129–130, 133, 157–159, 158f, 161f Multi-variate time-series predictive model, 130 Mutual adversarial training (MAT), 30–32, 43–44, 43t, 43f

N Named entity recognition (NER), 10–12 Negative predictive parity, 180 Negative relative similarity, 66–67, 67f Neighborhood component analysis (NCA), 69–70 Non-parametric regression model (NRM), 109 N-pair loss, 65–66, 65f

O Object detection defenses for, 44–46 with probability one, 2–5

250 Oncomechanics, 135 On-Manifold ImageNet (OM-ImageNet) dataset, 35 learned perceptual image patch similarity (LPIPS), 36 PGD-50 and OM-PGD-50 attacks, classification accuracy for, 39–41, 39t, 42t sample on-manifold images, 36, 37f StyleGAN, 35–36 unseen attacks, 40–41, 40t, 42t Ontology-based methods, 13 Out-of-distribution (OOD) generalization, 209

P Pair-based DML loss, 60–61, 60f contrastive loss, 62 multisimilarity loss, 66–68, 67f N-pair loss, 65–66, 65f triplet loss, 62–64, 63f Pairwise similarity, 13, 17 Partial attribute decorrelation (PARADE), 195–196 PGD-5 threat model, 36–38 Pilot Parliaments Benchmark (PPB) database, 177 Positive predictive parity, 180 Positive relative similarity, 67–68, 67f Precise subgroup equivalence (PSE), 183 Predictive parity, 181 Pre-trained and black box approaches, 192t attribute prediction, 196 face detection and recognition, 193, 193f Pre-trained deep face models, 185–186 Principal component analysis (PCA), 14–15, 17 Principle of Maximum Entropy, 91–92 Probability one, 2–5 Projected gradient descent (PGD), 30, 32, 34 Prophet models, 109 Proxy anchor loss, 71–72, 71f Proxy-based deep graph metric learning (ProxyGML) loss, 72–74, 73f Proxy-based DML loss, 60–61, 60f learnable shared embeddings, 69 proxy anchor loss, 71–72, 71f ProxyGML loss, 72–74, 73f ProxyNCA and ProxyNCA++, 69–70, 70f ProxyNCA++, 70 ProxyNCA loss, 69–70, 70f, 77–78 PSE. See Precise subgroup equivalence (PSE) PubMed, 9–12, 18–19 PubMed Central Open Access (PMC OA), 12

Index

R Racial bias, in facial analysis, 170–173 Racial Faces in the Wild (RFW) database, 177, 179f, 199, 199f, 200t Radio-genomics deep learning model, 141, 143f Random forest (RF), 108–109, 113, 135–138 Rapid antigen tests (RATs), 115 Rectified Linear Unit (ReLU), 15–17 Recurrent neural network (RNN), 110–111, 133 Reinforcement learning algorithm, 205–206 Relative error (RE) analysis, for COVID -19 prediction, 159, 160t Representation learning, for biomedical analysis ABC model, 10 bag of words (BOW) representation, 12 BioBERT, 10–11, 13 context vector-based methods, 13 co-occurrence methods, 10 cosine similarity score, 13–14, 14t experimental results error analysis, 24 intrinsic and extrinsic evaluation dataset, 18–19 qualitative analysis, 23–24 quantitative evaluation, 20–23 wide context embedding, 19–20 knowledge graph embeddings, 11–12 literature-based discovery (LBD), 9–10 ontology-based methods, 13 semantic sentences, 13 skip-gram and CBOW, 13 theoretical framework intrinsic and extrinsic evaluation, 17–18 local context embedding, 15 multiscale embedding, 17 qualitative evaluation, 17–18 wide context embedding, 15–16 word similarity task, postprocessing and inference for, 17 word embedding techniques (see Word embeddings, in biomedical domain) Representation neutralization for fairness (RNF) algorithm, 196 Resnet, 124 ResNet18 network, 30, 31f, 46, 50f Reverse engineering of deceptions via residual learning (REDRL), 46 adversarial perturbation estimation, 47–49, 51f end-to-end training, 52

251

Index

feature reconstruction, 51 image classification, 51–52 image reconstruction, 51 residual recognition, 52 evaluation, 52, 53t ResNet18 network, confusion matrix of, 46, 50f Reverse-transcriptase polymerasechain-reaction (RT-PCR) test, 115 RMSE. See Root mean square error (RMSE) RoBERTa, 75 Robust optimization, 30–32 Root mean square error (RMSE) rice production, 148–152, 150–151f, 153t short term wind speed prediction, 145, 145t

S Scale-invariant features (SIFT), 233–234 Scientific and industrial research, DL techniques for, 108 convolutional neural networks (CNN/ ConvNet), 112, 115–116, 120–121, 135, 136t, 138–140 COVID -19 prediction (see Coronavirus disease 2019 (COVID-19), DL techniques for) crop production, predictive model for (see Crop production, DL techniques for) data types COVID-19 data, 115 image data, 111–112, 115–119 meteorological data, 115 numerical data, 108–109, 113–114 federated learning (FL), 110, 123–126, 127–128t glioma patients, applications in (see Glioma tumors, deep learning applications in) long short-term memory (LSTM) (see Long short-term memory (LSTM)) sequential neural networks (SNN), 108, 133–135, 134t, 137f, 138 short term wind speed prediction (see Wind speed prediction) tea leaves classification, 111–112, 152–155 transfer learning (TL), 111–112, 119–123, 121f Seasonal automatic regressive integrated moving average (SARIMA) model, 109 Segment and complete (SAC) approach, 45–46, 46f, 47t, 48–49f Self-similarity, 66–67, 67f

Self-taught learning, 121–122 Semantic sentences, 13 Semantic word embeddings, 13 SemaTyP, 22–23 SemEval semantic textual similarity (SemEval STS), 19–21 SemMedDB, 13 Sensitive loss, for bias mitigation, 193f Sentence pair similarity, 10–11, 20–21, 22t Sequential neural networks (SNN), 108, 133–135, 134t, 137f, 138 SGNS. See Skip-gram with negative sampling (SGNS) Similarity task, 20 Single variant model, LSTM, 129–130 Skip-gram with negative sampling (SGNS), 15, 16f Sliding window approach, 10–12, 15 Snapchat, 210, 226–227, 232–233 SNN. See Sequential neural networks (SNN) Snow, 33 SNP 6.0 (SNP6) array data, 114–115 Sparse vector technique (SVT), 125 Spearman’s coefficient, 17, 20 Specific humidity (SH), 156, 157f, 159 Speed-up robust features (SURF), 233–234 Stacked denoising auto encoder, 110–111 Station-level meteorological data, 115, 116f Statistical parity, 180–181 StyleGAN, 35–36 Supervised learning algorithms, 205–206 Supervised machine learning algorithms, 108 Support vector machine (SVM) glioma diagnosis, 108–109, 113, 135–138 short term wind speed prediction, 145–148, 146–147f Susceptible-exposed-infectedrecovered-deceased (SEIRD), 157 SVM. See Support vector machine (SVM)

T Tea leaves classification, DL applications for, 111–112, 152–155 Therapeutic target database (TTD), 22–23 Time division ensemble technique, 130–131 Time division LSTM model, 110–111, 126–133, 141–148 Time series analysis, of COVID-19 cases, 155–156, 156f TIMIT, 230 TRADES, for DMAT, 40–41 Transductive learning, 121–122

252 Transfer learning (TL), 111, 119–120, 205–206 bidirectional language model for, 12 vs. conventional deep learning, 120, 121f convolutional neural networks (CNN), 120–121 inductive transfer learning, 121–122, 122f tea leaf classification, 111–112 transductive learning, 121–122 unsupervised transfer learning, 121–123 Triplet loss, 62–64, 63f True negative rate (TNR), 180 True positive rate (TPR), 174, 180–181 Twitter face-cropping algorithm, 210

U UADFV, 230 UMNSRS-Rel datasets, 17–18, 20, 21t UMNSRS Sim datasets, 15, 16t, 17–18 Univariate LSTM model, for COVID -19 prediction, 157–159, 158f, 161f Unsupervised domain adaptation (UDA) adaptation approach, 82 contributions, 83–84 hypothesis transfer learning (HTL), 84 image classification, 84 information maximization (IM) loss, 83 multisource domain adaptation (MSDA), 84–85 object detection, 84 practical motivation, 86 problem setting, 85 semantic segmentation, 84 source-free multisource, 85 Unsupervised transfer learning, 121–123 User interface (UI), 205–206 UTKFace database, 177, 179f, 188–189, 199, 201–202t

V Vanilla Word2Vec model, 20 Variant input sequence model network, 127–129, 129f Visual geometry group 16 (VGG16), 111–112, 121, 152–154, 154f

Index

W Weather integrated deep learning techniques, 155–159 White-box attack, 30 White-box machine learning models, 206 Whole-slide histopathological images (WSI), 112, 115–117, 135–140 Wide context embedding, 15–16, 19–20 Wilddeepfake, 230 Wind speed prediction data-driven models for, 110–111 dynamical models for, 110–111 feed-forward neural networks, 110–111 recurrent neural networks, 110–111 stacked denoising auto encoder, 110–111 station-level meteorological data, 115, 116f statistical models for, 110–111 time division LSTM, for short-term prediction, 110–111, 141–148 Word embeddings, in biomedical domain, 24t BioSentVec, 10–11, 15 BioWordVec, 10–15, 14t challenges, 10 continuous bag-of-words (CBOW), 12–13, 15, 19 dense word/sentence embeddings, 10–12 downstream application task, 20–22 extrinsic evaluation, 19 fastText, 12–15 Medical Subject Headings (MeSH), 10–12 MeSH2Vec, 10–11 nearest neighbors for, 23, 24t neural network, 12 PubMed, 10–12 semantic word embeddings, 13 similarity/relatedness tasks, 10–11 wide context embedding (context2vec), 19–20 Word2Vec, 12, 14–15 Word2Vec model, 12, 14–15, 16f architecture, 15, 16f vanilla Word2Vec model, 20 WSI. See Whole-slide histopathological images (WSI)

Z Zoom, 210