Introduction to Transfer Learning. Algorithms and Practice 9789811975837, 9789811975844


333 94 12MB

English Pages 333 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Acronyms
Symbols
Part I Foundations
1 Introduction
1.1 Transfer Learning
1.2 Related Research Fields
1.3 Why Transfer Learning?
1.3.1 Big Data vs. Less Annotation
1.3.2 Big Data vs. Poor Computation
1.3.3 Limited Data vs. Generalization Requirements
1.3.4 Pervasive Model vs. Personal Need
1.3.5 For Specific Applications
1.4 Taxonomy of Transfer Learning
1.4.1 Taxonomy by Feature Space
1.4.2 Taxonomy by Target Domain Labels
1.4.3 Taxonomy by Learning Methodology
1.4.4 Taxonomy by Online or Offline Learning
1.5 Transfer Learning in Academia and Industry
1.6 Overview of Transfer Learning Applications
1.6.1 Computer Vision
1.6.2 Natural Language Processing
1.6.3 Speech
1.6.4 Ubiquitous Computing and Human–Computer Interaction
1.6.5 Healthcare
1.6.6 Other Applications
References
2 From Machine Learning to Transfer Learning
2.1 Machine Learning Basics
2.1.1 Machine Learning
2.1.2 Structural Risk Minimization
2.1.3 Probability Distribution
2.2 Definition of Transfer Learning
2.2.1 Domains
2.2.2 Formal Definition
2.3 Fundamental Problems in Transfer Learning
2.3.1 When to Transfer
2.3.2 Where to Transfer
2.3.3 How to Transfer
2.4 Negative Transfer Learning
2.5 A Complete Transfer Learning Process
References
3 Overview of Transfer Learning Algorithms
3.1 Measuring Distribution Divergence
3.2 Unified Representation for Distribution Divergence
3.2.1 Estimation of Balance Factor μ
3.3 A Unified Framework for Transfer Learning
3.3.1 Instance Weighting Methods
3.3.2 Feature Transformation Methods
3.3.3 Model Pre-training
3.3.4 Summary
3.4 Practice
3.4.1 Data Preparation
3.4.2 Baseline Model: K-Nearest Neighbors
References
4 Instance Weighting Methods
4.1 Problem Definition
4.2 Instance Selection Methods
4.2.1 Non-reinforcement Learning-Based Methods
4.2.2 Reinforcement Learning-Based Methods
4.3 Weight Adaptation Methods
4.4 Practice
4.5 Summary
References
5 Statistical Feature Transformation Methods
5.1 Problem Definition
5.2 Maximum Mean Discrepancy-Based Methods
5.2.1 The Basics of MMD
5.2.2 MMD-Based Transfer Learning
5.2.3 Computation and Optimization
5.2.4 Extensions of MMD-Based Transfer Learning
5.3 Metric Learning-Based Methods
5.3.1 Metric Learning
5.3.2 Metric Learning for Transfer Learning
5.4 Practice
5.5 Summary
References
6 Geometrical Feature Transformation Methods
6.1 Subspace Learning Methods
6.1.1 Subspace Alignment
6.1.2 Correlation Alignment
6.2 Manifold Learning Methods
6.2.1 Manifold Learning
6.2.2 Manifold Learning for Transfer Learning
6.3 Optimal Transport Methods
6.3.1 Optimal Transport
6.3.2 Optimal Transport for Transfer Learning
6.4 Practice
6.5 Summary
References
7 Theory, Evaluation, and Model Selection
7.1 Transfer Learning Theory
7.1.1 Theory Based on H-Divergence
7.1.2 Theory Based on H H-Distance
7.1.3 Theory Based on Discrepancy Distance
7.1.4 Theory Based on Labeling Function Difference
7.2 Metric and Evaluation
7.3 Model Selection
7.3.1 Importance Weighted Cross Validation
7.3.2 Transfer Cross Validation
7.4 Summary
References
Part II Modern Transfer Learning
8 Pre-Training and Fine-Tuning
8.1 How Transferable Are Deep Networks
8.2 Pre-Training and Fine-Tuning
8.2.1 Benefits of Pre-Training and Fine-Tuning
8.3 Regularization for Fine-Tuning
8.4 Pre-Trained Models for Feature Extraction
8.5 Learn to Pre-Training and Fine-Tuning
8.6 Practice
8.7 Summary
References
9 Deep Transfer Learning
9.1 Overview
9.2 Network Architectures for Deep Transfer Learning
9.2.1 Single-Stream Architecture
9.2.2 Two-Stream Architecture
9.3 Distribution Adaptation in Deep Transfer Learning
9.4 Structure Adaptation for Deep Transfer Learning
9.4.1 Batch Normalization
9.4.2 Multi-view Structure
9.4.3 Disentanglement
9.5 Knowledge Distillation
9.6 Practice
9.6.1 Network Structure
9.6.2 Loss
9.6.3 Train and Test
9.7 Summary
References
10 Adversarial Transfer Learning
10.1 Generative Adversarial Networks
10.2 Distribution Adaptation for Adversarial Transfer Learning
10.3 Maximum Classifier Discrepancy for Adversarial Transfer Learning
10.4 Data Generation for Adversarial Transfer Learning
10.5 Practice
10.5.1 Domain Discriminator
10.5.2 Measuring Distribution Divergence
10.5.3 Gradient Reversal Layer
10.6 Summary
References
11 Generalization in Transfer Learning
11.1 Domain Generalization
11.2 Data Manipulation
11.2.1 Data Augmentation and Generation
11.2.2 Mixup-Based Domain Generalization
11.3 Domain-Invariant Representation Learning
11.3.1 Domain-Invariant Component Analysis
11.3.2 Deep Domain Generalization
11.3.3 Disentanglement
11.4 Other Learning Paradigms for Domain Generalization
11.4.1 Ensemble Learning
11.4.2 Meta-Learning for Domain Generalization
11.4.3 Other Learning Paradigms
11.5 Domain Generalization Theory
11.5.1 Average Risk Estimation Error Bound
11.5.2 Generalization Risk Bound
11.6 Practice
11.6.1 Dataloader in Domain Generalization
11.6.2 Training and Testing
11.6.3 Examples: ERM and CORAL
11.7 Summary
References
12 Safe and Robust Transfer Learning
12.1 Safe Transfer Learning
12.1.1 Can Transfer Learning Models Be Attacked?
12.1.2 Reducing Defect Inheritance
12.1.3 ReMoS: Relevant Model Slicing
12.2 Federated Transfer Learning
12.2.1 Federated Learning
12.2.2 Personalized Federated Learning for Non-I.I.D. Data
12.2.2.1 Model Adaptation for Personalized Federated Learning
12.2.2.2 Similarity-Guided Personalized Federated Learning
12.3 Data-Free Transfer Learning
12.3.1 Information Maximization Methods
12.3.2 Feature Matching Methods
12.4 Causal Transfer Learning
12.4.1 What is Causal Relation?
12.4.2 Causal Relation for Transfer Learning
12.5 Summary
References
13 Transfer Learning in Complex Environments
13.1 Imbalanced Transfer Learning
13.2 Multi-Source Transfer Learning
13.3 Open Set Transfer Learning
13.4 Time Series Transfer Learning
13.4.1 AdaRNN for Time Series Forecasting
13.4.2 DIVERSIFY for Time Series Classification
13.5 Online Transfer Learning
13.6 Summary
References
14 Low-Resource Learning
14.1 Compressing Transfer Learning Models
14.2 Semi-supervised Learning
14.2.1 Consistency Regularization Methods
14.2.2 Pseudo Labeling and Thresholding Methods
14.3 Meta-learning
14.3.1 Model-Based Meta-learning
14.3.2 Metric-Based Meta-learning
14.3.3 Optimization-Based Meta-learning
14.4 Self-supervised Learning
14.4.1 Constructing Pretext Tasks
14.4.2 Contrastive Self-supervised Learning
14.5 Summary
References
Part III Applications of Transfer Learning
15 Transfer Learning for Computer Vision
15.1 Objection Detection
15.1.1 Task and Dataset
15.1.2 Load Data
15.1.3 Model
15.1.4 Train and Test
15.2 Neural Style Transfer
15.2.1 Load Data
15.2.2 Model
15.2.3 Train
References
16 Transfer Learning for Natural Language Processing
16.1 Emotion Classification
16.2 Model
16.3 Train and Test
16.4 Pre-training and Fine-tuning
References
17 Transfer Learning for Speech Recognition
17.1 Cross-Domain Speech Recognition
17.1.1 MMD and CORAL for ASR
17.1.2 CMatch Algorithm
17.1.3 Experiments and Results
17.2 Cross-Lingual Speech Recognition
17.2.1 Adapter
17.2.2 Cross-Lingual Adaptation with Adapters
17.2.3 Advanced Algorithm: MetaAdapter and SimAdapter
17.2.4 Results and Discussion
References
18 Transfer Learning for Activity Recognition
18.1 Task and Dataset
18.2 Feature Extraction
18.3 Source Selection
18.4 Activity Recognition Using TCA
18.5 Activity Recognition Using Deep Transfer Learning
References
19 Federated Learning for Personalized Healthcare
19.1 Task and Dataset
19.1.1 Dataset
19.1.2 Data Splits
19.1.3 Model Architecture
19.2 FedAvg: Baseline Algorithm
19.2.1 Clients Update
19.2.2 Communication on the Server
19.2.3 Results
19.3 AdaFed: Adaptive Batchnorm for Federated Learning
19.3.1 Similarity Matrix Computation
19.3.2 Communication on the Server
19.3.3 Results
References
20 Concluding Remarks
References
A Useful Distance Metrics
A.1 Euclidean Distance
A.2 Minkowski Distance
A.3 Mahalanobis Distance
A.4 Cosine Similarity
A.5 Mutual Information
A.6 Pearson Correlation
A.7 Jaccard Index
A.8 KL and JS Divergence
A.9 Maximum Mean Discrepancy
A.10 A-distance
A.11 Hilbert–Schmidt Independence Criterion
B Popular Datasets in Transfer Learning
B.1 Digit Recognition Datasets
B.2 Object Recognition and Image Classification Datasets
B.3 Text Classification Datasets
B.4 Activity Recognition Datasets
C Venues Related to Transfer Learning
C.1 Machine Learning and AI
C.2 Computer Vision and Multimedia
C.3 Natural Language Processing and Speech
C.4 Ubiquitous Computing and Human–Computer Interaction
C.5 Data Mining
Reference
Recommend Papers

Introduction to Transfer Learning. Algorithms and Practice
 9789811975837, 9789811975844

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Machine Learning: Foundations, Methodologies, and Applications

Jindong Wang Yiqiang Chen

Introduction to Transfer Learning Algorithms and Practice

Machine Learning: Foundations, Methodologies, and Applications Series Editors Kay Chen Tan, Department of Computing, Hong Kong Polytechnic University, Hong Kong, China Dacheng Tao, University of Technology, Sydney, Australia

Books published in this series focus on the theory and computational foundations, advanced methodologies and practical applications of machine learning, ideally combining mathematically rigorous treatments of a contemporary topics in machine learning with specific illustrations in relevant algorithm designs and demonstrations in real-world applications. The intended readership includes research students and researchers in computer science, computer engineering, electrical engineering, data science, and related areas seeking a convenient medium to track the progresses made in the foundations, methodologies, and applications of machine learning. Topics considered include all areas of machine learning, including but not limited to: • • • • • • • • • • • • • • • •

Decision tree Artificial neural networks Kernel learning Bayesian learning Ensemble methods Dimension reduction and metric learning Reinforcement learning Meta learning and learning to learn Imitation learning Computational learning theory Probabilistic graphical models Transfer learning Multi-view and multi-task learning Graph neural networks Generative adversarial networks Federated learning

This series includes monographs, introductory and advanced textbooks, and stateof-the-art collections. Furthermore, it supports Open Access publication mode.

Jindong Wang • Yiqiang Chen

Introduction to Transfer Learning Algorithms and Practice

Jindong Wang Microsoft Research Asia (China) Beijing, China

Yiqiang Chen Institute of Computing Technology Beijing, China

ISSN 2730-9908 ISSN 2730-9916 (electronic) Machine Learning: Foundations, Methodologies, and Applications ISBN 978-981-19-7583-7 ISBN 978-981-19-7584-4 (eBook) https://doi.org/10.1007/978-981-19-7584-4 Jointly published with Publishing House of Electronics Industry, Beijing, China The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Publishing House of Electronics Industry. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

This book is dedicated to our families. Your love and support are all the sources of our strength.

Preface

Machine learning, as an important branch of artificial intelligence, is becoming increasingly popular. Machine learning makes it possible to learn from massive training data and experience and then apply the model to new problems. Transfer learning is an important machine learning paradigm that studies how to apply existing knowledge, models, and parameters to new problems. In recent years, algorithms, theories, and models for transfer learning have been extensively studied. Given tons of literature, it is frustratingly challenging for a preliminary researcher to take a first step and then make a difference to this area. There is a growing need for a book that can gradually introduce the essence of existing work in a learner-friendly manner. In April 2018, we open-sourced the first version of this book on Github and called it Transfer Learning Tutorial. Accompanying the tutorial, we also open-sourced the most popular transfer learning Github repository, which contains tutorials, codes, datasets, papers, applications, and many other materials. Our very first purpose is to let readers easily tune in this area and learn it quickly. The open-source tutorial gained much appreciation by readers and the Github repo received over 8.8K stars. You can find almost everything related to transfer learning at https://github.com/ jindongwang/transferlearning. In May 2021, we rewrote it and added many new contents, which were then published as a Chinese textbook. This textbook is based on the experience of teaching a ubiquitous computing class at the University of Chinese Academy of Sciences, through which we have gained much understanding of how to better prepare a book that can benefit everyone, especially new learners. Now we take a step further to write this English version with many new contents and reorganizations to help new learners who use English as their native language. In this book, our main purpose is not to introduce a particular algorithm or some papers but to introduce the very basic concept of transfer learning, its problems, general methods, extensions, and applications, from shallow to deep. We paid a great deal of attention to ensure that it starts from a new learner’s perspective such that it will be much easier to tune in, step by step. Additionally, this is a textbook, not a survey or some long talk that must contain all literature in it. We hope that this vii

viii

Preface

textbook will help interested readers quickly learn this area and, more importantly, use it in your own research or applications. Finally, we hope this book can be a friend who provides experience to readers to accelerate your success. In 2020, Cambridge University Press published the first transfer learning book by Qiang Yang’s group, which gives a comprehensive overview of this area. Compared to that book, our work provides a more detailed introduction of the latest progress with tutorial-style description, practicing codes, and datasets, which enable easy and fluent learning for readers. This book consists of three parts: Foundations, Modern Transfer Learning, and Applications of Transfer Learning. Part I is Foundations, composed of Chaps. 1–7. Chapter 1 is introduction that overviews the basic concepts of transfer learning, related research areas, problems, applications, and its necessity. Chapter 2 transits from general machine learning to transfer learning. Then, it introduces the fundamental problems in transfer learning. In Chap. 3, we unify the high-level idea of most transfer learning algorithms. This chapter should be seen as the start of the rest of the chapters. Chapters 4–6 present two categories of methods: instance weighting methods in Chap. 4 and statistical and geometrical feature transformation methods in Chaps. 5 and 6. Then, Chap. 7 presents the theory, model evaluation, and model selection technique for transfer learning. Part II is Modern Transfer Learning, which is composed of Chaps. 8–14. Chapter 8 introduces the third major category of transfer learning methods: pretraining and fine-tuning, which belongs to model-based methods. Chapters 9 and 10 are deep and adversarial transfer learning methods, which also belong to the former three basic types of methods, but with more algorithms especially in deep learning. Chapter 11 introduces the generalization problems in transfer learning. Then, Chap. 12 discusses the safety and privacy issues in transfer learning. Chapter 13 introduces how to deal with complex environments for transfer learning. Then, Chap. 14 introduces low-resource learning when the labeled data are extremely rare or even not accessible. Specifically, we introduce semi-supervised learning, metalearning, and self-supervised learning. Part III is Applications of Transfer Learning, which consists of Chaps. 15– 19. These chapters present the code practice of how to apply transfer learning to applications including the following: computer vision (Chap. 15), natural language processing (Chap. 16), speech recognition (Chap. 17), activity recognition (Chap. 18), and federated medical healthcare (Chap. 19). We show readers how transfer learning is adopted in different applications to address their different challenges. Chapter 20 is the last chapter of this book, and it presents several frontiers. Additionally, we provide some useful materials in the appendix. Of course, this book is not perfect, and we are aware of our own limitation. In case of any errors or suggestions, please do not hesitate to contact us. Beijing, China March 2022

Jindong Wang Yiqiang Chen

Acknowledgments

This book will not be possible without the help from our friends to whom we would like to show our sincere gratitude: We thank the following friends for their help in organizing and writing contents: Ph.D. candidate Mr. Wang Lu from Institute of Computing Technology, Chinese Academy of Sciences, for section “Federated Transfer Learning for Healthcare,” Dr. Chang Liu from Microsoft Research Asia for section “Causal Transfer Learning,” Ph.D. candidate Mr. Yuntao Du from Nanjing University for sections “Transfer Learning Theory” and “Online Transfer Learning,” Ph.D. candidate Mr. Yongchun Zhu from Institute of Computing Technology, Chinese Academy of Sciences, for section “Multi-source Transfer Learning,” Ph.D. candidates from Institute of Computing Technology, Chinese Academy of Sciences, for chapter “Federated Learning for Personalized Healthcare.” We thank the following friends for their valuable comments: Chief AI Officer Qiang Yang from WeBank, Prof. Zhi-hua Zhou from Nanjing University, Dr. Wenjie Feng from National University of Singapore, Dr. Ran Duan from Xidian University, Dr. Baochen Sun and Lei Huang from Microsoft, Dr. Wei Wang from Dalian University of Technology, and Ph.D. candidates Wang Lu, Xin Qin, Xiaohai Li, and Yuxin Zhang from the Institute of Computing Technology, Chinese Academy of Sciences. We would also like to thank the publishing house for your help in publishing this book. Finally, we leave our deepest gratitude to our families for their support during the entire writing process.

ix

Contents

Part I Foundations 1

2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related Research Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Why Transfer Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Big Data vs. Less Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Big Data vs. Poor Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Limited Data vs. Generalization Requirements . . . . . . . . . . 1.3.4 Pervasive Model vs. Personal Need . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 For Specific Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Taxonomy of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Taxonomy by Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Taxonomy by Target Domain Labels . . . . . . . . . . . . . . . . . . . . . 1.4.3 Taxonomy by Learning Methodology . . . . . . . . . . . . . . . . . . . . 1.4.4 Taxonomy by Online or Offline Learning . . . . . . . . . . . . . . . . 1.5 Transfer Learning in Academia and Industry . . . . . . . . . . . . . . . . . . . . . . 1.6 Overview of Transfer Learning Applications . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 Ubiquitous Computing and Human–Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 7 7 7 8 9 10 10 11 12 13 13 14 14 17 17 20 22

From Machine Learning to Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Structural Risk Minimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 39 41 41

24 25 28 29

xi

xii

Contents

2.2

Definition of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Fundamental Problems in Transfer Learning. . . . . . . . . . . . . . . . . . . . . . . 2.3.1 When to Transfer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Where to Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 How to Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Negative Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 A Complete Transfer Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 43 44 45 46 47 49 49 51 52

3

Overview of Transfer Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Measuring Distribution Divergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Unified Representation for Distribution Divergence . . . . . . . . . . . . . . . 3.2.1 Estimation of Balance Factor μ . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 A Unified Framework for Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Instance Weighting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Feature Transformation Methods. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Model Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Baseline Model: K-Nearest Neighbors. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53 56 57 58 59 60 60 61 61 62 64 65

4

Instance Weighting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Instance Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Non-reinforcement Learning-Based Methods . . . . . . . . . . . . 4.2.2 Reinforcement Learning-Based Methods . . . . . . . . . . . . . . . . . 4.3 Weight Adaptation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 67 69 70 71 72 75 77 77

5

Statistical Feature Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Maximum Mean Discrepancy-Based Methods . . . . . . . . . . . . . . . . . . . . . 5.2.1 The Basics of MMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 MMD-Based Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Computation and Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Extensions of MMD-Based Transfer Learning . . . . . . . . . . . 5.3 Metric Learning-Based Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Metric Learning for Transfer Learning . . . . . . . . . . . . . . . . . . . 5.4 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81 81 83 83 84 88 89 90 91 92 92

Contents

xiii

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95

6

Geometrical Feature Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Subspace Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Subspace Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Correlation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Manifold Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Manifold Learning for Transfer Learning . . . . . . . . . . . . . . . . 6.3 Optimal Transport Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Optimal Transport. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Optimal Transport for Transfer Learning . . . . . . . . . . . . . . . . . 6.4 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97 97 98 99 99 99 101 103 104 105 106 108 108

7

Theory, Evaluation, and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Transfer Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Theory Based on H-Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Theory Based on HH-Distance . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Theory Based on Discrepancy Distance . . . . . . . . . . . . . . . . . . 7.1.4 Theory Based on Labeling Function Difference. . . . . . . . . . 7.2 Metric and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Importance Weighted Cross Validation . . . . . . . . . . . . . . . . . . . 7.3.2 Transfer Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111 111 113 114 115 116 117 117 118 119 120 120

Part II Modern Transfer Learning 8

Pre-Training and Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 How Transferable Are Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Pre-Training and Fine-Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Benefits of Pre-Training and Fine-Tuning . . . . . . . . . . . . . . . . 8.3 Regularization for Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Pre-Trained Models for Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Learn to Pre-Training and Fine-Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125 125 128 130 131 133 134 136 138 138

9

Deep Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Network Architectures for Deep Transfer Learning . . . . . . . . . . . . . . . 9.2.1 Single-Stream Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141 142 143 143

xiv

Contents

9.2.2 Two-Stream Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution Adaptation in Deep Transfer Learning . . . . . . . . . . . . . . . Structure Adaptation for Deep Transfer Learning . . . . . . . . . . . . . . . . . 9.4.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Multi-view Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Network Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Train and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

144 145 147 147 148 149 150 151 152 154 157 161 161

Adversarial Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Distribution Adaptation for Adversarial Transfer Learning . . . . . . . 10.3 Maximum Classifier Discrepancy for Adversarial Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Data Generation for Adversarial Transfer Learning . . . . . . . . . . . . . . . 10.5 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Domain Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Measuring Distribution Divergence . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Gradient Reversal Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

163 163 165

Generalization in Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Data Augmentation and Generation. . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Mixup-Based Domain Generalization . . . . . . . . . . . . . . . . . . . . 11.3 Domain-Invariant Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Domain-Invariant Component Analysis . . . . . . . . . . . . . . . . . . 11.3.2 Deep Domain Generalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Other Learning Paradigms for Domain Generalization . . . . . . . . . . . . 11.4.1 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Meta-Learning for Domain Generalization . . . . . . . . . . . . . . . 11.4.3 Other Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Domain Generalization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Average Risk Estimation Error Bound . . . . . . . . . . . . . . . . . . . . 11.5.2 Generalization Risk Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Dataloader in Domain Generalization . . . . . . . . . . . . . . . . . . . . 11.6.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

175 175 178 178 180 181 181 183 185 187 187 188 190 190 190 192 193 193 195

9.3 9.4

10

11

167 169 170 171 171 172 173 173

Contents

xv

11.6.3 Examples: ERM and CORAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 12

Safe and Robust Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Safe Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Can Transfer Learning Models Be Attacked? . . . . . . . . . . . . 12.1.2 Reducing Defect Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 ReMoS: Relevant Model Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Federated Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Personalized Federated Learning for Non-I.I.D. Data . . . 12.3 Data-Free Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Information Maximization Methods . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Feature Matching Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Causal Transfer Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 What is Causal Relation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Causal Relation for Transfer Learning . . . . . . . . . . . . . . . . . . . . 12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

203 203 204 205 206 208 209 210 214 215 217 217 217 218 220 220

13

Transfer Learning in Complex Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Imbalanced Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Multi-Source Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Open Set Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Time Series Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 AdaRNN for Time Series Forecasting . . . . . . . . . . . . . . . . . . . . 13.4.2 DIVERSIFY for Time Series Classification . . . . . . . . . . . . . . 13.5 Online Transfer Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 227 229 231 232 234 236 237 238

14

Low-Resource Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Compressing Transfer Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Semi-supervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Consistency Regularization Methods . . . . . . . . . . . . . . . . . . . . . 14.2.2 Pseudo Labeling and Thresholding Methods . . . . . . . . . . . . . 14.3 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Model-Based Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Metric-Based Meta-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Optimization-Based Meta-learning . . . . . . . . . . . . . . . . . . . . . . . 14.4 Self-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Constructing Pretext Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.2 Contrastive Self-supervised Learning . . . . . . . . . . . . . . . . . . . . . 14.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

241 241 244 245 247 249 250 252 253 255 255 257 258 258

xvi

Contents

Part III Applications of Transfer Learning 15

Transfer Learning for Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Objection Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Task and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.4 Train and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

265 265 265 266 268 269 270 270 271 271 273

16

Transfer Learning for Natural Language Processing . . . . . . . . . . . . . . . . . . 16.1 Emotion Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Train and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Pre-training and Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

275 275 277 278 279 279

17

Transfer Learning for Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Cross-Domain Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1.1 MMD and CORAL for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1.2 CMatch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1.3 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Cross-Lingual Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Cross-Lingual Adaptation with Adapters . . . . . . . . . . . . . . . . . 17.2.3 Advanced Algorithm: MetaAdapter and SimAdapter . . . . 17.2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

281 281 282 282 286 287 287 288 289 290 292

18

Transfer Learning for Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1 Task and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3 Source Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.4 Activity Recognition Using TCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.5 Activity Recognition Using Deep Transfer Learning . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

293 293 294 295 297 297 301

19

Federated Learning for Personalized Healthcare . . . . . . . . . . . . . . . . . . . . . . . 19.1 Task and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.2 Data Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 FedAvg: Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

303 303 304 305 306 307

Contents

19.2.1 Clients Update. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2.2 Communication on the Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 AdaFed: Adaptive Batchnorm for Federated Learning . . . . . . . . . . . . 19.3.1 Similarity Matrix Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.2 Communication on the Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

308 308 309 309 310 311 312 312

20

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

A

Useful Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Minkowski Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Jaccard Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.8 KL and JS Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 Maximum Mean Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.10 A-distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.11 Hilbert–Schmidt Independence Criterion. . . . . . . . . . . . . . . . . . . . . . . . . . .

319 319 319 320 320 320 320 321 321 321 322 322

B

Popular Datasets in Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Digit Recognition Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Object Recognition and Image Classification Datasets . . . . . . . . . . . . B.3 Text Classification Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Activity Recognition Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

323 323 323 325 325

C

Venues Related to Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Machine Learning and AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Computer Vision and Multimedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Natural Language Processing and Speech . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Ubiquitous Computing and Human–Computer Interaction. . . . . . . . C.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

327 327 327 328 328 329 329

Acronyms

AutoML BN CNN CV DA DG EM ERM GAN KD MAP ML MLE MLP MMD NLP NMT NT OT PTM RKHS RNN SGD SRM SVM TL TTS

Automated Machine Learning Batch Normalization Convolutional Neural Networks Computer Vision Domain Adaptation Domain Generalization Expectation Maximization Empirical Risk Minimization Generative Adversarial Networks Knowledge Distillation Maximum A Posteriori Machine Learning Maximum Likelihood Estimation Multi-layer Perceptron Maximum Mean Discrepancy Natural Language Processing Neural Machine Translation Negative Transfer Optimal Transport Pre-trained Model Reproducing Kernel Hilbert Space Recurrent Neural Networks Stochastic Gradient Descent Structural Risk Minimization Support Vector Machine Transfer Learning Text-to-Speech

xix

Symbols

x .x .A .I .X .Y .D .N .H .P (·) .P (·|·) .k(·, ·) .E·∼D [f (·)] .(·, ·) .I(·) .{· · · }

T

.A

.tr(A) .max f (·), min f (·) .arg max f (·), arg min f (·) .||

· ||p n

.

i=1 i

Variable Vector Matrix Identity matrix Input space Output space Domain or dataset Normal distribution Hypothesis space Probability Conditional probability Kernel function Expectation of .f (·) on dataset .D Loss function Indicator function Set The transpose of matrix .A Trace of matrix .A The maximum and minimum value of function .f (·) Parameter value when .f (·) takes the maximum (minimum) values p-norm Sum

xxi

Part I

Foundations

Chapter 1

Introduction

In this chapter, we introduce the background of transfer learning to give an overview of this area. This chapter can be thought of as a broad introduction to the readers who have never experienced transfer learning. Thus, this chapter is self-contained. Experienced readers can skip it with no harm. The organization of this chapter is as follows. First, we introduce the informal concept of transfer learning using natural language in Sect. 1.1. Then, we introduce other machine learning areas that are closely related to transfer learning in Sect. 1.2. Next, we answer the question: why are we using transfer learning in Sect. 1.3. After that, we familiarize the readers with the categorization of transfer learning research scenarios in Sect. 1.4. Subsequently, we summarize the historical development of transfer learning in both academia and industry fields in Sect. 1.5. Finally, we present an overview of transfer learning applications in different areas in Sect. 1.6.

1.1 Transfer Learning There is an old saying in the ancient Chinese book The Analects of Confucius (Waley et al., 2005): If a man keeps cherishing his old knowledge, so as continually to be acquiring new, he may be a teacher of others. This sentence implies that we can often get new understanding of the old knowledge if we review it before learning the new. By relying on this ability, we can become teachers. Moreover, this sentence tells us that people’s new knowledge and abilities are often evolving from the old ones. Therefore, if we can find the connections and similarities between the old and new, the learning process of new knowledge can become much more effective and efficient. On the other hand, there is also an old story from another ancient Chinese book The Book of Chuang Tzu (Tzu, 2006): © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_1

3

4

1 Introduction

Xi Shi was a famous beauty of the State of Yue during the Spring and Autumn Period. Among Xi Shi’s neighbors, there was a very ugly woman whom everyone disliked and called “Dong Shi.” Dong Shi was very jealous of Xi Shi’s beauty, but there was nothing she could do about it. She could only imitate Xi Shi in every way. One day, Xi Shi felt pain in her heart and called a doctor for treatment. Having taken the medicine, Xi Shi went out of her house for a stroll. After taking a few steps, she felt a spasm of pain in her heart. Quickly pressing her bosom, she sat down with knitted brows in a chair in the garden. Dong Shi saw Xi Shi after long expectation, but she did not know Xi Shi was ill. She only felt that Xi Shi looked exceptionally beautiful that day. Therefore, she imitated Xi Shi by pressing her bosom and deliberately knitting her brows to attract people’s attention. Seeing her ugly and grotesque appearance, all the passers-by gave her a wide berth. This is the well-known story of Dong Shi Knits Her Brows in Imitation. When we said that old knowledge can be transferred to the new, then, why Dong Shi fails after learning from Xi Shi? We say that the key to this problem is their similarity. Why the review of the old knowledge can help learn the new one? Because similarity exists between the new and old knowledge, which can become their bridge for better consolidation. On the other hand, Dong Shi is born ugly, which makes it hard that she has many similarities with Xi Shi, who is a famous beauty. Therefore, even if she can imitate Xi Shi’s actions, she finally ended up with failure. So, how can we effectively utilize the similarity between things to help us solve new problems or acquire new abilities? This brings the topic of this book: Transfer Learning, which, by its name, indicates the process of effectively and efficiently achieving our goals by transferring knowledge from existing fields. The concept of transfer learning was originally born in psychology and pedagogy (Bray, 1928). It is also called “Learning Transfer” by psychologists, indicating the influence of one learning process to another. This even happens naturally in our daily lives. For instance, if we can play badminton, then we can learn to play tennis since playing badminton and tennis share some similar strategy and tricks; if we can play Chinese chess, then we can learn to play international chess by borrowing the similar rules between them; and if we can ride a bicycle, then we can learn to ride a motorcycle since the two are very similar. It is surprising to find that by leveraging the similarity between two things, we can build up a bridge through which the old experience or knowledge can be transferred to the new one, benefiting the learning process of new knowledge. In fact, humans are born with the transfer learning ability. For instance, we can draw inferences about other cases from one instance and also learn by analogy. There is also an old saying: “Jade can be polished by stones from other hills.” Figure 1.1 shows some examples of transfer learning in real life.1 In Fig. 1.1a, because of the high similarity between badminton and tennis, the two sports can

1 These

free images are from https://pixabay.com/.

1.1 Transfer Learning

(a)

5

(b)

Fig. 1.1 Real examples of transfer learning. (a) Badminton vs. tennis. (b) Chinese chess vs. international chess Fig. 1.2 The relationship between transfer learning and machine learning

Machine learning

Transfer learning

be learned by analogy. On the other hand, in Fig. 1.1b, the rules of Chinese and International chess are similar, and then we can learn them by analogy, as well. In the field of artificial intelligence, transfer learning is a specific learning paradigm. Machine learning is a kind of important learning methodology of artificial intelligence, which has gained great proliferation in the past decades. Machine learning makes it possible to learn knowledge from the data. Transfer learning, as an important branch of machine learning, focuses on the process of leveraging the learned knowledge to facilitate the learning of new ability, which increases the effectiveness and efficiency. Figure 1.2 shows the relationship of transfer learning and machine learning. Concretely speaking, in the field of machine learning, transfer learning can be generally defined as (informal):

Transfer learning aims to solve the new problem by leveraging the similarity of data (task or models) between the old problem and the new one to perform knowledge (experience, rules, etc.) transfer.

Fitting the definition, Fig. 1.3 shows another real-life example of transfer learning in the field of sensor-based human activity recognition (HAR). The goal of HAR

6

1 Introduction

Fig. 1.3 Different sensor signals generated by different users, devices, and positions Fig. 1.4 An illustration of transfer learning for cross-device human activity recognition

Model C Device C

Transfer learning

Model A

Model B

Device A

Device B

is to leverage different sensors attached to human body to capture their readings while humans are performing certain activities. Then, these readings are used as the training data to build machine learning models. From Fig. 1.3, we see that different sensor signals are generated by different users, devices, and on different wearing positions. It is nearly impossible to build one model that can recognize human activities from all possible users, devices, and positions. How can we better leverage their similarities to build a generalized model such that the daily activities of each user can be more accurately recognized? For the above cross-device/user/position HAR, Fig. 1.4 briefly illustrates a transfer learning process. Take the different devices as examples. First, we need to learn the common knowledge from the models built on devices A and B. Then, we can apply such common knowledge to a new user device C to build a model more effectively. This will prevent re-training from the data of C. With the common knowledge from A and B, the model of C will be trained more efficiently. Similar examples can be applied to different users and wearing positions.

1.3 Why Transfer Learning?

7

Table 1.1 Difference between traditional ML and transfer learning Aspect Data distribution Data annotation Model

Traditional machine learning Training and test data are i.i.d. Huge amount of annotations Train from scratch for every task

Transfer learning Training and testing data are non-i.i.d. Less annotations Model transfer between tasks

1.2 Related Research Fields Transfer learning is not a concept that is just born out of nowhere. Instead, it has some connections and differences to several other existing concepts. In this section, we summarize their similarities and differences (Table 1.1). First of all, traditional machine learning typically assumes that the training and test data are independently and identically distributed (i.i.d.), i.e., the training and test data distributions are the same. While in reality, this assumption often does not hold. Transfer learning mainly deals with the situation where the training and test data are not i.i.d. sampled (non-i.i.d.), which is better for solving real-world applications. Second, the success of traditional machine learning relies heavily on the massive labeled training data, which may not be possible to satisfy as there are more unlabeled data in the real world. Transfer learning, on the other hand, relaxes such assumption by transferring knowledge from existing models or tasks. Thus, it will not need large-scale labeled datasets for training. Third, in traditional machine learning, we often train a new model for a new task from scratch, which could be extremely time-consuming in face of many tasks. Transfer learning makes it possible to perform model transfer between tasks. Thus, we do not need to train a model scratch. Apart from above, there are also other concepts in machine learning that are similar to transfer learning. We summarize their differences in Table 1.2.

1.3 Why Transfer Learning? After the general introduction of transfer learning, we now answer another important question: why do we use transfer learning? We present 5 reasons in the following.

1.3.1 Big Data vs. Less Annotation Today, we are in a big data era. We are facing the flood of data from every aspect of our environment: social network, intelligent transportation, video surveillance,

8

1 Introduction

Table 1.2 Similarity and difference between transfer learning (TL) and other fields Difference Similarity Share knowledge between dif- MTL aims to improve all tasks, while TL ferent tasks to improve them all focuses on the target task; data in MTL often has enough labels in each task, while TL focuses on low-data modeling Lifelong learning They all target future tasks Lifelong learning focuses on continuous update of models, while TL focuses on a certain stage of learning Learning under non-i.i.d. situa- Covariate shift means the marginal Covariate shift tion distribution (.P (X)) is different, while TL focuses on more general distribution shift Learning under non-i.i.d. situa- DA is a special case of transfer learning Domain Adaptation (DA) tion Meta-learning Improve the performance of Meta-learning learns general knowledge new tasks by summarizing historical tasks; TL has a certain source and target data whose objective is more clear Build models in low-data situa- Few-shot learning can be seen as a special Few-shot tion learning case of transfer learning Field Multi-task Learning (MTL)

logistics and activities, etc. Huge amounts of data are generated every second in different kinds of modalities: images, videos, texts, languages, speech, and so on. Our machine learning and artificial intelligence models can continuously train and update on these massive data, so as to serve the people all around the world. However, big data introduces another critical issue: we often lack sufficient annotations of these data. It is known to us all that the machine learning models rely heavily on sufficient annotations to perform accurate predictions. While we can obtain huge amounts of data, most of which are in its very raw form that is often shortage of (correct) labeling. Data annotation is an extremely expensive and time-consuming process. Thus, it is challenging to build effective models by just relying on data that has very limited annotations. In order to solve this problem, one intuitive idea comes out: can we use transfer learning to transfer knowledge from existing models to reduce the need for massive data? Moreover, can we use transfer learning to bridge different domains, so as to reuse the high-quality annotated data from other fields?

1.3.2 Big Data vs. Poor Computation It is common knowledge that these big data in everyday life often need very powerful computation devices to save and compute. Unfortunately, the superpower computation devices are often restricted to those “super-rich” companies such as Google, Meta, Microsoft, Amazon, etc., which ordinary researchers cannot

1.3 Why Transfer Learning?

9

200

175 100

100 12 0.110.120.24 0.3 0.3 0.3 0.340.34 0.4 0.8 0.8 1.5 2.6 10 11

ERNIE-T MASS ALBERT UNiLM MT5 MT6 RoBERTa BERT BART XLM XNLG GPT2 CPM1 ERNIE-3 T5 CODEX M6 GPT3 CPM2 PanGu-α

0

198 200

Fig. 1.5 Evolution of the sizes of pre-trained language models (adapted from Liu et al., 2021)

afford. These tech giants have much more computation powers to train very huge models (Devlin et al., 2018; Radford et al., 2019) for their products and services. For instance, in computer vision field, the training of ImageNet (Deng et al., 2009) is extremely time-consuming on normal hardware, and in the natural language processing field, training a BERT (Devlin et al., 2018) model from scratch is not affordable for most researchers. Scientific research is a long journey; how would we expect general researchers can benefit from such a resource-constrained research topic? Figure 1.5 shows the ever-growing sizes of the pre-trained language models over the past years. We can clearly see that these models are getting bigger and bigger, requiring much more powerful hardware to train. Therefore, in this case, how can most of the ordinary researchers and students make good use of the technological breakthrough (e.g., pre-trained models) to do their own research? One of the probable answers is also transfer learning, which enables us to use the pre-trained models to facilitate our own task and benefit our own research.

1.3.3 Limited Data vs. Generalization Requirements Our key expectation of machine learning is to train models to make correct predictions to the unseen datasets, environments, and applications in the future, i.e., generalization or out-of-distribution (OOD) generalization (Quiñonero-Candela et al., 2009). Then, there is another challenge: limited data vs. generalization requirement. While we can collect more data, it is always finite and limited compared to the vast ocean of big data. The model may still suffer from the new environment. We take the medical machine learning as an example. It is known that the medical data is extremely limited due to several possible reasons such as limited number of patients, complicated experiments and operations, and unaffordable cost for data

10

1 Introduction

collection. In this case, if we can take advantage of transfer learning to build more generalizable models based on limited dataset, we may end up training a model that has good generalization ability, i.e., our model can predict well for the unseen data based on old models. From the big picture, the generalization ability of machine learning has always been one-hot research topic in the community. Luckily, transfer learning is one of the most important steps toward resolving this issue. For this problem, there are several research fields in transfer learning family, such as domain adaptation (Pan et al., 2011), domain generalization (Blanchard et al., 2011; Wang et al., 2021), a related topic meta-learning (Vanschoren, 2018), etc. Most of the content in this book will be centered on domain adaptation algorithms, while it is natural and intuitive to extend them to domain generalization and meta-learning research. Specifically, we will introduce domain generalization in Chap. 11 and meta-learning in Chap. 14.

1.3.4 Pervasive Model vs. Personal Need The purpose of machine learning is to build a general model for most of the tasks that can satisfy different users, devices, environments, and needs. This is our good anticipation of machine learning to make the best of its generalization ability. However, when it comes to each individual in the world, it is surprisingly difficult to satisfy the personal need for every person. A general model, or pervasive model, cannot meet our satisfactions. For instance, different people have different habits when using a piloting model on the car: someone prefers to go highways, while someone prefers to go sideways in the small village to enjoy more scenery. What is more, different people often have very different needs for privacy. As shown in Fig. 1.6, for the same cloud model, different users (e.g., women, men, and the old) often have different needs and habits, which are key elements when building machine learning models. Therefore, how to let the general machine learning models adapt to different personal needs? It is impossible to train a model for each person, which is far more expensive. Then, we can use transfer learning to perform model adaptation or fine-tuning based on the general models. After that, we can also leverage transfer learning to perform online knowledge transfer to make the models change according to the need of each person.

1.3.5 For Specific Applications There are some specific applications such as cold start. For an online shopping system that relies on recommendation technology, how does it perform accurate recommendation to users when it does not have enough user data? How can a new

1.4 Taxonomy of Transfer Learning

11

Favorites

User Cloud model

• • •

Makeup Good food Fashion

• • •

Digital media Fitness Games

• • •

Well-being Education Health

Fig. 1.6 Different personal needs for machine learning Table 1.3 Summarization of the necessity of transfer learning Traditional ML Contradiction Big data vs. less annotation Expensive human labeling Big data vs. poor computation Limited data vs. generalization ability Pervasive vs. personal need Specific applications

Exclusive computation device

Transfer learning Transfer knowledge from existing fields Model transfer

Poor performance for OOD data Domain generalization, meta-learning, etc. Adapt to every person Cannot satisfy personal needs Data transfer from other fields Cold start cannot be solved

image tagging service perform accurate tagging for new users? All these problems can be solved by using transfer learning technologies. To sum up, we show the necessity of transfer learning in Table 1.3.

1.4 Taxonomy of Transfer Learning Apart from reinforcement learning, according to a popular taxonomy in traditional supervised machine learning, machine learning can be categorized into three classes: supervised, semi-supervised, and unsupervised learning. Similarly, transfer learning can also be categorized into these three classes. Note that there may exist other different categorization standards which do not have a unified solution. Any categorization that summarizes the research well is reasonable. Here, we try to categorize transfer learning using several different taxonomies, as shown in Fig. 1.7. Generally speaking, the taxonomy of transfer learning follows four categories:

12

1 Introduction

Transfer learning Feature space

Supervision

Method

Online or offline

Homogeneous TL

Supervised TL

Instance-based TL

Online TL

Heterogeneous TL

Semi-supervised TL

Model-based TL

Offline TL

Unsupervised TL

Feature-based TL Relation-based TL

Fig. 1.7 Taxonomy of transfer learning research

• • • •

Whether the target domain has labels? Whether the feature spaces are the same between domains? Different learning strategy Is the learning online or offline?

These different classes represent different research scenarios. It is also the case that two different taxonomies may have overlap with each other. In the following, we briefly describe these research areas.

1.4.1 Taxonomy by Feature Space A common approach to taxonomy transfer learning research is by the feature space, which was recently introduced by Weiss et al. (2016). According to this taxonomy, transfer learning can be categorized into two main classes: 1. Homogeneous transfer learning 2. Heterogeneous transfer learning This is a rather intuitive taxonomy: if the feature semantics and dimensions from different domains are the same, then it belongs to homogeneous transfer learning; otherwise, it belongs to heterogeneous transfer learning. For instance, transfer learning between different image domains is homogeneous transfer, while transfer learning from image to text is considered as a heterogeneous task.

1.4 Taxonomy of Transfer Learning

13

1.4.2 Taxonomy by Target Domain Labels Similar to machine learning categorization, transfer learning can be categorized into three classes according to the label condition in target domain: 1. Supervised transfer learning 2. Semi-supervised transfer learning 3. Unsupervised transfer learning Obviously, low-resource situations (i.e., semi-supervised or unsupervised transfer learning) are more challenging and thus one of the hottest research topics in this area (Chap. 14). They are also the focus of this book.

1.4.3 Taxonomy by Learning Methodology Since our focus is the learning technology behind transfer learning, then a natural classification scheme is to categorize existing transfer learning algorithms by its learning methodology (Pan and Yang, 2010): 1. 2. 3. 4.

Instance-based transfer learning Feature-based transfer learning Model-based transfer learning Relation-based transfer learning

This categorization is derived by the data, feature, and model sequence, which is quite obvious; more importantly, another high-level relation between domains can also be considered an important category. Instance-based transfer learning aims to perform knowledge transfer by instance reweighting or re-sampling. Thus, we can give different weights to different samples. For instance, if an instance is more similar to our target domain data, we will give it more weight than other data. This idea is quite natural and simple. Feature-based transfer learning performs knowledge transfer using feature transformation or representation learning. Let us assume that the features of source and target domains are not in the same feature space, then how can we leverage representation learning to transform them into the same one, where their feature representations are similar. This category of methods is rather popular in both academia and industry fields. Model-based transfer learning performs knowledge transfer by sharing parameters between models from different domains. The weight parameters of a support vector machine can be shared; the weight and bias of a neural network can also be shared. Specifically for a neural network model, its weights can be easily transferred by using the pre-train–fine-tune scheme, which is quite popular.

14

1 Introduction

Relation-based transfer learning is rather not popular in recent literature. This category focuses on the logic behind different domains. For instance, the situation that students are having a class in a classroom is similar to the situation where employees are having a morning meeting in a company. Using these relations can assist our tasks in these specific applications.

1.4.4 Taxonomy by Online or Offline Learning According to the learning scheme, transfer learning can be categorized into two classes: 1. Offline transfer learning 2. Online transfer learning Currently, most of the transfer learning algorithms and applications are adopting an offline scheme, which refers to the situation that the source and target domains are given in advance. This scheme lacks flexibility of learning from more online data, which we believe should be the future stage that model can perform online update when the new data is coming. Most of the book is for offline transfer learning, and online transfer learning will be introduced in Sect. 13.5. The main part of this book is introducing these different kinds of learning algorithms. According to our own understanding of these algorithms, we present our categorization: 1. Instance re-weighting transfer learning, which refers to the instance-based methods in Chap. 4 2. Feature transformation transfer learning, which refers to feature-based methods in Chaps. 5 and 6 3. Pre-training and fine-tuning, which refers to model-based methods in Chap. 8 Based on these methods, we will introduce Deep and adversarial transfer learning in Chaps. 9 and 10. Finally, this book will not introduce relation-based transfer learning.

1.5 Transfer Learning in Academia and Industry Transfer learning has long been a hot research topic in leading machine learning conferences such as ICML (International Conference on Machine Learning),2 NeurIPS (Advances in Neural Information Processing Systems),3 , and ICLR (Inter-

2 https://icml.cc/. 3 https://neurips.cc/.

1.5 Transfer Learning in Academia and Industry

15

2005 DARPA the ability of a system to recognize and apply knowledge and skills learned in

2011 ICML

2017-2019 ICCV/ECCV

Unsupervised and

Transferring and Adapting Source

Transfer Learning

Knowledge in Computer Vision

previous tasks to novel tasks

1995 NIPS Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems

Workshop and Challenge

AAAI 2008 Transfer learning for complex tasks

2013 NIPS New Directions in Transfer and Multi-Task

2019 ICLR Learning from Limited

Labeled Data

Fig. 1.8 History of transfer learning academic society

national Conference on Learning Representations).4 There are several workshops or tutorials in these conferences and Figs. 1.8 and 1.9 briefly show the development of transfer learning in academic society. From the figures, we can clearly observe that transfer learning remains an important direction over the years. The definition, research topics, and boundaries of transfer learning are evolving over the past years. Over one century ago, when computer science only exists in our dream, the international psychology community started the research on how individuals can transfer their behaviors from one context to another (Woodworth and Thorndike, 1901). With the proliferation of computer science, NIPS (now renamed to NeurIPS) held a workshop on learning to learn: knowledge transfer in inductive learning systems5 in 1995. Later, in 2005, DARPA from USA started a research project on transfer learning, with the purpose of studying recognition and ability of applying existing knowledge to new tasks.6 Next year, ICML held the structural knowledge transfer7 workshop. In 2008, researchers started a workshop called Transfer learning for complex tasks8 on AAAI conference on artificial intelligence. In ICML 2011, there is also a workshop on Unsupervised and transfer learning.9 In the same year, IJCNN also held a challenge with similar name.10 Later, NIPS proposed to study transfer learning and multi-task learning11 in 2013. From

4 https://iclr.cc/. 5 http://plato.acadiau.ca/courses/comp/dsilver/NIPS95_LTL/transfer.workshop.1995.html. 6 http://logic.stanford.edu/tl/TransferLearningPIP.pdf. 7 http://orca.st.usm.edu/~banerjee/icmlws06/. 8 https://eecs.wsu.edu/~taylorm/AAAI08TL/index.htm. 9 http://clopinet.com/isabelle/Projects/ICML2011/home.html. 10 http://www.causality.inf.ethz.ch/unsupervised-learning.php. 11 https://sites.google.com/site/learningacross/.

16

1 Introduction

Fig. 1.9 Transfer learning in recent academic fields

2017 to 2019, ICCV and ECCV held several workshops and challenges on transfer learning.12 In 2019, ICLR also held a workshop on learning from limited data.13 Moreover, transfer learning also helped win top honors at international contests. In 2007, the first prize of ICDM indoor location challenge was given to a model that was built by transfer learning.14 In 2018, a paper titled Taskonomy: disentangling task transfer learning (Zamir et al., 2018) was awarded best paper award, which aims to disentangle the relationship between tasks. In the same year, the Ads challenge winner for IJCAI was also won by a solution powered by transfer learning.15 There is also a best paper at PAKDD 2019 titled Parameter transfer unit for deep neural networks (Zhang et al., 2019c). In 2019, Ming Zhou, who is the chair of ACL, emphasized the importance of pre-training and fine-tuning in natural language processing in the opening keynote of ACL.16 A year later, a paper titled Don’t stop pretraining: adapt language models to domains and tasks (Gururangan et al., 2020) was awarded the best paper nominee in ACL. In 2021, Turing award winners Y. Bengio, Y. Lecun, and G. Hinton stated that current systems are not as robust to changes in distribution as humans, who can quickly adapt to such changes with very few examples (Bengio et al., 2021). We see that transfer learning has achieved great progress in academic research. On the other hand, transfer learning technology has always been the key technology behind the success of industries. In 2020, Microsoft research teams 12 http://adas.cvc.uab.es/task-cv2017/,

https://sites.google.com/view/task-cv2019/home.

13 https://lld-workshop.github.io/. 14 https://www.kdnuggets.com/news/2007/n15/7i.html. 15 https://zhuanlan.zhihu.com/p/40631601. 16 https://www.msra.cn/zh-cn/news/features/acl-2019-ming-zhou.

1.6 Overview of Transfer Learning Applications

17

used sim-to-real transfer learning to train the drones.17 Later, Microsoft released the world’s largest pre-trained language model called Turing-NLG.18 OpenAI initiated a challenge on reinforcement transfer learning, where competitors are asked to develop algorithms that enable transfer learning between different environments.19 Other companies such as Google, Meta (formerly Facebook), and OpenAI also released different pre-trained models such as BERT(Devlin et al., 2018), T5 (Raffel et al., 2019), and GPT (Brown et al., 2020). NVIDIA released a transfer learning toolkit20 to train deep learning models. Alibaba used transfer learning and metalearning to assist few-shot learning and protect its systems.21 Alexa from Amazon used transfer learning for fast learning of the second language,22 which also reduced the need for massive training data. To sum up, examples in this section are only a few. We do hope that there are more advances in transfer learning in the future.

1.6 Overview of Transfer Learning Applications Generally speaking, transfer learning is an important branch of machine learning, and thus it is not restricted to specific applications. In fact, we can use transfer learning in any situation that satisfies its problem definition, including but not limited to computer vision, natural language processing, activity recognition, indoor location, video surveillance, sentiment understanding, human–computer interaction, etc. Figure 1.10 briefly illustrates some application areas of transfer learning. In the following, we will use some examples to show the rich applications. Note that the examples in this section are only for illustrating the usage of transfer learning in order to inform the readers of its potential application areas. For handson coding practice in these applications, please refer to Chaps. 15–19.

1.6.1 Computer Vision Transfer learning has been widely adopted in computer vision research, such as image classification and style transfer. Figure 1.11 shows two different transfer learning-based image classification tasks. Even if for the images that belong to the same classes, their feature distributions can change because of different viewing

17 https://tech.sina.cn/2020-03-26/detail-iimxyqwa3399986.d.html. 18 https://cloud.tencent.com/developer/article/1586451. 19 https://www.leiphone.com/news/201804/vKH66rt5xW0dycQL.html. 20 https://www.leiphone.com/news/201812/Cy3HiAmUh6J7P3gB.html. 21 https://www.jiqizhixin.com/articles/2019-12-10-7. 22 https://zhidx.com/p/144766.html.

18

1 Introduction

Low-resource translation for speech recognition

Cross-view / background / illumination image classification

Cross-person / device / position activity recognition

Machine translation across different domains

HCI across different persons / interfaces / contexts

Indoor location across different environments / device / time

Fig. 1.10 Wide applications of transfer learning

Digits dataset 2

Digits dataset 1

(a)

Image dataset 1

Image dataset 2

(b)

Fig. 1.11 Transfer learning applications in computer vision. (a) Cross-domain digit recognition. (b) Cross-domain image classification

angle, different illumination, and background. Thus, it is of great importance to build robust and adaptable classifier using transfer learning. Figure 1.11a shows digits datasets from MNIST and USPS, while images in Fig. 1.11b are coming from a popular transfer learning dataset called Office-Home (Venkateswara et al., 2017). We take the research of (Xie et al., 2016) as an example that is closer to real life, which has leveraged transfer learning technology to help survey the poverty of Africa for the United Nations. Researchers obtained the night images using the satellites above Africa, whose illuminations were then mapped to the class labels in the ImageNet (Deng et al., 2009) dataset. After careful training and fine-tuning, they showed that it becomes possible to utilize the prediction results on ImageNet to predict the poverty (which has positive relationship with the illumination intensity). This research implies that by only using transfer learning technology, we can obtain accurate results that are comparable to onsite survey. Of course, this is a positive example of transfer learning.

1.6 Overview of Transfer Learning Applications

19

In object detection, the work of Chen et al. (2018e), Inoue et al. (2018), Sun and Saenko (2014), and Raj et al. (2015) adopted the distribution adaptation problems between source and target domains. These works applied domain adaptation and weakly supervised learning to object detection to align the different distributions, which bring great performance improvement on the test data. As for the training data scarcity issue, Lim et al. (2011) proposed to transfer from related tasks to the target tasks. The work of Shi et al. (2017) proposed a ranking-based transfer learning for object detection. The work of Chen et al. (2018b) proposed adaptation solution for few-shot object detection applications. There are also applications in semantic segmentation. The work of Zhang et al. (2017), Tsai et al. (2018) and Li et al. (2019b) proposed different cross-domain semantic segmentation adaptation algorithm for city street view segmentation. The work of Kamnitsas et al. (2017) and Zou et al. (2018) addressed the medical data scarcity problems by transfer learning for brain images. Sankaranarayanan et al. (2017) proposed a generative adversarial network based generative transfer learning algorithm. Luo et al. (2019) proposed fine-grained semantic segmentation using class-wise transfer adaptation algorithms. Transfer learning is also applied in video understanding tasks. For instance, we know that the car status can be influenced by weather conditions in self-driving care. For this problem, the work of Wenzel et al. (2018) proposed a generative adversarial network based transfer learning algorithms for cross-weather control. For video classification, some researchers proposed different distillation and transfer learning algorithms to improve the classification accuracy (Diba et al., 2017; Zhang and Peng, 2018). For human action recognition in videos, Liu et al. (2011) proposed a cross-view transfer learning algorithm to achieve great performance in different camera angles. Rahmani and Mian (2015) proposed a cross-view algorithm for action recognition. The work of Jia et al. (2014) proposed a latent tensor transfer learning algorithm for RGB-D action recognition. In addition to crossview, Bian et al. (2011) proposed a cross-domain action recognition algorithm. Sargano et al. (2017) proposed a deep learning representation-based transfer learning algorithm. The work of Zheng et al. (2016) and Wu et al. (2013) proposed dictionary learning and heterogeneous action recognition algorithms. The work of Giel and Diaz (2015) proposed a transfer learning algorithm based on time relations. To sum up, video understanding tasks are similar to image classification and there are also cross-domain, cross-view, and cross-status tasks that need transfer learning. Scene text recognition is also an important vision task. The work of Zhang et al. (2019b) proposed an attention-based robust scene text recognition method for cross-domain tasks. Tang et al. (2016) explored the application of convolution neural networks in Chinese text recognition. The work of Goussies et al. (2014) used transfer decision forest algorithm for optimal character recognition. For image generation tasks, Zhang et al. (2019a) started from disentanglement and constructed a GAN-based makeup transfer learning algorithm. Image generation has a more common name: style transfer, such as (Gatys et al., 2016; Luan

20

1 Introduction

et al., 2017; Huang and Belongie, 2017; Li et al., 2017b), where almost every style transfer work adopted transfer learning algorithms. There are still other computer vision tasks that we cannot cover. We see that almost every task can apply transfer learning to solve its own problem, such as crossdomain or cross-view tasks. If the distributions are changed due to some reason, then, transfer learning can be applied to it. In Chap. 15, we offer implementations of using transfer learning for object detection and neural style transfer.

1.6.2 Natural Language Processing Language is the nature of humans and is the very difference between humans and animals. Computer vision enables humans to know the world, while language makes it possible to communicate between people, between people and animals, and people and machines. Natural language processing (NLP) brings together the knowledge of linguistics, artificial intelligence, and computer science. What is language? From a narrow sense, any text-related materials are languages. But if we understand languages beyond texts, then, from a wide sense, any medium are languages as long as they can convey messages, such as images, speech, and data. Hence, the definition toward what is language is also changing. This section focuses on textbased languages, while we will introduce speech area in the next section. Natural language processing is also a big research area, which is mainly composed of two categories of tasks: natural language understanding (NLU) and natural language generation (NLG). NLU enables the machine to understand human language, i.e., extract important information from binary languages; natural language generation enables the machines to say the human languages, i.e., transform the binary data into languages human can understand. There are tasks corresponding to these two categories, such as text classification, information retrieval, information extraction, chatbot, dialogue system, machine translation, and sequence tagging. The non-i.i.d. issue also exists in natural language processing tasks. Thus, many efforts have been done in applying transfer learning, domain adaptation, and multitask learning algorithms for cross-domain learning, such as sequence tagging (Grave et al., 2013; Peng and Dredze, 2016; Yang et al., 2017), semantic parsing (McClosky et al., 2010), sentiment classification (Cui et al., 2019; Ruder and Plank, 2017; Wu and Huang, 2016), text classification (Wang et al., 2019; Liu et al., 2019), relation classification (Feng et al., 2018), and text mining problems (Qu et al., 2019). In these areas, transfer learning is used to solve the non-i.i.d. problems and achieved great progress. Take text classification as an example. Since the texts on one domain are only specific to its domain knowledge, the classifier trained on this domain cannot be directly applied to another, which indicates the need for transfer learning. Figure 1.12 shows an example of transferring from electronic devices to DVDs. In such a problem, the classifier trained on electronic devices data is only specific to this domain and thus needs to be transferred and adapted to DVD domain.

1.6 Overview of Transfer Learning Applications

21

Fig. 1.12 Transfer learning applications in text classification

In machine translation, transfer learning is more important. First of all, translation is at least involved in two languages: source and target languages. And there exists distribution divergence between two languages. Second, another issue for translation is that the paired data is scarce to collect that would require many efforts in annotation by experts. Thus, how to improve the performance in low-resource setting is also an important problem; early works (Blitzer et al., 2006) utilized the pivot features as bridges for cross-domain translation. Bertoldi and Federico (2009) proposed hidden Markov model-driven domain adaptation translation system. Other works (Chen et al., 2017a; Jiang and Zhai, 2007; Poncelas et al., 2019; Wang et al., 2017) utilized instance weighting methods for translation. Chen and Huang (2016) proposed a semi-supervised domain adaptation transfer algorithm for low-resource translation. In recent years, there is a growing interest in pre-training of large language models. Researchers have found that if we can collect large-scale datasets, whether they are labeled or not, we can perform pre-training on them using self-supervised learning algorithms. Similar to computer vision models, the NLP pre-trained models are also general models that can be fine-tuned on downstream tasks to improve their performance. One important milestone for such research is BERT (Devlin et al., 2018), which is built on Transformers and performs pre-training on Wikipedia texts. BERT has achieved great performance gain in many NLP tasks, and then, many works followed BERT to perform fine-tune, adaptation, and transfer on different tasks/applications. BERT-based transfer learning methods dramatically increased the performance on text classification, sentiment analysis, natural language generation, and other tasks. Also, BERT is costly to train. There are other works that try to compress BERT for easier pre-training. Apart from BERT, the generative pre-training (GPT) series by OpenAI has set up new frontiers for more large-scale language pre-training (Radford et al., 2018, 2019; Brown et al., 2020). In Chap. 16, we implement a transfer learning-based emotion recognition task.

22

1 Introduction

Fig. 1.13 Illustration of the cross-lingual speech recognition task

1.6.3 Speech Voice is also a kind of human language and the medium for communication. Automatic speech recognition (ASR) refers to the process of transforming human audio data to texts. On the other hand, speech synthesis is an inverse process which makes computers generate human voice. Text to speech (TTS) is a popular field of speech synthesis that takes texts as inputs and then generates human voice based on the texts. There are other research fields in speech, such as speaker verification and voice conversion. Speech is a multidisciplinary area that brings together computer science, artificial intelligence, signal processing, linguistics, statistics, and probability theory. It has a long research history. With the development of deep learning, ASR and TTS now widely take deep neural networks as their backend and achieved outstanding performance. The success of deep learning relies on massive training data. Speech data, compared to image data, is more time-consuming and difficult to collect. On the other hand, speech is not images or texts that are more objective. Instead, speech is coming from humans which is more uncertain and unstable. The sound that we can hear, including tone, frequencies, and pitch, is unique for each person. Different people have different pitch, talking styles, and accents. Therefore, the data we can collect often have the non-i.i.d. and scarcity problems. Transfer learning remains the useful technology for speech. For TTS, we have different perspectives for speech data. We know that the inputs to TTS system are natural texts and the outputs are voice data. But the voice that “sounds good” is really a subjective and vague evaluation metric compared to other metrics such as accuracy and error in classification, regression, and detection tasks. Therefore, the evaluation brings another challenge for TTS. On the other hand, the data collection for TTS is to let people say a few words, and thus, it is also a few-shot learning problem. For instance, as shown in Fig. 1.13, given three rich-resource languages as source languages (Italian, Welsh, and Russian), how to learn the transferable knowledge from them to build cross-lingual ASR models for the target language Romanian?

1.6 Overview of Transfer Learning Applications

23

To solve the few-shot problem for speech recognition, the work of Li et al. (2019a) transferred the ASR model from mandarin to Cantonese speech. The work of Yao et al. (2012), Yu et al. (2013), and Gupta and Raghavan (2004) explored different adaptation technologies for different languages. Xue et al. (2014) proposed a deep fast adaptation model based on user identification code. The work of Wu and Gales (2015) proposed a multi-basis adaptation transfer network. Sun et al. (2017) proposed a deep domain adaptation application on ASR. The work of Hsu et al. (2017) and Abdel-Hamid and Jiang (2013) designed different encoders and convolutional networks for robust speech recognition. For speech recognition on certain groups, the work of Shivakumar et al. (2014) proposed a voice adaptation method based on the similarity between speech for children speech recognition. Kim et al. (2017) designed a KL-HMM adaptation method for people with pronunciation obstacles. The work of Liao (2013) proposed a context-based adaptation method and (Huang et al., 2016) designed such a system. In TTS systems, transfer learning also plays an important role. Cooper et al. (2020) proposed a zero-shot transfer learning algorithm that transfers from multiple persons to a new person. Chen et al. (2018d) adopted the speaking embedding correlation to design a TTS adaptation system. The work of Jia et al. (2018) learned the general features from speech verification to capture the similarity between voices. Daher et al. (2019) proposed a transformation system based on generative adversarial nets. In voice conversion, transfer learning is also widely adopted. Sun et al. (2016) proposed a method that does not need paired data to transfer from multiple speech to single-person speech. Liu et al. (2018) proposed a speech conversion system based on disentanglement. The work of Chen et al. (2018c) adopted adversarial training to converse the unpaired speech data. Chou et al. (2019) used instance normalization to disentangle the people feature and content feature for one-shot voice conversion. In addition to the challenges in data collection and non-i.i.d., there are also different accents in one language, which brings another challenge for speech. For example, there are many countries that adopt English as the official language, but the accents for English are diverse. Thus, how to perform good recognition and synthesis for different accents are the key research obstacles. Recently, Hou et al. (2021) designed a character-level distribution matching algorithm for cross-accent and cross-environment speech recognition, aiming to solve the above challenge. Hou et al. (2022) also designed an adapter-based transfer learning framework for cross-lingual low-resource speech recognition. They explored the implicit relations for different languages by meta-learning and the explicit relations by adapter fusion technology. We hope that in the future, there will be more research on this area. In Chap. 17, we present code implementations of cross-domain and cross-lingual speech recognition.

24

1 Introduction

1.6.4 Ubiquitous Computing and Human–Computer Interaction Ubiquitous computing is gaining increasing attention in recent years. Smart phones, smart watches, wearable devices, edge computing devices, etc. have greatly enriched people’s lives. Ubiquitous computing has been widely applied to many areas, such as wearable activity recognition (Wang et al., 2018a), indoor location (Zou et al., 2017), and emotion recognition (Nguyen et al., 2018). Ubiquitous computing focuses on the daily lives, which is full of dynamic change. This dynamic change brings challenges to current machine learning algorithms. As stated by Mark Weiser, a premier researcher in ubiquitous computing, ubiquitous computing is computing existing anywhere that can compute in any environment and any time. This requires the machine learning models be updated based on these changing time, location, and lifestyles. For instance, in wearable recognition, users have trained a recognition model using smart watches. Then, this model will certainly perform poorly for activity data collected on smart phones. This is because that the distributions of two signals on two devices are different, resulting in model shift (Huang et al., 2007). Even if, on the same device, different users, wearing positions, and activity patterns have also influenced the modeling of activity recognition. Thus, we need to use transfer learning algorithms to solve the non-i.i.d. issues in ubiquitous computing. In wearable activity recognition, a series of work Wang et al. (2018b), Wang et al. (2018a), Zhao et al. (2011), and Khan and Roy (2017) proposed methods for crosspeople application scenarios to reduce the recognition error on new users. The work of Wang et al. (2018a) and Wang et al. (2018b) pointed out that even if for the same user, when the sensor is placed on different body positions, the activity recognition model will have shifts, and thus the accuracy will drop. These literature proposed the stratified and deep learning algorithms to solve these problems. The devices for activity recognition are also different, such as accelerometer, gyroscope, and magnetometer. The difference between these devices is also causing model shift problems. Figure 1.14a shows the signal difference for the same user at different positions. For this problem, Morales and Roggen (2016) designed several experiments to study cross-sensor activity recognition. The work of Hu and Yang (2011) proposed to map the sensors for better transferable activity recognition. Chen et al. (2017b) proposed cross-modal methods for activity recognition. For the class shift problem, Hu et al. (2011) proposed to build the similarity between classes for cross-category activity recognition. From wearable activity recognition, transfer learning can help to build cross-user, cross-device, and cross-position models. Indoor location focuses on detecting user positions by the change of location signals such as Wi-Fi, which is highly related to the structure of the place. Thus, cross-room or cross-environment location model is important. In this way, the work of Pan et al. (2008), Zou et al. (2017), and Sun et al. (2008) proposed Wi-Fibased transfer learning methods for indoor location. Liu et al. (2010) proposed a

25

Mean distance estimation error(m)

1.6 Overview of Transfer Learning Applications 5 4 3 2 1 0

(a)

research laboratory hall corridors

Ewme

Esme

(b)

Fig. 1.14 Transfer learning applications in ubiquitous computing. (a) Signal difference in different positions (Wang et al., 2018a). (b) Performance change due to different indoor positions (Li et al., 2017a)

3D location method based on transfer learning. We can see that cross-environment is the key point in indoor location. Figure 1.14b shows when a location device (AP, access point) is in different environments (e.g., research lab, hall, and corridors), its reading can be different, leading to difference model performance. In human–computer interaction, transfer learning can also help build robust human–machine integration models for different users, interfaces, and contexts. For instance, the work of Nguyen et al. (2018), Nguyen et al. (2020), and Tu et al. (2019) designed a series of methods to perform transferable emotion recognition method for different users. Rathi (2018) proposed a transfer learning optimization solution for emotion language. He and Wu (2019) proposed to align distributions in Euclidean space for brain–computer interface. Chao et al. (2019) adopted transfer learning for cross-view gait recognition. To sum up, transfer learning can be applied to dynamically changing environment for ubiquitous computing. When designing pervasive computing applications, we should pay attention to the changing factors upon which we can design transfer learning algorithms for better adaptation and generalization performance. In Chap. 18, we implement the feature extraction, source domain selection, and traditional and deep transfer learning for cross-domain human activity recognition.

1.6.5 Healthcare Medical healthcare is an important research field that is closely related to everyone. At the same time, it is also a multidisciplinary area that connects computer science, medicine, biology, mathematics, statistics, chemistry, nursing science, pharmacology, and physiology. There is no doubt that doctors and nurses have made considerable contribution to the global health during very difficult times such as

26

1 Introduction

SARS and COVID-19. The development of machine learning technology must also do a lot to advance the medical healthcare. Transfer learning has wide applications in medical healthcare. For brevity, we will categorize the applications into the following areas: medical data analysis, medical process and patients management, drug discovery, and daily medical care. In each area, transfer learning plays a critical role. The medical area contains multiple forms of data: image, video, text, voice, speech, table, etc. For medical data, there are also its specifics: The first one is the scarcity of medical data. Medical data is not natural images or texts but is collected on patients on certain disease. They are normally in few-shot forms. Even if for a specific disease, since the patient condition, nutrition status, and life styles are different, the syndromes are also different for the same disease. This brings a huge challenge for machine learning. Secondly, the data is non-renewable. We can always create new samples for natural image classification if the samples are few using generative models. But for medical healthcare, this problem is full of contradicts between morality, medical research, and technology: can the generated medical data be meaningful just as natural images that can be used for training? Can they be accepted by medical experts? These are all tough questions. Thirdly, the privacy of patient data should be well protected. Today, we are facing increasing strict regulations for privacy and medical data is one of them. Therefore, we should pay special attention to protecting the patient data while handling them. For instance, we should not train models using these data without permission. The privacy regulation has made it more difficult to use the data for building models. The fourth one is interpretation of the results, which should not be overlooked. In medical area, good explanation model can help doctors make more intelligent decisions. Obviously, we should also pay attention to the explanation of the models. Most of the work on medical data analysis is to solve the first two challenges above since almost all diseases face them. Medical image data is one of the most popular data types. For instance, as shown in Fig. 1.15, the labeled samples for COVID-19 chest X-ray are rare. How to leverage existing normal pneumonia

Source domain

Pneumonia

Target domain

Normal

COVID-19

Normal

Fig. 1.15 The labeled samples for COVID-19 chest X-ray are rare. How to leverage existing normal pneumonia samples to help build models for COVID-19 diagnosis? Images taken from an open-sourced dataset COVID-DA (Zhang et al., 2020)

1.6 Overview of Transfer Learning Applications

27

samples to help build models for COVID-19 diagnosis? The work of Manakov et al. (2019) used image transformation to denoise medical images, Prodanova et al. (2018) used transfer learning to classify cornea images, and the works (Cao et al., 2018; Hu et al., 2019) used transfer learning for breast cancer classification. There are more efforts in this such as cancer slice and survival analysis (Cabezas et al., 2018), medical image data analysis by self-ensembling (Perone et al., 2019), heart slice analysis using domain adaptation (Dou et al., 2018), prostate data analysis (Ren et al., 2018), MRI for brain data (Giacomello et al., 2019), chest X-ray analysis (Chen et al., 2018a), sclerosis (Valverde et al., 2019), 3D image analysis (Chen et al., 2019b), retinal images (Yu et al., 2019), etc. Recently, the cell magazine reported that joint research teams from Guangzhou Women and Children’s Medical Center and University of California, San Diego, led by professor Kang Zhang, developed an artificial intelligence system that can diagnose eye disease and pneumonia with comparable accuracy to doctors (Kermany et al., 2018). This research is a successful case of transfer learning in medicine and healthcare. It is notable that some researchers used transfer learning for COVID-19 data. For instance, Zhang et al. (2020) used domain adaptation methods to perform transfer learning from normal pneumonia to COVID-19 when the labeled data is insufficient to help diagnose disease. Yu et al. (2020) also leveraged such data to perform automatic transfer learning for unsupervised domain adaptation. Apart from image data, Gupta et al. (2020) used transfer learning to perform medical time series analysis. Kachuee et al. (2018) proposed transfer learning methods for electrocardiograph (ECG) data. Similarly, the work of Salem et al. (2018) adopted transfer learning for diagnosis of arrhythmia. For more scientific medical process and patient management, Yu et al. (2018) started from small datasets and designed an operation stage detection method for managing operation process and guidance. Suresh et al. (2018) proposed a multitask network structure for patient data management. The work of Newman-Griffis and Zirikly (2018) explored medical named entity recognition in low-resource settings. The work of Rezaei et al. (2018) explored the challenges for imbalanced medical data. The researchers also designed domain adaptation systems for pandemic prediction based on context information (Rehman et al., 2018). For drug discovery, Ye et al. (2018) used the ensemble transfer learning and multi-task learning method to explore the parameter prediction in drug discovery. Some chronic disease should be treated carefully with great daily care such as Parkinson’s disease, Alzheimer’s disease, and small vessel disease. Marinescu et al. (2019) proposed deep transfer learning algorithms for chronic diseases. Phan et al. (2019) proposed deep transfer learning-based sleep stage detection solution. The work of Venkataramani et al. (2018) proposed a continuous domain adaptation algorithm for healthcare to perform better in complex environment. In data privacy protection, Chen et al. (2020) proposed a Parkinson’s disease healthcare system based on federated learning. In Chap. 19, we show how to use federated learning and transfer learning for healthcare applications in practice.

28

1 Introduction

1.6.6 Other Applications There are other areas that used transfer learning to achieve their own goals. More physics research begins to adopt transfer learning for cross-domain problems. Baalouch et al. (2019) designed a sim-to-real transfer learning system for high-energy physics experiments. Mari et al. (2019) systematically studied the application of transfer learning in mixture-classic quantum neural networks. The work of Humbird et al. (2019) studied inertial separation of fusion experiments using transfer learning. In finance, Zhu et al. (2020) modeled user activity orders to build a hierarchical and interpretable system to perform fake detection in cross-domain systems. In transportation, Mallick et al. (2020) combined the graph network and transfer learning for accurate traffic prediction. The work of Xu et al. (2019) designed a target-oriented transfer learning algorithm to better learn the traffic lights rules. Bai et al. (2019) used a multi-task convolutional network to estimate passengers. The work of Milhomem et al. (2019) used deep weighted transfer learning to detect fatigue of the drivers. In energy, Covas (2020) adopted transfer learning for temporal–spatial prediction of solar magnetic. The work of Li et al. (2020) used deep transfer learning for thermal modeling between devices. In recommendation systems, Saito (2019) introduced domain adaptation methods for offline recommendation to increase the robustness for cross-domain data. The work of Chen et al. (2019a) designed a multi-task learning method for cold-start problems. In pandemic control, the work of Appelgren et al. (2019) built a transfer learning model using the online tweets from users to train the early detection model for pandemic disease. In community management, the researchers proposed a transfer learning method from in-surveillance to no-surveillance areas for parking lot prediction (Ionita et al., 2019). In agriculture, Nguyen et al. (2019) proposed a temporal–spatial transfer learning algorithm to predict cotton outcome. Sun and Wei (2020) utilized transfer learning to help predict if the corn is having disease. In communication, Ahmed et al. (2019) proposed a transfer meta-learning method for user fluctuation. In astronomy, Vilalta et al. (2019) designed similarity-based transfer learning algorithms for supernova classification and Mars landforms identification. The work of Ackermann et al. (2018) used transfer learning to detect the galaxy merger. In software engineering, Chen (2018) proposed a transfer learning-based malicious software detection algorithm. Transfer learning has wide applications in reinforcement learning. Carr et al. (2018) applied domain adaptation to Atari games in reinforcement learning. Taylor and Stone (2009) gave a systematic overview of transfer reinforcement learning. Parisotto et al. (2015) proposed an actor-mimic deep multi-task and transfer

References

29

reinforcement learning method. The work of Boutsioukis et al. (2011) studied the learning problem of transfer learning in multi-agent systems. The work of Gamrian and Goldberg (2019) used transfer learning to implement reinforcement learning tasks by image-to-image translation. Gupta et al. (2017) proposed to use transfer learning to learn latent invariant representations and then perform knowledge transfer. In online education, Ding et al. (2019) used representation transfer learning to analyze the student activities for massive online open course. In bank security, Oliveira et al. (2020) proposed a cross-domain deep facial matching algorithm to enhance the security of bank facial recognition system. There are applications in graph network data mining (Lee et al., 2017). He et al. (2016) studied the transfer learning framework based on graph neural network. Tang et al. (2019) implemented the defense against attacks for graph neural networks. Dai et al. (2019) used adversarial domain adaptation and graph convolutions to implement graph transfer learning. Other researchers used transfer learning to implement rule mining for knowledge graph (Omran et al., 2019) and used active learning for graph data (Hu et al., 2019). In this section, we just show some cases of transfer learning in several areas. You may find that transfer learning has also been successfully applied to a wide range of areas and we do hope there could be more successful applications in the future.

References Abdel-Hamid, O. and Jiang, H. (2013). Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition. In INTERSPEECH, pages 1248–1252. Ackermann, S., Schawinski, K., Zhang, C., Weigel, A. K., and Turp, M. D. (2018). Using transfer learning to detect galaxy mergers. Monthly Notices of the Royal Astronomical Society, 479(1):415–425. Ahmed, U., Khan, A., Khan, S. H., Basit, A., Haq, I. U., and Lee, Y. S. (2019). Transfer learning and meta classification based deep churn prediction system for telecom industry. arXiv preprint arXiv:1901.06091. Appelgren, M., Schrempf, P., Falis, M., Ikeda, S., and O’Neil, A. Q. (2019). Language transfer for early warning of epidemics from social media. arXiv preprint arXiv:1910.04519. Baalouch, M., Defurne, M., Poli, J.-P., and Cherrier, N. (2019). Sim-to-real domain adaptation for high energy physics. arXiv preprint arXiv:1912.08001. Bai, L., Yao, L., Kanhere, S. S., Yang, Z., Chu, J., and Wang, X. (2019). Passenger demand forecasting with multi-task convolutional recurrent neural networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 29–42. Springer. Bengio, Y., Lecun, Y., and Hinton, G. (2021). Deep learning for AI. Communications of the ACM, 64(7):58–65. Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the fourth workshop on statistical machine translation, pages 182–189. Bian, W., Tao, D., and Rui, Y. (2011). Cross-domain human action recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2):298–307. Blanchard, G., Lee, G., and Scott, C. (2011). Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems, 24.

30

1 Introduction

Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence learning. In EMNLP, pages 120–128. Boutsioukis, G., Partalas, I., and Vlahavas, I. (2011). Transfer learning in multi-agent reinforcement learning domains. In European Workshop on Reinforcement Learning, pages 249–260. Springer. Bray, C. W. (1928). Transfer of learning. Journal of Experimental Psychology, 11(6):443. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In NeurIPS. Cabezas, M., Valverde, S., González-Villà, S., Clérigues, A., Salem, M., Kushibar, K., Bernal, J., Oliver, A., and Lladó, X. (2018). Survival prediction using ensemble tumor segmentation and transfer learning. arXiv preprint arXiv:1810.04274. Cao, H., Bernard, S., Heutte, L., and Sabourin, R. (2018). Improve the performance of transfer learning without fine-tuning using dissimilarity-based multi-view learning for breast cancer histology images. In International conference image analysis and recognition, pages 779–787. Springer. Carr, T., Chli, M., and Vogiatzis, G. (2018). Domain adaptation for reinforcement learning on the Atari. arXiv preprint arXiv:1812.07452. Chao, H., He, Y., Zhang, J., and Feng, J. (2019). GaitSet: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8126–8133. Chen, B., Cherry, C., Foster, G., and Larkin, S. (2017a). Cost weighting for neural machine translation domain adaptation. In Proceedings of the First Workshop on Neural Machine Translation, pages 40–46. Chen, B. and Huang, F. (2016). Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 314–323. Chen, C., Dou, Q., Chen, H., and Heng, P.-A. (2018a). Semantic-aware generative adversarial nets for unsupervised domain adaptation in chest x-ray segmentation. In International workshop on machine learning in medical imaging, pages 143–151. Springer. Chen, D., Ong, C. S., and Menon, A. K. (2019a). Cold-start playlist recommendation with multitask learning. arXiv preprint arXiv:1901.06125. Chen, H., Cui, S., and Li, S. (2017b). Application of transfer learning approaches in multimodal wearable human activity recognition. arXiv preprint arXiv:1707.02412. Chen, H., Wang, Y., Wang, G., and Qiao, Y. (2018b). LSTD: A low-shot transfer detector for object detection. In Thirty-Second AAAI Conference on Artificial Intelligence. Chen, L. (2018). Deep transfer learning for static malware classification. arXiv preprint arXiv:1812.07606. Chen, L.-W., Lee, H.-Y., and Tsao, Y. (2018c). Generative adversarial networks for unpaired voice transformation on impaired speech. arXiv preprint arXiv:1810.12656. Chen, S., Ma, K., and Zheng, Y. (2019b). Med3d: Transfer learning for 3d medical image analysis. arXiv preprint arXiv:1904.00625. Chen, Y., Assael, Y., Shillingford, B., Budden, D., Reed, S., Zen, H., Wang, Q., Cobo, L. C., Trask, A., Laurie, B., et al. (2018d). Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460. Chen, Y., Li, W., Sakaridis, C., Dai, D., and Van Gool, L. (2018e). Domain adaptive faster R-CNN for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3339–3348. Chen, Y., Qin, X., Wang, J., Yu, C., and Gao, W. (2020). FedHealth: A federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems, 35(4):83–93. Chou, J.-c., Yeh, C.-c., and Lee, H.-y. (2019). One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742. Cooper, E., Lai, C.-I., Yasuda, Y., Fang, F., Wang, X., Chen, N., and Yamagishi, J. (2020). Zeroshot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP

References

31

2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6184–6188. IEEE. Covas, E. (2020). Transfer learning in spatial–temporal forecasting of the solar magnetic field. Astronomische Nachrichten. Cui, W., Zheng, G., Shen, Z., Jiang, S., and Wang, W. (2019). Transfer learning for sequences via learning to collocate. arXiv preprint arXiv:1902.09092. Daher, R., Zein, M. K., Zini, J. E., Awad, M., and Asmar, D. (2019). Change your singer: a transfer learning generative adversarial framework for song to song conversion. arXiv preprint arXiv:1911.02933. Dai, Q., Shen, X., Wu, X.-M., and Wang, D. (2019). Network transfer learning via adversarial domain adaptation with graph convolution. arXiv preprint arXiv:1909.01541. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL. Diba, A., Fayyaz, M., Sharma, V., Karami, A. H., Arzani, M. M., Yousefzadeh, R., and Van Gool, L. (2017). Temporal 3D ConvNets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200. Ding, M., Wang, Y., Hemberg, E., and O’Reilly, U.-M. (2019). Transfer learning using representation learning in massive open online courses. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, pages 145–154. Dou, Q., Ouyang, C., Chen, C., Chen, H., Glocker, B., Zhuang, X., and Heng, P.-A. (2018). PnP-AdaNet: Plug-and-play adversarial domain adaptation network with a benchmark at crossmodality cardiac segmentation. arXiv preprint arXiv:1812.07907. Feng, J., Huang, M., Zhao, L., Yang, Y., and Zhu, X. (2018). Reinforcement learning for relation classification from noisy data. In Thirty-Second AAAI Conference on Artificial Intelligence. Gamrian, S. and Goldberg, Y. (2019). Transfer learning for related reinforcement learning tasks via image-to-image translation. In International Conference on Machine Learning, pages 2063– 2072. Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423. Giacomello, E., Loiacono, D., and Mainardi, L. (2019). Transfer brain MRI tumor segmentation models across modalities with adversarial networks. arXiv preprint arXiv:1910.02717. Giel, A. and Diaz, R. (2015). Recurrent neural networks and transfer learning for action recognition. Goussies, N. A., Ubalde, S., Fernández, F. G., and Mejail, M. E. (2014). Optical character recognition using transfer learning decision forests. In 2014 IEEE International Conference on Image Processing (ICIP), pages 4309–4313. IEEE. Grave, E., Obozinski, G., and Bach, F. (2013). Domain adaptation for sequence labeling using hidden Markov models. arXiv preprint arXiv:1312.4092. Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949. Gupta, P., Malhotra, P., Narwariya, J., Vig, L., and Shroff, G. (2020). Transfer learning for clinical time series analysis using deep neural networks. Journal of Healthcare Informatics Research, 4(2):112–137. Gupta, S. and Raghavan, P. (2004). Adaptation of speech models in speech recognition. US Patent App. 10/447,906. Gururangan, S., Marasovi´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. He, H. and Wu, D. (2019). Transfer learning for brain–computer interfaces: A Euclidean space data alignment approach. IEEE Transactions on Biomedical Engineering, 67(2):399–410.

32

1 Introduction

He, J., Lawrence, R. D., and Liu, Y. (2016). Graph-based transfer learning. US Patent 9,477,929. Hou, W., Wang, J., Tan, X., Qin, T., and Shinozaki, T. (2021). Cross-domain speech recognition with unsupervised character-level distribution matching. In Interspeech. Hou, W., Zhu, H., Wang, Y., Wang, J., Qin, T., Xu, R., and Shinozaki, T. (2022). Exploiting adapters for cross-lingual low-resource speech recognition. IEEE Transactions on Audio, Speech and Language Processing (TASLP). Hsu, W.-N., Zhang, Y., and Glass, J. (2017). Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 16–23. IEEE. Hu, D. H. and Yang, Q. (2011). Transfer learning for activity recognition via sensor mapping. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1962, Barcelona, Catalonia, Spain. IJCAI. Hu, D. H., Zheng, V. W., and Yang, Q. (2011). Cross-domain activity recognition via transfer learning. Pervasive and Mobile Computing, 7(3):344–358. Hu, Q., Whitney, H. M., and Giger, M. L. (2019). Transfer learning in 4d for breast cancer diagnosis using dynamic contrast-enhanced magnetic resonance imaging. arXiv preprint arXiv:1911.03022. Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., Schölkopf, B., et al. (2007). Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601. Huang, X. and Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pages 1501–1510. Huang, Z., Siniscalchi, S. M., and Lee, C.-H. (2016). A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing, 218:448–459. Humbird, K. D., Peterson, J. L., Spears, B., and McClarren, R. (2019). Transfer learning to model inertial confinement fusion experiments. IEEE Transactions on Plasma Science, 48(1):61–70. Inoue, N., Furuta, R., Yamasaki, T., and Aizawa, K. (2018). Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5001–5009. Ionita, A., Pomp, A., Cochez, M., Meisen, T., and Decker, S. (2019). Transferring knowledge from monitored to unmonitored areas for forecasting parking spaces. International Journal on Artificial Intelligence Tools, 28(06):1960003. Jia, C., Kong, Y., Ding, Z., and Fu, Y. R. (2014). Latent tensor transfer learning for RGB-D action recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pages 87–96. Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Moreno, I. L., Wu, Y., et al. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pages 4480–4490. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 264–271. Kachuee, M., Fazeli, S., and Sarrafzadeh, M. (2018). ECG heartbeat classification: A deep transferable representation. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), pages 443–444. IEEE. Kamnitsas, K., Baumgartner, C., Ledig, C., Newcombe, V., Simpson, J., Kane, A., Menon, D., Nori, A., Criminisi, A., Rueckert, D., et al. (2017). Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pages 597–609. Springer. Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C., Liang, H., Baxter, S. L., McKeown, A., Yang, G., Wu, X., Yan, F., et al. (2018). Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell, 172(5):1122–1131. Khan, M. A. A. H. and Roy, N. (2017). Transact: Transfer learning enabled activity recognition. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pages 545–550. IEEE.

References

33

Kim, M., Kim, Y., Yoo, J., Wang, J., and Kim, H. (2017). Regularized speaker adaptation of KL-HMM for dysarthric speech recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(9):1581–1591. Lee, J., Kim, H., Lee, J., and Yoon, S. (2017). Transfer learning for deep learning on graphstructured data. In AAAI, pages 2154–2160. Li, B., Wang, X., and Beigi, H. (2019a). Cantonese automatic speech recognition using transfer learning from mandarin. arXiv preprint arXiv:1911.09271. Li, P., Lou, P., Yan, J., and Liu, N. (2020). The thermal error modeling with deep transfer learning. In Journal of Physics: Conference Series, volume 1576, page 012003. IOP Publishing. Li, X., Chen, Y., Wu, Z., Peng, X., Wang, J., Hu, L., and Yu, D. (2017a). Weak multipath effect identification for indoor distance estimation. In UIC, pages 1–8. IEEE. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., and Yang, M.-H. (2017b). Universal style transfer via feature transforms. In Advances in neural information processing systems, pages 386–396. Li, Y., Yuan, L., and Vasconcelos, N. (2019b). Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6936–6945. Liao, H. (2013). Speaker adaptation of context dependent deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7947–7951. IEEE. Lim, J. J., Salakhutdinov, R. R., and Torralba, A. (2011). Transfer learning by borrowing examples for multiclass object detection. In Advances in neural information processing systems, pages 118–126. Liu, J., Chen, Y., and Zhang, Y. (2010). Transfer regression model for indoor 3d location estimation. In International Conference on Multimedia Modeling, pages 603–613. Springer. Liu, J., Shah, M., Kuipers, B., and Savarese, S. (2011). Cross-view action recognition via view knowledge transfer. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3209–3216, Colorado Springs, CO, USA. IEEE. Liu, M., Song, Y., Zou, H., and Zhang, T. (2019). Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1957–1968. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586. Liu, S., Zhong, J., Sun, L., Wu, X., Liu, X., and Meng, H. (2018). Voice conversion across arbitrary speakers based on a single target-speaker utterance. In Interspeech, pages 496–500. Luan, F., Paris, S., Shechtman, E., and Bala, K. (2017). Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4990–4998. Luo, Y., Zheng, L., Guan, T., Yu, J., and Yang, Y. (2019). Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2507–2516. Mallick, T., Balaprakash, P., Rask, E., and Macfarlane, J. (2020). Transfer learning with graph neural networks for short-term highway traffic forecasting. arXiv preprint arXiv:2004.08038. Manakov, I., Rohm, M., Kern, C., Schworm, B., Kortuem, K., and Tresp, V. (2019). Noise as domain shift: Denoising medical images by unpaired image translation. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data, pages 3–10. Springer. Mari, A., Bromley, T. R., Izaac, J., Schuld, M., and Killoran, N. (2019). Transfer learning in hybrid classical-quantum neural networks. arXiv preprint arXiv:1912.08278. Marinescu, R. V., Lorenzi, M., Blumberg, S. B., Young, A. L., Planell-Morell, P., Oxtoby, N. P., Eshaghi, A., Yong, K. X., Crutch, S. J., Golland, P., et al. (2019). Disease knowledge transfer across neurodegenerative diseases. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 860–868. Springer. McClosky, D., Charniak, E., and Johnson, M. (2010). Automatic domain adaptation for parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter

34

1 Introduction

of the Association for Computational Linguistics, pages 28–36. Association for Computational Linguistics. Milhomem, S., Almeida, T. d. S., da Silva, W. G., da Silva, E. M., and de Carvalho, R. L. (2019). Weightless neural network with transfer learning to detect distress in asphalt. arXiv preprint arXiv:1901.03660. Morales, F. J. O. and Roggen, D. (2016). Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations. In Proceedings of the 2016 ACM International Symposium on Wearable Computers, pages 92–99. Newman-Griffis, D. and Zirikly, A. (2018). Embedding transfer for low-resource medical named entity recognition: a case study on patient mobility. arXiv preprint arXiv:1806.02814. Nguyen, D., Nguyen, K., Sridharan, S., Abbasnejad, I., Dean, D., and Fookes, C. (2018). Meta transfer learning for facial emotion recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3543–3548. IEEE. Nguyen, D., Sridharan, S., Nguyen, D. T., Denman, S., Tran, S. N., Zeng, R., and Fookes, C. (2020). Joint deep cross-domain transfer learning for emotion recognition. arXiv preprint arXiv:2003.11136. Nguyen, L. H., Zhu, J., Lin, Z., Du, H., Yang, Z., Guo, W., and Jin, F. (2019). Spatial-temporal multi-task learning for within-field cotton yield prediction. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 343–354. Springer. Oliveira, J. S., Souza, G. B., Rocha, A. R., Deus, F. E., and Marana, A. N. (2020). Crossdomain deep face matching for real banking security systems. In 2020 Seventh International Conference on eDemocracy & eGovernment (ICEDEG), pages 21–28. IEEE. Omran, P. G., Wang, Z., and Wang, K. (2019). Knowledge graph rule mining via transfer learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 489– 500. Springer. Pan, S. J., Kwok, J. T., and Yang, Q. (2008). Transfer learning via dimensionality reduction. In Proceedings of the 23rd AAAI conference on Artificial intelligence, volume 8, pages 677–682. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210. Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE TKDE, 22(10):1345–1359. Parisotto, E., Ba, J. L., and Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. Peng, N. and Dredze, M. (2016). Multi-task domain adaptation for sequence tagging. arXiv preprint arXiv:1608.02689. Perone, C. S., Ballester, P., Barros, R. C., and Cohen-Adad, J. (2019). Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. NeuroImage, 194:1–11. Phan, H., Chén, O. Y., Koch, P., Mertins, A., and De Vos, M. (2019). Deep transfer learning for single-channel automatic sleep staging with channel mismatch. In 2019 27th European Signal Processing Conference (EUSIPCO), pages 1–5. IEEE. Poncelas, A., Wenniger, G. M. d. B., and Way, A. (2019). Transductive data-selection algorithms for fine-tuning neural machine translation. arXiv preprint arXiv:1908.09532. Prodanova, N., Stegmaier, J., Allgeier, S., Bohn, S., Stachs, O., Köhler, B., Mikut, R., and Bartschat, A. (2018). Transfer learning with human corneal tissues: An analysis of optimal cut-off layer. arXiv preprint arXiv:1806.07073. Qu, C., Ji, F., Qiu, M., Yang, L., Min, Z., Chen, H., Huang, J., and Croft, W. B. (2019). Learning to selectively transfer: Reinforced transfer learning for deep text matching. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 699–707. Quiñonero-Candela, J., Sugiyama, M., Lawrence, N. D., and Schwaighofer, A. (2009). Dataset shift in machine learning. MIT Press. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.

References

35

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Rahmani, H. and Mian, A. (2015). Learning a non-linear knowledge transfer model for cross-view action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2458–2466. Raj, A., Namboodiri, V. P., and Tuytelaars, T. (2015). Subspace alignment based domain adaptation for RCNN detector. arXiv preprint arXiv:1507.05578. Rathi, D. (2018). Optimization of transfer learning for sign language recognition targeting mobile platform. arXiv preprint arXiv:1805.06618. Rehman, N. A., Aliapoulios, M. M., Umarwani, D., and Chunara, R. (2018). Domain adaptation for infection prediction from symptoms based on data from different study designs and contexts. arXiv preprint arXiv:1806.08835. Ren, J., Hacihaliloglu, I., Singer, E. A., Foran, D. J., and Qi, X. (2018). Adversarial domain adaptation for classification of prostate histopathology whole-slide images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 201– 209. Springer. Rezaei, M., Yang, H., and Meinel, C. (2018). Multi-task generative adversarial network for handling imbalanced clinical data. arXiv preprint arXiv:1811.10419. Ruder, S. and Plank, B. (2017). Learning to select data for transfer learning with Bayesian optimization. arXiv preprint arXiv:1707.05246. Saito, Y. (2019). Unsupervised domain adaptation meets offline recommender learning. arXiv preprint arXiv:1910.07295. Salem, M., Taheri, S., and Yuan, J.-S. (2018). ECG arrhythmia classification using transfer learning from 2-dimensional deep CNN features. In 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), pages 1–4. IEEE. Sankaranarayanan, S., Balaji, Y., Jain, A., Lim, S. N., and Chellappa, R. (2017). Unsupervised domain adaptation for semantic segmentation with GANs. arXiv preprint arXiv:1711.06969, 2:2. Sargano, A. B., Wang, X., Angelov, P., and Habib, Z. (2017). Human action recognition using transfer learning with deep representations. In 2017 International joint conference on neural networks (IJCNN), pages 463–469. IEEE. Shi, Z., Siva, P., and Xiang, T. (2017). Transfer learning by ranking for weakly supervised object annotation. arXiv preprint arXiv:1705.00873. Shivakumar, P. G., Potamianos, A., Lee, S., and Narayanan, S. S. (2014). Improving speech recognition for children using acoustic adaptation and pronunciation modeling. In WOCCI, pages 15–19. Sun, B. and Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In BMVC, volume 1, page 3. Sun, L., Li, K., Wang, H., Kang, S., and Meng, H. (2016). Phonetic posteriorgrams for many-toone voice conversion without parallel data training. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE. Sun, S., Zhang, B., Xie, L., and Zhang, Y. (2017). An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing, 257:79–87. Sun, X. and Wei, J. (2020). Identification of maize disease based on transfer learning. In Journal of Physics: Conference Series, volume 1437, page 012080. IOP Publishing. Sun, Z., Chen, Y., Qi, J., and Liu, J. (2008). Adaptive localization through transfer learning in indoor Wi-Fi environment. In 2008 Seventh International Conference on Machine Learning and Applications, pages 331–336. IEEE. Suresh, H., Gong, J. J., and Guttag, J. V. (2018). Learning tasks for multitask learning: Heterogenous patient populations in the ICU. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 802–810. Tang, X., Li, Y., Sun, Y., Yao, H., Mitra, P., and Wang, S. (2019). Robust graph neural network against poisoning attacks via transfer learning. arXiv preprint arXiv:1908.07558.

36

1 Introduction

Tang, Y., Peng, L., Xu, Q., Wang, Y., and Furuhata, A. (2016). CNN based transfer learning for historical Chinese character recognition. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pages 25–29. IEEE. Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685. Tsai, Y.-H., Hung, W.-C., Schulter, S., Sohn, K., Yang, M.-H., and Chandraker, M. (2018). Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7472–7481. Tu, G., Fu, Y., Li, B., Gao, J., Jiang, Y.-G., and Xue, X. (2019). A multi-task neural approach for emotion attribution, classification, and summarization. IEEE Transactions on Multimedia, 22(1):148–159. Tzu, C. (2006). The Book of Chuang Tzu. Penguin UK. Valverde, S., Salem, M., Cabezas, M., Pareto, D., Vilanova, J. C., Ramió-Torrentà, L., Rovira, À., Salvi, J., Oliver, A., and Lladó, X. (2019). One-shot domain adaptation in multiple sclerosis lesion segmentation using convolutional neural networks. NeuroImage: Clinical, 21:101638. Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint arXiv:1810.03548. Venkataramani, R., Ravishankar, H., and Anamandra, S. (2018). Towards continuous domain adaptation for healthcare. arXiv preprint arXiv:1812.01281. Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027. Vilalta, R., Gupta, K. D., Boumber, D., and Meskhi, M. M. (2019). A general approach to domain adaptation with applications in astronomy. Publications of the Astronomical Society of the Pacific, 131(1004):108008. Waley, A. et al. (2005). The analects of Confucius, volume 28. Psychology Press. Wang, J., Chen, Y., Hu, L., Peng, X., and Yu, P. S. (2018a). Stratified transfer learning for crossdomain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom). Wang, J., Lan, C., Liu, C., Ouyang, Y., Zeng, W., and Qin, T. (2021). Generalizing to unseen domains: A survey on domain generalization. In IJCAI Survey Track. Wang, J., Zheng, V. W., Chen, Y., and Huang, M. (2018b). Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8. Wang, R., Utiyama, M., Liu, L., Chen, K., and Sumita, E. (2017). Instance weighting for neural machine translation domain adaptation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1482–1488. Wang, Z., Bi, W., Wang, Y., and Liu, X. (2019). Better fine-tuning via instance weighting for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7241–7248. Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big data, 3(1):1–40. Wenzel, P., Khan, Q., Cremers, D., and Leal-Taixé, L. (2018). Modular vehicle control for transferring semantic information between weather conditions using GANs. arXiv preprint arXiv:1807.01001. Woodworth, R. S. and Thorndike, E. (1901). The influence of improvement in one mental function upon the efficiency of other functions.(i). Psychological review, 8(3):247. Wu, C. and Gales, M. J. (2015). Multi-basis adaptive neural network for rapid adaptation in speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4315–4319. IEEE. Wu, F. and Huang, Y. (2016). Sentiment domain adaptation with multiple sources. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 301–310.

References

37

Wu, X., Wang, H., Liu, C., and Jia, Y. (2013). Cross-view action recognition over heterogeneous feature spaces. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–616. Xie, M., Jean, N., Burke, M., Lobell, D., and Ermon, S. (2016). Transfer learning from deep features for remote sensing and poverty mapping. In Thirtieth AAAI Conference on Artificial Intelligence. Xu, N., Zheng, G., Xu, K., Zhu, Y., and Li, Z. (2019). Targeted knowledge transfer for learning traffic signal plans. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 175–187. Springer. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., and Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1713–1725. Yang, Z., Salakhutdinov, R., and Cohen, W. W. (2017). Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345. Yao, K., Yu, D., Seide, F., Su, H., Deng, L., and Gong, Y. (2012). Adaptation of context-dependent deep neural networks for automatic speech recognition. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 366–369. IEEE. Ye, Z., Yang, Y., Li, X., Cao, D., and Ouyang, D. (2018). An integrated transfer learning and multitask learning approach for pharmacokinetic parameter prediction. Molecular pharmaceutics, 16(2):533–541. Yu, C., Wang, J., Liu, C., Qin, T., Xu, R., Feng, W., Chen, Y., and Liu, T.-Y. (2020). Learning to match distributions for domain adaptation. arXiv preprint arXiv:2007.10791. Yu, D., Yao, K., Su, H., Li, G., and Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7893–7897. IEEE. Yu, F., Zhao, J., Gong, Y., Wang, Z., Li, Y., Yang, F., Dong, B., Li, Q., and Zhang, L. (2019). Annotation-free cardiac vessel segmentation via knowledge transfer from retinal images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 714–722. Springer. Yu, T., Mutter, D., Marescaux, J., and Padoy, N. (2018). Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition. arXiv preprint arXiv:1812.00033. Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722. Zhang, C. and Peng, Y. (2018). Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. arXiv preprint arXiv:1804.10069. Zhang, H., Chen, W., He, H., and Jin, Y. (2019a). Disentangled makeup transfer with generative adversarial network. arXiv preprint arXiv:1907.01144. Zhang, Y., David, P., and Gong, B. (2017). Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2020–2030. Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., and Shen, H. T. (2019b). Sequence-to-sequence domain adaptation network for robust text image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2740–2749. Zhang, Y., Niu, S., Qiu, Z., Wei, Y., Zhao, P., Yao, J., Huang, J., Wu, Q., and Tan, M. (2020). COVID-DA: Deep domain adaptation from typical pneumonia to covid-19. arXiv preprint arXiv:2005.01577. Zhang, Y., Zhang, Y., and Yang, Q. (2019c). Parameter transfer unit for deep neural networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Zhao, Z., Chen, Y., Liu, J., Shen, Z., and Liu, M. (2011). Cross-people mobile-phone based activity recognition. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence (IJCAI), volume 11, pages 2545–2550. Citeseer.

38

1 Introduction

Zheng, J., Jiang, Z., and Chellappa, R. (2016). Cross-view action recognition via transferable dictionary learning. IEEE Transactions on Image Processing, 25(6):2542–2556. Zhu, Y., Xi, D., Song, B., Zhuang, F., Chen, S., Gu, X., and He, Q. (2020). Modeling users’ behavior sequences with hierarchical explainable network for cross-domain fraud detection. In Proceedings of The Web Conference 2020, pages 928–938. Zou, H., Zhou, Y., Jiang, H., Huang, B., Xie, L., and Spanos, C. (2017). Adaptive localization in dynamic indoor environments by transfer kernel learning. In 2017 IEEE wireless communications and networking conference (WCNC), pages 1–6. IEEE. Zou, Y., Yu, Z., Vijaya Kumar, B., and Wang, J. (2018). Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305.

Chapter 2

From Machine Learning to Transfer Learning

Transfer learning is an important branch of machine learning. They have very tight connections. Therefore, we should first start familiarizing with the basics of machine learning. With these bases, we can then deeply understand their problems with more insights. We have briefly introduced the background and concepts of transfer learning in last chapter. We will dive into this area starting from this chapter. We introduce the definition of machine learning and probability distribution. Then, we present the definition of transfer learning, along with its fundamental problems and negative transfer case. Finally, we present a complete transfer learning process. This chapter is organized as follows. We introduce the basic concepts of machine learning in Sect. 2.1. These basic concepts mainly include structural risk minimization, probability distribution, and machine learning basics. In Sect. 2.2, we present formal definition of transfer learning. Then, we show its three fundamental problems in Sect. 2.3. After that, we show in Sect. 2.4 what is a negative transfer, i.e., the failure case of transfer learning. Finally, we assume that the three problems are resolved and show a complete transfer learning process in Sect. 2.5.

2.1 Machine Learning Basics 2.1.1 Machine Learning Machine learning (ML) has been going through a rapid development in recent years. Machine learning is based on computer science and it connects to many research areas including but not limited to statistics, probability theory, convex optimization, and programming languages. It is important to note that there is actually not a rigorous definition on ML. Generally speaking, machine learning is the process of

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_2

39

40

2 From Machine Learning to Transfer Learning

making the computer induce a general model based on the existing data, which can then be used for prediction on the new data. Professor Tom Mitchell from Carnegie Mellon University presented a universal definition on machine learning (Mitchell et al., 1997):

Definition 2.1 (Machine Learning (Mitchell et al., 1997)) Assume we use P to evaluate the performance of a computer program on a task class T . If a program can utilize experience E to obtain improvement in T , then we say that this program learns T and P .

According to the above definition, we give a more formal definition of machine learning:

Definition 2.2 (Machine Learning) Let .X and .Y be the input and output spaces, respectively. Let .D = {(x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn )} be the training data, where .x i ∈ X is the i-th sample and .yi ∈ Y is its associated label. We use .f ∈ H to represent the objective function with .H being its hypothesis space. Then, the goal of machine learning can be formulated as 1 .f = arg min (f (x i ), yi ), n f ∈H n



(2.1)

i=1

where .(·, ·) is a specific loss function.

Generally speaking, we often use the cross-entropy loss as the loss function for classification problems, while we use mean squared error as the loss function for regression problems. But in real applications, we need to choose or design more suitable loss functions. The above formulation can also be written in different forms. For instance, if we use the Maximum Likelihood Estimation (MLE), then the above process can be represented as θ ∗ = arg max L(θ |x 1 , x 2 , · · · , x n ),

.

(2.2)

θ

where .θ is the model parameters to be learned and .L(θ |x) is the likelihood, which can be defined as L(θ |x 1 , x 2 , · · · , x n ) = fθ (x 1 , x 2 , . . . , x n ).

.

(2.3)

For more progress in machine learning algorithms, please refer to more monographs such as Bishop (2006) and Zhou (2016).

2.1 Machine Learning Basics

41

2.1.2 Structural Risk Minimization Equation (2.1) states that we can derive a predictive function by minimizing the loss on the training data. Such a process can also be called Empirical Risk Minimization (ERM) and its loss function is called empirical risk. Is ERM enough to build a good machine learning model? In fact, a good machine learning model should not only fit the training data well but also predict the future data well. In this case, we introduce another concept: Structural Risk Minimization (SRM), which is also an important concept in statistical machine learning. Under the principle of SRM, in addition to fitting the training data, a model should also be simpler (i.e., has lower VC dimensions). The VC (Vapnik–Chervonenkis) dimension is often used to describe the complexity of a machine learning system and evaluate the learnability based on datasets and models. More details of it can be found in Valiant (1984). In real tasks, we often use regularization to control the complexity of a model. Then, the structural risk minimization can be formulated as 1 f ∗ = arg min (f (x i ), yi ) + λR(f ), n f ∈H i=1 n

.

(2.4)

where .R(f ) is the regularization term, i.e., the measure of model complexity. More complex models tend to have larger .R(f ) or vice versa. .λ is the regularization hyperparameter. Therefore, we often use SRM instead of ERM to learn a model that can both fit the training data and generalize to future new data. There are also some choices of regularization .R(f ) such as L1 and L2 regularizations.

2.1.3 Probability Distribution Probability distribution is a basic concept in statistical machine learning. Data distribution refers to the shape or histogram of data. For example, there are 50 students in a class, consisting of 30 boys and 20 girls. Then, we can consider the numbers 30 and 20 the data distribution of sex in this class. Probability is based on probability theory and is more rigorous than simple data distribution. We should first understand what is a random variable before introducing that concept. Random variable is a function that quantifies the random events by mapping each possible result to a number. There are basically two types of random variables: discrete random variables and continuous random variables. For instance, we are given a proposition: whether it snows tomorrow. For this proposition, there are only two possible answers: “yes” and “no,” which is basically a discrete random variable. On the other hand, if our proposition is what is the probability of snowing tomorrow, then this value can be distributed in .[0, 100%], which is a continuous random variable.

42

2 From Machine Learning to Transfer Learning

If we combine probability, distribution, and random events, then we will get a probability distribution. There are some popular probability distributions such as Bernoulli distribution, Gaussian distribution, Poisson distribution, uniform distribution, etc., which can be found in machine learning monographs such as Bishop (2006) and Zhou (2016). We often use .P (x) to denote the probability distribution of a random variable X. Why do we study probability distribution? Machine learning is also a subject that relates to data. And in real world, data are changing dynamically. Statistical machine learning generally assumes that data can be generated by one (or several mixed) probability distribution(s), which can be formulated as .x ∼ P (X), in which .X is the sample space. Traditional machine learning often assumes that training and test data are from the same distributions. We use .Dtrain = {(x i , yi )}ni=1 to denote training data and m .Dtest = {(x j , yj )} j =1 to denote test data. Then, the assumption of traditional machine learning can be formulated as Ptrain (x, y) = Ptest (x, y).

(2.5)

.

However, their distributions are often not the same in real applications: Ptrain (x, y) = Ptest (x, y).

(2.6)

.

Before introducing the definition of transfer learning, we have to really understand the meaning of different distributions. Figure 2.1 shows three different Gaussian distributions: .N1 (0, 5), N2 (0, 7), and .N3 (0, 10). Obviously, they are different since their variances .σ are different. According to traditional machine learning, if the training data follows distribution .N1 (0, 5), the test data also follows distribution .N1 (0, 5); when in transfer learning, when the training data follows .N1 (0, 5), test data may follow .N2 (0, 7) or .N3 (0, 10) that can be different. The major situation of transfer learning is the case shown in Eq. (2.6), which is visualized in Fig. 2.2. It is also worth noting that even if the probability distribution P has many existing forms such as Gaussian or Poisson, but the real distributions of data can be very complex. Thus, we often do not (and cannot) write out its exact probability mass function. Fig. 2.1 Three different Gaussian distributions

2

=5 2

=7 2

= 10

2.2 Definition of Transfer Learning

43

Fig. 2.2 Training and test data follow different distributions

Training set

Test set

Data

Distribution

2.2 Definition of Transfer Learning 2.2.1 Domains After introducing the concept of probability distribution, we introduce one important concept in transfer learning: domain. Based on such concept, we will articulate the formal definition of transfer learning. A domain is the subject that performs learning. It consists two parts: data and the distribution that generates such data. We often use .D to denote a domain, whose sample can be denoted as input .x and output y, and its probability distribution is denoted as .P (x, y), indicating that data from .D follows such distribution: .(x, y) ∼ P (x, y). We use .X and .Y to denote the feature and label spaces, respectively. Then, for any sample .(x i , yi ), we have .x i ∈ X, yi ∈ Y. Then, a domain can be defined as .D = {X, Y, P (x, y)}. In transfer learning, there are at least two domains: one is the domain that has rich knowledge and is what we transfer from, and the other is the domain that we want to learn. The former and latter are called source domain and target domain, respectively. We often use subscripts s and t to denote them: .Ds is a source domain and .Dt is a target domain. When .Ds = Dt , we have .Xs = Xt , Ys = Yt or 1 .Ps (x, y) = Pt (x, y).

1 The

definition in this book is different from the definitions in Pan and Yang (2010) and Yang et al. (2020): they defined a domain as .D = (X, P (x)) and further introduced a new concept called task: .T = (Y, f ). Since this book focuses on the algorithms in domain adaptation, our definition of a domain stems from the very natural data generating process (.(x, y) ∼ P (x, y)), which also includes joint probability distributions. Interested readers can find that our definitions and theirs are only different in forms, but the same in spirit.

44

2 From Machine Learning to Transfer Learning

2.2.2 Formal Definition After the introduction of the domain, we now give a formal definition of transfer learning.

Definition 2.3 (Transfer Learning) Given a source domain .Ds = Nt s {x i , yi }N i=1 and a target domain .Dt = {x j , yj }j =1 , where .x ∈ X, y ∈ Y. The goal of transfer learning is to use the source domain data to learn a predictive function .f : x t → yt such that f can reach minimum prediction risk on the target domain (evaluated by .): f ∗ = arg min E(x,y)∈Dt (f (x), y),

.

(2.7)

f

when one of the following three conditions holds: 1. Different feature spaces, i.e., .Xs = Xt 2. Different label spaces, i.e., .Ys = Yt 3. Different probability distributions with the same feature and labels spaces, i.e., .Ps (x, y) = Pt (x, y)

Concretely speaking, different feature spaces, i.e., .Xs = Xt , refer to the case where two domains have different features or different feature dimensions. For instance, when the source domain is the collection of RGB images and the target domain is the collection of gray images, we say they have different feature spaces. Different label spaces, i.e., .Ys = Yt , refer to the case where the two domains have different tasks. For instances, in a classification problem, the source domain aims to classify cats and dogs, while the target domain wants to classify flowers. Different probability distributions, i.e., .Ps (x, y) = Pt (x, y), refer to the case that even if the feature and label spaces are the same, the probability distributions can also be different. For instance, when the source domain is object recognition from sketch images and target domain is the collection of art paintings, even if their feature space (both in RGB formats) and label space (both classify the same categories) are the same, their probability distributions are also different. Each of the above situations is spawning a lot of research. Luckily, despite of these different situations, we can still find some commonness in their key algorithms. Hence, we will not introduce each situation in detail but emphasize on the detailed introduction of a hot research topic: domain adaptation (DA), which corresponds to the last situation. The essence of most transfer learning algorithms can be concluded from our introduction of domain adaptation algorithms. Thus, you can learn to extend them to your own problems. The formal definition of domain adaptation is as follows:

2.3 Fundamental Problems in Transfer Learning

45

Definition 2.4 (Domain Adaptation) Given a source domain .Ds = Nt s {x i , yi }N i=1 and a target domain .Dt = {x j , yj }j =1 , their feature and label spaces are the same: .Xs = Xt , Ys = Yt , but their joint distributions are different: .Ps (x, y) = Pt (x, y). The goal of domain adaptation is to leverage the source domain data to learn a target predictive function .f : x t → yt , such that f can reach minimum prediction risk on the target domain (evaluated by .): f ∗ = arg min E(x,y)∈Dt (f (x), y).

.

(2.8)

f

Recall the taxonomy of transfer learning in Sect. 1.4, we can categorize domain adaptation into three situations: 1. Supervised Domain Adaptation (SDA), i.e., target domain is fully labeled: .Dt = t {x j , yj }N j =1 . 2. Semi-supervised Domain Adaptation (SSDA), i.e., target domain data is partially Ntl tu labeled: .Dt = {x j , yj }N j =1 ∪ {x j , yj }j =1 , in which .Ntl and .Ntu denote the number of labeled and unlabeled samples, respectively. 3. Unsupervised Domain Adaptation (UDA), i.e., target domain is totally unlabeled: Nt .Dt = {x j } j =1 . It is obvious that unsupervised domain adaptation is the most challenging case. Therefore, we will focus on this topic to introduce the algorithms in transfer learning, which can then be adapted to semi-supervised and supervised domain adaptation problems. Especially, when multiple domains all have labels, multi-task learning can be used, which is beyond the scope of this book. Finally, we want to emphasize that the above definition is also not universal and fixed. The readers are encouraged to give more customized definitions to your specific research and applications based on the above definitions.

2.3 Fundamental Problems in Transfer Learning In this section, we introduce the fundamental problems in transfer learning. The goal of this section is to provide an overview of these fundamental problems such that the readers can see through the mist in complex real tasks to grasp their nature. Then, you may be able to develop suitable solutions. According to the survey (Pan and Yang, 2010), there are three fundamental problems in transfer learning, which form a complete transfer learning life cycle: 1. When to transfer, which corresponds to the possibility of applying transfer learning or when to use it. This should be the first step when we want to use

46

2 From Machine Learning to Transfer Learning

transfer learning in our applications, i.e., given a target task, we should first examine whether this task is suitable for transfer or not. 2. What/where to transfer, which corresponds to the process of determining which source or samples to perform transfer learning. Here, what refers to the transferable knowledge such as neural network weights, feature transformation matrix, and some useful parameters, and where refers to the source, neuron, or some tree in a random forest in which transfer learning happens. 3. How to transfer, which corresponds to the process of conducting transfer learning, and it is the key part in most of existing literature. The goal of this step is to develop good algorithms to achieve the best performance. The whole life cycle of transfer learning is all about these three fundamental problems. According to current research progress, when to transfer contains some theories and boundaries that give us more theoretical guarantee before using transfer learning. What/where to transfer emphasizes a dynamic process of transfer learning. In the big data era, we need to select more suitable data, domains, networks, or distributions to perform transfer. Finally, how to transfer focuses on building more powerful models. Moreover, these three problems are not opposing one another but can be united. For instance, what/where to transfer often comes with dynamic distributions of the representations, which has a close relationship with how to transfer since we can change the distributions of data by building models. At certain conditions, they can be unified to benefit each other.

2.3.1 When to Transfer When to transfer refers to the theoretical guarantee of transfer learning: how can we say that our transfer learning is successful, rather than achieving worse results than no transfer (i.e., negative transfer, introduced in the next section)? Due to the scarcity of theory works, we only answer one question: why knowledge can be transferred between two domains with different distributions? Or, under what conditions, can knowledge be transferred with little error? This is an important question. In addition to experimental evaluation, we often anticipate that our algorithm can also derive some theoretical guarantee, such as, in what bound can our algorithm be successful? We will leave the introduction of theory in Sect. 7.1. Of course, the theory is highly recommended but not required to all readers.

2.3 Fundamental Problems in Transfer Learning

47

2.3.2 Where to Transfer Where to transfer guides the transfer subjects in our learning process. This contains two levels of questions: 1. In the level of dataset and domain: selecting the most appropriate source domains or datasets to perform transfer learning 2. In the level of samples: selecting the most informative samples from a dataset or domain to achieve the best performance The reason why we use “most appropriate” to describe the selected domains or samples is that the performance (e.g., accuracy) may not be the only metric to evaluate the success of transfer learning. We may have different evaluation metrics since we are constrained by environments, algorithms, complexity, and devices. Therefore, accuracy is not the only metric. In fact, the above two problems are identical to some extent: samples are the basic elements to construct a domain or a dataset. Thus, if we can master the sample selection methods, we could also know how to select domains. We will introduce more algorithms for sample selection in Chap. 4.2, and thus we do not describe too much of this content here. We will mainly introduce the methods for domain or dataset selection, which is also called source domain selection. Researchers from Hong Kong University of Science and Technology proposed a source-free domain selection method that does not need to explicitly appoint a source domain (Xiang et al., 2011). Their method is based on the semantic information between different categories and they utilized a social tagging network called “Delicious” to connect different classes and then constructed a Laplacian graph to represent the user interest. Based on that, they can discover the semantic relationship between the target domain and possible source domains, thus performing automatic source domain selection. Later, Lu et al. (2014) extended this work to text classification. In deep neural networks, Collier et al. (2018) systematically explored the transferability of each hidden layer by grid search. Then, Bhatt et al. (2016) proposed a greedy algorithm for multi-source domain selection. In manifold learning, Gong et al. (2012) proposed a domain similarity metric using principal angle. Their method performs greedy computation of similarity angles between domains and finally gets the optimal results. Another popular method is to compute the .A-distance (Ben-David et al., 2007) between domains. Specifically, we can construct a linear classifier to classify if a sample comes from one domain and use the classifier error as the indicator for domain similarity. This method has been widely adopted in recent algorithms. For instance, the MEDA algorithm (Wang et al., 2018a) utilized such distance to compute the distribution similarity between two domains. Source domain selection has also been applied to different applications. In Chen et al. (2019), the authors proposed a stratified transfer learning method for human activity recognition. Their algorithm computed the fine-grained maximum mean

48

2 From Machine Learning to Transfer Learning

Fig. 2.3 Source domain selection in human activity recognition: which part is most similar to the part with red star? Images taken from Wang et al. (2018b)

discrepancy (MMD) between domains and achieves better accuracy. Later, they proposed a source domain selection algorithm in Wang et al. (2018b) based on semantic and kinetic similarity, as shown in Fig. 2.3. Their method tried to merge the body similarity and sensor reading similarity and then constructed a neural network for transfer learning. The readers may have noticed that “where to transfer” is often highly related to “how to transfer”: evaluating the most appropriate domains often relates to the transfer learning results. Therefore, we should not treat these two problems separately. More researchers tried to use one unified framework to represent these two problems. In real applications, they are also highly related.

2.4 Negative Transfer Learning

49

Fig. 2.4 Research density of the three fundamental problems

5%

Estimated amount of literature

10%

When to transfer Where/what to transfer How to transfer

85%

2.3.3 How to Transfer After we identify when and where to transfer, the next step will be how to transfer, which is also the most popular topic in most transfer learning literature. This step is the major part of this book and we will not introduce it in this chapter. The amount of literature on “when to transfer,” “where to transfer,” and “how to transfer” can be observed from Fig. 2.4. Note that the numbers in the figure are not accurate and only show the trend of these researches. We need to point out that although “how to transfer” has long been the hottest topic in research, we should focus more on the other two problems and dig more.

2.4 Negative Transfer Learning At all times, we hope that transfer learning can run well and give satisfying results. However, things are not always going well. In this section, we introduce negative transfer learning, which is the failure case, e.g., Dong Shi Knits Her Brows in Imitation. We must understand what transfer learning is before attempting to understand negative transfer. The core of transfer learning to find and exploit the similarity between two domains. Therefore, if we fail to find such similarity or there is actually no similarity between two domains, then we may never be able to build a successful transfer learning model. For instance, we can learn to ride a motorbike by analogy after we know how to ride a bicycle. But it is impossible that we learn how to drive a car by borrowing knowledge from riding bicycles. That is when negative transfer happens. Therefore, we now understand why Dong Shi becomes uglier even if she acts the same as Xi Shi. This is because there is actually little similarity between them. Generally speaking, negative transfer refers to the case that the knowledge borrowed from the source domain has negative effects on the target domain. This can be formulated as follows:

50

2 From Machine Learning to Transfer Learning

Definition 2.5 (Negative Transfer Learning) Let .R(A(Ds , Dt )) denote the error by performing transfer learning between target domain .Dt and source domain .Ds using algorithm A. We use . to represent an empty set. Then, negative transfer happens when the following holds: R(A(Ds , Dt )) > R(A (, Dt )),

.

(2.9)

where .A denotes another algorithm and .R(A (, Dt )) denotes the error of no transfer learning.

There are several reasons that trigger negative transfer: • Data problem: the source and target domains are not similar. • Algorithm problem: when the domains are similar, but we have a poor algorithm, which could also result in failure. Negative transfer brings negative impact to transfer learning research and applications. In real problems, we need to overcome such a bad situation by finding more similar domains and developing more powerful algorithms. There are also some research efforts that tried to overcome negative transfer learning in recent years. Qiang Yang’s team at Hong Kong University of Science and Technology proposed transitive transfer learning in Tan et al. (2015). Later, they proposed distant domain transfer learning in Tan et al. (2017). In their latter research, we could even use the model trained on human faces to help recognize airplanes. Their research made it possible to perform transfer learning between two domains that are with less similarity, which further broadens the boundary of transfer learning. We use the example of The frog crossing the river in Fig. 2.5 to show the spirit of transitive transfer learning. In normal cases when the river is not too wide, the little frog can directly jump to the other side. However, in abnormal cases when the river is too wide to jump through, our clever frog can leverage some leaves as his middle points to jump to the other side. Similar to transfer learning, the sides of rivers are just source and target domains. The frog jumps to the other side denotes the transfer learning is successful. The situation that the river is not too wide denotes the large similarity between two domains and vice versa. And the leaves in the river are just like the intermediate domains, through which we can perform successful transfer learning. Wang et al. (2019) proposed some theoretical analysis to negative transfer learning and respective solutions. Then, Zhang et al. (2020) wrote a survey on negative transfer learning. They discussed the reasons, solutions, and possible applications in this area. Their research pointed out that when the source domain or target domain has low-quality data, or the domain discrepancy is too wide, or the algorithms are not good, there could be negative transfer.

2.5 A Complete Transfer Learning Process (a)

51 (c)

(b)

Fig. 2.5 The frog crossing the river and negative transfer learning. (a) Normal case: success (River is not wide: small domain gap). (b) Abnormal case: failure (River is wide: large domain gap). (c) Abnormal case: success (River is wide: large domain gap)

2.5 A Complete Transfer Learning Process After the introduction of fundamental problems in transfer learning, we can summarize a complete transfer learning process in Fig. 2.6. After obtaining the dataset, we need to perform analysis on the transferability, which corresponds to when and where to transfer. Then, we have the transfer learning process, which will be introduced in later chapters. Similar to the machine learning process, when a transfer learning process ends, we need to perform model selection to select the best models and hyperparameters. Note that transferability analysis, transfer process, and model selection are not separate but can be used to benefit one another. Finally, we need to deploy our model after obtaining the best one. The main parts of this book (from Chaps. 3 to 14) will introduce different kinds of transfer learning algorithms. Additionally, we present theory, evaluation, and model selection methods in Chap. 7. Finally, transfer learning application practice (i.e., codes) will be covered from Chaps. 15 to 19.

Dataset

Problem definition and transferability analysis

Transfer process

Chapters 2~3 Chapters 4~6, 8~14 Fig. 2.6 A complete transfer learning process

Model selection

Application and practice

Chapter 7 Chapters 15~19

52

2 From Machine Learning to Transfer Learning

References Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. (2007). Analysis of representations for domain adaptation. In NIPS, volume 19. Bhatt, H. S., Rajkumar, A., and Roy, S. (2016). Multi-source iterative adaptation for cross-domain classification. In IJCAI, pages 3691–3697. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Chen, Y., Wang, J., Huang, M., and Yu, H. (2019). Cross-position activity recognition with stratified transfer learning. Pervasive and Mobile Computing, 57:1–13. Collier, E., DiBiano, R., and Mukhopadhyay, S. (2018). CactusNets: Layer applicability as a metric for transfer learning. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE. Gong, B., Shi, Y., Sha, F., and Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. Lu, Z., Zhu, Y., Pan, S. J., Xiang, E. W., Wang, Y., and Yang, Q. (2014). Source free transfer learning for text classification. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Mitchell, T. M. et al. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870– 877. Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE TKDE, 22(10):1345–1359. Tan, B., Song, Y., Zhong, E., and Yang, Q. (2015). Transitive transfer learning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1155–1164. ACM. Tan, B., Zhang, Y., Pan, S. J., and Yang, Q. (2017). Distant domain transfer learning. In Thirty-First AAAI Conference on Artificial Intelligence. Valiant, L. (1984). A theory of the learnable. Commun. ACM, 27:1134–1142. Wang, J., Feng, W., Chen, Y., Yu, H., Huang, M., and Yu, P. S. (2018a). Visual domain adaptation with manifold embedded distribution alignment. In ACMMM, pages 402–410. Wang, J., Zheng, V. W., Chen, Y., and Huang, M. (2018b). Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8. Wang, Z., Dai, Z., Póczos, B., and Carbonell, J. (2019). Characterizing and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11293–11302. Xiang, E. W., Pan, S. J., Pan, W., Su, J., and Yang, Q. (2011). Source-selection-free transfer learning. In Twenty-Second International Joint Conference on Artificial Intelligence. Yang, Q., Zhang, Y., Dai, W., and Pan, S. J. (2020). Transfer learning. Cambridge University Press. Zhang, W., Deng, L., and Wu, D. (2020). Overcoming negative transfer: A survey. arXiv preprint arXiv:2009.00909. Zhou, Z.-h. (2016). Machine learning. Tsinghua University Press.

Chapter 3

Overview of Transfer Learning Algorithms

This chapter gives an overview of transfer learning algorithms so that readers can learn and understand detailed algorithms in other chapters with a thorough view. To facilitate such an understanding, we establish a unified representation framework through which most of existing methods can be derived. Then, other chapters will introduce more details on each kind of algorithms. We want to emphasize that you are encouraged to do such overview when learning new materials. This chapter is organized as follows. In Sect. 3.1, we introduce the most important concept in transfer learning algorithms: the measurement of distribution divergence. Then, in Sect. 3.2, we introduce the unified representation for distribution divergence. In, Sect. 3.3, we give the unified learning framework for transfer learning algorithms. Next, we offer code guidance for the readers to establish an environment for transfer learning research in Sect. 3.4.

3.1 Measuring Distribution Divergence After the formal definition of transfer learning in the last chapter, we now start to learn its algorithms. The ultimate goal of transfer learning is to develop algorithms to better harness existing knowledge and facilitate learning in the target domain. In order to leverage existing knowledge, the key is to find the similarity between two domains. Such similarity can be computed by measuring the distribution divergence between two domains. The distribution measurement has two benefits: first, it not only tells us whether or not the two domains are similar qualitatively, but also gives us concrete measures quantitatively; second, we can build learning algorithms based on such measurement to enlarge the cross-domain distribution similarity to complete the transfer learning task.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_3

53

54

3 Overview of Transfer Learning Algorithms

In transfer learning, the joint distributions between two domains are different, i.e., Ps (x, y) = Pt (x, y).

(3.1)

.

Then, how to measure such joint distribution divergence? s We have a fully labeled source domain .Ds = {(x i , yi )}N i=1 and an unlabeled Nt target domain .Dt = {(x j )}j =1 . In fact, due to the unknown target domain labels, it becomes impossible to measure its joint distribution divergence with the source domain. Thus, most of existing literature will adopt certain assumptions in their algorithms. According to the basic knowledge on probability theory, there exists relations between the joint, marginal, and conditional distributions: P (x, y) = P (x)P (y|x) = P (y)P (x|y).

.

(3.2)

We can leverage the above theorem to transform the problem and derive a solution. According to Eq. (3.2), we can thoroughly classify existing transfer learning algorithms into the following categories by the taxonomy of probability distribution matching: • • • •

Marginal Distribution Adaptation (MDA) Conditional Distribution Adaptation (CDA) Joint Distribution Adaptation (JDA) Dynamic Distribution Adaptation (DDA)

We illustrate what marginal, conditional, and joint distributions are in Fig. 3.1 to give the readers an intuitive understanding of them. It is obvious that when the target domain is case I in Fig. 3.1, we should consider matching the marginal distributions more since they are dissimilar from the whole picture; while the target domain is case II, we should turn to considering conditional distributions more since they are

Marginal or Conditional?

Fig. 3.1 Different distribution cases for source and target domains. Different shapes denote different classes. We show two cases of target domain distributions (Target I and II). Finally, we show the more general and complicated case of an unknown target in the last figure. Note that we only illustrate the major difference but not strictly satisfying assumptions

3.1 Measuring Distribution Divergence

55

the same from the whole picture, but only different from the distribution of each class. Marginal distribution adaptation (MDA) methods (Pan et al., 2011) aim to solve the covariate shift problem, which refers to the case when the marginal distributions are different, i.e., .Ps (x) = Pt (x). Covariate shift also assumes that their conditional distributions are the same, i.e., .Ps (y|x) ≈ Pt (y|x). Under such condition, MDA accomplishes transfer learning tasks by reducing their marginal distribution distances. On the other hand, we can also interpret it as using marginal distribution divergence to estimate the joint distribution divergence, i.e., D(Ps (x, y), Pt (x, y)) ≈ D(Ps (x), Pt (x)).

.

(3.3)

Zhao et al. (2019) theoretically proved that it is insufficient to only perform marginal distribution adaptation, and we should also consider the conditional distribution adaptation methods (Wang et al., 2018a; Zhu et al., 2020). The goal of conditional distribution adaptation is to reduce the conditional distribution divergence between two domains. Its assumption is just opposite to MDA, i.e., .Ps (x) ≈ Pt (x), Ps (y|x) = Pt (y|x). Under such situation, CDA methods try to use the conditional distribution divergence to estimate joint distribution divergence: D(Ps (x, y), Pt (x, y)) ≈ D(Ps (y|x), Pt (y|x)).

.

(3.4)

Joint distribution adaptation (Long et al., 2013) adopts more general assumptions, whose goal is to reduce the joint distribution divergence for better transfer learning. Specifically, since we cannot directly measure the joint distribution divergence, JDA methods utilize the marginal and conditional distribution divergence to estimate their joint distribution divergence: D(Ps (x, y), Pt (x, y)) ≈ D(Ps (y|x), Pt (y|x)) + D(Ps (x), Pt (x)).

.

(3.5)

Finally, dynamic distribution adaptation methods (Wang et al., 2020, 2018b) claim that the marginal and conditional distribution divergences are not equally important in transfer learning. This kind of method can dynamically and adaptively adjust the relative importance of the marginal and conditional distributions. Concretely speaking, DDA proposes to use a balance factor .μ for such adjustment: D(Ds , Dt ) ≈ (1 − μ)D(Ps (x), Pt (x)) + μD(Ps (y|x), Pt (y|x)),

.

(3.6)

where .μ ∈ [0, 1] denotes the balance factor. When .μ → 0, we can see that marginal distribution is more important; when .μ → 1, we can see that conditional distribution is more important to adapt, which means that the source and target domains are similar globally. Based on such analysis, we see that the balance factor .μ can dynamically adjust the importance of each distribution, thus obtaining better performance.

56

3 Overview of Transfer Learning Algorithms

3.2 Unified Representation for Distribution Divergence We show the assumptions and learning problems for these distribution matching methods in Table 3.1, where the function .D(·, ·) denotes a distribution matching measurement function, and we take it as a pre-defined function for now. From this table, we can clearly see that the problem definitions are different with the changing of assumptions. We can also see that from marginal distribution adaptation to dynamic distribution adaptation, researchers have been obtaining deeper insights on this problem. Obviously, we conclude that dynamic distribution adaptation is the most general case, which can be easily formulated to other cases by changing the value of the balance factor .μ: 1. When .μ = 0, it becomes marginal distribution adaptation. 2. When .μ = 1, it becomes conditional distribution adaptation. 3. When .μ = 0.5, it becomes joint distribution adaptation. We draw the results from five different transfer learning tasks (.U → M, B → E, etc., correspond to these tasks) in Fig. 3.2a to show the importance of .μ. We observe that the optimal transfer learning performances do not always come with a fixed value of .μ, which implies that the optimal .μ in different tasks are not the same; thus we need to tune its value to give a better estimation of the marginal and conditional distributions. In addition, we also note that there is no unified trend of .μ; thus we need to develop effective algorithms to estimate the value of .μ. Then, another challenge arises: how to estimate the value of .μ?

Table 3.1 Distribution measurement in transfer learning Method Marginal distribution adaptation Conditional distribution adaptation Joint distribution adaptation Dynamic distribution adaptation

Assumption = Pt (y|x)

.Ps (y|x)

.Ps (x)

= Pt (x)

Problem .min D(Ps (x), Pt (x))

.min D(Ps (y|x), Pt (y|x))

.Ps (x, y)

= Pt (x, y) .min D(Ps (x), Pt (x)) + D(Ps (y|x), Pt (y|x))

.Ps (x, y)

= Pt (x, y) .min(1 − μ)D(Ps (x), Pt (x)) + μD(Ps (y|x), Pt (y|x))

3.2 Unified Representation for Distribution Divergence 15

80

Random Average DDA 10

Error

Accuracy (%)

57

70 U B W C Cl

M E A P Pr

5

0

60 0

0.2

0.4

0.6

0.8

1

U

M B

EW

AC

P Cl

Pr

Task

(a)

(b)

Fig. 3.2 Balance factor .μ. (a) The importance of balance factor .μ. (b) Estimation of .μ

3.2.1 Estimation of Balance Factor μ We can easily treat .μ as a hyperparameter and tune its value by cross validation (we denote the optimal tuned value by .μopt ). However, there are no labels in the target domain in our most challenging unsupervised domain adaptation problems, which makes this approach impossible. There are also other two means to estimate its value: random guess and min-max averaging. Random guess is inspired from the random search in neural network hyperparameter selection. We can randomly choose a value from the .[0, 1] as .μ and then perform dynamic distribution adaptation using this value. If we repeat this process t times and denote the transfer performance of .rt , then  the final transfer performance of random guess can be denoted as .rrand = 1t ti=1 rt . Min-max averaging is similar to random guess: we can enumerate all values in the .[0, 1] interval with a step of .0.1, and then we have a candidate set for .μ, i.e., .[0, 0.1, . . . , 0.9, 1.0]. Then, we perform dynamic distribution adaptation using all the values and get the final performance 1 11 as .rmaxmin = 11 i=1 ri . However, the above two methods require tense computations and cannot be wellinterpreted nor guaranteed from theory. Wang et al. (2018b) and Wang et al. (2020) proposed the dynamic distribution adaptation approach for transfer learning. More importantly, they proposed the first accurate estimation of .μ. They leveraged the global and local properties of distributions for the computation of .μ. Specifically, they adopted .A-distance (BenDavid et al., 2007) as the basic distribution measurement. .A-distance can be seen as using the error of a binary classifier to classify two domains. Formally speaking, we denote .(h) as the error of using classifier h to classify two domains .Ds and .Dt ; then the .A-distance can be defined as dA (Ds , Dt ) = 2(1 − 2(h)).

.

(3.7)

58

3 Overview of Transfer Learning Algorithms

We can directly use the above equation to compute the .A-distance for marginal distributions between two domains, denoted by .dM . Then, for the .A-distance between conditional distributions, we use .dc to denote the .A-distance on class c, (c) (c) (c) which can be computed by .dc = dA (D(c) are the s , Dt ), where .Ds and .Dt samples from c-th class. Finally, we can compute the value of .μ as μˆ = 1 −

.

dM +

dM C

c=1 dc

.

(3.8)

The results in Fig. 3.2b show that the proposed estimation of .μ is better than other two estimations, which can lead to better transfer performance. Due to the dynamic and progressive changing of representations, this estimation should be made at each round of iteration. The estimation of marginal and conditional distributions is of great importance to transfer learning. Of course, there could be more effective approaches for the computation of .μ in the future. In the later works, dynamic distribution adaptation was applied to deep neural networks (Wang et al., 2020), adversarial training (Yu et al., 2019), and human activity recognition (Qin et al., 2019) and achieved satisfying performance.

3.3 A Unified Framework for Transfer Learning After obtaining the unified representation of distribution matching, we now establish a unified framework to represent most of the transfer learning algorithms. Recalling the popularity of structural risk minimization in machine learning, we turn to using it as the foundation for our unified framework. Our expectation is that readers can have an overview of the secret behind all fancy algorithms such that they can easily understand and use existing algorithm, and then develop their own algorithms. We establish a unified framework for transfer learning based on the principle of SRM in Eq. (2.4) as follows. Definition 3.1 (A Unified Framework for Transfer Learning) Given a labeled Nt s source domain .Ds = {(x i , yi )}N i=1 and an unlabeled target domain .Dt = {(x j )}j =1 , their joint distributions are different, i.e., .Ps (x, y) = Pt (x, y). A unified framework for transfer learning has the following form: Ns 1  (f (vi x i ), yi ) + λR(T (Ds ), T (Dt )), f ∗ = arg min Ns f ∈H i=1

.

(3.9)

3.3 A Unified Framework for Transfer Learning

59

where: • .v ∈ RNs denotes the weights for source samples and .vi ∈ [0, 1]. .Ns is the number of source domain samples. • T is the feature transformation function on both domains. • Note that when introducing weights .vi , we may replace the computation of average using weighted average. We use .R(T (Ds ), T (Dt )) to replace the regularization of SRM .R(f ) to better fit the transfer learning problems. In fact, since, regularization is widely adopted in machine learning, and we can often add one in our objective function. Specifically, we call this term transfer regularization term. Under this unified framework, transfer learning problems can be summarized as finding the optimal regularization functional. This also implies that compared to traditional machine learning, transfer learning focuses more on the relationship between the source and target domains (since the transfer regularization term is all about this). Then, can this unified framework summarize most transfer learning algorithms? Our answer is: yes. Specifically, we can give .vi and T different values or functional in Eq. (3.9) to change their concrete form, which can then transform three basic transfer learning algorithms: 1. Instance weighting methods. This kind of methods focuses on learning the sample weights .vi . 2. Feature transformation methods. This kind of methods corresponds to the case .vi = 1, ∀i, whose target is to learn a feature transformation functional T to reduce regularization loss .R(·, ·). 3. Model pre-training methods. This kind of methods corresponds to .vi = 1, ∀i, .R(T (Ds ), T (Dt )) := R(Dt ; fs ). In this method, our goal is to design strategies to regularize the source function .fs and fine-tune on the target domain. It is obvious that different settings can happen at the same time. For instance, if we learn the values of .vi and T simultaneously, then we face the problem of instance weighting and feature transformation. This can be seen as the extension of the above definition. These three algorithms can summarize most of the popular transfer learning methods in the literature. We will introduce their details in later chapters and only give a brief introduction to each of them here.

3.3.1 Instance Weighting Methods The motivation of instance weighting methods is intuitive: the key to transfer learning is the similarity of source and target domains. Then, we can learn to select a subset from the dataset .Ds ∈ Ds such that it retains the information of the source

60

3 Overview of Transfer Learning Algorithms

domain, but more importantly, be similar to the target domain. This operation can be done by the learning of .vi . At this time, we do not need to explicitly solve the feature transformation functional T since we can directly solve .vi without going through a certain feature transformation. We will introduce more details of such method in Chap. 4.

3.3.2 Feature Transformation Methods Feature transformation methods are closer to probability distribution measurement. If we treat all samples in both domains as equally important (i.e., .vi = 1, ∀i), then the goal of transfer learning becomes how to learn the feature transformation functional T to minimize the distribution discrepancy. How to solve for T ? We categorize feature transformation methods into two types: statistical transformation and geometrical transformation. The goal of statistical transformation is to explicitly reduce the distribution divergence, while the goal of geometrical transformation is to implicitly reduce their distribution divergence. What are implicit and explicit reduction of distribution distance? “Explicit” means we can directly utilize an existing measurement to compute the distribution divergence, such as Euclidean distance, cosine similarity, Mahalanobis distance, Kullback–Leibler divergence, Jensen–Shannon divergence, and mutual information, etc. On the other hand, if we view the distribution measurement as a metric learning problem, then the above-mentioned distances can be seen as pre-defined distances that can be used in most cases. But such pre-defined distances may not be enough to capture their divergence since the features are changing dynamically. How to learn such dynamic property to derive a more proper distance measurement? For instance, we use a discriminator in generative adversarial network (Goodfellow et al., 2014) to classify whether a sample comes from the real or fake image. Such a data-drive distance can be seen as an “implicit” distance. Also, the geometrical transformation can be seen as an implicit distance since we do not seek a certain distance function. We will introduce more details on feature transformation methods in Chaps. 5 and 6.

3.3.3 Model Pre-training The third kind of method is called model pre-training. If we already have a pretrained model .fs on the source domain and there are some labeled samples in the target domain that can be used to learn, then we can directly apply .fs to the target domain to fine-tune it. At this time, we focus on the situation of target domain during the fine-tuning process without further considering the regularization term. This pretraining and fine-tuning paradigm has been widely applied to computer vision (e.g.,

3.4 Practice

61

Table 3.2 Three basic transfer learning methods 1 Ns ∗ .f = arg minf ∈H i=1 (f (vi x i ), yi ) + λR(T (Ds ), T (Dt )) Ns Method Instance weighting Feature transformation Model pre-training

Setup .T (Ds ), T (Dt ) .vi .vi

Objective = Ds , Dt

= 1, ∀i = 1, ∀i, .R(T (Ds ), T (Dt )) := R(Dt ; fs )

.vi

T SRM

pre-trained models from ImageNet (Deng et al., 2009) dataset) and natural language processing (e.g., Transformer Vaswani et al., 2017 and BERT Devlin et al., 2018 models). We will introduce more details of pre-training methods in Chap. 8. And based on deep learning, we will also introduce deep transfer learning and adversarial transfer learning in Chaps. 9 and 10, respectively.

3.3.4 Summary We can see from the above introduction that our unified framework for transfer learning can cover most of the existing methods. We now summarize them in Table 3.2. Note that each category of methods is not separate from others, and they can be combined to benefit each other. Also, such a categorization can be naturally extended into deep learning, which will be introduced in later chapters.

3.4 Practice We overviewed transfer learning algorithms in a high level in previous sections. In this section, we will write codes to establish a baseline framework for transfer learning practice and introduce some datasets in this tutorial. This section can be seen as foundations for other practice sections. The complete code of this section can be found at: https://github.com/jindongwang/tlbook-code/tree/main/chap03_knn. Our main programming language is Python,1 which is one of the most popular programming languages in artificial intelligence and machine learning research especially in deep learning. There are some popular data science libraries such

1 https://www.python.org/.

62

3 Overview of Transfer Learning Algorithms

as NumPy,2 Pandas,3 Scikit-learn,4 and SciPy5 that make Python easy to use for machine learning research; moreover, there are also some popular deep learning frameworks that have great support for Python such as PyTorch,6 TensorFlow,7 and MXNet.8 Note that in the deep learning chapters of this book, we will use PyTorch as our main deep learning framework. In the rest of the practice sections, we will assume the readers have basic coding skills and Python knowledge.

3.4.1 Data Preparation ImageNet (Deng et al., 2009) has been the golden benchmark for computer vision research for more than 10 years. Just like ImageNet to computer vision, there are also standard datasets and benchmarks in transfer learning research, which include: • Objection recognition datasets, such as Office-319 and Office-Home10 • Handwritten digits such as MNIST,11 USPS,12 and SVHN13 • Sentiment analysis datasets, such as Amazon Review dataset,14 20Newsgroup,15 and Reuters-2157816 • Face recognition datasets such as CMU-PIE17 • Human activity recognition datasets such as DSADS18 and Opportunity19 We will not introduce all the datasets in detail in this section. In fact, we can always construct any transfer learning datasets in any area as long as they are suitable for transfer learning settings. For instance, in natural language processing

2 https://numpy.org/. 3 https://pandas.pydata.org/. 4 https://scikit-learn.org/. 5 https://scipy.org/. 6 https://pytorch.org/. 7 https://www.tensorflow.org/. 8 https://mxnet.apache.org/versions/1.9.0/. 9 https://faculty.cc.gatech.edu/~judy/domainadapt/. 10 https://www.hemanthdv.org/OfficeHome-Dataset/. 11 http://yann.lecun.com/exdb/mnist/. 12 https://git-disl.github.io/GTDLBench/datasets/usps_dataset/. 13 http://ufldl.stanford.edu/housenumbers/. 14 https://jmcauley.ucsd.edu/data/amazon/. 15 http://qwone.com/~jason/20Newsgroups/. 16 https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection. 17 https://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html. 18 https://archive.ics.uci.edu/ml/datasets/daily+and+sports+activities. 19 https://www.wes.org/fund/opportunity-challenge/.

3.4 Practice

63

tasks, cross-lingual tasks (Hou et al., 2022) are natural transfer learning tasks. We also show the detailed information of several popular transfer learning datasets in Appendix B of this book. We will adopt Office-31 as the benchmark dataset for transfer learning in this book for fair comparison. For other datasets, readers can easily perform certain data processing steps by following different protocols. On the other hand, readers should also note that we are only testing algorithms on certain public datasets, and the results do not guarantee that one algorithm can consistently perform the best in all applications. In real-world tasks, readers will still need to design and adapt algorithms that best fit your tasks. Office-31 (Saenko et al., 2010) is a popular benchmark for visual recognition, which includes three domains: Amazon (online retailer images), webcam (lowresolution images captured by web cameras), and DSLR (high-resolution images taken by professional cameras). There are 4110 samples and 31 classes in total. Since each domain contains images from a different distribution, we can construct transfer learning tasks by using any two domains, leading to 6 tasks in total: .A → D, A → W, · · · , W → A. The data samples in these three domains are as shown in Fig. 3.3, which clearly shows that even if two images from two different domains can belong to the same category, they are still different in distributions (e.g., lighting, background, and view angle). The samples in Office-31 dataset (Saenko et al., 2010) are in the form of image, which can be directly used as the inputs for deep learning algorithms without further feature extraction. However, we should still need to perform feature extraction for traditional learning methods. Henceforth, we often extract the ResNet-50 features for Office-31 dataset (i.e., using ResNet-50 He et al., 2016 for feature extraction backbone network). Readers do not need to worry about such feature extraction method by now, and you only need to understand that we should perform feature extraction for traditional methods before feeding them into the method. The download link for raw dataset and the extracted features is: https://github. com/jindongwang/transferlearning/tree/master/data#office-31.

Amazon

DSLR

Fig. 3.3 Samples from Office-31 dataset (Saenko et al., 2010)

Webcam

64

3 Overview of Transfer Learning Algorithms

Fig. 3.4 Office-31 dataset

After downloading, you should unzip them to the corresponding folders as shown in Fig. 3.4, where each folder corresponds to samples belonging to the same class.

3.4.2 Baseline Model: K-Nearest Neighbors In order to show the necessity of transfer learning algorithms, we use KNN (Knearest neighbor) classifier as the traditional method. We construct a KNN classifier and use it to classify the Office-31 dataset. We load samples from a domain using the following function to return its features and classes.

1 2 3 4 5 6

Load Office-31 data def load_csv(folder, src_domain, tar_domain): data_s = np.loadtxt(f’{folder}/amazon_{src_domain}.csv’, delimiter=’,’) data_t = np.loadtxt(f’{folder}/amazon_{tar_domain}.csv’, delimiter=’,’) Xs, Ys = data_s[:, :-1], data_s[:, -1] Xt, Yt = data_t[:, :-1], data_t[:, -1] return Xs, Ys, Xt, Yt

Then, we build a KNN classifier with the help of Scikit-learn package, which takes feature (X) and label (Y) from the source and target domains to classify samples.

1 2 3 4 5 6 7 8 9 10

KNN classifier def knn_classify(Xs, Ys, Xt, Yt, k=1): from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score model = KNeighborsClassifier(n_neighbors=k) Ys = Ys.ravel() Yt = Yt.ravel() model.fit(Xs, Ys) Yt_pred = model.predict(Xt) acc = accuracy_score(Yt, Yt_pred) print(’Accuracy using kNN: {:.2f}%’. format(acc * 100))

References

65

Fig. 3.5 KNN classification results

Finally, we can call this function in the main function to complete the transfer learning classification tasks. In this example, the source domain is the amazon, and the target domain is the webcam. Readers are encouraged to change this to other settings.

1 2 3 4 5 6 7 8 9 10

Main function if __name__ == "__main__": folder = ’./office31-decaf’ src_domain = ’amazon’ tar_domain = ’webcam’ Xs, Ys = load_data(folder, src_domain) Xt, Yt = load_data(folder, tar_domain) print(’Source:’, src_domain, Xs.shape, Ys.shape) print(’Target:’, tar_domain, Xt.shape, Yt.shape) knn_classify(Xs, Ys, Xt, Yt)

The outputs are shown in Fig. 3.5. We see that there are 2817 samples in the source domain and 795 samples in the target domains. The accuracy of transfer learning from amazon to webcam is 74.59%. We will not introduce deep learning-based methods in this chapter. Interested readers please refer to Sect. 9.6.

References Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. (2007). Analysis of representations for domain adaptation. In NIPS, volume 19. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. In NIPS. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. Hou, W., Zhu, H., Wang, Y., Wang, J., Qin, T., Xu, R., and Shinozaki, T. (2022). Exploiting adapters for cross-lingual low-resource speech recognition. IEEE Transactions on Audio, Speech, and Language Processing (TASLP).

66

3 Overview of Transfer Learning Algorithms

Long, M., Wang, J., et al. (2013). Transfer feature learning with joint distribution adaptation. In ICCV, pages 2200–2207. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210. Qin, X., Chen, Y., Wang, J., and Yu, C. (2019). Cross-dataset activity recognition via adaptive spatial-temporal transfer learning. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4):1–25. Saenko, K., Kulis, B., Fritz, M., and Darrell, T. (2010). Adapting visual category models to new domains. In ECCV, pages 213–226. Springer. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. Wang, J., Chen, Y., Feng, W., Yu, H., Huang, M., and Yang, Q. (2020). Transfer learning with dynamic distribution adaptation. ACM TIST, 11(1):1–25. Wang, J., Chen, Y., Hu, L., Peng, X., and Yu, P. S. (2018a). Stratified transfer learning for crossdomain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom). Wang, J., Feng, W., Chen, Y., Yu, H., Huang, M., and Yu, P. S. (2018b). Visual domain adaptation with manifold embedded distribution alignment. In ACMMM, pages 402–410. Yu, C., Wang, J., Chen, Y., and Huang, M. (2019). Transfer learning with dynamic adversarial adaptation network. In The IEEE International Conference on Data Mining (ICDM). Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. (2019). On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–7532. Zhu, Y., Zhuang, F., Wang, J., Ke, G., Chen, J., Bian, J., Xiong, H., and He, Q. (2020). Deep subdomain adaptation network for image classification. IEEE Transactions on Neural Networks and Learning Systems.

Chapter 4

Instance Weighting Methods

Instance weighting methods are one of the most effective methods for transfer learning. Technically speaking, any weighting methods can be used for evaluating the importance of each instance. In this chapter, we mainly focus on two basic methods: instance selection and instance weight adaptation. These two kinds of methods are widely adopted in existing transfer learning research and also act as the basic module for more complicated systems. The organization of chapter is as follows. In Sect. 4.1, we present the basic problem formulation of instance weighting methods. Then, we introduce instance selection methods in Sect. 4.2. After that, we describe weight adaptation methods in Sect. 4.3. Section 4.4 provides some actionable codes. Finally, we give a summary of this chapter in Sect. 4.5.

4.1 Problem Definition Recall that we pointed out in Chap. 3 that the core idea of transfer learning is to reduce the distribution discrepancy of two domains. Then, what can instance weighting methods do toward achieving this goal? Since the dimension and numbers of samples in transfer learning are often very large, it is impossible to directly estimate .Ps (x) and .Pt (x). Instead, we can optionally select some samples from the labeled source domain to make the distribution of the subset similar to that of the target domain. Then, we can establish models using traditional machine learning methods. The key to this kind of methods is how to design selection criterion. On the other hand, data selection can be identical to how to design sample weighting rules (Data selection can be seen as the special case of weighting. For instance, we can easily use the weights 1 and 0 to indicate whether we select a sample or not.).

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_4

67

68

4 Instance Weighting Methods

Source domain

Target domain

(Multiple animals)

(Dogs)

Increase weights for dogs Fig. 4.1 Illustration of instance weighting methods

Figure 4.1 shows the basic idea of instance weighting methods. There are animals belonging to different categories such as dogs, cats, and birds, but there is only one major category (dog) in the target domain. For transfer learning, in order to make the source and target domains more similar, we can design weighting strategy to increase the weights of dog class, i.e., giving it larger weights in training. Lots of research efforts (Khan and Heisterkamp, 2016; Zadrozny, 2004; Cortes et al., 2008; Dai et al., 2007) focused on evaluating the distribution ratio (.Ps (x)/Pt (x)) of source and target domains, which leads to the sample weight .vi . (x) These methods often assume . PPst (x) < ∞, but the conditional distributions on two domains are the same (i.e., .P (y|x s ) = P (y|x t )). Specifically, Dai et al. (2007) proposed a method called TrAdaboost to apply the boosting strategy into transfer learning in order to increase the weights of the samples that are useful for target domain and then decrease the weights of less useful samples. Then, they developed a generalization error bound based on PAC theory (Valiant, 1984). TrAdaboost is one of the most classic methods for instance weighting. At the same year, Huang et al. (2007) proposed a kernel mean matching (KMM) approach to estimate probability distributions to make them closer in the weighted distributions. Definition 4.1 (Instance Weighting for Transfer Learning) Given a labeled Nt s source domain .Ds = {(x i , yi )}N i=1 and an unlabeled target domain .Dt = {(x j )}j =1 , their joint distributions are different, i.e., .Ps (x, y) = Pt (x, y). Let vector .v ∈ RNs denote the weight for every sample in source domain; then, the goal of instance weighting methods is to learn an optimal weighting vector .v ∗ , such

4.2 Instance Selection Methods

69

that the distribution discrepancy can be minimized: .D(Ps (x, y|v), Pt (x, y)) < D(Ps (x, y), Pt (x, y)). Then, the risk on target can be minimized: Ns 1  f ∗ = arg min (vi f (x i ), yi ) + λR(Ds , Dt ), Ns f ∈H i=1

.

(4.1)

where the vector .v is our learning target. In next sections, we will introduce instance selection methods (.vi ∈ {0, 1}) and weight adaptation methods (.vi ∈ [0, 1]).

4.2 Instance Selection Methods Instance selection methods generally assume that the marginal distributions between the source and target domains are the same, i.e., .Ps (x) ≈ Pt (x). When their conditional distributions are different, we should design some selection strategy to select the appropriate samples. In fact, if we treat the whole selection process as a decision process, then it can be illustrated in Fig. 4.2: This process mainly consists of the following modules: • Instance selector f , which is to select a subset of samples from the source domain that has a similar distribution to the target domain • Performance evaluator g, which is to evaluate the distribution divergence between the selected subset and target domain • Reward r, which is to provide the reward to the selected subsets that can then adjust its selection process It is easy to find out that the above process can be seen as a Markov decision process (MDP) in reinforcement learning (Sutton and Barto, 2018). Therefore, a natural idea comes out: we can design reinforcement learning algorithms to select the best samples from the source domain by designing appropriate selector, reward,

Source data

Instance selector

Subset

Target data Reward

Fig. 4.2 Instance selection methods for transfer learning

Performance evaluator

70

4 Instance Weighting Methods

and performance evaluator. For instance, we can leverage the classic REINFORCE algorithm (Sutton and Barto, 2018) to learn a selection policy, or we can leverage the deep Q-learning algorithms for this task. In this section, we categorize instance selection methods into two categories based on whether they adopt the reinforcement learning algorithms: (1). nonreinforcement learning-based methods and (2). reinforcement learning-based methods.

4.2.1 Non-reinforcement Learning-Based Methods Before deep learning, researchers often leverage traditional learning methods for instance selection. Existing research endeavor can be further categorized into three sub-classes: (a). distance measurement-based methods, (b). meta-learning-based methods, and (c). other methods. Distance measurement-based methods are very straightforward. They rely on measurements designed by humans to ensure that the selected subset has the minimum distribution divergence with the target domain. The popular measurements include cross-entropy, maximum mean discrepancy (Borgwardt et al., 2006), and KL divergence, etc., which can be found in Appendix A. The process of these methods is also intuitive: they have two stages. The first stage is to exploit the measurement to select the best subsets from the source domain. Then, at the second stage, we can perform traditional machine learning on the subsets and the target domain. Note that we may not want to reverse the orders of the two stages since if a sample is selected, then it will not be selected again in subsequent process. These methods are widely adopted in natural language processing applications (Axelrod et al., 2011; Song et al., 2012; Murthy et al., 2018; Moore and Lewis, 2010; Duh et al., 2013; Chatterjee et al., 2016; Mirkin and Besacier, 2014; Plank and Van Noord, 2011; Ruder et al., 2017; Søgaard, 2011; Van Asch and Daelemans, 2010; Poncelas et al., 2019). Meta-learning-based methods design an extra network (i.e., the meta-network) to learn the selection strategy. The meta-network can interact with the main task to adjust the process. Thus, its process is not two-stage, but end-to-end and can be interactive. For instance, Shu et al. (2019) exploited curriculum learning (Bengio et al., 2009) to formulate the sample selection process for weakly supervised learning as a meta-learning task. Then, they can optimize the two networks (main network and the meta-network) at the same time. Similar idea is also adopted in other literature such as (Chen and Huang, 2016; Wang et al., 2017; Coleman et al., 2019; Loshchilov and Hutter, 2015; Wu and Huang, 2016; Ren et al., 2018). In addition to distance measurement-based methods and meta-learning-based methods, other methods such as Bayesian-based strategy (Tsvetkov et al., 2016; Ruder and Plank, 2017) can also be used for instance selection. We will not cover too much of its literature here.

4.2 Instance Selection Methods

71

In particular, Tsvetkov et al. (2016) proposed three important factors for data selection: 1. Simplicity: The data selection process should be simple that does not introduce much computational burden for the model. 2. Diversity: The selected instances must be diverse enough for better generalization. 3. Prototypicality: The selected instances must be representative that truly bring extra knowledge and patterns to the model. Readers should additionally pay attention to the connection between curriculum learning (Bengio et al., 2009) and instance selection. Since curriculum learning emphasizes the process of learning from easy to hard samples, which is similar to human learning, it can definitely help to perform data selection in transfer learning from many ways.

4.2.2 Reinforcement Learning-Based Methods With the success of deep learning (Krizhevsky et al., 2012; Deng et al., 2009; He et al., 2016), reinforcement learning, especially deep reinforcement learning, has become popular in recent years. For instance, Google DeepMind’s AlphaGo series (Silver et al., 2016, 2017) beat human experts in the game of Go, which dramatically accelerated the development of deep reinforcement learning. Instance selection methods can also take advantage of reinforcement learning for better instance selection. Feng et al. (2018) proposed a data selection method based on reinforcement learning that can learn from the noisy data. Patel et al. (2018) leveraged deep Qlearning to design a sampling strategy in domain adaptation problems. Later, Liu et al. (2019) leveraged the REINFORCE algorithm (Sutton and Barto, 2018) for instance selection in domain adaptation. Their method divided the source domain into several batches and then learned the weights of each sample in each batch. It is important to note that in order to measure the distribution divergence between domains, they first select some labeled samples as the guidance set. Then, in batchlevel training, the guidance set can guide the weight learning and feature learning processes. When applying reinforcement learning methods, it is important to define its key concepts: state, action, and reward. For instance, these concepts in Liu et al. (2019) are defined as: • The state is constructed by the weight vector of current batch and the parameters of feature extractor. • The action is implemented as the selection process, and thus it is a binary vector: 0 denotes not selecting this sample and 1 denotes selecting this sample.

72

4 Instance Weighting Methods

• The reward is implemented as the distribution divergence between the source and target domains. Specifically, reward function is the key in reinforcement learning-based methods. In the above literature, it is implemented as         r s, a, s  = d sBj −1 , st − γ d sBj , st ,

.

(4.2)

where the subscripts s and t denote the source and target domains, respectively. d(·, ·) is a distribution measurement function, which can be MMD and Reny distances in their experiments. .(s, a, s  ) denotes the features after state s takes action a to become .s  and .. .Bj −1 and .Bj denote the batch at .j − 1-th and j -th iteration, respectively. . is the feature. The optimal solution can be obtained by normal deep learning optimization. Other literature (Dong and Xing, 2018; Wang et al., 2019b,a; Guo et al., 2019; Qu et al., 2019) also applied reinforcement learning for selection to integrate it into the process of transfer learning. Finally, it is important to note that instance selection and feature learning are actually two complement processes; thus their combination is conductive to better performance. For instance, the work of Qu et al. (2019), Wang et al. (2019a) combined the two learning processes into one network and achieved great performance.

.

4.3 Weight Adaptation Methods Different from instance selection, weight adaptation methods assume that the conditional distributions between two domains are the same, while their marginal distributions are different: .Ps (y|x) ≈ Pt (y|x), but .Ps (x) = Pt (x). Inspired by the classic work of Jiang and Zhai (2007), we solve the weight adaptation problems using maximum likelihood estimation. Let .θ denote the learnable parameters for the model; then the optimal hyperparameters on the target domain can be represented as ∗ .θt

= arg max θ

  x

Pt (x, y) log P (y|x; θ )dx.

(4.3)

y∈Y

By leveraging Bayes’ theorem, the above equation can be computed as θt∗ = arg max

 Pt (x)

.

θ

x

 y∈Y

[Pt (y|x)] log P (y|x; θ )dx.

(4.4)

4.3 Weight Adaptation Methods

73

Note that there is only one unknown term: .Pt (y|x), which is exactly our learning objective. However, we can only leverage the distribution .Ps (x, y). Then, can we perform some transformation to leverage .Ps (x, y) to circumvent the computation of conditional distribution .Pt (y|x) to learn .θt∗ ? The answer is positive. We can construct the relationship between two probabilities and then perform the following transformation by leveraging the assumption that their conditional distributions are almost the same (.Ps (y|x) ≈ Pt (y|x)):

.



 Pt (x) Ps (x) Ps (y|x) log P (y|x; θ )dx θ x Ps (x) y∈Y   Pt (x) ˜ ≈ arg max P˜s (y|x) log P (y|x; θ )dx , Ps (x) θ x Ps (x) y∈Y   Ns   Pt xis 1   s  log P yis |xis ; θ ≈ arg max Ns P xi θ i=1 s

θt∗ ≈ arg max

(4.5)

P (x s ) where . Pt xis is called density ratio, which can guide the instance weighting s( i ) process. We can build the relationship between the source and target domains by leveraging the density ratio. In this way, the parameter on the target domain can be formulated as ∗ .θt

  Ns   Pt xis 1   s  log P yis |xis ; θ , ≈ arg max Ns P xi θ i=1 s

(4.6)

where every term can be learned. Thus, the problem can be solved. We know from the above analysis that probability density ratio can help build the relationship between the source and target distributions. For simplicity, we denote the density ratio as   Pt xis  . .βi := Ps xis

(4.7)

Thus, the vector .β can be used to denote the probability density ratio. Then, how does density ratio play a role in the learning process in a certain algorithm? Recalling the transfer learning framework in Sect. 3.3, we can reformulate the prediction function in the target domain as f ∗ = arg min f ∈H

Ns 

.

i

βi (f (x i ), yi ) + λR(Ds , Dt ).

(4.8)

74

4 Instance Weighting Methods

The above formulation is a general representation algorithm, which can be applied to certain algorithms. For instance, we can reformulate it in logistic regression as

.

min θ

m 

−βi log P (yi |xi , θ ) +

i=1

λ ||θ ||2 , 2

(4.9)

while in SVM, it becomes  1 min ||θ ||2 + C −βi ξi . θ,ξ 2 m

.

(4.10)

i=1

Specifically, weight adaptation methods can also be integrated with the feature transformation methods (which are introduced in next sections). If we combine the learning of density ratio and MMD distance, then it can be formulated as ⎡

⎤ Ns Nt     1 1 MMD (Ds , Dt ) = sup EP ⎣ βi f (xi ) − f xj ⎦ N N s t f i=1 j =1 . . 1 2 = 2 β T Kβ − 2 κ T β + const Ns Nt

(4.11)

By adopting the kernel trick, the above problem can be formulated as 1 min β T Kβ − κ T β β 2 N . s  βi − Ns ≤ Ns , s.t. β ∈ [0, B] and

(4.12)

i=1

which is the classic algorithm called kernel mean matching (KMM) (Huang et al., 2007), where . and B are pre-defined thresholds. More details on KMM algorithm can be found at its original publication. Based on the above analysis, there are more instance weight adaptation methods. It is important to note that this category of methods can be directly integrated into deep learning to learn the sample weights. For instance, the work of Wang et al. (2019c), Wang et al. (2019d) learned the weighting scheme in transfer learning and finetuning process. Moraffah et al. (2019) added causality in deep learning to learn better representations.

4.4 Practice

75

4.4 Practice In this section, we implement the kernel mean matching (KMM) algorithm (Sugiyama et al., 2007) for instance weighting in transfer learning. The core of KMM is to learn the weight ratio of source and target domains by building a quadratic programming equation. Then, we can use the KNN classifier for classification on the target domain. The complete code can be found at: https:// github.com/jindongwang/tlbook-code/tree/main/chap04_instance. The code of KMM algorithm is shown as follows. The key is to implement the fit function in scikit-learn style. Then, we can use the cvxopt package for quadratic programming. We pack the algorithm into a Python package for better use.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

The KMM algorithm class KMM: def __init__(self, kernel_type=’linear’, gamma=1.0, B=1.0, eps=None): ’’’ Initialization function :param kernel_type: ’linear’ | ’rbf’ :param gamma: kernel bandwidth for rbf kernel :param B: bound for beta :param eps: bound for sigma_beta ’’’ self.kernel_type = kernel_type self.gamma = gamma self.B = B self.eps = eps def fit(self, Xs, Xt): ’’’ Fit source and target using KMM (compute the coefficients) :param Xs: ns * dim :param Xt: nt * dim :return: Coefficients (Pt / Ps) value vector (Beta in the paper) ’’’ ns = Xs.shape[0] nt = Xt.shape[0] if self.eps == None: self.eps = self.B / np.sqrt(ns) K = kernel(self.kernel_type, Xs, None, self.gamma) kappa = np. sum(kernel(self.kernel_type, Xs, Xt, self.gamma) * float(ns) / float(nt), axis=1) K = matrix(K.astype(np.double)) kappa = matrix(kappa.astype(np.double)) G = matrix(np.r_[np.ones((1, ns)), -np.ones((1, ns)), np.eye(ns), np.eye(ns)]) h = matrix(np.r_[ns * (1 + self.eps), ns * (self.eps - 1), self.B * np.ones((ns,)), np.zeros((ns,))]) sol = solvers.qp(K, -kappa, G, h)

76 35 36

4 Instance Weighting Methods beta = np.array(sol[’x’]) return beta

Then, we can use the KNN classifier to obtain the classification results on the target domain. Note that we need to transform the source features using the KMM weights, as shown in the following function.

1 2 3 4 5 6 7 8 9 10 11 12

KMM main function if __name__ == "__main__": folder = ’../../office31_resnet50’ src_domain = ’amazon’ tar_domain = ’webcam’ Xs, Ys, Xt, Yt = load_csv(folder, src_domain, tar_domain) print(’Source:’, src_domain, Xs.shape, Ys.shape) print(’Target:’, tar_domain, Xt.shape, Yt.shape) kmm = KMM(kernel_type=’rbf’, B=18) beta = kmm.fit(Xs, Xt) Xs_new = beta * Xs knn_classify(Xs_new, Ys, Xt, Yt, k=1, norm=args.norm)

As shown in Fig. 4.3, the classification accuracy for task amazon to webcam in Office-31 dataset is 74.72%, which is better than KNN (74.59%) in last chapter. This indicates the effectiveness of KMM method. Of course, we can make the results better by further hyperparameter tuning in real applications.

Fig. 4.3 Results of the KMM method (Sugiyama et al., 2007)

References

77

4.5 Summary In this chapter, we introduced two categories of instance weighting methods: instance selection and weight adaptation. We may find that instance selection is a general and basic problem that can also be used in traditional machine learning and deep learning. This inspires us to find more insights of these areas from other literature. On the other hand, the instance weighting methods of transfer learning area can also be applied to other applications that are not limited to transfer learning problems.

References Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 355–362. Association for Computational Linguistics. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. Chatterjee, R., Arcan, M., Negri, M., and Turchi, M. (2016). Instance selection for online automatic post-editing in a multi-domain scenario. In The Twelfth Conference of the Association for Machine Translation in the Americas, pages 1–15. Chen, B. and Huang, F. (2016). Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 314–323. Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., Leskovec, J., and Zaharia, M. (2019). Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829. Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. (2008). Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, pages 38–53, Budapest, Hungary. Springer. Dai, W., Yang, Q., Xue, G.-R., and Yu, Y. (2007). Boosting for transfer learning. In ICML, pages 193–200. ACM. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE. Dong, N. and Xing, E. P. (2018). Domain adaption in one-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 573–588. Springer. Duh, K., Neubig, G., Sudoh, K., and Tsukada, H. (2013). Adaptation data selection using neural language models: Experiments in machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 678–683. Feng, J., Huang, M., Zhao, L., Yang, Y., and Zhu, X. (2018). Reinforcement learning for relation classification from noisy data. In Thirty-Second AAAI Conference on Artificial Intelligence. Guo, H., Pasunuru, R., and Bansal, M. (2019). AutoSeM: Automatic task selection and mixing in multi-task learning. arXiv preprint arXiv:1904.04153.

78

4 Instance Weighting Methods

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., Schölkopf, B., et al. (2007). Correcting sample selection bias by unlabeled data. Advances in Neural Information Processing Systems, 19:601. Jiang, J. and Zhai, C. (2007). Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264–271. Khan, M. N. A. and Heisterkamp, D. R. (2016). Adapting instance weights for unsupervised domain adaptation using quadratic mutual information and subspace learning. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 1560–1565, Mexican City. IEEE. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. Liu, M., Song, Y., Zou, H., and Zhang, T. (2019). Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1957–1968. Loshchilov, I. and Hutter, F. (2015). Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343. Mirkin, S. and Besacier, L. (2014). Data selection for compact adapted SMT models. Moore, R. C. and Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224. Association for Computational Linguistics. Moraffah, R., Shu, K., Raglin, A., and Liu, H. (2019). Deep causal representation learning for unsupervised domain adaptation. arXiv preprint arXiv:1910.12417. Murthy, R., Kunchukuttan, A., and Bhattacharyya, P. (2018). Judicious selection of training data in assisting language for multilingual neural NER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 401–406. Patel, Y., Chitta, K., and Jasani, B. (2018). Learning sampling policies for domain adaptation. arXiv preprint arXiv:1805.07641. Plank, B. and Van Noord, G. (2011). Effective measures of domain similarity for parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1566–1576. Association for Computational Linguistics. Poncelas, A., Wenniger, G. M. d. B., and Way, A. (2019). Transductive data-selection algorithms for fine-tuning neural machine translation. arXiv preprint arXiv:1908.09532. Qu, C., Ji, F., Qiu, M., Yang, L., Min, Z., Chen, H., Huang, J., and Croft, W. B. (2019). Learning to selectively transfer: Reinforced transfer learning for deep text matching. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 699–707. Ren, M., Zeng, W., Yang, B., and Urtasun, R. (2018). Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Ruder, S. and Plank, B. (2017). Learning to select data for transfer learning with Bayesian optimization. arXiv preprint arXiv:1707.05246. Ruder, S., Ghaffari, P., and Breslin, J. G. (2017). Data selection strategies for multi-domain sentiment analysis. arXiv preprint arXiv:1702.02426. Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. (2019). Meta-Weight-Net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, pages 1917–1928. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484.

References

79

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676):354. Søgaard, A. (2011). Data point selection for cross-language adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 682–686. Association for Computational Linguistics. Song, Y., Klassen, P., Xia, F., and Kit, C. (2012). Entropy-based training data selection for domain adaptation. In Proceedings of COLING 2012: Posters, pages 1191–1200. Sugiyama, M., Krauledat, M., and MÞller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press. Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., and Dyer, C. (2016). Learning the curriculum with Bayesian optimization for task-specific word representation learning. arXiv preprint arXiv:1605.03852. Valiant, L. (1984). A theory of the learnable. Commun. ACM, 27:1134–1142. Van Asch, V. and Daelemans, W. (2010). Using domain similarity for performance estimation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 31–36. Association for Computational Linguistics. Wang, R., Utiyama, M., Liu, L., Chen, K., and Sumita, E. (2017). Instance weighting for neural machine translation domain adaptation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1482–1488. Wang, B., Qiu, M., Wang, X., Li, Y., Gong, Y., Zeng, X., Huang, J., Zheng, B., Cai, D., and Zhou, J. (2019a). A minimax game for instance based selective transfer learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 34–43. Wang, J., Chen, Y., Feng, W., Yu, H., Huang, M., and Yang, Q. (2019b). Transfer learning with dynamic distribution adaptation. ACM Intelligent Systems and Technology (TIST). Wang, Y., Zhao, D., Li, Y., Chen, K., and Xue, H. (2019c). The most related knowledge first: A progressive domain adaptation method. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 90–102. Springer. Wang, Z., Bi, W., Wang, Y., and Liu, X. (2019d). Better fine-tuning via instance weighting for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7241–7248. Wu, F. and Huang, Y. (2016). Sentiment domain adaptation with multiple sources. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 301–310. Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, page 114, Alberta, Canada. ACM.

Chapter 5

Statistical Feature Transformation Methods

In this chapter, we introduce statistical feature transformation methods for transfer learning. This kind of approaches is extremely popular in existing literature with good results. Especially, they are often implemented in deep neural networks in recent research, demonstrating remarkable performance. Thus, it is important that we understand its very basic knowledge. Note that we will focus on its basics and will not introduce its deep learning extensions, which will be introduced in later sections. The organization of this chapter is as follows. In Sect. 5.1, we describe the problem definition of statistical feature transformation methods. In Sect. 5.2, we introduce maximum mean discrepancy-based approaches. Then, in Sect. 5.3, we present metric learning-based approaches. Later, we provide the code practice in Sect. 5.4. Finally, we summarize this chapter in Sect. 5.5.

5.1 Problem Definition Technically, there are thousands of statistical features such as mean, variance, second-order statistics, and hypothesis testing, etc. We could not introduce each of them in detail. Therefore, we focus on several widely adopted statistical features to show the advance of transfer learning. Definition 5.1 (Feature Transformation for Transfer Learning) Given a labeled Nt s source domain .Ds = {(x i , yi )}N i=1 and an unlabeled target domain .Dt = {(x j )}j =1 , their joint distributions are different, i.e., .Ps (x, y) = Pt (x, y). The core of feature

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_5

81

82

5 Statistical Feature Transformation Methods

transformation methods is to learn the feature transformation functional T to obtain the optimal predictive function f : Ns 1  f ∗ = arg min (f (x i ), yi ) + λR(T (Ds ), T (Dt )). Ns f ∈H i=1

.

(5.1)

Based on the property of the distribution divergence measurement function, we now define two classes of the feature transformation function: Definition 5.2 (Explicit Feature Transformation) If we adopt some pre-defined or existing distribution divergence measurement function to measure the divergence between two distributions and then perform feature transformation, we say that the feature transformation here is an explicit feature transformation with a pre-defined or existing distribution divergence measurement .D(·, ·): Ns 1  f ∗ = arg min (f (x i ), yi ) + λD(Ds , Dt ). Ns f ∈H i

.

(5.2)

Popular measurements include Euclidean distance, cosine similarity, KL divergence, and maximum mean discrepancy (MMD), which we will introduce in the next section. In Appendix A, we provide some popular distance and similarity measurements, which can be used for measuring distribution divergence at certain conditions. Definition 5.3 (Implicit Feature Transformation) If the distribution divergence measurement function is learnt by the model instead of pre-definition, we say that feature transformation is an implicit feature transformation with a learnable metric function .Metric(·, ·): Ns 1  f ∗ = arg min (f (x i ), yi ) + λ Metric(Ds , Dt ). Ns f ∈H i

.

(5.3)

Such implicit feature transformation includes metric learning, geometrical feature alignment (see next section), and adversarial learning (see Chap. 10). Note that although it seems the implicit feature transformation methods are more general than the explicit type, there is no proof that the former is definitely better than the latter. This is also verified in our experiments in Sect. 10.5, where the results of adversarial transfer learning (i.e., a type of implicit feature transformation method) are no better than MMD-based methods (i.e., the explicit feature transformation methods). In real applications, we should still design appropriate methods.

5.2 Maximum Mean Discrepancy-Based Methods

83

5.2 Maximum Mean Discrepancy-Based Methods Of all the statistical distance measurements, maximum mean discrepancy (MMD) (Borgwardt et al., 2006) may be one of the most popular choices. In this section, we introduce the detail of this discrepancy and articulate how to use it for transfer learning.

5.2.1 The Basics of MMD Maximum mean discrepancy was originally used for two-sample test in statistics. For two probability distributions p and q, we can determine whether to accept a hypothesis or not. MMD is an effective two-sample test approach. We use .Hk to denote a reproducing kernel Hilbert space (RKHS) defined by a characteristic kernel k. In this space, the mean embedding of a probability distribution p can be represented as .μk (p). Then, we say that .μk (p) is the only element of space .Hk , which states that for any function .f ∈ Hk , we have Ex∼p f (x) = f (x), μk (p)Hk .

(5.4)

.

We use .dk (p, q) to denote the maximum mean discrepancy between two probability distributions p and q. Then, its square is equal to the distance between two mean embeddings in RKHS:  2 dk2 (p, q)  Ex∼p [φ (x)] − Ex∼q [φ (x)]H ,

.

k

(5.5)

where the mapping function .φ(·) defines a map from the original space to RKHS and the kernel function is defined as the inner product of the mapping function:      k x i , x j = φ (x i ) , φ x j ,

.

(5.6)

where .·, · denotes the inner product operation. If .dk (p, q) = 0, then .p = q, or vice versa. The kernel function in MMD is also the same as what we learned in support vector machines (SVMs) and other traditional machine learning algorithms (Zhou, 2016; Bishop, 2006): • Linear kernel: .k(x i , x j ) = x i , x j . • Polynomial kernel: .k(x i , x j ) = x i , x j d , where d is the order of polynomial.

84

5 Statistical Feature Transformation Methods

 x i −x j 2 • Gaussian kernel or RBF kernel: .k(x i , x j ) = exp − 2σ 2 , where .σ is the bandwidth of the kernel function. Now let us go back to the definition of MMD: what does MMD do? In a nutshell, MMD first maps two distributions into another space and then computes their mean difference. Hence, we can leverage the above equations to compute the divergence between two distributions. The story of MMD is not over yet. Gretton et al. (2012) further extended MMD to the multiple-kernel version, i.e., the multiple-kernel MMD (MK-MMD). We know that MMD requires the selection of the kernel function k. If we choose different kernel functions, then we can compute different MMDs. However, which kernel function is the most suitable one that computes the optimal MMD distance? That remains a challenging problem. MK-MMD resolves this challenge by utilizing the optimal combination of multiple kernels to compute the value of MMD given a set of kernels. Specifically, MK-MMD can be seen as the combination of several semi-definite kernels .{ku }:

K k=

m 

.

u=1

βu ku :

m 

βu = 1, βu  0, ∀u ,

(5.7)

u=1

where .βu is the weight for each kernel. Therefore, MK-MMD is widely adopted in today’s transfer learning literature.

5.2.2 MMD-Based Transfer Learning We introduce MMD-based transfer learning methods. Recall the following equation: Ns 1  f ∗ = arg min (f (x i ), yi ) + λR(T (Ds ), T (Dt )). Ns f ∈H i=1

.

(5.8)

How to build connection between MMD and the feature transformation functional T ? Recall that in Sect. 3.2, the general form of distribution divergence is represented as D(Ds , Dt ) ≈ (1 − μ)D(Ps (x), Pt (x)) + μD(Ps (y|x), Pt (y|x)).

.

(5.9)

5.2 Maximum Mean Discrepancy-Based Methods

85

We can see that MMD can be directly used to compute the marginal distribution divergence .D(Ps (x), Pt (x)), which corresponds to the classic transfer learning method transfer component analysis (TCA) (Pan et al., 2011). For notations brevity, we use a semi-definite matrix .A to denote the feature transformation matrix computed using MMD. The matrix .A then becomes our learning objective in MMD-based transfer learning methods. In this way, the MMD between two marginal distributions (i.e., the core idea of TCA Pan et al., 2011) is represented as  2   Ns Nt   1   1 T T  .MMD(Ps (x), Pt (x)) =  A x − A x i j . N Nt  s i=1  j =1 H

(5.10)

Then, our problem is how to compute the distribution .Pt (y|x) on the target domain without the labels since joint, balanced, and dynamic distribution adaptation methods (recall Sect. 3.2) all require computing the conditional distributions. We can turn to using the class-conditional probability .Pt (x|y). According to Bayes’ theorem .Pt (y|x) = Pt (y)Pt (x|y), if we ignore .Pt (y), then we can use .Pt (x|y) to approximate .Pt (y|x). Is this possible? There is a concept called sufficient statistics, which means that if there are many unknown factors, then we can choose some of the statistical factors to approximate our goal if the sample sizes are large. However, it is still difficult to perform such approximation since we do not have .yt . To solve this problem, we often adopt an iteration-based training strategy: First, we use .(x s , ys ) to train a classifier (e.g., KNN and logistic regression) to get the pseudo labels .yˆt for the unlabeled target domain. Then, the pseudo labels can be used for transfer learning. After feature transformation, the pseudo labels can be updated or refined in later iterations. Therefore, the iteration-based training strategy makes it possible for such computation. Similar to Eq. (5.10), the MMD between two conditional distributions is represented as    1 .MMD(Ps (y|x), Pt (y|x)) =  (c)  c=1  Ns C  

(c)

(c)

2     1 T T  A x i − (c) A xj  ,  Nt (c) (c)  x i ∈Ds x j ∈Dt H (5.11)

where .Ns and .Nt denote the number of samples for the c-th class from source (c) (c) and target domains with .Ds and .Dt the sample sets, respectively. C is the total number of classes.

86

5 Statistical Feature Transformation Methods

However, it seems we cannot directly solve the equation! We will first show the final form of the above equation. Then, we take Eq. (5.10) as an example to show the detailed derivation. The final form of MMD-based transfer learning is derived as .

min tr(AT XMXT A),

(5.12)

where .tr(·) denotes the trace of a matrix, .X is the matrix combined by source and target features, and .M is the MMD matrix computed as M = (1 − μ)M 0 + μ

C 

.

Mc,

(5.13)

c=1

where the marginal and conditional MMD matrices are computed as

(M 0 )ij =

.

(M c )ij =

.

⎧ 1 ⎪ , ⎪ ⎨ Ns2

x i , x j ∈ Ds

1 , x i , x j ∈ Dt Nt2 ⎪ ⎪ ⎩ 1 − Ns Nt , otherwise

⎧ 1 ⎪ ⎪ ⎪ (Ns(c) )2 , ⎪ ⎪ ⎪ 1 ⎪ (c) ⎪ ⎨ (Nt )2 , ⎪ ⎪ − (c)1 (c) , ⎪ ⎪ Ns Nt ⎪ ⎪ ⎪ ⎪ ⎩ 0,

(5.14)

x i , x j ∈ D(c) s (c)

x i , x j ∈ Dt

(c) (c) x i ∈ Ds , x j ∈ Dt (c)

(5.15)

(c)

x i ∈ Dt , x j ∈ Ds otherwise.

Specifically, when we set the balance factor .μ = 0.5, the above equation will become joint distribution adaptation (Long et al., 2013). But the more general form is dynamic distribution adaptation (Wang et al., 2017, 2018b, 2020).

2  ⎡ ⎤ ⎡ ⎤   1 1     ⎢1⎥ ⎢1⎥   1 T    1 ⎢ ⎥ ⎢ ⎥ T   =  A x 1 x 2 · · · x Ns 1×N ⎢ . ⎥ − A x 1 x 2 · · · x Nt 1×N ⎢ . ⎥  s t . . ⎣.⎦ ⎣.⎦ Nt   Ns    1 N ×1 1 N ×1  s t  1 T 1 T 1 1 T T T T T T T T T T = tr A X 1(A X 1) + A X 1(A X 1) − A X 1(A X 1) − A X 1(A X 1) s s t t s t t s Ns Nt Ns Nt Ns2 Nt2  1 T 1 1 1 T T T T T T T T T T T A X 11 X A + A X 11 X A − A X 11 X A − A X 11 X A = tr s t s t s t t s Ns Nt Ns Nt Ns2 Nt2    1 1 1 1 T T T T T T T T 11 X X + 11 X X − 11 X X − 11 X X = tr AT s s t t s t t s A Ns Nt Ns Nt Ns2 Nt2  1 T −1 T        N 2 11 Ns Nt 11 Xs T s = tr A Xs Xt A 1 −1 T T 11 11 Xt 2 Ns Nt Nt   = tr AT XMXT A

 2   Ns Nt   1   1 T T   A xi − A xj  N N t  s i=1  j =1

5.2 Maximum Mean Discrepancy-Based Methods 87

Now, we explain the computing process of the marginal MMD:

The above derivation leverages two important matrix properties:

1. .||A||2 = tr(AAT ), which is used in the second step. 2. .tr(AB) = tr(BA), which is used in the fourth step.

For other MMD-based approaches, we can also perform the similar derivation.

88

5 Statistical Feature Transformation Methods

5.2.3 Computation and Optimization Now, all we need is to minimize Eq. (5.12). But is it that simple? Any optimization problem is involved with some constraints, or we just let all the elements in .A be zeros and then problem solved. In fact, we should still consider the scatter for the features before and after feature transformations. Given the sample set .x, we can compute its scatter (variance) matrix .S as S=

n    T xj − x xj − x j =1

.

=

n      xj − x ⊗ xj − x j =1

(5.16)

⎛ ⎞ n  T =⎝ x j x Tj ⎠ − nXX , j =1

$ where .x = n1 nj=1 x j denotes sample mean and .⊗ denotes outer product. We use (n+m)×(n+m) denotes identity .H = I − (1/n)1 to denote center matrix, and .I ∈ R matrix. Then, the scatter matrix can be formulated as S = XH X T .

(5.17)

.

We put .A into the above equation; then, the variance maximization can be formulated as max (AT X)H (AT X)T .

.

(5.18)

We combine Eqs. (5.12) and (5.18) to get the final form of MMD-based transfer learning:

.

min

  tr AT XMXT A tr(AT XH XT A)

.

(5.19)

We need to minimize the nominator and maximize the denominator of Eq. (5.19) to solve for .A, which is extremely hard. We note that the transformation matrix H .A is a Hermitian matrix, i.e., .A = A, where .H denotes the conjugate transpose. Therefore, Eq. (5.19) can be solved by using the Rayleigh quotient (Parlett, 1974). Finally, we transform Eq. (5.19) into   min tr AT XMXT A + λA2F , .

s.t. AT XH XT A = I ,

(5.20)

5.2 Maximum Mean Discrepancy-Based Methods

89

where the regularization term .λA2F is used to ensure this problem is well-defined. .λ > 0 is its hyperparameter. We often utilize Lagrange method for optimization: L = tr

.

      AT XAX T + λI A + tr I − AT XH XT A  .

(5.21)

Let .∂L/∂A = 0; then we get  .

 XMXT + λI A = XH XT A,

(5.22)

where . is Lagrange multiplier. Solving this equation in Python or MATLAB is easy (e.g., eigs in MATLAB or scipy.linalg.eig in Python). Eventually, we can use multiple iterations to make the pseudo labels for the target domains more accurate to refine our final results. Similarly, by setting different values to .μ, we can get the final forms of marginal, conditional, and joint distribution adaptation methods. We leave this to readers.

5.2.4 Extensions of MMD-Based Transfer Learning We review MMD-based transfer learning approaches here. The classic transfer component analysis (TCA) (Pan et al., 2011) proposed to compute the marginal distribution divergence using MMD. Then, Long et al. (2013) proposed joint distribution adaptation (JDA) to compute the conditional and marginal distribution divergence. Then, Wang et al. (2017), Wang et al. (2018b), Wang et al. (2020) proposed the dynamic distribution adaptation (DDA) to dynamically compute the distribution discrepancy between two domains in a unified framework. At the same time, the work of (Wang et al., 2018a) applied conditional distribution adaptation to cross-domain human activity recognition and obtained great performance. Here is the complete steps to use MMD for transfer learning: Input two feature matrices. First, we use a simple classifier (such as KNN) to compute the pseudo labels for the target domain. Then, we compute .M and .H . Next, we select some common kernel function to compute the kernel .K. Then, we solve Eq. (5.20) to obtain the transformed features for two domains. Take its first m features, which is our answer for .A. We can further perform multiple iterations to make the pseudo labels more accurate. MMD-based transfer learning methods are heavily extended later. ACA (Adapting Component Analysis) (Dorri and Ghodsi, 2012) added Hilbert–Schmidt independence criterion to TCA. DTMKL (Domain Transfer Multiple-Kernel Learning) (Duan et al., 2012) added multiple-kernel MMD in TCA and proposed a new way of optimization. VDA (Tahmoresnezhad and Hashemi, 2016) added innerclass distance. The work of (Hsiao et al., 2016) added some structure preservation loss. The work of (Hou et al., 2015) added the data selection of target domain.

90

5 Statistical Feature Transformation Methods

Source 1

2

Target

Fig. 5.1 Manifold embedded distribution alignment (MEDA) (Wang et al., 2018b)

JGSA (Joint Geometrical and Statistical Alignment) (Zhang et al., 2017) added inner- and intra-class distances. BDA (Balanced Distribution Adaptation) (Wang et al., 2017) and MEDA (Manifold Embedded Distribution Alignment) (Wang et al., 2018b) proposed dynamic distribution adaptation and used the representor theorem (Schölkopf et al., 2001) to learn the optimal classifier. Of all the traditional methods, MEDA (Wang et al., 2018b) is a more generic MMD-based method, and it is shown in Fig. 5.1. The MMD-based transfer learning approaches are also heavily extended in deep learning. Deep domain confusion (DDC) (Tzeng et al., 2014) added MMD in the loss of deep networks. Deep adaptation network (DAN) (Long et al., 2015) added the MK-MMD loss to the deep networks and extended the single-layer adaptation to multi-layer adaptation. Central moment matching (CMD) (Zellinger et al., 2017) extended MMD to higher-order moments. Yu et al. (2019) extended the dynamic distribution adaptation into adversarial networks, which proved that there also exists the imbalanced problem for marginal and conditional distributions in adversarial learning. Authors proposed a dynamic adversarial adaptation network (DAAN) to solve such problem. Then, DSAN (Deep Subdomain Adaptation Network) (Zhu et al., 2020) performs conditional distribution adaptation in deep learning. Then, DDAN (Deep Dynamic Adaptation Network) (Wang et al., 2020) presents deep learning-based dynamic adaptation and achieves better performance.

5.3 Metric Learning-Based Methods In this section, we introduce how to learn the transformation T using metric learning, which can be seen as a learnable distance in Eq. (5.3): .Metric(·, ·). Metric learning has been an important research topic in machine learning. How to measure the distance between two samples seems simple, but it actually relates to classification, regression, and clustering, etc., which are all important problems in machine learning. If we can find a good metric, then we can use it to construct better feature representations and finally build a better model.

5.3 Metric Learning-Based Methods

91

5.3.1 Metric Learning What is a metric? The Euclidean distance, Mahalanobis distance, cosine similarity, and MMD are all examples of metric, which are pre-defined distances. However, in certain applications, these distances do not guarantee that we can always obtain the best performance. Then, we need to turn to metric learning. The basic process of metric learning is to compute a better distance metric for a given set of samples such that the metric can reflect the important property of the dataset. The samples often contain some prior knowledge such as which two samples must be close and which must be farther, etc. Then, the learning algorithm can build an objective function based on the prior knowledge to learn a better metric for these samples. From this point of view, metric learning can be seen as an optimization problem under certain conditions. Metric learning has wide applications in areas such as computer vision, text mining, and bioinformatics. We can say that if there is no proper metric, there will be no good models in machine learning. The core of metric learning is the cluster assumption: the data that belong to the same cluster are highly likely to belong to the same class. Therefore, metric learning focuses on learning the pair-wise distance, which is different from the distribution distance in the previous section. In addition, metric learning considers the intra-class and inter-class distances more important. In order to evaluate the similarity between samples, metric learning borrows the linear discriminant analysis (LDA) (Izenman, 2013) to compute the intra- and interclass distances. Its goal is to make the inter-class distance larger and intra-class (M) (M) distance smaller. If we use .Sc and .Sb to denote the intra-class and inter-class distance, respectively, then they can be computed as Sc(M) =

N N   1  Pij d 2 x i , x j , Nk1 i=1 j =1

.

(M)

Sb

N N   1  = Qij d 2 x i , x j , Nk2

(5.23)

i=1 j =1

where .Pij and .Qij denote intra-class and inter-class distance, respectively. When .xi is in the .k1 neighbors of .xj , .Pij = 1, or it equals to 0; similarly, when .xi is in the .k2 neighbors of .xj , then .Qij = 1, or it will be 0. These metrics can be added to existing deep learning methods to perform deep metric learning. The computation of .d(·, ·) will be introduced later. Based on the above assumption, metric learning has greatly evolved in recent years, and several new metric losses are proposed such as triplet loss, contrastive loss, N-pair loss, InfoNCE loss, etc. Interested readers can refer to the survey on metric learning (Kulis et al., 2013; Yang, 2007) for more details.

92

5 Statistical Feature Transformation Methods

5.3.2 Metric Learning for Transfer Learning Existing metric learning might fail in face of different distributions. Then, we can combine the power of metric learning and transfer learning for better distance metric learning. On the other hand, existing transfer learning methods are mostly based on a pre-defined distance function such as MMD, which may not be general to any new applications. If we can use metric learning to contribute to transfer learning, then we can learn better distance functions for better representation learning. In addition, adding metric learning to transfer learning is also in consistence with Eq. (5.3). We integrate MMD into metric learning to get the following optimization objective (with a weight hyperparameter .β): (M)

J = Sc(M) − αSb

.

+ βDMMD (Xs , X t ) .

(5.24)

How to solve the above equation after introducing metric learning? The key is to use Mahalanobis distance to formalize metric learning. Let .M ∈ Rd×d be a semidefinite matrix; then the Mahalanobis distance between samples .x i and .x j can be defined as %  T   xi − xj M xi − xj . .dij = (5.25) Since .M is semi-definite, it can always be decomposed as .M = AT A, where d×d . Then, the Mahalanobis distance becomes .A ∈ R % % .dij = (x i − x j )T M(x i − x j ) = (Ax i − Ax j )T (Ax i − Ax j ). (5.26) The results above indicate that seeking the Mahalanobis distance matrix .M equals to finding a linear feature transformation .A between source and target domains. Thus, we will not compute .M directly, but to use a linear transformation (or use kernel methods for nonlinear transformation) to compute their values. Then, the optimization process can also be the same as previous methods.

5.4 Practice In this section, we implement the classic transfer learning method: transfer component analysis (TCA) (Pan et al., 2011). Similarly, other algorithms such as JDA and BDA can also be implemented in the same manner. The complete code for this section can be found at this link: https://github.com/jindongwang/tlbook-code/tree/ main/chap05_statistical. We mainly use Python as our programming language. The core here is to implement the eigen decomposition of TCA using Python package. We first define the kernel function as follows.

5.4 Practice

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

93

Kernel function def kernel(ker, X1, X2, gamma): K = None if not ker or ker == ’primal’: K = X1 elif ker == ’linear’: if X2 is not None: K = sklearn.metrics.pairwise.linear_kernel(np.asarray(X1).T, np. asarray(X2).T) else: K = sklearn.metrics.pairwise.linear_kernel(np.asarray(X1).T) elif ker == ’rbf’: if X2 is not None: K = sklearn.metrics.pairwise.rbf_kernel(np.asarray(X1).T, np. asarray(X2).T, gamma) else: K = sklearn.metrics.pairwise.rbf_kernel(np.asarray(X1).T, None, gamma) return K

Then, we can implement TCA using scipy.linag.eig function. Similar to the code of KMM in last chapter, we also package TCA as a class for better usage.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Transfer component analysis import numpy as np import scipy.io import scipy.linalg import sklearn.metrics from sklearn.neighbors import KNeighborsClassifier class TCA: def __init__(self, kernel_type=’primal’, dim=30, lamb=1, gamma=1): ’’’ Init func :param kernel_type: kernel, values: ’primal’ | ’linear’ | ’rbf’ :param dim: dimension after transfer :param lamb: lambda value in equation :param gamma: kernel bandwidth for rbf kernel ’’’ self.kernel_type = kernel_type self.dim = dim self.lamb = lamb self.gamma = gamma def fit(self, Xs, Xt): ’’’ Transform Xs and Xt :param Xs: ns * n_feature, source feature :param Xt: nt * n_feature, target feature :return: Xs_new and Xt_new after TCA ’’’ X = np.hstack((Xs.T, Xt.T)) X /= np.linalg.norm(X, axis=0)

94 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

5 Statistical Feature Transformation Methods m, n = X.shape ns, nt = len(Xs), len(Xt) e = np.vstack((1 / ns * np.ones((ns, 1)), -1 / nt * np.ones((nt, 1)) )) M = e * e.T M = M / np.linalg.norm(M, ’fro’) H = np.eye(n) - 1 / n * np.ones((n, n)) K = kernel(self.kernel_type, X, None, gamma=self.gamma) n_eye = m if self.kernel_type == ’primal’ else n a, b = K @ M @ K.T + self.lamb * np.eye(n_eye), K @ H @ K.T w, V = scipy.linalg.eig(a, b) ind = np.argsort(w) A = V[:, ind[:self.dim]] Z = A.T @ K Z /= np.linalg.norm(Z, axis=0) Xs_new, Xt_new = Z[:, :ns].T, Z[:, ns:].T return Xs_new, Xt_new def fit_predict(self, Xs, Ys, Xt, Yt): ’’’ Transform Xs and Xt, then make predictions on target using 1NN :param Xs: ns * n_feature, source feature :param Ys: ns * 1, source label :param Xt: nt * n_feature, target feature :param Yt: nt * 1, target label :return: Accuracy and predicted_labels on the target domain ’’’ Xs_new, Xt_new = self.fit(Xs, Xt) clf = KNeighborsClassifier(n_neighbors=1) clf.fit(Xs_new, Ys.ravel()) y_pred = clf.predict(Xt_new) acc = sklearn.metrics.accuracy_score(Yt, y_pred) return acc, y_pred

Finally, we write the main function to load data and apply TCA to them for transfer learning:

1 2 3 4 5 6 7 8 9 10 11

Main function for TCA if __name__ == "__main__": folder = ’../../office31_resnet50’ src_domain = ’amazon’ tar_domain = ’webcam’ Xs, Ys, Xt, Yt = load_csv(folder, src_domain, tar_domain) print(’Source:’, src_domain, Xs.shape, Ys.shape) print(’Target:’, tar_domain, Xt.shape, Yt.shape) tca = TCA(kernel_type=’primal’, dim=40, lamb=0.1, gamma=1) acc, ypre = tca.fit_predict(Xs, Ys, Xt, Yt) print(acc)

After TCA, as shown in Fig. 5.2, the transfer learning result on amazon to webcam is 76.10%, which is higher than previous methods in last sections.

References

95

Fig. 5.2 TCA classification results

The complete code for other methods such as BDA, JDA, and MEDA can be found at this link: https://github.com/jindongwang/transferlearning/tree/master/ code/traditional.

5.5 Summary In this chapter, we introduced the basic methods for statistical feature transformation, which can be combined with the marginal, conditional, joint, and dynamic distribution adaptation methods. They can also be implemented by the deep learning methods for end-to-end training and better performance.

References Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. Dorri, F. and Ghodsi, A. (2012). Adapting component analysis. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 846–851. IEEE. Duan, L., Tsang, I. W., and Xu, D. (2012). Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):465–479. Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. (2012). Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems, pages 1205–1213. Hou, C.-A., Yeh, Y.-R., and Wang, Y.-C. F. (2015). An unsupervised domain adaptation approach for cross-domain visual classification. In Advanced Video and Signal Based Surveillance (AVSS), 2015 12th IEEE International Conference on, pages 1–6. IEEE. Hsiao, P.-H., Chang, F.-J., and Lin, Y.-Y. (2016). Learning discriminatively reconstructed source data for object recognition with few examples. IEEE Transactions on Image Processing, 25(8):3518–3532. Izenman, A. J. (2013). Linear discriminant analysis. In Modern multivariate statistical techniques, pages 237–280. Springer. Kulis, B. et al. (2013). Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364. Long, M., Wang, J., et al. (2013). Transfer feature learning with joint distribution adaptation. In ICCV, pages 2200–2207. Long, M., Cao, Y., Wang, J., and Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105.

96

5 Statistical Feature Transformation Methods

Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210. Parlett, B. N. (1974). The Rayleigh quotient iteration and some generalizations for nonnormal matrices. Mathematics of Computation, 28(127):679–693. Schölkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416–426. Springer. Tahmoresnezhad, J. and Hashemi, S. (2016). Visual domain adaptation via transfer feature learning. Knowledge and Information Systems, pages 1–21. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Wang, J., Chen, Y., Feng, W., Yu, H., Huang, M., and Yang, Q. (2020). Transfer learning with dynamic distribution adaptation. ACM TIST, 11(1):1–25. Wang, J., Chen, Y., Hao, S., et al. (2017). Balanced distribution adaptation for transfer learning. In ICDM, pages 1129–1134. Wang, J., Chen, Y., Hu, L., Peng, X., and Yu, P. S. (2018a). Stratified transfer learning for crossdomain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom). Wang, J., Feng, W., Chen, Y., Yu, H., Huang, M., and Yu, P. S. (2018b). Visual domain adaptation with manifold embedded distribution alignment. In ACMMM, pages 402–410. Yang, L. (2007). An overview of distance metric learning. In Proceedings of the Computer Vision and Pattern Recognition Conference. Yu, C., Wang, J., Chen, Y., and Huang, M. (2019). Transfer learning with dynamic adversarial adaptation network. In The IEEE International Conference on Data Mining (ICDM). Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., and Saminger-Platz, S. (2017). Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811. Zhang, J., Li, W., and Ogunbona, P. (2017). Joint geometrical and statistical alignment for visual domain adaptation. In CVPR. Zhou, Z.-h. (2016). Machine learning. Tsinghua University Press. Zhu, Y., Zhuang, F., Wang, J., Ke, G., Chen, J., Bian, J., Xiong, H., and He, Q. (2020). Deep subdomain adaptation network for image classification. IEEE Transactions on Neural Networks and Learning Systems.

Chapter 6

Geometrical Feature Transformation Methods

In this chapter, we introduce the geometrical feature transformation methods for transfer learning, which is different from statistical feature transformation in the last section. The geometrical features can exploit the potentially geometrical structure to obtain clean and effective representations with remarkable performance. Similar to statistical features, there are also many geometrical features. We mainly introduce three types of geometrical feature transformation methods: subspace learning, manifold learning, and optimal transport methods. These methods are different in methodology, and they are all important in transfer learning. The organization of this chapter is as follows. We present subspace learning in Sect. 6.1. After that, we introduce manifold learning methods in Sect. 6.2. Section 6.3 mainly describes optimal transport methods. The code and practice can be found in Sect. 6.4. Finally, Sect. 6.5 concludes this chapter.

6.1 Subspace Learning Methods The geometrical feature transformation methods belong to implicit feature transformation approaches. Then, although we do not explicitly measure the distribution divergence, it can still be reduced by applying geometrical feature transformation. Subspace learning often assumes that the source and target data have similar distributions in the subspace after feature transformation. In that subspace, we can perform distribution alignment and use the traditional machine learning methods to build models. The concept of alignment is filled with intuitive geometrical information: if the data from two domains are aligned, we consider that their distribution divergence is minimized. Thus, subspace learning can be used for distribution alignment.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_6

97

98

6 Geometrical Feature Transformation Methods

6.1.1 Subspace Alignment Subspace Alignment (SA) Fernando et al. (2013) is a classic subspace learning approach. The goal of SA is to find a linear feature transformation .M with which domain alignment can be performed. Let .Xs and .Xt denote the d-dimensional feature matrix after PCA transformation of the source and target features .S and .T , respectively, which is called the subspace. Then, the objective of SA can be formulated as F (M) = ||Xs M − Xt ||2F .

.

(6.1)

The matrix M can be directly computed as M ∗ = arg min F (M).

.

(6.2)

M

Thus, due to the orthogonality of subspace learning, i.e., X Ts Xs = I , we can directly obtain the closed-form solution for the above problem: F (M) = ||XTs Xs M − XTs Xt ||2F = ||M − XTs Xt ||2F .

.

(6.3)

Therefore, the optimal solution for the feature transformation matrix M is computed as: M ∗ = XTs Xt . This indicates that when the source and target domains are the same (i.e., Xs = X t ), then M ∗ should be an identity matrix. We call Xa = Xs XTs Xt the target aligned source coordinate system, which transforms the source domain into a new subspace by S a = SXa .

.

(6.4)

Similarly, the target domain can be transformed as T t = T Xt . Finally, we can use S a and T t to build machine learning models instead of the original features S and T . This makes SA quite simple to implement in practice. Based on SA, Sun and Saenko (2015) proposed Subspace Distribution Alignment (SDA) that added the probability distribution adaptation to subspace learning. Concretely speaking, SDA argued that we should also add a distribution transformation matrix A in addition to the subspace learning matrix G. The optimization objective of SDA is formulated as M = Xs GAXG t .

.

(6.5)

Then, we can obtain transformed features for two domains and then build models following the similar steps in SA.

6.2 Manifold Learning Methods

99

6.1.2 Correlation Alignment Different from SA and SDA that only performs the first-order alignment, Sun et al. (2016) proposed CORrelation ALignment (CORAL) to conduct second-order alignment. Assume .C s and .C t are the covariance matrices for the source and target domains; then, CORAL learns a second-order feature transformation matrix .A to reduce their distribution divergence: .

min ||AT C s A − C t ||2F .

(6.6)

A

In this way, the source and target original features can be transformed via z =

.

r

 1 1 x r · (C s + E s )− 2 · (C t + E t ) 2

if r = s if r = t,

xr

(6.7)

where .E s and .E t are identity matrices with equal sizes to source and target domains, respectively. We can treat this step as a re-coloring process of each subspace (Sun et al., 2016) where Eq. (6.7) aligns the two distributions by re-coloring whitened source features with the covariance of target distributions. CORAL was then extended into the deep neural networks, which is called DCORAL (Deep CORAL) (Sun and Saenko, 2016). In DCORAL, CORAL is used to construct an adaptation loss in the networks that can replace existing MMD loss. The CORAL loss in deep learning is defined as CORAL =

.

1 ||C s − C t ||2F , 4d 2

(6.8)

where d is the number of feature dimensions. The computation of CORAL is also very easy, and it does not need to tune any hyperparameters. Plus, CORAL has also achieved great performance in both domain adaptation and domain generalization (Wang et al., 2021). In the practice section of this chapter, we will show that CORAL can achieve great performance while being simpler than other methods.

6.2 Manifold Learning Methods 6.2.1 Manifold Learning Since its first proposal in Science journal in 2000 (Seung and Lee, 2000), manifold learning has become a hot research topic in machine learning and data mining. Manifold learning generally assumes that current data are sampled from a highdimensional space, and it has low-dimensional manifold structures. Manifold is a

100

6 Geometrical Feature Transformation Methods

type of geometrical object (i.e., the object that is observable). Generally speaking, we cannot directly observe hidden structures from the raw data, but we can imagine that the data lie in a high-dimensional space, where it has a certain observable shape. A good example is the shape of constellations in the sky. To describe all the constellations, we can imagine that they have their certain shapes in the sky, which generates many popular constellations such as Lyra and Orion. Classic manifold learning methods include Isomap, locally linear embedding, and Laplacian eigenmap, etc. (Zhou, 2016; Bishop, 2006). The core of manifold learning is to exploit the geometrical structure to simplify the problem. Distance measurement is also important in manifold learning since we can exploit the geometrical structure to acquire better distance. So, what is the shortest path between two points in manifold learning? In a two-dimensional space, the shortest distance between two points is a line segment. But how about three-, four-, or n-dimensional spaces (.n > 4)? In fact, the shortest path between two points on the Earth is the line when we unfold the Earth space. That line is actually a curve, which is called the geodesic. Generally speaking, the geodesic distance is the shortest path between any two points in any space. For instance, Fig. 6.1 shows that the shortest path between two points A and B on a ball is the curve in a three-dimensional space. The Euclidean space that we are familiar with is also a manifold structure. In fact, the Whitney embedding theorem (Greene and Jacobowitz, 1971) shows that any manifold can be embedded into a high-dimensional Euclidean space, which makes the computation through manifold possible. Manifold learning methods often adopt the manifold assumption (Belkin et al., 2006), i.e., the data in its manifold embedding space often have similar geometrical property with its neighbors. Fig. 6.1 Geodesic distance between two points

.

A

.

B

6.2 Manifold Learning Methods

101

6.2.2 Manifold Learning for Transfer Learning Since the data in manifold space often have good geometrical properties to overcome feature distortion, manifold learning can be adopted in transfer learning. Among many existing manifolds, Grassmann manifold .G(d) regards the original d-dimensional space as its elements, thus facilitating the classifier learning. In Grassmann manifold, the feature transformation and distribution adaptation often have effective numeric forms that can be easily solved (Hamm and Lee, 2008). Thus, we can use Grassmann manifold for transfer learning (Gopalan et al., 2011; Baktashmotlagh et al., 2014). Manifold learning-based transfer learning takes inspiration from incremental learning: if a human wants to go to another point from the current point, he needs to take steps one by one. Then, if we treat the source and target domains as two points in a high-dimensional space, then we can mimic the human walking process to perform feature transformation step by step until we get to the target domain. Figure 6.2 briefly shows such a process. In that figure, the source domain goes through feature transformation .(·) to go from the starting to the ending point to complete manifold learning. Early methods optimize this problem as sampling d points in manifold space and then constructing a geodesic flow. We only need to find the transformation for each point. This kind of approach is called sampling geodesic flow (SGF) (Gopalan et al., 2011), which is the first work that proposed such an idea. However, SGF has several limitations: how many intermediate points should we find and how to do efficient computation? Later, Gong et al. (2012) proposed the geodesic flow kernel (GFK) to solve this problem. In a nutshell, to determine the intermediate points, GFK adopts a kernel learning method to exploit the infinite points on the geodesic path. We use .Ss and .St to denote the subspaces after PCA transformation. Then, .G can be seen as the set of all d-dimensional spaces, and every d-dimensional subspace can be seen as a point on .G. Thus, the geodesic between two points .{(t) : 0 ≤ t ≤ 1} is the shortest path. If we let .Ss = (0), .St = (1), then seeking a geodesic path

Manifold space

Source

Target



Fig. 6.2 Illustration for manifold transfer learning

102

6 Geometrical Feature Transformation Methods

from . (0) to . (1) equals to transforming the original features into an infinitedimensional space and eventually reduces the domain shift problem. Specifically, features in manifold space can be represented as .z =  (t)T x. The transformed features .zi and .zj define a positive semi-definite geodesic flow kernel: 

1

zi , zj  =

.

0

((t)T x i )T ((t)T x j ) dt = x Ti Gx j .

(6.9)

The geodesic flow kernel is computed as     U1 0 (t) .(t) = P s U 1 (t) − R s U 2 (t) = P s R s , 0 U2 (t) 

(6.10)

where .R s ∈ RD×d is the complementary element for .P s . .U 1 ∈ RD×d and .U 2 ∈ RD×d are two orthogonal matrices: P TS P T = U 1 V T , R TS P T = −U 2 V T .

.

(6.11)

The kernel .G can be computed as  T T   1 2  U1P s , G = P s U 1 Rs U 2 2 3 U T2 R Ts

.

(6.12)

where .1 , 2 , 3 are three diagonal matrices whose angle .θi is computed by the singular value decomposition (SVD): λ1i = .

λ2i = λ3i =

1

sin(2θi ) 2 0 cos (tθi ) dt = 1 + 2θi , 1 i )−1 − 0 cos (tθi ) sin (tθi ) dt = cos(2θ , 2θi 1 2 sin(2θi ) 0 sin (tθi ) dt = 1 − 2θi .

(6.13)

Then, √ the features in original space can be mapped to Grassmann manifold by z = Gx. The kernel .G can be easily computed using SVD. The implementation of GFK is pretty easy, and it can serve as the feature processing step for many methods. For instance, in manifold embedded distribution alignment (MEDA) (Wang et al., 2018), authors proposed to use GFK for feature extraction before applying distribution alignment. As shown in Fig. 6.3, the integration of GFK has increased the performance and robustness of the algorithm. In later research, Qin et al. (2019) proposed temporally adaptive GFK that extended the original GFK by adding the time variant to handle the cross-domain activity recognition problem. They claimed that the “walking” process of GFK is a Markov process: the current transformation is only dependent on the previous one time step. When GFK gradually transforms the source domain into the subspace of the target domain, the latter transformation has better influence. For time .t1 < t2 ,

.

6.3 Optimal Transport Methods

103

Accuracy

Variance

100

10 9

90

8 80

7

70

6

5 60

4 3

50

2 40

1

30

0

Task 1

Task 2

Task 3

Task 4

Task 5

Task 1

Task 6

w./o manifold

Task 2

Task 3

Task 4

Task 5

Task 6

w./ manifold

Fig. 6.3 Accuracy and variance of MEDA by adding manifold learning methods (Wang et al., 2018)

the transformation at .t2 should have larger influence on the final result than .t1 . The temporally adaptive GFK adds the time variant to Eq. (6.13): 1

t cos2 (tθi ) dt = 14 − 1 2 sin2 θi + 4θ1 i sin 2θi . 4θi 1 cos(2θi ) − sin(2θ2 i ) , . λ2i = − 0 t cos (tθi ) sin (tθi ) dt = 4θi 8θi 1 λ3i = 0 t sin2 (tθi ) dt = 14 + 1 2 sin2 θi − 4θ1 i sin 2θi . λ1i =

0

(6.14)

4θi

The temporally adaptive GFK achieved even better performance than original GFK. In addition, there are other manifold transfer learning methods. For instance, Baktashmotlagh et al. (2014) proposed to use the Hellinger distance in Riemann manifold for computing the source to target domain transformation. Guerrero et al. (2014) also proposed a joint manifold adaptation method.

6.3 Optimal Transport Methods In this section, we introduce optimal transport-based transfer learning methods, which present another perspective of geometrical feature transformation. Optimal transport is a classic research area and has been studied for a long time. Optimal transport has beautiful theoretical foundations, thus providing unique research sense for many applications in mathematics, computer science, and economics.

104

6 Geometrical Feature Transformation Methods

6.3.1 Optimal Transport Optimal Transport (OT) Villani (2008) was originally proposed by Gaspard Monge, an French mathematician in the eighteenth century. During World War II, Kantorovich, a Russian mathematician and economist, paid great attention to it that lay the foundations of linear programming. In 1975, Kantorovich was awarded the Nobel prize for economics for his contribution in optimal resource allocation. We often call the classic OT problem the Monge problem. Optimal transport has a solid practical background. We show an example on this. Jack and Rose grew together. Their families make lives by holding warehouse. One day, Rose’s house is on fire, and she is desperately in need of some emergency kits. Now, Jack must stand out to help her! We assume that Jack has N different warehouses. Each warehouse has a certain number of kits, which we denote as {Gi }N i=1 , where Gi is the number of kits in the i-th warehouse. The locations of these N warehouses are denoted by {xi }N i=1 . Similarly, Rose has M different warehouses whose locations are {yi }M . Each warehouse i=1 M,N

needs {Hi }M kits. We use c x , y to denote the distance between Jack’s i j i=1 i,j =1 warehouse i to Rose’s warehouse j . It is known that the transportation cost will increase with the increment of distance. Then, our problem is: how can Jack help Rose with minimum cost? We use a matrix T ∈ RN ×M to denote the transportation relations, where each element Tij indicates the number of kits from Jack’s warehouse i to Rose’s warehouse j . Then, this problem can be formulated as min .

s.t.

N,M



Tij c xi , yj

i,j =1

Tij = Gi ,

j

(6.15) Tij = Hj .

i

This is an application of optimal transport. We can view the warehouse and kit as probability distribution and random variables, respectively. Then, the formation of optimal transport is defined as determining the minimum cost to transform a distribution P (x) into Q(y), which is formulated as   .L = arg min π(x, y)c(x, y)dxdy, (6.16) π

x

y

with the following constraint (π(x, y) is their joint distribution):  π(x, y)dy = P (x), .



y

(6.17) π(x, y)dx = Q(y).

x

6.3 Optimal Transport Methods

105

The above equation indicates that optimal transport is about the connection of distributions, which is clearly related to transfer learning.

6.3.2 Optimal Transport for Transfer Learning In order to use optimal transport for transfer learning, we modify Eq. (6.16) to get the distribution divergence defined by optimal transport:  D(P , Q) = inf

.

π(x, y)c(x, y)dxdy.

π

(6.18)

X×Y

We often use L2 distance for computation, i.e., c(x, y) = x − y22 .

.

(6.19)

By adopting L2 distance, Eq. (6.18) becomes second-order Wasserstein distance:  W22 (P , Q) = inf

.

π

X×Y

π(x, y)x − y22 dxdy.

(6.20)

Different from traditional feature transformation methods, optimal transport studies the coupling matrix .T of the points. Then, after the mapping of .T , the source distribution can be mapped to the target domains with minimum cost. For a data distribution .μ, after gravity mapping and coupling matrix .T , we can obtain the distribution of .μ. Its new feature vectors are xˆi = arg min

.

x∈R d

T (i, j )c(x, xj ).

(6.21)

j

Then, how to determine this coupling matrix .T ? This is often related to the cost. In optimal transport, we often use transformation cost to evaluate the cost, denoted by .C(T ). The cost of .T with a probability metric .μ is defined as  C(T ) =

c(x, T (x))dμ(x),

.

(6.22)

s

where .c(x, T (x)) is the cost function, which can also be understood as a distance function. We can use the following transformation to transform the source distribution into the target distribution:  γ0 = arg min

.

γ ∈

s ×t





c x s , x t dγ x s , x t .

(6.23)

106

6 Geometrical Feature Transformation Methods

For distribution adaptation, we need to perform marginal, conditional, and dynamic adaptation. Courty et al. (2016), Courty et al. (2014) proposed to use optimal transport to learn a feature transformation .T to reduce the marginal distribution distance. Then, authors proposed joint distribution optimal transport (JDOT) (Courty et al., 2017) to add the conditional distribution adaptation. The core of JDOT is formulated as  .γ0 = arg min D (x 1 , y1 ; x 2 , y2 ) dγ (x 1 , y1 ; x 2 , y2 ) , (6.24) 2 γ ∈ Ps ,Pt (×C) whose cost function is formulated as the weighted sum of marginal distribution divergence and conditional distribution divergence:      D = αd x si , x tj + L yis , f x tj ,

.

(6.25)

which is quite similar to the dynamic distribution adaptation (Eq. (3.6)) in Sect. 3.1. The optimal transport problems can be solved using some popular tools such as POT (PythonOT).1 Optimal transport can also be applied to deep learning such as Xu et al. (2020b), Xu et al. (2020a), Bhushan Damodaran et al. (2018), and Lee et al. (2019). Recently, Lu et al. (2021) proposed to apply optimal transport-based domain adaptation to cross-domain human activity recognition and achieved great performance. Their approach is called substructural optimal transport (SOT). They argued that the domain- and class-level optimal transport are too coarse that may result in under-adaptation, while sample-level matching may be affected by the noise seriously and eventually cause over-adaptation. SOT exploits the locality information of activities by obtaining the substructures of activities via clustering methods and seeks the coupling of the weighted substructures between different domains. Thus, it can be seen as a more fine-grained optimal transport and achieved better results than traditional domain-level optimal transport.

6.4 Practice In this section, we implement CORrelation ALignment (CORAL) (Sun et al., 2016) to perform geometrical feature transformation. The complete code can be found at: https://github.com/jindongwang/tlbook-code/tree/main/chap06_geometrical. The implementation of CORAL is very easy since we just need to solve for the covariance matrix. We write a function fit that takes the source and target features

1 https://pythonot.github.io/.

6.4 Practice

107

Xs and .Xt and then returns the transformed source domain after CORAL. The code is as follows.

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

CORAL method def fit(self, Xs, Xt): ’’’ Perform CORAL on the source domain features :param Xs: ns * n_feature, source feature :param Xt: nt * n_feature, target feature :return: New source domain features ’’’ cov_src = np.cov(Xs.T) + np.eye(Xs.shape[1]) cov_tar = np.cov(Xt.T) + np.eye(Xt.shape[1]) A_coral = np.dot( scipy.linalg.fractional_matrix_power(cov_src, -0.5), scipy.linalg.fractional_matrix_power(cov_tar, 0.5)) Xs_new = np.real(np.dot(Xs, A_coral)) return Xs_new

After CORAL, we use scikit-learn to build a KNN classifier to compute the accuracy on the target domain. This can be implemented using the following function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

The fit and prediction method for CORAL def fit_predict(self, Xs, Ys, Xt, Yt): ’’’ Perform CORAL, then predict using 1NN classifier :param Xs: ns * n_feature, source feature :param Ys: ns * 1, source label :param Xt: nt * n_feature, target feature :param Yt: nt * 1, target label :return: Accuracy and predicted labels of target domain ’’’ Xs_new = self.fit(Xs, Xt) clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1) clf.fit(Xs_new, Ys.ravel()) y_pred = clf.predict(Xt) acc = sklearn.metrics.accuracy_score(Yt, y_pred) return acc, y_pred

We package CORAL as a class and apply on the same task as TCA in the last chapter. After CORAL, as shown in Fig. 6.4, the performance from amazon to webcam is 76.35%, which is better than TCA.

Fig. 6.4 CORAL classification results

108

6 Geometrical Feature Transformation Methods

6.5 Summary In this chapter, we introduced subspace learning, manifold learning, and optimal transport methods for transfer learning. Geometrical representation learning is also an important direction in machine learning. Note that the methods in this section can be combined with the methods in the last section to further boost the performance of transfer learning.

References Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salzmann, M. (2014). Domain adaptation on the statistical manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2481–2488. Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(Nov):2399–2434. Bhushan Damodaran, B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. (2018). DeepJDOT: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447–463. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Courty, N., Flamary, R., and Tuia, D. (2014). Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 274–289. Springer. Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2016). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739. Fernando, B., Habrard, A., Sebban, M., and Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV, pages 2960–2967. Gong, B., Shi, Y., Sha, F., and Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. Gopalan, R., Li, R., and Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In ICCV, pages 999–1006. IEEE. Greene, R. E. and Jacobowitz, H. (1971). Analytic isometric embeddings. Annals of Mathematics, pages 189–204. Guerrero, R., Ledig, C., and Rueckert, D. (2014). Manifold alignment and transfer learning for classification of Alzheimer’s disease. In International Workshop on Machine Learning in Medical Imaging, pages 77–84. Springer. Hamm, J. and Lee, D. D. (2008). Grassmann discriminant analysis: a unifying view on subspacebased learning. In ICML, pages 376–383. ACM. Lee, C.-Y., Batra, T., Baig, M. H., and Ulbricht, D. (2019). Sliced Wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10285–10295. Lu, W., Chen, Y., Wang, J., and Qin, X. (2021). Cross-domain activity recognition via substructural optimal transport. Neurocomputing, 454:65–75. Qin, X., Chen, Y., Wang, J., and Yu, C. (2019). Cross-dataset activity recognition via adaptive spatial-temporal transfer learning. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4):1–25.

References

109

Seung, H. S. and Lee, D. D. (2000). The manifold ways of perception. Science, 290(5500):2268– 2269. Sun, B. and Saenko, K. (2015). Subspace distribution alignment for unsupervised domain adaptation. In BMVC, pages 24–1. Sun, B. and Saenko, K. (2016). Deep CORAL: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450. Sun, B., Feng, J., and Saenko, K. (2016). Return of frustratingly easy domain adaptation. In AAAI. Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media. Wang, J., Feng, W., Chen, Y., Yu, H., Huang, M., and Yu, P. S. (2018). Visual domain adaptation with manifold embedded distribution alignment. In ACMMM, pages 402–410. Wang, J., Lan, C., Liu, C., Ouyang, Y., Zeng, W., and Qin, T. (2021). Generalizing to unseen domains: A survey on domain generalization. In IJCAI Survey Track. Xu, R., Liu, P., Wang, L., Chen, C., and Wang, J. (2020a). Reliable weighted optimal transport for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4394–4403. Xu, R., Liu, P., Zhang, Y., Cai, F., Wang, J., Liang, S., Ying, H., and Yin, J. (2020b). Joint partial optimal transport for open set domain adaptation. In International Joint Conference on Artificial Intelligence, pages 2540–2546. Zhou, Z.-h. (2016). Machine learning. Tsinghua University Press.

Chapter 7

Theory, Evaluation, and Model Selection

We have introduced several basic algorithms for transfer learning. However, we did not show how to select models and tune hyperparameters, which will be covered in this chapter. Moreover, we will also introduce theories behind existing approaches that act as the foundation for designing new algorithms. The organization of this chapter is as follows. We introduce the transfer learning theory in Sect. 7.1. Then, we describe the metrics and evaluations in Sect. 7.2. After that, we show how to perform model selection in Sect. 7.3. Finally, we summarize this chapter in Sect. 7.4.

7.1 Transfer Learning Theory Traditional machine learning generally adopts the i.i.d. assumption that the training and test data are sampled from the same distribution, based on which the statistical machine learning theory was established such as the probably approximately correct (PAC) learning theory (Valiant, 1984). These theories indicate that the model’s generalization error can be bounded by model complexity and the number of training samples. Technically speaking, the error will be reduced with the increment of the training samples. In transfer learning, the source and the target domains are from different distributions, making it challenging to directly apply the existing learning theory to our problem. In this section, we present some theories of transfer learning. More concretely, the theories are for the challenging unsupervised domain adaptation setting. Over the past two decades, researchers have proposed new theories and algorithms in this setting. They proposed the distribution divergence measurement such as .H-divergence (Ben-David et al., 2007) and .HH-distance (Ben-David et al., 2010) and established corresponding learning theory. Inspired by these theories,

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_7

111

112

7 Theory, Evaluation, and Model Selection

researchers proposed different algorithms to improve the transfer learning performance. Redko et al. (2020) categorized existing domain adaptation theory into three classes: 1. Discrepancy-based error bound: For the binary classification problem, BenDavid et al. (2007) proposed the first transfer learning and domain adaptation theory based on 0–1 loss function and .H-divergence. Mansour et al. (2009) extended that theory to loss functions that satisfies triangular inequality. 2. Integral probability matrix-based error bound: This kind of theory includes optimal transport (Courty et al., 2017; Dhouib et al., 2020; Redko et al., 2017) and maximum mean discrepancy (Borgwardt et al., 2006). The former adopted Wasserstein distance to measure the distribution divergence, while the latter utilized maximum mean discrepancy. Researchers proposed corresponding theory for these measurements. 3. PAC-Bayesian-based error bound: This kind of theory (Germain et al., 2013, 2015) states that models will further need to perform voting using classifiers, then the generalization error will be bounded by the classifier consistency. In this section, we focus on the theory for discrepancy measurement. We denote the two different distributions for source and target domains as P and Q. These two different distributions are joint distributions from .X × Y, where .X ∈ Rd , and .Y = {0, 1} is for binary classification, while .Y = {1, 2, ..., K} for K-class classification ˆ to denote sample set from domain .D. In unsupervised learning problem. We use .D s problems, there exists a labeled sample set .Pˆ = {(xis , yis )}N i=1 and an unlabeled set N ˆ = {x t } t . .Q i i=1 In binary classification, we define the true labeling function .f : X → [0, 1] for domain .D. For any classifier .h : X → [0, 1], the classification error is defined as: (h, f ) = Ex∼D [h(x) = f (x)] = Ex∼D [|h(x) − f (x)|].

.

(7.1)

Then, the error on the source and target domains by the classifier h can be formulated as: s (h) = s (h, fs ) .

t (h) = t (h, ft ).

(7.2)

We denote .ˆs (h) and .ˆt (h) as the empirical error on the source and target domains, respectively.

7.1 Transfer Learning Theory

113

7.1.1 Theory Based on H-Divergence The theory of .H-divergence (Ben-David et al., 2007) was proposed in 2007, then it was extended in Machine Learning journal in 2010. In this theory, authors considered the binary classification setting and derived corresponding error bound based on 0–1 loss function. Definition 7.1 (.H-Divergence) Given two distributions P and Q, let .H be the hypothesis space, .I(h) is the indicator function, and .h ∈ H, i.e., .x ∈ I(h) ⇔ h(x) = 1. Then, .H-divergence is defined as: dH (P , Q) = 2 sup |P rP [I(h)] − P rQ [I(h)]|. h∈H

(7.3)

.

In a finite sample set, we often adopt the empirical .H-divergence to measure distribution divergence. For a symmetric hypothesis class .H and two m-sized ˆ the empirical .H-divergence can be formulated as: sample sets .Pˆ and .Q, ⎛



ˆ = 2 ⎝1 − min ⎣ 1 ˆ Q) .dˆ (P, H h∈H m

 x:h(x)=0

ˆ + 1 I[x ∈ P] m



⎤⎞ ˆ ⎦⎠ . I[x ∈ Q]

x:h(x)=1

(7.4) Basically, Eq. (7.4) tells us that the divergence between two distributions can be empirically computed by the classifier loss: we train a classifier to classify whether a sample belongs to a distribution (i.e., a binary classification). Then, we can take ˆ This is exactly the ˆ Q). its loss as .min[·, ·] in Eq. (7.4) and then compute .dˆH (P, .A-distance we introduced in Sect. 3.2. Based on .H-divergence, authors proposed a new theorem: Theorem 7.1 (Target Error Bound Based on .H-Divergence) Let .H be a hypothesis space with VC dimension d. Given sample set with size m i.i.d. sampled from the source domain, then, with probability at least .1 − δ, for any .h ∈ H, we have: ˆ s, D ˆ t) + λ + t (h) ≤ ˆs (h) + dH (D

.





2em 4 4 d log + log , m d δ

(7.5)

where e is natural logarithm, .λ∗ = s (h∗ ) + t (h∗ ) is the ideal joint risk, and ∗ .h = arg min s (h)+t (h) is the optimal classifier on the source and target domains. h∈H Theory 7.1 indicates that the error bound on the target domain is bounded by four terms: (1) source empirical error, (2) the distribution discrepancy between source and target domains, (3) ideal joint error, and (4) some constant related to sample size and VC dimension.

114

7 Theory, Evaluation, and Model Selection

The ideal joint error .λ cannot be computed directly since it needs the real labels on the target domain. In many cases, we often assume .λ∗ is an extremely small value, i.e., there exists a classifier which makes the error on both source and target domains small, thus we can perform transfer learning between them. Based on such assumption, there are only two terms that truly affects the target error bound: source generalization error and the distribution divergence. Surprisingly, we see that the unified transfer learning framework of Eq. (3.9) is the same as that theory in spirit: the first term is the source error and the second is the cross-domain distribution divergence. Therefore, these theories directly prove that our induced algorithm framework is correct. Inspired by Theory 7.1, Ganin and Lempitsky (2015); Ganin et al. (2016) proposed the domain-adversarial neural networks (DANN, which will be introduced in Chap. 10) that computes the distribution divergence between domains based on a domain discriminator.

7.1.2 Theory Based on HH-Distance Based on .H-divergence and .HH-divergence, authors from previous theory extended their work in (Ben-David et al., 2010). Definition 7.2 (Symmetric Hypothesis Space .HH) For a hypothesis space .H, the symmetric hypothesis .HH is the set satisfying the following conditions: g ∈ HH ⇔ g(x) = h(x) ⊕ h (x), ∀h, h ∈ H,

.

(7.6)

where .⊕ denotes XOR operation. Based on .HH, .HH-distance is defined as: Definition 7.3 (.HH-Distance) For any .h, h ∈ H, dHH (P , Q) = 2 sup |P rx∼P [h(x) = h (x)] − P rx∼Q [h(x) = h (x)]|. h,h ∈H (7.7)

.

Based on .HH-distance, authors proposed a new target error bound: Theorem 7.2 (Target Error Bound Based on .HH-Distance) Let .H be a ˆ are instance set sampled from source hypothesis space with VC dimension d. .Pˆ , Q and target distributions, with size m. Then for any .δ ∈ (0, 1) and any .h ∈ H, with probability at least .1 − δ, we have: 2d log(2m) + log( 2δ ) 1ˆ ˆ +4 + λ. .t (h) ≤ s (h) + dHH (Pˆ , Q) 2 m

(7.8)

7.1 Transfer Learning Theory

115

For better understanding, we have the following proof: t (h) = t (h, ft ) ≤ t (h∗ ) + t (h, h∗ ) ≤ t (h∗ ) + s (h, h∗ ) + t (h, h∗ ) − s (h, h∗ ) ≤ t (h∗ ) + s (h, h∗ ) + |t (h, h∗ ) − s (h, h∗ )|

.

1 ˆ ≤ t (h∗ ) + s (h, h∗ ) + dˆHH (Pˆ , Q) 2 1 ˆ ≤ t (h∗ ) + s (h∗ ) + s (h) + dˆHH (Pˆ , Q) 2 1 ˆ +λ ≤ s (h) + dˆHH (Pˆ , Q) 2 2d log(2m) + log( 2δ ) 1ˆ ˆ +4 + λ. ≤ s (h) + dHH (Pˆ , Q) 2 m

(7.9)

Lines 4 and 5 on the above derivation are to find a upper bound for .|t (h, h∗ ) − ˆ distance is actually the defined upper bound. By s (h, h∗ )|, thus .dˆHH (Pˆ , Q) ˆ ˆ we find that .dˆ ˆ ˆ comparing .H-distance and .dHH (Pˆ , Q), HH (P , Q) is a special case if the hypothesis space equals to .HH. Based on Theory 7.2, Saito et al. (2018) proposed an algorithm called maximum classifier discrepancy (MCD, which will be introduced in Sect. 10.3) that estimates .HH-distance by designing the discrepancy of two classifiers, thus reducing the cross-domain distribution divergence.

7.1.3 Theory Based on Discrepancy Distance H-divergence and HH-distance only consider the case when the loss function is 0–1 loss. Then, Mansour et al. (2009) extended the loss function to any loss function that satisfies triangle inequality. They first define the discrepancy distance: Definition 7.4 (Discrepancy Distance) Let H denote a hypothesis space and L : Y × Y → R denotes the loss function of Y. The discrepancy distance discL between two distributions P and Q is defined as: discL (P , Q) = max |LP (h, h ) − LQ (h, h )|. h,h ∈H

.

(7.10)

The discrepancy distance satisfies discL (P , Q) ≤ discL (P , M)+discL (M, Q). Define h∗Q ∈ arg minh∈H LQ (h, fQ ), where fQ is the labeling function on distribution Q. Similarly, define h∗P as the optimal classifier for LP (h, fP ). To conduct transfer learning, authors assumed that the average loss LQ (h∗Q , h∗P ) on

116

7 Theory, Evaluation, and Model Selection

two optimal classifiers is extremely small. Different from Theories 7.1 and 7.2 that assume there exists one optimal classifier for both two domains, this theory assumes there exists one optimal classifier for each domain and their discrepancy is small. Theorem 7.3 (Target Error Bound Based on Discrepancy Distance) Assuming loss function L is symmetric and satisfies triangle inequalities. Then, for any h ∈ H, we have: LQ (h, fQ ) ≤ LQ (h∗Q , fQ ) + LP (h, h∗P ) + discL (P , Q) + LP (h∗P , h∗Q ).

.

(7.11)

Compared to Theory 7.2, authors also did some analysis. Assume h∗Q = h∗P = h∗Q . At the same time, Theory 7.3 becomes + LP (h, h∗ ) + disc(P , Q), and Theory 7.2 becomes + LP (h, fP ) + LP (h∗ , fP ) + disc(P , Q). Based on triangular inequality, we have LP (h, h∗ ) ≤ LP (h, fP ) + LP (h∗ , fP ). Thus, under such condition, Theory 7.3 is a tighter bound than Theory 7.2. h∗P , then we have h∗ = LQ (h, fQ ) ≤ LQ (h∗ , fQ ) LQ (h, fQ ) ≤ LQ (h∗ , fQ )

7.1.4 Theory Based on Labeling Function Difference Theory of 7.1 and 7.2 have been widely used for many years. Based on them, a lot of algorithms are developed with the purpose of minimizing the source empirical error and learning domain-invariant representations to also minimize the distribution discrepancy. Zhao et al. (2019) constructed a negative example to show that even the distribution discrepancy on two domains are 0, the sum of errors on source and target domains are always 1. In this extreme case, if we still minimize the source error, we will increase the target error. Zhao et al. (2019) proposed a new theory for this problem. Theorem 7.4 (Target Error Bound Based on Labeling Function Discrepancy) ˆ Let .fs ,.ft be the labeling functions on the source and target domains, .Pˆ ,.Q denotes their samples and each of them has size m. .Rads(H) denotes Rademacher complexity. Then, for any .H ∈ [0, 1]X and .h ∈ H, we have: ˆ + min{EP [|fs − ft |], EQ [|fs − ft |]} t (h) ≤ˆs (h) + dH (Pˆ , Q) .

ˆ + 2Rads(H) + 4Rads(H)  + O( log(1/δ)/m),

(7.12)

ˆ = {sgn(|h(x) − h (x)| − t) | h, h ∈ H, t ∈ [0, 1]}. where .H These error bound can be categorized into three parts. The first part (line 1) is the domain adaptation part, including source error risk, empirical .H-distance, and the labeling function difference. The second part (line 2) corresponds to complexity

7.3 Model Selection

117

ˆ And the third part (line 3) describes the measure of the hypothesis space .H and .H. error caused by finite samples. Compare Theories 7.4 and 7.2, the major difference is that the terms ∗ .min{EP [|fs − ft |], EQ [|fs − ft |]} in Theory 7.4 and term .λ in Theory 7.2, and the latter relies on the choice of hypothesis .H while the former does not. Theory 7.4 also revealed the conditional bias. We hope that readers can better understand these theories through our introduction. More importantly, they can learn to use them in their own research. It is also important to note that there are also other theories in addition to the ones in this section and the research in this area is also evolving. Interested readers please follow the updates in latest publications.

7.2 Metric and Evaluation In this section, we introduce how to perform evaluation on a transfer learning task. Generally speaking, a task gets more successful using transfer learning if it makes the metrics under this task better. For instance, a classification task often uses accuracy, F1 score, AUROC, etc. for model evaluation; a regression task often uses RMSE or MAE for evaluation; a machine translation task often uses BLEU score. In a nutshell, there is no unified evaluation metrics for transfer learning. Formally, define .U0 the original metric value for a task without transfer learning. We denote .U the performance on the same task using transfer learning. Then, we say this is a successful transfer learning task if .U U0 , where . denotes an order relation: .A B means that the performance A is better than B. It is also intuitive that when transfer learning algorithms improve the certain evaluation metric on certain tasks, we say it is a successful transfer learning application. Note that sometimes we do not care only about the transfer performance but also about the catastrophic interference. Catastrophic interference is used to evaluate the performance on original (source) tasks after performing transfer learning on a target task, i.e., measure the catastrophic forgetting (French, 1999). Therefore, we also apply prediction using the transferred model directly on the original data to test its backward transfer (BWT) performance (Parisi et al., 2019). Clearly, the ideal value for BWT would be 0.

7.3 Model Selection We need to tune hyperparameters and select models when designing algorithms. Since the test data cannot be used for training, we often adopt hold-out and k-fold cross validation for selecting models and tuning hyperparameters. Hold-out method

118

7 Theory, Evaluation, and Model Selection

splits the dataset as two sets (training and validation sets). Cross validation splits the dataset into k sets. Every time, we use the .k − 1 sets for training and the remaining as the validation. This can certainly reduce the validation loss (Zhou, 2016). Formally speaking, if we use .T = {Tj }kj =1 to denote the k sets of data, .fTj (x) denotes the model learned on the training data .Ti=j , then, k-fold cross validation is represented as: k



j =1

(x,y)∈Tj

 1 kCV ≡ 1   .R Tj  k

   x, y, fTj (x) ,

(7.13)

where .(·) is the loss function. k-fold cross validation is popular for selecting models. Go back to transfer learning. What are the training and validation data in domain adaptation problems? Can we use the target domain to directly perform model selection and hyperparameter tuning? This is impossible since we cannot use the test data for validation. How to solve such a problem? There are two possible avenues. Firstly, if there are some labeled data in the target domain, we can use it directly as the validation set. However, this is impossible for unsupervised domain adaptation problems. Then, the second manner is to fix the same hyperparameters for all the tasks, such as (Wang et al., 2017; Tzeng et al., 2017). Obviously, these two methods are not general. Then, how to perform model selection for unsupervised domain adaptation?

7.3.1 Importance Weighted Cross Validation Sugiyama et al. (2007) proposed importance weighted cross validation (IWCV) for marginal distribution adaptation (Ps (x) = Pt (x)). The core of IWCV is to exploit the density ratio we introduced in Sect. 4.3 for model selection: θt∗ ≈ arg max

.

θ

  Ns   Pt x si 1   s  log P yis |x si ; θ . Ns P xi i=1 s

(7.14)

After introducing density ratio, the target training loss of IWCV can be represented as: k



j =1

(x,y)∈Tj

 1 kI W CV ≡ 1   R Tj  k

.

 Pt (x)   x, y, fTj (x) . Ps (x)

(7.15)

7.3 Model Selection

119

If we use Ds to denote the source domain data, then IWCV can be reformulated as:  1 kI W CV ≡ 1   R  j k D  k



.

j =1

s

(x,y)∈D

j s

 Pt (x)   x, y, fDj (x) . s Ps (x)

(7.16)

It is proved that Eq. (7.15) is an unbiased estimate for the target domain error. Moreover, we can see from Eq. (7.16) that IWCV does not need the target domain labels. Thus, we can use IWCV for model selection in domain adaptation.

7.3.2 Transfer Cross Validation The basic assumption of IWCV is that the source and target domains have the same conditional distribution but different marginal distribution. What if the two distributions are different? Zhong et al. (2009) considered such case and proposed transfer cross validation (TrCV). TrCV introduced the estimation of target conditional probability. To introduce conditional probability, we reformulated IWCV as: kI W CV = arg min 1 R k f

k  

.

j =1 (x,y)∈Sj

  Pt (x)  Ps (y|x) − P y|x, fj  . Ps (x)

(7.17)

Pt (x) |Pt (y|x) − P (y|x, f )| . Ps (x)

(7.18)

Then, TrCV can be represented as: T rCV = arg min 1 R k f

k  

.

j =1 (x,y)∈Sj

We can clearly see that TrCV utilizes target domain labels (.Pt (y|x)) in estimation. Thus, TrCV can be used in the case where there are labels in the target domain. Based on Eq. (7.18), authors also constructed other validation frameworks and they will not be introduced here. Readers are encouraged to read more in their publications. We summarize the model selection methods of this section in Table 7.1. Model selection for transfer learning remains an open research problem and we hope there could be more efforts in this area.

120

7 Theory, Evaluation, and Model Selection

Table 7.1 Comparison of different model selection methods in transfer learning Model selection methods Source domain model selection Target domain model selection IWCV TrCV

Target domain labels No Yes No Yes

Covariate shift Cannot handle Can handle Can handle Can handle

7.4 Summary In this chapter, we introduced some fundamental concepts of transfer learning: theory, evaluation, and model selection. They are important in transfer learning research since all algorithms must be designed according to the theory. Then, they should be evaluated properly. We hope that readers can use these tools when needed.

References Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine learning, 79(1–2):151–175. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. (2007). Analysis of representations for domain adaptation. In NIPS, volume 19. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739. Dhouib, S., Redko, I., and Lartizien, C. (2020). Margin-aware adversarial domain adaptation with optimal transport. In Thirty-seventh International Conference on Machine Learning. French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35. Germain, P., Habrard, A., Laviolette, F., and Morvant, E. (2013). A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In ICML. Germain, P., Habrard, A., Laviolette, F., and Morvant, E. (2015). A new PAC-Bayesian perspective on domain adaptation. In ICML 2016. Mansour, Y., Mohri, M., and Rostamizadeh, A. (2009). Domain adaptation with multiple sources. In NeuIPS, pages 1041–1048. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71. Redko, I., Habrard, A., and Sebban, M. (2017). Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer.

References

121

Redko, I., Morvant, E., Habrard, A., Sebban, M., and Bennani, Y. (2020). A survey on domain adaptation theory. arXiv preprint arXiv:2004.11829. Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. (2018). Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732. Sugiyama, M., Krauledat, M., and MÞller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005. Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971. Valiant, L. (1984). A theory of the learnable. Commun. ACM, 27:1134–1142. Wang, J., Chen, Y., Hao, S., et al. (2017). Balanced distribution adaptation for transfer learning. In ICDM, pages 1129–1134. Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. (2019). On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–7532. Zhong, E., Fan, W., Yang, Q., Verscheure, O., and Ren, J. (2009). Cross validation framework to choose amongst models and datasets for transfer learning. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pages 1027–1036. ACM. Zhou, Z.-h. (2016). Machine learning. Tsinghua university press.

Part II

Modern Transfer Learning

Chapter 8

Pre-Training and Fine-Tuning

In this chapter, we focus on modern parameter-based methods: the pre-training and fine-tuning approach. We will also step into deep transfer learning starting from this chapter. In next chapters, the deep transfer learning methods focus on how to design better network architectures and loss functions based on the pre-trained network. Thus, this chapter can be seen as the foundations of the next chapters. Pre-training and fine-tuning belongs to the category of parameter/model-based transfer learning methods that perform knowledge transfer by sharing some important information of the model structures. The basic assumption is that there exists some common information between source and target structures that can be shared. Figure 8.1 shows the basic idea of model-based transfer learning methods. The organization of this chapter is as follows. We first introduce the transferability of deep neural networks in Sect. 8.1, which answers why deep neural networks can be transferred. Then, we introduce how to perform pre-training and fine-tuning in Sect. 8.2. In Sect. 8.3, we introduce several regularization techniques in finetuning for better performance. Later, in Sects. 8.4 and 8.5, we introduce different ways of pre-training and its extension and applications. In Sect. 8.6, we provide code practice. Finally, Sect. 8.7 concludes this chapter.

8.1 How Transferable Are Deep Networks We must first understand deep networks before conducting transfer learning with it. With AlexNet (Krizhevsky et al., 2012) winning the champion of ILSVRC challenge (Deng et al., 2009) in 2012, deep learning has been becoming increasingly popular. The deeper the network is, the better representations and results are. However, the layer-wise deep structure makes it challenging to perform interpretability (Chen et al., 2019a). One of the popular approach is to visualize the activation of each layer to better understand the information contained in it. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_8

125

126

8 Pre-Training and Fine-Tuning

Source domain

Target domain

(Chihuahua)

(Shepherd)

Has feet? N

Has feet?

Share

Y

Y

N

Has eyes?

Has eyes?

N

N

Y Has a tail? N

Y

Chihuahua

Y Long hair? N Y

Shepherd

Fig. 8.1 Illustration of model-based transfer learning methods

Fig. 8.2 Visualization of CNN features

Figure 8.21 shows how to extract features in a deep network. Assume the input is a lovely dog. In the forward propagation process, the networks only learn some edge information of the dog at shallow layers, which we call the low-level features. Then, networks begin to detect some lines and shapes at middle layers, which are more obvious than the low-level features and we call middle-level features. Finally,

1 The

visualizations are made using the tool from https://poloclub.github.io/cnn-explainer/ and the complete image can be found at this link: https://github.com/jindongwang/tlbook-code/tree/main/ chap08_pretrain_finetune.

8.1 How Transferable Are Deep Networks

Low-level feature

Middle-level feature

127

High-level feature

Classifier

Dog

Fig. 8.3 How deep networks extract features

networks can detect the semantic features of the dog such as dog face or nose, etc., which are high-level features. How can we leverage such observation to understand the transferability of deep networks? A popular explanation is as follows. In a deep neural network, the shallow layers learn the general features (e.g., the edge and shape information) while its deeper layers learn the specific features (e.g., legs and face). With the networks going deeper, the features are from general to more specific. This is a rather intuitive explanation and is easy to understand. If we can accurately know which layers in a network learn the general features and which learns the specific features, then, we may use such property to perform transfer learning. Since the layers for learning general features are not constrained for specific tasks, we can use it for common base tasks. For instance, in Fig. 8.3, we can directly use the shallow layers for learning the low-level features for cats since these features are also general to cats. In this way, we can reduce the training cost for cats classifier. Then, how can we determine which layers are for the low-level features and which layers are for specific features? Yosinski et al. (2014) proposed the first study toward how transferable are deep neural networks. Their study is mainly on ImageNet (Deng et al., 2009) experiments. They split the 1000 classes into two sets (A and B) where each set contains 500 classes. Then, they utilized the Caffe framework (Jia et al., 2014) to train an AlexNet (Krizhevsky et al., 2012) that contains 8 layers. Apart from the eighth layer that is for classification task, they performed fine-tuning experiments from the first to the seventh layer to explore how transferable are this deep network. Authors proposed several concepts to better interpret the results of fine-tuning: AnB, BnB, AnB+, and BnB+: • AnB is used to test the performance of transferring the first n layers from network A to network B. At the beginning, they obtained the first n layers of the pretrained network A. Then, they copied the parameters of these n layers from A to B. Next, they froze the first n layers for B and only trained the remaining .8 − n layers. • BnB is used to test the performance of network B. Specifically, they froze its first n layers and trained the other .8 − n layers. This is to compare with AnB. • AnB+ and BnB+ denote the situation that they did not freeze the first n layers of B, instead, they performed fine-tuning on these layers.

128

8 Pre-Training and Fine-Tuning

Here are some interesting conclusions: • As for BnB, they can directly use the first three layers for transfer learning. When it comes to the fourth and fifth layers, the accuracy drops. They explained that the features on fourth and fifth layers are becoming more and more specific, thus the accuracy drops. For BnB+, they found that fine-tuning actually helps to obtain better performance. • As for AnB, it does not hurt the performance of B if we directly copy the first three layers from A to B, which indicates that the first three layers do learn general features. But the accuracy drops from the fourth and fifth layers, which means that the features are not general. • As for AnB+, its performance is the best when we added fine-tuning. This implies that pre-training and fine-tuning really boost the performance of deep learning. Similar results were obtained in the work of Neyshabur et al. (2020) with the pre-training and fine-tuning experiments on DomainNet (Peng et al., 2019) dataset. These experiments show that the lower layers of a deep network mainly extract general features while the higher layers extract specific features. More importantly, researchers have pointed out that the similarity of domains plays a critical role in deep transfer learning: the more similar the domains are, the better the transfer performance is. To sum up, we now have the following conclusions about the transferability of deep networks: • The lower layers of a deep network are extracting general features that can be used for transfer learning. • Fine-tuning really helps the network achieve better performance.

8.2 Pre-Training and Fine-Tuning We reformulate traditional machine learning as learning an objective function f on a given dataset .D, such that f has the minimum risk. If we use .θ to denote the parameters to be learned on f and .L is the cost function, then a general machine learning task is formulated as: θ ∗ = arg min L(D; θ ),

.

θ

where .θ ∗ denotes the optimal parameter. Now we formally define pre-training and fine-tuning.

(8.1)

8.2 Pre-Training and Fine-Tuning

129

Definition 8.1 (Pre-Training and Fine-Tuning) Given a target dataset .D with limited labeled data, pre-training and fine-tuning aims at learning a function f parameterized by .θ using the previous knowledge (parameter) .θ0 from historical tasks: θ ∗ = arg min L(θ |θ0 , D).

(8.2)

.

θ

Pre-training and fine-tuning is different from domain adaptation since it does not need the source and target domains to have identical categories. In fact, in most of the pre-training and fine-tuning applications, their categories are not the same. In real situations, we often do not train from scratch for a new task, which is extremely tedious and expensive especially for the case where the labeled training data is not large enough. Then, we need to use pre-training and fine-tuning. Figure 8.4 shows a simple pre-training and fine-tuning process. This figure shows some open questions: • How many layers should we fix or fine-tune? • How to add regularization to improve the performance on downstream tasks with limited data? These two questions will be answered in Sects. 8.3 and 8.5, respectively.

Label

Label High-level layer Layer

Random Init. Copy

High-level layer

Retrain

Layer





Layer 2

Copy

Layer 2

Layer 1

Copy

Layer 1

Source

Target

Source network

Target network

Fig. 8.4 An illustration of pre-training and fine-tuning

Fix or Fine-tune

130

8 Pre-Training and Fine-Tuning

8.2.1 Benefits of Pre-Training and Fine-Tuning Why do we need fine-tuning? Because the models trained by others may not be suitable for our own tasks: the datasets may follow different distributions; the network of other tasks may be over-complicated for our tasks; or their structure may be too heavy. For instance, if we want to train a network to classify cats and dogs, then we can use the network pre-trained on CIFAR-100. However, CIFAR-100 has 100 classes and we only need two. Therefore, we should fix the general layers of the pre-trained network and then modify the output layer to fit our task. Generally speaking, fine-tuning has the following advantages: • We do not need to train from scratch, which saves us time and cost. • The pre-trained models are often trained on large-scale datasets, which enlarges the generalization ability of our model when performing downstream fine-tuning. • Fine-tuning is easy to implement and we only need to focus on our tasks. Later, researchers focused on the effectiveness of pre-training and fine-tuning. He et al. (2019) conducted multiple experiments for ImageNet pre-training. They concluded that the pre-trained models will reduce the training time compared to training from scratch. However, pre-training and fine-tuning only has marginal performance improvement compared to training from scratch. Another research (Kornblith et al., 2019) studied the effectiveness of pre-trained models to downstream tasks. Their conclusions are: • The pre-training performance on large-scale datasets will bound the final results on downstream tasks, i.e., the better the pre-trained model is, the better the results are. • Pre-trained models will not significantly improve the performance on finegrained tasks. • Compared to random initialization, the gain by pre-training and fine-tuning will become smaller as the downstream dataset is becoming larger, i.e., pre-training and fine-tuning brings large improvements on small-scale downstream tasks. Other researchers studied the effectiveness of pre-trained model toward the robustness of the models (Hendrycks et al., 2019). They found that pre-trained models can improve the robustness in several situations: • • • • •

For label corruption, pre-trained models can improve the final AUC performance. For imbalanced classification, pre-trained models can improve the accuracy. For adversarial perturbation, pre-trained models can improve the accuracy. For out-of-distribution tasks, pre-trained models can bring huge improvements. For calibration tasks, pre-trained models can also bring the improvement on model reliability.

8.3 Regularization for Fine-Tuning

131

8.3 Regularization for Fine-Tuning It is interesting to note that, fine-tuning, typically the most straightforward way to perform transfer learning in modern deep networks, is also an ERM process that undoubtedly needs explicit regularization. More formally, if we use .w to denote the weights in a network to be fine-tuned, then, the regularized transfer learning process can be represented as: min Lcls + R(w).

.

(8.3)

As we summarized in Sect. 3.3, the most important module in transfer learning is the transfer regularization term .R(·). Then, how can such regularization play a role in the fine-tuning process to obtain better performance? Such regularization is also called explicit inductive bias (Li et al., 2018). Moreover, Li et al. (2018) designed extensive experiments to evaluate the performance of different popular regularization terms for deep transfer learning. Now we elaborate on their details. The L2 regularization is widely used in deep learning for its simplicity. In this case, it is formulated as: R(w) =

.

α ||w||22 , 2

(8.4)

where .α is a hyperparameter. Furthermore, borrowing knowledge from lifelong learning (Biesialska et al., 2020), researchers have found that if we explicitly regularize the distance between current weights and the starting point (SP, i.e., the pre-trained weights), we can get better fine-tuning performance. This firstly introduced by Grachten and Chacón (2017) and they called this L2-SP regularization: R(w) =

.

α ||w − w0 ||22 , 2

(8.5)

where .w 0 denotes the weights from the starting point. Note that when the student and teacher networks have different architectures, we often split the above equation into two parts: the common part uses exactly the above equation if two networks share some common structure, and then we explicitly add L2 regularization on the remaining student weights. This extension is quite natural. Additionally, it is intuitive to develop L1 and L1-SP regularization by replacing the distance computation in L2 and L2-SP with Manhattan distance. Kirkpatrick et al. (2017) proposed the elastic weight consolidation (EWC) for regularization that can also be used in deep transfer learning. This regularization is

132

8 Pre-Training and Fine-Tuning

also called L2-SP-Fisher which uses Fisher matrix to denote the importance of the neural network weights: R(w) =

.

α ˆ Fjj (wj − wj0 )2 , 2

(8.6)

j ∈S

where .Fˆjj is the Fisher matrix, which is computed as the average of the derivative on the target dataset using the pre-trained network:   m K 1  ∂ fk (x i ; w 0 ) log fk (x i ; w 0 ) . Fˆjj = m ∂wj

.

(8.7)

i=1 k=1

Other than explicit distance regularization, researchers have sought to using other technique. Li et al. (2019) proposed DELTA (DEep Learning TrAnsfer) to regularize the feature map for different layers between the pre-trained network and the finetuned network: R(w) =

N 

.

||F Mj (w, x) − F Mj (w0 , x)||22 ,

(8.8)

j =1

where .F Mj (w, x) denotes the feature map for input x at layer j . Chen et al. (2019b) proposed the batch spectral shrinkage (BSS) regularization to constrain the k singular values for the features in the fine-tuned networks: RBSS =

k 

.

2 σ−i ,

(8.9)

i=1

where .σ−i denotes the i-th smallest singular value. Then, Wan et al. (2019) proposed a training strategy for fine-tuning using the gradient distance: if we can constrain the difference between cross-entropy loss gradient and L2 gradient, then the model can be fine-tuned better. Then, the work of (Li et al., 2020) stated that if we simply re-initialize the fully connected layers for classification in fine-tuning process, we can get better results. It is often the case that the teacher dataset and student dataset have totally different categories, i.e., different semantic spaces. To explicitly regularize the different semantics between the pre-trained network and the downstream tasks, You et al. (2020) proposed a simple strategy called co-tuning to bridge the relation between different semantic spaces: Rco−tuning = (LCE , P (yt | ys )),

.

(8.10)

where .ys and .yt denote the student and teacher labels, respectively. Their goal is to explicitly regularize their relations using a loss function ..

8.4 Pre-Trained Models for Feature Extraction

133

Recently, Li and Zhang (2021) proposed to regularize the distance between the pre-trained network and fine-tuned network at each layer using different importance. They argued that each layer in deep networks has different importance (refer to the previous sections in this chapter), thus, we need different distance thresholds (.γ i−1 ) for them: ˆ (t) (fw ) Wˆ ← arg min L   .   s.t. wi − w0i  ≤ D · γ i−1 , ∀i = 1, . . . , L.

(8.11)

F

Technically speaking, there is no unified conclusion about which regularization is the best for transfer learning. In real experiments, we need to test some algorithms to select the most appropriate one for our application.

8.4 Pre-Trained Models for Feature Extraction Fine-tuning does not only help deep neural networks but also help the traditional machine learning models. We can use the pre-trained networks to extract better features rather than heuristic feature engineering, and then use these features to train machine learning models. For instance, traditional image features such as SIFT and SURF are prevailing before deep learning. In 2014, researchers proposed the DeCAF feature extraction method (Donahue et al., 2014) to directly use the deep networks for feature extraction, which led to better performance than traditional image features. In addition, other researchers also used convolutional neural networks for feature extraction and then these features served as inputs to the support vector machines (Razavian et al., 2014), which improved the performance. Surprisingly, the combination of deep features and traditional classifiers can sometimes obtain even better performance than end-to-end deep transfer learning (introduced in next chapter). Wang et al. (2019) proposed such an idea called easy transfer learning (EasyTL). EasyTL first used the fine-tuned network on the source domain to extract high-level features, then performed feature transformation and classifier building. Compared to deep transfer learning, EasyTL is a two-stage approach: extract deep features using existing deep networks such as ResNet (He et al., 2016) and then build its own classifier. For classifier building, EasyTL focused on the intra-domain structure of two domains and proposed a linear programming to solve the label assignment problem, which remained very efficient in practice. EasyTL achieved remarkable performance as shown in Fig. 8.5. According to Wang et al. (2019), we show how to use pre-trained models in modern transfer learning applications: • Option 1: directly apply the pre-trained models for new tasks. • Option 2: pre-train and fine-tune.

134

8 Pre-Training and Fine-Tuning

Fig. 8.5 Comparison between EasyTL and other methods (Wang et al., 2019)

• Option 3: use the pre-trained model as the feature tractor and then build classifiers for these features. These options provide experience for how to better harness the power of pretrained models.

8.5 Learn to Pre-Training and Fine-Tuning Until now, an open problem remains: how many layers should we transfer for the general tasks and how many layers should we fine-tune for downstream tasks? In fact, these correspond to two fundamental problems: what to transfer and where to transfer? “What” answers which layer from the source network transfers to which layers in the target network; “where” answers how much knowledge should be transferred. Currently, we can only perform artificial experiments to know the answers. Zhang et al. (2019) studied such problems in convolutional and recurrent neural networks. They provided fine-grained analysis of when to fix, fine-tune, or randomly initialize the layers in a deep network. Later, Jang et al. (2019) further studied such a problem with the help of meta-learning (meta-learning will be introduced in Sect. 14.3). We will introduce their work in detail. Let .S m (x) denote the m-th representation of the pre-trained source network and .Tθ is the target network to be learned. Then, .Tθn (x) denotes the n-th feature representation of the target network, where .θ is the learnable parameters. The objective to learn what and where to transfer is formulated as: .

    rθ T n (x) − S m (x)2 , θ 2

where .rθ is a linear transformation.

(8.12)

8.5 Learn to Pre-Training and Fine-Tuning

135

Generally speaking, the above equation shows how we can perform transfer learning from the m-th layer of the source network to the n-th layer of the target network. They designed two weight matrices and a meta-learning network to learn this process. Considering that not all features in the source network can benefit the target task, they designed objectives for the channel:   m,n Lm,n = channel θ |x, w

.

2 1  m,n    n  rθ Tθ (x) c,i,j − S m (x)c,i,j , wc HW c i,j

(8.13) where .H ×W is the size of a channel and . wcm,n is the weight to be learned.

feature m,n It is obvious that . c wc = 1 . Then, they determined whether it is good for transfer learning from m-th layer to the n-th layer. They designed a weight matrix .λm,n > 0 as an indicator for the transfer from m-th to the n-th layer, formulated as: .

  λm,n = gφm,n S m (x) .

(8.14)

They obtained the transfer loss by integrating the above two equations: 

  m,n . λm,n Lm,n channel θ |x, w

(8.15)

Ltotal (θ |x, y, φ) = Lce (θ |x, y) + βLchannel (θ |x, φ),

(8.16)

Lchannel (θ |x, φ) =

.

(m,n)∈C

The network training loss is: .

where .β > 0 is a trade-off hyperparameter. Authors designed three transfer strategies: • Single: Transfer learning from the last source network layer to a specific layer on the target network. • One-to-one: Each layer before pooling layer is to transfer to specific layers on the target network. • All-to-all: m-th layers of the source network transfer to the n-th layers of the target network. They conducted experiments on these settings. The results show that their method is significantly better than existing methods with 10% of the improvement. Determining which layers to fix or fine-tune is still an open research question and we hope there will be more efforts on this topic. At the same time, it is also an open question of how to determine the best learning hyperparameters such as learning rate for the fine-tuning layers.

136

8 Pre-Training and Fine-Tuning

8.6 Practice In this section, we implement pre-training and fine-tuning using PyTorch framework. We define a function called finetune, which takes a model and then preform fine-tune on the target domain. The codes are as follows. The complete code can be found at this link: https://github.com/jindongwang/tlbook-code/tree/ main/chap08_pretrain_finetune.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

Pre-training and fine-tuning function def finetune(model, dataloaders, optimizer): since = time.time() best_acc = 0 criterion = nn.CrossEntropyLoss() stop = 0 for epoch in range(1, args.n_epoch + 1): stop += 1 # You can uncomment this line for scheduling learning rate # lr_schedule(optimizer, epoch) for phase in [’src’, ’val’, ’tar’]: if phase == ’src’: model.train() else: model. eval() total_loss, correct = 0, 0 for inputs, labels in dataloaders[phase]: inputs, labels = inputs.to(DEVICE), labels.to(DEVICE) optimizer.zero_grad() with torch.set_grad_enabled(phase == ’src’): outputs = model(inputs) loss = criterion(outputs, labels) preds = torch. max(outputs, 1)[1] if phase == ’src’: loss.backward() optimizer.step() total_loss += loss.item() * inputs.size(0) correct += torch. sum(preds == labels.data) epoch_loss = total_loss / len(dataloaders[phase].dataset) epoch_acc = correct.double() / len(dataloaders[phase].dataset) print(f’Epoch: [{epoch:02d}/{args.n_epoch:02d}]---{phase}, loss: {epoch_loss:.6f}, acc: {epoch_acc:.4f}’) if phase == ’val’ and epoch_acc > best_acc: stop = 0 best_acc = epoch_acc torch.save(model.state_dict(), ’model.pkl’) if stop >= args.early_stop: break print() model.load_state_dict(torch.load(’model.pkl’)) acc_test = test(model, dataloaders[’tar’])

8.6 Practice

137

Fig. 8.6 Results of pre-training and fine-tuning

42 43 44

time_pass = time.time() - since print(f’Training complete in {time_pass // 60:.0f}m {time_pass % 60:.0f} s’) return model, acc_test

Note that the model can be any pre-trained models such as AlexNet and ResNet. Figure 8.6 shows the results of pre-train and fine-tune. We see that the results on amazon to webcam is 72.8%. Note that this result is just another baseline of transfer learning compared to previous sections since they all use pre-trained features from this section. We can also extract features using the following codes:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29 30 31 32 33

Extract features using deep learning class FeatureExtractor(nn.Module): def __init__(self, model, extracted_layers): super(FeatureExtractor, self).__init__() self.model = model._modules[’module’] if type( model) == torch.nn.DataParallel else model self.extracted_layers = extracted_layers

def forward(self, x): outputs = [] for name, module in self.model._modules.items(): if name is "fc": x = x.view(x.size(0), -1) x = module(x) if name in self.extracted_layers: outputs.append(x) return outputs

def extract_feature(model, dataloader, save_path, load_from_disk=True, model_path=’’): if load_from_disk: model = models.Network(base_net=args.model_name, n_class=args.num_class) model.load_state_dict(torch.load(model_path)) model = model.to(DEVICE) model. eval() correct = 0 fea_all = torch.zeros(1,1+model.base_network.output_num()).to(DEVICE) with torch.no_grad(): for inputs, labels in dataloader: inputs, labels = inputs.to(DEVICE), labels.to(DEVICE) feas = model.get_features(inputs) labels = labels.view(labels.size(0), 1). float() x = torch.cat((feas, labels), dim=1)

138 34 35 36 37 38 39 40 41

8 Pre-Training and Fine-Tuning fea_all = torch.cat((fea_all, x), dim=0) outputs = model(inputs) preds = torch. max(outputs, 1)[1] correct += torch. sum(preds == labels.data. long()) test_acc = correct.double() / len(dataloader.dataset) fea_numpy = fea_all.cpu().numpy() np.savetxt(save_path, fea_numpy[1:], fmt=’%.6f’, delimiter=’,’) print(f’Test acc: {test_acc:.4f}’)

Note that fine-tuning is only a baseline method. In the next chapters, we will improve its results using different strategies. Additionally, fine-tuning provides a flexible way to extract deep features, which can also be the inputs to traditional (non-deep) learning methods just like the previous chapters.

8.7 Summary Pre-training methods have been widely applied to computer vision, natural language processing, and speech recognition tasks. For instance, we often adopt the ImageNet pre-trained network as the backbone when dealing with computer vision tasks, and then fine-tune the network on our downstream data; when we are dealing with machine translation tasks, we often adopt BERT as the pre-trained model and then fine-tune it on our own tasks. Pre-training remains extremely easy to understand and use, with remarkable performance. Note that pre-training methods are not the only type of model-based transfer learning. Before deep learning, there are some works for model-based transfer learning. Zhao et al. (2011) proposed a TransEMDT method to utilize the labeled data to build a decision tree for robust activity recognition. Then, for the unlabeled data, they used K-means to find the most appropriate parameters. Deng et al. (2014) used extreme learning machine for a similar job. Pan et al. (2008) used hidden Markov models to study indoor location for the situation of different devices and different environments. Other researchers studied support vector machines (Nater et al., 2011; Li et al., 2012) for model-based transfer learning. These methods assumed that the vector in SVM .w can be split into two parts: .w = w0 + v, where .w 0 is the common part and .v is the specific part. Interested readers are encouraged to read more articles on this area.

References Biesialska, M., Biesialska, K., and Costa-jussà, M. R. (2020). Continual lifelong learning in natural language processing: A survey. arXiv preprint arXiv:2012.09823. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K. (2019a). This looks like that: deep learning for interpretable image recognition. In Advances in neural information processing systems, pages 8930–8941.

References

139

Chen, X., Wang, S., Fu, B., Long, M., and Wang, J. (2019b). Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. Advances in Neural Information Processing Systems, 32:1908–1918. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE. Deng, W., Zheng, Q., and Wang, Z. (2014). Cross-person activity recognition using reduced kernel extreme learning machine. Neural Networks, 53:1–7. Donahue, J., Jia, Y., et al. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655. Grachten, M. and Chacón, C. E. C. (2017). Strategies for conceptual change in convolutional neural networks. arXiv preprint arXiv:1711.01634. He, K., Girshick, R., and Dollár, P. (2019). Rethinking ImageNet pre-training. In Proceedings of the IEEE International Conference on Computer Vision, pages 4918–4927. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778. Hendrycks, D., Lee, K., and Mazeika, M. (2019). Using pre-training can improve model robustness and uncertainty. In ICML. Jang, Y., Lee, H., Hwang, S. J., and Shin, J. (2019). Learning what and where to transfer. arXiv preprint arXiv:1905.05901. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521– 3526. Kornblith, S., Shlens, J., and Le, Q. V. (2019). Do better ImageNet models transfer better? In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2661– 2671. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. Li, D. and Zhang, H. (2021). Improved regularization and robustness for fine-tuning in neural networks. Advances in Neural Information Processing Systems, 34. Li, H., Shi, Y., Liu, Y., Hauptmann, A. G., and Xiong, Z. (2012). Cross-domain video concept detection: A joint discriminative and generative active learning approach. Expert Systems with Applications, 39(15):12220–12228. Li, X., Grandvalet, Y., and Davoine, F. (2018). Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning, pages 2825–2834. PMLR. Li, X., Xiong, H., An, H., Xu, C.-Z., and Dou, D. (2020). Rifle: Backpropagation in depth for deep transfer learning through re-initializing the fully-connected layer. In International Conference on Machine Learning, pages 6010–6019. PMLR. Li, X., Xiong, H., Wang, H., Rao, Y., Liu, L., and Huan, J. (2019). Delta: Deep learning transfer using feature map with attention for convolutional networks. arXiv preprint arXiv:1901.09229. Nater, F., Tommasi, T., Grabner, H., Van Gool, L., and Caputo, B. (2011). Transferring activities: Updating human behavior analysis. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 1737–1744, Barcelona, Spain. IEEE. Neyshabur, B., Sedghi, H., and Zhang, C. (2020). What is being transferred in transfer learning? arXiv preprint arXiv:2008.11687. Pan, S. J., Kwok, J. T., and Yang, Q. (2008). Transfer learning via dimensionality reduction. In Proceedings of the 23rd AAAI conference on Artificial intelligence, volume 8, pages 677–682.

140

8 Pre-Training and Fine-Tuning

Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. (2019). Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415. Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 512–519. IEEE. Wan, R., Xiong, H., Li, X., Zhu, Z., and Huan, J. (2019). Towards making deep transfer learning never hurt. In 2019 IEEE International Conference on Data Mining (ICDM), pages 578–587. IEEE. Wang, J., Chen, Y., Yu, H., Huang, M., and Yang, Q. (2019). Easy transfer learning by exploiting intra-domain structures. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1210–1215. IEEE. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328. You, K., Kou, Z., Long, M., and Wang, J. (2020). Co-tuning for transfer learning. Advances in Neural Information Processing Systems, 33. Zhang, Y., Zhang, Y., and Yang, Q. (2019). Parameter transfer unit for deep neural networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Zhao, Z., Chen, Y., Liu, J., Shen, Z., and Liu, M. (2011). Cross-people mobile-phone based activity recognition. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence (IJCAI), volume 11, pages 2545–2550. CiteSeer.

Chapter 9

Deep Transfer Learning

With the development of deep learning, more and more researchers adopt deep neural networks for transfer learning. Compared to traditional machine learning, deep transfer learning increases the performance on various tasks. In addition, deep learning can take the vanilla data as the inputs, thus it has two more benefits: automatic feature extraction and end-to-end training. This chapter will introduce the basic of deep transfer learning, including network structure of deep transfer learning, distribution adaptation, structure adaptation, knowledge distillation, and practice. Figure 9.1 shows the performance of several competitive methods on two public datasets (dataset 1 is Office-Home (Venkateswara et al., 2017) and dataset 2 is ImageCLEF-DA1 ). We can see that deep transfer learning methods (DAN (Long et al., 2015), DANN (Ganin and Lempitsky, 2015), and DDAN (Wang et al., 2020)) achieved better performance than non-deep learning methods (TCA (Pan et al., 2011), GFK (Gong et al., 2012), and CORAL (Sun et al., 2016)). We will assume that readers have the basic knowledge of deep learning, thus we do not introduce much of it here. Note that the pre-training and fine-tuning in last chapter is also deep transfer learning. Our main task in this chapter is to introduce more methods beyond pre-training and fine-tuning. The organization of this chapter is as follows. In Sect. 9.1, we show the overview of deep transfer learning methods. Then, we discuss classic architectures for deep transfer learning in Sect. 9.2. In Sect. 9.3, we introduce deep transfer learning using distribution adaptation. In Sect. 9.4, we introduce structure adaptation for deep transfer learning. Knowledge distillation is introduced in Sect. 9.5. We provide code practice in Sect. 9.6. Finally, we conclude this chapter in Sect. 9.7.

1 https://www.imageclef.org/2014/adaptation.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_9

141

142

9 Deep Transfer Learning 88 86 84 82 80 78 76 74 TCA

GFK

CORAL

DAN

Dataset 1

Dataset 2

DANN

DDAN

Fig. 9.1 Results comparison of deep and non-deep transfer learning

9.1 Overview While existing pre-training and fine-tuning can save the training cost and improve the performance, it has disadvantages: it cannot directly handle the situation where training and testing data have different distributions. Additionally, it requires that the target domain has labels to perform fine-tuning, which is not realistic in practice. Thus, we need more effective methods in face of unlabeled target domain. We want to emphasize that deep transfer learning methods in this chapter are not separated from the methods we introduced in previous chapters. The only difference is that the learning backbones become deep neural networks, while we do not have deep architectures in previous chapters. This implies that we can borrow knowledge from previous chapters and then apply them in deep transfer learning. Take the distribution adaptation methods as examples. Many deep learning methods also add the adaptation layer to bridge the distribution gap between source and target domains (Tzeng et al., 2014; Ganin and Lempitsky, 2015; Wang et al., 2020). With the good representation from deep neural networks, these methods can make even greater performance. Equation (3.9) in Sect. 3.3 is also useful for deep transfer learning. More importantly, we can modify it by using the batch data to replace all training data as: B 1  (f (x i ), yi ) + λR(Bs , Bt ), f ∗ = arg min B f ∈H i=1

.

(9.1)

where B is the batch size in deep learning, .Bs and .Bt denote source and target domain batch samples, respectively. .λ is the trade-off hyperparameter.

9.2 Network Architectures for Deep Transfer Learning

143

We often adopt mini-batch stochastic gradient descent (SGD) for optimization. Then, the gradient can be computed as: ∇ =

.

∂R(Bs , Bt ) ∂(f (x i ), yi ) +λ , ∂ ∂

(9.2)

where . is the parameter to be learned, which contains the weights and biases in a network: . = {W , b}.

9.2 Network Architectures for Deep Transfer Learning A classic deep neural network has structures as shown in Fig. 9.2. The input data has 3 dimensions, which is mapped to be the classification results after feature transformations of two layers. It is not surprising to find that the network in Fig. 9.2 cannot be directly used for deep transfer learning due to the following reasons: • The network has only one input source, thus it is impossible to input the target domain data. • It becomes difficult to compute .R(·, ·) in Eq. (9.1) in this network. Therefore, we need to modify such a network structure to develop transfer learning architectures.

9.2.1 Single-Stream Architecture Single-stream architecture refers to the network that has only one branch, which can directly be used for transfer learning. In fact, we have already introduced pretraining and fine-tuning methods in last chapter, which is based on the single-stream architecture. It is reasonable that when performing fine-tuning, we only have the data from the downstream tasks. Therefore, it is feasible to have one stream in the network. Fig. 9.2 Illustration of deep networks

Input layer

Hidden layers

Output layer

144

9 Deep Transfer Learning

Since we often treat the pre-trained network as a black-box model, some important structure information and possible operation for transfer learning will be constrained. Therefore, Hinton et al. (2015) proposed knowledge distillation to further exploit the structure information to perform cross-architecture transfer learning. Knowledge distillation designed a teacher-student network to perform transfer learning, which ensures that the network has more flexibility and better performance. We will introduce knowledge distillation later in Sect. 9.5.

9.2.2 Two-Stream Architecture In additional to single-stream architecture, we often adopt the two-stream architecture. It is not difficult to design such a two-stream architecture as shown in Fig. 9.3. The reason why we call it a two-stream architecture is that it takes inputs from both source and target domains. In Fig. 9.3, the input consists of a batch of samples from both domains. Then, after the L layers of sharing weights, the features are input to the higher layers. These higher layers are core layers to perform transfer learning. At last, the network reaches the output layer to finish a forward propagation. In fact, such a two-stream architecture corresponds to different configurations and operation logic. For instance, for the previous L layers, we can choose to share part of the layers and fine-tune the rest, which depends on the experimental results. We have mentioned several related work for this part in Chap. 8. On the other hand, the biggest difference for most of the work is the design of transfer regularization

Label

Label High-level layer

Transfer

High-level layer

Layer

Layer





Layer 2

Layer 2

Layer 1

Layer 1

Source

Target

Source network

Target network

Fig. 9.3 Two-stream architecture for deep transfer learning

Has labels or notϋ Transfer regularization

a. Fine-tune? b. Fix?

9.3 Distribution Adaptation in Deep Transfer Learning

145

term. Finally, according to whether the target domain has labels, we can design other different regularization. Of course we focus on the most challenging part: when the target domain has no labels. When performing back propagation, the network updates parameters according to Eq. (9.2). Specifically, the network utilizes the labeled source domain and the predicted labels to compute the supervised loss, and then use bottleneck layer to compute the transfer loss to optimize Eq. (9.2). We often adopt different learning rates for the shared and transfer layers in real applications, or we just freeze the shared layers to only update the transfer layers. This is due to the fact that the shared layers often extract low-level features which are common for many tasks, thus they are highly transferable (Tzeng et al., 2014; Ganin and Lempitsky, 2015). In addition, designing a network from scratch is non-trivial since we have to determine the structure, number of layers, neurons of each layer, activation function, etc. Luckily, we can borrow the well-established structures from existing networks to replace the backbone in Fig. 9.3 in some general tasks. For instance, authors utilized AlexNet as the backbone to perform transfer learning in deep domain confusion (DDC) (Tzeng et al., 2014). Generally speaking, we can adopt some classic networks such as AlexNet, ResNet, VGG, DenseNet, etc. for most of the computer vision tasks. In natural language processing tasks, we can use RNN/LSTM (Hochreiter and Schmidhuber, 1997; Du et al., 2021) and Transformer (Vaswani et al., 2017) structures.

9.3 Distribution Adaptation in Deep Transfer Learning Deep networks provide a feasible framework to perform distribution adaptation. Previous researchers proposed a domain adaptive neural network to directly integrate the MMD (Borgwardt et al., 2006) loss into a three-layer neural network (Ghifary et al., 2014). However, that network is rather shallow thus did not achieve remarkable performance. Tzeng et al. (2014) proposed deep domain confusion (DDC) to handle the distribution adaptation problem in deep networks. DDC adopted AlexNet (Krizhevsky et al., 2012) for adaptation. More concretely, DDC fixed the previous seven layers of AlexNet and added an adaptation layer after the feature extraction layers. They plugged in the widely adopted MMD for distribution divergence measurement: Lddc = Lc (Ds ) + λLmmd (Ds , Dt ),

.

(9.3)

where .Lc is the supervised classification loss on the source domain and .Lmmd is the MMD loss for measuring their distribution divergence calculated according to Sect. 5.2. Note that this equation is also corresponding to the unified framework in Sect. 3.3.

146

9 Deep Transfer Learning

Then, Long et al. (2015) proposed deep adaptation networks (DAN) to replace the original single-kernel MMD with multiple-kernel MMD. They also adapted three layers of features instead of only the last feature extractor layer. Wang et al. (2018) applied such algorithm to cross-domain activity recognition. In recent years, there are also other works for conditional, joint, and dynamic distribution adaptation that achieved better performance. Zhu et al. (2020) proposed a deep subdomain adaptation network (DSAN) to exploit the probability of each sample to construct soft labels for class-wise adaptation. DSAN proposed a local MMD distance that computes all the weighted MMD distances for C classes: 2   C         1 sc tc   w φ w φ x .dˆLMMD (p, q) = − (x ) i j  , i j  C  c=1 x i ∈Ds x j ∈Dt H

(9.4)

where w is called the class-wise adaptation weight with .w sc and .w tc the weights for class c for source and target domains, respectively. .φ is the mapping function following definition in Sect. 5.2. In fact, the definition of w is quite flexible and easy to extend. For instance, we can use multiple classes to define a subdomain. The weight w is defined as: wic = 

.

yic (x j ,yj )∈D yj c

,

(9.5)

where .yi c is the c-th entry of the prediction label vector .y i (i.e., the one-hot encoding vector). The structure of DSAN is shown as Fig. 9.4. The network takes an end-to-end training scheme for optimization:

.

min f

Ns     1  Lcls f x si , yis + λdˆLMMD (p, q). Ns

(9.6)

i=1

DSAN can be easily implemented and achieves great performance on public datasets. On the other hand, researchers proposed joint adaptation network (JAN) (Long et al., 2017) to use the tensor products of multiple layers to define

Fig. 9.4 Deep subdomain adaptation network (DSAN) (Zhu et al., 2020)

9.4 Structure Adaptation for Deep Transfer Learning

147

Fig. 9.5 Deep dynamic adaptation network (DDAN) (Wang et al., 2020)

the embedding representation of joint probabilities. In 2020, Wang et al. (2020) proposed a deep dynamic adaptation network (DDAN) to apply the dynamic distribution adaptation to the deep networks, as shown in Fig. 9.5. DDAN adopted the same architecture with DDC and JAN and improved the performance of deep transfer learning.

9.4 Structure Adaptation for Deep Transfer Learning While the distribution adaptation methods in deep transfer learning have achieved great progress, we can also adapt the structure of deep networks for knowledge transfer.

9.4.1 Batch Normalization Batch normalization (BN) (Ioffe and Szegedy, 2015) is widely adopted in deep neural networks. BN normalizes the input data to have zero mean and unit variance, which can reduce the internal covariate shift and speed up the convergence. Let .μ and .σ 2 denote the mean and variance of the network. For a batch of data m , the objective of BN is: .B = {(x i , yi )} i=1 x (j ) − μ(j ) ,  x (j ) =  σ 2 (j ) + 

.

y (j ) = γ (j ) x (j ) + β (j ) ,

(9.7)

where j is the channel subscript and . is an extreme value. The mean and variance are computed as: 1  (j ) xi , m m

μ(j ) =

.

i=1

2 1  (j ) xi − μ(j ) , m m

σ 2 (j ) =

i=1

where .β (j ) and .γ (j ) are learnable parameters.

(9.8)

148

9 Deep Transfer Learning

Batch normalization can be used for transfer learning. Li et al. (2018) proposed adaptive batch normalization (AdaBN) to extend BN to domain adaptation problems. AdaBN requires to perform BN on the source domain. Then, AdaBN recomputes the statistics on the target domain, which is domain-adaptive batch normalization. AdaBN achieves great performance in domain adaptation tasks. Maria Carlucci et al. (2017) designed an automatic domain alignment layer (AutoDIAL) that has similar idea with AdaBN, which used a trade-off hyperparameter .α to control the alignment results. The normalization operation makes it simple and effective to perform domain adaptation. For instance, Cui et al. (2020) proposed Batch-nuclear Norm (BNM) to boost the performance of domain adaptation by simply regularizing the batchnuclear norm of the data.

9.4.2 Multi-view Structure Researchers also explored other structure adaptation methods. Zhu et al. (2019) proposed a multi-representation adaptation network (MRAN). They pointed out that most of the domain adaptation methods adopt a single architecture, which cannot extract multi-view information. For instance, the raw images are as Fig. 9.6a. If we extract features by a single architecture, we can only get partial information such as (b), (c), and (d). To capture more information for transfer learning, MRAN proposed to use a mixed structure to capture different levels of information and then perform alignment to reduce their distribution divergence. The structure of MRAN is illustrated in Fig. 9.7. MRAN adopted a CMMD loss for distribution matching that is computed as:   1  1 ˆ .dCMMD (X s , X t ) =  (c)  C c=1  ns C  

(a)

(b)

2     1 s(c) t (c)  φ(x i ) − (c) φ(x j ) ,  nt (c) (c) s(c) t (c)  x i ∈Ds x j ∈Dt H (9.9)

(c)

(d)

Fig. 9.6 Different features of one image. (a) Original. (b) Saturation. (c) Brightness. (d) Hue

9.4 Structure Adaptation for Deep Transfer Learning

149

Fig. 9.7 Multi-representation adaptation networks (MRAN) (Zhu et al., 2019)

(c)

(c)

where .Ds and .Dt denote the samples belonging to class c in source and target (c) (c) domains, respectively. .ns and .nt are their numbers of samples, respectively. By capturing different views of the features, MRAN achieved great performance in different benchmarks.

9.4.3 Disentanglement We introduce disentanglement methods for deep transfer learning. Disentanglement means that we can classify complex features to better understand the structure and importance of each part of the features. Bousmalis et al. (2016) proposed domain separation networks (DSN) to perform feature disentanglement in transfer learning. Authors claimed that the source and target features are composed of two parts: common parts and specific parts, where the common parts learn general features that are domain-invariant, and the specific parts are specific to certain domains. The loss function for DSN is as follows: L = Ltask + αLrecon + βLdiff + γ Lsim .

.

(9.10)

In addition to the supervised loss .Ltask on the source domain, here are other losses: • .Lrecon : reconstruction loss to make sure the features are useful after encoder • .Ldiff : Difference between common and specific features • .Lsim : Similarity loss for source and target common features. Figure 9.8 shows the structure of DSN. By disentanglement, DSN can learn more common features across domains to ensure the success of transfer learning; at the same time, it can also leverage the specific features to enhance the performance on specific tasks.

150

9 Deep Transfer Learning

Fig. 9.8 Illustration of domain separation networks (DSN) (Bousmalis et al., 2016)

Teacher network

Distill

Data

(Feature / output – based distillation methods)

Student network

Fig. 9.9 Illustration of knowledge distillation

9.5 Knowledge Distillation Knowledge distillation (KD) was proposed by Hinton et al. (2015) for model compression and cross-network knowledge transfer. The core idea of KD is shown in Fig. 9.9. Knowledge distillation thinks that the knowledge contained in a complex model (teacher network) can be distilled to another small network (student network). The small network is simpler than the teacher network but with very close performance. Thus, knowledge can be used to perform model compression. Knowledge distillation requires to first train the complex teacher network and then train the student network to make their outputs close. Let p and q denote the predictions for both networks, then the objective of knowledge distillation can be formulated as: LKD = L(y, p) + λL(p, q),

.

(9.11)

9.6 Practice

151

where y is the actual label and .L(·, ·) is the loss function such as cross-entropy loss. The first term in the above equation denotes the training error of the student network and the second term denotes the closeness of student and teacher network outputs. If we directly treat the network output (i.e., the softmax probability) as the evaluation metric, the network may not be easily trained. Thus, authors proposed a temperature softmax function: exp (zi /T )  , qi =  j exp zj /T

.

(9.12)

where .zi is the network logit and T is the temperature. When .T = 1, the above equation equals to the original softmax function. We see that adding T makes the outputs more smooth. Knowledge distillation remains quite simple and effective, thus it has been widely used in different tasks. Researchers extended knowledge distillation in machine translation (Zhou et al., 2020), natural language processing (Hu et al., 2016), graph neural network (Yang et al., 2020), multi-task learning (Kundu et al., 2019), and zero-shot learning (Nayak et al., 2019). Interested readers can refer to (Gou et al., 2020) for more related works.

9.6 Practice We implement three deep transfer learning methods in this practice section: deep domain confusion (DDC) (Tzeng et al., 2014), deep CORAL (DCORAL) (Sun and Saenko, 2016), and deep subdomain adaptation network (DSAN) (Zhu et al., 2020) using the same architecture. The complete code can be found at this link: https:// github.com/jindongwang/tlbook-code/tree/main/chap09_deeptransfer. The core of the implementation is how to correctly compute the MMD, CORAL, and LMMD loss. Then, we can integrate them with the deep network. This part is the basic for implementing deep domain adaptation and we encourage readers to pay special attention to this section. Note that the code in this section is just for demonstration. We do not add any tricks or training skills. Thus their results are not the optimal. We have built an allin package called DeepDA2 for easy and unified usage of all deep transfer learning methods.

2 https://github.com/jindongwang/transferlearning/tree/master/code/DeepDA.

152

9 Deep Transfer Learning

9.6.1 Network Structure We first define the network structure, which is inherited from existing pre-trained models such as AlexNet or ResNet. Different from fine-tuning, we need to modify the output layer and the training objective to make it a two-stream architecture. Our network is constructed as follows.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Deep transfer learning network import torch import torch.nn as nn import backbone from coral import CORAL from mmd import MMDLoss from lmmd import LMMDLoss from adv import AdversarialLoss class TransferNet(nn.Module): def __init__(self, num_class, base_net=’resnet50’, transfer_loss=’mmd’, use_bottleneck=True, bottleneck_width=256, width=1024, n_class=31): super(TransferNet, self).__init__() self.base_network = backbone.network_dict[base_net]() self.use_bottleneck = use_bottleneck self.transfer_loss = transfer_loss self.n_class = num_class bottleneck_list = [nn.Linear(self.base_network.output_num( ), bottleneck_width), nn.BatchNorm1d(bottleneck_width), nn.ReLU(), nn.Dropout(0.5)] self.bottleneck_layer = nn.Sequential(*bottleneck_list) classifier_layer_list = [nn.Linear(self.base_network.output_num(), width), nn.ReLU(), nn.Dropout(0.5), nn.Linear(width, num_class)] self.classifier_layer = nn.Sequential(*classifier_layer_list) self.bottleneck_layer[0].weight.data.normal_(0, 0.005) self.bottleneck_layer[0].bias.data.fill_(0.1) for i in range(2): self.classifier_layer[i * 3].weight.data.normal_(0, 0.01) self.classifier_layer[i * 3].bias.data.fill_(0.0) if self.transfer_loss == ’dann’: self.adv = AdversarialLoss()

The Transfer Net is the network definition which takes the following inputs: • • • • • •

num class: number of target domain classes base net: backbone network such as ResNet Transfer loss: transfer loss, such as MMD or CORAL use bottleneck: use bottleneck layer or not bottleneck width: bottleneck width width: width for classification layer

9.6 Practice

153

In order to compute the transfer loss, we need to write a forward function that takes the source and target features as inputs, then computes the transfer loss. This function, along with the prediction capability, is as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Forward function def forward(self, source, target, source_label): source = self.base_network(source) target = self.base_network(target) source_clf = self.classifier_layer(source) target_clf = self.classifier_layer(target) if self.use_bottleneck: source = self.bottleneck_layer(source) target = self.bottleneck_layer(target) kwargs = {} kwargs[’source_label’] = source_label kwargs[’target_logits’] = torch.nn.functional.softmax( target_clf, dim=1) transfer_loss = self.adapt_loss( source, target, self.transfer_loss, **kwargs) return source_clf, transfer_loss def predict(self, x): features = self.base_network(x) clf = self.classifier_layer(features) return clf def adapt_loss(self, X, Y, adapt_loss, **kwargs): """Compute adaptation loss, currently we support mmd and coral Arguments: X {tensor} -- source matrix Y {tensor} -- target matrix adapt_loss {string} -- loss type: ’mmd’ | ’coral’ | ’dsan’ | ’dann’ Returns: [tensor] -- adaptation loss tensor """ if adapt_loss == ’mmd’: loss = MMDLoss()(X, Y) elif adapt_loss == ’coral’: loss = CORAL(X, Y) elif adapt_loss == ’dsan’: loss = LMMDLoss(self.n_class)( X, Y, kwargs[’source_label’], kwargs[’target_logits’]) elif adapt_loss == ’dann’: loss = self.adv(X, Y) else: loss = 0 return loss

154

9 Deep Transfer Learning

Note that we need to set different learning rates for the model as follows: Get parameters def get_optimizer(self, args): params = [ {’params’: self.base_network.parameters()}, {’params’: self.bottleneck_layer.parameters(), ’lr’: 10 * args.lr}, {’params’: self.classifier_layer.parameters(), ’lr’: 10 * args.lr}, ] if self.transfer_loss == ’dann’: params.append( {’params’: self.adv.domain_classifier.parameters(), ’lr’: 10 * args.lr}) 10 optimizer = torch.optim.SGD( 11 params, lr=args.lr, momentum=args.momentum, weight_decay=args.decay) 12 return optimizer 1 2 3 4 5 6 7 8 9

9.6.2 Loss We show the computation process of MMD, CORAL, and LMMD losses.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

MMD loss class MMDLoss(nn.Module): def __init__(self, kernel_type=’rbf’, kernel_mul=2.0, kernel_num=5): super(MMDLoss, self).__init__() self.kernel_num = kernel_num self.kernel_mul = kernel_mul self.fix_sigma = None self.kernel_type = kernel_type def guassian_kernel(self, source, target, kernel_mul=2.0, kernel_num=5, fix_sigma=None): n_samples = int(source.size()[0]) + int(target.size()[0]) total = torch.cat([source, target], dim=0) total0 = total.unsqueeze(0).expand( int(total.size(0)), int(total.size(0)), int(total.size(1))) total1 = total.unsqueeze(1).expand( int(total.size(0)), int(total.size(0)), int(total.size(1))) L2_distance = ((total0-total1)**2). sum(2) if fix_sigma: bandwidth = fix_sigma else: bandwidth = torch. sum(L2_distance.data) / (n_samples**2-n_samples) bandwidth /= kernel_mul ** (kernel_num // 2) bandwidth_list = [bandwidth * (kernel_mul**i) for i in range(kernel_num)] kernel_val = [torch.exp(-L2_distance / bandwidth_temp) for bandwidth_temp in bandwidth_list]

9.6 Practice 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

1 2 3 4 5 6

155

return sum(kernel_val) def linear_mmd2(self, f_of_X, f_of_Y): loss = 0.0 delta = f_of_X. float().mean(0) - f_of_Y. float().mean(0) loss = delta.dot(delta.T) return loss def forward(self, source, target): if self.kernel_type == ’linear’: return self.linear_mmd2(source, target) elif self.kernel_type == ’rbf’: batch_size = int(source.size()[0]) kernels = self.guassian_kernel( source, target, kernel_mul=self.kernel_mul, kernel_num=self. kernel_num, fix_sigma=self.fix_sigma) XX = torch.mean(kernels[:batch_size, :batch_size]) YY = torch.mean(kernels[batch_size:, batch_size:]) XY = torch.mean(kernels[:batch_size, batch_size:]) YX = torch.mean(kernels[batch_size:, :batch_size]) loss = torch.mean(XX + YY - XY - YX) return loss

CORAL loss def CORAL(source, target): d = source.size(1) ns, nt = source.size(0), target.size(0) # source covariance tmp_s = torch.ones((1, ns)).to(DEVICE) @ source cs = (source.t() @ source - (tmp_s.t() @ tmp_s) / ns) / (ns - 1) # target covariance tmp_t = torch.ones((1, nt)).to(DEVICE) @ target ct = (target.t() @ target - (tmp_t.t() @ tmp_t) / nt) / (nt - 1) # frobenius norm loss = (cs - ct). pow(2). sum().sqrt() loss = loss / (4 * d * d) return loss

LMMD loss class LMMDLoss(mmd.MMDLoss): def __init__(self, num_class, kernel_type=’rbf’, kernel_mul=2.0, kernel_num=5, gamma=1.0, max_iter=1000, **kwargs): ’’’ Local MMD ’’’ super(LMMDLoss, self).__init__(

156 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

9 Deep Transfer Learning kernel_type, kernel_mul, kernel_num, **kwargs) self.kernel_num = kernel_num self.kernel_mul = kernel_mul self.fix_sigma = None self.kernel_type = kernel_type self.num_class = num_class def forward(self, source, target, source_label, target_logits): if self.kernel_type == ’linear’: raise NotImplementedError("Linear kernel is not supported yet.") elif self.kernel_type == ’rbf’: batch_size = source.size()[0] weight_ss, weight_tt, weight_st = self.cal_weight( source_label, target_logits) weight_ss = torch.from_numpy(weight_ss).cuda() # B, B weight_tt = torch.from_numpy(weight_tt).cuda() weight_st = torch.from_numpy(weight_st).cuda() kernels = self.guassian_kernel(source, target, kernel_mul=self.kernel_mul, kernel_num=self.kernel_num, fix_sigma=self.fix_sigma) loss = torch.Tensor([0]).cuda() if torch. sum(torch.isnan( sum(kernels))): return loss SS = kernels[:batch_size, :batch_size] TT = kernels[batch_size:, batch_size:] ST = kernels[:batch_size, batch_size:] loss += torch. sum(weight_ss * SS + weight_tt * TT - 2 * weight_st * ST) return loss def cal_weight(self, source_label, target_logits): batch_size = source_label.size()[0] source_label = source_label.cpu().data.numpy() source_label_onehot = np.eye(self.num_class)[source_label] # one hot source_label_sum = np. sum( source_label_onehot, axis=0).reshape(1, self.num_class) source_label_sum[source_label_sum == 0] = 100 source_label_onehot = source_label_onehot / source_label_sum # label ratio # Pseudo label target_label = target_logits.cpu().data. max(1)[1].numpy() target_logits = target_logits.cpu().data.numpy() target_logits_sum = np. sum( target_logits, axis=0).reshape(1, self.num_class) target_logits_sum[target_logits_sum == 0] = 100 target_logits = target_logits / target_logits_sum

9.6 Practice 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

157

weight_ss = np.zeros((batch_size, batch_size)) weight_tt = np.zeros((batch_size, batch_size)) weight_st = np.zeros((batch_size, batch_size)) set_s = set(source_label) set_t = set(target_label) count = 0 for i in range(self.num_class): # (B, C) if i in set_s and i in set_t: s_tvec = source_label_onehot[:, i].reshape( batch_size, -1) # (B, 1) t_tvec = target_logits[:, i].reshape(batch_size, -1) # (B, 1) ss = np.dot(s_tvec, s_tvec.T) # (B, B) weight_ss = weight_ss + ss tt = np.dot(t_tvec, t_tvec.T) weight_tt = weight_tt + tt st = np.dot(s_tvec, t_tvec.T) weight_st = weight_st + st count += 1 length = count if length != 0: weight_ss = weight_ss / length weight_tt = weight_tt / length weight_st = weight_st / length else: weight_ss = np.array([0]) weight_tt = np.array([0]) weight_st = np.array([0]) return weight_ss.astype(’float32’), weight_tt.astype(’float32’), weight_st.astype(’float32’)

9.6.3 Train and Test In training, we input batches of source and target domain data, and the code is as follows:

1 2 3 4 5 6 7

Training of deep transfer learning def train(source_loader, target_train_loader, target_test_loader, model, optimizer): len_source_loader = len(source_loader) len_target_loader = len(target_train_loader) best_acc = 0 stop = 0 for e in range(args.n_epoch): stop += 1

158 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45

9 Deep Transfer Learning train_loss_clf = utils.AverageMeter() train_loss_transfer = utils.AverageMeter() train_loss_total = utils.AverageMeter() model.train() iter_source, iter_target = iter( source_loader), iter(target_train_loader) n_batch = min(len_source_loader, len_target_loader) criterion = torch.nn.CrossEntropyLoss() for _ in range(n_batch): data_source, label_source = iter_source. next() data_target, _ = iter_target. next() data_source, label_source = data_source.to( DEVICE), label_source.to(DEVICE) data_target = data_target.to(DEVICE) optimizer.zero_grad() label_source_pred, transfer_loss = model( data_source, data_target, label_source) clf_loss = criterion(label_source_pred, label_source) loss = clf_loss + args.lamb * transfer_loss loss.backward() optimizer.step() train_loss_clf.update(clf_loss.item()) train_loss_transfer.update(transfer_loss.item()) train_loss_total.update(loss.item()) # Test acc = test(model, target_test_loader) log.append( [e, train_loss_clf.avg, train_loss_transfer.avg, train_loss_total.avg, acc.cpu().numpy()]) pd.DataFrame.from_dict(log).to_csv(’train_log.csv’, header=[ ’Epoch’, ’Cls_loss’, ’Transfer_loss’, ’Total_loss’, ’Tar_acc’]) print(f’Epoch: [{e:2d}/{args.n_epoch}], cls_loss: {train_loss_clf. avg:.4f}, transfer_loss: {train_loss_transfer.avg:.4f}, total_Loss: {train_loss_total.avg:.4f}, acc: {acc:.4f}’) if best_acc < acc: best_acc = acc stop = 0 if stop >= args.early_stop: break print(’Transfer result: {:.4f}’. format(best_acc))

The test code is as follows:

1 2 3 4 5 6 7 8

Test deep transfer learning def test(model, target_test_loader): model. eval() test_loss = utils.AverageMeter() correct = 0 criterion = torch.nn.CrossEntropyLoss() len_target_dataset = len(target_test_loader.dataset) with torch.no_grad(): for data, target in target_test_loader:

9.6 Practice

159

Fig. 9.10 Results of DDC (Tzeng et al., 2014)

9 10 11 12 13 14 15 16

data, target = data.to(DEVICE), target.to(DEVICE) s_output = model.predict(data) loss = criterion(s_output, target) test_loss.update(loss.item()) pred = torch. max(s_output, 1)[1] correct += torch. sum(pred == target) acc = 100. * correct / len_target_dataset return acc

You can simply run python main.py or follow the instruction in our provided code to run the code. Figures 9.10, 9.11, and 9.12 show the results of DDC, DCORAL, and DSAN on amazon to webcam task, respectively. We see that their performance is 78.24%, 79.00%, and 79.25%, respectively. These results are clearly better than the results obtained by KMM, TCA, and CORAL in previous chapters. Therefore, it shows the effectiveness of deep transfer learning methods. Also, if you notice their performance increment, you can obviously see the development of algorithms over the past years (DDC is in 2014, while DSAN is in 2020). How happy if we can continuously improve the results, drop by drop! Again, note that we do not add any tricks or training skills. Thus their results are not the optimal. We have built an all-in package called DeepDA3 for easy and unified usage of all deep transfer learning methods. For instance, DSAN can achieve 94.34%, which is far better than 79.25% in this section.

3 https://github.com/jindongwang/transferlearning/tree/master/code/DeepDA.

160

Fig. 9.11 Results of DCORAL (Sun and Saenko, 2016)

Fig. 9.12 Results of DSAN (Zhu et al., 2020)

9 Deep Transfer Learning

References

161

Table 9.1 Popular research fields in transfer learning Fields Pre-train and fine-tune Knowledge distillation Domain adaptation Multi-task learning Domain generalization

Source domain Only need source models Only source models Need source data Need source data Need source data

Target domain Yes Yes Yes No No

Data distribution Different Different Different Same Different

Output space Often different Often different Same Often different Same

9.7 Summary In this chapter, we introduced some popular deep transfer learning methods. Specifically, we introduce distribution adaptation, structure adaptation, and knowledge distillation methods. Due to space limit and the emerging of deep transfer learning research, we could not list them all in here. But most of the methods adopt the same framework as we discussed in this chapter. Readers are encouraged to focus on the recent progress. To now, we have introduced most of the popular research areas of transfer learning. For better understanding, we list the key information in Table 9.1. Note that the information of domain generalization will be introduced in Chap. 11.

References Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., and Erhan, D. (2016). Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351. Cui, S., Wang, S., Zhuo, J., Li, L., Huang, Q., and Tian, Q. (2020). Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3941–3950. Du, Y., Wang, J., Feng, W., Pan, S., Qin, T., Xu, R., and Wang, C. (2021). AdaRNN: Adaptive learning and forecasting of time series. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 402–411. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189. Ghifary, M., Kleijn, W. B., and Zhang, M. (2014). Domain adaptive neural networks for object recognition. In PRICAI, pages 898–904. Gong, B., Shi, Y., Sha, F., and Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2020). Knowledge distillation: A survey. arXiv preprint arXiv:2006.05525. Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. In Advances in Neural Information Processing Systems (NIPS).

162

9 Deep Transfer Learning

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. Hu, Z., Ma, X., Liu, Z., Hovy, E., and Xing, E. (2016). Harnessing deep neural networks with logic rules. In ACL. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. Kundu, J. N., Lakkakula, N., and Babu, R. V. (2019). Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1436–1445. Li, Y., Wang, N., Shi, J., Hou, X., and Liu, J. (2018). Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80:109–117. Long, M., Cao, Y., Wang, J., and Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. Long, M., Zhu, H., Wang, J., and Jordan, M. I. (2017). Deep transfer learning with joint adaptation networks. In ICML, pages 2208–2217. Maria Carlucci, F., Porzi, L., Caputo, B., et al. (2017). Autodial: Automatic domain alignment layers. In ICCV, pages 5067–5075. Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V., and Chakraborty, A. (2019). Zero-shot knowledge distillation in deep networks. arXiv preprint arXiv:1905.08114. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210. Sun, B., Feng, J., and Saenko, K. (2016). Return of frustratingly easy domain adaptation. In AAAI. Sun, B. and Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027. Wang, J., Chen, Y., Feng, W., Yu, H., Huang, M., and Yang, Q. (2020). Transfer learning with dynamic distribution adaptation. ACM TIST, 11(1):1–25. Wang, J., Zheng, V. W., Chen, Y., and Huang, M. (2018). Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8. Yang, Y., Qiu, J., Song, M., Tao, D., and Wang, X. (2020). Distilling knowledge from graph convolutional networks. arXiv preprint arXiv:2003.10477. Zhou, C., Neubig, G., and Gu, J. (2020). Understanding knowledge distillation in nonautoregressive machine translation. In ICLR. Zhu, Y., Zhuang, F., Wang, J., Chen, J., Shi, Z., Wu, W., and He, Q. (2019). Multi-representation adaptation network for cross-domain image classification. Neural Networks, 119:214–221. Zhu, Y., Zhuang, F., Wang, J., Ke, G., Chen, J., Bian, J., Xiong, H., and He, Q. (2020). Deep subdomain adaptation network for image classification. IEEE Transactions on Neural Networks and Learning Systems.

Chapter 10

Adversarial Transfer Learning

Generative Adversarial Nets (GAN) is one of the most popular research topics in recent years. In this chapter, we introduce adversarial transfer learning methods, which belongs to the implicit feature transformation methods. Specifically, we will introduce GAN-based transfer learning with applications to distribution adaptation and maximum classifier discrepancy. Data generation is another important research topic in adversarial transfer learning. Finally, we present some practice. The organization of this chapter is as follows. Section 10.1 introduces the basic knowledge of generative adversarial networks and we analyze the possibility of using GAN for transfer learning. Then, we introduce distribution adaptation methods for adversarial transfer learning in Sect. 10.2. Different from distribution adaptation, we introduce maximum classifier discrepancy using adversarial training in Sect. 10.3. Section 10.4 introduces adversarial data generation methods. We offer code practice in Sect. 10.5. Finally, we conclude this chapter in Sect. 10.6.

10.1 Generative Adversarial Networks Generative adversarial network (GAN) (Goodfellow et al., 2014) was proposed by taking inspiration from the two-play game. It consists of two parts: a generator and a discriminator. The generator is a generative network to generate real examples using random noise and the discriminator is a discriminative network to judge if a sample is real or generated by the generator. In training, the two networks are in a min-max game. Figure 10.1 shows the process of generative adversarial network. The learning objective of GAN can be formulated as: .

min max Ex∼pdata (log(D(x)) + Ez∼pnoise log(1 − D(G(z))), G

D

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_10

(10.1)

163

164

10 Adversarial Transfer Learning

True data

True

Discriminator

Random noise

Generator

False

Fake data

Fig. 10.1 Generative adversarial networks Table 10.1 Generative adversarial networks vs. transfer learning Generative adversarial networks Real examples Random noise Generator Discriminator

Transfer learning Source domain Target domain Feature extractor Distribution divergence measurement

where D and G are the discriminator and generator, respectively. Distribution .pdata is the real distribution and .pnoise is the noise distribution which is often adopted as Gaussian. We often adopt a two-stage training process to train GAN. On the one hand, we minimize the loss for generator to make it generate more real examples; on the other hand, we maximize the loss for the discriminator to make it fail to judge where the samples come from. How to exploit generative adversarial networks for transfer learning? Let us go back to the definition of GAN. A vanilla GAN consists of at least four parts: • • • •

Real data: the real samples. Random noise: Noisy data that is used to generate fake samples. Generator: take the random noise as inputs to generate fake samples. Discriminator: take the samples to judge if it comes from the real samples or the generator.

We need to rethink the nature of transfer learning in order to use GAN for it. We know from Eq. (3.9) that the transfer regularization term is used to reduce the distribution divergence between domains. Thus, can we use the discriminator as a distribution divergence measurement to implicitly learn their divergence? The core parts of GAN-based transfer learning correspond to the parts of GAN, as shown in Table 10.1.

10.2 Distribution Adaptation for Adversarial Transfer Learning

165

Obviously, we need to associate the real samples and random noise in GAN with the source and target domains in transfer learning since the source domain is labeled and target domain is unlabeled. The generator can be treated as a feature extractor. While transfer learning does not focus on generating new samples like GAN, we can use the feature extractor to extract features. Similar to deep transfer learning, the training loss for adversarial transfer learning is also composed of two modules: supervised training loss .Lc and domain discriminator loss .Ladv : L = Lc (Ds ) + λLadv (Ds , Dt ).

(10.2)

.

10.2 Distribution Adaptation for Adversarial Transfer Learning Ganin and Lempitsky (2015), Ganin et al. (2016) used adversarial training in the networks for domain adaptation tasks: domain-adversarial neural networks (DANN). DANN directly utilizes the generative adversarial networks to perform adversarial training to fool the discriminator such that it cannot distinguish between source and target domains. Thus, DANN can learn domain-invariant features. The structure of DANN is shown as Fig. 10.2. DANN is composed of the following three modules, which are similar to our discussion in the last section: • Feature extractor Gf (θf ), which is used to take source or target domain data as inputs for feature extraction. • Classifier Gy (θy ), which is used for classification based on the features. • Domain discriminator Gd (θd ), which is used to judge if the sample is from the source or the target domain.

Label Classifier

Class label Source / target input

Domain label

Feature Extractor

Domain Classifier

Fig. 10.2 Illustration of domain-adversarial neural networks (DANN) (Ganin and Lempitsky, 2015)

166

10 Adversarial Transfer Learning

The objective for DANN is formulated as:        E θf , θy , θd = Ly Gy Gf (x i ) , yi x i ∈Ds      −λ Ld Gd Gf (x i ) , di , x i ∈Ds ∪Dt

.

(10.3)

where Ly is classification loss, Ld is the loss for discriminator, and di is domain label: when the data comes from the source domain, di = 0; otherwise di = 1. For training, DANN first optimizes the classification loss and the feature extractor loss to optimize the parameters for feature extractor and classifier (θf and θy are their parameters):  .

   θˆf , θˆy = arg min E θf , θy , θd .

(10.4)

θf ,θy

Then, DANN maximizes the loss for discriminator Gd to optimize it parameter θd : .

    θˆd = arg max E θf , θy , θd .

(10.5)

θd

These two steps are iterated repeatedly until convergence. For efficient implementation, authors introduced gradient reversal layer (GRL) to the backward propagation process. In forward propagation, GRL is an identity map: Rλ (x) = x.

.

(10.6)

While in backward propagation, the gradient is multiplied by an negative identity matrix I to reverse the gradient: .

dRλ = −λI . dx

(10.7)

If we view DANN from the perspective of distribution adaptation, we find that DANN actually matches the marginal distribution of two domains. This is because discriminator takes the whole data from two domains, which is equivalent to matching Ps (x) and Pt (x). In recent years, GAN-based transfer learning methods have been widely used. Tzeng et al. (2017) proposed adversarial discriminative domain adaptation (ADDA) as a general form of adversarial domain adaptation. Researchers also applied Wasserstein distance to adversarial training (Shen et al., 2018). Liu and Tuzel (2016) proposed coupled GAN for better transfer learning. Can we use adversarial training for dynamic distributions adaptation?

10.3 Maximum Classifier Discrepancy for Adversarial Transfer Learning

167

Fig. 10.3 Dynamic adversarial adaptation network (DAAN) (Yu et al., 2019)

Recently, Yu et al. (2019) proposed dynamic adversarial adaptation networks (DAAN) to apply the dynamic distribution adaptation to adversarial networks. They proved that there still exists different importance for marginal and conditional distributions in adversarial training and proposed DAAN for better adaptation. The framework of DAAN is shown in Fig. 10.3, where ω denotes the adversarial balance factor to evaluate the relative importance of these two distributions. DAAN mainly consists of three components: Label Classifier, Global Domain Discriminator, and Local Subdomain Discriminator. Integrating all components, the learning objective of DAAN can finally be formulated as: L(θf , θy , θd , θdc |C c=1 ) = Ly − λ((1 − ω)Lg + ωLl ),

.

(10.8)

where λ is a trade-off parameter, ω is the balance factor (same as μ in Sect. 3.2), Ly , Lg , and Ll are the classification loss, global (marginal) loss, and conditional loss, respectively. GAN-based transfer learning is often used to learn domain-invariant representations, which is widely used in both fine-tuning, domain adaptation, and domain generalization. We will cover more literature in domain generalization chapter.

10.3 Maximum Classifier Discrepancy for Adversarial Transfer Learning In this section, we introduce another different approach of adversarial transfer learning: Maximum classifier discrepancy (MCD) (Saito et al., 2018). MCD provides a different perspective to use adversarial training for transfer learning to obtain domain-invariant representations.

168

10 Adversarial Transfer Learning

Fig. 10.4 Training process of Maximum classifier discrepancy (MCD) (Saito et al., 2018). (a) Train classifiers. (b) Maximize discrepancy. (c) Minimize. (d) Final result

Due to the domain discrepancy, a pre-trained network has different results on different target samples: on some samples, the adaptation performance is good; while on others, the adaptation performance is not satisfying. MCD will not take much care of former, but it takes care of the samples with unsatisfying performance. Once all samples generate good results, the knowledge transfer is successful. Therefore, the core of MCD is to find these samples and then adapt the model to generate good performance. MCD introduces two independent classifiers .F1 and .F2 . The discrepancy between them is used to denote the confidence of the sample: if .F1 and .F2 have disagreement on one target sample, then they have discrepancies and thus needs further procedures. Figure 10.4 shows the training process of MCD: 1. At the first stage (a), there are some shadows (discrepancy region). Thus, the objective of MCD is to maximize the shadow area such that the two classifiers can have the maximum discrepancy. 2. If we maximize the discrepancy of two classifiers, then we get subplot (b), which has a larger discrepancy region. At this time, the goal of MCD is to extract better features to reduce the discrepancy of two classifiers. 3. If we minimize the discrepancy between two classifiers, then we get subplot (c), where there exists little discrepancy region. However, the classification boundary may still be a little tight, thus we go back to the first two steps to do more iterations. 4. Finally, MCD generates subplot (d), where there exists no shadows and the classification boundary is more robust. MCD uses the source domain data to train the different classifiers .F1 and .F2 :

.

min L (Xs , Ys ) = −E(x s ,ys )∼(Xs ,Ys )

K  k=1

I[k=ys ] log p (y | x s ) .

(10.9)

10.4 Data Generation for Adversarial Transfer Learning

169

Then, MCD fixes the feature extractor G to train .F1 and .F2 to have the maximum discrepancy using adversarial training: .

min L (Xs , Ys ) − Ladv (Xt ) ,

F1 ,F2

(10.10)

where the adversarial loss is defined as the distance between two distributions: Ladv (Xt ) = Ex t ∼Xt [d (p1 (y | x t ) , p2 (y | x t ))] ,

.

(10.11)

and MCD adopts the L1 distance to compute .d(p1 , p2 ). Then, MCD fixes the two classifiers and optimize the generator G to minimize their discrepancy: .

min Ladv (Xt ). G

(10.12)

The computation process of MCD is easy to understand and it achieves even better performance than DANN. From the theory side, MCD actually optimizes the .HH-distance (Ben-David et al., 2010) as we discussed in Sect. 7.1: the maximization and minimization processes are iteratively searching the worst case of two distributions and then optimize it. Thus, MCD is well-motivated in theory. In addition, the maximization and minimization process is also similar to the exploration and exploitation paradigm in reinforcement learning (Sutton and Barto, 2018): they both carefully take a step to generate some uncertain results. Then, they can solve it to learn better models.

10.4 Data Generation for Adversarial Transfer Learning Let us go back to the original goal of generative adversarial network: to generate real examples. Then, a natural idea is to use GAN to generate data for transfer learning. In order to use the generated samples, we should first have a clear understanding of their benefit. Why do we use the generated data for transfer learning? While generating data using GAN, the labels for the samples are not necessary. Thus, for transfer learning tasks, no matter there are labeled source or unlabeled target domains, we can always use GAN for generation. Thus, GAN can augment the training data. On the other hand, we can leverage the generated data to help learn the distribution discrepancy to achieve better transfer learning performance. Note that we will not introduce the GAN zoos here, which is out of the scope of this chapter. Zhu et al. (2017) proposed CycleGAN to match the different distributions between generated and real samples. CycleGAN first mapped the source domain into the target domain, then it used another mapping function to map the data back to the source domain. In this process, they used the divergence

170

10 Adversarial Transfer Learning

between the raw source data and the re-mapped source data to compute the loss. We use G and F to denote the mapping functions for source and target domains, respectively. Then, the training objective of CycleGAN can be formulated as: Lcyc = Ex1 ∼pdata (x1 ) [F (G(x1 )) − x1 1 ] + Ex2 ∼pdata (x2 ) [G(F (x2 )) − x2 1 ] . (10.13)

.

Similar idea was then used in later literature, such as CoGAN (Liu and Tuzel, 2016) and DiscoGAN (Kim et al., 2017). Bousmalis et al. (2016) proposed a pixel-to-pixel translation framework called PixelDA to learn fine-grained image translation models. We will not discuss these works here. In addition to direct data generation, other works also considered to combine data generation and transfer learning process. The work of (Sankaranarayanan et al., 2018) proposed Generate to Adapt (G2A), which leveraged the generated data and the source and target domains to learn domain-invariant features by adversarial training. Xu et al. (2020) proposed to use Mixup for adversarial transfer learning, which can learn better domain-invariant features. Other than adaptation, GAN-based data generation is also a popular technique to generate diverse and rich data to boost the generalization capability of a model. Rahman et al. (2019) leveraged ComboGAN (Anoosheh et al., 2018) to generate new data and then applied domain discrepancy measure such as MMD (Gretton et al., 2012) to minimize the distribution divergence between real and generated images to learn general representations. Qiao et al. (2020) leveraged adversarial training to create “fictitious” yet “challenging” populations, where a Wasserstein Auto-Encoder (Tolstikhin et al., 2017) was used to generate samples that preserve the semantics while having large domain transportation cost. Zhou et al. (2020) generated novel distributions under semantic consistency and then maximized the difference between source and the novel distributions. Somavarapu et al. (2020) introduced a simple transformation based on image stylization to explore crosssource variability.

10.5 Practice In this section, we implement adversarial transfer learning method domainadversarial neural networks (DANN) (Ganin and Lempitsky, 2015) using PyTorch. Note that DANN shares the same main functions as deep transfer learning in the last chapter since DANN can be seen as an implicit distribution divergence measurement. Then, we can replace the MMD or CORAL loss in last chapter with DANN loss. The complete code can be found at Sect. 9.6.

10.5 Practice

171

10.5.1 Domain Discriminator First, we implement the domain discriminator which classifies the input samples as from source or target domain. Thus, it is a simple neural network implemented as:

1 2 3 4 5 6 7 8 9 10 11 12 13

Discriminator class Discriminator(nn.Module): def __init__(self, input_dim=256, hidden_dim=256): super(Discriminator, self).__init__() self.input_dim = input_dim self.hidden_dim = hidden_dim self.dis1 = nn.Linear(input_dim, hidden_dim) self.dis2 = nn.Linear(hidden_dim, 1) def forward(self, x): x = F.relu(self.dis1(x)) x = self.dis2(x) x = torch.sigmoid(x) return x

10.5.2 Measuring Distribution Divergence The distribution divergence using domain discriminator is computed as follows.

1 2 3 4 5 6 7 8 9 10 11 12 13

Loss for discriminator def adv(source, target, input_dim=256, hidden_dim=512): domain_loss = nn.BCELoss() adv_net = Discriminator(input_dim, hidden_dim).cuda() domain_src = torch.ones( len(source)).cuda() domain_tar = torch.zeros( len(target)).cuda() domain_src, domain_tar = domain_src.view(domain_src.shape[0], 1), domain_tar.view(domain_tar.shape[0], 1) reverse_src = ReverseLayerF. apply(source, 1) reverse_tar = ReverseLayerF. apply(target, 1) pred_src = adv_net(reverse_src) pred_tar = adv_net(reverse_tar) loss_s, loss_t = domain_loss(pred_src, domain_src), domain_loss(pred_tar , domain_tar) loss = loss_s + loss_t return loss

Note that the core of the above code is to initialize the source domain labels as 0 and target domain labels as 1, which can be used as the ground-truth for domain discriminator. Then, we can compute the cross-entropy loss for domain discriminator to get its loss.

172

10 Adversarial Transfer Learning

10.5.3 Gradient Reversal Layer Gradient reversal layer is used to perform gradient reverse to make the training of DANN simpler.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Gradient reversal layer import torch.nn.functional as F from torch.autograd import Function class ReverseLayerF(Function): @staticmethod def forward(ctx, x, alpha): ctx.alpha = alpha return x.view_as(x) @staticmethod def backward(ctx, grad_output): output = grad_output.neg() * ctx.alpha return output, None

Note that for fair comparison, we use the same code as the last chapter. As shown in Fig. 10.5, the result of DANN on amazon to webcam is 78.87%.

Fig. 10.5 Results of DANN (Ganin and Lempitsky, 2015)

References

173

Again, note that we do not add any tricks or training skills. Thus their results are not the optimal. We have built an all-in package called DeepDA1 for easy and unified usage of all deep transfer learning methods.

10.6 Summary In this chapter, we introduced adversarial transfer learning methods. Compared to non-adversarial methods, adversarial methods can learn richer representations that could be more domain-invariant. In recent years, adversarial transfer learning methods are becoming more and more popular. We can also develop other kinds of methods by taking advantage of adversarial training, such as data generation and disentanglement.

References Anoosheh, A., Agustsson, E., Timofte, R., and Van Gool, L. (2018). ComboGAN: Unrestrained scalability for image domain translation. In CVPR Workshop, pages 783–790. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. (2010). A theory of learning from different domains. Machine learning, 79(1–2):151–175. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., and Erhan, D. (2016). Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial networks. In NIPS. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773. Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1857–1865. JMLR. org. Liu, M.-Y. and Tuzel, O. (2016). Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477. Qiao, F., Zhao, L., and Peng, X. (2020). Learning to learn single domain generalization. In CVPR, pages 12556–12565. Rahman, M. M., Fookes, C., Baktashmotlagh, M., and Sridharan, S. (2019). Multi-component image translation for deep domain generalization. In WACV, pages 579–588. IEEE. Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. (2018). Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732.

1 https://github.com/jindongwang/transferlearning/tree/master/code/DeepDA.

174

10 Adversarial Transfer Learning

Sankaranarayanan, S., Balaji, Y., Castillo, C. D., and Chellappa, R. (2018). Generate to adapt: Aligning domains using generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8503–8512. Shen, J., Qu, Y., Zhang, W., and Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation. In AAAI. Somavarapu, N., Ma, C.-Y., and Kira, Z. (2020). Frustratingly simple domain generalization via image stylization. arXiv:2006.11207. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2017). Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558. Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971. Xu, M., Zhang, J., Ni, B., Li, T., Wang, C., Tian, Q., and Zhang, W. (2020). Adversarial domain adaptation with domain mixup. In AAAI. Yu, C., Wang, J., Chen, Y., and Huang, M. (2019). Transfer learning with dynamic adversarial adaptation network. In The IEEE International Conference on Data Mining (ICDM). Zhou, K., Yang, Y., Hospedales, T. M., and Xiang, T. (2020). Learning to generate novel domains for domain generalization. In ECCV. Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232.

Chapter 11

Generalization in Transfer Learning

While fine-tuning and domain adaptation focus on the performance on the target domain, we discuss the generalization of transfer learning in this chapter. Specifically, we introduce the problem definition, main algorithms, and theory of domain generalization. The organization of this chapter is as follows. We first present the background and problem definition of domain generalization in Sect. 11.1. Then, we introduce the data manipulation methods in Sect. 11.2. After, we present the domaininvariant representation learning methods in Sect. 11.3. Then, we introduce different meta-learning, ensemble learning, and other learning paradigms in Sect. 11.4. In Sect. 11.5, we introduce some theory for domain generalization. After that, we show some code practice in Sect. 11.6. Finally, we conclude this chapter in Sect. 11.7.

11.1 Domain Generalization We introduce Domain generalization (DG) (Wang et al., 2021b) in this chapter. DG requires to learn a model from several domains that can generalize to unseen distributions. We can also treat DG as the extension of DA, but with the goal of generalizing well to unseen domains (Fig. 11.1). Domain generalization is not born from nothing, but with very strong motivation in daily lives. For instance, we are not able to collect all the required data due to the expensive and time-consuming operation in some specific applications of medical healthcare; in fall detection of the elderly, the labeled fall data in real environment are extremely rare, not to mention that we need to collect the fall data of the elderly of all age groups; in activity detection, we may not collect enough data for all different datasets (Lu et al., 2022a). These practical applications motivate us to learn a generalized model to perform well in all environments.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_11

175

176

11 Generalization in Transfer Learning Sketch

Cartoon

Training set

Art painting

Photo

Test set

Fig. 11.1 Examples from the PACS dataset (Li et al., 2017) for domain generalization. The training set is composed of images belonging to domains of sketch, cartoon, and art paintings. DG aims to learn a generalized model that performs well on the unseen target domain of photos

Fig. 11.2 Illustration of domain generalization

Figure 11.2 illustrates the problem of domain generalization. In DG, we often assume that the training data are from several domains with different distributions (i.e., multiple source domains), and these data are all we have for training. Different from domain adaptation that gives one target domain for training, DG does not have access to the target domain during training. The goal of DG is to utilize all of these data to learn a generalized model that performs well on unseen test distribution.

Definition 11.1 (Domain Generalization) As shown in Fig. 11.2, in domain generalization, we are given M training (source) domains .Strain = {Si | i i = 1, · · · , M} where .Si = {(x ij , yji )}nj =1 denotes the i-th domain. The joint j

i distributions between each pair of domains are different: .PXY = PXY , 1 ≤

(continued)

11.1 Domain Generalization

177

Definition 11.1 (continued) i = j ≤ M. The goal of domain generalization is to learn a robust and generalizable predictive function .h : X → Y from the M training domains to achieve a minimum prediction error on an unseen test domain .Stest (i.e., test i .Stest cannot be accessed in training and .P XY = PXY for i ∈ {1, · · · , M}): .

min E(x,y)∈Stest [(h(x), y)], h

(11.1)

where .E is the expectation and .(·, ·) is the loss function.

A very naive approach to solving DG is to combine all the domains into one domain and then perform ERM training on it, which can be implemented as: f ∗ = arg min

.

f

ni M 1  1  (f (x ij ), yji ). M ni i=1

(11.2)

j =1

Another intuitive approach is opposite: we can train a model for each domain .Si and then perform voting on these models to the test set, which is implemented as: fi∗ = arg min

.

f

ni 1  (f (x ij ), yji ), ni

∀i ∈ {1, 2, · · · , M}.

(11.3)

j =1

The prediction on any test input .x is illustrated as: yˆ = Vote[fi (x)],

.

(11.4)

where .Vote(·) denotes the voting operation such as average or weighting average voting. The voting results can also be learned through a model, which we call Ensemble learning-based domain generalization, and it will be introduced in Sect. 11.4. Obviously, we need to take care of the different distributions in these domains to maximally exploit their knowledge. On the other hand, we should also think about developing models that can generalize well since the test distribution is unknown. According to the first survey in domain generalization (Wang et al., 2021b), we summarize domain generalization algorithms into the following categories: 1. Data manipulation: This category of methods focuses on manipulating the inputs to assist learning general representations. Along this line, there are two kinds of popular techniques: a). Data augmentation, which is mainly based on augmentation, randomization, and transformation of input data; b). Data generation, which generates diverse samples to help generalization.

178

11 Generalization in Transfer Learning

2. Representation learning: This category of methods is the most popular in domain generalization. There are two representative techniques: (a) Domaininvariant representation learning, which performs kernel, adversarial training, explicit feature alignment between domains, or invariant risk minimization to learn domain-invariant representations; (b) Feature disentanglement, which tries to disentangle the features into domain-shared or domain-specific parts for better generalization. 3. Learning paradigm: This category of methods focuses on exploiting the general learning paradigms to enhance the generalization capability, which mainly includes three kinds of methods: (a) Ensemble learning, which relies on the power of ensemble to learn a unified and generalized predictive function; (b) Meta-learning, which is based on the learning-to-learn mechanism to learn general knowledge by constructing meta-learning tasks to simulate domain shift; (c) Other paradigms, which utilizes other paradigms such as gradient operation or self-supervised learning for generalization. In the rest of this chapter, we will introduce each of these algorithms. Note that DG is an active research area that there could be other methods that do not belong to the above categories.

11.2 Data Manipulation The generalization of a model often relies on the quantity and diversity of the training data. Given a limited set of training data, data manipulation is one of the cheapest and simplest way to generate samples so as to enhance the generalization capability of the model. The main objective for data manipulation-based DG is to increase the diversity of existing training data using different data manipulation methods. At the same time, the data quantity is also increased. Data manipulationbased DG is formulated as: .

min Ex,y [(h(x), y)] + Ex  ,y [(h(x  ), y)], h

(11.5)

where .x  = mani(x) denotes the manipulated data using a function .mani(·). Based on the difference of this function, we further categorize existing work into two types: data augmentation and data generation. Specifically, Mixup-based methods are an important type of methods in data generation that we would like to highlight.

11.2.1 Data Augmentation and Generation Augmentation is one of the most useful techniques for training machine learning models. Typical augmentation operations include flipping, rotation, scaling, crop-

11.2 Data Manipulation

179

ping, adding noise, and so on. (note that the dataloader codes in this book mostly used some augmentation tricks.) They have been widely used in supervised learning to enhance the generalization performance of a model by reducing overfitting. Without exception, they can also be adopted for DG where .mani(·) can be instantiated as these data augmentation functions. Domain randomization is a simple approach to augment data. Its goal is to generate a diversified dataset to mimic new environments and tasks such that the model can be robust to any unseen distributions. Domain randomization mainly uses the following transformations for generating data: • • • • •

Shape and volume of training data Position and texture of training data ViewAngle and illumination of the camera Lighting condition and position Random noise added to the data

As the dataset becomes more complex and diverse, model’s generalization ability can be enhanced. Here, the .mani(·) function is implemented as several manual transformations (commonly used in image data) such as: altering the location and texture of objects, changing the number and shape of objects, modifying the illumination and camera view, and adding different types of random noise to the data. Khirodkar et al. (2019) used domain randomization to generate more training data in simulated environments in order to generalize in real test environment. Prakash et al. (2019) further took into account the structure of the scene when randomly placing objects for data generation, which enables the neural network to utilize context information when detecting objects. Adversarial data augmentation. While domain randomization methods remain simple and effective for DG, they do not explicitly model the information contained in each domain, thus may lack domain-specific features. Adversarial data augmentation aims to guide the augmentation to optimize the generalization capability by enhancing the diversity of data while assuring their reliability. Shankar et al. (2018) used a Bayesian network to model dependence between label, domain and input instance, and proposed CrossGrad, a cautious data augmentation strategy that perturbs the input along the direction of greatest domain change while changing the class label as little as possible. Volpi et al. (2018) proposed an iterative procedure that augments the source dataset with examples from a fictitious target domain that is “hard” under the current model, where adversarial examples are appended at each iteration to enable adaptive data augmentation. Other than directly updating the inputs by gradient ascent, Zhou et al. (2020b) adversarially trained a data augmentation network to generate samples that can fool the feature extractor to eventually learn general representations.

180

11 Generalization in Transfer Learning

11.2.2 Mixup-Based Domain Generalization In addition to the above methods, Mixup (Zhang et al., 2018) is also a popular technique for data generation. Mixup generates new data by performing linear interpolation between any two instances and between their labels with a weight sampled from a Beta distribution, which does not require to train generative models. It constructs virtual training examples based on two random data points: λ ∼ Beta(α, α), .

x˜ = λx i + (1 − λ)x j ,

(11.6)

y˜ = λyi + (1 − λ)yj , where .Beta(α, α) is Beta distribution and .α ∈ (0, ∞) is a hyperparameter that controls the strength of interpolation between feature-target pairs, recovering the ERM principle as .α → 0. There are several methods using Mixup for DG by either performing Mixup in the original space (Wang et al., 2021c, 2020a) to generate new samples or in the feature space (Zhou et al., 2021) which does not explicitly generate raw training samples. These methods achieve promising performance on popular benchmarks while remaining conceptually and computationally simple. Recently, Lu et al. (2022c) proposed a theory to analyze why vanilla Mixup fails for better generalization. Their theory shows that Mixup can easily generate virtual noisy labels for the classification task. Moreover, Mixup does not discern the label and domain information, which should be handled more carefully. To perform disentanglement for Mixup-based domain generalization, they proposed a simple domain-invariant feature Mixup method called FIXED, which performs feature Mixup in the domain-invariant representations. Let .z be the domain-invariant feature, then, their Mixup approach can be formulated as: λ ∼ Beta(α, α), .

z˜ = λzi + (1 − λ)zj ,

(11.7)

y˜ = λyi + (1 − λ)yj . The domain-invariant features .z can be obtained by different domain-invariant learning approaches such as DANN (Ganin and Lempitsky, 2015) and CORAL (Sun and Saenko, 2016). It can be seen that FIXED is quite simple and general. Their theory showed that the domain-invariant feature Mixup can derive a larger distribution cover range than the original one, as shown in Fig. 11.3. They further added a large margin loss to enhance the discrimination of the classifier. FIXED achieved remarkable performance across image classification and time series classification experiments, while it remains extremely simple.

11.3 Domain-Invariant Representation Learning

(a) Implicit shrinkage

181

(b) Possible distribution range

Fig. 11.3 Toy examples of theoretical insights. (a) After the implicit shrinkage via Mixup, classes (denoted by different colors) mix together, bringing difficulty to classification. Different shapes denote domains. (b) FIXED enlarges the distribution cover range. Vertices represent distributions while colored areas represent possible .O and .O

Furthermore, to apply their algorithm to a real activity recognition application, they proposed semantic-discriminative Mixup (SDMix) (Lu et al., 2022b). Firstly, they introduced semantic-aware Mixup that considers the activity semantic ranges to overcome the semantic inconsistency brought by domain differences. Secondly, they introduced the large margin loss to enhance the discrimination of Mixup to prevent misclassification brought by noisy virtual labels. SDMix remains quite simple while being effective for cross-domain/dataset/position human activity recognition.

11.3 Domain-Invariant Representation Learning Representation learning has always been the focus of machine learning for decades and is also one of the keys to the success of domain generalization. We decompose the prediction function h as .h = f ◦ g, where g is a representation learning function and f is the classifier function. The goal of representation learning is formulated as: .

min Ex,y (f (g(x)), y) + λreg , f,g

(11.8)

where .reg denotes some regularization term and .λ is the trade-off parameter. Many methods are designed to better learn the feature extraction function g with corresponding .reg .

11.3.1 Domain-Invariant Component Analysis Transfer Component Analysis (TCA) is a classic method for domain adaptation. In DG, similar to TCA, domain-invariant component analysis (DICA) (Muandet et al., 2013) is one of the classic methods.

182

11 Generalization in Transfer Learning

The goal of DICA is to find a feature transformation in order to minimize the distribution divergence. Specifically, DICA treats this distribution gap as the distributional variance. DICA defined such variance as: N 1 1 1  VH (P) := Gij , tr() = tr(G) − 2 N N N

.

(11.9)

i,j =1

where N denotes number of distributions, . is the covariance operator of probability distribution .P, which is defined as:  := G − 1N G − G1N + 1N G1N ,

.

(11.10)

where .1N is the all-one matrix with length N and G is the Gram matrix: 





= Gij := μPi , μPj H

.

k(x, z)dPi (x)dPj (z).

(11.11)

The .μPi in Gram matrix denotes the embedding of distributions in RKHS and k(·, ·) is a kernel function. Similar to TCA, to compute the distribution variance .VH (P) in Eq. (11.9):

.

1  ) = tr(KQ),  tr( VH = N

.

(11.12)

where K and Q are kernel matrix and the distribution discrepancy factor, which are similar to TCA: ⎞ ⎛ K1,1 · · · K1,N ⎜ .. . . . ⎟ n×n .K = ⎝ (11.13) . .. ⎠ ∈ R , . KN,1 · · · KN,N ⎞ Q1,1 · · · Q1,N ⎜ .. . . . ⎟ n×n .Q = ⎝ . .. ⎠ ∈ R . . QN,1 · · · QN,N ⎛

(11.14)

With these notations, we can find a feature transformation matrix B such that the variance can become smaller after the transformation of B, which is represented as: .

 min tr B T KQKB .

(11.15)

In addition to finding the minimum distribution variance, DICA also wants to preserve the structure information by extra computations, which will not be

11.3 Domain-Invariant Representation Learning

183

discussed here. By combining these two objectives and smoothing the outputs by , the final formulation of DICA is:

.

.

max B

1 n

  tr B T L (L + nεIn )−1 K 2 B   . tr B T KQKB + BKB

(11.16)

If you remember the previous content, you can easily think that the above objective is extremely similar to the formulation of TCA, BDA, MEDA, etc. (i.e., Eq. 5.19). Also, their optimization is similar: Lagrange methods. The Lagrange objective is formulated as:  1 L = tr B T L (L + nεIn )−1 K 2 B n .

 , T − tr B KQKB + BKB − Im 

(11.17)

where . is the Lagrange factor. Setting the derivative to 0, we can obtain the following eigen-decomposition problem: .

1 L (L + nεIn )−1 K 2 B = (KQK + K)B. n

(11.18)

By solving the above problem, we can obtain the matrix B, which is the learning objective of DICA. On the other hand, DICA can be seen as minimizing the marginal distribution divergence. Therefore, other researchers proposed to minimize the conditional distribution divergence by proposing conditional-invariant domain generalization (CIDG) (Li et al., 2018c) and scatter component analysis (SCA) (Ghifary et al., 2017) to maximize the inter-class discrepancy and minimize the intra-class discrepancy. Moreover, The work of (Erfani et al., 2016) mapped each source and each class to a ellipsis and exploit the domain information to minimize the distribution divergence. Hu et al. (2019) also proposed a similar method which provided theoretical analysis.

11.3.2 Deep Domain Generalization Many DG methods can be extended to deep learning area. Multi-task Auto-encoder (MTAE) (Ghifary et al., 2015) started the trial of applying Auto-encoder for deep domain generalization. The core of MTAE is to exploit the encoder to reconstruct each sample in each domain. Since the encoder is shared, MTAE can learn the general features for all samples, which reduces the distribution divergence. The structure of MTAE is shown in Fig. 11.4.

184

11 Generalization in Transfer Learning

Fig. 11.4 Illustration of multi-task auto-encoder (MTAE) (Ghifary et al., 2015)

The training of MTAE is similar to the existing deep learning methods:

 hi = σenc W T x i .

 f(l) (x i ) = σdec V (l)T hi ,

(11.19)

  where . (l) = W , V (l) denotes the learnable parameters, .W are shared encoder parameters, and .V (l) , l = 1, 2, · · · , N are domain-specific parameters for N decoders. For each domain, the objective of MTAE is: Ni  

  1  J (l) = L f (l) x j , x lj . Ni

.

(11.20)

j =1

It is easy to see that MTAE provides a unified framework for most of the deep learning-based domain generalization methods. In later work, Li et al. (2018c) proposed the MMD-AAE algorithm that not only learned auto-encoder but also reduced the distribution divergence between domains using maximum mean discrepancy (Gretton et al., 2012). MMD-AAE takes the data of N domains as the inputs to the network, then decoded them using the decoder. MMD-AAE optimizes the MMD distance between domains and also introduced adversarial training. It adopts a mini-max optimization process: .

min max Lae + λ1 Lmmd + λ2 Lgan ,

(11.21)

where .Lae is the loss for auto-encoder, .Lmmd is MMD loss, and .Lgan is adversarial loss. By adopting the similar idea, other methods explicitly minimized the feature distribution divergence by minimizing the maximum mean discrepancy (MMD) (Li et al., 2018c), second order correlation (Sun and Saenko, 2016; Peng and

11.3 Domain-Invariant Representation Learning

185

Saenko, 2018), both mean and variance (moment matching) (Peng et al., 2019a), Wasserstein distance (Zhou et al., 2020a), etc., of domains for either adaptation or generalization. Zhou et al. (2020a) aligned the marginal distribution of different source domains via optimal transport by minimizing the Wasserstein distance to achieve domain-invariant feature space. Lu et al. (2022a) proposed a local and global (LAG) alignment method for generalizable activity recognition that used the correlation in signals to learn domain-invariant features. On the other hand, Domain-adversarial training is widely used for learning domain-invariant features. Li et al. (2018b) adopted such idea for DG. Gong et al. (2019) used adversarial training by gradually reducing the domain discrepancy in a manifold space. Li et al. (2018d) proposed a conditional invariant adversarial network (CIAN) to learn class-wise adversarial networks for DG. Similar ideas were also used in (Shao et al., 2019; Rahman et al., 2020; Wang et al., 2020b). Jia et al. (2020) used single-side adversarial learning and asymmetric triplet loss to make sure only the real faces from different domains were indistinguishable, but not for the fake ones. In addition to adversarial domain classification, Zhao et al. (2020) introduced additional entropy regularization by minimizing the KL divergence between the conditional distributions of different training domains to push the network to learn domain-invariant features. Some other GAN-based methods (Garg et al., 2021; Sicilia et al., 2021; Albuquerque et al., 2019) were also proposed with theoretically guaranteed generalization bound.

11.3.3 Disentanglement Disentangled representation learning learns a function that maps a sample to a feature vector that contains all the information about different factors of variation, with each dimension (or a subset of dimensions) containing information about only some factor(s). Disentanglement-based DG decomposes a feature representation into understandable compositions/sub-features, with one part being domainshared/invariant part, and the other domain-specific one, which is formulated as: .

min Ex,y (f (gc (x)), y) + λreg + μrecon ([gc (x), gs (x)], x),

gc ,gs ,f

(11.22)

where .gc and .gs denote the domain-shared and domain-specific feature representations, respectively. .λ, μ are trade-off parameters. The loss .reg is a regularization term that explicitly encourages the separation of the domain shared and specific features and .recon denotes a reconstruction loss that prevents the loss of information. Note that .[gc (x), gs (x)] denotes the combination/integration of two kinds of features (which is not limited to concatenation). Looking back previous works, we see that researchers proposed UndoBias (Khosla et al., 2012) for domain generalization based on SVM. We know that SVM is a linear model based on large margin, whose core is to learn the

186

11 Generalization in Transfer Learning

weights .w for linear transformation. Therefore, UndoBias assumes that we can learn a general weight .w 0 from training data and learn a domain-specific bias . i for each domain. Then, the weight .w i on domain i is represented as: wi = w0 + i .

.

(11.23)

Then, we can obtain a disentanglement model: 1 λ i 2 + Constraints, w 0 2 + 2 2 n

.

min

(11.24)

i=1

where Constraints denotes other constraints of SVM. Later works has similar idea with UndoBias. The work of (Niu et al., 2015; Xu et al., 2014) exploited multi-view learning for domain generalization. They also adopted low-rank SVM for computation. (Ding and Fu, 2017) treated the network as two modules: domain-invariant module and domain-specific module. Li et al. (2017) proposed deeper and broader domain generalization with Tucker decomposition to reduce the computation cost. Fang et al. (2013) proposed to learn a unbiased distance metric to make it general to any domains. In other works, Truong et al. (2019) proposed to define the environment as a Gaussian mixture model of the training data. First, they learned the embedding representations for each class. Then, they maximize the inter-class distance to learn the distribution information of each class. Next, then generate the most dissimilar data for each class for generalization. Shao et al. (2019) proposed feature extraction with adversarial learning. The work of (Zunino et al., 2021) proposed to discover the disentanglement relation by observing the heat map. The Domain-invariant variational auto-encoder (DIVA) (Ilse et al., 2020) disentangled the features into domain information, category information, and other information, which is learned in the VAE framework. Peng et al. (2019b) disentangled the fine-grained domain information and category information that are learned in VAEs. Qiao et al. (2020) also used VAE for disentanglement, where they proposed a Unified Feature Disentanglement Network (UFDN) that treated both data domains and image attributes of interest as latent factors to be disentangled. Similarly, Zhang et al. (2021) disentangled the semantic and variational part of the samples. Similar spirits also include (Li et al., 2021; Choi et al., 2021). Nam et al. (2021) proposed to disentangle the style and other information using generative models that their method is both for adaptation and generalization. Generative models can not only improve OOD performance but can also be used for generation tasks, which we believe is useful for many potential applications.

11.4 Other Learning Paradigms for Domain Generalization

187

11.4 Other Learning Paradigms for Domain Generalization Other than data manipulation and domain-invariant representation learning, there are other methods using different learning paradigms for domain generalization. In this section, we categorize them into three types: ensemble learning, meta-learning, and other paradigms.

11.4.1 Ensemble Learning Different from other methods, ensemble learning does not explicitly handle the distribution gap, but design the network architecture and training objectives to learn the ensemble of the multi-source domains for generalization. We can train a domain-specific model .fi for each domain, then any sample can be regarded as the ensemble representation by the N domains. Thus, we think the N domains as the basic vector for all the data. The core of this method is how to ensemble them for representation. A very intuitive manner is to perform output ensemble. Mancini et al. (2018) proposed a dynamic weighting method. Firstly, they built a network for each domain where the feature network is shared and the classifier is domain-specific (i.e., there are N classifiers for N source domains). The learning objective can be seen as the stacking of all predictions: f =

N 

.

wi fi (x),

(11.25)

i=1

where the domain weight .wi is learned by another network, which we call the domain classification network. Its loss is composed of two modules: L = Lc + λLdom ,

.

(11.26)

where .Ldom is the domain classification loss computed by cross-entropy and .Lc is the supervised loss for training data. He et al. (2018) adopted the attention mechanism and proposed ensemble learning for object recognition. They call it the domain-attention model. D’Innocente and Caputo (2018) also adopted the similar idea by leveraging a domain-specific layer-aggregation for better ensemble learning. Recently, Qin et al. (2022) applied ensemble learning based domain generalization to human activity recognition. Their algorithm, adaptive feature fusion for activity recognition (AFFAR), leveraged domain-specific module and domaininvariant module for generalization in activity data. As shown in Fig. 11.5, the domain-specific module (.Ldsr ) is an ensemble learning paradigm that learns the

188

11 Generalization in Transfer Learning

ℒ=ℒ

Domain 1

+ ℒ

+ ℒ

(

,

)



1

Domain 2

2



… Max pooling

Conv.

Input Domain

Walking Running Upstairs Sitting

Σ

Conv.

Max pooling

Feature extractor

Domain-specific feature

Acvity classificaon

Domain-invariant feature



1 2





Fig. 11.5 Illustration of the AFFAR algorithm proposed in Qin et al. (2022)

specific features for each domain, while domain-invariant learning (.Ldir ) is based on MMD or adversarial learning. The total loss for AFFAR is computed as: L = Lcls + λLdsr + βLdir .

.

(11.27)

AFFAR is just an application of ensemble learning-based domain generalization and it achieved competitive performance compared with other state-of-the-art methods.

11.4.2 Meta-Learning for Domain Generalization The key idea of meta-learning is to learn a general model from multiple tasks by either optimization-based methods (Finn et al., 2017a), metric-based learning (Snell et al., 2017), or model-based methods (Santoro et al., 2016). The idea of metalearning has been exploited for domain generalization. More information of metalearning can be found at Sect. 14.3. They divide the data form multi-source domains into meta-train and meta-test sets to simulate domain shift. Denote .θ the model parameters to be learned, metalearning can be formulated as: θ ∗ = Learn(Smte ; φ ∗ ) .

= Learn(Smte ; MetaLearn(Smtrn )),

(11.28)

where .φ ∗ = MetaLearn(Smtrn ) denotes the meta-learned parameters from the metatrain set .Smtrn which is then used to learn the model parameters .θ ∗ on the metatest set .Smte . The two functions .Learn(·) and .MetaLearn(·) are to be designed and

11.4 Other Learning Paradigms for Domain Generalization

189

implemented by different meta-learning algorithms, which corresponds to a bi-level optimization problem. The gradient update can be formulated as: θ =θ −α

.

∂((Smte ; θ ) + β(Smtrn ; φ)) , ∂θ

(11.29)

where .η and .β are learning rates for outer and inner loops, respectively. Inspired by MAML (Finn et al., 2017b), Li et al. (2018a) proposed MLDG (meta-learning for domain generalization). MLDG splits the data from the source domains into meta-train and meta-test to simulate the domain shift situation to learn general representations. Balaji et al. (2018) proposed to learn a meta regularizer (MetaReg) for the classifier. Li et al. (2019) proposed feature-critic training for the feature extractor by designing a meta optimizer. Dou et al. (2019) used the similar idea of MLDG and additionally introduced two complementary losses to explicitly regularize the semantic structure of feature space. Du et al. (2020) proposed an extended version of information bottleneck named Meta Variational Information Bottleneck (MetaVIB). They regularize the Kullback–Leibler (KL) divergence between distributions of latent encoding of the samples that have the same category from different domains and learn to generate weights by using stochastic neural networks. Recently, some works also adopted meta-learning for semi-supervised DG or discriminative DG (Chen et al., 2022; Sharifi-Noghabi et al., 2020; Wang et al., 2021a; Zhao et al., 2021a,b). Meta-learning is widely adopted in DG research and it can be incorporated into several paradigms such as disentanglement (Bui et al., 2021). Meta-learning performs well on massive domains since meta-learning can seek transferable knowledge from multiple tasks. Li et al. (2019) adopted the meta-learning and reinforcement learning ideas for domain generalization. They proposed a feature-critic method, which aimed to solve for feature extractor .fθ and classifier .gφ based on meta-learning. After feature extraction, the loss is divided into two parts: the first part is the normal classification loss, and the other is the augment loss, which refers to how to use the feature extractor .fθ to handle the distribution divergence. The augment loss is denoted by .hω . In order to learn it, they adopted the delayed reward idea from reinforcement learning. They seek the relation of the old parameters .θ (OLD) and the new ones (NEW) and proposed the following decision rule: .θ max ω

.





Dj ∈Dval dj ∈Dj



tanh γ θ (NEW) , φj , x (j ) , y (j ) 

−γ θ (OLD) , φj , x (j ) , y (j )

.

(11.30)

190

11 Generalization in Transfer Learning

In order to formulated the .γ function, they proposed to compare the loss for old and new parameters as the objective. The final objective is: min ω

.





Dj ∈Dval dj ∈Dj



 

tanh (CE) gφj fθ(NEW) x (j ) , y (j ) −

(CE)

gφj

.

  (OLD) (j ) (j ) fθ x ,y

(11.31)

There are more works that used meta-learning for domain generalization. In fact, meta-learning is a general framework and more ideas can be integrated into it for better performance.

11.4.3 Other Learning Paradigms There are many other learning paradigms in machine learning that could be employed for domain generalization. For instance, self-supervised learning is a recently popular learning paradigm that builds supervised tasks from the largescale unlabeled data (Jing and Tian, 2020). Inspired by this, Carlucci et al. (2019) introduced self-supervision tasks by solving jigsaw puzzles to learn generalized representations. Additionally, Huang et al. (2020) proposed a self-challenging training algorithm that aims to learn general representations by manipulating gradients. The work of (Ryu et al., 2019) used random forest to improve the generalization ability of convolutional neural networks (CNN). They sampled the triplets based on the probability mass function of the split results given by random forest, which is used for updating the CNN parameters by triplet loss. In the future, there could be more works for DG that uses other learning paradigms.

11.5 Domain Generalization Theory 11.5.1 Average Risk Estimation Error Bound The first line of domain generalization theory considers the case where the target domain is totally unknown (not even unsupervised data), and measures the average risk over all possible target domains. Assume that all possible target distributions follow an underlying hyper-distribution .P on .(x, y) distributions: t .P XY ∼ P, and that the source distributions also follow the same hyper-distribution: 1 M .P XY , · · · , PXY ∼ P. For generalization to any possible target domain, the classifier to be learned in this case also includes the domain information .PX into its input, so

11.5 Domain Generalization Theory

191

prediction is in the form .y = h(PX , x) on the domain with distribution .PXY . For such a classifier h, its average risk over all possible target domains is then given by: E(h) := EPXY ∼P E(x,y)∼PXY [(h(PX , x), y)],

.

(11.32)

where . is a loss function on .Y. Exactly evaluating the expectations is impossible, but we can estimate it using finite domains/distributions following .P, and finite 1 M .(x, y) samples following each distribution. As we have assumed .P XY , · · · , PXY ∼ P, the source domains and supervised data could serve for this estimation: i

n M 1  1  ˆ := .E(h) (h(Ui , x ij ), yji ), M ni i=1

(11.33)

j =1

where we use the supervised dataset .Ui := {x ij | (x ij , yji ) ∈ Si } from domain i as an empirical estimation for .PXi . The first problem to consider is how well such an estimate approximates the ˆ target .E(h). This can be measured by the largest difference between .E(h) and .E(h) on some space of h. To our knowledge, this is first analyzed by Blanchard et al. (2011), where the space of h is taken as a reproducing kernel Hilbert space (RKHS). However, different from common treatment, the classifier h here also depends on the distribution .PX , so the kernel defining the RKHS should be in the form 1 2 ¯ .k((P X , x 1 ), (PX , x 2 )). Blanchard et al. (2011) construct such a kernel using kernels   ¯ 1 2 .kX , k on .X and kernel .κ on the RKHS .Hk  of kernel .k : .k((P X X X , x 1 ), (PX , x 2 )) := X  (x, ·)] ∈ H  is κ(kX (PX1 ), kX (PX2 ))kX (x 1 , x 2 ), where .kX (PX ) := Ex∼PX [kX kX  the kernel embedding of distribution .PX via kernel .k . The result is given in the following theorem, which gives a bound on the largest average risk estimation error within an origin-centered closed ball .BH ¯ (r) of radius r in the RKHS .Hk¯ of kernel k ¯ in a slightly simplified case where .n1 = · · · = nM =: n. .k, Theorem 11.1 (Average Risk Estimation Error Bound for Binary Classification (Blanchard et al., 2011)) Assume that the loss function . is .L -Lipschitz in  and .κ its first argument and is bounded by .B . Assume also that the kernels .kX , kX are bounded by .Bk2 , Bk2 ≥ 1 and .Bκ2 , respectively, and the canonical feature map .κ : v ∈ Hk  → κ(v, ·) ∈ Hκ of .κ is .Lκ -Hölder of order .α ∈ (0, 1] on the closed X ball .BH  (Bk  ).1 Then for any .r > 0 and .δ ∈ (0, 1), with probability at least .1 − δ, kX

it holds that: .

     ˆ − E(h) ≤ C B −M −1 log δ. sup E(h) h∈B H (r)

(11.34)



means that for any .u, v ∈ BH  (BH  ), it holds that . κ (u) − κ (v) ≤ Lκ u − v α , k kX where the norms are of the respective RKHSs.

1 This

192

11 Generalization in Transfer Learning



√   −1 α/2 +rBk L Bk  Lκ n log(M/δ) + Bκ / M ,

(11.35)

where C is a constant. The bound becomes larger in general if .(M, n) is replaced with .(1, Mn). It indicates that using domain-wise datasets is better than just pooling them into one mixed dataset, so the domain information plays a role. This result is later extended in Muandet et al. (2013), and Deshmukh et al. (2019) give a bound for multi-class classification in a similar form.

11.5.2 Generalization Risk Bound Another line of DG theory considers the risk on a specific target domain, under the assumption of covariate shift (i.e., the labeling function .h∗ or .PY |X is the same over all domains). This measurement is similar to what is considered in domain adaptation theory in Sect. 7.1, so we adopt the same definition for the source risks 1 M and the target risk . t . With the covariate shift assumption, each domain . , · · · ,  is characterized by the distribution on .X. Albuquerque et al. (2019) then consider approximating the target domain distribution .PXt within the convex hull of source  i domain distributions: . := { M i=1 πi PX | π ∈ M }, where . M is the .(M − 1)-dimensional simplex so that each .π represents a normalized mixing weights. Similar to the domain adaptation case, distribution difference is measured by the .H-divergence to include the influence of the classifier class. Theorem 11.2 (Domain Generalization et al.,  Error i Bound (Albuquerque ∗2 be the 2019)) Let .γ := minπ ∈ M dH (PXt , M π P ) with minimizer . π i=1 i X M ∗ i distance of .PXt from the convex hull ., and .PX∗ := i=1 πi PX be the best approximator within .. Let .ρ := supP  ,P  ∈ dH (PX , PX ) be the diameter of .. X X Then it holds that:  t (h) ≤

M 

.

i=1

πi∗  i (h) +

γ +ρ + λH,(P t ,P ∗ ) , X X 2

(11.36)

where .λH,(P t ,P ∗ ) is the ideal joint risk across the target domain and the domain X X with the best approximator distribution .PX∗ . The result can be seen as the generalization of domain adaptation bounds in Chap. 7 when there are multiple source domains. Again similar to the domain adaptation case, this bound motivates DG methods based on domain-invariant

2 The original

presentation does not mention that the .π is the minimizer, but the proof indicates so.

11.6 Practice

193

representation, which simultaneously minimize the risks over all source domains corresponding to the first term of the bound, as well as the representation distribution differences among source and target domains in the hope to reduce .γ and .ρ on the representation space. Recently, Lu et al. (2022c) proposed a new bound based on Mixup and domaininvariant learning, where their algorithm showed that domain-invariant Mixup indeed enlarges the distribution cover range by training domains. To sum up, the theory of generalization is an active research area and other researchers also derived different DG theory bounds using informativeness (Ye et al., 2021) and adversarial training (Albuquerque et al., 2019; Sicilia et al., 2021; Ye et al., 2021; Deshmukh et al., 2019).

11.6 Practice In this section, we utilize PyTorch to implement ERM and CORAL (Sun and Saenko, 2016) in domain generalization. We mainly focus on the entire process of implementations, which consists of three parts including dataloader, training, and testing. For complete codes and more methods, please refer to here: https://github. com/jindongwang/transferlearning/tree/master/code/DeepDG.

11.6.1 Dataloader in Domain Generalization First, ImageDataset for domain generalization needs to be defined. Compared to the common dataset, ImageDataset provides one more item about domain information that may be used in some algorithms.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Domain generalization dataloader import numpy as np from torchvision.datasets import ImageFolder from torchvision.datasets.folder import default_loader class ImageDataset( object): def __init__(self, root_dir, domain_name, domain_label=-1, transform= None, target_transform=None): self.imgs = ImageFolder(root_dir+domain_name).imgs imgs = [item[0] for item in self.imgs] labels = [item[1] for item in self.imgs] self.labels = np.array(labels) self.x = imgs self.transform = transform self.target_transform = target_transform self.loader = default_loader self.dlabels = np.ones(self.labels.shape) * domain_label

194 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

11 Generalization in Transfer Learning

def target_trans(self, y): if self.target_transform is not None: return self.target_transform(y) else: return y def input_trans(self, x): if self.transform is not None: return self.transform(x) else: return x def __getitem__(self, index): index = self.indices[index] img = self.input_trans(self.loader(self.x[index])) ctarget = self.target_trans(self.labels[index]) dtarget = self.target_trans(self.dlabels[index]) return img, ctarget, dtarget def __len__(self): return len(self.indices)

ImageDatset extracts data from original files. The definitions of its parameters are as follows, • root_dir: file path. • domain_name: each domain usually has its own name, such as Real_World in Office-Home. • domain_label: you can assign domain labels by yourself. • transform: transformations of .x. • target_transform: transformations of y. Once obtaining ImageDataset, we utilize InfiniteDataLoader to generate batch data endlessly.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Domain generalization dataloader import torch class _InfiniteSampler(torch.utils.data.Sampler): def __init__(self, sampler): self.sampler = sampler def __iter__(self): while True: for batch in self.sampler: yield batch class InfiniteDataLoader: def __init__(self, dataset, batch_size, num_workers): super().__init__() sampler = torch.utils.data.RandomSampler(dataset,replacement=True)

11.6 Practice 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

195

batch_sampler = torch.utils.data.BatchSampler( sampler, batch_size=batch_size, drop_last=True) self._infinite_iterator = iter(torch.utils.data.DataLoader( dataset, num_workers=num_workers, batch_sampler=_InfiniteSampler(batch_sampler) )) def __iter__(self): while True: yield next(self._infinite_iterator)

11.6.2 Training and Testing After completing the data preparation process, we begin to train the generalization model. When training, each source corresponds to one InfiniteDataLoader and generates one batch respectively. The final model is selected according to the accuracy of the validation data split from the sources. Domain generalization training for epoch in range(args.max_epoch): for iter_num in range(args.steps_per_epoch): minibatches_device = [(data) for data in next(train_minibatches_iterator)] 4 step_vals = algorithm.update(minibatches_device, opt, sch) 5 6 if (epoch in [ int(args.max_epoch*0.7), int(args.max_epoch*0.9)]) and (not args.schuse): 7 print(’manually descrease lr’) 8 for params in opt.param_groups: 9 params[’lr’] = params[’lr’]*0.1 10 11 if (epoch == (args.max_epoch-1)) or (epoch % args.checkpoint_freq == 0): 12 for item in acc_type_list: 13 acc_record[item] = np.mean(np.array([modelopera.accuracy( 14 algorithm, eval_loaders[i]) for i in eval_name_dict[item]])) 15 if acc_record[’valid’] > best_valid_acc: 16 best_valid_acc = acc_record[’valid’] 17 target_acc = acc_record[’target’] 1 2 3

196

11 Generalization in Transfer Learning

Finally, we give codes on how to evaluate the model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Domain generalization testing def accuracy(network, loader): correct = 0 total = 0 network. eval() with torch.no_grad(): for data in loader: x = data[0].cuda(). float() y = data[1].cuda(). long() p = network.predict(x) if p.size(1) == 1: correct += (p.gt(0).eq(y). float()). sum().item() else: correct += (p.argmax(1).eq(y). float()). sum().item() total += len(x) network.train() return correct / total

11.6.3 Examples: ERM and CORAL Now, we give an example, which implements ERM.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

ERM class ERM(Algorithm): def __init__(self, args): super(ERM, self).__init__(args) self.featurizer = get_fea(args) self.classifier = common_network.feat_classifier( args.num_classes, self.featurizer.in_features, args.classifier) self.network = nn.Sequential( self.featurizer, self.classifier) def update(self, minibatches, opt, sch): all_x = torch.cat([data[0].cuda(). float() for data in minibatches]) all_y = torch.cat([data[1].cuda(). long() for data in minibatches]) loss = F.cross_entropy(self.predict(all_x), all_y) opt.zero_grad() loss.backward() opt.step() if sch: sch.step() return {’class’: loss.item()}

11.6 Practice

197

Fig. 11.6 Training process of ERM

23 24 25

def predict(self, x): return self.network(x)

The function get_fea obtains the feature net which can be ResNet18, ResNet50, and some others. common_network.feat_classifier often contains one fully connected linear layer. As we can see, ERM just combines all data from different sources and utilizes cross_entropy to optimize the model. Figure 11.6 shows the training process of ERM on Office-Home where Art serves as the target. As we can see, target_acc may decrease when train_acc and valid_acc increase. Since we cannot access the target, we can only select the model via valid_acc, which may not be the best model. The final target_acc is .57.77%. Although we obtain a model with .58.22%, we cannot choose it since there is no access to the target during training. Here, we give another example, which implements CORAL.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CORAL class CORAL(ERM): def __init__(self, args): super(CORAL, self).__init__(args) self.args = args self.kernel_type = "mean_cov" def coral(self, x, y): mean_x = x.mean(0, keepdim=True) mean_y = y.mean(0, keepdim=True) cent_x = x - mean_x cent_y = y - mean_y cova_x = (cent_x.t() @ cent_x) / ( len(x) - 1) cova_y = (cent_y.t() @ cent_y) / ( len(y) - 1) mean_diff = (mean_x - mean_y). pow(2).mean()

198 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

11 Generalization in Transfer Learning cova_diff = (cova_x - cova_y). pow(2).mean() return mean_diff + cova_diff def update(self, minibatches, opt, sch): objective = 0 penalty = 0 nmb = len(minibatches) features = [self.featurizer( data[0].cuda(). float()) for data in minibatches] classifs = [self.classifier(fi) for fi in features] targets = [data[1].cuda(). long() for data in minibatches] for i in range(nmb): objective += F.cross_entropy(classifs[i], targets[i]) for j in range(i + 1, nmb): penalty += self.coral(features[i], features[j]) objective /= nmb if nmb > 1: penalty /= (nmb * (nmb - 1) / 2) opt.zero_grad() (objective + (self.args.mmd_gamma*penalty)).backward() opt.step() if sch: sch.step() if torch.is_tensor(penalty): penalty = penalty.item() return {’class’: objective.item(), ’coral’: penalty, ’total’: ( objective.item() + (self.args.mmd_gamma*penalty))}

Figure 11.7 shows the training process of CORAL. CORAL finally achieves 59.29% accuracy on the target. Compare to ERM, CORAL achieves better performance on the same task under the same environment. The results demonstrate that learning domain-invariant features may bring better generalization.

11.7 Summary In this chapter, we introduced the recently popular domain generalization research area for transfer learning. Seeking generalized models is always the ultimate goal of machine learning and we hope there could be more works in this area.

References

199

Fig. 11.7 Training process of CORAL

References Albuquerque, I., Monteiro, J., Falk, T. H., and Mitliagkas, I. (2019). Adversarial target-invariant representation learning for domain generalization. arXiv preprint arXiv:1911.00804. Balaji, Y., Sankaranarayanan, S., and Chellappa, R. (2018). MetaReg: Towards domain generalization using meta-regularization. In NeurIPS, pages 998–1008. Blanchard, G., Lee, G., and Scott, C. (2011). Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems, 24. Bui, H., Tran, T., Tran, A. T., and Phung, D. (2021). Exploiting domain-specific features to enhance domain generalization. In Thirty-Fifth Conference on Neural Information Processing Systems. Carlucci, F. M., D’Innocente, A., Bucci, S., Caputo, B., and Tommasi, T. (2019). Domain generalization by solving jigsaw puzzles. In CVPR, pages 2229–2238. Chen, K., Zhuang, D., and Chang, J. M. (2022). Discriminative adversarial domain generalization with meta-learning based cross-domain validation. Neurocomputing, 467:418–426. Choi, S., Jung, S., Yun, H., Kim, J. T., Kim, S., and Choo, J. (2021). RobustNet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11580–11590. Deshmukh, A. A., Lei, Y., Sharma, S., Dogan, U., Cutler, J. W., and Scott, C. (2019). A generalization error bound for multi-class domain generalization. arXiv:1905.10392. Ding, Z. and Fu, Y. (2017). Deep domain generalization with structured low-rank constraint. IEEE TIP, 27(1):304–313. Dou, Q., de Castro, D. C., Kamnitsas, K., and Glocker, B. (2019). Domain generalization via model-agnostic learning of semantic features. In NeurIPS. Du, Y., Xu, J., Xiong, H., Qiu, Q., Zhen, X., Snoek, C. G. M., and Shao, L. (2020). Learning to learn with variational information bottleneck for domain generalization. In ECCV. D’Innocente, A. and Caputo, B. (2018). Domain generalization with domain-specific aggregation modules. In German Conference on Pattern Recognition, pages 187–198. Springer. Erfani, S., Baktashmotlagh, M., Moshtaghi, M., Nguyen, V., Leckie, C., Bailey, J., and Kotagiri, R. (2016). Robust domain generalisation by enforcing distribution invariance. In AAAI, pages 1455–1461. Fang, C., Xu, Y., and Rockmore, D. N. (2013). Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In ICCV, pages 1657–1664. Finn, C., Abbeel, P., and Levine, S. (2017a). Model-agnostic meta-learning for fast adaptation of deep networks. In ICML.

200

11 Generalization in Transfer Learning

Finn, C., Abbeel, P., and Levine, S. (2017b). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org. Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189. Garg, V. K., Kalai, A., Ligett, K., and Wu, Z. S. (2021). Learn to expect the unexpected: Probably approximately correct domain generalization. In International Conference on Artificial Intelligence and Statistics. Ghifary, M., Balduzzi, D., Kleijn, W. B., and Zhang, M. (2017). Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE transactions on pattern analysis and machine intelligence, 39(7):1414–1430. Ghifary, M., Bastiaan Kleijn, W., Zhang, M., and Balduzzi, D. (2015). Domain generalization for object recognition with multi-task autoencoders. In CVPR, pages 2551–2559. Gong, R., Li, W., Chen, Y., and Gool, L. V. (2019). DLOW: Domain flow for adaptation and generalization. In CVPR, pages 2477–2486. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773. He, W., Zheng, H., and Lai, J. (2018). Domain attention model for domain generalization in object detection. In PRCV, pages 27–39. Hu, S., Zhang, K., Chen, Z., and Chan, L. (2019). Domain generalization via multidomain discriminant analysis. In UAI, volume 35. Huang, Z., Wang, H., Xing, E. P., and Huang, D. (2020). Self-challenging improves cross-domain generalization. In ECCV, volume 2. Ilse, M., Tomczak, J. M., Louizos, C., and Welling, M. (2020). Diva: Domain invariant variational autoencoders. In Proceedings of the Third Conference on Medical Imaging with Deep Learning. Jia, Y., Zhang, J., Shan, S., and Chen, X. (2020). Single-side domain generalization for face antispoofing. In CVPR, pages 8484–8493. Jing, L. and Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE TPAMI. Khirodkar, R., Yoo, D., and Kitani, K. (2019). Domain randomization for scene-specific car detection and pose estimation. In WACV, pages 1932–1940. IEEE. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., and Torralba, A. (2012). Undoing the damage of dataset bias. In ECCV, pages 158–171. Springer. Li, D., Yang, J., Kreis, K., Torralba, A., and Fidler, S. (2021). Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8300–8311. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. (2017). Deeper, broader and artier domain generalization. In ICCV, pages 5542–5550. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. (2018a). Learning to generalize: Meta-learning for domain generalization. In AAAI. Li, H., Pan, S. J., Wang, S., and Kot, A. (2018b). Domain generalization with adversarial feature learning. In CVPR, pages 5400–5409. Li, Y., Gong, M., Tian, X., Liu, T., and Tao, D. (2018c). Domain generalization via conditional invariant representations. In AAAI. Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., and Tao, D. (2018d). Deep domain generalization via conditional invariant adversarial networks. In ECCV, pages 624–639. Li, Y., Yang, Y., Zhou, W., and Hospedales, T. M. (2019). Feature-critic networks for heterogeneous domain generalization. In ICML. Lu, W., Wang, J., and Chen, Y. (2022a). Local and global alignments for generalizable sensorbased human activity recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

References

201

Lu, W., Wang, J., Chen, Y., Pan, S., Hu, C., and Qin, X. (2022b). Semantic-discriminative mixup for generalizable sensor-based cross-domain activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies. Lu, W., Wang, J., Qin, X., and Chen, Y. (2022c). Exploiting mixup for domain generalization. In International conference on machine learning. Mancini, M., Bulò, S. R., Caputo, B., and Ricci, E. (2018). Best sources forward: domain generalization through source-specific nets. In ICIP, pages 1353–1357. Muandet, K., Balduzzi, D., and Schölkopf, B. (2013). Domain generalization via invariant feature representation. In ICML, pages 10–18. Nam, H., Lee, H., Park, J., Yoon, W., and Yoo, D. (2021). Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8690–8699. Niu, L., Li, W., and Xu, D. (2015). Multi-view domain generalization for visual recognition. In ICCV, pages 4193–4201. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. (2019a). Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415. Peng, X., Huang, Z., Sun, X., and Saenko, K. (2019b). Domain agnostic learning with disentangled representations. In ICML. Peng, X. and Saenko, K. (2018). Synthetic to real adaptation with generative correlation alignment networks. In WACV, pages 1982–1991. IEEE. Prakash, A., Boochoon, S., Brophy, M., Acuna, D., Cameracci, E., State, G., Shapira, O., and Birchfield, S. (2019). Structured domain randomization: Bridging the reality gap by contextaware synthetic data. In ICRA, pages 7249–7255. IEEE. Qiao, F., Zhao, L., and Peng, X. (2020). Learning to learn single domain generalization. In CVPR, pages 12556–12565. Qin, X., Wang, J., Chen, Y., Lu, W., and Jiang, X. (2022). Domain generalization for activity recognition via adaptive feature fusion. Rahman, M. M., Fookes, C., Baktashmotlagh, M., and Sridharan, S. (2020). Correlation-aware adversarial domain adaptation and generalization. Pattern Recognition, 100:107124. Ryu, J., Kwon, G., Yang, M.-H., and Lim, J. (2019). Generalized convolutional forest networks for domain generalization and visual recognition. In ICLR. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In ICML, pages 1842–1850. Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., and Sarawagi, S. (2018). Generalizing across domains via cross-gradient training. In ICLR. Shao, R., Lan, X., Li, J., and Yuen, P. C. (2019). Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In CVPR, pages 10023–10031. Sharifi-Noghabi, H., Asghari, H., Mehrasa, N., and Ester, M. (2020). Domain generalization via semi-supervised meta learning. arXiv preprint arXiv:2009.12658. Sicilia, A., Zhao, X., and Hwang, S. J. (2021). Domain adversarial neural networks for domain generalization: When it works and how to improve. arXiv preprint arXiv:2102.03924. Snell, J., Swersky, K., and Zemel, R. S. (2017). Prototypical networks for few-shot learning. In NeurIPS. Sun, B. and Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450. Truong, T.-D., Duong, C. N., Luu, K., and Tran, M.-T. (2019). Recognition in unseen domains: Domain generalization via universal non-volume preserving models. arXiv preprint:1905.13040. Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V., and Savarese, S. (2018). Generalizing to unseen domains via adversarial data augmentation. In NeurIPS, pages 5334–5344. Wang, B., Lapata, M., and Titov, I. (2021a). Meta-learning for domain generalization in semantic parsing. In NAACL. Wang, J., Lan, C., Liu, C., Ouyang, Y., Zeng, W., and Qin, T. (2021b). Generalizing to unseen domains: A survey on domain generalization. In IJCAI Survey Track.

202

11 Generalization in Transfer Learning

Wang, W., Liao, S., Zhao, F., Kang, C., and Shao, L. (2021c). DomainMix: Learning generalizable person re-identification without human annotations. In BMCV. Wang, Y., Li, H., and Kot, A. C. (2020a). Heterogeneous domain generalization via domain mixup. In ICASSP, pages 3622–3626. Wang, Z., Wang, Q., Lv, C., Cao, X., and Fu, G. (2020b). Unseen target stance detection with adversarial domain generalization. In IJCNN, pages 1–8. Xu, Z., Li, W., Niu, L., and Xu, D. (2014). Exploiting low-rank structure from latent domains for domain generalization. In ECCV, pages 628–643. Springer. Ye, H., Xie, C., Cai, T., Li, R., Li, Z., and Wang, L. (2021). Towards a theoretical framework of out-of-distribution generalization. In NeurIPS. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. In ICLR. Zhang, H., Zhang, Y.-F., Liu, W., Weller, A., Schölkopf, B., and Xing, E. P. (2021). Towards principled disentanglement for domain generalization. In ICML2021 Machine Learning for Data Workshop. Zhao, S., Gong, M., Liu, T., Fu, H., and Tao, D. (2020). Domain generalization via entropy regularization. In NeurIPS, volume 33. Zhao, Y., Zhong, Z., Yang, F., Luo, Z., Lin, Y., Li, S., and Sebe, N. (2021a). Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In CVPR. Zhao, Y., Zhong, Z., Yang, F., Luo, Z., Lin, Y., Li, S., and Sebe, N. (2021b). Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6277–6286. Zhou, F., Jiang, Z., Shui, C., Wang, B., and Chaib-draa, B. (2020a). Domain generalization with optimal transport and metric learning. ArXiv, abs/2007.10573. Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. (2020b). Deep domain-adversarial image generation for domain generalisation. In AAAI. Zhou, K., Yang, Y., Qiao, Y., and Xiang, T. (2021). Domain generalization with MixStyle. In ICLR. Zunino, A., Bargal, S. A., Volpi, R., Sameki, M., Zhang, J., Sclaroff, S., Murino, V., and Saenko, K. (2021). Explainable deep classification models for domain generalization. In CVPR.

Chapter 12

Safe and Robust Transfer Learning

In this chapter, we discuss the safety and robustness of transfer learning. By safety, we refer to its defense and solutions against attack and data privacy misuse. By robustness, we mean the transfer mechanism that prevents the model from learning from spurious relations. Therefore, we introduce the related topics from three levels: (1) framework level, which is the safe fine-tuning process against defect inheritance, (2) data level, which is the transfer learning system against data privacy leakage, and (3) mechanism level, which is causal learning. The organization of this chapter is as follows. In Sect. 12.1, we introduce safe fine-tuning. In Sect. 12.2, we introduce federated transfer learning which protects data privacy. Then, we introduce data-free transfer learning in Sect. 12.3 for adaptation without source data. Next, for robust transfer mechanism, causal transfer learning is presented in Sect. 12.4. Finally, we conclude this chapter in Sect. 12.5.

12.1 Safe Transfer Learning Transfer learning has been widely adopted in people’s daily life (see the applications in Sect. 1.6). However, every coin has two sides. While great progress has been made, there lies a dark side: transfer learning models may not be 100% safe, i.e., it can be attacked. For instance, there are thousands of pre-trained models publicly available on the Internet such as PyTorch hub,1 TensorFlow hub,2 and HuggingFace model hub.3 While they provide convenience for people to conduct research or

1 https://pytorch.org/hub/. 2 https://www.tensorflow.org/hub. 3 https://huggingface.co/models.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_12

203

204

12 Safe and Robust Transfer Learning

complete downstream tasks, these models can also be the swords for those with malignant will. Since the architectures and parameters for pre-trained models are open to everyone, then everyone can leverage this knowledge to perform some attack. For example, Google Could ML service suggests the usage of Inception model (Szegedy et al., 2015) as the base model for image classification. Then, people with malignant will can design attack algorithms toward such model since Inception’s architecture and parameters are open. Eventually, all services and products that exploit such a pre-trained model will be in danger. In this section, we introduce safe transfer learning which may be used to overcome such drawback. Particularly, “transfer learning” here mainly refers to the fine-tuning process, which is the most popular paradigm.

12.1.1 Can Transfer Learning Models Be Attacked? We will first answer the question: can transfer learning models be attacked? The answer is yes. This is due to the defect inheritance of transfer learning: if a pre-trained model contains some weaknesses that make it easy to be attacked, then the downstream tasks fine-tuned from this model can inevitably inherit such weakness. Such defect inheritance was also verified by the researchers through empirical study. Back to 2018, Wang et al. (2018) performed the first study on the attack of transfer learning models. They showed that if having access to the target data, we can attack the transfer learning models by input perturbation. Then, Ji et al. (2018) verified such experiments and proposed an attack method by leveraging the semantic consistency between labels. Later, the experiments in Rezaei and Liu (2020) implied that if we simply give extreme values to the softmax layer of a neural network, then we can train such adversarial attack models to classify an input to any labels we want. According to the recent statistics by Zhang et al. (2022), in computer vision and natural language processing tasks, the defect inheritance rates from the pretrained models to the fine-tuned (downstream) models are ranging from 52% to 97% for adversarial attack and backdoor attack. In a nutshell, if we are using a pretrained model in our task, then there is at least 52% of chance that the fine-tuned model will be attacked. Generally speaking, transfer learning is a certain type of software reuse technique. Software reuse is an important process in software engineering, where the attackers can easily design attacks to the system if we are using some public software, as shown in Fig. 12.1.

12.1 Safe Transfer Learning

205

Fig. 12.1 Transfer learning is similar to software reuse, which is all easily attacked

12.1.2 Reducing Defect Inheritance Then, how to reduce the defect inheritance for the fine-tuned model to prevent attack? There exist two naive approaches. The first one is training from scratch on the downstream tasks. It is clear that this method will definitely not inherit defect since it abandons the pre-trained models. Obviously, this is not our goal since training from scratch is slow and training on low-resource downstream data will likely overfit. The other one is opposite: just fine-tune the pre-trained models. This will clearly inherit the defects but will not lose the benefit of pre-trained models (which we discussed in Sect. 8.2). There are two more methods in addition to these naive approaches. The third one is called fix after transfer, which requires to introduce some mature defense methods after fine-tuning. This kind of method is extremely time-consuming and expensive. Plus, performing defense on limited downstream tasks is likely to fail. The last type of method is called fix before transfer, which is the opposite of the third type. This kind of approach requires to perform training on the downstream tasks and then perform knowledge distillation using the pre-trained models. A recent work in this category is Renofeation (Chin et al., 2021), which added some regularization to flush the defect inheritance: dropout, feature normalization, and stochastic weight averaging. However, such method is more complicated and not easy to train. Then, what kinds of methods do we need for safe transfer learning? There are three standards at least: • Effectiveness: the method should be effective in reducing defects. • Generality: the method should be general and not specific to a certain type of neural network. • Efficiency: the method should not bring heavy computations to the fine-tuning process.

206

12 Safe and Robust Transfer Learning

Fig. 12.2 The goal of safe transfer learning is to reduce the malignant knowledge (the red cross) and retain the benign knowledge (the blue circle)

Figure 12.2 illustrates the objective of safe transfer learning. The goal of safe transfer learning is to reduce the defect inheritance from the pre-trained models while preserving the benefits of transfer learning as much as possible. In that figure, it requires to eliminate the malignant knowledge (the red cross) and retain the benign knowledge (the blue circle).

12.1.3 ReMoS: Relevant Model Slicing Recently, Zhang et al. (2022) proposed an effective approach for safe transfer learning called ReMoS (Relevant Model Slicing), which borrowed the naming rule from traditional software reuse. We introduce this method in detail. Definition 12.1 (Safe Transfer Learning Against Defect Inheritance) If we use D S to denote the student (downstream) dataset and .T (·) denotes the transfer learning process, then, the learning objective of safe transfer learning is formulated as  . max I[f (x; T (w)) = y] + I[f (x; ˜ T (w)) = y)], (12.1)

.

w⊂w T

(x,y)∈D S

which refers to selecting the benign weights .w from the teacher network weights w T , such that the performance of normal input (x) and abnormal inputs (.x) ˜ can be maximized.

.

As shown in Fig. 12.3, ReMoS consists of four main steps: 1. Coverage frequency profiling: for the pre-trained model, this step leverages the student dataset to select the neurons and weights that are highly correlated to the student tasks.

12.1 Safe Transfer Learning

207

Fig. 12.3 The complete procedure of ReMoS (Zhang et al., 2022)

2. Ordinal score computation: this step computes the ordinal scores by summarizing the coverage frequency and the importance of the weights in the teacher network. 3. Relevant slice generation: this step generates the weights that we want to preserve according to the ordinal scores in the last step. 4. Finally, we retain the relevant slice weights and retrain the other weights. Coverage Frequency Profiling If the activation value for a neuron is greater than a pre-defined threshold .α, then we think it is useful for the student task, which is called a neuron coverage. Concretely speaking, for a network with K layers (we denote the activation of layer i by .hi ), its neuron coverage (Cov) is computed as Cov(x) = Cov({h1 , . . . , hK }) = {v i |v i = I(v i > α)}.

.

(12.2)

Then, the neuron coverage for a whole dataset .D S is computed as Cov(D S ) =

.

⎧ ⎨ ⎩

x∈D S

Cov(x)i |i = 1, . . . , K

⎫ ⎬ ⎭

.

(12.3)

The weight coverage is computed as the sum of the coverage of neurons that are connected by the weights: CovW(S S )k,i,j = Cov(D S )k−1,i + Cov(D S )k,j .

.

(12.4)

Ordinal Score Computation The subtraction of downstream task accuracy and teacher network weights can be used to evaluate the defect inheritance since the larger the weights in the teacher network, the higher the defect inheritance rate. This can be formulated as  ReMoS .w = arg max ACC(T (w), D S ) − |w|. (12.5) w⊂w T

w∈w

However, the ranges of the two terms are not the same. For instance, the coverage of ResNet-18 on MIT Indoor Scenes dataset is .[0, 5374], while the weight range for

208

12 Safe and Robust Transfer Learning

ResNet-18 is .[10−12 , 1.02]. This makes it difficult to perform direct computation. To unify the range of two terms, Zhang et al. (2022) introduced ordinal scores to replace the direct computation. They ranked the values from small to large, and then the actual values can be represented as a .rank(·) function: the weights for pre-trained model are .ord_mag = rank(|w|k,i,j ) and the coverage is .ord_cov = rank(CovW(D S )k,i,j ). Thus, the ordinal score is computed as ordk,i,j = ord_covk,i,j − ord_magk,i,j .

.

(12.6)

Relevant Slice Generation The ordinal scores are ranked to select the weights which are closely related to student tasks with a threshold .tθ : slice(D S ) = {wk,i,j |ordk,i,j > tθ }.

.

(12.7)

For instance, the top 9 weights in Fig. 12.3 are selected as slice and the rest are randomly initialized. Fine-tune If the weights are in .slice(D S ), then we inherit them from the pre-trained models; otherwise, we re-initialize it for retraining. It is obvious that ReMoS does not need to perform backward propagation (BP), which makes it very efficient. ReMoS achieved considerable results in both computer vision and natural language processing tasks using ResNet and BERT models. Specifically, ReMoS successfully reduced the defect inheritance rate by 63–86% on CV tasks and 40– 61% on NLP tasks, with 3% drop in accuracy at most. In the future, we expect more works in this area.

12.2 Federated Transfer Learning Privacy protection is becoming increasingly important in recent years. In 2018, the European Union released the General Data Protection Regulation (GDPR), which is extremely strict for data protection. This regulation emphasizes that the machine learning models have to be interpretable and the collection of user data must be open and transparent. Then, USA and China also released similar regulation to protect user privacy. This situation is illustrated in Fig. 12.4, where the organization A and organization B do not share data. Under that circumstance, how do we build effective transfer learning models without compromising privacy?

12.2 Federated Transfer Learning

209

Fig. 12.4 Organization A and organization B do not share data

Organization

Organization

12.2.1 Federated Learning Federated Learning (FL) Yang et al. (2019) is a recently proposed technology that enables decentralized training of machine learning models. FL makes it possible to build models without explicitly accessing the user data. FL was first proposed by Google (McMahan et al., 2017), where they proposed FedAvg to train machine learning models via aggregating information from distributed mobile phones without exchanging data. The key idea is to replace direct data exchange with model parameter-related information exchange. FedAvg is able to resolve the data islanding problems. Definition 12.2 (Federated Learning) In federated learning, there are N different clients (organizations or users), denoted as {C1 , C2 , · · · , CN }. Each client has its own dataset {D1 , D2 , · · · , DN }. Conventionally, we can aggregate all the data to train a model MALL which gives an accuracy AALL on the test set. Federated learning aims to train a model MF ED where data of each client are not exposed to each other. If we denote the accuracy of MF ED as AF ED , then the goal of federated learning is to minimize the gap between these two accuracies (measured by  > 0): |AF ED − AALL | < .

(12.8)

.

This is the basic problem definition of federated learning. Among many possible implementations, FedAvg is one of the most popular algorithms. Instead of centralized training using private data from all clients, FedAvg does not access the personal data but performs aggregation of the key information: gradients or features of local clients. The basic idea of FedAvg can be seen in Fig. 12.5. Formally speaking, each client Ck submits its gradient g k to the central server. Then, the central server aggregates the gradients to generate a new model, whose weight is updated as w t+1 ← wt − η∇f (wt ) = wt − η

.

K  nk k=1

n

gk ,

(12.9)

K where η is the learning rate for the server model and n = k=1 nk is the total number of data from all clients. This is called the gradient aggregation for federated learning. Another possible updating manner is weight aggregation, which, in contrast to the gradient aggregation, uses the averaging of the model weight wk from each client

210

12 Safe and Robust Transfer Learning

Fig. 12.5 Illustration of FedAvg

to update the server model: wt+1 ←

.

K  nk k=1

n

wkt+1 .

(12.10)

We see from the above equations that FedAvg is simple in computation, but it remains very effective in real applications.

12.2.2 Personalized Federated Learning for Non-I.I.D. Data Personalization is extremely important in healthcare applications since people in different individuals, hospitals, or countries usually have different demographics, lifestyles, and other health-related characteristics (i.e., non-i.i.d. (Xu et al., 2021) and also refer to Sect. 1.3 for personalized transfer learning). Therefore, we are more interested in achieving better personalized healthcare, i.e., building models for each client to preserve their specific information while harnessing the commonalities using federated learning. Though FedAvg works well in many situations, it may still suffer in the noni.i.d. data and fail to build personalized models for each client (Smith et al., 2017; Khodak et al., 2019). FedProx (Li et al., 2020) tackled data non-i.i.d. by allowing partial information aggregation and adding a proximal term to FedAvg. Yeganeh et al. (2020) aggregated the models of the clients with weights computed via .L1 distance among parameters among client models. These works focus on a common model shared by all clients while some other works try to obtain a unique model

12.2 Federated Transfer Learning

211

for each client. Arivazhagan et al. (2019) exchanged information of base layers and preserved personalization layer to combat the ill-effects of non-i.i.d. T Dinh et al. (2020) utilized Moreau envelopes as clients’ regularized loss function and decoupled personalized model optimization from the global model learning in a bi-level problem stylized for personalized FL. Yu et al. (2020) evaluated three techniques for local adaptation of federated models: fine-tuning, multi-task learning, and knowledge distillation. Definition 12.3 (Personalized Federated Learning) In federated learning, there are N different clients (organizations or users), denoted as .{C1 , C2 , · · · , CN }. Each i client has its own dataset .{D1 , D2 , · · · , DN }. Each dataset .Di = {(xi,j , yi,j )}nj =1 ntr

tr tr te i contains two parts, i.e., a training part .Dtr i = {(xi,j , yi,j )}j =1 and a test part .Di = nte

te tr te tr te i {(xte i,j , yi,j )}j =1 . Obviously, .ni = ni + ni and .Di = Di ∪ Di . All of the datasets have different distributions, i.e., .P (Di ) = P (Dj ). Each client has its own model denoted as .{fi }N i=1 . Personalized federated learning is to aggregate information of all clients to learn a good model .fi for each client on its local dataset .Di without private data leakage: te

ni N

1  1  te te x , y , . min  f i i,j i,j N nte {fk }N i k=1 i=1

(12.11)

j =1

where . is a loss function. In this section, we introduce two algorithms to achieve personalized federated learning while dealing with the non-i.i.d. issue. These two approaches are developed by considering the two sides of federated learning: (1) model adaptation on the local client side and (2) similarity-guided model aggregation on the server side. Thus, they can be flexible to different systems per your need.

12.2.2.1

Model Adaptation for Personalized Federated Learning

In order to handle the personalization issue in federated learning, Chen et al. (2020) proposed FedHealth for healthcare that combines the power of transfer learning and federated learning (Fig. 12.6). FedHealth mainly consists of four steps: 1. Each client trains its own model using local data. 2. Each client uploads their models to the central server and then the server performs model aggregation. 3. The server distributes the model to each client. 4. Each client performs adaptation using the server model based on its local data. Notice that in the adaptation step, each client performs adaptation using the local data via different training strategies: they can either perform fine-tuning (since local

212

12 Safe and Robust Transfer Learning

Fig. 12.6 FedHealth for wearable healthcare

data is fully labeled) or distribution adaptation using different distribution alignment methods such as MMD (Borgwardt et al., 2006) or CORAL (Sun and Saenko, 2016) loss. For instance, if the local client uses CORAL as the alignment loss, the loss for each client k is computed as

.

arg min Lk = θk

nk



 + ηCORAL ,  yik , fk x ki

(12.12)

i=1

where .CORAL is the CORAL alignment loss as Eq. (6.8) and .fk (·) is the model on client k with .θk its parameters. They applied this system to different settings of federated applications. Comprehensive experiments on five healthcare benchmarks demonstrate that FedHealth achieves better accuracy compared to state-of-the-art methods (e.g., 10%+ accuracy improvement for PAMAP2) with faster convergence speed.

12.2.2.2

Similarity-Guided Personalized Federated Learning

On the other hand, AdaFed (Chen et al., 2021), a federated learning algorithm via adaptive batch normalization for personalized healthcare, can aggregate the information from different clients without compromising privacy and security and learn personalized models for each client. Specifically, AdaFed learns the similarities among clients with the help of a pre-trained model that is easy to obtain. The similarities are determined by the distances of the data distributions, which can be calculated via the statistical values of the layers’ outputs of the pre-trained network. After obtaining the similarities, the server averages the models’ parameters in a personalized way and generates a unique model for each client. Each client preserves its own batch normalization and updates the model with a momentum

12.2 Federated Transfer Learning

213

BN

pool

Conv

BN

pool

Model B



New Model B

FC

Conv

pool

FC

pool

BN

Model A

Model C



New Model C

…… FC

pool

BN

Conv

pool

BN

Conv

Data C

New Model A

FC

BN

Conv

FC

Conv

Data B

FC

Data A

Client-specific layer

Shared layer

Fig. 12.7 The computing process of AdaFed (Chen et al., 2021)

method. In this way, AdaFed can cope with the non-i.i.d. issue in federated learning. The computing process of AdaFed is shown in Fig. 12.7. Different from FedAvg, AdaFed has the following updating strategy: .

= φ t+1 i ψ t+1 i

=

φ t∗ i N

t∗ j =1 wij ψ j ,

(12.13)

where .φ i corresponds to the parameters of BN layers specific to each client and .ψ i is the parameters of the other layers, while .wij ∈ [0, 1] represents the similarity between client i and client j . t denotes the t-th round. In addition to preserving the local normalization parameters, another key feature of AdaFed is computing the weight matrix .W . In Chen et al. (2021), AdaFed utilizes Wasserstein distance (.W2 (·, ·); see Sect. 6.3) to compute the distance of BN statistics of different clients: di,j =

L 

W2 (N(μi,l , σ i,l ), N(μj,l , σ j,l ))

l=1 .

L

 = (||μi,l − μj,l ||2 + || r i,l − r j,l ||22 )1/2 ,

(12.14)

l=1

where .(μi,l , σ i,l ) is the BN statistics of the l-th layer in the i-th client. Large .di,j means the distribution distance between the i-th client and the j -th client is large. Therefore, the larger .di,j is, the less similar two clients are, which means the smaller

214

12 Safe and Robust Transfer Learning

wi,j is. AdaFed sets .w˜ i,j as the inverse of .di,j , i.e., .w˜ i,j = 1/di,j , j = i. Then, AdaFed normalizes .w˜ i to obtain .wˆ i,j :

.

wˆ i,j = N

.

w˜ i,j

j =1,j =i

w˜ i,j

, where j = i.

(12.15)

For stability, AdaFed takes .ψ t∗ into account for .ψ t+1 and updates .ψ t+1 in a moving average style. Therefore, it sets λ, i = j, .wi,j = (12.16) (1 − λ) × wˆ i,j , i = j. Personalized federated learning can deal with the non-i.i.d. datasets from different clients, and thus it is of great importance in federated learning systems. We can borrow knowledge from existing transfer learning techniques when dealing with non-i.i.d. dataset. On the other hand, this can be seen as another type of knowledge transfer between different clients.

12.3 Data-Free Transfer Learning In the last section, we introduced federated transfer learning that builds models without directly accessing the user data. In this section, we tackle another challenging scenario: what if the source data is totally inaccessible due to more strict privacy constraints? In this case, traditional domain adaptation algorithms might not work well since they all typically use both the source and target domain data. This scenario is called data-free or source-free transfer learning. There is another reason that data-free transfer learning is reasonable. In real applications, it is often impossible to store all the source domain data to perform possible downstream target tasks. For instance, if we use the VisDA-17 classification dataset (Peng et al., 2017) as the source domain data, its storage size is 7884.8 MB. However, if we only store its pre-trained model, the size is 172.6 MB which is significantly smaller than the original dataset. More importantly, the pre-trained models contain important implicit knowledge about the source dataset even if we do not have the original data. Therefore, it is actionable to perform transfer learning by only using the pre-trained model from the source domain. Additionally, it is common that in standard pre-training and fine-tuning (Chap. 8), there is no source domain but only the target domain. The only difference between them is that finetuning methods operate on labeled low-resource target domain, while data-free methods assume the target domain is fully unlabeled. Definition 12.4 (Source-Free Transfer Learning) As shown in Fig. 12.8, in t source-free transfer learning, we are given only the target domain .Dt = {x it }N i=1 and a pre-trained model .M from the source domain. Our objective is to learn a good adaptation model with minimum risk on the target domain.

12.3 Data-Free Transfer Learning

215

Fig. 12.8 Illustration of source-free transfer learning/domain adaptation

Note that there is a previous work that utilized minimal target supervision (e.g., weakly supervised setting) (Chidlovskii et al., 2016), which is not the focus of this section. Our target situation is more challenging since we require no labels on the target domain. It is impossible to perform distribution alignment between two domains since the source data is unavailable. There are mainly two types of algorithms toward solving this problem: 1. Information maximization, which tries to maximize the prediction information on the target domain 2. Feature matching, which matches the features of source and target domains. The source data are often generated by generative models.

12.3.1 Information Maximization Methods What is the ideal state of the target domain if the domain gap is reduced? In a balanced classification dataset, it will be similar to one-hot encoding, but each entry is different from each other. Hence, the predictions on all target domain samples are both diverse and discriminative: the predictions (one-hot encoding) must be diverse so as to generate all possible classification results; on the other hand, the results must also be determinant to make clear judgment toward a sample. This is implemented by information maximization. Information maximization (IM) was first introduced by Bridle et al. (1991). IM tries to maximize the diversity and minimize the firmness for the prediction: IM(x; c) = − .

Nc 

y¯i log y¯i +

i=1

¯ − H (y), = H (y)

Nc 1  yi log yi Nt Nt i=1

(12.17)

216

12 Safe and Robust Transfer Learning

where .y¯ denotes the averaged outputs on class c and .Nc refers to the number of ¯ denotes the entropy of the average samples belonging to class c. The first term .H (y) of the outputs and the second term denotes the average of the entropy of the outputs. When the first term is maximized, its values are reaching the most diverse states, and when the second term is minimized, it reaches the most determinant state for a prediction. There are several algorithms based on information maximization. The source hypothesis transfer (SHOT) algorithm (Liang et al., 2020) adopted IM for sourcefree domain adaptation. In addition, SHOT introduced pseudo labeling for the target domain to enhance the prediction ability of the network. First, they obtain the class centroid for each class using approaches similar to weighted K-means: (0)

ck =

.

δ fˆt(k) (x) gˆ t (x)

, ˆt(k) (x) δ f xt ∈Xt

xt ∈Xt

(12.18)

where .δk denotes the k-th element of the softmax output, f denotes the hypothesis, (0) g is the feature encoder, and .ck is the k-th class centroid at 0-th iteration. Then, the pseudo labels can be obtained by nearest centroid:

(0) yˆt = arg min D gˆ t (xt ) , ck ,

.

(12.19)

k

where .D(·, ·) is a distance function such as Euclidean distance. Then, target centroids can be updated again using the above equations iteratively to get better results. SHOT was then extended by further exploring the power of the unsupervised data by semi-supervised learning. They split the target data into two sets with high and low entropy. The goal is to seamlessly transfer the knowledge from highentropy samples to the low-entropy ones. To this end, they used the MixMatch (Berthelot et al., 2019) algorithm, which is a semi-supervised method. Their improved algorithm is called SHOT++ (Liang et al., 2021b). Ahmed et al. (2021) extended the single-source setting to the multi-source setting. To better harness the knowledge from multiple training domains, they imposed weighting technique to the sources and class centers as suggested in Chap. 4. Taufique et al. (2021) extended this setting to continuous manner. They designed a buffer memory to store some target data, which is then used along with information maximization for optimization. Agarwal et al. (2022) extended this setting by adding adversarial attack to the model, such that the model stays robust.

12.4 Causal Transfer Learning

217

12.3.2 Feature Matching Methods Different from information maximization, feature matching methods attempt to perform distribution alignment between two domains. What about the absent source data? Data generation methods seem very intuitive: we can always generate some source data by exploiting the source model. Then, we perform distribution alignment. To ensure the generated data follows source distribution, the pre-trained source model is typically used for feature matching. Tian et al. (2021) designed a virtual domain modeling (VDM) method that uses the weights from pre-trained models to generate source data. Then, they perform adaptation using the generated virtual domains. Kurmi et al. (2021) proposed domain impression to generate source data with the pre-trained model weights and then added adversarial training to learn domain-invariant features. Hou and Zheng (2020) performed feature alignment between generated source and target samples via the feature statistics of the last feature layer. Yeh et al. (2021) inferred the latent features by the predictions on the target data. The source-free domain adaptation was extended in Kundu et al. (2020) to universal domain adaptation where the source and target domains do not have identical label spaces. Other than data generation, Liang et al. (2021a) formulated source-free domain adaptation as a heterogeneous knowledge distillation problem and proposed Distill and Fine-tune (Dis-tune). Dis-tune structurally distilled the knowledge from the source model to a customized target model and then fine-tuned the distilled model to fit the target domain in an unsupervised manner. Feng et al. (2021) proposed a KD3A framework that used the statistics of the batch normalization layer to perform Batch MMD for distribution alignment since the source information is not available.

12.4 Causal Transfer Learning In this section, we focus on the transfer mechanism: what determines the relation between two variables? What constrains their transferability? To learn a robust transfer mechanism, we introduce causal relation for transfer learning.

12.4.1 What is Causal Relation? People may have different understandings about causal relation. Causal inference and causal discovery are two subjects for causal relation. In these areas, the causal relation between two variables is defined as follows: if the intervention of the induced variable will change the causal variable but does not change the induced variable verse versa, we say that the two variables have a casual relation (Pearl, 2009; Peters et al., 2017). Such an intervention refers to using mechanism out of

218

12 Safe and Robust Transfer Learning

the current system to change a variable. Of course, the joint distribution of variables will change. We can use the following example (Peters et al., 2017) to understand such a definition. By observing the altitude and average temperature of all cities in the world, people will know the observational correlation of these two variables: higher altitude always comes with lower average temperature. But are there any causal relations between them? We can assume the intervention to altitude and average temperature. For instance, imagine that we can burn a huge stove in a city to intervene the temperature. But the physical world tells people that this intervention will not lower the altitude of this city. On the other hand, we can imagine a large weight lifting machine under the city to intervene the altitude which is possible to lower the temperature. Thus, altitude is a cause of the average temperature, which is consistent with our instinct. We have two observations according to the above example: (1) Causal relation contains more information than observational data (Pearl, 2009; Pearl et al., 2009; Peters et al., 2017). Since no matter the altitude causes the average temperature or vice versa, they can be explained by “higher altitude comes with lower temperature” from the observational data. Moreover, we still need the physical world or human society to provide extra information to tell us what will happen if there is any intervention. On the other hand, from the mathematical formulation, we can represent x causes y as .p(x, y) = p(x)p(y|x), where .p(y|x) describes such a causal relation. The observation data only contains information about the joint distribution .p(x, y) but cannot make sure that this joint distribution is to be factorized by .p(x)p(y|x) or .p(y)p(x|y). Causal relation can go further to determine such situation. (2) Causal relation shows some natural laws or variable generation mechanism. For instance, altitude will cause the temperature, which is a natural law. Thus, we say that these variables have autonomy and modularity, so that they are independent. This is the principle of independent mechanism (Schölkopf et al., 2012; Peters et al., 2017; Schölkopf, 2019). If the environment and domains are changed, we can use causal relation for machine learning. It gives more information than joint distribution. We can think that the cause of the change of environment and domains are intervention. The principle of causal relation means that the un-intervened relations are not changing, which we call causal invariance. If we know the causal relation and intervention mechanism, then we can exploit the causal invariance to bridge the two variables.

12.4.2 Causal Relation for Transfer Learning Exploit Causal Relations for Observation Data Schölkopf et al. (2011, 2012) first proposed the principle of causal relation and causal invariance. They pointed out that the causal relation for x predicting y corresponds to different transfer

12.4 Causal Transfer Learning

219

learning situations. For the situation of x causes y, we can think that .p(y|x) remains unchanged across domains. But the domain change originates from the change of .p(x), which is the covariate shift problem in transfer learning. Similarly, for the case that y is the cause of x, .p(x|y) is unchanged, which corresponds to the target shift problem. They proposed some algorithms based on additive noise models. Zhang et al. (2013) further used kernel function for covariate shift and they also allowed location-scale transformation for .p(x|y), i.e., conditional shift and generalized target shift. Gong et al. (2016, 2018) further considered more general case that there is only partial of x. Rojas-Carulla et al. (2018) proposed algorithms to exploit the causal relation in domain generalization and multi-task learning problems. Bahadori et al. (2017) used a pre-trained model for predicting causal effect and train weights to enlarge the power of x. Shen et al. (2018) considered weighting for all samples to make sure that the weighted data can reflect the average causal effects. Others extended this work to nonlinear case (Bahadori et al., 2017; He et al., 2019). Learning by Exploiting Latent Representations For sensor data such as images or voice, there could be no causal relation in the raw data. But there could be some latent factor (Lopez-Paz et al., 2017; Besserve et al., 2018; Kilbertus et al., 2018) that contains causal relation. Thus, we need to introduce latent variable to model the factors. Domain-invariant representation tries to learn marginal distribution that is invariant across domains. Such representation can be considered as commonality for domain which can be used for transfer learning. They have achieved great progress.4 Arjovsky et al. (2019) proposed invariant risk minimization to learn classifier that remains invariant across domains, instead of learning representations. Some methods used the causal invariance for transfer learning. Most of them are based on generative models. According to the principle, the data generation process p(x, y|z) does not change with domains, and the difference between domains mainly comes from the marginal distribution of the latent variable p(z). Teshima et al. (2020) adopted causal relation for few-shot supervised domain adaptation. Cai et al. (2019) and Ilse et al. (2020) proposed the causal relation as shown in Fig. 12.9 (left). They categorized the latent factors into two classes: one is the background factor zd caused by domain d (such as background, color, and style), and the other is the background factor zy caused by y (such as shape). This benefits generalization. Atzmon et al. (2020) extended their work and they thought that the change of domain comes from label y and other attributes. In recent works, Liu et al. (2020) adjusted the prior for new domains and obtained the inference and prediction rules based on generative mechanism. Their causal relation (Fig. 12.9 right) disentangled the latent factors as semantic factor s and non-semantic factor v and only the semantic factor is their concern. Sun et al. (2020) applied this algorithm to domain generalization to explicit model the changing rule of prior to identify s and v. 4 Other

works (Johansson et al., 2019; Zhao et al., 2019; Chuang et al., 2020) also proposed some problems of the domain-invariant learning.

220

12 Safe and Robust Transfer Learning

Fig. 12.9 Left: causal relation adopted by Cai et al. (2019), Ilse et al. (2020) and Atzmon et al. (2020), where d is domain, and zd and zy are latent factors for domain and class. Right: causal relation adopted by Liu et al. (2020) and Sun et al. (2020), where s and v are semantic factor for y and non-causal factor

12.5 Summary In this chapter, we introduced another important topic of transfer learning: its safety and robustness. In fact, the concept of safety and robustness is quite broad and we can only cover a part of it. We analyzed it in three levels: the fine-tuning framework against defect inheritance, the data and privacy protection using federated and datafree transfer learning, and the transfer mechanism using causal learning. All of these three levels are important in building a safe and robust transfer learning system and they are all in active research trend. In addition, there are also other topics related to safety and robustness such as attack transferability and neural style transfer attack. Interested readers can go further to investigate other topics.

References Agarwal, P., Paudel, D. P., Zaech, J.-N., and Van Gool, L. (2022). Unsupervised robust domain adaptation without source data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2009–2018. Ahmed, S. M., Raychaudhuri, D. S., Paul, S., Oymak, S., and Roy-Chowdhury, A. K. (2021). Unsupervised multi-source domain adaptation without access to source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10103–10112. Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. (2019). Federated learning with personalization layers. arXiv preprint arXiv:1912.00818. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893. Atzmon, Y., Kreuk, F., Shalit, U., and Chechik, G. (2020). A causal view of compositional zeroshot recognition. Advances in Neural Information Processing Systems, 33. Bahadori, M. T., Chalupka, K., Choi, E., Chen, R., Stewart, W. F., and Sun, J. (2017). Causal regularization. arXiv preprint arXiv:1702.02604. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. (2019). Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Besserve, M., Shajarisales, N., Schölkopf, B., and Janzing, D. (2018). Group invariance principles for causal generative models. In International Conference on Artificial Intelligence and Statistics, pages 557–565. PMLR.

References

221

Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. Bridle, J. S., Heading, A. J., and MacKay, D. J. (1991). Unsupervised classifiers, mutual information and ’phantom targets’. In Advances in neural information processing systems (NIPS). Cai, R., Li, Z., Wei, P., Qiao, J., Zhang, K., and Hao, Z. (2019). Learning disentangled semantic representation for domain adaptation. In Proceedings of the Conference of IJCAI, volume 2019, page 2060. NIH Public Access. Chen, Y., Lu, W., Wang, J., Qin, X., and Qin, T. (2021). Federated learning with adaptive batchnorm for personalized healthcare. arXiv preprint arXiv:2112.00734. Chen, Y., Qin, X., Wang, J., Yu, C., and Gao, W. (2020). Fedhealth: A federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems, 35(4):83–93. Chidlovskii, B., Clinchant, S., and Csurka, G. (2016). Domain adaptation in the absence of source domain data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 451–460. Chin, T.-W., Zhang, C., and Marculescu, D. (2021). Renofeation: A simple transfer learning method for improved adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops, pages 3243–3252. Chuang, C.-Y., Torralba, A., and Jegelka, S. (2020). Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning, pages 1984–1994. PMLR. Feng, H.-Z., You, Z., Chen, M., Zhang, T., Zhu, M., Wu, F., Wu, C., and Chen, W. (2021). Kd3a: Unsupervised multi-source decentralized domain adaptation via knowledge distillation. In International conference on machine learning (ICML). Gong, M., Zhang, K., Huang, B., Glymour, C., Tao, D., and Batmanghelich, K. (2018). Causal generative domain adaptation networks. arXiv preprint arXiv:1804.04333. Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Schölkopf, B. (2016). Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pages 2839–2848. He, Y., Shen, Z., and Cui, P. (2019). Towards non-i.i.d. image classification: A dataset and baselines. arXiv preprint arXiv:1906.02899. Hou, Y. and Zheng, L. (2020). Source free domain adaptation with image translation. arXiv preprint arXiv:2008.07514. Ilse, M., Tomczak, J. M., Louizos, C., and Welling, M. (2020). DIVA: Domain invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348. PMLR. Ji, Y., Zhang, X., Ji, S., Luo, X., and Wang, T. (2018). Model-reuse attacks on deep learning systems. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 349–363. Johansson, F. D., Sontag, D., and Ranganath, R. (2019). Support and invertibility in domaininvariant representations. In AISTAS, pages 527–536. Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. (2019). Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, volume 32, pages 5917– 5928. Kilbertus, N., Parascandolo, G., and Schölkopf, B. (2018). Generalization in anti-causal learning. arXiv preprint arXiv:1812.00524. Kundu, J. N., Venkat, N., Babu, R. V., et al. (2020). Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4544–4553. Kurmi, V. K., Subramanian, V. K., and Namboodiri, V. P. (2021). Domain impression: A source data free domain adaptation method. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 615–625.

222

12 Safe and Robust Transfer Learning

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. (2020). Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2–4, 2020. mlsys.org. Liang, J., Hu, D., and Feng, J. (2020). Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR. Liang, J., Hu, D., He, R., and Feng, J. (2021a). Distill and fine-tune: Effective adaptation from a black-box source model. arXiv preprint arXiv:2104.01539. Liang, J., Hu, D., Wang, Y., He, R., and Feng, J. (2021b). Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence. Liu, C., Sun, X., Wang, J., Li, T., Qin, T., Chen, W., and Liu, T.-Y. (2020). Learning causal semantic representation for out-of-distribution prediction. arXiv preprint arXiv:2011.01681. Lopez-Paz, D., Nishihara, R., Chintala, S., Schölkopf, B., and Bottou, L. (2017). Discovering causal signals in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6979–6987. McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR. Pearl, J. (2009). Causality. Cambridge university press. Pearl, J. et al. (2009). Causal inference in statistics: An overview. Statistics surveys, 3:96–146. Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. (2017). VisDA: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924. Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press. Rezaei, S. and Liu, X. (2020). A target-agnostic attack on deep models: Exploiting security vulnerabilities of transfer learning. In International conference on learning representations (ICLR). Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. (2018). Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342. Schölkopf, B. (2019). Causality for machine learning. arXiv preprint arXiv:1911.10500. Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. M. (2012). On causal and anticausal learning. In International Conference on Machine Learning (ICML 2012), pages 1255–1262. International Machine Learning Society. Schölkopf, B., Janzing, D., Peters, J., and Zhang, K. (2011). Robust learning via cause-effect models. arXiv preprint arXiv:1112.2738. Shen, Z., Cui, P., Kuang, K., Li, B., and Chen, P. (2018). Causally regularized learning with agnostic data selection bias. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 411–419. ACM. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. (2017). Federated multi-task learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4427–4437. Sun, B. and Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450. Sun, X., Wu, B., Liu, C., Zheng, X., Chen, W., Qin, T., and Liu, T.-y. (2020). Latent causal invariant model. arXiv preprint arXiv:2011.02203. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9. T Dinh, C., Tran, N., and Nguyen, T. D. (2020). Personalized federated learning with Moreau envelopes. In Advances in Neural Information Processing Systems, volume 33. Taufique, A. M. N., Jahan, C. S., and Savakis, A. (2021). ConDA: Continual unsupervised domain adaptation. arXiv preprint arXiv:2103.11056.

References

223

Teshima, T., Sato, I., and Sugiyama, M. (2020). Few-shot domain adaptation by causal mechanism transfer. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9458–9469. Tian, J., Zhang, J., Li, W., and Xu, D. (2021). VDM-DA: Virtual domain modeling for source data-free domain adaptation. arXiv preprint arXiv:2103.14357. Wang, B., Yao, Y., Viswanath, B., Zheng, H., and Zhao, B. Y. (2018). With great training comes great vulnerability: Practical attacks against transfer learning. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 1281–1297. Xu, J., Glicksberg, B. S., Su, C., Walker, P., Bian, J., and Wang, F. (2021). Federated learning for healthcare informatics. Journal of Healthcare Informatics Research, 5(1):1–19. Yang, Q., Liu, Y., Cheng, Y., Kang, Y., Chen, T., and Yu, H. (2019). Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 13(3):1–207. Yeganeh, Y., Farshad, A., Navab, N., and Albarqouni, S. (2020). Inverse distance aggregation for federated learning with non-iid data. In Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pages 150–159. Springer. Yeh, H.-W., Yang, B., Yuen, P. C., and Harada, T. (2021). Sofa: Source-data-free feature alignment for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 474–483. Yu, T., Bagdasaryan, E., and Shmatikov, V. (2020). Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758. Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827. Zhang, Z., Li, Y., Wang, J., Liu, B., Li, D., Chen, X., Guo, Y., and Liu, Y. (2022). ReMos: Reducing defect inheritance in transfer learning via relevant model slicing. In The 44th International Conference on Software Engineering. Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. (2019). On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–7532.

Chapter 13

Transfer Learning in Complex Environments

The application environment is dynamically changing. So does the algorithm. In order to cope with the changing environment, there are several new transfer learning algorithms being developed. We define the complex environment mainly based on the traits of the training data since data is the key to modern transfer learning. In this chapter, we briefly introduce some of these complex environments and show how transfer learning algorithms can be adapted to deal with such new challenges. Note that there are multiple new settings in recent literature and we cannot cover them all. The organization of this chapter is as follows. In Sect. 13.1, we introduce imbalanced transfer learning where the training data is highly imbalanced. In Sect. 13.2, we introduce multiple source transfer learning, where the training data comes from more than one source distribution. We describe open set transfer learning in Sect. 13.3, where the training and test data may not have identical categories. Finally, we introduce online transfer learning in Sect. 13.5 where the training data comes in an online manner.

13.1 Imbalanced Transfer Learning Imbalanced datasets generally exist in many applications. For instance, fall detection is a binary classification task as shown in Fig. 13.1a. When the fall and nonfall samples are 90% and 10% (we call them majority and minority classes, respectively), traditional machine learning models tend to overlook the minority class since its proportion is small. Thus, when there exists imbalance, the results will be dramatically influenced. There are many works in traditional machine learning that aimed to solve such problem (Chawla et al., 2002; Sun et al., 2007; Liu et al., 2008; He and Garcia, 2009; Tang et al., 2009; Sun et al., 2009; Li et al., 2011; Ganganwar, 2012; Kriminger et al., 2012; Huang et al., 2016). © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_13

225

226

13 Transfer Learning in Complex Environments

Target

Source 9 7

X Y Z

Hit Lost

5

gravity

Sq. root

1 0 -1

Class 1 Class 2

Still

90%

-5

%

-3 XYZ

-7 5

10

15

t (s)

20

25

30

70% %

a (g)

3

30%

10%

35

1

2

(a)

1

2

(b)

Fig. 13.1 Imbalanced transfer learning. (a) Fall detection. (b) Imbalanced transfer learning

The imbalance problem also exists in transfer learning. Researchers have proved that when the classes are imbalanced for domain adaptation tasks, the accuracy will be impacted (Weiss and Khoshgoftaar, 2016). Since transfer learning aims at learning knowledge from different distributions and class distribution is important to the whole distribution, we should pay attention to this problem. As shown in Fig. 13.1b, the second class (color green; dots in upper distribution figures) of the source domain has lower proportion than that in the target domain, while it is not the case for the first class. Since the blue class has lower proportion, negative transfer may happen. Wang et al. (2017) proposed balanced distribution adaptation (BDA) for imbalanced transfer learning. BDA considers the importance of each class while performing distribution adaptation. Inspired by TCA (Pan et al., 2011), BDA is formulated as C  .

βc tr(AT XM c XT A),

(13.1)

c=1

where .A denotes feature transformation matrix and .βc is the ratio of classes: βc =

.

Pt (yt = c; θ ) . Ps (ys = c; θ )

(13.2)

13.2 Multi-Source Transfer Learning

227

We constructed MMD matrices using the prior .βc :

(M c )ij =

.

⎧ 1 ⎪ ⎪ (c) , ⎪ (Ns )2 ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎨ (Nt(c) )2 , ⎪ ⎪ − (c)βc (c) , ⎪ ⎪ Ns Nt ⎪ ⎪ ⎪ ⎪ ⎩ 0,

(c)

x i , x j ∈ Ds

(c)

x i , x j ∈ Dt  (c) (c) x i ∈ Ds , x j ∈ Dt

(13.3)

(c) x i ∈ D(c) t , x j ∈ Ds

otherwise.

BDA can be easily implemented and it has good performance for imbalanced settings. In fact, BDA has several advantages compared to existing research. Previous sample reweighting methods (Ando and Huang,, 2017) only learned weights of specific samples but ignore the class weights balance for different classes. Ming Harry Hsu et al. (2015) developed a Closest Common Space Learning (CCSL) method to adapt the cross-domain weights. CCSL is an instance selection method, while BDA is a feature-based approach. Multiset feature learning was proposed in Wu et al. (2017a) to learn discriminant features. Yan et al. (2017) proposed weighted maximum mean discrepancy to construct a source reference collection on the target domain, but it only adapted the prior of source domain, while BDA could adapt the priors from both source and target domains. Hsiao et al. (2016) tackled the imbalance issue when target domain has some labels, while target domain has no labels in BDA. Li et al. (2016) adjusted the weights of different samples according to their predictions, while BDA focuses on adjusting the weight of each class. We also hope that there could be more works in this area.

13.2 Multi-Source Transfer Learning We mainly handle the single-source transfer learning problems in this book where there are only one source and one target domain. Beyond this setting, if we have multiple source domains, how to perform transfer learning? This is called multisource transfer learning. Single-source transfer learning only considers the domain shift of two domains, while multi-source transfer learning will have to consider domain shift for multiple domains. As shown in Fig. 13.2, we see that the distributions of multi-source transfer learning are more complicated and we cannot obtain the optimal results by simply merging all of the source into one domain. There is also some theory for multi-source transfer learning. Crammer et al. (2008) proposed the boundary for multi-source transfer and Mansour et al. (2009) proved that a target hypothesis can be represented by the weighted distributions by multiple sources.

228 Fig. 13.2 Single-source vs. multi-source transfer learning

13 Transfer Learning in Complex Environments

Single source TL

Source 1

Multi-source TL

Source 2

Source 3

Target

Definition 13.1 (Multi-Source Transfer Learning) Given N source domains i i Ni D = {Di }N i=1 , where each source domain .Di = {(x j , yj )}j =1 . We are also given a target domain .Dt = {Dlt ∪ Dut }, where .Dlt and .Dut are the labeled and unlabeled u Nu l u data. .Dlt = {x lj , yjl }N j =1 , Dt = {x j }j =1 . .Nl and .Nu denote the number of labeled and unlabeled samples. Multi-source transfer learning aims at learning a predictive function f such that it has the minimum risk on the unlabeled target domain.

.

Depending on whether the target domain has labels, multi-source transfer learning can be categorized into semi-supervised multi-source transfer learning (Schweikert et al., 2009) and unsupervised multi-source transfer learning (Zhu et al., 2019). The goal of multi-source transfer learning is to leverage the N source domains to help the learning in the target domain. We see that when .N = 1, it becomes single-source transfer learning. There are mainly two types of methods: feature-based methods and classifier ensembling methods. Feature-based methods aim to change the representation of each domain to make them better aligned. This kind of method hopes that the multi-source distributions are all close to the target distribution from two ways: (1) remove the samples that are with large distribution gaps in both domains (Sun et al., 2011) and (2) map the features into a common space where they can learn good representations to bridge the gap (Duan et al., 2009). Classifier ensembling methods train classifiers for multi-source and target domains and then ensemble them according to their similarities. Schweikert et al. (2009) utilized a simple ensemble method which gave the same weight to all the source classifier. Sun and Shi (2013) designed a Bayesian weighting method for multi-source transfer learning. In recent years, deep learning has enabled the success of multi-source transfer learning. Zhao et al. (2018) proposed a multi-domain adversarial network (MDAN) to align the distributions between any two pairs of source and target domains using multiple domain discriminators. Xu et al. (2018) designed deep cocktail network (DCTN) to use a domain discriminator and classifier for each source and target domain. The discriminator is used to align distributions, while the classifier is used to predict the probability. Peng et al. (2019) proposed a method called M.3 SDA to not only consider the source-target distributions but also align source–source distributions. Zhu et al. (2019) proposed a new framework called MFSAN which

13.3 Open Set Transfer Learning

229

extracted multi-source features to different spaces. They also proposed a consistency regularization term to make sure that multiple classifiers have unified outputs toward one objective sample. To sum up, multi-source transfer learning can be formulated as Lcls + Lda + Lreg ,

(13.4)

.

where .Lcls denotes the classification loss and .Lda denotes the distribution adaptation loss that considers source–source and source–target distribution adaptation. .Lreg is the regularization term such as consistency regularization (Zhu et al., 2019). Multi-source domains contain more knowledge than single-source domains and better results can be obtained. When designing algorithms for these areas, we should also try to avoid negative transfer.

13.3 Open Set Transfer Learning We introduce another hot research topic of transfer learning: open set transfer learning. Current tasks are usually closed set tasks, i.e., the source and target tasks are the same. Obviously, this is quite restricted since the two domains may have different sets of categories in real world. Therefore, we call the former paradigm closed set transfer learning, while the open set refers to the case that source and target domains share some but not all classes. Moreover, full-open set refers to the case that source and target domains do not share any classes. We see that from closed set to open set, to full-open set, the situation is becoming more and more challenging. Figure 13.3 describes the above three situations. These situations mainly care about the label space of two domains. We use .Ys and .Yt to denote the label spaces for source and target domains, respectively: Closed set

Label space

Source domain Target domain

Full-open set

Open set

Source domain

Target domain

Same label space =

Intersected label space ≠ ∅ ∩ , ≠

Domain adaptation

Open set domain adaptation

Fig. 13.3 Different settings of transfer learning

Source domain

Target domain

Different label spaces ≠ , ∩ = ∅ Cross-category transfer learning

230

13 Transfer Learning in Complex Environments

• Closed set: .Ys = Yt • Open set: .Ys = Yt , and .Ys ∩ Yt = ∅, where .∅ denotes empty set • Full-open set: .Ys = Yt and .Ys ∩ Yt = ∅ Note that their definitions are not unified. For instance, our definition is the same as that given in Panareda Busto and Gall (2017). But the definition of open set in Saito et al. (2018) does not allow access to the samples that belong to different classes. The core of open set transfer learning is to identify the semantic relations between source and target classes. Since they only share partial common classes, how to identify the similar categories remains the key challenge. Panareda Busto and Gall (2017) proposed an outlier detection approach to identify the common classes. Saito et al. (2018) adopted the class-wise probability to denote the weights for each class. Fang et al. (2019) gave some theoretical analysis of open set domain adaptation. Recently, Xu et al. (2020) argued that for open set domain adaptation, existing research is susceptible to misclassification when target domain unknown samples in the feature space distributed near the decision boundary learned from the labeled source domain. To overcome this, they proposed Joint Partial Optimal Transport (JPOT) (Fig. 13.4) to fully utilize the information of not only the labeled source domain but also the discriminative representation of unknown class in the target domain. The proposed joint discriminative prototypical compactness loss can not only achieve intra-class compactness and inter-class separability but also estimate the mean and variance of the unknown class through backpropagation, which remains intractable for previous methods due to the blindness about the structure of the unknown classes. JPOT is the first optimal transport model for open set domain adaptation. There are other works related to open set transfer learning. In fact, the research on this area is on the go and we expect there could be more research.

Fig. 13.4 Illustration of the joint partial optimal transport (JPOT) (Xu et al., 2020) for open set domain adaptation

13.4 Time Series Transfer Learning

231

13.4 Time Series Transfer Learning While most of the content in this book focused on general transfer learning algorithms and the test environment is mainly image classification tasks, we focus on a specific type of dataset in this section: time series. The research on this area is relatively less compared to general algorithms and existing transfer learning algorithms are not fully suitable for time series dataset. Time series (TS) has wide applications in real life such as weather forecasting, health data analysis, and transportation prediction. Time series refers to the sequence that is collected in the form of time, space, or other pre-defined rules. Since the time series is continuous, its statistical properties are dynamically changing over time. This is the non-stationary time series. Over the years, various research efforts have been made for building reliable and accurate models for the non-stationary TS. Traditional approaches have made great progress such as hidden Markov models (HMMs), dynamic Bayesian networks, Kalman filters, and other statistical models (e.g., ARIMA). Recently, better performance is achieved by the recurrent neural networks (RNNs) (Salinas et al., 2020; Vincent and Thome, 2019). RNNs make no assumptions on the temporal structure and can find highly nonlinear and complex dependency in TS. Definition 13.2 (Time Series Forecasting) Given a time series of n segments .D = {x i , y i }ni=1 , where .x i = {xi1 , · · · , ximi } ∈ Rp×mi is a p-variate segment of length c 1 c 1 .mi , and .y i = (y , . . . , y ) ∈ R is the corresponding label. We need to learn a i i precise prediction model .M : x i → y i on the future .r ∈ N+ steps2 for segments n+r .{x j } j =n+1 in the same time series. Are there transfer learning problems in time series data? If so, how can we handle them? First of all, we show that there exist transfer learning problems in time series data. We note that the statistical property of non-stationary time series is dynamically changing, indicating that its distribution is also changing. In this case, although RNN models can capture some locality, it still fails to model the future unseen data. Thus, RNN is likely to have the model shift problem in face of unseen distributions. We need to build a temporally invariant model to predict the unseen data. This section discusses possible solutions for transfer learning-based time series forecasting and classification.

1 In

general, .mi may be different for different time segments. = 1, it becomes the one-step prediction problem.

2 When .r

232

13 Transfer Learning in Complex Environments

13.4.1 AdaRNN for Time Series Forecasting As shown in Fig. 13.5, the data distributions .P (x) vary for different intervals A, B, and C where x and y are samples and predictions, respectively. Especially for the test data which is unseen during training, its distribution is also different from the training data and makes the prediction more exacerbated. The conditional distribution .P (y|x), however, is usually considered to be unchanged for this scenario, which is reasonable in many real applications. For instance, in stock prediction, it is natural that the market is fluctuating which changes the financial factors (.P (x)), while the economic laws remain unchanged (.P (y|x)). It is an inevitable result that the afore-mentioned methods have inferior performance and poor generalization (Kuznetsov and Mohri, 2014) since the distribution shift issue violates their basic I.I.D assumption. Then, how to build models for the times series data under that circumstance? Du et al. (2021) proposed the Adaptive RNN (AdaRNN) method to handle the different distributions in time series data. Firstly, AdaRNN proposed the Temporal covariate shift (TCS) problem in time series. Then, it built a temporally invariant model to solve it. Definition 13.3 (Temporal Covariate Shift) Given a time series data .D with n labeled segments. Suppose it can be split into K periods or intervals, i.e., .D = nk+1 , .n1 = 0 and .nk+1 = n. Temporal {D1 , · · · , DK }, where .Dk = {x i , y i }i=n k +1 Covariate Shift (TCS) is referred to the case that all the segments in the same period i follow the same data distribution .PDi (x, y), while for different time periods .1 ≤ i = j ≤ K, .PDi (x) = PDj (x) and .PDi (y|x) = PDj (y|x). To handle the TCS problem, AdaRNN designed two important steps as shown in Fig. 13.6: 1. Temporal Distribution Characterization (TDC): to characterize the distribution information in TS 2. Temporal Distribution Matching (TDM): to match the distributions of the discovered periods to build a time series prediction model Fig. 13.5 Temporal covariate shift (TCS) in time series

A

B

Raw data

Probability distribution

Temporal Covariate Shift:

C

Unseen test

?

13.4 Time Series Transfer Learning

(a)

233

(b)

(c)

Fig. 13.6 The framework of AdaRNN (Du et al., 2021). (a) Overview of AdaRNN. (b) ) Temporal distribution characterization (TDC). (c) Temporal distribution matching (TDM)

The formulation of TDC is given as max

max

0 0, then they update the weight as wt+1 = w t + τt yt x t , τt = min{C, lt /||xt ||2 },

.

(13.10)

where C is a hyperparameter. Apart from homogeneous setting, OTL also designed algorithms for heterogeneous setting and provided theoretical analysis.

13.6 Summary

237

Wu et al. (2017b) extended such single-source setting to multi-source setting. There are other researches that focus on applications of online transfer learning, such as online feature selection (Wang et al., 2013), object tracking (Gao et al., 2012), reinforcement transfer learning (Zhan and Taylor, 2015), cross-domain searching (Yan et al., 2016), and concept drift (McKay et al., 2019). Du et al. (2020) proposed an online learning algorithm to handle the distribution gap between two domains in online transfer learning. Assume there are n source nsi . In domains .Ds1 , Ds2 , . . . , Dsn , where the n-th domain is .Dsi = {(x j , yj )}j =1 the target domain, they assumed there are two types of data: the first type is the nut unlabeled data .Dtu = {x i }i=1 collected in advance, while the other is labeled data T l .Dt = {(x j , yj )} that is coming online. Different from Zhao and Hoi (2010), they j =1 considered multi-class classification. Use K to denote total class numbers, and they want to learn multiple feature transformation matrices .Ai , i = 1, 2, . . . , n, while learning the target classifier. The matrices are used to transform the data in a new space to reduce their distribution divergence. This algorithm consists of two stages. In the offline stage, they used joint distribution adaptation to give the matrices a good initialization. After feature mapping, they use PA algorithm to learn n source classifiers .fsi , i = 1, 2, . . . , n. At the online stage, they transformed the data into the new space: for the i-th matrix, p the mapped target data is .x t i = Ai xt . The predicted output is Ft =

.

n 

p

p ui fsti x t i + v i ftti x t i , yˆt = arg max Ftk , i=1

(13.11)

k

where .Ft is a K-dimensional vector and .Ftk denotes the outputs in k-th dimension. Du et al. (2020) also proposed to update the mapping matrices in an online manner. Online transfer learning has strong real need. We hope there will be more works in the future.

13.6 Summary In this chapter, we reviewed some of the related research areas to transfer learning in complex environment. We show five complex environments: data imbalance, multiple source data, category openness, time series, and online learning. Each of these research topics is vast and under fast development in recent years. Due to space limitation, we only introduce their key idea in this chapter. Please refer to the recent literature for thorough understanding of each category.

238

13 Transfer Learning in Complex Environments

References Ando, S. and Huang, C. Y. (2017). Deep over-sampling framework for classifying imbalanced data. arXiv preprint arXiv:1704.07515. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online passiveaggressive algorithms. Journal of Machine Learning Research, 7(Mar):551–585. Crammer, K., Kearns, M., and Wortman, J. (2008). Learning from multiple sources. JMLR, 9(Aug):1757–1774. Du, Y., Tan, Z., Chen, Q., Zhang, Y., and Wang, C. (2020). Homogeneous online transfer learning with online distribution discrepancy minimization. In ECAI. Du, Y., Wang, J., Feng, W., Pan, S., Qin, T., Xu, R., and Wang, C. (2021). Adarnn: Adaptive learning and forecasting of time series. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 402–411. Duan, L., Tsang, I. W., Xu, D., and Chua, T.-S. (2009). Domain adaptation from multiple sources via auxiliary classifiers. In ICML, pages 289–296. Fang, Z., Lu, J., Liu, F., Xuan, J., and Zhang, G. (2019). Open set domain adaptation: Theoretical bound and algorithm. arXiv preprint arXiv:1907.08375. Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4):42–47. Gao, C., Sang, N., and Huang, R. (2012). Online transfer boosting for object tracking. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 906–909. IEEE. He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284. Hoi, S. C., Sahoo, D., Lu, J., and Zhao, P. (2018). Online learning: A comprehensive survey. arXiv preprint arXiv:1802.02871. Hsiao, P.-H., Chang, F.-J., and Lin, Y.-Y. (2016). Learning discriminatively reconstructed source data for object recognition with few examples. IEEE Transactions on Image Processing, 25(8):3518–3532. Huang, C., Li, Y., Change Loy, C., and Tang, X. (2016). Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384. Kriminger, E., Principe, J. C., and Lakshminarayan, C. (2012). Nearest neighbor distributions for imbalanced classification. In The 2012 International Joint Conference on Neural Networks (IJCNN), pages 1–5. IEEE. Kuznetsov, V. and Mohri, M. (2014). Generalization bounds for time series prediction with nonstationary processes. In ALT, pages 260–274. Springer. Li, S., Song, S., and Huang, G. (2016). Prediction reweighting for domain adaptation. IEEE Transactions on Neural Networks and Learning Systems, (99):1–14. Li, S., Wang, Z., Zhou, G., and Lee, S. Y. M. (2011). Semi-supervised learning for imbalanced sentiment classification. In Twenty-Second International Joint Conference on Artificial Intelligence. Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550. Lu, W., Wang, J., Chen, Y., and Sun, X. (2022). Generalized representation learning for time series classification. In International conference on machine learning. Mansour, Y., Mohri, M., and Rostamizadeh, A. (2009). Domain adaptation with multiple sources. In NeuIPS, pages 1041–1048. McKay, H., Griffiths, N., Taylor, P., Damoulas, T., and Xu, Z. (2019). Online transfer learning for concept drifting data streams. In BigMine@ KDD.

References

239

Ming Harry Hsu, T., Yu Chen, W., Hou, C.-A., Hubert Tsai, Y.-H., Yeh, Y.-R., and Frank Wang, Y.-C. (2015). Unsupervised domain adaptation with imbalanced cross-domain data. In Proceedings of the IEEE International Conference on Computer Vision, pages 4121–4129. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE TNN, 22(2):199–210. Panareda Busto, P. and Gall, J. (2017). Open set domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 754–763. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. (2019). Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415. Saito, K., Yamamoto, S., Ushiku, Y., and Harada, T. (2018). Open set domain adaptation by backpropagation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 153–168. Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast, 36(3):1181–1191. Schweikert, G., Rätsch, G., Widmer, C., and Schölkopf, B. (2009). An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In NeuIPS, pages 1433–1440. Sun, Q., Chattopadhyay, R., Panchanathan, S., and Ye, J. (2011). A two-stage weighting framework for multi-source domain adaptation. In NeuIPS, pages 505–513. Sun, S.-L. and Shi, H.-L. (2013). Bayesian multi-source domain adaptation. In 2013 International Conference on Machine Learning and Cybernetics, volume 1, pages 24–28. IEEE. Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378. Sun, Y., Wong, A. K., and Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04):687–719. Tang, Y., Zhang, Y.-Q., Chawla, N. V., and Krasser, S. (2009). SVMS modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):281–288. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Vincent, L. and Thome, N. (2019). Shape and time distortion loss for training deep time series forecasting models. In NeurIPS, pages 4189–4201. Wang, J., Chen, Y., Hao, S., et al. (2017). Balanced distribution adaptation for transfer learning. In ICDM, pages 1129–1134. Wang, J., Zhao, P., Hoi, S. C., and Jin, R. (2013). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3):698–710. Weiss, K. R. and Khoshgoftaar, T. M. (2016). Investigating transfer learners for robustness to domain class imbalance. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 207–213. IEEE. Wu, F., Jing, X.-Y., Shan, S., Zuo, W., and Yang, J.-Y. (2017a). Multiset feature learning for highly imbalanced data classification. In Thirty-First AAAI Conference on Artificial Intelligence. Wu, Q., Zhou, X., Yan, Y., Wu, H., and Min, H. (2017b). Online transfer learning by leveraging multiple source domains. Knowledge and Information Systems, 52(3):687–707. Xu, R., Chen, Z., Zuo, W., Yan, J., and Lin, L. (2018). Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR, pages 3964–3973. Xu, R., Liu, P., Zhang, Y., Cai, F., Wang, J., Liang, S., Ying, H., and Yin, J. (2020). Joint partial optimal transport for open set domain adaptation. In International Joint Conference on Artificial Intelligence, pages 2540–2546. Yan, H., Ding, Y., Li, P., Wang, Q., Xu, Y., and Zuo, W. (2017). Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1705.00609. Yan, Y., Wu, Q., Tan, M., and Min, H. (2016). Online heterogeneous transfer learning by weighted offline and online classifiers. In European Conference on Computer Vision, pages 467–474. Springer.

240

13 Transfer Learning in Complex Environments

Zhan, Y. and Taylor, M. E. (2015). Online transfer learning in reinforcement learning domains. arXiv preprint arXiv:1507.00436. Zhao, H., Zhang, S., Wu, G., Moura, J. M., Costeira, J. P., and Gordon, G. J. (2018). Adversarial multiple source domain adaptation. In NeuIPS, pages 8559–8570. Zhao, P. and Hoi, S. C. (2010). OTL: A framework of online transfer learning. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 1231–1238. Zhu, Y., Zhuang, F., and Wang, D. (2019). Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources. In AAAI, volume 33, pages 5989–5996.

Chapter 14

Low-Resource Learning

This chapter discusses another important topic that is closely related to transfer learning: Low-resource learning. Low-resource learning refers to the situation where there are not sufficient hardware resources or labeled data to train a model or even perform fine-tuning. Low-resource learning has wide applications in real world. Particularly, we will first introduce how to compress transfer learning models. Then, we introduce three low-resource learning paradigms: semi-supervised learning, meta-learning, and self-supervised learning. While transfer learning has made consistent success in multiple fields, its combination with other learning paradigms can often generate bigger impacts beyond single transfer learning. We will introduce their basic problem definitions, algorithms, and possible applications. The organization of this chapter is as follows. We first introduce how to compress transfer learning models in a resource-constrained environment in Sect. 14.1. We introduce semi-supervised learning in Sect. 14.2. Then, we introduce meta-learning in Sect. 14.3. Next, we introduce self-supervised learning in Sect. 14.4. Finally, Sect. 14.5 concludes this chapter.

14.1 Compressing Transfer Learning Models Today, the machine learning models are becoming increasingly larger, so are transfer learning models. In a low-resource scenario where there are little labeled data available, the computation resources are also limited. Unfortunately, it is still challenging to deploy the transfer learning models to resource-constrained devices such as mobile phones since there is a huge computational cost required by these methods. In order to reduce resource requirement and accelerate the inference process, a common solution is network compression. Network compression methods mainly include network quantization (Rastegari et al., 2016; Zhou et al., 2017), weight pruning (Han et al., 2015b; He et al., 2017; © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_14

241

242

14 Low-Resource Learning

Molchanov et al., 2017), and low-rank approximation (Denton et al., 2014; Han et al., 2015a). Particularly, channel pruning (He et al., 2017; Molchanov et al., 2017), which is a type of weight pruning and compared to other methods, does not need special hardware or software implementations. In addition, it can reduce negative transfer by pruning some redundant channels. Therefore, it is a good choice for compressing deep transfer learning models. However, we cannot simply use channel pruning for model compression in transfer learning due to two reasons. Firstly, these compression methods are proposed to solve supervised learning problems, which is not suitable for the transfer settings since there are no labels in the target domain. Secondly, even if we can acquire some labels manually, applying these compression methods directly will result in negative transfer (Sect. 2.4). In this section, we introduce the first attempt to compress transfer learning models: transfer channel pruning (TCP) proposed by Yu et al. (2019a,b). TCP is capable of compressing the deep transfer learning models by pruning less important channels while simultaneously learning transferable features by reducing the cross-domain distribution divergence. Therefore, TCP can reduce the impact of negative transfer and maintains competitive performance on the target task. As shown in Fig. 14.1, TCP first learns the base deep transfer model through base model building. The base model is fine-tuned with the standard domain adaptation criteria. Second, TCP evaluates the importance of channels of all layers with the transfer channel evaluation and performs further fine-tuning. Specifically, the convolutional layers, which usually dominate the computation complexity, are pruned in this step. Third, TCP method iteratively refines the pruning results and stops after reaching the trade-off between accuracy and FLOPs (i.e., computational cost) or parameter size. Technically speaking, TCP aims to preserve the useful knowledge while pruning the K least important channels. Let .L(Ds , Dt , W ) be the cost function and we denote .W  as the final pruned weights. At the beginning, .W = W  . We need to minimize the loss change after pruning a channel .a l,i at layer l, which is computed as: .

        L al,i  = L Ds , Dt , al,i − L Ds , Dt , al,i = 0  .

Fig. 14.1 Illustration of the transfer channel pruning (TCP) approach (Yu et al., 2019b)

(14.1)

14.1 Compressing Transfer Learning Models

243

  To minimize .L al,i , TCP proposes the transfer channel evaluation to compute the importance of channels. According to Taylor’s theorem, the Taylor expansion of a function .f (x) at point .x = a can be computed as: f (x) =

.

P  f (p) (a) (x − a)p + Rp (x), p!

(14.2)

p=0

where p denotes the p-th derivative of .f (x) at point  a and the last item .Rp (x)  .x = represents the p-th remainder. To approximate .L al,i , we can use the first-order Taylor expansion near .al,i = 0, which means the loss change after removing .al,i , and then we can get  2 al,i         al,i · al,i + .f al,i = 0 = f al,i − f · f  (ξ ), 2

(14.3)

|a |2 where .ξ ∈ [0, 1] and . l,i2 · f  (ξ ) is a Lagrange form remainder which requires too much computation; therefore, we abandon this item for accelerating the pruning process. Then, back to Eq. (14.1), we can get     ∂L L Ds , Dt , al,i = 0 = L Ds , Dt , al,i − · al,i . ∂al,i

.

(14.4)

We get the criteria G of TCP by combining Eqs. (14.1) and (14.4): 

G al,i



.

       ∂L    = L al,i =  · al,i  , ∂a

(14.5)

l,i

which means the absolute value of product of the activation and the gradient of the cost function. .al,j in an N-length batch is computed as: al,i =

.

hl  wl N  1 1  p,q al,i . N hl × wl n=1

(14.6)

p=1 q=1

Considering the deep transfer learning method such as DDC (Tzeng et al., 2014) described in Chap. 9, we use .Lcls and .Ltransf er to denote the classification and transfer loss. Then, G can finally be computed as: 

G al,i

.



   ∂L (D , W) ∂Lmmd (Ds , Dt , W) t   cls s s = · al,i + β · al,i  ,   ∂asl,i ∂atl,i

(14.7)

where .asl,i and .atl,i denote the activation with source data and target data, respectively.

244

14 Low-Resource Learning

In short, TCP is an accurate, generic, and efficient compression method that can be easily implemented by most deep learning libraries. We expect that more model compression algorithms can be developed for resource-constrained situations.

14.2 Semi-supervised Learning We have discussed a lot of research work on unsupervised domain adaptation, which is typically dealing with the unlabeled target domain by transferring knowledge from the labeled source domain. In fact, this paradigm is closely related to a very fundamental machine learning paradigm: semi-supervised learning (Zhou, 2016; Bishop, 2006). Over the years, semi-supervised learning has undergone fast development. In this section, we introduce the basic idea and modern deep algorithms for semi-supervised learning. Our purpose is to bridge these two areas where some of their key technologies can be applied to each other to boost their performance. As its name states, semi-supervised learning (SSL) aims at training a machine learning model by using both labeled and massive unlabeled data. As shown in Fig. 14.2, the total number of unlabeled data is often way greater than that of the labeled data.

Definition 14.1 (Semi-supervised Learning) In semi-supervised learning, the training data consists of labeled and unlabeled data. Denote .[N ] := {1, 2, . . . , N }. Let .Dl = {(xb , yb ) : b ∈ [Nl ]} and .Du = {ub : b ∈ [Nu ]} be the labeled and unlabeled data, where .Nl and .Nu are their number of samples, respectively. Semi-supervised learning requires to learn a classifier .fθ that predicts well on the test dataset composed of distributions from both labeled and unlabeled data.

The training objective for semi-supervised learning can generally be represented as: Lssl = Ls + wLu ,

.

Fig. 14.2 Illustration of semi-supervised learning setting with binary classification

(14.8)

14.2 Semi-supervised Learning

245

where .Ls and .Lu denote the supervised loss on the labeled data and the unsupervised loss on the unlabeled data, respectively. w is the trade-off hyperparameter. We see that the most challenging part is to design the unsupervised loss .Lu . In this section, we focus on the modern, i.e., deep learning-based semi-supervised learning algorithms. According to a literature survey (Ouali et al., 2020), there are several kinds of SSL algorithms: • Consistency regularization: learn generalized representations by regularizing the perturbation of the networks • Pseudo labeling: a self-training approach to train models using the pseudo labels on the unlabeled data • Generative model: use a generative model to generate similar distributions that can be transferred to the unlabeled data • Graph-based methods: perform label propagation by considering the labeled and unlabeled data as nodes on a graph We will only introduce the consistency regularization and pseudo labeling algorithms since they are the most popular types in recent years. Semi-supervised learning vs. transfer learning Readers may notice that Eq. (14.8) is extremely similar to the transfer learning framework we introduced in Sect. 3.3. This again illustrates their close relationship. In fact, several recent efforts made connections between unsupervised domain adaptation and semi-supervised learning. Berthelot et al. (2021) argued that a single framework would benefit all of semi-supervised learning, unsupervised domain adaptation, and semi-supervised domain adaptation. Their algorithm, AdaMatch, extended the popular semi-supervised learning algorithm FixMatch (Sohn et al., 2020) by addressing the distributional shift and adjusting the pseudo label confidence. AdaMatch works surprisingly well for all three kinds of tasks. Zhang et al. (2021b) stated that semi-supervised learning can be cast as unsupervised domain adaptation and they argued that we can use a semi-supervised algorithm to solve UDA. Later, Liu et al. (2021) used a modified version of self-training for domain adaptation and achieved good performance.

14.2.1 Consistency Regularization Methods Consistency regularization is also called consistency training. It is based on the cluster assumption that the learned decision boundary must lie in a low-density region. Therefore, if an unlabeled data and its perturbations are not too distinct, their labels should also be the same. Figure 14.3 illustrates the basic process of consistency regularization using a representative approach called .-model (Diba et al., 2017). Given an input .x ∈ Du , we apply augmentation to it such as modifying the input data or adding dropout to the network .fθ . Then, consistency regularization will regularize their outputs to be similar. Denote .y˜1 = fθ (x) and .y˜2 = fθ (x) as the original and augmented

246

14 Low-Resource Learning

Network

Augmentation

Crossentropy

Fig. 14.3 The illustration of consistency regularization. We take .-model as an example

outputs, respectively. Then, the unsupervised loss for consistency regularization is represented as: Lu =

.

1  dMSE (y˜1 , y˜2 ), |Du |

(14.9)

x∈Du

where .dMSE is the mean squared error and it is natural to replace it with other distance measurements. Later work shares similar framework with .-model with more modification. Diba et al. (2017) proposed an extended .-model called temporal ensembling. They argued that the original .-model is not efficient since every raw sample x and its augmentation must go through the network twice. Additionally, it is not stable to perform updates based only on a single output pair. Temporal ensembling aggregated all the predictions of samples through time. For a target .y, ˜ it is aggregated with the exponential moving average (EMA) .yema from previous data: yema = αyema + (1 − α)y, ˜

.

(14.10)

where .α is the EMA parameter. In this way, it only needs to perform forward propagation once and is more efficient than .-model. Similar to temporal ensembling, Mean Teacher (Tarvainen and Valpola, 2017) did not perform EMA for the outputs, but the model parameters across time:  θt = αθt−1 + (1 − α)θt ,

.

(14.11)

where .θt is the model parameter at time t. Miyato et al. (2018) proposed virtual adversarial training (VAT) that borrows knowledge from adversarial attack to use the input perturbation to learn a generalized model. Specifically, they added a perturbation .radv to the input and then learned to minimize its difference with the original input: Lu =

.

1  dMSE (fθ (x), fθ (x + radv )), |Du |

(14.12)

x∈Du

where the perturbation .radv is computed based on gradient descent to the input noise.

14.2 Semi-supervised Learning

247

Park et al. (2018) proposed to use the different dropouts of the network as regularization. Xie et al. (2020) proposed unsupervised data augmentation (UDA) to perform augmentation to the inputs. Verma et al. (2022) proposed the interpolation consistency training (ICT) to regularize the difference between the outputs from Mixup samples and the Mixup outputs: Lu =

.

1  dMSE (fθ (Mixλ (xi , xj )), Mixλ (fθ (xi ), fθ (xj ))), |Du |

(14.13)

x∈Du

where .Mixλ is the Mixup function (Zhang et al., 2018) with parameter .λ. Consistency regularization is quite a general framework that can be used to solve the semi-supervised learning problems.

14.2.2 Pseudo Labeling and Thresholding Methods Pseudo labeling (Lee et al., 2013) is a classic self-training method that directly uses the probability output of the unlabeled data as their pseudo labels and then the unlabeled data can join training. The key here is to generate highly confident pseudo labels such that the unlabeled data with high confidence can be selected. Figure 14.4 shows an illustration of pseudo labeling and thresholding methods using FlexMatch (Zhang et al., 2021a) as an example. A prior work called FixMatch (Sohn et al., 2020) utilizes consistency regularization with strong augmentation to achieve competitive performance. For unlabeled data, FixMatch first uses weak augmentation to generate artificial labels. These labels are then used as the target of strongly augmented data. The unsupervised loss term in FixMatch thereby has the form: μB 1  . 1(max(pm (y|ω(ub ))) > τ )H (pˆ m (y|ω(ub )), pm (y| (ub ))), μB

(14.14)

b=1

Fig. 14.4 Illustration of pseudo labeling and thresholding methods using FlexMatch (Zhang et al., 2021a) as an example

248

14 Low-Resource Learning

where B is the batch size of unlabeled data, .μ is the ratio of unlabeled data to labeled data, and .μb denotes a piece of unlabeled data. . is a strong augmentation function instead of weak augmentation .ω, .pm (·) is the pseudo label (one-hot encoding or probabilities), and .H (·, ·) is cross-entropy. Here, .τ is a pre-defined threshold to mask the unlabeled data. Later, Zhang et al. (2021a) proposed FlexMatch with curriculum pseudo labeling by taking inspiration from curriculum learning (Bengio et al., 2009). They argued that different classes have different thresholds and proposed a flexible threshold: σt (c) =

N 

.

1(max(pm,t (y|un )) > τ ) · 1(arg max(pm,t (y|un ) = c),

(14.15)

n=1

where .σt (c) reflects the learning effect of class c at time step t. .pm,t (y|un ) is the model’s prediction for unlabeled data .un at time step t, and N is the total number of unlabeled data. Then, .σt (c) is used to compute the flexible threshold: Tt (c) =

.

σt (c) · τ. max σt

(14.16)

c

Compared with FixMatch, this is more flexible and achieves better results. Surprisingly, FlexMatch can be added to other thresholding methods such as UDA (Xie et al., 2020) and pseudo labeling (Lee et al., 2013) to boost their performance. FlexMatch also converges much faster than FixMatch. In addition, the authors from FlexMatch also developed a unified PyTorch-based semi-supervised learning library called TorchSSL1 for the research community. Recently, Wang et al. (2022) argued that existing methods either need a predefined fixed threshold such as FixMatch or ad hoc threshold adjusting scheme, which may not truly reflect the unlabeled data distribution. They proposed FreeMatch to adjust the threshold for unlabeled data in a self-adaptive manner. They proposed to tune the global (data-level) and local (class-specific) thresholds. The global threshold .τt is defined and adjusted as:  1 if t = 0, C, .τt = (14.17) 1 μB λτt−1 + (1 − λ) μB b=1 max(qb ), otherwise, where .λ ∈ (0, 1) is the momentum decay of EMA. The local threshold is to modulate the global threshold in a class-specific fashion to account for the intraclass diversity. FreeMatch computes the expectation of the model’s predictions on each class c to estimate the class-specific learning status:  1 , if t = 0, (14.18) .p ˜ t (c) = C 1 μB λp˜ t−1 (c) + (1 − λ) μB b=1 qb (c), otherwise,

1 https://github.com/torchssl/torchssl.

14.3 Meta-learning

249

where .p˜ t = [p˜ t (1), p˜ t (2), . . . , p˜ t (C)] is the list containing all .p˜ t (c). Integrating the global and local thresholds, we obtain the final self-adaptive threshold .τt (c) as: τt (c) = MaxNorm(p˜ t (c)) · τt .

=

where .MaxNorm is the Maximum Normalization (i.e., .x  = unsupervised training objective .Lu at t-th iteration is: Lu =

.

(14.19)

p˜ t (c) · τt , max{p˜ t (c) : c ∈ [C]} x max(x) ).

μB 1  1(max(qb ) > τt (arg max (qb )) · H(qˆb , Qb ). μB

Finally, the

(14.20)

b=1

FreeMatch achieved better results than FixMatch and FlexMatch. And most importantly, it does not need any pre-defined thresholds anymore, thus providing experience for the future research direction. There are many other algorithms for semi-supervised learning. Interested readers can refer to the survey (Ouali et al., 2020) for more methods.

14.3 Meta-learning In this section, we introduce another learning paradigm that is closely related to transfer learning: meta-learning. While transfer learning emphasizes that knowledge transfers from one domain to other, meta-learning learns a general model from multiple tasks. Thus, their goals are similar but with different problem definitions and learning methods. Recall the formulation of pre-training and fine-tuning in Sect. 8.2. Transfer learning is formulated as: θ ∗ = arg min L(θ |θ0 , D),

.

(14.21)

θ

where .θ0 is the model parameters for historical tasks. Meta-learning is a different type of method to learn from historical tasks. Meta-learning is also called learning to learn, which aims to learn knowledge from multiple tasks. The major difference between meta-learning and transfer learning is the acquisition of meta-knowledge. In order to acquire the meta-knowledge, meta-learning assumes we can gain access to some tasks sampled from .P (T).   M val (i) , D , Assume that we can sample M tasks, denoted as .Dsrc = Dtrain src src i=1 where the two terms denote the training and validation set on one task. Generally speaking, they are called the support set and query set.

250

14 Low-Resource Learning

We call the process of acquiring meta-knowledge as meta-train process, which is formulated as: φ ∗ = arg max log P (φ|Dsrc ) ,

.

(14.22)

φ

where .φ denotes the learning parameters for obtaining meta-knowledge. To validate the effectiveness of meta-knowledge, meta-learning defines a meta  Q test (i) : test process: sample Q tasks to form a test set .Dtar = Dtrain tar , Dtar i=1



train (i) . θ ∗(i) = arg max log P θ |φ ∗ , Dtar

.

(14.23)

θ

We use .Pθ (y|x, S) to denote the meta-knowledge obtained from training data S. Then, meta-learning can be categorized into the following three types according to the different representation of meta-knowledge: 1. Model-based meta-learning, which uses another network to learn metaknowledge, i.e., .Pθ (y|x, S) = fθ (x, S) 2. Metric-based meta-learning, which assumes  the meta-knowledge can be obtained by some metric, i.e., .Pθ (y|x, S) = (xi ,yi )∈S kθ (x, xi ) yi , where .k(·, ·) is a similarity measure 3. Optimization-based meta-learning, which performs gradient descent to progressively learn meta-knowledge from multiple tasks, i.e., .Pθ (y|x, S) = fθ(S) (x) Meta-learning vs. Transfer Learning Can we use meta-learning for transfer learning or domain adaptation tasks? Yes. For instance, Sun et al. (2019) proposed a meta-transfer learning framework to combine these two paradigms for few-shot learning. Yu et al. (2020) proposed a learning algorithm called learning to match (L2M), which sampled data using meta-learning and pseudo labeling to asymptotically match the distributions of source and target domains. L2M sampled different tasks from the source and target domains. Then, they designed a positive linear layer to compute the pair-wise distance between samples. After iterations, the distribution divergence can be approximated without any pre-defined distances such as MMD or adversarial training. Their experiments also proved that L2M can achieve better performance with the help of existing predefined distances.

14.3.1 Model-Based Meta-learning In this section, we introduce model-based meta-learning. This kind of methods assumes that the meta-knowledge can naturally be represented by another neural network, and thus it is also called black-box or memory-based meta-learning. The

14.3 Meta-learning

( 1,

251

1)

( 2,

2)

( 3,

3)

Training task set

Test phase

Fig. 14.5 Illustration of model-based meta-learning methods

core is to exploit the strategy of meta-learning: how to learn experience from historical tasks? Then, a natural idea is to use a network to learn such experience. This kind of method represents the meta-knowledge in S as Pθ (y|x, S) = fθ (x, S),

.

(14.24)

where .θ is the meta-parameter to be learned by another network f . Model-based meta-learning is shown in Fig. 14.5. They learn hyperparameter .φ from historical tasks and then apply it to the meta-test set. This process is done iteratively. A classic model-based meta-learning method is memory-augmented neural network (MANN) (Santoro et al., 2016). MANN learned general knowledge directly from historical tasks. When training current tasks, MANN inputs the labels of the last task as well. This can build direct connection by neural network training. Similar idea was also proposed in meta-networks (Munkhdalai and Yu, 2017). In addition to learning from historical tasks, Ravi and Larochelle (2016) proposed to learn from the optimization of historical tasks. They exploited an LSTM for such knowledge representation. The core of using a network for optimization is gradient descent. Thus, their work learned hyperparameters from historical data. The work of (Ha et al., 2016) proposed hypernetworks to automatically design the hyperparameters. On the other hand, can we also learn the process of gradient? Andrychowicz et al. (2016) proposed to learn the rules for gradient descent by learning from historical tasks. There are also other works using model-based meta-learning. We can always design a network for multiple learning objectives: labels, loss functions, gradient descent, and optimizers. For instance, meta pseudo labels (Pham et al., 2020) assumed that the cross-entropy loss can also be learned by a network, which achieved great progress for semi-supervised learning.

252

14 Low-Resource Learning

14.3.2 Metric-Based Meta-learning In this section, we introduce metric-based meta-learning. This kind of methods assumes that meta-knowledge can be learned by the similarity of historical tasks, and thus it is also called similarity-based meta-learning methods. This kind of methods represents the meta-knowledge of training data S as 

Pθ (y|x, S) =

.

(14.25)

kθ (x, xi ) yi ,

(xi ,yi )∈S

where .k(·, ·) is a similarity measure such as cosine distance. These methods can also be easily interpreted from the perspective of kernel function. The network is mainly designed to learn the similarity from the training data, and thus, it can predict the new data based on its similarity with the old data. Normally, the historical tasks are constructed by few-shot samples and the labels of the new data are also given by the similarity. Figure 14.6 shows the main idea of metric learning-based meta-learning. The arrows of different thickness denote the similarity. The work of Vinyals et al. (2016) proposed matching networks to learn the similarity between validation data and the few-shot samples. In matching networks, the label .yˆ of the test data x is given by its similarity with the k samples by function .a(·, ·): yˆ =

k 

.

a (x, ¯ xi ) yi ,

(14.26)

i=1

where the similarity function is defined as the softmax function defined by cosine similarity: exp (cos (f (x), g (x i ))     . j =1 exp cos f (x), g x j

a (x, x i ) = k

.

Fig. 14.6 Metric-based meta-learning

(14.27)

Training set

Test set

Similarity

14.3 Meta-learning

253

Similar idea was then developed in Prototypical networks (ProtoNet) (Snell et al., 2017). The idea of ProtoNet is easy to understand: it learns the mean embedding of each class. Then, the label for a new sample is computed by its distance to the class mean embedding. If we use .fθ to denote the feature learning step, then the mean embedding of class c is denoted as vc =

.

1 |Sc |



fθ (x i ) ,

(14.28)

(xi ,yi )∈Sc

where .Sc denotes the samples belonging to class c. Then, the label for a new input x is computed as     exp −dϕ (fθ (x), v c )  , .P (y = c|x) = softmax −dϕ (fθ (x), v c ) =  c ∈C exp −dϕ (fθ (x), v c ) (14.29)

.

where .dϕ (·, ·) denotes any distance function such as Euclidean distance. Later, Sung et al. (2018) proposed a similar relation network to learn the distance metric between samples. Chen et al. (2019) proposed metric learning using multiple tasks. There are other works in this area and we will not discuss more.

14.3.3 Optimization-Based Meta-learning In this section, we introduce optimization-based meta-learning methods. Different from model- and metric-based methods, optimization-based meta-learning assumes that we can learn general knowledge from multiple tasks by gradient descent, which has made great progress in recent years. This kind of methods represents the meta-knowledge as Pθ (y|x, S) = fθ(S) (x).

.

(14.30)

Note that the .fθ(S) (x) here can easily be confused with .fθ (x, S) in model-based methods. Their difference is that .fθ (S) represents another network in model-based methods, while for optimization-based meta-learning, the parameter .θ (S) is put in the subscript, denoting that it can directly be optimized. Finn et al. (2017) proposed the classic model-agnostic meta-learning (MAML). MAML tries to learn the general knowledge from multiple training tasks by gradient descent, as shown in Fig. 14.7. We use .φ to denote the parameter to be learned and .θ i denotes the parameter for task i. MAML first uses learning rate .γ to compute the gradient on the sampled n tasks: θ i = θ i − γ ∇φ l(φ).

.

(14.31)

254

14 Low-Resource Learning

Fig. 14.7 Illustration of the process of MAML (Finn et al., 2017)

meta-learning learning/adaptation

θ

∇L3 ∇L2

∇L1 θ1∗

θ3∗ θ2∗

Then, MAML computes the loss for each task on the query set: L(φ) =

.

n

 li θ i .

(14.32)

i=1

Finally, MAML optimizes the parameter .φ at learning rate .η: φ ← φ − η∇φ L(φ).

.

(14.33)

The above process actually learns the general knowledge. MAML does not care about the performance on a certain task, but its average performance on all tasks. This can be explained as learning an average state that can benefit all tasks. Then, MAML updates the learning status based on the query set. These steps are repeated, and then, MAML can learn general meta-knowledge from most of the tasks. Later, more works extended MAML. For instance, Reptile (Nichol et al., 2018) tries to repeat the process of gradient descent by MAML for many times. The work of (Rajeswaran et al., 2019) argued that one of the drawbacks of MAML is that it requires a lot of memory to store the gradient information of each task. Then, it proposed implicit MAML (iMAML) for optimization, which does not rely on the computation process by adding an .l2 regularization to constrain the distance between old and new parameters. Li et al. (2017) extended MAML by MAML-SGD to enable the update of initial parameters, learning rate, and gradient directions. There are other works that adopted MAML to solve certain problems. Hsu et al. (2018) extended MAML to unsupervised representation learning. The work of Na et al. (2019) integrated MAML with Bayesian framework to solve the imbalanced and biased classification. In Sect. 11.4, we introduced some meta-learning-based domain generalization algorithms like MLDG (Li et al., 2018), which are also successful applications of MAML.

14.4 Self-supervised Learning

255

14.4 Self-supervised Learning Both semi-supervised learning and meta-learning learn the original tasks by leveraging possible tasks or unlabeled data. In this section, we introduce another important topic that is closely related to transfer learning: self-supervised learning (Jing and Tian, 2020). In recent years, self-supervised learning has been widely applied to computer vision, time series analysis, natural language processing, and speech recognition. For instance, the seminar work BERT (Devlin et al., 2018) is based on self-supervised training of NLP tasks and the Wav2Vec models (Schneider et al., 2019; Baevski et al., 2020) are also based on constructing self-supervised tasks. Why are we using self-supervised learning? Because the labeled data are always limited and collecting enough labeled data is expensive and time-consuming. Thus, we want to explore the potential of using only unlabeled data for training. There is another advantage of using self-supervised learning, which is pretraining. Since the large datasets often lack enough labels for supervised training, we may perform self-supervised pre-training to learn more general representations on the datasets. Then, the pre-trained model can be transferred to any downstream tasks. This is the case in the current research in CV, NLP, and speech where they performed fine-tuning or further processing using the self-supervised pre-trained models such as BERT (Devlin et al., 2018) and Wav2Vec (Schneider et al., 2019; Baevski et al., 2020). Technically speaking, there is no unified definition on self-supervised learning. Typically, if we can perform representation learning on massive unlabeled data by constructing supervision tasks, this can be called self-supervision. We will introduce the self-supervised learning methods in the following two categories: 1. Constructing pretext tasks: we can construct several different pretext tasks to facilitate the learning of main tasks, which is simple and effective. 2. Contrastive self-supervised learning: learning representations by measuring the difference between the anchor, positive, and negative points.

14.4.1 Constructing Pretext Tasks Self-supervised learning focuses on learning better representations via constructing different self-supervised tasks or so-called pretext tasks. These tasks can be considered as auxiliary tasks that are not the same as the original one. Formally speaking, the objective of self-supervised learning is represented as Lself −super = Lmain + Laux ,

.

(14.34)

where .Lmain denotes the main task loss we want to complete and .Laux denotes the auxiliary task loss that we constructed to help learn the main task. Note that these two tasks are often not the same and the labels for the auxiliary tasks are generated

256

14 Low-Resource Learning

Fig. 14.8 Illustration of self-supervised learning

based on their construction principle. We often call these generated labels pseudo labels (pay attention to the difference between this pseudo label and that one in semi-supervised learning). This process is illustrated in Fig. 14.8. Therefore, the main challenge for self-supervised learning is how to design the auxiliary tasks. Why can it be successful? Usually, self-supervised learning assumes that both tasks are related and their representation learning modules are shared. Consider this as a multi-task case: two or more tasks are sharing some commonalities and structures, and thus they are likely to benefit each other. The difference between self-supervised learning and multi-task learning is that the tasks are given explicitly in the latter, while they are constructed by the users in the former. In fact, there are no specific rules or guidelines in constructing pretext tasks. We have to start from our own applications for such construction. This process is a bit of ad hoc. Gidaris et al. (2018) constructed pretext tasks by predicting the image rotations. We see that for an unlabeled image, we can easily rotate it to different angles: .0◦ , 90◦ , 180◦ , and .270◦ . Then, the pretext task becomes a classification problem: train the network to classify which rotation a sample belongs to. They train such task to obtain generalized representations which can benefit the main task. Doersch et al. (2015) constructed a pretext task by predicting the relative position of image patches, which are obtained by dividing images into different patches. Similarly, Noroozi and Favaro (2016) constructed jigsaw puzzles from the image patches. For non-image tasks, we can also leverage the data property for task construction. Zhang et al. (2022) constructed self-supervised tasks for multi-variate time series data by adding different operations to the time series data: noise, permutation, reverse, scale, etc., as shown in Fig. 14.9. Then, they constructed a classification task by predicting a time series data belongs to one of the transformations. This dramatically helped learn generalized representations for their anomaly detection applications. There are many other pretext tasks such as colorization, video generation, clustering, etc. We cannot cover them all. Interested readers can refer to (Jing and Tian, 2020) for more details.

14.4 Self-supervised Learning

257

Fig. 14.9 Illustration of self-supervised task constructed in (Zhang et al., 2022)

14.4.2 Contrastive Self-supervised Learning Contrastive learning learns representations by constructing anchor, positive, and negative points. The key idea is that we can learn better features by constraining that the distance between anchor and positive samples is smaller than that with the negative samples. Formally, for any instance x, we denote .x + and .x − as its positive and negative samples. Here, positive or negative sample refers to the similar or dissimilar points to x. Then, the learning objective is Score(f (x), f (x + )) >> Score(f (x), f (x − )),

.

(14.35)

where .Score(·, ·) is a similarity function such as mutual information, cosine similarity, or just inner product. For instance, a popular contrastive objective is denoted as

   exp f (x)T f x + .LN = −EX log (14.36)     −1    , T exp f (x)T f x + + N j =1 exp f (x) f xj where softmax-like loss is adopted and inner product acts as the score function. Hjelm et al. (2018) proposed the Deep InfoMax to predict whether a pair of global features and local features are from the same image or not. The anchor sample is the global feature. The positive samples are local features from the same image, while the negative samples are local features from different images. Oord et al. (2018) proposed the contrastive predictive encoding (CPC) for speech data. CPC predicts the latter data using the previous ones. Then, the positive sample will be just the strict time-ordered data and the negative samples are any randomly chosen data that are not time-ordered. They also showed that this can derive to mutual information maximization.

258

14 Low-Resource Learning

Self-supervised Learning vs. Transfer Learning Kang et al. (2019); Thota and Leontidis (2021) applied contrastive learning to domain adaptation and increased the performance of adaptation. They used contrastive learning between source and target domains to enhance the representation learning ability of the model. Lu et al. (2021) boosted the performance of contrastive learning with transfer learning on image classification tasks. In the future, we believe there will be more works that combine self-supervised learning with transfer learning or domain adaptation and benefit each other. For more details on contrastive learning and self-supervised learning, the readers can refer to survey works such as (Jing and Tian, 2020; Jaiswal et al., 2021).

14.5 Summary In this chapter, we introduced low-resource learning that is closely related to transfer learning. We first introduced transfer learning model compression in low-resource settings. Then, when labeled data is extremely rare, we introduce three learning paradigms: semi-supervised learning, meta-learning, and self-supervised learning. It is important to note that no research is independent from others. We are actually seeing more collaboration of transfer learning and other research fields: we can either use the knowledge from other fields to help transfer learning or use transfer learning to help solve the challenges in other fields. Discovering the connection between them is key to generating bigger impacts.

References Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989. Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for selfsupervised learning of speech representations. arXiv preprint arXiv:2006.11477. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. Berthelot, D., Roelofs, R., Sohn, K., Carlini, N., and Kurakin, A. (2021). Adamatch: A unified approach to semi-supervised learning and domain adaptation. arXiv preprint arXiv:2106.04732. Bishop, C. M. (2006). Pattern recognition and machine learning. springer. Chen, G., Zhang, T., Lu, J., and Zhou, J. (2019). Deep meta metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 9547–9556. Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, 27. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

References

259

Diba, A., Fayyaz, M., Sharma, V., Karami, A. H., Arzani, M. M., Yousefzadeh, R., and Van Gool, L. (2017). Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200. Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430. Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org. Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Ha, D., Dai, A., and Le, Q. V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106. Han, S., Mao, H., and Dally, W. J. (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149. Han, S., Pool, J., Tran, J., and Dally, W. (2015b). Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28. He, Y., Zhang, X., and Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397. Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. (2018). Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations. Hsu, K., Levine, S., and Finn, C. (2018). Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334. Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., and Makedon, F. (2021). A survey on contrastive self-supervised learning. Technologies, 9(1):2. Jing, L. and Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE TPAMI. Kang, G., Jiang, L., Yang, Y., and Hauptmann, A. G. (2019). Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893–4902. Lee, D.-H. et al. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. (2018). Learning to generalize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence. Li, Z., Zhou, F., Chen, F., and Li, H. (2017). Meta-SGD: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835. Liu, H., Wang, J., and Long, M. (2021). Cycle self-training for domain adaptation. arXiv preprint arXiv:2103.03571. Lu, Y., Jha, A., and Huo, Y. (2021). Contrastive learning meets transfer learning: A case study in medical image analysis. arXiv preprint arXiv:2103.03166. Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993. Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. (2017). Pruning convolutional neural networks for resource efficient inference. In International conference on learning representations (ICLR). Munkhdalai, T. and Yu, H. (2017). Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2554–2563. JMLR. org. Na, D., Lee, H. B., Kim, S., Park, M., Yang, E., and Hwang, S. J. (2019). Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. arXiv preprint arXiv:1905.12917. Nichol, A., Achiam, J., and Schulman, J. (2018). On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999.

260

14 Low-Resource Learning

Noroozi, M. and Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer. Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Ouali, Y., Hudelot, C., and Tami, M. (2020). An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278. Park, S., Park, J., Shin, S.-J., and Moon, I.-C. (2018). Adversarial dropout for supervised and semisupervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence. Pham, H., Xie, Q., Dai, Z., and Le, Q. V. (2020). Meta pseudo labels. arXiv preprint arXiv:2003.10580. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. (2019). Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pages 113–124. Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). XNOR-Net: ImageNet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer. Ravi, S. and Larochelle, H. (2016). Optimization as a model for few-shot learning. In ICLR. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In ICML, pages 1842–1850. Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862. Snell, J., Swersky, K., and Zemel, R. S. (2017). Prototypical networks for few-shot learning. In NeurIPS. Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. (2020). FixMatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS. Sun, Q., Liu, Y., Chua, T.-S., and Schiele, B. (2019). Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 403–412. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208. Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1195–1204. Thota, M. and Leontidis, G. (2021). Contrastive domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2209–2218. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Verma, V., Kawaguchi, K., Lamb, A., Kannala, J., Solin, A., Bengio, Y., and Lopez-Paz, D. (2022). Interpolation consistency training for semi-supervised learning. Neural Networks, 145:90–106. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638. Wang, Y., Chen, H., Heng, Q., Hou, W., Savvides, M., Shinozaki, T., Wu, Z., Bhiksha, R., and Wang, J. (2022). Freematch: self-adaptive thresholding for semi-supervised learning. In Technical report. Xie, Q., Dai, Z., Hovy, E., Luong, T., and Le, Q. (2020). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33. Yu, C., Wang, J., Chen, Y., and Qin, X. (2019a). Transfer channel pruning for compressing deep domain adaptation models. International Journal of Machine Learning and Cybernetics, 10(11):3129–3144. Yu, C., Wang, J., Chen, Y., and Wu, Z. (2019b). Accelerating deep unsupervised domain adaptation with transfer channel pruning. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.

References

261

Yu, C., Wang, J., Liu, C., Qin, T., Xu, R., Feng, W., Chen, Y., and Liu, T.-Y. (2020). Learning to match distributions for domain adaptation. arXiv preprint arXiv:2007.10791. Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., and Shinozaki, T. (2021a). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. In ICLR. Zhang, Y., Wang, J., Chen, Y., Yu, H., and Qin, T. (2022). Adaptive memory networks with selfsupervised learning for unsupervised anomaly detection. IEEE Transactions on Knowledge and Data Engineering. Zhang, Y., Zhang, H., Deng, B., Li, S., Jia, K., and Zhang, L. (2021b). Semi-supervised models are strong unsupervised domain adaptation learners. arXiv preprint arXiv:2106.00417. Zhou, A., Yao, A., Guo, Y., Xu, L., and Chen, Y. (2017). Incremental network quantization: Towards lossless CNNs with low-precision weights. In International conference on learning representations (ICLR). Zhou, Z.-h. (2016). Machine learning. Tsinghua University Press.

Part III

Applications of Transfer Learning

Chapter 15

Transfer Learning for Computer Vision

Today, most of the deep learning algorithms, tutorials, and talks are using computer vision tasks as benchmarks. For instance, the common “Hello world” example of deep learning tutorial is MNIST digits classification and the ImageNet challenge has dramatically boosted the rapid of deep learning. To now, ImageNet is still the common benchmark in many areas. Additionally, since classification tasks do not require much domain knowledge as speech recognition and machine translation tasks, we can focus more on the methods and algorithms, which is better for fast learning and understanding. That is also why classification on computer vision tasks is great experimental field in many areas. That is also why this book follows this example. Readers can find that not only this book but also most of the machine learning books use image classification tasks as examples. In this chapter, we will not demonstrate how to use transfer learning for image classification since they have already been implemented in previous sections. Instead, we implement transfer learning for object detection and neural style transfer. Particularly, we show the power of pre-training and fine-tuning in these tasks. The complete code can be found at: https://github.com/jindongwang/tlbookcode/tree/main/chap15_app_cv.

15.1 Objection Detection 15.1.1 Task and Dataset Image classification is a standard computer vision task. It requires to classify if an image belongs to one of the several pre-defined categories. There are two tasks beyond classification: objection detection and instance segmentation. Object © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_15

265

266

15 Transfer Learning for Computer Vision

Classification

Detection

Segmentation

Fig. 15.1 Illustration of image classification, objection, and segmentation. The image is taken from the PASCAL-VOC 2012 dataset

detection requires the model not only classify the object but also detect its position and range in the whole image (i.e., bounding box). Furthermore, since the bounding box only bounds the range of the object, instance segmentation further segments the whole object at pixel level. Figure 15.1 shows the difference between these three tasks. We see that from classification to segmentation, the tasks become more difficult. For brevity, we only show how to use transfer learning for object detection. We use the classic PASCAL-VOC 2012 dataset1 for our task. This dataset comes from the PASCAL-VOC object detection challenge in 2012 and is mainly designed for object detection and instance segmentation. Different from image classification, after downloading the dataset, it contains annotations of each image, indicating the object class and the bounding boxes.

15.1.2 Load Data Although PyTorch has its own VOC dataset class to can load this dataset, it somewhat cannot deal with the bounding boxes and areas. Thus, for better customization, we write our own dataset class as follows. Note that for training efficiency, we only load its 200 images as a demonstration. Dataset class class VOC(torch.utils.data.Dataset): def __init__(self, root_path, transforms): self.root = root_path self.transforms = transforms self.imgs = list( sorted(os.listdir(os.path.join(self. root, " JPEGImages"))))[:200] 6 self.masks = list( sorted(os.listdir(os.path.join(self. root, " Annotations"))))[:200] 1 2 3 4 5

1 http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html.

15.1 Objection Detection 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

267

def __getitem__(self, idx): img_path = os.path.join(self.root, "JPEGImages", self.imgs[idx]) annot_path = os.path.join(self.root, "Annotations", self.masks[idx]) img = Image. open(img_path).convert("RGB") tree = ET.parse(annot_path) root = tree.getroot() boxes = [] for neighbor in root. iter(’bndbox’): xmin = int(neighbor.find(’xmin’).text) ymin = int(neighbor.find(’ymin’).text) xmax = int(neighbor.find(’xmax’).text) ymax = int(neighbor.find(’ymax’).text) boxes.append([xmin, ymin, xmax, ymax]) num_objs = len(boxes)

# convert everything into a torch.Tensor boxes = torch.as_tensor(boxes, dtype=torch.float32) # there is only one class labels = torch.ones((num_objs,), dtype=torch.int64) image_id = torch.tensor([idx]) area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]) iscrowd = torch.zeros((num_objs,), dtype=torch.int64) target = {} target["boxes"] = boxes target["labels"] = labels target["image_id"] = image_id target["area"] = area target[’iscrowd’] = iscrowd if self.transforms is not None: img, target = self.transforms(img, target) return img, target def __len__(self): return len(self.imgs)

Then, we use PyTorch’s dataloader function to transform the dataset into the dataloader object:

1 2 3 4

Load data function def get_transform(train): transforms = [] transforms.append(T.ToTensor()) if train:

268 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

15 Transfer Learning for Computer Vision transforms.append(T.RandomHorizontalFlip(0.5)) return T.Compose(transforms)

def load_data(root_path): dataset = VOC(root_path, get_transform(train=True)) dataset_test = VOC(root_path, get_transform(train=False)) indices = torch.randperm( len(dataset)).tolist() dataset = torch.utils.data.Subset(dataset, indices[:-50]) dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:]) loader_tr = torch.utils.data.DataLoader( dataset, batch_size=args.batchsize, shuffle=True, num_workers=4, collate_fn=utils.collate_fn) loader_te = torch.utils.data.DataLoader( dataset_test, batch_size=args.batchsize * 2, shuffle=False, num_workers=4, collate_fn=utils.collate_fn) return loader_tr, loader_te

15.1.3 Model There are plenty of models for object detection in torchvision’s models. We simply use the Faster R-CNN (Ren et al., 2015) architecture with ResNet-50 (He et al., 2016) as the backbone model. For comparison, we load pre-trained or no pre-trained models, i.e., we perform transfer learning from ImageNet to VOC 2012 dataset. The code is as follows: 1 2 3 4 5 6 7 8 9

Load pre-trained model def get_model_detection(n_class, pretrain=True): from torchvision.models.detection.faster_rcnn import FastRCNNPredictor # load a pre-trained model model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained= pretrain) num_classes = n_class in_features = model.roi_heads.box_predictor.cls_score.in_features # replace the pre-trained head with a new one model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) return model

Then, we can control the parameter pretrain=True or set it to false to determine if we want the pre-trained model or without pre-training.

15.1 Objection Detection

269

15.1.4 Train and Test For simplicity, we use some training and evaluation codes from PyTorch: the train_one_epoch and evaluate functions. Then, our training code is: Train and test def train(loaders, model, device): params = [p for p in model.parameters() if p.requires_grad] optimizer = torch.optim.SGD(params, lr=args.lr, momentum=args.momentum, weight_decay=args. weight_decay) 5 lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1) 6 7 for epoch in range(args.nepoch): 8 # train for one epoch (using pytorch’s own function) 9 train_one_epoch(model, optimizer, loaders[’tr’], device, epoch, print_freq=10) 10 # update the learning rate 11 lr_scheduler.step() 12 # evaluation (also using pytorch’s own function) 13 evaluate(model, loaders[’te’], device=device) 1 2 3 4

We run the code with or without pre-trained models. The results are shown in Figs. 15.2 and 15.3, respectively. It can be seen that the performance (AP: average precision) of no pre-training is only slightly better than 0, while the performance of pre-trained model is 0.368. This indicates the effectiveness of transfer learning. Note that we only train for 10 epochs and better results can be obtained by training more epochs and adopting better backbone models.

Fig. 15.2 Results of objection detection without pre-trained models

270

15 Transfer Learning for Computer Vision

Fig. 15.3 Results of objection detection with pre-trained models

15.2 Neural Style Transfer Different from classification, detection, and segmentation tasks, neural style transfer is a generation task that learns to generate images with the same style as the given target images. Thus, it is a transfer learning task that transfers the style from one image to another. This task has wide applications in real world, such as photo editing, video editing, and art creation. In this section, we will implement a simple neural style transfer example. The key to implement neural style transfer is to compute the content loss and style loss. Content loss is the feature difference between generated images and the source images. Style loss is calculated by the distance between the gram matrices (or, in other terms, style representation) of the generated image and the style reference image.

15.2.1 Load Data We will load two images: the source image and the target image. The source image is also called the content image while the target image contains styles that we want to transfer to the content image. The code is as follows: Load data def load_images(src_path, tar_path): transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))]) img_src, img_tar = Image. open(src_path), Image. open(tar_path) 7 img_src = img_src.resize((224, 224)) 8 img_tar = img_tar.resize((224, 224)) 9 img_src = transform(img_src).unsqueeze(0).cuda() 1 2 3 4 5 6

15.2 Neural Style Transfer Fig. 15.4 Content and style images downloaded from https://pixabay.com/

10 11

271

Content image

Style image

img_tar = transform(img_tar).unsqueeze(0).cuda() return img_src, img_tar

We use the following content and style images (Fig. 15.4):

15.2.2 Model We use the VGG19 model as our base model. Then, we select its several convolutional layers to extract features which can then be used to compute the content and style losses. Load model class VGGNet(nn.Module): def __init__(self): super(VGGNet, self).__init__() self.vgg = torchvision.models.vgg19(pretrained=not args.no_pretrain) .features 5 self.conv_layers = [’0’, ’5’, ’10’, ’19’, ’28’] 6 7 def forward(self, x): 8 features = [] 9 for name, layer in self.vgg._modules.items(): 10 x = layer(x) 11 if name in self.conv_layers: 12 features.append(x) 13 return features 1 2 3 4

15.2.3 Train The training code is presented as follows. Note that we save the generated image every 500 steps.

272

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

15 Transfer Learning for Computer Vision

Model training def train(model, imgs): img_src, img_tar = imgs[’src’], imgs[’tar’] # Initialize a target image with the content image target = img_src.clone().requires_grad_(True) optimizer = torch.optim.Adam([target], lr=args.lr, betas=[0.5, 0.999]) for epoch in range(args.nepoch): fea_tar, fea_cont, fea_style = model(target), model(img_src), model( img_tar) loss_sty, loss_con = 0, 0 for f_tar, f_con, f_sty in zip(fea_tar, fea_cont, fea_style): loss_con += torch.mean((f_tar - f_con)**2) _, c, h, w = f_tar.size() f_tar = f_tar.view(c, h * w) f_sty = f_sty.view(c, h * w) f_tar = torch.mm(f_tar, f_tar.t()) f_sty = torch.mm(f_sty, f_sty.t()) loss_sty += torch.mean((f_tar - f_sty)**2) / (c * h * w) loss = loss_con + args.w_style * loss_sty optimizer.zero_grad() loss.backward() optimizer.step() if (epoch+1) % args.log_interval == 0: print(f’Epoch [{epoch+1}/{args.nepoch}], loss_con: {loss_con. item():.4f}, loss_sty: {loss_sty.item():.4f}’) if (epoch+1) % args.sample_step == 0: # Save the generated image denorm = transforms.Normalize((-2.12, -2.04, -1.80), (4.37, 4.46, 4.44)) img = target.clone().squeeze() img = denorm(img).clamp_(0, 1) torchvision.utils.save_image(img, ’pretrain-output-{}.png’. format(epoch+1))

The result with and without pre-trained models are shown in Fig. 15.5. We can see that with pre-training, the generated images are closer to the target image style. Of course this result is not perfect as we can see some noise in the image. More efforts are needed in the future.

References

273

w./o pre-training

w./ pre-training

Fig. 15.5 Neural style transfer with and without pre-trained models

References He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778. Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.

Chapter 16

Transfer Learning for Natural Language Processing

Recent years have witnessed the fast development of natural language processing (NLP). Particularly, the pre-training technique has been playing a key role in common NLP tasks. In this chapter, we show how to perform fine-tuning using the pre-trained language model on a sentence classification task. To save space, we will only introduce the important code snippets in this chapter. For complete code, please refer to the link: https://github.com/jindongwang/tlbook-code/tree/main/chap15_ application/nlp. In addition to PyTorch, Huggingface’s Transformers1 library is popular to handle the NLP tasks. Therefore, this library is necessary in our example. Readers can install it easily by running pip install transformers datasets in your terminal.

16.1 Emotion Classification We use the TweetEval dataset (Barbieri et al., 2020) to perform an emotion classification task. TweetEval contains seven heterogeneous tasks in Twitter, all framed as multi-class tweet classification. We use its emotion subset, which contains 4 emotions: anger, joy, optimism, and sadness. Figure 16.1 shows the dataset overview.

1 https://huggingface.co/.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_16

275

276

16 Transfer Learning for Natural Language Processing

Fig. 16.1 Illustration of the emotion subset of TweetEval dataset

To load and tokenize this dataset, we write two functions as follows:

1 2 3 4 5 6 7 8 9 10

Load and tokenizer def load_data(): dataset = load_dataset(’tweet_eval’, ’emotion’) return dataset def tokenizer(data): tokenizer = AutoTokenizer.from_pretrained(args.model) def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation= True) tok_data = data. map(tokenize_function, batched=True) return tok_data

Then, we can call these two functions to load and tokenize the dataset:

1 2 3

Load and tokenizer dataset = load_data() tok_data = tokenizer(dataset) print(tok_data)

This gives the following statistics (Fig. 16.2) of the dataset. We see that the training, validation, and test set contain 3257, 1421, and 374 sentences, respectively. Additionally, to make PyTorch dataloader suitable for such data, we need some further post-preprocessing to get the final dataloaders:

16.2 Model

277

Fig. 16.2 Dataset statistics of TweetEval emotion subset

1 2 3 4 5 6 7 8 9 10

Post-processing def post_process(tok_data): tok_data = tok_data.remove_columns(["text"]) tok_data = tok_data.rename_column("label", "labels") tok_data.set_format("torch") loader_tr = DataLoader(tok_data["train"], shuffle=True, batch_size=args. batchsize) loader_eval = DataLoader(tok_data["validation"], batch_size=args. batchsize) loader_te = DataLoader(tok_data[’test’], batch_size=args.batchsize) loaders = {"train": loader_tr, ’eval’: loader_eval, ’test’: loader_te} return loaders

16.2 Model There are thousands of pre-trained NLP models for sentence classification. In this chapter, we use the widely adopted BERT model (Devlin et al., 2018). Specifically, we use the “bert-base-cased” model from Huggingface’s model hub. In addition, to compare the performance of with or without pre-trained weights, we write the codes to load the model as following:

1 2 3 4 5 6 7 8

Load BERT model def load_model(pretrain=True): if pretrain: model = AutoModelForSequenceClassification.from_pretrained(args. model, num_labels=args.nclass) else: from transformers import BertConfig, BertForSequenceClassification config = BertConfig.from_pretrained(args.model, num_labels=args. nclass) model = BertForSequenceClassification(config) return model

278

16 Transfer Learning for Natural Language Processing

16.3 Train and Test The training and test of fine-tuning BERT is similar to that on image classification. We tune the model on the training split, and then evaluate its performance on the evaluation set to select the best model. Finally, we test it on the test set. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

Train and test def train(model, optimizer, loaders): num_epochs = args.nepochs num_training_steps = num_epochs * len(loaders[’train’]) lr_scheduler = get_scheduler( name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps ) best_acc = 0 for e in range(num_epochs): model.train() for batch in loaders[’train’]: batch = {k: v.cuda() for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() eval_acc = eval(model, loaders[’eval’]) print(f’Epoch: [{e}/{num_epochs}] loss: {loss:.4f}, eval_acc: { eval_acc:.4f}’) if eval_acc > best_acc: best_acc = eval_acc torch.save(model.state_dict(), ’bestmodel.pkl’) # final test on test loader test_acc = eval(model, loaders[’test’], ’bestmodel.pkl’) print(f’Test accuracy: {test_acc:.4f}’) def eval(model, dataloader, model_path=None): metric = load_metric(’accuracy’) if model_path: model.load_state_dict(torch.load(model_path)) model. eval() for batch in dataloader: batch = {k: v.cuda() for k, v in batch.items()} with torch.no_grad(): outputs = model(**batch) logits = outputs.logits predictions = torch.argmax(logits, dim=-1) metric.add_batch(predictions=predictions, references=batch["labels" ]) res = metric.compute() return res[’accuracy’]

References

279

Fig. 16.3 Real examples of transfer learning. (a) Results w./o pre-training. (b) Results w./ pretraining

16.4 Pre-training and Fine-tuning To directly compare the performance with or without pre-trained weights, we run the code with or without pre-trained BERT. The results are shown in Fig. 16.3. We see that the accuracy on test set is .78.11% with pre-training, while it is only .39.27% without pre-training. This clearly demonstrates the effectiveness of pre-training in this task. Similarly, we can perform more experiments to fine-tune different models on different tasks. In most cases, fine-tuning a pre-trained model will certainly improve the performance on a downstream task.

References Barbieri, F., Camacho-Collados, J., Anke, L. E., and Neves, L. (2020). TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

Chapter 17

Transfer Learning for Speech Recognition

Speech recognition is also an important research area of transfer learning. Speech recognition has several scenarios: cross-domain ASR and cross-lingual ASR. In this chapter, we introduce how to implement these two applications using PyTorch and EspNet.1 Note that for easy installation of EspNet, we provide a docker environment for the users: jindongwang/espnet:all11. For easy usage of EspNet, we also provide a wrapper for it: https://github.com/jindongwang/EasyEspnet.

17.1 Cross-Domain Speech Recognition Cross-domain speech recognition has many practical application scenarios in real world: the recording devices, speaking environment, speaker ages, voicing styles, accents, etc. can all be different. It is always expensive and time-consuming to collect all possible speech data from all domains. Thus, for robust speech recognition, we need to build a model that can adapt to a new domain. In this section, we implement the work CMatch (Hou et al., 2021) for crossdomain speech recognition. Specifically, we not only implement the MMD or adversarial-based ASR but also implement the character-level distribution matching (i.e., the CMatch algorithm). For brevify, we only present some core codes here and the complete code can be found at this link: https://github.com/jindongwang/ transferlearning/blob/master/code/ASR/CMatch. Generally speaking, the loss for ASR can be represented as: LASR = (1 − λ)LATT + λLCTC ,

.

(17.1)

1 https://espnet.github.io/espnet/.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_17

281

282

17 Transfer Learning for Speech Recognition

where .LATT and .LCTC denote the attention loss and CTC (connectionist temporal classification) loss, respectively. These two losses can be easily implemented with EspNet’s transformer code and hence we will omit them. When incorporating transfer losses, the total loss can be formulated as: LTotal = LASR + γ LTransfer ,

.

(17.2)

where .LTransfer is the transfer loss. The transfer loss is computed based on the features of the last encoder layer of Transformer. Thus, we can adapt the features to perform domain adaptation.

17.1.1 MMD and CORAL for ASR We will omit the implementations of MMD and CORAL loss (see Sect. 9.6 for their implementations). What we care the most is how to embed them into the Transformer network in EspNet. The important part is the “forward” function, which should not only contain the data from one domain but also two domains for source and target domains. For instance, given outputs “src_hs_pad” and “tgt_hs_pad” denoting source and target features, we implement the adversarial loss as:

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Adversarial loss def adversarial_loss(self, src_hs_pad, tgt_hs_pad, alpha=1.0): loss_fn = torch.nn.BCELoss() src_hs_pad = ReverseLayerF. apply(src_hs_pad, alpha) tgt_hs_pad = ReverseLayerF. apply(tgt_hs_pad, alpha) src_domain = self.domain_classifier(src_hs_pad).view(-1, 1) # B, T, 1 tgt_domain = self.domain_classifier(tgt_hs_pad).view(-1, 1) # B, T, 1 device = src_hs_pad.device src_label = torch.ones( len(src_domain)). long().to(device) tgt_label = torch.zeros( len(tgt_domain)). long().to(device) domain_pred = torch.cat([src_domain, tgt_domain], dim=0) domain_label = torch.cat([src_label, tgt_label], dim=0) uda_loss = loss_fn(domain_pred, domain_label[:, None]. float()) # B, 1 return uda_loss

We can also implement CORAL and MMD losses in the same manner.

17.1.2 CMatch Algorithm Computing .P (Y |X) requires obtaining the labels for each input, which refers to the frame-level labels in ASR as its inputs are frames. The feature of an input x extracted by Transformer encoder f is .f (x) ∈ RN ×D , where N denotes numbers

- - h h - a - p p p y y -

283 h

a

p

p

y

- t h h e a p p p p y y i

CTC

CTC

CTC

Encoding frames

Label assignment

17.1 Cross-Domain Speech Recognition

Sliding window

(a)

(b)

(c)

Fig. 17.1 Three strategies for frame-level label assignment. Symbol “.−” represents . Dotted labels are filtered out based on their predicted confidence scores. (a) CTC Forced Alignment. (b) Dynamic Frame Average. (c) Pseudo CTC Prediction

of frames and D denotes feature dimension. The Transformer decoder will output M labels using CTC, which is not mapped to the N frames, i.e., .N = M. Thus, it is challenging to acquire frame-level labels y for computing conditional distributions .P (Y |X). Note that this challenge exists for both the source and target domains. This is also significantly different from image classification where we can easily get the labels for each sample (Zhu et al., 2020). It is effective to use CTC forced alignment (Sak et al., 2015; Kürzinger et al., 2020) to take the labels from the most probable path selected by CTC as the framelevel assignment (Fig. 17.1a). However, this process is computationally expensive. It is also feasible to use a dynamic window average (Fig. 17.1b) strategy that assigns labels for each frame by averaging, which can only work in a strict condition that the character output is a uniform distribution. In our work, to obtain the frame-level labels, we propose to use the CTC pseudo labels (Fig. 17.1c) for both efficiency and correctness guarantee. We assume that an ideal CTC model should predict the label assignment with high accuracy and propose to directly utilize the CTC predictions since the CTC module naturally predicts labels frame by frame including symbol as the null label. More formally, the pseudo label for the n-th frame .Xn can be obtained by: .

Yˆn = arg max PCTC (Yn |Xn ),

1 ≤ n ≤ N.

(17.3)

Yn

We further filter out the CTC predictions with a threshold based on their softmax scores to improve the accuracy. Only the labels with a softmax score of over 0.9 are used. The codes for the frame-level label assignment are as follows.

1 2 3 4

Frame-level label assignment def get_enc_repr(self, src_hs_pad, src_hlens, tgt_hs_pad,

284 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

17 Transfer Learning for Speech Recognition tgt_hlens, src_ys_pad, tgt_ys_pad, method, src_ctc_softmax=None, tgt_ctc_softmax=None): src_ys = [y[y != self.ignore_id] for y in src_ys_pad] tgt_ys = [y[y != self.ignore_id] for y in tgt_ys_pad] if method == "frame_average": def frame_average(hidden_states, num): # hs_i, B T F hidden_states = hidden_states.permute(0, 2, 1) downsampled_states = torch.nn.functional.adaptive_avg_pool1d (hidden_states, num) downsampled_states = downsampled_states.permute(0, 2, 1) assert downsampled_states.shape[1] == num, f"{ downsampled_states.shape[1]}, {num}" return downsampled_states src_hs_downsampled = frame_average(src_hs_pad, num=src_ys_pad. size(1)) tgt_hs_downsampled = frame_average(tgt_hs_pad, num=tgt_ys_pad. size(1)) src_hs_flatten = src_hs_downsampled.contiguous().view(-1, self. adim) tgt_hs_flatten = tgt_hs_downsampled.contiguous().view(-1, self. adim) src_ys_flatten = src_ys_pad.contiguous().view(-1) tgt_ys_flatten = tgt_ys_pad.contiguous().view(-1) elif method == "ctc_align": src_ys = [y[y != -1] for y in src_ys_pad] src_logits = self.ctc.ctc_lo(src_hs_pad) src_align_pad = self.ctc_aligner(src_logits, src_hlens, src_ys) src_ys_flatten = torch.cat([src_align_pad[i, :src_hlens[i]].view (-1) for i in range( len(src_align_pad))]) src_hs_flatten = torch.cat([src_hs_pad[i, :src_hlens[i], :].view (-1, self.adim) for i in range( len(src_hs_pad))]) # hs_pad: B, T, F tgt_ys = [y[y != -1] for y in tgt_ys_pad] tgt_logits = self.ctc.ctc_lo(tgt_hs_pad) tgt_align_pad = self.ctc_aligner(tgt_logits, tgt_hlens, tgt_ys) tgt_ys_flatten = torch.cat([tgt_align_pad[i, :tgt_hlens[i]].view (-1) for i in range( len(tgt_align_pad))]) tgt_hs_flatten = torch.cat([tgt_hs_pad[i, :tgt_hlens[i], :].view (-1, self.adim) for i in range( len(tgt_hs_pad))]) # hs_pad: B, T, F elif method == "pseudo_ctc_pred": assert src_ctc_softmax is not None src_hs_flatten = torch.cat([src_hs_pad[i, :src_hlens[i], :].view (-1, self.adim) for i in range( len(src_hs_pad))]) # hs_pad: B * T, F src_hs_flatten_size = src_hs_flatten.shape[0] src_confidence, src_ctc_ys = torch. max(src_ctc_softmax, dim=1)

17.1 Cross-Domain Speech Recognition 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

285

src_confidence_mask = (src_confidence > self. pseudo_ctc_confidence_thr) src_ys_flatten = src_ctc_ys[src_confidence_mask] src_hs_flatten = src_hs_flatten[src_confidence_mask] assert tgt_ctc_softmax is not None tgt_hs_flatten = torch.cat([tgt_hs_pad[i, :tgt_hlens[i], :].view (-1, self.adim) for i in range( len(tgt_hs_pad))]) # hs_pad: B * T, F tgt_hs_flatten_size = tgt_hs_flatten.shape[0] tgt_confidence, tgt_ctc_ys = torch. max(tgt_ctc_softmax, dim=1) tgt_confidence_mask = (tgt_confidence > self. pseudo_ctc_confidence_thr) tgt_ys_flatten = tgt_ctc_ys[tgt_confidence_mask] tgt_hs_flatten = tgt_hs_flatten[tgt_confidence_mask] # logging.warning(f"Source pseudo CTC ratio: {src_hs_flatten. shape[0] / src_hs_flatten_size:.2f}; " \ # f"Target pseudo CTC ratio: {tgt_hs_flatten.shape[0] / tgt_hs_flatten_size:.2f}") return src_hs_flatten, src_ys_flatten, tgt_hs_flatten, tgt_ys_flatten

The character-level distribution matching is computed as: Lcmatch =

.

1  MMD(Hk , XSc , XTc ), |C| c∈C

(17.4)

where .XSc , .XTc denotes the encoder features of the source and target samples of the class c, .C denotes the character set. Note that we use CTC pseudo labels for both of the source and target domains, instead of using ground-truth labels. For real implementation, we input the features extracted by Transformer encoder to MMD instead of the raw inputs X for computation. It is implemented as:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CMatch loss function def cmatch_loss_func(self, n_classes, src_features, src_labels, tgt_features, tgt_labels): assert src_features.shape[0] == src_labels.shape[0] assert tgt_features.shape[0] == tgt_labels.shape[0] classes = torch.arange(n_classes) def src_token_idxs(c): return src_labels.eq(c).nonzero().squeeze(1) src_token_idxs = list( map(src_token_idxs, classes)) def tgt_token_idxs(c): return tgt_labels.eq(c).nonzero().squeeze(1) tgt_token_idxs = list( map(tgt_token_idxs, classes)) assert len(src_token_idxs) == n_classes assert len(tgt_token_idxs) == n_classes loss = torch.tensor(0.0).cuda()

286 16 17 18 19 20 21 22 23 24 25

17 Transfer Learning for Speech Recognition count = 0 for c in classes: if c in self.non_char_symbols or src_token_idxs[c].shape[0] < 5 or tgt_token_idxs[c].shape[0] < 5: continue loss = loss + adapt_loss(src_features[src_token_idxs[c]], tgt_features[tgt_token_idxs[c]], adapt_loss=’mmd_linear’) count = count + 1 loss = loss / count if count > 0 else loss return loss

17.1.3 Experiments and Results We employ the Libri-Adapt dataset (Mathur et al., 2020) for the experiments. The Libri-Adapt dataset is designed for unsupervised domain adaptation tasks and built on top of the Librispeech-clean-100 corpus recorded using 6 microphones under 4 synthesized background noise conditions (Clean, Rain, Wind, Laughter) in 3 different accents (en-us, en-gb, and en-in). Since the cross-accent data has not been not fully released by the author, in this work, we use the US accent (en-us) as the main accent language and keep it unchanged to isolate the influence brought by accents. We use the cross-device ASR as the demonstration. After running, the results of MMD, Adversarial, and CMatch are shown in Table 17.1 (note that we omit the training process since the training time is extremely long). We see that all transfer learning methods outperform the baseline, indicating their effectiveness. Among them, CMatch achieves the best performance. Furthermore, we can add some feature visualizations to see what CMatch really do. As shown in Fig. 17.2, some obvious bad cases in (a): {a, b, c, g, h, l, r, v, w, y} are aligned well in (b). This demonstrates the effectiveness of CMatch in aligning different characters. Table 17.1 WER on cross-device ASR in clean environment

Task M .→ P M .→ R P .→ M P .→ R R .→ M R .→ P Average

Source-only 23.87 25.21 31.15 23.99 32.45 23.48 26.69

MMD 20.87 22.21 27.22 21.90 28.27 21.09 23.59

ADV 21.11 22.27 28.29 21.74 29.95 21.23 24.10

CMatch 20.38 21.77 26.17 20.43 27.77 20.58 22.85

17.2 Cross-Lingual Speech Recognition

(a)

287

(b)

Fig. 17.2 Feature centers of all characters in two domains are closer after CMatch. Red and blue dots denote the source and target domains, respectively. (a) Before CMatch. (b) After CMatch

17.2 Cross-Lingual Speech Recognition Cross-lingual application is another challenging scenario for speech recognition. Compared with cross-domain applications, cross-lingual speech recognition poses more difficulties. For example, different languages often come with different voicing styles, speakers, etc., as well as totally different vocabularies. Moreover, it is known that there are about 7000 languages in the world. Due to the varying numbers of different languages’ speakers, many languages also face the data scarcity issue. Such languages are called low-resource languages. To handle such low-resource languages and mitigating the data scarcity issues, transfer learning plays a critical role. A standard protocol is to pre-train a multilingual model and then fine-tunes it on the target languages. In this section, we would like to introduce adapter-based transfer learning approaches, which is an alternative way to achieve this while retraining high parameter efficiency. We will only show the critical part of the code since the speech recognition codes are very long. The complete code of this section can be found at https://github.com/ jindongwang/transferlearning/blob/master/code/ASR/Adapter.

17.2.1 Adapter Adapter is a simple add-on module to the encoder and decoder layers in Transformer that mainly composed of layer normalization and fully connected layers. During fine-tuning, we can freeze the backbones of the pre-trained models and only train the adapters which has a small number of task-specific parameters. As shown in Fig. 17.3, a commonly used adapter structure includes layer normalization, a downprojection layer, a nonlinear activation function, and an up-projection layer. There is

288

17 Transfer Learning for Speech Recognition

Fig. 17.3 Architecture of the adapter module

Up Projection ReLU Down Projection

Layer Norm

also a residual connection that allows the adapter to keep the original representation unchanged. Thus, the adapter function is formulated as:     , al = Adapter(zl ) = zl + Wlu ReLU Wld LN zl

.

(17.5)

where .zl represents the outputs of layer l, .LN denotes layer normalization. .W u , W d are weight parameters for up projection and down projection. The code of Adapter is presented below.

1 2 3 4 5 6 7 8 9 10 11

Adapter class Adapter(torch.nn.Module): def __init__(self, adapter_dim, embed_dim): super().__init__() self.layer_norm = LayerNorm(embed_dim) self.down_project = torch.nn.Linear(embed_dim, adapter_dim, False) self.up_project = torch.nn.Linear(adapter_dim, embed_dim, False) def forward(self, z): normalized_z = self.layer_norm(z) h = torch.nn.functional.relu(self.down_project(normalized_z)) return self.up_project(h) + z

17.2.2 Cross-Lingual Adaptation with Adapters The training process of adapters is presented in Algorithm 1. It is worth noting that a difference between other applications of Adapters and cross-lingual speech recognition is that a language-specific language head is required to be trained for the

17.2 Cross-Lingual Speech Recognition

289

unseen target language. However, training the Adapters together with the language heads may result in the insufficient learning of semantic information in the adapters. Therefore, a two-stage training strategy is needed here where we firstly train the language-specific heads, after which we train the language-specific adapters. Algorithm 1 Learning algorithm of adapter-based cross-lingual adaptation Input: Pre-trained model M, target language L 1: Freeze the backbone of M. 2: Stage 1: 3: Randomly initialize target language head HL . 4: Optimize HL with speech recognition loss. 5: Stage 2: 6: Initialize language-specific adapters AL and inject them into backbone layers. 7: while not converge do 8: Optimize AL the speech recognition loss. 9: end while 10: return Original model M, target language head HL , and adapters AL .

17.2.3 Advanced Algorithm: MetaAdapter and SimAdapter We further introduce two algorithms of the Adapter which leverage the knowledge of multiple low-resource languages implicitly or explicitly and show better crosslingual adaptation performance: the MetaAdapter and the SimAdapter. The MetaAdapter is inspired by the idea of Model-Agnostic Meta-Learning (MAML) which aims to extract general latent knowledge from existing source training tasks and then adapt the knowledge to the target task. Specifically, we optimize the meta-optimization objective through gradient descent as: θa = θa − μ

N 

.

i=1





∇θa LS val fθa −∇θ L tr (fθ ) , a S a i

(17.6)

i

where .μ is the meta step size, . is the fast adaptation learning rate, L denotes the speech recognition loss. Opposed to the MetaAdapter, the SimAdapter is using the explicit relationship between different language (i.e., their linear similarity). SimAdapter uses attention mechanism to compute the similarity: SimAdapter(z, a{S1 ,S2 ,...,SN } ) =

N 

.

i=1

  Attn(z, aSi ) · aSi WV ,

(17.7)

290

17 Transfer Learning for Speech Recognition

Algorithm 2 Learning algorithm of the MetaAdapter Input: Pre-trained model M, source languages .{S1 , · · · , SN }, target language .LT . 1: Train language-specific heads on source languages Hi . 2: Initialize the MetaAdapter AL . 3: while meta-learning not done do 4: Optimizing A using Eq. (17.6). 5: end while 6: Train the target head HL on target language LT . 7: Fine-tune the MetaAdapter AL using speech recognition loss. 8: return Original model M, target language head HL , and adapters AL

where .SimAdapter(·) and .Attn(·) denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as: 

 (zWQ )(aWK ) .Attn(z, a) = Softmax , τ

(17.8)

where .τ is the temperature coefficient, .WQ , WK , WV are attention matrices. Note that while .WQ , WK are initialized randomly, .WV is initialized with a diagonal of ones and the rest of the matrix with small weights (.1e − 6) to retain the adapter representations. Algorithm 3 Learning algorithm of the SimAdapter Input: Pre-trained model M, source languages .{S1 , · · · , SN }, target language .LT . 1: Train Adapters AS 2: Initialize the SimAdapter layers. 3: while not done do 4: Optimizing layers using attention equations. 5: end while 6: return Original model M, target language head HL , and adapters {AS }.

17.2.4 Results and Discussion We adopt the Common Voice 5.1 (Ardila et al., 2020) corpus for our experiments. The results in Table 17.2 show that all adapter-based methods can improve the performance. Their combination (SimAdapter+) can achieve the best performance.

Target Romanian (ro) Czech (cs) Breton (br) Arabic (ar) Ukrainian (uk) AVG Weighted AVG

DNN/HMM 70.14 63.15 – 69.31 77.76 – –

Trans.(B) 97.25 48.87 97.88 75.32 64.09 76.68 72.28

Trans.(S) 94.72 51.68 92.05 74.88 67.89 76.24 72.50

Head 63.98 75.12 82.80 81.70 82.71 77.26 77.54

Full-FT 53.90 34.75 61.71 47.63 45.62 48.72 46.72

Table 17.2 Word error rates (WER) on the cross-lingual ASR tasks Full-FT+L2 52.74 35.80 61.75 50.09 46.45 49.37 47.50

Part-FT 52.92 54.66 66.24 58.49 66.12 59.69 59.43

Adapter 48.34 37.93 58.77 47.31 50.84 48.64 47.38

SimAdapter 47.37 35.86 58.19 47.23 48.73 47.48 46.08

MetaAdapter 44.59 37.13 58.47 46.82 49.36 47.27 46.12

SimAdapter+ 47.29 34.72 59.14 46.39 47.41 46.99 45.45

17.2 Cross-Lingual Speech Recognition 291

292

17 Transfer Learning for Speech Recognition

References Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4218–4222. Hou, W., Wang, J., Tan, X., Qin, T., and Shinozaki, T. (2021). Cross-domain speech recognition with unsupervised character-level distribution matching. In Interspeech. Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., and Rigoll, G. (2020). CTC-segmentation of large corpora for German end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer. Mathur, A., Kawsar, F., Berthouze, N., and Lane, N. D. (2020). Libri-adapt: a new speech dataset for unsupervised domain adaptation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7439–7443. IEEE. Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., and Schalkwyk, J. (2015). Learning acoustic frame labeling for speech recognition with recurrent neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4280– 4284. IEEE. Zhu, Y., Zhuang, F., Wang, J., Ke, G., Chen, J., Bian, J., Xiong, H., and He, Q. (2020). Deep subdomain adaptation network for image classification. IEEE Transactions on Neural Networks and Learning Systems.

Chapter 18

Transfer Learning for Activity Recognition

Sensor-based human activity recognition (HAR) (Wang et al., 2019) plays an important role in people’s daily life. HAR makes it possible to recognize people’s daily activities, thus monitoring people’s health status in a pervasive manner. In this chapter, we show how to use transfer learning for cross-domain human activity recognition on a public dataset. The complete code of this chapter can be found at: https://github.com/jindongwang/tlbook-code/tree/main/chap18_app_activity.

18.1 Task and Dataset Activity recognition can be cast as a classification problem. To construct a crossdomain HAR problem, we consider the cross-position activity recognition following (Wang et al., 2018b). The cross-position HAR refers to the situation where sensors placed in different positions can generate different signals when a person is performing some activities. This is necessary in real applications as we can always observe different activity patterns from different body positions, thus may help to diagnose certain diseases. Plus, we can never build a model to recognize activities of all body positions. It becomes important to conduct transfer learning, thus we can transfer our learned knowledge from one position to help annotate the activities in another position. In this chapter, we use the UCI Daily and Sports Activity dataset (Barshan and Yüksek, 2014) (DSADS) for our demonstration. It consists of 19 activities collected from 8 subjects wearing body-worn sensors on 5 body parts. The sensors are: accelerometer, gyroscope, and magnetometer. The body positions are: right arm (RA), left arm (LA), torso (T), right leg (RL), and left leg (LL). The 19 activities correspond to 19 different classes. The cross-position activity recognition refers to using the data from position A as the source domain, and data from position B as the target domain. More importantly, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_18

293

294

18 Transfer Learning for Activity Recognition

Table 18.1 Features extracted per sensor on each body part ID 1 2 3 4 5 6 7 8 9–13 14–18 19 20–23 24–27

Feature Mean STD Minimum Maximum Mode Range Mean crossing rate DC Spectrum peak position Frequency Energy Four shape features Four amplitude features

Description Average value of samples in window Standard deviation Minimum Maximum The value with the largest frequency Maximum minus minimum Rate of times signal crossing mean value Direct component First 5 peaks after FFT Frequencies corresponding to 5 peaks Square of norm Mean, STD, skewness, kurtosis Mean, STD, skewness, kurtosis

we consider a more practical application scenario: when we are given a specific body position data as the target domain, and we need to select the most similar source position to it to perform transfer learning. Thus, our task is composed of two processes: source selection and transfer learning.

18.2 Feature Extraction Before diving into the details of transfer learning, we need to perform feature extraction using the raw activity data. In our experiments, we use the data from all three sensors in each body part since most information can be retained  in this way. For one sensor, we combine the data from 3 axes together using .a = x 2 + y 2 + z2 . Then, we exploit the sliding window technique to extract features (window length is 5 s). The feature extraction procedure is mainly executed according to existing work (Wang et al., 2018a). In total, 27 features from both time and frequency domains are extracted for a single sensor. Since there are three sensors (i.e., accelerometer, gyroscope, and magnetometer) on one body part, we extracted 81 features from one position. The 27 features are shown in Table 18.1. Since it is rather time-consuming to perform feature extraction and it is not our focus, we will not show the codes here. Instead, it can be found at this GitHub repo.1 We also provide a public extracted features of the datasets in this link.2 Readers can download it for the later use in this section. 1 https://github.com/jindongwang/activityrecognition/tree/master/code/python/feature_extraction. 2 https://www.kaggle.com/jindongwang92/crossposition-activity-recognition.

18.3 Source Selection

295

18.3 Source Selection First, we perform source selection. Most importantly, we implement the source selection based on two metrics: the .A-distance and cosine distance. The two distances correspond to kinetic and semantic metrics, as described in Wang et al. (2018b). To compute the .A-distance, we need to construct a linear classifier to classify if a domain belongs to a source or a target domain. The code is as follows, which takes the features from two domains and output their distance. A-distance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

def proxy_a_distance(source_X, target_X, verbose=False): """ Compute the Proxy-A-Distance of a source/target representation """ nb_source = np.shape(source_X)[0] nb_target = np.shape(target_X)[0] if verbose: print(’PAD on’, (nb_source, nb_target), ’examples’) C_list = np.logspace(-5, 4, 10) half_source, half_target = int(nb_source/2), int(nb_target/2) train_X = np.vstack( (source_X[0:half_source, :], target_X[0:half_target, :])) train_Y = np.hstack((np.zeros(half_source, dtype= int), np.ones(half_target, dtype= int))) test_X = np.vstack((source_X[half_source:, :], target_X[half_target:, :])) test_Y = np.hstack((np.zeros(nb_source - half_source, dtype= int), np.ones(nb_target - half_target, dtype= int))) best_risk = 1.0 for C in C_list: clf = svm.SVC(C=C, kernel=’linear’, verbose=False) clf.fit(train_X, train_Y) train_risk = np.mean(clf.predict(train_X) != train_Y) test_risk = np.mean(clf.predict(test_X) != test_Y) if verbose: print(’[ PAD C = %f ] train risk: %f test risk: %f’ % (C, train_risk, test_risk)) if test_risk > .5: test_risk = 1. - test_risk best_risk = min(best_risk, test_risk)

296 39 40

18 Transfer Learning for Activity Recognition

return 2 * (1. - 2 * best_risk)

To compute the cosine similarity, we use the library from sklearn: Cosine similarity 1 2

def cosine_sim(source_X, target_X, w_source=1, w_target=1): return pairwise.cosine_similarity(source_X * w_source, target_X * w_target).mean()

We write a function to perform source selection by taking the semantic similarity as the weights for the distances: Source selection function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

def source_selection(target_pos=’RA’): # weights given by human, for the semantic similarity weights = [.2, .5, .15, .15] if args.target == ’RA’ else [.5, .2, .15, .15] d_tar = get_data_by_position(’dsads’, target_pos) x_tar = d_tar[0] t = body_parts[’dsads’].copy() t.remove(target_pos) d_src_list = [get_data_by_position(’dsads’, item) for item in t] print(’Source candidates:’, [item for item in t]) a_dist = [calc_dist.proxy_a_distance( x_tar, item[0]) for item in d_src_list] cos_dist = [1 - (calc_dist.cosine_sim(x_tar, d_src_list[i] [0]).mean() * weights[i]) for i in range( len(d_src_list))] total_dist = np.array(a_dist) + np.array(cos_dist) print(f’Distance to target: ’, total_dist) return total_dist, d_src_list, d_tar, t

After writing some codes for data loading and main procedure, we can get the results for source selection as shown in Fig. 18.1. It can be seen that the most similar body position to the right arm is left arm, which also fits our assumption that two arm are the most similar. Then, the torso is also similar to the arms. The most dissimilar ones are legs.

Fig. 18.1 Results for source selection

18.5 Activity Recognition Using Deep Transfer Learning

297

18.4 Activity Recognition Using TCA After getting the results for the best sources, we can perform transfer learning. For comparison, we use traditional KNN as the classifier. Then, we use TCA to compare with it. The code for TCA can be found at Sect. 5.4. Thus, we omit it here. For KNN, we define a class to perform classification: KNN classifier 1 2 3 4 5 6 7 8 9 10 11 12 13 14

def knn(self, verbose=False): from sklearn.neighbors import KNeighborsClassifier best_acc, best_k = 0, 0 for k in [1, 3, 5]: clf = KNeighborsClassifier(n_neighbors=k) clf.fit(self.x_src, self.y_src) ypred = clf.predict(self.x_tar) acc = accuracy_score(ypred, self.y_tar) if verbose: print(f’K: {k}, acc: {acc}’) if acc > best_acc: best_acc = acc best_k = k print(f’Best acc: {best_acc}, K: {best_k}’)

Figure 18.2 shows the results of KNN and TCA. We see that KNN gives an accuracy of .66.07%, while TCA achieves .70.49. This clearly demonstrates the effectiveness of using TCA for transfer learning, compared to traditional learning algorithms.

18.5 Activity Recognition Using Deep Transfer Learning Beyond TCA, we also use deep transfer learning method for cross-position activity recognition. Note that for comparison, we still use the extracted features as inputs to the network. In normal case, we often do not perform feature extraction. Instead, we directly use the raw data as inputs to the network. The code version of using raw data can be found at later sections in personalized federated learning.

Fig. 18.2 Results of KNN and TCA for activity recognition

298

18 Transfer Learning for Activity Recognition

First of all, we need to load the data using PyTorch. Note that the standard PyTorch dataloader is built for image data, thus not feasible for activity data. We need to inherit from the base Dataset class and write our own dataset class. Then, we can use a dataloader to wrap them. Load data using PyTorch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

25

26

class DSADS27(torch.utils.data.Dataset): def __init__(self, data): self.samples = data[:, :405] self.labels = data[:, -2] def __getitem__(self, index): sample, target = self.samples[index], self.labels[index] # from sklearn.preprocessing import StandardScaler # sample = StandardScaler().fit_transform(sample) return sample, target def __len__(self): return len(self.samples)

def load_27data(batch_size=100): root_path = ’/D_data/jindwang/Dataset_PerCom18_STL’ data = io.loadmat(os.path.join(root_path, ’dsads’))[’data_dsads’] from sklearn.model_selection import train_test_split data_train, data_test = train_test_split(data, test_size=.1) data_train, data_val = train_test_split(data_train, test_size=.2) train_set, test_set, val_set = DSADS27( data_train), DSADS27(data_test), DSADS27(data_val) train_loader, test_loader, val_loader = torch.utils.data.DataLoader( train_set, batch_size=batch_size, shuffle=True, drop_last=True), torch.utils.data.DataLoader( test_set, batch_size=batch_size * 2, shuffle=False, drop_last=False) , torch.utils.data.DataLoader(val_set, batch_size=batch_size, shuffle=False * 2, drop_last=False) return train_loader, val_loader, test_loader

After that, we define a CNN network for activity recognition. As pointed out by Wang et al. (2018b), CNN can be used to recognize the repetitive activities. Network architecture 1 2 3 4 5 6 7 8 9 10

class TNNAR(nn.Module): def __init__(self, n_class=19): super(TNNAR, self).__init__() self.n_class = n_class self.conv1 = nn.Sequential( nn.Conv2d(in_channels=9, out_channels=16, kernel_size=(1, 1)), nn.BatchNorm2d(16), nn.ReLU(), nn.MaxPool2d(kernel_size=(2, 1))

18.5 Activity Recognition Using Deep Transfer Learning 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

299

) self.conv2 = nn.Sequential( nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(1, 1)), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(kernel_size=(2, 1)) ) self.fc1 = nn.Sequential( nn.Linear(32 * 2, 100), nn.ReLU() ) self.fc2 = nn.Sequential( nn.Linear(100, self.n_class) ) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = x.reshape(-1, 32 * 2) x = self.fc1(x) fea = x out = self.fc2(x) return fea, out def predict(self, x): _, out = self.forward(x) return out

Finally, similar to the codes of DDC and DCORAL, we can add MMD and CORAL losses to the network to perform training: Train and test 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

def train_da(model, loaders, optimizer, mode=’ratola’): best_acc = 0 criterion = nn.CrossEntropyLoss() for epoch in range(args.nepochs): model.train() total_loss = 0 correct = 0 for (src, tar) in zip(loaders[0][0], loaders[1][0]): xs, ys = src xt, yt = tar xs = xs[:, 81:162] if mode == ’ratola’ else xs[:, 243:324] xt = xt[:, 162:243] if mode == ’ratola’ else xt[:, 324:405] xs, ys, xt, yt = xs. float().cuda(), ys. long().cuda() - 1, xt. float().cuda(), yt. float().cuda() - 1 # data, label = data[:,243:324].float().cuda(), label.long(). cuda() - 1 xs, xt = xs.view(-1, 9, 9, 1), xt.view(-1, 9, 9, 1) # data = data.view(-1, 9, 9, 1) fs, outs = model(xs)

300 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

18 Transfer Learning for Activity Recognition ft, _ = model(xt) loss_cls = criterion(outs, ys) mmd = MMD_loss(kernel_type=’rbf’)(fs, ft) if args.loss == ’mmd’ else CORAL_loss(fs, ft) # mmd = loss = loss_cls + args.lamb * mmd optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() _, predicted = torch. max(outs.data, 1) correct += (predicted == ys). sum() train_acc = float(correct) / len(loaders[0][0].dataset) train_loss = total_loss / len(loaders[0]) val_acc = test(model, loaders[1][1]) test_acc = test(model, loaders[1][2]) if best_acc < val_acc: best_acc = val_acc print(f’Epoch: [{epoch:2d}/{args.nepochs}] loss: {train_loss:.4f}, train_acc: {train_acc:.4f}, val_acc: {val_acc:.4f}, test_acc: { test_acc:.4f}’) print(f’Best acc: {best_acc}’)

def test(model, loader, model_path=None): if model_path: model.load_state_dict(torch.load(model_path)) model. eval() correct = 0 with torch.no_grad(): for data, label in loader: data, label = data. float().cuda(), label. long().cuda() - 1 data = data[:, 162:243] if args.mode == ’ratola’ else data[:, 324:405] data = data.view(-1, 9, 9, 1) pred = model.predict(data) _, predicted = torch. max(pred.data, 1) correct += (predicted == label). sum() acc = float(correct) / len(loader.dataset) return acc

We run the code and get the following results: deep transfer learning gives an accuracy of 76.10%, which is much better than previous KNN and TCA. This again demonstrates the effectiveness of deep transfer learning methods (Fig. 18.3).

References

301

Fig. 18.3 Results of deep transfer learning for cross-domain activity recognition

References Barshan, B. and Yüksek, M. C. (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11):1649–1667. Wang, J., Chen, Y., Hao, S., Peng, X., and Hu, L. (2019). Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters, 119:3–11. Wang, J., Chen, Y., Hu, L., Peng, X., and Yu, P. S. (2018a). Stratified transfer learning for crossdomain activity recognition. In 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom). Wang, J., Zheng, V. W., Chen, Y., and Huang, M. (2018b). Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8.

Chapter 19

Federated Learning for Personalized Healthcare

Federated learning aims at building machine learning models without compromising data privacy from the clients. Since different clients naturally have different data distributions (i.e., the non-i.i.d. issue), it is intuitive to embed transfer learning technology into the federated learning system. In this chapter, we first introduce a healthcare task and show how to prepare data for federated learning. Then, we show how to implement the most famous federated learning algorithm, FedAvg (McMahan et al., 2017). After that, we write codes to implement the AdaFed algorithm proposed in Chen et al. (2021) to cope with the non-i.i.d. issue in federated learning to enhance the personalization. In this chapter, we implement the classic FedAvg algorithm for an medical healthcare problem. Then, we add personalization to it and implement the AdaFed algorithm to improve the performance. We only show some of the key codes and the complete code can be found at this link: https://github.com/jindongwang/tlbookcode/tree/main/chap19_fl.

19.1 Task and Dataset MedMNIST (Yang et al., 2021a,b) is a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. These datasets cover primary data modalities (e.g., X-Ray, OCT, Ultrasound, CT, Electron Microscope), diverse classification tasks (binary/multi-class, ordinal regression, and multi-label), and data scales (from 100 to 100,000). All images are .28 × 28 (2D) or .28 × 28 × 28 (3D). In this chapter, we take OrganCMNIST (Bilic et al., 2019; Xu et al., 2019) as an example. OrganCMNIST is about Abdominal CT images and contains 11 classes with 23660 instances. An overview of OrganCMNIST is shown in Fig. 19.1.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Wang, Y. Chen, Introduction to Transfer Learning, Machine Learning: Foundations, Methodologies, and Applications, https://doi.org/10.1007/978-981-19-7584-4_19

303

304

19 Federated Learning for Personalized Healthcare

Fig. 19.1 An Overview of OrganCMNIST

19.1.1 Dataset Although OrganCMNIST is an image dataset, it is saved in a file with the “npz” format. For simplicity, we do not convert files to images but directly load them. The following shows how to construct a dataset for OrganCMNIST. MedMnist dataloader 1 2 3 4 5 6 7 8 9 10 11 12

def get_data_medmnist(filename): data=np.load(filename) train_data=np.vstack((data[’train_images’],data[’val_images’],data[’ test_images’])) y=np.hstack((np.squeeze(data[’train_labels’]),np.squeeze(data[’ val_labels’]),np.squeeze(data[’test_labels’]))) return train_data,y class MedMnistDataset(Dataset): def __init__(self, filename): self.data,self.targets=get_data_medmnist(filename) self.targets=np.squeeze(self.targets) self.data=torch.Tensor(self.data) self.data=torch.unsqueeze(self.data,dim=1)

19.1 Task and Dataset 13 14 15 16 17 18

305

def __len__(self): return len(self.targets) def __getitem__(self, idx): return self.data[idx], self.targets[idx]

19.1.2 Data Splits Since OrganCMNIST constructs only one data source, we need to split it into several parts to simulate clients in federated learning. The simplest way is to randomly split all data into n_clients parts where n_clients represents the number of clients. However, client data split via this way often share the same distribution, which is in contrast to reality. In real applications, different clients, such as hospitals in different locations, often have non-i.i.d. samples. In the following, we offer a most common way to split data into non-i.i.d. parts in federated learning (Yurochkin et al., 2019). Non-i.i.d. Split by Dirichlet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

def build_non_iid_by_dirichlet(random_state, indices2targets, non_iid_alpha, num_classes, num_indices, n_workers): n_auxi_workers = 10 assert n_auxi_workers