294 23 4MB
English Pages [207] Year 2021
Studies in Computational Intelligence 965
Muhammad Habib ur Rehman Mohamed Medhat Gaber Editors
Federated Learning Systems Towards Next-Generation AI
Studies in Computational Intelligence Volume 965
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Muhammad Habib ur Rehman Mohamed Medhat Gaber
•
Editors
Federated Learning Systems Towards Next-Generation AI
123
Editors Muhammad Habib ur Rehman Center for Cyber-Physical Systems Khalifa University of Science and Technology Abu Dhabi, United Arab Emirates
Mohamed Medhat Gaber School of Computing and Digital Technology Birmingham City University Birmingham, UK
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-70603-6 ISBN 978-3-030-70604-3 (eBook) https://doi.org/10.1007/978-3-030-70604-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
“A learned individual can benefit rest of the world.”
Preface
Businesses and Governments are collecting massive data about their customers and citizens. These organizations store the data in centralized cloud storage infrastructures to perform large-scale training to make complex and timely decisions. This data collection process remains continuous; however, customers and citizens always show their concerns about the data management and processing. Considering this, several regulatory measures were taken and various laws were approved by Governments such as the European Union (GDPR), the United States of America (HIPAA/CCPA), and Canada (PIPDEA). However, still, the customers and citizens are raising concerns over their privacy protection and false decisions made by machine learning systems. Therefore, the inability to collect fresh data poses a serious threat to well-informed and realistic decisions. Google introduced the term federated learning (FL) to enable the machine learning models to be initially trained at the customers’ or citizens’ devices and systems and later the model updates are aggregated at the centralized cloud servers. Considering this notion of FL, a large plethora of research activities has been performed by researchers and practitioners in academia and industry. Hence, numerous research publications were produced to solve the active research issues in terms of privacy, security, data and model synchronization, model development and deployment, personalization, incentivization, and heterogeneity across the FL systems. This book aims to study the FL ecosystem with a broader perspective to cover the theoretical as well as applied aspects of FL systems. Therefore, this book is structured into eight chapters in total. In the first chapter, Ali et al. performed a thorough bibliometric analysis of the field of FL. Authors have conducted thorough research of the Scopus database to uncover the publication trends. They found 476 scholarly documents in total and then analyzed the dataset to find the growth trends in FL research. Also, they studied subject areas and ranked them in terms of the number of publications. Also, they outlined the top-10 cited papers, top-10 authors, top-10 institutions, and top-10 countries. Moreover, they categorized the documents into various types and then uncovered the top-10 sources of these documents. Finally, the authors have performed the domain profiling of the FL research area vii
viii
Preface
and they identified five hot domains such as the internet of things (IoT), wireless communication, privacy and security, data analytics, and learning and optimization, where most of the FL research has been creating impact. Christopher et al., in Chap. 2, review FL as an approach for performing machine learning on distributed data to protect the privacy of user-generated data. They highlight pertinent challenges in an IoT context such as reducing communication costs associated with data transmission, learning from data under heterogeneous conditions, and applying additional privacy protections to FL. Throughout this review, they identify the strengths and weaknesses of different methods applied to FL, and finally, they outline future directions for privacy-preserving FL research, particularly focusing on IoT applications. To effectively prevent information leakage, K. Wei et al. (in Chap. 3) investigate a differential privacy mechanism in which, at the clients’ side, artificial noises are added to parameters before uploading. Moreover, they propose a K-client random scheduling policy, in which K clients are randomly selected from a total of N clients to participate in each communication round. Furthermore, a theoretical convergence bound is derived from the loss function of the trained FL model. In detail, considering a fixed privacy level, the theoretical bound reveals that there exist an optimal number of clients K that can achieve the best convergence performance due to the tradeoff between the volume of user data and the variances of aggregated artificial noises. To optimize this tradeoff, they further provide a differentially private FL-based client selection (DP-FedCS) algorithm, which can dynamically select the number of training clients. Their experimental results validate their theoretical conclusions and also show that the proposed algorithm can effectively improve both the FL training efficiency and FL model quality for a given privacy protection level. FL provides privacy-by-design. It trains a machine learning model collaboratively over several distributed clients (ranging from two to millions) such as mobile phones, without sharing their raw data with any other participant. In practical scenarios, all clients do not have sufficient computing resources (e.g., Internet of Things), the machine learning model has millions of parameters, and its privacy between the server and the clients while training/testing is a prime concern (e.g., rival parties). In this regard, FL is not sufficient, so split learning (SL) is introduced in Chap. 4 by C. Thapa et al. SL is reliable in these scenarios as it splits a model into multiple portions, distributes them among clients and server, and trains/tests their respective model portions to accomplish the full model training/testing. In SL, the participants do not share both data and their model portions with any other parties, and usually, a smaller network portion is assigned to the clients where data resides. Recently, a hybrid of FL and SL, called SplitFed learning, is introduced to elevate the benefits of both FL (faster training/testing time) and SL (model split and training). Following the developments from FL to SL, and considering the importance of SL, this chapter is designed to provide extensive coverage in SL and its variants. The coverage includes fundamentals, existing findings, integration with privacy measures such as differential privacy, open problems, and code implementation.
Preface
ix
Chapter 5 presents the practitioner view on FL research whereby a group of researchers from the PySyft Community has elaborated on the key features of their FL tool. PySyft is an open-source multi-language library enabling secure and private machine learning by wrapping and extending popular deep learning frameworks such as PyTorch in a transparent, lightweight, and user-friendly manner. Its aim is to both help popularize privacy-preserving techniques in machine learning by making them as accessible as possible via Python bindings and common tools familiar to researchers and data scientists, as well as to be extensible such that new Federated Learning, Multi-Party Computation, or Differential Privacy methods can be flexibly and simply implemented and integrated. This chapter will introduce the methods available within the PySyft library and describe their implementations. The authors also provide a proof-of-concept demonstration of an FL workflow using an example of how to train a convolutional neural network. Next, review the use of PySyft in academic literature to date and discuss future use-cases and development plans. Most importantly, they introduce Duet: their tool for easier federated learning for scientists and data owners. In the medical or healthcare industry, where the already available information or data is never sufficient, decisions can be performed with the help of FL by empowering AI models to learn on private data without conceding privacy. The primary objective of Chap. 6 is to highlight the adaptability and working of the FL techniques in the healthcare system especially in drug development, clinical diagnosis, digital health monitoring, and various disease predictions and detection system. The first section of the chapter comprised of motivation, FL for healthcare, FL working model in healthcare, and various benefits of FL. The next section of the chapter described the reported work which highlights the working of different researchers who used the FL model. The final section of the chapter presented the comparative analysis of different FL algorithms for different health sectors by using parameters such as accuracy, the area under the curve, precision, recall, and F-score. Ahmed et al., in Chap. 7, envision the idea of fully decentralized FL system. They emphasized on using blockchain-empowered smart contract technologies to enable fairness and trust among the FL participants over underlying peer-to-peer networks. Finally, David and Zaid in Chap. 8 analyze existing vulnerabilities of FL and subsequently perform a literature review of the possible attack methods targeting FL privacy protection capabilities. These attack methods are then categorized by a basic taxonomy. Additionally, they provide a literature study of the most recent defensive strategies and algorithms for FL aimed to overcome these attacks. These defensive strategies are categorized by their respective underlying defense principle. The chapter advocates that the application of a single defensive strategy is not enough to provide adequate protection against all available attack methods. Although this book primarily targets the computer science, information technology, and data science communities. However, considering the generalized content, this book will be equally helpful for the students, researchers, and practitioners from all walks of life. Since this is the first contributed research book on the topic of FL, we aim to include more important and active research topics in the future editions of
x
Preface
this book. Finally, we would like to thank and acknowledge the efforts of all contributors including authors, reviewers, and editorial staff for putting their untiring efforts into making this project a success. Abu Dhabi, United Arab Emirates Birmingham, UK December 2020
Muhammad Habib ur Rehman Mohamed Medhat Gaber
Contents
1 Federated Learning Research: Trends and Bibliometric Analysis . . . Ali Farooq, Ali Feizollah, and Muhammad Habib ur Rehman
1
2 A Review of Privacy-Preserving Federated Learning for the Internet-of-Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Briggs, Zhong Fan, and Peter Andras
21
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kang Wei, Jun Li, Chuan Ma, Ming Ding, and H. Vincent Poor
51
4 Advancements of Federated Learning Towards Privacy Preservation: From Federated Learning to Split Learning . . . . . . . . Chandra Thapa, M. A. P. Chamikara, and Seyit A. Camtepe
79
5 PySyft: A Library for Easy Federated Learning . . . . . . . . . . . . . . . . 111 Alexander Ziller, Andrew Trask, Antonio Lopardo, Benjamin Szymkow, Bobby Wagner, Emma Bluemke, Jean-Mickael Nounahon, Jonathan Passerat-Palmbach, Kritika Prakash, Nick Rose, Théo Ryffel, Zarreen Naowal Reza, and Georgios Kaissis 6 Federated Learning Systems for Healthcare: Perspective and Recent Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Yogesh Kumar and Ruchi Singla 7 Towards Blockchain-Based Fair and Trustworthy Federated Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Ahmed Mukhtar Dirir, Khaled Salah, and Davor Svetinovic 8 An Overview of Federated Deep Learning Privacy Attacks and Defensive Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 David Enthoven and Zaid Al-Ars
xi
Contributors
Zaid Al-Ars Delft University of Technology, Delft, Netherlands Peter Andras Keele University, Staffordshire, UK Emma Bluemke University of Oxford, Oxford, UK; OpenMined, Oxford, UK Christopher Briggs Keele University, Staffordshire, UK Seyit A. Camtepe CSIRO Data61, Sydney, Australia M. A. P. Chamikara CSIRO Data61, Sydney, Australia Ming Ding Data61, CSIRO, Sydney, Australia Ahmed Mukhtar Dirir Center for Cyber-Physical Systems, Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates David Enthoven Delft University of Technology, Delft, Netherlands Zhong Fan Keele University, Staffordshire, UK Ali Farooq Department of Computing, University of Turku, Turku, Finland Ali Feizollah University of Malaya Halal Research Centre (UMHRC), University of Malaya, Kuala Lumpur, Malaysia Georgios Kaissis OpenMined, Oxford, UK; Technical University of Munich, Imperial College London, Munich, Germany Yogesh Kumar Chandigarh Group of Colleges, Sahibzada Ajit Singh Nagar, India Jun Li Nanjing University of Science and Technology, Nanjing, China Antonio Lopardo ETH Zurich, Zurich, Switzerland; OpenMined, Oxford, UK
xiii
xiv
Contributors
Chuan Ma Nanjing University of Science and Technology, Nanjing, China Jean-Mickael Nounahon OpenMined, Oxford, UK; De Vinci Research Centre, Paris, France Jonathan Passerat-Palmbach OpenMined, Oxford, UK; Imperial College London, Consensys Health, London, UK H. Vincent Poor Princeton University, Princeton, NJ, USA Kritika Prakash OpenMined, Oxford, UK; IIIT Hyderabad, Hyderabad, India Muhammad Habib ur Rehman Center for Cyber-Physical Systems, Khalifa University of Science and Technology, Abu Dhabi, UAE Zarreen Naowal Reza OpenMined, Oxford, UK; Thales Canada Inc., Quebec, Canada Nick Rose OpenMined, Oxford, UK Théo Ryffel OpenMined, Oxford, UK; INRIA, ENS, PSL University Paris, Paris, France Khaled Salah Center for Cyber-Physical Systems, Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates Ruchi Singla Chandigarh Group of Colleges, Sahibzada Ajit Singh Nagar, India Davor Svetinovic Center for Cyber-Physical Systems, Electrical Engineering and Computer Science, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates Benjamin Szymkow OpenMined, Oxford, UK Chandra Thapa CSIRO Data61, Sydney, Australia Andrew Trask University of Oxford, Oxford, UK; OpenMined, Oxford, UK Bobby Wagner OpenMined, Oxford, UK Kang Wei Nanjing University of Science and Technology, Nanjing, China Alexander Ziller Technical University of Munich, Munich, Germany; OpenMined, Oxford, UK
Acronyms
AI AS CNN CPU CSV DCML DLG DNN DP DSGD FedSGD FHE FL GAN GDPR GPS GPU HE HIPPA iDLG IIoT IoT LDP LSTM mGAN-AI MIA ML MLP MPI MUD
Health Insurance Portability and Accountability Act Attack Surface Convolutional Neural Network Central Processing Unit Comma-Separated Values Distributed Collaborative Machine Learning Deep Leakage from Gradients Deep Neural Network Differential Privacy Distributed Stochastic Gradient Descent Federated Stochastic Gradient Descent Fully Homomorphic Encryption Federated Learning Generative Adversarial Network General Data Protection Regulation Global Positioning System Graphics Processing Unit Homomorphic Encryption Health Insurance Portability and Accountability Act Improved DLG Industrial Internet of Things Internet of Things Local Differential Privacy Long–Short-Term Memory multitask GAN for Auxiliary Identification Model Inversion Attack Machine Learning Multi-Layer Perception Message Passing Interface Manufacturer Usage Description
xv
xvi
NbAFL non-IID PATE PHE PIPEDA ReLU SFL SGD SL SMC STD SVM SWHE
Acronyms
Noising before model Aggregation FL Non-Independent Identical Distribution Private Aggregation of Teacher Ensembles Partially Homomorphic Encryption Personal Information Protection and Electronic Documents Act Rectified Linear Unit SplitFed Learning Stochastic Gradient Descent Split Learning Secure Multi-party Computation Standard Deviation Support Vector Machine Somewhat Homomorphic Encryption
Chapter 1
Federated Learning Research: Trends and Bibliometric Analysis Ali Farooq, Ali Feizollah, and Muhammad Habib ur Rehman
Abstract Federated learning (FL) allows machine learning algorithms to gain insights into a broad range of datasets located at different locations, enabling a privacy-preserving model development. Since its announcement in 2016, FL has gained interest from a variety of entities—both, in academia and industry. To understand what are the research trends in this area, a bibliometric analysis is conducted to objectively describe the research profile of the FL area. In this regard, 476 documents written in English were collected through a thorough systematic search in the Scopus database and examined from several perspectives (e.g., growth trends, top-cited papers, subject area), productivity measures of authors, institutions, and countries. Further, a co-word analysis through VOSviewer was carried out to identify the evolving research themes in FL. There has seen exponential growth in FL literature since 2018. There are five research themes, namely internet of things, wireless communication, privacy and security, data analytics, and learning and optimization, which were surfaced in the analysis. We also found that most of the documents related to FL were published in computer science, followed by engineering disciplines. It was also observed that China is at the forefront in terms of the frequency of documents in this area followed by the United States of America and Australia. Keywords Federated learning · Bibliometric · Research · Academia · Industry
A. Farooq (B) Department of Computing, University of Turku, 20500 Turku, Finland e-mail: [email protected] A. Feizollah University of Malaya Halal Research Centre (UMHRC), University of Malaya, 50603 Kuala Lumpur, Malaysia e-mail: [email protected] M. H. ur Rehman Center for Cyber-Physical Systems, Khalifa University of Science and Technology, 127788 Abu Dhabi, UAE e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. H. ur. Rehman and M. M. Gaber (eds.), Federated Learning Systems, Studies in Computational Intelligence 965, https://doi.org/10.1007/978-3-030-70604-3_1
1
2
A. Farooq et al.
1.1 Introduction Machine learning has become very popular recently due to an increase in computing power and abundance of data [1]. Data is considered the fuel for machine learning algorithms and it has a great impact on the model’s performance. Although many datasets are publicly available, the hunger for more data persists. However, users’ data is normally inaccessible due to organizational policies, otherwise, direct sharing of personal data raises privacy concerns. Considering this, various governments are regulating the privacy laws such as GDPR (EU), HIPPA (USA), and PIPEDA (Canada), to name a few. Normally, the data is collected from users and a machine learning model is trained on that data. But, this method makes personal data exposed to the people and devices who do not own the data. To effectively address this issue, Google announced Federated Learning (FL) in 2016 [2]. FL is the concept of training an algorithm by running it locally on a user’s device. It brings the algorithm to the data, rather than data to the algorithm. This operation is performed under the supervision of a central server. Upon completion of the training, only the updated parameters of the model are sent to the central server. The server aggregates all the received parameters from various devices to produce a global model [3, 19]. The FL has many applications in sales, finance, and other industries where data can not be directly collected because of intellectual property rights, privacy requirements, and data security policies. For example, in the retail industry, creating a personalized experience for users is the ultimate goal. To achieve this goal, the organization needs to have an access to users’ data and shopping history to understand their behavior and shopping patterns. However, due to data privacy reasons, conventional methods are considered not to be secured. FL solves this issue by bringing the algorithm to users’ data, by performing training on it, and by collecting the resultant model updates. This way, the data stays on users’ devices but the retailers get the required model updates. Although it has only been four years since the introduction of the FL, it has attracted the attention of many researchers. Therefore, it is important to examine the research progress and trends to identify the research gaps and suggest future directions. To this end, a bibliometric analysis can provide a macroscopic view of the entire field in the global context of related and neighboring fields. The understanding of the bigger picture will allow an individual to rationally choose a specific starting point for more detailed investigations in their areas of interests [4]. This analysis technique allows examining the evolution of the research domain, both topic-wise and authorship-wise [5]. Bibliometrics is the use of scholarly data to analyze publication patterns according to the author(s), topics including keywords, subject index or classification codes, affiliations of the author(s), the location where the research was conducted, sources of publications such as the journal or conference, the location where the research was published, date of publication, etc. [6]. In the recent years, this technique has been used in various fields (for example, knowledge management [7], medicine [8], business [5]) and domains (for example, strategic management [9],
1 Federated Learning Research: Trends and Bibliometric Analysis
3
information security [10], corporate social responsibility [11], application of AI in marketing [12], and COVID-19 [13]) for examining the research evolution. Therefore we present a thorough bibliometric analysis of research on FL and perform domain profiling, to better understand this research area.
1.2 Material and Method In this section, data collection and analysis are discussed.
1.2.1 Data Collection Since the purpose of the study was to examine the current research on FL, we first identified the keywords that could be used to search relevant literature. For this purpose, we examined the current literature and expert opinion was sought from two subject-matter experts. We used the following keywords: “federated learning”, “federated machine learning”, “federated ML”, “federated artificial intelligence”, “federated AI”, “federated intelligence”, and “federated training”. A Boolean operator OR was used in between the keywords to form a search string. These broad keywords allowed us to cover a broad range of studies with a focus on FL. The selection of data sources is the next important step in data collection. Researchers used a variety of sources in bibliometric studies such as ours. However, most often, Web of Science, Scopus, and Google scholar are mostly selected as data sources due to their coverage, citations, accuracy, and consistency. For example, Web of Science has better coverage of journals [14], however, Scopus has a wider coverage of publications as compared to Web of science [15]. Google Scholar has better citation coverage as compared to the other two, however, there has been a criticism for inconsistency [15]. All in all, Scopus has been regarded as a better source for data due to citation counts, coverage of disciplines and consistency [10]. Further, Scopus has been found stronger than Web of Science in the area of science and technology [16], which is particularly relevant considering our study. In line with this, we selected Scopus as a data source. To have a broader set of publications we search in titles, abstracts, and keywords of publications such as journal articles, conference proceedings, book chapters, reviews, short papers, and editorials. We refer to these publications as documents in the rest of the paper. The search was conducted on October 4, 2020, without any lower timelimit. We included documents that were published in English. In this way, we identified 476 documents from the database.
4
A. Farooq et al.
1.2.2 Data Analysis Bibliometric data of 476 documents were downloaded from the Scopus in a commaseparated values (CSV) file that can be used for further analysis in a variety of software tools for scientific mapping and profile analysis [17]. The data file was checked for missing data values such as authors, title, publication year, the title of the source, authors’ affiliations, keywords, and citation information. No significant deficiency was found, however, we examined the keywords and change the abbreviations to full titles (for example, from IoT or iot to internet of things), used standard terminology (for example, internet of thing (iot) were changed to internet of things), and removed duplicate of terms in unique records. Since the purpose of the study was to examine the profile and conceptual evolution of the research domain, we analyzed two phases. In Phase I, Microsoft Excel was used for performance analysis where essential characteristics of publications, such as year of publication, type of document, most prolific document sources, authors, and countries, were analyzed. VOSviewer [18], a visualization tool, was used for identifying conceptual themes based on co-occurrence of keywords, also known as co-word analysis. Co-word analysis identifies the topic themes based on the majority of documents in a dataset and considered better in comparison to co-citation analysis [10]. During the co-word analysis, the VOSviewer identifies the noun phrases (terms) from documents’ titles, abstracts, and keywords and create a similarity matrix. This matrix shows the frequency with which two terms have appeared together in the dataset. Thereafter, the terms are clustered and mapped based on “visualization of similarities” (VOS) mapping techniques [18]. Here it is pertinent to mention that we removed the keywords (that is, “federated learning”, “federated machine learning”, “federated ML”, “federated artificial intelligence”, “federated AI”, “federated intelligence”, and “federated training”) used in data search as it was likely that when visualizing the data the selected keywords would overshadow other terms. The parameters used in Table 1.1 are used for preparing the visualization shown in the result section.
Table 1.1 Parameters used for visualization in VOSviewer Method Value Counting method Threshold (Minimum number of occurrences of a term Number of terms (most relevant) Normalization method
Binary 5 183 (100%) Factorization
1 Federated Learning Research: Trends and Bibliometric Analysis
5
1.3 Results and Discussion This section presents the results of the bibliometric analysis on the FL. This research domain is still new and we found 476 documents in the Scopus database. These findings are important since they map the current research and unravel potential gaps in the research domain.
1.3.1 Growth Pattern Over the Years Figure 1.1 shows the growth in the domain of FL. As it can be seen, after the introduction of the concept in 2016, there is a massive growth in this topic. Specifically, after 2018, there is a considerable jump in published articles in this domain. The main idea of FL proposed by Google was to train machine learning models using distributed datasets scattered across multiple devices. This idea has been improved as the popularity of FL has been rising. Recently, the research works have been focusing on personalizing the FL [20, 21], the security aspect of the domain [22, 23], and improving statistical challenges [21]. Due to the challenging nature of FL, which includes users’ devices’ reliability and unbalanced data distribution, the research works have been moving towards on-device FL [25]. Most of the publications within FL were in the area of computer science and engineering. The topic generated interest in other subjects as well. For example, as shown in Table 1.2, a significant number of papers were published in the decision sciences which is close to business and economics, physics and astronomy, material sciences, as well as medicine. The list shown in Table 1.2 is directly taken from Scopus. The sum of the number of papers may exceed the total number of documents used for analysis in this chapter as one paper can belong to more than subject areas.
Fig. 1.1 Growth pattern over the years
6
A. Farooq et al.
Table 1.2 Subject area with highest number of papers Subject area Number of papers Computer science Engineering Mathematics Decision sciences Physics and astronomy Materials science Medicine Social sciences Energy Biochemistry, genetics and molecular biology Health professions Business, management and accounting Chemistry Agricultural and biological sciences Chemical engineering Multidisciplinary Dentistry Economics, econometrics and finance Neuroscience Pharmacology, toxicology and pharmaceutics
422 204 97 77 23 22 22 12 8 6 5 4 3 2 2 2 1 1 1 1
1.3.2 Top Cited Papers As we can see in Fig. 1.1, the publications in the area of FL saw steady growth from 2017. We identified the top 10 most cited papers in the area and are shown in Table 1.3. The citations were taken from the Scopus database and may differ from the Web of Science or Google Scholar. We did not normalize the citations based on years. The paper titled, “Communication-efficient learning of deep networks from decentralized data” [24], is the most cited paper in the research area, followed by paper titled, “Practical secure aggregation for privacy-preserving machine learning” [26], and “Federated machine learning: Concept and applications” [27].
1.3.3 Productivity Measures In this sub-section, we describe the domain profile in terms of productivity measures such as prolific authors, institutions with the highest research production, countries with most papers produced, and type of sources.
1 Federated Learning Research: Trends and Bibliometric Analysis Table 1.3 Ten top cited papers Number Title
Year
Source
Citations
2017
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics Proceedings of the ACM Conference on Computer and Communications Security ACM Transactions on Intelligent Systems and Technology Advances in Neural Information Processing Systems IEEE Journal on Selected Areas in Communications
194
2019
IEEE Network
42
2019
Proceedings—IEEE INFOCOM
33
2019
IEEE International Conference on Communications
29
2018
International Journal of Medical Informatics
29
2018
ACM Transactions on Internet Technology 2018 IEEE Global Communications Conference, GLOBECOM 2018—Proceedings IEEE Transactions on Information Forensics and Security
29
1
Communication-efficient learning of deep networks from decentralized data
2
Practical secure aggregation 2017 for privacy-preserving machine learning
3
Federated machine learning: Concept and applications Federated multi-task learning
2019
Adaptive Federated Learning in Resource Constrained Edge Computing Systems In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning Federated Learning over Wireless Networks: Optimization Model Design and Analysis Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge Federated learning of predictive models from federated Electronic Health Records Personalizing access to learning networks Federated Learning for Ultra-Reliable Low-Latency V2V Communications
2019
VerifyNet: Secure and Verifiable Federated Learning
2020
4
5
6
7
8
8
8 9
10
7
2017
2018
147
102
84
53
26
22
8
A. Farooq et al.
Table 1.4 Top 10 authors with highest number of papers Position Authors 1 2 3 4 5 6 7 8 9 101 101 101
Niyato, D. Bennis, M. Saad, W. Yu, S. Chen, M. Yang, Q. Yu, H. Zhang, J. Hong, C. S. Li, H. Tran, N. H. Xu, G.
1 Published
equal number of articles
1.3.3.1
Authors
Number of papers 12 9 9 9 8 8 8 8 7 7 7 7
We show the top 10 authors with the highest number of published papers in Table 1.4. As it can be seen, Niyato, D. published 12 papers that is the highest number among other authors [28–39]. His papers are mostly about FL in mobile networks as well as mobile edge network. He also published a paper related to 5G and FL. Three authors published 9 papers, namely Bennis, M., Saad, W., and Yu, S. Similarly, four authors published 8 papers, namely Chen, M., Yang, Q., Yu, H., and Zhang, J. The last four authors published 7 papers whose names are Hong, C. S., Li, H. Tran, N. H. n, and Xu, G.
1.3.3.2
Institutes
This section provides a list of the top 10 institutions that published their work on FL. Table 1.5 provides a list of institutions in descending order. As it can be seen, the University of Electronic Science and Technology of China and Beijing University of Posts and Telecommunications have published 20 papers, which is the highest number of papers among institutions. The third-place goes to the Nanyang Technological University with 19 published papers. The Princeton University and the University of Technology Sydney are ranked fourth and fifth by publishing 17 and 16 published papers, respectively. The next two institutions published the same number of papers, namely Hong Kong University of Science and Technology, and Tsinghua University by 15 papers. Fourteen papers were published by the School of Computer Science and Engineering, and thirteen papers were published by the IBM Thomas J. Watson Research Center. The last 3 institutions published 11
1 Federated Learning Research: Trends and Bibliometric Analysis
9
Table 1.5 Top 10 institutions with highest number of published papers Institution Number of published papers University of Electronic Science and Technology of China Beijing University of Posts and Telecommunications Nanyang Technological University Princeton University University of Technology Sydney Hong Kong University of Science and Technology Tsinghua University School of Computer Science and Engineering IBM Thomas J. Watson Research Center Kyung Hee University Imperial College London Chinese Academy of Sciences
20 20 19 17 16 15 15 14 13 11 11 11
Note Last three institutions have equal number of publications
papers. They are Kyung Hee University, Imperial College London, and the Chinese Academy of Sciences.
1.3.3.3
Countries
In the previous section, we saw the top 10 institutions producing research in the area of FL. Table 1.6 shows a list of the top 10 countries that have produced research in FL. Most documents were produced from China, followed by the United States. Australia remained at the 3rd position. In the top 10 list, four countries are from Asia, three from Europe, and two from North America.
1.3.3.4
Sources
In terms of publication types, out of 476 documents, 300 were conference proceedings, followed by 138 documents from journal publications. Table 1.7 lists different types of documents and the corresponding number of documents. In terms of sources, most (36) documents were published in Springer’s lecture notes series, followed by the IEEE International Conference on Communication. The top 10 sources contain three journals: IEEE Access, IEEE Intelligent Systems, and IEEE Internet of Things Journal. Table 1.8 lists top sources and the corresponding number of documents published in these sources.
10
A. Farooq et al.
Table 1.6 Top 10 countries in terms of publications Country Number of documents China United States Australia United Kingdom Singapore South Korea Hong Kong Canada Finland France
149 128 36 35 24 23 16 14 11 10
Table 1.7 Number of different document types Document type Number of documents Conference paper Article Conference review Review Book chapter Editorial Letter Short survey Total
300 138 25 7 3 1 1 1 476
1.3.4 Domain Profile Based on the documents we found in our research, we clustered them according to the keywords related to each paper. Figure 1.2 shows the retrieved document based on the clusters. We identified 5 clusters in the figure. By clustering the documents, we get a clearer picture of the ongoing research in the FL domain. As we can see, the clusters in red and blue are larger. Therefore, they include more research papers than other clusters. Table 1.9 shows the keywords in each cluster and our suggested name for each cluster. In each of the following sub-sections, we describe the cluster domain and mention some papers published in the cluster.
1 Federated Learning Research: Trends and Bibliometric Analysis Table 1.8 Top 10 document sources in descending order Source name Number of documents Lecture notes in computer science including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics IEEE International Conference on Communications ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings ACM International Conference Proceeding Series Ceur Workshop Proceedings IEEE Access IEEE Intelligent Systems IEEE Internet of Things Journal Communications in Computer and Information Science Proceedings IEEE INFOCOM
Fig. 1.2 Research themes in FL
36
20 13
12 11 11 11 10 7 7
11
12
A. Farooq et al.
Table 1.9 Different Themes in FL Research Cluster color Top 20 most occurring terms Red (1)
Green (2)
Blue (3)
Yellow (4)
Purple (5)
Machine learning, edge computing, blockchain, internet of things, machine learning models, artificial intelligence, data sharing, distributed machine learning, reinforcement learning, network security, distributed computer systems, IoT, transfer learning, IoT, security and privacy, network architecture, 5G mobile communication systems, decision making, quality of service, computation theory Stochastic systems, communication overheads, gradient methods, wireless networks, optimization problems, economic and social effects, optimization, iterative methods, stochastic gradient descent, communication efficiency, energy utilization, incentive mechanism, bandwidth, communication rounds, efficiency, mobile telecommunication systems, resource allocation, signal processing, numerical results, energy efficiency Learning systems, data privacy, privacy preserving, privacy, privacy preservation, neural networks, learning models, model parameters, state of the art, adversarial networks, privacy concerns, mobile computing, privacy-preserving, data mining, learning methods, privacy leakages, sensitive data, training process, benchmark datasets, centralized server Deep learning, learning frameworks, deep neural networks, communication cost, global modeling, classification (of information), fog computing, poisoning attacks, anomaly detection, central servers, intrusion detection, real-world datasets, benchmarking, data distribution, training data, computer aided instruction, data handling, automation, collaborative training, computation costs Learning algorithms, cryptography, differential privacy, big data, distributed learning, privacy protection, digital storage, collaborative learning, homomorphic encryptions, forecasting, human, e-learning, large dataset, cloud computing, clustering algorithms, sensitive information, diagnosis, distributed data, medical imaging, secure multi-party computation
Name of theme IoT
Wireless communication
Privacy and security
Data analytics
Learning and optimization
1 Federated Learning Research: Trends and Bibliometric Analysis
1.3.4.1
13
Cluster 1: Internet of Things
The keywords in the first cluster suggest that this cluster deals with the IoT and FL. A collection of sensors constitutes the IoT, which is a suitable application for machine learning to improve the efficiency of the network. However, due to the distributed nature of IoT as well as privacy issues, this domain is a great candidate for FL application. Among many kinds of research conducted in IoT and FL, Duan et al. [40] focused on internet video service providers (IVSPs) that are used to recommend videos to users in different geo-distributed manner. Due to its distributed nature, the authors applied FL to their network. They found that because the agents in the distributed network need to receive federated training instructions and to send the weights and trained model, the network traffic is high. Therefore, they combined low-rank matrix factorization and the 8-bit quantization method to reduce the used network bandwidth. Feraudo et al. [41] considered edge computing as a candidate for FL. Edge computing is designed to speed up the IoT networks. The edge devices are an important part of an IoT network since they transfer data from IoT devices to the cloud. Therefore, applications like intrusion detection are crucial in this type of network. FL is a great option to train a model without compromising the security of the edge devices. The authors proposed a system based on the open-source manufacturer usage description (MUD) and the PySyft framework. Lu et al. [42] examined the use of FL in blockchain architecture and the industrial internet of things (IIoT). They first designed a blockchain system for secure data sharing for distributed users. They then introduced FL to securely share the data model in the network. We just mentioned a couple of research works related to IoT and FL research. There is an opportunity in this domain for research since the FL is a great solution to the distributed agents in IoT that involving sensitive and personal data.
1.3.4.2
Cluster 2: Wireless Communication
The second cluster is found to be related to wireless communication. Specifically, focusing on communication costs and data transfer over the slow and expensive network. Wu et al. [43] researched a common application, which is a mobile keyboard suggestion. It aims at predicting the next word or phrase to ease the user interaction with the devices. The problem occurs when users’ inputs are needed to train a model. Since these input data are personal and sensitive, the authors proposed an FL approach to preserve the privacy of users and to reduce communication costs. They achieved a solution by using adaptive aggregation of weights in model training, mediation incentive scheme, and Top-K strategy. They conducted experiments on three datasets such as Penn Treebank, WikiText-2, and Yelp. The authors reported robust performance with results outperforming baseline approaches.
14
A. Farooq et al.
Luping et al. [44] specifically focused on communication overhead in FL. The problem occurs with mobile devices where the data plan is limited and the network connection to the central server is slow. The authors mentioned that the current works focus on compressing the data to reduce the bit transferred. Therefore, they proposed an orthogonal approach that identifies irrelevant updates from the client and excludes them from uploading to reduce the network traffic. The authors experimented with their proposed method by comparing it with the traditional FL. They reported that their method improves communication efficiency by 5.7 times with 4% higher prediction accuracy.
1.3.4.3
Cluster 3: Privacy and Security
This cluster includes research works related to privacy and security. This topic is the main characteristic of FL. Chandiramani et al. [45] compared basic machine learning, distributed machine learning, and FL to investigate how they perform in terms of data privacy and security. Their results showed that the FL method maintained privacy, and performed well with the fast deployment of models in mobile and low-compute devices.
1.3.4.4
Cluster 4: Data Analytics
This cluster is focused on data analytics, which is a necessary part of data science. The role of FL in this cluster is significant. The data is stored on users’ devices and using FL, the analysis is performed with preserving the privacy of the users. Zhao et al. [48] researched intrusion detection. With the fact that data analytics need to take data from many users into the account, privacy issues rises. It is not possible to train a model on a single user, and it violates the privacy of users to access their data and collect them for model training. The authors proposed an intrusion detection model based on FL and long short-term memory (LSTM) algorithm. First, the model from the central server is sent to all users. Then, each user trains the algorithm using their data, and they upload the model’s parameters to the central server. Finally, the central server aggregates the model parameters and updates the global model. The authors experimented with their proposed method and achieved higher accuracy and better consistency than conventional models. Schneible and Lu [49] considered anomaly detection in edge computing where there are many small devices and sensors with intermittent connectivity. The authors proposed an autoencoder, specialized deep learning neural networks, model to detect anomalies in edge devices. The model is deployed to devices to perform data analytics and detection of anomalies in a distributed fashion. Once the devices have access to network connectivity, they upload their observations to the central server. The central server then aggregates all the results from the devices and updates the global model. Then, the updated global model is sent back to the devices. This ensures that
1 Federated Learning Research: Trends and Bibliometric Analysis
15
the devices have updated models. This method reduces bandwidth and connectivity requirements for the edge devices.
1.3.4.5
Cluster 5: Learning and Optimization
This final cluster is about learning and optimization. It includes learning algorithms and optimizing them for some applications. Balachandar et al. [46] researched medical imaging. Patients’ data is very sensitive and sharing this data to train the algorithms is risky. Therefore, the authors tried to use FL to share patients’ data. They mentioned that “optimizations included proportional local training iterations, cyclical learning rate, locally weighted minibatch sampling, and cyclically weighted loss. We evaluated our optimizations on simulated distributed diabetic retinopathy detection and chest radiograph classification.” Their results showed 98.6% of accuracy. In another application, Doku et al. [47] researched big data and distributed data sources to disrupt data centralization. They mentioned that majority of data is in the possession of several big companies. They proposed an approach to combine blockchain and FL to store data in a decentralized manner.
1.4 Related Work To establish the position of this chapter among similar works, and to further clarify the contributions of this chapter, we examine similar research works in this section. The available literature suggests that the current review papers fall into two categories. In the first category, papers focus on the general overview of the FL domain. As an example, Yang et al. [27] published a research work that discussed the concept and applications of FL. The authors proposed a secure FL framework that supplements the original proposal by Google in 2016. They introduced horizontal FL, vertical FL, and federated transfer learning. They then surveyed existing works that fall into the proposed framework. Li et al. [50] discussed the general characteristics of FL and its challenges. They also provided a general overview of the current research directions. The authors mentioned four challenges of FL. They are expensive communication, systems heterogeneity, statistical heterogeneity, and privacy concerns. Furthermore, the current research works were classified based on their objectives to solve one of the introduced challenges. Finally, the authors mentioned future directions in the FL research domain. Another work that focuses on reviewing the general aspect of the FL was published by Lo et al. [51]. In their paper, the author systematically reviewed FL, and from the software engineering perspective. The authors examined 231 primary studies. The results suggest that most of the motivation of the published works target the FL challenges, including data privacy, communication efficiency, and statistical heterogeneity.
16
A. Farooq et al.
The second category is the papers that focused on a specific aspect of the FL. For instance, Lim et al. [37] focused on the applications of the FL in mobile edge networks. They first discussed the general concept of the FL and then narrowed the discussion down to the mobile edge networks. Security and privacy is another niche area that Mothukuri et al. [52] discussed. This paper aimed at providing a comprehensive study on the security and privacy aspects of FL. The author reviewed various styles of implementation of the security and privacy frameworks in FL. They concluded that there are fewer privacy-specific threats associated with FL compared to security threats. The most specific security threats currently are communication bottlenecks, poisoning, and backdoor attacks while inference-based attacks are the most critical to the privacy of FL. This research work falls into the first category, which is a general overview of the current research works. However, we focused on the bibliometric study of the FL, which bridges the gap in the journey of the FL research domain. It is important to be aware of the prominent authors and influential institutes in this field. Additionally, we grouped all the published works into five categories and discussed each group. This helps us understand which group needs more attention from the research community.
1.5 Conclusion and Future Research Directions The purpose of this study was to examine and describe the intellectual profile of research focused on federated learning (FL) over the past 25 years. In this regard, through a systematic search, 476 relevant documents published in the English language were extracted from the Scopus and examined. FL as a research area witnessed an exponential growth since 2018. Most of FL research is published in form of conference proceedings in the subject area of computer science and engineering. China has been at the forefront of publishing the most number of documents. In terms of universities, the University of Electronic Science and Technology, Beijing University of Posts and Telecommunications from China, and Nanyang Technological University from Singapore were the top three producers of research in FL. The current research in FL can be divided into five thematic areas: Internet of things, wireless communication, privacy and security, data analytic, and learning and optimization. While this study provides state-of-the-art in the research area, it is also affected by some limitations. For example, only one source has been used to identify related documents. We expect some publications that are not indexed in the Scopus database will be left out. Secondly, we took a quantitative approach to examine the research and bibliometric data was used. We did not run a content analysis of the publication.
1 Federated Learning Research: Trends and Bibliometric Analysis
17
References 1. E. Alpaydin, Introduction to Machine Learning (MIT Press, Cambridge, 2020) 2. H.B. McMahan, E. Moore, D. Ramage, S. Hampson, B. AgÃijera y Arcas, CommunicationEfficient Learning of Deep Networks from Decentralized Data (2016). arXiv:1602.05629 3. C. Dinh, N.H. Tran, M.N.H. Nguyen, C.S. Hong, W. Bao, A. Zomaya, V. Gramoli, Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation (2019). arXiv:1910.13067 4. R.N. Kostoff, D.R. Toothman, H.J. Eberhart, J.A. Humenik, Text mining using database tomography and bibliometrics: a review. Technol. Forecast. Soc. Change 68(3), 223–253 (2001) 5. N. Donthu, S. Kumar, D. Pattnaik, Forty-five years of journal of business research: a bibliometric analysis. J. Bus. Res. 109, 1–14 (2020) 6. M.K. McBurney, P.L. Novak, What is bibliometrics and why should you care?, in Proceedings. IEEE International Professional Communication Conference (IEEE, 2002), pp. 108–114 7. M. Gaviria-Marin, J.M. Merigó, H. Baier-Fuentes, Knowledge management: a global examination based on bibliometric analysis. Technol. Forecast. Soc. Change 140, 194–220 (2019) 8. H. Liao, M. Tang, L. Luo, C. Li, F. Chiclana, X.J. Zeng, A bibliometric analysis and visualization of medical big data research. Sustainability 10(1), 166 (2018) 9. J.J.M. Ferreira, C.I. Fernandes, V. Ratten, A co-citation bibliometric analysis of strategic management research. Scientometrics 109(1), 1–32 (2016) 10. N.V. Olijnyk, A quantitative examination of the intellectual profile and evolution of information security from 1965 to 2015. Scientometrics 105(2), 883–904 (2015) 11. S.S. Bhattacharyya, S. Verma, The intellectual contours of corporate social responsibility literature. Int. J. Sociol. Soc. Policy (2020) 12. M. Mekhail, J. Salminen, L. Ple, J. Wirtz, Artificial intelligence in marketing; bibliometric analysis, topci modeling and research agenda. J. Bus. Res. forthcoming (2021) 13. S. Verma, A. Gustafsson, Investigating the emerging COVID-19 research trends in the field of business and management: a bibliometric analysis approach. J. Bus. Res. 118, 253–261 (2020) 14. L.S. Adriaanse, C. Rensleigh, Web of science, scopus and Google Scholar. The Electronic Library (2013) 15. M.E. Falagas, E.I. Pitsouni, G.A. Malietzis, G. Pappas, Comparison of PubMed, Scopus, web of science, and Google scholar: strengths and weaknesses. FASEB J. 22(2), 338–342 (2008) 16. F. de Moya-Anegón, Z. Chinchilla-Rodríguez, B. Vargas-Quesada, E. Corera-Álvarez, F. Muñoz-Fernández, A. González-Molina, V. Herrero-Solana, Coverage analysis of Scopus: a journal metric approach. Scientometrics 73(1), 53–78 (2007) 17. M.J. Cobo, A.G. López-Herrera, E. Herrera-Viedma, F. Herrera, Science mapping software tools: review, analysis, and cooperative study among tools. J. Amer. Soc. Inf. Sci. Technol. 62, 1382–1402 (2011) 18. N.J. Van Eck, L. Waltman, Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84, 523–538 (2010) 19. P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R.G.L. D’Oliveira, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P.B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcný, A. Korolova, F. Koushanfar, S. Koyejo, T. de Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S.U. Stich, Z. Sun, A.T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F.X. Yu, H. Yu, S. Zhao, Advances and Open Problems in Federated Learning (2019). arXiv:1912.04977 20. F. Chen, M. Luo, Z. Dong, Z. Li, X. He, Federated Meta-Learning with Fast Convergence and Efficient Communication (2019). arXiv:1802.07876 21. V. Smith, C.-K. Chiang, M. Sanjabi, A.S. Talwalkar, Federated multi-task learning, in Advances in Neural Information Processing Systems, vol. 30,ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Curran Associates, Inc., New York, 2017), pp. 4424–4434, http://papers.nips.cc/paper/7029-federated-multi-task-learning.pdf
18
A. Farooq et al.
22. R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client level perspective (2017). arxiv:1712.07557 23. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for privacy-preserving machine learning, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17) (ACM, New York, 2017), pp. 1175–1191. https://doi.org/10.1145/3133956.3133982 24. B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Artificial Intelligence and Statistics (PMLR, 2017), pp. 1273–1282 25. M.H. Rehman, K. Salah, E. Damiani, D. Svetinovic, Towards blockchain-based reputationaware federated learning, in IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada (2020), pp. 183–188. https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9163027 26. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for privacy-preserving machine learning, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), pp. 1175–1191 27. Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019) 28. Y. Zou, S. Feng, D. Niyato, Y. Jiao, S. Gong, W. Cheng, Mobile device training strategies in federated learning: an evolutionary game approach, in 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (IEEE, 2019), pp. 874–879 29. S. Feng, D. Niyato, P. Wang, D.I. Kim, Y.C. Liang, Joint service pricing and cooperative relay communication for federated learning, in 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) (IEEE, 2019), pp. 815–820 30. J. Kang, Z. Xiong, D. Niyato, H. Yu, Y.C. Liang, D.I. Kim, Incentive design for efficient federated learning in mobile networks: a contract theory approach, in 2019 IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS) (IEEE, 2019), pp. 1–5 31. Y. Zou, S. Feng, J. Xu, S. Gong, D. Niyato, W. Cheng, Dynamic games in federated learning training service market, in 2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) (IEEE, 2019), pp. 1–6 32. T.T. Anh, N.C. Luong, D. Niyato, D.I. Kim, L.C. Wang, Efficient training management for mobile crowd-machine learning: a deep reinforcement learning approach. IEEE Wireless Commun. Lett. 8(5), 1345–1348 (2019) 33. J. Kang, Z. Xiong, D. Niyato, S. Xie, J. Zhang, Incentive mechanism for reliable federated learning: a joint optimization approach to combining reputation and contract theory. IEEE Internet Things J. 6(6), 10700–10714 (2019) 34. H. Yu, Z. Liu, Y. Liu, T. Chen, M. Cong, X. Weng, D. Niyato, Q. Yang, A fairness-aware incentive scheme for federated learning, in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2020), pp. 393–399 35. J. Kang, Z. Xiong, D. Niyato, Y. Zou, Y. Zhang, M. Guizani, Reliable federated learning for mobile networks. IEEE Wireless Commun. 27(2), 72–80 (2020) 36. H. Yu, Z. Liu, Y. Liu, T. Chen, M. Cong, X. Weng, D. Niyato, Q. Yang, A sustainable incentive scheme for federated learning. IEEE Intell. Syst. (2020) 37. W.Y.B. Lim, N.C. Luong, D.T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, C. Miao, Federated learning in mobile edge networks: a comprehensive survey. IEEE Commun. Surv. & Tutor. (2020) 38. Y. Liu, J.Q. James, J. Kang, D. Niyato, S. Zhang, Privacy-preserving traffic flow prediction: a federated learning approach. IEEE Internet Things J. (2020)
1 Federated Learning Research: Trends and Bibliometric Analysis
19
39. Y. Liu, J. Peng, J. Kang, A.M. Iliyasu, D. Niyato, A.A.A. El-Latif, A Secure Federated Learning Framework for 5G Networks (2020). arXiv:2005.05752 40. S. Duan, D. Zhang, Y. Wang, L. Li, Y. Zhang, JointRec: a deep learning-based joint cloud video recommendation framework for mobile IoTs. IEEE Internet Things J. 1 (2019). https:// doi.org/10.1109/jiot.2019.2944889 41. A. Feraudo, P. Yadav, V. Safronov, D.A. Popescu, R. Mortier, S. Wang, P. Bellavista, J. Crowcroft, CoLearn: enabling federated learning in MUD-compliant IoT edge networks, in Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking (2020), pp. 25–30 42. Y. Lu, X. Huang, Y. Dai, S. Maharjan, Y. Zhang, Blockchain and federated learning for privacypreserved data sharing in industrial IoT. IEEE Trans. Ind. Inf. 16(6), 4177–4186 (2019) 43. X. Wu, Z. Liang, J. Wang, Fedmed: a federated learning framework for language modeling. Sensors 20(14), 4048 (2020) 44. W.A.N.G. Luping, W.A.N.G. Wei, L.I. Bo, Cmfl: mitigating communication overhead for federated learning, in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) (IEEE, 2019), pp. 954–964 45. K. Chandiramani, D. Garg, N. Maheswari, Performance analysis of distributed and federated learning models on private data. Procedia Comput. Sci. 165, 349–355 (2019) 46. N. Balachandar, K. Chang, J. Kalpathy-Cramer, D.L. Rubin, Accounting for data variability in multi-institutional distributed deep learning for medical imaging. J. Amer. Med. Inf. Ass. 27(5), 700–708 (2020) 47. R. Doku, D.B. Rawat, C. Liu, Towards federated learning approach to determine data relevance in big data, in 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) (IEEE, 2019), pp. 184–192 48. R. Zhao, Y. Yin, Y. Shi, Z. Xue, Intelligent intrusion detection based on federated learning aided long short-term memory. Phys. Commun. 42 (2020) 49. J. Schneible, A. Lu, Anomaly detection on the edge, in MILCOM 2017-2017 IEEE Military Communications Conference (MILCOM) (IEEE, 2017), pp. 678–682 50. T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: challenges, methods, and future directions. IEEE Signal Proc. Mag. 37(3), 50–60 (2020) 51. S.K. Lo, Q. Lu, C. Wang, H. Paik, L. Zhu, A systematic literature review on federated machine learning: from a software engineering perspective (2020). arXiv:2007.11354 52. V. Mothukuri, R.M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha, G. Srivastava, A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 115, 619–640 (2020)
Chapter 2
A Review of Privacy-Preserving Federated Learning for the Internet-of-Things Christopher Briggs, Zhong Fan, and Peter Andras
Abstract The Internet-of-Things (IoT) generates vast quantities of data. Much of this data is attributable to human activities and behavior. Collecting personal data and executing machine learning tasks on this data in a central location presents a significant privacy risk to individuals as well as challenges with communicating this data to the cloud (e.g. where data is particularly large or updated very frequently). Analytics based on machine learning and in particular deep learning benefit greatly from large amounts of data to develop high-performance predictive models. This work reviews federated learning (FL) as an approach for performing machine learning on distributed data to protect the privacy of user-generated data. We highlight pertinent challenges in an IoT context such as reducing communication costs associated with data transmission, learning from data under heterogeneous conditions, and applying additional privacy protections to FL. Throughout this review, we identify the strengths and weaknesses of different methods applied to FL, and finally, we outline future directions for privacy-preserving FL research, particularly focusing on IoT applications. Keywords Distributed machine learning · Privacy · Federated learning · Internet of things · Heterogeneity
2.1 Introduction The Internet-of-Things (IoT) is represented by network-connected machines, often small embedded computers that provide physical objects with digital capabilities such as identification, inventory tracking, and sensing and actuator control. Smartphones C. Briggs (B) · Z. Fan · P. Andras Keele University, Staffordshire, UK e-mail: [email protected] Z. Fan e-mail: [email protected] P. Andras e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. H. ur. Rehman and M. M. Gaber (eds.), Federated Learning Systems, Studies in Computational Intelligence 965, https://doi.org/10.1007/978-3-030-70604-3_2
21
22
C. Briggs et al.
and other mobile communication devices are an example of particularly complex IoT devices that generate a significant volume of data collected from sensors such as accelerometers, GPS chips, and cameras. The applications that drive analytical insights in the IoT are often powered by machine learning and deep learning. Gartner [1] predicts that 25 billion IoT devices will be in use by 2021, forecasting a bright future for IoT applications. However, this poses a challenge for traditional cloud-based IoT computing. The volume, velocity, and variety of data streaming from billions of these devices require vast amounts of bandwidth which can become extremely cost-prohibitive. Additionally, many IoT applications require very lowlatency or near real-time analytics and decision-making capabilities. The round-trip delay from devices to the cloud and back again is unacceptable for such applications. Finally, transmitting sensitive data collected by IoT devices to the cloud poses security and privacy concerns [2]. Edge computing, and more recently, fog computing [3] have been proposed as a solution to these problems. Edge computing (and its variants: mobile edge computing, multi-access edge computing) restrict analytics processing to devices at the edge of the network [4]. However storage and compute power may be severely limited and coordination between multiple devices may be non-existent in the edge computing paradigm. Fog computing [3] offers an alternative to cloud computing or edge computing alone for many analytics tasks but significantly increases the complexity of an IoT network. Fog computing is generally described as a continuum of computing, storage, and networking capabilities to power applications and services in one or more tiers that bridge the gap between the cloud and the edge [5, 6]. Fog computing enables highly scalable, low-latency, geo-distributed applications, supporting location awareness and mobility [7]. Despite rising interest in fog-based computing, much research is still focused on the deployment of analytics applications (including deep learning applications) directly to edge devices. Performing computationally expensive tasks such as training deep learning models on edge devices poses a challenge due to limited energy budgets and compute capabilities [8]. In cloud environments, massively powerful and scalable servers making use of parallelization are typically employed for deep learning tasks [9]. In edge computing environments, alternative methods for distributing training are required. Additionally, as limited bandwidth is a key constraint in computing near/at the edge, the challenge of reducing network data transfer is also important. Federated learning (FL) [10] has been proposed as a method for distributed machine learning, suitable for edge computing environments addresses many of the issues discussed above—namely, compute power, data transfer as well as privacy preservation. This review provides a comprehensive survey of privacy-preserving FL. We show how FL is ideally suited for data analytics in the IoT and review research addressing privacy concerns [11], bandwidth limitations [12], and power/compute limitations [13].
2 A Review of Privacy-Preserving Federated Learning …
23
2.2 Distributed Machine Learning FL was preceded by much work in distributed machine learning in the data-center [9, 14, 15]. This section gives a brief history of distributed machine learning, paying particular attention to distributed deep learning training via stochastic gradient descent (SGD). Deep learning is concerned with machine learning problems based on artificial neural networks comprised of many layers and has been used with great success in the fields of computer vision, speech recognition, and translation, as well as many other areas [16]. In these fields, most other machine learning methods have been surpassed by deep learning methods due to the very complex functions they can compute which can both approximate training labels and generalize well to unseen samples. Deep neural networks (DNNs) are composed of multiple connected units (also known as neurons) organized into layers through which the training data flow [17]. Each unit computes a weighted sum of its input values (including a bias term) composed with a non-linear activation function g(W X + b) and returns the result to the next connected layer. Passing data through the network and performing a prediction is known as the forward pass. To train the network, a backward pass operation is specified to compute updates to the weights and biases to better approximate the labels associated with the training data. An algorithm known as backpropagation [18] is used to propagate the error back through each layer of the network by calculating gradients of the weights and biases concerning the error. DNNs perform best when trained on very large datasets and often incorporate millions if not billions of parameters to express weights between neurons (for example the AlexNet DNN achieved state-of-the-art performance on the ImageNet dataset in 2012 using 60 million parameters [19]). Both of these factors require large sums of memory and compute capabilities. To scale complex DNNs trained on lots of data requires concurrency across multiple CPUs or more commonly GPUs (most often in a local cluster). GPUs are optimized to perform matrix calculations and are well suited for the operations required to compute activations across a DNN. Concurrency can be achieved in a variety of ways as discussed below.
2.2.1 Concurrency To train a large DNN efficiently across multiple nodes, the calculations required in the forward and backward passes need to be parallelized. One method to achieve this is model parallelism which distributes collections of neurons among the available compute nodes [14]. Each node then only needs to compute the activations of its neurons, however must communicate regularly with nodes computing on connected neurons. The calculations on all nodes must occur synchronously and therefore computation proceeds at the speed of the slowest node in the cluster. Another drawback of the model parallelism approach is that the current mini-batch must be copied to
24
C. Briggs et al.
all nodes in the compute cluster, further increasing communication costs within the cluster. A second method resolves some of the issues of excessive communication between nodes by distributing one or more layers on each node. This ensures that each worker node only needs to communicate with one other node (a different node depending on whether the computation is part of the forward pass or the backward pass) [9]. However, this method still requires that data in the mini-batch be copied to all nodes in the cluster. The final method to achieve parallelism in training a large DNN is termed data parallelism. This method partitions the training dataset and copies the subsets to each compute node in the cluster. Each node computes forward and backward passes over the same model but using mini-batches drawn from its subset of the training data. The results of the weight updates are then reduced on each iteration via MapReduce or more commonly today, via message passing interface (MPI) [9]. Data parallelism is particularly effective as most operations over mini-batches in SGD are independent. Therefore scaling the problem via sharding the data to many nodes is relatively simple compared to the methods mentioned above. This method solves the issue of training with large amounts of data but requires that the model (and its parameters) fit in memory on each node. Hybrid parallelism combines two or all three of the concurrency schemes mentioned above to mitigate the drawbacks associated with each and best support parallelism on the underlying hardware. DistBelief [14] achieves this by distributing the data, network layers, and neurons within the same layer among the available compute nodes, making use of all three concurrency schemes. Similarly, Project Adam [20] employs all three concurrency schemes but much more efficiently than DistBelief (using significantly fewer nodes to achieve high accuracy on the ImageNet1 22 k data set)
2.2.2 Model Consistency Model consistency refers to the state of a model when trained in a distributed manner [9]—a consistent model should reflect the same parameter values among compute nodes before each training iteration (or set of training iterations, sometimes referred to as a communication round). To maintain model consistency, individual compute nodes need to write updates to a global parameter server [15]. The parameter server performs some form of aggregation on the updates to synchronize a global model and the parameters (for example, weights in a neural network) are then shared with the individual compute nodes for the next iteration/round of training. There are several broad methods by which to train, update, and share a distributed deep learning model. Synchronous updates occur when the parameter server waits for all compute nodes to return parameters for aggregation. This method provides 1 http://www.image-net.org/.
2 A Review of Privacy-Preserving Federated Learning …
25
high consistency between iterations/rounds of training as each node always receives up-to-date parameters but is not hardware performant due to the delays caused by the slowest communicating node. For example, a set of parameters wt (which is simply a vector representation of the weights for a given model) at time t is shared among n c compute nodes. The compute nodes each perform some number of forward and backward passes over the data available to them and compute the parameter gradients wc . These gradients are communicated to the parameter server, which in turn averages the gradients from all workers and then updates the parameters for time t + 1: nc 1 wt = wc . n c c=1
(2.1)
wt+1 = wt − ηwt . Asynchronous updates occur when the parameter server shares the latest parameters without waiting for all nodes to return parameter updates. This reduces model consistency as parameters can be overwritten and become stale due to slow communicating nodes. This method is hardware performant however as optimization can proceed without waiting for all nodes to send parameter updates. The HOGWILD! algorithm [21] takes advantage of sparsity within the parameter update matrix to asynchronously update gradients in shared memory resulting in faster convergence. Downpour SGD [14] describes asynchronous updates as an additional mechanism to add stochasticity to the optimization process resulting in greater prediction performance. In order to improve consistency using hardware performant asynchronous updates, the concept of parameter ‘staleness’ has been tackled by several works. The stale synchronous parallel (SSP) model [22] synchronises the global model once a maximum staleness threshold has been reached but still allows workers to compute updates on stale values between global model syncs. The impact of staleness in asynchronous SGD can also be mitigated by adapting the learning rate as a function of the parameter staleness [23, 24]. As an example a worker pushes an update at t = j to the parameter server at t = i. The parameters in the global model at t = i are the most up-to-date available. To prevent a stale parameter update from occurring, a staleness parameter τk for the kth parameter is calculated as τk = i − j. The learning rate used in Eq. 2.1 is modified as: η/τk if τk = 0 ηk = . (2.2) η otherwise
26
C. Briggs et al.
2.2.3 Centralized Versus Decentralized Learning Centralized distribution of the model updates requires a parameter server (which may be a single machine or sharded across multiple machines as in [14]). The global model tracks the averaged parameters aggregated from all the compute nodes that perform training (see Eq. 2.1). The downside to this distribution method is the high communication cost between compute nodes and the parameter server. Multiple shards can relieve this bottleneck to some extent, such that different workers read and write parameter updates to specific shards [14, 20]. Heterogeneity of worker resources is handled well in centralized distribution models. Distributed compute nodes introduce varying amounts of latency (especially when distributed geographically as in [25]), yet training can proceed via asynchronous, or more efficiently, stale-synchronous methods [26]. Heterogeneity is an inherent feature of FL. Decentralized distribution of DNN training does not rely on a parameter server to aggregate updates from workers but instead allows workers to communicate with one another, resulting in each worker performing aggregation on data from the parameters it receives. Gossip algorithms that share updates between a fixed number of neighboring nodes have been applied to distributed SGD [27–29] to efficiently communicate/aggregate updates between all nodes in an exponential fashion similar to how the disease is spread during an epidemic. Communication can be avoided completely during training, resulting in many individual models represented by very different parameters. These models can be combined (as an ensemble [16]), however averaging the predictions from many models can slow down inference on new data. To tackle this, a process known as knowledge distillation can be used to train a single DNN (known as the mimic network) to emulate the predictions of an ensemble model [30–32]. Unlabelled data is passed through the ensemble network to obtain labels on which the mimic network can be trained.
2.3 Federated Learning 2.3.1 Overview FL extends the idea of distributed machine learning, making use of data parallelism. However, rather than randomly partitioning a centralized dataset to many compute nodes, training occurs in the user domain on distributed data owned by the individual users (often referred to as clients) [10]. The consequence of this is that user data is never shared directly with a third party orchestrating the training procedure. This greatly benefits users where the data might be considered sensitive. Where data needs to be observed (for example, during the training operation), the processing is handled on the device where the data resides (for example a smartphone). Once a round of
2 A Review of Privacy-Preserving Federated Learning …
27
training is completed on the device, the model parameters are communicated to an aggregating server, such as a parameter server provided by a third party. Although the training data itself is never disclosed to the third-party, it is a reasonable concern that something about an individual’s training data might be inferred by the parameter updates; this is discussed further in Sect. 2.5. FL is vastly more distributed than traditional approaches for training machine learning models via data parallelism. Some of the key differences are [10]: 1. Many contributing clients—FL needs to be scalable to many millions of clients. 2. Varying quantity of data owned by each user—some clients may train on only a few samples; others may have thousands. 3. Often very different data distributions between users—user data is highly personal to individuals and therefore the model trained by each client represents non-IID (independent, identically distributed) data. 4. High latency between clients and aggregating service—updates are commonly communicated via the internet introducing significant latency between communication rounds. 5. Unstable communication between clients and aggregating service—client devices are likely to become unavailable during training due to their mobility, battery life, or other reasons. These distinguishing features of FL pose challenges above and beyond standard distributed learning. Although this review focuses on deep learning in particular, many other ML algorithms can be trained via FL. Any ML algorithm designed to minimise an objective function of the form: min
w∈Rd
m 1 f i (w). m i=1
(2.3)
is well suited to training via many clients (for example linear regression and logistic regression). Some non-gradient based algorithms can also be trained in this way, such as principal component analysis and k-mean clustering [33]. Federated optimization was first suggested as a new setting for vastly and unevenly distributed machine learning by Kone˘cný et al. [34] in 2016. In their work, the authors first describe the nature of the federated setting (non-IID data, the varying quantity of data per client, etc.). Additionally, the authors test a simple application of distributed gradient descent against a federated modification of SVRG (a variance reducing variant of SGD [35]) over distributed data. Federated SVRG calculates gradients and performs parameter updates on each of K nodes over the available data on each node and obtains a weighted average of the parameters from all clients. The performance of these algorithms is verified on a logistic regression language model using Google+ data to determine whether a post will receive at least one comment. As logistic regression is a convex problem, the algorithms can be benchmarked against a known optimum. Federated SVRG is shown to outperform gradient descent by converging to the optimum within 30 rounds of communication.
28
C. Briggs et al.
Algorithm 2.1 Federated Averaging (FedAvg) algorithm [10]. C is the fraction of clients selected to participate in each communication round. The K clients are indexed by k; B is the local mini-batch size, Pk is the dataset available to client k, E is the number of local epochs, and η is the learning rate Run on server
1: procedure FedAvg 2: Initialise w0 3: for each round t = 1, 2, . . . do 4: m ← max (C · K , 1) 5: St ← (random set of m clients) 6: for each client k ∈ St do k 7: wt+1 ← ClientUpdate(k, wt ) 8: end for K nk k 9: wt+1 ← k=1 n wt+1 10: end for 11: end procedure
In parallel
Run on client k
12: procedure ClientUpdate(k, w) 13: B ← (Split Pk into mini-batches of size B) 14: for each local epoch i from 1 to E do 15: for batch b ∈ B do 16: w ← w − η∇ L(w; b) 17: end for 18: end for 19: return w to server 20: end procedure
FL (as described in [10] simplifies the federated SVRG approach in [34] by modifying SGD for the federated setting. McMahan et al. [10] provide two distributed SGD scenarios for their experiments: FedSGD and FedAvg. FedSGD performs a single step of gradient descent on all the clients and averages the gradients on the server. The FedAvg algorithm (shown in Algorithm 2.1) randomly selects a fraction of the clients to participate in each round of training. Each client k computes the k gradients on the current state of the global model wt and updates the parameters wt+1 in the standard fashion in gradient descent: k ← wt − η∇ f (wt ). ∀k, wt+1
(2.4)
All clients communicate their updates to the aggregating server, which then calculates a weighted average of the contributions from each client to update the global model: wt+1 ←
K nk k=1
n
k wt+1 .
(2.5)
Here, n k /n is the fraction of data available to the client compared to the available data to all participating clients. Clients can perform one or multiple steps of gradient
2 A Review of Privacy-Preserving Federated Learning …
29
Fig. 2.1 Schematic diagram showing how communication proceeds between the aggregating server and individual clients according to the FedAvg protocol. This procedure is iterated until the model converges or the model reaches some desired target metric (e.g. elapsed time, accuracy)
descent before sending weight updates as orchestrated by the federated algorithm. A diagram describing how FL proceeds in the FedAvg scenario is provided in Fig. 2.1
2.3.2 Specific Challenges for FL in IoT Context Some of the major research avenues in FL that are pertinent in the IoT context are discussed in this sub-section. These include understanding how FL behaves when data distributions are skewed among heterogeneous clients and how to improve performance under such conditions. We highlight research demonstrating attacks to the FL protocol and methods to mitigate these attacks. Finally, we explore methods for vastly decreasing communication in the FL protocol that is vitally necessary for the IoT context (Table 2.1).
2.3.2.1
FL in the Presence of Heterogeneous Clients
Centralized machine learning (and distributed learning in the data center) benefit from training under the assumption that data can be shuffled and is independent and identically distributed (IID). This assumption is generally invalid in FL (particularly so in the IoT) as the training data is decentralized with significantly different distributions and the number of samples between participating clients. Training using non-IID data has been shown to converge much more slowly than IID data in a FL setting using the MNIST2 dataset (for handwritten digit recognition), distributed 2 http://yann.lecun.com/exdb/mnist/.
30
C. Briggs et al.
Table 2.1 A summary of important contributions to FL research Ref Research focus Year Major contribution [34]
Optimization
2016
[10]
Optimization
2016
[36]
Communication
2016
[37]
Multi-task FL
2017
[38]
FL attacks
2018
[39]
FL attacks
2018
[40]
Application
2018
[41]
Optimization
2018
[42]
Application
2018
[43]
Client selection
2018
[13]
Communication
2018
[44]
Non-IID
2018
[45]
Optimization
2018
[46]
Multi-task FL
2019
[47]
Optimization
2020
First description of federated optimisation and its application to a convex problem (logistic regression) Description of federated averaging (FedAvg) algorithm to improve the performance of the global model and reduce communication between the clients and server Methods for compressing weight updates and reducing the bandwidth required to perform FL Application of multi-task learning in a federated setting and discussion of system challenges relevant to using FL on resource-constrained devices A demonstration of poisoning the shared global model in a FL setting A method to recognise adversarial clients and combat model poisoning in a FL setting Application of FL in a commercial setting (next word keyboard prediction in Android Gboard) Application of per-coordinate averaging (based on Adam) to FL to achieve faster convergence (in fewer communication rounds) Applied FL to a healthcare application including further training after FL on client data (transfer learning) A method of FL selecting clients with faster communication/greater resources to participate in each communication round achieving faster convergence Description of adaptive FL method suitable for deployment on resource-constrained devices to optimally learn a shared model while maintaining a fixed energy budget Characterisation of how non-IID data reduces the model performance of FL and method for improving model performance Adds a tunable regularising term to FedAvg to stabilise training on skewed, non-IID data, limiting the influence of client models on the global model. Training pluralistic models that are tailored to subsets of clients that belong to the same timezones Applies a variance reduction method for improving convergence speed on non-IID data compared to FedAvg
2 A Review of Privacy-Preserving Federated Learning …
31
between clients after having been sorted by the target label [10]. The overall accuracy achieved by a DNN trained via FL can be significantly reduced when trained on highly skewed non-IID data [44]. Yue et al. [44] show that accuracy can be improved by sharing a small subset of non-private data between all the clients to reduce the variance between weight updates of the clients involved in each communication round. The FedProx algorithm [45] encompasses FedAvg as a special case and adds a regularising term to the local optimization objective. This has the effect of limiting the distance between the local model and global model during each communication round and stabilizes training overall. Karimireddy et al. [47] take a similar approach using SCAFFOLD by accounting for client drift (the estimated difference between the global and local model directions) and corrects for this in the model update step. This can be understood as a variance reduction method and significantly outperforms FedAvg by reducing the number of rounds of communication and improving the final model accuracy on highly skewed non-IID data. A different approach for federated optimization over many nodes is proposed by Smith et al. [37] under strongly non-IID assumptions. In this work, each client’s data distribution is modeled as a single task as part of a multi-task learning objective. In multi-task learning, all tasks are assumed to be similar and therefore each task can benefit from learning derived from all the other tasks. On three different problems (based on human activity recognition and computer vision), the federated multi-task learning setting outperforms a centralized global setting and a strictly localized setting with lower average prediction errors. As part of this work [37], the authors also show that federated multi-task learning is robust to nodes temporarily dropping out during learning and when communication is reduced (by simulating more iterations on the client per communication round). Eichner et al. [46] propose a pluralistic approach to tackle the issue of training only when devices are available (generally overnight for mobile phones). Multiple models are trained according to the timezone when the device is available and results in better language models targeted at each timezone. To specifically tackle the issue of model degradation due to the presence of non-IID data [44], Sattler et al. [48] propose splitting the shared model by determining the cosine similarity of updates from different clients during training. Similarly, Briggs et al. [49] use a hierarchical clustering algorithm to judge client update similarity to produce models tailored to clients with similarly-distributed data.
2.3.2.2
FL Attacks
The IoT presents a large attack surface due to the many system components involved. Moreover, due to the nature of distributed client participation required for FL, the protocol is susceptible to adversarial attacks. Multiple works [38, 50] present methods for poisoning the global model with an adversary acting as a client in the FL setting. The adversary constructs an update such that it survives the averaging procedure and heavily influences or replaces the global model. In this way, an adversary can poison the model to return predictions specified by the attacker given certain input features. Fung et al. [39] describe a method to defend against Sybil-based adversarial attacks
32
C. Briggs et al.
by measuring the similarity between client contributions during model averaging and filtering attacker’s updates out. These kinds of attacks might be inadvertently mitigated against using some of the modifications to FedAvg outlined above (for example by FedProx [45] or SCAFFOLD [47]) to limit the effect of individual client updates.
2.3.2.3
Communication-Efficient FL
As highlighted above, distributed learning and FL, in particular, suffer from high latency between communication rounds. Additionally, given a sufficiently large DNN, the number of parameters that need to be communicated in each communication round from possibly many thousands of clients becomes problematic concerning data transmission and traffic arriving at the aggregating server. There are several approaches to mitigating these issues as discussed in this subsection. In an IoT context where data transmission is likely to be costly, communication-efficient algorithms for performing FL are of the utmost importance. The simplest method to reduce bandwidth use in communicating model updates is simply to communicate less often. In [10], researchers experimented with the mini-batch size and number of epochs while training a convolutional neural network (CNN) on the MNIST dataset as well as an LSTM trained on the complete works of William Shakespeare3 to predict the next text character after some input of characters. Using FedAvg, the authors [10] showed that increasing computation on the client between communication rounds significantly reduced the number of communication rounds required to converge to a threshold test accuracy compared to a single epoch trained on all available data on the client (a single iteration of gradient descent). The greatest reduction in communication rounds was achieved using a mini-batch size of 10 and 20 epochs on the client using the CNN model (34.8x) and a mini-batch size of 10 and 5 epochs using the LSTM model (95.3x). As this method eliminates many of the communication rounds, it should be preferred over (or combined with) the compression methods discussed next. As the network connection used to communicate between the clients and the aggregating server is generally asymmetric (download speed is generally significantly faster than upload speed), downloading the updated model from the aggregating server is less of a concern than uploading the updates from the clients. Despite this, compression methods exist to reduce the size of deep learning models themselves [51, 52]. The compression of the parameter updates on the client before transmission is important to reduce the size of the overall update but should still maintain a stable statistical mean when updates from all clients are averaged in the global shared model. The following compression methods are not specific to FL but have been experimented with in several works related to FL.
3 https://www.gutenberg.org/ebooks/100.
2 A Review of Privacy-Preserving Federated Learning …
33
The individual weights that represent the parameters of a DNN are generally encoded using a 32-bit floating-point number. Multiple works explore the effect of lossy compression of the weights (or gradients) to 16-bit [53], 8-bit [54], or even 1-bit [55] employing stochastic rounding to maintain the expected value. The results of these experiments show that as long as the quantization error is carried forward between mini-batch computations, the overall accuracy of the model isn’t significantly impacted. Another method to compress the weight matrix itself is to convert the matrix from a dense representation to a sparse one. This can be achieved by applying a random mask to the matrix and only communicating the resulting non-zeros values along with the seed used to generate the random mask [36]. Using this approach combined with FedAvg on the CIFAR-104 image recognition dataset, it has been shown [36] that neither the rate of convergence nor the overall test accuracy is significantly impacted, even when the only 6.25% of the weights are transmitted during each communication round. Also described in [36] is a matrix factorization method whereby the weight matrix is approximated by the product of a randomly generated matrix, A, and another matrix optimized during training, B. Only the matrix B (plus the random seed to generate A) needs to be transmitted in each communication round. The authors show however that this method performs significantly worse than the random mask method as the compression ratio is increased. Shokri and Shmatikov [12] propose an alternative sparsification method implemented in their “Selective SGD” (SSGD) procedure. This method transfers only a fraction of randomly selected weights to each client from the global shared model and only shares a fraction of updated weights back to the aggregating service. The updated weights selected to be communicated are determined by either weight size (largest unsigned magnitude updates) or a random subset of values above a certain threshold. The authors [12] show that a CNN trained on the MNIST and Street View House Numbers (SVHN)5 datasets can achieve similar levels of accuracy sharing only 10% of the updated weights and only a slight drop (1–2%) inaccuracy by sharing only 1% of the updated weights. The paper also shows that the greater the number of users participating in SSGD, the greater the overall accuracy. Hardy et al. [56] take a similar approach to selective SGD but select the largest gradients in each layer rather than the whole weight matrix to better reflect changes throughout the DNN. Their algorithm, “AdaComp” also uses an adaptive learning rate per parameter based on the staleness of the parameter to improve overall test accuracy on the MNIST dataset using a CNN. Most recently, Lin et al. [57] apply gradient sparsification along with gradient clipping and momentum correction during training to reduce communication bandwidth by 270x and 600x without a significant drop in prediction performance on various ML problems in computer vision, language modeling, and speech recognition. Leroy et al. [41] experiment with using moment-based averaging inspired by the Adam optimization procedure, in place of a standard weighted global average in 4 https://www.cs.toronto.edu/~kriz/cifar.html. 5 http://ufldl.stanford.edu/housenumbers/.
34
C. Briggs et al.
FedAvg. The authors train CNN to detect a “wake word” (similar to “Hey Siri” to initialize the Siri program on iOS devices). The moment-based averaging method achieves a targeted recall of almost 95% within 100 rounds of communication over 1400 clients compared to only 30% using global averaging. By selecting clients based on client resource constraints in a mobile edge computing environment, Nishio and Yonetani [43] show that FL can be sped up considerably. As FL proceeds in a synchronous fashion, the slowest communicating node is a limiting factor in the speed at which training can progress. In this work, [43], target accuracies on the CIFAR-10 and Fashion-MNIST6 datasets are achieved in significantly less time than by using the FedAvg algorithm in [10]. In a similar vein, Wang et al. [13], aim to take into account client resources during FL training. In this work, an algorithm is designed to control the tradeoff between local gradient updates and global averaging to optimally minimize the loss function under a fixed energy budget—an important problem for FL in the IoT (especially for battery-powered devices).
2.3.3 Applied FL Research FL is particularly well suited as a solution for distributed learning in the IoT setting. As such, FL research is flourishing in various applications associated with the IoT. FL has been applied in robotics to aid multiple robots to share imitation learning strategies [58] and more generally for protecting privacy-sensitive robotics tasks [59]. In mobile edge computing environments, FL has been demonstrated for predicting demand in edge deployed applications [60] and for improving proactive edge content caching mechanisms [61]. For vehicular edge computing, Lu et al. [62] propose a framework to tackle issues of intermittent vehicle connectivity and an untrusted aggregating entity and Ye et al. [63] propose a system using FL for intelligent connected vehicle image classification tasks. Energy demand in electric vehicle charging networks has also been addressed with a FL strategy by Saputra et al. [64]. For anomaly detection in IoT environments, FL has been applied to detect intrusions and attacks by Nguyen et al. [65]. More novel applications of FL include learning to detect jamming in drone networks [66], predicting breaks in presence by users of virtual reality environments [67] and human activity recognition using wearable devices [68]. For supervised problems, user data needs to be labeled by the user to be useful for training. This is demonstrated in [40] where a long short-term memory neural network (LSTM) is trained via many clients on words typed on a mobile keyboard to predict the next word. However, this data is highly sensitive and should not be sent to a central server directly and would benefit from training via FL. The training data for this model is automatically labeled when the user types the next word. In cases where data is stored locally and already labeled such as medical health records, privacy is of 6 https://www.kaggle.com/zalando-research/fashionmnist.
2 A Review of Privacy-Preserving Federated Learning …
35
great concern, and even sharing of data between hospitals may be prohibited [69]. FL can be applied in these settings to improve a shared global model that is more accurate than a model trained by each hospital separately. In [42], electronic health records from 58 hospitals are used to train a simple neural network in a federated setting to predict patient mortality. The authors found that partially training the network using FL, followed by freezing the first layer and training only on the data available to each client resulted in better performing models for each hospital.
2.4 Privacy Preservation Data collection to learn something about a population (for example in machine learning to discover a function for mapping the data to target labels) can expose sensitive information about individual users. In machine learning, this is often not the primary concern of the developer or researcher creating the model, yet is extremely important for circumstances where personally sensitive data is collected and disseminated in some form (e.g. via a trained model). Privacy has become even more important in the age of big data (data which is characterized by its large volume, variety, and velocity [70]). As businesses gather increasing amounts of data about users, the risk of privacy breaches via controlled data releases grows. This review focuses on the protection of an individual’s privacy via controlled data releases (such as from personal data used to train a machine learning model) and does not consider privacy breaches via hacking and theft which is a separate issue related to data security. Privacy is upheld as a human right in many countries [71]. In Europe, rigorous legislation concerning data protection via the General Data Protection Regulation [72] safeguards data privacy such that users are given the facts about how and what data is collected about them and how it used and by whom. Despite these rights and regulations, data privacy is difficult to maintain and breaches of privacy via controlled data releases occur often. Privacy can be preserved in several ways, yet it is important to maintain a balance between the level of privacy and utility of the data (along with some consideration for the computational complexity required to preserve privacy). A privacy mechanism augments the original data to prevent a breach of personal privacy (i.e. an individual should not be able to be recognized in the data). For example, a privacy mechanism might use noise to augment the result of a query on the data [73]. Adding too much noise to a result might render it meaningless and adding too little noise might leak sensitive information. The privacy/utility tradeoff is a primary concern of the privacy mechanisms to be discussed in the next subsection.
36
C. Briggs et al.
2.4.1 Privacy Preserving Methods The privacy-preserving methods discussed in this section can be described as either suppressive or perturbative [74]. Suppressive methods include removal of attributes in the data, restricting queries via the privacy mechanism, aggregation/generalization of data attributes, and returning a sampled version of the original data. Perturbative methods include noise addition to the result of a query on the data or rounding of values in the dataset.
2.4.1.1
Anonymisation
Anonymisation or de-identification is achieved by removing any information that might identify an individual within a dataset. Ad-hoc anonymization might reasonably remove names, addresses, phone numbers, etc., and replace each user’s record(s) with a pseudonym value to act as an identifier under the assumption that individuals cannot be identified within the altered dataset. However, this leaves the data open to privacy attacks known as linkage attacks [74]. In the presence of auxiliary information, linkage attacks allow an adversary to re-identify individuals in the otherwise anonymous dataset. Several famous examples of such linkage attacks exist. An MIT graduate, Latanya Sweeney, purchased voter registration records for the town of Cambridge, Massachusetts, and was able to use combinations of attributes (ZIP code, gender, and date of birth) known as a quasi-identifier to identify the then governor of Massachusetts, William Weld. When combined with state-released anonymized medical records, Sweeney was able to identify his medical information from the data release [75]. As part of a machine learning competition known as the Netflix Prize7 launched in 2006, Netflix released a random sample of pseudo-anonymized movie rating data. Narayanan and Shmatikov [76] were able to show that using relatively few publicly published ratings by IMDb,8 all the ratings in the Netflix data for the same user could be revealed. Lastly, in 2014, celebrities in New York were able to be tracked by combining taxi route data released via freedom of information requests, a de-hashing of the taxi license numbers (which were based on md5) and with geo-tagged photos of the celebrities entering/exiting certain taxies [77]. k-anonymity was proposed by Sweeney [78] to tackle the challenge of linkage attacks on anonymized datasets. Using k-anonymity, data is suppressed such that k − 1 or more individuals possess the same attributes used to create a quasiidentifier. Therefore, an identifiable record in an auxiliary dataset would link to multiple records in the anonymous dataset. However, k-anonymity cannot defend against linkage attacks where a sensitive attribute is shared among a group of individuals with the same quasi-identifier. l-diversity builds on k-anonymity to ensure that there is diversity within any group of individuals sharing the same quasi-identifier [79]. 7 https://www.netflixprize.com/index.html. 8 https://www.imdb.com/.
2 A Review of Privacy-Preserving Federated Learning …
37
t-closeness builds on both these methods to preserve the distribution of sensitive attributes among any group of individuals sharing the same quasi-identifier [80]. All the methods suffer however when an adversary possesses some knowledge about the sensitive attribute. Research related to improving k-anonymity based methods has mostly been abandoned in the literature in preference of methods that offer more rigorous privacy guarantees (such as Sect. 2.4.1.3)
2.4.1.2
Encryption
Anonymisation presents several difficult challenges to provide statistics about data without disclosing sensitive information. Encrypting data provides better privacy protection but the ability to perform useful statistical analysis on encrypted data requires specialist methods. Homomorphic encryption [81] allows for processing of data in its encrypted form. Earlier efforts (termed “Somewhat Homomorphic Encryption”) allowed for simple addition and multiplication operations on encrypted data [82], but were shortly followed by Fully Homomorphic Encryption allowing for any arbitrary function to be applied to data in ciphertext form to yield an encrypted result [81]. Despite the apparent advantages of homomorphic encryption to provide privacy to individuals over their data whilst allowing a third party to perform analytics on it, the computational overhead required to perform such operations is very large [83, 84]. IBM’s homomorphic library implementation9 runs some 50 million times slower than performing calculations on plaintext data [85]. Due to this computational overhead, applying homomorphic encryption to training on large-scale machine learning data is currently impractical [86]. Several projects make use of homomorphic encryption for inference on encrypted private data [83, 87] Secure multi-party computation (SMC) [88] can also be adopted to compute a function on private data owned by many parties such that no party learns anything about others’ data—only the output of the function. Many SMC protocols are based on Shamir’s secret sharing [89] which splits data into n pieces in such a way that at least k pieces are required to reconstruct the original data (k − 1 pieces reveal nothing about the original data). For example, a value x is shared with multiple servers (as x A , x B . . .) via an SMC protocol such that the data can only be reconstructed if the shared pieces on k servers are known [90]. Various protocols exist to compute some function over the data held on the different servers via rounds of communication, however, the servers involved are assumed to be trustworthy.
2.4.1.3
Differential Privacy
Differential privacy provides an elegant and rigorous mathematical measure of the level of privacy afforded by a privacy-preserving mechanism. A differentially private 9 https://github.com/shaih/HElib.
38
C. Briggs et al.
privacy-preserving mechanism acting on very similar datasets will return statistically indistinguishable results. More formally: Given some privacy mechanism M that maps inputs from domain D to outputs in the range R, it is “almost” equally likely (by some multiplicative factor ) for any subset of outputs S ⊆ R to occur, regardless of the presence or absence of a single individual in 2 neighboring datasets d and d
drawn from D (differing by a single individual) [73] Pr [M(d) ∈ S] ≤ e Pr [M(d ) ∈ S].
(2.6)
Here, d and d are interchangeable with the same outcome. This privacy guarantee protects individuals from being identified within the dataset as the result from the mechanism should be essentially the same regardless of whether the individual appeared in the original dataset or not. Differential privacy is an example of a perturbative privacy-preserving method, as the privacy guarantee is achieved by the addition of noise to the true output. This noise is commonly drawn from a Laplacian distribution [73] but can also be drawn from an exponential distribution [91] or via the novel staircase mechanism [92] that provides greater utility compared to laplacian noise for the same . The above description of differential privacy is often termed -differential privacy or strict differential privacy. The amount of noise required to satisfy -differential privacy is governed by and the sensitivity of the statistic function Q defined by [73]: Q = max(||Q(d) − Q(d )||1 ).
(2.7)
This maximum is evaluated overall neighboring datasets in the set D differing by a single individual. The output of the mechanism using noise drawn from the Laplacian distribution is then: Q . (2.8) M(d) = Q(d) + Laplace 0, A relaxed version of differential privacy known as (, δ)-differential privacy [93] provides greater flexibility in designing privacy preserving mechanisms and greater resistance to attacks making use of auxiliary information [91]: Pr [M(d) ∈ S] ≤ e Pr [M(d ) ∈ S] + δ.
(2.9)
Whereas -differential privacy provides a privacy guarantee even for results with extremely small probabilities, the δ in (, δ)-differential privacy accounts for the small probability that the privacy guarantee of ordinary -differential privacy is broken. The Gaussian mechanism is commonly used to add noise to satisfy (, δ)differential privacy [94], but instead of the L1 norm used in Eq. 2.7, the noise is scaled to the L2 norm: 2 Q = max(||Q(d) − Q(d )||2 ).
(2.10)
2 A Review of Privacy-Preserving Federated Learning …
39
The following mechanism then satisfies (, δ)-differential privacy (given , δ ∈ (0, 1)): M(d) = Q(d) +
2 Q N 0, 2 ln (1.25/δ) .
(2.11)
is additive for multiple queries [91] and therefore an -budget should be designed to protect private data when queried multiple times. Practically, this means that any differential privacy based system must keep track of who queries what and how often to ensure that some predefined -budget is not surpassed. In a machine learning setting a method of accounting for the accumulated privacy loss over training iterations [11] needs to be employed to maintain an -budget. Accumulated knowledge as described above is one of the weaknesses of differential privacy to keep sensitive data private [91]. Another is collusion. If multiple users collude in the querying of the data (sharing the results of queries with one another) the -budget for any single user might be breached. Finally, suppose an -budget is assigned for each query; a user making queries on correlated data will use only the budget for each query, yet may be able to gain more information since two quantities are correlated (e.g. income and rent). Large (or large -budgets) introduce the greater risk of privacy breaches than small ones but selecting an appropriate is a non-trivial issue. Lee and Clinton [95] discuss how might be selected for a given problem but identify that to do so, the dataset and the queries on the dataset should be known ahead of time. Noise addition can be applied in two separate scenarios. Given a trusted data curator, noise can be added to queries on a static dataset, introducing only minimal noise per query. This can be considered as a global privacy setting. Conversely in a local privacy setting, no such trusted curator exits. Local differential privacy applies when noise is added to each sample before collection/aggregation. For example, the randomized response technique [96] allow participants to answer a question truthfully or randomly based on the flip of a coin. Each participant, therefore, has plausible deniability for their answer, yet statistics can still be estimated on the population given enough data (and a fair coin flip). Each sample is extremely noisy in the local case due to the high vulnerability of a single sample being leaked, however, aggregated in volume, the noise can be filtered out to an extent to reveal population statistics. FL is another example of where local differential privacy is useful [97]. Adding noise to the updates during training rounds on local user data before aggregation by an untrusted parameter server provides greater privacy to the user and their contributions to a global model (discussed further in Sect. 2.5) Limited examples of practical applications using differential privacy exist outside of academia. Apple implemented differential privacy in its iOS 10 operating system for the iPhone [98] to collect statistics on emoji suggestions and safari crash reports [99]. Google also collect usage statistics for the Chrome internet browser using differential privacy via multiple rounds of the randomized response technique coupled with a bloom filter to encode the domain names of sites a user has visited [100]. Both
40
C. Briggs et al.
these applications use local differential privacy to protect the individual’s privacy but rely on large numbers of participating users to determine accurate overall statistics. Future research and applications of differential privacy are likely to focus on improving utility whilst retaining good privacy guarantees for greater adoption by the IT industry.
2.5 Privacy Preservation in FL FL already increases the level of privacy afforded to an individual over traditional machine learning on a static dataset. It mitigates the storage of sensitive personal data by a third party and prevents a third party from performing learning tasks on the data for which the individual had not initially given permission. Additionally, inference does not require that further sensitive data be sent to a third party as the global model is available to the individual on their private device [12]. Despite these privacy improvements, the weight/gradient updates uploaded by individuals may reveal information about the user’s data, especially if certain weights in the weight matrix are sensitive to specific features or values in the individual’s data (for example, specific words in a language prediction model [101]). These updates are available to any client participating in FL as well as the aggregating server. Bonawitz et al. [102] show that devices participating in FL can also act as parties involved in SMC to protect the privacy of all user’s updates. In their “Secure Aggregation” protocol, the aggregating server only learns about client updates in aggregate. Similarly, the ∝MDL protocol described in [104] uses SMC but also encrypts the gradients on the client using homomorphic encryption. The summation of the encrypted gradients over all participating clients gives an encrypted global gradient, however, this summation result can only be decrypted once a threshold number of clients have shared their gradients. Therefore, again, the server can only learn about client updates in aggregate, preserving the privacy of individual contributions (Table 2.2). Researchers at Google have recently described the high-level design of a production-ready FL framework [105] based on Tensorflow.10 This framework includes Secure Aggregation [102] as an option during training. Applying SMC to FL suffers from increased communication and greater computational complexity in the aggregation process (both for the client and the server). Additionally, the fully trained model available to clients after the FL procedure may still leak sensitive data about specific individuals as described earlier. Adversarial attacks on FL models can be mitigated by inspecting and filtering out malicious client updates [39]. However, the Secure Aggregation protocol [102] prevents the inspection of individual updates and therefore cannot defend against such poisoning attacks [38] in this way. While SMC achieves privacy through increased computational complexity, differential privacy trade-offs model utility for increased privacy. Additionally, differential 10 https://www.tensorflow.org/.
2 A Review of Privacy-Preserving Federated Learning …
41
Table 2.2 A summary of important contributions to FL research with a focus on privacy enhancing mechanisms (DP = Differential privacy, HE = Homomorphic encryption, SMC = Secure multiparty computation) Ref Year Major contribution Privacy Privacy details mechanism [12]
2015
[11]
2016
[102]
2017
[97]
2017
[101]
2017
[103]
2017
[104]
2017
[105]
2019
Description of a selective DP distributed gradient descent method to reduce communication and the application of differential privacy to protect the model parameter updates Description of an efficient DP accounting method for accumulating privacy losses while training a DNN with differential privacy New method to provide secure SMC multi-party computation specifically tailored towards FL Method for providing user-level differential privacy for FL with only small loss in model utility Method for providing user-level differential privacy for FL without degrading model utility Demonstration of an attack method on the global model using a generative adversarial network, effective even against record/batch-level DP Method for encrypting user updates during distributed training, decryptable only when many clients have participated in the distributed learning objective
DP
Description of a full-scale production-ready FL system (focusing on mobile devices)
SMC
Batch-level DP, -DP (Laplace mechanism)
Batch-level DP, (-δ)-DP (Gaussian mechanism)
Secure aggregation protocol evaluates the average gradients of clients only when a sufficient number send updates User-level DP, (-δ)-DP (Gaussian mechanism)
DP
User-level DP, (-δ)-DP (Gaussian mechanism)
DP
Attack tested against record/batch-level DP (implemented using [12])
HE, SMC
Gradient updates are encrypted using homomorphic encryption. Aggregate server obtains average gradient over all workers but can only decrypt this result once a certain number of updates have been aggregated Optionally makes use of the Secure aggregation protocol in [102]
42
C. Briggs et al.
privacy protects an individual’s contributions to the model during training, and once the model is fully trained. Differential privacy has been applied in multiple cases to mitigate the issue of publishing sensitive weight updates during communication rounds in a FL setting. Shokri and Shmatikov [12] describe a communication-efficient method for FL of a deep learning model tested on the MNIST and SVHN datasets. They select only a fraction of the local gradient updates to share with a central server but also experiment with adding noise to the updates to satisfy differential privacy and protect the contributions of individuals to the global model. An -budget is divided and spent on selecting gradients above a certain threshold and on publishing the gradients. Judging the sensitivity of SGD is achieved by bounding the gradients between [−γ , γ ] (γ is set to some small number). Laplacian noise is generated using this sensitivity and added to the updates before selection/publishing. The authors show that their differentially private method outperforms standalone training (training performed by each client on their data alone) and approaches the performance of SGD on a non-private static dataset given that enough clients participate in each communication round. Abadi et al. [11] apply a differentially private SGD mechanism to train on the MNIST and CIFAR-10 image datasets. They show they can achieve 97% accuracy on MNIST (1.3% worse than non-differentially private baseline) and 73% accuracy on CIFAR-10 (7% worse than non-differentially private baseline) using a modest neural network and principal component analysis to reduce the dimensionality of the input space. This is achieved using an (, δ)-differential privacy of (8, 10−5 ). The authors also introduce a privacy accountant to monitor the accumulated privacy loss overall training operations based on moments of the privacy loss random variable. The authors point out that privacy loss is minimal for such a large number of parameters and training examples. Geyer et al. [97] make use of the moments privacy accountant from [11] and evaluate the accumulated δ during training. Once the accumulated δ reaches a given threshold, training is halted. Intuitively, training is halted once the risk of the privacy guarantee being broken becomes too probable. This method of FL protects the privacy of an individual’s participation in training over their entire local dataset as opposed to a single data point during training as in [11]. The authors show that with a sufficiently large number of clients participating in the federated optimization, only a minor drop in performance is recorded whilst maintaining a good level of privacy over the individual’s data. Similarly, McMahan et al. [101] apply user-level differential privacy (noise is added using sensitivity measured at the user-level rather than sample or mini-batch level) via the moment’s privacy accountant introduced in [11]. A method for attacking deep learning models trained via FL has been proposed in [103]. This approach involves a malicious user participating in federated training whose alternative objective is to train a generative adversarial network (GAN) to generate realistic examples from the globally shared model during training. The authors show that even a model trained with differentially private updates is susceptible to the attack but that it could be defended against with user-level or device-level differential privacy such as that which is described in [97, 101].
2 A Review of Privacy-Preserving Federated Learning …
43
An alternative method to perform machine learning on private data is via a knowledge distillation-like approach. Private Aggregation of Teacher Ensembles (PATE) [106] trains a student model (which is published and used for inference) using many teacher models in an ensemble. Neither the sensitive data available to the teacher models nor the teacher models themselves are ever published. The teacher models once trained on the sensitive data are then used to label public data in a semi-supervised fashion via voting for the predicted class. The votes cast by the teachers have noise generated via a Laplacian distribution added to preserve the privacy of their predictions. This approach requires that public data is available to train the student model, however shows better performance than [11, 12] whilst maintaining privacy guarantees ((, δ)-differential privacy of (2.04, 10−5 ) and (8.19, 10−6 ) on the MINST and SVHN datasets respectively). Further improvements to PATE show that the method can scale to large multi-class problems [107].
2.6 Challenges in Applying Privacy-Preserving FL to the IoT In this section, we identify and outline some promising areas to develop privacypreserving FL research, particularly focused on IoT environments.
2.6.1 Optimal Model Architecture/Hyperparameters FL precludes seeing the data that a model is trained on. On a traditionally centralized dataset, a deep learning architecture and hyperparameters can be selected via a validation strategy. However, to follow the same approach in FL to find an optimal architecture or set the optimal hyperparameters to produce good models would require training many models on user devices (possibly incurring unacceptable amounts of battery power and bandwidth). Therefore novel research is required to tackle this specific problem, unique to FL.
2.6.2 Continual Learning Training a machine learning model is an expensive and time-consuming task and this can be significantly worse in the FL setting. As data distributions evolve, a trained model’s performance deteriorates. To avoid the cost of federated training many times over, research into methods for improving how a model learns is congruent to the FL objective over time. Methods such as meta-learning, online learning, and continual
44
C. Briggs et al.
learning will be important here which will have specific challenges unique to the distributed nature of FL.
2.6.3 Better Privacy-Preserving Methods As seen in this review, there is an observable tradeoff between the performance of a model and the privacy that is afforded to a user. Further research is ongoing into differential privacy accounting methods that introduce less noise into the model (thus improving utility) for the same level of privacy (as judged by the parameter). Likewise, further research is required to vastly reduce the computational burden of methods such as homomorphic encryption and secure multi-party computation for them to become common-use methods for preserving privacy for large-scale machine learning tasks.
2.6.4 FL Combined with Fog Computing Reducing the latency between rounds of training in FL is desirable to train models quickly. Fog computing nodes could feasibly be leveraged as aggregating servers to remove the round-trip communication between clients and cloud servers in the aggregation step of FL. Fog computing could also bring other benefits, such as sharing the computational burden by hierarchically aggregating many large client models.
2.6.5 FL on Low Power Devices Training deep networks on resource-constrained and low power devices poses specific challenges for FL. Much of the research into FL focuses on mobile devices such as smartphones with abundant compute, storage, and power capabilities. As such, new methods are required for reducing the amount of work individual devices need to do to contribute to training (perhaps using the model parallelism approach seen in [14] or training only certain deep network layers on subsets of devices.)
2.7 Conclusion Deep learning has shown impressive successes in the fields of computer vision, speech recognition, and language modeling. With the exploding increase in deployments of IoT devices, naturally, deep learning is starting to be applied at the edge of the network on mobile and resource-limited embedded devices. This environment
2 A Review of Privacy-Preserving Federated Learning …
45
however presents difficult challenges for training deep models due to their energy, compute, and memory requirements. Beyond this, a model’s utility is strictly limited to the data available to the edge device. Allowing machines close to the edge of the network to train on data produced by edge devices (as in fog computing) risks privacy breaches of such data. Federated learning (FL) is a good solution to improve deep learning models while maintaining the privacy of the raw data. FL presents a new field of research but has great potential for improving the privacy of training data and giving users control of how their data is used by third parties. Combining FL with privacy mechanisms such as differential privacy further secures user data from adversaries with the inclination and means to reverse-engineer parameter updates in distributed SGD procedures. Differential privacy as applied to machine learning is also in its infancy and challenges remain to provide good privacy guarantees whilst simultaneously limiting the required communication costs in a federated setting. The intersection of FL, differential privacy, and IoT data represents a fruitful area of research. Performing deep learning efficiently on resource-constrained devices while preserving privacy and utility poses a real challenge. Additionally, the nature of IoT data, as opposed to internet data for private FL, deserves more attention from the research community. IoT data is often represented by highly skewed non-IID data with high temporal variability. This is a challenge that needs to be overcome for FL to flourish in edge environments. Acknowledgements This work is partly supported by the SEND project (grant ref. 32R16P00706) funded by ERDF and BEIS.
References 1. Gartner, Gartner identifies top 10 strategic IoT technologies and trends (2018). https://www. gartner.com/en/newsroom/press-releases/2018-11-07-gartner-identifies-top-10-strategiciot-technologies-and-trends 2. J.A. Stankovic, Research directions for the internet of things. IEEE Internet Things J. 1(1), 3–9 (2014) 3. F. Bonomi, R. Milito, J. Zhu, S. Addepalli, Fog computing and its role in the internet of things, in MCC Workshop on SIGCOMM 2012, August (ACM, New York, USA, 2012), pp. 13–16 4. Y. Ai, M. Peng, K. Zhang, Edge computing technologies for internet of things: a primer. Digit. Commun. Netw. 4(2), 77–86 (2018) 5. L. Bittencourt, R. Immich, R. Sakellariou, N. Fonseca, E. Madeira, M. Curado, L. Villas, L. DaSilva, C. Lee, O. Rana, The internet of things, fog and cloud continuum: integration and challenges. Internet Things 3–4, 134–155 (2018) 6. OpenFog Consortium, OpenFog reference architecture for fog computing (2017). https:// www.openfogconsortium.org/wp-content/uploads/OpenFog_Reference_Architecture_2_ 09_17-FINAL.pdf 7. F. Bonomi, R. Milito, P. Natarajan, J. Zhu, Fog computing: a platform for internet of things and analytics in Big Data and Internet of Things: A Roadmap for Smart Environments (Springer, Cham, 2014), pp. 169–186 8. H. Li, K. Ota, M. Dong, Learning IoT in edge: deep learning for the internet of things with edge computing. IEEE Netw. 32(1), 96–101 (2018)
46
C. Briggs et al.
9. T. Ben-Nun, T. Hoefler, Demystifying parallel and distributed deep learning. ACM Comput. Surv. (CSUR) 52(4), 1–43 (2019) 10. B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Artificial Intelligence and Statistics (2017), pp. 1273–1282 11. M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in ICLR, October (ACM, 2016), pp. 308–318 12. R. Shokri, V. Shmatikov, Privacy-preserving deep learning, in The 22nd ACM SIGSAC Conference, October (ACM, 2015), pp. 1310–1321 13. S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 37(6), 1205–1221 (2019) 14. J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, A.Y. Ng, Large scale distributed deep networks, in Advances in Neural Information Processing Systems, December (Curran Associates Inc., 2012), pp. 1223–1231 15. M. Li, D.G. Andersen, J.W. Park, A.J. Smola, A. Ahmed, Scaling distributed machine learning with the parameter server, in OSDI, vol. 14, pp. 583–598 (2014) 16. Y. Lecun, Y. Bengio, G.E. Hinton, Deep learning. Nature 521(7), 436–444 (2015) 17. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, 1st edn. (MIT Press, Cambridge, 2016) 18. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in Parallel Distributed Processing Explorations in the Microstructure of Cognition (MIT Press, Cambridge, 1986), pp. 318–362 19. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 20. T.M. Chilimbi, Y. Suzue, J. Apacible, K. Kalyanaraman, Project Adam: building an efficient and scalable deep learning training system, in OSDI (2014), pp. 571–582 21. B. Recht, C. Ré, S. Wright, F. Niu, Hogwild: a lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems (2011), pp. 693–701 22. Q. Ho, J. Cipar, H. Cui, S. Lee, J.K. Kim, P.B. Gibbons, G.A. Gibson, G. Ganger, E.P. Xing, More effective distributed ML via a stale synchronous parallel parameter server, in Advances in Neural Information Processing Systems (2013), pp. 1223–1231 23. A. Odena, Faster asynchronous SGD (2016), arXiv:1601.04033 24. W. Zhang, S. Gupta, X. Lian, J. Liu, Staleness-aware async-SGD for distributed deep learning, in Proceedings of the 25th International Joint Conference on Artificial Intelligence, November (2015), pp. 2350–2356 25. K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G.R. Ganger, P.B. Gibbons, Gaia: geodistributed machine learning approaching LAN speeds, in NSDI (2017), pp. 629–647 26. J. Jiang, B. Cui, C. Zhang, L. Yu, Heterogeneity-aware distributed parameter servers, in 2017 ACM International Conference (ACM Press, New York, USA, 2017), pp. 463–478 27. J. Daily, A. Vishnu, C. Siegel, T. Warfel, V. Amatya, GossipGraD: scalable deep learning using gossip communication based asynchronous gradient descent (2018), arXiv:1803.05880 28. P.H. Jin, Q. Yuan, F. Iandola, K. Keutzer, How to scale distributed deep learning? (2016), arXiv:1611.04581 29. S. Sundhar Ram, A. Nedic, V.V. Veeravalli, Asynchronous gossip algorithms for stochastic optimization, in 2009 International Conference on Game Theory for Networks (GameNets) (IEEE, 2009), pp. 80–81 30. J. Ba, R. Caruana, Do deep nets really need to be deep? in Advances in Neural Information Processing Systems (2014), pp. 2654–2662 31. Y. Chebotar, A. Waters, Distilling knowledge from ensembles of neural networks for speech recognition, in Interspeech 2016 (ISCA, 2016), pp. 3439–3443 32. G.E. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network (2015), arXiv:1503.02531
2 A Review of Privacy-Preserving Federated Learning …
47
33. Y. Liang, M.F. Balcan, V. Kanchanapally, Distributed PCA and k-means clustering, in The Big Learning Workshop at NIPS (2013) 34. J. Koneˇcný, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimization: distributed machine learning for on-device intelligence (2016), arXiv:1610.02527 35. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1 (Curran Associates Inc., USA, 2013), pp. 315–323 36. J. Koneˇcný, H.B. McMahan, F.X. Yu, P. Richtárik, A.T. Suresh, D. Bacon, Federated learning: strategies for improving communication efficiency, in NIPS Workshop on Private Multi-party Machine Learning (2016) 37. V. Smith, C.-K. Chiang, M. Sanjabi, A. Talwalkar, Federated multi-task learning, in Advances in Neural Information Processing Systems (2017), arXiv:1705.10467 38. E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, V. Shmatikov, How to backdoor federated learning (2018), arXiv:1807.00459 39. C. Fung, C.J.M. Yoon, I. Beschastnikh, Mitigating sybils in federated learning poisoning (2018), arXiv:1808.04866 40. A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, D. Ramage, Federated learning for mobile keyboard prediction (2018), http://arxiv.org 41. D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, J. Dureau, Federated learning for keyword spotting (2018), arXiv:1810.05512 42. D. Liu, T. Miller, R. Sayeed, K.D. Mandl, FADL: federated-autonomous deep learning for distributed electronic health record (2018), arXiv:1811.11400 43. T. Nishio, R. Yonetani, Client selection for federated learning with heterogeneous resources in mobile edge (2018), arXiv:1804.08333 44. Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, V. Chandra, Federated learning with non-IID data (2018), arXiv:1806.00582 45. I. Dhillon, D. Papailiopoulos, V. Sze (eds.), Federated optimization in heterogeneous networks (2020) 46. H. Eichner, T. Koren, H.B. McMahan, N. Srebro, K. Talwar, Semi-cyclic stochastic gradient descent, in International Conference on Machine Learning, April (2019), arXiv:1904.10120 47. S.P. Karimireddy, S. Kale, M. Mohri, S.J. Reddi, S.U. Stich, A.T. Suresh, Scaffold: stochastic controlled averaging for federated learning (2020). arXiv preprint arXiv:1910.06378 48. F. Sattler, S. Wiedemann, K.-R. Müller, W. Samek, Robust and communication-efficient federated learning from non-IID data (2019), arXiv:1903.02891 49. C. Briggs, Z. Fan, P. Andras, Federated learning with hierarchical clustering of local updates to improve training on non-IID data, in 2020 International Joint Conference on Neural Networks (IJCNN) (IEEE, 2020) 50. Analyzing federated Learning through an adversarial lens, in PMLR, May (2019) 51. S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding (2015), arXiv:1510.00149 52. S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network, in Advances in Neural Information Processing Systems (2015), pp. 1135–1143 53. S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical precision (2015), arXiv:1502.02551 54. T. Dettmers, 8-bit approximations for parallelism in deep learning (2015), arXiv:1511.04561 55. F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs, in 15th Annual Conference of the International Speech Communication Association (2014) 56. C. Hardy, E. Le Merrer, B. Sericola, Distributed deep learning on edge-devices: feasibility via adaptive compression, in 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA) (IEEE), pp. 1–8 57. Y. Lin, S. Han, H. Mao, Y. Wang, W.J. Dally, Deep gradient compression: reducing the communication bandwidth for distributed training (2017), arXiv:1712.01887
48
C. Briggs et al.
58. B. Liu, L. Wang, M. Liu, C.-Z. Xu, Federated imitation learning: a novel framework for cloud robotic systems with heterogeneous sensor data. IEEE Robot. Autom. Lett. 5(2), 3509–3516 (2020) 59. W. Zhou, Y. Li, S. Chen, B. Ding, Real-time data processing architecture for multi-robots based on differential federated learning, in 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (IEEE, 2018), pp. 462–471 60. R. Fantacci, B. Picano, Federated learning framework for mobile edge computing networks. CAAI Trans. Intell. Technol. 5(1), 15–21 (2020) 61. Z. Yu, J. Hu, G. Min, H. Lu, Z. Zhao, H. Wang, N. Georgalas, Federated learning based proactive content caching in edge computing, in GLOBECOM 2018–2018 IEEE Global Communications Conference (IEEE, 2018), pp. 1–6 62. Y. Lu, X. Huang, Y. Dai, S. Maharjan, Y. Zhang, Differentially private asynchronous federated learning for mobile edge computing in urban informatics. IEEE Trans. Ind. Inform. 16(3), 2134–2143 (2019) 63. D. Ye, R. Yu, M. Pan, Z. Han, Federated learning in vehicular edge computing: a selective model aggregation approach. IEEE Access 8, 23 920–23 935 (2020) 64. Y.M. Saputra, D.T. Hoang, D.N. Nguyen, E. Dutkiewicz, M.D. Mueck, S. Srikanteswara, Energy demand prediction with federated learning for electric vehicle networks, in GLOBECOM 2019–2019 IEEE Global Communications Conference (IEEE, 2019), pp. 1–6 65. T.D. Nguyen, S. Marchal, M. Miettinen, H. Fereidooni, N. Asokan, A.-R. Sadeghi, “D I¨oT: a federated self-learning anomaly detection system for IoT, in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS) (IEEE, 2019), pp. 756–767 66. N.I. Mowla, N.H. Tran, I. Doh, K. Chae, Federated learning-based cognitive detection of jamming attack in flying ad-hoc network. IEEE Access 8, 4338–4350 (2020) 67. M. Chen, O. Semiari, W. Saad, X. Liu, C. Yin, Federated echo state learning for minimizing breaks in presence in wireless virtual reality networks. IEEE Trans. Wirel. Commun. 19(1), 177–191 (2020) 68. K. Sozinov, V. Vlassov, S. Girdzijauskas, Human activity recognition using federated learning, in 2018 IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) (IEEE, 2018), pp. 1103–1111 69. R. Miotto, F. Wang, S. Wang, X. Jiang, J.T. Dudley, Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform. 19(6), 1236–1246 (2017) 70. A. Gandomi, M. Haider, Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015) 71. UN General Assembly, Universal Declaration of Human Rights (2015) 72. European Commision, Regulation (EU) 2016/679 of the European parliament and of the council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (General Data Protection Regulation). Off. J. Eur. Union L119, 1–88 (2016) 73. C. Dwork, Differential privacy, in Automata, Languages and Programming (Springer, Berlin, 2006), pp. 1–12 74. B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. (CSUR) 42(4), 14–53 (2010) 75. H.T. Greely, The uneasy ethical and legal underpinnings of large-scale genomic biobanks. Annu. Rev. Genomics Hum. Genet. 8(1), 343–364 (2007) 76. A. Narayanan, V. Shmatikov, Robust de-anonymization of large sparse datasets, in 2008 IEEE Symposium on Security and Privacy (SP 2008) (IEEE, 2008), pp. 111–125 77. A. Tockar riding with the stars: passenger privacy in the NYC taxicab dataset (2014), https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-inthe-nyc-taxicab-dataset/
2 A Review of Privacy-Preserving Federated Learning …
49
78. L. Sweeney, K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.Based Syst. 10(5), 557–570 (2002) 79. A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, in 22nd International Conference on Data Engineering (IEEE, 2006), pp. 24–24 80. N. Li, T. Li, S. Venkatasubramanian, t-closeness: privacy beyond k-anonymity and l-diversity, in 2007 IEEE 23rd International Conference on Data Engineering (IEEE, 2007), pp. 106–115 81. C. Gentry, Computing arbitrary functions of encrypted data. Commun. ACM 53(3), 97–105 (2010) 82. A. Acar, H. Aksu, A.S. Uluagac, M. Conti, A survey on homomorphic encryption schemes: theory and implementation. ACM Comput. Surv. (CSUR) 51(4), 1–35 (2018) 83. R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, J. Wernsing, Cryptonets: applying neural networks to encrypted data with high throughput and accuracy, in International Conference on Machine Learning (2016), pp. 201–210 84. E. Hesamifard, H. Takabi, M. Ghasemi, CryptoDL: deep neural networks over encrypted data (2017), arXiv:1711.05189 85. L. Rist, Encrypt your machine learning (2018), https://medium.com/corti-ai/encrypt-yourmachine-learning-12b113c879d6 86. Y. Du, L. Gustafson, D. Huang, K. Peterson, Implementing ML algorithms with HE. MIT Course 6.857: Computer and Network Security (2017) 87. E. Chou, J. Beal, D. Levy, S. Yeung, A. Haque, L. Fei-Fei, Faster cryptoNets: leveraging sparsity for real-world encrypted inference (2018), arXiv:1811.09953 88. O. Goldreich, Secure multi-party computation. Manuscript. Preliminary version, vol. 78 (1998) 89. A. Shamir, How to share a secret. Commun. ACM 22(11), 612–613 (1979) 90. J. Launchbury, D. Archer, T. DuBuisson, E. Mertens, Application-scale secure multiparty computation, in Programming Languages and Systems, April (Springer, Berlin, 2014), pp. 8–26 91. C. Dwork, A. Roth, The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014) 92. Q. Geng, P. Kairouz, S. Oh, P. Viswanath, The staircase mechanism in differential privacy. IEEE J. Sel. Top. Signal Process. 9(7), 1176–1184 (2015) 93. S.P. Kasiviswanathan, A. Smith, On the ‘semantics’ of differential privacy: a bayesian formulation. J. Priv. Confid. 6(1) (2014) 94. T. Zhu, G. Li, W. Zhou, P.S. Yu, Preliminary of differential privacy, in Differential Privacy and Applications (Springer International Publishing, Cham, 2017), pp. 7–16 95. J. Lee, C. Clifton, How much is enough? Choosing ε for differential privacy, in Information Security, October (Springer, Berlin, 2011), pp. 325–340 96. S.L. Warner, Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60(309), 63 (1965) 97. R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client level perspective (2017), arXiv:1712.07557 98. Apple, Apple differential privacy technical overview (2017). https://www.apple.com/privacy/ docs/Differential_Privacy_Overview.pdf 99. A.G. Thakurta, Differential privacy: from theory to deployment, in USENIX Association (USENIX Association, Vancouver, 2017) 100. Ú. Erlingsson, V. Pihur, A. Korolova, “RAPPOR: randomized aggregatable privacy-preserving ordinal response, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, November (ACM, 2014), pp. 1054–1067 101. H.B. McMahan, D. Ramage, K. Talwar, L. Zhang, learning differentially private recurrent language models, in ICLR (2018) 102. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for privacy-preserving machine learning, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, October (ACM, 2017), pp. 1175–1191
50
C. Briggs et al.
103. B. Hitaj, G. Ateniese, F. Perez-Cruz, Deep models under the GAN, in 2017ACM SIGSAC Conference (ACM Press, New York, USA, 2017), pp. 603–618 104. X. Zhang, S. Ji, H. Wang, T. Wang, Private, yet practical, multiparty deep learning, in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (IEEE, 2017), pp. 1442–1452 105. K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Koneˇcný, S. Mazzocchi, H.B. McMahan, T. Van Overveldt, D. Petrou, D. Ramage, J. Roselander, Towards federated learning at scale: system design (2019), arXiv:1902.01046 106. N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, K. Talwar, Semi-supervised knowledge transfer for deep learning from private training data (2016), arXiv:1610.05755 107. N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, Ú. Erlingsson, Scalable private learning with PATE (2018), arXiv:1802.08908
Chapter 3
Differentially Private Federated Learning: Algorithm, Analysis and Optimization Kang Wei, Jun Li, Chuan Ma, Ming Ding, and H. Vincent Poor
Abstract Federated learning (FL), a type of collaborative machine learning framework, is capable of helping protect users’ private data while training the data into useful models. Nevertheless, privacy leakage may still happen by analyzing the exchanged parameters, e.g., weights and biases in deep neural networks, between the central server and clients. In this chapter, to effectively prevent information leakage, we investigate a differential privacy mechanism in which, at the clients’ side, artificial noises are added to parameters before uploading. Moreover, we propose a K -client random scheduling policy, in which K clients are randomly selected from a total of N clients to participate in each communication round. Furthermore, a theoretical convergence bound is derived from the loss function of the trained FL model. In detail, considering a fixed privacy level, the theoretical bound reveals that there exists an optimal number of clients K that can achieve the best convergence performance due to the tradeoff between the volume of user data and the variances of aggregated artificial noises. To optimize this tradeoff, we further provide a differentially private FL based client selection (DP-FedCS) algorithm, which can dynamically select the number of training clients. Our experimental results validate our theoretical conclusions and also show that the proposed algorithm can effectively improve both the FL training efficiency and FL model quality for a given privacy protection level. K. Wei · J. Li (B) · C. Ma Nanjing University of Science and Technology, Nanjing, China e-mail: [email protected] K. Wei e-mail: [email protected] C. Ma e-mail: [email protected] M. Ding Data61, CSIRO, Sydney, Australia e-mail: [email protected] H. V. Poor Princeton University, Princeton, NJ, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. H. ur. Rehman and M. M. Gaber (eds.), Federated Learning Systems, Studies in Computational Intelligence 965, https://doi.org/10.1007/978-3-030-70604-3_3
51
52
K. Wei et al.
Keywords Federated learning · Differential privacy · Convergence performance · Client selection
3.1 Introduction It is widely expected that big data-driven artificial intelligence (AI) will soon be applied in many aspects of our daily life, including health care [4], communications [40], transportation [17, 19], etc. At the same time, the rapid growth of Internetof-Things (IoT) applications calls for data mining and model learning securely and reliably in distributed systems [12, 15, 21]. As integrating AI in a variety of IoT applications, distributed machine learning (ML) systems are preferred for many data processing tasks by defining parametrized functions from inputs to outputs as compositions of basic building blocks [24, 30]. Federated learning (FL) [31, 35, 43, 44], as a recent advance in distributed ML, was proposed, in which data are acquired and processed locally at the client-side, and then the trained ML model parameters are transmitted to a central server for aggregation. Nevertheless, several key challenges are also existing in FL, e.g., private information leakage, expensive communication costs between the server and clients, and device variability [9, 16, 27, 42]. Generally, distributed stochastic gradient descent (SGD) is a common optimization algorithm adopted for training ML models including FL. Existing works [2, 41] developed upper bounds for FL convergence performance based on distributed SGD, with a one-step local update before global aggregation. The work in [18] considered partially global aggregation, where after each local update step, parameter aggregation is performed over a non-empty subset of the client set. To analyze the convergence, federated proximal (FedProx) was proposed [14] by adding regularization on each local loss function. The work in [36] obtained a convergence bound for SGD based FL that incorporates non-independent-and-identically-distributed (noni.i.d.) data distributions among clients. Meanwhile, privacy preservation has become a worldwide and significant issue for big data applications and distributed learning systems with the ever-increasing awareness of data security of personal information [20, 29]. One prominent advantage of FL is that the whole training process only needs local training with no personal data exchange between the server and clients, thereby protecting clients’ data from being eavesdropped by potential adversaries. Nevertheless, private information can still be divulged by analyzing the differences of parameters trained and uploaded from the clients, such as inference attack [23, 25, 33] and reconstruction attack [10]. Therefore, differential privacy (DP), as a principled privacy framework that helps mitigate the trade-off between sharing statistical information while protecting privacy, has drawn great attention [3, 5]. Existing works on DP based learning algorithms include local DP (LDP) [6, 34, 37], DP based distributed SGD [1, 39] and DP meta-learning [13]. In LDP, each client perturbs its information locally and only sends a randomized version to the central server, thereby protecting both the clients and server against private information leakage. The work in [34] proposed solu-
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
53
tions to building up an LDP-compliant SGD, which powers a variety of important ML tasks. The work in [37] considered the distributed estimation at the server over uploaded data from clients while providing protections on these data with LDP. The work in [22] introduced an algorithm for user-level differentially private training of large neural networks, in particular a complex sequence model for next-word prediction. The work in [28] developed a chain abstraction model on tensors to efficiently override operations (or encode new ones) such as sending/sharing a tensor between workers and then provided the elements to implement recently proposed DP and multiparty computation protocols using this framework. The work in [1] improved the computational efficiency of DP based SGD by tracking detailed information about privacy loss and obtained accurate estimates of the overall privacy loss. The work in [39] proposed novel DP-based SGD algorithms and analyzed their performance bounds which were shown to be related to privacy levels and the sizes of datasets. Also, the work in [13] focused on the class of gradient-based parameter-transfer methods and developed a DP based meta-learning algorithm that not only satisfies the privacy requirement but also retains provable learning performance in the convex settings. More specifically, DP based FL approaches are usually devoted to capturing the tradeoff between privacy and convergence performance in the training process. The work in [8] proposed an FL algorithm with the consideration of preserving clients’ privacy. This algorithm can achieve good training performance at a given privacy level, especially when there is a sufficiently large number of participating clients. The work in [32] presented an alternative approach that utilizes both DP and secure multiparty computation (SMC) to prevent differential attacks. However, the above two works on DP-based FL design have not taken into account privacy protection during the parameter uploading stage, i.e., the clients’ private information can be potentially intercepted by potential adversaries when uploading the training results to the server. Moreover, these two works only showed empirical results but lacked theoretical analysis on the FL system, such as the tradeoffs between privacy, convergence performance, and convergence rate. To the authors’ knowledge, a theoretical analysis of the convergence behavior of FL with privacy-preserving noise perturbations has not yet been considered in existing studies, which will be the major focus of this work. Compared with the previous works, such as [8, 32], which focus mainly on simulation results, our theoretical performance analysis is more efficient to find the optimal parameters related to the system performance, e.g., the number of chosen clients K and the number of maximum aggregation times T , to achieve the minimum loss function. In this chapter, we propose a novel framework to effectively prevent information leakage. We use the concept of DP where each client perturbs its trained parameters locally by purposely adding noise before uploading them to the server for aggregation, namely, noising before model aggregation FL (NbAFL). Then, we analyze the convergence properties of differentially private FL algorithms and design a differentially private FL-based client selection (DP-FedCS) algorithm, which can dynamically select the number of training clients. The main contributions of this chapter are summarized as follows:
54
K. Wei et al.
• We prove that the proposed NbAFL scheme satisfies the requirement of a global DP in terms of the information leakage from uplink and downlink channels, under a certain Gaussian noise perturbation level by properly adapting the variance. • We develop a convergence bound on the loss function of the trained FL model in the NbAFL. Our developed bound reveals the following three key properties: (1) There is a tradeoff between the convergence performance and privacy protection levels, i.e., a better convergence performance leads to a lower protection level; (2) Increasing the number of N overall clients participating in FL can improve the convergence performance, given a fixed privacy protection level; and (3) There is an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. • We propose a K -client random scheduling strategy, where K (1 ≤ K < N ) clients are randomly selected from the N overall clients to participate in each aggregation. We also develop a corresponding convergence bound on the loss function in this case. From our analysis, the K -client random scheduling strategy retains the above three properties. Also, we find that there exists an optimal value of K that achieves the best convergence performance at a fixed privacy level. • To optimize the tradeoff between convergence performance and privacy guarantee, we further propose a DP-FedCS algorithm, which will reduce the value of K to decrease the noise variance with a certain privacy level when the training process stops improving. Correspondingly, varying variances of additive Gaussian noises are applied in this algorithm. • We conduct extensive simulations based on real-world datasets to validate the properties of our theoretical bound in NbAFL. Evaluations demonstrate that our theoretical results are consistent with simulations. Moreover, our experimental results show that the proposed DP-FedCS algorithm possesses a better tradeoff on convergence performance and privacy levels compared with the NbAFL algorithm. A summary of basic concepts and notations is provided in Table 3.1.
3.2 Preliminaries First, we will present preliminaries and related background knowledge on FL and DP. Then, we introduce the threat model that will be discussed in our following analysis.
3.2.1 Federated Learning Let us consider a general FL system consisting of one server and N clients. Let Di denote the local database owned by the client Ci (i ∈ {1, 2, . . . , N }). The goal of the FL is trying to learn a model over data that resides at the N associated clients. An active client Ci , participating in the local training, needs to find a parameter vector wi
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
55
Table 3.1 Summary of main notations Ci
The i th client
w
The vector of model parameters
N
Total number of all clients
w
The optimal parameters that minimize F(w)
D
The dataset held by all the clients
w0
Initial parameters of the global model
Di
The dataset held by the owner Ci
w(t)
Global parameters generated from all local parameters
Di
The adjacent dataset of Di
w (t)
K
The number of chosen clients
wi
T
The number of aggregation times
w i
t
The index of the t th aggregation
v(t)
(t) (t)
Global parameters generated from all perturbed local parameters Local training parameters of the i th client Local training parameters after adding noises Global parameters under K -client random scheduling
|·|
The cardinality of a set
Fi (w)
Local loss function from the i th client
·
The norm of a vector
∇ Fi (w)
The gradient of the i th local loss function
B
The dissimilarity of various clients
F(w)
Global loss function
pi
The ratio of the data size for the i th |D | client, i.e. pi = |D|i
L
The exposure times of uploaded parameters from each client
The difference between the original loss and the optimal loss
C
The clipping threshold to bound the norm of the uploaded parameters
M
A randomized mechanism for DP
sU i
, δ
The parameters related to the DP
sU i
X
The domain of the dataset
sD i
R
The domain of the output M(Di )
sD i
N
The function of Gaussian distribution
m
The minimum size of all local datasets
σ
The standard deviation of additive Gaussian noise
σU
The standard deviation of the additive Gaussian noise for the uplink channel
c
The constant used to adopt the σ to satisfy (, δ)-DP
σD
The standard deviation of the additive Gaussian noise for the downlink channel
s
The function to preprocess the data before publishing or sharing
nD
The additive noise for the downlink channel
s
The sensitivity of the function
l
The constant of the Polyak–Lojasiewicz condition
β
The constant of the Lipschitz continuous
V(t)
The test accuracy at the t th aggregation time
ζ
The threshold to decide
εi
The divergence metric between the i th
ρ
The constant of the Lipschitz smooth
D
D
D
D
whether the FL stops improving cost
The cumulative privacy budget
The local training process for the i th client The sensitivity of the local training process The aggregation process for the i th client The sensitivity of the aggregation process
local gradient and the global gradient
56
K. Wei et al.
of an AI model to minimize a certain loss function Fi (·). Then, the server aggregates the parameters received from the N clients as w=
N
pi wi ,
(3.1)
i=1
where wi is the parameter vector trained at the ith client, w is the parameter veci| ≥ 0 with tor after aggregating at the server, N is the number of clients, pi = |D |D| N N i=1 pi = 1, and |D| = i=1 |Di | is the total size of all data samples. Formally, such an optimization problem can be formulated as w = arg min w
N
pi Fi (w, Di ),
(3.2)
i=1
where Fi (·) is the local loss function at the ith client. Generally, the local loss function Fi (·) is given by local empirical risks. The training process of such an FL system usually consists of the following four steps: • Step 1: • Step 2: • Step 3: • Step 4:
Local training: All active clients locally train parameters based on own datasets and send trained ML model parameters to the server; Model aggregating: The server performs secure aggregation over the uploaded model parameters from N clients without learning clients’ information; Parameters broadcasting: The server broadcasts the aggregated model to all N clients; Model updating: All clients test the performance of the aggregated models and update their respective models with the aggregated model parameters using their own datasets.
In the FL process, all N clients should own the same data structure and collaboratively learn an ML model with the help of a central server. After a sufficient number of local training and update exchanges between the server and its associated clients, the solution to the optimization problem (3.2) can converge to that of the global optimal learning model.
3.2.2 Differential Privacy A strong criterion for privacy preservation of distributed data processing systems has been provided by (, δ)-DP. Here, > 0 represents the distinguishable bound of all outputs on neighboring datasets Di , Di in a database, and δ represents the event that the ratio of the probabilities for outputs on two adjacent datasets Di , Di cannot be bounded by e after adding a privacy-preserving mechanism M. With an
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
57
arbitrarily given δ, a DP mechanism with a larger gives a clearer distinguishability of neighboring datasets and thus a higher risk of privacy violation. Now, the concept of DP can be defined as follows. Definition 3.1 ((, δ)-DP [5]) A randomized mechanism M : X → R with domain X and range R satisfies (, δ)-DP, if for all measurable sets S ⊆ R and for any two adjacent databases Di , Di ∈ X, Pr[M(Di ) ∈ S] ≤ e Pr[M(Di ) ∈ S] + δ.
(3.3)
A Gaussian mechanism defined in [5] can be used to guarantee (, δ)-DP for numerical data. According to [5], the following DP mechanism by adding artificial Gaussian noises have been presented. In order to ensure that the artificial Gaussian noises following the distribution n ∼ N(0, σ 2 ) can preserve (, δ)-DP, where N represents √ the Gaussian distribution, we choose noise scale σ ≥ cs/ and the constant c ≥ 2 ln(1.25/δ) for ∈ (0, 1). In this result, n is the value of an additive noise sample for a data in the dataset, s = maxDi ,Di s(Di ) − s(Di ) is the sensitivity governed by the function s, and s is a real-valued function. Considering the concept of the DP mechanism, choosing an appropriate level of noise remains a significant research problem, which will affect the privacy guarantee of clients and the convergence rate of the FL process.
3.2.3 Threat Model In this chapter, we assume the server is honest. However, there exist external adversaries targeting clients’ private information, as depicted in Fig. 3.1. Although the individual dataset Di is kept locally at the ith client in FL, the intermediate parameter wi , which needs to be shared with the server, may reveal the clients’ private information. For example, authors in [7] demonstrated a model-inversion attack that recovers original images from the ML model of a facial recognition system. Also, the privacy leakage can happen in the broadcasting phase (through downlink channels) by utilizing and analyzing the global parameter w [10, 33]. We also assume that uplink channels are more secure than downlink broadcasting channels since clients can be assigned to different channels (e.g., time slots, frequency bands) dynamically in each uploading time, while downlink channels are broadcasting. Hence, we assume that there are at most L (L ≤ T ) exposures of uploaded parameters from each client in the uplink1 and T exposures of aggregated parameters in the downlink, where T is the number of aggregation times.
1 Here
we assume that the adversary cannot know where the parameters come from.
58
K. Wei et al.
Fig. 3.1 A FL training model with a hidden adversary who can eavesdrop on the trained parameters from both the clients and the server
3.3 Federated Learning with Differential Privacy In this section, we will first introduce the concept of global DP and analyze the DP performance in the context of FL. Then, we will propose the NbAFL algorithm that can satisfy the global DP requirement by adding proper noisy perturbations on the interactive parameters at both the clients’ and the server sides.
3.3.1 Global Differential Privacy Here, a global (, δ)-DP requirement is defined for both uplink and downlink channels. From the uplink perspective, using a clipping technique and a clipping threshold C, we can ensure that wi ≤ C, where wi denotes training parameters from the ith client without perturbation. We assume that in the local training, the batch size is equal to the number of training samples. Then, the local training process in the ith client can be defined by sUDi
|D |
i 1 wi = arg min Fi (w, Di ) = arg min Fi (w, Di, j ), |Di | j=1 w w
(3.4)
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
59
where Di is the ith client’s dataset and Di, j is the jth sample in Di . Therefore, the sensitivity of sUDi can be expressed as sUDi = max sUDi Di ,Di
|Di | 1 − sU = max arg min Fi (w, Di, j ) Di ,Di |Di | w j=1 |Di | 2C 1 − arg min Fi (w, Di, j ) = |D | , (3.5) |Di | j=1 w i Di
where Di is an adjacent dataset to Di which has the same size but only differ by one sample, and Di, j is the jth sample in Di . From the above result, we can define a global sensitivity in the uplink channel as sU max sUDi , ∀i.
(3.6)
We can note that if all the clients use sufficient local datasets for training, we can obtain a small global sensitivity. Hence, the minimum size of the local datasets can . To ensure (, δ)-DP for each be defined by m and then we can obtain sU = 2C m client in the uplink in one exposure, we set the standard deviation of the additive Gaussian noise, as σU = csU /. Considering L exposures of local parameters, we need to set σU = cLsU / because of the linear relation between and σU in the Gaussian mechanism. From the downlink perspective, for Di , the aggregation operation can be expressed as (3.7) sDDi w = p1 w1 + · · · + pi wi + · · · + p N w N , 1 ≤ i ≤ N , where w is the aggregated parameters at the server to be broadcast to the clients. Regarding the sensitivity of sDDi , i.e., sDDi , we can obtain the following lemma. Lemma 3.2 (Sensitivity for the aggregation operation) The sensitivity for Di after the aggregation operation sDDi in FL training process can be given by sDDi =
2C pi . m
(3.8)
D
Proof We first analyze the difference between sDDi and sD i . Then, we derive the upper bound of this difference. For details, see Appendix A in [38]. Remark 3.3 From Lemma 3.2, the ideal condition to achieve a small global sensitivity in the downlink channel which is defined by 2C pi , ∀i, sD max sDDi = max m
(3.9)
60
K. Wei et al.
is that all the clients should use the same size of local datasets for training, i.e., pi = 1/N . From the above remark, we can obtain the optimal value of the sensitivity sD by setting pi = 1/N , ∀i. So here we should add noise at the client-side first and then decide whether or not to add noises at the server to satisfy the (, δ)-DP criterion in the downlink channel. Theorem 3.4 (DP guarantee for downlink channels) To ensure (, δ)-DP in the downlink channels with T aggregations, the standard deviation (STD) of Gaussian noises nD that are added to the aggregated parameter w by the server can be given as √ 2 2 √ 2cC T −L N , T > L N, m N √ σD = (3.10) 0, T ≤ L N. Proof We first derive the STD of aggregated Gaussian noises due to uplink protection. Then we compute the required STD of noises considering downlink channels. For details, see Appendix B in [38]. Theorem 3.4 shows that to satisfy a (, δ)-DP requirement for the downlink channels, additional noises nD need to be added by the server. With a certain L, the standard deviation of additional noises is depending on the relationship between the number of aggregation times T and the number of clients N . The intuition is that a larger T can lead to a higher chance of information leakage, while a larger number of clients helps hide their private information. This theorem also provides the variance value of the noises that should be added to the aggregated parameters. Based on the results in Theorem 3.4, we propose the following NbAFL algorithm.
3.3.2 Proposed NbAFL Here, we will introduce the proposed NbAFL algorithm. Algorithm 3.1 outlines our NbAFL, which aims to train an effective model with a global (, δ)-DP requirement. In this algorithm, we use μ and w(0) to represent the presetting constant of the proximal term and the initiate global parameter, respectively. At the beginning of this algorithm, the required privacy level parameters (, δ) and the initiate global parameter w(0) are sent to clients by the server. In the tth aggregation, N active clients respectively train the parameters by using local datasets with preset termination conditions. After completing the local training, the ith client, ∀i, will add noises to (t) i to the server for the trained parameters wi(t) , and upload the noised parameters w aggregation. Then the server updates the global parameters w(t) by aggregating the local param(t) eters integrated with different weights. Additive noises n(t) D are added to this w according to Theorem 3.4 before being broadcast to the clients. Based on the received
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
61
Algorithm 3.1: Noising before Aggregation FL (NbAFL) Data: T , w(0) , μ, and δ Initialization: t = 1 and wi(0) = w(0) , ∀i while t ≤ T do Local training process: for Ci ∈ {C1 , C2 , . . . , C N } do 5 Update the local parameters wi(t) as
(t) 6 wi = arg min Fi (wi ) + μ2 wi − w(t−1) 2 wi
(t) w (t) (t) 7 Clip the local parameters wi = wi / max 1, Ci
1 2 3 4
8 9 10 11
Add noise and upload parameters w i(t) = wi(t) + ni(t) Model aggregating process: Update the global parameters w(t) as N (t) w(t) = pi wi i=1
16
The server broadcasts global noised parameters (t) w (t) = w(t) + nD Local testing process: for Ci ∈ {C1 , C2 , . . . , C N } do Test the aggregating parameters w (t) using local dataset
17
t ←t +1
12 13 14 15
Result: w (T )
(t)
global parameters w , each client will estimate the accuracy by using local testing datasets and start the next round of training process based on these received parameters. The FL process completes after the aggregation time reaches a preset number (T ) T and the algorithm returns w . Now, let us focus on the privacy preservation performance of the NbAFL. First, the set of all local parameters are received by the server. Owing to the local perturbations in the NbAFL, it will be difficult for malicious adversaries to infer the information at the i-client from its uploaded parameters w i . After the model aggregation, the aggregated parameters w will be sent back to clients via broadcast channels. This poses threats to clients’ privacy as potential adversaries may reveal sensitive information about individual clients from w. In this case, additive noises may be posed to w based on Theorem 3.4.
3.4 Convergence Analysis on NbAFL In this section, the convergence performance of the proposed NbAFL is analyzed. First, we analyze the upper bound of the expected increment of adjacent aggregations in the loss function. Then, we focus on deriving the convergence property for the whole training process under the global (, δ)-DP requirement.
62
K. Wei et al.
For the convenience of the analysis, we make the following assumptions on the loss function and network parameters. Theorem We make assumptions on the global loss function F(·) defined by 3.5 N pi Fi (·), and the ith local loss function Fi (·) as follows: F(·) i=1 (1) Fi (w) is convex; (2) Fi (w) satisfies the Polyak–Lojasiewicz condition with the positive parameter l, which implies that F(w) − F(w ) ≤ 2l1 ∇ F(w)2 , where w is the optimal result; (3) F(w(0) ) − F(w ) = ; (4) Fi (w) is β-Lipschitz, i.e., Fi (w) − Fi (w ) ≤ βw − w , for any w, w ; (5) Fi (w) is ρ-Lipschitz smooth, i.e., ∇ Fi (w) − ∇ Fi (w ) ≤ ρw − w , for any w, w , where ρ is a constant determined by the practical loss function; (6) For any i and w, ∇ Fi (w) − ∇ F(w) ≤ εi , where εi is the divergence metric. Similar to the gradient divergence, the divergence metric εi is the metric to capture the divergence between the gradients of the local loss functions and that of the aggregated loss function, which is essential for analyzing SGD. The divergence is related to how the data is distributed at different nodes. Using Assumption 3.5 and assume ∇ F(w) to be uniformly away from zero, we then have the following lemma. Lemma 3.6 (B-dissimilarity of various clients) For a given ML parameter w, there exists B satisfying (3.11) E ∇ Fi (w)2 ≤ ∇ F(w)2 B 2 , ∀i. Proof Via the divergence metric in Assumption 3.5, we can bound E ∇ Fi (w)2 . For details, see Appendix C in [38]. Lemma 3.6 comes from the assumption of the divergence metric and demonstrates the statistical heterogeneity of all clients. As mentioned earlier, the values of ρ and B are determined by the specific global loss function F(w) in practice and the training parameters w. With the above preparation, we are now ready to analyze the convergence property of NbAFL. First, we present the following lemma to derive an expected increment bound on the loss function during each iteration of parameters with artificial noises. Lemma 3.7 (Expected increment on the loss function) After receiving updates, the expected difference from the tth to the (t + 1)th aggregation in the loss function can be upper-bounded by (t+1)
E{F( w
(t)
(t)
) − F( w )} ≤ λ2 E{∇ F( w )2 } (t)
+λ1 E{n(t+1) ∇ F( w )} + λ0 E{n(t+1) 2 }, where λ0 = ρ2 , λ1 =
ρB , μ
ρB μ2
ρ B2 2μ2
(3.12)
and n(t) are the equivalent N noises imposed on the parameters after the tth aggregation, given by n(t) = i=1 pi ni(t) + n(t) D . 1 μ
+
λ2 = − μ1 +
+
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization (t+1)
63 (t)
Proof First, we can use the taylor expansion to bound E{F( w ) − F( w )}. Then, we can derive this upper bound by our assumptions. For details, see Appendix D in [38]. In this lemma, the value of an additive noise sample n in vector n(t) satisfies the
following Gaussian distribution N(0, σA2 ). Also, we can obtain σA = σD2 + σU2 /N from Sect. 3.3. From the right hand side (RHS) of the above inequality, we can see that it is crucial to select a proper proximal term μ to achieve a low upper-bound. It is clear that artificial noises with a large σA may improve the DP performance in terms privacy protection. However, from the RHS of (3.12), a large σA may enlarge the expected difference of the loss function between two consecutive aggregations, leading to a deterioration of convergence performance. Furthermore, to satisfy the global (, δ)-DP, by using Theorem 3.4, we have σA =
cT sD , cLs U √ , N
√ T > L N, √ T ≤ L N.
(3.13)
Next, we will analyze the convergence property of NbAFL with the (, δ)-DP requirement. Theorem 3.8 (Convergence upper bound of the NbAFL) With required protection level , the convergence upper bound of Algorithm 3.1 after T aggregations is given by E{F( w(T ) ) − F(w )} ≤
κ0 T 2
κ1 T T P + 1 − PT , + 2 where P = 1 + 2lλ2 , κ1 =
λ1 βcC m(1−P)
2 Nπ
and κ0 =
(3.14) (3.15)
λ0 c 2 C 2 . m 2 (1−P)N
Proof Here, we can derive this upper bound with the Polyak–Lojasiewicz condition. For details, see Appendix E in [38]. Theorem 3.8 reveals an important relationship between privacy and utility by taking into account the protection level and the number of aggregation times T . As the number of aggregation times T increases, the first term of the upper bound decreases but the second term increases. Furthermore, By viewing T as a continuous variable and by writing the RHS of (3.14) as h(T ), we have
κ1 T κ0 T 2 d 2 h(T ) P T ln2 P = − − d2T 2
2κ0 T 2κ0
κ1 P T ln P + 2 1 − P T . + 2 −2 (3.16)
64
K. Wei et al.
It can be seen that the second term and third term of on the RHS of (3.16) are always positive. When N and are set to be large enough, we can see that κ1 and κ0 are small, and thus the first term can also be positive. In this case, we have d 2 h(T )/d 2 T > 0 and the upper bound is convex for T . Remark 3.9 As can be seen from this theorem, the expected gap between the (T ) achieved loss function F( w ) and the minimum one F(w ) is a decreasing function of . By increasing , i.e., relaxing the privacy protection level, the performance of NbAFL algorithm will improve. This is reasonable because the variance of artificial noises decreases, thereby improving the convergence performance. Remark 3.10 The number of clients N will also affect its iterative convergence performance, i.e., a larger N would achieve a better convergence performance. This is because a larger N leads to lower variance of the artificial noises. Remark 3.11 There is an optimal number of maximum aggregation times T in terms of convergence performance for given and N . In more detail, a larger T may lead to a higher variance of artificial noises, and thus pose a negative impact on convergence performance. On the other hand, more iterations can generally boost convergence performance if noises are not large enough. In this sense, there is a tradeoff on choosing a proper T .
3.5
K -Client Random Scheduling Policy
In this section, we consider the case where only K (K < N ) clients are selected to participate in the aggregation process, namely K -client random scheduling. We now discuss how to add artificial noises in the K -client random scheduling to satisfy a global (, δ)-DP. It is obvious that in the uplink channels, each of the K scheduled clients should add noises with scale σU = cLsU / for achieving (, δ)DP. This is equivalent to the noise scale in the all-clients selection case in Sect. 3.3 since each client only considers its privacy for uplink channels in both cases. However, the derivation of the noise scale in the downlink will be different for the K -client random scheduling. As an extension of Theorem 3.4, we present the following lemma in the case of K -client random scheduling on how to obtain σD . Lemma 3.12 (DP guarantee for K -client random scheduling) In the NbAFL algorithm with K -client random scheduling, to satisfy a global (, δ)-DP, and the standard deviation σD of additive Gaussian noises for downlink channels should be set as σD = where b = − T ln 1 −
N K
+
⎧ 2 ⎨ 2cC Tb2 −L 2 K ⎩0,
N − eT K
mK
, T > γ ,
(3.17)
T ≤ γ ,
and γ = − ln 1 −
K N
+
−
K L√K e N
.
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
65
Proof We derive the standard deviation of aggregated Gaussian noises considering sampling. For details, see Appendix F in [38]. Lemma 3.12 recalculates σD by considering the number of chosen clients K . Generally, the number of clients N is fixed, we thus focus on the effect of K . Based on the DP analysis in Lemma 3.12, we can obtain the following theorem. Theorem 3.13 (Convergence under K -client random scheduling) With required protection level and the number of chosen clients K , for any > 0, the convergence upper bound after T aggregation times is given by (T )
E{F( v
1 − QT ) − F(w )} ≤ Q + 1− Q
T
+
2ρ B μ
cCα1 β
−m K ln 1 − KN +
c2 C 2 α0
m 2 K 2 ln2 1 − KN +
N − T e K
N − T e K
2l μ2
2 π
.
2 2 μB √B + √ + ρ B + ρKB + 2ρ − μ , α0 = K K √ ) K B K + 2ρμN , and v(T ) = i=1 pi wi(T ) + ni(T ) + n(T D .
where Q = 1 + 1+
ρ B2 2
2ρ K N
(3.18) + ρ, α1 =
Proof Here, we derive the convergence upper bound considering sampling. For details, see Appendix G in [38]. (T )
The above theorem provides the convergence upper bound between F( v ) and F(w ) under K -random scheduling. Using K -client random scheduling, we can achieve an important relationship between privacy and learning performance by taking into account the protection level , the number of chosen clients K , and the number of aggregation times T . Remark 3.14 From the bound derived in Theorem 3.13, we can conclude that there exists an optimal K (0 < K < N ) that can achieve the optimal convergence performance. That is, by finding a proper K , the K -client random scheduling policy is superior to the one that all N clients participate in each FL aggregation.
3.6 Differentially Private FL Based Client Selection Based on the above analysis, we reveal that there exists an optimal K (0 < K < N ) that achieves the optimal convergence performance. In more detail, a larger K degrades the model quality by increasing the amount of noise added in each communication round for the given (in line with Lemma 3.12), but it also has a positive impact on the convergence performance because it reduces the variance of all local models. Therefore, it is valuable to propose a client selection algorithm to improve convergence performance in the training process. Then, we provide a DP-FedCS algorithm, which can adjust the number of participants dynamically. Moreover, we analyze the STDs of additive noises in the case of varying K during the FL training.
66
K. Wei et al.
3.6.1 Algorithm Description Based on the above analysis, we can note that an improper value of K will damage the performance of the NbAFL algorithm. Hence, if we reduce the value of K slightly when the training performance stops improving, we can obtain a smaller STD as well as improving the training performance. Therefore, we design a DP-FedCS algorithm by adjusting the number of participants with a discounting method during the training process to achieve a better convergence performance. The training process of such a DP-FedCS algorithm in the NbAFL contains following steps: • Step 1: Initializing: The server broadcasts the initial parameters w0 and N ; • Step 2: Local Training: All active clients locally compute training parameters with local datasets and the global parameter; • Step 3: Norm Clipping: In order to prove the DP guarantee, the influence of each individual example on local parameters should be bounded with a clipping threshold C. We can remark that parameter clipping of this form is a popular ingredient of SGD and ML for non-privacy reasons; • Step 4: Noise Adding for Uplink Channels: Artificial Gaussian noises with a certain STD σU using our noise recalculation (introduced in the following subsection) will be added to the local trained parameters; • Step 5: Parameter Uploading: All active clients upload the noised parameters to the server for aggregation; • Step 6: Model Aggregation: The server performs aggregation over the uploaded parameters from clients; • Step 7: Noise Adding for Downlink Channels: The server adds Gaussian noises with a certain STD σD(t) to the aggregated parameter. We can note that σD(t) is varying because of the changed K ; • Step 8: Model Broadcasting: The server broadcasts the aggregated parameters to active clients; • Step 9: Model Updating: All clients update their respective models with the aggregated parameters, then test the performance of the updated models and upload the performance to the server; • Step 10: Participant Number Discounting: When the convergence performance stops improving by the following decision V(t+1) − V(t) < ζ , where V(t) is the test accuracy at the tth aggregation time and ζ is a preset threshold, the server will obtain a smaller K than the previous one with a linear discounting factor α by K = max{α · K , 1}. This factor can control the decaying speed of K . The FL process will be completed when the aggregation time reaches the preset T . In this method, the value of K is determined iteratively to ensure a high convergence performance in FL training. When the value of K is adjusted, we must recalculate a new STD of additive noises in terms of the previous training process. Therefore, we will develop a noise recalculation method to update the STD of additive noises and K alternately in the following subsection.
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
67
3.6.2 Noise Recalculation for Varying K From our analysis about σU and σD , we can note that the varying K will affect the value of σD , which means that in the proposed DP-FedCS algorithm, there may exist various σD for different aggregation times. Therefore, with a new K , we need to recalculate the STD of noises and add them to the aggregated parameters in the following aggregation time. In order to analyse the noise recalculation method, let t be the index of the aggregation time and σD(τ ) (0 ≤ τ ≤ t − 1) be the STD of additive noises at the τ th aggregation time. Then, we can obtain the following theorem. Theorem 3.15 After t (0 ≤ t < T ) aggregation times and a new K , the STD of additive noises for downlink channels to guarantee a global (, δ)-DP can be given as ⎧ −t)2 2 ⎪ 2 ⎨ 2cC b2(T(− 2 −L K cost ) , T > −γ cost + t, (3.19) σD = mK ⎪ −cost ⎩0, T ≤ γ + t, where cost = −
t−1
cs − (τD) σA ln qe −q +1 ,
(3.20)
τ =0
and σA(τ )
=
σD(τ )
2
2 + σU(τ ) /N
(3.21)
Proof We define the sampling parameter q K /N to represent the probability of being selected by the server for each client in an aggregation. Let M1:T denote (M1 , . . . , MT ) and similarly let o1:T denote a sequence of outcomes (o1 , . . . , oT ). Considering a global (, δ)-DP requirement in the downlink channels, requirement for the ith client, we use σA(t) to represent the standard deviation of aggregated Gaussian noises at the tth aggregation time. With neighboring datasets Di and Di , we are looking at Pr[M1:T (Di,1:T ) = o1:T ] ln Pr[M (D 1:T i,1:T ) = o1:T ] (t) 2 n (t) +s 2 − n − D2 T (t) 2 (t) (1 − q)e 2 σA + qe 2 σA = ln n (t) 2 (3.22) − 2 t=1 (t) 2 σA e ⎛ ⎞ 2 2n (t) sD +sD − T (t) 2 2 σA ≤ . ⎝ ⎠ = ln 1 − q + qe t=1 Equation (3.22) implies that the privacy cost after T aggregations should bounded by . Therefore, after t aggregations with varying noise STDs σA(τ ) (0 ≤ τ ≤ t − 1)
68
K. Wei et al.
and a new noise STD σA(t) for the following T − t aggregations, we know t−1
⎛ ln ⎝1 − q + qe
−
2n (τ ) sD +sD 2 (τ ) 2 2 σA
⎞
⎛
⎠ + (T − t) ln ⎝1 − q + qe
−
2nsD +sD 2 (t) 2 2 σA
⎞ ⎠ ≥ −.
τ =0
(3.23) This Eq. (3.23) can be expressed as ⎛ (T − t) ln ⎝1 − q + qe where cost =
t−1
τ =0
⎛
−
2nsD +sD 2 (t) 2 2 σA
⎞ ⎠ ≥ −( − cost ),
(3.24)
(τ ) and (τ ) represents that the ratio of the event that ⎛
Pr ⎝− ln ⎝1 − q + qe
−
2n (τ ) sD +sD 2 (τ ) 2 2 σA
⎞
⎞
⎠ ≤ (τ ) ⎠ ≥ 1 − δ.
(3.25)
D , we can achieve ( (τ ) , δ)-DP Because of [38], we know that if we set σA(τ ) = cs b (τ ) (τ ) requirement. In another word, for a certain σA , we have
cs − (τD) (τ ) ≤ − ln qe σA − q + 1 and cost = −
t−1
(3.26)
cs − (τD) ln qe σA − q + 1
(3.27)
τ =0
Therefore, the STD of downlink channels can by given by σD(t) =
⎧ T 2 2 ⎪ 2 ⎨ 2cC b2 (− 2 −L K cost ) ⎪ ⎩0,
mK
, T > T ≤
−cost γ −cost γ
+ t,
(3.28)
+ t,
In Theorem 3.15, we can obtain a proper STD of the additive noises based on the previous training process and varying σD(τ ) . From this result, we can find that if we have large STDs (strong privacy guarantee) in the previous t − 1 training processes, i.e., σD(τ ) is large, the recalculated STD will be small (weak privacy guarantee), i.e., σD(t) is small.
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
69
3.7 Experimental Results In this section, we evaluate the proposed NbAFL by using multi-layer perception (MLP) and real-world federated datasets. We conduct experiments by varying the protection levels of , the number of clients N , the number of chosen clients K , and the number of maximum aggregation times T to characterize the convergence property of NbAFL. We conduct experiments on the standard MNIST dataset for handwritten digit recognition consisting of 60000 training examples and 10000 testing examples [11], where each example is a 28 × 28 size gray-level image. Our baseline ML model uses an MLP network with a single hidden layer consisting of 256 hidden units. Also, ReLU units and the softmax of 10 classes (corresponding to the 10 digits) are applied in this feed-forward neural network. For the optimizer of networks, we use the SGD optimizer and set the learning rate to 0.002. Then,i.e. we evaluate this MLP for the multi-class classification task with the standard MNIST dataset, i.e., recognizing from 0 to 9, where each client has 100 training samples locally with large privacy levels such as = 50, = 60 and = 100. We also consider the scenario that each client has 512 training samples locally and choose small protection levels, i.e., = 6, = 8, and = 10 for this experiment. This set of training samples is in line with the ideal condition in Remark 3.3. We can note that the parameter clipping technique is a popular ingredient of SGD based ML for non-privacy reasons. A proper value of clipping threshold C should be considered for the DP based FL framework. Hence, we utilize the method in [10] and choose C by taking the median of the norms of the unclipped parameters throughout training in the following experiments (except Sect. 3.7.3). The values of ρ, β, l, and B are governed by the specific loss function, and thus we will use estimated values in our experiments [36].
3.7.1 Performance Evaluation on Protection Levels In Fig. 3.2, we choose various protection levels = 50, = 60 and = 100 to show the results of the loss function in NbAFL. Furthermore, we also include a nonprivate approach to compare with our NbAFL. In this experiment, we set N = 50, T = 25 and δ = 0.01, and compute the values of the loss function as a function of the aggregation times t. As shown in Fig. 3.2, values of the loss function in NbAFL are decreasing as we relax the privacy guarantees (increasing ). Such observation results are in line with Remark 3.9. In Fig. 3.3, we also choose high protection levels = 6, = 8 and = 10 for this experiment, where each client has 512 training samples locally. We set N = 50, T = 25 and δ = 0.01. From Fig. 3.3, we can draw a similar conclusion as in Remark 3.9 that values of the loss function in NbAFL are decreasing as we relax the privacy guarantees.
70 2.4
= 50 = 60 = 100 Non-private
2.2 2
Loss Function Value
Fig. 3.2 The comparison of training loss for 50 clients with various protection levels, i.e., = 50, = 60 and = 100, respectively. As shown in Fig. 3.2, values of the loss function in NbAFL are decreasing as we relax the privacy guarantees (increasing ). Such observation results are in line with Remark 3.9
K. Wei et al.
1.8 1.6 1.4 1.2 1 0.8 0.6 0
5
10
15
20
25
Aggregation Time (t) 2.4 = 6 = 8 = 10 Non-private
2.2
Loss Function Value
Fig. 3.3 The comparison of training loss for 50 clients with various protection levels i.e., = 6, = 8 and = 10, respectively. We can also note that values of the loss function in NbAFL are decreasing with the increasing , which is in line with Remark 3.9
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0
5
10
15
20
25
Aggregation Time (t)
3.7.2 Impact of the Number of Chosen Clients K Considering the K -client random scheduling, we evaluate the performances with various protection levels = 50, = 60 and = 100 in Fig. 3.4. For experimental parameters, we set N = 50, K = 20, T = 25, and δ = 0.01. From Fig. 3.4, we can note that under the K -client random scheduling, the convergence performance is improved with the increasing .
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization 2 1.8
Loss Function Value
Fig. 3.4 The comparison of training loss with various privacy levels for 50 clients using = 50, = 60 and = 100, respectively. As shown in Fig. 3.4, the convergence performance under the K -client random scheduling is also improved with the increasing
71
= 50 = 60 = 100 Non-private
1.6 1.4 1.2 1 0.8 0.6 0.4 0
5
10
15
20
25
Aggregation Time (t) 2.2 C = 10 C = 15 C = 20 C = 25
2.1
Loss Function Value
Fig. 3.5 The comparison of training loss with various clipping thresholds for 50 clients using = 60. Note that limiting the parameter norm has two opposing effects. Concretely, a small clipping threshold C may destroy the intended gradient direction of model parameters, and increasing the norm bound C forces us to add more noise to the parameters because of its effect on the sensitivity
2 1.9 1.8 1.7 1.6 1.5 0
5
10
15
20
25
Aggregation Time (t)
3.7.3 Impact of the Clipping Threshold In this subsection, we choose various clipping thresholds C = 10, 15, 20 and 25 to show the results of the loss function for 50 clients using = 60 in NbAFL. As shown in Fig. 3.5, the convergence performance of NbAFL can obtain the best value when C = 20. This is because limiting the parameter norm has two opposing effects. On the one hand, clipping destroys the intended gradient direction of parameters with too small clipping threshold C. On the other hand, with increasing the norm bound C, we must add more noise to the parameters because of its effect on the sensitivity.
72 2.2
N = 50 N = 60 N = 80 N = 100
2
Loss Function Value
Fig. 3.6 The value of the loss function with various numbers of clients under = 60 under NbAFL algorithm. We notice that the performance among different numbers of clients is governed by Remark 3.10. This is because more clients not only provide larger global datasets for training but also bring down the standard deviation of additive noises due to the aggregation
K. Wei et al.
1.8 1.6 1.4 1.2 1 0.8 0
5
10
15
20
25
Aggregation Time (t)
3.7.4 Impact of the Number of Clients N We also compare the convergence performance of NbAFL under required protection level = 60 and δ = 0.01 as a function of clients’ number, N in Fig. 3.6. In this experiment, we set N = 50, N = 60, N = 80 and N = 100. We notice that the performance among different numbers of clients is governed by Remark 3.10. This is because, for FL training, more clients not only provide larger global datasets but also bring down the standard deviation of additive noises due to the aggregation.
3.7.5 Impact of the Number of Maximum Aggregation Times T In Fig. 3.7, we show the experimental results of training loss and theoretical results as a function of maximum aggregation times with various privacy levels = 60 and 100 under NbAFL algorithm. This observation is in line with Remark 3.11, and the reason comes from the fact that a lower privacy level decreases the standard deviation of additive noises and the server can achieve better quality ML model parameters from the clients. Figure 3.7 also implies that an optimal number of maximum aggregation times increases almost concerning the increasing . In Fig. 3.8, we plot the values of the loss function in the normalized NbAFL using solid lines and the K -random scheduling based NbAFL using dotted lines with various numbers of maximum aggregation times. This figure shows that the value of loss function is a convex function of maximum aggregation times for a given protection level under NbAFL algorithm, which validates Remark 3.11. From Fig. 3.8, we can also see that for a given , K -client random scheduling based NbAFL algorithm has
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization 3.5 = 60 (Theoretical Results) = 60 (Experimental Results) = 100 (Theoretical Results) = 100 (Experimental Results)
3
Loss Function Value
Fig. 3.7 The convergence upper bounds with various privacy levels = 50, 60 and 100 under 50-clients’ NbAFL algorithm. This observation is in line with Remark 3.11. This figure also implies that an optimal number of maximum aggregation times increases almost with respect to the increasing
73
2.5 2 1.5 1 0.5 5
10
15
20
25
30
Numnber of Maximum Aggregation Times (T) 2 1.8
Loss Function Value
Fig. 3.8 The value of the loss function with various privacy levels = 60 and = 100 under NbAFL algorithm with 50 clients. We can note that the value of loss function is a convex function of T for a given protection level , which validates Remark 3.11. Furthermore, we can also see that for a given , K -client random scheduling has a better convergence performance than the normalized NbAFL algorithm for a larger T
1.6 1.4
= 60 (N=50, K=50) = 100 (N=50, K=50) Non-private (N=50, K=50) = 60 (N=50, K=30) = 100 (N=50, K=30) Non-private (N=50, K=30)
1.2 1 0.8 0.6 0.4 0.2 10
15
20
25
30
35
40
Number of Maximum Aggregation Times (T)
a better convergence performance than the normalized NbAFL algorithm for a larger T . This is because that K -client random scheduling can bring down the variance of artificial noises with little performance loss.
3.7.6 Impact of the Number of Chosen Clients K In Fig. 3.9, considering the random scheduling policy in NbAFL, we evaluate values of the loss function with various numbers of chosen clients K . The number of clients is set to N = 50, and K clients are randomly chosen to participate in local training and aggregation in each iteration. In addition, we also set = 50, = 60, = 100
74
= 50 = 60 = 100 Non-private
3
Loss Function Value
Fig. 3.9 The value of the loss function with various numbers of chosen clients under = 50, 60, 100 under NbAFL algorithm and non-private approach with 50 clients. Note that an optimal K which further improves the convergence performance exists for various protection levels, due to a trade-off between enhance privacy protection and involving larger global training datasets in each model updating round
K. Wei et al.
2.5
2
1.5
1
0.5 10
15
20
25
30
35
40
45
50
Number of the Chosen Clients (K)
and δ = 0.01 in this experiment. Meanwhile, we also exhibit the performance of the non-private approach with various numbers of chosen clients K . Note that an optimal K which further improves the convergence performance exists for various protection levels, due to a trade-off between enhance privacy protection and involving larger global training datasets in each model updating round. This observation is in line with Remark 3.14. As shown in Fig. 3.9, for a given protection level , the K -client random scheduling can achieve a better tradeoff than the normal selection policy.
3.7.7 Performance of DP-FedCS Algorithm In Fig. 3.10, we plot values of the loss function with various privacy levels under DPFedCS algorithm, NbAFL algorithm and the non-private approach with 50 clients. The number of clients is N = 50, and K clients are randomly chosen to participate in training and aggregation in each iteration. In this experiment, we set T = 50, = 6, 8, 10, 12, 14, 16, δ = 0.01, α = 0.8 and ζ = 0.001, respectively. Note that the proposed DP-FedCS further improves the convergence performance compared with the NbAFL algorithm for various protection levels, due to dynamic numbers of chosen clients K .
3.8 Conclusion In this chapter, we have focused on information leakage in SGD based FL. We have first defined a global (, δ)-DP requirement for both uplink and downlink channels, and developed variances of artificial noises at both client and server sides. Then, we
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization 2.5
Loss Function Value
Fig. 3.10 The value of the loss function with various privacy levels under DP-FedCS algorithm, NbAFL algorithm, and the non-private approach with 50 clients. Note that the proposed DP-FedCS further improves the convergence performance compared with the NbAFL algorithm for various protection levels, due to dynamic numbers of chosen clients K
75
2
1.5
1
NbAFL DP-FedCS Non-private
0.5 6
8
10
12
14
16
Privacy Level ( )
have proposed a novel framework based on the concept of global (, δ)-DP, named NbAFL. We have developed theoretically a convergence upper bound on the loss function of the trained FL model in the NbAFL. Then, from theoretical convergence bounds, we have shown the following properties: (1) there exists a tradeoff between the convergence performance and privacy protection levels, i.e., a lower protection level leads to better convergence performance; (2) we can improve the convergence performance by increasing the number N of overall clients participating in FL, with a given privacy protection level; and (3) there exists an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. Furthermore, we have proposed a K -client random scheduling strategy and also developed a corresponding convergence bound on the loss function in this case. In addition to the above three properties. we have seen that with a fixed privacy level, there exists an optimal value of K that achieves the best convergence performance. Based on this property of the optimal K , we have designed a DP-FedCS algorithm, which will reduce the value of K to decrease the noise variance with a certain privacy level when the training process stops improving. Extensive experimental results confirm the effectiveness of our analysis. Therefore, our analytical results are helpful for the design of privacy-preserving FL architectures with different tradeoff requirements on convergence performance and privacy levels.
References 1. M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Vienna, Austria (2016), pp. 308–318 2. A. Alekh, D.J.C, Distributed delayed stochastic optimization, in Proceedings of the IEEE Conference on Decision and Control (CDC), Maui, HI, USA (2012)
76
K. Wei et al.
3. A. Blum, C. Dwork, F. McSherry, K. Nissim, Practical privacy: the SuLQ framework, in Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), Baltimore, Maryland, USA (2005), pp. 128–138 4. Y. Deng, F. Bao, Q. Dai, L.F. Wu, S.J. Altschuler, Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat. Methods 16, 311–314 (2019) 5. C. Dwork, A. Roth, The algorithmic foundations of differential privacy. Found. Trends R Theor. Comput. Sci. 9(3–4), 211–407 (2014) 6. U. Erlingsson, V. Pihur, A. Korolova, RAPPOR: randomized aggregatable privacy-preserving Ordinal Response, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Scottsdale, Arizona, USA (2014), pp. 1054–1067 7. M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Denver, Colorado, USA (2015), pp. 1322–1333 8. R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client level perspective (2017). arXiv:1712.07557 9. M. Hao, H. Li, X. Luo, G. Xu, H. Yang, S. Liu, Efficient and privacy-enhanced federated learning for industrial artificial intelligence. IEEE Trans. Ind. Inf. 16(10), 6532–6542 (2020) 10. B. Hitaj, G. Ateniese, F. Perez-Cruz, Deep models under the GAN: information leakage from collaborative deep learning, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, Texas, USA (2017), pp. 603–618 11. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 12. H. Lee, S.H. Lee, T.Q.S. Quek, Deep learning for distributed optimization: applications to wireless resource management. IEEE J. Sel. Areas Commun. 37(10), 2251–2266 (2019) 13. J. Li, M. Khodak, S. Caldas, A. Talwalkar, Differentially private meta-learning (2019). arXiv:1909.05830 14. T. Li, A. Kumar Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks, in Proceedings of the Third Conference on Machine Learning and Systems (MLSys), Austin, TX, USA (2020) 15. J. Li, S. Chu, F. Shu, J. Wu, D.N.K. Jayakody, Contract-based small-cell caching for data disseminations in ultra-dense cellular networks. IEEE Trans. Mobile Comput. 18(5), 1042– 1053 (2019) 16. T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: Challenges, methods, and future directions. IEEE Signal Proc. Mag. 37(3), 50–60 (2020) 17. J. Li, Z. Xing, W. Zhang, Y. Lin, F. Shu, Vehicle tracking in wireless sensor networks via deep reinforcement learning. IEEE Sens. Lett. 4(3), 1–4 (2020) 18. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Proceedings of the ACM Neural Information Processing Systems (NIPS), Long Beach, California, USA (2017), pp. 5336–5346 19. Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, F. Shu, Path planning for UAV-mounted mobile edge computing with deep reinforcement learning. IEEE Trans. Veh. Technol. 69(5), 5723–5728 (2020) 20. C. Ma, J. Li, M. Ding, H. Hao Yang, F. Shu, T.Q.S. Quek, H.V. Poor, On safeguarding privacy and security in the framework of federated learning. IEEE Netw. 34(4), 242–248 (2020) 21. Z. Ma, M. Xiao, Y. Xiao, Z. Pang, H.V. Poor, B. Vucetic, High-reliability and low-latency wireless communication for internet of things: challenges, fundamentals, and enabling technologies. IEEE Internet Things J. 6(5), 7946–7970 (2019) 22. H.B. McMahan, D. Ramage, K. Talwar, L. Zhang, Learning differentially private language models without losing accuracy (2018). arXiv:1710.06963 23. L. Melis, C. Song, E. De Cristofaro, V. Shmatikov, Exploiting unintended feature leakage in collaborative learning, in Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA (2019), pp. 691–706
3 Differentially Private Federated Learning: Algorithm, Analysis and Optimization
77
24. M. Mohammadi, A. Al-Fuqaha, S. Sorour, M. Guizani, Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun. Surv. Tutor. 20(4), 2923–2960 (2018) 25. M. Nasr, R. Shokri, A. Houmansadr, Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning, in Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA (2019), pp. 739–753 26. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, 1st edn. (Springer, Boston, 2014) 27. Y. Qiang, L. Yang, C. Tianjian, T. Yongxin, Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 12:1–12:19 (2019) 28. T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert, J. Passerat-Palmbach, A generic framework for privacy preserving deep learning (2018). arXiv:1811.04017 29. R. Shokri, V. Shmatikov, Privacy-preserving deep learning, in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Denver, Colorado, USA (2015), pp. 1310–1321 30. W. Sun, J. Liu, Y. Yue, AI-enhanced offloading in edge computing: When machine learning meets industrial IoT. IEEE Netw. 33(5), 68–74 (2019) 31. N.H. Tran, W. Bao, A. Zomaya, N.H.N. Minh, C. S. Hong, Federated learning over wireless networks: Optimization model design and analysis, in Proceedings of the IEEE Conference on Computer Communications (INFOCOM) (2019), pp. 1387–1395 32. S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, Y. Zhou, A hybrid approach to privacy-preserving federated learning, in Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec), London, UK (2019), pp. 1–11 33. Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, H. Qi, Beyond inferring class representatives: User-level privacy leakage from federated learning, in Proceedings of the IEEE Conference on Computer Communications (INFOCOM), Paris, France (2019), pp. 2512–2520 34. N. Wang, X. Xiao, Y. Yang, J. Zhao, S.C. Hui, H. Shin, J. Shin, G. Yu, Collecting and analyzing multidimensional data with local differential privacy, in Proceedings of the IEEE International Conference on Data Engineering (ICDE), Macao, China (2019), pp. 638–649 35. X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, M. Chen, In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Netw. 33(5), 156– 165 (2019) 36. S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 37(6), 1205–1221 (2019) 37. S. Wang, L. Huang, Y. Nie, X. Zhang, P. Wang, H. Xu, W. Yang, Local differential private data aggregation for discrete distribution estimation. IEEE Trans. Parallel Distrib. Syst. 30(9), 2046–2059 (2019) 38. K. Wei, J. Li, M. Ding, C. Ma, H.H. Yang, F. Farokhi, S. Jin, T.Q.S. Quek, H. Vincent Poor, Federated learning with differential privacy: algorithms and performance analysis. IEEE Trans. Inf. Forens. Secur. 15, 3454–3469 (2020) 39. N. Wu, F. Farokhi, D. Smith, M.A. Kaafar, The value of collaboration in convex machine learning with differential privacy, in Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA (2020), pp. 304–317 40. P. Wu, J. Li, L. Shi, M. Ding, K. Cai, F. Yang, Dynamic content update for wireless edge caching via deep reinforcement learning. IEEE Commun. Lett. 23(10), 1773–1777 (2019) 41. L. Xiangru, H. Yijun, L. Yuncheng, L. Ji, Asynchronous parallel stochastic gradient for nonconvex optimization, in Proceedings of the ACM Neural Information Processing Systems (NIPS), Montreal, Canada (2015), pp. 2737–2745 42. G. Xu, H. Li, S. Liu, K. Yang, X. Lin, VerifyNet: secure and verifiable federated learning. IEEE Trans. Inf. Forens. Secur. 15, 911–926 (2020)
78
K. Wei et al.
43. H.H. Yang, A. Arafa, T.Q.S. Quek, H.V. Poor, Age-based scheduling policy for federated learning in mobile edge networks, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain (2020), pp. 8743–8747 44. H.H. Yang, Z. Liu, T.Q.S. Quek, H.V. Poor, Scheduling policies for federated learning in wireless networks. IEEE Trans. Commun. 68(1), 317–333 (2020)
Chapter 4
Advancements of Federated Learning Towards Privacy Preservation: From Federated Learning to Split Learning Chandra Thapa, M. A. P. Chamikara, and Seyit A. Camtepe
Abstract In the distributed collaborative machine learning (DCML) paradigm, federated learning (FL) recently attracted much attention due to its applications in health, finance, and the latest innovations such as industry 4.0 and smart vehicles. FL provides privacy-by-design. It trains a machine learning model collaboratively over several distributed clients (ranging from two to millions) such as mobile phones, without sharing their raw data with any other participant. In practical scenarios, all clients do not have sufficient computing resources (e.g., Internet of Things), the machine learning model has millions of parameters, and its privacy between the server and the clients while training/testing is a prime concern (e.g., rival parties). In this regard, FL is not sufficient, so split learning (SL) is introduced. SL is reliable in these scenarios as it splits a model into multiple portions, distributes them among clients and server, and trains/tests their respective model portions to accomplish the full model training/testing. In SL, the participants do not share both data and their model portions to any other parties, and usually, a smaller network portion is assigned to the clients where data resides. Recently, a hybrid of FL and SL, called splitfed learning, is introduced to elevate the benefits of both FL (faster training/testing time) and SL (model split and training). Following the developments from FL to SL, and considering the importance of SL, this chapter is designed to provide extensive coverage in SL and its variants. The coverage includes fundamentals, existing findings, integration with privacy measures such as differential privacy, open problems, and code implementation. Keywords Federated learning · Split learning · Splitfed learning · Distributed machine learning · Privacy · Security · Code examples C. Thapa (B) · M. A. P. Chamikara · S. A. Camtepe CSIRO Data61, Sydney, Australia e-mail: [email protected] M. A. P. Chamikara e-mail: [email protected] S. A. Camtepe e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. H. ur. Rehman and M. M. Gaber (eds.), Federated Learning Systems, Studies in Computational Intelligence 965, https://doi.org/10.1007/978-3-030-70604-3_4
79
80
C. Thapa
4.1 Introduction In today’s world, machine learning (ML) has become an integral part in various domains, including health [27, 50], finance [40] and transportation [52]. As data are usually distributed and stored among different locations (e.g., data centers and hospitals), distributed collaborative machine learning (DCML) is used over conventional (centralized) machine learning to access them without centrally pooling all data. Moreover, additional advantages of DCML over conventional machine learning includes improved privacy of data by limiting it within its primary locations and distributed computations. Overall, DCML has become an essential tool for training/testing ML/AI models in a wide scale of distributed environments ranging from few to millions of devices [28]. DCML introduces collaborative architectures that enable reliable solutions for data sharing and model training [32]. Federated learning (FL) is a DCML mechanism that provides a widely accepted approach for ML/AI. In FL, several distributed clients (e.g., mobile phones) train a global ML/AI model collaboratively in an efficient and privacy-preserving manner by keeping data within the local boundaries (i.e., data custodians/clients) while sharing no data among participating entities [60]. Moreover, in FL, each client keeps a copy of the entire model and trains it with the local data, and in a subsequent step, the local models (trained locally) are forwarded to a coordinating server for the aggregation, which produces a global model. However, in a resource-constrained environment (e.g., the internet of things environments) where the clients have limited computing capability, FL is not feasible. As the server is assumed to have high computing resources than clients, FL is not leveraging that as the sever performs low computing jobs such as model aggregation and coordination. In contrast, the server can compute a larger portion of the ML model and reduce the computation at the client-side. As such, split learning (SL) is introduced. SL is a DCML approach that introduces a split in executing a model training/testing that is shared among clients and the server [57]. The server and clients have access only to their portion of the whole model. Thus, it provides model privacy while training/testing in contrast to FL; in addition to keeping input data in the local bounds [57]. Besides, SL can be communication efficient and achieve faster convergence than FL [48]. Due to these inherent features of SL, it is gaining much research attention. There have been various research works performed to address its multiple aspects, including performance enhancement by introducing a variant of SL, called splitfed learning (SFL) [51], leakage reduction [56], and communication efficiency [48]. However, there are many open problems, including efficient leakage reductions, handling non-IID data distributions among clients, and reducing communication costs. An extensive understanding of SL is required to address these problems. This chapter provides comprehensive knowledge on SL and its variants, recent advancements, and integration with privacy measures such as differential privacy and leakage reduction techniques. In this regard, this chapter is divided into four main sections: (4.2) FL to split learning, (4.3) data privacy and privacy-enhancing
4 Advancements of Federated Learning Towards Privacy Preservation …
81
techniques, (4.4) applications and implementation, and (4.5) challenge and open problems. Firstly, Sect. 4.2 presents the fundamentals, dynamics, crucial results produced under FL, and detailed coverage on SL and SFL. The primary focus is given to SL. Detailed coverage is done on the key results that include performance analysis, effects of the number of clients on the performance, communication efficiency, data privacy leakage, and methods to perform SL over partitioned datasets (e.g., vertical partitioned) among clients. Secondly, Sect. 4.3 explores the dynamics of privacy offered by FL, SL, and SFL in their default settings. Afterward, this chapter discusses privacy measures such as encryption-based techniques and provides details on the provable-privacy mechanism, called differential privacy, which are employed further to enhance the data privacy with these DCML techniques. Thirdly, Sect. 4.4 presents the significance of SL and SFL by highlighting their applications in various domains, including health. Besides, it provides a technical background on implementing the related concepts highlighting programmatic techniques with code examples. Finally, open challenges and potential future research directions are discussed in Sect. 4.5.
4.2 Federated Learning to Split Learning In this section, firstly, FL is revisited to highlight its relevant results for completeness. However, this chapter does not intend to conduct an in-depth analysis of FL as its primary goal is to investigate the architectural improvements of FL towards SL and SFL. Hence, the discussions on FL act as a base for the rest of the chapter. Consequently, the discussions on FL is followed by an in-depth analysis of SL and its variants by covering their existing results and recent developments.
4.2.1 Overview of Federated Learning and Key Results The primary concept behind FL is training an ML/AI model without exposing the raw data to any participant, including the coordinating server, enabling the collaborative training/testing of the model between distributed data owners/custodians [60]. Figure 4.1 illustrates FL with K clients. In FL, the model knowledge orchestration is done usually through federated averaging (FedAvg) [37]. Due to the promising aspects of the underlying architecture, FL is gaining much attention in privacy-preserving data analytics. Google and Apple are two major companies that utilize FL capabilities in training ML/AI models [24, 28]. Besides, it is gaining attention in transportation (e.g., self-driving cars) and Industry 4.0 [10, 36]. Several implementation-based frameworks are already supporting FL. Some of these frameworks include PySyft [45], Leaf [12], and PaddleFL [25]. Based on different requirements, such as the feature space distribution, different FL configurations can be employed. These configurations include (1) horizontal FL, (2) vertical FL, and (3) federated transfer learning [60]. Horizontal FL is used when
82
C. Thapa
Fig. 4.1 FL with K clients
…
all distributed clients share the same feature space but different data samples. Vertical FL is used when the distributed clients have different feature spaces that belong to the same data samples, and federated transfer learning is used when the distributed clients have different datasets that have different feature spaces belong to different data samples [60]. However, one of the fundamental issues in FL is the communication bottleneck. When clients need to connect through the Internet, communications become potentially expensive and oftentimes unreliable. Besides, the limited computational capabilities of distributed clients often make the whole process slower in model training. FL requires a particular client to hold the entire model. Hence, in a situation where a heavy deep learning model needs to be trained, an FL client should have enough computing power, which avoids resource-constrained devices being employed [35]. Besides, the requirement of continuous communications with the server also limits FL’s use in a conventional environment that does not possess powerful communication channels. Moreover, in real-world scenarios, distributed nodes/clients can face different failures, directly affecting the global model generalization. Hence, the participation of an extensively large number of clients can make FL unreliable. Different compression mechanisms such as gradient compression, model broadcast compression, and local computation reduction are being experimented with to maintain an acceptable FL efficiency [23, 62]. It has also been shown that unbalanced and non-IID data partitioning across unreliable devices drastically affects model convergence [28]. Data leakage introduces another layer of complexity for FL in addition to such data communication and data convergence issues. It has been shown that the parameters transferred between the clients and the server can leak information [8]. The participation of malicious clients or servers can introduce many security and privacy vulnerabilities—the backdoor attacks is an example where malicious entities can deploy backdoor to learn others’ data [11]. Besides, privacy attacks, such as membership inference, can exploit this vulnerability [46]. Many existing works try to employ third-party solutions such as fully homomorphic encryption and differential privacy to limit such unanticipated privacy leaks from FL. However, the performance of these approaches is always governed by the limitations of these approaches (e.g., high computational complex-
4 Advancements of Federated Learning Towards Privacy Preservation …
83
ity [7]). In a distributed setup where resource-constrained devices are used, this can be a challenging task.
4.2.2 Split Learning and Key Results In this section, firstly, SL is introduced along with its algorithm and configurations. Secondly, an extensive coverage on its existing results, including its performance in model development, data leakage analysis, and countermeasures, are presented.
4.2.2.1
Split Learning
SL [22, 57] is a collaborative distributed machine learning approach, where an ML network (model) is split into multiple portions and executed in sequence at different clients and servers. In a simple setup, the full model W, which includes weights, bias and hyperparameters, is split into two portions WC and WS ; WC is called clientside network portion and WS is called server-side network portion, as illustrated in Fig. 4.2. In SL, the client commits only to the training/testing of the client-side network, and the server executes only to the training/testing of the server-side network. The training and testing of the full model are done by executing a sequential (forward/backward) propagation between a client and the server. In the simplest form, firstly, the forward propagation takes place as follows: a client forward propagates until a specific layer of the network, which is called the cut layer, over the raw data, then the cut layer’s activations, called smashed data, are transmitted to the server. The smashed data matrix is represented as A. Afterward, the server considers the smashed data (received from the client) as its input and performs the forward propagation on the remaining layers. So far, a single forward propagation on the full model is completed. Now the back-propagation takes place as follows: After calculating the loss, the server starts back-propagation, where it computes gradients of weights and activations of the layers until the cut layer, and then transmits the smashed data’s
Fig. 4.2 Split learning in a simple setup
Cut layer
Output layer
Input layer
Smashed data
Client-side model poron
Server-side model poron
Full model ( )
84 Fig. 4.3 Split learning with multiple clients. At one round, indicated by the red dot with a label, one client k ∈ {1, 2, . . . , K } engages with the server for the training/testing. The control is passed to client k + 1, till k + 1 = K , after the client k completes its operations with the server
C. Thapa
1
Client 1
2 Client 2 Main Server Client-side model poron
K Server-side model poron
Client K Full model
gradients back to the client. With the received gradients, the client executes its backpropagation on its client-side network. So far, a single pass of the back-propagation between a client and the server is completed. In ML/AI model training, the (forward and back) propagation continues until the model is trained on all the participating clients and meets a decent convergence point (e.g., high prediction accuracy). For details on SL, refer to Algorithm 4.1 (which is extracted from [51]). By limiting the client-side model network portion up to a few numbers of layers, SL reduces the client-side computation compared to FL, where each client has to train a full-sized model. Besides, while machine learning training/testing, the server and clients are limited within their designated portions of the full model, and the full model is not accessible to them. To predict the missing model portion, a semi-honest1 client or server requires to predict entire parameters of the missing model portion, and its probability decrease with the increase in model parameters in the portion. As a result, SL provides a certain level of privacy to the trained model from honest but curious clients and the server. There is no model privacy between the clients and server in FL as both of them have either white-box access (i.e., full access) to the full model, or the server can easily predict the full model from the gradients of the locally trained model that are transferred by clients for model aggregation. Split Learning with Multiple Clients SL with multiple clients is illustrated in Fig. 4.3. With multiple clients, SL takes place in two ways, namely centralized distributed training and peer-to-peer distributed training [22]. In centralized distributed training, after a client completes its training with the main server, it uploads an encrypted version of the weights of the client-side network either to the server or a third-party server. When a new client initiates its training with the server, it first downloads the encrypted weights, decrypts them, and 1 A semi-honest entity in collaboration among multiple entities performs their jobs as specified, but
it can be curious about the details of the information present in other participating entities. It is also called an honest-but-curious entity.
4 Advancements of Federated Learning Towards Privacy Preservation …
85
Algorithm 4.1: Split learning with label sharing [51, 57]. 1
2
Notations: • St is a set of n t clients at time instance t . • Ak,t is the smashed data of client k at t . ˆ k are ground truth and predicted labels for client k . • Yk and Y • is the gradient of loss . • η is the learning rate. /* Runs on Server */ EnsureMainServer executes at time instance t ≥ 0: for a request from client k ∈ St with new data do C ) (Ak,t , Yk ) ← ClientUpdate(Wk,t
Ak,t and Yk are from client k
Forward propagation with Ak,t on WtS WtS is the server-side part of the model Wt ˆk Loss calculation with Yk and Y S ← WS − η (WS ; AS ) Back-propagation and model updates with learning rate η: Wt+1 t t t Send dAk,t := (ASt ; WtS ) (i.e., gradient of the Ak,t ) to client k for its ClientBackprop(dAk,t ) end
3 4 5 6 7 8 9 10 11 12
13 14 15 16 17
18 19 20 21 22 23
/* Runs on Client k ∈ {1, . . . , K } */ C ): EnsureClientUpdate(Wk,t Set Ak,t = φ if Client k is the first client to start the training then C ← Randomly initialize (using Xavier or Gaussian initializer) Wk,t else C ← ClientBackprop(dA Wk,t k − 1 is the last trained client with the main server k−1,t−1 ) end for each local epoch e from 1 to E do for batch b ∈ B do C C of batch b, where WC is the client-side part of the model Forward propagation on Wk,b,t Wk,t k,t Wt Concatenate the activations of its final layer to Ak,t Concatenate respective true labels to Yk end end Send Ak,t and Yk to server /* Runs on Client k */ EnsureClientBackprop(dAk,t ): for batch b ∈ B do Back-propagation with dAk,b,t C ← WC − η dA Model updates Wt+1 k,b,t t end C to the next client ready to train with the main server. Send Wt+1
dAk,t of the batch b
loads them to its client-side network. In peer-to-peer distributed training, the server sends the address of the last trained client to the new client, which then downloads the encrypted weights directly from the client and loads them to its client-side network after decryption. Besides, training can be carried out in two ways; one with the client-side model synchronization (client k trains then the model is passed to client k + 1), and the other is without weight synchronization, where clients take turns with alternating epochs in working with the server. However, there is no convergence guarantee if model training is done without weight synchronization.
(a)
Input Data
Input Data
Client 2
Output
Client 1
Client
Input Data
Server
Server
Output
Input Data
Server
Labels, Smashed data
Server
Smashed data’s gradients
Input Data
Server
Client
Input Data
Client 2
C. Thapa Client 1
86
Output
Output
(b)
(c)
(d)
Fig. 4.4 Split learning configurations a simple vanilla, b extended vanilla, c without label sharing, and d vertically partitioned data
Different Configurations in Split Learning Due to the flexibility of splitting the model while training/testing, SL has several possible configurations, namely vanilla split learning, extended vanilla split learning, split learning without label sharing, split learning for a vertically partitioned data, split learning for multi-task output with vertically partitioned input, ‘Tor’ like multihop split learning [57]. Refer to Fig. 4.4 for illustrations. Vanilla split learning is the most straightforward configuration, in which the client shares the smashed data and labels with the server (Fig. 4.4a). In an extended vanilla split learning, some other workers process some intermediate layers of the ML network before passing it to the main server (Fig. 4.4b). An analogous extension of the vanilla configuration provides the ‘Tor’ like multi-hop split learning configuration, where one client has data, and multiple clients/servers train a portion of the ML network in a sequence. In SL without label sharing configuration, the client only shares the smashed data to the server (unlike vanilla split learning), which completes forward propagation up to some layers of the network, called server cut layer, and then again sends the activations of its server cut layer to the client, which then completes the forward propagation up to the output layer. Afterward, the client starts backpropagation and sends the gradients of the activations of the server cut layer to the server. Then the server carries out the backpropagation and sends the smashed data’s gradients to the client, which then completes its backpropagations. Here the forward propagation and backpropagation happen in a U-shape (client ⇐⇒ server ⇐⇒ client), thus it is also called U-shaped configuration (Fig. 4.4c). For SL with vertically partitioned data configuration (Fig. 4.4d), clients have a vertically partitioned dataset configuration, and each carries out the forward propagation on its local client-side ML network portions. Then they transfer their smashed data to the server. Afterward, the server concatenates all smashed data and carry on the forward propagation on the single server-side model. The backpropagation proceeds from the output layer up to the concatenation layer at the server, which then transmits the respective smashed data’s gradients to the clients. Next, the clients perform their backpropagation on their client-side ML network portions. In SL for multi-task output, multi-modal data from different clients train their client-side ML network up to their corresponding cut layer and then transfer their smashed data to an inter-
4 Advancements of Federated Learning Towards Privacy Preservation …
87
mediate agent, which concatenates the smashed data from all clients. Afterward, it sends the concatenated smashed data to multiple servers, where each server trains its server-side model. All these configurations in SL are useful based on the requirements. For example, if labels are sensitive to the clients, then there is no need to send them to the server; instead, U-shape SL can be used as the learning configuration. If there is a need to keep the identity of clients confidential to the server, then it can be done by the extended vanilla configuration or tor configuration.
4.2.2.2
Key Results in Split Learning
There are several lines of works in SL literature. Broadly, these works are divided into five main categories, namely convergence, computation, communication, dataset partition, and information leakage. Convergence This category explores and addresses the research questions related to the convergence in SL; the effect on the model convergence due to the data distribution among participating clients and their number. The data distribution is of two types: independent and identically distributed (IID) and non-IID. In IID distribution, the training datasets distributed among the clients are IID sampled from the total dataset, where each sample has the same probability distribution and mutually independent of each other. The non-IID dataset distribution in a distributed setup with multiple clients refers to the difference in the distribution and any dependencies of the local datasets among the clients. The non-IID dataset due to the quantity skew (i.e., different number of samples in different clients), and label distribution skew (i.e., different clients can have samples belonging to the particular labels) are explored in SL. With a single client and server setup, SL has the same results as centralized learning, where the ML network is not split and trained [22]. Under IID data configurations, SL shows higher validation/test accuracy and faster convergence when considering a large number of clients than FL [20, 22]. The experiments are carried out with models, including VGG2 [47] and ResNet3 [26], over several datasets, including MNIST dataset (a database with 70,000 handwritten digit samples of ten labels), CIFAR-10 dataset (60,000 tiny images with ten labels), and ILSVRC-12 (1.2 million image samples with 1000 object categories), speech command (SC) dataset (20,827 samples of single spoken English word with ten classes), and electrocardiogram (ECG) dataset (26,490 samples representing five heartbeat types via ECG signals). The number of clients in SL affects the convergence
has sixteen network layers, and its input dimension of an image dataset is 224 × 224 × 3 and 3 × 3 sized kernels are used. 3 ResNet18 has eighteen network layers, and a 3 × 3 and 7 × 7 sized kernels are implemented in its layers. 2 VGG16
88
C. Thapa
curve. It fluctuates more as the number of clients goes up [20]. For AlexNet4 [33] on HAM10000 [54] (medical image dataset with 10,015 samples with seven cases of skin lesion), it even fails to converge for 100 clients [51]. For imbalanced data distributions (different sample sizes in different clients ranging from 48 to 3855), the convergence of FL is slow if there is an increase in the number of clients (50 and 100). In contrast, for the same setup, SL’s fast convergence shows less sensitivity in convergence towards imbalanced data distributions. However, the model performance (e.g., accuracy) decreases if there is an increase in the number of clients for the case with imbalance data distribution [20]. SL is highly sensitive to non-IID data distributions than FL. For ECG and SC dataset, SL is slow in convergence and no convergence if each client has samples from only one class [20]. In experiments performed in [20], FL outperforms in accuracy and convergence than SL under a non-IID setting. Overall, it is required to carefully choose an appropriate DCML based on the dataset distribution and the number of clients. Computation This category explores and addresses the research questions related to the client-side computational efficiency due to SL and the techniques for improvements by maintaining data privacy. Due to the inherent characteristic of SL, i.e., split the network and train them separately on clients and server, it reduces the computation significantly in the client-side [22]. It is shown that for a setup with 100 and 500 clients, when training CIFAR10 over VGG, SL requires 0.1548 TFlops5 and 0.03 TFlops of computations, respectively, in the client-side. On the other hand, FL requires 29.4 TFlops and 5.89 TFlops, for the same setup, respectively [57]. The computation at the client-side depends on the client-size network size; however, how the split layer position affects the performance and the definition of the optimal position of the cut layer in a given model are still open problems. Moreover, the possible answers depend on other factors such as information leakage. Communication This category explores and addresses the research questions related to communication efficiency in SL and the techniques for further improvements. The communication cost in SL can be significantly less than FL by limiting the client-side network to few layers and few or compressed activations at the cut layer (e.g., max pooling layer) [22]. In contrast, FL has gradient updates that of the full network from all clients to the main server, and global weights are forwarded from the server to all clients. For the same total dataset, the communication bandwidth of 6 GB and 1.2 GB per client are required for SL with 100 and 500 clients, respectively, with ResNet has eight network layers, and its layers has 3 × 3, 5 × 5, and 11 × 11 sized kernels. The dimension of the input image is 227 × 227 × 3. VGG16, ResNet18, and AlexNet are convolutional neural networks with multiple layers. The layers have convolution, ReLU (activation function), max-pooling (downsampling), and fully connected layers. 5 Floating point operations per second (Flops) is a measure of the computation of the instructions per second, and one Tera Flops (TFlops) is 1012 Flops. 4 AlexNet
4 Advancements of Federated Learning Towards Privacy Preservation … Table 4.1 Communication efficiency [48] Method Communication per client Split learning with client weight sharing Split learning with no client weight sharing FL
89
Total communication
2( p/K )q + ηN
2 pq + ηN K
2( p/K )q
2 pq
2N
2K N
over CIFAR100. In contrast, FL requires 3 GB and 2.4 GB with 100 and 500 clients, respectively, for the same setup [57]. A more detailed analysis of the communication efficiency of SL and FL is done in [48]. The analytical results are presented in Table 4.1. K is the number of clients, N is the number of model parameters, p is the total dataset size, q is the smashed layer’s size, and η is the fraction of the whole model parameters that belong to a client. Usually, the data are distributed, and the size of data at each client is p/K , where the size decreases if K increases. In a separate work [51], for ResNet18 on the HAM10000 dataset, SL is shown to have an efficient communication when the number of clients is twenty, whereas, for AlexNet on MNIST, it is efficient after five clients. These studies reconfirm the following result: In terms of communication efficiency, if the model size and the number of clients are large, then SL is efficient than FL; otherwise (for a small model and few clients), FL is efficient. Note point: In SL, the communication cost also depends on the type of cut layer. The cut layer can be a max pool layer that compresses the outputs from the previous times to the height and nWn−Wf times layer (e.g., convolutional layer) by nH −n Hf
s
+1
s
+1
to the width of the input dimension n H × n W , where f is the filter size of the pooling layer, and s is the stride size. This is not the case if the cut layer is a convolutional layer, i.e., n H × n W . Dataset Partition and Split Learning This category explores and addresses the research questions related to carrying SL under various partitions of the datasets among the clients. Most of the works in SL consider the horizontal partition of the dataset, where different clients have different samples but all with the same feature space. However, in fields like finance, multiple clients (finance institutes) can have the same sample (related to the same set of data owners) but with different features. The dataset, in this case, is referred to as a vertically partitioned dataset. SL with vertical partitioning of datasets is performed in [13]. This work evaluates several aggregating configurations to merge the outputs of the partial networks from multiple clients and analyze the performance and resource efficiency. The configurations include element-wise average, element-wise maximum, element-wise sum, element-wise multiplication, and concatenation. Considering three financial datasets, namely Bank Marketing, Some Credit Kaggle, and Financial PhraseBank, the experiments show that the performance depends on the dataset and the merging
90
C. Thapa
technique. Element-wise average pooling outperforms other techniques for Financial PhraseBank, whereas element-wise max pooling and concatenation for Bank Marketing and Give me Credit, respectively, outperforms other approaches. In other experiments, the performance degradation with the increase in the number of clients dropping during the training and testing is observed. Moreover, the communication and computation costs are measured for this setup, and their dependencies on the dataset and model architecture are observed. A similar technique of concatenating the smashed data from the clients before feeding to the server-side model is used in multiple classifications with horizontal partitioning of the dataset in SL [30]. This is done primarily for privacy reasons, as this setup does not require the client-side models to synchronize. However, a trusted worker to concatenate the smashed data is added to the setup. Information Leakage and Countermeasures This category explores and addresses the research questions related to privacy and information leakage and their countermeasures. Information leakage from an ML model is defined as the ability to reconstruct the original raw data from model parameters or intermediate activations. In SL, possible information leakage is an investigation based on the data communicated between the clients and the server. The server receives the smashed data from the clients in each epoch, and the smashed data can leak information about private/sensitive raw data as it possesses a certain level of correlation to the raw data. So far, in the literature, two methods are implemented to reduce information leakage in SL; one by using differential privacy (refer to Sect. 4.3.1 for details), and the other by mapping the smashed data with higher distance correlation [49] to the raw input data. The latter is known to maintain accuracy, unlike differential privacy, which degrades accuracy with increased privacy. Moreover, it has relatively low computational complexity of O(nlogn) and O(n K logn) for univariate and multivariate settings, respectively, with O(max(n, k)) memory requirements. K is the number of random projections [58]. Information leakage can be measured using Kullback– Leibler (KL) divergence between the raw data and the smashed data [56, 58]. KL provides a measure of the invertibility of the smashed data. It is also known as relative entropy, and it is a non-symmetric measure of the difference between two probability distributions X and Z . In other words, it is a measure of the information loss when Z is used to approximate X and expressed as follows: N
X (xi ) . Z (xi )
(4.1)
DKL (X ||Z ) = H (X, Z ) − H (X ),
(4.2)
DKL (X ||Z ) =
i=1
X (xi ) ln
KL divergence can be written as:
4 Advancements of Federated Learning Towards Privacy Preservation …
91
where H (X, Z ) and H (X ) are the cross-entropy and entropy, respectively. The distance correlation is minimized by introducing an additional regularization term in the loss function: Loss = α1 DCOR(X n , Zˆ ) + α2 CCE(Yn , Yˆ ),
(4.3)
where DCOR refers to the distance correlation, CCE refers to categorical crossentropy, and α1 and α2 are scalar weights. The combined optimization of the resulting loss term reduces the leakage from the smashed data without degrading accuracy. Furthermore, the same approach of distance correlation is used for attribute privacy, where the distance correlation is minimized between the smashed data and the certain attribute (e.g., age, race, and gender) of the raw data. For more details refer to [58].
4.2.3 Splitfed Learning and Key Results SL can be considered a better contender for edge-computing than FL due to application in a resource-constrained environment and model privacy. However, the clientside model synchronization between the clients is done sequentially, as described in Sect. 4.2.2.1. This results in a higher latency in the model training/testing in SL. Consequently, the latency adds challenges in SL to leverage the advantages of SL in a resource-constrained environment where fast model training/testing time is required to periodically update the model with the continually updating dataset with time. This environment characterizes the fields such as health and finance, where in a distributed setting, frequent model updates are required to incorporate newly available threats in activities, including real-time anomaly detection and fraud detection. In this regard, to utilize the benefits of both FL and SL, splitfed learning is introduced.
4.2.3.1
Splitfed Learning
SFL is a hybrid form of FL and SL. It performs client-side model training/testing in parallel, such as in FL, and trains/tests the full model by splitting it to the client-side and server-side for privacy and computation benefits such as in SL. Its architecture is divided into two main parts; the client-side part and the main-server part. In the client-side part, unlike SL, SFL introduces an extra worker, called the fed server, that is dedicated to performing the synchronization of the client-side model (see Fig. 4.5). The main server-side remains precisely the same as the server-side module in SL. SFL operates in the following way: Firstly, the fed-server sends the global initial client-side model to all clients. Then, all clients (e.g., different hospitals or IoMTs with limited computing power) proceed with their forward propagation on their local data over the client-side model in parallel. Secondly, the smashed data are transmitted to the main server, which usually has enough computing power (e.g., cloud computing platform or research institution with a high-performance computing platform).
92
C. Thapa
Fig. 4.5 Splitfed learning architecture with K clients and two servers; Fed server and Main server
Client 1
Client 2 Fed Server
Main Server
Client-side model poron
Server-side model poron
Client K Full model
Once the main server receives the smashed data from clients, it starts the forward propagation over the server-side model. The computations (forward and backpropagation) at the server-side associated with each client’s smashed data can be done per client basis in parallel due to its high computing resource and independence of the operations related to clients. The main server computes the gradients of the smashed data concerning the loss function in its backpropagation and sends the gradients to the respective client. Thirdly, each client completes its backpropagation on their client-side models. The forward propagation and backpropagation between the clients and the server proceed for some rounds without the fed server. Afterward, the clients transmit the client-side networks’ updates in the form of gradients to the fed server. Then, the fed server aggregates the updates and makes a global client-side model, which is sent back to all clients. This way, the client-side model synchronization happens in splitfed learning. Averaging is a simple method widely used for model aggregation in fed server; thus, the computation in it is not costly, making it suitable to operate within the local edge boundaries. There are two ways to do the server-side model synchronization: firstly, train the server-side model separately over the smashed data from each client, and later aggregate (e.g., a weighted average) all resulting server-side models to make the global server-side model—this is known as splitfedv1 in [51]—and secondly, keep training the same server-side model over the smashed data from different clients—this is known as splitfedv2 in [51]. Splitfedv1 is illustrated in Algorithm 4.2, which is extracted from [51].
4.2.3.2
Key Results in Splitfed Learning
There are three aspects that have been investigated in SFL: performance, training time latency, and privacy. Splitfed shows the same communication efficiency as of SL, with significantly less training time than SL [51]. Empirical results considering ResNet18, AlexNet, and LeNet over HAM10000, FMNIST, CIFAR10, and MNIST datasets show that SFL (both splitfedv1 and splitfedv2) has comparative performance6 to SL 6 Comparative
performance refers to the case where two results are close to each other, and any result can be slightly higher or lesser than the other.
4 Advancements of Federated Learning Towards Privacy Preservation …
93
Algorithm 4.2: Splitfed learning with label sharing [51]. The notations are the same as in Algorithm 4.1. 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26
27 28 29 30 31 32
/* Runs on Main Server */ EnsureMainServer executes at round t ≥ 0: for each client k ∈ St in parallel do C ) (Ak,t , Yk ) ← ClientUpdate(Wk,t
ˆk Forward propagation with Ak,t on WtS , compute Y ˆk Loss calculation with Yk and Y Back-propagation calculate k (WtS ; ASt ) Send dAk,t := k (ASt ; WtS ) (i.e., gradient of the Ak,t ) to client k for ClientBackprop(dAk,t )
end S ← WS − η n k K (WS ; AS ) Server-side model update: Wt+1 i t t t n i=1 /* Runs on Client k */ C ): EnsureClientUpdate(Wk,t C ← FedServer(WC ) Model updates Wk,t t−1 Set Ak,t = φ for each local epoch e from 1 to E do for batch b ∈ B do C Forward propagation on Wk,b,t Concatenate the activations of its final layer to Ak,t Concatenate respective true labels to Yk end end Send Ak,t and Yk to the main server /* Runs on Client k */ EnsureClientBackprop(dAk,t ): for batch b ∈ B do C ) Back-propagation, calculate gradients k (Wk,b,t C ← WC − η (WC ) Wk,t k k,t k,b,t end C Send Wk,t to the Fed server
/* Runs on Fed Server */ EnsureFedServer executes: for each client k ∈ St in parallel do C ← ClientBackprop(dA ) Wk,t k,t end nk C C ← K Client-side global model updates: Wt+1 k=1 n Wk,t C C Send Wt+1 to all K clients for ClientUpdate(Wk,t )
and FL. Moreover, even for multiple clients ranging from 1 to 100 with ResNet18 over the HAM10000 dataset, the performance pattern is close to each other. However, this result is not present in general, specifically, for a higher number of clients. LeNet5 on FMNIST with 100 users has slow convergence in splitfedv1 than others, and AlexNet on HAM10000 with 100 users has failed to converge in splitfedv2. For uniformly distributed HAM10000 and MNIST datasets with multiple clients over the ResNet18 and AlexNet, SFL is shown to reducing the training time by four to six times than SL. The same communication cost for both SL and SFL is observed [51]. The privacy aspects of SFL is covered in Sect. 4.3.4.
94
C. Thapa
4.3 Data Privacy and Privacy-Enhancing Techniques FL, SL, and SFL provide certain privacy to the raw data, called default privacy, in their vanilla setting as the raw data are always within the control of data custodians, and no raw data sharing among the participants. This setup works perfectly well in an environment where the participating entities are semi-honest adversaries.7 However, if the capability of adversaries is increased more than their curious state, for example, membership inference attacks [46] and model memorization attacks [34], then the default privacy is not sufficient, and other approaches are needed also. In literature, commonly used approaches can be listed as secure multiparty computation [39], homomorphic encryption [44], and differential privacy (DP) [21]. In secure multiparty computation (SMC), multiple entities jointly compute a function on their inputs without sharing their inputs with any other party to provide privacypreserving computation [61]. Due to this approach, each entity involved in the computation can make sure their input privacy is preserved against the other parties involved in the computation. This is different than the traditional cryptographic approaches, which try to preserve data security against third party adversaries who are external to the system. However, SMC is recommended for lower security requirements. Besides, SMC introduces high computation complexity and a high communication cost [15]. Homomorphic encryption performs the computations of arbitrary functions on encrypted data without decryption. Consequently, the party which conducts computation over homomorphically encrypted data does not have to access the original data/information. Thus an untrusted third-party entity can conduct heavy computations while maintaining the privacy of input data/model [7, 55]. For example, a server can conduct all the heavy computation over encrypted data and send the encrypted results to the data owner who can decrypt the data to view the final results in their non-encrypted format. However, the high computational complexity of the computations in homomorphic encryption reduces the efficiency of the overall system. DP is a privacy definition that can be achieved by adding noise to data or ML models so that the adversaries cannot extract private information while conducting analytics [16]. For example, an original value x is now available in the perturbed dataset as x + n, where n is the noise. Consequently, the adversaries will find it hard to extract the exact value, x from the noisy data due to the high randomness of the noise added. However, due to the noise addition, differential privacy introduces a certain level of utility loss. By observing the side-by-side comparison, including computational efficiency and flexible scalability, of these approaches, differential privacy is considered the most favorable concept to improve data privacy [21]. Furthermore, its properties such as immunity to post-processing, privacy guarantee, and composite differential privacy further signify its importance [17]. Differential privacy is formally defined in the following section. 7A
semi-honest adversary in a collaborative environment with multiple entities executes their assigned task as expected, but it can be curious about the information of the other entities.
4 Advancements of Federated Learning Towards Privacy Preservation …
95
4.3.1 Differential Privacy and Its Application Differential privacy (DP) is a privacy definition that guarantees a strong privacy level upon data [14, 16]. A randomized algorithm, K provides differential privacy (for δ ≥ 0) if for all adjacent datasets D1 and D2 (any adjacent dataset, D2 of D1 differs from D1 at most one element) and all S ⊆ Range(K), Pr K (D1 ) ∈ S ≤ exp(ε) × Pr K (D2 ) ∈ S + δ.
(4.4)
In the above equation, ε represents the privacy budget that provides a measurement to the level of privacy leak from a certain function or algorithm (K), which satisfies differential privacy [16, 18]. As the definition states, the value of ε should be maintained at a lower level (e.g., 0.1 to 9) to make sure that K does not leak an unacceptable level of information. δ provides a certain relaxation to the definition by providing a precalculated chance of failure. However, δ should be maintained at extremely low values (e.g., 1/(100 ∗ N ), N : the number of instances in the database) to guarantee there is a meager chance (1%) of privacy violation. Applying noise over output results/queries to generate differentially private query/ML outputs is called global differential privacy. Applying noise on input data to generate differentially private datasets is called local differential privacy. Due to the strong privacy guarantee, DP has been applied with deep learning [38] applications in DCML for areas such as healthcare [10]. Moreover, with DP, deep learning can guarantee a robust resistance to privacy attacks such as membership inference attacks and model memorization attacks [34, 46]. The differentially private solutions for deep learning can be categorized into two types: (1) approaches based on global differential privacy [5], and (2) approaches based on local differential privacy [9]. Moreover, the existing architectural configurations of differentially private deep learning is depicted in Fig. 4.6. Global differential privacy applies noise based on the training algorithm. For example, it adds calibrated noise over the gradients of the model at each step of the stochastic gradient descent [5]. In contrast, local differential privacy applies noise to the data transferred between entities. For example, the noise can be added by adding an intermediate layer of randomization in between the convolutional layer and the fully connected layer of a convolutional neural network [9]. The global differential privacy has been the most popular approach for deep learning as the amount of noise added to the learning process is lower than the local differential privacy, which often adds overly conservative noise to maintain DP. Also, the global differential privacy provides more flexibility in calibrating noise during the training process of a deep learning model. However, due to higher noise levels, the local differential privacy provides higher privacy levels than the global differential privacy [29]. The literature shows solutions under both types of differential privacy with high accuracy; however, local differential privacy solutions offer higher privacy guarantees.
96
C. Thapa
Fig. 4.6 Different configurations of differentially private deep learning under global and local settings. DP: Differential Privacy/Differentially Private, DL: Deep Learning
4.3.2 Differential Privacy in Federated Learning Differential privacy is commonly integrated with FL, and it can be identified in two different forms: (1) adding differentially private noise to the parameter updates at the clients [59], (2) adding differentially private noise to the sum of all parameter updates at the server [21]. In the first approach, calibrated noise is added to local weight (at the client-sides), whereas in the second approach, calibrated noise is added to global weight updates (at the server-side). Equation 4.5 represents the differentially private parameter update at the server model (for the second form above). In this equation, wk,t is the client k’s parameter update at time instant t, K is the number of clients, S is the clipping threshold (sensitivity), N is the noise scaled to S. In the first form above, the noise addition mechanism follows a similar format. However, since the noise is added to each client model separately, K becomes 1, where the parameter update considers only the current client model weights. wt+1
1 = wt + K
wk 2 +N wk,t / max 1, S k=0
K
(4.5)
4 Advancements of Federated Learning Towards Privacy Preservation …
97
4.3.3 Privacy in Split Learning In contrast to FL, SL takes one step further to split the full model and its execution between clients and the server. Consequently, SL introduces an additional level of privacy to the full model, while training/testing, from the semi-honest clients and server [22, 28]. This is because the main server has access only to the smashed data rather than the whole client-side model updates, and it is highly unlikely to invert all the client-side model parameters up to the raw data if the configuration of the client-side model portion has a fully connected layer with sufficiently large numbers of nodes [22]. For other cases, there exist some possibilities, but this problem can be addressed by modifying the loss function at the client-side as in Eq. 4.3 [58]. Due to the same reason as above, the client, who has access only to the gradients (of the smashed data) from the server-side, is not able to infer the server-side model portion. Besides, the SL configuration that trains a model without label sharing with the server (refer to Sect. 4.2.2.1) enhances the inherent privacy by leaving the server clueless about classification results. However, the smashed data transferred from client to server still reveal a certain level of information about the underlying data. In this regard, differential privacy [6, 51] and distance correlation techniques (see Sect. 4.2.2.2) have been used in SL.
4.3.4 Privacy in Splitfed Learning SFL aims to introduce a more efficient approach with a higher level of privacy by utilizing the advantages of both FL and SL [51]. The introduction of a local model federation server within the local bounds in the SFL architecture not only improves efficiency (e.g., training time) but also enhances privacy during the local model parameter synchronization [51]. The enhanced privacy is obtained because the client has access only to the aggregated client-side model updates rather than untouched client-side model updates as in SL. To further limit privacy leaks, SFL investigates the use of differential privacy during local model training based on the differentially private deep learning approach developed in [5], which we call ADPDL. In SFL, ADPDL for local model training is applied according to the following equation,
gk,t ←
1 gk,t (xi ) + N , nk i
(4.6)
which adds calibrated noise (N) to the average gradient (where gk,t represents the 2 norm clipped gradients) calculated in each step of the SGD algorithm. Next, the client C C ← Wk,t − ηt
gk,t (where ηt side model parameters are updated according to Wk,t+1 represent the learning rate) by taking a step back in the opposite direction. As a client is holding only a portion of the full model, the ADPDL will only have the full effect on the client-side model portion when the total budget is utilized.
98
C. Thapa
Hence, in the initial steps, the effect of noise over the activations and smashed data is minimal. To avoid any privacy leak due to this aspect, SFL adds a noise layer after the client-side model’s cut layer. This layer adds calibrated noise over the smashed data in a utility-preserving manner based on Laplace mechanism [41]. For this, the bounds (max Ak,i , min Ak,i ) of smashed data, and a vector of intervals i is Ii = max Ak,i − min Ak,i are calculated. Next, Laplacian noise with scale I ε applied to randomize the smashed data according to the following equation,
APk,i
= Ak,i
Ii + Lap ε
,
(4.7)
where is the privacy budget used for the Laplacian noise.
4.4 Applications and Implementation Applications of privacy-preserving machine learning approaches such as SL are widespread (e.g., finance and health) due to their inherent data privacy mechanisms. In this section, focusing on SL, firstly, the applications are presented, and then it provides the implementations from the programming perspective.
4.4.1 Applications of Split Learning Health data analytics have been becoming an integral part of healthcare nowadays. The analytics are powered by (deep) machine learning/artificial intelligence models. However, to train these models, usually, a considerable amount of data are required. Though the health domain has sufficient data for ML training/testing, they are generally curated in silos due to privacy concerns. Thus, distributed machine learning approaches with privacy-preservation properties, such as SL, provide better alternatives for such domains. Figure 4.7 illustrates a simple setup where a collaborative ML/AI training/testing is carried out among Hospital A, Research Center, and Hospital B without sharing their raw data. This setup directly reflects the architectural configurations of SL. The primary benefits of SL over FL in this setting are model privacy and fewer client-side computations. SL has been proposed for various settings with practical scenarios [57]. Techniques such as differential privacy and other noise integration methods are combined with SL to reducing the distance correlation of the smashed data with the raw input data for robust privacy. An improved SL with the KL divergence technique (Sect. 4.2.2.2) is tested over colorectal histology images without any data augmentation [56]. In a distributed setting with up to fifty clients, the proposed approach is used to analyze the diabetic retinopathy dataset over ReNet34 and chest X-ray dataset over DenseNet121. The results show that it provides better accuracy than
4 Advancements of Federated Learning Towards Privacy Preservation …
99
Coordinang Server
Model Owner
Model/Program
Hospital A
Research Centre
Hospital B
Fig. 4.7 An application of split learning in a cross-siloed health environment
non-collaborative techniques [42]. In separate work, ECG data privacy is increased by integrating SL with the differential privacy measures and increasing the layers in the client-side model [6]. SL is implemented for an edge-device machine learning configuration that is capable of effectively handling the internet of things (IoT) gateways. In a setup with five Raspberry pi kits as a gateway of IoTs, SL is evaluated over FL. SL shows significant results by reducing the training time—2.5 h to run MobileNet on CIFAR10, whereas FL takes 8 h—and reduces the dissipated heat by the kit [20]. This has practical importance as the kit can breakdown due to excessive heat due to the computations. SL is also successfully implemented in the wireless communication [31]. Precisely, an SL based approach is proposed to integrate the image and radio frequency received signal, which is later used to predict the received power of millimeter-wave radio-frequency signals. This approach is a communication-efficient and privacypreserving as it compresses the communication payload by adjusting the pooling size of the machine learning network architecture.
4.4.2 Implementation of Split Learning In this section, vanilla SL with LeNet5 on the FMNIST dataset is implemented for an illustration purpose. To this end, one client and one server are considered; however, the program can handle multiple clients. The programs for multiple clients is made by simply changing the device identity indicated by variable idx in the same client program repeatedly for the respective client. The network architecture, i.e., LeNet5, is split to form the client-side network portion and the server-side network portion, as shown in Fig. 4.8. For multiple clients, the dataset is uniformly, identically, and independently distributed among them (the IID setting). One user-
100
C. Thapa
(a)
(b)
Fig. 4.8 A code snippet of an example of network architecture for the a client-side and b serverside portion of LeNet5, where the network split is done at the second layer after the max pool layer of the LeNet5 architecture
defined function iid_datadistribution(dataset, numusers) and one class get_data(Dataset) are implemented for this purpose, and their code snippets are provided in the Fig. 4.9. The next step is to create a data loader, which is a combiner of both a dataset and a sampler, to iterate over the given dataset in PyTorch. This process is depicted in Fig. 4.10. Each client’s program and the server program run separately. Socket programming is used to simulate the communication between clients and the server. A Python socket is used along with some helper functions for handling messages while transmitting and receiving. Refer to Fig. 4.11 for a code snippet. All programs can be run on the same localhost or in different hosts. If different hosts, then the address of the host needs to be provided. For the initial setup, the server program runs first; then, the client programs are started sequentially. The complete program is available at GitHub,8 and the convergence curve for the accuracy while training and testing are provided in Fig. 4.12. This result is obtained at the server-side after ten global epochs,9 where each global epoch has only one local epoch.10 Our complete implementation is available at [3].
8 Some
implementation by other sources: For extended SL and vertically partitioned SL: https://github.com/nin-ed/Split-Learning, for TensorFlow implementation: https://colab.research. google.com/drive/1GG5HctuRoaQF1Yp6ko3WrJFtCDoQxT5_\#scrollTo\=esvT5OgzG6Fd. Now available on PySyft. 9 One global epoch occurs if (forward and back) propagation is completed for all active clients’ datasets for one cycle. 10 In one local epoch of a client, one forward-propagation and its respective back-propagation are completed over an entire local dataset of the client.
4 Advancements of Federated Learning Towards Privacy Preservation …
101
(a)
(b) Fig. 4.9 A code snippet of a function and a class that is implemented for IID data distribution among clients; a returns a dictionary whose keys are the clients idx, and lists of random indices of the samples as their values and b a class that extracts each data (i.e., x) and its label (i.e., y) in the whole Dataset, and its object is used as an input while creating a PyTorch data loader
Fig. 4.10 A code snippet depicting the implementation of the data loader in pytorch utilizing iid_datadistribution(dataset, numusers) and get_data(Dataset)
Split Learning Implementation in Raspberry pi Raspberry pi kits are IoT gateways of low-end IoT devices. They support the corresponding operating systems (OS) to compute and process the ML/AI model on the client-side. Raspberry Pi 3 model BV1.2 can run PyTorch version 1.0.0, OS Raspbian GNU/Linux 10, and python version 3.7.3 with no CUDA support. A useful manual to install Pytorch on Raspberry Pi is available at [2]. In one of our works, a laptop having CPU i7-7700HQ, GPU GTX 1050, PyTorch version 1.0.0 installed, Windows 10, python version 3.6.8 installed, and CUDA version 10.1 was considered as the
102
C. Thapa
(a)
(b)
Fig. 4.11 A code snippet of socket implementation in a client side, and b server side. recv_msg(connection) and send_msg (connection, msg) are user-defined helper functions
Fig. 4.12 Accuracy curves of training and testing of LeNet5 on FMNIST dataset up to ten global epochs having one local epoch with the client each global epoch
(a)
(b)
Fig. 4.13 a Peak power and b temperature measurement [20]. A plug-in power-meter is used to measure the power consumption in kWh unit. The temperature of the Raspberry Pi CPU is measured using a Python library CPUTemperature and CPUTemperature() function. The experiment is performed with 1D CNN (two CNN layers at the client-side, and one CNN and two fully-connected layers) model over the ECG dataset
server. For simplicity, a simple model architecture with 1D CNN on sequential time series data is trained on a setup with five Raspberry pi kits and a laptop. The kit and laptop were connected by 10Gbit/s dedicated LAN. Measurements of power and temperature per kit are depicted in Fig. 4.13. For more results and a complete code, refer to the paper [1, 20], respectively.
4 Advancements of Federated Learning Towards Privacy Preservation …
103
4.4.3 Implementation of Splitfed Learning with a Code Example In this section, vanilla SFL with LeNet5 on the FMNIST dataset is implemented. The setup has three clients and two servers (the main server and the fed server). There are three separate programs; one for the client, one for the fed server, and the other for the main server. The server program is capable of handling any number of clients. Clients are labeled by a unique identity indicated by a positive integer starting from zero. The program for a client is obtained by simply assigning the variable idx to the client’s label in the program. The LeNet5 architecture is split as shown in Fig. 4.8. The dataset distribution part and the socket connection part are kept the same as in SL implementation (refer to Sect. 4.4.2). The server programs run first; then, the client program is started in the following sequence: fed server program, then main server program, after that all client programs. The complete program code is available at [4]. The model aggregation at the fed server and the main server is done by applying weighted averaging on all locally trained model portions. The function that performs the aggregation is given as a code snippet in Fig. 4.14a. A
(a)
(b) Fig. 4.14 a Code snippet of a user-defined function to compute weighted averaging of the models’ weights received from multiple clients and collected in a list w, and b output snapshot while training at the main server at global epoch 28
104
C. Thapa
Fig. 4.15 Average accuracy curves while training LeNet5 on FMNIST dataset up to 30 global epochs, each with one local epoch, over three clients
snapshot of the outputs at the server while training concerning the global epoch is provided in Fig. 4.14b while Fig. 4.15 depicts the accuracy convergence during the training process. We use the cross-entropy loss as our loss function. It measures the difference between the probability distributions of the ground truths and the outputs of the model.
4.5 Challenges and Open Problems By keeping a focus on SL, this section presents and discusses the challenges and open problems in FL, SL, and SFL. This helps to shape future possible research avenues.
4.5.1 Challenges and Open Problems in Federated Learning In FL, most of its vulnerabilities or privacy challenges come from untrusted entities. The entity can be a server or a client. Moreover, these entities can initiate various attacks, including model inversion attack [19], reconstruction attacks, membershipinference attacks [53] in FL environment. To overcome this vulnerability, literature shows the application of privacy and security-enhancing techniques such as DP, secure multi-party computation, and homomorphic encryption [21, 39, 44]. However, these are not enough in a general setting, as they suffer performance degradation like in DP, or high computation overhead like in homomorphic encryption. Thus, finding an optimal and practically feasible solution is still open. Another prominent issue in FL is the need for constant communication with the central server. However, the distributed clients (most often) might have limited bandwidth and connectivity with the server. Hence, maintaining secure channels, maintaining a steady connection, and maintaining trust upon clients are extremely complex open challenges [43]. FL is primarily based on the SGD optimization algorithm, which is used to train deep neural networks [28, 62]. To guarantee an unbiased estimate of the full gradients, the data should follow IID properties. However, in real-world scenarios, it is unlikely that the
4 Advancements of Federated Learning Towards Privacy Preservation …
105
datasets follow IID properties, and real-world datasets often have non-IID properties. Consequently, with skewed non-IID data, FL can show poor performance [62]. Thus, how to perform FL under a highly skewed non-IID setting is still an active field of research.
4.5.2 Challenges and Open Problems in Split Learning Due to the similarities in the distributed learning architectures, SL shares similar issues with FL. As the fundamental definition is the split of a network, SL’s current configurations are limited to neural network-based model architectures [28]. Along with the non-neural based architectures such as support vector machines, and even neural network-based networks such as recurrent neural networks, it is yet to find an efficient way to split such networks to perform SL over them. Similar to the cases of FL with non-IID data distribution, as mentioned in [28], SL is also susceptible to the data skewness. In terms of label distribution skew, where each client has samples all belonging to some class labels, SL also can show poor performance. Besides, SL does not learn under a one-class setup. Compared to FL, SL may need more communications due to the additional step of local model synchronization (peer-to-peer or client-server). SL is also sensitive to data communication challenges. Similar to FL, maintaining secure channels, maintaining steady connections, and maintaining trust among clients are additional challenging problems in SL. Moreover, the unpredictability of a client’s status (e.g., connected, dropped, sleeping, and terminated) play a significant role in the model convergence as the client model synchronization in SL is done through a passive sequential parameter update. The feasibility of any approach in a resource-constrained environment (e.g., a smart city using IoTs) is driven by the communication efficiency of that approach. Herein, SL needs further attention. Compressing smashed data and reducing the number of communications necessary for the convergence in SL are two of the important research ventures. Another main problem that is yet to solve fully is related to information leakage from the cut layer and countermeasures. Though there have been some works in this regard, it still needs further investigations to provide a potentially feasible solution that can limit the information leakage and maintain the utility of the model at the same time in general setup. Client vulnerabilities (e.g., the participation of vulnerable clients), server vulnerabilities (e.g. malicious servers), and communication channels issues (e.g., spoofing) are three of the architectural vulnerabilities of SL that need further investigations. Moreover, efficient incorporation of homomorphic encryption, secure multi-party computation, and DP to solve such vulnerabilities is still an open problem.
106
C. Thapa
4.5.3 Challenges and Open Problems in Splitfed Learning SFL is devised by amalgamating SL and FL. However, the central concept of SFL is based on SL, whereas FL is used as an effective solution for the local model synchronization. Hence, apart from the local model synchronization issues in SL, all other challenges and open problems related to SL are also common to SFL.
4.6 Conclusion This chapter presented an analytical picture of the advancement in distributed learning paradigms from federated learning (FL) to split learning (SL), specifically from SL’s perspective. One of the fundamental features common to FL and SL is that they both keep the data within the control of data custodians/owners and do not require to see the raw data. In addition to this feature, SL provides an additional capability of enabling ML model training/testing in resource-constrained client-side environments by splitting the model and allowing the clients to perform computation only on a small portion of the ML model. Besides, full model privacy is achieved while training/testing is done by a curious server or clients in SL. FL and SL both were shown to perform well over each other in different setups and configurations. For example, SL has faster convergence, however failing to converge if the data distribution is highly skewed. Whereas FL shows some resilience in this case, and it is possibly due to its model aggregation technique, which is weighted averaging. Thus, in general, SL, FL, and a hybrid approach like splitfed learning have their importance. As SL is in its initial development phase, there are multiple open problems related to several aspects, including communication efficiency, model convergence, device-based vulnerabilities, and information leakage. With research inputs over time, SL will become a matured privacy-preserving distributed collaborative machine learning approach.
References 1. 2. 3. 4. 5.
https://github.com/Minki-Kim95/Federated-Learning-and-Split-Learning-with-raspberry-pi https://github.com/Minki-Kim95/Install-pytorch-on-RaspberryPi https://github.com/chandra2thapa/Vanilla-split-learning https://github.com/chandra2thapa/Vanilla-SplitFed-learning M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), pp. 308–318 6. S. Abuadbba, K. Kim, M. Kim, C. Thapa, S.A. Camtepe, Y. Gao, H. Kim, S. Nepal, Can we use split learning on 1d cnn models for privacy preserving training?, in Proceedings of the ACM AsiaCCS (2020). arXiv:2003.12365.pdf 7. A. Acar, H. Aksu, A.S. Uluagac, M. Conti, A survey on homomorphic encryption schemes: theory and implementation. ACM Comput. Surv. 51(4), 79:1–79:35 (2018)
4 Advancements of Federated Learning Towards Privacy Preservation …
107
8. Y. Aono, T. Hayashi, L. Wang, S. Moriai et al., Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Secur. 13(5), 1333–1345 (2017) 9. P.C.M. Arachchige, P. Bertok, I. Khalil, D. Liu, S. Camtepe, M. Atiquzzaman, Local differential privacy for deep learning. IEEE Internet Things J (2019) 10. P.C.M. Arachchige, P. Bertok, I. Khalil, D. Liu, S. Camtepe, M. Atiquzzaman, A trustworthy privacy preserving framework for machine learning in industrial iot systems. IEEE Trans. Ind. Inf. 16(9), 6092–6102 (2020) 11. E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, V. Shmatikov, How to backdoor federated learning, in International Conference on Artificial Intelligence and Statistics (PMLR, 2020), pp. 2938– 2948 12. S. Caldas, S.M.K. Duddu, P. Wu, T. Li, J. Koneˇcn`y, H. Brendan McMahan, V. Smith, A. Talwalkar, Leaf: a benchmark for federated settings (2019). https://arxiv.org/pdf/1812.01097. pdf 13. I. Ceballos, V. Sharma, E. Mugica, A. Singh, P. Vepakomma, R. Raskar, A. Roman, Splitnndriven vertical partitioning (2020). arXiv:2008.04137 14. M.A.P. Chamikara, P. Bertók, D. Liu, S. Camtepe, I. Khalil, An efficient and scalable privacy preserving algorithm for big data and data streams. Comput. & Secur. 87 (2019) 15. W. Du, Y.S. Han, S. Chen, Privacy-preserving multivariate statistical analysis: Linear regression and classification, in Proceedings of the 2004 SIAM International Conference on Data Mining (SIAM, 2004), pp. 222–233 16. C. Dwork, Differential privacy: a survey of results, in International Conference on Theory and Applications of Models of Computation (Springer, 2008), pp. 1–19 17. C. Dwork, A. Roth, The algorithmic foundations of differential privacy. Found. Trends® Theor. Comput. Sci. 9(3–4), 211–407 (2014) 18. C. Dwork, A. Roth et al., The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014) 19. M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (2015), pp. 1322–1333 20. Y. Gao, M. Kim, S. Abuadbba, Y. Kim, C. Thapa, K. Kim, S.A. Camtepe, H. Kim, S. Nepal, Endto-end evaluation of federated learning and split learning for internet of things, in Proceedings of the SRDS (2020). arXiv:2003.13376.pdf 21. R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client level perspective (2017). arXiv:1712.07557 22. O. Gupta, R. Raskar, Distributed learning of deep neural network over multiple agents. J. Netw. Comput. Appl. 116, 1–8 (2018) 23. F. Haddadpour, M.M. Kamani, A. Mokhtari, M. Mahdavi, Federated learning with compression: unified analysis and sharp guarantees (2020). arXiv:2007.01154 24. A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, C. Kiddon, D. Ramage, H. Eichner, Federated learning for mobile keyboard prediction (2018). arXiv:1811.03604 25. C. He, S. Li, J. So, M. Zhang, H. Wang, X. Wang, P. Vepakomma, A. Singh, H. Qiu, L. Shen, et al., Fedml: a research library and benchmark for federated machine learning (2020). arXiv:2007.13518 26. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE CVPR (2016), pp. 770–778 27. F. Jiang et al., Artificial intelligence in healthcare: past, present and future. Stroke Vasc. Neurol 21, 230–243 (2017) 28. P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis, A.N. Bhagoji, K. Bonawitz, et al., Advances and open problems in federated learning (2019). arXiv:1912.04977 29. P. Kairouz, S. Oh, P. Viswanath, Extremal mechanisms for local differential privacy, in Advances in Neural Information Processing Systems (2014), pp. 2879–2887 30. J. Kim, S. Shin, J. Lee, K. Lee, Y. Yu, Multiple classification with split learning (2020). arXiv:2008.09874
108
C. Thapa
31. Y. Koda, J. Park, M. Bennis, K. Yamamoto, T. Nishio, M. Morikura, One pixel image and rf signal based split learning for mmwave received power prediction, in Proceedings of the15th International Conference on emerging Networking Experiments and Technologies (2019) 32. T. Kraska, A. Talwalkar, J.C. Duchi, R. Griffith, M.J. Franklin, M.I. Jordan, Mlbase: a distributed machine-learning system. CIDR 1, 1–7 (2013) 33. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the NIPS’12 - Vol. 1, USA (2012), pp. 1097–1105 34. K. Leino, M. Fredrikson, Stolen memories: leveraging model memorization for calibrated white-box membership inference, in 29th {USENIX} Security Symposium (2020), pp. 1605– 1622 35. T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: challenges, methods, and future directions. IEEE Signal Proc. Mag. 37(3), 50–60 (2020) 36. Y. Liu, J.Q. James, J. Kang, D. Niyato, S. Zhang, Privacy-preserving traffic flow prediction: a federated learning approach. IEEE Internet Things J. (2020) 37. H.B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. Arcas, Communication-efficient learning of deep networks from decentralized data, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of JMLR: W&CP (2017), pp. 1–10 38. N. Mohammed, R. Chen, B.C.M. Fung, P.S. Yu, Differentially private data release for data mining, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011), pp. 493–501 39. P. Mohassel, Y. Zhang, Secureml: a system for scalable privacy-preserving machine learning, in 2017 IEEE Symposium on Security and Privacy (SP) (IEEE, 2017), pp. 19–38 40. A.M. Ozbayoglu, M.U. Gudelek, O.B. Sezer, Deep learning for financial applications: a survey (2020). arXiv:2002.05786 41. N.H. Phan, X. Wu, H. Hu, D. Dou, Adaptive laplace mechanism: differential privacy preservation in deep learning, in 2017 IEEE International Conference on Data Mining (ICDM) (IEEE, 2017), pp. 385–394 42. M.G. Poirot, P. Vepakomma, K. Chang, J. Kalpathy-Cramer, R. Gupta, R. Raskar, Split learning for collaborative deep learning in healthcare (2019). arXiv:1912.12115 43. L. Reyzin, A.D. Smith, S. Yakoubov, Turning hate into love: homomorphic ad hoc threshold encryption for scalable mpc. IACR Cryptol. ePrint Arch. 2018, 997 (2018) 44. R.L. Rivest, L. Adleman, M.L. Dertouzos et al., On data banks and privacy homomorphisms. Found. Secure Comput. 4(11), 169–180 (1978) 45. T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert, J. Passerat-Palmbach, A generic framework for privacy preserving deep learning (2018). arXiv:1811.04017 46. R. Shokri, M. Stronati, C. Song, V. Shmatikov, Membership inference attacks against machine learning models, in 2017 IEEE Symposium on Security and Privacy (SP) (IEEE, 2017), pp. 3–18 47. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proceedings of the 3rd ICLR (2015) 48. A. Singh, P. Vepakomma, O. Gupta, R. Raskar, Detailed comparison of communication efficiency of split learning and federated learning (2019). arXiv:1909.09145 49. G.J. Székely, M.L. Rizzo, N.K. Bakirov et al., Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007) 50. C. Thapa, S. Camtepe, Precision health data: requirements, challenges and existing techniques for data security and privacy (2020). arXiv:2008.10733 51. C. Thapa, M.A.P. Chamikara, S. Camtepe, When federated learning meets split learning, Splitfed (2020). arXiv:2004.12088 52. A. Tizghadam, H. Khazaei, M.H.Y. Moghaddam, Y. Hassan, Machine learning in transportation. J. Adv. Trans. 2019 53. A. Triastcyn, B. Faltings, Federated generative privacy (IEEE Intell, Syst, 2020) 54. P. Tschandl, The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions (2018). https://doi.org/10.7910/DVN/DBW86T
4 Advancements of Federated Learning Towards Privacy Preservation …
109
55. M. Van Dijk, C. Gentry, S. Halevi, V. Vaikuntanathan, Fully homomorphic encryption over the integers, in Annual International Conference on the Theory and Applications of Cryptographic Techniques (Springer, Berlin, 2010), pp. 24–43 56. P. Vepakomma, O. Gupta, A. Dubey, R. Raskar, Reducing leakage in distributed deep learning for sensitive health data, in Proceedings of the ICLR (2019) 57. P. Vepakomma, O. Gupta, T. Swedish, R. Raskar, Split learning for health: distributed deep learning without sharing raw patient data (2018). arXiv:1812.00564 58. P. Vepakomma, A. Singh, O. Gupta, R. Raskar, Nopeek: information leakage reduction to share activations in distributed deep learning (2020). arXiv:2008.09161 59. K. Wei, J. Li, M. Ding, C. Ma, H.H. Yang, F. Farokhi, S. Jin, T.Q.S. Quek, H.V. Poor, Federated learning with differential privacy: Algorithms and performance analysis. in IEEE Transactions on Information Forensics and Security, (2020) 60. Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019) 61. A.C. Yao, Protocols for secure computations, in Proceedings of the 23th Annual Symposium on Foundations of Computer Science (FOCS ’82) (1982), pp. 160–164 62. Y. Zhao, M. Li, L. Lai, D. Civin, V. Chandra, Federated learning with non-iid data, Naveen Suda (2018). arXiv:1806.00582
Chapter 5
PySyft: A Library for Easy Federated Learning Alexander Ziller, Andrew Trask, Antonio Lopardo, Benjamin Szymkow, Bobby Wagner, Emma Bluemke, Jean-Mickael Nounahon, Jonathan Passerat-Palmbach, Kritika Prakash, Nick Rose, Théo Ryffel, Zarreen Naowal Reza, and Georgios Kaissis
Abstract PySyft is an open-source multi-language library enabling secure and private machine learning by wrapping and extending popular deep learning frameworks such as PyTorch in a transparent, lightweight, and user-friendly manner. Its aim is We thank the OpenMined community and contributors for their work making PySyft possible. For more information about OpenMined, find us on GitHub or slack. https://www.openmined.org/. A. Ziller Technical University of Munich, Munich, Germany A. Trask · E. Bluemke University of Oxford, Oxford, UK A. Lopardo ETH Zurich, Zurich, Switzerland A. Ziller · A. Trask · A. Lopardo · B. Szymkow · B. Wagner · E. Bluemke · J.-M. Nounahon · J. Passerat-Palmbach · K. Prakash · N. Rose · T. Ryffel · Z. N. Reza · G. Kaissis OpenMined, Oxford, UK J.-M. Nounahon De Vinci Research Centre, Paris, France J. Passerat-Palmbach Imperial College London, Consensys Health, London, UK K. Prakash IIIT Hyderabad, Hyderabad, India T. Ryffel INRIA, ENS, PSL University Paris, Paris, France Z. N. Reza Thales Canada Inc., Quebec, Canada G. Kaissis (B) Technical University of Munich, Imperial College London, Munich, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 M. H. ur. Rehman and M. M. Gaber (eds.), Federated Learning Systems, Studies in Computational Intelligence 965, https://doi.org/10.1007/978-3-030-70604-3_5
111
112
A. Ziller et al.
to both help popularize privacy-preserving techniques in machine learning by making them as accessible as possible via Python bindings and common tools familiar to researchers and data scientists, as well as to be extensible such that new Federated Learning (FL), Multi-Party Computation, or Differential Privacy methods can be flexibly and simply implemented and integrated. This chapter will introduce the methods available within the PySyft library and describe their implementations. We will then provide a proof-of-concept demonstration of a FL workflow using an example of how to train a convolutional neural network. Next, we review the use of PySyft in academic literature to date and discuss future use-cases and development plans. Most importantly, we introduce Duet: our tool for easier FL for scientists and data owners. Keywords Privacy · Federated learning · Differential privacy · Multi-party computation
5.1 Introduction to PySyft Modern machine learning requires large datasets to achieve state-of-the-art performance. Furthermore, validating that the trained models are fair, unbiased, and robust often require even more data. However, many of the most useful datasets include confidential or private data that needs to be protected for regulatory, contractual, or ethical reasons. Additionally, much of this data is generated and stored in a decentralized fashion, for example on edge computing hardware, such as mobile phones or wearable health-tracking devices. Research using large, private, decentralized datasets requires novel technical solutions so that this data can be used securely. These technical solutions must enable training on data that is neither locally available nor directly accessible without leaking that data. Decentralized computing techniques collectively referred to as Federated Learning (FL) allow training on non-local data. In FL, training is performed at the location where the data resides, and only the machine learning (ML) algorithm (or updates to it) are being transferred. Secure machine learning, on the other hand, represents a collection of techniques that allow ML models to be trained without direct access to the data while protecting the models themselves from theft, manipulation, or misuse. Some examples of secure computation are encryption techniques such as homomorphic encryption (HE), by performing computation on fragments of data distributed over a computational network (secure multi-party computation, SMPC) Differentially Private (DP) machine learning aims to prevent trained models from inadvertently storing personally identifiable information about the dataset in the model itself. One common way of preventing this memorization is by perturbing the dataset (for example by noise injection) in a way that allows statistical reasoning about the entire dataset while protecting the identity of any single individual contained in it.
5 PySyft: A Library for Easy Federated Learning
113
For secure and private ML to attain widespread acceptance, the availability of wellmaintained, high quality, easily deployed open-source code libraries incorporating state-of-the-art techniques of the above-mentioned fields is essential. OpenMined, an open-source community, built the PySyft library to make these techniques as accessible and easy to implement as possible.
5.1.1 The PySyft Library PySyft is an open-source, multi-language library that enables secure and private machine learning. PySyft aims to popularize privacy-preserving techniques in machine learning by making them as accessible as possible via Python bindings and an interface reminiscent of common machine learning which is familiar to researchers and data scientists. To ease future development in the field of privacy-preserving machine learning, PySyft aims to be extensible such that new FL, Multi-Party Computation, or Differential Privacy methods can be flexibly and simply implemented and integrated. This chapter will introduce the methods available within the PySyft library and describe their implementations. We will then provide a proof-of-concept demonstration of a FL workflow using an example of how to train a convolutional neural network. Next, we review the use of PySyft in academic literature to date and discuss future use-cases and development plans. Most importantly, we introduce Duet: our tool for easier FL for scientists and data owners. Duet is the way to use PySyft for data science in a seamless, intuitive way.
5.1.2 Privacy and FL FL is a collection of computational techniques allowing the distributed training of algorithms on remote datasets. It relies on distributing copies of the learning algorithm to where the data is instead of centrally collecting the data itself. For example, in a healthcare setting, patient data can remain on the hospital’s servers thus retaining data ownership while still allowing the training of algorithms on the data. FL can be set up in several ways, for example, a common way is having decentralized nodes for training which send their updates back to a central server for aggregation. FL itself is not always a sufficient privacy mechanism. Deep learning models tend to unintentionally memorize aspects of their training data, i.e. storing dataset attributes in their parameters [11]. This unintentional memorization makes FL settings without additional privacy-preserving techniques sensitive to various attacks, such as attribute or membership inference attacks. In addition to the various attack vectors, FL typically has substantial network transfer requirements as well as usually reduced algorithm performance compared to centralized training. Techniques for efficient training and for reducing network transfer requirements are being actively
114
A. Ziller et al.
researched [23]. These concerns aside, FL is an important component of many privacy-preserving machine learning systems.
5.1.3 Differential Privacy Differential privacy is a randomized system for publicly sharing meaningful information about a dataset by describing the patterns within the dataset while withholding information about individuals in the dataset. It is a process of perturbing the data with minimal noise to preserve the privacy of the users while minimizing the loss in the utility of the data and the computational complexity of operating on it. Uncertainty in the process means uncertainty for the attacker, which means better privacy. Consider the example of a simple private database that stores the names and ages of the citizens of a small town. This dataset could be useful for various applications and help better the lives of the citizens of that town, but it is important that the database remains private and does not get into the wrong hands. The private data could be misused! If we wanted to obtain statistics on the database publicly, such as the average, minimum, and maximum age of the citizens, we need to be careful. If we report the true average, minimum, or maximum, we might compromise the privacy of some specific citizens. So, we report a slightly noisy average, minimum, and maximum publicly, so that no one single citizen’s privacy is breached. This enables all citizens a certain degree of plausible deniability as to whether they were a part of the database. -Differential Privacy provides us with a universal privacy guarantee—given a small privacy budget , we can ensure that the output of the process does not change by more than e by the change or removal of a single person from the database. Differential privacy is more robust than naive anonymization in that it quantifies the risk of data leakage and provides theoretical privacy guarantees within its scope of application. The amount of noise that needs to be added to the query result depends on: • Sensitivity of the query • Distribution of the noise source • Privacy budget.
5.1.3.1
Differentially Private Machine Learning
In the context of data-driven approaches like machine learning, Differential Privacy is very useful, as their objectives align. Both focus on recognizing general meaningful patterns instead of individual user data. However, it is quite challenging to generalize various DP techniques to the broad set of techniques in Machine Learning. Consider an artificial neural network—it is a complex function involving a large number of parameters and inputs. Finding the query sensitivity of this neural network is quite
5 PySyft: A Library for Easy Federated Learning
115
tricky, as it involves a lot of non-linear computations. This acts as a barrier to applying Differential Privacy more generally to machine learning. However, despite the challenges, there have been some interesting and useful research work in this area. Various works focusing on Differential Privacy applied to learning algorithms consider the following key factors: • The scale of the noise added relative to the data (norm). • The right stage in the learning process for adding noise. This often is strongly related to the model of computation and data flow. In the case of data coming in from multiple (distributed) sources, we might consider adding noise early on. • Pre-processing to have an upper bound on the sensitivity of the data. • Adversarial attacks are specific to learning methods, such as model inversion attack, linkage attack, and data reconstruction attacks. We are interested to see how the behavior changes when Differential Privacy is used. • The utility-privacy-computation speed trade-off in the learning process. Since Differential Privacy provides us a quantitative method to measure privacy, it helps in a strong analysis of the decrease in utility, which we achieve at the cost of data privacy. • The effect of Differential Privacy on the learning model’s ability to generalize. It has been a surprising result that adding noise to ensure privacy improves the utility of the learning model, as it can generalize better on larger unseen data. • A similar effect has been observed in improving the stability of the learning algorithm.
5.1.3.2
Differential Privacy in PySyft
While the use of FL enables us to perform distributed computation by enabling access to remote data, it does not guarantee the privacy of users, as inferences about the users can be made from the trained model, thus risking the exposure of sensitive information about the users. PySyft creates and uses automatic Differential Privacy to provide the data users with strong privacy guarantees, independent of the Machine Learning architecture and the data itself. PySyft can achieve this is by adding automatic query sensitivity tracking and privacy budgeting to the private tensor. The tensor is made up of a huge matrix of private scalar values, which are all bounded. This way, we can dynamically track the sensitivity of the query function as well as the amount of privacy budget we have spent. This helps us in employing the various techniques on making Machine Learning automatically differentially private with ease. Thus, PySyft is general enough to support any kind of deep learning architecture, as the key components of Differential Privacy are a part of PySyft’s building blocks.
116
A. Ziller et al.
Fig. 5.1 Illustration of splitting a single integer in multiple shares, which can then be held by other parties
5.1.4 Secure Multi-party Computation FL and differential privacy are not sufficient to prevent attacks against machine learning models or datasets by the participants in the training. For example, it is very easy for a data owner to steal a model in classic FL as they are sent the model in plain text. Methods [34] need to be developed to safe-keep both the data and the algorithms, while still permitting training and inference. Secure multi-party computation (SMPC) provides a framework allowing multiple parties to jointly perform computations over a set of inputs and to receive the resulting outputs without exposing any party’s sensitive input. It thus allows models to be trained or applied to data without disclosing the training data items or the model’s weights. SMPC relies on splitting data into ‘shares’, as seen in Fig. 5.1, which, when summed, yield the original value. Evaluating or training a model can be decomposed in basic computations for which SMPC protocols exist, which allows for end-to-end secure procedures. In the case of ML, during training, the model’s gradients can be shared, while in inference, the entire model function can be shared. SMPC incurs a significant communication overhead but has the advantage that unless a majority of the parties are malicious and coordinate to reveal an input, the data will remain private even if sought after for unlimited time and resources. SMPC can protect both models’ parameters and training/inference data. One of the conceptually simplest implementations of secret sharing in SMPC is ‘additive secret sharing’. In this paradigm, a number, for example, x = 5, can be split into several shares (in this example two shares), shar e1 = 2 and shar e2 = 3, managed independently by two participants. At this point, the application of any number of additional operations on these shares individually and the sum of the results would be the same as applying the same additions individually on x = 5. In practice, the shares are often taken randomly in a large finite field, which implies that having access to one share doesn’t reveal anything about the secret value. 1
from r a n d o m i m p o r t r a n d i n t # i m p o r t l i b r a r y to generate random integers
2 3
Q = 121639451781281043402593 # a large prime number
4 5
def e n c r y p t ( x , n _ s h a r e s = 2) :
5 PySyft: A Library for Easy Federated Learning 6 7 8
9 10 11
117
s h a r e s = l i s t () for i in range ( n_shares -1) : s h a r e s . a p p e n d ( r a n d i n t (0 , Q ) ) # r a n d i n t will r e t u r n any r a n d o m n u m b e r b e t w e e n 0 to Q f i n a l _ s h a r e = Q - ( sum ( s h a r e s ) % Q ) + x shares . append ( final_share ) return tuple ( shares )
12 13 14
def d e c r y p t ( s h a r e s ) : r e t u r n sum ( s h a r e s ) % Q
15 16 17
18 19 20 21
def add ( a , b ) : a s s e r t ( len ( a ) == len ( b ) ) # c h e c k if l e n g t h of v a r i a b l e s is e q u a l - if c o n d i t i o n is False , A s s e r t i o n E r r o r is r a i s e d c = list () for i in r a n g e ( len ( a ) ) : c . a p p e n d (( a [ i ] + b [ i ]) % Q ) return tuple (c)
For multiplication between encrypted numbers, PySyft implements the SPDZ protocol (pronounced “Speedz”) that is an extension of additive secret sharing; encrypt, decrypt and add are the same, but it enables more complex operations than addition. Multiplication in SPDZ uses externally generated triples of numbers to maintain privacy during the computation. Within the examples below we use a crypto provider that is not otherwise involved in the computation to generate these triples but they could also be generated with HE. 1 2 3 4 5
def g e n e r a t e _ m u l _ t r i p l e () : a = random . randrange (Q) b = random . randrange (Q) a_mul_b = (a * b) % Q return encrypt (a) , encrypt (b) , encrypt ( a_mul_b )
6 7
# we a l s o a s s u m e t h a t the c r y p t o p r o v i d e r d i s t r i b u t e s the s h a r e s
8 9 10
def mul ( x , y ) : a , b , a _ m u l _ b = g e n e r a t e _ m u l _ t r i p l e ()
11 12
13
alpha = decrypt (x - a) # x remains hidden because a is r a n d o m beta = decrypt (y - b) # y remains hidden because b is r a n d o m
14 15 16
# l o c a l re - c o m b i n a t i o n r e t u r n alpha . mul ( beta ) + alpha . mul ( b ) + a . mul ( beta ) + a_mul_b
Taking a closer look at the operations we can see that since alpha ∗ beta == x y − xb − ay + ab, b ∗ alpha == bx − ab, and a ∗ beta == ay − ab, if we add them all together and then sum a*b we will effectively return a privately shared version of xy.
118
A. Ziller et al.
This schema appears quite simple, but it already permits all operations, as combinations of additions and multiplications, between two secretly shared numbers. Similar to the more complex homomorphic encryption schemes that work with a single party, SPDZ allows computation on ciphertexts generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext. In this case, splitting the data into shares is the encryption, adding the shares back together is the decryption, while the shares are the ciphertext on which to operate. This technique is adequate for integers, covering the encryption of things like the values of the pixels in images or the counts of entries in a database. The parameters of many machine learning models like neural networks, however, are floats, so how can we use additive secret sharing in machine learning? We need to introduce a new ingredient, Fixed Precision Encoding, an intuitive technique that enables computation to be performed on floats encoded in integers values. In base 10, the encoding is as simple as removing the decimal point while keeping as many decimal places as indicated by the precision. In PySyft, you can take any floating-point number and turn it into fixed precision by calling my_tensor. f i x_ pr ecision(). 1 2
BASE =10 P R E C I S I O N =4
3 4 5
def e n c o d e ( x ) : r e t u r n int (( x * ( BASE ** P R E C I S I O N ) ) % Q )
6 7 8 9
def d e c o d e ( x ) : r e t u r n ( x if x