301 113 9MB
English Pages 281 [282] Year 2023
Wireless Networks
Qingguo Lü · Xiaofeng Liao · Huaqing Li · Shaojiang Deng · Shanfu Gao
Distributed Optimization in Networked Systems Algorithms and Applications
Wireless Networks Series Editor Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state of the art and set the course for future research and development in wireless communication networks. The scope of this series includes not only all aspects of wireless networks (including cellular networks, WiFi, sensor networks, and vehicular networks), but related areas such as cloud computing and big data. The series serves as a central source of references for wireless networks research and development. It aims to publish thorough and cohesive overviews on specific topics in wireless networks, as well as works that are larger in scope than survey articles and that contain more detailed background information. The series also provides coverage of advanced and timely topics worthy of monographs, contributed volumes, textbooks and handbooks. ** Indexing: Wireless Networks is indexed in EBSCO databases and DPLB **
Qingguo Lü • Xiaofeng Liao • Huaqing Li • Shaojiang Deng • Shanfu Gao
Distributed Optimization in Networked Systems Algorithms and Applications
Qingguo Lü College of Computer Science Chongqing University Chongqing, China
Xiaofeng Liao College of Computer Science Chongqing University Chongqing, China
Huaqing Li College of Electronic and Information Engineering Southwest University Chongqing, China
Shaojiang Deng College of Computer Science Chongqing University Chongqing, China
Shanfu Gao College of Computer Science Chongqing University Chongqing, China
ISSN 2366-1186 ISSN 2366-1445 (electronic) Wireless Networks ISBN 978-981-19-8558-4 ISBN 978-981-19-8559-1 (eBook) https://doi.org/10.1007/978-981-19-8559-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
To My Family Q. Lü To my family X. Liao To my family H. Li To my family S. Deng To my family S. Gao
Preface
In recent years, the Internet of Things (IoT) and big data have been interconnected to a wide and deep extent through the sensing, computing, communication, and control of intelligent information. Networked systems are playing an increasingly important role in the interconnected information environment, profoundly affecting computer science, artificial intelligence, and other related fields. The core of such systems composed of many nodes is to efficiently accomplish certain global goals by collaborating with each other, while making separate decisions based on different preferences, thus solving large-scale complex problems that are difficult for individual nodes to perform, with strong resistance to interference and environmental adaptability. In addition, such systems require participating nodes to access only their own local information. This may be due to the consideration of security and privacy issues in the network, or simply because the network is too large, making the aggregation of global information to a central node practically impossible or very inefficient. Currently, as a hot research topic with wide applicability and great application value across multiple disciplines, distributed optimization of networked systems has laid an important foundation for promoting and leading the frontier development in computer science and artificial intelligence. However, networked systems cover a large number of intelligent devices (nodes), and the network environment is often dynamic and changing, making it extremely hard to optimize and analyze them. It is problematic for existing theories and methods to effectively address the new needs and challenges of optimization brought about by the rapid development of technologies related to networked systems. Hence, it is urgent to develop new theories and methods of distributed optimization over networks. Analysis and synthesis including distributed unconstrained optimization, distributed constrained optimization, distributed nonsmooth optimization, distributed online optimization, distributed economic dispatch in smart grids, undirected networks, directed networks, time-varying networks, consensus control protocol, gradient tracking technique, event-triggered communication strategy, Nesterov and heavy-ball accelerated mechanisms, variance-reduction technique, differential privacy strategy, gradient descent algorithm, accelerated algorithm, stochastic gradient algorithm, and online algorithm are all thoroughly studied. This monograph vii
viii
Preface
mainly investigates distributed optimization algorithms and applications in networked control systems. In general, the following problems are investigated in this monograph: (1) accelerated algorithms for distributed convex optimization; (2) projection algorithms for distributed stochastic optimization; (3) proximal algorithms for distributed coupled optimization; (4) event-triggered algorithms for distributed convex optimization; (5) event-triggered acceleration algorithms for distributed stochastic optimization; (6) accelerated algorithms for distributed economic dispatch; (7) primal-dual algorithms for distributed economic dispatch; (8) eventtriggered algorithms for distributed economic dispatch; and (9) privacy preserving algorithms for distributed online learning. Among the topics, simulation results including some typical real applications are presented to illustrate the effectiveness and the practicability of the distributed optimization algorithms proposed in the previous parts. This book is appropriate as a college course textbook for undergraduate and graduate students majoring in computer science, automation, artificial intelligence, and electric engineering, and as a reference material for researchers and technologists in related fields. Chongqing, China
Qingguo Lü Xiaofeng Liao Huaqing Li Shaojiang Deng Shanfu Gao
Acknowledgments
This book was supported in part by the Natural Science Foundation of Chongqing under Grant CSTB2022NSCQ-MSX1627, in part by the Chongqing Postdoctoral Science Foundation under Grant 2021XM1006, in part by the China Postdoctoral Science Foundation under Grant 2021M700588, in part by the National Natural Science Foundation of China under Grant 62173278, in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202100228, in part by the project of Key Laboratory of Industrial Internet of Things & Networked Control, Ministry of Education under Grant 2021FF09, in part by the project funded by Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System (Wuhan University of Science and Technology) under Grant ZNXX2022004, in part by the project funded by Hubei Key Laboratory of Intelligent Robot (Wuhan Institute of Technology) under Grant HBIR202205, in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202100228, and in part by National Key R&D Program of China under Grant 2018AAA0100101. We would like to begin by acknowledging Yingjue Chen and Keke Zhang who have unselfishly given their valuable time in arranging raw materials. Their assistance has been invaluable to the completion of this book. The authors are especially grateful to their families for their encouragement and never ending support when it was most required. Finally, we would like to thank the editors at Springer for their professional and efficient handling of this book.
ix
Contents
1 Accelerated Algorithms for Distributed Convex Optimization . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Centralized Nesterov Gradient Descent Method (CNGD) . . . . 1.3.2 Directed Distributed Nesterov-Like Gradient Tracking (D-DNGT) Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Projection Algorithms for Distributed Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Problem Reformulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Computation-Efficient Distributed Stochastic Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 4 5 5 6 6 7 9 11 11 12 16 21 23 26 26 31 31 34 34 35 36 36 36 37 xi
xii
Contents
2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Example 1: Performance Examination . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Example 2: Application Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 40 40 46 49 53 53 54 57 58
3
Proximal Algorithms for Distributed Coupled Optimization . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 The Saddle-Point Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Unbiased Stochastic Average Gradient (SAGA) . . . . . . . . . . . . . . 3.3.2 Distributed Stochastic Algorithm (VR-DPPD) . . . . . . . . . . . . . . . . 3.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Example 1: Simulation on General Real Data . . . . . . . . . . . . . . . . . 3.5.2 Example 2: Simulation on Large-Scale Real Data . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 64 64 64 65 67 68 68 69 72 73 74 81 82 85 87 87
4
Event-Triggered Algorithms for Distributed Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Communication Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Distributed Subgradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91 91 94 94 94 95 95 96 99 99 100 108 111 111
Contents
5
Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Event-Triggered Communication Strategy . . . . . . . . . . . . . . . . . . . . 5.3.2 Event-Triggered Distributed Accelerated Stochastic Gradient Algorithm (ET-DASG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Example 1: Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Example 2: Energy-based Source Localization . . . . . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
115 115 118 118 118 119 119 120 121 124 124 125 132 138 138 143 145 146
6
Accelerated Algorithms for Distributed Economic Dispatch . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Centralized Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Directed Distributed Lagrangian Momentum Algorithm . . . . . 6.3.2 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Case Study 1: EDP on IEEE 14-Bus Test Systems . . . . . . . . . . . 6.5.2 Case Study 2: EDP on IEEE 118-Bus Test Systems . . . . . . . . . . 6.5.3 Case Study 3: The Application to Dynamical EDPs . . . . . . . . . . 6.5.4 Case Study 4: Comparison with Related Methods . . . . . . . . . . . . 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151 151 154 154 155 156 156 158 158 160 161 161 162 172 172 173 175 177 180 180
7
Primal–Dual Algorithms for Distributed Economic Dispatch . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183 183 186 186
xiv
Contents
7.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Distributed Primal–Dual Gradient Algorithm . . . . . . . . . . . . . . . . . 7.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Small Gain Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Example 1: EDP on the IEEE 14-Bus Test Systems . . . . . . . . . . 7.5.2 Example 2: Demand Response for Time-Varying Supplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
186 187 189 189 190 192 192 194 196 201 201
8
Event-Triggered Algorithms for Distributed Economic Dispatch . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Problem Reformulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Event-Triggered Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 The Exclusion of Zeno-Like Behavior. . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Example 1: EDP on the IEEE 14-Bus System . . . . . . . . . . . . . . . . 8.5.2 Example 2: EDP on Large-Scale Networks . . . . . . . . . . . . . . . . . . . 8.5.3 Example 3: Comparison with Related Methods . . . . . . . . . . . . . . . 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 209 211 211 212 212 213 213 213 214 216 216 221 224 225 226 226 229 229 232
9
Privacy Preserving Algorithms for Distributed Online Learning . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Communication Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Differential Privacy Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Differentially Private Distributed Online Algorithm . . . . . . . . . .
235 235 238 238 239 240 241 241 243
202 205 205
Contents
9.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Differential Privacy Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Logarithmic Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Square-Root Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Robustness to Communication Delays . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
244 246 249 258 260 262 267 268
List of Figures
Fig. 1.1 A directed and strongly connected network . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Fig. 1.2 Performance comparisons between D-DNGT and the methods without momentum terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Fig. 1.3 Performance comparisons between D-DNGT and the methods with momentum terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Fig. 1.4 Performance comparisons between the extensions of D-DNGT and their closely related methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Fig. 2.1 Convergence of xˆ for solving the optimization problem in Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Fig. 2.2 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Fig. 2.3 Comparison (b): x-axis is the number of gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Fig. 2.4 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Fig. 2.5 Comparison (b): x-axis is the number of gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Fig. 2.6 Comparison (a): x-axis is the iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Fig. 2.7 Comparison (b): x-axis is the number of gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Fig. 3.1 (a) Random network with a connection probability p = 0.8. (b) Complete network. (c) Cycle network. (d) Star network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Fig. 3.2 The transient behaviors of the second dimension of each primal variable xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Fig. 3.3 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations. (b) The x-axis is the number of gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Fig. 3.4 Evolution of residuals under different networks . . . . . . . . . . . . . . . . . . . . . 85 Fig. 3.5 Comparisons between VR-DPPD and other algorithms. (a) The x-axis is the iterations. (b) The x-axis is the number of gradient evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Fig. 4.1 All nodes’ states xi (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xvii
xviii
Fig. Fig. Fig. Fig.
List of Figures
4.2 4.3 4.4 5.1
Fig. 5.2
Fig. 5.3 Fig. 5.4 Fig. 5.5 Fig. 5.6 Fig. 5.7 Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
6.1 6.2 6.3 6.4 6.5 6.6 7.1 7.2 7.3 7.4 7.5 8.1 8.2 8.3
Fig. 8.4 Fig. 9.1
Fig. 9.2
Fig. 9.3
Evolutions of all nodes’ control inputs ui (t) . . . . . . . . . . . . . . . . . . . . . . . . . All nodes’ sampling time instant sequences {tki } . . . . . . . . . . . . . . . . . . . . . Evolutions of measurement error and threshold for node 3 . . . . . . . . . Four undirected and connected network topologies composed of 10 nodes. (a) Random network with a connection probability pc = 0.4. (b) Complete network. (c) Cycle network. (d) Star network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence: (a) The transient behaviors of three dimensions (randomly selected) of state estimator x. (b) The testing accuracy of ET-DASG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The triggering times for the neighbors when 5 nodes run ET-DASG under different event-triggered parameters . . . . . . . . . . . . . . . Evolution of residuals under different constant step-sizes or momentum coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evolution of residuals under different networks . . . . . . . . . . . . . . . . . . . . . Comparisons between ET-DASG and other methods . . . . . . . . . . . . . . . . The randomly selected 7 paths displayed on top of contours of log-likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EDP on IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EDP on IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamical EDPs on IEEE 14-bus test system . . . . . . . . . . . . . . . . . . . . . . . . Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The IEEE 14-bus test system [43] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power allocation at generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Consensus of Lagrange multipliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimal energy schedule of each household . . . . . . . . . . . . . . . . . . . . . . . . . . Predicted price on the time-varying demands . . . . . . . . . . . . . . . . . . . . . . . . EDP on the IEEE 14-bus system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EDP on the IEEE 118-bus test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with related methods in which the residual E(t) as the comparison metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with related methods in which the obtained cost of EDP as the comparison metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Estimations of five nodes without communication delay. (b) The maximum and minimum pseudo-individual average regrets (Rj (T )/T ) without communication delays . . . . . . . . . (a) Estimations of five nodes with communication delays. (b) The maximum and minimum pseudo-individual average regrets (Rj (T )/T ) with communication delays . . . . . . . . . . . . Estimations of five nodes for DTS with communication delays. (a) Node’s estimate (z) (DP-DSSP). (b) Node’s estimate (z) (the method in [37]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109 110 110
139
141 142 143 144 144 145 172 174 175 176 178 179 202 203 203 204 204 227 228 230 231
264
265
266
List of Figures
xix DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k)) between k and k + 1 for DTS with communication delays. (b) The pseudo-individual average regrets (Rj (T )/T ) for DTS with communication delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations fit and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Chapter 1
Accelerated Algorithms for Distributed Convex Optimization
Abstract In this chapter, we introduce and solve distributed optimization problems on directed networks, where each node has its own convex cost function while obeying the network connectivity structure, and the principal target of these problems is to minimize the global cost function (formulated by the average of all local cost functions). Most of the existing methods, such as the push-sum strategy, have eliminated the unbalancedness caused by directed networks with the help of column-stochastic weights, but those methods may be infeasible in case the distributed implementation requires each node to obtain (at least) its out-degree information. In contrast, with the help of a directed network of rowstochastic weights, we propose a new directed distributed Nesterov-like gradient tracking algorithm, named D-DNGT, that incorporates the gradient tracking into the distributed Nesterov method with momentum terms and employs non-uniform stepsizes. This approach can effectively overcome the abovementioned limitations of column-stochastic directed networks in the implementation. The implementation of D-DNGT is straightforward if each node locally chooses a suitable step-size and privately regulates the weights of information that acquires from in-neighbors. If the largest step-size and the maximum momentum coefficient are positive and small sufficiently, we can prove that D-DNGT converges linearly to the optimal solution provided that the cost functions are smooth and strongly convex. We provide numerical experiments to confirm the findings in this chapter and contrast D-DNGT with recently proposed distributed optimization approaches. Keywords Distributed convex optimization · Nesterov-like algorithm · Gradient tracking · Directed network · Linear convergence
1.1 Introduction In the past decades, with the development of artificial intelligence and the emergence of 5G, a number of researchers are already interested in distributed optimization. This chapter considers a class of widely concerned distributed optimization problems with each node cooperatively attempting to optimize a global cost function © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_1
1
2
1 Accelerated Algorithms for Distributed Convex Optimization
in the context of local interactions and local computations [1]. Instances of such formulation characterized by distributed computing have several important and widespread applications in various fields, including wireless sensor networks for decision-making and information-processing [2], distributed resource allocation in smart grids [3], distributed learning in robust control [4], and time-varying formation control [5, 6], among many others [7–13]. Unlike traditional centralized optimization, distributed optimization involves multiple nodes that gain access to their private local information over networks, and typically no central coordinator (node) can acquire the entire information over the networks. Recently, an increasing number of distributed algorithms have been emerged according to various locally computational schemes for individual nodes. Some known approaches for different networks are usually dependent on the distributed (sub)gradient descent with extensions to figure out interaction delays, asynchronous updates and stochastic (sub)gradient scenarios, etc. [14–22]. It is noteworthy in this aspect that these algorithms are intuitive and flexible for the cost functions and networks, and however the convergence rates are considerably slow owing to the utilization of diminishing step-size, which is required to guarantee convergence to an exact optimal solution [14]. The convergence rate of the known algorithms, even for strong convex functions, is only sublinear [15]. The convergence rate reaches to be linear of an algorithm with a constant step-size at the cost of a sub-optimal solution [20]. Methods that make up this exactness-speed dilemma, such as the distributed alternating direction method of multipliers (ADMMs) [23, 24] and the distributed dual decomposition [25], are based on Lagrangian dual, which have nice provable convergence rates (linear convergence rate for strong convex functions) [26]. In addition, extensions of various real-world factors including stochastic errors [27], privacy preserving [28] and techniques including proximal (sub)gradient [29], and formation-containment control [30] have been extensively studied. However, due to the need to deal with sub-problems in each iteration, the computational complexity is considerably high. To overcome these difficulties effectively, quite a few approaches have been proposed, which achieve linear convergence for smooth and strongly convex cost functions [31–38]. Nonetheless, these approaches [31–38] are just suitable for undirected networks. Distributed optimization over directed networks was firstly studied in [39], where (sub)gradient-push (SP) method was employed to eliminate the requirement of network balancing, i.e., with column-stochastic weights. Since SP is established on (sub)gradient descent with diminishing step-size, it also encounters a slow sublinear convergence rate. To accelerate convergence, Xi and Khan [40] proposed a linearly convergent distributed method (DEXTRA) with constant step-size by combining push-sum strategy with the protocol (EXTRA) in [31]. Further, Xi et al. [41] (fixed directed network) and Nedic et al. [42] (time-varying directed networks) combined the push-sum strategy with distributed inexact gradient tracking with constant stepsize (ADD-OPT [41] and Push-DIGing [42]) to acquire linear convergence to the exact optimal solution. Then, Lü et al. [43, 44] extended the work of [42] to nonuniform step-sizes and showed linear convergence. A different class of approaches which do not utilize push-sum mechanism have been recently proposed in [45–
1.1 Introduction
3
50], where both row- and column-stochastic weights are adopted simultaneously to acquire linear convergence over directed networks. It is noteworthy that although these approaches [39–50] avoid the construction of doubly-stochastic weights, they just require nodes to possess (at least) its own out-degree information exactly. Therefore, all the nodes in the networks [39–50] can adjust their outgoing weights to ensure that the sum of each column of weight matrix is one. This requirement, however, is likely to be unrealistic in broadcast-based interaction schemes (i.e., the node neither accesses its out-neighbors nor regulates its outgoing weights). In this chapter, the algorithm that we will construct depends crucially on the gradient tracking and is a variation of methods appeared in [47–55]. To be specific, Qu and Li [54] combined the gradient tracking with distributed Nesterov gradient descent (DNGD) method [55] and thereby investigated two accelerated distributed Nesterov methods, i.e., Acc-DNGD-SC and Acc-DNGD-NSC, which exhibited fast convergence rate compared with the centralized gradient descent (CGD) method for different cost functions. Note that although the convergence rates are improved, the two approaches in [54] just assume that the interaction networks are undirected, which also involve the applicability of the methods in many fields, such as wireless sensor networks. To remove this deficiency, Xin et al. [48] established an acceleration and generalization of first-order methods with the gradient tracking and the momentum term, i.e., ABm, which overcame the conservatism (eigenvector estimation or doubly-stochastic weights) in the related work by implementing both row- and column-stochastic weights. In this setting, some interesting generalized methods [46] (random link failures) and [49] (interaction delays) were proposed. Regrettably, the construction of column-stochastic weights demands each node to possess at least its out-degree information, which is arduous to be implemented, for example, in broadcast-based interaction scenarios. In light of this challenge, Xin et al. [52] investigated the case of row-stochastic weight matrix which was required to restrict global information on the network and proposed a fast distributed method (FROST) under non-uniform step-sizes motivated by the idea of [51]. Related works also involve the issues of demand response and economic scheduling in power systems [53, 56]. However, these methods [51–53, 56] do not adopt momentum terms [54, 55, 57], where nodes acquire more information from in-neighbors in the network for fast convergence. Moreover, two accelerated methods based on Nesterov’s momentum for the distributed optimization over arbitrary networks were presented in [50]. Unfortunately, the related work [50] does not consider the nonuniform step-sizes and lack of a rigorous theoretical analysis of the methods. Hence, it is of great significance to discuss such a challenging issue due to its practicality. The main interest of this chapter is to study the distributed convex optimization problem over a directed network. To solve this issue, a linearly convergent algorithm is designed, for which the non-uniform step-sizes, momentum terms, and row-stochastic matrix are utilized. We hope to develop a broad theory of the distributed convex optimization, and the potential purpose of designing a distributed optimization algorithm is to adapt and promote real scenarios. To conclude, this
4
1 Accelerated Algorithms for Distributed Convex Optimization
chapter possesses the following four contributions: (i) We design and discuss a novel directed distributed Nesterov-like gradient tracking algorithm, named as D-DNGT, with row-stochastic matrix to solve the distributed convex optimization problems over a directed network. Specifically, D-DNGT incorporates the gradient tracking into the distributed Nesterov method, which adds two types of momentum terms to ensure that nodes acquire more information from in-neighbors in the network than the existing methods [51–53] to achieve fast convergence. More importantly, a consensus iteration step [50–52] is exploited for designing D-DNGT to counteract the effect of the unbalancedness induced by the directed network. (ii) In comparison with Acc-DNGD-SC and Acc-DNGD-NSC proposed in [54], D-DNGT extends the centralized Nesterov gradient descent method (CNGD) [57] to a distributed form and is suitable for directed networks. In contrast to [54] (doubly-stochastic matrix) and [39–50] (column-stochastic matrix), D-DNGT with row-stochastic matrix is relatively easy to be achieved in a distributed way if each node can privately regulate the weights on information which acquires from in-neighbors. This is inevitable in some real applications such as ad hoc networks, peer to peer, etc. (iii) D-DNGT adopts non-uniform step-sizes, which presents a selection of more relaxed step-sizes than most existing methods proposed in [41, 42, 49], etc. If the cost functions are smooth and strongly convex, D-DNGT attains a linear convergence to the exact optimal solution in the case where the non-uniform step-sizes and the momentum coefficients are constrained by the specific upper bounds. In addition, some extensions of D-DNGT are discussed in the presence of two types of weight matrices (only column-stochastic [41, 42] or both rowand column-stochastic [45, 46]) or network interaction delays (arbitrary but uniformly bounded) [49, 58]. (iv) The provided bounds on the largest step-size depend only on the network topology and the cost functions, and each node can choose a relatively wider step-size. This is in contrast to the earlier work on non-uniform step-sizes within the framework of the gradient tracking [33, 43, 44, 53], which is relied on the heterogeneity of the step-sizes. Moreover, the bounds of non-uniform step-sizes in this chapter allow the existence (not all) of zero step-sizes among the nodes.
1.2 Preliminaries 1.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let R and Rp denote the set of real numbers and p-dimensional real column vectors, respectively. The subscripts i and j are utilized to denote the indices of the node
1.2 Preliminaries
5
and the superscript t denotes the iteration index of an algorithm; e.g., xit denotes the variable of node i at time t. We let notations 1n and 0n denote two column vectors with all entries equaling to one and zero, respectively. Let In and zij denote the identity matrix of size n and the entry of matrix Z in its i-th row and j -th column, respectively. The Euclidean norm for vectors and the induced 2-norm for matrices are represented by the symbol || · ||2 . Let notation Z = diag{y} represent the diagonal matrix of the vector y = [y1 , y2 , . . . , yn ]T , which follows that zii = yi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . We define the symbol diag{Z} as a diagonal matrix whose diagonal elements correspond (same) to the matrix Z. The transposes of a vector z and a matrix W are indicative of zT and W T , respectively. Let ei = [0, . . . , 1i , . . . , 0]T . The gradient of f (z) (differentiable) at z is denoted as f (z) : Rp → Rp . A non-negative square matrix Z ∈ Rn×n is row-stochastic if Z1n = 1n , column-stochastic if Z T 1n = 1n , and doubly stochastic if Z1n = 1n and Z T 1n = 1n .
1.2.2 Model of Optimization Problem Consider a set of n nodes connected over a directed network. The global objective is to find x ∈ Rp that minimizes the average of all local cost functions, i.e., 1 fi (x), n n
minp f (x) =
x∈R
(1.1)
i=1
where each fi : Rp → R is a convex function that we view as the local cost of node i. Let f (x ∗ ) = f ∗ and x ∗ be represented by the optimal value and an optimal ∗ solution to (1.1), respectively. The optimal solution’s set to (1.1) is denoted by X , where X∗ = {x ∈ Rp |(1/n) ni=1 fi (x) = f ∗ }.
1.2.3 Communication Network In this chapter, we consider a group of n nodes communicating over a directed network G = {V, E} involving the nodes set V = {1, . . . , n} and the edges set E ⊆ V × V. If (i, j ) ∈ E, it indicates that node i can directly transmit data to node j , where i is viewed as an in-neighbor of j and contrarily j is regarded as an out-neighbor of i. Let Niin = {j ∈ V|(j, i) ∈ E} and Niout = {j ∈ V|(i, j ) ∈ E} be the in-neighbor and out-neighbor sets of i, respectively. If |Niin | = |Niout |, the network is said to be unbalanced, where | · | is called as the cardinality of a set. For the directed network G, a path of length b from node i1 to node ib+1 is a sequence of b + 1 distinct nodes i1 , . . . , ib+1 such that (ik , ik+1 ) ∈ E for k = 1, . . . , b. If there is
6
1 Accelerated Algorithms for Distributed Convex Optimization
a path between any two nodes, then G is said to be strongly connected. In addition, the following assumptions are adopted. Assumption 1.1 ([51]) The network G corresponding to the set of nodes is directed and strongly connected. Remark 1.1 Assumption 1.1 is fundamental to assure that nodes in the network can always affect others directly or indirectly when studying distributed optimization problems [39–53]. Assumption 1.2 ([42]) Each local cost function fi , i ∈ V, is Li -smooth. Mathematically, there exists Li > 0 such that for any x, y ∈ Rp , one has ||∇fi (x) − ∇fi (y)||2 ≤ Li ||x − y||2 .
(1.2)
Assumption 1.3 ([42]) Each local cost function fi , i ∈ V, is μi -strongly connected. Mathematically, there exists μi ≥ 0 such that for any x, y ∈ Rp , one has fi (x) ≥ fi (y) + ∇fi (y)T (x − y) + where μi ∈ [0, +∞) and
n
i=1 μi
μi ||x − y||22 , 2
(1.3)
> 0.
Remark 1.2 It is worth emphasizing that Assumptions 1.2 and 1.3 are two standard assumptions to achieve linear convergence when the first-order methods [41–53] are employed. If Assumption 1.3 holds, it suffices that each fi is convex and at least one of them is strongly convex. Moreover, under Assumption 1.3, problem (1.1) possesses In the following analysis, we denote a unique optimal solution. L¯ = (1/n) ni=1 Li and μ¯ = (1/n) ni=1 μi as the Lipschitz continuity and strong convexity constants, respectively, for the global cost function f . Denote Lˆ = maxi {Li }.
1.3 Algorithm Development On the basis of the above section, we first review the centralized Nesterov gradient descent method (CNGD) and then propose the directed distributed Nesterov-like gradient tracking algorithm, named as D-DNGT, to solve problem (1.1).
1.3.1 Centralized Nesterov Gradient Descent Method (CNGD) ¯ Here, CNGD derived from [57] is briefly introduced for L-smooth and μ-strongly ¯ convex cost function. At each time t ≥ 0, CNGD kept three vectors y t , x t , v t ∈ Rp
1.3 Algorithm Development
7
and implemented the following three steps: ⎧ αγ v t +γ x t t ⎪ ⎨ y = α μ+γ ¯ x t +1 = y t − L1¯ ∇f (y t ) ⎪ ⎩ t +1 v = (1 − α)v t + αγμ¯ y t − γα ∇f (y t ),
(1.4)
with the initial states y 0 = x 0 = v 0 ∈ Rp , where α and γ are constants related to the parameters (L¯ and μ) ¯ of the cost function f . Nesterov [57] specified the ¯ then γ must satisfy γ = requirement that γ = (1 − α)γ + α μ. ¯ If α = μ/ ¯ L, ¯ ¯ (1 − μ/ ¯ L)γ + μ¯ μ/ ¯ L = μ. ¯ After a series of transformations (see [59] for a specific transformation), the equivalent form of CNGD (1.4) is given by
x t +1 = y t − L1¯ ∇f (y t ) y t +1 = x t +1 + β(x t +1 − x t ),
(1.5)
√ √ √ √ where β = ( L¯ − μ)/( ¯ ¯ It is well known that among all centralized L¯ + μ). gradient approaches, CNGD [57] achieved the optimal convergence rate in terms of the first-order oracle complexity. Under Assumptions 1.2 and 1.3, it is deduced that ¯ t ) whose dependence on the convergence rate of CNGD (1.5) was O((1 − μ/ ¯ L) ¯ t ) in the large L/ ¯ μ¯ improved over CGD’s rate O((1 − μ/ ¯ μ¯ condition number L/ ¯ L) regime. In this chapter, we devote ourselves to the study of a directed distributed Nesterov-like gradient tracking (D-DNGT) algorithm, which is not only suitable for a directed network but also converges linearly and accurately to the optimal solution to (1.1). To the best of our knowledge, this work has not yet been involved and is worthwhile to study.
1.3.2 Directed Distributed Nesterov-Like Gradient Tracking (D-DNGT) Algorithm We now describe D-DNGT to distributedly deal with problem (1.1). Each node i ∈ V at time t ≥ 0 stores four variables: xit ∈ Rp , yit ∈ Rp , sti ∈ Rn , and zit ∈ Rp . For t > 0, node i ∈ V updates its variables as follows: ⎧ n ⎪ ⎪ xit +1 = rij yjt + βi (xit − xit −1) − αi zit ⎪ ⎪ ⎪ j =1 ⎪ ⎪ ⎪ t +1 t +1 t +1 t ⎪ ⎨ yi = xi + βi (xi − xi ) n sti +1 = rij stj ⎪ ⎪ ⎪ j =1 ⎪ ⎪ ⎪ n ⎪ t +1 ∇f (y t+1 ) ∇f (y t ) ⎪ ⎪ rij zjt + it+1i − [sit ] i , ⎩ zi = [s ] i j =1
i
i
i
(1.6)
8
1 Accelerated Algorithms for Distributed Convex Optimization
where αi > 0 and βi ≥ 0 are referred to the constant step-size (non-uniform) and the momentum (heavy-ball momentum and Nesterov momentum) coefficient (non-uniform) locally chosen at each node i, respectively. The notations [sti ]i and ∇fi (yit ) (vector), respectively, denote the i-th entry of sti and the gradient of fi (y) at y = yit . The weights, rij , i, j ∈ V, associated with the network G obey the following conditions:
rij =
and rii = 1 −
⎧ n ⎨ > , j ∈ Niin , ⎩
0, otherwise,
rij = 1, ∀i,
(1.7)
j =1
1 j ∈Niin rij > , ∀i, where 0 < < 1. Each node i ∈ V starts with initial states xi0 = yi0 ∈ Rp , s0i = ei , and zi0 = ∇fi (yi0 ).2 Denote R = [rij ] ∈ Rn×n as the collection of weights rij , i, j ∈ V in (1.7), which is obviously row-stochastic. In essence, the update of zit in (1.6) is a distributed
inexact gradient tracking step, where each local cost function’s gradient is scaled by [sti ]i , which is generated by the third update in (1.6). Actually, the update of sti in (1.6) is a consensus iteration aiming to estimate the Perron eigenvector w = [w1 , . . . , wn ]T (related to the eigenvalue 1) of the weight matrix R satisfying 1T w = 1. This iteration is similar to that employed in [51–53]. To sum up, D-DNGT (1.6) transforms CNGD (1.5) into distributed ones via gradient tracking and can be applied to a directed network. Remark 1.3 For the sake of brevity, we mainly concentrate on the one dimensional case, i.e., p = 1, and the multiple dimensional case is similarly proven. Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , zt = [z1t , . . . , znt ]T ∈ S t = [st1 , . . . , stn ]T ∈ Rn×n , ∇F (y t ) = [∇f 1 (y1t ), . . . , ∇f n (ynt )]T ∈ Rn and S˜ t = diag{S t }. Therefore, the aggregated form of D-DNGT (1.6) can be written as follows: Rn ,
⎧ t +1 x = Ry t + Dβ (x t − x t −1 ) − Dα zt ⎪ ⎪ ⎨ t +1 y = x t +1 + Dβ (x t +1 − x t ) , t +1 ⎪S = RS t ⎪ ⎩ t +1 z = Rzt + [S˜ t +1 ]−1 ∇F (y t +1) − [S˜ t ]−1 ∇F (y t )
(1.8)
where Dα = diag{α} ∈ Rn×n and Dβ = diag{β} ∈ Rn×n , where α = [α1 , . . . , αn ]T and β = [β1 , . . . , βn ]T . The initial states are x 0 = y 0 ∈ Rn , S 0 = In and
It is worth noticing that the weights, rij , i, j ∈ V , associated with the network G given in (1.7) is valid. For all i ∈ V , the conditions of the weights, rij , i, j ∈ V , in (1.7) can be satisfied when we set rij = 1/|Niin |, ∀j ∈ Niin , and rij = 0, otherwise. 2 Suppose that each node possesses and achieves its unique identifier in the network, e.g., 1, . . . , n, [45–50]. 1
1.3 Algorithm Development
9
z0 = ∇F (y 0 ) ∈ Rn . It is worth emphasizing that D-DNGT (1.6) do not need the out-degree information of nodes (only row-stochastic matrix adopted in D-DNGT), which is more likely to be implemented.
1.3.3 Related Methods In this subsection, some distributed optimization methods which are not only suitable for directed networks but also related to D-DNGT (1.6) are discussed, based on an instinct explanation. In particular, we consider ADD-OPT/Push-DIGing [41, 42], FROST [52] and ABm [48].3 (a) Relation to ADD-OPT/Push-DIGing ADD-OPT [41] (Push-DIGing [42] is suitable for time-varying networks in comparison with ADD-OPT) kept updating four variables xit , sit , yit and zit ∈ R for each node i. Starting from the initial states si0 = 1, zi0 = ∇fi (yi0 ) and an arbitrary xi0 , the updating rule of ADD-OPT is given by ⎧ n ⎪ t +1 ⎪ cij xjt − αzit ⎪ xi = ⎪ ⎪ j =1 ⎪ ⎪ ⎨ n t +1 si = cij sjt , yit +1 = xit +1 /sit +1 , ⎪ j =1 ⎪ ⎪ n ⎪ ⎪ t +1 ⎪ ⎪ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ) ⎩ zi =
(1.9)
j =1
where C = [cij ] ∈ Rn×n is column-stochastic and α > 0 is the step-size. Under Assumptions 1.1–1.3, ADD-OPT converged linearly to the optimal solution over a directed network using uniform constant step-size. Besides, ADD-OPT/PushDIGing applied push-sum strategy (column-stochastic weights) to overcome the unbalancedness induced by directed networks, which may be infeasible in distributed implementation because it required each node to possess (at least) its outdegree information. We emphasize here that row-stochastic weights are relatively easy to achieve in a distributed setting and the implementation is straightforward if each node can privately regulate the weights on information which acquires from in-neighbors. (b) Relation to FROST The method, FROST, proposed in [52], served as a basis for the development of D-DNGT (1.6). FROST maintained over time t ≥ 0 at each node i, the solution estimate xit ∈ R and two auxiliary variables sti ∈ Rn and zit ∈ R.
3
Notice that some notations involved in the relevant method may contradict the notations described in distributed optimization problem/algorithm/analysis throughout the chapter. Therefore, we declare here that the symbols in this section should not be applied to other parts.
10
1 Accelerated Algorithms for Distributed Convex Optimization
Mathematically, the updating rule is as follows: ⎧ n ⎪ ⎪ xit +1 = rij xjt − αi zit ⎪ ⎪ ⎪ j =1 ⎪ ⎪ ⎨ n sti +1 = rij stj ⎪ j =1 ⎪ ⎪ n ⎪ ⎪ ∇f (x t+1 ) t +1 ⎪ ⎪ rij zjt + it+1i − ⎩ zi = j =1
[si
]i
,
(1.10)
∇fi (xit ) [sti ]i
where αi > 0 is a step-size locally chosen at each node i and the row-stochastic weights R = [rij ] ∈ Rn×n comply with (1.7); the initialization xi0 ∈ R, s0i = ei , and zi0 = ∇fi (xi0 ). FROST utilized row-stochastic weights with non-uniform step-sizes among the nodes, and exhibited fast convergence over a directed network, which converged at a linear rate to the optimal solution under Assumptions 1.1–1.3. (c) Relation to ABm The ABm, investigated in [48], combined the gradient tracking with a momentum term and utilized non-uniform step-sizes, which is described as follows: ⎧ n t +1 ⎪ ⎪ rij xjt − αi zit + βi (xit − xit −1 ) ⎨ xi = j =1
, n ⎪ t +1 ⎪ cij zjt + ∇fi (xit +1 ) − ∇fi (xit ) ⎩ zi =
(1.11)
j =1
initialized with zi0 = ∇fi (xi0 ) and an arbitrary xi0 at each node i, where as before αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of node i. By simultaneously implementing both row-stochastic (R = [rij ] ∈ Rn×n ) and column-stochastic (C = [cij ] ∈ Rn×n ) weights, it is deduced from [48] that ABm reduces to AB [45] when βi = 0, ∀i, and AB lies at the heart of existing methods that employ the gradient tracking [42, 43, 48]. Notice that, ADD-OPT/Push-DIGing, FROST and D-DNGT, described above, have a non-linear term which is derived from the division by the eigenvector learning term ((1.6), (1.9) and (1.10)). ABm eliminates this non-linear calculation and is still suitable for the directed networks. However, ABm requires each node to gain access to its out-degree information to build column-stochastic weights. It is a challenge to establish directly in a distributed manner, which has been interpreted earlier. It is worth highlighting that our algorithm, D-DNGT, extends CNGD to a distributed form and is suitable for the directed networks in comparison with CNGD [57] and Acc-DNGD-SC/Acc-DNGD-NSC [54]. In addition, D-DNGT combines FROST with two kinds of momentum terms (heavy-ball momentum and Nesterov momentum), which ensures that nodes acquire more information from in-neighbors in the network than FROST to achieve much faster convergence.
1.4 Convergence Analysis
11
1.4 Convergence Analysis In this section, we will prove that D-DNGT (1.6) converges at a linear rate to optimal solution x ∗ provided that the coefficients (non-uniform step-sizes and momentum coefficients) are bounded with properly chosen constants. The following notations and relations are employed. Recalling that R is irreducible and row-stochastic with positive diagonals, under Assumption 1.1, there exists a normalized left Perron eigenvector w = [w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that lim (R)t = (R)∞ = 1n wT , wT R = wT and wT 1n = 1.
t →∞
Also, define S ∞ = limt →∞ S t (we obtain S ∞ = (R)∞ because of S 0 = In ), S˜ ∞ = diag{S ∞ }, sˆ = supt ≥0||S t ||2 , s˜ = supt ≥0 ||[S˜ t ]−1 ||2 ,4 x¯ t = wT x t , ∇F (1n x¯ t ) = [∇f1 (x¯ t ), . . . , ∇fn (x¯ t )]T , αˆ = maxi∈V {αi } and βˆ = maxi∈V {βi }. Since R is primitive and S 0 = In , it yields that {S t } is convergent [39, 51], and therefore, the diagonal elements of S t are positive and bounded for all t ≥ 0. Thus, sˆ and s˜ are two finite constants. In addition, we employ || · || to indicate either a particular matrix norm or a vector norm such that ||Rz|| ≤ ||R||||z|| for all matrices R and vectors z. Since all vector norms are equivalent in finite dimensional vector space, we find the following results: || · ||2 ≤ d1 || · || and || · || ≤ d2 || · ||2 , where d1 and d2 are some positive constants.
1.4.1 Auxiliary Results Before showing the main results, we introduce some auxiliary results. First, the following crucial lemma is given, which is a direct implication of Assumption 1.1 and (1.7) (see Section II-B in [32]). Lemma 1.4 ([32]) Suppose that Assumption 1.1 holds. Considering the weight matrix R = [rij ] ∈ Rn×n follows (1.7). Then, there are a norm || · || and a constant 0 < ρ < 1 such that ||Rx − (R)∞ x|| ≤ ρ||x − (R)∞ x||, for all x ∈ Rn . According to the result established in Lemma 1.4, in the following, we present an additional lemma from the Markov chain and consensus theory [60].
4
Throughout the chapter, for any arbitrary matrix/vector/scalar Z, we utilize the symbol (Z)t to represent the t-th power of Z to distinguish the iteration of variables.
12
1 Accelerated Algorithms for Distributed Convex Optimization
Lemma 1.5 ([60]) Let S t be generated by (1.8). Then, there exist 0 < θ < ∞ and 0 < λ < 1 such that ||S t − S ∞ ||2 ≤ θ (λ)t , ∀t ≥ 0. The next lemma, as a direct consequence of Lemma 1.5, will be employed to deduce that the linear convergence of the sequences {||[S˜ t ]−1 − [S˜ ∞ ]−1 ||2 } and {||[S˜ t +1]−1 − [S˜ t ]−1 ||2 } can be acquired (Detailed proof may see [52] or [51]). Lemma 1.6 ([51]) Let S t be generated by (1.8). For all t ≥ 0, it holds that (a) ||[S˜ t ]−1 − [S˜ ∞ ]−1 ||2 ≤ θ (˜s )2 (λ)t , (b) ||[S˜ t +1 ]−1 − [S˜ t ]−1 ||2 ≤ 2θ (˜s )2 (λ)t . The next lemma derives the dynamics that govern the evolution of the weight sum of zt . Lemma 1.7 ([51]) Let zt be generated by (1.8). Recall that z0 = ∇F (y 0 ). Then, for all t ≥ 0, it yields that (R)∞ zt = (R)∞ [S˜ t ]−1 ∇F (y t ). For convenience of the convergence analysis, we will make frequently use of the following well-known lemma (see example [32] for a proof). Lemma 1.8 ([32]) Suppose that Assumptions 1.2–1.3 hold. Since the global cost ¯ ¯ function f is μ-strongly ¯ convex and L-smooth, then for all x ∈ R and 0 < ε < 2/L, we get ||x − ε∇f (x) − x ∗ ||2 ≤ l||x − x ∗ ||2 , ¯ |1 − με|}, where l = max{|1 − Lε|, ¯ x ∗ is the optimal solution to (1.1) and ∇f (x) is the gradient of f (x) at x.
1.4.2 Supporting Lemmas In this subsection, we begin to constitute the convergence analysis of D-DNGT by investigating the evolutions of ||x t +1 − (R)∞ x t +1||, ||(R)∞ x t +1 − 1n x ∗ ||2 , ||x t +1 − x t || and ||zt +1 − (R)∞ zt +1 ||. Our goal is to bound the above four expressions according to the linear combinations of their past estimates and ∇F (y t ), in which way we construct a linear system of inequalities. In what follows, the bound of the consensus violation, ||x t +1 − (R)∞ x t +1 ||, is first provided.
1.4 Convergence Analysis
13
Lemma 1.9 Suppose that Assumption 1.1 holds. Then, for all t > 0, we have the following inequality: ||x t +1 − (R)∞ x t +1 || ˆ t − x t −1|| + κ2 α||z ≤ ρ||x t − (R)∞ x t || + κ1 β||x ˆ t ||2 ,
(1.12)
where ρ is given in Lemma 1.4, κ1 = d2 (ρ + 1)||In − (R)∞ || and κ2 = (d2 )2 ||In − (R)∞ ||; αˆ and βˆ are the largest step-size and the maximum momentum coefficient among the nodes, respectively. Proof According to the updates of x t , y t in D-DNGT (1.8), it holds that ||x t +1 − (R)∞ x t +1|| ≤ ρ||(In − (R)∞ )(x t + Dβ (x t − x t −1 ))|| + ||(In − (R)∞ )Dβ (x t − x t −1)|| + κ2 α||z ˆ t ||2 ,
(1.13)
where the inequality in (1.13) is obtained from Lemma 1.4 and the fact that (R)∞ R = (R)∞ . The desired result of Lemma 1.9 is then acquired. The next lemma presents the bound of the optimality residual associated with the weight average ||(R)∞ x t +1 − 1n x ∗ ||2 (Notice that (R)∞ x t +1 = 1n x¯ t +1 ). ¯ Lemma 1.10 Suppose that Assumptions 1.2 and 1.3 hold. If 0 < n(wT α) < 2/L, then, the following inequality holds for all t > 0: ||(R)∞ x t +1 − 1n x ∗ ||2 ∞ t ≤ d1 nLˆ α||(R) ˆ x − x t || + l1 ||(R)∞ x t − 1n x ∗ ||2 + sˆ (˜s 2 )θ α||∇F ˆ (y t )||2 (λ)t
ˆ t − x t −1|| + κ3 α||z + (2κ3 +d1 sˆ s˜ Lˆ α) ˆ β||x ˆ t − (R)∞ zt ||,
(1.14)
T α), |1 − μn(w T α)}; θ and λ are ¯ where κ3 = d1 ||(R)∞ ||2 and l1 = max{|1 − Ln(w ¯ introduced in Lemma 1.5.
Proof Notice that (R)∞ R = (R)∞ . Recalling the updates of x t and y t in D-DNGT (1.8), we get Lemma from 1.7 that ||(R)∞ x t +1 − 1n x ∗ ||2 = ||(R)∞ (x t + 2Dβ (x t − x t −1 ) − Dα zt + (Dα − Dα )(R)∞ zt ) − 1n x ∗ ||2 ≤ ||(R)∞ x t − (R)∞ Dα (R)∞ [S˜ t ]−1 ∇F (y t ) − 1n x ∗ ||2 ˆ t − x t −1 || + κ3 α||z + 2κ3 β||x ˆ t − (R)∞ zt ||.
(1.15)
14
1 Accelerated Algorithms for Distributed Convex Optimization
We now discuss the first term in the inequality of (1.15). Note that (R)∞ = 1n wT and ∇F (1n x¯ t ) = [∇f1 (x¯ t ), . . . , ∇fn (x¯ t )]T . By utilizing 1n wT Dα 1n wT = (wT α)1n wT , one obtains ||(R)∞ x t − (R)∞ Dα (R)∞ [S˜ t ]−1 ∇F (y t ) − 1n x ∗ ||2 ≤ ||1n (wT x t − x ∗ − n(wT α)∇f (x¯ t ))||2 + (wT α)||n1n ∇f (x¯ t ) − 1n wT [S˜ t ]−1 ∇F (y t )||2 = Λ1 + Λ2 ,
(1.16)
¯ where ∇f (x¯ t ) = (1/n)1Tn ∇F (1n x¯ t ). By Lemma 1.8, when 0 < n(wT α) < 2/L, Λ1 is bounded by √ Λ1 ≤ l1 n||wT x t − x ∗ ||2 = l1 ||(R)∞ x t − 1n x ∗ ||2 ,
(1.17)
T α)|, |1 − μn(w T α)|}. Then, Λ can be bounded in the ¯ where l1 = max{|1 − Ln(w ¯ 2 following way:
Λ2 ≤(wT α)||n1n ∇f (x¯ t ) − 1n 1Tn ∇F (x t )||2 + (wT α)||1n 1Tn ∇F (x t ) − 1n wT [S˜ t ]−1 ∇F (y t )||2 =Λ3 + Λ4 ,
(1.18)
where ∇F (x t ) = [∇f1 (x1t ), . . . , ∇fn (xnt )]T . Since ∇f (x¯ t ) = (1/n)1Tn ∇F (1n x¯ t ), it yields from Assumption 1.2 that ∞ t Λ3 ≤ nLˆ α||(R) ˆ x − x t ||2 .
(1.19)
Next, by employing Lemma 1.6 and the relation S ∞ [S˜ ∞ ]−1 = 1n 1Tn , we have Λ4 = (wT α)||S ∞ [S˜ ∞ ]−1 ∇F (x t ) − S ∞ [S˜ t ]−1 ∇F (y t )||2 ˆ t − x t −1 ||2 + sˆ (˜s )2 θ α||∇F ≤ sˆs˜ Lˆ αˆ β||x ˆ (y t )||2 (λ)t ,
(1.20)
where sˆ = supt ≥0 ||S t ||2 and s˜ = supt ≥0||[S˜ t ]−1 ||2 . The lemma follows by plugging (1.16)–(1.20) into (1.15). For the bound of the estimate difference ||x t +1 − x t ||, the following lemma is shown. Lemma 1.11 Suppose that Assumption 1.2 holds. For all t > 0, it holds that ˆ t − x t −1|| + d2 α||z ||x t +1 − x t || ≤ κ4 ||x t − (R)∞ x t || + κ5 β||x ˆ t ||2 , where κ4 = ||R − In || and κ5 = d2 + d2 ||R||.
(1.21)
1.4 Convergence Analysis
15
Proof Recalling that (R)∞ R = (R)∞ , we obtain from the updates of x t and y t in D-DNGT (1.8) that ||x t +1 − x t || ≤||R − In ||||x t − (R)∞ x t || + d2 α||z ˆ t ||2 ˆ t − x t −1 ||, + (d2 + d2 ||R||)β||x
(1.22)
and the lemma follows.
The next lemma establishes the inequality which bounds the error term corresponding to gradient estimation ||zt +1 − (R)∞ zt +1 ||. Lemma 1.12 Suppose that Assumptions 1.1–1.3 hold. For all t > 0, we get the following estimate: ||zt +1 − (R)∞ zt +1 || t ˆ ˆ α||z − (R)∞ x t || + κ6 d2 (1 + β) ˆ t ||2 ≤ κ4 κ6 (1 + β)||x
ˆ β||x ˆ t − x t −1 || + ρ||zt − (R)∞ zt || + κ6 (1 + κ5 + κ5 β) + 2||In − (R)∞ ||d2 (˜s )2 θ ||∇F (y t )||2 (λ)t ,
(1.23)
ˆ s. where κ6 = ||In − (R)∞ ||d1 d2 L˜ Proof It is immediately obtained from the update of zt in D-DNGT (1.8) that ||zt +1 − (R)∞ zt +1 || ≤ ||In − (R)∞ ||||[S˜ t +1]−1 ∇F (y t +1) − [S˜ t ]−1 ∇F (y t )|| + ρ||zt − (R)∞ zt ||,
(1.24)
where we employ the triangle inequality and Lemma 1.4 to deduce the inequality. As for the first term of the inequality in (1.24), we apply the update of y t in D-DNGT (1.8) and the result in Lemma 1.6 to obtain ||[S˜ t +1]−1 ∇F (y t +1 ) − [S˜ t ]−1 ∇F (y t )|| ≤ ||[S˜ t +1 ]−1 ∇F (y t +1) − [S˜ t +1 ]−1 ∇F (y t )|| + ||[S˜ t +1 ]−1 ∇F (y t ) − [S˜ t ]−1 ∇F (y t )|| ˆ s ||y t +1 − y t ||2 + 2d2(˜s )2 θ ||∇F (y t )||2 (λ)t ≤ d2 L˜ t +1 ˆ s (1 + β)||x ˆ s β||x ˆ ˆ t − x t −1|| ≤ d1 d2 L˜ − x t || + d1 d2 L˜
+ 2d2 (˜s )2 θ ||∇F (y t )||2 (λ)t . Combining Lemma 1.11 with (1.25), the result in Lemma 1.12 is obtained.
(1.25)
16
1 Accelerated Algorithms for Distributed Convex Optimization
The final lemma constitutes an inevitable bound on the estimate, ||zt ||2 , for deriving the aforementioned linear system. Lemma 1.13 Assume that Assumption 1.2 holds. Then, the following inequality can be established for all t > 0, ˆ t − (R)∞ x t || ||zt ||2 ≤d1 ||zt − (R)∞ zt || + d1 nL||x ∞ t ˆ ˆ t − x t −1 || + nL||(R) x − 1n x ∗ ||2 + d1 nLˆ β||x
+ sˆ (˜s )2 θ (λ)t ||∇F (y t )||2 .
(1.26)
Proof Note that ||zt ||2 ≤ ||zt − (R)∞ zt ||2 + ||(R)∞ zt ||2 .
(1.27)
In view of Lemma 1.7, using S ∞ [S˜ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that ||(R)∞ zt ||2 ≤||S ∞ [S˜ t ]−1 ∇F (y t ) − S ∞ [S˜ ∞ ]−1 ∇F (y t )||2 + ||S ∞ [S˜ ∞ ]−1 ∇F (y t ) − S ∞ [S˜ ∞ ]−1 ∇F (1n x ∗ )||2 ˆ t − 1n x ∗ ||2 . ≤ˆs (˜s )2 θ (λ)t ||∇F (y t )||2 + nL||y
(1.28)
By the update of y t in D-DNGT (1.8), one gets ||y t − 1n x ∗ ||2 ≤||x t − (R)∞ x t ||2 + ||(R)∞ x t − 1n x ∗ ||2 ˆ t − x t −1 ||2 . + β||x
(1.29)
Substituting (1.28) and (1.29) into (1.27) yields the desired result in Lemma 1.13. The proof is completed.
1.4.3 Main Results With the supporting relationships, i.e., the above Lemmas 1.9–1.13, in hands, the main convergence results of D-DNGT are now established as follows. ˆ For the sake of convenience, we define wmin = mini∈V {wi }, ν1 = κ2 d1 nL, ˆ ˆ ˆ ˆ ˆ ν2 = κ2 nL, ν3 = κ2 d1 , ν4 = d1 nL, ν5 = d1 sˆ s˜ L, ν6 = d2 d1 nL, ν7 = d2 nL, ˆ ˆ ν8 = d2 d1 , ν9 = κ4 κ6 , ν10 = κ6 d2 d1 nL, ν11 = κ6 d2 nL, ν12 = κ6 + κ5 κ6 , ν13 = κ5 κ6 , ν14 = κ6 d2 d1 , ν15 = κ2 αˆ sˆ (˜s )2 θ , ν16 = sˆ (˜s )2 θ α, ˆ ν17 = d2 αˆ sˆ (˜s )2 θ , ν18 = ∞ 2 ˆ (2||In − (R) || + κ6 (1 + β)αˆ sˆ )(˜s ) θ d2 , ν19 = ν13 η3 + ν10 η3 α, ˆ ν20 = ν9 η1 + ν10 η1 αˆ + ν11 η2 αˆ + ν12 η3 + ν10 η3 αˆ + ν14 η4 αˆ and ν21 = η4 (1 − ρ) − ν9 η1 − (ν10 η1 + ν11 η2 + ν14 η4 )α. ˆ Then, the first result, i.e., Theorem 1.14, is introduced below.
1.4 Convergence Analysis
17
Theorem 1.14 Suppose that Assumptions 1.1–1.3 hold. Considering D-DNGT ¯ (1.8) updates the sequences {x t }, {y t }, {S t } and {zt }. Then, if 0 < n(wT α) < 2/L, one gets the following linear system of inequalities: ⎡
⎤ ||x t +1 − (R)∞ x t +1 || ⎢ ||(R)∞ x t +1 − 1n x ∗ ||2 ⎥ ⎢ ⎥ ≤Γ ⎣ ||x t +1 − x t || ⎦ t +1 ∞ t +1 − (R) z || ||z
⎡
⎤ ||x t − (R)∞ x t || ⎢ ||(R)∞ x t − 1n x ∗ ||2 ⎥ ⎢ ⎥ + φt , ⎣ ||x t − x t −1|| ⎦ t ∞ t ||z − (R) z ||
(1.30)
where the inequality is seen as component-wise. The elements of matrix Γ = [γij ] ∈ R4×4 and the vector φ t = [φ1t , φ2t , φ3t , φ4t ]T ∈ R4 are respectively given by ⎡
ρ + ν1 αˆ ⎢ ν4 αˆ Γ =⎢ ⎣ κ4 + ν6 αˆ γ41
ν2 αˆ κ1 βˆ + ν1 αˆ βˆ l1 2κ2 βˆ + ν5 αˆ βˆ ν7 αˆ κ5 βˆ + ν6 αˆ βˆ γ42 γ43
⎤ ν3 αˆ κ3 αˆ ⎥ ⎥, ν8 αˆ ⎦ γ44
ˆ γ42 = ν11 αˆ + ν11 αˆ β, ˆ γ43 = ν12 βˆ + and γ41 = ν9 + ν10 αˆ + ν9 βˆ + ν10 αˆ β, ˆ φ t = ν15 (λ)t ||∇F (y t )||2 , ν13 βˆ 2 + ν10 αˆ βˆ + ν10 αˆ βˆ 2 and γ44 = ρ + ν14 αˆ + ν14 αˆ β; 1 φ2t = ν16 (λ)t ||∇F (y t )||2 , φ3t = ν17 (λ)t ||∇F (y t )||2 and φ4t = ν18 (λ)t ||∇F (y t )||2 . Assuming in addition that the largest step-size satisfies 0 < αˆ < min
1 η1 (1 − ρ) , , nL¯ ν1 η1 + ν2 η2 + ν3 η4
η4 (1 − ρ) − ν9 η1 η3 − κ4 η1 , , ν6 η1 + ν7 η2 + ν8 η4 ν10 η1 + ν11 η2 + ν14 η4
(1.31)
and the maximum momentum coefficient satisfies η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )αˆ ˆ , 0 ≤ β < min κ1 η3 + ν1 η3 αˆ η2 (1 − l1 ) − (ν4 η1 + κ3 η4 )αˆ −ν20 + (ν20 )2 + 4ν19 ν21 , , 2κ2 η3 + ν5 η3 αˆ 2ν19 η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )αˆ (1.32) . κ5 η3 + ν6 η3 αˆ Then, the spectral radius of Γ , defined as ρ(Γ ), is strictly less than 1, where η1 , η2 , η3 , and η4 are arbitrary constants such that η1 > 0, η2 >
ν4 η1 + κ3 η4 ν9 η1 . , η3 > κ4 η1 , η4 > μnw ¯ min 1−ρ
(1.33)
18
1 Accelerated Algorithms for Distributed Convex Optimization
Proof First, plugging Lemma 1.13 into Lemmas 1.9–1.12 and rearranging the acquired inequalities, it is immediately to verify (1.30). Next, we provide quite a few conditions for the relation ρ(Γ ) < 1 to establish. According to Theorem 8.1.29 in [60], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 , if Γ η < η, then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality Γ η < η is equivalent to ⎧ ⎪ (κ1 η3 + ν1 η3 α) ˆ βˆ < η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )αˆ ⎪ ⎪ ⎨ (2κ η + ν η α) ˆ 2 3 5 3 ˆ βˆ < η2 (1 − l1 ) − (ν4 η1 + κ3 η4 )α ˆ ˆ β < η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )αˆ ⎪ (κ5 η3 + ν6 η3 α) ⎪ ⎪ ⎩ 2ν19 βˆ < −ν20 + (ν20 )2 + 4ν19 ν21 .
(1.34)
T α) ≤ ¯ from Lemma 1.10, it yields that l1 = 1 − μn(w When 0 < αˆ < 1/nL, ¯ 1 − μnw ¯ min α. ˆ To ensure the positivity of βˆ (the right hand sides of (1.34) are always positive), (1.34) further implies that
⎧ 1 (1−ρ) αˆ < ν1 η1η+ν ⎪ ⎪ 2 η2 +ν3 η4 ⎪ ν4 η1 +κ3 η4 ⎨ η2 > μnw ¯ min η3 −κ4 η1 ⎪ ⎪ αˆ < ν6 η1 +ν7 η2 +ν8 η4 , η3 > κ4 η1 ⎪ ⎩ η4 (1−ρ)−ν9 η1 9 η1 αˆ < ν10 η1 +ν11 η2 +ν14 η4 , η4 > ν1−ρ .
(1.35)
Now, we are in the position of selecting vector η = [η1 , . . . , η4 ]T to ensure the solvability of α. ˆ Since ρ < 1, we first pick an arbitrary positive constant η1 , then, respectively, choose η3 and η4 in accordance with the third and fourth conditions in (1.35), and finally select η2 satisfying the second condition in (1.35). Hence, following from (1.35), it yields the upper bounds on the largest step-size αˆ in (1.31) ¯ In addition, we achieve the considering the requirement that 0 < n(wT α) < 2/L. upper bounds on the maximum momentum coefficient βˆ according to (1.34) and the largest step-size α. ˆ This finishes the proof. Remark 1.15 It is worth emphasizing that η1 , η2 , η3 , and η4 in Theorem 1.14 are adjustable parameters, which rely only on the network topology and the cost functions. Thus, the choices of the largest step-size, α, ˆ and the maximum ˆ can be calculated without much effort as long as other momentum coefficient, β, parameters, such as λ, η, etc., are properly selected. Furthermore, to design the stepˆ L, ¯ μ, sizes and the momentum coefficients, some global parameters, such as L, ¯ and wmin , are needed. We noticed that the preprocessing amount for calculating global parameters is almost negligible compared to the worst-case runtime of D-DNGT (see [42] for a specific analysis). Before presenting the linear convergence of D-DNGT to the global optimal solution, the following supermartingale convergence result is first introduced, which will be crucial for the convergence analysis.
1.4 Convergence Analysis
19
Lemma 1.16 ([39]) Let {v t }, {ut }, {a t }, and {bt } be non-negative sequences such that for all t ≥ 0, v t +1 ≤ (1 + a t )v t − ut + b t . ∞ t t t Also, let ∞ t =0 a < ∞ and ∞ t =0tb < ∞. Then, we get limt →∞ v = v for a certain variable v ≥ 0, and t =0 u < ∞. Now, we are ready to state the main convergence result. Theorem 1.17 Suppose that Assumptions 1.1–1.3 hold. Consider that the sequences {x t }, {y t }, {S t }, and {zt } are updated in D-DNGT (1.8). If αˆ and βˆ satisfy the conditions in Theorem 1.14, then the sequence {x t } converges to 1n x ∗ at a linear rate of O((δ)t ), where λ < δ < 1 is a constant. Proof Define ⎤ ⎡ ||x t − (R)∞ x t || ν15 (λ)t ∞ t ∗ ⎢ ||(R) x − 1n x ||2 ⎥ t ⎢ ν16 (λ)t ⎥,P = ⎢ ϕt = ⎢ ⎦ ⎣ ||x t − x t −1|| ⎣ ν17 (λ)t ∞ t t ||z − (R) z || ν18 (λ)t ⎡
⎤ ⎡ ⎤ 000 ||∇F (y t )||2 ⎢ ⎥ 0 0 0⎥ 0 ⎥ , Qt = ⎢ ⎥. ⎦ ⎣ ⎦ 0 000 000 0
Then, inequality (1.30) is equivalent to ϕ t +1 ≤ Γ ϕ t + P t Qt .
(1.36)
By iterating (1.36) recursively, for all t > 0, we can see that ϕ ≤ (Γ ) ϕ + t
t 0
t −1
(Γ )t −k−1 P k Qk .
(1.37)
k=0
Since the spectral radius of Γ is strictly less than 1, it can be concluded from Lemma 1.16 in [52] that ||(Γ )t ||2 ≤ ϑ(δ0 )t and ||(Γ )t −k−1 P k ||2 ≤ ϑ(δ0 )t for some ϑ > 0 and λ < δ0 < 1. Taking 2-norm on both sides of (1.37) yields that ||ϕ t ||2 ≤ ||(Γ )t ||2 ||ϕ 0 ||2 +
t −1
||(Γ )t −k−1 P k ||2 ||Qk ||2
k=0
≤ ϑ||ϕ 0 ||2 (δ0 )t +
t −1
ϑ(δ0 )t ||Qk ||2 .
k=0
And further, for all k = 0, . . . , t − 1, ||Qk ||2 ≤||∇F (y k ) − ∇F (1n x ∗ )||2 + ||∇F (1n x ∗ )||2 ∞ k ˆ k − (R)∞ x k ||2 + L||(R) ˆ ≤L||x x − 1n x ∗ ||2
(1.38)
20
1 Accelerated Algorithms for Distributed Convex Optimization
ˆ k − x k−1 ||2 + ||∇F (1n x ∗ )||2 + Lˆ β||x ˆ k ||2 + ||∇F (1n x ∗ )||2 , ˆ L||ϕ ≤(1 + d1 )(1 + β)
(1.39)
where ∇F (1n x ∗ ) = [∇f1 (x ∗ ), . . . , ∇fn (x ∗ )]T . Thus, by combining (1.38) and (1.39), we deduce that for all t > 0, ˆ ˆ Lϑ ||ϕ t ||2 ≤ (ϑ||ϕ 0 ||2 + (1 + d1 )(1 + β)
t −1
||ϕ k ||2
k=0 ∗
+ ϑt||∇F (1n x )||2 )(δ0 ) . t
(1.40)
t −1 k t 0 ˆ ˆ Define v t = k=0 ||ϕ ||2 , ν22 = (1 + d1 )(1 + β)Lϑ and p = ϑ||ϕ ||2 + ∗ ϑt||∇F (1n x )||2 , and then (1.40) implies that ||ϕ t ||2 = v t +1 − v t ≤ (ν22 v t + pt )(δ0 )t ,
(1.41)
which is equivalent to v t +1 ≤ (1 + ν22 (δ0 )t )v t +1 + pt (δ0 )t .
(1.42)
∞ t t t Since λ < δ0 < 1, it holds that ∞ t =0 ν22 (δ0 ) < ∞ and t =0 p (δ0 ) < ∞. t Thus, all the conditions in Lemma 1.16 are satisfied (u = 0, t ≥ 0) and we achieve that v t converges and thus is bounded. Following from (1.41), we obtain that limt →∞ ||ϕ t ||2 /(δ1 )t ≤ limt →∞ (ν22 v t + pt )(δ0 )t /(δ1 )t = 0 for all δ0 < δ1 < 1, and thus there is a positive constant m and an arbitrarily small constant τ such that for all t ≥ 0, ||x t − 1n x ∗ ||2 ≤ ||x t − (R)∞ x t ||2 + ||(R)∞ x t − 1n x ∗ ||2 ≤ (1 + d1 )||ϕ t ||2 ≤ m(δ0 + τ )t , where we define δ = δ0 + τ . This fulfills the proof.
(1.43)
Remark 1.18 Theorem 1.17 establishes that D-DNGT linearly converges to the global optimal solution provided that the largest step-size, α, ˆ and the maximum ˆ respectively, obey the upper bounds given in Theomomentum coefficient, β, rem 1.14. Many existing works (the gradient tracking methods) [33, 35] and our previous works [43, 44] adopted non-uniform step-sizes and converged at a linear rate. Compared with [33, 35, 43, 44], this chapter still has three advantages. First, D-DNGT incorporates the gradient tracking into the distributed Nesterov method, which adds two types of momentum terms to improve information exchange to ensure fast convergence. Second, since the provided bounds on the largest step-size, α, ˆ in Theorem 1.14, depend only on the network topology and the cost functions,
1.4 Convergence Analysis
21
each node can choose a relatively wider step-size. This is in contrast to the earlier work on non-uniform step-sizes within the framework of the gradient tracking [33, 35, 43, 44], which is dependent on the heterogeneity (||(In − W )α||2 /||W α||2 , W is the weight matrix, in [35], and α/ ˆ α, ˜ α˜ = mini∈V {αi }, in [33], [43, 44]) of the step-sizes. Besides, the analysis showed that the algorithms in [33, 35, 43, 44] could linearly converge to the optimal solution if and only if the heterogeneity and the largest step-size are small. However, the largest step-size follows a bound which is a function of the heterogeneity, and there is a trade-off between the tolerance of heterogeneity and the largest step-size which can be achieved. Finally, the bounds of non-uniform step-sizes in this chapter allow the existence (not all) of zero step-sizes among the nodes if the largest step-size is positive and sufficiently small.
1.4.4 Discussion The idea of D-DNGT can be applied to other directed distributed gradient tracking methods to relax the condition of the weight matrices being only column-stochastic [41, 42] or both row- and column-stochastic [45, 46]. Next, three possible Nesterovlike optimization algorithms are presented. In this chapter, we only highlight and verify their feasibilities by means of simulations. A rigorous theoretical analysis of the three possible algorithms is left for the future work. (a) D-DNGT with Only Column-Stochastic Weights [41, 42] here, we present an extended algorithm, named as D-DNGT-C, by applying the momentum terms into ADD-OPT [41]/Push-DIGing [42] (the weight matrices are only columnstochastic). Specifically, the updates of D-DNGT-C are stated as follows: ⎧ n ⎪ t +1 ⎪ x = cij htj + βi (xit − xit −1) − αi zit ⎪ i ⎪ ⎪ j =1 ⎪ ⎪ ⎪ ht +1 = x t +1 + β (x t +1 − x t ) ⎪ ⎨ i i i i i n t +1 t +1 t +1 t +1 t = c s , y = h s ⎪ ij j i i i /si ⎪ ⎪ j =1 ⎪ ⎪ ⎪ n ⎪ ⎪ ⎪ zit +1 = cij zjt + ∇fi (yit +1 ) − ∇fi (yit ), ⎩
(1.44)
j =1
initialized with xi0 = h0i = yi0 ∈ R, si0 = 1, and zi0 = ∇fi (yi0 ), where as before C = [cij ] ∈ Rn×n is column-stochastic, and αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of node i. Unlike ADD-OPT [41]/PushDIGing [42], D-DNGT-C, by means of column-stochastic weights, adds two types of momentum terms (heavy-ball momentum and Nesterov momentum) to ensure that nodes acquire more information from in-neighbors in the network to achieve fast convergence.
22
1 Accelerated Algorithms for Distributed Convex Optimization
(b) D-DNGT with Both Row- and Column-Stochastic Weights [45, 46] consider that D-DNGT with both row- and column-stochastic weights does not need the eigenvector estimation in D-DNGT (1.6) or D-DNGT-C (1.44). Hence, an extended algorithm (named as D-DNGT-RC), which utilizes both row-stochastic (R = [rij ] ∈ Rn×n ) and column-stochastic (C = [cij ] ∈ Rn×n ) weights, is presented as follows: ⎧ n ⎪ ⎪ xit +1 = rij yjt + βi (xit − xit −1 ) − αi zit ⎪ ⎪ ⎪ j =1 ⎨ yit +1 = xit +1 + βi (xit +1 − xit ) ⎪ n ⎪ ⎪ t +1 ⎪ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ), ⎪ ⎩ zi =
(1.45)
j =1
where xi0 = yi0 ∈ R and zi0 = ∇fi (yi0 ), αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of node i. D-DNGT-RC not only reduces additional iterations of eigenvector learning but also guarantees that more information nodes can be obtained from in-neighbors, which may exhibit fast convergence than [45] and [46]. (c) D-DNGT-RC with Interaction Delays [49] note that nodes will confront arbitrary but uniformly bounded interaction delays in the process of gaining information from in-neighbors [49]. Specifically, to solve problem (1.1), we denote ςijt 5 as an arbitrary priori unknown delay induced by the interaction link (j, i) at time t ≥ 0. Then, the updates of D-DNGT-RC with delay (D-DNGT-RC-D) become ⎧ n t −ς t ⎪ ⎪ rij yj ij + βi (xit − xit −1 ) − αi zit ⎪ xit +1 = ⎪ ⎪ j =1 ⎨ yit +1 = xit +1 + βi (xit +1 − xit ) ⎪ n ⎪ t −ς t ⎪ t +1 ⎪ cij zj ij + ∇fi (yit +1) − ∇fi (yit ). ⎪ ⎩ zi =
(1.46)
j =1
Remark 1.19 The time-varying implementation of D-DNGT is more straightforward on a broadcast-based mechanism or a random network, such as the related work in [46]. The asynchronous scheme can also follow the method in [19, 26, 35, 37]. In addition, it is also concluded from [61] that when D-DNGT is employed for optimizing more complexity problem, such as deep neural networks, the gradient is usually replaced with the stochastic gradient, which yields the stochastic version of D-DNGT.
For all t > 0, the interaction delays ςijt are assumed to be uniformly bounded. That is, there exists ˆ In addition, each node is accessible to its own estimate some finite ςˆ > 0 such that 0 ≤ ςijt ≤ ς. without delays, i.e., ςiit = 0, ∀i ∈ V and t > 0. 5
1.5 Numerical Examples
23
1.5 Numerical Examples This section provides a variety of numerical experiments to illustrate the application and performance of D-DNGT. The numerical experiments are divided into three parts. (i) Without momentum terms: the convergence between D-DNGT and the methods without momentum terms, including FROST [52], AB [45], and ADDOPT/Push-DIGing [41, 42], is first compared. (ii) With momentum terms: the convergence between D-DNGT and the methods with momentum terms, including ABm [48], ABN [50], and FROZEN [50], is also compared. (iii) Extensions of D-DNGT: in the final scenario, we verify the convergence between the extensions (including D-DNGT-C, D-DNGT-RC, and D-DNGT-RC-R) of D-DNGT and their closely related methods (including ADD-OPT/Push-DIGing [41, 42], AB [45], and AB with delays (AB-D) [49]). For the comparison of delays, let ςˆ = 6 be the upper bound of the time-varying delays. At each time t > 0, interaction delays, which are imposed on each interaction link, are randomly and uniformlyselected in {0, 1, . . . , 6}. In light of parts (i)–(iii), we plot the residual log10 ( ni=1 (||xit − x ∗ ||2 /||xi0 − x ∗ ||2 )) (t is the discrete-time iteration) for comparison. In the experiment, we concern a distributed binary classification problem utilizing regularized logistic regression [48]. Specifically, the application of D-DNGT for handling the distributed logistic regression problem is considered over a directed network: min f (x, v) =
n
fi (x, v),
i=1
where x ∈ Rp and v ∈ R are the optimization variables for learning the separable hyperplane. Here, the local cost function fi is given by i ω (||x||22 + v 2 ) + ln 1 + exp − cTij x + v bij , 2
m
fi (x, v) =
j =1
where each node i ∈ {1, . . . , n} privately knows mi training examples; cij , bij ∈ Rp × {−1, +1}, where cij is the p-dimensional feature vector of the j -th training sample at the i-th node following from a Gaussian distribution with zero mean, and bij is the label according to a Bernoulli distribution. In terms of parameter design, we choose n = 10 and mi = 10 for all i and p = 2. The network topology as the directed and strongly connected network is depicted in Fig. 1.1. In addition, we utilize a simple uniform weighting strategy, rij = 1/|Niin |, ∀i, to regulate the rowstochastic weights. The simulation results are plotted in Figs. 1.2, 1.3 and 1.4. Figure 1.2 indicates that D-DNGT with momentum terms promotes the convergence in comparison with the applicable algorithms without momentum terms. Figure 1.3 means that
24
1 Accelerated Algorithms for Distributed Convex Optimization
Fig. 1.1 A directed and strongly connected network Comparison (i)
0
-2
Residual
-4
-6
-8
-10
-12
-14
0
200
400
600
800
1000
1200
1400
Time[step]
Fig. 1.2 Performance comparisons between D-DNGT and the methods without momentum terms
D-DNGT with two momentum terms (heavy-ball momentum [48] and Nesterov momentum [50, 54, 55]) improves the convergence when compared with the applicable algorithms with single momentum term. We note that although the eigenvector learning existed in D-DNGT may slow down convergence, D-DNGT is more suitable for broadcast-based protocols than other optimization methods (AB, ADD-OPT/Push-DIGing, ABm, and ABN ) because it only requires rowstochastic weights. Finally, it is concluded from Fig. 1.4 that the algorithms with momentum terms can successfully promote the convergence regardless of whether the interaction links undergo interaction delays or the weight matrices are only column-stochastic or both row- and column-stochastic.
1.5 Numerical Examples
25 Comparison (ii)
0
-2
Residual
-4
-6
-8
-10
-12
-14
0
200
400
600
800
1000
1200
1400
Time[step]
Fig. 1.3 Performance comparisons between D-DNGT and the methods with momentum terms
Comparison (iii)
0
-2
Residual
-4
-6
-8
-10
-12
-14
0
200
400
600
800
1000
1200
1400
Time[step]
Fig. 1.4 Performance comparisons between the extensions of D-DNGT and their closely related methods
26
1 Accelerated Algorithms for Distributed Convex Optimization
1.6 Conclusion In this chapter, we have considered a general distributed optimization problem in which nodes aimed to collectively optimize the average of all local cost functions. To figure out the optimization problem, a generalized directed distributed Nesterov-like gradient tracking algorithm, named as D-DNGT, has been proposed and analyzed in detail. D-DNGT extended distributed gradient tracking method with heavy-ball momentum and Nesterov momentum, guaranteed that nodes selected non-uniform step-sizes in a distributed manner and only required the weight matrix to be rowstochastic, which has indicated that it was suitable for a directed network. In particular, the directed network was assumed to be strongly connected. When the largest step-size and the maximum momentum coefficient were subjected to some upper bounds (the bounds relied only on the network topology and the cost functions), we have established the globally linear convergence rate for D-DNGT at the expense of eigenvector learning, supposing strongly convex and smooth cost functions. In addition, some extensions of D-DNGT have been also explored. Simulation results further verified our theoretical analysis. However, D-DNGT is not flawless, and more in-depth researches are demanded to perfect it. For example, DDNGT cannot be suitable for the dynamical networks, stochastic noises, as well as the networks with random link failures and quantization effects. As the future work, it would be valuable to extend D-DNGT to deal with a number of problems, i.e., time-varying directed networks, stochastic noises, as well as networks with random link failures and quantization effects. Moreover, more complex (asynchronous interaction, inequality constraints, etc.) optimization problem is also worthy of study.
References 1. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence of communication delays. IEEE Trans. Syst., Man, Cybern., Syst. 47(5), 717–728 (2017) 2. J. Chen, A. Sayed, Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012) 3. K. Li, Q. Liu, S. Yang, J. Cao, G. Lu, Cooperative optimization of dual multiagent system for optimal resource allocation. IEEE Trans. Syst., Man, Cybern., Syst. 50(11), 4676–4687 (2020) 4. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern. 47(8), 2321–2333 (2017) 5. X. Dong, G. Hu, Time-varying formation tracking for linear multi-agent systems with multiple leaders. IEEE Trans. Autom. Control 62(7), 3658–3664 (2017) 6. X. Dong, G. Hu, Time-varying formation control for general linear multi-agent systems with switching directed topologies. Automatica 73, 47–55 (2016) 7. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent networks via edge-based method. Automatica 94, 55–62 (2018) 8. S. Zhu, C. Chen, W. Li, B. Yang, X. Guan, Distributed state estimation of sensor-network systems subject to Markovian channel switching with application to a chemical process. IEEE Trans. Syst. Man Cybern. Syst. 48(6), 864–874 (2018)
References
27
9. D. Jakovetic, A unification and generalization of exact distributed first order methods. IEEE Trans. Signal Inform. Process. Over Netw. 5(1), 31–46 (2019) 10. Z. Wu, Z. Li, Z. Ding, Z. Li, Distributed continuous-time optimization with scalable adaptive event-based mechanisms. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3252–3257 (2020) 11. K. Scaman, F. Back, S. Bubeck, Y. Lee, L. Massoulie, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning (PMLR), vol. 70 (2017), pp. 3027–3036 12. X. He, T. Huang, J. Yu, C. Li, Y. Zhang, A continuous-time algorithm for distributed optimization based on multiagent networks. IEEE Trans. Syst. Man Cybern. Syst. 49(12), 2700–2709 (2019) 13. Y. Zhu, W. Ren, W. Yu, G. Wen, Distributed resource allocation over directed graphs via continuous-time algorithms. IEEE Trans. Syst. Man Cybern. Syst. 51(2), 1097–1106 (2021) 14. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 15. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 16. H. Li, S. Liu, Y. Soh, L. Xie, Event-triggered communication and data rate constraint for distributed optimization of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 48(11), 1908–1919 (2018) 17. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization. Automatica 90, 196–203 (2018) 18. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE J. Sel. Topics Signal Process. 5(4), 754–771 (2011) 19. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans. Autom. Control 62(8), 3986–3992 (2017) 20. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018) 21. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning, IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018) 22. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over timevarying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018) 23. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014) 24. J. Mota, J. Xavier, P. Aguiar, M. Puschel, D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013) 25. H. Terelius, U. Topcu, R. Murray, Decentralized multi-agent optimization via dual decomposition. IFAC Proc. Volumes 44(1), 11245–11251 (2011) 26. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers, in 2013 IEEE Global Conference on Signal and Information Processing (2013). https://doi.org/10.1109/GlobalSIP.2013.6736937 27. M. Hong, T. Chang, Stochastic proximal gradient consensus over random networks. IEEE Trans. Signal Process. 65(11), 2933–2948 (2017) 28. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through alternating direction method of multipliers (2019). Preprint arXiv:1902.06101 29. A. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (2012). https:// doi.org/10.1109/Allerton.2012.6483273 30. X. Dong, Y. Hua, Y. Zhou, Z. Ren, Y. Zhong, Theory and experiment on formation-containment control of multiple multirotor unmanned aerial vehicle systems. IEEE Trans. Autom. Sci. Eng. 16(1), 229–240 (2019) 31. W. Shi, Q. Ling, G. Wu, W Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optimi. 25(2), 944–966 (2015)
28
1 Accelerated Algorithms for Distributed Convex Optimization
32. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2018) 33. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https:// doi.org/10.23919/ACC.2017.7963560 34. M. Maros, J. Jalden, A geometrically converging dual method for distributed optimization over time-varying graphs. IEEE Trans. Autom. Control 66(6), 2465–2479 (2021) 35. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018) 36. S. Pu, A. Nedic, Distributed stochastic gradient tracking methods. Math. Program. 187(1), 409–457 (2021) 37. Y. Tian, Y. Sun, B. Du, G. Scutari, ASY-SONATA: Achieving geometric convergence for distributed asynchronous optimization, in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (2018). https://doi.org/10.1109/ALLERTON.2018. 8636055 38. M. Maros, J. Jalden, Panda: A dual linearly converging method for distributed optimization over time-varying undirected graphs, in 2018 IEEE Conference on Decision and Control (CDC) (2018). https://doi.org/10.1109/CDC.2018.8619626 39. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 40. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans. Autom. Control 62(10), 4980–4993 (2017) 41. C. Xi, R. Xin, U. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans. Autom. Control 63(5), 1329–1339 (2018) 42. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optimi. 27(4), 2597–2633 (2017) 43. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with timevarying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018) 44. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed constrained optimisation over time-varying directed unbalanced networks. IET Control Theory Appl. 13(17), 2800–2810 (2019) 45. R. Xin, U. Khan, A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018) 46. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021) 47. F. Saadatniaki, R. Xin, U. Khan, Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices. IEEE Trans. Autom. Control 65(11), 4769–4780 (2020) 48. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020) 49. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020) 50. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs. IEEE Signal Process. Lett. 26(8), 1247–1251 (2019) 51. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018) 52. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated stepsizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019) 53. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237– 248 (2019) 54. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control 65(6), 2566–2581 (2020)
References
29
55. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014) 56. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. 50(7), 2612–2622 (2020) 57. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science & Business Media, Berlin, 2013) 58. H. Wang, X. Liao, T. Huang, C. Li, Cooperative distributed optimization in multiagent networks with delays. IEEE Trans. Syst. Man Cybern. Syst. 45(2), 363–369 (2015) 59. A. Defazio, On the curved geometry of accelerated optimization (2018). Preprint arXiv:1812.04634 60. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013) 61. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 2
Projection Algorithms for Distributed Stochastic Optimization
Abstract This chapter focuses on introducing and solving the problem of composite constrained convex optimization with a sum of smooth convex functions and non-smooth regularization terms (1 norm) subject to locally general constraints. Each of the smooth objective functions is further thought of as the average of several constituent functions, which is motivated by the modern large-scale information processing problems in machine learning (the samples of a training dataset are randomly distributed across multiple computing nodes). We present a novel computation-efficient distributed stochastic gradient algorithm that makes use of both the variance-reduction methodology and the distributed stochastic gradient projection method with constant step-size to solve the problem in a distributed manner. Theoretical study shows that the suggested algorithm can discover the precise optimal solution in expectation when each constituent function (smooth) is strongly convex if the constant step-size is less than an explicitly calculated upper constraint. Regarding the current distributed methods, the suggested technique not only has a low computation cost in terms of the overall number of local gradient evaluations but is also suited for addressing general restricted optimization problems. Finally, the numerical proof is offered to show the suggested algorithm’s attractive performance. Keywords Composite constrained optimization · Distributed stochastic algorithm · Computation-efficient · Variance reduction · Non-smooth term
2.1 Introduction Given the limited computational and storage capacity of nodes, it has become unrealistic to deal with large-scale tasks centrally on a single compute node [1]. Distributed optimization is a classic topic [2–9] yet has recently aroused considerable interest in many emerging applications (large-scale tasks), such as parameter estimation [3, 4], network attacks [5], machine learning [6], IoT networks [7], and some others. At least two facts [8] have contributed to this resurgence of interest: (a) recent developments in high-performance computing platforms © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_2
31
32
2 Projection Algorithms for Distributed Stochastic Optimization
have enabled us to employ distributed resources to contribute significantly to computational efficiency and (b) the size of datasets often far exceeds the storage capacity of a single machine, requiring coordination across multiple machines. In distributed optimization (without centralized coordination), each node is only allowed to interact with its neighbors through a locally connected network. In general, designing effective distributed algorithms for a wide range of optimization problems is more challenging [8–10]. Distributed optimization methods that are only dependent on gradient information have become a core interest in processing large-scale tasks due to their excellent scalability. Many known methods, including distributed gradient descent (DGD) [11, 12], dual averaging [13], EXTRA [14, 15], ADMM [16, 17], adaptive diffusion [18], gradient tracking [19–21], and methods for constrained optimization problems [22], have been studied in the literature. Moreover, quite a few efficient methods for dealing with various practical problems such as complex networks [23], privacy security [24], machine learning [25], online optimization [26], and power system operation [27, 28] have been emerged. More recently, significant effort has been made to design distributed methods to solve the problem of composite non-smooth optimization. For the composite optimization problem with global non-smooth terms, a fast distributed proximal gradient method that adopts Nesterov’s acceleration is proposed in [29], which achieves accelerated convergence. Other work focuses on the situation where each node has a local non-smooth term that may be different from other nodes. For example, a proximal distributed linearized ADMM (DL-ADMM) method is provided in [30] to resolve such composite problems, and the convergence is guaranteed. By extending EXTRA [14] to deal with local non-smooth terms, PGEXTRA is proposed in [31] and an improved convergence is established. In addition, in comparison with PG-EXTRA [31], the NIDS method proposed in [32] is able to employ larger step-sizes and also possesses the same convergence. From the overview of the convergence rate, the above distributed proximal gradient methods still exist a clear gap compared with centralized methods. Based on this, a linearly convergent proximal gradient method is proposed in [33], which can successfully break such a gap. With the advent of the big data era, the amount of data that nodes in the network need to process is getting larger and more complicated [34]. Therefore, the above methods can be computationally very demanding due to the requirement that each iteration of the algorithm needs a full gradient evaluation of local objective functions [10–19, 21–24, 26, 29–33]. This may make these methods to be practically infeasible when dealing with large-scale tasks, mainly because the nodes in the network need to cope with large amounts of various data. In order to avoid extensive computation and keep computational simplicity, one natural solution is to use the stochastic gradient (a random subset of the local data for gradient evaluation) to approximate the true gradient, and thus the distributed stochastic optimization methods have emerged [35]. Considerable works have been done in investigating the distributed stochastic optimization methods including distributed stochastic gradient descent [35], stochastic gradient push [36], stochastic mirror descent [37], and stochastic
2.1 Introduction
33
gradient tracking [38]. However, in practice, these methods converge slowly due to the large variance coming from the stochastic gradient and the adoption of a carefully tuned sequence of decaying step-sizes. To address this deficiency, various variance-reduction techniques have been leveraged in developing the stochastic gradient descent methods, which appear some representative centralized methods such as S2GD [39], SAG [40], SAGA [41], SVRG [42, 43], and SARAH [44]. The idea of the variance-reduction technique is to reduce the variance of the stochastic gradient and substantially improve the convergence. Motivated by the centralized variance reduced methods, the distributed variance reduced methods have been extensively studied, which outperform their centralized counterparts in handling large-scale tasks. Of relevance to our work are the recent developments in [45] and [46]. The distributed stochastic averaging gradient method (DSA) proposed in [45] incorporates the variance-reduction technique in SAGA [41] to the algorithm design ideas of EXTRA [14], which not only obtains the expected linear convergence of distributed stochastic optimization for the first time but also performs better than the previous works [14, 35] in dealing with machine learning problems. Similar works also involve the DSBA [47], diffusionAVRG [48], ADFS [49], SAL-Edge [50], GT-SAGA/GT-SVRG [2, 51, 52], and Network-DANE [8] utilizing various strategies. However, to the best knowledge of the authors, there are no methods to focus on solving general composite constrained convex optimization problems. Recently, the distributed neurodynamicbased consensus algorithm proposed in [46] is developed to solve the problem of a sum of smooth convex functions and 1 norms subjected to the locally general constraints (linear equality, convex inequality, and bounded constraints), which generalizes the work in [53] to the case where the objective function and the constraint conditions are wider. In particular, based on the Lyapunov stability theory, the method in [46] can achieve consensus at the global optimal solution with constant step-size. The work in [46] is insightful, but unfortunately, the algorithm does not take into account the high computational cost of evaluating the full gradient of the local objective function at each iteration. In this chapter, we are concerned with solving the composite constrained convex optimization problem with a sum of smooth convex functions and non-smooth regularization terms (1 norm), where the smooth objective functions are further composed of the average of several constituent functions and the locally general constraints are constituted by linear equality, convex inequality, and bounded constraints. To aim at this, a computation-efficient distributed stochastic gradient algorithm is proposed, which is capable of adaptability and facilitating the realworld applications. In general, the novelties of the present work are summarized as follows: (i) We propose and analyze a novel computation-efficient distributed stochastic gradient algorithm by leveraging the variance-reduction technique and the distributed stochastic gradient projection method with constant step-size. In contrast with most existing distributed methods [29–33, 45, 47–51, 53], the
34
2 Projection Algorithms for Distributed Stochastic Optimization
proposed algorithm is capable of solving a class of composite non-smooth optimization problems subject to the locally general constraints. (ii) The proposed algorithm outperforms the existing distributed methods [29– 33, 46, 53] in light of the total number of local gradient evaluations. In particular, at each iteration, the proposed algorithm only evaluates the gradient of one randomly selected constituent function and employs the unbiased stochastic average gradient (obtained by the average of all most recent stochastic gradients) to estimate the local gradients. Thus, the proposed algorithm highly reduces the expense of full gradient evaluations. (iii) If the constant step-size is less than an explicitly estimated upper bound, the proposed algorithm is proven to converge to the exact optimal solution in expectation when each constituent function is smooth and strongly convex. In the unconstrained case, we also propose a distributed stochastic proximal gradient algorithm by using the variance-reduction technique and study its convergence rate.
2.2 Preliminaries 2.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let R, Rn , and Rm×n denote the set of real numbers, n-dimensional real column vectors, and m × n real matrices, respectively. The n × n identity matrix is denoted as In , and two column vectors of all ones and all zeros are denoted as 1 and 0 (appropriate dimensional), respectively. A quantity (probably a vector) of node i is indexed by a subscript i; e.g., let xik be the estimate of node i at time k. We use χmax (A) and χmin (A) to represent the largest and the smallest eigenvalues of a real symmetric matrix A, respectively. We let the symbols x T and AT denote the transposes of a vector x and a matrix A. The Euclidean norm (vectors) √ and 1 norm are denoted as || · || and || · ||1 , respectively. We let ||x||A = x T Ax, where matrix A ∈ Rn×n is a positive semi-definite matrix. The Kronecker product and the Cartesian product are represented by the symbols ⊗ and , respectively. Given a random estimator x, the probability and expectation are represented by P[x] and E[x], respectively. We utilize Z = diag{x} to represent the diagonal matrix of vector x = [x1 , x2 , . . . , xn ]T , which satisfies that zii = xi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . Denote (·)+ = max{0, ·}. For a set Ω ⊆ Rd , the projection of a vector x ∈ Rd onto Ω is denoted by PΩ (x), i.e., PΩ (x) = arg miny∈Ω ||y −x||2. Notice that this projection always exists and is unique if Ω is nonempty, closed, and convex [53]. Moreover, let Ω be a nonempty closed convex set, then the projection operator PΩ (·) has the following properties: (a) (y − PΩ (y))T (PΩ (y) − x) ≥ 0, for any x ∈ Ω and y ∈ Rd and (b) ||PΩ (y) − PΩ (x)|| ≤ ||y − x||, for any x, y ∈ Rd .
2.2 Preliminaries
35
For any given vector z ∈ Rd , denote φ(z) as another form of projection of the vector z, which satisfies that φ(z) = [ψ(z1 ), . . . , ψ(zd )] ∈ Rd with the elements such that ∀i = 1, . . . , d, ⎧ if zi > 0, ⎨ 1, ψ(zi ) = [−1, 1] , if zi = 0, ⎩ −1, if zi < 0.
2.2.2 Model of Optimization Problem Consider the composite constrained convex optimization problem in the following form: min J (x) ˆ = n
x∈∩ ˆ i=1 Ωi
n
(fi (x) ˆ + ||Pi xˆ − qi ||1 ),
i=1
s.t. Bi xˆ = ci , Di xˆ ≤ si , i = 1, . . . , n,
(2.1)
where xˆ ∈ Rd is the optimization estimator, fi (x) ˆ is the local objective function of node i, and ||Pi xˆ − qi ||1 is a non-smooth 1 -regularization term of node i; Pi ∈ Rmi ×d (mi ≥ 0), qi ∈ Rmi , Bi ∈ Rwi ×d (0 ≤ wi < d) is full row rank, ci ∈ Rwi , Di ∈ Rai ×d (ai ≥ 0), si ∈ Rai , and Ωi ⊆ Rn is a nonempty and closed convex set. Moreover, each fi (x) ˆ is represented as the average of ei constituent functions, that is, fi (x) ˆ =
ei 1 fi,j (x), ˆ i = 1, . . . , n. ei
(2.2)
j =1
In addition, the main results of this chapter are based on the following assumptions. Assumption 2.1 ([46]) The network G corresponding to the set of nodes is undirected and connected. Assumption 2.2 ([45]) Each local constituent function fi,j , i ∈ V, j ∈ {1, . . . , ei }, is ν-smooth and μ-strongly convex, where ν > μ > 0. Remark 2.1 The formulated problem (2.1) with (2.2) can be frequently found in machine learning (such as modern large-scale information processing problems, reinforcement learning problems, etc.) with large-scale training samples randomly distributed across the multiple computing nodes which focus on collectively training a model xˆ ∈ Rd utilizing the neighboring nodes’ data. However, performing full gradient evaluation becomes prohibitively expensive when the local data batch at a
36
2 Projection Algorithms for Distributed Stochastic Optimization
single computing node is very large, i.e., ei 1. Thus, designing a computationefficient algorithm will have far-reaching implications.
2.2.3 Communication Network In this chapter, we consider a group of n nodes communicating over an undirected graph G = {V, E, A} involving the nodes set V = {1, 2, . . . , n}, the edges set E ⊆ V × V, and the adjacency matrix A = [aij ] ∈ Rn×n . If (i, j ) ∈ E, it indicates that node i can directly exchange data with node j , where i or j is viewed as a neighbor of j or i. The connection weight between nodes i and j in graph G satisfies aij = aj i > 0 if (i, j ) ∈ E and otherwise aij = aj i = 0. Without loss ofgenerality, no self-connection in the graph indicating that aii = 0 for all i. di = nj=1,j =i aij represents the degree of node i ∈ V, whereas the degree matrix DG = diag{d1 , d2 , . . . , dn } is a diagonal matrix. The Laplacian matrix of graph G is represented by LG = DG − A satisfying the symmetric and positive semi-definite properties if the graph G is undirected. A path is a series of consecutive edges. If there is a path between any two nodes, then G is said to be connected.
2.3 Algorithm Development In this section, a reformulation of problem (2.1) is initially presented. Then, a computation-efficient distributed stochastic gradient algorithm is developed to figure out the reformulated problem.
2.3.1 Problem Reformulation Define x as a vector that stacks all the local estimators xi , i ∈ V (i.e., x = vec[x1 , . . . , xn ] ∈ Rnd ). Let P , B, and D be the block diagonal matrices of P1 to Pn (i.e., P = blkdiag{P1 , . . . , Pn } ∈ Rm×nd ), B1 to Bn (i.e., B = blkdiag{B1 , . . . , Bn } ∈ Rw×nd ), andD1 to Dn (i.e., D = blkdiag{D1 , . . . , Dn } ∈ n n n Ra×nd ), respectively, where m = m , w = w , and a = i i i=1 i=1 i=1 ai . T T T T T m T T w Denote q = [q1 , . . . , qn ] ∈ R , c = [c1 , . . . , cn ] ∈ R , s = [s1 , . . . , snT ]T ∈ Ra , Ω = ni=1 Ωi , and L = LG ⊗ Id . Under Assumption 2.1, problem (2.1) can be equivalently reformulated as follows: min J (x) =
x∈Ω
n
fi (xi ) + ||P x − q||1 ,
i=1
s.t. Bx = c, Dx ≤ s, Lx = 0,
(2.3)
2.3 Algorithm Development
where fi (xi ) = (1/ei )
37
ei
j =1 fi,j (xi ).
Remark 2.2 Note that the equality constraint Lx = 0 in (2.3) is equivalent to the condition x1 = x2 = . . . = xn if the undirected network is connected. It is worth highlighting that if xˆ ∗ is the optimal solution to the problem (2.1), then x ∗ = 1n ⊗ xˆ ∗ is the optimal solution to the problem (2.3). According to this observation, the main motivation of this chapter is constructing a computation-efficient algorithm to search for the optimal solution to the problem (2.3) over undirected and connected networks. We notice that under Assumption 2.2, problem (2.1) has a unique optimal solution, denoted as xˆ ∗ . Therefore, problem (2.3) also has a unique optimal solution x ∗ = 1n ⊗ xˆ ∗ under Assumption 2.1. By utilizing the Lagrangian function, the necessary and sufficient conditions for the optimality of problem (2.3) are given in the following lemma, whose proof is directly concluded from [46, 53]. Lemma 2.3 ([46, 53]) Let Assumptions 2.1 and 2.2 hold, and let η = 0 be a given scalar. Then, x ∗ is an optimal solution to (2.3) if and only if there exist α ∗ ∈ Rm , β ∗ ∈ Rw , λ∗ ∈ Ra , and γ ∗ ∈ Rnd such that (x ∗ , α ∗ , β ∗ , λ∗ , γ ∗ ) satisfies the following relations: ⎧ ∗ x = PΩ [x ∗ − η(∇f (x ∗ ) + P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ )] ⎪ ⎪ ⎨ ∗ α = φ(α ∗ + P x ∗ − q) ⎪ Bx ∗ = c, Lx ∗ = 0 ⎪ ⎩ ∗ λ = (λ∗ + Dx ∗ − s)+ ,
(2.4)
where PΩ : Rnd → Ω and φ : Rm → [−1, 1]m are two projection operators. Remark 2.4 In view of Lemma 2.3, the first condition means that the fixed-point x ∗ of a mapping induced by a projected gradient descent step is guaranteed to exist in the bounded constraint. The second condition is a projection step for the non-smooth regularization term. The third and fourth conditions indicate primal feasibility, that is, satisfying the two equalities and one inequality in (2.3). Therefore, if the designed algorithm can achieve the solution to (2.4), the optimal solution to (2.3) can be obtained accordingly.
2.3.2 Computation-Efficient Distributed Stochastic Gradient Algorithm In this subsection, inspired by the algorithm design ideas of methods in [41, 46, 53], we introduce the proposed algorithm named computation-efficient distributed stochastic gradient algorithm for solving the problem (2.3). To motivate our algorithm design, we can find that the existing distributed algorithms have suffered
38
2 Projection Algorithms for Distributed Stochastic Optimization
from the high computation cost when evaluating the local full gradients in machine learning applications. Such a phenomenon inspires us to explore a useful method which could significantly promote the computation efficiency. Therefore, the proposed algorithm leverages the variance-reduction technique of SAGA [41] and the distributed stochastic gradient projection method with constant step-size, which effectively alleviates the computational burden in locally full gradient evaluation. The computation-efficient distributed stochastic gradient algorithm at each node i is formally described in Algorithm 1. To locally implement estimators of Algorithm 1, each node i must own a gradient table that possesses all local constituent gradients ∇fi,j (ti,j ), where ti,j is the most recent estimator at which the constituent gradient ∇fi,j was evaluated. At each iteration k ≥ 0, each node i uniformly at random selects one constituent function that indexed by χik ∈ {1, . . . , ei } from its own local data batch and then generates the local stochastic gradient gik as step 4 in Algorithm 1. After generating gik , the entry ∇fi,χ k (t k k ) is replaced by the newly i
i,χi
constituent gradient ∇fi,χ k (xik ), while the other entries remain the same. Then, the i
projection step of estimator xik is implemented on the local stochastic gradient gik , and other steps of estimators, αik , βik , λki , γ˜ik , are implemented subsequently. Define x k = vec[x1k , . . . , xnk ], α k = vec[α1k , . . . , αnk ], β k = vec[β1k , . . . , βnk ], λk = vec[λk1 , . . . , λkn ], γ˜ k = vec[γ˜1k , . . . , γ˜nk ], and g k = vec[g1k , . . . , gnk ]. Let γ˜ k = Lγ k , where γ k = vec[γ1k , . . . , γnk ] and the initial condition satisfies γ˜ 0 = Lγ 0 [46, 53]. Then, we write Algorithm 1 in the following compact matrix form for the convenient of analysis: ⎧ k
⎪ x − η(g k + P T φ(α k + P x k − q) + L(γ k + x k ) k+1 ⎪ ⎪ x = PΩ ⎪ ⎪ +B T (β k + Bx k − c) + D T (λk + Dx k − s)+ ) ⎪ ⎪ ⎨ k+1 α = φ(α k + P x k+1 − q) k+1 ⎪β = β k + Bx k+1 − c ⎪ ⎪ ⎪ k+1 ⎪ = (λk + Dx k+1 − s)+ λ ⎪ ⎪ ⎩ k+1 γ = γ k + x k+1 .
(2.5)
We notice here that the randomness of Algorithm 1 rests with the set of random k≥0 independent variables {χik }i∈{1,...,n} for calculating the local stochastic gradient gik . Based on this, we utilize F k to indicate the entire history of the dynamic system ˜ ˜ constructed by {χik }k≤k−1 i∈{1,...,n} . Therefore, from some prior results in [41, 45, 51], we know that the local stochastic gradient gik calculated in step 4 of Algorithm 1 is an unbiased estimator of the local batch gradient ∇fi (xik ). Specifically, when F k is given, we have
E gik |F k = ∇fi xik .
(2.6)
Remark 2.5 Algorithm 1 adopts a novel variance-reduction technique of SAGA [41] for gradient evaluation to avoid extensive computation. Although several
2.4 Convergence Analysis
39
Algorithm 1 Computation-efficient distributed stochastic gradient algorithm at each node i ∈ V 1: Initialization: Each node i starts with xi0 ∈ Rd , αi0 ∈ Rmi , βi0 ∈ Rwi , λ0i ∈ Rai , γ˜i0 ∈ Rd , 0 ∈ Rd for all j ∈ V . and ti,j 2: for k = 0, 1, 2, . . . do 3: Choose χik uniformly at random from {1, . . . , ei }; 4: Calculate the local stochastic gradient as: ei
1
k k gik = ∇fi,χ k xik − ∇fi,χ k ti,χ . ∇fi,j ti,j k + i i ei i j =1
k+1 5: If j = χik , then store ∇fi,j (ti,j ) = ∇fi,j (xik ) in χik gradient table position; else k+1 k ) = ∇fi,j (ti,j ). ∇fi,j (ti,j
6: Update estimator xik+1 according to: xik+1
= PΩi
xik − η(gik + PiT φi (αik + Pi xik − qi ) + BiT (βik + Bi xik − ci ) , +DiT (λki + Di xik − si )+ + γ˜ik + nj=1,j =i aij (xik − xjk ))
where the step-size η > 0. 7: Update estimators αik+1 , βik+1 , λk+1 , and γ˜ik+1 as follows: i ⎧ k+1 = φi (α k + Pi x k+1 − qi ) α ⎪ ⎪ ⎨ βik+1 = β k +i B x k+1i − c i i i i i . k+1 k Di xik+1 − si )+ ⎪ ⎪ λi = (λi + ⎩ n k+1 k+1 k+1 k γ˜i = γ˜i + j =1,j =i aij (xi − xj ) 8: end for
existing distributed variance reduced methods such as DSA [45], GT-SAGA/GTSVRG [51], etc. have been investigated for various kinds of problems, it is worth highlighting that via employing the variance-reduction technique, Algorithm 1 is well applied to a class of composite optimization problems that include the 1 -norm subject to locally linear equality, convex inequality, and bounded constraints.
2.4 Convergence Analysis In this section, we first introduce quite a few auxiliary results related to the stochastic gradient g k , k ≥ 0. Then, we design a Lyapunov function and derive the upper bounds of two parts of the Lyapunov function to support the main results. Subsequently, we provide the theoretical guarantees for the convergence behavior of the computation-efficient distributed stochastic gradient algorithm described in Algorithm 1 by using the Lyapunov method. Finally, under some special cases, we propose a distributed stochastic proximal gradient algorithm by using the variancereduction technique and study its convergence rate.
40
2 Projection Algorithms for Distributed Stochastic Optimization
2.4.1 Auxiliary Results To establish the auxiliary results, we first denote the auxiliary sequence r k = n k i=1 ri ∈ R, where rik
ei
1 k k fi,j ti,j − fi,j (xˆ ∗ ) − ∇fi,j (xˆ ∗ )T ti,j = − xˆ ∗ . ei
(2.7)
j =1
k ) − f (xˆ ∗ ) − ∇f (xˆ ∗ )T (t k − xˆ ∗ ), ∀k ≥ 0, is nonHere, note that fi,j (ti,j i,j i,j i,j negative under Assumption 2.2, and consequently r k , ∀k ≥ 0, is non-negative as well. Based on the above, an expected upper bound for the distance between the stochastic gradient g k and the gradient ∇f (x ∗ ) is introduced in the following part, which is deduced in many existing works [45, 47–51]. For simplifying the writing, we denote E[·] = E[·|F k ] as the condition expectation on F k in the subsequent analysis.
Lemma 2.6 ([45]) Under Assumption 2.2 and the definition of r k in (2.7), the sequence generated by Algorithm 1 satisfies the following: ∀k ≥ 0, E ||g k − ∇f (x ∗ )||2 ≤ 4νr k + 2(2ν − μ)(f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )).
(2.8)
Notice that if the sequence of iterates x k tends to the optimal solution x ∗ , then the k approach to xˆ ∗ , which yields that r k converges to zero. This auxiliary variables ti,j fact combined with the result in Lemma 2.6 indicates that the expected value for the distance between the stochastic gradient g k and the gradient ∇f (x ∗ ) diminishes when x k tends to x ∗ .
2.4.2 Supporting Lemmas In this subsection, we proceed to analyze the performance of Algorithm 1. To aim at this, we first define a Lyapunov function as V (x k , α k , β k , λk , γ k , r k ) = V1 (x k ) + η[V2 (α k ) + V3 (β k ) + V4 (λk ) + V5 (γ k )] + br k ,
(2.9)
where the function V1 (x k ) = ||x k − x ∗ ||2 , V2 (α k ) = ||α k − α ∗ ||2 , V3 (β k ) = ||β k − β ∗ ||2 , V4 (λk ) = ||λk − λ∗ ||2 , V5 (γ k ) = ||γ k − γ ∗ ||2L for all k ≥ 0; η is the step-size
2.4 Convergence Analysis
41
and b is a positive constant which will be specified in the subsequent analysis. For simplifying notation, we denote V k = V (x k , α k , β k , λk , γ k , r k ) for all k ≥ 0. Then, we give two crucial lemmas that involve the upper bounds of two parts of Lyapunov function to support the main results. The following lemma that involves an expected upper bound for r k is inevitable to the subsequent convergence. The concrete proof can be found in [45, 47–51]. Lemma 2.7 ([45]) Consider the sequence {r k } generated by Algorithm 1. Under Assumptions 2.1 and 2.2, we have the following inequality: ∀k ≥ 0, 1 k 1 f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) , r + E[r k+1 ] ≤ 1 − eˆ eˇ
(2.10)
where eˆ = maxi∈V {ei } and eˇ = mini∈V {ei }. Recalling the results in Lemma 2.3, we can find that there exist α ∗ ∈ Rm , β ∗ ∈ λ∗ ∈ Ra , and γ ∗ ∈ Rnd such that (x ∗ , α ∗ , β ∗ , λ∗ , γ ∗ ) satisfies the relations in (2.4). Based on this, the following key lemmas characterize the dynamics of the above five functions. Rw ,
Lemma 2.8 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality holds for V1 (x k ): E V1 (x k+1 ) − V1 (x k ) ≤
4ην k 4η(ν − μ) r + f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) a a k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) −(1 − aη)E ||x k+1 − x k ||2 +2ηE (x −2ηE (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ ) −2ηE (Bx k+1 − c)T (β k − β ∗ + Bx k − c) − 2ηE (x k+1 )T L(γ k − γ ∗ + x k ) (2.11) −2ηE (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ) ,
where a > 0 is a tunable parameter. Proof Denote the equation ψ1k = x k+1 = PΩ [x k − υ1k ], where υ1k = η(g k + P T φ(α k + P x k − q) + B T (β k + Bx k − c) + D T (λk + Dx k − s)+ + L(γ k + x k )). Recalling the definition of V1 (x k ), one has V1 (x k+1 ) − V1 (x k ) = ||ψ1k − x ∗ ||2 − ||x k − x ∗ ||2
42
2 Projection Algorithms for Distributed Stochastic Optimization
= −||ψ1k − x k ||2 + 2(ψ1k − x k )T (ψ1k − x ∗ ) = −||ψ1k − x k ||2 + 2(ψ1k − x k + υ1k )T (ψ1k − x ∗ ) − 2(υ1k )T (ψ1k − x ∗ ). (2.12) From the projection property, we obtain that (ψ1k − x k + υ1k )T (ψ1k − x ∗ ) = (PΩ [x k − υ1k ] − (x k − υ1k ))T (PΩ [x k −υ1k ]−x ∗ ) ≤ 0. Combining with (2.12) yields that V1 (x k+1 ) − V1 (x k ) ≤ −||x k+1 − x k ||2 − 2η(x k+1 − x ∗ )T P T φ(α k + P x k − q) − 2η(x k+1 − x ∗ )T L(γ k + x k ) − 2η(x k+1 − x ∗ )T D T (λk + Dx k − s)+ − 2η(x k+1 − x ∗ )T B T (β k + Bx k − c) − 2η(x k+1 − x ∗ )T g k .
(2.13)
Next, we will analyze each cross-term in (2.13). Notice that (x k+1 − x ∗ )T g k = (x k+1 − x k )T (g k − ∇f (x k )) + (x k+1 − x ∗ )T ∇f (x k ) + (x k − x ∗ )T (g k − ∇f (x k )) a 1 ||g k − ∇f (x k )||2 ≥ − ||x k+1 − x k ||2 − 2 2a + (x k − x ∗ )T (g k − ∇f (x k )) + (x k+1 − x ∗ )T ∇f (x k ), (2.14) where a > 0 is an adjustable parameter. Recall the convexity of f (x) that f (x k ) − f (x ∗ ) ≤ (x k − x ∗ )T ∇f (x k ) and f (x k+1 ) − f (x k ) ≤ (x k+1 − x k )T ∇f (x k+1 ). Then, we get (x k+1 − x ∗ )T ∇f (x k ) = (x k − x ∗ )T ∇f (x k ) + (x k+1 − x k )T ∇f (x k ) ≥ f (x k ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 )) + (x k+1 − x k )T ∇f (x k+1 ) ≥ f (x k+1 ) − f (x ∗ ) + (x k+1 − x k )T (∇f (x k ) − ∇f (x k+1 )). Then, substituting (2.14) and (2.15) into (2.13) deduces that V1 (x k+1 ) − V1 (x k ) ≤ −(1 − aη)||x k+1 − x k ||2 − 2η(x k − x ∗ )T (g k − ∇f (x k )) η + ||g k − ∇f (x k )||2 − 2η(f (x k+1 ) − f (x ∗ )) a
(2.15)
2.4 Convergence Analysis
43
− 2η(x k+1 − x ∗ )T P T φ(α k + P x k − q) − 2η(x k+1 − x ∗ )T L(γ k + x k ) − 2η(x k+1 − x ∗ )T B T (β k + Bx k − c) − 2η(x k+1 − x ∗ )T D T (λk + Dx k − s)+ + 2η(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )).
(2.16)
Construct a Lagrangian function as follows: Ψ (x, β, λ, γ ) =f (x) + ||P x − q||1 + β T (Bx − c) + λT (Dx − s) + γ T Lx.
(2.17)
According to the Saddle-Point Theorem [54], we find that (x ∗ , β ∗ , λ∗ , γ ∗ ) is the saddle point of Ψ (x, β, λ, γ ) in (2.17). For the saddle point, (x ∗ , β ∗ , λ∗ , γ ∗ ), it satisfies that Ψ (x ∗ , β, λ, γ ) ≤ Ψ (x ∗ , β ∗ , λ∗ , γ ∗ ) ≤ Ψ (x, β ∗ , λ∗ , γ ∗ ). Thus, with regard to x, (x ∗ , β ∗ , λ∗ , γ ∗ ) is a minimum point of Ψ (x, β ∗ , λ∗ , γ ∗ ), which further means that the variational inequality (x − x ∗ )T ς ∗ ≥ 0 holds for all x ∈ Ω, where ς ∗ ∈ ∂x Ψ (x ∗ , β ∗ , λ∗ , γ ∗ ) is a subgradient of the Lagrangian function Ψ . Since x k+1 ∈ Ω, one obtains that (x k+1 − x ∗ )T (∇f (x ∗ ) + P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ ) ≥ 0. Then, by the convexity of f (x), we further acquire f (x k+1 ) − f (x ∗ ) + (x k+1 − x ∗ )T (P T α ∗ + B T β ∗ + D T λ∗ + Lγ ∗ ) ≥ 0. (2.18) Notice that Bx ∗ = c and Lx ∗ = 0. Then, combining (2.16) and (2.18) yields V1 (x k+1 ) − V1 (x k ) ≤ − (1 − aη)||x k+1 − x k ||2 − 2η(x k − x ∗ )T (g k − ∇f (x k )) +
η k ||g − ∇f (x k )||2 − 2η(x k+1)T L(γ k − γ ∗ + x k ) a
− 2η(x k+1 − x ∗ )T P T (φ(α k + P x k − q) − α ∗ ) − 2η(Bx k+1 − c)T (β k − β ∗ + Bx k − c) − 2η(x k+1 − x ∗ )T D T ((λk + Dx k − s)+ − λ∗ ) + 2η(x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )).
(2.19)
Recall that α ∗ = φ(α ∗ +P x ∗ −q), where α ∗ is a solution to the variational inequality (α − α ∗ )T (−P x ∗ +q) ≥ 0, ∀α ∈ [−1, 1]m . Thus, (φ(α k +P x k −q)−α ∗ )T (P x ∗ − q) ≤ 0 holds due to φ(α k + P x k − q) ∈ [−1, 1]m . Then, we have (x k+1 − x ∗ )T P T (φ(α k + P x k − q) − α ∗ ) = (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ ) − (P x ∗ − q)T (φ(α k + P x k − q) − α ∗ ) ≥ (P x k+1 − q)T (φ(α k + P x k − q) − α ∗ ).
(2.20)
44
2 Projection Algorithms for Distributed Stochastic Optimization
Similarly, one has ((λk + Dx k − s)+ − λ∗ )T (Dx ∗ − s) ≤ 0 owing to (λk + Dx k − s)+ ≥ 0. Thus, we get (x k+1 − x ∗ )T D T ((λk + Dx k − s)+ − λ∗ ) = (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ) − (Dx ∗ − s)T ((λk + Dx k − s)+ − λ∗ ) ≥ (Dx k+1 − s)T ((λk + Dx k − s)+ − λ∗ ).
(2.21)
Moreover, by utilizing the standard variance decomposition E[||a − E[a]||2 ] = E[||a||2] − ||E[a]||2 and (2.6), the expectation E[||g k − ∇f (x k )||2 ] = E||g k − ∇f (x k )||2 −||∇f (x k )−∇f (x ∗ )||2 . According to the strongly convexity of the global function f (x), it has that ||∇f (x k )−∇f (x ∗ )||2 ≥ 2μ(f (x k )−f (x ∗ )−∇f (x ∗ )T (x k − x ∗ )). Thus, it follows from (2.8) that E ||g k − ∇f (x k )||2 ≤ 4νr k + 4(ν − μ)(f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ).
(2.22)
Substituting (2.20)–(2.21) into (2.19), taking conditional expectation on F k , and then combining with (2.22), we complete the proof of Lemma 2.8. Lemma 2.9 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality holds for V2 (α k ): E V2 (α k+1 ) − V2 (α k ) ≤2E (P x k+1 − q)T (α k+1 − α ∗ ) − E ||α k+1 − α k ||2 .
(2.23)
Proof Denote ψ2k = α k+1 = φ(υ2k ), where υ2k = α k + P x k+1 − q. From the definition of V2 (α k ), one has V2 (α k+1 ) − V2 (α k ) = ||α k+1 − α ∗ ||2 − ||α k − α ∗ ||2 = −||ψ2k − α k ||2 + 2(ψ2k − α k )T (ψ2k − α ∗ ) = −||ψ2k − α k ||2 + 2(ψ2k − α k − P x k+1 + q)T (ψ2k − α ∗ ) + 2(P x k+1 − q)T (ψ2k − α ∗ ).
(2.24)
From the projection property, we obtain that (ψ2k − α k − P x k+1 + q)T (ψ2k − α ∗ ) = (φ(υ2k )−υ2k )T (φ(υ2k )−α ∗ ) ≤ 0. Combining with (2.24) and then taking conditional expectation on F k , we complete the proof of Lemma 2.9.
2.4 Convergence Analysis
45
Lemma 2.10 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality holds for V3 (β k ): E[V3 (β k+1 )] − V3 (β k ) − E[||Bx k+1 − Bx k ||2 ] ≤2E[(Bx k+1 − c)T (β k − β ∗ + Bx k − c)] − ||Bx k − c||2.
(2.25)
Proof Since β k+1 = β k + Bx k+1 − c, from the definition of V3 (β k ), it follows that V3 (β k+1 ) − V3 (β k ) = ||β k+1 − β ∗ ||2 − ||β k − β ∗ ||2 = ||β k − β ∗ + Bx k+1 − c||2 − ||β k − β ∗ ||2 = (Bx k+1 − c)T (2(β k − β ∗ ) + Bx k+1 − c) = 2(Bx k+1 − c)T (β k − β ∗ ) + ||Bx k+1 − Bx k ||2 + 2(Bx k+1 − c + c − Bx k )T (Bx k − c) + ||Bx k − c||2
(2.26)
= 2(Bx k+1 − c)T (β k − β ∗ + Bx k − c) + ||Bx k+1 − Bx k ||2 − ||Bx k − c||2 . Furthermore, taking conditional expectation on F k yields the result of Lemma 2.10. This completes the proof. Lemma 2.11 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality holds for V4 (λk ): E[V3 (β k+1 )] − V3 (β k ) − E[||Bx k+1 − Bx k ||2 ] ≤2E[(Bx k+1 − c)T (β k − β ∗ + Bx k − c)] − ||Bx k − c||2.
(2.27)
Proof Denote ψ3k = λk+1 = (υ3k )+ , where υ3k = λk + Dx k+1 − s. From the definition of V4 (λk ), one has V4 (λk+1 ) − V4 (λk ) = ||λk+1 − λ∗ ||2 − ||λk − λ∗ ||2 = ||ψ3k − λ∗ ||2 − ||λk − λ∗ ||2 = −||ψ3k − λk ||2 + 2(ψ3k − λk )T (ψ3k − λ∗ ) = −||ψ3k − λk ||2 + 2(ψ3k − λk − Dx k+1 + s)T (ψ3k − λ∗ ) + 2(Dx k+1 − s)T (ψ3k − λ∗ ).
(2.28)
46
2 Projection Algorithms for Distributed Stochastic Optimization
From the projection property, we obtain that (ψ3k − λk − Dx k+1 + s)T (ψ3k − λ∗ ) = ((υ3k )+ − υ3k )T ((υ3k )+ −λ∗ ) ≤ 0. Combining with (2.28) and then taking conditional expectation on F k , we complete the proof of Lemma 2.11. Lemma 2.12 Consider the sequence of iterations x k , α k , β k , λk , and γ k generated by Algorithm 1. Under Assumptions 2.1 and 2.2, ∀k ≥ 0, the following inequality holds for V5 (γ k ): E V5 (γ k+1 ) − V5 (γ k ) − E (x k+1 − x k )T L(x k+1 − x k ) ≤2E (x k+1 )T L(γ k − γ ∗ + x k ) − (x k )T Lx k .
(2.29)
Proof Since γ k+1 = γ k + x k+1 , from the definition of V5 (γ k ), it results in V5 (γ k+1 ) − V5 (γ k ) = (γ k+1 − γ ∗ )T L(γ k+1 − γ ∗ ) − (γ k − γ ∗ )T L(γ k − γ ∗ ) = (γ k − γ ∗ + x k+1 )T L(γ k − γ ∗ + x k+1 ) − (γ k − γ ∗ )T L(γ k − γ ∗ ) = 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 )T Lx k+1 = 2(x k+1 )T L(γ k − γ ∗ ) + (x k+1 − x k )T L(x k+1 − x k ) + 2(x k+1 )T Lx k − (x k )T Lx k = 2(x k+1 )T L(γ k − γ ∗ + x k ) + (x k+1 − x k )T L(x k+1 − x k ) − (x k )T Lx k . (2.30) Then, taking conditional expectation on F k , we get the result of Lemma 2.12. This completes the proof.
2.4.3 Main Results In Theorem 2.13, we will show that Algorithm 1 converges under appropriate stepsize η and constant b by combining the results in Lemmas 2.7–2.12. Before present
ˆ ˆ min (B T B) ˆ ing Theorem 2.13, we first set the constant b ∈ 4ηaeν , 2eηχ − 4eνη(ν−μ) ν aν
ˆ 2 +4eνη(ν−μ) ˆ and the tunable parameter a ∈ 4η2eν , +∞ . eηχ ˆ (B T B) min
Theorem 2.13 Suppose that Assumptions 2.1 and 2.2 hold. Considering the computation-efficient distributed stochastic gradient algorithm described in
2.4 Convergence Analysis
47
Algorithm 1, if the constant step-size η is selected from the interval 0,
χmax (aInd + 2νInd
1 , + B T B + L + 2D T D + 2P T P )
then the estimator xik , i ∈ V, converges to the global optimal solution xˆ ∗ in expectation. Proof By the Lyapunov function (2.9) and the results in Lemmas 2.8–2.12, we obtain that E[V k+1 ] − V k ≤ Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 ,
(2.31)
where Ξ1 = 2ηE (x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) − η||Bx k − c||2 − (1 − aη)E ||x k+1 − x k ||2 , Ξ2 = 2ηE (P x k+1 − q)T (α k+1 − φ(α k + P x k − q)) − ηE ||α k+1 − α k ||2 , Ξ3 = 2ηE (Dx k+1 − s)T (λk+1 − (λk + Dx k − s)+ ) − ηE ||λk+1 − λk ||2 ,
Ξ4 = b E r k+1 − r k + ηE ||x k+1 − x k ||2B T B , Ξ5 =
4ην k 4η(ν − μ) r + (f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ )) a a + ηE ||x k+1 − x k ||2L − η||x k ||2L .
Next, we derive the upper bound for each term in the right side of inequality (2.31). From the smoothness of the global objective function f (x), one obtains the fact that (x k+1 − x k )T (∇f (x k+1 ) − ∇f (x k )) ≤ ν||x k+1 − x k ||2 . Thus, the first term Ξ1 is bounded by Ξ1 ≤ −(1 − aη − 2ην)E[||x k+1 − x k ||2 ] − η||x k − x ∗ ||2B T B ,
(2.32)
where inequality (2.32) is deduced by the equality ||Bx k − c||2 = ||Bx k − Bx ∗ ||2 = (x k − x ∗ )T B T B(x k − x ∗ ) = ||x k − x ∗ ||2B T B . From the projection property, we have that E[(P x k+1 − q)T (α k+1 − φ(α k + P x k − q))] ≤ E[||x k+1 − x k ||2P T P ] +
48
2 Projection Algorithms for Distributed Stochastic Optimization
E[(P x k − q)T (α k+1 −φ(α k +P x k −q))]. Therefore, the second term Ξ2 is bounded by Ξ2 ≤ 2ηE (α k+1 − α k )T (α k+1 − φ(α k + P x k − q)) − 2ηE (α k+1 − (α k + P x k − q))T (α k+1 − φ(α k + P x k − q)) (2.33) − ηE ||α k+1 − α k ||2 + 2ηE ||x k+1 − x k ||2P T P . Consider that E[(α k+1 − (α k + P x k − q))T (α k+1 − φ(α k + P x k − q))] ≥ E[||α k+1 − φ(α k + P x k − q)||2]. In conjunction with (2.33), it follows that Ξ2 ≤ 2ηE ||x k+1 − x k ||2P T P − η||α k − φ(α k + P x k − q)||2 − ηE ||α k+1 − φ(α k + P x k − q)||2 ≤ 2ηE ||x k+1 − x k ||2P T P − η||α k − φ(α k + P x k − q)||2 .
(2.34)
Similar to the procedure for deriving the bound of Ξ2 , we achieve the bound of Ξ3 as Ξ3 ≤ 2ηE ||x k+1 − x k ||2D T D − η||λk − (λk + Dx k − s)+ ||2 − ηE ||λk+1 − (λk + Dx k − s)+ ||2 (2.35) ≤ 2ηE ||x k+1 − x k ||2D T D − η||λk − (λk + Dx k − s)+ ||2 . Then, with the result presented in Lemma 2.32, we further get b b f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) Ξ4 ≤ − r k + eˆ eˇ + ηE ||x k+1 − x k ||2B T B .
(2.36)
Moreover, according to the smoothness of the global objective function f (x), it follows that f (x k ) − f (x ∗ ) − ∇f (x ∗ )T (x k − x ∗ ) ≤ ν2 ||x k − x ∗ ||2 . Hence, we have b 4ην k νb 2νη(ν − μ) − + r + ||x k − x ∗ ||2 Ξ4 + Ξ5 ≤ − eˆ a 2eˇ a + ηE ||x k+1 − x k ||2B T B+L − η||x k ||2L . (2.37)
2.4 Convergence Analysis
49
Substituting (2.32)–(2.16) and (2.37) into the right side of (2.31) and rearranging the obtained terms, one gets
E[V k+1 ] − V k ≤ − E ||x k+1 − x k ||2I
T T T nd −η(aInd +2νInd +B B+L+2D D+2P P )
− ||x k − x ∗ ||2
+ 2νη(ν−μ) ηB T B− νb Ind a 2eˇ
−
b 4ην − eˆ a
rk
− η||λk − (λk + Dx k − s)+ ||2 − η||x k ||2L − η||α k − φ(α k + P x k − q)||2.
(2.38)
To ensure that the Lyapunov function V k decreases monotonously over the entire period of iteration k, it is equivalent to prove that E[V k+1 ] − V k < 0.
(2.39)
If each term in the right side of (2.38) is non-negative, the inequality (2.39) holds. Therefore, the following conditions should be satisfied: (a) the matrices Ind −(aInd +2νInd +B T B +L+2D T D +2P T P ) and ηB T B − ( νb + 2νη(ν−μ) )Ind a 2eˆ should be positive definite and (b) (b/e)−(4ην/a) ˆ > 0. The results of Theorem 2.13 are then achieved. Remark 2.14 Theorem 2.13 indicates that even utilizing the stochastic gradient g k , Algorithm 1 is guaranteed to resolve the composite constrained convex optimization problem (2.1) if some conditions (such as a, b, and η) are satisfied and the assumptions on the objective functions and the communication network hold. However, an explicit convergence rate is not established in Theorem 2.13 due to the existence of the locally general constraints that are constituted by linear equality, convex inequality, and bounded constraints. When dealing with a composite nonsmooth problem, although global linear convergence of the distributed proximal gradient methods has been well proved in the recent work [33], there is still no work to analyze the global linear convergence of the primal–dual method (similar to the proposed algorithm). In terms of this issue, related work needs to be further studied. We, therefore, draw support from simulations to explore possible results.
2.4.4 Discussion In this subsection, by adopting the proximal operator, we will explore that a distributed variance-reduction algorithm has similar performance to the centralized method in dealing with the special case (the constraints are not involved and the nonsmooth terms are identical) of problem (2.1) from the perspective of the convergence rate [33, 55]. In particular, the special case of problem (2.1) can be equivalently
50
2 Projection Algorithms for Distributed Stochastic Optimization
reformulated as follows: min J (x) = x
n
fi (xi ) + ||P˜ xi − q|| ˜ 1 , s.t. W 2 x = 0, 1
(2.40)
i=1
˜ where x = vec[x1, . . . , xn ] ∈ Rnd , P˜ ∈ Rm×d (m ˜ > 0), q˜ ∈ Rm˜ , fi (xi ) = ei (1/ei ) j =1 fi,j (xi ), W = (1/2)(Ind −A⊗Id ), and A is a primitive, symmetric, and ˜ 1. doubly-stochastic matrix. For convenience, we define h(x) = ni=1 ||P˜ xi − q|| To solve problem (2.40), we propose the following recursion inspired by the design ideals of the methods in [33, 55]: ⎧ 1 1 ⎪ ⎨ zk+1 = x k − ηg k − W 2 x k − W 2 γ k 1 (2.41) γ k+1 = γ k + W 2 zk+1 ⎪ ⎩ k+1 x = proxηh (zk+1 ),
where proxηh (zk+1 ) = arg miny {h(y)+(1/2η)||y−zk+1||2 } is the proximal operator at auxiliary variable zk+1 ∈ Rnd . Then, the next proposition explores the linear convergence of the algorithm (2.41). Proposition 2.15 Suppose that Assumptions 2.1 and 2.2 hold. If γ 0 = 0nd and the constant step-size η is small enough, algorithm (2.41) can guarantee that the estimator x k linearly converges to the global optimal solution x ∗ in expectation. Proof It is concluded from [33] that the fixed-point optimality to algorithm (2.41) satisfies that ⎧ 1 1 ⎪ ⎨ z∗ = x ∗ − η∇f (x ∗ ) − W 2 x ∗ − W 2 γ ∗ 1 (2.42) 0 = W 2 z∗ ⎪ ⎩ ∗ ∗ x = proxηh (z ), where (x ∗ , γ ∗ , z∗ ) is a fixed-point of (2.41). Define xˆ k = x k − x ∗ , γˆ k = γ k − γ ∗ , and zˆ k = zk − z∗ . Then, subtracting (2.42) from (2.41), we obtain the following error recursions: ⎧ 1 1 ⎪ ⎨ zˆ k+1 = xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k − W 2 γˆ k 1 (2.43) γˆ k+1 = γˆ k + W 2 zˆ k+1 ⎪ ⎩ k+1 k+1 ∗ xˆ = proxηh (z ) − proxηh (z ). Note from the definition of matrix W that W is symmetric and its singular values are in [0, 1) (i.e., ρmin = 0 < ρ ≤ ρmax < 1, where ρ is the minimum nonzero
2.4 Convergence Analysis
51
singular value of W ). From the first equality in (2.43), one gets ||ˆzk+1 ||2 = ||xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k ||2 + ||W 2 γˆ k ||2 1
1
1
1
−2(γˆ k )T W 2 (xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k ).
(2.44)
From the second equality in (2.43), one gets 1
1
||γˆ k+1 ||2 = ||γˆ k ||2 + ||W 2 zˆ k+1 ||2 − 2||W 2 γˆ k ||2 1
1
+2(γˆ k )T W 2 (xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k ).
(2.45)
1
Combining (2.45) with (2.44) and noting that ||W 2 γˆ k ||2 ≥ ρ||γˆ k ||2 , we have ||ˆzk+1 ||2Ind −W + ||γˆ k+1 ||2 ≤ (1 − ρ)||γˆ k ||2 + ||xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k ||2 . 1
(2.46)
Then, we further obtain that ||xˆ k − η(g k − ∇f (x ∗ )) − W 2 xˆ k ||2 νμ η2 ||xˆ k ||2 + ≤ 1 − 2η ||g k − ∇f (x ∗ )||2 ν+μ 1 − ρmax 1
−
4ημ (f (x k ) − f (x ∗ ) − ∇f T (x ∗ )xˆ k )−2η(xˆ k )T (g k − ∇f (x k )), ν+μ
where we have used the inequalities (xˆ k )T (∇f (x k ) − ∇f (x ∗ )) ≥ 1 k ν+μ ||∇f (x ) ∇f T (x ∗ )xˆ k ),
−
∇f (x ∗ )||2 ,
||∇f (x k )
−
∇f (x ∗ )||2 1 2
and ||η(g k − ∇f (x ∗ )) − W xˆ k ||2 ≤
1 2
≥
(2.47)
νμ k 2 ν+μ ||xˆ || k 2μ(f (x ) − f (x ∗ )
η2 k 1−ρmax ||g
+ −
− ∇f (x ∗ )||2 +
(xˆ k )T W xˆ k to derive (2.47). Substituting (2.47) into (2.46) and taking the condition expectation, we obtain that E ||ˆzk+1 ||2Ind −W + E ||γˆ k+1 ||2 νμ η2 ||xˆ k ||2 + ≤ 1 − 2η E ||g k − ∇f (x ∗ )||2 ν+μ 1 − ρmax −
4ημ (f (x k ) − f (x k ) − ∇f T (x ∗ )xˆ k ) + (1 − ρ)||γˆ k ||2 ν+μ
−
4ημ (f (x k ) − f (x k ) − ∇f T (x ∗ )xˆ k ). ν+μ
(2.48)
52
2 Projection Algorithms for Distributed Stochastic Optimization
According to Lemmas 2.6 and 2.7, we can further get E ||ˆzk+1 ||2Ind −W + E ||γˆ k+1 ||2 + b1 ηE r k+1 4νη b1 νμ + ||xˆ k ||2 + 1 − ηr k ≤ 1 − 2η ν+μ eˆ 1 − ρmax 2(2ν − μ)η b1 4μ − ηs k + (1 − ρ)||γˆ k ||2 , − − ν +μ 1 − ρmax eˇ
(2.49)
where b1 is a positive constant which will be specified next and we set s k = f (x k )− ˇ 1 (ν+μ))(1−ρmax ) 4μeˇ f (x ∗ )−∇f (x ∗ )T (x k −x ∗ ). If 0 < η ≤ (4μe−b and b1 < (ν+μ) , (2.49) 2e(ν+μ)(2ν−μ) ˇ further implies that E ||ˆzk+1 ||2Ind −W + E ||γˆ k+1 ||2 + b1 ηE r k+1 4νη b1 νμ + ||xˆ k ||2 + 1 − ηr k +(1 − ρ)||γˆ k ||2 . ≤ 1 − 2η ν+μ eˆ 1 − ρmax (2.50) Furthermore, it follows from the third equality in (2.43) that xˆ k+1 = proxηh (zk+1 ) − proxηh (z∗ ) ≤ zˆ k+1 ,
(2.51)
and therefore (1 − ρmax )E ||xˆ k+1||2 + E ||γˆ k+1 ||2 + b1 ηE r k+1 2ηνμ 1 (1 − ρmax )||xˆ k ||2 − ≤ 1 − ρmax (ν + μ)(1 − ρmax ) 4νη 1 1 b1 ηr k + (1 − ρ)||γˆ k ||2 . + − + b1 eˆ (1 − ρmax )b1
(b1 +b1 eˆ −e)(1−ρ ˆ max ) ν+μ , 2νμ and b1 > 1+eˆ eˆ , there must exist a 4ν eˆ 2ηνμ max{ 1−ρ1max − (ν+μ)(1−ρ , 1 − 1eˆ + (1−ρ4νη , 1 − ρ} such max ) b1 max )b1
Then, if 0 < η ≤ min positive constant θ = that
(2.52)
(1 − ρmax )E ||xˆ k+1 ||2 + E ||γˆ k+1 ||2 + b1 ηE r k+1 ≤ θ ((1 − ρmax )||xˆ k ||2 + b1 ηr k + ||γˆ k ||2 ),
(2.53)
where 0 < θ < 1. From here, the proof is similar to that in the proof of Theorem 1 in [33]. Then, it is suffice to obtain the desired results.
2.5 Numerical Examples
53
2.5 Numerical Examples In this section, two numerical examples are provided to examine the convergence and the practical behavior of the proposed algorithms. Notice that all the simulations are carried out in MATLAB on an HP 288 Pro G4 MT Business PC with Intel(R) Core(TM) i7-8700 processors, 8 GB memory, and 3.2 GHz. For the sake of comparison, the optimal solutions to the following examples are obtained by the centralized method with proper step-sizes for a long enough time.
2.5.1 Example 1: Performance Examination First, the proposed algorithms are applied to solve a general distributed minimization problem which is described as follows:
min
n i=1
⎛
⎞ ei 1 ⎝ ||Ci,j xˆ − bi,j ||2 + ||Pi xˆ − qi ||1 ⎠, ei j =1
s.t. xˆ 1 + xˆ2 + xˆ3 + xˆ4 = 3, xˆ 1 − xˆ2 + xˆ3 − xˆ4 ≤ 2, − 2 ≤ xˆi ≤ 2, i = 1, . . . , 4,
(2.54)
where xˆ ∈ R4 , Ci,j ∈ R1×4 , Pi ∈ R1×4 , bi,j ∈ R, and qi ∈ R for all i, j . Let n = 10 and ei = 10 for all i. The components of Ci,j , bi,j , Pi , and qi are randomly selected in [0, 2], [−4, 4], [0, 2], and [−4, 4], respectively. The communication among 10 nodes is modeled as a ring network. The node i is assigned the ith objective function ei fi (x)+||P ˆ ˆ = (1/ei ) ||Ci,j xˆ − bi,j ||2 with the constituent i xˆ −qi ||1 , where fi (x) j =1
function fi,j (x) ˆ = ||Ci,j (x) ˆ − bi,j ||2 , j = 1, . . . , ei . In the simulation, the constant step-size η is set as 0.04 and the initial conditions (xi0 , αi0 , βi0 , λ0i , and γi0 ) are randomly generated in Algorithm 1. Then, the simulation results are shown as follows: (1) Figure 2.1 depicts the transient behaviors of all dimensions of state estimator x. ˆ Figure 2.1 indicates that the state estimator xˆ in Algorithm 1 can successfully achieve the consensus at the globally optimal solution in expectation. (2) Figures 2.2 and 2.3 adopt the residual (1/n) ni=1 ||xik − xˆ ∗ || to show a numerical comparison between the proposed algorithm (2.4) and the distributed method in [46] with the x-axis being the iteration and the number of gradient evaluations, respectively. Figure 2.2 means that the proposed algorithm (2.4) can perform a linear convergence rate with constant step-size and its performance does not decrease even using the stochastic gradients. Figure 2.3
54
2 Projection Algorithms for Distributed Stochastic Optimization The transient behaviors of all dimensions of state vector 2 1.5 1 0.5 0 −0.5 0
200
400
600
800
1000
Iteration
Fig. 2.1 Convergence of xˆ for solving the optimization problem in Example 1 Residual 0
The proposed algorithm (5) The method in [46]
10
−5
10
−10
10
0
200
400
600
800
1000
Iteration
Fig. 2.2 Comparison (a): x-axis is the iteration
indicates that compared with the method in [46], the proposed algorithm (2.4) demands a small number of gradient evaluations, which largely reduce the computational cost.
2.5.2 Example 2: Application Behavior Second, we further verify the application behavior of the proposed algorithm with numerical simulations for real datasets. We consider the distributed sparse logistic regression problem using the breast cancer Wisconsin (diagnostic) dataset provided
2.5 Numerical Examples
55 Residual
0
10
The proposed algorithm (5) The method in [46]
−5
10
−10
10
0
1
2 3 4 5 6 Number of gradient evaluations
7
8 4
x 10
Fig. 2.3 Comparison (b): x-axis is the number of gradient evaluations
in the UCI Machine Learning Repository [56]. In the breast cancer Wisconsin (diagnostic) dataset, we adopt N = 200 samples as training data, where each training data has dimension d = 9. All the characters have been preprocessed and normalized to the unit vector for each dataset. For the network, we generate a randomly connected network with n = 20 nodes utilizing an Erdos–Renyi network with probability p = 0.4. The distributed sparse logistic regression problem can be formally described as min
x∈R ˆ d
n
fi (x) ˆ + κ1 ||x|| ˆ 1,
(2.55)
i=1
ˆ being with the local objective function fi (x) fi (x) ˆ =
ei
κ 1 2 T ˆ 22 , ln 1 + exp(−bi,j ci,j x) ˆ + ||x|| ei 2 i=1
where bi,j ∈ {−1, 1} and ci,j ∈ Rd are local data kept by node i for j ∈ {1, . . . , ei }; the regularization term κ1 ||x|| ˆ 1 is applied to impose sparsity of the optimal solution and (κ2 /2)||x|| ˆ 22 is added to avoid overfitting, respectively. In the simulation, we assign data randomly to each local node, i.e., ni=1 ei = N. We set the regularization parameters κ1 = 0.05 and κ2 = 10, respectively. Then, we compare the proposed algorithms with the existing distributed methods, including DL-ADMM [30], PG-EXTRA [31], NIDS [32], P2D2 [33], that can deal with the composite non-smooth optimization problem. When κ1 = 0, we also compare the proposed algorithms with the existing distributed methods, including
56
2 Projection Algorithms for Distributed Stochastic Optimization
DSA [31] and GT-SAGA [51] that use the variance-reduction technique. The simulation results are described as follows: (1) Figure 2.4 means that the proposed algorithms can achieve the linear convergence rate as the existing distributed methods [30–33] that can deal with composite non-smooth optimization problem under the real training set. Figure 2.5 indicates that compared with the existing distributed methods [30–33] that do not adopt the variance-reduction technique, the proposed algorithms demand a small number of gradient evaluations, which is cheaper in terms of the computational cost.
Residual: Real dataset
0
10
The proposed algorithm (5) PG−EXTRA NIDS P2D2 DL−ADMM The proposed algorithm (22)
−2
10
−4
10
−6
10
0
50
100
150
200 250 Iteration
300
350
400
Fig. 2.4 Comparison (a): x-axis is the iteration Residual: Real dataset 0
10
The proposed alogrithm (5) PG−EXTRA NIDS P2D2 DL−ADMM The proposed algorithm (22)
−2
10
−4
10
−6
10
0
0.5
1 1.5 2 2.5 3 Number of gradient evaluations
Fig. 2.5 Comparison (b): x-axis is the number of gradient evaluations
3.5
4 4
x 10
2.6 Conclusion
57 Residual: Real dataset
0
10
The proposed algorithm (5) DSA GT−SAGA
−5
10
−10
10
0
100
200
300 400 Iteration
500
600
700
Fig. 2.6 Comparison (a): x-axis is the iteration Residual: Real dataset
0
10
The proposed algorithm (5) DSA GT−SAGA
−5
10
−10
10
0
1,000
2,000 3,000 4,000 5,000 Number of gradient evaluations
6,000
7,000
Fig. 2.7 Comparison (b): x-axis is the number of gradient evaluations
(2) When κ1 = 0, Figs. 2.6 and 2.7 tell us that the proposed algorithms show similar performance with the existing distributed variance reduced methods [46, 51].
2.6 Conclusion In this chapter, we have designed a novel computation-efficient distributed stochastic gradient algorithm for solving a class of strongly convex composite constrained optimization problems over networks. The proposed algorithm leveraged
58
2 Projection Algorithms for Distributed Stochastic Optimization
the variance-reduction technique which highly reduces the expense of full gradient evaluation. Through constructing an appropriate Lyapunov function, we proved that the proposed algorithm converges in expectation to the optimal solution with a suitably selected constant step-size. Furthermore, the privacy properties of the proposed algorithm have also been explored via differential privacy strategy. Extensive numerical experiments have been conducted to verify the superior performance of the proposed algorithm. However, some nontrivial issues still deserve further study. For example, the convergence rate of the proposed algorithm for the composite constrained optimization problem needs to be studied in-depth, and general nonsmooth terms as well as more complex networks still demand further consideration. In the future, we will further investigate the convergence rate of the proposed algorithm and extend the algorithm to be applicable to more complex directed networks. The extensions of the current algorithm to general non-smooth terms and the distributed non-convex stochastic optimization are also two promising research directions.
References 1. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Advances in Neural Information Processing Systems (NIPS), vol. 30 (2017), pp. 1– 11 2. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a unified variance-reduction framework for robust performance and fast convergence. IEEE Signal Process. Mag. 37(3), 102–113 (2020) 3. S. Khobahi, M. Soltanalian, F. Jiang, A.L. Swindlehurst, Optimized transmission for parameter estimation in wireless sensor networks. IEEE Trans. Signal Inf. Proc. Netw. 6, 35–47 (2019) 4. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst. 1, 77–103 (2018) 5. J. Li, W. Abbas, X. Koutsoukos, Resilient distributed diffusion in networks with adversaries. IEEE Trans. Signal Inf. Proc. Netw. 6, 1–17 (2019) 6. A. Nedic, Distributed gradient methods for convex machine learning problems in networks: distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020) 7. M. Rossi, M. Centenaro, A. Ba, S. Eleuch, T. Erseghe, M. Zorzi, Distributed learning algorithms for optimal data routing in IoT networks. IEEE Trans. Signal Inf. Proc. Netw. 6, 175–195 (2020) 8. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks with gradient tracking and variance reduction, in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (PMLR), vol. 108 (2020), pp. 1662–1672 9. H. Li, C. Huang, Z. Wang, G. Chen, H. Umar, Computation-efficient distributed algorithm for convex optimization over time-varying networks with limited bandwidth communication. IEEE Trans. Signal Inf. Proc. Netw. 6, 140–151 (2020) 10. T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, K. Johansson, A survey of distributed optimization. Annu. Rev. Control 47, 278–305 (2019) 11. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 12. S. Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147, 516–545 (2010)
References
59
13. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012) 14. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015) 15. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans. Autom. Control 62(10), 4980–4993 (2017) 16. M. Maros, J. Jalden, On the Q-linear convergence of distributed generalized ADMM under non-strongly convex function components. IEEE Trans. Signal Inf. Proc. Netw. 5(3), 442–453 (2019) 17. C. Zhang, H. Gao, Y. Wang, Privacy-preserving decentralized optimization via decomposition (2018). Preprint. arXiv:1808.09566 18. J. Chen, S. Liu, P. Chen, Zeroth-order diffusion adaptation over networks, in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018). https://doi.org/10.1109/ICASSP.2018.8461448 19. J. Xu, S. Zhu, Y.C. Soh, L. Xie, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual Conference on Decision and Control (2015). https://doi.org/10.1109/CDC.2015.7402509 20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 21. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020) 22. V.S. Mai, E.H. Abed, Distributed optimization over directed graphs with row stochasticity and constraint regularity. Automatica 102(102), 94–104 (2019) 23. B. Huang, Y. Zou, Z. Meng, W. Ren, Distributed time-varying convex optimization for a class of nonlinear multiagent systems. IEEE Trans. Autom. Control 65(2), 801–808 (2020) 24. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over timevarying directed networks. IEEE Trans. Signal Inf. Proc. Netw. 4(1), 4–17 (2018) 25. M. Hong, D. Hajinezhad, M. Zhao, Prox-PDA: the proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks, in Proceedings of the 34th International Conference on Machine Learning (ICML), vol. 70 (2017), pp. 1529–1538 26. F. Hua, R. Nassif, C. Richard, H. Wang, A.H. Sayed, Online distributed learning over graphs with multitask graph-filter models. IEEE Trans. Signal Inf. Proc. Netw. 6, 63–77 (2020) 27. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron. 64(6), 5095–5106 (2017) 28. T. Yang, D. Wu, H. Fang, W. Ren, H. Wang, Y. Hong, K. Johansson, Distributed energy resource coordination over time-varying directed communication networks. IEEE Trans. Control Netw. Syst. 6(3), 1124–1134 (2019) 29. A.I. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in Proceedings of the 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (2012), https://doi.org/10.1109/Allerton.2012.6483273 30. T.-H. Chang, M. Hong, X. Wang, Multi-agent distributed optimization via inexact consensus ADMM. IEEE Trans. Signal Process. 63(2), 482–497 (2015) 31. W. Shi, Q. Ling, G. Wu, W. Yin, A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015) 32. Z. Li, W. Shi, M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Trans. Signal Process. 67(17), 4494–4506 (2019) 33. S. Alghunaim, K. Yuan, A.H. Sayed, A linearly convergent proximal gradient algorithm for decentralized optimization, in Advances in Neural Information Processing Systems (NIPS), vol. 32 (2019), pp. 1–11 34. K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, W. Xiang, Big data-driven optimization for mobile networks toward 5G. IEEE Netw. 30(1), 44–51 (2016)
60
2 Projection Algorithms for Distributed Stochastic Optimization
35. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and convergence to local minima (2020). Preprint. arXiv:2003.02818v1 36. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep learning, in Proceedings of the 36th International Conference on Machine Learning (ICML) (2019). https://doi.org/10.48550/arxiv.1811.10792 37. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization. Automatica 90, 196–203 (2018) 38. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking over strongly-connected networks, in Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC) (2019). https://doi.org/10.1109/CDC40024.2019.9029217 39. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016) 40. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017) 41. A. Defazio, F. Bach, S. Lacoste-Julien, Saga: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems (NIPS), vol. 27 (2014), pp. 1–9 42. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems (NIPS), vol. 26 (2013), pp. 1–9 43. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9 44. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning problems using stochastic recursive gradient, in Proceedings of the 34th International Conference on Machine Learning (ICML) (2017), pp. 2613–2621 45. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm. J. Mach. Learn. Res. 17(1), 2165–2199 (2016) 46. Y. Zhao, Q. Liu, A consensus algorithm based on collective neurodynamic system for distributed optimization with linear and bound constraints. Math. Program. 122, 144–151 (2020) 47. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decentralized learning: faster convergence and sparse communication, in Proceedings of the 35th International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633 48. K. Yuan, B. Ying, J. Liu, A. Sayed, Variance-reduced stochastic learning by networked agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 1–11 (2019) 49. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algorithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019), pp. 4624–4633 50. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020) 51. R. Xin, U. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Trans. Signal Process. 68, 6255–6271 (2020) 52. R. Xin, A. Sahu, S. Kar, U.A. Khan, Distributed empirical risk minimization over directed graphs, in Proceedings of the 53rd Asilomar Conference on Signals, Systems, and Computers (2019). https://doi.org/10.1109/IEEECONF44664.2019.9049065 53. Q. Liu, S. Yang, Y. Hong, Constrained consensus algorithms with fixed step size for distributed convex optimization over multi-agent networks, IEEE Trans. Autom. Control 62(8), 4259– 4265 (2017) 54. M. Bazaraa, H. Sherali, C. Shetty, Nonlinear Programming: Theory and Algorithms, 3rd edn. (John Wiley & Sons, Hoboken, 2006) 55. S. Alghunaim, E.K. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021) 56. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ. California, Irvine, CA, USA (2019)
Chapter 3
Proximal Algorithms for Distributed Coupled Optimization
Abstract In this chapter, we consider a multi-node sharing problem, where each node possesses a local smooth function that is further considered as the average of several constituent functions, and the network aims to minimize a finite sum of all local functions plus a coupling function (possibly non-smooth). Due to its benefits in scalability, robustness, and flexibility, distributed optimization has been a significant focus in engineering research to tackle this problem. To accomplish this, an equivalent saddle-point problem of this problem that is amenable to distributed solutions is first formulated. Then, a novel distributed stochastic algorithm called VR-DPPD is proposed, which combines the variance-reduction technique of SAGA with the distributed proximal primal–dual method. We present a convergence analysis and demonstrate that if smooth local functions are strongly convex, VRDPPD converges linearly to the exact optimal solution in expectation. With a novel linear convergent algorithm that achieves low computation costs, our work advances efforts to solve a general composite optimization problem with a convex (possibly non-smooth) coupling function. The viability and performance of VR-DPPD are demonstrated numerically. Keywords Distributed optimization · Non-smooth coupling function · Proximal primal–dual method · Variance reduction · Linear convergence
3.1 Introduction Over the last decade, distributed optimization over networks has become a hotspot of research along with the rapid development of network technologies, where nodes focus on minimizing the sum of local functions (owned by each node) through local communication [1, 2]. Traditional centralized approaches to solving optimization problems usually require an entity to obtain essential information from all nodes, which are costly, prone to the single point of failure, and lack robustness to new environments. Compared to centralized algorithms, distributed algorithms avoid long-distance communication and have greater flexibility and scalability because they have the ability to decompose large-scale problems into a series of smaller © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_3
61
62
3 Proximal Algorithms for Distributed Coupled Optimization
problems [3, 4]. Considering this, distributed algorithms possess better robustness, less communication, and good privacy protection in many applications [5–14], including but not limited to machine learning [5, 6], online optimization [7, 8], privacy masking [9, 10], resource allocation [11, 12], and data processing [13, 14]. Recently, researchers have made significant effort to study distributed approaches for successfully solving optimization problems [15–24]. Distributed approaches only depended on gradient information have been the majority reported in the literature because of their good performance and excellent scalability [1, 2]. There are quite a few known approaches such as distributed gradient descent (DGD) [15, 16], distributed dual averaging (DDA) [17], EXTRA [18, 19], distributed ADMM [20], distributed adaptive diffusion [21], and distributed gradient tracking [22–24]. Based on the known approaches [15–24], many efficiently distributed approaches have been proposed for handling the possible factors that exist in the problem or achieving the desired targets, such as transmission delays [25], complex constraints [26], computation efficiency [27], communication efficiency [28], privacy security [29], etc. Besides the aforementioned works on discretetime iteration, distributed continuous-time approaches have been well-investigated [30] simultaneously, which exhibit flexible applications in continuous-time physical systems and hardware implementations [31]. In addition to the aforementioned works for handling problems with a single objective, composite optimization problems with smooth+non-smooth objectives have been sparked considerable interest in the community of distributed optimization due to its broad applications. Usually, the approaches that solve composite optimization problem contain the (fast) distributed proximal gradient [32], the distributed linearized ADMM (DL-ADMM) [33], PG-ADMM [34], PG-EXTRA [35], and NIDS [36]. From the perspective of convergence rate, the aforementioned approaches only achieve sublinear convergence rate, and there is still a clear gap compared with their centralized counterparts. Until recently, distributed linearly convergent approaches have been investigated to successfully fill such a gap [37]. In particular, the authors in [37] proposed a distributed proximal gradient algorithm based on a general primal–dual algorithmic framework, which not only attained a linear convergence rate but also unified many existing related approaches. Then, based on the gradient tracking mechanism, the authors in [38, 39] introduced the NEXT/SONATA algorithm with linear convergence. Subsequently, the authors in [40] gave a unified distributed algorithmic framework to obtain similar results on the basis of the operator splitting theory. For the case where the non-smooth function couples all nodes, the authors in [41, 42] firstly designed a distributed proximal primal–dual algorithm from the novel perspective of transforming the saddlepoint problem and theoretically established its linear convergence. Under a much weaker condition than the strong convexity assumed in [37–41], the authors in [43] developed a distributed randomized block-coordinate proximal algorithm, which achieved asymptotic linear convergence. However, the aforementioned approaches are largely affected by the pressure of calculation.
3.1 Introduction
63
Recently, some distributed stochastic gradient methods have emerged [44–47]. Then, many representative centralized stochastic gradient approaches, including S2GD [48], SAG [49], SAGA [50], SVRG [51], and SARAH [52], have adopted various variance-reduction techniques to decrease the variance of the stochastic gradient and improve the convergence. Inspired by Konecny et al. [48], Schmidt et al. [49], Defazio et al. [50], Johnson and Zhang [51], Nguyen et al. [52], many distributed variance-reduced approaches [28, 53–58] have been extensively investigated. In particular, the authors in [56] proposed two novel distributed algorithms, namely GT-SAGA and GT-SVRG, to achieve optimal rates for distributed stochastic convex problems. Then, the authors in [57, 58] proposed a combination of gradient tracking and variance reduction (SAGA-Type and SARAH-type) to address the stochastic and distributed nature of the non-convex problem very well. In addition, the authors in [59] investigated a novel stochastic and distributed algorithm to solve the nonconvex and non-smooth constrained optimization problem. However, we notice that there is currently little linearly convergent distributed stochastic algorithm that is dedicated to deal with the multi-node sharing problems (composite optimization problems) where the non-smooth function couples all nodes. Thus, the main motivation of this chapter is investigating a distributed stochastic algorithm, which not only linearly converges to the exact optimal solution but also reduces the computational burden. In this chapter, we consider a general multi-node sharing problem that subsumes several real applications [41, 42]. We do not directly solve the primal problem but attempt to transform it into an equivalent saddle-point problem that is amenable to distributed solutions. Based on this, we propose a novel distributed stochastic algorithm which possesses adaptability in real-world applications. Relative to the existing works, the novelties of the present work are concluded as follows: (i) A novel distributed stochastic algorithm (named VR-DPPD), which combines the variance-reduction technique of SAGA with the distributed proximal primal–dual method, is proposed. Different from [15–24], VR-DPPD can successfully solve a general composite optimization problem [41, 42] concerning a coupling function (possibly non-smooth) in a distributed fashion. (ii) Unlike the existing stochastic gradient methods [44–47] that just possess variances, VR-DPPD leverages the unbiased stochastic average gradient (SAGA) to estimate the local gradients, which reduces the variance of the stochastic gradient and promotes the convergence substantially. Using SAGA, VRDPPD outperforms some distributed approaches [32–36, 38–42] involving the computation cost of local gradient. (iii) The convergence and convergence rate of VR-DPPD are rigorously analyzed. In particular, we demonstrate that VR-DPPD converges linearly to the exact optimal solution in expectation if smooth local functions are strongly convex. As far as we know, this is the novel (not yet been investigated in other works) linearly convergent distributed stochastic algorithm which focuses on solving the general multi-node sharing problems accurately.
64
3 Proximal Algorithms for Distributed Coupled Optimization
3.2 Preliminaries 3.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let the symbol · denote the 2-norm. Given a vector x and a positive semi-definite matrix W , we denote ||x||2W = x T W x. The Kronecker products are denoted as ⊗. The vector that stacks x1 , · · · , xn on top of each other is indicated as col{xi }ni=1 . The symbol blkdiag{Xi }ni=1 denotes the block diagonal matrix that consists of diagonal blocks {Xi }. The proximal operator of a function f (x) : Rn → R at x is defined as proxηf (x) = arg miny∈Rn {f (y) + 1/(2η)||x − y||2}, where η > 0 is a parameter (step-size). We denote the conjugate function of a function f at x ∈ Rn as f † (x) = supy∈Rn {x T y − f (y)}. The gradient of a function f at x ∈ Rn is denoted as ∇f (x). The subdifferential ∂f (x) of a function f at x ∈ Rn is the set of all subgradients.
3.2.2 Model of Optimization Problem This chapter is focused on cooperatively solving a general composite optimization problem, which is defined over an undirected and connected network of n nodes, as follows: n n min (3.1) fi (xi ) + g Ai xi , x1 ,··· ,xn
i=1
i=1
where fi : Rqi → R is a convex function that we view as the privately cost of node i, g : Rp → R∪{+∞} is a convex possibly non-smooth global cost function known by all nodes,1 and the matrix Ai ∈ Rp×qi (full row rank) is a linear transform only known by node i. Furthermore, each fi is described by fi (xi ) =
mi 1 fi,h (xi ), i = 1, · · · , n. mi h=1
1
Here, it is also worth noticing that the non-smooth function g may be expressed as an indicator function for inequality constraints or equality constraints. For example, in a distributed resource allocation problem [60], this non-smooth term may be an indicator function of the equality constraints. In a distributed ridge regression problem [42], this non-smooth term may be an indicator function of the inequality constraints. In addition, the non-smooth function g may represent the regularization term, see, e.g., [3].
3.2 Preliminaries
65
Here, several quantities that will support the problem reformulation are defined below: x = col{xi }ni=1 ∈ Rq , q =
n
qi , f (x) =
i=1
n
fi (xi ), A = [A1 , · · · , An ] ∈ Rp×q .
i=1
Then, problem (3.1) can be rewritten as2 min f (x) + g(Ax).
(3.2)
x
Moreover, we make the assumptions on the constituent functions fi,h and the global cost function g below. Assumption 3.1 (i) Each local constituent function fi,h , i ∈ {1, . . . , n}, h ∈ {1, . . . , mi }, is β-smooth and α-strongly convex. (ii) The function g : Rp → R ∪ {+∞} is proper lower semi-continuous and convex. (iii) There is x ∈ Rq satisfying that Ax belongs to the relative interior domain of g.
Notice from Assumption 3.1 that the global cost function f : Rq → R is also α -strongly convex and β -smooth, where 0 < α ≤ β , and problem (3.2) possesses a unique optimal solution x ∗ = col{x1∗ , · · · , xn∗ } ∈ Rq , which achieves the minimum of this problem.
3.2.3 Motivating Examples Problem (3.1) is the sharing problem, where different individual variables possessed by nodes are coupled through a function g . Notice that problems of form (3.1) appear in many engineering applications [1], including smart grids, basis pursuit, and resource allocation in wireless networks. They also appear in machine learning applications [1], such as regression over distributed features. Here, we provide two motivational physical applications that fit (3.1). Example 1 A well-known example of problem (3.1) is the distributed resource
allocation problem. Inspired by Xu et al. [60], Scaman et al. [62], we set the distributed resource allocation problem as follows:
min
x1 ,··· ,xn
C(x) =
n i=1
Ci (xi ), s.t.
n
(xi − ri ) = 0,
(3.3)
i=1
2 If g = 0 and A is the Laplacian matrix of the graph for instance (or a square root of the Laplacian matrix), then problem (3.2) will become a general consensus optimization problem that can be well solved by distributed primal–dual algorithms, such as [31, 40, 59, 61]. From this perspective, problem (3.2) is more general.
66
3 Proximal Algorithms for Distributed Coupled Optimization
where x = [x1 , x2 , . . . , xn ]T ∈ Rn is the optimal estimator and xi ∈ R is the number of resources allocated to node i . The function Ci : R → R is convex representing the cost incurred by the resource xi . The equality constraints, ni=1 (xi − ri ) = 0, couple the nodes’ decisions, where ri ∈ R is a local virtual demand of resource for node i . Problem (3.3) can be transformed into problem (3.1) through defining an indicator function g(·) : R → R such that g(x) ˇ =
0, if xˇ = a, +∞, otherwise,
which is non-smooth in terms of xˇ . Here, xˇ and a represent the coupled term ni=1 xi n and i=1 ri , respectively. Based on this, problem (3.3) can be further equivalently transformed into the following composite non-smooth problem: n mi n 1 min Ci,h (xi ) + g xi , x1 ,··· ,xn mi i=1
h=1
i=1
i which is the similar form as problem (3.1) with Ci (xi ) = (1/mi ) m h=1 Ci,h (xi ) and Ai = 1 for all i . Here, we note that the above transformed problem is reasonable because in the actual resource allocation problem we want to minimize the cost of the entire network under the premise of coordinating the optimization variables of each node, and the cost encountered by each node is given by many different costs. Example 2 Another notable example of problem (3.1) is the distributed logistic
regression problem, which possesses important applications in machine learning [42, 53, 56–58] and can be described as follows: mi n 1 T ln 1 + exp −bi,h ci,h xi , s.t. xi = xj , x1 ,··· ,xn mi
min
i=1
h=1
∀j ∈ Ni , where Ni is the neighbors set of i . Here, bi,h ∈ {−1, 1} and ci,h ∈ Rqi , h ∈ {1, . . . , mi }, is the local data kept by node i . Similar to Example 1, the above problem can be transformed into problem (3.1) if we define the indicator function g(·) such
that g(·) =
0, if x1 = · · · = xn , +∞, otherwise.
In this case, the matrix A = [A1 , · · · , An ] is sparse and encodes the communication network between nodes. It is worth noticing that the matrix A in the problem (3.1) is not necessarily sparse, and Ai is a private matrix only known by node i . Therefore, the general distributed logistic regression problem can be transformed into a special case of problem (3.1) if we utilize the above indicator function g .
3.2 Preliminaries
67
3.2.4 The Saddle-Point Reformulation Notice from Assumption 3.1 that the strong duality holds Corollary 31.2.1 in [63]. Similar to the existing works [41], the saddle-point reformulation of problem (3.2) is given by Proposition 19.18 in [64]: min max f (x) + y T Ax − g † (y), x
y
(3.4)
where y ∈ Rp is the dual variable related to the coupled term Ax . Furthermore, (x ∗ , y ∗ ) is an optimal solution of (3.4) if, and only if, it meets the following sufficient and necessary conditions for the optimality of problem (3.4) [62, Proposition 19.18]: −AT y ∗ = ∇f (x ∗ ), Ax ∗ ∈ ∂g † (y ∗ ),
(3.5)
where ∇f (x ∗ ) = [∇f1 (x1∗ )T , · · · , ∇fn (xn∗ )T ]T . Notice that the dual variable y in (3.4) is multiplied by A, which couples all nodes. Thus, algorithms solving problem (3.4) directly in a distributed fashion cannot exist because the dual update needs to be calculated by a central coordinator. Further reformulation is required to arrive at a distributed solution. To aim at this, we in the following reformulate problem (3.4) into another equivalent saddle-point problem that avoids the above dilemma. First, let yi be a local copy of y at node i , and the following quantities are needed: 1 † gi (yi ), Ac = blkdiag{Ai }ni=1 ∈ Rpn×q , n n
yˆ = col{yi }ni=1 ∈ Rpn , G† =
i=1
and we further denote the symmetric matrix L ∈ Rpn×pn to satisfy that Lyˆ = 0 ⇐⇒ y1 = y2 = · · · = yn .
(3.6)
Then, another saddle-point reformulation is
min max f (x) + yˆ T Ac x + yˆ T Lz − G† yˆ , x,z
yˆ
(3.7)
where z = col{zi }ni=1 ∈ Rpn . Since the matrix Ac is block diagonal and the matrix L encodes the network sparsity structure, it suffices to conclude that problem (3.7) can be resolved in a distributed manner. Then, the optimality conditions of problem (3.7) are given as follows [41]: ⎧ T ∗ ∗ ⎪ ⎨ −Ac yˆ = ∇f (x ) ∗ Lyˆ = 0, ⎪ ⎩ A x ∗ + Lz∗ ∈ ∂G† (yˆ ∗ ), c
where (x ∗ , z∗ , yˆ ∗ ) is an optimal solution of (3.7).
(3.8)
68
3 Proximal Algorithms for Distributed Coupled Optimization
According to (3.8), the following lemma will show that problems (3.4) and (3.7) share the same optimal solution in terms of x . Since the saddle-point reformulation from (3.4) to (3.7) is the same as [41], this result can be directly followed from [41] and we just show it here for completeness. Lemma 3.1 (Adapted from Lemma 1 in [41]) If (x ∗ , z∗ , yˆ ∗ ) fulfills the optimality condition (3.8), then yˆ ∗ = 1n ⊗ y ∗ holds with (x ∗ , y ∗ ) such that (3.5).
Lemma 3.1 indicates that if a designed algorithm for solving problem (3.7) can achieve the optimal solution (x ∗ , z∗ , yˆ ∗ ), then the corresponding primal-dual pair (x ∗ , y ∗ ) is the optimal solution to problem (3.4). That is to say, the designed algorithm can solve problem (3.4) indirectly and problems (3.1), (3.4), and (3.7) share the same optimal solution in terms of x . However, unlike problem (3.4), problem (3.7) can be resolved in a distributed manner because the matrices Ac and L encode the network sparsity structure. Therefore, based on Lemma 3.1, we have the opportunity to design distributed algorithms to finally solve the problem (3.1).
3.3 Algorithm Development Based on the reformulation of the saddle-point problem in the previous section, we give the construction of the algorithm in this section. First, the unbiased stochastic average gradient (SAGA) is introduced.
3.3.1 Unbiased Stochastic Average Gradient (SAGA) Since the local function fi (xi ) at each node i is the average of mi local constituent functions fi,h (xi ), the implementations of most existing distributed primal–dual algorithms [41, 43] require that each node i at time t ≥ 0 calculates the local full gradient of fi at xit as ∇fi (xit ) =
mi 1 ∇fi,h (xit ), i = 1, · · · , n, mi
(3.9)
h=1
which may result in high computation cost when the number of constituent functions mi is large. This issue motivates us to investigate an effective technique that
can improve computation efficiency significantly. Fortunately, unbiased stochastic average gradient (SAGA) can be substituted for the local full gradients to resolve this issue. The idea is to keep a gradient list of all constituent functions, where a randomly selected element is replaced each time, and the average value of the elements in this list is applied for gradient approximation. In specific, denote χit ∈ {1, · · · , mi } as the function index of node i , which is uniformly and randomly
3.3 Algorithm Development
69
selected at time t . Then, let ei,h be the auxiliary variable, which was selected to evaluate the constituent gradient of the function fi,h at the last time. Therefore, the recursive updates of variables ei,h are t+1 t+1 t ei,h = xit , if h = χit ; ei,h = ei,h , if h = χit .
At time t , the stochastic averaging gradient at node i is represented by mi t
1 t , sit = ∇fi,χ t (xit ) − ∇fi,χ t ei,χ ∇fi,h ei,h + t i i i mi
(3.10)
h=1
t ) need to be stored in a table data structure ensuring where the gradients ∇fi,h (ei,h the implementation of (3.10). Define F t as the history of the system up until time t . Then, from the prior results of [28, 53], F t , one obtains that
E[sit |F t ] = ∇fi (xit ).
(3.11)
In (3.10), we note that, at each time, the computation of sit , i ∈ {1, · · · , n}, is mi t costly because of the calculation of h=1 ∇fi,h (ei,h ). That is, we must face the O(mi )-order computational cost if we naively implement the update in (3.10). In contrast, when we implement the following recursive formulation: mi h=1
mi t t−1 = + ∇fi,χ t−1 xit−1 − ∇fi,χ t−1 et−1t−1 , ∇fi,h ei,h ∇fi,h ei,h h=1
i
i
i,χi
(3.12) i t the above cost can be avoided and we can calculate m h=1 ∇fi,h (ei,h ) in a computationally efficient way. In addition, we also point that the O(mi )-order computational cost cannot be overcome in the existing methods [35–37, 40, 41, 43] using deterministic gradient information.
3.3.2 Distributed Stochastic Algorithm (VR-DPPD) Inspired by the variance-reduction technique of SAGA [55, 56, 58] and the distributed proximal primal–dual approach [41], we now propose a novel distributed stochastic algorithm (VR-DPPD) to resolve problem (3.7), followed by its distributed implementation. Define the auxiliary variable w = col{wi }ni=1 ∈ Rpn and the stochastic gradient s = col{si }ni=1 ∈ Rq . Let x 0 and yˆ 0 be any values, z0 = 0pn , and s 0 = ∇f (x 0 ). Then,
70
3 Proximal Algorithms for Distributed Coupled Optimization
the general matrix form of VR-DPPD at t ≥ 0 is ⎧ x t+1 = x t − ηx s t − ηx ATc yˆ t ⎪ ⎪ ⎪ ⎨ wt+1 = yˆ t + η A x t+1 + Lzt y c ⎪ zt+1 = zt − Lwt+1 ⎪ ⎪ ⎩ t+1 yˆ = proxηy G† (Bc wt+1 ),
(3.13)
where ηx and ηy are two constant step-sizes (tunable) and Bc = B ⊗ Ip with B ∈ Rn×n satisfying Assumption 3.2 described in the following. Here, we can conclude that VR-DPPD (3.13) is a stochastic version of the algorithm (3.12) in [41] by using the variance-reduction technique of SAGA [55, 56, 58]. Assumption 3.2 Suppose that the network is undirected and connected. Moreover, the matrix B is symmetric, doubly stochastic, and primitive. In addition, suppose that the matrix L satisfies condition (3.6) and 0 < Ipn − L2 , Bc2 ≤ Ipn − L2 .
(3.14)
Remark 3.2 Requiring the matrix B to satisfy the condition in Assumption 3.2 is necessary to ensure that all nodes converge to the same optimal variable, and this is not difficult to construct over an undirected connected network, see, e.g., the lazy Metropolis matrix designed in [23]. Moreover, the eigenvalues of Bc belong to (−1, 1]. Then, given Bc , there exist plentiful choices for matrix L. For example, we can denote L2 = Ipn −Bc2 and verify the correctness of the condition 0 < Ipn −L2 . If it is true, then Assumption 3.2 can be satisfied. If not, we can let L2 = d(Ipn − Bc2 ) for any d ∈ (0, 1). Although there are many choices for the matrices Bc and L, we only keep one choice for simplicity of the following analysis and presentation.
By utilizing the design idea of the lazy Metropolis matrix (for more details, please refer to [23]), a primitive symmetric doubly-stochastic matrix B˜ = [b˜ij ] ∈ Rn×n is first constructed, which satisfies that b˜ij > 0 if two nodes i and j are connected through an edge over undirected connected networks, and b˜ij = 0 ˜ otherwise. Then, let B = (In + B)/2 = [bij ] ∈ Rn×n , which satisfies Assumption 3.2. 2 Based on this, we can set L = Ipn − Bc2 to satisfy the related conditions in Assumption 3.2. With the above choices of matrices in hand, we here show the distributed implementation of (3.13). Specifically, it follows from the updates of wt and zt in (3.13) that, for all t ≥ 1, wt+1 =(Ipn − L2 )wt + yˆ t − yˆ t−1 + ηy Ac (x t+1 − x t ),
(3.15)
3.3 Algorithm Development
71
which eliminates the auxiliary variable zt , and algorithm (3.13) can be rewritten as follows: ⎧ x t+1 = x t − ηx s t − ηx ATc yˆ t ⎪ ⎪ ⎪ ⎨ wt+1 = (I − L2 )wt + yˆ t − yˆ t−1 + η A (x t+1 − x t ) pn y c t+1 = B w t+1 ⎪ ϕ c ⎪ ⎪ ⎩ t+1 yˆ = proxηy G† (ϕ t+1 ), where ϕ = col{ϕi }ni=1 ∈ Rpn is an auxiliary variable. Utilizing the previous choices of matrices Bc and L, it suffices to find that the i -th block in the vectors {x, w, ϕ, y} ˆ can be updated by each node i for i = 1, · · · , n. For any i , we let xi0 , yi0 be any values, ϕi0 = φi0 (φ = col{φi }ni=1 ∈ Rpn is an auxiliary variable), 0 ei,h = xi0 , h = 1, · · · , mi , and si0 = ∇fi (xi0 ). From Moreau’s decomposition [64] (i.e., a = proxηg (a) + ηproxη−1 g† (η−1 a), for all a ∈ Rp and η > 0) and letting (ηy /n) = η, the updates of VR-DPPD at each node i are given as follows:3 for all t ≥ 1, ⎧ t+1 xi = xit − ηx sit − ηx ATi yit ⎪ ⎪ ⎪ ⎪ ⎪ φ t+1 = yit + ηy Ai xit+1 ⎪ ⎪ ⎨ it+1 wi = ϕit + φit+1 − φit n ⎪ t+1 ⎪ ϕ = bij wjt+1 ⎪ i ⎪ ⎪ ⎪ j =1 ⎪ ⎩ t+1 yi = ϕit+1 − ηproxη−1 g (η−1 ϕit+1 ).
(3.16)
Here, the pseudo-code of the VR-DPPD algorithm is outlined in Algorithm 2. To locally implement Algorithm 2, suppose that each node i ∈ V has a gradient table containing all gradients ∇fi,h , ∀h ∈ {1, . . . , mi }, in relation to the local primal variable xi or the local auxiliary variables ei,h . At each iteration t + 1, first each node i chooses one label which indexed by χit ∈ {1, . . . , mi } from its own data batch uniformly and randomly and then updates the local stochastic gradient sit via (3.10). t+1 After updating sit , the local auxiliary variable ei,χ t will be assigned by the local i
t+1 variable xit at the label χit , and the entry ∇fi,χit (ei,χ t ) is substituted for the newly i
constituent gradient ∇fi,χit (xit ) in the χit gradient table position, while the other entries keep unchanged. Then, each node i sequentially updates the local variables xit+1 , φit+1 , and ϕit+1 . Subsequently, each node i transmits bj i wit+1 to its neighbors j ∈ Ni , where Ni is the neighbors set of i , receives bij wjt+1 from its neighbors j , and updates ϕit+1 . Finally, each node i updates yit+1 . Since g is a global cost function known by all nodes, we can deduce that Algorithm 1 can be implemented in a distributed manner based on the above implementation process.
By using Moreau’s decomposition [64], the update of yit+1 can be executed in a convenient fashion that avoids computing the proximal mapping of the conjugate function g † at each time.
3
72
3 Proximal Algorithms for Distributed Coupled Optimization
Algorithm 2 Distributed stochastic algorithm (VR-DPPD)-from the view of node i 0 1: Initialization: Each node i initializes xi0 ∈ Rqi , yi0 ∈ Rp , ϕi0 = φi0 ∈ Rp , ei,h = xi0 , 0 0 h = 1, · · · , mi , and si = ∇fi (xi ). Each node i knows the step-sizes ηx and ηy . 2: for t = 0, 1, 2, . . . do 3: if t = 0 then i 0 0 4: Compute and store the sum of the local gradients m h=1 ∇fi,h (ei,h ), which is ∇fi (xi ) actually, 0 0 and let si = ∇fi (xi ). 5: else 6: Choose χit uniformly and randomly from {1, . . . , mi }. i t 7: Compute and store the summation term m h=1 ∇fi,h (ei,h ) according to (3.12). 8: Compute stochastic averaging gradient sit via (3.10). t+1 t+1 t t t 9: Take ei,χ t = xi and replace ∇fi,χ t (ei,χ t ) by ∇fi,χ t (xi ) in the corresponding χi gradient table i
10: 11: 12: 13:
i
i
i
t+1 position. All other ei,h , ∀h = χit , and gradient elements in the table keep unchanged, i.e., t+1 t+1 t t ei,h = ei,h and ∇fi,h (ei,h ) = ∇fi,h (ei,h ) for all h = χit . end if Update xit+1 , φit+1 and wit+1 sequentially via (3.16). Transmit bj i wit+1 to its neighbors j ∈ Ni . Receive bij wjt+1 from its neighbors j ∈ Ni , and update ϕit+1 via (3.16).
14: Update yit+1 via (3.16). 15: end for
Remark 3.3 At present, the classical proximal primal–dual fixed-point method (PDFP) [65] and the later known distributed methods (including the proximal exact dual diffusion method (PED2 ) [41], the dual consensus proximal algorithm (DCPA) [42], and the primal–dual hybrid gradient method (PDHG) [66]) are proposed to solve the minimization problem from the novel perspective of transforming (3.1) into a saddlepoint formulation. When dealing with large-scale tasks, the above methods [41, 66, 67] may suffer from high computation costs. Compared with [41, 42, 66, 67], Algorithm 2 leverages the variance-reduction technique of SAGA [55, 58] for the purpose to evaluate the locally full gradient in a more cost-efficient way. In addition, via Moreau’s decomposition [64], the update of yit+1 in Algorithm 2 does not need to calculate the conjugate function g † at each time, which can be implemented in a convenient manner.
3.4 Convergence Analysis In this section, we show the convergence behavior of VR-DPPD (3.13). First, some auxiliary results related to the stochastic gradient and the fixed-point of (3.13) are provided for further convergence analysis.
3.4 Convergence Analysis
73
3.4.1 Auxiliary Results Before introducing the results, we define an auxiliary sequence v t = R, ∀t ≥ 0, where vit =
n
t i=1 vi
∈
mi 1 t t (fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h − xi∗ )). mi h=1
t t Here fi,h (ei,h ) − fi,h (xi∗ ) − ∇fi,h (xi∗ )T (ei,h − xi∗ ), ∀t ≥ 0, is non-negative by the strong convexity of local constituent function fi,h , and thus v t , ∀t ≥ 0, is also nonnegative. To simplify notation, we denote E ·|F t = Et , ∀t ≥ 1, in the following analysis. Moreover, we define ∇f (x t ) = [∇f1 (x1t )T , · · · , ∇fn (xnt )T ]T .
Lemma 3.4 (Adapted from Lemma 6 in [53]) Consider the definition of v t . Under Assumption 3.1, the following recursive relation holds: ∀t ≥ 0, E [v t
t+1
1 1 t ∗ ∗ T t ∗ ] ≤ (f (x ) − f (x ) − ∇f (x ) (x − x )) + 1 − vt , m ˇ m ˆ
(3.17)
where m ˇ and m ˆ are the smallest and largest amounts of the local constituent funcˇ = mini∈{1,··· ,n} {mi } and m ˆ = tions over the whole network, respectively, i.e., m maxi∈{1,··· ,n} {mi }.
In addition, an upper bound for the mean-squared stochastic gradient variance between the stochastic gradient s t and the gradient ∇f (x ∗ ), i.e., Et [s t − ∇f (x ∗ )], is given, whose proof can be referred to [53]. Lemma 3.5 (Adapted from Lemma 4 in [53]) Under Assumption 3.1, the following recursive relation holds: ∀t ≥ 0, Et [||s t − ∇f (x ∗ )||2 ]
(3.18) ∗
∗ T
∗
≤ 4βv + 2(2β − α)(f (x ) − f (x ) − ∇f (x ) (x − x )). t
t
t
From Lemma 3.5, we can deduce that for each node i = 1, · · · , n, when xit t approaches to xi∗ , then ei,h , h ∈ {1, · · · , mi }, tend to xi∗ , which indicates that the mean-squared stochastic gradient variance between the stochastic gradient s t and the gradient ∇f (x ∗ ) vanishes. Subsequently, we continue to show the existence and optimality of the fixed-points of (3.13), which is taken from [41].
74
3 Proximal Algorithms for Distributed Coupled Optimization
Lemma 3.6 (Adapted from Lemma 2 in [41]) A point (x ∗ , z∗ , yˆ ∗ , w∗ ) is concerned as the fixed-point of (3.13) if and only if the following conditions hold: ⎧ 0 = ∇f (x ∗ ) + ATc yˆ ∗ ⎪ ⎪ ⎪ ⎨ w∗ = yˆ ∗ + η A x ∗ + Lz∗ y c ⎪ 0 = Lw∗ ⎪ ⎪ ⎩ ∗ yˆ = proxηy G† (Bc w∗ ).
(3.19)
In addition, if a fixed-point (x ∗ , z∗ , yˆ ∗ , w∗ ) satisfies (3.19), then yˆ ∗ = 1n ⊗ y ∗ holds with (x ∗ , y ∗ ) such that (3.5).
Lemma 3.6 shows the result that the fixed-point of (3.13) is optimal. According to Lemma 3.6, we know that the stochastic gradient s t will be fixed to f (x ∗ ) if (x ∗ , z∗ , yˆ ∗ , w∗ ) is the fixed-point of (3.13). This fact makes us easy to prove the sufficient and necessary conditions for the fixed-point optimality by considering that VR-DPPD (3.13) is a stochastic version of the algorithm (3.12) in [41]. In addition, it concludes from the update of zt in (3.13) that if z0 = 0pn , then z1 = Lw1 , which remains with the range space of L. Accordingly, {zt }t≥0 will always belong to the range space of L. Utilizing similar arguments in [41], one can constantly suppose that z∗ remains in the range space of L and (x ∗ , z∗ , yˆ ∗ , w∗ ) is a fixed-point since putting a vector in the null space of L to z∗ does not influence on the optimality condition.
3.4.2 Main Results First, we give a crucial lemma that plays an important role in supporting the convergence results. Define the following error terms: x˜ t = x t − x ∗ , y˜ t = yˆ t − yˆ ∗ , w˜ t = wt − w∗ , z˜ t = zt − z∗ .
Then, it obtains from (3.13) and (3.19) that ⎧ ⎪ x˜ t+1 = x˜ t − ηx (s t − ∇f (x ∗ )) − ηx ATc y˜ t ⎪ ⎪ ⎨ w˜ t+1 = y˜ t + η A x˜ t+1 + L˜zt y c ⎪ z˜ t+1 = z˜ t − Lw˜ t+1 ⎪ ⎪ ⎩ t+1 y˜ = proxηy G† (Bc wt+1 ) − proxηy G† (Bc w∗ ).
(3.20)
Based on (3.20), we next establish a critical equality to support the main results.
3.4 Convergence Analysis
75
Lemma 3.7 Suppose that ηx and ηy are strictly positive. Under Assumption 3.1, the following recursive relation holds: ∀t ≥ 0, ||x˜ t+1 ||2I
T q −ηx ηy Ac Ac
+ κ||w˜ t+1 ||2I
2 pn −L
+ κ||˜zt+1 ||2
= ||x˜ t − ηx (s t − ∇f (x ∗ ))||2 + κ||y˜ t ||2I
T pn −ηx ηy Ac Ac
+ κ||˜zt ||2I
2 pn −L
,
(3.21)
where κ = ηx /ηy . Proof First, it follows from the update of the error term x˜ t in (3.20) that ||x˜ t+1 ||2 = ||x˜ t − ηx (s t − ∇f (x ∗ )) − ηx ATc y˜ t ||2 = ||x˜ t − ηx (s t − ∇f (x ∗ ))||2 + ηx2 ||y˜ t ||2A
T c Ac
− 2ηx (y˜ t )T (x˜ t − ηx (s t − ∇f (x ∗ ))).
(3.22)
Then, it derives from the update of the error term w˜ t in (3.20) that κ||w˜ t+1 ||2 = κ||y˜ t + ηy Ac x˜ t+1 + L˜zt ||2 = κ||y˜ t ||2 + κ||ηy Ac x˜ t+1 + L˜zt ||2 + 2κηy (y˜ t )T Ac x˜ t+1 + 2κ(y˜ t )T L˜zt = κ||y˜ t ||2 + κ||ηy Ac x˜ t+1 + L˜zt ||2 + 2ηx (y˜ t )T Ac x˜ t+1 + 2κ(y˜ t )T L˜zt = κ||y˜ t ||2 + ηx ηy ||x˜ t+1 ||2AT A + κ||˜zt ||2L2 + 2ηx z˜ t LT Ac x˜ t+1 c
+ 2ηx (y˜ ) Ac x˜ t T
t+1
c
+ 2κ(y˜ ) L˜zt . t T
(3.23)
Similarly, it obtains from the update of the error term z˜ t in (3.20) that κ||˜zt+1 ||2 = ||˜zt − Lw˜ t+1 ||2 = κ||˜zt ||2 + κ||w˜ t+1 ||2L2 − 2κ(˜zt )T L(y˜ t + ηy Ac x˜ t+1 + L˜zt ) = κ||˜zt ||2 + κ||w˜ t+1 ||2L2 − 2κ(˜zt )T Ly˜ t − 2ηx (˜zt )T LAc x˜ t+1 − 2κ||˜zt ||2L2 .
(3.24)
Combining the three results (3.22)–(3.24), one can deduce that ||x˜ t+1 ||2 + κ||w˜ t+1 ||2 + κ||˜zt+1 ||2 = 2ηx (y˜ t )T Ac x˜ t+1 − 2ηx (y˜ t )T Ac (x˜ t − ηx (s t − ∇f (x ∗ ))) + ||x˜ t − ηx (s t − ∇f (x ∗ ))||2 + κ||˜zt ||2I
2 pn −L
+ κ||y˜ t ||2I
T pn +ηx ηy Ac Ac
+ ηx ηy ||x˜ t+1 ||2AT A . c
c
+ κ||w˜ t+1 ||2L2
(3.25)
76
3 Proximal Algorithms for Distributed Coupled Optimization
Rearranging equation (3.25), we have that ||x˜ t+1 ||2I
T q −ηx ηy Ac Ac
+ κ||w˜ t+1 ||2I
2 pn −L
+ κ||˜zt+1 ||2
= ||x˜ t − ηx (s t − ∇f (x ∗ ))||2 + κ||y˜ t ||2I
T pn +ηx ηy Ac Ac
− 2ηx (y˜ t )T Ac (x˜ t − ηx (s t − ∇f (x ∗ ))) + 2ηx (y˜ t )T Ac x˜ t+1 + κ||˜zt ||2I
2 pn −L
(3.26)
.
Substituting the update of the error term x˜ t in (3.20) into (3.26) yields the result of Lemma 3.7. The proof is completed.
Let ρmin (·), ρmax (·), and λmin (·) be the smallest non-zero singular value, the largest singular value, and the smallest eigenvalue of its argument, respectively. Notice from the condition (3.14) that 0 ≤ L2 < Ipn , which further implies that 0 < ρmin (L) < 1. Denote a tunable parameter τ = ηx τ1 , where τ1 < 4α m/(α ˇ + β) is a constant. Then, we will deduce the convergence results of VR-DPPD (3.13) by using the result in Lemma 3.7. Theorem 3.8 Consider VR-DPPD (3.13) and let Assumption 3.1 hold. If the step-sizes ηx and ηy satisfy
2 τ1 1 , , α + β 4β m ˆ 2(2β − α)
0 < ηx < min
0 < ηy
0 is the network control gain; the positive scalars g(t) > 0 are step-sizes, and the vector ∇fi (xi (t)) is a subgradient of the node i objective function fi (x) at x = xi (t). In the following, we will give an event-triggered control scheme to reduce not only communication between neighboring nodes but also the energy consumption of incident detection for each node, while retaining asymptotic property of consensus. Suppose that the event-triggered instant sequence of node i is tki , k = 0, 1, . . . , when node i collects its state xi (tki ) and broadcasts it to its j neighboring nodes. Furthermore, node j can sent out its latest sampling state xj (tk ) i ), the following distributed control to node i if (i, j ) ∈ E. Therefore, for t ∈ [tki , tk+1 input can be designed as ui (t) = −l
j wij xi (tki ) − xj tk ,
(4.3)
j ∈Ni
where l > 0 is the constant control parameter, wij are non-negative weights, j tki denotes the instant when the kth event happens for the node i, tk , k = arg
j
i min {t − tm } is the latest event-triggered instant of node j . And tk+1 denotes
j ¯ ≥tm m∈k,t
node i’s next event-triggered time instant after tki , which is analytically decided by i tk+1 = inf t : t > tki and yi (t) > 0 ,
(4.4)
in which yi (t) is the event-triggered function and denoted by yi (t) = ||ei (t)|| − β1 ||
j ∈Ni
j aij xi tki − xj tk || − β2 μt ,
(4.5)
where β1 > 0, β2 > 0, μ > 0 and the measurement error is designed as ei (t) = xi (tki ) − xi (t). T T T T Nn Let ∇F (X(t)) = [∇f1 (x1 (t)) , ∇f2 (x2 (t)) , · · · , ∇fN (xN (t)) ] ∈ R , x(t) ¯ = (1/N) N x (t), δ (t) = x (t) − x(t), ¯ i = 1, 2, . . . , N, and X(t) = i i i=1 i T (t)]T ∈ R Nn . Then the distributed subgradient algorithm [x1T (t), x2T (t), · · · , xN
4.3 Algorithm Development
97
(4.2) with distributed control input can be rewritten in a compact matrix-vector form as follows: X(t + 1) = [(IN − hlL) ⊗ In ]X(t) − hl(L ⊗ In )e(t) − hg(t)∇F (X(t)). (4.6) By the definition of δi (t), it is easy to know that the consensus error δ(t) = [(IN − JN ) ⊗ In ]X(t) hold, where JN = N1 1N 1TN . It follows that X(t) = δ(t) + (JN ⊗ In )X(t).
(4.7)
It is obtained from (4.7) that [(IN − JN ) ⊗ In ][(IN − hlL) ⊗ In ]X(t) = [(IN − JN ) ⊗ In ][(IN − hlL) ⊗ In ][δ(t) + (JN ⊗ In )X(t)] = [(IN − hlL − JN ) ⊗ In ]δ(t) + [(JN − JN ) ⊗ In ]X(t) = [(IN − hlL − JN ) ⊗ In ]δ(t) = [(IN − hlL) ⊗ In ]δ(t).
(4.8)
In the above derivation, we have used the equalities JN JN = JN , [(IN − JN ) ⊗ In ](L ⊗ In ) = L ⊗ In , and [(IN − JN ) ⊗ In ](L ⊗ In ) = L ⊗ In . With (4.6) and (4.8), we have δ(t + 1) = [(IN − hlL) ⊗ In ]δ(t) − hl(L ⊗ In )e(t) − hg(t)[(IN − JN ) ⊗ In ]∇F (X(t)).
(4.9)
Since the undirected graph network is connected, we can conclude that the eigenvalues of the matrix L are 0, λ2 , . . . , λN . Then we can take an orthogonal matrix T = [ζ, φ2 , . . . , φN ] ∈ R N×N by using the Schmidt’s Orthogonalization Method, where ζ = N1 1N is an eigenvector of matrix IN − hlL with respect to eigenvalue 0 and φi is the eigenvector of matrix L with respect to eigenvalues ˜ = (T −1 ⊗ In )δ(t), λ2 , . . . , λN (φiT L = λi φiT , i = 2, 3 . . . , N). Letting δ(t) e(t) ˜ = (T −1 ⊗ In )e(t), ∇ F˜ (X(t)) = (T −1 ⊗ In )∇F (X(t)), we have that ˜ + 1) = [T −1 (IN − hlL)T ⊗ In ]δ(t) ˜ − hl(T −1 LT ⊗ In )e(t) δ(t ˜ − hg(t)[T −1 (IN − JN )T ⊗ In ]∇ F˜ (X(t)).
(4.10)
98
4 Event-Triggered Algorithms for Distributed Convex Optimization
δ˜1 (t) e˜1 (t) ∇ F˜1 (X(t)) ˜ , , e(t) ˜ = , ∇ F (X(t)) = δ˜2 (t) e˜2 (t) ∇ F˜2 (X(t)) and in view of 4.9 and 4.10, it follows that
˜ Decomposing δ(t) =
δ˜1 (t + 1) δ˜2 (t + 1)
0 In − hlIn = ˜ 0 (IN−1 − hl L) ⊗ In − hl L˜ ⊗ In ∇ F˜1 (X(t)) I 0 − hg(t) n ˆ 0 L ⊗ In ∇ F˜2 (X(t))
δ˜1 (t) δ˜2 (t)
, (4.11)
⎛ ⎜ where L˜ = ⎝
0
λ2 ..
.
⎞
⎡
⎤ φ2 ⎢. ⎥ ⎟ ˆ ⎠ , L = ⎣ .. ⎦ JN [φ2 · · · φN ]. On the other hand,
0 λN φN ˜ from (T −1 ⊗ In )δ(t) = (T T ⊗ In )δ(t) = δ(t), one obtains δ˜1 (t) = (ζ T ⊗ In )δ(t). T Note that ζ 1N = 1, it is easy to get δ˜1 (t) = =
1 (ζ T ⊗ In )[X(t) − (JN ⊗ In )X(t)] ||ζ || 1 1 (ζ T ⊗ In )X(t) − (ζ T ⊗ In )X(t) = 0. ||ζ || ||ζ ||
(4.12)
Then one obtains ˜ ⊗ In ]δ˜2 (t) − hl(L˜ ⊗ In )e˜2 (t) δ˜2 (t + 1) = [(IN−1 − hl L) − hg(t)(Lˆ ⊗ In )∇ F˜2 (X(t)).
(4.13)
Before giving a key definition, we require to show the following assumption with regard to the step-sizes. Assumption 4.3 The step-sizes {g(t)} are positive sequence, which satisfies the following: sup g(t) ≤ 1, t ≥0
∞ t =0
g(t) = +∞ and
∞
(g(t))2 < +∞.
t =0
Remark 4.2 On the one hand, the step-size function g(t) increases the accuracy of node’s state estimation. On the other hand, to achieve the optimization, the step-size function g(t) ought to decrease to zero as t increases to infinite, i.e., limt →∞ g(t) = 0. According to the nature of the series summation, we can ∞ ∞ k k 2 conclude that < +∞ for all t =0 (1/(t + 1) ) = +∞, t =0 (1/(t + 1) ) 0.5 < k ≤ 1. Since 0 < supt ≥0(1/(t + 1)k ) < 1, 0.5 < k ≤ 1, we can apply
4.4 Convergence Analysis
99
g(t) = 1/(t + 1)k , 0.5 < k ≤ 1, to satisfy the Assumption 4.3 in the forthcoming analysis. Definition 4.3 Under some distributed control input ui (t), the distributed subgradient algorithm 4.2 is said to achieve consensus if lim ||xi (t)−xj (t)|| = 0, ∀i, j = t →∞ 1, 2, . . . , N, hold for any initial values.
4.4 Convergence Analysis In this section, we first provide some supporting lemmas. Then we will prove convergence property of the optimization algorithm (4.2).
4.4.1 Auxiliary Results
Lemma 4.4 ([45]) All the eigenvalues of matrix L˜ are positive if and only if the undirected graph network of all nodes is connected. In the following, we will first assume that all the eigenvalues of matrix have arbitrary imaginary parts and positive real parts, which will lead Lemma 4.4 to a more general situation. Lemma 4.5 ([43]) Suppose that the undirected graph network is connected and 2α 0 < h < 1, 0 < l < (α 2 +β 2 )h , where α and β represent the real and imaginary part ˜ < 1, where ρ(IN−1 − hl L) ˜ ˜ Then one has ρ(IN−1 − hl L) of the eigenvalue of L. ˜ stands for the spectral radius of matrix IN−1 − hl L. ˜ then, Proof Letting λ be any eigenvalue of IN−1 − hl L, ˜ =| 0 = |(λ − 1)IN−1 + hl L|
(λ − 1) ˜ IN−1 + L|. lh
(4.14)
√ ˜ By (4.14), we have Let σ = α + β −1 be the eigenvalue of the matrix L. λ−1 + σ = 0. lh Denote √ d(λ) = λ + lhα + lhβ −1 − 1 = a1 λ + a0 . By Lemma 4.4, we have α > 0. Noticing that 0 < h < 1 and 0 < l < can therefore compute that a Δ1 = 0 a¯ 1
a1 = lh(lhα 2 − 2α + lhβ 2 ) < 0, a¯ 0
(4.15) 2α , (α 2 +β 2 )h
one
(4.16)
100
4 Event-Triggered Algorithms for Distributed Convex Optimization
where a¯ 0 and a¯ 1 are the conjugate complex of a0 and a1 , respectively. Therefore, we ˜ < 1. have |λ| < 1 by applying Schur-Cohn Stability Test [51]. Thus, ρ(IN−1 −hl L) The proof is therefore completed. ˜ < 1, there exist positive constants M ≥ 1 and Lemma 4.6 If ρ(IN−1 − hl L) 0 < γ < 1 such that ˜ t ≤ Mγ t , t ≥ 0. ||IN−1 − hl L||
Proof The proof process can imitate that of [47].
4.4.2 Main Results We will introduce the characteristic that the nodes reach a consensus asymptotically, which means the node estimate xi (t) converges to the same point when t goes to infinity. Theorem 4.7 (Consensus) Let the connected Assumption 4.1, the subgradient boundedness Assumption 4.2, and the step-size Assumption 4.3 hold. Consider the distributed subgradient algorithm (4.2) with control input (4.3), where the 1 triggering time sequence is determined by (4.4) with β1 ∈ (0, N||L|| ), β2 ∈ (0, ∞). 2α Then, the consensus for (4.2) can be achieved for l ∈ (0, (α 2 +β 2 )h ) and μ ∈ (γ , 1).
Proof It is immediately obtained from (4.13) that ˜ ⊗ In ]t δ˜2 (0) − hl δ˜2 (t) = [(IN−1 − hl L)
t−1
˜ ⊗ In ] [(IN−1 − hl L)
t−s−1
(L˜ ⊗ In )˜e2 (s)
s=0
−h
t−1
˜ ⊗ In ]t−s−1 (Lˆ ⊗ In )∇ F˜2 (X(s)). g(s)[(IN−1 − hl L)
(4.17)
s=0
Since the undirected graph network of these nodes is connected, by Lemma 4.5, we ˜ < 1. Then it follows from Lemma 4.6 that there exist positive have (ρIN−1 − hl L) ˜ t ≤ Mγ t , t ≥ 0. Thus, from constants M ≥ 1 and 0 < γ < 1 such that ||IN−1 − hl L|| (4.17), we can get ˜ ||δ˜2 (t)|| ≤ Mγ t ||δ˜2 (0)|| + hl||L||
t−1
Mγ t−s−1 ||e˜2 (s)||
s=0
ˆ + h||L||
t−1 s=0
||g(s)||Kγ t−s−1||∇ F˜2 (X(s))||.
(4.18)
4.4 Convergence Analysis
101
It is obvious that ˜ + 1)|| = ||δ˜2 (t + 1)||, ||δ(t + 1)|| = ||δ(t
(4.19)
||e˜2 (t)|| ≤ ||e(t)|| ˜ = ||e(t)||,
(4.20)
||∇ F˜2 (X(t))|| ≤ ||∇ F˜ (X(t))|| = ||∇F (X(t))||.
(4.21)
and
Together with (4.19), (4.20), and (4.21), (4.18) further implies ˜ ||δ(t)|| ≤ Mγ t ||δ(0)|| + hl||L||
t−1
Mγ t−s−1 ||e(s)||
s=0
ˆ + h||L||
t−1
||g(s)||Mγ t−s−1||∇F (X(s))||.
(4.22)
s=0
In addition, by the definition of triggering time sequence, we deduce that yi (t) ≤ 0, i.e., ||ei (t)|| ≤ β1 || ≤ β1 ||
j
j ∈Ni
aij (xi (tki ) − xj (tk ))|| + β2 μt
aij (xi (t) − xj (t))|| + ||
j ∈Ni
aij (ei (t) − ej (t))|| + β2 μt
j ∈Ni
≤ β1 ||L ⊗ In ||||δ(t)|| + β1 ||L ⊗ In ||||e(t)|| + β2 μt ≤ β1 ||L||||δ(t)|| + β1 ||L||||e(t)|| + β2 μt .
(4.23)
Then, we have ||e(t)|| ≤ Nβ1 ||L||||δ(t)|| + Nβ1 ||L||||e(t)|| + Nβ2 μt .
Finally, for all 0 < β1
1,
||δ(t)|| < ηZμt , t ≥ 0.
(4.27)
Assuming (4.27) is not true, there must exist a t ∗ > 0 such that ||δ(t ∗ )|| ≥ ηZμt and ||δ(t)|| < ηZμt for t ∈ (0, t ∗ ). Then, by (4.22), it is obtained that
∗
∗
ηZμt ≤ ||δ(t ∗ )|| t∗
˜ ≤ Mγ ||δ(0)|| + hlM||L||
∗ −1 t
γt
s=0
t ∗ −1
∗ −s−1
Nβ2 Nβ1 ||L|| ||δ(s)|| + μs 1 − Nβ1 ||L|| 1 − Nβ1 ||L||
∗
ˆ + hM||L|| ||g(s)||γ t −s−1 ||∇F (X(s))|| s=0 ⎧ ⎫ ∗ −1 ∗ −1 √ t t ⎨ ⎬ s ˆ h|| L|| ||L||Z + β μ N H 1 1 β ∗ 1 2 ˜ + ≤ ηMγ t ||δ(0)|| + hl||L||N ⎩ 1 − Nβ1 ||L|| γ γs γ γs ⎭ s=0
s=0
√
ˆ ˜ h||L|| NH hl||L||N(β 1 ||L||Z + β2 ) − (μ − γ )(1 − Nβ1 ||L||) 1−γ " ˜ hl||L||N(β 1 ||L||Z + β2 ) t ∗ μ . + (μ − γ )(1 − Nβ1 ||L||)
≤ ηM
||δ(0)|| −
! γt
∗
(4.28)
In the process of the above derivation, we have used the subgradient boundedness Assumption 4.2, H = maxi {Hi }, the step-size Assumption 4.3, supt≥0 g(t) ≤ 1, and the sum formula of geometric sequence. Then, we present the following two cases: Case 1:
Z = M[δ(0) −
˜ hl||L||N (β1 ||L||Z+β2 ) (μ−γ )(1−N β1 ||L||) .
√ ˆ h||L|| NH 1−γ
], which implies that δ(0) −
√ ˆ h||L|| NH 1−γ
≥
Then by (4.28), we can achieve that
∗
ηZμt ≤ ||δ(t ∗ )||
! √ ˆ ˜ h||L|| NH hl||L||N(β ∗ 1 ||L||Z + β2 ) − μt < ηM (μ − γ )(1 − Nβ1 ||L||) 1−γ " ˜ hl||L||N(β 1 ||L||Z + β2 ) t ∗ μ + (μ − γ )(1 − Nβ1 ||L||) ! √ ˆ NH h||L|| ∗ ∗ μt = ηZμt . (4.29) = ηM δ(0) − 1−γ δ(0) −
4.4 Convergence Analysis
103
˜ Mhl||L||N β2 , ˜ (μ−γ )(1−N β1 ||L||)−Mhl||L||N β1 ||L|| ˜ hl||L||N (β1 ||L||Z+β2 ) (μ−γ )(1−N β1 ||L||) . Then, we have
Case 2:
Z=
ˆ
√
NH which implies that δ(0) − h||L|| < 1−γ
∗
ηZμt ≤ ||δ(t ∗ )|| < ηM
˜ hl||L||N(β 1 ||L||Z + β2 ) t ∗ μ (μ − γ )(1 − Nβ1 ||L||) ∗
= ηZμt .
(4.30)
The contradiction of (4.29) and (4.30) demonstrates that (4.27) is valid for any η > 1. Then, let η → 1, we can obtain the results that inequality (4.26) holds,
which further implies the consensus of (4.2) can be achieved asymptotically. The proof is thus completed.
We now begin to introduce an indispensable convergence result which is shown in Lemma 4.8. Lemma 4.8 ([52]) Let {φ(t)} be a non-negative scalar sequence such that φ(t + 1) ≤ (1 + v(t))φ(t) − ϕ(t) + u(t), where v(t) ≥ 0, ϕ(t) ≥ 0 and u(t) ≥ 0 for all t ≥ 0 with ∞ for all t ≥ t =0 v(t) < 0, ∞ ∞, and u(t) < ∞. Then, the sequence {φ(t)} converges to some φ ≥ 0 and t =0 ∞ t =0 ϕ(t) < ∞. In what follows, we will present a vital lemma, which is crucial in the analysis of the distributed optimization algorithm (4.2). Thereafter, we will investigate the convergence behavior of the algorithm, where the optimal solution can be achieved asymptotically. Lemma 4.9 ([48]) Consider an optimization problem minp∈R n h(p), where h : R n → R is a continuous objective function. Suppose that the optimal solution set P ∗ of the above optimization problem is nonempty. Let {p(t)} be a sequence with all p ∈ P ∗ and all t ≥ 0 such that ||p(t + 1) − p||2 ≤ (1 + v(t))||p(t) − p||2 − α(t)(h(p(t)) − h(p)) + u(t), ≥ 0 and u(t) ≥ 0 for all t ≥ 0 with ∞ where v(t) ≥ 0, α(t) t =0 v(t) < ∞, ∞ ∞ t =0 α(t) = ∞ and t =0 u(t) < ∞. Then the sequence {p(t)} converges to a certain optimal solution p∗ ∈ P ∗ . Proof Letting p = p∗ for any p∗ ∈ P ∗ and denoting h∗ = minp∈R n h(p), it follows that for all t ≥ 0, ||p(t + 1) − p∗ ||2 ≤ (1 + v(t))||p(t) − p∗ ||2 − α(t)(h(p(t)) − h∗ ) + u(t).
104
4 Event-Triggered Algorithms for Distributed Convex Optimization
Note that all the conditions of Lemma 4.8 hold, and then with the help of Lemma 4.8, we obtain the following statements: {||p(t) − p∗ ||2 } converges for each p∗ ∈ P ∗ ,
(4.31)
and ∞
α(t)(h(p(t)) − h∗ ) < ∞.
(4.32)
t =0
Since
∞
t =0 α(t)
= ∞, it is obtained from 4.32 that lim h(p(t)) = h∗ .
t →∞
Denoting a subsequence of {p(t)} by {p (t)}, then lim h(p (t)) = lim inf h(p(t)) = h∗ . t →∞
→∞
(4.33)
Recalling (4.31), it is clear that the sequence {p(t)} is bounded. In general, we can assume that {p (t)} converge to some p. ˜ By continuity of h, we therefore obtain lim h(p (t)) = h(p). ˜
→∞
(4.34)
Thus, (4.31) and (4.34) jointly imply p˜ ∈ P ∗ . By substituting p∗ in (4.31) with p, ˜ we achieve that {p(t)} converges to p. ˜ The proof is thus completed. Theorem 4.10 (Convergence Properties) Let the connected Assumption 4.1, the subgradient boundedness Assumption 4.2, and the step-size Assumption 4.3 hold. As for the problem (4.1), consider the distributed subgradient algorithm (4.2) with distributed control input (4.3), where the triggering time sequence is decided by (4.4). Then, there exists an optimal solution x ∗ ∈ X∗ such that lim ||xi (t) − x ∗ || = 0, ∀i ∈ {1, . . . , N}.
t →∞
Proof Taking the average process of (4.2), we can conclude that N N N N 1 1 1 xi (t + 1) = wij xj (t) − hl aij (ei (t) − ej (t)) N N N i=1
i=1 j =1
−h
N 1 g(t)∇fi (xi (t)). N i=1
i=1 j ∈Ni
(4.35)
4.4 Convergence Analysis
Since x¯ =
N
i=1 xi (t),
105
the control law (4.35) can be rewritten as 1 hg(t) ∇fi (xi (t)). N N
x(t ¯ + 1) = x(t) ¯ −
(4.36)
i=1
Consider the sequence (4.36). Letting x ∈ R n be an arbitrary vector, we have for all t ≥ 0, ||x(t ¯ + 1) − x||2 = ||x(t) ¯ −
N 1 hg(t) ∇fi (xi (t)) − x||2 i=1 N 2hg(t) ∇fi (xi (t))T (x(t) ¯ − x) N N
= ||x(t) ¯ − x||2 −
i=1
+
h2 g 2 (t) || N2
N
∇fi (xi (t))||2 .
(4.37)
i=1
Recalling that the subgradient boundedness Assumption 4.2 is hold by Hi and H = maxi {Hi }, then we can derive that ||x(t ¯ + 1) − x||2 ≤ ||x(t) ¯ − x||2 + h2 H 2 g 2 (t) 2hg(t) ∇fi (xi (t))T (x(t) ¯ − x). N N
−
(4.38)
i=1
Next, we analyze the cross-term ∇fi (xi (t))T (x(t) ¯ − x) in (4.38). Firstly, we write [∇fi (xi (t))]T (x(t) ¯ − x) ¯ − xi (t)) + [∇fi (xi (t))]T (xi (t) − x). = [∇fi (xi (t))]T (x(t)
(4.39)
We can take a lower bound on the first item [∇fi (xi (t))]T (x(t) ¯ − xi (t)) as follows by using the subgradient boundedness: [∇fi (xi (t))]T (x(t) ¯ − xi (t)) ≥ −||∇fi (xi (t))||||x(t) ¯ − xi (t)||.
(4.40)
As for the second term [∇fi (xi (t))]T (xi (t) − x), we apply the convexity of fi to obtain [∇fi (xi (t))]T (xi (t) − x) ≥ fi (xi (t)) − fi (x),
(4.41)
106
4 Event-Triggered Algorithms for Distributed Convex Optimization
from which, by applying the Lipschitz continuity of fi (deduced by the Assumption 4.2) and adding and subtracting fi (x(t)), ¯ we can further achieve [∇fi (xi (t))]T (x(t) ¯ − x) ¯ − xi (t)|| + [fi (xi (t)) − fi (x(t))] ¯ + [fi (x(t)) ¯ − fi (x)] ≥ −||∇fi (xi (t))||||x(t) T (x (t) − x(t)) ¯ − xi (t)|| + [∇fi (x(t))] ¯ ¯ + fi (x(t)) ¯ − fi (x) ≥ −||∇fi (xi (t))||||x(t) i
¯ − xi (t)|| − ||∇fi (x(t))||||x ¯ ¯ + fi (x(t)) ¯ − fi (x) ≥ −||∇fi (xi (t))||||x(t) i (t) − x(t)|| ¯ x(t) ¯ − xi (t)|| + fi (x(t)) ¯ − fi (x). = −(||∇fi (xi (t))|| + ||∇fi (x(t)))||)||
(4.42)
Substituting (4.42) into (4.38) yields ||x(t ¯ + 1) − x||2 ≤
N 2hg(t) [(||∇fi (xi (t))|| + ||∇fi (x(t))||)|| ¯ x(t) ¯ − xi (t)|| + fi (x) − fi (x(t))] ¯ N i=1
+ ||x(t) ¯ − x||2 + h2 H 2 g 2 (t) ≤ ||x(t) ¯ − x||2 +
N N 4hHg(t) 2hg(t) (||x(t) ¯ − xi (t)||) − (fi (x(t)) ¯ − fi (x)) N N i=1
i=1
+ h H g (t), 2
2 2
(4.43)
where in the second inequality we use the subgradient boundedness of fi . Now, we can apply (4.34) with x = x ∗ for any x ∗ ∈ X∗ to acquire ||x(t ¯ + 1) − x ∗ ||2 ≤ ||x(t) ¯ − x ∗ ||2 +
N 4hHg(t) (||x(t) ¯ − xi (t)||) N i=1
2hg(t) (fi (x(t)) ¯ − fi (x ∗ )) + h2 H 2 g 2 (t). N N
−
(4.44)
i=1
Rearranging the above formula and applying f (x) =
N
i=1 fi (x),
it follows that
2h ¯ − x ∗ ||2 − ||x(t ¯ + 1) − x ∗ ||2 g(t)(f (x(t)) ¯ − f ∗ ) ≤ ||x(t) N +
N 4hHg(t) (||x(t) ¯ − xi (t)||) + h2 H 2 g 2 (t). N i=1
(4.45)
4.4 Convergence Analysis
107
where f ∗ is the optimal value. Summing (4.45) over [0, ∞), dropping the negative term on the right hand side and multiplying by N on both sides, we obtain 2h
∞
g(t)(f (x(t)) ¯ − f ∗ ) ≤ N||x(0) ¯ − x ∗ ||2 + Nh2 H 2
t=0
+ 4hH
∞ t=0
g(t)
N
∞ t=0
g 2 (t)
(||x(t) ¯ − xi (t)||).
(4.46)
i=1
Now, we are in the position to study inequality (4.46). The right side of (4.46) can be partitioned as three items. For the first item, it is clear to get N||x(0) ¯ − x ∗ ||2 < ∞.
(4.47)
Similarly, in view of step-size Assumption 4.3, it follows that ∞
Nh2 H 2
g 2 (t) < ∞, for all t ≥ 0.
(4.48)
t=1
For the second item of (4.46), it is obtained from the consistency Theorem 4.7 that ||xi (t) − x(t)|| ¯ ≤ Zμt .
(4.49)
Multiplying the above relation with g(t) and sum up from 0 to l , it yields that l
g(t)
t=0
N
(||x(t) ¯ − xi (t)||) ≤ NZ
i=1
l
g(t)μt .
(4.50)
t=0
By using g(t)μt ≤ g 2 (t) + μ2t , we immediately have l t=0
g(t)
N
(||x(t) ¯ − xi (t)||) ≤ NZ
i=1
l
(g 2 (t) + μ2t ).
(4.51)
t=0
By geometric series summation method and the step-size Assumption 4.3, we obtain ∞ 2 1 2t < μ and g (t) < ∞. Substituting (4.47), (4.48), and (4.51)back t=0 t=0 1−μ2 into (4.46) yields
∞
∞
g(t)(f (x(t)) ¯ − f ∗ ) < ∞.
(4.52)
t=1
Since
∞
t=0
g(t) = ∞ and f (x(t)) ¯ − f ∗ ≥ 0, it yields that lim inf(f (x(t)) ¯ − f ∗ ) = 0.
t→∞
(4.53)
108
4 Event-Triggered Algorithms for Distributed Convex Optimization
2 Thus, from (4.4) and ∞ t=0 g (t) < ∞, we can derive that all the conditions of Lemma 4.9 are established. With this lemma in hand, we can deduce that the average sequence {x(t)} ¯ asymptotic converges to an optimal solution x ∗ ∈ X∗ . Recalling Theorem 4.7, it yields that each sequence {xi (t)}, i = 1, . . . , N , converges to the same optimal solution x ∗ . The proof is thus achieved.
Remark 4.11 In this chapter, we only consider the diminishing step-size rule (Assumption 4.3) to make the algorithm (4.2) converge to a consistent optimal solution. It is worth noting that the algorithm (4.2) and other similar algorithms can be fast by using a fixed or constant step-size, but they only converge to a neighborhood of the optimal solution set.
4.5 Numerical Examples In this section, a numerical example is given to validate the practicability of the proposed algorithm and correctness of the theoretical analysis throughout this chapter. Consider the general undirected graph G = {V , E, W = [wij ]5×5 }, where E = {(1, 2), (2, 1), (1, 3), (3, 1), (2, 3), (3, 2), (2, 4), (4, 2), (2, 5), (5, 2), (3, 4), (4, 3), (3, 5), (5, 3), (4, 5), (5, 4)}, w12 = 0.54, w21 = 0.54, w13 = 0.81, w31 = 0.81, w23 = 0.72, w32 = 0.72, w24 = 0.36, w42 = 0.36, w25 = 0.72, w52 = 0.72, w34 = 0.9, w43 = 0.9, w35 = 0.36, w53 = 0.36, w45 = 0.36, w54 = 0.36 and wij = 0 if (i, j ) ∈ / E. In the following example, the undirected graph sequence G describedabove will be employed. Consider the optimization problem (4.1) with fi (x) = 3j =1 Hi (xij , uij ) (i = 1, 2, 3, 4, 5), where Hi (xij , uij ) = (xij − uij )2 /2 if |xij −uij | ≤ and Hi (xij , uij ) = (|xij −uij |−(1/2)) otherwise. Here, xij is the j -th row of vector xi and uij is the corresponding element of matrix U = [uij ]5×3 . Note that the local objective function fi (x) is not differentiable, and x ∈ R 3 is a global decision vector. Moreover, we discuss the distributed subgradient algorithm (4.2) with control input (4.3), where the triggering time sequence is determined by (4.4). In the simulation, we choose the design parameters = 2, β1 = e−2.3 , β2 = 1/(1 − e−0.223 ), h = 0.01, l = 1, the step-sizes g(t) = 0.005/(t + 1) and design one such random matrix U = [uij ]5×3 that the mean of each column is −2, 0, 2, respectively, and the variance of each column is 1. The simulation results of the algorithm (4.2) are described in Figs. 4.1, 4.2, 4.3, and 4.4. The state evolutions of all nodes are shown in Fig. 4.1, from which one can see that all the nodes asymptotically achieve the optimal solution by taking 3000 iteration. When the nodes achieve consensus, the distributed control input ui (t) tends to 0, which can be seen from Fig. 4.2. Each node’s event-triggered sampling time instants are shown in Fig. 4.3, from which we can observe that the updates of the control inputs are asynchronous. According to the statistics, the sampling times for the 5 nodes are [22, 28, 31, 41, 53], and the average sampling times are 35. Thus, the average update
Fig. 4.1 All nodes’ states xi (t)
Fig. 4.2 Evolutions of all nodes’ control inputs ui (t)
Fig. 4.3 All nodes’ sampling time instant sequences {tki }
Fig. 4.4 Evolutions of measurement error and threshold for node 3
References
111
rate of control inputs is 35/3000 = 1.17%. Figure 4.4, for node 3, it is explicit that the norm of measurement error ||e3 (t)|| is asymptotically reduces to zero.
4.6 Conclusion In this chapter, the consensus-based first-order discrete-time multi-node system for solving the distributed convex optimization problem with event-triggered communication has been studied in detail. Under the designed distributed event-triggered function and the triggering condition, the distributed control input is designed. It has been proven that the algorithm is able to make the whole nodes asymptotically converge to an optimal point. Moreover, the theoretical results are demonstrated through a numerical example. Although, we do not prove that the Zeno-like behavior for triggering time sequence is excluded throughout this chapter, the proof of this crucial problem will capture our attention in future study. Future work also should include the case of more complex constrained convex optimization problem, and event-triggered communication among nodes in the dynamic networks.
References 1. M. Porfiri, D. Roberson, D. Stilwell, Tracking and formation control of multiple autonomous agents: a two-level consensus approach. Automatica 43(8), 1318–1328 (2007) 2. L. Cheng, Y. Wang, W. Ren, Z.-G. Hou, M. Tan, Containment control of multi-agent systems with dynamic leaders based on a PIn-type approach. IEEE Tran. Cybern. 46(12), 3004–3017 (2016) 3. S. Olfati, R. Murray, Consensus problems in networks of agents with switching topology and time delays. IEEE Trans. Autom. Control 49(9), 1520–1533 (2004) 4. H. Li, G. Chen, T. Huang, Z. Dong, High performance consensus control in networked systems with limited bandwidth communication and time-varying directed topologies. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1043–1054 (2017) 5. H. Geng, Z. Chen, Z. Liu, Q. Zhang, Consensus of a heterogeneous multi-agent system with input saturation. Neurocomputing 166, 382–388 (2015) 6. X. Wu, Y. Tang, W. Zhang, Input-to-state stability of impulsive stochastic delayed systems under linear assumptions. Automatica 66, 195–204 (2016) 7. H. Chu, Y. Cai, W. Zhang, Consensus tracking for multi-agent Systems with directed graph via distributed adaptive protocol. Neurocomputing 166, 8–13 (2015) 8. H. Li, G. Chen, X. Liao, T. Huang, Quantized data-based leader-following consensus of general discrete-time multi-agent systems. IEEE Trans. Circuits Syst. Express Briefs 63(4), 401–405 (2016) 9. G. Miao, Q. Ma, Group consensus of the first-order multi-agent systems with nonlinear input constraints. Neurocomputing 161, 113–119 (2015) 10. C. Huang, H. Li, D. Xia, L. Xiao, Distributed consensus of multi-agent systems over general directed networks with limited bandwidth communication. Neurocomputing 174, 681–688 (2016) 11. H. Li, X. Liao, T. Huang, W. Zhu, Y. Liu, Second-order global consensus in multiagent networks with random directional link failure. IEEE Trans. Neural Netw. Learn. Syst. 26(3), 565–575 (2015)
112
4 Event-Triggered Algorithms for Distributed Convex Optimization
12. Y. Kang, D.-H. Zhai, G.-P. Liu, Y.-B. Zhao, P. Zhao, Stability analysis of a class of hybrid stochastic retarded systems under asynchronous switching. IEEE Trans. Autom. Control 59(6), 1511–1523 (2014) 13. A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Trans. Autom. Control 48(6), 988–1001 (2003) 14. T. Schetter, M. Campbell, D. Surka, Multiple agent-based autonomy for satellite constellations. Artif. Intell. 145(1), 147–180 (2003) 15. S. Xie, Y. Wang, Construction of tree network with limited delivery latency in homogeneous wireless sensor networks. Wirel. Pers. Commun. 78(1), 231–246 (2014) 16. S. Kar, J. Moura, Distributed average consensus in sensor networks with random link failures, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP ’07 (2007). https://doi.org/10.1109/ICASSP.2007.366410 17. S. Pereira, Z. Pages, Consensus in correlated random wireless sensor networks. IEEE Trans. Signal Process. 59(12), 6279–6284 (2011) 18. S. Jian, H.W. Tan, J. Wang, J.W. Wang, S.Y. Lee, A novel routing protocol providing good transmission reliability in underwater sensor networks. J. Int. Technol. 16(1), 171–178 (2015) 19. P. Guo, J. Wang, X.H. Geng, C.S. Kim, J.U. Kim, A variable threshold-value authentication architecture for wireless mesh networks. J. Int. Technol. 15(6), 929–936 (2014) 20. S.R. Olfati, J.S. Shamma, Consensus filters for sensor networks and distributed sensor fusion, in Proceedings of 44th IEEE Conference on Decision and Control (2005). https://doi.org/10. 1109/CDC.2005.1583238 21. Z. Fu, K. Ren, J. Shu, X. Sun, F. Huang, Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE Trans. Parallel Distrib. Syst. 27(9), 2546– 2559 (2016) 22. Y.J. Ren, J. Shen, J. Wang, J. Han, S.Y. Lee, Mutual verifiable provable data auditing in public cloud storage. J. Int. Technol. 16(2), 317–323 (2015) 23. Z. Xia, X. Wang, X. Sun, Q. Wang, A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 27(2), 340–352 (2016) 24. Z. Fu, X. Sun, Q. Liu, L. Zhou, J. Shu, Achieving efficient cloud search services: multikeyword ranked search over encrypted cloud data supporting parallel computing. IEICE Trans. Commun. 98(1), 190–200 (2015) 25. M. Cao, A. Morse, B. Anderson, Reaching a consensus in a dynamically changing environment: convergence rates, measurement delays, and asynchronous events. SIAM J. Control Optim. 47(2), 575–600 (2008) 26. Y. Fan, G. Feng, Y. Wang, C. Song, Distributed event-triggered control of multi-agent systems with combinational measurements. Automatica 49(2), 671–675 (2013) 27. K. Hamada, N. Hayashi, S. Takai, Event-triggered and self-triggered control for discrete-time average consensus problems. SICE J. Control Meas. Syst. Integr. 7(5), 297–303 (2014) 28. Y.P. Tian, Stability analysis and design of the second order congestion control for networks with heterogeneous delays. IEEE/ACM Trans. Netw. 13(5), 1082–1093 (2005) 29. B. Gu, V. Sheng, A robust regularization path algorithm for v-support vector classification. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1241–1248 (2017) 30. Z. Xia, X. Wang, X. Sun, Q. Liu, N. Xiong, Steganalysis of LSB matching using differences between nonadjacent pixels. Multimed. Tools Appl. 75(5), 1947–1962 (2016) 31. X. Wen, S. Ling, X. Yu, F. Wei, A rapid learning algorithm for vehicle classification. Inf. Sci. 295(1), 395–406 (2015) 32. T. Ma, J. Zhou, M. Tang, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, S. Lee, Social network and tag sources based augmenting collaborative recommender system. IEICE Trans. Inf. Syst. 98(4), 902–910 (2015) 33. G. Chen, L. Luo, H. Shu, B. Chen, H. Zhang, Color image analysis by quaternion-type moments. J. Math. Imaging Vis. 51(1), 124–144 (2015) 34. J. Hu, X. Hu, Nonlinear filtering in target tracking using cooperative mobile sensors. Automatica 46(12), 2041–2046 (2010)
References
113
35. P. Tabuada, Event-triggered real-time scheduling of stabilizing control tasks. IEEE Trans. Autom. Control 52(9), 1680–1685 (2007) 36. H. Li, X. Liao, G. Chen, D.J. Hill, Z. Dong, T. Huang, Event-triggered asynchronous intermittent communication strategy for synchronization in complex dynamical networks. Neural Netw. 66, 1–10 (2015) 37. D. Dimarogonas, E. Frazzoli, K. Johansson, Distributed event-triggered control for multi-agent systems. IEEE Trans. Autom. Control 57(5), 1291–1297 (2012) 38. G. Seyboth, D. Dimarogonas, K. Johansson, Event-based broadcasting for multi-agent average consensus. Automatica 49(1), 245–252 (2013) 39. H. Li, G. Chen, T. Huang, Z. Dong, W. Zhu, L. Gao, Event-triggered distributed consensus over directed digital networks with limited bandwidth. IEEE Tran. Cybern. 46(12), 3098–3110 (2016) 40. W. Zhu, Z. Jiang, G. Feng, Event-based consensus of multi-agent systems with general linear models. Automatica 50(2), 552–558 (2014) 41. X. Chen, F. Hao, Event-triggered average consensus control for discrete-time multi-agent systems. IET Contr. Theory Appl. 16(6), 2493–2498 (2012) 42. D. Ding, Z. Wang, B. Shen, Event-triggered consensus control for a class of discrete-time stochastic multi-agent systems, in Proceeding of the 11th World Congress on Intelligent Control and Automation (2014). https://doi.org/10.1109/WCICA.2014.7052731 43. H. Pu, W. Zhu, D. Wang, Consensus analysis of first-order discrete-time multi-agent systems with time delay: an event-based approach, in 2016 35th Chinese Control Conference (CCC) (2016). https://doi.org/10.1109/ChiCC.2016.7554623 44. X. Yin, D. Yue, S. Hu, Distributed event-triggered control of discrete-time heterogeneous multi-agent systems. J. Frankl. Inst. 350(3), 651–669 (2013) 45. W. Zhu, Z. Tian, Event-based consensus of first-order discrete time multi-agent systems, in 2016 12th World Congress on Intelligent Control and Automation (WCICA) (2016). https:// doi.org/10.1109/WCICA.2016.7578796 46. W. Du, Sunney Y. Leung, Y. Tang, A. Vasilakos, Differential evolution with event-triggered impulsive control. IEEE Trans. Cybern. 47(1), 244–257 (2016) 47. D. Yang, X. Liu, W. Chen, Periodic event/self-triggered consensus for general continuous-time linear multi-agent systems under general directed graphs. IET Control Theory Appl. 9(3), 428– 440 (2015) 48. A. Nedic, A. Ozdaglar, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 49. Y. Lou, G. Shi, K. H. Johansson, Y. Hong, Approximate projected consensus for convex intersection computation: convergence analysis and critical error angle. IEEE Trans. Autom. Control 59(7), 1722–1736 (2014) 50. M. Zhu, S. Martínez, On distributed convex optimization under inequality and equality constraints via primal-dual subgradient methods (2010). Preprint. arXiv: 1001.2612 51. B.C. Kuo, Discrete-Data Control System (Prentice Hall, Hoboken, 1970) 52. I. Lobel, A. Ozdaglar, D. Feijer, Distributed multi-agent optimization with state-dependent communication. Math. Program. 129(2), 255–284 (2011) 53. H. Li, C. Guo, T. Huang, Z. Wei, X. Li, Event-triggered consensus in nonlinear multi-agent systems with nonlinear dynamics and directed network topology. Neurocomputing 185, 105– 112 (2016)
Chapter 5
Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Abstract In this chapter, we focus on introducing and exploring how to improve the computational and computational efficiency in distributed optimization problems, and the problem under study remains the problem of distributed optimization to minimize a finite sum of convex cost functions over the nodes of a network where each cost function is further considered as the average of several constituent functions. Reviewing the existing work, no method can improve communication efficiency and computational efficiency simultaneously. To achieve the above goal, we will introduce an effective event-triggered distributed accelerated stochastic gradient algorithm, namely ET-DASG. ET-DASG can improve communication efficiency through an event-triggered strategy, improve computational efficiency by using SAGA’s variance-reduction technique, and accelerate convergence by using Nesterov’s acceleration mechanism, thus achieving the target of improving communication efficiency and computational efficiency simultaneously. Furthermore, we will provide in this chapter a convergence analysis that demonstrates that ETDASG can converge to the exact optimal solution within the average value with a well-selected constant step-size. Also, thanks to the gradient tracking scheme, the algorithm can achieve linear convergence rates when each constituent function is strongly convex and smooth. Moreover, under certain conditions, we prove that the time interval between two successive trigger moments is larger than the iteration interval for each node. Finally, we also confirm the attractive performance of ETDASG through simulation results. Keywords Distributed optimization · Stochastic algorithm · Event-triggered · Variance reduction · Nesterov’s acceleration
5.1 Introduction The emergence of networked control systems has brought with it an urgent requirement for efficient communication and computing technologies. Distributed optimization can tackle the interaction of multiple nodes on a network and has a broad application in machine learning [1, 2], resource allocation [3, 4], data analysis © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_5
115
116
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
[5], privacy masking [6], and signal processing [7] due to its ability to parallelize computation and prevent nodes from sharing privacy. Distributed algorithms usually follow an iterative process in which nodes in the network store certain estimates of the decision vector in the context of optimization, exchange this information with neighboring nodes, and update their estimates based on the information received. Some of the literature for such distributed schemes include the early work on distributed gradient descent (DGD) [8] and its various extensions in achieving efficiency [9, 10], solving constraints [11, 12], applying to complex networks [13, 14], or performing acceleration [15, 16]. These optimization methods successfully showed the effectiveness for dealing with problems in a distributed manner [8–16]. Nonetheless, although these methods were intuitive and flexible for cost functions and networks, their convergence rates were particularly slow in comparison with that of centralized counterparts. Besides, linear convergence rates in sub-optimality could be derived for DGD-based methods with constant step-sizes [17]. Therefore, from an optimization point of view, it is always a priority to propose and analyze methods that are comparable in performance to centralized counterparts in terms of convergence rate. In a recent stream of literature, distributed gradient methods that overcome this exactness-rate dilemma have been proposed, which achieve exactly linear convergence rate for smooth and strongly convex cost functions. Instances of such methods, including methods based on gradient tracking [18–28], methods based on Lagrangian multiplier [29–32], and methods based on dual decomposition [33–36], are characterized by various mechanisms. Toward practical optimization models, approaches of momentum acceleration have been successfully and widely used in optimization techniques, which is conducive to the convergence of the DGD-based methods [16, 37–42]. First-order optimization methods based on momentum acceleration have been of significance in the machine learning community because of their better scalability for largescale tasks (including deep learning, federal learning, etc.) and good performance in practice. When solving convex or strongly convex optimization problems, many momentum approaches have emerged, e.g., the Nesterov’s acceleration mechanism in [16, 37–39] and the heavy-ball mechanism in [40–42], which both ensure that nodes obtain more information from neighbors in the network than ones with no momentum, and have been proven to largely improve the convergence rate of gradient-based methods. Despite momentum acceleration mechanisms having superior theoretical advantages, they do not fully exploit the performance of related methods in terms of efficiency, e.g., communication and computation. For example, in machine learning, the accuracy of the machine learning model can be improved by increasing the parameter scale and training dataset. However, this operation will lead to a substantial increase in training time, which results in low communication and computation efficiency. Therefore, exploiting some valid techniques to achieve provable efficiency becomes a new challenge for researchers. To improve communication efficiency and meanwhile maintain the desired performance of the network, various types of strategies have recently been proposed and gained popularity in the existing works, e.g., [43–46]. The emergence of the event-triggered strategy provides a new perspective for collecting and transmitting
5.1 Introduction
117
information. The main idea behind the event-triggered strategy is that nodes only take actions when necessary, that is, only when a measurement of the local node’s state error reaches a specified threshold. Its superiority is that some desired properties of the network can still be maintained efficiently. There are many works on distributed event-triggered methods over networks, which can successfully solve various practical problems and achieve expected results [43–46]. For example, distributed event-triggered algorithms proposed in [43] have been utilized to resolve constrained optimization problems, and event-triggered distributed gradient tracking algorithms in [44] have been proven to linearly converge to the optimal solution, which have further been extended to the distributed energy management problem of smart grids in [46]. In the era of big-data, nodes in the network may process large and complex data in the process of information sharing and calculation [5]. Thus, the above method will bear the pressure of a lot of calculations. To reduce the computational pressure and maintain the simplicity of the calculation, approximating the true gradient with a stochastic gradient is a relatively common solution at present. Based on this scheme, related stochastic gradient methods have emerged [47–49]. However, because of the large variance existing in the stochastic gradient, these approaches have weaker convergence. By decreasing the variance of the stochastic gradient, many centralized stochastic methods [50– 54] adopt various variance-reduction techniques to surmount this shortcoming and improve convergence. Inspired by Konecny et al. [50], Schmidt et al. [51], Defazio et al. [52], Tan et al. [53], Nguyen et al. [54], many distributed variance-reduced approaches [2, 55–60] have been extensively investigated, and their performance in processing machine learning tasks is better than their centralized counterparts. In this chapter, we focus on promoting the execution (i.e., communication and computation) efficiency and accelerating the convergence of the distributed optimization in dealing with the machine learning tasks. As far as the authors know, there is no work involving methods to implement the included target. In specific, we highlight the main contributions of this work as follows: (i) A novel event-triggered distributed accelerated stochastic gradient algorithm, namely ET-DASG, is proposed to solve the machine learning tasks. ET-DASG with well-selected constant step-size can linearly converge in the mean to the exact optimal solution if each constituent function is strongly convex and smooth. (ii) Unlike the time-triggered methods [38–42], ET-DASG utilizes the eventtriggered strategy which effectively avoids frequent real-time communication, reduces the communication load, and thus improves communication efficiency. Furthermore, for each node, we prove that the time interval between two successive triggering instants is larger than the iteration interval. (iii) Compared with the existing methods [38–46], ET-DASG achieves higher computation efficiency by means of the variance-reduction technique. In particular, at each iteration, ET-DASG only employs the gradient of one randomly selected constituent function and adopts the unbiased stochastic
118
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
average gradient (SAGA) to estimate the local gradients, which greatly reduces the expense of full gradient evaluations. (iv) In comparison with the existing methods without momentum acceleration mechanism [19–22, 44, 45], ET-DASG performs accelerated convergence with the help of the Nesterov’s acceleration mechanism. Moreover, simulation results verify that the convergence rate of ET-DASG improves with the increase of the momentum coefficient.
5.2 Preliminaries 5.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let Rp and Rp×q denote the real Euclidean spaces with dimensions p and p × q, respectively. The identity matrix and the spectral radius of a matrix are represented as Ip ∈ Rp×p and ρ(·), respectively. The Euclidean norm and Kronecker product is represented as || · || and ⊗, respectively. Let symbol E[·] denote the expectation of a random variable. Let notations ∇f (·) and (·)T denote the gradient of a function f and the transpose of vectors (matrices), respectively. The p-dimensional vector with all ones and all zeros are represented as 1p and 0p , respectively.
5.2.2 Model of Optimization Problem This chapter focuses on optimizing a finite sum cost functions in machine learning, which can be described as i
n m 1 i 1 i,h i min f (x) = f (x), f (x) = i f (x), x∈Rp m n i=1
(5.1)
h=1
where x ∈ Rp is the optimization estimator (decision vector) and f i : Rp → R is a convex function that we view as the privately cost of node i, which is represented as the average of ni constituent functions f i,h . In addition, we make the following assumption regarding the constituent functions. Assumption 5.1 Each local constituent function f i,h , i ∈ {1, . . . , m}, h ∈ {1, . . . , ni }, is κ1 -strongly convex and κ2 -smooth, i.e., for any a, b ∈ Rp : (i) f i,h (a + b) ≥ f i,h (a) + ∇f i,h (a)Tb + κ21 ||b||2; (ii) ||∇f i,h (a + b) − ∇f i,h (a)|| ≤ κ2 ||b||.
5.3 Algorithm Development
119
Notice from Assumption 5.1 that problem (5.1) possesses a unique optimal solution x ∗ and the global cost function f is also κ1 -strongly convex and κ2 -smooth, where 0 < κ1 ≤ κ2 . In addition, the condition number of the global cost function f is defined as γ = κ2 /κ1 .
5.2.3 Communication Network ˆ where We aim to solve (5.1) over an undirected network (graph) G = {V, E, A}, V = {1, 2, . . . , m} is the set of nodes, E ⊆ V × V is the set of edges containing the interactions among all nodes and Aˆ = [a ij ] ∈ Rm×m is the weight matrix (symmetric). The edge (i, j ) ∈ E if node j can directly exchange data with node i. If (i, j ) ∈ E, then a ij > 0; otherwise, a ij = 0. Let N i = {j |a ij > 0} denote the set of neighbors of node i. Furthermore, we make the following assumption regarding the network. Assumption 5.2 (i) G is undirected and connected; (ii) Aˆ = [a ij ] ∈ Rm×m is primitive and doubly stochastic. Assumption 5.2 indicates that the second largest singular value κ3 of Aˆ is less than 1, i.e., κ3 = ||Aˆ − (1/m)1m 1Tm || < 1, [19, 20, 22, 25, 44]. Remark 5.1 Assumptions 5.1 and 5.2 are very general and easy to be satisfied in many machine learning tasks that can usually be expressed as problem (5.1). These two assumptions allow us to conditionally design a distributed linearly convergent algorithm to accurately solve the practical applications. In addition, when training a machine learning model, it may be necessary for a single computing node to have a large amount of local data (ni 1), but the limited memory of the computing node causes a significant increase in training time as well as the amount of communication and calculations. However, it is expensive to improve the computing and communication capabilities of a single piece of hardware. Hence, designing a novel event-triggered distributed accelerated stochastic gradient algorithm will be of great significance.
5.3 Algorithm Development In this section, the event-triggered communication strategy is introduced. Then, a novel event-triggered distributed accelerated stochastic gradient algorithm (ETDASG) is developed.
120
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
5.3.1 Event-Triggered Communication Strategy In this subsection, we focus on designing an event-triggered strategy, where each node can determine online when to broadcast its current estimators to its neighbors by testing a certain triggering condition. Before introducing the event-triggered strategy, we first define tki by the k-th triggering time of the i-th node, where i ∈ V. Assume that xˆti and yˆti are the estimators1 that node i transmits to its neighbors at the latest triggering time before time t, i.e., ⎧ i i ⎨ xˆti = x ii for tk(i,t ) ≤ t < tk(i,t )+1 , tk(i,t)
⎩ yˆti = y ii
tk(i,t)
i i for tk(i,t ) ≤ t < tk(i,t )+1 ,
where xti and yti are two estimators of node i. Moreover, we suppose that all the nodes broadcast its estimators xti and yti at initial time, i.e., xˆ0i = x0i and yˆ0i = y0i for i i all i ∈ V. In addition, the next triggering time tk(i,t )+1 after tk(i,t ) for node i ∈ V is determined by i,y 2 i,x 2 i i t tk(i,t )+1 = inf t|t > tk(i,t ) , ||εt || + ||εt || > Cκ4 ,
(5.2)
where Cκ4t is the event-triggered threshold with parameters C > 0 and 0 < κ4 < 1, i,y and εti,x , εt are the measurement errors which are defined by i,y
εti,x = xˆti − xti , εt
= yˆti − yti .
(5.3)
Remark 5.2 The emergence of the event-triggered strategy provides a new perspective for information sampling and transmission. In particular, once node i receives j j the transmitted estimators (xˆt , yˆt ) from its neighbors j ∈ N i , node i first replaces the neighbor’s estimators stored in the local memory, and then updates its next time estimators (xti+1 , yti+1) based on its current information (xti , yti ). Note that only when an event occurs at node i, i.e., the triggering condition in (5.2), is met, the estimators xti and yti can be broadcasted to its neighbors. From the communication perspective, the event-triggered strategy avoids real-time communication with neighbors, which plays an essential role in reducing the communication load and realizing better communication efficiency compared with other time-triggered methods [38–42, 55– 60].
In the methods based on event-triggered strategy, the local estimators xˆti and yˆti are determined by its own estimators and the latest information sent from its neighbors j ∈ N i (at the latest triggering time of node j before t). 1
5.3 Algorithm Development
121
5.3.2 Event-Triggered Distributed Accelerated Stochastic Gradient Algorithm (ET-DASG) In this subsection, we introduce the event-triggered distributed accelerated stochastic gradient algorithm, named as ET-DASG, to solve problem (5.1). To motivate the design of ET-DASG, we know that methods based on the eventtriggered communication strategy can achieve better communication efficiency. On the other hand, the variance-reduction technique adopted in the existing approaches can effectively alleviate the computation burden in locally full gradient evaluations and promote the computation efficiency. In addition, the accelerated linear convergence is also an important requirement for machine learning methods. Therefore, we propose ET-DASG which applies the event-triggered strategy to realize better communication efficiency, applies the variance-reduction technique of SAGA to achieve higher computation efficiency, and meanwhile leverages the Nesterov’s acceleration mechanism and gradient tracking technique to implement accelerated linear convergence. The details of ET-DASG are given in Algorithm 3.2 To implement Algorithm 3 locally, we first assume that each node i ∈ V has a gradient table containing all gradients ∇f i,h , ∀h ∈ {1, . . . , ni }, with respect to some estimators (such as the local accelerated estimator s i or the local auxiliary estimator ei,h ). At each iteration t + 1, each node i first updates the step of the local decision estimator xti+1 and the local accelerated estimator sti+1 as step 3. Then, each node i uniformly and randomly chooses one label which indexed by χti+1 ∈ {1, . . . , ni } from its own data batch and then updates the local stochastic gradient gti+1 as step 5. After updating i,χ i
gti+1 , the local auxiliary estimator et +2t+1 will be assigned by the local accelerated i
i,χ i
estimator sti+1 at the label χti+1 , and the entry ∇f i,χt+1 (et +2t+1 ) is substituted for the i
newly constituent gradient ∇f i,χt+1 (sti+1 ) in the χti+1 gradient table position, while the other entries keep unchanged. Then, the step of local auxiliary estimator yti+1 is implemented to track the variance-reduced gradient. Subsequently, each node i i,y calculates the measurement errors εti,x , εt in (5.3) and then tests the triggering condition in (5.2). Finally, node i broadcasts xti+1 and yti+1 to its neighbors j ∈ N i and updates the latest triggering time if the condition is satisfied, i.e., the event is triggered, and otherwise keeps local estimators unchanged.
2 Assume that there is no information transmission and calculation at time t = 0, which further means that the event-triggered and stochastic gradient process are not executed.
122
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Algorithm 3 Event-triggered distributed accelerated stochastic gradient algorithm (ET-DASG) for node i ∈ V 1: Initialization: Each node i initializes xˆ0i = x0i ∈ Rp , e1i,h = s0i ∈ Rp , ∀h ∈ {1, . . . , ni }, and yˆ0i = y0i = g0i = ∇f i (s0i ) ∈ Rp . 2: for t = 0, 1, 2, . . . do i i and st+1 according to: 3: Update estimators xt+1 i xt+1 = xti +
m
j aij xˆt − xˆti − ηyti ,
(5.4a)
j =1
i i i st+1 = xt+1 + α xt+1 − xti ,
(5.4b)
where aij is the weight between i and j , the step-size η > 0, and the momentum coefficient 0 < α < 1. i uniformly and randomly from {1, . . . , ni }. 4: Choose χt+1 i 5: Update gt+1 according to:
i ni i i 1 i,χ i,h i i gt+1 − ∇f i,χt+1 et+1t+1 + i . =∇f i,χt+1 st+1 ∇f i,h et+1 n h=1
i,χ i
i
i,χ i
i
i i i 6: Take et+2t+1 = st+1 and replace ∇f i,χt+1 (et+2t+1 ) by ∇f i,χt+1 (st+1 ) in the χt+1 gradient i,h i table position. All other estimators et+2 , ∀h = χt+1 , and gradient entries in the table keep i,h i,h i,h i,h i = et+1 and ∇f i,h (et+2 ) = ∇f i,h (et+1 ) for all h = χt+1 . unchanged, i.e., et+2 i 7: Update estimator yt+1 according to:
i yt+1 = yti +
m
j i aij yˆt − yˆti + gt+1 − gti .
(5.4c)
j =1 i,y
8: Calculate the measurement errors εti,x , εt in (5.3), and then test the triggering condition in (5.2). 9: if the triggering condition in (5.2) is satisfied then i i and yt+1 to its neighbors j ∈ N i , and update the latest triggering time. 10: Broadcast xt+1 11: end if 12: end for
According to the measurement errors (5.3) and Assumption 5.2(ii), the updates in (5.4)–(5.4c) of Algorithm 3 can be represented as follows: ⎧ m
⎪ j ⎪ xti+1 = aij xt + εˆ ti,x − ηyti ⎪ ⎪ ⎪ j =1 ⎨ , sti+1 = xti+1 + α(xti+1 − xti ) ⎪ m ⎪
⎪ i j i,y ⎪ aij yt + εˆ t + gti+1 − gti ⎪ ⎩ yt +1 = j =1
(5.5)
5.3 Algorithm Development
where εˆ ti,x =
m
123
j,x j =1 aij (εt
i,y
− εti,x ) and εˆ t
=
m
j,y j =1 aij (εt
i,y
− εt ).
Remark 5.3 It is worth noticing that, as calculated in step 5 of Algorithm 3, computation of local stochastic gradient gti , i ∈ V, is costly in respect that this
i step needs to calculate the summation term nh=1 ∇f i,h (eti,h +1 ) at each iteration. In specific, if we naively implement the update in step 5, then at each iteration, we must
i face the O(ni )-order computational cost when calculating nh=1 ∇f i,h (eti,h +1 ). At each iteration, when using the following recursive formula to update the summation term: i
n h=1
∇f
i,h
ni i,χti i,h i,h i,h i,χti i i,χti et , ∇f et +1 = et + ∇f st − ∇f h=1
the i above cost can be avoided and we can calculate the summation term
n i,h i,h h=1 ∇f (et +1 ) in a computationally efficient way. However, we also point that, the O(ni )-order computational cost cannot be overcome in the existing methods [19, 20, 22, 23, 25] using deterministic gradient tracking technique. From this point of view, the computation cost of ET-DASG is related to ni , and ET-DASG can improve computation efficiency. y
1,y
m,y
Define εˆ tx = [(ˆεt1,x )T , . . . , (ˆεtm,x )T ]T , εˆ t = [(ˆεt )T , . . . , (ˆεt )T ]T , xt = [(xt1 )T , . . . , (xtm )T ]T , yt = [(yt1)T , . . . , (ytm )T ]T , st = [(st1 )T , . . . , (stm )T ]T , gt = [(gt1 )T , . . . , (gtm )T ]T , and A = Aˆ ⊗ Ip . Then, Algorithm 3 can be rewritten in a compact form, ⎧ ⎨ xt +1 = Axt + εˆ tx − ηyt s = xt +1 + α(xt +1 − xt ) . ⎩ t +1 y yt +1 = Ayt + εˆ t + gt +1 − gt
(5.6)
In what follows, we use Ft to denote the history of the system up until time t. Then, from the prior results of [2, 55, 57], it is clear that the stochastic averaging gradient is unbiased. In particular, given Ft , one gets E gti |Ft = ∇f i (sti ).
(5.7)
Remark 5.4 In comparison with the existing distributed accelerated approaches [38–42], distributed event-triggered methods [43–46], and variance-reduced distributed stochastic approaches (including GT-SAGA/GT-SVAG [59], DSA [55], etc.), it is worth highlighting that Algorithm 3 achieves the above targets well when processing machine learning tasks. That is to say, Algorithm 3 not only accelerates the convergence but also promotes the execution (communication and computation) efficiency. Based on this, we can consider that this chapter develops a novel distributed optimization algorithm in adapting to real scenarios.
124
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
5.4 Convergence Analysis In this section, we show the theoretical guarantees for the convergence of ET-DASG.
5.4.1 Auxiliary Results To proceed, several auxiliary variables that will support the subsequent analysis are defined below: 1 T 1 T 1m ⊗ Ip xt , y¯t = 1m ⊗ Ip yt , m m T 1m 1Tm ⊗ Ip , ∇F (st ) = ∇f 1 (st1 )T , . . . , ∇f m (stm )T , A∞ = m 1 T 1 T 1m ⊗ Ip ∇F (st ), g¯t = 1m ⊗ Ip gt . ∇ F¯ (st ) = m m
x¯t =
Then, note that (5.6) is a stochastic gradient tracking method [55, 59, 60]. Under the given initial conditions, the measurement error (5.3) and Assumption 5.2(ii), through induction, the following conclusions can be clearly drawn: y¯t = g¯t ,
1 T 1 T y (1m ⊗ Ip )ˆεtx = 0, (1 ⊗ Ip )ˆεt = 0, ∀t ≥ 0. m m m
Recalling from (5.7), it is clear to verify that E[gt |Ft ] = ∇F (st ), E[y¯t |Ft ] = E[g¯t |Ft ] = ∇ F¯ (st ), ∀t ≥ 0. Moreover, for each node i ∈ V, the average optimality gap between the auxiliary variables eti,h , h ∈ {1, . . . , ni }, and the optimal solution x ∗ is defined as follows: vti
m ni 2 1 i,h = i vti , ∀t ≥ 0. et − x ∗ , vt = n h=1
(5.8)
i=1
Before establishing the auxiliary results, we state a few useful results in the following lemma, of which the proofs can be found in [40, 41, 46, 59, 61]. Lemma 5.5 Under Assumptions 5.1–5.2, we possess that (i) For all x ∈ Rmp , one has ||Ax − A∞ x|| ≤ κ3 ||x − A∞ x||, where 0 < κ3 < 1. (ii) For all x ∈ Rp and if 0 < η ≤ (1/κ2 ), one gets ||x − η∇f (x) − x ∗ || ≤ (1 − κ1 η)||x − x ∗ ||.
5.4 Convergence Analysis
125
(iii) Assume that Γ ∈ Rp×p is a non-negative matrix and w ∈ Rp is a positive vector. If Γ w < w, then ρ(Γ ) < 1. (iv) Consider the sequences {xt } and {st } generated by the algorithm (5.6), one √ obtains that ||∇ F¯ (st ) − ∇f (x¯t )|| ≤ (κ2 / m)||st − A∞ xt || for all t ≥ 0. In addition, an upper bound of E[gt − ∇F (st )2 ] is introduced in the following lemma, whose proof can be referred to the literatures [55, 59]. Lemma 5.6 If Assumptions 5.1–5.2 hold, we possess the following recursive relation: ∀t ≥ 1, 2 2 E[gt − ∇F (st )2 ] ≤ 4κ22 E[xt − A∞ xt ] + 8κ22 E[mx¯t − x ∗ ] + 8α 2 κ22 E[xt − xt −1 2 ] + 2κ22 E[vt ].
(5.9)
From Lemma 5.6, we can find that when xti , i ∈ V, approach to x ∗ , then eti,h , i ∈ V, h ∈ {1, . . . , ni }, tend to x ∗ , which indicates that E[gt − ∇F (st )2 ] diminishes to zero.
5.4.2 Supporting Lemmas In this subsection, we start to analyze the performance of Algorithm 3 by establishing the interactions among the following sequences for all t ≥ 1: (i) Mean-squared consensus error E[||xt − A∞ xt ||2 ]; (ii) Mean-squared network optimality gap E[||x¯t − x ∗ ||2 ]; (iii) Mean state optimality gap E[vt ]; (iv) Mean-squared state difference E[||xt −xt −1 ||2 ]; and (v) Mean-squared stochastic gradient tracking error E[||yt − A∞ yt ||2 ]. The first result concerning the upper bound of E[vt ] is given in the following lemma. The proof of this result can refer to [59]. Lemma 5.7 The sequence {vt }, ∀t ≥ 1, is upper bounded by
2 2 1 2 2 E[vt ] + E[xt − A∞ xt ] + E[mx¯t − x ∗ ], E[vt +1 ] ≤ 1 − nˆ nˇ nˇ (5.10) where nˆ = maxi∈V {ni } and nˇ = mini∈V {ni }. In the next lemma, we give the bound of the mean-squared state difference E[||xt − xt −1||2 ].
126
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Lemma 5.8 If Assumptions 5.1–5.2 hold, we possess the following recursive relation: ∀t ≥ 1, E[||xt +1 − xt ||2 ] ≤ (96η2 κ22 + 8)E[||xt − A∞ xt ||2 ] + 32η2 κ22 E[vt ] + 144η2κ22 E[m||x¯t − x ∗ ||2 ] + 16η2 E[||yt − A∞ yt ||2 ] + 160η2κ22 α 2 E[||xt − xt −1||2 ] + 4E[||ˆεtx ||2 ].
(5.11)
Proof Following from (5.6) and the fact that ||A − Imp ||2 ≤ 4, one gets ||xt +1 − xt ||2 = ||(A − Imp )xt − εˆ tx − ηyt ||2 = ||(A − Imp )(xt − A∞ xt ) − εˆ tx − ηyt ||2 ≤ 8||xt − A∞ xt ||2 + 4||ˆεtx ||2 + 4η2 ||yt ||2 .
(5.12)
Define ∇F (x ∗ ) = [∇f 1 (x ∗ )T , . . . , ∇f m (x ∗ )T ]T . Then, we further get ||yt || = ||yt − A∞ yt + A∞ yt − A∞ ∇F (st ) + A∞ ∇F (st )|| ≤ ||yt − A∞ yt || + ||gt − ∇F (st )|| + κ2 ||st − (1m ⊗ Ip )x ∗ || ≤ ||yt − A∞ yt || + ||gt − ∇F (st )|| + κ2 ||st − A∞ xt || √ + mκ2 ||x¯t − x ∗ ||,
(5.13)
where the first inequality has exploited the facts that y¯t = g¯t , ∀t ≥ 0, and (1Tm ⊗ Ip )∇F (x ∗ ) = 0. In view of (5.12) and (5.13), one has E[||xt +1 − xt ||2 ] ≤ 8E[||xt − A∞ xt ||2 ] + 16η2 κ22 E[||st − A∞ xt ||2 ] + 16η2E[||gt − ∇F (st )||2 ] + 16η2 E[||yt − A∞ yt ||2 ] + 16η2κ22 E[m||x¯t − x ∗ ||2 ] + 4E[||ˆεtx ||2 ].
(5.14)
Following from (5.14) and with reference to (5.6), it can be obtained that E[||xt +1 − xt ||2 ] ≤ 16η2 E[||gt − ∇F (st )||2 ] + 16η2κ22 E[m||x¯t − x ∗ ||2 ] + 16η2 E[||yt − A∞ yt ||2 ] + 32α 2 η2 κ22 E[||xt − xt −1||2 ] + (32η2 κ22 + 8)E[||xt − A∞ xt ||2 ] + 4E[||ˆεtx ||2 ], which together with Lemma 5.6 finishes the proof.
(5.15)
Then, we derive a bound for the mean-squared consensus error E[||xt −A∞ xt ||2 ].
5.4 Convergence Analysis
127
Lemma 5.9 If Assumptions 5.1 and 5.2 hold, we possess the following recursive relation: ∀t ≥ 1, E[||xt +1 − A∞ xt +1||]2 ≤
1 + κ32 4η2 E[||xt − A∞ xt ||]2 + E[||yt − A∞ yt ||]2 2 1 − κ32 +
4 E[||ˆεtx ||2 ]. 1 − κ32
(5.16)
Proof From the update rule (5.6), it follows ||xt +1 − A∞ xt +1 ||2 = ||Axt − εˆ tx − ηyt − A∞ (Axt − εˆ tx − ηyt )||2 2 = Axt − A∞ xt − εˆ tx − ηyt + ηA∞ yt ,
(5.17)
where the second equality has employed the facts that A∞ A = A∞ and A∞ εˆ tx = 0, ∀t ≥ 0. Recalling from the well-known inequality that ||c +d||2 ≤ (1 +a)||c||2 + (1 + 1/a)||d||2, ∀c, d ∈ Rmp , for any a > 0, it deduces from Lemma 5.5(i) that ||xt +1 − A∞ xt +1 ||2 ≤ (1 + a)κ32 ||xt − A∞ xt ||2 + 2(1 + a −1 )η2 ||yt − A∞ yt ||2 + 2(1 + a −1 )||ˆεtx ||2 .
(5.18)
Substituting a = (1 − κ32 )/(2κ32 ) (due to the fact that a is an arbitrary positive constant) to (5.18) leads to the results of (5.16) in Lemma 5.9. Next, we establish a bound for the mean-squared network optimality gap E[||x¯t − x ∗ ||2 ]. Lemma 5.10 If 0 < η ≤ κ1 /(16κ2 ) and supposing that Assumptions 5.1 and 5.2 hold, we possess the following recursive relation: ∀t ≥ 1, E[m||x¯t +1 − x ∗ ||2 ] 4κ 2 η ηκ1 E[m||x¯t − x ∗ ||2 ] + 2 E[||xt − A∞ xt ||2 ] ≤ 1− 2 κ1 +
2η2 κ22 4α 2 κ22 η E[vt ]. E[||xt − xt −1 ||2 ] + κ1 m
(5.19)
128
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Proof Multiplying (1Tm ⊗ Ip )/m to the update of xt in (5.6) and in light of the fact ((1Tm ⊗ Ip )/m)ˆεtx = 0, one has E[||x¯t +1 − x ∗ ||2 |Ft ] = E[||x¯t − ηy¯t − x ∗ ||2 |Ft ] = E[||x¯t − η∇f (x¯t ) − x ∗ + η∇f (x¯t ) − ηy¯t ||2 |Ft ] = ||x¯t − η∇f (x¯t ) − x ∗ ||2 + η2 E[||∇f (x¯t ) − g¯t ||2 ] + 2ηx¯t − η∇f (x¯t ) − x ∗ , ∇f (x¯t ) − ∇ F¯ (st ),
(5.20)
where the last equality has leveraged the facts that y¯t = g¯ t and E[g¯t |Ft ] = ∇ F¯ (st ). Considering that ∇f (x¯t ) − ∇ F¯ (st ), E[∇ F¯ (st ) − g¯ t |Ft ] = 0, we have E[||∇f (x¯t ) − g¯t ||2 |Ft ] = E[||∇f (x¯t ) − ∇ F¯ (st )||2 |Ft ] + E[||∇ F¯ (st ) − g¯ t ||2 |Ft ]. Notice that {gti } is independent from each other for given Ft . Then, E[ j ∇f i (sti ), gt
j − ∇f j (st )|Ft ]
(5.21)
i i =j gt
−
= 0 holds. Thus, the last term in (5.21) is equal to
m 1 2 i i i 2 ¯ E[||∇ F (st ) − g¯t || |Ft ] = 2 E || gt − ∇f (st )|| |Ft m i=1
= E[||gt − ∇F (st )||2 |Ft ],
(5.22)
which together with (5.21) and Lemma 5.5(ii) in (5.20) leads to 2 E[||x¯t +1 − x ∗ ||2 |Ft ] ≤ (1 − ηκ1 )2 ||x¯t − x ∗ ||2 + η2 ∇f (x¯t ) − ∇ F¯ (st ) + 2η(1 − ηκ1 )||x¯t − x ∗ ||||∇f (x¯t ) − ∇ F¯ (st )|| +
η2 E[||gt − ∇F (st )||2 |Ft ]. m2
(5.23)
By Lemma 5.5(iv) and the update of st in (5.6), we further have E[||x¯t +1 − x ∗ ||2 |Ft ] ≤
2η2 κ22 η(1 − ηκ1 ) ||xt − A∞ xt ||2 + ||∇f (x¯t ) − ∇ F¯ (st )||2 m b +
2α 2 η2 κ22 η2 ||xt − xt −1 ||2 + 2 E[||gt − ∇F (st )||2 |Ft ] m m
+ η(1 − ηκ1 )b||x¯t − x ∗ ||2 + (1 − ηκ1 )2 ||x¯t − x ∗ ||2 ,
(5.24)
5.4 Convergence Analysis
129
where b > 0 is an arbitrary positive constant. Inserting b = κ1 into (5.24) gives rise to E[m||x¯t +1 − x ∗ ||2 ] = (1 − ηκ1 )E[m||x¯t − x ∗ ||2 ] + +
2κ22 η E[||xt − A∞ xt ||2 ] κ1
2α 2 κ22 η η2 E[||xt − xt −1 ||2 ] + E[||gt − ∇F (st )||2 ]. κ1 m
(5.25)
Substituting Lemma 5.6 into (35.25) yields that 2κ 2 8η 2η2 κ22 2 E[m||x¯t − x ∗ ||2 ] + E[vt ] E[m||x¯t +1 − x ∗ ||2 ] ≤ 1 − ηκ1 + m m
2 8η E[||xt − xt −1 ||2 ] + α 2 κ22 η + κ1 m
2 4η 2 + + κ2 η (5.26) E[||xt − A∞ xt ||2 ]. m κ1
We note here that 1 − ηκ1 + (8η2 κ22 )/m ≤ 1 − (ηκ1 )/2 if 0 < η ≤ (κ1 m)/(16κ22), 4η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤ 1/(2κ1), and 8η/m + 2κ1−1 ≤ 4κ1−1 if 0 < η ≤ 1/(4κ1). Therefore, the results of (5.19) in Lemma 5.10 can be derived if 0 < η ≤ κ1 /(16κ2 ). Finally, we establish a bound for the mean-squared stochastic gradient tracking error E[||yt − A∞ yt ||2 ]. Lemma 5.11 If 0 < η ≤ (1 − κ32 )/(99κ2) and supposing that Assumptions 5.1 and 5.2 hold, we possess the following recursive relation: ∀t ≥ 1, E[||yt +1 − A∞ yt +1 ||2 ] ≤
1240κ22
266κ22
1 − κ3
1 − κ32
E[||xt − A∞ xt ||2 ] + 2
+ +
E[m||x¯t − x ∗ ||2 ]
3 + κ32 188κ22 E[||xt − xt −1 ||2 ] E[||yt − A∞ yt ||]2 + 4 1 − κ32 42κ22
640κ22
1 − κ3
1 − κ32
E[vt ] + 2
E[||ˆεtx ||2 ] +
4 y E[||ˆεt ||2 ]. 1 − κ32
(5.27)
130
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Proof Utilizing the update of yt in (5.6) and the fact that A∞ A = A∞ , one acquires ||yt +1 − A∞ yt +1 ||2 y 2 = Ayt − A∞ yt + (Imp − A∞ )(gt +1 − gt ) − εˆ t ≤ (1 + c)||Ayt − A∞ yt ||2 + 2(1 + c−1 )||gt +1 − gt ||2 + 2(1 + c−1 )||ˆεt ||2 , (5.28) y
where A∞ εˆ t = 0, ∀t ≥ 0, and the inequality ||a + b||2 ≤ (1 + c)||a||2 + (1 + 1/c)||b||2, ∀a, b ∈ Rmp , for any c > 0, have been employed to derive (5.28). Selecting c = (1 − κ32 )/(2κ32 ) in (5.28) and then taking the total expectation, we have y
E[||yt +1 − A∞ yt +1 ||2 ] ≤
1 + κ32 4 4 y E[||yt − A∞ yt ||2 ] + E[||gt +1 − gt ||2 ] + E[||ˆεt ||2 ], 2 2 1 − κ3 1 − κ32 (5.29)
where ||Imp − A∞ || = 1 and Lemma 5.5(i) have been applied to acquire (5.29). Next, we proceed to analyze E[||gt +1 − gt ||2 ]. First, we have E[||gt +1 − gt ||2 ] ≤ 2E[||gt +1 − gt − (∇F (st +1 ) − ∇F (st ))||2 ] + 2E[||∇F (st +1) − ∇F (st )||2 ] ≤ 2E[||gt +1 − gt ||2 ] + 2E[||gt − ∇F (st )||2 ] + 2κ22 E[||xt +1 − xt + α(xt +1 − xt ) − α(xt − xt −1)||2 ] ≤ 4κ22 α 2 E[||xt − xt −1 ||2 ] + 2E[||gt − ∇F (st )||2 ] + 16κ22E[||xt +1 − xt ||2 ] + 2E[||gt +1 − gt ||2 ],
(5.30)
where the third and second inequalities have exploited 0 < α < 1 and E[gt +1 − ∇F (st +1 ), gt − ∇F (st )] = E[E[gt +1 − ∇F (st +1 ), gt − ∇F (st )|Ft +1 ]] = 0, respectively. Using (5.15) with the requirement that 0 < η ≤ 1/(32κ2), it can be deduced that E[||xt +1 − xt ||2 ] ≤ 16η2E[||yt − A∞ yt ||2 ] + 0.03125α 2E[||xt − xt −1 ||2 ] + 8.03125E[||xt − A∞ xt ||2 ] + 16η2 E[||gt − ∇F (st )||2 ] + 0.015625E[m||x¯t − x ∗ ||2 ] + 4E[||ˆεtx ||2 ],
(5.31)
5.4 Convergence Analysis
131
which together with (5.30) yields that E[||gt +1 − gt ||2 ] ≤ 256κ22η2 E[||yt − A∞ yt ||2 ] + 2E[||gt +1 − ∇F (st +1 )||2 ] + 4.5κ22α 2 E[||xt − xt −1 ||2 ] + 0.25κ22E[m||x¯t − x ∗ ||2 ] + 128.5κ22E[||xt − A∞ xt ||2 ] + 64κ22E[||ˆεtx ||2 ] + 2.25E[||gt − ∇F (st )||2 ].
(5.32)
We next bound E[||gt +1 − ∇F (st +1 )||2 ]. Following directly from Lemma 5.6, one has E[gt +1 − ∇F (st +1 )2 ] ≤ 4κ22 E[xt +1 − A∞ xt +12 ] + 8κ22E[mx¯t +1 − x ∗ 2 ] + 8α 2 κ22 E[xt +1 − xt 2 ] + 2κ22 E[vt +1 ].
(5.33)
Note that E[||xt +1 −A∞ xt +1 ||2 ] ≤ 4η2 E[||yt −A∞ yt ||2 ] +2κ32 E[||xt −A∞ xt ||2 ] + 4E[||ˆεtx ||2 ] if we select a = 1 in (5.18), and E[m||x¯t +1 − x ∗ ||2 ] ≤ α 2 E[||xt − xt −1||2 ] + (η2 /m)E[||gt − ∇F (st )||2 ] + E[||xt − A∞ xt ||2 ] + 2E[m||x¯t − x ∗ ||2 ] if we select b = 1/η in (5.24) for 0 < η < 2/κ2 . Substituting the above results and (5.10) into (5.33) leads to the following: if 0 < η ≤ 1/(32κ2), E[||gt +1 − ∇F (st +1 )||2 ] ≤ 84.25κ22E[||xt − A∞ xt ||2 ] + 144η2κ22 E[||yt − A∞ yt ||2 ] 8η2 κ22 + 128η2κ22 E[||gt − ∇F (st )||2 ] + 2κ22 E[vt ] + m + 20.125κ22E[m||x¯t − x ∗ ||2 ] + 48κ22E[||ˆεtx ||2 ] + 8.25α 2κ22 E[||xt − xt −1||2 ].
(5.34)
If 0 < η ≤ 1/(16κ2), we further get from (5.9) that E[||gt +1 − ∇F (st +1 )||2 ] ≤ 86.25κ22E[||xt − A∞ xt ||2 ] + 144η2κ22 E[||yt − A∞ yt ||2 ] + 24.125κ22E[m||x¯t − x ∗ ||2 ] + 3κ22 E[vt ] + 12.25α 2κ22 E[||xt − xt −1 ||2 ] + 48κ22E[||ˆεtx ||2 ].
(5.35)
132
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
As for (5.35), keeping (5.32) in mind, it deduces from (5.29) that E[||yt +1 − A∞ yt +1||2 ] ≤
266κ22
188κ22
1 − κ3
1 − κ32
E[m||x¯t − x ∗ ||2 ] + 2
E[||xt − xt −1 ||2 ]
42κ22 4 y 2 E[||ˆ ε || ] + E[vt ] t 1 − κ3 1 − κ32 1 − κ32 2176κ22η2 1 + κ32 + E[||yt − A∞ yt ||2 ] + 2 1 − κ32 +
+
640κ22
E[||ˆεtx ||2 ] + 2
1240κ22 1 − κ32
E[||xt − A∞ xt ||2 ].
(5.36)
Hence, choosing 0 < η ≤ (1 − κ32 )/(99κ2 ) can achieve the results of (5.27) in Lemma 5.11.
5.4.3 Main Results Next, we will show that ET-DASG linearly converges in the mean. In the theoretical procedure, we first construct a linear matrix inequality based on the results in Lemmas 5.7–5.11. Then, similar to other works [20, 22, 24, 25], we further prove that the spectral radius of the coefficient matrix is less than 1. After iterating the linear matrix inequality, we can finally find that ET-DASG linearly converges in the mean without other complex operations. Theorem 5.12 Consider ET-DASG in Algorithm 3 under Assumptions 5.1–5.2. If the step-size η is chosen from the interval
1 − κ32 0 < η ≤ min 12γ κ2
ˇ − 9α 2 ) 1 − κ32 1 nˇ mn(1 , , , , nˆ 34nκ ˆ 1γ 2 99κ2 κ2 Λ(γ , κ3 , n, ˆ n) ˇ
the estimator xti , i ∈ V, converges in the mean to the exact optimal solution to problem (5.1) with a linear convergence rate O(λt ), where 0 < λ < 1, 0 < α < (1/3), and Λ(γ , κ3 , n, ˆ n) is a constant that are specified in the proof. Proof To begin with, we jointly write (5.10), (5.11), (5.16), (5.19), and (5.27) as a linear matrix inequality, i.e., θt +1 ≤ Γ θt + νt ,
(5.37)
5.4 Convergence Analysis
133
where θt ∈ R5 , νt ∈ R5 , and Γ ∈ R5×5 are represented by ⎤ ⎤ ⎡ ⎡ E ||xt − A∞ xt ||2 Δ1 ⎥ ⎢ E m||x¯ − x ∗ ||2 ⎢0 ⎥ ⎥ ⎥ ⎢ ⎢ t ⎥ ⎥ ⎢ ⎢ 2 θt = ⎢ E ||xt − xt −1 || ⎥ , νt = ⎢ Δ2 ⎥ , ⎥ ⎥ ⎢ ⎢ ⎦ ⎣ E [vt ] ⎣0 ⎦ E ||yt − A∞ yt ||2 /κ22 Δ3 with Δ1 =
4 E[||ˆεtx ||2 ], 1−κ32 y 4 E[||ˆεt ||2 ], and 1−κ 2
640κ22 E[||ˆεtx ||2 ] 1−κ32
Δ2 = 4E[||ˆεtx ||2 ] as well as Δ3 =
+
3
⎡
Γ =
1+κ32 2 ⎢ ⎢ 4κ22 η ⎢ ⎢ κ ⎢ 96η2 κ12 + ⎢ 2 ⎢ 2 ⎢ nˇ ⎣ 1240 1−κ32
8
0
0
0
1 − ηκ21 144η2κ22 2 nˇ 266 1−κ32
4κ22 α 2 η κ1 160η2κ22
2η2 κ22 m 32η2 κ22 1 − n1ˆ 42 1−κ32
0 188 1−κ32
4κ22 η2 1−κ32
⎤
⎥ ⎥ 0 ⎥ ⎥ 16κ22η2 ⎥ ⎥. ⎥ 0 ⎥ ⎦ 3+κ 2 3
4
Then, we first infer the range of η like the existing works [20, 22, 25, 40] satisfying ρ(Γ ) < 1 to establish the linear convergence of ET-DASG. Recalling from Lemma 5.5(iii), to ensure ρ(Γ ) < 1, we can derive the range of η and a positive vector w = [w1 , w2 , w3 , w4 , w5 ]T such that Γ w < w holds, which equivalently indicates that ⎧ 2 (1−κ32 ) w1 ⎪ ⎪ η2 κ22 < ⎪ 8 w4 ⎪ ⎪ ⎪ ⎪ ηκ 2 < mκ1 w2 − mκ22 w1 − mκ22 α 2 w3 ⎪ ⎨ 2 8 w4 κ1 w4 κ1 w4 w3 −8w1 2κ 2 < . (5.38) η 2 160w +96w +32w +16w ⎪ 3 1 4 5 ⎪ 2 nw ˆ 2 nw ˆ ⎪ 1 2 ⎪ + nˇ < w4 ⎪ nˇ ⎪ ⎪ 4960w1 2 ⎪ + 1064w + 752w232 + 168w242 < w5 ⎩ 2 2 2 2 (1−κ3 )
(1−κ3 )
(1−κ3 )
(1−κ3 )
Obviously, to achieve a certain feasible range of η, we must ensure that the right hands of the first three conditions related to the step-size η in (5.38) are positive. According to this observation, it suffices to derive another two relations among the elements in w, i.e., ⎧ ⎨ w3 > 8w1 (5.39) 8κ 2 w 8κ 2 α 2 w . ⎩ w2 > 22 1 + 2 2 3 κ1
κ1
134
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Based on this, it is now ready to choose the values of w1 , w2 , w3 , w4 , w5 that are independent of η. First, according to the first condition in (5.39), we can set w1 = 1 and w3 = 9. Second, by using w1 and w3 , the second condition in (5.39) is equivalent to the following: w2 >
8κ22 κ12
(1 + 9α 2 ).
Here, we assume that 0 < α < (1/3), and we thus set w2 = 16γ 2 , where γ = κ2 /κ1 . Third, we note that the fourth condition in (5.38) can be written as 32nγ ˆ 2 2nˆ 2nˆ + = (1 + 16γ 2 ). nˇ nˇ nˇ
w4 >
Since γ > 1, we therefore set w4 = (34nγ ˆ 2 )/n. ˇ Finally, in order to ensure that the fifth condition in (5.38) is true, w5 should satisfy w5 >
8 (1 − κ32 )
2
(620w1 + 133w2 + 94w3 + 21w4 )
357nγ ˆ 2 2 = 733 + 1064γ + . 2 nˇ (1 − κ32 ) 16
Noticing that 733 + 1064γ 2 +
357nγ ˆ 2 nˇ
≤
2154nγ ˆ 2 , nˇ
we thus set w5 =
ˆ 2 34464 nγ . 2 (1−κ32 ) nˇ
We
now solve for the range of η from the first three conditions in (5.38) given in the previously fixed w1 , w2 , w3 , w4 , w5 . According to the first condition in (5.38), we obtain 1 − κ32 w1 1 − κ32 nˇ . η< = 2κ2 2w4 12γ κ2 nˆ Moreover, we can conclude that if η meets η
0 and 0 < max{κ4 , ρ(Γ )} ≤ κ5 < 1. Therefore, it can be deduced from (5.40) that ||θt ||≤ ||Γ t ||||θ0|| +
t −1
||Γ t −l−1 νl ||
l=0
≤ Q||θ0 ||κ5t +
t −1
Qκ5t = ωt κ5t ,
(5.41)
l=0
where ωt = Q||θ0 || + Qt. Following from (5.41), it holds that limt →∞ ||θt ||/κ6t = ωt (κ5 /κ6 )t = 0 for all κ5 < κ6 < 1. Hence, there exists a positive constant Q1 and an arbitrary small constant ν such that κ6 = κ5 + ν and ||θt || ≤ Q1 κ6t for all t ≥ 0, then, E[||xt − (1m ⊗ Ip )x ∗ ||2 ] ≤ 2E[||xt − A∞ xt ||2 ] + 2E[||A∞ xt − (1m ⊗ Ip )x ∗ ||2 ] ≤ 2Q1 (κ5 + ν)t ,
(5.42)
√ √ where we define λ= κ5 + ν= κ6 . This completes the proof of Theorem 5.12. Remark 5.13 Theorem 5.12 means that even adopting the event-triggered communication strategy and the stochastic gradient gt , ET-DASG is ensured to figure out the machine learning task (5.1) and achieves linear convergence rate if some conditions (such as η and α) and the assumptions on the communication network as well as the cost functions hold. In addition, it is worth emphasizing that due to the influence of related factors (such as the convergence analysis method, the event-triggered strategy, the variance-reduction technique, etc.) in this chapter, it is difficult for us to intuitively obtain the optimal momentum coefficient like the
136
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
existing work [16]. In terms of this issue, we will conduct a detailed study in future work. To verify the effectiveness of the designed event-triggered communication strategy, in the following, we will prove that for each node the time interval between two successive triggering instants is larger than the iteration interval, i.e., i i tk(i,t )+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1. Theorem 5.14 Suppose that Assumptions 5.1 and 5.2 hold. Considering Algorithm 3, if the momentum coefficient α and the constant step-size η are selected i i to satisfy Theorem 5.12, then, tk(i,t )+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, if the eventtriggered parameters C and κ4 satisfy that C > ((12Q1 + 128Q1 κ22 )(1 + κ42 ))/κ42 . Proof It is deduced from (5.3) that i,y
||εti,x ||2 + ||εt ||2 = ||xtii
k(i,t)
− xti ||2 + ||ytii
k(i,t)
≤ 4||xtii
k(i,t)
− x¯t i
k(i,t)
− yti ||2
||2 + 4||x¯t i
k(i,t)
+ 4||x¯t − xti ||2 + 4||ytii
k(i,t)
+ 4||g¯t i
k(i,t)
− x ∗ ||2 + 4||x ∗ − x¯t ||2
− y¯t i
k(i,t)
||2 + 4||y¯t − yti ||2
− ∇f (x ∗ )||2 + 4||∇f (x ∗ ) − g¯t ||2 ,
(5.43)
where we have employed y¯t = g¯ t . Then, we further have that E[||∇f (x ∗ ) − g¯ t ||2 |Ft ] =E[||∇f (x ∗ ) − ∇ F¯ (st )||2 |Ft ] + E[||∇ F¯ (st ) − g¯t ||2 |Ft ] ≤
2κ22 1 2 E[||xt − xt −1 ||2 |Ft ] E[||∇F (s ) − g || |F ] + t t t m m2 + 4κ22 E[||x¯t − x ∗ ||2 |Ft ] +
4κ22 E[||xt − A∞ xt ||2 |Ft ], m
(5.44)
where in the first inequality we have used that E[g¯t |Ft ] = ∇ F¯ (st ), and in the second inequality we have applied E[||∇ F¯ (st ) − g¯ t ||2 |Ft ] = (1/m2 )E[||∇F (st ) − gt ||2 |Ft ] and Assumption 5.1(ii). Similarly, one can get E[||g¯t i
k(i,t)
≤
− ∇f (x ∗ )||2 |Ft ]
4κ22 2κ 2 E[||xt i − A∞ xt i ||2 |Ft ] + 2 E[||xt i − xt i −1 ||2 |Ft ] k(i,t) k(i,t) k(i,t) k(i,t) m m 1 + 4κ22 E[||x¯t i − x ∗ ||2 |Ft ] + 2 E[||∇F (st i ) − gt i ||2 |Ft ]. (5.45) k(i,t) k(i,t) k(i,t) m
5.4 Convergence Analysis
137
Recalling ||θt || ≤ Q1 κ6t in Theorem 5.12, we infer from the definition of ||θt || that E[||xt − A∞ xt ||2 ] ≤ Q1 κ6t , E[m||x¯t − x ∗ ||2 ] ≤ Q1 κ6t , E[||xt − xt −1||2 ] ≤ Q1 κ6t , E[vt ] ≤ Q1 κ6t , and E[||yt − A∞ yt ||2 ] ≤ Q1 κ6t . Next, we apply (5.9), (5.43), (5.44), and (5.45) to proceed. If 0 < α < 1, we possess that ti
i,y
||εti,x ||2 + ||εt ||2 ≤(12Q1 + 128Q1 κ22 )κ6t + (12Q1 + 128Q1κ22 )κ6k(i,t) . (5.46) It is also worth noting that when the triggering condition is not satisfied, the next i,y event will not happen, that is, ||εti,x ||2 + ||εt ||2 ≤ Cκ4t , ∀t ≥ 0, i ∈ V. Thus, when i t = tk(i,t )+1 the following inequality must be satisfied, i.e., ti
Cκ4k(i,t)+1 ≤ ||εi,x i
tk(i,t)+1
i,y ||2 , tk(i,t)+1
||2 + ||ε i
(5.47)
which indicates that ti
ti
ti
Cκ4k(i,t)+1 ≤(12Q1 + 128Q1κ22 )κ6k(i,t)+1 + (12Q1 + 128Q1κ22 )κ6k(i,t) .
(5.48)
Since κ4 < κ6 , it is clear to deduce that (5.48) must hold if there is a constant C satisfying i i tk(i,t )+1 − tk(i,t ) ≥ ln
12Q1 + 128Q1κ22 C − (12Q1 + 128Q1κ22 )
/ ln κ4 .
(5.49)
i i Here, it suffices to verify that tk(i,t )+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, when C and κ4 are chosen from Theorem 5.14.
Remark 5.15 The communication cost of ET-DASG has a great relationship with the event-triggered parameters C and κ4 . When the event-triggered parameters satisfy the condition C > ((12Q1 + 128Q1 κ22 )(1 + κ42))/κ42 in Theorem 5.14 (other parameters, such as α and η, satisfy Theorem 5.12 to achieve the convergence), the time interval between two successive triggering instants for each node is greater i i than 2, i.e., tk(i,t )+1 − tk(i,t ) ≥ 2, ∀i ∈ V, t ≥ 1, which means that by using the event-triggered communication strategy, the communication cost of ET-DASG at each iteration is at least half of the related time-triggered distributed methods [18–22, 40, 41]. From this point of view, the communication cost of ET-DASG is related to the event-triggered parameters C and κ4 , and ET-DASG can promote communication efficiency. Remark 5.16 Although the existing gradient tracking methods [19–29] including our previous methods [21, 29] just linearly converge to the optimal solution, it is worth noting that ET-DASG enjoys three appealing features: (i) Communication efficiency with event-triggered communication strategy; (ii) Computation efficiency
138
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
with variance-reduction technique; (iii) Accelerated convergence with Nesterov’s acceleration mechanism.
5.5 Numerical Examples In this section, two numerical examples about logistic regression in machine learning [2, 59] and energy-based source localization in sensor network [62] are provided to examine the effectiveness of ET-DASG. Notice that all the examples are implemented using MATLAB R2014a on a computer running MacBook(Pro) 2017 with Intel(R) Core(TM) i5 CPU (2.3 GHz) and 8GB RAM. To facilitate verification and comparison, we adopt the yalmip toolbox in Matlab to get the optimal solutions of the following examples.
5.5.1 Example 1: Logistic Regression In this subsection, ET-DASG is leveraged to deal with a binary classification problem by logistic regression using two real datasets, i.e., breast cancer Wisconsin (diagnostic) dataset (dataset 1) and mushroom dataset (dataset 2), from UCI Machine Learning Repository [63]. For the networks, a randomly connected network with m = 10 nodes employing the Erdos–Renyi model [10, 60] with a connection probability pc = 0.4 is plotted in Fig. 5.1a, and three categories of network (i.e., complete network, cycle network, and star network) are plotted in Fig. 5.1b–d. The motivation or discussion to adopt the two real datasets and the Erdos–Renyi network model is given as follows: (a) This subsection mainly applies ET-DASG to solve the logical regression problem in machine learning. This problem is actually a binary classification problem. On the one hand, both the datasets 1 and 2 can be used for training of binary classification problems (i.e., diseased vs. non-disease and toxic vs. non-toxic). On the other hand, the scales of the two datasets are different. Dataset 1 can be used to examine the performance of ET-DASG, while dataset 2 can be employed for comparison with other related distributed methods to verify the advantages of ET-DASG in processing large-scale problems. In addition, many recent works [64, 65] have conducted algorithms on these two datasets for machine learning. (b) The Erdos–Renyi network model employed in this subsection is an undirected and connected graph which can satisfy the conditions in Assumption 5.2 well. In addition, there are also a lot of works [10, 60] to simulate the algorithm based on the Erdos–Renyi network model. Because of the randomness of the network, the simulation results are more reliable and can verify the performance of ET-DASG well. In dataset 1, we use n = 200 samples as training data and each data has dimension p = 9. In dataset 2, we use n = 6000 samples as training data and each
5.5 Numerical Examples
139
Fig. 5.1 Four undirected and connected network topologies composed of 10 nodes. (a) Random network with a connection probability pc = 0.4. (b) Complete network. (c) Cycle network. (d) Star network
data has dimension p = 112. For each dataset, all features have been preprocessed and normalized to the unit vector. The distributed logistic regression problem takes the form m n π 1 1 i,h i,h T ln 1 + exp −b (c ) x + ||x||22, minp x∈R m ni 2 i
i=1
h=1
where the local objective function f i (x) is n π 1 f (x) = i ln 1 + exp −b i,h (ci,h )T x + ||x||22, n 2 i
i
h=1
140
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
with bi,h ∈ {−1, 1} and ci,h ∈ Rp being local data kept by node i for h ∈ {1, . . . , ni }; (π/2)||x||22 is the regularization term for avoiding overfitting. In the
i simulation, we randomly divide data to each local node, i.e., 10 i=1 n = n. The simulation results are described in the following five parts, i.e., (i)–(v). Here, we also point out that the dataset 1 is adopted in (i)–(iv), and the datasets 1 and 2 are adopted in (v). Moreover, Fig. 5.1a–d are applied to part (iv), while Fig. 5.1a is applied to all other parts. For the comparison in parts (iii)-(v), the residual i ∗ (1/m)log10 ( m i=1 ||xt − x ||) is treated as the comparison metric. (i) Convergence: In this simulation, we set η = 0.0035, α = 0.2, C = 5, κ4 = i 0.985, π = 4, and 10 i=1 n = 200. Then, the transient behaviors of three dimensions (randomly selected) of state estimator x and the testing accuracy are shown in Fig. 5.2, which illustrate that the testing accuracy is 97.1% and the state estimator x in ET-DASG can achieve the consensus in the mean at the global optimal solution. (ii) Triggering times: In this simulation (other parameters are the same as part (i)), we discuss the triggering times for the neighbors when 5 nodes run ETDASG under different event-triggered parameters. The simulation results are shown in Fig. 5.3, which imply by combining with Fig. 5.2 that ET-DASG with event-triggered communication strategy can achieve the expected results with fewer communications. In addition, it can be verified from Fig. 5.3 that the triggering times will be decreased (that is, the communication cost is small) if the parameter κ4 increases. Here, since C is related to κ4 (Theorem 5.14) and the range of C will be wider when κ4 becomes larger, it is reasonable to discuss the influence of different κ4 on the triggering times of the nodes, that is, C remains unchanged. (iii) Impacts of the constant step-size and the momentum coefficient: In this simulation (other parameters are the same as part (i)), we discuss the impacts of the constant step-size η and the momentum coefficient α in ET-DASG on the convergence results. The simulation results are shown in Fig. 5.4. Figure 5.4a indicates that the increase of the step-size η can promote the convergence of ET-DASG. However, when the step-size η exceeds an upper bound (around 0.012), the performance of ET-DASG will be reduced. Figure 5.4b shows that the convergence rate of ET-DASG improves with the increase of the momentum coefficient α (η = 0.012). However, this situation is subject to the upper bound of α (Theorem 5.12). (iv) Impacts of network sparsity: In this simulation (η = 0.012 and other parameters are the same as part (i)), we discuss the impacts of the networks on the convergence results of ET-DASG. The simulation results are shown in Fig. 5.5. It can be verified from Fig. 5.5 that ET-DASG converges faster as the network becomes dense. (v) Comparison: In this simulation (the related parameters are the same as part (i)), we compare ET-DASG with other existing methods [20, 38, 59] to show the appealing features of ET-DASG. The simulation results are shown in Fig. 5.6 and Table 5.1. Figure 5.6 means that ET-DASG can achieve the same
Convergence of three dimensions of x
5.5 Numerical Examples
141
0.08
0.06
0.04
0.02 xi,1 t 0 0
200
400 600 Iteration
xi,2 t
xi,3 t
800
1000
1 X: 19 Y: 0.971
Accuracy
0.8
0.6
0.4 0
ET−DASG 5
10 Iteration
15
20
Fig. 5.2 Convergence: (a) The transient behaviors of three dimensions (randomly selected) of state estimator x. (b) The testing accuracy of ET-DASG
convergence as other existing methods [20, 38, 59] even if it uses both the event-triggered strategy and the variance-reduction technique at the same time. Then, Table 5.1 summarizes the convergence time in seconds and the number of local gradient evaluations of ET-DASG and the method in [38] for a specific residual 10−7 under two real training datasets. Table 5.1 tells us that ET-DASG demands less calculation time and less number of local gradient evaluations, which can quickly achieve the target and greatly reduce the computation cost in machine learning tasks. Moreover, Table 5.1 also shows that when the number n and the dimension p of datasets are large, the calculation time and the number of local gradient evaluations of ET-DASG are far less than that of the
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
Triggering times of 5 nodes
142
5
4
3
2
Triggering times of 5 nodes
1 0
50
100 Iteration
150
200
0
50
100 Iteration
150
200
5
4
3 2
1
Fig. 5.3 The triggering times for the neighbors when 5 nodes run ET-DASG under different eventtriggered parameters
method in [38]. Hence, ET-DASG can be well adapted to large-scale machine learning tasks by adopting the stochastic gradient. Remark 5.17 Due to the simultaneous use of the event-triggered communication and stochastic gradient, ET-DASG requires more iterations to achieve a specific residual, i.e., the convergence rate of ET-DASG is slightly slower than other related methods [20, 38, 58]. It is worth highlighting that ET-DASG, for a specific termination condition, has the advantages in communication times, the number of local gradient evaluations and the running time, which exhibits well efficiency in communication and computation.
5.5 Numerical Examples
143
Residual
0 η=0.002 η=0.004 η=0.006 η=0.010 η=0.012 η=0.016
−5
−10
−15
0
500
1000 Iteration
1500
2000
0
Residual
α=0.1 α=0.2 α=0.3 −5
−10
−15
0
500
1000 Iteration
1500
2000
Fig. 5.4 Evolution of residuals under different constant step-sizes or momentum coefficients
5.5.2 Example 2: Energy-based Source Localization In this example, ET-DASG is applied to handle the energy-based source localization problem [62] over a network of m sensors (a randomly connected network with pc = 0.8). Each sensor is randomly distributed in spatial locations denoted as a i ∈ R2 , i = 1, . . . , m, which is known privately by itself, and each sensor collects ni measurements. Then, an isotropic energy propagation model is applied to measure the h-th received signal strength at sensor i, which is represented by s i,h = c/(||xˆ − a i ||d ) + bi,h , where c > 0 is a constant and d ≥ 1 is an attenuation characteristic; ||xˆ − a i || > 1, and bi,h is an independent and identically distributed sample noise following from the zero-mean Gaussian distribution with variance σ 2 .
144
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization 0 Random network: pc=0.2 Random network: pc=0.4
Residual
Random network: pc=0.6 Complete network Cycle network Star network
−5
−10
−15
0
1500
1000 Iteration
500
2000
Fig. 5.5 Evolution of residuals under different networks
Residual
0 ET −DASG The method in [59] The method in [38] The method in [20]
−5
−10
−15
0
200
400 600 Iteration
800
1000
Fig. 5.6 Comparisons between ET-DASG and other methods
The maximum-likelihood estimator for the source’s location is found by solving the following problem: ⎛ ⎞ 2 ni
m 1 ⎝ 1 i,h c ⎠. min s − ni ||x − a i ||d x∈R2 m i=1
h=1
Here, we consider that m = 50 sensors are uniformly distributed in a 100 × 100 square and the source location is randomly chosen from the square. The source emits a signal with strength c = 100 and each sensor has ni = 100 measurements. In this simulation, we set η = 0.008, α = 0.1, C = 5, and κ4 = 0.92. Assume
5.6 Conclusion
145
Table 5.1 The convergence time in seconds and the number of local gradient evaluations of ETDASG and the method in [38] for a specific residual 10−7 under two real datasets ET-DASG Time(s) 4.9287 22.8123
Datasets Dataset 1 Dataset 2
The method in [38] Time(s) Number 5.8607 29360 132.9638 7.956 ×105
Number 1665 3552
Decentralized energy-based source localization
100
Sensor location Source location
90 80 70 60 50 40 30 20 10 10
20
30
40
50
60
70
80
90
100
Fig. 5.7 The randomly selected 7 paths displayed on top of contours of log-likelihood function
that there is a stationary acoustic source x ∗ ∈ R2 locating in (55, 55) (an location unknown to any sensors) that we aim at locating in the sensor networks. Based on the above, the randomly selected 7 paths taken by ET-DASG are shown in Fig. 5.7 which is plotted on top of contours of the log-likelihood. Figure 5.7 illustrates that ET-DASG can successfully find the exact source location like other verified effective algorithm [62], which is suitable for the practical energy-based source localization problem.
5.6 Conclusion In this chapter, we have proposed a novel event-triggered distributed accelerated stochastic gradient algorithm, namely ET-DASG, for resolving the machine learning tasks over networks. ET-DASG could realize better communication efficiency, achieve higher computation efficiency, and implement accelerated convergence. With the help of the linear matrix inequality theory, we proved that ET-DASG with suitably selected constant step-size converged in the mean linearly to the optimal solution if each constituent function is strongly convex and smooth. Furthermore, the time interval for each node between two successive triggering instants has been proven to be larger than the iteration interval. Simulation results have confirmed the appealing performance of ET-DASG. However, ET-DASG
146
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
is not suitable for the network with random link failures or the problems with constraints. In addition, the privacy issues are also not considered in ET-DASG. Future work will further focus on investigating the privacy protection of ET-DASG and extending the algorithm to be appropriate for the directed networks and the distributed constrained optimization problems. The asynchronous implementation of ET-DASG over broadcast-based mechanism or gossip-based mechanism with random link failures is also a promising research direction.
References 1. S. Pu, A. Olshevsky, I.C. Paschalidis, Asymptotic network independence in distributed stochastic optimization for machine learning: examining distributed and centralized stochastic gradient descent. IEEE Signal Process. Mag. 37(3), 114–122 (2020) 2. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020) 3. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2407–2418 (2018) 4. N. Heydaribeni, A. Anastasopoulos, Distributed mechanism design for network resource allocation problems. IEEE Trans. Netw. Sci. Eng. 7(2), 621–636 (2020) 5. M. Rossi, M. Centenaro, A. Ba, S. Eleuch, T. Erseghe, M. Zorzi, Distributed learning algorithms for optimal data routing in IoT networks. IEEE Trans. Signal Inf. Proc. Netw. 6, 175–195 (2020) 6. D. Nunez, J. Cortes, Efficient privacy-preserving machine learning in hierarchical distributed system. IEEE Trans. Netw. Sci. Eng. 6(4), 599–612 (2019) 7. A. Nedic, Distributed gradient methods for convex machine learning problems in networks: distributed optimization. IEEE Signal Process. Mag. 37(3), 92–101 (2020) 8. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 9. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs. IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014) 10. C. Li, X. Yu, W. Yu, G. Chen, J. Wang, Efficient computation for sparse load shifting in demand side management. IEEE Trans. Smart Grid 8(1), 250–261 (2017) 11. M.O. Sayin, N.D. Vanli, S.S. Kozat, T. Basar, Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Trans. Netw. Sci. Eng. 4(4), 248–260 (2017) 12. H. Li, Q. Lü, G. Chen, T. Huang, Z. Dong, Distributed constrained optimization over unbalanced directed networks using asynchronous broadcast-based algorithm. IEEE Trans. Autom. Control 66(3), 1102–1115 (2021) 13. Z. Wang, D. Wang, D. Gu, Distributed optimal state consensus for multiple circuit systems with disturbance rejection. IEEE Trans. Netw. Sci. Eng. 7(4), 2926–2939 (2020) 14. X. He, J. Yu, T. Huang, C. Li, Distributed power management for dynamic economic dispatch in the multimicrogrids environment. IEEE Trans. Control Syst. Technol. 27(4), 1651–1658 (2019) 15. N. Loizou, M. Rabbat, P. Richtárik, Provably accelerated randomized gossip algorithms, in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019). https://doi.org/10.1109/ICASSP.2019.8683847 16. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014)
References
147
17. A. Nedic, J. Liu, Distributed optimization for control. Ann. Rev. Control Robot. Auton. Syst. 1, 77–103 (2018) 18. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237– 248 (2019) 19. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 20. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2018) 21. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with timevarying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018) 22. R. Xin, U.A. Khan, A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018) 23. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015) 24. Y. Sun, A. Daneshmand, G. Scutari, Convergence rate of distributed optimization algorithms based on gradient tracking (2019). Preprint. arXiv:1905.02637 25. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021) 26. M. Bin, I. Notarnicola, L. Marconi, G. Notarstefano, A system theoretical perspective to gradient-tracking algorithms for distributed quadratic optimization, in Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC) (2019). https://doi.org/10.1109/ CDC40024.2019.9029824 27. G. Scutari, Y. Sun, Distributed nonconvex constrained optimization over time-varying digraphs. Math. Program. 176(1), 497–544 (2019) 28. M.I. Qureshi, R. Xin, S. Kar, U.A. Khan, S-ADDOPT: decentralized stochastic first-order optimization over directed graphs. IEEE Control Syst. Lett. 5(3), 953–958 (2021) 29. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed constrained optimisation over time-varying directed unbalanced networks. IET Control Theory Appl. 13(17), 2800–2810 (2019) 30. X. He, X. Fang, J. Yu, Distributed energy management strategy for reaching cost-driven optimal operation integrated with wind forecasting in multimicrogrids system. IEEE Trans. Syst. Man Cybern. Syst. 49(8), 1643–1651 (2019) 31. J. Zhang, K. You, K. Cai, Distributed dual gradient tracking for resource allocation in unbalanced networks. IEEE Trans. Signal Process. 68, 2186–2198 (2020) 32. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through alternating direction method of multipliers (2019). Preprint. arXiv:1902.06101 33. M. Maros, J. Jalden, ECO-PANDA: a computationally economic, geometrically converging dual optimization method on time-varying undirected graphs, in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019). https://doi.org/10.1109/ICASSP.2019.8683797 34. K. Scaman, F. Bach, S. Bubeck, Y. Lee, L. Massoulie, Optimal convergence rates for convex distributed optimization in networks. J. Mach. Learn. Res. 20(159), 1–31 (2019) 35. C.A. Uribe, S. Lee, A. Gasnikov, A. Nedic, A dual approach for optimal algorithms in distributed optimization over networks, in Proceedings of the 2020 Information Theory and Applications Workshop (ITA) (2020). https://doi.org/10.1109/ITA50056.2020.9244951 36. S.A. Alghunaim, E. Ryu, K. Yuan, A.H. Sayed, Decentralized proximal gradient algorithms with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021) 37. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science & Business Media, Berlin, 2013) 38. R. Xin, D. Jakovetic, U.A. Khan, Distributed nesterov gradient methods over arbitrary graphs. IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
148
5 Event-Triggered Acceleration Algorithms for Distributed Stochastic Optimization
39. Q. Lü, X. Liao, H. Li, T. Huang, A nesterov-like gradient tracking algorithm for distributed optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270 (2021) 40. R. Xin, U. A. Khan, Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking, IEEE Trans. Autom. Control 65(6), 2627–2633 (2020) 41. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020) 42. Y. Zhou, Z. Wang, K. Ji, Y. Liang, Proximal gradient algorithm with momentum and flexible parameter restart for nonconvex optimization (2020). Preprint. arXiv:2002.11582v1 43. C. Liu, H. Li, Y. Shi, D. Xu, Distributed event-triggered gradient method for constrained convex minimization. IEEE Trans. Autom. Control 65(2), 778–785 (2020) 44. N. Hayashi, T. Sugiura, Y. Kajiyama, S. Takai, Event-triggered consensus-based optimization algorithm for smooth and strongly convex cost functions, in Proceedings of the 2018 IEEE Conference on Decision and Control (CDC) (2018). https://doi.org/10.1109/CDC.2018. 8618863 45. C. Li, X. Yu, W. Yu, T. Huang, Z-W. Liu, Distributed event-triggered scheme for economic dispatch in smart grids. IEEE Trans. Ind. Inform. 12(5), 1775–1785 (2016) 46. K. Zhang, J. Xiong, X. Dai, Q. Lü, On the convergence of event-triggered distributed algorithm for economic dispatch problem. Int. J. Electr. Power Energy Syst. 122, 1–10 (2020) 47. B. Swenson, R. Murray, S. Kar, H. Poor, Distributed stochastic gradient descent and convergence to local minima (2020). Preprint. arXiv:2003.02818v1 48. M. Assran, N. Loizou, N. Ballas, M. Rabbat, Stochastic gradient push for distributed deep learning, in Proceedings of the 36th International Conference on Machine Learning (ICML) (2019), pp. 344–353 49. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization. Automatica 90, 196–203 (2018) 50. J. Konecny, J. Liu, P. Richtarik, M. Takac, Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016) 51. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017) 52. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems (NIPS), vol. 27 (2014), pp. 1–9 53. C. Tan, S. Ma, Y. Dai, Y. Qian, Barzilai-borwein step size for stochastic average gradient, in Advances in Neural Information Processing Systems, vol. 29 (2016), pp. 1–9 54. L. Nguyen, J. Liu, K. Scheinberg, M. Takac, SARAH: a novel method for machine learning problems using stochastic recursive gradient, in Proceedings of the 34th International Conference on Machine Learning (ICML) (2017), pp. 2613–2621 55. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm. J. Mach. Learn. Res. 17(1), 2165–2199 (2016) 56. Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, H. Qian, Towards more efficient stochastic decentralized learning: faster convergence and sparse communication, in Proceedings of the 35th International Conference on Machine Learning (PMLR), vol. 80 (2018), pp. 4624–4633 57. K. Yuan, B. Ying, J. Liu, A.H. Sayed, Variance-reduced stochastic learning by networked agents under random reshuffling. IEEE Trans. Signal Process. 67(2), 351–366 (2019) 58. H. Hendrikx, F. Bach, L. Massoulie, An accelerated decentralized stochastic proximal algorithm for finite sums, in Advances in Neural Information Processing Systems, vol. 32 (2019), pp. 4624–4633 59. R. Xin, S. Kar, U.A. Khan, Decentralized stochastic optimization and machine learning: a unified variance-reduction framework for robust performance and fast convergence. IEEE Signal Process. Mag. 37(3), 102–113 (2020) 60. B. Li, S. Cen, Y. Chen, Y. Chi, Communication-efficient distributed optimization in networks with gradient tracking and variance reduction, in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) (2020), pp. 1662–1672
References
149
61. R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013) 62. D. Blatt, A. Hero, Energy-based sensor network source localization via projection onto convex sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006) 63. D. Dua, C. Graff, UCI machine learning repository, Dept. School Inf. Comput. Sci., Univ. California, Irvine, CA, USA (2019) 64. R.M. Gower, M. Schmidt, F. Bach, P. Richtarik, Variance-reduced methods for machine learning. Proc. IEEE 108(11), 1968–1983 (2020) 65. S. Horvath, L. Lei, P. Richtarik, M.I. Jordan, Adaptivity of stochastic gradient methods for nonconvex optimization (2020). Preprint. arXiv:2002.05359
Chapter 6
Accelerated Algorithms for Distributed Economic Dispatch
Abstract In this chapter, we focus on introducing accelerated distributed optimization algorithms in the application scenario of the economic dispatch problem (EDP) for smart grids. This application scenario focuses on researching how to allocate the generation power among generators to match the load demand with the minimum total generation cost while observing all constraints on the local generation capacity. Each generator possesses its own local generation cost, and the total generation cost is the sum of all local generation costs. For the EDP question, most existing methods, such as push-sum-based strategies, overcome the unbalancedness induced by directed networks by employing column-stochastic weights, which may be infeasible in distributed implementations. In contrast, to apply to directed networks with row-stochastic weights, we develop a new directed distributed Lagrangian momentum algorithm, D-DLM, which integrates a distributed gradient tracking method with two momentum terms and non-uniform step-sizes in the update of the Lagrangian multipliers. Next, we give proof that if the maximum step-size and the maximum momentum coefficient are positive and sufficiently small, the D-DLM can be optimally dispatched with a linear assignment under smooth and strongly convex generation costs. Finally, various studies of EDP in smart grids are simulated. Keywords Distributed economic dispatch · Smart grids · Directed network · Distributed Lagrangian momentum algorithm · Linear convergence
6.1 Introduction The economic dispatch problem (EDP) is one of the fundamental issues for energy management during the practical operations of smart grids. The target of EDP is allocating the generation power among the generators to meet the load demands with minimal total operation cost (i.e., sum of the local generation costs) while preserving all constraints of local generation capacity. On a certain sense, EDP can be speciated as a constrained optimization problem, which has appealed to many researchers in recent years [1–7]. To tackle EDP, many basic methods [8, 9] have been implemented in a centralized manner. However, these centralized methods © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_6
151
152
6 Accelerated Algorithms for Distributed Economic Dispatch
require a powerful centralized controller to collect global information about the entire grid and process large amounts of data, which are often computationally and communicationally expensive and prone to single-point failure. As the integration of renewable resources, energy storage devices, plug-in hybrid vehicles, and potential consumers occurs, the smart grid of the future will be highly distributed and further lead to the impracticality of traditional centralized methods [10]. Recently, many distributed algorithms have been proposed for the optimization problem described above with considerable applications on EDP. Some known approaches for different networks are usually dependent on the distributed consensus protocol (local calculation and local information exchange) with extensions to figure out delays and asynchronous scenarios [11–25]. For the smart grid with generator constraints, Zhang and Chow [11] first proposed an incremental cost consensus algorithm, which adopted the collective mismatch between demand and supply to send back to the algorithm, allowing incremental cost to converge to the optimal value. However, to ensure energy balance, a leader was defined in [11], which prevented the algorithm from operating in a fully distributed manner. In order to remove the dependency on the leader, a two-level incremental cost consensus algorithm was proposed in [12]. And then, Kar and Hug [13] proposed a distributed algorithm based on the consensus+innovations framework, which avoided the requirement of the leader. Meanwhile, extensions of various real-world factors and techniques including vehicle-to-grid [14], event-triggered [15], delays [16], transmission losses [17], and security [18] have been extensively considered in the study of EDP. It is noteworthy in this aspect that all of these works only involved the case of undirected networks [11–25]. Distributed optimization for solving EDP over directed networks was recently studied in [26], where (sub)gradient-push (SP) method was employed to eliminate the requirement of network balancing, i.e., with column-stochastic matrices. Since SP was based on (sub)gradient descent with diminishing step-size, it also exhibited a slow sublinear convergence rate. To accelerate convergence, Li et al. [27] proposed a linearly convergent distributed algorithm with constant step-size to solve the EDP by incorporated the push-sum strategy into the distributedly inexact gradient tracking method. And then, Lü et al. [27] extended the work of [27] to non-uniform stepsizes and showed linear convergence. In addition, to deal with EDP, a different class of approaches which did not utilize push-sum strategy was recently proposed in [29], where a row- and a column-stochastic matrix were adopted simultaneously to acquire linear convergence over directed networks. It is noteworthy that although these approaches [26–29] were appropriate for the directed networks, they all required generators to possess (at least) its own out-degree information exactly. Therefore, all the generators in the networks [26–29] could adjust their outgoing weights and ensure that the sum of each column of weight matrix is one. This requirement, however, is likely to be unrealistic in the practical operations of smart grids. In this chapter, the algorithm that we will present mainly depends on the distributed gradient tracking method and is a variation of methods appeared in [30] for EDP and [31, 32, 35] for classic convex optimization problems. To be
6.1 Introduction
153
specific, Li et al. [30] developed a distributed primal–dual augmented (sub)gradient algorithm to solve the EDP over a directed network. The algorithm in [30] employed a row-stochastic matrix and non-uniform step-sizes and yet linearly converged to the optimal solution for smooth and strongly convex cost functions. However, the algorithm in [30] did not adopt momentum terms [31–33], where generators acquired more information from in-neighbors in the network for faster convergence. In light of methods with momentum terms, Qu and Li [31] adopted Nesterov momentum and thereby investigated two accelerated distributed Nesterov algorithms, which exhibited faster convergence rate compared with the centralized gradient descent (CGD) method for different cost functions. Note that, although the convergence rate was improved, the two algorithms in [31] were only suitable for undirected networks, which also involved the applicability of the methods in smart grids. To overcome this deficiency, Xin et al. [32] established a generalization and acceleration of first-order methods with heavy-ball momentum, i.e., ABm, which removed the conservatism (doubly-stochastic weights or eigenvector estimation) in the related works by implementing both row- and column-stochastic weights. In this setting, some interesting generalized methods [34] (random link failures) and [29] (delays) were proposed. Unfortunately, the construction of column-stochastic weights required the out-degree information, which was arduous to be implemented, for example, in broadcast-based or gossip scenarios. On the other hand, the works of [31–34] did not consider the distributed constrained optimization problem. This happens to be an indispensable issue which we must face when studying EDP in smart grids. The related work [35] did not consider the distributed constrained optimization problem and the non-uniform step-sizes, and a rigorous theoretical analysis of the algorithm was also lacking. Hence, it is of great significance to develop effective distributed algorithms to deal with the more practical EDP in smart grids. The main interest of this chapter is to study the EDP over a directed network. To solve this issue, a linearly convergent algorithm is constructed, for which the nonuniform step-sizes, two types of momentum terms, and the row-stochastic matrix are utilized. We hope to develop a broad theory of the distributed convex optimization, and the potential purpose of designing a distributed algorithm is to adapt and promote the practical operations of smart grids. To conclude, the key contributions of the presented work can be summarized in the following four aspects: (i) We propose a novel directed distributed Lagrangian momentum algorithm, named as D-DLM, with row-stochastic matrix to solve the EDP over a directed network. In contrast to [31] (doubly-stochastic matrix) and [26–29] (column-stochastic matrix) for the EDP, D-DLM with row-stochastic matrix is relatively easy to be implemented in a distributed manner. Specifically, the implementation of D-DLM is straightforward if each generator can privately regulate the weights on information acquired from in-neighbors. This is more appropriate for the practical operations of smart grids. (ii) For the updates of Lagrangian multipliers, D-DLM extends the centralized Nesterov gradient descent method (CNGD) [36] to a distributed form and is
154
6 Accelerated Algorithms for Distributed Economic Dispatch
appropriate for the distributed constrained optimization problem over directed networks in comparison with the work in [31]. In particular, D-DLM extends the distributed gradient tracking method with two types of momentum terms to ensure that generators acquire more information from in-neighbors in the network than the existing methods [29, 30] for faster convergence. More importantly, a consensus iteration step is exploited for designing D-DLM to counteract the effect of the unbalancedness induced by the directed networks. (iii) D-DLM utilizes non-uniform step-sizes, which shows a wider range of stepsize selection than that of most of the existing methods investigated in [27, 29], etc. More importantly, assuming that the non-uniform step-sizes and the momentum coefficients are constrained by some specific upper bounds, DDLM linearly allocates the optimal dispatch if the cost functions are smooth and strongly convex. In addition, in comparison with [32] and [35], we also establish explicit estimates for the convergence rate of D-DLM. (iv) The provided bounds on the largest step-size only depend on the cost functions and the network topology. This is superior to the earlier works on nonuniform step-sizes within the framework of gradient tracking [27, 28, 37, 38], which relies not only on the cost functions and the network topology but also on the heterogeneity of the step-sizes (there is a compromise between the heterogeneity and the largest achievable step-size). More importantly, the bound of non-uniform step-sizes in this chapter allows the existence (not all) of zero step-sizes among the generators.
6.2 Preliminaries 6.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. We let the subscript i denote the generator index and the superscript t denote the time index; e.g., xit is generator i’s variable at time t. The sets of real numbers, n-dimensional real column vectors, and n-dimensional real square matrices are represented as R, Rn , and Rn×n , respectively. The symbol zij is denoted as the entry of matrix Z in its i-th row and j -th column and In is denoted as the identity matrix of size n. Given a vector y = [y1 , y2 , . . . , yn ]T , Z = diag{y} is utilized to represent a diagonal matrix which satisfies that zii = yi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . The diagonal matrix consisting of the corresponding diagonal elements of matrix Z is represented as diag{Z}. Two column vectors with all entries equal to ones and zeros are denoted as 1n and 0n , respectively. The 2-norm of vectors and matrices is denoted as || · ||2 . The symbols zT and W T are the transposes of a vector z and a matrix W , respectively. Given two vectors v = [v1 , . . . , vn ]T and u = [u1 , . . . , un ]T , the notation v u implies that vi ≤ ui for any i ∈ {1, . . . , n}. The vector f (x) : Rp → Rp denotes the gradient of f (z) (differentiable) at z.
6.2 Preliminaries
155
Define ei = [0, . . . , 1i , . . . , 0]T . A non-negative square matrix Z ∈ Rn×n is rowstochastic if Z1n = 1n .
6.2.2 Model of Optimization Problem Consider a set of n generators connected over a smart grid. The global objective of EDP is to find the optimal allocation that meets the expected load demands while preserving the limitations of generator capacity, therefore minimizing the total generation cost, i.e., minn C(x) =
x∈R
n i=1
Ci (xi ), s.t.
n
xi =
i=1
n
di , ximin ≤ xi ≤ ximax ,
(6.1)
i=1
∀i = 1, . . . , n, where Ci : R → R is the local cost function known privately by generator i. Denote xi ∈ R as the power generation of generator i and x = [x1 , x2 , . . . , xn ]T ∈ Rn . For each generator i, the maximum and minimum capacities of power generation are indicated by ximin and ximax , respectively. The set of limitation of generator capacity is defined by Xi = {x† ∈ R|ximin ≤ x† ≤ ximax } for i = 1, . . . , n, and denote their Cartesian products as X = X1 × · · · × Xn . To guarantee the feasibility of problem (6.1), the total expected power demand satisfies n n n max . We denote by x ∗ = [x ∗ , x ∗ , . . . , x ∗ ]T ∈ Rn min ≤ x d ≤ i n i=1 i i=1 i=1 xi 1 2 the optimal solution to problem (6.1). Then, the following assumptions are made in the sequel. Assumption 6.1 (Strong Convexity [30]) Each local cost function Ci , i = 1, . . . , n, is μi -strongly connected. Mathematically, there exists μi > 0 such that for any x, y ∈ R, θ Ci (x) + (1 − θ )Ci (y) ≥ Ci (θ x + (1 − θ )y) +
μi (x − y)2 , 2
where θ ∈ [0, 1] is an arbitrary constant. Assumption 6.2 (Smoothness [30]) Each local cost function Ci , i = 1, . . . , n, is continuously differentiable and has a Lipschitz-continuous gradient. Mathematically, there exists li > 0 such that for any x, y ∈ R, |∇Ci (x) − ∇Ci (y)| ≤ li |x − y|. Due to Assumption 6.1, problem (6.1) has a unique optimal solution. It is worth emphasizing that Assumptions 6.1 and 6.2 are two standard assumptions to achieve linear convergence when employing first-order methods [27–35].
156
6 Accelerated Algorithms for Distributed Economic Dispatch
6.2.3 Communication Network The network is unbalanced if |Niin | = |Niout |, where | · | is called as the cardinality of a set. In this chapter, we consider a group of n generators communicating over a directed network G = {V, E} involving the vertices (generators) set V = {1, . . . , n} and the edges set E ⊆ V × V, which consists of ordered pairs of vertices. If (i, j ) ∈ E, it indicates that generator i can directly transmit data to generator j , where i is viewed as an in-neighbor of j and contrarily j is regarded as an outneighbor of i. Let Niin = {j ∈ V|(j, i) ∈ E} and Niout = {j ∈ V|(i, j ) ∈ E} be the in-neighbor and out-neighbor sets of i, respectively. If |Niin | = |Niout |, the network is said to be unbalanced, where |·| is called as the cardinality of a set. For the directed network G, a path of length b from generator i1 to generator ib+1 is a sequence of b + 1 distinct generators i1 , . . . , ib+1 such that (ik , ik+1 ) ∈ E for k = 1, . . . , b. If there exists a path between any two generators, G is strongly connected. Then, we consider the following assumption. Assumption 6.3 ([32]) The network G corresponding to the set of generators is directed and strongly connected. Assumption 6.3 forces each generator in the network to directly or indirectly have an impact on the others. In comparison with the uniformly and strongly connected assumption [27, 28, 39], Assumption 6.3, although more restrictive, is still relatively common [30–35].
6.2.4 Centralized Lagrangian Method First, we construct the dual problem of (6.1). To this end, we introduce the Lagrangian function L : X × R → R defined as L(x, y) =
n
Ci (xi ) − y
i=1
n
(xi − di ),
(6.2)
i=1
where y is the Lagrangian multiplier associated with equality constraint in (6.1). Consider a convex function Ci , i ∈ V, and the conjugate function Ci⊥ of Ci is given by Ci⊥ (y) = sup yxi − Ci (xi ), xi ∈Xi
(6.3)
6.2 Preliminaries
157
for all y ∈ R. With the above, the dual problem of (6.1) is described as max Φ(y) = y∈R
n
Φi (y),
(6.4)
i=1
where the dual function Φi (y) for each i ∈ V is given by Φi (y) = ψi (y) + ydi ,
(6.5)
ψi (y) = −Ci⊥ (y) = min (Ci (xi ) − yxi ).
(6.6)
where xi ∈Xi
By Assumption 6.1, it can be demonstrated that the strong duality between (6.1) and (6.4) holds [40], i.e., there exists at least a dual optimal solution y ∗ to (6.4) such that C(x ∗ ) = Φ(y ∗ ), where x ∗ is the primal optimal solution to (6.1), and that a set of the dual optimal solution to (6.4) is nonempty [40]. Note from Assumption 6.1 that for each i ∈ V and any given y ∈ R, (6.6) uniquely exists a minimizer as follows: ⎧ max if ∇Ci−1 (y) ≥ ximax ⎨ xi x˜i (y) = ∇Ci−1 (y) if ximin < ∇Ci−1 (y) < ximax ⎩ min xi if ∇Ci−1 (y) ≤ ximin,
(6.7)
where ∇Ci−1 (y) is the inverse function of Ci , and according to Assumption 6.2, ∇Ci−1 (y) exists in interval [ximin, ximax ]. Since a set of the dual optimal solution to (6.4) is nonempty, the primal (unique) optimal solution to (6.1) for each i ∈ V becomes xi∗ = x˜i (y ∗ ), where y ∗ is any dual optimal solution. Then, for any given y ∈ R, the dual function Φ(y) is differentiable [26] (because of the uniqueness of x˜i (y)) at y and its gradient is ∇Φ(y) = −
n
(x˜i (y) − di ),
(6.8)
i=1
and further ∇Φi (y) = −(x˜i (y) − di ), ∀i ∈ V. Thus, the dual problem (6.4) can be solved by utilizing the standard gradient ascent method as follows: y
t +1
= y − α˜ t
n
(x˜i (y t ) − di ),
(6.9)
i=1
where α˜ > 0 is an appropriately selected step-size. It has been proven that method (6.9) converges to the dual optimal solution, i.e., y t converges to y ∗ , under some
158
6 Accelerated Algorithms for Distributed Economic Dispatch
minor assumptions on cost functions and an arbitrary y 0 ∈ R, and then x t converges to x ∗ [40]. However, it is worth noting that the update of y in (6.9) requires the knowledge of each generator, thus curbing the distributed implementation of the method. To eliminate this deficiency, each generator must update y in some way via interacting only with its in-neighbors. Hence, we require an approach not only to, in a distributed manner, calculate or approximate ni=1 (x˜i (y t ) − di ) in (6.9), but also to be suitable for the directed networks considered in this chapter. It is basically the main motivation of D-DLM presented in the next section.
6.3 Algorithm Development In this section, D-DLM that is capable of solving EDP in a distributed manner is presented over a directed network. First, D-DLM is constructed for the EDP via integrating the distributed gradient tracking method with two types of momentum terms (for the Lagrangian multiplier). Then, the two distributed optimization methods [28, 30] that are not only suitable for directed networks but also related to D-DLM are discussed with an intuitive explanation.
6.3.1 Directed Distributed Lagrangian Momentum Algorithm As the previous section is concerned, the centralized Lagrangian method requires global information from generators to generate the Lagrangian multiplier. To overcome this deficiency, a distributed algorithm is needed for each generator to estimate this quantity. Specifically, we devote ourselves to the study of a novel directed distributed Lagrangian momentum algorithm (D-DLM), which is not only suitable for a directed network but also converges linearly and accurately to the optimal solution to (6.1). To the best of our knowledge, this work has not yet been involved and is worthwhile to study. First, the dual problem (6.4) can be converted to the following minimization form: min q(y) = y∈R
n
qi (y),
(6.10)
i=1
where qi (y) = −Φi (y) = Ci⊥ (y) − ydi . It is worth highlighting that problem (6.10) shares the same set of dual optimal solution to problem (6.4), and it also has the similar formulation with the distributed convex problems in [31, 32, 37]. According to (6.10), we now describe D-DLM to distributedly deal with problem (6.1). In DDLM, each generator i ∈ V at time t ≥ 0 stores five variables: xit ∈ R, yit ∈ R, hti ∈ R, sti ∈ Rn , and zit ∈ R. For t ≥ 0, generator i ∈ V updates its variables as
6.3 Algorithm Development
159
follows: ⎧ ⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax } ⎪ ⎪ ⎪ n ⎪ t +1 t −1 t t t t ⎪ ⎪ ⎪hi = j =1 rij yj + βi (hi − hi ) − αi zi ⎨ yit +1 = hti +1 + βit +1 (hti +1 − hti ) ⎪ ⎪ ⎪ sti +1 = nj=1 rij stj ⎪ ⎪ ⎪ t+1 t ⎪ i ⎪ ⎩zit +1 = nj=1 rij zjt + xi t+1−di − xi −d , [st ] [si
]i
(6.11)
i i
where αi > 0 refers to the constant step-size (non-uniform) locally chosen at each generator i, and βit is the momentum (heavy-ball momentum and Nesterov momentum) coefficient (non-uniform) at time t ≥ 0. The symbol [sti ]i denotes the i-th entry of vector sti . The weights, rij , i, j ∈ V, associated with the network G obey the following conditions:
rij = and rii = 1 −
⎧ n ⎨ > , j ∈ Niin , ⎩
j ∈Niin rij
0, otherwise,
rij = 1, ∀i ∈ V,
(6.12)
j =1
> , ∀i ∈ V, where 0 < < 1. Each generator i ∈ V
starts with many initial states xi0 ∈ Xi , yi0 = h0i ∈ R, s0i = ei , and zi0 = xi0 − di .1 Denote by R = [rij ] ∈ Rn×n the collection of weights rij , i, j ∈ V, in (6.11), which is obviously row-stochastic. In essence, (xit − di ) in the update of zit in (6.11) is the gradient of the function qi (y) at y = yit , i.e., ∇qi (yit ) = xit − di , i ∈ V. Furthermore, the update of zit in (6.11) is a distributedly inexact gradient tracking step, where each function’s gradient is scaled by [sti ]i , which is generated by the update of sit in (6.11). Actually, the update of zit in (6.11) a consensus iteration step aiming to overcome the unbalancedness brought about the left normalized Perron eigenvector w = [w1 , . . . , wn ]T , corresponding to the eigenvalue 1, of the weight matrix R, i.e., the left eigenvector w satisfying 1Tn w = 1. This iteration resembles those employed in [30, 41] and [42]. To sum up, D-DLM (6.11) transforms the centralized method (6.9) into the distributed ones via gradient tracking method and can be applied to a directed network. Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , ht = [ht1 , . . . , htn ]T ∈ Rn , zt = [z1t , . . . , znt ]T ∈ Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , S˜ t = diag{S t }, and d = [d1, . . . , dn ]T ∈ Rn . Therefore, D-DLM (6.11) can be rewritten in the following aggregated form:
1
Suppose that each generator possesses and achieves its unique identifier in the network, e.g., 1, . . . , n [27–35].
160
6 Accelerated Algorithms for Distributed Economic Dispatch
⎧ ⎪ xit +1 = min{max{∇Ci−1 (yit ), ximin }, ximax } ⎪ ⎪ ⎪ ⎪ t +1 = Ry t + D t (ht − ht −1 ) − D zt ⎪ ⎪ α β ⎨h
y t +1 = ht +1 + Dβt +1 (ht +1 − ht ) ⎪ ⎪ ⎪ ⎪ S t +1 = RS t ⎪ ⎪ ⎪ ⎩ t +1 z = Rzt + [S˜ t +1 ]−1 (x t +1 − d) − [S˜ t ]−1 (x t − d),
(6.13)
where Dα = diag{α} ∈ Rn×n and Dβt = diag{β t } ∈ Rn×n , t ≥ 0, where α = [α1 , . . . , αn ]T and β t = [β1t , . . . , βnt ]T . The initial states are x 0 ∈ X, y 0 = h0 ∈ Rn , S 0 = In , and z0 = (x 0 − d) ∈ Rn . It is worth emphasizing that D-DLM (6.13) does not need the out-degree information of generators (only row-stochastic matrix is adopted in D-DLM), which is more likely to be implemented in a distributed manner.
6.3.2 Related Methods Relation to Method with Column-Stochastic Weights [28] Under Assumptions 6.1–6.3, DPD-PD proposed in [28] converged linearly to the optimal solution over a directed network using non-uniform step-sizes. Besides, DPD-PD applied push-sum strategy (column-stochastic weights) to overcome the unbalancedness induced by the directed networks, which may be infeasible in the distributed implementation because it required each generator to possess (at least) its outdegree information. We emphasize here that row-stochastic weights are relatively easier to achieve in a distributed setting and the implementation is straightforward if each generator can privately regulate the weights on information acquired from in-neighbors. Relation to Method with Row-Stochastic Weights [30] The distributed primal– dual method proposed in [30] served as a basis for the development of D-DLM (6.11). The method [30] utilized row-stochastic weights with non-uniform step-sizes among the generators and converged at a linear rate to the optimal solution over a directed network under Assumptions 6.1–6.3. Notice that D-DLM (6.11) combines the method [30] with two kinds of momentum terms (heavy-ball momentum and Nesterov momentum), which ensures that generators acquire more information from in-neighbors in the network than the method proposed in [30] for faster convergence.
6.4 Convergence Analysis
161
6.4 Convergence Analysis In this section, the convergence properties of D-DLM (6.11) are rigorously analyzed. Before showing the main results, some auxiliary results (borrowed from the literature) are introduced for completeness.
6.4.1 Auxiliary Results First, the following lemma shows that Ci⊥ , i ∈ V, is strongly convex and smooth [22, 27, 28, 30]. Lemma 6.1 Suppose that Assumptions 6.1 and 6.2 hold. Then, for each i ∈ V, Ci⊥ is strongly convex with constant ϑi and Lipschitz differentiable with constant i , respectively, where ϑi = 1/li and i = 1/μi . If Lemma 6.1 holds, it suffices that the global function (−Φ) is strongly convex with parameter ϑ = ni=1 ϑi and has Lipschitz continuous gradient with parameter = ni=1 i , respectively. In addition, we define ˆ = maxi∈V { i }. Considering the sequence {v˜ t }∞ t =0 and γ ∈ (0, 1), for any positive integer T > 0 and norm || · ||c (in this chapter, norm || · ||c may be a 2-norm or a particular norm), let us further define γ ,T
||v|| ˜ c
=
sup
t =0,1,...,T
γ
{||v˜ t ||c /γ t } and ||v|| ˜ c = sup{||v˜ t ||c /γ t }. t ≥0
Then, the following additional lemma from the generalized small gain theorem [43] is presented. Lemma 6.2 (Generalized Small Gain Theorem) Suppose that non-negative vecm×m , u ˜ tor sequences {v˜it }∞ ˜ ∈ Rm t =0 , i = 1, . . . , m, a non-negative matrix Γ ∈ R and γ ∈ (0, 1) such that for all T > 0, v˜ γ ,T Γ˜ v˜ γ ,T + u, ˜
(6.14)
γ ,T γ ,T γ where v˜ γ ,T = [||v˜1 ||c , . . . , ||v˜m ||c ]T . If ρ(Γ˜ ) < 1, then ||v˜i ||c < B, where B < +∞ and ρ(Γ˜ ) is the spectral radius of matrix Γ˜ . Hence, each ||v˜i ||c , i ∈ {1, . . . , m}, linearly converges to zero at the linear rate of O((γ )t ).
Recall that R is irreducible and row-stochastic with positive diagonal entries. Under Assumption 6.3, there exists a normalized left Perron eigenvector w = [w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that lim (R)t = (R)∞ = 1n wT , wT R = wT and wT 1n = 1.
t →∞
162
6 Accelerated Algorithms for Distributed Economic Dispatch
Also, we define S ∞ = limt →∞ S t (we have S ∞ = (R)∞ due to S 0 = In ), S˜ ∞ = diag{S ∞ }, sˆ = supt ≥0||S t ||2 , s˜ = supt ≥0 ||[S˜ t ]−1 ||2 ,2 y¯ t = wT y t , ∇Q(1n y¯ t ) = [∇q1 (y¯ t ), . . . , ∇qn (y¯ t )]T , ∇Q(y t ) = [∇q1 (y1t ), . . . , ∇qn (ynt )]T , αˆ = maxi∈V {αi }, and βˆ = supt ≥0maxi∈V {βit }. Since R is primitive and S 0 = In , it yields that {S t } is convergent [30, 41, 42], and therefore the diagonal elements of S t are positive and bounded, for all t ≥ 0. Thus, sˆ and s˜ are two finite constants. In addition, we employ || · || to indicate either a particular matrix norm or a particular vector norm such that ||Ry|| ≤ ||R||||y|| for all matrices R and vectors y. Since all vector norms on finite-dimensional vector space are equivalent, we have the following conclusions: || · ||2 ≤ p1 || · || and || · || ≤ p2 || · ||2 , where p1 and p2 are some positive constants [45].
6.4.2 Main Results In this subsection, the linear convergence rate of D-DLM is found via employing Lemma 6.2. First, we cast four inequalities into a linear system as the form of (6.14) and then investigate the spectral properties of the achieved coefficient matrix. To this aim, some essential notations are introduced to simplify the main results. Denote v1t = y t − (R)∞ y t , ∀t ≥ 0, v2t = (R)∞ y t − 1n y ∗ , ∀t ≥ 0, and v3t = ht − ht −1 , ∀t > 0, with the convention that v30 = 0n , and v4t = zt − (R)∞ zt , ∀t ≥ 0. The first lemma constitutes an inevitable bound on the estimate ||zt ||2 for deriving the aforementioned linear system. Lemma 6.3 Under Assumptions 6.1 and 6.2, the following inequality holds for all t ≥ 0: ˆ 1 ||v t || + n ||v ˆ t ||2 + p1 ||v t || + sˆ(˜s )2 θ (λ)t ||∇Q(y t )||2 , ||zt ||2 ≤ n p 1 2 4
(6.15)
where 0 < θ < ∞ and 0 < λ < 1 are constants. Proof Note that ||zt ||2 ≤ ||zt − (R)∞ zt ||2 + ||(R)∞ zt ||2 .
(6.16)
Recalling that z0 = x 0 − d and ∇qi (yit ) = xit − di , i ∈ V, then, for all t ≥ 0, we have (R)∞ zt = (R)∞ [S˜ t ]−1 ∇Q(y t ) (see reference [30] for a proof). Thus, using
2
Throughout the chapter, for any arbitrary matrix (scalar or variable) Z, we utilize the symbol (Z)t to represent the t-th power of Z to distinguish the iteration of variables.
6.4 Convergence Analysis
163
S ∞ [S˜ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that ||(R)∞ zt ||2 ≤||S ∞ [S˜ t ]−1 ∇Q(y t ) − S ∞ [S˜ ∞ ]−1 ∇Q(y t )||2 + ||S ∞ [S˜ ∞ ]−1 ∇Q(y t ) − S ∞ [S˜ ∞ ]−1 ∇Q(1n y ∗ )||2 ˆ t − 1n y ∗ ||2 , ≤ˆs (˜s )2 θ (λ)t ||∇Q(y t )||2 + n ||y
(6.17)
where sˆ = supt ≥0||S t ||2 , s˜ = supt ≥0 ||[S˜ t ]−1 ||2 , and the last inequality follows from the fact that ||[S˜ t ]−1 − [S˜ ∞ ]−1 ||2 ≤ θ (˜s )2 (λ)t , where 0 < θ < ∞ and 0 < λ < 1 are constants (see [41, 42] for more details). Then, one gets ||y t − 1n y ∗ ||2 ≤ ||y t − (R)∞ y t ||2 + ||(R)∞ y t − 1n y ∗ ||2 .
(6.18)
Substituting (6.17) and (6.18) into (6.16) yields the desired result in Lemma 6.3. In what follows, the bound of the consensus violation ||v1 ||γ ,T of the Lagrangian multiplier is provided. Lemma 6.4 Suppose that Assumptions 6.1 and 6.2 hold. Then, for all T > 0, we have the following inequality: ||v1 ||γ ,T ≤
ακ ˆ 1 n ˆ ˆ 1 γ − ρ − ακ ˆ 1 n p +
γ ,T
||v2 ||2
ακ ˆ 1 p1 ˆ 1 γ − ρ − ακ ˆ 1 n p
+
ˆ 1 2βκ ˆ 1 γ − ρ − ακ ˆ 1 n p
||v3 ||γ ,T
||v4 ||γ ,T + u1 ,
(6.19)
ˆ 1 } < γ < 1, where 0 < ρ < 1 is a constant, κ1 = for all max{λ, ρ + ακ ˆ 1 n p ∞ p2 ||In − (R) || and u1 = (||v10 || + ακ ˆ 1 sˆ(˜s )2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − ρ − ˆ 1 ). ακ ˆ 1 n p Proof According to the updates of ht and y t of D-DLM (6.13), it holds that ||v1t +1 || ≤ ||(R − (R)∞ )v1t || + ||(In − (R)∞ )Dβt +1 v3t +1 || + ||(In − (R)∞ )Dβt v3t || + ||(In − (R)∞ )Dα zt ||,
(6.20)
where the inequality in (6.20) is obtained from the facts that (R)∞ R = (R)∞ and (R − (R)∞ )(In − (R)∞ ) = R − (R)∞ . Considering the weight matrix R = [rij ] ∈ Rn×n (6.12), then there are a norm || · || and a constant 0 < ρ < 1 such that ||Ry − (R)∞ y|| ≤ ρ||y − (R)∞ y|| for all y ∈ Rn (see [46] Lemma 5.3). Thus, (6.20) further implies that ˆ 1 ||v t +1 || + βκ ˆ 1 ||v3t ||, ||v1t +1 || ≤ρ||v1t || + ακ ˆ 1 ||zt ||2 + βκ 3
(6.21)
164
6 Accelerated Algorithms for Distributed Economic Dispatch
ˆ By Lemma 6.3, we have that where αˆ = ||Dα ||2 and ||Dβt ||2 ≤ β. ˆ 1 )||v t || + ακ ˆ t ||2 + βκ ˆ 1 ||v3t || ||v1t +1 || ≤ (ρ + ακ ˆ 1 n p ˆ 1 n ||v 1 2 ˆ 1 ||v t +1 || + ακ + βκ ˆ 1 p1 ||v4t || + ακ ˆ 1 sˆ (˜s )2 θ (λ)t ||∇Q(y t )||2 . 3
(6.22)
From here, the procedure is similar to that in the proof of Lemma 8 in [30]. We include the proof for completeness. By multiplying (γ )−(t +1) on both sides of (6.22) and then taking the supremum for t = 0, . . . , T − 1, one has sup
t =0,...,T −1
||v1t +1 || (γ )t +1
≤
ˆ 1 ρ + ακ ˆ 1 n p γ ˆ 1 + βκ +
||v1t || ακ ||v2t ||2 ˆ 1 n ˆ sup + t γ t =0,...,T −1 (γ )t t =0,...,T −1 (γ )
sup
t =0,...,T −1
sup
sup
||v3t || + ||v3t +1 || (γ )t +1
+
ακ ˆ 1 p1 γ
ακ ˆ 1 sˆ (˜s )2 θ (λ)t ||∇Q(y t )||2
t =0,...,T −1
(γ )t +1
||v4t || t t =0,...,T −1 (γ ) sup
.
(6.23)
Also, assuming that (γ )0 ||v10 || ≤ (γ )0 ||v10 || and λ < γ < 1, it follows that ||v1 ||γ ,T ≤
ˆ 1 ρ + ακ ˆ 1 n p ακ ˆ 1 n ˆ γ ,T ||v1 ||γ ,T + ||v2 ||2 γ γ +
ˆ 1 2βκ ακ ˆ 1 p1 ||v3 ||γ ,T + ||v4 ||γ ,T + ||v10 || γ γ
+
ακ ˆ 1 sˆ(˜s )2 θ γ
sup
t =0,...,T −1
||∇Q(y t )||2 ,
which after some algebraic manipulations yields the desired result. The next lemma presents the bound of the dual optimality residual associated with the weight average (notice that (R)∞ y t = 1n y¯ t ).
(6.24) γ ,T ||v2 ||2
Lemma 6.5 Suppose that Assumptions 6.1 and 6.2 hold. If max{λ, l1 } < γ < 1 and 0 < wT α < 2/ , then the following inequality holds for all T > 0: γ ,T
||v2 ||2
≤
ˆ 1 ˆ 2 αn ˆ p 2βκ ακ ˆ 2 ||v1 ||γ ,T + ||v3 ||γ ,T + ||v4 ||γ ,T + u2 , γ − l1 γ − l1 γ − l1
(6.25)
where κ2 = p1 ||(R)∞ ||2 , l1 = max{|1 − wT α|, |1 − ϑwT α|} and u2 = (||v20 ||2 + αˆ sˆ (˜s )2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − l1 ).
6.4 Convergence Analysis
165
Proof Notice that (R)∞ R = (R)∞ . Following from (6.13), we have ||(R)∞ y t +1 − 1n y ∗ ||2 = ||(R)∞ (y t + Dβt (ht − ht −1 ) + Dβt +1 (ht +1 − ht ) − Dα zt + (Dα − Dα )(R)∞ zt ) − 1n y ∗ ||2 ≤ ||(R)∞ y t − (R)∞ Dα (R)∞ [S˜ t ]−1 ∇Q(y t ) − 1n y ∗ ||2 ˆ 2 [||ht − ht −1 || + ||ht +1 − ht ||] + βκ + ακ ˆ 2 ||zt − (R)∞ zt ||.
(6.26)
We now discuss the first term in the inequality of (6.26). Note that (R)∞ = 1n wT . Through utilizing 1n wT Dα 1n wT = (wT α)1n wT , one obtains ||(R)∞ y t − (R)∞ Dα (R)∞ [S˜ t ]−1 ∇Q(y t ) − 1n y ∗ ||2 ≤ ||1n (wT y t − x ∗ − (wT α)∇q(y¯ t ))||2 + (wT α)||1n ∇q(y¯ t ) − 1n wT [S˜ t ]−1 ∇Q(y t )||2 = Λ1 + Λ2 ,
(6.27)
where ∇q(y¯ t ) = 1Tn ∇Q(1n y¯ t ) and ∇Q(1n y¯ t ) = [∇q1(y¯ t ), . . . , ∇qn (y¯ t )]T . Since the global function q is strongly convex and smooth (see Lemma 6.1), if 0 < wT α < 2/ , Λ1 is bounded by √ Λ1 ≤ l1 n||wT y t − y ∗ ||2 = l1 ||(R)∞ y t − 1n y ∗ ||2 , (6.28) where l1 = max{|1 − wT α|, |1 − ϑwT α|} (see reference [42] for a proof). Then, Λ2 can be bounded in the following way: Λ2 ≤ (wT α)||1n ∇q(y¯ t ) − 1n 1Tn ∇Q(y t )||2 + (wT α)||1n 1Tn ∇Q(y t ) − 1n wT [S˜ t ]−1 ∇Q(y t )||2 =Λ3 + Λ4 .
(6.29)
Since ∇q(y¯ t ) = 1Tn ∇Q(1n y¯ t ), it holds from Lemma 6.1 that ˆ 1 ||(R)∞ y t − y t ||. Λ3 ≤ αn ˆ p
(6.30)
Next, by employing the fact ||[S˜ t ]−1 − [S˜ ∞ ]−1 ||2 ≤ θ (˜s )2 (λ)t and the relationship S ∞ [S˜ ∞ ]−1 = 1n 1Tn , we get Λ4 = (wT α)||S ∞ [S˜ ∞ ]−1 ∇Q(y t ) − S ∞ [S˜ t ]−1 ∇Q(y t )||2 ≤ αˆ sˆ(˜s )2 θ (λ)t ||∇Q(y t )||2 ,
(6.31)
166
6 Accelerated Algorithms for Distributed Economic Dispatch
where sˆ = supt ≥0 ||S t ||2 and s˜ = supt ≥0||[S˜ t ]−1 ||2 . Plugging (6.27)–(6.31) into (6.26) yields ˆ 1 ||v t || + βκ ˆ 2 [||v t +1 || + ||v3t ||] ||v2t +1 ||2 ≤ l1 ||v2t ||2 + αn ˆ p 1 3 + ακ ˆ 2 ||v4t || + αˆ sˆ(˜s )2 θ (λ)t ||∇Q(y t )||2 .
(6.32)
Here, we can identify the terms in (6.32) with the terms in (6.22). Hence, in order to establish this lemma, we proceed as in the proof of Lemma 6.4 from (6.22). For the bound of the estimate difference ||v3 ||γ ,T , the following lemma is shown. ˆ < γ < 1, Lemma 6.6 Suppose that Assumptions 6.1 and 6.2 hold. If max{λ, 2p2 β} it holds that for all T > 0, ||v3 ||γ ,T ≤
ˆ 1 ˆ 2 n p κ3 + αp αp ˆ 2 n ˆ γ ,T ||v1 ||γ ,T + ||v2 ||2 γ − 2p2 βˆ γ − 2p2 βˆ +
αp ˆ 2 p1 γ − 2p2 βˆ
||v4 ||γ ,T + u3 ,
(6.33)
ˆ and κ3 = ||R − In ||. where u3 = (αp ˆ 2 sˆ (˜s )2 θ supt =0,...,T ||∇Q(y t )||2 )/(γ − 2p2 β) Proof Recalling that (R)∞ R = (R)∞ , we obtain from (6.13) that ||ht +1 − ht || = ||(R − In )y t + 2Dβt (ht − ht −1 ) − Dα zt || ≤ ||(R − In )(y t − (R)∞ y t )|| ˆ t − ht −1 || + p2 α||z + 2p2 β||h ˆ t ||2 ,
(6.34)
where the inequality in (6.34) is obtained from the fact (R − In )(In − (R)∞ ) = R − In . Now, apply Lemma 6.3 to deduce that ˆ 1 )||v t || + αp ˆ t ||2 ˆ 3t || + (κ3 + αp ||v3t +1 || ≤ 2p2 β||v ˆ 2 n p ˆ 2 n ||v 1 2 + αp ˆ 2 p1 ||v4t || + αp ˆ 2 sˆ (˜s )2 θ (λ)t ||∇Q(y t )||2 .
(6.35)
Similar to the procedure following (6.22), it is suffice to derive the desired result. The next lemma establishes the inequality which bounds the error term ||v4 ||γ ,T corresponding to gradient estimation. Lemma 6.7 Suppose that Assumptions 6.1–6.3 hold. If max{λ, ρ} < γ < 1, for all T > 0, we have the following estimate: ||v4 ||γ ,T ≤
ˆ 4 κ4 + 2βκ ||v3 ||γ ,T + u4 , γ −ρ
(6.36)
6.4 Convergence Analysis
167
where u4 = 2||In − (R)∞ ||p2 (˜s )2 θ supt =0,...,T ||∇Q(y t )||2 /(γ − ρ) + ||v40 ||/(γ − ρ) ˆs . and κ4 = ||In − (R)∞ ||p1 p2 ˜ Proof It is immediately obtained from (6.13) that ||zt +1 − (R)∞ zt +1 || = ||(R)∞ (Rzt + [S˜ t +1 ]−1 ∇Q(y t +1 ) − [S˜ t ]−1 ∇Q(y t )) − (Rzt + [S˜ t +1 ]−1 ∇Q(y t +1 ) − [S˜ t ]−1 ∇Q(y t ))|| ≤ ||In − (R)∞ ||||[S˜ t +1]−1 ∇Q(y t +1 ) − [S˜ t ]−1 ∇Q(y t )|| + ρ||zt − (R)∞ zt ||,
(6.37)
where we employ the triangle inequality and the fact ||Ry − (R)∞ y|| ≤ ρ||y − (R)∞ y|| to deduce the inequality. As for the first term of the inequality in (6.37), we apply (6.13) to obtain ||[S˜ t +1 ]−1 ∇Q(y t +1 ) − [S˜ t ]−1 ∇Q(y t )|| ≤ ||[S˜ t +1 ]−1 ∇Q(y t +1 ) − [S˜ t +1 ]−1 ∇Q(y t )|| + ||[S˜ t +1 ]−1 ∇Q(y t ) − [S˜ t ]−1 ∇Q(y t )|| ˆs ||y t +1 − y t ||2 + 2p2 (˜s )2 θ (λ)t ||∇Q(y t )||2 ≤ p2 ˜ t +1 ˆs (1 + β)||h ˆs β||h ˆ ˆ t − ht −1 || ≤ p1 p2 ˜ − ht || + p1 p2 ˜
+ 2p2 (˜s )2 θ (λ)t ||∇Q(y t )||2 .
(6.38)
Combining (6.37) and (6.38) leads to t +1 ˆ t ˆ ||v4t +1 || ≤ ρ||v4t || + κ4 (1 + β)||v 3 || + κ4 β||v3 ||
+ 2||In − (R)∞ ||p2 (˜s )2 θ (λ)t ||∇Q(y t )||2 ,
(6.39)
which further derives ||v4t +1 || (γ )t +1
≤
||v3t +1 || κ4 βˆ ||v3t || ρ ||v4t || ˆ + κ (1 + β) + 4 γ (γ )t γ (γ )t (γ )t +
2||In − (R)∞ ||p2 (˜s )2 θ ||∇Q(y t )||2 λ t ( ). γ γ
(6.40)
If max{λ, ρ} < γ < 1, we can conclude the desired result by taking supt =0,...,T on both sides of (6.40) and rearranging the acquired expressions. With the aid of the auxiliary relationships, i.e., the above Lemmas 6.4–6.7, the main convergence results of D-DLM are now established. For the sake of
168
6 Accelerated Algorithms for Distributed Economic Dispatch
ˆ 1 . Then, the first result, convenience, we define wmin = mini∈V {wi } and κ5 = κ1 n p i.e., Theorem 6.8, is introduced as follows. Theorem 6.8 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM (6.13) updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. Then, if 0 < wT α < 2/ , we obtain the following linear inequality for all T > 0: v γ ,T Γ v γ ,T + u,
(6.41)
γ ,T
where v γ ,T = [||v1 ||γ ,T , ||v2 ||2 , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , u = [u1 , u2 , u3 , u4 ]T , and the elements of matrix Γ = [γij ] ∈ R4×4 are given by ⎡
0
⎢ αn ˆ 1 ⎢ ˆ p ⎢ γ −l1 Γ = ⎢ κ +αp ˆ 1 ⎢ 3 ˆ 2 n p ⎣ γ −2p2 βˆ 0
ˆ 1 2βκ ακ ˆ 1 p1 ακ ˆ 1 n ˆ γ −ρ−ακ ˆ 5 γ −ρ−ακ ˆ 5 γ −ρ−ακ ˆ 5 ˆ 2 2βκ ακ ˆ 2 0 γ −l1 γ −l1 αp ˆ 2 n ˆ αp ˆ 2 p1 0 γ −2p2 βˆ γ −2p2 βˆ ˆ 4 κ4 +2βκ 0 0 γ −ρ
⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎦
Assuming in addition that the largest step-size satisfies η1 (1 − ρ) 1 η3 − κ3 η1 , 0 < αˆ < min , , ˆ 2 + κ1 p1 η4 p2 n p ˆ 1 η1 + p2 n η ˆ 2 + p2 p1 η4 κ5 η1 + κ1 n η (6.42) and the maximum momentum coefficient satisfies ˆ 2 + κ1 p1 η4 )αˆ η4 − η4 ρ − κ4 η3 η1 (1 − ρ) − (κ5 η1 + κ1 n η ˆ 0 ≤ β < min , , 2κ1 η3 2κ4 η3 ˆ 1 η1 − κ2 η4 )αˆ η3 − κ3 η1 − (p2 n p ˆ 1 η1 + p2 n η ˆ 2 + p2 p1 η4 )αˆ (ϑwmin η2 − n p , 2κ2 η3 2p2 η3
.
(6.43) Then the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ), where 0 < γ < 1 is a constant such that γ = max λ, ρ +
ˆ 2 + κ1 p1 η4 )αˆ ˆ 1 η3 + (κ5 η1 + κ1 n η 2βκ , η1
ˆ 1 η1 + κ2 η4 )αˆ ˆ 2 η3 + (n p ˆ 4 η3 + κ4 η3 2βκ 2βκ ,ρ + , η2 η4 ˆ 1 η1 + p2 n η ˆ 2 + p2 p1 η4 )αˆ κ η + (p n p 3 1 2 , 2p2 βˆ + η3 l1 +
(6.44)
6.4 Convergence Analysis
169
where η1 , η2 , η3 , and η4 are positive constants such that η1 > 0, η2 >
ˆ 1 η1 + κ2 η4 n p κ4 η3 . , η3 > κ3 η1 , η4 > ϑwmin 1−ρ
(6.45)
Proof First, summarizing the results of Lemmas 6.4–6.7, we can conclude (6.41) immediately. Next, we provide some sufficient conditions to make the spectral radius of Γ , defined as ρ(Γ ), be strictly less than 1, i.e., ρ(Γ ) < 1. According to Theorem 8.1.29 in [45], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 , if Γ η < η, then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality Γ η < η is equivalent to ⎧ ˆ 2 + κ1 p1 η4 )αˆ ˆ 1 η3 < η1 γ − η1 ρ − (κ5 η1 + κ1 n η ⎪ 2βκ ⎪ ⎪ ⎪ ⎨2βκ ˆ 1 η1 + κ2 η4 )αˆ ˆ 2 η3 < η2 γ − η2 l1 − (n p ⎪ ˆ 1 η1 + p2 n η ˆ 2 + p2 p1 η4 )αˆ ˆ 3 < η3 γ − κ3 η1 − (p2 n p ⎪2p2 βη ⎪ ⎪ ⎩ ˆ 2βκ4 η3 < η4 γ − η4 ρ − κ4 η3 ,
(6.46)
which further implies that ⎧ ˆ 2 + κ1 p1 η4 )αˆ ˆ 1 η3 + ρη1 + (κ5 η1 + κ1 n η ⎪ 2βκ ⎪ ⎪ γ > ⎪ ⎪ η1 ⎪ ⎪ ⎪ ˆ 1 η1 + κ2 η4 )αˆ ⎪ ˆ 2 η3 + η2 l1 + (n p 2 βκ ⎪ ⎪ ⎨γ > η2 ˆ 1 η1 + p2 n η ˆ 2 + p2 p1 η4 )αˆ ˆ ⎪ βη + κ η + (p2 n p 2p 2 3 3 1 ⎪ ⎪γ > ⎪ ⎪ η3 ⎪ ⎪ ⎪ ⎪ ˆ 4 η3 + ρη4 + κ4 η3 2 βκ ⎪ ⎪γ > ⎩ . η4
(6.47)
Recalling l1 in Lemma 6.5, if αˆ < 1/ , it yields that l1 = 1 − ϑwT α ≤ 1 − ϑwmin α. ˆ To ensure the positivity of βˆ (the right hand sides of (6.46) are always positive), if 0 < γ < 1, (6.46) further gives ⎧ η1 (1 − ρ) ⎪ αˆ < ⎪ ⎪ ⎪ ˆ 2 + κ1 p1 η4 κ η + κ1 n η ⎪ 5 1 ⎪ ⎪ ⎪ ˆ 1 η1 + κ2 η4 n p ⎪ ⎨η2 > ϑwmin η3 − κ3 η1 ⎪ ⎪ ⎪ α ˆ < ; η3 > κ3 η1 ⎪ ⎪ ˆ ˆ 2 + p2 p1 η4 ⎪ p2 n p1 η1 + p2 n η ⎪ ⎪ κ η ⎪ ⎩η4 > 4 3 . 1−ρ
(6.48)
Now, we are in the position of selecting vector η = [η1 , . . . , η4 ]T to ensure the solvability of α. ˆ Since ρ < 1, we first pick an arbitrary positive constant η1 , then
170
6 Accelerated Algorithms for Distributed Economic Dispatch
choose η3 and η4 in accordance with the third and the fourth conditions in (6.48), respectively, and finally select η2 satisfying the second condition in (6.48). Hence, following from (6.48), it yields the upper bounds on the largest step-size αˆ in (6.42) considering the requirement that αˆ < 1/ . If 0 < γ < 1, and then we achieve the upper bounds on the maximum momentum coefficient βˆ according to (6.46) and the upper bounds α. ˆ Recalling that ∇Q(y t ) = [∇q1 (y1t ), . . . , ∇qn (ynt )]T and ∇qi (yit ) = xit − di , i ∈ V, it derives that ||∇Q(y t )||2 ≤ for a positive constant > 0. And then all the elements (u1 , u2 , u3 and u4 ) in vector u are uniformly bounded. Therefore, all the conditions of Lemma 6.2 are completely satisfied. By Lemma 6.2, we can deduce that the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ), where γ satisfies (6.44). This finishes the proof. Remark 6.9 It is worth emphasizing that η1 , η2 , η3 , and η4 in Theorem 6.8 are tunable parameters, which only depend on the network topology and the cost functions. Thus, the choices of the largest step-size αˆ and the maximum momentum coefficient βˆ can be calculated without much effort as long as other parameters, such as λ, ρ, etc., are properly selected. Furthermore, in order to design the stepˆ and sizes and the momentum coefficients, some global parameters, such as ϑ, , , wmin , are needed. We note that the amount of preprocessing in calculating the global parameters is substantially negligible compared with the worst-case running time of D-DLM (see [46] for a specific analysis). Based on Theorem 6.8, below we show that the sequence {x t } linearly converges to the optimal solution to (6.1). Similar to many distributed Lagrangian methods [22, 27, 28, 30], we accomplish this by relating the primal variables with Lagrangian multipliers. Theorem 6.10 Suppose that Assumptions 6.1–6.3 hold. Considering D-DLM (6.13) updates the sequences {x t }, {ht }, {y t }, {S t }, and {zt }. If αˆ and βˆ satisfy the conditions of Theorem 6.8, then the sequence {x t } converges to x ∗ at a linear rate of O((γ /2)t ). Proof Here, the approach is similar to that in the proof of Theorem 3 in [30]. We briefly give the proof here for completeness. To demonstrate Theorem 6.10, we ought to show the following relation of the primal variables and Lagrangian multipliers: n μi i=1
≤
2
(xit − xi∗ )
n i=1
2
[∇qi (y ∗ )(yit − y ∗ ) +
i t 2 (y − y ∗ ) + yit (xi∗ − di )]. 2 i
(6.49)
Recall that x ∗ is the primal solution to (6.1) and x ∗ = [x1∗ , x2∗ , . . . , xn∗ ]T . n optimal ∗ Then, we have that i=1 (xi − di )=0. It also derives that each Lagrangian
6.4 Convergence Analysis
171
multiplier yit , i ∈ V, converges to y ∗ (dual optimal solution to (6.10)) if the largest step-size αˆ and the maximum momentum coefficient βˆ satisfy the corresponding upper bounds in Theorem 6.8. Thus, the right hand side of (6.49) goes to zero as t → ∞. This immediately indicates that the sequence {x t } converges to x ∗ at a linear rate of O((γ /2)t ) if the sequence {y t } converges to 1n y ∗ at a linear rate of O((γ )t ) given in Theorem 6.8. From here, the current focus is to display inequality (6.49). We note that various variants of (6.49) are shown in detail in most works [22, 27, 28, 30]. Hence, in order to establish inequality (6.49), we perform the remaining analysis according to the proof given in [22, 27, 28, 30]. This fulfills the proof. Remark 6.11 Theorem 6.10 establishes that D-DLM linearly converges to the global optimal solution provided that the largest step-size αˆ and the maximum ˆ respectively, obey the upper bounds given in Theorem 6.8. momentum coefficient β, Although most existing works (the distributed gradient tracking methods) [37, 47] and our previous works [28, 38] adopted non-uniform step-sizes and converged at a linear rate, compared with [28, 37, 38, 47], this chapter still has three advantages. First, D-DLM extends the distributed gradient tracking method with heavy-ball momentum and Nesterov momentum, which improves information exchange to ensure faster convergence. Second, since the provided bounds on the largest stepsize αˆ in Theorem 6.8 only depend on the network topology and the cost functions, each generator can choose a relatively wider step-size. This is in contrast to the earlier works on non-uniform step-sizes within the framework of gradient tracking [28, 37, 38, 47], which relies not only on the cost functions and the network topology but also on the heterogeneity (||(In − W )α||2 /||W α||2 , W is the weight matrix in [47] and α/ ˆ α, ˜ α˜ = mini∈V {αi } in [28, 37, 38]) of the step-sizes. Besides, the analysis has also showed that the algorithms in [28, 37, 38, 47] could linearly converge to the optimal solution if and only if the heterogeneity and the largest step-size are small. However, the largest step-size follows a bound which is a function of the heterogeneity, and there is a trade-off between the tolerance of heterogeneity and the largest step-size which can be achieved. Finally, the bound of non-uniform step-sizes in this chapter allows the existence (not all) of zero step-sizes among the generators given that the largest step-size αˆ is positive and sufficiently small. Remark 6.12 Recently, there has been a lot of works devoted to the study of the distributed methods with momentum terms, such as method with heavy-ball momentum [32] and method with Nesterov momentum [31, 35]. It should be noted that the analysis method, i.e., the generalized small gain theorem [43] employed in this chapter is significantly different from the method used in [31, 32, 35]. In comparison with [32, 35], this chapter also establishes explicit estimates for the convergence rate of D-DLM. It is further straightforward to conceive a time-varying implementation of D-DLM over broadcast-based mechanism or random networks, e.g., the related work in [34]. Asynchronous schemes may also be derived from the methodologies in [43, 47]. In addition, it is also concluded from [48] that if D-DLM employed for optimizing more complex problems, such as deep neural networks,
172
6 Accelerated Algorithms for Distributed Economic Dispatch
the gradient of dual function is usually calculated with different kinds of stochastic noises, which yields the stochastic version of D-DLM.
6.5 Numerical Examples In this section, a variety of studies on EDP in smart grids are provided to verify the effectiveness of D-DLM and the correctness of the theoretical analysis. Here, all the simulations are carried out in MATLAB on a HP Desktop with 3.20 GHz, 6 Cores, 12 Threads, Intel i7-8700 processors, and 8 GB memory.
6.5.1 Case Study 1: EDP on IEEE 14-Bus Test Systems First, we study the EDP on the IEEE 14-bus test system [22] as described in Fig. 6.1, where {1, 2, 3, 6, 8} are generator buses and {2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14} are load buses. Suppose that each generator i suffers a quadratic cost function, i.e., Ci (xi ) = ai (xi )2 + bi xi (known privately by generator i), where the generator parameters are summarized in Table 6.1 [26]. Note that the power generation is zero if a bus does not possess generators. The total demand is 14 i=1 di = 380 MW, where the local demands on each load bus are d1 = 0 MW, d2 = 9 MW, d3 = 56
Fig. 6.1 The IEEE 14-bus test system
6.5 Numerical Examples Table 6.1 Generator parameters in IEEE 14-bus test system
173 Bus 1 2 3 6 8
ai ($/MW2 ) 0.04 0.03 0.035 0.03 0.04
bi ($/MW) 2.0 3.0 4.0 4.0 2.5
[ximin , ximax ] (MW) [0, 80] [0, 90] [0, 70] [0, 70] [0, 80]
MW, d4 = 55 MW, d5 = 27 MW, d6 = 27 MW, d7 = 0 MW, d8 = 0 MW, d9 = 8 MW, d10 = 24 MW, d11 = 53 MW, d12 = 46 MW, d13 = 16 MW, and d14 = 40 MW. The goal now is to minimize the total power generation cost of the generator while meeting the load demands. Suppose that the communication between generators is represented by a directed and strongly connected circle network [44], and the weighting strategy rij = 1/|Niin |, ∀i, is utilized to construct the row-stochastic weights. The non-uniform step-sizes and the momentum coefficients are, respectively, selected by αi = 0.004θi and βit = 0.3θ˜it , t ≥ 0, where θi and θ˜it are uniformly distributed over the interval (0, 1). The simulation results are shown in Fig. 6.2. It deduces from Fig. 6.2a and b that the optimal incremental cost y ∗ = 8.527 $/MW and the optimal power generation are x1∗ = 80 MW, x2∗ = 90 MW, x3∗ = 64.67 MW, x6∗ = 70 MW, and x8∗ = 75.33 MW. When yik and xik converge to the optimal solutions, the total generation satisfies the total demand, as depicted in Fig. 6.2c.
6.5.2 Case Study 2: EDP on IEEE 118-Bus Test Systems In this case study, EDP on the IEEE 118-bus test system [49] continues to be considered to demonstrate the performance of D-DLM on a large-scale network. The IEEE 118-bus test system, as shown in Fig. 6.3 [49], contains 54 generators connected by bus lines, which is assumed to be directed and strongly connected. Each generator i confronts a quadratic cost function (known privately by generator i), i.e., Ci (xi ) = ai (xi )2 + bi xi + ci , where the coefficients ai ∈ [0.0024, 0.0697], bi ∈ [8.3391, 37.6968], and ci ∈ [6.78, 74.33] with units $/MW2 , $/MW, and $, respectively. Each xi is constrained by some interval in [ximin, ximax ] (MW), where ximax ∈ [150, 400] and ximin ∈ [5, 150]. Suppose that the total load required from the system is 4242 MW. In addition, the communication between generators adopts the bus data given in [22]. For convenience, we employ the same uniformly weighting strategy as explained in Case Study 1. Moreover, the non-uniform stepsizes and the momentum coefficients are, respectively, selected by αi = 0.0002θi and βit = 0.5θ˜it , t ≥ 0. Then, numerical results are illustrated in Fig. 6.4, which demonstrates the convergence of D-DLM for the IEEE 118-bus test system. It implies that D-DLM
174
6 Accelerated Algorithms for Distributed Economic Dispatch (a) Power generation (MW) 100
80
60
40
Gen. 1
20
0 0
50
Gen. 2
100
Gen. 3
150 Iteration
Gen. 6
200
Gen. 8
250
300
250
300
(b) Incremental cost ($/MW) 10
8
6
4
2
0 0
50
100
150 Iteration
200
(c) Power balance (MW) 500
400
300
200
Total Generation Total Demand
100
0 0
50
100
Fig. 6.2 EDP on IEEE 14-bus test system
150 Iteration
200
250
300
6.5 Numerical Examples
175
Fig. 6.3 The IEEE 118-bus test system
successfully drives the variables to the optimal solutions within few iterations even for this large-scale network.
6.5.3 Case Study 3: The Application to Dynamical EDPs Considering that the demand is not always unchangeable in the practical operations of smart grids, this case study involves the application of D-DLM in dynamic economic dispatch problems (EDPs), i.e., time-varying demands. In this case study, we will simulate the performance of the D-DLM utilizing the same IEEE 14-bus test system, the cost functions at the generators, and other related parameters as described in Case Study 1. In addition, we divide the total iteration time into 3 identical time intervals and assume that the total demand at each time interval is different. Then, numerical results are illustrated in Fig. 6.5. It shows that when the total demand changes, the generator will alter the power generation accordingly to suffice
176
6 Accelerated Algorithms for Distributed Economic Dispatch
400 350 300 250 200 150 100 50 0 0
100
200
300
400
500 600 Iteration
700
800
900
1000
800
900
1000
(b) Incremental cost ($/MW) 18 16 14 12 10 8 6 4 2 0 0
100
200
300
400
500 600 Iteration
700
(c) Power balance (MW) 5000 4500 4000 3500 3000 2500
Total Generation Total Demand
2000 1500 0
100
200
300
400
Fig. 6.4 EDP on IEEE 118-bus test system
500 600 Iteration
700
800
900
1000
6.5 Numerical Examples
177
the current total demand, and D-DLM successfully achieves the optimal power generation after a short period of time.
6.5.4 Case Study 4: Comparison with Related Methods Finally, D-DLM is compared with the existing centralized primal–dual method [40] and distributed primal–dual method [30] in terms of the convergence performance, convergence time, and computational complexity. To this aim, we show the following two scenarios: (1) Comparison for convergence performance: in the first scenario, the convergence performance comparison is conducted on the IEEE 14-bus and 118-bus test systems, respectively, where the residual E t = log10 ni=1 ||xit − xi∗ ||, t ≥ 0, is applied as the comparison metric. The required parameters (row-stochastic weights, non-uniform step-sizes, etc.) correspond to Case Studies 1 and 2. Figure 6.6 implies that D-DLM performs linear convergence rate, and it improves the convergence rate well with the increase of the momentum coefficients (with an upper bound) in comparison with the applicable methods [30, 40] without momentum terms. In addition, the fact that D-DLM promotes the convergence rate is obvious even in large-scale directed networks. (2) Comparisons for convergence time and computational complexity: in the second scenario, the convergence time and computational complexity of D-DLM and the applicable methods [30, 40] are discussed on the IEEE 14-bus and 118bus test systems. Here, we measure the convergence time by the time it takes the algorithm and the computational complexity by the number of calculations required by the algorithm to achieve the desired level of residual E t = −15, respectively. Table 6.2 indicates that, in terms of the convergence time, D-DLM promotes the performance in comparison with the distributed method [30] since two momentum terms are added in D-DLM for improving the convergence rate. Besides, although the centralized method has less convergence time than the distributed methods, it is worthy highlighting that in the distributed method simulations, if they are run in parallel (in practice), the total time they need should be far more less than the total local optimizations time running in sequence. Table 6.2 also means that the computation complexity of D-DLM and the distributed method [30] increase very slowly with the increase of the number of buses, while the centralized method [40] rapidly increases.
178
6 Accelerated Algorithms for Distributed Economic Dispatch (a) Power generation (MW) 100
80
60
40
0 0
Gen. 2
Gen. 1
20
100
200
Gen. 3
Gen. 4
300 Iteration
Gen. 5
400
500
600
400
500
600
(b) Incremental cost ($/MW) 10
8
6
4
2
0 0
100
200
300 Iteration
(c) Power balance (MW) 450 400 350 300 250 200 150 100
Total Generation Total Demand
50 0 0
100
200
300 Iteration
Fig. 6.5 Dynamical EDPs on IEEE 14-bus test system
400
500
600
6.5 Numerical Examples
179 (a) Residual: IEEE 14−bus test system
5
0
−5
−10
−15
0
50
100
150
200
250 300 Iteration
350
400
450
500
(b) Residual: IEEE 118−bus test system 10
5
0
−5
−10
−15
0
100
200
300
Fig. 6.6 Performance comparison
400
500 600 Iteration
700
800
900
1000
180
6 Accelerated Algorithms for Distributed Economic Dispatch
Table 6.2 Convergence time and computational complexity Algorithm types D-DLM Centralized Distributed
Bus systems 14 118 14 118 14 118
Convergence time (s) 0.0684 3.1956 0.0156 0.0693 0.0726 4.8311
Computational complexity 1.704 × 104 2.106 × 105 1.337 × 104 2.229 × 106 1.568 × 104 2.619 × 105
6.6 Conclusion In this chapter, we have considered the EDP in smart grids where generators were designed to collectively minimize the total generation cost while satisfying the expected load demands and preserving the limitations of generator capacity. To solve EDP, a novel directed distributed Lagrangian momentum algorithm, named as D-DLM, has been presented and analyzed at great length. D-DLM extended the distributed gradient tracking method with heavy-ball momentum and Nesterov momentum, guaranteed that generators selected non-uniform step-sizes in a distributed manner and only required the weight matrix to be row-stochastic, indicating that it was suitable for a directed network. In particular, the directed network was assumed to be strongly connected. If the largest step-size and the maximum momentum coefficient were subjected to some upper bounds (the bounds relied only on the network topology and the cost functions), we have proven that D-DLM linearly allocated the optimal dispatch at the expense of eigenvector learning, supposing smooth and strongly convex cost functions. In addition, the explicit estimation for the convergence rate of D-DLM has also been explored. The theoretical analysis has been further verified by simulations. In the future work, we will continue to consider a few of interesting problems (privacy masking, utility maximization, etc.) in smart grids with the aid of D-DLM and study the robustness of time-varying networks, packet dropout, latency, random link failures, and transmission losses.
References 1. N. Heydaribeni, A. Anastasopoulos, Distributed mechanism design for network resource allocation problems. IEEE Trans. Netw. Sci. Eng. 7(2), 621–636 (2020) 2. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2407–2418 (2018) 3. L. Wang, S. Cheng, Data-driven resource management for ultra-dense small cells: an affinity propagation clustering approach. IEEE Trans. Netw. Sci. Eng. 6(3), 267–279 (2019) 4. Y. Kawamoto, H. Takagi, H. Nishiyama, N. Kato, Efficient resource allocation utilizing Qlearning in multiple UA communications. IEEE Trans. Netw. Sci. Eng. 6(3), 293–302 (2019)
References
181
5. R. Wang, Q. Li, B. Zhang, L. Wang, Distributed consensus based algorithm for economic dispatch in a microgrid, IEEE Trans. Smart Grid 10(4), 3630–3640 (2019) 6. T. Kim, S. Wright, D. Bienstock, S. Harnett, Analyzing vulnerability of power systems with continuous optimization formulations. IEEE Trans. Netw. Sci. Eng. 3(3), 132–146 (2016) 7. S. D’Aronco, P. Frossard, Online resource inference in network utility maximization problems. IEEE Trans. Netw. Sci. Eng. 6(3), 432–444 (2019) 8. N. Li, L. Chen, S. Low, Optimal demand response based on utility maximization in power networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES). https://doi.org/10.1109/PES.2011.6039082 9. N. Li, L. Chen, M. Dahleh, Demand response using linear supply function bidding. IEEE Trans. Power Syst. 6(4), 1827–1838 (2015) 10. M. Ogura, V. Preciado, Stability of spreading processes over time-varying large-scale networks. IEEE Trans. Netw. Sci. Eng. 3(1), 44–57 (2016) 11. Z. Zhang, M. Chow, Convergence analysis of the incremental cost consensus algorithm under different communication network topologies in a smart grid. IEEE Trans. Power Syst. 27(4), 1761–1768 (2012) 12. S. Yang, S. Tan, J. Xu, Consensus based approach for economic dispatch problem in a smart grid. IEEE Trans. Power Syst. 28(4), 4416–4426 (2013) 13. S. Kar, G. Hug, Distributed robust economic dispatch in power systems: a consensus + innovations approach, in 2012 IEEE Power and Energy Society General Meeting. https://doi. org/10.1109/PESGM.2012.6345156 14. S. Xie, W. Zhong, K. Xie, R. Yu, Y. Zhang, Fair energy scheduling for vehicle-to-grid networks using adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 27(8), 1697– 1707 (2016) 15. Y. Li, H. Zhang, X. Liang, B. Huang, Event-triggered based distributed cooperative energy management for multi-energy systems. IEEE Trans. Ind. Inform. 15(4), 2008–2022 (2019) 16. B. Huang, L. Liu, H. Zhang, Y. Li, Q. Sun, Distributed Optimal economic dispatch for microgrids considering communication delays. IEEE Trans. Syst. Man, Cybern. Syst. 49(8), 1634–1642 (2019) 17. G. Binetti, A. Davoudi, F. Lewis, D. Naso, B. Turchiano, Distributed consensus-based economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014) 18. Z. Ni, S. Paul, A multistage game in smart grid security: a reinforcement learning solution. IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2684–2695 (2019) 19. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation (2017). arXiv preprint arXiv:1706.05441 20. T. Doan, C. Beck, Distributed Lagrangian methods for network resource allocation, in 2017 IEEE Conference on Control Technology and Applications (CCTA). https://doi.org/10.1109/ CCTA.2017.8062536 21. Z. Tang, D. Hill, T. Liu, A novel consensus-based economic dispatch for microgrids. IEEE Trans. Smart Grid 9(4), 3920–3922 (2018) 22. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dispatch/demand response in power systems (2016). arXiv preprint arXiv:1609.06660 23. H. Pourbabak, J. Luo, T. Chen, W. Su, A novel consensus-based distributed algorithm for economic dispatch based on local estimation of power mismatch. IEEE Trans. Smart Grid 9(6), 5930–5942 (2018) 24. Q. Li, D. Gao, L. Cheng, F. Zhang, W. Yan, Fully distributed DC optimal power flow based on distributed economic dispatch and distributed state estimation (2019). arXiv preprint arXiv:1903.01128 25. Z. Deng, X. Nian, Distributed generalized Nash equilibrium seeking algorithm design for aggregative games over weight-balanced digraphs. IEEE Trans. Neural Netw. Learn. Syst. 30(3), 695–706 (2019) 26. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron. 64(6), 5095–5106 (2017)
182
6 Accelerated Algorithms for Distributed Economic Dispatch
27. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. 50(7), 2612–2622 (2020) 28. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed constrained optimisation over time-varying directed unbalanced networks. IET Control Theory Appl. 13(17), 2800–2810 (2019) 29. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020) 30. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237– 248 (2019) 31. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control 65(6), 2566–2581 (2020) 32. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020) 33. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014) 34. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021) 35. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs. IEEE Signal Process. Lett. 26(8), 1247–1251 (2018) 36. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, Berlin, 2013) 37. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization with uncoordinated step-sizes, in 2017 American Control Conference (ACC). https://doi.org/ 10.23919/ACC.2017.7963560 38. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with timevarying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018) 39. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs. IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014) 40. D. Bertsekas, Nonlinear Programming, 2nd edn. (Athena Scientific, Cambridge, 1999) 41. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated stepsizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019) 42. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018) 43. Y. Tian, Y. Sun, G. Scutari, Achieving linear convergence in distributed asynchronous multiagent optimization. IEEE Trans. Autom. Control 65(12), 5264–5279 (2020) 44. Z. Wang, H. Li, Edge-based stochastic gradient algorithm for distributed optimization. IEEE Trans. Netw. Sci. Eng. 7(3), 1421–1430 (2020) 45. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013) 46. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 47. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018) 48. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization (2016). arXiv preprint arXiv:1604.03257 49. Y. Fu, M. Shahidehpour, Z. Li, AC contingency dispatch based on security-constrained unit commitment. IEEE Trans. Power Syst. 21(2), 897–908 (2006)
Chapter 7
Primal–Dual Algorithms for Distributed Economic Dispatch
Abstract In this chapter, we study a distributed primal–dual gradient algorithm applicable in a sequence of time-varying general directed networks based on a distributed economic dispatch problem for smart grids where each node can only obtain its own locally convex objective function and the estimation of each node is restricted to coupled linear constraints and single-box constraints. In this algorithm, we assume that the communication network between the nodes is a uniform strongly connected network, and apply a column-stochastic mixing matrix and a fixed stepsize in the algorithm to accurately guide all nodes to converge asymptotically to a global, consistent optimal solution. Under standard assumptions of strong convexity and smoothness of the objective function, we show that the distributed algorithm is able to drive the entire network to converge geometrically to an optimal solution of the convex optimization problem only if the step-size does not exceed some upper bound. We also give an explicit analysis of the convergence rate of the proposed optimization algorithm. We perform simulations of economic dispatch problems and demand response problems in power systems to illustrate the effectiveness of the proposed optimization algorithm. Keywords Distributed convex optimization · Multi-node systems · Time-varying directed networks · Primal–dual algorithm · Geometric convergence
7.1 Introduction During recent years, multi-node systems have attracted increasing attention in the fields of distributed sensor networks, multi-robot cooperation, UAVs formation flight, and missile joint attack operations, etc. Many difficulties still exist in the control and optimization of multi-node systems due to the complexity of node dynamics, network structure, and actual target tasks. As one of the most important research topics in the field of multi-node systems, the distributed optimization problem has attracted strong research interest from various scientific disciplines. This is mainly due to its broad application in engineering fields, including distributed formation control for resource allocation in peer-to-peer communication networks © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_7
183
184
7 Primal–Dual Algorithms for Distributed Economic Dispatch
[1], multiple autonomous vehicles [2], and distributed data fusion, information processing, and decision-making in wireless sensor networks [3–5], etc. The distributed optimization framework not only avoids the need to build long-distance communication systems or data fusion centers but also provides a better load balance for the network. Among the existing literature, the (sub)gradient descent algorithm [6, 7], the primal–dual (sub)gradient algorithm [8], the fast (sub)gradient descent algorithm [9], and the (sub)gradient-push descent algorithm [10] were extensively proposed to resolve distributed optimization problems. Nedic et al. [6] exactly incorporated the average consensus approaches into the (sub)gradient methods to handle a multi-node-based unconstrained convex optimization problem. Theoretical analysis showed √ that the (sub)gradient descent algorithm in [6] converges at a rate of O(1/ t) for convex Lipschitz and possibly non-smooth objective functions. This coincides with the convergence rate of the centralized (sub)gradient descent algorithm. Then, Zhu et al. [8] devised two distributed primal–dual (sub)gradient algorithms to tackle a convex optimization problem where the nodes are restricted to a global inequality constraint, a global equality constraint, and a global constraint set. Nedic et al. [10] proposed a (sub)gradient-push descent algorithm that could find the exact optimal solution even without the knowledge of the number of nodes or the network sequence to implement the assumption that the objective function is (sub)gradient boundedness. The√(sub)gradient-push algorithm was proved to be convergent with a rate of O(ln t/ t) by employing a diminishing step-size. This series of work has been extended to distributed optimization under various realistic conditions, such as stochastic (sub)gradient errors [11], directed [12] or random communication network [13], linear scaling in network size [14], heterogeneous local constraints [15], asynchrony and delays [16], to just name a few. Although these algorithms can solve different kinds of optimization problems, it is undeniable that these algorithms are usually slow and still need to use a diminishing stepsize to achieve optimal solution even if the objective functions are differentiable and strongly convex [8–25]. Besides, the abovementioned algorithms all require the assumption of bounded (sub)gradient to accomplish the exact optimal solution, which is another drawback. Nonetheless, at the expense of inexact convergence (only converge to the vicinity of the optimal solution), the methods described above can accelerate to O(1/t) by utilizing a constant step-size. However, this is not the ultimate goal of solving the optimization problem. To address such issues, Xu et al. [24] studied a new augmented distributed gradient method (Aug-DGM) with uncoordinated constant step-sizes over general linear time-invariant system by employing the so-called adapt-then-combine (ATC) scheme. Shi et al. [25] removed the steady-state error by putting a difference structure into distributed gradient descent algorithm, therefore extending a novel distributed first-order algorithm (EXTRA). The algorithm achieved a convergence rate O(1/t) for convex objective functions and a linear rate O(C −t ) (C > 1 is some constant) for strongly convex objective functions. Considering the large application domains of future smart grids, distributed algorithms [26, 27] are broadly employed to promote the energy optimization
7.1 Introduction
185
efficiency in smart grids in recent years. He et al. in [28] proposed two secondorder continuous algorithms to exactly solve the economic power dispatch problem and showed that the convergence rate of the algorithm is faster than the firstorder continuous time algorithm. In addition, a push-sum consensus protocol was introduced in [29] to solve economic dispatch problems (EDPs) on directed fixed networks in smart grids. This line of work has been extended to a variety of realistic scenarios for distributed optimization. Li et al. [30] investigated a novel distributed event-triggered optimization algorithm to address the economic dispatch problems in smart grids. By making two consensus protocols running in parallel, Binetti et al. [31] established a distributed consensus-based protocol for the optimization with transmission losses. The recent literature [35, 36] are the most relevant to our work. Nedic et al. [35] was concentrated on the analysis of the distributed optimization problems with coordinated step-size over time-varying undirected/directed networks. The algorithm in [35] was capable of driving the whole network to converge to a global and consensual minimizer at a geometric rate under the strong convexity and smoothness assumptions. Doan et al. [36] further took coupling linear constraint and individual box constraints into consideration. Under the strong convexity and smoothness assumption on objective functions, Doan et al. conducted an explicit and detailed analysis for the convergence rate on an undirected network. It is noteworthy that [35] did not study the constrained optimization problem, while [36] could not provide a detailed analysis for the constrained optimization problem over general and time-varying directed network topologies. Our work is also linked to a distributed scheme-based distributed optimization algorithm for network resource allocation [32]. However, in [32], the network was required to be undirected, and the weighted matrix was double-stochastic, which is quite strict in real applications. Other related work are [33, 34], where the demand response problems (DRPs) in power networks have been considered. Regrettably, the idea fails to update the Lagrangian multiplier in a distributed way. To sum up, when each node in the network is subjected to certain constraints, the problem is largely open for the distributed constrained optimization over timevarying general directed networks that we studied in this work. Therefore, we develop and analyze a fully distributed primal–dual optimization algorithm for the problem with both coupling linear constraint and individual box constraints. Specially, compared to the centralized approach, the proposed algorithm is more adaptable and practical due to its robustness to variability of renewable resources and flexibility to the dynamic topology of networks. In general, the main contributions of this chapter can be structured as the following four aspects: (i) The push-sum protocol and a gradient tracking technique are incorporated into distributed primal–dual optimization algorithm. It generalizes the work of [35], which neglected the constraints of each node in practical scenarios, and moreover, it gives a wide selection of the step-sizes compared with most existing distributed optimization algorithms, such as [8–23].
186
7 Primal–Dual Algorithms for Distributed Economic Dispatch
(ii) We study a distributed constrained optimization problem over time-varying general directed networks that is a main advantage compared with the work in [36]. Moreover, the underlying networks are assumed to be uniformly and jointly strongly connected, which is considerably weaker than requiring each network to be strongly connected. (iii) Unlike the centralized methods proposed in [33, 34], the proposed algorithm is fully distributed, which can successfully estimate the Lagrangian multiplier based only on local interaction (in a distributed manner) and update its primal variable by applying this estimate. (iv) The proposed algorithm achieves a geometrical convergence rate as long as the step-size is smaller than an explicit upper bound and no positive lower bound is required when the objective functions satisfy the strong convexity and smoothness assumptions. We also provide an explicit analysis for the convergence rate of the proposed algorithm.
7.2 Preliminaries 7.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let R, RN , and RN×N denote the set of real numbers, the set of N-dimensional real column vectors, and the set of N × N real matrices, respectively. Let IN denote the N-dimensional identity matrix. The symbol 1 denotes an all-ones column vector with appropriate dimensions. Given a matrix, W, W T , and W −1 (W is reversible) are represented as the transpose and the inverse, respectively. The symbol · is denoted as the inner product of two vectors. For a vector x ∈ RN , x¯ = (1/N)1T x denotes the vector whose elements are its average vector and its consensus violation is written as x˘ = x − (1/N)11T x = (I − J )x = J˘x, where J = (1/N)11T and J˘ = I − J are two symmetric √ matrices. Given a vector x ∈ RN , the standard Euclidean norm, namely ||x|| = x T x, is defined as ||x||, and the infinite norm is N ˘ defined as ||x|| ∞ . Given a vector x ∈ R , the J weighted (semi)-norm is denoted as ||x||J˘ = x, J˘x. Since J˘ = J˘T J˘, we have ||x||J˘ = ||J˘x||. Let ||W|| denote the spectral norm of matrix ||W|| and ∇f (x) : R → R represent the (sub)gradient of f (x). For a matrix W, we write Wij or [W]ij to denote its i, j ’th entry.
7.2.2 Model of Optimization Problem Consider a network of N nodes labeled by V = {1, 2, . . . , N}, in which each node can only share information to their own neighbors via local communication. The goal of the multi-node group is to collectively solve the following distributed
7.2 Preliminaries
187
economic dispatch problems in smart grids: min f (x) =
x∈RN
N N 1 1 fi (xi ), s.t. x i = P , i ≤ x i ≤ ui , N N i=1
(7.1)
i=1
∀i = 1, . . . , N, where x = [x1, x2 , . . . , xN ]T . For each node i, fi : R → R is the local convex objective function. Assume that each node i preserves a variable xi ∈ R and knows a function fi privately. Throughout the chapter, we care about the condition that the average of all nodes’ variables is equal to an invariable positive constant, i.e., (1/N) N i=1 xi = P , while each node variable is subjected to some interval, i.e., i ≤ xi ≤ ui , i = 1, . . . , N. We will denote the equality constraint in (7.1) as the coupling constraint P, i.e., P = {x ∈ RN |(1/N) N x = P }, and i i=1 denote the individual constraints in (7.1) as the box constraints Xi , i.e., Xi = {x˜ ∈ R|i ≤ x˜ ≤ ui }, ∀i = 1, . . . , N. For all nodes, we use X = X1 × X2 × . . . × XN to denote their Cartesian products. Moreover, we denote S = X ∩ P the feasible set ∗ ]T as of the optimization problem (7.1). In the chapter, we assume x ∗ = [x1∗ , . . . , xN the optimal solution of problem (7.1). Remark 7.1 (Economic Dispatch Problems) EDPs are studied by a network of N generators. The goal of generators is to minimize the total cost incurred in the case of the generators subject to individual capacity constraints while in the meantime to seek optimal power generation to satisfy the average load of the network. Here we index the generators as 1, . . . , N. Based on the optimization models (7.1), we denote by xi the power generated at generator i. Upon generating power of xi units, generator i is exposed to a convex cost denoted by fi (xi ). Thus, the average cost f of the network is equivalent to the average of the cost at each generator, i.e., f (x) = (1/N) N i=1 fi (xi ). It is indispensable that the average power generated at the nodes meets the average load of the network, represented by a positive constant P , i.e., P = (1/N) N i=1 xi . In the EDPs, we focus on not only the power balance at individual generator but also the balance of the entire network. Finally, we assume that each generator i can only generate a limited power that can be represented by the box constraints in (7.1), i.e., i ≤ xi ≤ ui , ∀i = 1, . . . , N.
7.2.3 Communication Network In this chapter, we consider a group of n nodes communicating over a general weighted and directed network G(k) = {V, E(k), A(k)} at time k ≥ 0, where V = {1, 2, . . . , N} and E(k) ⊆ V×V represents the set of nodes with i indicating the ith vertex and the set of directed edges, respectively. The weighted adjacency matrix of G(k) is represented by A(k) = [Aij (k)] ∈ RN×N with non-negative elements Aij (k). A directed edge (j, i) ∈ E(k) indicates that node j can arrive at node i at time k ≥ 0. The in-neighbors set and out-neighbors set of node i at time k are
188
7 Primal–Dual Algorithms for Distributed Economic Dispatch
denoted by Niin (k) = {j ∈ V|(j, i) ∈ E(k)} and Niout (k) = {j ∈ V|(i, j ) ∈ E(k)}, respectively. A directed path in the directed network G from node j to node i can be depicted by a group of connected edges (j, i1 ), (i1 , i2 ), . . . , (im , i) with different nodes ik , k = 1, 2, . . . , m. A directed network is strongly connected if and only if for any two distinct nodes j and i in the set V, there exists a directed path from node j to node i. The following two standard assumptions about the network communication network are adopted. Assumption 7.1 ([6]) The time-varying general directed network sequences G(k) are B0 -strongly connected. Namely, there exists a positive integer B0 such that the general directed network G(k) with vertex set V and edge set EB0 (k) = (k+1)B −1 ∪s=kB0 0 E(s) is strongly connected for any k ≥ 0. Remark 7.2 With Assumption 7.1, through repeated communications with neighbors, all nodes can repeatedly interact with each other in the whole network sequence G(k). Particularly, this assumption is considerably weaker than that requires each G(k) be strongly connected for all k ≥ 0. Assumption 7.2 ([6]) For any k = 0, 1, . . . , the mixing matrix C(k) = [Cij (k)] ∈ RN×N is defined as Cij (k) =
1 , djout (k)+1
0,
j ∈ Niin (k) j∈ / Niin (k) or j = i
,
where djout(k) = |Njout (k)| is the out-degree of node j at time k ≥ 0 (Njout (k) = {i ∈ V|(j, i)∈ E(k)}). Also, it is clear to know that C(k) is a column-stochastic matrix, i.e., N i=1 Cij (k) = 1 for any j ∈ V and k ≥ 0. Moreover, to facilitate the analysis of the optimization algorithm, we will make the following two assumptions. Assumption 7.3 ([36]) For each node i = 1, . . . , N, its local objective function fi is differentiable everywhere and has Lipschitz continuous gradients. Namely, there exists a constant σi ∈ (0, +∞) such that ||∇fi (x) − ∇fi (y)|| ≤ σi ||x − y|| for any x, y ∈ R where we use σˆ = mini {σi } in the subsequent analysis. Assumption 7.4 ([36]) For each node i = 1, . . . , N, its objective function fi is proper, closed, and strongly convex with strongly convex parameter μi . Namely, for any x, y ∈ R, we have fi (x) ≥ fi (y) + ∇fi (y), x − y +
μi ||x − y||2 , 2
where μi ∈ (0, +∞), and we employ μˆ = mini {μi } in the subsequent analysis.
7.3 Algorithm Development
189
7.3 Algorithm Development On the basis of the above section, we first analyze the dual problem of problem (7.1), and then we design our main algorithm, namely, distributed primal–dual gradient algorithm, to handle the optimization of the dual problem. At the end of this section, we will give some lemmas to support the convergence analysis of the algorithm.
7.3.1 Dual Problem Consider the Lagrangian function L : X × R → R of problem (7.1) defined as N 1 L(x, w) = fi (xi ) + w N i=1
N 1 xi − P N
,
(7.2)
i=1
where w denotes the Lagrangian multiplier in accordance with the coupling constraint in (7.1). Given a proper, closed, and convex function f : X → R, we let f ∗ be its conjugate, i.e., f ∗ (y) = supx∈X y T x − f (x), y ∈ RN . Assuming w ∈ R, the dual function d : R → R of problem (7.1) can be described as d(w) = min L(x, w) x∈X
N 1 N 1 fi (xi ) + w xi − P i=1 i=1 N x∈X N N 1 −supxi ∈Xi {−wxi − fi (xi )} = −wP + i=1 N 1 N = (−fi∗ (−w) − wP ) i=1 N 1 N di (w), = i=1 N = min
(7.3)
where di (w) = −fi∗ (−w) − wP . The dual problem of (7.1), therefore, is given as maxd(w),
w∈R
(7.4)
where its gradient is obtained as ∇d(w) =
N 1 xi − P . N i=1
(7.5)
190
7 Primal–Dual Algorithms for Distributed Economic Dispatch
According to the analysis of the primal–dual gradient method in [36], we know that the update of the Lagrangian multiplier depends on the information of each node, which prevents us from implementing the distributed algorithm. This shortcoming has been reflected in [33, 34], where a centralize step is required to update w and send it back to the nodes. Specifically, the objective of this chapter is to solve the dual problem in a distributed manner, which requires that the nodes do not need to know the information of each node, but only share their own states with their neighbors in each step. Before giving the distributed primal–dual gradient algorithm, we conclude this part by considering some important properties of the dual function (7.3). The following three lemmas show some basic properties of the conjugate of the convex function f . Lemma 7.3 ([37]) Let f be a proper, closed, and convex function and denote f ∗ as its conjugate. Then, we have that f ∗ is also proper, closed, and convex. Moreover, we also have f ∗∗ = f . Lemma 7.4 ([38, 39]) For a proper, lower semi-continuous, convex function f : dom(f ) → R and a positive value χ, the following two properties are equivalent: (a) f ∗ is strongly convex with constant χ; (b) f is differentiable, and the gradient ∇f is Lipschitz continuous with constant 1/χ. Built on the above two lemmas, the following lemma, which is necessary in the analysis of distributed primal–dual gradient algorithm, is shown. Lemma 7.5 ([36]) Let Assumptions 7.3 and 7.4 hold. Then the conjugate fi∗ of fi satisfies: (i) fi∗ is strongly convex with constant 1/σi ; (ii) fi∗ is differentiable, and the gradient ∇fi∗ is Lipschitz continuous with constant 1/μi . Define the Lagrangian dual optimal solution of the Lagrangian dual problem as w∗ . Then the following definition of saddle point plays a critical role in the following analysis. Definition 7.6 (Saddle Point [8]) Consider the Lagrangian function L : X × R → R. A pair of vectors (x ∗ , 1w∗ ) is called a saddle point of L(x, w) over the set X ×R if L(x ∗ , w) ≤ L(x ∗ , 1w∗ ) ≤ L(x, 1w∗ ) holds for all x ∈ X and w ∈ R.
7.3.2 Distributed Primal–Dual Gradient Algorithm Since w is the solution of the dual problem (7.4), we can design a distributed algorithm to tackle the dual problem (7.4) that is equivalent to the problem (7.1). Moreover, the dual problem (7.4) can be redefined as the following minimization problem: min q(w) =
w∈R
N 1 qi (w), N i=1
(7.6)
7.3 Algorithm Development
191
where qi is defined as qi (w) = fi∗ (−w) + wP . Therefore, we are now concerned with the minimization of problem (7.6) since it has the same formula as [35]. It is important to note that the problems (7.6) and (7.4) share the same set of solutions. In our algorithm, each node i maintains five variables xi (k), λi (k), vi (k), wi (k) and βi (k) at each time instant k = 0, 1, . . ., where λi (k), vi (k), βi (k) are three auxiliary variables and xi (k), wi (k) are two key variables used to estimate the optimal ∗ ]T solution, denoted by xi∗ and w∗ , respectively. Here, recall that x ∗ = [x1∗ , . . . , xN ∗ is the primal optimal solution of (7.1) and w is the dual optimal solution of (7.6). The initialization of our distributed primal–dual gradient algorithm is xi (0) ∈ Xi , wi (0) = λi (0) ∈ R, βi (0) = ∇qi (wi (0)) ∈ R, and vi (0) = 1 for all i = 1, . . . , N. Then, each node i at each time instant k ≥ 0 updates based on the following rules: ⎧ xi (k + 1) ∈ arg min fi (xi (k)) + wi (k)xi (k) ⎪ ⎪ ⎪ xi ∈Xi ⎪ ⎪ ⎪ (k + 1) = Cij (k)(λj (k) − αβj (k)) λ ⎪ i ⎪ ⎪ ⎪ j ∈Niin (k) ⎪ ⎨ Cij (k)vj (k) vi (k + 1) = , ⎪ j ∈Niin (k) ⎪ ⎪ ⎪ ⎪ ⎪ wi (k + 1) = λi (k + 1)/vi (k + 1) ⎪ ⎪ ⎪ ⎪ Cij (k)βi (k)+(∇qi (wi (k + 1)) − ∇qi (wi (k))) ⎪ ⎩ βi (k + 1) =
(7.7)
j ∈Niin (k)
where α > 0 is some step-size, and the vector ∇qi (wi (k)) is the gradient of the node i’s objective function qi (w) at w = wi (k). According to the definition of qi (w), we can achieve that ∇qi (wi (k)) = −xi (k) + P .
(7.8)
Let x(k) = [x1 (k), . . . , xN (k)]T ∈ RN , w(k) = [w1 (k), . . . , wN (k)]T ∈ RN , λ(k) = [λ1 (k), . . . , λN (k)]T ∈ RN , β(k) = [β1 (k), · · · , βN (k)]T ∈ RN , v(k) = [v1 (k), . . . , vN (k)]T ∈ RN , h(k) = [h1 (k), . . . , hN (k)]T ∈ RN , and ∇q(w(k)) = [∇q1 (w1 (k)), · · · , ∇qN (wN (k))]T ∈ RN . With formula (7.8), the algorithm (7.7) can be rewritten in the compact matrix form (by manipulating a simple algebraic transformation) as follows: ⎧ N ⎪ ⎪ ⎪ x(k + 1) ∈ arg min fi (xi ) + wi (k)xi ⎪ ⎪ x∈X i=1 ⎪ ⎪ ⎨ v(k + 1) = C(k)v(k) , ⎪ V (k + 1) = diag{v(k + 1)} ⎪ ⎪ ⎪ ⎪ w(k + 1) = R(k)(w(k) − αh(k)) ⎪ ⎪ ⎩ h(k + 1) = R(k)h(k) − (V (k + 1))−1 (∇q(w(k + 1)) − ∇q(w(k)))
(7.9)
where R(k) = (V (k + 1))−1 C(k)V (k), h(k) = (V (k))−1 β(k). Note that, under Assumptions 7.1 and 7.2, each matrix V (k) is invertible, and we denote
192
7 Primal–Dual Algorithms for Distributed Economic Dispatch
||V −1 ||max = supk≥0 ||V −1 (k)||, which is bounded. Also, we can prove that R(k) is actually a row-stochastic matrix (see Lemma 4 of [10]). Next, we will use the following symbols CB (k) = C(k)C(k −1) . . . C(k +1−B) for any k = 0, 1, . . . and B = 0, 1, . . . with B ≤ k + 1 and the exceptional case that C0 (k) = I for any k and CB (k) = I for any needed k < 0. An important character of the norm of (I − (1/N)11T )RB (k) is shown in the following lemma, which comes from the properties of distributed primal–dual gradient algorithm and can be achieved from [33, 34]. Lemma 7.7 ([35]) Let Assumptions 7.1 and 7.2 hold, and let B be an integer satisfying B ≥ B0 . Then, for any k = B − 1, B, . . . and any vectors φ and ˘ ≤ δ||ϕ||, ϕ with appropriate dimensions, if φ = RB (k)ϕ, we have ||φ|| ˘ where B−1
RB (k) = (V (k + 1))−1 CB (k)(V (k + 1 − B)), δ = Q1 (1 − τ NB0 ) NB0 < 1, −NB 1 Q1 = 2N 1+τ NB00 , τ = 2+NB . 0 1−τ
N
7.4 Convergence Analysis In this section, we first introduce the small gain theorem [35], followed by some supporting lemmas. Then, we present the main results of this chapter.
7.4.1 Small Gain Theorem To realize the geometric convergence of the distributed primal–dual gradient algorithm, we first give a preliminary knowledge, namely, the small gain theorem. Before describing the small gain theorem, we need to give some notations. Denote the notation si for the infinite sequence si = (si (0), si (1), si (2), . . .), where si (k) ∈ RN , ∀i. Furthermore, for any positive integer K, denote ||si ||γ ,K = max
k=0,...,K
||si ||γ = sup k≥0
1 ||si (k)||, γk
1 ||si (k)||, γk
(7.10)
(7.11)
where the parameter γ ∈ (0, 1) will be conducted as the geometric convergence parameter in our later analysis. Based on the above definition, the geometric convergence analysis of the sequence {||w(k) − 1w∗ ||} is mainly built on the small gain theorem that we will state in the next.
7.4 Convergence Analysis
193
Lemma 7.8 (The Small Gain Theorem [35]) Suppose that s1 , . . . , sm are sequences such that for all K > 0 and each i = 1, . . . , m, ||si ||γ = sup k≥0
1 ||si (k)||, γk
(7.12)
where ωi is some constant, and the gains (non-negative constants) η1 , . . . , ηm satisfy η1 η2 . . . ηm < 1;
(7.13)
1 (ηm ηm−1 . . . η2 ω1 + . . . + ηm ωm−1 + ωm ). 1 − η1 η2 . . . ηm
(7.14)
then, we obtain ||s1 ||γ ≤
Remark 7.9 The original version of the small gain theorem has been extensively studied and has been widely used in control theory [40]. Besides, since the small gain theorem involves a cyclic structure, s1 → s2 → · · · → sm → s1 , one can get similar bounds for ||si ||γ , ∀i. Lemma 7.10 ([35]) For any matrix sequence si and a positive constant γ ∈ (0, 1), if ||si ||γ is bounded, then ||si (k)|| converges to 0 with a geometric rate O(γ k ). We need to define the following additional symbols that are often used in the latter analysis before the main proof of the idea is carried out: y(k) = w(k) − 1w∗ for any k = 0, 1, . . . , z(k) = ∇q(w(k)) − ∇q(w(k − 1)) for any k = 1, 2, . . . ,
(7.15) (7.16)
where w∗ ∈ R is the optimal solution of problem (7.6), and the initialization z(0) = 0. Considering the small gain theorem, the geometric convergence of ||w(k)|| will be achieved by employing Lemma 7.8 to the following circle of arrows: 4
3
2
1
y → z → h˘ → z˘ → y.
(7.17)
Remark 7.11 Recall that y is the difference between the estimate and the global optimizer of the Lagrangian multiplier w, z is the successive difference of gradients, h˘ is the consensus violation of the estimation of gradient average across nodes, and w˘ is the consensus violation of the Lagrangian multiplier. In a sense, as long as y is small, the error z is small since the gradients are close to zero in the vicinity of the optimal Lagrangian multiplier. Then, as long as z is small, h˘ is small by the framework of the algorithm (7.9). Furthermore, as long as h˘ is small, the framework of algorithm (7.9) means that w˘ is close to zero. Finally, as long as w˘ is close to zero, the algorithm will drive y to zero and thus achieve the whole cycle.
194
7 Primal–Dual Algorithms for Distributed Economic Dispatch
Remark 7.12 After the establishment of each arrow, we will apply Lemma 7.8 to conclude our main results. Specifically, we need to be aware of the prerequisite ˘ γ ,K , ||w|| that the sequences {||y||γ ,K , ||z||γ ,K , ||h|| ˘ γ ,K , ||y||γ ,K } are proven to be bounded. Therefore, we can draw a conclusion that all quantities in the above circle of arrows converge at a geometric rate O(γ k ). In addition, in order to apply the small gain theorem in the following analysis, we need to require that the product of gains γi is less than one, which is achieved by finding an appropriate step-size α. Now, we are ready to present the establishment of each arrow in the above circle (7.17). The following series of lemmas are based mainly on the views of [35].
7.4.2 Supporting Lemmas Before introducing the Lemma 7.13, we make some definitions only for this lemma, which distinguishes the notation used in our distributed optimization problem, algorithm, and analysis. We next redefine problem (7.6) with different notations as min g(p) =
p∈RN
N 1 gi (p), N
(7.18)
i=1
where each function gi satisfies Assumptions 7.3–7.4. Consider the following inexact gradient descent on the function g: pk+1 = pk − θ
N 1 ∇gi (si (k)), N
(7.19)
i=1
where θ is the step-size. Let p∗ be the global optimal solution of g, and define rk = ||pk − p∗ || for any k = 0, 1, . . .
(7.20)
On the basis of the above definitions, we introduce Lemma 7.13. θ σ˜ ϑ Lemma 7.13 ([35]) Suppose that 1 − ϑ+1 ≤ γ < 1 and 0 < θ ≤ ϑ+1 1 1 N 1 min{ σ˜ ϑ , μ(1+υ) }, where ϑ > 0, υ > 0, σ˜ = N i=1 σ1i , and μ˜ = N1 N i=1 μi . Let ˜ Assumptions 7.3–7.4 hold for every function gi . For the problem (7.18), consider the sequences {rk } and {pk } be updated by the inexact gradient descent algorithm (7.19). Then, for K = 0, 1, . . ., we have |r|
γ ,K
√ ≤ 2r0 + (γ N )−1
ϑ 1+υ + υ μˆ σ˜ σˆ σ˜
N i=1
||p − si ||γ ,K ,
(7.21)
7.4 Convergence Analysis
195
where we recall that σˆ = mini {σi } and μˆ = mini {μi }. In the following, we start with the first demonstration of the circle (7.17) that is grounded on the error bound of the inexact gradient descent (IGD) algorithm in Lemma 7.13. Lemma 7.14 (||w|| ˘ γ ,K → ||y||γ ,K [35]) . Let Assumptions 7.1, 7.3–7.4 hold. In addition, we suppose that the step-size α and the parameter γ satisfy
α σ˜ ϑ ≤γ 0, υ > 0 are some adjustable parameters and σ˜ = 1 μ˜ = N1 N i=1 μi . Then, for all K = 0, 1, . . ., we have that ||y||γ ,K
1 N
N
1 i=1 σi
and
√ √ √ ϑ N 1+υ ≤ (1 + N ) 1 + ¯ − w∗ ||. + ||w|| ˘ γ ,K + 2 N ||w(0) γ υ μˆ σ˜ σˆ σ˜ (7.22)
Next, to demonstrate the second arrows in the circle (7.17), we give the following Lemma 7.15. ˘ γ ,K → ||w|| Lemma 7.15 (||h|| ˘ √γ ,K [35]) . Let Assumptions 7.1–7.2 hold, and let γ be a positive constant in ( B δ, 1), where δ and B are the constants given in Lemma 7.7. Then, we get ||w|| ˘
γ ,K
α ≤ B γ −δ
B B γ − γB ˘ γ ,K + γ δ + Q1 ||h|| γ −(t −1) ||w(t ˘ − 1)||, 1−γ γB − δ t =1
(7.23) for all K = 0, 1, . . ., where Q1 is the constant as defined in Lemma 7.7. Now we present the third arrow of circle (7.17) in the following lemma. Lemma 7.16 (||z||γ ,K → ||w|| ˘ γ ,K [35]) . Let Assumptions 7.1–7.4 hold,√let the parameter δ be given in Lemma 7.7, and let γ be a positive constant in ( B δ, 1). Then, we obtain for all K = 0, 1, . . . that ˘ γ ,K ≤ Q1 ||V −1 ||max ||h||
γ (1 − γ B ) ||z||γ ,K (γ B − δ)(1 − γ )
B γ B −(t −1) ˘ + B γ ||h(t − 1)||. γ −δ t =1
(7.24)
196
7 Primal–Dual Algorithms for Distributed Economic Dispatch
The last arrow in the circle (7.17) demonstrated in the following lemma is a simple conclusion of the truth that the gradient of q is Lipschitz continuous with parameter 1/μ. ˆ Lemma 7.17 (||y||γ ,K → ||z||γ ,K [35]) . Under Assumption 7.3, we obtain that for all K = 0, 1, . . . , and any 0 < γ < 1, ||z||γ ,K ≤
γ +1 ||y||γ ,K . γ μˆ
7.4.3 Main Results Based on the circle (7.17) established in the previous section, we will demonstrate a major result about the geometrical convergence of (x(k), w(k)) to a saddle point (x ∗ , 1w∗ ) of the Lagrangian function L for the distributed primal–dual gradient algorithm over a time-varying general directed networks. In what follows, we first prove that the sequence {w(k)} updated by the distributed primal–dual gradient algorithm (7.9) converges to 1w∗ at a global geometric rate O(γ k ) with the help of the small gain theorem. Then, on the basis of the geometric convergence of the sequence {w(k)}, we will prove that the sequence {x(k)} goes to x ∗ at a global geometric rate O((γ /2)k ). Moreover, an explicit convergence rate γ for the distributed primal–dual gradient algorithm will be given along the way. Theorem 7.18 Let Assumptions 7.1–7.4 and Lemmas 7.3–7.17 hold. Let B be a B−1
large enough integer constant such that δ = Q1 (1 − τ NB0 ) NB0 < 1. Then, for any 2 step-size α ∈ (0, 1.5(1−δ) ], the sequence {w(k)} be generated by the distributed σ˜ J primal–dual gradient algorithm converges to 1w∗ at a global geometric rate O(γ k ), √ 2 1.5( J 2 +(1−δ 2 )J −δ J ) σ˜ if α ∈ (0, ], where γ ∈ (0, 1) is given by γ = 2B 1 − α1.5 2 σ˜ J (1+J ) √ 2 1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2 σ˜ J if α ∈ ( , σ˜ J ], where J = 3Q1 × and γ = B δ + α1.5 σ˜√ J (1+J )2 √ √ ||V −1 ||max κB(δ + Q1 (B − 1))(1 + N ) × (1 + 4 N κ) and κ = 1/σ˜ μ. ˆ Proof It is immediately obtained from Lemmas 7.13–7.17 that: √ √ ˘ γ ,K + ω1 , where η1 = (1 + N )(1 + γN υ1+υ + σˆϑσ˜ ) and (i) ||y||γ ,K ≤ η1 ||w|| μˆ σ˜ √ ω1 = 2 N||w(0) ¯ − w∗ ||. B γ ,K ˘ γ ,K + ω2 , where η2 = Bα (δ + Q1 γ −γ ) and ω2 = (ii) ||w|| ˘ ≤ η2 ||h|| 1−γ γ −δ γ B B −(t −1) ||w(t ˘ − 1)||. t =1 γ γ B −δ
7.4 Convergence Analysis
197 B
˘ γ ,K ≤ η3 ||z||γ ,K + ω3 , where η3 = Q1 ||V −1 ||max γB (1−γ ) and ω3 = (iii) ||h|| (γ −δ)(1−γ ) γ B B −(t −1) ˘ ||h(t − 1)||. t =1 γ γ B −δ (iv) ||z||γ ,K ≤ η4 ||y||γ ,K +ω4 , where η4 =
γ +1 γ μˆ
and ω4 = 0.
Moreover, to use the small gain theorem, we need to choose an appropriate step-size α such that η1 η2 η3 η4 < 1,
(7.25)
which means that ⎛ ⎞ √ √ ϑ α (1 + N ) 1 + γN υ1+υ × + B μˆ σ˜ σˆ σ˜ γ −δ ⎝ ⎠ < 1, B) −γ B γ +1 Q1 ||V −1 ||max (γ γB (1−γ δ + Q1 γ1−γ −δ)(1−γ ) γ μˆ
(7.26)
where ϑ > 0, υ > 0, and other constraint conditions on parameters that occur in Lemmas 7.7, 7.14, and 7.15 are stated as follows: 1 ϑ +1 , , 0 < α ≤ min (7.27) σ˜ ϑ μ(1 ˜ + υ) 1−
α σ˜ ϑ ≤ γ < 1, ϑ +1
√ B δ < γ < 1.
(7.28)
(7.29)
Noting that ϑ > 0 and υ > 0, it follows from (7.27) that √ B δ < γ < 1.
(7.30)
Define two specific values for the parameters ϑ = 2σˆ /μˆ and υ = 1 in Lemma 7.14 to obtain some concrete (probably loose) bound on the convergence rate. Furthermore, by using 0.5 ≤ γ < 1 and (1 − γ B )/(1 − γ ) ≤ B, from relation (7.26), we obtain α≤
μ(γ ˆ B − δ)
2
2Q1 ||V −1 ||max B(δ + Q1 (B − 1))(1 +
√
√ √ , N)(1 + 4 N κ)
(7.31)
198
7 Primal–Dual Algorithms for Distributed Economic Dispatch
where κ = 1/σ˜ μˆ is the condition number. Noting that (ϑ + 1)/ϑ ≥ 1.5, it follows from (7.28) that 1.5(1 − γ 2 ) ≤ α. σ˜
(7.32)
Then, √ using relations (7.31) and (7.32), we can achieve that there exists a λ ∈ ( B δ, 1) such that
1.5(1 − γ 2 ) 1.5(γ B − δ) , σ˜ σ˜ J
2
= ∅,
(7.33)
√ √ √ where J = 3Q1 ||V −1 ||max κB(δ + Q1 (B − 1))(1 + 4 N κ)(1 + N ). Here, we study a smaller interval by enlarging the left side in (7.33). Since B ≥ 1, we will prove that
1.5(1 − γ 2B ) 1.5(γ B − δ) , σ˜ σ˜ J
2
= ∅.
(7.34)
√ Noting that when γ increases from B δ to 1, the left side of (7.34) is decreasing from 2 1.5(1−δ 2 ) to 0 monotonically, while the right side is increasing from 0 to 1.5(1−δ) σ˜ σ˜ J √ monotonically. Thus, when γ varies from B δ to 1, the critical value of the interval in (7.34) is valid when γ is given by γ =
B
δ+
J 2 + (1 − δ 2 )J . (1 + J )
(7.35)
Here, we assume that the value obtained in (7.35) is γmid . Thus, if we choose √ 2 1.5( J 2 +(1−δ 2 )J −δ J ) 2B σ˜ α ∈ (0, ], we can set γ = 1 − α1.5 , while for α ∈ σ˜ J (1+J )2 √ 2 1.5( J 2 +(1−δ 2 )J −δ J ) 1.5(1−δ)2 σ˜ J ( , σ˜ J ], we can set γ = B δ + α1.5 . The proof is thus σ˜ J (1+J )2 completed. On the basis of the geometric convergence of the sequence {w(k)} in Theorem 7.18, we next demonstrate that the sequence {x(k)} goes to x ∗ at a global geometric rate O((γ /2)k ). Theorem 7.19 Suppose Assumptions 7.1–7.4 hold. Let the sequence {x(k)}, {λ(k)}, {v(k)}, {w(k)}, and {β(k)} be updated by the distributed primal–dual gradient algorithm (7.9). Choose α in Theorem 7.18 such that the sequence {w(k)} converges to 1w∗ at a geometric rate O(γ k ). Then we have the sequence {x(k)} converges to x ∗ at a geometric rate O((γ /2)k ).
7.4 Convergence Analysis
199
Proof To demonstrate the result shown in Theorem 7.19, we first show that for any k ≥ 0, the following inequality holds: N μi i=1
≤
2
(xi (k + 1) − xi∗ )2
N
(∇qi (w∗ )(wi (k) − w∗ ) +
i=1
1 (wi (k) − w∗ )2 + wi (k)(xi∗ − P )). 2μi (7.36)
∗ Noticing that x ∗ ∈ S, we have N i=1 (xi − P ) = 0. Moreover, since wi (k) → ∗ w as k → ∞, it suffices to prove that the last term in (7.36) goes to zero as k → ∞. Then we obtain the right hand side of (7.36) tends to zero as k → ∞. Therefore, we immediately conclude that as the sequence {w(k)} converges to 1w∗ at a geometric rate of O(γ k ), the sequence {x(k)} converges to x ∗ at a geometric rate of O((γ /2)k ). Based on the above analysis, the next major task is to derive inequality (7.36). To show inequality (7.36), we first define the local Lagrangian function Li : Xi × R → R as Li (xi , wi ) = fi (xi ) + wi (xi − P ).
(7.37)
Then, the global Lagrangian function L : X × RN can be defined as L(x, w)=
N 1 Li (xi , wi ). N
(7.38)
i=1
Noticing that since fi is strongly convex with parameter μi , Li is strongly convex with parameter μi for a fixed wi . Therefore, we get for all xi ∈ Xi that μi (xi (k + 1) − xi )2 ≤ Li (xi , wi (k)) − Li (xi (k + 1), wi (k)) 2 − ∇Li (xi (k + 1), wi (k))(xi − xi (k + 1)).
(7.39)
Using xi (k + 1) ∈ arg min fi (xi (k)) + wi (k)xi (k), we also obtain for all xi ∈ Xi xi ∈χi
that μi (xi (k + 1) − xi )2 ≤ Li (xi , wi (k)) − Li (xi (k + 1), wi (k)). 2
(7.40)
200
7 Primal–Dual Algorithms for Distributed Economic Dispatch
Since (7.40) holds for all xi ∈ Xi , replacing xi by xi∗ and taking the average process of (7.40), we immediately have N 1 μi (xi (k + 1) − xi∗ )2 N 2 i=1
≤
N 1 (Li (xi∗ , wi (k)) − Li (xi (k + 1), wi (k))) N i=1
= L(x ∗ , w(k)) − L(x(k + 1), w(k)).
(7.41)
Recalling (7.3), (7.6), and (7.8), we therefore have qi (w(k)) = −fi (xi (k + 1)) − wi (k)(xi (k + 1) − P ).
(7.42)
Moreover, the strong duality holds according to Assumption 3 in [41], i.e., f ∗ = d ∗ = −q ∗ . We therefore get Li (xi∗ , wi (k)) − Li (xi (k + 1), wi (k)) = fi (xi∗ ) + wi (k)(xi∗ − P ) − fi (xi (k + 1)) − wi (k)(xi (k + 1) − P ) = fi (xi∗ ) + qi (wi (k)) + wi (k)(xi∗ − P ).
(7.43)
Averaging (7.43) over i from 1 to N gives L(x ∗ , w(k)) − L(x(k + 1), w(k)) = f (x ∗ ) + q(w(k)) +
N 1 (wi (k)(xi∗ − P )) N i=1
=
1 N
N
(qi (wi (k)) − qi (wi∗ ) + wi (k)(xi∗ − P )).
(7.44)
i=1
Since qi has Lipschitz continuous derivative with Lipschitz parameter 1/μi , it yields that L(x ∗ , w(k)) − L(x(k + 1), w(k)) ≤
N 1 1 (∇qi (wi∗ )(wi (k) − w∗ ) + (wi (k) − w∗ )2 + wi (k)(xi∗ − P )). N 2μi i=1
(7.45)
7.5 Numerical Examples
201
Substituting (7.45) into (7.41), we obtain N μi i=1
≤
2
(xi (k + 1) − xi∗ )2
N
(∇qi (w∗ )(wi (k) − w∗ ) +
i=1
1 (wi (k) − w∗ )2 + wi (k)(xi∗ − P )), 2μi
which concludes our proof.
7.5 Numerical Examples In this section, two numerical examples about economic dispatch problems and demand response problems in power systems are presented to validate the practicability of the proposed algorithm and feasibility of the theoretical analysis throughout this chapter.
7.5.1 Example 1: EDP on the IEEE 14-Bus Test Systems In the first example, we consider the economic dispatch on the IEEE 14-bus systems [43] as interpreted in Fig. 7.1. Specially, we study a class of problems with which some generators may not be connected to the grid or cease to exchange their power during their operation. This may occur due to the fault of the generators or the variability of renewable energy, which limits the generator generating capacity within a specific time. To study this problem, we will use a class of uniformly strongly connected time-varying general directed networks to model for the variable connection between generators. Specifically, in this chapter, we consider a system in which each generator i suffers a quadratic cost as a function of the amount of its generated power Pi , i.e., Yi (Pi ) = ci Pi2 + di Pi , where ci and di are the adjustable cost coefficients of generator i. Assume that each generator i only generates a limited amount of power denoted by [0, Pimax ]. In the simulation, we choose the average demand (required from the system) P = 60, and the coefficients of each generator are shown in Table 7.1 that is also applied in [42]. Then, the simulation results of the algorithm (7.7) are described in Figs. 7.1 and 7.2. The power allocation at each generator is shown in Fig. 7.2, from which one can see that the distributed primal–dual gradient algorithm (7.7) successfully allocates optimal powers to each generator at time step k = 200. The allocated optimal power at each generator is as follows, P1∗ = 66.25, P2∗ = 71.61, P3∗ = 47.15, P4∗ = 54.98, and P5∗ = 60.01. From Fig. 7.3, for each generator, it is explicit that each Lagrangian multiplier successfully converges to the dual optimal solution w∗ .
202
7 Primal–Dual Algorithms for Distributed Economic Dispatch
Fig. 7.1 The IEEE 14-bus test system [43] Table 7.1 Generator parameters (MU = Monetary units)
Gen 1 2 3 4 5
Bus 1 2 3 6 8
ci ($/MW2 ) 0.04 0.03 0.035 0.03 0.04
di ($/MW) 2.0 3.0 4.0 4.0 2.5
[0, Pimax ] (MW) [0, 80] [0, 90] [0, 70] [0, 70] [0, 80]
7.5.2 Example 2: Demand Response for Time-Varying Supplies Our second application is about the problem of the demand response of 5 households in the summer served by a single generator. In particular, we will consider the issue of time-varying supplies during a day. We assume that the generator can predict the average power to supply per hour of the day based on the information collected over the past few days, and each household will incur costs in the use of power. Suppose that all households are interested in cooperating to arrange their loads to meet the supply while minimizing their total costs. According to [34], we know that the demand response problems can be considered as the optimization problem studied in this chapter. Specifically, in this chapter, we consider the energy consumption of air conditioning, and no other tunable device is required during the process of demand response. This is mainly because it may consume most energy during summer. When the air conditioning uses an amount xi of power, we suppose that each household i suffers a quadratic cost as a function, i.e., Qi (xi ) = ai (xi − bi )2 , where ai is the cost coefficient and bi is the initial energy of each household i. Let
7.5 Numerical Examples
203
80 P P
70
P
1 2 3
Power allocation xi(k)
P4 P
60
5
50 40 30 20 10 0
0
20
40
60
80
100 time
120
140
160
180
200
Fig. 7.2 Power allocation at generators 0 w1 w
2
−1
w3 w
4
w5
Lagrange multipliers wi(k)
−2 −3 −4 −5 −6 −7 −8 0
20
40
60
80
100 time
120
140
160
180
200
Fig. 7.3 Consensus of Lagrange multipliers
S(t), t ∈ [7, 19], be the average vector of power supplied by the generator from 7 : 00 to 19 : 00. Let S(t), ai , and bi be chosen randomly from [0, 2000], (0, 1), and [0, 1000], respectively. Then, the simulation results of the algorithm (7.7) are described in Figs. 7.4 and 7.5. The optimal schedule of power set by each household is shown in Fig. 7.4, while the predicted price on the time-varying demand is shown in Fig. 7.5.
204
7 Primal–Dual Algorithms for Distributed Economic Dispatch Optimal energy schdule of households 3000 supply 1 supply 2 supply 3 supply 4 supply 6 average supply
Energy schedule
2500
2000
1500
1000
500 7:00
8:00
9:00
10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00
Fig. 7.4 Optimal energy schedule of each household
Dynamic price on the time-varying demand 1000 900 800 700
price(w)
600 500 400 300 200 100 0 7:00
8:00
9:00
10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00
Fig. 7.5 Predicted price on the time-varying demands
References
205
7.6 Conclusion In this chapter, a fully distributed primal–dual gradient algorithm for tackling the convex optimization problem with both coupling linear constraint and individual box constraints has been studied in detail. It has been proven that under some fairly standard assumptions on network connectivity and the objective function, the algorithm is able to achieve a geometrical convergence rate over time-varying general directed networks. Based on the adjustment of some parameters (ϑ, υ) and the small gain theorem, we derived an explicit convergence rate γ for different ranges of α. Furthermore, the correctness and effectiveness of the theoretical results have been demonstrated by applying the proposed algorithm to investigate few interesting problems in power systems. In addition, there are several meaningful questions to be studied in future work. For example, it would be interesting to study more general case in which the step-size we adopted is uncoordinated. It would also be meaningful to extend our work to the cases of event-triggered, asynchronous, and quantized communication among nodes over time-varying networks.
References 1. B. Johansson, M. Rabi, M. Johansson, A simple peer-to-peer algorithm for distributed optimization in sensor networks, in 2007 46th IEEE Conference on Decision and Control. https://doi.org/10.1109/CDC.2007.4434888 2. Z.-W. Liu, X. Yu, Z.-H. Guan, B. Hu, C. Li, Pulse-modulated intermittent control in consensus of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 783–793 (2017) 3. G. Wen, W. Yu, Y. Xia, X. Yu, J. Hu, Distributed tracking of nonlinear multiagent systems under directed switching topology: an observer-based protocol, IEEE Trans. Syst. Man Cybern. Syst. 47(5), 869–881 (2017) 4. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence of communication delays. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 717–728 (2017) 5. D. Wang, N. Zhang, J. Wang, W. Wang, Cooperative containment control of multiagent systems based on follower observers with time delay. IEEE Trans. Syst. Man Cybern. Syst. 47(1), 13– 23 (2017) 6. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 55(1), 48–61 (2011) 7. D. Yuan, D.W.C. Ho, S. Xu, Regularized primal-dual subgradient method for distributed constrained optimization. IEEE Trans. Cybern. 46(9), 2109–2118 (2016) 8. M. Zhu, S. Martinez, On distributed convex optimization under inequality and equality constraints. IEEE Trans. Autom. Control 57(1), 151–164 (2012) 9. D. Jakovetic, J. Xavier, J.M.F. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 10. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 11. S.S. Ram, A. Nedic, V.V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010) 12. Z. Feng, G. Hu, G. Wen, Distributed consensus tracking for multi-agent systems under two types of attacks. Int. J. Robust. Nonlinear Control 26(5), 896–918 (2016)
206
7 Primal–Dual Algorithms for Distributed Economic Dispatch
13. I. Matei, J.S. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE J. Sel. Topics. Signal Process. 5(4), 754–771 (2011) 14. A. Olshevsky, Linear time average consensus on fixed graphs and implications for decentralized optimization and multi-agent control (2014). arXiv preprint arXiv:1411.4186 15. A. Nedic, A. Ozdaglar, P. A. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 16. T. Wu, K. Yuan, Q. Ling, W. Yin, A. H. Sayed, Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal Inform. Process. Netw. 4(2), 293–307 (2018) 17. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans. Autom. Control 56(6), 1337–1351 (2011) 18. I. Lobel, A. Ozdaglar, Convergence analysis of distributed subgradient methods over random networks, in 2008 46th Annual Allerton Conference on Communication, Control, and Computing. https://doi.org/10.1109/ALLERTON.2008.4797579 19. P. Yi, Y. Hong, Quantized subgradient algorithm and data-rate analysis for distributed optimization. IEEE Trans. Control Netw. Syst. 1(4), 380–392 (2014) 20. B. Johansson, T. Keviczky, M. Johansson, K. H. Johansson, Subgradient methods and consensus algorithms for solving convex optimization problems, in 2008 47th IEEE Conference on Decision and Control. https://doi.org/10.1109/CDC.2008.4739339 21. D. Yuan, S. Xu, H. Zhao, Distributed primal-dual subgradient method for multiagent optimization via consensus algorithms. IEEE Trans. Syst. Man Cybern. B: Cybern. 41(6), 1715–1724 (2011) 22. Z.J. Towfic, A.H. Sayed, Adaptive penalty-based distributed stochastic convex optimization. IEEE Trans. Signal. Process. 62(15), 3924–3938 (2014) 23. S.S. Ram, A. Nedic, V.V. Veeravalli, Incremental stochastic subgradient algorithms for convex optimization. SIAM J. Control Optim. 20(2), 691–717 (2009) 24. J. Xu, S. Zhu, Y. C. Soh, L. Xie, Augmented distributed gradient methods for multi-Agent optimization under uncoordinated constant stepsizes, in Proceedings of the IEEE 54th Annual Conference on Decision and Control. https://doi.org/10.1109/CDC.2015.7402509 25. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25, 944–966 (2015) 26. C. Li, X. Yu, T. Huang, X. He, Distributed optimal consensus over resource allocation network and its application to dynamical economic dispatch. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2407–2418 (2018) 27. C. Li, X. Yu, W. Yu, G. Chen, J. Wang, Efficient computation for sparse load shifting in demand side management. IEEE Trans. Smart Grid 8(1), 250–261 (2017) 28. X. He, D.W.C. Ho, T. Huang, J. Yu, H. Abu-Rub, C. Li, Second-order continuous-time algorithms for economic power dispatch in smart grids. IEEE Trans. Syst. Man Cybern. Syst. 48(9), 1482–1492 (2018) 29. H. Xing, Y. Mou, M. Fu, Z. Lin, Distributed bisection method for economic power dispatch in smart grid. IEEE Trans. Power Syst. 30(6), 3024–3035 (2015) 30. C. Li, X. Yu, W. Yu, T. Huang, Z.-W. Liu, Distributed event-triggered scheme for economic dispatch in smart grids. IEEE Trans. Ind. Informat. 12(5), 1775–1785 (2016) 31. G. Binetti, A. Davoudi, F.L. Lewis, D. Naso, B. Turchiano, Distributed consensus-based economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014) 32. A. Nedic, A. Olshevsky, W. Shi, Improved convergence rates for distributed resource allocation (2017). arXiv preprint arXiv:1706.05441 33. N. Li, L. Chen, M.A. Dahleh, Demand response using linear supply function bidding. IEEE Trans. Smart Grid 6(4), 1827–1838 (2015) 34. N. Li, L. Chen, S.H. Low, Optimal demand response based on utility maximization in power networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES). https://doi.org/10.1109/PES.2011.6039082 35. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
References
207
36. T. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dispatch/demand response in power networks (2016). arXiv preprint arXiv:1609.06660 37. D.P. Bertsekas, Convex Optimization Theory (Athena Scientific, Cambridge, MA, 2009) 38. H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory (Springer, Berlin, 2011) 39. R.T. Rockafellar, R.J.B. Wets, Variational Analysis (Springer, Berlin, 2009) 40. C.A. Desoer, M. Vidyasagar, Feedback Systems: Input-Output Properties (SIAM, 2009) 41. D.P. Bertsekas, A. Nedic, A.E. Ozdaglar, Convex Analysis and Optimization (Athena Scientific, Cambridge, MA, 2004) 42. S. Kar, G. Hug, Distributed robust economic dispatch in power systems: a consensus + innovations approach, in 2012 IEEE Power and Energy Society General Meeting. https://doi. org/10.1109/PESGM.2012.6345156 43. IEEE 14 bus system. http://icseg.iti.illinois.edu/ieee-14-bus-system/
Chapter 8
Event-Triggered Algorithms for Distributed Economic Dispatch
Abstract In this chapter, we still study the problem of energy management in smart grid operations, the problem of economic dispatch, i.e., the problem of minimizing a sum of local convex cost functions subjected to both local interval constraints and coupling linear constraint over an undirected network. We propose a new eventtriggered distributed accelerated primal–dual algorithm, ET-DAPDA, that achieves a reduction in computation and interaction to solve the EDP with uncoordinated stepsizes. ET-DAPDA (with respect to the dual updates) adds two momentum terms to the gradient tracking scheme and assumes that each node interacts independently with its neighbors only at the event-triggered sampling time instants. Assuming smoothness and strong convexity of the cost function, the linear convergence of ET-DAPDA is analyzed using the generalized small gain theorem. In addition, ETDAPDA strictly excludes Zeno-like behavior, which greatly reduces the interaction cost. ET-DAPDA is investigated on 14-bus and 118-bus systems to evaluate its applicability. Simulation results of convergence rates are further compared with existing techniques to demonstrate the superiority of ET-DAPDA. Keywords Economic dispatch · Event-triggered · Momentum terms · Accelerated algorithm · Linear convergence
8.1 Introduction With the development of related industries in the engineering field, distributed optimization has become a broad concern in smart grids [1], sensor networks [2], etc. Practical optimization problems are evolving to become increasingly complex owing to energy constraints [3], privacy concerns [4], dynamic resource requirements [5], and the wide size of networks [6]. Achieving optimal solutions may not be easy when using traditional techniques. Therefore, developing more efficient distributed optimization algorithms to solve general convex optimization problems is the focus of current research [7–12]. In the last few years, strategies based on distribution consensus [13–19] have become mainstream in the interpretation of the optimization approaches. More © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_8
209
210
8 Event-Triggered Algorithms for Distributed Economic Dispatch
commonly used in these approaches include distributed (sub)gradient descent approach [13], distributed primal–dual approach [14], and so on. Similar work also has been generalized to the application of distributed optimization under all kinds of practical conditions, such as complex networks [16], communication constraints [17, 18], online optimization [19], and so on. Because the above approaches [13– 19] all ask for using the decreasing step-size to guide the state of all nodes to reach the optimal solution, they have been exposed to a potential disadvantage of slow convergence rate. In order to overcome this type of difficulty, a lot of neoteric distributed optimization methods have been generated [20–23]. A distributed inexact gradient method with constant step-size for time-varying undirected/directed networks was developed in [20], which adopted the gradient tracking technique to handle optimization problem. Furthermore, in terms of the uncoordinated step-sizes, Lü et al. in [21] and Nedic et al. in [22] expanded the work [20] achieving linear convergence results, respectively. Despite the better performance of the above methods, there are still many disadvantages, for instance, the computational and interaction capabilities of the nodes are not considered, and thus it is tough to apply in particular the methods to practical applications. In terms of this issues, some event-triggered distributed optimization algorithms have been emerged, such as [24, 25]. Although the convergence rates accelerated to linear and the capabilities of the nodes were considered, these methods [24, 25] could not be applied to the EDP. Motivated by the EDP in smart grids, Doan and Beck [26] developed a distributed primal–dual method to settle the EDP over undirected networks. The analysis of [26] revealed that the method can find the exact optimal solution with a linear convergence rate. Unfortunately, the interactions among nodes in the network undoubtedly squander plentiful resources. Thus, Wang et al. in [27] presented an event-triggered distributed primal–dual method for solving EDP over time-varying networks and obtained desired linear convergence results. Moreover, quite a number of effective methods have received increasing attention in recent years [28, 29]. For comprehensively figuring out the practical EDP, a new hybrid evolutionary algorithm was discussed [28]. Then, a reliable hybrid evolutionary algorithm was proposed to solve the multi-objective multi-area EDP [29]. Besides, quite a few more practical energy management issues were considered recently, such as [30] for complex optimal power flow problem and [31] for transmission expansion planning with wind farms. It should be noted that the above methods [24–27] did not rely on the momentum terms [32, 33] to ensure the algorithm achieving faster convergence rate. Existing work has proven that both the algorithms with the Nesterov momentum [32] and the algorithms with heavy-ball momentum [33] can improve the convergence rate to some extent. This motivates us to develop an event-triggered distributed optimization algorithm that uses Nesterov momentum and heavy-ball momentum (not only guarantee faster convergence but also save computation and interaction resources) to solve EDP. In this chapter, we bring out a fresh event-triggered distributed accelerated primal–dual algorithm, named as ET-DAPDA, to solve the distributed constrained
8.2 Preliminaries
211
(local interval constraint and coupling linear constraint) convex optimization problem over an undirected network. ET-DAPDA guarantees that the interaction between two nodes in the network is in an event-triggered way. To be specific, the principal contributions of this work are shown in the following aspects: (i) Distributed event-triggered interaction scheme is considered in the primal– dual method. Compared with the recent work [20–23], ET-DAPDA considers reducing the energy consumption and intensive calculations of interactions between nodes, which may extend the useful life of a particular network such as power systems or smart grids. (ii) In comparison with [34, 35], ET-DAPDA does not demand the centralized control to generate the Lagrangian multiplier. Specifically, ET-DAPDA incorporates the gradient tracking into the distributed primal–dual method to come true linear convergence and adds two types of momentum terms to enable nodes to obtain more information than the existing methods [26, 27] from their neighbors in the network to accelerate convergence. (iii) The convergence of ET-DAPDA is analyzed by using the generalized small gain theorem, a standard tool in control theory for analyzing stability of interconnected dynamical systems, which is expected to be broadly applicable to other accelerated algorithms. In addition, ET-DAPDA provides a more relaxed step-size choice than most existing distributed methods presented in [20, 26, 33], etc. (iv) Presuming that the largest step-size and the maximum momentum coefficient are subjected to certain upper bounds, ET-DAPDA linearly converges to the optimal solution under smoothness and strong convexity cost functions. ETDAPDA also rigorously excludes the Zeno-like behavior [25, 27]. In addition, in comparison with [23, 33], explicit estimates for the convergence rate of ETDAPDA are built.
8.2 Preliminaries 8.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. We let · denote the inner product of two vectors, || · || denote the 2-norm for both vectors and matrices, ρ(B) represent the spectral radius of a matrix B, and diag{y} be a diagonal matrix of the vector y, where the diagonal element is yi and other elements are zeros. The symbols 1m , Im and B T denote the one-column vectors of m-dimensional, the m×m identity matrix, and the transpose of a matrix B (probably a vector), respectively. Given an infinite sequence si = (si (0), si (1), si (2), · · ·), where si (t) ∈ R, ∀t, we define si γ ,T = maxt =0,1,...,T γ1t si (t) and
212
8 Event-Triggered Algorithms for Distributed Economic Dispatch
si γ =supt ≥0 γ1t si (t) with γ ∈ (0, 1). Given two vectors v = [v1 , . . . , vm ]T and u = [u1 , . . . , um ]T , the notation v u implies that vi ≤ ui for any i.
8.2.2 Nomenclature The global cost function of EDP The local cost function of node i The dual function of EDP The local dual function of node i The conjugate function of function f Node i’s estimate of EDP The element of coupling linear constraint The upper bound of node i’s estimate xi The lower bound of node i’s estimate xi The Lagrangian multiplier of node i The constant step-size of node i The momentum coefficient of node i at time t The auxiliary variables of node i Node i’s control inputs of the algorithm The event-triggered time instant sequence of node i at time t The measurement errors of node i The Lipschitz smoothness parameter of fi The strong convexity parameter of fi
f fi C Ci f xi di min i max i λi αi ηi (t) zi , yi y viλ , vi i tk(t ) y
eiλ , ei σi μi
8.2.3 Model of Optimization Problem This chapter studies a general EDP of finding the optimal solution of the sum of the m local convex cost functions subjected to both local interval constraints and coupling linear constraint over an undirected network of m nodes, which is described as follows: minf (x) = x
m i=1
fi (xi ), s.t.
m i=1
xi =
m
di , min ≤ xi ≤ max i i ,
(8.1)
i=1
∀i = 1, . . . , m, where x = [x1 , . . . , x m ]T ∈ Rm , the coupling linear constraint set is m m expressed by M = {x ∈ R | i=1 xi = m i=1 di }, and the local interval constraint set is denoted by Xi = {x˜ ∈ R|min ≤ x˜ ≤ min i i } for all i = 1, . . . , m. We write X = X1 ×X2 ×· · ·×Xm for the Cartesian products of each local interval constraint. ∗ ]T In addition, we indicate by P = X ∩ M the feasible set, x ∗ = [x1∗ , . . . ., xm
8.3 Algorithm Development
213
the optimal solution to (8.1), and f (x ∗ ) = f ∗ the optimal value. Denote optimal ∗ ∗ solution (nonempty) X = {x ∈ P| m i=1 fi (xi ) = f }. Remark 8.1 The optimization problem (8.1) is a general mathematical form of the EDP that is a fundamental and important problem that arises in a variety of application domains within engineering, such as smart grids [7, 12, 26, 27]. There is a possibility of incorporating the important issues such as: transmission line losses, the valve-point effects, the prohibited zones of generator, the ramp-rate limits of generator, etc., in the considered problem, but this is not the focus of our work.
8.2.4 Communication Network In this chapter, we concern a weight undirected network G making up of m nodes, where each node communicates only locally, that is, with a subset of other nodes. The network G is represented by G = {V, E, A}, where V = {1, . . . , m} is the nodes set, E ⊆ V × V is the edges set, and A = [aij ] ∈ Rm×m is the weight adjacency matrix where the weights aij satisfy that aij = aj i > 0 if (i, j ) ∈ E, and aij = 0 otherwise. The diagonal elements aij = 0 if only G is simple, and A is symmetric. Nodes i and j can only communicate directly with each other if (i, j ) ∈ E. The neighboring set of node i is denoted by Ni = {j |j ∈ E, aij ≥ 0}. The degree matrix D = diag{d˜1 , . . . , d˜m } is diagonal, whose diagonal elements are ˜ ˆ ˜ outlined by di = m j =1 aij , ∀i. Define d = maxi∈V {di }. The Laplacian matrix is m×m L = [lij ] ∈ R , which satisfies L = D − A.
8.3 Algorithm Development This section presents the algorithm development followed by some assumptions. First, the problem reformulation is shown.
8.3.1 Problem Reformulation The dual problem of (8.1) can be displayed as max C(λ) = λ∈R
m i=1
Ci (λ),
(8.2)
214
8 Event-Triggered Algorithms for Distributed Economic Dispatch
where Ci (λ) = −fi (−λ)−λdi and f (λ) = supx∈X λx −f (x). Then, we transform the dual problem (8.2) into the following minimization form: min q(λ) =
λ∈R
m
qi (λ),
(8.3)
i=1
where qi (convex) is given by qi (λ) = −Ci (λ) = fi (−λ) + λdi .
(8.4)
8.3.2 Event-Triggered Control Scheme The event-triggered control scheme can effectively decrease network burden in comparison with the traditional time-triggered control scheme [7, 26]. In eventi , t = 0, 1, . . . , i ∈ V, is employed to stand for triggered control scheme, tk(t ) the event-triggered time instant sequence of node i. In addition, each node i at i , t = 0, 1, . . ., holds its estimates λ (t i ), each event-triggered time instant tk(t i k(t ) ) i ), and then transmits to neighbors. Also, node j can disseminate its latest yi (tk(t ) sampling states λj (tki (t ) ), yj (tki (t ) ), to node i at tki (t ) when (j, i) ∈ E, where k (t) = arg minm∈k,t ¯ ≥t j {t − tm }. j
m
i After tk(t ) , the next event-triggered time instant for node i ∈ V is represented by i tk(t )+1, which follows that y
i i λ t tk(t )+1 = inf{t > tk(t ) | ||ei (t)|| + ||ei (t)|| > Qω },
(8.5)
where Q > 0 and 0 < ω < 1 are the control parameters; the measurement errors y eiλ (t) and ei (t) are given as
i ) − λ (t) eiλ (t) = λi (tk(t i ) . y i ei (t) = yi (tk(t ) − y (t) i )
(8.6)
8.3.2.1 Event-Triggered Distributed Accelerated Primal–Dual Algorithm (ET-DAPDA) Here, ET-DAPDA for solving (8.3) is formally described in Algorithm 4. It obtains from (8.4) that ∇qi (λi (t)) = −xi (t) + di ,
(8.9)
8.3 Algorithm Development
215
Algorithm 4 Event-triggered distributed accelerated primal–dual algorithm (ETDAPDA) at each node i 1: Initialization: Choose the initial points xi (0) ∈ R, zi (0) ∈ R, λi (0) ∈ R, yi (0) = di − y xi (0) ∈ R, viλ (0) = λi (0) ∈ R, and vi (0) = yi (0) ∈ R. 2: Set t = 0. 3: Local information exchange: Each node i independently interacts with neighbors only at the event-triggered sampling time instants. 4: Control inputs updates: Based on the event-triggered control scheme, each node i updates the auxiliary variables: ⎧ m j i λ ⎪ aij (λj (tk (t) ) − λi (tk(t) )) ⎪ ⎨ vi (t) = λi (t) + h j =1
, m y j ⎪ i )) ⎪ aij (yj (tk (t) ) − yi (tk(t) ⎩ vi (t) = yi (t) + h
(8.7)
j =1
where aij ≥ 0 is the weight between node j and node i, and h > 0 is the tunable parameter. 5: Local variables updates: Each node i updates the following variables according to (8.7): ⎧ xi (t + 1) = arg min fi (xi ) + λi (t)(xi − di ) ⎪ ⎪ xi ∈Xi ⎪ ⎨ zi (t + 1) = viλ (t) + ηi (t)(zi (t) − zi (t − 1)) − αi yi (t) , ⎪ ⎪ ⎪ λi (t + 1) = ziy(t + 1) − ηi (t)(zi (t + 1) − zi (t)) ⎩ yi (t + 1) = vi (t) − xi (t + 1) + xi (t)
(8.8)
where αi > 0 and ηi (t) ≥ 0. 6: Set t = t + 1 and repeat. 7: Until t > tmax (tmax is a predefined maximum iteration number).
where ∇qi (λi (t)) is the gradient of qi (λ) at λ = λi (t), t ≥ 0. It deserves to be mentioned that ET-DAPDA neither requires repeated exchange of information between nodes nor demands a conspicuous computational cost. Hence, ET-DAPDA with event-triggered control scheme may maximize the lifespan of the system. Define ∇Q(λ(t)) = [∇q1 (λ1 (t)), . . . , ∇qm (λm (t))]T ∈ Rm , x(t) = [x1 (t), x2 (t), . . . , xm (t)]T ∈ Rm , z(t) = [z1 (t), z2 (t), . . . , zm (t)]T ∈ Rm , λ(t) = [λ1 (t), λ2 (t), . . . , λm (t)]T ∈ Rm , y(t) = [y1 (t), . . . , ym (t)]T ∈ Rm , v λ (t) = λ (t)]T ∈ Rm , v y (t) = [v y (t), . . . , v y (t)]T ∈ Rm , η(t) = [η (t), . . ., [v1λ (t), . . . , vm 1 m 1 ηm (t)]T ∈ Rm , and α = [α1 , . . . , αm ]T ∈ Rm . Then, we write the vector form of Algorithm 4 below: ⎧ m ⎪ ⎪ x(t + 1) = arg min (fi (xi ) + λi (t)(xi − di )) ⎪ ⎪ x∈X i=1 ⎨ z(t + 1) = v λ (t) + Dη (t)(z(t) − z(t − 1)) − Dα y(t) , ⎪ ⎪ ⎪ λ(t + 1) = z(t + 1) + Dη (t)(z(t + 1) − z(t)) ⎪ ⎩ y(t + 1) = v y (t) − x(t + 1) + x(t),
(8.10)
216
8 Event-Triggered Algorithms for Distributed Economic Dispatch
where Dα = diag{α}, Dη (t) = diag{η(t)}, v λ (t) = W λ(t) − hLeλ (t), v y (t) = ˆ Wy(t) − hLey (t). Here, W = Im − hL and W = [wij ] ∈ Rm×m , where 0 < h < d. T m The initialization y(0) = d − x(0), where d = [d1 , . . . , dm ] ∈ R . Assumption 8.1 ([25]) The undirected network G is connected, and the matrix W meets 1Tm W = 1Tm , W 1m = 1m , and δ = ρ(W − 1m 1Tm /m) < 1. Assumption 8.2 ([26]) Each cost function fi , i ∈ V, is μi -strongly convex and has Lipschitz continuous gradient with parameter σi , where σi , μi ∈ (0, +∞). Remark 8.2 Assumption 8.1 often describes the information exchange rules in the network. Assumption 8.1 assures the nonemptiness of the optimal solution set.
8.4 Convergence Analysis In this section, the linear convergence rate of ET-DAPDA is constructed by applying the generalized small gain theorem [36]. First, we convert four inequalities into a linear matrix inequality and immediately discover the spectral nature of its coefficient matrix. For this purpose, some elementary notations are introduced to simple the convergence results. Denote v1 (t) = λ(t) − (1/m)1m1Tm λ(t), ∀t ≥ 0, v2 (t) = (1/m)1m 1Tm λ(t)−1m λ∗ , ∀t ≥ 0, v3 (t) = z(t)−z(t − 1), ∀t > 0, as well as the practice that v3 (0) = 0m , and v4 (t) = y(t) − (1/m)1m 1Tm y(t), ∀t ≥ 0. Besides, T, α ¯ ¯ ¯ define ∇Q(1m λ(t)) . . . , ∇qm ( λ(t))] ˆ = maxi∈V {αi }, βˆ = = [∇q 1 (λ(t)), m m ˆ = maxi∈V {1/σi }, (1/σ ), = (1/μ supt ≥0 maxi∈V {βi (t)}, ϑ = i i ), ϑ i=1 i=1 ˆ and = maxi∈V {1/μi }.
8.4.1 Supporting Lemmas First, the convergence analysis of the Lagrangian multiplier primarily relies on the generalized small gain theorem [36] described below. Lemma 8.3 ([36]) Assuming non-negative vector sequences {v˜i (t)}∞ t =0 , i m×m m ˜ , u˜ ∈ R , and γ ∈ (0, 1) satisfy 1, . . . , m, a non-negative matrix Γ ∈ R
=
v˜ γ ,T Γ˜ v˜ γ ,T + u, ˜ for all positive integer T , where v˜ γ ,T = [||v˜1 ||γ ,T , . . . , ||v˜m ||γ ,T ]T . Then, ||v˜i ||γ < B0 if ρ(Γ˜ ) < 1, where B0 < ∞. Hence, each ||v˜i ||, i ∈ {1, . . . , m}, converges linearly to zero at a rate of O(γ t ). In the following, the bound of ||v1 ||γ ,T is showed.
8.4 Convergence Analysis
217
Lemma 8.4 For all T ≥ 0 and under Assumptions 8.1 and 8.2, we have the following inequality: ||v1 ||γ ,T ≤
αˆ ˆ γ − δ − αˆ ˆ +
||v2 ||γ ,T +
h||L|| γ − δ − αˆ ˆ
2ηˆ γ − δ − αˆ ˆ
||eλ||γ ,T +
||v3 ||γ ,T +
||v1 (0)|| γ − δ − αˆ ˆ
αˆ γ − δ − αˆ ˆ
||v4 ||γ ,T
,
ˆ 1 ) < γ < 1, where δ is provided in Assumption 8.1. for all (δ + αˆ p Proof On the basis of the updates of z(t) and λ(t) of ET-DAPDA in (8.10), it holds that
1 1 ||v1 (t + 1)|| ≤|| W − 1m 1Tm v1 (t)|| + || Im − 1m 1Tm Dη (t)v3 (t)|| m m
1 + || Im − 1m 1Tm Dα y(t)|| + h||L||||eλ(t)|| m
1 T + || Im − 1m 1m Dη (t + 1)v3 (t + 1)||, (8.11) m where the inequality in (8.11) is acquired from the fact that (W − (1/m)1m 1Tm ) (Im − (1/m)1m 1Tm ) = W − (1/m)1m 1Tm . Notice that αˆ = ||Dα || and ||Dη (t)|| ≤ η. ˆ Built on Assumption 8.1, (8.11) further implies that ||v1 (t + 1)|| ≤δ||v1 (t)|| + α||y(t)|| ˆ + η||v ˆ 3 (t + 1)|| + η||v ˆ 3 (t)|| + h||L||||eλ(t)||.
(8.12)
ˆ 1 (t)|| + ||v ˆ 2 (t)|| + ||v4 (t)|| holds [12], we have that Since ||y(t)|| ≤ ||v ˆ 1 (t)|| + αˆ ||v ˆ 2 (t)|| + η||v ||v1 (t + 1)|| ≤(δ + αˆ )||v ˆ 3 (t + 1)|| ˆ 4 (t)|| + h||L||||eλ(t)||. + η||v ˆ 3 (t)|| + α||v
(8.13)
From here, the process resembles that in the proof of Lemma 8 in [20]. We contain the proof for integrity. By taking the supremum on both sides of (8.13)
218
8 Event-Triggered Algorithms for Distributed Economic Dispatch
for t = 0, .., T − 1, one has ||v1 (t + 1)|| δ + αˆ ˆ ||v1 (t)|| αˆ ˆ ||v2 (t)|| sup sup ≤ + t +1 t γ γ γ γ γt t =0,..,T −1 t =0,..,T −1 t =0,..,T −1 sup
+
αˆ γ
+ ηˆ
||v4 (t)|| h||L|| ||eλ (t)|| + sup γt γ t =0,..,T −1 γ t t =0,..,T −1 sup
||v3 (t)|| + ||v3 (t + 1)|| . γ t +1 t =0,..,T −1 sup
(8.14)
Also, assuming that γ 0 ||v1 (0)|| ≤ γ 0 ||v1 (0)||, one has ||v1 ||γ ,T ≤
δ + αˆ ˆ αˆ ˆ 2ηˆ ||v1 ||γ ,T + ||v2 ||γ ,T + ||v3 ||γ ,T γ γ γ +
αˆ h||L|| λ γ ,T ||v4 ||γ ,T + ||e || + ||v1 (0)||, γ γ
(8.15)
which after some algebraic manipulations yields the desired result. This completes the proof. ¯ Notice that (1/m)1m 1Tm λ(t) = 1m λ(t). Then, the bound of ||v2 ||γ ,T is given in the next lemma. Lemma 8.5 Under Assumptions 8.1–8.2 and when c1 < γ < 1 as well as 0 < (1/m2 )1Tm α < 2/, one gets ∀T ≥ 0, ||v2 ||γ,T ≤
αˆ ˆ 2ηˆ αˆ γ ||v1 ||γ,T + ||v3 ||γ,T + ||v4 ||γ,T + ||v2 (0)||, γ − c1 γ − c1 γ − c1 γ − c1
where c1 = max{|1 − (1/m2 )1Tm α|, |1 − ϑ(1/m2 )1Tm α|}. Proof Notice that 1Tm W = 1Tm and 1Tm L = 0. Following from the updates of z(t) and λ(t) of ET-DAPDA in (8.10), we can utilize the fact (1/m)1m 1Tm y(t) = (1/m)1m 1Tm ∇Q(λ(t)) to establish ||
1 1m 1Tm λ(t + 1) − 1m λ∗ || m 1 1 1 ≤ || 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ || m m m + η[||z(t) ˆ − z(t − 1)|| + ||z(t + 1) − z(t)||] + α||y(t) ˆ −
1 1m 1Tm y(t)||. m (8.16)
8.4 Convergence Analysis
219
We now talk over the first term in the inequality of (8.16). Note that (1/m)1m 1Tm y(t) = (1/m)1m1Tm ∇Q(λ(t)). Utilizing 1m 1Tm Dα 1m 1Tm = 1Tm α1m 1Tm , one obtains ||
1 1 1 1m 1Tm λ(t) − 1m 1Tm Dα 1m 1Tm y(t) − 1m λ∗ || m m m 1 1 ¯ ¯ − 1Tm α ∇q(λ(t)) − λ∗ )|| ≤ ||1m (λ(t) m m 1 1 1 + 1Tm α|| 1m 1Tm ∇Q(1m λ¯ (t)) − 1m 1Tm ∇Q(λ(t))|| m m m = Λ1 + Λ2 ,
(8.17)
where ∇q(λ¯ (t)) = 1Tm ∇Q(1m λ¯ (t)). By Lemma 3 in [33], if 0 < (1/m2 )1Tm α < 2/, Λ1 is bounded by √ 1 Λ1 ≤ c1 m||λ¯ (t) − λ∗ || = c1 || 1m 1Tm λ(t) − 1m λ∗ ||, m
(8.18)
where c1 = max{|1 − (1/m2)1Tm α|, |1 − ϑ(1/m2 )1Tm α|}. Then, Λ2 can be bounded in the following way: Λ2 ≤
1 T ˆ 1 1 α || 1m 1Tm λ(t) − λ(t)||. m m m
(8.19)
Plugging (8.17)–(8.19) into (8.16) yields ˆ 1 (t)|| + η||v ||v2 (t + 1)|| ≤c1 ||v2 (t)|| + αˆ ||v ˆ 3 (t + 1)|| + η||v ˆ 3 (t)|| + α||v ˆ 4 (t)||.
(8.20)
Here, we can identify the terms in (8.20) with the terms in (8.13). Therefore, for the sake of setting up this lemma, we proceed as in the proof of Lemma 8.4 from (8.13). This completes the proof. The following lemma displays the bound of ||v3 ||γ ,T . Lemma 8.6 Under Assumptions 8.1 and 8.2, if 2ηˆ < γ < 1, one gets that ||v3 ||γ ,T ≤
κ1 + αˆ ˆ αˆ ˆ γ ||v3 (0)|| ||v1 ||γ ,T + ||v2 ||γ ,T + γ − 2ηˆ γ − 2ηˆ γ − 2ηˆ +
αˆ h||L|| λ γ ,T ||v4 ||γ ,T + ||e || , γ − 2ηˆ γ − 2ηˆ
∀T ≥ 0, where κ1 = ||W − Im ||.
220
8 Event-Triggered Algorithms for Distributed Economic Dispatch
Proof It is obtained from the updates of z(t) and λ(t) of ET-DAPDA in (8.10) that ||z(t + 1) − z(t)|| ≤κ1 ||
1 1m 1Tm λ(t) − λ(t)|| + 2η||z(t) ˆ − z(t − 1)|| m
+ h||L||||eλ (t)|| + α||y(t)||, ˆ
(8.21)
where the inequality in (8.21) is obtained from the fact that (W − Im )(Im − (1/m)1m 1Tm ) = W − Im . Then, one has ˆ 1 (t)|| + αˆ ||v ˆ 2 (t)|| ||v3 (t + 1)|| ≤2η||v ˆ 3 (t)|| + (κ1 + αˆ )||v + α||v ˆ 4 (t)|| + h||L||||eλ(t)||.
(8.22)
Similar to the procedure following (8.13), it is ample to infer the desired result. This completes the proof. The next lemma sets up the inequality that bounds the error term ||v4 ||γ ,T to gradient estimation. Lemma 8.7 Under Assumptions 8.1 and 8.2, if ρ < γ < 1, one obtains that ||v4 ||γ ,T ≤
ˆ + 2ηˆ ˆ h||L|| y γ ,T γ ||v3 ||γ ,T + ||e || ||v4 (0)||, + γ −δ γ −δ γ −δ
for all T ≥ 0. Proof By the update of y(t) of ET-DAPDA in (8.10), we have 1 1m 1Tm y(t + 1)|| m 1 ≤ δ||y(t) − 1m 1Tm y(t)|| + h||L||||ey (t)|| + ||x(t + 1) − x(t)||, m
||y(t + 1) −
(8.23)
where we utilize the triangle inequality and Assumption 8.1 to derive the inequality. With regard to the last term of the inequality in (8.23), we apply the update of λ(t) of ET-DAPDA in (8.10) and the gradient ∇qi (λi (t)) = −xi (t) + di to obtain ˆ + η)||z(t ||x(t + 1) − x(t)|| ≤ (1 ˆ + 1) − z(t)|| + ˆη||z(t) ˆ − z(t − 1)||. (8.24) Combining (8.23) and (8.24) gives ˆ + η)||v ˆ 3 (t + 1)|| + ˆη||v ˆ 3 (t)|| + h||L||||ey (t)||. ||v4 (t + 1)|| ≤δ||v4 (t)|| + (1 Similar to the procedure following (8.13), it is ample to infer the desired result if δ < γ < 1. This completes the proof.
8.4 Convergence Analysis
221
8.4.2 Main Results Drawing support from Lemmas 8.4–8.7, the convergence results of ETDAPDA (8.10) are now built as follows. As a matter of convenience, we define ˆ u2 = γ ||v2 (0)||/(γ − c1 ), u3 = u1 = (h||L|| + ||eλ ||γ ,T + ||v1 (0)||)/(γ − δ − αˆ ), (h||L||||eλ||γ ,T + γ )/(γ − 2η), ˆ and u4 = (h||L||||ey ||γ ,T + γ ||v4 (0)||)/(γ − δ). Theorem 8.8 is introduced below. Theorem 8.8 In consideration of ET-DAPDA, (8.10) updates the sequences {x(t)}, {z(t)}, {λ(t)}, and {y(t)}. Then, under Assumptions 8.1 and 8.2 and if 0 < (1/m2 )1Tm α < 2/, we acquire the following linear system of inequality: v γ ,T Γ v γ ,T + u,
(8.25)
where u = [u1 , u2 , u3 , u4 ]T , v γ ,T = [||v1 ||γ ,T , ||v2 ||γ ,T , ||v3 ||γ ,T , ||v4 ||γ ,T ]T , and the elements of matrix Γ = [γij ] ∈ R4×4 are given by ⎡ ⎢ ⎢ ⎢ Γ =⎢ ⎢ ⎣
2ηˆ αˆ ˆ αˆ γ −δ−αˆ ˆ γ −δ−αˆ ˆ γ −δ−αˆ ˆ 2ηˆ αˆ ˆ αˆ 0 γ −c1 γ −c1 γ −c1 κ1 +αˆ ˆ αˆ αˆ ˆ 0 γ −2ηˆ γ −2ηˆ γ −2ηˆ ˆ ˆ +2ηˆ 0 0 0 γ −δ
0
⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎦
Assume moreover that the largest step-size satisfies 0 < αˆ < min
β1 (γ − δ) m β3 − β1 κ1 , , , β1 ˆ + β2 ˆ + β4 β1 ˆ + β2 ˆ + β4
(8.26)
and the maximum momentum coefficient satisfies 0 ≤ ηˆ < min
β1 (γ − δ) − α(β ˆ 1 ˆ + β2 ˆ + β4 ) , 2η3
β2 (γ − c1 ) − α(β ˆ 1 ˆ + β4 ) β4 (γ − δ) − β3 ˆ , , ˆ 3 2β3 2β β3 γ − β1 κ1 − α(β ˆ 1 ˆ + β2 ˆ + β4 ) . 2β3
(8.27)
222
8 Event-Triggered Algorithms for Distributed Economic Dispatch
Then, if 0 < γ < 1 is a constant such that γ = max
ˆ 3 + β3 ˆ + β4 δ 2ηβ ˆ 3 + β1 δ + α(β ˆ 1 ˆ + β2 ˆ + β4 ) 2ηˆ β , , β1 β4
ˆ 3 + β1 κ1 + α(β 2ηβ ˆ 3 + β2 c1 + α(β ˆ 1 ˆ + β4 ) 2ηβ ˆ 1 ˆ + β2 ˆ + β4 ) , , β2 β3 the sequence {λ(t)} converges linearly to 1n λ∗ at a rate of O(γ t ), where β1 , β2 , β3 , and β4 are arbitrary constants such that β1 > 0, β2 >
m(β1 ˆ + β4 ) β3 ˆ , β3 > β1 κ1 , β4 > . ϑ 1−δ
Proof First, generalizing the results of Lemmas 8.4–8.7, we can infer inequality (8.25) instantly. Next, we give some abundant conditions to make the spectral radius of Γ , defined as ρ(Γ ), strictly less than 1, i.e., ρ(Γ ) < 1. In accordance with Theorem 8.1.29 in [37], we know that, for a positive vector β = [β1 , . . . , β4 ]T ∈ R4 , if Γ β < β, then ρ(Γ ) < 1 holds. It is inferred that inequality Γ β < β is equivalent to ⎧ ˆ 3 < β1 (γ − δ) − α(β ˆ 1 ˆ + β2 ˆ + β4 ) ⎪ ⎪ 2ηβ ⎨ ˆ 1 ˆ + β4 ) 2ηβ ˆ 3 < β2 (γ − c1 ) − α(β , ⎪ 2ηβ ˆ 3 < β3 γ − β1 κ1 − α(β ˆ 1 ˆ + β2 ˆ + β4 ) ⎪ ⎩ ˆ 3 < β4 (γ − δ) − β3 ˆ 2ηˆ β
(8.28)
which further implies that ⎧ ⎪ γ > ⎪ ⎪ ⎪ ⎪ ⎨γ > ⎪ ⎪ γ > ⎪ ⎪ ⎪ ⎩ γ >
ˆ 2 +β ˆ 4) 2ηβ ˆ 3 +β1 δ+α(β ˆ 1 +β β1 ˆ 4) 2ηβ ˆ 3 +β2 c1 +α(β ˆ 1 +β β2 ˆ 2 +β ˆ 4) 2ηβ ˆ 3 +β1 κ1 +α(β ˆ 1 +β β3 ˆ 4δ ˆ 3 +β3 +β 2ηˆ β β4
.
(8.29)
Reviewing c1 in Lemma 8.5, if αˆ < m/, it produces that c1 = 1 − ϑ(1/m2 )1Tm α ≤ 1 − ϑ α/m. ˆ To assure the positivity of βˆ (the right hand sides of (8.28) should be
8.4 Convergence Analysis
223
positive), if 0 < γ < 1, (8.28) further gives ⎧ ⎪ αˆ < βˆ1 (γ −δ) ⎪ ˆ 4 ⎪ β1 +β2 +β ⎪ ⎪ ˆ 4) ⎨ m(β1 +β β2 > ϑ . ⎪ αˆ < βˆ3 −β1ˆκ1 , β3 > β1 κ1 ⎪ ⎪ β1 +β2 +β4 ⎪ ⎪ ⎩ β > β3 ˆ 4
(8.30)
1−δ
Now, we are in the state of picking vector β = [β1 , . . . , β4 ]T to assure the solvability of α. ˆ Since δ < 1, first, let β1 be an arbitrary positive constant. Then, choose β3 and β4 satisfy the third and fourth conditions in (8.30), respectively. Finally, selecting β2 follows the second condition in (8.30). Therefore, based on (8.30), it generates the upper bounds of the largest step-size αˆ in (8.26) concerning αˆ < m/. If 0 < γ < 1, then we reach the upper bounds of the maximum momentum coefficient ηˆ following from (8.28) and the largest stepsize α. ˆ √ Since ||eλ (t)|| ≤ √ Qωt and ||ey (t)|| ≤ Qωt , one obtains γ −t ||eλ (t)|| ≤ mQ and γ −t ||ey (t)|| ≤ mQ when we choose ω = γ . Then, we can infer that all the elements (u1 , u2 , u3 , and u4 ) in vector u are uniformly bounded. Hence, by Lemma 8.3 (the conditions are all satisfied), the results of Theorem 8.8 can be acquired. This completes the proof. Remark 8.9 Theorem 8.8 indicates that, when certain EDP is determined, αˆ and ηˆ can be calculated without much effort by correctly selecting other parameters such as ω, δ, μ, etc. Notice that the selection of αˆ and ηˆ may require some ˆ It is derived from [20] that the amount of global parameters, such as ϑ, , and . preprocessing required to calculate the global values is heavily smaller than the running time (worst case) of ET-DAPDA (see [20] for a specific analysis). On the basis of Theorem 8.8, the linear convergence of the sequence {x(t)} will be shown in the following. Theorem 8.10 Given ET-DAPDA, (8.10) updates the sequences {x(t)}, {z(t)}, {λ(t)}, and {y(t)}. Under Assumptions 8.1–8.2 and let αˆ and ηˆ satisfy Theorem 8.8, then the sequence {x(t)} linearly converges to x ∗ at a rate of O((γ /2)t ). Proof As many distributed primal–dual methods [7, 12, 26, 27], we achieve this theorem by associating the primal variables with Lagrangian multipliers. Similar to the proof of Theorem 2 in [12, 26, 27], we also obtain the following inequality: m μi i=1
≤
2
(xi (t + 1) − x ∗ )2
m i=1
(∇qi (λ∗ )(λi (t) − λ∗ ) +
1 (λi (t) − λ∗ )2 + λi (t)(xi∗ − di )). 2μi
(8.31)
224
8 Event-Triggered Algorithms for Distributed Economic Dispatch
∗ The fact x ∗ ∈ P yields that m i=1 (xi − di ) = 0. Additionally, if t → ∞, λi (t) → ∗ λ , ∀i ∈ V. Thence, if t → ∞, the right and left sides of (8.31) incline to zero. This shows ||x(t) − x ∗ || linearly converges, i.e., ||x(t) − x ∗ || ≤ O((γ /2)t ), if ||λ(t) − 1m λ∗ || ≤ O(γ t ) achieved in Theorem 8.8. The proof of Theorem 8.10 is accomplished. Remark 8.11 Recall the conditions of the largest step-size αˆ and the maximum momentum coefficient ηˆ in (8.26) and (8.27) of Theorem 8.8 that the conditions ˆ m) that involve the αˆ and ηˆ imposed by ET-DAPDA depend on the parameters (, , cost function, (δ, κ1 ) that involve the network topologies, and (γ , β1 , β2 , β3 , β4 ). Moreover, the tunable parameters β1 , β2 , β3 , and β4 in Theorem 8.8 only rely on the parameters of the network and the cost functions. Hence, we can find that the designation of αˆ and ηˆ devolves on the complexity of the EDP. Here, it is worth noting that when the EDP takes into consideration the important issues such as nonconvex and/or discontinuous cost function, transmission line losses, the valve-point effects etc., ET-DAPDA may not be very suitable or need to be modified, so the conditions αˆ and ηˆ may be another form. This is an open problem and remains to be studied in the future.
8.4.3 The Exclusion of Zeno-Like Behavior In what follows, the Zeno-like behavior of the discrete-time systems will be i i excluded, i.e., infk {tk(t )+1 − tk(t ) } ≥ 2, ∀i ∈ V, and t ≥ 0. Theorem 8.12 Considering ET-DAPDA, (8.10) updates the sequences {x(t)}, {z(t)}, {λ(t)}, and {y(t)}. Select α, ˆ η, ˆ ω, and γ to satisfy Theorem 8.8. Assume that Assumptions 8.1–8.2 hold. Let the control gain h obey 0 < h < min
1 1 , , dˆ w1 w2
(8.32)
√ ˆ + 1/(γ − δ). where w1 = (1 + γ 2 )||L|| m/γ 2 and w2 = (1 + )/(γ − δ − αˆ ) i } does not exist Zeno-like behavior if Then, the event-triggered time sequence {tk(t ) Q satisfies Q>
1 + γ2 (B1 + B2 + B4 + B1 + B2 ), γ2
(8.33)
where B1 , B2 , B3 , and B4 are positive constants. Proof Recalling that ρ(Γ ) < 1, we obtain from Lemma 8.3 that ||v1 ||γ ≤ B1 , ||v2 ||γ ≤ B2 , ||v3 ||γ ≤ B3 , and ||v4 ||γ ≤ B4 , where B1 , B2 , B3 , and B4 are positive constants. Utilizing Lemma 8.3, one has ||v1 || ≤ B1 γ t , ||v2 || ≤ B2 γ t , ||v3 || ≤ B3 γ t , and ||v4 || ≤ B4 γ t . Notice that the next event will not happen when
8.5 Numerical Examples
225 y
the event-triggered condition eiλ (t) + ei (t) − Qωt > 0 is invalid; thus, we possess y
||eiλ(t)|| + ||ei (t)|| i ¯ i ¯ i ¯ ¯ ≤ ||λi (tk(t ) ) − λ(tk(t ) )|| + ||λ(tk(t ) ) − λ(t)|| + ||λ(t) − λi (t)|| i i i + ||yi (tk(t ¯ k(t ¯ k(t ¯ + ||y(t) ¯ − yi (t)|| ) ) − y(t ) )|| + ||y(t ) ) − y(t)|| i
≤ (B1 + B2 + B4 + B1 + B2 )γ tk(t) + (B2 γ t + B1 + B4 + B2 + B1 )γ t . (8.34) i If tk(t )+1 meets the condition (8.5), one has (recall that ω = γ is selected to satisfy Theorem 8.8) i
y
i i Qγ tk(t)+1 ≤ ||eiλ (tk(t )+1 )|| + ||ei (tk(t )+1 )||.
(8.35)
It is deduced from (8.34) and (8.35) that i i tk(t )+1 − tk(t ) ≥ ln
B1 + B2 + B4 + B1 + B2 / ln γ . B1 + B2 + B4 + B1 + B2
(8.36)
i Selecting h such that (8.32), then Q in (8.33) is strictly nonempty, and thus tk(t )+1 − i tk(t ) ≥ 2 in (8.36) is guaranteed. The proof of Theorem 8.12 is accomplished.
Remark 8.13 Consider that most metaheuristic algorithms (including genetic algorithm, ant colony optimization algorithm, PSO optimization algorithm, artificial neural network algorithm, etc.) successfully solve the EDP (8.1), in terms of identifying both the best solution and the computation time. In the following, we conclude three advantages of applying ET-DAPDA compared to those of applying the metaheuristic algorithms for the EDP: (1) ET-DAPDA can guarantee the global optimal solution of the EDP, and the metaheuristic algorithm exists the randomness; (2) ET-DAPDA can get a fixed output under a fixed input, while the metaheuristic algorithm does not get a fixed output under a fixed input; (3) ET-DAPDA can guarantee a fixed efficiency, but the metaheuristic algorithm cannot guarantee the efficiency because of the randomness.
8.5 Numerical Examples This section validates the theoretical results and demonstrates better performance of ET-DAPDA. Notice that all the simulations are carried out in MATLAB on a MacBook Pro 2017 with 8 GB memory, Intel Core i5 processors, 2 Cores, and 2.3 GHz.
226
8 Event-Triggered Algorithms for Distributed Economic Dispatch
8.5.1 Example 1: EDP on the IEEE 14-Bus System Consider the EDP on the IEEE 14-bus system as described in [7], where {1, 2, 3, 6, 8} are generator buses. Each generator’s cost function is represented by fi (xi ) = ci xi + bi xi2 , where the cost coefficients (ci ,bi ) of each generator i and the in Table 8.1 generators’ generation capacities (xi ∈ [ximin , ximax ]) are summarized [7]. The total demand needed in this system is assumed as m d = 380 MW. i=1 i By running ET-DAPDA (8.10), it deduces from Fig. 8.1a and b that the whole system successfully implements the economic dispatch, and the optimal power generations are x1∗ = 80 MW, x2∗ = 90 MW, x3∗ = 64.67 MW, x6∗ = 70 MW, and x8∗ = 75.33 MW. By computation, the network is suffered a total cost 2176$. In addition, each generator i’s event-triggered sampling time instants and one randomly selected measurement error are depicted in Fig. 8.1c and d, respectively. It can be derived from the statistics of Fig. 8.1c that the number and the average of sampling times for the 5 generators are [98 84 93 96 113] and 97, respectively. Thus, the average sampling rate is 97/600 = 16.17%, which indicates that the Zeno-like behavior does not happen. Figure 8.1d tells that the measurement error decreases to zero asymptotically. Finally, the calculation time obtained by ET-DAPDA for solving EDP on the IEEE 14-bus system is 0.2828s.
8.5.2 Example 2: EDP on Large-Scale Networks To demonstrate the applications of ET-DAPDA for large-scale networks, the EDP on the IEEE 118-bus system [12] is considered in this example. The IEEE 118bus system holds 54 generators, which are connected by quite a few lines [12]. Each cost function of the generator i is held by fi (xi ) = ci xi2 + bi xi + ai , where ci ∈ [0.002, 0.071], bi ∈ [8.335, 37.697], and ai ∈ [6.78, 74.33] are adjustable coefficients with suitable units. Generating capacities of each generator i are different from each other, which means that xi ∈ [ximin , ximax ], where MW and ximax = [20, 450] MW. Notice that the overall system ximin = [0, 150] m demands i=1 di = 6000 MW power generation. Figure 8.2 depicts that ETDAPDA successfully achieves the desired results even in this kind of large-scale networks. With this optimal schedule of power at each generator, the optimal power generations of 54 generators are [5, 5, 5, 292.1342, 292.1348, 10, 55.5241, 5, 5, Table 8.1 Generator parameters in the IEEE 14-bus system
Bus 1 2 3 6 8
bi ($/MW2 ) 0.04 0.03 0.035 0.03 0.04
ci ($/MW) 2.0 3.0 4.0 4.0 2.5
[ximin , ximax ] (MW) [0, 80] [0, 90] [0, 70] [0, 70] [0, 80]
8.5 Numerical Examples Fig. 8.1 EDP on the IEEE 14-bus system
227
100 90 80 70 60 50 40 30 20 10 0 0
500 450 400 350 300 250 200 150 100 50 0
(a) Power generation (MW)
Gen. 1
100
Gen. 2
200
Gen. 3
Gen. 4
300 400 Iteration
Gen. 5
500
600
(b) Total generation (MW)
Total Demand
Total Generation
0
100
200
300 400 Iteration
500
600
(c) Event sampling time instants 5 4 3 2 1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
100
200
300 400 500 Iteration (d) Measurement error and threshold
Q
100
200
300 400 Iteration
t
600
||e (t)|| 1
500
600
228 Fig. 8.2 EDP on the IEEE 118-bus test system
8 Event-Triggered Algorithms for Distributed Economic Dispatch (a) Power generation (MW)
450 400 350 300 250 200 150 100 50 0
0
100
200
300 Iteration
400
500
600
(b) Total generation (MW)
7000 6000 5000 4000 3000 2000
Total Generation
Total Demand
1000 0
0
100
200
300 Iteration
400
500
600
(c) Event sampling time instants of five (randomly) generators 5
4
3
2
1 0
100
200
300 Iteration
400
500
600
(d) Measurement error and threshold
1.2 1
Q
0.8
t
||e (t)|| 1
0.6 0.4 0.2 0 0
100
200
300 Iteration
400
500
600
8.6 Conclusion
229
292.1353, 350, 8, 8, 55.5245, 8, 55.5244, 8, 8, 55.5257, 250, 250, 55.5234, 55.5238, 200, 200, 55.5246, 420, 420, 292.1332, 41.0539, 10, 5, 5, 55.5241, 55.5252, 292.1356, 55.5244, 10, 300, 200, 8, 20, 292.1361, 292.1348, 292.1350, 8, 55.5244, 55.5251, 8, 25, 55.5249, 55.5259, 55.5245, 25] MW, and the system is suffered a total cost 9.262 × 104$. Finally, the calculation time obtained by ET-DAPDA for solving EDP on the IEEE 118-bus system is 6.3838s.
8.5.3 Example 3: Comparison with Related Methods To verify the convergence performance of ET-DAPDA for the analyzed systems (IEEE 14-bus and 118-bus systems), the results obtained in this chapter are compared to those obtained by the existing related methods [23, 26, 27, 34]. Here, the required parameters are the same as Examples 1 and 2. (a) First, the convergence performance comparison is respectively conducted on the above two systems, where the residual E(t) = log10 m ||x (t) − xi∗ ||, t ≥ 0, i i=1 is the comparison metric. Figure 8.3 shows us the following four facts: (1) ETDAPDA can still achieve the same linear convergence rate as in [23, 26] even under event-triggered control; (2) ET-DAPDA promotes the convergence rate compared with the applicable event-triggered methods [27] without momentum terms; (3) When the largest step-size is around in αˆ = 0.0042, the convergence performs well; (4) ET-DAPDA performs better in comparison with the centralized method [34] even in large-scale networks. Finally, the accuracy of the computation conducted for the optimal power generation is included in [10−30, 10−25 ] on the IEEE 14-bus system and is included in [10−25, 10−20 ] on the IEEE 118-bus system, respectively, which means that the accuracy is not seriously affected even in large-scale networks. (b) Second, the cost (corresponding to each analyzed system) obtained by ETDAPDA and other related methods [23, 26, 27, 34] are depicted in Fig. 8.4. Figure 8.4 shows us the following two facts: (1) ET-DAPDA not only adopts two effective strategies (event-triggered control strategy and momentum acceleration strategy) to ensure that ET-DAPDA is superior to other methods [23, 26, 27, 34] in terms of the calculation and the convergence rate, but also succeeds in getting the same optimal cost of the EDP as other proven methods [23, 26, 27, 34] for each analyzed system; (2) ET-DAPDA can quickly obtain the optimal cost in large-scale networks.
8.6 Conclusion This chapter develops and analyzes ET-DAPDA for handling EDP over a connected undirected network. ET-DAPDA not only allows uncoordinated constant step-sizes,
230
8 Event-Triggered Algorithms for Distributed Economic Dispatch (a) Residual: IEEE 14 bus system [17]
5 0 -5 -10 -15 -20 -25 -30
0
500
1000
1500
2000
2500
3000
3500
4000
4500
4000
4500
Time[step] (b) Residual: IEEE 118 bus system [12]
10 5 0 -5 -10 -15 -20 -25 -30
0
500
1000
1500
2000
2500
3000
3500
Iteration
Fig. 8.3 Comparison with related methods in which the residual E(t) as the comparison metric
but also most importantly integrates gradient tracking strategy with two types of momentum terms for faster convergence. If the largest step-size and the maximum momentum coefficient are positive and sufficiently smaller than some certain constants, it is proved that ET-DAPDA can linearly seek the exact optimal solution with an explicit linear convergence rate under the assumptions on the networks and the cost functions. Numerical experiments validate the theoretical results. Nevertheless, ET-DAPDA is not perfect. For instance, it cannot be appropriate for complex networks and other actual scenarios in EDP. Thus, future works can
8.6 Conclusion
231 (a) Cost: IEEE 14 bus system [17]
2500
X: 503 Y: 2176
2000
1500
1000
500
0
0
100
4 11 × 10
200
300 Iteration
400
500
600
(b) Cost: IEEE 118 bus system [12]
10 X: 1877 Y: 9.262e+04
9 8 7 6 5 4 3
0
200
400
600
800
1000 1200 Iteration
1400
1600
1800
2000
Fig. 8.4 Comparison with related methods in which the obtained cost of EDP as the comparison metric
focus on designing similar algorithms applied to directed or time-varying directed networks, considering the EDP with power loss, transmission line losses, the valvepoint effects, the prohibited zones of generator, the ramp-rate limits of generator, and investigating other methods (hybrid and PSO-oriented methods) to the EDP.
232
8 Event-Triggered Algorithms for Distributed Economic Dispatch
References 1. B. Jeddi, V. Vahidinasab, P. Ramezanpour, J. Aghaei, M. Shafie-khah, J. Catalao, Robust optimization framework for dynamic distributed energy resources planning in distribution networks. Int. J. Electr. Power Energy Syst. 110, 419–433 (2019) 2. X. He, W. Xue, H. Fang, Consistent distributed state estimation with global observability over sensor network. Automatica 92, 162–172 (2018) 3. X. Xing, L. Xie, H. Meng, Cooperative energy management optimization based on distributed MPC in grid-connected microgrids community. Int. J. Electr. Power Energy Syst. 107, 186–199 (2019) 4. Q. Lü, X. Liao, T. Xiang, H. Li, T. Huang, Privacy masking stochastic subgradient-push algorithm for distributed online optimization. IEEE Trans. Cybern. 51(6), 3224–3237 (2021) 5. X. Mao, W. Zhu, L. Wu, B. Zhou, Optimal allocation of dynamic VAR sources using zoningbased distributed optimization algorithm. Int. J. Electr. Power Energy Syst. 113, 952–962 (2019) 6. M. Ogura, V. Preciado, Stability of spreading processes over time-varying large-scale networks. IEEE Trans. Netw. Sci. Eng. 3(1), 44–57 (2016) 7. Y. Yuan, H. Li, J. Hu, Z. Wang, Stochastic gradient-push for economic dispatch on time-varying directed networks with delays. Int. J. Electr. Power Energy Syst. 113, 564–572 (2019) 8. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed constrained optimisation over time-varying directed unbalanced networks. IET Contr. Theory Appl. 13(7), 2800–2810 (2019) 9. T. Liu, X. Tan, B. Sun, Y. Wu, D. Tsang, Energy management of cooperative microgrids: a distributed optimization approach. Int. J. Electr. Power Energy Syst. 96, 335–346 (2018) 10. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. 50(7), 2612–2622 (2020) 11. W. Liu, M. Chi, Z. Liu, Z. Guan, J. Chen, J. Xiao, Distributed optimal active power dispatch with energy storage units and power flow limits in smart grids. Int. J. Electr. Power Energy Syst. 105, 420–428 (2019) 12. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020) 13. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 14. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 151–164 (2012) 15. Q. Lü, X. Liao, H. Li, T. Huang, A Nesterov-like gradient tracking algorithm for distributed optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270 (2021) 16. A. Bedi, A. Koppel, K. Rajawat, Beyond consensus and synchrony in online network optimization via saddle point method (2017). Preprint arXiv:1707.05816 17. Q. Lü, H. Li, Event-triggered discrete-time distributed consensus optimization over timevarying graphs. Complexity 2017, 1–13 (2017) 18. Q. Lü, H. Li, D. Xia, Distributed optimization of first-order discrete-time multi-agent systems with event-triggered communication. Neurocomputing 235, 255–263 (2017) 19. S. Shahrampour, A. Jadbabaie, Distributed online optimization in dynamic environments using mirror descent. IEEE Trans. Autom. Control 63(3), 714–725 (2018) 20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 21. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with timevarying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
References
233
22. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https:// doi.org/10.23919/ACC.2017.7963560 23. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020) 24. Y. Kajiyama, N. Hayashi, S. Takai, Distributed subgradient method with edge-based eventtriggered communication. IEEE Trans. Autom. Control 63(7), 2248–2255 (2018) 25. Q. Lü, H. Li, X. Liao, H. Li, Geometrical convergence rate for distributed optimization with zero-like-free event-triggered communication scheme and uncoordinated step-sizes, in Proceedings of the 7th International Conference on Information Science and Technology (ICIST) (2017). https://doi.org/10.1109/ICIST.2017.7926783 26. T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dispatch/demand response in power systems (2016). Preprint arXiv:1609.06660 27. J. Wang, H. Li, Z. Wang, Distributed event-triggered scheme for economic dispatch in power systems with uncoordinated step-sizes. IET Gener. Transm. Distrib. 13(16), 3612–3622 (2019) 28. E. Naderi, A. Azizivahed, H. Narimani, M. Fathi, M. Narimani, A comprehensive study of practical economic dispatch problems by a new hybrid evolutionary algorithm. Appl. Soft. Comput. 67, 1186–1206 (2017) 29. H. Narimani, S. Razavi, A. Azizivahed, E. Naderi, M. Fathi, M. Ataei, M. Narimani, A multiobjective framework for multi-area economic emission dispatch. Energy 154, 126–142 (2018) 30. E. Naderi, M. Pourakbari-Kasmaei, H. Abdi, An efficient particle swarm optimization algorithm to solve optimal power flow problem integrated with FACTS devices. Appl. Soft. Comput. 80, 243–262 (2019) 31. E. Naderi, M. Pourakbari-Kasmaei, M. Lehtonen, Transmission expansion planning integrated with wind farms: a review, comparative study, and a novel profound search approach. Int. J. Electr. Power Energy Syst. 115, 10546–0 (2020) 32. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control 65(6), 2566–2581 (2020) 33. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020) 34. N. Li, L. Chen, S. Low, Optimal demand response based on utility maximization in power networks, in Proceedings of the 2011 IEEE Power and Energy Society General Meeting (PES) (2011). https://doi.org/10.1109/PES.2011.6039082 35. N. Li, L. Chen, M. Dahleh, Demand response using linear supply function bidding. IEEE Trans. Smart Grid 6(4), 1827–1838 (2015) 36. Y. Tian, Y. Sun, G. Scutari, Achieving linear convergence in distributed asynchronous multiagent optimization. IEEE Trans. Autom. Control 65(12), 5264–5279 (2020) 37. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 9
Privacy Preserving Algorithms for Distributed Online Learning
Abstract In this chapter, we focus on introducing a distributed online optimization problem for a set of nodes communicating on a time-varying unbalanced directed network, while considering the problem of how to preserve the privacy of their local cost functions. The main goal of this set of nodes is to cooperatively minimize the sum of all locally known convex cost functions (global cost function). We propose a differentially private distributed stochastic subgradient-push algorithm, named DPDSSP, to solve such an optimization problem in a collaborative and distributed manner. The algorithm ensures that nodes interact with their in-neighbors and collectively optimize the global cost function. Unlike most existing distributed algorithms that do not consider privacy issues, DP-DSSP successfully protects the privacy of participating nodes through a differential privacy strategy, which is more practical in applications involving sensitive information, such as military affairs or medical treatment. An important feature of DP-DSSP is that it handles distributed online optimization problems in the context of time-varying unbalanced directed networks. Theoretical analysis shows that DP-DSSP can effectively protect differential privacy and can achieve sublinear regret. The tradeoff between the level of privacy and accuracy of DP-DSSP is also revealed. Moreover, DP-DSSP is able to handle arbitrarily large but uniformly bounded delays in the communication link. Finally, simulation experiments confirm the usefulness of DP-DSSP and the findings of this chapter. Keywords Distributed online optimization · Differential privacy · Stochastic subgradient-push algorithm · Time-varying directed networks · Communication delays
9.1 Introduction The distributed convex optimization problem has attracted the interest of many researchers in the past few decades with the tremendous advances in advanced technology and low-cost devices. Numerous engineering applications can be viewed as distributed convex optimization problems, such as robust control [1], smart grids © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks, https://doi.org/10.1007/978-981-19-8559-1_9
235
236
9 Privacy Preserving Algorithms for Distributed Online Learning
[2], model prediction [3], and smart metering [4], among others [5–11]. This type of problem requires a fully distributed optimization algorithm. Unlike traditional centralized approaches, distributed algorithms involve multiple nodes that have access to their own local information without the existence of a central coordinator (node) to obtain all the information on the network [12]. A typical feature in many practical scenarios of the distributed optimization is that they may just need to be adapted for dynamic changes and uncertain environment. These problems with uncertainties can be regarded as the distributed online optimization where the cost function changes over time with the adaptive decisions are only relevant to the previous information and must be made at each time. In view of the distributed online optimization problems, the target of researchers is to develop the distributed online algorithms in a coordinated manner. To estimate the performance of the distributed online algorithms, it is conventional to capture the difference between the cost incurred by the algorithm through the sequential cost functions and the cost incurred by the best fixed decision in hindsight. The metric, as the difference between two costs, is called regret. In addition, it is declared “good” for the distributed online algorithm when the achieved regret is sublinear [13]. In terms of the distributed online optimization problems over networks, many significant results have recently emerged [14–25]. Nunez and Cortes [14] designed a distributed online subgradient descent algorithm on timevarying balanced directed networks, which allowed proportional–integral feedback on the divergence among nodes (neighboring). Akbari et al. [15] investigated a distributed online subgradient-push algorithm on time-varying unbalanced directed networks. Then, Bedi et al. [16] proposed an asynchronous stochastic saddlepoint method to solve online expected risk minimization problem, while Pradhan et al. [17] considered distributed online non-parametric optimization problem. Subsequently, with the development of the distributed optimization, the work of the distributed online optimization was further explored. Some notable approaches that addressed the distributed online optimization problems over networks mainly included distributed proximal gradient algorithm [18], saddle-point algorithm [19], distributed mirror descent algorithm [20], distributed dual averaging algorithm [21], distributed regression algorithm [22], etc. However, in the above algorithms, nodes have to accept the fact that their privacy, at least some of it, will be inevitably disclosed during information sharing. From the aspect of privacy preserving of nodes, differential privacy is one of the insightful privacy strategies. It was initially proposed by Dwork [26] and gained much attention because of its rigorous formulation and the proof of security properties. In general, differential privacy guarantees that the malicious node finds little sensitive information of any other nodes. Recently, several privacy preserving optimization algorithms have been presented in the literature [27–31], where the controlling idea is to inject stochastic noises or offsets in the node communications and updates. Although the algorithms in [27–31] could figure out distributed optimization problem with privacy considerations, they were only suitable for problems with time-invariant cost functions. On the other hand, the study of stochastic optimization problems is also profound. The addition of noise
9.1 Introduction
237
to optimization algorithms has a long history originally arising from statistical mechanics (such as simulated annealing) [32] and given significant treatment in [33] that presented a stochastic approximation method and included the detailed analysis of convergence in the context of diverse noise models. Distributed implementation of stochastic gradient methods has received increasing attention in recent years [34–36]. Ram et al. [34] proposed a stochastic subgradient projected algorithm for solving convex optimization problem subject to a common convex constraint set. Then, Srivastava and Nedic [35] utilized two diminishing step-sizes to deal with communication noises and subgradient errors, respectively. The recent application of stochastic gradient methods to non-convex optimization in machine learning started at [36], which initially ensured global convergence of the methods on non-convex functions with exponentially multi-local minima and saddle points. However, they all neglected the preserving of nodes’ privacy [32–36]. Of significant relevance to our work are the recent developments in [37] and [38]. Under time-varying balanced directed networks, Zhu et al. [37] established an intuitive method with differential privacy strategy to solve the distributed online optimization problem. Then, Li et al. [39] studied the privacy preserving of distributed online learning and detailedly discussed the privacy levels of the proposed differentially private online algorithm. Note that, although the privacy of nodes was preserved, the two approaches in [37] and [39] just assumed that the interaction networks were balanced, which was considered to be the main difficulty and challenge of implementing distributed algorithms. This therefore limits the applicability of the algorithm in a number of real-world fields, such as peer to peer, ad hoc, wireless sensor networks, etc. Concerning this limitation, Nedic and Olshevsky [38] designed a subgradient-push algorithm by incorporating distributed subgradient descent into push-sum mechanism on time-varying unbalanced directed networks. In this setting, some interesting generalized algorithms [40] (the distributed stochastic subgradient push) and [15] (the distributed online subgradient push) were developed. Regrettably, the privacy issues of nodes were ignored in the above algorithms [15, 38, 40]. To sum up, there has not yet been any prior work devoted to designing algorithms, which not only preserve the privacy of participating nodes but also figure out distributed online optimization problems on time-varying unbalanced directed networks. It is of great significance, therefore, to discuss such challenging issue owning to its practicality. In this chapter, our focus is to initiate the study of differentially private distributed online convex optimization problems in dynamic environment. To this aim, a fully distributed online algorithm with differential privacy strategy is proposed, for which a desired privacy level and the expected regrets can be obtained even on timevarying unbalanced directed networks. We hope to make contributions for a broad theory of the distributed online convex optimization, and our motivation is designing a completely distributed algorithm being capable of adaptability and facilitating the real-world applications. In general, the key contributions of the present work are as follows:
238
9 Privacy Preserving Algorithms for Distributed Online Learning
(i) We design and discuss a differentially private distributed stochastic subgradient-push algorithm, named as DP-DSSP, by incorporating randomized perturbation technique into stochastic subgradient-push mechanism. More importantly, DP-DSSP is completely distributed, and it just needs each node to possess its own out-degree without any further knowledge of the number of nodes or network topologies. DP-DSSP is therefore easier to be implemented than the existing methods [14, 30, 41]. (ii) In comparison with the algorithms in [14] for online optimization problems and [37] for privacy issues, DP-DSSP not only solves differentially private distributed online convex optimization problems but also applies to more general interaction networks, i.e., time-varying unbalanced directed networks. To overcome the imbalance induced by time-varying directed networks, a push-sum mechanism [38, 40] is exploited for designing DP-DSSP in order to counteract the effect of network’s imbalance. (iii) DP-DSSP can be considered as an extension of the algorithms in [37] and [40] for both preserving the privacy of participating nodes and solving the distributed online convex optimization problems. Specifically, DP-DSSP not only guarantees the privacy of nodes’ cost functions but also achieves sublinear (logarithmic/square-root) regrets for diverse (strongly/generally convex) cost functions for a fixed privacy level. Therefore, in DP-DSSP, the performance is not seriously affected while maintaining differential privacy. (iv) A compromise between the desired level of privacy preserving and the accuracy of DP-DSSP is revealed. Namely, by fixing other parameters, the upper bounds of expected regrets of DP-DSSP possess the order of magnitude O(1/ε2 ). Furthermore, the robustness of DP-DSSP is investigated in the presence of arbitrary but uniformly bounded network communication delays [42]. This thus makes it more universal and practical in real applications.
9.2 Preliminaries 9.2.1 Notation If not particularly stated, the vectors mentioned in this chapter are column vectors. Let R≥0 , Z≥0 , and Z>0 be the set of non-negative real numbers, the set of nonnegative integers, and the set of positive integers, respectively. A quantity (probably a vector) allocated to node i is indexed by a subscript i; e.g., zi (t) is the estimate of node i at time t. The n × n identity matrix is denoted as In , and the column vectors of all ones and all zeros are denoted as 1 and 0 (appropriate dimensional), respectively. The transposes of a vector z and a matrix W are represented as zT and W T , respectively. Let the symbol ·, · denote the inner product of two vectors. The Euclidean norm (vectors) and 1-norm are written as || · || and || · ||1 , respectively. Given a random/stochastic variable Z, P[Z] and E[Z] denote its probability and
9.2 Preliminaries
239
expectation, respectively. We let W (t : k) = W (t) · · · W (k), ∀t ≥ k ∈ Z≥0 , represent the products of the time-varying matrices W (t), . . . , W (k). Also, denote W (t − 1 : t) = I, ∀t ∈ Z>0 . Assume that C (convex) is a subset of Rd (d is the dimension). The function f : C → R is convex if f (y) ≥ f (x) + ∇f (x), y − 2 x + κ(x,y) 2 ||y − x|| , ∀x, y ∈ C, where ∇f (x) is a subgradient of f at x, and κ : C × C → R≥0 . If κ(x, y) = κ > 0, ∀x, y ∈ C, in this relation, the function f is κ-strongly convex on C. Concerning a convex function f and a vector x ∈ Rd , ∇f (x) ∈ Rd is represented by a subgradient of f at x, if for all y ∈ Rd , f (y) − f (x) ≥ ∇f (x), y − x.
9.2.2 Model of Optimization Problem In distributed online optimization problems, suppose that there is a series of timevarying global convex cost functions which are not previously fetched and gradually revealed over time. More specifically, at each time t ∈ {1, . . . , T } (T ∈ Z>0 is the time horizon), each node i ∈ V selects an estimate zi (t) ∈ Rd . After this, node i observes a local convex cost function fit : Rd → R and incurs the cost of fit (zi (t)). Therefore, the network cost, at each time t, is represented by f t (z) =
n
fit (z),
i=1
where z ∈ Rd is the decision vector. Then, nodes, at each time t, intend to minimize the following optimization problem: min f t (z) =
z∈Rd
n
fit (z),
(9.1)
i=1
where node i ∈ V is only aware of its individual convex cost function fit . Suppose that f t is not known to any node collectively, nor is it available to any single location. In this chapter, we wish to design a class of the distributed stochastic subgradient algorithm to reduce the total cost incurred by nodes over a finite time horizon T ∈ Z>0 . Due to the distributed online optimization possessing time-varying nature, the regret is an indispensable metric when evaluating the convergence of the designed algorithm. Hence, the following classical network regret and individual regret of node j ∈ V are introduced [14]: R(T ) =
T n t =1 i=1
fit (zi (t)) −
T n t =1 i=1
fit (z∗ ),
240
9 Privacy Preserving Algorithms for Distributed Online Learning
and Rj (T ) =
n T
fit (zj (t)) −
t =1 i=1
where z∗ = arg min f (z) = z∈Rd
T n t =1 i=1
n T
fit (z∗ ),
t =1 i=1
fit (z) is the best decision.
Notice that the subgradients and other quantities such as {zi (t)}Tt=1 and t {fi (zi (t))}Tt=1 , i ∈ V, will become stochastic variables due to the presence of the stochastic noises in calculating the subgradients. Hence, it may be impossible to obtain deterministic regrets like the classical regrets described above while adopting stochastic subgradients. Moreover, E(·) is with respect to the expectation of a stochastic variable. Based on this, the concept of pseudo-network-regret and pseudo-individual-regret associated with node j ∈ V is utilized in this chapter1 [43]: R(T ) = E
n T
fit (zi (t))
−E
t =1 i=1
n T
fit (z∗ )
,
(9.2)
t =1 i=1
and Rj (T ) = E
n T
fit (zj (t)) − E
t =1 i=1
T n
fit (z∗ ) .
(9.3)
t =1 i=1
As a result, the main focus in this chapter is to investigate a distributed stochastic subgradient algorithm for each node i ∈ V on time-varying unbalanced directed networks such that the regrets in (9.2) and (9.3) are sublinear with respect to T , i.e., limT →∞ R(t)/T = 0 and limT →∞ Rj (t)/T = 0, j ∈ V. Intuitively, the sublinearity implies that the average value of the global cost function over time horizon T achieves the optimal value as T → ∞.
9.2.3 Communication Network In this chapter, the intrinsic interconnections among nodes are represented by a time-varying unbalanced directed network G(t) = {V, E(t)} at time t, where
1 The pseudo-network-regret and pseudo-individual-regret associated with node j ∈ V are brought forward to measure the difference between the expectation of the total cost incurred by the node’s state estimation and the optimal expected total cost that could have been achieved by the best fixed decision in hindsight.
9.3 Algorithm Development
241
V = {1, 2, . . . , n} is the nodes set and E(t) ⊆ V × V is the edges set. An edge (i, j ) ∈ E(t) indicates that node i can directly route information to node j at time t, where i is regarded as an in-neighbor of j and in contrary j is viewed as an out-neighbor of i. At time t, we define Niin (t) = {j ∈ V|(j, i) ∈ E(t)} and Niout (t) = {j ∈ V| (i, j ) ∈ E(t)} as the in-neighbors set and out-neighbors set of node i, respectively. The network is unbalanced at time t if |Niin (t)| = |Niout (t)|, where | · | is the cardinality of a set. The time-varying network is said to be unbalanced if it is unbalanced at all times. The out-degree of node i at time t is denoted by diout (t) = |Niout (t)|. A directed path is a series of directed consecutive edges. If there is at least one directed path between any pair of distinct nodes in the network, then the network is called strongly connected. The weighted adjacency matrix at time t corresponding to network G(t) is indicative of W (t) = [wij (t)] ∈ Rn×n , which follows that wij (t) > 0 if (j, i) ∈ E(t), and wij (t) = 0 otherwise. In addition, at time t, W (t) is column-stochastic if ni=1 wij (t) = 1, ∀j ∈ V.
9.3 Algorithm Development In this section, we first introduce the differential privacy strategy. Then, a distributed online algorithm with differential privacy strategy is developed to figure out the formulated problem.
9.3.1 Differential Privacy Strategy The definition of differential privacy was first presented by Dwork in [26] and subsequently analyzed in detail in [27–31]. Differential privacy enables participating nodes to share their individual information without revealing sensitive information about their privacy. In this chapter, the differential privacy is employed to preserve the local cost function of individual node. Next, we first give the definition of adjacent relation. Definition 9.1 (Adjacent Relation) Taking into account two datasets D = {fi }ni=1 and D = {fi }ni=1 , if there is i ∈ V such that fi = fi and fj = fj , ∀j = i, then D and D are called as adjacent. In other words, if the data of a single participant in two datasets D and D is different and the data of other participants is the same, then two datasets are adjacent. Informally, in differential privacy, preserving privacy enables that two adjacent datasets are nearly indistinguishable from the output of the randomized method which acts on the datasets. In view of the differential privacy, we introduce the following definition.
242
9 Privacy Preserving Algorithms for Distributed Online Learning
Definition 9.2 (Differential Privacy) A randomized method M is ε- differential privacy if given any set of outputs Ψ ⊆ Range(M) with any adjacent datasets D and D , it holds P[M(D) ∈ Ψ ] ≤ exp(ε) · P[M(D ) ∈ Ψ ],
(9.4)
where ε > 0 and Range(M) represents the output range of the randomized method M. Inequality indicates that no matter whether an individual node participates in the dataset, it does not incur any remarkable difference to the output of the randomized method M. Hence, a malicious node gains little information of other nodes. The constant ε is a measurement of the privacy level of M. That is to say, a smaller ε means a higher level of privacy preserving. Thus, the constant ε is a compromise between the desired level of privacy preserving and the accuracy of M. For the sake of ensuring different privacy, we wish to know the “sensitivity” of M. It is deduced from [26] that the magnitude of the stochastic noise (perturbation) is dependent on the maximum change on the output of M from an individual entry in the data source. This amount is known as the sensitivity of the method, which is defined as follows. Definition 9.3 (Sensitivity) At time t, the sensitivity of a method M is denoted as Δ(t) =
sup
Dt ,Dt :Adj(Dt ,Dt )
||M(Dt ) − M(Dt )||1 ,
(9.5)
where Adj(Dt , Dt ) is the adjacent relation of adjacent datasets Dt and Dt at time t. Built on the concept of sensitivity, we observe that if the method maintains the same level of privacy preserving, a higher sensitivity requires more stochastic noise. Thus, we can control the magnitude of the stochastic noise via bounding the sensitivity to ensure differential privacy. Remark 9.4 With the development of artificial intelligence and the emergence of 5G, data from multiple real-world application areas, such as disease prevention/treatment and online ads recommendation, has become large, widely distributed, and rapidly changing. This requires data to be processed in real time (online) to meet the need for rapid response to users, and a large amount of data related to personal information needs to be preserved. Thus, designing an algorithm which not only preserves the privacy of participating nodes but also solves the problem of the distributed online optimization will have far-reaching implications.
9.3 Algorithm Development
243
9.3.2 Differentially Private Distributed Online Algorithm Recall that our goal is to distributedly solve problem (9.1) with the privacy preserving of nodes. In the case of distributed manner, we assume that each node, at each time t, only acquires information transmitted by in-neighbors and sends its own estimates to its out-neighbors through a time-varying unbalanced directed network G(t). The privacy preserving indicates that the sensitive information of each node i ∈ V is well preserved. Suppose that a malicious node can eavesdrop on information by monitoring the entire communication channels and intercepting messages interacted among nodes. Thus, traditional online approaches may result in leakage of sensitive information of nodes. To address this issue, DP-DSSP is employed to ensure that the privacy of the participating nodes is not leaked. It is worth mentioning that the performance of DP-DSSP inevitably suffers from varying degrees of degradation due to the existence of stochastic noise. That is to say, the privacy level and the performance are inversely related. We are now in the position of presenting DP-DSSP. In DP-DSSP, each node, i ∈ V, owns five estimates: xi (t) ∈ Rd , hi (t) ∈ Rd , yi (t) ∈ Rd , si (t) ∈ R, and zi (t) ∈ Rd . Then, each node i ∈ V implements the following updates at each time t ∈ {1, . . . , T }: ⎧ ⎪ hi (t) = xi (t) + ηi (t), ⎪ ⎪ ⎪ h (t ) ⎪ ) ⎪ yi (t + 1) = d outhi(t(t)+1 + j ∈N in (t ) d outj(t )+1 ⎪ ⎪ i ⎨ i j s (t ) ) si (t + 1) = d outsi(t(t)+1 + j ∈N in (t ) d outj(t )+1 ⎪ i i j ⎪ ⎪ ⎪ ⎪ zi (t + 1) = yi (t + 1)/si (t + 1) ⎪ ⎪ ⎪ ⎩ xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1),
(9.6)
where ηi (t) ∈ Rd , generated by node i ∈ V, is a stochastic variable (independent identically distributed) from a Laplace distribution Lap(σ (t)), and α(t + 1) > 0 is the diminishing learning rate. The noisy subgradient of fit (z) at z = zi (t + 1) is indicative of git +1 (zi (t + 1)) abbreviated as gi (t + 1) and is assumed to be satisfied: git +1 (zi (t + 1)) = ∇fit +1 (zi (t + 1)) + τi (t + 1), where ∇fit +1 (zi (t + 1)) represents the subgradient of fit +1 (z) at z = zi (t + 1), and τi (t + 1) ∈ Rd is an independent zero-mean stochastic noise term. The estimates of DP-DSSP (9.6) are initialized as xi (0) ∈ Rd and si (0) = 1 for all i. In addition, for any t ∈ Z≥0 and i, j ∈ V, non-negative matrix W (t) = [wij (t)] ∈ Rn×n follows
wij (t) =
1 , j ∈ Niin (t) ∪ {i}, djout (t )+1 0, j ∈ / Niin (t).
(9.7)
244
9 Privacy Preserving Algorithms for Distributed Online Learning
In addition, the following assumptions are needed. Assumption 9.1 (Bounded Subgradient [38]) Concerning a sequence of local convex cost functions {f1t , . . . , fnt }Tt=1, for all i ∈ V, the subgradient ∇fit (z) of fit (z) at z is Li -bounded over Rd , i.e., ||∇fit (z)|| ≤ Li , ∀x ∈ Rd . Denote by Ft the σ -field produced by all the information of DP-DSSP up to time t. Then, we give the conditions of the stochastic noise τi (t), i ∈ V, t ∈ {1, . . . , T }, as follows. Assumption 9.2 (Stochastic Noise [40]) The stochastic noise τi (t), i ∈ V, t ∈ {1, . . . , T }, is a stochastic variable, which is independent and satisfies E[τi (t)|Ft −1 ] = 0. Also, suppose that E[||τi (t)||] ≤ πi for a constant πi > 0. The stochastic noise τi (t), i ∈ V, t ∈ {1, . . . , T }, is also considered in [37] and [40], which plays a significant role in real applications. In addition, the next assumption is needed. Assumption 9.3 ([42]) For the sequence {G(t) = {V, E(t)}} of time-varying unbalanced directed networks, there is an integer B ∈ Z>0 such that the aggregate +B−1 directed network GB (t) = {V, ∪t=t E()} is strongly connected ∀t ∈ Z≥0 . The strong-connectivity bound B introduced in Assumption 9.3 is not required to be known at any of the nodes and is only employed in the convergence analysis [44]. Assumption 9.3 is more general than that demanding each G(t) is strongly connected [37, 45].
9.4 Main Results This section presents the main results (ε-differential privacy and expected regrets) in this chapter. Theorem 9.5 (ε-Differential Privacy) Suppose that Assumptions 9.1–9.3 hold. Let ηi (t) ∈ Rd , i ∈ V, t ∈ {1, . . . , T }, be introduced in DP-DSSP (9.6) with the corresponding parameter σ (t) = Δ(t)/ε, where ε > 0. Then, DP-DSSP (9.6) can guarantee ε-differential privacy. Before bounding the expected regrets of individual node, many crucial notations are first introduced. Let z∗ and Z ∗ be, respectively, defined as the optimal solution n and the nonempty optimal solution set to (9.1). Moreover, let μ ¯ = (1/2n) i=1 μi , n Lmax = maxi∈V Li , and L = L , where μ > 0, ∀i ∈ V. With the above i i i=1 preparations, it is now ready to present the expected regrets of this chapter.
9.4 Main Results
245
Theorem 9.6 (Logarithmic Regret) Suppose that Assumptions 9.1–9.3 hold. Assume that each local cost function fit , i ∈ V, is μi -strongly convex for all t ∈ {1, . . . , T }. Consider that DP-DSSP updates the sequence {z1 (t), . . . , zn (t)}Tt=1 with α(t) = 1/(μt), ˆ where μˆ ∈ (0, μ]. ¯ Then, the pseudo-individual-regret of node j ∈ V can be upper bounded as ¯ j (T ) + R
n
μj E[||ˆzj (t) − z∗ ||2 ] ≤ Ξ1 + Ξ2 (1 + ln T ),
j =1
where zˆ i (t) =
2
t
k=1 kzi (k)
t (t + 1)
,
Ξ1 =nμE[|| ˆ x(0) ¯ − z∗ ||2 ] + Ξ2 =
14CL n ||xj (0)||1 , j =1 δ(1 − λ)
√ 42 2Cn3/2 d pL 16nd pˆ 2 14Cn3/2pL ˆ ˆ + + + 2 2 , δλμ(1 ˆ − λ) δεμ(1 ˆ − λ) ε2
δ > 0 and 0 < λ < 1 connect to the network topology, x(0) ¯ = where (1/n) ni=1 xi (0), pˆ = maxi∈V {Li + πi }, and 2 = ni=1 (Li + νi )2 . Noticing that the learning rate α(t) in Theorem 9.6 shows the reliance on the time horizon T , thus the Doubling Trick scheme √ (DTS) [41] which does not rely on T is applied. In DTS, we pick α(t) = 1/ 2 k in each period of 2k rounds (t = 2k , . . . , 2k+1 − 1) for k = 0, 1, 2, . . . , log2 T . By means of DTS, the pseudoindividual-subregret associated with node j ∈ V between k and k + 1 can be defined as follows:2 ⎡ k+1 ⎤ ⎡ k+1 ⎤ 2 n 2 n DT S Rj (k) = E ⎣ fit (zj (t))⎦ − E ⎣ fit (z∗ )⎦ . (9.8) t =2k i=1
t =2k i=1
Based on the above definition, the following theorem, i.e., square-root regret, is immediately established for generally convex cost functions. Theorem 9.7 (Square-Root Regret) Suppose that Assumptions 9.1–9.3 hold. Assume that each local cost function fit , i ∈ V, is convex for all t ∈ {1, . . . , T }. Consider that DP-DSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 with α(t) selected by DTS. Then, the pseudo-individual-regret of node j ∈ V can be
DT S
Here, we do not give a specific definition of pseudo-network-subregret R interested readers can similarly define it with reference to (9.2) and (9.8).
2
(k), and the
246
9 Privacy Preserving Algorithms for Distributed Online Learning
upper bounded as ¯ j (T ) ≤ R
log 2T
√
DT S Rj (k)
k=0
√ 2 Γ1 T , ≤ √ 2−1
where ¯ − z∗ ||2 ] + Γ1 =nE[||x(0)
14CL ||xj (0)||1 + 2 2 δ(1 − λ) n
j =1
√ 42 2Cn3/2 d pL 16nd pˆ 2 ˆ ˆ 14Cn3/2 pL + + + . δλ(1 − λ) δε(1 − λ) ε2 As pointed out in Theorems 9.6 and 9.7, the desired regrets of DP-DSSP can be derived for strongly convex (logarithmic regret) and generally convex cost functions (square-root regret). In addition, the obtained regrets are dependent on the network size n and the vector dimension d. Notice that the achieved regrets possess the same √ order of O(ln T ) and O( T ) as [14, 15] (without differential privacy strategy) for a fixed ε even the Laplace random noise is injected. Besides, for a fixed T , the regrets in Theorems 9.6 and 9.7 will increase (arbitrarily large) as ε → 0. Namely, the regrets possess the order of O(1/ε2 ). Remark 9.8 DTS in Theorem 9.7 is to divide the original time series into series of increasing size (except the last subseries) and run DP-DSSP on each subseries. Specifically, DTS actually divides the original time series t ∈ {1, . . . , T } into log2 T +1 subseries. In each subseries (except the last subseries), DP-DSSP needs to update 2k rounds. Based on this, we can calculate the corresponding subregret in each subseries k = 0, 1, 2, . . . , log T , and the total regret is bounded by the 2 sum of log2 T + 1 subregrets. Therefore, DP-DSSP can completely implement t = 1, . . . , T iterations even in the case of DTS. More importantly, DTS does not require to√ gain access to the global information T but still guarantees that DP-DSSP achieves T regret. Hence, DTS is meaningful √ when compared with the existing works that do not utilize DTS and still attain T regret growth [19, 20].
9.4.1 Differential Privacy Analysis This subsection demonstrates that DP-DSSP guarantees ε-differential privacy. In the process of information sharing, since the sensitive information of individual node may be invaded by a malicious node, differential privacy strategy is thus applied
9.4 Main Results
247
to preserve the sensitive information of nodes. In differential privacy, a stochastic noise (stochastic variable) which follows the Laplace distribution is injected to the estimate of node at each time t. To achieve differential privacy, the sensitivity (as defined in Definition 9.3) of DP-DSSP is employed, which identifies the amount of noises that need to be added. By bounding the sensitivity, a number of noises (stochastic) to ensure ε-differential privacy are determined. Hence, we first bound the sensitivity of DP-DSSP as follows. Lemma 9.9 Suppose that Assumptions 9.1 and 9.2 hold. The sensitivity of DPDSSP (9.6) calculated as (9.5) is bounded by √ Δ(t) ≤ 2pˆ dα(t + 1), where pˆ = maxi∈V {Li + πi }. Proof This chapter requires that each node pursues the privacy of their local cost function being well preserved. Thus, adjacent relation in this chapter implies that
there is a node i ∈ V such that fit = f ti and fjt = f tj , ∀j = i. Let xi (t) and xi (t) be, respectively, indicative of the implementations for M(Dt ) and M(Dt ). In light of the update of xi (t) in (9.6) and Definition 9.3, it holds ||M(Dt ) − M(Dt )||1 =||xi (t + 1) − xi (t + 1)||1 ≤α(t + 1)(||gi (t + 1)||1 + ||gi (t + 1)||1 ) √ ≤ dα(t + 1)(||gi (t + 1)|| + ||gi (t + 1)||), which follows from Assumptions 9.1 and 9.2 that E[||gi (t + 1)|||Ft ] ≤ Li + πi ≤ p. ˆ
(9.9)
Due to the arbitrary nature of the pair of adjacent datasets Dt and Dt , by Definition 9.3, one yields √ Δ(t) ≤ E[||xi (t + 1) − xi (t + 1)||1|Ft ] ≤ 2pˆ dα(t + 1), which implies the desired result. Then, the Proof of Theorem 9.5 is provided as follows.
248
9 Privacy Preserving Algorithms for Distributed Online Learning
Proof of Theorem 9.5 Define x(t) = [x1 (t), . . . , xn (t)]T ∈ Rnd and x (t) = [x1 (t), . . . , xn (t)]T ∈ Rnd . Recalling Definition 9.3, one has ||x(t) − x (t)||1 ≤ Δ(t).
(9.10)
Noticing that x (t) and x(t) include in Rnd , from the property of 1-norm, one gets n d
|xi,k (t) − xi,k (t)| = ||x(t) − x (t)||1 ≤ Δ(t),
(9.11)
i=1 k=1
(t) and x (t) are, respectively, defined as the k-th elements of x (t) where xi,k i,k and x(t). Then, according to the property of Laplace distribution [46] and (9.11), it suffices that n d P[zi,k (t) − xi,k (t)]
(t)] P[zi,k (t) − xi,k i=1 k=1 n d exp − |zi,k (t )−xi,k (t )| σ (t ) =
(t )| |zi,k (t )−xi,k i=1 k=1 exp σ (t )
≤
n d
exp
(t) − z (t) + x (t)| |zi,k (t) − xi,k i,k i,k
σ (t)
i=1 k=1
≤ exp
(t)|| ||xi,k (t) − xi,k
σ (t)
Δ(t) ≤ exp . σ (t)
(9.12)
Furthermore, one has P[M(D) ∈ Ψ ] =
T
P[M(Dt ) ∈ Ψ ].
t =1
Integrating both sides of (9.12), by (9.13), then we acquire T t =1
P[M(Dt ) ∈ Ψ ] ≤
T t =1
Δ(t) exp σ (t)
· P[M(D t ) ∈ Ψ ],
which according to Δ(t)/σ (t) = ε yields that P[M(D) ∈ Ψ ] ≤ exp(ε) · P[M(D t ) ∈ Ψ ].
(9.13)
9.4 Main Results
249
By Definition 9.2, the result of Theorem 9.5 is completed.
Remark 9.10 Since the sensitivity of DP-DSSP (9.6) depends on α(t), then in light of Theorem 9.5, the sensitivity, for a fixed ε, will decrease with the execution of DP-DSSP.
9.4.2 Logarithmic Regret This subsection concerns with establishing the logarithmic regret of DP-DSSP (9.6) presented in Theorem 9.6. For this purpose, some supporting lemmas that are indispensable in the proof of the main results are first presented. In light of the definition of matrices W (t) and W (t : k), t ≥ k ∈ Z≥0 , the following lemma is stated, which directly follows from Corollary 2 in [38]. Lemma 9.11 ([38]) Suppose that Assumption 9.3 holds. If matrix W (t) = [wij (t)] ∈ Rn×n satisfies (9.7), there is a series of stochastic vectors φ(t) ∈ Rd such that the matrix W (t : k) satisfies ∀i, j ∈ V and t ≥ k ∈ Z≥0 , |[W (t : k)]ij − φi (t)| ≤ Cλt −k , where C ≥ 1 and 0 < λ < 1. Denote η(t) = [η1 (t), . . . , ηn (t)]T , y(t) = [y1 (t), . . . , yn (t)]T , s(t) = 1 T T [s1 (t), . . ., sn (t)] , g(t) = [g1 (t), . . . , gn (t)] , and x(t) ¯ = n ni=1 xi (t). With the help of Lemma 9.11, we provide the upper bound of E[||zi (t +1)−x(t)||], ¯ ∀t ∈ Z≥0 , below, which is obtained directly from Lemma 1(a) in [38]. Lemma 9.12 ([38]) Suppose that Assumptions 9.2 and 9.3 hold. Then, DPDSSP (9.6) generates a sequence {z1 (t), . . . , zn (t)}Tt=1 such that ∀i ∈ V, t ∈ Z≥0 , E[||zi (t + 1) − x(t)||] ¯ √ t n n 3C n t −k 2C t λ ||xj (0)||1 + λ E[||ηj (k)||] ≤ δ δ j =1
+
k=0
j =1
t 2Cn3/2 pˆ t −k λ α(k), δ k=1
where δ ≥ 1/(nnB ) weighs the imbalance of the network and 0 < λ ≤ (1 − 1/(nnB ))1/(B) is a measure of connectivity (if each of the networks G(t) is regular, one may see [38] for more details on δ and λ); ηi (t) ∼ Lap(σ (t)) with E[ηi (t)] = 0 and E[||ηi (t)||2 ] = 2σ 2 (t).
250
9 Privacy Preserving Algorithms for Distributed Online Learning
The next lemma is dedicated to establishing a necessary relation for deriving the main results. Lemma 9.13 Suppose that Assumptions 9.1–9.3 hold. Consider that DPDSSP (9.6) generates a sequence {x1 (t), . . . , xn (t)}Tt=1 . Then, we have for any v ∈ Rd that ∀t ∈ Z≥0 , ¯ − v||2 ] E[||x(t ¯ + 1) − v||2 ] − E[||x(t) 2 2α(t + 1) E[f t +1 (x(t)) E[||ηi (t)||2 ] − ¯ − f t +1 (v)] n n i=1 n 2α 2 (t + 1) 2 4α(t + 1) E Li ||x(t) ¯ − zi (t + 1)|| + + n n i=1 n α(t + 1) E κit +1||zi (t + 1) − v||2 , − n n
≤
i=1
where κit +1 is the strong convexity parameter of fit +1 . Proof Owing to the column-stochasticity of each matrix W (t), it is concluded from the update of hi (t) in (9.6) that x(t ¯ + 1) = x(t) ¯ +
1 1 ηi (t) + χi (t + 1), n n n
n
i=1
i=1
(9.14)
where χi (t + 1) = −α(t + 1)gi (t + 1), t ∈ Z≥0 . Let v ∈ Rd be an arbitrary vector. It derives from 9.14 that ||x(t ¯ + 1) − v||2 − ||x(t) ¯ − v||2 2 2 χi (t + 1), x(t) ¯ − v + ηi (t), x(t) ¯ − v n n n
=
n
i=1
+ ||
1 n
i=1
n
(ηi (t) + χi (t + 1))||2 .
(9.15)
i=1
Now, the cross-term (2/n) ni=1 χi (t + 1), x(t) ¯ − v in (9.15) is first considered. For this purpose, we have that E[χi (t + 1), x(t) ¯ − v|Ft ] = −α(t + 1)∇fit +1 (zi (t + 1)), x(t) ¯ − v,
(9.16)
9.4 Main Results
251
where the cross-term ∇fit +1 (zi (t + 1)), x(t) ¯ − v is ∇fit +1 (zi (t + 1)), x(t) ¯ − v ¯ − zi (t + 1) = ∇fit +1 (zi (t + 1)), x(t) + ∇fit +1 (zi (t + 1)), zi (t + 1) − v.
(9.17)
Following from the Cauchy–Schwarz inequality, we further get ∇fit +1 (zi (t + 1)), x(t) ¯ − zi (t + 1) ≥ −Li ||zi (t + 1) − x(t)||. ¯
(9.18)
For the cross-term ∇fit +1 (zi (t + 1)), zi (t + 1) − v, since each local cost function fit +1 is strongly convex with parameter κi (v, zi (t + 1)), it holds that ∇fit +1 (zi (t + 1)), zi (t + 1) − v ≥ fit +1 (zi (t + 1)) − fit +1 (v) +
κit +1 ||zi (t + 1) − v||2 . 2
(9.19)
Since fit +1 (zi (t + 1)) − fit +1 (v) = fit +1 (zi (t + 1)) − fit +1 (x(t)) ¯ + fit +1 (x(t)) ¯ − t +1 t +1 fi (v) and each local cost function fi is convex, it is obtained that fit +1 (zi (t + 1)) − fit +1 (v) ≥∇fit +1 (x(t)), ¯ zi (t + 1) − x(t) ¯ ¯ − fit +1 (v). + fit +1 (x(t))
(9.20)
Then, we also acquire that ∇fit +1 (x(t)), ¯ zi (t + 1) − x(t) ¯ ≥ −Li ||zi (t + 1) − x(t)||. ¯
(9.21)
Therefore, it follows from (9.19)–(9.21) that ∇fit +1 (zi (t + 1)), zi (t + 1) − v ¯ + fit +1 (x(t)) ¯ − fit +1 (v) ≥ −Li ||zi (t + 1) − x(t)|| +
κit +1 ||zi (t + 1) − v||2 . 2
(9.22)
252
9 Privacy Preserving Algorithms for Distributed Online Learning
Using f t +1 (z) = one acquires n
n
t +1 (z) i=1 fi
and substituting (9.18) and (9.22) back into (9.17),
∇fit +1 (zi (t + 1)), x(t) ¯ − v
i=1
≥ f t +1 (x(t)) ¯ − f t +1 (v) +
1 t +1 κi ||zi (t + 1) − v||2 2 n
i=1
−2
n
Li ||zi (t + 1) − x(t)||. ¯
(9.23)
i=1
Recall that ηi (t) ∼ Lap(σ (t)) with E[ηi (t)] = 0. In light of (9.16) and (9.23), thus it implies that
2 E χi (t + 1) + ηi (t), x(t) ¯ − v n n
i=1
2α(t + 1) E[f t +1 (x(t)) ¯ − f t +1 (v)] n n α(t + 1) t +1 2 − E κi ||zi (t + 1) − v|| n i=1 n 4α(t + 1) E + Li ||zi (t + 1) − x(t)|| ¯ . n
≤−
(9.24)
i=1
Furthermore, according to Assumptions 9.1 and 9.2, we also have that n n 2 1 2α 2 (t + 1) 2 2 , E || (ηi (t) + χi (t + 1))|| ≤ E[||ηi (t)||2 ] + n n n
i=1
i=1
(9.25) where (9.25) is built upon the condition that ( ni=1 ai )2 ≤ n ni=1 ai2 . Substituting (9.24) and (9.25) into the estimate where the total expectation has been taken for (9.15) deduces the result in Lemma 9.13. For the relation of the pseudo-network-regret, the following lemma is shown. Lemma 9.14 Suppose that Assumptions 9.1–9.3 hold. Consider that DPDSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, it holds that
9.4 Main Results
253
∀T ∈ Z>0 , R(t) + E
T n
κit ||zi (t)
∗ 2
− z ||
t =1 i=1 t −1
n ˆ t −k−1 10Cn3/2 pL E[||x(0) ¯ − z∗ ||2 ] + λ α(k) α(1) δ T
≤
t =1 k=1
10CL ||xj (0)||1 + + δ(1 − λ) n
j =1
T 16nd pˆ 2 2 + 2 α(t) ε2 t =1
√ T t −1 ˆ t −k−1 30 2Cn3/2 d pL λ α(t + 1) + δε + nE
t =1 k=0
T
∗ 2
||x(t ¯ − 1) − z || Δ1 (t) ,
t =2
where Δ1 (t) =
1 α(t )
−
1 α(t −1)
−
1 2n
n i=1
κit .
Proof Letting v = z∗ in Lemma 9.13, we achieve that E[||x(t ¯ + 1) − z∗ ||2 ] − E[||x(t) ¯ − z∗ ||2 ] 2 2α(t + 1) E[f t +1 (x(t)) E[||ηi (t)||2 ] − ¯ − f t +1 (z∗ )] n n i=1 n 4α(t + 1) 2α 2 (t + 1) 2 E + Li ||zi (t + 1) − x(t)|| ¯ + n n i=1 n α(t + 1) t +1 ∗ 2 E − κi ||zi (t + 1) − z || . (9.26) n n
≤
i=1
We then analyze the term f t +1 (x(t)) ¯ − f t +1 (z∗ ) in (9.26). First, since fit +1 is t +1 strongly convex with parameter κi , it yields that f
t +1
(x(t)) ¯ −f
t +1
n 1 t +1 ||x(t) ¯ − z∗ ||2 . (z ) ≥ κi 2 ∗
i=1
(9.27)
254
9 Privacy Preserving Algorithms for Distributed Online Learning
On the other hand, by Assumptions 9.1 and 9.2, we get that f t +1 (x(t)) ¯ − f t +1 (z∗ ) ≥f t +1 (zi (t + 1)) − f t +1 (z∗ ) ¯ − L||zi (t + 1) − x(t)||,
(9.28)
n
where L = i=1 Li . Thus, by combining (9.27) and (9.28), we conclude the following estimate: 2(f t +1 (x(t)) ¯ − f t +1 (z∗ )) 1 t +1 κi ||x(t) ¯ − z∗ ||2 2 n
≥ f t +1 (zi (t + 1)) − f t +1 (z∗ ) +
i=1
− L||zi (t + 1) − x(t)||. ¯
(9.29)
Plugging (9.29) back into (9.26) and then summing the obtained inequality over t from 1 to T , it holds (using quite a few algebraic operations) that R(t) − 2
T n 1 E[||ηi (t − 1)||2 ] α(t) t =1
≤ LE
i=1
T
||zi (t) − x(t ¯ − 1)|| − E
t =1
+ nE + 4E
T
T n
||x(t ¯ − 1) − z || Δ1 (t) +
T n
∗ 2
− z ||
t =1 i=1
∗ 2
t =2
κit ||zi (t)
n E[||x(0) ¯ − z∗ ||2 ] α(1)
Li ||zi (t) − x(t ¯ − 1)|| + 2 2
t =1 i=1
T
α(t).
(9.30)
t =1
Furthermore, from Lemma 9.12, one achieves E[
T n
Li ||zi (t) − x(t ¯ − 1)||]
t =1 i=1
≤
T n n 2C Li λt −1 ||xj (0)||1 δ t =1 i=1
+
j =1
√ T n n t −1 3C n t −k−1 Li λ E[||ηj (k)||] δ t =1 i=1
+
k=0
j =1
t −1 T n 2Cn3/2pˆ t −k−1 Li λ α(k). δ t =1 i=1
k=1
(9.31)
9.4 Main Results
255
Since 0 < λ < 1, we get T n
Li λt −1
n
t =1 i=1
L ||xj (0)||1 . 1−λ n
||xj (0)||1 ≤
j =1
(9.32)
j =1
√ Furthermore, since ni=1 ||ηi (t)|| = n d |ηi,k (t)|2 (ηi,k (t), k = 1, . . . , d, is the k-th element of ηi (t) ∈ Rd ) and each ηi,k (t), k = 1, . . . , d, is drawn from Lap(σ (t)), we deduce that E[|ηi,k (t)|2 ] = 2σ 2 (t). In light of Δ(t)/σ (t) = ε, it yields E[
n i=1
√ √ √ 2 2nd pα(t ˆ + 1) n 2dΔ(t) 2 ≤ . ||ηi (t)||] = E[n d |ηi,k (t)| ] = ε ε (9.33)
Thus, by combining (9.31)–(9.33), we acquire that E
T n
Li ||zi (t) − x(t ¯ − 1)||
t =1 i=1 t −1
2CL ˆ t −k−1 2Cn3/2 pL ≤ ||xj (0)||1 + λ α(k) δ(1 − λ) δ n
T
j =1
t =1 k=1
√ T t −1 6 2Cn3/2 d pL ˆ t −k−1 + λ α(t + 1). δε
(9.34)
t =1 k=0
In addition, from (9.33), we also obtain n i=1
E[||ηi (t)||2 ] ≤
8nd pˆ 2 α 2 (t + 1) . ε2
(9.35)
Combining (9.30)–(9.35) and Lemma 9.12 and arranging the terms, the result in Lemma 9.14 follows immediately. Finally, the next lemma presents the connection between the pseudo-individualregret of node j ∈ V and the pseudo-network-regret. Lemma 9.15 Suppose that Assumptions 9.1–9.3 hold. Consider that DPDSSP (9.6) generates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, it holds that for
256
9 Privacy Preserving Algorithms for Distributed Online Learning
each j ∈ V and T ∈ Z>0 , n 4CL ||xj (0)||1 δ(1 − λ)
R¯ j (T ) −
j =1
√ T t −1 ˆ t −k−1 12 2Cn3/2 d pL ¯ ≤ R(T ) + λ α(k + 1) δε t =1 k=0
t −1
ˆ t −k−1 4Cn3/2 pL λ α(k). δ T
+
t =1 k=1
Proof First, notice that T n
fit (zj (t)) −
t =1 i=1
T n
fit (zi (t)) ≤
t =1 i=1
T n
Lj ||zj (t) − zi (t)||,
(9.36)
t =1 i=1
which directly derived from the convexity of fit . Moreover, it also concludes that ||zj (t + 1) − zi (t + 1)||2 2 2 ≤ ||zj (t + 1) − x(t)|| ¯ + ||zi (t + 1) − x(t)|| ¯
+ 2||zj (t + 1) − x(t)||||z ¯ ¯ i (t + 1) − x(t)||.
(9.37)
Thus, we further get from (9.37) that E[||zj (t + 1) − zi (t + 1)||] √ t n n 6C n t −k 4C t λ ||xj (0)||1 + λ E[||ηj (k)||] ≤ δ δ j =1
+
k=0
t 4Cn3/2 pˆ t −k λ α(k). δ k=1
j =1
(9.38)
9.4 Main Results
257
Since ηi (t) ∼ Lap(σ (t)) with E[||ηi (t)||2 ] = 2σ 2 (t), following from (9.33) and (9.38), it yields that E
T n
Lj ||zj (t) − zi (t)||
t =1 i=1 t −1
4Cn3/2 pL 4CL ˆ t −k−1 ||xj (0)||1 + λ α(k) ≤ δ(1 − λ) δ n
T
j =1
t =1 k=1
√ T t −1 ˆ t −k−1 12 2Cn3/2 d pL λ α(k + 1). + δε t =1 k=0
Hence, by the definitions of the pseudo-individual-regret and the pseudo-network regret, the result in Lemma 9.15 is established. We now present the Proof of Theorem 9.6 based on the lemma established above. Proof of Theorem 9.6 Let κit = μi , ∀i ∈ V, t ∈ {1, . . . , T }. Then, the local cost function fit , i ∈ V, is strongly convex with parameter μi for all t ∈{1, . . . , T }. Recalling that α(t) = 1/(μt) ˆ for any μˆ ∈ (0, μ] ¯ and μ¯ = (1/2n) ni=1 μi , we obtain Δ1 (t) = μt ˆ − μ(t ˆ − 1) − μ¯ ≤ 0.
(9.39)
T Since Tk=1 (1/k) = 1 + Tk=2 (1/k) ≤ 1 + 1 (1/k)dk = 1 + ln T , it can be deduced that T t −1
λt −k−1 α(k + 1) ≤
t =1 k=0
1 (1 + ln T ), μ(1 ˆ − λ)
(9.40)
and further T t −1 t =1 k=1
λt −k−1 α(k) ≤
1 (1 + ln T ). λμ(1 ˆ − λ)
(9.41)
258
9 Privacy Preserving Algorithms for Distributed Online Learning
Let θ (t) = t (t + 1)/2 and zˆ i (t) = yields that E
T n
t
k=1 kzi (k)/θ (t)
μi ||zi (t) − z∗ ||2
t =1 i=1
for all t ∈ {1, . . . , T }. It then
T n t ≥E μi ||zi (t) − z∗ ||2 θ (t) t =1 i=1 n ∗ 2 ≥E μi ||ˆzi (t) − z || .
(9.42)
i=1
Substituting (9.39)–(9.42) into Lemma 9.14, by computation, one has ⎡ R(t) + E ⎣
n
⎤ μj ||ˆzj (t) − z∗ ||2 ⎦ ≤ Ξ3 + Ξ4 (1 + ln T ),
(9.43)
j =1
where Ξ3 = nE[μ|| ˆ x(0) ¯ − z∗ ||2 ] +
n 10CL ||xj (0)||1 , δ(1 − λ) j =1
√ 16nd pˆ 2 ˆ ˆ 10Cn3/2pL 30 2Cn3/2 d pL 2 + . + 2 + Ξ4 = δλμ(1 ˆ − λ) ε2 δεμ(1 ˆ − λ) Thus, the result in Theorem 9.6 can be concluded by putting (9.43) into Lemma 9.15. This fulfills the Proof of Theorem 9.6.
9.4.3 Square-Root Regret In the previous subsection, the logarithmic regret of DP-DSSP (9.6) is proved under the strongly convexity of cost functions. Then, if the cost functions are generally convex, it can be established that DP-DSSP (9.6) achieves a square-root regret. Proof of Theorem 9.7 Noting that α(t) in Theorem 9.6 shows the reliance on T , thus DTS that does not depend on T is employed to generate an execution process.
9.4 Main Results
259
By selecting α(t) = ζ, ∀t ∈ {1, . . . , T }, it holds from Lemma 9.14 that ⎡ E⎣
⎤
T n
(fit (zi (t)) − fit (z∗ ))⎦ −
t =1 i=1
n E[||x(0) ¯ − z∗ ||2 ] ζ
√ 30 2Cn3/2 d pL 10Cn3/2pL ˆ ˆ ≤ T ζ + T ζ δε(1 − λ) δλ(1 − λ) n 10CL 16nd pˆ 2 2
+ 2 ζ + ||xj (0)||1 , T ε2 δ(1 − λ)
(9.44)
j =1
where we have utilized the fact that (1/ζ ) − (1/ζ ) − (1/2n) √ √ letting ζ = 1/ T in (9.44) and using T ≥ 1, one achieves ⎡ E⎣
T n
n
t i=1 κi
≤ 0. Then,
⎤
√ (fit (zi (t)) − fit (z∗ ))⎦ ≤ Γ2 T ,
(9.45)
t =1 i=1
where Γ2 =nE[||x(0) ¯ − z∗ ||2 ] +
10CL ||xj (0)||1 + 2 2 δ(1 − λ) n
j =1
√ 30 2Cn3/2 d pL 16nd pˆ 2 ˆ ˆ 10Cn3/2 pL + + + . δλ(1 − λ) δε(1 − λ) ε2 By DTS, for k = 0, 1, 2, . . . , log2 T , DP-DSSP (9.6) is performed in period of T = 2k (DP-DSSP between t = 2log2 T and (9.6) only updates thek time period k+1 t = T when k = log2 T ) rounds t = 2 , . . . , 2 − 1. In addition, the bound √ k of (9.45) on each period is at most Γ2 2 . Thus, the total bound is log 2T k=0
Γ2
2k
= Γ2
1−
√ √ log2 T +1 √ 2 2 ≤√ Γ2 T . √ 1− 2 2−1
Hence, combining (9.45) with Lemma 9.15, the result in Theorem 9.7 is obtained. The proof is complete.
260
9 Privacy Preserving Algorithms for Distributed Online Learning
9.4.4 Robustness to Communication Delays In this section, the robust performance of DP-DSSP (9.6) in the presence of communication delays is discussed. Assume that nodes will confront arbitrary but uniformly bounded communication delays in the process of gaining information from in-neighbors. Specifically, at time t, let ςij (t) be indicative of an arbitrary priori unknown delay induced by the communication link (j, i). In the following, we normally provide the bounded assumption on delay ςij (t). Assumption 9.4 (Bounded Delays[42]) For all t ∈ Z>0 , the communication delays ςij (t), ∀ i, j ∈ V, are uniformly bounded. Namely, there is a constant ςˆ ∈ Z>0 such that 0 ≤ ςij (t) ≤ ςˆ , ∀ i, j ∈ V, t ∈ Z>0 . Furthermore, each node possesses its individual estimate without delays, i.e., ςii (t) = 0, ∀i ∈ V, t ∈ Z>0 . Notice that for each node i ∈ V, both the updates of yi (t) and si (t) in DP-DSSP (9.6) are dependent on the information acquired from in-neighbors, while the estimates of hi (t), zi (t), and xi (t) are implemented locally without further interactions. Therefore, at each time t ∈ {1, . . . , T }, when undergoing communication delays, DP-DSSP (9.6) is executed as follows: ⎧ ⎪ ⎪hi (t) = xi (t) + ηi (t), ⎪ ⎪ hj (t −ςij (t )) ⎪ ⎪yi (t + 1) = outhi (t ) + ⎪ j ∈Niin (t ) d out (t )+1 ⎪ d (t )+1 ⎨ i j s (t −ς (t )) ) si (t + 1) = d outsi(t(t)+1 + j ∈N in (t ) jd out (t ij)+1 ⎪ i i j ⎪ ⎪ ⎪ ⎪ z (t + 1) = y (t + 1)/s (t + 1) ⎪ i i i ⎪ ⎪ ⎩ xi (t + 1) = yi (t + 1) − α(t + 1)gi (t + 1).
(9.46)
The following theorem manifests that algorithm (9.46) not only preserves differential privacy but also achieves expected regrets when the interactions among nodes on time-varying unbalanced directed networks are subjected to arbitrary but uniformly bounded communication delays. Theorem 9.16 Suppose that Assumptions 9.1–9.4 hold, and Z ∗ is non-empty. Consider that algorithm (9.46) updates the sequence {z1 (t), . . . , zn (t)}Tt=1 . Then, the following results are derived: (a) Letting ηi (t), i ∈ V, satisfy Theorem 9.5, then algorithm (9.46) preserves differential privacy. (b) Letting α(t) satisfy Theorem 9.6, then algorithm (9.46) achieves a logarithmic regret of order O(ln T ) for strongly convex cost functions. (c) Letting α(t) satisfy Theorem 9.7, then algorithm (9.46) accomplishes a square√ root regret of order O( T ) for generally convex cost functions.
9.4 Main Results
261
Proof Note that algorithm (9.46) is transformed from DP-DSSP (9.6) due to the modeling of the communication delays we have established. The Proof of Theorem 9.16 can be divided into two steps [42]. We first convert algorithm (9.46) (with communication delays) to a delay-free algorithm by employing an augmented unbalanced directed network representation. Then, the performance of the obtained delay-free algorithm can be analyzed similarly according to the above subsection. For each node i in the network, we present ςˆ imaginary nodes i(1), i(2) , . . . , i(ς) ˆ . At each time t, imaginary node i(k) preserves message which is eventually transmitted to node i in the k-th time. According to Assumption 9.4, it is deduced that the total number of nodes in the augmented unbalanced directed network is n(ςˆ + 1). In addition, the n imaginary nodes are defined when the n nodes on the unbalanced directed network (original) model the 1st time delay, the next n imaginary nodes model the 2nd time delay, etc. For the interactions among these nodes on the augmented unbalanced directed network at each time t, we suppose that for each edge (j, i) on the unbalanced directed network (original), there always exist edges (j, i(1)), (j, i(2)), . . . , (j, i(ς) , i(ς) ˆ ) and edges (i(1) , i), (i(2) , i(1) ), . . . , (i(ς−1) ˆ ˆ ) on the augmented unbalanced directed network. Let xi,(r) be the estimate of imaginary node i(r), where r ∈ {1, . . . , ςˆ }. For convenience of analysis, we suppose d = 1 in this subsection. Define x(t) ˜ = T T n(ς+1) ˆ [x(t)T , x(1) (t)T , . . . , x(ς) , where x(r)(t) = [x1,(r), . . . , xn,(r) ]T ∈ ˆ (t) ] ∈ R Rn , and r ∈ {1, . . . , ςˆ }, is the column stack vector of xi,(r) for i ∈ V. Then, ˜ η(t), ˜ y(t), ˜ and s˜ (t) are defined similarly. Hence, at each time t ∈ notations h(t), {1, . . . , T }, algorithm (9.46) can be written in the delay-free compact format as follows: ⎧ ⎪ ˜ = x(t) h(t) ˜ + η(t) ˜ ⎪ ⎪ ⎪ ⎪ ⎪ ˜ ˜ + 1) = W˜ (t)h(t) ⎪ ⎨y(t ˜ (9.47) s˜ (t + 1) = W (t)˜s (t) ⎪ ⎪ ⎪ ⎪zi (t + 1) = yi (t + 1)/si (t + 1) ⎪ ⎪ ⎪ ⎩x(t ˜ + 1) = y(t ˜ + 1) − α(t + 1)[g(t + 1)T , 0T ]T , ˆ where g(t) = [g1 (t), . . . , gn (t)]T , η(t) ˜ = [η(t)T , 0T ]T ∈ Rn(ς+1) , and η(t) = [η1 (t), . . . , ηn (t)]T . For the imaginary nodes, we pick x(r)(0) = 0, η(r) (0) = 0 and s(r) (0) = 0 for all r ∈ {1, . . . , ςˆ }. Notice that the update of zi (t) in (9.47) is identical to the update of zi (t) in (9.6), which guarantees zi (t) is a finite quality even s(r) (0) = 0. In addition, the weighted matrix W˜ (t) associated with the augmented unbalanced directed network is indicated as
⎡
W˜ (0) (t) In×n 0 ⎢ W˜ (1) (t) 0 In×n ⎢ ⎢ . .. .. .. W˜ (t) = ⎢ . . ⎢ ⎣ W˜ (t) 0 0 (ς−1) ˆ W˜ (ςˆ ) (t) 0 0
⎤ ... 0 ... 0 ⎥ ⎥ .. ⎥ , .. . . ⎥ ⎥ . . . In×n ⎦ ... 0
262
9 Privacy Preserving Algorithms for Distributed Online Learning
where non-negative matrices W˜ (0) (t), W˜ (1) (t), . . . , W˜ (ςˆ ) (t) (suitably defined) are dependent on the communication delays encountered by the information interaction at time t. Specifically, the weighted matrix W˜ (r) (t) = [w˜ ij,(r) (t)] ∈ Rn×n , r ∈ {1, . . . , ςˆ }, follows the following rules: w˜ ij,(r) (t) =
wij (t), if ςij (t) = r, (j, i) ∈ E(t), 0, otherwise,
where wij (t) is introduced in (9.7). According to the definition of w˜ ij,(r) (t), it can be concluded that for each edge (j, i) ∈ E(t) at each time t, only one of w˜ ij,(1) (t), . . . , w˜ ij,(ςˆ ) (t) is positive and is equal to wij (t) (others are equal to zero). Therefore, the transformation from (9.46) (with communication delays) to (9.47) (without communication delays) is achieved. From the definition of W˜ (t) and W (t), we can deduce that under Assumptions 9.3 and 9.4, the matrices W˜ (t) and W (t) exhibit the same properties at each time t. Thus, the reminder Proof of Theorem 9.16 can be easily obtained by following the similar techniques in the above subsection.
9.5 Numerical Examples In this section, some practical simulation experiments are presented to assess effectiveness and universality of DP-DSSP. Inspired by Akbari et al. [15] and Hosseini et al. [21], we investigate the distributed collaborative localization problem. Each node in the network is employed to detect a vector z ∈ Rd . At time t ∈ {1, . . . , T }, each node i ∈ V acquires an uncertain and time-varying (the noise, such as jamming, may be exist) detection vector pi (t) ∈ Rdi and is indicated to possess a linear model pi,z = Pi z, where Pi ∈ Rdi ×d and Pi z = 0 if and only if z = 0. The main focus is to estimate the vector zˆ ∈ Rd which minimizes the global cost function: f (ˆz) =
n T 1 t =1 i=1
2
||qi (t) − Pi zˆ ||2 ,
where the detection vector is represented as pi (t) = Pi z + βi (t) and βi (t) is the white noise. Notice that the characteristics of noise are not acquired in advance and quite a few nodes may not work well in some situations. Therefore, it is necessary to utilize the distributed online algorithm to estimate the best selection for z. In the simulation, we discuss a large-scale network with n = 100 nodes and the dimension d = 1. At time t ∈ Z≥0 , a lot of n nodes and (n − 1)2 /4 directed edges are randomly assigned in the unbalanced directed network G = {V, E}. Suppose that the directed unbalanced network G = {V, E} (randomly selected) is strongly connected. Then, through uniformly and randomly sampling E(t) from E of G with 80%, we generate the time-varying unbalanced directed networks G(t) =
9.5 Numerical Examples
263
{V, E(t)}, t ∈ Z≥0 . Additionally, at time t, the local cost function fit : R → R associated with node i ∈ V is given by fit (ˆz) = (1/2)(qi (t) − Pi zˆ )2 , where qi (t) = ai (t)z+bi (t) and Pi ∈ R. The cost coefficients ai (t) and bi (t) for each node i are, respectively, randomly selected from a uniform distribution on [amin, amax ] and [bmin , bmax ] at time t. In this simulation, we employ DP-DSSP to estimate z (for clearer expression, we randomly select few nodes to display in the following scenarios) and study different scenarios to validate the performance of DP-DSSP. (1) Without Communication Delays in this scenario, suppose that there are no communication delays in the transmission of information on the network by nodes. At time t ∈ {1, . . . , T }, let the coefficients ai (t) ∈ [0, 2] and bi (t) ∈ [−0.5 + ((i − 50)/100), 0.5 + ((i − 50)/100)] be picked at random by a uniform distribution for each node i and the learning rate α(t) = 5/t. In addition, assume that Pi ∈ (0, 1] for each node i. Based on the communication network designed above, preliminary results are displayed in Fig. 9.1. The estimations of five nodes (randomly displayed) are manifested in Fig. 9.1a over 100 time iterations. By running DP-DSSP, it is demonstrated that nodes’ estimates (randomly displayed) have less fluctuations (the fluctuation is larger than general online algorithms) around the optimal solution. Figure 9.1b depicts the maximum and minimum pseudo-individual average regrets over 100 time iterations, which exhibit the sublinear properties. (2) With Communication Delays in this scenario, we verify the robust performance of DP-DSSP with communication delays. Specifically, let ςˆ = 4 be the upper bound of the time-varying delays. At each time t, communication delays imposed on each communication link are randomly and uniformly selected in {0, 1, . . . , 4}. Other required parameters correspond to the first scenario. The simulation results are displayed in Fig. 9.2. It is clearly observed that when communication links undergo communication delays, DP-DSSP is able to achieve the desired results. (3) DTS and Communication Delays in this scenario (T = 100), we suppose that the learning rates are selected by DTS and there exist communication delays (the upper bound of the time-varying delays is ςˆ = 4) in the transmission of information on the network by nodes. Comparisons of DP-DSSP with the method in [37] are shown in Figs. 9.3 and 9.4. On the one hand, the x-axes in Figs. 9.3 and 9.4a are simulated in terms of k because of DTS employed in DP-DSSP.3 On the other hand, the x-axis in Fig. 9.4b is simulated in terms of t to better reflect the main contributions of this chapter. We can clearly obtain from Fig. 9.3 that the nodes’ estimates (randomly displayed) calculated by DP-DSSP undergo less fluctuations around the optimal solution, while the method in [37] cannot perform well over time-varying unbalHere, it is worth highlighting that DTS actually divides the original time series t ∈ {1, . . . , T } into log2 T + 1 subseries and DP-DSSP needs to update 2k rounds in each subseries (except k the last subseries). Since DP-DSSP cannot update 2 rounds in the last subseries, we only plot k = 0, 1, . . . , log2 T in Figs. 9.3 and 9.4a.
3
264
9 Privacy Preserving Algorithms for Distributed Online Learning (a) Node's estimate (z)
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 0
20
40
Time (T)
60
80
100
(b) Pseudo-individual average regrets
9
max pseudo-individual average regret min pseudo-individual average regret
8 7 6 5 4 3 2 1 0 0
20
40
Time (T)
60
80
100
Fig. 9.1 (a) Estimations of five nodes without communication delay. (b) The maximum and minimum pseudo-individual average regrets (Rj (T )/T ) without communication delays
anced directed networks. In Fig. 9.4a, we simulate each pseudo-individual average k+1 k+1 DT S subregret Rj,av (k) = (E[ 2t =2k ni=1 fit (zj (t))] − E[ 2t =2k ni=1 fit (z∗ )])/2k between k and k + 1 for k = 0, 1, 2, . . . , log2 T − 1. Under the same settings, Fig. 9.4a indicates that the difference between the max and the min pseudoindividual average subregrets of the method in [37] is larger than that of DP-DSSP. This is because the method in [37] cannot be applicable for time-varying unbalanced directed networks. Moreover, Fig. 9.4b implies that DP-DSSP achieves small expected regrets and exhibits sublinear property, which is superior to the method in [37].
9.5 Numerical Examples
265 (a) Node's estimate (z)
0.8 0.6 0.4 0.2 0 -0.2 -0.4 0
20
40
Time (T)
60
80
100
(b) Pseudo-individual average regrets
12
min pseudo-individual average regret max pseudo-individual average regret
10 8 6 4 2 0 0
20
40
Time (T)
60
80
100
Fig. 9.2 (a) Estimations of five nodes with communication delays. (b) The maximum and minimum pseudo-individual average regrets (Rj (T )/T ) with communication delays
(4) Differential Privacy Properties [47] in this scenario, we verify the differential privacy properties of DP-DSSP with communication delays. Assume that the cost
functions fit = f ti for all i ∈ {2, . . . , 100} are identical except node 1’s cost
t function, i.e., f1 = f t1 . Figure 9.5 (corresponding to the second scenario) plots three randomly displayed nodes’ (always include node 1) outputs xi (t) and xi (t)
for DP-DSSP related to the adjacent relations fit and f ti , respectively. Figure 9.5 describes that the two outputs are fully fitted, which is almost indistinguishable from a malicious node. Remark 9.17 The fourth scenario can well validate the differential privacy properties of DP-DSSP, which is done in the same way as many of the existing works,
266
9 Privacy Preserving Algorithms for Distributed Online Learning (a) Node's estimate (z) (DP-DSSP)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 012 3
4
5
k
6
...
(b) Node's estimate (z) (The method in [37])
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 012 3
4
5
k
6
...
Fig. 9.3 Estimations of five nodes for DTS with communication delays. (a) Node’s estimate (z) (DP-DSSP). (b) Node’s estimate (z) (the method in [37])
such as [4, 47]. DP-DSSP may be further applied to problems related to the spy attacks on the network and dataset recovery, which are the key issues in the modern era of big data. However, we note that the focus of this chapter is not about the spy attacks on the network and dataset recovery but to design DP-DSSP to solve the differentially private distributed online optimization problem on time-varying unbalanced directed networks. In the future, further research on this kind of problem will be conducted.
9.6 Conclusion
267 (a) Pseudo-individual average subregrets
20
min pseudo-individual average subregret (DP-DSSP) min pseudo-individual average subregret [37] max pseudo-individual average subregret [37] max pseudo-individual average subregret (DP-DSSP)
18 16 14 12 10 8 6 4 2 0 0
1
2
3 k
4
5
6
(b) Pseudo-individual average regrets
60
min pseudo-individual average regret [37] max pseudo-individual average regret (DP-DSSP)
50 40 30 20 10 0 0
20
40
Time (T)
60
80
100
DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k)) between k and k + 1 for DTS with communication delays. (b) The pseudo-individual average regrets (Rj (T )/T ) for DTS with communication delays
9.6 Conclusion This chapter has investigated the differentially private distributed online convex optimization problem on time-varying unbalanced directed networks which were supposed to be uniformly strongly connected. To figure out such optimization problem distributedly, we have developed a new differentially private distributed stochastic subgradient-push algorithm, named as DP-DSSP. Theoretical analysis has shown that DP-DSSP (appropriate learning rate) preserved differential privacy and achieved expected sublinear regrets for diverse cost functions. The compromise
268
9 Privacy Preserving Algorithms for Distributed Online Learning The outputs xi(t) and x'i(t)
6 4 2 0 -2
x1 (t)
-4
x'1 (t)
x41 (t)
x'41 (t)
x81 (t)
x'81 (t)
-6 0
20
40
Time (T)
60
80
Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations fit and fit
100
between the desired level of privacy preserving and the accuracy of DP-DSSP has been revealed. Furthermore, the robustness of DP-DSSP to communication delays has also been explored. Finally, the performances of DP-DSSP have been demonstrated via simulation experiments. Our work is still open, and more in-depth research is demanded to resolve the distributed online constrained problem. As a future work, we will consider to extend this work for a number of directions, i.e., spy attacks on the network, dataset recovery, complex constraints, non-convex and/or non-smooth functions, and networks with random link failures.
References 1. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern. 47(8), 2321–2333 (2017) 2. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. 50(7), 2612–2622 (2020) 3. X. Shi, J. Cao, W. Huang, Distributed parametric consensus optimization with an application to model predictive consensus problem. IEEE Trans. Cybern. 48(7), 2024–2035 (2018) 4. M. Hale, P. Barooah, K. Parker, K. Yazdani, Differentially private smart metering: Implementation, analytics, and billing, in UrbSys’19: Proceedings of the 1st ACM International Workshop on Urban Building Energy Sensing, Controls, Big Data Analysis, and Visualization (2019), pp. 33–42 5. Z. Deng, X. Nian, C. Hu, Distributed algorithm design for nonsmooth resource allocation problems. IEEE Trans. Cybern. 50(7), 3208–3217 (2020) 6. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization. Automatica 90, 196–203 (2018)
References
269
7. L. Ma, W. Bian, A novel multiagent neurodynamic approach to constrained distributed convex optimization. IEEE Trans. Cybern. 51(3), 1322–1333 (2021) 8. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent networks via edge-based method. Automatica 94, 55–62 (2018) 9. X. Wang, Y. Hong, P. Yi, H. Ji, Y. Kang, Distributed optimization design of continuous-time multiagent systems with unknown-frequency disturbances. IEEE Trans. Cybern. 47(8), 2058– 2066 (2017) 10. B. Ning, L. Han, Z. Zuo, Distributed optimization of multiagent systems with preserved network connectivity. IEEE Trans. Cybern. 49(11), 3980–3990 (2019) 11. S. Yang, Q. Liu, J. Wang, A multi-agent system with a proportionalintegral protocol for distributed constrained optimization. IEEE Trans. Autom. Control 62(7), 3461–3467 (2017) 12. Q. Lü, X. Liao, H. Li, T. Huang, Achieving acceleration for distributed economic dispatch in smart grids over directed networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1988–1999 (2020) 13. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018) 14. D. Nunez, J. Cortes, Distributed online convex optimization over jointly connected digraphs. IEEE Trans. Netw. Sci. Eng. 1(1), 23–37 (2014) 15. M. Akbari, B. Gharesifard, T. Linder, Distributed online convex optimization on time-varying directed graphs. IEEE Trans. Control Netw. Syst. 4(3), 417–428 (2017) 16. A. Bedi, A. Koppel, K. Rajawat, Beyond consensus and synchrony in online network optimization via saddle point method (2017). Preprint arXiv:1707.05816 17. H. Pradhan, A. Bedi, A. Koppel, K. Rajawat, Exact decentralized online nonparametric optimization, in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (2018). https://doi.org/10.1109/GlobalSIP.2018.8646689 18. R. Dixit, A. Bedi, K. Rajawat, Online learning over dynamic graphs via distributed proximal gradient algorithm. IEEE Trans. Autom. Control 66(11), 5065–5079 (2021) 19. A. Koppel, F. Jakubiec, A. Ribeiro, A saddle point algorithm for networked online convex optimization. IEEE Trans. Signal Process. 63(19), 5149–5164 (2015) 20. S. Shahrampour, A. Jadbabaie, Distributed online optimization in dynamic environments using mirror descent. IEEE Trans. Autom. Control 63(3), 714–725 (2018) 21. S. Hosseini, A. Chapman, M. Mesbahi, Online distributed convex optimization on dynamic networks. IEEE Trans. Autom. Control 61(11), 3545–3550 (2016) 22. D. Yuan, A. Proutiere, G. Shi, Distributed online linear regression. IEEE Trans. Inf. Theory 67(1), 616–639 (2021) 23. X. Zhou, E. Anese, L. Chen, A. Simonrtto, An incentive-based online optimization framework for distribution grids. IEEE Trans. Autom. Control 63(7), 2019–2031 (2018) 24. A. Molina, M. Cervantes, E. Montes, M. Perez, Adaptive controller tuning method based on online multiobjective optimization: a case study of the four-bar mechanism. IEEE Trans. Cybern. 51(3), 1272–1285 (2021) 25. X. Li, X. Yi, L. Xie, Distributed online optimization for multi-agent networks with coupled inequality constraints. IEEE Trans. Autom. Control 66(8), 3575–3591 (2021) 26. C. Dwork, Differential privacy: A survey of results, in International Conference on Theory and Applications of Models of Computation (2008). https://doi.org/10.1007/978-3-540-79228-4_ 27. Y. Wang, M. Hale, M. Egerstedt, G. Dullerud, Differentially private objective functions in distributed cloud-based optimization, in 2016 IEEE 55th Conference on Decision and Control (CDC) (2016). https://doi.org/10.1109/CDC.2016.7798824 28. E. Nozari, P. Tallapragada, J. Cortes, Differentially private distributed convex optimization via objective perturbation, in 2016 American Control Conference (ACC) (2016). https://doi.org/10. 1109/ACC.2016.7525222 29. M. Hale, M. Egerstedt, Differentially private cloud-based multi-agent optimization with constraints, in 2015 American Control Conference (ACC) (2015). https://doi.org/10.1109/ACC. 2015.7170902 30. Y. Lou, L. Yu, S. Wang, P. Yi, Privacy preservation in distributed subgradient optimization algorithms. IEEE Trans. Cybern. 48(7), 2154–2165 (2018)
270
9 Privacy Preserving Algorithms for Distributed Online Learning
31. C. Zhang, H. Gao, Y. Wang, Privacy-preserving decentralized optimization via decomposition ((2018)). Preprint arXiv:1808.09566 32. S. Gelfand, S. Mitter, Recursive stochastic algorithms for global optimization in Rd . SIAM J. Control Optim. 29(5), 999–1018 (1991) 33. K. Harold, G. Yin, Stochastic Approximation and Recursive Algorithms and Applications (Springer Science & Business Media, Berlin, 2003) 34. S. Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147, 516–545 (2010) 35. K. Srivastava, A. Nedic, Distributed asynchronous constrained stochastic optimization. IEEE J. Sel. Topics Signal Process. 5(4), 772–790 (2011) 36. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points-online stochastic gradient for tensor decomposition, in Proceedings of The 28th Conference on Learning Theory (PMLR), vol. 40 (2015), pp. 797–842 37. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over timevarying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018) 38. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 39. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning. IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018) 40. A. Nedic, A. Olsevsky, Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Trans. Autom. Control 61(12), 3936–3947 (2016) 41. S. Shwartz, Online learning and online convex optimization. Found. Trends Mach. Learn. 4(2), 107–194 (2012) 42. T. Yang, J. Lu, D. Wu, J. Wu, G. Shi, Z. Meng, K. Johansson, A distributed algorithm for economic dispatch over time-varying directed networks with delays. IEEE Trans. Ind. Electron. 64(6), 5095–5106 (2017) 43. S. Lee, A. Nedic, M. Raginsky, Stochastic dual averaging for decentralized online optimization on time-varying communication graphs. IEEE Trans. Autom. Control 62(12), 6407–6414 (2017) 44. H. Li, C. Huang, G. Chen, X. Liao, T. Huang, Distributed consensus optimization in multiagent networks with time-varying directed topologies and quantized communication. IEEE Trans. Cybern. 47(8), 2044–2057 (2017) 45. Q. Lü, X. Liao, H. Li, T. Huang, A Nesterov-like gradient tracking algorithm for distributed optimization over directed networks. IEEE Trans. Syst. Man Cybern. Syst. 51(10), 6258–6270 (2021) 46. R. Durrett, Probability: Theory and Examples (Cambridge University Press, Cambridge, 2019) 47. L. Gao, S. Deng, W. Ren, Differentially private consensus with event-triggered mechanism. IEEE Trans. Control Netw. Syst. 6(1), 60–71 (2019)