271 49 6MB
English Pages XVIII, 243 [257] Year 2020
Huaqing Li · Qingguo Lü · Zheng Wang · Xiaofeng Liao · Tingwen Huang
Distributed Optimization: Advances in Theories, Methods, and Applications
Distributed Optimization: Advances in Theories, Methods, and Applications
Huaqing Li Qingguo Lü Zheng Wang Xiaofeng Liao Tingwen Huang •
•
•
•
Distributed Optimization: Advances in Theories, Methods, and Applications
123
Huaqing Li College of Electronic and Information Engineering Southwest University Chongqing, China
Qingguo Lü College of Electronic and Information Engineering Southwest University Chongqing, China
Zheng Wang College of Electronic and Information Engineering Southwest University Chongqing, China
Xiaofeng Liao College of Computer Science Chongqing University Chongqing, China
Tingwen Huang Science Program Texas A&M University at Qatar Doha, Qatar
ISBN 978-981-15-6108-5 ISBN 978-981-15-6109-2 https://doi.org/10.1007/978-981-15-6109-2
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
With the continuous development of computer and network technology, especially the emergence of artificial intelligence, quantum engineering, bioengineering, new energy, and other emerging fields, the distributed optimization theory and application based on networked control systems has attracted the attention of many scientific researchers, and gradually plays an important role in social services such as medical services, smart cities, social welfare, and engineering applications such as smart grids, biological engineering, and national defense. Due to the distribution and complexity of networked control systems, traditional centralized method is not suitable to solve these optimization problems. Unlike the centralized optimization strategy, distributed optimization method only requires the information transformation between agents, and it also has strong robustness against the failure of individual agents or the failure of links between agents in the network. Moreover, distributed optimization method can handle large-scale optimization tasks very well, and it has a great improvement in terms of operation complexity and processing speed compared to the traditional centralized methods. Therefore, due to its many advantages, distributed optimization has developed rapidly in the field of optimization control in recent years, and is widely used in human daily life and social development (mainly including social networks, information science, smart cities, signals processing, and sensor control). Although distributed optimization has become a hot issue in scientific field and has been applied in many fields, distributed optimization that meets the actual conditions is still weak. Designing and developing effective distributed optimization algorithms and analyzing the convergence and complexity of the algorithms are of great significance both in theoretical research and engineering applications. Analysis and synthesis including consensus control, accelerated algorithm, asynchronous broadcast-based algorithm, random sleep scheme, stochastic gradient algorithm, distributed robust algorithm, distributed constrained optimization, economic dispatch in smart grids or power systems, event-triggered communication, quantized communication, unbalanced directed networks, and time-varying networks are all thoroughly studied. This monograph mainly investigates distributed optimization theory and applications in networked control systems. In general, the v
vi
Preface
following problems are investigated in this monograph: (1) Achieving linear convergence of distributed optimization over unbalanced directed networks with row-stochastic weight matrices; (2) Achieving linear convergence of distributed optimization over time-varying directed networks with column-stochastic weight matrices; (3) Asynchronous broadcast-based distributed optimization over unbalanced directed networks; (4) Quantized communication-based distributed optimization over time-varying directed networks; (5) Event-triggered scheme-based distributed optimization over digital network with limited data rate; (6) Random sleep scheme-based distributed optimization over time-varying directed networks; (7) Distributed stochastic optimization: Variance reduction and edge-based method; (8) Distributed economic dispatch in smart grids: General unbalanced directed network; and (9) Distributed economic dispatch in smart grids: Event-triggered scheme. Among the topics, simulation results including some typical real applications are presented to illustrate the effectiveness and the practicability of the distributed optimization methods proposed in the previous parts. This book is appropriate as a college course textbook for undergraduate and graduate students majoring in computer science, automation, artificial intelligence, etc., and as a reference material for researchers and technologists in related fields. Chongqing, China Chongqing, China Chongqing, China Chongqing, China Doha, Qatar
Huaqing Li Qingguo Lü Zheng Wang Xiaofeng Liao Tingwen Huang
Acknowledgements
This book was supported in part by the National Natural Science Foundation of China under Grant 61773321, in part by the Fundamental Research Funds for the Central Universities under Grant XDJK2019AC001 and XDJK2020D008, and in part by the Innovation Support Program for Chongqing Overseas Returnees under Grant cx2019005. We would like to begin by acknowledging Xiangzhao Wu, Zuqing Zheng, Wentao Ding, Enbing Su, Jinhui Hu, and Huqiang Cheng who have unselfishly given their valuable time in arranging raw materials. Their assistance has been invaluable to the completion of this book. The authors are especially grateful to their families for their encouragement and never ending support when it was most required. Finally, we would like to thank the editors at Springer for their professional and efficient handling of this book.
vii
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Achieving Linear Convergence of Distributed Optimization over Unbalanced Directed Networks with Row-Stochastic Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Model of Optimization Problem . . . . . . . . . . . 2.2.3 Communication Network . . . . . . . . . . . . . . . . 2.2.4 Necessary Assumptions . . . . . . . . . . . . . . . . . 2.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Distributed Accelerated Convergence Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . 2.3.3 Convergence of SGT-FROST . . . . . . . . . . . . . 2.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Achieving Linear Convergence of Distributed Optimization Time-Varying Directed Networks with Column-Stochastic Weight Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Model of Optimization Problem . . . . . . . . . . . 3.2.3 Communication Network . . . . . . . . . . . . . . . .
1
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
7 7 10 10 10 10 11 12
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
12 13 23 27 29 29
. . . . . .
. . . . . .
33 33 35 35 35 36
over . . . . . .
. . . . . .
. . . . . .
. . . . . .
ix
x
Contents
3.3
Optimization Algorithm . . . . . . . 3.3.1 Push-DIGing Algorithm 3.3.2 Small Gain Theorem . . 3.3.3 Supporting Lemmas . . . 3.4 Main Results . . . . . . . . . . . . . . 3.5 Numerical Examples . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
4
5
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Asynchronous Broadcast-Based Distributed Optimization over Unbalanced Directed Networks . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Model of Optimization Problem . . . . . . . . . . 4.2.3 Communication Network . . . . . . . . . . . . . . . 4.3 Broadcast-Based Optimization Algorithm . . . . . . . . . . 4.3.1 Epigraph Form of Constrained Optimization . 4.3.2 Distributed Asynchronous Broadcast-Based Random Projection Algorithm . . . . . . . . . . . . 4.3.3 Assumptions and Lemmas . . . . . . . . . . . . . . 4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Supporting Lemmas . . . . . . . . . . . . . . . . . . . 4.4.2 Proof of Theorem 4.10 . . . . . . . . . . . . . . . . . 4.4.3 Proof of Corollary 4.11 . . . . . . . . . . . . . . . . 4.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
37 37 39 40 47 52 53 54
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
57 57 60 60 60 61 61 62
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
62 66 67 70 74 77 78 82 83
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
85 85 87 87 87 88
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
89 90 91 91 92
Quantized Communication-Based Distributed Optimization over Time-Varying Directed Networks . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Model of Optimization Problem . . . . . . . . . . . 5.2.3 Time-Varying Communication Networks . . . . . 5.2.4 Quantization, Encoding and Decoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Distributed Optimization Algorithm . . . . . . . . . . . . . . . 5.3.1 Quantized (Sub)gradient Algorithm . . . . . . . . . 5.3.2 Quantized Recursive Algorithm . . . . . . . . . . . . 5.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
5.4.1 Reaching Consensus . . . . . 5.4.2 Reaching Optimal Solution 5.4.3 Reaching Optimal Value . . 5.5 Numerical Examples . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . 6
7
8
xi
. . . . . .
. . . . . .
. . . . . .
. . . . . .
93 103 106 108 111 112
Event-Triggered Scheme-Based Distributed Optimization over Digital Network with Limited Data Rate . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Model of Optimization Problem . . . . . . . . . . . . . . 6.2.3 Communication Network . . . . . . . . . . . . . . . . . . . 6.2.4 Quantization Rule and Encoding-Decoding Scheme 6.3 Quantized Optimization Algorithm . . . . . . . . . . . . . . . . . . . 6.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
115 115 118 118 118 119 120 122 125 136 139 139
Random Sleep Scheme-Based Distributed Optimization over Time-Varying Directed Networks . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Model of Optimization Problem . . . . . . . . 7.2.3 Communication Network . . . . . . . . . . . . . 7.3 Distributed Algorithm . . . . . . . . . . . . . . . . . . . . . . 7.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . 7.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Case Study I . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Case Study II . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
141 141 144 144 144 144 145 147 155 155 156 159 159
Distributed Stochastic Optimization: Variance Reduction and Edge-Based Method . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Model of Optimization Problem . . . . . . . . . . 8.2.3 Communication Network . . . . . . . . . . . . . . . 8.2.4 Problem Reformulation . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
161 161 163 163 164 164 165
. . . . . . . . . . . . .
xii
Contents
8.3
Distributed Algorithm . . . . . . . . . . . . . 8.3.1 Unbiased Stochastic Averaging 8.3.2 Algorithm Development . . . . . 8.4 Convergence Analysis . . . . . . . . . . . . . 8.4.1 Preliminary Results . . . . . . . . 8.4.2 Main Results . . . . . . . . . . . . . 8.5 Numerical Examples . . . . . . . . . . . . . . 8.5.1 Case Study I . . . . . . . . . . . . . 8.5.2 Case Study II . . . . . . . . . . . . . 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
9
........ Gradient . ........ ........ ........ ........ ........ ........ ........ ........ ........
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Distributed Economic Dispatch in Smart Grids: General Unbalanced Directed Network . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Model of Optimization Problem . . . . . . . . . . 9.2.3 Communication Network . . . . . . . . . . . . . . . 9.2.4 Lagrange Dual Problem of EDP . . . . . . . . . . 9.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Algorithm Development . . . . . . . . . . . . . . . . 9.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . 9.4 Robustness Against Communication Delays and Noisy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Case Study I . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Case Study II . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
165 165 167 168 169 170 179 180 185 185 186
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
189 189 192 192 192 193 193 195 195 196
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
201 203 204 210 211 211
. . . . . . . . .
. . . . . . . . .
213 213 215 215 215 216 217 217 218
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Model of Optimization Problem . . . . . . . . . . . . . . . 10.2.3 Communication Network . . . . . . . . . . . . . . . . . . . . 10.3 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Problem Reformulation . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Event-Triggered Scheme . . . . . . . . . . . . . . . . . . . . . 10.3.3 Event-Triggered Distributed Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 219
Contents
10.4 Convergence Analysis . 10.5 Numerical Examples . . 10.5.1 Case Study I . 10.5.2 Case Study II . 10.5.3 Case Study III 10.5.4 Case Study IV 10.6 Conclusion . . . . . . . . . References . . . . . . . . . . . . . .
xiii
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
220 235 235 236 238 239 240 241
Symbols and Acronyms
R Rn Rmn N In 0n 1n ðÞT A1 h; i OðÞ qð AÞ ‚min ð AÞ ‚max ð AÞ diagf Ag diagfxg distðx; X Þ E½ x PX ½ x k k k k1 k k1 k x kW supk 0 yðkÞ inf k 0 yðkÞ ei
Set of real numbers Set of n-dimensional real vectors Set of m n-dimensional real matrices Set of positive integers n n identity matrix n-dimensional column vector of all zeros n-dimensional column vector of all ones Kronecker product Hadamard product Transpose of a matrix or a vector Inverse matrix of a matrix A Inner product of two vectors Order of magnitude Spectral radius of a matrix A Minimum eigenvalue of a real symmetric matrix A Maximum eigenvalue of a real symmetric matrix A Diagonal matrix forming from the diagonal components in matrix A Diagonal matrix with elements of vector x on the main diagonal Distance of a vector x from a closed convex X Expectation of a random variable x Projection of a vector x onto the closed convex set X 2-norm of a vector or matrix in Euclidean space 1-norm of a vector or matrix in Euclidean space Infinity norm of a vector or matrix in Euclidean space Induced norm of a vector in W-space Supremum of a function yðkÞ over k 0 Infimum of a function yðkÞ over k 0 Vector of all zeros except one at the ith location
xv
xvi
rf domðf Þ jj ðf ðxÞÞ þ row-stochastic matrix column-stochastic matrix doubly-stochastic matrix
Symbols and Acronyms
Gradient of a function f Domain of a function f Cardinality of a set Non-negative part of a function f Non-negative square matrix with each row summing up to one Non-negative square matrix with each column summing up to one Non-negative square matrix with both rows and columns summing up to one
List of Figures
Fig. 2.1 Fig. 2.2 Fig. 3.1 Fig. 3.2
Fig. 4.1 Fig. 4.2 Fig. 4.3 Fig. 4.4 Fig. 4.5 Fig. 5.1 Fig. 5.2 Fig. 5.3 Fig. 5.4 Fig. Fig. Fig. Fig. Fig. Fig. Fig.
6.1 6.2 6.3 6.4 7.1 7.2 7.3
A directed and strongly connected network . . . . . . . . . . . . . Performance comparisons among optimization methods for an unbalanced directed network . . . . . . . . . . . . . . . . . . . A simple directed network topology . . . . . . . . . . . . . . . . . . . Time evolution of EðkÞ ¼ P log10 5i¼1 ðjjxi ðkÞ x jj=jjxi ð0Þ x jjÞ among Push-DIGing algorithm (coordinated step-size), Push-DIGing algorithm (uncoordinated step-size), Push-Sum algorithm and DGD . . . A strongly connected and unbalanced directed network . . . . The evolutions of three randomly displayed agents’ states . . Comparisons of the proposed algorithm, push-sum method [19] and asynchronous DDPG method [31] . . . . . . . . . . . . . The evolutions of the residual Lðk Þ with pij ¼ 1, pij 2 ½0:5; 0:8 and pij 2 ½0:1; 0:3 for all i; j ¼ 1; . . .; 100 . . . . . . . A special unbalanced directed network . . . . . . . . . . . . . . . . . The optimal solution estimation trajectories of five agents in Example 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The optimal value estimation trajectories of five agents in Example 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The optimal value estimation trajectories of five agents in Example 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The optimal value estimation trajectories of five agents in Example 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All agents’ states xi ðtÞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All agents’ control inputs ui ðtÞ . . . . . . . . . . . . . . . . . . . . . . . All agents’ quantization errors dij ðtÞ . . . . . . . . . . . . . . . . . . . All agents’ sampling time instant sequences . . . . . . . . . . . . . The evolutions of all agents’ xi ðkÞ . . . . . . . . . . . . . . . . . . . . Relative estimation error between agents 1 and 2 . . . . . . . . . The evolutions of all agents’ xi ðkÞ . . . . . . . . . . . . . . . . . . . .
..
28
.. ..
28 52
.. .. ..
53 79 80
..
80
.. ..
81 82
. . 110 . . 110 . . 111 . . . . . . . .
. . . . . . . .
111 137 138 138 138 156 156 157
xvii
xviii
List of Figures
Fig. 7.4 Fig. 8.1 Fig. 8.2 Fig. 8.3 Fig. 8.4 Fig. 8.5 Fig. Fig. Fig. Fig.
8.6 8.7 8.8 9.1
Fig. 9.2 Fig. 9.3 Fig. 9.4 Fig. Fig. Fig. Fig.
9.5 9.6 9.7 10.1
Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10
P The evolutions of 5i¼1 kxi ðkÞ x k for ¼ 0:3; 0:5, and 0:8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network topologies. a Star network. b Circle network. c Random network with pe ¼ 0:5. d Full connected network . Comparison across EXTRA, AL-Edge, DSA, and the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evolution of residuals with different networks . . . . . . . . . . . Evolution of residuals with different edge weights . . . . . . . . Residuals at the 300th iteration with different constant step-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samples from the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . A network of one hundred nodes . . . . . . . . . . . . . . . . . . . . . Evolution of residuals with different edge weights . . . . . . . . Communication topology: G1 is ring-connection, G2 is random-connection, and G3 is for anti-damage test . . . . . . . . Comparison between ring and random topologies . . . . . . . . . Time-varying demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . With communication delays and noisy gradient observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-quadratic cost functions test . . . . . . . . . . . . . . . . . . . . . Anti-damage test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation on IEEE-118-bus system . . . . . . . . . . . . . . . . . . . Time-varying balanced communication network consists of 3 fixed topological transformations, i.e., Gð3kÞ ¼ g1 , Gð3k þ 1Þ ¼ g2 , and Gð3k þ 2Þ ¼ g3 for 8k 0 . . . . . . . . . . Power allocation at generators . . . . . . . . . . . . . . . . . . . . . . . Consensus of Lagrange multipliers . . . . . . . . . . . . . . . . . . . . All generators’ sampling time instant sequences . . . . . . . . . . Evolutions of ke1 ðkÞk and C k . . . . . . . . . . . . . . . . . . . . . . . Power allocation at generators . . . . . . . . . . . . . . . . . . . . . . . Consensus of Lagrange multipliers . . . . . . . . . . . . . . . . . . . . Power balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power allocation at generators . . . . . . . . . . . . . . . . . . . . . . . Performance comparison with other algorithms . . . . . . . . . . .
. . 158 . . 181 . . 182 . . 183 . . 183 . . . .
. . . .
184 184 184 185
. . 204 . . 205 . . 206 . . . .
. . . .
207 208 209 210
. . . . . . . . . .
. . . . . . . . . .
236 236 237 237 237 238 239 239 240 240
Chapter 1
Introduction
Distributed system plays a core role in today’s science and technology engineering, and has great development potential in large-scale distributed industrial control, multi-robot system, smart grid and so on. How to optimize the scheduling of distributed systems in various fields is an important research subject. At present, the operation mechanism of distributed optimization is not clear in complex network, digital communication, and the application effect in various fields is not optimistic. Therefore, the establishment of an effective operation mechanism of distributed optimization is an urgent problem to be solved. This book, from distributed optimization under the complex network, distributed optimization under digital communication and distributed optimization application in the smart grid, conduct the thorough research to the three aspects. These findings not only break through the fundamental limitations of traditional distributed systems under static network, but also overcome the essential difficulties of collaborative control and distributed optimization under complex dynamic environment and system resource limitation. In addition, the results not only reveal the quantitative relationship among system performance, network structure and node parameters, but also present a number of local and global precise cooperative judgment criteria for distributed systems. In this book, we present a series of high performance distributed optimization algorithms, which effectively promote the application transformation of distributed system analysis, control and optimization theory. The research on distributed optimization of multi-agent systems in complex dynamic networks has received significant attention in many fields. As the theoretical basis and key technologies for applications such as aerial vehicle cooperative operations, sensor network estimation and detection, and resource allocation, the distributed optimization theory plays a critical role. With the dynamic changes of network systems and the increasing complexity of network structures, the existing distributed optimization algorithms are difficult to satisfy some practical require© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_1
1
2
1 Introduction
ments such as fast convergence speed, low computational complexity, strong scalability, and high agent autonomy. Intuitively, the complex dynamic environment makes it difficult for isolated agents to obtain global information, which poses a great challenge to achieve the global optimal solution for the multi-agent networks. At the same time, the complex nonlinear dynamic behavior of agents also makes the system consistency analysis extremely hard. The traditional invariant-time system control theories are not able to cope with these challenges. Therefore, it is of great significance to study the distributed optimization in complex dynamic networks. In digital communications, distributed optimization has many application scenarios, including multi-robot collaboration, transportation vehicle management, satellite attitude adjustment, and so on. Compared with traditional analog communication, digital microprocessors and communication networks have significant advantages such as anti-interference, integrating easy, and processing easy. However, the limitation of channel bandwidth of the actual digital communication causes the limited quantization level, which results in the restriction of transmitting limited bit information between agents and further yields quantization errors. The quantization error does not have a well transcendental feature, which is the key factor leading to the steady-state error of multi-agent system. Agents, communication networks and optimization algorithms are the three major elements of optimization and control of the multi-agent systems. As the medium for information transmission between agents, communication network is key to the effective implementation of optimization algorithms. However, the limited bandwidth of communication channels in practical environments greatly limits the efficiency of optimization and control of the multiagent systems. Furthermore, the limited computing ability and energy supply of the agents may further limit the performance of optimization and control of the multiagent systems. Therefore, studying distributed optimization in digital communication environments is meaningful. Smart grid is built on the basis of integrated, high-speed two-way communication network, which achieves the goal of reliable, safe, economic, efficient, environmentally friendly and safe use of the power grid through the application of advanced sensing and measurement technology, advanced equipment technology, advanced control methods and advanced decision support system technology. However, due to the extensive use of distributed power system, smart grid faces challenges such as complexity of power grid structure, intermittency and randomness of power output. Therefore, it is of great practical significance to research how to improve energy allocation, adjust power generation and pricing strategies, maximize optimal dispatching of smart grid, and bring broader social and economic benefits to the country. Chapter 1 presents the research background, motivations and research problems, which involve consensus control, accelerated algorithm, asynchronous broadcastbased algorithm, random sleep scheme, stochastic gradient algorithm, distributed robust algorithm, distributed constrained optimization, economic dispatch in smart grids or power systems, event-triggered communication, quantized communication, unbalanced directed networks, time-varying networks, and then the outline of the monograph is listed.
1 Introduction
3
Chapter 2 investigates the problem of distributed convex optimization, where the target is to collectively minimize a sum of local convex functions over an unbalanced directed multi-agent network. Each agent in the network possesses only its private local objective function, and the sum of all local objective functions constitutes the global objective function. We particularly consider the scenario where the underlying interaction network is strongly connected and the relevant weight matrix is row-stochastic. To collectively figure out the optimization problem, a distributed accelerated convergence algorithm where agents utilize uncoordinated step-sizes is presented by incorporating consensus of multi-agent networks into distributed inexact gradient tracking technique. Most of the existing methods require all agents to possess the out-degree information of their in-neighbors, which is impractical and hardly inevitable as interpreted in the chapter. By utilizing the small gain theorem, we prove that, if the maximum step-size is positive and sufficiently small (constrained by a specific upper bound), the proposed algorithm, termed as SGT-FROST, converges linearly to the optimal solution given that the objective functions are smooth and strongly convex. A certain convergence rate is also shown. Simulations confirm the findings in this chapter. Chapter 3 studies a class of distributed optimization algorithms by a set of agents, where each agent only has access to its own local convex objective function, and the goal of the agents is to jointly minimize the sum of all the local functions. The communications among agents are described by a sequence of time-varying directed networks which are assumed to be uniformly strongly connected. A column stochastic mixing matrices is employed in the algorithm which exactly steers all the agents to asymptotically converge to a global and consensual optimal solution even under the assumption that the step-sizes are uncoordinated. Two fairly standard conditions for achieving the linear convergence rate are established under the assumption that the objective functions are strong convexity and have Lipschitz continuous gradient. The theoretical analysis shows that the distributed algorithm is capable of driving the whole network to linearly converge to an optimal solution of the convex optimization problem as long as the uncoordinated step-sizes do not exceed some upper bound. We also give an explicit analysis for the convergence rate of our algorithm through a different approach. Finally, simulation results illustrate the feasibility of the proposed algorithm and the effectiveness of the theoretical analysis throughout this chapter. Chapter 4 focuses on distributed convex optimization problems over an unbalanced directed multi-agent (no central coordinator) network with inequality constraints. The goal is to cooperatively minimize the sum of all locally known convex cost functions. Every single agent in the network only knows its local objective function and local inequality constraint, and is constrained to a privately known convex set. Furthermore, we particularly discuss the scenario in which the interactions among agents over the whole network are subjected to possible link failures. To collaboratively solve the optimization problem, we mainly concentrate on an epigraph form of the original constrained optimization to overcome the unbalancedness of directed networks, and propose a new distributed asynchronous
4
1 Introduction
broadcast-based optimization algorithm. The algorithm allows that not only the updates of agents are asynchronous in a distributed fashion, but also the stepsizes of all agents are uncoordinated. An important characteristic of the proposed algorithm is to cope with the constrained optimization problem in the case of unbalanced directed networks whose communications are subjected to possible link failures. Under two standard assumptions that the communication network is strongly connected and the (sub)gradients of all local objective functions are bounded, we provide an explicit analysis for convergence of the algorithm. Simulation results obtained by three numerical experiments substantiate the feasibility of the algorithm and validate the theoretical findings. Chapter 5 considers solving a class of optimization problems which are modeled as the sum of all agents’ convex cost functions and each agent is only accessible to its individual function. Communication between agents in multi-agent networks is assumed to be limited: each agent can only interact information with its neighbors by using time-varying communication channels with limited capacities. A technique which overcomes the limitation is to implement a quantization process to the interacted information. The quantized information is first encoded as a binary sequence at the side of each agent before sending. After the binary sequence is received by the neighboring agent, corresponding decoding scheme is utilized to resume the original information with a certain degree of error which is caused by the quantization process. With the availability of each agent’s encoding states (associated with its out-channels) and decoding states (associated with its in-channels), we devise a set of distributed optimization algorithms that generate two iterative sequences, one of which converges to the optimal solution and the other of which reaches to the optimal value. We prove that if the parameters satisfy some mild conditions, the quantization errors are bounded and the consensus optimization can be achieved. How to minimize the number of quantization level of each connected communication channel in fixed networks is also explored thoroughly. It is found that, by properly choosing system parameters, one bit information exchange suffices to ensure consensus optimization. Finally, we present two numerical simulation experiments to illustrate the efficacy of the algorithms as well as to validate the theoretical findings. Chapter 6 is concerned with solving a large category of convex optimization problems using a group of agents, each only being accessible to its individual convex cost function. The optimization problems are modeled as minimizing the sum of all the agents’ cost functions. The communication process between agents is described by a sequence of time-varying yet balanced directed networks which are assumed to be uniformly strongly connected. Taking into account the fact that the communication channel bandwidth is limited, for each agent we introduce a vector-valued quantizer with finite quantization levels to preprocess the information to be exchanged. We exploit an event-triggered broadcasting technique to guide information exchange, further reducing the communication cost of the network. By jointly designing the dynamic event-triggered encoding-decoding schemes and the event-triggered sampling rules (to analytically determine the
1 Introduction
5
sampling time instant sequence for each agent), a distributed (sub)gradient descent algorithm with constrained information exchange is proposed. By selecting the appropriate quantization levels, all the agents’ states asymptotically converge to a consensus value which is also the optimal solution to the optimization problem, without committing saturation of all the quantizers. We find that one bit of information exchange across each connected channel can guarantee that the optimization problem can be exactly solved. Theoretical analysis shows that the event-triggered (sub)gradient descent algorithm with constrained data rate of net√ works converges at the rate of O(ln t/ t). We supply a numerical simulation experiment to demonstrate the effectiveness of the proposed algorithm and to validate the correctness of theoretical results. Chapter 7 considers a category of constrained convex optimization problems over multi-agent networks. The networked agents aim at collaboratively minimizing the sum of all locally known objective functions over a common convex set. Each agent possesses only its local convex function and its state is constrained to a privately known convex set. A novel distributed algorithm is proposed over timevarying unbalanced directed networks based on epigraph form of the original optimization problem and consensus theory. By incorporating the random sleep scheme, the proposed algorithm allows each agent to independently and randomly decide whether to calculate (sub)gradient and take projection at each iteration, which alleviates the cost of (sub)gradient observation. Besides, it neither resorts to doubly-stochastic weight matrices (but only row-stochastic) nor the information of the network sequence to execute. Convergence of the algorithm is explicitly analyzed under conditions that the sequence of time-varying directed networks is uniformly jointly strongly connected and the (sub)gradients of all local objective functions are bounded over a convex set. The optimization algorithm ensures zero-gap on the expected distance between the estimated value of each agent and the exact optimal solution. Two simulation cases are presented to demonstrate practicability of the algorithm and correctness of the obtained theoretical results. Chapter 8 investigates distributed optimization problems where a group of networked nodes collaboratively minimizes the sum of all local objective functions. The local objective function of each node is further set as an average of a finite set of subfunctions. This adjustment is motivated by machine learning problems with large training samples distributed and known privately to individual computational nodes. An augmented Lagrange (AL) stochastic gradient algorithm is presented to address the distributed optimization problem, which is integrated with the factorization of weighted Laplacian and local unbiased stochastic averaging gradient methods. At each iteration, only one randomly selected gradient of a subfunction is evaluated at a node, and a variance-reduced stochastic averaging gradient technique is applied to approximate the gradient of local objective function. Strong convexity of the local subfunction and Lipschitz continuity of its gradient are shown to ensure a linear convergence rate of the proposed algorithm in expectation. Numerical experiments on a logistic regression problem demonstrate the correctness of theoretical results.
6
1 Introduction
Chapter 9 is centered on the economic dispatch problem (EDP) in smart grids, which aims at scheduling generators to meet the total demand at the minimized cost. This chapter proposes a fully distributed algorithm to address the EDP over directed networks and takes into account communication delays and noisy gradient observations. In particular, the rescaling gradient technique is introduced in algorithm design and the implementation of the distributed algorithm only resorts to row-stochastic weight matrices, which allows each generator to locally allocate the weights on the messages received from its in-neighbors. It is proved that the optimal dispatch can be achieved under the assumptions that the non-identical constant communication delays inflicting on each link are uniformly bounded and the noises embroiled in gradient observation of every generator are bounded variance zero-mean. Simulations are provided to validate and testify effectiveness of the presented algorithm. Chapter 10 studies the economic dispatch problem (EDP) in time-varying balanced communication networks, aiming to minimize the total cost of generation electricity. This is equivalent to dealing with the problem of optimizing the sum of local functions while a single generator possesses only its local function. The variables of generators, satisfying some local constraints, are coupled by a linear constraint. In order to resolve the EDP, we design a fully distributed primal-dual optimization algorithm with time-varying uncoordinated step-sizes. In consideration of saving computation and communication resources, an event-triggered scheme is introduced into the algorithm, based on which each generator is only allowed interact with their neighboring generators at some independent eventtriggered sampling time instants. The proposed algorithm is able to achieve a linear convergence rate under the strong convexity and smoothness of the local objective functions. The Zeno-like behavior is rigorously excluded, which means that the interval between any two consecutive sampling time instants of each generator is not less than two. Effectiveness of the algorithm and correctness of the theoretical analysis are verified by numerical experiments.
Chapter 2
Achieving Linear Convergence of Distributed Optimization over Unbalanced Directed Networks with Row-Stochastic Weight Matrices
2.1 Introduction Over the past several years, under the great progress of multi-agent networks in emerging areas, increasing number of investigators have conducted in-depth research and achieved remarkable results. With the universalization of networked control systems, multi-agent networks not only introduce a theoretical analysis approach for modeling and analyzing dynamic systems, but also have a crucial role to play in studying distributed artificial intelligence [1–6]. Distributed coordination and optimization of networked control systems, as a significant topic in the study of multiagent networks, have gained considerable interest and great attention. Specifically, this class of problem has found a number of engineering applications, e.g., distributed state estimation [7], resource allocation [8], regression [9, 10], as well as machine learning [11–13], among many others. Distributed optimization problems over networks have been recently gained many paramount results [14–33], etc. Some notable approaches that address distributed convex optimization problems mainly include distributed (sub)gradient descent [17, 18], distributed dual averaging [19], and distributed nesterov (sub)gradient [20, 21], etc. Nedic and Ozdaglar [17] firstly proposed a distributed (sub)gradient descent method (consensus-based) by means of a (sub)gradient descent strategy to accomplish distributed optimization. Then, Duchi et al. [19] devised a distributed dual averaging approach using a similar concept, which converged to the optimal solution. It is noteworthy at this aspect that the two approaches, distributed (sub)gradient descent [17] and distributed dual averaging [19], are essentially based on distributed (sub)gradient descent, which is computationally simple and intuitive but usually slow with the adoption of the diminishing step-size. The convergence rates are demonstrated to be √ O(lnk/k) (k is the discrete time iteration) for strongly convex functions and O(lnk/ k) for arbitrary convex functions. In addition, extensions to various realistic factors and technologies containing stochastic (sub)gradient errors [22, 23], switching topologies [24], uncoordinated local constraints [25, 26], random link fail© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_2
7
8
2 Achieving Linear Convergence of Distributed Optimization …
ures [27] and distributed charging control [28] have been studied extensively. From the viewpoint of convergence rate, the abovementioned distributed methods can be promoted with the help of a constant step-size. However, such changes will expose the shortcoming of inaccurate convergence. Furthermore, Nedic and Olshevsky [29] combined distributed (sub)gradient descent with push-sum strategy and developed a distributed (sub)gradient-push method (SP) over time-varying unbalanced directed networks, i.e., with columnstochastic matrices. Motivated by the idea of [29], Xi et al. [30] constructed a directed-distributed (sub)gradient descent method (D-DGD) which was based on surplus consensus (records the state updates for each agent) and distributed (sub)gradient descent. Both column-stochastic matrix and row-stochastic matrix were contained in the underlying weight matrix of [30] to satisfy the properties of doubly-stochastic matrix. By integrating weight-balancing mechanism [31] together with distributed (sub)gradient descent, a weight-balancing-distributed (sub)gradient descent method (WB-DGD) was investigated in [32] (doubly-stochastic matrix). The convergence rates of [29, 30, 32] are O(lnk/k) for strongly convex functions. To accelerate convergence, Xi and Khan [33] proposed a novel fast distributed gradient method (DEXTRA) with constant step-size by incorporating push-sum technique into the protocol (EXTRA) of [14]. When objective functions were strongly convex and each agent accessed its local out-degree, DEXTRA converged linearly to the exact optimal solution with a rate of O(λk ) for 0 < λ < 1. It is noteworthy that although these approaches [29, 30, 32, 33] avoided the construction of doubly-stochastic matrix, they all required all agents to possess the out-degree information of their in-neighbors exactly. Therefore, all the agents in the networks [29, 30, 32, 33] could adjust their outgoing weights and ensured that the sum of each column of weight matrix was one, bringing about a column-stochastic weight matrix. This demand, however, is likely to unrealistic in e.g., broadcast-based interaction schemes (i.e., the agent neither accesses its out-neighbors nor regulates its outgoing weights). This chapter is related most closely to recent achievements in [16, 34–40]. To be specific, Xu et al. [34] developed an Aug-DGM algorithm based on gradient tracking which employed uncoordinated step-sizes over undirected network. Aug-DGM [34], for arbitrary convex functions, converged at an O(1/k) convergence rate. And then, the algorithm in [35] exactly the same as Aug-DGM achieved O(λk ), 0 < λ < 1, convergence rate for strongly convex functions. Simultaneously, Nedic et al. [36] incorporated gradient tracking strategy into distributed inexact gradient method to present the algorithms DIGing/Push-DIGing which utilized coordinated step-size for time-varying undirected/unbalanced directed networks (Xi et al. [16] proposed similar algorithm, i.e., ADD-OPT, for fixed unbalanced directed network). After that, Lü et al. in [39] (time-varying unbalanced directed networks) and Nedic et al. in [40] (undirected network) respectively extended the work [36] to uncoordinated stepsizes. When the objective functions were smooth and strongly convex, the small gain analysis showed that the linear convergence rate in [36, 39, 40] could be achieved. However, the work of [36, 39, 40] did not further analyze distributed optimization methods with the case of row-stochastic matrix. Compared with column-stochastic matrix, the implementation of row-stochastic matrix (each agent can privately reg-
2.1 Introduction
9
ulates the weights on information it acquires from in-neighbors) is straightforward and it is much easier to be achieved in a distributed manner in practical applications. To conquer this deficiency, Xi et al. in [37] investigated the case of row-stochastic weight matrix (required limited global information on the network) and proposed a distributed inexact gradient tracking method under coordinated step-size. Under two standard assumptions (smoothness and strong convexity), the linear convergence of [37] was obtained over an unbalanced directed network. Although the method developed in [37] considered row-stochastic weight matrix, it did not take the uncoordinated step-sizes of each agent into account. This will hold back the growth of distributed convex optimization in real applications. In view of this limitation, Xin et al. [38] employed insights from matrix contraction argument to propose a fast algorithm (FROST) for which they proved exact linear convergence under certain conditions. Related work also includes method based on distributed projected (sub)gradient √ with row-stochastic weight matrix [41]. However, the convergence rate (O(lnk/ k)) of [41] was considered to be relatively slow due to the adoption of diminishing step-size. Other related work is the distributed methods for the problems of resource allocation and demand response in smart grids [42, 43]. The main interest of this chapter is to investigate the distributed convex optimization problems over an unbalanced directed network. We aim at developing a broad theory of the distributed convex optimization, and the main target of presenting a distributed optimization algorithm with more flexible step-sizes (uncoordinated stepsizes) is to adapt and promote real scenarios. More specifically, the key contributions of this chapter as follows: (I) A distributed accelerated convergence algorithm, termed as SGT-FROST, with row-stochastic matrix is developed to resolve distributed convex optimization problems over an unbalanced directed network. Most importantly, unlike the matrix contraction argument employed in [37, 38], a different approach is presented to show convergence of SGT-FROST using the small-gain theorem which is a practical tool firstly utilized in [36] for distributed optimization. (II) Unlike the strategies investigated in [36, 39], SGT-FROST adopts a new distributed inexact gradient tracking mechanism to guarantee that all agents in the network eventually converge to the exact optimal solution of the distributed convex optimization problem. In comparison with [32, 33, 42], SGT-FROST is relatively easier to be achieved in a distributed setting since the implementation of row-stochastic matrix is straightforward [38, 43]. This is inevitable in quite a few practical scenarios such as wireless sensor networks, ad hoc networks and peer-to-peer, etc. (III) Though row-stochastic weight matrix is also investigated in [37, 41, 44], it is even more essential that SGT-FROST adopts uncoordinated step-sizes [34, 38, 39, 43], which offers a wider selection of the step-sizes than most existing algorithms proposed in [36, 37, 41], etc. In addition, we develop the conditions concerning the upper bound of the step-sizes (uncoordinated). (IV) SGT-FROST achieves a faster convergence in comparison with some harmonious distributed optimization methods [29, 30, 32]. Specifically, if the objec-
10
2 Achieving Linear Convergence of Distributed Optimization …
tive functions are smooth and strongly convex, SGT-FROST reaches a linear convergence to the optimal solution if the uncoordinated step-sizes are constrained by a specific upper bound. Unlike some existing methods [33, 45] (demands a specific lower bound for the selection of the step-sizes), SGTFROST well liberates the constraint on the step-sizes. A certain convergence rate is also shown.
2.2 Preliminaries 2.2.1 Notation If there are no other statements, the vectors in this chapter are assumed to be columns. For an infinite sequence si = (si (0) , si (1) , si (2) , . . .) where si (k) ∈ Rn , ∀k ∈ N, define ||si ||λ,K = maxk=0,1,...,K (1/λk )||si (k)||, ∀K = 0, 1, . . ., and si λ = supk≥0 (1/λk ) si (k) with λ ∈ (0, 1).
2.2.2 Model of Optimization Problem In this chapter, we study a network involving N agents whose target is to collectively resolve the distributed convex optimization problems of the form as follows: min f (x) =
N
f i (x), x ∈ Rn
(2.1)
i=1
where each local objective function, f i : Rn → R, is held privately at agent i for i = 1, . . . , N , and x ∈ Rn is the global decision vector. Let f (x ∗ ) = f ∗ and x ∗ be represented by the optimal value and an optimal solution to (2.1), respectively. The optimal N f i (x) = f ∗ }. solution set to (2.1) is denoted by X ∗ , where X ∗ = {x ∈ Rn | i=1 This chapter mainly focuses on resolving problem (2.1) by presenting a distributed accelerated algorithm (SGT-FROST) which is applicable to an unbalanced directed network. At each time instant k ≥ 0, each agent i ∈ {1, . . . , N } is only allowed to acquire information from its in-neighbors and locally updates a vector xi (k) so that each xi (k) converges linearly to x ∗ as k goes to infinity.
2.2.3 Communication Network The interactions among N agents are commonly expressed by an unbalanced directed network G, where G = {V, E, W}, V = {1, . . . ,N } is the agents set, E ⊆ V × V is the edges set, and the weight matrix W = wi j ∈ R N ×N is a non-negative matrix
2.2 Preliminaries
11
for adjacency weights of edges such that wi j > a for some a > 0 if ( j, i) ∈ E and wi j = 0 otherwise. Specifically, ∀i, j ∈ V, if ( j, i) ∈ E, it indicates that agent i can acquire message from agent i directly. We denote by Niin the set of in-neighbors of agent i, i.e., Niin = { j| ( j, i) ∈ E}. In directed network G, a path of length b from agent i 1 to agent i b+1 is a sequence of b + 1 distinct agent i 1 , . . . , i b+1 such that agents, then (i k , i k+1 ) ∈ E for k = 1, . . . , b. If there exists a path between any two G is said to be strongly connected. In addition, W is row-stochastic if Nj=1 wi j = 1, ∀i ∈ V.
2.2.4 Necessary Assumptions Assumption 2.1 [37] The underlying communication network G is directed and strongly connected. Remark 2.1 When investigating distributed optimization problems [37, 41], Assumption 2.1 is indispensable to guarantee that agents in the network can always influence others directly or indirectly. Assumption 2.2 (Smoothness [36]) Each local objective function f i , i ∈ V, is L i smooth, i.e., it is differentiable and has Lipschitz-continuous gradient. Specifically, there exists L i such that for any x, y ∈ Rn , one gets ||∇ f i (x) − ∇ f i (y)|| ≤ L i ||x − y||
(2.2)
where L i ∈ (0, +∞). We employ Lˆ = maxi∈V {L i }. Assumption 2.3 (Strong Convexity [36]) The local objective function f i , i ∈ V, is μi -strongly convex. Mathematically, there exists μi such that for any x, y ∈ Rn , f i (x) ≥ f i (y) + ∇ f i (y), x − y +
μi ||x − y||2 2
(2.3)
where μi ∈ [0, +∞). We write μˆ = maxi∈V {μi }. Remark 2.2 Based on Assumption 2.3, problem (2.1) has a unique optimal solution. It is worth emphasizing that Assumptions 2.2 and 2.3 are two standard assumptions to achieve linear convergence when utilizing the first-order methods [34, 36, 42, 43]. If Assumption 2.3 holds, it suffices that each f i is convex and at least one of them is strongly convex.
12
2 Achieving Linear Convergence of Distributed Optimization …
2.3 Main Results 2.3.1 Distributed Accelerated Convergence Algorithm In this section, a distributed accelerated convergence algorithm, termed as SGTFROST, is presented to distributedly work out convex optimization problem (2.1). Each agent i ∈ V at the k-th iteration stores three variables xi (k) ∈ Rn , yi (k) ∈ R N , and z i (k) ∈ Rn . For k ≥ 0, agent i ∈ V updates its variables as follows: xi (k + 1) =
N
wi j x j (k) − αi z i (k)
j=1
yi (k + 1) =
N
wi j y j (k)
(2.4)
j=1
z i (k + 1) =
N
wi j z j (k) +
j=1
∇ f i (xi (k + 1)) ∇ f i (xi (k)) − [yi (k + 1)]i [yi (k)]i
where wi j is the weight and αi > 0 is uncoordinated step-size at agent i. The variable [yi (k)]i indicates the i-th entry of yi (k) and ∇ f i (xi (k)) represents the gradient of f i (x) at x = xi (k). Moreover, the weight wi j obeys wi j =
⎧ N ⎨ > , j ∈ Niin , ⎩
0, other wise,
wi j = 1, ∀i ∈ V
(2.5)
j=1
and wii = 1 − j∈Niin wi j > , ∀i ∈ V, where 0 < < 1. Each agent i ∈ V starts with some initial vectors xi (0) ∈ Rn , yi (0) = ei ∈ R N , and z i (0) = ∇ f i (xi (0)) ∈ Rn . Remark 2.3 For brevity, suppose that the sequences {xi (k)}, and {z i (k)}, i ∈ V, generated by (2.4) possess only one-dimension, i.e., n = 1. For the cases of xi (k), z i (k) and ∇ f i (xi (k)) ∈ Rn , i ∈ V, by means of Kronecker product (⊗), the analysis strategy is similar to one-dimension. Let W = [wi j ] ∈ R N ×N denote the collection of weights wi j , i, j ∈ V, which is obviously row-stochastic. Define x(k) = [x1 (k) , . . . , x N (k)]T ∈ R N , Y (k) = [y1 (k), . . . , y N (k)]T ∈ R N ×N , ∇ F (k) = [∇ f 1 (x1 (k)) , . . . , ∇ f N (x N (k))]T ∈ R N , Y˜ (k) = diag{Y (k)} and z(k) = [z 1 (k), . . . , z N (k)]T ∈ R N . Thus, the compact matrix-vector form of SGT-FROST (2.4) can be written as follows: x(k + 1) = W x(k) − Dz(k) Y (k + 1) = WY (k)
(2.6)
2.3 Main Results
13
z(k + 1) = Wz(k) + Y˜ −1 (k + 1)∇ F(k + 1) − Y˜ −1 (k)∇ F(k) where Y (0) = I N and z(0) = ∇ F(0); D = diag{α} ∈ R N ×N , where α = [α1 , α2 , . . . , α N ]T . It is worth emphasizing that SGT-FROST (2.6) does not need all agents to possess the out-degree information of their in-neighbors (only row-stochastic matrix adopted in SGT-FROST).
2.3.2 Supporting Lemmas Before providing the supporting lemmas, a slice of inevitable notations are introduced, which are crucial to sustain the succeeding analysis. First, the definition of the Perron vector is given. Definition 2.4 (Perron vector [37]) Suppose that Assumption 2.1 holds. Assume that W in (2.6) is a the non-negative weight matrix. Then, there exists a Perron vector a = [a1 , . . . , a N ]T ∈ R N such that
lim W k = 1 N a T , a T W = a T , a T 1 N = 1, ρ W − 1 N a T < 1
k→∞
(2.7)
N ai vi , vˇ = 1 N a T v and The following notations are also needed. Denote v¯ = i=1 v˘ = v − 1 N v¯ = J˘v for any i, where v = [v1 , . . . , v N ]T , J˘ = I N − 1 N a T . Denote Y∞ = limk→∞ Y (k) = limk→∞ W k (Y (0) = I N ). It is worth emphasizing that the two symbols 1 N a T and Y∞ are actually the same. For the sake of analysis, we give different definitions here. Before proceeding the analysis, we provide two lemmas required for the convergence to hold. Lemma 2.5 [37] Let Y (k) be updated by (2.6). Denote the limit of Y (k) as Y∞ . Then, there exist 0 < C < ∞ and 0 < λ0 < 1 such that ||Y (k) − Y∞ || ≤ Cλk0 , ∀k ≥ 0 Lemma 2.6 [36] Under Assumption 2.1 and for any matrices (suitable dimensions) ˘ ≤ δ||B||, ˘ where C and B, if C = WB, where W satisfies Definition 2.4, we get ||C|| ˘ ˘ ˘ ˘ C = J C, B = J B, and 0 < δ < 1. In order to achieve a linear convergence of the Push-DIGing algorithm, we first introduce a preliminary result, namely, the small gain theorem. It is a somewhat unusual version derived from [36]. Moreover, the original version of the theorem has received a widely research and been extensively employed in control theory [46]. Lemma 2.7 (Small Gain Theorem [36]) Suppose that s1 , . . . , sm are vector sequences. Then, for each i = 1, . . . , m, we possess that si → s(i mod m)+1 , that is for all positive integers K ,
14
2 Achieving Linear Convergence of Distributed Optimization …
||s(i mod m)+1 ||λ,K ≤ γi ||si ||λ,K + i where 0 < λ < 1; the constants γ1 , . . . , γm and 1 , . . . , m are independent of K . Assume in addition that the constants γ1 , . . . , γm are non-negative and follow m γi < 1. It holds that 0 < i=1 ||s1 ||λ ≤ (1 −
1 m
m
γi )
i
i=1
m
γj
j=i+1
i=1
Since si → s(i mod m)+1 , i = 1, . . . , m, exists a loop structure, some similar bounds of the remaining sequences can be obtained. Moreover, the following lemma will be applied to conclude that the linear convergence of si , i = 1, . . . , m, given in Lemma 2.7 can be acquired. Lemma 2.8 (Linear Convergence [36]) For any matrix sequence si = (si (0), si (1), si (2), . . .) and λ ∈ (0, 1), if ||si ||λ < B0 , where B0 < ∞, then ||si (k)|| converges to zero linearly with a rate of O(λk ). Before setting up the main results, in the following, a slice of basic symbols are given to simplify the convergence analysis. Specifically, denote q (k) = x (k) − 1 N x ∗
(2.8)
h(k + 1) = Y˜ −1 (k + 1)∇ F(k + 1) − Y˜ −1 (k)∇ F(k)
(2.9)
∀k ≥ 0. For consistency in the proofs we define h (0) = 0 N . In order to directly apply Lemma 2.7 and establish the convergence of SGT-FROST, a slice of supporting relationships are needed to build the loop below: ⎧ ||x|| ˘ ⎪ ⎪ ⎨ ↑ ||q|| ← ||˘z || ← ||h|| ← ||q|| ⎪ ⎪ ⎩ ||z|| ← ||ˇz || ← ||q||
(2.10)
Remark 2.9 Notice that q is the difference between the estimate x and the global optimal solution 1 N x ∗ ; x˘ is the consensus violation of the estimate x; h is the difference between two successively corrected gradients; z˘ (ˇz ) is the consensus violation (average) of the estimate z; z is the average estimate of all agents’ gradients. Generally speaking, if q is small, h is small because the two corrected gradients are close to the same near the optimal solution. And then, if q and h are small, the structure of SGT-FROST (2.6) indicates that z is small. Moreover, if z is small, x˘ is small according to the structure of SGT-FROST (2.6). Finally, if z and x˘ are small, SGT-FROST will steer q to zero and therefore accomplish the whole loop.
2.3 Main Results
15
Remark 2.10 If each arrow in loop (2.10) is established, the small gain theorem (Lemma 2.7) is utilized to deduce the boundedness of the sequences {||q||λ,K , ||x|| ˘ λ,K , ||z||λ,K , ||˘z ||λ,K , ||ˇz ||λ,K , ||h||λ,K }. Thus, we can achieve that all quantities in loop (2.10) converge at a linear rate O(γ k ) through using Lemma 2.8. Notice that to realize the desired results via employing small gain theorem, the requirement of condition 0 < γ1 · · · γm < 1 is needed. This can be implemented by selecting a reasonable step-size matrix D. Indeed, after choosing a sufficiently small step-sizes, the small gain theorem can be employed effectively. Now, we are in the position of presenting the establishment of each relationship of loop (2.10). Before Lemma 2.11 is given, some notations are introduced, which distinguish the notations described in distributed optimization problem/algorithm/analysis throughout the chapter. With different notations, the optimization problem (2.1) is redefined as: min g ( p) =
N
gi ( p), p ∈ Rn
(2.11)
i=1
where gi , i = 1, . . . , N (the same as f i ), obeys Assumptions 2.2 and 2.3. To solve (2.11), the perturbed version of gradient descent method is presented as follows: p (k + 1) = p (k) − θ
N i=1
ai
∇gi (si (k)) + e (k) ϕi (k)
(2.12)
where θ > 0 is the constant step-size. Recall Definition 2.4 that ai > 0, ∀i ∈ V. The variable ϕi (k) > 0, ∀i ∈ V, ∀k ≥ 0, and e(k), ∀k ≥ 0, is a superimposed noise. Define r (k) = || p (k) − p ∗ ||, ∀k = 0, 1, . . ., where p ∗ is indicative of the optimal solution to (2.11). We now provide Lemma 2.11 which is useful in the establishment of the first relationship of loop (2.10). Lemma 2.11 Let each function gi , i = 1, . . . , N , in (2.11) satisfy Assumptions 2.2 and 2.3. Suppose in addition that
β+1 1 ¯ θϕ1 μβ ≤ λ < 1, 0 < θ ≤ min , 1− ¯ + η) 2(β + 1) ϕ1 μβ ¯ ϕ2 L(1
N ¯ where ϕ1 = inf k≥0 mini∈V {1/ϕi (k)}, ϕ2 = supk≥0 maxi∈V {1/ϕi (k)}, L= i=1 ai L i , N μ¯ = i=1 ai μi , β ≥ 2, and η > 0. Then, for any K = 0, 1, . . ., the sequences {r (k); p(k); s1 (k), . . . , s N (k)} generated by (2.12) achieves
16
2 Achieving Linear Convergence of Distributed Optimization …
|r |λ,K
N ˆ + η) ϕ2 μβ ϕ2 L(1 ˆ p − si λ,K + ϕ1 μη ¯ ϕ1 μ¯ i=1 √ 3 − θϕ1 μ¯ eλ,K + 2r (0) + λθϕ1 μ¯
1 ≤ λ
(2.13)
Proof Note that ||c||2 = ||c + d||2 − 2 c, d − ||d||2 , ∀c, d ∈ R N . Letting c = p(k + 1) − p ∗ and d = p(k) − p(k + 1), it deduces that r 2 (k + 1) = r 2 (k) + 2θ p ∗ − p(k),
N
ai
∇gi (si (k)) ϕi (k)
ai
∇gi (si (k)) ϕi (k)
i=1
− 2θ p(k + 1) − p(k),
N i=1
+ 2 p(k + 1) − p ∗ , e(k) − || p(k) − p(k + 1)||2
(2.14)
We first discuss the second term of (2.14) of the second equation. By Assumption 2.3, we get μi gi ( p ∗ ) ≥ gi (si (k)) + ∇gi (si (k)), p ∗ − si (k) + || p ∗ − si (k)||2 2
(2.15)
for any i ∈ V. Notice that ||si (k) − p ∗ ||2 ≥ (β/ (β + 1)) || p(k) − p ∗ ||2 − β|| p(k) − si (k)||2 , where β is a tunable parameter. One obtains from (2.15) that gi ( p ∗ ) ≥ gi (si (k)) + ∇gi (si (k)), p ∗ − p(k) + p(k) − si (k) β μi || p(k) − p ∗ ||2 − β|| p(k) − si (k)||2 ) + ( 2 β+1
(2.16)
which when multiplying both sides by ai /ϕi (k) , ∀i ∈ V, k ≥ 0, and summing up for i from 1 to N yields N i=1
≤
ai ∇gi (si (k)), p ∗ − p(k) ϕi (k)
N ai gi ( p ∗ ) i=1
−
ϕi (k)
ai gi (si (k)) μβϕ ¯ 1 || p(k) − p ∗ ||2 − 2(β + 1) ϕi (k) i=1 N
−
N N ai μi β|| p(k) − si (k)||2 ai ∇gi (si (k)) , p(k) − si (k) +
ϕi (k) 2ϕi (k) i=1 i=1
(2.17)
N where ϕ1 = inf k≥0 mini∈V {1/ϕi (k)} and μ¯ = i=1 ai μi . Then, we discuss the third term of (2.14) of the second equation. Similarly, following from Assumption 2.2, for
2.3 Main Results
17
any vector ∈ R N , we have gi ( p(k) + ) ≤ gi (si (k)) + ∇gi (si (k)), + ∇gi (si (k)), p(k) − si (k) L i (1 + η) L i (1 + η) (2.18) ||||2 + || p(k) − si (k)||2 + 2 2η where η > 0 is also a tunable parameter. Similar to the processing of formula (2.17), one has N ai − ∇gi (si (k)), ϕ i (k) i=1
≤−
N ai gi ( p(k) + ) i=1
+ +
ϕi (k)
+
¯ 2 (1 + η)||||2 Lϕ 2
N N ai gi (si (k)) ai ∇gi (si (k)) , p(k) − si (k) +
ϕi (k) ϕi (k) i=1 i=1 N ai L i (1 + η)|| p(k) − si (k)||2 i=1
2ηϕi (k)
(2.19)
N where ϕ2 = supk≥0 maxi∈V {1/ϕi (k)} and L¯ = i=1 ai L i . We next analyze the iteration between r (k + 1) and r (k). Substituting (2.17) and (2.19) with = p(k + 1) − p(k) into (2.14) gives us the following: ¯ + η))|| p(k + 1) − p(k)||2 r 2 (k + 1) + (1 − θϕ2 L(1 N ¯ ai (gi ( p(k + 1)) − gi ( p ∗ )) θϕ1 μβ r 2 (k) − 2θ ≤ 1− β+1 ϕi (k) i=1 N ˆ + η) ϕ2 θ L(1 ||e(k)||2 + + ϕ2 θμβ ˆ || p(k) − si (k)||2 + η ρ(k) i=1 + ρ(k)r 2 (k + 1)
(2.20)
maxi∈V {L i }, and ρ(k) is a sequence of positive paramwhere μˆ = maxi∈V {μi }, Lˆ = N ¯ + η) || p(k) − si (k)|| 2 . To guarantee 1 − θϕ2 L(1 eters (tunable). Let ε(k) = i=1
¯ + η) . Thus, the following in (2.20) non-negative, we suppose 0 < θ ≤ 1/ ϕ2 L(1 inequality is established: r 2 (k + 1) ≤
2 ˆ + η) + ηϕ2 θμβ (β + 1 − θϕ1 μβ)r ¯ (k) ϕ2 θ L(1 ˆ + ε(k) (β + 1)(1 − ρ(k)) η(1 − ρ(k))
18
2 Achieving Linear Convergence of Distributed Optimization …
ai (gi ( p(k + 1)) − gi ( p ∗ )) ||e(k)||2 2θ − (1 − ρ(k))ρ(k) 1 − ρ(k) i=1 ϕi (k) N
+
(2.21)
Now, we aim to discuss (2.21). At time k, two possibilities might occur. Possibility A is that ˆ + η) ϕ2 μβ ϕ2 L(1 ˆ ||e(k)||2 2 + (2.22) ε(k) + r (k + 1) ≥ ϕ1 μη ¯ ϕ1 μ¯ θϕ1 μρ(k) ¯ Possibility B is the opposite one, i.e., r (k + 1) < 2
ˆ + η) ϕ2 μβ ϕ2 L(1 ˆ + ϕ1 μη ¯ ϕ1 μ¯
ε(k) +
||e(k)||2 θϕ1 μρ(k) ¯
(2.23)
In Possibility A, we have N ai 2θ (gi ( p(k+1)) − gi ( p ∗ )) ϕ (k) i i=1
¯ p(k+1) − p ∗ ||2 ≥ θϕ1 μ|| = θϕ1 μr ¯ 2 (k + 1) ˆ + η) ϕ2 θ L(1 1 e(k)2 + ϕ2 θμβ ˆ ε(k) + ≥ η ρ(k)
(2.24)
which together with (2.21) yields θϕ1 μβ ¯ 1 1− r 2 (k) r (k + 1) ≤ 1 − ρ(k) β+1 2
(2.25)
By combining Possibilities A and B, we have r 2 (k) ≤
⎧ ⎨
¯ 1 1 μβ 1 − θϕβ+1 r 2 (k), 1−ρ(k) max ϕ L(1+η) ˆ ||e(k)||2 ⎩ 2ˆ ε(k) + θϕ + ϕϕ21μβ ϕ1 μη ¯ μ¯ ¯ 1 μρ(k)
⎫ ⎬ ⎭
(2.26)
Using (2.26) recursively and then taking square root on the obtained inequality, we get λ−(k+1)r (k + 1) k −(k+1) ≤λ t=0
1 1 − ρ(t)
1−
θϕ1 μβ ¯ β+1
k+1 2
r (0)
2.3 Main Results
19
⎧ ⎨ t−1
t ¯ θϕ1 μβ +λ max 1− s=0 t=0,...,k ⎩ β+1 ˆ + η) ϕ2 μβ e(k − t) ϕ2 L(1 ˆ × ε(k − t) + √ + ϕ1 μη ¯ ϕ1 μ¯ ρ(k − t)θϕ1 μ¯ −(k+1)
1 1 − ρ(k − s)
(2.27)
√
k+1 √ 1/ (1 − ρ(t)) 1 − θϕ1 μβ/ ¯ (β + 1) r (0) is √ finite. Moreover, for a large
enough k0 , ∀k > k0 , we have 1/ (1 − ρ(k)) √ ¯ (β + 1) < 1, which can√be implemented by fixing ρ(k) = (1/λ) 1 − θϕ1 μβ/ ¯ (2(β ¯ with λ > 1 − θϕ1 μβ/2(β ¯ + 1). Choosing ρ = θϕ1 μβ/ √ + 1) − θϕ1 μβ) ¯ + 1), β ≥ 2, and noting that λ−2 (1−θϕ1 μβ/2(β ¯ + ρ(k) = ρ, λ ≥ 1−θϕ1 μβ/2(β 1)) ≤ 1, inequality (2.27) is further rewritten as Note supk=0,1,... λ−(k+1)
k t=0
λ−(k+1)r (k + 1) − r (0) ⎧ ⎫ √ ⎨ ⎬ ˆ e(t) ϕ ϕ μβ ˆ 3 − θϕ μ ¯ L(1 + η) 2 2 1 + ≤ λ−1 max λ−t ε(t) + λ−t t=0,...,k ⎩ ⎭ ϕ1 μη ¯ ϕ1 μ¯ θϕ1 μ¯ (2.28) Applying
√ N p(k) − si (k) to (2.28), one obtains that ε(k) ≤ i=1
N ˆ + η) ϕ2 μβ ϕ2 L(1 ˆ + max λ−t p(t) − si (t) t=0,...,k ϕ1 μη ¯ ϕ1 μ¯ i=1 √ 3 − θϕ1 μ¯ (2.29) max {λ−t e(t)} + r (0) + λθϕ1 μ¯ t=0,...,k
1 λ−(k+1)r (k + 1) ≤ λ
Taking maxk=0,...,K −1 {·}, we get the desired results.
In what follows, relation {||x|| ˘ λ,K ; ||z||λ,K } → ||q||λ,K is provided, which is established on the basis of Lemma 2.11. Lemma 2.12 Under Assumptions 2.1–2.3 and if the parameters α and λ satisfy β+1 1 α yˆ μβ ¯ 0 < α ≤ min , 1− , ≤λ 0. Then, ∀K = 0, 1, . . ., we get
N i=1
ai L i , μ¯ =
N i=1
ai μi ,
20
2 Achieving Linear Convergence of Distributed Optimization …
⎛ ||q||λ,K
⎞ ˆ + η) y˜ L(1 y˜ μβ ˆ ⎠ + ||x|| ˘ λ,K yˆ μη ¯ yˆ μ¯ √ + C y,α ||z||λ,K + 2 N ||x(0) ¯ − x ∗ ||
N ≤ ⎝1 + λ
(2.30)
√ ¯ yˆ μ(1 ¯ − k −1 where C y,α = N 3 − αmax yˆ μ/λ D ) (positive constant) if α = αmax , and k D = αmax /αmin (αmax = maxi∈V {αi }, αmin = mini∈V {αi }). Proof Consider that W is row-stochastic. Then, multiplying both sides of the inequality (2.6) to a T , we have z¯ (k + 1) = z¯ (k) + a T (Y˜ −1 (k + 1)∇ F(k + 1) − Y˜ −1 (k)∇ F(k)) where z¯ =
N i=1
(2.31)
ai z i . With z i (0) = ∇ f i (xi (0)), ∀i ∈ V, and Y (0) = I N , we get that z¯ (k) =
N
ai
i=1
Now we exploit the evolution of x¯ = one obtains that x(k ¯ + 1) = x(k) ¯ −α
N i=1
ai
N i=1
∇ f i (xi (k)) [yi (k)]i
(2.32)
ai xi . In view of the first equation of (2.6),
∇ f i (xi (k)) + (αa T − a T D)z(k) [yi (k)]i
(2.33)
where α is a non-negative constant. Applying Lemma 2.11 to (2.33), it is sufficient that ⎞ √ ⎛ ˆ y ˜ μβ ˆ N y ˜ L(1 + η) ⎝ ⎠ ||x|| + ¯ − x ∗ || + ||x¯ − x ∗ ||λ,K ≤ 2||x(0) ˘ λ,K λ yˆ μη ¯ yˆ μ¯ # #λ,K # 3 − α yˆ μ¯ # T TD # a −a z# + (2.34) # # λ yˆ μ¯ α Noting q(k) = x(k) ˘ + 1 N (w T x(k) − x ∗ ), it follows that ˘ λ,K + ||q||λ,K ≤ ||x||
√
N ||x¯ − x ∗ ||λ,K
(2.35)
Plugging (2.34) into (2.35) and letting α = αmax (αmax = maxi∈V {αi }), it holds that ||q||λ,K
√
⎛
N ≤2 N ||x(0) ¯ − x ∗ || + ⎝1 + λ
⎞ ˆ + η) y˜ μβ ˆ ⎠ y˜ L(1 + ||x|| ˘ λ,K yˆ μη ¯ yˆ μ¯
2.3 Main Results
21
√ # #λ,K # # N 3 − α yˆ μ¯ D # a # I N − z# + λ yˆ μ¯ αmax # ⎛ ⎞ ˆ √ y˜ μβ ˆ ⎠ N y˜ L(1 + η) + ≤ 2 N ||x(0) ¯ − x ∗ || + ⎝1 + ||x|| ˘ λ,K λ yˆ μη ¯ yˆ μ¯ √ N 3 − α yˆ μ¯ λ,K (1 − k −1 + (2.36) D )||z|| λ yˆ μ¯ where we have used the relation ||I N − D/αmax || ≤ 1 − k −1 D to obtain the last inequality, and the parameter k D = αmax /αmin is the condition number of matrix D. The proof is completed. Then, relation ||z||λ,K → ||x|| ˘ λ,K will be presented in Lemma 2.18. Lemma 2.13 Under Assumptions 2.1–2.3 and for any λ ∈ (δ, 1) and K = 0, 1, . . ., it can be established that ||x|| ˘ λ,K ≤
αmax λ ||z||λ,K + ||x(0)|| ˘ λ−δ λ−δ
(2.37)
where recall that δ and αmax are respectively defined in Lemmas 2.6 and 2.12. Proof From SGT-FROST (2.6) and Lemma 2.6, we get ||x(k ˘ + 1)|| ≤ ||(W − 1 N a T ) J˘ x(k)|| + || J˘ Dz(k)|| ≤ δ||x(k)|| ˘ + αmax ||z(k)||
(2.38)
where we used (W − 1 N a T ) = (W − 1 N a T ) J˘ to derive the first inequality and used Lemma 1 and ||D|| ≤ αmax to derive the second inequality. Multiplying with λ−(k+1) on both sides of inequality (2.38), one deduces that ˘ + 1)|| ≤ λ−(k+1) ||x(k
δ −k αmax −k λ ||x(k)||+ λ ||z(k)|| ˘ λ λ
(2.39)
To employ notation || · ||λ,K , the initial condition of (2.39) is needed. Here, let ||x(0)|| ˘ ≤ ||x(0)|| ˘ hold. Combining the initial condition of (2.39) and taking maxk=0,...,K −1 {·} on both sides of (2.39), one yields ||x|| ˘ λ,K ≤
δ αmax ||x|| ˘ λ,K + ||z||λ,K + ||x(0)|| ˘ λ λ
(2.40)
The desire results of Lemma 2.13 is achieved by rearranging (2.40). This thus completes the proof. Finally, the following lemma shows the relationship of ||q||λ,K → ||z||λ,K .
22
2 Achieving Linear Convergence of Distributed Optimization …
Lemma 2.14 Under Assumptions 2.1–2.3 and for any λ ∈ (δ, 1) ∩ [λ0 , 1) and K = 0, 1, . . ., it can be inferred that (i) ||z||λ,K ≤ ||˘z ||λ,K + ||ˇz ||λ,K λ λ ||h||λ,K + ||˘z (0) ||λ,K (ii) ||˘z ||λ,K ≤ λ−δ λ−δ Lˆ y˜ (1 + λ) 2 y˜ 2 C||∇ F(1 N x ∗ )|| ||q||λ,K + (iii) ||h||λ,K ≤ λ λ λ,K 2 λ,K 2 ˆ ≤ L(y y˜ C + N )||q|| + y y˜ C||∇ F(1 N x ∗ )|| (iv) ||ˇz ||
(2.41a) (2.41b) (2.41c) (2.41d)
where ∇ F(1 N x ∗ ) = [∇ f 1 (x ∗ ), . . . , ∇ f N (x ∗ )]T ∈ R N and y = supk≥0 ||Y (k)||. The constants C and λ0 are given in Lemma 2.5; δ and y˜ are provided in Lemmas 2.6 and 2.12, respectively. Proof (i) ||˘z ||, ||ˇz || → ||z|| The proof of Lemma 2.14 (i) follows easily by using ||z (k) || ≤ ||˘z (k) || + ||ˇz (k) ||. (ii) ||h|| → ||˘z || According to the last equation of (2.6) and the definition (2.9), we acquire that z(k + 1) = Wz(k) + h(k + 1)
(2.42)
Recalling that z˘ = J˘ z, one gets (c.f. (2.38)) ||˘z (k + 1)|| ≤ δ||˘z (k)|| + ||h(k + 1)||
(2.43)
Similarly, let ||˘z (0)|| ≤ ||˘z (0)||, it yields that (c.f. (2.40)) ||˘z ||λ,K ≤
δ ||˘z ||λ,K + ||h||λ,K + ||˘z (0)|| λ
(2.44)
Rearranging (2.44), one obtains the inequality of Lemma 2.14 (ii). (iii) ||q|| → ||h|| From Assumptions 2.1 and 2.2, Lemma 2.7, and the definition of (2.9), it is sufficient that ˆ ||h(k + 1)|| ≤ y˜ L(||q(k + 1)|| + ||q(k)||) + 2 y˜ 2 Cλk0 ||∇ F(1 N x ∗ )||
(2.45)
where ∇ F(1 N x ∗ )= [∇ f 1 (x ∗ ), . . . , ∇ f N (x ∗ )]T , and we applied Assumption 2.2 and Lemma 2.14(b) in [37] to derive the inequality. Thus, for all λ ≥ λ0 , we acquire the inequality of Lemma 2.14 (iii) via a process similar to Lemma 2.13. (iv) ||q|| −→ ||z|| Considering a T z(k) − a T Y˜ −1 (k)∇ F(k) = · · · = a T z(0) − a T Y˜ −1 (0)∇ F(0) = 0 and Y (0) = I N , it holds
2.3 Main Results
23
||ˇz (k)|| = ||1 N a T Y˜ −1 (k)∇ F(k)|| −1 ≤ ||Y∞ Y˜ −1 (k)∇ F(k) − Y∞ Y˜∞ ∇ F(k)|| −1 −1 + ||Y∞ Y˜∞ ∇ F(k) − Y∞ Y˜∞ ∇ F(1 N x ∗ )||
≤y y˜ 2 Cλk0 ||∇ F(k)|| + N ||∇ F(k) − ∇ F(1 N x ∗ )||
(2.46)
where ∇ F(1 N x ∗ )= [∇ f 1 (x ∗ ), . . . , ∇ f N (x ∗ )]T and y = supk≥0 ||Y (k)||. We used the −1 = 1 N 1TN to obtain the first inequality and used the result of Lemma relation Y∞ Y˜∞ 2.14(a) in [37] to obtain the second inequality in (2.46). Then, by Assumption 2.2, we further have ˆ y˜ 2 C + N )||x(k) − 1 N x ∗ || + y y˜ 2 Cλk0 ||∇ F(1 N x ∗ )|| ||ˇz (k)|| ≤ L(y
(2.47)
The desired result of Lemma 2.14 (iv) (c.f. (2.40) or (2.44)) follows immediately. The proof of Lemma 2.14 is completed.
2.3.3 Convergence of SGT-FROST In the case of acquiring the supporting relationships, i.e., loop (2.10), the main convergence results of SGT-FROST are built as follows. Theorem 2.15 Consider that SGT-FROST (2.4) with uncoordinated step-sizes (D) updates the sequence {x(k)}. Assume that αmax = maxi∈V {αi } and k D = αmax /αmin (condition number of matrix D). Under Assumptions 2.1–2.3 and if
αmax
(1 − δ)((1 − δ) − J ) 1 , ∈ 0, min K 2 y˜ L¯
$ (2.48)
and kD < 1 +
yˆ (1 − δ)2 √ 2κ 3N (2 y˜ + y y˜ 2 C + N ) − yˆ (1 − δ)2
(2.49)
then the sequence {x(k)} converges to 1 N x ∗ at a global linear rate of O(λk ), where λ ∈ (0, 1) satisfies % λ = max δ +
J +
' & J 2 + 4αmax K αmax μ¯ yˆ , 1− , λ0 2 3
√ ˆ y˜ + y y˜ 2 C + N )(1 + ˆ , K = L(2 where = 2κ 3N (2 y˜ + y y˜ 2 C + N )(1 − k −1 D )/ y J√ ˆ μ¯ (condition number of f ). 4N y˜ / yˆ κ) and κ = L/
24
2 Achieving Linear Convergence of Distributed Optimization …
Proof According to Lemmas 2.12–2.14 we have the following estimates: ˘ λ,K + γ12 ||z||λ,K + 1 ||q||λ,K ≤ γ11 ||x|| ||x|| ˘ λ,K ≤ γ2 ||z||λ,K + 2 ||z||λ,K ≤ ||˘z ||λ,K + ||ˇz ||λ,K ||˘z ||λ,K ≤ γ31 ||h||λ,K + 31 ||h||λ,K ≤ γ32 ||q||λ,K + 32 ||ˇz ||λ,K ≤ γ33 ||q||λ,K + 33 ( ˆ + η)/ yˆ μη γ11 = 1 + (N /λ) y˜ L(1 ¯ + y˜ μβ/ ˆ yˆ μ, ¯ γ12 = 3 − αmax yˆ μ¯ √ (1 − k −1 ˆ μ, ¯ γ2 = αmax / (λ − δ), γ31 = λ/ (λ − δ) , γ32 = Lˆ y˜ (1 + λ)/λ, D ) N /λ y 2 ˆ γ33 = L(y y˜ C + N ), and 1 , 2 , 31 , 32 , 33 < ∞. To employ the small gain theorem, i.e., Lemma 2.7, the maximum step-size αmax should be chosen to satisfy where
(γ11 γ2 + γ12 ) (γ31 γ32 + γ33 ) < 1
(2.50)
which means that ⎤ ⎡ ⎛ ⎞ √ ˆ + η) α y ˜ μβ ˆ y ˜ L(1 N 3 − α y ˆ μ ¯ N max max ⎦ ⎝1 + ⎠ + + (1 − k −1 1 >⎣ D ) λ−δ λ yˆ μη ¯ yˆ μ¯ λ yˆ μ¯ (1 + λ) y˜ + y y˜ 2 C + N × Lˆ (2.51) λ−δ where two tunable parameters β ≥ 2 and η > 0 are given in Lemma 2.12, and other constants satisfy
0 < αmax
1 β+1 , ≤ min ¯ yˆ μβ ¯ y˜ L(1 + η)
(2.52)
αmax μ¯ yˆ β ≤λ 0, i = 1, . . . , N are step-sizes, the scalar Ai j (k) is non-negative weights, and the vector ∇ f i (xi (k)) is the gradient of the agent i’s objective function f i (x) at x = xi (k). Moreover, to facilitate the analysis of the optimization algorithm, we will make the following assumptions. Assumption 3.2 For any k = 0, 1, . . . , the mixing matrix A(k) = [Ai j (k)] ∈ R N ×N is defined by 1 , j ∈ Niin (k) out Ai j (k) = d j (k)+1 0, j∈ / Niin (k) or j = i out out where d out j (k) = |N j (k)|, (N j (k) = {i ∈ V|( j, i) ∈ E(k)}) is the out-degree of agent j at time k. Also, we can conclude that A is a column-stochastic matrix, i.e., N i=1 Ai j (k) = 1 for any j ∈ V.
Assumption 3.3 For each i = 1, . . . , N , the function f i is differentiable and has Lipschitz continuous gradients, i.e., there exists a constant L i ∈ (0, +∞) such that ||∇ f i (x) − ∇ f i (y)|| ≤ L i ||x − y|| for any x, y ∈ Rn
38
3 Achieving Linear Convergence of Distributed Optimization …
¯ As a consequence, f is L-smooth with L¯ = (1/N ) maxi∈V {L i } in the forthcoming analysis.
N i=1
L i . We will use Lˆ =
Assumption 3.4 Assumption 3.3: For each i = 1, . . . , N , its objective function f i : Rn → R satisfies f i (x) ≥ f i (y) + ∇ f i (y), x − y +
μi ||x − y||2 , for any x, y ∈ Rn 2
where μi ∈ [0, +∞) and at least one μi is nonzero. Moreover, we will use μˆ = N μi . maxi∈V {μi } and μ¯ = (1/N ) i=1 Remark 3.2 Throughout the chapter, we only study the case n = 1 since the analysis method in this chapter can be easily extended to multi-dimensional cases. Let ∇ F(x(k)) = [∇ f 1 (x1 (k)), ∇ f 2 (x2 (k)), . . . , ∇ f N (x N (k))]T , and x(k) = [x1 (k), x2 (k), . . . , x N (k)]T ∈ R N n . Then the Push-DIGing algorithm (3.2) can be rewritten as the following compact matrix-vector form: p(k + 1) = A(k)( p(k) − Dy(k)) s(k + 1) = A(k)s(k) S(k + 1) = diag{s(k + 1)}
(3.3)
−1
x(k + 1) = S (k + 1) p(k + 1) y(k + 1) = A(k)(y(k) + ∇ F(x(k + 1)) − ∇ F(x(k))) where D is a diagonal matrix and [D]ii = αi is the constant step-size of agent i. Here we make a simple algebraic transformation on the Push-DIGing algorithm (3.3), i.e. s(k + 1) = A(k)s(k) S(k + 1) = diag{s(k + 1)} x(k + 1) = R(k)(x(k) − Dh(k)) h(k + 1) = R(k)h(k) + S −1 (k + 1)A(k)(∇ F(x(k + 1)) − ∇ F(x(k)))
(3.4)
where R(k) = S −1 (k + 1)A(k)(S(k)), h(k) = S −1 (k)y(k). Noting that, under Assumptions 3.1 and 3.2, it is clear that each matrix S(k) is invertible, and denote ||S −1 ||max = supk≥0 ||S −1 (k)|| ≤ N N B0 , where B0 is the network connectivity constant defined in Assumption 3.1 and the relation follows from Corollary 2(b) of [31]. Also, we can prove that R(k) is actually a row-stochastic matrix (see Lemma 4 of [51]). In what follows, we will introduce the following notation A B (k) = A(k)A(k − 1) . . . A(k + 1 − B)
3.3 Optimization Algorithm
39
for any k = 0, 1, 2, . . . and B = 0, 1, 2, . . . with B ≤ k + 1 and the special case that A0 (k) = I N for any k and A B (k) = I N for any needed k < 0. A crucial property of the norm of I N − (1/N ) 1TN R B (k) is given in the following lemma, which comes from the properties of push-sum algorithm and can be obtained from references [52, 53]. Lemma 3.3 Let Assumptions 3.1 and 3.2 hold, and let B be an integer satisfying B ≥ B0 . Then, for any k = B − 1, B, . . . and any matrix x with appropriate dimensions, ˘ ≤ δ|| y˘ || where R B (k) = S −1 + 1 − if x = R B (k)y, we have ||x|| (k + 1)A B(k)(S(k N B0 (B−1)/N B0 ) < 1, Q 1 = 2N 1 + τ −N B0 / 1 − τ N B0 , B)) and δ = Q 1 (1 − τ τ = 1/N 2+N B0 . Proof We omit the proof of Lemma 3.3 since it is almost identical to that of Lemma 13 in [42].
3.3.2 Small Gain Theorem Before proceeding to the main proof idea, we need to define a few quantities which will use frequently in the analysis. Consider the following additional notations q(k) = x(k) − 1 N x ∗ , for any k = 0, 1, . . .
(3.5)
z(k) = ∇ F(x(k)) − ∇ F(x(k − 1)), for any k = 1, 2, . . .
(3.6)
where x ∗ ∈ R is the optimal solution of problem (3.1), and the initiation z(0) = 0 N . Consider the small gain theorem, the linear convergence of ||q(k)|| can be achieved by applying this theorem to the following circle of arrows: ||q|| → ||z|| → ||h|| →
||x|| ˘ → ||q|| ||y||
(3.7)
Remark 3.4 Recalling that q is the difference between local states and the global optimizer, z is a continuous difference of gradients, h is the intermediate variable obtained by a simple algebraic transformation of y, y is the estimation of gradient average across agents and x˘ is the consensus violation of local states. In a sense, as long as q is small, the continuous difference of gradients z is small since the gradients are close to zero in the vicinity of the optimal point. Then, as long as z is small, h is small by the framework of Push-DIGing algorithm (3.4). Furthermore, as long as h is small, the framework of Push-DIGing algorithm (3.4) means that y is small and x is close to consensual. Finally, as long as y is small and x is close to consensual, the Push-DIGing algorithm will drive q to zero and thus achieve the whole cycle.
40
3 Achieving Linear Convergence of Distributed Optimization …
Remark 3.5 We will use the small gain theorem based on the establishment of each arrow. Specially, we need to be aware of the prerequisite that the sequences ˘ λ,K , ||q||λ,K } are proved to be bounded. {||q||λ,K , ||z||λ,K , ||h||λ,K , ||y||λ,K , ||x|| Thus, we can conclude that all quantities in the above circle of arrows converge at an linear rate O(λk ). Furthermore, to apply the small gain theorem in the following analysis, we need to require that the product of gains γi is less than one, which will be achieved by seeking out an appropriate step-size matrix D. Now, we are ready to present the establishment of each arrow in the above circle (3.7).
3.3.3 Supporting Lemmas Before introducing the Lemma 3.6, we make some definitions only used in this lemma, which distinguish the notation used in the distributed optimization problem, algorithm and analysis. Problem (3.1) is redefined as follows with different notation, min g(x) =
N
gi (x), x ∈ Rn
(3.8)
i=1
where each function gi satisfies Assumptions 3.1 and 3.2. Consider the following inexact gradient descent on the function g: v(k + 1) = v(k) − θ
N
∇gi (u i (k)) + e(k)
(3.9)
i=1
where θ is the step-size and e(k) is an additive noise. Let v ∗ be the global optimal solution of g and define r (k) = ||v(k) − v ∗ ||, for any k = 0, 1, . . . Based on the above definitions, we next introduce Lemma 3.6. Lemma 3.6 Suppose that
β+1 1 θμβ ¯ 3 ≤ λ < 1, 0 < θ < min , 1− , ¯ + η) μ¯ 2(β + 1) μβ ¯ L(1
where β ≥ 2 and η > 0. Let Assumptions 3.1 and 3.2 hold for every function gi . For the problem (3.8), consider the sequences {r (k)} and {v(k)} be generated by the inexact gradient descent algorithm (3.9). Then, for any positive integer K , we have
3.3 Optimization Algorithm
|r |λ,K
41
√ N ˆ + η) μˆ 3 − θμ¯ L(1 1 ||e||λ,K + √ + β ≤ 2r0 + ||v − u i ||λ,K λθμ¯ η μ¯ μ¯ i=1 λ N (3.10)
Proof We refer the reader to the paper [49] for the proof of Lemma 3.6.
In the following lemma, we start with the first demonstration of the circle (3.7) which is grounded on the error bound of the inexact gradient descent algorithm in Lemma 3.6. Lemma 3.7 ({||x||, ˘ ||y||} → ||q||): Let Assumptions 3.2–3.4 hold. Also, assume that the parameters α and λ satisfy
1−
αμβ ¯ β+1 1 3 ≤ λ < 1, 0 < α < min , , ¯ + η) μ¯ 2(β + 1) μβ ¯ L(1
where β ≥ 2 and η > 0 are some adjustable parameters. Then, we have that for all K = 0, 1, . . . ,
√
¯ − x ∗ || + (1 + ||q||λ,K ≤ 2 N ||x(0)
√
⎞⎞ ⎛
ˆ + η) μˆ N L(1 ⎝ + β ⎠⎠ N ) ⎝1 + λ η μ¯ μ¯ ⎛
√
× ||x|| ˘ λ,K + C y,α ||y||λ,K
(3.11)
√ where the positive constant C y,α = 3 − αmax μ/λ ¯ μ¯ (1 − k −1 αmax /αmin D ), k D = √ (αmax = maxi∈V {αi }, αmin = mini∈V {αi } ) if α = αmax ; C y,α = 3 − α¯ μ/ ¯ √ N N 2 ¯ and α = α, ¯ α¯ = (1/N ) i=1 αi . α¯ N λμ¯ i=1 (αi − α) Proof Multiplying (1/N ) 1TN on the both sides of the last equation of (3.3) and noting that A(k) is a column-stochastic matrix, we then have y¯ (k + 1) −
1 T 1 1 ∇ F(x(k + 1)) = y¯ (k) − 1TN ∇ F(x(k)) N N N
(3.12)
where y¯ = (1/N ) 1TN y. Since yi (0) = ∇ f i (xi (0)), i ∈ V, it follows that y¯ (k) =
N 1 ∇ f i (xi (k)) N i=1
(3.13)
Let us consider the evolution of p(k). ¯ Multiplying (1/N ) 1TN at the both sides of the first equation of (3.3), one can obtain
42
3 Achieving Linear Convergence of Distributed Optimization …
p(k ¯ + 1) = p(k) ¯ −α
N 1 1 1 ∇ f i (xi (k)) + α 1TN − 1TN D y(k) N i=1 N N
(3.14)
Applying Lemma 3.6 to the recursion relation of (3.14), and 0 < α < 3/μ, ¯ we can achieve ⎞ ⎛
N ˆ μ ˆ L(1 + η) 1 + β⎠ || p¯ − x ∗ ||λ,K ≤ 2|| p(0) ¯ − x ∗ || + √ ⎝ || p¯ − xi ||λ,K η μ¯ μ¯ λ N i=1 √ λ,K 3 − αμ¯ 1T − 1T D y + (3.15) N N N λμ¯ α Let us analyze the summation in the second term of (3.15). Since x(k + 1) = S −1 (k + 1)s(k + 1) and x(0) = p(0), we obtain N
|| p¯ − xi ||λ,K =
i=1
N
||( p¯ − x) ¯ + (x¯ − xi )||λ,K
i=1
≤ N || p¯ − x|| ¯ λ,K +
N
||x¯ − xi ||λ,K
i=1
λ,K N 1 T 1 T 1 1 ≤ N Sx − x + ||x¯ − xi ||λ,K N N N N i=1 √ T λ,K λ,K = ||(1 N − s) x|| + N ||x|| ˘
(3.16)
¯ ¯ − 1 N p(k) ¯ + 1 N p(k) ¯ − 1N x ∗, Since q(k)=x(k) − 1 N x ∗ =x(k)−1 N x(k)+1 N x(k) it follows that q(k) = x(k) ˘ +
1 1 N (1 N − s(k))T x(k) + 1 N ( p(k) ¯ − x ∗) N
(3.17)
Together with (3.15) and (3.16), (3.17) further implies ⎞ √
ˆ + η) μˆ √ N L(1 + β ⎠ ||x|| ˘ λ,K + 2 N ||x(0) ≤ ⎝1 + ¯ − x ∗ || λ η μ¯ μ¯ ⎛ ⎞
ˆ + η) μˆ √ −1 1 L(1 λ,K + ⎝( N ) + + β ⎠ (1 N − s)T x λ η μ¯ μ¯ √ λ,K 3 − αμ¯ 1T − 1T D y + √ (3.18) N N α N λμ¯ ⎛
||q||λ,K
Finally, let us bound the third term in (3.18) as follows
3.3 Optimization Algorithm
43
1 T T (1 I − x(k) 1 ||(1 N − s(k))T x(k)|| = − s(k)) N N N ≤ N 2 − N ||x(k)|| ˘ ≤ N ||x(k)|| ˘
(3.19)
Substituting (3.19) into (3.18) yields the desired results. The proof is thus established. Next, to demonstrate the second arrows in the circle (3.7), we proceed by showing two lemmas. One is ||h|| → ||x|| ˘ and the other is ||h|| → ||y||. Lemma 3.8 √ (||h|| → ||x||): ˘ Let Assumptions 3.1 and 3.2 hold, and let λ be a positive B constant in ( δ, 1), where B is the constant given in Lemma 3.3. Then, we get ||x|| ˘ λ,K ≤
λ − λB ||D|| δ + Q ||h||λ,K 1 (λ B − δ) 1−λ λB λ−(t−1) ||x(t ˘ − 1)|| B (λ − δ) t=1 B
+
(3.20)
for all K = 0, 1, . . . , where Q 1 is the constant as defined in Lemma 3.3. Proof Note that x(k + 1) = R(k)(x(k) − Dh(k)). The results can be obtained by the same argument as that illustrated in the proof of Lemma 6 in [42], thus we can achieve (3.20). Lemma 3.9 (||h|| → ||y||): For any positive integer K , ||y||λ,K ≤ ||S||max ||h||λ,K holds, where ||S||max is the constant defined above. Proof Considering y(k) = S(k)h(k), we have ||y(k)|| = ||S(k)h(k)|| ≤ ||S||max ||h(k)||. The result immediately follows by applying the corresponding proprieties of Euclidean norm. The next lemma presents the establishment of the third arrows in the circle (3.7). Lemma 3.10 (||z||λ,K → ||h||λ,K ): Let Assumptions 3.1 and 3.2–3.4 hold,√let the B parameter δ be as given in Lemma 3.3, and let λ be a positive constant in ( δ, 1). Then, we have for all K = 0, 1, . . . , ˘ λ,K + ||h||λ,K ||h||λ,K ≤||h|| J ˘ λ,K ≤Q 1 ||S −1 ||max ||C||max ||h||
λ(1 − λ B )
z λ,K (λ B − δ)(1 − λ)
λB λ−(t−1) ||h(t − 1)|| (λ B − δ) t=1 B
+ ||h||λ,K ≤ J
||J R||max ||h||λ,K + ||S −1 ||max ||C||max ||z||λ,K λ
44
3 Achieving Linear Convergence of Distributed Optimization …
Specially, suppose that ||J R||max < λ < 1. Then, followed by above three items, we finally obtain ||h||λ,K ≤
||S −1 ||max ||A||max 1 +
+
λB (λ B −δ)
Q 1 λ(1−λ B ) (λ B −δ)(1−λ)
1 − ||J R||max /λ B λ−(t−1) ||h(t − 1)||
||z||λ,K
t=1
1 − ||J R||max /λ
(3.21)
˘ λ,K + ||h||λ,K Proof (i) ||h||λ,K ≤ ||h|| J . Since ||h(k)|| = ||(I N − (1/N )1TN + (1/N )1TN )h(k)||, it thus follows that 1 T 1 T ||h(k)|| ≤ (I N − 1 N )h(k) + 1 N h(k) N N ˘ ≤ ||h(k)|| + ||h(k)|| J
(3.22)
Multiplying the above relation with λ−k , k = 0, 1, . . ., it yields that ˘ + λ−k ||h(k)|| J λ−k ||h(k)|| ≤ λ−k ||h(k)||
(3.23)
Taking the maximum over k = 0, . . . , K on both side of (3.23), the desired result follows immediately. ˘ λ,K ≤ Q 1 ||S −1 ||max ||A||max λ(1 − λ B )/(λ B − δ)(1 − λ) z λ,K (ii) ||h|| B −(t−1) ˘ − 1)||. + λ B /(λ B − δ) ||h(t t=1 λ Using z(k) = ∇ F(x(k)) − ∇ F(x(k − 1)), the relation in Push-DIGing algorithm (3.4) is equivalent to h(k + 1) = R(k)h(k) + S −1 (k + 1)A(k)z(k + 1)
(3.24)
Then, using Lemma 3.3, for all k ≥ B − 1, we can achieve that ˘ + 1)|| ||h(k = || J˘h(k + 1)|| ≤ || J˘ R1 (k)S −1 (k)A(k − 1)z(k)|| + || J˘ R0 (k)S −1 (k + 1) A(k)z(k + 1)|| + || J˘ R B−1 (k)S −1 (k + 2 − B) A(k + 1 − B)z(k + 2 − B)|| + · · · + || J˘ R B (k)h(k + 1 − B)|| ≤ δ|| J˘h(k + 1 − B) || + Q 1 ||S −1 ||max ||A(k + 1 − B)z(k + 2 − B)|| + · · · + Q 1 ||S −1 ||max ||A(k − 1)z(k)|| + Q 1 ||S −1 ||max ||A(k)z(k + 1)||
3.3 Optimization Algorithm
45
≤ δ|| J˘h(k + 1 − B)|| + Q 1 ||S −1 ||max ||A||max
B
||z(k + 2 − t)||
(3.25)
t=1
where Q 1 is the constant defined in Lemma 3.3. Multiplying the above relation with λ−k , k = B − 1, B, . . . , we have λ
−(k+1)
˘ + 1)|| ≤ ||S −1 ||max ||A||max ||h(k +
B Q 1 −(k+2−t) λ ||z(k + 2 − t)|| t−1 λ t=1
δ −(k+1−B) ˘ λ ||h(k + 1 − B)|| λB
(3.26)
˘ + Here, we need to take maxk=0,1,...,K , which in turn requires a relation λ−(k+1) ||h(k 1)|| with k = −1, 1, . . . , B − 2. To get such a relation, without loss of generality, we assume the initial condition of (3.26) is ˘ + 1)|| ≤ λ−(k+1) ||h(k ˘ + 1)||, k = −1, 1, . . . , B λ−(k+1) ||h(k
(3.27)
Taking the maximum over k = −1, . . . , B − 2 on the both sides of (3.26) and the maximum over k = B − 1, . . . , K on the both sides of (3.27), by combining the obtained two relations, we immediately obtain ˘ λ,K ≤ ||h||
1 δ ˘ λ,K −B ||h|| + Q 1 ||S −1 ||max ||A||max ||z||λ,K +1−t B t−1 λ λ t=1 B
+
B
˘ − 1)|| λ−(t−1) ||h(t
t=1
≤
B 1 δ ˘ λ,K −1 λ,K || h|| + Q ||S || ||A|| ||z|| 1 max max t−1 λB λ t=1
+
B
˘ − 1)|| λ−(t−1) ||h(t
(3.28)
t=1
Therefore, we finally get ˘ λ,K ≤ Q 1 ||S −1 ||max ||A||max ||h||
λ(1 − λ B ) ||z||λ,K − δ)(1 − λ)
(λ B
λB ˘ − 1)|| λ−(t−1) ||h(t B (λ − δ) t=1 B
+
The desired result follows immediately.
(3.29)
46
3 Achieving Linear Convergence of Distributed Optimization …
(iii) ||h||λ,K ≤ (||J R||max /λ) ||h||λ,K + ||S −1 ||max ||A||max ||z||λ,K . J First, multiplying the relation (3.24) with J = (1/N ) 1TN and taking Euclidean norm at the both sides of the obtained equality and using the corresponding proprieties of Euclidean norm, we therefore have ||J h(k + 1)|| = ||J R(k)h(k) + J S −1 (k + 1)A(k)z(k + 1)|| ≤ ||J R(k)h(k)||+||J S −1 (k + 1)A(k)z(k + 1)|| ≤ ||J R||max ||h(k)||+||S −1 ||max ||A||max ||z(k + 1)||
(3.30)
where we have employed ||J || = || (1/N ) 1TN || = 1 to obtain the last inequality. Multiplying the above relation with λ−(k+1) , k = 0, 1, . . ., we immediately have λ−(k+1) ||h(k + 1)|| J ≤ ||S −1 ||max ||A||max λ−(k+1) ||z(k + 1)|| ||J R||max −k λ ||h(k)|| + λ
(3.31)
Taking maxk=0,1,...,K −1 {·} on the both sides of (3.31) gives ||h||λ,K ≤ J
||J R||max ||h||λ,K + ||S −1 ||max ||A||max ||z||λ,K λ
(3.32)
From the conditions (i), (ii) and (iii), we further to show that the rest proof of Lemma 3.10. By combining the preceding items (i), (ii) and (iii), we have for all K = 0, 1, . . . , and 0 < λ < 1 ||J R||max λB ˘ − 1)|| ||h||λ,K + B λ−(t−1) ||h(t λ (λ − δ) t=1 Q 1 λ(1 − λ B )
z λ,K + ||S −1 ||max ||A||max 1 + B (λ − δ)(1 − λ) B
||h||λ,K ≤
(3.33)
Rearranging the above formula and recalling ||J R||max < λ < 1, we finally have ||h||λ,K ≤
||S −1 ||max ||A||max 1 +
+ This completes the proof.
λB (λ B −δ)
Q 1 λ(1−λ B ) (λ B −δ)(1−λ)
1 − ||J R||max /λ B ˘ − 1)|| λ−(t−1) ||h(t
||z||λ,K
t=1
1 − ||J R||max /λ
The last arrow in the circle (3.7) demonstrated in the following Lemma is a simple consequence of the fact that the gradient of f is L-Lipschitz.
3.3 Optimization Algorithm
47
Lemma 3.11 (||q||λ,K → ||z||λ,K ): Let Assumption 3.3 holds. Then, we have for all K = 0, 1, . . . , and any 0 < λ < 1, ||z||
λ,K
1 ˆ ||q||λ,K ≤ L 1+ λ
Proof The proof procedure can imitate that of Lemma 5 in [42], and thus it is omitted.
3.4 Main Results Based on the circle (3.7) established in the previous section, next we will demonstrate a major result about the linear convergence rate estimate for the Push-DIGing algorithm with uncoordinated step-sizes over a time-varying directed graph sequence. Theorem 3.12 Let Assumptions 3.1 and 3.2–3.4, and Lemma 3.3 hold. Let αmax = maxi∈V {αi } (αmin = mini∈V {αi }) be the largest (smallest) positive entry element of the uncoordinated step-size matrix D such that ⎧⎡ ⎫⎞ ⎤ ⎛ (1 − δ) ⎪ ⎪ ⎪ ⎪ √ √ √ ⎪ ⎪ ⎢ 2 L(1 ⎟ ⎪ ⎪ ⎥ ⎜ ˆ ⎪ ⎪ + N )(1 + 4 N κ)(δ + Q (B − 1)) 1 ⎢ ⎟ ⎪ ⎪ ⎥ ⎜ ⎪ ⎪ ⎢ ⎟ ⎪ ⎪ ⎥ ⎜ ) (1 − δ)(1 − ||J R|| max ⎪ ⎪ ⎢× ⎟ ⎪ ⎪ ⎥ ⎜ ⎪ ⎪ −1 B ⎢ ⎟ ⎪ ⎪ ⎥ ⎜ ⎨⎢ ||S ||max ||A||max (B Q 1 + (1/2 − δ)) ⎥ 1 ⎬⎟ ⎜ (1 − δ) ⎢ ⎟ ⎥ ⎜ αmax ∈ ⎜0, min ⎢ − ⎥ , 2 L¯ ⎪⎟ √ √ √ ⎪ ⎢ ⎟ ⎪ ⎥ ⎜ ⎪ ˆ ⎪ ⎪ + Q 1 (B − 1)) ⎥ ⎢ 2 L(1 ⎟ ⎪ ⎜ ⎪ √+ N )(1−1+ 4 N κ)(δ ⎪ ⎪ −1 ⎢ ⎟ ⎪ ⎥ ⎜ ⎪ ⎪ ⎪ 4 3κ(1 − k D )||S||max ||S ||max | ⎢ ⎟ ⎪ ⎥ ⎜ ⎪ ⎪ ⎪ × ⎪ ⎪ ⎣ ⎠ ⎦ ⎝ −1 || B − δ)) ⎪ ⎪ ||S ||A|| (B Q + (1/2 ⎪ ⎪ max max 1 ⎩ ⎭ B ×||A||max (B Q 1 + 1/2 − δ) ˆ μ, where κ = L/ ¯ and k D = αmax /αmin is the condition number of the step-size matrix D. Suppose that the condition number k D is selected such that (λ B −δ)(1−||J R||max ) kD < 1 + √ 4 3κ||S||max ||S −1 ||max ||A||max (B Q 1 +(λ B − δ))−(λ B − δ)(1 − ||J R||max )
Then, the sequence x{k} generated by the Push-DIGing algorithm with uncoordinated step-sizes converges to 1 N x ∗ at a global linear rate O(λk ), where λ ∈ (0, 1) is given by ⎫ % & ⎪ & ⎪ 2 + 4 (C + H F) G K α B − H K + − H K Fα Fα ) ) (G (G max max max ⎪ ' ⎪ ,⎬ δ+ 2(C + H F) λ = max ( ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ μ¯ α ⎪ ⎪ ⎭ ⎩ 1 − max , J R max 3 ⎧ ⎪ ⎪ ⎪ ⎪ ⎨
48
3 Achieving Linear Convergence of Distributed Optimization …
with F = ||S −1 ||max √||A||max >√0 √ ˆ + N )(1 + 4 N κ)(δ + Q 1 (B − 1)) > 0 G = 2 L(1 C = 1 −√ ||J R||max > 0 H = −4 3κ(1 − k −1 D )||S||max < 0 K = B Q 1 ||S −1 ||max ||A||max > 0
(3.34)
Proof It is immediately obtained from Lemmas 3.7–3.11 that √ λ,K (i) ||q||λ,K ≤ γ ˘ λ,K + γ12||y|| + ω1 where ω1 = 2 N ||x(0) ¯ − x ∗ ||, 11 ||x|| √ √ ˆ + η)/η μ¯ + μβ/ γ11 = 1 + N 1+ N /λ L(1 ˆ μ¯ , and γ12 = √ 3 − αmax μ/λ ¯ μ¯ (1 − k −1 D ); (ii) ||x|| ˘ λ,K ≤ γ21 ||h||λ,K + ω21 where γ21 = δ + Q 1 λ − λ B / (1 − λ) × B −(t−1) ||D||/(λ B − δ) , ω21 = λ B /(λ B − δ) ||x(t ˘ − 1)||; t=1 λ (iii) ||y||λ,K ≤ γ22 ||h||λ,K + ω22 where γ22 = ||S||max and ω22 = 0; (iv) ||h||λ,K ≤ γ3 ||z||λ,K + ω3 where γ3 = ||S −1 ||max ||A||max (1 + Q 1 λ(1 − λ B )/(λ B − δ)(1 − λ))/ (1 − (||J R||max /λ)), (1 − ||J R||max /λ) / λ B /(λ B − δ) B −(t−1) ˘ − 1)||; = (1/ω3 ) t=1 λ ||h(t ˆ + 1/λ), ω4 = 0. (v) ||z||λ,K ≤ γ4 ||q||λ,K + ω4 where γ4 = L(1 Moreover, to use the small gain theorem, we must choose the step-size αmax such that (γ11 γ21 + γ12 γ22 )γ3 γ4 < 1 which means that ) ⎤ ⎡ √ * √ N ||D|| ˆL(1+η) μˆ + μ¯ β ⎥ ⎢ (1 + N ) 1 + η μ¯ λ (λ B − δ) ⎥ ⎢ √ ⎥ ⎢ ⎥ ⎢ 3 − αmax μ¯ Q 1 (λ − λ B ) −1 ⎢× δ + + (1 − k D )||S||max ) ⎥ ⎥ 0 and other constraint conditions on parameters that occur in Lemmas 3.7, 3.8, and 3.10 are stated as follows:
3.4 Main Results
49
0 < αmax < min
1−
3 1 β+1 , , ¯ + η) μ¯ μβ ¯ L(1
(3.37)
αμβ ¯ ≤λ 0 and u j (k) = u j and for + = 0, and the vector v j (k) ∈ ∂ f j y˜ j (k) . some u j = 0n if g j,ω j (k) x j (k − 1) Then, we address the optimization problem (without inequality constraints) by projecting z˜ j (k) onto a randomly selected component of their local constraint sets, and by using the standard distributed (sub)gradient descent algorithm asynchronously. h j (k) = PX j,Ω j (k) [˜z j (k)], p˜ j (k) = q j (k) − α j (k) 1 N , ∀ j ∈ i k ∪ Jk
(4.5)
where α j (k) is a diminishing step-size (random) determined by the update frequency of the agent, i.e., α j (k) = 1/Γ j (k), where Γ j (k) represents the number of updates performed by agent j until time k [16]; Ω j (k) is a random set extracted from the set
64
4 Asynchronous Broadcast-Based Distributed Optimization …
U j . Suppose that the random variable Ω j (k) is independent of the random broadcast process. However, agents other than i k and Jk do not generate their states for k ≥ 1, i.e., h j (k) = x j (k − 1) , p˜ j (k) = t j (k − 1) , ∀ j ∈ / i k ∪ Jk
(4.6)
Step 2: Based on the above scheme, we update xi (k) and ti (k) by using the dynamical averaging consensus process, i.e., xi (k) = (1 − Wiik )h i (k) + Wiik h ik (k) , ∀i ∈ Jk
(4.7a)
ti (k) = (1 − Wiik ) p˜ i (k) + Wiik p˜ ik (k) , ∀i ∈ Jk
(4.7b)
where Wiik > 0 is the weight from broadcast agent i k to agent i at time k ≥ 1. Besides, agents i k and i that do not acquire information from i k do not update their states, i.e., / Jk xi (k) = xi (k − 1) , ti (k) = ti (k − 1) , ∀i ∈
(4.8)
Once all agents reach an agreement, the state vectors (xi (k) , ti (k)) , i ∈ V at each agent asymptotically converges to a feasible point in . For a convenient expression of algorithm, for each k, we introduce a nonnegative matrix W (k) as / Jk Wii (k) = 1 − Wiik if i ∈ Jk , Wii (k) = 1 if i ∈ Wiik (k) = Wiik if i ∈ Jk , Wi j (k) = 0 if i ∈ / Jk , i = j
(4.9a) (4.9b)
Using W (k), for all k ≥ 1 and i, j ∈ V, the updates (4.6) are thus equivalently written as h j (k) = x j (k − 1) + hˆ j (k) χ{ j∈ik ∪Jk } p˜ j (k) = t j (k − 1) + pˆ j (k) χ{ j∈ik ∪Jk } xi (k) =
N
(4.10a) (4.10b)
Wi j (k) h j (k)
(4.10c)
Wi j (k) p˜ j (k)
(4.10d)
j=1
ti (k) =
N j=1
where
+ g x (k − 1) ˆh j (k) =PX j,Ω (k) x j (k − 1) − j,ω j (k) j u j (k) j ||u j (k) ||2
4.3 Broadcast-Based Optimization Algorithm
65
Algorithm 1 Distributed Asynchronous Broadcast-Based Optimization Algorithm over Unbalanced Directed Networks 1: Initialization: Each agent j ∈ V starts with x j (0) ∈ X j,l , ∀l ∈ U j , and t j (0) ∈ R N . 2: Set k = 0. 3: Local information exchange: At each time instant k, only one agent i k in the network broadcasts its own information and only a subset Jk ⊆ Niout receives the broadcast information because of k possible link failures. 4: Local variables updates: Each agent j ∈ i k ∪ Jk updates its intermediate state vectors as follows: + g j,ω j (k) (x j (k−1)) 5: y˜ j (k) = x j (k − 1) − u j (k), 2 u j (k) + f j ( y˜ j (k))−eTj t j (k−1) 6: z˜ j (k) = y˜ j (k) − v j (k), 2 1+v j (k) + T f j ( y˜ j (k))−e j t j (k−1) 7: z˜ j (k) = y˜ j (k) − v j (k), 2 1+v j (k) 8: h j (k) = PX j,Ω j (k) z˜ j (k) , p˜ j (k) = q j (k) − α j (k) 1 N , where ω j (k), u j (k), v j (k), Ω j (k) and α j (k) are respectively defined in (4.4) and (4.5). However, agents other than i k and Jk do not generate their states, i.e., 9: h j (k) = x j (k − 1) , p˜ j (k) = t j (k − 1) , ∀ j ∈ / i k ∪ Jk . 10: Dynamical averaging consensus: Each agent i ∈ Jk updates its main state vectors as follows: xi (k) = 1 − Wiik h i (k) + Wiik h ik (k), ti (k) = (1 − Wiik ) p˜ i (k) + Wiik p˜ ik (k). Besides, agents i k and i that do not acquire information from i k do not generate their states, i.e., 13: xi (k) = xi (k − 1) , ti (k) = ti (k − 1) , ∀i ∈ / Jk . 14: Set k = k + 1 and repeat. 15: Until a predefined stopping rule (e.g., a maximum iteration number) is satisfied.
11: 12:
−
fj
x j (k − 1) − g j,ω j (k)
−eTj t j (k − 1)
+
v j (k)
2 1 + v j (k)
+ u j (k) x j (k − 1)
u j (k) 2
− x j (k − 1)
and
pˆ j (k) =
fj
x j (k − 1) − g j,ω j (k)
−eTj t j (k − 1)
+
+ u j (k) x j (k − 1)
u j (k) 2
ej
2 − α j (k) 1 + v j (k)
Here, χP is the characteristic-event function, i.e., χP = 1 if condition P holds, and χP = 0, otherwise. Then, in the following, we summarize the proposed algorithms (4–10) in Algorithm 1.
66
4 Asynchronous Broadcast-Based Distributed Optimization …
Remark 4.1 According to the constraints in (4.3), we fully solve their inequality constraint function by using Polyak’s projection methods [26, 35], which is different from the constraint version of DGD in [7] because it directly uses Euclidean projection operators. Obviously, the projection is easy to perform only when the projection set has a relatively simple structure, such as a space or a half space. Otherwise, the sub-optimization problem of completing the Euclidean projection is inevitable. From this perspective, our algorithm requires less computational load per iteration. Remark 4.2 The updates (4)–(8) are related to the distributed random-fixed projection algorithm studied in [27]. Specifically, the work in [27] solved distributed convex optimization problem over time-varying unbalanced directed networks and addresses the unbalancedness by focusing on an epigraph form of the objective function. However, the algorithm in [27] performed synchronously and the step-size is coordinated. Unlike [27], the proposed algorithm concerns broadcast-based asynchronous strategies and possible link failures, and it is also applicable to uncoordinated step-sizes. In addition, our algorithm can still be extended to the time-varying unbalanced directed networks considered in [27]. It is worth mentioning that our method first executes optimization through a random projection step and a standard distributed (sub)gradient descent step, and then immediately follows the dynamical averaging. This is another difference compared with [27].
4.3.3 Assumptions and Lemmas In the following, the assumptions needed for convergence analysis of the optimization algorithm are presented. Assumption 4.1 [16] Suppose that W = [Wi j ] ∈ R N ×N is the weighted matrix of network G. Then, we possess that: (a) The unbalanced directed network G = {V, E} is strongly connected. And the possible link failures process in the network is i.i.d., that is, the link ( j, i) ∈ E is functioning independent of other links with probability pi j (0 < pi j < 1) at any time. (b) The weight Wi j > 0 when ( j, i) ∈ E and the diagonal entries Wii of W are positive for all i ∈ V. Remark 4.3 Assumption 4.1 is significant because it guarantees that each agent’s information reaches other agents frequently enough through broadcast strategy. This frequent information transmission ensures that the agent’s estimate can converge to a shared vector. Next, the assumption on two i.i.d. sequences are established. Assumption 4.2 The two sequences ω j (k) and Ω j (k) are i.i.d. and are uniformly distributed over {1, . . . , τ j } and {1, . . . , U j } for any j ∈ V, respectively. In
4.3 Broadcast-Based Optimization Algorithm
67
addition, the sequences ω j (k) and Ω j (k) are respectively independent of the index j. It is worth mentioning that matrix sequence {W (k)} is i.i.d. in light of Assumption 4.1. For any k ≥ 0, the random matrix W (k) is actually row-stochastic from the definition of (4.9). Based on this, the following lemma is then shown. Lemma 4.4 Each weighted matrix W (k) , k ≥ 0, of (4.9) is row-stochastic. Moreover, it can be described as (a) From Assumption 4.1, the expected weighted matrix W¯ = E [W (k)] is rowstochastic. (b) Denote W¯ = W¯ i j ∈ R N ×N . There is a constant κ > 0 such that ∀i, j ∈ V, if W¯ i j > 0, then W¯ i j > κ. Built on Lemma 4.4, the lemma which plays significant role in the convergence analysis of algorithm (4.10) is presented below. Lemma 4.5 (Perron vector [36]) Let Assumption 4.1 hold. Then, there is a left Perron eigenvector π = [π1 , . . . , π N ]T ∈ R N of W¯ satisfying lim W¯ k = 1 N π T , π T W¯ = π T , π T 1 N = 1, πi > 0, ∀i ∈ V
k→∞
After presenting the algorithm, we define Fk as the σ-algebra updated by the overall history of the algorithm up to time k, i.e., ∀k ≥ 1, Fk = {xi (0) ; i ∈ V} ∪ {i , J , ωi () , Ωi () ; i ∈ {i , J } , 1 ≤ ≤ k} with F0 = {xi (0) ; i ∈ V}. The following supermartingale convergence results for random variables are also employed. Lemma 4.6 ([37]) Suppose that {Ω, F, P} is a probability space and F0 ⊂ F1 ⊂ · · · is a sequence of σ-algebra of F. Let {v (k)}, {u (k)}, {a (k)} and {b (k)} be sequences of Fk -measurable nonnegative random variables such that for all k ≥ 0, E [v (k + 1) |Fk ] ≤ (1 + a (k)) v (k) − u (k) + b (k) , a.s. . . , u (k), a (0) , . . . , a (k), and where Fk is the collection v (0) , . . . , v (k), u (0) , . ∞ b (0) , . . . , b (k). Suppose that ∞ k=0 a (k) < ∞ and k=0 b (k) < ∞ a.s.. Then, we have limk→∞ v (k) = v for a random variable v ≥ 0 a.s., and ∞ k=0 u (k) < ∞ a.s..
4.4 Convergence Analysis This section is devoted to proving the convergence of distributed asynchronous broadcast-based random projection algorithm (4.10). To simplify notations, the fol-
68
4 Asynchronous Broadcast-Based Distributed Optimization …
lowing form of (4.3) is considered: min cT θ, θ ∈ , s.t. f i (θ) ≤ 0, g i (θ) 0τi
(4.11)
for i = 1, . . . , N , where θ = (x T , t T )T ∈ R N +n and c = (0nT , 1TN )T . Moreover, f i : R N +n → R is a convex function and gi : R N +n → Rτi is a vector of convex functions. T For any k ≥ 0, define θi (k) = (xi (k))T , (ti (k))T , i ∈ V. Then, algorithm (4.10) for solving (4.11) can be rewritten as two cases by the following: θi (k) =
N
Wi j (k) P j,Ω j (k) γ j (k) , i ∈ Jk
(4.12a)
Wi j (k) θ j (k − 1), i ∈ / Jk
(4.12b)
j=1
θi (k) =
N j=1
where i ∈ V and k ≥ 1. The states γ j (k) = ϕ j (k) − α j (k) c, ϕ j (k) = φ j (k) −
2 + + × f j φ j (k) j (k) / j (k) , φ j (k) = θ j (k − 1) − g j,ω j (k) θ j (k − 1)
2 N
ς j (k) / ς j (k) and j,Ω j (k) = X j,Ω j (k) × R for j ∈ i k ∪ Jk . In addition, the + + if f j φ j (k) > 0 and j (k) = j for some vector j (k) ∈ ∂ f j φ j (k) + + = 0. The vector ς j (k) ∈ ∂ g j,ω j (k) θ j (k − 1) if j = 0 N +n if f j φ j (k) + g θ (k − 1) > 0 and ς j (k) = ς j for some ς j = 0 N +n if j,ω j (k) j + g j,ω j (k) θ j (k − 1) = 0. Obviously, it is sufficient to demonstrate the convergence of (4.12). To present the convergence results, the following notations are introduced: j = {θ ∈ | f j (θ) ≤ 0, g j (θ) 0τi } 0 = 1 ∩ · · · ∩ N ∗ = {θ ∈ 0 |cT θ ≤ cT θ , ∀θ ∈ 0 }
(4.13)
Assumption 4.3 (Bounded (sub)gradients and solvability [27]) (a) The (sub)gradients j (k) and ς j (k) given in (4.12) are uniformly bounded over the set in (4.3). Namely, there exists a constant D > 0 such that
max j (k) , ς j (k) ≤ D, ∀ j ∈ V, ∀k ≥ 0 (b) The optimization problem in (4.11) is feasible and possesses a nonempty optimal solution set, i.e., 0 = ∅ and ∗ = ∅. Assumption 4.4 ([32]) There is a finite constant c˜ > 0 such that for all θ ∈ ∪ Nj=1 j , dist2 (θ, 0 ) ≤ c||[g ˜ i,ωi (θ)]+ ||2
4.4 Convergence Analysis
69
where ωi , i ∈ V, is a random variable that follows a positive probability distribution over the set {1, . . . , τi }. Remark 4.7 Note that Assumption 4.4 is satisfied when the set 0 is nonempty (For the case that the set 0 is empty, one gets dist2 (θ, 0 ) = +∞ and the scalar c˜ would not exist). This is a general result in distributed convex optimization and an extension result of [35] (Detailed analysis refers to [35]). When the set 0 is nonempty and the index set τi is finite for each agent i, Assumption 4.4 can be confirmed by the concept of convergence analysis using the alternative projection method investigated in [38]. Moreover, when gi,l (x) = dist(x, j,l,x ), l ∈ {1, . . . , τ j }, Assumption 4.4 can be verified if there exists a constant cˆ such that dist(x, 0 ) ≤ cˆ · max j∈V,l∈{1,...,τ j } dist(x, j,l,x ). For the consensus of algorithm (4.12), the following lemma is shown. Lemma 4.8 (Consensus [39]) Consider the following sequence: θi (k + 1) =
N
Ai j θ j (k) + i (k)
j=1
where the matrix A = [Ai j ] ∈ R N ×N has the same properties as the expected matrix N W¯ in Lemma 4.4. Further, let θ¯ (k) = i=1 πi θ i (k), where πi satisfies Lemma 4.5. If limk→∞ i (k) = 0, ∀i ∈ V, then limk→∞ θi (k) − θ¯ (k) = 0, ∀i ∈ V. Let E i (k) = {i ∈ Jk } be the event that agent i updates θi (k) at time k. The possible link failures process is i.i.d. and the sequence {i k } is i.i.d. with equidistribution over the set V. Thus, for each i ∈ V, the events E i (k) , k ≥ 0 are independent across time and possess the time-invariant probability distribution. Let ηi be defined as the probability of event E i (k), and note that ηi =
1 pi j , ∀i ∈ V j∈Niin N
(4.14)
where 0 < pi j < 1 is the probability given in Assumption 4.1(a). We next establish long term estimates for the step-size α j (k) in terms of η j . Lemma 4.9 ([16]) Recall that α j (k) = 1 Γ j (k) for all j and k. Define a scalar q ˆ satisfying 0 < q < 1/2. Then, there is a kˆ = k(q, N ) (large enough) such that a.s. for all k ≥ kˆ and j ∈ V (i) α j (k) ≤ (ii) α2j (k) ≤ (iii) α j (k) −
2 ; kη j 4N 2 k 2 pmin 2
; 1 ≤ kη j
2 k 3/2−q pmin2
,
where pmin = min(i, j)∈E pi j . The main convergence result for algorithm (4.12) is presented.
70
4 Asynchronous Broadcast-Based Distributed Optimization …
Theorem 4.10 Suppose that problem (4.1) exists a nonempty optimal set X ∗ and Assumptions 4.1–4.4 hold. The sequence {θi (k)} in (4.12) converges almost surely to the optimal solution in the set ∗ to problem (4.11). Proof See Sect. 4.4.2. Corollary 4.11 (Error bounds [25]) in Theorem 4.10 k Suppose that the conditions hold. Denote zˆ (k) = (1/υ (k)) l=1 (¯z (l) /l) and e (k) = cT zˆ (k) − θ∗ , where k υ (k) = l=1 (1/l). Then, it yields that 0 ≤ E [e (k)] ≤
D(k) , ηmin υ(k)
k
2 E θ¯ (k) − z¯ (l + 1) ≤
l=1
D(k) ϑ
where z¯ (l) is feasible and D (k) = exp
k l=1
a (l)
N i=1
N k
2 πi θi (1) − θ∗ + c2 b (l) πi ηi
+ cT z¯ (1) − θ∗ < ∞
l=1
i=1
Proof See Sect. 4.4.3. The next subsection is devoted to manifesting the convergence results in Theorem 4.10 and Corollary 4.11. Firstly, quite a few lemmas that are indispensable in the analysis of the distributed algorithm will be provided. Then, a detailed proof of Theorem 4.10 and Corollary 4.11 will be given.
4.4.1 Supporting Lemmas The following lemma relies on the performance of iterative projections, which means that the distance from a point to the intersection of convex sets monotonously decreases no matter how many times the projections are carried out. Lemma 4.12 (Iterative projection [33]) For any k ≥ 0, let {ρk } : R N → R be a sequence of convex functions and {Ω (k)} ⊆ R N be a sequence of convex closed sets. Suppose that {y (k)} ⊆ R N is updated by [ρk (y (k))]+ y (k + 1) = PΩ(k) y (k) − d (k) d (k)2 where the vector d (k) ∈ ∂ρk (y (k)) if ρk (y (k)) > 0 and d (k) = d, ∀d = 0 N , otherwise. We obtain that
2 [ρ0 (y (0))]+ 2
2
y (k) − zˆ ≤ y (0) − zˆ − d (0)2
4.4 Convergence Analysis
71
for any zˆ ∈ (Ω (0) ∩ · · · ∩ Ω (k − 1)) ∩ y|ρ j (y) ≤ 0, j = 0, . . . , k − 1 . The next lemma basically guarantees local feasibility. Lemma 4.13 (Feasibility [35]) Let γ j (k) be updated by Algorithm 1 with α j (k) ˆ satisfying Lemma 4.9. Then, for any 0 < q < 1/2, there is a kˆ = k(q, N ) (large ˆ z j ∈ R N +n and i ∈ V, the following inequality enough) such that a.s. for all k ≥ k, holds
+
2 g θ − 1) (k
j,ω j (k)
2
2
j
γ j (k) − ν ≤ θ j (k − 1) − ν −
2
ς j (k)
2 1 T − c z j − ν + a (k) z j − ν
kη j
1
ϕ j (k) − z j 2 + b (k) c2 + (4.15) 4υ 2 for any ν ∈ 0 , where 0 is defined in (4.13), a (k) = 2/(k 3/2−q pmin ) and b (k) = 2 2 2 4(1 + 4υ)N /(k pmin ) + a (k).
Proof Firstly, we capture the relation between γ j (k) − ν and ϕ j (k) −ν. For any z j ∈ R N +n , it follows from the definition of γ j (k) that
γ j (k) − ν 2 = ϕ j (k) − α j (k) c − ν 2
2 = ϕ j (k) − ν − 2α j (k) cT ϕ j (k) − cT z j + cT z j − cT ν + α2j (k) c2
(4.16)
where α j (k) , j ∈ V, are given in Assumption 4.2. Noticing that cT ϕ j (k) − cT z j + cT z j − cT ν ≥ − c ϕ j (k) − z j + cT z j − cT ν, we deduce that − 2α j (k) cT ϕ j (k) − cT z j + cT z j − cT ν
1
ϕ j (k) − z j 2 − 2α j (k) cT z j − cT ν ≤4υα2j (k) c2 + (4.17) 4υ
where ν > 0 and we use the basic inequality 2α j (k) c ϕ j (k) − z j ≤ 4υα2j (k)
2 c2 + ϕ j (k) − z j /4υ to obtain the above inequality. Substituting (4.17) into (4.16) yields that
γ j (k) − ν 2 ≤ ϕ j (k) − ν 2 − 2α j (k) cT z j − cT ν + 1 ϕ j (k) − z j 2 4υ + α2j (k) (1 + 4υ) c2 (4.18)
Second, we characterize the relation between ϕ j (k) − ν and θ j (k − 1) − ν . Define y (0) = θ j (k − 1), ρ0 (y) = g j,ω j (k) (y) and ρ1 (y) = f i (y), Ω (0) = R N and
72
4 Asynchronous Broadcast-Based Distributed Optimization …
Ω (1) = R N in Lemma 4.12. Since, ν ∈ 0 ⊆ (Ω (0) ∩ Ω (1)), both ρ0 (ν) ≤ 0 and ρ1 (ν) ≤ 0 are achieved. Under Lemma 4.12, it leads to
+
2 g θ − 1) (k
j,ω j (k) j
2
2
ϕ j (k) − ν ≤ θ j (k − 1) − ν −
2
ς j (k)
(4.19)
Substituting (4.19) into (4.18) gives us the following:
+
2 g θ − 1) (k
j,ω j (k) j
2
2
γ j (k) − ν ≤ θ j (k − 1) − ν −
2
ς j (k)
T T 2 − 2α j (k) c z j − c ν + α j (k) (1 + 4υ) c2
1
ϕ j (k) − z j 2 + 4υ
(4.20)
2 From Lemma 4.9(c), using a (k) = 2/(k 3/2−q pmin ), we have α j (k) = α j (k) − 1/ kη j + 1/ kη j and α j (k) − 1/kη j ≤ a (k). Then, for sufficiently large kˆ = ˆ k(q, N ), one achieves a.s. for all k ≥ kˆ that
− 2α j (k) cT z j − ν
2 1 T ≤− c z j − ν + a (k) z j − ν + a (k) c2 kη j
(4.21)
Replacing −2α j (k) cT z j − ν in (4.20) by (4.21) is exactly the inequality in Lemma 4.13. This completes the proof. The following lemma manifests that the algorithm (4.12) is stochastically decreasing. Lemma 4.14 (Stochastically decreasing [35]) Assume that Assumptions 4.1–4.3 hold. Let α j (k) satisfy Lemma 4.9 and z j (k) = P0 ϕ j (k) . Then, for any 0 < ˆ ˆ ν ∈ 0 q < 1/2, there is a kˆ = k(q, N ) (large enough) such that a.s. for all k ≥ k, and i ∈ V, the following inequality holds: E θi (k) − ν2 |Fk−1 ⎡ ⎤ N
2 ≤ (1 + a (k)) E ⎣ Wi j (k) θ j (k − 1) − ν |Fk−1 ⎦ j=1
⎤
+
2 g j,ω j (k) θ j (k − 1)
⎥ ⎢ − ηi E ⎣ Wi j (k) |Fk−1 ⎦
2
ς (k) j j=1 ⎡
N
4.4 Convergence Analysis
73
⎤ 1 − ηi E ⎣ Wi j (k) cT z j (k) − ν |Fk−1 ⎦ kη j j=1 ⎡ ⎤ N
η 2 i + ηi b (k) c2 + E⎣ Wi j (k) ϕ j (k) − z j (k) |Fk−1 ⎦ 4υ j=1 ⎡
N
(4.22)
Proof Let z j = z j (k) = P0 ϕ j (k) for j ∈ V and k ≥ 1. It follows from (4.15) that
+
2
2
2 g j,ω j (k) θ j (k − 1)
γ j (k) − ν ≤ θ j (k − 1) − ν −
ς j (k) 2
2 1 T − c z j (k) − ν + a (k) z j (k) − ν
kη j
1
ϕ j (k) − z j (k) 2 + b (k) c2 + (4.23) 4υ Since ν ∈ 0 , noticing (4.19), one gets
z j (k) − ν 2 ≤ θ j (k − 1) − ν 2
(4.24)
Substituting (4.24) into (4.23), we obtain that
+
2
2
2 g j,ω j (k) θ j (k − 1)
γ j (k) − ν ≤ (1 + a (k)) θ j (k − 1) − ν −
ς j (k) 2
1
ϕ j (k) − z j (k) 2 + b (k) c2 − 1 cT z j (k) − ν + 4υ kη j (4.25) Based on the convexity of ·2 , the row-stochastic matrix W (k) and the fact ν ∈ 0 ⊆ j,Ω j (k) , for i ∈ Jk , we derive θi (k) − ν2 ≤
N
2 Wi j (k) γ j (k) − ν
(4.26)
j=1
Combining with (4.25), and taking conditional expectation on Fk−1 and Jk jointly, for i ∈ Jk , it holds
74
4 Asynchronous Broadcast-Based Distributed Optimization …
E θi (k) − ν2 |Fk−1 , Jk ⎡ ⎤ N
2 ≤ (1 + a (k)) E ⎣ Wi j (k) θ j (k − 1) − ν |Fk−1 , Jk ⎦ j=1
⎡ ⎤
+
2 N g θ − 1) (k
j,ω j (k) j
⎢ ⎥ − E⎣ Wi j (k) |Fk−1 , Jk ⎦
2
ς (k) j j=1 ⎡ ⎤ N 1 Wi j (k) cT z j (k) − ν |Fk−1 , Jk ⎦ + b (k) c2 − E⎣ kη j j=1 ⎡ ⎤ N
2 1 ⎣ E + Wi j (k) ϕ j (k) − z j (k) |Fk−1 , Jk ⎦ 4υ j=1
(4.27)
For the case i ∈ / Jk , we can get E[θi (k) − ν2 |Fk−1 , Jk ] ⎡ ⎤ N
2 Wi j (k) θ j (k − 1) − ν |Fk−1 , Jk ⎦ ≤ (1 + a (k)) E ⎣
(4.28)
j=1
Combining the two cases (4.27) and (4.28), the result in Lemma 4.14 follows immediately. This completes the proof.
4.4.2 Proof of Theorem 4.10 Based on the lemmas established in the previous subsection, we now proceed to establish the proof of Theorem 4.10. By premultiplying πi on both sides of (4.22) and summing up on i from 1 to N , we obtain a.s. for any 0 < q < 1/2 and for all k ≥ kˆ that N
N πi E θi (k) − ν2 |Fk−1 − b (k) c2 πi ηi
i=1
≤ (1 + a (k))
N i=1
⎡ πi E ⎣
i=1
N j=1
⎤
2 Wi j (k) θ j (k − 1) − ν |Fk−1 ⎦
⎤
+
2 N g j,ω j (k) θ j (k − 1)
⎥ ⎢ − ηi πi E ⎣ Wi j (k) |Fk−1 ⎦
2
ς j (k) i=1 j=1 ⎡
N
4.4 Convergence Analysis
75
⎤ 1 − ηi πi E ⎣ Wi j (k) cT z j (k) − ν |Fk−1 ⎦ kη j i=1 j=1 ⎡ ⎤ N N
2 1 + ηi πi E ⎣ Wi j (k) ϕ j (k) − z j (k) |Fk−1 ⎦ 4υ i=1 j=1 ⎡
N
N
(4.29)
where ηi is given in (4.14). In the following, we will analyze some of conditional expectations in (4.29). From π T W¯ = π T (see Lemma 4.5), one can obtain N
⎡ πi E ⎣
i=1
=
N
⎤
2 Wi j (k) θ j (k − 1) − ν |Fk−1 ⎦
j=1
N N N
θ j (k − 1) − ν 2 πi W¯ i j = πi θi (k − 1) − ν2 j=1
i=1
(4.30)
i=1
2 From Assumptions 4.3(a) and 4.4, we have ς j (k) ≤ D 2 and dist2 θ j (k − 1) ,
+
2
0 ) ≤ c˜ g j,ω j (k) θ j (k − 1) , j ∈ V. Denoting ηmin = mini∈V {ηi }, it can be concluded that ⎤ ⎡
+
2 N N g j,ω j (k) θ j (k − 1)
⎥ ⎢ − ηi πi E ⎣ Wi j (k) |Fk−1 ⎦
2
ς j (k) i=1 j=1 ⎤ ⎡ N N 1 ≤− ηi πi E ⎣ Wi j (k) dist2 θ j (k − 1) , 0 |Fk−1 ⎦ cD ˜ 2 i=1 j=1 N ηmin ≤− πi θi (k − 1) − z i (k)2 cD ˜ 2 i=1
(4.31)
2 where we have utilized the fact that π T W¯ = π T . Noticing that ϕ j (k) − z j (k)
2 ≤ θ j (k − 1) − z j (k) , it follows that ⎡ ⎤ N N
2 1 ηi πi E ⎣ Wi j (k) ϕ j (k) − z j (k) |Fk−1 ⎦ 4υ i=1 j=1 ≤
N 1 πi θi (k − 1) − z i (k)2 4υ i=1
(4.32)
76
4 Asynchronous Broadcast-Based Distributed Optimization …
N Suppose that υ > cD ˜ 2 /4ηmin . Define ϑ = ηmin /cD ˜ 2 − 1/4υ and z¯ (k) = i=1 πi z i (k). Let ν = θ∗ for any arbitrary θ∗ ∈ ∗ . Combining (4.29)–(4.32), we have the following estimate: N
2 πi E θi (k) − θ∗ |Fk−1
i=1
≤ (1 + a (k)) −ϑ
N
N
2 ηmin T c z¯ (k) − θ∗ πi θi (k − 1) − θ∗ − k i=1
πi θi (k − 1) − z i (k)2 + b (k) c2
i=1
N
πi ηi
(4.33)
i=1
2 Since z i (k) ∈0 , we obtain cT (¯z (k) − θ∗ ) ≥ 0. Noticing that a (k) = 2/(k 3/2−q pmin ) 2 2 2 /(k p ) + a for 0 < q < 1/2, it can be verified that and b = 4(1 + 4υ)N (k) (k) min ∞ ∞ k=0 a (k) < ∞ and k=0 b (k) < ∞. Thus, in light of (4.33), all the conditions existed in Lemma and then we get the following two results: N 4.6 are established πi θi (k) − θ∗ 2 converges as k → ∞ for any θ∗ ∈ ∗ a.s.. Result 1: i=1 ∞ N T Result 2: 2Mηmin ∞ z (k) − θ∗ ) < ∞ a.s. and k=1 c (¯ k=1 i=1 πi −z i (k) +θi (k − 1)2 < ∞ a.s.. T In terms of Result 2, since ∞ k=1 (1/k) = ∞, we have lim k→∞ inf c z¯ (k) = T ∗ T ∗ c θ a.s. under the condition that c (¯z (k) − θ ) ≥ 0. As πi > 0 (see Lemma = 0 a.s. for i ∈ V. Letting θ¯ (k) = 4.5), N we obtain limk→∞ θi (k − 1) − z i (k) N π θ and recalling that z ¯ = (k) (k) i i i=1 πi z i (k), one gets lim k→∞
i=1
θ¯ (k − 1) − z¯ (k) = 0 a.s.. Then, {¯z (k) − θ∗ } converges a.s. as well according to Result 1. Moreover, it is immediately obtained from (4.12) that
θi (k) =
N
Wi j (k) θ j (k − 1) + ei (k) , i ∈ Jk
(4.34a)
Wi j (k) θ j (k − 1), i ∈ / Jk
(4.34b)
j=1
θi (k) =
N j=1
where ei (k) = Nj=1 Wi j (k) P j,Ω j (k) γ j (k) − θ j (k − 1) , i ∈ Jk . Taking conditional expectation on Fk−1 , we deduce for ∀i ∈ V that N E θi (k) |Fk−1 = W¯ i j θ j (k − 1) + ηi E ei (k) |Fk−1 j=1
By computation, one has
(4.35)
4.4 Convergence Analysis
77
ei (k) ≤
N
Wi j (k) P j,Ω j (k) ϕ j (k) − α j (k) c − z j (k) + z j (k) − θ j (k − 1)
j=1
≤2
N
Wi j (k) z j (k) − θ j (k − 1) + α j (k) c
(4.36)
j=1
where the first inequality is based on the convexity of || · ||, the nonexpansive property of the projection P j,Ω j (k) and the fact z j (k) ∈ 0 ⊆ j,Ω j (k) , and the last inequality follows from (4.19) for some ν = z j (k) ∈ 0 , ∀ j ∈ V. It yields from (4.36) and Result 2 that limk→∞ ei (k)
= 0 a.s.. Thus, we can directly rely on Lemma 4.8 to achieve that limk→∞ θi (k) − θ¯ (k) = 0 a.s. for with Result 1 and the row-stochasticity of W (k) yields that
i ∈ V. Together
θ¯ (k) − θ∗ is convergent a.s.. Considering limk→∞ inf cT z¯ (k) = cT θ∗ a.s., it implies that there exists a subse∗ ∗ quence {¯z (k )} that converges a.s. to some optimal ∈ point θ (0) , and it holds ∗ ∗ clearly that lim→∞ ¯z (k ) − θ (0) = 0. Since θ¯ (k) − θ (0) is a.s. convergent, it follows that limk→∞ ¯ (k) = θ∗ (0) a.s.. Finally, we notice that θi (k) −
z
∗ θ (0) ≤ θi (k) − θ¯ (k) + θ¯ (k) − z¯ (k) + ¯z (k) − θ∗ (0), where the three
terms θi (k) − θ¯ (k) , θ¯ (k) − z¯ (k) and ¯z (k) − θ∗ (0) converge to zero a.s.. Hence, there exists θ∗ (0) ∈ ∗ such that limk→∞ θi (k) − θ∗ = 0 a.s. for i ∈ V. This completes the proof.
4.4.3 Proof of Corollary 4.11 According to the results obtained in Theorem 4.10, the error bounds in Corollary 4.11 are then established. Taking expectation on both sides of (4.33) yields that E
N
2 πi θi (k) − θ∗
i=1
ηmin T 2 E c z¯ (k) − θ∗ ≤ (1 + a (k)) E πi θi (k − 1) − θ∗ − k i=1 + b (k) c
2
N
N
πi ηi
(4.37)
i=1
Denote Λt (k + 1) = (1 + a (k + 1)) Λt (k) and Λl (l) = 1 for any k ≥ l ≥ 1. Iteratively using (4.37) in that
78
4 Asynchronous Broadcast-Based Distributed Optimization …
E
N
N
2
2 ∗ ∗ πi θi (k) − θ − Λ1 (k) E πi θi (1) − θ
i=1
≤
k l=2
i=1
ηmin T E c z¯ (l) − θ∗ + b (l) c2 Λl (k) − πi ηi l i=1 N
(4.38)
which indicates that k ηmin
E cT z¯ (l) − θ∗
l N k N
∗ 2 2
≤Λ1 (k) πi θi (1) − θ + c πi ηi b (l) l=2
i=1
i=1
(4.39)
l=1
where (4.39) is derived from the fact that Λl (k) ≥ 1 and Λl (k) ≤ Λ1 (k). Moreover, k (1/l)E cT (¯z (l) − θ∗ ) ≤ D (k), since exp(x) ≥ 1 + x, ∀x ≥ 0, it holds that l=1
which easily brings about the first result in Corollary 4.11. Since z j (k) is feasible and 0 is convex, z¯ (k) is feasible as well. Then, we can similarly derive that N k l=1
D (k) πi E θi (l) − z i (l + 1)2 ≤ ϑ i=1
(4.40)
2 N We notice that E θ¯ (l) − z¯ (l + 1) ≤ i=1 πi E θi (l) − z i (l + 1)2 , which together with (4.40) yields the second result in Corollary 4.11. This completes the proof.
4.5 Numerical Examples In this section, three numerical experiments are presented to elucidate the effectiveness of the proposed distributed asynchronous broadcast-based random projection algorithm over unbalanced directed networks. Example 4.1 This numerical experiment originates from the facility location problem which is one of the classical problems in operation research. The corresponding convex optimization problem is [40]:
4.5 Numerical Examples
79
Fig. 4.1 A strongly connected and unbalanced directed network
min f (x) =
N
ai ||x − bi ||
i=1
s.t. x ∈ X, ||x − ci,1 || ≤ li,1 , ||x − ci,2 || ≤ li,2
(4.41)
N X i ⊆ Rn and each agent i has a local constraint set for i = 1, . . . , N , where X = ∩i=1 X i = X i,1 ∩ X i,2 . The parameters ai , bi , ci,1 , ci,2 , li,1 , li,2 and sets X i,1 , X i,2 in (4.41) are all properly selected, which ensure the randomness of the selection process and the solvability of (4.41). Assume that there are N = 5 agents connected by an unbalanced directed network manifested in Fig. 4.1. In addition, only one agent randomly wakes up at a time k (say agent i k ) and then broadcasts its own information to out-neighbors ) with probabilities 0 < pi j < 1 (possible link failures) for all (say agents j ∈ Niout k i, j = 1, . . . , 5. In this experiment, we set the step-size α j (k) = 1 Γ j (k) for all j = 1, . . . , 5 and k ≥ 0. Comparisons of the proposed algorithm, push-sum method [19] and asynchronous DDPG method [31] are shown in Figs. 4.2 and 4.3. It can be clearly obtained that the three randomly displayed agents’ states (see Fig. 4.2) converge to the same solution and the minimum cost (see Fig. 4.3) calculated by the proposed algorithm is the same as the other two proven effective methods. The results in Figs. 4.2 and 4.3 mean that the proposed distributed asynchronous broadcast-based optimization algorithm can successfully solve the constrained optimization problem even if the possible link failures exist in the unbalanced directed network.
Example 4.2 The second numerical experiment takes a distributed parameter estimation problem in large-scale wireless sensor networks into consideration [41]. To this aim, a set of N = 100 sensors is employed to address the following constrained optimization problem in a distributed manner for estimation:
4 Asynchronous Broadcast-Based Distributed Optimization …
Fig. 4.2 The evolutions of three randomly displayed agents’ states
1.8 The proposed algorithm Push-sum method AsynchronousDDPGmethod
1.6
Randomly displayed agent' states
80
1.4 1.2 1 0.8 0.6 0.4
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000
Time [step]
Fig. 4.3 Comparisons of the proposed algorithm, push-sum method [19] and asynchronous DDPG method [31]
3 The proposed algorithm Push-sum method Asynchronous DDPG method
Cost f (x i )
2.5
2
1.5
1
0.5
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000
Time [step]
N
min f (x) =
ψi j ∈Δi ϕi j x, δi j
1
i=1
s.t. g i (x) 0τi , x ∈ X =
N
Xi
(4.42)
i=1
for i = 1, . . . , N , where ϕi j (x, δi j ) is the Huber penalty function employed to eliminate the influence of outliers and is defined by ϕi j (x, δi j ) =
|x − δi j | − 1/2, |δi j − x| > 1 (1/2)(x − δi j )2 , |δi j − x| ≤ 1
(4.43)
and δi j is the observation. The notation N (ρ, σ) denotes a normal distribution with ρ is the mean value and σ is the covariance. The Huber penalty function are imple-
4.5 Numerical Examples
81
400 p ij =1
350
p ij
[0.5,0.8]
p ij
[0.1,0.3]
Residual L k
300 250 200 150 100 50 0
0.5
1
1.5
2
2.5
3
Time [step]
3.5
4
4.5 5 4 × 10
Fig. 4.4 The evolutions of the residual L (k) with pi j = 1, pi j ∈ [0.5, 0.8] and pi j ∈ [0.1, 0.3] for all i, j = 1, . . . , 100
mented by 100 measurements perturbed by additive white Gaussian noise. The first 90 measurements (denoted as ψi j , j = 1, . . . , 90) are set to be i.i.d. N (1, 1) while the remained 10 measurements (denoted as ψi j , j = 91, . . . , 100) are set to be i.i.d. N (1, 10). The collection of ψi j is represented by Δi = {ψi j , j = 1, . . . , 100}. In addition, assume that gi (x) = [gi,1 (x), . . . , gi,τi (x)]T ∈ Rτi (τi > 0 is a constant) is a vector of convex linear function and X i is a box constraint set, respectively. This experiment generates a strongly connected and unbalanced directed network with N = 100 agents using nearest-neighbor rules. The evolutions of the residual L (k) = f (x (k)) − f ∗ are shown in Fig. 4.4 with various probabilities pi j (0 < pi j < 1) for all i, j = 1, . . . , 100. Figure 4.4 indicates that the proposed algorithm converges fast when the probabilities pi j of the links in the network being activated are large. At the same time, a smaller probability may allow for a larger step-size (see Lemma 4.9(a)), which may accelerate the convergence of the algorithm to some extent after multiple iterations (see the evolutions of the red and green lines in Fig. 4.4). Example 4.3 The final experiment describes the scalability of the proposed algorithm and other methods to the directed networks, and presents the simulation results for a special unbalanced directed network which is depicted in Fig. 4.5 [42] (The special unbalanced directed network has been proven to be a pathological case of the push-sum method). Specifically, we simulate the proposed algorithm and two kinds of push-sum methods (The synchronous and broadcast-based asynchronous versions of push-sum methods) on this special unbalanced directed network for an increasing number of agents and summarize the number of iterations required to reach the approximate minimum value to the problem (4.42) in Example 4.2. In this experiment, we suppose that the Huber penalty function in (4.43) are implemented by 10 measurements which are set to be i.i.d. N (2, 1). Considering that the
82
4 Asynchronous Broadcast-Based Distributed Optimization …
Fig. 4.5 A special unbalanced directed network Table 4.1 The number of iterations required to reach the approximate minimum value ( f (x) − f ∗ ≤ 10−2 ) for the proposed algorithm and two kinds of push-sum methods Agents Our Asynchronous Push-sum push-sum 10 20 30 40
171 1154 6796 39742
112 1392 9789 168723
11 95 723 29604
push-sum method can only be applied to figure out the distributed unconstrained optimization problem, here we remove the inequality constraints and constraints set in the problem (4.42) to facilitate comparison. In Table 4.1, the number of iterations required to reach the approximate minimum value ( f (x) − f ∗ ≤ 10−2 ) to the problem (4.42) for the proposed algorithm and two kinds of push-sum methods (push-sum and asynchronous push-sum) are summarized. It can be seen that the number of iterations of two kinds of push-sum methods will increase significantly as the number of agents increases, which verifies that Fig. 4.5 does a pathological case of the pushsum method. More importantly, the proposed algorithm has good scalability for the directed network depicted in Fig. 4.5 when compared to the asynchronous push-sum method.
4.6 Conclusion In this chapter, we propose a new distributed asynchronous broadcast-based random projection algorithm that minimizes a sum of convex objective functions over an unbalanced directed network. Each agent in the network can only access its own objective function and local inequality constraint, and is constrained to a privately known convex set. Specifically, we discuss two general scenarios where the interactions over the whole network are confronted with possible link failures and the
4.6 Conclusion
83
step-sizes are uncoordinated. The almost sure convergence of the algorithm is rigorously established under some standard assumptions on agent’s weights and individual objectives. We further provide three extensive numerical experiments that verify the performance and convergence of the proposed algorithm.
References 1. A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Trans. Autom. Control 48(6), 988–1001 (2003) 2. A. Dimakis, S. Kar, M. Moura, A. Scaglione, Gossip algorithms for distributed signal processing. Proc. IEEE 98(11), 1847–1860 (2010) 3. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011) 4. S. Zhu, C. Chen, W. Li, B. Yang, X. Guan, Distributed optimal consensus filter for target tracking in heterogeneous sensor networks. IEEE Trans. Cybern. 43(6), 806–811 (2013) 5. A. Abdessameud, I. Polushin, A. Tayebi, Distributed coordination of dynamical multi-agent systems under directed graphs and constrained information exchange. IEEE Trans. Autom. Control 62(4), 1668–1683 (2017) 6. P. Lin, W. Ren, Distributed subgradient projection algorithm for multi-agent optimization with nonidentical constraints and switching topologies, in Proceedings of the 51th IEEE Annual Conference on Decision and Control (2012), pp. 6813–6818 7. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 8. M. Zhu, S. Martinez, On distributed convex optimization under inequality and equality constraints. IEEE Trans. Autom. Control 57(1), 151–164 (2012) 9. N. Chatzipanagiotis, D. Dentcheva, M. Zavlanos, An augmented Lagrangian method for distributed optimization. Math. Program. 152(1–2), 405–434 (2015) 10. E. Wei, A. Ozdaglar, A. Jadbabaie, A distributed Newton method for network utility maximization-Part II: convergence. IEEE Trans. Autom. Control 58(9), 2176–2188 (2013) 11. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 12. S. Ram, A. Nedic, V. Veeravalli, Asynchronous gossip algorithms for stochastic optimization, in Proceedings of the 48th IEEE Conference on Decision and Control, https://doi.org/10.1109/ CDC.2009.5399485 13. K. Tsianos, S. Lawlor, M. Rabbat, Consensus-based distributed optimization: practical issues and applications in large-scale machine learning, in Proceedings of the 50th Allerton Conference on Communication, Control, and Computing, https://doi.org/10.1109/Allerton.2012.6483403 14. K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim. 26(3), 1835–1854 (2016) 15. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 592–606 (2012) 16. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans. Autom. Control 56(6), 1337–1351 (2011) 17. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans. Autom. Control 62(8), 3986–3992 (2017) 18. K. Cai, H. Ishii, Average consensus on general strongly connected digraphs. Automatica 48(11), 2750–2761 (2012) 19. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015)
84
4 Asynchronous Broadcast-Based Distributed Optimization …
20. D. Kempe, A. Dobra, J. Gehrke, Gossip-based computation of aggregate information, in Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, https:// doi.org/10.1109/SFCS.2003.1238221 21. I. Tsianos, S. Lawlor, M. Rabbat, Push-sum distributed dual averaging for convex optimization, in Proceedings of the 51st IEEE Conference on Decision and Control, https://doi.org/10.1109/ CDC.2012.6426375 22. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in networks, in 2018 IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC. 2018.8619047 23. F. Saadatniaki, R. Xin, U. Khan, Optimization over time-varying directed graphs with row and column-stochastic matrices, https://arxiv.xilesou.top/abs/1810.07393 24. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018) 25. K. You, R. Tempo, P. Xie, Distributed algorithms for robust convex optimization via the scenario approach. IEEE Trans. Autom. Control 64(3), 880–895 (2019) 26. B.T. Polyak, Random algorithms for solving convex inequalities. Stud. Comput. Math. 8, 409– 422 (2001) 27. P. Xie, K. You, R. Tempo, S. Song, C. Wu, Distributed convex optimization with inequality constraints over time-varying unbalanced digraphs. IEEE Trans. Autom. Control 63(12), 4331– 4337 (2018) 28. K. Srivastava, A. Nedic, Distributed asynchronous constrained stochastic optimization. IEEE J. Sel. Top. Signal Process. 5(4), 772–790 (2011) 29. F. Zanella, D. Varagnolo, A. Cenedese, G. Pillonetto, L. Schenato, Asynchronous NewtonRaphson consensus for distributed convex optimization. IFAC Proc. Vol. 45(26), 133–138 (2012) 30. S. Lee, A. Nedic, Asynchronous gossip-based random projection algorithms over networks. IEEE Trans. Autom. Control 61(4), 953–968 (2016) 31. I. Notarnicola, G. Notarstefano, Asynchronous distributed optimization via randomized dual proximal gradient. IEEE Trans. Autom. Control 62(5), 2095–2106 (2017) 32. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers, in Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, https://doi.org/10.1109/GlobalSIP.2013.6736937 33. D. Bertsekas, Convex Optimization Algorithms (Athena Scientific Belmont, 2015) 34. S. Boyd, A. Ghosh, B. Prabhakar, D. Shah, Randomized gossip algorithms. IEEE Trans. Inf. Theory 52(6), 2508–2530 (2006) 35. A. Nedic, Random algorithms for convex minimization problems. Math. Program. 129(2), 225–253 (2011) 36. R. Horn, Matrix Analysis (Cambridge University Press, Cambridge, 1986) 37. B. Polyak, Introduction to Optimization (Optimization Software, Incorporation, Publications Division, New York, 1987) 38. L. Gubin, B. Polyak, E. Raik, The method of projections for finding the common point of convex sets. USSR Comput. Math. Math. Phys. 7(6), 1–24 (1967) 39. V. Mai, E. Abed, Distributed optimization over weighted directed graphs using row stochastic matrix, in 2016 American Control Conference, https://doi.org/10.1109/ACC.2016.7526803 40. F. Chudak, D. Shmoys, Improved approximation algorithms for the uncapacitated facility location problem. Math. Program. 1412(2), 180–194 (2004) 41. L. Xiao, S. Boyd, S. Lall, A scheme for robust distributed sensor fusion based on average consensus, in Fourth International Symposium on Information Processing in Sensor Networks, https://doi.org/10.1109/IPSN.2005.1440896 42. A. Nedic, A. Olshevsky, C. Uribe, Distributed Gaussian learning over time-varying directed graphs, in Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers, https://doi.org/10.1109/ACSSC.2016.7869674
Chapter 5
Quantized Communication-Based Distributed Optimization over Time-Varying Directed Networks
5.1 Introduction Recent advances in networked control and distributed systems require the development of scalable algorithms that consider the decentralized characteristic of the problem and communication restrictions. The interest in solving distributed consensus optimization problems of multi-agent systems has been growing. Distributed consensus optimization problems are modeled as minimizing a global objective function by multiple agents over a network. The formulation of distributed consensus optimization has been paid extensive attention due to its widespread applications, e.g., large-scale machine learning [1, 2], model predictive control [3], cognitive networks [4, 5], source localization [6, 7], resource allocation or scheduling [8], message routing [9], distributed spectrum sensing [5], statistical inference and learning [6, 10–12]. In these applications, without need of putting all the parameters together which define the optimization problem, decentralized nodes that only have a local subcollection of such parameters, collaboratively achieve the minimizer of the global objective function. In light of commonly used distributed optimization algorithms, by introducing a large number of nodes with a certain ability of calculation and communication, a complex global optimization problem is spit and distributed to those nodes. Through locally calculating and communicating with neighboring nodes, global optimal solutions can be obtained. In the referred literature that have appeared so far, the (sub)gradient-based methods are broadly adopted to solve massive optimization problems in a distributed way. Tsitsiklis et al. in [13] first proposed a distributed optimization algorithm to minimize a common cost function, in which information exchange is asynchronous and computations are parallelized across multiple processors. In distributed computation, since consensus theory [14] allows agents to acquire global results by only taking local actions, it is very appropriate to implement distributed optimization algorithm. Based on the studies of consensus of multi-agent systems, Nedic and Ozdaglar in [15] proposed a consensus-based distributed (sub)gradient descent © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_5
85
86
5 Quantized Communication-Based Distributed Optimization …
method where each agent performed a consensus step and then a descent step along the local (sub)gradient direction. They showed that when using a diminishing step size, the method achieved the optimizer at a speed of O(1/k) for a class of convex Lipshitz and possibly nonsmooth objective functions [15]. The convergence rate of this algorithm matched that of the centralized (sub)gradient descent algorithm. Duchi et al. in [16] utilized a similar notion to develop a distributed dual averaging algorithm which steered the iterate to the optimal solution with a similar speed while (almost) independent of network size. Some extension works have been completed for cases under constraints [17, 18] and for cases in which the (sub)gradient is perturbed by noisy [19, 20]. However, most of the existing algorithms, such as[13–20], assume that the exchange of information between agents is bidirectional. Compared with undirected networks, distributed optimization problems over directed networks have wider applications. For example, agent may broadcast information at different power levels, which means that communication channel is in one direction, but not the other. Moreover, an algorithm should remain to converge when slow communication links are eliminated because of their decaying the whole network, which results in a directed network. Therefore, a directed topology is more flexible compared with an undirected topology. In order to achieve the overall goal, network topology and information transmitted in the network commonly play key roles. Therefore, we should consider their joint effects in the analysis as well as the construction of distributed optimization algorithms. Early consensus [14, 21–27] and optimization algorithms [13–19, 28–33] assume that agents can obtain the information flow from their neighboring agents without any distortion, that is, the exchange of information is not restricted in the link process. When state of agent is transmitted in a channel with the form of actual value, this is equivalent to the requirement that the channel has unlimited capacity (bandwidth) and the algorithms being executed have an infinite accuracy. Obviously, this is not the case as there are bandwidth and energy constraints in digital communication, which leads to unreliable communication channels, the limited capacity of the intelligent body and the constraints of the total cost. In practice, many applications may suffer from these communication limitations, e.g., to realize predictive control algorithms for large-scale systems by utilizing distributed optimization methods. Some systems may have very strong limitations in communication source such as underwater vehicle and low-cost unmanned aerial vehicles. In order to satisfy the limited communication data rate, the information to be exchanged between neighboring subsystems needs to be quantified first. The quantized information is encoded as a binary sequence and then transmitted along the digital communication channel at each time step. When a neighboring agent receives the binary sequence, it utilizes corresponding decoding scheme to recover the original information with a certain error. Li et al. [34] presented a distributed averaging consensus protocol by considering constraints of data transmission rate and information quantization, and provided an elegant method to reduce the number of transmitting bits along each channel. Zhang et al. [35] investigated a more complicated problem, i.e., distributed averaging consensus of multi-agent systems with time-varying directed balanced communication topologies and limited data transmission rate, and presented a rig-
5.1 Introduction
87
orous analysis for the consensus. A distributed (sub)gradient algorithm was first proposed by Yi and Hong in [36] with quantization information exchange over timevarying undirected networks. Note that quantization allows inexact iteration which affects the convergence and the gap between the algorithms’ final state and the true optimizer. From the above, it is more interesting and meaningful to design distributed consensus optimization algorithms by taking into consideration of data transmission rate constraints over time-varying directed networks. This chapter considers solving a class of optimization problems which are defined as minimizing the sum of all agents’ convex cost functions through a distributed computing platform. The communication networks with limited data transmission rate are assumed to be directed time-varying. At each time iteration, information with finite number of bits is transmitted between neighboring agents. The contribution of the chapter can be summarized as three folds. Firstly, a distributed (sub)gradient descent method is proposed to exactly solve unconstrained optimization problems by a progressive combination of quantization, dynamic encoding and decoding schemes. Secondly, we prove that all the quantizers are not saturated provided that the quantization levels at each time step satisfy certain mild conditions. How to characterize the minimal quantization level to ensure unsaturation of quantizers is also analyzed. Thirdly, exact consensus optimization analysis of multi-agent systems is presented. We find that one bit information exchange suffices to guarantee unsaturation of all quantizers and consensus optimization.
5.2 Preliminaries 5.2.1 Notation If there are no other statements, the vectors in this chapter are assumed to be columns. Given a real number a, |a| denotes its absolute value; a represents the maximum integer that is not larger than a while a the minimum integer that is not less than a. For a convex function f (x), its domain is denoted by dom( f ) and the set of its (sub)gradients at x is denoted by ∂ f (x), satisfying f (y) − f (x) ≥ g T (x)(y − x), ∀x, y ∈ dom( f ), g(x)∈ ∂ f (x).
5.2.2 Model of Optimization Problem The task of nodes in a network is to collectively minimize a convex objective function which is expressed as the sum of all agents’ individual (local) convex cost functions. Mathematically, the agents cooperate with each other to solve an unconstrained convex optimization problem expressed as follows:
88
5 Quantized Communication-Based Distributed Optimization …
min f (x) =
N
f i (x), x ∈ Rn
(5.1)
i=1
through local communication and local computation (in a distributed manner), where f i : Rn → R is a convex objective function privately held by agent i for i = 1, . . . , N , and x ∈ Rn is the global optimization variable. Let f ∗ indicate the ∗ optimal value of f , which is supposed to limited, and Nlet X be the∗set of all optif i (x) = f }. Throughout mizers (optimal solutions) of f , i.e.,X ∗ ={x ∈ Rn | i=1 the chapter, it is assumed that the optimal solution set X ∗ is compact and nonempty. Let x ∗ ∈ X ∗ denote an optimizer of problem (5.1).
5.2.3 Time-Varying Communication Networks Due to link failure or packet dropout, information transmission between agents in a network at some time instants may fail. In such cases, communication network used to describe information interaction is usually represented by a time-varying network sequence {G(k), k = 0, 1, . . .} which is based on an ideal directed network. We denote G = {V, EG , AG } as an ideal directed network where V = {1, . . . , N } is the agent (node, vertex) set, EG ⊆ V × V is the edge set and AG = [ai j ] ∈ R N ×N represents the adjacency matrix. Specifically, ai j > 0 if and only if there is a directed edge from agent j to agent i; ai j = 0, otherwise. We use the convention that the network G is simple, containing no self-loop or multiple edges, that is, aii = 0 for all i ∈ V. We indicate Ni+ and Ni− the set of in-neighbors and the set of out-neighbors of agent i, respectively, in other words, Ni+ = { j| j ∈ V, ai j > 0} and Ni− = { j| j ∈ V, a ji > 0}. A directed path from agent i 1 to agent i k in the network G is denoted by a sequence of edges (i 1 , i 2 ), (i 2 , i 3 ), . . . , (i t−1 , i t ) with (i j−1 , i j ) ∈ EG , j = 2, . . . , t. The network G is strongly connected if for any two distinct agents i, j ∈ V, there is a directed path from agent i to agent j. In this chapter, it is assumed that at each time instant k, the communication between agents is modeled by a time-varying directed network G(k) [35]. Similarly, the network G(k) ={V, EG (k) , AG (k) } is composed of the agent set V = {1, . . . , N }, the time-varying edge set EG(k) and the adjacency matrix AG(k) = [ai j (k)] ∈ R N ×N with ai j (k) ≥ 0. Here, ai j (k) > 0 if and only if ( j, i) ∈ EG(k) ; ai j (k) = 0, otherwise. It is also assumed that there is no self-loop or multiple edges in G(k), k = 0, 1, . . .. Similarly, at each time instant k, we respecEG(k) } and outtively define agent i ’s in-neighbor set Ni+ (k) = { j ∈ V, ( j, i) ∈ neighbor set Ni− (k) = { j ∈ V, (i, j) ∈ EG(k) } For agent i, di+ (k) = j∈Ni+ (k) ai j (k) and di− (k) = j∈Ni− (k) a ji (k) denote its in-degree and out-degree, respectively. If di+ (k) = di− (k), for all i ∈ V, k ≥ 0, the network G(k) is balanced. There exists a finite positive constant named the degree of the network G such that d ∗ = supk≥0 maxi∈V {di+ (k), di− (k)}. The Laplacian matrix LG(k) of the network G(k) is defined as LG(k) = DG(k) − AG(k) , where DG(k) = diag{d1+ (k), . . . , d N+ (k)} is a diag-
5.2 Preliminaries
89
onal of k directed networks {G(i), i = 1, 2, . . . , k} is denoted by k k matrix. The union k ˆ G(i) = {V, ∪ E , A ˆ , AG(k) ˆ )(V, EG(k) ˆ , i=1 i=1 G(k) }. Denote G(k) = (V, EG(k) i=1 G(i) ) the mirror network of the directed network G(k) = (V, E , A ) with the AG(k) G(k) G(k) ˆ ˜ ˜ edge set EG(k) = EG(k) ∪ EG(k) , where EG(k) is the reverse edge set of G(k) obtained ˆ ˆ by reversing the order of agents of all pairs in EG(k) . It is evident that G(k) is an N ×N = [aˆ i j (k)]∈ R is symmetric undirected network and the adjacency matrix AG(k) ˆ with aˆ i j (k) = aˆ ji (k) = (ai j (k) + a ji (k))/2. Let Ni+ = ∪k≥0 Ni+ (k).
5.2.4 Quantization, Encoding and Decoding Schemes It is noted that in digital communication networks, channel capacity is usually limited, that is, only information with finite data bits can be transmitted between agents at each time step. Therefore, constraints of data transmission rate in a communication network should be also considered when solving optimization problems through a distributed computing platform. Information needs to be quantized in advance and then the quantized information is encoded as a binary sequence at the side of each agent before sending. After the binary sequence is received by the neighboring agent, corresponding decoding scheme is utilized to resume the original information with a certain degree of error which is caused by the quantization process. Let us define the following uniform symmetric quantizer with finite but may be time-varying quantization levels as T Q kji (ζ) = q kji ζ 1 , q kji ζ 2 , . . . , q kji ζ n
(5.2)
where ζ = (ζ 1 , ζ 2 , . . . , ζ n )T ∈ Rn and ⎧ 0, ⎪ ⎪ ⎨ t, q tji (ζ m ) = T ji (k), ⎪ ⎪ ⎩ k −q ji (−ζ m ),
− 21 < ζ m < 21 t − 21 ≤ ζ m < t + 21 , t = 1, 2, . . . , T ji (k) ζ m ≥ T ji (k) + 21 ζ m ≤ − 21
(5.3)
It can be seen from (5.3) that the quantizer q kji (·) is a function which maps a real number to a finite set Γ , i.e., q kji (·) : R → Γ , where Γ = {0, ±t, t = 1, 2, . . . , T ji (k)}, and the number of quantization level of q kji (·) is 2T ji (k) + 1. If ζ∞ ≤ T ji (k) holds for all
( j, i) ∈ EG(t) and k ≥ 0, the quantization is not saturated, i.e., ||q kji (ζ) − ζ||∞ ≤ 1 2. Remark 5.1 To transmit quantized information using one-level quantizer Q kji (·), we need to send nlog 2 (2T ji (k) + 1) -bit information at each time step. At time k, for each connected communication channel ( j, i) ∈ EG(k) , the associated encoder Φ ji at the side of agent j is defined as
90
5 Quantized Communication-Based Distributed Optimization …
⎧ θ ξ (0) = 0 ⎪ ⎪
⎪ ji ⎨ s(k − 1)Δθji (k) + ξ θji (k − 1), if Δθji (k) is received by i at time k θ ξ ji (k) = otherwise ξ θji (k − 1), ⎪ ⎪ ⎪ ⎩ Δθ (k) = Q k s −1 (k − 1) θ (k) − ξ θ (k − 1) , k = 1, 2, . . . j ji ji ji (5.4) where ξ θji (k) and Δθji (k) are the internal state and the quantized output while θ j (k) is the input of encoder Φ ji at time k. s(k) is a common scaling function, which plays the role of zooming-in the difference signal θ j (k + 1) − ξ θji (k)) and will be specified in Sect. 5.3. At time k, at the receiver side of the communication channel ( j, i) ∈ EG(k) , each agent i ∈ N j− (k) obtains an appropriate estimation of agent j’s state x j (k) by decoder Ψi j which is designed as ⎧ ˆ ⎪ 0 ⎨ θ ji (0) =
s(k − 1)Δθji (k) + θˆ ji (k − 1), if Δθji (k) is received by i at time k otherwise, k = 1, 2, . . . θˆ ji (k − 1), (5.5) where θˆ ji (k) is the output of decoder Ψi j at time k. ˆ ⎪ ⎩ θ ji (k) =
Remark 5.2 Since we utilize s(k) to zoom-in the difference signal θ j (k + 1) − ξ θji (k) in encoder Φ ji to increase the accuracy of state estimation, the scaling function should gradually decay (limk→∞ s(k) = 0) to make the quantizer Q kji (·)being persistently activating such that each agent can receive more valuable (no-zero) information from its in-neighbors continuously. Furthermore, the scaling function should also be relatively large such that the quantizer would not be saturated. On the other hand, to exactly solve the optimization problem (5.1), the step-size s(k) needs to be non-summable and square summable, which will be stated in the following Assumption 5.1. From (5.4) and (5.5), we know that ξ θji (k) and θˆ ji (k) share the same dynamic at the two sides of each connected directed communication channel ( j, i) ∈ EG(k) for all k ≥ 0. This can ensure the following relation holds ξ θji (k) = θˆ ji (k), k = 0, 1, . . . ; i ∈ N j− (k), j ∈ V
(5.6)
Hence, for each connected communication channel, the same estimate for each agent’ state at the both sides of sender and receiver has been constructed. This is a key step in exactly solving the optimization problem (5.1) using distributed computing.
5.3 Distributed Optimization Algorithm This section is devoted to designing algorithms to exactly solve problem (5.1) in a distributed computing fashion. Two quantized algorithms are presented, one of which
5.3 Distributed Optimization Algorithm
91
cooperatively seeks an optimal solution x ∗ ∈ X ∗ and the other collectively converges to the optimal value f ∗ . In order to distinguish the two algorithms, we call them quantized (sub)gradient algorithm and quantized recursive algorithm, respectively.
5.3.1 Quantized (Sub)gradient Algorithm Each agent i ∈ V maintains a vector-value variable xi (k) ∈Rn , and the information to be exchanged between every connected channel is subjected to the dynamic encoding-decoding schemes (5.4) and (5.5) with the quantization rule (5.3). The dynamic of each agent i ∈ V is represented as xi (k + 1) = xi (k) + h [u i (k) − s (k) gi (xi (k))] , k = 0, 1, . . .
(5.7)
where 0 < h < 1/d ∗ is the control gain; s(k) is the step-size; u i (k) is the distributed control input of agent i ; gi (xi (k)) is the (sub)gradient of the local objective convex function f i (x) at x = xi (k). We construct a consensus protocol over a sequence of time-varying directed network {G(k), k = 0, 1, . . .}, as follows: Ni+ (k)
u i (k) =
Ni− (k)
ai j (k) xˆ ji (k) −
j=1
ai j (k) ξ xji (k),
j=1
k = 0, 1, . . . ; i ∈ V
(5.8)
Remark 5.3 Each agent i ∈ V needs to have knowledge of the following information to implement the distributed consensus protocol (5.8): the link weights associated with its in-neighbors and out-neighbors, and the outputs of decoders and the internal states of encoders are respectively associated with its in-neighbors and out-neighbors.
5.3.2 Quantized Recursive Algorithm To obtain the optimal value f ∗ of problem (5.1), we assign each agent i ∈ V a variable yi (k) ∈ R, which recursively updates using the dynamic encoding-decoding schemes (5.4) and (5.5) with the quantization rule (5.3) as follows: ⎧ y yi (k + 1) = vi (k)+ N ( f i (xi (k)) − f i (xi (k − 1))) ⎪ ⎪ ⎪ + − ⎨ N N i (k) i (k) y ai j (k) xˆ ji (k) − ai j (k) ξ xji (k) vi (k) = yi (k) + h ⎪ j=1 j=1 ⎪ ⎪ ⎩ yi (1) = N f i (xi (0))
(5.9)
92
5 Quantized Communication-Based Distributed Optimization …
where k = 1, 2, . . . . We utilize the one-step backward difference N ( f i (xi (k)) − f i (xi (k − 1))) as the local input in the purpose of steering each agent i to follow the dynamic sum of all agents’ local objective functions.
5.4 Main Results For the unconstrained convex optimization problem (5.1), we naturally put forward the following two questions: Can the multi-agent systems (5.7) and (5.9) over timevarying directed networks with limited data transmission rate cooperatively seek the optimal solution and the optimal value? If so, how to design parameters such that the quantizers associated with all the connected channels are not saturated. This section is dedicated to answer the two questions definitely. The following assumptions are necessary for the main results. Assumption 5.1 The step-size and the scaling function have the same definition and satisfy the following decaying conditions 0 < s(k + 1) < s(k),
∞
s(k) = ∞,
k=0
∞
s 2 (k) < ∞
k=0
Remark 5.4 Denote γ(k) = s(k + 1)/s(k) with γ0 = γ(0), s0 = s(0). Then, γ(k) satisfies 0 < γ(k) ≤ γ(k + 1) < 1 and 0 < γ −1 (k) ≤ γ0−1 . In light of s(k) being square summable, we have limk→∞ s(k) = 0. Assumption 5.2 The time-varying network sequence {G(k), k = 0, 1, . . .} is balanced. There exists a positive integer K 0 , such that j ∈ Ni+ (k) holds at least once in [k1 , k1 + K 0 ) for any time instant k1 ≥ 0 and for any agent j ∈ Ni+ , i ∈ V. Moreover, {G(k), k = 0, 1, . . .} is uniformly strongly connected, i.e., there exist a positive constant λ0 and a positive integer h 0 > 0 such that h0 ≥ λ0 > 0 inf λm k h0
m k ≥0
t+h 0 −1 where λth 0 = λ2 (LGˆ h0 ), Gˆth 0 is the minor network of Gth 0 , and Gth 0 = i=t G(i) t ,t = 0, 1, . . .. Namely, {G(k), k = 0, 1, . . .} is uniformly jointly connected over the intervals [m k h 0 + 1, (m k + 1)h 0 ], m k = 0, 1, . . .. that Assumption 5.3 There exist positive constants L , Cx and Cδ such supk≥0 LG(k) ≤ L , maxi∈V xi (0)∞ ≤ C x , and maxi, j∈V xi (0) − x j (0)∞ ≤ Cδ . Assumption 5.4 [36] The (sub)gradient of each local objective convex function f i (x) is uniformly bounded, i.e., there exists a positive constant C g such that sup max gi (xi )∞ ≤ C g , ∀xi ∈ dom ( f i ) k≥0
i∈V
5.4 Main Results
93
In the following subsections, we state the main results of the chapter.
5.4.1 Reaching Consensus Theorem 5.5 Let Assumptions 5.1–5.4 hold. Set a suitable control gain h > 0 such that ρh, ε1 ∈ (0, 1) and select a suitable parameter γ0 such that γ0−2h 0 ρh, ε1 ∈ (0, 1). For each agent i ∈ V and agent j ∈ Ni+ , if the number of quantization level T ji (k) of quantizer q ji (k) associated with the connected directed communication channel ( j, i) ∈ EG(k) at time t satisfies T ji (1) ≥
1 1 1 + 2hd ∗ C x + 2hd ∗ Cδ + hC g − s0 2
T ji (2) ≥
1 γ0
(5.10)
1 1 + 2hd ∗ [(1 − 2hd ∗ )Cδ + 2hd ∗ C x ] 2 s0 1 + h(1 + γ0−1 h)C g + γ0−1 hd ∗ − 2
T ji (k) ≥ T1 (h, s0 , ε1 , ε2 ), k = 3, 4, . . .
(5.11) (5.12)
then theconsensus of multi-agent systems (5.7) can be achieved, i.e., ∀i, j∈V, limk→∞ xi (k) − x j (k) = 0 and all the quantizers q ji (k) will never be saturated and the quantization errors are uniformly bounded. The expression of T1 (h, s0 , ε1 , ε2 ) is 1 T1 (h, s0 , ε1 , ε2 ) = M1 (h, s0 , ε1 , ε2 ) − +1 (5.13) 2 where M1 (h, s0 , ε1 , ε2 )
K 0 +1 1 = max 2γ1 0 , 21 γ0−K 0 −2 + M 2 1 + γ0−K 0 −1 + hC g γ0 −l + 2M (h, s0 , ε1 , ε2 )hd ∗ + 1 2
γ0−1 hd ∗
(5.14)
l=1
+ hC g
Here, the expression of M can be found in (5.41) and ρh,ε1 = 1 − 2hλ0 + 2h 0 l l l 2 l=2 h C 2h 0 L + ε1 h , ε1 > 0. Proof Noticing that G(k) is balanced (Assumption 5.2), according to the relation (5.6), the control input (5.8) can be equivalently rewritten as follows
94
5 Quantized Communication-Based Distributed Optimization …
u i (k) =
+ N i (k)
ai j (k)(x j (k) − xi (k)) i=1 + − N N i (k) i (k) ai j (k)λxji (k) − a ji (k)λixj (k) + j=1 j=1
(5.15)
where λxji (k) = xˆ ji (k) − x j (k) if j ∈ Ni+ , and λxji (k) = 0, otherwise. Noting λxji (k) ∈ Rn , ∀i, j ∈ V, we denote the diagonal matrix associated with λxji (k) by diag(λxji (k)) ∈ Rn×n . For the sake of simplicity, we give the following notations T X (k) = x1T (k) , x2T (k) , . . . , x NT (k) ∈ R N n G (k) = g1T (x1 (k)) , g2T (x2 (k)) , . . . , g TN (x N (k))T ∈ R N n Λ (k) = diag λxji (k) ∈ R N n×N n
(5.16)
By substituting control input (5.15) into (5.7), in line with the notations defined in (5.16), the iteration in matrix form is shown as follows X (k + 1) =(I N − hLG(k) ) ⊗ In X (k) − hs(k)G(k) − h[(LG(k) ⊗ In Λ(k)) − (LG(k) ⊗ In Λ(k))T ]1 N n
(5.17)
We know that 1TN LG(k) = 1TN LTG(k) = 0 (from Assumption 5.2). Then, we can derive (1TN ⊗ In )[(LG(k) ⊗ In Λ(k)) − (LG(k) ⊗ In Λ(k))T ]1 N n = (1TN LG(k) ⊗ In Λ(t))1 N n − (1TN LTG(k) ⊗ In ΛT (k))1 N n = 0
(5.18)
Let JN = (1/N )1 N 1TN , and define δ(k) = (I N − JN ) ⊗ In X (k), D = I N − JN . Using (5.18), it can be obtained from (5.17) that the closed-loop system with regard to δ(k) satisfies δ(k + 1) = (I N − hLG(k) ) ⊗ In δ(k) − hs(t)D ⊗ In G(k) − h [(LG(k) ⊗ In Λ(k)) − (LG(k) ⊗ In Λ(k))T ]1 N n
(5.19)
in which Λ(k) and G(k) are defined in (5.16). From (5.4), (5.17) and the definition of λxji (k), we have the following recursive expression ⎧ λ + k λ −1 ⎪ ⎨ M ji (k) − s(k)Q ji (s+ (k)M ji (k)), if j ∈ Ni (k + 1) i λxji (k + 1) = M λji (k), if j ∈ N +N(k+1) ⎪ i ⎩ 0, otherwise
(5.20)
where M λji (k) = λxji (k) + [hLG(k) ⊗ In δ(k) + hs(k)G(k) + h [(LG(k) ⊗ In Λ(k)) − (LG(k) ⊗ In Λ(k))T ]1 N n ]| j . Furthermore, let z ji (k) = s −1 (k)λxji (k), w(k) = s −1 (k)δ(k) and denote the diagonal matrix associated with z ji (k) by
5.4 Main Results
95
diag(z ji (k)) ∈ Rn×n . Similar to (5.16), we define Z (k) = diag z ji (k) ∈ R N n×N n This together with (5.20) gives ⎧ z + z k ⎪ ⎨ M ji (k) − Q ji (M ji (k)), if j ∈ Ni (k+ + 1) Ni z ji (k + 1) = γ −1 (k) M zji (k), if j ∈ N + (k+1) ⎪ i ⎩ 0, otherwise
(5.21)
where M zji (k) = z ji (k) + [hLG(k) ⊗ In w(k) + hG(k) + h [(LG(k) ⊗ In Z (k)) − (LG(k) ⊗ In Z (k))T ]1 N n ]| j . For any i ∈ V, j ∈ Ni+ (k), k ≥ 0, define the quantization error as Δ¯ ji (k) = Q kji [M zji (k)] − M zji (k) At first, we will prove that if relation in (5.10) holds, there is no quantizer that is saturated at the initial time instant. Evidently, for any k ≥ 0, we have ||LG(k) ||∞ ≤ 2d ∗ . Then, from Assumption 5.3, we get z M ji (0)
∞
≤ s0−1 x j (0)∞ + s0−1 h LG(0) ∞ δ(0)∞ + h g j (x j (0))∞ + h [(LG(0) ⊗ In Z (0)) −(LG(0) ⊗ In Z (0))T ]1 N n ∞ ≤ s0−1 C x + s0−1 h2d ∗ Cδ + hC g + s0−1 h2d ∗ C x = s0−1 [(1 + 2hd ∗ )C x + 2hd ∗ Cδ ] + hC g
Under inequality (5.10), one obtains ||M zji (0)||∞ < K ji (1) + 1/2, which combines with (5.3) further implies 1 max Δ¯ i j (0)∞ ≤ i, j∈V 2 Thus, at time k = 0, no quantizer is saturated. It is noted that [(LG(1) ⊗ In Z (1)) − (LG(1) ⊗ In Z (1))T ]1 N n ∞ d∗ ≤ diag ai j (1)z ji (1) ∞ + diag a ji (1)z i j (1) ∞ ≤ γ0
(5.22)
It follows from Assumption 5.2 and (5.19) that δ(1)∞ = I N − hLG(0) ∞ δ(0)∞ + hs0 G(0)∞ + h [(LG(0) ⊗ In Λ(0)) − (LG(0) ⊗ In Λ(0))T ]1 N n ∞ ∗
∗
≤(1 − 2hd )Cδ + hs0 C g + 2hd C x Using (5.22), (5.23) and Assumption 5.2, if relation in (5.11) holds, we have
(5.23)
96
5 Quantized Communication-Based Distributed Optimization …
z M ji (1)
∞
≤z ji (1)∞ + s0−1 γ0−1 h LG(1) ∞ δ(1)∞ + h g j (x j (1))∞ + h [(LG(1) ⊗ In Z (1)) − (LG(1) ⊗ In Z (1))T ]1 N n ∞ 1 ≤ γ0 + hC g + γ0−1 hd ∗ 2 + s0−1 γ0−1 h2d ∗ [(1 − 2hd ∗ )Cδ + hs0 C g + 2hd ∗ C x ]
1 + 2hd ∗ s0−1 [(1 − 2hd ∗ )Cδ + 2hd ∗ C x ] =γ0−1 2 + h(1 + γ0−1 h)C g + γ0−1 hd ∗ 1 ≤K ji (2) + 2
This together with (5.3) leads to 1 max Δ¯ i j (1)∞ ≤ i, j∈V 2 Hence, at time k = 1, no quantizer is saturated. Supposing that there is no quantizer is saturated at time t = 0, 1, . . . , k, we can show that, if relation (5.12) holds, at time k + 1, no quantizer is saturated by induction method. Namely, we have for any i ∈ V and j ∈ Ni+ , that z M ji (k + 1) ∞ = z ji (k + 1) + {hLG(k+1) ⊗ In w(k + 1) + hG(k + 1) +h[(LG(k+1) ⊗ In Z (k + 1)) − (LG(k+1) ⊗ In Z (k + 1))T ]1 N n }| j ∞ ≤ K ji (k + 2) + 21
Let z¯ i j (t) =
ai j (t)z ji (t), if j ∈ Ni+ (t), t = 1, . . . , k + 1 0, otherwise
Denoting diag(¯z i j (t)) ∈ Rn×n as the diagonal matrix of z¯ i j (t), by computation, one has L G(t) ⊗ In Z (t) = Z¯ (t) = diag z¯ i j (t) ∈ R N n×N n Then by Assumption 5.1, we can derive from (5.21) ||¯z i j (t)||∞ ≤
ai j (t) 2γ0
∗ (LG(t) ⊗ In Z (t))1 N n ≤ d ∞ 2γ0 ∗ (LG(t) ⊗ In Z (t))T 1 N n ≤ d , t = 1, 2, . . . , k + 1 ∞ 2γ0
(5.24)
5.4 Main Results
97
Moreover, by the definition of z ji (k + 1) and Assumption 5.2, we can obtain z ji (k + 1)
∞
1 ≤ max , s −1 (k + 1)xˆ ji (τ kji ) − x j (k + 1)∞ 2γ0
(5.25)
where τ kji = max{k1 ≤ t, j ∈ Ni+ (k1 )}, k − τ kji ≤ K 0 . After that, by the definition
of the decoder Ψi j , for any j ∈ Ni+ Ni+ (k + 1), we obtain −1 k s (τ − 1)(x j (τ k ) − xˆ ji (τ k )) ≤ 1 ji ji ji ∞ 2
(5.26)
Define w(k) = s −1 (k)δ(k). The proof of boundedness of w(k) as follows, which implies that the consensus of multi-agent systems (5.7) can be achieved, i.e., lim xi (k) − x j (k) = 0, ∀i, j ∈ V
k→∞
(5.27)
Proof Since w (k) = s −1 (k) δ (k), one can obtain from (5.19) that w((m k + 1)h 0 ) =Φ((m k + 1)h 0 , m k h 0 )w(m k h 0 ) +
(m k +1)h 0 −1
hγ −1 ( j)Φ((m k + 1)h 0 − 1, j)
j=m k h 0
×
L G( j) ⊗ In Z ( j)
T
− L G( j) ⊗ In Z ( j) 1 N n − D ⊗ In G ( j) (5.28)
where Φ(i, i) = I N n and Φ(n + 1, i) = γ −1 (n)(I N − hLG(n) ) ⊗ In Φ(n, i). It yields that w((m k + 1)h 0 )2 = I1 + I2 + h 2 Im k h (5.29) where I1 = w T (m k h 0 )Φ T ((m k + 1)h 0 , m k h 0 )Φ((m k + 1)h 0 , m k h 0 )w(m k h 0 ) I2 = 2w T (m k h 0 )Φ T ((m k + 1)h 0 , m k h 0 ) ⎫ ⎧ (m +1)h −1 k 0 ⎪ ⎪ −1 ⎪ ⎪ hγ ( j)Φ((m + 1)h − 1, j) ⎬ ⎨ k 0 j=m k h 0 × T ⎪ ⎪ ⎪ ⎭ ⎩ ×[((L G( j) ⊗ In Z ( j)) − (L G( j) ⊗ In Z ( j)))1 N n ⎪ −D ⊗ In G( j)] and
98
5 Quantized Communication-Based Distributed Optimization …
Im k h
⎡⎧ ⎫ ⎤ (m k +1)h 0 −1 −1 ⎪ ⎪ ⎪ ⎪ γ ( j)Φ((m k + 1)h 0 − 1, j) ⎬ ⎥ ⎢⎨ ⎥ ⎢ j=m k h 0 ⎥ ⎢⎪ T ⎪ ×[((L ⊗ I Z ( j)) − (L ⊗ I Z ( j)))1 ⎢⎪ G( j) n G( j) n N n ⎪ ⎭ ⎥ ⎥ ⎢⎩ ⎢ −D ⊗ I G( j)] ⎫⎥ = ⎢ ⎧ (m k +1)hn0 −1 ⎥ (5.30) ⎥ ⎢ ⎪ ⎪ −1 ⎪ ⎢ ⎪ γ ( j)Φ((m + 1)h − 1, j) ⎬⎥ k 0 ⎥ ⎢ ⎨ j=m k h 0 ⎥ ⎢× T ⎦ ⎣ ⎪ ⎪ ×[((L ⊗ I Z ( j)) − (L ⊗ I Z ( j)))1 G( j) n G( j) n N n ⎪ ⎪ ⎩ ⎭ −D ⊗ In G( j)]
It can be concluded that Φ((m k + 1)h 0 , m k h 0 ) =
() (m k +1)h 0 −1 k=m k h 0
γ
−1
* (k)(I N − hLG(t) ) ⊗ In
(5.31)
Based on Assumption 5.1, it follows that (m k +1)h 0 −1 k=m k h 0
(L G(k) + L TG(k) ) 2
(m k +1)h 0 −1
=
L G(k) + =L (m k +1)h 0 −1
k=m k h 0
=L G+(mk+1)h0 −1
+ G(k)
mk h0
k=m k h 0
which together with (A4) and Assumption 5.2 gives T Φ ((m k + 1)h 0 , m k h 0 )Φ((m k + 1)h 0 , m k h 0 )
2h 0 l l ≤ γ0 −2h 0 1 − 2hλ0 + h C2h 0 L l
(5.32)
l=2
In view of 2x T y ≤ εx T x + ε−1 y T y, x, y ∈ Rn letting ε = ε1 > 0, we have from (5.31) that , 2h 0 l h l C2h L l w(m k h 0 )2 (5.33) I1 ≤ γ0 −2h 0 1 − 2hλ0 + 0 l=2
and I2 ≤ ε1 h 2 γ0 −2h 0 w(m k h 0 )2 , 2h 0 −1 l l l + ε1 1 − 2hλ0 + h C2h 0 L Im k h
(5.34)
l=2
Due to D = I N − (1 N )1 N 1TN , we have ||D|| = 1. Then, in view of ||x||∞ ≤ ||x||, ∀x ∈ R N n by Assumption 5.3, one has D ⊗ In G(k)2∞ ≤ C g2 Combining (5.24), (5.30) and (5.35), we obtain
(5.35)
5.4 Main Results
99
Im k h ≤γ0 −2 N n(γ0−1 d ∗ + C g )2 ×
2h 0 −2
(h 0 − | j − (h 0 − 1)|)γ0 j−2
j=0
j
h l C lj L l
(5.36)
l=0
Therefore, we have the following inequality holds. w((m k + 1)h 0 )2 ≤(γ0−2h 0 ρh, ε1 )m k w(h 0 )2 mk 2h 0 1 − (γ0−2h 0 ρh, ε1 ) −1 −2 l l l 2 −2 + h C2h 0 L ) + h γ0 ε1 γ0 (1 − 2hλ0 + 1 − γ0−2h 0 ρh, ε1 l=2 × N n(γ0−1 d ∗ + C g )2 ×
2h 0 −2
(h 0 − | j − (h 0 − 1)|)γ0 j−2
j=0
j
h l C lj L l
(5.37)
l=0
2h 0 l l where ρh,ε1 = 1 − 2hλ0 + l=2 h C2h 0 L l + ε1 h 2 , ∀ε1 > 0. According to (5.19) and w(k) = s −1 (k)δ(k), it yields that w(k + 1)2 ρh,ε ≤ 22 w(k)2 + ε−1 2 γ0 × (LG(k) ⊗ In Z (k)) − (LG(k) ⊗ In Z (k))T 1 N n −D ⊗ In G(k)2
(5.38)
where ρh,ε2 = 1 + 2h L + h 2 L 2 + ε2 h 2 , ∀ε2 > 0. From ||LG(k) ⊗ In Z (k)||∞ ≤ s0−1 d ∗ C x and (LG(k) ⊗ In Z (k)) − (LG(k) ⊗ In Z (k))T 1 N n − D ⊗ In G(k)2 ≤N n(γ0−1 d ∗ + C g )2
(5.39)
thus we obtain w(h 0 )2 h 0 +1 −2 ≤(γ0 −2 ρh,ε2 )h 0 s0−2 N nCδ2 + ε−1 N n(4s0−2 d ∗ 2 C x2 + C g2 ) 2 (γ0 ρh, ε2 ) h γ0 −2 ρh, ε2 1 − (γ0 −2 ρh, ε2 ) 0 −1 ∗ 2 + ε−1 (5.40) 2 N n(γ0 d + C g ) 1 − γ0 −2 ρh, ε2
100
5 Quantized Communication-Based Distributed Optimization …
- . For any given k ≥ 0, by defining m¯ k = k h 0 , we have 0 ≤ k − m¯ k h 0 ≤ h 0 . Letting 1/2h γ0 ∈ (ρh,ε1 0 , 1), from (5.37), (5.38), (5.39) and (5.40), we have w(k + 1)2
2h N nCδ2 ρh,ε0 2 ≤ + 2 4h 0 s γ 0 0
2h +1
N nρh,ε0 ε2 γ0
2 4h 0 +2
4d ∗ 2 C x2 s02
+ C g2
h +1 h 2 N nρ 0 2 [1−(γ0 −2 ρh, ε2 ) 0 ] ρh, ε1 m¯ t −1 + C g ) ε2 γh,ε + 2h 0 +2 (1−γ −2 ρ 2h 0 γ0 0 0 h, ε2 ) −1 −2h 0 −2 −1 ∗ 0 2 + N nρh, ε1 ρhh,ε ε γ (γ d + C ) 0 g 0 2 1 j 2h m¯ −1 0 −2 1−(γ0 −2h 0 ρh, ε1 ) t j−2 × (h 0 − | j − (h 0 − 1)|)γ0 h l C lj L l 1−γ −2h 0 ρ 0 h, ε1 j=0 l=0 h N nρ [1−(γ −2 ρ ) 0] + ε2h,γε022 (1−γ00−2 ρh,h,εε2) (γ0−1 d ∗ + C g )2 (γ0−1 d ∗
2h
≤
N nCδ2 ρh,ε0
+
+
×
0 N nρh, ε1 ρh,ε
2
ε1 γ0 2 (γ0 2h 0 −ρh, ε1 2h 0 −2
+ Δ
(5.41)
2
h
+
2 2h +1
N nρh,ε0
4d ∗ 2 C x2 2 + C g2 ε2 γ0 4h 0 +2 s02 h +1 h N nρ 0 [1−(γ0 −2 ρh, ε2 ) 0 ] 2 (γ0−1 d ∗ + C g )2 ε2 γh,ε 2h 0 +2 (1−γ −2 ρ 0 0 h, ε )
s02 γ0 4h 0
2
(γ −1 d ∗ ) 0
+ Cg )
2
(h 0 − | j − (h 0 − 1)|)γ0 j−2
j=0 h N nρh, ε2 [1−(γ0 −2 ρh, ε2 ) 0 ] (γ0−1 d ∗ ε2 γ0 2 (1−γ0 −2 ρh, ε2 )
j l=0
h l C lj L l
+ C g )2
=M Furthermore, if s0 satisfies the conditions in Assumption 5.2, we have w(k + 1) ≤ 2
×
2h 0 −2
0 N nρh, ε1 ρhh,ε 2
ε1 γ0 2 (γ0 2h 0 − ρh, ε1 )
(h 0 − | j − (h 0 − 1)|)γ0
j=0
+
j−2
(γ0−1 d ∗ + C g )2
j
h l C lj L l
l=0 −2
N nρh, ε2 [1 − (γ0 ρh, ε2 ) ε2 γ0 2 (1 − γ0 −2 ρh, ε2 )
2h 0
]
(γ0−1 d ∗ + C g )2
Δ = M¯
The proof is thus completed.
(5.42)
From (5.41), we get √ w(k + 1)∞ = s −1 (k + 1)δ(k + 1)∞ ≤ M This implies
(5.43)
5.4 Main Results
101
√ x j (k + 1) − JN ⊗ In X (k + 1) ≤ Ms(k + 1) ∞ and
√ x j (τ k ) − JN ⊗ In X (τ k ) ≤ Ms(τ k ) ji ji ∞ ji
Furthermore, we have x j (k + 1) − x j (τ k ) ji
∞
k = (x j (k + 1) − JN ⊗ In X (k + 1)) + h s(k)JN ⊗ In G(t)
−(x j (τ kji ) − JN ⊗ In X (τ kji ))∞
t=τ kji
k ≤ x j (k + 1) − JN ⊗ In X (k + 1) ∞ + h s(k)JN ⊗ In G(t) t=τ k ji ∞ + x j (τ k ) − JN ⊗ In X (τ k ) ji
ji
∞
k √ ≤ M(s(k + 1) + s(τ kji )) + hC g s(k) t=τ kji
k h s(k)J ⊗ I G(t) ≤x j (k + 1) − JN ⊗ In X (k + 1)∞ + N n t=τ k ji ∞ + x j (τ kji ) − JN ⊗ In X (τ kji )∞ k √ ≤ M(s(k + 1) + s(τ kji )) + hC g s(t)
(5.44)
t=τ kji
Thus, by (5.26), (5.43), (5.44), Assumptions 5.1 and 5.2, we obtain s −1 (k + 1)xˆ ji (τ kji ) − x j (k + 1)∞ =s −1 (k + 1)(xˆ ji (τ k ) − x j (τ k )) − (x j (k + 1) − x j (τ k )) ji
ji
ji
∞
k √ 1 ≤s −1 (k + 1)[ s(τ kji − 1) + M(s(k + 1) + s(τ kji )) + hC g s(k)] 2 k t=τ ji
K 0 +1 √ 1 ≤ γ0 −K 0 −2 + M(1 + γ0 −K 0 −1 ) + hC g γ0 −l 2 l=1
(5.45)
Recalling that ||LG(k) ||∞ ≤ 2d ∗ , from (5.13), (5.14), (5.25), (5.43), (5.45) and the definition of M zji (k + 1), for all k = 2, 3, . . ., we have the following relation holds
102
5 Quantized Communication-Based Distributed Optimization …
z M ji (k + 1) ∞ ≤z ji (k + 1)∞ + h LG(k+1) ∞ w(k + 1)∞ + hgi (xi (k + 1))∞ + h[(LG(k+1) ⊗ In Z (k + 1)) − (LG(k+1) ⊗ In Z (k + 1))T ]1 N n ∞ , K 0 +1 1 1 −K 0 −2 √ −K 0 −1 −l ≤ max + M(1 + γ0 ) + hC g γ0 γ0 , γ0 2 2 l=1 √ + Mh2d ∗ + hγ0−1 d ∗ + hC g 1 1 +1+ < M1 (h, s0 , ε1 , ε2 ) − 2 2 1 =K 1 (h, s0 , ε1 , ε2 ) + 2 1 (5.46) ≤K ji (k + 2) + 2 Therefore, at time k = k + 1, all the quantizers are also unsaturated. Then by induction, we conclude that 1 sup max ||Δ¯ i j (k)||∞ ≤ (5.47) 2 k≥0 i, j∈V Hence, the quantizers will never be saturated and the quantization errors are bounded. Theorem 5.6 Let Assumptions 5.1–5.4 hold. Set a suitable control gain h > 0 such that ρh, ε1 ∈ (0, 1) and select a suitable γ0 such that γ0−2h 0 ρh, ε1 ∈ (0, 1). Let θˆ ji (t) be such that / , (1 + 2hd ∗ )C x + 2hd ∗ Cδ Θ1 , s0 ≥ max 1 Θ K + 2 − hC g 2 where −4h 0 0 Θ1 =N nρ2h (γ0 2h 0 − ρh, ε1 ) h,ε2 γ0 ∗2 2 2 2 −2 × (Cδ2 + ρh,ε2 ε−1 2 γ0 (4d C x + C g s0 ))
and 0 ε−1 γ0 −2 (d ∗ γ0 −1 + C g )2 Θ2 =N nρh,ε1 ρhh,ε 2 1
×
2h 0 −2 j=0
(h 0 − | j − (h 0 − 1)|)γ0
j−2
j l=0
. Let K 2 (h, s0 , ε1 , ε2 ) = M2 (h, s0 , ε1 , ε2 ) − 21 + 1 with
h l C lj L l
5.4 Main Results
103
M2 (h, s0 , ε1 , ε2 )
K √ 0 +1 −K 0 −1 1 1 −K 0 −2 −l ¯ = max 2γ0 , 2 γ0 + M(1 + γ0 ) + hC g γ0 l=1 √ ∗ ¯ + Mh2d + hγ0−1 d ∗ + hC g Supposing that the quantization level is fixed, i.e., K ji (k) = T , and T is a positive integer such that T ≥ T2 (h, s0 , ε1 , ε2 ) then under control input of multi-agent systems (5.7) can be (5.8), the consensus achieved, i.e., limk→∞ xi (k) − x j (k), and all the quantizers q ji (k) will never be saturated. Here 2h 0 l l h C2h 0 L l + ε1 h 2 , ∀ε1 > 0 ρh,ε1 = 1 − 2hλ0 + l=2 2 2 ρh,ε2 = 1 + 2h L + h L + ε2 h 2 , ∀ε2 > 0 and the expression of M¯ can be found in (5.42) Proof The proof procedure is omitted because it is similar to that of Theorem 5.5.
5.4.2 Reaching Optimal Solution Theorem 5.7 Satisfies Theorem 5.5 or Theorem 5.6, then the quantized (sub)gradient algorithm (5.7) with control input (5.8) has the following convergence property lim xi (k) = x ∗ , ∀ i ∈ V, x ∗ ∈ X ∗
k→∞
Proof Multiplying (1/N )1TN ⊗ In on the both sides of (5.17) and using relation (5.18), one has for all k ≥ 0 N N N 1 1 1 xi (k + 1) = xi (k) − hs(k) gi (xi (k)) N i=1 N i=1 N i=1
(5.48)
With availability of (5.19), we can obtain that the time averaging dynamic for all the N agents’ states, i.e., v(k) = (1/N ) i=1 xi (k), satisfies the following relation hs(k) gi (xi (k)) N i=1 N
v(k + 1) = v(k) − In view of x ≤
√
nx∞ , ∀ x ∈ Rn , we can get from Assumption 5.4 that
(5.49)
104
5 Quantized Communication-Based Distributed Optimization …
gi (xi (k)) ≤
√
nC g , ∀ xi (k) ∈ dom ( f i )
(5.50)
Using (5.50), it follows from (5.49) for any x ∗ ∈ X ∗ and k ≥ 0 that v(k + 1) − x ∗ 2
2 N N hs(k) 2 2hs(k) ∗ T ∗ = v(k) − x − gi (xi (k))(v(k) − x ) + gi (xi (k)) N N i=1 i=1 N 2 2hs(k) ≤ v(k) − x ∗ − g T (xi (k))(v(k) − x ∗ ) + nC g2 h 2 s 2 (k) N i=1 i
(5.51)
Again considering that the (sub)gradient of each f i is bounded (Assumption 5.4), we have the following two inequalities hold giT (xi (k))(v(k) − xi (k)) ≥ − giT (xi (k)) (v(k) − xi (k)) ≥ − gi (xi (k)) v(k) − xi (k) √ ≥ − nC g v(k) − xi (k)
(5.52)
and √ f i (xi (k)) − f i (v(k)) ≥giT (v(k))(xi (k) − v(k)) ≥ − nC g v(k) − xi (k) (5.53) Then, together with inequalities (5.52), (5.53) and the (sub)gradient inequality giT (xi (k))(xi (k) − x ∗ ) ≥ f i (xi (k)) − f i (x ∗ ), we further obtain for any x ∗ ∈ X ∗ and all k ≥ 0 that giT (xi (k))(v(k) − x ∗ ) = − giT (xi (k))(xi (k) − v(k)) + giT (xi (k))(xi (k) − x ∗ ) √ ≥ − nC g v(k) − xi (k) + f i (xi (k)) − f i (x ∗ ) √ ≥ − 2 nC g v(k) − xi (k) + f i (v(k)) − f i (x ∗ )
(5.54)
Substituting (5.54) into (5.51) yields v(k + 1) − x ∗ 2 N 2 2hs(k) ≤v(k) − x ∗ − ( f i (v(k)) − f i (x ∗ )) N i=1 √ N 4 nC g hs(k) v(k) − xi (k) + nC g2 h 2 s 2 (t) + N i=1
From inequality (5.55), we obtain for any x ∗ ∈ X ∗ and all k ≥ 0 that
(5.55)
5.4 Main Results
105
2hs(k) ( f i (v(k)) − f i (x ∗ )) N i=1 2 2 ≤v(k) − x ∗ − v(k + 1) − x ∗ + nC g2 h 2 s 2 (k) √ N 4 nC g hs(k) v(k) − xi (k) + N i=1 N
(5.56)
Changing k into p and summing up (5.56) over p from 0 to k, we further obtain that k N 2h s( p)( f i (v( p)) − f i (x ∗ )) N p=0 i=1
√ k k N 4 nC g h ∗ 2 2 2 2 + nC g h s ( p) + s( p) v( p) − xi ( p) ≤ v(0) − x N p=0 p=0 i=1 (5.57) From (5.41), one has ||w(k)|| = s −1 (k)||δ(k)|| =
√
M < ∞, which means that
xi (k) − v(k) 0 if and only if the edge ( j, i) ∈ E(k), which indicates that there is an active communication channel from agent j to agent i. The in-neighbor set and the out-neighbor set of agent i at time k are respectively represented by Ni+ (k) = { j ∈ V | ai j (k) > 0} and Ni− (k) = { j ∈ V | a ji (k) > 0}. The in-degree and out-degree of agent i at time k are respectively denoted by di+ (k) and di− (k), where di+ (k) = Nj=1 ai j (k) and di− (k) = Nj=1 a ji (k). Evidently, there exists a positive constant d ∗ such that d ∗ = sup max di+ (k), di− (k)
(6.3)
k≥0 i∈V
which is called the degree of the time-varying networks G(k). The Laplacian matrix L(k) associated with the network G(k) is defined as L(k) = D(k) − A(k), where D(k) = diag(d1+ (k), d2+ (k), . . . , d N+ (k)). A sequence of active communication channels (i 1 , i 2 ), (i 2 , i 3 ),. . ., (i k−1 , i k ) is called a path from agent i 1 to agent i k . We say that the network G(k) is strongly connected if for any two distinct agents i, j ∈ V, there is a path from agent i to agent j. Let Ni+ = ∪k≥0 Ni+ (k) and Ni− = ∪k≥0 Ni− (k). Assumption 6.3 [12] (i) Assume that {G(k), k = 0, 1, . . .} is a balanced directed network sequence, namely, L(k)1 N = 0 N and 1TN L(k) = 0TN , ∀k ≥ 0. In addition, there is a positive K 0 , such that for any time instant k1 ≥ 0 and any agent j ∈ Ni+ , i = 1, 2, . . . , N , j ∈ Ni+ (k) holds at least once in [k1 , k1 + K 0 ). (ii) Let G {V, E∞ } = ∪∞ k=0 G(k). There exists an integer B ≥ 1 such that for each edge (i, j) ∈ E∞ , agent i sends its information to the neighboring agent j at least every B consecutive time slots, i.e., at time k or at time k + 1 or . . . or (at least) at time k + B − 1 for any
120
6 Event-Triggered Scheme-Based Distributed Optimization …
k ≥ 0. That is to say, if an edge appears once, it will appear infinite times as time goes on. In addition, it is assumed that the network sequence {G(k)} is uniformly strongly connected.
6.2.4 Quantization Rule and Encoding-Decoding Scheme Owing to the constrained capacity or bandwidth of the channel, the agent must quantize the information before transmitting it to its neighboring agents. To this end, for a vector Z = (z 1 , z 2 , . . . , z n )T ∈ Rn , define a standard uniform quantizer Q T (Z ) with 2T + 1 quantization levels as follows: Q T (Z ) = [qT (z 1 ), qT (z 2 ), . . . , qT (z n )]T
(6.4)
where the quantization function qT (·) : R → Λ = {0, ±l; l = 1, 2, . . . , T } is defined by: ⎧ 0, − 21 < zl < 21 ⎪ ⎪ ⎨ 2l−1 l, ≤ zl < 2l+1 , l ∈ {1, 2, . . . , T } 2 2 qT (zl ) = 2T +1 ≥ T, z ⎪ l ⎪ 2 ⎩ −qT (−zl ), zl ≤ − 21
(6.5)
If Z ∞ ≤ T + (1/2) is true, the n-dimensional quantizer Q T (Z ) will not saturate. Under this case, the quantization error can be bounded by the quantization bin, i.e., Z − Q T (Z )∞ ≤ 1/2. Remark 6.1 It is evident that when T = 1, the minimum quantization levels are three due to the fact that the quantization set is Λ = {−1, 0, 1}. To send the quantized information which is processed by a (2T + 1)-level quantizer Q T (Z ) with Z ∈ Rn , it is necessary to transmit nlog2 (2T + 1) -bit information at each time step. In order to reduce the communication cost, in the next, we propose event-triggered dynamical encoding-decoding schemes for each agent to obtain estimates of the states being sent from its neighbors. At time k, the encoder φi j associated with agent i for the directed communication channel (i, j) is designed as: ⎧ ⎪ ⎨ ξi j (0) = 0 i ξi j (k) = g(k)I i j (k) + ξi j (kti )
i j (k)s ⎪ ⎩ si j (k) = Q T 1 xi (k) − ξi j (k i ) ti g(k)
(6.6)
where k = 1, 2, . . .; ξi j (k) is the internal state of φi j ; xi (k) and si j (k) are respectively the input and output of φi j ; ktii represents agent i’s latest sampling time instant before k, namely, ktii ≤ k − 1. The scaling function g(k) plays the role of zooming-in the
6.2 Preliminaries
121
difference signal xi (k) − ξi j (ktii ), which will be specified in the subsequent analysis. The index function Ii j (k) is defined as: Ii j (k) =
1, if a ji (k) > 0 0, if a ji (k) = 0
(6.7)
which plays the role of guiding whether the binary sequence si j (k) needs to be transmitted through the communication channel (i, j) at time k. To ensure that agent j can obtain an estimate (with certain degree of quantization error) of agent i’s state, the decoder ϕi j associated with agent j for the channel (i, j) is designed by:
xˆi j (0) = 0 xˆi j (k) = g(k)Ii j (k)si j (k) + xˆi j (ktii )
(6.8)
where
k = 1, 2, . . .; xˆi j (k) is the output of ϕi j ; si j (k) = 0 if Ii j (k) = 0, and si j (k) = 1 xi (k) − ξi j (ktii ) , otherwise. Q T g(k) Remark 6.2 When the channel (i, j) is activated at time k (Ii j (k) = 1), after the binary sequence si j (k) is received by agent j, the decoder ϕi j is activated to obtain an estimate xˆi j (k) of xi (k). Otherwise, ϕi j keeps the estimation xˆi j (k) unchanged. To this end, each agent i ∈ V needs |Ni− | encoders to pre-process the quantized information and needs |Ni+ | decoders to posk-process the received binary sequence. Since ξi j (k) and xˆi j (k) share the same dynamics at the event-triggered time instants {ktii }, we can get the following relation ξi j (ktii ) = xˆi j (ktii ), ti = 1, 2, . . .
(6.9)
That is to say, at the event-triggered time instants k = {ktii }, agent i grasps the value of ξi j (k) (encoding value at the sender side) which is consistent with agent j’s estimation of agent i’s encoded state (decoding value at the receiver side) along each connected channel (i, j) ∈ E(k). Most importantly, this relation plays a crucial role in the elimination of the quantization error for our later algorithm design. Furthermore, for the event-triggered encoder and decoder in this work, only those specific (at the event-triggered instants) data needs to be stored, which can save memory for each agent to some degree. The following three key problems need to be addressed. Problem 1: Based on the event-triggered communication scheme and the quantized information exchange with constrained data rate, design a distributed algorithm to guarantee that the problem (6.1) can be exactly solved, that is, limk→+∞ xi (k) = x ∗ , x ∗ ∈ X ∗ , ∀i ∈ {1, 2, . . . , N }. Problem 2: Seek the lowest possible number of quantization levels to reduce the communication cost while ensuring that all the quantizers are not saturated and exact consensus can be realized.
122
6 Event-Triggered Scheme-Based Distributed Optimization …
Problem 3: Characterize the convergence rate of the proposed distributed (sub)gradient algorithm which simultaneously considers the event-triggered communication scheme and limited data rate of communication networks.
6.3 Quantized Optimization Algorithm Considering that agent i is accessible to the (sub)gradient D f i (xi (k)) of its individual objective function f i (x), thus we let its dynamic be described by the following firskorder linear difference equation: xi (k + 1) = xi (k) + h [u i (k) − g(k)D f i (xi (k))]
(6.10)
where k = 0, 1, . . .; u i (k) ∈ Rn is the distributed control input of agent i to be designed; h > 0 is the networked control gain. Based on the above dynamic event-triggered encoding-decoding schemes (6.6) and (6.8), we construct the following control input u i (k) as: Ni+ (k)
u i (k) =
Ni− (k) j ai j (k)xˆ ji (kt j )
j=1
−
a ji (k)ξi j (ktii )
(6.11)
j=1
where k ∈ [ktii , ktii +1 ) and ktii +1 represents agent i’s next event-triggered time instant after ktii , which will be analytically determined by the individual sampling event, see below (6.12). For i = 1, 2, . . . , N and k ∈ [ktii , ktii +1 ), define agent i’s measurement error ei j (k) associated with channel (i, j) as: ei j (k) = ξi j (ktii ) − ξi j (k) if (i, j) ∈ E(k); ei j (k) = T T T 0, otherwise. Let ei (k) = ei1 (k), ei2 (k), . . . , eiTN (k) . Design a private sampling event for agent i to ascertain the next sampling time instant ktii +1 by the following condition: ktii +1 = min k > ktii , pi (ei (k)) ≤ 0 ∧ pi (ei (k + 1)) > 0
(6.12)
where pi (y) = y∞ − Hi with N + (k i )
t i σi g(k) i i i i ξ Hi = T g(k) + a (k ) (k ) − x (k ) i j i j i t t t i i i ∞ g(ktii ) j=1
(6.13)
We call σi > 0 the agent i’s event gain. Remark 6.3 It is needed to calculate ei (k + 1) when we implement the sampling event (6.12) at time k. Considering that u i (k) is available, it is possible to accomplish
6.3 Quantized Optimization Algorithm
123
this task since one can make use of u i (k) to produce xi (k + 1) one step ahead by using (6.10). Remark 6.4 Notice that ei j (k) = ξi j (ktii ) − ξi j (k) would be set to zero at all the event-triggering time instants k ∈ {ktii }. The inequality ei (k)∞ ≤ Hi would hold for all k ≥ 0 provided that all ei j (k + 1)∞ do not exceed Hi when k = ktii . This can be shown by ei j (k + 1) = ξi j (ktii ) − ξi j (k + 1) = g(k)Ii j (k + 1)si j (k + 1) and si j (k + 1)∞ ≤ T . Therefore, ei j (k + 1)∞ ≤ T g(k) ≤ Hi . The state estimation error e˜i j (k) for agent i corresponding to channel (i, j) is defined as: e˜i j (k) = ξi j (k) − xi (k) if (i, j) ∈ E(k); e˜i j (k) = 0, otherwise. Similarly, the quantization error associated with (i, j) is defined as: Δi j (k − 1) =Q T
1 1 i xi (k) − ξi j (kti ) − xi (k) − ξi j (ktii ) g(k) g(k)
(6.14)
In accordance with the definitions of e˜i j and Δi j , the following equality holds for all k ≥ 1. a ji (k)e˜i j (k) = a ji (k)g(k)Δi j (k − 1)
(6.15)
This is because if a ji (k) > 0 (Ii j (k) = 1), using the definition of (6.14), we can derive e˜i j (k) =g(k)Ii j (k)si j (k) + ξi j (ktii ) − xi (k) xi (k) − ξi j (ktii ) =g(k) si j (k) − g(k) =g(k)Δi j (k − 1)
(6.16)
If a ji (k) = 0, the equality (6.15) naturally holds. By making use of (6.14) and (6.15) and noticing Assumption 6.3, substituting (6.11) into (6.10) produces xi (k + 1) = xi (k) + h
+ N i (k)
ai j (k)(x j (k) − xi (k))
j=1
+h +h −h
+ N i (k)
j=1 + N i (k) j=1 − N i (k) j=1
ai j (k)e ji (k) −
− N i (k)
a ji (k)ei j (k)
j=1
ai j (k)g(k)Δ ji (k − 1) a ji (k)g(k)Δi j (k − 1)
−hg(k)D f i (xi (k))
(6.17)
124
6 Event-Triggered Scheme-Based Distributed Optimization …
For convenience and simplicity, we define the following notations: X = (x1T , x2T , . . . , x NT )
T
Σ1 = diag{A1,: , A2,: , . . . , A N ,: } Σ2 = diag{AT:,1 , AT:,2 , . . . , AT:,N } T T T , e21 , . . . , eTN 1 , e12 , . . . , eTN N ) eˆ = (e11
T
T T T T e¯ = (e11 , e12 , . . . , e1N , e21 , . . . , eTN N )
T
Δˆ = (ΔT11 , ΔT21 , . . . , ΔTN 1 , ΔT12 , . . . , ΔTN N )T Δ¯ = (ΔT11 , ΔT12 , . . . , ΔT1N , ΔT21 , . . . , ΔTN N )T ∇ F(X ) = (DT f (x1 ), DT f (x2 ), . . . , DT f (x N ))
T
The closed-loop system (6.16) can be equivalently rewritten in the compact matrixvector form as follows: X (k + 1) = [(I N− hL(k)) ⊗ In ] X (k) − hg(k)∇ F(X (k)) +h Σ1 (k) ⊗ In e(k) ˆ − Σ2 (k) ⊗ In e(k) ¯ ˆ − 1) +hg(k)[Σ1 (k) ⊗ In Δ(k ¯ − 1)] −Σ2 (k) ⊗ In Δ(k
(6.18)
Assumption 6.4 [12] Let h ∈ (0, 1/d ∗ ) and W(k) = I N − hL(k) = [ωi j (k)], k ≥ 0, where d ∗ is the uniformly maximal degree given in (6.3). There exists a scalar η with 0 < η < 1 such that for all i, j ∈ V and for all k ≥ 0, the following two inequalities hold. (a) ωii (k) ≥ η; (b) If ωi j (k) > 0, then ωi j (k) ≥ η. Lemma 6.5 [12] Let Assumptions 6.3 and 6.4 hold. Then, W(k) is doubly stochastic, namely, W(k)1 N = 1 N and 1TN W(k) = 1TN , k ≥ 0. Define Φ(k, s) = W(k)W(k − 1) . . . W(s) with k ≥ s ≥ 0, where Φ(k, k) = W(k). Then, letting B0 = (N − 1)B, we have the following results: The entries [Φ(k, s)]i j of the transmission matrice Φ(k, s) converge to 1/N as t → ∞ with a linear convergence rate with respect to all i and j, that is, for all i, j ∈ V, [Φ(k, s)]i j − 1 ≤ Cλk−s N 1 −B0 where C = 2 1+η and λ = 1 − η B0 B0 . 1−η B0 Assumption 6.5 There exist positive scalars σ, C x , Cδ and A∗ such that
6.3 Quantized Optimization Algorithm
125
max σi ≤ σ, i ∈ V max xi (0)∞ ≤ C x , i ∈ V max xi (0) − x j (0)∞ ≤ Cδ , i, j ∈ V sup max {Σ1 (k), Σ2 (k)} ≤ A∗ k≥0
Assumption 6.6 The stepsize {g(k)} is a positive sequence which satisfies the following relations: g(k + 1) ≤ g(k), sup k≥0 ∞
g(k) = ∞,
k=0
∞
g(k) ≤ μ, 0 < λμ < 1 g(k + 1)
g 2 (k) < ∞
k=0
where λ is given in Lemma 6.5.
6.4 Main Results Theorem 6.6 Let Assumptions 6.1–6.6 hold. With the event-triggered quantized algorithm (6.11), the consensus of multi-agent systems (6.10) can be achieved, i.e., maxi, j∈V xi (k) − x j (k) 1), σ and large positive g(0), T = 1 suffices to ensure the inequality (6.20). That is to say, under the event-triggered limited data rate algorithm (6.10), by selecting suitable parameters, the optimization problem (6.1) can be exactly solved based on only one bit information exchange along each connected communication channel at each time step.
In the following, we now turn to the derivation of convergence rate for the optimization algorithm (6.10). Since the stepsize g(k) needs to be nonsummable and square summable, we can hardly characterize the convergence rate in this case. However, if the stepsize g(k) takes a special case, which does not satisfy the square summable condition, we can approximately capture the algorithm’s convergence since the proof of which drops the assumption of stepsize g(k) being square summable. The results are stated in the following theorem, the proof of which drops the requirements that the stepsize g(k) are needed to be square summable. Theorem 6.12 Let Assumptions 6.1–6.6√hold, and the quantization level T satisfy (6.20). Take the stepsize as g(k) = g(0)/ k + 1, k ≥ 0 and x ∗ ∈ X ∗ . The distributed
6.4 Main Results
133
√ quantization (sub)gradient algorithm (6.10) converges at a rate of O(lnk/ k) (high √ order of lnk/ k) in terms of the running best of objective error. Specifically, letting x˜i (k) = (1/ ks=0 g(s)) ks=0 g(s)xi (s), we have f (x˜i (k)) − f (x ∗ ) ≤
g(0)N C 2f (1 + ln(k + 1)) 2g(0)C f (1 + lnk) ¯ − x ∗ 2 N x(0) + + √ √ √ 2g(0) 2 (1 − λ) k+1 k+1 k+1 √ 1 1 μ (1 + ln(k + 1)) + 2H + N C H g(0)N C f + N nC x C √ g(0) 21−λ k+1 (1 + lnk) NC H g(0)N C f √ (6.55) + 2(1 − λ) k+1
Proof From (6.49), we have for any x ∗ ∈ X ∗ and k ≥ 1, k 2g(s) s=0
N
f (x(s)) ¯ − f (x ∗ )
k k N 4C f g(s) ∗ 2 2 2 x(s) ¯ −x ¯ − xi (s) + g (s)C f + ≤ x(0) N s=0 s=0 i=1
From the preceding relation, letting S(k) = we obtain k
g(s) f (x(s)) ¯
s=0
S(k)
k s=0
(6.56)
g(s) and dividing by (2/N )S(k),
− f (x ∗ ) k
∗ 2
¯ −x N x(0) + ≤ 2 S(k)
g N C 2f s=0 2
2
N k
(s)
S(k)
+ 2C f
g(s) x(s) ¯ − xi (s)
s=0 i=1
S(k)
(6.57)
Using the convexity of f , one has ⎛
⎞ g(s) x(s) ¯ k ⎜ ⎟ 1 ⎜ s=0 ⎟ f⎜ g(s) f (x(s)) ¯ ⎟≤ ⎝ ⎠ S(k) s=0 S(k) k
(6.58)
For the convenience of representation, let M = g(0). We also have the following inequality holds
134
6 Event-Triggered Scheme-Based Distributed Optimization … k
√
s=0
1 s+1
≥
√ k + 1, k ≥ 1
(6.59)
which follows by k s=0
$
1
≥ √ s+1
k+1 0
√ du = 2( k + 2 − 1) √ u+1
(6.60)
√ √ and the relation 2( k + 2 − 1) ≥ k + 1 for all k ≥ 1. It follows from (6.59) and (6.60) that S(k) =
k
√
s=0
M s+1
√ ≥ M k+1
(6.61)
By computation, we can derive k
g (s)=M 2
s=0
2
k+1 1 s=1
s
=M
$ k+1 k+1 1 dx =M 2 (1+ln(k+1)) 1+ ≤M 2 1+ s x 1 s=2
2
(6.62) In addition, letting g(s) = 0 for s < 0, we have k s−1
g 2 (τ −1)λs−1−τ ≤ M 2 λ−1
s=0 τ =1
k−1 1 (λ − λk−τ +1 ) 1 ≤ M2 (1+lnk) τ 1−λ 1−λ τ =1
(6.63) where the last step follows by $ t k−1 t 1 1 dτ ≤1+ ≤1+ = 1 + lnk τ τ 1 τ τ =1 τ =2 Based on (6.63), we have k N
g(s) xi (s) − x(s) ¯
s=0 i=1
k 1 1 μ + 2N H + N 2C H ≤ N 2Cx C g 2 (s) M 21−λ s=0 1 2 N CH g 2 (τ − 1)λs−1−τ 2 s=0 τ =1 k
+
s−1
(6.64)
6.4 Main Results
135
1 1 μ + 2N H + N 2 C H M 2 (1 + ln(k + 1)) ≤ N 2Cx C M 21−λ 2 2 N CH M (1 + lnk) + 2 1−λ
(6.65)
Substituting (6.57) into (6.58) and using the relations (6.61), (6.62) and (6.63), we can derive ⎛ ⎞ k g(s) x(s) ¯ ⎜ ⎟ ⎜ s=0 ⎟ f⎜ ⎟ − f (x ∗ ) ⎝ ⎠ S(k)
≤
M N C 2f (1 + ln(k + 1)) 2MC f (1 + lnk) ¯ − x ∗ 2 N x(0) + + √ √ √ 2M 2 (1 − λ) k + 1 k+1 k+1
(6.66)
˜¯ = (1/ ks=0 g(s)) × ks=0 Let x˜i (k) = (1/ ks=0 g(s)) ks=0 g(s)xi (s) and x(k) g(s)x(s). ¯ Using the Lipschitz continuity of f (implied by the the (sub)gradient boundedness), we obtain N
˜¯ D f jT ( X˜ (k))(x(k) − x˜i (k)) ˜ ¯ − x˜i (k) ≥ −N C f x(k)
˜¯ f (x(k)) − f (x˜i (k)) ≥
j=1
(6.67)
Using the convexity of · , one has the following k k g(s) 1 ≤ (s)) g(s) x(s) ¯ − xi (s) x(s) ¯ − x ( i k k s=0 s=0 g(s) s=0 g(s) s=0
(6.68)
Substituting (6.68) into (6.67), it yields k NC f ˜ f (x˜i (k)) ≤ f (x(k)) ¯ + g(s) x(s) ¯ − xi (s) S(k) s=0
It follows from (6.66) and (6.69) that f (x˜i (k)) − f (x ∗ ) ≤
M N C 2f (1 + ln(k + 1)) 2MC f (1 + lnk) ¯ − x ∗ 2 N x(0) + + √ √ √ 2M 2 (1 − λ) k + 1 k+1 k+1
(6.69)
136
6 Event-Triggered Scheme-Based Distributed Optimization …
1 1 μ (1 + ln(k + 1)) + 2H + NC H M NC f + N Cx C √ g(0) 21−λ k+1 (1 + lnk) NC H M NC f √ (6.70) + 2(1 − λ) k+1 Therefore, under the condition that the selected stepsize the √ is square summable, √ algorithm converges at a rate of high order of ln(k)/ k, i.e., O(ln(k)/ k). The proof is thus completed. Remark 6.13 From Theorems 6.6, 6.10, and 6.12, we can see that the control parameters h and g(k) (g(0), and μ) are designed off-line. It is interesting to develop distributed optimization algorithms which are based on the on-line parameter design.
6.5 Numerical Examples In this section, we will make use of a numerical example to validate the proposed algorithm. A parameter estimation problem in wireless sensor networks is taken into consideration [27]. A set of N sensors want to cooperatively estimate some global decision parameter x with their individual data sets Θi and the local optimization objectives Ψi (x, Θi ) for i = 1, 2, . . . , N . Consider that five sensors estimate a decision parameter x ∈ R3 to minimize the global objective function described by min f (x) =
5
Ψi (x, Θi ), x ∈ R3
(6.71)
i=1
where Ψi (x; Θi ) =
θi j ∈Θi
ψi j (x; θi j )
(6.72)
1
and ψi j (x; θi j ) is the vector version of the Huber penalty function with the p-th element defined as follows [27]: p ψi j (x; θi j )
⎧ p p2 ⎨ (x −θi j ) , = 2 p ⎩ x p − θ − 1 , ij 2
p for θi j − x p ≤ 1 p for θi j − x p > 1
(6.73)
Assume that each agent has 100 measurements which are corrupted by additive white Gaussian noise. The first 90 measurements (indicted by θi j , j = 1, 2, . . . , 90) are taken to be i.i.d N([9, 2, −5]T , I3 ) and the remaining 10 ones (indicted by θi j , j = 91, 92, . . . , 100) are taken from the i.i.d N([9, 2, −5]T , 20I3 ) (due to wrong reading). Here, N(μ, Σ) represents the normal distribution where μ is the mean vector and Σ the covariance matrix. Then, the set of θi j is denoted by Θi = {θi j , j = 1, . . . , 100}.
6.5 Numerical Examples
137
The time-varying communication network sequence is taken as G(k) = {V, E(k), A(k) = [ai j ]3×3 }, where E(k) = {(1, 2), (1, 3), (2, 4), (3, 1), (4, 5), (5, 3)}, a12 (k) = 1.5, a13 (k) = 1.5, a24 (k) = 1.5, a31 (k) = 3, a45 (k) = 1.5, a53 (k) = 1.5, and ai j (k) = 0, if (i, j) ∈ / E(k) when k = 3k, k = 0, 1, . . .; E(k) = {(1, 3), (3, 1), (4, 5), (5, 4)}, a13 (k) = 0.9, a31 (k) = 0.9, a45 (k) = 0.9, a54 (k) = 0.9, and ai j (k) = 0, if (i, j) ∈ / E(k) when k = 3k + 1, k = 0, 1, . . .; E(k) = {(1, 2), (2, 1), (2, 4), (4, 2), (3, 5), (5, 3)}, a12 (k) = 1, a21 (k) = 1, a24 (k) = 0.85, a42 (k) = 0.85, a35 (k) = 0.85, / E(k) when k = 3k + 2, k = 0, 1, . . .. By coma53 = 0.85, and ai j (k) = 0, if (i, j) ∈ putation, we have η = 0.9, B = 3, B0 = 12, λ = 0.982. Let the initial condition of j xi (0) be i j/N , and σi = 0.1, i = 1, 2, . . . , 5, j = 1, 2, 3. We can calculate A∗ = 3, ∗ d = 3, C x = 3, Cδ = 2.4. Based on Theorem 6.6, we can take h = 0.002, T = 1 and g(k) = 200/(0.02k + 1)0.55 to satisfy the conditions in Theorem 6.6 (T1 = 0.0636, T2 = 0.041, and T3 = 0.0863). We solve the convex optimization problem (6.71) by utilizing the distributed optimization algorithm (6.10) over the above defined time-varying networks in which the communication channel allows the agents to transmit only one bit information each time step. The estimation trajectories of all agents are shown in Fig. 6.1, from which we can see that all agents collaboratively estimate the mean vector x ∗ = [9, 2, −5]T . The distributed control algorithms u i (k), i = 1, 2, . . . , 5 are shown in Fig. 6.2. The quantization errors δi j (k), i, j ∈ V, are depicted in Fig. 6.3, from which we can see that the upper bound of δi j (k) is not larger than 0.5. The event-triggered sampling time instants for each agent are described in Fig. 6.4, from which we can see the updates of the control inputs are asynchronous. According to the statistics, the sampling times for the five agents are [109, 95, 62, 79, 83], and the average sampling times is 86. Thus the average update rate of control inputs is 86/3000 = 2.87%.
30
Fig. 6.1 All agents’ states xi (t)
x1(t) x (t)
20
2
x (t)
States of agents
3
x4(t)
10
x (t) 5
0 −10 −20 −30
0
500
1000
1500
2000
Time [step]
2500
3000
100
50
i
Fig. 6.2 All agents’ control inputs u i (t)
6 Event-Triggered Scheme-Based Distributed Optimization …
Distributed control input u (t)
138
u1(t) u (t) 2
0
u (t) 3
u4(t) u5(t)
−50
−100
−150
0
500
2000
1500
1000
2500
3000
Time [step] 1
Fig. 6.3 All agents’ quantization errors δi j (t)
0.6
ij
Quantization error δ (t)
0.8
0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
0
500
1000
1500
2000
2500
Time [step] 5.5
Fig. 6.4 All agents’ sampling time instant sequences
5
Event time instants
4.5 4 3.5 3 2.5 2 1.5 1 0.5
0
500
1000
1500
2000
Time [step]
2500
3000
6.6 Conclusion
139
6.6 Conclusion We have introduced an event-triggered distributed (sub)gradient protocol for solving convex optimization problems which are modeled as the sum of all agents’ convex cost functions. The protocol is designed over a sequence of time-varying yet uniformly strongly connected directed balanced networks with information quantization and constrained data rate of the communication channel. It is shown that under some mild assumptions and a proper parameter configuration, the event-triggered quantized distributed (sub)gradient algorithm succeeds in driving all the agents’ states to a consensus value which is also in the set of optimal solutions of the convex optimization problems. We also find that only one-bit information exchange between each connected pair of agents suffices to exactly find the optimization solution meanwhile guarantee that all the quantizers are not saturated. Moreover, the proposed distributed √ optimization algorithm converges at a rate of O(lnk/ k), where the ratio depends on the initial vectors, the bounds on (sub)gradient norms, the network maximum degree, the number of network nodes, the sampling event gain, the initial condition of the scaling function, the consensus speed λ of the network sequence. The communication cost of the proposed distributed optimization algorithms is largely reduced compared with ones which are based on the traditional continuous communication schemes.
References 1. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence of communication delays. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 717–728 (2017) 2. D. Wang, N. Zhang, J. Wang, W. Wang, Cooperative containment control of multiagent systems based on follower observers with time delay. IEEE Trans. Syst. Man Cybern. Syst. 47(1), 13–23 (2017) 3. G. Wen, W. Yu, Y. Xia, X. Yu, J. Hu, Distributed tracking of nonlinear multiagent systems under directed switching topology: an observer-based protocol. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 869–881 (2017) 4. Z. Liu, X. Yu, Z. Guan, B. Hu, C. Li, Pulse-modulated intermittent control in consensus of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 783–793 (2017) 5. I. Lobel, A. Ozdaglar, Distributed subgradient methods for convex optimization over random networks. IEEE Trans. Autom. Control 56(6), 1291–1306 (2011) 6. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE J. Sel. Top. Signal Process. 5(4), 754–771 (2011) 7. V. Kekatos, G. Giannakis, Distributed robust power system state estimation. IEEE Trans. Power Syst. 28(2), 1617–1626 (2013) 8. U.A. Khan, S. Kar, J.M.F. Moura, Diland: an algorithm for distributed sensor localization with noisy distance measurements. IEEE Trans. Signal Process. 58(3), 1940–1947 (2010) 9. C. Li, Y. Li, A distributed multiple dimensional QoS constrained resource scheduling optimization policy in computational grid. J. Comput. Syst. Sci. 72(4), 706–726 (2006) 10. J. Predd, S. Kulkarni, H. Poor, A collaborative training algorithm for distributed learning. IEEE Trans. Inf. Theory 55(4), 1856–1871 (2009)
140
6 Event-Triggered Scheme-Based Distributed Optimization …
11. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 12. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 13. A. Nedic, A. Olshevsky, J. Tsitsiklis, Distributed subgradient methods and quantization effects, in 2008 47th IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC.2008. 4738860 14. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 15. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 592–606 (2012) 16. E. Wei, A. Ozdaglar, Distributed alternating direction method of multipliers, in 2012 IEEE 51st IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC.2012.6425904 17. X. Liu, Z. Guang, X. Shen, G. Feng, Consensus of multi-agent networks with aperiodic sampled communication via impulsive algorithms using position-only measurements. IEEE Trans. Autom. Control 57(10), 2639–2643 (2012) 18. Y. Li, S. Tong, Adaptive neural networks decentralized FTC design for nonstrict-feedback nonlinear interconnected large-scale systems against actuator faults. IEEE Trans. Neural Netw. Learn. Syst. 28(11), 2541–2554 (2017) 19. T. Li, M. Fu, L. Xie, J.-F. Zhang, Distributed consensus with limited communication data rate. IEEE Trans. Autom. Control 56(2), 279–292 (2011) 20. T. Li, L. Xie, Distributed consensus over digital networks with limited bandwidth and timevarying topologies. Automatica 47(9), 2006–2015 (2011) 21. D. Li, Q. Liu, X. Wang, Z. Lin, Consensus seeking over directed networks with limited information communication. Automatica 49(2), 610–618 (2013) 22. S. Liu, T. Li, L. Xie, Distributed consensus for multiagent systems with communication delays and limited data rate. SIAM J. Control Optim. 49(6), 2239–2262 (2011) 23. Q. Zhang, J.-F. Zhang, Quantized data-based distributed consensus under directed time-varying communication topology. SIAM J. Control Optim. 51(1), 332–352 (2013) 24. Y. Wang, T. Bian, J. Xiao, C. Wen, Global synchronization of complex dynamical networks through digital communication with limited data rate. IEEE Trans. Neural Netw. Learn. Syst. 26(10), 2487–2499 (2015) 25. M. Rabbat, R. Nowak, Quantized incremental algorithms for distributed optimization. IEEE J. Sel. Areas Commun. 23(4), 798–808 (2005) 26. D. Yuan, S. Xu, H. Zhao, L. Rong, Distributed dual averaging method for multi-agent optimization with quantized communication. Syst. Control Lett. 61(11), 1053–1061 (2012) 27. P. Yi, Y. Hong, Quantized subgradient algorithm and data-rate analysis for distributed optimization. IEEE Trans. Control Netw. Syst. 1(4), 380–392 (2014) 28. X. Wang, M. Lemmon, Event-triggered broadcasting across distributed networked control systems, in IEEE American Control Conference, https://doi.org/10.1109/ACC.2008.4586975 29. G. Seyboth, D. Dimarogonas, K. Johansson, Event-based broadcasting for multi-agent average consensus. Automatica 49(1), 245–252 (2013) 30. C. Li, X. Yu, W. Yu, T. Huang, Z. Liu, Distributed event-triggered scheme for economic dispatch in smart grids. IEEE Trans. Ind. Inf. 12(5), 1775–1785 (2016) 31. H. Li, Z. Chen, L. Wu, H. Lam, H. Du, Event-triggered fault detection of nonlinear networked systems. IEEE Trans. Cybern. 47(4), 1041–1052 (2017) 32. S. Liu, L. Xie, D.E. Quevedo, Event-triggered quantized communication based distributed convex optimization. IEEE Trans. Control Netw. Syst. 5(1), 167–178 (2018)
Chapter 7
Random Sleep Scheme-Based Distributed Optimization over Time-Varying Directed Networks
7.1 Introduction The problem of minimizing a sum of local convex functions which are only accessible to specific agents of a network is a significant issue. It naturally appears in the field of network resource allocation, motion planning, wireless networks, collaborative control, cognitive networks, statistical inference, estimation, and machine learning, etc., [1–10]. From the network viewpoint, the algorithms investigated for solving this category of optimization problems need to obey network structure, especially when agents are merely able to communicate with their local neighbors. That is, the optimization algorithms should be completely distributed and executed by agents with no central coordinators [11, 12]. In the seminal work of distributed optimization [13], the authors introduced a set of processors executing computations and exchanging information to achieve an optimal solution of a collective objective function. In the area of distributed computing, the implementation of distributed algorithms mostly relies on consensus theory. It permits agents taking local operations and sharing limited messages with their neighbors to find global solutions. The authors of [4] proposed a distributed gradient descent algorithm via developing communication protocols to achieve consensus among agents, and using (sub)gradients to individually minimize each local cost function. Afterwards, relevant distributed consensus-based optimization algorithms have been widely studied, see, [2–5, 7, 14, 15] and references therein. The major limitation of the above literature is the dependence of doubly-stochastic weighted matrices. This potentially leads to undirected or directed balanced communication networks, which indeed limit the practicabilities of these algorithms, especially in unbalanced time-varying networks. First, communication networks are not all undirected or balanced directed. For instance, information transmission at nonidentical capacity levels will cause asymmetric information exchange among agents. Moreover, some algorithms require disconnecting slow communication links while still remaining convergent; some procedures may lead to directed unbalanced networks. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_7
141
142
7 Random Sleep Scheme-Based Distributed Optimization …
Second, constructing a doubly-stochastic weight matrix is usually difficult in practice, even when feasible, it requires computationally intensive and centralized algorithms. Third, doubly-stochastic weight matrices obstruct the utilization of some natural broadcast protocols where a specific agent may broadcast information without expecting any immediate feedback [16]. To address the optimization problem over unbalanced networks, a surplus-based method was recently developed in [17] where an additional vector is augmented for each agent to track its state updates. Such an idea comes from [18], which aims at average consensus on directed networks. However, this method can not be applied to time-varying unbalanced directed networks. For time-varying unbalanced directed networks, [19, 20] designed algorithms by incorporating the so-called push-sum protocol and gradient descent method. Yet, these push-sum-based algorithms only work for unconstrained optimization as the feasibility of the iterates is not considered. Moreover, it is complicated and involves multiple nonlinear iterations. For the constrained optimization, a distributed projected (sub)gradient method was presented in [6], where two cases of time-varying balanced networks with identical constraints and time-invariant balanced networks with nonidentical constraints are investigated. The authors in [21] proposed a distributed projected (sub)gradient algorithm over time-invariant unbalanced networks with identical constraints, where the unbalancedness is conquered by scaling the (sub)gradient with an estimation of consensus-based variable. The work [22] generalized [21] to time-varying unbalanced directed networks. In [23], we extended [20] to the cases with nonidentical constraints. Different with [6, 21, 22], which used the costly Euclidean projection operator, more computational inexpensive Polyak’s projection based algorithms were proposed in [11, 24]. The employment of Polyak’s projection allows agents to execute inaccurate projection and thus leads to less calculation per iteration. This idea was initially proposed in [25] for centralized constrained optimization problems, and extended to distributed constrained optimization problems over time-invariant unbalanced networks in [24]. Then, the results in [24] was further generalized to time-varying unbalanced networks in [11]. Especially, the authors in [11, 24] subtly eliminated the effect of unbalancedness by studying the epinetwork form of the original optimization problem. In all of the aforementioned distributed methods, an accurate local gradient information is required for each agent per iteration. The calculation of gradients leads to heavy computational load for each agent, which involves computation of certain subproblems, big data, and multiple function values. For constrained optimization, although the Polyak’s projection based algorithms [11, 24] avoid accurate projection, the calculations of projection are still time consuming. In this chapter, we introduce random sleep scheme into distributed constrained optimization to decrease calculations of (sub)gradient and projection. The random sleep scheme was initially introduced into wireless sensor networks to save energy and prolong the lifetime of the system [26], where each sensor periodically enters sleep mode (turn off its communication unit but remains its sensering unit active). Then, on the basis of projected consensus algorithm [6], the authors in [27, 28] introduced random sleep scheme to reduce the cost of projection, where each agent randomly decide whether to average
7.1 Introduction
143
with its neighbors and take projection on the set of optimal solutions. The insightful work [29] generalized the results of [27, 28] to distributed constrained (identical constraint) optimization, where each agent randomly and independently determines whether to calculate local (sub)gradient and take projection to perform an optimization step or only average with its neighbors (consensus). Thus, the computation of (sub)gradient and projection are simultaneously reduced. However, the result in [29] is only applicable to time-varying balanced networks. Up to now, we have noticed that there is no relevant work focusing on distributed constrained (nonidentical constraints) optimization problems over timevarying unbalanced directed networks with random sleep scheme. In order to overcome the deficiencies simultaneously, we present a distributed energy-efficient algorithm incorporating with random sleep scheme to address the constrained optimization problem over time-varying unbalanced directed networks. Each agent is independently characterised by two operation model at each iteration: activated and inactivated, through a simple local Bernoulli process. The optimization algorithm is divided into two parts to perform. In the first part, for those agents which are activated and their associated neighbor agents, two intermediate state vectors are obtained by adopting the Polyak’s projection idea [11, 24]. Then the dynamical averaging on two states are executed after a standard distributed (sub)gradient descent. In the second part, the intermediate state vectors would keep unchanged for the inactivated agents, and then the dynamical averaging on two states are executed without a distributed (sub)gradient descent. The contributions of this chapter are summarized as follows. (I) The proposed distributed optimization algorithm is still effective even over time-varying unbalanced directed networks that are only required to be uniformly jointly strongly connected, which is a relaxed long-term connectivity condition compared with most existing methods limited to strongly connected time-invariant networks [2, 3, 6, 7, 14, 17, 21, 24]. (II) The random sleep scheme is introduced in the distributed algorithm, which circumvents the requirements of (sub)gradient calculation and projection per iteration. This is the most distinctive feature of the proposed algorithm compared with the work on time-varying unbalanced networks [11, 20, 22, 23]. (III) Unlike the coordinated step-sizes used in [4, 6, 11, 17, 19, 21, 22, 24], the proposed algorithm is very flexible in the choice of the step-sizes since agents are allowed to use uncoordinated step-sizes. (IV) Compared with [29] and the work on constrained optimization [21, 22], the proposed algorithm also allows for nonidentical set constraints for each agent.
144
7 Random Sleep Scheme-Based Distributed Optimization …
7.2 Preliminaries 7.2.1 Notation If there are no other statements, the vectors in this chapter are assumed to be columns. Let ( f (x))+ = max{0, f (x)} take the nonnegative part of function f . We abbreviate with probability one as w.p.1.
7.2.2 Model of Optimization Problem Let X i ⊆ Rn be a nonempty closed convex set only accessed by agent i. The state of each agent xi (k) ∈ Rn is constrained in X i . The networked agents aim at collaboratively solving a convex optimization problem as below: min f (x) =
N
f i (x), X =
i=1
N
Xi , x ∈ X
(7.1)
i=1
where x ∈ Rn is a global decision vector, and each f i : Rn → R denotes the local continuous and convex function possessed by only agent i. For each function f i , the domain of f i is denoted by dom( f i ), where dom( f i ) = {x ∈ Rn | f i (x) < ∞}. In this chapter, we make no assumptions on the differentiability of the function f i . We utilize s fi (x) ∈ Rn to denote a (sub)gradient of the function f i at x ∈ dom( f i ) when the relation holds: f i (y) − f i (x) ≥ (s fi (x))T (y − x) for all y ∈ dom( f i ). The collection of all (sub)gradients of f i at x is defined as ∂ f i (x). This chapter aims at developing a distributed algorithm to address problem (7.1) over time-varying unbalanced networks. That is, at each iteration k ≥ 0, each agent i updates a vector xi (k) by local communication with its neighbors so that each xi (k) ultimately converges to some optimal solution x ∗ ∈ X ∗ , where X ∗ = {x| f (x) = f ∗ } is the collection of optimal solution and f ∗ is the optimal value of problem (7.1).
7.2.3 Communication Network Consider a network consisting of N agents. We denote by V = {1, . . . , N } the set of agents. At each time k ≥ 0, a directed network G(k) = {V, E(k), W(k)} is introduced to describe the underlying communication network. Here, E(k) ⊆ V × V is the collection of directed edges, and W(k) = [wi j (k)] ∈ R N ×N is the weight matrix where wi j (k) ≥ 0 is the weight appointed to the directed edge ( j, i). The collection of neighbors of agent i is defined as Ni (k) = { j|( j, i) ∈ E(k)}, i.e., agents j ∈ Ni (k)
7.2 Preliminaries
145
can directly send messages to agent i. We assume that there are no self-loops in the networks G(k), k ≥ 0.
7.3 Distributed Algorithm To circumvent the limitation of unbalancedness, we first convert the optimization problem (7.1) to its equivalent epinetwork form. Given a function h(x) : Rn → R, its epinetwork is denoted by epi h = {(x, t)|x ∈ dom(h), h(x) ≤ t}
(7.2)
With the definition of epinetwork, we have the same linear objective function for each agent. That is, problem (7.1) can be equivalently rewritten as min
N
1TN t, s.t. f i (x) − eiT t ≤ 0, i ∈ V, (x, t) ∈
(7.3)
i=1
where t = [t1 , . . . , t N ]T ∈ R N and = X × R N is the Cartesian product of X and R N . Denote i = {(x, t) f i (x) − eiT t ≤ 0, x ∈ X i }. Remark 7.1 Note that epi h ∈ Rn+1 is a convex set iff h is a convex function. Finding the minimal additional variable t within epi h is equal to minimizing h [30]. The local objective functions are transformed to the same linear function f 0 (y) = cT y, where y = [x T , t T ]T and c = [0nT , 1TN ], but with the additional constraint fi (x) ≤ ti , ∀i ∈ V. The epinetwork method subtly avoids the effect of different elements of the Perron vector, which is impractical to estimate in time-varying unbalancedness networks [11]. Let χi (k), k ≥ 0 be independent random variables to indicate at time k, whether agent i is activated or not. Here, χi (k), k ≥ 0 are i.i.d. Bernoulli random variables satisfying P(χi (k) = 1) = γi and P(χi (k) = 0) = 1 − γi , where 0 < γi < 1, i ∈ V. Particularly, we employ γmax = max{γi } and γmin = min{γi }, i ∈ V, in the subsequent analysis. We next present the distributed optimization algorithm with random sleep scheme (see Algorithm 2). In Algorithm 2, each agent locally updates its states and communicates with its neighbors such that the optimal solution to problem (7.3) (equivalently problem (7.1)) can be ultimately sought over time-varying unbalanced networks. Remark 7.2 In Algorithm 2, we apply the Polyak’s idea to cope with the constraint of (7.3) to drive the intermediated state vectors u j (k) and v j (k) toward the feasible set. As noted, the vector d j (k) ∈ ∂ f j (x j (k)) if f j (x j (k)) > eTj t j (k) and d j (k) = d j for some d j = 0n if f j (x j (k)) ≤ eTj t j (k). In fact d j (k) is a decreasing
146
7 Random Sleep Scheme-Based Distributed Optimization …
Algorithm 2 Random Sleep Scheme Based Distributed Optimization Algorithm over Time-Varying Unbalanced Networks Initialization: Each agent i ∈ V sets xi (0) ∈ Xi , ti (0) ∈ R N . Set k = 0. For i = 1 to N do Generating random variables: Each agent i independent and identically generates a Bernoulli random variable χi (k) at time k. 5: If χi (k) = 1 do 6: for j = 1 to N do 7: if j ∈ Ni (k) {i} then locally update for optimization:
1: 2: 3: 4:
8:
u j (k) = x j (k) −
9:
v j (k) = t j (k) +
( f j (x j (k))−eTj t j (k))+ d j (k), 1+d j (k)2 ( f j (x j (k))−eTj t j (k))+ ej, 1+d (k)2 j
where d j (k) ∈ ∂ f j (x j (k)) if f j (x j (k)) > eTj t j (k), and d j (k) = d j for some d j = 0n if f j (x j (k)) ≤ eTj t j (k). 10: else 11: u j (k) = x j (k), v j (k) = t j (k). 12: end if 13: end for 14: Dynamical averaging consensus by: N 15: xi (k + 1) = wi j (k) PX j [u j (k)], j=1
16:
ti (k + 1) =
N
wi j (k) (v j (k) − α j (k)1 N ).
j=1
17: Else 18: Dynamical averaging consensus by: N 19: xi (k + 1) = wi j (k) x j (k), j=1
20:
ti (k + 1) =
N
wi j (k) t j (k).
j=1
21: 22: 23: 24:
End if End for Set k = k + 1 and repeat. Until a terminator (e.g., a maximum iterations) is activated.
+ direction of f j (x j (k)) − eTj t j (k) which leads to that u j (k) asymptotically converges to the feasible set. Similarly, once agent i is activated at time, we have that dist ((xi (k + 1), ti (k + 1)), i ) ≤ dist ((xi (k), ti (k)), i ). When the agents achieve consensus, the state vector (xi (k), ti (k)), i ∈ V asymptotically converges to a feasible point. Let Fk denote the σ-algebra created by the total history of Algorithm 2 up to time k, i.e., for k ≥ 1, Fk = {((xiT (0), tiT (0))T , i ∈ V); χi (l); 1 ≤ l ≤ k} with F0 = {(xiT (0), tiT (0))T , i ∈ V}.
7.4 Convergence Analysis
147
7.4 Convergence Analysis We first present the following necessary assumptions. Assumption 7.1 [31] Each directed network G(k) and weight matrix W(k) satisfy the following conditions. s+B−1 (a) There exists an integer B ≥ 1 such that for all s ≥ 0, the joint network k=s G(k) is strongly connected. (b) Each W(k) associated with the network G(k) is row-stochastic, i.e., wi j (k) > 0 when ( j, i) ∈ E(k), for all k ≥ 0. (c) Each W(k) has positive diagonal entries, i.e., there exists a scalar β > 0 such that wii (k) = 1 − j∈Ni (k) wi j (k) > β for all k ≥ 0 and i ∈ V. Assumption 7.2 [4] The set of (sub)gradients of f i is uniformly bounded, i.e., for all i ∈ V, we can find a positive constant C f such that s fi (x) ≤ C f holds for all s fi (x) ∈ ∂ f i (x), where x ∈ dom( f i ). Assumption 7.3 For all i ∈ V and k ≥ 0, let the step-size αi (k) = 1/Γi (k), where Γi (k) denotes the number of times that agent i has been placed in activated mode until time k. Lemma 7.3 [32] Let a (k), b (k), v (k) and u (k) be sequences of Fk measurable nonnegative random variables such that for all k ≥ 0, where Fk represents the ∞b (0) , . . . , b (k), v (0) , . . . , v (k), u (0) , . . . , u (k). Let ∞set a (0) , . . . , a (k), a < ∞ and Then w.p.1, we have for some random (k) k=0 k=0 b (k) < ∞ a.s.. variable v ≥ 0, limk→∞ v (k) = v a.s. and ∞ k=0 u (k) < ∞ a.s.. Lemma 7.4 [11] Suppose Assumption 7.1 hold for G(k) and W(k). For s ≥ k, let s:k k:k = I. W s:k = W(s − 1) . . . W(k) and wis:k j be entries of W . Complement W T Then we can find a normalized vector π(k) satisfying 1 N π(k) = 1 for any k ≥ 0 such that s−k , ∀ i, j ∈ V, s ≥ (a) There exist L > 0, 0 < ξ < 1 such that |wis:k j − π j (k)| ≤ Lξ k ≥ 0. (b) We can find a constant η ≥ β (N −1)B such that πi (k) ≥ η, ∀ i ∈ V, k ≥ 0. (c) π T (k) = π T (k + 1)W(k).
In order to make this chapter self-contained, we introduce the following lemma which is originated from [7]. Lemma 7.5 Assume that the step-sizes αi (k), i ∈ V, satisfy Assumption 7.3. Then, there is a sufficiently large kˆ such that w.p.1 for all k ≥ kˆ and i ∈ V 2 αi (k) ≤ , kγi where 0 < q < 1/2 is a scalar.
αi (k) − 1 ≤ a(k) = 3 1 2 kγi k 2 −q γmin
(7.4)
148
7 Random Sleep Scheme-Based Distributed Optimization …
Proof Note that Γi (k) = kh=1 χi (h). Since χi (h) is i.i.d. with mean E[χi (h)] = γi , ∀i ∈ V. From the law of iterated logarithms (see Sect. 12.5 in [33]), it follows for any q > 0, we have that w.p.1 lim
|Γi (k) − kγ| k 2 +q 1
= 0, ∀i ∈ V, k → ∞
(7.5)
1 , ∀i ∈ V and k ≥ kˆ 2
(7.6)
ˆ we have Thus, for a large enough k, |Γi (k) − kγi | k
1 2 +q
≤
ˆ we have that w.p.1 That is, ∀i ∈ V and k ≥ k,
1 1 +q 1 1 1 −q 2 2 Γi (k) ≥ kγi − k k 2 +q = k γ− 2 2
(7.7)
Note that for 0 < q < 1/2, the term k (1/2)−q γi → ∞ as k increases. Then, choosing ˆ we have that w.p.1 a sufficiently larger kˆ (if required) such that ∀i ∈ V and k ≥ k, k 2 −q γi − 1
1 1 1 > k 2 −q γi 2 2
(7.8)
ˆ we have that w.p.1 Combining (7.6) and (7.7), we have ∀i ∈ V and k ≥ k, 1 2 ≤ Γi (k) kγi
(7.9)
where 0 < q < 1/2. Consider |αi (k) − (1/kγi )|, we obtain w.p.1 that, ∀i ∈ V and k ≥ kˆ αi (k)− 1 kγ
1 1 = 1 kγ Γ (k) |kγi − Γi (k)| ≤ 23 −q 2 i i i k γmin
(7.10)
This completes the proof. Lemma 7.6 [11] Consider the following sequence θi (k + 1) =
N
wi j (k)θ j (k) + i (k)
j=1
where the matrix sequence W(k) = [wi j (k)] ∈ R N ×N satisfies Assumption 7.1. N ¯ Define θ(k) = i=1 πi (k)θi (k) where πi (k) is given in Lemma 7.4. ¯ = 0. If limk→∞ i (k) = 0 for any i ∈ V, then we have limk→∞ θi (k) − θ(k)
7.4 Convergence Analysis
149
We now proceed to show the convergence analysis of Algorithm 2. Let us consider (7.3) with simplified notations as the following form min cT θ, s.t. gi (θ) ≤ 0, i ∈ V, θ ∈
(7.11)
where θ = (x T , t T )T , c = (0nT , 1TN )T and gi (θ)= f i (x) − eiT t. For all i ∈ V and k ≥ 0, define θi (k) = (xiT (k), tiT (k))T , pi (k) = (u iT (k), viT (k))T , φi (k) = (diT (k), eiT )T and gi (θi (k)) = f i (xi (k)) − eiT ti (k). Then, Algorithm 2 for solving (7.11) (equivalently for (7.1) or (7.3)) can be rewritten as two cases by the following ⎧ N ⎪ ⎪ ⎪ θ (k + 1) = wi j (k)PΘ j [ p j (k) − α j (k)c], χi (k) = 1 i ⎪ ⎪ ⎨ j=1 N ⎪ ⎪ ⎪ ⎪ θ (k + 1) = wi j (k)θ j (k), χi (k) = 0 ⎪ i ⎩
(7.12)
j=1
2 where i ∈ V, k ≥ 0, p j (k) = θ j (k) − (g j (θ j (k)))+ φ j (k)/φ j (k) and Θ j = X j × R N for j ∈ Ni (k) N {i}. Define Λ = i=1 Λi , where Λi = {θ|gi (θ) ≤ 0} for all i ∈ V, and further let Θ = Λ. Let Θ ∗ denote the set of optimal solutions to problem (7.11). Lemma 7.7 Assume that the step-sizes α j (k), j ∈ V, satisfy Assumption 7.3. For ˆ we have that w.p.1 any ϑ ∈ Θ, z j ∈ R N +n , and k > k, p j (k) − ϑ − α j (k)c2 (g j (θ j (k)))+ 2 2 T − c (z j − ϑ) + a(k)z j − ϑ2 φ j (k)2 kγ j 1 + a(k)c2 + p j (k) − z j 2 + α2j (k)(1 + 4η)c2 (7.13) 4η
≤θ j (k) − ϑ2 −
where η > 0 and 0 < q < 1/2. Proof First, we capture the relation between p j (k) − ϑ and θ j (k) − ϑ. Based on the definition of p j (k) and the (sub)gradient inequality, we have p j (k) − ϑ2 =θ j (k) − ϑ2 + ≤θ j (k) − ϑ2 +
(g j (θ j (k)))+ 2 (g j (θ j (k)))+ −2 (θ j (k) − ϑ)T φ j (k) 2 φ j (k) φ j (k)2 (g j (θ j (k)))+ 2 (g j (θ j (k)))+ − 2 ((g j (θ j (k)))+ − (g j (ϑ))+ ) φ j (k)2 φ j (k)2 (7.14)
150
7 Random Sleep Scheme-Based Distributed Optimization …
where the second inequality follows that φ j (k) is a (sub)gradient of (g j (θ j (k)))+ . This can be shown by (g j (θ j (k)))+ φ j (k) = (g j (θ j (k)))+ φ j (k)sign((g j (θ j (k)))+ ) and φ j (k)sign((g j (θ j (k)))+ ) is a (sub)gradient of (g j (θ j (k)))+ . Noticing g j (ϑ) ≤ 0 since ϑ ∈ Θ, we have (g j (ϑ))+ = 0. Therefore, one obtains from (7.14) that p j (k) − ϑ2 ≤ θ j (k) − ϑ2 −
(g j (θ j (k)))+ 2 φ j (k)2
(7.15)
Second, we characterize the relation between θ j (k)−ϑ and p j (k)−ϑ−α j (k)c. Through using the basic inequality 2α j (k)c p j (k)−z j ≤ 4ηα2j (k)c2 + p j (k) − z j 2 /4η, where η is a tunable parameter, for any z j ∈ R N +n , it follows that p j (k) − ϑ − α j (k)c2 ≤ p j (k) − ϑ2 − 2α j (k)cT (z j − ϑ) +
1 p j (k) − z j 2 + α2j (k)(1 + 4η)c2 4η (7.16)
Substituting (7.15) into (7.16), we obtain that p j (k) − ϑ − α j (k)c2 ≤θ j (k) − ϑ2 −
(g j (θ j (k)))+ 2 1 − 2α j (k)cT (z j − ϑ) + p j (k) − z j 2 2 φ j (k) 4η
+ α2j (k)(1 + 4η)c2
(7.17)
From Assumption 7.3, by using α j (k) − 1/kγ j ≤ a(k) (see Lemma 7.5), for kˆ sufficiently large, it follows w.p.1, ∀k ≥ kˆ and j ∈ V, that −2α j (k)cT (z j − ϑ) ≤ −
2 T c (z j − ϑ) + a(k)z j − ϑ2 + a(k)c2 kγ j
(7.18)
Replacing −2α j (k)cT (z j − ϑ) in (7.17) by (7.18) is exactly the inequality in Lemma 7.7. The proof is completed. Lemma 7.8 Assume that Assumptions 7.1 and 7.3 are satisfied. Let z j (k) = ˆ the following PΘ [ p j (k)] for all j ∈ V and k ≥ 0. For any ϑ ∈ Θ, i ∈ V, and k ≥ k, inequality holds E[θi (k + 1) − ϑ2 |Fk ] ≤(1 + a(k))
N j=1
wi j (k)θ j (k) − ϑ2 − γi
N j=1
wi j (k)
(g j (θ j (k)))+ 2 φ j (k)2
7.4 Convergence Analysis
−
151
N N 1 2γi γi wi j (k)cT (z j (k) − ϑ) + wi j (k) p j (k) − z j (k)2 k j=1 γj 4η j=1
+ γi a(k)c2 + γi (1 + 4η)c2
N
wi j (k)α2j (k)
(7.19)
j=1
ˆ let z j = z j (k) = PΘ [ p j (k)]. It follows from (7.13), we have Proof ∀ j ∈ V, k ≥ k, that w.p.1, p j (k) − ϑ − α j (k)c2 (g j (θ j (k)))+ 2 2 T − c (z j (k) − ϑ) + a(k)z j (k) φ j (k)2 kγ j 1 − ϑ2 + a(k)c2 + p j (k) − z j (k)2 + α2j (k)(1 + 4η)c2 (7.20) 4η
≤θ j (k) − ϑ2 −
Since ϑ ∈ Θ, noticing (7.15), we have the following z j (k)−ϑ2 ≤ θ j (k)−ϑ2 −
(g j (θ j (k)))+ 2 φ j (k)2
(7.21)
Substituting (7.21) into (7.20) gives us the following p j (k) − ϑ − α j (k)c2 ≤(1 + a(k))θ j (k) − ϑ2 − +
(g j (θ j (k)))+ 2 2 T − c (z j (k) − ϑ) + a(k)c2 2 φ j (k) kγ j
1 p j (k) − z j (k)2 + α2j (k)(1 + 4η)c2 4η
(7.22)
For the case χi (k) = 1, it follows that θi (k) − ϑ ≤ 2
N
wi j (k) p j (k) − ϑ − α j (k)c2
(7.23)
j=1
Substituting (7.22) into (7.23), for χi (k) = 1, we conclude θi (k + 1) − ϑ2 N N (g j (θ j (k)))+ 2 2 ≤ (1 + a(k))wi j (k)θ j (k) − ϑ − wi j (k) φ j (k)2 j=1 j=1
−
N N 2 1 1 wi j (k) cT (z j (k) − ϑ) + a(k)c2 + wi j (k) p j (k) − z j (k)2 k j=1 γj 4η j=1
152
7 Random Sleep Scheme-Based Distributed Optimization …
+ (1 + 4η)c2
N
wi j (k)α2j (k)
(7.24)
j=1
For the case χi (k) = 0, it follows that θi (k +1)−ϑ2 ≤ (1+a(k))
N
wi j (k)θ j (k)−ϑ2
(7.25)
j=1
Combining the two cases (7.24) and (7.25), the results in Lemma 7.8 follow immediately. The proof is completed. Theorem 7.9 Consider the sequence {θi (k)} in (7.12). Let Assumptions 7.1–7.3 hold. We have limk→∞ θi (k) = θ∗ ∈ Θ ∗ for all i ∈ V a.s.. Proof Let ϑ = θ∗ ∈ Θ ∗ . Noting z j (k) ∈ Θ, we can get cT (z j (k) − θ∗ ) ≥ 0. Multiplying πi (k + 1) at both sides of (7.19) and summing up for i from 1 to N , it follows, ∀k ≥ kˆ E
N
∗ 2
πi (k + 1)θi (k + 1) − θ |Fk
i=1
≤(1 + a(k))
N
πi (k + 1)
N
i=1
− γmin
N
πi (k + 1)
j=1 N
i=1
2γmin − kγmax +
γmax 4η
πi (k + 1)
i=1
N
wi j (k)
j=1
N
wi j (k)θ j (k) − θ∗ 2
N
(g j (θ j (k)))+ 2 φ j (k)2
wi j (k)cT (z j (k) − θ∗ )
j=1
πi (k + 1)
N
i=1
wi j (k) p j (k) − z j (k)2
j=1
+ γmax (1 + 4η)c2
N
πi (k + 1)
i=1
N
wi j (k)α2j (k) + γmax a(k)c2
(7.26)
j=1
Defining y(k) = [y1T (k), . . . , y NT (k)]T , we have N i=1
πi (k + 1)
N
wi j (k)y j (k) = π T (k + 1)W(k)y(k)
j=1
From π T (k) = π T (k + 1)W(k) (see Lemma 7.4), we obtain
(7.27)
7.4 Convergence Analysis N
153
πi (k + 1)
i=1
N
N
wi j (k)y j (k) =
j=1
πi (k)yi (k)
(7.28)
i=1
Therefore, we can rewrite inequality (7.26) as E
N
∗ 2
πi (k + 1)θi (k + 1) − θ |Fk
i=1
≤(1 + a(k))
N
πi (k)θi (k) − θ∗ 2 − γmin
i=1 N
2γmin kγmax
−
N
πi (k)
i=1
πi (k)cT (z i (k) − θ∗ ) +
i=1
(gi (θi (k)))+ 2 φi (k)2
N γmax πi (k) pi (k) − z i (k)2 4η i=1
+ γmax a(k)c2 + γmax (1 + 4η)γc2
N
αi2 (k)
(7.29)
i=1
N From [25], we can find a finite constant c > 0 such that for all θ ∈ i=1 Θi , the inequality dist2 (θ, Θ) ≤ c(gi (θ))+ 2 holds. Notice pi (k) − z i (k) = pi (k) − PΘ [ pi (k)] ≤ θi (k) − PΘ [ pi (k)] = θi (k) − z i (k). From Assumption 7.2, we have φi (k) ≤ C f + 1 for all i ∈ V. If we choose η > 1/4c(C f + 1)2 and let κ = γmax /(C f + 1)2 c − γmax /4η, then we get the following, ∀k ≥ kˆ E
N
πi (k + 1)θi (k + 1) − ϑ |Fk 2
i=1
≤(1 + a(k))
N
πi (k)θi (k) − ϑ2 −
i=1
−κ
N
N 2γmin πi (k)cT (z i (k) − ϑ) kγmax i=1
πi (k)θi (k) − z i (k)2 + γmax a(k)c2 + γmax (1 + 4η)c2
i=1
N
αi2 (k)
i=1
(7.30) Denoting z¯ (k) = E
N
N i=1
πi (k)z i (k), it follows, ∀k ≥ kˆ ∗ 2
πi (k + 1)θi (k + 1) − θ |Fk
i=1
≤(1 + a(k))
N i=1
πi (k)θi (k) − θ∗ 2 −
2γmin T c (¯z (k) − θ∗ ) kγmax
154
7 Random Sleep Scheme-Based Distributed Optimization …
−κ
N
πi (k)θi (k) − z i (k)2 + γmax a(k)c2 + γmax (1 + 4η)c2
i=1
N
αi2 (k)
i=1
(7.31) 2 a(k) < ∞ since a(k) = 1/k (3/2)−1 γmin with 0 < From Lemma 7.5, we have ∞ ∞ N 2 k=0 q < 1/2 and k=0 i=1 αi (k) < ∞ w.p.1. All the conditions in Lemma 7.3 are met and we have thefollowing results: N ∗ 2 ∗ ∗ Result 1: E[ i (k) − θ ] converges as k → ∞ for any θ ∈ Θ . ∞ i=1 πi (k)θ Result 2: k=0 (1/kγ)cT (¯z (k) − θ∗ ) < ∞ a.s.. N 2 Result 3: ∞ k=0 i=1 πi (k)θi (k) − z i (k) < ∞ a.s.. T Combing Result 2 and the fact c (¯z (k) − θ∗ ) ≥ 0, it follows limk→∞ inf cT ׯz (k) = cT θ∗ . Hence, there exists a subsequence {¯z (k) |k ∈ K } which converges to some Θˆ ∗ ∈ Θ ∗ a.s.. Since {¯z (k) − Θˆ ∗ }, we have that limk→∞ z¯ (k)− Θˆ ∗ = 0. From Result 3, as πi (k) ≥ η (see Lemma 7.4), we have limk→∞ θi (k) − z i (k) = 0 ¯ a.s. for all i ∈ V. Let θ(k) = π T (k)θ(k) and z¯ (k) = π T (k)z(k), where θ(k) = T T T (θ1 (k), . . . , θ N (k)) and z(k) = (z 1T (k), . . . , z TN (k))T . Therefore, one gets ¯ − z¯ (k) = 0 a.s.. limk→∞ θ(k) We have E[θi (k + 1)|Fk ] =
N
wi j (k)θ j (k) + γi ei (k)
(7.32)
j=1
where ei (k) = Nj=1 wi j (k)(PΘ j [ p j (k) − α j (k)c] − θ j (k)) for all i ∈ V. By computation, one has ei (k) N = wi j (k) · PΘ j p j (k) − α j (k)c − θ j (k) j=1
≤
N
N wi j (k) p j (k) − α j (k)c − z j (k) + wi j (k) z j (k) − θ j (k)
j=1
≤2
N j=1
j=1
wi j (k) z j (k) − θ j (k) + c
N
wi j (k)α j (k)
(7.33)
j=1
where we utilize the nonexpansive property of the projection PΘ j to obtain first inequality and the second inequality follows from (7.15) for some z j (k) ∈ Θ, ∀ j ∈ V. Consequently, limk→∞ ei (k) = 0, a.s. for all i ∈ V. Applying Lemma 7.6, we ¯ ¯ for all i ∈ V. Since θi (k) − Θˆ ∗ ≤ θi (k) − θ(k) + have limk→∞ θi (k) − θ(k), ∗ ∗ ¯ ˆ ˆ θ(k) − z¯ (k) + ¯z (k) − Θ , we conclude that limk→∞ θi (k) − Θ = 0 for all i ∈ V and some Θˆ ∗ ∈ Θ ∗ a.s.. The proof is completed.
7.5 Numerical Examples
155
7.5 Numerical Examples In this section, we report two case studies to verify the correctness of the theoretical results and practicability of Algorithm 2. The two simulations will be conducted by a networked system with five agents (N = 5) over time-varying unbalanced directed networks, which is described as: G(k) = {V, E(k), W(k)}, where w21 (3k + 1) = 0.3, w34 (3k + 1) = 0.4, w52 (3k + 1) = 0.3; w15 (3k + 2) = 0.2, w32 (3k + 2) = 0.3, w54 (3k + 2) = 0.4; w13 (3k + 3) = 0.4, w35 (3k + 3) = 0.3, / E(k) w42 (3k + 3) = 0.3; wii (k) = 1 − j∈Ni (k) wi j (k), and wi j (k) = 0 if ( j, i) ∈ for all k = 0, 1, . . .. It is shown that all the weight matrices W(k), k ≥ 0 are only row-stochastic but not column-stochastic, and the whole network is uniformly jointly strongly connected a certain bounded length time interval.
7.5.1 Case Study I In this case, we take into consideration a distributed parameter estimation problem in wireless-sensor networks [34]. It is too private and expensive to share the data set by all sensor in many situations. The sensors need to distributedly address the following optimization problem for the estimation: f (x) = min
N ψi j (x; υi j )1 , x ∈ X i=1
(7.34)
φi j ∈Φi
where ψi j (x; υi j ) is the Huber penalty function (to eliminate the influence of outliers) with the p-th element defined as [35]: ⎧ ⎨ 1 (x p − υ p )2 , 2 p i j ψi j (x; υi j ) = p p ⎩ x − υ − 1 , ij 2
p υi j − x p ≤ 1 p υi j − x p > 1
(7.35)
5 The constrained set X is formed by X = i=1 X i , where X i = [X i1 ; X i2 ; X i3 ]. In detail, the agents’ private set constraints are set as: X 11 = [−2, 9], X 12 = [−3, 3], X 13 = [−5, 3], X 21 = [−4, 8], X 22 = [−4, 2], X 23 =[−6, 2], X 31 = [−5, 9.5], X 32 = [−5, 5], X 33 = [−6, 5.5], X 41 = [−3.5, 8.5], X 42 = [−3.4, 4], X 43 = [−5, 4], X 51 = [−3, 9], X 52 = [−3, 5], X 53 = [−5.5, 5.5]. We use B(ρ, σ) to denote a normal distribution with ρ as the mean vector and σ as the covariance matrix. We implement the Huber penalty function with 50 measurements perturbed by white Gaussian noise. Because of wrong observation, the first 40 measurements (denoted as φi j , j = 1, . . . , 40) are set to be i.i.d. B([6, 2, −2]T , I ) while the remained 10 measurements (denoted as φi j , j = 41, . . . , 50) are set to be i.i.d. B([6, 2, −2]T , 10I ). Then, the collection of φi j is defined as fli = {φi j , j = 1, . . . , 50}. According to
156
7 Random Sleep Scheme-Based Distributed Optimization … 10
Fig. 7.1 The evolutions of all agents’ xi (k)
x (k)
8
1
x (k) 2
States of x i (k)
6
x 3 (k) x (k) 4
4
x (k) 5
2 0 −2 −4 −6
0
500
1000
1500
2000
Time [step] 10
Fig. 7.2 Relative estimation error between agents 1 and 2
x1− x1
8
2
2 2
3 1
3 2
x −x
4
x1(k) − x (k)
1 2
x1− x2
6 2 0 −2 −4 −6 −8 −10 0
500
1000
1500
2000
Time [step]
Assumption 7.3, we set γ1 = 0.8, γ2 = γ3 = γ4 = 0.5, and γ5 = 0.6. The evolutions of xi (k), i = 1, . . . 5 at each dimension is reported in Fig. 7.1. Figure 7.1 shows that the optimal solution to problem (7.34) can be sought over time-varying unbalanced directed networks. The relative estimation errors between agent 1 and 2 is shown in Fig. 7.2 that illustrates they reach consensus.
7.5.2 Case Study II In this case, we research a robust linear regression problem [16]. The observation of each agent i ∈ {1, . . . , N } is given by yi = Pi x + ei , where yi ∈ R pi and Pi ∈ R pi ×n are known data, x ∈ Rn is unknown, and ei ∈ R pi is the measurement noise. The goal of the problem is to make a distributed estimation on x. The optimization problem
7.5 Numerical Examples
157
is: f (x) = min
pi N
h c (Pi, j x − yi, j ), x ∈ X
(7.36)
i=1 j=1
where Pi, j and yi, j are the j-th row of matrix Pi and vector yi , respectively. Here, the Huber loss function h c is used to eliminate the influence of outliers as follows h c (r ) =
if |r | > c r 2, c(2|r | − c), if |r | ≤ c
(7.37)
In the experiment, we set N = 5, [ p1 , p2 , p3 , p4 , p5 ]T = [6, 8, 7, 9, 5]T , and n = 2. The entries of vector Pi, j is generated by an i.i.d.. Gaussian distribution, then normalized to be ||Pi, j || = 1. The measurement noise is defined by a Gaussian distribution with standard deviation σ = 0.03 and cut-off parameter c is set to be c = 10σ. the 5 X i , where X i = [X i1 ; X i2 ]. In detail, the The constrained set X is formed by X = i=1 1 agents’ private set constraints are set as: X 1 = [−0.8, 2], X 12 = [−2.4, 1.2], X 21 = [−1.6, 1.6], X 22 = [−1.6, 0.8], X 31 = [ −2, 1.8], X 32 = [−2, 2], X 41 = [−1.4, 2.2], X 42 = [−1.4, 1.6], X 51 = [−1.2, 2], X 52 = [−1.2, 2]. According to Assumption 7.3, we set γ1 = 0.8, γ2 = γ3 = γ4 = 0.5, and γ5 = 0.6. Figure 7.3 shows the evolution of xi (k), i = 1, . . . 5. We can see from Fig. 7.3 that the optimal solution to problem (7.36) can be sought over time-varying unbalanced directed networks. To further illustrate the influence of the random sleep scheme on the convergence of algorithm, we set all the agents adopt identical probabilities 0.8, γ5 = 0.3, 0.5, and xi (k) − x ∗ with respectively. The evolutions of the relative residual J (k) = i=1 different random sleep probabilities are reported in Fig. 7.4. As shown in Fig. 7.4, the larger probability chooses, the faster convergence the algorithm can achieve. But at the same time, a smaller probability leads to comparatively larger step-sizes (since the uncoordinated step-sizes are relevant to the frequency of the agent updates), which speeds up the convergence to some extent.
1.5
Fig. 7.3 The evolutions of all agents’ xi (k)
x (k) 1
x (k)
1
2
x (k)
0.5
x (k) 5
i
States of x (k)
3
x 4(k)
0 −0.5 −1 −1.5
100 200 300 400 500 600 700 800 900 1000
Time [step]
158
7 Random Sleep Scheme-Based Distributed Optimization …
Fig. 7.4 The evolutions of 5 ∗ i=1 x i (k) − x for γ = 0.3, 0.5, and 0.8
10
2
0
J (k)
10
−2
10
−4
10
0
10
1
10
10
2
10
3
10
4
Time [step]
(a) γ = 0.3 2
10
0
J (k)
10
−2
10
−4
10
0
10
1
10
2
10
3
10
4
10
Time [step]
(b) γ = 0.5 2
10
0
J (k)
10
−2
10
−4
10 0 10
1
10
2
10
Time [step]
(c) γ = 0.8
3
10
4
10
7.6 Conclusion
159
7.6 Conclusion This chapter has researched a distributed optimization problem defined on multiagent networks involving heterogeneous constraint sets. A completely distributed optimization algorithm over time-varying unbalanced directed networks has been proposed, the main characteristic of which includes incorporation of random sleep scheme, uncoordinated step-sizes, and nonidentical constraints. The almost sure convergence of Algorithm 2 has been rigorously established under some moderate conditions.
References 1. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 592–606 (2012) 2. I. Lobel, A. Ozdaglar, D. Feijer, Distributed multi-agent optimization with state-dependent communication. Math. Program. 129(2), 255–284 (2011) 3. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE J. Sel. Top. Signal Process. 5(4), 754–771 (2011) 4. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 5. H. Li, G. Chen, T. Huang, Z. Dong, W. Zhu, L. Gao, Event-triggered distributed average consensus over directed digital networks with limited communication bandwidth. IEEE Trans. Cybern. 46(12), 3098–3110 (2016) 6. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 7. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans. Autom. Control 56(6), 1337–1351 (2011) 8. C.K. Maurya, T. Durga, Large-Scale distributed sparse class-imbalance learning. Inf. Sci. 456, 1–12 (2018) 9. X. He, T. Huang, J. Yu, C. Li, Y. Zhang, A continuous-time algorithm for distributed optimization based on multiagent networks. IEEE Trans. Syst. Man Cybern. Syst. 49(12), 2700–2709 (2019) 10. H. Li, G. Chen, T. Huang, Z. Dong, High performance consensus control in networked systems with limited bandwidth communication and time-varying directed topologies. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1043–1054 (2017) 11. P. Xie, K. You, R. Tempo, S. Song, C. Wu, Distributed convex optimization with inequality constraints over time-varying unbalanced digraphs. IEEE Trans. Autom. Control 63(12), 4331– 4337 (2018) 12. W. Dong, W. Zhu, C. Ming, W. Wang, Distributed optimization for multi-agent systems with constraints set and communication time-delay over a directed graph. Inf. Sci. 438, 1–14 (2018) 13. J.N. Tsitsiklis, Problems in decentralized decision making and computation. PH. D. dissertation, Massachusetts Institute of Technology, Cambridge, MA (1984) 14. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence of communication delays. IEEE Trans. Syst. Man Cybern. Syst. 47(5), 717–728 (2017) 15. H. Li, S. Liu, Y.C. Soh, L. Xie, Event-triggered communication and data rate constraint for distributed optimization of multi-agent systems. IEEE Trans. Syst. Man Cybern. Syst. 48(11), 1908–1919 (2018)
160
7 Random Sleep Scheme-Based Distributed Optimization …
16. Y. Sun, G. Scutari, D. Palomar, Distributed nonconvex multiagent optimization over timevarying networks, in Conference Record - Asilomar Conference on Signals, Systems and Computers, https://doi.org/10.1109/ACSSC.2016.7869154 17. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans. Autom. Control 62(8), 3986–3992 (2017) 18. K. Cai, H. Ishii, Average consensus on general strongly connected digraphs. Automatica 48(11), 2750–2761 (2012) 19. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 20. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 21. V. Mai, E. Abed, Distributed optimization over weighted directed graphs using row stochastic matrix, in 2016 American Control Conference, https://doi.org/10.1109/ACC.2016.7526803 22. H. Li, Q. Lü, T. Huang, Distributed projection subgradient algorithm over time-varying general unbalanced directed graphs. IEEE Trans. Autom. Control 64(3), 1309–1316 (2019) 23. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. https://doi.org/10.1109/TSMC.2018.2823901 24. K. You, R. Tempo, P. Xie, Distributed algorithms for robust convex optimization via the scenario approach. IEEE Trans. Autom. Control 64(3), 880–895 (2019) 25. A. Nedic, Random algorithms for convex minimization problems. Math. Program. 129(2), 225–253 (2011) 26. J. Liu, X. Jiang, T. Horiguchi, T. Lee, Analysis of random sleep scheme for wireless sensor networks. Int. J. Sens. Netw. 7(1), 71–84 (2010) 27. G. Shi, K.H. Johansson, Randomized optimal consensus of multi-agent systems. Automatica 48(12), 3018–3030 (2012) 28. Y. Lou, G. Shi, K.H. Johansson, Y. Hong, Convergence of random sleep algorithms for optimal consensus. Syst. Control Lett. 62(12), 1196–1202 (2013) 29. Y. Peng, Y. Hong, Stochastic sub-gradient algorithm for distributed optimization with random sleep scheme. Control Theory Technol. 13(4), 333–347 (2015) 30. D. Bertsekas, Convex Optimization Algorithms (Athena Scientific Belmont, 2015) 31. A. Nedic, J. Liu, On convergence rate of weighted-averaging dynamics for consensus problems. IEEE Trans. Autom. Control 62(2), 766–781 (2017) 32. H. Robbins, D. Siegmund, A convergence theorem for non-negative almost supermartingales and some applications, https://doi.org/10.1016/B978-0-12-604550-5.50015-8 33. R. Dudley, Real Analysis and Probability (Cambridge University Press, Cambridge, MA, 2002) 34. L. Xiao, S. Boyd, S. Lall, A scheme for robust distributed sensor fusion based on average consensus, in Fourth International Symposium on Information Processing in Sensor Networks, https://doi.org/10.1109/IPSN.2005.1440896 35. P. Yi, Y. Hong, Quantized subgradient algorithm and data-rate analysis for distributed optimization. IEEE Trans. Control Netw. Syst. 1(4), 380–392 (2014)
Chapter 8
Distributed Stochastic Optimization: Variance Reduction and Edge-Based Method
8.1 Introduction The abstraction of distributed optimization is to achieve optimal decision making or control by local manipulation with private data and diffusion of local information through a network of computational nodes. Due to the promising prospects in machine learning, statistical computation [1, 2], and extensive applications for power systems, sensor networks, and wireless communication networks [3, 4], distributed optimization has harvested many attentions over the years. Most of issues arisen in these fields are cast as distributed optimization problems, in which nodes of a network collaboratively optimize a global objective function through operating on their local objective functions and communicating with their neighbors only. The methodologies of distributed optimization chiefly comprise the primal domain methods, augmented Lagrangian methods, and network Newton methods. A few notable primal domain methods were distributed gradient-based methods, such as distributed (sub)gradient descent [5] and distributed dual averaging [6]. Extensions of these work were provided to cope with various situations e.g., time-varying or random networks [7–10], communication delays [11, 12] stochastic gradient errors [13], and constrained problems [14, 15]. The main advantage of gradient-based algorithms is intuitive and computationally simple. However, a necessary requirement of diminishing step-size to ensure exact convergence leads to slow convergence rates. An adoption of constant step-size could accelerate the rate but with low accuracy [16]. To bypass the limitation, a distributed Nesterov gradient algorithm was proposed in [17], which allowed the using of constant step-size to acquire faster rates and ensures exact convergence, whereas excessive consensus computations were performed at each iteration. Recently, a gradient tracking approach was employed to guarantee convergence for smooth convex functions [18] and shown to achieve linear convergence for strongly convex functions [19, 20]. Related work along this line included extensions over directed networks [21–23], stochastic or time-varying networks [24, 25]. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_8
161
162
8 Distributed Stochastic Optimization: Variance Reduction …
Aside from primal domain methods, dual domain methods were widely studied on the basis of augmented Lagrangian (AL) functions [26]. A typical example was the decentralized alternating direction method of multipliers (ADMM) [27], based on which many distributed algorithms were presented [28–30]. Distributed ADMM showed linear convergence rates for strongly convex functions with constant stepsize, but suffered from heavy computation burden because each node had to optimize a local problem at each iteration. To realize computational simplicity, the decentralized linearized ADMM (LADMM) [31–33] and the exact first-order algorithm (EXTRA) [34] were developed, which were essentially first-order approximations of distributed ADMM. Especially, both decentralized LADMM and EXTRA retained linear convergence as distributed ADMM shows. It was worth mentioning that although the abovementioned methods [18–25, 31– 34] were shown to be able to achieve linear convergence and keep high accuracy, a full evaluation of the gradients of local objective functions was required at each iteration. In essence, the practicabilities of these methods were limited for the cases with large amounts of data and parameters. In centralized optimization, stochastic gradient (SG) was an efficient modification and widely adopted to alleviate the cost of full evaluation of gradients. At each iteration, gradient descent in standard SG methods only relied on substituting full gradient vector with one point value from the entire batch. However, a major deficiency was that only by using diminishing step-size, the variance of standard stochastic gradient update direction can converge to zero, thus failed to achieve linear convergence. This limitation in fact evoked the employment of variance-reduced techniques to reduce the variance. Several notable methods applied stochastic incremental averaging gradients, e.g., stochastic averaging gradient (SAG) [35], stochastic variance reduction gradient (SVRG) [36], and unbiased stochastic averaging gradient (SAGA) [37], which permitted the utilization of constant step-size for linear convergence. In these methods, only one randomly selected gradient was updated at each iteration. As for distributed optimization, there were considerable works on stochastic gradient algorithms where gradients of local objective functions were corrupted by stochastic noises [38–43]. Rather than inquiring exact gradients, these algorithms used less-computational noisy gradients and showed comparable performance with their deterministic counterparts. In distributed settings, individual nodes possess their private data (training samples or trainable parameters). Implementation of gradient based distributed algorithms necessitated massive computational requirements of gradients, especially when a large amounts of data was distributed to the nodes. Conventionally, each node could implement the noisy gradient algorithms and reduce the cost of full evaluation of local gradients by uniformly choosing only one data point from its local batch [43]. However, resembling the centralized standard stochastic gradient algorithm, this led to sublinear convergence rate. The bottleneck was overcome in [44] where a distributed double stochastic averaging gradient algorithm (DSA) was proposed on the basis of EXTRA. In [44], the local objective function was conceived to be an average of a number of subfunctions. This setup was inspired by machine learning problems with training samples distributed and known only to individual computational nodes, and provided the precondition for using a localized SAGA method. DSA worked well
8.1 Introduction
163
in large-scale issues and showed satisfactory convergence rate, but had strict sets on the weight matrices, i.e., the two weight matrices in DSA required to be symmetric and met a predefined spectral relationship. This motivated us to explore the role of the localized SAGA method applied in more general distributed gradient methods, not only for the purpose of reducing the cost of full evaluation of local gradients but also showing comparable performance with their deterministic counterparts. The insightful work [45] proposed a distributed AL algorithm by introducing the edgewise constraints, and a novel and concise description on a linear convergence rate was provided. This chapter presents a distributed stochastic gradient-based AL algorithm based on [45], but uses a different local objective function setup and a local SAGA method inspired by [44]. We now highlight main contributions of this chapter distinguishing from the existing literature. (I) A distributed stochastic gradient-based AL algorithm is proposed, which significantly reduces the cost of gradient evaluation. Unlike most of existing algorithm for distributed optimization [5, 6, 16–20, 24–26, 31–34, 45, 46], with the requirement of evaluating the full local gradients, we use an unbiased stochastic averaging gradient to approximate the full local gradient, which relies on the evaluation of gradient of only one randomly selected local subfunction. (II) An explicit linear convergence rate of the presented algorithm is established and an upper bound for the selection of constant step-sizes is provided. Motivated by [45], we use the techniques of the factorization of weighted Laplacian and the spectral decomposition to show a simplified convergence analysis process in contrast to related work [31, 34, 44]. (III) We establish the relationship between the convergence rate, constant step-size, and edge weights, which gives a diverse description of the convergence rate. It reveals the superiority of the proposed algorithm in flexible edge weight selections compared with gradient tracking methods [19, 20, 24, 25] and saddlepoint methods [34, 44]. One can accelerate the convergence by increasing the edge weights so long as the constant step-size respects its bound.
8.2 Preliminaries In this section, we formulate the distributed optimization problem, state the network model, then reformulate the distributed optimization problem by introducing the edge-based constraints.
8.2.1 Notation We abbreviate almost surely as a.s. and independent and identically distributed as i.i.d..
164
8 Distributed Stochastic Optimization: Variance Reduction …
8.2.2 Model of Optimization Problem Consider a network containing N nodes which aims at collaboratively solving the following optimization problem: min fˆ(x) ˆ =
N i=1
qi N 1 h f i (x) ˆ = f i (x), ˆ xˆ ∈ Rn q i=1 i h=1
(8.1)
where f i : Rn → R is the local objective function of node i. The local objective function f i is further defined as an average of qi local subfunctions f ih , h = 1, . . . , qi , known privately by node i. Define qmin = min{qi } and qmax = max{qi }, i = 1, . . . , N . The potential application scenarios of problem (8.1) are machine learning problems with large training sets involving training samples that are distributed among the nodes, such as distributed logistic regression and reinforcement learning problems.
8.2.3 Communication Network We consider nodes of the network communicating through a connected undirected network G = {V, E, A}, where V = {1, 2, . . . , N } denotes the collection of N nodes, E ⊆ V × V denotes the collection of M edges, and A = [li j ] ∈ R N ×N indicates the weighted adjacency matrix where the weight li j associated with edge (i, j) satisfies: li j > 0 if (i, j) ∈ E; and li j = 0, otherwise. The Laplacian matrix L = [Li j ] ∈ R N ×N for G is denoted by: Li j = −li j if i = j; and Lii = Nj=i li j . Thus, L has N −1 nonnegative eigenvalues and a zero eigenvalue. Let 0 = λ1 (L) ≤ λ2 (L) · · · ≤ λ N (L) represent the eigenvalues of L. Now, we provide a factorization of the Laplacian matrix L by introducing a diagonal matrix W and an edge-node incidence matrix C. Let eτ denote the edges (i, j) ∈ E, i.e., eτ = (i, j) ∈ E, τ = 1, 2, . . . , M, and impose on each edge eτ an arbitrary orientation. Define W = diag {w1 , w2 , . . . , w M } as an M × M diagonal matrix where wτ = ai j for (i, j) = eτ ∈ C, τ = 1, 2, . . . , M. Further, define C = [cκτ ] ∈ R N ×M as an edge-node incidence matrix in which each row of C is associated with a node and each column represents an edge in the network, respectively. We have cκτ = 1 if node κ is the head of eτ ; cκτ = −1 if node κ is the tail of eτ ; and cκτ = 0, otherwise. Note that 1TN C = 0. Based on the definitions of W and C, we can factorize the 1 1 Laplacian matrix as L = CWC T = CW 2 W 2 C T .
8.2 Preliminaries
165
8.2.4 Problem Reformulation Next, we reformulate problem (8.1) by introducing the edge-based constraints: min f (x) =
N i=1
s.t.
qi N 1 h f i (xi ) = f i (xi ), x ∈ Rn N q i i=1 h=1
li j (xi − x j ) = 0, ∀(i, j) ∈ E
(8.2)
xi . where x = [x1T , . . . , x NT ]T ∈ Rn N is a vector that concatenates the local iterates By the factorization of the Laplacian matrix, the edge-based constraints li j (xi − 1 x j ) = 0, (i, j) ∈ E, can be expressed as a vector form of (W 2 C T ⊗ In ) ×x = 0. Defining L = (L ⊗ In ), W = (W ⊗ In ), and C = (C ⊗ In ), problem (8.2) is equivalent to min f (x) =
N
f i (xi ) =
i=1 1 2
qi N 1 h f i (xi ), x ∈ Rn N q i=1 i h=1
s.t. W C x = 0 T
(8.3)
8.3 Distributed Algorithm In this section, a distributed stochastic gradient-based AL algorithm is proposed to solve optimization problem (8.3). In particular, to alleviate the computation cost of the evaluation of local gradients, we use local unbiased stochastic averaging gradient to substitute for the local objective function gradient.
8.3.1 Unbiased Stochastic Averaging Gradient Recall the definitions of the local objective function f i (xi ) and the local subfunctions {1, . . . , qi }, available at node i. The local objective function gradient f ih (xi ), h ∈ qi ∇ f i (xi ) = h=1 ∇ f ih (xi )/qi is costly to evaluate, especially when qi is large. To resolve this problem, we employ stochastic averaging gradient to replace the local gradient ∇ f i (xi (k)). Define yih (k) ∈ Rn the iteration values of subfunction f ih (k), h ∈ {1, . . . , qi }, at iteration k. Let ti (k) ∈ {1, . . . , qi } indicate one of the subfunctions selected uniformly at random at iteration k and update yih (k) as follows
yih (k + 1) = xi (k), if h = ti (k) , h ∈ {1, . . . , qi } yih (k + 1) = yih (k), if h = ti (k)
(8.4)
166
8 Distributed Stochastic Optimization: Variance Reduction …
At iteration k, the stochastic averaging gradient at node i is defined as gi (k) = ∇ f iti (k) (xi (k))−∇ f iti (k) (yiti (k) (k))+
qi 1 ∇ f ih (yih (k)) qi h=1
(8.5)
Let Fk denote the σ-algebra generated by entire history of the stochastic averaging gradients, i.e., for k ≥ 1, Fk = xi,0 ; gi (t), 1 ≤ t ≤ k, i ∈ V
Lemma 8.1 ([37, 44]) The local stochastic averaging gradient in (8.5) is an unbiased estimate of the local gradient ∇ f i (xi (k)), i ∈ V, i.e., E[gi (k)|Fk ] = ∇ f i (xi (k)). Remark 8.2 Observe that the evaluation of gi (k) is costly due to the computation qi ∇ f ih (yih (k)) is O(qi ). This cost can be reduced by updating the complexity of h=1 sum recursively as qi h=1 qi
=
∇ f ih (yih (k)) ∇ f ih (yih (k − 1))+∇ f iti (k−1) (xi (k − 1)) − ∇ f iti (k−1) (yiti (k−1) (k − 1)) (8.6)
h=1
Thus, we can update gi (k) in a computationally efficient manner. Remark 8.3 A common way to reduce the calculation of local objective function gradient ∇ f i (xi ) is by using a noisy surrogate which satisfies unbiased estimation with finite error variance [38–43]. For the problems where a large amount of data distributed and known privately to individual computational nodes (such as distributed logistic regression [1]), evaluating the local gradient requires each node to inquire each data point. This is costly especially when the scale of local dataset is large. The abovementioned distributed noisy gradient algorithm do not provide effective way to reduce this cost. Consider that these problems can be usually transformed into the structure of (8.1) [44]. It is possible to extend notable centralized stochastic methods SAG [35], SVRG [36], and SAGA [37] to distributed environments, which are known to have linear convergence rates. This motivates us to apply the localized version of SAGA (8.5) in the distributed AL algorithm [45].
8.3 Distributed Algorithm
167
8.3.2 Algorithm Development n Let λi j ∈ R be the Lagrange multiplier associated with the edge-based constraints li j (xi − x j ) = 0, ∀(i, j) ∈ E. Then problem (8.3) can be converted to a saddle point problem of the augmented Lagrangian function formed as: 1
1
1
Lα (x, λ) = f (x) + λT W 2 C T x + x T C W 2 W 2 C T x 1
= f (x) + λT W 2 C T x + x T L x
(8.7)
where λ ∈ Rn M is the vector form of the multipliers λi j , ∀(i, j) ∈ E. We develop the following AL algorithm to resolve the problem (8.3) by employing the unbiased stochastic averaging gradients: 1
x(k + 1) = x(k) − α(g(k) + C W 2 λ(k) + L x(k)) 1 2
λ(k + 1) = λ(k) + W C T x(k + 1)
(8.8a) (8.8b)
where g(k) = [g1T (k), . . . , g TN (k)]T ∈ R N n is the vector that concatenates all the local unbiased stochastic averaging gradients at iteration k and α is the constant step-size. Defining Pi = { j|(i, j) ∈E, i > j} and Si = { j|(i, j) ∈E, i < j}, we formally present algorithm (8.8) in a distributed form, as shown in Algorithm 3. The distributed algorithm (8.8) can be interpreted as a stochastic version of the saddle point algorithm in [45] that substitutes the local gradients for stochastic gradients. Algorithm (8.8) can effectively alleviate the cost by using the localized SAGA method, and is more suitable to large-scale issues. In essence, algorithm (8.8) can be transformed into other dual algorithms, e.g., distributed linearized ADMM algo1 rithms (DLM [32] and DPGA [33]) by redefining u(k) = C W 2 λ(k) and EXTRA [34] (can be interpreted as a saddle-point method √ [44]) by reforming the Lagrangian 1 function as Lα (x, λ) = f (x) + λT W 2 C T x/ 2α + x T L x/4α (cf. Sect. 2.4 in [45]). With the interpretation, our conjecture is that the unbiased stochastic averaging gradient (8.5) can be extended to the category of dual algorithms with linear convergence rates. The main challenge in convergence analysis is how to deal with the effects of the difference between the stochastic averaging gradient and the exact local gradient.
168
8 Distributed Stochastic Optimization: Variance Reduction …
Algorithm 3 Distributed AL Algorithm with Unbiased Stochastic Averaging Gradient 1: 2: 3: 4: 5:
Initialization: Let xi (0) ∈ Rn , yih (0) = xi (0) for i ∈ V . Let λi j (0) = 0 for j ∈ Si . Set k = 0. For i = 1 to N do Choose ti (k) uniformly at random from set {1, . . . , qi } Compute and store gi (k) as (8.5): t (k)
gi (k) = ∇ f i i
t (k)
(xi (k))−∇ f i i
t (k)
(yi i
(k))+
qi 1 ∇ f ih (yih (k)) qi h=1
6: Take yiti (k) (k + 1) = xi (k) and store ∇ f iti (k) (yiti (k) (k + 1)) = ∇ f iti (k) (xi (k)) in ti (k) gradient table position. All other entries in the table remain unchanged. 7: Update variable xi (k) according to: xi (k + 1) =xi (k)−α(gi (k)+
N
li j (xi (k) − x j (k))+
j=1
+ li j λi j (k)sgn( j −i))
li j λ ji (k)sgn( j −i) j∈Pi
j∈Si
where sgn(·) denotes the sign function. 8: Update variable λi j (k), j ∈ Si according to: λi j (k + 1) = λi j (k) +
li j (xi (k + 1) − x j (k + 1))
9: End for 10: Set k = k + 1 and repeat. 11: Until a termination condition is satisfied.
8.4 Convergence Analysis The convergence analysis process of algorithm (8.8) is shown in this section. First, we assume that each local subfunction f ih has the following specific properties which are necessary in our analysis. Assumption 8.1 Each local subfunction f ih is strongly convex and its gradient is Lipschitz continuous i.e., for all i ∈ V and h ∈ {1, . . . , qi }, we have (∇ f ih (a)−∇ f ih (b))T (a − b) ≥ μ a − b 2 and h ∇ f (a) − ∇ f h (b) ≤ L f a − b i i ∀a, b ∈ Rn , where L f > μ > 0.
8.4 Convergence Analysis
169
Remark 8.4 The condition imposed by Assumption N 8.1 implies that the local funcf i (xi ) are strongly convex and tions f i (xi ) and the global function f (x) = i=1 have Lipschitz continuous gradient. The strong convexity of each subfunction is a necessary assumption in this chapter, which seems too strict. However, the strong convexity of the global function induced by an additional quadratic regulariser is common in practice, such as distributed logistic regression problems. Hence, this assumption can be satisfied by splitting the quadratic regulariser amongst the local subfunctions evenly. This is a standard technique used in stochastic gradient methods [37, 44].
8.4.1 Preliminary Results Lemma 8.5 ([45]) Let Assumption 8.1 hold. Then, for augmented Lagrange function (8.7), there exists a saddle point (x ∗ , λ∗ ) ∈ Rn N × Rn N satisfying ∇ f (x ∗ ) + C W 2 λ∗ = 0
(8.9a)
W CTx∗ = 0
(8.9b)
1
1 2
where x ∗ = 1 N ⊗ x , with x being the optimal solution to problem (8.1). Next, we state an expected upper bound for the distance between the stochastic averaging gradient g(k) and the gradient at the optimal solution ∇ f (x ∗ ). Define the auxiliary sequence p(k) ∈ R as p(k) =
qi N 1 h h ( f i (yi (k))− f ih (x )−∇ f ih (x )T (yih (k)−x )) q i i=1 h=1
Note that each local subfunction f ih is strongly convex, then each term f ih (yih (k))− f ih (x )−∇ f ih (x )T (yih (k)−x ) is non-negative and thus p(k) is non-negative, ∀k ≥ 0. Lemma 8.6 ([44]) Consider algorithm (8.8) and the definition of the sequence p(k) in (8.10). Let Assumption 8.1 hold. Then, it follows 2
E g(k)−∇ f (x ∗ ) Fk
≤4L f p(k)+2 2L f − μ ( f (x(k))− f (x ∗ )−∇ f (x ∗ )T x(k)−x ∗ )
(8.10)
Remark 8.7 Note that all the variables yih (k) converge to x if x(k) approaches the optimal solution x ∗ . Then, we have p(k)
converges to zero. Combining with Lemma 8.6 implies that E[ g(k)−∇ f (x ∗ ) 2 Fk ] converges to zero as x(k) approaches the optimal solution x ∗ .
170
8 Distributed Stochastic Optimization: Variance Reduction …
The following lemma provides an upper bound of E [ p(k + 1)| Fk ] in terms of p(k). Lemma 8.8 ([44]) Consider algorithm (8.8) and the definition of the sequence p(k) in (8.10). Let Assumption 8.1 hold. Then, for all k ≥ 0, the sequence p(k) satisfies 1 1 p(k) + E [ p(k + 1)| Fk ] ≤ 1 − f (x(k)) − f (x ∗ ) qmax qmin
T − ∇ f (x ∗ ) x(k) − x ∗
8.4.2 Main Results Now, we are ready to show a linear convergence rate of x(k) generated by algorithm (8.8) to the optimal solution x ∗ . Define the vectors z(k), z ∗ ∈ R2N n , and matrix G ∈ R2N n×2N n as 1 ∗ x(k) x P +γ Q 0 2 (8.11) , G = z(k) = , z∗ = α λ∗ λ(k) I 0 2 where P = I − αL and Q = α(μ − φ/2)I with 0 < φ < 2μ and 0 < γ < 1. Note that z ∗ ∈ R2N n involves the optimal primal and dual solutions and z(k) ∈ R2N n contains primal and dual iterates at iteration k. In the following analysis, we work on the convergence of G-weighted norm z(k) − z ∗ 2G instead of z(k) − z ∗ 2 . Here, G is positive semi-definite when the step-size α is properly chosen and an exact range for the selection of step-size will be given latter. We target on demonstrating that z(k) − z ∗ 2G converges linearly to zero. To this end, we establish linear convergence of the Lyapunov function defined as z(k) − z ∗ 2G + cp(k) with c > 0 being a positive constant. Before proceeding to the main result, we establish an upper bound of E [ z(k) − z ∗ 2G |Fk ] in terms of z(k) − z ∗ 2G . Lemma 8.9 Consider algorithm (8.8). Further recall the definitions of the sequence p(k) in (8.10) and z(k), z ∗ , G in (8.11). Let Assumption 8.1 hold. If the step-size α is chosen from the interval (0, 1/(λmax (L) +
L 2f φ
+ η)), then we have
2 E z(k + 1) − z ∗ G |Fk 2 2 α ≤ z(k) − z ∗ G − E[ λ(k + 1) − λ(k) 2 |Fk ] − x(k) − x ∗ γ Q 2 2 − E x(k + 1) − x ∗ (1−γ)Q |Fk − E x(k + 1) − x(k) 21 P−R |Fk 2
2α 2α + L f p(k)+ (L f −μ)( f (x(k))− f (x ∗ )−∇ f (x ∗ )T (x(k) − x ∗ )) η η
(8.12)
8.4 Convergence Analysis
171
where R = α η/2+L 2f /2φ I , 0 < φ < 2μ, and η > 0. Proof According to ∇ f (x ∗ ) + C W 2 λ∗ = 0 (see Lemma 8.5), by substituting 1 ∇ f (x ∗ ) + C W 2 λ∗ into the right side of (8.8a) and multiplying the both sides of the equality by (x(k + 1) − x ∗ )T , we obtain 1
T
T
g(k)−∇ f (x ∗ ) 0 = x(k + 1)−x ∗ (x(k + 1)−x(k))+α x(k + 1)−x ∗
T
T 1 1 +α x(k +1)−x ∗ (C W 2 λ(k)−C W 2 λ∗ )+α x(k +1)−x ∗ L x(k) (8.13) Adding and subtracting ∇ f (x(k + 1)) − ∇ f (x(k)) to the term g(k)−∇ f (x ∗ ), x(k) to the term x(k + 1) − x ∗ , respectively, we have
T x(k + 1) − x ∗ g(k) − ∇ f (x ∗ )
η
T ∇ f (x(k + 1))−∇ f (x ∗ ) − x(k + 1) − x(k) 2 ≥ x(k + 1)−x ∗ 2 1 φ 1 2 g(k)−∇ f (x(k)) 2 − x(k +1)−x ∗ − ∇ f (x(k))−∇ f (x(k +1)) 2 − 2η 2 2φ
T + x(k)−x ∗ (g(k)−∇ f (x(k))) (8.14)
where η > 0 and φ > 0. Since the global function f (x) is strongly convex and its gradient ∇ f (x) is Lipschitz continuous, one can obtain (x(k + 1) − x ∗ )T (∇ f (x(k + 1)) − ∇ f (x ∗ )) ≥ μ x(k + 1) − x ∗ 2 and ∇ f (x(k))−∇ f (x(k +1)) 2 ≤ L 2f x(k + 1) − x(k) 2 . Substituting into the right side of (8.14), we obtain
T
x(k + 1) − x ∗ g(k) − ∇ f (x ∗ ) 2 1 φ η g(k) − ∇ f (x(k)) 2 − x(k + 1) − x ∗ ≥ − x(k + 1) − x(k) 2 − 2 2η 2 2 L 2f
T x(k + 1)−x(k) 2 + x(k)−x ∗ (g(k)−∇ f (x(k))) + μx(k + 1)−x ∗ − 2φ (8.15) The next step is to find a simplification for the inner product (x(k + 1) − x ∗ )T 1 1 (C W 2 λ(k) − C W 2 λ∗ ). In conjunction with (8.8b), it follows
T 1 1 x(k + 1) − x ∗ (C W 2 λ(k) − C W 2 λ∗ )
T
1 1 1 = x(k + 1) − x ∗ (C W 2 λ(k + 1) − λ∗ − C W 2 W 2 E T x(k + 1))
T =(λ(k + 1) − λ(k))T λ(k + 1) − λ∗ − x(k + 1) − x ∗ L x(k + 1)
(8.16)
172
8 Distributed Stochastic Optimization: Variance Reduction … 1
1
where the factorization of the Laplacian matrix L = C W 2 W 2 C T is used. By substituting the lower bound in (8.15) and the simplification in (8.16) into (8.13), we obtain
x(k + 1) − x ∗
T
T (x(k + 1) − x(k)) − α x(k + 1) − x ∗ L (x(k + 1) − x(k))
L 2f
T η + α λ(k + 1) − λ∗ (λ(k + 1) − λ(k)) − α( + ) x(k + 1) − x(k) 2 2 2φ
T α g(k) − ∇ f (x(k)) 2 + α x(k) − x ∗ (g(k) − ∇ f (x(k))) − 2η 2 φ (8.17) + α(μ − )x(k + 1) − x ∗ ≤ 0 2 Observing that for any a, b, c ∈ Rn and positive semi-definite matrix A ∈ Rn×n , the inner product (a − b)T A(a − c) can be simplified as 1/2 a − b 2A + 1/2 a − c 2A − 1/2 b − c 2A . Thus (8.17) can be reformulated as 1 x(k + 1) − x ∗ 2 + 1 x(k + 1) − x(k) 2 − 1 x(k) − x ∗ 2 2 2 2 2 α α α ∗ 2 x(k + 1) − x L − x(k + 1) − x(k) 2L + x(k) − x ∗ L − 2 2 2 2 α 2 α α + λ(k + 1) − λ∗ + λ(k + 1) − λ(k) 2 − λ(k) − λ∗ 2 2 2 2
φ T + α x(k) − x ∗ (g(k) − ∇ f (x(k))) + α(μ − )x(k + 1) − x ∗ 2 2 L η α f ) x(k + 1) − x(k) 2 − g(k)−∇ f (x(k)) 2 ≤ 0 − α( + (8.18) 2 2φ 2η Define P=I − αL , Q = α (μ − φ/2) I , and R = α η/2 + L 2f /2φ I . Suppose 0 < α < 1/λmax (L) + L 2f /φ + η and 0 < φ < 2μ such that P > 0, Q > 0, and P/2 − R > 0, then the inequality (8.18) equals to x(k + 1) − x ∗ 21
α λ(k + 1) − λ∗ 2 + x(k + 1) − x ∗ 2 (1−γ)Q 2 2 α 2 ∗ + x(k) − x γ Q + x(k + 1) − x(k) 1 P−R + λ(k + 1) − λ(k) 2 2 2
α 2 ∗ T g(k)−∇ f (x(k)) + α x(k) − x (g(k) − ∇ f (x(k))) − 2η 2 2 α ≤ x(k) − x ∗ 1 P+γ Q + λ(k) − λ∗ (8.19) 2 2 2
P+γ Q
+
where γ ∈ (0, 1). By computing the expectation conditioned on Fk and regrouping the terms, we obtain
8.4 Convergence Analysis
173
2 E z(k + 1)−z ∗ G |Fk 2 2 ≤ z(k)−z ∗ G − E x(k + 1)−x ∗ (1−γ)Q |Fk α − E x(k + 1) − x(k) 21 P−R |Fk − E λ(k + 1) − λ(k) 2 |Fk 2 2 2 α (8.20) + E g(k) − ∇ f (x(k)) 2 |Fk − x(k) − x ∗ γ Q 2η Observing the relation E[gi (k)|Fk ] = ∇ f i (xi (k)), it follows that the expectation E[ g(k) − ∇ f (x(k)) 2 |Fk ] can be simplified as E g(k) − ∇ f (x(k)) 2 |Fk 2 =E g(k) − ∇ f (x ∗ ) − ∇ f (x(k)) + ∇ f (x ∗ ) |Fk 2 2 =E g(k) − ∇ f (x ∗ ) |Fk − ∇ f (x(k)) − ∇ f (x ∗ )
(8.21)
where we have used the standard variance decomposition E[ a − E[a|Fk ] 2 |Fk ] = E[ a 2 |Fk ] − E [a|Fk ] 2 to obtain the second equality. According to the strong convexity of the global function f (x), it follows ∇ f (x(k)) − ∇ f (x ∗ ) 2 ≥ 2μ( f (x(k)) − f (x ∗ ) − [∇ f (x ∗ )]T (x(k) − x ∗ )). In conjunction with Lemma 8.6, it follows 2
E g(k) − ∇ f (x ∗ ) Fk ≤4L f p(k) + (4L f − 2μ)( f (x(k)) − f (x ∗ ) − ∇ f (x ∗ )
(8.22) T
x(k) − x ∗ )
Replacing the expectation E[ g(k) − ∇ f (x(k)) 2 |Fk ] in (8.20) with the upper bound in (8.22) implies 2 E z(k + 1)−z ∗ G |Fk α 2 2 ≤ z(k)−z ∗ G − E x(k + 1)−x ∗ (1−γ)Q |Fk − E λ(k + 1) − λ(k) 2 |Fk 2 2 2α − E x(k + 1) − x(k) 21 P−R |Fk − x(k) − x ∗ γ Q + L f p(k) 2 η
2α L f −μ ( f (x(k))− f (x ∗ )−∇ f (x ∗ )T x(k)−x ∗ ) (8.23) + η Therefore, the claim in Lemma 8.9 is valid.
Theorem 8.10 Consider the algorithm (8.8). Suppose that Assumption 8.1 holds. If the parameters η and c satisfy
174
8 Distributed Stochastic Optimization: Variance Reduction …
η∈ c∈
2qmin L f (L f − μ) + 2L 2f qmax γqmin (2μ − φ)
,∞
2αL f qmax αγqmin (2μ − φ) 2qmin α L f − μ − , η Lf η
(8.24)
and the step-size α is selected from the interval ⎞
⎛ α ∈ ⎝0,
1 λmax (L) +
L 2f φ
⎠ +η
(8.25)
where φ > 2μ and γ ∈ (0, 1), then there exists a positive constant δ such that 2 2 (1 + δ)E z(k + 1)−z ∗ G +cp(k + 1)|Fk ≤ z(k)−z ∗ G +cp(k)
(8.26)
Proof To obtain the result in (8.26), it is equivalent to prove 2 δE z(k + 1) − z ∗ G + cp(k + 1)|Fk 2 2 ≤ z(k)−z ∗ G +cp(k)−E z(k + 1)−z ∗ G |Fk −cE [ p(k + 1)|Fk ]
(8.27)
Let a1 denote the upper bound of the left side of (8.27) and a2 denote the lower bound of the right side of (8.27). Then, the inequality (8.27) holds only if 0 < a1 ≤ a2 . Adding cE [ p(k + 1)|Fk ] + c p(k) on the both sides of (8.23) and substituting the term E [ p(k + 1)|Fk ] by its upper bound as introduced in Lemma 8.8 lead to z(k)−z ∗ 2 +cp(k)−E z(k + 1)−z ∗ 2 |Fk −cE [ p(k + 1)|Fk ] G G 2 ∗ ≥E x(k + 1)−x (1−γ)Q |Fk + E x(k + 1) − x(k) 21 P−R |Fk 2 2α α c 2 2 ∗ p(k) Lf− + x(k)−x γ Q + E λ(k + 1)−λ(k) |Fk − 2 η qmax Δ Lf 2α c x(k) − x ∗ 2 = L f −μ + − a2 (8.28) η qmax 2 We proceed to find the upper bound of δE[ z(k + 1) − z ∗ 2G + cp(k + 1)|Fk ]. Considering (8.8a), together with Lemma 8.5, we have 1 1 0 = x(k + 1)−x(k)+α g(k)−∇ f (x ∗ )+C W 2 λ(k)−C W 2 λ∗ + L x(k)
1 = (I −αL)(x(k + 1)−x(k))+α g(k)−∇ f (x ∗ ) +αC W 2 (λ(k + 1)−λ∗ )
1 (8.29) = P(x(k + 1)−x(k))+α g(k)−∇ f (x ∗ ) +αC W 2 (λ(k + 1)−λ∗ )
8.4 Convergence Analysis
175 1
1
where we have used (8.8b) and the relation L = C W 2 W 2 C T to obtain the second equality. Applying the basic inequality a + b 2 ≥ (1 − θ) a 2 + (1 − 1/θ) b 2 , ∀a, b ∈ R N , θ > 1, it follows P (x(k + 1) − x(k)) 2
1 2 = α2 g(k) − ∇ f (x ∗ ) + C W 2 λ(k + 1) − λ∗ 2
1 2 ≥ α2 (1−θ) g(k)−∇ f (x ∗ ) +α2 (1− 1/θ)C W 2 λ(k + 1)−λ∗
(8.30)
Since both λ (k + 1) and λ∗ lie in the column space of W 2 C T , one can obtain 2 1 that C W 2 (λ(k + 1)−λ∗ ) ≥ λ2 (L) λ(k + 1) − λ∗ 2 , where λ2 (L) represents the second smallest nonzero eigenvalue of L. Then, (8.30) can be rewritten as 1
P (x(k + 1) − x(k)) 2 2 2 1 ≥α2 (1−θ) g(k)−∇ f (x ∗ ) +α2 (1− )λ2 (L)λ(k + 1)−λ∗ θ
(8.31)
Taking the expectation conditioned on Fk , replacing the expression E[ g(k)− ∇ f (x ∗ ) 2 |Fk ], and rearranging the terms imply 2 E λ(k + 1) − λ∗ |Fk 4θL f E P (x(k + 1) − x(k)) 2 |Fk + p(k) λ2 (L) −
2θ 2L f − μ f (x(k)) − f (x ∗ ) − ∇ f (x ∗ )T x(k) − x ∗ + λ2 (L) ≤
1
α2 (1
1 )λ2 (L) θ
(8.32)
Note that 2 δE z(k +1)−z ∗ G + cp(k + 1)|Fk αδ 2 2 =δE x(k +1)−x ∗ 1 P+γ Q |Fk + E λ(k +1)−λ∗ |Fk + δcE [ p(k +1)|Fk ] 2 2 (8.33) Substituting the terms E[ λ(k + 1) − λ∗ 2 |Fk ] and E[ p(k + 1)|Fk ] by their upper bounds as introduced in (8.32) and Lemma 8.8, respectively, we obtain the upper bound of δE[ z(k + 1) − z ∗ 2G + cp(k + 1)|Fk ] as
176
8 Distributed Stochastic Optimization: Variance Reduction …
2 δE z(k + 1) − z ∗ G + cp(k + 1)|Fk 2αδθL δc f ∗ 2 ≤δE x(k + 1)−x 1 P+γ Q |Fk + p(k) + δc − 2 λ2 (L) qmax δ
+ E P (x(k + 1) − x(k)) 2 |Fk 1 2α 1 − θ λ2 (L)
Δ αδθ 2L f − μ Lf δc x(k) − x ∗ 2 = + a1 (8.34) + λ2 (L) qmin 2 We emphasize that the inequality in (8.27) holds only if 0 < a1 ≤ a2 . Combining (8.28) with (8.34) yields a sufficient condition for the claim in (8.27) as 2 0 ≤E x(k + 1) − x ∗ (1−γ)Q−δ( 1 P+γ Q) |Fk 2 + δE x(k + 1)−x(k) 1
|Fk ( ) 2αδθL f c δc 2α Lf − − δc p(k) + + − qmax qmax η λ2 (L)
2 α Lf − Lfμ cL f (1 + δ) φ − + γα(μ − ) − 2 η 2qmin
2 αδθ 2L f − L f μ x(k) − x ∗ 2 − 2λ2 (L) 2
2
P−R−
δ PT P 2α 1− 1 λ2 (L) θ
(8.35)
It is known that the inequality (8.35) holds if each summand in the right side is non-negative. Therefore, the following conditions should be satisfied: 1 P + γQ > 0 (8.36a) (1 − γ) Q − δ 2 1 δ PT P
P−R− >0 (8.36b) 2 2α 1 − 1θ λ2 (L) 2αδθL f c δc 2α Lf − − δc > 0 (8.36c) + − qmax qmax η λ2 (L)
αL f L f − μ cL f (1 + δ) αδθL f 2L f − μ φ γα μ − − − − >0 2 η 2qmin 2λ2 (L) (8.36d)
All the inequalities in (8.36) hold for some δ > 0, when η, α and c are selected from the intervals shown in (8.24) and (8.25). Next, we study the quantitative description of the convergence rate δ, as presented below.
8.4 Convergence Analysis
177
Theorem 8.11 Let Assumption 8.1 hold. Consider algorithm (8.8), if the step-size α meets the condition (8.25), then δ in (8.26) can be described by
δ = min
⎧ ⎪ ⎨
L2 (θ−1) λ2 (L) α 1−αλl (L)−α η+ φf
(1 − γ) (2μ − φ) α , ⎪ ⎩ (1−αλl (L))+γα (2μ−φ) ⎫ L −μ 2qmin α Lγf μ − φ2 − ( fη ) − c ⎬ qmin αθ(2L f −μ) λ2 (L)
θ(1−αλl (L))2
(8.37)
⎭
+c
for l = 1, . . . , N , where θ > 1 and the definitions of c, φ, γ and η are given in Theorem 8.10. Especially, when the step-size is selected in α ∈ (0, 1/ (λmax (L) + 2(L 2f /φ + η) , we can get a more concise form of (8.37) as ⎧ ⎪ ⎨ (1 − γ) (2μ − φ) α (θ − 1) λ2 (L) α 1 − α η + , δ = min ⎪ θ ⎩ 1 + γα (2μ − φ) ⎫ L −μ 2qmin α Lγf μ − φ2 − ( fη ) − c ⎬ qmin αθ(2L f −μ) λ2 (L)
+c
L 2f φ
⎭
(8.38)
Proof In view of the proof of Theorem 8.10, if η, α and c satisfy conditions in (8.24) and (8.25), then there exists a δ > 0 such that the inequalities (8.36a)–(8.36d) are tenable. Recalling the definition of Q and P, (8.36a) can be rewritten as δ
φ 1 (I − αL) + γα(μ − )I 2 2
φ ≤ α (1 − γ) μ − I 2
(8.39)
Recall that the communication network is undirected and connected. By using the Schur’s lemma, (8.39) is equivalent to δ
φ 1 φ (1−αλl (L)) + γα μ− ≤ α (1 − γ) μ − 2 2 2
(8.40)
for l = 1, . . . , N . Similarly, we have an equivalent condition of (8.36b) L 2f δ(1 − αλl (L))2
≤ (1 − αλl (L)) − α η + φ α 1 − 1θ λ2 (L)
(8.41)
Since (8.36c) holds for any δ > 0 when c > 2αL f qmax /η, together with (8.40), (8.40) and (8.36d), the desired result in (8.37) is obtained. The first term of the right side of (8.38) follows the fact that 0 = λ1 (L) ≤ λ2 (L) · · · ≤ λ N (L). Consider a function regarding λ as
178
8 Distributed Stochastic Optimization: Variance Reduction …
Γ (λ) =
1 − αλ − α η +
L 2f φ
(1 − αλ)2
By derivation, its derivative is α 1 − αλ − 2α η + dΓ (λ) = d (λ) (1 − αλ)3
L 2f φ
When 1 < α < 1/ λmax (L) + 2(L 2f /φ + η) , it follows dΓ (λ)/d(λ) > 0. That is, arg minλl ,l=1,...,N Γ (λl ) = λl = 0. Based on this observation, the desired result in (8.38) is obtained. Remark 8.12 It is shown in (8.38) that Theorem 8.11 establishes a relation between convergence rate, constant step-size, and eigenvalue λ2 (L) of the Laplacian matrix L. It allows us to accelerate the convergence by increasing λ2 (L), which can be realized by increasing the edge weights li j , so long as the constant step-size respects its upper bound in Theorem 8.11. This shows the superiority of the proposed algorithm in flexible edge weight selections compared with gradient tracking methods [19, 20, 24, 25] DSA [44], and EXTRA [34]. It is shown in Theorem 8.10 that the expected value of the sequence z(k) − z ∗ 2G + cp(k) diminishes over time. Computing the expected value with the initial sigma field E[·|F0 ] = E[·] implies that the sequence z(k) − z ∗ 2G + cp(k) converges linearly to zero in expectation, i.e., 2 E[z(k)−z ∗ G + cp(k)|Fk ] ≤
1 z(0)−z ∗ 2 + cp(0) G (1+δ)k
(8.42)
Next, we employ this discovery to establish linear convergence of x(k) − x ∗ 2 in expectation. Corollary 8.13 Consider the algorithm (8.8). Supposing that the conditions of Theorem 8.10 hold, there exists a positive constant δ such that 2 E x(k) − x ∗ ≤
1 ( z(0) − z ∗ 2G + c p(0))
(1 + δ)k λmin 21 P + γ Q
(8.43)
Proof Recalling the definitions of z, G and p(k), one can obtain that x(k) − x ∗ 21 P+γ Q ≤ z(k) − z ∗ 2G + c p(k), where P/2 + γ Q and G are positive 2
definite matrices. Due to the fact that x(k) − x ∗ 2P/2+γ Q is lower bounded by λmin (P/2 + γ Q) x(k) − x ∗ 2 , we have λmin (P/2 + γ Q) x(k) − x ∗ 2 ≤ z(k)− z ∗ 2G + cp(k). Then, combining the result in (8.42) completes the proof.
8.4 Convergence Analysis
179
Theorem 8.14 Consider algorithm (8.8) and suppose that the conditions of Theorem 8.10 hold. Then, the sequences of the local variables xi (k) for all i ∈ V converge to the optimal solution x a.s.. That is lim xi (k) = x a.s. for all i ∈ V
k→∞
(8.44)
Proof The proof of Theorem 8.14 depends on the relation in (8.26) of Theorem 8.10 to construct a supermartingale sequence. To this end, define the stochastic processes ζ(k) and β(k) as 2 ζ(k) = z(k) − z ∗ G + c p(k), β(k) =
1 z(k) − z ∗ 2 + c p(k) (8.45) G 1+δ
Define Fk a sigma-algebra measuring ζ(k), β(k) and z(k). Considering the definitions of ζ(k) and β(k), form (8.26), we have E [ζ(k + 1)|Fk ] ≤ ζ(k) − β(k)
(8.46)
Since the sequences ζ(k) and β(k) are always non-negative, it follows that (8.46) satisfies the conditions of the supermartingale convergence theorem (Theorem 7.4, [46]). Therefore, it follows: (i) The sequence ζ(k) converges a.s. (ii) The sum ∞ k=0 β(k) < ∞ is finite a.s.. That is ∞ k=0
1 z(k) − z ∗ 2 + c p(k) < ∞ a.s. G 1+δ
(8.47)
Since x(k) − x ∗ 2P/2+γ Q ≤ z(k) − z ∗ 2G + cp(k) and x(k) − x ∗ 2P/2+γ Q is lower bounded by λmin (P/2 + γ Q) x(k) − x ∗ 2 , then we have limk→∞ z¯ (k) − Θˆ ∗ = 0. In conjunction with (8.47), we have ∞ k=0
1 λmin 1+δ
2 1 P + γ Q x(k) − x ∗ < ∞ a.s. 2
(8.48)
Noting the fact λmin (P/2 + γ Q) is positive, we conclude from (8.48) that the sequence x(k) − x ∗ 2 converges to zero a.s.. This concludes the proof.
8.5 Numerical Examples In this section, the performance of the proposed algorithm is tested through a binary classification problem using regularized logistic regression. Specifically, we consider nodes of a network collaboratively solving the following distributed logistic regression problem:
180
8 Distributed Stochastic Optimization: Variance Reduction …
qi N 2
λ T xˆ + x = argmin log 1+exp(−bi h ci h x) ˆ 2 x∈R ˆ n i=1 h=1
(8.49)
where ci h ∈ Rn is the h-th training sample kept by node i and bi h ∈ {±1} is the 2 corresponding label. The regulariser (λ/2) xˆ is introduced to eliminate overfitting. This problem can be formulated in the form of problem (8.1) with the local objective function f i defined as f i (x) ˆ =
qi
λ xˆ 2 + log 1 + exp(−bi h ciTh x) ˆ 2N h=1
(8.50)
Observe that the local function f i in (8.50) can be further formulated as the average of qi local subfunctions defined as f ih (x) ˆ =
λ xˆ 2 + qi log 1 + exp(−bi h c T x) ih ˆ 2N
(8.51)
for all h = 1, . . . , qi . Based on the definitions of the local subfunctions f ih (x) ˆ in (8.51), the distributed logistic regression problem (8.49) can be settled by our algorithm. We next present two case studies. The first case is performed on a synthetic dataset to show the comparison of the proposed algorithm with others, and to illuminate the effects of network sparsity, edge weights, and constant step-size. The second case uses a subset of MNIST dataset [47] to test the performance of the proposed algorithm in large-scale classification. In the following simulations, the residual is defined as N log10 (1/N ) i=1 xi (k) − x . We use CVX [48] to acquire the optimal solution x by solving the distributed problem in a centralized way.
8.5.1 Case Study I In this case, N = 10, qi = 30 for all i = 1, . . . , 10, n = 2, and λ = 0.0001. We use N (ρ, σ) to denote a normal distribution with ρ as the mean vector and σ as the covariance matrix. For each node i, half of the feature vectors ci h with label bi h = +1 are drawn by i.i.d. N ([2, −2]T , 2I ) while the others with label bi h = −1 are set to be i.i.d. N ([−2, 2]T , 2I ).
8.5.1.1
Comparison with Distributed Algorithms
We contrast the convergence results of algorithms in settlement of the distributed logistic regression problem, including EXTRA [34], the AL algorithm [45] (termed
8.5 Numerical Examples
181
Fig. 8.1 Network topologies. a Star network. b Circle network. c Random network with pe = 0.5. d Full connected network
as AL-Edge), DSA [44] and the proposed algorithm (termed as SAL-Edge). Note that DSA and SAL-Edge use stochastic averaging gradients, EXTRA and AL-Edge are their deterministic counterparts. The network topology is shown in Fig. 8.1c. For EXTRA and DSA, the associated doubly-stochastic weighting matrix is generated by the Metropolis rule [1]. The weights li j in our algorithm and AL-Edge are selected in [0.7, 1.0]. For comparison, the constant step-sizes for EXTRA and AL-Edge are 0.05, for DSA and SAL-Edge are 0.005. The simulation results in Fig. 8.2a show that the convergence rates of EXTRA, DSA, AL-Edge, and SAL-Edge are linear. Convergence rates of EXTRA and AL-Edge are comparatively faster than DSA and SAL-Edge in terms of number of iterations. Nevertheless, the complexities of each iteration for DSA and SAL-Edge are less than their deterministic counterparts. To illustrate this difference, we depict the evolutions of residuals respect to the number of gradient evaluations as shown in Fig. 8.2b.
8.5.1.2
Effects of Network Sparsity
In this section, we investigate the effects of network sparsity on the performance of the proposed algorithm. We consider four categories of network, G1 , G2 , G3 , and G4 shown in Fig. 8.1, with decreasing sparsity. The constant step-size and edge weights are same as Sect. 8.5.1.1. The performance of SAL-Edge under each category of
182
8 Distributed Stochastic Optimization: Variance Reduction … 1
Fig. 8.2 Comparison across EXTRA, AL-Edge, DSA, and the proposed algorithm
SAL−Edge AL−Edge EXTRA DSA
0
Residual
−1 −2 −3 −4 −5 −6 −7
100
200
300
400
500
600
700
800
900 1000
Iterations
(a) Evolution of residuals with number of iterations 1 SAL−Edge AL−Edge EXTRA DSA
0
Residual
−1 −2 −3 −4 −5 −6 −7
500
1000
1500
2000
2500
3000
3500
4000
Number of gradient evalutions
(b) Evolution of residuals with number of gradient evaluations
networks is shown in Fig. 8.3. It is verified that the algorithm converges faster as the network becomes dense.
8.5.1.3
Effects of Edge Weights
Next, we study the effects of edge weights on the convergence of the proposed algorithm to verify the statement in Remark 8.12. The edge weights li j are randomly selected in [0.2, 0.5], [0.5, 0.7], [0.7, 1.0], and [1.0, 1.5], respectively. The constant step-size α = 0.005 which meets the condition (8.25) in the four experiments. As
8.5 Numerical Examples
183
Fig. 8.3 Evolution of residuals with different networks
1
Star network Circle network Random network with p e=0.5
0
Full connected network
Residual
−1 −2 −3 −4 −5 −6 −7
100
200
300
400
500
600
700
800
900
1000
Iterations
Fig. 8.4 Evolution of residuals with different edge weights
1 lij∈ [0.2,0.5]
0
lij∈ [0.5,0.7] l ∈ [0.7,1.0] ij
Residual
−1
lij∈ [1.0,1.5]
−2 −3 −4 −5 −6 −7
100
200
300
400
500
600
700
800
900 1000
Iterations
shown in Fig. 8.4, it is verified that the performance of the proposed algorithm can be improved by increasing the edge weights.
8.5.1.4
Effects of Constant Step-Size
It is stated in Theorem 8.11 that the convergence rate of the proposed algorithm is also related to the constant step-size. We now compare the performance of the proposed algorithm in terms of stepsize selection. The simulation is performed on the network shown in Fig. 8.1c and the edge weights li j is set as 0.7. It is depicted in Fig. 8.5 that, in the given setting, the practical upper bound of the constant step-size is around α = 0.065, and the best performance is achieved when α = 0.01.
184
8 Distributed Stochastic Optimization: Variance Reduction … 0
Fig. 8.5 Residuals at the 300th iteration with different constant step-size
−0.5
Residual
−1 −1.5 −2 −2.5 −3 −3.5
0
0.01
0.02
0.03
0.04
Constant step−size
Fig. 8.6 Samples from the dataset
Fig. 8.7 A network of one hundred nodes
0.05
0.06
0.07
8.5 Numerical Examples
185 0
Fig. 8.8 Evolution of residuals with different edge weights
l ∈ [0.2,0.5] ij
lij∈ [0.5,0.7] l ∈ [0.7,1.0] ij
Residual
−0.5
lij∈ [1.0,1.5]
−1
−1.5
−2
0.2
0.4
0.6
0.8
1
Iterations
1.2
1.4
1.6
1.8
2 4
x 10
8.5.2 Case Study II In this case study, we test the performance of the proposed algorithm in large-scale classification. We consider the dataset comes from the MNIST handwritten digit database [47]. We choose a subset of 11339 digits, 6 and 7, from MNIST, labeled as +1 and −1, respectively. Figure 8.6 shows a part of the training samples. Each image can be transformed into a 784-dimensional vector. We consider the training samples are distributed over 100 nodes, and the network is depicted in Fig. 8.7. The constant step-size chooses 5 × 10−5 and the edge weights li j are randomly selected in [0.2, 0.5], [0.5, 0.7], [0.7, 1.0], and [1.0, 1.5], respectively. The simulation results in shown in Fig. 8.8, which match the observations in Fig. 8.4.
8.6 Conclusion In this chapter, a distributed augmented Lagrange algorithm was presented by using local unbiased stochastic averaging gradients. The local objective function was constructed as an average of a finite set of subfunctions, which was inspired by the machine learning problems with large training samples distributed and known privately to multiple computational nodes. Unlike most of existing work that required the evaluation of the full gradient at each iteration, the proposed algorithm resorted to the gradient of only one randomly selected subfunction at a node, which was more computationally efficient. An explicit linear convergence rate was provided by utilizing the factorization of weighted Laplacian and the spectral decomposition technique.
186
8 Distributed Stochastic Optimization: Variance Reduction …
References 1. A. Sayed, Adaptation, learning, and optimization over networks. Found. Trends® Mach. Learn. 7(4–5), 311–801 (2014) 2. Q. Jia, L. Guo, Y. Fang, G. Wang, Efficient privacy-preserving machine learning in hierarchical distributed system. IEEE Trans. Netw. Sci. Eng. https://doi.org/10.1109/TNSE.2018.2859420 3. T. Kim, S. Wright, D. Bienstock, S. Harnett, Analyzing vulnerability of power systems with continuous optimization formulations. IEEE Trans. Netw. Sci. Eng. 3(3), 132–146 (2016) 4. A. Yazicioglu, M. Egerstedt, J. Shamma, Formation of robust multi-agent networks through self-organizing random regular graphs. IEEE Trans. Netw. Sci. Eng. 2(4), 139–151 (2016) 5. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 6. J. Duchi, A. Agarwal, M. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(1), 592–606 (2012) 7. I. Lobel, A. Ozdaglar, Convergence analysis of distributed subgradient methods over random networks, in 46th Annual Allerton Conference on Communication, Control, and Computing, https://doi.org/10.1109/ALLERTON.2008.4797579 8. H. Li, C. Huang, G. Chen, X. Liao, T. Huang, Distributed consensus optimization in multi-agent networks with time-varying directed topologies and quantized communication. IEEE Trans. Cybern. 47(8), 2044–2057 (2017) 9. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015) 10. H. Li, Q. Lü, T. Huang, Distributed projection subgradient algorithm over time-varying general unbalanced directed graphs. IEEE Trans. Autom. Control 64(3), 1309–1316 (2019) 11. I. Tsianos, S. Lawlor, M. Rabbat, Push-sum distributed dual averaging for convex optimization, in Proceedings of the 51st IEEE Conference on Decision and Control, https://doi.org/10.1109/ CDC.2012.6426375 12. K. Tsianos, M. Rabbat, Distributed consensus and optimization under communication delays, in 49th Allerton Conference on Communication, Control, and Computing, https://doi.org/10. 1109/Allerton.2011.6120272 13. S. Sundhar Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010) 14. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010) 15. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–248 (2019) 16. K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim. 26(3), 1835–1854 (2016) 17. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014) 18. J. Xu, S. Zhu, Y. Soh, L. Xie, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 54th IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC.2015.7402509 19. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2018) 20. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization with uncoordinated step-sizes, in 2017 American Control Conference, https://doi.org/10. 23919/ACC.2017.7963560 21. C. Xi, X. Ran, U. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans. Autom. Control 63(5), 1329–1339 (2018) 22. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
3.5
187
23. R. Xin, U. Khan, A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018) 24. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018) 25. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 26. H. Terelius, U. Topcu, R. Murray, Decentralized multi-agent optimization via dual decomposition, in 18th IFAC World Congress, vol. 44(1), pp. 7391–7397 (2011) 27. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011) 28. E. Wei, A. Ozdaglar, Distributed alternating direction method of multipliers, in 2012 IEEE 51st IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC.2012.6425904 29. F. Iutzeler, P. Bianchi, P. Ciblat, W. Hachem, Explicit convergence rate of a distributed alternating direction method of multipliers. IEEE Trans. Autom. Control 61(4), 892–904 (2016) 30. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014) 31. Q. Ling, A. Ribeiro, Decentralized linearized alternating direction method of multipliers, in IEEE International Conference Acoustics. Speech and Signal Processing (2014), https://doi. org/10.1109/ICASSP.2014.6854644 32. Q. Ling, W. Shi, G. Wu, A. Ribeiro, DLM: decentralized linearized alternating direction method of multipliers. IEEE Trans. Signal Process. 63(15), 4051–4064 (2015) 33. N. Aybat, Z. Wang, T. Lin, S. Ma, Distributed linearized alternating direction method of multipliers for composite convex consensus optimization. IEEE Trans. Autom. Control 1(63), 5–20 (2018) 34. W. Shi, Q. Ling, G. Wu, W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015) 35. M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Programm. 162(1–2), 83–112 (2017) 36. R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26 (2013) 37. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, https://hal.archives-ouvertes.fr/hal--01016843 38. J. Lei, H. Chen, H. Fang, Asymptotic properties of primal-dual algorithm for distributed stochastic optimization over random networks with imperfect communications. SIMA J. Control Optim. 56(3), 2159–2188 (2018) 39. M. Sayin, N. Vanli, S. Kozat, T. Basar, Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Trans. Netw. Sci. Eng. 4(4), 248–260 (2017) 40. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly convex optimization. Automatica 90, 196–203 (2018) 41. D. Yuan, D. Ho, Y. Hong, On convergence rate of distributed stochastic gradient algorithm for convex optimization with inequality constraints. SIMA J. Control Optim. 54(5), 2872–2892 (2016) 42. S. Pu, A. Nedi´c, A distributed stochastic gradient tracking method, in 2018 IEEE Conference on Decision and Control, https://doi.org/10.1109/CDC.2018.8618708 43. R. Xin, A. Sahu, U. Khan, S. Kar, Distributed stochastic optimization with gradient tracking over strongly-connected networks. arXiv:1903.07266 44. A. Mokhtari, A. Ribeiro, DSA: decentralized double stochastic averaging gradient algorithm. J. Mach. Learn. Res. 17, 1–35 (2016) 45. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent networks via edge-based method. Automatica 94, 55–62 (2018)
188
8 Distributed Stochastic Optimization: Variance Reduction …
46. V. Solo, X. Kong, Adaptive Signal Processing Algorithms: Stability And Performance (1995) 47. Y. LeCun, C. Cortes, C. Burges, MINIST Handwritten Digit Database. AT & T Labs, http:// yann.lecun.com/exdb/mnist 48. M. Grant, S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1. (2014), http://cvxr.com/cvx
Chapter 9
Distributed Economic Dispatch in Smart Grids: General Unbalanced Directed Network
9.1 Introduction Economic and environmental factors, such as energy crisis, distributed energy resources, communication network, etc., endow modern energy network with dynamic and distributed natures. This trend inevitably raises technical and theoretical challenges in energy management and smart grid operation. Recently, economic dispatch in smart grids has been attracting more and more attention [1–15]. Although some control and optimization algorithms are available, a large number of distributed generators in smart grids further increase the diversification of the communication networks. Hence, how to achieve the optimal dispatch in the dynamic and changeful environment is worthy of further analysis., In smart grids, economic dispatch aims at scheduling generators in order to achieve the lowest generation cost subjected to the required load demand. Traditional centralized methods need central controllers to gather global information from the system, which are computationally costly[1]. Besides, they are prone to single point failure due to the usage of central controllers [9]. On the contrary, distributed methods are scalable and robust to meet the optimization for large-scale systems with dynamic natures, contributing to the localized communication and operation. Recently, researchers are interested in introducing the consensus theory based distributed methods for multi-agent systems (MASs). In these methods, by choosing a proper measure as a consensus variable and exchanging information over communication networks, the conventional centralized EDP can be settled in a distributed fashion. In line with this way, researchers are devoted to the generalization of networks and the robustness of distributed algorithms. The static undirected networks are considered in [3–7]. By selecting the incremental costs as the consensus variables, Zhang and Chow [3] develop a leader-following method to solve the EDP, where a leader is accommodated to balance demand and generation. A leaderless distributed algorithm is developed in [4] by applying the consensus + innovations method [16]. Binetti et al. [6] present a consensus-based method to handle the EDP © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 H. Li et al., Distributed Optimization: Advances in Theories, Methods, and Applications, https://doi.org/10.1007/978-981-15-6109-2_9
189
190
9 Distributed Economic Dispatch in Smart Grids …
with transmission line losses. Loia and Vaccaro [5] put forward a distributed algorithm which takes into account transmission losses and variable load demands. In [7], the effect of communication delays is intensively analyzed. For the dynamic networks cases, Lakshmanan and Farias [11] propose a distributed algorithm for undirected time-varying networks, where penalty terms are employed in compensating for individual constraints. Motivated by distributed inexact gradient method, a distributed primal-dual algorithm is developed in [12]. Nevertheless, it is tough to meet undirected communication in practice, especially in a distributed situation. For instance, in wireless networks, the communication radius of units are related to their broadcasting power levels which are usually non-identical. This naturally leads to directed communication. Besides, in undirected cases where each units blocks until it receives a response, deadlocks can occur when the network has cycles. Moreover, directed networks show lower cost, higher scalability, and stronger robustness against interference compared with their undirected counterparts. In the theoretical level, the challenge of distributed algorithm design for directed networks lies in the construction of weight matrices that can not be doubly stochastic (which is required typically in undirected networks but not applicable to general unbalanced directed networks), but can only be either row-stochastic or column-stochastic. For the EDP over static directed networks, Dominguez-Garcia et al. [8] present a ratio consensus-based distributed method. A surplus based distributed consensus approach is designed in [9], where generators estimate the imbalance of the system in a collective sense to adjust current power generation. Relying on consensus theory and bisection method, Xing et al. [10] present a distributed algorithm, where the restriction of quadratic cost functions (e.g., [3–5, 8, 9]) can be relaxed to general convex functions. On the basis of [9], Zhao et al. [17] systematically study the effect of communication delays. Xu et al. [13] present a distributed method for time-varying networks, resorting to surplus averaging approach [18], where directed communication is counted as a part of the constraints to the problem. Inspired by gradient push-sum method [19], Yang et al. [14] develop a distributed algorithm to address the EDP and further show its effectiveness in the presence of communication delays. Then, Wu et al. [15] provide a robustified extension of [14] undergoing packet drops. In the prior distributed algorithms for the EDP over directed networks [8, 10, 13–15, 17], generators rely on the out-degree of their in-neighbors to construct column-stochastic weight matrices. Technically, such a requirement will hinder the independence of generators and the practicality of algorithms, since for each generator, the weights on the received messages are determined by its in-neighbors rather than locally allocated by itself. In addition, this requirement may not always practical in many situations, e.g., the broadcast-based communication protocol. By contrast, row-stochastic weight matrices have superiority in practice since each generator can locally decide the weights, and provide a broad range of communication protocols including the broadcast-based methods. However, the main obstacle of the implementation of row-stochastic weight matrices is how to eliminate the unbalancedness of the directed networks. In the field of decentralized optimization. Xie et al. [20] deal with the unbalancedness by simplifying the optimization problem to an epi-
9.1 Introduction
191
graph form. Mai and Abed [21] develop a rescaling gradient method to conquer the unbalancedness, where an additional consensus iteration is employed to dynamically estimate the normalized left Perron eigenvector of the row-stochastic weight matrix. This insightful work is most related to us. However, it can cope with distributed optimization problems with non-identical local convex constraint sets but is not applicable to solve the EDP. The aforementioned work drives us to design distributed algorithm with rowstochastic weight matrices to tackle the EDP over directed networks. However, most of them (see [3–5, 8–13, 20, 21]) neglect the network effects and assume that the messages are transmitted via ideal communication channels without delays. Moreover, the effect of uncertainties is a significant issue in networked systems, e.g., wireless sensor networks, multirobot systems, and smart grids, which is not considered in the abovementioned work. Thus, another aim of our chapter is to design robust algorithms to handle non-ideal communication networks (with communication delays) and uncertainties (noisy gradients). This chapter focuses on the practicality and robustness of algorithms for economic dispatch in smart grids. Since most existing algorithms for solving EDP over general unbalanced directed networks cannot work only relying on row-stochastic weight matrices. It is significant to consider this issue due to the superiority and practicality of row-stochastic weight matrices. We now describe the main contributions of this chapter distinguishing from the existing literature. (I) This chapter presents a fully distributed algorithm for economic dispatch in smart grids over general unbalanced directed networks and provides provable convergence results. Only the row-stochastic weight matrices are employed in the proposed algorithm, which allows generators to locally allocate the weights on the messages received from their in-neighbors. Compared with related distributed algorithms for the EDP over directed networks [8, 10, 13–15, 17], the algorithm presented in this chapter does not require generators to send their out-degree to their in-neighbors to construct column-stochastic weight matrices, and is thus easier to be implemented and applicable for more communication protocols, e.g., the broadcast-based methods. (II) The robustness of the presented algorithm is investigated under both nonidentical but uniformly bounded constant communication delays and noisy gradients with bounded variance zero-mean which are not considered in [8, 10, 13, 15]. We show that the algorithm can still solve the EDP in both cases. (III) Unlike most existing work (see [3–10, 17, 22]) which assume the cost functions to be quadratic, this chapter allows strictly convex and continuously differentiable cost functions which are more general than quadratic cost functions. (IV) Compared with the aforementioned chapter (see [14, 15, 19–21]), the algorithm proposed in this work is more flexible in the choice of the decaying stepsize since it allows stepsizes to be uncoordinated.
192
9 Distributed Economic Dispatch in Smart Grids …
9.2 Preliminaries This section introduces preliminaries and basic terminologies from network theory, and then formulates the EDP and its dual counterpart.
9.2.1 Notation If there are no other statements, the vectors in this chapter are assumed to be columns.
9.2.2 Model of Optimization Problem We consider a smart grid composed of N generators. For each generator i ∈ {1, . . . , N }, let Ci (xi ) represent the cost function where xi is the power generation. Mathematically, EDP can be formulated as min C (x) =
N
Ci (xi ), x1 , x2 , .., x N ∈ R
(9.1a)
i=1
respecting the following two constraints. Individual Generator Constraint: ximin ≤ xi ≤ ximax , ∀i ∈ V
(9.1b)
where ximin and ximax are the maximum and minimum capacities of power generation of generator i, respectively. The individual constraint is (9.1b) at each generator referred to as box constraint and denoted by X i = xi ximin ≤ xi ≤ ximax . Total Demand Constraint: N xi = Td (9.1c) i=1
N min N max where Td represents total demand meeting i=1 xi ≤ Td ≤ i=1 xi , i.e., there is at least an unsaturated generator when the total demand constraint is satisfied, which ensures the feasibility of problem (9.1).
9.2 Preliminaries
193
9.2.3 Communication Network To characterize the nature of information exchanging, networks are usually introduced to describe the networks. A directed network G = (V, E) is composited by a collection of nodes V = {1, 2, . . . , N } and a collection of edges E ⊆ V × V. An ordered pair ( j, i) ∈ E indicates the directed edge from node i to node j. Node i is reachable to node j when there exists a channel ( j, j1 ) , ( j1 , j2 ) , . . . , j p , i originating from node j and terminating at node i. Besides, G is strongly connected if each node is reachable to others. A weight matrix assigned to G is defined as W = [wi j ] ∈ R N ×N which satisfies that wi j > 0 if ( j, i) ∈ E, and wi j = 0, otherwise. For node i, the collection of its in-neighbors at iteration k is defined as Niin = { j |( j, i) ∈ E}. Similarly, Niout = { j |(i, j) ∈ E} denotes the collection of out-neighbors of node i. Also, we and out-degree of node i, respectively, i.e., denote by diin and diout the in-degree diin = Niin and diout = Niout .
9.2.4 Lagrange Dual Problem of EDP For the cost functions, we have the following assumption. Assumption 9.1 Each cost function Ci : X i → R+ is strictly convex and continuously differentiable. This assumption and the fact that the constraint (9.1c) is affine imply that Slater condition is tenable. Hence, there is zero duality gap, which allows us to convert the original problem (9.1) to its Lagrange dual form. Consider the Lagrangian function as follows: L(x, λ) =
N
Ci (xi ) − λ
i=1
N
xi − Td
i=1
where x = [x1 , . . . , x N ]T and λ is the Lagrangian multiplier corresponding to constraint (9.1c). Define the dual function D : R → R as D(λ) = min L(x, λ) = min
xi ∈X i
N
Ci (xi ) − λ
N
i=1
xi − Td
(9.2)
i=1
Then, the corresponding Lagrangian dual problem of (9.1) is defined as max
N i=1
Ψi (λ) + λTd , λ ∈ R
(9.3)
194
9 Distributed Economic Dispatch in Smart Grids …
where Ψi (λ) = min xi ∈X i Ci (xi ) − λxi . Accordingly, a unique minimizer of the righthand side of (9.2) is given by xi (λ) = ϕi (∇Ci−1 (λ))
(9.4)
where ⎧ max if ∇Ci−1 (λ) > ximax ⎨ xi , −1 −1 ϕi (∇Ci (λ)) = ∇Ci (λ), if ximin ≤ ∇Ci−1 (λ) ≤ ximax ⎩ min xi , if ∇Ci−1 (λ) < ximin with ∇Ci−1 being the inverse function of ∇Ci existing over ∇Ci−1 (ximin ),∇Ci−1 (ximax ) owing to the continuity and strict monotonicity of ∇Ci−1 . For any given λ ∈ R, the dual function D(λ) is differentiable at λ and its gradient N xi − Td ). The standard gradient ascent method to address the is ∇ D(λ) = −( i=1 dual problem (9.3) can be depicted as λ(k + 1) = λ(k) + γ(k)
N
−xi (λ(k)) + Td
(9.5)
i=1
N where γ(k) is the stepsize. Note that the global quantity i=1 xi (λ(k)) − Td will prevent us from distributedly solving the EDP. In order to avert this defect, each generator has to estimate the multiplier λ only by local communication with its in-neighbors. We first convert the dual problem (9.3) into min Φ(λ) =
N
Φi (λ), λ ∈ R
(9.6)
i=1
where Φi (λ) = min xi ∈X i Ci (xi ) − λ(xi − Ti ), and Ti is a virtual local demand N Ti = Td . assigned to generator i such that i=1 Remark 9.1 In fact, the virtual local demands Ti , i ∈ V, have no physical meanings. They are only introduced to design distributed algorithm by applying the gradient method on the basis of the dual problem (9.3). In the initialization of the algorithm, the virtual local demands can be randomly N assigned by the sensors in each generator Ti = Td , which is necessary to ensure the as long as they satisfy the condition i=1 power balance. The gradient of Φi (λ) is ∇Φi (λ) = xi (λ) − Ti , which is uniformly bounded, i.e., |∇Φi (λ)| = |xi (λ) − Ti | ≤ max ximax + max Ti = l i∈V
i∈V
(9.7)
9.2 Preliminaries
195
Furthermore, it follows that Φi (λ) is l-Lipschitz continuous, i.e., Φi (λ) satisfies |Φi (x) − Φi (y)| ≤ l |x − y| , ∀x, y ∈ R.
9.3 Main Results In this section, we develop a distributed algorithm to address the EDP over directed networks using only row-stochastic weight matrices.
9.3.1 Algorithm Development The communication network among N generators is depicted as an unbalanced directed network G. In the proposed algorithm, two scalar variables xi (k) and λi (k) are assigned to each generator i ∈ V, representing the primal and the dual variables, respectively. A surplus vector variable yi (k) = [yi1 (k), . . . 3, yi N (k)]T ∈ R N is introduced for each generator i ∈ V to conquer unbalancedness of the directed networks. The initialization of each generator i ∈ V chooses estimates xi (0) ∈ X i , λi (0) ∈ R and yi (0) = ei ∈ R N . The update fashion performs as follows: λi (k + 1) =
N
wi j λ j (k) − γi (k)
j=1
∇Φi (λi (k)) yii (k)
xi (k + 1) = ϕi (∇Ci−1 (λi (k + 1))) yi (k + 1) =
N
wi j y j (k)
(9.8a) (9.8b) (9.8c)
j=1
where wi j denotes the consensus weight and γi (k) represents the stepsize. Remark 9.2 In essence, (9.8a) is a modification of gradient method in which the gradient ∇Φi (λi (k)) is balanced by yii (k). In (9.8b), ∇Ci−1 (λi (k + 1)) may not have a closed-form solution for a general convex cost function, but we can obtain a numerical solution by utilizing the bisection method since ∇Ci−1 is monotonic continuous. Assumption 9.2 The directed network G is strongly connected. At this moment, the proposed algorithm (9.8) is still conceptual since we have not specified the parameters γi (k) and wi j . Assumption 9.3 For all i ∈ V and k ≥ 0, let the uncoordinated stepsizes γi (k) = 1/ (k + 1) + ai (k) > 0, where the uncoordinated term ai (k) satisfies |ai (k)| ≤ a(k) with ∞ k=0 a(k) < ∞.
196
9 Distributed Economic Dispatch in Smart Grids …
Remark 9.3 This assumption endows each generator with flexibility in the choice of decaying stepsize due to the uncoordinated term ai (k). For instance, γi (k) = Mi /(k + 1)ci with Mi > 0 and ci ∈ (0.5, 1], by letting ai (k) = Mi /(k + 1)ci − 1/ (k + 1). Assumption 9.4 For all i ∈ V and k ≥ 0, we can find a positive scalar η ∈ (0, 1) such that N > η, j ∈ Niin ∪ {i}, wi j = wi j = 1 (9.9) 0, otherwise, j=1 Remark 9.4 The lower bound η means that if generator j send messages to generator i, the interaction strength would not be too weak. Besides, Assumption 9.2 implies that the weight matrix W = [wi j ] ∈ R N ×N associated with the directed network G is row-stochastic, i.e., W 1 N = 1 N . It can be lightly fulfilled, for instance, by setting wi j = 1/(diin + 1), j ∈ Niin ∪ {i}, and wi j = 0, otherwise. In this way, Assumption 9.2 is satisfied with η = 1/N . Note that for each generator i ∈ V, only the local information diin is required to construct the row-stochastic weight matrix W . That is, each generator can locally decide the weights without resorting to any information form its in-neighbors. Remark 9.5 It is revealed in [20] that using the standard N gradient method, i.e., (9.8a) without yii (k), can only lead to a minimizer of i=1 πi Φi (λ) rather than N i=1 Φi (λ) due to the unbalancedness of directed networks. Here, πi ∈ (0, 1) is the i-th term of π = [π1 , . . . , π N ]T ∈ R N , which is the left normalized eigenvector (Perron vector) of W satisfying limk→∞ W k = 1 N π T and 1TN π = 1. If πi can be known by each generator in advance, this issue can be easily settled by scaling gradients with πi , ∀i ∈ V. However, the defect of precomputation of Perron vector is obvious, since the algorithm cannot perform until each generator learns its πi . To overcome the limitation, the variable yi (k) in (9.8c) is introduced to asymptotically estimate the Perron vector. This rescaling gradient technique is initially proposed by Mai and Abed [21]. It is worth mentioning that a mild prerequisite for this method is that each generator knows its global ordering over the system. This can be satisfied in practice since generators have different addresses and identifiers.
9.3.2 Convergence Analysis We start the convergence analysis with some supporting lemmas which play important role in our analysis. The following lemma reveals the relations on the directed network G and the associated weight matrix W . Lemma 9.6 ([21]) Suppose that Assumptions 9.2 and 9.4 hold for G and W . Then, there exist β > 0 and ξ ∈ (0, 1) such that ∀i, j ∈ V and k ≥ 0,
9.3 Main Results
197
k [W ] ji − πi ≤ βξ k , |yii (k) − πi | ≤ βξ k Moreover, there is a constant h > 0 such that h −1 ≤ yii (k) ≤ 1, ∀i, j ∈ V, k ≥ 0, The following lemma helps us to establish an asymptotic consensus result. Lemma 9.7 ([21]) Consider the sequence θi (k + 1) =
N
wi j θ j (k) + εi (k)
(9.10)
j=1
N ¯ where wi j satisfies Assumption 9.2. Let θ(k) = i=1 πi θi (k) where πi is defined in Lemma 9.6. If limk→∞ εi (k) = 0, then we have ¯ = 0, ∀i ∈ V lim ||θi (k) − θ(k)||
k→∞
(9.11)
We rely on a deterministic counterpart of the supermartingale convergence theorem to achieve asymptotic convergence of the dual variables λi (k), as stated in the following lemma. Lemma 9.8 ([19]) Let {s(k)} be a non-negative scalar sequence satisfying, ∀k ≥ 0, s(k + 1) ≤ (1 + v(k))s(k) − b(k) + c(k) ∞ where v(k) ≥ 0, b(k) ≥ 0, and c(k) ≥ 0 with ∞ k=0 v(k) < ∞ and k=0 c(k) < s(k) = σ where σ is a non-negative constant, and ∞. Then, we have lim k→∞ ∞ b(k) < ∞. k=0 Lemma 9.9 (Asymptotic consensus) N Consider algorithm (9.8) and suppose Assump¯ tions 9.1–9.4 hold. Let λ(k) = i=1 πi λi (k), and then we have limk→∞ ||λi (k) − ¯ λ(k)|| = 0, for any i ∈ V. Proof Note that yii (k) > 0, ∀k ≥ 0 and limk→∞ yii (k) = πi > 0 (cf. Lemma 9.6). It can be verified that {yii (k)} is a positive bounded sequence. Combining the fact that ∇Φi (λi (k)) is uniformly bounded and limk→∞ γi (k) = 0, we have limk→∞ γi (k) ∇Φi (λ) /yii (k) = 0. In light of Lemma 9.7, it follows limk→∞ ||λi (k) − ¯ λ(k)|| = 0. Lemma 9.10 Let Assumptions 9.1–9.4 hold. Then, for any u ∈ R and k ≥ 0, we have
198
9 Distributed Economic Dispatch in Smart Grids … N
πi λi (k + 1) − u2
i=1 N ≤ πi λi (k)−u2 − j=1
+
2 ¯ Φ(λ(k))−Φ(u) k+1
N N 6lh ¯ + 2lha(k) πi λi (k)− λ(k) πi λi (k)−u k + 1 i=1 i=1
+ h 2l 2
N
πi γi2 (k)+
i=1
Proof Denote qi (k) =
N j=1
2nhβξ k λ(k)−u ¯ k+1
(9.12)
wi j λ j (k). We have
∇Φi (λi (k))T (qi (k) − u) yii (k) 2 ∇Φi (λi (k)) (9.13) + γi2 (k) yii2 (k)
λi (k + 1) − u2 =qi (k) − u2 − 2γi (k)
Resorting to the row stochasticity of W and the convexity of ·2 , qi (k) − u2 can be bounded by qi (k) − u2 ≤
N
2 wi j λ j (k) − u
(9.14)
j=1
Considering the second term in (9.13), ignoring 2/yii (k), we obtain − γi (k)∇Φi (λi (k))T (qi (k) − u) 1 ∇Φi (λi (k))T (qi (k) − u) ≤ la(k) qi (k) − u − k+1 N 1 ¯ ≤ la(k) Φi (λ(k))−Φ wi j λ j (k)−u − i (u) k+1 j=1 N l ¯ + 2l λi (k)− λ(k) ¯ + wi j λ j (k)− λ(k) k +1 j=1 k +1
(9.15)
where the first inequality follows |γi (k) − 1/(k + 1)| ≤ a(k) and |∇Φi | ≤ l, and the second inequality holds due to the fact that Φi (λ) is convex and l-Lipschitz continuous. Substituting (9.14) and (9.15) into (9.13), we have
9.3 Main Results
199
λi (k + 1) − u2
¯ 2 2 Φi (λ(k)) − Φi (u) + h 2 l 2 γi2 (k) ≤ wi j λ j (k) − u − (k + 1)y (k) ii i=1 N
N 2lh ¯ + 4lh λi (k)− λ(k) ¯ + wi j λ j (k)− λ(k) k + 1 j=1 k+1
+ 2lha(k)
N
wi j λ j (k)−u
j=1
Multiplying πi at both sides and summing over i ∈ V, we obtain N
πi λi (k + 1) − u2
i=1 N ≤ πi λi (k)−u2 − j=1
+
2 πi ¯ (Φi (λ(k))−Φ i (u)) k + 1 i=1 yii (k) N
N N 6lh ¯ + 2lha(k) πi λi (k)− λ(k) πi λi (k)−u k + 1 i=1 i=1
+h l
2 2
N
πi γi2 (k)
(9.16)
i=1
where we use the fact π T W = π T . Considering the second term on the right side of (9.16), we have 2 πi ¯ Φi (λ(k)) − Φi (u) k + 1 i=1 yii (k) N
−
2 ¯ Φ(λ(k)) − Φ(u) k+1 N 2 yii (k) − πi ¯ + y (k) Φi (λ(k)) − Φi (u) k+1
≤−
i=1
≤−
ii
2nhβξ k 2 ¯ λ(k)−u ¯ Φ(λ(k))−Φ(u) + k+1 k+1
Substituting (9.17) into (9.16) yields (9.12) as desired. The convergence results of algorithm (9.8) are stated in Theorem 9.11.
(9.17)
200
9 Distributed Economic Dispatch in Smart Grids …
Theorem 9.11 Suppose Assumptions 9.1–9.4 hold. Algorithm (9.8) solves problem (9.1), i.e., as k → ∞, λi (k) → λ∗ and xi (k) → xi∗ , ∀i ∈ V, where λ∗ is the optimal incremental cost and xi∗ is the optimal power generation for each generator ∀i ∈ V. Proof Consider Lemma 9.10. Letting u = λ∗ , it follows N
2 πi λi (k + 1) − λ∗
i=1 N 2 ≤ πi λi (k)−λ∗ − j=1
+
2 ¯ Φ(λ(k))−Φ(λ∗ ) k+1
N N 6lh ¯ + 2lha(k) πi λi (k)− λ(k) πi λi (k)−λ∗ k + 1 i=1 i=1
+ h 2l 2
N
πi γi2 (k)+
i=1
2nhβξ k ∗ λ(k)−λ ¯ k+1
(9.18)
Define s(k) =
N
2 πi λi (k) − λ∗
i=1
2 ¯ Φ(λ(k))−Φ(λ∗ ) k+1 N k 6lh ∗ ¯ + 2nhβξ λ(k)−λ ¯ c(k) = πi λi (k)− λ(k) k + 1 i=1 k+1
b(k) =
+2lha(k)
N
N πi λi (k)−λ∗ + h 2 l 2 πi γi2 (k)
i=1
i=1
Then, we get a concise expression of (9.18) as s(k + 1) ≤ s(k) − b(k) + c(k) In order to apply Lemma 9.8 (with v(k) = 0), we start to prove holds. It follows from Lemma 9.9 that ∞ N 6lh ¯ 0, satisfying ∇ f i (xa ) − ∇ f i (xb ) ≤ σi xa − xb , ∀xa , xb ∈ Rn We will use σˆ = maxi∈V {σi } and σ˜ =
N
i=1 (1/σi )
in the subsequent analysis.
Assumption 10.3 For all i ∈ V, the function f i is strongly convex with parameter μi > 0. That is, for any xa , xb ∈ Rn , we have f i (xa ) ≥ f i (xb ) + ∇ f i (xb ) , xa − xb + We will use μˆ = maxi∈V {μi } and μ˜ =
N
i=1 (1/μi )
μi xa − xb 2 2
in the subsequent analysis.
Remark 10.1 For simplicity, suppose that each generator has only one dimension, i.e., xi ∈ R. It is worth noting that this chapter can be trivially extended to the highdimensional case by resorting to Kronecker product.
10.2.3 Communication Network The communication network is modeled as a series of time-varying balanced networks G(k) = (V, E(k), A(k)), where V = {1, 2, . . . , N } indicates the collection of generators and E(k) ⊆ V × V denotes the collection of edges. At time k, the weighted adjacency matrix is represented as A(k) = [ai j (k)] ∈ R N ×N , where ai j (k) > 0 if the edge ( j, i) ∈ E(k); ai j (k) = 0, otherwise. If (i, j) ∈ E(k), generator i can send information to generator j at time k ≥ 0. The in-neighbors and out-neighbors set of generator i are denoted by Niin (k) = { j ∈ V|( j, i) ∈ E(k)} and Niout (k) = { j ∈ V|(i, j) ∈ E(k)}, ∀k ≥ 0. Besides, the degree matrix D(k) =
10.2 Preliminaries
217
diag{d1 (k), d2 (k), . . . , d N (k)} is diagonal, where di (k) = Nj=1 ai j (k) for all i ∈ V and t ≥ 0. We denote d ∗ = supk≥0 maxi∈V {di (k)}. The Laplacian matrix L(k) = [li j (k)] ∈ R N ×N associated with the network G(k) at time t is denoted by li j (k) = −ai j (k), i = j; lii (k) = Nj=1, j=i ai j (k), which assures that Nj=1 li j (k) = 0. Note that L(k) = D(k) − A(k).
10.3 Algorithm Development In this section, we first turn the primal problem (10.1) into a dual problem for eventtriggered distributed optimization algorithm development. In addition, some knowledge related to the event-triggered scheme is introduced, based on which we put forward a distributed optimization algorithm to cope with the dual problem.
10.3.1 Problem Reformulation The Lagrangian function L : X × R → R can be constructed as L(x, λ) =
N
f i (xi ) + λ
N
i=1
xi − p
(10.2)
i=1
where λ ∈ R denotes the Lagrangian multiplier related to the equality constraint (10.1). Given a convex function f , we write the conjugate function of f as f ⊥ (y) = supx∈X {y T x − f (x)} for y ∈ Rn . To this end, we define the dual function of problem (1) as D(λ) = min
xi ∈Xi
N
N
f i (xi ) + λ
i=1
= −λ p − =
N
N
xi − p
i=1
sup (−λxi − f i (xi ))
i=1 xi ∈Xi
Di (λ)
(10.3)
i=1
− f i⊥ (−λ) − λ p N . We can get the gradient of the dual function where Di (λ) = N xi − p. Then the dual problem of (10.1) can be formulated as ∇D(λ) = i=1 max D(λ), λ ∈ R
(10.4)
218
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
which is equivalent to the following minimization problem min H(λ) =
N
Hi (λ), λ ∈ R
(10.5)
i=1
where Hi (λ) = f i⊥ (−λ) + λ p N . Lemma 10.2 ([45]) Suppose that Assumptions 10.2 and 10.3 hold. We have that for all i ∈ V the conjugate f i⊥ of f i satisfies the following properties: (1) The function f i⊥ is strongly convex with constant 1/σi ; (2) The function f i⊥ is differentiable and ∇ f i⊥ is Lipschitz continuous with constant 1/μi .
10.3.2 Event-Triggered Scheme Compared with the traditional time-triggered scheme [45], the event-triggered scheme can effectively decrease network burden. We design the following auxiliary variables based on the event-triggered communication scheme as follows viλ (k) = λi (k) + h
j
(10.6a)
j
(10.6b)
ai j (k) (λ j (kt (k) ) − λi (kti (k) ))
j∈Niin (k) y
vi (k) = yi (k) + h
ai j (k) (y j (kt (k) ) − yi (kti (k) ))
j∈Niin (k) j
j
where generator i obtains the states λ j (kt (k) ) and y j (kt (k) ) at last sampling time j instant kt (k) for generator j, j ∈ Niin (k), and spreads its information λi (kti (k) ) and yi (kti (k) ) to the generators j, j ∈ Niout (k). Let h > 0 be the control gain which will be specified in the subsequent analysis. Assume that a sequence of event-triggered time instants for each generator i ∈ y i λ V is {kti (k) }∞ t (k)=1 . Two measurement errors ei (k) = λi (kt (k) ) − λi (k) and ei (k) = yi (kti (k) ) − yi (k) are defined for all k ∈ [kti (k) , kti (k)+1 ). To meet conditions of the event-triggered scheme, the next event-triggered time instant kti (k)+1 is determinated by y i = inf k > kti (k) | eiλ (k) + ei (k) > Cγ k kt(k)+1
(10.7)
where parameters C > 0 and γ > 0 will be given in the subsequent analysis. For the sake of simplicity, we suppose that k1i = 0, ∀i ∈ V. y In light of the definitions of eiλ (k) and ei (k), Eq. (10.6a) can be transformed into
10.3 Algorithm Development
219
viλ (k) = λi (k) − h
N
li j (k)λ j (k) − h
j=1
N
li j (k)eλj (k)
(10.8)
j=1
At time k, define the time-varying matrix W (k) = [wi j (k)] ∈ R N ×N , where wii (k) = 1 − hlii (k), ∀i ∈ V, and wi j (k) = −hli j (k), i = j. Note that W (k) is a doubly stochastic matrix for all k ≥ 0. From (10.8), we get viλ (k)
=
N
wi j (k)λ j (k) − h
j=1
N
li j (k)eλj (k)
(10.9)
j=1
where 0 < h < 1/ supk≥0 maxi∈V {lii (k)} . Similarly, one has y
vi (k) =
N
wi j (k)y j (k) − h
j=1
N
y
li j (k)e j (k)
(10.10)
j=1
10.3.3 Event-Triggered Distributed Optimization Algorithm We are committed to coming up with an event-triggered distributed optimization algorithm to solve problem (10.5). In the proposed algorithm, each generator i ∈ V contains three main variables xi (k), λi (k) and yi (k). The optimal solution of primal problem (10.1) and dual problem (10.5) are estimated by xi (k) and λi (k), respectively. by [45], the auxiliary variable yi (k) is introduced to esti N Motivated ∇Hi (λ∗ ), where λ∗ represents the optimal solution to dual problem (1). mate i=1 The initialization xi (0) ∈ Xi , λi (0) ∈ R and yi (0) = ∇H(λi (0)). Specifically, the updates of distributed optimization algorithm are xi (k + 1) = arg min f i (xi ) + λi (k)xi
(10.11a)
λi (k + 1) = viλ (k) − αi (k)yi (k) y yi (k + 1) = vi (k) + ∇Hi (λi (k + 1)) − ∇Hi (λi (k))
(10.11b) (10.11c)
xi ∈Xi
where ∇Hi (λi (k)) = −xi (k) + p N and αi (k) denotes the time-varying step-size of generator i ∈ V. Define x(k) = [x1 (k),. . .,x N (k)]T , λ(k) = [λ1 (k),. . ., λ N (k)]T , y(k) = y y [y1 (k),. . .,y N (k)]T , v λ (k) = [v1λ (k),. . .,v λN (k)]T , v y (k) = [v1 (k),. . .,v N (k)]T , ∇H(λ(k)) = [∇H1 (λ1 (k)),. . .,∇H N (λ N (k))]T and diagonal matrix R(k) = diag{α1 (k), α2 (k), . . . , α N (k)}. Then, the distributed optimization algorithm (10.11a) can be transformed into the following vector form
220
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
x(k + 1) = arg min x∈X
N
f i (xi ) + λi (k)xi
(10.12a)
i=1
λ(k + 1) = v λ (k) − R(k)y(k) y(k + 1) = v y (k) + ∇H(λ(k + 1)) − ∇H(λ(k))
(10.12b) (10.12c)
where v λ (k) = W (k)λ(k) − h L(k)eλ (k) and v y (k) = W (k)y(k) − h L(k)e y (k).
10.4 Convergence Analysis For simplicity, we define L max = supk≥0 { L(k) } and k D = αmax /αmin , where αmax = supk≥0 maxi∈V {αi (k)} and αmin = inf k≥0 mini∈V {αi (k)}. Suppose that si (0) , si R N , ∀k ≥ 0. Define (1) , si (2) , . . . represents an infinite sequence, where si (k) ∈ γ γ,K −k −k = maxk=0,1,...,K {γ ||si (k)||} and si = supk≥0 γ si (k) . ||si || In the following, for some positive integer B ≥ 1, we utilize the notation W B (k) = W (k)W (k − 1) · · · W (k − B + 1) with the convention that W B (k) = I N , k < 0; W0 (k) = I N , k ≥ 0. Assumption 10.4 For all k ≥ 0, suppose that there exists some positive integer B ≥ 1 such that δ = sup {δ(k)} < 1 k≥B−1
where δ(k) = λmax W B (k) −
1 1 1T N N N
.
Lemma 10.3 ([46]) Under Assumption 10.4, for any k = B − 1, B, . . . , and any vector y with proper dimensions, if x = W B (k)y, we then have x ≤ δ(k) y , where δ(k) is given in Assumption 10.4. Some new notations will be defined before the next lemma is stated. For any k ≥ 1, let z(k) = ∇H(λ(k)) − ∇H(λ(k − 1)) accompany the initial value z(0) = 0, and q(k) = λ(k) − 1 N λ∗ . In addition, the dual problem (10.5) is further addressed by applying the inexact gradient descent method [47]. For ease of distinction, different notations will be used to restate the problem (10.5) min g (x) =
N
gi (x), x ∈ Rd
(10.13)
i=1
The inexact gradient algorithm for solving (10.13) is as follows p (k + 1) = p (k) − θ (k)
N i=1
∇gi (si (k)) + τ (k)
(10.14)
10.4 Convergence Analysis
221
where τ (k) denotes an additive noise and θ (k) is time-varying step-size. In particular, we apply p (k) instead of the value of p at time k. For any k = 0, 1, ... , define r (k) = p (k) − p ∗ , where p ∗ is an optimal solution to problem (10.13). Next, we present some necessary lemmas that play a significant role in the analysis of convergence. Lemma 10.4 Under Assumptions 10.2 and 10.3, if the parameters satisfy 1−
μ˜ θβ ≤ γ < 1, 0 < θ < 2σ(β ˜ + 1) 1+η
(10.15)
where η > 0 and β ≥ 2, then for any K ≥ 0, one has r γ,K
N ˆ 1 σβ σ(1 ˆ + η) ≤2 r (0) + p − si γ,K + γ σ˜ μη ˜ i=1
(3σ˜ − θ)σˆ τ γ,K + γθ
(10.16)
Proof First, utilizing equality a 2 = a + b 2 − 2a, b − b 2 and combining a = p (k + 1) − p ∗ and b = p (k) − p (k + 1) , we have p (k + 1) − p ∗ 2 = p (k) − p ∗ 2 − 2 < p (k + 1) − p ∗ , p (k) − p (k+1) > − p (k + 1) − p (k) 2 . Since r (k) = p (k) − p ∗ , then r (k + 1) 2 = r (k) 2 − 2 < p (k + 1) − p ∗ , p (k) − p (k + 1) > − p (k + 1) − p (k) 2 . Substituting (10.14) into the above equation, we deduce that r (k + 1) 2 = r (k) 2 + 2 p (k + 1) − p ∗ , τ (k) − p (k + 1) − p (k) 2 N ∗ + 2θ (k) ∇gi (si (k)), p − p (k) − 2θ (k)
i=1 N
∇gi (si (k)), p (k + 1) − p (k)
(10.17)
i=1
Next, we will focus on the last two items of (10.17). It can be derived from Assumption 10.3 and Lemma 10.2 that gi ( p ∗ ) ≥ gi (si (k)) + p ∗ − si (k) 2 /2σi + ∇gi (si (k)) , p ∗ − si (k) Then, we obtain gi ( p ∗ ) ≥gi (si (k)) + ∇gi (si (k)) , p (k) − si (k) + ∇gi (si (k)) , p ∗ − p (k) β 1 ∗ 2 2 (10.18) p (k) − p − β p (k) − si (k) + 2σi β + 1 where we use the inequality si (k) − p ∗ 2 ≥ β p (k) − p ∗ 2 / (β + 1) − β p (k) − si (k) 2 , and β > 0 is a tunable parameter. In the light of (10.18), we
222
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
establish that N
∇gi (si (k)) , p ∗ − p (k)
i=1
β p (k) − p ∗ 2 + gi (si (k)) 2σ(β ˜ + 1) i=1 N
≤g( p ∗ ) − −
N i=1
β ∇gi (si (k)) , p (k) − si (k) − p (k) − si (k) 2 2σi
(10.19)
Besides, for any vector ∈ Rd , we have gi ( p (k) + ) ≤gi (si (k)) + ∇gi (si (k)) , + ∇gi (si (k)) , p (k) − si (k) 1+η 1+η p (k) − si (k) 2 + 2 + (10.20) 2μi 2μi η where η > 0 is a tunable parameter. Therefore, −
N
∇gi (si (k)),
i=1
1+η 2 + gi (si (k)) 2μ˜ i=1 N
≤ − g( p (k) + ) +
N 1+η 2 ∇gi (si (k)) , p (k) − si (k) + p (k) − si (k) + 2μi η i=1
(10.21)
Substituting = p (k + 1) − p (k) into (10.21), it follows that −
N
∇gi (si (k)), p (k + 1) − p (k)
i=1
1+η p (k + 1) − p (k) 2 + gi (si (k)) 2μ˜ i=1 N
≤ − g( p (k + 1)) + +
N 1+η ∇gi (si (k)) , p (k) − si (k) + p (k) − si (k) 2 2μi η i=1
By (10.19) and (10.22), we further convert (10.17) into
(10.22)
10.4 Convergence Analysis
223
r (k + 1) 2
θ (k) β r (k) 2 − 2θ (k) g ( p (k + 1)) − g p ∗ ≤ 1− σ˜ (β + 1) N θ (k) β θ (k) (1 + η) 1 p (k) − si (k) 2 + τ (k) 2 + + σ μ η ρ (k) i i i=1 θ (k) (1 + η) p (k + 1) − p (k) 2 + ρ (k) r (k + 1) 2 − 1− (10.23) μ˜ where ρ (k) is a sequence of positive tunable parameters. Suppose that 0 < ρ (k) < 1 for all k ≥ 0. Let 0 < θ (k) < σ˜ (β + 1)/β such that 1 − θ (k) β/ (σ˜ (β + 1)) ≥ 0. N p (k) − si (k) 2 and selecting 0 < θ (k) < μ/ Defining ε (k) = i=1 ˜ (1 + η) such that 1 − θ (k) (1 + η)/μ˜ is non-negative, we have r (k + 1) 2
θ (k) β 1 2θ (k)
r (k) 2 − 1− g ( p (k + 1)) − g p ∗ ≤ 1 − ρ (k) σ(β ˜ + 1) 1 − ρ (k) θ (k) β θ (k) (1 + η) 1 1 τ (k) 2 + + ε (k) + 1 − ρ (k) σ˜ μη ˜ (1 − ρ (k)) ρ (k) (10.24) Considering (10.24), we need to be aware of the following two possibilities. One case is σˆ (1 + η) σˆ σβ ˆ 2 r (k + 1) ≥ τ (k) 2 + ε (k) + (10.25) σ˜ μη ˜ ρ (k) θ (k) and the other is r (k + 1) < 2
σβ ˆ σˆ (1 + η) σˆ τ (k) 2 + ε (k) + σ˜ μη ˜ ρ (k) θ (k)
(10.26)
Note that the function g( p) is strongly convex. If the first situation occurs, one obtains
2θ (k) g ( p (k + 1)) − g p ∗ ≥
θ (k) (1 + η) 1 θ (k) β τ (k) 2 + ε (k) + σ˜ μη ˜ ρ (k)
(10.27) Following from (10.27), we have r (k + 1) 2 ≤
θ (k) β 1 r (k) 2 1− 1 − ρ (k) σ˜ (β + 1)
(10.28)
224
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
Another possibility is opposite. Combining (10.27) with (10.28), we have the following ⎫ 1 θ (k) β 2 ⎪ r (k) ,⎪ 1− ⎬ 1 − ρ (k) σ˜ (β + 1) 2 (10.29) r (k + 1) ≤ max ⎪ σβ ˆ σˆ (1 + η) σˆ ⎪ ⎪ 2⎪ ⎭ ⎩ τ (k) + ε (k) + σ˜ μη ˜ ρ (k) θ (k) ⎧ ⎪ ⎪ ⎨
Defining θ = supk≥0 {θ (k)}, we obtain 0 < θ < μ/ ˜ (1 + η). Through continuous iterative updates, and multiplying γ −2(k+1) on the both sides of the inequality (10.29) and taking square root, we get the following γ −(k+1) r (k + 1) 21 k+1 k 2 θβ 1 −(k+1) r (0) 1− ≤γ s=0 1 − ρ (s) σ˜ (β + 1) 21 2s s−1 θβ 1 −(k+1) +γ 1− max t=0 s=0,...,k 1 − ρ (k − t) σ˜ (β + 1) ⎞⎫ ⎛⎛ ⎞ ⎬
σβ ˆ σ ˆ σ(1 ˆ + η) ⎠ ε (k − s) + τ (k − s) ⎠ (10.30) ×⎝⎝ + ⎭ σ˜ μη ˜ ρ (k − s) θ As√long as there is a large enough k > k0 such that 0 < × 1/ (1 − ρ (k))/γ < 1, then sup γ
−(k+1)
k
k=0,1,...
s=0
1 1 − ρ (s)
21 1−
1 − θβ/ (σ˜ (β + 1))
θβ σ˜ (β + 1)
k+1 2
r (0)
is finite. To this end,
we implement it by choosing ρ (k) = ρ = θβ/ (2σ˜ (β + 1) −θβ) with γ > 1 − θβ/ (σ˜ (β + 1)). Because ρ increases as β increases, then we will adopt ρmin =θ/
(3σ˜ − θ) in the following derivation process. Taking advantage of ρ (k) = ρ, γ > 1 − θβ/ (σ˜ (β + 1)) and β ≥ 2 to further simplify (10.30) yields γ −(k+1) r (k + 1)
⎧ ⎨
s ⎛ σ(1 ˆ + η) ˆ θβ ⎝ σβ + 1− ≤ r (0) + γ −(k+1) max s=0,...,k ⎩ 2σ(β ˜ + 1) σ˜ μη ˜ !
1 (10.31) (3σ˜ − θ)σˆ τ (k − s) × ε (k − s) + θ Defining c = γ −2 (1 − θβ/ (2σ˜ (β + 1))) ∈ (0, 1] and following from (10.31), we obtain
10.4 Convergence Analysis
225
⎧ ⎛ ⎨ σβ ˆ σ(1 ˆ + η) γ −(k+1) r (k + 1) ≤ r (0) + γ −1 max γ −s ⎝ ε (s) + s=0,...,k ⎩ σ˜ μη ˜ ! 1 (10.32) + (3σ˜ − θ)σˆ τ (s) θ where 0 < θ ≤ 3σ. ˜ In light of
√
ε (k) =
"
N i=1
p (k) − si (k) 2 , we further get
γ −(k+1) r (k + 1)
(3σ˜ − θ)σˆ ≤ r (0) + max γ −s τ (s) s=0,...,k γθ ⎛ ⎞ N 1 σ(1 ˆ + η) ⎠ σβ ˆ + ⎝ + max γ −s p (s) − si (s) s=0,...,k γ σ˜ μη ˜ i=1
(10.33)
Taking the maximum over k = 0, 1, . . . , K on the both sides of the inequality (10.33), we have
(3σ˜ − θ)σˆ r γ,K ≤ 2 r (0) + τ γ,K γθ ⎛ ⎞ σ(1 ˆ + η) ⎠ N ˆ 1 ⎝ σβ p − s γ,K + + i i=1 γ σ˜ μη ˜
which is the desired result.
Lemma 10.5 Under Assumptions 10.2 and
10.3, assume that the parameters αmax and γ satisfy 0 < αmax < μ/(1 ˜ + η) and 1 − αmax β/ (2σ˜ (β + 1)) ≤ γ < 1, where η > 0 and β ≥ 2 are tunable parameters, then we have # q γ, K ≤ 1 +
N γ
#"
+B y y γ, K where B y =
$$ ˆ λ γ, K + σ(1+η) μη ˜ √ ¯ + 2 N λ(0) − λ∗ σβ ˆ σ˜
(10.34)
(3σ˜ − αmax ) σˆ N − t R−1 /γ and t R = αmax /αmin .
Proof Multiplying 1TN on the both sides of (10.12c) yields 1TN y(k + 1) = 1TN y(k) + 1TN ∇H(λ(k + 1)) − 1TN ∇H(λ(k)) Since y(0) = ∇H(λ(0)), we have 1TN y(k) = 1TN ∇H(λ(k)) Furthermore, defining λ¯ (k) = 1TN λ (k) /N , we deduce
(10.35)
226
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
1 ¯ + 1) =λ(k) ¯ λ(k − 1TN R(k)y(k) N N 1 ¯ Hi (λi (k)) + (m(k)1TN − 1TN R(k))y(k) =λ(k) − m(k) N i=1
(10.36)
where m(k) > 0 is constant step-size of the gradient descent. Combining with Lemma 10.4, select supk≥0 {m(k)} = αmax , then we have ⎛ ⎞ σ(1 ˆ + η) ⎠ γ,K N ⎝ σβ ˆ + λ γ σ˜ μη ˜
γ,K (3σ˜ − αmax )σˆ R (k) N IN − y + (10.37) √ αmax Nγ
γ, K λ(k) ¯ ¯ − λ∗ ≤2 λ(0) − λ∗ +
√
√ " N N 2 λi ≤ N where (10.37) follows i=1 i=1 λi . Note that q (k) = λ (k) − 1 N λ∗ = Jˆλ (k) + 1 N λ¯ (k) − 1 N λ∗ Thus, it can be established that
q γ, K = λ γ, K +
√ γ, K N λ¯ − λ∗
(10.38)
Finally, combining (10.37) with (10.38), we obtain ⎛
q γ, K
⎞ √ σ(1 ˆ + η) σβ ˆ N ¯ ⎠ λ γ, K + 2 N λ(0) + ≤ ⎝1 + − λ∗ γ σ˜ μη ˜
(3σ˜ − αmax )σˆ (N − t R−1 ) y γ, K + (10.39) γ
The proof is completed.
Lemma 10.6 Suppose that Assumption 10.4 holds. Let the parameter γ satisfy γ ∈ √ B ( δ, 1). Then, for the sequences {λ(k)} and {y(k)} generated by the distributed optimization algorithm (10.12), we have that ∀K = 0, 1, . . . ,
λ γ,K ≤
αmax (1−γ B ) (γ B −δ)(1−γ)
+ γ Bγ−δ B
√
h L (1−γ B ) N C y γ,K + max (γ B −δ)(1−γ) % ! B −(t−1) γ λ(t − 1)
t=1
Proof According to (10.12c), we have
(10.40)
10.4 Convergence Analysis
227
λ(k + 1) = W (k)λ(k) − h L(k)eλ (k) − R(k)y(k) By recursion, we obtain λ(k + 1) = W B (k)λ(k − B + 1) − h
B−1
Wt (k)L(k − t)eλ (k − t)
t=0
−
B−1
Wt (k)R(k − t)y(k − t)
t=0
Using Lemma 10.3, for any K ≥ B − 1, it follows that
λ(k + 1) ≤δ λ(k − B + 1) + αmax
B
y(k − t + 1)
t=1
+ h L max
B λ e (k − t + 1)
(10.41)
t=1
Multiplying γ −(k+1) on the both sides of (10.41), we have ∀k = B − 1, B, . . . ,
γ −(k+1) λ(k + 1) δ ≤ B γ −(k+1−B) λ(k − B + 1) γ B 1 −(k−t+1) y(k − t + 1) γ + αmax t γ t=1 + h L max
B 1 −(k−t+1) eλ (k − t + 1) γ γt t=1
(10.42)
where γ B ∈ (δ, 1). For the sake of utilizing the norm · γ,K , we supplement the initial relation of −1 ≤ k ≤ B − 2 as follows
γ −(k+1) λ(k + 1) ≤ γ −(k+1) λ(k + 1)
(10.43)
Taking the maximum over k = B − 1, . . . , K on the both sides of (10.42) and the maximum over k = −1, . . . , B − 2 on the both sides of (10.43), we further obtain
228
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
λ γ, K ≤
B B 1 1 δ γ, K γ, K eλ γ, K y λ + α + h L max max B t t γ γ γ t=1 t=1 ! B % γ −(t−1) λ(t − 1) + t=1
Since eiλ (k) ≤ Cμk , μ ≤ γ, one obtains γ
−k
k λ √ e (k) ≤ N C μ γ
√ γ, K √ Selecting μ = γ such that γ −k eλ (k) ≤ N C, we have eλ ≤ N C. Therefore, we get
γ, K
λ
√ NC αmax 1−γ B h L max 1−γ B γ, K
y ≤ B + B γ −δ (1−γ) γ −δ (1−γ) % ! B γB −(t−1)
γ λ(t − 1) + B γ − δ t=1
which is the conclusion of Lemma 10.6.
Lemma 10.7 Suppose that Assumptions 10.1–10.4 hold. Let the parameter γ satisfy √ B γ ∈ ( δ, 1). Then, we have for any positive integer K that y γ,K ≤
#
(γ+1)(1−γ B ) μ˜ (γ B −δ )(1−γ)
+ γ Bγ−δ B
B
+
1 μ˜
$
q γ,K +
√ h L max (1−γ B ) N C B (γ −δ)(1−γ)
γ −(t−1) y¯ (t − 1)
(10.44)
t=1
Proof Considering y(k) ≤ y(k) Jˆ + y(k) J , we obtain γ −k y(k) ≤ γ −k y(k) Jˆ + γ −k y(k) J
(10.45)
Taking the maximum over k = 0, . . . , K on the both sides of (10.45), we get γ,K
y γ,K ≤ y γ,K + y J
(10.46)
Note that the function ∇H(λ) is Lipschitz continuous. Considering 1TN ∇H(λ∗ ) = 0 and applying (10.35), one has y(k) J ≤
1 λ(k) − 1 N λ∗ μ˜
(10.47)
10.4 Convergence Analysis
229
˜ Then, Therefore, it follows y (k) J ≤ q (k) /μ. γ −k y(k) ≤
1 −k γ q(k) μ˜
(10.48)
Taking the maximum over k = 0, . . . , K on the both sides of (10.48), we obtain γ,K
y J
≤
1 q γ,K μ˜
(10.49)
Based on (10.12c), similar with Lemma 10.6 to derive the following
γ, K
y
√ γ 1 − γB h L max (1 − γ B ) N C γ,K z ≤ B + B γ − δ (1 − γ) γ − δ (1 − γ) B ' γ B & −(t−1) γ y(t − 1) + B γ − δ t=1
(10.50)
Futhermore, under Assumption 10.2, according to the definition of z(k) and Lemma 10.2, we achieve z (k + 1) ≤
1 1 1 λ (k + 1) − λ (k) ≤ q (k + 1) + q (k) μ˜ μ˜ μ˜
which implies that γ −(k+1) z(k + 1) ≤
1 −(k+1) 1 −k q(k + 1) + γ γ q(k) μ˜ μγ ˜
(10.51)
Since z(0) = 0, taking the maximum over k = 0, . . . , K − 1 on the both sides of (10.51), we have z γ,K ≤
γ+1 q γ,K μγ ˜
(10.52)
Substituting (10.52) into (10.50) yields
γ, K
y
√
NC h L max 1 − γ B (γ + 1) 1 − γ B γ, K q ≤ B + B γ − δ (1 − γ) μ˜ γ − δ (1 − γ) B ' γ B & −(t−1) γ y(t − 1) + B γ − δ t=1
(10.53)
Integrating the inequalities (10.46), (10.49), and (10.53), the results of Lemma 10.7 can be obtained.
230
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
Theorem 10.8 Let Assumptions 10.1–10.4 hold. Select the largest element in matrix R(k) as follows αmax
(
B % γ − δ μ˜ γ B − δ − χ μ˜ ∈ 0, min , √ 1+η (1 + 2N )B
√
where = σ/ ˆ μ˜ and χ = 3σ˜ σˆ N − t R−1 . Then, the sequence {λ(k)} generated by the distributed optimization algorithm (10.12) linearly converges to 1 N λ∗ with the rate O(γ k ), where the parameter γ ∈ (0, 1) is given by ⎫ ⎧) "
* ⎪ −1 * ⎪ ⎪ χ N − t R − 4μα ˜ ⎪ ˜ max ζ B + 2μδ B χ + ⎪ ⎪ + ⎪ ⎪ ⎨ ,⎬ 2μ˜ γ = max , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ 1 − αmax 3σ˜ √ where ζ = 1 + 2N . Proof The conclusions of Lemmas 10.5–10.7 are summarized as follows
q γ, K ≤ l11 λ γ, K + l12 y γ, K + ω1
λ γ, K ≤ l2 y γ, K + ω2 y γ, K ≤ l3 q γ, K + ω3
where l = 1 + N σβ/ ˆ σ ˜ + σ ˆ + η) / μη)/γ, ˜ l = − αmax ) σ/γ (1 ( (3σ˜
11 12 √ ˆ
−1 ∗ B B ¯ × N − t R , ω1 = 2 N λ(0) − λ , l2 = αmax (1 − γ )/ γ − δ (1 − γ) , √
l3 = 1/μ˜ + (γ + 1) 1 − γ B / μ˜ γ B − δ (1 − γ) , ω2 = h L max 1 − γ B NC
B B {γ −(t−1) λ(t − 1) } γ B − δ , and ×1/ γ − δ (1 − γ) + γ B t=1
√
B B B B −(t−1) N C/ γ − δ (1 − γ) + γ y(t − 1) } ω3 = h L max 1 − γ t=1 {γ ×1/ γ B − δ . In order to use Lemma 2.7, we need to make the parameters satisfy 0 < (l11l2 + l12 )l3 < 1, i.e., ⎧⎛ ⎞
⎨ αmax 1 − γ B σ(1 ˆ + η) σβ ˆ N ⎝1 + ⎠
+ ⎩ γ σ˜ μη ˜ γ B − δ (1 − γ) (
(γ + 1) 1 − γ B (3σ˜ − αmax )σˆ
1 −1
N − tR + +
γ2 μ˜
(10.65)
and % 0 < h < min
1 1 , d ∗ sm
! (10.66)
Then, Zeno-like behavior will not exist, i.e., kti (k)+1 − kti (k) ≥ 2, ∀i ∈ V and ∀k ≥ 0. Proof According to the relationship derived from Lemmas 10.5–10.7, we obtain q γ,K ≤ (l11l2 + l12 ) y γ,K + l11 ω2 + ω1 y γ,K ≤ l3 q γ,K + ω3
234
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
Applying Lemma 2.7, we achieve that (l11l2 + l12 )ω3 + l11 ω2 + ω1 = Uq 1 − (l11l2 + l12 )l3 l3l11 ω2 + l3 ω1 + ω3 y γ,K ≤ = Uy 1 − (l11l2 + l12 )l3
q γ,K ≤
Similarly, we have l2 l3l11 ω2 + l2 l3 ω1 + l2 ω3 + ω2 = Uλ 1 − (l11l2 + l12 )l3 l4 (l11l2 + l12 )ω3 + l4 l11 ω2 + l4 ω1 ≤ + ω3 = Uy 1 − (l11l2 + l12 )l3
λ γ, K ≤
y λ, K
Utilizing Lemma 2.7, one has q(k) ≤ Uq γ k , y(k) ≤ U y γ k , λ(k) ≤ Uλ γ k , and
y(k) ≤ Uy γ k , ∀k ≥ 0. Note that v¯ λ (k) = 1TN (W (k) λ (k) − h y ×L (k) eλ (k) × 1/N = λ¯ (k). If the event-triggered condition eiλ (k) + ei (k) − Cγ k > 0 is not satisfied, then the next event will not occur, which indicates that λ y e (k) + e (k) i i i ≤ λi (kt (k) ) − v¯iλ (kti (k) ) + v¯iλ (k) − λi (k) + yi (kti (k) ) − y¯i (kti (k) ) + y¯i (k) − yi (k) + v¯ λ (k i ) − v¯ λ (k) + y¯i (k i ) − y¯i (k) i
t (k)
kti (k)
t (k)
i
kti (k)
≤ Uλ γ + U γ + Uy γ + Uy γ k λ i λ λ i + v¯ (k ) − v¯ (k) + y¯ (k ) − y¯ (k)
t (k)
k
t (k)
(10.67)
In addition, we deduce that λ v¯ (k) − v¯ λ (k i ) ≤ Uq γ k + Uq γ kti (k) t (k)
(10.68)
Similarly, we achieve that y¯ (k) − y¯ (k i ) ≤ 1 (Uq γ k + Uq γ kti (k) ) t (k) μ˜
(10.69)
Substituting (10.68) and (10.69) into (10.67) yields $ λ y # e (k) + e (k) ≤ U γ k + Uy + Uq + 1 Uq γ kti (k) i i λ μ˜ $ # k + U λ γ + U y + Uq + μ1˜ Uq γ k
(10.70)
10.4 Convergence Analysis
235
When the next trigger time kti (k)+1 meets the condition (10.7), one has y i Cγ kt (k)+1 ≤ eiλ (kti (k) + 1 ) + ei (kti (k) + 1 )
(10.71)
It is deduced from (10.70) and (10.71) that kti (k)+1
−
kti (k)
≥ ln
Uλ γ k + Uy + Uq + μ1˜ Uq C − (U λ
γk
+ U y + Uq +
1 U ) μ˜ q
1 lnγ
(10.72)
Since we need to guarantee kti (k)+1 − kti (k) ≥ 2, parameter C needs to satisfy 1 + γ2 1 k U λ γ + U y + Uq + Uq C> γ2 μ˜
(10.73)
Choose h such that (10.66) holds, which strictly guarantees that C satisfying (10.65) is non-empty. This completes the proof.
10.5 Numerical Examples In this section, simulation results are provided to demonstrate the effectiveness of our algorithm.
10.5.1 Case Study I We build a power system containing 5 generators, whose underlying communication network is time-varying, as shown in Fig. 10.1. The cost function of generator i is characterized as f i (xi ) = ai + bi xi + ci xi2 , i = 1, 2, 3, 4, 5 where the coefficients are ai = [1.51, 1.32, 0.57, 1.39, 0.64], bi = [2.0, 3.0, 4.0, 4.0, 2.5], and ci = [0.04, 0.032, 0.035, 0.03, 0.06328]. According to the actual situation, the power generation xi of each generator i is limited, which can be expressed as xi ∈ [0, ximax ]. We choose the upper limit of the power generation of five generators as ximax = [80, 90, 70, 70, 80]. The total demand of power system is p = 300 kW. In the simulation, the time-varying uncoordinated step-size αi (k) of generator i is randomly selected in the range [0.001,0.003] at time k. Set the initial conditions xi (0) ∈ Xi and λi (0) ∈ R. Simultaneously, select parameters h = 0.04, γ = 0.988 and C = 1.1 to satisfy Theorems 10.8 and 10.10. The objective of our algorithm is not only to allocate optimal power to each generator with limited capacity, but also to meet the total network demand. The simulation results are shown in Figs. 10.2, 10.3, 10.4 and 10.5. Specifically, Fig. 10.2 presents that optimal power outputs of
236
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
Fig. 10.1 Time-varying balanced communication network consists of 3 fixed topological transformations, i.e., G(3k) = g1 , G (3k + 1) = g2 , and G (3k + 2) = g3 for ∀k ≥ 0 80
Fig. 10.2 Power allocation at generators
x1(k)
70
x2(k) x3(k)
Power allocation
60
x4(k) x5(k)
50 40 30 20 10 0
0
50
100
150
200
250
300
350
400
Time [step]
generators as x1∗ = 71.19 kW, x2∗ = 73.36 kW, x3∗ = 52.79 kW, x4∗ = 61.60 kW, x5∗ = 41.05 kW. In Fig. 10.3, it can be intuitively seen that the Lagrange multipliers asymptotically converge to the optimal solution λ∗ = −7.695 for each generator i. The event-triggered sampling time instants of each generator i are depicted in Fig. 10.4. Different colors indicate sampling time instant sequences of different generators. For example, the points where the vertical axis coordinate is 1 represent the event-triggered sampling time instant sequences of generator 1. According to the statistical analysis, the number of sampling times for the five generators are [65,72,78,69,73], and the number of average sampling times is 71. Therefore, the average communication rate of control inputs is 71/400 = 17.75%, which means that the Zeno-like behavior is rigorously excluded. Figure 10.5 presents the evolution of y measurement error e1 (k) = e1λ (k) + e1 (k) for generator 1, which asymptotically decreases to zero.
10.5.2 Case Study II In order to prove that the proposed algorithm is suitable for large-scale networks, we consider the EDP on the IEEE 118 bus system in this case study. The IEEE
10.5 Numerical Examples
237 2
Fig. 10.3 Consensus of Lagrange multipliers
λ1(k)
Lagrange multipliers (λ)
1
λ2(k)
0
λ3(k) λ4(k)
−1
λ5(k)
−2 −3 −4 −5 −6 −7 −8
0
50
100
150
200
250
300
350
400
300
350
400
Fig. 10.4 All generators’ sampling time instant sequences
Event sampling time instants of five generators
Time [step]
5
4
3
2
1 0
50
100
150
200
250
Fig. 10.5 Evolutions of e1 (k) and Cγ k
Measurement error and threshold
Time [step] 1.4 Cγ k ||e 1(k)||
1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
200
250
Time [step]
300
350
400
238
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme 500
Fig. 10.6 Power allocation at generators
450
Power allocation
400 350 300 250 200 150 100 50 0
0
50
100
150
200
250
300
350
400
Time [step]
118 bus system contains 3 zones, 54 generators, 186 branches, and 91 loads, which are connected by quite a few directed bus lines [48]. Time-varying directed communication networks are considered to simulate the connection between generators. Specifically, we utilize the way in [49] to model a time-varying topology {G(k)} by switch between two graphs G1 and G2 , where G1 is given by disconnecting Zone 3, and G2 is given by disconnecting Zone 1. Each generator i possesses a quadratic cost function f i (Pi ) = ci Pi2 + bi Pi + ai , where ci , bi and ai are adjustable coefficients. Specifically, these coefficients are selected within the interval ci ∈ [0.002, 0.071](M Btu/MW2 ), bi ∈ [8.335, 37.697](M Btu/MW) and ai ∈ [6.78, 74.33](M Btu). Considering that each generator has different power generation capabilities, it is assumed that the power generation of generator i is limited by [Pimin , Pimax ], where Pimin = [5, 150], and Pimax = [20, 450]. Supposed that the total demand of power system is p = 6000 MW, which can be known by each generator. Set the initial conditions xi (0) ∈ Xi and λi (0) ∈ R. The simulation results are shown in Figs. 10.6 and 10.7, which demonstrate the convergence of the proposed algorithm. From Fig. 10.6, it can be seen that the algorithm successfully drives each generator to achieve optimal power generation. At the same time, all Lagrangian multipliers converge to the optimal solution λ∗ = −19.24.
10.5.3 Case Study III In this case study, we study the effectiveness of the proposed algorithm to deal with time-varying demand. Considering the case where the system consists of five generators, we change the total load demand in the simulation. Specifically, as shown in Fig. 10.8, we set the total load demand to p = 300 MW at the beginning, and change the total demand to 200 MW and 600 MW at the time step k = 400 and k = 800, respectively. According to Fig. 10.9, it can be seen that with the change of
10.5 Numerical Examples
239
Fig. 10.7 Consensus of Lagrange multipliers Lagrange multipliers (λ)
5
0
−5
−10
−15
−20
0
50
100
150
200
250
300
400
350
Time [step] 800
Fig. 10.8 Power balance
Total demand Total generation
Total generation
700 600 500 400 300 200 100 0
0
200
400
600
800
1000
1200
Time [step]
the total load demand, each generator can output the optimal power at each stage while satisfying the constraints. More importantly, Fig. 10.8 depicts the total generation of all generators equal to the total load demand of the system. Therefore, the proposed algorithm is suitable for the case where the load demand changes.
10.5.4 Case Study IV In this case study, the convergence rate of the proposed algorithm is compared with other algorithms (primal-dual algorithm [25], Stochastic gradient algorithm [50]) that study EDP. Under the same topology networks and parameters,# the simulation results $ 5 ∗ . are presented in Fig. 10.10, where the residual E(k) = log10 i=1 x i (k) − x i From Fig. 10.10, the proposed algorithm has a relatively fast convergence rate com-
240
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme 150
Fig. 10.9 Power allocation at generators
x1(k) x2(k)
Power allocation
x3(k) x4(k) x5(k)
100
50
0
0
200
400
600
800
1000
1200
Time [step] 4
Fig. 10.10 Performance comparison with other algorithms
The proposed algorithm Primal −dual algorithm
2
Stochastic gradient algorithm
E (k)
0 −2 −4 −6 −8 −10
200
400
600
800
1000 1200 1400 1600 1800 2000
Time [step]
pared with other algorithms. More importantly, the proposed algorithm can reduce the network communication burden at least one half.
10.6 Conclusion This chapter studied EDP over time-varying balanced communication networks with time-varying uncoordinated step-sizes. We put forward an event-triggered primaldual distributed optimization algorithm to deal with EDP. Theoretical analysis and simulation results indicated that the proposed algorithm linearly converged to the optimal solution when the communication networks were uniformly jointly strongly connected. An explicit convergence rate was given by utilizing the small gain theorem. Compared to traditional time-triggered scheme, the algorithm proposed in this chapter can effectively decrease the network burden.
References
241
References 1. E. Kötter, L. Schneider, F. Sehnke, K. Ohnmeiss, The future electric power system: Impact of Power-to-Gas by interacting with other renewable energy components. J. Energy Storage 5, 113–119 (2016) 2. Y. Elsheakh, S. Zou, Z. Ma, B. Zhang, Decentralised gradient projection method for economic dispatch problem with valve point effect. IET Gener. Transm. Distrib. 12(16), 3844–3851 (2018) 3. A. Bakirtzis, V. Petridis, S. Kazarlis, Genetic algorithm solution to the economic dispatch problem. IEEE Proc. Gener. Transm. Distrib. 141(4), 377–382 (1994) 4. D. Babazadeh, D.V. Hertemb, L. Nordstr¨om, Study of centralized and distributed coordination of power injection in multi-TSO HVDC grid with large off-shore wind integration. Electr. Power Syst. Res. 136, 281–288 (2016) 5. I.A. Farhat, M.E. El-Hawary, Dynamic adaptive bacterial foraging algorithm for optimum economic dispatch with valve-point effects and wind power. IET Gener. Transm. Distrib. 4(9), 989–999 (2010) 6. J.A.P. Lopes, C.L. Moreria, A.G. Madureira, Defining control strategies for microgrids islanded operation. IEEE Trans. Power Syst. 21(2), 916–924 (2016) 7. B.T. Polyak, Random algorithms for solving convex inequalities. Stud. Comput. Math. 8, 409– 422 (2001) 8. B.L. Gorissen, ˙I. Yanıkoˇglu, D. Hertog, A practical guide to robust optimization. Omega 53(4), 124–137 (2015) 9. S. Yang, S. Tan, J. Xu, Consensus based approach for economic dispatch problem in a smart grid. IEEE Trans. Power Syst. 28(4), 4416–4426 (2013) 10. Z. Zhang, M. Chow, Convergence analysis of the incremental cost consensus algorithm under different communication network topologies in a smart grid. IEEE Trans. Power Syst. 27(4), 1761–1768 (2012) 11. A.D. Dominguez-Garcia, S.T. Cady, C.N. Hadjicostis, Decentralized optimal dispatch of distributed energy resources, in 2012 IEEE 51st IEEE Conference on Decision and Control, https:// doi.org/10.1109/CDC.2012.6426665 12. G. Binetti, A. Davoudi, F.L. Lewis, D. Naso, B. Turchiano, Distributed consensus-based economic dispatch with transmission losses. IEEE Trans. Power Syst. 29(4), 1711–1720 (2014) 13. G. Binetti, A. Davoudi, D. Naso, B. Turchiano, F.L. Lewis, A distributed auction-based algorithm for the nonconvex economic dispatch problem. IEEE Trans. Industr. Inf. 10(2), 1124– 1132 (2014) 14. H. Li, X. Liao, T. Huang, W. Zhu, Y. Liu, Second-order global consensus in multiagent networks with random directional link failure. IEEE Trans. Neural Netw. Learn. Syst. 26(3), 565–575 (2015) 15. H. Li, G. Chen, T. Huang, Z. Dong, High performance consensus control in networked systems with limited bandwidth communication and time-varying directed topologies. IEEE Trans. Neural Netw. Learn. Syst. 28(5), 1043–1054 (2017) 16. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009) 17. A. Nedic, Asynchronous broadcast-based convex optimization over a network. IEEE Trans. Autom. Control 56(6), 1337–1351 (2011) 18. S. Sundhar Ram, A. Nedic, V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010) 19. D. Yuan, Gossip-based gradient-free method for multi-agent optimization: constant step size analysis, in 33rd Chinese Control Conference, https://doi.org/10.1109/ChiCC.2014.6896825 20. H. Li, X. Liao, G. Chen, D. Hill, Z. Dong, T. Huang, Event-triggered asynchronous intermittent communication strategy for synchronization in complex dynamical networks. Neural Netw. 66, 1–10 (2015) 21. H. Li, X. Liao, T. Huang, W. Zhu, Event-triggering sampling based leader-following consensus in second-order multi-agent systems. IEEE Trans. Autom. Control 60(7), 1998–2003 (2015)
242
10 Distributed Economic Dispatch in Smart Grids: Event-Triggered Scheme
22. G. Mateos, J.A. Bazerque, G.B. Giannakis, Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010) 23. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014) 24. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers, in Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, https://doi.org/10.1109/GlobalSIP.2013.6736937 25. P. Liu, H. Li, X. Dai, Q. Han, Distributed primal-dual optimisation method with uncoordinated time-varying step-sizes. Int. J. Syst. Sci. 49(6), 1256–1272 (2018) 26. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–248 (2019) 27. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst. https://doi.org/10.1109/TSMC.2018.2823901 28. E. Ghadimi, A. Teixeira, I. Shames, M. Johansson, Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems. IEEE Trans. Autom. Control 60(3), 644–658 (2015) 29. J. Guo, G. Hug, O. Tonguz, Asynchronous ADMM for distributed non-convexoptimization in power systems, https://arxiv.xilesou.top/abs/1710.08938 30. A. Cherukuri, J. Cortes, Initialization-free distributed coordination for economic dispatch under varying loads and generator commitment. Automatica 74(3), 183–193 (2016) 31. S. Mhanna, G. Verbiˇc, A.C. Chapman, Accelerated methods for the SOCP-relaxed componentbased distributed optimal power flow, in Proceedings of 2018 Power Systems Computation Conference (PSCC), https://doi.org/10.23919/PSCC.2018.8442892 32. J. Guo, G. Hug, O.K. Tonguz, A case for nonconvex distributed optimization in large-scale power systems. IEEE Trans. Power Syst. 32(5), 3842–3851 (2017) 33. A. Engelmann, Y. Jiang, T. Mühlpfordt, B. Houska, T. Faulwasser, Toward distributed OPF using ALADIN. IEEE Trans. Power Syst. 34(1), 584–594 (2018) 34. T. Erseghe, Distributed optimal power flow using ADMM. IEEE Trans. Power Syst. 29(5), 2370–2380 (2014) 35. W. Lu, M. Liu, S. Lin, L. Li, Fully decentralized optimal power flow of multi-area interconnected power systems based on distributed interior point method. IEEE Trans. Power Syst. 33(1), 901–910 (2018) 36. P. Yi, Y. Hong, F. Liu, Initialization-free distributed algorithms for optimal resource allocation with feasibility constraints and application to economic dispatch of power systems. Automatica 74, 259–269 (2016) 37. F. Safdarian, O. Ciftci, A. Kargarian, A time decomposition and coordination strategy for power system multi-interval operation, in 2018 IEEE Power & Energy Society General Meeting, https://doi.org/10.1109/PESGM.2018.8585766 38. M. Chen, X. Xiao, Secondary voltage control in islanded microgrids using event-triggered control. IET Gener. Transm. Distrib. 12(8), 1872–1878 (2018) 39. C. Dou, B. Liu, J.M. Guerrero, Event-triggered hybrid control based on multi-agent system for microgrids. IET Gener. Transm. Distrib. 8(12), 1987–1997 (2014) 40. H. Li, S. Liu, Y.C. Soh, L. Xie, D. Xia, Achieving linear convergence for distributed optimization with Zeno-Like-Free event-triggered communication scheme, in 29th Chinese Control And Decision Conference (CCDC), https://doi.org/10.1109/CCDC.2017.7978291 41. Q. Lü, H. Li, X. Liao, H. Li, Geometrical convergence rate for distributed optimization with Zero-Like-Free event-triggered communication scheme and uncoordinated step-sizes, in 2017 Seventh International Conference on Information Science and Technology, https://doi.org/10. 1109/ICIST.2017.7926783 42. S. Liu, L. Xie, D.E. Quevedo, Event-triggered quantized communication based distributed convex optimization. IEEE Trans. Control Netw. Syst. 5(1), 167–178 (2018)
References
243
43. S. Sahoo, S. Mishra, An adaptive event-triggered communication based distributed secondary control for DC microgrids. IEEE Trans. Smart Grid 9(6), 6674–6683 (2018) 44. H. Li, G. Chen, T. Huang, Z. Dong, W. Zhu, L. Gao, Event-triggered distributed average consensus over directed digital networks with limited communication bandwidth. IEEE Trans. Cybern. 46(12), 3098–3110 (2016) 45. T.T. Doan, A. Olshevsky, On the geometric convergence rate of distributed economic dispatch/demand response in power systems, arXiv:1609.06660 46. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017) 47. O. Devolder, F. Glineur, Y. Nesterov, First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1/2), 37–75 (2014) 48. IEEE 118 bus system [online]. http://motor.ece.iit.edu/data/JEA-S_IEEE118.doc 49. D. Zhang, S. Li, Optimal dispatch of competitive power markets by using PowerWorld simulator. Int. J. Emerg. Electr. Power Syst. 14(6), 535–547 (2013) 50. H. Zhang, H. Li, Y. Zhu, Zheng. Wang, D. Xia, A distributed stochastic gradient algorithm for economic dispatch over directed network with communication delays. Int. J. Electr. Power Energy Syst. 110, 759–771 (2019)