394 96 4MB
English Pages [137] Year 2023
Synthesis Lectures on Learning, Networks, and Algorithms
Gauri Joshi
Optimization Algorithms for Distributed Machine Learning
Synthesis Lectures on Learning, Networks, and Algorithms Series Editor Lei Ying, ECE, University of Michigan–Ann Arbor, Ann Arbor, USA
The series publishes short books on the design, analysis, and management of complex networked systems using tools from control, communications, learning, optimization, and stochastic analysis. Each Lecture is a self-contained presentation of one topic by a leading expert. The topics include learning, networks, and algorithms, and cover a broad spectrum of applications to networked systems including communication networks, data-center networks, social, and transportation networks.
Gauri Joshi
Optimization Algorithms for Distributed Machine Learning
Gauri Joshi Carnegie Mellon University Pittsburgh, PA, USA
ISSN 2690-4306 ISSN 2690-4314 (electronic) Synthesis Lectures on Learning, Networks, and Algorithms ISBN 978-3-031-19066-7 ISBN 978-3-031-19067-4 (eBook) https://doi.org/10.1007/978-3-031-19067-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my students and collaborators for their dedicated work and insightful discussions without which this book would not have been possible To my family for their unconditional support and encouragement
Preface
Stochastic gradient descent is the backbone of supervised machine learning training today. Classical SGD was designed to be run on a single computing node, and its error convergence with respect to the number of iterations has been extensively analyzed and improved in optimization and learning theory literature. However, due to the massive training datasets and models used today, running SGD at a single node can be prohibitively slow. This calls for distributed implementations of SGD, where gradient computation and aggregation are split across multiple worker nodes. Although parallelism boosts the amount of data processed per iteration, it exposes SGD to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. Thus, there is a critical need to make distributed SGD fast, yet robust to system variability. In this book, we will discuss state-of-the-art algorithms in large-scale machine learning that improve the scalability of distributed SGD via techniques such as asynchronous aggregation, local updates, quantization and decentralized consensus. These methods reduce the communication cost in several different ways—asynchronous aggregation allows overlap between communication and local computation, local updates reduce the communication frequency thus amortizing the communication delay across several iterations, quantization and sparsification methods reduce the per-iteration communication time, and decentralized consensus offers spatial communication reduction by allowing different nodes in a network topology to train models and average them with neighbors in parallel. For each of the distributed SGD algorithms presented here, the book also provides an analysis of its convergence. However, unlike traditional optimization literature, we do not only focus on the error versus iterations convergence, or the iteration complexity. In distributed implementations, it is important to study the error versus wallclock time convergence because the wallclock time taken to complete each iteration is impacted by the synchronization and communication protocol. We model computation and communication delays as random variables and determine the expected wallclock runtime per iteration of the various distributed SGD algorithms presented in this book. By pairing this runtime analysis with the error convergence analysis, one can get a true comparison of the convergence speed of different algorithms. The book advocates a system-aware philosophy, vii
viii
Preface
which is cognizant of computation, synchronization and communication delays, toward the design and analysis of distributed machine learning algorithms. This book would not have been possible without the wonderful research done by my students and collaborators. I thank them for helping me learn the material presented in this book. Our research was generously supported by several funding agencies including the National Science Foundations and research awards from IBM, Google and Meta. I was also inspired by the enthusiasm of the students who took my class on large-scale machine learning infrastructure over the past few years. Finally, I am immensely grateful to my family and friends for their constant support and encouragement. Pittsburgh, PA, USA August 2022
Gauri Joshi
Contents
1 Distributed Optimization in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 SGD in Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Training Data and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Mini-batch SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.6 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.7 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.8 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Distributed Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 The Parameter Server Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 The System-Aware Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Scalable Distributed SGD Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Straggler-Resilient and Asynchronous SGD . . . . . . . . . . . . . . . . . . . 1.3.2 Communication-Efficient Distributed SGD . . . . . . . . . . . . . . . . . . . . 1.3.3 Decentralized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 2 2 3 3 4 6 6 7 8 8 9 9 10 10 11
2 Calculus, Probability and Order Statistics Review . . . . . . . . . . . . . . . . . . . . . . . 2.1 Calculus and Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Norms and Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Lipschitz Continuity and Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Some Canonical Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Bayes Rule and Conditional Probability . . . . . . . . . . . . . . . . . . . . . .
13 13 13 14 16 17 18 18 19 20
ix
x
Contents
2.3 Order 2.3.1 2.3.2 2.3.3
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Order Statistics of the Exponential Distribution . . . . . . . . . . . . . . . . Order Statistics of the Uniform Distribution . . . . . . . . . . . . . . . . . . . Asymptotic Distribution of Quantiles . . . . . . . . . . . . . . . . . . . . . . . . .
22 22 24 24
3 Convergence of SGD and Variance-Reduced Variants . . . . . . . . . . . . . . . . . . . . 3.1 Gradient Descent (GD) Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Effect of Learning Rate and Other Parameters . . . . . . . . . . . . . . . . . 3.1.2 Iteration Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Convergence Analysis of Mini-batch SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Effect of Learning Rate and Mini-batch Size . . . . . . . . . . . . . . . . . . 3.2.2 Iteration Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Non-convex Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Variance-Reduced SGD Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Dynamic Mini-batch Size Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Stochastic Average Gradient (SAG) . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Stochastic Variance Reduced Gradient (SVRG) . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 27 28 29 29 32 32 33 34 34 35 36 38
4 Synchronous SGD and Straggler-Resilient Variants . . . . . . . . . . . . . . . . . . . . . . 4.1 Parameter Server Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Distributed Synchronous SGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Iteration Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Runtime per Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Gradient Computation and Communication Time . . . . . . . . . . . . . . 4.4.2 Expected Runtime per Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Error Versus Runtime Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Straggler-Resilient Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 K -Synchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 K -Batch-Synchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 39 40 41 42 42 43 43 45 45 46 47 49
5 Asynchronous SGD and Staleness-Reduced Variants . . . . . . . . . . . . . . . . . . . . 5.1 The Asynchronous SGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Comparison with Synchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Runtime Speed-Up Compared to Synchronous SGD . . . . . . . . . . . . 5.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Implications of the Asynchronous SGD Convergence Bound . . . . 5.4 Staleness-Reduced Variants of Asynchronous SGD . . . . . . . . . . . . . . . . . . . 5.4.1 K -Asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 51 52 53 53 54 57 57 58
Contents
xi
5.4.2 K -Batch-Asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Adaptive Methods to Improve the Error-Runtime Trade-Off . . . . . . . . . . . 5.5.1 Adaptive Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Adaptive Learning Rate Schedule to Compensate Staleness . . . . . 5.6 HogWild and Lock-Free Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60 61 61 63 64 66
6 Local-Update and Overlap SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Local-Update SGD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Adaptive Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Elastic and Overlap SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Elastic Averaging SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Overlap Local SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67 67 69 76 79 84 85 87 91
7 Quantized and Sparsified Distributed SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Quantized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Uniform Stochastic Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Adaptive Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Sparsified SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Rand-k Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Top-k Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Rand-k Sparsified Distributed SGD . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Error Feedback in Sparsified SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93 93 94 95 98 100 101 102 102 103 104 106
8 Decentralized SGD and Its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Network Topology and Graph Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Laplacian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Mixing Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Decentralized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Variants of Decentralized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Error Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Convergence Analysis of Decentralized SGD . . . . . . . . . . . . . . . . . .
107 107 108 108 109 109 110 111 112 113 113
xii
Contents
8.3.3 Convergence Analysis of Decentralized Local-Update SGD . . . . . 8.4 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117 118 121
9 Beyond Distributed Training in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 126
Acronyms and Symbols
D N x y w d GD SGD b η m L c σ2 τ M
Training dataset Number of samples in the training dataset Feature vector corresponding to a training example Label corresponding to a training example Parameter vector of the machine learning model Dimension of the model parameter vector w Gradient descent Stochastic gradient descent Mini-batch size Learning rate or step size Number of worker nodes Lipschitz smoothness parameter Strong convexity parameter Stochastic gradient variance bound Number of local updates at each worker Mixing matrix
xiii
1
Distributed Optimization in Machine Learning
Machine learning is revolutionizing data-driven decision-making in a myriad applications including search and recommendations, self-driving cars, robotics, and medical diagnosis. And stochastic gradient descent (SGD) is the most common algorithm used for training such machine learning models. Due to the massive size of training datasets in modern applications, SGD is typically implemented in a distributed fashion, where the task of computing gradients is split across multiple computing nodes. The goal of this book is to introduce you to the stateof-the-art distributed optimization algorithms used in machine learning. For each distributed SGD algorithm, we will take a deep dive into analyzing this error versus runtime performance trade-off. These analyses will equip you with insights on designing algorithms that are best suited to the communication and computation constraints of the systems infrastructure. We begin this chapter by introducing how SGD is employed in supervised machine learning. Next, we will present the motivation for distributing the implementation of SGD across multiple computing nodes. In designing distributed SGD algorithms, this book advocates a system-aware philosophy. Unlike classic optimization theory which focuses on the sample or iteration complexity of SGD algorithms, the system-aware philosophy is cognizant of communication and synchronization delays in the underlying computing infrastructure. In the rest of the book, we will deep dive into various scalable distributed SGD algorithms, which are summarized in the last section of this chapter.
1.1
SGD in Supervised Machine Learning
1.1.1
Training Data and Hypothesis
Supervised learning is a class of machine learning, where the goal is to learn a prediction function or a hypothesis h(x) : X → Y that maps an input feature vector x ∈ X to an output label y ∈ Y . For example, if our goal is to identify spam emails, then the feature vector x © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_1
1
2
1 Distributed Optimization in Machine Learning
contains the number of occurrences of suspicious words in an email, and the function h maps x to a binary label y ∈ {0, 1} such that y = 1 indicates that the email is spam. How do we learn the hypothesis h that will achieve the goal of correctly mapping an input feature to an output label? We use a training dataset, consisting of known feature-label pairs (x1 , y1 ), (x2 , y2 ), . . . , (x N , y N ). The training dataset supervises the hypothesis h in learning to predict the label y of an unseen feature vector x .
1.1.2
Empirical Risk Minimization
In order to measure how good h(x) is at predicting the correct label, we define the sample loss (h(xi ), yi ) for the training sample (xn , yn ), n ∈ {1, . . . , N }. For example, if our goal is to perform regression and match h(x) to the target value y, then a common choice of the sample loss (h(xi ), yi ) = (h(x) − y)2 , the square loss function. The training phase of machine learning seeks to minimize the average loss over the entire training dataset, which is also referred to as the empirical risk objective function [1, 2]: F(h) =
N 1 (h(xn ), yn ) N
(1.1)
n=1
The objective function defined in (1.1) is referred to as the empirical risk because the training dataset (x1 , y1 ), (x2 , y2 ), . . . , (x N , y N ) is a finite sample of the joint probability distribution P(x, y) of the features and the labels. The goal of supervised training is to find the hypothesis h ∗ from within a hypothesis class H that minimizes the empirical risk, that is, h ∗ = arg min F(h). h∈H
1.1.3
(1.2)
Gradient Descent
For simple hypothesis classes H and sample losses (·) it may be possible to directly solve for h ∗ . However, in general, we need to employ an iterative algorithm such as gradient descent (GD) to solve the optimization problem in (1.2). Consider a hypothesis class H for which each hypothesis h is specified by a d-dimensional parameter vector w, and the corresponding empirical risk objective F(h) can be written as F(w). Gradient descent starts with a randomly initialized set of parameters w0 . Then in each iteration, it computes the gradient ∇ F(h) of the loss function at w and takes a small step in the direction of steepest descent as follows:
1.1
SGD in Supervised Machine Learning
3
wt+1 = wt − η∇ F(wt ) = wt −
η N
N
∇(h(xn ), yn ))
(1.3) (1.4)
n=1
where η is the step size or the learning rate. Since each iteration of GD uses all the N samples in training dataset, it is sometimes referred to batch GD or full-batch GD.
1.1.4
Stochastic Gradient Descent
Observe that the computational complexity of each iteration of gradient descent is O(N d), which can be prohibitively expensive for a large training dataset size N and for highdimensional w (that is, large d). A computationally efficient alternative is to use stochastic gradient descent, first proposed in [3], where the gradient ∇ F(wt ) is replaced by a noisy estimate ∇(h(xn ), yn )) computed using a single training example (xn , yn ). In each iteration of stochastic gradient descent, wt+1 = wt − η∇(h(xn ), yn ))
(1.5)
= wt − η∇(wt , ξt )
(1.6)
The training example (xn , yn ) is sampled uniformly at random with replacement from the training dataset. The symbol ξt is often used to denote the training example sampled in the t + 1-th iteration when going from wt to wt+1 , and thus the stochastic gradient is written as g(wt ) = ∇(wt , ξt ). Observe that the computational complexity of each iteration of SGD is O(d), which is independent of the size of the dataset. For large datasets it is significantly smaller than the O(N d) complexity of gradient descent.
1.1.5
Mini-batch SGD
Since we are using a noisy estimate of the gradient, SGD typically takes more iterations to converge close to the optimal parameters w∗ that minimize the empirical risk objective function. For example, Fig. 1.1 shows the trajectory of GD and SGD for a convex loss function. To achieve a middle ground between these two extremes, most practical implementations use an algorithm called mini-batch SGD, where in each iteration we sample a batch of b training example uniformly at random with replacement, and then use their averaged samb ∇(wt , ξt,l )/b in place of ∇(wt , ξt ) in (1.6) to update the ple loss gradient g(wt ) = l=1 model parameters w:
4
1 Distributed Optimization in Machine Learning
Stochastic Gradient Descent
Gradient Descent
Fig. 1.1 Illustration of the convergence of stochastic gradient descent (SGD) versus gradient descent (GD). SGD takes a noisy path towards the minimum because the gradients are noisy estimates of the full gradient
wt+1
b η = wt − ∇(wt , ξt,l ) b
(1.7)
l=1
Observe that mini-batch SGD reduces to GD if we set b = N and it reduces to SGD if we set b = 1. For a suitable intermediate value of b, mini-batch SGD is more computationallyefficient than GD, but significantly less noisy than SGD. In Chap. 3 we will formally study the convergence of gradient descent and stochastic gradient descent to quantify the number of iterations T required until the distance between expected value of wT and w∗ reaches below an target error. In Chap. 2 we review concepts from probability, linear algebra and calculus that are necessary to understand the proofs presented in Chap. 3. Although the convergence to the optimal w∗ is guaranteed only for convex objectives, mini-batch SGD has been shown to perform well even on non-convex loss surfaces due to its ability to escape saddle points and local minima [4–7]. As a result, it is the dominant algorithm used in machine learning. Several works such as [8–15] focus on accelerating the convergence of mini-batch SGD in terms of error versus the number of iterations. In Chap. 3 we will study some of these variants such as variance-reduction techniques in SGD. In Sects. 1.1.6, 1.1.7 and 1.1.8 below we review three examples of SGD in machine learning, namely linear regression, logistic regression and neural networks. For a more detailed treatment, please refer an introductory machine learning textbook [2].
1.1.6
Linear Regression
Linear regression is the simplest form of machine learning, where the hypothesis class H is restricted to linear mappings from the features x = [x1 , . . . , xd ] ∈ Rd to the label (also referred as the target) y ∈ R. Thus, a hypothesis h is given by:
1.1
SGD in Supervised Machine Learning
5
h(x) = w0 + w1 x1 + · · · wd xd ⎡ ⎤ 1 ⎢x ⎥ ⎢ 1⎥ ⎢ ⎥ x ⎥ = [w0 , . . . , wd ] ⎢ ⎢ .2 ⎥ ⎢.⎥ ⎣.⎦ xd = w x
(1.8)
(1.9)
(1.10)
where w = [w0 , w1 , . . . , wd ] is the model parameter vector and w0 is often referred to as the bias. The vector x = [1 x ] is the augmented feature vector, which includes a 1 corresponding to the bias parameter w0 . The performance of the linear hypothesis function for a data sample is measured in terms of the square sample loss function: (h(xi ), yi ) = (yi − h(x))2 .
(1.11)
For a training dataset (x1 , y1 ), (x2 , y2 ), . . . , (x N , y N ), the empirical risk objective function becomes the least mean squares function given by: F(w) =
N 1 (yi − w x )2 N
(1.12)
1 y − Xw2 N
(1.13)
n=1
= where
⎡ ⎤ ⎡ ⎤ x1 y1 ⎢ ⎥ ⎢ y x ⎢ 2 ⎥ 2⎥ ⎥ ⎥ and y = ⎢ X=⎢ ⎢ ⎥ . . ⎢ . ⎥ . ⎣ . ⎦ ⎣ . ⎦ yN x N
(1.14)
which are referred to as the design matrix and the target vector respectively. Due to the simple form of the empirical risk objective function in (1.13), we can directly get a closed-form expression for the model parameters w∗ that minimize F(w): w∗ = min
w∈Rd+1
1 y − Xw2 = (X X)−1 X y, N
(1.15)
referred to as the least mean squares solution. For a training dataset with a large number of training examples N , and a high-dimensional model (large d) evaluating the least mean squares solution w∗ can be computationally expensive. The dominant terms in the computation complexity are the O(N d) operations
6
1 Distributed Optimization in Machine Learning
to compute the matrix product X X, and the O(d 3 ) operations to invert X X. Gradient descent and its variants described in (1.4), (1.6) and (1.7) above are computationally-efficient alternatives to find the least squares solution w∗ .
1.1.7
Logistic Regression
When the labels y take categorical values rather than real numbered values, for example, binary values 0 or 1, the squared loss function (y − h(x))2 defined in (1.11) is not suitable to measure the error in classifying feature vectors x into the categories 0 or 1. And instead of a linear hypothesis h(x) = w x, we need a non-linear function that swiftly transitions from 0 to 1. Logistic regression uses the sigmoid function σ (w x) = 1/(1 + e−w x ) to denote the probability that the predicted label y is 0 or 1. If w x < 0, then the predicted label is assigned as 0, and if w x > 0 the predicted label is set to 1. In order to train the parameters w that determine the decision boundary, we minimize the following cross entropy loss function for training sample (xn , yn ):
(σ (w xn ), yn ) = − yn log σ (w xn ) + (1 − yn ) log(1 − σ (w xn ) (1.16) It represents the negative log of the likelihood Pr(yn |xn ; w) = σ (w xn ) yn (1 − σ (w xn ))1−yn of sample. For a training dataset (x1 , y1 ), (x2 , y2 ), . . . , (x N , y N ), the empirical risk objective function is: F(w) = −
N 1
yn log σ (w xn ) + (1 − yn ) log(1 − σ (w xn ) N
(1.17)
n=1
Unlike the least-squares solution for linear regression, we cannot get a closed-form expression for the optimal parameter vector w∗ that minimizes the F(w). Thus, we have to resort to using iterative methods such as gradient descent to train the parameters. The (batch) gradient descent update rule for logistic regression is given by: wt+1 = wt −
N η (σ (w xn ) − yn ) N
(1.18)
n=1
using the property that the derivation of sigmoid, σ (a) = σ (a)(1 − σ (a)), for any a ∈ R.
1.1.8
Neural Networks
Logistic regression enables us to learn linear decision boundaries, given by w x = 0. Neural networks are a flexible and efficient way to learn more complex hypothesis functions h(x) mapping a feature vector x to labels y. Neural networks consist of computation units called
1.2
Distributed Stochastic Gradient Descent
7
Fig. 1.2 Neural network with one hidden layer
neurons, which are organized into layers, as illustrated in Fig. 1.2. Each neuron applies a non-linear activation function g(. . .) to the linear combination of its inputs. For example, the output y j of the hidden layer neuron shown in Fig. 1.2 is y j = g( i wi j xi ). The objective function F(w) measures the error between the output z k of the last layer and the target or label yk . Stochastic gradient descent is the dominant algorithm used to minimize this objective function and tune the weight of each of the connections of the neural networks. Due to the layer-wise structure of the network, the gradients of the loss F(w) with respect to each weight wi j can be computed in a recursive fashion. This recursive algorithm to compute the gradients in an efficient manner is called the backpropagation algorithm, first introduced in [16]. Since the objective function of neural networks is highly non-convex, SGD is not guaranteed to converge to the global minimum. However, in practice, convergence to a local minimum is often sufficient to achieve high training and test accuracy. Since the focus of this book is on the convergence of distributed SGD algorithm, we will abstract the exact form of the objective function F(w). It could be the linear regression loss, logistic regression loss, a neural network loss function, or the loss function of any other machine learning model.
1.2
Distributed Stochastic Gradient Descent
Classical SGD was designed to be run on a single computing node, and its error-convergence with respect to the number of iterations has been extensively analyzed and improved in optimization and learning theory literature. However, in modern machine learning applications, the massive training data-sets and deep neural network architectures are massive. For example, the widely used ImageNet dataset [17] used to train image classification models N ∼ 1.3 million images across 1000 classes. And the commonly used neural network architecture called ResNet-50 [18] for image classification has d = 23 million trainable parameters. Since SGD is an inherently sequential algorithm, for such large datasets and models, run-
8
1 Distributed Optimization in Machine Learning
ning SGD at a single node can be prohibitively slow. It may take days or even weeks to train the model parameters w. In order to cut down the training time, it is imperative to use distributed implementations of SGD, where gradient computation and aggregation is parallelized across multiple worker nodes.
1.2.1
The Parameter Server Framework
In Chap. 4, we introduce the parameter server framework that is used to run distributed SGD in most state-of-the-art industrial implementations. It consists of a central parameter server and m worker nodes, as shown in Fig. 5.1. The parameter server stores and updates the model parameters w. The training dataset is shuffled and split equally into partitions D1 , …, Dm which are stored at the m workers. The most common distributed SGD algorithm in the parameter server framework is called synchronous SGD. In the t + 1-th iteration of synchronous SGD, each of the m worker nodes reads the current version of wt from the parameter server and computes a gradient g(wt , ξi ) using a mini-batch (denoted by ξi ) of b samples drawn from its local dataset partition Di . The parameter server then collects the gradients and updates the model w as per the following update rule: wt+1
m η = wt − g(wt , ξi ) m
(1.19)
m b η = wt − ∇(wt , ξt,l ) mb
(1.20)
i=1
i=1 l=1
The synchronous SGD update rule defined above is, in fact, equivalent to mini-batch SGD with a mini-batch size of bm instead of b. By utilizing m worker nodes that compute gradients in parallel, we are able to process m times more data per iteration.
1.2.2
The System-Aware Design Philosophy
The main benefit of using more workers is that it reduces the noise in the averaged gradient used for updating w in every iteration and thus improves the error versus iterations convergence. The number of iterations required to achieve a target error, also referred to as the iteration complexity has been extensively studied in optimization theory and many SGD variants have been proposed to accelerated the convergence. However, when SGD is run in a distributed setting, it is not sufficient to focus on the error versus iterations convergence because the wall-clock time spent per iteration depends on the synchronization and communication delays stemming from the underlying computing infrastructure.
1.3
Scalable Distributed SGD Algorithms
9
In this book, we emphasize a system-aware design philosophy that considers the true convergence speed of distributed SGD with respect to the wallclock runtime. It is a product of two factors: (1) the error in the trained model versus the number of iterations, and (2) the number of iterations completed per second. Traditional single-node SGD analysis focuses on optimizing the first factor, because the second factor is generally a constant when SGD is run on a single dedicated server. In distributed SGD, which is often run on shared cloud infrastructure, the second factor depends on several aspects such as the gradient computation and communication delays at the worker nodes, and the protocol (synchronous, asynchronous or periodic) used to aggregate their gradients. Hence, in order to achieve the fastest convergence speed we need: (1) optimization techniques (e.g. variable learning rate) to maximize the error-convergence rate with respect to iterations, and (2) scheduling techniques (e.g. straggler mitigation) to maximize the number of iterations completed per second. These directions are inter-dependent and need to be explored together rather than in isolation. While many works have advanced the first direction, the second is less explored from a theoretical point of view, and the juxtaposition of both is an unexplored problem. In Chap. 4, we adopt this philosophy to analyze the error versus runtime convergence of synchronous SGD. We quantify how the number of workers m affects the convergence speed by combining (1) the error-versus-iterations convergence analysis of mini-batch SGD derived in Chap. 3 and (2) the analysis of how m and the gradient computation delay distribution at each workers affects the expected wallclock runtime per iteration.
1.3
Scalable Distributed SGD Algorithms
Synchronous SGD, the simplest distributed SGD algorithm does not scale well to a large number of worker nodes due to synchronization and communication delays and the bandwidth limitations at the central parameter server. In this book, we will study several scalable variants of distributed SGD that are robust to such inherent variability and constraints of the computing infrastructure.
1.3.1
Straggler-Resilient and Asynchronous SGD
In each iteration of synchronous SGD, the parameter server has to wait for all the m to send back gradients before it can update the model w and proceed to the next iteration. Worker nodes being cloud servers, are suspectible to unpredictable slowdown or failure due to various reasons such as background workload, outages, etc. Such node straggling is the norm rather than the exception in data center computing [19]. As the number of workers m increases, even a small probability of straggling can cause an order-of-magnitude increase the expected runtime per iteration. In Chap. 5 we propose straggler-resilient variants of synchronous SGD that are robust to straggling workers. These variants span different points on the trade-
10
1 Distributed Optimization in Machine Learning
off between iterations complexity and runtime per iteration. Relaxing the synchronization barrier of synchronous SGD can reduce the runtime per iteration but it may increase the iteration complexity, that is, the number of iterations required to reach a target training loss. We present convergence analysis and runtime analysis to quantify this trade-off.
1.3.2
Communication-Efficient Distributed SGD
Besides the time taken by workers to compute its mini-batch gradient, the runtime per iteration also includes the communication delay spent in sending the gradients to the parameter server and receiving the updated model parameter. When the model w is high-dimensional or when the communication link has high latency, this delay can dominate the runtime per iteration. This is especially true in emerging distributed machine learning framework such as federated learning where the worker nodes are edge devices such as cell phones, which have unreliable and low-bandwidth wireless links to connect with the cloud, where the parameter server is present. Therefore there is a critical need to design communication-efficient distributed SGD algorithms, going beyond synchronous SGD which requires communication after every iteration. In this book we will study two orthogonal ways to reduce communication delay. The first approach is to reduce the communication frequency by using an algorithm called localupdate SGD, where the workers perform multiple local SGD iterations instead of just computing gradients, and the resulting locally trained models are averaged by the parameter server. In Chap. 6 we will study the convergence and runtime properties of local SGD and its variant elastic-averaging SGD. Elastic-averaging SGD further improves the scalability of local-update SGD by facilitating the overlap of communication and computation. The second approach to make distributed SGD more communication-efficient is to reduce the number of bits transmitted from the workers to the server by quantizing or sparsifying the gradient or model parameters. In Chap. 7 we will study quantization and sparsification techniques. Since the compression is lossy, it can have an adverse effect of the error at convergence. We will study these convergence and runtime trade-offs in Chap. 7.
1.3.3
Decentralized SGD
As the number of worker nodes m grows, it may be prohibitively expensive to have a single central parameter server to aggregate the gradients or model updates and maintain the latest version of the model parameters w. Instead, training can be performed in a decentralized fashion where there is an arbitrary network topology connecting the workers, as shown in Fig. 1.3. Each worker makes local updates to the model parameters based on its local datasets and then averages the updates with its neighbors. Eventually, as long as the network topology is connected, the updates from a worker will reach all other workers. In Chap. 8
References
11
Fig. 1.3 Decentralized SGD topology consisting of 8 nodes
2 3 1
4
6 0 5
7
we will study decentralized SGD algorithms and their convergence analysis. The number of iterations required for the model to converge to its optimal value depends on the number of worker nodes, the topology connecting them and the inter-node communication delays. Decentralized SGD has many applications such as multi-agent networks of sensors or IoT devices and cross-silo federated learning. Summary Stochastic gradient descent (SGD) is at the core of state-of-the-art supervised learning, and due to large datasets and models used today, it is imperative to implement it in a distributed manner. Thus, speeding-up distributed SGD is arguably the single most impactful and transformative problem in the field of machine learning. This book takes a systemaware approach towards designing distributed SGD algorithms, which is cognizant of the synchronization and communication delays in the computing infrastructure. In this chapter, we outlined the various scalable distributed SGD algorithms that we will study in this book.
References 1. S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. 2. S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. New York, NY, USA: Cambridge University Press, 2014. 3. H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951. 4. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” 2017. [Online]. Available: https://arxiv.org/abs/1611.03530 5. B. Neyshabur, R. Tomioka, R. Salakhutdinov, and N. Srebro, “Geometry of optimization and implicit regularization in deep learning,” CoRR, vol. abs/1705.03071, 2017. [Online]. Available: http://arxiv.org/abs/1705.03071 6. P. Chaudhari and S. Soatto, “Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks,” CoRR, vol. abs/1710.11029, 2017. [Online]. Available: http://arxiv.org/abs/1710.11029 7. R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” CoRR, vol. abs/1703.00810, 2017. [Online]. Available: http://arxiv.org/abs/1703.00810 8. B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
12
1 Distributed Optimization in Machine Learning
9. Y. Nesterov, “A method of solving a convex programming problem with convergence rate o(1/k 2 ),” Soviet Mathematics Doklady, vol. 27, no. 5, pp. 372–376, 1983. 10. N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Advances in Neural Information Processing Systems, 2012, pp. 2663–2671. 11. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 315–323. [Online]. Available: http://papers.nips.cc/paper/4937-acceleratingstochastic-gradient-descent-using-predictive-variance-reduction.pdf 12. Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods,” in Proceedings of Symposium on Theory of Computing (STOC), 2017, pp. 1200–1205. 13. L. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc, “Sarah: A novel method for machine learning problems using stochastic recursive gradient,” arXiv preprint arXiv:1703.00102, 2017. 14. J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011. 15. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (ICLR), 2015. 16. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, pp. 533–536, 1986. 17. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. 18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. 19. J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
2
Calculus, Probability and Order Statistics Review
In this chapter, we will begin by reviewing some concepts from calculus and probability that are essential to understanding the error and runtime analyses of various distributed SGD algorithms that we will study in the rest of the book.
2.1
Calculus and Linear Algebra
2.1.1
Norms and Inner Products
In machine learning problems, we quantify the quality of a hypothesis h(x) by comparing it with the target y. However, h(x) and y can be vectors. For example, in binary classification problems, the two elements of hx correspond to the probability or confidence with which the h function predicts the label as 0 and 1 respectively. The y vector is constructed using one-hot encoding where the i-th element is 1 if the true class is i, and otherwise it is zero. The difference h(x) − y between the predicted and the true labels is also a vector and it cannot be directly used to quantify the error. Norms are functions that map a vector to a scalar value, and thus by taking a norm of h(x) − y we can concretely quantify the error in the prediction hx. The norm of a vector is defined as follows. Definition 2.1 (Vector norm) A vector norm is a function that maps a vector a ∈ Rd to a scalar a such that it satisfies the following conditions: 1. Non-negativity: a ≥ 0 and a = 0 ⇐⇒ a = 0 2. Scaling: ca = |c| a for any scalar c ∈ R 3. Triangle Inequality: a + b ≤ a + b © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_2
13
14
2 Calculus, Probability and Order Statistics Review
d Somecommonly used vector norms are the 1 and 2 norms, a1 = i=1 |ai | and d 2 a2 = i=1 ai respectively. Another concept from linear algebra that we will use in this book is that of inner products of two vectors. While norms offer a way to measure the length of a vector by mapping it to a scalar quantity, inner products associate a pair of vectors with a scalar quantity. Inner products allow us a to measure the similarity of two vectors, and they are defined as follows. Definition 2.2 (Inner product) The inner product of two vectors a1 , a2 ∈ Rd is denoted by a1 , a2 and it has to satisfy the following conditions: 1. 2. 3. 4.
ca1 , a2 = ca1 , a2 for any scalar c ∈ R a1 , a2 + a3 = a1 , a2 + a1 , a3 a1 , a2 = a2 , a1 a, a > 0 if a > 0
The dot product is a special case of inner products where we take the sum of the elementd wise product of the two vectors, that is, a1 · a2 = i=1 a1,i a2,i . The Euclidean or 2 norm is also a special case of inner product, where we take the dot product of a vector with itself, a22 = a · a.
2.1.2
Lipschitz Continuity and Smoothness
The next concept that we will repeatedly use in this book is that of Lipschitz continuity and smoothness. Lipschitz continuity is a stronger form of continuity that restricts how quickly a function F can change. Definition 2.3 (Lipschitz continuity) A function F(x) : Rd → R is said to be K -Lipschitz continuous for some positive real constant K > 0 if |F(x1 ) − F(x2 )| ≤ K |x1 − x2 | for all x1 , x2 ∈ Rd
(2.1)
For a 1-dimensional function, Lipschitz continuity means that if you draw two lines with slopes K and −K out of any point (x, F(x)) then the whole function lies between the two lines, as illustrated in Fig. 2.1. Instead of restricting how quickly a function F can change, if we restrict how quickly its gradient will change, then we get the concept of Lipschitz smoothness that is formally defined as follows. Definition 2.4 (Lipschitz smoothness) A function F(x) : Rd → R is said to be L-Lipschitz smooth for some positive real constant L > 0 if and only if
2.1
Calculus and Linear Algebra
15
Slo pe K
Fig. 2.1 Illustration of Lipschitz continuity. If a scalar function F(x) is K -Lipschitz continuous, then for any point x, the function lies inside the region bounded by lines of slope K and −K that pass through the point (x, F(x))
F(x)
-K pe Slo
0
Fig. 2.2 Illustration of Lipschitz smoothness. If a scalar function F(x) is L-Lipschitz smooth, then for any point x, the function lies inside the shaded region shown in the picture, as specified by (2.2)
x
F(x)
0
|∇ F(x1 ) − ∇ F(x2 )| ≤ L|x1 − x2 | for all x1 , x2 ∈ Rd
x
(2.2)
that is, the gradient ∇ F(x) is L-Lipschitz-continuous. The intuition behind Lipschitz smoothness is that the rate of change of the function F is bounded by L. In most SGD convergence analyses, we assume that the objective function F is Lipschitz smooth to ensure that it does not change arbitrarily quickly. In particular, the SGD convergence analyses presented in this book will use the following inequality which is a consequence of Lipschitz smoothness (Fig. 2.2). Lemma 2.1 If a function is L-Lipschitz smooth, then for any x1 , x2 ∈ Rd it satisfies the following upper bound:
16
2 Calculus, Probability and Order Statistics Review
F(x1 ) ≤ F(x2 ) + ∇ F(x2 ) (x1 − x2 ) +
L x1 − x2 22 2
(2.3)
Proof 1 F(x1 ) = F(x2 ) + 0
1 = F(x2 ) +
∂ F(x2 + t(x1 − x2 )) dt ∂t
(2.4)
∇ F(x2 + t(x2 − x1 )) (x1 − x2 )dt
(2.5)
0
= F(x2 ) + ∇ F(x2 ) (x2 − x1 ) +
1
[∇ F(x2 + t(x1 − x2 )) − ∇ F(x2 ) ] (x1 − x2 )dt
0
(2.6) L ≤ F(x2 ) + ∇ F(x2 ) (x1 − x2 ) + x1 − x2 22 2
2.1.3
(2.7)
Strong Convexity
Gradient descent algorithms, which are ubiquitous in machine learning, are guaranteed to converge to the global optimum of the objective function F(x) only for convex functions. When analyzing the convergence of SGD algorithms, we will also use a property called strong convexity that helps us quantify the speed of convergence. Definition 2.5 (Strong convexity) If a function is c-strongly convex, then for any x1 , x2 ∈ Rd it satisfies: F(x1 ) ≥ F(x2 ) + ∇ F(x2 ) (x1 − x2 ) +
c x1 − x2 22 2
(2.8)
Every convex function is 0-strongly convex, a special case of the class of strongly convex functions defined above. Figure 2.3 illustrates the difference between convexity and strong convexity for a onedimensional function F(x). A convex function always lies above a tangent drawn at any point (x, F(x)). On the other hand, if F(x) is strongly convex, then it has to lie above a quadratic function drawn at that point. A consequence of strong convexity is the Polyak-Lojasiewicz (PL) inequality stated and proved below. Lemma 2.2 (Polyak-Lojasiewicz (PL) inequality) If a function is c-strongly convex, then for any x ∈ Rd it satisfies the following lower bound:
2.2
Probability Review
17
F(x)
F(x)
x) F(
c(x2-x1)2/2
Δ
Δ
0
x
(a) Convex Function
0
x1
x) F(
x2
(b) Strongly Convex Function
Fig. 2.3 Illustration of a convex function and a c-strongly convex function
2c(F(x) − F(x∗ )) ≤ ∇ F(x)22
(2.9)
Proof The above result follows from Definition 2.5 by minimizing both sides of (2.8) with respect to x1 . The left hand side is minimized by setting x1 = x∗ . To minimize the right hand side, we take the gradient with respect to x1 and set it to zero. c ∇x1 F(x2 ) + ∇ F(x2 ) (x1 − x2 ) + x1 − x2 22 = 0 (2.10) 2 (2.11) [∇ F(x2 ) + c(x1 − x2 )] = 0 1 x1 = x2 − ∇ F(x2 ) (2.12) c Substituting this value of x1 in right hand side of (2.8) we get: 2 1 1 c 2 F(x ) ≥ F(x2 ) − ∇ F(x2 )2 + − ∇ F(x2 ) c 2 c 2 ∗
(2.13)
The result follows from simplifying and rearranging the terms in the above equation.
2.2
Probability Review
Next, let us review from probability concepts. We will use these in analyzing the runtime per iteration of the various distributed SGD algorithms covered in this book.
18
2.2.1
2 Calculus, Probability and Order Statistics Review
Random Variable
A probability space is represented by a tuple of (, F , P ) where is the sample space consisting of the set of possible outcomes, F is the event space consisting of all subsets of , and P is a probability measure that maps events to probabilities. For example, for one roll of a fair die, the sample space is = {1, 2, . . . , 6}, the event space F is all the 26 subsets of and the probability of an event A ∈ F is P (A) = |A|/6. The probability measure P must satisfy the following axioms for any events A and B: 1. 0 ≤ Pr(A) ≤ 1 for any event A 2. Pr(∅) = 0, where ∅ is the empty set 3. Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B) A real-valued random variable X is a function that maps to the real line R. If it takes a set of discrete values {x1 , x2 , . . . , x R } ∈ R, then it is referred to as a discrete random variable and if X takes a continuous set of values S ∈ R then it is referred to as a continuous random variable. Each random variable is associated with a cumulative distribution function (CDF) denoted by FX (x). For each x ∈ R, FX (x) = Pr(X ≤ x), the probability that X is less than or equal to x. By definition FX (−∞) = 0 and FX (∞) = 1. A related function is the tail distribution function or the complementary cumulative distribution function (CCDF) denoted by F¯ X (x) = 1 − FX (x) = Pr(X > x). To determine the probability that X takes a specific value x, we define its probability density function (PDF) f X (x) = FX (x), the derivative of the cumulative distribution func ∞ tion FX (x). For any set S ∈ R, we have Pr(X ∈ S ) = x∈A f X (x), and −∞ f X (x)d x = 1. If X is discrete, then instead of the probability density function we use the probability mass function (PMF), where Pr(X = xi ) = pi is the probability that X takes the value xi and R i=1 Pr(X = x i ) = 1.
2.2.2
Expectation and Variance
Instead of specifying the entire probability distribution of a random variable X , it is often convenient to characterize it in terms of its average value and the deviation around the average, which are captured by the expectation and variable respectively. The expected value or the mean of a random variable are defined as x f X (x)d x for continuous X (2.14) E [X ] = x∈R
=
x
x Pr(X = x) for discrete X
(2.15)
2.2
Probability Review
19
For a non-negative random variables X ≥ 0, it is often convenient to compute the expectation in terms of the tail distribution function F¯ X (x) = Pr(X > x): Pr (X > x)d x for non-negative X (2.16) E [X ] = x∈R
The deviation of a random variable around its expected value is captured by the variance Var [X ] which is defined as: Var [X ] = (x − E [X ])2 f X (x)d x (2.17) x∈R
= E X 2 − (E [X ])2
2.2.3
(2.18)
Some Canonical Random Variables
Here are some common random variables that will use in the book: 1. Bernoulli B ( p): The sample space = {0, 1}, representing two possible outcomes. A coin toss is the canonical example of the Bernoulli random variable. If X is Bernoulli with bias p, then 1 with probability p X= (2.19) 0 otherwise The mean and variance of the Bernoulli distribution are E [X ] = p and Var [X ] = p(1 − p) respectively. 2. Geometric Geom( p): The geometric random variable represents the first occurrence of success (or failure) in a sequence of independent Bernoulli trials. If the probability of success of each trial is p, then a geometric random variable X has the probability mass function Pr(X = k) = (1 − p)k−1 p for k = 1, 2, . . . , ∞.
(2.20)
For the coin toss example, the number of tosses until the first heads occurs is geometrically distributed. The mean and variance of the Geometric random variable are E [X ] =
1 1− p and Var [X ] = p p
(2.21)
respectively. 3. Exponential Exp(λ): The exponential random variable is a non-negative random variable, which is often used to model delays such as the time taken to execute a job in
20
2 Calculus, Probability and Order Statistics Review
computer systems. We will use it heavily in this book when analyzing the runtime of distributed SGD algorithms. The probability density function (PDF) of the exponential random variable for rate λ > 0 is f X (x) = λe−λx for x ≥ 0
(2.22)
The mean and variance of the exponential random variable are E [X ] = λ1 and Var [X ] = 1 respectively. λ2 4. Gaussian N (μ, σ 2 ): The Gaussian random variable is used to model noise in any random phenomenon, and its probability density function (PDF) is given by: 2 1 − (x−μ) e 2σ 2 f X (x) = √ 2π σ 2
(2.23)
The mean and variance of the exponential random variable are E [X ] = μ and Var [X ] = 1 respectively. One of the reasons for its ubiquity is the Central Limit Theorem, which μ2 shows that the average of a large number of realization of a random variables with any distribution with finite mean and finite variance converges to the Gaussian distribution. 5. Pareto Pareto(xm , α): The Pareto distribution is a power-law distribution that is used to model many observable phenomenon such as cloud server delays, error or failure rates of hardware components, wealth distribution of a population, online file size, etc. Its probability density function (PDF) is given by: αx α m x ≥ xm α+1 (2.24) f X (x) = x 0 x < xm The parameters xm and α are referred to as scale and shape parameters respectively. Larger α implies a faster decay of the tail. The mean and variance of a Pareto random variable are given by: ∞ α≤1 (2.25) E [X ] = αx m α−1 α > 1 ⎧ ⎨∞ α≤2 Var [X ] = (2.26) 2 αx m ⎩ α>2 2 (α−1) (α−2)
2.2.4
Bayes Rule and Conditional Probability
For two events A and B, the conditional probability A given that event B has occurred is denoted as Pr(A|B). Bayes theorem connects that conditional probabilities Pr(A|B) and Pr(B|A) as follows:
2.2
Probability Review
21
Pr(A|B) =
Pr(A ∩ B) Pr(B|A) Pr(A) = Pr(B) Pr(B)
(2.27)
A consequence of Bayes theorem is the concept of the residual time of a random variable, which we formally define as follows. Definition 2.6 (Residual time of a random variable) For a non-negative random variable X ≥ 0 that denotes the time taken to complete a task, the residual time Y is the time remaining given that t units of time have elapsed. The random variable Y = (X − t)|(X > t) and its probability distribution is: FY (y) =
FX (t + y) Pr(X > t)
(2.28)
Depending on how the tail distribution Pr(Y > y) of the residual time Y = (X − t)|(X > t) compares with the tail distribution Pr(X > x) of X itself, we can define two special classes of random variables called new-longer-than-used and new-shorter-than-used. Definition 2.7 (New-longer-than-used and new-shorter-than-used) A random variable X is said to have a new-longer-than-used distribution if the tail distribution of the following holds for all t, x ≥ 0: Pr(X > x + t|X > t) ≤ Pr(X > x). (2.29) On the other hand, X is said to have a new-shorter-than-used distribution if the tail distribution of the following holds for all t, x ≥ 0: Pr(X > x + t|X > t) ≥ Pr(X > x).
(2.30)
To understand the intuition behind this notion, let a random variable X denote the computational time taken to perform a task. Suppose that the task has been running for t units of time but has not finished, and the scheduler needs to decide whether to keep the task running or abort it and launch a new copy. If X has a new-longer-than-used distribution, then a new copy is expected to take longer than waiting for the already running task to finish. Most of the continuous distributions we encounter like normal, shifted-exponential, gamma, beta are new-longer-than-used. On the other hand, the hyper-exponential distribution (mixture of exponentials) is new-shorter-than-used. If the computation time X of a task is new-shorter-than-used, then launching a new copy is better than keeping an old copy of the task running. The exponential random variable is a special random variable—it is the only continuous random variable that is both new-longer-than-used and new-shorter-than-used. For the exponential distribution, (2.29) and (2.30) holds with equality. Thus, the exponential distribution
22
2 Calculus, Probability and Order Statistics Review
has the memoryless property, which means that the residual time Y = (X − t)|(X > t) is also an exponential random variable, as shown below. FY (y) =
e−λ(y+t) = e−λy for all y, t ≥ 0 e−λt
(2.31)
This implies that a fresh realization of the exponential random variable X , and a residual version with time t elapsed are statistically identical. Among discrete random variables, the geometric random variable is the analogue of the exponential and it is the only discrete random variable to have the memoryless property.
2.3
Order Statistics
Consider n independent and identically distributed (i.i.d.) realizations X 1 , . . . , X n of a random variable X , then the order statistics X 1:n , X 2:n , . . . , X n:n denote the realizations sorted in increasing order, that is X 1:n = min(X 1 , . . . , X n ) and X n:n = max(X 1 , . . . , X n ). The order statistics X k:n themselves are random variables and we can express their probability distribution and probability density functions in terms of the distribution FX (x) of X : FX k:n (x) =
n n j=k
f X k:n (x) =
j
[FX (x)] j [1 − FX (x)]n− j
(2.32)
n! f X (x)[FX (x)]k−1 [1 − FX (x)]n−k (k − 1)!(n − k)!
(2.33)
The two extreme special cases, the maximum order statistic X n:n and the minimum order statistic X 1:n have cumulative distribution functions: FX n:n (x) = Pr(max(X 1 , . . . , X n ) ≤ x) = (FX (x))n
(2.34)
FX 1:n (x) = Pr(min(X 1 , . . . , X n ) ≤ x) = 1 − (1 − FX (x)) . n
2.3.1
(2.35)
Order Statistics of the Exponential Distribution
Let us determine the distribution of the k-th order statistic X k:n of the exponential random variable X ∼ Exp(λ). Firstly observe that X 1:n , the minimum of n exponentials is an exponential with rate nλ, as given by its derived probability distribution below: Pr(X 1:n > x) = Pr(min(X 1 , X 2 , . . . , X n > x) = Pr(X i > x) =e
−nλ
n
(2.36) (2.37) (2.38)
2.3
Order Statistics
23
Fig. 2.4 Illustration of order Statistics of exponential random variables
Due to the memoryless property of the exponential, after X 1:n time elapses the residual time of the remaining n − 1 exponentials is also exponentially distributed. Therefore, after X 2:n is X 1:n plus the minimum of n − 1 exponential random variables, as illustrated in Fig. 2.4. Continuing to use the memoryless property, we can show that the k-th order statistic X k:n =
k i=1
Zi n−i +1
(2.39)
where Z i ’s are i.i.d. exponential random variables with rate λ. The expected value of X k:n is given by E [X k:n ] =
k i=1
=
1 λ(n − i + 1)
(2.40)
Hn − Hn−k λ
(2.41)
n where Hn = i=1 1/i, the n-th Harmonic number. For large n, Hn ≈ log n. By using the fact the the variance of a sum of independent random variables is a sum of their variances, the variance of X k:n is given by: Var [X k:n ] =
k k
Var
i=1 i=1 k
Zi n−i +1
1 − i + 1)2 i=1 1 (2) = 2 Hn(2) − Hn−k λ
=
λ2 (n
(2.42)
(2.43) (2.44)
n (2) 1/i 2 , the generalized Harmonic number. where Hn = i=1 For the special case of the maximum order statistic X n:n , the mean and variance are given by:
24
2 Calculus, Probability and Order Statistics Review
log n Hn ≈ λ λ 1 (2) n(n + 1)(2n + 1) = = O(n 3 ). Var [X n:n ] = 2 Hn λ 6λ2 E [X n:n ] =
2.3.2
(2.45) (2.46)
Order Statistics of the Uniform Distribution
Consider n i.i.d. realizations U1 , U2 , . . . , Un of the uniform distribution 1 for u ∈ [0, 1] fU (u) = . 0 otherwise
(2.47)
The k-th order statistic Uk:n follows the Beta distribution Beta(k, n + 1 − k) and its probability distribution function (pdf) is given by: fUk:n (u) =
n! u k−1 (1 − u)n−k (k − 1)!(n − k)!
The mean of the k-th order statistics is given by E [Uk:n ] =
2.3.3
(2.48)
k n+1 .
Asymptotic Distribution of Quantiles
For a large number of samples n from a distribution FX , it is interesting to understand the asymptotic distribution of the quantiles, that is the order statistic X np:n for p ∈ (0, 1). It can be shown that as n → ∞, X np:n converges in distribution to a Gaussian (Normal) distribution: p(1 − p) −1 . (2.49) X np:n ∼ N F ( p), n[ f (F −1 ( p))]2 Summary In this chapter we reviewed calculus concepts, in particular, strong convexity and Lipschitz smoothness, which will be used in the convergence analysis of SGD and its variants in the upcoming chapters of the book. We also reviewed the basics of probability theory and discussed the concept of order statistics of random variables. Order statistics will be used to analyze the runtime per iteration of various distributed SGD algorithms that we will study in this book.
2.3
Order Statistics
25
Problems
1. Identify whether the following functions are Lipschitz continuous, and if they are, then determine the smallest corresponding Lipschitz continuity parameter K : (1) F(x) = sin(x), and (2) F(x) = x 2 . 2. Identify whether the following functions are Lipschitz smooth, and if they are, determine the smallest corresponding Lipschitz smoothness parameter L: 1) F(x) = 21 Ax − y2 , and 2) F(x) = x 2 3. Consider the Pareto distribution, whose cumulative distribution function is given by: α x ≥ xm 1 − xxm FX (x) = (2.50) 0 x < xm Compare the mean E [X ] with the mean of the residual time E [(X − t)|X > t] for some elapsed time t ≥ 0. Is a new realization of the Pareto random variable expected to be longer than the expected residual time of an old realization? 4. Let X be a random variable with mean μ and variance σ 2 (assuming σ > 0). Let be a random variable drawn from standard Gaussian N (0, 1) such that X and are independent. What is the variance of the random variable Y = X /σ + ? 5. Find the probability distribution, expectation and variance of the k-th order statistic X k:n of n i.i.d. realizations of a shifted exponential distributed X ∼ + Exp(λ) for some
> 0. 6. Find the probability distribution, expectation and variance of the minimum order statistic X 1:n of n i.i.d. realizations of the Pareto distribution X ∼ Pareto(xm , α).
3
Convergence of SGD and Variance-Reduced Variants
In this chapter we will analyze the convergence of gradient descent (GD) and stochastic gradient descent (SGD) to determine the number of iterations required to reach a target error. Because it uses noisy estimates of the true gradient, SGD has slower convergence than GD. Therefore, recent works have proposed variance-reduced versions of SGD that improve its convergence rate. We will review some of these variance-reduced variants of SGD.
3.1
Gradient Descent (GD) Convergence
Recall from Chap. 1 that GD descent is an iterative algorithm that minimize the loss function F(w) with respect to the parameter vector w. It starts with a randomly initialized w0 , and uses the following update rule: wt+1 = wt − η∇ F(wt )
(3.1)
where η is the step size or the learning rate. The convergence speed of GD depends on the choice of η and the properties of the loss function F(w). As with most other gradient-based methods, for non-convex objectives, GD may get stuck in a local minima. However, for strongly convex and smooth F(w) and small enough η, it is guaranteed to convergence to the optimal x ∗ . Theorem 3.1 (Convergence of gradient descent (GD)) For a c-strongly convex and Lsmooth function, if the learning rate η < L1 and the starting point is w0 , then F(wt ) after t gradient descent iterations is bounded as F(wt ) − F(w∗ ) ≤ (1 − ηc)t (F(w0 ) − F(w∗ ))
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_3
(3.2)
27
28
3 Convergence of SGD and Variance-Reduced Variants
Proof Recall that if a function is L-Lipschitz smooth, then for any x1 , x2 ∈ Rd it satisfies: F(x1 ) ≤ F(x2 ) + ∇ F(x2 ) (x1 − x2 ) +
L x1 − x2 22 2
(3.3)
Replacing x1 with wt+1 and x2 with wt respectively we have L wt+1 − wt 2 2 L ≤ ∇ F(wt ) (−η∇ F(wt )) + − η∇ F(wt )2 2 L ≤ η 1 − η (−∇ F(wt )2 ) 2
F(wt+1 ) − F(wt ) ≤ ∇ F(wt ) (wt+1 − wt ) +
(3.4) (3.5) (3.6)
Now using the Polyak-Lojasiewicz (PL) inequality 2c(F(w) − F(w∗ )) ≤ ∇ F(w)2 , which is a consequence of c-strong-convexity of F(w) we have L (3.7) F(wt+1 ) − F(wt ) ≤ η 1 − η (−2c(F(wt ) − F(w∗ ))) 2 Assume that η < L1 . Then 1 − L2 η ≥ 21 . Thus, F(wt+1 ) − F(wt ) ≤ −ηc(F(wt ) − F(w∗ )) ∗
∗
(3.8)
∗
F(wt+1 ) − F(w ) + F(w ) − F(wt ) ≤ −ηc(F(wt ) − F(w )) ∗
(3.9) ∗
F(wt+1 ) − F(w ) ≤ (1 − ηc)(F(wt ) − F(w ))
(3.10)
Continuing recursively, we will have F(wt+1 ) − F(w∗ ) ≤ (1 − ηc)2 (F(wt−1 ) − F(w∗ )) .. .
(3.12) ≤ (1 − ηc)
3.1.1
(3.11)
t+1
∗
(F(w0 ) − F(w ))
(3.13)
Effect of Learning Rate and Other Parameters
The above analysis shows that as t → ∞, F(wt ) converges to the optimal value F(w∗ ) and the optimality gap shrinks exponentially fast due to the multiplicative factor (1 − ηc) < 1. The rate of convergence of F(w) to its optimal value F(w∗ ) increases with η and c, because it will result in a smaller value of the multiplicative factor (1 − ηc). Figure 3.1 illustrates the effect of η on the convergence of batch gradient descent for a simple linear regression problem. In general, η is a hyperparameter that requires manual tuning in order to set it to a value that is best suited to the dataset and the objective function in question. A smaller value of the Lipschitz constant L also results in faster convergence because η can be increased up to
Convergence Analysis of Mini-batch SGD
η = 0.01 η = 0.1
RSS Value
1.5 1.0
29 η = 0.01 η = 0.1 η = 0.12
40 RSS Value
3.2
20
0.5 0
20 40 Number of Iterations
0
0
20 40 Number of Iterations
Fig. 3.1 Illustration of batch gradient descent convergence for a linear regression problem, where the y-axis is the residual sum of squares (RSS) error. When the learning rate η increases from η = 0.01 to η = 0.1 we get faster convergence but increasing η further to η = 0.12 causes the error to diverge
1/L. Unlike the learning rate η, the Lipschitz constant L and the strong convexity parameter c are properties of the objective function and cannot be controlled by the optimization algorithm.
3.1.2
Iteration Complexity
More concretely, let us determine how many iterations do it takes to reach error F(wt ) − F(w∗ ) = . We can evaluate it as follows. (1 − ηc)t (F(w0 ) − F(w∗ )) ≤ ∗
t log(1 − ηc) + log(F(w0 ) − F(w )) ≤ log() 1 t log(1/(1 − ηc)) − log(F(w0 ) − F(w∗ ) ≥ log( ) 1 t = O log( )
(3.14) (3.15) (3.16) (3.17)
This convergence rate is often called linear convergence. It is considered to be a fast convergence speed because of the log term. For instance, to achieve an error = 10−m , we only need O(m) iterations.
3.2
Convergence Analysis of Mini-batch SGD
N Since many practical ML problems use the empirical risk function F(w) = N1 n=1 (w, ξn ), it is too expensive to compute the full gradient ∇ F(w) of the objective function because it requires processing each of the N samples of a large training dataset in every iteration. Minibatch stochastic gradient descent (SGD) is a computationally efficient alternative where the b ∇(wt , ξt,l )/b computed gradient ∇ F(wt ) is replaced by a noisy estimate g(wt , ξt ) = l=1
30
3 Convergence of SGD and Variance-Reduced Variants
using a batch of b training samples chosen uniformly at random and with replacement from the training dataset. The update rule of mini-batch SGD is as follows: wt+1 = wt − ηg(wt , ξt )
(3.18)
Next, we analyze the convergence of mini-batch SGD and compare it with batch GD in order to understand how the stochasticity of the gradients affects the rate of convergence. To analyze the convergence of mini-batch SGD, we make the same assumptions of c-strong convexity and L-Lipschitz smoothness. In addition, we also need assumptions on the minibatch stochastic gradients g(w; ξ ): • Unbiased Estimate: The mini-batch stochastic gradient g(w; ξ ) is an unbiased estimate of ∇ F(w), that is, Eξ [g(w; ξ )] = ∇ F(w) • Bounded Variance: The gradient ∇(w, ξl ) of the l-th sample in the mini-batch has bounded variance, that is, Var(∇(w; ξ )) ≤ σ 2
(3.19)
This implies the following bounds on the variance of the mini-batch stochastic gradient g(w; ξ ): Eξ [g(w; ξ )2 ] − Eξ [g(w; ξ )]2 ≤
σ2 b
(3.20)
Eξ [g(w; ξ )2 ] ≤ Eξ [g(w; ξ )]2 + Eξ [g(w; ξ )2 ] ≤ ∇ F(w)2 +
σ2 b
σ2 b
(3.21) (3.22)
Under these assumptions, we have the following result on the convergence of stochastic gradient descent. Theorem 3.2 (Convergence of mini-batch SGD) For a c-strongly convex and L-smooth function satisfying the unbiased gradient estimate and bounded variance assumptions listed above, if the learning rate η < L1 and the starting point is w0 , then E [F(wt )] after t minibatch SGD iterations is bounded as ηLσ 2 ηLσ 2 ∗ t ∗ (3.23) E [F(wt )] − F(w ) − ≤ (1 − ηc) E [F(w0 )] − F(w ) − 2cb 2cb Proof Starting with the Lipschitz smoothness of the objective function we have
3.2
Convergence Analysis of Mini-batch SGD
31
L wt+1 − wt 2 2 L ≤ ∇ F(wt ) (−ηg(wt , ξt )) + − ηg(wt , ξt )2 2
F(wt+1 ) − F(wt ) ≤ ∇ F(wt ) (wt+1 − wt ) +
(3.24) (3.25)
Taking expectation on both sides with respect to the stochasticity ξt in the t-th iteration, η2 L E g(wt , ξt )2 E F(wt+1 ) − F(wt ) ≤ −η∇ F(wt ) E [g(wt , ξt )] + 2 2L η ≤ −η∇ F(wt ) E [g(wt , ξt )] + E g(wt , ξt )2 2
(3.26) (3.27)
Using the unbiased gradient assumption Eξ [g(w; ξ )] = ∇ F(w) and the bounded variance assumption Eξ [g(w; ξ )2 ] ≤ ∇ F(w)2 + σ 2 we have η2 L σ2 E F(wt+1 ) − F(wt ) ≤ −η∇ F(wt )2 + (∇ F(w)2 + ) 2 b L η2 σ 2 L ≤ η 1 − η (−∇ F(wt )2 ) + 2 2b
(3.28) (3.29)
Now using the strong convexity property 2c(F(w) − F(w∗ )) ≤ ∇ F(w)2 we have L η2 σ 2 L (3.30) E F(wt+1 ) − F(wt ) ≤ η 1 − η (2c(F(wt ) − F(w∗ ))) + 2 2b ≤ η(−c(E [F(wt )] − F(w∗ ))) +
η2 σ 2 L 2b
(3.31)
where in (3.31) we use the assumption that η < L1 , which implies that 1 − L2 η ≥ 21 . Now taking total expectations and subtracting the optimal F(w∗ ) from both sides we have η2 σ 2 L E F(wt+1 ) − F(w∗ ) + F(w∗ ) − E [F(wt )] ≤ −ηc(E [F(wt )] − F(w∗ )) + 2b (3.32) η2 σ 2 L E F(wt+1 ) − F(w∗ ) ≤ (1 − ηc)(E [F(wt )] − F(w∗ )) + 2b (3.33) Subtracting
ηLσ 2 2cb
from both sides
ηLσ 2 η2 σ 2 L ηLσ 2 E F(wt+1 ) − F(w∗ ) − ≤ (1 − ηc)(E [F(wt )] − F(w∗ )) + − 2c 2 2cb 2 2 ηLσ ηLσ E F(wt+1 ) − F(w∗ ) − ≤ (1 − ηc) E [F(wt )] − F(w∗ ) − 2cb 2cb
32
3 Convergence of SGD and Variance-Reduced Variants
Applying the above inequality recursively for t iterations starting from w0 , we have the result.
3.2.1
Effect of Learning Rate and Mini-batch Size
Similar to GD, the above analysis implies that the rate of convergence of F(w) to its optimal value F(w∗ ) increases with η and c, because it will result in a smaller value of the multiplicative factor (1 − ηc). Figure 3.2 illustrates this phenomenon for the linear regression example. However, unlike GD, as t → ∞, the loss function value F(wt ) does not converge to the optimal F(w∗ ). The stochasticity of the gradients results in an error floor ηLσ 2 /2cb which increases with η and the variance bound σ 2 . Increasing the mini-batch size b reduces the variance and thus reduces the error floor. In the special case when b = N , mini-batch SGD reduces to batch GD and the learning curve will have no stochasticity as we saw in Fig. 3.1.
3.2.2
Iteration Complexity
Due to the non-zero error floor ηLσ 2 /2cb, when we use a constant learning rate η the progress of the SGD algorithm towards the optimal model w∗ stalls after some iterations. In order to achieve zero error floor, the learning rate η is gradually decayed during the training process. With such a decaying learning rate schedule, how many iterations do we need to converge to reach error F(wt ) − F(w∗ ) = ? It can be shown that a learning rate schedule ηt = γ β+t for some constants β, γ > 0 gives the following convergence result, from which we can infer the iterations complexity.
RSS Value
2.0
η = 0.05 η = 0.1 η = 0.25
1.5 1.0 0.5 0
20 40 Number of Iterations
Fig.3.2 Illustration of stochastic gradient descent convergence for a linear regression problem, where the y-axis is the residual sum of squares (RSS) error. Higher learning rate η yields faster convergence, but also results in a higher error floor
3.2
Convergence Analysis of Mini-batch SGD
33
Theorem 3.3 (Convergence of SGD with decaying learning rate) For a c-strongly convex and L-smooth function, suppose the learning rate ηt = γ β+t in the t-th iteration for some
β > 1c and γ > 0 such that η1 ≤ 1/L. With the starting point w0 , F(wt ) after t gradient descent iterations is bounded as E F(wt ) − F(w∗ ) ≤
where ν = max
β 2 Lσ 2 2(βc−1) , (γ
ν γ +t
(3.34)
+ 1)(F(w0 ) − F(w∗ )) .
The proof can be found in [1]. Theorem 3.3 implies that the number of iterations required to converge to reach error F(wt ) − F(w∗ ) = is O( 1 ). For instance, to achieve an error = 10−m , we only need O(10m ) iterations. This convergence rate is much slower than the O(log 1 ) linear convergence rate achieved by batch gradient descent, as we showed in Sect. 3.1.
3.2.3
Non-convex Objectives
Many practical machine learning problems have non-convex objective functions, which do not satisfy the strongly convex assumption. We can extend the above convergence analysis to non-convex objectives, as we show below. However, this general convergence analysis can only guarantee convergence to a stationary point of the objective function (a local minimum or saddle point) because there is no unique global minimum as in the case of strongly convex objective functions. Instead, we assume that the sequence of function values F(wk ) is bounded below by a scalar Fin f . Theorem 3.4 (Convergence of mini-batch SGD, non-convex objectives) For a L-smooth function satisfying the unbiased gradient estimate and bounded variance assumptions listed above, if the learning rate η < L1 and the starting point is w1 , then after t mini-batch SGD iterations the expected average of squared gradients of F can be bounded as follows t
2(F(w0 ) − Fin f ) 1 ηLσ 2 2 E ∇ F(wk )2 ≤ + (3.35) t b tη k=1
Proof We follow the same steps as the proof of Theorem 3.2 starting with the Lipschitz smoothness of the objective function to get L η2 σ 2 L E F(wk+1 ) − F(wk ) ≤ η 1 − η (−∇ F(wk )2 ) + (3.36) 2 2b η2 σ 2 L η ≤ − ∇ F(wk )2 + 2 2b
(3.37)
34
3 Convergence of SGD and Variance-Reduced Variants
where we use the assumption that η < L1 , which implies that 1 − L2 η ≥ 21 . Summing both sides for k = 1, . . . , t and dividing by t we get: t E F(wt+1 ) − F(w1 ) η2 σ 2 L η ∇ F(wk )2 + ≤− t 2t 2b k=1 t 2(E F(wt+1 ) − F(w1 )) ησ 2 L 1 ∇ F(wk )2 ≤ − + t ηt b 1 t
k=1 t
∇ F(wk )2 ≤ −
k=1
2(Fin f − F(w1 )) ησ 2 L + ηt b
(3.38)
(3.39)
(3.40)
We can also extend this result to obtain a convergence analysis of SGD with diminishing step-size η for non-convex objective functions. However, we will skip stating that here and refer the readers to [1] for the details.
3.3
Variance-Reduced SGD Variants
In Sect. 3.2 we saw that mini-batch SGD, the computationally efficient version of batch GD that uses noisy stochastic gradients instead of full gradients, has a non-zero error floor due to the gradient noise. In order to achieve convergence to the optimum F(w∗ ), we need to decay the learning rate during the course of training. Even with this decay learning rate strategy we need O(1/) iterations to reach an error, which is significantly larger than the O(log(1/)) iterations required with batch GD. To bridge this gap, several recent papers have proposed techniques to reduce the variance of the stochastic gradients and achieve the O(log(1/)) linear convergence rate. We will discuss three such techniques in this section, namely, Dynamic mini-batch size, Stochastic Average Gradient (SAG) [2] and Stochastic Variance Reduced Gradient (SVRG) [3].
3.3.1
Dynamic Mini-batch Size Schedule
Instead of decaying learning rate, an alternate approach to gradually reducing the gradient noise in order to achieve a zero error floor is to dynamically increase the mini-batch size b during the course of training. In particular, consider the following dynamic mini-batch version of SGD, where wt+1 = wt − ηg(wt , ξt ) where 1 g(wt , ξt ) = ∇(wt , ξt, j ) with bt = |St | = γ t−1 bt j∈St
(3.41) (3.42)
3.3 Variance-Reduced SGD Variants
35
for some γ > 1. Thus, the mini-batch size bt is increased at geometrically at the rate γ . The variance of the gradient in the t-th iteration is bounded Var(g(wt ; ξ )) ≤ bMt . With this dynamic mini-batch schedule, SGD can achieve a F(wt ) − F(w∗ ) ≤ error in O(log 1 ), matching the linear convergence rate of batch GD. While the number of iterations required to reach error reduces, note that the total number of per-sample gradient computation increases geometrically in each iteration. Over t iterations, the number of gradient computations is proportional to γ t . Thus, over O(log 1 ) iterations, the number of gradient computations is O( 1 ). This is exactly the iterations complexity of SGD, which only performs 1 gradient evaluation per iteration.
3.3.2
Stochastic Average Gradient (SAG)
In each stochastic gradient descent (SGD) chooses a data sample ξt uniformly at random from the N training samples and uses the gradient at that sample in lieu of the full gradient computed over the entire dataset. Stochastic Average Gradient (SAG) proposed in [2] maintains previously computed gradients in memory and uses them in place of the N − 1 missing gradients in each iterations. The SAG update rule is given by: wt+1
1 = wt − η N
∇(wt , ξt ) − vt,n +
N
vt,n
(3.43)
n=1
where vn is the (potentially outdated) gradient at the n-th sample that is stored in memory. After the t-th iteration, these vectors are updated as follows: ∇(wt , ξt ) if ξt = ξn (3.44) vt,n = vt−1,n otherwise At the beginning of training, SAGA evaluates the full batch GD in order to initialize the vectors v0,n = ∇(w0 , ξn ) for all n. Using the gradient vt,n stored in memory in place of N − 1 missing stochastic gradients corresponding to the unsampled indices, reduces the variance of the overall gradient. As a result, SAG can achieve an O(log(1/)) linear convergence rate. However, this convergence improvement comes at the cost of O(N D) memory required to store previously computed gradients for all the N training samples, and a one-time O(N ) cost of initializing vt,n by evaluating the full gradient. Other, more efficient initialization techniques could be used in practice such as setting the gradients to zero, or only using the gradients that are available in memory until that time. N vt,n is a biased estimate of the full A drawback of SAG is that the gradient N1 n=1 N gradient n=1 ∇(wt , ξn ). A variant of SAG called SAGA proposed in [4] removes this bias by using the following modified version of the update rule:
36
3 Convergence of SGD and Variance-Reduced Variants
wt+1
N 1 = wt − η ∇(wt , ξt ) − ∇(vt,n , ξt ) + vt,n N
(3.45)
n=1
By moving the N1 inside the brackets, SAGA ensures that the expected value of the gradient N used in every update is equal to the full batch gradient n=1 ∇(wt , ξn ), but it has a lower variance. While we present SAG and SAGA for the case where each stochastic gradient ∇(wt , ξt ) is computed over a single sample from the training dataset, the algorithms and their convergence analysis can easily be extended to the case where ∇(wt , ξt ) is computed over a mini-batch of b samples chosen uniformly at random with replacement from the training dataset.
3.3.3
Stochastic Variance Reduced Gradient (SVRG)
Unlike SAG and SAGA, the SVRG method proposed in [3] does not require additional memory to store previous gradients. The key idea behind SVRG is to periodically compute the full gradient, and use it to reduce the variance of stochastic gradients in each iteration. SVRG operates in cycles, where each cycle consists of t0 iterations. At the end of each cycle, ˜ which denotes the snapshot of the w, that is, w ˜ is set to wt if t mod t0 = 0. it updates w, ˜ at the After every update to the snapshot, SVRG also computes the full gradient ∇ F(w) ˜ During the t0 iterations of the next cycle, SVRG updates the model parameter snapshot w. vector wt as follows: ˜ ξ ) + ∇ F(w) ˜ , wt+1 = wt − η(∇(wt , ξt ) − ∇(w, t
(3.46)
g(w ˜ t)
where the gradient g(w ˜ t ) is an unbiased estimate of the full gradient ∇ F(wt ), however its variance of the gradient is smaller than the stochastic gradient ∇(wt , ξt ) at the sample chosen in the t-th iteration. SVRG gives an iteration complexity of O(log( 1 )), matching that of batch GD. This faster convergence comes at a additional computation cost of periodically computing the full gradient. Amortizing the per-cycle cost across the t0 iterations in that cycle, the per-iteration computation cost of SVRG is O( tN0 ), in contrast to the O(b) of computing a mini-batch gradient in each iteration of mini-batch SGD. Similar to SAG and SAGA, although we presented SVRG for the case where ∇(wt , ξt ) is computed over a single training sample, the algorithm and its analysis can easily be extended to the case where ∇(wt , ξt ) is computed over a mini-batch of b samples chosen uniformly at random with replacement from the training dataset.
3.3 Variance-Reduced SGD Variants
37
Table 3.1 Comparison of the iteration complexity, computation per iteration and memory cost of various SGD algorithms Algorithm Batch GD SGD Mini-batch SGD SAGA SVRG
Iters. to reach error
O log 1
O 1
O 1
O log 1
O log 1
Comp. per iter.
Memory
O(N d)
O(d)
O(1)
O(d)
O(bd)
O(d)
O(d)
O(N d)
O( Nt0d )
O(d)
Summary In this chapter, we showed the convergence analysis of batch GD and mini-batch SGD for strongly convex and Lipschitz smooth functions, and showed that they require O(log 1 ) and O( 1 ) iterations respectively to reach an error. In order to bridge the gap in their iteration complexity, methods such as SAGA and SVRG are use previously computed gradients to reduce the variance of the stochastic gradient used to updated the model parameters in each iteration. A comparison of all these SGD variants is summarized in Table 3.1. The strong convexity assumption used in the convergence analyses presented in this chapter can be removed to obtain a more general convergence analysis for Lipschitz smooth but non-convex objective functions. When the objective function is non-convex, GD and SGD only guarantee convergence to stationary point (local minimum or saddle point) of the function, rather than the global optimum. Please refer to [1] for these general convergence analyses results and their proofs. The proofs of the convergence of SVRG and SAGA are also omitted here for the purpose of brevity and can be found in [2–4] respectively. Problems
1. Consider a linear regression problem, where the goal is to minimize the residual sum N (yi − w xi )2 for a training dataset D = {(x1 , y1 ), . . . , (x N , yn )}. of squares error i=1 Implement mini-batch SGD and plot the residual sum of squares (RSS) error versus the number of iterations for different mini-batch sizes for a fixed value of learning rate η. How does the convergence rate and the error floor depend on the mini-batch size? 2. Prove Theorem 3.3 by following steps similar to the proof of Theorem 3.2. Please refer to [1] for a detailed solution. 3. For the same linear regression considered in Problem 1, implement SAG and SAGA and compare their performance with that of SGD (equivalent to mini-batch SGD with b = 1).
38
3 Convergence of SGD and Variance-Reduced Variants
References 1. L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” arXiv preprint arXiv:1606.04838, Feb. 2018. 2. N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Advances in Neural Information Processing Systems, 2012, pp. 2663–2671. 3. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 315–323. [Online]. Available: http://papers.nips.cc/paper/4937-accelerating-stochastic-gradientdescent-using-predictive-variance-reduction.pdf 4. A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Advances in Neural Information Processing Systems 27, 2014, pp. 1646–1654.
Synchronous SGD and Straggler-Resilient Variants
For large training datasets, it can be prohibitively slow to conduct sequential SGD training at a single node, as we described in Chap. 1. Therefore, most practical implementations run SGD in a distributed manner using multiple worker nodes that share the task of computing minibatch gradients. They communicate the gradients with a central server called the parameter server [1] that stores the current version of the model. In this chapter, we will introduce two types of distributed training in the parameter server framework, namely data-parallel and model-parallel training. Then the rest of the chapter will focus on the most commonly used distributed SGD algorithm, synchronous SGD. We will analyze its error convergence and runtime per iteration and study the trade-off between these two quantities. Finally, we will introduce straggler-resilient variants that strike a good balance between error and runtime.
4.1
Parameter Server Framework
To understand the need to distribute SGD across multiple computing nodes, let us evaluate the wallclock runtime taken to achieve this error for a simple delay model. Suppose the time taken to compute a sample gradient is y and the time taken to update model parameters is denoted by δ. Then for a mini-batch size of b, the runtime per iteration is (by + δ). From the mini-batch SGD convergence analysis presented in Chap. 3, we know that using a larger mini-batch size b results in a lower error. However, the wallclock runtime increases linearly in b. Instead of a sequential implementation of mini-batch SGD, if we use multiple worker nodes that compute gradients in parallel, then we can process large mini-batches without increasing the wallclock time per iteration. For example, suppose we consider a network of b workers that compute one sample gradient each in parallel, the runtime per iteration will reduce to (y + δ). This idea of processing more data per iteration using multiple worker nodes is called data-parallel distributed training. It is implemented using a parameter server © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_4
39
4
40
4 Synchronous SGD and Straggler-Resilient Variants
that stores the current version of the model w and collects gradients from the worker nodes. Each worker nodes stores a replica of the current version of the model and a partition of the training data. Besides data parallelism, an orthogonal way to reduce the runtime per iteration of minibatch SGD is to reduce the per-sample gradient computation time y. The gradient computation time is proportional to the size of the gradient vector g(w) is equal to the number of parameters in the model w that we want to train. In model-parallel training, the model itself is partitioned into shards and each worker only computes gradients for one of the model partitions. Thus, if we have m workers, the per-sample gradient computation time y reduces to y/m, and with the simple delay model considered above the runtime per iteration becomes by/m + δ. The data-parallel and model-parallel training paradigms within the parameter server framework were first introduced by [1]. While both are used in practice, model-parallel training is less common and is mainly used training massive models because it requires additional communication to ensure synchronization of the model partitions. In most of this chapter, we will focus on an algorithm called synchronous SGD which is the standard version of data-parallel distributed training used in most industrial implementations.
4.2
Distributed Synchronous SGD Algorithm
The synchronous distributed SGD algorithm operates within the parameter server framework that consists of a central parameter server and m worker nodes, as shown in Fig. 4.1. The training dataset D is shuffled and split equally into partitions D1 , . . . , Dm which are stored at the m workers. The parameter server stores the latest version of the model parameters w. Each iteration of the synchronous SGD algorithm proceeds as follows: 1. The parameter server sends the current version of the parameter vector wt to all the m workers. 2. Each worker i computes a gradient g(wt ; ξi ) = b1 bj=1 ∇(wt ; ξi, j ) using a mini-batch of b samples drawn uniformly at random with replacement from its dataset partition Di .
Fig. 4.1 Data-parallel training in the parameter server framework
Parameter Server
wt+1 = wt −
g(wt , ξi ) Worker 1
wt
η m
wt Worker 2
m
g(wt , ξi ) i=1
wt Worker m
4.3
Convergence Analysis
41
3. The parameter server collects these m mini-batch gradients and updates the parameter m g(wt ; ξi ). vector as: wt+1 = wt − η m1 i=1 It is easy to see that synchronous SGD with a mini-batch size b at each of the m workers processes m times more training data per iteration than mini-batch SGD run at the single node. However, does that imply that the error convergence is always better as the number of workers m increases? To answer this questions we adopt our system-aware philosophy that analyzes two dimensions of error convergence, firstly, how the error versus iterations convergence depends on m (in Sect. 4.3 below) and secondly, how the runtime per iteration depends on m (in Sect. 4.4 below). By combining these two factors we can determine how the choice of the number of workers m affects the error versus runtime convergence of distributed synchronous SGD.
4.3
Convergence Analysis
Let us first analysis the error versus iterations convergence of synchronous mini-batch SGD, that is, obtain a bound on E F(wt+1 ) − F(w∗ ) in terms of the number of workers m, minim g(wt ; ξi ) of synchronous batch size b. Observe that the update rule wt+1 = wt − η m1 i=1 SGD is equivalent to performing mini-batch SGD, but with a batch size of mb instead of just b samples! So, the convergence analysis of mini-batch SGD can be directly applied here. The only parameter that changes is the variance upper bound. More formally, the convergence analysis of synchronous SGD with m workers can be obtained as follows. Similar to the analysis mini-batch SGD presented in Chap. 3, we assume that the objective function F(w) is c-strong convex and L-Lipschitz smooth. In addition, we make the following assumptions about the the stochastic gradients g(w; ξ ) returned by the workers: • Unbiased Estimate: The stochastic gradient g(w; ξ ) is an unbiased estimate of ∇ F(w), that is, Eξ [g(w; ξ )] = ∇ F(w) This assumption is true because we consider that in the parameter server framework, the training dataset is shuffled and then equally partitioned across the workers. • Bounded Variance: The gradient ∇(w, ξl ) of the l-th sample in each mini-batch has bounded variance, that is, Var(∇(w; ξ )) ≤ σ 2
(4.1)
This implies the following bounds on the variance of the average stochastic gradient 1 m i=1 g(wt ; ξi ) used by the parameter server to update the model parameters w: m
42
4 Synchronous SGD and Straggler-Resilient Variants m 1 σ2 Var( g(wt ; ξi )) ≤ m bm
(4.2)
i=1
Eξ [
m 1 σ2 g(wt ; ξi )2 ] ≤ ∇ F(w)2 + m bm
(4.3)
i=1
Under these assumptions, we have the following result on the convergence of distributed synchronous SGD. Theorem 4.1 (Convergence of synchronous distributed SGD) Consider a c-strongly convex and L-smooth function satisfying the unbiased gradient estimate and bounded variance assumptions listed above. if the learning rate η < L1 and the starting point is w0 , then F(wt ) after t synchronous distributed SGD iterations is bounded as ηLσ 2 ηLσ 2 ∗ t ∗ (4.4) E [F(wt )] − F(w ) − ≤ (1 − ηc) E [F(w0 )] − F(w )) − 2cbm 2cbm
4.3.1
Iteration Complexity
The main implication of Theorem 4.1 above is that the error floor ηLσ 2cbm is inversely proportional to the number of workers m and the mini-batch size b at each worker. With a decaying learning rate schedule ηt = η0 /t, to reach an error floor , synchronous SGD would take O(1/m) iterations. Thus, having m times more worker nodes always improves the error versus iterations convergence to a target error by a factor of m iterations. 2
4.4
Runtime per Iteration
In each iteration of synchronous SGD, the parameter server needs to wait for all m workers to return their gradients, update the parameter vector w, and broadcast it to all the workers. Assuming the ideal scenario where workers perform perfectly parallel computation, the wallclock time per iteration should be independent of m. However, in practice, the workers experience system-level variabilities in their gradient computation times and the communication delay in exchanging gradients and the updated model with the parameter server can also take a non-negligible time. As a result, having m workers would not necessarily result in an m-fold speed-up in the data processed per unit time. In this section, we seek to understand how all the gradient computation and communication times affect the runtime per iteration Tsync and its dependence on the number of workers m. Putting this runtime analysis together with the error convergence analysis presented in Sect. 4.3 will shed light on the true convergence speed of synchronous SGD in terms of the error versus wallclock runtime.
4.4
Runtime per Iteration
4.4.1
43
Gradient Computation and Communication Time
The time taken by a worker to compute and send its mini-batch gradient to the PS can vary randomly across workers and iterations due to several reasons such as fluctuations in the worker’s gradient computation speed, variation in the gradient computation time across mini-batches, and network latency and outages that can affect the time taken for the workers to communicate the gradients to the parameter server. We use the random variable X to denote the total time taken by each worker to compute and send its gradient to the parameter server. While the probability distribution FX (x) of the random variable X depends on the computing/network infrastructure and communication and protocols, a useful model is the exponential distribution Exp(λ), which we introduced in Chap. 2. Due to its memoryless property, the exponential distribution is especially well-suited for theoretical analysis and we will use it extensively in the runtime analysis of synchronous SGD and its variants. A practical variant of the exponential distribution is a shifted exponential distribution + Exp(λ) where represents a constant delay in the communication and local computation, and Exp(λ) captures the random fluctuations in this delay. Moreover, the assumption that the gradient computation times X are i.i.d. across workers and iterations is also required for tractability of the runtime analysis. However, in practice, slow (or fast) workers remain slow (or fast) for multiple iterations. Incorporating such memory can make the runtime analysis more realistic.
4.4.2
Expected Runtime per Iteration
Since the parameter server needs to wait for all m workers to return their gradients, the runtime per iteration Tsync of synchronous distributed SGD is the maximum of the gradient computation times X 1 , X 2 , . . . , X m of the m workers, as shown in Fig. 4.2. Due to the synchronous nature of gradient aggregation, fast workers that finish their gradient computation early need to remain idle until the slowest worker returns its gradient to the parameter server. Only then the parameter server can compute the average of the m m g(wt ; ξi ), and update the model parameters according to the update rule: gradients i=1 m g(wt ; ξi ). As the number of workers m increases, the probability of wt+1 = wt − η m1 i=1 at least one of the workers being slow also increases, and therefore the expected runtime per iteration E Tsync increases with m. This effect is referred to as the tail latency and has been observed and studied [2] in the context of distributed computations in frameworks such as MapReduce [3]. Let us formally evaluate how E Tsync scales with m for different probability distributions FX (x). Suppose X i are independent and identically distributed (i.i.d.) across workers i = 1, . . . , m. Then the expected runtime per iteration E Tsync is given by:
44
4 Synchronous SGD and Straggler-Resilient Variants
PS
w0
w1
w2
Time
Worker 1 Worker 2 Worker 3 Time taken to compute gradients and send them to the PS
Idle time wasted at faster workers
Fig. 4.2 The runtime per iteration of synchronous SGD is the maximum of the gradient computation times (illustrated by the length of each arrow) at each of the m workers. As the number of worker nodes increases, there will be more idle time wasted at the fast workers while the parameter server waits for the slowest workers to finish their gradient computation
E Tsync = E [max(X 1 , X 2 , . . . , X m )] = E [X m:m ] where X m:m is the maximum order statistic of the m local gradient computation times. For example, suppose that the local gradient computation time at each worker X ∼ + Exp(λ), following a shifted exponential distribution, which has mean E [X ] = + λ1 . The expected runtime per iteration of synchronous SGD with m workers is: Hm log m ∼ + E Tsync = + λ λ
m 1 where Hm = i=1 i , the m-th Harmonic number. Thus, the expected runtime increases logarithmically with m. In Fig. 4.3a, we plot the expected runtime per iteration E Tsync and in Fig. 4.3b, we plot the data processed per iteration, both versus the number of workers, for different distributions of gradient computation time X . All the distributions have the same mean, E [X ] = 3, and the ideal scaling plot corresponds to X being constant and equal 3. In Fig. 4.3a, we observe that with the ideal scaling where each worker takes a constant time X to return its gradient, the expected runtime stays constant irrespective of the number of workers. However, when X is random, as the variance of X increases, the expected runtime grows faster with the number of workers. For example, the shifted exponential distribution X ∼ 2 + Exp(1) has lower variance than the exponential distribution and Pareto distribution with the same mean. In Fig. 4.3b, we plot the expected number of minibatches processed per unit time, which is equal to m/E Tsync . When X is constant, E Tsync = X , the data processed per unit time scales linearly with m—the ideal scaling that we desire from parallel computation at the m workers. However, in practice, X is random, due to which we will get a sub-optimal scaling of the data processed per unit time.
4.5
Straggler-Resilient Variants
45
(a)
(b)
Fig. 4.3 Expected runtime per iteration of synchronous SGD, E Tsync , and data processed per unit time versus the number of workers m for different distributions of the gradient computation time X
4.4.3
Error Versus Runtime Convergence
To understand the true convergence speed of distributed synchronous SGD, let us now determine a bound on the error at a wallclock time T by combining the error and runtime analysis given above. Assume that X ∼ Exp(λ) following the exponential distribution with rate λ. For large T , the number of iterations t of SGD completed in T units of time is given by lim
T →∞
λ t 1 ≈ = T E [X m:m ] log m
(4.5)
Tλ Thus, if we use t ≈ log m as an approximation for the number of completed iterations, then the error bound at time T is given by: Tλ ηLσ 2 ηLσ 2 ∗ ∗ log m (4.6) E [F(w0 )] − F(w )) − ≤ (1 − ηc) E [F(wT )] − F(w ) − 2cbm 2cbm
Observe from this bound that there is a trade-off between the speed of convergence and the error floor. A larger number of workers m results in slower convergence (small exponent of (1 − ηc)) because the expected runtime per iteration is larger, but it achieves a smaller error m floor. The wallclock time taken to achieve an error floor is O( log λm ), since the number of log 1 iterations is O( m ) and the expected runtime per iteration is O( λ m ).
4.5
Straggler-Resilient Variants
A key takeaway from the runtime analysis of synchronous SGD presented in Sect. 4.4 above is as the number of workers m increases, the tail latency to wait for slowest worker can
46
4 Synchronous SGD and Straggler-Resilient Variants
significantly slow down the error versus runtime convergence. To avoid waiting for straggling workers, we can design straggler-resilient variants of synchronous SGD that modify the gradient aggregation protocol so that the parameter server (PS) only needs to wait for subset of rather than all the m workers. In this section, we present two such variants, proposed in [4], and their error and runtime analyses.
4.5.1
K -Synchronous SGD
The first straggler-resilient variant of synchronous SGD is called K -synchronous SGD, illustrated in Fig. 4.4. In each iteration of K -synchronous SGD: 1. The PS sends the current version of the model wt to all m workers. 2. Each worker i compute gradient g(wt ; ξi ) using a mini-batch of b samples from this dataset Di . 3. The parameter server waits until it receives. gradients from any K of the m workers and discards the rest of the m − K gradients by canceling the gradient computation tasks at the slow workers. It uses these K gradients to update the model wt+1 = wt − K g(wt ; ξi ). η K1 i=1 Synchronous distributed SGD described in Sect. 4.2 is a special case of K -synchronous SGD when K = 1. Waiting for the K < m fastest workers in each iteration significantly reduces the tail latency and decreases the expected runtime per iteration as compared to m-synchronous SGD. For example, for exponentially distributed gradient computation time X ∼ Exp(λ), the expected runtime per iteration is:
PS
w0
w1
w2
PS
w0
w1
w2
PS
Worker 1
Worker 1
Worker 1
Worker 2
Worker 2
Worker 2
Worker 3
Worker 3
Worker 3
Fully Sync-SGD
K-Sync SGD
w0
w1
w2
K-Batch Sync SGD
Fig. 4.4 Illustration of synchronous SGD and its straggler-resilient variants K -synchronous SGD and K -batch-synchronous SGD
4.5
Straggler-Resilient Variants
E TK −sync = E [X k:m ] 1 1 1 + + ··· + = mλ (m − 1)λ (m − k + 1)λ 1 = (Hm − Hm−K ) λ 1 m . ≈ log λ m−K
47
(4.7) (4.8) (4.9) (4.10)
The proof follows from the memoryless property of the exponential distribution. The time taken for the fastest worker to finish is the minimum of m exponentials, which is also an exponential with rate mλ and mean 1/mλ. Then by the memoryless property, after the fastest worker finishes, the residual time taken by the second fastest worker is also an exponential with rate (m − 1)λ and mean 1/(m − 1)λ. Continuing recursively, we get the expression in (4.9). Then using the approximation Hm ≈ log m, we get (4.10). From (4.10) observe that by setting K = (m) we can cut down on the log m scaling of the runtime of synchronous SGD. After receiving the K gradients from the fastest workers, the PS cancels the computation at the other m − K workers. Thus only K mini-batches of data are processed in each iteration, which results in the averaged gradient having a larger variance than in fully synchronous distributed SGD. From the mini-batch SGD convergence analysis, it follows that the error after t iterations of K -synchronous SGD can be bounded as ηLσ 2 ηLσ 2 ∗ t ∗ , (4.11) ≤ (1 − ηc) E [F(w0 )] − F(w ) − E [F(wt )] − F(w ) − 2cbK 2cbK under the same assumptions as in Sect. 4.3. Thus, there is a trade-off between error and runtime as the parameter K changes. A smaller K will reduce the runtime per iteration but it will result in a higher error floor.
4.5.2
K -Batch-Synchronous SGD
While K -synchronous SGD reduces the straggling effect, there is still some idle time at the K − 1 workers until the PS waits for the K -th fastest worker to finish its gradient computation. To overcome this residual straggling effect, the K -batch-synchronous SGD variant of synchronous SGD allows fast workers to compute multiple mini-batch gradients per iteration, as illustrated in Fig. 4.4. Each iteration of K -batch-synchronous SGD proceeds as follows: 1. The PS sends the current version of the model wt to all m workers. 2. Each worker i continuously computes one or more mini-batch gradients g(wt ; ξi ) and send them to the PS, until it receives a cancellation signal and/or the updated wt+1 from the PS.
48
4 Synchronous SGD and Straggler-Resilient Variants
3. The parameter server waits until it receives K gradients (no matter which worker sends them) in total. It then cancels all outstanding gradient computations at the workers, and K g(wt ; ξi ). updates the model wt+1 = wt − η K1 i=1 For a general probability distribution FX , it is difficult to obtain a closed-form expression for the expected runtime of K -batch-sync SGD. However, for exponentially distributed gradient computation times X ∼ Exp(λ), we can show that the expected runtime per iteration of K -batch-synchronous SGD is given by K . E TK -batch-sync = mλ
(4.12)
Observe that in contrast to the runtime of synchronous SGD that increases with m, E TK -batch-sync decreases with the number of workers m. Thus, by enabling fast workers to steal work from slow workers, we can eliminate the tail latency due to straggling workers and achieve the ideal linear speed-up in the runtime as the m increases. Under the same assumptions as Sect. 4.3, the error versus iterations convergence is the same as that of K -synchronous SGD given in (4.11). Putting the error and runtime analysis together, we get an error-runtime trade-off controlled by the parameter K , similar to the K -synchronous SGD variant. However, since the error-versus-iterations convergence is identical, but the expected runtime per iteration of K -batch-synchronous is less than of K -synchronous SGD, overall, K -batch-synchronous SGD has a better error versus runtime convergence. Summary In this chapter we introduced distributed implementations of SGD using the parameter server framework. Synchronous SGD is the standard approach to perform data-parallel distributed training. We analyzed how its error versus iterations convergence and expected runtime per iteration scale with the number of workers m. A key insight from this analysis is that the runtime per iteration increases with m due to straggling workers. To overcome this tail latency we presented two variants K -synchronous and K -batch-synchronous SGD that relax the synchronization protocol allowing it to only wait for a subset of the workers’ gradients in each iteration. By changing the parameter K we can control the degree of synchronization, and the resulting trade-off between error convergence and runtime. Problems 1. For exponential gradient computation times X ∼ Exp(λ), derive the expression for the variance of the runtime per iteration of Tsync and analyze how it scales with the number of workers m. 2. By simulating and plotting the expected runtime per iteration of synchronous SGD for Pareto distributed X ∼ Pareto(xm , α), compare its scaling with m for different values of the shape parameter α.
References
49
3. For the simulation set up used above, implement K -sync and K -batch synchronous SGD are compare their expected runtime per iteration versus K .
References 1. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large scale distributed deep networks,” in Proceedings of the International Conference on Neural Information Processing Systems, 2012, pp. 1223–1231. 2. J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013. 3. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” ACM Commun. Mag., vol. 51, no. 1, pp. 107–113, Jan. 2008. 4. S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD,” International Conference on Artificial Intelligence and Statistics (AISTATS), Apr. 2018. [Online]. Available: https://arxiv.org/abs/ 1803.01113
5
Asynchronous SGD and Staleness-Reduced Variants
In Chap. 4 we saw that synchronous distributed SGD suffers from tail latency due to straggling workers. Its straggler-resilient variants K -sync and K -batch-sync SGD overcome the straggling delay, but they end up discarding partially completed gradient computation at one or more workers, thus reducing the statistical efficiency of the algorithm and adversely affecting the error convergence. In this chapter, we study the class of asynchronous SGD algorithms that relax the need for all workers to compute gradients for the same synchronized version of the model w. First popularized by [1, 2], asynchronous SGD has been extensively studied in recent distributed ML literature [3–8]. Allowing workers to have different versions of the model causes staleness in some of the gradients. In Sect. 5.3 we analyze how staleness affects error convergence, and in Sect. 5.2 we analyze the runtime per iteration of asynchronous SGD. To limit the gradient staleness while preserving the runtime benefits of asynchronous SGD, we study some staleness-reduced variants in Sect. 5.4.
5.1
The Asynchronous SGD Algorithm
Similar to synchronous distributed SGD, asynchronous SGD is a data-parallel distributed training algorithm that operates within the parameter server framework. The training dataset D is shuffled and uniformly partitioned into subsets D1 , . . . , Dm that are stored at worker nodes 1, 2, . . . , m respectively. In asynchronous SGD, each worker i, for i = 1, . . . , m operates independently and does the following:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_5
51
52
5 Asynchronous SGD and Staleness-Reduced Variants
Parameter Server
wt+1 = wt − ηg(wτ (t) , ξi ) g(wτ (t) , ξi )
wt+1
Worker 1
Worker 2
wt
Worker m
wt
Fig. 5.1 In asynchronous SGD, whenever a worker sends its gradient to the PS, the PS updates the model w and send the new version only to that worker. Other workers continue computing gradients at older versions of the model
1. Pull the current version of the model wt from the parameter server. 2. Compute the mini-batch gradient g(wt ; ξi ) using a mini-batch of data sampled from its local dataset partition Di . As soon as the parameter server receives a gradient from one of the workers, it updates the model and sends the new version to that worker. The local gradient computation and communication time can fluctuate across mini-batches and workers. Thus, while worker i is completing its gradient computation, one or more other workers may update the model at the parameter server. As a result, worker i’s gradient would be computed at a stale version of the model w (Fig. 5.1). More formally, if wt is the current version of the model at the parameter server, it may receive from the worker a (potentially stale) gradient g(wτ (t) ) for some index τ (t) ≤ t. Thus, the asynchronous SGD update rule is: wt+1 = wt − ηg(wτ (t) ; ξ ).
(5.1)
The index τ (t) is a random variable that depends on the distribution of the gradient computation time X at each worker, and the number of workers.
5.1.1
Comparison with Synchronous SGD
In each iteration of synchronous SGD, the parameter server waits for all m workers to send one mini-batch gradient each. Thus, m mini-batches of b samples each are processed in every iteration. In contrast, in asynchronous SGD, the parameter only waits for one worker at a time to send its gradient and thus it can complete significantly more iterations in the same time. However, only one mini-batch is processed in each iteration and the gradient
5.2
Runtime Analysis
53
used to update the model wt can also be a stale gradient computed at a previous version wτ (t) . In the rest of this chapter, we will take a deep dive into the error-runtime trade-off of asynchronous SGD and understand how it compares with synchronous SGD and its variants.
5.2
Runtime Analysis
In synchronous SGD and its variants suffer from inefficiency due to idle time at fast workers and discarded partial computation at slow workers. By removing the synchronization barrier, asynchronous SGD eliminates this runtime inefficiency, albeit at the cost of gradient staleness. For gradient computation time X ∼ FX , the expected runtime per iteration is given by E [X ] E Tasync = m
(5.2)
The proof comes from the elementary renewal theorem [9, Chap. 5]. For the i-th worker, let Ai (t) by the number of gradients it pushes to the PS in time t. The time between two consecutive gradient pushes is an independent realization X i ∼ FX , whose mean is E [X ]. By the elementary renewal theorem we have: lim
t→∞
Ai (t) 1 = t E [X ]
(5.3)
Since m workers push gradients independently, the total number of gradient pushes m i=1 Ai (t) in time t, which is also equal to the number of iterations completed in time t, is a superposition of m renewal processes and it satisfies: m m i=1 Ai (t) (5.4) lim = t→∞ t E [X ] Thus expected runtime per iteration is equal to
5.2.1
E[X ] m .
Runtime Speed-Up Compared to Synchronous SGD
For exponentially distributed gradient computation times, X ∼ Exp(λ) the ratio of the expected runtimes per iteration of synchronous and asynchronous SGD is: E Tsync ≈ m log m (5.5) E Tasync This, asynchronous SGD gives an O(m log m) runtime speed-up and it can be dramatically faster than synchronous SGD for large m.
54
5.3
5 Asynchronous SGD and Staleness-Reduced Variants
Convergence Analysis
Next, let us analyze how the error versus iterations convergence of asynchronous SGD is affected by the staleness in gradients returned by the worker nodes. The convergence result presented in Theorem 5.1 requires the following assumptions. They are similar to the assumptions that we made in the single-node SGD and distributed synchronous SGD analyses, except that we need an additional bound on the staleness of gradients. 1. Lipschitz Smoothness: F(w) is an L-smooth function. Thus, ||∇ F(w) − ∇ F( w)||2 ≤ L||w − w||2 ∀ w, w.
(5.6)
2. Strong Convexity: F(w) is strongly convex with parameter c. Thus, 2c(F(w) − F ∗ ) ≤ ||∇ F(w)||22 ∀ w.
(5.7)
3. Unbiased Stochastic Gradients: The stochastic gradient is an unbiased estimate of the true gradient: Eξ j |wk g(wk , ξ j ) = ∇ F(wk ) ∀ k ≤ j. (5.8) Observe that this is slightly different from the common assumption that says Eξ j g(w, ξ j ) = ∇ F(w) for all w. Observe that all w j for j > k is actually not independent of the data ξ j . We thus make the assumption more rigorous by conditioning on wk for k ≤ j. Our requirement k ≤ j means that wk is the value of the parameter at the PS before the data ξ j was accessed and can thus be assumed to be independent of the data ξ j . 4. Bounded Gradient Variance: We assume that the variance of the stochastic update given wk at iteration k before the data point was accessed is also bounded as follows: σ2 ∀ k ≤ j. Eξ j |wk ||g(wk , ξ j ) − ∇ F(wk )||22 ≤ b
(5.9)
5. Bounded Staleness: We assume that for some γ ≤ 1, E ||∇ F(w j )−∇ F(wτ ( j) )||22 ≤ γ E ||∇ F(w j )||22 . Recall that τ ( j), the index of the model version at the worker that returns a gradient at iteration j, is a random variable. Its distribution implies a parameter p0 , a lower bound on the probability of receiving a fresh gradient in an iteration, as we show in the Lemma 5.1 below. This parameter will feature in the error bound in Theorem 5.1 below.
5.3
Convergence Analysis
55 ( j)
Lemma 5.1 Suppose that p0 is the conditional probability that τ ( j) = j given all the ( j) past delays and all the previous w, and p0 ≤ p0 for all j. Then, E ||∇ F(wτ ( j) )||22 ≥ p0 E ||∇ F(w j )||22 .
(5.10)
Proof By the law of total expectation, ( j) E ||∇ F(wτ ( j) )||22 = p0 E ||∇ F(wτ ( j) )||22 |τ ( j) = j ( j) + (1 − p0 )E ||∇ F(wτ ( j) )||22 |τ ( j) = j ≥ p0 E ||∇ F(w j )||22 . If the gradient computation time X at each worker follows an exponential distribution, then we can show that τ ( j) is geometrically distributed, and p0 = m1 . This is because, when the gradient computation times are exponentially distributed, the number of gradients received by the parameter server within a time window follows the Poisson distribution, and each gradient push is equally likely to come from any of the m worker nodes. With the above assumptions, we can show the following convergence bound on the error after t iterations of asynchronous SGD. Theorem 5.1 (Error convergence of asynchronous SGD) Suppose the objective F(w) is 1 . Also assume that for some γ ≤ 1, c-strongly convex and the learning rate η ≤ 2L E ||∇ F(w j ) − ∇ F(wτ ( j) )||22 ≤ γ E ||∇ F(w j )||22 . Then, the error of after t iterations is, ηLσ 2 ηLσ 2 t ∗ E [F(wt )] − F ≤ + (1 − ηcγ ) E [F(w0 )] − F − 2cγ b 2cγ b ∗
where γ = 1 − γ +
(5.11)
p0 2 .
Proof We start with the L-Lipschitz smoothness assumption, which implies the following: L ||w j+1 − w j ||22 2 Lη2 = F(w j ) + (−ηg(wτ ( j) ))T ∇ F(w j ) + ||g(wτ ( j) )||22 2 η η η = F(w j ) − ||∇ F(w j )||22 − ||g(wτ ( j) )||22 + ||∇ F(w j ) − g(wτ ( j) )||22 2 2 2 Lη2 + (5.12) ||g(wτ ( j) )||22 . 2
F(w j+1 ) ≤ F(w j ) + (w j+1 − w j )T ∇ F(w j ) +
56
5 Asynchronous SGD and Staleness-Reduced Variants
where the last line follows from 2a T b = ||a||22 + ||b||22 − ||a − b||22 . Taking expectation on both sides we have E F(w j+1 ) − E F(w j ) (5.13) η η ≤ − E ||∇ F(w j )||22 − E ||g(wτ ( j) )||22 2 2 Lη2 η + E ||∇ F(w j ) − g(wτ ( j) )||22 + (5.14) E ||g(wτ ( j) ||22 2 2 η Lη2 η ≤ − E ||∇ F(w j )||22 − E ||g(wτ ( j) )||22 + E ||g(wτ ( j) )||22 2 2 2 η + E ||∇ F(w j ) − ∇ F(wτ ( j) ) + ∇ F(wτ ( j) ) − g(wτ ( j) )||22 (5.15) 2 η Lη2 η ≤ − E ||∇ F(w j )||22 − E ||g(wτ ( j) )||22 + E ||g(wτ ( j) )||22 2 2 2 η η 2 (5.16) + E ||∇ F(w j ) − ∇ F(wτ ( j) )||2 + E ||∇ F(wτ ( j) ) − g(wτ ( j) )||22 2 2 η Lη2 η ≤ − E ||∇ F(w j )||22 − E ||g(wτ ( j) ||22 + E ||g(wτ ( j) )||22 2 2 2 η η η 2 + E ||∇ F(w j ) − ∇ F(wτ ( j) )||2 + E ||g(wτ ( j) ||22 − E ||∇ F(wτ ( j) )||22 2 2 2 (5.17) η Lη2 η ≤ − E ||∇ F(w j )||22 − E ||∇ F(wτ ( j) )||22 + E ||g(wτ ( j) )||22 2 2 2 η + E ||∇ F(w j ) − ∇ F(wτ ( j) )||22 2
(5.18)
where • in (5.15) we add and subtract the full (potentially) stale gradient ∇ F(wτ ( j) ) in the term with the difference of full fresh gradient ∇ F(w j ) and the stochastic (potentially) stale gradient g(wτ ( j) ), • in (5.16) we open the norm-squared of the summation, and use the fact that the cross term is zero because E ∇ F(wτ ( j) ) − g(wτ ( j) ) = 0 as a result of the unbiased gradient assumption, • to get (5.17) we further expand the last term in (5.16) and use the unbiased gra dient assumption, which implies that the cross term −2∇ F(wτ ( j) ) E g(wτ ( j) ) = −2||∇ F(wτ ( j) )||22 , and • finally, to get (5.18) we cancel out the two terms − η2 E ||g(wτ ( j) ||22 and η 2 2 E ||g(wτ ( j) ||2 . Next we use the bounded variance and bounded staleness assumptions to further simplify the right hand side as follows
5.4
Staleness-Reduced Variants of Asynchronous SGD
57
E F(w j+1 ) − E F(w j )
(5.19)
η η ≤ − (1 − γ )E ||∇ F(w j )||22 + − (1 − ηL) E ||∇ F(wτ ( j) )||22 2 2b 2 Lη2 σ 2 η η ≤ − (1 − γ )E ||∇ F(w j )||22 + − p0 E ||∇ F(w j )||22 . 2 2m 4 Lη2 σ 2
(5.20) (5.21)
where in (5.21) we use Lemma 5.1. Next, we will use the assumption that the objective function F(w) c-strongly convex, which implies that (5.22) 2c(F(w) − F ∗ ) ≤ ||∇ F(w)||22 ∀ w Using this result in (5.21), we obtain the following: E F(w j+1 ) −F ∗ ≤ Let us denote γ = (1 − γ + E [F(wt )] − F ∗ ≤
5.3.1
η2 Lσ 2 2b
p0 2 ).
+ (1 − ηc(1 − γ +
p0 2 ))(eF(w j ) −
F ∗ ).
Then, from the above recursion we get
ηLσ 2 ηLσ 2 t ∗ . E c) )] − F − + (1 − ηγ [F(w 0 2cγ b 2cγ b
Implications of the Asynchronous SGD Convergence Bound
Recall that γ is a measure of staleness of the gradients, with larger γ indicating more staleness. Similarly, smaller po also indicates more staleness because the probability of getting a fresh gradient is lower bounded by a smaller quantity. The convergence bound in (5.11) shows the effect of staleness on the convergence speed and the error floor. The parameter γ = (1 − γ + p20 ) is smaller when there is higher staleness. Smaller γ will result in slower convergence speed, that is, larger (1 − ηγ c), and a higher error floor, that ηLσ 2 is, larger 2cγ b . If we use an appropriate O(1/t) learning rate decay, then the overall error convergence rate is O(1/t), that is, it takes O(1/) to reach an error floor of . Although the error versus iterations convergence of asynchronous SGD may be slower than that of synchronous SGD, because the runtime per iteration is significantly smaller, the error versus wallclock runtime convergence may still favor asynchronous SGD, as shown in Fig. 5.2.
5.4
Staleness-Reduced Variants of Asynchronous SGD
Synchronous SGD and asynchronous SGD are at two extremes of the error-runtime tradeoff. Synchronous SGD has a lower error floor at convergence but suffers from tail latency and has a higher runtime per iteration. Asynchronous SGD sharply reduces the runtime
58
5 Asynchronous SGD and Staleness-Reduced Variants
Fig. 5.2 Theoretical error-runtime trade-off for synchronous and asynchronous SGD with same η. Asynchronous SGD has faster decay with time but a higher error floor
Log loss
Synchronous
Asynchronous
Time
per iteration by removing the synchronization barrier at the PS, but it results in a higher error floor due to staleness in the gradients returned by workers to the PS. In this section, we propose two staleness-reduced variants of asynchronous SGD that span the spectrum between asynchronous and synchronous SGD.
5.4.1
K -Asynchronous SGD
The first staleness-reduced variant of asynchronous SGD is called K -asynchronous SGD, illustrated in Fig. 5.3, and suggested in [5, 7]. Similar to asynchronous SGD, each worker i, for i = 1, . . . , m operates independently. After receiving the current version of the model wt from the parameter server, it computed the mini-batch gradient g(wt ; ξi ) using a mini-batch of data sampled from its local dataset partition Di and sends it to the parameter server. The PS waits for any K out of m workers to send gradients and updates the model wi . However, it does not cancel the gradient computation at the remaining m − K workers. As a result, for every update the gradients returned by each worker might be computed at a stale or older value of the parameter w. The update rule is thus given by: w j+1 = w j −
K η g(wτ (k, j) , ξk, j ). K
(5.23)
k=1
Here k = 1, 2, . . . , K denotes the index of one of the K workers that contribute to the update at the corresponding iteration, ξl, j is one mini-batch of m samples used by the k-th worker at the j-th iteration and τ (k, j) denotes the iteration index when the k-th worker last read from the central PS where τ (k, j) ≤ j. Also, g(wτ (k, j) , ξk, j ) = m1 ξ ∈ξk, j ∇ f (wτ (k, j) , ξk, j ) is the average gradient of the loss function evaluated over the mini-batch ξk, j based on the stale value of the parameter wτ (k, j) . Asynchronous SGD is a special case of K -asynchronous SGD when K = 1, and synchronous SGD is a special case when K = m.
5.4
Staleness-Reduced Variants of Asynchronous SGD
PS
w 0 w1 w 2 w3
PS
w0
w1
w2
59 w0
w3
Worker 1
Worker 1
Worker 1
Worker 2
Worker 2
Worker 2
Worker 3
Worker 3
Worker 3
Async SGD
w 1 w2 w 3
PS
K-Async SGD
K-Batch Async SGD
Fig. 5.3 Asynchronous SGD and its staleness-reduced variants K -asynchronous and K -batchasynchronous SGD for K = 2 and m = 3
5.4.1.1 Error Convergence of K -Asynchronous SGD The convergence analysis of K -asynchronous SGD is similar to asynchronous SGD, except that the staleness parameter γ is smaller for larger K , and the mini-batch size is K b instead of b. Thus, after t iterations, the error of K -asynchronous SGD is bounded as: ηLσ 2 ηLσ 2 ∗ t ∗ (5.24) + (1 − ηcγ ) E [F(w0 )] − F − E [F(wt )] − F ≤ 2cγ K b 2cγ K b where γ = 1 − γ + p20 . The proof is a generalized of the proof of Theorem 5.1, and it can be found in [7]. From (5.24), we can see that the error floor can be reduced by increasing the value of K without affecting the convergence speed. However, increasing K can increase the expected runtime per iteration, as we see below.
5.4.1.2 Runtime per Iteration of K -Asynchronous SGD
Recall from (5.2) that the expected runtime per iteration of asynchronous SGD is E Tasync = E [X ] /m. The parameter K controls the number of gradients that the PS waits for before updating the model w. Increasing K increases the runtime per iteration. While it is difficult to find a closed-form expression for a general distribution FX , we can evaluate it for exponential distributed gradient computation times X ∼ Exp(λ). Due to the memoryless property of the exponential distribution, starting from the beginning of each iteration, the time taken by each worker to finish is X ∼ Exp(λ), irrespective of whether the worker is computing a fresh gradient on the updated model or a stale gradient on a previous model version. Thus, Hm − HK E TK -async = E [X K :m ] = λ
(5.25)
where X K :m is the Kth order statistic of m i.i.d. random variables X 1 , X 2 , . . . , X m . We can also show that if X has a new-longer-than-used distribution then, the expected runtime per iteration for K -async is upper-bounded as E [T ] ≤ E [X K :m ].
60
5.4.2
5 Asynchronous SGD and Staleness-Reduced Variants
K -Batch-Asynchronous SGD
While K -asynchronous SGD reduces staleness, it introduces some idle time at the K − 1 workers until the PS waits for the K -th fastest worker to finish its gradient computation. To overcome this residual straggling effect, the K -batch-asynchronous SGD variant allows fast workers to compute multiple mini-batch gradients per iteration, as illustrated in Fig. 5.3. In K -batch-asynchronous SGD, whenever any worker finishes one gradient computation, it pushes that gradient to the PS, fetches current parameter at PS and starts computing gradient on the next mini-batch. The PS waits for K mini-batch gradients before updating itself but irrespective of which worker they come from. The update rule is similar to Eq. (5.23) theoretically except that now k denotes the indices of the K mini-batches that finish first instead of the workers and wτ (k, j) denotes the version of the parameter when the worker computing the k-th mini-batch last read from the PS. While the error convergence of K -batch-async is similar to K -async, it reduces the runtime per iteration as no worker is idle. Lemma 5.2 (Runtime of K -batch-async SGD) The expected runtime per iteration for K -batch-async SGD in the limit of large number of iterations is given by: E [T ] =
K E [X ] . m
(5.26)
Proof (Proof of Lemma 5.2) For the i-th worker, let {Ni (t), t > 0} be the number of times the i-th worker pushes its gradient to the PS over in time t. The time between two pushes is (1) (2) an independent realization of X i . Thus, the inter-arrival times X i , X i , . . . are i.i.d. with mean inter-arrival time E [X i ]. Using the elementary renewal theorem [9, Chap. 5] we have, E [Ni (t)] 1 . = t→∞ t E [X i ] lim
(5.27)
Thus, the rate of gradient pushes by the i-th worker is 1/E [X i ]. As there are m workers, we have a superposition of P renewal processes and thus the average rate of gradient pushes to the PS is m m m E [Ni (t)] 1 . (5.28) = = lim t→∞ t E [X i ] E [X ] i=1
i=1
Every K pushes are one iteration. Thus, the expected runtime per iteration or effectively the ] expected time for K pushes is given by E [T ] = K E[X m .
5.5
Adaptive Methods to Improve the Error-Runtime Trade-Off
Error in Convergence
Fig. 5.4 Asynchronous SGD and its staleness-reduced variants K -asynchronous and K -batch-asynchronous SGD for K = 2 and m = 3
61
Async SGD
K-Batch Async K-Async SGD
K=2 K=2
K=3 K=3
K=4
K=4 Batch Sync
Fully Sync
Wall clock time
5.5
Adaptive Methods to Improve the Error-Runtime Trade-Off
Together the K -asynchronous and K -batch-asynchronous variants of asynchronous SGD span different points on the trade-off between the error floor at convergence and the wallclock runtime spent iteration, as illustrated in Fig. 5.4. The trade-off is always better for the K batch asynchronous variants because they eliminate the idle time at the fast workers. By choosing a suitable value of K , we can balance the error floor and runtime.
5.5.1
Adaptive Synchronization
What if we do not have to keep K constant throughout the training process? By dynamically adapting K during training, we can potentially achieve the best of both worlds, a low error floor as well as a fast runtime per iteration. The paper [8] proposes AdaSync, a method that uses the theoretical characterization of the error-runtime trade-off to decide how to gradually increase K during the course of training. As a result, we gradually transition from fully asynchronous SGD (K = 1) to fully synchronous SGD (K = m). A similar method called Adacomm is proposed for local-update SGD in Chap. 6. To obtain a schedule for K , let us partition the training time into intervals of time T0 each. The number of iterations performed within time t is assumed to be approximately N (t) ≈ t/E [T ] where E [T ] is the expected runtime per-iteration for the chosen SGD variant. We can write a (heuristic) upper bound on the average of E ||∇ F(w)||22 within each time interval T0 as follows: u(K ) =
2(F(wstar t ) − F ∗ )E [T ] Lησ 2 + , T0 ηγ K mγ
where wstar t denotes the value of the model w at the beginning of that time interval. Our goal is to minimize u(K ) with respect to K for each time interval. Observe that, 2(F(wstar t )) ∂E [T ] Lησ 2 ∂u(K ) = − 2 . ∂K T0 ηγ ∂K K bγ
(5.29)
62
5 Asynchronous SGD and Staleness-Reduced Variants
Setting
∂u(K ) ∂K
to 0 therefore provides a rough heuristic on how to choose parameter K for
each time interval, as long as, This leads to
∂ 2 u(K ) ∂K2
K2 =
is positive. We approximate E [T ] ≈
K
m
+
Lησ 2 T0 ηmμ . 2m(F(wstar t ))( μ + log m)
K log m mμ .
(5.30)
To actually solve for this equation, we would need the values of the Lipschitz constant, variance of the gradient etc. which are not always available. We sidestep this issue by proposing a heuristic here that relies on the ratio of the parameter K at different instants. For K -async SGD, the schedule takes the following form: F(w0 ) K ∼ K0 (5.31) F(wt ) where the initial value K 0 is optimized by grid search. We evaluate the effectiveness of AdaSync for both K -sync SGD and K -async SGD algorithms. An exponential delay with mean 0.02 s is added to each worker node independently. We fix K for every t = 60 s (about 10 epochs). The initial values of K are fine-tuned and set to 2 and 4 for K -sync SGD and K -async SGD, respectively. As shown in Fig. 5.5, the adaptive strategy achieves the fastest convergence in terms of error-versus-time. The adaptive K -async algorithm can even achieve a better error-runtime trade-off than the K = 8 case (i.e., fully synchronous case).
60
60 K=2 K=4 K=8 Adaptive
50 45
K=2 K=4 K=8 Adaptive
55
Test error
Test error
55
40
50 45 40 35
35
30
30 0
10
20
30
40
50
60
Wall clock time / min (a) Adaptive
-sync. SGD.
70
80
0
10
20
30
40
50
60
Wall clock time / min (b) Adaptive
-async. SGD.
Fig. 5.5 Test error of AdaSync SGD on CIFAR-10 with 8 worker nodes. We add an exponential delay with mean 0.02 s on each worker. The value of K is changed after every 60 s
5.5
Adaptive Methods to Improve the Error-Runtime Trade-Off
5.5.2
63
Adaptive Learning Rate Schedule to Compensate Staleness
The staleness of the gradient is random, and can vary across iterations. Intuitively, if the gradient is less stale, we want to weigh it more while updating the parameter w, and if it is more stale we want to scale down its contribution to the update. With this motivation, we propose the following condition on the learning rate at different iterations. η j E ||w j − wτ ( j) ||22 ≤ C
(5.32)
for a constant C. In our analysis of Asynchronous SGD, we observe that the term η 2 to bound. For fixed learning rate, we 2 E ||∇ F(w j ) − ∇ F(wτ ( j) )||2 is the most difficult had assumed that E ||∇ F(w j ) − ∇ F(wτ ( j) )||22 is bounded by γ ||∇ F(w j )||22 . However, if we impose the condition Eq. (5.32) on η, we do not require this assumption. Our proposed condition actually provides a bound for the staleness term as follows: η j L2 C L2 ηj E ||∇ F(w j ) − ∇ F(wτ ( j) )||22 ≤ E ||w j − wτ ( j) ||22 ≤ . 2 2 2 Inspired by this analysis, we propose the learning rate schedule,
C , ηmax η j = min ||w j − wτ ( j) ||22
(5.33)
(5.34)
where ηmax is a suitably large ceiling on learning rate. It ensures stability when the first term in (5.34) becomes large due to the staleness ||w j − wτ ( j) ||2 being small. The C is chosen of the same order as the desired error floor. To implement this schedule, the PS needs to store the last read model parameters for every learner. In Fig. 5.6 we illustrate how this schedule can stabilize asynchronous SGD. We also show simulation results that characterize the performance of this algorithm in comparison with naive asynchronous SGD with fixed learning rate. The idea of variable learning rate is related to the idea of momentum tuning in [4, 10] and may have a similar effect of stabilizing the convergence of asynchronous SGD. However, learning rate tuning is arguably more general since asynchrony results in a momentum term
Fig. 5.6 Adaptive learning rate schedule
64
5 Asynchronous SGD and Staleness-Reduced Variants
in the gradient update (as shown in [4, 10]) only under the assumption that the staleness process is geometric and independent of w.
5.6
HogWild and Lock-Free Parallelism
In the asynchronous SGD algorithms that we have considered in this chapter so far, the entire parameter vector w is updated at the same time using gradients from one or more workers. However, in some applications, the objective function F(w) may be sum of sparse functions that only depend on a subset of parameters each, that is, F(w) = f e (we ) (5.35) e∈E
where e represents a subset of the indices {1, 2, . . . , d} of the parameter vector w. The function induces a hypergraph (a graph where an edge can connect more than two vertices) where each hyperedge e connects the set of vertices we . For example, each hyperedge is represented by a different color in the illustration of the sparse functions in Fig. 5.7. Some practical applications where the objective function takes this sparse form include sparse support vector machines, matrix completion and graph-cut problems. The paper [2] proposes an algorithm called HogWild that lets workers update different parts of w in a lock-free and asynchronous fashion. The parameter vector is stored in shared memory and it can be accessed by all the workers. Each worker asynchronously does the following: 1. Sample e uniformly at random from the edge set E . 2. Read the subset of indices we from the parameter server and evaluate the gradient ge (we ) = ∇ f e (we ). 3. Update the indicates we ← we − ηg(we ).
f1 (w1, w2)
f2 (w2, w4)
f3 (wd-3, wd-2)
Fig. 5.7 Illustration of the sparse components of the objective function F(w)
f4 (wd)
5.6
HogWild and Lock-Free Parallelism
65
Observe that depending on the relative speeds of the worker nodes, different parts of the parameter vector w can have different levels of staleness. Suppose a worker reads a parameter wi , then by the time it finishes its gradient computation and updates the parameter, one or more other workers may have already updated wi to get w j for some j > i. Thus, the gradient sent by that worker may be g(wτ ( j) ) where τ ( j) < j is the index of weight at which that gradient is computed and j − τ ( j) measures the amount of staleness. To prove the convergence of this algorithm, the authors assume that the staleness j − τ ( j) of parameter vector is bounded above by a value B. Under these assumptions, for c-strongly convex and L-Lipschitz smooth functions, they can show an O(1/t) rate of convergence, similar to the rate achieved by asynchronous SGD in Theorem 5.1. The constants in the convergence upper bound depend on the sparsity of the hypergraph induced by the objective function defined in (5.35). Summary In this chapter we considered asynchronous SGD, which relaxes the synchronization barrier in synchronous SGD and allows the PS to move forward and update the model without waiting for all workers. Unlike the straggler-resilient variants of synchronous SGD, asynchronous SGD does not cancel the gradient computations at slow workers and instead allows them to send gradients computed at outdated versions of the model. While this reduces wasted computation at the workers, it can introduce staleness in the gradients that are used to update the model. We analyzed the convergence of asynchronous SGD by accounting for this staleness, to show that it can result in a higher error floor than synchronous SGD. We proposed staleness-reduced variants, K -asynchronous SGD and K -batch-asynchronous SGD, which span different points on the trade-off between the error floor and runtime per iteration. Finally, we proposed two adaptive method to overcome staleness, an adaptive synchronization method that dynamically adjusts K during the course of training, and a learning rate schedule that mitigates the adverse effect of stale gradients by reducing the gradient applied to them. Synchronous and asynchronous SGD are at the backbone of industrial ML implementations today. One of the earliest industrial implementations is Google’s DistBelief [11]. Distbelief popularized the data-parallel and model-parallel parameter server framework. Their algorithm Downpour SGD is based on asynchronous SGD. More recently, specialized hardware such as Tensor Processing Units (TPUs) removes straggling and synchronization delays. Thus, industrial implementations are moving from asynchronous SGD back to synchronous SGD. But asynchronous SGD is still a useful paradigm when one is using cheap consumer hardware and low-bandwidth networks.
66
5 Asynchronous SGD and Staleness-Reduced Variants
Problems 1. For exponential gradient computation times X ∼ Exp(λ), derive the expression for the expectation and variance of the runtime per iteration of Tasync and analyze how it scales with the number of workers m. 2. By simulating and plotting the expected runtime per iteration of asynchronous SGD for Pareto distributed X ∼ Pareto(xm , α), compare its scaling with m for different values of the shape parameter α. 3. For the simulation set up used above, implement K -async and K -batch-async SGD are compare their expected runtime per iteration versus K .
References 1. J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, 1986. 2. B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Proceedings of the International Conference on Neural Information Processing Systems, 2011, pp. 693–701. 3. X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Proceedings of the International Conference on Neural Information Processing Systems, 2015, pp. 2737–2745. 4. I. Mitliagkas, C. Zhang, S. Hadjis, and C. Ré, “Asynchrony begets momentum, with an application to deep learning,” in 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 997–1004. 5. S. Gupta, W. Zhang, and F. Wang, “Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,” in IEEE International Conference on Data Mining (ICDM). IEEE, 2016, pp. 171–180. 6. X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 3043–3052. [Online]. Available: http://proceedings.mlr.press/v80/lian18a.html 7. S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD,” International Conference on Artificial Intelligence and Statistics (AISTATS), Apr. 2018. [Online]. Available: https://arxiv.org/abs/1803. 01113 8. S. Dutta, J. Wang, and G. Joshi, “Slow and Stale Gradients Can Win the Race,” IEEE Journal on Selected Areas of Information Theory (JSAIT), Aug. 2021. 9. R. Gallager, Stochastic Processes: Theory for Applications, 1st ed. Cambridge University Press, 2013. 10. J. Zhang, I. Mitliagkas, and C. Re, “Yellowfin and the art of momentum tuning,” CoRR, vol. arXiv:1706.03471, Jun. 2017. [Online]. Available: http://arxiv.org/abs/1706.03471 11. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large scale distributed deep networks,” in Proceedings of the International Conference on Neural Information Processing Systems, 2012, pp. 1223–1231.
6
Local-Update and Overlap SGD
In the last two chapters, we studied synchronous and asynchronous SGD, where the workers compute mini-batch gradients and the parameter server aggregates them and updates the model. Both these algorithms and their variants require constant communication between the workers and the parameter server after every iteration. However, when the network is bandwidth-limited and/or the size of the model is large, this communication delay can dominate the runtime per iteration. This calls for communication-efficient distributed SGD algorithms. In the upcoming chapters, we will study different methods of achieving communication efficiency. In this chapter, we reduce the frequency of communication by allowing workers to performing multiple SGD updates instead of just computing gradients. In Chap. 7, we will study quantization and sparsification methods to reduce the number of bits transmitted in every communication round, and in Chap. 8 we consider decentralized topologies in order to reduce the communication bottleneck at a central parameter server.
6.1
Local-Update SGD Algorithm
Local-update SGD uses a simple strategy to reduce the communication cost of distributed SGD. In local-update SGD, after receiving the latest model parameters w from the parameter server, instead of just computing and sending a mini-batch gradient, each worker node makes τ > 1 local updates to the model and then sends the resulting local version of the model back to the parameter server. This strategy reduces the frequency of communication between the parameter server and the workers. The local update SGD algorithm operates in ‘communication rounds’ or ‘training rounds’. More formally, each round proceeds as follows: 1. Worker i fetches the current version of the model wt from the parameter server 2. It performs τ local SGD updates using its local data Di © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_6
67
68
6 Local-Update and Overlap SGD (i)
3. Due to difference in Di across workers, the resulting models wt+τ will be different (i) 4. The resulting models wt+τ are aggregated by the parameter server as per the update rule wt+τ =
m (i) wt+τ i=1
m
.
The above steps repeat again in the next round, where the workers fetch the aggregated model wt+τ from the parameter server. (i) Thus, suppose all the local models are initialized to the same value w1 = w1 for all i = 1, . . . , m, the local-update SGD update rule in the t-th iteration can be written formally as follows: ⎧ ( j) ( j) ⎨ 1 m if t mod τ = 0 j=1 wt − ηg(wt ) (i) m wt+1 = (6.1) ⎩ w(i) − ηg(w(i) ) otherwise t
(i)
t
(i)
where g(wt ) represents a mini-batch stochastic gradient of the objective function F(wt ) computed using a batch of b datapoints sampled uniformly at random with replacement from worker i’s local dataset. Beyond data-center-based training using the parameter server framework, local-update SGD is especially relevant and well-suited for the emerging framework of federated learning [1–4], where the model is trained using local computation capabilities and data at edge devices such as cell phones or sensors. Since these devices are typically connected to the central (parameter) server via limited bandwidth and high latency wireless communication links, using a communication-efficient algorithm is especially important to ensure fast and scalable distributed training. Figures 6.1 and 6.2 present two different views of how local updates affect the performance of the algorithm in terms of the runtime and convergence respectively. The pink colored block in Fig. 6.1 represents the time taken by the parameter server to aggregate the
Fig. 6.1 Timeline of the local-update SGD, illustrating how performing more local updates (larger τ ) reduces the runtime per iteration. This is because the communication delay per round (represented by the cream colored rectangles) is amortized across the τ iterations within each round
6.1
Local-Update SGD Algorithm
69
Fig. 6.2 Local-update SGD illustration
Parameter Server m
wt+τ = i=1
wt
(i)
wt+τ m
wt
Worker 1
wt
Worker 2
Worker 3
τ steps (1)
wt+τ
(i)
wt+τ m
wt+τ = i=1
(m)
wt+τ
(i)
wt+τ m
(i)
local models wt+τ from workers i = 1, . . . , m. The blue arrows at each worker are the τ local updates made to the model wt that the workers receive from the parameter server at the beginning of the communication round. By communicating only once in τ iterations, the local-update SGD algorithm amortizes the communication delay across the τ iterations, thus reducing the runtime per iteration. In Sect. 6.1.2 we will formally analyze the runtime per iteration of local SGD to understand how it is affected by τ . While it has a faster runtime per iteration, local-update SGD has worse error convergence than synchronous distributed SGD because of infrequent consensus between the models at the m workers. This phenomenon is illustrated in Fig. 6.2. As τ increases, the m worker models drift further away from each other because of the differences in their local data partitions. This causes a higher variance in the aggregated model updates, which in turn results in a worse error floor at convergence. In Sect. 6.1.1 we will analyze the error convergence of local-update SGD and quantify how the rate of convergence and the error floor are affected by τ .
6.1.1
Convergence Analysis
Local-update SGD optimizes the same empirical risk objective function N (h(xn ), yn ) that we considered in previous chapters. If decomposed in F(h) = N1 n=1 We make the following assumptions about the objective function F(w) in order to analyze the convergence of local-update SGD.
70
6 Local-Update and Overlap SGD
1. Lipschitz Smoothness: The objective function F(w) is differentiable and L-Lipschitz smooth, that is, ∇ F(w) − ∇ F(w ) ≤ Lw − w 2. Lower Bound on Fin f : The objective function F(w) is bounded below by Finf , that is, F(w) ≥ Finf for all w 3. Unbiased Gradients: The stochastic gradient g(w; ξ ) is an unbiased estimate of ∇ F(w), that is, Eξ [g(w; ξ )] = ∇ F(w) In order for this assumption to be true, the local datasets Di at each of the workers need m D . This can be achieved to have the same distribution as the global dataset D = ∪i=1 i by uniformly shuffling and splitting the dataset D across the workers. However, when local-update SGD is used in the federated learning framework, the local datasets are collected by the edge clients and they inherently have different distributions. Therefore, when local-update SGD is analyzed in the federated learning setting, this assumption is replaced by a local unbiasedness assumption Eξ [gi (w; ξ )] = ∇ Fi (w), along with an additional bounded heterogeneity assumption on the difference between the gradients across the edge workers. 4. Bounded Variance: The stochastic gradient g(w; ξ ) has bounded variance, that is, Var(g(w; ξ )) ≤
σ2 b
Eξ [g(w; ξ )2 ] ≤ ∇ F(w)2 +
σ2 b
Theorem 6.1 (Convergence of Local-update SGD) For a L-smooth function, ηL + η2 L 2 τ (τ − 1) ≤ 1, and if the starting point is w1 then after t iterations of local update SGD we have t
2 F(w1 ) − Finf 1 ηLσ 2 η2 L 2 σ 2 (τ − 1) 2 E ∇ F(wk ) ≤ + + , t ηt mb b k=1
m
(i)
where wk = i=1 wk /m denotes the averaged model at the k th iteration. This is a virtual sequence of iterates that is equal to the global model after every τ iterations when k mod τ = 0.
6.1
Local-Update SGD Algorithm
71
The version of the local-update SGD analysis presented in Theorem 6.1 above is based on [5], which does not assume strong convexity of the objective function F. That is why, the left hand side is the average of gradients rather than the optimality gap F(w) − F ∗ , because for a non-convex objective, SGD can get stuck in a stationary point and it is not guaranteed to convergence to the optimal value F ∗ . Another version of the analysis with the strong convexity assumption is given in [6].
6.1.1.1 Implications of the Convergence Result For the special case when the communication period τ = 1, this bound reduces the convergence bound of mini-batch SGD for non-convex objective functions. The last term in the error floor grows linearly with τ , corroborating our intuition that the error floor increases with τ due to the increased discrepancy in the model versions at the workers after the local training iterations. While the error floor does increase if we consider a constant learning
rate that is inde1 pendent of the other parameters, if we set the learning rate as η = L mt and 10mτ 2 ≤ t, where t is the total number of iterations for which we plan to run the algorithm, then
t 2 L [F(w1 ) − Finf ] + σ 2 /b m(τ − 1)σ 2 1 2 ∇ F(wk ) ≤ E + (6.2) √ t tb mt k=1 m(τ − 1) 1 = O( √ ) + O . (6.3) t mt The second term in (6.3) decays faster with the number of iterations t than the first term. Thus, the overall convergence rate is O( √1mt ). Due to the m in the denominator, we can conclude that local SGD provides a linear speed-up in convergence with respect to the number of workers m. This means that using m workers to perform local updates in parallel results in convergence in m times fewer iterations than the single-node case.
6.1.1.2 Proof of Theorem 6.1 To prove the convergence bound for local-update SGD, let us define the following terms: m (i) g(wk ), the average of stochastic gradients at the workers at iteration k, • Gk = m1 i=1 and m (i) ∇ F(wk ), the average of the full gradients at the workers at iteration k. • Hk = m1 i=1 For the ease of writing, we define the notation Ek to denote the conditional expectation (1) (m) E K |Zk , where k is the set {ξk , . . . , ξk } of mini-batches at m workers in iteration k. By the unbiasedness of the gradients, observe that Ek [Gk ] = Hk . Using the definitions of Gk and Hk , we can show that
72
6 Local-Update and Overlap SGD
Ek Gk − Hk 2
m 1 (i) (i) 2 =Ek g(wk ) − ∇ F(wk ) m i=1 m
1 (i) (i) 2 = 2 Ek g(wk ) − ∇ F(wk ) + m
(6.4) (6.5)
i=1
m
( j)
( j)
(l)
(l)
Ek g(wk ) − ∇ F(wk )g(wk ) − ∇ F(wk )
(6.6)
j =l
≤
σ2 , bm
(6.7)
(i)
where Eq. (6.7) is due to {ξk } are independent random variables. By directly applying the unbiased gradient and bounded variance assumptions to (6.6), one can observe that all cross terms are zero. Using the result in (6.7), and the fact that Ek [Gk ] = Hk , we can also show that: Ek Gk 2 = Ek Gk − Ek [Gk ]2 + Ek [Gk ]2 = Ek Gk − Hk 2 + Hk 2
(6.8) (6.9)
σ2 + Hk 2 bm
m 1 σ2 (i) 2 ∇ F(wk ) +E = bm m
≤
(6.10) (6.11)
i=1
≤
m σ2 1 (i) ∇ F(wk )2 + , m bm
(6.12)
i=1
where the last inequality follows from the convexity of vector norm and Jensen’s inequality. Another property of Gk required for the convergence analysis is that
m 1 (i) g(wk ) (6.13) E [∇ F(wk ), Gk ] = E ∇ F(wk ), m i=1
m 1 (i) = ∇ F(wk ), ∇ F(wk ) m
=
i=1 m
1 2m
(6.14) (i)
∇ F(wk )2 + ∇ F(wk )2
i=1 (i)
−∇ F(wk ) − ∇ F(wk )2
(6.15)
6.1
Local-Update SGD Algorithm
73
m 1 1 (i) 2 = ∇ F(wk ) + ∇ F(wk )2 2 2m i=1
m 1 (i) − ∇ F(wk ) − ∇ F(wk )2 , 2m
(6.16)
i=1
where we use the fact that 2w T w = w2 + w 2 − w − w 2 . Now we will use the bounds derived in (6.12) and (6.16) to analyze the difference between the values of two successive values Ek [F(wk+1 )] and F(wk ) of the objective function where Ek denotes the conditional expectation given all the mini-batches processed until iteration k. We start with the Lipschitz smoothness assumption, which implies the following. E F(wk+1 ) − F(wk ) η2 L E[Gk 2 ] 2 m m η η η (i) (i) ∇ F(wk )2 + ∇ F(wk ) − ∇ F(wk )2 ≤ − ∇ F(wk )2 − 2 2m 2m ≤ −ηE[∇ F(wk ), Gk ] +
i=1
i=1
m L (i) + ∇ F(wk )2 + 2bm 2m
η2
Lσ 2
(6.17)
η2
(6.18)
i=1
m m η η(1 − ηL) ηL 2 η2 Lσ 2 (i) (i) ≤ − ∇ F(wk )2 − ∇ F(wk )2 + wk − wk 2 + , 2 2m 2m 2bm i=1
i=1
(6.19) where (6.18) uses the result in (6.16), and (6.19) follows from the Lipschitz smoothness of F. Rearranging the terms to bring ∇ F(wk )2 to the left hand side, we have m 2(F(wk ) − E F(wk+1 ) ) (1 − ηL) (i) 2 ∇ F(wk ) ≤ ∇ F(wk )2 + − η m i=1
m L2
m
i=1
(i)
wk − wk 2 +
ηLσ 2 bm
(6.20) (6.21)
Now taking total expectation and averaging over t iterations,
74
6 Local-Update and Overlap SGD
2(F(w1 ) − Fin f ) ηLσ 2 1 E[∇ F(wk )2 ] ≤ + + t tη bm k=1 t
synchronous SGD error t m L2
mt
(i) E[wk − wk 2 ] −
k=1 i=1
(1 − ηL) (i) E[∇ F(wk )2 ] mt k=1 i=1 t
m
(6.22)
additional error
The additional error term arises because we only average after every τ local iterations, (i) which causes a discrepancy between the averaged model wk and the local model wk at the i-th worker. The Cooperative SGD paper [5] gives an upper bound on this additional error term and shows that t t m m (1 − ηL) η2 L 2 σ 2 (τ − 1) L2 (i) (i) E[wk − wk 2 ] − E[∇ F(wk )2 ] ≤ mt mt b k=1 i=1
k=1 i=1
(6.23) when the learning rate is small enough. The proof of (6.23) is as follows. To write the above expression more compactly, define matrices Wk , Gk ∈ Rd×m that are concatenate the models and the gradients at the m workers: (1)
(m)
Wk = [wk , . . . , wk ] Gk =
(1) (m) [g(wk ), . . . , g(wk )]
(6.24) (6.25)
We also define matrix J = 11 /(1 1) where 1 is the all-ones column vector. Thus, every element of the matrix J is equal to 1/m. As usual I denotes the identity matrix of size m × m, where m is the number of workers. Moreover, for a matrix A, A2F denotes its Frobenius norm, which is the sum of the squares of each of the elements of the matrix. And the operator norm Aop of a matrix is defined as maxx=0 Ax = λmax (A A). m (i) Let us analyze the term i=1 E[wk − wk 2 ] for different values of the iteration index k, by representing k = r τ + a, where r ≥ 0 is the communication round index, and 1 ≤ a ≤ τ is the index of a local step within that communication round. m
(i)
E[wr τ +a − wr τ +a 2 ]
(6.26)
i=1
= E[Wr τ +a (I − J)2 ] = E[Wr τ +1 (I − J) − η
(6.27) a−1
Gr τ +l (I − J)2 ]
(6.28)
l=1
= E[−η
a−1 l=1
Gr τ +l (I − J)2 ]
(6.29)
6.1
Local-Update SGD Algorithm
≤ η2 E[
a−1
75
Gr τ +l 2 ]
(6.30)
l=1
= η2
m
E[
i=1
≤ 2η2
m
a−1
(i)
g(wr τ +l )2 ]
(6.31)
l=1
E[
a−1 a−1 m (i) (i) (i) (g(wr τ +l ) − ∇ F(wr τ +l )2 ] + 2η2 E[ ∇ F(wr τ +l )2 ] (6.32)
i=1
l=1
≤ 2η2 m(a − 1)
i=1
σ2 b
m
+ 2η2
E[
i=1
a−1
l=1
(i)
∇ F(wr τ +l )2 ]
(6.33)
l=1
σ2 (i) E[∇ F(wr τ +l )2 ] + 2η2 (a − 1) b m a−1
≤ 2η2 m(a − 1)
(6.34)
i=1 l=1
where in (6.30) we use the fact that the operator norm of (I − J) is less than or equal to 1. To get (6.32) we use the fact that (a + b)2 ≤ 2a 2 + 2b2 . In (6.33) we use the bounded gradient variance assumption. And in (6.38) we use the inequality (a1 + a2 + · · · + am )2 ≤ 2 ). m(a12 + · · · + am Summing over all a and r and multiplying by L 2 /mt we have: t m L2 (i) E[wk − wk 2 ] mt
(6.35)
k=1 i=1
≤
t/τ min(τ,k−r m τ) L2 (i) E[wr τ +a − wr τ +a 2 ] mt r =0
a=1
(6.36)
i=1
t/τ min(τ,k−r τ) 2η2 L 2 σ2 ≤ m(a − 1) + mt b r =0
a=1
t/τ min(τ,k−r m a−1 τ) L2 (i) (a − 1) E[∇ F(wr τ +l )2 ] mt
2η2
r =0
≤2 =
a=1
m t η2 L 2 t τ (τ − 1)σ 2 η2 L 2 (i) τ (τ − 1)E[∇ F(wk )2 ] + t τ 2b mt i=1 k=1 m t
η2 L 2 τ (τ − 1) η2 L 2 (τ − 1)σ 2 + b mt
i=1 k=1
2(F(w1 ) − Fin f ) ηLσ 2 1 E[∇ F(wk )2 ] ≤ + + t tη bm k=1
(i)
E[∇ F(wk )2 ].
Substituting this in (6.22) we have: t
(6.37)
i=1 l=1
(6.38)
(6.39)
76
6 Local-Update and Overlap SGD
η2 L 2 (τ − 1)σ 2 (1 − ηL − η2 L 2 τ (τ − 1)) (i) E[∇ F(wk )2 ] − b mt t
m
(6.40)
k=1 i=1
≤
2(F(w1 ) − Fin f ) ηLσ 2 η2 L 2 (τ − 1)σ 2 + + tη bm b
(6.41)
where we can drop the last term in (6.40) if the learning rate η is small enough such that ηL + η2 L 2 τ (τ − 1) < 1. Thus, we get the upper bound on the additional term as given in (6.23). Finally, by combining (6.22) and (6.23) we have proved Theorem 6.1.
6.1.2
Runtime Analysis
In Sect. 6.1.1 we analyzed the error versus iterations convergence of local-update SGD and showed that the error versus iterations covergence degrades with τ because the discrepancy between the local models results in a higher error floor. In the section, we consider the other side of this story, and analyze the effect of τ on the runtime per iteration.
6.1.2.1 Delay Model At the beginning of a communication round, all the m workers receive the current version of the model w and commence τ local updates each. We use the random variable Y ∼ FY denotes the taken by a worker to perform one local update, which includes the time taken to compute one mini-batch gradient and update the model. Assume that these local update times are i.i.d. across workers and the τ local updates at each worker. Thus, if Yi j is time taken to finish the j-th local update in the round at the i-th worker. Thus, Yi j for all i = 1, . . . , m and j = 1, 2, . . . , τ are i.i.d. realizations of the probability distribution FY . After all the workers finish their local updates, their models are sent to the parameter server, which averages them and sends back the workers. We refer to the time window from the instant when the latest worker finishes and sends its local updates, until all the workers receive the new model version as the communication delay. It is modeled by a random variable D ∼ FD , which is assumed to be i.i.d. across communication rounds.
6.1.2.2 Total Local Computation Time per Round Using the delay model described above, let us compute the total local computation time Tcomp of a round, which is defined as the time from the start of the local updates at the m workers, until the last worker finishes its τ local updates. Tcomp = max
i=1,...,m
Yi,1 + Yi,2 + · · · + Yi,τ .
(6.42)
6.1
Local-Update SGD Algorithm
77
In order to rewrite Tcomp more compactly and analyze how its probability distribution depends on FY , we define a random variable Yi which denotes that average of the τ local update times are worker i, that is, Yi,1 + Yi,2 + · · · + Yi,τ (6.43) Yi = m Since Yi, j s are i.i.d., the average local update time random variables Yi are also i.i.d. across workers. To understand the difference between the probability distributions FY and FY , let us compare their expected values and variances: E[Y ] = E[Y ] Var [Y ] Var Y = τ
(6.44) (6.45)
Thus, the random variables Y has the same mean but τ times lower variance than Y . The expected computation time Tcomp can thus be expressed in terms Y as follows: E Tcomp = τ E Ym:m
(6.46)
where Ym:m = max(Y1 , . . . Ym ), the maximum order statistic of the average local update times Y1 , . . . Ym at the m workers.
6.1.2.3 Expected Runtime Per Iteration The time taken to complete one communication round is the total local computation time Tcomp plus the communication delay D. Since τ iterations are completed in each round, the expected runtime per iteration is: E Tcomp + E [D] (6.47) E[Tlocal ] = τ E[D] = E[Ym:m ]+ (6.48) τ Let us compare this with the runtime of synchronous SGD E[Tsync ] = E[Ym:m ] + E[D]
(6.49)
By comparing (6.48) and (6.49), we observe that performing multiple local updates at each worker and averaging the models periodically reduces the expected runtime in two ways. Firstly, the communication delay E [D] gets divided by τ because it gets amortized across τ iterations completed in each communication round. In contrast, synchronous SGD has to incur the communication delay after every iteration. This reduction in communication delay is a game-changer in bandwidth-limited settings such as federated learning, where the worker
6 Local-Update and Overlap SGD Speedup over fully sync SGD
78
25
Wall clock time
20
Computation time Communication time
15 10 5 0
ResNet50 ResNet50, =10 VGG16
VGG16, =10
2 = 0.1 = 0.5 = 0.9
1.8 1.6 1.4 1.2 1
0
20
40
60
80
100
Communication period
(a)
(b)
Fig.6.3 a Wallclock time to finish 100 iterations in a cluster with 4 worker nodes. To achieve the same computation/computation ratio, VGG-16 requires a larger communication period τ than ResNet-50. b The speed-up offered by using periodic-averaging SGD increases with τ (the communication period) and with the communication/computation delay ratio α = D/Y , where D is the all-node broadcast delay and Y is the time taken for each local update at a worker
nodes are wirelessly connected mobile devices. Also, the relative values of the means of Y and D depend on the size of the model being trained. Suppose Y and D are constants, and their ratio, the communication to computation time ratio is α = D/Y . The value of α depends on the size of the model being trained, the available communication bandwidth and other system parameters. For experiments run on a university computing cluster, we observed that the value of α for the VGG16 model is more than that of ResNet16 because VGG16 is a larger neural network, as shown in Fig. 6.3a. The speed-up of local-update SGD over synchronous SGD is given by E[Tsync ] Y+D 1+α = = E[Tlocal ] Y + D/τ 1 + α/τ
(6.50)
As α increases, local-update SGD gives an even larger speed-up over synchronous SGD. Figure 6.3b shows the speed-up for different values of α and τ . When D is comparable with Y (α = 0.9), local-update SGD can be almost twice as fast as fully synchronous SGD. Local SGD also reduces the runtime per iteration in another, more subtle way. When we perform multiple local updates, the variations across those updates are evened out as τ becomes larger, thus providing a straggler mitigation effect. This can be formally observed ] is less than E[Y from the fact that the value of the maximum order statistic E[Ym:m m:m ] because Y has lower variance than Y . This reduction is more drastic when the distribution of Y has a heavier tail and/or for a larger number of workers m. In Fig. 6.4, we demonstrate this phenomenon for exponentially distributed Y ’s. Local-update SGD not only gives an almost two-fold reduction in the expected runtime per iteration (shown by the dotted line but also reduces the tail of the runtime distribution. In practice, the assumption that the Y ’s are i.i.d. across local updates at a worker may not be true because a worker that slows down may remain slow for several consecutive local iterations. As a result, the reduction in the expected runtime may not be as drastic as illustrated in Fig. 6.4. However we still expect local-update SGD to provide a straggler mitigation benefit in addition to the communication delay amortization.
6.1
Local-Update SGD Algorithm
79
Probability
0.4 Sync SGD PASGD ( = 10)
2x less
0.3 0.2 0.1 0
0
2
4
6
8
Runtime per iteration
Fig. 6.4 Comparison of the probability distributions of the runtime per iteration of local update SGD (τ = 10), also called periodic averaging SGD (PASGD), with synchronous SGD for exponentially distributed local update times Y ∼ Exp(1) and constant communication delay D = 1 for a system of m = 16 workers. Local-update SGD not only reduces the mean (shown by the dotted line) but also reduces the tail latency
6.1.3
Adaptive Communication
The convergence and runtime analyses of local-update SGD in Sects. 6.1.1 and 6.1.2 show that there is an inherent error-runtime trade-off in terms of the number of local updates τ . Experimental results match this insight, as shown in Fig. 6.5. To obtain these results, we train the VGG-16 [7] neural network on the CIFAR-10 dataset [8] with 8 worker nodes in a university computing cluster. We start with a constant learning rate and then reduce it by a factor of 10 after 100 and 150 epochs (traversals of the entire dataset). If we look at the training loss versus epoch plots in Fig. 6.5 we see that increasing τ results in a higher error floor, as suggested by our convergence analysis in Sect. 6.1.1. However, if we account for the runtime per iteration and plot the training loss versus wallclock time, we observe that larger τ provides a drastic runtime reduction. These error versus walllclock time plots suggest a natural strategy to enjoy fast runtime per iteration as well as a small error floor at convergence—starting with a larger τ and gradually reducing it during the training process.
6.1.3.1 Derivation of the Adacomm Strategy Next, let us develop this strategy called Adacomm, which determines the best τ at each wallclock time T . Using this strategy, we will switch from one learning curve to another, as illustrated in Fig. 6.6. First, we need to determine the error after T iterations. From the convergence analysis in Sect. 6.1.1 we know that the error after t iterations is bounded by t
η2 L 2 σ 2 (τ − 1) 1 2 [F(w1 ) − Finf ] ηLσ 2 2 ∇ F(wk ) ≤ + + (6.51) E t ηt mb b k=1
80
6 Local-Update and Overlap SGD 100 Fully sync. SGD PSASGD, τ = 4 PSASG, τ = 24
10−1
10−2
Training loss
Training loss
100
10−3
10−1
10−2
10−3 0
25
50
75
100 125 150 175 200
0
Epoch (a) Learning rate equals to 0.04.
25
75
100 125 150 175 200
10−1
10−2
Training loss
100 Fully sync. SGD PSASGD, τ = 4 PSASG, τ = 24
10−3
Fully sync. SGD PSASGD, τ = 4 PSASGD, τ = 24
10−1
10−2
10−3 0.0
2.5
5.0
7.5
10.0
12.5
15.0
0.0
2.5
Wall-clock time / min
5.0
7.5
10.0
12.5
15.0
Wall-clock time / min
(c) Learning rate equals to 0.04.
(d) Learning rate equals to 0.4. 90
85 80 Fully sync. SGD PSASGD, τ = 4 PSASG, τ = 24
75
Test accuracy
90
Test accuracy
50
Epoch (b) Learning rate equals to 0.4.
100
Training loss
Fully sync. SGD PSASGD, τ = 4 PSASGD, τ = 24
85 80 Fully sync. SGD PSASGD, τ = 4 PSASGD, τ = 24
75 70
70 0.0
2.5
5.0
7.5
10.0
12.5
15.0
Wall-clock time / min (e) Learning rate equals to 0.04.
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Wall-clock time / min (f) Learning rate equals to 0.4.
Fig. 6.5 Illustration of error-convergence and communication-efficiency trade-off in local-update SGD. We train a VGG-16 on CIFAR-10 with 8 worker nodes. Each line was trained for 200 epochs (traversals of the training dataset) and the learning rate is decayed by 10 at epoch 100, 150. After the same number of epochs, a larger communication period leads to higher training loss but costs much less wall clock time
6.1
Local-Update SGD Algorithm
81
Suppose that the local update timeY and the communication delay D are constants, so that the runtime per iteration of local-update SGD is Tlocal = Y + D/τ . Substituting this in the error bound, we can show that the error at time T is bounded by D ηLσ 2 η2 L 2 σ 2 (τ − 1) 2 [F(w1 ) − Finf ] Y+ + + (6.52) Error at time T ≤ ηT τ mb b We derive a heuristic schedule for τ by minimizing this upper bound with respect to τ . Setting the derivative of the right hand side with respect to τ to 0, we get η2 L 2 σ 2 2 [F(w1 ) − Finf ] D + =0 (6.53) − ηT τ2 b 2b [F(w1 ) − Finf ] D (6.54) τ∗ = η3 L 2 σ 2 T
6.1.3.2 Modifying the Schedule to Accommodate Practical Constraints Can we use directly use the τ schedule given by (6.54) in practice? Unfortunately, not, due to several parameters (F(w1 ) − Finf ), L, σ 2 which are properties of the objective function and its stochastic gradients are not known in practice. Also, it would be impossible to change τ at every time instant T . We accommodate these practical constraints by first dividing time into intervals of size T0 and finding best τ for each interval Fig. 6.6. In the first interval, the τ0∗ is τ0∗
=
2b [F(wT =0 ) − Finf ] D η3 L 2 σ 2 T0
(6.55)
Then, we find τl by considering that the workers start at the initial point F(wT =lT0 ): 2b F(wT =lT0 ) − Finf D ∗ τl = (6.56) η3 L 2 σ 2 T0 Now taking the ratio of τl∗ and τ0∗ we get: τl∗
=
F(wT =lT0 ) − Finf ∗ F(wT =lT0 ) ∗ τ ∼ τ F(wT =0 ) 0 [F(wT =0 ) − Finf ] 0
(6.57)
where we assume that Finf = 0. Observe that the unknown terms L and σ 2 cancel out. The only remaining unknown variable is initial τo∗ , which we can choose using grid search over several possible values by treating it as a hyperparameter.
6 Local-Update and Overlap SGD
Switch point Large comm. period Small comm. period
τ0∗ τ1∗
τ2∗ · · · · · · τl∗
Training loss
Fig. 6.6 Illustration of adaptive communication strategies. The dashed line denotes the learning curve with adaptive communication
Training loss
82
Wall clock time
0
T0 2T0
(a)
···
lT0
(b)
6.1.3.3 Incorporating a Variable Learning Rate Schedule When deriving (6.57), we assumed that the learning rate η is constant throughout training. However, in practice, the learning rate is generally decayed during training according to a variable learning rate schedule? We can incorporate such a learning rate schedule into our Adacomm strategy. Consider that the learning rate in the l-th size T0 time interval is ηl .Now take the ratio of τl∗ and τ0∗ to get τl∗ =
η03 F(wT =lT0 ) − Finf ∗ η03 F(wT =lT0 ) ∗ τ ∼ τ0 0 ηl3 [F(wT =0 ) − Finf ] ηl3 F(wT =0 )
(6.58)
where τ0∗ is found via brute-force grid search. Observe that when the learning rate is fixed, τ reduces over time. However, for a variable learning rate, if the learning rate decays at some time instant, the corresponding τ increases. This is because, when each step is smaller we can afford to take more local steps without causing too much discrepancy between the local (i) models wk across workers i = 1, . . . , m. Figure 6.7 presents the results for VGG-16 for both fixed and variable learning rates. A large communication period τ initially results in a rapid drop in the error, but the error finally converges to higher floor. By adapting τ , the proposed AdaComm scheme strikes the best error-runtime trade-off in all settings. In Fig. 6.7, while fully synchronous SGD takes 33.5 min to reach 4 × 10−3 training loss, AdaComm costs 15.5 minutes achieving more than 2× speedup. Similarly, in Fig. 6.7, AdaComm takes 11.5 minutes to reach 4.5 × 10−2 training loss achieving 3.3× speedup over fully synchronous SGD (38.0 min). However, for ResNet-50, the communication overhead is no longer the bottleneck. For fixed communication period, the negative effect of performing local updates becomes more obvious and cancels the benefit of low communication delay. It is not surprising to see fully synchronous SGD is nearly the best one in the error-runtime plot among all fixed-τ methods. Even in this extreme case, adaptive communication can still have a competitive performance. When combined with learning rate decay, the adaptive scheme is about 1.3 times faster than fully synchronous SGD, 15.0 versus 21.5 minutes to achieve 3 × 10−2 training loss.
6.1
Local-Update SGD Algorithm
83 ResNet-50, fixed learning rate
100
Training loss
=1 = 20 = 100 AdaComm 3.3x less
10-1
Comm. Period
0
10
20
30
40
50
0
10
20
30
40
50
60
= 1= 1 = 5= 5 = 100 = 100 AdaComm
0
60
Wall clock time / min 20 10 0
10
0
Comm. Period
Training loss
VGG-16, fixed learning rate
5
10
0
0
5
10
Training loss
10
2x less
10-2
Comm. Period
20
25
30
5
10
15
20
25
30
10-2
35
Wall clock time / min
0
35
1.4x less
0
5
10
10 5 0
0
Training loss
Training loss
=1 = 20 = 100 AdaComm
100
3.5x less
5
10
30
40
50
60
70
Wall clock time / min 20 10 0
0
10
20
30
(e)
25
20
25
ResNet-50 with block momentum
40
50
60
70
=1 = 20 = 100 AdaComm
1.5 1 0.5
Comm. Period
Comm. Period
20
15
2
0 10
20
(d) 2.5
0
15
Wall clock time / min
(c)
10-2
25
=1 =5 = 100 AdaComm
0
10-1
Comm. Period
Training loss
10
20 10 0
15
ResNet-50, variable learning rate
=1 = 20 = 100 AdaComm
15
20
5
VGG-16, variable learning rate
10
25
(b)
0
5
20
10
(a)
0
15
Wall clock time / min
0
10
20
30
40
Wall clock time / min 10 5 0
0
10
20
30
40
(f)
Fig. 6.7 Experimental results showing the performance of Adacomm for the VGG16 (first column) and ResNet-50 (second column) neural networks trained on the CIFAR-10 dataset
84
6 Local-Update and Overlap SGD
6.1.3.4 Applying Momentum in the Local-SGD Setting Recall the addition of momentum γ in vanilla SGD yt = γ yt−1 + η∇ F(wt ) wt+1 = wt − yt In local update SGD, we can use momentum for local updates where each worker will (i) (i) maintain its own momentum buffer yt . But there will be a dramatic change in wt every time the models are averaged; this can create spikes in momentum that can prevent convergence. The solution is to reset the momentum buffer y to zero at workers after every averaging step. Besides local momentum, we can also add global momentum at the parameter server by treating the accumulated local updates from all workers as a single gradient step. In Fig. 6.7, we apply our adaptive communication strategy with block momentum and observe significant performance gain on CIFAR10. In particular, the adaptive communication scheme has the fastest convergence rate with respect to wall-clock time in the whole training process. While fully synchronous SGD gets stuck with a plateau before the first learning rate decay, the training loss of adaptive method continuously decreases until converging. For VGG-16, AdaComm is 3.5× faster (in terms of wall-clock time) than fully synchronous SGD in reaching a 3 × 10−3 training loss. For ResNet-50, AdaComm takes 15.8 minutes to get 2 × 10−2 training loss which is 2 times faster than fully synchronous SGD (32.6 minutes).
6.2
Elastic and Overlap SGD
In local-update SGD, since the updated global model needs to be communicated to the nodes before the next set of τ updates can commence. Moreover, the global model cannot be updated until the slowest of the m nodes finishes its τ local updates. Since this communication barrier is imposed by the algorithm rather than the systems implementation, we need an algorithmic approach to remove it and allow communication to overlap with local computation. Asynchronous SGD algorithms [9–15] that we studied in Chap. 5, use asynchronous gradient aggregation to remove the synchronization barrier. However, asynchronous aggregation causes model staleness, that is, slow nodes can have arbitrarily outdated versions of the global model. Another approach is to allow faster nodes to compute more mini-batch gradients [15, 16] per iteration. However, this idea cannot be trivially extended to the τ > 1 case. In this section, we will study two variants of the local-update SGD algorithm, namely (1) elastic averaging SGD and (2) overlap local SGD, that allow overlapping communication and local computation in order to further reduce the runtime per iteration.
6.2
Elastic and Overlap SGD
85
Fig. 6.8 Illustration of local-update SGD and EASGD in the model parameter space. Blue and black arrows denote local update SGD iterations and the update of auxiliary variables, respectively. Red arrows represent averaging local models with each other or with the auxiliary variable. In this toy example, the number of local updates τ is set to 3
6.2.1
Elastic Averaging SGD
In local-update SGD, at averaging step which happens every τ iterations, all the m worker m (i) wt /m. Instead, elastic averaging SGD models are reset to the updated model wt = i=1 [17] allows some slack between the worker models. At each averaging step, the parameter server pulls the worker models towards it, and the workers collectively pull the parameter server’s model in the opposite direction, as illustrated in Fig. 6.8b. This averaging is analogous to having springs attached between the parameter server and each worker, and these springs exerting elastic force in both directions stopping the parameter server model and each worker model from straying too far away from each other. The elasticity of the spring is represented by the parameter α. To implement this idea, elastic averaging SGD defines an anchor model z, which is maintained by the parameter server. The update rules for the worker and anchor models are: (i) (i) (i) wk − ηgi (wk ) − α(wk − zk ), k mod τ = 0 (i) , (6.59) wk+1 = (i) (i) otherwise wk − ηgi (wk ), ⎧ ⎨(1 − mα)z + mα 1 m w(i) k mod τ = 0 k i=1 k m zk+1 = (6.60) ⎩z k , otherwise From (6.60), we observe that the auxiliary variable zk can be considered as a moving average m (i) wk . A larger value of the parameter α forces more consensus of the averaged model m1 i=1 between the locally trained models and improves stability, but it may reduce the convergence speed.
86
6 Local-Update and Overlap SGD
Elastic averaging SGD in effect solves a modified optimization problem, instead of the standard empirical risk objective. This modified problem includes a proximal term that penalizes the difference between the worker and the anchor models. ⎡ ⎤ ⎢ N ⎥ m ⎢ i ⎥ ρ (i) ⎢ (i) 2⎥ min f (w ; ξi,n ) + w − z ⎥ ⎢ ⎢ ⎥ 2 z,w(1) ,...,w(m) i=1 ⎣ n=1 ⎦
(6.61)
Worker i’s local objective
where ρ = α/η. We get the update rules above by taking the partial derivatives of this objective function with respect to w(i) and z respectively. Ideas similar to EASGD have been proposed in the context of federated learning, for example FedProx [18], which seeks to address the problem of data heterogeneity, and [19, 20], which seek to train personalized models at the worker nodes.
6.2.1.1 Convergence Analysis of Elastic Averaging SGD Now let us understand how the convergence of elastic-averaging SGD depends on α, τ , m and other parameters. The original paper [17] only gives an analysis of EASGD with 1 local update and for quadratic objective functions. We present here a general analysis for non-convex objective functions, which follows from the Cooperative SGD framework proposed in [5]. Cooperative SGD gives unified convergence analysis of local-update SGD, elastic-averaging and other variants of distributed SGD. Theorem 6.2 (Convergence of elastic-averaging SGD) For a L-smooth function, and starting point is w1 then F(wt ) after t iterations of elastic averaging SGD is bounded as t
2(m + 1) F(w1 ) − Finf 1 ηLσ 2 2 E ∇ F(yk ) ≤ + t ηmt (m + 1)b k=1 η2 L 2 σ 2 1 + ζ 2 + τ − 1 (6.62) b 1 − ζ2 where yk = ( implies that
m
(i) i=1 wk
+ zk )/(m + 1) and ζ = max(|1 − α|, |1 − (m + 1)α|), which
ζ =
(m + 1)α − 1, 1 − α,
2 m+2
1 local SGD steps. The communication and consensus steps remain the same. This algorithm was proposed and analyzed in [9].
8.2.2.2 Decentralized SGD with a Time-Varying Active Topology The basic decentralized SGD algorithm described above uses the entire network topology and the same mixing matrix in every iteration. Decentralized local-update SGD reduces the frequency of communication on each link so that the network topology is activated once after every round of τ local iterations. However, it uses the same frequency of communication on every link. Instead, [10] tunes the communication frequency according to the importance of each link in terms of maintaining the connectivity of the graph. This is achieved by decomposing the graph into disjoint matchings. Then the probability of activation of each matching is optimized such that the overall communication frequency is less than a budget, and the algebraic connectivity λ2 (L) of the expected graph Laplacian matrix is maximized.
112
8 Decentralized SGD and Its Variants
Connectivity-critical links are activated more frequency, whereas links between nodes that are already well-connected are activated less frequently.
8.2.2.3 Decentralized SGD on Directed Graphs Considering a doubly stochastic and symmetric mixing matrix implies that for every pair of nodes i and j: (1) if i sends its updates to j, then it also receives an update from j, and (2) the weight assigned by node i to node j’s update is equal to the weight assigned by node j to node i. These assumptions can lead to deadlocks in the implementation due to straggling nodes. To reduce the coupling of inter-node messages, we can use a directed graph, which is represented by a column stochastic matrix, to perform consensus between nodes. Based on this idea, an algorithm called Stochastic Gradient Push-sum was recently proposed in [8].
8.2.2.4 Network Gradient Tracking and Variance Reduction Instead of having each node do one or more vanilla SGD updates, [11] proposes using variance-reduction techniques to improve the convergence of decentralized SGD. The paper proposes two algorithms Network-SVRG (based on stochastic variance reduction) and Network-DANE (based on gradient tracking) that incorporate information about previously computed gradients in each local SGD update.
8.2.2.5 Decentralized Elastic Averaging SGD To increase consensus between a loosely connected set of workers, we can employ the idea of elastic averaging SGD, which we introduced in Chap. 6. We can do so by adding copies of an anchor model z, a vector of the same size as the model w, to each node in the network topology. The anchor model adds some inertia to avoid the models at different nodes from drifting away from each other, and reduces the effective ζ of the graph (see Theorem 8.1 for the definition). More details on this variant can be found in [9].
8.3
Error Convergence Analysis
Let us now analyze the effect of the network topology on the error convergence of decentralized SGD. Observe that the consensus step, which mixes the information between neighboring nodes will achieve faster consensus if the underlying graph G is densely connected. Thus, we expect the error convergence to become worse as the graph becomes more sparse. In this section we provide a convergence analysis that corroborates this intuition.
8.3
Error Convergence Analysis
8.3.1
113
Assumptions
The convergence analysis requires the following standard assumptions, which we seen in other convergence analyses in this book. • Lipschitz Smoothness: The objective function F(w) is differentiable and L-Lipschitz smooth, that is, ∇ F(w) − ∇ F(w ) ≤ Lw − w
(8.8)
• Unbiased Gradients: The stochastic gradient g(w; ξ ) is an unbiased estimate of ∇ F(w), that is, Eξ [g(w; ξ )] = ∇ F(w) • Bounded Variance: The stochastic gradient g(w; ξ ) has bounded variance, that is, Eξ [g(w; ξ )2 ] ≤ ∇ F(w)2 + σ 2
(8.9)
• Mixing Matrix: The mixing matrix M is doubly stochastic. This implies that the largest eigen value of the matrix is 1 and all the other eigen values have magnitude strictly less than 1.
8.3.2
Convergence Analysis of Decentralized SGD
Based on these assumptions, [9] presents the following convergence results for decentralized SGD and its variants decentralized local-update SGD. The proofs of these theorems can be found in the appendix of [9]. Theorem 8.1 (Convergence of Decentralized SGD) For a L-smooth function, if the learning rate satisfies ηL + η2 L 2 (2ζ 2 /(1 − ζ 2 ) + 2ζ /(1 − ζ )2 ) ≤ 1, and if the starting point is w1 then F(wt ) after t iterations of decentralized SGD is bounded as
t 2 F(w1 ) − Finf 1 + ζ2 1 ηLσ 2 2 E ∇ F(wk ) ≤ − 1 (8.10) + + η2 L 2 σ 2 t ηt m 1 − ζ2 k=1
where wk denotes the averaged model at the k-th iteration and the parameter ζ = max (|λ2 (M)|, |λm (M)|). The first two terms in (8.10) are identical to the error bound for fully synchronous SGD. The last term is the network error term, which arises because of the imperfect consensus
114
8 Decentralized SGD and Its Variants
between the nodes due to the decentralized topology and the fact that each node can only communicate with its neighbors. This term increases with the parameter ζ , the second largest (by magnitude) eigen value of the mixing matrix M. Sparser graphs have larger ζ , causing an increase in the error floor. Fully synchronous SGD corresponds to ζ = 0, resulting in the last error term being zero. Proof The proof of Theorem 8.1 is as follows, and it is based on a more general version presented in the Cooperative SGD paper [9]. Similar to the proof of local-update SGD given in Chap. 6, we define the following terms m (i) g(wk ), the average of stochastic gradients at the workers at iteration k, • Gk = m1 i=1 and m (i) ∇ F(wk ), the average of the full gradients at the workers at iteration k. • Hk = m1 i=1 (i)
where wk denotes the model at the i-th worker in the network. Using these definitions, and following the same steps as the first part of the proof of Theorem 6.1, we obtain the following error decomposition: 2(F(w1 ) − Fin f ) ηLσ 2 1 E[∇ F(wk )2 ] ≤ + + t tη bm k=1 t
synchronous SGD error
L2
t m
mt
k=1 i=1
(i)
(1 − ηL) (i) E[∇ F(wk )2 ] mt k=1 i=1 t
E[wk − wk 2 ] −
m
(8.11)
additional network error
2(F(w1 ) − Fin f ) ηLσ 2 + = + tη bm synchronous SGD error
L2
t
mt
k=1
(1 − ηL) E[∇ F(Wk )2F ] mt k=1 t
E[Wk (I − J)2F ] −
(8.12)
additional network error
where Wk and Gk are as defined in (8.5) and (8.6) respectively. The matrix J = 11 /(1 1) where 1 is the all-ones column vector. Thus, every element of the matrix J is equal to 1/m. As usual, I denotes the identity matrix of size m × m, where m is the number of workers. For matrix A, A2F denotes its Frobenius norm, which is the sum of the squares of each of the elements of the matrix.
8.3
Error Convergence Analysis
115
Now let us analyze the term E[Wk (I − J)2F ] appearing in the additional error. Recursively expanding the term inside the norm, we have: Wk (I − J) = (Wk−1 − ηGk−1 )M(I − J)
(8.13)
= Wk−1 (I − J)M − ηGk−1 (Wk−1 − J) = W1 (I − J)Mk−1 − η
k−1
Gl (Mk−l − J)
(8.14) (8.15)
l=1
= −η
k−1
Gl (Mk−l − J)
(8.16)
l=1
where (8.14) follows from the doubly stochastic property of the mixing matrix M, which implies that MJ = JM = J and hence (I − J)M = M(I − J). We get (8.16) because all the workers are initialized by the same model w1 , which implies that the term W1 (I − J) = 0. From this, we can bound the first term in the additional error as: E[Wk (I − J)2F ] = η2 E[
k−1
Gl (Mk−l − J)2F ]
(8.17)
(Gl − ∇ F(Wl ))(Mk−l − J) + ∇ F(Wl )(Mk−l − J)2F ]
(8.18)
l=1
= η2 E[
k−1 l=1
k−1
≤ 2η2 E[
(Gl − ∇ F(Wl )(Mk−l − J)2F ] + 2η2 E[
l=1
k−1
∇ F(Wl )(Mk−l − J)2F ]
l=1
(8.19) ≤ 2η2
k−1
E[(Gl − ∇ F(Wl )2F ](Mk−l − J)2op
l=1
+ 2η2 E[
k−1
∇ F(Wl )(Mk−l − J)2F ]
(8.20)
l=1
≤ 2η
2
k−1 mσ 2
b
l=1
≤ 2η
ζ
2(k−l)
+ 2η E[ 2
k−1
∇ F(Wl )(Mk−l − J)2F ]
ζ2 2 + 2η E[ ∇ F(Wl )(Mk−l − J)2F ] b 1 − ζ2
2 mσ
2
(8.21)
l=1 k−1 l=1
(8.22)
116
8 Decentralized SGD and Its Variants
where (8.19) follows from the fact that (a + b)2 ≤ 2a 2 + 2b2 . Then we get (8.20) by observing that the cross terms are zero, and that for two real matrices A and B, if B is symmetric then AB F ≤ Bop A F (see [9] for the proof). To get (8.21) we use a result that the operator norm of Mk−l − J is bounded above by ζ , where ζ = max(|λ2 (M)|, |λm (M)|). Then, we k−1 2(k−l) k−1 ζ ≤ l=−∞ ζ 2(k−l) = ζ 2 /(1 − ζ 2 ). obtain (8.22) by using the bound l=1 Next, let us upper bound the second term in (8.22) E[Wk (I − J)2F ] − 2η2 ≤ 2η2 E[
k−1
mσ 2 ζ 2 b 1 − ζ2
∇ F(Wl )(Mk−l − J)2F ]
(8.23)
l=1
≤ 2η2
k−1
E[∇ F(Wl )2F ](Mk−l − J)2op +
l=1
η2
k−1
k−1
(Mk−l − J)op (Mk−h − J)op (E[∇ F(Wl )2F ] + E[∇ F(Wh )2F ])
l=1 h=1,h=l
(8.24) ≤ 2η2
k−1
E[∇ F(Wl )2F ]ζ 2(k−l) +
l=1
η2
k−1 k−1
ζ 2k−l−h (E[∇ F(Wl )2F ] + E[∇ F(Wh )2F ])
(8.25)
l=1 h=1,h=l
≤ 2η2
k−1
E[∇ F(Wl )2F ]ζ 2(k−l) + 2η2
l=1
≤ 2η2
k−1
ζ k−l E[∇ F(Wl )2F ]
k−1
(8.26)
h=1,h=l
l=1
k−1 k−1 ζ2 2η2 ζ k−l 2 E[∇ F(W ) ] + ζ E[∇ F(Wl )2F ]. l F 1 − ζ2 (1 − ζ ) l=1
ζ k−h
(8.27)
l=1
Now we sum over all k from 1 to t to get the following: t L2 E[Wk (I − J)2F ] mt k=1
≤ ≤
t k−1 2η2 L 2 σ 2 ζ 2 ζ 2η2 L 2 + ζ k−l E[∇ F(Wl )2F ] b 1 − ζ2 mt (1 − ζ )
(8.28)
2η2 L 2 σ 2 ζ 2 + b 1 − ζ2
(8.29)
k=1 t 2 2 2η L
mt
k=1
l=1
ζ E[∇ F(Wl )2F ] (1 − ζ )2
8.3
Error Convergence Analysis
117
Substituting this into (8.12) we have: 2(F(w1 ) − Fin f ) ηLσ 2 2η2 L 2 σ 2 ζ 2 1 E[∇ F(wk )2 ] ≤ + + t tη bm b 1 − ζ2 t
k=1
+
t t 2η2 L 2 (1 − ηL) ζ 2 E[∇ F(W ) ] − E[∇ F(Wk )2F ] l F mt (1 − ζ )2 mt k=1
(8.30)
k=1
If ηL + η2 L 2 (2ζ 2 /(1 − ζ 2 ) + 2ζ /(1 − ζ )2 ) ≤ 1 then the last two terms can be dropped. And this completes the proof of Theorem 8.1.
8.3.3
Convergence Analysis of Decentralized Local-Update SGD
Theorem 8.1 can be generalized to the decentralized local-update SGD variant, where each node performs τ local updates before communicating with its neighbors, as follows. Theorem 8.2 (Convergence of Decentralized Local-update SGD) Starting from w1 , for a L-smooth objective function if the learning rate η satisfies 2ζ 2 η2 L 2 τ 2 2ζ τ −1 ≤ 1, ηL + + + 1−ζ 1+ζ 1−ζ τ then F(wt ) after t iterations (t/τ rounds) of decentralized local update SGD is bounded as
t 2 2 F(w1 ) − Finf 1 ηLσ 2 2 2 2 2 1+ζ E ∇ F(wk ) ≤ τ −1 + +η L σ t ηt m 1 − ζ2 k=1
(8.31) where wk denotes the averaged model at the k-th iteration. The proof of Theorem 8.2 is a combination of the proof of local-update SGD given in Chap. 6 and that of decentralized SGD derived above (see [9] for details). When the communication period τ = 1 and ζ = 0, this bound reduces the convergence bound for synchronous SGD. As τ increases, the adverse effect of network sparsity (larger ζ ) is magnified because 2 . To illustrate the effect τ and ζ , we plot the last term in (8.31), τ multiplies the term 1+ζ 1−ζ 2 referred to as the network error bound (NEB) term in Fig. 8.2a. The network becomes sparser as ζ varies along the x-axis. Each line corresponds to a different value of τ . The bottom left corner of the plot corresponds to fully synchronous SGD. Using Fig. 8.2a a system designer can select the (ζ, τ ) pair that achieves the target error and also has a smaller runtime per iteration than synchronous SGD. Figure 8.2b shows experimental results for a system of 8 worker nodes used to train the VGG-16 network on the CIFAR-10 dataset. We observe
118
8 Decentralized SGD and Its Variants
(b) 1.5 0
Log NEB
10
( = 10,
= 0)
( = 2,
= 0)
( = 1,
= 0.99)
Increasing ( = 1,
= 1/3)
-5
10
Sync SGD
0
=1 =2 = 10
Sparser Network
0.2
0.4
0.6
0.8
Training Loss
(a)
1
0.5
0 1
Sync SGD ( = 1, = 0) = 1, = 0.33 = 2, = 0 = 10, = 0 = 1, = 0.99
0
20
40
60
80
100
Epoch
(a) Numerical plot of the additional network error bound (NEB) term in (8.10).
(b) Experimental results on neural networks.
Fig. 8.2 a Illustration of how the additional network error term in (8.31) monotonically increases with τ and ζ ; b Experiments on CIFAR-10 with VGG-16 and 8 worker nodes. For the same learning rate, larger τ or larger ζ lead to a higher error floor at convergence. Each line in (b) corresponds to a circled point in (a)
that by using a sparser network, we can achieve almost the same error versus iterations convergence as synchronous SGD.
8.4
Runtime Analysis
In this section, we quantify the runtime advantage of using a sparse network topology to perform consensus between the m worker nodes. The runtime per iteration of decentralized SGD depends on the communication protocol used for inter-node information exchange. Below, we show the analysis for one such communication protocol. To obtain a simple analysis of how the runtime per iteration scales with the properties of the graph, we assume that the communication between any pair of nodes can happen in parallel. This can be implemented in practice using (1) time division multiplexing, where each edge is assigned a time slot in which nodes connected by that edge can exchange information or (2) frequency division multiplexing, where each edge is assigned a frequency band that they can use to communicate. More efficient communication protocols that reduce the number of time slots or frequencies required can be designed by decomposing the network into matchings. We refer the readers to [10] for a detailed description of such improved communication protocols. Suppose that local computation time to complete one SGD iteration at each node is Y , assumed to be constant here for simplicity. We also consider that the communication delay in exchanging models between nodes i and j is an exponentially random variable Ti, j ∼ Exp(μ), which is independent and identically distributed across edges. Let us compute the expected time Ti taken by each node i exchange (send and receive) model updates with its neighbors j ∈ N (i). The random variable Ti is the maximum of the communication times on each of node i’s links:
8.4
Runtime Analysis
119
Ti = max Ti, j
(8.32)
log di μ
(8.33)
j∈N (i)
E[Ti ] ≈
Since all inter-node communications occur in parallel, the expected runtime per iteration is the maximum of Ti ’s of all nodes: E[T ] = Y + E[max Ti ]
(8.34)
≥ Y + max E[Ti ]
(8.35)
i∈V
i∈V
≈Y +
log(maxi di ) μ
(8.36)
where to get (8.35), we use the convexity of the maximum function to switch the order of the max and the expectation. Thus, the expected runtime is lower bounded by a function that increases with the maximum node degree in the graph. For exponential inter-node communication times, the function scales logarithmically with the max node degree. In contrast, for fully synchronous SGD, recall that the expected runtime per iteration is E [T ] ≈ log m μ . By using a sparse graph, we can ensure that the max degree maxi di is significantly smaller than m and thus decentralized SGD can achieve a large speed-up over synchronous SGD. If we perform decentralized local-update SGD with τ local updates at each node, then the communication delay will be amortized across the τ iterations to get E[maxi∈V Ti ] τ maxi∈V E[Ti ] ≥Y + τ log(maxi di ) ≈Y+ μτ
E[T ] = Y +
(8.37) (8.38) (8.39)
In Table 8.1, we compare decentralized SGD with fully synchronous SGD in terms of the maximum number of handshakes required and the maximum transmitted data size. In fully synchronous SGD, the parameter server needs to communicate with all m nodes and exchange a d-dimensional model w with each of the m nodes. In decentralized SGD, each node only needs to communicate with di neighbors and exchange models with them. In Fig. 8.3 we show experimental results comparing different decentralized local-update SGD protocols, firstly in terms of the training loss versus epochs (or traversals of the training dataset) and secondly in terms of the training loss versus wallclock time. Performing sparse (higher ζ ) or infrequent (large τ ) consensus between the nodes results in a worse error versus epochs convergence. However, due to the savings in the wallclock runtime spent per iteration, we can achieve a significant convergence speed-up as seen in the right side subfigure.
120
8 Decentralized SGD and Its Variants
Table 8.1 Comparison between sparse averaging (decentralized averaging) and full synchronization (via a parameter server). When the latency to establish handshakes is dominant, sparse averaging can provide significant reduction in communication time Averaging protocol
# Handshakes
Transmitted data size
Decentralized
maxi di
2d · maxi di
Fully synchronized (parameter server)
m
2dm
10−1 τ τ τ τ
10−2
0
25
= 1, ζ = 0 = 10, ζ = 0 = 1, ζ = 0.75 = 10, ζ = 0.75
50
75 100 125 150 175 200
Epochs
Training loss
Training loss
τ τ τ τ
100
100
10−1
= 1, ζ = 0 = 10, ζ = 0 = 1, ζ = 0.75 = 10, ζ = 0.75
10−2
0
5
10
15
20
25
30
35
Wall clock time / min
Fig. 8.3 Decentralized periodic averaging on CIFAR-10 with VGG-16. Fully synchronous SGD corresponds to (τ = 1, ζ = 0). Allowing more local updates (higher τ ) leads to slower convergence in terms of epochs. But it requires about 4x less wall-clock time to achieve a training loss of 0.1
Summary In this chapter we considered a variant of our distributed optimization problem where there is no central parameter server to coordinate between the worker nodes. Instead, the workers are connected via an arbitrary decentralized topology and each worker can only exchange updates with its neighbors. We described the decentralized SGD algorithm and several of its variants that combine decentralized consensus with previously proposed communicationefficiency and variance reduction strategies. We then presented an error convergence analysis and a runtime analysis of decentralized SGD, along with experimental results showing the trade-off between error and runtime. Problems 1. Consider a network of 4 worker nodes that are connected by a ring topology, and they collaboratively train a machine learning model using decentralized SGD. At every averaging step, assume that each worker assigns weight α to each of its neighbors. Write down the mixing matrix of this system of nodes.
References
121
2. Please calculate the ζ value of the mixing matrix that you wrote down in Problem 1 (you can use numpy.linalg.eig to find the eigen values). Plot ζ for α taking values from the set {0, 0.005, 0.100, 0.105, 0.110, . . . , 0.500}. In order to achieve the lowest error floor at convergence, what is the best value of α? In order to guarantee the convergence, what is range of feasible α?
References 1. J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9, pp. 803–812, 1986. 2. A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009. 3. K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016. 4. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Proceedings of the International Conference on Neural Information Processing Systems, 2017, pp. 5330–5340. 5. K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee, “Optimal algorithms for non-smooth distributed optimization in networks,” in Advances in Neural Information Processing Systems, 2018, pp. 2740–2749. 6. A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” arXiv preprint arXiv:1902.00340, 2019. 7. D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 4238–4245. 8. M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 344–353. [Online]. Available: http://proceedings.mlr.press/v97/assran19a.html 9. J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-147.html 10. J. Wang, A. Sahu, G. Joshi, and S. Kar, “MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling,” preprint, May 2019. [Online]. Available: https://arxiv.org/ abs/1905.09435 11. B. Li, S. Cen, Y. Chen, and Y. Chi, “Communication-efficient distributed optimization in networks with gradient tracking and variance reduction,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 108. PMLR, 26–28 Aug 2020, pp. 1662–1672. [Online]. Available: https://proceedings.mlr. press/v108/li20f.html
9
Beyond Distributed Training in the Cloud
Let us summarize the concepts that we learned through this book, and discuss the future of distributed machine learning beyond cloud-based implementations that we studied in this book. We started off this book by analyzing the convergence of classic single-node SGD and studied some of its variance-reduced variants such as SAG and SAGA. To handle large training datasets in a fast and efficient manner, practical implementations require SGD to be run in a distributed manner. The main focus of this book was on studying distributed stochastic gradient descent (SGD) algorithms, which are used to perform machine learning training using worker nodes that are servers in the cloud. In Chap. 4, we introduced the first and the most common distributed SGD algorithm, synchronous SGD, which uses a parameter server and a system of worker nodes, which are servers in the cloud, and we analyzed its convergence and runtime per iteration. To overcome tail latency due to straggling workers in synchronous SGD, in Chap. 5 we introduced the asynchronous SGD algorithm, which removes the synchronization barrier and allows worker nodes to independently communicate with the parameter server whenever they finish their local computation. Then in Chaps. 6 and 7 we studied two ways to reduce the communication cost of sending and receiving model updates between the worker nodes and the parameter server. Finally in Chap. 8 we discussed decentralized SGD using an arbitrary peer-to-peer topology connecting worker nodes, without having a central parameter server to aggregate their updates. A key insight from all these algorithms and their analyses was that the standard optimization-theoretic approach that seeks to speed-up the error versus iterations convergence is insufficient to capture the true convergence speed with respect to wallclock time. The design of distributed SGD algorithms has to take into account how the synchronization and communication protocol affects the wallclock runtime spent per iteration. This book advocates such a system-aware philosophy in the design of distributed machine learning. In most modern machine learning applications, edge nodes such as cellphone or sensors collect rich training data from their environment which can be used for data-driven decision© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Joshi, Optimization Algorithms for Distributed Machine Learning, Synthesis Lectures on Learning, Networks, and Algorithms, https://doi.org/10.1007/978-3-031-19067-4_9
123
124
9 Beyond Distributed Training in the Cloud Geographically distributed nodes
Unreliable and bandwidth-limited communication
Heterogeneity in the number of local model updates
Decentralized Comm.
Heterogeneous Data Size and Composition
Fig. 9.1 Federated learning using communication-limited edge devices such as cellphones, smartwatches, IoT cameras, etc. Computational speeds and the size and composition of data can vary widely across nodes
making. If we wish to employ cloud-based training using the parameter server framework then these raw training data will have to transferred to the cloud, where it can shuffled and partitioned across worker nodes. However, due to limited communication capabilities as well as privacy concerns, it is prohibitively expensive to send the data collected by edge nodes to the cloud for centralized processing. The nascent research field called federated learning [1, 2] considers a large number of resource-constrained edge clients such as cellphones or IoT sensors that collect training data from their environment. Instead of sending the training data to the cloud, the nodes locally perform a few iterations of training and only send the resulting model to the cloud, as illustrated in Fig. 9.1. While the federated learning algorithms are similar to local-update SGD and other communication-efficient distributed SGD algorithms, where are some unique aspects of the large-scale federated setting that requires innovations to cloud-based distributed training algorithms. 1. Data Heterogeneity. Unlike cloud-based training where a central training dataset is shuffled and then evenly partitioned across the worker nodes, in federated learning, the data is collected independently by each edge client. Thus the local datasets are highly heterogeneous across the clients both in terms of size and distribution. This data heterogeneity when combined with limited communication between the edge clients and the cloud can result in a higher error floor. Recent works [3, 4] have proposed techniques to combat such data heterogeneity. 2. Computational Heterogeneity. In the cloud-based training setting, we considered homogeneous worker nodes that can experience short-term delay fluctuations. However, in the federated learning setting, the computational capabilities can vary widely across
9 Beyond Distributed Training in the Cloud
3.
4.
5.
6.
125
edge clients due to different device types (phones versus sensors) and varying hardware speeds. The amount of local computation can also vary due to the heterogeneity in the local datasizes. For example, in federated learning, each node performs E epochs (traversals of their local dataset). If a client has n i local data samples and the mini-batch size is b, the number of local SGD iterations is τi = En i /b, which can widely vary across nodes. Alternately, if we fix a time interval instead of fixing the number local epochs, then faster nodes can perform more local updates than slower nodes within the same time interval. Finally, the client may also use different local optimizers, which can result in heterogeneous local progress. Some recent works [5–7] propose new algorithms to tackle such data heterogeneity. Communication Constraints. Since edge clients are wirelessly connected to the cloud and to the other clients, typically with bandwidth-limited links, the distributed training algorithms have operate under much stricter communication constraints than cloud-based training. Techniques such local updates, quantization and sparsification of updates are essential to enable federated training algorithms to scale to a large number of devices. Intermittent Availability. Due to the sheer scale of federated learning where thousands or even millions of geographically distributed edge clients can be involved in the training of a machine learning model, it is infeasible for all clients to be available in each training round. The availability of devices depends on their timezone, their battery status and network connectivity. Therefore, federated learning algorithms have to allow only a subset of clients to participate in each training round. Recent works such as [8] propose client selection strategies and others such as [9] try to reduce the variance due to partial client participation. Private and Secure Aggregation. Although the data stays on the edge client, the model updates that are communicated with the cloud can reveal private information about the users’ data. Local and global differential privacy techniques [10] can be used to obfuscate local updates sent by the clients. However, there is a trade-off between the convergence speed and the privacy offered by these methods An alternative to privacy is to use cryptographic secure aggregation protocols [11]. However, such secure aggregation protocols require additional computation, which make them hard to scale to a large number of edge clients. Adversarial Nodes. In the data-center setting, all the worker nodes are under centralized control by the system administrator and it is difficult for malicious adversaries to take control of one or more worker nodes. However, with thousands of edge clients in the federated learning scenario, adversaries can control a small subset of devices without being identified by the central aggregating server. Some recent works such as [12, 13] show that even a single adversarial client can send poisonous updates and replace the trained global model by any model of its choice. Currently, federated learning algorithms employ clipping of clients’ updates in order to mitigate the effect of adversarial clients [14].
126
9 Beyond Distributed Training in the Cloud
Although we did not cover federated learning and the above aspects in this book, the foundation of system-aware distributed algorithm design established can be helpful in the design of federated optimization algorithms.
References 1. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “CommunicationEfficient Learning of Deep Networks from Decentralized Data,” International Conference on Artificial Intelligenece and Statistics (AISTATS), Apr. 2017. [Online]. Available: https://arxiv. org/abs/1602.05629 2. P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascon, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecny, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Ozgur, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramer, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao, “Advances and open problems in federated learning,” Foundations and Trends in Machine Learning, Jun. 2021. [Online]. Available: https://arxiv.org/abs/1912.04977 3. X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in International Conference on Learning Representations (ICLR), Jul. 2020. [Online]. Available: https://arxiv.org/abs/1907.02189 4. S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for on-device federated learning,” arXiv preprint arXiv:1910.06378, 2019. 5. J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization,” in Proceedings on Neural Information Processing Systems (NeurIPS), Dec. 2020. [Online]. Available: https://arxiv.org/abs/2007.07481 6. J. Wang, Z. Xu, Z. Garrett, Z. Charles, L. Liu, and G. Joshi, “Local adaptivity in federated learning: Convergence and consistency,” arXiv preprint arXiv:2106.02305, 2021. 7. F. Mansoori and E. Wei, “Flexpd: A flexible framework of first-order primal-dual algorithms for distributed optimization,” IEEE Transactions on Signal Processing, vol. 69, pp. 3500–3512, 2021. 8. Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning: Convergence analysis and power-of-choice selection strategies,” 2020. 9. X. Gu, K. Huang, J. Zhang, and L. Huang, “Fast federated learning in the presence of arbitrary device unavailability,” Advances in Neural Information Processing Systems, vol. 34, 2021. 10. K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. S. Quek, and H. Vincent Poor, “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3454–3469, 2020. 11. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for federated learning on user-held data,” in NIPS Workshop on Private Multi-Party Machine Learning, 2016. 12. H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee, and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 070–16 084, 2020.
References
127
13. E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to backdoor federated learning,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2938–2948. 14. Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you really backdoor federated learning?” arXiv preprint arXiv:1911.07963, 2019.