Table of contents : Distributed Machine Learning Patterns Copyright contents front matter preface acknowledgments about this book Who should read this book? How this book is organized: A roadmap About the code liveBook discussion forum about the author about the cover illustration Part 1 Basic concepts and background 1 Introduction to distributed machine learning systems 1.1 Large-scale machine learning 1.1.1 The growing scale 1.1.2 What can we do? 1.2 Distributed systems 1.2.1 What is a distributed system? 1.2.2 The complexity and patterns 1.3 Distributed machine learning systems 1.3.1 What is a distributed machine learning system? 1.3.2 Are there similar patterns? 1.3.3 When should we use a distributed machine learning system? 1.3.4 When should we not use a distributed machine learning system? 1.4 What we will learn in this book Summary Part 2 Patterns of distributed machine learning systems 2 Data ingestion patterns 2.1 What is data ingestion? 2.2 The Fashion-MNIST dataset 2.3 Batching pattern 2.3.1 The problem: Performing expensive operations for Fashion MNIST dataset with limited memory 2.3.2 The solution 2.3.3 Discussion 2.3.4 Exercises 2.4 Sharding pattern: Splitting extremely large datasets among multiple machines 2.4.1 The problem 2.4.2 The solution 2.4.3 Discussion 2.4.4 Exercises 2.5 Caching pattern 2.5.1 The problem: Re-accessing previously used data for efficient multi-epoch model training 2.5.2 The solution 2.5.3 Discussion 2.5.4 Exercises 2.6 Answers to exercises Section 2.3.4 Section 2.4.4 Section 2.5.4 Summary 3 Distributed training patterns 3.1 What is distributed training? 3.2 Parameter server pattern: Tagging entities in 8 million YouTube videos 3.2.1 The problem 3.2.2 The solution 3.2.3 Discussion 3.2.4 Exercises 3.3 Collective communication pattern 3.3.1 The problem: Improving performance when parameter servers become a bottleneck 3.3.2 The solution 3.3.3 Discussion 3.3.4 Exercises 3.4 Elasticity and fault-tolerance pattern 3.4.1 The problem: Handling unexpected failures when training with limited computational resources 3.4.2 The solution 3.4.3 Discussion 3.4.4 Exercises 3.5 Answers to exercises Section 3.2.4 Section 3.3.4 Section 3.4.4 Summary 4 Model serving patterns 4.1 What is model serving? 4.2 Replicated services pattern: Handling the growing number of serving requests 4.2.1 The problem 4.2.2 The solution 4.2.3 Discussion 4.2.4 Exercises 4.3 Sharded services pattern 4.3.1 The problem: Processing large model serving requests with high-resolution videos 4.3.2 The solution 4.3.3 Discussion 4.3.4 Exercises 4.4 The event-driven processing pattern 4.4.1 The problem: Responding to model serving requests based on events 4.4.2 The solution 4.4.3 Discussion 4.4.4 Exercises 4.5 Answers to exercises Section 4.2 Section 4.3 Section 4.4 Summary 5 Workflow patterns 5.1 What is workflow? 5.2 Fan-in and fan-out patterns: Composing complex machine learning workflows 5.2.1 The problem 5.2.2 The solution 5.2.3 Discussion 5.2.4 Exercises 5.3 Synchronous and asynchronous patterns: Accelerating workflows with concurrency 5.3.1 The problem 5.3.2 The solution 5.3.3 Discussion 5.3.4 Exercises 5.4 Step memoization pattern: Skipping redundant workloads via memoized steps 5.4.1 The problem 5.4.2 The solution 5.4.3 Discussion 5.4.4 Exercises 5.5 Answers to exercises Section 5.2 Section 5.3 Section 5.4 Summary 6 Operation patterns 6.1 What are operations in machine learning systems? 6.2 Scheduling patterns: Assigning resources effectively in a shared cluster 6.2.1 The problem 6.2.2 The solution 6.2.3 Discussion 6.2.4 Exercises 6.3 Metadata pattern: Handle failures appropriately to minimize the negative effect on users 6.3.1 The problem 6.3.2 The solution 6.3.3 Discussion 6.3.4 Exercises 6.4 Answers to exercises Section 6.2 Section 6.3 Summary Part 3 Building a distributed machine learning workflow 7 Project overview and system architecture 7.1 Project overview 7.1.1 Project background 7.1.2 System components 7.2 Data ingestion 7.2.1 The problem 7.2.2 The solution 7.2.3 Exercises 7.3 Model training 7.3.1 The problem 7.3.2 The solution 7.3.3 Exercises 7.4 Model serving 7.4.1 The problem 7.4.2 The solution 7.4.3 Exercises 7.5 End-to-end workflow 7.5.1 The problems 7.5.2 The solutions 7.5.3 Exercises 7.6 Answers to exercises Section 7.2 Section 7.3 Section 7.4 Section 7.5 Summary 8 Overview of relevant technologies 8.1 TensorFlow: The machine learning framework 8.1.1 The basics 8.1.2 Exercises 8.2 Kubernetes: The distributed container orchestration system 8.2.1 The basics 8.2.2 Exercises 8.3 Kubeflow: Machine learning workloads on Kubernetes 8.3.1 The basics 8.3.2 Exercises 8.4 Argo Workflows: Container-native workflow engine 8.4.1 The basics 8.4.2 Exercises 8.5 Answers to exercises Section 8.1 Section 8.2 Section 8.3 Section 8.4 Summary 9 A complete implementation 9.1 Data ingestion 9.1.1 Single-node data pipeline 9.1.2 Distributed data pipeline 9.2 Model training 9.2.1 Model definition and single-node training 9.2.2 Distributed model training 9.2.3 Model selection 9.3 Model serving 9.3.1 Single-server model inference 9.3.2 Replicated model servers 9.4 The end-to-end workflow 9.4.1 Sequential steps 9.4.2 Step memoization Summary index