Distributed Machine Learning Patterns 4189448097, 4155892859, 4139115240, 4239780954, 1071502578, 2208102039

Practical patterns for scaling machine learning from your laptop to a distributed cluster. In Distributed Machine Learn

132 24 17MB

English Pages 248 Year 2023

Report DMCA / Copyright

DOWNLOAD EPUB FILE

Table of contents :
Distributed Machine Learning Patterns
Copyright
contents
front matter
preface
acknowledgments
about this book
Who should read this book?
How this book is organized: A roadmap
About the code
liveBook discussion forum
about the author
about the cover illustration
Part 1 Basic concepts and background
1 Introduction to distributed machine learning systems
1.1 Large-scale machine learning
1.1.1 The growing scale
1.1.2 What can we do?
1.2 Distributed systems
1.2.1 What is a distributed system?
1.2.2 The complexity and patterns
1.3 Distributed machine learning systems
1.3.1 What is a distributed machine learning system?
1.3.2 Are there similar patterns?
1.3.3 When should we use a distributed machine learning system?
1.3.4 When should we not use a distributed machine learning system?
1.4 What we will learn in this book
Summary
Part 2 Patterns of distributed machine learning systems
2 Data ingestion patterns
2.1 What is data ingestion?
2.2 The Fashion-MNIST dataset
2.3 Batching pattern
2.3.1 The problem: Performing expensive operations for Fashion MNIST dataset with limited memory
2.3.2 The solution
2.3.3 Discussion
2.3.4 Exercises
2.4 Sharding pattern: Splitting extremely large datasets among multiple machines
2.4.1 The problem
2.4.2 The solution
2.4.3 Discussion
2.4.4 Exercises
2.5 Caching pattern
2.5.1 The problem: Re-accessing previously used data for efficient multi-epoch model training
2.5.2 The solution
2.5.3 Discussion
2.5.4 Exercises
2.6 Answers to exercises
Section 2.3.4
Section 2.4.4
Section 2.5.4
Summary
3 Distributed training patterns
3.1 What is distributed training?
3.2 Parameter server pattern: Tagging entities in 8 million YouTube videos
3.2.1 The problem
3.2.2 The solution
3.2.3 Discussion
3.2.4 Exercises
3.3 Collective communication pattern
3.3.1 The problem: Improving performance when parameter servers become a bottleneck
3.3.2 The solution
3.3.3 Discussion
3.3.4 Exercises
3.4 Elasticity and fault-tolerance pattern
3.4.1 The problem: Handling unexpected failures when training with limited computational resources
3.4.2 The solution
3.4.3 Discussion
3.4.4 Exercises
3.5 Answers to exercises
Section 3.2.4
Section 3.3.4
Section 3.4.4
Summary
4 Model serving patterns
4.1 What is model serving?
4.2 Replicated services pattern: Handling the growing number of serving requests
4.2.1 The problem
4.2.2 The solution
4.2.3 Discussion
4.2.4 Exercises
4.3 Sharded services pattern
4.3.1 The problem: Processing large model serving requests with high-resolution videos
4.3.2 The solution
4.3.3 Discussion
4.3.4 Exercises
4.4 The event-driven processing pattern
4.4.1 The problem: Responding to model serving requests based on events
4.4.2 The solution
4.4.3 Discussion
4.4.4 Exercises
4.5 Answers to exercises
Section 4.2
Section 4.3
Section 4.4
Summary
5 Workflow patterns
5.1 What is workflow?
5.2 Fan-in and fan-out patterns: Composing complex machine learning workflows
5.2.1 The problem
5.2.2 The solution
5.2.3 Discussion
5.2.4 Exercises
5.3 Synchronous and asynchronous patterns: Accelerating workflows with concurrency
5.3.1 The problem
5.3.2 The solution
5.3.3 Discussion
5.3.4 Exercises
5.4 Step memoization pattern: Skipping redundant workloads via memoized steps
5.4.1 The problem
5.4.2 The solution
5.4.3 Discussion
5.4.4 Exercises
5.5 Answers to exercises
Section 5.2
Section 5.3
Section 5.4
Summary
6 Operation patterns
6.1 What are operations in machine learning systems?
6.2 Scheduling patterns: Assigning resources effectively in a shared cluster
6.2.1 The problem
6.2.2 The solution
6.2.3 Discussion
6.2.4 Exercises
6.3 Metadata pattern: Handle failures appropriately to minimize the negative effect on users
6.3.1 The problem
6.3.2 The solution
6.3.3 Discussion
6.3.4 Exercises
6.4 Answers to exercises
Section 6.2
Section 6.3
Summary
Part 3 Building a distributed machine learning workflow
7 Project overview and system architecture
7.1 Project overview
7.1.1 Project background
7.1.2 System components
7.2 Data ingestion
7.2.1 The problem
7.2.2 The solution
7.2.3 Exercises
7.3 Model training
7.3.1 The problem
7.3.2 The solution
7.3.3 Exercises
7.4 Model serving
7.4.1 The problem
7.4.2 The solution
7.4.3 Exercises
7.5 End-to-end workflow
7.5.1 The problems
7.5.2 The solutions
7.5.3 Exercises
7.6 Answers to exercises
Section 7.2
Section 7.3
Section 7.4
Section 7.5
Summary
8 Overview of relevant technologies
8.1 TensorFlow: The machine learning framework
8.1.1 The basics
8.1.2 Exercises
8.2 Kubernetes: The distributed container orchestration system
8.2.1 The basics
8.2.2 Exercises
8.3 Kubeflow: Machine learning workloads on Kubernetes
8.3.1 The basics
8.3.2 Exercises
8.4 Argo Workflows: Container-native workflow engine
8.4.1 The basics
8.4.2 Exercises
8.5 Answers to exercises
Section 8.1
Section 8.2
Section 8.3
Section 8.4
Summary
9 A complete implementation
9.1 Data ingestion
9.1.1 Single-node data pipeline
9.1.2 Distributed data pipeline
9.2 Model training
9.2.1 Model definition and single-node training
9.2.2 Distributed model training
9.2.3 Model selection
9.3 Model serving
9.3.1 Single-server model inference
9.3.2 Replicated model servers
9.4 The end-to-end workflow
9.4.1 Sequential steps
9.4.2 Step memoization
Summary
index

Distributed Machine Learning Patterns
 4189448097, 4155892859, 4139115240, 4239780954, 1071502578, 2208102039

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
Recommend Papers