High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment 3031297687, 9783031297687

This book brings a thorough explanation on the path needed to use cloud computing technologies to run High-Performance C

307 66 11MB

English Pages 336 [337] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword
Preface
Contents
1 Why Move HPC Applications to the Cloud?
1.1 Book Organization
References
Part I Foundations
2 What Is Cloud Computing?
2.1 First Look at the Cloud
2.1.1 Origin
2.1.2 Definition
2.2 Benefits and Drawbacks
2.2.1 Cost Savings
2.2.2 Elasticity
2.2.3 Drawbacks
2.3 Service and Delivery Models
2.3.1 Service Models
2.3.2 Delivery Models
2.4 Virtualization and Containers Technologies
2.4.1 Virtualization
2.4.2 Containers
2.5 Final Remarks
References
3 What Do HPC Applications Look Like?
3.1 About High-Performance Computing and Its Way So Far
3.1.1 Concept and Motivations
3.1.2 Evolution of HPC Systems
3.1.3 Graphical Programming Unit as the Main HPC Accelerator
3.1.4 Overview of Current HPC Systems and Associated Concerns
3.2 Design and Performance
3.2.1 Methodology for the Design of HPC Applications
3.2.2 Synopsis of HPC Programming
3.2.3 Critical Numerical and Performance Challenges
3.2.4 About Parallel Efficiency
3.3 Two Examples of HPC Applications
3.3.1 Lattice Quantum ChromoDynamics (LQCD)
3.3.2 High-Resolution Seismic Imaging
3.4 HPC and Cloud Computing
References
Part II Running HPC Applications in Cloud
4 Deploying and Configuring Infrastructure
4.1 Introduction
4.2 Key Infrastructure Elements
4.2.1 Virtual Machines
4.2.1.1 Virtual Machine Images
4.2.2 Regions, Availability Zones, and Placement Strategies
4.2.3 Tenancy
4.2.4 Storage Services
4.2.5 Virtual Private Cloud Networks
4.3 Overview of a Cloud-Based HPC Cluster
4.3.1 Cost and Performance of Cloud-Based HPC Clusters
4.4 Deploying Infrastructure on the IaaS Model
4.4.1 GUI and Command-Line Interface Tools
4.4.2 Infrastructure as Code
4.4.3 IaC Tools for Cloud HPC-Cluster-Like Environments
4.5 Considerations About Selecting Resources and Tools to Deploy HPC Systems on the Cloud
References
5 Executing Traditional HPC Application Code in Cloud with Containerized Job Schedulers
5.1 Introduction
5.1.1 Foreword
5.1.2 Chapter Organization
5.2 Change Nothing at the Application Level but a Little at the Cloud Orchestrator Level
5.2.1 Introduction
5.2.2 Elements of Vocabulary and Essential Definitions
5.2.2.1 Basic Vocabulary Regarding the Notion of HPC Jobs and HPC Job Schedulers
5.2.2.2 Overview of Containers and Cloud Orchestrator
5.2.2.3 Overview of Kubernetes, Slurm, OAR and OpenPBS
5.2.3 Related Works
5.2.4 Challenges, Issues, and Solutions
5.2.4.1 Motivation
5.2.4.2 Propositions
5.2.4.3 Containerized HPC Schedulers
5.2.4.4 Dynamic Containerized of HPC Clusters
5.2.4.5 Impact on Pending Jobs
5.2.4.6 Impact on Running Jobs
5.2.4.7 Towards a General Methodology to Containerize HPC Job Schedulers
5.2.5 Summary of the Discussion
5.3 Adding a Mechanism for Autoscaling for Containerized HPC Schedulers
5.3.1 Introduction
5.3.2 Related Works and Positioning
5.3.3 Challenge and Issues for Auto Scaling Mechanisms with OAR
5.3.4 Summary of the Discussion
5.4 Conclusion
References
6 Designing Cloud-Friendly HPC Applications
6.1 Introduction
6.2 Exploring Cloud Features and Capabilities Through the Lens of HPC Demands
6.3 Analyzing HPC Models to Write Cloud-Friendly Applications
6.4 Loosely-Coupled HPC Applications for Cloud
6.4.1 Bag-of-Tasks
6.4.2 Master-Slave
6.4.3 Pipeline
6.4.4 Divide-and-Conquer
6.5 Tightly-Coupled HPC Applications for Cloud
6.5.1 Bulk-Synchronous Parallel
6.6 Discussion and Open Challenges on HPC-Oriented Cloud Applications
6.7 Conclusion
References
7 Exploiting Hardware Accelerators in Clouds
7.1 Introduction
7.2 Accelerator Optimized Instances on the Cloud
7.2.1 GPUs: Graphic Processing Units
7.2.2 TPUs: Tensor Processing Units
7.2.3 FPGAs: Field-Programmable Gate Arrays
7.2.4 Other Cloud Providers Accelerators and AIprocessors
7.3 Programming for Cloud Accelerators
7.3.1 Amazon Web Services (AWS)
7.3.2 Google Cloud Platform (GCP)
7.3.3 Microsoft Azure
7.4 Influence of Accelerators in IoT and Edge Computing
7.5 Final Remarks
References
Part III Cost and Performance Optimizations
8 Optimizing Infrastructure for MPI Applications
8.1 Fundamentals of MPI
8.2 Interconnection Networks for MPI Environments
8.3 Cloud Facilities for MPI Applications
8.4 Executing an MPI Job in the Cloud
8.5 Optimizing the Performance of MPI Applications on theCloud
8.6 Conclusions
References
9 Harnessing Low-Cost Virtual Machines on the Spot
9.1 Introduction
9.2 Spot VMs
9.2.1 Using Hibernation-Prone Spot VMs in BoTApplications
9.3 Reducing Monetary Costs Within Markets
9.3.1 Instances Galore and the Paradox of Choice
9.3.2 Choosing the ``Right'' Instance May Not Be Enough
9.4 Burstables Virtual Machines
9.5 Conclusions and Future Directions
References
10 Ensuring Application Continuity with Fault ToleranceTechniques
10.1 Introduction
10.2 Fault Tolerance
10.2.1 Failure Detection
10.2.2 Checkpointing
10.2.3 Replication
10.2.4 Fault Tolerant MPI
10.2.5 Fault Tolerance in HPC Applications
10.3 Fault Tolerance in Clouds
10.3.1 Failure Detectors in Clouds
10.3.2 Implementing Checkpoints in Cloud
10.3.2.1 Bag-of-Tasks Applications
10.3.3 Reliable Cloud Storage Solutions
10.3.3.1 Choice of the Storage Service
10.3.4 Replication
10.3.5 Fault Tolerance and Preemptible VMs
10.4 Conclusion and Future Directions
References
11 Avoiding Resource Wastage
11.1 Introduction
11.2 HPC Workload Characteristics and Resource Wastage
11.2.1 Typical HPC Workloads
11.2.2 Sources of Resource Wastage in HPC Cloud
11.2.3 Resource Management
11.3 Strategies to Detect and Prevent Resource Wastage
11.3.1 Metrics to Detect Resource Wastage
11.3.2 Resource Optimisation Strategies
11.3.3 Research Challenges
11.4 Conclusions
References
Part IV Application Study Cases
12 Biological Sequence Comparison on Cloud-Based GPUEnvironment
12.1 Introduction
12.2 Amazon Web Services
12.2.1 Overview
12.2.2 GPU Instances on AWS
12.2.3 Application Execution on AWS
12.2.4 High-Performance Computing on AWS
12.2.4.1 Fault Tolerance
12.2.4.2 Application Isolation
12.3 Case Study: Biological Sequence Comparison Application
12.3.1 Overview
12.3.2 Reducing the Monetary Costs
12.3.3 Reducing the Execution Time
12.4 Experimental Results
12.4.1 Reducing the Monetary Costs
12.4.2 Reducing the Execution Time
12.4.3 Discussion
12.5 Conclusions
References
13 Reservoir Simulation in the Cloud
13.1 Introduction
13.2 Reservoir Simulation Overview
13.2.1 Reservoir Simulation Software
13.2.2 Reservoir Simulation Challenges
13.3 Cloud Advantages and Challenges for the O&G Industry
13.4 Cloud Deploy Case Study of Reservoir Simulation
13.5 Conclusions and Future Trends
References
14 Cost Effective Deep Learning on the Cloud
14.1 Introduction
14.2 Key Deep Learning Concepts
14.2.1 Training Deep Learning Models
14.2.2 Model Partitioning Strategies for Distributed Training
14.3 Training Deep Learning Models in the Cloud
14.3.1 Services for Deep Learning in the Cloud
14.3.2 Training with IaaS
14.3.3 Training with SageMaker
14.4 Optimizing Cost and Training Time
14.4.1 Study Case: Medical Image Segmentation with MONAI
14.4.2 Searching for a Cost-Efficient Infrastructure
14.4.3 Selecting Efficient VM Types on EC2 and SageMaker
14.4.4 Exploring Cost and Training Time with Distributed Training
14.4.5 Reducing the Cost with Preemptible VMs
14.5 Final Considerations
References
A Deploying an HPC Cluster on AWS
A.1 Deploying Infrastructure Using the Web Console
A.1.1 Creating the VPC Network
A.1.2 Creating a Shared File System Using the AWS Elastic File System (EFS)
A.1.3 Instantiating Virtual Machines
A.2 Deploying Infrastructure Using the AWS Command-Line Interface
A.3 Deploying Infrastructure Using Ansible
B Configuring a Cloud-Deployed HPC Cluster
B.1 Introduction
B.2 Configuring the Cluster Using the Command-Line Interface
B.2.1 Mounting the EFS File System
B.2.2 Configuring SSH for Password-Less Connections
B.2.3 Installing and Configuring MUNGE
B.2.4 Installing and Configuring SLURM
B.3 Configuring the Cluster Using Ansible
B.3.1 Creating the Playbook Inventory
B.3.2 Configuring the HPC Cluster
B.3.3 Executing the Playbook
B.4 Submitting Jobs on the HPC Cluster
Recommend Papers

High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment
 3031297687, 9783031297687

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Edson Borin · Lúcia Maria A. Drummond · Jean-Luc Gaudiot · Alba Melo · Maicon Melo Alves · Philippe Olivier Alexandre Navaux   Editors

High Performance Computing in Clouds Moving HPC Applications to a Scalable and Cost-Effective Environment

High Performance Computing in Clouds

Edson Borin • Lúcia Maria A. Drummond • Jean-Luc Gaudiot • Alba Melo • Maicon Melo Alves • Philippe Olivier Alexandre Navaux Editors

High Performance Computing in Clouds Moving HPC Applications to a Scalable and Cost-Effective Environment

Editors Edson Borin University of Campinas Campinas, Brazil

Lúcia Maria A. Drummond Fluminense Federal University Niteroi, Brazil

Jean-Luc Gaudiot University of California, Irvine Irvine, CA, USA

Alba Melo University of Brasília Brasilia, Brazil

Maicon Melo Alves PETROBRAS S.A. Macae, Brazil

Philippe Olivier Alexandre Navaux Federal University of Rio Grande do Sul Porto Alegre, Brazil

ISBN 978-3-031-29768-7 ISBN 978-3-031-29769-4 https://doi.org/10.1007/978-3-031-29769-4

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

The development of high-performance computing systems has been driven by the needs of applications, often viewed through the lens of system developers. When I was at Cray Research in the 1990s, end users were asked about their needs and desires, which were captured in terms of needed functionalities at a kernel or microkernel level as well as how these kernels interoperated outside of the CPU, e.g., via networking and I/O. These “models” of user needs were then used to design and simulate systems, which were then built, and handed off to the software team to build and port software components to make the system usable. Finally, users were brought in to test the systems, at a time when hardware changes could no longer be made and software changes were possible, but difficult and time-consuming to implement, with typically multiple years needed before they would appear. This was recognized as a problem and led to an era of co-design, particularly in the work leading to exascale systems, where users/application developers, system software and library developers, and computer architects came together to design, simulate, and in some cases build reduced models of the various software and hardware components in order to perform analyses and optimizations. This allowed a small number of users (those seen as representing important applications, mostly scientific and engineering simulation and modeling) to have a more direct and more integrated role in the development of HPC systems, though often limited in practice to incremental changes on commercially planned components. These systems, as they have been developed, are typically oversubscribed, and usually focus on high utilization and throughput of large jobs, with users having grown used to this model of submitting and waiting, as their applications typically are not time-critical. These systems also are traditionally homogenous, where developers port their application to each system individually and then can ideally use most or all of the system. In parallel, large companies such as Amazon and Google were working to build systems to support their data analysis needs, with their workloads leading to a different set of choices for processors, connectivity, I/O, and other components, with increased heterogeneity to support multiple workloads. Additionally, the needs of these companies to run their essential and bursty operations in a timely manner caused them to build systems that were underutilized when larger applications were v

vi

Foreword

not running. This led them to sell this unused capacity to others, which they then developed into a profit center: clouds. For many external small-scale users, this capability can appear to be infinite and available on-demand. Many scientists and engineers saw this as tremendously appealing, particularly those who were focusing on data analysis initially and then later, deep learning, as the common view was that the hardware and software system choices made by cloud providers would not support HPC modeling and simulation applications. Additionally, the easy-touse, automated model for gaining access to these cloud resources is very appealing to researchers who have been used to the long peer-review processes often used to determine allocations on HPC systems and the manual processes to actually implement these decisions, as is the idea of easy-to-port-to resources that are required based on underlying heterogeneity and enabled by container technologies. Today, it’s clear that there are many HPC applications that do work well on both commercial and in-house clouds, as well as some that don’t, for a variety of reasons including changes in interconnects, virtualization systems, and optimal levels of numerical precision. Understanding this, and what changes could be made at the application, system software, and hardware level to increase the faction that do, is the topic of this timely book, which has the promise of bridging the gap between user and large-scale system needs. Champaign, IL, USA December 2022

Daniel S. Katz

Preface

This book offers a thorough explanation to the path needed to use cloud computing technologies to run High-Performance Computing (HPC) applications. Besides presenting the motivation behind moving HPC applications to the cloud, it covers both essential and advanced issues on this topic such as deploying HPC applications and infrastructures, designing cloud-friendly HPC applications, and optimizing a provisioned cloud infrastructure to run this sort of applications. Additionally, this book also describes the best practices to maintain and keep running HPC applications in the cloud by employing fault-tolerance techniques and avoiding resource wastage. To give practical meaning to topics covered in this book, it presents some case studies where HPC applications used in relevant scientific areas, like Bioinformatics and Oil and Gas industry, were moved to the cloud. Moreover, it also discusses how to train deep learning models in the cloud, elucidating the key components and aspects necessary to train these models via different types of services offered by cloud providers. Despite the vast bibliography about cloud computing and HPC, there is a lack of books covering these topics together, discussing the steps, methods, and strategies to execute HPC applications in clouds. Therefore, we believe this title is useful for IT professionals and students and researchers interested in the cutting-edge technologies, concepts, and insights surrounding the use of cloud technologies to run HPC applications. In order to have a meaningful book that really reached its main objective, the editors initially defined its chapters and some essential contents. Only after that, specialists were invited to contribute to the chapters which matched their expertise. All chapters were also reviewed so as to ensure a coherent chain of the presented topics.

vii

viii

Preface

We are grateful to all authors who have contributed to this book by accepting our invitation and suggestions, and sharing their knowledge and experience in the written chapters. Campinas, Brazil Niteroi, Brazil Irvine, CA, USA Brasilia, Brazil Macae, Brazil Porto Alegre, Brazil November 2022

Edson Borin Lúcia Maria A. Drummond Jean-Luc Gaudiot Alba Melo Maicon Melo Alves Philippe Olivier Alexandre Navaux

Contents

1

Why Move HPC Applications to the Cloud? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edson Borin, Lúcia Maria A. Drummond, Jean-Luc Gaudiot, Alba Melo, Maicon Melo, and Philippe O. A. Navaux 1.1 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

3 4

Part I Foundations 2

3

What Is Cloud Computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maicon Melo Alves 2.1 First Look at the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Benefits and Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Cost Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Service and Delivery Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Service Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Delivery Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Virtualization and Containers Technologies . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Do HPC Applications Look Like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claude Tadonki 3.1 About High-Performance Computing and Its Way So Far . . . . . . . . . 3.1.1 Concept and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Evolution of HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 9 11 13 14 16 17 17 17 19 20 20 22 23 24 27 27 27 30

ix

x

Contents

3.1.3

Graphical Programming Unit as the Main HPC Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Overview of Current HPC Systems and Associated Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Design and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Methodology for the Design of HPC Applications . . . . . . . 3.2.2 Synopsis of HPC Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Critical Numerical and Performance Challenges . . . . . . . . . 3.2.4 About Parallel Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Two Examples of HPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Lattice Quantum ChromoDynamics (LQCD) . . . . . . . . . . . . . 3.3.2 High-Resolution Seismic Imaging . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 HPC and Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 34 36 36 39 40 43 45 45 47 49 50

Part II Running HPC Applications in Cloud 4

5

Deploying and Configuring Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edson Borin and Otávio O. Napoli 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Key Infrastructure Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Virtual Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Regions, Availability Zones, and Placement Strategies. . . 4.2.3 Tenancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Storage Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Virtual Private Cloud Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Overview of a Cloud-Based HPC Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Cost and Performance of Cloud-Based HPC Clusters . . . . 4.4 Deploying Infrastructure on the IaaS Model . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 GUI and Command-Line Interface Tools. . . . . . . . . . . . . . . . . . 4.4.2 Infrastructure as Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 IaC Tools for Cloud HPC-Cluster-Like Environments . . . 4.5 Considerations About Selecting Resources and Tools to Deploy HPC Systems on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Executing Traditional HPC Application Code in Cloud with Containerized Job Schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Cérin, Nicolas Grenèche, and Tarek Menouer 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Change Nothing at the Application Level but a Little at the Cloud Orchestrator Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Elements of Vocabulary and Essential Definitions . . . . . . . .

55 55 56 56 58 59 59 62 63 64 66 66 68 70 72 73 75 75 75 76 76 76 76

Contents

6

7

xi

5.2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Challenges, Issues, and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Summary of the Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Adding a Mechanism for Autoscaling for Containerized HPC Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Related Works and Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Challenge and Issues for Auto Scaling Mechanisms with OAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Summary of the Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 80 89

Designing Cloud-Friendly HPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo da Rosa Righi, Guilherme Galante, Vinicius Facco Rodrigues, Heonyoung Yeom, Harald Koestler, Madhusudan Singh, and Guann-Pyng Li 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Exploring Cloud Features and Capabilities Through the Lens of HPC Demands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Analyzing HPC Models to Write Cloud-Friendly Applications . . . 6.4 Loosely-Coupled HPC Applications for Cloud . . . . . . . . . . . . . . . . . . . . . 6.4.1 Bag-of-Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Master-Slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Tightly-Coupled HPC Applications for Cloud . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Bulk-Synchronous Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Discussion and Open Challenges on HPC-Oriented Cloud Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Exploiting Hardware Accelerators in Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristiano A. Künas, Matheus S. Serpa, and Philippe O. A. Navaux 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Accelerator Optimized Instances on the Cloud . . . . . . . . . . . . . . . . . . . . . 7.2.1 GPUs: Graphic Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 TPUs: Tensor Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 FPGAs: Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . 7.2.4 Other Cloud Providers Accelerators and AI processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Programming for Cloud Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Amazon Web Services (AWS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Google Cloud Platform (GCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Microsoft Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 89 90 92 94 94 96

99 101 103 106 106 108 112 115 117 118 120 123 124 127 127 128 129 130 131 132 133 133 135 138

xii

Contents

7.4 Influence of Accelerators in IoT and Edge Computing. . . . . . . . . . . . . 142 7.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Part III Cost and Performance Optimizations 8

9

10

Optimizing Infrastructure for MPI Applications . . . . . . . . . . . . . . . . . . . . . . . José E. Moreira 8.1 Fundamentals of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Interconnection Networks for MPI Environments . . . . . . . . . . . . . . . . . . 8.3 Cloud Facilities for MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Executing an MPI Job in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Optimizing the Performance of MPI Applications on the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

Harnessing Low-Cost Virtual Machines on the Spot . . . . . . . . . . . . . . . . . . . Alexandre C. Sena, Cristina Boeres, Luan Teylo, Lúcia Maria A. Drummond, and Vinod E. F. Rebello 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Spot VMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Using Hibernation-Prone Spot VMs in BoT Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Reducing Monetary Costs Within Markets . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Instances Galore and the Paradox of Choice . . . . . . . . . . . . . . 9.3.2 Choosing the “Right” Instance May Not Be Enough . . . . . 9.4 Burstables Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

163

Ensuring Application Continuity with Fault Tolerance Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafaela Brum, Luan Teylo, Luciana Arantes, and Pierre Sens 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Failure Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Fault Tolerant MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Fault Tolerance in HPC Applications . . . . . . . . . . . . . . . . . . . . . . 10.3 Fault Tolerance in Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Failure Detectors in Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Implementing Checkpoints in Cloud . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Reliable Cloud Storage Solutions . . . . . . . . . . . . . . . . . . . . . . . . . .

147 149 151 154 156 160 161

163 165 170 175 176 180 183 185 186 191 191 193 193 194 197 197 198 199 199 200 201

Contents

11

xiii

10.3.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.5 Fault Tolerance and Preemptible VMs . . . . . . . . . . . . . . . . . . . . 10.4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

204 205 206 207

Avoiding Resource Wastage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Altino M. Sampaio and Jorge G. Barbosa 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 HPC Workload Characteristics and Resource Wastage. . . . . . . . . . . . . 11.2.1 Typical HPC Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Sources of Resource Wastage in HPC Cloud . . . . . . . . . . . . . 11.2.3 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Strategies to Detect and Prevent Resource Wastage . . . . . . . . . . . . . . . . 11.3.1 Metrics to Detect Resource Wastage . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Resource Optimisation Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Research Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

213 213 214 215 217 219 222 222 224 229 231 231

Part IV Application Study Cases 12

13

Biological Sequence Comparison on Cloud-Based GPU Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Walisson P. Sousa, Filipe M. Soares, Rafaela C. Brum, Marco Figueiredo, Alba C. M. A. Melo, Maria Clicia S. de Castro, and Cristiana Bentes 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Amazon Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 GPU Instances on AWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Application Execution on AWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.4 High-Performance Computing on AWS . . . . . . . . . . . . . . . . . . . 12.3 Case Study: Biological Sequence Comparison Application. . . . . . . . 12.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Reducing the Monetary Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Reducing the Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Reducing the Monetary Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Reducing the Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

239

239 241 241 242 242 243 244 244 246 248 250 251 254 257 260 261

Reservoir Simulation in the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Felipe Albuquerque Portella and Fabio Moreira de Souza 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

xiv

14

A

B

Contents

13.2

Reservoir Simulation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Reservoir Simulation Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Reservoir Simulation Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Cloud Advantages and Challenges for the O&G Industry . . . . . . . . . 13.4 Cloud Deploy Case Study of Reservoir Simulation . . . . . . . . . . . . . . . . 13.5 Conclusions and Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

266 269 272 273 274 278 281

Cost Effective Deep Learning on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Otávio O. Napoli, Rafael K. Tesser, Daniel L. Fonseca, and Edson Borin 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Key Deep Learning Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Training Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.2 Model Partitioning Strategies for Distributed Training . . . 14.3 Training Deep Learning Models in the Cloud . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Services for Deep Learning in the Cloud . . . . . . . . . . . . . . . . . . 14.3.2 Training with IaaS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Training with SageMaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Optimizing Cost and Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Study Case: Medical Image Segmentation with MONAI 14.4.2 Searching for a Cost-Efficient Infrastructure . . . . . . . . . . . . . . 14.4.3 Selecting Efficient VM Types on EC2 and SageMaker . . . 14.4.4 Exploring Cost and Training Time with Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.5 Reducing the Cost with Preemptible VMs . . . . . . . . . . . . . . . . 14.5 Final Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

283

Deploying an HPC Cluster on AWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edson Borin and Otávio O. Napoli A.1 Deploying Infrastructure Using the Web Console . . . . . . . . . . . . . . . . . . A.1.1 Creating the VPC Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 Creating a Shared File System Using the AWS Elastic File System (EFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 Instantiating Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Deploying Infrastructure Using the AWS Command-Line Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Deploying Infrastructure Using Ansible . . . . . . . . . . . . . . . . . . . . . . . . . . . .

309

Configuring a Cloud-Deployed HPC Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edson Borin and Otávio O. Napoli B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Configuring the Cluster Using the Command-Line Interface . . . . . . B.2.1 Mounting the EFS File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 Configuring SSH for Password-Less Connections . . . . . . . .

323

283 284 286 287 289 290 292 293 295 295 296 299 301 304 305 306

309 310 311 313 315 319

323 324 324 325

Contents

B.3

B.4

xv

B.2.3 Installing and Configuring MUNGE. . . . . . . . . . . . . . . . . . . . . . . B.2.4 Installing and Configuring SLURM . . . . . . . . . . . . . . . . . . . . . . . Configuring the Cluster Using Ansible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Creating the Playbook Inventory. . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.2 Configuring the HPC Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.3 Executing the Playbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Submitting Jobs on the HPC Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

326 327 328 329 329 333 333

Chapter 1

Why Move HPC Applications to the Cloud? Edson Borin, Lúcia Maria A. Drummond, Jean-Luc Gaudiot, Alba Melo, Maicon Melo, and Philippe O. A. Navaux

From the invention of the first general-purpose electronic computers in the 1940s to the early 1990s, the supercomputers used in high-performance computing (HPC) were highly specialized parallel machines with vector processors specifically designed to accelerate the execution of niche scientific and engineering applications. With the evolution and popularization of processors for personal computers (PCs) and workstations, from the 1990s onwards, supercomputers became clusters of computers with microprocessors designed for general-purpose applications connected through high-speed networks. With economies of scale, this approach reduced the costs of designing and manufacturing HPC systems and still shapes the design of current supercomputers. In fact, the majority of the supercomputers listed on the

E. Borin () University of Campinas, Campinas, Brazil e-mail: [email protected] L. M. A. Drummond Fluminense Federal University, Niterói, Brazil e-mail: [email protected] J.-L. Gaudiot University of California, Irvine, CA, USA e-mail: [email protected] A. Melo University of Brasília, Brasília, Brazil e-mail: [email protected] M. Melo Petróleo Brasileiro S.A., Petrobras, Rio de Janeiro, Brazil e-mail: [email protected] P. O. A. Navaux Federal University of Rio Grande do Sul, Porto Alegre, Brazil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_1

1

2

E. Borin et al.

Top 500 list1 are based on x86 microprocessors and most of the accelerators used in these systems also mostly comprise chips originally designed for PCs, e.g., Graphic Processing Units (GPUs). With the rapid evolution of the internet in the 1990s and early 2000s, several researchers investigated how to execute high-performance computing applications on multiple PCs, workstations, or clusters of computers connected through the internet. This approach would allow computers located at distant sites (e.g., cities or countries) to be combined to support the execution of applications to solve very large computational problems. On the one hand, this approach known as grid computing, was highly successful in supporting the execution of decoupled (or weakly-coupled) applications which are not sensitive to the performance of the interconnecting network; On the other hand, strongly- or mildly-coupled applications could not be accelerated with this kind of system mostly due to the poor performance of the network. The evolution of the internet also gave rise to online businesses, which, in turn, led to the creation of several very large2 datacenters distributed across the world to support these businesses’ operations. These datacenters, originally designed to support the peak demand of specific online businesses, would be heavily underutilized during long periods of time if used only for this purpose. Nonetheless, the evolution of virtualization technologies allowed datacenter owners to rent the underutilized hardware to third parties, including ordinary consumers. This new business model, called cloud computing, allowed anyone to rent computing resources from these datacenters to host online services or execute computational workloads. Similar to supercomputers, cloud computing datacenters are composed of several clusters of computers with microprocessors designed for general-purpose applications connected through high-speed networks. Even though the network performance is not as high as the one experienced on supercomputers, the total number of computers and microprocessors in a cloud datacenter may be much larger than on supercomputers. In this context, it is natural to expect users to rent fractions of the datacenter resources to use as a supercomputer. In fact, some supercomputers listed on the Top 500 list are systems composed of hardware rented from cloud computing datacenters. While in the 1990s economies of scale favored the design of microprocessors for general-purpose applications, mostly due to the PC industry, current trends show that the demands of cloud datacenters will drive the design of these chips [2, 3, 7]. Hence, it would not be surprising if the hardware of future supercomputers’ was more similar to cloud datacenters than it is today. Although we have not reached this point yet, even now there are several advantages to executing HPC workloads on the cloud: (a) there is no need to acquire, install, and maintain expensive hardware nor invest in physical facilities to host such equipment; (b) there are no energy or cooling constraints to hamper the increase of computing and storage capacity; (c) there is 1 The

list with the 500 most powerful supercomputers in the world: www.top500.org. of these datacenters contain millions of computing cores.

2 Some

1 Why Move HPC Applications to the Cloud?

3

no need to waste efforts on third party obligations like contracting and maintaining IT professionals, managing software licenses, or addressing hardware obsolescence; moreover, (d) there are virtually no delays associated with system acquisition and installation, nor waiting times on job queues caused by high system utilization. There have been several successful cases of users migrating HPC applications from on-premise supercomputers to the cloud [1, 4–6, 8–10], nonetheless, there are still several challenges that must be addressed to ease this process in a wider scale. Finding the right infrastructure (i.e., set of cloud computing resources) for each application or adapting the application to take advantage of new features, such as cloud elasticity, are a few examples of these challenges. The various chapters of this book discuss strategies and techniques to efficiently use the infrastructure provided by the cloud to execute HPC applications. Besides presenting the basic concepts related to cloud computing and a brief description about HPC applications, the authors have brought a comprehensive material covering fundamental aspects to provision a raw infrastructure to run those applications and advanced topics like the adoption of methods to ensure application continuity and avoiding resource wastage. Thus, this material brings useful information to any stage or maturity level of projects related to run HPC applications in cloud. We hope this book will help readers taking their initial and main decisions for implementing their HPC applications in the cloud.

1.1 Book Organization This book is organized in four parts and two appendices. In Part I, Foundations, two chapters are presented: What is Cloud Computing? and What do HPC applications look like?. They introduce the fundamentals and key aspects of cloud computing and high performance applications, respectively. Part II, Running HPC Applications in Cloud, provides an overview of the key cloud infrastructure elements for HPC workloads and how they can be instantiated and managed (Deploying Infrastructure and Applications). It also presents strategies and techniques to smoothly execute HPC applications in cloud, besides discussing how to adapt HPC workloads to incorporate and benefit from cloud advantages like elasticity (Executing Traditional HPC Application Code in the Cloud). In addition, this part also brings an overview of cloud accelerators available on different cloud providers, showing how to program and instantiate them, and discussing a deployment workflow (Exploiting Hardware Accelerators in Clouds). After presenting, in Parts I and II, the basic concepts and how to use clouds for executing HPC applications, Part III focuses on Cost and Performance Optimizations. The chapter Optimizing Infrastructure for MPI Applications discusses cloud infrastructure optimizations for the specific mode of operation represented by Message Passing Interface (MPI) applications, considering that the large number of compute nodes must be interconnected with fast networks and multi-level cloud schedulers must coordinate to assign and allocate those nodes to submitted MPI

4

E. Borin et al.

jobs. The chapter Harnessing Low-Cost Virtual Machines on the Spot provides an overview on how users might utilize and benefit from the variety of instances and different contract models, such as Spot and On-demand models, on offer from public cloud providers to reduce their financial outlays. Regarding the fault tolerance problem, crucial in large-distributed systems as cloud environments, the chapter Ensuring Application Continuity with Fault Tolerance Techniques presents an overview of the related literature about fault tolerance techniques most used by clouds and HPC applications that run on them, as well as fault detection approaches and existing reliable storage in clouds. That part of the book is concluded with the chapter Avoiding Resource Wastage, which introduces the resource wastage problem in the context of HPC cloud, and, also, provides existing state-of-the-art solutions to tackle such situations. Part IV describes Application Case Studies in three chapters. The first one, Biological Sequence Comparison on Cloud-based GPU Environment, explores the parallelism provided by cloud computing to execute a biological sequence comparison application in order to achieve high performance. It focuses on reducing both the monetary costs and the execution time of the application by taking advantage of cloud features such as Spot instances and parallel virtual clusters. The next chapter, Oil & Gas Reservoir Simulation in the Cloud, introduces the reader briefly to the aspects of the reservoir simulator and describes the experience of Petrobras, the Brazilian O&G company, in its use in an HPC cloud infrastructure. The chapter also details the concerns of big data movements and cloud-bursting issues, and raises some of the specific advantages for that industry, such as the cost system. The last chapter of this part, Cost Effective Deep Learning on the Cloud, addresses training deep learning models in the cloud, elucidating the key components and aspects necessary to train these models via different types of service offered by cloud providers. It also discusses important aspects the reader must observe when choosing services and instances to train their deep-learning models on the cloud. Finally, in the first Appendix Deploying an HPC cluster on AWS provides detailed step-by-step guides to deploy an HPC cluster in AWS cloud. The second one, Configuring a cloud-deployed HPC cluster, illustrates how to configure and use a cloud-deployed HPC cluster.

References 1. M Stordalen Flister and K Hopstaken. Running reservoir simulations in the public cloud; a case study of a cost-controlled method, running tNavigator and Eclipse in an Azure HPC environment. In EAGE/AAPG Digital Subsurface for Asia Pacific Conference, volume 2020, pages 1–4. European Association of Geoscientists & Engineers, 2020. 2. Qingye Jiang, Young Choon Lee, and Albert Y. Zomaya. The power of arm64 in public clouds. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 459–468, 2020.

1 Why Move HPC Applications to the Cloud?

5

3. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News, 45(2):1–12, June 2017. 4. Carsten Kutzner, Christian Kniep, Austin Cherian, Ludvig Nordstrom, Helmut Grubmüller, Bert L de Groot, and Vytautas Gapsys. GROMACS in the cloud: A global supercomputer to speed up alchemical drug design. Journal of Chemical Information and Modeling, 62(7):1691– 1711, 2022. 5. Marco A. S. Netto, Rodrigo N. Calheiros, Eduardo R. Rodrigues, Renato L. F. Cunha, and Rajkumar Buyya. HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Comput. Surv., 51(1), Jan 2018. 6. Masahito Ohue, Kento Aoyama, and Yutaka Akiyama. High-performance cloud computing for exhaustive protein–protein docking. In Advances in Parallel & Distributed Processing, and Applications, pages 737–746. Springer, 2021. 7. Daniel Reed, Dennis Gannon, and Jack Dongarra. Reinventing high performance computing: Challenges and opportunities, 2022. 8. Francesco Salvadore and Raffaele Ponzini. LincoSim: a web based HPC-cloud platform for automatic virtual towing tank analysis. Journal of Grid Computing, 17(4):771–795, 2019. 9. Philipp A Witte, Mathias Louboutin, Henryk Modzelewski, Charles Jones, James Selvage, and Felix J Herrmann. An event-driven approach to serverless seismic imaging in the cloud. IEEE Transactions on Parallel and Distributed Systems, 31(9):2032–2049, 2020. 10. Qie Zhang, George Iordanescu, Wee Hyong Tok, Sverre Brandsberg-Dahl, Hari Krishnan Srinivasan, Ranveer Chandra, Navjot Kukreja, and Gerard Gorman. Hyperwavve: A cloudnative solution for hyperscale seismic imaging on azure. In First International Meeting for Applied Geoscience & Energy, pages 782–786. Society of Exploration Geophysicists, 2021.

Part I

Foundations

Chapter 2

What Is Cloud Computing? Maicon Melo Alves

2.1 First Look at the Cloud We can undoubtedly state that Cloud Computing surrounds our entire society. After all, people use this technology daily when accessing an e-commerce site, sending emails, or watching a movie on their favorite stream platform. The vast majority of these people do not realize that Cloud Computing is the technology behind those services and resources. What was once just a promise or a vague idea, Cloud Computing is now a real and ongoing technology, present in a plethora of business segments and adopted by small to giant companies around the globe. This section gives a first look at the cloud1 by describing its origin and presenting its most accepted definition.

2.1.1 Origin In the 1990s, the research community turned its attention to the wastage of computing power available on workstations, desktops, and servers. They realized many of these machines do not operate at their maximum processing capacity all the time, raising the opportunity to harness this idle computing time to execute

1 The term “Cloud Computing” and “cloud” will be used interchangeably in this chapter from this point.

M. M. Alves () Petróleo Brasileiro S.A., Petrobras, Rio de Janeiro, Brazil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_2

9

10

M. M. Alves

Fig. 2.1 Grid Computing and geographically distributed shared resources

batch tasks. Scientists have materialized this idea in a new computing paradigm called Grid Computing [1]. Grid Computing allows users and organizations to share computing resources placed in the same organization, or geographically spread in distinct regions as shown in Fig. 2.1. The elementary goal of Grid Computing is to seek the optimal usage of resources, where partners can collaborate with themselves to consume idle computing cycles in a distributed and heterogeneous cluster of computers [3]. The possibility to improve the performance of distributed applications by executing them in the infrastructure provided by Grid Computing caught the attention of many researchers, technology companies, and enthusiasts in this area. Such colossal interest motivated the emergence of conferences, books, and scientific papers focusing on describing and broadcasting this computing model. By that time, the community had proposed many protocols, algorithms, tools, and software stacks to support the practical usage of Grid Computing. The pioneer and one of the most notable projects related to Grid Computing was the SETI@home created by the Berkeley Research Center [2]. This project aimed to distribute the processing of radio signals to find signs of extraterrestrial intelligence. To contribute to this project, anyone could download the program and start analyzing a piece of data on their computer. The SETI@home software had the prerogative to execute the processing analysis without competing for resources already allocated by the user. This project is meaningful for Grid Computing since it showed the viability of using this model in a real scenario. Following the need to perform a more efficient resource usage, another proposal had emerged at the beginning of the 2000s, dividing the spotlights with Grid Computing. Nearby 2006, companies like Amazon and Google started to offer on-demand and pay-as-you-go services in which people would be able to access

2 What Is Cloud Computing?

11

software, computing resources, and storage directly from the Internet [4]. Yes, we are now talking about Cloud Computing. Companies’ motivation to design and offer those services relied on the opportunity to deliver IT infrastructure and software as a utility, a commodity [5]. In such a way, organizations and end-users could have immediate and easy access to infinite resources without worrying about energy and space constraints, cooling capacity, hardware obsolescence, and software licenses. As a computing utility, companies could stretch the resources to fit their demand in a given period, avoiding the problems of super and underestimation of resources. Therefore, Cloud Computing, likely Grid, is a technology born from the necessity to optimize the usage of computing resources. At first sight, we could say that Grid and Cloud Computing are similar technologies, especially if considering some characteristics and aspects related to optimal usage of resources. However, these technologies have one substantial difference: the business model. Grid Computing was designed to offer distributed and shared computing services to a collaborative partner network. So, to take advantage of Grid, a company or institution should be part of this network, with the counterpart of offering their resources to be joined in the pool controlled by the Grid. On the other hand, as shown in Fig. 2.2, Cloud Computing was created to be available to any user or company interested in paying to access online services available on the Internet. Due to this open and commercial nature, Cloud Computing has attracted much more attention and interest than Grid. As years went by, the scientific community and technology companies have left Grid behind, passing to concentrate their energy and effort to improve Cloud Computing. Despite that, it is worth stating that many techniques and algorithms created to solve Grid problems have inspired solutions adopted by Cloud so far. Considering the revolution brought by Cloud Computing, we can say that it is not just a new technology; it is a novel computing paradigm. But what precisely is Cloud Computing? What are its most fundamental characteristics? Well, answers to these questions are responded to in the next section, which introduces the most accepted definition of Cloud Computing.

2.1.2 Definition When a new paradigm arises, it comes with many uncertainties related to scope, foundations, and goals. This lack of characterization usually motivates researchers to propose a precise definition of this new technology. In the case of Cloud Computing, it was no different. Many authors have tried to coin a formal definition of this model during its first years. Among several definitions published in papers and books, the scientific community has unofficially elected the one offered by NIST, the U.S. National Institute of Standards and Technology, as the most complete and concise definition of Cloud Computing. Since then, the Cloud Computing description offered by NIST has

12

M. M. Alves

Fig. 2.2 Cloud Computing business model

turned itself into the de facto definition for the cloud, being extensively used and referenced in any material about this topic. So, according to NIST: Cloud Computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [6]

This definition may sound a little wordy and confusing, especially for those reading it for the first time or who recently started to learn this subject. Nevertheless, we can say that each word in this sentence matters, having a valuable meaning in Cloud Computing’s definition. Notwithstanding, to make things easier, we pick up some key concepts to better describe the dimension, scope, and purpose of this model.

2 What Is Cloud Computing?

13

• Model: the definition starts by saying that Cloud Computing is a model. Although this statement may sound irrelevant, it has great importance in the cloud’s definition. Remark that model is a term assigned to an abstract representation of some system, aiming to describe its essential characteristics and putting aside unnecessary details. By affirming that Cloud Computing is a model, the definition explicitly states that Cloud Computing is not a tool, software, or technology but a computing paradigm with delimited and precise attributes, elements, and purpose. • On-demand network access: users should be able to access cloud resources through the Internet from anywhere and whenever they want. This on-demand access through the Internet is one of the essential differences between Cloud and Grid and one of the main reasons for the success of the former. • Shared pool of configurable computing resources: cloud exists to provide computing resources to users; this is the primary purpose of this model. These computing resources, like virtual machines and storage capacity, are configurable, thus allowing customization according to users’ needs. Unlike on-premise infrastructure, computing resources in the cloud are shared among their users, which can impose performance and security issues. • Rapid provisioning and releasing of resources: cloud should allow rapid provisioning and release of resources. Otherwise, users would not be able to adjust their resources to fit a demand at a particular moment. After all, the idea behind the cloud relies on cost optimization and efficiency, which can only be achieved if users can avoid or minimize the super and underestimation usage of resources. • Minimal management effort or service provider interaction: the management of resources in the cloud must be fast and equally effortless. In this way, users must be able to manage their resources by themselves with maximum autonomy. They must only interact with the service provider in case of problems or unexpected behavior of computing resources. The topics above overview the main aspects of Cloud Computing. However, in the real world, Cloud Computing providers, tools, and solutions incorporate other characteristics besides the ones presented in this chapter. So, it is important to state that the NIST definition aims to capture the essence of Cloud Computing and does not intend to cover all the practical facets of this computing model. Next section describes some of the most prominent advantages and noticeable drawbacks of Cloud Computing.

2.2 Benefits and Drawbacks Any paper or book introducing Cloud Computing usually dedicates a couple of pages to convince the reader how advantageous it can be to move IT services from on-premise to the cloud. Indeed, we can easily list a dozen benefits related to cost, flexibility, collaboration, continuity, security, mobility, etc.

14

M. M. Alves

However, when looking with attention, we realize many of those benefits overlap, whether considering one or another aspect. Thus, to focus on what truly matters, we resume the cloud benefits in the following two ones: cost savings and elasticity.

2.2.1 Cost Savings Cloud Computing has arisen from the need to optimize the usage of resources, being motivated primarily by cost reduction. Thus, the first reason that comes up to mind for adopting Cloud Computing is the possibility of reducing operational costs [8, 14]. Which are those operational expenses? These costs are related to maintaining an IT infrastructure up and running, such as energy consumption, property security, and maintenance services. Remark that, as depicted in Fig. 2.3, we have a bunch of charges not directly correlated to the IT environment but yet essential to keep it operational.

Fig. 2.3 Traditional on-premise IT infrastructure

2 What Is Cloud Computing?

15

Besides that, the company needs to assume particular IT costs like contracting specialized people to provide and support the IT infrastructure. To manage a usual midsize IT infrastructure, it is mandatory having professionals with different skills, ranging from developers and administrators to network analysts and telecommunications engineers. This cost can significantly increase according to the boosting complexity of the IT environment [9]. In addition, the company should buy and administrate licenses of software used in its environment. Despite the increased offering of open source and free software, companies still have to purchase software like operating systems, database managers, applications servers, compilers, and other specific tools and applications. Besides the money needed to buy licenses, the company should also spend time and energy controlling this asset. Otherwise, it can suffer the penalties of the law whether using unlicensed software. Beyond costs, there are physical constraints that prevent a company from growing its IT infrastructure and services. Physical space and cooling capacity are examples of these restrictions. Even power capacity can also be a problem. In multiple situations, companies do not have enough physical area to expand their data centers, being limited to the capacity already installed. So, in those cases, the enterprise cannot extend the IT infrastructure, regardless of their intention to invest money in this area. Those costs, concerns, and restrictions vanish from the company’s mind when using Cloud Computing. Since the environment provided by the cloud is virtual and available on the Internet, companies do not have to worry about costs regarding energy, cooling, and facility fees, nor physical restrictions and impediments, as pictorially described in Fig. 2.4. In the same way, the whole responsibility for the IT infrastructure moves to the cloud provider, releasing the company from the burden Fig. 2.4 IT services and infrastructure provided by the cloud

e

16

M. M. Alves

of contracting and managing IT people and software licenses, besides dealing with hardware and software obsolescence. Therefore, cost savings is undoubtedly the main advantage of adopting the Cloud Computing model. However, cloud is not a silver bullet. In several cases, as discussed throughout this book, the on-premise infrastructure can still be the best solution for a given application or scenario.

2.2.2 Elasticity As computing resources can be rapidly and easily managed, companies can scale down and up their infrastructure to satisfy the workload requirements in a given time. This elasticity brings a higher degree of flexibility in the company’s resource control [10, 14]. Such flexibility allows the company to deal with cyclic and seasonal demand. For example, on commemorative holidays like Christmas, an e-commerce website commonly experiences a higher number of access and transactions than usual because much more people tend to access the company’s website on those occasions. By using the cloud, this company can easily reorganize and adapt its infrastructure and services to satisfy the higher demand of those periods. Past that moment, the company can return to the previous number of resources, which are originally enough to treat requests in an ordinary period. In addition, cloud elasticity can be extremely interesting to companies that need to handle peak access to IT services. Peak access can be triggered for several reasons, like a sudden advertisement or aggressive promotion of products and services. In those cases, the company’s IT infrastructure should be able to respond to a massive number of requests in a short slice of time. Due to the clouds’ ability to rapidly provision resources, the company can expand the infrastructure to bear the abrupt increase in demand. More than handling cyclic and peak demands, elasticity can be employed to prevent, minimize or avoid the problems of super and underestimation of resources. Remark that an infrastructure, initially estimated to meet a particular demand, can have more or fewer resources than indeed necessary. In the case of super estimation, idle servers and network equipment, and unused storage can lead to wastage of energy, physical space, and costs related to IT professionals. On the other hand, underestimation is particularly harmful to the enterprise’s trade because the company is unable to attend to clients’ requests due to the fewer resources available. This effortless elasticity puts Cloud Computing miles ahead of on-premise infrastructure when the question is adjusting computing resources to match the demand required by a workload. In some cases, this advantage can be the primary reason to adopt the Cloud Computing model.

2 What Is Cloud Computing?

17

2.2.3 Drawbacks As said at the beginning of this section, the cloud has many advantages. But, as with everything in life, nothing is perfect! Cloud also impose risks and drawbacks, as the following: • Dependency on the Internet connection: without the Internet, the user does not have access to the Cloud Provider, putting it apart from the IT infrastructure and services. On the other hand, the lack of Internet connection does not impact internal services in an on-premise infrastructure. Curiously, one of the benefits of Cloud Computing can also be a notable disadvantage [16, 17]. • Downtime imposed by cloud providers: emergency and planned service interruptions to execute maintenance tasks and software updates can affect users against their will. Although cloud providers guarantee a high level of service availability, these interruptions can still occur, specifically in circumstances of unexpected problems [15, 16]. • Risk of data theft: in the case of a cybernetic attack directed at the cloud provider. Even adopting the most advanced security components and best practices, the cloud provider is not free from being attacked. No environment does. Consequently, there is a possible chance of data theft, loss, and corruption when moving data from on-premise to the cloud [14]. People interested in using the cloud must carefully assess those disadvantages by considering the characteristics and requisites of their project or application. Maybe an aspect is not worthy of concern in a given case, though it can be a challenging obstacle to another else.

2.3 Service and Delivery Models The previous section discussed benefits and disadvantages of using the cloud. This section moves forward and shows how cloud services are offered and delivered to their users and clients.

2.3.1 Service Models Till now, we have been employing terms like infrastructure, software, applications, and storage to reference resources provided by the cloud. But how exactly does the cloud provider offer services to users and clients? Cloud companies basically offer three types of services [7, 8], illustrated in Fig. 2.5 and described as follows: • Software as a Service (SaaS): in the SaaS service model, users can access applications deployed by cloud companies on the Internet. This service model

18

M. M. Alves

Fig. 2.5 Cloud service models. (a) Software as a Service (SaaS). (b) Platform as a Service (PaaS). (c) Infrastructure as a Service (SaaS). (d) HPC as a Service (HPCaaS)

is the simplest one because clients do not have to worry about anything else besides accessing the online application by using a browser, a particular desktop application, or a mobile app. Examples of SaaS comprise Gmail, Skype, OneDrive, Overleaf, and Office 365 [14]. • Platform as a Service (PaaS): in the SaaS, cloud companies provide end-user applications, whereas, in the PaaS, cloud providers offer development platforms. So, enterprises can use cloud resources to create and build applications instead of provisioning a development environment in the on-premise infrastructure. With the easy and rapid scalability of PaaS, a company can dimension this environment to match its current needs. The Google App Engine and Apreenda Cloud Platform are some examples of PaaS services [14]. • Infrastructure as a Service (IaaS): unlike SaaS and PaaS, in this service model users can deploy by their own a complete and functional IT infrastructure composed of servers, storage, and network elements, totally accessible from the Internet. This service model is ideal for companies that need to have more control of their IT infrastructure and services. In this case, consumers must create the infrastructure from scratch, deciding issues like the number and type of servers, operating systems, and network connectivity, among many other questions. Consequently, the company still has to manage things like the servers’ operating system and security updates. One example of IaaS is the Amazon Elastic Compute Cloud (Amazon EC2) [14].

2 What Is Cloud Computing?

19

By analyzing these services, we can clearly see that each model is suited for a distinct audience. IaaS is commonly targeted to IT professionals specialized in creating and maintaining an entire infrastructure. SaaS is aimed to end-users, while PaaS is used by IT professionals interested in developing applications and solutions. Besides these classic service models, we can find new types created to satisfy specific demands. An example of a new service model is the HPCaaS, which stands for HPC (High Performance Computing) as a Service. The HPCaaS is a service model in which cloud providers offer services and special resources to execute HPC applications [11]. These special resources comprise accelerator devices, high performance interconnect and servers, and tools to smooth the deploying and execution of these applications. In the Amazon Web Services (AWS), for example, customers can use the AWS Parallel Cluster to define and create a complete HPC infrastructure composed of computing machines, resource managers, and storage resources. Another service model gaining attention is the Function as a Service, or FaaS, for short. Based on the concept of serverless, FaaS allows cloud customers to code functions without worrying about the server required to execute that operation. Thus, a company can focus exclusively on implementing the functionality it needs, leaving behind any concerns about server provisioning and configuration, software stack, and so forth [20, 21]. For instance, an e-commerce company can adopt FaaS to implement a function to pre-process a product image before loading it to the website. Therefore, the company can use this cloud capability to execute this operation instead of deploying an entire software stack and set of resources to support this operation. As examples of FaaS, we can cite the AWS Lambda and the Google Cloud Functions.

2.3.2 Delivery Models Whereas service models specify how cloud computing services are offered to endusers and clients, delivery models determine where they are provided. Concerning delivery models, we have the following three main types [7, 8]: • Public Cloud: this is the ordinary delivery model, where a cloud provider offers services upon payment of fees. The term public is employed here to indicate that services are available to anyone interested in using them. In other words, these services are not close or private to a delimited group of persons and organizations [17]. Some examples of public clouds comprise the Amazon Web Services, Microsoft Azure, Google Cloud, IBM Cloud, and the less known CloudFlare. • Private Cloud: considering all benefits of the Cloud Computing model, why not bring its philosophy and main characteristics to the company’s on-premise infrastructure? This is the idea behind the private cloud. In this delivery model, companies use part of their on-premise infrastructure to implement a particular cloud environment to offer internal services along the lines of a public cloud

20

M. M. Alves

provider. Remark that, even using the on-premise infrastructure, the private cloud yet preserves many of the advantages of the Cloud Computing model, like rapidly provisioning and releasing of resources [17]. HPE Helion Managed Private Cloud, VMware vRealize Suite Cloud Management Platform, Dell Enterprise Private Cloud Solution, and Cisco ONE Enterprise Cloud Suite are some examples of this delivery model. • Hybrid Cloud: the best of two worlds. In a hybrid cloud, companies have a private cloud environment to satisfy internal demands while keeps using the public cloud to deploy applications and IT services as usual [17]. We can cite the IBM Bluemix and the Verizon Enterprise as examples of hybrid clouds. In general, a private cloud is attractive to mid and large-sized companies because these enterprises usually have a lot of internal IT services that can benefit from the cloud’s philosophy. Even that, companies need to evaluate the tradeoff of creating— and maintaining—their own cloud environment in relation to moving IT services to the public cloud or traditionally deploying them on the on-premise infrastructure.

2.4 Virtualization and Containers Technologies Cloud Computing is based on the premise of resource optimization, which, in turn, is reached by resource sharing. In the cloud, two technologies are used to enable resource sharing: virtualization and containers. Before describing those technologies, it is worth saying that we do not intent to cover this topic in detail. However, we need to explain this subject because it is essential to comprehend many of the topics covered throughout the book.

2.4.1 Virtualization Virtualization is the engine of Cloud Computing. This technology allows cloud providers to host many virtual machines on a single physical server, enabling the resource sharing needed to make cloud computing possible. Although it started to be used more intensively in the last decades, virtualization was created many years before by the giant IBM in the 1960s. By then, IBM had invested many resources to develop mainframe and time-sharing machines, which is the base of the virtualization we use nowadays. For the sake of simplicity, we can say that virtualization is the technology that allows virtual machines (guests) to run on top of a physical server (host). As shown in Fig. 2.6, these virtual machines access shared computing resources available in the physical host, like CPU, memory, network devices, disks, and so forth. In spite of sharing resources at the hardware level, virtual machines are isolated from each other, having a separate software stack comprised of operating system, libraries, and applications.

2 What Is Cloud Computing?

21

Fig. 2.6 Virtualization technology

The main component of the software virtualization stack is the hypervisor which is responsible for managing the entire virtual environment. This component provides functionalities to execute administrative tasks like creating, deleting, monitoring and restoring virtual machines. It also offers advanced features such as hardware emulation and binary translation of instructions. Linux KVM, Oracle VirtualBox, and Microsoft Hyper-V are examples of hypervisors. Essentially, there are three types of virtualization techniques [12]: • Full virtualization: the hypervisor intercepts and translates sensitive and privileged instructions, whereas end-user instructions are directly executed in the CPU. Because the hypervisor handles privileged instructions, the guest operating system does not need to be modified to run in the virtual environment. Consequently, the guest OS is utterly unaware it is running upon a virtual machine. If, on the one hand, this high level of abstraction is exceptionally convenient and flexible, on the other hand, it leads to considerable performance overhead [18]. • Para-virtualization: this technique is equal to the full virtualization strategy regarding executing end-user instructions directly in the CPU. But, in the paravirtualization approach, the guest operating system uses hypercalls to ask the hypervisor to run privileged instructions in the CPU. The use of hypercalls relieves the hypervisor from the burden of trapping and translating privileged instructions before their execution, increasing overall system performance. As a counterpart, the guest OS needs to be changed to support this capability [18]. • Native virtualization: also called hardware-assisted virtualization, this approach relies on hardware instructions especially designed to support virtualization.

22

M. M. Alves

This functionality allows the guest OS to execute both end-user and privileged instructions directly in the CPU. So, if a processor is endowed with this capability, the CPU is able to provide native support to virtualization. In the case of Intel and AMD, this instruction set is called VT-x and AMD-V, respectively [19]. Hardware vendors continuously invest in technical solutions to improve native virtualization. Besides the set of CPU instructions for virtualization, they also provide hardware capabilities to increase the performance of virtual machines when accessing I/O and memory subsystems [19]. Intel processors, for example, are endowed with VT-d, a set of instructions that enables guest systems to directly access PCI devices like GPU, network cards, RAID controllers, and so on. Hence, the goal is to get even close to the near bare-metal performance.

2.4.2 Containers Besides resource sharing, virtual machines are commonly used for software isolation. As a virtual machine is completely isolated from others running in the same host, we can use it to confine applications so that modifications in the software stack of one virtual machine do not affect an application or service running in another. In this way, we can protect the application’s context against problems experienced by other applications running in the same physical system. However, the isolation provided by virtual machines comes with a price. Even with the recent advances, virtual machines still face performance overhead. This overhead is derived from the entire operating system running in the guest, with plenty of services, tools, and management tasks. In addition to the performance problem, we can have unnecessary software duplicity in case of virtual machines running the same software stack. An alternative to obtain software isolation and resource sharing is container technology. Unlike virtual machines, containerization does not require a guest operating system running on each container instance. For this reason, containerization is sometimes called lightweight virtualization because it guarantees the main benefits of virtualization without the overhead imposed by a guest operating system. As pictorially described in Fig. 2.7, a physical machine can host many containers, where each container is able to execute a separate software stack composed of applications and libraries. The access to shared resources is controlled by the host operating system and not by an additional component, like a hypervisor in case of virtualization. Containerization provides software isolation and enables resource sharing by means of two features, namely cgroups and namespaces [13], described as follows: • Cgroups: allow the host operating system to control the access of running containers to the available set of shared resources. Thus, the host OS can either

2 What Is Cloud Computing?

23

Fig. 2.7 Container technology

guarantee fair use of those resources or assign a distinct resource configuration to a particular group of containers. • Namespaces: the host operating system uses this feature to maintain virtual scopes of the system’s objects and entities like files, directories, users, permissions, and so forth. Due to this feature, the host operating system can separate the execution context of each running container in such a way a container cannot access or even see objects of other containers running in the same physical host. Cgroups and namespaces are provided by the host operating system. Even so, we still need tools to execute administrative tasks like managing the container lifecycle. In this sense, we have frameworks, tools, and solutions such as Docker, Singularity (recently renamed to Apptainer), Podman, and Enroot. Docker is currently the de facto standard for general-purpose containerization, while Singularity and Enroot emerged as specific solutions for HPC applications. It is worth stating that containerization should not be considered a better solution than virtualization for all scenarios. In many cases, virtualization continues to be the right choice to provide software isolation while enabling resource sharing.

2.5 Final Remarks This chapter briefly introduces fundamental concepts of Cloud Computing, like services and delivery models, besides presenting a short description of surrounding technologies like Virtualization and Containerization. These topics are essential to

24

M. M. Alves

effectively understanding the subjects covered in this book. It is worth to state that this chapter intends to give an overall knowledge of these topics, but a detailed description of these themes is out of the scope of this book.

References 1. Leymann, F., Fritsch, D.: Cloud computing: The next revolution in IT. Proceedings of the 52th Photogrammetric Week. 169, 3–12 (2009). 2. Stanoevska, K., Wozniak, T., Ristol, S.: Grid and cloud computing: a business perspective on technology and applications. Springer, Science & Business Media (2009). 3. Jacob, B., Brown, M., Fukui, K., Trivedi, N.: Introduction to grid computing. IBM redbooks, 3–6 (2005) 4. Bairagi, S. I., Bang, A. O.: Cloud computing: History, architecture, security issues. In: National Conference “CONVERGENCE p. 28 (2005) 5. Haris, M., Khan, R. Z.: A systematic review on cloud computing. In: International Journal of Computer Sciences and Engineering, 6(11), 632–639 (2018). 6. Mell, P., Grance, T.: The NIST definition of cloud computing. Technical report, Computer Security Division, Information Technology Laboratory, National Institute of Standards and Technology Gaithersburg (2011) 7. Bokhari, M. U., Makki, Q., Tamandani, Y. K.: A survey on cloud computing. In: Big Data Analytics (pp. 149–164). Springer, Singapore (2018) 8. Rashid, A., Chaturvedi, A.: Cloud computing characteristics and services: a brief review. International Journal of Computer Sciences and Engineering, 7(2), 421–426 (2019) 9. Rafique, K., Tareen, A. W., Saeed, M., Wu, J., Qureshi, S. S.: Cloud computing economics opportunities and challenges. In: 4th IEEE International Conference on Broadband Network and Multimedia Technology (pp. 401–406). IEEE (2011). 10. Al-Dhuraibi, Y., Paraiso, F., Djarallah, N., Merle, P.: Elasticity in cloud computing: state of the art and research challenges. IEEE Transactions on Services Computing, 11(2), 430–447 (2017). 11. Deniziak, S., Bak, ˛ S.: Scheduling of Distributed Applications in HPCaaS Clouds for Internet of Things. In: 23rd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) (pp. 1–4). IEEE (2020). 12. Lee, H.: Virtualization basics: Understanding techniques and fundamentals. School of Informatics and Computing Indiana University, 815 (2014). 13. Shashank Mohan, J.: Linux Containers and Virtualization: A Kernel Perspective. Apress (2020). 14. Xue, C. T. S., Xin, F. T. W.: Benefits and challenges of the adoption of cloud computing in business. International Journal on Cloud Computing: Services and Architecture, 6(6), 01–15 (2016). 15. Nwogbaga, N. E., Ogbaga, I. N.: Overview of Cloud Computing, Benefits and Drawbacks. EPRA Int. J. Multidiscip. Res, 2 (2016). 16. Abdalla, P. A., Varol, A.: Advantages to disadvantages of cloud computing for small-sized business. In: 7th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1– 6). IEEE (2019). 17. Beri, R., Behal, V.: Cloud computing: a survey on cloud computing. International journal of computer applications, 111(16) (2015). 18. Fayyad-Kazan, H., Perneel, L., Timmerman, M.: Full and para-virtualization with Xen: a performance comparison. Journal of Emerging Trends in Computing and Information Sciences, 4(9), 719–727 (2013).

2 What Is Cloud Computing?

25

19. Ganesan, R., Murarka, Y., Sarkar, S., Frey, K.: Empirical study of performance benefits of hardware assisted virtualization. In: Proceedings of the 6th ACM India Computing Convention (pp. 1–8) (2013). 20. Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N. J., Popa, R. A., Gonzalez, J. E., Stoica, I., Patterson, D. A.: What serverless computing is and should become: the next phase of cloud computing. Communications of the ACM, 64, 5, 76–84 (2021). 21. Malla, S., Christensen, K.: HPC in the cloud: Performance comparison of function as a service (FaaS) vs infrastructure as a service (IaaS). Internet Technology Letters, 3(1), e137 (2020).

Chapter 3

What Do HPC Applications Look Like? Claude Tadonki

3.1 About High-Performance Computing and Its Way So Far 3.1.1 Concept and Motivations When writing a program, the first focus is correctness. Then follows the need for speed, and the target becomes performance. This is already the case at the fundamental level, we first design an algorithm and make sure it is correct, and then analyse its complexity. While the complexity of an algorithm is somehow absolute, the performance of a program is relative as it depends on the considered target machine. The main question from the end-user is “How long the program will take to execute?”, it is the time-to-completion. For many reasons, the user might want this time to be as short as desired, or the shortest possible. Thereby, comes the need for a very fast implementation, which is the main purpose of High-Performance Computing (HPC). From the computer standpoint, the standard configuration is a parallel machine, which is made up with several individual processors that are interconnected in such a way that they can cooperate to perform a given computation, this is parallel computing. The idea is therefore to use as many processors as necessary to reach the expected level of performance. The need for speed is genuine at various levels from the standpoint of an ordinary user expecting a prompt interaction with a given application to the processing of large scale scenarios in cutting-edge industrial or research activities. The reference metric for an HPC system is the number of (reference) instructions per unit of time expressed in Million of Instructions per Seconds (MIPS) or

C. Tadonki () Mines ParisTech - PSL University, Centre de Recherche en Informatique (CRI), Fontainebleau Cedex, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_3

27

28

C. Tadonki

Floating Point Operations per Seconds (FLOPS). This metric applies to a machine, purely or associated to a given program. In the first case, it indicates the potential speed of the machine and is called the peak performance, typically expresses in XFLOPS, where X is M = Mega, G = Giga, T = Tera, P = Peta and E = Exa. In the second case, so-called sustained performance, it provides a measurement considering a given program executed on that machine. Figure 3.1 presents the first exascale supercomputer (top500 ranking of June 2022 [32]), hosted at Oak Ridge Leadership Computing Facility (OLCF) in Tennessee, USA. On the main High-Performance Linpack (HPL) benchmark used by Top500, Frontier reached 1.102 exaflops of sustained performance. It has a theoretical peak performance of 1.686 exaflops, although Oak Ridge believes it can be boosted to 2 exaflops. It is important to understand that the calculation of the peak performance assumes that each clock cycle corresponds to one instruction/operation. The reality at runtime includes an overhead related to data activities like memory accesses and interprocessors communication, synchronization, initialization, and other nonnumerical instructions. Thus, there is a (natural) gap between the theoretical peak and the sustained performance. Making this gap as small as possible is the main purpose of code optimization in HPC, the limit being algorithm and machine dependent [30]. The so-called absolute efficiency is the percentage of the peak obtained with a given implementation, which is usually considered to

Fig. 3.1 Specifications of the current fastest supercomputer

3 What Do HPC Applications Look Like?

29

evaluate the effort of the programmer. Users expect supercomputers to be powerful enough for their applications, not in absolute. Thus, getting close to the maximum sustained performance is a crucial request. The speedup, which measure how a given implementation scales with the number of computing units, is more about the quality of the parallelization. The task of an HPC specialist is to strive at the best for both metrics. HPC is genuinely needed in various application domains. Let us consider for instance the case of Astronomy. It can take millions of years for a specific event to occur like stars to collide, galaxies to merge, and so on, thus astrophysicists must turn to computer simulations to investigate. The models are complex and the effective time frame is tremendously long, thus the need for massive compute power to figure out what could happen through short time simulations. Other nice examples can be considered in oceanic investigations to understand specific phenomenons; atmospheric activities for weather forecasting and ecosystem predictions; cuttingedge operational research and its applications like, for example, the airline crew pairing problem, which is to find a minimum cost assignment of flight crews to a given flight schedule, see Fig. 3.2; large-scale genomics; high-precision numerical simulations; to name a few. HPC infrastructures are costly to manage and maintain, in addition to the cost of the machines. The so-called mean time between failure (MTBF) gives an indication about the probability to have a failure on a node of the supercomputer. Since a supercomputer has many compute nodes (dozens of thousands summing up to millions of cores), this eventuality is a genuine concern. In addition, the overall cost of energy (electricity and cooling system) is considerable. For all these reasons, supercomputers are typically located in supercomputing centers and made available to the users through remote accesses. Another mean to provide HPC resources to end-users is the so-called “Cloud”. Cloud computing offers a great alternative

Fig. 3.2 Airline crew pairing

30

C. Tadonki

for mass storage, software and computing devices. Federating available computing resources, assuming sufficiently fast network, is certainly an efficient way to offer more powerful computing systems to the community. The main advantage is the users pay only for what they have really consumed and they are not concerned about any kind of maintenance effort.

3.1.2 Evolution of HPC Systems The first commercially available supercomputer was the CRAY-1. Introduced in 1975 and installed at Lawrence Livermore National Laboratory (USA), the whole machine weighted 5.5 tons and has a theoretical peak of 160 MFLOPS. If we dare to compare with FRONTIER, the current fastest supercomputer who has a peak of 1.1 EFLOPS, we get a factor .6.8 × 109 . Figure 3.3 displays the evolution of the computing power since 1990 (the beginning of the top500 ranking). Let us back to the current fastest supercomputer. Figure 3.4 provides its main technical specifications. We see that the machine is hybrid (CPU/GPU) with several compute

Fig. 3.3 Performance evolution of the major supercomputers

3 What Do HPC Applications Look Like?

31

Fig. 3.4 Hardware specifications of FRONTIER supercomputer

nodes, each of them being a 64-cores processors. This configuration gives an overview of the evolution of HPC systems from the hardware standpoints. Let us briefly describe the evolution of each of the main aforementioned aspects. • Distributed memory configuration. The model here is to have several independent processors interconnected by a physical network. The main evolution in this configuration is on the nature of the so-called compute node, which has moved from a standard processor to a multi-core processor possibly equipped with a GPU. The main challenge with a large number of nodes is the efficiency of the interconnect, which includes the topology, the network speed, and the routing mechanism. Before the advent of the multi-core, increasing the number of processors was the main way to larger supercomputers. Afterwards, the common metric became the number of cores. Indeed, the size of a supercomputer is currently provided as its total number of cores. • Shared memory configuration. From the beginning, the design of processors was following the Moore’s Law, which roughly prescribes that processor transistor count doubles every 2 years. This was still possible by adding transistors and logic to the standard CPU and increasing the clock frequency, until it becomes impractical because of the associated power. Therefore, the multi-core processor strategy appeared (nearly 2003) and became the new standard. The trend now is to increase the number of cores in a single die or package. In any case of a multi-core configuration, the main memory (whatever its organization) is really shared by all the cores, thus there is no need of explicit data communication. Some examples are the AMD Ryzen Threadripper PRO 5995WX with 64 cores and the Intel Core i9-12900K with 16 cores. • Vector computing units. This refers to the so-called vector computing also known as Single Instruction Multiple Data (SIMD). It is the innermost level of parallelism, which is performed with specific units and associated registers. This was one of the first way to implement parallelism, like with the CRAY1, followed by other examples. Then, this approach re-emerged in 1997 with the MMX instructions, then, SEE, AVX and AVX-512. The main evolution in this direction mainly addresses the length of vector registers and the set of instructions. Initially devoted to the processing of images, it has been extend to general purpose computation.

32

C. Tadonki

• Hybrid configuration. A typical hybrid supercomputer is made up with several multi-core nodes, each of them possibly coupled with one or more GPUs. This combination results in cutting-edge supercomputers. The FRONTIER supercomputer is exactly made that way. Cloud computing is also a common place to deal with hybrid computing resources. Hybrid machines require an important programming effort to get the maximum benefit because all units have to be exploited at the best, as well as all levels of parallelism. As we indicated for hybrid configurations, GPUs are the main devices for the so-called accelerated computing. We now provide an extended description of this special unit.

3.1.3 Graphical Programming Unit as the Main HPC Accelerator Graphic processing unit (GPU) is a specialized microprocessor that is used to offload and accelerate graphics rendering from the central processor [18]. It was primarily a graphics chip, acting as a graphics co-processor. Gradually, the chip became increasingly programmable and computationally powerful, thereby leading to a general purpose unit. Figure 3.5 shows this impressive evolution of the GPU performance. Nowadays, GPU is commonly used for common scientific and engineering applications. The highly parallel structure of modern GPUs makes them much more efficient than traditional CPUs for stream processing applications

Fig. 3.5 Performance evolution of the GPU [18]

3 What Do HPC Applications Look Like?

33

Fig. 3.6 SP/DP performance scaling of the GPU [18]

or more generally for applications with high data parallelism. This has pushed computer scientists to start thinking about an effective use of GPU to efficiently accelerate a wider range of applications, thus leading to the so-called GPGPU (General-Purpose Graphics Processing Units). In GPGPU, a GPU is viewed as a high-performance many-core processor that can be used (together with a standard CPU acting as a master) to perform a wide range of computing tasks at a high speed. At the earlier stage of GPGPU (and this is still an important aspect), the main concern was how to efficiently exchange data between the main memory of the hosting CPU and the GPU. This CPU-to-GPU bottleneck [19], often shirked in some enthusiastic reports, has been one of the main hurdles on the GPGPU ascent as a genuine HPC unit. Another critical point is the severe slowdown when using double precision data rather than single precision, which is essential in many numerical applications. These two issues have been seriously and efficiently addressed in current generation GPUs. Figure 3.6 shows the performance scaling of the GPU over the time together with its evolution, considering single precision (SP) and double precision (DP) cases. We can note the SP/DP convergence with the Ampere A100 GPU. With some applications requiring massive vector operations, using a GPU can yield several orders of magnitude higher performance than with a standard CPU. Figure 3.7 illustrates the speedup with new generation GPUs compare to conventional CPU. Indeed, considering artificial intelligence kernels, there is an acceleration of 8 (resp. 237) between an NVIDIA A100 and an NVIDIA T4 (resp. between an NVIDIA A100 and an INTEL Cooper Lake). The use of GPUs as a

34

C. Tadonki

Fig. 3.7 Speedup of new generation GPUs (.©NVIDIA)

HPC accelerator has become a common choice for a number of scientific areas (with a noticeable success in artificial intelligence), with an additional objective of saving energy through shorter execution times. This last aspect has motivated the consideration of hybrid CPU/GPU supercomputers and the use of GPU as a key device in Cloud computing [20].

3.1.4 Overview of Current HPC Systems and Associated Concerns A simple way to get the picture of the majors HPC infrastructures is to take a look at the semi-annual top500 [31], which has triggered an exciting competition among manufacturers and countries for the worldwide fastest supercomputers. Alongside the ranking itself, the top500 provides valuable statistics on existing supercomputers that can be used for factual and prospective analyses. Figure 3.8 displays the top five machines of the list of June 2022. We see that the fastest machine is an exascale system, it is the first to have broken the exascale barrier. In addition, it has more than the overall theoretical peak (Rpeak) of the top five group. We also see that three of them have GPUs in their configuration, which show that the heterogeneous profile is spreading seriously. Another important aspect is the sustained performance (Rmax), which gives 65%, 82%, 70%, 74% and 75% respectively in our selection. Note that these performances are measured using linear algebra subroutines, which are known to have highly regular memory accesses. Considering that the overhead of the memory accesses, which is not taken into account for the peak performance, is usually dominant with real-life applications. When speaking about “memory accesses”, it is basically

3 What Do HPC Applications Look Like?

35

Fig. 3.8 Top five supercomputers from the June 2022 list

“data accesses”, which also includes explicit data transfers between the individual compute nodes. The trend with current and future generation HPC processors is an increasing number of cores and an efficient but more complex memory system. Indeed, beside the classical memory hierarchy, there are non-uniform memory access (NUMA) architectures for which the overhead of the memory transactions is likely to be more severe, and this penalty is exacerbated with multi-sockets packaging. Out of a compute node of a supercomputer, we have data exchanges, which are explicitly performed through the physical network following specific routing protocols. Depending on the protocol and on the underlying interconnection topology, interprocessors communication is what counts the most in the global overhead with distributed memory parallel machines. All these mechanisms that act on the path between the data and the computation have to be managed skillfully in order to minimize their impact in the overall execution time. At the level of the internal memory, having a good cache performance is known to be the key. From the hardware standpoint, cache protocol (including for coherency) is the main point, while from the programming side the focus is essentially on data locality. Regarding the interprocessors communication, there are two aspects to care about that are the communication graph and the possibility to overlap data exchanges with

36

C. Tadonki

computations. What we should understand here is the performance limitation that comes from accessing the data internally (“memory wall”) and externally (“network wall”). Beside this aspect, there are two other sources of performance issue: energy from the technical consequences and the associated cost, and hardware failure for which the corresponding concern can be seen as a limitation. Energy is a very important measure when it comes to HPC devices. Indeed, processing with a high-speed macro-system like a supercomputer corresponds to a huge number of individual CPUs working concomitantly, thus a serious risk of overheating. Based on the Ohm’s Law, we have that the dissipated power is approximately proportional to the square of the CPU voltage and the CPU frequency, which gives .P = CV 2 f , where C is the capacitance, V the voltage, and f the frequency. It is important to note that these parameters can be tuned dynamically at runtime, which offers an opportunity for an energy-aware monitoring. Note that the strong correlation between energy and CPU frequency is one the main reasons of considering less powerful processors for energy-aware HPC systems. Network and memory activities also count, but the most important focus is on pure CPU activities. This concern about energy is related to the associated cost (electricity and cooling) and also to its potential impact on hardware failure. Failure (hardware and software) is an important concern with computations that take a long execution time. Failure can be due to a hardware issue or an unexpected software behaviour. In both cases, the running workflow is affected and might be resumed from a previous state or restarted from scratch. This scenario is a classical concern when running heavy simulation codes on a supercomputer. For a given company, maintenance activities (emergency and preventive ones) are costly, and this is a serious criteria when making investment plans about HPC resources. In summary, processor manufacturers are constantly improving their products by tweaking CPU components and implementing new hardware concepts. The aim is to keep providing increasingly powerful computers for common applications and large-scale supercomputers for cutting-edge research and engineering activities. There is a kind of game between progress and needs, where the respective limits are alternatively and iteratively pushed forward. Harvesting computing cycles for science has a clear impact on the landscape of experimental research and has shorten the path to scientific discovery and technical insights. We now describe how HPC applications can be designed and implemented, followed by an overview of main performance issues.

3.2 Design and Performance 3.2.1 Methodology for the Design of HPC Applications Designing HPC applications, especially for solving large-scale problems, is a complex task from fundamental aspects (modelling, methods, design, analysis) to more technical aspects (programming, deployment, runtime management). From

3 What Do HPC Applications Look Like?

37

Fig. 3.9 Overview of a hybrid programming chain

the programming point of view, there are number of serious challenges that need to be addressed or that remain under deep investigations. The heterogeneity of current and upcoming supercomputers requires to consider hybrid implementations, which corresponds to a more complex programming task. Another programming approach is to use (semi-)automatic code generators, which thereby allow to concentrate on a higher level abstraction. This approach implies to rely on the output of the code generation frameworks and to consider practical issues related to debugging, maintenance, adaptability, tuning, and refactoring. Figure 3.9 displays an example of a complex code design framework. Formalism and Algorithms refer to the modelling of the problem followed by an explicit algorithm expressed accordingly. Intermediate Representation is an abstract specification of the program so as to allow proofs/analyses/transformations. Ad-Hoc Methods stand for specific approaches, for examples those that come from the intuition and personal skills of the programmer. Regarding fundamental aspects, we need to deal with powerful methods to seek efficient algorithms before moving to programming. Indeed, the noteworthy increase of supercomputers capability has boosted the enthusiasm for solving largescale combinatorial problems as they the major clients for HPC. However, we still need powerful methods to tackle those problems, and afterward provide efficient implementation on modern computing systems. We really need to seat far beyond brute force or had-hoc (unless genius) approaches, as increasingly bigger instances are under genuine consideration. Figure 3.10 displays an overview of a typical workflow when it comes to solving optimization problems. The feasibility set is for problems where the focus is on the existence of one solution (i.e. an element that satisfies all the constraints), while the feasibility set is for problems where the focus is on the best points (i.e. those that yield the optimal value of the objective function). One a method is chosen, a corresponding implementation is considered to compute a valid solution that will be used for the decision.

38

C. Tadonki

Fig. 3.10 Typical operation research workflow

Beside combinatorial problems, we have data intensive applications like those related to genomics, finance, statistics, streaming analytics, to name a few. In addition to speed up the computing process, we need a space-time efficient management of the data. One of the challenges here is how to process/query large volume of data and extract useful knowledge in a timely manner. Beside data placement technique in managing big data applications, there is also a data replication technique which consists in creating multiple copies of the data sets so as to have them available at different locations. The main goal of replicating data is to improve the locality of the data sets and thus reduce their transfer costs when running on a federated Cloud system [17]. Another interesting aspect is how the algorithm we choose to solve a problem will turn into an efficient solution when implemented in the HPC universe. An illustrative example can be found in linear programming where we have the simplex algorithm [21] and the ellipsoid method [22]. From the theoretical complexity viewpoint, the simplex is exponential while the ellipsoid is polynomial. However, when implemented, the simplex is much more efficient. We have a similar situation with sorting. One of the reason of this reality is the memory access pattern, which is not necessarily cache-friendly with sophisticated algorithms. In addition, when moving to a parallel implementation, as we will explain later, there might be several sources of weak scalability. A good HPC application should be able to benefit from a larger amount of computing resources for a faster execution.

3 What Do HPC Applications Look Like?

39

Fig. 3.11 Overview of HPC programming

3.2.2 Synopsis of HPC Programming As previously said, HPC clusters have three levels of parallelism and are enhanced in some configurations by accelerators like the GPUs. A cutting-edge HPC application is expected to be designed so as to harvest all these possibility for faster processing. Figure 3.11 provides an overview of the aforementioned levels of parallelism along with the corresponding programming paradigm. Let us say some words about each of the main levels in the design of a HPC program application. • Distributed memory parallelism. This is the most important paradigm when targeting a supercomputer as the compute nodes are typically independent CPUs that are linked together by a physical interconnect system. Since a parallel program written in that way also works on a shared memory machine, the distributed memory model is very common in the universe of scientific and technical applications. The main way to implement this form of parallelism is through message passing libraries like MPI [23] (the most popular ones). The main program is considered by the operating system as the generic program associated to the processes to be created and scheduled on the available compute nodes, thus the term SPMD for single program multiple data. This generic program can be implemented following the other kinds of parallelism, which correspond to the so-called hybrid implementation. • Shared memory parallelism. The shared memory model was already conceptualised through the PRAM model at the dawn of parallel computing. With the advent and pervasiveness of multicore processors (the main memory is fully shared by all the cores), this model of parallelism has become more common

40

C. Tadonki

in the HPC universe. The main argument here is that we get rid of explicit data exchanges and the associated time/space overhead. Here the main program is a single process that spawns so-called threads (lightweight processes) to be scheduled by the operating system on the available cores. The threads can be explicitly created by the programmer using specific libraries like Pthreads [24] or through programming directives using a source-to-source compiler like OpenMP [25]. For NUMA-aware implementation (very important for scalability on unconventional multicore processors as previously explained), there are libraries like libnuma [27] or hwloc [28] that can be use explicitly within the user program to control memory allocation and threads allocations. • Vector computing. Also referred as singe-instruction-multiple data (SIMD), this level of parallelism can be seen as a fine grain approach at the level of the instructions [26]. Intensively used in image processing, it has been extended (like the GPU) to common computations as long as data are aligned and the floating point operations are dominant. The evolution of vector computing is a clear fact even if (maybe for programming reasons) it seems to not be considered as it could be. This evolution includes: wider and modulable vector registers, more instructions, and powerful intrinsics. Compilers have the ability to try it automatically whenever possible and potentially efficient (i.e. auto-vectorisation). A typical vectorisation approach is to consider a SIMD implementation in some specific sections of the program where there is a potential benefit. This can be done using dedicated API like SSE or AVX and relevant intrinsics. • Accelerated computing. This generic term designates the use of a special computing device to execute part of the computation that has appropriate characteristics. The most common device consider for this purpose in HPC is the GPU. The typical organisation is to have the whole computation and data hosted on a standard CPU, where selected routines will be offloaded to the GPU with associated data. Thus, there is a need to implement the code to be executed on the GPU. The most common frameworks for this are OpenCL and Cuda. However, source-to-source compilers like OpenMP [25] and OpenAcc [29] can generate a CPU/GPU code automatically from an ordinary code where appropriate directives are used to indicate the part to be offloaded to the GPU and to specify particular runtime characteristics. As we can see, implementing a HPC application is complex programming task. Targeting performance requires serious efforts as there are number of related issues that need to be handle skillfully. We now described the main ones in Sect. 3.2.3 for the issues related to the absolute efficiency, and in Sect. 3.2.4 for those related to the relative efficiency (i.e. how good is the parallelism).

3.2.3 Critical Numerical and Performance Challenges 1. Computing unit: The basic compute node of current and future supercomputers is (in most cases) a many-core processor. Seeking efficiency and scalability with a

3 What Do HPC Applications Look Like?

41

many-core processor is a hard task [11]. As with any shared-memory system, the way to go is through the shared-memory paradigm. This approach allows to avoid explicit data exchanges, thus to get rid of the associated time cost. However, concurrency on the access to the main memory might put a noticeable pressure on the whole memory system (bus, controller, ...) and thus turns to be a source of weak scalability. The availability of vector computing units is also common, thus allowing a vector or single-instruction-multiple-data (SIMD) implementation, which requires a appropriate organization of the data on memory. 2. Memory system: One critical point in the context of shared memory parallelism with multi-core processors is the management of shared global variables. Many computation schemes like those used in matrix computation are based on an iterative scheme, thus the accesses to common variables are repeated accordingly. For read-only accesses, the performances will depend on how good we are with cache hits. For write accesses, the main issue is concurrency, with a special focus on iterative (in-place) updates. The case of non uniform memory access (NUMA) architectures needs a special attention as most of many-core processors follow this specific memory organization. A NUMA-unaware implementation will certainly suffer from a severe inefficiency, with a very poor scalability with an increasing number of cores. Figure 3.12 [33] presents the NUMA model with two nodes and Fig. 3.13 presents an example of a NUMA configuration with 4 nodes, both figures illustrating the non-conventional sharing of the overall main memory. The main issues with NUMA structuring are the remote accesses (more costly that the local ones) and the contention (resp. contention) and the memory controllers (resp. memory buses). 3. Numerical sensitivity: Despite numerical accuracy concerns, it is common to consider a lower precision data type in order to get higher FLOPS through wider SIMD and better data locality. The main drawbacks with lower precision come from the potential lost of accuracy, which might lead to wrong numerical results or longer execution (with an iterative numerical process for instance, much more iterations can be necessary to converge).

Fig. 3.12 The NUMA model

42

C. Tadonki

Fig. 3.13 Examples of NUMA configuration with 4 nodes

4. Heterogeneity: The tendency with top class supercomputers is heterogeneity, with the classical CPU-GPU conjunction being the most common configuration. GPUs (see Sect. 3.1.3 for details) have reached enough maturity so as they are considered for common computing tasks including those not as highly regular as it used to be required. However, the well-known problem of CPU-GPU data transfers still needs a serious consideration even if the corresponding time overhead has been significantly reduced. In case of high-precision computation, using a GPU might raise some concerns about accuracy. Indeed, it is common to consider low-precision data (i.e. single precision instead of double precision) in order to get the maximum FLOPS with GPUs, which has a direct impact on the numerical accuracy of the outputs. 5. Synchronization: Computing in parallel commonly requires to synchronize at some points for various reasons including scheduling constraints, critical sharing, concurrent updates, global conditions, checkpoints, .... Synchronization can be local (only a subset of the computing units) or global (all the computing units). Synchronizing in the context of a large-scale supercomputer is costly and the effect on the scalability can be noticeable. 6. Data exchanges: This is the main source of a serious time overhead with distributed memory parallelism, beside the aforementioned mechanism (synchronization). The communication cost depends on their volume (amount of data exchanged), occurrence (how many times), and quality (compatibility with the physical interconnect). This aspect is certainly one of the most hindering on the way to parallel efficiency, as it consumes the major part of the overall overhead. In the context of accelerated computing using an external device like a GPU or a FPGA, the exchanges with the host CPU is critical and should be taken into account carefully.

3 What Do HPC Applications Look Like?

43

7. Load balance: Parallel tasks do not necessarily have the same makespan. Beside the (floating-point) computing load, there are also numerical and scheduling characteristics that might impact the runtime complexity of a task on a given compute node. This aspect is hard to fix without changing the way the computation is organized. The makespan of a task depends on its specific structure and also on the computation flowchart. As it is thus difficult to have a good static prediction of the makespan, a dynamic mechanism should be implemented to strive for load balancing among the computing units at runtime.

3.2.4 About Parallel Efficiency As we have so far explained, increasing the power of the supercomputers by increasing the number of computing units raises number of technical challenges that need to be addressed carefully in order to make their benefit clear to the community. Indeed, the gap between the peak performance and the sustained performance is a genuine concern. This is like gross salary and net salary from the employee viewpoint. Users expect supercomputers to be powerful enough for their applications, not in absolute. Thus, getting close to the maximum performance will be a crucial request. From the hardware point of view, this means number of improvement: memory latency at all hierarchy levels should be reduced; opportunity should be given to the programmer to manage memory features as desired; data exchanges between different memory levels should be improved by adding additional buses; the penalty for accessing distant parts of a NUMA memory should be revisited; the set of vector instructions should be soundly extended; network capability should be improved (topology, bandwidth, and latency) in order to lower enough the communication overhead. At the algorithmic level, the scheduling should be aware of the Amdhal law [1]. When it comes to supercomputer, the main problem is scalability. The complexity of the communication pattern increases with the number of processors, thus exacerbating the gap between the virtual topology and the physical interconnect. Supercomputers are generally made with shared memory computing nodes with several cores. Ordinary programmers use to consider the processor core as the basic processing unit, and then launch a pure message passing program onto the machine. Current implementation of MPI allows this to work seamlessly, but a scalability wall is quickly reached. Having a shared memory implementation on each multicore node has several advantages. The first one is that the overall memory of the computing node is available for the task assigned to the node, this also reduces data dependencies. Secondly, the cores within a node do no longer need to exchange data through the network, they concurrently access their local shared memory instead. Third, the global communication topology becomes lighter, this might lead to a significant reduction of the communication cost. When considering a parallel approach with a given application, the efficiency concern is mainly guided by the conceptual flowchart displayed in Fig. 3.14.

44

C. Tadonki

Fig. 3.14 Parallelism in relation with efficiency

Technically, there are several aspects to consider when it comes to scalability. We list and comment the major ones. • Dependencies. Whatever the nature of the dependency between two tasks, if it really has to be taken into account at runtime then there will be a temporary lost of parallelism. In addition, there might an overhead to handle the corresponding synchronisation depending on the implementation and the target architecture. This conceptual aspect is the main critical one against full parallelism and it stands as the primary focus when designing a parallel algorithm/program. • Creation and management of the parallelism. Whatever the paradigm used for implementation, the tasks (i.e. processes/threads/...) have to be created and managed at runtime. In addition, each of the tasks will interact with the operating system and request for resources, thus there is an overhead and a certain level of serialisation. The effect of this aspect increases with the magnitude of the parallelism, but noticeable efforts are made to lower the cost of this aspect. • Load imbalance. As the main idea of parallelization is to share a global task among several workers, the ideal configuration should be to have each worker with the same load. This is exactly what we have in mind when expecting a parallel execution time of T/p, where T is the sequential execution time and p the number of processors. The main problem here comes from the fact most of time, parallelism is designed with static considerations, which do not always match the realities at run-time. For instance, in a typical parallel sorting, the input array

3 What Do HPC Applications Look Like?

45

is subdivided into fixed size sub-arrays to be sorted independently. However, the effective complexity of a sorting might depend on the configuration of the input array, for instance if it is already sorted. Several examples can be found in linear algebra and combinatorial optimisation, where the nature of the data has an influence on the workflow and thereby on the time complexity. • Access to the data. This aspect is really crucial in all kinds of hardware configuration. Indeed, parallelization mainly focuses on the computation and data aspects are considered later on do address space/time concerns. In a sequential computation, we already know that the penalty of an inefficient data access pattern can be severe through the corresponding cache misses for instance. In a shared memory parallelism, can come from bus contention, memory controller saturation, remote memory accesses (in a NUMA configuration), false sharing, and mutual exclusion to access critical variables. In a distributed memory parallelism, the major overhead again a good speedup comes explicit data exchanges (latency and transfer through the network, together with necessary synchronization).

3.3 Two Examples of HPC Applications Considering that the purpose of HPC is to provide a fast computing solution for a given problem, the noteworthy evolution of the potential computing power of modern machines has naturally pushed the horizon of users expectations from the applications standpoint. Let us illustrate this with two examples: Lattice Quantum ChromoDynamics (LQCD) and High-precision Seismic Imaging.

3.3.1 Lattice Quantum ChromoDynamics (LQCD) Quantum ChromoDynamics (QCD) [9], the theory of the strong nuclear force which is responsible for the interactions of nuclear particles (the main purpose is to understand the origin of the universe), can be numerically simulated on massively parallel supercomputers using the Monte Carlo paradigm and the lattice gauge theory (LQCD) approach (see Vranas et al. [8]). A typical LQCD simulation workflow applies basic linear algebra operators on a huge number of structured variables. The major LQCD kernel is the inversion of the Dirac operator (seen as a matrix), which is an important step during the synthesis of a statistical gauge configuration sample. Indeed, in the Hybrid Monte Carlo (HMC) algorithm [7], it appears in the expression of the fermionic force that is used to update the momenta associated with the gauge fields along a trajectory. The Wilson-Dirac matrix is sparse and implicit (i.e. not in the form .(aij )), thus iterative approaches are the main option for its inversion (what is really done is the resolution of the corresponding linear system). In addition, some numerically

46

C. Tadonki

sensitive scenarios show almost null eigenvalues, which exacerbates numerical issues and pushes far away the required number of iterations to converge. Moreover, this numerical sensitivity justifies the importance of using double precision arithmetic. Some authors consider the so-called mixed-precision [6], which sacrifices the precision of the core computation, while keeping double precision for updates and convergence criterion. In the presence of very small eigenvalues, thus a illconditioned Wilson-Dirac matrix, the iteration process will hardly converge (too many iterations). Mixed precision is mainly motivated by the desire of using single precision, which yields the best performances with GPUs, and also with CPU through larger vector registers and lower memory footprint/bandwidth. However, the penalty from the loss of numerical robustness might not be affordable when its comes to large-scale sensitive LQCD scenarios like the ones related to very small pion masses. For all the aforementioned reasons, the need for efficient highprecision implementations of the Dirac operator is on the spotlight of both the HPC and the LQCD communities. A common way to parallelize LQCD applications is to follow the domain decomposition paradigm, which means to partition the whole lattice into sublattices and assign of them to a computing node (see [5, 6]). This yields a standard SPMD model, which is afterwards implemented and deployed on a given parallel machine. As we have previously explained, a typical compute node of a modern supercomputer is a multicore CPU, thus the need for a hybrid implementation. With multicore/manycore nodes, the intention for optimizing the computation using a shared memory parallelism is to get a simpler communication graph of the compute nodes. Thereby, the interprocessor communication overhead should be significantly reduced. This is very important for large scale LQCD on supercomputers, where each node has to communicate with its 8 “neighbors” (stencil computation), thus the unacceptable communication overhead usually observed in that context. Number of authors have studied LQCD implementation on various kinds of supercomputer [4, 8, 10]. However, the efficiency of LQCD frameworks on large clusters is usually below expectations, sometimes unacceptable. The main reason is that, even if supercomputers are increasingly powerful, all levels of parallelism need to be skillfully harnessed in order to harvest a significant fraction of the available computing power. In addition, memory accesses and data exchanges, never counted on the theoretical peak performance, are really dominant in LQCD computations. The way to get the maximum efficiency of a supercomputer is to deeply focus on the compute node and strive for the most efficient implementation. In addition to lower data communication overhead because of less complex interprocessor exchanges, data redundancy is also reduced by an explicit shared memory implementation on local nodes. The efforts in this direction should focus on several aspects like memory and data management, vector computing, and multithreading. NUMA-aware scheduling is also to be investigated in order to cope with scalability issues. Since Wilson-Dirac inversion is exclusively done through iterative approaches, making each iteration faster should certainly improve the overall performance, beside those approaches which try to reduce the number of iterations through purely numerical techniques.

3 What Do HPC Applications Look Like?

47

Here we point out a number of important facts that should be carefully considered in order to harvest an increasing fraction of the available computing power with LQCD applications. Let start by pointing this performance of 0.5 GFlops/core reported by G. Grosdidier [12] when running tmLQCD [13] on 10,000 cores of the CURIE-FAT machine [14]. The machine is based on Xeon X7560 8C 2.26 GHz processor, thus a peak of 9 GFlops per core. We then see that each core is running at 5% if its theoretical peak performance, which is unacceptable. Among the reasons why large-scale LQCD might show some level of inefficiency with standard implementations, there is a lack of low-level parallelism (mainly SIMD), which reduces the theoretical performance expectation by a factor 4, since most of modern processors now have at least 256-bit vector registers (4 double precision components). Memory performance is also a serious bottleneck. Indeed, as we have previously explained, computing Wilson Dslash implies a noticeable memory activity with lot of redundant accesses and waste of memory bandwidth. Another important source of performance penalty is the interprocessor communication overhead when running on distributed memory parallel machines. Indeed, in addition to the natural cost of data transfers, there is an important gap between the ideal 8Dtorus topology required for LQCD computations and the physical topology on real supercomputers. Hybrid implementations are certainly a relevant approach to reduce the need for explicit data exchanges, but this requires to have an efficient intranode implementation (shared memory parallel implementation). Considering multi-socket processors (thus a larger number of cores), designing efficient scalable LQCD code is challenging because of the side effects that are typically encountered with NUMA architectures [15, 16].

3.3.2 High-Resolution Seismic Imaging Seismic imaging techniques are extensively used in geophysical exploration. For example, the full-waveform inversion (FWI) and the reverse time migration (RTM) are essential applications for the identification and placement of hydrocarbon reservoirs and for the characterisation of the subsurface material like porosity, viscosity, acoustic velocity, localisation, dimensions, and others. Such applications are extremely important for the efficiency of oil and gas exploration, such as in the Brazilian coastal region where hydrocarbon reservoirs are found a few kilometres deep under salt bodies. One single well drilled in the wrong location can waste millions of euros and delay the production for weeks or months. Even after the production starts, it is important to track how reservoirs evolve (i.e., the quantities of remaining hydrocarbon, the flow and pressure of fluids in porous rocks, where to inject fluids to increase pressure, etc.) to maximise production. Seismic imaging methods can also be employed for the location of both onshore and offshore sites for the storage of carbon dioxide storage. Studies on the use of caves for the capture and storage of CO.2 are recent but with great potential for environmental sustainability [2].

48

C. Tadonki

Fig. 3.15 Geophysical data acquisition

Full-waveform inversion (FWI) is a data-fitting procedure based on fullwavefield modelling to extract quantitative information from collected data as illustrated in Fig. 3.15. FWI is widely used in seismology and geophysical exploration to build high-resolution images from subsurface materials based on their physical properties. In conventional time-domain FWI workflows, most of the time is spent in the computation of the forward and adjoint wave propagation. The kernel of this process involves the numerical solution of partial differential equations (PDEs) that model the propagation of acoustic waves in multi-layer subsurface materials. The FWI workflow minimises both amplitude and phase differences between the signals that are recorded with a set of microphones (or hydrophones) located near the surface. The model is incrementally modified so that the functional that represents the error is sufficiently reduced [3]. The final objective is to reconstruct various parameters of the materials, such as the velocities of P-waves and S-waves, density, anisotropy, and attenuation. The reverse-time migration (RTM) is a quite similar process which uses least-squares minimisation of the misfit between recorded and modelled data. Besides their application for the identification and placement of hydrocarbon reservoirs, FWI and RTM are important tools for investigating the use of caves for capture and storage of carbon in the oil&gas industry.

3 What Do HPC Applications Look Like?

49

As a numerically sensitive inversion problem, FWI is computationally challenging. For instance, cycle-skipping and non-linearity will lead to convergence toward a local minimum. To mitigate this issue, there is a strong need for: (1) multiscale strategies, which progressively incorporate shorter wavelengths in the parameter space; (2) differential approaches, in which the gradient and the Hessian operators can be efficiently estimated even in the presence of multi-layer materials with sharp interfaces and diverse geological shapes; (3) better modelling of the wave propagation physics; (4) noise reduction, and (5) efficient absorbing techniques along the domain boundaries to mitigate spurious reflections. FWI and RTM workflows are known to be computationally heavy. Typically, the execution of an FWI scenario can take several months on a Petaflop/s cluster with data that are collected within the range from 2 to 10 Hz. As better quality data can be collected and made available, and because higher resolution images are requested (i.e., processing higher frequency data) to shorten the “time to first oil” and improve the industry efficiency, the computational cost of FWI will keep increasing significantly in the coming years. As mentioned, seismic imaging applications remain challenging because of the inherent complexity of the problem, huge volume of data and high computational cost. The required software tools are highly specialized, both from mathematical and high performance computing standpoints, and they require many person-years to get designed and efficiently implemented. This fact stands as a serious issue on the way to development of innovative methods. HPC is expected to reduce the “time to first oil” so as to make it realistic and profitable, while machine learning (ML) techniques are to be investigated for innovative approaches dedicated to high-resolution seismic imaging.

3.4 HPC and Cloud Computing Beside the objective of increasing processors power to cope with the need for processing speed with HPC applications, there is a focus on the ability to leverage distant power with an increasingly diverse collection of devices. Cloud computing offers a great alternative on mass storage, software and computing devices. Federating available computing resources, assuming sufficiently fast network, is certainly an efficient way to offer a more powerful computing system to the community. The main advantage is that the maintenance cost is thereby mutualized and the users pay only for what they have (really or potentially) consumed. In addition, related to the Software as a Service (SaaS) feature, users instantly benefit from updates, new releases, and new software, with an opportunity to share data and key parameters. This approach of federating available resources can be also seen as a way to save power consumption through wastage minimization. The topic of cloud computing is coming to the vogue and will probably be adopted for major large scale scientific experiments. The challenge for computer scientist is how to efficiently schedule a given set of tasks on the available set of resources in order to serve users requests at the best, while taking care of energy.

50

C. Tadonki

References 1. M. Hill and M. Marty, Amdahl?s law in the multicore era, Computer, vol. 41, no. 7, pp. 33–38, 2008. 2. RCGI scientists study storage of carbon-rich natural gas in underwater salt caves, https://www.rcgi.poli.usp.br/rcgi-scientists-study-storage-of-carbon-rich-natural-gas-inunderwater-salt-caves 2018 3. Virieux, J. and Operto, S. An overview of full-waveform inversion in exploration geophysics, Geophysics, vol. 74 (6), 2009. 4. G. Bilardi1, A. Pietracaprina1, G. Pucci1, F. Schifano, and R. Tripiccione, The Potential of On-Chip Multiprocessing for QCD Machines, HiPC 2005, LNCS 3769, pp. 386–397, 2005. 5. M. Luscher, Implementation of the lattice Dirac operator, White paper (https://repository. prace-ri.eu), January 2012; revised November 2013. 6. Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181 (2010) 1517–1528. 7. C. Urbach, K. Jansen, A. Shindler, and U. Wenger, HMC Algorithm with Multiple Time Scale Integration and Mass Preconditioning, Computer Physics Communications, vol. 174, p. 87, 2006. 8. P. Vranas, M. A. Blumrich, D. Chen, A. Gara, M. E. Giampapa, P. Heidelberger, V. Salapura, J. C. Sexton, R. Soltz, G. Bhanot, Massively parallel quantum chromodynamics, IBM J. RES. & DEV. VOL. 52 NO. 1/2 JANUARY/MARCH 2008. 9. F. Wilczek, What QCD Tells Us About Nature and Why We Should Listen, Nuclear Phys. A 663, 320, 2000. 10. Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joo, B., Chhugani, J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallel systems using a cachefriendly hybrid threaded-MPI approach In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’11 (2011) 69:1–69:11 International Workshop on Highly Efficient Accelerators and Reconfigurable Technologies (HEART) in conjunction with the 24th ACM International Conference on Supercomputing (ICS), pp. 67–71, Epochal Tsukuba, Tsukuba, Japan, June 1–4, 2010. ACM SIGARCH Computer Architecture News, vol 38(4) 2011. 11. C. Tadonki, Scalability on Manycore Machines https://www.cri.ensmp.fr/people/tadonki/talks/ Scalability.pdf 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany, June 16-20, 2013. 12. G.Grosdidier, Scaling stories, PetaQCD Final Review Meeting, Orsay, France, Sept. 27th–28th 2012. 13. K. Jansen and C. Urbach, tmLQCD: a program suite to simulate Wilson Twisted mass Lattice QCD, Computer Physics Communications, vol. 180(12), p. 2717–2738, 2009. 14. QDP++, http://www.top500.org/system/177003. 15. Y. Li, I. Pandis, R. Mueller, V. Raman, and G. Lohman, NUMA-aware algorithms: the case of data shuffling http://www.pandis.net/resources/cidr13numashuffling.pdf 2013. 16. R. Al-Omairy, G. Miranda, H. Ltaief, R. M. Badia, X. Martorell, J. Labarta, and D. Keyes, Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing, Supercomputing Frontiers and Innovations, vol. 2(1), 2015. 17. L. Bouhouch and C. Tadonki, M. Zbakh, Dynamic Data Replication and Placement Strategy in Geographically Distributed Data centers, Concurrency and Computation: Practice and Experience (CCPE). https://doi.org/10.1002/cpe.6858, 2022. 18. W. Dally, S. Keckler and D. Kirk, Evolution of the Graphics Processing Unit (GPU), in IEEE Micro, vol. 41, no. 06, pp. 42–51, (doi: https://doi.org/10.1109/MM.2021.3113475), 2021. 19. Chris Gregg and Kim Hazelwood, Where is the Data? Why You Cannot Debate CPU vs. GPU Performance Without the Answer, International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX. April 2011. http://www.cs.virginia.edu/kim/ docs/ispass11.pdf

3 What Do HPC Applications Look Like?

51

20. G. Giunta, R. Montella, G. Agrillo, G. Coviello, A GPGPU Transparent Virtualization Component for High Performance Computing Clouds, 16th International Euro-Par Conference, Ischia, Italy, August 31–September 3, 2010. 21. Alexander Schrijver, Theory of Linear and Integer Programming, Wiley, ISBN: 978-0-47198232-6 June 1998 22. L. G. Khachiyan, A polynomial algorithm in linear programming, Doklady Akademii Nauk SSSR 244:1093–1096, 1979. 23. Peter Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1996. 24. Bradford Nichols, Dick Buttlar and Jacqueline Proulx Farell, Pthreads Programming, O’REILLY, 1996. 25. https://www.openmp.org/ 26. Oliver K. Ban, Vector Computing: Principals, Implementation and Applications, M&L Publishers, 2001. 27. A. Kleen. A NUMA API for Linux. Technical report, Novel Inc, 2004. http://www.halobates. de/numaapi3.pdf. 28. https://www.open-mpi.org/projects/hwloc/ 29. https://www.openacc.org/ 30. Claude Tadonki, High Performance Computing as a Combination of Machines and Methods and Programming, University of Paris-Sud, Orsay, France, 2013. 31. https://top500.org/ 32. https://www.top500.org/lists/top500/2022/06/ 33. https://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/

Part II

Running HPC Applications in Cloud

Chapter 4

Deploying and Configuring Infrastructure Edson Borin and Otávio O. Napoli

4.1 Introduction High-Performance Computing (HPC) systems are typically built with a set of basic components, as illustrated in Fig. 4.1. The computing nodes are responsible for executing the HPC applications. These components contain one or more processing cores, a main memory, and, in some cases, they may also contain specialized accelerators (e.g., GPUs) or local storage (e.g., SSD or HDD devices). The main storage component1 is responsible for storing the data and applications, and for providing interfaces so the HPC applications can access the data. The login node (also known as head node) is a machine that provides an interface for users to log into the HPC system and run their applications. It allows users to copy files to and from the system storage component and is usually responsible for running the job scheduler (e.g., PBS or SLURM). Users connect to this node through an external network (e.g., the Internet) usually using secure shell protocols and associated tools, such as the Linux ssh tool. Finally, the system network is responsible for connecting all the components, including the computing nodes, the storage components, and the login node. Cloud providers offer several kinds of infrastructure components that can be deployed and configured to assemble HPC systems as illustrated by Fig. 4.1. Moreover, there are several ways of deploying and configuring these systems. This

1 The

term “component” is used instead of “device” because these elements may be implemented in several ways, including by a dedicated, specialized, hardware device or by using the local storage devices on the computing nodes. E. Borin () · O. O. Napoli UNICAMP, Campinas, SP, Brazil e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_4

55

56

E. Borin and O. O. Napoli

Fig. 4.1 HPC system components overview

Network External Network

Computing Node 1

Login node Computing Node 2 … Main storage component

Computing Node N

High-performance computer

chapter presents key infrastructure elements offered by cloud computing providers under the IaaS model, and discusses means to deploy and configure them into HPC clusters. The remaining of this chapter is organized as follows: Sect. 4.2 presents the key infrastructure elements offered by cloud computing providers and Sect. 4.3 discusses how these elements can be used to implement a cloud-based HPC cluster. Then, Sect. 4.4 discusses different approaches to deploy and configure this system. Finally, Sect. 4.5 provides the final considerations about selecting tools and instances for deploying and configuring HPC systems on the cloud.

4.2 Key Infrastructure Elements This section presents the key infrastructure elements that can be used to assemble cloud-based HPC systems. First, Sect. 4.2.1 presents the role of virtual machines and virtual machine images on a cloud-based HPC system. Then, Sects. 4.2.2 and 4.2.3 discuss how placement strategies and tenancy may affect the system performance. Next, Sect. 4.2.4 provides an overview of the main storage services, their characteristics, and how they may affect the design and execution of HPC applications. Finally, Sect. 4.2.5 introduces the concept of Virtual Private Cloud networks and discusses their role on cloud-based HPC systems.

4.2.1 Virtual Machines The IaaS model allows users to rent Virtual Machines, or VMs, which can be seen, in general, as remote computers that can be configured and accessed through the

4 Deploying and Configuring Infrastructure Fig. 4.2 Three virtual machines (VM.1 , VM.2 , and VM.3 ) instantiated on a single physical machine composed of one 48-core CPU and 256 GB of memory. Solid boxes indicate components on the physical machine while dashed ones indicate virtual machines

57

32 GiB

VM1: 2 vCPUs / 32 GiB

112 GiB

VM2: 1 vCPU / 112 GiB

Core C3 …

64 GiB

VM3: 1 vCPU / 64 GiB

Core C47

48 GiB

CPU (48 cores)

Main memory (256 GB)

Core C0 Core C1 Core C2

Internet. In this context, these resources could be used to implement a login node, computing nodes, or even a storage component.2 Cloud datacenters are usually composed of several kinds of physical machines, e.g., with different processor models, amount of main memory, etc. Moreover, modern virtualization technologies allow providers to instantiate virtual machines on any of these physical machines and configure them to use only a subset of the physical machine resources, as illustrated in Fig. 4.2. As a consequence, cloud providers usually offer a wide variety of virtual machine configurations. The virtual machine type is a logical concept often used to define the set of resources present on a virtual machine (e.g., number of processing cores, main memory capacity, etc.). For example, on AWS, the m5.16xlarge virtual machine type defines virtual machines that contain 256 GiB of memory, 64 vCPUs, and are powered by Intel Xeon Platinum 8175M processors while the m6g.16xlarge VM type defines virtual machines that contain 256 GiB of memory, 64 vCPUs, and are powered by ARM-based AWS Graviton2 processors. Cloud providers typically group their virtual machine types based on the resources that best suit each use case. These logical groupings are called families.3 Table 4.1 shows common families found on the main cloud providers and provides a brief description of their targeted workloads.

4.2.1.1

Virtual Machine Images

A virtual machine image is a binary file that contains a set of pre-installed software, often including an Operating System (OS). A VM image is composed of one or more volumes, which can be seen as logical partitions of a hard drive or as multiple

2 Section

4.2.4 will discuss alternative and more efficient ways of implementing the main storage component on the cloud. 3 Some providers use the term “series” instead of families to denote these logical groupings.

58

E. Borin and O. O. Napoli

Table 4.1 Common virtual machine families Workload type Balanced (applications that use resources in equal proportions)

Common applications Databases, web serving, caching, media streaming

Compute optimized

Description Machines with a balanced degree of computing, memory, and network resources Machines with a high CPU-to-memory ratio

Compute-intensive

Memory optimized

Machines with a high memory-to-CPU ratio

Memory-intensive

Storage optimized

Machines with high I/O operations per second (IOPS) Machines with hardware accelerators (e.g., GPUs)

I/O-intensive

Scientific modeling, gaming, batch processing, media transcoding, artificial intelligence In-memory databases, genomics analysis, SQL analysis services Non-relational databases, data warehousing Applications with lots of floating-point operations, graphics processing, stencil computations

VM family General purpose

Accelerator optimized

Massive parallel computation

hard drives. They are loaded when booting virtual machines and can be seen as the equivalent of the contents of a hard drive on a desktop machine. Cloud providers usually offer a wide variety of virtual machine images, with different sets of software pre-installed. Based on such images (or from scratch), users may also create their own custom images, facilitating applications’ deployment and minimizing the effort needed to configure a virtual machine. Custom images are usually private, but users can share them publicly.



> Virtual Machine Images and Snapshots

Snapshots represent a copy of a single volume at a specific point of time. A virtual machine image represents a “copy” of the whole instance, which can be composed by several volumes (or snapshots for each of them), launch instance permissions, and the device mappings (which describe the volumes to attach to the instance).

4.2.2 Regions, Availability Zones, and Placement Strategies Regions are geographic locations in which datacenters are located. Availability zones are locations within the region where a datacenter resides and operates. In general, zones have high-bandwidth and low-latency network connections to other

4 Deploying and Configuring Infrastructure

59

zones in the same region. Besides that, resource’s price may vary depending on the zone they are located. Placement strategies4 are the policies used to allocate instances on the datacenters, optimizing a desired objective. For instance, placement strategies can be used to spread the allocation of resources (e.g., VMs) across different datacenters in order to reduce the correlated failures and minimize the downtime of an application. Fault-tolerant and high availability applications (e.g., web applications) are usually designed to operate in different regions, aiming to minimize the downtime in events of disruption. Some providers allow users to define placement strategies to allocate computing resources on the same computing rack.5 These strategies, which focus on placing resources as close as possible on the datacenter, are useful to achieve low-latency and high-bandwidth network performance between resources. This kind of approach is usually desired when assembling an HPC cluster on the cloud.

4.2.3 Tenancy Tenancy refers to how a system or a resource is shared. Cloud providers usually offer virtual machines where the underlying hardware (i.e., the physical machine) may be shared among different users, similar to that shown in Fig. 4.2. This model of sharing is called multi-tenant. In general, multi-tenant models allow cloud providers to save money by maximizing the resource usage, as a resource may not be used all the time. However, sometimes users applications may suffer performance interference if some underlying hardware resource is saturated (e.g., processor cache or memory bandwidth). Contrary to multi-tenant models, cloud providers may also offer single-tenant models for some resources, which grants users with exclusive usage. This model is usually preferable when dealing with sensitive data, which may be exposed to leakages in multi-tenant scenarios. Single-tenant model’s pricing is usually more expensive than multi-tenant models.

4.2.4 Storage Services The organization and access to data is a point that may impact the application’s performance. Depending on the use case and the application, different storage strategies and services may perform better. Usually, data is stored locally, in the

4 The term varies according to the cloud provider and can also be called placement groups or placement policies. 5 A computing rack is a stand in which the physical machines are stored.

60

E. Borin and O. O. Napoli

machine’s local storage system. However, in clusters, it is often convenient for users to have the data visible by multiple machines, which requires the use of shared storage. For this kind of storage, different cloud services and systems are available, and they are classified depending on how data is stored and interfaced. Cloud storage services are usually categorized into file-, block, and object-based storage systems. Block-based storage systems store the data into evenly-sized blocks of data, and each block is assigned with a unique identifier. These blocks can be stored on different locations, including on different physical devices (e.g., HDDs) or different servers, on a transparent way, i.e., without the user knowledge. Whenever the user requires data from the block storage system, the system puts together the requested blocks before retrieving the data to the user. This flexibility allows the cloud provider to allocate or move blocks around its infrastructure to optimize data access performance (e.g., increase bandwidth by distributing blocks among multiple servers or reduce latency by moving blocks to nearby servers). In fact, often, this is one of the cloud-based storage systems that offer the lowest latency. Services such as Amazon EBS and Google’s Persistent Disk are examples of block-based storage systems which have higher input and output operations per second (IOPS) compared to other storage types. On the cloud, block-based storage systems are typically used to store virtual machine images. In this context, they are mounted on a VM and used as a boot device. Because they may provide high-performance in terms of IOPS, they are also often used to store databases and for data warehousing. Block-based systems can also be formatted by the user to store the data of a file system (e.g., NTFS, ext4, etc.). In this case, the user may read, write, or modify files using common operating system calls (e.g., read and write system calls on Linux) and, consequently, common programming language file interfaces (e.g., fopen and fread in C). Block-based storage systems are acquired in volumes of fixed sizes, i.e., their size cannot be easily expanded or reduced once they are allocated and attached to virtual machines. Moreover, there is usually a limitation on how many virtual machines can mount the volume at the same time. Hence, this kind of storage is not commonly used as the main storage component of cloud-based HPC clusters, but is very common to store local and ephemeral contents. File-based storage systems store data as files, which are also organized hierarchically, in folders, and identified by their paths. These systems are usually managed by the operating system, allowing users to read, write, or modify files using common system calls (e.g., read and write system calls on Linux) and, consequently, common programming languages interfaces (e.g., fopen and fread in C). The file-based storage system may be stored at a local device (e.g., HDD), or at a remote server, which may allow multiple machines to access the file system concurrently, with a latency slightly higher than block-based systems. The Network File System (NFS) protocol, for instance, is one of the most popular shared file system protocols that provide a remote file-based storage, allowing users to mount their file systems on different physical machines and to access its contents transparently. AWS EFS, Azure Files, and Google FileStore are examples of file-based cloud storage services.

4 Deploying and Configuring Infrastructure

61

File-based cloud storage services are usually very flexible in terms of size, allowing users to add or remove as many files as they wish. Also, this kind of system is reasonably scalable, allowing users to store very large files, with tens of Gigabytes, and lots of files per folder. Moreover, this kind of storage can, usually, be shared by a large amount of VMs, making it a good fit for the main storage component on many cloud-based high-performance computer systems.6 On the cloud, file-based storage systems are typically used for storing application data, logs, web serving, and home directories. Object-based storage systems store data as objects. An object is usually defined by a unique identifier, the actual data, and its metadata. The metadata keeps information about the data, such as content-type (e.g., video file), data size, ownership, date of creation, access rights, etc. Objects are accessed directly by applications through APIs that perform requests to store/retrieve the objects on/from the storage system server (e.g., HTTP’s PUT and POST). AWS S3, Azure Blob, and Google’s Cloud Storage are examples of object-based cloud storage services. By using a unique identifier, data can be accessed without the need for creating or mounting any file system, allowing it to be accessed almost from everywhere and from any number of VMs. This also provides more flexibility to system designers, as multiple users can read/write objects using the API, and also to cloud providers, as they can store objects at different locations. As a consequence, object-based storage systems are usually more scalable than block- and file-based ones. Additionally, object-based cloud storage services are usually cheaper than its counter-parts. On the downside, object-based cloud storage services usually have higher latency. Also, organizing data into objects and accessing them through specialized APIs (e.g., HTTP requests) is not a common practice on HPC applications, hence, this kind of storage may require the user to change the HPC application. Finally, every read and write operation generates Internet requests (e.g., HTTP requests), and, specifically for write operations, a new version of the object must be uploaded; hence, these operations may cause significant impact on the application’s performance if modifications are not seldom or are made to large objects. Cloud-based storage services are usually charged as a function of the data size and the amount of time the data is stored (e.g., USD/GB/h). For block-based storage volumes, the user is charged for the whole size of the device, regardless of whether its is being fully utilized or not. For file- and object-based services, the user is charged according to the total size of the files/objects stored on the system. The price of these services can also be influenced by the location where data is stored. Moreover, since data may be accessed through the Internet, object-based systems may also charge for data transfers (i.e., when the data is being accessed), specially if the data is being accessed by a connection from outside the region or the availability zone.

6 Some HPC systems may require a high-throughput I/O system. In these cases, a highperformance file-based system, such as Lustre, may be necessary. Some cloud providers already offer high-performance file-based storage services for these use cases (e.g., AWS FSx).

62

E. Borin and O. O. Napoli

Table 4.2 Differences between storage types and services Aspect Pricing

Block-based Per contracted GB

File-based Per used GB and location

Scalability Latency Interface

Low Lowest Device-level

Moderate Low Operating system

Sharing Access control

Single/few VMs Network firewalls, Cloud user based authentication

Several VMs Network firewalls, Cloud user based authentication

Use case

Boot volumes, databases, and data warehousing

Logs, application data, web serving, home directories

Example services

Amazon EBS, Google’s Persistent Disk, Azure Managed Disks

Amazon EFS, Google FileStore, Azure Files

Object-based Per contracted GB, location, and amount of transfer High Moderate–High API (usually via HTTP requests) Several VMs Object-based policies, user policies, pre-signed URLs, and public accesses Web serving, big data, content management, media, entertainment, and backups Amazon S3, Google’s Cloud Storage, Azure Blob Storage

Table 4.2 summarizes the main characteristics of the storage services in cloud contexts.

4.2.5 Virtual Private Cloud Networks A Virtual Private Cloud network, or VPC network, is a virtual network that can be used to connect resources (e.g., storage services and VMs) allocated by the cloud provider on behalf of users. It can be used to allow resources to communicate as if they were inside a local area network (LAN), with their own private IP addresses, subnets, and routing tables. Also, VPC networks may connect resources located in different regions, transparently. In the context of cloud-based high-performance computing systems, a VPC network can be used to connect the login and the computing nodes to the main storage component. It is worth noticing that, cloud providers may offer different types of network interfaces (e.g., AWS elastic fabric adapter), which may result in different network performances, such as latency and bandwidth. Usually, tightlycoupled HPC applications may benefit from a fast network as the number of nodes increase.

4 Deploying and Configuring Infrastructure

63

4.3 Overview of a Cloud-Based HPC Cluster High-performance clusters hosted in cloud providers are built with a set of virtual components and services, as illustrated in Fig. 4.3. In this example the login and the computing nodes are implemented with AWS virtual machines. The login node is of type m5.xlarge while the computing ones are more powerful, of type c6i.32xlarge. Each VM has a block-based storage device attached to it, which stores the virtual machine image7 and local data. The main storage component is implemented by a file-based cloud storage service, the AWS Elastic File System, or EFS. This file system is mounted on all nodes, including the login and the computing ones, providing both the user and the applications with seamless access to the data files—the user access these files from the login node (e.g., through a remote connection) while the applications access them from the computing nodes. Finally, all components are connected by a VPC network. Once the infrastructure is deployed and configured, a cluster management tool (e.g., SLURM) may be installed at all nodes, allowing users to run their HPC jobs as if they were on a traditional HPC cluster. Figure 4.4 shows a typical workflow for deploying (steps 2–4), configuring (steps 5–6), and using (steps 7–10) an HPC system on the cloud. The first step consists of creating a user account. Once the user has an account, she can deploy the infrastructure (steps 2–4). She may start by creating a VPC network (step 2) and a shared file system (step 3). Then, she may instantiate VMs for the login and the computing nodes (step 4). Next, she may configure the system by mounting the shared file system (step 5) and installing common libraries and the job submission tool (step 6) on the VMs. Once the HPC system is deployed and configured, the user may log into the login node to install and/or execute the HPC application (step 8) and to retrieve the results (step 9). Next, the user may proceed to execute new experiments or applications (back to step 8) or, alternatively, shut the VMs down (step 10). Finally, when the

VPC Network External Network

Login Node m5.xlarge

AMI

EFS Main storage component

Block-based storage Computing Node 1

AMI

c6i.32xlarge

Computing Node 2

AMI

c6i.32xlarge

Fig. 4.3 Example of a cloud-based high-performance computer system

7 On

AWS, a virtual machine image is also called Amazon Machine Image (AMI).

64

E. Borin and O. O. Napoli

Use the HPC system

Deploy the infrastructure 1

Create an user account

2

3

4

Create the VPC network Create the shared file system Instantiate the login and computing VMs

7

Start VMs 8

Install/Execute the application

9

Retrieve results 10

5

Mount the shared file system

6

Install common libraries and job submission tool

Shutdown VMs

11

Destroy infrastructure

Configure the infrastructure Fig. 4.4 Typical workflow for deploying, configuring, and using an HPC cluster on the cloud

cluster is not necessary anymore, the user may destroy the infrastructure (step 11) to prevent unnecessary expenses. It is worth noticing that VMs are charged by the amount of time they are turned on, regardless of whether they are executing workloads or not, and also by the storage resources attached to it, such as the block-based device that stores the virtual machine image. Hence, it is a common practice to stop VMs when they are not being used and to turn them back on whenever they are needed. The contents of the VM image are preserved when a VM is stopped; hence the user is still charged for its storage. However, the VM image storage cost is usually much lower than the cost of executing the VM on a physical machine. Finally, once a VM is no longer necessary, it may be terminated, which will destroy the block devices attached to it, so the user will no longer be charged for these services.

4.3.1 Cost and Performance of Cloud-Based HPC Clusters The cloud computing model allows users to custom design the computing infrastructure for each application needs. At one hand, this flexibility allows users to accelerate their applications by adding more (or faster) computing resources as needed. On the other hand, this flexibility can also be used to reduce the total computing cost.

4 Deploying and Configuring Infrastructure

65

Most of the resources on the cloud computing model are charged based on the amount of time they are used for. For example, virtual machines are typically charged by time (e.g., USD/h) and storage is charged by the amount of storage being used and time (e.g., USD/GB/month). Hence, the total computing cost is defined by the resource price multiplied by the time it has been allocated for the user. In this context, using a less powerful computing resource, with a lower price tag, does not mean the total computing cost would be reduced. As an example, consider GROMACS, a molecular dynamics software, and two fictitious virtual machine types: V M1 and V M2 . Also, let’s assume the price of V M1 and V M2 are .0.4 USD/h and .0.6 USD/h and the time it takes GROMACS to execute a given simulation on these VMs is 3 and 1 hour, respectively. In this scenario, using V M1 to execute the GROMACS simulation costs .1.2 USD (.0.4 USD/h .× 3 h), while using V M2 costs only .0.6 USD (.0.6 USD/h .× 1 h). Notice that, even though the price tag of V M2 is .1.5× higher than V M1 , executing the simulation on V M2 is twice as cheap and three times faster than executing it on V M1 . The opposite scenario is also possible, i.e., using a more powerful computing resource, with a higher price tag, may cost more and take longer to execute the application. Even though the previous example is fictitious, this situation happens frequently with real applications on the cloud. For instance, Fig. 4.5 shows the cost (x-axis) and the time (y-axis) it takes to execute a simulation with an application called Large-

Fig. 4.5 Execution time and cost of executing LAMMPS application at different AWS’ virtual machine types

66

E. Borin and O. O. Napoli

scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) on different AWS’ virtual machine types. It is possible to note that using the x1e.2xlarge type is .7.3× costlier and .3× slower than the c5.4xlarge virtual machine type for this simulation. The results in Fig. 4.5 indicate that c5.4xlarge is the fastest while t2.medium is the cheapest VM type to execute this particular LAMMPS simulation. The t2.large is slightly faster than the t2.medium, but costs almost twice as much as the t2.medium VM type. These three VM types define the Pareto Frontier for this experiment, which is the set of VM types where one objective cannot be improved without detriment to another objective (cost and performance, in this case). The remaining VM types take longer and cost more to execute than at least one of these three VM types and there is no point in selecting them for this particular workload. Even though this is the case for this particular LAMMPS simulation, as showed by Brunetta and Borin [1], the best virtual machine types for a given workload depends on the application itself and, in some cases may also depend on the input dataset. The previous analyses focus on the selection of VM types, which are usually the resources responsible for most of the computing cost on a typical cloud-based HPC cluster. Nonetheless, it is worth mentioning that the same cost-performance analysis must be done for other resources, such as storage. The selection of the resources that optimize the cost/benefit of using the cloud environment may not be a trivial task, due to the amount of different resource configurations given by cloud providers. There are several strategies to deal with this problem and determine the best configuration that suits a given workload [2–7]. This optimization problem is covered in the chapters of Part III. The next section discusses means to deploy and configure infrastructure on the cloud.

4.4 Deploying Infrastructure on the IaaS Model Infrastructure deployment is an important step that consists of creating and starting all the paramount resources needed to compose a cloud-based system. The next sections discuss the main means to deploy infrastructure in the IaaS model.

4.4.1 GUI and Command-Line Interface Tools Cloud providers usually provide graphical user interfaces, or GUIs, to support users deploying infrastructure in the IaaS model. This interface is often implemented as a web portal and accessed through Internet browsers. Figure 4.6 illustrates a graphical user interface being used to deploy virtual machines. In this particular case, the user is selecting the Virtual Machine Image to be used on the virtual machines on AWS.

4 Deploying and Configuring Infrastructure

67

Application and OS Images (Amazon Machine Image) Info An AMI is a template that contains the software configuration (operating system, application server, and applications) required to launch your instance. Search or Browse for AMIs if you don't see what you are looking for below

Search our full catalog including 1000s of application and OS Images

Recents

Amazon Linux

My AMIs

Quick Start

Ubuntu

Windows

Red Hat

SUSE Linux Browse more AMIs Including AMIs from AWS, Marketplace and the Community

Amazon Machine Image (AMI) (HVM), SSD Volume V T Type Ubuntu server 20.04 LLTS (HVM),

Free tier eligible

ami-040505e74c0741db8d (64-bit (x86)) / ami-0b49a4a6e8e22fa16 (64-bit (Arm)) evice type:: ebs Virtualization:: hvm ENA enabled enabled:: true Root d device

Description Canonical, Ubuntu, 20.04 LTS, amd64 focal image build on 2021-11-29 Architecture

64-bit (x86)

AMI ID

ami-040505e74c0741db8d

Fig. 4.6 Example of a web interface used to deploy virtual machines

Section A.1, in Appendix A, illustrates, step-by-step, how to deploy a small cluster of computers on AWS using the AWS GUI. Graphical user interfaces are usually more intuitive and easier to learn than the other interfaces; nonetheless, they can become very tedious and error-prone when performing repeated or large tasks. Besides the graphical user interfaces, cloud providers also offer commandline interface tools to enable users to automate the infrastructure management process, i.e., to create, configure, and/or destroy infrastructure elements in a programmatic way. These tools are designed to perform the same operations as in the GUI interface, in a standardized way, so users can invoke them to deploy their infrastructure. For instance, the command at Fig. 4.7 uses the AWS CLI tool to instantiate 3 virtual machines of type t2.medium, with the image identified as ami-04505e74c0741db8d, and with a key pair named MyKeyPair. These tools are not only tied to the virtual machine deployment but to the management of the whole cloud (e.g., keys, VPCs, etc.). Section A.2, in Appendix A, illustrates how to deploy a small cluster of computers on AWS using the AWS CLI tool.

68

E. Borin and O. O. Napoli

Fig. 4.7 Example of a command issued on a command-line interface to deploy virtual machines on AWS

Fig. 4.8 Example of a command issued on a command-line interface to deploy virtual machines on Google Compute Engine

Other cloud providers, such as Microsoft Azure and Google Compute Engine, also offer CLI tools to manage the computing resources. Figure 4.8 shows the instantiation of a virtual machine named instance-1 of type n1-standard-1 in the us-central1-b zone, using the Google Cloud Engine CLI tool. Often, cloud providers offer CLI tools as an alternative to the GUI interfaces, enabling the automation of the infrastructure management process. These tools, however, are tightly coupled to a specific cloud provider, similarly to the GUI interface, requiring users to create specialized codes for each provider they want to use. Given the similarities in several processes of different providers, Infrastructure as Code tools, described in the next section, usually try to unify this diversity in provider-agnostic tools.

4.4.2 Infrastructure as Code Infrastructure as Code, or IaC, is the process of provisioning and managing an infrastructure through source codes. IaC tools are used to implement an automated deployment pipeline, from infrastructure deployment to software implantation and monitoring, and have been used in different scenarios and segments. IaC tools make use of IaC scripts, also referred as configuration scripts, which describe the actions that must be taken when it is executed. Different tools provide different ways, libraries and languages in order to write down these scripts. Besides reducing manual-intensive labor of system administrators, which may be slow and error-prone, due to IaC scripts language’s consistency, a version control can easily be used in order to fix and keep these scripts up-to-date. For example, Ansible, is an IaC tool that enables the user to manage both the deployment of computing infrastructure and its configuration using YAML language, a declarative language. Figure 4.9 shows a sample of Ansible code used to deploy virtual machines on AWS. Actions to be performed can be described using YAML data structures making it easy to understand. Section A.3, in Appendix A, gives more details and illustrates how to deploy a small cluster of computers on AWS using Ansible. The IaC tools also aim at canonizing the computing ecosystem in their IaC scripts, unifying the universe of different cloud provider and operating system

4 Deploying and Configuring Infrastructure

69

Fig. 4.9 Example of an Ansible code fragment to deploy virtual machines

resources and taxonomies. The benefits are manifold, such as: improved consistency, as many IaC tools tend to manage provisioning configurations it allows easy versioning, reproducibility, and low management overhead; fast provisioning with different configurations, allowing users to test different scenarios in a straightforward way; automatic exception handling; and standardized error messages. In fact, these tools are widely used in the industry in order to automate and simplify the deployment pipeline. For instance, the National Aeronautics and Space Administration (NASA) used an IaC tool in order to increase efficiency and perform migration of application in different cloud providers [8]. Giving Ansible the management and scheduling of the cloud environment reduced a multi-day patching process to a 45 min process. The update of the nasa.gov website went from over 1 h to under 5 min. There are several IaC tools available on the web: Ansible [9], Chef [10], CLAP [7], elasticluster [11], OpenStack [12], Puppet [13], Salt Stack [14], and Terraform [15] are examples of IaC tools, which are also provider-agnostic tools. However, these tools differ regarding their approach towards automation and configuration management. Many of them have similarities and differences, allowing them to thrive on different scenarios. When selecting the deployment tool, the user usually should take into consideration the following properties: • Ease of installation and start-up: Automation tools should have a low setup process as this may delay the system or application deployment. When choosing a tool it is very important to check how easy it is to install and start using the tool. Some automation tools must be installed directly on the nodes that are being managed, while others only need to be installed in the “control machine”, i.e., the machine from which the user will execute the IaC tool. • Scalability: This characteristic is related to how well the tool can perform when dealing with numerous resources, e.g., many virtual machines. This is very important when setting up HPC clusters, as these systems may contain lots of compute nodes.

70

E. Borin and O. O. Napoli

• Error treatment: In general, all machines in a cluster must be configured properly in order to execute an HPC application. The tool must be capable to detect exceptions raised during the configuration process and handle them gracefully, e.g., reporting the errors to the user and destroying the partially deployed resources when the system cannot be fully deployed. • Interoperability: Sometimes, a tool must be capable to work correctly with machines that use different operating systems and hardware. Besides that, some tools are restricted to some cloud providers while others are designed to be cloudprovider agnostic. • Learning curve: This characteristic denotes how easy it is to learn how to use the tool. Some tools may require the user to learn new domain specific languages, while others are based on well known configuration or script files. Although IaC tools are useful for managing cloud infrastructures applications and services, several IaC tools have been designed for building HPC clusters in the computing cloud, simplifying even more the deployment process for this specific niche. This is discussed in next subsection.

4.4.3 IaC Tools for Cloud HPC-Cluster-Like Environments HPC clusters are usually composed of collections of tightly coupled components (e.g., compute, storage, and networking resources) which allow users to run HPC workloads. As discussed in Sect. 4.3, an HPC cluster is usually composed of one login and several computing nodes, each one connected though a VPC network and to a shared file system (main storage component). The computing nodes usually share the same operating system interfaces, allowing user’s application to run transparently at any of them, but may offer different capabilities (e.g., GPUs, CPU models, etc.). Based on these premises, several IaC tools were constructed or adapted to simplify the deployment and scalability of HPC clusters and the execution of HPC workloads. AWS Parallel Cluster [16], Azure Batch MPI [17], AlcesFlight, CLAP [7], and elasticluster [11], are some examples of IaC tools focused on creating and managing the cloud HPC-cluster-like environments. Usually these tools abstract the complexity of setting up a scheduler (e.g., SLURM) and simplify the infrastructure deployment process when composing the HPC cluster. However, it is worth noting that they usually do not instantiate the storage and network resources, that is, these components must be instantiated at least once by the user, and it will be (re-)used when instantiating new cluster nodes. AWS parallel cluster, for instance, is an open source and free cluster management tool written in Python and designed to deploy cloud HPC-cluster-like environments on AWS. Figure 4.10 shows an AWS Parallel cluster’s configuration file used to deploy a SLURM-managed cluster. In this case, all nodes use a VM image called alinux2. The login node is a virtual machine of type t2.medium

4 Deploying and Configuring Infrastructure

71

Fig. 4.10 Example of an AWS Parallel Cluster configuration file to create a SLURM-based cluster

and is attached to a network device that is connected to the subnet identified as subnet-abcdef01234567890. In this example, the cluster can have up to 10 computing nodes, which are virtual machines of type c5.2xlarge, also connected to the subnet identified by subnet-abcdef01234567890. AWS parallel cluster can be configured to automatically turn on/off computing nodes according to the workloads on the SLURM queue. For example, it can automatically start computing nodes whenever a new job is submitted to the SLURM system and stop them whenever the job is completed. Moreover, all the cluster management process can be done using a user-friendly command-line tool. Chapter 12 discusses how AWS parallel cluster was used to deploy the infrastructure and execute a Bioinformatics case study. Azure CycleCloud is a web-application that can be used to deploy and manage HPC clusters on Azure. The tool provides several cluster templates for schedulers (e.g., PBSPro, LSF, Grid Engine, Slurm, HTCondor), allowing users to deploy batch-queueing clusters easily. Once a cluster is created, it automatically autoscales to handle the computational jobs that are submitted to the scheduler. Besides that, the tool allows users to create different types of file systems and mount them in the computing nodes, also using templates. The Google Cloud HPC Toolkit is a set of open-source tools to deploy HPC environments, composed of four main elements: the HPC blueprint, which is an YAML file that describes elements that must be used in the cluster (compute, networking, storage, etc.) and the scheduler, by referencing the HPC modules; HPC modules are a set of Terraform or Packer templates designed to deploy several infrastructure elements; an HPC deployment folder which is a self-contained folder that can be used to deploy a cluster; and the Google’s ghpc engine that

72

E. Borin and O. O. Napoli

combines these elements all together and provide the command-line tool as the default interface. Google offers several HPC modules from a variety of components, providing a configurable and extensible solution. The elasticluster is a user-friendly open-source command-line tool to create and manage clusters hosted in private and public cloud providers. The clusters are instantiated using configuration files, similarly to AWS Parallel Cluster. However, it relies on Ansible to configure the cluster and comes with several HPC cluster configuration presets in order to set the batch-queueing schedulers up easily (e.g., SLURM, GridEngine). Differently from AWS parallel cluster, new HPC cluster presets can easily be added, simply implementing it in Ansible. Besides that, the tool is provider-agnostic. CLAP tool is also a user-friendly open-source and provider-agnostic commandline tool to manage clusters. It behaves similarly to elasticluster but relies on Ansible both to deploy the infrastructure and configure a cluster with or without a scheduler. Besides that, CLAP allows users to create mixed clusters, i.e., clusters with machines belonging to different cloud providers, and manage them transparently. Finally, an easy-to-use Python API is also provided. Besides the above-mentioned tools, many other generic IaC frameworks can be configured to deploy HPC-cluster-like environments, such as Ansible, Chef, Saltstack, and Terraform. Also, besides IaC frameworks, Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS) tools can also be used to deploy HPCcluster-like environments. This includes tools like Azure Batch MPI, AWS Batch, Amazon Elastic Container Service (Amazon ECS), Microsoft Batch, among others. It is worth to notice that the choice of the right tool may depend on several factors as described in the last subsection. For non-open-source tools, the pricing must also be taken into account.

4.5 Considerations About Selecting Resources and Tools to Deploy HPC Systems on the Cloud In summary, cloud providers offer a wide range of resources that can be used to compose HPC infrastructure based on users’ needs. Usually the main elements include: the virtual machine type and image that will be used; the storage components; the network interfaces; and the location where these elements will be instantiated (region and availability zone). Several aspects must be considered when selecting the elements to compose the infrastructure, such as: which network adapters to use, which instance types are most suitable to the workload, how much storage is needed, what kind of storage to use, among others. Cloud computing resources can be instantiated and managed in several ways. Web-based GUIs, such as the AWS Web Console, allows users to quickly learn how to deploy infrastructure using a web browser. However, despite the fact that this interface is easy to learn and straightforward, this approach may become tedious and error-prone as the number of infrastructure elements increase, or when handling

4 Deploying and Configuring Infrastructure

73

errors and faults. In this way, the Infrastructure as Code approach is a better fit for users that need to deal with large or several infrastructures, as automation tools provide easier ways to organize, track changes, handle errors, and reduce the infrastructure management overhead. The selection of proper automation tools also involves several aspects, such as: usability, scalability, exception and error handling features, among others. Thus, automation tools become handy and less laborious when scaling the infrastructure. Even though the discussion was performed around an HPC cluster, it is worth noticing that a typical HPC cluster, with a head node and a job scheduler, may not be necessary. For simpler use cases, users can opt to start one or a few powerful virtual machines and run their commands directly on them, without deploying a login node or installing a job submission system. Moreover, for recurring workloads, the user can also create a custom virtual machine image to simplify the configuration process. Finally, it is worth noting that the aforementioned resources need to be thought out in advance as it directly impacts the cost, the maintenance, and the performance of the infrastructure. As these impacts can be severe, the following chapters discuss several of these aspects in details and provide insights on how the user can optimize the infrastructure for different applications and different objectives (e.g., minimizing the computing cost or the processing time). Acknowledgments The authors would like to thank the following funding agencies for supporting their research into High-Performance Cloud Computing: FAPESP (process 2013/08293-7) and CNPq (processes 314645/2020-9 and 404087/2021-3).

References 1. J.R. Brunetta, E. Borin, Selecting efficient cloud resources for HPC workloads, in Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing (2019), pp. 155–164 2. N.T. Okita, T.A. Coimbra, M. Tygel, E. Borin, A heuristic to optimize the execution cost of distributed seismic processing programs on the cloud, in SEG International Exposition and Annual Meeting (OnePetro, 2019) 3. D. Samuel, S. Khan, C.J. Balos, Z. Abuelhaj, A.D. Dutoi, C. Kari, D. Mueller, V.K. Pallipuram, A2Cloud-RF: A random forest based statistical framework to guide resource selection for highperformance scientific computing on the cloud. Concurrency and Computation: Practice and Experience 32(24), e5942 (2020) 4. T. A. S. Camacho, V. M. do Rosario, O. O. Napoli, E. Borin, PB3Opt: Profile-based biased Bayesian optimization to select computing clusters on the cloud. Concurrency and Computation: Practice and Experience e7540(2022) 5. N.T. Okita, T.A. Coimbra, Faster and cheaper: How graphics processing units on spot-market instances minimize turnaround time and budget. Interpretation 9(1), SA1 (2021) 6. W.F. Tavares, M.R. Assis, E. Borin, Leveraging vCPU-utilization rates to select cost-efficient VMs for parallel workloads, in Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing (2021), pp. 1–10

74

E. Borin and O. O. Napoli

7. O.O. Napoli, G.C. Pinton, E. Borin, CLAP-Bot: a framework for automatic optimization of high-performance elastic applications on the Clouds, in 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) (IEEE, 2021), pp. 28–34 8. Ansible. Increasing cloud efficiency with Ansible and Ansible tower (2019). URL https:// www.ansible.com/hubfs/pdf/Ansible-Case-Study-NASA.pdf?hsLang=en-us 9. R.H. Inc. Ansible is simple IT automation (2012). URL https://www.ansible.com/ 10. Progress. Chef: Automation software for continuous delivery of secure applications and infrastructure (2009). URL https://www.chef.io/ 11. R. Murri. elasticluster: Create clusters of VMS on the cloud and configure them with Ansible (2013). URL https://github.com/elasticluster/elasticluster/ 12. O. Sefraoui, M. Aissaoui, M. Eleuldj, et al., OpenStack: toward an open-source solution for cloud computing. International Journal of Computer Applications 55(3), 38 (2012) 13. Perforce. Puppet: Powerful infrastructure automation and delivery (2005). URL https://puppet. com/ 14. VMware. Saltstack is a revolutionary approach to infrastructure management that replaces complexity with speed (2011). URL https://saltproject.io/ 15. Y. Brikman, Terraform: up & running: writing infrastructure as code (O’Reilly Media, 2019) 16. AWS. Aws parallelcluster: Quickly build HPC compute environments on AWS (2018). URL https://aws.amazon.com/hpc/parallelcluster/ 17. Azure. Introducing MPI support for Linux on azure batch (2016). URL https://azure.microsoft. com/es-es/blog/introducing-mpi-support-for-linux-on-azure-batch/

Chapter 5

Executing Traditional HPC Application Code in Cloud with Containerized Job Schedulers Christophe Cérin, Nicolas Grenèche, and Tarek Menouer

5.1 Introduction 5.1.1 Foreword High-Performance Computing (HPC) [1, 2] refers to aggregating computing power in a way that delivers much higher horsepower than traditional computers and servers organized “On-premises” according to the vocabulary of cloud computing. “On-premises” literally means “on-site.” This definition of On-Premises refers to using the company’s server and computing environment. In this usage model, the customer, or licensee, purchases or rents server-based software to be installed on their server or a rented server. Cloud computing [3] and HPC are a way of processing huge volumes of data at very high speeds using multiple computers and storage devices as a cohesive fabric. Cloud and HPC allow us to explore and find answers to some of the world’s biggest problems in science, engineering, and business. Today, HPC is used to solve complex, performance-intensive problems (number crunching)—and organizations are increasingly moving HPC workloads to the cloud. HPC in the cloud is changing the economics of product development and research because it requires fewer prototypes, accelerates testing, and decreases time to market. These goals are for practical purposes. According to the NIST (National Institute of Standards and Technology—US. Department of Commerce) C. Cérin () · N. Grenèche Université Sorbonne Paris Nord, LIPN UMR CNRS 7030, Villetaneuse, France e-mail: [email protected]; [email protected] T. Menouer HPC & Cloud Computing, Research & Innovation, Centre d’Innovation Digitale CGI, Le Carré Michelet, Puteaux, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_5

75

76

C. Cérin et al.

definition, the cloud has been precisely devised as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services). These resources can be rapidly provisioned and released with minimal management effort or service provider interaction.” In other words, Cloud computing is a utility computing model that everyone can practice, thanks to automation and virtualization, as we will see later in the chapter. We discuss here research directions in Cloud Computing so as not to confuse the people who have been working for many years in the HPC world and want to move their applications and workflows to the Cloud computing world. The critical starting option consists of the containerization of HPC job schedulers, which opens challenging questions we will discuss later in the chapter.

5.1.2 Chapter Organization Consequently, since we have one main research direction and for the sake of brevity, the chapter is divided into two sections. Moreover, at the beginning and the end, we give motivation for the work and provide a general conclusion. We follow a unified organization. First, we present some definitions and vocabulary regarding the topic. Second, we offer a related work section. Third, we dig into the problems and explain the solutions, their potential benefit, and their limit. Finally, we conclude the section.

5.2 Change Nothing at the Application Level but a Little at the Cloud Orchestrator Level 5.2.1 Introduction This part assumes that we use available HPC applications of any sort, communicating or not, and we do not modify any piece of the code. We specifically discuss the containerization of job schedulers in terms of the organization and interactions of the different containers that constitute the new systems in the cloud. We also show some experiments in using the Kubernetes orchestrator. The idea is not to work at the application level but at the cloud orchestrator level to leverage the execution of applications. In doing so, we do HPC in the cloud.

5.2.2 Elements of Vocabulary and Essential Definitions In this self-contained section, we describe all notions that underlie the work. We aim to scale containerized HPC clusters on Cloud infrastructure and run HPC jobs, and

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

77

we will chronologically introduce these concepts. First, we must define HPC jobs and how they spread on HPC clusters. Then, we will define containers and explain how they are allocated on a Cloud infrastructure.

5.2.2.1

Basic Vocabulary Regarding the Notion of HPC Jobs and HPC Job Schedulers

An HPC job is a set of processes. These processes may run a single node and share its memory. This is the shared memory scheme. A master process forks threads on nodes with a low-level library like pthread or a high one like OpenMP. Processes may also be distributed on several nodes. In this case, a launcher is used to instantiate a communication infrastructure to run processes on it. The communication infrastructure allocates a different rank to each process. Processes can talk to each other through their rank. In this case, the memory is distributed on HPC nodes. This is the distributed memory scheme. MPI (Message Passing Interface) is the standard for implementing this scheme. A hybrid mode may mix both schemes. HPC jobs run on a HPC cluster infrastructure. It consists of a set of physical computing nodes. The challenge is to place HPC jobs on this infrastructure. HPC job schedulers are designed to achieve this aim. They are divided into two parts, a controller on an administration node and a worker on the computing node. The user submits an HPC job to the controller. The controller knows the HPC cluster infrastructure’s topology and available and reserved resources. The controller gets the resource constraints defined by the user and places the HPC job according to it. If resources are unavailable, the HPC job is in a queue waiting for resource availability. The HPC job is sent to the worker on computing nodes if resources are available. The worker is a daemon that runs with administrative privileges on the computing node. This daemon forks task to be executed on the node. A task is a system process that may be multithread at some points of its execution in the shared memory scheme or an MPI process in the distributed memory scheme. After this fork, a setuid is performed to assign user identity as the owner of task-related processes. In the distributed scheme context, the worker daemon may also instantiate the communication infrastructure before task launching.

5.2.2.2

Overview of Containers and Cloud Orchestrator

In modern operating systems [20], the data structure that implements system processes in the kernel contains two elements: a resources limitation subsystem and a dedicated kernel object instantiation. Combining these two elements leads to system process isolation, also called containerization. We will discuss Linux containers because it is the experimentation platform. In Linux-based operating systems, the resources limitation subsystem mechanism is cgroup, and the multiinstantiation of kernel objects mechanism is a namespace.

78

C. Cérin et al.

Containerization does not target a whole OS but a set of one or more system processes. These system processes are called containers. When the host has to build a container, the kernel starts by grouping system processes independently from the parent/child model. A container can be composed of horizontally pickedup system processes. The process groups are called Cgroups with Linux and JobObjects on Windows systems. The groups are implemented through dedicated kernel system process data structures. The kernel also has several accounting drivers to monitor process groups and resource consumption to limit and control their usage. Moreover, the kernel also supplies a way to create multiple instantiations of objects such as user namespace, network stack, or users’ PID index to create an isolated namespace for process groups. These technologies are combined to supply a sandboxed environment for system processes, also called containers. An Orchestrator deals with containers’ placement on nodes of a Cloud infrastructure. Nodes can be either physical servers or virtual machines, depending on the provisioning workflow that comes with a container engine. When a container creation is requested, the Orchestrator elects a node based on available resources and asks the container engine to create it.

5.2.2.3

Overview of Kubernetes, Slurm, OAR and OpenPBS

The SLURM job scheduler [5] is built upon two services: Slurmctld and Slurmd. Slurmctld is the scheduler, and Slurmd is the worker. All HPC nodes are described in a plain text file owned by the scheduler. We configure SLURM in the configless mode: the workers connect to the scheduler to retrieve the configuration. This configuration requires the Munge service to authenticate workers’ and scheduler communications. The OpenPBS job scheduler [6] hosts several services. The process pbs_sched is the scheduler itself, pbs_comm handles the High Availability, and pbs_server.bin communicates with worker nodes to execute users’ jobs. This process also interacts with a Postgres database to store resource descriptions (such as workers’ specifications) and job information. We have pbs_mom on the worker node, which receives jobs from the PBS server to execute them on the node. The OAR job scheduler [7] is composed of several processes. A central one executes an automaton that reacts to all events from jobs’ and nodes’ states and initiates appropriate action by launching corresponding processes like scheduling round, job launching, and nodes’ checking. All states related to jobs, nodes, and scheduling decisions are stored in a Postgres database. Kubernetes [4] is a container orchestrator above the runtime layer, does not create containers. This task is delegated to container runtimes, and Kubernetes must, therefore, be able to connect to several of them, each with a different API. The problem may eventually be tackled in yet another way. Kubernetes created an interface, namely Container Runtime Interface (CRI), which defines how Kubernetes talks to container runtimes. It is then up to them to implement or not the calls described in CRI.

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

79

5.2.3 Related Works In the first part of the chapter, we propose advanced integration into the Cloud of popular batch schedulers and discuss the suitability of the methodology for the resonance between HPC and Cloud systems. The originality and strength are in deploying and removing, on the fly, multiple batch schedulers. More importantly, we developed a layer between the batch scheduler and the Cloud orchestrator that dynamically adds or removes computational nodes, thanks to dedicated mechanisms at the Cloud orchestrator level. Thus, the notion of a container for cloud orchestrators is central. We comment here on two papers on containers, then four papers dealing more with runtimes and orchestrators. In [15], authors noticed that conventional HPC workload managers lack microservice support and deeply-integrated container management, as opposed to container orchestrators. They introduce a Torque-Operator, which bridges the HPC workload manager (TORQUE) and container orchestrator (Kubernetes). They also propose a hybrid architecture that seamlessly integrates HPC and Cloud clusters with little interference to HPC systems where container orchestration is performed on two levels. In [13], authors define a model for parallel MPI application DevOps and deployment using containers to enhance development effort and provide container portability from laptops to clouds or supercomputers. In this endeavor, they extend the use of Singularity containers to a Cray XC-series supercomputer. They use the HPCG and IMB benchmarks to investigate potential points of overhead and scalability with containers on a Cray XC30 testbed system. Furthermore, we also deploy the same containers with Docker on Amazon’s Elastic Compute Cloud (EC2) and compare them against the Cray supercomputer testbed. The authors’ results indicate that Singularity containers operate at native performance when dynamically linking Cray’s MPI libraries on a Cray supercomputer testbed. Amazon EC2 may be helpful for initial DevOps and testing. Scaling HPC applications better fits supercomputing resources like a Cray. In [10], the authors address the problem of running HPC workloads efficiently on Kubernetes clusters. They compare the Kubernetes’ default scheduler with KubeFlux, a Kubernetes plugin scheduler built on the Flux graph-based scheduler, on a 34- node Red Hat OpenShift cluster on IBM Cloud. They also detail how scheduling can affect the performance of GROMACS [19], a well-known HPC application, and they demonstrate that KubeFlux can improve its performance through better pod scheduling. In contrast with our work, authors work at one application level (GROMACS), whereas we are working on containerizing job schedulers. In [11], authors studied the potential use of Kubernetes on HPC infrastructure for use by the scientific community. They directly compared both its features and performance against Docker Swarm and bare-metal execution of HPC applications. They detailed some configurations required for Kubernetes to operate with

80

C. Cérin et al.

containerized MPI applications, explicitly accounting for operations such as (1) underlying device access, (2) inter-container communication across different hosts, and (3) configuration limitations. They discovered some rules that showed that Kubernetes presents overheads for several HPC applications over TCP/IP protocol. In [12], authors argued that HPC container runtimes (Charliecloud, Shifter, Singularity) have minimal or no performance impact. To prove this claim, they ran industry-standard benchmarks (SysBench, STREAM, HPCG). They found no meaningful performance differences between the used environments, except modest variation in memory usage. They invite the HPC community to containerize their applications without concern about performance degradation. In [14], authors have related their experience utilizing the Kubernetes orchestrator to efficiently allocate resources in a heterogeneous and dynamic academic environment. They disclosed three significant sources of inefficiencies: the unavailability of the fair-sharing functionality (dynamic user priorities), which hampers the efforts when developing a fair scheme for Pod/job scheduling and/or eviction. We can help a little with these problems by showing that we can add and remove server nodes dynamically in the first part of the chapter.

5.2.4 Challenges, Issues, and Solutions Until the conclusion of this first section, the following paragraphs follow the ideas presented in [8, 9], and we synthesize them.

5.2.4.1

Motivation

This section presents the results of a survey that we launched on several HPC centers. We aim to determine the real resources use rate of several scales of HPC clusters. We considered four categories of HPC clusters: 1. 2. 3. 4.

Laboratory scale; University scale; National scale; Specialized HPC infrastructure;

We collected the CPU usage efficiency of HPC jobs over 6 months. The first three categories of HPC clusters differ in scale and are ordered from the smallest (Laboratory) to the biggest (National). The last kind is a specialized HPC cluster dedicated to a community, a usage, or, more specifically, software. The job efficiency is computed via this formula: CP U eff iciency = T otalCP U/AllocCP U s/ElapsedT ime

.

(5.1)

5 Executing Traditional HPC Application Code in Cloud with Containerized. . . Table 5.1 CPU efficiency statistical analysis results

Category 1 2 3 4

Jobs 548 2366 967,652 2080

Average 43.47 75.17 38.43 85.51

Median 4.16 99.73 47.67 99.43

81 Overall 73.4 63.36 46.64 88.85

Fig. 5.1 Jobs CPU efficiency boxplot. (a) Category 1—Laboratory. (b) Category 2—University. (c) Category 3—National. (d) Category 4—Specialized

In (5.1), TotalCPU is the sum of System CPU (amount of system CPU time used by the job) and User CPU (amount of user CPU time used by the job), AllocCPUs is the number of CPUs allocated to the job, and Elapsed (job elapsed time). We considered the number of jobs, average efficiency, and median for each kind. Results are summed up in Table 5.1. There are four columns: • Jobs: Number of jobs during the study; • Average: Average CPU efficiency of jobs; • Median: Median of CPU efficiency of jobs. Figure 5.1 is a boxplot representation for each category of HPC cluster; • Overall: Overall efficiency;

82

C. Cérin et al.

The Overall efficiency is the time spent in computing for whole jobs reported to the sum of elapsed time of jobs: OverallElapsed =



.

T otalCP U

/

 AllocCP U s

/



(5.2)

Elapsed

These metrics are pretty standard in the HPC world. We attempted to compute a custom metric to represent the harmful impact of a job on the whole HPC cluster for a given resource (the CPU in the study). This metric adds a temporal ponderation to average efficiency. The philosophy is that the longer an inefficient job lasts, the more it impacts the whole cluster badly. The formula for each job is: BadI mpact = W astedJ obCP U T ime/OverallElapsed

.

(5.3)

In (5.3), WastedJobCPUTime is the opposite of CPU efficiency. It represents the CPU time not used by the job (for example, making IO). The OverallElapsed is the sum of elapsed time of each job. In Fig. 5.2, we draw a scatter graph for each category. A point represents each job. The Y-axis is the average efficiency of the job, and the X-axis is the bad impact factor. This view lets us differentiate jobs with the same efficiency on a time basis. This study demonstrates that for the first three categories of the HPC cluster, the CPU usage is not maximized. Moreover, the harmful impact factor shows that some jobs do not utilize specialized HPC hardware. Due to the multi-purpose usage of these HPC clusters, we can guess that some users see HPC clusters as a platform where you can run long jobs. This study opens the path to cohabitation between HPC jobs and more regular services on the same hardware to make the best use of resources.

5.2.4.2

Propositions

As stated in the previous Sect. 5.2.4.1, HPC resources are not always used efficiently. The more versatile the HPC cluster is, the less resource usage is optimal. The approach to resource management is similar to virtualization. We divide the node between several isolated worker daemons. However, we do not use virtualization. Virtualization relies on hypervision, i.e., the ability to run different kernels in parallel over multi-instantiated emulated hardware. We do not need such isolation and direct access to hardware for performance’s sake. As stated in Sect. 5.2.2.2, containers provide resource limitation (cgroups) and isolation (namespaces) at the system process level, making it lighter than virtualization. Moreover, HPC worker daemons are well suited for containerization. As explained in Sect. 5.2.2.3, it’s a single system process that forks and setuid to execute the user’s tasks. Consequently, a containerized HPC worker daemon is a process with available resources (memory and CPU) limited by cgroups and user identifiers, filesystem, and network stack isolated by namespaces. At last, as stated

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

83

Fig. 5.2 Bad impact scatter plot. (a) Category 1—Laboratory. (b) Category 2—University. (c) Category 3—National. (d) Category 4—Specialized

in Sect. 5.2.2.2, containers can be instantiated and removed on the fly, making it possible to build scalable containerized HPC clusters on a Cloud infrastructure. Consequently, according to the discussion, the contribution is to (1) Evaluate the feasibility of containerizing HPC schedulers and (2) The ability of containerized HPC schedulers to grow or shrink dynamically with a Cloud Orchestrator.

5.2.4.3

Containerized HPC Schedulers

In the study, we used Kubernetes as a Cloud Orchestrator. Kubernetes runs Pods. Pods are a gathering of containers that belongs to the same Cloud node. The Pod containers share the same network stack and can share volatile or persistent file systems volumes. Moreover, each container of a Pod can have resources limitation through cgroups. We evaluated three major HPC schedulers: SLURM, OpenPBS, and OAR. We run the containerized version of each HPC scheduler as a StatefulSet in Kubernetes. A StatefulSet is a set of identical Pods that manages stateful applications and guarantees the ordering and uniqueness of these

84

C. Cérin et al.

Pods. A StatefulSet contains Pods based on identical container specifications. Statefulset also maintains a sticky identity for each Pod. The listing 5.1 is a shortened example of such a Statefulset. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

apiVersion: apps/v1 kind: StatefulSet metadata: name: hpc-node namespace: hpc-nico labels: role: worker spec: replicas: 2 template: metadata: labels: partition: COMPUTE containers: - name: image: resources: limits: cpu: "2" requests: cpu: "2" Listing 5.1 Example of a user-defined Worker

On line 2, we can see that Workers Pods are defined as a StatefulSet. The keyword replicas (line 9) gives the number of instanced Pods. On line 7, the label role informs on the type of Pod (Scheduler or Worker). Here, we have a worker Pod. In HPC clusters, homogeneous nodes are frequently gathered in partitions or queues (the denomination may differ from one HPC scheduler to another). In line 13, we label this set of Pods with partition set to COMPUTE (a partition is a set of nodes). Line 17–21 gives the resource constraint required by the Worker Pod to the Kubernetes orchestrator. Here, we request 2 CPUs, and this example instantiates two Pods with two CPUs each in the COMPUTE partition/queue. As stated in Sect. 5.2.4.7, compute nodes’ resource topology is defined in the configuration of the HPC scheduler controller. Consequently, the resource limitation specified in listing 5.1 must be mapped to the HPC scheduler controller configuration. We developed sidecar containers for each HPC scheduler controller pod (SLURM, OpenPBS, and OAR) to request Kubernetes deployment through its API. A sidecar container is a regular container that interacts with the Pod’s main container(s). In the study, the sidecar container generates HPC scheduler controller configuration according to Kubernetes deployment. From this point, we can instantiate a static containerized HPC cluster. The next step is to dynamically expand or shrink this containerized HPC cluster.

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

5.2.4.4

85

Dynamic Containerized of HPC Clusters

We expanded each sidecar container of each HPC scheduler controller pod (SLURM, OpenPBS, and OAR) to poll on containerized HPC cluster StatefulSet. The containerized HPC scheduler controller is reconfigured when a modification is detected, i.e., the addition or removal of containerized HPC nodes. We explored several scenarios to evaluate how each HPC scheduler behaves when resources (nodes) are added or removed. We qualify the impact on pending and running jobs. For each scenario, we submitted MPI and non-MPI jobs. The MPI job is a Pi computation with a Monte Carlo method. The non-MPI job is a multithreaded infinite computation. The nature of jobs does not matter, meaning that jobs with MPI communication and without communication are both running correctly. We want to keep nodes busy and generate MPI communications while adding or removing workers’ containers on the fly. In Table 5.2, we have four declined scenarios for each of the three evaluated job schedulers. There are two states of jobs regarding the queue of requests in an HPC scheduler: pending (the job is waiting for resources) and running (the job is running somewhere on the HPC cluster nodes). We consider the impact of growth and shrinking workers’ containers for each state. In Table 5.2, a None value means that we do not encounter any problem, suggesting that the execution was correct.

5.2.4.5

Impact on Pending Jobs

(1) Workers Addition In this scenario, we submit several jobs to consume all resources of the containerized HPC cluster and have some in a pending state. Then we dynamically extend the containerized HPC cluster to supply enough resources to pending jobs. Sequential pending jobs run without errors with each containerized HPC scheduler. MPI jobs fail with SLURM on dynamically added nodes. The reason is that the MPI job is run with srun. The srun command instantiates the MPI communication infrastructure. The first MPI job scheduled for newcomer workers fails. Then, the second will work. When new nodes are added on a SLURM cluster, a reboot of slurmctld and each slurmd service is required. Dynamic node addition will be fully supported in the 23.02 version of SLURM.1

Table 5.2 Synthesis of experimentation Scenarios (1) Impact on pending jobs when resources are added (2) Impact on pending jobs when resources are removed (3) Impact on running jobs when resources are added (4) Impact on running jobs when resources are removed

1 https://slurm.schedmd.com/SLUG21/Roadmap.pdf.

SLURM Fail None None None

OpenPBS None None None None

OAR None None None None

86

C. Cérin et al.

(2) Workers Removal In this scenario, we consider the state of a pending job when containerized HPC nodes are removed. This removal has no impact on pending MPI and sequential jobs.

5.2.4.6

Impact on Running Jobs

(3) Workers Addition We launch mixed MPI and sequential jobs. Then, we add workers to the containerized HPC cluster. This addition does not disturb running jobs. So, there is no impact on running jobs. (4) Workers Removal We launch mixed MPI and sequential jobs, but we keep free containerized HPC nodes. While jobs are running, we remove free workers, and all jobs run and end smoothly for each containerized job scheduler. So, workers’ removal has no impact on running jobs.

5.2.4.7

Towards a General Methodology to Containerize HPC Job Schedulers

We now highlight that we followed the same methodology to containerize each HPC scheduler. This method has two levels of detail: macro and micro. The macrolevel draws the main outlines of HPC scheduler containerization. The micro-level considers the specificities of each HPC scheduler to melt at the macro-level. In this section, the term “user” relates to the person who defines and instantiates the containerized HPC cluster. The term “developer” determines the person who develops services for coupling the Cloud orchestrator and the job scheduler.

Macro-Level The developer must implement two kinds of service: initialization and resource polling. These services run sequentially. The initialization service runs as an initContainer that runs before any other container of the containerized HPC scheduler. An initContainer is a container that runs before any regular container of the Pod. Standard containers start when the initContainer ends successfully, i.e., the containerized process exits with return code zero. The initContainer initializes the configuration of the containerized HPC scheduler according to the listing 5.1 supplied by the user. At this point, notice that this step is sufficient to build static containerized HPC clusters. Resource polling is essential for HPC scheduler dynamicity. Developers must implement sidecar containers for each HPC scheduler controller pod (SLURM, OpenPBS, and OAR) to request Kubernetes deployment through its API. This sidecar container is a regular container that interacts with the Pod’s main container(s). In the study, the sidecar container generates HPC scheduler controller configuration

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

Pods definition in statefulSet

(3) Gets Worker Pods specification (1) Gets initial configuration

Update resources configuration

(1) Gets Scheduler location

initContainer

initContainer

Sidecar Container

87

(2) Sets initial Configuration

(2) Sets Scheduler location

Scheduler Container

Worker Container

Phase 1 containers Phase 2 container Fig. 5.3 Methodology to containerize HPC job schedulers: the macro-level

according to Kubernetes deployment. This step adds a dynamic elasticity property to the containerized HPC scheduler. Figure 5.3 sums up interactions between macrolevel actors. Micro-Level This unit details internals related to each HPC scheduler’s bootstrapping and topology modification handling. This section enables the reader to weigh the specificities of each HPC scheduler that must be handled to fit in the macro level. SLURM SLURM job scheduler is built upon two services: Slurmctld and Slurmd. Slurmctld is the scheduler, and Slurmd is the worker. All HPC nodes are described in a plain text file owned by the scheduler. We configure SLURM in the configless mode: the workers connect to the scheduler to retrieve the configuration. This configuration requires the Munge service to authenticate workers’ and scheduler communications. As a result, the Pod scheduling mechanism relies on four containers: an initContainer, Slurmctld, Munge, and a sidecar container that generates or updates the configuration file. The worker Pod has three Pods: an initContainer, Slurmd, and Munge.

88

C. Cérin et al.

The contributions are based on introducing initContainer for Slurmd and Slurmctld and the Slurmctld’s sidecar container. Slurmctld’s initContainer generates a minimal configuration that enables Slurmctld to start. Slurmd’s initContainer locates the Slurmctld service to retrieve configuration. We have an initContainer for both Slurmctld and Slurmd Pods. Slurmctld’s sidecar container is responsible for configuration updates when nodes are added or suppressed from containerized HPC cluster. SLURM does not support a comprehensive dynamic creation/suppression of his nodes in his current state. However, a relatively safe method is to restart Slurmctld. Then, in configless mode, all attached Slurmd daemons will reread their configuration. This method has limitations, and we will discuss them below. Consequently, when a modification is detected in the containerized HPC cluster’s topology, the sidecar container modifies the configuration file and sends a SIGTERM signal to the Slurmctld process. We use2 to supervise the Slurmctld process. Thus, when Slurmctld exits due to the SIGTERM reception, Daemontools’ manage process restarts it gracefully without crashing the container. OpenPBS OpenPBS job scheduler hosts several services. The process pbs_sched is the scheduler itself, pbs_comm handles the High Availability, and pbs_server.bin communicates with worker nodes to execute users’ jobs. This process also interacts with a Postgres database to store resource descriptions (such as workers’ specifications) and job information. We have pbs_mom on the worker node, which receives jobs from the PBS server to execute them on the node. The scheduler Pod has three containers: an initContainer that creates the configuration file for the PBS server, a container that hosts all the processes composing the PBS server, and the sidecar container that registers or unregister worker from the PBS server’s database. The containerization of OpenPBS follows the same scheme as SLURM. The initContainer is likely to be the SLURM’s. It creates the configuration file for PBS server Pod and worker Pod. The sidecar container triggers the commands to add or delete resources in the PBS server database at each containerized HPC cluster’s topology modification. OpenPBS and SLURM are very close regarding the methodology because they work on the same pattern of server/agent, and these two components are more or less coupled. We now consider a third HPC scheduler called OAR that relies on SSH for interactions between schedulers and workers. OAR OAR job scheduler is composed of several processes. A central one executes an automaton that reacts to all events from jobs’ and nodes’ states and initiates appropriate action by launching corresponding processes like scheduling round, job launching, and nodes’ checking. All states related to jobs, nodes, and scheduling decisions are stored in a Postgres database. OAR is well suited for containerization because workers and schedulers are loosely coupled, making it easier to deal with synchronization. An initContainer in the scheduler Pod initiates a

2 https://cr.yp.to/daemontools.html.

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

89

configuration for the Almighty service that drives OAR cluster resources. An initContainer is deployed aside from worker Pods to get the scheduler Pod location. Then, a sidecar container is executed from the scheduler server container inside the scheduler Pod to add or remove workers according to resources defined on the StatefulSets.

5.2.5 Summary of the Discussion The previous units detailed a method to build scalable containerized HPC clusters in the Cloud. The originality of the work consists in containerizing three central HPC schedulers: SLURM, OpenPBS, and OAR. They all can be jailed in containers, and the experimentation demonstrates that scaling jobs do not impact running or pending jobs (except for SLURM, but this point will be handled in the upcoming release). Contrary to the main works, we do not work at the application level but at the orchestration level for containerization. Remind that the initial motivation of the work and the artifact to replay the experiments are pointed in [8, 9] if the reader needs more insights. In short, the above discussion concerns converged computing which is a paradigm that aims to offer HPC performance, efficiency, and sophisticated scheduling, with cloud benefits. While orchestration frameworks like Kubernetes offer several advantages, such as resiliency, elasticity, portability, and manageability, they are not performance-oriented to the same degree as HPC. The vision of converged computing, presented in this first part of the chapter, is first to put into the Cloud the HPC ecosystems and not the applications supervised by the cloud orchestrator.

5.3 Adding a Mechanism for Autoscaling for Containerized HPC Schedulers 5.3.1 Introduction This section briefly introduces a Cloud orchestrator controller that enables the autoscaling of containerized HPC Clusters in the Cloud. This controller triggers the creation or suppression of containerized HPC compute nodes according to metrics collected at the containerized HPC scheduler’s job queue level. Again, we promote using Kubernetes as the Cloud orchestrator and OAR as the HPC scheduler, and our approach does not modify the Kubernetes Cloud orchestrator or the OAR HPC scheduler. Again the scheme followed in this section is generic and can be applied to other HPC schedulers rather than OAR.

90

C. Cérin et al.

The sections also exemplify Cloud and HPC convergence, allowing a high degree of flexibility for users and community platform developers. Again, we assume that containerization principles facilitate the reproducibility of experiments by adding the HPC scheduler to the environment replayed by the end user. In the aforementioned sections, we evaluated the dynamic abilities of the three central open-source HPC schedulers: OAR, SLURM, and OpenPBS, in a containerized environment. We demonstrated that they all provide mechanisms to add or remove nodes on the fly in a containerization orchestration context. We highlighted that each HPC scheduler follows the same pattern to scale up or down in a containerized environment. Consequently, we deduced a methodology to build dynamic containerized HPC schedulers. We now design a custom controller to add the autoscale feature to a containerized OAR HPC scheduler. Remind that, in the Kubernetes terminology, a controller is an arbitration loop that drives the current state of a cluster to its desired state. Kubernetes has built-in support for CPU and memory metrics and supplies an API to handle the metrics of its objects. Our metric is the queue state of the containerized HPC scheduler, so we cannot rely on native Kubernetes metrics. Our controller queries metrics straight from the containerized HPC scheduler API.

5.3.2 Related Works and Positioning In [16], the authors supply a Semi-Elastic Cluster shortened in SEC. SEC aims to predict the number of resources needed. This prediction relies on the accounting data of the HPC schedulers (SLURM in their experimentation) that are processed every weekend. At the end of data processing, compute nodes are provisioned in various Cloud platforms such as GCE or EC2. Our work differs because we rely on queues instead of data accounting processing. Moreover, we are not using a prediction paradigm but a real-time adaptation paradigm to the user’s needs. In [17], the authors describe an infrastructure where the HPC schedulers (Moab) collaborates with a provisioning service (OpenStack) to deploy virtual machines embedding specialized physic software on compute nodes. Users request deployment of virtual machines through Moab like any regular job, and Openstack provisions the VM on compute nodes. When the computation ends on virtual machines, the hosting physical node is released for any other standard calculation. This work is similar to Kadeploy [18], which collaborates with the OAR HPC scheduler to provision bare metal servers. In our work, cohabitation occurs at the Cloud orchestrator level. As a consequence, it’s more application-oriented than an entire-fledged operating system (virtualized [17] or physical [18]). Complementary readings, already commented above, are [10–15].

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

91

In short, this section highlighted two categories of related work. The first is the integration of containers in HPC schedulers. This integration can be more or less deep, and a light integration consists of deploying a container engine on compute nodes to run applications in a sandbox without significant overhead. Deeper integration is to add connectors between HPC schedulers and the Cloud platform. These connectors enable the HPC scheduler to run applications on the Cloud platform, providing scalability. The second category is application-oriented. Authors try to adapt their HPC applications to a Cloud context and must adjust their scientific workflows from regular HPC schedulers to Cloud orchestrators. It implies containerization of their applications and studying the impact of Software Defined Network (SDN), containers engine security policy, etc. Our contribution opens a third path by containerizing the job scheduler itself. We use the Cloud orchestrator as a provisioning system for our containerized HPC cluster, and our containerized cluster can scale according to available resources on the Cloud platform. Moreover, our containerized HPC scheduler runs amongst other regular applications on the same hardware platform reducing waste of resources. This approach corresponds to a full-Cloud approach, not as efficient as barebone HPC but more scalable and reproducible. As stated previously in the chapter, we introduced a motivating example by studying the efficiency of submitted jobs at the laboratory, university, national, and specialized HPC infrastructure scales. In this study’s conclusion, we found that multi-purposed HPC systems do not maximize CPU efficiency. In summary, we think there is a potential for improvement in the efficiency of an HPC system. Our research studies the feasibility of cohabitation between HPC jobs and generalist services, e.g., running a small website in a fully containerized environment. Indeed we promoted the multiplexing of online jobs with batch workloads on the same infrastructure. Thus, the overall research directions we investigate in our research, and more specifically in this section, are: 1. 2. 3. 4. 5. 6. 7.

On-demand and scalable HPC infrastructures; Reproducibility of HPC infrastructures deployment; Flat-sharing with regular Cloud applications; Orchestration of comprehensive HPC infrastructures in the Cloud; Collaboration between Cloud orchestrators and HPC schedulers; Integration of HPC schedulers in DevOps recipes; Hybrid HPC infrastructures provisioning with bare-metal, virtual machines, and containers.

In this part of the work, we investigate specifically on-demand and scalable HPC infrastructures issues and the autoscaling problem, i.e., the ability of HPC job schedulers to dynamically request the extension or the shrinking of the HPC cluster.

92

C. Cérin et al.

5.3.3 Challenge and Issues for Auto Scaling Mechanisms with OAR Minimum alterations of the Kubernetes and OAR ecosystems are one of our main objectives. OAR services are pretty easy to containerize because Almighty is a service that can run in the foreground. Moreover, it forks additional required services like task launcher or monitoring service. This property meets best practices in container management. Specifically, one container should only host one foreground process and its sons. The agent on containerized compute nodes is a standard sshd service configured for OAR, and it can natively run in the foreground. When the sshd service receives a job, it invokes the typical sequence fork(), setuid(), and exec() to run the job. The only mandatory requirement is to be able to add the chroot capability to the containerized compute node. Optionally, you may add the netraw capability to the Almighty container. This capability is used by the monitoring service to check the availability of containerized compute nodes through ICMP. Kubernetes has two native controllers that manage a set of Pods: Deployment and StatefulSet. They both rely on ReplicaSet. A ReplicaSet is a process that runs multiple instances of a Pod and keeps the specified number of Pods constant. Its purpose is to maintain the specified number of Pod instances running in a cluster at any given time. ReplicaSet handles the creation and deletion of Pods. Deployments create Pods with unpredictable names, making it impossible to provision their configurations on the HPC scheduler. StatefulSet produces Pods with a predictable name. However, they both delete Pods following a LIFO scheme. This schema is not an option because we delete idle containerized Pods. Moreover, They do not take queues as a metric to scale up/down. That is why we developed an HPC dedicated controller that scales according to queues metrics and deletes Pods in a non-LIFO way. Our containerized OAR cluster runs in a dedicated namespace. Namespaces enable isolating the set of Pods in a virtual cluster. This isolation is a great help in applying access control between this set of Pods and the rest of the Kubernetes cluster. The scheduler part of our containerized OAR cluster is defined in listing 5.2. The following listing is shortened to show only relevant details. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

apiVersion: v1 kind: ConfigMap metadata: name: oarconf namespace: oar data: DB_TYPE: "Pg" DB_HOST: "db-server" DB_PORT: "5432" DB_OAR_BASE_NAME: "oar" --apiVersion: v1 kind: Service metadata:

5 Executing Traditional HPC Application Code in Cloud with Containerized. . . 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

93

name: nodes namespace: oar spec: selector: net: headless clusterIP: None --apiVersion: v1 kind: Pod metadata: name: hpc-scheduler namespace: oar labels: net: headless annotations: default/nodes: "3" default/cpuspernode: "2" default/image: "nyk0/chsc-oar" default/hostnamebase: "hpc-node" spec: hostname: hpc-scheduler containers: - image: nyk0/chsc-oar name: oar-server envFrom: - configMapRef: name: oarconf command: ["/bin/bash"] args: ["/start-almighty.sh"] securityContext: capabilities: add: ["NET_RAW"] --apiVersion: v1 kind: Pod metadata: name: controller namespace: oar labels: net: headless spec: containers: - image: nyk0/chsc-oar name: controller envFrom: - configMapRef: name: oarconf Listing 5.2 Containerized HPC scheduler YAML

In lines 1–10, we defined an object ConfigMap. A ConfigMap is an API object that stores non-confidential data in key/value pairs. These pairs of keys/values are passed to the containers as environment variables. We parse them with a

94

C. Cérin et al.

basic shell script to generate the main configuration file of OAR. Almighty and every administration or user commands of OAR need this configuration file. As a consequence, every container will use this ConfigMap. In lines 12–20, we defined a Service. A Service sets the network context of the Pod. It registers the Pod in the DNS service and eventually establishes a load balancer address if you use a Deployment or a StatefulSet. In our experimentation, we do not use replicas. We manage each Pod individually, so we use a headless Service. A headless service supplies an IP address and a DNS record to the Pod. In lines 22–46, we defined the Pod that hosts the containerized OAR scheduler. This Pod contains only one container for Almighty. As we stated before, Almighty forks every other service required for OAR working. We are on the father/son processes model that fits containerization. An essential part of the Pod definition is the annotations section. Annotations add metadata to objects, and tools or libraries can retrieve this metadata. We used Annotations to define queue parameters. In this example, we define a queue named “default” with three nodes (default/nodes: “3”). This example is the maximum number of nodes the controller can create. Each node has two CPUs (default/cpuspernode: “2”). This value is the amount of CPU accounted by underlying cgroups to the containerized compute node. When containerized compute nodes are orchestrated on a Cloud infrastructure, they are less adherent to the hardware. This point results in a kind of loose HPC. As stated before, OAR is versatile, and we can customize the resource’s hierarchy according to the underlying resources manager (direct access to hardware or cgroups). For the sake of briefly, we stop the discussion at this point, and we hope it is sufficient to measure the difficulties. Indeed, it remains to talk about how we could define the image used at containerized compute nodes instantiation and how to design the architecture of the controller. In short, the controller reads Annotations and jobs in queue states to instantiate containerized compute nodes.

5.3.4 Summary of the Discussion In this section, we discussed the specific issues to introduce some autoscaling mechanisms in the initial proposal. The goal is to offer, to the concurrent HPC job schedulers deployed on the bare metal, to add or remove, dynamically, nodes. This point is a crucial issue to meet the scale in / scale out properties of Cloud Computing.

5.4 Conclusion Bibliographic references [8] and [9] constitute the genesis of this chapter. In summarizing, the chapter is part of an effort to containerize HPC schedulers and jobs within a Kubernetes environment. We assume that this option is an excellent

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

95

way to work at unifying HPC and Cloud, thus aiming to encapsulate cloud and HPC services within a single environment. The proposed path is promising for converged computing and plays well with existing technologies (e.g., Kubernetes, OAR, SLURM, etc.,). In this chapter, we described, for instance, an implementation of an HPC-oriented controller for Kubernetes. This controller enables the building of autoscaling containerized HPC schedulers. However, it is essential to state that this approach does not conflict with traditional HPC cluster infrastructures. Indeed, traditional HPC clusters run directly on hardware and have several features that contribute to a highly fine-grained resources allocation of physical resources. Our work addresses users who get along well with loose HPC or need highly reproducible environments (including the job scheduler). Finally, we are one step closer to offering an architecture where different HPC job scheduling components coexist under the supervision of a Cloud orchestrator and where each scheduler can require more or fewer resources independently. We believe that such an organizational scheme allows us to manage and to share resources in a more refined way, probably in a frugal way, while respecting the “ondemand” motto and the habits of cluster users. However, many works remain to accomplish this vision. For instance, When it comes to the establishment of auto-scaling, what happens if one step in the recipe fails? This path of the control plan is not yet investigated. The discussed contribution is containerized schedulers that, in turn, spawn jobs within containers. In the future, we plan to consider placement and overlay networks for tightly-coupled jobs. We also plan to consider challenges in getting features like remote storage or enabling MPI. Our first tests with MPI programs demonstrate that communicating programs are operational. At last, we need to experiment with this scheduler for significant workloads (e.g., 10s–100s of nodes, tightly-coupled and loosely-coupled workloads, hybrid jobs spanning on-site and cloud clusters, etc.). The idea is to check if the proposal fits HPC/HTC and if it could support high-frequency use cases. On the experimental plan, it could be expected to see some quantitative data supporting the work. For example, how well does the FIFO controller saturate compute nodes? If nodes are activated and returning to standby, what is the turnaround time for jobs to launch? Acknowledgments This work is conducted during the Délégation with Centre National de la Recherche Scientifique (CNRS) of Mr. Cérin. Thanks to the institutional support of the CNRS, University of Grenoble Alpes, DATAMOVE INRIA Team, and university Sorbonne Paris Nord. Mr. Grenèche is also working with “Pôle de soutien à la recherche” of Sorbonne Paris Nord, Direction des Systèmes d’Information (DSI).

96

C. Cérin et al.

References 1. Thomas Sterling, Maciej Brodowicz, and Matthew Anderson, High Performance Computing: Modern Systems and Practices 1st Edition, Morgan Kaufmann; (December 19, 2017), ISBN10: 012420158X; ISBN-13: 978-0124201583 2. Maciej Brodowicz, Thomas L. Sterling, Matthew Anderson: Continuum Computing - on a New Performance Trajectory beyond Exascale. Supercomput. Front. Innov. 5(3): 5–24 (2018) 3. Nick Antonopoulos, Lee Gillam: Cloud Computing - Principles, Systems and Applications, Second Edition. Computer Communications and Networks, Springer 2017, ISBN 978-3-31954644-5 4. Google, Kubernetes – see https://kubernetes.io/ 5. Morris A. Jette and Andy B. Yoo and Mark Grondona, SLURM: Simple Linux Utility for Resource Management, In Lecture Notes in Computer Science: Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP) 2003, 2002, pages 44–60, Springer-Verlag 6. Henderson, R.L., Tweten, D.: Portable Batch System-PBS: Requirements Specification. NASA Ames Research Center (1998) 7. The OAR job scheduler. See http://oar.imag.fr 8. Christophe Cérin, Nicolas Grenèche, Tarek Menouer, Towards Pervasive Containerization of HPC Job Scheduler, SBAC-PAD 2020, pages 281–288. 9. Nicolas Grenèche, Tarek Menouer, Christophe Cérin, and Olivier Richard, A methodology to scale containerized HPC infrastructures in the Cloud, In: Euro-Par 2022, Glasgow, UK, August 22–26, 2022 (2022). 10. C. Misale and M. Drocco and D. J. Milroy and C. Gutierrez and S. Herbein and D. H. Ahn and Y. Park, 2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), It’s a Scheduling Affair: GROMACS in the Cloud with the KubeFlux Scheduler, 2021, pages 10–16, doi https://doi.org/10.1109/ CANOPIEHPC54579.2021.00006, IEEE Computer Society, Los Alamitos, CA, USA. 11. A. M. Beltre and P. Saha and M. Govindaraju and A. Younge and R. E. Grant, 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), Enabling HPC Workloads on Cloud Infrastructure Using Kubernetes Container Orchestration Mechanisms, 2019, pages 11–20. 12. A. Torrez and T. Randles and R. Priedhorsky, 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIEHPC), HPC Container Runtimes have Minimal or No Performance Impact, 2019, doi https:// doi.org/10.1109/CANOPIE-HPC49598.2019.00010, IEEE Computer Society, Los Alamitos, CA, USA. 13. A. J. Younge and K. Pedretti and R. E. Grant and R. Brightwell, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds, 2017, pages 74–81, doi https://doi.org/10.1109/CloudCom.2017.40, IEEE Computer Society, Los Alamitos, CA, USA 14. Viktoria Spisakova, Dalibor Klusacek and Lukas Hejtmanek, Using Kubernetes in Academic Environment: Problems and Approaches (Open Scheduling Problem), Job Scheduling Strategies for Parallel Processing (JSSPP). In conjunction with 36th IEEE International Parallel and Distributed Processing Symposium (IPDPS’2022), May 30–June 3, 2022 in Lyon, France. Available at https://jsspp.org/papers22/kubernetes-OSP.pdf 15. Naweiluo Zhou, Yiannis Georgiou, Marcin Pospieszny, Li Zhong, Huan Zhou, Christoph Niethammer, Branislav Pejak, Oskar Marko, Dennis Hoppe: Container orchestration on HPC systems through Kubernetes. J. Cloud Comput. 10(1): 16 (2021) 16. S. Niu, J. Zhai, X. Ma, X. Tang, W. Chen and W. Zheng, “Building Semi-Elastic Virtual Clusters for Cost-Effective HPC Cloud Resource Provisioning,” in IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 7, pp. 1915-1928, 1 July 2016, doi: https://doi. org/10.1109/TPDS.2015.2476459.

5 Executing Traditional HPC Application Code in Cloud with Containerized. . .

97

17. Meier, Konrad & Fleig, Georg & Hauth, Thomas & Janczyk, Michael & Quast, Günter & von Suchodoletz, Dirk & Wiebelt, Bernd. (2016). Dynamic provisioning of a HEP computing infrastructure on a shared hybrid HPC system. Journal of Physics: Conference Series. 762. 012012. https://doi.org/10.1088/1742-6596/762/1/012012. 18. Emmanuel Jeanvoine, Luc Sarzyniec, Lucas Nussbaum. Kadeploy3: Efficient and Scalable Operating System Provisioning for Clusters. USENIX Association, USENIX Association, 2013, 38 (1), pp. 38–44 19. The GROMACS user guide - https://doi.org/10.5281/zenodo.6103568 20. Andrew S. Tanenbaum and Herbert Bos. Modern Operating Systems, 4th edition. Published by Pearson (July 14th 2021) - Copyright © 2015

Chapter 6

Designing Cloud-Friendly HPC Applications Rodrigo da Rosa Righi, Guilherme Galante, Vinicius Facco Rodrigues, Heonyoung Yeom, Harald Koestler, Madhusudan Singh, and Guann-Pyng Li

6.1 Introduction Although first proposed in 1992, the cloud became a reality only in 2003 with the Amazon AWS proposal. Unlike in 1992, in 2003, we had both a mature pay-as-yougo business model and advances in virtualization technologies to allow this style of computing delivery. As said previously, the cloud started focusing on transactionbased applications, and soon high-performance computing (HPC) demands were tested with cloud resources. Although facilitating the deployment of HPC demands, the delay in managing and orchestrating virtualized resources was commonly not

R. da Rosa Righi () · V. F. Rodrigues Universidade do Vale do Rio dos Sinos (UNISINOS), São Leopoldo, Brazil e-mail: [email protected]; [email protected] G. Galante Universidade Estadual do Oeste do Paraná (UNIOESTE), Cascavel, Paraná, Brazil e-mail: [email protected] H. Yeom Seoul National University (SNU), Seoul, South Korea e-mail: [email protected] H. Koestler Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany e-mail: [email protected] M. Singh Engineering Management Technology (EMT), Oregon Institute of Technology, Klamath Falls, OR, USA e-mail: [email protected] G.-P. Li University of California Irvine (UCI), Irvine, CA, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_6

99

100

R. da Rosa Righi et al.

acceptable for high-performance computing. Soon the big cloud players, including Amazon, Google, and Microsoft, reorganize their datacenters to offer particular configurations and deployments to bypass this limitation. Thus, not shared resources and specialized networks could be rented quickly, so allowing performance rates very similar when considering on-premise and virtualized environments. Today, containers (with the Docker suite) and SDN (software-defined networks) are critical technologies in the cloud to guarantee acceptable performance for HPC applications. Both HPC and cloud computing have the keyword “fast” in mind. The former provides strategies to run demands faster, saving an acceptable application precision. The second enables a quick time-to-market. In the HPC world, we only discuss competing and executing applications, not taking care of infrastructure details. In order to join the fast on both sets, we need to understand how HPC applications are written, as well as details regarding computing architecture. In brief, we have five application models: Bag-of-Tasks, Master-Slave, Pipeline, Divide-and-Conquer, and Bulk-Synchronous Parallel. Each has its peculiarities in process coupling and interactions, which profoundly impact CPU, memory, and network specifications. Also, selecting a model (or a combination) should consider how we can explore cloud facilities and the provided infrastructure. We can extract multiprocessor and multicomputer parallelism, in addition to exploring both through combining appropriated libraries like OpenMP and MPI (Message-Passing Interface). In this context, this book chapter presents how we can design cloud-friendly HPC applications. The key objective is first to detail the main features of cloud computing, then detail how the HPC application models mentioned earlier can address each feature. In other words, we plan to confront cloud features against HPC application models, revealing matching and not matching engines. Figure 6.1 depicts the adopted challenge in the present book chapter. In the end, we will have the best ideas for developing HPC applications to run in the cloud and those source organizations that are not recommended to be migrated to the cloud. For example, to motivate the reading of the book chapter, we have resource elasticity as one of the main characteristics of cloud testbeds. Thus, we will investigate the different faces of this keyword, also looking at how the different application models can exploit each face (either in a friendly or not-friendly way). After presenting the Introduction in Sect. 6.1, we explore the cloud features and capabilities in Sect. 6.2. Section 6.3 is in charge of overviews the current HPC application models. At this moment, we have a division in the document. Section 6.4 presents how loosely-coupled applications can extract the main cloud features. In this turn, Sect. 6.5 details the matching between cloud features and tightlycoupled applications. Section 6.6 reveals the open challenges on HPC-oriented cloud applications. Finally, Sect. 6.7 presents the conclusion section, bringing back the main discussed concepts and some hints that can be explored as future work.

6 Designing Cloud-Friendly HPC Applications Cloud Features

101 Application Models

Automation Bag-of-Tasks Elasticity

Master-Slave

Pay-as-you-go

Backup and recovery Matching engine to execute Cloud-friendly HPC applications

Pipeline

Big data analytics

Virtualization

Divide-and-Conquer

Storage

Networking

Bulk-Synchronous ParalleL

Fig. 6.1 Cloud features versus application models

6.2 Exploring Cloud Features and Capabilities Through the Lens of HPC Demands The compute resources needed to analyze big data and solve complex HPC problems are expanding beyond the on-premise compute clusters towards resources available from public cloud services. Cloud adoption for HPC is central to transitioning workloads from an on-premise-only approach to one decoupled from a specific infrastructure or location. As said, cloud computing allows resources to be available on-demand, which can be cost-effective and allow for greater flexibility to run HPC workloads. On-site datacenters typically require many racking and stacking procedures (involving hardware setup, software patching, and other timeconsuming IT management chores). Cloud computing removes the need for many of these tasks, so IT teams can spend time on achieving more important business goals. Cloud offers on-demand self-service and fast scalability. The user can continuously monitor the server uptime, capabilities, and allotted network storage to decide on on-the-fly resource provisioning. The scalability feature enables cost-effective handling of workloads that require many servers but only for a short period. Many HPC demands have workloads that can be run very cost-effectively due to the rapid

102

R. da Rosa Righi et al.

scalability of cloud computing. Thus, we have the ability feature to quickly increase or decrease the size or power of the resources. People interested in the HPC field move towards cloud computing because it provides scalable options in a few clicks and within some moments. Additionally, resource pooling enables scalability for cloud providers and users because compute, storage, networking, and other assets can be added or removed as needed. This feature helps enterprise IT teams optimize cloud-hosted workloads and avoid end-user bottlenecks. Nevertheless, clouds can scale vertically or horizontally, and service providers offer automation software to handle dynamic scaling for users. Cloud also goes in the direction of aiding disaster recovery and resource dynamism. This feature is aligned with long-running HPC demands. Cloud-based backup and recovery ensure that the data is secure. Implementing robust disaster recovery was once a problem for small businesses. However, cloud solutions now provide these organizations with the cost-effective solutions with the expertise they need. Cloud services save time, avoid significant investments and provide a thirdparty experience for the company. Backing up data can be a pain, so cloud providers back up many resources by default, and he/she can add additional backup services relatively easily. You can choose a backup strategy for various services and keep those backups in various places in your region, continent, or even worldwide. For example, a user can store critical files on three servers in three data centers located on three continents, giving him/her .3 × 3 × 3 backups or 27 copies of the same file. Restoring a backup might be as easy as picking which backup to restore and clicking a button. This feature makes disaster recovery fast and cost-effective. Thus, we can quickly mitigate a disaster at one data center, or even across an entire region, by turning over to another datacenter in another region. Traditional, on-premises architectures cannot scale quickly. Typically, enterprises have to plan for peak capacity by purchasing servers and other infrastructure assets; those extra resources sit idle during lulls in activity. While scalability describes longer-term cloud infrastructure plans, rapid elasticity is more of a shortterm characteristic. When demand unexpectedly surges, properly configured cloud applications and services instantly and automatically add resources to handle the load. When the demand abates, services return to original resource levels. The adoption of container technologies has also gained momentum in HPC. Containers are lightweight and enable flexibility with low levels of overhead (improving performance and cost). Containers also help meet the requirements of many HPC applications, such as scalability, reliability, automation, and security. The ability to package application code, its dependencies, and even user data, combined with the demand to simplify sharing of scientific research and findings with a global community across multiple locations, as well as the ability to migrate said applications into public or hybrid clouds, make containers very relevant for HPC environments. By using containers to deploy HPC applications and workloads in the cloud, you are not tied to a specific HPC system or cloud provider. In this way, containers for HPC applications can profit from the pay-as-you-go business model from the cloud. The user only pays for actually used resources. This characteristic

6 Designing Cloud-Friendly HPC Applications

103

is well suited for institutions that otherwise would have to deal with underutilized resources or a restricted budget that prevents an investment for on-site clusters. In addition, the on-demand self-service characteristic of HPC cloud offerings allows additional execution scenarios. For example, workloads submitted to grids or clusters are typically handled by a scheduling system, stored in a queue, and executed later when resources become available. In contrast, unlimited and immediately available resources in cloud environments allow the execution of all workloads in parallel, avoiding waiting times. This feature gains much power when combined with resource elasticity. Since the average CPU load on canters are high (above a particular upper threshold), we can enlarge the computing infrastructure. As discussed in the remaining sections of this book chapter, the key challenges are integrating this set of new resources with the HPC code and using them correctly. We will perceive that there are HPC application models that support more accessible integration. At the same time, others are not much easy, and others yet, practically unfeasible.

6.3 Analyzing HPC Models to Write Cloud-Friendly Applications High-performance computing is typically explored by joining the machine model, programming model and application model. Machine models can comprise the use of a multiprocessor or multicomputer strategy, or yet a combination of them. Following the Flynn classification of parallel machines, this set is enclosed in the MIMD (Multiple Instruction, Multiple Data). In addition, we have the SIMD (Single Instruction, Multiple Data) architectures, where vector machines and GPU boards represent the leading players. Together with these machine details, HPC applications can be organized under at least one of the following application models: Bag of Tasks, Master-Slave, Divide-and-Conquer, Pipeline, and Bulk-Synchronous Parallel. Finally, programming models comprises the following discussions: processes’ interactions (shared-memory or message-passing), parallelism extraction (data parallelism and task parallelism) and process creation. This last topic, in particular, involves some thematics like SPMD (Single Program Multiple Data), MPMD (Multiple Program Multiple Data), static or dynamic process creation. Thus, the act of coding an HPC application, or translating a sequential into a parallel one, includes the combination of these three models. Application and machine models can be matched with cloud computing characteristics to extract the best strategies for exploring high performance. As described in Sect. 6.2, all cloud features are important to meet HPC requirements. However, we would like to highlight three of them: (1) on-demand self-service and fast scalability; (2) economic benefits; and (3) disaster recovery and resource dynamism. Regarding on-demand self-service and fast scalability (feature (1) above), the cloud elasticity feature and the fast deployment of computing resources using

104

R. da Rosa Righi et al.

containers can be effectively explored to enhance the performance of applications. On having dynamic and unpredictable workloads, resource elasticity is helpful to reduce or enlarge the number of resources (or containers) at runtime under the current input demand. According to the technique used to trigger the elasticity actions we can classify it as reactive or proactive. Reactive solutions use rules based in lower and upper thresholds to execute elasticity actions. Predictive approach uses heuristics and mathematical/analytical techniques to anticipate the system load behavior. Meanwhile, we have horizontal and vertical strategies. The first reduces or adds new computing resources as a whole. In contrast, the second works with virtual computing resource resizing (changing memory size or vCPU, for example, up to an upper limit imposed by the physical hardware). Together with these possibilities, some others also the present container or VM migration as a complementary possibility to answer performance on cloud environments. Here, commonly used strategies are migrating one or more resource units from an overloaded resource to a lighter one or approximating processes that present a recurrent communication pattern. Economic benefits (presented in feature (2) above) are closely related to renting resources instead of buying and configuring them, in addition to provisioning resources that best fit the particular situation of the application/machine. Most of HPC applications are executed in batch mode and their workloads are defined by input files containing the data to be processed, and sometimes, the resources requirements cannot be determined precisely in advance or can change during runtime. It may happen due to changes in application structure (e.g., use of adaptive mesh refinement) or use of different algorithms with specific resources demands. Leveraging cloud elasticity enables precise management of resources as compute load and memory requirements changes along application execution. Thus, unlike allocating over-provisioned resources, it is possible to start with the middle-range level and pass the control of resources to an elasticity manager, which organizes allocations at runtime. The user will only pay for the number of resources being used, which seems a fair approach. Disaster recovery and resource dynamism (presented in feature (3) above) are pertinent for long-running applications. Any power supply or hardware problem can imply reinitializing the application, putting away spent money and time. Thus, periodical backup saving incremental execution images helps users when needing to restore the application state. Also, this feature is related to economic benefits since new executions from scratch are no more needed. According to Kehrer and Blochinger [1], three different cloud adaptation strategies can be applied to different parallel applications: (1) Copy and Paste, (2) Cloud-aware Refactoring, and (3) Cloud-aware Refactoring and Elasticity Control. The Copy and Paste strategy consists in migrating the existing parallel applications without modifications. In these cases, users can substitute their HPC infrastructure with virtual hardware, harnessing immediately available resources in cloud environments offered in the pay-per-use model. Relevant options are IaaS or container solutions since they can be customized to host the required software stack for running such applications and coordinated with distributed computing middleware

6 Designing Cloud-Friendly HPC Applications

105

capable of interacting with cloud-based infrastructures. The main drawbacks of this strategy are: (1) elasticity is not employed, considering that the application was not modified to handle variable resources. Thus, the number of processing units has to be statically selected; and (2) some characteristics of the cloud, such as heterogeneous processing speeds, varying network latencies, low network bandwidths, and virtualization overhead, may affect the performance applications that demand synchronous communications (tightly-coupled ones) [2]. The Cloud-aware Refactoring strategy proposes architectural refactoring to make existing parallel applications cloud-aware and less affected by the characteristics of standard cloud environments. In this context, Fehling et al. [3] identified the IDEAL application properties that effectively allow taking advantage of cloud environments: Isolation of state, Distribution, Elasticity, Automated Management, and Loose coupling. Isolation of state is related to designing the parts of a cloud application to be stateless, thus, isolating state in small portions of the application. Distribution is achieved by decomposing an application into separate components that can be distributed among the available resources. Elasticity focuses on the dynamic addition and removal of computational resources and demands that the application handles the variable amount of resources. Automated Management is essential in the context of cloud computing since the Management of all properties of a cloud-aware application should be automated by a software layer (e.g., monitoring system or middleware) that interacts with interfaces of cloud providers. However, as Management is an operations-related property, it cannot be directly linked to parallel application design on a conceptual level [4]. Finally, the loose coupling means that the dependencies between application components should be minimized. It facilitates procedures such as scaling and failure recovery by reducing dependencies among the application parts. The challenge is to think differently and rewrite the application to support the new computational and programming models. It is necessary to analyze the structure and nature of the application and determine how to exploit what the cloud offers. For example, if the application is tightly coupled, it will probably restrict the freedom of where you port the application. On the other hand, if it is relatively loosely coupled, it will enable greater freedom when porting the application [2]. Fox and Gannon [5] argue that applications can be redesigned and implemented on top of cloud programming models to leverage their unique capabilities. They suggest using frameworks (e.g., MapReduce or Hadoop), PaaS platforms (e.g., Microsoft Azure and Aneka), and workflows in the construction or adaptation of applications that will run in the cloud. In Cloud-aware Refactoring and Elasticity Control, the use of elasticity to process HPC workloads in the cloud must also be considered in addition to application refactoring. This ability to run parallel applications with dynamic resources can provide several benefits, including improvements in application performance and efficiency, cost reduction, fault tolerance, load balancing, and better resource utilization [6]. Several articles related to cloud computing address exploring this feature and survey elasticity mechanisms. We highlight the works presented on references [7–9]. In the following sections, we divide the HPC application models into two groups and detail how we can explore cloud features on each.

106

R. da Rosa Righi et al.

6.4 Loosely-Coupled HPC Applications for Cloud Loosely-coupled applications, sometimes referred to as embarrassingly parallel applications, are characterized by the minimal effort required to divide the application into a collection of tasks that can be distributed among a set of computing components for parallel processing. Loosely-coupled applications with thousands or even billions of tasks can be found in the literature, and are used to solve complex and resource-intensive scientific problems in various fields such as drug discovery, high-energy physics, chemistry, astronomy, image processing, machine learning, and bioinformatics [10, 11]. In other words, in loosely-coupled applications, tasks are not interdependent and can be executed without meaningful communication with other tasks. Thus, considering the low overhead resulting from idle time or communication between components, it is possible to achieve high efficiency and scalability in distributed environments such as clouds [4, 12]. Moreover, these applications can also take advantage of the variable number of resources offered by clouds, especially by exploring elasticity [13, 14]. In summary, loosely-coupled parallel applications have characteristics that make them suitable for running in the cloud, with little or no adaptation. In the following sections, we present some loosely coupled models and show how they can take advantage of cloud features.

6.4.1 Bag-of-Tasks Bag-of-Task (BoT) application is one class of workload that is commonly used on the cloud and consists of many independent tasks that do not communicate with each other, may depend upon one or more input files, and can be executed in any order (normally, but we can also have tasks organized following a Direct Acyclic Graph (DAG) fashion). The execution of a BoT application is considered finished only when the processing of all of its component tasks has been completed. Each task generates a set of one or more output files [15]. Figure 6.2 illustrates the BoT execution model. As seen, we have a shared space where tasks are deposited and a manager who is in charge of coordinating the access to the work pool bag. The computing process requests the manager a task to compute from the bag and returns the results afterward. From a load balancing viewpoint, in a particular execution snapshot, the difference between the number of tasks from the slowest and the fastest process is one. Logically, the second will go much more times to the bag to request a task. Therefore, the literature normally presents BoT as a good model for load balancing. Since each task is typically independent, they can be dispatched to run in parallel on multiple instances in the cloud, taking advantage of the high availability of resources seamlessly provisioned at runtime. These instances can include virtual machines, spot instances, containers, and even serverless functions [16]. In

6 Designing Cloud-Friendly HPC Applications

Repository

Input

Input

Input

107

Input

Input

Task Execution

Result Files

Input

Input

Input

...

Result

Result

Result

...

Result

Fig. 6.2 Bag-of-Tasks applications execution model

particular, Serverless is ideal for asynchronous, stateless applications that can be started instantaneously. Likewise, serverless is a good fit for use cases that see infrequent, unpredictable surges in demand. Many HPC problems fit the stateless, embarrassingly parallel compute model well suited to functions. Examples include stochastic analysis, parametric sweeps, and pricing calculations in financial risk. Also, think of a task like batch processing of incoming image files, which might run infrequently but also must be ready when a large batch of images arrives all at once. Developers can call serverless apps through APIs which the provider handles through an API gateway. Cloud providers offer virtually unlimited resources, so the limit should be determined based on the user’s budget constraint. Another concern when executing BoT applications in the cloud is the tradeoff between using high-performance (and expensive) machines or a collection of cheaper machines that can improve execution parallelism. It must be analyzed for each case since tasks of a CPU-bound application will perform best on a compute-optimized machine, while a memorybound application may not require such expensive resources, for example. Since the input data and results file may be stored in a shared repository, cloud storage can be a valid option since services offer low latency, high throughput, high availability, and scalability. An example of high-performance storage is the Amazon FSx for Lustre.1 The Lustre file system is optimized for data processing, with sub-millisecond latencies and throughput that scales to hundreds of gigabytes per second. It is

1 https://aws.amazon.com/pt/fsx/lustre/.

108

R. da Rosa Righi et al.

possible to access the shared storage from tens of thousands of compute instances and scale up storage capacity on-demand. Examples of Bag-of-Tasks applications include Monte Carlo simulations, massive searches, image manipulation applications, data mining, and parameter sweep applications [15]. An interesting case that shows the scalability of bag-of-task applications in cloud environments is presented by Kaplan et al. [17]. Motivated by the shortage of ventilators in the Covid 19 pandemic, they developed a numerical model in Matlab aimed at improving the safety profile of ventilator splitting by accurately predicting delivered tidal volumes and pressures under a variety of clinically relevant situations. By combining the parameters of this model, a total of 270 million different simulations have been created. These simulations were grouped into 146,000 jobs (CSV files), corresponding to 800,000 compute hours, that were processed in parallel over 72 h using 24,000 cores allocated in the Microsoft Azure Cloud. The authors point that if they had used only on-premises resources (approximately 1000 cores), the total time to resolution would have delayed the ability to address the initial surge of Covid-19 cases.

6.4.2 Master-Slave The Master-Slave (or Task-Farm) paradigm is a fundamental and commonly used approach for parallel and distributed applications since it can achieve high computational speedups and an exciting degree of scalability [18]. Specifically, the paradigm distinguishes two types of actors: a master and a set of slaves, as illustrated in Fig. 6.3. The master coordinates the execution of the application by assigning and scheduling work units (tasks) to each of the slaves for processing. Upon receiving the results of processing data packets, these are merged to obtain a final solution to a problem. The slaves receive the tasks, process them, and return the results. We must consider that bi-directional connections are established between the master and all the slaves; however, there is no interconnection between any slaves pair [19]. In some implementations, the master is also responsible for launching the slaves; in others, the slaves are deployed separately by the user or specific framework. In Master-Slave, we assume that the work can be divided into a set of independent tasks that can be processed independently by the slaves. Here, the number of tasks can be equal the number of slaves, or the work can be divided into a more significant number of tasks. The task distribution among the slaves can be based on push or pull approaches. In the push-based approach, the master assigns work to the slaves without the slaves asking for that. In the pull-based approach, the master only assigns work to the slaves if they ask. The choice of the number of tasks and the distribution approach depends on application and computational environment characteristics. In cloud computing, Master-Slave applications can take advantage of virtually infinite resources. Theoretically, it would be possible to assign a cloud resource (virtual machine, container, or serverless function) to each slave. However, in practice, it can be limited by budget constraints.

6 Designing Cloud-Friendly HPC Applications

109

Master(1) create tasks

launch slaves

Slaves(n)

send tasks

receive task

process task

collect results

send result

merge results

Fig. 6.3 Master-Slave applications execution model

Besides, cloud resources can be explored statically or elastically. The number of slaves (S) in a static scenario is fixed and known in advance. For example, a virtual cluster of .W + 1 instances of virtual machines (VM) or containers can be used. If the number of slaves is small, a VM with multiple vCPUs/cores is also an option. Supposing resources are homogeneous and previously allocated, the workload can be divided into W tasks, and the distribution among slaves can be done in a single step using a push-based approach. The implementation of this statical application can be made using MPI (cluster), OpenMP (multiprocessor VMs) [18], or MapReduce-like frameworks [5]. In a dynamic and elastic scenario, in its turn, the cloud resources can be dynamically allocated, and the number of slaves can also vary during execution, as presented in Fig. 6.4. Elasticity can improve performance and fault-tolerance, adapt costs, or achieve deadline constraints. We can start the application execution with the master and a minimal amount of slaves. While the master waits for slave results, it also can check the availability of resources and launch additional slaves if needed. As tasks run out, slaves and resources can be deallocated. Considering that the number of slaves can be modified along with the execution, the number of tasks must be significant to allow all slaves to have a workload. Here, the pull-based distribution approach may be appropriate since there is a gap between the requests for new resources and their availability. Using this approach, the slaves ask for tasks as soon as they run. It should be noted, though, that if the tasks are too small or if there are too many slaves connected to one master, the master may become a bottleneck [20]. In this case, it could be necessary to allocate a VM with more extensive capabilities to host the master or use vertical elasticity

110

R. da Rosa Righi et al.

Container 2

Slave Process 1

Slave Process 1

...

loop

Container 1

Master Process

request result

Slave Process 2

Master Process

Container i+1

request result

Slave Process i

...

...

Container i+j+1

Slave Process i

(a) Traditional Master-Slave Model

Slave Process i+j

(b) Cloud-based Master-Slave Model

Fig. 6.4 Traditional Master-Slave applications (a) divide the workload among a fixed set of i slave processes. Cloud-based Master-Slave applications (b) can have additional j slave processes according to the availability of new resources due to elasticity actions

to improve its processing power. The critical problems of executing Master-Slave applications to explore, in particular, the elasticity feature from the cloud are: how to inform the application about the existence of new resources and how to scale down the application without crashing it as a whole? Finally, elastic Master-Slave applications can be implemented using MPI-2, MapReduce, and frameworks such as Work Queue [21] and AutoElastic [22]. The Work Queue framework consists of a master program that coordinates the overall computation and multiple slaves that carry out the individual tasks. Each task consists of a standalone sub-program to run, along with a definition of the necessary input and output files for the task. Abdul-Wahid et al. [23] present an example of using this framework for task management in a protein folding application. They implemented a technique called Accelerated Weighted Ensemble (AWE) and applied it to an all atom protein model. The experiments shown a good scalability, running between 1700 and 2500 slaves simultaneously, and up to three masters. Multiple computing platforms were used, including clouds (Amazon EC2, Microsoft Azure), dedicated clusters and grids. They achieved an aggregate sampling rate of over 500 ns/h. As a comparison, a single process typically achieves 0.1 ns/h. In this context, AutoElastic [22] is a middleware that exploits data parallelism to handle iterative message-passing applications that are modeled as a Master-Slave. In this way, the composition of the communication framework began by analyzing

6 Designing Cloud-Friendly HPC Applications 1. size = initial_mapping(ports); 2. for (j=0; j< total_tasks; j++){ 3. publish_ports(ports, size); 4. for (i=0; i< size; i++){ 5. conection_accept(slaves[i], ports[i]); 6. } 7. calculate_load(size, work[j], intervals); 8. for (i=0; i< size; i++){ 9. task = create_task(work[j], intervals[i]]); 10. send_assync(slaves[i], task); 11. } 12. for (i=0; i< size; i++){ 13. recv_sync(slaves[i], results[i]; 14. } 15. store_results(slave[j], results); 16. for (i=0; i< size; i++){ 17. disconnect(slaves[i]); 18. } 19. unpublish_ports(ports); 20. }

(a)

111 1. 2. 3. 4. 5. 6. 7. 8. 9.

master = lookup(master_address, naming); port = create_port(IP_address, VM_id); while (true){ connection_request(master, port); recv_sync(master, task); result = compute(task); send_assync(master, result); disconnect(master); }

(b) 1. int changes = 0; 2. if (action == 1){ 3. changes += add_VMs(); 4. } 5. else if (action == 2){ 6. changes -= drop_VMs(); 7. allow_consolidation();// enabling action3 8. } 9. if (action ==1 or action == 2){ 10 reorganize_ports(ports); 11. } 12. size += changes;

(c) Fig. 6.5 Application model in pseudo-language: (a) Master process; (b) Slave process; (c) elasticity code to be inserted in the Master process at PaaS level by using either method overriding, source-to-source translation or wrapper technique

the traditional interfaces of MPI 1.0 and MPI 2.0. In the former, process creation is given in a static approach, where a program starts and ends with the same number of processes. MPI 2.0 provides features that enable elasticity because it offers both the dynamic creation of new processes and the on-the-fly connection with other processes in the topology. AutoElastic parallel applications are designed according to the MPMD model (Multiple Program Multiple Data), where multiple autonomous VMs execute simultaneously one type of program: master or slave. This decoupling helps to provide cloud elasticity because a specific VM template is generated for each program type to enable a more flexible scaling-out operation. Figure 6.5a presents a Master-Slave application that is supported by AutoElastic. As stated, it has an iterative behavior in which the master has a series of tasks, sequentially distributing them through the slave processes. The distribution of tasks is emphasized in the external loop of Fig. 6.5a (lines 2–20). Based on the MPI 2.0 interface, AutoElastic works with the following groups of programming directives: (1) publication of connection ports, (2) searching for a server using a specific port, (3) connection accept, (4) requesting a connection and (5) disconnection. Unlike the approach in which the master process dynamically launches other processes (using the spawn call), the proposed model operates according to the other MPI 2.0 approach for dynamic process management: point- to-point communication with Socket-like connections and disconnections. The launching of a new VM automatically entails the execution of a slave process, which requests a connection

112

R. da Rosa Righi et al.

to the master automatically. We emphasize that a program with AutoElastic does not need to follow the MPI 2.0 API and instead follows the semantics of each aforesaid directive. Communications between master and slaves follow the asynchronous model, where sending operations are nonblocking and receiving operations are blocking (see lines 5 and 7 of Fig. 6.5b). The method in line 1 of the master process checks either a configuration file or arguments passed to the program to obtain the virtual machine identifiers and the IP address of each process. Based on the results, the master knows the number of slaves and creates port numbers to receive connections from each slave. The publishing of the ports occurs in the method of line 3. Programs with an outer loop are convenient for elasticity establishment because in the beginning of an iteration, it is possible to make resource reconfigurations without changing both the application syntax and semantics. The transformation of the application shown in Fig. 6.5 in an elastic situation is performed at the PaaS level by applying one of the following methods: (1) in an object-oriented implementation, overriding the publish ports() method for elasticity management; (2) use of a sourceto-source translator that inserts the elasticity code between lines 3 and 4 in the master code; (3) development of a wrapper for procedural languages to change the method in line 3 of the master code transparently. The additional code for enabling elasticity checks in the shared directory to determine whether there is any new data from the AutoElastic Manager (see Fig. 6.5c). In the case of Action1, the manager already set up new VMs, and the application can add data of the new slaves in the slaves set. In case Action2 takes place, the application understands the order from the manager and reduces the number of VMs (and consequently, the number of processes in the parallel application) and triggers Action3. Although the initial focus of AutoElastic is in iterative MasterSlave applications, the use of MPI 2.0-like directives makes the inclusion of new processes and the reassembly of arbitrary communication topologies easier. At the implementation level, it is possible to optimize the connections if a process remains in the list of active processes. This circumstance is pertinent over TCP networks, which use a three-way handshake protocol known as an overhead source when connecting two end points.

6.4.3 Pipeline The pipeline paradigm is based on a functional decomposition approach. The problem is divided into a series of tasks, called stages, that must be completed one after the other [24]. Each stage receives data from the preceding stage, carries out a subset of the original operations, and sends the results to the next stage, as illustrated in Fig. 6.6. The main keyword of this application model is cadence since we should have an acceptable synchronism to chain a stage’s output to become the next one’s input. When executing with highly heterogeneous stages, it is common to put the pipeline performance away since a bottleneck in some stages will take

6 Designing Cloud-Friendly HPC Applications

113

Stage 1 input data

receive data process task

Stage 2

send data

receive data process task send data

Stage N ...

receive data process task send data

result

Fig. 6.6 Pipeline applications execution model

place. In contrast, others will be empty, waiting for incoming data. Thus, the parallelism degree in a pipeline is limited by the number of stages that can be executed simultaneously. A more significant number of stages allows a higher level of parallelism; however, the data transference introduces overhead, and we have more chances to have disparities in the stages’ performance (which impacts, as said previously, the pipeline cadence). First, considering the communication viewpoint, it is essential to divide the stages in such a form that the work done by a stage is large compared to the communication overhead [25]. Second, considering the computing viewpoint, we should avoid situations where some resources remain idle until all the stages are occupied with helpful work. This situation is called pipeline filling and is illustrated in Fig. 6.7, where we can observe that full parallelism is achieved only after the fourth application instance enters the pipeline, and we have four stages executing simultaneously. As the task instances end (pipeline draining), the degree of parallelism decreases, and the resources become idle. In this scenario, cloud elasticity can be helpful since new computing resources can be dynamically allocated for a particular stage during pipeline filling and deallocated in the pipeline draining phase. We can replicate the overloaded stages when the application throughput is lower than expected, so scaling out actions for VMs or containers could take place (see Fig. 6.8). The exact number of additional replicas to answer a performance problem is hard to say, so a heuristic of adding a measuring performance impact one by one seems to be a good option. Regarding implementation details on the cloud, the most natural approach is to assign one (or more) VM/container to each pipeline stage and implement each connection between successive stages as a sequence of messages. Also, a load balancer is helpful to dispatch an incoming request to a stage to a particular CM/Container replica. MPI API provides all mechanisms needed for pipeline

114

R. da Rosa Righi et al. Full parallelism

Instance 1

Stage 1

Instance 2

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Instance 3

Instance 4

Stage 6

Pipeline draining

Pipeline filling

Fig. 6.7 Pipeline applications execution model Stage 2

Stage 1 Process 1

Process 2

Stage N

...

Process n

(a) Traditional Pipeline Model

Container 1

Container i+1

Process 1

Process i+1

Container 2 Process 2

Container i+j Process i+j

Stage N Container i+j+1

... Load Balancer

Stage 2

Load Balancer

Stage 1

Process i+j+1

Container i+j+2

...

Container i

Process i+j+2

Container i+j+k

...

Process i

Process i+j+k

(b) Cloud-based Pipeline Model

Fig. 6.8 Traditional pipeline applications (a) divide the data processing among a fixed set of n slave processes. Cloud-based pipeline applications (b) can have additional processes in each stage according to stages processing demands and availability of new resources

application construction. Nevertheless, some frameworks also can be employed, e.g., Taskflow [26]2 and Pipel [27]. In summary, an essential aspect of this parallel application structure is the load unbalance among stages. Significant differences among the computational effort of the pipeline stages will create bottlenecks, and the pipeline’s throughput will

2 https://taskflow.github.io/taskflow/ParallelPipeline.html.

6 Designing Cloud-Friendly HPC Applications

115

be determined by the slowest stage [28]. To mitigate the problem, we can insert bubbles along the stages or include large buffers in front of each stage [27]. Another alternative to ensure load balancing is allocating cloud resources according to the demands of each stage. For example, Memory-Bound stages can be deployed in large-memory VMs, while an intensive processing stage can be executed in a VM with many vCPUs or even a GPU. In this case, parallelism could be explored on two levels: pipeline and intra-stage. It will be necessary to parallelize the stage-based algorithm according to the VM resources characteristic. Meyer et al. [27] describe a pipeline application implemented using Pipel framework leveraging dynamic allocation of resources. In this application, a sequence of three image processing operations (greyscale, colour reversal and thresholding) are applied to a set of images. After crossing by the three stages, an image is completely transformed and this task can be classified as done. Pipel is employed to manage and rearrange the resources, thus, it is possible to allocate or consolidate VMs according to the specific need of each stage of the application. The results show that it is possible to reduce the application execution time by up to 38% when using the elastic resources provided by the cloud.

6.4.4 Divide-and-Conquer In the divide-and-conquer (DaC) approach, the problem at hand is divided into smaller subproblems, and each part is solved independently. Keeping on dividing the subproblems into even smaller sub-problems, we may eventually reach a stage where the subproblem is simple enough to be solved directly. Then these smallest subproblems are solved in parallel. Finally, the results of the subproblems are merged to obtain the solution for the whole problem [29]. A DaC application is organized in a virtual tree structure, in which some of the processes split the workload, create subtasks, and combine the received results to produce the final result. When each division creates two parts, we obtain a binary tree, as illustrated in Fig. 6.9. However, divide-and-conquer can also be applied where the workload is divided into more than two parts at each stage [18]. This application model can be seen as a recursive procedure. When receiving a request, each node must proceed in the following way: if it can compute the task, it will do that and pass the result to its father node; if the incoming task is too heavy, it will create an arbitrary number of children, divide the original task by the number of new entities and pass them their respective computing requests. DaC applications can be implemented in both shared and distributed memory architectures. For instance, OpenMP nested parallelism or OpenMP tasks can be used to construct DaC applications in multiprocessor machines [18]. For distributed memory systems, it is possible to use MPI, Cilk, MapReduce, and other frameworks/APIs that supports the construction of subtasks that are executed in parallel.

116

R. da Rosa Righi et al.

P0

P0

P0

P0

P2

P1

P2

P0

Division Phase

P4

P4

P3

P4

P2

P6

P5

P6

P4

P0

P7

Parallel Solution

P6

P4

Merging Phase

P0

Fig. 6.9 In DaC model, a problem is divided up into two or more subproblems. Each subproblems is solved in parallel and their results are combined to give the final result

Considering that the solving phase of each minimal subproblem can be executed in parallel, ideally, each task must be assigned to a different processor/core. However, as the number of subproblems can grow exponentially, this is not always possible in computational platforms with a fixed number of processors. Another approach consists in stopping the recursion when the number of active subproblems is the same as the number of processors. Nevertheless, this approach limits the parallelism degree. In this sense, the cloud can provide the necessary resources to allocate all tasks of a DaC application. Besides, these resources can be allocated elastically, starting with a single resource to the tree root and, at each workload division, create additional resources to map the process. As the results merge, the over-provided resources can be deallocated. Specifically, containers are more recommended in this case since their startup time is much shorter than that of VMs. A drawback of this application model is load balancing. Ideally, execution time is minimized if the workload can be subdivided in a way that perfectly matches

6 Designing Cloud-Friendly HPC Applications

117

the work distribution for the solution of the corresponding subproblems. However, in some cases, the resulting subproblems can present different workloads, and additional subproblems have to be created, causing an unbalanced tree. For example, when creating two children, sometimes it is not easy to divide the incoming request precisely in half. Also, using heterogeneous resources (VMs or containers with different vCPUs, for example) is challenging when dividing and distributing requests. Traditionally, some techniques such as work-stealing using shared queues and over partitioning are implemented to provide load balancing [30]. Nevertheless, considering that these applications can run in the cloud, the elastic resource allocation can help provide the most appropriate resources for each subtask to guarantee (approximately, at least) an equilibrium of execution among the children nodes. An example of problem that can take advantage of divide-and-conquer is the Barnes-Hut (BH) algorithm for the N-body problem [31]. The N-body problem aims to compute the states of N bodies (or particles) at a time T , given their initial states (velocities and positions). In BH, the simulation domain is partitioned into cubic cells via an octree (or squares in 2D spaces using quadtrees), so that only particles from nearby cells need to be treated individually, and particles in distant cells can be treated as a single large particle centered at the cell’s center of mass. This can drastically reduce the number of particle-pair interactions that need to be calculated. Parallel N-body simulation based on BH consists of splitting the force calculations for all particles among different processors that can be run in parallel. Augustyn et al. [32] describe a Barnes-Hut solution to the N-body problem using storage and computational resources provided by the Azure platform. Later, Katsogridakis et al. [33] present a Barnes-Hut N-body simulation implemented using the Apache Spark MapReduce engine. Although the application has not been tested in cloud environments, Spark applications can be ported to this type of environment with little effort.

6.5 Tightly-Coupled HPC Applications for Cloud First, a tightly coupled HPC system was linked to multiprocessor architectures. Thus, a multiprocessor is called tightly coupled if it has a shared memory. The communication bandwidth is on the order of the memory bandwidth, and a tightly coupled multiprocessor is often called a “shared-memory” multiprocessor. This kind of organization has a high degree of interaction between tasks. With the advances in HPC, tightly-coupled concepts are aggregated to architectures composed of separate processors and memory modules interconnected via a multistage switch. Thus, this model can be explored in distributed memory machines such as clusters and clouds. As the need for more effective data rates and bandwidths is becoming more and more demanding, especially with the addition of heterogeneous resources in everyday tasks and applications, tightly coupled workloads in the essence of HPC have been introduced into cloud computing. These tasks are continuously

118

R. da Rosa Righi et al.

more dependent on each other, utilize a standard shared memory, and the need for data rate and transfer is enormous. A tightly coupled workload requires interprocess communication patterns that rely on high bandwidth with low latency to maintain optimal performance. This characteristic could lead to a significantly reduced number of executed tasks, considering the demanding requirements.

6.5.1 Bulk-Synchronous Parallel Bulk-Synchronous Parallel (BSP) is a parallel programming model planned for applications that execute on homogeneous and dedicated resources. However, we can use it on grid computing and cloud infrastructures. Since we have a tightlycoupled model, elasticity with scaling in and scaling out operations is not easy since we have a high degree of arbitrary communication among the processes. Since we have computing units such as VMs or containers in the cloud, a cloud feature that can be explored here is migration. Thus, all processes inside a VM/container can be migrated from one physical machine to another. A relevant idea of a rescheduling model refers to the treatment of pertinent information for processes migration in dynamic and eventually heterogeneous environments. Firstly, the primary metric to perform load balancing in distributed systems is the computational load of the nodes or the computational time spent by each process/task to execute a set of instructions. Figure 6.10 illustrates the BSP functioning, together with the benefits of applying processes rescheduling to minimize the remaining supersteps’ times. Aiming to demonstrate a possible problem of only looking at CPU metric for load balancing, we can create a hypothetical infrastructure with two clusters and a migration situation between them. We modeled two clusters as Cluster1 and Cluster2. Both have ten nodes, each one with a single processor. In addition, each cluster has a Gigabit Ethernet connection for intra-cluster communication. Their connection comprises links through the Internet with a mean capacity of 10 Mbits/s. Cluster1 has nodes with 500 MHz, while the second cluster presents nodes with a capacity of 1 GHz. Suppose that the initial processes-resources assignment maps all the six application processes on Cluster1. Each process is mapped to a different node. Concerning this, we can design a possible scenario of a computation-based processes rescheduling where process p1 is chosen for migration from cluster Cluster1 to Cluster2. This decision considers this process the slowest one, so it can run two times faster on Cluster2. After the migration. we could have a worse scenario since all communications to process p1 must go through the low bandwidth link between the clusters. Therefore, it is essential that load balancing and rescheduling strategies for cloud and multi-cluster environments consider communication speed between the sites. One of the main challenges of process migration is to choose which process(es) should migrate. In this context, we can apply the approach presented in Fig. 6.11. Here, D informs the percentage of how far the time of the slowest and the fastest processes can be from the average. If we have a process in the unbalancing space,

6 Designing Cloud-Friendly HPC Applications

119

Fig. 6.10 Observation of supersteps in different situations. (a) Superstep k: Processes are not balanced among the resources. (b) Superstep > k: Situation after applying the processes reassignment model

Fig. 6.11 Analysis of both balancing and unbalancing situations which depend on the distance D from the average time A

they can be selected for migration at the end of a superstep. Thus, the next superstep can start faster. As said earlier, a pertinent migration strategy for BSP applications

120

R. da Rosa Righi et al.

should consider communication and computation-related metrics. In addition, it is important to take into consideration also the weight of the process. In other words, we cannot disregard the amount of memory already allocated in the process and the instruction code itself, since they will be serialized and passed through the network between two end points.

6.6 Discussion and Open Challenges on HPC-Oriented Cloud Applications After presenting a vision regarding high-performance computing application models and their adherence to cloud computing features, here we summarize some pertinent aspects. They can be seen below: • Changing or not the application to fit cloud features—Cloud can be understood as a distributed system where a collection of networked resources is delivered. Features like cost benefits, rapid deployment, and automatic backup are helpful and desirable but do not affect the code of parallel applications. Also, the eventual loss of performance on network operations in the cloud is not significant, even for IO-bound applications (those applications where the filesystem or the databases are accessed through the network, for example). Also, resource elasticity is a critical concern in extracting advantages of cloud computing to the HPC world. It is not trivial to make a glue between these two aspects. Typically, a third-party manager coordinates actions between resource provisioning and the application code so that modifications in the HPC should exist. Also another fine-tuning possibility is to insert elasticity directives (API from a particular middleware) along with the application code to manage resource provisioning by hand. A third solution, implemented by a middleware denoted AutoElastic, consists of wrappers to insert elasticity actions in the application at compilation time, so providing dynamic resource provisioning at the user viewpoint effortlessly. Finally, elasticity is much easier to manage on loosely-coupled applications since, in the addition of new resources, we need only to establish a network connection between two processes (a master and a slave, a father and a child, a slave and the work pool, for example). • Performance on virtualized resources—In the first years of cloud computing, it was known that cloud resources did not fit HPC demands. This fact was true basically up to the mainstream use of containers. The state-of-the-art today uses Docker containers, where we have an image that is lightweight and standalone. This executable package of software includes everything needed to run an application: code, runtime, system tools, system libraries, and settings. A container is managed using the Docker API or CLI. Docker is a set of the platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers. Docker can package an application and its dependencies in a virtual container that can run on any Linux, Windows,

6 Designing Cloud-Friendly HPC Applications

121

or macOS computer. This feature enables the application to run in various locations, such as on-premises, in a public or private cloud. As virtualization has become mainstream in modern data centers, many businesses and IT decisionmakers are contemplating its potential benefits in the near term and future. For instance, AWS supports two types of virtualization for computing instances: Para Virtualization (PV); Hardware-assisted Virtual Machine (HVM). Every physical machine has a hypervisor running on it. A Xen hypervisor, for example, supports both virtualization strategies above. In addition to computing, the advantages of virtualization for networks are vast. These networks have higher operational speed, cost savings, flexibility, and agility. Cloud providers have incorporated this kind of virtualization as mainstream. • Performance isolation—Although presenting the capacity of isolation, machines and containers do not provide complete performance isolation. In other words, a CPU-hungry container can interfere with a neighbor container’s performance, both running on the same physical resource. Considering that a minimum delay can be shown as a significant bottleneck on HPC applications, the authors suggest the developers/users analyze the SLA (Service Level Agreement) deeply when launching parallel applications in the cloud. XEN, KVM, and VMware virtualization engines have solved many performance-related problems. However, it is common sense in the literature that this point must be observed then migration from on-premise clusters to virtualized data centers. • Best application models to run in the cloud—In our understanding, Master-Slave and Bag-of-Tasks represent the most used and affordable models to run HPC applications in the cloud. Both have a straightforward communication structure, favoring the increment or decrement of resources and facilitating the adaptation in the current set of processes. Also, load balancing is accessible on both models: (1) a master adapts its workload among the existing slaves; (2) slaves access the bag collecting a single task at a time. Finally, we agree that in both cases, it is easy to adapt an application to run in the cloud and develop applications with these models from scratch. Taking into account the current state-of-the-art and novel frameworks and technologies in the intersection between cloud computing and HPC applications, below we present some future trends and research opportunities: • Lightweight AI-driven resource provisioning solutions—We envisage the use of more and more AI solutions in the scheduling and load balancing solutions in such a way that prediction, pattern recognition, on-the-fly classification, and event correlation can help to optimize performance decisions. The challenge here consists in addressing time-consuming procedures related to AI. So, it is possible to use the idea of train once use many, and the combination of AI with fast heuristic approaches. The recent articles published by Jiang et al. [34] and Yadav et al. [35] confirm this trend, in which machine learning techniques are used to provide efficient resource provisioning. Figure 6.12 depicts an example where we can use AI solutions to combine multiple objectives for running HPC applications in fog and cloud architectures.

122

R. da Rosa Righi et al. Thresholds

Load

scaling out: + resources + energy consumption - system load

120% 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

scaling in: - resources - energy consumption + system load

Load

Load

120% 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

1

2

3

4 Time 5

(a)

6

7

8

1

2

3

4

Time

(b)

5

6

7

8

Req N

Fig. 6.12 Possibility to use AI-based solutions to improve not only performance, but also energy and budget on running HPC applications in the cloud. In (a), we have an application that executes inside predefined thresholds. In (b), we have the AI-based control to keep the application inside thresholds taking into account several objective criteria Container Orchestration Layer

Req 2

Timeout

Req 0

Req 1

FaaS  Layer

Req 3

Requests

Fig. 6.13 Combination of Function as a Service (FaaS) and tradition reactive horizontal elasticity to execute HPC applications

• Blending cloud elasticity methodologies—Combining vertical and horizontal elasticity in fog and cloud architectures is possible. First, vertical can be used up to the limit of a physical machine. We can use horizontal elasticity to allocate more virtual resources if more performance is needed. This combination can be of interest for energy consumption and financial cost reduction since we first allocate containers and demands to a particular physical machine, efficiently using it before allocating newer ones. Also, the combination of Serverless, or FaaS (Function as a Service) and the traditional approaches of cloud elasticity could be interesting for executing irregular applications since FaaS is best suited for short-running tasks. At the same time, the allocation of servers can handle long-running demands better, as illustrated in Fig. 6.13. • Efficient use of heterogeneous architectures—With accelerator technologies (e.g., GPU and FPGA), system architectures have started on a clear trend towards increased parallelism and heterogeneity. However, resource allocation frameworks targeting heterogeneous architectures have not achieved their full potential, which occurs when orchestrating all the different resources together or dynamically, selecting the most suitable resource configuration for each application (or application stage).

6 Designing Cloud-Friendly HPC Applications

123

• Exploration of promising programming languages—We envisage using powerful programming languages such as Go and Elixir/Erlang as an HPC trend. Both have built-in features to handle high-performance computing, including mutual exclusion directives, efficient send/receive mechanisms, data replication, scalability support, and distributed and shared-memory-based architectures compatibility. Also, we observe the growing usage of these programming languages with container orchestration tools like K3S and Kubernetes to address efficient scheduling and elasticity proposals. • Effortless way of exploring high-performance computing—As a trend, we observe the exploration of automatic approaches to boost already developed HPC applications to run yet faster. The idea here is to explore programming libraries and compilers that change the code of functions to insert transparently at the user’s viewpoint resource management directives. Thus, it is possible to transform an MPI non-elastic application into an elastic one, only computing it with a particular MPI elasticity-based library, enabling a code to explore the benefits of cloud computing. • Data redistribution mechanisms—The challenge of designing adaptivity techniques for some classes of applications (e.g., SPMD and domain decomposition ones) is not simply to modify the number of processes that the application is running on according to the availability of resources. Reconfiguring actions involve redistributing the data across the new processes (which may cause load imbalance) and modifying the communication patterns. In this sense, usertransparent data redistribution mechanisms are needed to enable a broad class of applications’ efficient use of dynamic resources.

6.7 Conclusion As technologies like the Internet of Things (IoT), artificial intelligence (AI), and 3-D imaging evolve, the size and amount of data that organizations have to work with are growing exponentially. In this context, we also envisage that the HPC area is constantly evolving and growing, addressing not only the scientific applications but also the aforementioned demands and others such as streaming events, testing of new products and stock trends analysis. To keep ahead of the competition, organizations need lightning-fast, highly reliable IT infrastructure to process, store, and analyze massive amounts of data. Moreover, the combination of HPC and cloud presents their high value here. For the following years, we envisage the increasing use of cloud computing to run HPC demands, enabling the use of elasticity and the allocation of specific hardware, including caching requirements and GPU devices, QoS definitions, and specifications of libraries, and dependencies. Also, to the best of our knowledge, we envisage two hot topics for the following years. First, effortless, automatic, and transparent are key drivers to explore HPC on virtualized architectures, enabling developers to run their demands faster with minimal code changes. Second, aligned

124

R. da Rosa Righi et al.

to the trends of Top500.org, we will have more programs each time that extract the simultaneous power of multiple architectures. Thus, new middleware and libraries should automatically deploy a code on many layers (multicomputer, multiprocessor, and accelerators, for instance) with minimal effort. Acknowledgments The authors would like to thank the following Brazilian funding entities: FAPERGS (process 21/2551-0000118-6), CAPES (process 88881.310440/2018-01) and CNPq (process 305263/2021-8).

References 1. Stefan Kehrer and Wolfgang Blochinger. A survey on cloud migration strategies for high performance computing. In Proceedings of the 13th Advanced Summer School on ServiceOriented Computing, pages 57–69. IBM Research Division, 2019. 2. Guilherme Galante, Luis Carlos Erpen De Bona, Antonio Roberto Mury, Bruno Schulze, and Rodrigo Rosa Righi. An analysis of public clouds elasticity in the execution of scientific applications: A survey. J. Grid Comput., 14(2):193–216, June 2016. 3. Christoph Fehling, Frank Leymann, Ralph Retter, Walter Schupeck, and Peter Arbitter. Cloud Computing Patterns: Fundamentals to Design, Build, and Manage Cloud Applications. Springer Publishing Company, Incorporated, 2014. 4. Stefan Kehrer and Wolfgang Blochinger. Migrating parallel applications to the cloud: assessing cloud readiness based on parallel design decisions. SICS Softw.-Intensive Cyber Phys. Syst., 34(2–3):73–84, 2019. 5. Geoffrey C. Fox and Dennis Gannon. Using clouds for technical computing. In High Performance Computing Workshop (1), volume 24 of Advances in Parallel Computing, pages 81–102. IOS Press, 2012. 6. Guilherme Galante and Rodrigo da Rosa Righi. Exploring cloud elasticity in scientific applications. In Nick Antonopoulos and Lee Gillam, editors, Cloud Computing - Principles, Systems and Applications, Second Edition, Computer Communications and Networks, pages 101–125. Springer, 2017. 7. Emanuel Ferreira Coutinho, Flávio Rubens de Carvalho Sousa, Paulo Antonio Leal Rego, and Danielo Goncalves Gomes anJosé Neuman de Souza. Elasticity in cloud computing: a survey. Ann. des Télécommunications, 70(7–8):289–309, 2015. 8. Yahya Al-Dhuraibi, Fawaz Paraiso, Nabil Djarallah, and Philippe Merle. Elasticity in cloud computing: State of the art and research challenges. IEEE Transactions on Services Computing, 11(2):430–447, 2018. 9. Stefan Kehrer and Wolfgang Blochinger. Elastic parallel systems for high performance cloud computing: State-of-the-art and future directions. Parallel Processing Letters, 29(02):1950006, 2019. 10. Thilina Gunarathne, Tak-Lon Wu, Jong Youl Choi, Seung-Hee Bae, and Judy Qiu. Cloud computing paradigms for pleasingly parallel biomedical applications. Concurrency and Computation: Practice and Experience, 23(17):2338–2354, 2011. 11. Eunji Hwang, Suntae Kim, Tae-kyung Yoo, Jik-Soo Kim, Soonwook Hwang, and Youngri Choi. Resource allocation policies for loosely coupled applications in heterogeneous computing systems. IEEE Transactions on Parallel and Distributed Systems, 27(8):2349– 2362, 2016. 12. Mohamed Ben Belgacem and Bastien Chopard. A hybrid HPC/cloud distributed infrastructure: Coupling EC2 cloud resources with HPC clusters to run large tightly coupled multiscale applications. Future Generation Computer Systems, 42:11–21, 2015.

6 Designing Cloud-Friendly HPC Applications

125

13. Marco A. S. Netto, Rodrigo N. Calheiros, Eduardo R. Rodrigues, Renato L. F. Cunha, and Rajkumar Buyya. HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Comput. Surv., 51(1), Jan 2018. 14. Sulav Malla and Ken Christensen. HPC in the cloud: Performance comparison of function as a service (FaaS) vs infrastructure as a service (IaaS). Internet Technology Letters, 3(1):e137, 2020. 15. Hermes Senger and Fabrício Alves Barbosa da Silva. Bounds on the scalability of bag-of-tasks applications running on master-slave platforms. Parallel Processing Letters, 22(02):1250004, 2012. 16. Long Thai, Blesson Varghese, and Adam Barker. A survey and taxonomy of resource optimisation for executing bag-of-task applications on public clouds. Future Generation Computer Systems, 82:1–11, 2018. 17. Michael Kaplan, Charles Kneifel, Victor Orlikowski, James Dorff, Mike Newton, Andy Howard, Don Shinn, Muath Bishawi, Simbarashe Chidyagwai, Peter Balogh, and Amanda Randles. Cloud computing for covid-19: Lessons learned from massively parallel models of ventilator splitting. Computing in Science & Engineering, 22(6):37–47, 2020. 18. Paweł Czarnul. Parallel Programming for Modern High Performance Computing Systems. CRC Press, USA, 2018. 19. Mohammad Hammoud and Majd F. Sakr. Distributed programming for the cloud: Models, challenges, and analytics engines. In Sherif Sakr and Mohamed Gaber, editors, Large Scale and Big Data, pages 1–38. Auerbach Publications, Boca Raton, Florida, 2014. 20. Lucas Baldo, Leonardo Brenner, Luiz Gustavo Fernandes, Paulo Fernandes, and Afonso Sales. Performance models for master/slave parallel programs. Electronic Notes in Theoretical Computer Science, 128(4):101–121, 2005. Proceedings of the First International Workshop on Practical Applications of Stochastic Modelling (PASM 2004). 21. Dinesh Rajan, Anthony Canino, Jesus A. Izaguirre, and Douglas Thain. Converting a high performance application to an elastic cloud application. In 2011 IEEE Third International Conference on Cloud Computing Technology and Science, CLOUDCOM ’11, page 383–390, USA, 2011. IEEE Computer Society. 22. Rodrigo da Rosa Righi, Vinicius Facco Rodrigues, Cristiano André da Costa, Guilherme Galante, Luis Carlos Erpen De Bona, and Tiago C. Ferreto. Autoelastic: Automatic resource elasticity for high performance applications in the cloud. IEEE Trans. Cloud Comput., 4(1):6– 19, 2016. 23. B. Abdul-Wahid, L. Yu, D. Rajan, H. Feng, E. Darve, D. Thain, and J. A. Izaguirre. Folding proteins at 500 ns/hour with work queue. In 2012 IEEE 8th International Conference on EScience (e-Science), pages 1–8, Los Alamitos, CA, USA, Oct 2012. IEEE Computer Society. 24. Barry Wilkinson and Michael Allen. Parallel programming - techniques and applications using networked workstations and parallel computers. Pearson Education, 1998. 25. Michael McCool, James Reinders, and Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2012. 26. Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin. Taskflow: A lightweight parallel and heterogeneous task graph computing system. IEEE Transactions on Parallel and Distributed Systems, 33(6):1303–1320, 2022. 27. Vinicius Meyer, Vinicius Facco Rodrigues, Rodrigo da Rosa Righi, Cristiano André da Costa, Guilherme Galante, and Cristiano Bonato Both. Pipel: exploiting resource reorganisation to optimise performance of pipeline-structured applications in the cloud. Int. J. Computational Systems Engineering, 5(1), 2019. 28. Andreu Moreno, Anna Sikora, Eduardo César, Joan Sorribes, and Tomàs Margalef. HeDPM: Load balancing of linear pipeline applications on heterogeneous systems. J. Supercomput., 73(9):3738–3760, Sep 2017. 29. Marco Danelutto, Tiziano De Matteis, Gabriele Mencagli, and Massimo Torquati. A divideand-conquer parallel pattern implementation for multicores. In Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, SEPS 2016, page 10– 19, New York, NY, USA, 2016. Association for Computing Machinery.

126

R. da Rosa Righi et al.

30. Mattias V. Eriksson, Christoph W. Keßler, and Mikhail Chalabine. Load balancing of irregular parallel divide-and-conquer algorithms in group-SPMD programming environments. In ARCS Workshops, volume P-81 of LNI, pages 313–322. GI, 2006. 31. Barry Wilkinson. Grid Computing: Techniques and Applications. CRC Press, Boca Raton, FL, 1st ed. edition, 2009. 32. Dariusz Rafał Augustyn and Łukasz Warchał. Cloud service solving n-body problem based on windows azure platform. In Andrzej Kwiecie´n, Piotr Gaj, and Piotr Stera, editors, Computer Networks, pages 84–95, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. 33. Pavlos Katsogridakis, Sofia Papagiannaki, and Polyvios Pratikakis. Execution of recursive queries in Apache Spark. In Francisco F. Rivera, Tomás F. Pena, and José C. Cabaleiro, editors, Euro-Par 2017: Parallel Processing, pages 289–302, Cham, 2017. Springer International Publishing. 34. Yuang Jiang, Murali Kodialam, T. V. Lakshman, Sarit Mukherjee, and Leandros Tassiulas. Resource allocation in data centers using fast reinforcement learning algorithms. IEEE Transactions on Network and Service Management, 2021. 35. Mahendra Pratap Yadav, Rohit, and Dharmendra Kumar Yadav. Resource provisioning through machine learning in cloud services. Arabian Journal for Science and Engineering, 2021.

Chapter 7

Exploiting Hardware Accelerators in Clouds Cristiano A. Künas, Matheus S. Serpa, and Philippe O. A. Navaux

7.1 Introduction Cloud Computing service providers, such as Amazon Web Services (AWS), Microsoft Azure (Azure), and Google Cloud Platform (GCP), offer many types of computing services [1, 2, 5]. In the most basic model, known as infrastructure as a service (IaaS), the user can hire virtual machines (VMs), storage devices, and network connectors and pay only for their use. Virtual machines are generally billed by the time they remain on (U$/h), while storage devices are billed by the occupied space and the time of use (U$/GB/h) [10]. The types of computing resources hired can affect both the cost and the total processing time of a high-performance application in Cloud Computing. In this context, the computing resources with the highest price (in U$/h) have better configurations and tend to offer the best performance. On the other hand, because the total cost of processing depends on price, processing time, and network performance, the lowest-priced feature is not always the one that offers the lowest cost of processing [9]. With the increasing use of Artificial Intelligence in several applications, accelerators are essential to process algorithms in a reasonable time. With this challenge, computer architectures have been changing from homogeneous to heterogeneous architectures in the last years. In these heterogeneous systems, more and more different architectures are used, especially GPUs, but other options can be used as FPGAs or Vector Processor chips. So currently, computers are composed of X86 processors with GPUs and/or other accelerators. This reality of heterogeneous architectures is adopted by cloud providers so that users can instantiate computers

C. A. Künas · M. S. Serpa · P. O. A. Navaux () Federal University of Rio Grande do Sul, Porto Alegre, Brazil e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_7

127

128

C. A. Künas et al.

with different configurations of processors and accelerators. These configurations need to be evaluated considering the performance versus cost. The resource variety offered in Cloud Computing allows the user to adjust both the computer system’s hardware and the application’s software to optimize the cost or performance of the processing. The first can be done using accelerators, such as GPUs, TPUs, and FPGAs, which help accelerate the execution of many workloads requiring high computational performance. On the other hand, to optimize the performance of an application in a cloud instance, the user is generally limited to adjusting the application’s software to make good use of the available resources in the computer system [11]. Providers use virtualization technologies to allocate computing resources more flexibly for customers. With this, it is possible to divide the computing resources of a single physical server among multiple virtual machines so that multiple clients hire fractions of the system for exclusive use. Although the virtualization layer provides a reasonable level of resource isolation, some parts of the hardware, such as the processor cache and the main memory communication channel, cannot be divided and are shared by virtual machines running on the same machine [13]. In this way, the intense demand for these parts in a virtual machine (intensive memory access) can affect the performance of applications in other virtual machines allocated on the same physical machine. As the user has no control over the allocation of other virtual machines on the same physical machine, the performance of virtual machines may vary throughout their use. Some Cloud Computing service providers allow the user to hire the physical machine for exclusive use, which would avoid volatility in performance. On the other hand, the price of this type of service can be much higher than hiring VMs that share the physical machine. Cloud Computing service providers typically offer different models for contracting VMs with different prices and guarantees regarding availability and volatility [12]. In AWS, for example, there are three main models: (1) the on-demand model; (2) the reserved model; and (3) the spot model (or preemptible VMs) [8].

7.2 Accelerator Optimized Instances on the Cloud Cloud providers offer several accelerators such as GPUs, TPUs, and FPGAs. This section aims to introduce the instance types and discuss the main characteristics of each computer architecture. The cloud providers offer a wide selection of optimized instance types to cater to different use cases. The instance types consist of various combinations of CPU, memory, storage, and network capabilities and offer the flexibility of choosing the suitable composition of resources for their applications. Each type of instance includes one or more instance sizes, allowing the scalability of its resources according to the workload’s requirements to be executed. Accelerated computing instances use hardware accelerators, or coprocessors, to execute functions, such as floating-point number calculations, graphic processing, or data standard matching,

7 Exploiting Hardware Accelerators in Clouds

129

more efficiently than is possible with software executing on CPUs. They have several instances for accelerated computing, ranging from GPUs to FPGAs, besides cloud providers’ accelerators own, such as AWS Inferentia and Google Cloud TPU.

7.2.1 GPUs: Graphic Processing Units This section introduces the concepts of graphics processing units (GPUs), their types, and their instances. Discuss the properties of different cases, such as the number of CUDA cores, memory capacity, and storage. Usually, computational clouds like Amazon AWS have instances with the most modern GPUs and some low-cost ones with older GPUs. As of June 2022, AWS has 8 GPU instances. Amazon EC2 P4 instances are the latest generation of GPU-based instances and provide the highest performance for machine learning training and high-performance cloud computing. The instance has 8 NVIDIA A100 Tensor Core GPUs, each with 6912 CUDA Cores and 432 Tensor cores. These instances provide the highest performance for training machine learning (ML) and high-performance computing (HPC) applications in the cloud. P4d instances are powered by the latest NVIDIA A100 Tensor Core GPUs and deliver industry-leading high throughput and low latency networking. These instances are the first in the cloud to support 400 Gbps instance networks. P4d instances offer up to 60% lower cost to train ML models, with an average 2.5.× better performance for deep learning models than previous generation P3 and P3dn instances. Researchers, data scientists, and developers can use the P4 instances to train machine learning (ML) models for different use cases such as Natural Language Processing (NLP), object detection and classification, and recommendation systems, as well as run High-Performance Computing (HPC) applications such as oil and gas simulation, financial modeling and pharmaceutical discovery. Unlike on-premises systems, customers can access “unlimited” compute and storage resources, scaling their infrastructure based on their business needs without any setup or maintenance cost. Like P4 instances, AWS EC2 P3 instances provide up to 8 NVIDIA V100 Tensor Core GPUs and up to 100 Gbps of network throughput for ML and HPC applications. The main difference between the P4 and P3 instances is that P4 has NVIDIA A100 GPUs while P3 has NVIDIA V100. Another similar instance is the AWS EC2 P2, which is intended for general-purpose GPU computing applications. The GPU of this version is the NVIDIA K80. AWS has several instances focused on graphical applications, such as the G5, G5g, G4dn, G4ad, and G3 instances. Amazon EC2 G5 instances are designed to help accelerate graphics-intensive application inference and machine learning. They can also use them to train simple to moderately complex machine learning models. Amazon EC2 G5g instances are powered by AWS Graviton2 processors and feature NVIDIA T4G Tensor Core GPUs to provide the best price-performance ratio on Amazon EC2 for graphics workloads such as streaming Android games. These are

130

C. A. Künas et al.

the first Arm-based instances in a large cloud to offer GPU acceleration. Customers can also cost-effectively use G5g instances for ML inference. Amazon EC2 G4dn instances are designed to help accelerate machine learning inference and graphicsintensive workloads. The processor in this instance is Intel and the GPU NVIDIA, while on the G4ad, both processor and GPU are AMD. The Amazon EC2 G4ad instances provide the best price-performance ratio for graphics-intensive cloud applications. As mentioned before, the difference is that the processor and GPU are AMD. The GPU is an AMD Radeon Pro V520, and the processor is an AMD EPYC 7R32. Finally, Amazon EC2 G3 instances are optimized for graphics-intensive applications and have NVIDIA Tesla M60 GPUs, each with 2048 parallel processing cores and 8 GiB of video memory.

7.2.2 TPUs: Tensor Processing Units This section discusses the Tensor Processing Unit (TPU), an AI accelerator developed by Google and available on the google cloud platform (GCP). It is deeply integrated with the TensorFlow software. TPUs are specialized application-specific integrated circuits (ASICs) to support large-scale machine-learning tasks. The performance of linear algebraic computing, intensely used in machine learning, is accelerated with the TPU resources, minimizing the time it takes to generate precision when training large and complex neural network models. Figure 7.1 shows the architecture of a TPUv2 chip. Each TPUv2 device has four internal chips, and each is made up of two cores. Each core has scalar, vector, and matrix units (MXU) connected with the on-chip high bandwidth memory (HBM) of the 8 GB for each TPUv2 core. The performance of each TPU chip is 180 TFlops (32-bit and 16-bit mixed precision) [18]. Compared to TPUv2, TPUv3 doubles the number of MXUs and HBM capacity per core and has a peak of 420 TFlops, 2.3.× greater than TPUv2 [15]. Furthermore, each core of the TPU device performs calculations independently, and the high bandwidth interconnections allow the chips to communicate directly with each other in the TPU device [4]. Additionally, you can run machine learning workloads on a TPU pod that increases workloads with little or no code changes. A TPU pod is a set of TPU devices connected by dedicated high-speed network interfaces. A TPU pod allows you to distribute the processing load among multiple TPUs, which can have up to 512 cores and provide 11.5 petaFLOPS performance for TPUv2 and up to 2048 cores with 100+ petaFLOPS performance for TPUv3. A high-performance CPU-based host machine is connected to each TPU board for data loading and preprocessing [17]. Figure 7.2 illustrates how a user can access a Cloud TPU. Cloud TPUs are network-attached. The user must create a virtual computing engine machine and apply a cloud TPU. The virtual machine connects to the cloud TPU through grpc, Google’s open-source high-performance Remote Procedure Call that can run in any

7 Exploiting Hardware Accelerators in Clouds

Scalar / Vector Units

131

Scalar / Vector Units

HBM 8GB

HBM 8GB

Matrix Unit (MXU)

Matrix Unit (MXU)

Fig. 7.1 The architecture of TPUv2 can achieve up to 180 TFlops

ssh

User

gRPC

Compute Engine VM

Cloud TPU

Fig. 7.2 Illustration on how to offload to a cloud TPU

environment. Users do not need to install any driver and can use the machine images provided by Google Cloud. However, the users still need to design the algorithm and write the code for their applications [18]. Configuring the Cloud TPU model requires a strategy to allow distributed training across all cores. This strategy scales the batch size by the number of available TPU chips, i.e., if the batch size is 32, the global batch is 256 (8 cores .× 32 = 256). The global batch size is fragmented automatically across all replicas [6].

7.2.3 FPGAs: Field-Programmable Gate Arrays We introduce the field-programmable gate arrays (FPGAs) available on the cloud providers, mainly used for machine learning inference.

132

C. A. Künas et al.

Amazon EC2 F1 instances offer customizable hardware acceleration with programmable field port arrays (FPGAs). They use FPGAs to enable the delivery of custom hardware accelerations. F1 instances are easy to program and come with everything needed to develop, simulate, debug, and compile hardware-accelerated code, including an AMI for FPGA developers and supporting hardware-grade development in the cloud. Using F1 instances to deploy hardware accelerations can be helpful in a variety of applications to solve complex problems in science, engineering, and business that require high bandwidth, enhanced networks, and very high computing power. Examples of target applications that can benefit from F1 instance acceleration are genomics, search/analysis, image and video processing, network security, electronic design automation (EDA), image and file compression, and big data analysis. The instance has Intel Xeon Scalable processors and Xilinx Virtex UltraScale+ VU9P FPGAs with up to 64 GB of memory. Amazon EC2 VT1 instances are designed to accelerate real-time video transcoding and deliver low-cost transcoding for live video streams. They can deliver up to 30% lower cost per stream compared to Amazon EC2 G4dn GPU-based instances and up to 60% lower cost per stream compared to Amazon EC2 C5 CPU-based instances for transcoding live video streams. VT1 instances can support streams up to 4K UHD resolution at 60 frames per second (FPS) and can transcode up to 64 simultaneous 1080p60 streams in real time. They are powered by up to 8 Xilinx. Alveo™ U30 media accelerator cards and support up to 96 vCPUs, 192 GB of memory, 25 Gbps of enhanced networking, and 19 Gbps of EBS bandwidth. They are optimized for workloads such as live broadcast, video conferencing, and just-intime transcoding. It has up to 8 Xilinx U30 media accelerator cards with accelerated H.264/AVC and H.265/HEVC decoders.

7.2.4 Other Cloud Providers Accelerators and AI processors We present and discuss other accelerators for machine learning, such as the AWS Inferentia and Intel Habana. With the growing use of Artificial Intelligence models, providers are offering instances with AI processors, like TPU from Google and Inferentia from AWS. These processors execute hardware functions that are frequent in AI processing. It significantly accelerates the training of deep learning model time. In a previous section, this chapter discussed the Tensor Processing Unit used by Google, while this section presents some of the other processors. Amazon EC2 DL1 instances are powered by Gaudi accelerators from Habana Labs (an Intel company). They offer up to 40% better price-performance training deep learning models than current GPU-based EC2 instances. It has up to 8 Gaudi accelerators with 32 GB high-bandwidth memory (HBM) per accelerator. Pairs of Gaudi accelerators are attached directly through a PCIe Gen3x16 link. Additionally, peer-to-peer networking via 100 Gbps RoCEv2 links—with seven active links per card—provides a torus configuration with a total of 700 Gbps

7 Exploiting Hardware Accelerators in Clouds

133

interconnects bandwidth per card. This topology is a separate interconnect outside of the two NUMA domains. Furthermore, the instance supports four EFA ENIs and 4 .× 1 TB of local NVMe SSD storage [AWS data sheet]. AWS Instances with Habana accelerators are designed to provide high performance and cost-efficiency for deep learning model training workloads. Specifically, DL1 instances are ideal for training machine learning models in applications such as natural language processing, object detection and classification, recommendation engines, and autonomous vehicle perception. Amazon EC2 Trn1 instances provide the best price-performance ratio for training deep learning models in the cloud. Trn1 instances are powered by AWS Trainium, the second AWS-designed machine learning (ML) chip optimized for high-performance deep learning training. Trn1 instances are now available in preview and have up to 16 AWS Trainium Accelerators. Finally, Amazon EC2 Inf1 instances are built from scratch to support machine learning inference applications. They have up to 16 AWS Inferentia chips. Graphcore and Microsoft are working on a hardware-software combo that provides artificial intelligence through the Azure cloud. Graphcore’s chips, which it calls Intelligence Processing Units (IPUs), have many more cores than GPUs or TPUs. They also feature memory on the chip itself, which removes a bottleneck that comes with moving data onto a chip for processing and off again. A software framework called Poplar was created, which allows existing AI programs to be ported to the hardware [Grapcore data sheet].

7.3 Programming for Cloud Accelerators This section plans to show some guidelines on deploying applications for accelerators available on the cloud. We focus on data storage and accelerator instantiation. Also, we show how to train and save the model in the cloud provider bucket, deploy new versions of the neural networks, and have the endpoint for inference globally available.

7.3.1 Amazon Web Services (AWS) We discuss how to program for GPUs and other accelerators through AWS SageMaker. On AWS, using SageMaker, you can build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. SageMaker has several benefits, allowing machine learning to be more accessible. Thus, empowering more people to innovate with ML through a choice of tools—integrated development environments for data scientists and code-free visual interfaces for business analysts. You can also prepare data for scale by

134

C. A. Künas et al.

accessing, labeling, and processing large amounts of structured data (tabular data) and unstructured data (photos, video, and audio) for ML. In addition, you can use SageMaker to accelerate ML development, reducing training time from hours to minutes with optimized infrastructure and increasing team productivity by up to 10 times with specific tools. Finally, you can simplify the ML lifecycle by automating and standardizing MLOps practices across your organization to build, train, deploy, and manage models at scale. Now, let us see how to use AWS Sagemaker to create an application that analyzes the sentiment of movie reviews. Our goal is to have a web page that a user can use to enter a movie review. The web page then sends the review off to our deployed model, which predicts the sentiment of the entered review. We use Python and Amazon SageMaker Notebook Instances. First, we need to install the Sagemaker using: 1

!pip install sagemaker

Afterward, you need to put the training data in an S3 bucket. Data is available at http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz. You must read all the reviews and combine them into a single structure. Afterward, it is necessary to divide them into training and testing. Another important step is to transform the data. Most sentiment analysis projects transform the data from its word representation to a bag-of-words feature representation. We construct a similar feature representation for the model we construct in this notebook. To start, we represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and do not likely contain much sentiment analysis information. We deal with this problem by fixing the size of our working vocabulary, and we only include the words that appear most frequently. We then combine all the infrequent words into a single category, and, in our case, we label it as 1. We must define a training function based on PyTorch to train the model. This should be very similar to the training methods you have written before to train PyTorch models. 1 2 3 4 5 6

def train(model, train_loader, epochs, optimizer,loss_fn,device): for epoch in range(1, epochs + 1): model.train() total_loss = 0 for batch in train_loader: batch_X, batch_y = batch

7 8 9

batch_X = batch_X.to(device) batch_y = batch_y.to(device)

10 11

optimizer.zero_grad()

12 13

output = model.forward(batch_X)

14 15 16

loss = loss_fn(output, batch_y) loss.backward()

17 18

optimizer.step()

7 Exploiting Hardware Accelerators in Clouds

135

19 20 21

total_loss += loss.data.item() print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This Python file is executed when the model is trained. Inside the train directory is a file called train.py, which has been provided and contains most of the necessary code to train our model. 1

from sagemaker.pytorch import PyTorch

2 3 4 5 6 7 8 9 10 11 12

estimator = PyTorch(entry_point="train.py", source_dir="train", role=role, framework_version=’0.4’, train_instance_count=1, train_instance_type=’ml.p2.xlarge’, hyperparameters={ ’epochs’: 10, ’hidden_dim’: 200, })

13 14

estimator.fit({’training’: input_data})

Now that we have trained our model, we would like to test it to see how it performs. When deploying a model, you are asking SageMaker to launch a compute instance that waits for data to be sent to it. As a result, this compute instance continues to run until you shut it down. This is important since the cost of a deployed endpoint depends on how long it has been running. 1

predictor = estimator.deploy(initial_instance_count = 1, instance_type = ’ml.m4.xlarge’)

Of course, once we have deployed an endpoint, it continues to run until we tell it to shut down. Since we are done using our endpoint, for now, we can delete it using: 1

estimator.delete_endpoint()

7.3.2 Google Cloud Platform (GCP) We show how to set up the Cloud TPU to run a deep learning model and use the Google Vertex AI to train a TensorFlow model in a custom container. To get started, you can enjoy a set of open-source reference models optimized for use with TPUs at http://cloud.google.com/tpu/docs/tutorials/supported-models. Before using Cloud TPU features, you must create a Google Cloud account and project. Next, you must enable the Compute Engine and Cloud TPU APIs to train a model. Cloud TPU requires access to the Google Storage buckets where you

136

C. A. Künas et al.

store your datasets and training results. Using a free Google account, we can easily run Google’s web-based Google Cloud Platform user interface for computing, big data/analytics, security, networking, and artificial intelligence. There is a cost-effective option in GCP that you can consider in your project, known as preemptible TPU. If you select a preemptible instance, you can launch an instance at a much lower price than a regular instance. However, the provider can terminate (preempt) these instances if it needs access to these resources for other tasks. While using this option can be a little inconvenient, it is cost-effective. Another limitation is that the compute engine permanently shuts down after running for 24 h. So this is the trade-off between preemptible and non-preemptible, and you should decide based on your requirements. Preemptible is a practical and cost-effective option for a new user running small deep learning models and still exploring different functionality of GCP and Cloud TPU. The model we see runs the training script in less than 5 min on a Cloud TPU v3-8. Therefore, we selected the preemptible option when configuring our environment. We assume that you have already created a project, and billing is enabled for your project. To set up the environment and get your own TPU, access the GCP console and open the web-based Google Cloud Shell. Activate the Cloud TPU API (line 1) and set the ID of the project (line 2) in which you want to create the Cloud TPU. The first time you run this command, you must authorize Cloud Shell to allow it to make GCP API calls with your credentials. Create a service account for the Cloud TPU project (line 3). Also, create a Cloud Storage bucket (line 4). It stores the data you use to train your model and the training results. This ensures the high performance of the TPU by optimizing the input pipeline. The next step is to launch a Compute Engine virtual machine (VM) and the Cloud TPU (lines 5–9). In this command, you flag which zone you plan to create your TPU, the TPU type, the software version, and if you allow Cloud TPU to preempt your TPU. When the command finishes running, you can connect to the Compute Engine instance (line 10). We set some environment variables in the VM session window to help train the model (lines 11–15). The last step is to change to the directory that stores the model (line 16) and run the training script (lines 17–23). In this command, we flag the name of the TPU, the bucket where we store the results, the path to the training input data, the number of epochs to train the model, and the distribution strategy. Finally, enable the script to download and preprocess the dataset if it has not already been downloaded. 1 2 3

4 5 6 7 8 9

gcloud services enable tpu.googleapis.com gcloud config set project project-id gcloud beta services identity create --service tpu.googleapis.com --project project-id gsutil mb -p project-id -c standard -l us-central1 gs://mnist-tf gcloud alpha compute tpus tpu-vm create mnist-tutorial \ --zone=us-central1-b \ --accelerator-type=v3-8 \ --version=tpu-vm-tf-2.8.0 \ --preemptible

7 Exploiting Hardware Accelerators in Clouds 10

11 12 13 14 15 16 17 18 19 20 21 22 23

137

gcloud alpha compute tpus tpu-vm ssh mnist-tutorial --zone=uscentral1-b export TPU_NAME=local export STORAGE_BUCKET=gs://mnist-tf export MODEL_DIR=${STORAGE_BUCKET}/mnist export DATA_DIR=${STORAGE_BUCKET}/data export PYTHONPATH="${PYTHONPATH}:/usr/share/tpu/models" cd /usr/share/tpu/models/official/vision/image_classification python3 mnist_main.py \ --tpu=${TPU_NAME} \ --model_dir=${MODEL_DIR} \ --data_dir=${DATA_DIR} \ --train_epochs=10 \ --distribution_strategy=tpu \ --download

Cloud providers develop tools that unify services for building ML, such as Google’s Vertex AI, to facilitate the deployment of machine learning solutions. The unified set of APIs that Vertex AI provides allows companies and developers to accelerate the development and maintenance of machine learning solutions using as little code as possible. According to Google [16], Vertex AI requires nearly 80% fewer lines of code to train a model, allowing data scientists and ML engineers at all levels of expertise to implement machine learning operations (MLOps) to create and manage ML projects throughout the development lifecycle efficiently. Vertex AI includes many different products to support end-to-end machine learning workflows. Figure 7.3 presents an overview of the platform. We show you how to use Google Vertex AI to train a TensorFlow model in a custom container. Again, we assume you have already created a GCP project and enabled billing for it. To begin setting up the environment, you must enable the Compute Engine and Vertex AI APIs (line 1) if not already enabled. The first time you run this

Fig. 7.3 Vertex AI overview, an end-to-end unified AI platform

138

C. A. Künas et al.

command, you must authorize Cloud Shell to allow it to make GCP API calls with your credentials. Set the ID of the project (line 2) on which you want to use Vertex AI functionality. Create a Cloud Storage bucket (line 3) and copy the training dataset into your bucket (line 4). Then in the Vertex AI console, you create a dataset. Specify a name, such as text_classification. Select the data type as Text and the Text Classification as a single label. Keep the same region used when creating the bucket, then click create. On the import page, select Import a file from Cloud Storage and specify the location where you saved the copy of the dataset. Vertex AI automatically distributes the dataset across training, validation, and testing so that you can keep the default data split. To start the import process, click continue. This process may take some time to complete. 1

2 3 4

gcloud services enable compute.googleapis.com aiplatform. googleapis.com gcloud config set project project-id gsutil mb -p project-id -l us-central1 gs://bucket_folder/ gsutil -m cp -R gs://cloud-ml-data/NL-classification/happiness. csv gs://bucket_folder/text/

In the Vertex AI console, access the Models page. For the region, select uscentral1 and click Create. The Train New Model window opens. In this window, select the training dataset you created and the text classification annotation set, check the AutoML option, then click continue. Now you specify a name for the model and start training. Training can take a long time. When training is complete, click on your AutoML model to see details such as performance metrics. Select the Deploy and Test tab to create an endpoint and click Deploy to Endpoint. In the Deploy to Endpoint window, check to Create a new endpoint and set a name, accept 100% traffic split, and click Deploy. It can take several minutes to create the endpoint and deploy the AutoML model. Finally, after the endpoint is created, you can receive text predictions from the Vertex AI console. Enter your text in the Test your model section and click Predict to view the predicted model label and confidence score.

7.3.3 Microsoft Azure We discuss using accelerators on cloud providers, like Microsoft Azure and Azure Machine Learning Studio. Azure Machine Learning is a cloud service for accelerating and managing the lifecycle of machine learning projects. Machine learning professionals, data scientists, and engineers can use it in their everyday workflows to train and deploy models and manage MLOps. You can build a model in Azure Machine Learning or use a model built from an open-source platform such as Pytorch, TensorFlow, or scikit-learn. MLOps tools help you monitor, train, and redeploy models. Azure Machine Learning is for individuals and teams who implement

7 Exploiting Hardware Accelerators in Clouds

139

MLOps in their organizations to bring machine learning models into production in a secure, auditable production environment. Data scientists and ML engineers find tools to accelerate and automate their everyday workflows. Application developers find tools to integrate models into applications or services. Platform developers find a robust set of tools supported by durable Azure Resource Manager APIs for building advanced ML tools. Now, let’s see how to train a machine learning model using Azure Machine Learning Studio. You use the training and deployment workflow for Azure Machine Learning in a Jupyter Notebook in Python. We train a simple logistic regression model using the MNIST dataset and Scikit-learn. The goal is to create a multiclass classifier to identify the digit that a given image represents. We assume you already have an Azure account with an active subscription, and you’ve also created a workspace and a compute instance. Azure Machine Learning includes a cloud notebook server in your workspace for a pre-configured experience with no installation required. First, sign in to Azure Machine Learning Studio and select the subscription and workspace you created. On the left, select the Notebooks menu and then select the Swatches tab. We clone the tutorials folder for our user (Fig. 7.4).

Fig. 7.4 Azure machine learning studio window showing the samples tab

140

C. A. Künas et al.

Fig. 7.5 Selecting Jupyter notebook from the user directory in azure machine learning studio

Select the quickstart-azureml-in-10mins.ipynb file in the compute-instancequickstarts/quickstart-azureml-in-10mins folder (Fig. 7.5). In the top bar, select the compute instance you created to use it to run the notebook. Switch to Jupyter Notebook to run the code. In it, you’ll use Azure Open Datasets to get the data files into MNIST [7], load the zipped files into NumPy arrays, train the model, and log metrics with MLFlow. You use the LogisticRegression classifier from the Scikitlearn framework to classify the data. Model training takes just a few minutes to complete. In the left menu, select Jobs and then choose your experiment (azure-ml-in10-mins-tutorial) to see metrics, logs, explanations, etc. You can use model registration to store your models and version them in your workspace. The code below registers the trained model and controls

7 Exploiting Hardware Accelerators in Clouds

141

its version. After running, you can see the model in the registry by selecting Models from the left menu in Azure Machine Learning Studio. 1 2 3

# register the model model_uri = "runs:/{}/model".format(run.info.run_id) model = mlflow.register_model(model_uri, "sklearn_mnist_model")

Now, with the model saved, let’s create the deployment configuration with all the dependencies and amount of computation needed to host the model, and then we deploy it. A function that gets the template from the registry and sets global variables. And a function that runs whenever a call is made to the service. In this function, you typically format the input data, run a prediction, and produce the predicted result. You can view the deployed model by navigating to Endpoints in the left menu of Azure Machine Learning Studio. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

# create an environment for the deploy import uuid from azureml.core.model import InferenceConfig from azureml.core.model import Model from azureml.core.environment import Environment from azureml.core.conda_dependencies import CondaDependencies from azureml.core.webservice import AciWebservice # get a curated environment env = Environment.get( workspace=ws, name="AzureML-sklearn-0.24.1-ubuntu18.04-py37-cpu-inference", version=1 ) env.inferencing_stack_version=’latest’

15 16 17 18 19 20 21 22 23

# create deployment config, i.e., compute resources aciconfig = AciWebservice.deploy_configuration( cpu_cores=1, memory_gb=1, tags={"data": "MNIST", "method": "sklearn"}, description="Predict MNIST with sklearn", )

24 25 26 27 28 29

30

31 32 33 34 35 36

%%time # get the registered model model = Model(ws, "sklearn_mnist_model") # create an inference config, i.e., the scoring script and environment inference_config = InferenceConfig(entry_script="score.py", environment=env) # deploy the service service_name = "sklearn-mnist-svc-" + str(uuid.uuid4())[:4] service = Model.deploy( workspace=ws, name=service_name, models=[model],

142 inference_config=inference_config, deployment_config=aciconfig,

37 38 39

C. A. Künas et al.

)

40 41

service.wait_for_deployment(show_output=True)

You can test the model by sending an HTTP request to test the web service. 1 2 3 4 5

# send raw HTTP requests to test the web service. import requests # send a random row from the test set to score random_index = np.random.randint(0, len(X_test) - 1) input_data = ’{"data": [’ + str(list(X_test[random_index]))+ "]}"

6 7 8

9 10 11

headers = {"Content-Type": "application/json"} resp = requests.post(service.scoring_uri, input_data, headers= headers) print("POST to url", service.scoring_uri) print("label:", y_test[random_index]) print("prediction:", resp.text)

Of course, if you are not going to continue to use this template, delete the template service using: 1

service.delete()

7.4 Influence of Accelerators in IoT and Edge Computing In this section, we discuss the influence of accelerators in IoT and Edge Computing. Also, the impact of adopting 5G and its workloads transform the computing landscape and proliferate acceleration technologies. Cloud and IoT technology complement each other allowing for increased system performance. IoT is edge computing, preparing data and saving the amount of information and time to transfer it to the cloud. The more the edge has power computing, minus data to transfer, the better the system’s result performance. With the increased use of accelerators, GPUs, FPGAs, Vector accelerators, and others, more computing power can be used on edge, reducing the need for cloud computing. This is a tendency for the next few years as the amount of data to transfer is growing very fast, and the connection support between the cloud and the IoT has limitations of bandwidth and cost. The adoption of the 5G comes to increment the bandwidth between the cloud and the IoT [14]. Many providers offer Storage as a Service (SaaS), i.e., customers pay for the space used on the servers. With edge computing, the amount of data that needs to be stored in the cloud is reduced, and so is the user’s storage capacity, just using the appropriate one for correct functioning.

7 Exploiting Hardware Accelerators in Clouds

143

Fig. 7.6 Quantity of images by Size in KB, with (right) and without (left) preprocessing. (a) Without pre-processing. (b) With pre-processing

As an example of the importance of managing the amount of data to be transferred to the cloud and the edge preprocessing decision, Fig. 7.6 give an idea of the difference between the size of image data from the APTOS 2019 dataset [3] transferred without preprocessing and with edge processing. The left figure presents the distribution and the size of the 3.662 images transferred without any processing with two peaks of size, one near 2.000 KB and the other near 5.000 KB. With preprocessing, in the right figure, the size falls into two groups, one near 50 KB and the other near 60 KB. To speed up edge processing, proliferate acceleration technologies such as GPU, FPGA, Vector Processors, and others. This became central in deciding to equilibrate the best performance versus cost between edge and cloud processing. Edge processing grows in the future, as it is important to reduce the amount of data storage in the cloud, its price, and the need to reduce data transfer due to connection limitations.

7.5 Final Remarks The evolution of using accelerators with computers is inexorable. Heterogeneous machines are the future. In the cloud context, we see in the sections of this chapter that using accelerators is usual when the application needs more processing power. The cloud providers are offering accelerators in their instances, and some are developing specific processors to attend to AI demands.

144

C. A. Künas et al.

References 1. Aljamal, R., El-Mousa, A., Jubair, F.: A comparative review of high-performance computing major cloud service providers. In: 2018 9th International Conference on Information and Communication Systems (ICICS). pp. 181–186. IEEE (2018) 2. Dutta, P., Dutta, P.: Comparative study of cloud services offered by amazon, microsoft & google. International Journal of Trend in Scientific Research and Development 3(3), 981–985 (2019) 3. Gangwar, A.K., Ravi, V.: Diabetic retinopathy detection using transfer learning and deep learning. In: Evolution in Computational Intelligence, pp. 679–689. Springer (2021) 4. Google: Cloud tpu system architecture (2022), https://cloud.google.com/tpu/docs/systemarchitecture-tpu-vm 5. Kotas, C., Naughton, T., Imam, N.: A comparison of amazon web services and microsoft azure cloud platforms for high performance computing. In: 2018 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–4. IEEE (2018) 6. Künas, C.A., Serpa, M.S., Bez, J.L., Padoin, E.L., Navaux, P.O.: Offloading the training of an i/o access pattern detector to the cloud. In: 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). pp. 15–19. IEEE (2021) 7. LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998) 8. Lin, L., Pan, L., Liu, S.: Methods for improving the availability of spot instances: A survey. Computers in Industry 141, 103718 (2022) 9. Maliszewski, A.M., Roloff, E., Carreño, E.D., Griebler, D., Gaspary, L.P., Navaux, P.O.: Performance and cost-aware hpc in clouds: A network interconnection assessment. In: 2020 IEEE Symposium on Computers and Communications (ISCC). pp. 1–6. IEEE (2020) 10. Roloff, E., Diener, M., Gaspary, L.P., Navaux, P.O.: Hpc application performance and cost efficiency in the cloud. In: 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). pp. 473–477. IEEE (2017) 11. Serpa, M.S., Cruz, E.H., Diener, M., Krause, A.M., Navaux, P.O., Panetta, J., Farrés, A., Rosas, C., Hanzich, M.: Optimization strategies for geophysics models on manycore systems. The International Journal of High Performance Computing Applications 33(3), 473–486 (2019) 12. Singh, H.: Aws pricing and cost management. In: Practical Machine Learning with AWS, pp. 29–44. Springer (2021) 13. Vogel, A., Griebler, D., Maron, C.A., Schepke, C., Fernandes, L.G.: Private iaas clouds: a comparative analysis of opennebula, cloudstack and openstack. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). pp. 672–679. IEEE (2016) 14. Wang, D., Chen, D., Song, B., Guizani, N., Yu, X., Du, X.: From iot to 5g i-iot: The next generation iot-based intelligent algorithms and 5g technologies. IEEE Communications Magazine 56(10), 114–120 (2018) 15. Wang, Y.E., Wei, G.Y., Brooks, D.: Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019) 16. Wiley, C.: Google cloud unveils vertex ai, one platform, every ml tool you need (2021), https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-launchesvertex-ai-unified-platform-for-mlops 17. Ying, C., Kumar, S., Chen, D., Wang, T., Cheng, Y.: Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992 (2018) 18. You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., Keutzer, K.: Fast deep neural network training on distributed systems and cloud tpus. IEEE Transactions on Parallel and Distributed Systems 30(11), 2449–2462 (2019)

Part III

Cost and Performance Optimizations

Chapter 8

Optimizing Infrastructure for MPI Applications José E. Moreira

8.1 Fundamentals of MPI The Message Passing Interface (MPI) [1, 2] is a specification for writing messagepassing applications for parallel computing architectures. While not sanctioned by any major standards organization, it has become a de facto standard for programming parallel applications. Its many benefits include portability from laptops to the largest supercomputers, the enabling of high-performance computations, and a variety of high-quality implementations, both open source and proprietary [3–5]. The execution entity in MPI is a job. An MPI job consists of tasks. The number N of tasks in the collection is called the size of the MPI job. Each task is identified by its rank, a unique integer number between 0 and .N −1. A task consists of an address space and one or more instruction streams (threads) that can perform memory load and store operations against that address space. MPI tasks in a job exchange data through communication operations. Communication operations can be either one- or two-sided. One-sided operations are initiated by a task and complete without requiring action from any other task. One-sided operations take the form of either a get or a put. A .get(i, a, n) operation returns to the initiating thread the n bytes of data starting at address a in the address space of task rank i. A .put(D, n, i, a) stores data D in the n bytes starting at address a in the address space of task rank i. One can think of a get as a load from the address space of another task. Correspondingly, a put is like a store into the address space of another task. Two-sided operations can take many forms. The most common, and the basis for other operations, are .send(B, n, i), which sends n bytes of data in buffer B from

J. E. Moreira () IBM Research, Yorktown Heights, NY, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_8

147

148

J. E. Moreira

Fig. 8.1 Sample MPI program. Each task i receives data from task .i − 1 and sends data to task +1

.i

the initiating task to another task i, and .recv(B, n, i), which receives n bytes of data sent by task i into buffer B of the initiating task. These operations are called two-sided because two tasks must match corresponding operations. For example, one task can perform a send only if the destination task performs a matching recv. Both one- and two-sided MPI operations are often used in their nonblocking form. In those cases, the operations do not necessarily complete before returning to the caller. Instead, they return a token to the initiating task that can later be used to test for completion of the operation. This mode of operation allows a task to continue to make progress in its own execution while waiting for the pending communication operations to complete in the background. The nonblocking mode of operation both improves performance and avoids deadlock, as the threads of execution in a task are constantly making forward progress. MPI applications can follow either the Single Program Multiple Data (SPMD) model or the Multiple Program Multiple Data model (MPMD). In SPMD, all tasks follow (for the most part) the same control flow of execution and, therefore, are

8 Optimizing Infrastructure for MPI Applications

149

Fig. 8.2 Output from a 4-task run of the sample MPI program. The order of tasks printing is not deterministic

performing the same operations but each on its own data. In MPMD, each task can have its own control flow (likely in a different executable) and perform a different computation on its own data. The SPMD model is far more common, although large multidisciplinary applications may follow a Multiple SPMD model (a subset of the MPMD model) of different subsets of tasks working on a particular problem and cooperating with other subsets as part of a larger global problem. We illustrate some basic concepts of MPI with the sample program in Fig. 8.1. The program starts by each task initializing MPI and obtaining both the total number of tasks and its rank in the global communicator (lines 15–17). Task 0 prints the total number of tasks (line 19) and then all tasks synchronize (line 21). Each task generates a random number (lines 23–24) to exchange with its neighbors. The ranks of the left and right neighbors of each task are computed in lines 26–27. We then use the MPI_Sendrecv function (lines 29–30) to both send data to the right neighbor and receive data from the left neighbor. (Using a combined send/receive operation prevents any possibility of deadlock.) Finally, each task prints the data it sent and received, identifying the corresponding neighbors (lines 32–33), and stops MPI services (line 35). An output from the MPI program of Fig. 8.1 when running with 4 tasks is shown in Fig. 8.2. The barrier in line 21 ensures that task 0 prints the total number of tasks before any downstream operations, but the order of individual tasks printing their information is not deterministic. Different runs can generate different task orders.

8.2 Interconnection Networks for MPI Environments MPI application can impose high demands of communication on the interconnection network of a computing system. Therefore, it is important to build a network that can offer both high bandwidth and low latency. An important class of interconnection networks with those characteristics is the fat tree network [6, 7]. The links (edges) in a fat tree network, shown in Fig. 8.3, have different bandwidth. Links at the bottom (leaf level) of the tree have the lowest bandwidth. As links get closer to the root of the tree, the bandwidth increases. In the ideal scenario illustrated in Fig. 8.3, the bandwidth of the limbs double as we go up each level of the tree. The most important and desirable characteristic of a fat tree network is its bisection bandwidth. The leaf level of the tree in Fig. 8.3 consists of nodes .0, . . . , 7. Each node has a network link with bandwidth B. For any two disjoint subsets of

150

J. E. Moreira

Fig. 8.3 In a fat tree network, links near the root of the tree are fatter (thicker), delivering more bandwidth. In this particular example, links at the leaf level have bandwidth B. Bandwidth per link increases to 2B and 4B as we get closer to the root

Fig. 8.4 32-port fat-tree network built of 8-port switches. This is a two-layer network, with 8 leaf switches (.LS0 , . . . , LS7 ) and 4 spine switches (.SS0 , . . . , SS3 )

those leaf nodes, of size .n1 and .n2 such that .n1 ≤ n2 , the total bandwidth between the two subsets is .n1 B. In particular, if we split the leaf level into two equal halves, the bandwidth between the halves is 4B. This bandwidth is independent of the actual nodes in each half. It holds if we assign all even numbered nodes (.0, 2, 4, 6) to one half and all odd numbered nodes to the other half (.1, 3, 5, 7). It also holds if we assign nodes .0, . . . , 3 to one half and nodes .4, . . . , 7 to the other half. This generalizes to fat trees of n leaf nodes, as the bandwidth between any two disjoint partitions of . n2 nodes is . n2 B. This property will prove very useful when we discuss allocating nodes to executing MPI jobs in Sect. 8.4 [8, 9]. In practice, fat tree networks are built by combining all-to-all switches in multiple layers. Figure 8.4 illustrates a 32-port fat tree built out of 8-port switches organized in two layers. At the leaf layer, we have eight 8-port leaf switches (.LS0 , . . . , LS7 ). Each of those switches is configured so that 4 of its ports are for connecting down to compute servers (ports .0, . . . , 31). The remaining 4 ports in each leaf switch connect up to a layer of four 8-port spine switches (.SS0 , . . . , SS3 ). The switches themselves are exactly the same as the leaf switches, but each of their 8 ports connects down to a different leaf switch. The resulting structure has the same bisection bandwidth property of the ideal fat tree in Fig. 8.3. In many cases, the leaf and spine switches do not have the same number of ports. If the leaf switches have n ports and the spine switches have m ports, the resulting two-layer network has a total of . nm 2 ports. Bigger networks can be built by

8 Optimizing Infrastructure for MPI Applications

151

Fig. 8.5 Routing in a fat-tree network can include both single lead switch routes (e.g., between ports 0 and 1, in green) as well as routes that have to go through multiple leaf switches and a spine switch (e.g., between ports 2 and 24, in red)

introducing additional layers. For an l-layer network of n-port switches, the number  l−1 n. of leaf ports is . n2 Routing between ports in a fat tree network may require multiple hops through the switches. Two cases are illustrated in Fig. 8.5. In the best case scenario, between ports 0 and 1, a single hop through switch .LS0 is enough to connect the two ports. In the case of routing between ports 2 and 24, we need to go through leaf switch .LS0 , through spine switch .SS3 , and then finally through leaf switch .LS6 , for a total of three hops. This three-hop trip is the longest trip in a two-layer network. For an l-layer network, the longest trip generalizes to .2l − 1 hops. We should also mention that the routing between nodes 2 and 24 could have gone through any of the spine switches, since each of them is connected to all leaf switches. It is this property that makes the tree fat, delivering the characteristic bisection bandwidth that we discuss above.

8.3 Cloud Facilities for MPI Applications Cloud data centers [10] are typically organized as a collection of pods. Each pod is a collection of servers, typically homogeneous in capabilities, with good network connectivity among them. A pod is the natural infrastructure for running MPI programs. A pod consists of a collection of both compute racks and switch racks, as shown in Fig. 8.6. Compute racks contain the compute servers that execute the MPI jobs. A typical configuration of a compute rack is shown in Fig. 8.7a. It contains 20 compute servers, each two rack units high (2 RUs), and two top-of-the rack switches (.TOR). Each .TOR connects to the 20 servers in the rack and plays the role of a leaf switch in the fat tree configuration discussed in Sect. 8.2. Each .TOR is part of a different and independent network.

152

J. E. Moreira

Fig. 8.6 Diagram of a pod with 800 compute servers organized into 40 compute racks of 20 servers each (.C00 through .C39). The servers are interconnected through two fat tree networks, A and B. The spine switches for those networks are located in racks .SSA and .SSB, respectively Fig. 8.7 Compute (a) and switch (b) racks of a pod. Each compute rack has 20 servers and two top-of-the rack switches (.TOR A and .TOR B) that are the leaf switches of two distinct fat tree networks (A and B, respectively). The 20 spine switches of each network are contained in the switch racks

Switch racks contain the spine switches. Each network has its own set of spine switches. A typical configuration of a switch rack is shown in Fig. 8.7b. It contains 20 switches of 40 ports each. Each port of a spine switch connects to the corresponding TOR of a different compute rack.

8 Optimizing Infrastructure for MPI Applications

153

Fig. 8.8 Organization of a typical dual-socket node used as a compute node for MPI applications

The organization of a typical compute server is illustrated in Fig. 8.8. It consists of two processors (.CPU0 and .CPU1), each with multiple compute cores. Each processor is attached to a main memory through multiple memory channels. A processor in a server can access both its locally attached memory as well as the memory attached to the other processor in the same server. The processors connect to the networks through interface cards (.NICs) that are typically PCIe-attached. In this example, we show a single card attached to both processors and with two network ports, .A and .B. Those ports are used to connect the compute server to interconnection networks A and B, respectively. We can consider an entire server as a compute node. That means that a single MPI task can utilize all the resources of the server, including compute cores, memory, and network ports. When a task does not need all those resources by itself, it can share the node with other tasks, effectively partitioning a single physical compute node into multiple virtual nodes. Of particular interest is the case of partitioning the physical compute node into two virtual nodes, one assigned to each processor. In this scenario, each MPI task exploits the parallelism of one CPU using some form of multithreading (e.g., OpenMP, pthreads) and does not have to consider the nonuniform memory access behavior that derives from accessing memory attached to a different processor. Parallelism across processors, whether on the same compute node or in different nodes, is exploited the same way, through MPI. This approach usually leads to simpler and more predictable behavior, and can be adopted whenever the memory and compute requirements of a task can be satisfied by a single processor.

154

J. E. Moreira

8.4 Executing an MPI Job in the Cloud The lifetime of an MPI job begins with its submission by an user. When submitting a job, the user specifies a list of parameters including: 1. The size of the job (N ). That is, how many tasks it requires. Sometimes this is specified as a single value, sometimes as a list of possible values, to indicate the program is flexible and can run with any of those sizes. 2. The memory requirements of each task (M). This could be a function of the number of tasks. That is, if fewer tasks are used, each will require more memory to fit the same global problem size. 3. The number of executing streams of instructions per task (P ). This can also be either a single value or a list of possible values. It denotes the capability of each task to use multiple-threads internally. 4. The executable(s) (E) that each task will run. When the application follows the SPMD model, a single executable is used for all tasks. In the MPMD model, each task can have its own executable. 5. A list of file systems (F ). Those files systems contain the files that will be read/written by the running job. Therefore, they must be mounted on each node running a job. 6. An execution time limit (T ). The maximum execution time that the user will allow for the job, once it starts running. This could also be a function of the job size, as fewer tasks require more time to finish the same computation. To simplify discussion, we will consider only the case of a single value for each of the parameters above. A job J can be described as a tuple .J = N, M, P , E, F, T , with each element having just one value. Once a job is submitted, it is the job of the meta scheduler to select a pod for its execution. The pods available to the meta scheduler can all be located in one data center or spread over various data centers in multiple geographies. When choosing a pod, the meta scheduler has to take into consideration: 1. Availability of the required file system on the pod. It makes little sense to send a job for execution where it cannot reach the files it needs. 2. Capacity of the pod. Does the pod contain enough resources to run the program, as per its N , M, and P parameters? 3. Current load on the pod. It makes more sense to send a job to the pod where it will run sooner. Different pods may have different resources, so this load is computed relative to those resources. A bigger pod can process more work quicker than a smaller one. A sample algorithm for selecting the best pod to send a job to is shown in Algorithm 1. The cloud infrastructure consists of a set of pods. Each pod has its own scheduler that, when given the parameters of a prospective job, can provide an estimate of when that job will start executing in that pod. The algorithm is independent of the capabilities, load, and scheduling strategy of each pod.

8 Optimizing Infrastructure for MPI Applications

155

Algorithm 1 An algorithm for selection of execution pod by a cloud meta scheduler J = N, M, P , E, F, T  is the submitted job, and .cloud is a set of pods. Each pod has its own scheduler. Method .estimate provides an estimate by that scheduler of when a job can start execution.

.

procedure META(J, cloud)  select pod for execution of job J best ← ∅  start with no pod for pod ∈ cloud do  try each pod in the cloud if pod can mount file system F then  file system requirement if pod can satisfy N , M, and P parameters then  size requirements estimate ← pod.estimate(J )  get an estimate for this pod if estimate < best.estimate(J ) then  better than current estimate? best ← pod  new best pod for execution end if end if end if end for return best  return best pod end procedure

Once a pod is selected, the job is sent to that pod for execution. It is then up to the local scheduler in the pod to allocate the necessary resources and decide when to run the job. The first step in allocating resources is to map the tasks of the job to the nodes in the pod. Sometimes each task is big enough to fill an entire physical compute node, and the mapping is 1 to 1: each task is allocated one node of the pod. Sometimes the tasks are small, and multiple tasks can be allocated per node. It is unusual to share a node among tasks from different jobs, as this has security and performance implications. In particular, it may invalidate a user’s estimates of the running time of their programs. Therefore, we ignore this scenario and consider only sharing of a node by tasks of the same job, as compatible with the memory and processing requirements of each task. Using the terminology we already established, let M and P represent the memory and processing requirements for each task. (Each stream of instructions is executed on its dedicated processing core.) And let m and p represent the memory capacity and number of cores in each node of the pod. It is then reasonable to assign n = min

.

 m   p  , M P

(8.1)

tasks per node. In that case, the required number of nodes .N is computed by  N=

.

N . n

(8.2)

156

J. E. Moreira

Given the necessary number of nodes .N, the scheduler still has to select the exact nodes for executing the job. The fat tree network helps with that assignment, since any set of .N nodes is just as good for executing the job. Parallel job scheduling is a rich field of study on its own [11], and we will not go into details here. Suffice it to say that the simplest schedulers implement a First Come First Serve (FCFS) policy [12]. In this case, jobs are placed in a waiting queue in the order they arrive. As soon as the required number of nodes are available to run the job at the head of the queue, that job is issued for execution. There is one additional role of the pod scheduler that we need to cover: killing jobs. When an executing job exceeds its specified running time T , it must be killed. This is necessary for proper functioning of the system, as otherwise the job queue may just stop making progress and estimates of when jobs will run may become invalid. Schedulers reward good estimates as provided by the users. Estimates that are over but close to the actual running time lead to earlier issue of the job. But underestimating may cause the job to be killed and waste a lot of execution time. Therefore, it is good practice for users to specify an execution time that is a little longer than their best estimate. If that estimate is T , then it is appropriate to request an execution time of .(1 + )T , where . is a safety margin.

8.5 Optimizing the Performance of MPI Applications on the Cloud The Cloud offers unique opportunities and challenges to the execution of MPI applications. The source of opportunities and challenges is the same: the diversity of cloud resources. Users of an on-premise parallel cluster typically have good knowledge of their systems, including number of nodes, memory capacity and computation speed of each node, and interconnection network capabilities. The cloud offers a much broader diversity of resources. Pods can have very different configurations, including number of nodes and type of each node. Resource availability and price can vary from moment to moment. A single geographical location may include several pods, all capable of running parallel MPI jobs. Therefore, when composing a parallel MPI program to run on the cloud, flexibility of execution is the key characteristic to pursue. We note that flexibility is also a useful characteristic for on-premise execution, even if not as critical as in the cloud. We illustrate an approach to develop flexible MPI programs through a simple example. In this example, we want to solve Laplace’s equation (.∇ 2 a = 0) through a series of Jacobi relaxation steps [13] on a two-dimensional problem grid, represented by a matrix A of shape .M × N. (These have nothing to do with the job parameters in the previous section.) The left and right edges of this problem grid define the boundary conditions and have fixed value. The top and bottom edges connect to each other, as in a cylindrical surface. At each relaxation step, the value of an interior point of the grid, .ai,j is updated to the average value of its four neighbors:

8 Optimizing Infrastructure for MPI Applications

ai,j ←

.

1 (ai−1,j +ai+1,j +ai,j −1 +ai,j +1 )∀i ∈ 0 : M − 1, j ∈ 0 : N − 1. 4

157

(8.3)

The global .M × N problem grid described above can be decomposed on a virtual processor grid, an arrangement of the tasks of the MPI job. The decomposition typically works better when the problem and processor grids have similar topology. Therefore, we choose to use a two-dimensional grid of .p × q processors, as shown in Fig. 8.9. Each processor will be responsible for computing an .m × n panel of the global grid, such that .M = mp and .N = nq. Just like the problem grid is periodic along one dimension, so is the processor grid. In this particular case, each column wraps around. The rows do not, and the left boundary of the problem grid is mapped to column 1 of the processor grid and the right boundary to column q, The task associated with each virtual processor is responsible for computing the relaxation within its local .m × n panel, which will be denoted by .a0:m−1,0:n−1 . Computing the new values of each element in that panel requires the presence of halos. A halo is either (1) a region of the global grid that is mapped to a different processor, but used in computing the elements assigned to the local processor; or (2) a boundary condition, for processors at the edge of the grid. We illustrate a local panel and its halos in Fig. 8.10. Before a task can compute a relaxation step, it must receive the top halo (.a−1,0:m−1 ) from its up neighbor in the processor grid. It must also send its own top row (.a0,0:m−1 ) to that same neighbor. A similar exchange of the bottom halo (.am,0:n−1 ) and bottom row (.am−1,0:n−1 ) must be performed with its down neighbor.

Fig. 8.9 Two-dimensional .p × q virtual processor grid. In this particular case, the processor grid wraps around each column but not around each row. (For example, processor .(1, 1) has an up neighbor at .(p, 1) but no left neighbor.) Each virtual processor executes one MPI task. The local problem size in each task is .m × n and the global problem size is .M × N , with .M = mp and .N = nq. We use MPI topology services to create and optimize this virtual processor grid

158

J. E. Moreira

Fig. 8.10 Local view of the problem array A in each task. At each relaxation step, the internal region (.a0:m−1,0:n−1 ) is updated in each task. The shaded areas (.a−1,0:n−1 , am,0:n−1 , a0:m−1,−1 , a0:m−1,n ) are the halos which overlap the internal regions of the up, down, left, and right neighbors, respectively. For tasks at the left and right edges of the processor grid, the left and right halos are the problem boundary conditions, respectively

Fig. 8.11 MPI topology services can be used to create virtual processor grids

To simplify the figure, we do not show the exchanges with the left and right neighbors, but those are also necessary. (Processors in column 1 of the grid do not have a left neighbor and processors in column q of the grid do not have a right neighbor. They do not participate in those particular exchanges.) MPI offers various types of topology services, including the creation of multidimensional Cartesian (rectangular) virtual processor grids. This is illustrated in the code of Fig. 8.11. Lines 2–5 define the parameters of grid we want MPI to create. In this case, the dimension along the first axis (number of rows) is p and the dimension along the second axis (number of columns) is q. As discussed, the first axis is periodic, whereas the second is not.

8 Optimizing Infrastructure for MPI Applications

159

Line 6 of Fig. 8.11 is the call to MPI_Cart_create to create the two-dimensional virtual processor grid. It passes an input communicator (MPI_COMM_WORLD) and grid parameters (grid_size and grid_period), each a two-element vector. The .true. parameter tells the MPI implementation that it can reorder the tasks to optimize communication in the resulting grid. This gives MPI the most flexibility in mapping tasks to physical resources, and it is essential if we want a flexible program that can run well across a variety of platforms. The call produces a new MPI communicator (grid) that represents the virtual processor grid. In tasks that do not belong to the new processor grid (when the number of tasks in the input communicator is larger than .p × q), grid has the value MPI_COMM_NULL. In that case, execution jumps to line 100, which marks the end of the routine. We can obtain the rank and coordinates of the local task in the new processor grid using the MPI_Comm_rank (line 8) and MPI_Cart_coords (line 9) calls. We can also obtain the ranks of the up, down, left, and right neighbors of the local task on the processor grid with the MPI_Cart_shift calls (lines 12 and 13). This is essential for the parallel program, since the actual message passing operations require the ranks of the sources and destinations. These services isolate the running program from the complexity of mapping tasks and virtual processors to the physical resources of the platform, making the program more flexible and agnostic to the specific cloud implementation. The actual computation of the relaxation steps is shown in Fig. 8.12. We allocate the local array in each task dynamically, using the allocate command (line 2). This keeps the code flexible, since for the same global problem grid, the size of the local grid will change as the number of processors in the virtual processor grid changes. Lines 3 and 4 of Fig. 8.12 show the initialization of the boundary conditions, for those tasks that are on column 1 or q of the processor grid. In general, bcleft and bcright are vectors of m elements. (MPI numbers grid indices starting from 0. In

Fig. 8.12 Relaxation loop. Each iteration performs one step of relaxation, including halo exchange with the four neighbors

160

J. E. Moreira

our discussion above we adopted, for simplicity, indices starting from 1. The code in Fig. 8.12 uses the MPI convention.) The loop of relaxation steps is shown in lines 7–18 of Fig. 8.12. At each step we first perform the halo exchanges (lines 8–15) and then the computation of new element values (lines 16–17). The array notation of Fortran lets us compute the new values of matrix A in place, since the evaluation is performed before the assign. We also call attention to the use of MPI_Sendrecv to perform the halo exchanges. This MPI function is particularly well suited for the job because: 1. Each call to MPI_Sendrecv sends data to one neighbor and receives data from the other neighbor. When performed by all tasks, it corresponds to a collective operation, with all tasks in the virtual processor grid working together. 2. It is deadlock-free, as used in this example, since all tasks are sending and receiving at the same time, in proper coordination. (That is, there is always an active receiver for each ongoing sender.) 3. It handles both periodic and non-periodic grid directions uniformly. This is another benefit from obtaining the neighbors through the MPI topology services. 4. It is compatible with Fortran array syntax, since it is a blocking operation. (Non-blocking MPI operations are not compatible with general Fortran array syntax [14].) Each of the four MPI_Sendrecv operations in Fig. 8.12 performs a halo exchange in one direction. Lines 8–9 perform a halo exchange to the down neighbors (send down, receive up), while lines 10–11 perform a halo exchange to the up neighbors (send up, receive down). Lines 12–13 perform a halo exchange to the right neighbors (send right, receive left), while lines 14–15 perform a halo exchange to the left neighbors (send left, receive right).

8.6 Conclusions MPI is one of the most popular systems for developing parallel applications. The elasticity and scale of cloud data centers provide a natural environment for the execution of those applications. MPI jobs place significant demands on the cloud infrastructure, which should be properly optimized for their execution. Fat tree networks provide the bisection bandwidth and flexibility that high-performance MPI programs demand and are one of the common interconnect solutions for cloud infrastructure supporting those programs. The computation infrastructure is typically organized as pods, containing both the compute servers and the network switches that interconnect them. When an MPI job is submitted to the cloud, a meta scheduler must first select a pod for the execution of that job. The job is then directed to the selected pod, which will do its own scheduling, including allocation of nodes and sequencing of execution. MPI programs that want to take advantage of the broad spectrum of cloud resources should be coded to allow for flexible and infrastructureagnostic execution. MPI has several facilities that support the optimization and portability of parallel applications.

8 Optimizing Infrastructure for MPI Applications

161

References 1. Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge, MA, USA, 2nd. (revised) edition, 1998. 2. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 2014. 3. William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput., 22(6):789–828, sep 1996. 4. Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, pages 97–104, Budapest, Hungary, September 2004. 5. IBM Corporation. IBM Spectrum MPI: User’s Guide, 2016. 6. Charles E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Comput., 34(10):892–901, oct 1985. 7. Nikhil Jain, Abhinav Bhatele, Louis H. Howell, David Böhme, Ian Karlin, Edgar A. León, Misbah Mubarak, Noah Wolfe, Todd Gamblin, and Matthew L. Leininger. Predicting the performance impact of different fat-tree configurations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA, 2017. Association for Computing Machinery. 8. Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. Evaluation of an interference-free node allocation policy on fat-tree clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press, 2018. 9. Peixin Qiao, Xin Wang, Xu Yang, Yuping Fan, and Zhiling Lan. Joint effects of application communication pattern, job placement and network routing on fat-tree systems. In Proceedings of the 47th International Conference on Parallel Processing Companion, ICPP ’18, New York, NY, USA, 2018. Association for Computing Machinery. 10. Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. The datacenter as a computer: Designing warehouse-scale machines, third edition. Synthesis Lectures on Computer Architecture, 13(3):i–189, 2018. 11. Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. Parallel job scheduling — a status report. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages 1–16, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. 12. Uwe Schwiegelshohn and Ramin Yahyapour. Analysis of first-come-first-serve parallel job scheduling. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’98, page 629–638, USA, 1998. Society for Industrial and Applied Mathematics. 13. Sandip Mazumder. Chapter 3 - solution to a system of linear algebraic equations. In Sandip Mazumder, editor, Numerical Methods for Partial Differential Equations, pages 103–167. Academic Press, 2016. 14. David Henty. Message Passing Programming: Array Issues in Fortran. www.archer.ac.uk/ training/course-material/2018/11/mpi-newcastle/notes/MPP-f90issues.pdf.

Chapter 9

Harnessing Low-Cost Virtual Machines on the Spot Alexandre C. Sena, Cristina Boeres, Luan Teylo, Lúcia Maria A. Drummond, and Vinod E. F. Rebello

9.1 Introduction Typically, cloud providers enable users to acquire computational resources encapsulated as preconfigured Virtual Machines (VMs) or Instances that can be selected according to their application’s requirements (in terms of CPU, memory and I/O, for example). Each instance may be offered under one of several contract models that differ in terms of availability guarantees and prices. In most commercial cloud providers, there are three main contract models (also called markets): (1) the Reserved market, where the user makes a commitment to a consistent instance configuration, for a fixed period of time (e.g. 1 or 3 years); (2) the On-demand market, where instances are allocated for specific periods of time and incur a fixed cost per unit time of use, with availability being guaranteed during this period; and (3) the Spot market, where the provider’s unused resources are available for rates with up to a 90% discount when compared to the on-demand model, but under the condition that these resources be released at the provider’s request, at any time. The main characteristics of each market can be seen in Table 9.1. The Reserved instance market is a discount billing concept in which the user can obtain significant cost reductions (of up to around 70%) compared to standard On-demand cloud computing prices in return for committing to a specified level

A. C. Sena State University of Rio de Janeiro, Rio de Janeiro, Brazil e-mail: [email protected] C. Boeres · L. M. A. Drummond () · V. E. F. Rebello Fluminense Federal University, Niterói, Brazil e-mail: [email protected]; [email protected]; [email protected] L. Teylo Inria Bordeaux Sud Ouest, Bordeaux, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_9

163

164

A. C. Sena et al.

Table 9.1 Cloud market characteristics Markets Reserved On-Demand Spot

Availability Guaranteed Very high Can be revoked

Discount Up to 72% 0% Up to 90%

Use Fixed time (e.g. 1 year) Only when necessary Only when necessary

Payment During the whole period Only when used Only when used

of compute capacity usage over a predetermined period of time. Although specific terms under which a Reserved instance discount is offered vary from provider to provider, today, the main providers, such as Amazon Web Services (AWS), Microsoft Azure and Google Cloud, offer a choice of 1-year or 3-year commitments. However, in some cases, not all of their virtual machine instance types are available in the agreement. Financially, Reserved instances can be of significant benefit to users with steady workloads. However, for many HPC users, who do not run virtual machines around the clock (i.e., 100% utilization), this reservation model does not fit well. On-demand VMs can be deployed at any time, offering high availability. Differently from the reserved market, where the user pays for the allocated resources regardless of whether they use them or not, the user pays only for the time the instance is running. However, no discount is offered in this market. On the other hand, cloud providers often have spare compute capacity (unused instances) which they offer with huge discounts as Spot Instances. Spot Instances are generally the cheapest option (50%–90% lower than the On-demand rates) amongst the three. However, Spot VMs can be revoked whenever the provider requires its resources back. Today, although the main providers offer Spot VMs, only AWS offers the Spot VM hibernation feature as well. In this case, when the VM is hibernated by AWS, its memory and context are saved, and when the AWS reactivates the VM, the context is restored and the interrupted tasks are restarted from the hibernation point. Nowadays, even with the possibility of definitive or temporary revocation, the use of Spot instances has attracted a lot of attention and their use, associated with fault tolerance or revocation prediction strategies, has been widely reported, as will be shown in this chapter. In particular, this chapter presents an overview of the HADS framework developed for scheduling deadline constrained Bag-of-Tasks applications on Spot VMs, with the aim of minimizing the financial costs. The choice of instance amongst markets is strongly related to the minimization of monetary costs while aiming to preserve application performance, especially when dealing with high-performance applications. It is well known that this class of applications can execute for long periods of time even in the presence of a significant amount of resources. Therefore, another relevant aspect that needs to be addressed is the selection of instances with the appropriate amounts of resources, within a given market, so that not only is the necessary performance obtained, but also, the corresponding monetary costs are minimized.

9 Harnessing Low-Cost Virtual Machines on the Spot

165

Within each of the cloud markets, most cloud providers generally offer a variety of instance types to fit different application needs, varying between general purpose configurations to instances with specific features or technologies. Typically, each instance type is comprised of various instance sizes with varying combinations of the number of CPUs, memory, storage, and networking capacities and thus provide users with the flexibility to choose the appropriate mix of resources for their target applications. Nonetheless, in their search for the required performance at an acceptable monetary cost, users have the hard task of analyzing and choosing not only the right instance type and size, but also making optimal software level decisions like the choice of operating system and compiler. With the aim of highlighting the difficulties faced by cloud users, this chapter also exemplifies some of the costperformance related issues that affect all three cloud markets. Besides the contract models and the variety of instance configurations, some leading cloud providers (e.g. Microsoft Azure, Amazon EC2), recently introduced the concept of a burstable VM that can boost its performance for a limited period of time to help cope with sudden workload variations. These are again offered at a discounted rate in relation to non-burstable instances with the equivalent computational resources. Bustables VMs have also being used more recently in HPC applications, especially together with Spot instances, to mitigate the Spot revocation problem, as briefly introduced in this chapter. The remainder of this chapter is organized as follows. Section 9.2 presents a panorama of the use of the Spot market to execute Bag-of-Tasks (BoT), MPI and GPU-based applications. It also presents some results of the Hibernation-Aware Dynamic Scheduler (HADS) framework, which was developed for scheduling deadline-constrained BoT applications on Spot instances. Section 9.3 presents some discussions about identifying the appropriate instance type for applications and why this decision is not solely related to defining the required resource capacities the chosen instances should have. Section 9.4 presents some works which used Burstables VMs to reduce the financial costs and execution times, frequently associated with Spot VMs, in the HPC context. Finally, Sect. 9.5 concludes the chapter and introduces some future directions.

9.2 Spot VMs The Spot Instances, also called preemptible VMs, were introduced by Amazon in 2009 as a response to the spare cloud capacity that remained without use during periods of low demand [8]. Those instances were born from the provider’s observation that the unused computational power of the cloud could be offered at a low cost without guarantees of reliability. Thus, whenever that provider needs that capacity back (due to increased requests for on-demand VMs, for example), the instances could be revoked without violating QoS agreements. That model quickly

166

A. C. Sena et al.

became popular, and other providers such as Google and Azure also adopted their own spot-like strategies. Initially, in AWS, the price of the Spot instances and their revocations were defined according to users’ bids: to contract a VM on the Spot market, the users needed to offer a bid indicating the maximum price they would accept to pay for the instance. The prices were defined according to users’ bids (if users started to give high bids for a specific instance, its price would immediately arise), and the VMs were interrupted when their prices exceeded the bid. That pricing model created a competitive market, with huge prices variations and a high rate of revocations. Because of that, most of the works related to cost reductions using Spot VMs, focus on revocation predictions using the historical prices and on mechanisms to define the best bid strategy. However, in December 2017, mainly due to the success of other providers that had Spot markets not based on bids, Amazon changed that pricing for a more stable model. According to the provider, the new model reduced the number of revocations, and the Spot instances became more reliable. Moreover, new features, such as hibernation, were included, opening new research efforts to low-cost instances. Nowadays,the availability of VMs in the Spot market fluctuates according to the cloud’s current demand. If there are not enough resources to meet clients’ requests, the cloud provider can interrupt a Spot VM (temporarily or definitively). As already said, despite the risk of unavailability, the main advantage of using Spot VMs is that their costs are usually much lower than on-demand VMs. Given an interrupted Spot VM instance can either terminate or hibernate, fault tolerance techniques, such as checkpointing and restart, can be used to migrate computations to another instance or complete their execution later. Many existing studies have been proposed to improve the reliability of Spot instances and benefit from the low prices. Several of them focus on proposing fault tolerance techniques to handle unexpected Spot failures, while many others present prediction models of Spot prices to help users moderate the risk of Spot failures thus increasing the reliability of Spot instances. Regarding HPC applications, we will consider here two main classes of parallel applications. One of them is Bag-of-Tasks (BoT) that are composed of independent tasks which can be executed in any order and in parallel. Although simple, the BoT approach is used by several applications such as parameter sweep, chromosome mapping, Monte Carlo simulation and computer imaging applications. The other model consists of MPI applications composed of partial-dependent tasks, given that Message Passing Interface (MPI) is the key programming paradigm for developing HPC applications. Although the Spot market has received a lot of attention in the last years, few BoT schedulers exploit the use of Spot VMs, as for example [22, 25, 29, 30, 37], that propose, for monetary cost sake, the use, whenever possible, of Spot VMs for scheduling tasks. They cope with the termination/ revocation of Spot VMs and do not consider the hibernation feature. The common objective of them is rather a tradeoff between monetary cost, reliability, and execution time.

9 Harnessing Low-Cost Virtual Machines on the Spot

167

In [30], Subramanya et al. present SpotOn, a batch computing service platform that uses checkpointing, migration, and replication to mitigate the impact of Spot VMs revocations. SpotOn uses the price history of the Spot market to select the fault-tolerance mechanism that minimizes the expected monetary cost of job execution. Loo et al. [22] propose a hybrid approach that considers on-demand VMs for high priority tasks and Spot VMs for non-priority ones. In order to tolerate Spot VMs interruptions, a certain number of on-demand VMs are reserved as spare resources to execute backup tasks. Whenever Spot instances are terminated, the workload is immediately migrated to on-demand VMs. SpotCheck [29] uses nested VMs within Spot VMs to provide the illusion of a platform that offers alwaysavailable VMs. The nested VMs are transparently migrated to an on-demand VM when a Spot revocation is detected. It also provides a checkpoint of the VM’s memory state in an external disk by running a background process that continually flushes dirty memory pages to a backup machine. The VM may then resume from the saved memory state in a different machine. In [25], an online learning algorithm, that selects Spot and on-demand VMs to execute batch jobs that arrive over time, is proposed. The algorithm dynamically adapts the resource allocation by learning from its performance on prior job executions and from the history of Spot prices. The solution is able to switch to on-demand resources whenever there are not enough available Spot VMs to ensure the desired performance. AutoBot [37] presents the following characteristics: (1) it uses both Spot and on-demand VMs for scheduling tasks of BoT applications with a user-defined deadline; (2) it applies task migration from Spot to on-demand VMs to satisfy constraints; and (3) it uses checkpoint strategies for performance sake. Three checkpoint strategies are proposed: (1) optimistic checkpoint, where the state of the task is recorded just before the migration to an on-demand VM; (2) grace period checkpoint, where the two minutes between the notification of the interruption of a Spot VM and the VM interruption itself are used to take the checkpoint; and (3) sliding checkpoint, where the checkpoint is taken in fixed intervals. AutoBot also considers a critical point within the application execution when all tasks running in Spot VMs should migrate to on-demand ones, even if no spot interruptions has happened. Several above works exploit the historical of Spot VM price variation to predict Spot VMs’ revocations. However, with the new prices model announced in December 2017 by Amazon, prices of VMs in the Spot market are quite stable. Moreover, the use of hibernation mechanism of the Spot VMs in the Bot Scheduling strategies are only discussed in more recent works. Aiming at coping with Spot VM revocations, in [33], Teylo et al. proposed a static heuristic that creates pre-defined backup maps, i.e., before the execution of the job tasks themselves. Such a heuristic was the first attempt to handle the hibernation of Spot VMs, and results obtained by simulation showed that the hibernation problem is better handled with a dynamic approach. Thus, in [36], the authors present the first version of the framework HADS, that aims at minimizing the monetary costs of executing BoT applications on Clouds ensuring that their deadlines are respected even in the presence of multiple hibernations. Results collected from experiments on Amazon EC2 VMs using synthetic applications and a NAS benchmark application

168

A. C. Sena et al.

show the effectiveness of HADS in terms of monetary costs when compared to ondemand VM only solutions. Regarding MPI applications, there is an even smaller number of works that consider Spot instances. Taifi et al. [31] introduced a formal model to guide the design of checkpoint for MPI-based applications. They only consider checkpointing technique and emphasize the dynamic adjustment of optimal checkpoint-restart (CPR) intervals. They also introduce a formal model based on Bid-aware Optimal CPR Interval, and provide a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Their models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. In [14], a monetary cost optimizations for MPI based applications with deadline constraints on Amazon EC2 is proposed. Particularly, the authors consider to utilize two kinds of Amazon EC2 instances (on-demand and Spot instances). As a Spot instance can fail, fault tolerant executions are necessary. Through detailed studies, they have found that two common fault tolerant mechanisms, i.e., checkpoints and replicated executions, are complementary for cost-effective MPI executions on Spot instances. They formulate the optimization problem and propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate that (1) it is feasible to run MPI applications with performance constraints on Spot instances, (2) the proposal achieves significant monetary cost reduction compared to the state-of-theart algorithm and (3) it is necessary to adaptively choose checkpoint and replication techniques for cost-effective and reliable MPI executions on Amazon EC2. Marathe et al. [24] exploit replicated executions for cost-effective, timeconstrained execution of HPC applications on Amazon EC2. They present several ways of minimizing cost of running applications on the cloud. First, they describe a method to exploit redundancy of compute resources for cost effective execution on the EC2 Spot market. Second, they present the Adaptive algorithm, which takes a user-defined execution time bound as input and chooses a bid price and a checkpoint-insertion algorithm that results in meeting the bound at low cost. Finally, they extend their adaptive framework to incorporate application scalability characteristics into the scheduling decision. More recently, in [41], an optimization framework for HPC applications and the Spot market, named FarSpot, was proposed with the goal of minimizing application cost while ensuring performance constraints. FarSpot provides accurate long-term price prediction for a wide range of Spot instance types using ensemble-based learning method. It further incorporates a cost-aware deadline assignment algorithm to distribute application deadline to each task according to Spot price changes. With the assigned sub-deadline of each task, FarSpot dynamically migrates tasks among Spot instances to reduce execution cost. Evaluation results using real HPC benchmark show that (1) the prediction error of FarSpot is very low (below 3%), (2) FarSpot reduced the monetary cost by 32% on average compared to state-of-the-art

9 Harnessing Low-Cost Virtual Machines on the Spot

169

Table 9.2 Main characteristics of some works related to the execution of BoT and MPI applications in Spot Instances BoTMPI MPI

Hibernate/ resume No

Fault tolerance Checkpoint

BoT

No

Migration

BoT

No

Migration

BoT

No

Sharma et al. [29] (2015) Gong et al. [14] (2015)

BoT

No

Checkpoint, migration and replication Migration

MPI

No

Marathe et al. [24] (2016)

MPI

No

Teylo et al. [33] (2019) Varshney and Simmhan [37] (2019) Zhou et al. [41] (2021) Teylo et al. [36] (2021)

BoT

Yes

BoT

No

MPI

No

BoT

Yes

Article Taifi et al. [31] (2011) Lu et al. [22] (2013) Menache et al. [25] (2014) Subramanya et al. [30] (2015)

Checkpoint and replication Checkpoint and Replication Migration Checkpoint and Migration Migration Checkpoint and migration

Objective (minimize) Monetary cost Monetary cost Monetary cost Monetary cost



Evaluation approach Executionon EC2 Simulation

Deadline

Simulation



Simulation and real cloud

Monetary cost Monetary cost



Real cloud

Deadline

Real cloud

Monetary cost

Deadline

Real cloud

Monetary cost Monetary cost

Deadline

Simulation

Deadline

Monetary cost Monetary cost

Deadline

Simulation and real cloud Simulation

Deadline

Real cloud

Constraint –

algorithms, and (3) FarSpot satisfies the user-specified deadline constraints at all time. However, the tests were not executed in real cloud environment. Table 9.2 summarizes the main characteristics of some works that consider the execution of BoT and MPI applications in Spot instances. The following features are highlighted in the table: Hibernate/Resume, which shows if hibernation-prone VMs are used; the applied fault tolerance technique; the objectives and constraints of the scheduling algorithm; and how the proposed scheduler is evaluated. It is important to note that there are also some few papers that exploit Spot GPU VMs. However, in this environment the application must be fault-tolerant to deal with revocations. Some works use Spot VMs to train Machine Learning (ML) models. Lee and Son [20] proposed DeepSpotCloud that searches the cheapest AWS Spot instances available in different countries to train Deep Learning tasks. The authors used checkpoints of intermediate training results to recover revoked VMs. Also, the authors implemented a live migration heuristic to reduce execution costs.

170

A. C. Sena et al.

Table 9.3 Main characteristics of works related to the use of Spot GPUs in clouds Paper Lee e Son [20] Wagenländer et al. [38] Zhou et al. [42] Brum et al. [6]

Application ML ML SC SW

Objective (minimize) Monetary cost Execution time Execution time Monetary cost

Constraint – – – Deadline

Evaluation approach Real cloud Real cloud Real cloud Real cloud

Note that DeepSpotCloud is specific for ML applications. To deal with revocations in distributed ML training, Wagenländer et al. [38] proposed the Spotnik, which deals with the VMs revocations by synchronizing the communication phase of each isolated model and ignoring the changes in the iterations when the provider revokes any VM. With this synchronization, Spotnik avoids the overhead incurred by checkpoints. This technique cannot handle revocations of all VMs. If all preemptive VMs are revoked, the ML training restarts from the beginning. Zhou et al. [42] implemented a fault-tolerant stencil computation (SC) on the AWS Spot GPU instances. The proposed framework takes advantage of pipelining to overcome the communication overhead and to increase the processing speed. It uses a low-cost checkpointing mechanism to handle the possible termination of the Spot instances. The checkpointing mechanism periodically copies the memory block from the GPU to a buffer on the host memory using CUDA memcopy and sends it to the backup server using a non-blocking MPI send. The results were obtained in two different environments, a 4-node cluster and Amazon’s GPU instances (g2.2xlarge) and their checkpointing mechanism showed to have small overhead achieving fault tolerance on cloud Spot instances as required. Added to that, an optimized bidding strategy for Spot instance minimized the monetary cost while enhancing stability. In [6], the focus is on minimizing the monetary cost using Spot GPU VMs to execute a sequence alignment algorithm while observing a deadline constraint. The results show that the costs are greatly reduced, even in cases of revocations when compared against on demand executions, and yet respecting the deadline. Table 9.3 summarizes the main characteristics of these four works: target application; objective; constraint; and how the proposal is evaluated. The main characteristics of all those works is that all of them use fault-tolerance mechanisms at the application level, so they are specific designed for a particular application. Many of the described works focus on the Spot price prediction problem, and most of them do not consider the hibernation feature. Nonetheless, exploiting this property can in fact reduce the financial cost without increasing the execution time significantly, as discussed in the next subsection, where some results of the framework HADS, that deals with Spot hibernation, is presented.

9.2.1 Using Hibernation-Prone Spot VMs in BoT Applications To take advantage of Hibernation-prone Spot instances, a BoT scheduler framework called Hibernation-Aware Dynamic Scheduler (HADS) was proposed in [34]. This

9 Harnessing Low-Cost Virtual Machines on the Spot

171

section provides an overview of HADS and shows some practical results to illustrate how Spot VMs can be used to reduce the monetary cost of the execution, respecting the applications deadline even in the presence of VMs revocations. HADS is available and documented at https://github.com/luanteylo/hads_. HADS schedules BoT applications with deadline constraints in both hibernationprone Spot (for cost sake) and on-demand VMs. Hence, if a Spot VM hibernates in time to satisfy an application deadline, the tasks already assigned to this hibernated Spot are resumed from the break-point when available again. However, if it is not the case, a temporal failure will possibly take place and the application deadline will be violated. Therefore, the aim of HADS is to offer a dynamic scheduling solution that guarantees the execution of the tasks of applications with deadline constraints, avoiding temporal failures even in the presence of multiple hibernations with minimum monetary cost regarding VMs allocation prices. To this end, the framework provides mechanisms to migrate tasks from a hibernated Spot VM whenever it does not resume early enough to ensure the deadline constraints of the application. If the existing allocated Spot VMs are not enough to execute these tasks, new on-demand VMs should be deployed. The framework is composed by two main modules: (1) the Primary Scheduling Heuristic Module which defines an initial task scheduling map, and (2) an eventdriven Dynamic Scheduler Module which, if necessary, migrates tasks to other VMs so that the deadline is respected. Furthermore, for reducing costs or load balancing, it may also migrates tasks from busy VMs to idle Spot ones by applying a work stealing procedure. Finally, in order to avoid executing migrated tasks from the beginning, tasks on Spot VM take checkpoints periodically. Hence, those which were running in a VM that hibernated are migrated to other VMs and start their execution from their respective last checkpoint. On the other hand, checkpointing of a task induces overhead, increasing the task execution time, which must be considered by the Primary Scheduling Module when mapping tasks to Spot VMs. All results presented in this section were obtained from real executions using VMs from AWS EC2. In those experiments, two kinds of BoT applications were considered: Synthetic BoT In this case, the BoT is composed by tasks generated with the application template proposed by Alves et al. [3], which is based on vector operations and the respective execution time depends on the size of the vectors. Thus, several synthetic tasks were created, each one with memory footprint between 2.81 MB and 13.19 MB, resulting in execution times which vary from 1:42 to 5:30 min. Then, three BoT applications were conceived, J60, J80, and J100, by randomly selecting those tasks. ED Jobs from NAS Benchmark The Embarrassingly Distributed (ED) applications from GRIDNBP 3.1 suite [5] are composed of multiple independent tasks, each one executing the same program with different input parameters. ED instances composed of 200 tasks running the largest problem size (class B), denoted as ED200, were considered.

172

A. C. Sena et al.

Table 9.4 Jobs characteristics Job J60 J80 J100 ED200

# tasks 60 80 100 200

Runtime (minutes) Min Avg 01:42 03:18 01:43 03:19 01:47 03:10 02:41 03:31

Max 05:23 05:22 05:30 05:54

Memory footprint Min Avg 2.85 MB 4.69 MB 2.91 MB 4.71 MB 2.81 MB 4.49 MB 153.74 MB 168.68 MB

Max 12.20 MB 13.19 MB 10.86 MB 177.77 MB

Table 9.5 VMs attributes Type c3.large c4.large c3.xlarge c4.xlarge

#VCPUs 2 2 4 4

Memory 3.75 GB 3.75 GB 7.50 GB 7.50 GB

Gflops 22.09 40.73 44.46 83.33

On-demand price per hour 0.105$ 0.100$ 0.210$ 0.199$

Spot price per hour 0.0294$ 0.0308$ 0.0596$ 0.0673$

Table 9.4 summarizes the characteristics of the four BoT applications tackled, including their respective number of tasks, memory footprint and runtime on a baseline VM c3.large. For all jobs, the execution deadline is 35 min (.D = 2100 s). In December 2019, according with EC2, only the VMs of families c3, c4, c5, m4, m5, r3, and r4 with less than 100 GB of memory, running in the Spot market, were hibernation-prone. Therefore, in the experiments, only Spot VMs of the families c3 and c4 which provide good computation power and have high availability in the Spot market were selected.1 Table 9.5 shows the computational characteristics of the used VMs as well as their respective prices in On-demand and Spot markets (obtained in December 2019). Cloud users have no control on Spot VMs hibernations since it is the cloud provider that decides when to hibernate and resume a given Spot VM according to resource demands variation. Thus, in order to evaluate different patterns of Spot VMs hibernation and resuming, the Hibernation Emulation Module (HEM) was developed. HEM emulates the cloud hibernation feature using Poisson distribution [1] to model both the hibernation and resuming times for each type of Spot VM. Since in Amazon EC2, when a Spot VM of a given type hibernates, other VMs of the same type will probably hibernate too, HEM emulates events for groups of VMs of identical types. In other words, when a HEM event happens for a VM, it has an impact not only on this VM but also on all VMs of that type. HEM uses distinct Poisson functions for modeling the events, which allows the creation of scenarios where hibernating and resuming events have different probability mass functions defined by the parameters .λh and .λr , respectively. Whenever an emulated hibernation event occurs, the Spot VM state is saved by using the checkpoint tool

1 https://aws.amazon.com/ec2/spot/instance-advisor/.

9 Harnessing Low-Cost Virtual Machines on the Spot Table 9.6 Different execution scenarios generated by varying parameters .λh and .λr

ID .sc1 .sc2 .sc3 .sc4 .sc5 .sc6 .sc7

.Hibernating .kh .kh .kh .kh .kh .kh .kh

=1 =5 =1 =5 =3 =2 =2

173 Resuming .kr = 0 .kr = 0 .kr = 5 .kr = 5 .kr = 2.5 .kr = 1 .kr = 2

.λh

.λr

.1/2100

.0/2100

.5/2100

.0/2100

.1/2100

.5/2100

.5/2100

.5/2100

.3/2100

.2.5/2100

.2/2100

.1/2100

.2/2100

.2/2100

CRIU,2 and all tasks allocated to it are paused. Hence, if the VM resumes later, those tasks can be recovered and continue their execution. Note that, although the hibernation event is emulated, the feature of the hibernation event was preserved, i.e., all tasks are recovered from the break-point when a hibernated Spot VM resumes. During the tests, the maximum number of concurrent deployed VMs was limited to 20, respecting the default constraints specified by Amazon EC2.3 That constraint also limits to five the number of VMs from the same type deployed simultaneously at each market. The framework also respected that limit. Regarding the emulation, the .λ parameter of Poisson distribution is the number of expected events divided by a given time interval. Since the application execution is discretized by time intervals and D is the application deadline, let .kh and .kr denote the expected number (rate) of hibernating and resuming events during the application execution and therefore, .λh and .λr parameters are given by .λh = kh /D and .λr = kr /D, respectively. Nevertheless, the actual number of hibernation (resp., resuming) events that occur during the experiment might be greater than the expected .kh (resp., .kr ), as presented in Table 9.8, discussed later. Table 9.6 presents seven different scenarios by varying .kh and .kr . Two baseline cases are considered: (1) Spot VMs without hibernation, which is the case where the initial primary scheduling is followed without the need of migration; and (2) On-demand only, which uses the same scheduling, but only with On-demand VMs. Table 9.7 presents the average costs of executing the four evaluated applications/jobs (J60, J80, J100 and ED200) on both baseline cases. It also contains the type and number of used VMs, the average makespan in minutes, and the percentage difference between their execution costs (diff). Note that, because the scheduling is the same in both cases, except for the market, the cost difference is around 66.33%–76.2%, which is close to the difference in the price between the used Spots and on-demand VMs (see Table 9.5). Observe also that, in Table 9.7 the jobs makespan are below the jobs deadline of 35 min. Table 9.8 presents the performance results related to the execution of jobs J60, J80, J100, and ED200 in each one of the seven scenarios .sci . The table presents the 2 https://www.criu.org/. 3 https://docs.aws.amazon.com/AWSEC2/UserGuide/ec2-resource-limits.html.

174

A. C. Sena et al.

Table 9.7 Baseline executions Job J60 (6 VMs)

J80 (8 VMs)

J100 (10 VMs)

ED200 (16 VMs)

#VMs 2-c3.large 2-c4.large 2-c4.xlarge 2-c3.large 1-c3.xlarge 3-c4.large 2-c4.xlarge 2-c3.large 1-c3.xlarge 4-c4.large 3-c4.xlarge 5-c3.large 2-c3.xlarge 5-c4.large 4-c4.xlarge

Makespan 20:08

Spot without hibernation $0.08

On-demand only $0.32

Diff 76.25%

19:49

$0.10

$0.37

72.97%

18:43

$0.13

$0.43

70.78%

31:27

$0.33

$0.98

66.33%

average number of hibernations, the number of used On-demand VMs in executions where hibernation took place, followed by the corresponding average values of both the makespan and monetary costs. The percentage difference between the latter and the On-demand VMs baseline is in the last column labeled as diff. In all cases, remark that the makespan is less than the 35 min of the deadline. When compared to the On-demand baseline, HADS presents cost reductions in all of the cases, which vary from 19.79% to 72.92%. For all jobs, the worst results in terms of monetary cost are those for scenario .sc2 . Such a result is expected since .sc2 has no VMs resuming (.kr = 0) and has the highest hibernation rate (.kh = 5). Therefore, in this case, it is always necessary to allocate many On-demand VMs throughout the execution to avoid temporal failures. Nevertheless, it is worth pointing out that, although there is no possibility of resuming also in scenario .sc1 , the cost is reduced in more than 50% for almost all jobs, except for J 80 where the reduction is 46.87%. Such a reduction happens because, in this scenario, the number of hibernations is low (.kh = 1), i.e., in general, less than half of the Spot VMs hibernate. Thus, the tasks are migrated to busy or idle VMs instead of new allocated On-demand VMs. These results illustrate the Spot instances’ potential to reduce the executions’ monetary cost. Moreover, they confirm the effectiveness of scheduling approaches and the importance of adopting fault tolerance techniques when using Spot instances. They also show the importance of adopting scheduler tools, such as HADS, to execute applications in the cloud. In the next Section, we will deepen the reduction of monetary costs through the proper selection of VMs types.

9 Harnessing Low-Cost Virtual Machines on the Spot

175

Table 9.8 Results of HADS in scenarios .sc1 to .sc7 , where the columns show: .λh and .λr , the probabilistic mass function of the hibernation and resume events; the average number of hibernations; the number of used On-demand VMs; the average makespan; and the average monetary cost Jobs Scenario .λh .λr J60 .sc1 .1/2100 .0/2100 (6 Spot VMs) .sc2 .5/2100 .0/2100 .sc3 .1/2100 .5/2100 .sc4 .5/2100 .5/2100 .sc5 .3/2100 .2.5/2100 .sc6 .2/2100 .1/2100 .sc7 .2/2100 .2/2100 J80 .sc1 .1/2100 .0/2100 (8 Spot VMs) .sc2 .5/2100 .0/2100 .sc3 .1/2100 .5/2100 .sc4 .5/2100 .5/2100 .sc5 .3/2100 .2.5/2100 .sc6 .2/2100 .1/2100 .sc7 .2/2100 .2/2100 J100 .sc1 .1/2100 .0/2100 (10 Spot VMs) .sc2 .5/2100 .0/2100 .sc3 .1/2100 .5/2100 .sc4 .5/2100 .5/2100 .sc5 .3/2100 .2.5/2100 .sc6 .2/2100 .1/2100 .sc7 .2/2100 .2/2100 ED200 .sc1 .1/2100 .0/2100 (16 Spot VMs) .sc2 .5/2100 .0/2100 .sc3 .1/2100 .5/2100 .sc4 .5/2100 .5/2100 .sc5 .3/2100 .2.5/2100 .sc6 .2/2100 .1/2100 .sc7 .2/2100 .2/2100

# hibernation 1.33 4.33 1.67 2.02 2.00 2.00 2.00 2.57 6.33 2.67 4.00 4.33 2.67 2.67 2.33 7.67 1.33 3.40 3.00 4.67 3.67 3.00 8.33 2.33 5.33 4.67 4.00 4.33

# Ondemand 1.33 3.33 1.0 1.33 0.67 1.00 1.67 1.0 4.67 1.30 2.33 3.33 1.33 1.33 0.67 3.67 1.00 1.89 2.70 2.00 1.33 3.33 7.67 4.00 4.93 5.00 4.67 3.00

Makespan (min) 25:13 34:24 24:39 30:40 24:31 30:45 32:02 27:11 34:56 31:53 32:41 33:26 26:58 29:13 26:42 30:14 26:08 34:12 32:59 28:54 32:31 32:19 33:03 34:43 34:22 33:00 34:01 33:15

Cost $0.146 $0.256 $0.087 $0.145 $0.090 $0.097 $0.093 $0.197 $0.284 $0.117 $0.140 $0.213 $0.153 $0.123 $0.167 $0.302 $0.150 $0.189 $0.223 $0.177 $0.160 $0.430 $0.657 $0.353 $0.413 $0.442 $0.523 $0.410

Diff 54.52% 19.79% 72.92% 54.69% 71.77% 69.79% 70.94% 46.82% 23.12% 68.47% 62.09% 42.34% 58.56% 66.67% 61.77% 30.66% 65.61% 56.64% 48.78% 59.48% 63.30% 56.12% 32.99% 63.95% 57.82% 54.84% 46.60% 58.16%

9.3 Reducing Monetary Costs Within Markets By their nature, HPC applications often require large quantities of resources for extended periods of time. Thus, enterprises and research labs make it a priority to try to optimize their cloud usage for cost and performance—inappropriately chosen cloud instances will likely have a detrimental effect on the budget, application performances and ultimately a negative impact on user experience. This section focuses on some of the issues facing users when attempting to reduce costs by choosing an appropriate VM instance. One is often lead to believe that,

176

A. C. Sena et al.

when using the cloud, one does not pay for what is not being used. While elastic resource provisioning has long been associated with cloud computing, in terms of IaaS, the concept has basically been limited to horizontal elasticity, where the number of instances is scaled in or out, in relation to demand. Vertical elasticity of VM instances is not yet common practice in public cloud offering. Instead, the user must choose the specific instance type with an appropriate static configuration prior to launching it in the cloud.

9.3.1 Instances Galore and the Paradox of Choice Within each of the previously mentioned cloud markets, most cloud providers generally offer users a variety of instance configurations and sizes, possibly based on different hardware, capacities and characteristics, in an attempt to satisfy the particular needs of as many of their applications as possible. Unfortunately, being spoiled for choice can also be off putting, especially for relatively inexperienced users. More so when one considers that the types of instances available may vary from one geographical region to another, and those on offer are constantly being modernized, with older generations of instance configurations being slowly phased out. Through strategic pricing policies, cloud providers frequently offer incentives, in the form of attractive costs, to accelerate the migration to the latest generation of instances. Thus, users frequently have to reconsider the instance choices they make. As a more concrete example, in Amazon EC2, the user must first identify both the Instance Type, which will determine the hardware of the server that will be used to host the VM, and then one of the available Instance Sizes for that type. Generally speaking, within an instance type, each size increment involves either a doubling of both the number of physical or logical CPUs and the amount of RAM memory available, and/or increasing local storage capacity and network bandwidth available to the instance. This is usually accompanied with a linear increase in the rental cost of the instance relative to the smallest size of the corresponding type. Amazon EC2 groups instances into five Instance Families: General Purpose instances aim to provide a balance of compute, memory and networking resources for use by a variety of diverse workloads. These instances might be considered a good initial option, especially for applications that use these resources in equal proportions; Compute Optimized instances are suited for more compute intensive applications that can benefit from high performance standard CPUs; Memory Optimized instances have a higher memory to vCPU ratio than even the Compute Optimized ones so can benefit more data intensive workloads; Accelerated Computing instances employ additional co-processing hardware such as Tensor core GPUs or FPGAs that are able to process certain workload functions more efficiently than conventional CPUs; Finally, Storage Optimized instances are configured to offer high bandwidth and/or low latency I/O operations to local and non-local storage. Each of these groups is typically composed of multiple Instance Varieties based on the latest processor generations from different manufacturers. Furthermore, each

9 Harnessing Low-Cost Virtual Machines on the Spot

177

instance variety is generally charged at a different base price, on a per hour or per second basis, with the cost increasing proportionally with the increase in the instance’s size. In the On-Demand market, Amazon EC2 currently4 offers a choice of 483 different Linux based instances at its US East (N. Virginia) region, while in its US East (New York City) region it has only 5, South America (Sao Paulo) has 299, and Europe (London) 316 instances. This shows that not all of the instances may be available in every region. This may be an issue in relation to choosing an instance if the user needs to use a specific region, for example, for reasons of data locality or legal restrictions. Since one might argue that choosing the instance family group is generally easier as this depends, in principle, on identifying the predominant characteristic of the application (i.e., whether the application’s limiting performance factor is compute, memory or I/O consumption), the remainder of Sect. 9.3 will focus on just one of the five groups—Compute Optimized instances, although the discussion applies equally to any of the families. Compute Optimized instances are normally the first candidates for HPC applications in search of good CPU performances. Table 9.9 presents just the Compute Optimized Linux instances available at the AWS Region US East (N. Virginia). The table shows that there are 12 distinct instance types available, each with a specific processor architecture and a number of options in terms of the size of the instance. A total of 101 possible choices for the user. Given the wide variety in hardware configurations and instance sizes available, making the correct choice can actually be a rather complex decision, particularly for less knowledgeable users. Table 9.9 is organized as follows: the instance name usually starts with the single letter “c”, referring to the group or family of Compute Optimized instances designed for compute intensive workloads, and is followed by a number that refers to the generation of this instance group, a higher value implying a more recent release (e.g., the instance type c7g was launched at the end of May 2022, while the families c6i and C6a were both released at the end of 2021). The letters that then follow, if any, refer to the processor manufacturer or family (“a” for AMD, “i” for Intel, or “g” for AWS Graviton, for example), or enhancements to specific instance resources such as “n” for network bandwidth, or “d” for access to NVMe SSD storage locally on the host server, in addition to EBS that provides persistent block storage volumes for use with most Amazon EC2 instances. Each instance family has a number of preconfigured instances with increasing capacities in terms of vCPUs, memory and network bandwidth. The number of different instance sizes and the range of vCPUs and Gibibytes (GiB) of memory available in these different sizes is also presented in the table. The corresponding cost per hour in US$ increases proportionally with the increase in size, relative to the smallest instance size for that family. The first cost value is for the smallest sized instance, the second, for the largest available instance. Although all of these instance types are also available in the Spot Market, the table

4 As

of May 2022.

178

A. C. Sena et al.

Table 9.9 The currently available as AWS EC2 Compute Optimized Instances Family [13] Instance Processor (clock freq. range) type c7g 3rd gen. AWS Graviton3 c6i c6a c6g c6gd c6gn c5

c5d

c5a c5ad c5n

c4

No. of vCPUs and On demand $ sizes memory (GiB) cost/hr 8 1-64 $0.0363–$2.32 2 -128 3rd gen. Intel Xeon Scalable 10 2-128 $0.085–$5.44 4-256 Ice Lake 8375C (2.9–3.5 GHz) 3rd gen. AMD EPYC 7R13 11 2-192 $0.0765–$7.344 (2.65–3.6 GHz) 4-384 2nd gen. AWS Graviton2 9 1-64 $0.034–$2.176 (2.5 GHz) 2 -128 2nd gen. AWS Graviton2 9 1-64 $0.0384–$2.4576 (2.5 GHz) 2 -128 2nd gen. AWS Graviton2 8 1-64 $0.0432–$2.7648 2 -128 (2.5 GHz) 2nd gen. Intel Xeon Scalable Cascade 9 2-96 $0.085–$4.08 Lake 8275CL (3.0–3.9 GHz) or 2nd gen. 4-192 Intel Xeon Scalable Cascade Lake 8223CL or 1st gen. Intel Xeon Platinum Skylake 8124 (3.0–3.5 GHz) 2nd gen. Intel Xeon Scalable Cascade 9 2-96 $0.096–$4.608 Lake 8275CL (3.0–3.9 GHz) or 2nd gen. 4-192 Intel Xeon Scalable Cascade Lake 8223CL or 1st gen. Intel Xeon Platinum Skylake 8124 (3.0–3.5 GHz) 2nd generation AMD EPYC 7R32 8 2-96 $0.077–$3.696 4-192 (2.8–3.3 GHz) 2nd generation AMD EPYC 7R32 8 2-96 $0.086–$4.128 4-192 (2.8–3.3 GHz) 1st gen. Intel Xeon Platinum Skylake 7 2-72 $0.108–$3.888 8124 5.25-192 (3.0–3.5 GHz) 1st gen. Intel Xeon Scalable Haswell 5 2-36 $0.010–$1.591 3.75-60 E5-2666 v3

only presents the current On-Demand market costs because the hourly Spot market price varies, based on demand. Notice that the base cost may or may not differ between instance types. Also, the cost for a given instance may differ from AWS region to region. From Table 9.9, one can see that the 101 instance configurations available are based on 10 distinct processors from three different manufacturers. The Intel and AMD processors implement 2-way simultaneous multithreading (hyperthreading in Intel terminology), so AWS EC2 uses the term vCPU to refer to the number of logical cores that will be visible to the VM instance’s operating system. Note that the AWS Graviton processors, based on ARM Neoverse N1 cores, do not have this

9 Harnessing Low-Cost Virtual Machines on the Spot

179

feature, so each vCPU refers to a physical core. When launched, AWS claimed that their 2nd generation Graviton instances (c6g) offered 40% better price-performance than comparable 5th generation X86 based instances (c5 and c5a). While each instance variety generally employs a single processor type, there can be exceptions, for example, c5 and c5d, where EC2 will instantiate a VM with one of 3 processors. Sometimes, depending on the instance size, the user may only discover which processor has been allocated after the instance has booted up. As each processor may provide differing performances, the cost of executing a given application can be higher and take longer! Most multicore processors adjust the clock frequency of individual cores in order to remain within the processor’s total power, current and thermal budget. This means that the single core performance of an application may also depend on the number of active cores in the instance (within the user’s control) and on the server hosting that instance (outwith the user’s control). In practice, in addition to the processor family, cloud providers often specify a range of clock frequencies, with a minimum guaranteed clock frequency for all cores and a upper threshold that might be attainable depending on how busy the host and instance are (also shown in the table). All of these issues are in addition to fact that the size and type of the instance to choose also depends on the characteristics of the application and, in some cases, on the size of its inputs [7]. As exemplified by AWS EC2, most public cloud providers offer a portfolio of different off-the-shelf cloud instances (i.e., with pre-selected hardware configurations). But, although these standard instances can be useful in many cases, they may not be best options for all workloads and organizations. Instead, users may find customizing instances better suit their needs. A custom instance enables users to pick and choose specific memory, CPU and storage resources to fit the specific requirements of their applications. For example, an application might be very CPU intensive, but not require much memory or storage. AWS, Google Cloud and Microsoft Azure all claim to offer the ability to customize and build instances to match the exact needs of the user’s workloads, however given the diversity of the existing instance offering, downsizing specific resources capacities may only save a relatively small amount of money (AWS EC2 does not appear offer any rebates). Cloud providers insist that users should test their applications on a variety of different instances in order to find a suitable target. But, with the continuous renewal of instance configurations and changing prices due to competition and demand, this can still be both a time consuming and expensive proposition. Some research has proposed approaches to address the issue of choosing appropriately sized instances for certain classes of applications [7, 12, 15, 17, 32]. An early analysis of resource utilization on Google Cloud Platform (GCP) reported [26] that while instances are typically allocated so that more than 80% of the memory capacity and more than 100% of the CPU (i.e., the CPUs are oversubscribed) of the host servers’ are reserved, the actual memory usage did not exceed 50% nor did the CPU usage consistently better 60%. This seems to reflect a habit of user’s over estimating the resources required by their applications, and thus incurring unnecessary additional cloud costs.

180

A. C. Sena et al.

When focusing on execution time and monetary costs, a typical approach combines their minimization in a single function, which may hamper the quality of the proposed solutions. Differently, the work presented by Durillo and Prodan in [12] proposed a Pareto-based approach to provide a set of (near) optimal solutions that represent a trade-off among the two distinct goals. Meanwhile, with the aim of finding ideal instance configurations, good results were achieved by using evaluations to generate a machine learning model to predict the performance of workloads and then apply search-based techniques to find near optimal configurations [17]. Hsu et al. use low-level performance information such as CPU utilization, working memory size and I/O wait times, instead of only core counts and memory size as proposed in previous work, to improve a Bayesian optimization approach [15]. The authors in [16] argued that most cloud optimizers only improve one workload at a time, which can be prohibitively expensive approach for optimizing many workloads. Based on a large-scale empirical study that showed that there is often a single cloud configuration which is surprisingly near-optimal for most workloads, they create a collective-optimizer for finding this efficient cloud configuration. Different from prior solutions that require knowledge about the application or require the execution of the application on multiple VM instances, the work presented by Tavares et al. [32] optimizes the cost of parallel workloads in cloud instances using just the CPU utilization in the VM. The authors highlight that their work is intended for users that do not have enough experience to properly select the best computing resources for their applications. However, while the instance configuration needs to meet all of the application’s requirements, independently of which instance type is chosen and whether it is from the Spot or On-Demand Market, the user is still faced with a number of other performance related dilemmas.

9.3.2 Choosing the “Right” Instance May Not Be Enough Just as an application’s performance does not solely depend on the server’s hardware but also its run-time environment, in order to launch an instance in the cloud, the user must specify the software environment the VM instance will boot and run. Choosing the appropriate environment is equally essential to achieving efficient cloud performance and, consequently, reducing costs. In AWS EC2, this environment is referred to as an Amazon Machine Image (AMI) (in Microsoft Azure, a Virtual Hard Disk, and in GCP, simply a Virtual Machine Image), and defines not only the operating system, but also software stack and tools to take advantage of the underlying hardware. One can launch multiple instances from a single AMI when multiple instances with the same configuration are required. The cost per hour of an instance often depends on the chosen AMI. AWS EC2 offers a number of Linux-based images, both open source and commercial, as well as Microsoft Windows. Amazon Linux 2 is the current default Linux operating system for EC2 instances. The preferred form of virtualization by Linux AMIs is hardware virtual machine (HVM) which takes advantage of hardware

9 Harnessing Low-Cost Virtual Machines on the Spot

181

extensions that provide fast access to the underlying hardware on the host system, including Advanced Vector extension (AVX) instructions on microprocessors from AMD and Intel. To highlight some of the issues related to choosing the appropriate instance and image, this section compares the variations in execution times of two state-of-theart bioinformatics tools that are often used to find the optimal pairwise alignment of genetic sequences. The scientific importance of such tools has been brought to the forefront recently due to the pandemic of 2020. Performing DNA sequence comparisons to find, for example, new SARS-CoV-2 variants or variants of other viruses is essential for understanding the infection each virus may cause, how easily that virus spreads, the severity of associated symptoms, and the effectiveness of the respective vaccines. Over 18 million SARS-CoV-2 DNA sequences, obtained from human infections from around the world, were rapidly made available to scientists in public genome databases, such as GenBank5 from the National Center for Biotechnology Information (NCBI) and GISAID,6 for analysis. While these two applications adopt different algorithms, they also differ in their approach to exploiting parallelism. Their performances and costs are compared when executed in the smallest sized instances of each of the following Compute Optimized instance types, c5, c6g, and c5a (i.e., four instances, each containing a single physical core of a different processor architecture). The first application, MLCS [19], is based on an approach that finds the longest common subsequence [39] and was designed to exploit the SIMD registers available in many modern microprocessor architectures, including those available in AWS Compute Optimized Instances. The second application is a highly optimized tool designed to align long DNA sequences on a variety of hardware/software platforms, aptly named MASA (Multi-Platform Architecture for Sequence Aligners) [11]. As input to both applications, two strains of the SARS-CoV-2 virus were compared, the original reference sequence from Wuhan with another registered sequence both obtained from GenBank. The average execution time (in seconds) for ten executions and the corresponding cost (in US dollars) for a single sequence comparison using each application is presented in Tables 9.10 (for MLCS) and 9.11 (for MASA). These executions were also carried out with different AMIs, based on the Amazon Linux 2 and Ubuntu 20.04 operating systems. The times obtained are considered to be quite reliable with the coefficients of variation being less than 1.5%. Analyzing the results for the MLCS application in Table 9.10, the difference in execution times is noticeable and so are the monetary costs. The execution times range from .0.73 to .3.98 s, the latter being 5.45 times slower. Different aspects contribute to this. With respect to the AMIs, Ubuntu 20.04 and the default Amazon Linux 2, both are HVM based and thus are enabled to exploit vector extension instructions of their respective underlying microarchitectures. The differences between execution times are not substantial, although Ubuntu, in general, is the AMI

5 https://www.ncbi.nlm.nih.gov/genbank/. 6 https://www.gisaid.org/.

182

A. C. Sena et al.

Table 9.10 Average execution times and costs of MLCS on 4 compute optimized instances Instance c5.large

Processor Intel 8124

c5.large

Intel 8275CL

c6g.medium c5a.large

AWS Graviton2 AMD EPYC 7R32

Compiler gcc icc gcc icc gcc gcc

Ubuntu 20.04 Time (s) Cost ($) 2.87 0.068 0.73 0.017 2.71 0.064 0.77 0.018 3.80 0.036 2.67 0.057

Amazon Linux 2 Time (s) Cost ($) 3.09 0.073 0.97 0.023 2.56 0.060 0.79 0.019 3.98 0.038 2.70 0.058

that provides the slightly better performance for all instances with the exception of the Intel CL8275 processor with gcc. While the gcc compiler in Ubuntu is more recent, the version of the Intel icc compiler is the same in both AMIs, so the improvement may possibly be due differences in Linux kernel. Comparing the compilers themselves, the Intel icc compiler yields a considerable performance gain over the gcc compiled version for MLCS. As this application was written to take advantage of SIMD registers, specific compiler options were enabled in both compilers. Nevertheless, it appears that icc is able to harness insider knowledge of the Intel microarchitecture to significantly improve performance. As a consequence, the corresponding monetary cost was the lowest. As shown in Table 9.9, another issue is that for some instance types, for example, c5, there can be some uncertainty as to which processor family will be assigned by AWS EC2 when launching an instance of that type. As reflected by execution times in Table 9.10 for the two different c5.large instances, there could exist a penalty, paid for by the user, for receiving the slower, older Intel Skylake 8124 processor. In relation to the Intel Cascade Lake 8275CL processor, the difference in performance with the default AMI is over 20%, although with the Ubuntu AMI this difference drops significantly. Using the gcc compiled execution for comparison, in some sense the AMD EPYC and AWS Graviton2 instances appear competitive with the the Intel Cascade Lake 8275CL c5 instance. In the case of c5a and c5, these execution times are very similar, and being a cheaper instance, the cost of c5a is also a little lower. With its own Graviton2 processor, the AWS claimed price-performance improvements of up to 40% is achievable if the same compiler is used. But while significantly cheaper, its execution times were more that 40% slower. On the other hand, in comparison with icc compiled version of MLCS, the c6g instance incurs around twice the cost, surprisingly. Results for the MASA application are shown in Table 9.11. Different from the behavior observed with the MLCS application, the variation between the shortest and longest execution times across all instances was much smaller, but still significant: the worst, .2.14 s being 1.66 times slower than the best execution time of .1.29 s. While the difference between AMIs is now almost negligible, at most 3%, the gcc compiled version is over 10% slower than the icc compiled one.

9 Harnessing Low-Cost Virtual Machines on the Spot

183

Table 9.11 Average execution times and costs of MASA on 4 Compute Optimized instances Instance c5.large

Processor Intel 8124

c5.large

Intel 8275CL

c6g.medium c5a.large

AWS Graviton2 AMD EPYC 7R32

Compiler gcc icc gcc icc gcc gcc

Ubuntu 20.04 Time (s) Cost ($) 1.53 0.036 1.36 0.032 1.45 0.034 1.29 0.030 1.89 0.018 2.13 0.045

Amazon Linux 2 Time (s) Cost ($) 1.57 0.037 1.37 0.032 1.49 0.035 1.29 0.030 1.88 0.018 2.14 0.046

In terms of the execution times for the different instance types, a user seeking the best performance should opt for Intel-based c5 instances, choosing a large instance size will guarantee the faster 8275CL processor. For this application, the poorest performance and cost was obtained with the c5a AMD EPYC processor based instance. Similar to the case of MLCS, the cheapest execution was observed using Graviton2 processor. While the cheapest execution on a C5 instance was 1.67 times more expensive than the c6g instance, the latter was 1.46 slower than the Intel 8275CL processor. Therefore, there exists a cost-performance trade off for which different users might have different tolerances for how much longer they might be willing to wait for the execution to complete for a reward of a reduction in the cost. Although hourly rates are commonly billed per second, one should also note that cloud providers often charge a minimum period when launching and running an instance. Therefore to dilute this cost, instances should be utilized for a duration that is significantly longer. For example, in the case of AWS EC2, the minimum charge is for 60 s, which happens to be the average time to boot a VM instance. As a consequence of the recent worldwide COVID-19 pandemic, it has become common practice for scientists to require the comparison of thousands or even millions of sequences. To address this increasing demand, a number of bioinformatics cloud-based services seek sustainability by aiming to tackle the complexities of finding good performance and cost effective instances. While most previous research has focused on techniques to choose instances based on their available hardware configurations, this section has tried to shine a light on an orthogonal opportunity to obtain addition cost reductions.

9.4 Burstables Virtual Machines In this section, we intend to introduce another approach to reduce the costs of HPC applications on clouds. Particularly, we are interested in presenting applications that benefit from the use of burstable VMs. As pointed out earlier, several types of VMs are offered in a typical cloud environment. Each VM type has a number of computational resources and is usually

184

A. C. Sena et al.

grouped by its purpose. For instance, the VMs from type C in Amazon have a good relation between price and computational performance and are optimized for CPUbound applications. In contrast, VMs from type R, designed for memory-bound applications, have memories of high capacity and performance. In general, there is always a VM type that will fit the application’s requirements. One particular type of VMs has gained much attention in the last years: the burstable VMs [21, 35]. Burstable VMs provide a baseline CPU performance level and can burst to a higher level of performance to deal with occasional performance peaks of the application. In other words, those instances were designed for workloads that do not demand 100% of the CPU during all the execution, varying between states of idleness (or very low CPU utilization) and high demand. Examples of such applications are micro-services, development and test environments, web servers, and small databases. All the major cloud providers offer burstable VMs. In Amazon, burstable VMs are the ones from types T2, T3, and T4 [28]. In Google Cloud and Azure, VMs of series N2 and B-series, respectively, have burstable capacities [4, 9]. The reason for those types of VMs being so popular is that by offering a VM that does not have to deliver its maximum computational power all the time, the providers can oversubscribe the computational resources. That reflects in economic advantages for the clients: burstable VMs are usually cheaper than non-burstable ones. According to [28], burstable VMs can save up to 15% in costs when compared to M instances in the On-demand market. The way the providers determine the burst duration of a VM varies. In Amazon, the burst times are controlled by a CPU credit regime that allows users’ applications to access 100% of the CPU performance (burst mode) or not (baseline mode). In summary, when the use of the CPU is below the baseline, the instance accumulates CPU credits. Each earned credit allows users to burst a CPU core for 1 min; when the VM does not have CPU credits, it automatically limits the CPU performance to its baseline [28]. On the other hand, in the Oracle cloud, the ability to burst depends on the CPU usage pattern and the underlying server resource usage. According to its documentation, if the average CPU utilization over the past 24 h of an instance was below the baseline, the provider allows that instance to burst. Therefore, there is no guarantee that an instance will be able to burst when needed [10]. In the related literature, several works have explored those VMs with the objective of reducing the monetary cost or even the execution time of the applications. Many of them focus on evaluating the burstable approach and the improvement in computational performance that it can induce. In [21], Leitner and Scheuner presented a first empirical and analytical study about the second generation of AWS burstable instances (T2 family). They specifically considered T2.micro, T2.small, and T2.medium instances. Their article aimed at answering if, in terms of monetary cost and performance, these instance types are more efficient than other ones. The presented results show that compared to general-purpose and computed-optimized instances (2015 generation), the evaluated T2 instances provide a higher CPU performance-cost ratio as long as the average utilization of instances is below 40%.

9 Harnessing Low-Cost Virtual Machines on the Spot

185

To figure out the CPU usage limits on-the-fly, considering the dynamic variation of the workloads, Ali et al. proposed in [2] an autonomic framework that combines light-weight profiling and an analytical model. The objective was to maximize the amount of work done using the burstable capacity of T2 VM instances. The authors state that the framework extends the CPU credits depletion period. Similarly to Leitner and Scheuner’s work [21], their results also confirmed the benefits from active CPU usage control when burstable instances are exploited. However, they do not discuss the impact on the total execution time when such an approach is applied. Jiang et al. [18] analytically modelled the performance of burstable VMs, considering their respective configuration, such as CPU, memory, and CPU credits parameters. They also showed that providers could maximize their total revenue by finding the optimal prices for burstable instances. Although their work does not focus on application performance, its contribution is interesting since it states that providers can offer burstable instances for low prices without losing revenue while meeting QoS parameters. In [40], Wang et al. combined On-demand, Spot, and Burstable instances proposing an in-memory distributed storage solution. Burstable instances were used as a backup to overcome performance degradation resulting from Spot instance revocations. According to the authors, those instances’ burst capacity makes them ideal candidates for such a backup. Performance results show that the backup that uses burstable instances presents a latency, which is 25% lower than the latency of backup based on regular instances, inducing, therefore, significant monetary cost saving. In [35] the authors investigate how burstable instances could be used to reduce the impact on the application’s execution time and the corresponding monetary costs. From their study, the authors proposed a multiobjective framework called BurstHADS that manages the execution of BoT applications with deadline constraints. The framework explores both hibernation-prone Spot VMs and Burstable Ondemand VMs, aiming to minimize the monetary costs and reduce the execution time (makespan). Their results show that burstable VMs can act as an additional and cheap resource that comes in to mitigate the performance loss caused by revocations or interruption of Spot VMs.

9.5 Conclusions and Future Directions Unlike the advertising of cloud providers that advocate the ease of use of cloud environments as one of their main advantages, when considering all of the necessary variables to be defined in order to execute a given application efficiently, a cloud can become an intricate environment where any decision directly impacts the final execution and respective monetary costs. This chapter has presented an overview of how users can use and benefit from the variety of VM instances and contract models on offer from public cloud providers to reduce their financial costs. In particular, the framework HADS was introduced.

186

A. C. Sena et al.

With mechanisms to take advantage of Spot VMs even in the presence of hibernation events, it aims to reduce financial costs of deadline constrained BoT applications while maintaining good execution performance. Concerning the choice of instance within markets, this chapter also pointed out important aspects related to using the instances made available by cloud providers that usually might not be regarded. Rather than exploiting horizontal elasticity, a methodology which may not always be cost-effective, further improvements in terms of both performance and monetary costs may be achieved if the utilization of resources within given instances is maximized. Nonetheless, the provisioning of resources vertically is not commonly provided in public clouds. Nowadays, there are many discussions about the future of cloud computing. In [27] the authors foresee that the majority of data center computing will be dominated by serverless computing. They believe that new general-purpose serverless abstractions will emerge, adding sophisticated state management and automatic optimization to enable many more use cases. They also foreshadow that, in the future, serverless will simplify the use of hardware accelerators such as Graphical Processing Units (GPUs) or Tensor Processing Units (TPUs) that support specific workloads. Concerning HPC applications, there are some very recent papers which evaluate the use of serverless cloud computing for some models of HPC applications. For example, in [23], Google cloud’s FaaS (Cloud Functions) is compared with its IaaS (Compute Engine) in terms of cost and performance for an embarrassingly parallel application. Those experiments showed that FaaS can be 14%–40% less expensive than IaaS for the same level of performance. However, performance of FaaS exhibits higher variation since the number of CPUs allocated (scalability) depends on the cloud provider. As a future direction, many other studies should be conducted with the constantly evolving services that cloud providers offer in order to identify strategies to reduce the financial costs and execution times of HPC applications.

References 1. Joachim H Ahrens and Ulrich Dieter. Computer methods for sampling from gamma, beta, poisson and bionomial distributions. Computing, 12(3):223–246, 1974. 2. Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. Cedule: A scheduling framework for burstable performance in cloud computing. In IEEE International Conference on Autonomic Computing (ICAC), pages 141–150, 2018. 3. Maicon Melo Alves and Lúcia Maria de Assumpção Drummond. A multivariate and quantitative model for predicting cross-application interference in virtual environments. Journal of Systems and Software, 128:150 – 163, 2017. 4. Microsoft Azure. B-series burstable virtual machine sizes. https://docs.microsoft.com/en-us/ azure/virtual-machines/sizes-b-series-burstable. Accessed in May 2022. 5. David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. The nas parallel benchmarks 2.0. Technical report, Technical Report NAS-95-020, NASA Ames Research Center, 1995.

9 Harnessing Low-Cost Virtual Machines on the Spot

187

6. Rafaela C. Brum, Walisson P. Sousa, Alba C. M. A. Melo, Cristiana Bentes, Maria Clicia Stelling de Castro, and Lúcia Maria de A. Drummond. A fault tolerant and deadline constrained sequence alignment application on cloud-based spot GPU instances. In Leonel Sousa, Nuno Roma, and Pedro Tomás, editors, Euro-Par 2021: Parallel Processing - 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1-3, 2021, Proceedings, volume 12820 of Lecture Notes in Computer Science, pages 317–333. Springer, 2021. 7. Jeferson R. Brunetta and Edson Borin. Selecting efficient cloud resources for hpc workloads. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, UCC’19, page 155–164, New York, NY, USA, 2019. Association for Computing Machinery. 8. Navraj Chohan, Claris Castillo, Mike Spreitzer, Malgorzata Steinder, Asser Tantawi, and Chandra Krintz. See spot run: Using spot instances for {MapReduce} workflows. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10), 2010. 9. Google Cloud. E2 machine series. https://cloud.google.com/compute/docs/general-purposemachines#e2_machine_types. Accessed in May 2022. 10. Oracle Cloud. Burstable Instances. https://docs.oracle.com/en-us/iaas/Content/Compute/ References/burstable-instances.htm. Accessed in May 2022. 11. Edans F. De O. Sanders, Guillermo Miranda, Xavier Martorell, Eduard Ayguade, George Teodoro, and Alba C. M. A. De Melo. Masa: A multiplatform architecture for sequence aligners with block pruning. ACM Trans. Parallel Comput., 2(4), February 2016. 12. J.J. Durillo and R. Prodan. Multi-objective workflow scheduling in Amazon EC2. Cluster Computing, 17(2):169–189, 2014. 13. Amazon EC2. Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/. Accessed in May 2022. 14. Yifan Gong, Bingsheng He, and Amelie Chi Zhou. Monetary cost optimizations for mpi-based hpc applications on amazon clouds: Checkpoints and replicated execution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, New York, NY, USA, 2015. Association for Computing Machinery. 15. Chin-Jung Hsu, Vivek Nair, Vincent W. Freeh, and Tim Menzies. Arrow: Low-level augmented bayesian optimization for finding the best cloud vm. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 660–670, 2018. 16. Chin-Jung Hsu, Vivek Nair, Tim Menzies, and Vincent Freeh. Micky: A cheaper alternative for selecting cloud instances. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 409–416, 2018. 17. Chin-Jung Hsu, Vivek Nair, Tim Menzies, and Vincent W. Freeh. Scout: An experienced guide to find the best cloud configuration. ArXiv, abs/1803.01296, 2018. 18. Yuxuan Jiang, Mohammad Shahrad, David Wentzlaff, Danny HK Tsang, and Carlee Joe-Wong. Burstable instances for clouds: Performance modeling, equilibrium analysis, and revenue maximization. In IEEE INFOCOM Conference on Computer Communications, pages 1576– 1584, 2019. 19. Mario João Jr, Alexandre C. Sena, and Vinod E. F. Rebello. On the parallelization of hirschberg’s algorithm for multi-core and many-core systems. Concurrency and Computation: Practice and Experience, 31(18):e5174, 2019. e5174 cpe.5174. 20. K. Lee and M. Son. DeepSpotCloud: Leveraging Cross-Region GPU Spot Instances for Deep Learning. In 2017 IEEE 10th Int. Conf. on Cloud Computing (CLOUD), pages 98–105, 2017. 21. Philipp Leitner and Joel Scheuner. Bursting with possibilities–an empirical study of creditbased bursting cloud instance types. In IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC), pages 227–236, 2015. 22. Sifei Lu, Xiaorong Li, Long Wang, Henry Kasim, Henry Novianus Palit, Terence Hung, Erika Fille Tupas Legara, and Gary Kee Khoon Lee. A dynamic hybrid resource provisioning approach for running large-scale computational applications on cloud spot and on-demand instances. In 19th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2013, Seoul, Korea, December 15-18, 2013, pages 657–662, 2013.

188

A. C. Sena et al.

23. S. Malla and K. Christensen. HPC in the cloud: Performance comparison of function as a service (FaaS) vs infrastructure as a service (IaaS). Internet Technology Letters, 3(1):e137, 2020. 24. Aniruddha Marathe, Rachel Harris, David K. Lowenthal, Bronis R. de Supinski, Barry Rountree, and Martin Schulz. Exploiting redundancy and application scalability for costeffective, time-constrained execution of hpc applications on amazon ec2. IEEE Transactions on Parallel and Distributed Systems, 27(9):2574–2588, 2016. 25. Ishai Menache, Ohad Shamir, and Navendu Jain. On-demand, spot, or both: Dynamic resource allocation for executing batch jobs in the cloud. In 11th International Conference on Autonomic Computing, ICAC ’14, Philadelphia, PA, USA, June 18-20, 2014., pages 177–187, 2014. 26. Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, New York, NY, USA, 2012. Association for Computing Machinery. 27. Johann Schleier-Smith, Vikram Sreekanti, Anurag Khandelwal, Joao Carreira, Neeraja J. Yadwadkar, Raluca Ada Popa, Joseph E. Gonzalez, Ion Stoica, and David A. Patterson. What serverless computing is and should become: The next phase of cloud computing. Commun. ACM, 64(5):76–84, apr 2021. 28. Amazon Web Services. Burstable performance instances. https://docs.aws.amazon.com/ AWSEC2/latest/UserGuide/burstable-performance-instances.html. Accessed in May 2022. 29. Prateek Sharma, Stephen Lee, Tian Guo, David E. Irwin, and Prashant J. Shenoy. Spotcheck: designing a derivative iaas cloud on the spot market. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21-24, 2015, pages 16:1–16:15, 2015. 30. Supreeth Subramanya, Tian Guo, Prateek Sharma, David E. Irwin, and Prashant J. Shenoy. Spoton: a batch computing service for the spot market. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27-29, 2015, pages 329–341, 2015. 31. Moussa Taifi, Justin Y. Shi, and Abdallah Khreishah. Spotmpi: A framework for auctionbased hpc computing using amazon spot instances. In Proceedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing - Volume Part II, ICA3PP’11, page 109–120, Berlin, Heidelberg, 2011. Springer-Verlag. 32. William F. C. Tavares, Marcio R. M. Assis, and Edson Borin. Leveraging vcpu-utilization rates to select cost-efficient vms for parallel workloads. In Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing, New York, NY, USA, 2021. Association for Computing Machinery. 33. L. Teylo, L. Arantes, P. Sens, and L. M. d. A. Drummond. A bag-of-tasks scheduler tolerant to temporal failures in clouds. In 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 144–151, 2019. 34. Luan Teylo. Scheduling Deadline Constrained Bag-of-Tasks in Cloud Environments using Hibernation prone Spot Instances. PhD thesis, Federal Fluminense University, 2021. Available at http://www.ic.uff.br/PosGraduacao/frontend-tesesdissertacoes/download.php?id=1015. pdf&tipo=trabalho (Accessed in May 2022). 35. Luan Teylo, Luciana Arantes, Pierre Sens, and Lucia Drummond. Scheduling bag-of-tasks in clouds using spot and burstable virtual machines. IEEE Transactions on Cloud Computing, 2021. 36. Luan Teylo, Luciana Arantes, Pierre Sens, and Lúcia M. A. Drummond. A dynamic task scheduler tolerant to multiple hibernations in cloud environments. Clust. Comput., 24(2):1051– 1073, 2021. 37. Prateeksha Varshney and Yogesh Simmhan. Autobot: Resilient and cost-effective scheduling of a bag of tasks on spot vms. IEEE Trans. Parallel Distrib. Syst., 30(7):1512–1527, 2019. 38. Marcel Wagenländer, Luo Mai, Guo Li, and Peter Pietzuch. Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20). USENIX Association, July 2020.

9 Harnessing Low-Cost Virtual Machines on the Spot

189

39. Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, January 1974. 40. Cheng Wang, Bhuvan Urgaonkar, Aayush Gupta, George Kesidis, and Qianlin Liang. Exploiting spot and burstable instances for improving the cost-efficacy of in-memory caches on the public cloud. In Twelfth European Conference on Computer Systems, pages 620–634, 2017. 41. Amelie Chi Zhou, Jianming Lao, Zhoubin Ke, Yi Wang, and Rui Mao. Farspot: Optimizing monetary cost for hpc applications in the cloud spot market. IEEE Transactions on Parallel and Distributed Systems, pages 1–1, 2021. 42. J. Zhou, Y. Zhang, and W. Wong. Fault Tolerant Stencil Computation on Cloud-Based GPU Spot Instances. IEEE Trans. on Cloud Comput., 7(4):1013–1024, 2019.

Chapter 10

Ensuring Application Continuity with Fault Tolerance Techniques Rafaela Brum, Luan Teylo, Luciana Arantes, and Pierre Sens

10.1 Introduction A cloud environment is a distributed system composed of hundreds to millions of components. With such a scale, the probability of failures is extremely high and, therefore, failures become the norm and not the exception [97]. In general, public providers like Google and Amazon offer guarantees of high availability for their services but they are not 100% failure safe. For instance, in 2017, during an operational check-in on AWS, a typo in one of the commands executed by the technicians took down a great number of servers, affecting the S3 service. Suddenly, numerous services on the internet, including services offered by big players, like Quora and Spotify, started to report crashes. Millions of users could not use the services in question during the four hours needed to solve the problem [57]. Another recent incident happened in December 2020, when the N. Virginia region (us-east1) of AWS EC2 faced a significant outage that took down lots of sites on the internet, rendering unavailable services that were primordial for the functionality of autonomous vacuum cleaners and doorbells [25]. Other public cloud providers, such as Microsoft Azure and Google Cloud, also had to cope with failures in the last years [4].

R. Brum Fluminense Federal University, Niterói, Brazil e-mail: [email protected] L. Teylo Inria Bordeaux Sud Ouest, Bordeaux, France e-mail: [email protected] L. Arantes () · P. Sens Sorbonne Université, CNRS, Paris, France e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_10

191

192

R. Brum et al.

Therefore, the probability that failures have an impact on cloud and user applications is extremely high. In particular, in HPC applications, usually composed of long-running tasks whose execution can stand for days or even months, failures can have strong negative consequences on the correct execution of the application and even work loss. Thus, it is fundamental to know which failure tolerance level a given HPC application requires and which is the ideal fault-tolerance technique to achieve it. Beyond failures, a client or application tasks can suffer from interruptions related to the execution model of a cloud service. In this case, the provider does not offer full guarantees in terms of reliability and might revoke the service. The most wellknown example of such services is the preemptible or spot VMs offered by the majority of cloud providers [32]. Such VMs have economic advantages but can be revoked at any time, contrary to on-demand VMs. The former can have prices up to 90% below the latter. One can argue that such revocations are not failures since they are not generated by unexpected behavior or bad functionality. However, fault tolerance techniques are mandatory for ensuring reliability of such services, otherwise applications will not work correctly. Hence, in this chapter, we also consider service revocation as a type of failure and discuss how FT techniques can be used to extract the maximum economic advantages of this execution model also guaranteeing the correct execution of applications. HPC applications are typically executed in the cloud using the Infrastructure as a Service (IaaS) model where computational resources, such as storage, computing, and network, are offered as virtual machines (VMs) and on a pay-as-you-go basis [52]. Hence, in order to execute an application, the user requests a set of VMs, sets up the execution environment, and launches the application. At the end of the execution, the total monetary cost is computed based on the execution time and the VMs’ costs. Nevertheless, one or more VMs can be interrupted due to failures or revocation, stopping the user’s application and increasing the execution time and, in some cases, the monetary cost. Fault tolerance mechanisms play, thus, two fundamental roles: they ensure that the applications finish correctly and avoid high monetary cost increases. In this chapter, we consider the two main types of faults: crash and resource revocation. In the context of clouds, a crash happens when a resource stops unexpectedly. For instance, when an on-demand VM stops working before the client releases it. On the other hand, a revocation happens when a resource, such as spot or preemptible VM, is intentionally stopped by the provider. In both cases, the most common techniques used to tolerate them are checkpoint-rollback and replication. Therefore, we present an overview of solutions of the related literature that provides fault tolerance for HPC applications in cloud environments based on these two techniques. We also discuss some works in fault detection and different existing cloud reliable storage. The remainder of the chapter is organized as follows. The next section summarizes fault tolerance techniques in distributed systems, focusing mainly on fault detection, checkpoint, and replication. Section 10.3 discusses the implementation of these techniques in clouds and the different approaches available in the related

10 Ensuring Application Continuity with Fault Tolerance Techniques

193

literature. Section 10.4, concludes the chapter and discusses some future directions and challenges.

10.2 Fault Tolerance This section presents some fault-tolerant concepts and mechanisms existing in distributed systems. Section 10.2.1 discusses the concept of failure detection, a step before using the fault tolerance techniques while Sects. 10.2.2 and 10.2.3 respectively presents the checkpoint/recovery and replication techniques. Section 10.2.4 present some MPI projects and approaches that provide fault tolerance while Sect. 10.2.5 discusses some issues on applying fault tolerance mechanisms on HPC applications.

10.2.1 Failure Detection A classic approach for tolerating failures in distributed systems is the detection and then the recovery of them. The failure detection phase is essential in reducing system unavailability, playing, thus, a central role in the engineering of such systems. Proposed by Chandra and Toueg [21], unreliable failure detectors (FDs) can be seen as oracles which provide information on task crashes. They usually output a list of tasks suspected of having crashed. The information is unreliable in the sense that correct tasks might be falsely suspected of having crashed, and faulty tasks might still be trusted after they crashed. If an FD detects its mistake later, it corrects it. For instance, an FD can stop suspecting at time t + 1, a task that it suspected at time t. Unreliable failure detectors are usually characterized by two properties: completeness and accuracy, as defined by Chandra and Toueg [21]. Completeness characterizes the failure detector’s capability of suspecting faulty tasks, while accuracy characterizes the failure detector’s capability of not suspecting correct tasks, i.e., restricts the mistakes that the failure detector can make. Two kinds of completeness and four kinds of accuracy are defined by Chandra and Toueg [21], which once combined yield eight classes of failure detectors. Numerous failure detector implementations and classes have been proposed in the literature based on Chandra and Toueg’s seminal work. They usually differ in the system assumptions such as type of node (identifiable, anonymous [10], homonymous [7]), type of link [1, 2, 47] (lossy asynchronous, reliable, timely, eventually timely, etc.), behavior properties [1, 54]; type of network (static [9, 47], dynamic [6, 35]), etc. Regarding implementation, unreliable FDs usually exploit either a timer-based or a message-pattern approach. In the first one, FD implementations make use of timers to detect failures in tasks. Two mechanisms can be used to implement the timerbased strategy: heartbeat and pinging. In the heartbeat [23], every task q periodically

194

R. Brum et al.

sends an “I am alive” message to task p that is responsible for monitoring q. If p does not receive such a message from q after the expiration of a timer, it adds q to its list of suspected tasks. If p later receives an “I am alive” message from q, p then removes q from its list of suspected tasks. In the pinging mechanism [26, 100], every task p periodically sends a query message “Are you alive?” to the other tasks. Upon reception of such a message, a task q replies with an “I am alive” message. The heartbeat strategy has advantages over pinging since the former sends half of the messages than the latter for providing the same detection quality. Furthermore, a heartbeat detector estimates only the transmission delay of “I am alive” messages, whereas the pinging detector must estimate the transmission delay of “Are you alive?” messages, the reaction delay, and the transmission delay of “I am alive” messages. The message-pattern strategy does not use any timeout mechanism. In Mostefaoui et al. [54], the authors propose an implementation that exploits such a strategy. A task p sends a QUERY message to n nodes that it monitors and then waits for responses (RESPONSE message) from .α tasks (.α ≤ n, traditionally .α = n − f , where f is the maximum number of failures). Task p starts then to suspect every task that does not respond among the .α first ones.

10.2.2 Checkpointing Checkpointing and rollback recovery are well-known techniques to provide fault tolerance for parallel applications [3, 22, 30]. Each application task periodically saves its state on reliable storage in a checkpoint and, when a failure is detected, the execution is rolled back and resumed from earlier checkpoints. In a distributed context, backward error recovery of a task can result in a domino effect: to recover from a failure, the execution must be rolled back to a consistent state, but rolling back one task could result in an avalanche of rollbacks of other tasks before a consistent state is found. Figure 10.1 illustrates such an effect.

Fig. 10.1 Cascade of recoveries. Xs represent checkpoints. To recover from p0 failure, tasks p0, p1, and p3 need to recover from .C0,1 , .C1,2 , and .C2,1 to maintain a consistent global state represented by the red line

10 Ensuring Application Continuity with Fault Tolerance Techniques

195

Fig. 10.2 Chandy-Lamport algorithm. Xs represent a consistent global state and messages “in transit” are logged. In case of failure, all tasks recover only from their last checkpoint

Numerous approaches to checkpointing and rollback recovery have been proposed in the literature for parallel systems. Checkpointing techniques can be divided into two categories: consistent and independent checkpointing. With consistent checkpointing, tasks coordinate their checkpointing actions such that the collection of checkpoints represents a consistent state of the whole system where the saved local state of each task does not depend on the receipt of a message that is yet to be sent [30]. When a failure occurs, the system restarts from these checkpoints. Chandy and Lamport [22] proposed the first algorithm to save a consistent global state, assuming FIFO communication channels. When a task starts a new checkpoint, it sends a special message called marker over all its output channels. When a task receives a marker for the first time, it checkpoints. After beginning a checkpoint, all messages received from a neighbor n are added to the checkpoint image, until the marker reception from n. Figure 10.2 illustrates the Chandy-Lamport algorithm. The main drawback of this approach is that the messages used for synchronizing a checkpoint are an important source of overhead. Moreover, after a failure, surviving tasks may have to rollback to their latest checkpoint in order to remain consistent with recovering tasks. Alternatively, Koo and Toueg [45] reduce the number of tasks to rollback, by analyzing the interactions between tasks. In the second approach, each task independently saves its state with no synchronization with the others. This technique is simple, but since the set of checkpoints may not define a consistent global state, the failure of one task leads to the rollback of other tasks. A reliable message logging [13, 27] avoids this domino effect. Logging methods fall into two classes: pessimistic and optimistic. Pessimistic message logging synchronously saves messages [16, 75], i.e., the receiver is blocked until the message is logged on stable storage. In this way, all sent messages are logged, and a task in its recovered execution will directly access the log to receive again messages in the same order. A recovered task has then no interaction with the others until it reaches the last state before the failure. The optimistic message logging reduces failure-free overhead by logging recovery information asynchronously [16, 89]. Several messages can be grouped together and written to the stable storage in a single operation to reduce the logging overhead. However, tasks that survive a failure may be rolled back.

196

R. Brum et al.

checkpointing

Scenario 02

X

recovery procedure

Scenario 01

0

10

20

30

40

50

60

70

80

90

Fig. 10.3 Execution scenarios illustrate the checkpoint and recovery approach

Another central concern when implementing a checkpointing technique is the involved time overheads. Basically, they can be divided into recovery and dump. The latter concerns the time spent recording on stable storage the application state in checkpoint files [95] while the former is related to the time spent reading these files and restarting the application. Depending on the recovery approach, the recovery time can also include extra overheads, such as the time to detect the failure. Consequently, both time overheads have a direct impact on the efficiency of the checkpointing technique. In order to illustrate the difficulty in choosing a good checkpointing strategy, let’s consider the example in Fig. 10.3, where two different execution scenarios are presented. In both cases, the application records a checkpoint every 15 units of time, and each of them takes 5 time units to be recorded. In the first scenario, no failure happens, and the total execution time of the application is 75 time units. In the second scenario, just after the second checkpointing, the platform faces a failure (represented by the red x) and the application is interrupted. Once the failure is detected, the recovery task starts, and the application rolls back to its last record state, finishing with a total execution time of 95 units. In both scenarios, the total dump time was 15 units of time (5 units per checkpointing). In Fig. 10.3, the recovery procedure took 10 time units. Sometimes, the advantages of using a checkpoint strategy are not straightforward. For instance, considering the total execution time without any monetary cost, if a failure takes place in period 50, an application without checkpointing will restart from the beginning, spending 110 units of time. Thus, in this case, the checkpoint is worthwhile. However, if the checkpointing duration was 15 time units instead of 5, restarting the application from the beginning would spend less time. Therefore, the time and cost of saving and recovering the checkpointed files need to be included in the overall time and cost of the checkpointing technique which should not be greater than the time of restarting the application. For this purpose, the storage system where checkpoints are recorded needs to be not only stable and reliable but also fast. A second critical parameter that needs to be carefully chosen when using the checkpoint-rollback recovery techniques is the checkpointing interval which defines the time between two consecutive checkpoints, i.e., the frequency with which the

10 Ensuring Application Continuity with Fault Tolerance Techniques

197

application’s states are recorded. Such a frequency also has an impact on the efficiency of the checkpoint strategy. On the one hand, the smaller the interval is, the higher the number of recorded checkpoints, leading to higher dump time. On the other hand, the longer the interval, the smaller the number of recorded checkpoints and the higher the recovery time. Thus, the ideal would be to adapt the checkpointing frequency according to the rate of failures or mean time between failures (MTBF). The closer the checkpoint frequency is to the frequency of failures, the more optimized the number of checkpoints will be. In Siavvas and Gelenbe [87], the authors propose a mathematical model to compute the optimal interval for application-level checkpoints of long-running loops. A single expression gives the interval by considering the program failure rate as well as dump and recovery times.

10.2.3 Replication Replication has been applied to achieve fault tolerance in both distributed systems and databases where a client interacts with a replicated service. They are usually classified into three main different types: active, semi-active, and passive replication. In the active replication scheme, also called the state-machine approach [74], all replicas process the requests received from the client so that their internal states are closely synchronized. Then, any replica can respond to the client requests to provide a low response time in the case of a crash. However, to ensure a strong consistency, all replicas must receive the requests in the same order and require deterministic processing which renders such a scheme quite costly. Semi-active replication [24] extends active replication. While the actual processing of a request is performed by all replicas, one of the replicas, the leader, is responsible for performing the non-deterministic processing and inform the other ones called the followers. With the passive replication technique, also called primary-backup [18], one of the replicas, the primary, receives the requests from the clients and returns responses. The other replicas, the backups, interact with the primary only and receive state update messages from the primary. This replication technique requires less processing power than the active ones and makes no assumption on the determinism of processing a request. However, like semi-active replication, the implementation of passive replication requires a mechanism to agree on the primary (e.g., a leader election or group membership). If the primary fails, one of the backups takes over. This leads to a significantly increased response time in the case of failure which makes it unsuitable in the context of time-critical applications.

10.2.4 Fault Tolerant MPI Some projects as MPI-FT [14] or MPI/FT [20] integrate fault tolerant support into the MPI standard where failures are totally masked from users and handled

198

R. Brum et al.

by the MPI library. Unfortunately, several studies point out that user transparent approaches exhibit poor efficiency on Exaflop platforms [11, 12]. The User Level Failure Mitigation (ULFM) interface [51] adopts an alternative approach offering to developers of applications a set of functions to implement fault tolerance taking into account the properties of the target application. ULFM includes functionalities for task failure detection and communication reconfiguration but it does not provide a strategy for data restoration. According to Ansel et al. [5], the checkpointing and rollback recovery of MPI applications are typically made by using user-Level MPI libraries for checkpointing, which demands that all communications between tasks are made exclusively through MPI. Scalable Checkpoint/Restart for MPI (SCR) is probably of the most popular libraries for MPI applications[53]. It has been in production since 2007 and has several advantages over other solutions: it is a multilevel library that includes several strategies to reduce the load of critical shared resources such as the parallel file system. Another popular library, called the Distributed MultiThreaded CheckPointing (DMTCP) [5], is also an example of a user-level library for MPI applications and has been successfully used to checkpoint MPI applications running in a cloud environment [8]. MPI applications can also be checkpointed at the application level. For instance, CRAFT [85], an open source library, offer basic functionalities for the implementation of application-level checkpoints to MPI applications. According to the authors, the main advantage of such an approach is the reduction of the checkpointing overhead. Besides that, CRAFT also supports SCR, which enables checkpoint storage and recovery at the node level. At the system level, the Berkeley Lab Checkpoint/Restart library [37] (BLCR) is probably one of the most widely used checkpoint-restart implementations. BLCR was developed initially for Linux clusters, but Azeem and Helal [8] showed that BLCR can be used to save MPI applications running in multiple EC2 VMs.

10.2.5 Fault Tolerance in HPC Applications Efficient fault tolerance mechanism for HPC applications should consider performance and scalability issues. In an HPC context, fault tolerance mechanisms have conflicting goals as they should provide good performance in both failure-free execution and recovery while limiting the amount of resources used. As the failure rate increases proportionally to the number of nodes, large HPC applications require a high checkpointing frequency to limit the impact of rollbacks in the response time. On the other hand, frequency increasing has a direct impact in failure-free execution performance. Message logging can avoid rolling back all the tasks but at the cost of saving messages in the node memory, while the memory size per CPU available tends to be smaller as the number of nodes increases. Coordinated checkpointing, does not require saving any messages but if a failure occurs all tasks need to rollback to the last checkpoint. Some hybrid protocols, combining coordinated checkpointing

10 Ensuring Application Continuity with Fault Tolerance Techniques

199

and message logging, have been proposed for fault tolerance of HPC application at large scale [15, 56]. Note that HPC applications that use MPI [46] can be fault tolerant when using one of the libraries or approaches discussed in Sect. 10.2.4.

10.3 Fault Tolerance in Clouds This section discusses the implementation of failure detectors (Sect. 10.3.1), checkpoint-rollback (Sect. 10.3.2) and replication (Sect. 10.3.4) in the context of cloud environment. As presented in Sect. 10.2, they are extensively used in distributed fault tolerant systems. They have been extended to clouds by considering at which level they should be applied (application or provider) and cloud features such as elasticity, network dynamics, storage, and monetary cost. In Sect. 10.3.3, we present some of the existing storage services in public clouds, and how they can be used alongside the VMs to implement the checkpointing approach. Finally, in Sect. 10.3.5, we discuss some existing solutions of the literature that tolerate the revocation of spot and preemptible VMs allocated by applications, guaranteeing that the latter execute correctly. We point out that the addition of a fault tolerance feature can increase the users’ final monetary cost, either because of the contract and use of storage services, extra VMs, or increment in the execution time caused by additional overheads. Thus, a critical challenge is to define which resources should be used to implement the fault tolerance feature, leading to a good trade-off between monetary cost and reliability.

10.3.1 Failure Detectors in Clouds As highlighted by Bui et al. [19], failure detectors (FD) in the context of clouds have to cope with several features of the environment such as elasticity, multipurposed user services that continuously cause changes in the system, and the large scale number of nodes which makes difficult the collection of failure detection information. Clients and providers can respectively detect faults of application tasks and physical resources. On the other hand, both of them can detect faults of virtual machines. The FD associated with an application should monitor the tasks and/or VMs state during their lifetime. In case of a VM failure, the application requires the allocation of a new VM to the provider and then restarts the tasks which were running in the failed VM in the former [92]. Some works in the literature propose the implementation of failure detection in clouds [49, 60, 99].

200

R. Brum et al.

Xiong et al. [99] state that a FD for cloud environments should automatically adjust its parameters according to the dynamics of the network, which can greatly vary over time. Hence, they present the SFD, a self-tuning FD for cloud computing networks. Every SFD module has a sliding window which maintains the most recent samples of the arrival times and, at the next timeout delay, the parameters are adjusted using both the information in the sliding window and information from the FD output meeting, therefore, recent network conditions. The adaptive failure detector AFD for cloud computing infrastructure [60] exploits autonomic techniques and does not rely on failure history. It continuously monitors the cloud execution and collects runtime performance data and then extracts the most relevant metrics, used to detect possible failures. When the latter is verified, the AFD adapts itself to these new detection results. Since clouds are composed of several non-overlapping layers (e.g., IaaS, SaaS, and PaaS), Lee et al. [49] argue that having a single heartbeat-based FD is not a good solution as failures should be distinguished. For instance, failures in the system from those of the application or from power supply. Thus, they propose to group cloud environment components into linear dependent layers. Based on such layers, their FD solution can determine the faulty layer without needing to conduct fault detection in all layers.

10.3.2 Implementing Checkpoints in Cloud A checkpoint can be implemented into one of the three following distinct levels, according to the degree of transparency and location in the software stack [8, 40, 87]: (1) application-level, (2) user-level, and (3) system-level. At the application-level, the code of the application needs to inform when checkpoints should be taken. Then, the checkpoint procedure captures the state of the application through direct interaction with it. Such an approach is expected to be the most efficient one since the programmer knows which data structures and variables must be preserved and which may be discarded [72, 87]. However, its applicability is restricted to the case where the application’s code is available to be modified. Furthermore, the recovery time might be a concern as it consists of the time to request, boot up, and configure a new VM to charge the application. At the user-level, checkpoints are implemented in the user space and provide transparency to the application by virtualizing system calls. According to [40], such a virtualization allows checkpoint tools to capture the state of the entire process without being tied to the kernel, providing thus more portability between platforms, but at the cost of a constant virtualization overhead. Moreover, user-level checkpoints are usually larger than application-level ones since they cannot take advantage of memory optimization based on application semantics. System-level checkpoint procedures are implemented either in the kernel or as a kernel module. In this case, the whole memory stack of the application is saved. Different from the user-level implementation, checkpointing at the system level does

10 Ensuring Application Continuity with Fault Tolerance Techniques

201

not need to virtualize system call interfaces since it has direct access to the kernel structures [40]. However, they are often linked to the kernel version making them not portable between different platforms.

10.3.2.1

Bag-of-Tasks Applications

Bag-of-Tasks applications are composed of independent jobs (or tasks) which can thus be executed in parallel in any order. Such lack of task dependency simplifies the checkpointing implementation since the latter does not require coordination between tasks. In other words, each task can take its checkpoint independently. CRIU [31], a very popular checkpointing tool, has been used in several works to guarantee reliability for applications running in clouds [42, 93]. It runs at the userlevel and saves the full state of the process without any changes in the application code. In Teylo et al. [94], the authors applied CRIU to record the checkpoints of BoT applications running in a Amazon EC2 cloud. Note that defining a good failure rate in cloud environments is not a straightforward task, particularly on the client-side. Consequently, in clouds, the checkpoint intervals are typically either user-defined fixed intervals or adaptive ones [4].

10.3.3 Reliable Cloud Storage Solutions Cloud providers have several storage services available to rent. Until January 2022, Amazon Web Services (AWS) offers a total of 11 storage services divided into seven categories [83], while Google Cloud Provider (GCP) offers nine different storage services divided into eight categories [63]. Each storage service focuses on different needs in an institution’s workflow. For example, AWS DataSync [77] and GCP’s Data Transfer Services [70] have a major concern on data migration while Amazon Simple Storage Service (S3) [76] and GCP’s Cloud storage [61] on storing data in form of objects, without an underlying file system. Files resulting from checkpoints can be of huge sizes. Furthermore, they need to be saved fast and easily retrieved. The general storage services based on an object, file, or block storage, give all guarantees in terms of reliability and are usually cheaper than specialized storage services. Thus, in this section, we discuss these services in both AWS and GCP. Amazon S3 and Cloud Storage are the object storage services from AWS and GCP, respectively. They similarly represent the objects, using a two-level organization [61, 76]. At the superior level, they use buckets, which are structures similar to folders that have a unique global name and help the organization of data from different users, identifying and billing them accordingly. In S3, each bucket is restricted to a single region, and each account can associate up to 100 buckets. In Cloud Storage, the user can configure the bucket availability to a single cloud region, in two close regions (dual-region), or several regions spread in a larger area

202

R. Brum et al.

(multi-region). There is no limit in GCP associated with the number of buckets in a single account, but there are bounds regarding the bucket’s name and creation rate [62]. Objects are the inferior level of these two storage services. They contain the user stored data represented by a name and unique key used to access the object.1.,2 Both services have an upper limit to a single object size of 5 TB [62, 76] and allow the user to create, change, and read objects from a bucket using a single operation. However, if the user wants to rename or move the object to another place, it takes at least two operations, downloading the object to a local system and then uploading it with the new name or to the new location. The block storage service of AWS is called Amazon Elastic Block Service (EBS) [78], and the ones of GCP are Persistent Disk [68] and Local SSD [66]. In these services, the user creates storage volumes and attaches them to directories inside a Virtual Machine (VM) of each provider [73]. EBS allows the user to allocate disks from 1 GB to 156 TB [78], and these volumes can be Solid State Drives (SSDs), with low latency, or Hard Disk Drives (HDDs), with higher throughput. AWS restricts the availability of an EBS volume to a single zone (data center) of a region. In the Persistent Disk service of GCP, it is possible to create HDDs or SSDs volumes in a single cloud zone or all zones of a cloud region. The size limit for the former is 10 GB to 64 TB and for the latter is 200 GB to 64 TB [69]. Both AWS and GCP allow the user only to increase the size of the volume while attached to a VM.3.,4 Besides, a specific type of AWS’s EBS volume and all GCP’s Persistent Disk types can be attached to multiple VMs in read-only mode. However, the user does not know the exact physical location of an EBS or Persistent Disk volume. GCP’s Local SSD service allows users to physically attach an SSD volume of size 375 GB to a single instance, offering higher performance and lower latency compared to GCP’s Persistent Disks [66]. Regarding the file storage services, AWS offers EFS [80] and FSx [82], while GCP offers only Filestore [64]. FSx focuses on application migration from on-premise clusters to cloud resources. The user can choose from four highperformance file systems (NetApp ONTAP, OpenZFS, Windows File Server, and Lustre), making it easier to connect FSx to a local machine and send data to AWS. On the other hand, Amazon EFS and GCP Filestore provide a simple and scalable file system. They increase and decrease their allocated size automatically when the user adds or removes files. Both are compatible only with the Network File System (NFS).

1 https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html,

last access in July 19th, 2022. 2 https://cloud.google.com/storage/docs/naming-objects, last access in July 19th, 2022. 3 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-modify-volume.html, last access in July 19th, 2022. 4 https://cloud.google.com/compute/docs/disks/working-with-persistent-disks#resize_pd, last access in July 19th, 2022.

10 Ensuring Application Continuity with Fault Tolerance Techniques

203

The main advantages of these three file storage systems are their availability in all zones of a single region and the accessibility in parallel by several VMs, up to 500 VMs in GCP [65] and 120 VMs in AWS [81]. However, GCP imposes a 16 TiB size limit on a single file, and AWS establishes a 47.9 TiB size limit on a single file.

10.3.3.1

Choice of the Storage Service

In case of using checkpoint to tolerate failures, the choice of the most suitable storage service mainly depends on the checkpointing patterns of the application. If the implementation requires a dedicated space to store the checkpoints with a single task accessing it per time, the straightforward choice is using a local volume from Amazon EBS or GCP’s Local SSD. However, in this case, the price depends not only on what it is stored but also the size of the volume. On the other hand, if the application stores concurrent multiple checkpoints or different tasks access one’s checkpoint simultaneously, it is possible to use the object storage services (Amazon S3 or GCP’s Cloud Storage) or the file storage systems (Amazon EFS or GCP’s Firestore). The main difference between them is that the object storage services are cheaper than the file storage ones while the latter usually have better performance than the former. In Teylo et al. [95], the authors compared the dump and recovery time of a checkpoint stored in Amazon S3, Amazon EBS, and Amazon EFS. They showed that the fastest service to store the checkpoint is the EBS, while the EFS is the fastest in its recovery (Figures 1 and 4 in Teylo et al. [95]). Moreover, the time to store sequential checkpoints in Amazon S3 consists of 37.1% of the total execution time, while in EBS corresponds only to 11.4% (Figure 3 in Teylo et al. [95]). However, most works in the literature use Amazon S3 as the storage service for storing checkpoints [93, 101, 102] and very few ones use a spare EBS [90]. Such a difference can be explained due to the possibility of concurrent access to a bucket while each volume of EBS is limited to a single VM at a time which in several types of HPC applications is not viable. An example of such applications is those with several independent tasks executing in parallel (Bag-of-Tasks) with data dependencies between them (workflows) or data exchange. On the other hand, in Amazon, it is more costly to store multiple checkpoints in EFS than in S3. For example, storing 30 GB for a month in S3 costs $21.91 while in EFS costs $30.19 [95]. According to Nicolae and Cappello [58], one strong argument for using Amazon EBS instead of Amazon S3 to store checkpoints is that most of the time the VMs uses only part of the attached volume instead of the full allocated size, leaving a huge portion of its storage to be paid without use. Therefore, the authors propose a shared pool, composed of all spare disk spaces, to store and recover the checkpoints of applications. As several disks are used, different parts of the pool can be accessed simultaneously. To further benefit from this multitude of disks, all checkpoints are divided into small pieces and distributed among the disks so that they can be recovered in smaller time.

204

R. Brum et al.

10.3.4 Replication Depending on the level of control on the placement of each replicated task, we can divide replication in clouds based on either the provider or the client’s views. The former sees the virtual machines (VMs) apart from the physical ones, which allows the deployment of different replicas into different physical resources. On the other hand, a client cannot dictate where to deploy their VMs. In 2017, AWS released the spread placement group approach that allows users to request the placement of their VMs in distinct hardware5 but limited to seven VMs per cloud zone [84]. Thus, if the user needs more than seven VMs for her/his application, the only way to guarantee the mapping to different resources is by choosing a different cloud zone per placement group (seven VMs). However, in this case, communication time between the deployed VMs considerably increases as well as their data access time, which can become prohibitive to some applications. Due to such performance issues, most works found in the literature assume that different VMs are in separated physical resources, even in the same cloud region. It is also worth pointing out that the task replication approach increases the total execution cost for the client since he/she pays for the execution of every replica. This higher monetary cost justifies why there exist more fault-tolerant solutions in clouds based on checkpointing than on replication because the former does not consume more resources than the strictly necessary [58]. To the best of our knowledge, there is only one work concerning replication in the provider’s view. Qiu et al. [71] presented an active task replication framework that executes the clients’ jobs, each job mapped as a set of virtual machines with distinct tasks. The framework focuses on increasing reliability and performance and decreasing energy consumption. After receiving a client request, the framework actively creates replicas for each VM and allocates them to different and heterogeneous resources. There are some works on task replication from the client’s view, in which the physical data center of each VM is unknown. In [104], Zhu et al. present a passive replication technique, in which all tasks of a workflow have a primary and a backup copy. They schedule them in different VMs and balance the number of primary copies between all deployed instances. Li et al. [50] and Xie et al. [98] propose task replication in clouds with a variable number of replicas per task. Both papers use empirical fault rates in a Poisson distribution to estimate the number of copies per task, aiming at minimizing costs. Li et al. consider a deadline constraint and thus need to schedule all replicas while Xie et al. consider reliability bound, which allows the removal of those duplicates that surpass such a limit in order to reduce costs. Consequently, the latter ensures lower reliability than the former but with lower execution costs.

5 https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-spread-placement-groups-

for-amazon-ec2/, last access August 1st, 2022.

10 Ensuring Application Continuity with Fault Tolerance Techniques

205

Another interesting paper is the one by Nik et al. [55] where the authors propose an active task replication solution that does not increase execution costs. Tasks are replicated in the idle slots of the scheduling solution and there exists a maximum of one replica per task. As the idle spaces may not correspond to every task in the job, it is possible to calculate the probability of each task failure in the primary assigned virtual machine and to use this probability together with the expected execution time of each task to select the ones to replicate. However, the approach does not guarantee a minimum of reliability to all tasks since some of them will not have replicas.

10.3.5 Fault Tolerance and Preemptible VMs Preemptible VMs are offered with a steep discount but can be revoked at any time by the cloud provider. Therefore, fault tolerance techniques are mandatory for longrunning applications using these VMs in order to ensure their complete execution. In the related literature, several works have been proposed to explore preemptible (also called spots) VMs to reduce the monetary cost of the executions. The majority of them rely on checkpoint-rollback approaches to guarantee that applications will finish even if revocations occur. Moreover, on-demand VMs are generally used as a backup resource. In this way, when a spot VM is revocated, the application is typically resumed on an on-demand VM. In Sharma et al. [86], for instance, the authors proposed SpotCheck, a framework that uses nested VMs within spot VMs to provide the illusion of a platform that provides always-available VMs. In order to cope with spot revocations, the nested VMs are migrated to an on-demand VM whenever a spot revocation occurs. AutoBot [96] uses both spot and on-demand VMs for executing applications with a user-defined deadline. The framework migrates applications from spot to on-demand VMs to satisfy time constraints. It also uses checkpoint strategies to ensure reliability when executing on the preemptive VMs.Yi et al. propose in [101] an adaptive checkpoint that takes into account the history of the price of the spot VMs to predict their revocation and decide when a checkpoint should be recorded. In Subramanya et al. [90], the authors implement a proactive mechanism, where the number of checkpoints is neither related to the VMs’ volatility nor the number of revocations, but on a given checkpointing interval. In Varshney and Simmhan [96], three checkpoint strategies are proposed: (1) optimistic checkpoint, where the state of the task is recorded just before the migration to an on-demand VM; (2) grace period checkpoint, where the 2 min between the notification of the interruption of a spot VM and the VM interruption itself are used to take the checkpoint; and (3) sliding checkpoint, where the checkpoint is taken in fixed intervals. A framework that exploits both spot and on-demand VMs to execute Bag-ofTasks applications is proposed by Teylo et al. [93]. It aims at minimizing the execution’s monetary cost, respecting a deadline defined by the user. Periodically,

206

R. Brum et al.

the state of the application is recorded by checkpointing. Then, in case of spot VM revocation, the checkpoints are used to resume the application on on-demand VMs.

10.4 Conclusion and Future Directions In this chapter, we have discussed checkpoint/rollback and replication techniques implemented in clouds and/or application tasks that provide fault tolerance to HPC applications, ensuring their correct and complete execution. However, most of the referenced solutions mainly concern applications that use only CPUs for computation. We believe that accelerators, such as GPUs and FPGAs, can be used to reduce the total execution time of different HPC applications, and, therefore, there is a growing concern around fault tolerant solutions in clouds using accelerators. GPUs are accelerators with thousands of simple cores to execute a single instruction in multiple data while FPGAs are reconfigurable devices with logic blocks that can map different programs in specialized hardware. The increasing number of supercomputers with accelerators in the Top500 list shows that they already take part in HPC [28]. As presented by Jain and Cooperman [41], the number of clusters with Graphics Processing Units (GPUs) in the list was 136 in November 2019, growing to 151 in November 2021. At the same time, most cloud providers offer VMs with accelerators, such as GPUs and Field-Programmable Gate Array (FPGAs), to the user [67, 79]. However, to the best of our knowledge, there are few (resp., any) works in the literature that propose checkpoints on GPUs (resp. FPGAs) on Clouds. Since GPU architectures evolve in a constant and fast way, any effort to create a generic checkpoint solution is very difficult. Therefore, most existing ones become very fast obsolete [33, 43, 59, 91]. Jain and Cooperman [41] present a GPU checkpoint approach in an initial development stage, suitable only for small applications. Hence, HPC applications using GPUs on Clouds require applicationlevel checkpoints or some other fault-tolerant techniques to ensure their correct and complete execution in case of failures. Lee and Son [48] propose both application-level checkpoint and live migration to reduce computational costs when training Deep Learning tasks on Clouds. The model weights are saved after each training epoch, used thus as checkpoints. Furthermore, the spot VMs price per region is monitored aiming at migrating tasks to the cheapest region. In Zhou et al. [103], a fault-tolerant stencil computation to AWS GPU instances, based on two-phased application-level checkpoints is presented. The first phase blocks the execution of the stencil while coping the GPU memory block to the host memory while the second one sends this memory block to a backup server asynchronously. The two-phased application-level checkpoints behave as a pipeline to surpass the communication overhead between the host and the backup server. Brum et al. [17] present a framework to execute a sequence alignment application on AWS spot VMs. The goal is to minimize the monetary cost, considering user-defined deadline constraints. Application-level checkpoints

10 Ensuring Application Continuity with Fault Tolerance Techniques

207

periodically save rows of the computed matrix in order to find the optimal sequence alignment. When a spot VM is revoked, its execution is restarted in another VM from the last saved row. Regarding FPGA checkpointing, as it is a reconfigurable hardware, basically, two different checkpoint approaches are used: the first one is applied to the task in the FPGA; the second one concerns hardware configuration itself [44]. The former is more restricted as it needs to be restored in the same device with the same hardware configuration while in the latter, the computation can be restarted in another device. Most early works focus on executing multiple tasks in the same FPGA to allow preemption and context switch inside a single FPGA. Therefore, they present several mechanisms to stop and restore the execution of the concurrent tasks, using only task-level FPGA checkpoints. However, when we think of a fault-tolerant context, this checkpoint approach cannot be considered a generic one due to the restriction in the restore. In Koch et al. [44], the authors propose the first formal model for hardware checkpoints with different mechanisms to change each hardware module, improving, thus, the capability of checkpointing. Since this first formal model, there have been several others, based on signal collection to reconstruct FPGAs checkpointing execution traces [34, 38, 39, 88]. On the other hand, all these models need particular hardware, which increases overhead costs to the FPGA synthesis, rendering them unpractical in most cases [29, 36].

References 1. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: On Implementing Omega with Weak Reliability and Synchrony Assumptions. In: Proceedings of the Twenty-Second Annual Symposium on Principles of Distributed Computing, PODC ’03, p. 306–314. Association for Computing Machinery, New York, NY, USA (2003) 2. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Communication-Efficient Leader Election and Consensus with Limited Link Synchrony. In: Proceedings of the Twenty-Third Annual ACM Symposium on Principles of Distributed Computing, p. 328–337. Association for Computing Machinery, New York, NY, USA (2004) 3. Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24(2), 149–159 (1998) 4. Amoon, M., El-Bahnasawy, N., Sadi, S., Wagdi, M.: On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. Journal of Ambient Intelligence and Humanized Computing 10(11), 4567–4577 (2019) 5. Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for cluster computations and the desktop. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–12 (2009) 6. Arantes, L., Greve, F., Sens, P., Simon, V.: Eventual Leader Election in Evolving Mobile Networks. In: Proceedings of the 17th International Conference on Principles of Distributed Systems - Volume 8304, OPODIS 2013, p. 23–37. Springer-Verlag, Berlin, Heidelberg (2013) 7. Arévalo, S., Anta, A.F., Imbs, D., Jiménez, E., Raynal, M.: Failure Detectors in Homonymous Distributed Systems (with an Application to Consensus). J. Parallel Distrib. Comput. 83(C), 83–95 (2015) 8. Azeem, B.A., Helal, M.: Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud. In: 2014 9th International Conference on Informatics and Systems, pp. PDC–49. IEEE (2014)

208

R. Brum et al.

9. Bertier, M., Marin, O., Sens, P.: Performance analysis of a hierarchical failure detector. In: International Conference on Dependable Systems and Networks, 2003 (DSN), pp. 635–644 (2003) 10. Bonnet, F., Raynal, M.: Anonymous asynchronous systems: the case of failure detectors. Distributed Comput. 26(3), 141–158 (2013) 11. Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J.J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concurr. Comput. Pract. Exp. 26(17), 2772–2791 (2014) 12. Bougeret, M., Casanova, H., Robert, Y., Vivien, F., Zaidouni, D.: Using group replication for resilience on exascale systems. Int. J. High Perform. Comput. Appl. 28(2), 210–224 (2014) 13. Bouteiller, A., Bosilca, G., Dongarra, J.J.: Redesigning the message logging model for high performance. Concurr. Comput. Pract. Exp. 22(16), 2196–2211 (2010) 14. Bouteiller, A., Bosilca, G., Dongarra, J.J.: Redesigning the message logging model for high performance. Concurr. Comput. Pract. Exp. 22(16), 2196–2211 (2010) 15. Bouteiller, A., Hérault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurr. Comput. Pract. Exp. 25(4), 572–585 (2013) 16. Bouteiller, A., Ropars, T., Bosilca, G., Morin, C., Dongarra, J.J.: Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery. In: Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp. 1–9. IEEE Computer Society (2009) 17. Brum, R.C., Sousa, W.P., Melo, A.C.M.A., Bentes, C., de Castro, M.C.S., Drummond, L.M.A.: A Fault Tolerant and Deadline Constrained Sequence Alignment Application on Cloud-Based Spot GPU Instances. In: L. Sousa, N. Roma, P. Tomás (eds.) Euro-Par 2021: Parallel Processing, pp. 317–333. Springer International Publishing, Cham (2021) 18. Budhiraja, N., Marzullo, K., Schneider, F.B., Toueg, S.: The Primary-Backup Approach, p. 199–216. ACM Press/Addison-Wesley Publishing Co., USA (1993) 19. Bui, K.T., Vo, L.V., Nguyen, C.M., Pham, T.V., Tran, H.C.: A fault detection and diagnosis approach for multi-tier application in cloud computing. J. Commun. Networks 22(5), 399– 414 (2020) 20. Buntinas, D., Coti, C., Hérault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Gener. Comput. Syst. 24(1), 73–84 (2008) 21. Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43(2), 225–267 (1996) 22. Chandy, K.M., Lamport, L.: Distributed Snapshots: Determining Global States of Distributed Systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985) 23. Chen, W., Toueg, S., Aguilera, M.K.: On the Quality of Service of Failure Detectors. IEEE Trans. Comput. 51(1), 13–32 (2002) 24. Chereque, M., Powell, D., Reynier, P., Richier, J.L., Voiron, J.: Active replication in Delta4. In: [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing, pp. 28–37 (1992) 25. D’Antoni, J.: The Night the Lights Went Out in the Cloud: Lessons from the AWS Outage. https://redmondmag.com/articles/2020/12/02/lessons-from-aws-outage.aspx. Accessed: 2022-03-20 26. Das, A., Gupta, I., Motivala, A.: SWIM: scalable weakly-consistent infection-style process group membership protocol. In: Proceedings International Conference on Dependable Systems and Networks (DSN), pp. 303–312 (2002) 27. Dichev, K., Sensi, D.D., Nikolopoulos, D.S., Cameron, K.W., Spence, I.: Power Log’n’Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols. IEEE Transactions on Parallel & Distributed Systems 33(06), 1276–1288 (2022) 28. Dongarra, J., Luszczek, P.: TOP500, pp. 2055–2057. Springer US, Boston, MA (2011)

10 Ensuring Application Continuity with Fault Tolerance Techniques

209

29. Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing 65(3), 1302–1326 (2013) 30. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Comput. Surv. 34(3), 375–408 (2002) 31. Emelyanov, P.: Criu: Checkpoint/restore in userspace, july 2011. https://criu.org (2011) 32. García, Á.L., del Castillo, E.F., Plasencia, I.C.: An efficient cloud scheduler design supporting preemptible instances. Future Generation Computer Systems 95, 68–78 (2019) 33. Garg, R., Mohan, A., Sullivan, M., Cooperman, G.: CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 302–313 (2018) 34. Goeders, J., Wilton, S.J.E.: Signal-Tracing Techniques for In-System FPGA Debugging of High-Level Synthesis Circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36(1), 83–96 (2017) 35. Gómez-Calzado, C., Lafuente, A., Larrea, M., Raynal, M.: Fault-Tolerant Leader Election in Mobile Dynamic Distributed Systems. In: IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 78–87 (2013) 36. Hale, R., Hutchings, B.: Enabling Low Impact, Rapid Debug for Highly Utilized FPGA Designs. In: 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pp. 81–813 (2018) 37. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clusters. In: Journal of Physics: Conference Series, vol. 46, p. 067. IOP Publishing (2006) 38. Holanda Noronha, D., Zhao, R., Goeders, J., Luk, W., Wilton, S.J.: On-Chip FPGA Debug Instrumentation for Machine Learning Applications. In: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’19, p. 110–115. Association for Computing Machinery, New York, NY, USA (2019) 39. Hung, E., Wilton, S.J.E.: Scalable Signal Selection for Post-Silicon Debug. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21(6), 1103–1115 (2013) 40. Hursey, J.: Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems. Indiana University (2010) 41. Jain, T., Cooperman, G.: CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020) 42. Jesus Leonardo; Drummond, L.M.A., Oliveira, D.d.: Eeny meeny miny moe: Choosing the fault tolerance technique for my cloud workflow. In: Latin American High Performance Computing Conference, pp. 321–336. Springer (2017) 43. Jiang, H., Zhang, Y., Jennes, J., Li, K.C.: A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States. International Journal of Networked and Distributed Computing 1, 196–212 (2013) 44. Koch, D., Haubelt, C., Teich, J.: Efficient Hardware Checkpointing: Concepts, Overhead Analysis, and Implementation. In: Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays, FPGA ’07, p. 188–196. Association for Computing Machinery, New York, NY, USA (2007) 45. Koo, R., Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering SE-13(1), 23–31 (1987) 46. Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjellum, A., Sultana, N.: A large-scale study of mpi usage in open-source hpc applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/ 3295500.3356176 47. Larrea, M., Anta, A.F., Arévalo, S.: Implementing the weakest failure detector for solving the consensus problem. Int. J. Parallel Emergent Distributed Syst. 28(6), 537–555 (2013) 48. Lee, K., Son, M.: DeepSpotCloud: Leveraging Cross-Region GPU Spot Instances for Deep Learning. In: 2017 IEEE 10th Int. Conf. on Cloud Computing (CLOUD), pp. 98–105 (2017)

210

R. Brum et al.

49. Lee, Y.L., Liang, D., Wang, W.J.: Optimal Online Liveness Fault Detection for Multilayer Cloud Computing Systems. IEEE Transactions on Dependable and Secure Computing (2021) 50. Li, Z., Yu, J., Hu, H., Chen, J., Hu, H., Ge, J., Chang, V.: Fault-tolerant scheduling for scientific workflow with task replication method in cloud. In: V. Munoz, R. Walters, F. Firouzi, G. Wills, V. Chang (eds.) IoTBDS 2018 - Proceedings of the 3rd International Conference on Internet of Things, Big Data and Security, pp. 95–104. SciTePress (2018) 51. Losada, N., González, P., Martín, M.J., Bosilca, G., Bouteiller, A., Teranishi, K.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Gener. Comput. Syst. 106, 467–481 (2020) 52. Manvi, S.S., Shyam, G.K.: Resource management for Infrastructure as a Service (IaaS) in cloud computing: A survey. Journal of network and computer applications 41, 424–440 (2014) 53. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.d.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010) 54. Mostefaoui, A., Mourgaya, E., Raynal, M.: Asynchronous implementation of failure detectors. In: International Conference on Dependable Systems and Networks (DSN), pp. 351–360 (2003) 55. Mousavi Nik, S.S., Naghibzadeh, M., Sedaghat, Y.: Task replication to improve the reliability of running workflows on the cloud. Cluster Computing 24(1), 343–359 (2021) 56. Ndiaye, N.M., Sens, P., Thiare, O.: Performance comparison of hierarchical checkpoint protocols grid computing. Int. J. Interact. Multim. Artif. Intell. 1(5), 46–53 (2012) 57. Newton, C.: How a typo took down S3, the backbone of the internet. https://www.theverge. com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server. Accessed: 2022-0320 58. Nicolae, B., Cappello, F.: BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. In: SC’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. IEEE (2011) 59. Nukada, A., Takizawa, H., Matsuoka, S.: NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 104–113 (2011) 60. Pannu, H.S., Liu, J., Guan, Q., Fu, S.: AFD: Adaptive failure detection system for cloud computing infrastructures. In: 31st IEEE International Performance Computing and Communications Conference, IPCCC 2012, Austin, TX, USA, December 1-3, 2012, pp. 71–80. IEEE Computer Society (2012) 61. Provider, G.C.: Cloud Storage. https://cloud.google.com/storage (2021). Accessed 19 December 2021 62. Provider, G.C.: Quotas & limits - Cloud Storage. https://cloud.google.com/storage/quotas (2021). Accessed 19 December 2021 63. Provider, G.C.: Cloud Computing Services. https://cloud.google.com/products/storage (2022). Accessed 11 January 2022 64. Provider, G.C.: Filestore. https://cloud.google.com/filestore (2022). Accessed 11 January 2022 65. Provider, G.C.: Limits - Filestore. https://cloud.google.com/filestore/docs/limits (2022). Accessed 12 January 2022 66. Provider, G.C.: Local SSD. https://cloud.google.com/local-ssd (2022). Accessed 11 January 2022 67. Provider, G.C.: Machine Families - Documentation. https://cloud.google.com/compute/docs/ machine-types#predefined_machine_types (2022). Accessed 14 March 2022 68. Provider, G.C.: Persistent Disk. https://cloud.google.com/persistent-disk (2022). Accessed 11 January 2022

10 Ensuring Application Continuity with Fault Tolerance Techniques

211

69. Provider, G.C.: Storage Options - Compute Engine. https://cloud.google.com/compute/docs/ disks (2022). Accessed 11 January 2022 70. Provider, G.C.: Storage Transfer Service. https://cloud.google.com/storage-transfer-service (2022). Accessed 11 January 2022 71. Qiu, X., Sun, P., Dai, Y.: Optimal task replication considering reliability, performance, and energy consumption for parallel computing in cloud systems. Reliability Engineering & System Safety 215, 107834 (2021) 72. Roman, E.: A survey of checkpoint/restart implementations. In: Lawrence Berkeley National Laboratory, Tech. Citeseer (2002) 73. Ruiz-Alvarez, A., Humphrey, M.: An Automated Approach to Cloud Storage Service Selection. In: Proceedings of the 2nd International Workshop on Scientific Cloud Computing, ScienceCloud ’11, p. 39–48. Association for Computing Machinery, New York, NY, USA (2011) 74. Schneider, F.B.: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Comput. Surv. 22(4), 299–319 (1990) 75. Sens, P., Folliot, B.: Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments. In: 1997 International Conference on Parallel Processing (ICPP ’97), August 11-15, 1997, Bloomington, IL, USA, Proceedings, pp. 334–341. IEEE Computer Society (1997) 76. Services, A.W.: Amazon S3. https://aws.amazon.com/s3/ (2021). Accessed 19 December 2021 77. Services, A.W.: Amazon DataSync. https://aws.amazon.com/datasync/ (2022). Accessed 11 January 2022 78. Services, A.W.: Amazon EBS. https://aws.amazon.com/ebs (2022). Accessed 11 January 2022 79. Services, A.W.: Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/ (2022). Accessed 14 March 2022 80. Services, A.W.: Amazon EFS. https://aws.amazon.com/efs/ (2022). Accessed 11 January 2022 81. Services, A.W.: Amazon EFS quotas and limits. https://docs.aws.amazon.com/efs/latest/ug/ limits.html (2022). Accessed 12 January 2022 82. Services, A.W.: Amazon FSx. https://aws.amazon.com/fsx/ (2022). Accessed 11 January 2022 83. Services, A.W.: Cloud Storage on AWS. https://aws.amazon.com/products/storage/ (2022). Accessed 11 January 2022 84. Services, A.W.: Placement Groups - Amazon Elastic Compute Cloud. https://docs.aws. amazon.com/AWSEC2/latest/UserGuide/placement-groups.html (2022). Accessed 1 August 2022 85. Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., Wellein, G.: CRAFT: A library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems 30(3), 501–514 (2018) 86. Sharma, P., Lee, S., Guo, T., Irwin, D.E., Shenoy, P.J.: SpotCheck: designing a derivative IaaS cloud on the spot market. In: Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21-24, 2015, pp. 16:1–16:15 (2015) 87. Siavvas, M., Gelenbe, E.: Optimum interval for application-level checkpoints. In: 2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2019 5th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), pp. 145–150. IEEE (2019) 88. Sidler, D., Eguro, K.: Debugging framework for FPGA-based soft processors. In: 2016 International Conference on Field-Programmable Technology (FPT), pp. 165–168 (2016) 89. Strom, R., Yemini, S.: Optimistic Recovery in Distributed Systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)

212

R. Brum et al.

90. Subramanya, S., Guo, T., Sharma, P., Irwin, D.E., Shenoy, P.J.: SpotOn: a batch computing service for the spot market. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27-29, 2015, pp. 329–341 (2015) 91. Takizawa, H., Sato, K., Komatsu, K., Kobayashi, H.: CheCUDA: A Checkpoint/Restart Tool for CUDA Applications. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 408–413 (2009) 92. Tchana, A., Broto, L., Hagimont, D.: Fault tolerant approaches in cloud computing infrastructures. In: The Eighth International Conference on Autonomic and Autonomous Systems, pp. 42–48 (2012) 93. Teylo, L., Arantes, L., Sens, P., Drummond, L.M.A.: A dynamic task scheduler tolerant to multiple hibernations in cloud environments. Cluster Computing 24(2), 1051–1073 (2021) 94. Teylo, L., Arantes, L., Sens, P., Drummond, L.M.A.: Scheduling Bag-of-Tasks in Clouds using Spot and Burstable Virtual Machines. IEEE Transactions on Cloud Computing pp. 1–1 (2021) 95. Teylo, L., Brum, R.C., Arantes, L., Sens, P., Drummond, L.M.A.: Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services. In: 49th International Conference on Parallel Processing - ICPP: Workshops, ICPP Workshops ’20. Association for Computing Machinery, New York, NY, USA (2020) 96. Varshney, P., Simmhan, Y.: AutoBoT: Resilient and Cost-Effective Scheduling of a Bag of Tasks on Spot VMs. IEEE Trans. Parallel Distrib. Syst. 30(7), 1512–1527 (2019) 97. Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on Cloud computing, pp. 193–204 (2010) 98. Xie, G., Zeng, G., Li, R., Li, K.: Quantitative Fault-Tolerance for Reliable Workflows on Heterogeneous IaaS Clouds. IEEE Transactions on Cloud Computing 8(4), 1223–1236 (2020) 99. Xiong, N., Vasilakos, A.V., Wu, J., Yang, Y.R., Rindos, A.J., Zhou, Y., Song, W., Pan, Y.: A Self-tuning Failure Detection Scheme for Cloud Computing Service. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012, pp. 668–679. IEEE Computer Society (2012) 100. Yang, R., Zhu, S., Li, Y., Gupta, I.: Medley: A Novel Distributed Failure Detector for IoT Networks. In: Proceedings of the 20th International Middleware Conference, Middleware ’19, p. 319–331. Association for Computing Machinery, New York, NY, USA (2019) 101. Yi, S., Andrzejak, A., Kondo, D.: Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Transactions on Services Computing 5(4), 512–524 (2011) 102. Zhou, A.C., He, B., Liu, C.: Monetary cost optimizations for hosting workflow-as-a-service in IaaS clouds. IEEE transactions on cloud computing 4(1), 34–48 (2015) 103. Zhou, J., Zhang, Y., Wong, W.: Fault Tolerant Stencil Computation on Cloud-Based GPU Spot Instances. IEEE Trans. on Cloud Comput. 7(4), 1013–1024 (2019) 104. Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.: Fault-Tolerant Scheduling for RealTime Scientific Workflows with Elastic Resource Provisioning in Virtualized Clouds. IEEE Transactions on Parallel and Distributed Systems 27(12), 3501–3517 (2016)

Chapter 11

Avoiding Resource Wastage Altino M. Sampaio and Jorge G. Barbosa

11.1 Introduction High-performance computing (HPC) workloads span from traditional computationintensive applications such as simulation of complex systems (wind tunnels, drugs development, chemical industry, weather forecasting [40, 43]), to new workloads such as big data [20, 50], artificial intelligence [28, 46, 52], DNA sequencing [17, 39], and autonomous driving [2, 47]. Based on the degree of interaction between the concurrently running parallel processes, these workloads can be categorised as loosely coupled and tightly coupled workloads. In any case, the complexity of HPC workloads is immense and typically requires a large amount of interconnected computing resources. The requirements come not only from processing units (e.g., CPU), but also from the amount of memory, storage and network bandwidth and latency, to support the proper execution of HPC applications. Cloud computing is supported by large data centers that have thousands of powerful compute servers. It utilises virtualisation technology to efficiently allocate and provision system resources (e.g., CPU, GPU,1 FPGA,2 memory and storage devices) among workloads in the form of well defined heterogeneous virtual

1 Graphics

Processing Units. Gate Array.

2 Field-programmable

A. M. Sampaio Instituto Politécnico do Porto Escola Superior de Tecnologia e Gestão, Rua do Curral, Casa do Curral, Felgueiras, Portugal e-mail: [email protected] J. G. Barbosa () Universidade do Porto, Faculdade de Engenharia, Porto, Portugal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_11

213

214

A. M. Sampaio and J. G. Barbosa

machines (VMs). Cloud users have administrative privileges within the VMs operating system to customise the execution environment according to their specific requirements. Cloud providers typically consolidate several VMs on the same physical server in order to achieve higher resource utilisation levels and decrease operating costs. However, the approach may lead to VM performance degradation due to contention in hardware resources [60]. Such a situation is not appropriate for HPC applications, which typically are resource-intensive and present completion time requirements. By means of efficient scheduling strategies, applications are allocated to VMs within specified Quality of Service (QoS) constraints, thus cumulatively satisfying performance objectives (e.g., deadline) and constraints (e.g., monetary cost) [13, 55]. As such, schedulers are expected to take advantage of the heterogeneity in the virtual cluster (i.e., VMs of different types and pricing models) to perform smart workload allocations. By better matching applications requirements with virtual resources, it is expected to improve resource utilisation, minimise cost with the leased VMs, and comply with deadline constraints. However, the scheduling problem has been shown to be NP-complete [36], which means that, due to the difference in QoS constraints and heterogeneous environments, finding the optimal workload-to-VM mapping solution in polynomial time is yet to be found. Over the last years, extensive studies focusing on this problem have proposed several static and dynamic scheduling algorithms to find a near-optimal solution to schedule, execute and manage HPC applications in different environments. To make efficient decisions, schedulers typically take into account the status of VMs, the order and priority of tasks, and the estimation of the resources needed by those tasks. This chapter is organised as follows: Sect. 11.2 gives an overview of classes of HPC applications, namely Bag-of-Tasks (BoT) and workflows, and identifies the sources of resource wastage. It also formulates the workload management problem in the context of the time-slotted based cloud cost model. Section 11.3 discusses metrics used to detect resource inefficiencies and provides a comparative analysis on resource optimisation strategies. It also presents a discussion on research challenges. Finally, a summary of this chapter and final conclusions regarding resource optimisation on HPC clouds are presented in Sect. 11.4.

11.2 HPC Workload Characteristics and Resource Wastage Typical HPC workloads consist of a set of jobs usually with strict resource requirements. A job is a resource request that is submitted to the HPC job scheduler service and that contains one or more tasks. A job requests X amount of Y resources for a time period Z (e.g., 10 CPUs, memory and disk space for 2 h). The HPC job scheduler is responsible for assigning jobs to resources. Well-known job schedulers in HPC include Slurm, Moab/TORQUE, PBS, and Cobalt [22]. After a job is submitted to the HPC job scheduler service, it is placed in the queue where it waits

11 Avoiding Resource Wastage

215

until the resources necessary to run its tasks become available. The job scheduler service sorts the jobs in the queue according to the site policy.

11.2.1 Typical HPC Workloads Modern HPC platforms comprise a large amount of interconnected computing nodes, each having one or more multi-core or many-core processors [9, 45]. HPC can be run on many types of workloads. However, two broad categories stand out based on the degree of interaction between the concurrently running parallel tasks, namely loosely coupled and tightly coupled workloads [8]. Loosely Coupled Workloads Also known as embarrassingly parallel workloads, entails the processing of a collection of independent tasks that require very few or virtually no communication among themselves. Tasks might be of the same type or of different types and there is no specific requirement to be executed at the same time. A task running on one node either executes by one thread or multiple threads with Shared Memory Parallelism within that node. The nodes involved in the execution of tasks can be heterogeneous in terms of characteristics and power. Owing to the low level of interaction among tasks, the execution of tasks is not sensitive to the bandwidth and latency capabilities of the network between nodes. In terms of processing requirements, some applications are optimised to take advantage of GPU or FPGA accelerators [3]. Regarding storage, requirements might be diverse considering the dataset size and performance for transferring, reading, and writing the data. Two important classes of such workloads, widely used in science and technology, are Bag-of-Tasks (BoT) and workflows. BoT [16] applications are parallel computing applications whose tasks can be processed in parallel by independent workers without synchronisation (i.e., relatively poor network performance does not represent a bottleneck for a BoT application, as illustrated by SETI@home [27]). There is no data or control dependencies between the tasks and the entire job is considered finished when the last of its tasks is completed. BoT applications are widely used in a variety of scenarios, such as parameter sweeps, Monte Carlo simulations, and big data analysis [26, 67, 69]. Users are now executing BoT applications on clouds owing to the fast access to massive amounts of computational resources in a pay-as-you-go manner [69]. Workflows [66] consist of multiple sequential and concurrent data processing tasks with control or data dependencies between them, whose order is determined by data inter-dependencies. Scientific experiments and business processes are commonly represented as workflows, where a task is the basic data processing component, consuming data from input files or previous tasks and producing data for follow-up tasks or output files. A workflow is often described as a Directed Acyclic Graph (DAG), as shown in Fig. 11.1. In a DAG, the nodes represent individual computational tasks and the edges represent data and control dependencies between

216

A. M. Sampaio and J. G. Barbosa

Fig. 11.1 Description of a DAG job where nodes .V0 and .V8 are the entry and exit nodes, respectively

the tasks. A dependency ensures that a child task cannot be executed before all its parent tasks finish successfully and transfer the required child input data. In a given DAG, a task with no predecessors is called an entry task and a task with no successors is called an exit task. If a DAG has multiple entry or exit tasks, a dummy entry or exit task is added to the graph. The HPC environment may be equipped with a shared filesystem to store intermediate files during workflow execution. Examples of workflow workloads are Montage, CyberShake, Epigenomics and LIGO [29]. Tightly Coupled Workloads Encompasses breaking a large workload into smaller tasks that run in parallel and depend on each other to carry out the calculation. All tasks iterate together and communicate continuously with one another. The failure of one node involved in the execution of tasks usually leads to the failure of the entire calculation. Interaction among tasks typically rely on a shared memory (e.g. OpenMP) for parallelization inside the node and Message Passing Interface (MPI) for parallel computation between nodes. Owing to the high level of interaction among tasks, slow communication between nodes results in the slowdown of the entire calculation. In terms of processing power, tightly coupled workloads typically require a homogeneous cluster built from similar compute nodes. Regarding storage, requirements might be diverse considering the dataset size and performance for transferring, reading, and writing the data. Additional consideration might be given to scratch storage—a temporary storage for input, output, and intermediate data of currently running and soon-to-run user jobs [41]. Examples of tightly coupled HPC workloads include weather and climate simulations, computational fluid dynamics, and nuclear reaction simulations [61]. Several studies have shown cloud environments have a great potential to execute HPC loosely coupled workloads which require very few or virtually no communication among application’s tasks [4, 24, 38, 45]. Regarding the execution of

11 Avoiding Resource Wastage

217

large-scale tightly coupled workloads on HPC cloud (e.g., MPI-based applications that use thousands of cores and require a high-performance network), recent improvements in the cloud inter-node network performance (e.g., Elastic Fabric Adapter (EFA)3 for high-speed inter-node communications), the introduction of new EC2 instance types (e.g., c5n.18xlarge4 ) with much higher network bandwidth, and the fast MPI collective algorithms in the latest Intel-MPI library, put Amazon Web Services (AWS) clusters in a position to rival the performance of on-premise HPC clusters [70, 78]. With performance comparable to supercomputing machines, this chapter focuses on running BoT and workflow applications on HPC clouds and having in mind resource wastage.

11.2.2 Sources of Resource Wastage in HPC Cloud Although clouds can potentially offer unlimited resources to run HPC workloads, managing large-scale computing resources is not a trivial task. Resource management encompasses several activities such as resource provisioning, resource scheduling and resource monitoring. While resource provisioning provides a way to acquire or release VMs based on user demands, resource scheduling controls the order of execution. Resource and application monitoring in cloud computing is fundamental to make sure tasks are working without any problem and to be able to make more accurate future requests for resources. To this end, the cloud AWS provides a monitoring and observability service called CloudWatch5 to track and collect different performance metrics. The objective is to offer users the ability to predict and to respond to system-wide performance changes, and to optimise resource utilisation [32]. CloudWatch can monitor custom metrics produced by users for their applications and services and can also monitor AWS resources. Of course, users can also install their own monitoring services by using tools such as Zabbix6 or Prometheus.7 Resource management is, in fact, a fundamental activity to avoid problematic scenarios of underprovisioning and overprovisioning. The underprovisioning situation occurs when the amount of resources provisioned are insufficient for the application demands, which can vary over time. The advantage is that the cost of the infrastructure may decrease, however the number of jobs delayed will increase. In turn, the overprovisioning problem occurs when the resources allocated for a certain application exceed its demands, resulting in a waste of resources that are paid as if they were being used. The convenience results

3 https://aws.amazon.com/hpc/efa/. 4 https://aws.amazon.com/ec2/instance-types/c5/. 5 https://aws.amazon.com/cloudwatch. 6 https://www.zabbix.com/. 7 https://prometheus.io/.

218

A. M. Sampaio and J. G. Barbosa

from the fact that jobs would finish by the expected time, but the financial cost would increase for the user. Both underprovisioning and overprovisioning problems occur due to application demands uncertainty and systems performance deviations. According to the analysis carried out for 12,000 Google servers over the period of a month [42, 53], users heavily allocate resources, in more than 80% of the cluster’s memory capacity and more than 100% of the cluster’s CPU capacity, but the overall usage is much lower, not exceeding 50% and 60% of CPU and memory, respectively. The amount of server resource utilisation that is wasted incurs into additional financial costs for both users and providers. From the point of view of the user, the optimal scenario is the one that provisions the exact amount of resources that the application needs (i.e., minimises the number of resources, and hence the financial cost of the infrastructure, while jobs can finish in due time). However, one or a combination of several factors contribute for underprovisioning and overprovisioning situations. Common sources of resource wastage are the lack of user knowledge regarding the amount of resources needed to run the application, variation of requirements over time, variable behavior in the execution when changing inputs, and randomness inherent to algorithms of HPC applications [64, 69]. For example, as the task’s execution time or resource demands are usually an estimated value, their actual value can diverge due to factors such as the input size. Furthermore, VMs of the same type can possibly have different levels of performance since they are located on physical servers of different hardware types. Besides these problems, the performance of HPC applications can be particularly affected by mutual interference due to resource sharing policies usually adopted by cloud providers. In general, one physical server can host many VMs eventually holding distinct applications. Although the virtualisation layer provides a reasonable level of resource isolation, some shared resources, like cache, memory and I/O devices, cannot be sliced over all applications running in the VMs. As a consequence, these co-located applications can experience performance degradation over time, resulting in execution slowdown. Several studies confirmed these phenomena and reported performance variations up to 29% [35, 60, 73]. Another reason causing parameter variation is the noticeable amount of time for cloud resources to be made available, a delay referred to as instance startup time, which exists among all cloud providers and can vary from a few seconds up to a few minutes [59]. This problem is specially exacerbated with spot VMs. Pricing of spot VMs is typically lower than on-demand VMs, a decisive factor for many users who can see in this pricing model an advantage to reduce costs [59, 68, 72]. However, spot instances price may fluctuate dynamically over time based on the number of bidders. A spot VM can be terminated by the cloud provider if the spot price rises above the user’s bid price. In such cases, the instance receives a notification to save temporary data and the time from that point until the termination of the VM is lost. In the specific case of tightly coupled workloads, especially MPI-based ones, with an increasing number of processors, also the parallel overhead (e.g., communication overhead) augments, resulting in dropping parallel efficiency (i.e., the fraction of time for which a processor is usefully utilized) [48]. Poor parallel efficiency means that a tightly coupled application no longer uses a processor 100%

11 Avoiding Resource Wastage

219

Fig. 11.2 Categorization of the sources of resource wastage in HPC cloud

of the time for the computation, leading to a waste of resources. Notice that, even in situations when there are no overprovisioning or underprovisioning of resources, there may still be wastage caused by poor matching of the application and the infrastructure. For example, a typical user wanting to deploy a workload might pick the cheapest VM type and, paradoxically, end up not just with poor performance but also with higher total costs [74] than using a larger VM for less time. In fact, an adequate match of VM types and task requirements, in terms of CPU, memory and I/O, is of utmost relevance to obtain performance, reduce resource wastage and costs. Figure 11.2 categorizes the discussed sources of resource wastage in HPC cloud. In conclusion, resource management has a central role in reducing the resource wastage and, therefore, it is detailed in the next section.

11.2.3 Resource Management Cloud service providers follow a pay-as-you-go model for resource pricing, and offer a wide range of computing resources, packed under the form of VMs, with different performance levels, that can be instantiated and accessed through the Internet. This service is known as Infrastructure-as-a-Service (IaaS). The cost of rented resources is calculated based on a billing period, instance type, and pricing model. Typically, the partial usage of a VM is rounded up to the nearest billing period. A paradigmatic example is the case of Amazon Elastic Compute Cloud (EC2), a pioneer in cloud services and one of the world’s largest players in cloud computing. Amazon EC2 provides a wide selection of instance types (i.e., VMs), comprising diverse combinations of CPU, GPU, memory, local storage, and networking bandwidth, and several ways of paying for those computing instances (i.e., pricing models), such as on-demand, spot instances, reserved instances, and dedicated hosts [59]. Regarding storage, Amazon provides various data storage options8 to work with VM instances, namely EC2 instance store, Elastic Block Store 8 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html.

220

A. M. Sampaio and J. G. Barbosa

(EBS), EFS, and S3. Amazon EC2 instance store (or local store) means there is a disk physically attached to the host computer where the VM runs, and data persists only during the life of the associated VM. The size of an instance store as well as the number of devices available varies by instance type and are included as part of the instance’s usage cost. Amazon EBS, EFS file system and S3 provide durable and network attached data storage, regardless of the life of VMs. However, these options have a price to users, which typically are charged on the basis of the amount of data being stored. In detail, Amazon EBS provides four volume types,9 namely: (1) General Purpose Solid-State Drives (SSD) volumes; (2) Provisioned Input/output Operations Per Second (IOPS) SSD volumes; (3) Throughput Optimized HDD and Cold HDD volumes; and (4) Previous generation Magnetic volumes. Provisioned IOPS SSD volumes are the highest performance EBS storage volumes and the only ones supporting Multi-Attach,10 which allows the attachment of a single volume to multiple instances (limited up to 16 Linux instances) that are in the same Availability Zone. The EFS file system can serve as a common data source for multiple VMs, and S3 can be accessed from anywhere on the web. A storage system is essential for executing some HPC workloads, as is the case of workflows, in which tasks typically communicate through the use of files. In a workflow, tasks produce one or more output files that become inputs to subsequent tasks. However, when tasks are run on different computational nodes, these files are either stored in a shared file system, or transferred from one node to the next by the workflow management system. One advantage of using network attached shared storage, such as Amazon S3 or EFS file system, comes from the fact that task’s input data will not be lost in case of failure as storage services guarantee high reliability and availability for their services. Another advantage is that in case the resources for the subsequent tasks are not up and running yet, a parent task does not need to wait until it transfers all the output data to its children, which may cause additional costs [55]. Now, let one assume the scenario shown in Fig. 11.3, where a set of users .U = {u1 , . . . , un }, n ∈ N, that submit HPC jobs to the cloud. For example, users who belong to the same organization, that selects a certain amount of VM instances, .V = {v1 , . . . , vh }, h ∈ N, in order to build HPC execution environments to support the HPC workloads submitted. VMs may be of different types (i.e., have different performance levels by aggregating resources of diverse capacity). The execution environment is based on the idea of a virtual cluster, which is a collection of VMs that are configured to act as an HPC cluster [30]. Each virtual cluster has access to a distributed shared storage, built upon each local VM disks. The reason behind this storage setup is to be free of charge, while keeping good performance [30]. A VM of instance type .υ in the cluster is defined by the tuple .v υ = (C υ , pυ ), where .C υ = {cr1 , . . . , crk }, k ∈ N, describes the capacity of each type of resource υ ∈ R+ is the .rm , m ∈ {1, . . . , k} (i.e., CPU, GPU, memory, and network), and .p

9 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html. 10 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes-multi.html.

11 Avoiding Resource Wastage

221

Fig. 11.3 Execution environment of HPC workloads on clouds

price paid for using one billing period.11 One advantage of HPC clouds is that users have root access to the instances operating system, which makes it possible installing and configuring a job management software, such as a job scheduler, a resource monitoring system and a network file system. The tasks to configure the cluster software can be performed automatically by using configuration tools, such as Ansible,12 to avoid tedious and error-prone manual configurations. The resource provisioning strategy followed in building the cluster is based on the concept of elastic VM pool. This way, the number and type of VMs forming the cluster can be updated as the execution of the tasks progresses, as a result of the decisions made by the scheduler. For example, a new VM can be provisioned so that the task being scheduled can finish before its deadline while idle VMs can be shutdown to save cost. Scheduling decisions are based on some sort of objectives, such as minimisation of cost, makespan (i.e., the time it takes to run the set of tasks), resource wastage, or even maximisation of workload (i.e., maximise the number of applications executed) [54]. Regarding resource wastage, idle time slots in leased VMs represent a waste of money as they were paid for but not used. In this regard, most scheduling algorithms indirectly address the objective of minimising resource wastage by being

11 Cloud providers, like Amazon EC2 and Google Cloud Platform, are billing VMs by the second, with a 1-min minimum. 12 https://www.ansible.com/.

222

A. M. Sampaio and J. G. Barbosa

cost-aware, i.e., by minimising idle time slots and maximising the utilisation of resources. In the next Sect. 11.3.2, we give examples of scheduling strategies, targeting the reduction of resource wastage, for both BoT applications and workflow jobs.

11.3 Strategies to Detect and Prevent Resource Wastage Significant technical difficulty is introduced when choosing the optimal platform based upon a limited knowledge of application characteristics, platform capabilities and high degree of heterogeneity. Insufficient information and naive user decisions lead to a potential mismatch between the required and selected resources for HPC applications. The result is that part of the infrastructure would be overloaded, and another would be idle or wasted, which in turn degrades the application’s performance and increases the usage costs. Therefore, it is necessary to have a comprehensive set of metrics to evaluate the overall performance of the HPC environment.

11.3.1 Metrics to Detect Resource Wastage With resource optimisation in mind, the performance of cloud resources, and the amount of time tasks will take to execute in different VMs, can be estimated based on several methods, such as analytical modeling, simulations, profiling through sample execution (e.g., the first few iterations) on actual platform, interpolation, prediction models, among others [24, 71]. Scheduling algorithms can consider the variability observed in resources (e.g., VM provisioning and deprovisioning delays, overhead of the virtualisation layer) and their utility-based pricing model, and apply a strategy to satisfy user-defined QoS requirements and objectives. These QoS requirements are generally defined in terms of performance metrics such as execution time, cost, speedup and throughput, and additionally eventual nonfunctional requirements such as security and energy consumption [45]. Machine characteristics, such as clock cycles are not good performance metrics when used isolated because they do not relate to application performance [49]. In HPC cloud context, performance constraints (e.g., deadline constraint) with cost objectives (i.e., cost minimisation) is the most popular requirement. This is because a desirable level of performance needs to be achieved while being aware of the monetary costs incurred in real time. Therefore, cost-aware scheduling, with the goal of meeting application deadlines, is the common denominator of many resource allocation solutions. Performance metrics such as cost have been considered by many researchers focused on reducing resource wastage while satisfying the deadline constraint, aiming at achieving the desired performance with the minimal cost. By being cost-aware, scheduling algorithms are indirectly

11 Avoiding Resource Wastage

223

concerned with minimising idle time slots, and consequently reducing resource wastage, which has benefits for users in terms of cost. Aiming at maximising the utilisation of VMs, several studies [18, 65] propose to detect resource wastage by periodically capturing the values of the resource load metrics on each VM (e.g., CPU utilisation rate, memory and storage) and use them in comparison with both minimum and maximum thresholds. Monitoring tools are able to regularly collect and track metrics but also to draw insights from the logs collected from an EC2 instance. For example, Amazon CloudWatch monitoring tool (referred in Sect. 11.2.2) provides users a system-wide visibility into resource utilization, application performance, and operational health. To this end, a diversity of system variables are periodically collected and stored as time-ordered set of data points. CloudWatch provides two categories of monitoring, namely basic monitoring and detailed monitoring. Active by default, basic monitoring publishes metrics at 5-min intervals at no charge for resources such as Amazon EC2 instances and Amazon EBS volumes. In turn, detailed monitoring incurs charges but users are allowed to customize their metrics and to define the granularity (in seconds) at which metrics data is published. Important to notice that using detailed monitoring helps users to find trends and take action faster in order to improve the utilization of resources. Metric data can be aggregated over specified periods of time in order to obtain different statistics.13 A period can be as short as 1 s or as long as 1 day. The default value is 1 min. As an example, let one consider the particular case of EC2 compute optimized instance (e.g., "c5.xlarge"). The instance is used to run, at distinct intervals in time, two different workloads. Amazon CloudWatch provides a set of metrics for each resource.14 For example, the “CPUUtilization” metric measures the processing power required to run a workload upon a particular instance. Additionally, one can enable detailed monitoring so that each CPU utilization metric data point covers the next minute of activity from the moment the workload was submitted to the instance. For the sake of simplicity, Fig. 11.4 combines the CPU utilization for two hypothetical workloads running in the EC2 instance. Profile 1 represents the CPU consumed by a more homogeneous workload, for example, a BoT application, while profile 2 illustrates the CPU required over a more heterogeneous workload, such as a workflow application. For the former, the CPU utilization is more stable during the execution of tasks, whereas for the later the CPU consumption varies more over time due to task dependencies. As can be observed for both workloads, the percentage of CPU consumed, in average, is quite inferior than the capacity allocated, thus generating wastage of resources, being profile 2 worse. The user is paying for more than it is effectively being used. This problem can be tackled by the scheduler, by performing scaling out or scaling in operations (e.g., leverage a cheaper instance with lower CPU power) when a

13 https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.

html. 14 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch. html.

224

A. M. Sampaio and J. G. Barbosa

Fig. 11.4 CPU utilization leading to wastage of CPU (profile 1 represents the CPU required for a higher homogeneous workload, whereas profile 2 represents an heterogeneous workload)

threshold is reached (say, 70% of CPU utilization), in order to improve the usage of resources. Additional low-level metrics (e.g. disk read and write rates, incoming and outgoing network traffic volume, memory and swap utilisation, etc.) can be combined with heuristics to predict user resource needs and avoid wastage [32].

11.3.2 Resource Optimisation Strategies This section presents scheduling algorithms proposed to optimize the execution of BoT and workflow workloads on clouds. Cloud environments are inherently heterogeneous, where a variety of instance types are available to users. An instance type is designed to optimise the execution of some sort of application (e.g., CPU, memory or I/O intensive) and follows a predefined hardware configuration (CPU, memory, network, storage capacity), and time slot price. By means of scheduling, it is selected a VM to execute a task so as to improve the cost and resource efficiency, while guaranteeing its time constraints. The scheduling process encompasses several aspects that are necessary to be defined, such as the actions performed, parameters needed for the decisionmaking process, the goals to achieve, the approach to deal with eventual parameter variations, and the applied algorithm to find a solution for the problem [25, 69]. Regarding the actions performed, the most resource efficient task-to-VM mapping can be found by performing one, or a combination of several, of the following actions: (a) selecting the most appropriated instance type; (b) determining the number of VMs for each instance type; and (c) assigning the workload to the VM that better fits the application needs. The majority of existing solutions use a combination of determining the number of VMs for each instance type to then assign the workload onto those VMs. Therefore, VMs are created only if there is workload to assign to them [7, 12, 55, 56, 68, 77]. However, to perform such actions, the scheduling process must be aware of specific parameters, such as the performance across all instance types, the price for each one, the current state of

11 Avoiding Resource Wastage

225

running VMs, and the application requirements. The availability of these parameters is crucial since they directly affect the outcome of the scheduling process. While some of the parameters are given before the execution (e.g. the number of tasks), others need to be estimated. Performance parameters like task execution time and makespan are specific to each unique pair of application and instance type, but they can be estimated based on techniques such as profiling, matching between benchmarking results and application characteristics, and sampling execution in which a job is partially executed on VMs [11]. Resource demands can also be used as a parameter because the performance of an application depends on the amount of resources that is allocated to execute it [58]. Since leveraging cloud resources to run HPC workloads incurs monetary costs, it is of user interests to consider scheduling solutions that focus on optimising cloud usage by minimising the monetary cost while satisfying the deadline constraint [44, 55, 64, 68, 77]. As previously mentioned, cost minimisation emphases on efficient resource usage to decrease the wastage of resources. There are also other scheduling approaches such as trade-offs between performance and cost [44, 68]. According to Thai et al. [69], scheduling algorithms can be categorised as exact algorithms and heuristic algorithms. Exact algorithms aim to find the optimal scheduling plan but they also require a significant amount of time to find it, which makes the approach not suitable for large-scale problems or scenarios in which a decision must be made in a timely manner. Heuristic (and meta-heuristic) algorithms aim to find a nearoptimal solution in a reasonable amount of time [25]. Heuristic and meta-heuristic algorithms have been proposed to solve scheduling problems concerning workflows and independent tasks/applications [15, 25, 37, 44, 69, 72]. In cloud environments, load fluctuates constantly, unpredictable interference affects the performance of VMs running in the same physical server, and VMs of the same type may present different levels of performance. Also, price of spot instances may change over time, and the interval time between request and deploy of a VM may change from some seconds to several minutes [24, 51, 60, 72]. Because one or a combination of several of these parameters fluctuate over time, the scheduling decision may become obsolete and inaccurate after some time, which makes static algorithms less adequate for scheduling in HPC clouds. Therefore, dynamic (or hybrid) algorithms are largely adopted in cloud environments in order to keep precise and accurate the amount of resources allocated to tasks and overcome the potential problems of overutilisation and underutilisation [54, 75]. Another aspect to be considered is when to trigger dynamic scheduling to update the current taskto-VM mapping. Some alternatives can be considered, such as when idle VMs are detected, when the requirements change or are predicted to be violated, at the end of each billing cycle in order to decide if VMs must be released to reduce waste, or after the monitoring process which updates the parameters to reflect the current state of the execution. The existing solutions that deal with efficient use of resources typically follow the previous characteristics. Most of them assume that the performance parameters are available prior to the optimisation process. In any case, there has been an effort on estimating performance parameters (e.g., the execution time, resource demand, etc.)

226

A. M. Sampaio and J. G. Barbosa

based on the characteristics of the application. Several studies have been carried out and showed the relative performance can be estimated with an accuracy above 97%, by executing the application partially, fully, or multiple times [11, 14, 34]. HPC workloads exhibit lower variance compared with the large-scale cloud workloads, and specific methods can be applied such as ensemble-based prediction [31], linear regression [63], neural networks [33], and collaborative filtering [57], to achieve higher accuracy levels. To respond to changes in the environment, Tavares et al. [65] started by leveraging low-level metrics (e.g., CPU utilisation rate) to detect overprovisioning occurrences in a reactive way. Such information can easily be obtained from cloud providers, e.g., AWS service CloudWatch. A scheduling algorithm can use this information to dynamically adjust the task-to-VM mapping. In [64], the same authors proposed a reactive solution, as an alternative to predictive techniques that try to anticipate the occurrence of resource inefficiency, to recommend costeffective VMs for parallel workloads aiming at reducing the resource wastage while maintaining the performance. The mechanism works based solely on the CPU utilisation rate of the currently executing VM. Usually, these solutions are combined with other objectives or constraints such as deadline. For example, in the context of BoT applications, Sampaio et al. [60] have developed a dynamic resource management framework to deal with VM performance variations. The framework combines a performance deviation estimator, which is based on a Kalman filter, and a scheduling algorithm that reacts to performance deviations (i.e., slowdowns in applications) and adjusts the amount of resources allocated to tasks. By dynamically applying corrective actions to detected performance deviation events, situations of overprovisioning and underprovisioning are reduced and the application deadline constraints are respected. Other scheduling solutions consider monetary cost minimisation to indirectly deal with resource wastage. In this regard, Belgacem et al. [10] provided a solution, called Multi-Objective Symbiotic Organism Search Algorithm (MOSOS), for dynamically assigning tasks to VMs aiming at minimising the makespan and monetary cost. Both CPU and memory resources are considered, and the number of tasks running in a VM cannot exceed its number of CPUs. The solution is tested with several Amazon EC2 instance types. Abdi et al. [1] developed a mathematical model for the resource allocation problem, in which BoT applications are assigned to instance types in federated hybrid clouds at minimal cost while meeting the deadline and resource constraints in the environment. The derived model is a binary linear programming problem that may be solved with the CPLEX solver in a reasonable time. The model considers the CPU, memory, and the price for various instance types, as well as the cost of transferring data to cloud providers. The problem admits that a VM can run several tasks simultaneously. Additional analyses were carried out and results showed that the optimal cost and optimal solutions in the cloud federations are lower and more stable, respectively, than those presented by single-provider clouds. Aiming at further minimising monetary costs, Teylo et at. [68] explored spot and on-demand instances to minimise both the monetary cost and the total execution time of the BoT application, while the deadline constraint was met. The optimisation

11 Avoiding Resource Wastage

227

problem considered CPU and memory limits, availability and the price of instances. The proposed framework defines an initial scheduling map of tasks to VMs, by applying a heuristic based on the Iterated Local Search (ILS), and then a second dynamic scheduler is responsible for task migration in case of spots hibernation. The number of tasks allocated to a VM does not exceed its number of virtual cores. Either idle VMs are terminated or eventually kept in order to execute tasks without incurring deploying overhead. AutoBoT [72] also uses spot instances to reduce the overall monetary cost for the BoT execution, while attaining the deadline constraint and limiting any potential monetary loss. Checkpointing and migration of tasks is performed to deal with the unreliable nature of spot instances. A collection of heuristics determines the acquisition and release of spot or fixed-price VMs, the task-to-VM mapping, and the checkpointing and migration of tasks. Concerning the efficient management of a set of workflows in the context of utility computing comprising a set of heterogeneous resources that provide services of different capabilities and costs, Arabnejad [5] presented the on-line MultiWorkflow Deadline-Budget Scheduling (MW-DBS) algorithm, which aims to find a feasible schedule within an individual budget and deadline constraints values for each submitted application. In this approach resources are shared among workflows to increase resource usage, but complying with user individual constraints, and reduce the global cost. In [6] the same authors presented the Multi-QoS Profit-Aware scheduling algorithm (MQ-PAS), for the same context, but in addition attempts to increase the revenue of the provider by considering the budget available for each job to define tasks priorities. In [23] Ghasemzadeh presented the Deadline-Budget Workflow Scheduling (DBWS) algorithm for cloud environments, that takes into account the hourly based charging policy. This algorithm produces an array of acquire/release timestamps for each VM resource in order to minimize the cost of executing the workflow. Results show that DBWS achieved higher success rates under more restricted constrains for budget and deadline. Sahni and Vidyarthi [56] presented a dynamic cost-minimisation deadline constrained heuristic for scheduling scientific workflows in a public cloud environment. The solution addresses three important cloud issues, namely VM performance variation, resource acquisition and termination delays and heterogeneous nature of instance types. In order to minimise execution costs and wastage, resources are provisioned just before they are needed and the idle VMs, which have completed the transfer of output data, are deprovisioned. Running tasks are continuously monitored to dynamically make cost effective scheduling decisions for subsequent tasks such that the deadline constraint is met. Chen et al. [13] schedule tasks from different workflows in a hybrid way in order to avoid idle time slots on VMs and improve resource utilisation. The objective is to reduce the cost for executing workflow applications while guaranteeing the deadlines. The solution proposes a cost-efficient reactive scheduling algorithm, namely CERSA, which combines three reactive scheduling strategies to schedule workflows and to adjust available VMs. Sampaio and Barbosa [59] have proposed a multi-workflow scheduling framework to execute workflow applications on different Amazon EC2 instance types

228

A. M. Sampaio and J. G. Barbosa

and pricing models. The objective was to minimise execution cost and time, and to meet the user-defined deadline and budget QoS constraints. In order to further reduce costs, the framework combines spots and on-demand instances, of several types with an approach to deal with the unreliable nature of spots. A heuristic-based algorithm uses a multi-objective utility function to dynamically allocate a task in the VM that best balances time, monetary cost, and reliability. Zhu and Tang [77] proposed a multi-resource workflow scheduling solution on heterogeneous on-demand instances. The cluster of VMs is dynamically scaled up and down during the scheduling process. The objective is to minimise the total monetary cost of workflow execution and satisfy the deadline constraint. To reduce waste, a single VM can host more than one task at the same time when there are sufficient resources. A list-scheduling algorithm assigns the tasks in a workflow, in the order of a priority list, to the VMs dynamically acquired from IaaS platforms according to their multi-resource demands. By employing different task prioritisation, different scheduling criteria can be achieved. Data transferred between tasks are stored in a centralised storage. As users became more interested in running workflows in the cloud, providers started to leverage the computational resources to offer an environment to execute workflows as a service (WaaS). In WaaS environments, users do not need to do anything to manage the resources. To this purpose, Saeedizade and Ashtiani [55] built a dynamic multi-constraint workflow scheduling algorithm for a WaaS cloud environment named DDBWS. The algorithm schedules multiple tasks onto a single VM and shares the processing power amongst them by taking advantage of containers. Both CPU and memory of a VM are considered by the scheduler. DDBWS tries to assign the minimum number of CPU cores to a task in order to achieve a makespan close to the user-defined deadline as much as possible. Additionally, DDBWS delays tasks to reuse a running VM, avoiding the VM provisioning delay in launching a new VM that could affect the makespan of a workflow. The result is a scheduling algorithm that decreases the overall cost and the number of leased VMs, while keeping high the number of workflows which makespan and cost are not greater than the user-defined deadline and budget. It is worth noting that, unfortunately, the solution does not perform well under high uncertainty caused by CPU performance degradation. Majority of research addresses resource efficiency in clouds in terms of computation (CPU and memory), and does not consider other resources such as network communication and storage from the user’s point of view. Nevertheless, some existing research supports communication and storage-aware resource management. For example, Stavrinides and Karatza [62] proposed heuristics for the scheduling of real-time BoT jobs that arrived dynamically at a hybrid cloud. The proposed scheduling strategies took into account the end-to-end deadlines of the jobs, the monetary cost required for processing and transferring the data, and legal constraints (sensitive data should not be transferred to the public cloud). Duan et al. [19] formulated a communication and storage-aware algorithm for scheduling multiple large-scale parallel workflow applications on heterogeneous hybrid clouds. The multi-objective solution minimises the expected execution time and economic

11 Avoiding Resource Wastage

229

Table 11.1 Scheduling approaches in the HPC cloud Approach Integer linear programming Heuristic

Work [68]

Workload BoT

[60]

[72]

Independent tasks Independent tasks BoT

[1]

BoT

[62] [19]

BoT Workflow

Binary linear programming Heuristic Heuristic

[77] [55]

Workflow Workflow

Heuristic Heuristic

[5]

Heuristic

[23]

Concurrent workflows Workflow

Heuristic

[56] [59] [13]

Workflow Workflow Workflow

Heuristic Heuristic Heuristic

[10]

Meta-heuristic Heuristic

Objectives Minimise cost and execution time Minimise interference and execution time Minimise cost and makespan Minimise cost and number of VMs Minimise cost in federated hybrid clouds Minimise cost Minimise cost and makespan Minimise cost Minimise cost and number of VMs No optimization Minimize cost and number of VMs Minimise cost Minimise cost Minimise cost

Constraints Deadline Deadline CPU and memory Deadline and monetary loss Deadline and amount of resources Deadline Network bandwidth and storage Deadline Deadline and budget Deadline and budget Deadline and budget Deadline Deadline and budget Deadline

cost of applications based on a sequential cooperative game theoretic algorithm. Regarding storage resource and its cost, an alternative solution to central shared storage (e.g., Amazon S3 and EFS file system) would be the use of a distributed file system, such as GlusterFS,15 aggregating local disks of running VMs in the cluster. Such a solution can achieve good results in terms of cost and performance compared to Amazon S3 [30]. Table 11.1 summarises all methods reviewed in this section and categorises them using important aspects, such as the type of workload, the approach to solve the scheduling problem, and the objectives to achieve and constraints to meet.

11.3.3 Research Challenges Based on approaches to detect resource wastage presented in Sect. 11.3.1 and the analyses of strategies to minimise cost and resource wastage summarised in

15 https://www.gluster.org/.

230

A. M. Sampaio and J. G. Barbosa

Sect. 11.3.2, this section discusses research challenges and directions that will improve the usage of cloud resources by users, in the context of HPC. From a HPC user’s perspective, it is still challenging to utilise cloud resources efficiently and the area of resource management has diverse open issues still pending. Most research considers that the necessary parameters, such as execution time or resource demands, can be obtained prior to executing an application. However, such assumption is not always valid given the system behavior in cloud computing. Therefore, there is a need for more effective performance and resource demands prediction techniques rather than simply sampling applications. It is important to understand and keep in mind that performance prediction plays a key role to help in resource allocation and adjustment decisions to make the best use of resources and avoid overprovisioning and underprovisioning problems. Nevertheless, predicting the performance of an application and its resource demands can be challenging since it requires sophisticated profiling techniques to analyse an application statically or dynamically [32, 69]. For simplicity purposes, a number of solutions assume that the performance of cloud computing resources remains unchanged during execution. Unfortunately, this assumption does not hold for real clouds because virtualisation does not guarantee performance isolation and can suffer from unpredictable performance interference arising from other VMs hosted in the same physical server [60]. Moreover, several studies execute simultaneously more than one task in a VM at one time slot, thus potentiating interference or resource contention as analyses suggest [64, 76]. Consequently, tasks may slow down due to resource contention. Also, VM acquisition delay and variability of spot prices introduce additional uncertainty [59]. Hence, resource management frameworks need to monitor running tasks to ensure that they are getting the compute resources necessary. If this is not the case, then adjustments need to be performed to the initial scheduling plan. To this end, dynamic scheduling must occur during task execution, and the user should be given the possibility to prioritise the objective, maximise the resource utilisation or minimise execution time. This is especially important for the execution of workflows since a number of solutions propose static scheduling. Some HPC applications, such as BoT and workflows, can benefit from HPC hybrid and federated cloud environments. Another reason to justify cloud federation is that commercial clouds typically limit the maximum number of VMs that a customer can run simultaneously [21]. To overcome this limitation, users can explore cloud federation to simultaneously run numerous VMs. Such an approach creates a more flexible environment in which users can actively select the most cost-effective cloud to run their applications. Studies have shown that optimal cost is lower and optimal solutions and more stable in the cloud federations than those presented by single-provider clouds [1]. However, mature research on resource optimisation focuses essentially on solutions for single cloud providers. Geographically distributed resources available through a number of cloud providers consequently increases the number of options that the scheduling process must consider, which significantly expands the search space. Moreover, transferring data out of a cloud provider is generally expensive, thus algorithms that schedule

11 Avoiding Resource Wastage

231

applications across multiple providers may need to be concerned with this cost. These factors constitute a major problem for the design and implementation of an efficient scheduling algorithm which research needs to address.

11.4 Conclusions Cloud computing has been emerging as a promising alternative to supercomputers for some HPC applications. The use of cloud infrastructures for HPC applications has coined the term HPC cloud. Compared with traditional HPC platforms, clouds offer intrinsic advantages such as on-demand network access to virtualised computing resources that can be rapidly provisioned and released with minimal management effort, and significant economies of use by leveraging the pay-by-use model rather than investing for the acquisition of expensive hardware. Therefore, HPC cloud users do not need to pay attention to the deployment and maintenance of the physical infrastructure. Nevertheless, they are responsible for using the resources offered by the providers to build their own HPC cluster. Moreover, HPC users are faced with the challenges of dealing with highly heterogeneous resources, different pricing models and VM performance variation. This chapter discussed the resource wastage problem in the context of HPC cloud, an open issue that results in additional costs to HPC users. To this end, this chapter started by describing two classes of HPC workloads that typically run on cloud environments, namely BoT and workflows, followed by an overview of sources of resource wastage. The concepts of overprovisioning and underprovisioning, and the reasons why these two situations occur, as well as the metrics that can be used to detect resource inefficiencies were also introduced. The problem of efficiently managing resources to avoid waste was presented and several schedulingbased strategies to solve it were discussed. Although several advances happened in past years in the cloud space, many current issues and challenges still need both more analysis and discussion. This chapter aimed at helping HPC users understand the problem of utilising resources efficiently, the existing solutions and their limitations, and pointed out research challenges that need to be addressed in the context of optimising HPC cloud usage.

References 1. Somayeh Abdi, Latif PourKarimi, Mahmood Ahmadi, and Farzad Zargari. Cost minimization for deadline-constrained bag-of-tasks applications in federated hybrid clouds. Future Generation Computer Systems, 71:113–128, 2017. 2. Furqan Alam, Rashid Mehmood, and Iyad Katib. Comparison of decision trees and deep learning for object classification in autonomous driving. In Smart Infrastructure and Applications, pages 135–158. Springer, 2020.

232

A. M. Sampaio and J. G. Barbosa

3. Emad Alamoudi, Rashid Mehmood, Aiiad Albeshri, and Takashi Gojobori. A survey of methods and tools for large-scale dna mixture profiling. In Smart Infrastructure and Applications, pages 217–248. Springer, 2020. 4. Rawan Aljamal, Ali El-Mousa, and Fahed Jubair. Benchmarking microsoft azure virtual machines for the use of hpc applications. In 2020 11th International Conference on Information and Communication Systems (ICICS), pages 382–387. IEEE, 2020. 5. Hamid Arabnejad and Jorge G. Barbosa. Maximizing the completion rate of concurrent scientific applications under time and budget constraints. Journal of Computational Science, 23:120–129, 2017. 6. Hamid Arabnejad and Jorge G. Barbosa. Multi-qos constrained and profit-aware scheduling approach for concurrent workflows on heterogeneous systems. Future Generation Computer Systems, 68:211–221, 2017. 7. Vahid Arabnejad, Kris Bubendorfer, and Bryan Ng. Budget and deadline aware e-science workflow scheduling in clouds. IEEE Transactions on Parallel and Distributed systems, 30(1):29–44, 2018. 8. Amazon AWS. High performance computing lens - aws well-architected framework, 12 2018. Last accessed 27 December 2021. 9. Amazon AWS. New – amazon ec2 hpc6a instance optimized for high performance computing, 01 2022. Last accessed 28 January 2022. 10. Ali Belgacem, Kadda Beghdad-Bey, and Hassina Nacer. Dynamic resource allocation method based on symbiotic organism search algorithm in cloud computing. IEEE Transactions on Cloud Computing, 10(3):1714–1725, 2022. 11. Jeferson R Brunetta and Edson Borin. Selecting efficient cloud resources for hpc workloads. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, pages 155–164, 2019. 12. Koneti Kalyan Chakravarthi and L Shyamala. Topsis inspired budget and deadline aware multiworkflow scheduling for cloud computing. Journal of Systems Architecture, 114, 2021. 13. Huangke Chen, Jianghan Zhu, Guohua Wu, and Lisu Huo. Cost-efficient reactive scheduling for real-time workflows in clouds. The Journal of Supercomputing, 74(11):6291–6309, 2018. 14. Zheyi Chen, Jia Hu, Geyong Min, Albert Y Zomaya, and Tarek El-Ghazawi. Towards accurate prediction for high-dimensional and highly-variable cloud workloads with deep learning. IEEE Transactions on Parallel and Distributed Systems, 31(4):923–934, 2019. 15. Amit Chhabra, Kuo-Chan Huang, Nebojsa Bacanin, and Tarik A Rashid. Optimizing bag-oftasks scheduling on cloud data centers using hybrid swarm-intelligence meta-heuristic. The Journal of Supercomputing, 78:1–63, 2022. 16. Walfredo Cirne, Francisco Brasileiro, Jacques Sauve, Nazareno Andrade, Daniel Paranhos, Elizeu Santos-neto, Raissa Medeiros, and Federal Campina Gr. Grid computing for bag of tasks applications. In Proc. of the 3rd IFIP Conference on E-Commerce, E-Business and EGovernment, 2003. 17. Iacopo Colonnelli, Barbara Cantalupo, Ivan Merelli, and Marco Aldinucci. Streamflow: crossbreeding cloud with hpc. IEEE Transactions on Emerging Topics in Computing, 9(04):1723– 1737, 2020. 18. Rodrigo da Rosa Righi, Vinicius Facco Rodrigues, Cristiano André Da Costa, Guilherme Galante, Luis Carlos Erpen De Bona, and Tiago Ferreto. Autoelastic: Automatic resource elasticity for high performance applications in the cloud. IEEE Transactions on Cloud Computing, 4(1):6–19, 2015. 19. Rubing Duan, Radu Prodan, and Xiaorong Li. Multi-objective game theoretic schedulingof bag-of-tasks workflows on hybrid clouds. IEEE Transactions on Cloud Computing, 2(1):29– 42, 2014. 20. Donatello Elia, Sandro Fiore, and Giovanni Aloisio. Towards hpc and big data analytics convergence: Design and experimental evaluation of a hpda framework for escience at scale. IEEE Access, 9:73307–73326, 2021.

11 Avoiding Resource Wastage

233

21. Joseph Emeras, Sebastien Varrette, Valentin Plugaru, and Pascal Bouvry. Amazon elastic compute cloud (ec2) versus in-house hpc platform: A cost analysis. IEEE Transactions on Cloud Computing, 7(2):456–468, 2016. 22. Yuping Fan, Zhiling Lan, Paul Rich, William E Allcock, Michael E Papka, Brian Austin, and David Paul. Scheduling beyond cpus for hpc. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’19, pages 97– 108. Association for Computing Machinery, 2019. 23. Mozhgan Ghasemzadeh, Hamid Arabnejad, and Jorge G. Barbosa. Deadline-Budget constrained Scheduling Algorithm for Scientific Workflows in a Cloud Environment. In Panagiota Fatourou, Ernesto Jiménez, and Fernando Pedone, editors, 20th International Conference on Principles of Distributed Systems (OPODIS 2016), volume 70 of Leibniz International Proceedings in Informatics (LIPIcs), pages 19:1–19:16, 2017. 24. Abhishek Gupta, Paolo Faraboschi, Filippo Gioachin, Laxmikant V Kale, Richard Kaufmann, Bu-Sung Lee, Verdi March, Dejan Milojicic, and Chun Hui Suen. Evaluating and improving the performance and scheduling of hpc applications in cloud. IEEE Transactions on Cloud Computing, 4(3):307–321, 2014. 25. Essam H Houssein, Ahmed G Gad, Yaser M Wazery, and Ponnuthurai Nagaratnam Suganthan. Task scheduling in cloud computing based on meta-heuristics: Review, taxonomy, open challenges, and future trends. Swarm and Evolutionary Computation, 62:100841, 2021. 26. Menglan Hu and Bharadwaj Veeravalli. Requirement-aware scheduling of bag-of-tasks applications on grids with dynamic resilience. IEEE Transactions on Computers, 62(10):2108– 2114, 2012. 27. Bahman Javadi, Derrick Kondo, Jean-Marc Vincent, and David P Anderson. Discovering statistical models of availability in large distributed systems: An empirical study of seti@ home. IEEE Transactions on Parallel and Distributed Systems, 22(11):1896–1903, 2011. 28. Zihan Jiang, Wanling Gao, Fei Tang, Lei Wang, Xingwang Xiong, Chunjie Luo, Chuanxin Lan, Hongxiao Li, and Jianfeng Zhan. Hpc ai500 v2. 0: The methodology, tools, and metrics for benchmarking hpc ai systems. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 47–58. IEEE, 2021. 29. Gideon Juve, Ann Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. Characterizing and profiling scientific workflows. Future Generation Computer Systems, 29(3):682–692, 2013. 30. Gideon Juve, Ewa Deelman, G Bruce Berriman, Benjamin P Berman, and Philip Maechling. An evaluation of the cost and performance of scientific workflows on amazon ec2. Journal of Grid Computing, 10(1):5–21, 2012. 31. Gurleen Kaur, Anju Bala, and Inderveer Chana. An intelligent regressive ensemble approach for predicting resource usage in cloud computing. Journal of Parallel and Distributed Computing, 123:1–12, 2019. 32. Hisham A Kholidy. An intelligent swarm based prediction approach for predicting cloud computing user resource needs. Computer Communications, 151:133–144, 2020. 33. Jitendra Kumar and Ashutosh Kumar Singh. Workload prediction in cloud using artificial neural network and adaptive differential evolution. Future Generation Computer Systems, 81:41–52, 2018. 34. Jitendra Kumar, Ashutosh Kumar Singh, and Rajkumar Buyya. Self directed learning based workload forecasting model for cloud resource management. Information Sciences, 543:345– 366, 2021. 35. Philipp Leitner and Jürgen Cito. Patterns in the chaos–a study of performance variation and predictability in public iaas clouds. ACM Transactions on Internet Technology (TOIT), 16(3):1– 23, 2016. 36. Thomas Lengauer. Optimization problems. In Combinatorial Algorithms for Integrated Circuit Layout, pages 31–45. Vieweg+Teubner Verlag, 1990. 37. Syed Hamid Hussain Madni, Muhammad Shafie Abd Latiff, Yahaya Coulibaly, et al. Resource scheduling for infrastructure as a service (iaas) in cloud computing: Challenges and opportunities. Journal of Network and Computer Applications, 68:173–200, 2016.

234

A. M. Sampaio and J. G. Barbosa

38. Aniruddha Marathe, Rachel Harris, David K Lowenthal, Bronis R De Supinski, Barry Rountree, Martin Schulz, and Xin Yuan. A comparative study of high-performance computing on the cloud. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 239–250, 2013. 39. João F Matias Rodrigues and Christian von Mering. Hpc-clust: distributed hierarchical clustering for large sets of nucleotide sequences. Bioinformatics, 30(2):287–288, 2014. 40. John Michalakes. Hpc for weather forecasting. In Parallel Algorithms in Computational Science and Engineering, pages 297–323. Springer, 2020. 41. Henry M Monti, Ali R Butt, and Sudharshan S Vazhkudai. /scratch as a cache: Rethinking hpc center scratch storage. In Proceedings of the 23rd international conference on Supercomputing, pages 350–359, 2009. 42. Ismael Solis Moreno, Peter Garraghan, Paul Townend, and Jie Xu. Analysis, modeling and simulation of workload patterns in a large-scale utility cloud. IEEE Transactions on Cloud Computing, 2(2):208–221, 2014. 43. Raúl Moreno, Enrique Arias, Andrés Navarro, and Francisco J Tapiador. How good is the openpower architecture for high-performance cpu-oriented weather forecasting applications? The Journal of Supercomputing, 75(10):6178–6193, 2019. 44. Ioannis A Moschakis and Helen D Karatza. Multi-criteria scheduling of bag-of-tasks applications on heterogeneous interlinked clouds with simulated annealing. Journal of Systems and Software, 101:1–14, 2015. 45. Marco AS Netto, Rodrigo N Calheiros, Eduardo R Rodrigues, Renato LF Cunha, and Rajkumar Buyya. Hpc cloud for scientific and business applications: taxonomy, vision, and research challenges. ACM Computing Surveys (CSUR), 51(1):1–29, 2018. 46. Marek Nowicki, Łukasz Górski, and Piotr Bała. Pcj java library as a solution to integrate hpc, big data and artificial intelligence workloads. Journal of Big Data, 8(1):1–21, 2021. 47. Hamza Ouarnoughi, Grislin-Le Strugeon, Smail Niar, et al. Simulating multi-agent-based computation offloading for autonomous cars. Cluster Computing, 25:2755–2766, 2022. 48. Fearghal O’Donncha, Emanuele Ragnoli, Srikumar Venugopal, Scott C James, and Kostas Katrinis. On the efficiency of executing hydro-environmental models on cloud. Procedia Engineering, 154:199–206, 2016. 49. John O’Loughlin and Lee Gillam. Good performance metrics for cloud service brokers. In The Fifth International Conference on Cloud Computing, GRIDs, and Virtualization, pages 64–69. Citeseer, 2014. 50. Ajeet Ram Pathak, Manjusha Pandey, and Siddharth S Rautaray. Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation. Cluster Computing, 23(2):953–988, 2020. 51. Thanh-Phuong Pham, Sasko Ristov, and Thomas Fahringer. Performance and behavior characterization of amazon ec2 spot instances. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 73–81. IEEE, 2018. 52. Ferran Parés Pont, Pedro Megias, Dario Garcia-Gasulla, Marta Garcia-Gasulla, Eduard Ayguadé, and Jesús Labarta. Size & shape matters: The need of hpc benchmarks of high resolution image training for deep learning. Supercomputing Frontiers and Innovations, 8(1):28–44, 2021. 53. Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. Towards understanding heterogeneous clouds at scale: Google trace analysis. Intel Science and Technology Center for Cloud Computing, Tech. Rep, 84:1–21, 2012. 54. Maria Alejandra Rodriguez and Rajkumar Buyya. A taxonomy and survey on scheduling algorithms for scientific workflows in iaas cloud computing environments. Concurrency and Computation: Practice and Experience, 29(8):e4041, 2017. 55. Ehsan Saeedizade and Mehrdad Ashtiani. Ddbws: a dynamic deadline and budget-aware workflow scheduling algorithm in workflow-as-a-service environments. The Journal of Supercomputing, pages 1–40, 2021.

11 Avoiding Resource Wastage

235

56. Jyoti Sahni and Deo Prakash Vidyarthi. A cost-effective deadline-constrained dynamic scheduling algorithm for scientific workflows in a cloud environment. IEEE Transactions on Cloud Computing, 6(1):2–18, 2015. 57. Shweta Salaria, Aleksandr Drozd, Artur Podobas, and Satoshi Matsuoka. Predicting performance using collaborative filtering. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), pages 504–514. IEEE, 2018. 58. Altino M Sampaio and Jorge G Barbosa. Towards high-available and energy-efficient virtual computing environments in the cloud. Future Generation Computer Systems, 40:30–43, 2014. 59. Altino M Sampaio and Jorge G Barbosa. Constructing reliable computing environments on top of amazon ec2 spot instances. Algorithms, 13(8):187, 2020. 60. Altino M Sampaio, Jorge G Barbosa, and Radu Prodan. Piasa: A power and interference aware resource management strategy for heterogeneous workloads in cloud data centers. Simulation Modelling Practice and Theory, 57:142–160, 2015. 61. Naresh Kumar Sehgal and Pramod Chandra P. Bhatt. Cloud Workload Characterization, pages 61–83. Springer International Publishing, Cham, 2018. 62. Georgios L Stavrinides and Helen D Karatza. Dynamic scheduling of bags-of-tasks with sensitive input data and end-to-end deadlines in a hybrid cloud. Multimedia Tools and Applications, 80(11):16781–16803, 2021. 63. Xiaoyong Tang, Xiaoyi Liao, Jie Zheng, and Xiaopan Yang. Energy efficient job scheduling with workload prediction on cloud data center. Cluster Computing, 21(3):1581–1593, 2018. 64. William FC Tavares, Marcio RM Assis, and Edson Borin. Leveraging vcpu-utilization rates to select cost-efficient vms for parallel workloads. In Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing, pages 1–10, 2021. 65. William FC Tavares, Marcio Roberto Miranda Assis, and Edson Borin. Quantifying and detecting hpc resource wastage in cloud environments. In 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pages 41–46. IEEE, 2021. 66. Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Matthew Shields. Workflows for e-science: scientific workflows for grids, volume 1. Springer, 2014. 67. George Terzopoulos and Helen D Karatza. Bag-of-task scheduling on power-aware clusters using a dvfs-based mechanism. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages 833–840. IEEE, 2014. 68. Luan Teylo, Luciana Arantes, Pierre Sens, and Lucia Drummond. Scheduling bag-of-tasks in clouds using spot and burstable virtual machines. IEEE Transactions on Cloud Computing, 2021. 69. Long Thai, Blesson Varghese, and Adam Barker. A survey and taxonomy of resource optimisation for executing bag-of-task applications on public clouds. Future Generation Computer Systems, 82:1–11, 2018. 70. John Thompson, Ramon Ramirez-Linan, Michael Rilee, Aaron Skolnik, and Daniel Duffy. Leveraging High Performance Computing Cloud Based Resources for Advancing Science at the NASA Goddard Space Flight Center. In AGU Fall Meeting Abstracts, volume 2021, December 2021. 71. László Toka, Gergely Dobreff, Balázs Fodor, and Balázs Sonkoly. Machine learning-based scaling management for kubernetes edge clusters. IEEE Transactions on Network and Service Management, 18(1):958–972, 2021. 72. Prateeksha Varshney and Yogesh Simmhan. Autobot: Resilient and cost-effective scheduling of a bag of tasks on spot vms. IEEE Transactions on Parallel and Distributed Systems, 30(7):1512–1527, 2018. 73. Jonathan Stuart Ward and Adam Barker. Observing the clouds: a survey and taxonomy of cloud monitoring. Journal of Cloud Computing, 3(1):1–30, 2014. 74. Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, Burton Smith, and Randy H Katz. Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In Proceedings of the 2017 Symposium on Cloud Computing, pages 452–465, 2017.

236

A. M. Sampaio and J. G. Barbosa

75. Lu Yin, Junlong Zhou, and Jin Sun. A stochastic algorithm for scheduling bag-of-tasks applications on hybrid clouds under task duration variations. Journal of Systems and Software, 184:111123, 2022. 76. Yi Zhang, Junlong Zhou, and Jin Sun. Scheduling bag-of-tasks applications on hybrid clouds under due date constraints. Journal of Systems Architecture, 101:101654, 2019. 77. Zhaomeng Zhu and Xueyan Tang. Deadline-constrained workflow scheduling in iaas clouds with multi-resource packing. Future Generation Computer Systems, 101:880–893, 2019. 78. Jiawei Zhuang, Daniel J. Jacob, Haipeng Lin, Elizabeth W. Lundgren, Robert M. Yantosca, Judit Flo Gaya, Melissa P. Sulprizio, and Sebastian D. Eastham. Enabling high-performance cloud computing for earth science modeling on over a thousand cores: Application to the geoschem atmospheric chemistry model. Journal of Advances in Modeling Earth Systems, 12(5), 2020.

Part IV

Application Study Cases

Chapter 12

Biological Sequence Comparison on Cloud-Based GPU Environment Walisson P. Sousa, Filipe M. Soares, Rafaela C. Brum, Marco Figueiredo, Alba C. M. A. Melo, Maria Clicia S. de Castro, and Cristiana Bentes

12.1 Introduction The analysis and comparison of DNA, RNA, or protein sequences are one of the most important tasks in computational biology. The efforts in DNA sequencing bring contributions in different areas, such as molecular evolution [19], pharmaceutical development [30] or protein folding [21]. Recently, we are witnessing the importance of biological sequence analysis in the fight against COVID-19, not only in identifying new variants of the virus, but also in the search for adequate medicines or treatment for the disease. Several algorithms and methods were developed to provide biological sequence comparison. Among these, dynamic programming algorithms, such as the Smith-Waterman [41] algorithm and its variants, produce optimal solutions. Still, the computational cost of such algorithms makes the comparison of large sequences prohibitive in terms of computing power. Therefore, biological sequencing applications that compare large sequences and produce optimal results demand high performance computing (HPC) environments to provide realistic executing time [44]. Many tools that exploit HPC for pairwise comparison of biological sequences have been proposed in the literature. Examples of state-of-the-art tools are SW# [31], SWIMM 2.0 [37] and MASA-CUDAlign [22, 38]. Of those, MASA-CUDAlign is able to compare larger sequences and

W. P. Sousa · M. C. S. de Castro · C. Bentes State University of Rio de Janeiro, Rio de Janeiro, Brazil e-mail: [email protected]; [email protected]; [email protected] F. M. Soares · M. Figueiredo · A. C. M. A. Melo () University of Brasilia, Brasilia, Brazil e-mail: [email protected] R. C. Brum Fluminense Federal University, Niterói, Brazil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_12

239

240

W. P. Sousa et al.

presents the best performance [22]. MASA-CUDAlign is a highly optimized sequence comparison tightly coupled parallel tool that exploits the power of GPUs to accelerate the sequence comparisons and provided impressive 82,822 GCUPS (Giga-Cells Updated per Second) on a cluster with 512 GPUs. Although MASA-CUDAlign is able to compare sequences with more than 240 million base pairs in about 11 min, the platform used to achieve such result is very expensive and mostly inaccessible for scientists. For this reason, we claim here the adoption of cloud computing as an alternative platform for executing biological sequence comparison. The cloud pay-as-you-go model avoids the costs of setting up, buying, and maintaining a dedicated HPC infrastructure. Furthermore, it offers cost-effective advantages with the flexibility and customization provided by its virtualization support. Nowadays, cloud providers offer a wide variety of resources such as virtualized central processing units (vCPUs), disks for storage, and accelerators such as Graphics Processing Units (GPUs) or Field-Programmable Gate Arrays (FPGAs). Likewise, cloud providers offer distinct markets such as On-Demand, Spot, and Reserved. The cost and performance of those resources under different cloud markets present a vast variation, both in cost and performance. Therefore, choosing appropriate resources, which will provide good performance at low monetary costs is not trivial at all, since it involves conflicting objectives. For example, resources with accelerators are often more expensive than CPU-only resources, but using them may reduce considerably the execution time, leading to reduced financial costs. In this chapter, we propose to explore the parallelism provided by the different services of Amazon Web Services (AWS) cloud provider to run MASACUDAlign focusing on the reduction of monetary costs and execution time. We used the most recent versions of MASA-CUDAlign: MASA-CUDAlign 4.0 [38] and MASA-CUDAlign-MultiBP (Multi Block Pruning) [22]. We propose to run MASA-CUDAlign 4.0 taking advantage of AWS GPU Spot instances to reduce the monetary costs. However, since Spot instances may be revoked by AWS at any time, MASA-CUDAlign 4.0 must be prepared to deal with revocations, using a fault tolerance mechanism. We also propose to run MASA-CUDAlign-MultiBP on the cloud, creating an HPC parallel cluster on AWS to reduce the execution time. More specifically, among all service possibilities provided by AWS, we explored in our study of biological sequence comparison: On-Demand and Spot models, GPU Virtual Machines as accelerators, and parallel cluster. Hence, the execution of our biological sequence application on the cloud brought into discussion two important aspects of cloud computing: fault tolerance and guarantee of application isolation. The remainder of this chapter is organized as follows. Section 12.2 discusses the AWS Cloud, with focus on the AWS functionalities that we used to run our application. Then, in Sect. 12.3, we present our Bioinformatics case study, including an overview of MASA-CUDAlign and our strategies to (a) run it with reduced monetary costs in one GPU and (b) run it in several GPU instances in the AWS Parallel Cluster. Section 12.4 discusses the experimental results of our case study. Finally, Sect. 12.5 presents our conclusions and future research directions.

12 Biological Sequence Comparison on Cloud-Based GPU Environment

241

12.2 Amazon Web Services 12.2.1 Overview Cloud service providers offer On-Demand resource provisioning, with the potential advantages of availability and scalability on a pay-as-you-go basis. Cloud resources may be offered by generic cloud models, such as [43] Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS), or by more specialized ones as Security-as-a-Service [26] and Mobility-as-a-Service [15], among others. In this chapter, we will discuss Infrastructure-as-a-Service solutions for executing our highly optimized biological sequence comparison application. Traditionally, the IaaS cloud model offers virtual machines (VMs) to the user, which are called instances. As soon as the user obtains access to the instance, he/she can copy applications to it and execute them. The IaaS model is thus very flexible since nearly any application can be executed on it. There are several service providers available in the cloud market such as Amazon Web Services (AWS)[5], Microsoft Azure[9], or Google Cloud Platform [8]. We chose AWS due to several factors, such as the diversity of services offered, reasonable prices, multiple pricing models, but mainly grounded on the Gartner Magic Quadrant for Cloud Infrastructure and Platform Services 2021[7], where AWS is shown to be the top leader. One important AWS IaaS service is the Elastic Compute Cloud (EC2) [2]. AWS EC2 offers three main instance purchasing options: On-Demand, Spot, and Reserved. In this work, we focus on the On-Demand and Spot models, which are classified as pay-as-you-go (i.e. charged by the hour/second of use). In the OnDemand model, each instance has a fixed cost per hour/second, which will be charged from the moment the instance is acquired to the moment the instance is terminated by the user. On the other hand, Spot instances have a variable cost per hour/second and they are often much cheaper than the On-Demand counterparts. Spot instances, however, may be revoked by the provider at any time and, for this reason, applications that run on Spot instances must be prepared for this situation with, for instance, a fault tolerance mechanism. In March 2022, Amazon EC2 had 84 data centers spread in 26 regions all over the globe. It offered almost 400 instance types, including Intel and ARM CPUs, ranging from 1 vCPU (instance t2.nano) to 192 vCPUs (instance c6a) and from 0.5 GiB (instance t2.nano) to 24 TiB (instance high memory). EC2 also provides accelerated computing instances for high-performance computing such as GPUs and FPGAs as well as the custom-designed Habana Gaudi and AWS Trainium accelerators for deep learning [2].

242

W. P. Sousa et al.

12.2.2 GPU Instances on AWS GPUs provide a powerful and efficient platform to accelerate a broad range of compute-intensive applications that present regular data dependency patterns. The rapid increase in the computational power of the GPUs, which nowadays have thousands of cores, as well as advances in their memory subsystem, makes the GPU faster than the CPU for most compute-intensive problems. Moreover, generalpurpose programming environments, such as compute unified device architecture (CUDA) and open computing language (OpenCL) provide the adequate programming apparatus to successfully exploit the GPU inherent parallelism. Biological sequence comparison applications that obtain the optimal result with variants of the Smith-Waterman (SW) algorithm have a regular data dependency pattern over the anti-diagonals of the DP matrix, which has been positively exploited using GPUs. MASA-CUDAlign-MultiBP [22] and ADEPT [17] are examples of recent tools for sequence comparison that exploit the power of GPUs. ADEPT aims to pairwise compare a big set of small DNA or protein sequences in a cluster of GPUs with a variant of SW. It achieved 497 GCUPS in a standalone cluster of 8 NVidia V100 GPUs. MASA-CUDAlign-MultiBP is the latest version of the MASA-CUDAlign tool. It pairwise compares very long DNA sequences (i.e. whole chromosomes) in one or more GPUs with a variant of SW. In a stand-alone cluster of 8 NVidia V100 GPUs, it achieved 2521 GCUPS. MASA-CUDAlign-MultiBP holds actually the best performance in the literature (82,822 GCPUS), in a cluster of 512 NVidia V100. Neither of these tools, however, has been executed in GPUs in the cloud. AWS EC2 offers a wide choice of NVidia GPUs, spread over two instance families (p and g). The p family contains the most recent NVidia GPUs, including GPU architectures such as A100 (6912 cores) and V100 (5210 cores), whereas the g family contains more affordable GPUs, such as the ones with the T4 architecture (2560 cores).

12.2.3 Application Execution on AWS From the application point of view, utilizing cloud resources requires some decisions regarding resource management. Figure 12.1 shows a workflow of seven steps to deploy the cloud execution considering only one instance. These decisions are manually or partially automated by custom frameworks [10, 11]. In this work, we manually made the resource management decisions to better adapt to our sequence comparison application requirements. Following the workflow for one instance, first, we decided on the us-east-1 region (Northern Virginia) because it presented, in preliminary studies, the lowest and most stable prices. As the biological sequence comparison application exhibits wave patterns, they permit parallel executions and the use of GPUs. In Step 2, Among the

12 Biological Sequence Comparison on Cloud-Based GPU Environment

243

1

2

3

4

5

6

7

Region Selection

Instance Type Selection

VM Image Selection

Disk Type Selection

VM Request

Application Execution

VM Release

Fig. 12.1 Executing application steps

AWS instances equipped with GPUs, we selected the ones that cost less in families p and g. In Step 3, the VM Images are called Amazon Machine Images (AMI). Our AMI configuration includes Ubuntu 18.04 with CUDA driver 10.2 adjusted for each GPU architecture tested. In Step 4, we selected the standard local disk (Elastic Block Store—EBS) [12] to store the checkpoints required for the fault-tolerant version of our application. Steps 5, 6, and 7 correspond to requesting resources, executing the application, and releasing resources.

12.2.4 High-Performance Computing on AWS Progressively, the interest in adopting cloud computing as an alternative platform for executing HPC applications is growing. Some early studies on evaluating the use of cloud computing for HPC applications, however, exposed some of its shortcomings [25, 27, 33]. Tightly coupled HPC applications that rely on frequent interprocess communication commonly do not perform and scale well in cloud environments. The main reasons for these results were the poor network performance due to virtualization and the sharing of the processors. However, the performance of network and processing of cloud platforms improved rapidly over the years. Some recent studies show that the cloud can be a viable platform for HPC applications and also that the HPC application can be adapted to become more cloud friendly [34, 45]. In order to execute a HPC application on the cloud, we need to consider two important aspects of cloud computing: fault tolerance and guarantee of application isolation.

12.2.4.1

Fault Tolerance

Fault tolerance is the ability of the system to perform its function regardless of faults in software or hardware components, power failures, or any other adversities. In cloud computing, the fault tolerance is related to robustness, reliability, availability, effective throughput supplied by cloud service providers, and the absence of breakdowns. Due to the heterogeneity, large scale, and distribution of components, cloud computing providers have the potential to fail. Therefore, fault tolerance becomes a critical requirement for achieving high performance in cloud computing. Some

244

W. P. Sousa et al.

classes of cloud computing applications can allow partial failures. A partial fault leads to performance degradation instead of a complete breakdown. The fault-tolerant methods have two main classifications based on their policies and procedures: proactive and reactive. They also have various new strategies of fault-tolerance techniques in cloud computation [13, 16, 20, 32, 40]. In this study, since we exploit Spot GPU VMs, we deploy an application-specific fault tolerance scheme to deal with revocations. To replace the revoked VM, the new VM can be Spot or On-Demand, depending on the decision that involves cost and performance.

12.2.4.2

Application Isolation

Most HPC applications execute in multiple nodes and require sustained computing power as well as dedicated network bandwidth. Such execution environment is not traditionally offered by the cloud. In order to bridge the gap between cloud computing and HPC, AWS provides a service that enables the creation of clusters for HPC, named ParallelCluster[3]. ParallelCluster is an open-source tool, created in 2018, that uses resources from EC2 and other services from AWS to assemble and manage a high-performance cluster. Through a configuration file, it is possible to define the characteristics of the parallel cluster, which include (a) the number of nodes and the instance(s) that will compose each node; (b) the network; (c) the storage type and size; and (d) other AWS services. With a configured cluster, AWS allows the user to submit jobs, supporting job schedulers such as AWS Batch and Slurm. There is no charge for using the AWS ParallelCluster tool. However, AWS charges for the instances, network, storage and other services included in a particular cluster. We chose AWS ParallelCluster to run our application because it mimics an HPC platform on the cloud. Since there is a considerable amount of communication between neighbor nodes in our application, having dedicated network and computing power is mandatory for us. In addition, ParallelCluster provides job schedulers such as the ones used in supercomputers (e.g. Slurm) and our execution scripts are already prepared to use them.

12.3 Case Study: Biological Sequence Comparison Application 12.3.1 Overview In this case study, we execute MASA-CUDAlign, which is based on the dynamic programming SW algorithm. It compares two DNA sequences with a variant of the Gotoh [24] and Myers-Miller [35] algorithms. It produces the optimal local,

12 Biological Sequence Comparison on Cloud-Based GPU Environment

(a)

(b)

(c)

(d)

245

(e)

Fig. 12.2 Overview of MASA-CUDAlign [38]. (a) Stage 1 finds the optimal score and its position. Special rows are stored on disk and some blocks are pruned (gray blocks). (b) Stage 2 finds crosspoints between optimal alignment and special rows. Special columns are flushed. (c) Stage 3 finds more crosspoints over special rows stored from previous stages. (d) Stage 4 executes Myers and Miller’s algorithm between each successive crosspoints. (e) Stage 5 obtains the complete alignment between each successive crosspoints

global or semi-global alignment in quadratic time and linear memory. Since MASACUDAlign targets long DNA sequences, it executes in one or more GPUs. MASA-CUDAlign runs in 5 stages (Fig. 12.2), where Stage 1 executes Step 1 (compute the optimal score) of the Gotoh algorithm and Stages 2–5 execute Step 2 of the Gotoh algorithm (obtain the optimal alignment), combined with MyersMiller. Stage 1 computes three dynamic programming (DP) matrices. Instead of saving all rows of the DP matrices, MASA-CUDAlign saves only some rows, either in disk or in memory, shown in bold in Fig. 12.2a. These are called special rows area (SRA). The number of SRA to be saved is configurable, defined by the user. As output, Stage 1 provides the value of the maximum score (optimal score) and its coordinates in the DP matrices as well as several saved rows. Since data dependencies among the DP matrices’ elements must be respected, the parallel computation follows the wavefront pattern, where each anti-diagonal of the DP matrices may be computed in parallel. Stages 2–5 retrieve the optimal alignment as follows. Stage 2 (Fig 12.2b) starts from the point where the optimal score occurs (marked as x in the figure), reads the saved row which is closer to it and recomputes the DP matrices until the point that belongs to the optimal alignment and crosses this row (crosspoint) [35] is found. Then, the row which is closer to the one that has just been processed is read and the process is repeated. This continues until the beginning of the alignment is found. At the end of stage 2, some crosspoints are found. Then, in Stages 3, 4 and 5 (Fig 12.2c, d and e), more crosspoints are obtained in a divide-and-conquer manner, until the whole optimal alignment is found at the end of Stage 5. In addition, MASA-CUDAlign may apply the block pruning (BP) strategy [39] in Stage 1. BP does not calculate the blocks of cells in the DP matrices which surely do not belong to the optimal alignment. In Fig. 12.2a, the pruned blocks are shown in gray. The most recent versions of MASA-CUDAlign are (a) MASA-CUDAlign 4.0 [38], which runs in single and multiple GPUs and provides pruning only for the single GPU version; and (b) MASA-CUDAlign-MultiBP (Multi Block

246 Fig. 12.3 Multi-GPU chaining in Stage 1 of MASA-CUDAlign 4.0 [38]

W. P. Sousa et al.

GPU

1

GPU

2

O2 O1

GPU

3

O3 I3

GPU

4

I4

I2

Pruning) [22], which executes Stage 1 on two or more GPUs, with pruning capability for multi-GPU executions. In MASA-CUDAlign 4.0, each GPU will compute a subset of columns of the DP matrices in Stage 1, sending the DP matrices’ border elements to the right neighbor and receiving border elements from the left neighbor, as shown in Fig. 12.3. In order to do the traceback (Stages 2–5), the communication between the GPUs occur in the reverse order (i.e. from .GP Ui to .GP Ui−1 ), with a speculative mechanism [38]. MASA-CUDAlign 4.0 also provides an application-specific fault tolerance scheme for executions in one GPU. First, the user should define that all rows saved in Stage 1 must be put into disk. If there is a failure, the user restarts the execution of MASA-CUDAlign 4.0, informing the path where the rows are saved. Then, MASACUDAlign 4.0 will access the checkpoint, which is the last saved row, and restart execution using it. It is important to notice that the execution with multiple GPUs does not have a fault tolerance mechanism. In MASA-CUDAlign-MultiBP, besides pruning the matrices, it distributes the load (i.e. subsets of columns) among the GPUs with static or dynamic workload assignment strategies, taking into account the pruning pattern and the characteristics of the GPU execution environment. MASA-CUDAlign 4.0 was able to run very efficiently on a large scale cluster with 384 GPUs Tesla M2090, achieving 10370 GCUPS [38]. Similarly, MASACUDAlign-MultiBP compared whole human vs chimpanzee chromosomes on a GPU cluster composed of 512 GPUs NVidia V100, achieving the best result in the literature so far [22]. GPU clusters like those, however, are very expensive and inaccessible for most scientists. Therefore, executing MASA-CUDAlign on the cloud would bring a new perspective for biological sequence comparison in terms of performance and cost.

12.3.2 Reducing the Monetary Costs We propose to run MASA-CUDAlign 4.0 taking advantage of AWS GPU Spot instances to reduce the monetary cost of the executions. Spot instances may be up to 90% cheaper than On-Demand ones, but can be revoked at any time. Therefore, the execution of MASA-CUDAlign 4.0 should be fault tolerant. We discarded the idea of using a generic GPU checkpointing and recovering mechanism since the current solutions are obsolete [23, 29, 36, 42] or at an initial development stage [28].

12 Biological Sequence Comparison on Cloud-Based GPU Environment

Interrupts the spot VM

Selects and deploys the application in another VM

VM 1

AWS

Worker

Notifies about its revocation

Controller

247

VM 2

Worker Restart execution from checkpoint

Fig. 12.4 Workflow in case of a revocation

Therefore, we proposed in [18] a framework that exploits the application-specific fault tolerance scheme of MASA-CUDAlign 4.0 to guarantee the correct execution even when a Spot GPU is revoked. The proposed framework focuses on scheduling MASA-CUDAlign 4.0 on the cloud considering Spot interruptions. It uses MASA-CUDAlign scheme of saving some rows at Stage 1, as a checkpointing mechanism. We assume that all rows in the SRA are saved to a spare disk that can be detached and attached from one VM to another. The framework has two modules: the Controller and the Worker. The Controller module schedules MASA-CUDAlign on the Spot GPU VM and deploys it in the EC2. The Worker module is responsible for monitoring the deployed VM and for informing the Controller whenever the VM status changes from active to revoked. When a revocation occurs, the Controller is responsible for moving the application to a new (Spot or On-Demand) VM. When a revocation occurs, the framework needs some time to select a new VM, deploy it, and recover from the last checkpoint. So, the framework has to guarantee that all these recovery steps are performed respecting the given deadline, called .M D . The worst-case scenario for recovery is when the application is interrupted during the execution of the checkpointing. In this case, MASA-CUDAlign needs to restart from the previously-stored special row. This means the re-execution of the whole block between the last two consecutive special rows. The time for all the recovery process needs to be accounted in the initial allocation and a new deadline, earlier than the user-provided one, called .M DSpot , needs to be set. We calculate the value of .M DSpot by computing the time to execute the block of two consecutive rows in the slower instance and subtracting that time from the user-defined deadline. The initial schedule of MASA-CUDAlign on Spot GPU VM uses a greedy algorithm to select the instance that meets the deadline .M DSpot and provides the minimum cost. When none of the Spot VMs are able to meet the deadline, the framework asks the user to define a new deadline. Figure 12.4 shows how our framework handles a revocation. In the case of it, the framework tries to reschedule the VM on a different type of Spot instance, since we observed in our experiments that after an instance type is revoked, AWS often revokes it again shortly. If there is no other type of Spot instance available to run the application, the framework moves the application to an On-Demand instance.

248

W. P. Sousa et al.

There are three ways to recognize a revocation. First, the Worker monitors the instance metadata every 5 s, since AWS writes in this metadata a 2-min notice for a revocation [6]. Sometimes, however, this 2-min notice is not properly sent. So, the second way of recognizing a revocation is through monitoring the VM state by using the AWS SDK for Python, boto3 [4]. In both ways, when the revocation is recognized, the Worker sends a signal to the Controller so that the migration process can start. The third way considers the communication monitoring between the Controller and the Worker. If this communication fails three times in the same row, the Controller also assumes that revocation has occurred.

12.3.3 Reducing the Execution Time We propose to run MASA-CUDAlign-MultiBP on a virtual cluster using AWS ParallelCluster in order to guarantee performance isolation reducing its execution time. The steps required for the integration of MASA-CUDAlign-MultiBP into the AWS ParallelCluster are the following. In the first step, the user fills in a configuration file, with the desired settings, such as type and number of instances, storage, VPC (Virtual Private Cloud) network and job scheduler, among others. The ParallelCluster is then created according to the information provided in the configuration file and a ssh connection is provided to the user. In our ParallelCluster, we opted to use the GPU instance g4dn.xlarge, since it provides a good tradeoff between performance (NVidia T4—2560 cores) and price (USD 0.526/hour). We also opted to create parallel clusters of 2, 4 and 8 GPUs. Having the ssh connection, the user may install and configure his/her application which, in our case, is MASA-CUDAlign-MultiBP. Figure 12.5 illustrates how the MASA-CUDAlign-MultiBP execution works in the AWS ParallelCluster. It shows a hypothetical cluster composed of a control node (master) and four GPU nodes. In this figure, the user connects to the master node through ssh and copies to it, through scp, the input sequences, as well as two scripts, Scheduler and Executor. The Scheduler script is responsible for determining the job characteristics, while the Executor script defines how the GPU nodes must be connected. The Scheduler script is submitted to the job scheduler Slurm. When the job is selected for execution, the Scheduler script calls the Executor script, which manages the execution of MASACUDAlign-MultiBP on the ParallelCluster. In MASA-CUDAlign-MultiBP, each .GP Ui computes its subset of columns in the DP matrix and sends regularly the data of its edges to .GP Ui+1 , as indicated by the purple arrows in Fig. 12.5. After the execution of all previous GPUs, the node’s last GPU communicates with the first GPU of the next host. Furthermore, as stated in Sect. 12.3.1, MASA-CUDAlign-MultiBP performs the pruning procedure on multiple GPUs and, to maximize the pruning area, the GPUs exchange information on the best score obtained so far. This communication occurs from .GP Ui to .GP Ui+1 MOD ng, where ng is the number of GPUs and it is represented by the red arrows.

12 Biological Sequence Comparison on Cloud-Based GPU Environment

249

AWS ParallelCluster

workdir 0 GPU 0 job scheduler sbatch

scheduler instructions

scriptslurm

workdir 1

Application execution

GPU 1 SSH conection SCP file transfer

shareddir user workdir 2

Master GPU 2

workdir 3 GPU 3

Fig. 12.5 MASA-CUDAlign execution at the AWS ParallelCluster

The Executor script of MASA-CUDAlign-MultiBP that establishes communication among GPUs of a physical cluster was adapted to the AWS ParallelCluster environment. The script first creates a list with all allocated nodes and associates each of them with their IP addresses. The list is then traversed to identify which node is currently running, as well as the previous and next ones. Each node can contain one or more GPUs, depending on the parameters used in the Scheduler script. Then, the script identifies the pair of sequences to be compared and creates two directories: workdir and shareddir. Workdir is private and stores its nodes local execution data, whereas shareddir is placed in a shared file system since it stores control information for multi-pruning, which is shared by all GPUs. Finally, the master node sends a command to the allocated GPU nodes to start the MASA-CUDAlign-MultiBP execution. This command establishes the communication between neighbor GPUs performed via sockets; and some execution parameters, making it specific to each GPU. Examples of parameters defined inside the command are the GPU index, the subset of columns to be computed, and the IP address and communication port of the preceding and following GPUs. When the execution is complete, the last node copies statistic results to the destination folder. The source code of main instructions executed by the Scheduler (lines 1–13) and Executor scripts (lines 14–30) can be seen in Fig. 12.6, where: $shareddir is the shared directory; $workdir is the work directory; $split is the matrix column distribution among GPUs, i.e., a split of 1,2 will assign the first .1/3 of the matrix to the first GPU and the last .2/3 of the matrix to the second one; $PARAMS contains common parameters required to execute MASA-CUDAlign-MultiBP; $sol is the

250 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

29 30

W. P. Sousa et al. #**********************************************************************************# #!/bin/bash # File: Scheduler.sh # Input parameter: sequence ID # Parametrs for Sbatch: #SBATCH --job-name=multibp #SBATCH --ntasks=8 #SBATCH --nodes=8 #SBATCH --time=09:59:00 #SBATCH --mail-type=END,FAIL gpus=1 # Number of GPUs per node srun Executor.sh $1 $gpus #**********************************************************************************# #!/bin/bash # File: Executor.sh # Input parameters: sequence ID, amount of GPUs, instance type, amount of nodes # # Some instructions were omitted... # Common execution parameters PARAMS="--stage-1 --no-flush --blocks=512 --shared-dir=$shareddir --split=$split" if [ x$previp == x ]; then # If first host ./$sol/cudalign $PARAMS --work-dir=$workdir --part=1 --flush-column=socket://127.0.0.1:$baseport $seq1 $seq2 elif [ x$nextip == x ]; then # If last host prevport=$((baseport+basepart-2)) ./$sol/cudalign $PARAMS --work-dir=$workdir --part=$basepart --load-column=socket://$previp:$prevport $seq1 $seq2 else # Remaining hosts prevport=$((baseport+basepart-2)) nextport=$((baseport+basepart-1)) ./$sol/cudalign $PARAMS --work-dir=$workdir --part=$basepart --load-column=socket://$previp:$prevport --flush-column=socket://127.0.0.1:$nextport $seq1 $seq2 fi #**********************************************************************************#

Fig. 12.6 Scheduler and executor scripts

path where the executable is located; $previp is the IP of the previous GPU host; $nextip is the IP of the next GPU host; $prevport is the previous socket port; $nextport is the next socket port; and $seq1 and $seq2 are the paths of the two input sequences.

12.4 Experimental Results As discussed in Sect. 12.2, AWS EC2 service offers computational resources which provide different instance families. As we need GPU for MASA-CUDAlign, we used the accelerated computing family, by considering cost restrictions (less than 1 USD/hour) and different GPU architectures, Kepler (K520 and K80), Maxwell (M60), and Turing (T4), as shown in Table 12.1. The experiments were performed on 10 real biological sequences obtained from the NCBI website [1], with sizes ranging from 3 MBP (Millions of Base Pairs) to 66 MBP. The sequences used in the experiments are shown in Table 12.2.

12 Biological Sequence Comparison on Cloud-Based GPU Environment

251

Table 12.1 Selected AWS GPU Instances Name g2.2xlarge g3s.xlarge g4dn.xlarge g4dn.2xlarge p2.xlarge

CPU Intel Xeon E5-2670 2.6 GHz Intel Xeon E5-2686 v4 2.3 GHz Intel Xeon 24C 2.5 GHz Intel Xeon 24C 2.5 GHz Intel Xeon E5-2686 v4 2.3 GHz

RAM (GiB) 15 30.5 16 32 61

GPU Nvidia K520 Nvidia M60 Nvidia T4 Tensor Core Nvidia T4 Tensor Core Nvidia K80

Table 12.2 Biological sequences used in our experiments Seq. 3m 5m 7m 10m 23m chr19 chr20 chr21 chr22 chrY

Accession BA000035.2 BX927147.1 AE016879.1 AE017225.1 NC_003997.3 NC_005027.1 NC_014318.1 NC_017186.1 NT_033779.4 NT_037436.3 NC 000019.10 NC 006486.4 NC 000020.11 NC 006487.4 NC_000021.9 NC_006488.4 NC_000022.11 NC_006489.4 NC_000024.10 NC_006492.4

Name Corynebacterium efficiens YS-314 Corynebacterium glutamicum ATCC 13032 Bacillus anthracis str. Ames Bacillus anthracis str. Sterne Bacillus anthracis str. Ames Rhodopirellula baltica SH 1 chromosome Amycolatopsis mediterranei U32 chromosome Amycolatopsis mediterranei S699 chromosome Drosophila melanogaster chromosome 2L Drosophila melanogaster chromosome 3L Homo sapiens chromosome 19 Pan troglodytes chromosome 19 Homo sapiens chromosome 20 Pan troglodytes chromosome 20 Homo sapiens chromosome 21 Pan troglodytes chromosome 21 Homo sapiens chromosome 22 Pan troglodytes chromosome 22 Homo sapiens chromosome Y Pan troglodytes chromosome Y

Size 3,147,090 3,282,708 5,227,293 5,228,663 5,227,293 7,145,576 10,236,715 10,236,779 23,011,544 24,543,557 58,617,616 61,309,027 64,444,167 66,533,130 46,709,983 33,445,071 50,818,468 37,823,149 57,227,415 26,350,515

Type small small small small small large large large large large

12.4.1 Reducing the Monetary Costs We first evaluate the reductions in the monetary cost provided by the use of Spot instances. For these experiments, we executed MASA-CUDAlign 4.0 using a single GPU (MASA-CUDAlign versions with multiple GPUs do not have a checkpointing mechanism). We performed two set of experiments. In the first set, MASA-CUDAlign 4.0 was executed with small sequences. The purpose of this set of experiments was to test the Spot instances with comparisons that take less than 1 h to finish. This was motivated by previous exhaustive testing on Spot instances, from 2019 to 2021, that showed no revocations before 1 h of computation. In this way, we want to compare the performance of the sequence comparisons on the Spot and On-Demand

252

W. P. Sousa et al. g3s.xlarge

g4dn.xlarge

g4dn.2xlarge

p2.xlarge

15000.00

15000.00

10000.00

5000.00

0.00

3m

5m

7m

10m

g3s.xlarge

g2.2xlarge 20000.00

Execution Time (s)

Execution Time (s)

g2.2xlarge 20000.00

g4dn.2xlarge

p2.xlarge

10000.00

5000.00

0.00

23m

g4dn.xlarge

3m

5m

Sequences

7m

10m

23m

Sequences

(a)

(b)

Fig. 12.7 Comparison of execution time between On-Demand and Spot instances for small sequences. (a) Execution time in on-demand instances. (b) Execution time in spot instances g2.2xlarge

g3s.xlarge

g4dn.xlarge

g4dn.2xlarge

g2.2xlarge

p2.xlarge

Monetay Cost (USD)

Monetay Cost (USD)

$3.00

$2.00

$1.00

$0.00

g3s.xlarge

g4dn.xlarge

g4dn.2xlarge

p2.xlarge

$4.00

$4.00

3m

5m

7m Sequences

(a)

10m

23m

$3.00

$2.00

$1.00

$0.00

3m

5m

7m

10m

23m

Sequences

(b)

Fig. 12.8 Comparison of monetary cost between On-Demand and Spot instances for small sequences. (a) Monetary cost in on-Demand instances. (b) Monetary cost in spot instances

instances when there is no revocation involved and, consequently, no fault-tolerance mechanism is required. In the second set of experiments, MASA-CUDAlign 4.0 was executed with large sequences. In these experiments, the probability of revocation occurring is high. In a previous work [18], we observed that some instance types suffer more frequent revocations when executing with longer sequences, specially g2.2xlarge and p2.xlarge. Figure 12.7 shows the execution time of MASA-CUDAlign 4.0 with small sequences from Table 12.1 running on On-Demand (Fig. 12.7a) and Spot instances (Fig. 12.7b). Figure 12.8 shows the monetary cost of these executions. We can observe in these figures that there is no significant difference in terms of execution time on both types of instances, but there is a considerable difference in terms of monetary cost. The executions on the Spot instances costed, on average, 70% less than the corresponding executions on the On-Demand instances. The results also show that the g4dn.xlarge and g4dn.2xlarge instances presented the lowest execution time, whereas, g2.xlarge presented the highest. The most expensive instances were, respectively, p2.xlarge and g2.xlarge. The g4dn.xlarge instance was the one that obtained the best cost-benefit in our experiments with small instances.

12 Biological Sequence Comparison on Cloud-Based GPU Environment

253

For the second set of experiments, we used the fault-tolerance framework described in Sect. 12.3.2. The framework presented an overhead of 3% in the execution time on g4dn.2xlarge, when compared to executions on the same instance without the framework. In order to guarantee the occurrence of revocation and properly evaluate the cost of migrating the computation to another instance, we created different scenarios of simulated revocations. Three scenarios were created using a Poisson distribution [14] with a .λ of 1 divided by the average time between revocation: 2 h (S1), 4 h (S2) and 6 h (S3). The framework starts the MASA-CUDAlign 4.0 computation at the instance g4dn.xlarge and, on the presence of the first revocation it migrates the application to the instance g4dn.2xlarge. When the second revocation occurrs, the application migrates to the instance g3s.xlarge. When the third revocation occurs, the application migrates to g2.2xlarge and to p2.xlarge when the next revocation occurs. For the fifth revocation (that occurred in scenario S1 of chromosome 20), the application migrates to the On-Demand instance g4dn.xlarge. In Table 12.3, we show for each sequence and each revocation scenario (S1, S2 and S3), the number of used Spot or On-Demand VMs, the execution time, and the monetary cost. We also show, for comparison purposes, the execution time and monetary cost when the execution is performed only on the g4dn.xlarge OnDemand instance. Figure 12.9 shows the execution time and monetary cost of the three simulated revocation scenarios and the On-Demand only execution. All the results were obtained as the average of three executions in each of the evaluated scenarios. Table 12.3 Number of used Spot and On-Demand VMs, execution times, and monetary costs with simulated revocations vs. the On-Demand results Seq. chr19

chr20

chr21

chr22

chrY

Simulated revocations Scenario Spot VMs 4 S1 2 S2 2.67 S3 4.33 S1 S2 2.67 S3 2 2 S1 1.67 S2 1 S3 2 S1 S2 1.33 1.33 S3 S1 1.33 1 S2 S3 1.67

On-Demand VMs 0 0 0 0.67 0 0 0 0 0 0 0 0 0 0 0

Exec. time 07:18:45 05:58:34 06:17:53 08:35:08 07:35:34 06:02:27 02:31:51 02:25:24 02:29:27 03:19:12 02:50:13 02:47:40 02:40:29 02:38:34 02:42:41

Cost $1.39 $1.18 $1.31 $2.34 $1.44 $1.14 $0.46 $0.43 $0.39 $0.61 $0.49 $0.49 $0.44 $0.41 $0.49

On-Demand VM Exec. time Cost 04:56:49 $2.60

05:57:15

$3.13

02:31:39

$1.33

02:38:30

$1.39

02:45:52

$1.45

254

W. P. Sousa et al. S2

S3

S1

On-Demand $4.00

8:00:00

$3.00

Monetay Cost (USD)

Execution Time (h)

S1 10:00:00

6:00:00

4:00:00

2:00:00

S2

S3

On-Demand

$2.00

$1.00

$0.00 chr19

chr20

chr21

chr22

chrY

Sequences

(a)

chr19

chr20

chr21

chr22

chrY

Sequences

(b)

Fig. 12.9 Comparison of execution time and monetary cost between each scenario of simulated revocations and On-Demand execution. (a) Execution time of revocation scenarios vs. On-Demand execution. (b) Monetary cost of revocation scenarios vs. On-Demand execution

We can observe in these results that, despite having revocations, the faulttolerant execution reduced the monetary cost significantly when compared to the On-Demand execution. The cost reductions were around 60% on average. There is, though, an average increase of 13% in the execution time, which is not that high compared to the actual reductions in the monetary costs. In our experiments, each revocation is handled within 4 min. But this handling time is negligible when compared to the hours required to compare these long sequences. In terms of monetary cost, the most expensive execution was the comparison of chromosome 20 (chr20) in the revocation scenario S1, with five revocations. In this case, the framework had to migrate the application five times, and all Spot and OnDemand instance types chosen were used. For this worst case scenario, the use of Spot instances allowed a reduction of 25% in the monetary cost. In terms of execution time, the comparison with the highest increase was the chromosome 19 (chr19) in the revocation scenario S1, with four revocations. In this case, the framework migrated the application to the slowest Spot instance, g2.2xlarge. This caused an increase in the execution time by almost 48%, but still the monetary cost was reduced by 46%.

12.4.2 Reducing the Execution Time The reduction in the execution time was evaluated with the fastest version of MASA-CUDAlign, MASA-CUDAlign-MultiBP, running on AWS ParallelCluster with multiple GPUs. For these experiments, we used only On-Demand instances since none of the MASA-CUDAlign versions with multiple GPUs support checkpointing (Sect. 12.3.1). AWS ParallelCluster was implemented on top of the g4dn.xlarge instance type, since this was the instance that provided the best cost-benefit ratio on our previous experiments. We used the same small and large sequences shown in Table 12.2.

12 Biological Sequence Comparison on Cloud-Based GPU Environment 1 instance

2 instances

4 instances

255

8 instances

Execution time (m)

60.00

40.00

20.00

0.00 3M

5M

7M

10M

23M

Biological Sequences

Fig. 12.10 Execution time of MASA-CUDAlign-MultiBP running on ParallelCluster for small sequences

Four cluster configurations were created: 1, 2, 4 and 8 computational nodes. As required by ParallelCluster, all configurations have to include a master node. Each experiment was executed five times. Figures 12.10 and 12.11 show the execution time of MASA-CUDAlign-MultiBP for small and large sequences, respectively, according to the number of computational nodes allocated to the cluster. As expected, the longer sequences present greater execution times. For the smaller sequences, we noticed that the execution time reduced only slightly as we doubled the number of computational nodes. This may indicate that there is not enough parallelism to keep the GPUs of the largest cluster configurations (4 and 8 GPUs) busy all the time. For the large sequences, we can see that when the number of computational nodes used by the cluster doubles, the execution time is reduced by around 50%. Other measures of the performance of MASA-CUDAlign-MultiBP on ParallelCluster are in terms of CUPS and speedups. As more GPUs are used, we explored more parallelism obtaining better speedups and greater CUPS rates. We achieved .1.34 TCUPS for the execution of chromosome Y (chrY) using 8 GPUs. We also analyzed the speedup of MASA-CUDAlign-MultiBP when compared to the execution with one GPU. Figures 12.12 and 12.13 show the speedups for the small and large sequences, respectively. The best speedups were 7.40 and 6.30, obtained for the execution of chromosome Y and the 23M sequence, respectively, using 8 GPUs.

256

W. P. Sousa et al. 1 instance

2 instances

4 instances

8 instances

05:00:00

Execution time (h)

04:00:00

03:00:00

02:00:00

01:00:00

00:00:00 Chr19

Chr20

Chr21

Chr22

ChrY

Biological Sequences

Fig. 12.11 Execution time of MASA-CUDAlign-MultiBP running on ParallelCluster for large sequences

3M

5M

7M

10M

23M

8,00

Speedup

6,00

4,00

2,00

0,00 2

4

6

8

Number of Instances

Fig. 12.12 Speedups of the execution with 8 GPUs compared to the execution with 1 GPU for the small sequences

12 Biological Sequence Comparison on Cloud-Based GPU Environment Chr19

Chr20

Chr21

257

ChrY

Chr22

8,00

Speedup

6,00

4,00

2,00

0,00 2

4

6

8

Number of Instances

Fig. 12.13 Speedups of the execution with 8 GPUs compared to the execution with 1 GPU for the large sequences

Figures 12.14 and 12.15 show the monetary costs of the cluster executions for the small and large sequences respectively. As expected, the 1-instance scenario provides the lowest monetary costs. In general, we observed that the bigger the cluster, the more expensive it is for the small sequences. The 23M sequence execution, however, showed different results. The 2-instance cluster was the most expensive one. Its monetary cost decreased as the size of the cluster increased from 2 to 4 and from 4 to 8 nodes. This fact occurred because the execution with more than one node required an extra instance for the master node. So the 2-instance execution requires the allocation of 3 instances, which increased its cost in 50%. The execution with 8 instances requires the allocation of 9 instances, increasing the cost by around 12%. In addition, when there is more computation to be performed, which is the case for the 23M sequence, the use of more nodes decreases significantly the time to use the cloud, which also reduces its cost. We observed the same trend as presented for the 23M sequence for the large sequences. The most expensive execution was on the 2-instance cluster, and the costs tend to reduce when the size of the cluster increases.

12.4.3 Discussion When we deploy MASA-CUDAlign on the cloud, we need to take into account an important trade-off: how to balance performance and cost. Our results show that the selection of the adequate instance, number of nodes and version of MASACUDAlign is not trivial. For example, choosing the most powerful instance to reduce

258

W. P. Sousa et al. 1 instance

2 instances

4 instances

8 instances

$1.00

Monetary cost (USD)

$0.75

$0.50

$0.25

$0.00 3M

5M

7M

10M

23M

Biological Sequences

Fig. 12.14 Monetary cost of the cluster execution for small sequences

1 instance

2 instances

4 instances

8 instances

$5.00

Monetary cost (USD)

$4.00

$3.00

$2.00

$1.00

$0.00 Chr19

Chr20

Chr21

Chr22

Biological Sequences

Fig. 12.15 Monetary cost of the cluster execution for large sequences

ChrY

12 Biological Sequence Comparison on Cloud-Based GPU Environment

259

Table 12.4 Comparison of the execution time and cost on spot instances and ParallelCluster Seqs chr19

chr20

chr21

chr22

chrY

Nodes 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

Spot instances Time (s) Cost (US$) 358.67 1.18

455.57

1.44

140.40

0.43

170.22

0.49

158.57

0.41

Cluster Time (s) 249.92 155.13 84.69 44.57 271.75 182.21 95.21 55.46 109.17 72.12 41.46 22.03 155.96 89.17 50.53 26.85 158.58 80.16 41.81 21.43

Cost (US$) 2.19 4.08 3.71 3.52 2.38 4.79 4.17 4.38 0.96 1.90 1.82 1.74 1.37 2.35 2.21 2.12 1.39 2.11 1.83 1.69

GCUPS/US$ Spot Cluster 141.52 109.43 94.63 190.63 381.78 108.93 110.49 81.88 179.99 294.18 431.28 248.44 190.01 345.06 679.25 384.08 149.93 152.88 286.87 562.79 386.58 114.02 148.59 328.48 693.96

the execution time is not necessarily the best decision in a scenario with budget restrictions. In Table 12.4, we compare the execution time and monetary cost of the executions on Spot instances and on ParallelCluster. For the executions on Spot instances, we used the results of the average revocation scenario, S2. We can observe that, as expected, MASA-CUDAlign runs faster on the cluster than on Spot instances (except for the chromosome Y running on 1 node, since in this case there is no revocation). For the executions with 8 nodes, the ParallelCluster execution is about 6–8 times faster than the execution on Spot instances. When we compare the monetary cost, on the other side, the execution on Spot instances is 3–4 times cheaper than the execution with 8 nodes on the cluster. In order to properly discuss these results, we evaluate the performance gain per invested dollar, or GCUPS/US$. This value is obtained by dividing the GCUPS obtained for each execution by its monetary cost. Thus, this metric can show the relation between performance and investment (the higher, the better). Table 12.4 shows the GCUPS/US$ values obtained on the Spot instances execution and on the cluster execution. In the last column, the bold values highlight the best GCUPS/US$ rate for each comparison. We observe that the GCUPS/US$ of the Spot instances execution is mostly better than the GCUPS/US$ of the cluster execution when less

260

W. P. Sousa et al.

than 8 nodes are used, except for the chromosomes 19 (chr19) and 20 (chr20). The comparisons of these chromosomes on Spot instances generate more revocations than the comparisons of the other chromosomes, which contributes to reduce the GCUPS/US$. Nevertheless, for all the sequences comparisons, the cluster execution with 8 nodes is the one that provides the best GCUPS/US$ values, since the use of 8 GPUs improves the performance drastically.

12.5 Conclusions In this work, we analyzed the execution of a biological sequence comparison application on a cloud-based AWS GPU environment. We used the highly optimized MASA-CUDAlign tool that compares two long DNA sequences exploiting the power of GPUs to accelerate the comparisons. Our goal in executing MASACUDAlign on the cloud was twofold: reduce the monetary cost and reduce the execution time. Since these two goals can be conflicting, we analyzed them separately and then we compared their results by the metric performance gain per invested dollar. In the analysis of the monetary cost, we took advantage of the GPU Spot instances provided by AWS, that can cost up to 90% less than the On-Demand instances. However, since these instances can be revoked at any time, the execution of MASA-CUDAlign in this type of instances required a fault-tolerant mechanism. Our results showed that small sequences comparisons do not experience revocation and the execution on Spot instances provide the same execution time as OnDemand instances costing 70% less. The comparison of large sequences, on the other hand, has a high probability of the revocation occurrence. On these sequences, the execution on Spot instances showed an average increase of 13% in the execution time due to the fault-tolerance mechanism and an average reduction of 60% on the monetary cost. In reducing the execution time, we used the multi-GPU version MASACUDAlign-MultiBP and explored the ability of AWS in providing a virtual cluster environment with ParallelCluster. We created a virtual cluster with 2, 4 and 8 nodes and showed that, for large sequences, the use of multiple GPUs allows great reductions in the execution time, around 50% of reduction when the cluster doubles the number of nodes, and up to 7.4 of speedup. When we compared the execution on Spot instance and on the virtual cluster using the metric performance gain per invested dollar (GCUPS/US$), we observed that the best GCUPS/US$ results were obtained with the cluster execution with 8 nodes due to its drastic performance increase. As future work, we intent to predict the ideal cluster size according to sequence size. Also, we aim to combine our framework with ParallelCluster by using Spot instances.

12 Biological Sequence Comparison on Cloud-Based GPU Environment

261

References 1. National Center for Biotechnological Information (2020). https://www.ncbi.nlm.nih.gov/. Accessed March 2021 2. Amazon Web Services. Amazon EC2 Instance Types (2021). https://aws.amazon.com/ec2/ instance-types/. Accessed December 2021 3. Amazon Web Services. AWS ParallelCluster Quickly build HPC compute environments on AWS (2021). https://aws.amazon.com/pt/hpc/parallelcluster/. Accessed January 2022 4. Amazon Web Services. Boto3 Documentation (2021). https://boto3.readthedocs.io/. Accessed February 2021 5. Amazon Web Services. Cloud Services. https://aws.amazon.com/ (2021). Accessed December 2021 6. Amazon Web Services. User Guide for Linux Instances - Spot Instance interruptions. https:// docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html (2021). Accessed 03 February 2021 7. Gartner. Magic Quadrant for Cloud Infrastructure and Platform Services. https://www.gartner. com/technology/media-products/reprints/AWS/1-271W1OT3-PTB.html (2021). Accessed February 2022 8. Google Cloud Provider. Cloud Computing Services. https://cloud.google.com/ (2021). Accessed December 2021 9. Microsoft Azure. Cloud Computing Services. https://azure.microsoft.com/en-us/ (2021). Accessed December 2021 10. Open Infrastructure Foundation . The Most Widely Deployed Open Source Cloud Software in the World (2021). http://openstack.org. Accessed December 2021 11. OpenNebula Systems. Discover OpenNebula (2021). https://opennebula.io/discover/#why_ opennebula. Accessed December 2021 12. Amazon Web Services. Amazon Elastic Block Store (EBS). https://aws.amazon.com/ebs (2022). Accessed January 2022 13. Agarwal, H., Sharma, A.: A comprehensive survey of fault tolerance techniques in cloud computing. In: 2015 International Conference on Computing and Network Communications (CoCoNet), pp. 408–413 (2015) 14. Ahrens, J.H., Dieter, U.: Computer methods for sampling from gamma, beta, poisson and bionomial distributions. Computing 12(3), 223–246 (1974) 15. et al., P.F.: On the use of lorawan and cloud platforms for diversification of mobility-asa-service infrastructure in smart city scenarios. IEEE Transactions on Instrumentation and Measurement 71, 5501109:1–5501109:9 (2022) 16. Ataallah, S.M., Nassar, S.M., Hemayed, E.E.: Fault tolerance in cloud computing-survey. In: 2015 11th International computer engineering conference (ICENCO), pp. 241–245. IEEE (2015) 17. Awan, M., Deslippe, J., Buluc, et al., A.: Adept: a domain independent sequence alignment strategy for gpu architectures. BMC Bioinformatics 21, 406:1–406:12 (2020) 18. Brum, R.C., Sousa, W.P., Melo, A.C.M.A., Bentes, C., de Castro, M.C.S., Drummond, L.M.d.A.: A fault tolerant and deadline constrained sequence alignment application on cloudbased spot gpu instances. In: L. Sousa, N. Roma, P. Tomás (eds.) Euro-Par 2021: Parallel Processing, pp. 317–333. Springer International Publishing, Cham (2021) 19. Dayhoff, M.O.: Atlas of protein sequence and structure. National Biomedical Research Foundation. (1972) 20. Dhingra, M., Gupta, N.: Comparative analysis of fault tolerance models and their challenges in cloud computing. International Journal of Engineering & Technology 6, 36 (2017) 21. Dill, K.A., MacCallum, J.L.: The protein-folding problem, 50 years on. science 338(6110), 1042–1046 (2012)

262

W. P. Sousa et al.

22. Figueiredo, M., Navarro, J.P., Sandes, E.F., Teodoro, G., Melo, A.C.: Parallel fine-grained comparison of long dna sequences in homogeneous and heterogeneous gpu platforms with pruning. IEEE Transactions on Parallel and Distributed Systems 32(12), 3053–3065 (2021) 23. Garg, R., Mohan, A., Sullivan, M., Cooperman, G.: Crum: Checkpoint-restart support for cuda’s unified memory. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 302–313 (2018) 24. Gotoh, O.: An improved algorithm for matching biological sequences. J Mol Biol 162(3), 705–708 (1982) 25. Gupta, A., Milojicic, D.: Evaluation of hpc applications on cloud. In: 2011 Sixth Open Cirrus Summit, pp. 22–26. IEEE (2011) 26. Huang, C., Chen, W., Yuan, L., Yan Ding, S.J., Tan, Y., Chen, H., Chen, D.: Toward security as a service: A trusted cloud service architecture with policy customization. Journal of Parallel and Distributed Computing 149, 76–88 (2021) 27. Iosup, A., Ostermann, S., Yigitbasi, M.N., Prodan, R., Fahringer, T., Epema, D.: Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Transactions on Parallel and Distributed systems 22(6), 931–945 (2011) 28. Jain, T., Cooperman, G.: CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM. In: Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press (2020) 29. Jiang, H., Zhang, Y., Jennes, J., Li, K.C.: A checkpoint/restart scheme for cuda programs with complex computation states. International Journal of Networked and Distributed Computing 1, 196–212 (2013) 30. Jones, W.: Genomics and bioinformatics in biological discovery and pharmaceutical development. In: Quantitative Methods in Pharmaceutical Research and Development, pp. 105–142. Springer (2020) 31. Kopar, M., Sikic, M.: Sw#-gpu-enabled exact alignments on genome scale. Bioinformatics 29(19), 2494–2495 (2013) 32. Kumari, P., Kaur, P.: A survey of fault tolerance in cloud computing. Journal of King Saud University - Computer and Information Sciences 33(10), 1159–1176 (2021) 33. Mehrotra, P., Djomehri, J., Heistand, S., Hood, R., Jin, H., Lazanoff, A., Saini, S., Biswas, R.: Performance evaluation of amazon ec2 for nasa hpc applications. In: Proceedings of the 3rd workshop on Scientific Cloud Computing, pp. 41–50 (2012) 34. Mohammadi, M., Bazhirov, T.: Comparative benchmarking of cloud computing vendors with high performance linpack. In: Proceedings of the 2nd International Conference on High Performance Compilation, Computing and Communications, pp. 1–5 (2018) 35. Myers, E.W., Miller, W.: Optimal alignments in linear space. Comp App in Biosci 4(1), 11–17 (1988) 36. Nukada, A., Takizawa, H., Matsuoka, S.: Nvcr: A transparent checkpoint-restart library for nvidia cuda. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 104–113 (2011) 37. Rucci, E., Garcia Sanchez, C., Botella Juan, G., et al.: Swimm 2.0: Enhanced smith– waterman on intel’s multicore and manycore architectures based on avx-512 vector extensions. International Journal of Parallel Programming 47(3), 296–316 (2019) 38. Sandes, E.F.O., Miranda, G., Martorell, X., Ayguade, E., Teodoro, G., Melo, A.C.M.A.: MASA: A Multiplatform Architecture for Sequence Aligners with block pruning. ACM Trans Parallel Computing 2(4) (2016) 39. Sandes, E.F.O., Teodoro, G.L.M., Walter, M.E.M.T., Martorell, X., Ayguade, E., Melo, A.C.M.A.: Formalization of block pruning: Reducing the number of cells computed in exact biological sequence comparison algorithms. The Computer Journal 61, 687–713 (2018) 40. Shahid, M.A., Islam, N., Alam, M.M., Mazliham, M., Musa, S.: Towards resilient method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Computer Science Review 40, 100398 (2021) 41. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J Mol Biol 147(1), 195–197 (1981)

12 Biological Sequence Comparison on Cloud-Based GPU Environment

263

42. Takizawa, H., Sato, K., Komatsu, K., Kobayashi, H.: Checuda: A checkpoint/restart tool for cuda applications. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 408–413 (2009) 43. Wulf, F., Lindner, T., Strahringer, S., Westner, M.: Iaas, paas, or saas? the why of cloud computing delivery model selection. In: Proceedings of the 54th Hawaii International Conference on System Sciences, 2021, pp. 6285–6294 (2021) 44. Xia, Z., Cui, Y., Zhang, A., et al: A review of parallel implementations for the smith–waterman algorithm. Interdisciplinary Sciences: Computational Life Sciences 14(2), 1–14 (2022) 45. Zhuang, J., Jacob, D.J., Lin, H., Lundgren, E.W., Yantosca, R.M., Gaya, J.F., Sulprizio, M.P., Eastham, S.D.: Enabling high-performance cloud computing for earth science modeling on over a thousand cores: Application to the geos-chem atmospheric chemistry model. Journal of Advances in Modeling Earth Systems 12(5), e2020MS002064 (2020)

Chapter 13

Reservoir Simulation in the Cloud Felipe Albuquerque Portella and Fabio Moreira de Souza

13.1 Introduction Reservoir simulation is one of the most important High-Performance Computing (HPC) problems in the oil and gas (O&G) industry. Therefore, it is natural to think of migrating this type of workload to the growing HPC cloud. This chapter will give the reader a brief overview of the Reservoir Engineering area and its reservoir simulator applications, and how Petrobras developed a project to move part of its daily reservoir simulation HPC demand to third-party cloud providers. This project also addressed interesting challenges regarding Network Attached Storage (NAS) synchronization issues and how to balance on-premises and cloud provisioning. Petróleo Brasileiro S.A., better known as Petrobras, is a Brazilian energy company established in 1953. Petrobras has set many world records for oil and gas exploration in deep and ultradeep waters (more than 7 km below sea level), mainly for the pre-salt layer off the Brazilian coast. The company is ranked 128th on the Fortune Global 5001 2022 list in terms of revenue and has appeared multiple times in the TOP5002 list of the largest HPC systems in the world, being in 33rd position on the 2022 TOP500 list with its Pégaso supercomputer. This chapter is organized as follows: Sect. 13.2 provides a brief overview of reservoir simulation tools, how they help the engineer develop the oil field, some details of their scalability, and some of the typical workflows where they are

1 https://fortune.com/global500. 2 https://www.top500.org/.

F. A. Portella () · F. M. de Souza Petróleo Brasileiro S.A. - PETROBRAS, Rio de Janeiro, Brazil e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_13

265

266

F. A. Portella and F. M. de Souza

applied that bring the problem to a supercomputer level. Section 13.3 describes some of the advantages and challenges with the cloud, specifically for the O&G industry, a market intensive in capital, with huge investments in supercomputing. Section 13.4 presents the architectures adopted in the Petrobras project to run reservoir simulators in the cloud. Finally, Sect. 13.5 provides details of some new cloud-related technologies and trends that require Petrobras IT to reconsider its use of deployed architectures.

13.2 Reservoir Simulation Overview The development of oil reservoirs requires significant investment, and reservoir simulation is an essential computational tool that helps reduce risks in the Oil and Gas (O&G) industry. Engineers use it to reproduce the production history and forecast the future production of oil, gas, and water from the reservoir field over time. The computer simulation system is called the “reservoir simulator”, and the input dataset, which represents a physical reservoir field, is called the “reservoir model” [1]. Simulators support important decision-making processes among many life cycle phases, from defining the number of wells to drill and their spacial placement during the Field Development Phase to optimizing injection rates and maximizing production during the Field Management phase. Engineers can simulate different scenarios countless times for a fraction of the cost and time of real-world operations. Reservoir models are typically three-dimensional grids that discretize the actual reservoir among thousands or millions of cells [1]. Figure 13.1 shows the reservoir model of the UNISIM-I [2] model with its wells. Each cell represents different rock properties of the reservoir field by single values. In the image, we can observe the porosity of each cell. Simply put, porosity is the void space in the rock of the reservoir that can trap oil and gas. The reservoir model is, in fact, the result of geostatistical techniques in which geoscientists try to describe with the best fidelity the actual reservoir, also considering the uncertainty of the available data. The geological model that engineers use as the basis for the reservoir model is generally on such a fine scale that it becomes computationally unfeasible to simulate. A standard step taken by engineers to allow simulation in reasonable time is a procedure known as “upscale”, where the size of the grid is reduced by certain sampling techniques. The equations used to describe the reservoir models derive from fundamental physical principles, such as mass conservation, thermodynamic equilibrium, heat transfer, and flow in porous media governed by Darcy’s law [1]. These equations are expressed in finite and differential forms, discretizing the problem in time and space. This numerical model is solved with partial differential equations (PDEs), which can accurately model the fluid flow in porous media. However, these equations are complex and require advanced mathematical techniques that require extremely powerful computational resources. Because of the uncertainty associated with the

13 Reservoir Simulation in the Cloud

267

Fig. 13.1 Porosity 3D map of the UNISIM-I model. Injector wells (blue) and producer wells (red) are also shown

data, it is common for engineers to work with multiple models simultaneously —an ensemble of models. In addition, many workflows, such as optimization or history matching, require numerous simulations of the same models. As a result, engineers rely on computer clusters and supercomputers to conduct their studies with hundreds of simulations. Notice that in many cases the reservoir engineer can run the simulation tool in his own laptop or workstation and that is a common routine when adding new features to the model to evaluate quickly if the reservoir model was built correctly. What turns this same simulation into an HPC problem is the aforementioned workflow’s or the resolution of the model. An example of an optimization workflow is to better determine the placement of the wells in the grid. To find the optimal location, systems such as OCTOPUS [3], use an evolutionary strategy that generates hundreds of simulations by changing the well placement and comparing the results found. A single simulation can run on a laptop, but when users have such a big bag-oftasks, they need an HPC infrastructure. The size of the model could be another requirement for the engineer to better capture some details that will demand an HPC environment. The reservoir engineer often works with upscaled models (as previously explained, with a lower number of cells), but if they need for some reason a more fine-scale simulation the memory consumption could be an issue requiring “fat nodes” as found in some HPC systems or the simulation could turn so slow that they need to use multiple nodes to solve a single simulation. The combination of both is also a common use issue. The size and complexity of the numerical models push the limits of HighPerformance Computing (HPC) technology. Together with seismic processing, is responsible for the most extensive HPC demands in the O&G. HPC used in the O&G Industry is so substantial that a quick filter of the TOP5003 list reveals 3 TOP500

list available at: https://www.top500.org/lists/top500.

268

F. A. Portella and F. M. de Souza

Table 13.1 O&G direct presence among the first 100 positions in the TOP500 list of Jun/2022 Rank 12 18 28 33 44 60

System HPC5 Dammam-7 Ghawar-1 PANGEA III HPC4 Dragão

Site Eni Saudi Aramco Saudi Aramco Total exploration production Eni Petróleo Brasileiro S.A.

Rmax (PFlops/s) 35.45 22.40 19.26 17.86 12.21 8.98

that a considerable number of supercomputers have appeared in well-known oil companies over the years, as can be seen in Table 13.1. The HPC5 system, which ranks 12th overall, is the 1st when filtering the list by the industry segment. The TOP500 list is a ranking of the 500 largest supercomputers in the world, published twice a year, with benchmarks submitted voluntarily by the institutions that own the systems. Not all company names are recorded or identifiable as there are entries on the list that appear just as “Energy Company”. It is also worth mentioning that many companies prefer not to benchmark or submit their results due to internal corporate strategies or policies, or the rush to put the newer supercomputers into production use. For instance, British Petroleum (BP) announced investment in a 2.2 PFlops HPC facility in 2013 [4], which would have put it at around 15th place at that time. In 2017 they announced an upgrade [5] to its CHPC system, bringing it to around 8 PFlops, and in 2020 another press release [6] on donation of computing capacity to COVID-19 research states they have a 16.3 PFlops system. BP never submitted any of its systems to the TOP500 lists, but by the latest public data for 2020, their supercomputers could be ranking among the top fifty, as shown in Table 13.1. There are also political considerations that might lead to some supercomputers being left off the list [7]. A more in-depth analysis of the list also reveals there are many research institutions in the TOP500 with substantial relationships with O&G companies. One example is King Abdullah University of Science and Technology (KAUST), which appears in 97th position and has appeared in different news reports [8, 9] related to advances in simulation tools for Saudi Aramco. However, it is important to note that these machines from O&G companies in the TOP500 are not necessarily used to run reservoir simulations. In fact, the most considerable HPC workload in the O&G industry is the seismic processing workload. Furthermore, the TOP500 list is undergoing rapid changes as Cloud HPC has gained a significant uptick in attention in recent years. Due to the massive investments necessary for HPC, many corporations are moving their workloads to the cloud. From the business perspective, Cloud HPC allows the conversion of those investments from Capex (capital expenditure) into Opex (operational expenditure). From a technical standpoint, there are many well-known advantages, such as “easy” and “quick” elasticity. Although Cloud HPC also imposes some new challenges

13 Reservoir Simulation in the Cloud

269

and this elasticity does not come so “cheap”. In Sect. 13.3, we detail the pros and cons, but the benefits seem to be prevailing, and many companies are migrating at different paces to the cloud [10]. The O&G industry follows the same trend, and it is expected that fewer O&G systems on the TOP500 list will be seen on the future.

13.2.1 Reservoir Simulation Software There is no single way to classify reservoir simulators. Some prefer to classify them by the application (e.g., thermal), others by the model formulation (e.g., implicit vs IMPES4 vs AIM5 ), yet others by an attribute of the reservoir rock formation (e.g. dual-porosity). The first of these classifications is the most widely accepted, as reservoir engineers typically choose an appropriate reservoir simulator depending on the simulation objectives, the type of reservoir, and the production mechanisms involved. We can use a typical phase diagram, as shown in Fig. 13.2, to classify the reservoir itself following that approach. This figure, known as a phase envelope,

Critical Point

Single Phase Region (Gas)

Pressure

Phase Diagram

Single Phase Region (Liquid)

Two Phase Region (Liquid + Gas)

Gas Condensate Region

Volatile Oil Region

Reservoir Simulation

Temperature

Thermal

Black-Oil

Compositional

Black-Oil

Fig. 13.2 Simplified phase diagram of hydrocarbons with the most appropriate simulator type for each problem. This plot is also known as pressure-temperature diagram or phase envelope diagram

4 Implicit

Pressure, Explicit Saturations. Implicit Method: uses different levels of implicitness in different blocks.

5 Adaptive

270

F. A. Portella and F. M. de Souza

shows the equilibrium condition between the different phases of the hydrocarbon compounds being simulated at each pressure and temperature. For heavy oils, which typically are on the extreme left of the diagram, a thermal simulator is used, which has advanced modeling capabilities for sophisticated extraction techniques such as in-situ combustion, steam, solvents, chemicals, and other complex techniques to make the oil more fluid and help with mobility. There are compositional simulators to model reservoirs in the two-phase region, where there may be volatile oil and condensate gas. These simulators can even model close to the end point of a phase equilibrium curve, the critical point, where different conditions, such as a liquid and its vapor, can co-exist. These simulators use Equation of State (EOS) to capture changes in each component of the fluid; therefore, they are also known as EOS simulators. These changes are represented by the percentage in each hydrocarbon fluid of pure compounds, such as .CO2 , N2 , H2 S, C1 , C2 , C3 , etc. present in each cell over the time. Components are grouped for simplification purposes and to improve the performance of the EOS simulators. Black-oil (BO) simulators, on the other hand, are used for most regions of the diagram, represented by the whole of an area where there is only one welldefined phase (liquid or gas). BO simulators can be used when the fluid has constant PVT (Pressure, Volume, and Temperature) behavior. It can be modeled with simpler equations, representing the fluid only with three phases: oil, gas, and water. This leads to simulators that are much faster than the compositional simulator, even when grouping some components in the EOS (e.g., .C1 − C4 , .C4 − C7 , .C7 − C15 , .C15+ , plus the water). This classification is not extensive, as other overlapping types exist, such as pseudo-compositional, which breaks the composition into chains, but uses a physical equilibrium constant (K) throughout the lifetime of the simulation, and are, accordingly, also known as k-value simulators. There are also various combinations of these classifications. For example, compositional simulators can support dual porosity and dual permeability to deal with fractured reservoirs, such as carbonates in the Brazilian pre-salt. More modern simulators, called next-generation simulators, aggregate all these different models under a single engine, allowing greater flexibility in changing from one fluid model to another and seamless integration when modeling different reservoirs that will use the same production facilities (platforms, FPSOs,6 etc.). The current commercial market for reservoir simulators has been dominated by two companies since the 1970s with their respective commercial software: Schlumberger7 with its ECLIPSE simulator line,8 and the Computer Modelling

6 Floating

Production Storage and Offloading. fact, the ECLIPSE simulators were originally developed by a company named Exploration Consultants Limited (ECL), which was purchased by GeoQuest Systems, both being incorporated into the SIS division of Schlumberger in 1992. 8 https://www.software.slb.com/products/eclipse. 7 In

13 Reservoir Simulation in the Cloud

271

Group with its CMG Suite.9 Both companies have different products for each type of application, as described above. However, both have initiatives for “nextgeneration” reservoir simulators, which integrate all kinds of simulations into one product with further enhancements, such as better parallel support. The ECLIPSE family consists of E100, which supports black-oil, and the E300, which supports compositional and special methods. The CMG Suite consists of many programs, including pre- and post-simulation tools, but there are three simulators in the current generation: IMEX for black-oil models, GEM for compositional problems, and STARS for simulating thermal, chemical EOR, and other advanced processes. These two main lines of products, CMG and ECLIPSE, have become standard in the oil and gas industry, so newcomer companies usually develop their products to support the input data format of ECLIPSE, CMG, or both. Open-source initiatives for reservoir simulators are still modest. In addition to following the input standards of commercial products, they usually lack many features that would make them suitable for use in an oil company, and are being explored more in the academic community to develop new methods. To cite two relevant simulators in this category, we have the MATLAB Reservoir Simulation Toolbox (MRST)10 and the OPM Flow.11 The MRST is a MATLAB set of modules that offers the basic functionality required of a reservoir simulator, such as data structures and visualization, different solvers and full workflows tools for upscaling, history matching with EnKF methods, and much more [11]. OPM Flow is a threephase, black-oil simulator from the Open Porous Media (OPM) initiative, and it has an experimental version with FPGA12 support. Both simulators are only black-oil simulators and use the ECLIPSE dataset as the input data format. The simulator list mentioned above is not exhaustive, and many other simulators are available in commercial or open-source form. CMG and Schlumberger, for instance, each have their own “next-generation” simulator, called INTERSECT and CoFlow, respectively. The former combines high-resolution modeling of complex geological structures with outstanding performance (including full GPU simulation of black-oil models13 ). The latter, developed in collaboration with Shell and Petrobras, provides a collaborative modeling environment that allows reservoir and production engineers to make informed decisions on large integrated oil and gas projects.14 Some energy companies also have internally developed simulators, such as Saudi Aramco with its POWERS simulator, which is known for breaking records for large reservoir models with multiple nodes.15 Some other “new” commercial

9 https://www.cmgl.ca/. 10 https://www.sintef.no/projectweb/mrst/. 11 https://opm-project.org/?page_id=19. 12 Field-Programmable

Gate Array (FPGA) is a reconfigurable hardware accelerator.

13 https://www.software.slb.com/products/intersect/features. 14 https://www.cmgl.ca/coflow. 15 https://www.worldoil.com/news/2016/11/28/saudi-aramco-in-trillion-cell-reservoir-

simulation-run.

272

F. A. Portella and F. M. de Souza

simulators, such as the 6X16 and the tNavigator17 from Rock Fluid Dynamics (RFD), or ECHELON from Stone Ridge,18 bring many performance enhancements, such as support to different hardware systems (CPUs, GPUs and even TPUs19 accelerators), as discussed by Esler et al. [12] and Bogachev et al. [13].

13.2.2 Reservoir Simulation Challenges The perceived computational performance of any reservoir simulator is the elapsed time or the wall time it takes to produce the prediction of a reservoir model for a specified period (e.g., simulating a 10-year history and forecasting 20 years into the future). The elapsed time of a reservoir simulation in a given hardware system depends mainly on the input data, on the numerical controls of convergence, and on the efficiency of its internal solvers. In relation to input data, the factors that affect simulation time range from the number and shape of cells (grid data) to the type of fluid that will lead to the usage of a specific simulator with a different equation system, as seen in Fig. 13.2. Therefore, each property of the model should be the focus of specific studies in order to identify problems that might generate numerical inconsistencies, which often increase the elapsed time. Reservoir simulators also have a numerical control section that allows engineers to tune the internal numerical parameters to improve the convergence behavior of the iterative matrix solution routines. For most applications, the default solution methods and parameters are sufficiently robust and efficient, but in more complex cases, fine-tuning could not only significantly speed up the simulation runtime, but also lead to better numerical results. This optimization can be performed manually or with the help of some specific software for that purpose. The utility of the reservoir model is due to its ability to predict future behavior with greater confidence. To better calibrate the model, engineers use a method called History Matching (HM). The explicit purpose of HM is to assign values to the parameters of the model to be optimized in such a way that the mathematical model of the reservoir reproduces the behaviors observed during the production period, leading to better forecasting (or at least to reduced uncertainty). The HM is a computationally expensive workflow as it could require hundreds of simulations of the same reservoir model. Some techniques try to reduce this cost by optimizing the numerical parameters using a machine learning performance model within the HM workflow [14]. However, even without a separate optimization study, the best

16 https://ridgewaykite.com/. 17 https://rfdyn.com/. 18 https://www.stoneridgetechnology.com/echelon-reservoir-simulation-software/. 19 The Tensor Processing Unit was developed by Google for targeting Artificial Intelligence applications, but could be explored for other applications, including as the linear solver for the reservoir simulators.

13 Reservoir Simulation in the Cloud

273

tool the engineer has on hand to speed up the HM study is to launch the simulations as individual jobs, so that hundreds of computer nodes can be processing slightly different problems in parallel. The cloud environment tested and described in the following section has key architecture decisions based on the characteristics of the CMG Suite, which was the target simulator for us to evaluate in a cloud environment, although it is not being Petrobras’s only one. Up to the 2021 version, classic CMG simulators used only a shared memory system with OpenMP, which means that we were unable to scale using multiple compute nodes to solve the same reservoir model simulation. However, many nodes are commonly used to simultaneously solve multiple alternatives of the same reservoir field (e.g., for a history matching workflow or an optimization problem).

13.3 Cloud Advantages and Challenges for the O&G Industry The O&G industry seeks the same advantages that many other businesses see in the cloud and faces many of the same challenges, as briefly described in the preface to this book. However, some advantages and challenges have a more significant impact due to our business model or application needs. This chapter briefly discusses these specific pros and cons without going into too much detail or repeating the more general considerations previously presented. However, this does not mean that some advantages and challenges mentioned in the preface do not apply also to O&G HPC workloads. The dilemma between Capex and Opex with regards to IT costs has many aspects. In the O&G industry, the Opex cloud allows us to share costs with partners more clearly, since there is a third party providing the service and the cost metrics. Another advantage of Opex is it has no or low upfront costs, allowing the expenses to be to spread over a period of time. A disadvantage of Capex is the costs related to infrastructure purchase, such as hardware and equipment, which generally in HPC have a lifespan of 5 to 6 years, depending on the deprecation values. This concern about the amortization of these expenses does not apply to cloud resources. Cloud computing has some payment aspects that strike a trade-off between agility, planning, and cost. If the company has a good governance process and well executed capacity planning, they can afford to pay less in the cloud by doing some upfront planning. On the other hand, using a cloud resource on an on-demand/payper-use model, the costs can be significantly higher. Another concern with the use of cloud resources is the risk of cloud provider lock-in and a solution that is too adherent to one provider, making it impossible to migrate to another cloud provider offering a better cost, for instance, or even impossible to migrate to an on-premises system. One good strategy to avoid this issue is to implement a multi-

274

F. A. Portella and F. M. de Souza

cloud architecture when designing a solution using the cloud, prioritizing elements that can be exchanged between providers, using the mantra: the simpler, the better. One of the most significant benefits of cloud computing is the agility of provisioning the computing environment. This agility allows evaluation of new software development paths and analysis of new hardware architectures that can help not only scientific software development, but also help define the architecture for new on-premises HPC cluster purchases. Near-endless compute resources, provisioning agility, and low costs are the biggest draws of cloud computing. These aspects may be true for most cloud users, but in HPC, more specifically in O&G workloads, the reality can be entirely different. HPC workloads require a lot of specific and expensive hardware, such as networks and parallel file systems, and at such a volume that not all cloud providers can handle the demand. A simple job of history matching requires one hundred to two hundred nodes at once, which could represent all resources in an availability zone at a cloud provider and to ensuring access to all these resources at a good price per hour, would make it necessary to sign long-term contracts with the cloud providers, bringing the same problem of needing to plan for demand and capacity faced with on-premises resources, limiting the use of elasticity.

13.4 Cloud Deploy Case Study of Reservoir Simulation The use of HPC in Petrobras began in the 1980s with Mainframes and then with RISC machines at the beginning of the 1990s. In 1997, the use of commodities clusters began, followed 10 years later by the use of GPUs. The first appearance of Petrobras on the TOP500 list was in 2003, and it has appeared 16 times since then, its newest machine, at the time of writing, called Dragão, being at 55th position on the list released in November 2021. The HPC Reservoir Simulation environment at Petrobras has some Linux clusters that run the jobs submitted from Windows workstations. The data is shared between the clusters and the workstations through a Network-Attached Storage (NAS) device. Windows workstations usually access the data from NAS using the Common Internet File System/Server Message Block (CIFS/SMB) protocol and Linux clusters use the Network File System (NFS) protocol. The first reason for using cloud computing for Reservoir Simulation was to enable cloud bursting, where cloud computing resources are called on whenever on-premises infrastructure reaches capacity limit, by linking our Linux clusters to any of the cloud providers in a multi-cloud solution and also with a storage sync service to provide data movement transparency to users. As shown in Fig. 13.3, all these services are supported by a fast connection to the cloud. However, a proof-of-concept (PoC) that was made with some cloud providers found that this strategy would not be so easy to implement, since data locality is a huge issue in our workload. It is difficult to manage the amount of data moving between on-premises and the cloud because the job submission requires prior data movement and the reservoir simulation

13 Reservoir Simulation in the Cloud

275

Fig. 13.3 HPC cloud environment for reservoir simulation with cloud bursting

launcher needs to track the output data during the simulation in order to maintain the simulation and check that the job is still running, so it is necessary to sync the files between cloud and on-premises storage. Although this amount of data is not large, about a few hundred megabytes, but it can be spread out over many directories, which makes the tracking and the data movement incredibly complicated. In order to avoid this complex strategy, we decided to build a full HPC Reservoir Simulation with Windows instances to submit jobs and visualize the simulation results, together with a Linux cluster to run the simulations and a shared file system on the cloud. This environment requires transfering just the input dataset to the cloud, where it is used totally isolated from the on-premises facilities. Although this approach is an easy way to use cloud computing to run reservoir simulations, the issue of data management persists since the output data has to be moved back on-premises to store the results. This concern about moving data back was resolved using NAS storage for the cloud to access the data center on-premises. With this architecture, we use only a cluster in the cloud reading and writing data into a NAS on-premises. Although we resolved the data locality issue, a fast and dedicated link connecting to the cloud is mandatory to provide good performance while accessing on-premises NAS. Another concern about this strategy is the availability zone that can be used in the cloud. Even with a dedicated connection, it is a huge problem to use a zone abroad because of the latency with accessing on-premises facilities. As mentioned before, our ambition is to implement an architecture with cloud bursting totally integrated with our on-premises HPC environment, providing transparent use of the cloud resources by the users. In this context, we conducted

276

F. A. Portella and F. M. de Souza

Fig. 13.4 Architecture used during tests with Azure HPC cache

tests with a solution from Azure, called HPC Cache, and one from AWS, called Storage Gateway. We describe these tests and the results below. The first solution we tested in order to implement cloud bursting was HPC Cache from Azure, since we previously had contact with a solution called Avere. Microsoft acquired Avere Systems at the beginning of 2018 and after some development and enhancements renamed the product Azure HPC Cache. Figure 13.4 shows the architecture implemented during the tests with Azure HPC Cache. This environment provides low-latency hybrid storage that caches metadata in the cloud from the onpremises NAS. In this way, the computational workload running on our on-premises HPC cluster is available to the cloud in the same namespace and will be transferred transparently to the user by HPC Cache. Similarly, if the workload was created in the cloud, the metadata is cached by HPC Cache to the local NAS, allowing the users to easily use the simulation results from their workstations. Caching only the metadata allows storage or file creation processes to complete quickly and the data to be moved on demand, as required by usage. More details and information about HPC Cache is available on the Microsoft Azure web page [15]. The other cloud NAS solution evaluated was AWS Storage Gateway. As shown in Fig. 13.5, the implemented architecture had a solution called File Gateway, an on-premises virtual machine acting as a NAS. File Gateway works as a local cache and syncs with AWS Storage Gateway in order to maintain the same metadata in the cloud and on-premises environments and move the data on demand. Two S3 buckets managed by Storage Gateway were created in the cloud: one to export the metadata to the File Gateway from FSx Lustre; another to import from on-premises to FSx Lustre. This scheme was more complex than expected as these buckets in the middle of the sync process introduced some latency, and not all steps were automated. At the end of the testing, we reported this issue and AWS informed us that this mechanism would be improved and be more automated in the future.

13 Reservoir Simulation in the Cloud

277

Fig. 13.5 Architecture used during tests with the AWS storage gateway

Regardless of these details related to the latency, the solution was able to provide a good experience when bursting our workload from the on-premises HPC cluster to the cloud. More information and details about AWS Storage Gateways can be found on the AWS web page [16]. There are many challenges to address when using storage in HPC, which usually requires high performance Input/Output (I/O), and managing data between the onpremises and cloud environments requires more attention and effort. These solutions provide a way to simplify this task, allowing users to send their workload to the cloud as it is needed. In our tests, we noticed that HPC Cache is more mature as it has benefited from the historical development by Avere. Both tools have been improving over time, with many enhancements since we have tested them. A completely different challenge was related to the proper selection of the type of cloud instances to provide the best performance at the best cost. As mentioned in Sect. 13.2, the CMG suite chosen for this PoC simulates the reservoir using OpenMP to parallelize the work to all the cores of the virtual-machine instance, but its scalability is not linear as we add more virtual CPUs (vCPUs). In other words, the CMG simulator will not have double performance if we chose, for example, an AWS c6i.16xlarge (with 64 vCPUS) instead of an AWS c6i.8xlarge (with 32 vCPUS). In addition, the usage of hyper-threading (HT) does not benefit this class of problems and many times is even prejudicial for the performance, leading us to choose to deactivate the HT in our instances. Also with regard to the instance selection, the processor generation and the enhancements in the micro-architecture could play a more important role than other factors between different instances. For example, CMG simulators benefit significantly from Intel AVX-512 SIMD instructions in such a way that the per-

278

F. A. Portella and F. M. de Souza

formance gain will reduce the simulation time compensating for the usually higher price that these instances have. Last, but possibly the most important consideration in the instance selection, is memory availability. While the former discussion has an impact on the performance, and it is interesting from the user or company perspective to optimize our cost and the time-to-solution, without enough memory the simulator will simply fail and will not produce any results at all. Therefore, the primary selection of the instance-type takes into consideration the memory required by the target model being simulated. For the PoC we took into consideration all those variables to manually select a few instances to use in accordance with the models each engineer works with, but an enhancement is underway to dynamically select the instance-type according to cloud provider instance availability and the model requirements and scalability. The user experience using the cloud is another challenge to be faced, as there are many expectations on it to have an infinity of resources whenever it is needed. Users run simulations in the cloud only when their demands cannot be supported by our on-premises clusters, in which case they submit their jobs to a specific head node in the cloud that creates the instances on-demand. The time since the job submission to the cloud, till the creation of the instance, and to effectively start the simulation, takes longer than our on-premises environment, and the users often complain about it. One solution that has been mapped is to create an image already customized by the IT team to reduce instance creation time. Another issue was regarding the offer of high-density memory instances, which is usually required in reservoir simulation. This type of resource is not used by most cloud users, therefore, is not commonly found in all Availability Zones (AZ) in the cloud. In this case, we have to make a reservation in advance to guarantee the amount and type of resources needed for a specific cloud demand, something not wanted when using the cloud. The journey to the cloud requires analyzing all the workflows and infrastructure before migrating or even starting to put the hands on the cloud management console. Even with multiple studies, testing and use have occurred in the past, it is always important to investigate better ways to use the cloud, whatever the task. Nowadays, we implement the following workflow when a demand arises: we first design a preliminary architecture for the solution, and with the cloud provider, in our case AWS or Azure, we decide which element of each cloud service we are going to use. After that, a specialized team creates the environment for users. In terms of infrastructure, various dedicated links to the cloud providers have also been implemented to improve the data movement, the user experience, and the connection’s performance.

13.5 Conclusions and Future Trends The PoC described in the previous chapter has evolved Petrobras in terms of use, technological advances, and governance. A Center of Excellence in Cloud Computing was created to be an architecture and technical reference and also

13 Reservoir Simulation in the Cloud

279

centralize the demands. We built new automation over our scripts to deploy the environments quickly and better customized to each end-user’s needs. We are provisioning the entire Infrastructure as Code (IaC) with CloudFormation for AWS, for example, inside a pipeline that provides HPC cluster environments for specific reservoir projects. This allows us to achieve better accountability, security isolation, and scalability to satisfy each project SLA (Service Level Agreement). We could say that the move to the HPC in the cloud as an Infrastructure as a Service (IaaS) using IaC is a reality from which there will be no going back and which has brought us many advantages, but we continue to face challenges as we have to adapt to the constant innovations from the Cloud Providers and Independent Software Vendors (ISVs). Balancing on-premises and cloud machine provisioning remains a challenging issue. As we describe in Sect. 13.4, we are not just moving the pending simulations in our on-premises queue to the cloud, we are offloading a whole set of projects to the cloud. Typically, these involve a specific reservoir field, affecting the entire engineering team dedicated to it, while other projects remain with our onpremises HPC resources. Despite massive investment, for continuous large-scale HPC demand, on-premises continues to be a “cheaper” option in the long term, but this margin is been reduced year after year. One trend is to move more projects with continuous demands to the cloud because these have more predictable resource usage. In this scenario, often arises the question about moving all the data to the cloud instead of transferring data between cloud and on-premises regularly. However, there are some dilemmas with this strategy. The first is the performance of accessing this “all-inclusive” environment with data and computing in the cloud, which demands a good connection, but also provides the advantage of low latency between data and computing resources. The other is to put all the data in one single cloud provider, as the company creates a vendor lock-in. If the company has a huge repository in the cloud, it can easily share the data with the partners, reducing this downside, as it is something extremely common and useful in the O&G industry. Another challenge is how to deal with the Software as a Service (SaaS) solutions for reservoir engineering that are now being provided by the major ISVs. We can now choose to use CMG Cloud20 instead of purchasing licenses for the simulator and investing in our own infrastructure (as on-premises or IaaS). Some other ISVs make it even harder to find this balance as they provide a full stack ecosystem of applications in the cloud, such as the Schlumberger’s DELFI21 Platform. The challenges of data movement we face are considerably greater when integrating different reservoir engineering lifecycle applications that rely on different ISVs cloud solutions and, therefore, various cloud providers. Many key global energy companies are building Artificial Intelligence (AI) strategies for data analysis based on cloud platforms, including in partnerships or strategic alliances for digital

20 https://www.cmgl.ca/cloud. 21 https://www.software.slb.com/delfi.

280

F. A. Portella and F. M. de Souza

transformation (DT) with Cloud Providers [17]. These DT programs rely on many data sources; as expected, reservoir simulations are part of the kernel of these workflows. All this will reduce the number of supercomputer machines seen in the TOP500 as usage spreads into Cloud HPC computing [10], as described in Sect. 13.2. The SaaS movement, from one view point, simplifies the deployment requirements for the Energy companies with the transfer to the ISVs of many of the responsibilities for the infrastructure. However, in another view, it limits the possibilities of applying R&D to enhance the reservoir simulations. The trend is that the ISVs themselves are incorporating this kind of intelligence into their SaaS solutions [17]. A third view is that it could bring new R&D opportunities for optimizing the ISVs and Cloud Providers environments as they have more data. The topic of security and privacy is also very relevant. Despite data privacy agreements and terms of service, the simple collection of metadata about the reservoir simulations can provide valuable insights, as happens in many other fields. Using IaaS, we could apply state-of-the-art cryptography solutions to bring more security to the cloud. One example is Homomorphic Encryption, which conveys the idea of performing the computation directly on the encrypted data. However, with SaaS the same problem as discussed before arises and we need to rely on the ISV’s security. Another trend we foresee is towards Serverless computing, a cloud solution that has been popular in recent year. The idea is that the developer publishes code functions that can be used by the applications, leaving all the infrastructure provisioning responsibility to be managed by the cloud provider. It has an attractive, flexible pricing model where the user pay per number of requests, function uptime, and memory consumption. These benefits attracted some HPC users to explore Serveless computing to handle embarrassingly parallel workloads including the ones of the O&G industy. A research work examining Serverless seismic imaging in the cloud [18] demonstrated that its feasible and allows the operating cost to be reduced by up to a factor of 6. However, it is still a very new service and has some pitfalls, and there is much research underway. To provide Serverless computing, the cloud providers typically containerize the service, which introduces large payloads for complex HPC workloads, which are slow to start and introduce latency. In addition, Severless computing is not always the cheapest solution using the cloud, and better analysis of the use cases is required. Finally, it is worth mentioning that despite all the investment in the cloud that Petrobras is making, by creating a center of cloud competence (CCC) group and adopting an internal strategy of cloud-first, the company is still deeply investing in its on-premises infrastructures. The most recent evidence is the new Pegasus supercomputer as recently announced [19]. A supercomputer with 2016 GPUs and 678 terabytes of RAM that will add more 21 theoretical peak petaflops to the company’s on-premises HPC capacity. This puts in evidence the strategy discussed in this chapter that the cloud is not a magical solution for all the demands and needs.

13 Reservoir Simulation in the Cloud

281

A deep analysis should be performed case by case to decide which demands will benefit from the cloud and should be moved to it and which are better to be kept inhouse. After all, the best strategy seems to be cloud-first and not cloud-all-in even for HPC.

References 1. J. R. Fanchi, Principles of Applied Reservoir Simulation, 2006. https://doi.org/10.1016/B9780-7506-7933-6.X5000-4. 2. G. D. Avansi, D. J. Schiozer, UNISIM-I: Synthetic Model for Reservoir Development and Management Applications, International Journal of Modeling and Simulation for the Petroleum Industry (2015). 3. R. Lima, A. C. Abreu, M. A. Pacheco, Optimization of Reservoir Development Plan Using the System OCTOPUS, in: OTC Brasil, Offshore Technology Conference, 2015, pp. 1724–1732. https://doi.org/10.4043/26266-MS. URL http://www.onepetro.org/doi/10.4043/26266-MS 4. BP, BP opens new facility to house the world’s largest supercomputer for commercial research, Accessed May/2022 (October 2013). URL https://www.bp.com/en/global/corporate/ news-and-insights/press-releases/bp-opens-new-facility-houston-largest-supercomputer.html 5. BP, BP supercomputer now world’s most powerful for commercial research, Accessed May/2022 (December 2017). URL https://www.bp.com/en_us/united-states/home/news/pressreleases/bp-supercomputer-now-worlds-most-powerful-for-commercial-research.html 6. BP, BP supercomputer to aid global healthcare researchers in race to halt COVID-19, Accessed May/2022 (April 2020). URL https://www.bp.com/en/global/corporate/news-andinsights/press-releases/bp-supercomputer-to-aid-global-healthcare-researchers-in-race-tohalt-covid19.html 7. S. C. M. Post, China’s supercomputer Sunway TaihuLight falls to sixth place amid reluctance to share data over US sanctions fears, Accessed June/2022 (June 2022). URL https://www. scmp.com/tech/big-tech/article/3180037/chinas-supercomputer-sunway-taihulight-fallssixth-place-amid 8. HPCwire, ANSYS, Saudi Aramco & KAUST Shatter Supercomputing Record, Accessed May/2022 (November 2016). URL https://www.hpcwire.com/off-the-wire/ansys-saudiaramco-kaust-shatter-supercomputing-record/ 9. HPCwire, Saudi Aramco Scientists Achieve First Trillion Cell Reservoir Simulation Run, Accessed May/2022 (July 2017). URL https://www.hpcwire.com/off-the-wire/saudi-aramcoscientists-achieve-first-trillion-cell-reservoir-simulation-run/ 10. D. Reed, D. Gannon, J. Dongarra, Reinventing High Performance Computing: Challenges and Opportunities (2022). https://doi.org/10.48550/ARXIV.2203.02544. URL https://arxiv. org/abs/2203.02544 11. K.-A. Lie, O. Møyner (Eds.), Advanced Modeling with the MATLAB Reservoir Simulation Toolbox, Cambridge University Press, 2021. https://doi.org/10.1017/9781009019781. URL https://www.cambridge.org/core/product/identifier/9781009019781/type/book 12. K. Esler, K. Mukundakrishnan, V. Natoli, J. Shumway, Y. Zhang, J. Gilman, Realizing the Potential of GPUs for Reservoir Simulation, ECMOR XIV - 14th European Conference on the Mathematics of Oil Recovery (September 2014) (2014) 8–11. https://doi.org/10.3997/22144609.20141771. 13. K. Bogachev, S. Milyutin, A. Telishev, V. Nazarov, V. Shelkov, D. Eydinov, O. Malinur, HighPerformance Reservoir Simulations on Modern CPU-GPU Computational Platforms (2018) 1–11.

282

F. A. Portella and F. M. de Souza

14. F. Portella, D. Buchaca, J. R. Rodrigues, J. L. Berral, TunaOil: A Tuning Algorithm Strategy for Reservoir Simulation Workloads, Journal of Computational Science (aug 2022). https://doi.org/10.1016/j.jocs.2022.101811. URL https://linkinghub.elsevier.com/retrieve/pii/ S1877750322001806 15. M. Azure, HPC Cache, File caching for high-performance computing (HPC), Accessed June/2022 (June 2022). URL https://azure.microsoft.com/en-us/services/hpc-cache/ 16. A. W. Service, AWS Storage Gateway, Accessed June/2022 (June 2022). URL https://aws. amazon.com/storagegateway/ 17. L. KUANG, H. LIU, Y. REN, K. LUO, M. SHI, J. SU, X. LI, Application and development trend of artificial intelligence in petroleum exploration and development, Petroleum Exploration and Development 48 (1) (2021) 1–14. https://doi.org/10.1016/S1876-3804(21)60001-0. URL https://www.sciencedirect.com/science/article/pii/S1876380421600010 18. P. A. Witte, M. Louboutin, C. Jones, F. J. Herrmann, Serverless seismic imaging in the cloud (2019). https://doi.org/10.48550/ARXIV.1911.12447. URL https://arxiv.org/abs/1911.12447 19. HPCwire, Petrobras to Launch Atos-Built Pégaso Supercomputer, Accessed August/2022 (July 2022). URL https://www.hpcwire.com/2022/07/24/petrobras-to-launch-atos-builtpegaso-supercomputer/

Chapter 14

Cost Effective Deep Learning on the Cloud Otávio O. Napoli, Rafael K. Tesser, Daniel L. Fonseca, and Edson Borin

14.1 Introduction The use of Deep Learning (DL) has been progressively growing in recent years due to its great performance in several areas [10, 14, 17, 24] and due to large investments in research on this field [4]. One of the reasons for this constant growth in performance is the construction of deeper models, capable of inferring increasing amounts of information from the input data. However, deeper models require more computational power and, consequently, more powerful and expensive computing resources. Cloud providers have become an attractive solution for training these models. They offer virtual machine instances with GPUs, one of the most used hardware devices for machine-learning, due to its design and to its performance on matrix and vector computations. Besides, they allow users to build a custom infrastructure designed for their needs. Furthermore, they offer several specialized products aimed at automating and speeding up the building and development of deep learning models. These multiple products are based on different service models, such as IaaS, PaaS, and SaaS. As cloud providers offer various types of services and virtual machines, users can design custom computing infrastructures with different performance and costs to run their models. Since training deep learning models can be expensive, users must wisely choose services and configurations with good cost and performance trade-offs. Nonetheless, as discussed in several research works [3, 13, 21], selecting a cost-effective set of cloud computing resources may not be trivial.

O. O. Napoli · R. K. Tesser · D. L. Fonseca · E. Borin () UNICAMP, Cidade Universitária, Campinas, Brazil e-mail: [email protected]; [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4_14

283

284

O. O. Napoli et al.

This chapter discusses how to train a deep learning model in the cloud using a medical image segmentation case study. It starts by describing the main components involved in training a deep model (Sect. 14.2) and presents strategies used to train large models. Later, in Sect. 14.3, we discuss how these models can be trained using different services offered by cloud providers. As training a DL model usually follows a very typical workflow, we show how some services increasingly automate the workflow, requiring less human intervention, but at an additional cost. In Sect. 14.4 we present and discuss the problem of selecting virtual machines and services, guiding the reader to observe crucial points involved in choosing cost-efficient services and instances. Finally, in Sect. 14.5 we present our final considerations.

14.2 Key Deep Learning Concepts A machine learning (ML) algorithm is able to learn how to perform a task from a dataset [5]. Traditionally, a machine learning system is composed of four elements: a dataset; a model, which aims to generalize a function f based on the dataset; a cost function, which measures how far the model is from a correct solution; and an optimization algorithm, which aims to optimize (minimize or maximize) the value of the cost function. Machine learning algorithms are usually categorized as: supervised learning, unsupervised learning, and reinforcement learning algorithms. In supervised learning algorithms, in addition to the dataset X, there is a set of labels Y , which describes each element of X. Their goal is to optimize the parameters of a parametric function f , known as the model, so that for each sample in X, f returns a value that closely matches its respective label [2]. This process of optimizing the model is called training [15]. During training, the objective is to reduce the training error (also called loss), which is the error observed when generating a prediction based on an input. In unsupervised learning, the dataset is not labeled, and the model is intended to reveal insights about the data by finding patterns and hidden structures within it. Finally, in reinforcement learning, observations are made at an environment, at specific points of time, and the training process consists in optimizing a policy function which maximizes the reward for the observer [2]. Among the various machine learning algorithms, artificial neural networks (ANNs) are among the most powerful. These algorithms are inspired by biological nervous systems [16] and are composed of artificial neurons, or perceptrons, arranged in layers, formed by multiple sequentially connected neurons, as illustrated in Fig. 14.1. The connections between neurons are called synapses, and they have associated weights, which are part of the model’s parameters. At the input layer different parts of the input data are mapped to the input layer’s perceptrons. The subsequent layers receive the output of the previous layer, which is calculated using a mathematical function (activation function) of the previous

14 Cost Effective Deep Learning on the Cloud

285

Forward Propagation Model

Data

Input Layer

Hidden Layer

Output Layer

Backward Propagation Fig. 14.1 An example of an artificial neural network. Circles represent perceptrons and arrows represent the connection between perceptrons (synapses). Each arrow (except dashed ones) has a weight associated to it. The output of each perceptron is calculated using an activation function

activation functions and the weights associated to the corresponding synapses. At the end, the output layer contains the prediction produced by the model. The process of propagating the data from the input to the output layer is called forward propagation, or inference. The training process consists of finding values for the weights (i.e., the parameters) that optimize the accuracy of the result produced by the model (i.e., the model accuracy). Algorithms like stochastic gradient descent (SGD) [8] calculate the error of the result produced by the inference process based on the input data label and back propagate the partial derivatives relative to this error (gradients), in the same way it is done when performing a linear or logistic regression, for instance. This process is called backward propagation. Deep Neural Networks (DNNs) are artificial neural networks with several hidden layers. Each layer of a DNN learns increasing levels of abstraction from the input data. Thus, this process, chained along several layers, allows DNNs to learn more and more complex functions [23]. As deep neural networks evolved, different architectures where proposed, replacing the traditional perceptrons’ functionality. Convolutional neural networks (CNNs), for example, were inspired by the visual cortex of animal brains and introduce new layers such as convolutional and pooling

286

O. O. Napoli et al.

layers [5]. We refer to this class of machine learning model, with several layers, as Deep Learning (DL) models. Modern deep learning models gained great notoriety when they outperformed important machine learning algorithms such as Support Vector Machines (SVM) in various pattern recognition applications and competitions [12]. In recent years, DL models have been used for various purposes such as financial forecasting, driving assistance, among others [11, 12, 18, 20]. As these models got deeper and more complex, training them started demanding more computational resources and the development of strategies to improve training efficiency.

14.2.1 Training Deep Learning Models Training a deep learning model is an iterative process that consists in performing the forward and the backward propagation multiple times over the entire dataset. Typically, the dataset is partitioned in three subsets: (a) the training dataset, which is used to tune the model parameters and minimize the prediction error; (b) the validation dataset, which is used during training to verify how well the model is performing on samples not used to tune the parameters (usually used as a criterion to stop the training process); and (c) the test dataset, which is used to evaluate the model at the end of the training process. Each iteration over the whole training dataset is called an epoch, and each epoch is composed of multiple steps, where each step consists on computing the gradients and updating the model parameters using a batch of data. A typical training workflow is illustrated in Fig. 14.2. First, the model is defined, and its parameters are initialized (step 1). At each epoch, the model processes the whole training dataset (steps 3–6), which consists in fetching a batch of data from the training dataset (step 3), performing the forward and backward propagation to calculate the gradients (step 4), and updating the model’s parameters (step 5). Next, the validation dataset may be used to evaluate the model’s performance (step 7) and to support the training stop criteria. The training runs for as many epochs as needed to satisfy the stop criteria (step 8). These criteria may include reaching a target accuracy or loss, training for a fixed amount of epochs, or training for a maximum amount of epochs without significant accuracy improvements. Besides model parameters, there are several other parameters, known as hyperparameters, that affect the training process and the quality of the resulting model. The batch size, i.e., the amount of data processed in each step, is one of these. Choosing the right hyperparameters to use during training is critical for optimizing the quality of the trained model, and the process of tuning them is usually done during the model’s development. During this process, it is important to consider the specifications of the computing infrastructure, as they may affect the hyperparameter choices. As an example, the batch size usually affects the model’s accuracy and how long it takes to reach the target accuracy. Nonetheless, its maximum value is

14 Cost Effective Deep Learning on the Cloud 1

2

287

Initialize parameters Start an epoch

3

Training dataset

Fetch next batch Calculate gradient for each sample on the batch

5

6

Validation dataset

7

8

Epoch

Model parameters

Step

4

Update parameters using the batch gradients

Processed all batches?

N

Y

Compute stop criteria

Stop == True

N

Y 9

End

Fig. 14.2 Typical deep learning training workflow

usually limited by the amount of data that fits inside the processing device’s volatile memory (RAM or VRAM).

14.2.2 Model Partitioning Strategies for Distributed Training Training a deep learning model can take a significant amount of time, depending on its size (number of parameters), operations performed, number of samples in the dataset, the batch size, the data type, and other hyperparameters. A practical solution to reduce training time while meeting the growing demand for computational power is to distribute the training process on multiple computing devices (e.g., GPUs or CPUs) [2]. This can be achieved by partitioning different elements of the model, such as the input data (data parallelism), the model (model parallelism), or the layers (pipeline parallelism), as illustrated in Fig. 14.3.

288

O. O. Napoli et al.

Data Parallelism

Model Parallelism

Pipeline Parallelism

Fig. 14.3 Strategies for model partitioning. Different colors represent different computing devices

Data Parallelism Data parallelism strategies consist in partitioning the dataset across different computing devices. Thus, each device has a copy of the model parameters and performs the forward and backward propagation independently. After the backward propagation, the devices must share the computed gradients so that they can be combined to update the model parameters. On distributed training, this is usually performed using collective synchronization operations. There are many ways to perform this task, the most common being the AllReduce operation [2]. This way, since the synchronization overhead may become a bottleneck, the scalability of the training process depends not only on the batch size, but also on the number of devices participating in the training. There are alternatives to reduce this bottleneck, the most common being asynchronous training. With this strategy, the parameters are updated asynchronously and the workers do not maintain an exact copy of all model’s parameters, thus minimizing any extra overhead to the training process. Nonetheless, to prevent parameter copies from diverging too much and affecting the final model accuracy, the copies must be synchronized periodically. When using this strategy the user may need to reduce the batch size when the number of devices increases, to maintain the model’s accuracy [6]. Model Parallelism In model parallelism strategies, the neurons of each network layer are distributed across different computing devices, which perform the training process using the same batch of data. This strategy allows training large models, that do not fit in the computing device’s memory since each device does not need a copy of the whole model. However, this approach adds extra communication operations after each layer finishes processing. As a result, this may raise a device underutilization issue, as forward propagation and backward propagation are both synchronous operations.

14 Cost Effective Deep Learning on the Cloud

289

Pipeline Parallelism In deep learning, pipelining refers to overlapping computations, such as between one layer and the next (as data becomes ready), or to partitioning the layers and assigning each layer to a specific worker [2]. The latter, called layer parallelism, is often used, and allows large deep learning models to be trained. Different from model parallelism, workers do not need to store the parameters of all layers for forward and backward propagation. Besides that, the communication occurs in defined points (at layer boundaries), across a few number of computing devices, thus avoiding all-toall communication operations [2]. In order to fully utilize the computing system, the pipeline must be carefully partitioned among the devices, so that all devices process the same amount of samples per second. In general, if the computing devices have enough memory to hold the model parameters and a fair amount of dataset samples, it is easier, and more common, to use data parallelism to accelerate the training process. Hence, we will focus on this strategy in the next sections.

14.3 Training Deep Learning Models in the Cloud A typical machine learning workflow is divided into the following stages: • Data preparation: All the relevant data to train the model is extracted and preprocessed. It is also common to extract or select features that will improve the accuracy of the trained model; • Training: Using the prepared data, the model is trained with techniques as presented in previous sections. The result of the training is usually stored in a model registry, which provides a set of APIs to manage the lifecycle of trained models; • Serving: The trained model is retrieved from the model registry and served under a prediction service that can be used for inference. There are multiple ways of executing these stages in the cloud. Depending on their needs and technical expertise, users may choose between these three types of service: • Infrastructure as a Service (IaaS): Users are given access to a virtualized environment, where they can customize most aspects of the execution environment (including operating system, runtime libraries, tools, middleware, and application). Different virtual-machine or container base images may be provided to facilitate the configuration. In fact, many cloud providers offer specialized virtual machine images with popular frameworks (e.g., TensorFlow, Keras, PyTorch) already pre-configured. • Platform as a Service (PaaS): Typically, the user only needs to provide an application and its input data (this may include configuration files). Popular ML/DL frameworks are usually pre-installed, configured, and optimized for the

290

O. O. Napoli et al.

underlying platform. The deployment and execution process is automated, and the platform usually provides means to facilitate retrieving data and saving results to cloud storage. • Software as a Service (SaaS): Ready-to-use ML/DL software accessed via graphical user interfaces or APIs. The user provides the data, and may be able to select from available learning models and to configure some aspects/parameters. It may be possible to build a custom model, e.g., via a graphical interface. Some services provide pre-trained models. In other cases, the user may provide their own training data to train their selected model. The next sections provide an overview of different cloud computing services for deep learning (Sect. 14.3.1), and discuss how to train deep learning models using VMs on the IaaS model (Sect. 14.3.2) and using SageMaker (Sect. 14.3.3), a PaaS tool.

14.3.1 Services for Deep Learning in the Cloud As previously mentioned, each machine learning workflow stage has specific requirements. Users can leverage different services to handle each part of their workflow, customizing it based on their own needs. Some cloud providers offer “allin-one” solutions that can solve most needs across different stages. These solutions are usually more expensive, but they can offer integrated pipelines, as well as a better development experience. At the SaaS level, users may take advantage of pre-trained models for common use-cases or provide their own dataset to train an existing model to solve more specific problems. Some cloud providers also supply services capable of automatically building a DL model based on user-provided data. Such services typically can be accessed programmatically, thus allowing customers to integrate them into their own applications. In addition, some of them may also provide a graphical user interface. Major cloud providers, like Amazon,1 Google,2 and Microsoft,3 provide a multitude of ML/DL services in the SaaS model. Common SaaS ML/DL solutions encompass fields such as computer vision, natural language processing, business applications, among others. Computer vision applications may include image classification, labeling of objects in images and video, text detection, object segmentation, detection of defects and automated inspection, face detection, and face recognition. Example of these include Amazon’s Rekognition and Lookout for Vision, Google’s Vision AI and Video AI, and some of the services in the Azure Cognitive Services. Natural language processing solutions may include chatbots and

1 https://aws.amazon.com/machine-learning/

(accessed in August 2022). (accessed in August 2022). 3 https://azure.microsoft.com/en-us/overview/ai-platform/ (accessed in August 2022). 2 https://cloud.google.com/products/ai

14 Cost Effective Deep Learning on the Cloud

291

virtual agents, language translation, speech recognition, speech-to-text and text-tospeech conversion, extraction of data and knowledge from documents. Examples of such services are Amazon’s Lex, Textract, Comprehend, Transcribe, and Polly, Google’s DialogFlow, Translation AI, Speech-to-Text, and Text-to-speech, and part of the Azure Cognitive Services. In the business side, they may include the analysis and forecasting of business metrics, fraud detection, and sentiment analysis. Examples include Amazon’s Lookout for Metrics, Metrics Advisor, Forecast, and Fraud Detector. At the PaaS level, customers are able to run their applications in an environment that has been specially tailored for machine learning. Users are typically given access to tools designed to facilitate the building, training, tuning, and deployment of ML/DL models. These may include IDEs, Jupyter notebook environments, debuggers, profilers, optimizers, vendor-specific APIs, as well as access to popular machine-learning frameworks. Users may employ these environments to build deeplearning applications based on vendor-provided built-in models or on their own custom models. These services often employ container technologies to implement their execution environments and may offer ways for users to customize such containers to their needs, or even bring their own containers into the ML platform. Examples of PaaS for ML/DL include Amazon SageMaker,4 Google Vertex AI,5 and Microsoft’s Azure Machine Learning6 (AzureML). Section 14.3.3 presents SageMaker in more detail as our case study (to be discussed at Sec. 14.4.1) relies on this platform. At the IaaS level, besides the inherent increase in flexibility, the burden of configuring a suitable environment falls completely on the users. This includes choosing an operating system image, installing and configuring the required software tools, libraries, and DL frameworks, as well as their dependencies. In some cases, it may also be necessary to configure cloud storage for the training dataset and for output data. Some of this work may be eased if the cloud vendor provides ML optimized virtual-machine images or by using pre-configured container images. The latter may be either publicly available or offered by the vendor itself. Nevertheless, even with containers, IaaS may still entail a non-negligible amount of extra work, but allows users to have more control over the end-to-end pipeline. Since there are many details to consider, deciding which type of service to use during each stage of the machine learning workflow is a non-trivial task. The level of automation depends on the selected service model and, in general, the amount of flexibility decreases with the increase in automation. It is worth noting that increased automation usually comes with an increased price. Therefore, users have to weigh in the advantages brought by automation against the potential increase in cost. For instance, unless there is a significant increase in performance, moving from IaaS to PaaS will most likely incur in higher costs, which may prove to be an advantage to IaaS.

4 https://aws.amazon.com/sagemaker/

(accessed in August 2022). (accessed in August 2022). 6 https://azure.microsoft.com/services/machine-learning/ (accessed in August 2022). 5 https://cloud.google.com/vertex-ai/

292

O. O. Napoli et al. 1

Prepare the environment 2

Prepare the data 3

Select virtual machines 4

Start virtual machines 5

6

Model building and tuning

Setup model and hyperparameters

Cloud

Setup environment Infrastructure Performance and cost optimizations

7

Train model 8

Fetch results

9

End Fig. 14.4 Typical workflow for training a deep learning model in the cloud

14.3.2 Training with IaaS Training a deep learning model consists in several steps, from preprocessing to optimization. A typical workflow for training a deep learning workload in the cloud is illustrated in Fig. 14.4. First the user must prepare the environment, which includes describing the software and dependencies, preparing containers or VM images, and other useful tools needed to execute the desired workload (step 1). After that, the user must preprocess the data and put it in an appropriate place (step 2). Depending on the amount of data, the data type (e.g., plain text files, digital images, structured databases, etc.), and the training magnitude (e.g., the number of devices used in the training process and the model size), different services may fit better. For instance, for small workloads and datasets, the data may be placed directly in the block storage, together with the VM image, or included directly on the environment, in order to avoid downloading external files, while for larger workloads, file-based or object-based file systems may be more appropriate and cheaper. Then, the user must select the virtual machines to run their workload (step 3). Cloud providers offer a variety of virtual machine configurations with differ-

14 Cost Effective Deep Learning on the Cloud

293

ent costs, performance, and resources options. Thus, selecting the set of virtual machines that optimize these axes may not be a trivial task. These performance and cost optimizations can be addressed manifold, although many of them require the model to be executed in different configurations (arrow from End to step 3). This challenge will be discussed later, in Sect. 14.4. Once the virtual machines are selected, the user can instantiate them in the cloud provider (step 4); configure the environment (step 5), e.g., installing the required packages or running the appropriated containers; set the model and hyperparameters up for training (step 6); train the model (step 7); and, finally, fetch the results (step 8). In model testing, building, and tuning scenarios, steps 5 to 8 may be performed several times. Moreover, optimizing the cost and performance of training may also require several repetitions of steps 3 to 9, to profile different infrastructure alternatives. It is worth noting that the first steps (1, 2, and 3) are sometimes performed locally, outside the cloud, while the other steps are executed in the cloud. A common way of automating those steps is using infrastructure-as-code tools, such as Terraform7 and Ansible,8 to create scripts and resource definitions that are able to provide and configure your infrastructure automatically.

14.3.3 Training with SageMaker Amazon SageMaker is a cloud-based machine-learning platform that provides services to facilitate and automate various tasks related to developing, tuning, and deploying machine learning models. This includes all the stages of a machine learning workflow, from data preparation to serving. By using Amazon SageMaker it is possible to perform data-analysis via a visual interface (no coding needed), as well as having a fully-integrated IDE to handle every step of the machine learning process. Finally, Amazon SageMaker also offers dedicated instances that users can leverage to train and serve their models. Training a model using Amazon SageMaker follows the steps described in Fig. 14.5. The core of the workflow is similar to the one presented in Fig. 14.4, but most of the work associated with infrastructure provision is handled by Amazon SageMaker automatically. Users can initialize a training job directly on AWS Console, or using any of their SDKs.9 If they decide to use an SDK, it is common to use the Python SDK10

7 https://www.terraform.io/

(accessed in August 2022). (accessed in August 2022). 9 https://docs.aws.amazon.com/sagemaker/latest/dg/api-and-sdk-reference.html August 2022). 10 https://sagemaker.readthedocs.io/en/stable/ (accessed in August 2022). 8 https://www.ansible.com/

(accessed

in

294

O. O. Napoli et al. 1

Prepare the data

Model building and tuning

Setup model and hyperparameters

3

Launch a training job 4

Cloud

2

Infrastructure Performance and cost optimizations

Fetch results 5

End Fig. 14.5 Typical workflow for training a deep learning model in the cloud using Amazon SageMaker

inside either a Notebook Instance, or using Amazon SageMaker Studio.11 As for the model, users can either (a) use an algorithm provided by SageMaker, (b) use Apache Spark integrated with SageMaker, (c) use an algorithm from AWS Marketplace, or (d) submit a custom model for training. While training using a custom model, users need to configure a SageMaker Estimator, which is responsible for providing their training infrastructure, as well as the hyperparameters. The training process is initialized with the fit method from the Estimator, passing the input, validation, and test data. Users can use data from different sources, Amazon S3 (Simple Storage Service) being the most usual, as well as integrating it with other SageMaker products (like Ground Truth). The only change required in the source code is fetching the data from SageMaker channels, defined as environment variables by the SageMaker platform.12 Once a training job is created, SageMaker launches the user-specified ML compute instances and uses the specified code and dataset to train the model. Upon completion, the resulting model artifacts and other outputs are saved in the userspecified Amazon S3 bucket. After the training is completed, AWS charges the user according to the amount of billable time of the training instance. It is important to mention that the time to download the training data is not billable, as well as the boot time, but the time to download the container image may be charged, if the user provided a custom container to be used by the Estimator.

11 https://aws.amazon.com/sagemaker/studio/

(accessed in August 2022).

12 https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_

VARIABLES.md#sm_channel_channel_name (accessed in August 2022).

14 Cost Effective Deep Learning on the Cloud

295

14.4 Optimizing Cost and Training Time Training deep learning models can be expensive. Cloud providers offer various types of services and virtual machines allowing users to design custom computing infrastructures for their models. As discussed in several research works [3, 13, 21], selecting the set of cloud resources and services for deep-learning that maximizes the cost-benefit may not be trivial. This section discusses the cost and performance implications of choosing different cloud resources and services for training deep learning models through a study case and presents a strategy to select cost-efficient configurations. Section 14.4.1 presents the study case and Sect. 14.4.2 discusses a strategy to search for efficient resource configurations. Sections 14.4.3 and 14.4.4 show the results of applying this strategy on the study case. Finally, Sect. 14.4.5 discusses how cost-efficiency can be improved even further with preemptible VMs.

14.4.1 Study Case: Medical Image Segmentation with MONAI According to Antonelli et al. [1], “semantic segmentation refers to the process of transforming raw medical images into clinically relevant, spatially structured information, such as outlining tumor boundaries, and is an essential prerequisite for a number of clinical applications [. . .].” In this study case, we train a deep learning model using a dataset from the Medical Segmentation Decathlon (MSD) [1]. This is a biomedical image analysis challenge where competing algorithms have to solve 10 medical segmentation tasks, 7 of which were provided in the development phase while 3 were used for evaluation in a later phase, called mystery phase. The goal of MSD was to demonstrate generalization, by showing that algorithms that perform well on multiple tasks should also do well on previously unseen tasks. The winner was nn-UNet [9], which is based on the U-Net architecture [19]. For this study case, we chose the Heart dataset, which, according to Antonelli et al. [1], “consists of 30 3-D mono-modal MRI scans [(20 for training and 10 for testing)] of the entire heart [. . . ].” They also explain that it “was selected due to the combination of a small training data set with large anatomical variability”, and that the data came from the 2013 Left Atrial Segmentation Challenge (LASC) [22]. We will use this data set to train the Dynamic UNet model (a.k.a. DynUnet), an stateof-the-art architecture based on nn-UNet for this task. The model contains several compute-intensive layers, such as three dimensional convolutions, and more than 3 million of trainable parameters, which makes it a reasonably large model to train and parallelize. The model was obtained from the Medical Open Network for Artificial Intelligence (MONAI) Framework. This framework is a community-supported

296

O. O. Napoli et al.

Table 14.1 Instances used to train the deep learning model on AWS Service type EC2

SageMaker a

Instance type g3s.xlarge g4dn.xlarge g4dn.12xlarge g5.xlarge p2.xlarge p3.2xlarge ml.g4dn.xlarge ml.p3.2xlarge

Accelerator (GPU) NVIDIA Tesla M60 NVIDIA T4 4.×NVIDIA T4 NVIDIA A10G NVIDIA K80 NVIDIA V100 NVIDIA T4 NVIDIA V100

Pricea (USD/h) 0.750 0.526 3.912 1.006 0.900 3.060 0.736 3.825

AWS on-demand pricing information accessed on August 2022

open-source framework for deep-learning in healthcare imaging developed by the MONAI project,13 and is based on PyTorch The next sections discuss how to train this model using both the AWS Elastic Cloud Computing service, or EC2 (IaaS), and the AWS SageMaker service (PaaS). Table 14.1 shows the type, the hardware accelerators, and the price of the VM instances we used. Instance names with the “ml.” prefix indicate SageMaker instances. Each experiment was executed three times per virtual machine type, each time in a different instance. For training, we used a batch size of 1. This value was chosen due to the GPUs memory limitations, as the model size grows increasingly with the batch size. The input dataset is stored in an EFS file system. The experiments on EC2 were conducted with Amazon-provided PyTorch AMIs which contain the Ubuntu 20.04 operating system, NVIDIA driver version 470, CUDA toolkit version 11.4, and PyTorch version 1.12. The SageMaker instances use a similar AMI as well. Finally, all the code, instructions to execute it, and the experimental results are publicly available.14

14.4.2 Searching for a Cost-Efficient Infrastructure There are multiple ways of looking for the appropriate instance that maximizes the cost-benefit for a particular workload. The naive approach consists in executing the whole workload and measuring the performance of the training algorithm in all available configurations (e.g., VM types, number of VMs, etc). The cost of this exploration can easily become prohibitive if there are many candidate configurations (search space) or the workload demands a large amount of time to finish its execution. Therefore, we need strategies to reduce the search cost, such as: 13 https://monai.io/

(accessed in July 2022).

14 https://github.com/discovery-unicamp/HPCC-18-MONAI.

14 Cost Effective Deep Learning on the Cloud

297

(a) sampling part of the search space using smarter algorithms, such as grid search, Bayesian optimization search, genetic algorithms, among others; (b) employing a performance estimation method that costs less than executing the whole workload; or (c) combining both. Regarding the performance estimation, this can be done using performance models or proxy workloads. In the first, mathematical models are manually designed or learned, in order to map a given configuration to its respective performance metric (e.g., execution time). Although this method can be fast and cheap, since there is no need to execute code, the mathematical model may need to be tuned to new DL models or hardware devices. Moreover, these methods usually cannot deal with performance fluctuations caused by multi-tenancy on the cloud. In the second method, a proxy workload is one that behaves similarly to the base workload. It is typically a simpler (or shorter) workload, but still shows the same performance behavior as the original when executed on different hardware configurations. Although this method requires deploying the infrastructure and executing the proxy workload for every candidate configuration, it accounts for performance fluctuations, since it is measuring the system performance under the current (sharing) conditions. Estimating Performance Using Paramount Iterations Brunetta and Borin [3] showed that, in the context of HPC workloads on the cloud, it is possible to estimate the relative performance of different virtual machine types with very little execution time by only collecting information about few iterations of the main execution cycle of the application, or Paramount Iteration (PI). The paramount iterations also proved to be a good proxy for deep learning applications [13, 21]. Tesser et al. [21] showed that, for models with a stable computational behavior, except for the first step, the performance of the initial steps of the first epoch can be a good performance proxy for the whole training. We also verified this behavior on our experiments. This is shown in Fig. 14.6, where the DynUnet model was trained for 10 epochs with the Heart dataset. At almost all instances, after the first step, the execution time of each step remains stable. The first step of the first epoch has a higher execution time due to some overheads introduced by common DL frameworks, such as: (a) lazy compilation, as the model gets compiled at start of the first epoch; (b) benchmarking, as some DL frameworks benchmark some layers of the model in order to select the most efficient kernel implementation; and (c) GPU initialization and memory transfers, since in the first epoch the weights and batches are copied to the GPU memory, and at next epochs, this copy process is interleaved asynchronously. It is worth noting that in our experiments SageMaker instances suffer less external interference than EC2 instances. This can be noticed in the last two rows of Fig. 14.6, which correspond to training performed on SageMaker. This may occur due to optimizations performed on the AWS framework that executes the training.

298

O. O. Napoli et al.

g3s.xlarge, Run #1

g3s.xlarge, Run #2

4 3 2 1 0 4

8

12

16

4 3 2 1 0 4

8

12

16

4

3

3

3

2

2

2

1

1

1

0

0 4

8

12

g5.xlarge, Run #1

8

12

4

16

g5.xlarge, Run #2 1.0

1.0

0.5

0.5

0.5

0.0

0.0 8

12

16

8

12

16

4

8

12

16

4

8

12

16

4

8

12

4

8

12

16

4

3

3

2

2

2

1

1

1

0

0 8

12

16

8

12

4

16

1.0

1.0

0.5

0.5

0.5

8

12

16

8

12

16

8

12

16

0.0

0.0 4

16

ml.p3.2xlarge, Run #3

ml.p3.2xlarge, Run #2

1.0

0.0

12

0 4

ml.p3.2xlarge, Run #1

8

1 2 3 4 5 6 7 8 9 10

ml.g4dn.xlarge, Run #3

ml.g4dn.xlarge, Run #2

3

4

16

1.5 1.0 0.5 0.0 4

16

12

p3.2xlarge, Run #3

1.5 1.0 0.5 0.0

ml.g4dn.xlarge, Run #1

8

8 6 4 2 0

p3.2xlarge, Run #2

1.5 1.0 0.5 0.0

16

p2.xlarge, Run #3

8 6 4 2 0

p3.2xlarge, Run #1

12

Epoch 4

p2.xlarge, Run #2

8 6 4 2 0

8

0.0 4

p2.xlarge, Run #1

16

g5.xlarge, Run #3

1.0

4

12

0 4

16

8

g4dn.xlarge, Run #3

g4dn.xlarge, Run #2

g4dn.xlarge, Run #1

Step duration (s)

g3s.xlarge, Run #3

4 3 2 1 0

4

8

12

16

4

8

12

16

Step number Fig. 14.6 Execution times of every step of every epoch when training the DynUnet model with the Heart dataset on Amazon EC2 (IaaS) and SageMaker (PaaS). Columns are different executions and rows represent different instances. The plots do not show the execution time of the first step, which is located outside the plot area

14 Cost Effective Deep Learning on the Cloud

299

1000

Total Time (s)

800 Pareto frontier ml.g4dn.xlarge (1x) p3.2xlarge (1x) p2.xlarge (1x) g5.xlarge (1x) g3s.xlarge (1x) g4dn.xlarge (1x) ml.p3.2xlarge (1x)

600

400

200

0 0.00

0.05

0.10

0.15 Cost (USD)

0.20

0.25

Fig. 14.7 Execution times and costs of training the DynUnet model for 10 epochs at different virtual machine types and services, ignoring initialization overheads (VM boot, data fetch, environment setup, etc). Instances with prefix “ml.” stand for SageMaker instances and “1x” says that only one instance was used for training (i.e., it is not distributed)

14.4.3 Selecting Efficient VM Types on EC2 and SageMaker Most of the resources on public clouds are charged based on the amount of time they are used. Virtual machines are typically charged by time (e.g., USD/h), both on IaaS and SageMaker. Hence, the total computing cost is defined by the resource price multiplied by the time it has been allocated for the user. In this context, using a less powerful computing resource, with a lower price tag, does not mean the total computing cost would be reduced. Figure 14.7, shows the cost (x-axis) and time (y-axis) it takes to train the DynUnet model for 10 epochs not accounting for the initialization overheads (VM boot, data fetch, environment setup, etc.). The results indicate that g5.xlarge is the cheapest VM type while ml.p3.2xlarge is the fastest one. Besides, ml.g4dn.xlarge and g4dn.xlarge have almost the same cost, but ml.g4dn.xlarge is 1.39.× faster. It is worth noticing that SageMaker instances performed slightly better than the respective instances on EC2, especially in the first iteration of each epoch, as already shown in Fig. 14.6. Finally, the p2.xlarge is 2.26.× costlier and 9.62.× slower than ml.p3.2xlarge. Notice that, even though p2.xlarge has a lower price tag (0.9 USD/h), it is cheaper to train the model with

300

O. O. Napoli et al. Costs

0.55

Initialization First step Remaining steps

0.50 0.40

Cost (USD)

Cost (USD)

0.45 0.35 0.30 0.25 0.20 0.15 0.10 0.05

g4

g3

s.x Ia la aS rg e dn .x Iaa la S rg e g5 .x Iaa la S rg e p2 .x Iaa la S rg e p3 .2 Ia xl a m S arg S l.g ag e 4d eM n. a xl ke m Sa arge r l.p ge 3. M 2x ak la er rg e

0.00

75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0

Costs Initialization First step Remaining steps

g3 s.x Ia la aS rg e g4 dn .x Iaa la S rg e g5 .x Iaa la S rg e p2 .x Iaa la S rg e p3 .2 Ia xl a m S arg S l.g ag e 4d eM n. a xl ke a m Sa rge r l.p ge 3. M 2x ak la er rg e

0.60

Fig. 14.8 Median cost of training in different virtual machine types and services (3 executions). The error bars show the minimum and maximum values. (a) Real costs for 10 epochs. (b) Estimated costs for 3000 epochs

the ml.p3.2xlarge VM type. These results show the importance of properly selecting a cost-efficient VM type when training a DL model on the cloud. The previous analysis did not account for the initialization overheads, which include: (a) booting the VM instances; (b) fetching the data from the external storage; (c) setting the environment up (e.g., start docker container, mount EFS file system); and (d) executing other data pre-processing steps in the training script before the training loop. For short-running training workloads these overheads may be significant when compared to the time and cost to train the model. For example, Fig. 14.8a shows the total cost of training the DynUnet model for 10 epochs on multiple configurations at EC2 and SageMaker, including the initialization and the training cost. Notice that, in some configurations, the initialization cost is higher than the training cost. Moreover, the cost to execute the first step of the first epoch (in green), can be significant in some scenarios. Even though the initialization cost is non-negligible in this scenario, on more common, long-running, workloads it does tend to become negligible. The MONAI framework, for example, trains the DynUnet for 3000 epochs, 300 times more than in our experiments. In Fig. 14.8b, we show the estimated initialization and training costs for such an 3000-epoch execution, based on the average cost of a single training step (except the first step of the first epoch). In this case, the estimated training cost is .≈112–296 times higher and the initialization cost is negligible. Therefore, as can be clearly seen in the chart, the initialization and the first step overheads become very small. As a consequence, the cost vs. performance analysis performed using only the training loop and ignoring the initialization and the first step becomes a good proxy for estimating the total execution.

Estimated cost for 3000 epochs (USD)

14 Cost Effective Deep Learning on the Cloud

Estimated time 3000 epochs (s)

3e+05

2e+05

1e+05

0e+00 0

2

4

Mean time for 10 steps (s)

6

301

Instance

60

g3s.xlarge g4dn.xlarge 40

g5.xlarge ml.g4dn.xlarge ml.p3.2xlarge

20

p2.xlarge p3.2xlarge

0 0.0000

0.0005

0.0010

0.0015

Mean cost for 10 steps (USD)

Fig. 14.9 Training times (left) and costs (right): mean values for 10 steps vs. estimated values for 3000 epochs (48000 steps). Relative time and cost to execute a few steps on different VM types is a good proxy for the relative time and cost of training for several epochs

Figure 14.9 shows that the time (left) and the cost (right) to train the DynUnet model for 10 steps (steps 2 to 11) is a good proxy for the time and cost to perform the whole training (i.e., 3000 epochs) on different VM types.

14.4.4 Exploring Cost and Training Time with Distributed Training One way to reduce the training time is using distributed training, which uses multiple computing devices to perform the training. Regarding the arrangement of computing devices, there are three possible organizations: (a) single node with multiple computing devices; (b) multiple nodes with single computing unit; and (c) multiple nodes with multiple computing devices. Regarding the participating computing devices, they can be homogeneous, when they have the same configuration, or heterogeneous, when they have different configurations. Cloud providers usually offer virtual machines with multiple accelerators with the price proportionally higher, allowing the users to build their custom architectures by mixing the components as they wish. However, the training time is not proportionally smaller, as networking and synchronization may become bottlenecks when scaling up the training process. For instance, using synchronous data parallelism, one of the most common parallelism strategies, besides requiring that the whole model fit in the accelerator memory, also adds a synchronization point at the end of each epoch. Thus, if the user builds a heterogeneous architecture, faster machines may starve waiting for slower machines to reach the synchronization point. This generates under utilization,

302

O. O. Napoli et al.

300000

Pareto frontier ml.g4dn.xlarge (1x) p3.2xlarge (1x) p3.2xlarge (2x) p3.2xlarge (4x) p2.xlarge (1x) p2.xlarge (2x) p2.xlarge (4x) g5.xlarge (1x) g5.xlarge (2x) g5.xlarge (4x) g3.16xlarge (1x) g3.8xlarge (1x) g3s.xlarge (1x) g3s.xlarge (2x) g3s.xlarge (4x) g4dn.12xlarge (1x) g4dn.xlarge (1x) g4dn.xlarge (2x) g4dn.xlarge (4x) ml.p3.2xlarge (1x)

250000

Total Time (s)

200000

150000

100000

50000

0

0

10

20

30

40 50 Cost (USD)

60

70

80

Fig. 14.10 Estimated execution times and costs of training distributed the DynUnet model for 3000 epochs at different virtual machine types, including initialization overheads. The numbers at the end of the labels (e.g., 4x) represent the number of nodes used in the distributed training. Blue arrows indicate the improvements achieved by single-node multi-GPUs while gray arrows indicate improvements achieved by multi-node single-GPU data parallel training

leading to a waste of resources and money. Thus, it is important to select the best cost-benefit machine types and quantity in order to avoid resource wastage. Figure 14.10 shows the estimated cost (x-axis) and time (y-axis) it takes to train the DynUnet model for 3000 epochs, including the initialization overheads (VM boot, data fetch, environment setup, etc.). Blue arrows indicate the improvements achieved using single-node multi-GPU configurations while gray arrows indicate improvements achieved using multi-node single-GPU configurations when employing the data parallel training model. It is worth noticing that, for the chosen problem, the batch size could only be divided by 2 and 4. Thus, we only used configurations with 1, 2, and 4 computing devices in the distributed training. This way, the single node multi GPUs distributed training was not performed in SageMaker instances as only instances with 8-GPUs or more are available. Also, only g3s.xlarge and g4dn.xlarge have equivalent instances with up to 4 computing devices (for single node multiple GPU), being g3.8xlarge and g3.16xlarge with 2 and 4 GPUs, respectively, and g4dn.12xlarge with 4 GPUs. As can be seen in Fig. 14.10, the Pareto Frontier is slightly altered when compared to the one in Fig. 14.7. The new frontier is now defined by configurations composed of 1, 2, and 4 g5.xlarge VM instances. Notice that training the model

14 Cost Effective Deep Learning on the Cloud

303

with two and four g5.xlarge nodes reduced the execution time at an additional cost. The same happens for the g4dn.xlarge and g3s.xlarge VM types, however, the cost increase is very small when compared to the execution time reduction. This happens because, when compared to the total training cost, the cost to boot the VMs, which does not depend on GPU types, becomes proportionally higher on VMs with powerfull GPUs. Another interesting observation is that multinode configurations were cheaper than using single node multi-GPU equivalents. The experimental results also indicate that single-node multi-GPU setups perform slightly faster than multi-node equivalent ones (i.e., with the same number of GPUs). This is likely to be due to reduced synchronization overheads, as GPUs can communicate through the node’s system bus instead of the datacenter network. Finally, it is worth noticing that, despite better performance, training with singlenode muti-GPU setups costs more. Based on the previous analysis, we selected the g5.xlarge (2x) configuration to perform the full training of the DynUnet model (i.e., train it for 3000 epochs) on AWS EC2. The total time and cost to train the model were 14 344 seconds and 8.01 USD, respectively. Notice that these values are very close to the estimates showed in Fig. 14.10, which indicate this training would take 14 137 seconds and 7.91 USD to complete. Selecting a Cost-Efficient Configuration As can be observed in Fig. 14.10, the more computing devices employed on the distributed training, the lower the execution time. Nonetheless, as the training time reduces and the initialization and synchronization overheads become more expressive when compared to the training time, the cost increase rates become higher than the execution time reduction rates. Eventually, with too many computing devices, the execution time gains becomes marginal when compared to the cost increase. Hence, it is important to select a number of computing devices that yield a good trade-off between cost and execution time. Also, as discussed in Sect. 14.4.3, the VM type has a strong influence on the cost-efficiency of the training process. Therefore, both factors must be taken into account when optimizing cost and performance. Exploring all combinations of VM types and number of computing devices may lead to a large search space, which may require too much time and money to identify the best configuration. In this context, we suggest the following approach to reduce the time and cost required to identify a cost-efficient configuration to train deep learning models on the cloud: 1. First, benchmark each instance type without parallelism strategies. This must be performed using a performance proxy (e.g., paramount iteration as discussed in Sect. 14.4.3); 2. Then, select one or more instances types that produce results close to the bottomleft frontier, i.e., cheap and fast; 3. Finally, perform scalability tests by adding more nodes of the selected instance type to the distributed training until the execution time improvement rates are not

304

O. O. Napoli et al.

worth the cost increase rate. The number of nodes can be increased incrementally by a constant factor (e.g., 4 or 8) or exponentially (e.g., by doubling them). Notice that these experiments must also be performed using a performance proxy to reduce the search cost. This approach allows the user to reason about the cost/execution time increase/reduction rates. Nonetheless, if the user knows for how many epochs the training process must run, the absolute cost and execution time values can also be estimated based on the initialization overhead, the average step execution time, and the number of steps per epoch. Finally, it is worth noting that this strategy is recommended for training that relies on synchronous data parallelism and assumes configuration prices to grow proportionally to the number of computing devices. This approach may need to be adapted when these assumptions do not hold.

14.4.5 Reducing the Cost with Preemptible VMs Besides multiple hardware configurations (e.g., VM types and number of instances), cloud providers may also offer a variety of service level agreements (SLAs), which may significantly affect the cost of using the cloud. A notable example is the Preemptible VM SLA, in which cloud providers offer VM instances for much lower prices when compared to regular VMs; however, they might stop (preempt) these instances to reclaim the computing capacity for other purposes, such as allocating the underlying physical server to other clients. The price of Preemptible VMs may vary according to demand and can be much cheaper than on-demand VMs. Preemptible VMs can be reclaimed under none or very short notices, which usually requires users to employ fault tolerance strategies to ensure partial results are saved and computation can be later resumed from where it stopped. Luckily, the typical workflow used for training deep learning models, illustrated in Fig. 14.2, provides easy opportunities for saving and resuming the training process. As an example, users can modify the training loop to save to a persistent storage (e.g., EFS) the (i) epoch and the step identifiers (e.g., sequence numbers) and (ii) the model parameters after each set of steps (e.g., after each epoch). Upon failure, the training process can be later resumed by loading the last set of stored parameters and continuing the data fetch process (Step 3) from the last epoch and step sequence numbers that were recorded. Harlap et al. (2017) [7] proposed using a non-preemptible VM to run a parameter server, which is responsible for generating checkpoints so that work can be resumed later upon worker failures. Finally, it is worth noting that AWS SageMaker allows users to employ preemptible VMs to train DL models. Nonetheless AWS recommends users to implement checkpointing and restore strategies in case the training process is too long. Preemptible VMs can be combined with the approach we discussed in the previous sections to reduce even further the cost to train deep learning models.

14 Cost Effective Deep Learning on the Cloud

305

Nonetheless, since preemptible VM price tags may vary according to demand, they may affect the cost and performance trade-off observed by the user before the training process, when profiling the configurations. To mitigate this problem, the user may set a maximum price tag when renting preemptible VMs, which would prevent unexpectedly high costs. Alternatively, the user may set up a recurrent program (or script) that periodically checks for price changes, computes new cost estimates, and sends notices to the user or automatically adapts the infrastructure to optimize the cost (e.g., by replacing VM types).

14.5 Final Considerations Deep learning models may vary in architecture, frameworks, and in methods for partitioning and parallelizing them. Nonetheless, a typical workflow is followed in order to train them. This includes preparing the data, setting the model and environment up, training it and fetching the results. This chapter discussed how different service models offered by cloud providers automate this typical workflow. At the IaaS level, besides the inherent increase in flexibility, the burden of configuring a suitable environment falls completely on the users. At the PaaS level, users are able to run their applications in an environment that has been specially tailored for machine learning. We demonstrated how SageMaker, a specialized Amazon service for developing and training deep learning models, can automate much of the training workflow at an affordable price. In the use case presented, of segmenting 3D medical images, we showed that SageMaker greatly reduces the initial training overhead of initializing a VM, fetching the data, and setting the environment up, compared to the IaaS service model, where users must deploy and configure the whole infrastructure, thus being a good choice for small workloads. However, for long-running workloads, training with SageMaker does not sound promising, as the costs tend to increase due to higher instance price tags. Besides that, choosing the appropriated resources that maximize the training efficiency and reduce costs may not be trivial. First, it involves estimating the performance, which can be done by using a mathematical model or a proxy application to estimate the training performance in a set of resources. We showed that early stop mechanisms, such as paramount iterations, can be a good performance proxy, with minimal overhead. After that, choosing the resources must be done wisely. In the use case we presented, choosing a VM type that is not suitable for the task, such as p2.xlarge, can be 5.43.× slower and 4.86.× costlier than choosing g5.xlarge. This chapter also discussed approaches for dealing with this problem. Finally, distributed training can be a good way of exploring the trade-off between cost and execution time, as well as handling large datasets. For instance, in our experiments we managed to reduce the execution time by 3.69.× when using four g4dn.xlarge nodes instead of only one, with an increase in cost of only 1.08.×.

306

O. O. Napoli et al.

Regarding the distributed training strategy, data parallelism is usually simple to use, as it is natively implemented in several deep learning frameworks, and it is a good choice when the whole model fits in the accelerator memory, as every participant of the training keeps a local copy of the model that must be synchronized at the end of every epoch. If this condition is not met, other parallelism options can be used, such as model and pipeline parallelism. Other efficient training strategies involve asynchronous training architectures, that use Parameter Server approaches. These strategies allow elasticity in the training process, allowing it to take more advantage of preemptive instances and aiming to reduce the training cost even more. Acknowledgments The authors would like to thank the following funding agencies for supporting their research into High-Performance Cloud Computing: FAPESP (process 2013/08293-7) and CNPq (processes 314645/2020-9 and 404087/2021-3).

References 1. ANTONELLI, M., REINKE, A., BAKAS, S., FARAHANI, K., KOPP-SCHNEIDER, A., LANDMAN, B. A., LITJENS, G., MENZE, B., RONNEBERGER, O., SUMMERS, R. M., ET AL. The medical segmentation decathlon. Nature Communications 13, 1 (2022), 1–13. 2. BEN-NUN, T., AND HOEFLER, T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43. 3. BRUNETTA, J. R., AND BORIN, E. Selecting efficient cloud resources for hpc workloads. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing (2019), pp. 155–164. 4. BUREAL, E. Abacus.ai announces series b funding of $22m and abacus.ai deconstructed, a set of stand-alone modules that help organizations deploy ai models in production. https://enter prisetalk.com/news/abacus-ai-announces-series-b-funding-of-22m-and-abacus-ai-decon -structed-a-set-of-stand-alone-modules-that-help-organizations-deploy-ai-models-in-production Published at 19/11/2020. Accessed at 15/08/2022. 5. GOODFELLOW, I., BENGIO, Y., AND COURVILLE, A. Deep Learning. MIT Press, 2016. http:// www.deeplearningbook.org. 6. GUPTA, S., ZHANG, W., AND WANG, F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016), IEEE, pp. 171–180. 7. HARLAP, A., TUMANOV, A., CHUNG, A., GANGER, G. R., AND GIBBONS, P. B. Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets. 589–604. 8. HOBBINS, H., AND MONRO, S. A stochastic approximation method. Annals of Mathematical Statistics (1951). 9. ISENSEE, F., JAEGER, P. F., KOHL, S. A., PETERSEN, J., AND MAIER-HEIN, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18, 2 (2021), 203–211. 10. JIANG, Z., DING, C., LIU, M., AND TAO, D. Two-stage cascaded u-net: 1st place solution to brats challenge 2019 segmentation task. In International MICCAI Brainlesion Workshop (2019), Springer, pp. 231–241. 11. KIM, S., CHUN, J., AND DEY, A. K. Sensors know when to interrupt you in the car: Detecting driver interruptibility through monitoring of peripheral interactions. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (2015), pp. 487–496.

14 Cost Effective Deep Learning on the Cloud

307

12. KRIZHEVSKY, A., SUTSKEVER, I., AND HINTON, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097–1105. 13. MALTA, E. M., AVILA, S., AND BORIN, E. Exploring the cost-benefit of aws ec2 gpu instances for deep learning applications. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing (2019), pp. 21–29. 14. MATHURIYA, A., BARD, D., MENDYGRAL, P., MEADOWS, L., ARNEMANN, J., SHAO, L., HE, S., KÄRNÄ, T., MOISE, D., PENNYCOOK, S. J., ET AL. Cosmoflow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018), IEEE, pp. 819–829. 15. MAYER, R., AND JACOBSEN, H.-A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–37. 16. MCCULLOCH, W. S., AND PITTS, W. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5, 4 (1943), 115–133. 17. NASSIF, A. B., SHAHIN, I., ATTILI, I., AZZEH, M., AND SHAALAN, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 7 (2019), 19143–19165. 18. RAMOS, E. G., AND MARTÍNEZ, F. V. A review of artificial neural networks: How well do they perform in forecasting time series? Analítika: revista de análisis estadístico 2, 6 (2013), 7–18. 19. RONNEBERGER, O., FISCHER, P., AND BROX, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (2015), Springer, pp. 234–241. 20. SCHMIDHUBER, J. Deep learning in neural networks: An overview. Neural networks 61 (2015), 85–117. 21. TESSER, R. K., MARQUES, A., AND BORIN, E. Selecting efficient vm types to train deep learning models on amazon sagemaker. In 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) (2021), IEEE, pp. 20–27. 22. TOBON-GOMEZ, C., GEERS, A. J., PETERS, J., WEESE, J., PINTO, K., KARIM, R., AMMAR, M., DAOUDI, A., MARGETA, J., SANDOVAL, Z., ET AL. Benchmark for algorithms segmenting the left atrium from 3d ct and mri datasets. IEEE transactions on medical imaging 34, 7 (2015), 1460–1473. 23. WANG, W., CHEN, G., DINH, A. T. T., GAO, J., OOI, B. C., TAN, K.-L., AND WANG, S. Singa: Putting deep learning in the hands of multimedia users. In Proceedings of the 23rd ACM international conference on Multimedia (2015), pp. 25–34. 24. ZHAO, T. Seismic facies classification using different deep convolutional neural networks. In SEG Technical Program Expanded Abstracts 2018. Society of Exploration Geophysicists, 2018, pp. 2046–2050.

Appendix A

Deploying an HPC Cluster on AWS Edson Borin and Otávio O. Napoli

This appendix describes, step-by-step, how to deploy the infrastructure for a cloudbased HPC cluster on the AWS provider. A simple cluster, based on Fig. 4.3, shall be used to illustrate the process. First, Sect. A.1 demonstrates how to create the infrastructure using the web interface. Then, Sect. A.2 shows the same process using a command-line interface. Finally, Sect. A.3 discusses how to deploy the cluster infrastructure using Ansible, an automation tool that can be used to implement IaC frameworks.

A.1 Deploying Infrastructure Using the Web Console To create virtual machines in AWS Elastic Cloud Computing (EC2), the user must first create an account in the AWS services1 to access the AWS Web Console. Once the account is created, the user may log into the AWS Web Console using an Internet browser and proceed to create a Virtual Private Cloud (VPC) network.

1 https://aws.amazon.com/ec2/.

E. Borin () · O. O. Napoli UNICAMP, Cidade Universitária, Campinas, Brazil e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4

309

310

E. Borin and O. O. Napoli

Create VPC A VPC is an isolated portion of the AWS Cloud populated by AWS objects, such as Amazon EC2 Instances.

VPC settings Resources to create Create only the VPC resource or create VPC, subnets, etc.

VPC only

VPC, subnets, etc.

Name tag - optional Creates a tag with a key of 'Name' and a value that you specify.

my-vpc IPv4 CIDR block IPv4 CIDR manual input IPAM-allocated IPv4 CIDR block IPv4 CIDR

10.0.0.0/24 IPv6 CIDR block No IPv6 CIDR block IPAM-allocated IPv6 CIDR block Amazon-provided IPv6 CIDR block IPv6 CIDR owned by my Tenancy

Default

Fig. A.1 Creating a VPC

A.1.1 Creating the VPC Network AWS’ VPC allows users to control their virtual networking environment, including resource placement, connectivity, and security. To create the Virtual Private Cloud (VPC) network, the user must first locate the VPC service, in the AWS console search bar. Once the process is started, the user will be provided with a window to set the VPC up, as illustrated in Fig. A.1. In the example, a VPC named “my-vpc” is created with default values, which provides up to 256 private IPs, allowing user’s resources to communicate to each other as if they were in the same LAN. Subnets are logical divisions of the VPC, in which fine-grained network rules can be applied. The creation of a subnet is usually required to use some cloud services, such as the EFS. Figure A.2 illustrates how to create a subnet named “my-subnet” inside the “my-vpc” VPC. Here, the subnet is created with the default parameters, involving the whole VPC and providing up to 256 private IPs.

A Deploying an HPC Cluster on AWS

311

Create subnet VPC VPC ID Create subnets in this VPC.

vpc-01568d276717eab9f p ((my-vpc) y p )

Associated VPC CIDRs IPv4 CIDR 10.0.0.0/24

Subnet settings Specify the CIDR blocks and Availability Zone for the subnet.

Subnet 1 of 1 Subnet name Create a tag with a key of 'Name' and a value that you specify.

my-subnet p to 256 characters long. g The name can be up

Availability Zone Choose the zone in which your subnet will reside, or let Amazon choose one for you.

No preference

IPv4 CIDR Block No preference

X

Tags - optional Value - optional

Key Name

X

my-subnet

X

Remove

Add new tag You can add 49 more tags

Remove

Fig. A.2 Creating a subnet inside the VPC

A.1.2 Creating a Shared File System Using the AWS Elastic File System (EFS) AWS offers the Elastic File System (EFS) service,2 which provides a simple, serverless, set-and-forget file-based storage to be used together with other AWS cloud resources. The EFS can be mounted and used by multiple VMs concurrently 2 https://aws.amazon.com/efs/.

312

E. Borin and O. O. Napoli

Fig. A.3 Creating an EFS shared file system

and grows and shrinks automatically as users add and remove files without any need for management. In order to create an EFS, the user must locate the EFS service, in the AWS web console’s search bar, and select the “create file system” option, which will start the process by displaying a configuration window, as illustrated in Fig. A.3. The user may, optionally, assign a name to the EFS file system (e.g., “My EFS”) and select a VPC to connect the file system (e.g., “my-vpc”, as created previously). Finally, the user must select whether the EFS file system will be stored regionally (redundantly across multiple availability zones) or in one zone (redundantly within a single availability zone). It is worth noticing that the one zone option can be significantly cheaper than the regional one.

A Deploying an HPC Cluster on AWS

313

Launch instance To get started, launch an Amazon EC2 instance, which is a virtual server in the cloud.

Launch Instance

Migrate a server

Launch instance Launch instance from template

(N.Virginia) Region

Fig. A.4 Initiating the launch instance process

A.1.3 Instantiating Virtual Machines To instantiate the virtual machines, the user must locate the EC2 service, in the AWS web console’s search bar, click on the “Launch Instances” button, and select the “Launch Instance” option, as illustrated in Fig. A.4. This allows the creation of several virtual machines with the same configuration. Proceeding with instance launch, the next step is to select a virtual machine image, as illustrated in Fig. A.5. On AWS, a virtual machine image is called Amazon Machine Image, or AMI. There are several AMIs available on AWS, each of them with an operating system and a set of software pre-installed. In this example the “Ubuntu Server 20.04 LTS (HVM), SSD Volume Type” (64-bit, x86) image is selected.



! Virtual Machine Image Architecture

Operating systems are tightly coupled with the underlying machine instruction set architecture, or ISA (e.g., x86 or ARM). AWS allows selecting virtual machine images compiled for different machine architectures.

After selecting the virtual machine image, the user must proceed by selecting the virtual machine type, as illustrated in Fig. A.6. As discussed in Sect. 4.2.1, VM types may be grouped in families or series. On AWS, the T- and M-family VM types, i.e., VM types starting with the letter T or M, correspond to general-purpose virtual machine types. In this example, the t2.medium VM type was selected. This VM type comprises 2 vCPUs and 4 GB of RAM. Once the VM instance is running, AWS allows users to access them through SSH connections; however, it has to be through a private/public key pair (at least for the first access to it). In this way, the VM must be set with a public key for which the user has the private pair. The user may create the key when deploying the VM by choosing the option “Create a new key pair”. In case a new key pair is created, AWS will generate a private key for the user and add the respective public key to the instance.

314

E. Borin and O. O. Napoli

Application and OS Images (Amazon Machine Image) Info An AMI is a template that contains the software configuration (operating system, application server, and applications) required to launch your instance. Search or Browse for AMIs if you don't see what you are looking for below

Search our full catalog including 1000s of application and OS Images

Recents

Amazon Linux

My AMIs

Quick Start

Ubuntu

Windows

Red Hat

SUSE Linux Browse more AMIs Including AMIs from AWS, Marketplace and the Community

Amazon Machine Image (AMI) Ubuntu server 20.04 LTS L (HVM), (HVM), SSD Volume V T Type

Free tier eligible

ami-040505e74c0741db8d (64-bit (x86)) / ami-0b49a4a6e8e22fa16 (64-bit (Arm)) Virtualization:: hvm ENA enabled evice type:: ebs enabled:: true Root d device

Description Canonical, Ubuntu, 20.04 LTS, amd64 focal image build on 2021-11-29 Architecture

64-bit (x86)

AMI ID

ami-040505e74c0741db8d

Fig. A.5 Selecting the virtual machine image



! A Note About Key Pairs

The user may create a new key pair when deploying VMs or use a pre-registered key pair. If the user opts for creating a new key pair, the newly created private key must be downloaded immediately. This process can be done only once and, without it, the user cannot log into the machine.

The next step consists in configuring the network interfaces of the virtual machines, as illustrated in Fig. A.7. In order to include the instances in the freshly created VPC, the user must select it in the VPC drop box (“my-vpc”, in this case) and the proper subnet (“my-subnet”, in this case). Next, the user must configure the security groups. Security groups act as a virtual firewall for the user’s instances to control incoming and outgoing traffic. In this example, the security group called

A Deploying an HPC Cluster on AWS

315

Fig. A.6 Selecting the virtual machine type

“my-secgroup” was created and, for simplicity, all traffic from/to everywhere is allowed to all instances in the subnet, as shown in Fig. A.7. Finally, the user must define the size and other parameters of the block-based storage that will be deployed to store the virtual machine image, as illustrated in Fig. A.8. In this case, an 8 GB EBS volume is attached to the instance, which will be mounted at /dev/sda1. If the “Delete on Termination” option is set to true, this volume will be automatically destroyed after the VM instance is terminated. Else, the volume will be kept stored, and available to be attached to another VM instance. After configuring these options, the user may select the number of VM replicas to instantiate and click in the launch instance button to place a request for creating the VMs. Once the instances are started, the user may log into the virtual machines executing the command ssh -i "" ubuntu@, in which is the path to the private key used when creating the VM and is the virtual machine hostname or public IP address. The hostname, or the public IP address, may be retrieved from the AWS EC2 console, which lists all VMs created by the user.

A.2 Deploying Infrastructure Using the AWS Command-Line Interface Cloud providers usually provide means to deploy their resources programmatically, by using command-line tools, for example. This section discusses how to deploy the infrastructure at AWS EC2, as presented in the previous section, using a commandline interface. For simplicity, this section focuses the discussion on the deployment

316

E. Borin and O. O. Napoli

Network settings VPC - required Info vpc-01558d276717eab9f (my-vpc) 10.0.0.0/24

Subnet Info subnet-0e0848c2d06adbb28 VPC: vpc-01558d276717eab9f Owner: 245983579475 Availability Zone: us-east-1d IP address available: 250

my-subnet

Auto-assign public IP Info

Disable

Firewall (security groups) A security group is a set of firewall rules that control the traffic for your instance. Add rules to allow specific traffic to reach your instance.

Create security group

Select existing security group

Security group name - required

g p Myy secgropu This security group will be added to all network interfaces.The name can't be edited after the security group is created. Max length is 255 characters.Valid characters: a-z, A-Z, 0-9, spaces, and ._-:/()#,@[]+=&;{}!$*

Description - required Info

Open World Security Group Inbound security groups rules

Security group rule 1 (All, All, 0.0.0.0/0) Type Info

All traffic Source type Info

Anywhere

Protocol Info

TCP Source Info

Add CDIR, prefix list or security

Remove Port range Info

All Description - optional Info

e.g. SSH for admin desktop

0.0.0.0/0 X

Fig. A.7 Setting up the virtual machine network

of the virtual machines and assume the VPC and the EFS have already been created by the user,3 as in the previous section. For AWS, the resources can be accessed using the aws-cli command-line tool.

3 Notice that the VPC and the EFS can also be deployed and managed using a command-line interface.

A Deploying an HPC Cluster on AWS

317

Storage (volumes) Info

Simple

EBS Volumes

Hide details

Volume 1 (AMI Root) Storage type Info

Device name - required Info

Snapshot Info

EBS

/dev/sda1

snap-0f7a6eae6d90437c4

Size (GiB) Info

Volume type Info

8

gp2

Delete on termination Info

Yes

Encrypted Info

No

IOPS Info

100 / 3000 KMS Key Info

Select KMS keys are only applicable when encryption is set on this volume

Add new volume The selected AMI contains more instance store volumes than the instance allows. Only the first 0 instance store volumes from the AMI will be accessible from the instance

Fig. A.8 Configuring the virtual machine storage



! Reproducing Command-Line Commands

In this example, and others to come, it is assumed that the user uses a command terminal on a Unix-like operating system to perform the commands that will be shown. In special, the commands presented were tested using the bash terminal on a machine with an Ubuntu 20.04 OS.

At AWS, the username and password information are used to access the web console. To use the aws-cli tool, the user needs another credential for programmatic access. This credential is called the access key and is divided into two elements: the access key ID, which publicly identifies the user; and the secret access key, which is the private key for the programmatic access. For the creation of this credential the user may consult the Amazon documentation.4 The aws-cli tool can be easily installed on Ubuntu and other Debian-based Linux operating systems through the default package manager (e.g., using sudo apt install awscli). Once the aws-cli tool is installed, the user must

4 https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-

secret-access-keys.

318

E. Borin and O. O. Napoli

inform his/her access credentials. By using a shell terminal, this can be done by executing the command: aws configure. This will prompt the user for the credentials (access key ID and secret access key) in order to proceed. Finally, the user can do a dry-run test by querying the instances, using the command aws ec2 describe-instances -dry-run. In case of errors, the respective error messages are printed on the terminal in which the command was executed. To instantiate three VMs of type t2.medium, the user can use the command aws ec2 run-instances, as indicated in the following code:5 Instantiating Instances Using AWS CLI

aws ec2 run-instances \ --image-id ami-04505e74c0741db8d \ --count 3 \ --instance-type t2.medium \ --key-name MyKeyPair

The previous command specifies the following parameters: • image-id: the virtual machine image ID (AMI ID). The example in the previous section uses the Ubuntu 20.04 Server LTS image, which has the following AMI ID: ami-04505e74c0741db8d.6 The ID for other AMIs can be queried via the web interface or by executing the command aws ec2 describe-images. • count: the number of VM instances to create. • instance-type: the VM type. • key-name: the name of the key pair registered in the AWS keyring. Every public key must be registered in AWS keyring with a key pair ID. If the user already has the set of public/private keys but does not have it registered in the AWS keyring yet, it can be done by using the command aws ec2 import-key-pair. Alternatively, the user can create a new set of keys and register the public key in the AWS keyring by using the aws ec2 create-key-pair. This process will return the public and the private keys to the user and register the public one in the AWS keyring, upon the user-specified ID. Running the previous command will place a request to start three t2.medium VM instances at AWS. The command returns the IDs created for the virtual machine instances. These IDs can be used to issue commands to specific VM instances.

5 This is a single command. Backslashes at the end of each line are used to inform bash that the contents of the next line belongs to the same command. 6 This ID is shown at the bottom of Fig. A.5.

A Deploying an HPC Cluster on AWS

319

For example, the user may retrieve the properties of a virtual machine instance by executing the command: aws ec2 describe-instances ID, where ID must be replaced by the instance ID. The aws ec2 stop-instances IDs and aws ec2 terminateinstances IDs commands can be used to stop and terminate instances. Again, the user must inform one or more instance IDs.

A.3 Deploying Infrastructure Using Ansible This section discusses how to deploy infrastructure for a cloud-based HPC cluster using an IaC tool called Ansible. In general, Ansible is a well known open-source automation tool that requires a small learning curve, as the user does not need to learn any programming language, and requires only Python and SSH installed in the managed machines,7 as well as in the control machine. Ansible can be installed on the control machine using Python’s pip tool or other software management tools from the Operating System. Boto3, which is the AWS’s Python SDK for programmatic access, is also required on the control machine to enable Ansible to interact with AWS resources. Both Ansible and Boto3 can be installed using Python with the command python3 -m pip install ansible boto3. In order to deploy and configure infrastructure, Ansible relies on Playbooks, which are YAML (“Yet Another Markup Language”) files where Ansible code is written to indicate where, how, and which operations must be executed. Each Playbook comprises one or more Plays, as illustrated in Fig. A.9. The goal of the Plays is to define the set of tasks that must be performed on a group of hosts. Hence, each Play is defined by the group of hosts and a set of tasks. In the example of Fig. A.9, the first Play contains three tasks that must be executed on Group 1, which contains one host. The Plays and the tasks are executed sequentially on the groups of hosts identified by the Playbook. Each task is defined by a module (a Python script) and a set of parameters to this module. When executing the Playbook, Ansible copies the module into the hosts, executes it using the set of parameters defined by the task, and, finally, removes it. For instance, in the Playbook illustrated in Fig. A.9, Play 1 executes tasks A, B, and C, sequentially, at hosts belonging to Group 1. Task 1 executes module 1 with its set of arguments, the same for other tasks. This structure allows modules to be reused for tasks in other Playbooks or other Plays on the same Playbook. Ansible provides a wide variety of modules, ranging from provisioning to management modules. Finally, the hosts (machines where the modules will be executed on) and how they are grouped, can be specified using an inventory, which assigns hosts to groups.

7 These applications are very common in almost any new Linux distribution, and usually comes installed by default.

320

E. Borin and O. O. Napoli

Ansible Playbook Module 1

Group 1

Play 1

Module 2

Task B {mod. args} Task C {mod. args}

Module 3

Group 2

Inventory

Task A {mod. args}

Play 2 Task D {mod. args} Module 4 Task E {mod. args}

Fig. A.9 Example of an ansible playbook

To execute the tasks, Ansible performs an SSH connection to each host, copies the required modules, run the necessary commands inside the host environment, and, finally, removes the modules from the host. In this way, Ansible requires only that hosts have SSH and Python installed in their environment, allowing virtually any Linux VM to be used as host with no need to install extra software. Besides the hosts defined by the inventory, the localhost is the host used when executing modules on the controller, i.e., the machine that runs Ansible. An Ansible Playbook may be used to create VMs on a cloud provider. As an example, the Playbook below creates three t2.medium VMs on AWS. The Playbook follows the YAML syntax and the Plays are declared as a list of dictionaries, organized hierarchically through indentation. Also, the # characters are used for line commentaries. This Playbook contains only one Play, which is identified by the name “Launch three EC2 instances”. The set of machines to which this Play will be applied to is specified by the key “hosts”, which must be associated with a value that identifies the machines.8 The Play tasks must be organized in a list identified by the key “tasks”. Each task is defined by a dictionary with, usually, two attributes: a name (e.g., using the “name” key) and a module.9 The name of the module attribute (e.g., amazon.aws.ec2) identifies the module (i.e., the Python script) and its values identifies the parameters that must be used when executing the module. In the following example, there is only one task named “Deploy compute nodes” that invokes the module amazon.aws.ec2 8 This

identifier may be: (a) the identifier of a group on the inventory; (b) “localhost”, i.e., the control machine; or (c) the keyword “all”, which specify all hosts in the inventory, regardless of the group. 9 A task may also contain other attributes, but for sake of simplicity this example shows only the name and the module attributes.

A Deploying an HPC Cluster on AWS

321

with several parameters (e.g., key_name, instance_type, etc) to instantiate two c6i.32xlarge VMs on AWS.10 deploy-cluster-nodes.yaml

--# A single Play - name: Deploy cluster nodes # The Play will be executed at the control machine (localhost) hosts: localhost # The list of tasks of this Play tasks: # The task’s name - name: Deploy compute nodes # This task calls a module named amazon.aws.ec2 amazon.aws.ec2: # keypair name key_name: mykeypair # instance type instance_type: c6i.32xlarge # AMI ID image: ami-04505e74c0741db8d # Number of VM replicas count: 2 # Region being used aws_region: us-east-1 # Access key ID aws_access_key: AKIAVWOJMI5XXXXXXXX # Secret access key aws_secret_key: XXXXXX

In order to use this Playbook the user must save it on a text file and adjust the following parameters to the “amazon.aws.ec2” module: • key_name: the AWS key pair ID. • aws_access_key: user’s access key ID. • aws_secret_key: user’s secret access key. Private Information on Ansible Playbooks The parameters previously adjusted are user specific and contain private data (especially the access and secret key parameters); hence, this Playbook is not supposed to be shared with other users and should not be stored on code versioning 10 For simplicity, the example contains only one task that deploys the computing nodes. However, notice that it can be easily extended with a new task to also deploy the login node.

322

E. Borin and O. O. Napoli

repositories.11 Nonetheless, in many cases, it is useful to make these Playbooks as templates that can be used by several users or on different contexts. In these cases, the user may omit these parameters on the Playbook, in which case, the amazon.aws.ec2 module will look for the missing information on local environment variables (e.g., AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), which can be set by each user on their local environment.

Assuming the Playbook file is named “deploy-cluster-nodes.yaml”, once it is adapted and saved, the user may run the following command to deploy the infrastructure for the cluster nodes: ansible-playbook deploy-cluster-nodes.yaml. After running the command above, Ansible will create the virtual machines in the cloud provider and print several useful information in the terminal, such as, the instance IDs, connection IPs, etc. Ansible has a large collection of modules to perform several operations in AWS EC2 services, such as: creating key pairs, querying instance status, creating network interfaces, creating storage, among others.

11 Notice

that there may be risk of private data leakage in case the repository is shared with others.

Appendix B

Configuring a Cloud-Deployed HPC Cluster Edson Borin and Otávio O. Napoli

Configuring an HPC cluster comprises a sequence of tasks (e.g., commands) that must be performed inside the cluster nodes. Typical tasks include copying/moving files, creating directories, exchanging SSH keys, changing files’ permissions and ownership, starting services, installing packages, etc. This appendix illustrates how to configure and use an HPC cluster deployed on the cloud.

B.1 Introduction To illustrate the configuration and usage of a cloud-deployed HPC cluster, this appendix will discuss how to configure a simple cluster with a batch-queueing SLURM system. SLURM is an open source, fault-tolerant, and highly-scalable cluster management and job scheduling system for small and large Unix clusters. A minimal SLURM cluster is composed of two components: one SLURM controller (slurmctld), usually installed in the login node, and multiple SLURM computing nodes (slurmd). At the login node, the user may submit, query, and perform other job-related operations. The user may also run small operations, such as compiling the HPC application code. For user jobs, in general, the results are saved on paths mapped on the shared file system (e.g., an EFS file system), which can be accessed by the user in the login node and its also visible by all computing nodes. In order to configure a batch-queueing SLURM cluster, the user must mount the shared file system (EFS, in our example), install the SLURM software and its dependencies, and configure them properly on all nodes. The next sections will show how to configure the SLURM cluster using the command-line (Sect. B.2), and then,

E. Borin () · O. O. Napoli UNICAMP, Cidade Universitária, Campinas, Brazil e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Borin et al. (eds.), High Performance Computing in Clouds, https://doi.org/10.1007/978-3-031-29769-4

323

324

E. Borin and O. O. Napoli

using the Ansible automation tool (Sect. B.3). The concepts will be illustrated using as an example the cluster shown in Fig. 4.3. All the components needed (virtual network, shared file system, and virtual machines) can be created based on the instructions presented on the previous Appendix. To install the SLURM tool, the user shall: 1. 2. 3. 4. 5. 6. 7. 8.

Mount the EFS file system at all nodes; Set up password-less SSH access between all nodes; Install MUNGE, at all nodes; Create a MUNGE key at one node and distribute this key to all nodes; Start the MUNGE service at all nodes: Install SLURM at all nodes; Configure SLURM at all nodes; and Start the slurmctld service at the login node and the slurmd service at the computing nodes.

B.2 Configuring the Cluster Using the Command-Line Interface B.2.1 Mounting the EFS File System Once the EFS file system and the VMs are deployed, the file system must be mounted on each VM, i.e., each cluster node. To do so, the user may log into each VM, create a directory at the local VM file system, and invoke the operating system mount tool to mount the EFS file system on this directory. The following code, which must be executed inside the VM, shows how to create a directory named efs inside the users’ home directory (mkdir command) and how to mount the EFS file system on this directory (mount command). The mount command takes several arguments.1 The argument must be replaced by the EFS identifier, which can be the IP or the hostname associated with the EFS system. Mouting the EFS File System

# Create the mounting point mkdir -p ~/efs # Mount the remote EFS file system sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,\ timeo=600,retrans=2,noresvport :/ ~/efs

1 These

are default parameters to mount an EFS system currently recommended by AWS.

B Configuring a Cloud-Deployed HPC Cluster

325

My EFS (fs-0c7074a59164ea65d)

Delete

General

Attach Edit

Performance mode General Purpose

Automatic backups E Enabled

Throughput mode Bursting

Encrypted

Lifecycle management Transition into IA: 30 days since last access Transition out of IA: On first access

File system state A Available

Availability zone Regional

DNS name fs-0c7074a59164ea65d.efs.us-east-1.amazonaws.com a

90ce3dfa-ce14-41a3-9737-d043d3e0dd78 (aws/elasticfilesystem)

Fig. B.1 Retrieving the EFS file system identifier

The EFS identifier can be retrieved by the user using the AWS EFS service. To do so, the user may log into the web console, select the EFS service, and select the desired EFS file system to inspect its properties, as illustrated in Fig. B.1. The red box indicates the DNS for the EFS file system (fs-0c7074a59164ea65d.efs.us-east-1.amazonaws.com).

B.2.2 Configuring SSH for Password-Less Connections In order to use SLURM, and several other HPC frameworks (e.g., MPI), machines belonging to the cluster must allow communication to each other via passwordless SSH connections. One way to set this kind of connection up is by generating a single public/private keypair for all nodes in the cluster. The key pair can be generated using the following command (which can be executed in the user’s local machine): ssh-keygen -t rsa -b 4096 -f /id_rsa, where the must be replaced with the path where the key pairs shall be generated. After that, the user must copy these keys to the default SSH directory (usually at ~/.ssh) on all machines. Finally, the user must append the contents of the public key (usually, the id_rsa file) to the authorized keys file (usually at ~/.ssh/authorized_keys) on all machines. Once done, the user may perform SSH connections between machines without being probed for passwords.

326

E. Borin and O. O. Napoli

B.2.3 Installing and Configuring MUNGE MUNGE is an authentication service for creating and validating user credentials and must be installed at all nodes. In order to install MUNGE, the user must log into each virtual machine and execute the following commands: Installing MUNGE at All Nodes

# Updating packages sudo apt update # Creating a Munge user export MUNGEUSER=3456 sudo groupadd -g $MUNGEUSER munge sudo useradd -m -d /var/lib/munge -u $MUNGEUSER \ -g munge -s /sbin/nologin munge # Installing MUNGE sudo apt install -y munge libmunge2 libmunge-dev # Setting MUNGE permissions sudo chown -R munge: /etc/munge/ /var/log/munge/ \ /var/lib/munge/ /run/munge/ sudo chmod 0700 /etc/munge/ /var/log/munge/ \ /var/lib/munge/ sudo chmod 0755 /run/munge/

The previous commands create a MUNGE user and group, install the required packages, and set directories permissions for MUNGE. For security reasons, a MUNGE user and group is created and set for MUNGE operations. Once MUNGE is installed, a MUNGE key must be generated using the command sudo /usr/sbin/create-munge-key. The key, which will be created at /etc/munge/munge.key, must, then, be copied to all nodes belonging to the cluster with the same permissions as it was created. After that, the MUNGE service must be started at all nodes using the following commands: Starting MUNGE

sudo systemctl enable munge sudo systemctl start munge

B Configuring a Cloud-Deployed HPC Cluster

327

B.2.4 Installing and Configuring SLURM Finally, the SLURM workload manager can be installed at all nodes, which can be done by executing the command sudo apt install -y slurm-wlm. Once SLURM is installed, it must be configured properly. SLURM configuration involves a SLURM configuration file, which tells the SLURM controller where the SLURM computing nodes are located at and several others’ configuration parameters. The SLURM configuration file must be created at the /etc/slurm-llnl/slurm.conf path on the login node. The following listing shows an example of contents of a SLURM configuration file: /etc/slurm-llnl/slurm.conf ControlMachine= AuthType=auth/munge ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=root SlurmdUser=root StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear AccountingStorageType=accounting_storage/none ClusterName=cluster JobAcctGatherType=jobacct_gather/none SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdLogFile=/var/log/slurm-llnl/slurmd.log NodeName= CPUs=2 State=UNKNOWN NodeName= CPUs=2 State=UNKNOWN PartitionName=Part Nodes=, State=UP

In the previous listing, the ControlMachine option defines the hostname of the machine where slurmctld is running (usually the login node). The must be replaced by the login node hostname. Also, the NodeName options define the computing nodes and must be populated with the computing nodes’ hostnames (instead of and ). Finally, it is worth noticing that the cluster may have several computing nodes, which can be managed by the SLURM controller by adding new nodes at this configuration file.

328

E. Borin and O. O. Napoli

Once the SLURM configuration file is correctly configured at all nodes (the same file must be in all of them), the SLURM controller can be initialized by executing the following commands at the login node: Starting the SLURM Controller

sudo systemctl enable slurmctld sudo systemctl start slurmctld

Finally, SLURM can be initialized on the computing nodes by executing the following code on all computing nodes: Starting the SLURM Computing Nodes

sudo systemctl enable slurmd sudo systemctl start slurmd

Once started, the user can submit jobs at the login node using the SLURM interface.

B.3 Configuring the Cluster Using Ansible The conventional approach to configure the cloud-based HPC system, presented in the previous section, involves executing several commands on each one of the computing nodes, which may be a tedious and error-prone task, especially if the system contains lots of computing nodes. In this context, it is important to use software automation tools to configure the infrastructure. These tools can be used to automate the execution of tasks (e.g., batch processes, workflows, etc.) across the infrastructure. Examples of such tools include: Terraform, Chef, Puppet, Salt Stack, Ansible, elasticluster, CLAP, OpenStack, among others. This section discusses how the Ansible tool can be used to automate the configuration of a cloud-based HPC system using the same setup presented in the previous section. The following workflow will be used to configure the cluster with Ansible: 1. Create the Playbook inventory to identify and group the cluster nodes; 2. Write a Playbook with tasks to configure the cluster nodes; and 3. Execute the Playbook to configure the cluster nodes.

B Configuring a Cloud-Deployed HPC Cluster

329

B.3.1 Creating the Playbook Inventory As discussed in Sect. 4.4.2, each Ansible Play contains a set of tasks that are applied to a group of hosts. These groups are defined at an Ansible inventory file, a text file that maps each group name to a list of host names. The following Ansible inventory file maps the group named “control” to a single host, named “login”, and the group named “compute” to two hosts, named “compute01” and “compute02”. Each host is associated with a host address (ansible_host) and a user (ansible_user), which will be used by Ansible when performing the SSH connection. Ansible Inventory File

control: hosts: login: ansible_host: ansible_user: compute: hosts: compute01: ansible_host: ansible_user: compute02: ansible_host: ansible_user:

ubuntu

ubuntu

ubuntu

Similar to the Ansible Playbook file, the Inventory file may be stored anywhere on the user’s file system. It is worth noticing that Ansible allows the inventory to be written in several formats. The previous example used the YAML format. The inventory file defined 2 groups: control and compute. The control group has one node, the login node, and the compute group has the other two nodes, the computing nodes. Groups is a way to organize hosts, allowing Ansible Playbooks to execute specific commands in only nodes of some group or all of them. The user may note that the address of the node (used to perform SSH), must be replaced in the inventory file above.

B.3.2 Configuring the HPC Cluster Once the inventory and the key pair were created, a single Playbook file can be used to configure an HPC cluster with SLURM. This includes:

330

1. 2. 3. 4.

E. Borin and O. O. Napoli

Mounting the EFS; Setting up password-less SSH between all nodes; Installing, setting up, and starting MUNGE; Installing and setting up SLURM

All Plays and tasks must be defined on a single Playbook file, which is named configure-cluster.yaml in this example; nonetheless, to simplify the discussion, its contents will be shown in multiple listings. For simplicity, it is supposed that the user’s key pair (generated in previous steps) will be located at ~/.ssh, and the slurm.conf file (generated in previous section) is located at the user’s home directory (i.e., ~/slurm.conf). Finally, this example assumes the EFS host is at the same place as indicated in Fig. B.1, and must be replaced in the Playbook, accordingly. The configure-cluster.yaml Playbook file contains three Plays. The first one mounts the EFS, install SLURM and MUNGE and configures the SSH connection and the MUNGE authentication service on all nodes, while the second one sets the SLURM system up on the login node, and, finally, the third one sets the SLURM system up on the computing nodes. The following listing shows the first three tasks of the first Play. The first task uses the mount module to mount the EFS file system at ~/efs. The become: yes key-value pair informs Ansible that this task must be executed by a superuser, i.e., the root user. The second task copies the key pair to all hosts, and the third one use the authorized_key module to set the password-less SSH connection up. configure-cluster.yaml (part 1)

--# Play 1: Install SLURM/MUNGE and setup SSH/MUNGE on all nodes - hosts: all name: Install SLURM and setup SSH/MUNGE on all nodes tasks: - name: Mount EFS volume become: yes mount: name: ~/efs src: fs-0c7074a59164ea65d.efs.us-east-1.amazonaws.com: fstype: nfs4 opts: ’nfsvers=4.1’ state: mounted - name: Copy private key to nodes copy: src: ~/id_rsa.pem dest: ~/.ssh/id_rsa mode: ’0400’ owner: ubuntu group: ubuntu

B Configuring a Cloud-Deployed HPC Cluster

331

- name: Set the authorized keys files for password-less SSH access authorized_key: user: ubuntu state: present key: "{{ lookup(’file’, ’~/id_rsa.pub’) }}"

The following listing shows the remaining tasks of the first Play, which are used to set the MUNGE service up. First the munge user and the munge group are created using the group and the user modules. Then, the MUNGE software is installed2 using the apt module. Next, the MUNGE key file is created using the shell module and the login node munge key is copied to the remaining nodes using the synchronize module. Finally, the system is configured and the munge service is started. configure-cluster.yaml (part 2)

- name: Create and setup the munge group become: yes group: name: munge state: present - name: Create and setup the munge user become: yes user: name: munge createhome: no shell: /sbin/nologin state: present group: munge - name: Install the SLURM and the MUNGE packages become: yes apt: update_cache: yes pkg: - munge - libmunge2 - libmunge-dev - slurm-wlm - name: Create the MUNGE key file become: yes

2 This

task also install the slurm package on all nodes.

332

E. Borin and O. O. Napoli shell: /usr/sbin/create-munge-key args: creates: /etc/munge/munge.key - name: Copy the munge key file from login to compute nodes become: yes synchronize: src: /etc/munge/munge.key dest: /etc/munge/munge.key delegate_to: login - name: Enable the munge service at startup become: yes service: name: munge state: started enabled: yes

The following listing shows the second Play, designed to configure the SLURM system on the login node. First the slurm.conf file is copied from the local user machine (~/slurm.conf) to the login node. Then the slurmctld service is started. configure-cluster.yaml (part 3) # Play 2: SLURM login node setup - name: SLURM login node setup hosts: control become: yes tasks: - name: Copy the slurm.conf file to the login node copy: src: ~/slurm.conf dest: /etc/slurm-llnl/slurm.conf - name: Enable the slurmctld service on startup service: name: slurmctld state: started enabled: yes

Finally, the following listing shows the last Play, designed to configure the SLURM system on the computing nodes. This Play simply copies the ~/slurm.conf file to all computing nodes and start the slurmd service in all computing nodes.

B Configuring a Cloud-Deployed HPC Cluster

333

configure-cluster.yaml (part 4)

# Play 3: SLURM compute nodes setup - name: Start slurmd in compute nodes hosts: compute become: yes tasks: - name: Copy the slurm.conf file to the compute nodes copy: src: ~/slurm.conf dest: /etc/slurm-llnl/slurm.conf - name: Enable the slurmd service on startup service: name: slurmd state: started enabled: yes

B.3.3 Executing the Playbook The configure-cluster.yaml Playbook can be executed by using the command: ansible-playbook configure-cluster.yaml -i inven tory.yaml, where the inventory.yaml and configure-cluster. yaml represent the paths to the inventory and the playbook files, respectively.

B.4 Submitting Jobs on the HPC Cluster Once the SLURM system is deployed, the user can access the login node via SSH to use the HPC cluster. The user may execute the sinfo command in order to list the available nodes in the HPC cluster. If everything is correctly setup, the two computational nodes will appear in the output of the command, as illustrated below:3

3 In this case, the computing nodes are identified by the ip-172-31-91-36 and ip-172-31-26-39 host names.

334

E. Borin and O. O. Napoli

sinfo output PARTITION AVAIL Part* up

TIMELIMIT infinite

NODES 1

STATE NODELIST idle ip-172-31-91-36,ip-172-31-26-39

In order to submit a job, the user must write a simple job script. SLURM’s job scripts are composed of two parts: resource requests and job steps. Resource requests tell SLURM which and how many resources users would like to use. In job scripts, resource request usually starts with a “# SBATCH” line, in the beginning of the script. The job steps come after and describe tasks that must be performed by the computing nodes, i.e., command lines to be executed by the computing nodes. The following listing shows an example of a SLURM’s job script, which executes the command hostname on all compute nodes. This job script requests one CPU (ntasks=1) for one minute (time=1:00), along with 100 MB of RAM (mem-per-cpu=100). The output produced by the job shall be saved at the res.txt file, as indicated by the job script (output=res.txt). Example of SLURM Job Script (job.sh)

#!/bin/bash #SBATCH --ntasks=1 #SBATCH --time=1:00 #SBATCH --mem-per-cpu=100 #SBATCH --output=res.txt srun hostname

In order to submit this job for execution, the user must log into the login node and execute the command sbatch job.sh, where job.sh is the path to the job script file. Once the job is done, the name of the hosts are written into the res.txt file.