Programming your GPU with OPENMP: Performance Portability for GPUs 9780262547536, 9780262377737, 9780262377720

The essential guide for writing portable, parallel programs for GPUs using the OpenMP programming model. Today’s comput

123 48 13MB

English Pages 336 Year 2023

Report DMCA / Copyright

DOWNLOAD EPUB FILE

Table of contents :
Scientific and Engineering Computation
Programming Your GPU with OpenMP
Contents
List of Figures
List of Tables
Series Forward
Preface
Acknowledgments
I SETTING THE STAGE
1 Heterogeneity and the Future of Computing
1.1 The Basic Building Blocks of Modern Computing
1.1.1 The CPU
1.1.2 The SIMD Vector Unit
1.1.3 The GPU
1.2 OpenMP: A Single Code-Base for Heterogeneous Hardware
1.3 The Structure of This Book
1.4 Supplementary Materials
2 OpenMP Overview
2.1 Threads: Basic Concepts
2.2 OpenMP: Basic Syntax
2.3 The Fundamental Design Patterns of OpenMP
2.3.1 The SPMD Pattern
2.3.2 The Loop-Level Parallelism Pattern
2.3.3 The Divide-and-Conquer Pattern
2.3.3.1 Tasks in OpenMP
2.3.3.2 Parallelizing Divide-and-Conquer
2.4 Task Execution
2.5 Our Journey Ahead
II THE GPU COMMON CORE
3 Running Parallel Code on a GPU
3.1 Target Construct: Offloading Execution onto a Device
3.2 Moving Data between the Host and a Device
3.2.1 Scalar Variables
3.2.2 Arrays on the Stack
3.2.3 Derived Types
3.3 Parallel Execution on the Target Device
3.4 Concurrency and the Loop Construct
3.5 Example: Walking through Matrix Multiplication
4 Memory Movement
4.1 OpenMP Array Syntax
4.2 Sharing Data Explicitly with the Map Clause
4.2.1 The Map Clause
4.2.2 Example: Vector Add on the Heap
4.2.3 Example: Mapping Arrays in Matrix Multiplication
4.3 Reductions and Mapping the Result from the Device
4.4 Optimizing Data Movement
4.4.1 Target Data Construct
4.4.2 Target Update Directive
4.4.3 Target Enter/Exit Data
4.4.4 Pointer Swapping
4.5 Summary
5 Using the GPU Common Core
5.1 Recap of the GPU Common Core
5.2 The Eightfold Path to Performance
5.2.1 Portability
5.2.2 Libraries
5.2.3 The Right Algorithm
5.2.4 Occupancy
5.2.5 Converged Execution Flow
5.2.6 Data Movement
5.2.7 Memory Coalescence
5.2.8 Load Balance
5.3 Concluding the GPU Common Core
III BEYOND THE COMMON CORE
6 Managing a GPU’s Hierarchical Parallelism
6.1 Parallel Threads
6.2 League of Teams of Threads
6.2.1 Controlling the Number of Teams and Threads
6.2.2 Distributing Work between Teams
6.3 Hierarchical Parallelism in Practice
6.3.1 Example: Batched Matrix Multiplication
6.3.2 Example: Batched Gaussian Elimination
6.4 Hierarchical Parallelism and the Loop Directive
6.4.1 Combined Constructs that Include Loop
6.4.2 Reductions and Combined Constructs
6.4.3 The Bind Clause
6.5 Summary
7 Revisiting Data Movement
7.1 Manipulating the Device Data Environment
7.1.1 Allocating and Deleting Variables
7.1.2 Map Type Modifiers
7.1.3 Changing the Default Mapping
7.2 Compiling External Functions and Static Variables for the Device
7.3 User-Defined Mappers
7.4 Team-Only Memory
7.5 Becoming a Cartographer: Mapping Device Memory by Hand
7.6 Unified Shared Memory for Productivity
7.7 Summary
8 Asynchronous Offload to Multiple GPUs
8.1 Device Discovery
8.2 Selecting a Default Device
8.3 Offload to Multiple Devices
8.3.1 Reverse Offload
8.4 Conditional Offload
8.5 Asynchronous Offload
8.5.1 Task Dependencies
8.5.2 Asynchronous Data Transfers
8.5.3 Task Reductions
8.6 Summary
9 Working with External Runtime Environments
9.1 Calling External Library Routines from OpenMP
9.2 Sharing OpenMP Data with Foreign Functions
9.2.1 The Need for Synchronization
9.2.2 Example: Sharing OpenMP Data with cuBLAS
9.3 Using Data from a Foreign Runtime with OpenMP
9.3.1 Example: Sharing cuBLAS Data with OpenMP
9.3.2 Avoiding Unportable Code
9.4 Direct Control of Foreign Runtimes
9.4.1 Query Properties of the Foreign Runtime
9.4.2 Using the Interop Construct to Correctly Synchronize with Foreign Functions
9.4.3 Non-blocking Synchronization with a Foreign Runtime
9.4.4 Example: Calling CUDA Kernels without Blocking
9.5 Enhanced Portability Using Variant Directives
9.5.1 Declaring Function Variants
9.5.1.1 OpenMP Context and the Match Clause
9.5.1.2 Modifying Variant Function Arguments
9.5.2 Controlling Variant Substitution with the Dispatch Construct
9.5.3 Putting It All Together
10 OpenMP and the Future of Heterogeneous Computing
Appendix: Reference Guide
A.1 Programming a CPU with OpenMP
A.2 Directives and Constructs for the GPU
A.2.1 Parallelism with Loop, Teams, and Worksharing Constructs
A.2.2 Constructs for Interoperability
A.2.3 Constructs for Device Data Environment Manipulation
A.3 Combined Constructs
A.4 Internal Control Variables, Environment Variables, and OpenMP API Functions
Glossary
References
Subject Index
Scientific and Engineering Computation

Programming your GPU with OPENMP: Performance Portability for GPUs
 9780262547536, 9780262377737, 9780262377720

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
Recommend Papers