Programming your GPU with OPENMP: Performance Portability for GPUs 9780262547536, 9780262377737, 9780262377720

The essential guide for writing portable, parallel programs for GPUs using the OpenMP programming model. Today’s comput

135 48 13MB

English Pages 336 Year 2023

Table of contents :
Scientific and Engineering Computation
Programming Your GPU with OpenMP
Contents
List of Figures
List of Tables
Series Forward
Preface
Acknowledgments
I SETTING THE STAGE
1 Heterogeneity and the Future of Computing
1.1 The Basic Building Blocks of Modern Computing
1.1.1 The CPU
1.1.2 The SIMD Vector Unit
1.1.3 The GPU
1.2 OpenMP: A Single Code-Base for Heterogeneous Hardware
1.3 The Structure of This Book
1.4 Supplementary Materials
2 OpenMP Overview
2.1 Threads: Basic Concepts
2.2 OpenMP: Basic Syntax
2.3 The Fundamental Design Patterns of OpenMP
2.3.1 The SPMD Pattern
2.3.2 The Loop-Level Parallelism Pattern
2.3.3 The Divide-and-Conquer Pattern
2.3.3.1 Tasks in OpenMP
2.3.3.2 Parallelizing Divide-and-Conquer
2.4 Task Execution
2.5 Our Journey Ahead
II THE GPU COMMON CORE
3 Running Parallel Code on a GPU
3.1 Target Construct: Oﬄoading Execution onto a Device
3.2 Moving Data between the Host and a Device
3.2.1 Scalar Variables
3.2.2 Arrays on the Stack
3.2.3 Derived Types
3.3 Parallel Execution on the Target Device
3.4 Concurrency and the Loop Construct
3.5 Example: Walking through Matrix Multiplication
4 Memory Movement
4.1 OpenMP Array Syntax
4.2 Sharing Data Explicitly with the Map Clause
4.2.1 The Map Clause
4.2.2 Example: Vector Add on the Heap
4.2.3 Example: Mapping Arrays in Matrix Multiplication
4.3 Reductions and Mapping the Result from the Device
4.4 Optimizing Data Movement
4.4.1 Target Data Construct
4.4.2 Target Update Directive
4.4.3 Target Enter/Exit Data
4.4.4 Pointer Swapping
4.5 Summary
5 Using the GPU Common Core
5.1 Recap of the GPU Common Core
5.2 The Eightfold Path to Performance
5.2.1 Portability
5.2.2 Libraries
5.2.3 The Right Algorithm
5.2.4 Occupancy
5.2.5 Converged Execution Flow
5.2.6 Data Movement
5.2.7 Memory Coalescence
5.2.8 Load Balance
5.3 Concluding the GPU Common Core
III BEYOND THE COMMON CORE
6 Managing a GPU’s Hierarchical Parallelism
6.1 Parallel Threads
6.2 League of Teams of Threads
6.2.1 Controlling the Number of Teams and Threads
6.2.2 Distributing Work between Teams
6.3 Hierarchical Parallelism in Practice
6.3.1 Example: Batched Matrix Multiplication
6.3.2 Example: Batched Gaussian Elimination
6.4 Hierarchical Parallelism and the Loop Directive
6.4.1 Combined Constructs that Include Loop
6.4.2 Reductions and Combined Constructs
6.4.3 The Bind Clause
6.5 Summary
7 Revisiting Data Movement
7.1 Manipulating the Device Data Environment
7.1.1 Allocating and Deleting Variables
7.1.2 Map Type Modifiers
7.1.3 Changing the Default Mapping
7.2 Compiling External Functions and Static Variables for the Device
7.3 User-Defined Mappers
7.4 Team-Only Memory
7.5 Becoming a Cartographer: Mapping Device Memory by Hand
7.6 Unified Shared Memory for Productivity
7.7 Summary
8 Asynchronous Oﬄoad to Multiple GPUs
8.1 Device Discovery
8.2 Selecting a Default Device
8.3 Oﬄoad to Multiple Devices
8.3.1 Reverse Oﬄoad
8.4 Conditional Oﬄoad
8.5 Asynchronous Oﬄoad
8.5.1 Task Dependencies
8.5.2 Asynchronous Data Transfers
8.5.3 Task Reductions
8.6 Summary
9 Working with External Runtime Environments
9.1 Calling External Library Routines from OpenMP
9.2 Sharing OpenMP Data with Foreign Functions
9.2.1 The Need for Synchronization
9.2.2 Example: Sharing OpenMP Data with cuBLAS
9.3 Using Data from a Foreign Runtime with OpenMP
9.3.1 Example: Sharing cuBLAS Data with OpenMP
9.3.2 Avoiding Unportable Code
9.4 Direct Control of Foreign Runtimes
9.4.1 Query Properties of the Foreign Runtime
9.4.2 Using the Interop Construct to Correctly Synchronize with Foreign Functions
9.4.3 Non-blocking Synchronization with a Foreign Runtime
9.4.4 Example: Calling CUDA Kernels without Blocking
9.5 Enhanced Portability Using Variant Directives
9.5.1 Declaring Function Variants
9.5.1.1 OpenMP Context and the Match Clause
9.5.1.2 Modifying Variant Function Arguments
9.5.2 Controlling Variant Substitution with the Dispatch Construct
9.5.3 Putting It All Together
10 OpenMP and the Future of Heterogeneous Computing
Appendix: Reference Guide
A.1 Programming a CPU with OpenMP
A.2 Directives and Constructs for the GPU
A.2.1 Parallelism with Loop, Teams, and Worksharing Constructs
A.2.2 Constructs for Interoperability
A.2.3 Constructs for Device Data Environment Manipulation
A.3 Combined Constructs
A.4 Internal Control Variables, Environment Variables, and OpenMP API Functions
Glossary
References
Subject Index
Scientific and Engineering Computation

Programming your GPU with OPENMP: Performance Portability for GPUs
9780262547536, 9780262377737, 9780262377720

Author / Uploaded
Tom Deakin
Timothy G.Mattson

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Recommend Papers

Programming Your GPU with OpenMP: Performance Portability for GPUs [1 ed.] 0262547538, 9780262547536, 9780262377737, 9780262377720

The essential guide for writing portable, parallel programs for GPUs using the OpenMP programming model. Today’s comput

120 40 13MB Read more

Programming Your GPU with OpenMP: Performance Portability for GPUs [1 ed.] 0262547538, 9780262547536, 9780262377737, 9780262377720

The essential guide for writing portable, parallel programs for GPUs using the OpenMP programming model. Today’s comput

117 3 11MB Read more

Parallel Programming in C with MPI and OpenMP 0072822562, 9780072822564

This volume gives a high-level overview of parallel architectures, including processor arrays, centralized multi-process

437 88 17MB Read more

GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation [2, 1 ed.] 0321335597, 978-0321335593

This sequel to the best-selling, first volume of "GPU Gems" details the latest programming techniques for toda

529 97 6MB Read more

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA: Effective Techniques for Processing Complex Image Data in Real Time Using GPUs

Discover how CUDA allows OpenCV to handle complex and rapidly growing image data processing in computer and machine visi

315 35 6MB Read more

Programming with Python for Engineers

An interactive book introducing Python to engineers and engineering students. This book is intended to be an accompanyi

117 66 8MB Read more

Linux Observability with BPF: Advanced Programming for Performance Analysis and Networking [1 ed.] 1492050202, 9781492050209

Build your expertise in the BPF virtual machine in the Linux kernel with this practical guide for systems engineers. You

393 66 3MB Read more

Linux Observability with BPF: Advanced Programming for Performance Analysis and Networking [1 ed.] 1492050202, 9781492050209

Build your expertise in the BPF virtual machine in the Linux kernel with this practical guide for systems engineers. You

532 96 9MB Read more

The OpenMP Common Core: Making OpenMP Simple Again 0262538865, 9780262538862

1,004 178 4MB Read more

Chasing Stars: The Myth of Talent and the Portability of Performance [Course Book ed.] 9781400834389

It is taken for granted in the knowledge economy that companies must employ the most talented performers to compete and

102 37 2MB Read more