Computational Technologies: Advanced Topics 9783110359961, 9783110359947

This book discusses questions of numerical solutions of applied problems on parallel computing systems. Nowadays, engi

220 54 10MB

English Pages 278 [280] Year 2014

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
List of contributors
Contents
Preface
Introduction
1 Architecture of parallel computing systems
1.1 History of computers
1.2 Architecture of parallel computers
1.3 Modern supercomputers
1.4 Multicore computers
1.5 Operating system processes and threads
1.6 Programming multi-threaded applications
2 Multi-threaded programming
2.1 POSIX threads
2.2 Creating and terminating threads
2.3 Thread life-cycle
2.4 Multi-threaded matrix summation
2.5 Thread synchronization
3 Essentials of OpenMP
3.1 OpenMP parallel programming model
3.2 The parallel construct
3.3 Work-sharing constructs
3.4 Work-sharing clauses
3.5 Synchronization
3.6 Dirichlet problem for the Poisson equation
4 MPI technology
4.1 Preliminaries
4.2 Message-passing operations
4.3 Functions of collective interaction
4.4 Dirichlet problem for the Poisson equation
5 ParaView: An efficient toolkit for visualizing large datasets
5.1 An overview
5.2 Data file formats
5.3 Preparing data
5.4 Working with ParaView
5.5 Parallel visualization
6 Tools for developing parallel programs
6.1 Installation of PTP
6.2 Program management
6.3 Parallel debugging
6.4 Performance analysis
7 Applied software
7.1 Numerical simulation
7.2 Applied software engineering
7.3 Software architecture
7.4 General purpose applied software
7.5 Problem-oriented software
8 Geometry generation and meshing
8.1 General information
8.2 The Gmsh workflow
8.3 NETGEN first look
9 PETSc for parallel solving of linear and nonlinear equations
9.1 Preliminaries
9.2 Solvers for systems of linear equations
9.3 Solution of nonlinear equations and systems
9.4 Solving unsteady problems
10 The FEniCS project
10.1 Preliminaries
10.2 Model problem
10.3 Finite element discretization
10.4 Program
10.5 Result processing
10.6 Nonlinear problems
10.7 Time-dependent problems
11 Numerical study of applied problems
11.1 Heat transfer with phase transitions
11.2 Lid-driven cavity flow
11.3 Steady thermoelasticity problem
11.4 Joule heating problem
Bibliography
Index
Recommend Papers

Computational Technologies: Advanced Topics
 9783110359961, 9783110359947

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Petr N. Vabishchevich (Ed.) Computational Technologies De Gruyter Graduate

Computational Technologies | Advanced Topics Edited by Petr N. Vabishchevich

Mathematics Subject Classification 2010 35-01, 65-01, 65M06, 65M22, 65M50, 65N06, 65N22, 65N50, 68-01, 68N15, 68N20 Editor Prof. Dr. Petr N. Vabishchevich Nuclear Safety Institute Russian Academy of Sciences B. Tulskaya 52 Moscow 115191 Russia [email protected]

ISBN 978-3-11-035994-7 e-ISBN (PDF) 978-3-11-035996-1 e-ISBN (EPUB) 978-3-11-038688-2 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Munich/Boston Typesetting: PTP-Berlin, Protago TEX-Produktion GmbH, www.ptp-berlin.de Printing and binding: CPI books GmbH, Leck ♾Printed on acid-free paper Printed in Germany www.degruyter.com

List of contributors Dr. Mikhail Yu. Antonov Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Dr. Nadezhda M. Afanasyeva Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Victor S. Borisov Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Dr. Aleksandr G. Churbanov Nuclear Safety Institute Russian Academy of Sciences Moscow Russia E-mail: [email protected]

Dr. Aleksander V. Grigoriev Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Alexandr E. Kolesov Centre of Computational Technologies North-Eastern Federal University

Yakutsk Russia E-mail: [email protected]

Petr A. Popov Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Ivan K. Sirditov Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Prof. Dr. Petr N. Vabishchevich Nuclear Safety Institute Russian Academy of Sciences Moscow Russia E-mail: [email protected]

Dr. Maria V. Vasilieva Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Dr. Petr Zakharov Centre of Computational Technologies North-Eastern Federal University Yakutsk Russia E-mail: [email protected]

Contents Preface | xi Introduction | xiii Mikhail Yu. Antonov 1 Architecture of parallel computing systems | 1 1.1 History of computers | 1 1.2 Architecture of parallel computers | 1 1.3 Modern supercomputers | 4 1.4 Multicore computers | 5 1.5 Operating system processes and threads | 5 1.6 Programming multi-threaded applications | 7 Mikhail Yu. Antonov 2 Multi-threaded programming | 9 2.1 POSIX threads | 9 2.2 Creating and terminating threads | 10 2.3 Thread life-cycle | 11 2.4 Multi-threaded matrix summation | 12 2.5 Thread synchronization | 15 Petr A. Popov 3 Essentials of OpenMP | 21 3.1 OpenMP parallel programming model | 21 3.2 The parallel construct | 22 3.3 Work-sharing constructs | 23 3.4 Work-sharing clauses | 28 3.5 Synchronization | 32 3.6 Dirichlet problem for the Poisson equation | 36 Aleksandr V. Grigoriev, Ivan K. Sirditov, and Petr A. Popov 4 MPI technology | 45 4.1 Preliminaries | 45 4.2 Message-passing operations | 50 4.3 Functions of collective interaction | 53 4.4 Dirichlet problem for the Poisson equation | 68 Aleksandr V. Grigoriev 5 ParaView: An efficient toolkit for visualizing large datasets | 77 5.1 An overview | 77

viii | Contents 5.2 5.3 5.4 5.5

Data file formats | 78 Preparing data | 82 Working with ParaView | 90 Parallel visualization | 95

Victor S. Borisov 6 Tools for developing parallel programs | 99 6.1 Installation of PTP | 99 6.2 Program management | 100 6.3 Parallel debugging | 113 6.4 Performance analysis | 116 Aleksandr G. Churbanov, Petr N. Vabishchevich 7 Applied software | 121 7.1 Numerical simulation | 121 7.2 Applied software engineering | 124 7.3 Software architecture | 125 7.4 General purpose applied software | 127 7.5 Problem-oriented software | 129 Maria V. Vasilieva, Alexandr E. Kolesov 8 Geometry generation and meshing | 133 8.1 General information | 133 8.2 The Gmsh workflow | 135 8.3 NETGEN first look | 149 Nadezhda M. Afanasyeva, Maria V. Vasilieva 9 PETSc for parallel solving of linear and nonlinear equations | 153 9.1 Preliminaries | 153 9.2 Solvers for systems of linear equations | 163 9.3 Solution of nonlinear equations and systems | 171 9.4 Solving unsteady problems | 180 Petr E. Zakharov 10 The FEniCS project | 189 10.1 Preliminaries | 189 10.2 Model problem | 191 10.3 Finite element discretization | 192 10.4 Program | 194 10.5 Result processing | 204 10.6 Nonlinear problems | 208 10.7 Time-dependent problems | 212

Contents

Victor S. Borisov, Maria V. Vasilieva, and Alexandr E. Kolesov 11 Numerical study of applied problems | 219 11.1 Heat transfer with phase transitions | 219 11.2 Lid-driven cavity flow | 229 11.3 Steady thermoelasticity problem | 240 11.4 Joule heating problem | 249 Bibliography | 263 Index | 265

| ix

Preface Modern scientific and engineering computations are based on a numerical study of applied mathematical models. Mathematical models involve linear or nonlinear equations as well as systems of ordinary differential equations (ODEs). But in the majority of cases, mathematical models consist of systems of partial differential equations (PDEs), which may be time-dependent as well as nonlinear and, moreover, they may be strongly coupled with each other. In addition, these equations are supplemented with the appropriate boundary and initial conditions. To obtain high-fidelity numerical results of practical interest, it is necessary to find solutions of boundary value problems in complicated computational domains. The existing literature that discusses problems of scientific and engineering computations, in fact, does not reflect the up-to-date realities. Practically, we have books on numerical methods somehow adapted to the needs of the reader. They focus on scientific and engineering computations from the viewpoint of specialists in numerical methods and computational mathematics. This approach includes developing a numerical method, programming a code, performing computations, and processing numerical results; all these steps are carried out by the readers themselves. This methodology is appropriate for solving rather simple problems and suggests the sufficient universality of the reader, which is quite rare. Applied software must reflect the state-of-the-art in numerical methods, programming techniques, and the efficient use of computing systems. This can be achieved with a component-based software development. In this approach, after a modular analysis, a mathematical model is divided into basic computational subproblems, and then an algorithmic interface is organized between them. The solution of the subproblems is implemented using standard computational components of scientific and engineering software. Software packages and modules for pre- and post-processing of problem data may also be treated as components of the developed applied software. The above-mentioned problem-oriented computational components are oriented to solving typical problems of computational mathematics, and they are developed by experts in numerical methods and programming techniques. The last condition ensures the quality of the developed product when we employ modern computing systems. Applied software is developed using certain standards and agreements. It concerns, in particular, a programming language. For a long time the software for scientific and engineering computations was implemented in the programming language Fortran. The main advantage of Fortran is a large amount of programs and libraries written in it, which are often free available with a source code and documentation. Nowadays, the situation is changing in favour of other programming languages, especially in favour of C and C++. At present, new mathematical libraries and partic-

xii | Preface ular components are usually written in C/C++. Moreover, many well-proven applied software projects developed earlier in Fortran were rewritten in C/C++. In research projects, we traditionally focus on using free and open source software (FOSS). It is especially suitable for educational process. From our point of view, the natural business model should be based on payment to an educational institution for training potential users of proprietary software, but in reality, it seems to be quite to the contrary. The second requirement, in our mind, is portability, i.e., software should work on various hardware platforms and/or operating systems. In more exact terms, programming languages, available libraries, and applied software should be crossplatform. Another important issue is related to multiprocessor computers. Applied software for multiprocessor computing systems with shared memory (multicore computers) is developed using OpenMP. For systems with distributed memory (clusters), the standard programming technique is MPI. Applied problems that are governed by PDEs can be solved on parallel computers using the library PETSc. These key thoughts have determined the structure of the book and its general direction to design modern applied software. We describe the basic elements of present computational technologies that use the algorithmic languages C/C++. The emphasis is on GNU compilers and libraries as well as FOSS for the solution of computational mathematics problems and visualization of the obtained data. This set of development tools in other circumstances might be slightly different, but this does not change the general orientation. The book was prepared by the team of researchers from Center of Computational Technologies, M.K. Ammosov North-Eastern Federal University, Yakutsk, Russia, and scientists from Nuclear Safety Institute, Russian Academy of Sciences, Moscow, Russia. We hope that this book will be useful for students and specialists who solve their engineering and scientific problems using numerical methods. We gratefully accept any constructive comments on the book. Petr N. Vabishchevich Moscow and Yakutsk, March 2014.

Introduction Nowadays, engineering and scientific computations are carried out on parallel computing systems, which provide parallel data processing on a few computing nodes. In developing up-to-date applied software, this feature of computers must be taken into account for the most efficient usage of their resources. In constructing computational algorithms, we should separate relatively independent subproblems in order to solve them on a single computing node. Parallel computing is supported by a variety of programming techniques. Improvement of program performance on computing systems of various structures (multiprocessor, multicore or cluster architecture) is possible with the multi-threaded programming model. Open Multi-Processing (OpenMP) is widely used to parallelize programs. This standard includes a set of compiler directives, library routines, and environment variables which are employed for the programming of multi-threaded applications on multiprocessor systems with shared memory. In OpenMP a master thread creates a set of slave threads which run concurrently on a machine with multiple processors within the selected fragment of a source code. Language constructions in OpenMP are defined as the corresponding compiler directives. Message Passing Interface (MPI) is applied in the development of programs for parallel computing systems with distributed memory. An MPI application is a set of independent processes interacting with each other by sending and receiving messages. The robustness of this technique is supported by the use of networks to transfer data between processes. ParaView is commonly used for the visualization and analysis of a large amount of data in scientific and engineering computations. This multiplatform open source tool can work on single or multiple processor computing systems with distributed or shared memory. It supports the importation of three-dimensional data in many formats for the handling of geometrical models. ParaView also provides visualization of data obtained on structured, unstructured, and multiblock grids. Moreover, processing filters allow the creation of new data sets. Eclipse for Parallel Application Developers is used as an integrated development environment (IDE) for the development of parallel applications. It supports a wide range of parallel architectures and computing systems and includes, in particular, parallel debugger. OpenMP and MPI applications can be developed using this IDE. Current applied software includes a pre-processing system, computational core, and data post-processing (with visualization) system. Component-based programming involves the use of highly developed components. This book describes free and open source software (FOSS), which can be employed as already constructed and validated components for scientific and engineering computations.

xiv | Introduction Computer-Aided Design (CAD) systems are applied to construct 3D parametric geometrical models. Using geometrical models, we must construct computational grids for numerical research. Netgen and Gmsh are frequently used for the preparation of geometrical models and computational meshes. Capabilities and working methods of these programs are looked at briefly. Portable Extensible Toolkit for Scientific Computation (PETSc) is widely used for scientific and engineering computations. This toolkit supports modern parallel programming paradigms on the basis of the MPI standard. The main purpose of PETSc is the solution of linear and nonlinear systems of equations, which result from discretization of boundary value problems for partial differential equations (PDEs). Complete multicomponent software for solving multiphysics problems should include tools for geometry and mesh generation, constructing discrete problems, numerical solution, data visualization, and processing. The FEniCS finite element software provides an example of such a complete numerical tool for the solution of applied problems. The primary functional capabilities and features of FEniCS are discussed below. The capabilities of existing FOSS tools for scientific and engineering computations are illustrated by the solution of model problems. Heat and fluid flow problems are studied numerically. The melting/solidification problem and the lid-driven flow of an incompressible fluid are modelled using FEniCS. Next, Joule heating and poroelasticity are presented as examples of multiphysics phenomena. All of these problems are studied through a complete cycle of computational research: problem formulation, mesh generation, approximation of equations, solution of discrete problems, and analysis of numerical results.

Mikhail Yu. Antonov

1 Architecture of parallel computing systems Abstract: Impressive advances in computer design and technology have been made over the past several years. Computers have became a widely used tool in different areas of science and technology. Today, supercomputers are one of the main tools employed for scientific research in various fields, including oil and gas recovery, continuum mechanics, financial analysis, materials manufacturing, and other areas. That is the reason computational technologies, parallel programming, and efficient code development tools are so important for specialists in applied mathematics and engineering.

1.1 History of computers The first electronic digital computers capable of being programmed to solve different computing problems appeared in the 1940s. In the 1970s, when Intel developed the first microprocessor, computers became available for the general public. For decades, the computational performance of a single processor was increased according to Moore’s law. New micro-architecture techniques, increased clock speed, and parallelism on the instruction level made it possible for old programs to run faster on new computers without the need for reprogramming. There has been almost no increase of instruction rate and clock speed since the mid-2000s. Now, major manufacturers emphasize multicore central processing units (CPUs) as the answer to scaling system performance, though to begin with this approach was used mainly in large supercomputers. Nowadays, multicore CPUs are applied in home computers, laptops, and even in smartphones. The downside of this approach is that software has to be programmed in a special manner to take the full advantages of multicore architecture into account.

1.2 Architecture of parallel computers Parallel programming means that computations are performed on several processors simultaneously. This can be done on multicore processors, multiprocessor computers with shared memory, computer clusters with distributed memory or hybrid architecture.

2 | 1 Architecture of parallel computing systems 1.2.1 Flynn’s taxonomy of parallel architecture Computer architecture can be classified according to various criteria. The most popular taxonomy of computer architecture was defined by Flynn in 1966. The classifications introduced by Flynn are based on the number of concurrent instructions and data streams available in the architecture under consideration. SISD (single instruction stream / single data stream) – A sequential computer which has a single instruction stream, executing a single operation with a single data stream. Instructions are processed sequentially, i.e. one operation at a time (the von Neumann model). SIMD (single instruction stream / multiple data stream) — A computer that has a single instruction stream for processing multiple data flows, which may be naturally parallelized. Machines of this type usually have many identical interconnected processors under the supervision of a single control unit. Examples include array processors, Graphics Processing Units (GPUs), and SSE (Streaming SIMD Extensions) instruction sets of modern x86 processors. MISD (multiple instruction stream / single data stream) – Multiple instructions for processing a single data flow. The same data stream flows through an array of processors executing different instruction streams. This uncommon architecture is practically considered to be almost empty. MIMD (multiple instruction stream / multiple data stream) – Multiple instructions for processing multiple data flows. Multiple processor units execute various instructions on different data simultaneously. An MIMD computer has many interconnected processing elements, and each of them processes its own data with its own instructions. All multiprocessor systems fall under this classification. Flynn’s taxonomy is the most widely used classification for initial characterization of computer architecture. However, this classification has evident drawbacks. In particular, the MIMD class is overcrowded. Most multiprocessor systems and multiple computer systems can be placed in this category, including any modern personal computer with x86-based multicore processors.

1.2.2 Address-space organization Another method of classification of parallel computers is based on address-space organization. This classification reflects types of communication between processors. Shared-memory multiprocessors (SMP) – Shared-memory multiprocessor systems have more than one scalar processor which share the same addressing space (main memory). This category includes traditional multicore and multiprocessor personal computers. Each processor in an SMP-system may have its own cache memory, but all processors have to be connected to a common memory bus and memory bank (Figure 1.1). One of the main advantages of this architecture is the (compara-

1.2 Architecture of parallel computers

|

3

Fig. 1.1. Tightly coupled shared-memory system (SMP).

tive) simplicity of the programming model. Disadvantages include bad scalability (due to bus contention) and price. Therefore, SMP-based supercomputers are more expensive than MPP systems with the same number of processors. Massively parallel processors (MPP) – Massively parallel processor systems are composed of multiple subsystems (usually standalone computers) with their own memory and copy of operating system (Figure 1.2). Subsystems are connected by a high-speed network (an interconnection). In particular, this category includes computing clusters, i.e. sets of computers interconnected using standard networking interfaces (Ethernet, InfiniBand etc).

Fig. 1.2. MPP architecture.

MPP systems can easily have several thousand nodes. The main advantages of MPP systems are scalability, flexibility, and relatively low price. MPP systems are usually programmed using message-passing libraries. Nodes exchange data through interconnection networks; speed, latency, and flexibility of the interconnection become very important. Existing interfaces are slower than the speed of data processing in nodes. Nonuniform memory access (NUMA) – NUMA architecture is something between SMP and MPP. NUMA systems consist of multiple computational nodes which each have their own local memory. Each node can access the entire system memory.

4 | 1 Architecture of parallel computing systems However, the speed of access to local memory is much faster than the speed of access to the remote memory. It should be mentioned that this classification is not absolutely mutually exclusive. For example, clusters of symmetric multiprocessors are relatively common among the TOP500 list.

1.3 Modern supercomputers Floating-point operations per second (FLOPS) is a measure of computer performance. The LINPACK software for performing numerical linear algebra operations is one of the most popular methods of measuring the performance of parallel computers. The TOP500 project ranks and details the 500 most powerful supercomputer systems in the world. The project started in 1993 and publishes an updated list of supercomputers twice a year. Since then, 21 years have gone by, and during the 11 years of the rating, the peak power of supercomputers has increased by three orders of magnitude (Table 1.1). Table 1.1. Supercomputer performance. Name

Year

Performance

ENIAC IBM 709 Cray-1 Cray Y-M Intel ASCI Red IBM Blue Gene/L IBM Roadrunner Cray XT5 Jaguar Tianhe-1A Fujitsu K computer IBM Sequoia Cray XK7 Titan Tianhe-2

1946 1957 1974 1988 1997 2006 2008 2009 2010 2011 2012 2012 June 2014

300 flops 5 Kflops 160 MFlops 2.3 Gflops 1 Tflops 478.2 Tflops 1.042 Pflops 1.759 Pflops 2.507 Pflops 8.162 Pflops 20 Pflops 27 Pflops 54.9 Pflops

It is convenient to have high-performance computational power in a desktop computer either for computational tasks or to speed up standard applications. Computer processor manufacturers have presented dual-core, quad-core and even 8- and 16-core x86-compatible processors since 2005. Using a standard 4-processor motherboard, it is possible now to have up to 64 cores in a single personal computer. In addition, the idea of creating a personal supercomputer is supported by graphic processing unit (GPU) manufacturers, who adapted the technology and software for general purpose calculations on GPUs. For instance, NVIDIA provides CUDA tech-

1.4 Multicore computers

| 5

nology, whereas AMD presents ATI Stream technology. GPUs demonstrate up to 1 TFlops on a single GPU processor, i.e. more than traditional x86 central processing units.

1.4 Multicore computers Nowadays, the majority of modern personal computers have two or more computational cores. As a result, parallel computing is used extensively around the world in a wide variety of applications. As stated above, these computers belong to SMP systems with shared memory. Different cores on these computers can run distinct command flows. A single program can have more than one command flow (thread), all of which operate in shared memory. A program can significantly increase its performance on a multicore system if it is designed to employ multiple threads efficiently. The advantage of multi-threaded software for SMP systems is that data structures are shared among threads, and thus there is no need to copy data between execution contexts (threads, processes or processes over several computers), as implemented in the Message-Passing Interface (MPI) library. Also, system (shared) memory is usually essentially faster (by many orders of magnitude in some scenarios) than the speed of interconnection generally used in MPP systems (e.g. Infiniband). Writing complex parallel programs for modern computers requires the design of codes for the multiprocessor system architecture. While this is relatively easy to implement for symmetrical multiprocessing, uniprocessor and SMP systems require different programming methods in order to achieve maximum performance. Programmers need to know how modern operating systems support processes and threads as well as to understand the performance limits of threaded software and to predict results.

1.5 Operating system processes and threads Before studying multi-threaded programming, it is necessary to understand what processes and threads in modern operating systems are. First, we illustrate how threads and processes work.

1.5.1 Processes In a multitasking operating system, multiple programs, also called processes, can be executed simultaneously without interfering with each other (Figure 1.3). Memory protection is applied at the hardware level to prevent a process from accessing the memory of another process. A virtual memory system exists to provide a framework for an operating system to manage memory on the behalf of various processes. Each

6 | 1 Architecture of parallel computing systems

Fig. 1.3. Multitasking operating system.

process is presented with its own virtual address space, while hardware and the operating system prevent a process from accessing memory outside its own virtual address space. When a process is run in the protected mode, it has its own independent, continuous, and accessible address space, and the virtual memory system is responsible for managing communications between the virtual address space of the process and the real physical memory of the computer (Figure 1.4). A virtual memory system also allows programmers to develop software in a simple memory model without needing to synchronize the global address space between different processes. 1.5.2 Threads Just as an operating system can simultaneously execute several processes, each process can simultaneously execute several threads. Usually each process has at least one thread. Each thread belongs to one process and threads cannot exist outside a process. Each thread represents a separate command flow executed inside a process (with its own program counter, system registers and stack). Both processes and threads can be seen as independent sequences of execution. The main difference between them is that while processes run in different contexts and virtual memory spaces, all threads of the same process share some resources, particularly memory address space. Threads within a single process share – global variables; – descriptors; – timers;

1.6 Programming multi-threaded applications

Process 1 virtual address space Adress

Mapping

Value

0x1 0x2 0x3 0x4 0x5 0x6 0x7

|

7

Physical memory Adress

Value

0x1 0x2 0x3 0x4 0x5 0x6 0x7 …

Process 2 virtual address space Adress

Value

0x1 0x2 0x3 0x4 0x5 0x6 0x7

… …

Fig. 1.4. Multitasking and virtual memory.



semaphores; and more.

Each thread, however, has its own – program counter; – registers; – stack; – state. A processor core switches rapidly from one thread to another in order to maintain a large number of different running processes and threads in the system. When the operating system decides to switch a currently running thread, it saves context information of the thread/process (registers, the program counter, etc.) so that the execution can be resumed at the same point later, and loads a new thread/process context into the processor. This also enables multiple threads/processes to share a single processor core.

1.6 Programming multi-threaded applications There are different forms, technologies, and parallel programming models available for the development of parallel software. Each technique has both advantages and

8 | 1 Architecture of parallel computing systems disadvantages, and in each case it should be decided whether the development of a parallel software version justifies the additional effort and resources.

1.6.1 Multi-threading: pros and cons The main advantage of multi-thread programming is obviously the effective use of SMP-architecture resources, including personal computers with multicore CPUs. A multi-threaded program can execute several commands simultaneously and its performance is significantly higher as a result. The actual performance increase depends on the computer architecture and operating system, as well as on how the programs is implemented, but it is still limited by Amdahl’s law. Disadvantages include, first of all, possible loss of performance due to thread management overheads. Secondly, it is also more difficult to write and debug multithreaded programs. In addition to common programming mistakes (memory leaks, allocation failures, etc.), programmers face new problems on handling parallel codes (race conditions, deadlocks, synchronization, etc.). To makes matters worse, the program will continue to work in many cases when such a mistake has been made because these mistakes only manifest themselves under very specific thread schedules.

1.6.2 Program models One of the popular ways of implementing multi-threading is to use OpenMP shared memory programming technology. A brief overview of this programming environment will be given later in the book. Another way is to use special operating system interfaces for developing multithreaded programs. When using these low-level interfaces, threads must be explicitly created, synchronized, and destroyed. Another difference is that while high-level solutions are usually more task-specific, low-level interfaces are more flexible and give sophisticated control over thread management.

Mikhail Yu. Antonov

2 Multi-threaded programming Abstract: Parallel programming is vital to the efficient use of computers with multiprocessor architecture. MPI is one of the most popular technologies for parallel computations, in particular on MPP systems. On the other hand, most personal computers are equipped with multicore processors these days. There are different Application Programming Interfaces (APIs), which provide easier and simpler multi-threaded programming in such systems. Also, if MPI programs are used for communication inside a single computer performance could possibly be improved by multi-threaded parallelization.

2.1 POSIX threads Thread libraries provide an API for creating and managing threads. Depending on the operating system, it can be Win32 Threads as a part of Windows API, API Solaris (Sun Microsystems), the Clone() function of Linux, or Pthreads library (API POSIX). One operating system may also adopt several different threading APIs. The portable operating system interface (POSIX) standard defines API for writing multi-threaded applications. The interface is known as Pthreads. Implementations of the Pthreads library are available on most modern *nix operating systems, including GNU Linux, Unix, Mac OS X, Solaris, and others. Also, third-party implementations for Windows exist, making it easier for a programmer to write portable codes. It should be mentioned that although POSIX differs from the Windows API threads library, they are very similar in some ways, which is why the latter is not reviewed in this book. Pthreads defines a set of the C programming language types, routines, and constants used for developing multi-threaded applications. It includes routines for creating and terminating threads, mechanisms for communication and synchronization of processes in parallel programs, such as mutexes, semaphores, condition variables and others. Technically speaking, POSIX only works with semaphores, but they are not part of the threads standard. The use of Pthreads is the same as any other library. A multi-threaded program must include Pthreads header files, which contain information on Pthreads-specific data types, structures, macros, and routines. In addition, the program must be linked with the Pthreads library (via, e.g. the -lpthread compiler parameter). While the Pthreads library is fairly comprehensive (although not quite as extensive as some other native API sets) and distinctly portable, it suffers from seri-

10 | 2 Multi-threaded programming ous limitations common in all native threading APIs: it requires inclusion of a large threading-specific part of a code. In other words, coding for Pthreads needs to connect the code-base with a threaded model. Moreover, certain decisions, such as the number of necessary threads, can become hard-coded into the program. For example, using a threaded loop to step through a large data block requires that threading structures be declared. Next, the threads must be created individually and further, the loop bounds for each thread computed and assigned to the thread, and ultimately the thread termination must be handled; all these operations must be coded by the developer.

2.2 Creating and terminating threads In Pthreads, all operations for creating and controlling threads are expressed directly by calling special functions of the POSIX standard. In order to create a new thread, we apply the following function: int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void* (*start)(void *), void *arg)

The pthread_create() function starts a new thread and stores a thread identifie in a variable pointed by the thread parameter. This identifier can be used to control the thread in subsequent calls of other Pthreads routines. Here we see one of the features of Pthreads API, which is the concept of opaque objects. This object reveals nothing about their implementation. Users cannot directly modify them by assignments. A set of routines is provided to initialize, configure, and destroy each Pthreads object type. For instance, we can copy pthread_t variables and compare them using the int pthread_equal(pthread_t thr1, pthread_t thr2) function. Attribute objects require some attention. When an object is initialized memory is allocated for it. When the thread terminates, the Pthreads standard provides a function to destroy attribute objects and return memory to the system. The attr argument is a pointer to an opaque attribute object, which describes attributes for the new thread. We can specify thread attributes object (using pthread_attr_init and related functions), or employ NULL for default values. The start argument is a pointer to a function, which is executed by the newly created thread. The arg argument is passed as the single argument to this function. In the case of success, pthread_create() returns 0. In error, it returns an error number, and the contents of *thread is undefined. For example, calling pthread_create(&thr, NULL, start, NULL) we create a thread which will execute start routine once it is created. The thread will be created with default attributes, and its identifier will be saved in the thr variable. The NULL value will be passed as argument to the start routine.

2.3 Thread life-cycle

| 11

int pthread_join(pthread_t thread, void** value_ptr)

The thread terminates when we invoke the pthread_exit(void *retval) function or return with void *retval exit status supplied in the return statement. The pthread_join() function waits for the thread specified by thread to terminate. If this thread has already terminated, then pthread_join() returns immediately. The exit status of the thread is copied to the location pointed to by *value_ptr. pthread_join() releases system resources of the joined thread. Thus, if multiple threads simultaneously try to join with the same thread, the results are undefined. If we do not need to keep the thread exit status, then the thread can be detached using the pthread_detach() function. The detached thread cannot be joined, but its resources are automatically released when the thread terminates. A thread can also be created as not joinable using attr argument of pthread_create(). int pthread_detach(pthread_t thread)

The pthread_t thread attribute identifies the thread to be marked as detached. The thread can detach itself if it gets its own identifier using the pthread_t pthread_self(void) function. Once a thread has been detached, it cannot be made joinable again. Attempting to detach an already detached thread results in unspecified behavior. pthread_join() or pthread_detach() should be called for each terminated thread to release the system resources of the thread. Thread resources include the thread’s kernel stack, security context, and other specific data. It should be mentioned that explicitly allocated resources including memory allocated by the new(), malloc() function are not released when the thread terminates. We should explicitly release those resources via free()/ delete()/close() etc.

2.3 Thread life-cycle In C/C++ programming language, the main() function starts execution. A thread which invokes the main() routine is called the main (initial) thread. There are some major differences between the main thread and other process threads. First, the main thread is not created using the pthread_create() call and has a different routine prototype int main(int argc, char ** argv). Calling return or exit from the main thread results in termination of the entire process including all other process threads. If we need to join the main thread to the other thread, the main thread should be terminated applying pthread_exit(). Depending on Pthreads imple-

12 | 2 Multi-threaded programming Table 2.1. Thread life-cycle. State

Description

Ready Running

The thread is ready for execution. The thread is currently being executed by the processor. In a computer with a multicore central processor, several simultaneously running threads may exist. The thread waits for some interruption to occur. For example, it may wait for user input or wait for a mutex. If no interruption occurs, the thread cannot continue; thus, threads in this state are not considered by the scheduler for scheduling. The thread was terminated, but has not released its resources because it was not yet joined or detached. As soon as a thread in this state is either joined or detached it stops existing.

Blocked

Terminated

mentation, there may be more differences between the main thread and other process threads. The thread life-cycle consists of four major states (Table 2.1). A thread starts its life-cycle in the Ready state. When the thread receives processor time (which depends on systems schedule policy), it changes its state to Running. It should be noted that the thread does not necessarily exist immediately after returning from pthread_create. It is possible that in the time of return from pthread_create in the parent thread a child thread has already been terminated and stopped existing.

2.4 Multi-threaded matrix summation Now we will consider an example of a multi-threaded program for matrices summation. Because element-by-element operations are independent from each other, they can be processed in parallel. In this example, we use POSIX API and the thread pool of 2 threads. Listing 2.1. 1 2 3 4 5

#include #include #include #include int a[6],b[6]={0,1,2,3,4,5},c[6]={1,1,2,3,5,8};

We need to include the Pthreads library header. Next, we declare the global arrays a, b, c. Here the global variables are used to simplify data exchange between threads. All process threads share virtual address space. Thus, all static and global variables are shared by all threads. The variables are accessed by different threads if there are pointers to these variables. Each thread can modify such variables and can detect

2.4 Multi-threaded matrix summation

|

13

changes made to a variable by other threads. However, access is not guaranteed to be synchronized. We should highlight that changes of shared variables are not always seen immediately in other threads. A compiler may make a particular optimization based on the code analysis, and so in some cases these changes may not be seen immediately in other threads. For instance, assume that thread number 1 is continuously changing the global variable i, and the variable is accessed without any external synchronization. The compiler can optimize the program working with this variable on the processor register and make all the changes on the register without copying results to the global memory intermediately. As register values are not shared between different processor cores, other threads trying to read the variable value may read it from the global memory. In this case, they will read the old value of the variable i without changes made by thread number 1. In order to prevent the compiler from applying certain optimizations, the variable can be declared as volatile. In the following example (Listing 2.2), it is not necessary to synchronize threads since there is no concurrency (each thread works with its own part of an array). Listing 2.2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

void * ThreadFunction(void *id); int main(int argc, char *argv[]) { pthread_t thread[2]; int ID[2] = {0,1}; pthread_create(&thread[0], NULL, ThreadFunction, (void *)&ID[0]); pthread_create(&thread[1], NULL, ThreadFunction, (void *)&ID[1]); pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); printf("\nResult: "); for (int i=0; i 0) printf("\n Value is %d",i); flag = 0; return NULL; } int main(int argc, char *argv[]) { pthread_t thread[2]; pthread_create(&thread[0], NULL, ThreadFunction1, NULL); pthread_create(&thread[1], NULL, ThreadFunction2, NULL); pthread_join(thread[0], NULL); pthread_join(thread[1], NULL); return(0); }

The program creates two threads. The first thread repeatedly changes the value of the i variable from 1 to -1. The second thread 100 times checks the value, and if it is positive displays the value of the variable. The second thread manages the first thread employing the global variable flag. This example illustrates the concept of concurrent processing in a multi-threaded system. First of all, it is important to declare i and flag variables as volatile ones. If not, the compiler may perform some optimization, e.g. keeping and working with them in fast memory (register or cache). This could result in a situation, e.g. when one thread does not see changes made to the variables i and flag by another thread. Therefore, the second thread may not see these changes and, moreover, the first thread will not see changes of the flag falling into an infinite loop. For this reason it is necessary to declare them as volatile.

2.5 Thread synchronization

| 17

To make things more complicated, this program will output -1 value from time to time. This can happen because the first thread can change the value of i in the time when the second thread checks and prints the value of i.

2.5.1 Semaphore Sometimes it is necessary that one thread waits until data is processed by other threads. A mechanism is required for a thread to receive some sort of signal that data is ready for processing. The simplest solution is when the thread repeatedly checks if a condition is true (using, e.g. a global volatile variable), waiting in a cycle. This technique is called busy waiting. While busy waiting can be a valid strategy in certain circumstances, it should generally be avoided because it wastes processor time on useless activity. In 1968, Edsger Dijkstra introduced a counting semaphore concept as a tool for computer synchronization. The counting semaphore S is an object with a connected integer value (called the semaphore value) and two operations, i.e. P(S) and V(S). The P(S) operation checks the semaphore value, and if it is positive it decrements the semaphore value by 1 and continues. If the value is equal to 0 the thread is blocked and added to the semaphore queue. All threads in the queue remain blocked until another thread invokes the V(S) operation. The V(S) operation checks if there are blocked threads waiting in the semaphore queue. If blocked threads exist, then one of these threads will be unblocked. If the queue is empty the semaphore value is increased by 1. Threads in the queue may belong to different processes when the semaphore is used to organize process synchronization. Further, the definition of the V(S) operation does not specify how an unblocked thread is chosen. Assume that we have the S semaphore with value 1. The first thread which calls P(S) will decrease the semaphore value to 0, and all other threads which call P(S) will be blocked until V(S) is called. Thus we have a critical section,where the thread enters performing P(S) and leaves performing V(S). Only one thread can be active within boundaries of the critical section at the same moment in time. If the semaphore is initialized with the value N, then no more than N threads can execute the section of the code simultaneously.

2.5.2 Mutex Semaphores initialized with the value 1 are commonly used in practice. Such semaphores are called binary and may exist in two states: 0 (locked) and 1 (unlocked). Pthreads provides a special version of a binary semaphore called a mutex. In many cases, it is easier to apply them for thread synchronization.

18 | 2 Multi-threaded programming A mutex is a synchronization object which can have only two states, i.e. locked and unlocked, and only two operations: lock and unlock. It is very similar to binary semaphores, but usually has some additional features making it differ from binary semaphores. For instance, in Win32 API, a thread which has already locked a mutex can lock it again without being blocked. In other versions, an attempt to lock an already locked mutex may lead to blocking of the thread. In Pthreads, mutexes are available as pthread_mutex_t variables. In accordance with the POSIX concept, this data type is opaque, and a set of routines is provided to initialize, lock, unlock, and destroy mutex variables. int pthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr);

To start work with pthread_mutex_t variable it must be initialized. System resources are allocated when mutex is initialized. The mutex argument refers to the mutex to be initialized, and the attr argument points to a mutex attribute object. We can specify the mutex attribute’s object or pass NULL for default values. Upon successful initialization, the state of the mutex becomes initialized and unlocked. When the mutex is not needed it must be destroyed in order to release system resources associated with it. int pthread_mutex_destroy(pthread_mutex_t *mutex);

The mutex argument refers to the initialized unlocked mutex to be destroyed. The attempt to destroy a locked mutex leads to undefined behavior. The destroyed mutex becomes, in fact, uninitialized, and can be reinitialized using pthread_mutex_init(). After the mutex is initialized it can be locked or unlocked with the following operations: int pthread_mutex_lock(pthread_mutex_t *mutex); int pthread_mutex_trylock(pthread_mutex_t *mutex); int pthread_mutex_unlock(pthread_mutex_t *mutex);

The mutex arguments refers to the initialized mutex. The pthread_mutex_lock() function tries to lock the mutex. If the mutex is already locked the calling thread is blocked until the mutex is released. If the mutex is successfully locked, pthread_ mutex_lock() returns 0. The pthread_mutex_trylock() function is identical to pthread_mutex_ lock() except if the mutex object is already locked, when it immediately returns the EBUSY value.

2.5 Thread synchronization

| 19

The pthread_mutex_unlock() function releases the mutex object referenced by the mutex attribute. The thread called pthread_mutex_unlock() must previously lock the mutex. The attempt to release an unlocked mutex, or one previously locked by another thread leads to undefined behavior. Let us improve the previous example by using mutexes for synchronization. For this, we create pthread_mutex_t variable and initialize it. pthread_mutex_t i_mutex=PTHREAD_MUTEX_INITIALIZER;

Using mutex we organize critical sections, i.e. parts of code, where shared variables are read or modified; see Listing 2.7 below. Listing 2.7. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

void * ThreadFunction1(void *id) { while (flag) { pthread_mutex_lock(&i_mutex); i = -i; pthread_mutex_unlock(&i_mutex); } return NULL; } void * ThreadFunction2(void *id) { int j = 0; for (int j = 0; j < 10000; ++j) { pthread_mutex_lock(&i_mutex); if (i > 0) printf("\n Value is %d",i); pthread_mutex_unlock(&i_mutex); } flag = 0; return NULL; }

Because only one thread can be active inside the critical section, the situation can be avoided if the value of i is changed in between checking and displaying the value of i inside the if operator.

Petr A. Popov

3 Essentials of OpenMP Abstract: OpenMP is a shared-memory Application Programming Interface (API). It represents a collection of compiler directives, library routines, and environment variables for shared-memory parallelism in C, C++, and Fortran programs. OpenMP is managed by the non-profit technology consortium OpenMP Architecture Review Board (OpenMP ARB), which consists of major computer hardware and software vendors.The first OpenMP API specification was published in October 1997. The latest version (4.0) was released in July 2013.

3.1 OpenMP parallel programming model OpenMP employs a portable, easy-to-use, and scalable programming model which gives programmers a simple and very flexible interface for the development of parallel applications on various platforms ranging from standard personal computers to the most powerful supercomputer systems. OpenMP provides #pragma compiler directives, library routines, and environment variables to create and control the execution of parallel programs. OpenMP pragmas begin with #pragma omp. They typically direct the compiler to parallelize sections of a code. The syntax of #pragma omp directives is as follows: #pragma omp [clause [ [,] clause]...]

OpenMP directives are ignored by compilers which do not support OpenMP. If so, the program will still behave correctly but without parallelism. The main directives are: parallel, for, parallel for, section, sections, single, master, critical, flush, ordered, and atomic. These directives specify either worksharing or synchronization constructs. Each OpenMP directive may have a few extra optional modifiers, i.e. clauses, which affect the behavior of directives. Also, there are five directives (master, critical, flush, ordered, and atomic), which do not accept clauses at all. OpenMP runtime routines mainly serve to modify and retrieve information about the environment. There are also API functions for certain kinds of synchronization. In order to use these functions a program must include the OpenMP header file omp.h.

22 | 3 Essentials of OpenMP

3.2 The parallel construct The parallel construct is the most important construct of OpenMP. A program without parallel constructs is executed sequentially. The syntax of the parallel construct has the following form: #pragma omp parallel [clause[[,] clause]...] { structured block }

The parallel construct is used to specify the code parts which should be executed in parallel. Code parts not wrapped by the parallel construct are executed serially. OpenMP follows the fork-join programming model of parallel execution. An OpenMP program starts with a single thread of execution (called the initial thread). When certain conditions are met, e.g. a parallel construct, it becomes the master thread of the new team of threads created to execute a parallel region. The number of threads in the team remains constant. Each thread in the team is assigned a unique thread number for identification; the number can be obtained using the omp_get_thread_num() function. The master thread is assigned the thread number of zero. All threads in the team execute the parallel region, but only the parallel construct ensures that work is performed in parallel. However, it does not distribute the work among threads, so if no action is specified, the work will simply be replicated. The parallel region ends implicitly at the end of the structured block, and there is an implied barrier which forces all threads to wait until each thread in the team finishes its work within the parallel region. When the team threads complete statements in the parallel region, they synchronize and terminate, leaving only the master thread to resume execution. A simple example is presented in Listing 3.1. Listing 3.1. 1 2 3

#pragma omp parallel { printf("Hello World!\n"); }

The output for four threads is as follows: Hello Hello Hello Hello

World! World! World! World!

3.3 Work-sharing constructs











– – –

| 23

The parallel construct has the following optional clauses: if () indicates that the parallel region is executed only if a certain condition is met; if the condition is not satisfied the directive fails and execution continues in the sequential mode. num_threads () explicitly sets the number of threads which will execute a parallel region. The default value is the last value set by the omp_set_num_threads() function, or the environment variable OMP_NUM_ THREADS. default(shared|none). The shared value means that all variables without the directly assigned data-sharing attribute are shared by all threads. The none value means that data-sharing attributes of variables must be directly specified. private () specifies a list of private variables which have local copies made for each thread, so that changes made in one thread are not visible in other threads. firstprivate () prescribes a list of private variables which have local copies made for each thread. This makes copies of values of private variables into each thread before the parallel region. shared () specifies a list of variables to be shared to all threads. copyin () determines a list of threadprivate variables which have access to the the master thread’s value from all threads. reduction () specifies an operator and a list of variables which have local copies in each thread. The specified reduction operation is applied to each variable at the end of the parallel region.

3.3 Work-sharing constructs Work-sharing constructs are responsible for distributing the work between threads which encounter this directive. These constructs do not launch new threads, and there is no implied barrier upon entry to a work-sharing construct. However, there is an implied barrier at the end of a work-sharing construct, unless the nowait clause is specified. OpenMP defines the following work-sharing constructs: – for directive; – sections directive; – single directive. These constructs must be connected with an active parallel region in order to have an effect or they are simply ignored. This is due to the fact that work-sharing directives can be in procedures used both inside and outside parallel regions, and can therefore be executed during certain calls and ignored in others.

24 | 3 Essentials of OpenMP

– –

The work-sharing constructs have the following restrictions: they must be encountered by all threads in a team or none at all; consecutive work-sharing constructs and barriers must be encountered in the same order by each thread in a team.

3.3.1 The for directive If there is a loop operator in a parallel region, then it is executed by all threads, i.e. each thread executes all iterations of this loop. It is necessary to apply the for directive to distribute loop iterations between different threads. The for directive gives instructions to a for loop operator it is associated with to iterate the loop in the parallel mode. During execution, loop iterations are distributed between threads. The syntax of the for directive is as follows: #pragma omp for [clause[[,] clause] ... ] for loop

The use of the for construct is limited to those types of loop operators where the number of iterations is an integer value. The iteration counter must also increase (decrease) by a fixed integer value on each iteration. In the following example (Listing 3.2), the parallel directive is used to define a parallel region. Next, we distribute the work inside the parallel region among the threads with the help of #pragma omp for directive, which indicates that the iterations of the associated loop operator will be distributed among the threads. Listing 3.2. 1 2 3 4 5 6

#pragma omp parallel shared(n) private(i) { #pragma omp for for (i = 0; i < n; i++) printf("Thread %d executes iteration %d\n", comp_get_thread_num(), i); }

The output of the program for n = 8 is Thread Thread Thread Thread Thread

1 2 3 4 0

executes executes executes executes executes

iteration iteration iteration iteration iteration

2 4 6 8 0

3.3 Work-sharing constructs

Thread Thread Thread Thread

1 2 3 0

executes executes executes executes

iteration iteration iteration iteration

| 25

3 5 7 1

It is easy to see that the program has been executed by five threads. Considering that this is a parallel program, we should not expect the results to be printed in the deterministic order known a priori.

3.3.2 The sections directive The sections directive is employed to define a finite (iteration-free) parallelism mode. It contains a set of structured blocks which are distributed between threads. Each structured block is executed once by one of the threads. The sections directive has the following syntax: #pragma omp sections [clause[[,] clause] ...] { #pragma omp section { structured block } #pragma omp section { structured block } . . . }

The sections directive is the easiest way to order different threads to perform different types of work, since it allows specification of different code segments, each of which will be executed by one of the threads. The directive consists of two parts: the first part is #pragma omp sections and indicates the start of the construct, and the second is #pragma omp section, which marks each separate section. Each section must be a structured block of code independent from other blocks. Each thread executes only one block of code at a time, and each block of code is executed only once. If the number of threads is less than the number of blocks, some threads have to perform multiple blocks, otherwise the remaining threads will be idle. The sections directive can be used to perform various independent tasks by threads. The most frequently used application is to carry out functions or subroutines in parallel mode. The following example (Listing 3.3), contains the sections directive consisting of two section constructs, which means that only two threads can execute it at the same time while other threads remain idle.

26 | 3 Essentials of OpenMP Listing 3.3. 1 2 3 4 5 6 7 8

#pragma omp parallel { #pragma omp sections { #pragma omp section (void) func1(); #pragma omp section (void) func2(); } }

The sections directive may cause a load balancing problem; this occurs when threads have different amounts of work in different section constructs. 3.3.3 The single directive If any portion of a code in a parallel region is to be executed only once, then this part should be allocated inside a single construct, which has the following syntax: #pragma omp single [clause [[,] clause]...] { structured block }

The single directive indicates to its structured block of code that this code must be executed only by a single thread. Which thread will execute the code block is not specified. Typically, the choice of thread can vary from one run to another. It can also be different for various single constructs within a single program. This is not a limitation; the directive single should only be used when it is not important which kind of thread executes the part of the program. Other threads remain idle until the execution of the single construct is completed. The following example demonstrates the use of the single directive: Listing 3.4. 1 2 3 4 5 6

#pragma omp parallel shared(a,b) private(i) { #pragma omp single { printf("Single construct is executed by thread", omp_get_thread_num()); } }

The output of the program is Single construct is executed by thread 3

3.3 Work-sharing constructs

| 27

3.3.4 Combined work-sharing constructs Combined constructs are shortcuts which can be applied when a parallel region includes exactly one work-sharing directive and this directive includes all code in the parallel region. For example, we consider the following programs: #pragma omp parallel { #pragma omp for for loop }

#pragma omp parallel { #pragma omp sections { #pragma omp section { structured block } #pragma omp section { structured block } . . . } }

These programs are equivalent to the following codes: #pragma omp parallel for for loop

#pragma omp parallel sections { #pragma omp section { structured block } #pragma omp section { structured block } . . . }

Combined constructs allow us to specify options supported by both directives. The main advantage of combined constructs is their readability, and also the performance growth which may occur in this case. When using combined constructs, a compiler knows what to expect and is able to generate a slightly more efficient code.

28 | 3 Essentials of OpenMP For instance, the compiler does not insert more than one synchronization barrier at the end of a parallel region.

3.4 Work-sharing clauses The OpenMP directives discussed above support a series of clauses which provide a simple and effective way to control the behavior of the directives to which they are applied. They include a syntax necessary to specify which variables are shared and which remain private in a code associated with a construct.

3.4.1 The shared clause The shared clause is employed to identify data that will be shared between threads in a parallel region associated with the clause. Namely, we have one unique instance of a variable, and each thread in the parallel region can freely read and modify its value. The syntax of the shared clause seems to be as follows: #pragma omp shared (list)

A simple example using the shared clause is shown in Listing 3.5 below: Listing 3.5. 1 2 3 4

#pragma omp parallel for shared(a) for (i = 0; i < n; ++i) { a[i] += i; }

In the above example, the a array is declared shared, which means that all threads can read and write the elements without interference. When we employ the shared clause, multiple threads may simultaneously try to update the same memory location or, for example, one thread may try to read data from the memory location where another thread is writing. Special attention should be paid to ensure none of these situations occur and that the access to shared data is ordered in accordance with the requirements of an algorithm. OpenMP places the responsibility on users and contains several structures which can help to overcome such situations.

3.4 Work-sharing clauses

| 29

3.4.2 The private clause The private clause specifies a list of variables which are private to each thread. Each variable in the list is replicated so that each thread has exclusive access to a local copy of this variable. Changes made to the data of one thread are not visible to other threads. The syntax of the private clause is as follows: #pragma omp private (list)

Private variables replace all references to original objects with references to new declared objects with the same type as the original one. The new object is declared once for each thread. It is assumed that the private variables are uninitialized for each thread. In Listing 3.6 we explain how to use the private clause. Listing 3.6. 1 2 3 4

#pragma omp parallel for private(i,a) for (i = 0; i < n; ++i) { a = i + 1; }

The variable i is a parallel loop iterations counter. It cannot be a shared variable because the loop iteration is divided between threads, and each thread must have its own unique local copy of i so that it can safely change its value. Otherwise, changes made by one thread can affect the value of i in the memory of another thread, thus making it impossible to track iterations. Even if i is explicitly declared as shared, it will be converted into a private variable. Here both variables are declared as private. If variable a were to be listed as shared, several threads would simultaneously try to update it in an uncontrolled fashion. The final value of the variable would depend on which thread updated it last. This error is called a data race. Thus, the a variable must also be declared a private variable. Because each thread has its own local copy of the variable there is no interference between them and the result will be correct. Note that the values of variables from the list specified in the private clause are uninitialized. The value of any variable with the same name as a private variable will also be uncertain after the end of the associated construct, even if the corresponding variable was defined before entering the construct.

30 | 3 Essentials of OpenMP 3.4.3 The schedule clause The schedule clause is supported only by the for directive. It is used to control how loop iterations are divided among threads. This can have a major impact on the performance of a program. The default schedule depends on the implementation. This clause has the following syntax: #pragma omp schedule (type[, chunk_size])

chunk_size is a variable which determines the number of iterations allocated to threads. If it is specified, the value of chunk_size must be integer and positive. There are four schedule types: – static. Iterations are divided into equal chunks with the size chunk_size. The chunks are statically assigned to threads. If chunk_size is not specified then the iterations are divided as uniformly as possible among threads. – dynamic. Iterations are allocated to threads by their request. After a thread finishes its assigned iteration, it gets another iteration. If chunk_size is not specified, it is equal to 1. – guided. Iterations are dynamically assigned to threads in chunks of decreasing size. It is similar to the dynamic type, but in this case chunk size decreases each time. The size of the initial chunk is proportional to the number of unassigned iterations divided by the number of threads. If chunk_size is not specified it is equal to 1. – runtime. The type of scheduling is determined at runtime using the environment variable OMP_SCHEDULE. It is illegal to specify chunk_size for this clause. The static scheduling of iterations is used by default in most compilers supporting OpenMP unless explicitly specified. It also has the lowest overheads from the scheduling types listed above. The dynamic and guided scheduling types are useful for work with ill-balanced and unpredictable workloads. The difference between them is that with the guided scheduling type the chunk size decreases with time. This means that initially large pieces are more desirable because they reduce overheads. Load balancing becomes the essential problem by the end of calculations. The system then uses relatively small pieces to fill gaps in distribution.

3.4.4 The reduction clause The reduction clause specifies some forms of recurrence calculation (involving associative and commutative operators), so that they can be performed in parallel mode without code modifications. It is only necessary to define an operation and variables

3.4 Work-sharing clauses

| 31

that contain results. The results will be shared, therefore there is no need to explicitly specify the appropriate variables as shared. The syntax of the clause has the following form: #pragma omp reduction (operator:list)

operator is not an overloaded operator, but one of +, *, -, &, ˆ, |, &&, ||. It is applied to the specified list of scalar reduction variables. The reduction clause allows the accumulation of a shared variable without the use of special OpenMP synchronization directives, meaning that performance will be faster. At the beginning of a parallel block a private copy of each variable is created for each thread and pre-initialized to a certain value. At the end of the reduction clause, the private copy is atomically merged into the shared variable using the defined operator. Listing 3.7 demonstrates the use of the reduction clause. Listing 3.7. 1 2 3 4 5 6 7

int factorial(int n) { int fac = 1; #pragma omp parallel for reduction(*:fac) for (int i = 2; i max) { #pragma omp critical { if (a[i] > max) max = a[i]; } } } for (i = 0; i < n; ++i) { printf("%d\n", a[i]);; } printf("max = %d\n", max);

The extra comparison of the a[i] and max variables is due to the fact that the max variable could have been changed by another thread after comparison outside the critical section. The output of the program is as follows: 23 64 74 98 56 35 2 12 5 18 max = 98

3.5.3 The atomic directive The atomic construct allows multiple threads to safely update a shared variable. It specifies that a specific memory location is updated atomically. The atomic directive is commonly used to update counters and other simple variables which are accessed

3.5 Synchronization

|

35

by multiple threads simultaneously. In some cases it can be employed as an alternative to the critical construct. Unlike other OpenMP constructs, the atomic construct can only be applied to a single assignment statement which immediately follows it. The affected expression must conform to certain rules to have an effect, which severely limits its area of applicability. The syntax of the atomic directive is shown below. #pragma omp atomic expression

The way the atomic construct is used is highly dependent on the implementation of OpenMP. For example, some implementations can replace all atomic directives with critical directives with the same unique name. On the other hand, there are hardware instructions which can perform atomic updates with lower overheads and are optimized better than the critical constructs. The atomic directive is applied only to a single operator assignment. The atomic setting can only be specified to simple expressions such as increments and decrements. It cannot include function calls, array indexing or multiple statements. The supported operators are: +, *, -, /, &, ˆ, |, «, ». The atomic directive only specifies atomically the update of the memory cell located to the left of the operator. It does not guarantee that an expression to the right of the operator is evaluated atomically. If the hardware platform is able to perform atomic read-modify-write instructions, then the atomic directive tells the compiler to use this operation. Listing 3.10 demonstrates the application of the atomic directive. Listing 3.10. 1 2

#pragma omp atomic counter += value;

In the next example (Listing 3.11), the atomic directive does not prevent multiple threads from executing the func function at the same time. It only atomically updates the variable j. If we do not intend to allow threads to simultaneously execute the func function, the critical construct should be used instead of the atomic directive.

36 | 3 Essentials of OpenMP Listing 3.11. 1 2 3 4 5

#pragma omp for (i = 0; #pragma omp j = j + }

parallel shared(n,j) private(i) i < n, ++i) { atomic func();

To avoid race conditions, all updates in parallel computations should be protected with the atomic directive, except for those that are known to be free of race conditions.

3.6 Dirichlet problem for the Poisson equation Partial differential equations provide the basis for developing mathematical models in various fields of science and technology. Usually, the analytic solution of these equations is possible only in particular simple cases. Therefore, an analysis of mathematical models based on differential equations requires the application of numerical methods. In this Section, we employ OpenMP technology to develop a parallel program which solves the Dirichlet problem for the Poisson equation.

3.6.1 Problem formulation Let us consider in the unit square domain Ω = {x | 0 < xα < 1,

α = 1, 2},

the Poisson equation −Δu = f (x),

x∈Ω

(3.1)

x ∈ 𝜕Ω.

(3.2)

with Dirichlet boundary conditions u(x) = g(x),

For simplicity, we define the function f (x) and the boundary condition g(x) as follows: f (x) = 1,

g(x) = 0.

(3.3)

We need to find an approximate solution of (3.1)–(3.3). We introduce a uniform grid ω = ω ∪𝜕ω with steps h1 and h2 in the corresponding directions. Here ω is the set of interior nodes ω = {x | xα = iα hα ,

iα = 1, 2, ..., Nα ,

Nα hα = 1,

α = 1, 2},

3.6 Dirichlet problem for the Poisson equation

|

37

and 𝜕ω is the set of boundary nodes. We apply the finite difference approximation in space for the problem (3.1)–(3.3) and obtain the following grid problem with the solution denoted by y(x), x ∈ ω : y(x1 − h1 , x2 ) − 2y(x1 , x2 ) + y(x1 + h1 , x2 ) h21 y(x1 , x2 − h2 ) − 2y(x1 , x2 ) + y(x1 , x2 + h2 ) − = 1, x ∈ ω , h22



y(x) = 0,

x ∈ 𝜕ω .

(3.4)

(3.5)

The system can be written in the form of a matrix problem: Ay = b, where A is the coefficient matrix, and b is the vector on the right. The matrix is sparse (only five diagonals contain nonzero values). We solve the system using one of the best known iterative techniques: the conjugate gradient method. The numerical algorithm is as follows: – preset the initial value y0 , – calculate s0 = r0 = b − Ay0 , – arrange the loop with k = 0, 1, ..., where: – calculate pk = Ask , – calculate the iteration parameter τ = (rk , rk )/(sk , pk ), – find the new iteration value yk+1 = yk + τ sk , – calculate the residual rk+1 = rk − τ pk , – evaluate sk+1 = rk+1 + sk (rk+1 , rk+1 )/(rk , rk ). The iterative process terminates if ‖rk+1 ‖ ≤ ε , where ε is the required tolerance.

3.6.2 Parallel implementation First, we include standard libraries and define the size of domain as demonstrated in Listing 3.12.

38 | 3 Essentials of OpenMP Listing 3.12. 1 2 3 4 5 6 7

#include #include #include #include #define N1 100 #define N2 100 #define M N1*N2

The main function of our program looks as shown in Listing 3.13. Listing 3.13. 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

int main() { double begin, end; double h1 = 1.0 / N1, h2 = 1.0 / N2; // Create matrix A abd vectors x, b double *A = (double *) malloc(sizeof(double) * M * M); double *x = (double *) malloc(sizeof(double) * M); double *b = (double *) malloc(sizeof(double) * M); initMat(A, h1, h2); initVec(x, 0.0); initVec(b, h1 * h2); begin = omp_get_wtime(); int iters = solveCG(A, x, b, 1000, 1.0e-3); end = omp_get_wtime(); int threads; #pragma omp parallel threads = omp_get_num_threads(); printf("Time = %3.2lf sec.\tIterations = %d\tThreads = %d\n", end - begin, iters, threads); vecPrint("./solution.txt", x, h1, h2); free(A); free(b); free(x); return 0; }

Here, we define the size of the matrix variable A employing macros N1 and N2, and specify the steps of the grid by means of the variables h1 and h2. We also allocate memory for the pointers A, y, and b. Next we initialize the matrix data as illustrated in the following Listing 3.14.

3.6 Dirichlet problem for the Poisson equation

|

Listing 3.14. 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

void initMat(double *A, double h1, double h2) { int i, j, col, row; double val; for (col = 0; col < M; col++) { i = col % N1; j = col / N1; for (row = 0; row < M; row++) { val = 0; if ((i > 0 && row == (j * N1 + (i - 1))) || (i < (N1 - 1) && row == (j * N1 + (i + 1)))) val = -1 * h2 / h1; if ((j > 0 && row == ((j - 1) * N1 + i)) || (j < (N2 - 1) && row == ((j + 1) * N1 + i))) val = -1 * h1 / h2; if (row == (j * N1 + i)) val = 2 * (h2 / h1 + h1 / h2); A[col * M + row] = val; } } }

Obviously, it is a five-diagonal matrix. The implementation of the conjugate gradients method is shown in Listing 3.15. Listing 3.15. 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

// Conjugate gradients method int solveCG(double* A, double* x, double* b, double maxIter, double eps) { int k; double *r = (double*) malloc(sizeof(double) * M); double *s = (double*) malloc(sizeof(double) * M); double *p = (double*) malloc(sizeof(double) * M); double *Ax = (double*) malloc(sizeof(double) * M); // s_0 = r_0 = b - A x_0 matVecMult(Ax, A, x); vecSum(r, -1, Ax, b); vecSum(s, -1, Ax, b); for (k = 0; k < maxIter; k++) { // p_k = A s_k matVecMult(p, A, s); // tau = (r_k, r_k)/(s_k, p_k) double rr = vecDot(r, r); double tau = rr/vecDot(s, p); // x_k+1 = x_k + tau*s_k vecSum(x, tau, s, x);

39

40 | 3 Essentials of OpenMP

// r_k+1 = r_k - tau*p_k vecSum(r, -tau, p, r); // s_k+1 = r_k+1 + (r_k+1, r_k+1)/(r_k, r_k)*s_k double rrNew = vecDot(r, r); vecSum(s, rrNew/rr, s, r); double norm = sqrt(vecDot(r, r)); if(norm < eps) break;

97 98 99 100 101 102 103

} free(r); free(s); free(p); free(Ax); return k;

104 105 106 107 108 109 110

}

In the function for matrix-vector multiplication, we use the #pragma omp parallel for construct, which allows us to distribute iterations among threads. Listing 3.16. 35 36 37 38 39 40 41 42 43 44 45

// y = A * x void matVecMult(double *y, double *A, double *x) { int i; #pragma omp parallel for for (i = 0; i < M; i++) { y[i] = 0; int j; for (j = 0; j < M; j++) y[i] += A[i * M + j] * x[j]; } }

The function for the scalar product includes the #pragma omp parallel for construct with the reduction(+:a) clause. This means that each variable a will be added at the end of the parallel region as shown in Listing 3.17. Listing 3.17. 56 57 58 59 60 61 62 63 64

// scalar protuct (x, x) double vecDot(double *x, double *y) { double a = 0; int i; #pragma omp parallel for reduction(+:a) for (i = 0; i < M; i++) a += x[i] * y[i]; return a; }

3.6 Dirichlet problem for the Poisson equation

| 41

The complete text of the program is presented in Listing 3.18. Listing 3.18. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

#include #include #include #include #define N1 100 #define N2 100 #define M N1*N2 void initMat(double *A, double h1, double h2) { int i, j, col, row; double val; for (col = 0; col < M; col++) { i = col % N1; j = col / N1; for (row = 0; row < M; row++) { val = 0; if ((i > 0 && row == (j * N1 + (i - 1))) || (i < (N1 - 1) && row == (j * N1 + (i + 1)))) val = -1 * h2 / h1; if ((j > 0 && row == ((j - 1) * N1 + i)) || (j < (N2 - 1) && row == ((j + 1) * N1 + i))) val = -1 * h1 / h2; if (row == (j * N1 + i)) val = 2 * (h2 / h1 + h1 / h2); A[col * M + row] = val; } } } void initVec(double* x, double val) { int i; #pragma omp parallel for for (i = 0; i < M; i++) x[i] = val; } // y = A * x void matVecMult(double *y, double *A, double *x) { int i; #pragma omp parallel for for (i = 0; i < M; i++) { y[i] = 0; int j; for (j = 0; j < M; j++) y[i] += A[i * M + j] * x[j]; } } // s = a * x + y

42 | 3 Essentials of OpenMP

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94

void vecSum(double *s, double a, double *x, double *y) { int i; #pragma omp parallel for for (i = 0; i < M; i++) s[i] = a * x[i] + y[i]; } // scalar protuct (x, x) double vecDot(double *x, double *y) { double a = 0; int i; #pragma omp parallel for reduction(+:a) for (i = 0; i < M; i++) a += x[i] * y[i]; return a; } // print vector for Gnuplot void vecPrint(char* fileName, double* x, double h1, double h2) { int i, j; remove(fileName); FILE *file; file = fopen(fileName, "a+"); for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { fprintf(file, "%f %f %f \n", i*h1, j*h2, x[j * N1 + i]); } fprintf(file, "\n"); } fclose(file); } // Conjugate gradients method int solveCG(double* A, double* x, double* b, double maxIter, double eps) { int k; double *r = (double*) malloc(sizeof(double) * M); double *s = (double*) malloc(sizeof(double) * M); double *p = (double*) malloc(sizeof(double) * M); double *Ax = (double*) malloc(sizeof(double) * M); // s_0 = r_0 = b - A x_0 matVecMult(Ax, A, x); vecSum(r, -1, Ax, b); vecSum(s, -1, Ax, b); for (k = 0; k < maxIter; k++) { // p_k = A s_k matVecMult(p, A, s); // tau = (r_k, r_k)/(s_k, p_k) double rr = vecDot(r, r); double tau = rr/vecDot(s, p); // x_k+1 = x_k + tau*s_k vecSum(x, tau, s, x);

3.6 Dirichlet problem for the Poisson equation

95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

|

43

// r_k+1 = r_k - tau*p_k vecSum(r, -tau, p, r); // s_k+1 = r_k+1 + (r_k+1, r_k+1)/(r_k, r_k)*s_k double rrNew = vecDot(r, r); vecSum(s, rrNew/rr, s, r); double norm = sqrt(vecDot(r, r)); if(norm < eps) break; } free(r); free(s); free(p); free(Ax); return k; } int main() { double begin, end; double h1 = 1.0 / N1, h2 = 1.0 / N2; // Create matrix A abd vectors x, b double *A = (double *) malloc(sizeof(double) double *x = (double *) malloc(sizeof(double) double *b = (double *) malloc(sizeof(double) initMat(A, h1, h2); initVec(x, 0.0); initVec(b, h1 * h2); begin = omp_get_wtime(); int iters = solveCG(A, x, b, 1000, 1.0e-3); end = omp_get_wtime(); int threads; #pragma omp parallel threads = omp_get_num_threads(); printf("Time = %3.2lf sec.\tIterations = end - begin, iters, threads); vecPrint("./solution.txt", x, h1, h2); free(A); free(b); free(x); return 0; }

* M * M); * M); * M);

%d\tThreads = %d\n",

We set the required tolerance as ε = 10−3 , and the grid sizes N1 = 100, N2 = 100. After the compilation program we run it with a different number of threads. The solution is found in 75 iterations (Figure 3.1). In the output we see that the program execution time decreases when the number of threads increases:

44 | 3 Essentials of OpenMP

Fig. 3.1. Solution of the Dirichlet problem (using gnuplot).

$ OMP_NUM_THREADS=1 Time = 58.34 sec. $ OMP_NUM_THREADS=2 Time = 29.87 sec. $ OMP_NUM_THREADS=4 Time = 14.97 sec. $ OMP_NUM_THREADS=8 Time = 7.94 sec.

./poisson Iterations ./poisson Iterations ./poisson Iterations ./poisson Iterations

= 75 Threads = 1 = 75 Threads = 2 = 75 Threads = 4 = 75 Threads = 8

Aleksandr V. Grigoriev, Ivan K. Sirditov, and Petr A. Popov

4 MPI technology Abstract: Message Passing Interface Standard (MPI)¹ is a portable, efficient, and flexible message-passing standard for developing parallel programs. Standardization of MPI is performed by the MPI Forum, which incorporates more than 40 organizations including suppliers, researchers, software developers, and users. MPI is not, however, an IEEE or ISO standard but, in fact, an industry standard for writing message-passing programs for high-performance computing systems. There are a large number of MPI implementations, including both commercial and free variants. MPI is widely used to develop programs for clusters and supercomputers. MPI is mainly oriented to systems with distributed memory where the costs of messagepassing are high, while OpenMP is more appropriate for systems with shared memory (multicore processors with shared cache). Both technologies can be applied together in order to optimally employ multicore systems.

4.1 Preliminaries Here the basic concepts of the standard are presented and a simple example is considered.

4.1.1 General information MPI is a message-passing library specification for developers and users. It is not a library on its own, it is a specification of what a library should be. MPI is primarily focused on message-passing in the parallel programming model: data is transferred from the address space of one process to the address space of another process through cooperative operations on each process. The goal of MPI is to provide a widely used standard for programs based on message-passing. There have been a number of MPI versions, the current version being MPI-3. The interface specification is defined for C/C++ and Fortran 77/90. The implementation of MPI libraries may differ in both supported specification versions and standard features. The MPI standard was released in 1994 to resolve problems with various architectures of parallel computing systems and to enable the creation of portable programs. The existence of such a standard allows users to develop libraries to hide the majority of architectural features of parallel computing systems, thus greatly simplifying

1 www.mpi-forum.org.

46 | 4 MPI technology the development of parallel programs. Moreover, standardization of the basic system level has greatly improved the portability of parallel programs because there are many implementations of the MPI-standard for most computer platforms. The main goal of the MPI specification is a combination of portable, efficient, and advanced message-passing tools. This means the ability to develop programs using specialized hardware or software of different providers. At the same time, many properties, such as application-oriented structure of the processes or dynamically controllable processes with a wide range of collective operations, can be used in any parallel program. A large number of libraries for numerical analysis such as ScaLAPACK², AZTEC³, PETSc⁴, Trilinos⁵, and Hypre⁶ have been developed on the basis of MPI.

4.1.2 Compiling Several implementations of the MPI standard exist, including both free (OpenMPI⁷, LAM/MPI⁸, MPICH⁹ etc.), and commercial (HP-MPI, Intel MPI etc.). Here we consider the OpenMPI library. To install it, we apply the following command: $ sudo apt-get install mpi

To compile programs written in C, we use $ mpicc -o

For C++ programs, we employ $ mpicxx -o

To execute programs, we use $ mpirun -np

2 3 4 5 6 7 8 9

www.netlib.org/scalapack. http://acts.nersc.gov/aztec/. www.mcs.anl.gov/petsc/. http://trilinos.org/. http://acts.nersc.gov/hypre/. http://www.open-mpi.org/. http://www.lam-mpi.org/. http://www.mpich.org/.

4.1 Preliminaries

|

47

For a computer with a 4-core processor, the maximum number of processes which can be simultaneously run is 4. If we launch an MPI program on more than 4 processes some processes will be emulated sequentially and the program may work incorrectly.

4.1.3 Getting started with MPI An MPI program is a set of simultaneously running processes. Each process works in its own address space, which implies that there are no shared variables or data. Processes can be executed on several processors or on a single processor. Each process of a parallel program is generated by a copy of the same code (SPMD–Single Program, Multiple Data). This program code must be available on launching the parallel program in all applicable processes in the form of an executable program. The number of processes is determined on beginning the run by means of the parallel program execution environment for MPI programs, and it cannot be changed during the calculations (the MPI-2 standard provides the ability to dynamically change the number of processes). All program processes are numbered from 0 to p − 1, where p is the total number of processes. The process number is called the rank of the process. It allows us to upload a particular sub-task depending on the rank of a process, i.e. the initial task is divided into sub-tasks (decomposition). The common technique is as follows: each sub-task is issued as a separate unit (function, module), and the same program loader is invoked for all processes, which loads one or another sub-task depending on the rank of process. First, we must call the MPI_INIT function, i.e. the initialization function. Each MPI program should call this function once before calling any other MPI function. We cannot call MPI_INIT multiple times in a single program. MPI provides objects called communicators and groups which determine which processes can communicate with each other. MPI_INIT defines a global communicator MPI_COMM_WORLD for each process which calls it. MPI_COMM_WORLD is a predefined communicator, and includes all processes and MPI groups. In addition, at the start of a program, there is the MPI_COMM_SELF communicator, which contains only the current process, as well as MPI_COMM_NULL communicator, which does not contain any processes. All MPI communication functions require a communicator as an argument. These processes can communicate with each other only if they have a common communicator. Each communicator contains a group that is a list of processes. Each of these processes has its own unique integer identifier called the rank. The rank is assigned by the system during process initialization. The rank is sometimes also called task ID. Rank numbering begins with zero. Processes of parallel programs are grouped together. The structure of formed groups is arbitrary. Groups may be identical and contain other groups which may or

48 | 4 MPI technology may not intersect. Each group forms a communication area which is associated with a communicator. Processes can communicate only within a certain communicator, messages sent in different communicators do not overlap and do not interfere with each other. When all communications have been completed the MPI_FINALIZE function needs to be invoked. This function cleans all data structures of MPI. MPI_FINALIZE must be called at the end; further MPI functions cannot be used after MPI_FINALIZE.

4.1.4 Determination of the number of processes and their ranks The basics of working with MPI will now be demonstrated in Listing 4.1 using an example which displays a process rank and the total number of running processes. Listing 4.1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

#include #include int main(int argc, char* argv[]) { int rank, size; // MPI initialization MPI_Init(&argc, &argv); // determine process count MPI_Comm_size(MPI_COMM_WORLD, &size); // determine rank of process MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("\n Hello, World from process %3d of %3d", rank, size); // finalize program MPI_Finalize(); return 0; }

The program with two processes can result in two different outputs as seen below: variation 1. Hello, World from process 0 of 2 Hello, World from process 1 of 2 variation 2. Hello, World from process 1 of 2 Hello, World from process 0 of 2

4.1 Preliminaries

| 49

These output variants result from the fact that the order of message printing is determined by the order of processes used first to print the message. It is difficult to predict this situation in advance. Let us now consider the implementation of the above example in detail. At the beginning of the program, the library header file mpi.h is connected. It contains definitions of the functions, types, and constants of MPI. This file must be included in all modules which use MPI. To get started with MPI, we call MPI_Init as follows: int MPI_Init (int *argc, char ***argv)

MPI_Init can be employed to pass arguments from the command line to all processes, although it is not recommended by the standard rules and essentially depends on MPI implementation. argc is a pointer to the number of command line options, whereas argv are command line options. The MPI_Comm_size function returns the number of started processes size for a given communicator comm as follows: int MPI_Comm_size (MPI_Comm comm, int *size)

The way the user runs these processes depends on MPI implementation, but any program can determine the number of running processes with this function. To determine the rank of a process, we apply the following function: int MPI_Comm_rank (MPI_Comm comm, int *rank)

This returns the rank of the process which calls this function in the communication area of the specified communicator.

4.1.5 The standard MPI timer In order to parallelize a sequential program or to write a parallel program, it is necessary to measure the runtime of calculations so as to evaluate the acceleration achieved. Standard timers employed in such cases depend on hardware platforms and operating systems. The MPI standard includes special functions for measuring time. Applying these functions eliminates the dependency on the runtime environment of parallel programs. To get the current time, we simply use the following function: double MPI_Wtime (void);

50 | 4 MPI technology Listing 4.2 demonstrates the use of this function. Listing 4.2. 1 2 3 4 5

double starttime, endtime; starttime = MPI_Wtime(); . . . endtime = MPI_Wtime(); printf ("Work time %f seconds\n", endtime - starttime);

The function returns the number of seconds which has elapsed from a certain point of time. This time point is of random value, which may depend on an MPI implementation, but the value will not change during the lifetime of the process. The MPI_Wtime function should only be used to determine the duration of the execution of certain code fragments of parallel programs.

4.2 Message-passing operations We now consider data transfer operations between processes and their basic modes (synchronous, blocking, etc.). We discuss MPI data types and their compliance with C data types.

4.2.1 Data exchange between two processes The basis of MPI is message-passing operations. There are two types of communication functions: operations between two processes, and collective operations for the simultaneous interaction of several processes. We begin with point-to-point communications. This means that there are two points of interaction: a process-sender and a process-receiver. Point-to-point communications and other types of communications always occur within a single communicator, which is specified as a parameter in function calls. The ranks of the processes involved in exchanges are calculated with respect to the specified communicator. In Listing 4.3 we consider the program which describes the behavior of two interacting processes, one of which sends data and the other receives and prints it.

4.2 Message-passing operations

|

51

Listing 4.3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

#include #include int main(int argc, char* argv[]) { int rank; MPI_Status st; char buf[64]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { //process-sender sprintf(buf, "Hello from process 0"); MPI_Send(buf, 64, MPI_CHAR, 1, 0, MPI_COMM_WORLD); } else if (rank == 1) { //process-receiver MPI_Recv(buf, 64, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &st); printf("Process %d received %s \n", rank, buf); }; MPI_Finalize(); return 0; }

The output of the program is Process 1 received Hello from process 0

Here the process with rank 0 puts the welcoming message in the buffer buf, which subsequently sends it to the process with rank 1. The process with rank 1 receives this message and prints it. We apply functions of point exchanges to arrange data transfer in the above example. To send data the following function is used: int MPI_Send (void * buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm);

where buf is the starting address of the send buffer; count is the number of items in data; type is the datatype of sent items; dest is the rank of the destination process in the group associated with the communicator comm; tag is the data identifier; comm is the communicator. In our example, we send the entire array buf. Since the array consists of 64 elements, count and type parameters are specified as 64 and the MPI_CHAR type, respectively. The dest argument determines the rank of the process for which the message is intended. The tag argument sets the so-called message tag, which is an integer that is passed along with the message and checked upon receiving. In the example, the rank of the receiver is 1, and the tag is 0.

52 | 4 MPI technology To receive data, we employ the following function: int MPI_Recv (void * buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status * status);

where buf is the starting address of the receive buffer; count is the number of received items; type is the datatype of received items; source is the rank of the process from which data was sent; tag is the data identifier; comm is the communicator; and status is attributes of the received message. MPI provides additional data exchange functions which differ from each other in the ways they organize data exchange. Two of the functions described above implement the standard mode with blocking. In functions with blocking, it is not possible to exit from functions until an exchange operation is completed. Therefore, sending functions are blocked until all sent data has been placed in a buffer (in some MPI implementations, it may be an intermediate system buffer or buffer of receiving process). On the other hand, receiving functions are blocked until all received data has been read from a buffer into the address space of the receiving process. Both blocking and nonblocking operations support four execution modes. Table 4.1 presents the basic point-to-point communication functions. Table 4.1. Point-to-point functions. Execution modes

Blocking

Nonblocking

Send Synchronous send Buffered send Ready send Receive

MPI_Send MPI_Ssend MPI_Bsend MPI_Rsend MPI_Recv

MPI_Isend MPI_Issend MPI_Ibsend MPI_Irsend MPI_Irecv

In the table we can see how functions are named. To the name the basic functions Send/Recv, we add the following prefixes: – S(synchronous) prefix indicates the synchronous data transfer mode. Here an operation will end only when receiving of data ends. The function is non-local. – B(buffered) prefix denotes the buffered data transfer mode. In this case a special function creates a buffer in the address space of the sending process which will be used in operations. The send operation terminates when data is placed in the buffer. The function is local. – R(ready) prefix indicates the ready data exchange mode. The send operation starts only when the receiving operation is initiated. The function is non-local. – I(immediate) prefix refers to nonblocking operations.

4.3 Functions of collective interaction

|

53

4.2.2 Data types We must explicitly specify the type of data being sent for data exchange. MPI contains a large set of basic datatypes, which are essentially consistent with data types of the C and Fortran programming languages. Table 4.2 shows the basic MPI data types. Table 4.2. Compliance between MPI and C datatypes. MPI type MPI_BYTE MPI_CHAR MPI_DOUBLE MPI_FLOAT MPI_INT MPI_LONG MPI_LONG_DOUBLE MPI_PACKED MPI_SHORT MPI_UNSIGNED_CHAR MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_UNSIGNED_SHORT

C type

signed char double float int long long double short unsigned unsigned unsigned unsigned

char int long short

However, if the system provides additional data types, MPI will also support them. For example, if the system supports complex variables with double precision DOUBLE COMPLEX, there will be the MPI_DOUBLE_COMPLEX data type. The MPI_BYTE and MPI_PACKED data types are used to transmit binary data without any conversion. In addition, it is possible to create new derived data types for a more precise and brief description of data.

4.3 Functions of collective interaction Functions for collective interaction are discussed below. We consider examples of the matrix-vector multiplication and scalar product of vectors.

4.3.1 General information A set of operations such as point-to-point communication is sufficient for any programming algorithm. However, MPI is not limited to this set of communication operations. One of the strong features of MPI is a wide set of collective operations which allow us to solve the most frequently arising problems in parallel programming. For

54 | 4 MPI technology example, we often need to send some variable or array from one process to all other processes. Of course, we can write a procedure for this using the Send/Recv functions. But it is more convenient to employ the collective MPI_Bcast operation, which implements the broadcast of data from one process to all other processes in the communication environment. This operation is very efficient, since the function is implemented employing internal capabilities of communication environment. The main difference between collective and point-to-point operations is that collective operations always involve all processes associated with a certain communicator. Noncompliance with this rule leads to either crashing or hanging. A set of collective operations includes the following: – the synchronization of all processes (MPI_Barrier); – collective communication operations including: – broadcasting of data from one process to all other processes in a communicator (MPI_Bcast); – gathering of data from all processes into a single array in the address space of the root process (MPI_Gather, MPI_Gatherv); – gathering of data from all processes into a single array and broadcasting of the array to all processes (MPI_Allgather, MPI_Allgatherv); – splitting of data into fragments and scattering them to all other processes in a communicator (MPI_Scatter, MPI_Scatterv); – combined operation of Scatter/Gather (All-to-All), i.e. each process divides data from its transmit buffer and distributes fragments to all other processes while collecting fragments sent by other processes in its receive buffer (MPI_Alltoall, MPI_Alltoallv); – global computational operations (sum, min, max etc.) over the data located in different address spaces of processes; – saving a result in the address space of one process (MPI_Reduce); – sending a result to all processes (MPI_Allreduce); – the operation Reduce/Scatter (MPI_Reduce_scatter); – prefix reduction (MPI_Scan). All communication routines, except for MPI_Bcast, are available in the following two versions: – the simple option, when all parts of transmitted data have the same length and occupy a contiguous area in the process’s address space; – the vector option, which provides more opportunities for the organization of collective communications in terms of both block lengths and data placement in the address spaces of processes. Vector variants have the v character at the end of the function name.

4.3 Functions of collective interaction

|

55

Distinctive features of collective operations are as follows: – a collective communication does not interact with point-to-point communications; – collective communications are performed with an implicit barrier synchronization. The function returns in each process only take place when all processes have finished the collective operation; – the amount of received data must be equal to the amount of sent data; – data types of sent and received data must be the same; – data have no tags. Next we use some some examples to study collective operations in detail.

4.3.2 Synchronization function The process synchronization function MPI_Barrier blocks the called process until all other processes of the group call this function. The function finishes at the same time in all processes (all processes overcome barrier simultaneously). int MPI_Barrier (MPI_Comm comm),

where comm is a communicator. Barrier synchronization is employed for example to complete all processes of some stage of solving a problem, the results of which will be used in the next stage. Using barrier synchronization ensures that none of processes start the next stage before they are allowed to. The implicit synchronization of processes is performed by any collective function.

4.3.3 Broadcast function The data broadcast from one process to all other processes in a communicator is performed using the MPI_Bcast function (see Figure 4.1). The process with the rank root sends a message from its buffer to all processes in the comm communicator. int MPI_Bcast (void * buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm),

where buffer is the starting address of the buffer; count is the number of items in data; datatype is the MPI datatype of sent data; root is the rank of the sending process; comm is the communicator.

56 | 4 MPI technology After performing this function, each process in the comm communicator including the sender will receive a copy of sent data from the sender process root.

Fig. 4.1. Graphic interpretation of the MPI_Bcast function.

4.3.4 Data gathering functions There are four functions for gathering data from all processes: MPI_Gather, MPI_Allgather, MPI_Gatherv, and MPI_Allgatherv. Each of these functions extends the possibilities of the previous one. The MPI_Gather function assembles the data blocks sent by all processes into an array in the process with the root rank (Figure 4.2). The size of the blocks should be the same. Gathering takes place in the rank order, i.e. data sent by the i-th process from its buffer sendbuf is located in the i-th portion of the buffer recvbuf of the process root. The size of an array, where data is collected, should be sufficient to assemble them. int MPI_Gather (void * sendbuf, int sendcount, MPI_Datatype sendtype, void * recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

Here sendbuf is the starting address of sent data; sendcount is the number of items in data; sendtype is the datatype of sent data; recvbuf is the starting address of the receive buffer (used only in the receiving process root); recvcount is the number of elements received from each process (used only in the receiving process root); recvtype is the datatype of received elements; comm is the communicator. The MPI_Allgather function described in Figure 4.3 is the same as MPI_Gather, but in this case all processes are receivers. Data sent by the i-th process from its buffer sendbuf is placed in the i-th portion of the buffer recvbuf of each process. After the operation, the contents of buffers recvbuf of all processes are the same.

4.3 Functions of collective interaction

|

57

Fig. 4.2. Graphic interpretation of the MPI_Gather function.

int MPI_Allgather (void * sendbuf, int sendcount, MPI_Datatype sendtype, void * recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

sendbuf is the starting address of sent data; sendcount is the number of items in data; sendtype is the type of sent data; recvbuf is the starting address of the receive buffer; recvcount is the number of elements received from each process; recvtype is the datatype derived elements; comm is the communicator.

Fig. 4.3. Graphic interpretation of the MPI_Allgather function. In this sketch, the Y-axis is the group of processes and the X-axis indicates data blocks.

The MPI_Gatherv function (Figure 4.4) allows us to gather data blocks with different numbers of elements from each process, since the number of elements received from each process is defined individually using the array recvcounts. This feature also provides greater flexibility in locating data in receiving processes by introducing the new argument displs. int MPI_Gatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm),

58 | 4 MPI technology where sendbuf is the starting address of send buffer; sendcount is the number of items in data; sendtype is the datatype of sent data; recvbuf is the starting address of the receive buffer; recvcounts is an integer array (of the length equals the number of processes), where the value of i determines the number of elements that must be obtained from the i-th process; displs is an integer array (of the length equals the number of processes), where the value of i is the displacement of the i-th block of data with respect to recvbuf; recvtype is the datatype of received elements; root is the rank of the receiving process; comm is the communicator. Messages are placed in the buffer of the receiving process in accordance with the sending process numbers. To be exact, data sent by the i-th process are placed in the address space of the root process, starting recvbuf + displs_i.

Fig. 4.4. Graphic interpretation of the MPI_Gatherv function.

The MPI_Allgatherv function is similar to the MPI_Gatherv function, except that gathering is performed by all processes. There is therefore no need for the root argument. int MPI_Allgatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer; sendcount is the number of items in data; sendtype is the data type of sent data; recvbuf is the starting address of the receive buffer; recvcounts is the integer array (of the length equals the number of processes), where the value of i determines the number of elements that must be obtained from the i-th process; displs is the integer array (of the length equals the number of processes), where the value of i is the displacement of the i-th block of data with respect to recvbuf; recvtype is the data type of received elements; comm is the communicator.

4.3 Functions of collective interaction

|

59

4.3.5 Data distribution functions The functions for sending data blocks to all processes in a group are MPI_Scatter and MPI_Scatterv. The MPI_Scatter function (see Figure 4.5) splits data from the send buffer of the root process into pieces with the size sendcount and sends the i-th part to the receive buffer of the process with the rank i (including itself). The root process uses two buffers (sending and receiving), so all function parameters are significant for the calling routine. On the other hand, the rest of the processes with the comm communicator are only recipients, and therefore their parameters which specify the send buffer are not significant. int MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer (employed only by root); sendcount is the number of items sent to each process; sendtype is the data type of sent items; recvbuf is the starting address of the receive buffer; recvcount is the number of received items; recvtype is the datatype of received items; root is the rank of the sending process; comm is the communicator. The data type of sent items sendtype must meet the type recvtype of received items. Also, the number of sent items sendcount must be equal to the number of received items recvcount. It should be noted that the value of sendcount at the root process is the number of elements sent to each process, not the total number. The Scatter function is the inverse of the Gather function.

Fig. 4.5. Graphic interpretation of the MPI_Scatter function.

The MPI_Scatterv function (Figure 4.6) is a vector version of the MPI_Scatter function, which allows us to send a different number of elements to each process. The starting addresses of block elements sent to the i-th process are specified in the

60 | 4 MPI technology array of displacements displs, and the numbers of sent items are defined in the array sendcounts. This function is the inverse of the MPI_Gatherv function. int MPI_Scatterv(void* sendbuf, int *sendcounts, int *displs, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer (used only by root); sendcounts is the integer array (of the length equals the number of processes) containing the number of elements sent to each process; displs is the integer array (of the length equals the number of processes), where the value of i determines the displacement of data sent by the i-th process with respect to sendbuf; sendtype is the data type of sent items; recvbuf is the starting address of the receive buffer; recvcount is the number of received items; recvtype is the data type of received items; root is the rank of the sending process; comm is the communicator.

Fig. 4.6. Graphic interpretation of the MPI_Scatterv function.

4.3.6 Matrix-vector multiplication Now we discuss how to implement the parallel matrix-vector multiplication. We consider an N×N square matrix A and a vector b with N elements. The result of multiplying the matrix A by the vector b is a vector c with N elements. We describe the parallel implementation of the algorithm, which splits the matrix into strips and calculates separate parts of the vector c within individual processes, and then brings the vector elements together.

4.3 Functions of collective interaction

|

61

Listing 4.4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

#include #include #include #include #define N 100 int main(int argc, char* argv[]) { int rank, size; MPI_Init(&argc, &argv); int i, j; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Create matrix A and vectors b and c double *A = (double *) malloc(sizeof(double) * N * N); double *local_A = (double *) malloc(sizeof(double) * N * N / size); double *b = (double *) malloc(sizeof(double) * N); double *local_c = (double *) malloc(sizeof(double) * N / size); double *c = (double *) malloc(sizeof(double) * N); // Initialize matrix and vectors in zero process // It is convenient to consider the matrix as an // one-dimensional array with i*N+j indexes. if (rank == 0) { for (i = 0; i < N; i++) for (j = 0; j < N; j++) A[i * N + j] = rand() % 100; for (i = 0; i < N; i++) b[i] = rand(); } // Broadcast the vector b MPI_Bcast(b, N, MPI_DOUBLE, 0, MPI_COMM_WORLD); // Divide the matrix into horizontal stripes MPI_Scatter(A, N * N / size, MPI_DOUBLE, local_A, N * N / size, MPI_DOUBLE, 0, MPI_COMM_WORLD); for (i = 0; i < N / size; i++) for (j = 0; j < N; j++) local_c[i] += local_A[i * N + j] * b[j]; // Collect the result in zero process MPI_Gather(local_c, N / size, MPI_DOUBLE, c, N / size, MPI_DOUBLE, 0, MPI_COMM_WORLD); //Print the result if (rank == 0) { for (i = 0; i < N; i++) printf("%f3.3 \n", c[i]); } free(A); free(b); free(c); free(local_c);

62 | 4 MPI technology

free(local_A); MPI_Finalize(); return 0;

48 49 50 51

}

The partitioning of a matrix is performed using the scatter function. For convenience, the matrix is represented as a vector and its parts are scattered by separate processes. Since the vector b is employed by all processes, we apply the bcast function to send it. Finally, we gather the vector c in the process of rank 0.

4.3.7 Combined collective operations The MPI_Alltoall function (see Figure 4.7) combines the Scatter and Gather functions and is an extension of the Allgather function, where each process sends different data to different recipients. The i-th process sends the j-th block of its sent buffer to the j-th process, which puts received data in the i-th block of its receive buffer. The amount of sent data must equal the amount of received data for each process. int MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer; sendcount is the number of sent items; sendtype is data type of sent items; recvbuf is the starting address of the receive buffer; recvcount is the number of received items from each process; recvtype is the data type of received items; comm is the communicator.

Fig. 4.7. Graphic interpretation of the MPI_Alltoall function.

4.3 Functions of collective interaction

|

63

The MPI_Alltoallv function is a vector version of Alltoall. It sends and receives blocks of various lengths with more flexible placement of sent and received data.

4.3.8 Global computational operations In parallel programming, mathematical operations on data blocks distributed among processors are called global reduction operations. In general, a reduction operation is an operation, one argument of which is a vector, and the result is a scalar value obtained by applying a mathematical operation to all components of the vector. For example, if the address space of all processes of a group contains a variable var (there is no need to have the same value on each process), then we can apply the operation of global summation or the SUM reduction operation to this variable and obtain one value, which will contain the sum of all local values of this variable. These operations are one of the basic tools for organizing distributed computations. In MPI, the global reduction operations are available in the following versions: – MPI_Reduce – operation which saves a result in the address space of one process; – MPI_Allreduce – operation which saves a result in the address space of all processes; – MPI_Scan – prefix reduction operation which returns a vector as a result. The ith component of this vector is the result of the reduction of the first i components of a distributed vector; – MPI_Reduce_scatter – combined operation Reduce/Scatter. The MPI_Reduce function (Figure 4.8) works as follows. A global reduction operation specified by op is conducted on the first elements of the send buffer on each process, and the result is sent to the first element of the receive buffer of the root process. The same action is then done with the second element of the buffer, and so on. int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer; recvbuf is the starting address of the results buffer (used only in the root process); count is the number of elements in the send buffer; datatype is the data type of elements in the send buffer; op is the reduction operation; root is the rank of the receiving process; comm is the communicator. As an op operation in the above description, it is possible to apply either predefined operations or user-defined operations. All predefined operations are associative and commutative. User-defined operations must be at least associative. The order of

64 | 4 MPI technology

Fig. 4.8. Graphic interpretation of the MPI_Reduce function.

reduction is specified by the number of processes in a group. The types of elements must be compatible with the op operation. Table 4.3 presents predefined operations which can be used in the reduction function. Table 4.3. Predefined reduction operations and compatible data types. Name

Operation

Data type

MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_LOR MPI_LXOR MPI_BAND MPI_BOR MPI_BXOR MPI_MAXLOC MPI_MINLOC

Maximum Minimum Sum Product Logical AND Logical OR Logical exclusive OR Bit-wise AND Bit-wise OR Bit-wise exclusive OR Maximum and location of maximum Minimum and location of minimum

Integer Floating point C integer Floating point, Complex C integer, Logical

C integer Byte Special type for this function Special type for this function

The MAXLOC and MINLOC functions are carried out with special pair types, each element of which stores two values: the value governing the search of the maximum or minimum, and the index of the element. MPI for C provides 6 such predefined types: – MPI_FLOAT_INT is float and int; – MPI_DOUBLE_INT is double and int; – MPI_LONG_INT is long and int; – MPI_2INT is int and int; – MPI_SHORT_INT is short and int; – MPI_LONG_DOUBLE_INT is long double and int.

4.3 Functions of collective interaction

|

65

The MPI_Allreduce function (Figure 4.9) stores the result of reduction in the address space of all processes, and so the root process is absent from the above list. The rest of the parameters are the same as in the previous function. int MPI_Allreduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer; recvbuf is the starting address of the receive buffer; count is the number of elements in the send buffer; datatype is the data type of elements in the send buffer; op is the reduce operation; comm is the communicator.

Fig. 4.9. Graphic interpretation of the MPI_Allreduce function.

The MPI_Reduce_scatter function combines the reduction operation and the scattering of results. MPI_Reduce_scatter(void* sendbuf, void* recvbuf, int *recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Where sendbuf is the starting address of the send buffer; recvbuf is the starting address of the receive buffer; recvcounts is an array that specifies the size of sent data blocks; datatype is the data type of elements in the send buffer; op is the reduce operation; comm is the communicator. The MPI_Reduce_scatter function (see Figure 4.10) differs from the MPI_Allreduce function so that the result of the operation is cut into disjoint parts according to the number of processes in a group, and the i-th part is sent to the i-th process. The lengths of these parts are defined by the third parameter, which is an array. The MPI_Scan function (Figure 4.11) performs a prefix reduction. Function parameters are the same as in the MPI_Allreduce function, but results obtained by each process differ. The operation sends the reduction of the values of the send buffer of processes with ranks 0, 1, ..., i to the receive buffer of the i-th process.

66 | 4 MPI technology

Fig. 4.10. Graphic interpretation of the MPI_Reduce_scatter function.

int MPI_Scan(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

where sendbuf is the starting address of the send buffer; recvbuf is the starting address of the receive buffer; count is the number of elements in the send buffer; datatype is the data type of elements in the send buffer; op is the reduce operation; comm is the communicator.

Fig. 4.11. Graphic interpretation of the MPI_Scan function.

4.3.9 Scalar product of vectors Let us consider the implementation of the scalar product of vectors. First, the sequential version of the scalar product is presented in Listing 4.5 below.

4.3 Functions of collective interaction

|

67

Listing 4.5. 1 2 3 4 5 6

double vectorDot(double *a, double *b){ double p = 0; for (int i = 0; i < (N); i++) p += a[i]*b[i]; return p; }

In a parallel implementation of the scalar product, vectors are stored in a distributed manner, i.e. each process contains only part of a vector. Each process calculates the local scalar product of vectors. To derive the total scalar product, we need to use the reduction operation of summation (Figure 4.12).

Fig. 4.12. Sequential and parallel implementations of the scalar product of vectors.

Listing 4.6. 1 2 3 4 5 6 7 8

double vectorDot(double *a, double *b, int size){ double p = 0; for (int i = 0; i < (N / size); i++) p += a[i]*b[i]; double r = 0; MPI_Allreduce(&p, &r, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); return r; }

The above function can be employed to calculate the norm of vectors: ‖x‖ = √(x, x), i.e. to take the square root of the scalar product.

68 | 4 MPI technology

4.4 Dirichlet problem for the Poisson equation In the course of the book we consider the Dirichlet problem for the Poisson equation as the basic problem. The problem was described in detail in the sixth section of the previous chapter. Therefore, we provide only the problem formulation here. The approximation details and algorithm of iterative solution of the grid problem can be found in the chapter devoted to OpenMP technology. Here we construct a parallel implementation to solve the considered problem on the basis of MPI.

4.4.1 Problem formulation In the unit square domain Ω = {x | 0 < xα < 1,

α = 1, 2}

we consider the Poisson equation −Δu = f (x),

x∈Ω

(4.1)

x ∈ 𝜕Ω.

(4.2)

with Dirichlet boundary conditions u(x) = g(x), In our case, we put f (x) = 1,

g(x) = 0.

(4.3)

4.4.2 Parallel algorithm The parallel implementation of the above numerical algorithm requires essential modifications. First, we define the dimensions of problem as shown in Listing 4.7. Listing 4.7. 1 2 3

#define N1 100 #define N2 100 #define M N1*N2

The value M must be divisible by the number of running processes. In the main function, we create matrix and vectors for the current process. Next, the five-diagonal matrix and the vector on the right are initialized. Further, we preset the initial guess and call the method of conjugate gradients for the numerical solution of the equations. After obtaining the solution we print the number of performed iterations and save the result to a file.

4.4 Dirichlet problem for the Poisson equation

| 69

Listing 4.8. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

int main(int argc, char* argv[]) { double begin, end; int rank, size; // Initialize MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Define grid double h1 = 1.0 / N1, h2 = 1.0 / N2; // Create matrix A and vectors b, x double *A = (double*) malloc(sizeof(double) * M * M / size); double *b = (double*) malloc(sizeof(double) * M / size); double *x = (double*) malloc(sizeof(double) * M / size); initMat(A, h1, h2, rank, size); initVec(b, h1 * h2, size); initVec(x, 0.0, size); begin = MPI_Wtime(); int iters = solveCG(A, x, b, 1000, 1.0e-3, rank, size); end = MPI_Wtime(); if (rank == 0) printf("Time = %3.2lf sec.\tIterations = %d\n", end-begin, iters); vecPrint("./solution.txt", x, h1, h2, rank, size); free(A); free(b); free(x); MPI_Finalize(); return 0; }

The output file solution.txt can be visualized by gnuplot. The implementation of the conjugate gradients method is conducted without using MPI functions. Listing 4.9. 1 2 3 4 5 6 7 8 9

// Conjugate gradients method int solveCG(double* A, double* x, double* b, double maxIter, double eps, int rank, int size) { int k; double *r = (double*) malloc(sizeof(double) * M / size); double *s = (double*) malloc(sizeof(double) * M / size); double *p = (double*) malloc(sizeof(double) * M / size); double *Ax = (double*) malloc(sizeof(double) * M / size); // s_0 = r_0 = b - A x_0

70 | 4 MPI technology

matVecMult(Ax, A, x, rank, size); vecSum(r, -1, Ax, b, size); vecSum(s, -1, Ax, b, size); for (k = 0; k < maxIter; k++) { // p_k = A s_k matVecMult(p, A, s, rank, size); // tau = (r_k, r_k)/(s_k, p_k) double rr = vecDot(r, r, size); double tau = rr/vecDot(s, p, size); // x_k+1 = x_k + tau*s_k vecSum(x, tau, s, x, size); // r_k+1 = r_k - tau*p_k vecSum(r, -tau, p, r, size); // s_k+1 = r_k+1 + (r_k+1, r_k+1)/(r_k, r_k)*s_k double rrNew = vecDot(r, r, size); vecSum(s, rrNew/rr, s, r, size); double norm = sqrt(vecDot(r, r, size)); if(norm < eps) break; } free(r); free(s); free(p); free(Ax); return k;

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

}

MPI functions appear in the scalar product of vectors as shown below in Listing 4.10. Listing 4.10. 1 2 3 4 5 6 7 8 9 10

// scalar product (x, x) double vecDot(double *x, double *y, int size){ double a = 0; int i; for (i = 0; i < (M / size); i++) a += x[i]*y[i]; double r = 0; MPI_Allreduce(&a, &r, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); return r; }

4.4 Dirichlet problem for the Poisson equation

| 71

Next, the matrix-vector multiplication is shown in Listing 4.11. Listing 4.11. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

// y = A * x void matVecMult(double *y, double *A, double *x, int rank, int size){ double *t = (double*) malloc(sizeof(double) * M); // gathering parts of vector x in vector t MPI_Allgather(x, M / size, MPI_DOUBLE, t, M / size, MPI_DOUBLE, MPI_COMM_WORLD); int i, j; for (i = 0; i < M / size; i++) { y[i] = 0; for (j = 0; j < M; j++) y[i] += A[i * M + j] * t[j]; } free(t); }

The matrix array A is filled-in in the initMat function. Listing 4.12. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

void initMat(double *A, double h1, double h2, int rank, int size){ int i, j, disp, col, row; double val; for (col = 0; col < (M / size); col++) { // finding displace disp = rank * (M / size); i = (col + disp) % N1; j = (col + disp) / N1; for (row = 0; row < M; row++) { val = 0; if ( ( i > 0 && row == (j * N1 + (i-1)) ) || ( i < (N1-1) && row == (j * N1 + (i+1)) ) ) val = - 1*h2/h1; if ( ( j > 0 && row == ((j-1) * N1 + i) ) || ( j < (N2-1) && row == ((j+1) * N1 + i) ) ) val = - 1*h1/h2; if ( row == (j * N1 + i) ) val = 2*(h2/h1 + h1/h2); A[col * M + row] = val; } } }

72 | 4 MPI technology Here disp is the displacement of a matrix block, which is defined as the multiplication of the process rank by the number of elements in the block. The complete source code of the program is shown in Listing 4.13 below. Listing 4.13. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

#include #include #include #include #define N1 100 #define N2 100 #define M N1*N2 void initMat(double *A, double h1, double h2, int rank, int size){ int i, j, disp, col, row; double val; for (col = 0; col < (M / size); col++) { // finding displace disp = rank * (M / size); i = (col + disp) % N1; j = (col + disp) / N1; for (row = 0; row < M; row++) { val = 0; if ( ( i > 0 && row == (j * N1 + (i-1)) ) || ( i < (N1-1) && row == (j * N1 + (i+1)) ) ) val = - 1*h2/h1; if ( ( j > 0 && row == ((j-1) * N1 + i) ) || ( j < (N2-1) && row == ((j+1) * N1 + i) ) ) val = - 1*h1/h2; if ( row == (j * N1 + i) ) val = 2*(h2/h1 + h1/h2); A[col * M + row] = val; } } } void initVec(double* x, double val, int size){ int i; for (i = 0; i < M / size; i++) x[i] = val; } // y = A * x void matVecMult(double *y, double *A, double *x, int rank, int size){ double *t = (double*) malloc(sizeof(double) * M); // gathering parts of vector x in vector t MPI_Allgather(x, M / size, MPI_DOUBLE, t, M / size, MPI_DOUBLE, MPI_COMM_WORLD); int i, j; for (i = 0; i < M / size; i++) {

4.4 Dirichlet problem for the Poisson equation

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

| 73

y[i] = 0; for (j = 0; j < M; j++) y[i] += A[i * M + j] * t[j]; } free(t); } // s = a * x + y void vecSum(double *s, double a, double *x, double *y, int size){ int i; for (i = 0; i < M / size; i++) s[i] = a*x[i] + y[i]; } // scalar product (x, x) double vecDot(double *x, double *y, int size){ double a = 0; int i; for (i = 0; i < (M / size); i++) a += x[i]*y[i]; double r = 0; MPI_Allreduce(&a, &r, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); return r; } // print vector to file void vecPrint(char* fileName, double* x, double h1, double h2, int rank, int size) { remove(fileName); FILE *file; file = fopen(fileName, "a+"); int i, j; double *t = (double*) malloc(sizeof(double) * M); MPI_Gather(x, M / size, MPI_DOUBLE, t, M / size, MPI_DOUBLE, 0, MPI_COMM_WORLD); if (rank == 0) { for(i = 0; i < N1; i++){ for(j = 0; j < N2; j++){ fprintf(file, "%f %f %f \n", i*h1, j*h2, t[j*N1 + i]); } fprintf(file, "\n"); } } fclose(file); free(t); } // Conjugate gradients method int solveCG(double* A, double* x, double* b, double maxIter, double eps, int rank, int size) { int k; double *r = (double*) malloc(sizeof(double) * M / size);

74 | 4 MPI technology

92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140

double *s = (double*) malloc(sizeof(double) * M / size); double *p = (double*) malloc(sizeof(double) * M / size); double *Ax = (double*) malloc(sizeof(double) * M / size); // s_0 = r_0 = b - A x_0 matVecMult(Ax, A, x, rank, size); vecSum(r, -1, Ax, b, size); vecSum(s, -1, Ax, b, size); for (k = 0; k < maxIter; k++) { // p_k = A s_k matVecMult(p, A, s, rank, size); // tau = (r_k, r_k)/(s_k, p_k) double rr = vecDot(r, r, size); double tau = rr/vecDot(s, p, size); // x_k+1 = x_k + tau*s_k vecSum(x, tau, s, x, size); // r_k+1 = r_k - tau*p_k vecSum(r, -tau, p, r, size); // s_k+1 = r_k+1 + (r_k+1, r_k+1)/(r_k, r_k)*s_k double rrNew = vecDot(r, r, size); vecSum(s, rrNew/rr, s, r, size); double norm = sqrt(vecDot(r, r, size)); if(norm < eps) break; } free(r); free(s); free(p); free(Ax); return k; } int main(int argc, char* argv[]) { double begin, end; int rank, size; // Initialize MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Define grid double h1 = 1.0 / N1, h2 = 1.0 / N2; // Create matrix A and vectors b, x double *A = (double*) malloc(sizeof(double) * M * M / size); double *b = (double*) malloc(sizeof(double) * M / size); double *x = (double*) malloc(sizeof(double) * M / size); initMat(A, h1, h2, rank, size); initVec(b, h1 * h2, size); initVec(x, 0.0, size); begin = MPI_Wtime(); int iters = solveCG(A, x, b, 1000, 1.0e-3, rank, size); end = MPI_Wtime(); if (rank == 0)

4.4 Dirichlet problem for the Poisson equation

75

printf("Time = %3.2lf sec.\tIterations = %d\n", end-begin, iters); vecPrint("./solution.txt", x, h1, h2, rank, size); free(A); free(b); free(x); MPI_Finalize(); return 0;

141 142 143 144 145 146 147 148 149

|

}

Here we use the required tolerance ε = 10−3 and the grid sizes N1 = 100, N2 = 100. 75 iterations are required to achieve the solution of the problem with the given accuracy (Figure 4.13).

Fig. 4.13. The solution of the Dirichlet problem (using gnuplot).

The output of the program for 1, 2, 4, 8 processes is as follows: $ mpirun -np 1 ./poisson Time = 47.29 sec. Iterations $ mpirun -np 2 ./poisson Time = 23.44 sec. Iterations $ mpirun -np 4 ./poisson Time = 11.86 sec. Iterations $ mpirun -np 8 ./poisson Time = 6.58 sec. Iterations

= 75 = 75 = 75 = 75

Note that the solution time decreases almost linearly when we increase the number of processes.

Aleksandr V. Grigoriev

5 ParaView: An efficient toolkit for visualizing large datasets Abstract: In this chapter, we consider a portable open source software package for visualizing scientific datasets: ParaView¹. Basic data formats used in the program are described below. We also discuss the preparation of input data on the basis of simple examples. Finally, the feature of parallel visualization is demonstrated.

5.1 An overview We can download ParaView installation files from the official website (for Windows, Linux, and Mac). Here the focus is on the Ubuntu version. To install the application, we use the following command: $ sudo apt-get install paraview

ParaView allows visualization of scientific datasets, both on local computers and on parallel computing systems. The program has client-server architecture; it was developed on the basis of the Visualization ToolKit (VTK) library². It is possible to work with ParaView via the graphical user interface (GUI) or in batch mode using the Python programming language. The visualizer supports standard features of scientific visualization with the focus on two- and three-dimensional data. Other products, such as Tecplot³, VisIt⁴, and EnSight⁵ can be considered analogs. ParaView supports datasets defined on uniform and non-uniform, rectilinear and curvilinear, general non-structured and multiblock grids. The VTK format is the basic data format. ParaView can also import a large number of other data formats, i.e.,STL⁶, NetCDF⁷, CAS⁸ and more. ParaView provides a large set of useful filters for data analysis. Users can also create their own filters. The program allows visualization of scalar and vector fields.

1 2 3 4 5 6 7 8

www.paraview.org/. www.vtk.org/. www.tecplot.com/. https://wci.llnl.gov/simulation/computer-codes/visit. www.ceisoftware.com/. http://en.wikipedia.org/wiki/STL_(file_format). www.unidata.ucar.edu/software/netcdf/. http://filext.com/file-extension/CAS/.

78 | 5 ParaView: An efficient toolkit for visualizing large datasets Using filters, we can extract contours and isosurfaces, intersect domains with a plane or a given function, conduct data analysis, display streamlines for vector fields, and plot distribution of data values over given lines. We can also save screenshots and animation for time-dependent fields.

5.2 Data file formats The VTK format has two different styles of data presentation: serial and XML based formats. The serial formats are text files, which can be easily written and read. The XML based formats support random access and parallel input/output. These formats also provide better data compression.

5.2.1 Serial formats The serial formats consist of the following five basic parts: – The file version and identifier. This part contains the single line: #vtk Datafile Version x.x. – The header consists of a string with \n at the end. The maximum number of characters is 256. The header can be used to describe the dataset and includes any other information. – The file format which indicates the type of file; files may be in the (ASCII) code or binary (BINARY). – The structure of the dataset which describes its geometry and topology. This part begins with the keyword DATASET followed by the type of dataset. Other combinations of keywords and data are then defined. – The attributes of the dataset which begin with the keywords POINT_DATA or CELL_DATA, followed by the number of points or cells, respectively. Other combinations of keywords and data define the dataset attribute values such as scalars, vectors, tensors, normals, texture coordinates, or field data. This structure of serial files provides freedom to select dataset and geometry and to change data files using filters of VTK or other tools. VTK supports five different dataset formats: structured points, structured grid, rectilinear grid, unstructured grid, and polygonal representation.

5.2 Data file formats |

79

5.2.1.1 Structured points This file format allows storage of a structured point dataset. Grids are defined by the dimensions nx, ny, nz, the coordinates of the origin point x, y, z, and the data spacing sx, sy, sz. DATASET STRUCTURED_POINTS DIMENSIONS nx ny nz ORIGIN x y z SPACING sx sy sz

5.2.1.2 Structured grid This file format allows storage of a structured grid dataset. Grids are described by the dimensions nx, ny, nz and the number of points, n. The POINT section consists of the coordinates of each point. DATASET STRUCTURED_GRID DIMENSIONS nx ny nz POINTS n dataType P0x P0y P0z P1x P1y P1z . . . P(n-1)x P(n-1)y P(n-1)z

5.2.1.3 Rectilinear grid This file format is oriented to store a rectilinear grid dataset. Grids are defined by the dimensions nx, ny, nz, and three lists of coordinate values. DATASET RECTILINEAR_GRID DIMENSIONS nx ny nz X_COORDINATES nx dataType x0 x1 . . . x(nx-1) Y_COORDINATES ny dataType y0 y1 . . . y(ny-1) Z_COORDINATES nz dataType z0 z1 . . . z(nz-1)

80 | 5 ParaView: An efficient toolkit for visualizing large datasets 5.2.1.4 Polygonal representation This format provides storage of an unstructured grid dataset and consists of arbitrary combinations of primitives (surfaces, lines, points, polygons, triangle strips). Grids are described by POINTS, VERTICES, LINES, POLYGONS, TRIANGLE_STRIPS. Note that here POINTS is the same as the POINTS in structured grid. DATASET POLYDATA POINTS n dataType P0x P0y P0z P1x P1y P1z . . . P(n-1)x P(n-1)y P(n-1)z VERTICES n size numPoints0, i0, j0, k0, ... numPoints1, i1, j1, k1, ... . . . numPoints(n-1), i(n-1), j(n-1), k(n-1), ... LINES n size numPoints0, i0, j0, k0, ... numPoints1, i1, j1, k1, ... . . . numPoints(n-1), i(n-1), j(n-1), k(n-1), ... POLYGONS n size numPoints0, i0, j0, k0, ... numPoints1, i1, j1, k1, ... . . . numPoints(n-1), i(n-1), j(n-1), k(n-1), ... TRIANGLE_STRIPS numPoints0, i0, numPoints1, i1, . . . numPoints(n-1),

n size j0, k0, ... j1, k1, ... i(n-1), j(n-1), k(n-1), ...

5.2.1.5 Unstructured grid This format is designed for storing unstructured grid datasets. An unstructured grid is defined by vertices and cells. The CELLS keyword requires two parameters: the number of cells n, and the size of the cell list size. The cell list size indicates the total number of integer values required to represent cells. The CELL_TYPES keyword requires the only parameter, i.e. the number of cells n that should be equal to the value in CELLS. The cell types data are integer values specifying the type of each cell.

5.2 Data file formats |

81

DATASET UNSTRUCTURED_GRID POINTS n dataType p0x p0y p0z p1x p1y p1z ... p(n-1)x p(n-1)y p(n-1)z CELLS n size numPoints0, i, j, k, l, ... numPoints1, i, j, k, l, ... numPoints2, i, j, k, l, ... ... numPoints(n-1), i, j, k, l, ... CELL_TYPES n type0 type1 type2 ... type(n-1)

5.2.2 XML formats Data formats based on XML syntax provide many more features. The main feature is data streaming and parallel input/output. These data formats also provide data compression, random access, multiple file representation of data, and new file extensions for different VTK dataset types. There are two different types of XML formats: – the serial type is applied to read and write employing a single process; – the parallel type is applied to read and write using multiple processes. Each process writes or reads only its own part of data which is stored in an individual file. In the XML format, datasets are divided into two categories: structured and unstructured categories. A structured dataset is a regular grid. The structured dataset types are vtkImage Data, vtkRectilinearGrid, and vtkStructuredGrid. An unstructured dataset defines an irregular grid that is a set of points and cells. The unstructured dataset types are vtkPolyData or vtkUnstructuredGrid. XML formats are designed similarly to serial formats. A more detailed description can be found in the ParaView documentation.

82 | 5 ParaView: An efficient toolkit for visualizing large datasets

5.3 Preparing data Let us consider simple examples of how to prepare data using the given data visualization formats. We present the implementation of writing the results of mathematical modeling in a file.

5.3.1 Structured 2D grid Using the above description of the serial data representation format, we write values of a two-dimensional grid function in a VTK file. We use the following function to generate the grid dataset: f = exp(√x2 + y2 ) cos(3π x) cos(4π y). Function values are calculated on the uniform rectangular grid 100 × 100. Consider the implementation of the program. First, to work with mathematical functions and to write to a file, we include the standard libraries as shown in Listing 5.1. Listing 5.1. 1 2

#include #include

Further, we set the dimensions of the grid as follows: Listing 5.2. 1 2

#define NX 101 #define NY 101

Define the function to calculate a 2D structured dataset as shown Listing 5.3. Listing 5.3. 1 2 3

double value2d(double x, double y) { return exp(sqrt(x * x + y * y)) * cos(3*M_PI * x) * cos(4*M_PI * y); }

Next, we set the function which writes the dataset to a file with VTK serial format as follows in Listing 5.4.

5.3 Preparing data |

83

Listing 5.4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

void WriteStructuredGrid2D(double *x, double *y, double *A, const char *filename) { FILE *out; out = fopen(filename, "w"); fprintf(out, "# vtk DataFile Version 3.0\n"); fprintf(out, "Example 2D regular grid VTK file.\n"); fprintf(out, "ASCII\n"); fprintf(out, "DATASET STRUCTURED_GRID\n"); fprintf(out, "DIMENSIONS %d %d %d\n", NX, NY, 1); fprintf(out, "POINTS %d float\n", NX * NY); for (int j = 0; j < NY; j++) for (int i = 0; i < NX; i++) fprintf(out, "%f %f %f\n", x[i], y[j], 0.0); fprintf(out, "POINT_DATA %d\n", NX * NY); fprintf(out, "SCALARS u float 1\n"); fprintf(out, "LOOKUP_TABLE default\n"); for (int i = 0; i < NX * NY; i++) fprintf(out, "%f\n", A[i]); fclose(out); }

In the main function, it is necessary to specify domain parameters, generate the coordinates vectors x, y, calculate values of the function v, and write into a file as shown in Listing 5.5 below. Listing 5.5. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

int main() { // Domain and mesh double lx = 1.0, ly = 1.0; double hx = lx/(NX-1), hy = ly/(NY-1); // Coordinates vectors x and y double x[NX]; for (int i = 0; i < NX; i++) x[i] = i * hx; double y[NY]; for (int i = 0; i < NY; i++) y[i] = i * hy; // Function values v double v[NX*NY]; for (int j = 0; j < NY; j++) for (int i = 0; i < NX; i++) v[i + NX * j] = value2d(x[i], y[j]); // Write to vtk-file WriteStructuredGrid2D(x, y, v, "vtk2d.vtk"); }

84 | 5 ParaView: An efficient toolkit for visualizing large datasets

Fig. 5.1. Structured 2D grid dataset.

Finally, we visualize the resulting file vtk2d.vtk using ParaView as shown in Figure 5.1.

5.3.2 Structured 3D grid Here we present an example of generating a structured 3D grid dataset and writing it to a VTK file. The dataset corresponds to the following function: f = exp(√x2 + y2 + z2 ) cos(3π x) cos(4π y), with the regular uniform grid 100 × 100 × 50. The complete code of the program is shown in Listing 5.6 below; the implementation is the same as in the 2D example.

5.3 Preparing data |

85

Listing 5.6. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

#include #include #define NX 101 #define NY 101 #define NZ 51 // Calculate data value on coordinates x, y, z double value3d(double x, double y, double z) { return exp(sqrt(x * x + y * y + z * z)) * cos(3 * M_PI * x) * cos(4 * M_PI * y); } // Write VTK file from data A on x, y, z coordinates void WriteStructuredGrid3D(double *x, double *y, double *z, double *A, const char *filename) { FILE *out; out = fopen(filename, "w"); fprintf(out, "# vtk DataFile Version 3.0\n"); fprintf(out, "Example 3D regular grid VTK file.\n"); fprintf(out, "ASCII\n"); fprintf(out, "DATASET STRUCTURED_GRID\n"); fprintf(out, "DIMENSIONS %d %d %d\n", NX, NY, NZ); fprintf(out, "POINTS %d float\n", NX * NY * NZ); for (int k = 0; k < NZ; ++k) for (int j = 0; j < NY; ++j) for (int i = 0; i < NX; ++i) fprintf(out, "%f %f %f\n", x[i], y[j], z[k]); fprintf(out, "POINT_DATA %d\n", NX * NY * NZ); fprintf(out, "SCALARS u float 1\n"); fprintf(out, "LOOKUP_TABLE default\n"); for (int i = 0; i < NX * NY * NZ; ++i) fprintf(out, "%f\n", A[i]); fclose(out); } int main() { // Domain and mesh double lx = 1.0, ly = 1.0, lz = 1.0; double hx = lx/(NX-1), hy = ly/(NY-1), hz = lz/(NZ-1); // Coordinates vectors x, y and z double x[NX]; for (int i = 0; i < NX; i++) x[i] = i * hx; double y[NY]; for (int j = 0; j < NY; j++) y[j] = j * hy; double z[NZ]; for (int k = 0; k < NZ; k++) z[k] = k * hz; // Function values v

86 | 5 ParaView: An efficient toolkit for visualizing large datasets

double v[NX * NY * NZ]; for (int k = 0; k < NZ; k++) for (int j = 0; j < NY; j++) for (int i = 0; i < NX; i++) v[i + NX*j + NX*NY*k] = value3d(x[i], y[j], z[k]); // Write to a file WriteStructuredGrid3D(x, y, z, v, "vtk3d.vtk");

48 49 50 51 52 53 54 55

}

We visualize the resulting file vtk3d.vtk via ParaView (Figure 5.2).

Fig. 5.2. The surface of the 3D dataset.

5.3.3 Unstructured 2D grid In the case of unstructured 2D grids, it is reasonable to employ special software. Of course, we can use the VTK library itself, but there are other libraries which are more convenient for writing VTK files.

5.3 Preparing data |

87

Let us consider the VisIt library⁹, and in particular the visit_writer module. This module provides a convenient interface for writing VTK files and allows the writing of unstructured grids.

Fig. 5.3. Slice.

Fig. 5.4. Clip.

visit_writer includes two files (visit_writer.c and visit_writer.h). The following code (Listing 5.7) writes an unstructured grid dataset to a VTK file using the library.

9 https://wci.llnl.gov/simulation/computer-codes/visit.

88 | 5 ParaView: An efficient toolkit for visualizing large datasets Listing 5.7. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

#include "visit_writer.h" int main(int argc, char *argv[]) { // Node coordinates int nnodes = 9; int nzones = 8; float pts[] = { 0., 0., 0., 2., 0., 0., 5., 0., 0., 3., 3., 0., 5., 3., 0., 0., 5., 0., 2., 5., 0., 4., 5., 0., 5., 5., 0. }; // Zone types int zonetypes[] = { VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE, VISIT_TRIANGLE }; // Connectivity int connectivity[] = { 1, 3, 6, /* tri zone 1. */ 3, 7, 6, /* tri zone 2. */ 0., 1, 5, /* tri zone 3. */ 1, 5, 6, /* tri zone 4. */ 1, 2, 3, /* tri zone 5. */ 2, 3, 4, /* tri zone 6. */ 3, 4, 7, /* tri zone 7. */ 4, 7, 8, /* tri zone 8. */ }; // Data arrays float nodal[] = { 0, 0, 0, 12, 2, 5, 5, 5, 5 }; float zonal[] = { 1, 4, 9, 16, 25, 36, 49, 64 }; // visit_writer parameters int nvars = 2; int vardims[] = { 1, 1 }; int centering[] = { 0, 1 }; const char *varnames[] = { "zonal", "nodal" }; float *vars[] = { zonal, nodal }; // Write mesh write_unstructured_mesh("unsctructured2d.vtk", 1, nnodes, pts, nzones, zonetypes, connectivity, nvars, vardims, centering, varnames, vars); return 0; }

We visualize the resulting file unsctructured2d.vtk using ParaView as demonstrated in Figures 5.3 and 5.4. It is possible to visualize the data associated with the nodes (Figure 5.5) and cells (Figure 5.6).

5.3 Preparing data |

Fig. 5.5. Nodal data.

Fig. 5.6. Cell data.

89

90 | 5 ParaView: An efficient toolkit for visualizing large datasets

5.4 Working with ParaView The user interface of ParaView consists of the following sections: – Toolbars provides quick access to frequently used functions; – Pipeline Browser displays a list of objects (opened files and the filters applied to them) and allows selection of the objects for visualization; – Object Inspector contains three tabs: – the Properties tab provides options to change properties of selected objects; – the Display tab provides access to parameters of selected objects; – the Information tab provides basic statistics on objects. – View area displays objects.

5.4.1 Loading data The File | Open command or the Open button from the toolbar can be used to load data into ParaView. It is necessary to select a file, for example file vtk2d.vtk from the previous section (Figure 5.7). Next, we click the Apply button in Object Inspector (the Properties tab). If a data file contains more than one function for visualization, then the same color will be used over an entire dataset (Solid Color).

Fig. 5.7. Loading data.

5.4 Working with ParaView |

91

In this case, we select a function in the Toolbars or the Display tab of Object Inspector. Then the dataset will be colored based on the values of the function. The color palette of the function may be edited from the dialog that appears from clicking the Edit Color Map button.

5.4.2 Filters After a file is loaded, we can use the mouse buttons to zoom, pan, rotate or roll an object. For a more detailed study of the object, filters for processing data should be applied. The most common filters are listed in Table 5.1. The results of applied filters are displayed in Pipeline browser. Table 5.1. Common filters. Filter

Description

Calculator

Evaluates a user-defined expression on a per-point or per-cell basis.

Contour

Extracts the points, curves, or surfaces where a scalar field is equal to a user-defined value.

Clip

Intersects an object with a half-space.

Slice

Intersects an object with a plane.

Threshold

Extracts cells that lie within a user-defined range of a scalar field.

Extract Subset

Extracts a subset by defining either a volume of interest or a sampling rate.

Glyph

Places a glyph on each point in a mesh.

Stream Tracer

Seeds a vector field with points and then traces those points through the (steady state) vector field.

Warp

Displaces each point in a mesh by a given vector field.

Group Datasets

Combines the output of several objects into a dataset.

Extract Level

Extracts one or more items from a dataset.

These filters are only a small example of the possibilities available. Currently, ParaView offers more than one hundred. For convenience, the Filters menu is divided into submenus (Table 5.2). The use of filters to process and visualize generated 3D structured dataset is now demonstrated (file vtk3d.vtk): – load and display the dataset u from the file vtk3d.vtk (Figure 5.8); – apply the Slice filter. (The parameters of the filter are set before applying the filter; see Figure 5.9); – similarly, build part of the object using the Clip filter (Figure 5.10); – apply the Contour filter. (The contour values are specified in the Properties tab of Object Inspector; see Figure 5.11).

92 | 5 ParaView: An efficient toolkit for visualizing large datasets Table 5.2. Filters menu. Submenu

Description

Search Recent Common Cosmology Data Analysis Statistics Temporal Alphabetical

Searches through lists of filters Recently used filters Common filters Filters developed at LANL for cosmology research Filters designed to obtain quantitative values Filters which provide descriptive statistics of data Filters which analyze or modify data that changes over time Alphabetical list of filters

In addition, we can display the color legend of an object by clicking the button

.

Clicking the button , we edit the color map (palette) as well as the number of colors in the color map (Resolution slider). To save screenshots, the File | Save Screenshot menu command is available.

Fig. 5.8. Step 1.

5.4 Working with ParaView |

Fig. 5.9. Step 2.

Fig. 5.10. Step 3.

93

94 | 5 ParaView: An efficient toolkit for visualizing large datasets

Fig. 5.11. Step 4.

5.4.3 Time series In order to work with transient fields, we need to create a pvd file which contains information about the time steps and files with data of this time step:





Each individual file can contain a grid of any type: vtu = unstructured, vtp = polydata, vtr = rectilinear, vti = imagedata. At the same time, the file can be binary, text, compressed or not compressed. We can play a time-history with the VCR control toolbar. In addition, it is possible to save the current time series as an animation using the File | Save Animation menu command.

5.5 Parallel visualization |

95

5.5 Parallel visualization ParaView provides the possibility of parallel visualization and data processing to work with large datasets. We now briefly describe the architecture of ParaView to help understand this parallelism.

5.5.1 The architecture of ParaView ParaView has a three-level client-server architecture and consists of the following parts: – Data Server is responsible for reading, processing and writing data. All objects in Pipeline Browser are located on Data Server. It may in turn be parallel. – Render Server is responsible for rendering data and can be parallel. – Client is responsible for displaying data. The client controls the creation of objects, their implementation, and deletion from servers, but does not contain data itself. The user interface is part of the client. The client is a serial program. These parts of the program may be located on different computing systems, but they are often integrated into a single application. There are three modes of starting ParaView: – The standard mode, when all three parts are combined in one application. – The Client/Server mode, when Data Server and Render Server act as the server to which the Client connects. In this case, we use the pvserver command, which runs the data and render servers. – The Render/Server mode, when all three parts are run as separate applications. This mode is suitable for work with very large data. The client connects to the Render server, which is in turn connected with the Data server. The Render/Server mode should only be used in special cases. For example, the power of a parallel system is not enough for both servers and they must be separated. 5.5.2 Running in parallel mode Modern computers are parallel computing systems and can provide significant performance improvement due to multiprocessor architecture. Here we consider running ParaView in the Client/Server mode. To do this, we run the server using the following command: $ mpirun -np 4 pvserver

This command runs the parallel server in four processes.

96 | 5 ParaView: An efficient toolkit for visualizing large datasets Further actions are associated with connecting the client to the parallel server. We invoke ParaView and click on the Connect button. Next, in the opened dialog, we click on the button Add server and set the name or ip address. Finally, we can open files as usual, for instance, the file vtk3d.vtk as presented in Figure 5.12.

Fig. 5.12. Opening the file on the parallel server.

After applying the filter Process Id Scalars, we can choose the array ProcessId to display subdomains belonging to different processes (Figure 5.13).

5.5 Parallel visualization |

Fig. 5.13. Process Id Scalars.

97

Victor S. Borisov

6 Tools for developing parallel programs Abstract: In this chapter, we present the Eclipse IDE with tools for parallel application development, i.e. Parallel Tools Platform (PTP). PTP provides facilities for working with remote computing systems: remote launch, debugging, and performance analysis of parallel programs.

6.1 Installation of PTP PTP can be installed as a separate program from an installation package which includes all prerequisite components. Alternatively, PTP can be added to pre-installed Eclipse as an additional plug-in.

6.1.1 Prerequisites Eclipse is a cross-platform framework; it requires Java Virtual Machine on Linux, MacOSX (10.5 Leopard and later), and Windows (XP and later) operating systems. PTP needs certain virtual machines and does not work on others. The program has been tested on Java Runtime Environment (JRE) 1.6 or later from Oracle. The OpenJDK virtual machine has not been tested officially, robustness and stability can therefore not be guaranteed.

6.1.2 Installing Eclipse with PTP The special installation package Eclipse for Parallel Application Developers for the relevant operating system can be selected from the Eclipse downloads page¹ . The package includes all of the prerequisite components for developing parallel applications. Installation is performed by decompressing the archive of the package. The program is launched by the eclipse executable file in the program’s directory.

1 www.eclipse.org/downloads/.

100 | 6 Tools for developing parallel programs 6.1.3 Installing the plug-in PTP can be added to pre-installed Eclipse via the Help | Install New Software menu command. The following address is entered: http://download.eclipse. org/tools/ptp/updates/kepler in the Work With field. Next, select the following packages: – Fortran Development Tools (Photran); – Parallel Tools Platform; – Remote Development Tools. After a few clicks on Next, Eclipse will download and install the plug-ins. It may be necessary to restart the program.

6.2 Program management PTP is intended for the development of parallel applications with the possibility of working with a remote computer. Consider the following scenario: it is convenient to write a program on a local computer, but in the future the program must be executed on a remote computing system. We have remote access to it, e.g. using SSH, and therefore need to copy the program into the remote system and launch it there. PTP provides enough features to simplify this process essentially. First, a connection to the remote system is created in the Eclipse editor. Then, a special project located on both the local and remote computer is designed. The developing code must be the same on both computers. If it has been changed the source code must be synchronized. The program can be compiled and executed on both computers. PTP also has some auxiliary capabilities to employ MPI technology.

6.2.1 Basics of PTP PTP is an integrated environment designed for parallel applications development. Its main features are: – it supports a wide range of parallel architectures and runtime systems; – it has a scalable parallel debugger; – it provides tools for parallel programming; – it involves support for the integration of parallel tools; – it ensures easy-to-use interaction with parallel systems. More information is available on the PTP website².

2 //www.eclipse.org/ptp/.

6.2 Program management

|

101

6.2.2 Creating a connection with a remote computer PTP allows work on a remote system using the Remote Systems view. This view can be opened as follows: select Window | Show View | Other and choose the Remote Systems view from the Remote Systems folder (Figure 6.1). Another way to open the Remote Systems view is to apply the Remote C/C++ perspective.

Fig. 6.1. Choose Remote Systems from views.

In the Remote Systems view, a new connection is created using the File | New | Connection menu command or the New Connection toolbar button. In the New Connection wizard, the SSH Only type must be selected, then click Next, as shown in Figure 6.2. The name or IP address of the remote system should then be entered in the Host Name field and Finish clicked. The new connection appears in the Remote Systems view. A right-click on SSH terminals opens the terminal, and launch Terminal can be selected in the context menu which then appears. User ID and password are entered in the dialog box which opens, and the Ok button clicked. Several RSA messages appear which must be agreed with. The terminal will now open on the remote computer (Figure 6.3).

6.2.3 Creating projects A program must be created as a special project to work with a remote computer. The main feature of this project is the possibility of editing the program which is actually located on the remote computer. PTP supports three project types:

102 | 6 Tools for developing parallel programs

Fig. 6.2. Create a connection for SSH access.

Fig. 6.3. Terminal on the remote computer.

– – –

Local: source code is located on a local computer and the project is also built on this computer; Remote: source code is located on a remote computer and the project is also built and executed on the remote computer; Synchronized: source code is located on both the local and remote computer and the code is synchronized between these computers. The project is built and executed on the local or remote computer.

6.2 Program management

|

103

Fig. 6.4. New synchronized project wizard.

Lets us suppose that the source code of a C/C++ program is located on a remote computer. Next, we create a Synchronized project using the File | New | Other | Synchronized C/C++ Project menu command. In the New Synchronized Project wizard (Figure 6.4), we fill in the Project Name field. In Local Directory, we set the directory on the local computer to store the source code. It may be the default location within the workspace directory. The Remote Directory field must indicate the project location on the remote computer. Next, the remote computer is selected from the Connection name list or a new connection created by clicking the New button. The Target Environment Configuration dialog then appears. We enter the name of the new connection in the Target name field. Host name, user, and password are specified in the corresponding fields. The Finish button completes creation of the new connection and we return to the project creation wizard (Fig.ure 6.5). Further, the project type is selected. If Makefile is on the remote computer, we select the Makefile project | Empty Project type. If not, another appropriate project type is indicated. Next, we select toolchains for compilation on remote and local computers. Finally, we click Finish and complete project setup.

104 | 6 Tools for developing parallel programs

Fig. 6.5. Create a new connection.

6.2.4 Project synchronization A synchronized project is marked with a special icon in Project Explorer. We right-click on this icon and select the Synchronization sub-menu, where a synchronization process with the remote computer can be controlled. The check mark in Auto-Sync indicates that PTP will automatically synchronize the project if project files are modified. We set a specific configuration of synchronization in Auto-Sync Settings. The Sync Active Now and Sync All Now commands start synchronization of active and all project files respectively. If we do not want all files to be synchronized, they can be excluded from the Filter option of the Synchronization context menu (Figure 6.6). For example, binaries or large files can be excluded from synchronization. Filters can be created during a project creation process. For this, the project creation wizard has the Modify File Filtering button. The source code of a project on a remote computer can contain preprocessor directives #include, which include files from the remote computer. If the correct location of these files is specified, Eclipse will display the code editor correctly and build projects. To do this, open the context menu Properties of the project and unfold the C/C++ General section. On the Entries tab of the Preprocessor Include Paths, Macros etc. page, choose the GNU C language, then find CDT

6.2 Program management

|

105

Fig. 6.6. Filters.

User Setting Entries and click Add. Select File System Path in the dropdown box on the upper right. Next, we enter a directory with header files in the Path field (e.g., //server/opt/openmpi/include) and click OK. After adding required directories, Eclipse must invoke a C/C++ files indexer. If the indexer does not start it can be invoked using the Project Properties | Index | Rebuild command.

6.2.5 Editing MPI programs PTP includes the Parallel Language Development Tools (PLDT) for developing parallel applications based on MPI technology. PLDT provides the following features for user convenience:

106 | 6 Tools for developing parallel programs – – – – – –

determination of the location of MPI function calls; code auto-completion; reference information about MPI; context-sensitive help; MPI code templates; MPI barrier analysis.

If a source code contains MPI function calls (artifacts) which need to be found, select the project, folder or file in the Project Explorer view and click the Show MPI Artifacts menu item in the PLDT drop-down menu of the tool bar. All MPI artifacts will be marked in the editor view. In addition, MPI Artifact View will be opened. This is a list of all artifacts, including their types and location (Figure 6.7). A double click on any line in this list will navigate the editor to that line. Artifact markers can be deleted using the red X button.

Fig. 6.7. MPI artifacts.

PTP can employ auto-completion of MPI functions. When the first letters of the MPI function name are entered and Ctrl+Space clicked, a list of possible completions with additional descriptive information appears (Figure 6.8) and the desired variant can be selected. Context-sensitive help regarding MPI commands provides extensive documentation. The Help view opens via F1, Ctrl+F1, or the Help | Dynamic Help menu com-

6.2 Program management

|

107

Fig. 6.8. Code completion.

mand. In this view, the Related Topics tab must be active as depicted in Figure 6.9. If we hover over an MPI function in the editor, the name of function will appear in the Help view. A click on the function name provides the corresponding information. Return to Related Topics to retrieve information on other functions. MPI code templates allow common patterns of MPI programming to be entered quickly. For example, entering mpisr and hitting Ctrl+Space writes the typical construction send-receive. PTP will then automatically write the necessary template (Figure 6.10). To see the code templates, enter mpi and hit Ctrl+Space. MPI barrier analysis helps detect synchronization errors in an MPI code. To start the analysis, select the place to be searched (files, folders or the whole project) in the Project Explorer view and click the MPI Barrier Analysis menu item in the PLDT drop-down menu of the toolbar. If there are barriers, they can be seen in the MPI Barriers, Barrier Matches, and Barrier Errors views. The first view is a list of all barriers detected, the second shows grouped sets of barriers (Figure 6.11), and the last displays errors with counterexamples.

6.2.6 Building programs If a project is built on a remote computer it must be synchronized between the local and remote computers. Usually files are synchronized automatically when they are

108 | 6 Tools for developing parallel programs

Fig. 6.9. Context help.

Fig. 6.10. MPI code templates.

6.2 Program management

|

109

Fig. 6.11. MPI barrier analysis.

saved, and the synchronization process is not required. To do so, right-clicking on the project name in the Project Explorer view opens the Build Configuration context menu. Change the active configuration, e.g. Debug_remote in the Set Active submenu. To start building, select the project in the Project Explorer view and click the build button with drop-down menu, where the build configuration can also be changed. There is no Makefile by default and Eclipse will create it automatically. The result can be seen in the Console view. The project is not recompiled if there are no changes in the source code. To enforce this process regardless of changes, the project must be cleaned using the Clean Project command in the project context menu. Building the project begins without problems after cleaning. Forced synchronization is performed using the Synchronization | Sync Active Now command in the project context menu.

110 | 6 Tools for developing parallel programs

Fig. 6.12. Build configuration.

Build configuration can be customized in the C/C++ Build of the project properties (Figure 6.12): – C/C++ Build . The building command can be changed, which invokes make by default; – C/C++ Build | Settings. Error and binary parsers are selected; – C/C++ Build | Environment. Environment variables are changed/added; – C/C++ Build | Logging. Build logging is enabled/disabled.

6.2.7 Running programs It is necessary to create a run configuration to launch parallel applications. The Run Configuration dialog is opened using the Run | Run Configurations menu command and the Parallel Application item is selected. To create a new configuration, click the New button or double-click on Parallel Application and enter a name for the configuration in the Name field (Figure 6.13). Select Target System Configuration in the Resources tab, e.g. Open MPI – Generic-Interactive or PBS-Generic-Batch if a remote system has appropriate programs, to run the parallel application from the drop-down menu (Figure 6.14). Then select a connection from the drop-down menu, where previously configured connections should be available. If necessary, a new connection can be added by clicking the New button. Next, fill in the fields to create a PBS-script to run a parallel program. These will usually be the Queue, Number of nodes, MPI

6.2 Program management

Fig. 6.13. Run configuration.

Fig. 6.14. Run resources.

|

111

112 | 6 Tools for developing parallel programs Command, MPI Number of Processes fields. The script created can be seen by clicking the View Script button. Indicate the project in the Application tab of the Project field. Next, enter the path of the executable file on the remote computer in the Application program field as shown in Figure 6.15. By clicking the Browse button, the executable on the remote system can be selected.

Fig. 6.15. Application program.

If the Display output from all processes in a console view option is checked, PTP displays the output of the program in the Console view. Sometimes files need to be transferred from the local computer to the remote computer and vice versa. For example files of a computational mesh or outputs with results of calculations. The Synchronize tab allows the creation of rules to upload/download files before and after running a program (Figure 6.16). The rules consist of files, which should be upload/download, and location, where files will be placed afterwards. The run configuration is now complete and the parallel program can be executed by clicking the Run button. PTP requests a switch from the current perspective to the System Monitoring perspective. If the PBS-Generic-Batch system was selected in the Target System Configuration field, this perspective allows the running processes on the remote computer to be viewed . Active and inactive jobs can be seen in the views of the same name. During execution the program is placed on the Active Jobs view. When work has stopped the program will be moved to Inactive Jobs. System view contains a graphical representation of the activity of the remote system: a small box is a single process, whereas processes grouped in one big box represent a computational node. Different colors correspond to different running programs; the color can be found in the Active Jobs view. The view is updated once

6.3 Parallel debugging

| 113

Fig. 6.16. Synchronization rules.

per minute, but an immediate update can be requested by right-clicking on a job in the Active Jobs view and selecting the Refresh Job Status action. The output of a completed job can be displayed in the Console view by right-clicking on the job in the Inactive Jobs view and selecting Get Job Output.

6.3 Parallel debugging In this section, elements of parallel debugging are discussed, including configuration and running the parallel debugger, a review of the Parallel Debug perspective, managing processes, and breakpoints.

6.3.1 Debug configuration The debugger requires its own configuration, which can be created using the Run | Debug Configurations menu command. The debugging configuration can also be designed on the basis of the run configuration. Thus we can select the run configuration and click the New button. We select one of the following interactive modes of running parallel programs in the Target System Configuration field: – Torque-Generic-Interactive; – PBS-Generic-Interactive; – OpenMPI-Generic-Interactive.

114 | 6 Tools for developing parallel programs

Fig. 6.17. Debug configuration.

gdb-mi is set in the Debugger tab of the Debugger backend. The SDM debugger located on the remote computer is selected by clicking the Browse button (Figure 6.17) and the debugger invoked by clicking the Debug button.

6.3.2 Parallel debug perspective For users’ convenience, PTP has the Parallel Debug perspective (Figure 6.18) with the following views: – the Parallel Debug view shows the processes associated with the job; – the Debug view contains threads and stack frames for each process; – the Source view displays the source code and the marker showing the line with executable code; – the Breakpoints view shows breakpoints; – the Variables view demonstrates the current values of variables for the process selected in the Debug view.

6.3.3 Process managing Step-by-step execution of a program is provided for both a single process and a group of processes using buttons from the Debug and Parallel Debug views respectively.

6.3 Parallel debugging

| 115

Fig. 6.18. Parallel Debug perspective.

The usual debugger is applied to a single process, whereas the parallel debugger is used for a process set. At the beginning of a parallel debug session all processes are contained in the Root set. Process sets are always associated with a single job. A job can have any number of process sets, and a set can contain one or more processes. Debug operations on the Parallel Debug view always apply to the current process set. The name of the current set and the number of processes are shown next to the job. Other buttons in the Parallel Debug view serve to create, change, and remove process sets.

6.3.4 Breakpoints Breakpoints apply to a set of processes. Breakpoints differ in color as follows: – a green breakpoint applies to the current process set; – a blue breakpoint corresponds to overlapping process sets; – a yellow breakpoint is associated with another process set. All breakpoints are deleted automatically after the job is completed. Before creating a breakpoint we must select a process set to which the breakpoint will be applied in the Parallel Debug view (Figure 6.19). We create a breakpoint in the editor by dou-

116 | 6 Tools for developing parallel programs

Fig. 6.19. Breakpoints.

ble clicking on the left of the editor. We can also right-click and use the Parallel Breakpoint | Toggle Breakpoint context menu.

6.4 Performance analysis Tuning and Analysis Utilities (TAU) conducts performance analysis for parallel programs through profiling and tracing. Profiling shows how much time was spent on each code section. Tracing shows when certain events appeared in all processes throughout the execution. TAU uses the Performance Database Toolkit (PDT) for automatic code processing. The program is free and works on all high-performance platforms.

6.4.1 Installation TAU can be downloaded from the TAU website³. The following commands install the program: tar -zxf tautools.tgz cd tautools-2.22b ./configure

Installed executable files are set into the PATH environment variable and Eclipse is launched with this environment. After launching, the following command must be executed: taudb_configure

3 http://tau.uoregon.edu/tautools.tgz.

6.4 Performance analysis

|

117

and the instructions followed. More information about the installation process is available on the TAU website⁴.

6.4.2 Configuration As usual, to launch the performance analysis, a profile configuration must be created on the basis of a run configuration using the Run | Profile Configurations menu command. The Performance Analysis tab has the Tool Selection sub-tab, where we choose TAU in the Select Tool field and additional sub-tabs will appear. Check MPI and PDT in the TAU Makefile sub-tab, as shown in Figure 6.20. Next, select Makefile.tau-mpi-pdt in the Select Makefile drop-down menu.

Fig. 6.20. TAU Makefile.

The TAU Compiler (see Figure 6.21) sub-tab allows control of TAU’s compiling scripts. By checking various items, controlling arguments are generated. For example: – Verbose: shows detailed debugging information of profiling; – KeepFiles: saves intermediate files; – PreProcess: processes the source code before analysis. A method of selection of source code for analysis in the Selective Instrumentation tab (Figure 6.22) is specified as follows:

4 www.cs.uoregon.edu/research/tau/home.php.

118 | 6 Tools for developing parallel programs

Fig. 6.21. TAU Compiler.

Fig. 6.22. Selective Instrumentation.

– –

Internal: uses the file created by Eclipse; User Defined: employs the file defined by the user.

The following parameters are specified in the TAU Runtime tab (Figure 6.23): – Callpath: sets the location of events in the call stack; – Throttling: reduces the load which occurs as a consequence of the analysis; – Tracing: traces execute time of certain commands.

6.4 Performance analysis

|

119

Fig. 6.23. TAU Runtime.

The Data Collection tab allows specification of how the collected results are to be saved (Figure 6.24): – Select Database: saves in a database for storing results; – Keep profiles: saves data for further analysis; – Print Profile Summary: displays the final result on the console.

Fig. 6.24. Data Collection.

120 | 6 Tools for developing parallel programs 6.4.3 Running TAU is invoked by clicking the Profile button. The project is rebuilt with new compiler commands specified by TAU. The project is executed in the standard way, but additional analysis files will be created. If there is a local database, program execution will be seen in the Performance Data Management view.

Aleksandr G. Churbanov, Petr N. Vabishchevich

7 Applied software Abstract: Nowadays, theoretical studies of applied problems are performed with the extensive use of computational tools (computers and numerical methods). In this chapter, the contemporary concept of the so-called component-based software is discussed. Component-based software is a set of well-developed software components which solve basic individual problems. Computational technologies for scientific research are based on constructing geometrical models, generating computational meshes, applying discretization methods, solving discrete problems approximately, visualizing, and processing calculated data. This chapter gives a brief review of existing applied software which can be used in engineering and scientific computations.

7.1 Numerical simulation Theoretical and experimental research approaches demonstrate a large degree of independence from one another. Since fundamental models are well-known and deeply validated, the problem of closer coordination and communication of theoretical and experimental research can be formulated. This is a new technology which combines scientific research with numerical simulation technology.

7.1.1 Mathematical modeling The first stage in the mathematization of scientific knowledge (theoretical level of research) involves creating abstraction independent of the specific nature of a problem, i.e. idealization and specification of its mathematical form (a mathematical model is formulated). The abstraction of a mathematical model causes certain difficulties in its application to the description of a specific problem or process. Nowadays, thanks to experience, the process of idealization and designing abstraction is performed quietly and quickly in most sciences. The second stage of mathematization involves the study of mathematical models as pure mathematical (abstract) objects. For this, both existing and new, specially developed, mathematical methods are applied. Nowadays, computational tools (computers and numerical methods) provide great opportunities for the investigation of mathematical models. And finally, the third stage of employing mathematics in applied research is characterized by result interpretation, i.e. the attachment of specific applied meaning to mathematical abstractions. A specialist in applied mathematical modeling always recognizes the specific applied meaning for any mathematical abstractions.

122 | 7 Applied software Heuristically, the role of mathematical modeling is apparent in the fact that a computational experiment is conducted instead of the corresponding physical experiment. Instead of studying a phenomenon of nature, we conduct a parametric study of its mathematical model and determine the dependence of the solution on its essential parameters. Such an experiment, in combination with natural tests, allows investigation of the process in a more comprehensive and fruitful manner. The contemporary level of science and technology is characterized by studying complex nonlinear mathematical models. Under these conditions, computational tools are becoming the basic and predominant instruments of exploration. Conventional analytical methods of applied mathematical modeling perform only an auxiliary and attendant role in this process. They can only be employed for the qualitative study of a problem in a highly simplified statement. The ability to study complicated mathematical models using numerical methods and computers allows us to consider the methodology of scientific research from a new viewpoint. Nowadays, powerful computers, efficient computational algorithms, and modern software make it possible to arrange scientific research within computational technology which includes both theoretical and experimental investigations.

7.1.2 Basic steps of a computational experiment Let us present a short description of the basic components of mathematical modeling. We treat a computational experiment as the creation and study of mathematical models using computational tools. The model-algorithm-program triple (Samarskii’s triad) is recognized as the basis of computational experiments. First, we design a mathematical model for the object under consideration based on known fundamental models. A computational experiment essentially involves the application of a group of close and possibly coupled models. We start with a simple but quite meaningful and close to experimental data model. Further, this model will be refined and tuned, incorporating new factors. Once the mathematical model has been built, a preliminary study is conducted using traditional methods of applied mathematics. The main purpose of the preliminary study is the extraction of simpler ("model") problems and their comprehensive analysis. The next step of computational experiments involves the construction of discrete problems and numerical methods to solve the original continuous problem. Mathematical models usually include partial differential equations as well as systems of ordinary differential and algebraic equations. A computational experiment is characterized by two peculiarities which must be taken into account when creating adequate software. These are the necessity of multivariant calculations within the selected model and application of a variety of models (possibly a hierarchy of models). Therefore, we cannot work with only one program.

7.1 Numerical simulation | 123

Also, we need to be able to modify our program easily to solve related problems, i.e. to extend the field of applications. Software for computational experiments is based on the use of mathematical libraries, special toolkits, and packages of applied programs. In order to take the main features of computational experiments into account, we need to employ object-oriented programming techniques and modern programming languages. Finally, we conduct a series of predictions by varying certain parameters of the problem. Experts in the corresponding application area participate in the analysis and interpretation of the data.The numerical results are processed, taking the existing theoretical concepts and available experimental data into account. On analysis of the results, it will become clear whether or not the selected mathematical model and its computational implementation is adequate and correct. If necessary, models and numerical methods are refined and improved and the whole cycle of computational experiments is repeated.

7.1.3 Mathematical models The success of mathematical modeling is essentially governed by the issues of the mathematical models applied and how appropriate they are for the process under investigation. It seems reasonable to define sub-systems and construct lower level mathematical models in order to study a complex object. Here it is important not to focus on a separate mathematical model corresponding to a single problem, but to try to cover a wide enough class (set) of mathematical models. This set of models should be structured and ordered in a certain way. We are therefore speaking of a hierarchy of applied mathematical models. Applied models consist of fundamental models (governing equations) supplemented by appropriate boundary and initial conditions, i.e. boundary value problems (BVP) for partial differential equations (PDEs). This transition is associated with the construction of geometric models (computational domains) of the objects being studied. The creation of geometric models is extremely important for the success of predictions. Modern engineering and scientific computations are based on the use of multidimensional (ideally 3D) geometric models. Applied mathematical models usually describe phenomena of different natures which may be interconnected. This multiphysics leads to mathematical models which are based on systems of nonlinear equations for scalar and vector unknowns. As a rule, time-dependent formulations are considered. In this case, we can construct computational algorithms to determine the state of a system at a new time level by solving simpler problems for sub-systems (schemes of splitting with respect to separate processes).

124 | 7 Applied software

7.2 Applied software engineering Traditionally, two types of software are recognized: applied software and system software. The latter is supporting software for the development of general-purpose applications, which is not directly related to applied problems. Below we consider problems associated with software for the numerical analysis of applied mathematical models. We discuss both commercial and free/open source software for multiphysical simulations.

7.2.1 Features of applied software In the early years of computer use, mathematical models were simple and unreliable. Applied software, in fact, was also primitive. The transition to a new problem or new version of calculations required practically assembling a new program, which was essentially a modification of the previously developed codes. Modern use of computational tools is characterized by the study of complex mathematical models. Software has therefore greatly expanded, becoming larger as well as more complex in study and use. Current software includes a large set of various program units requiring construction of a certain workflow for efficient employment. We do not study a particular mathematical problem in numerical simulations, but investigate a class of problems, i.e. we highlight some order of problems and hierarchy of models. Software should therefore focus on a multitasking approach and be able to solve a class of problems by switching rapidly from one problem to another. The numerical analysis of applied problems is based on multiparametric predictions. Within the framework of a specific mathematical model it is necessary to trace the impact of various parameters on the solution. This feature of numerical simulations requires that the software has been adapted to massive calculations. Thus, software developed for computational experiments must, on the one hand, be adapted to quick significant modifications, and on the other be sufficiently conservative to focus on massive calculations using a single program. Our experience shows that applied software is largely duplicated during its lifecycle. Repeating the programming of a code for a similar problem wastes time and eventually money. This problem is resolved by the unification of applied software and its standardization both globally (across the global scientific community), and locally (within a single research team).

7.2.2 Modular structure The modular structure of applied software is based primarily on modular analysis of the area of application. Within a class of problems, we extract relatively indepen-

7.3 Software architecture

| 125

dent sub-problems, which form the basis for covering this class of problems, i.e. each problem of the class can generally be treated as a certain construction designed using individual sub-problems. An entire program is represented conceptually as a set of modules appropriately connected with one another. These modules are relatively independent and can be developed (coded and verified) separately. The decomposition of the program into a series of program modules implements the idea of structured programming. A modular structure of programs may be treated as an information graph, the vertices of which are identified with program modules, and the branches (edges) of which correspond to the interface between modules. Functional independence and content-richness are the main requirements of program modules. A software module can be associated with an application area. For example, in a computational program, a separate module can solve a meaningful subproblem. Such a module can be named a subject-oriented module. A software module can be connected with the implementation of a computational method, and therefore the module can be called a mathematical (algorithmic) module (solver). In fact, this means that we conduct a modular analysis (decomposition) at the level of applications or computational algorithms. Moreover, mathematical modeling is based on the study of a class of mathematical models, and in this sense there is no need to extract a large number of subject-oriented modules. A modular analysis of a class of applied problems is carried out with the purpose of identifying individual, functionally independent mathematical modules. Software modules may be parts of a program which are not directly related to a meaningful sub-problem in the applied or mathematical sense. They can perform supporting operations and functions. These types of module, referred to as internal modules, include data modules and documentation modules, among others. Separate parts of the program are extracted for the purpose of autonomous (by different developers) design, debugging, compilation, storage, etc. A module must be independent (relatively) and easily replaced. When creating applied software for computational experiments, we can highlight characteristics of software modules such as actuality, i.e. its use as much as possible in a wider range of problems of this class.

7.3 Software architecture The structure of applied software is defined according to solving problems. Both general purpose and specialized program packages are used to model multiphysics processes.

126 | 7 Applied software 7.3.1 Basic components The basic components of software tools for mathematical modeling are the following: – pre-processor – preparation and visualization of input data (geometry, material properties), assembly of computational modules; – processor – generation of computational grids, numerical solving of discrete problems; – post-processor – data processing, visualization of results, preparation of reports. An analysis of these components is presented below.

7.3.2 Data preparation In research applied software, data input is usually conducted manually via editing text input files. A more promising technology is implemented in commercial programs for mathematical modeling. It is associated with the use of a graphical user interface (GUI). It is necessary to carry out an additional control for input data to solve geometrically complex problems. Due to the modular structure of applied software, the problem of controlling geometric data is resolved on the basis of unified visualization tools. The core of a pre-processor is a task manager, which allows an executable program to be assembled for a particular problem. It is designed for automatic preparation of numerical schemes to solve specific problems. The task manager includes system tools for solving both steady and time-dependent problems as well as adjoint problems using a multiple block of computational modules. A numerical scheme for adjoint problems includes an appropriately arranged chain of separate computational modules.

7.3.3 Computational modules Program packages for applied mathematical modeling consist of a set of computational modules which are designed to solve specific applied problems, i.e. to carry out the functions of the processor (generating mesh, solving systems of equations). These software modules are developed by different research groups with individual traditions and programming techniques. A program package for applied mathematical modeling is a tool for integrating the developed software in a given application area. The use of unified standards of input/output allows the integration of a computational module into a software system for mathematical modeling, which makes automated assembly of computational schemes from separate computational modules possible.

7.4 General purpose applied software

|

127

A computational module is designed to solve a particular applied problem. For transient problems, a computational module involves solution of the problem from an initial condition up to the final state. The reusability of computational modules for different problems results from the fact that the computational module provides a parametric study of an individual problem. To solve the problem, we choose a group of variable parameters (geometry, material properties, computational parameters, etc.). Software system tools provide a user-friendly interface for computational modules on the basis of dialog tools which set the parameters of a problem.

7.3.4 Data processing and visualization A software system post-processor is used for visualization of computational data. This problem is resolved by the use of a unified standard for the output of computational modules. Software systems for engineering and scientific calculations require visualization of 1D, 2D, and 3D calculated data for scalar and vector fields. Data processing (e.g. evaluation of integral field values or critical parameters) is performed in separate computational modules. Software systems for mathematical modeling make it possible to calculate these additional data and to output the results of calculations and include them in other documents and reports.

7.4 General purpose applied software The idea of end-to-end mathematical modeling (from geometry creation to data visualization) is most completely realized in commercial multiphysical numerical packages. The organization of applied software is also possible for free/open source software; as an example see the Salome platform (see Section 7.4.3 below).

7.4.1 User-friendly interface User interaction with a computer is based on a graphical user interface in modern applied software. GUI is a system of tools based on the representation of all available user system objects and functions in the form of graphic display components (windows, icons, menus, buttons, lists, etc.). The user has random access (using keyboard or mouse) to all visible display objects.

128 | 7 Applied software GUI is employed in commercial software for mathematical modeling, such as ANSYS¹, Marc, SimXpert, and other products of MSC Software², STAR-CCM+, and STAR-CD from CD-adapco³, COMSOL Multiphysics⁴. In these cases, preparation of geometrical models, equation selection, setting of boundary and initial conditions is conducted as easily and clearly as possible for inexperienced users.

7.4.2 Basic features A pre-processor solves two primary problems in systems for applied mathematical modeling. The first problem involves preparing a geometrical model and specifying a computational domain. The second pre-processor problem consists of preparing an applied model, i.e. selecting and setting governing equations along with initial and boundary conditions. It is usually possible to import geometry designed in CAD systems to prepare geometrical models (see SolidWork⁵, CATIA⁶, Autodesk Inventor⁷ among others) using the following formats: IGES, STEP, VRML, STL, etc. The second approach is to import from other mathematical modeling systems, in particular from systems of finite element analysis, such as NASTRAN, ABAQUS, ANSYS, etc. Advanced systems for applied mathematical modeling support their own tools to create and edit geometrical models. Such editors for geometrical modeling provide geometrical primitives and boolean operations. In addition, it is possible to design ad hoc geometrical editors using the Open CASCADE⁸ library for solid modeling and visualization. A pre-processor plays a very important role in setting a physical model. We usually work within a fixed set of governing equations, where it is necessary to specify coefficients for individual sub-domains. The setting of material properties is performed in analytical (functional dependence on variables) or table form. Standardization is ensured by using a common database of material properties and convenient tools for their retrieval. A special technique is used to prescribe boundary and initial conditions. It is necessary to have graphical tools to extract individual parts of surfaces and impose some boundary conditions on them.

1 2 3 4 5 6 7 8

www.ansys.com. www.mscsoftware.com. www.cd-adapco.com. www.comsol.com. www.solidworks.com. www.3ds.com/products/catia/welcome. usa.autodesk.com. http://www.opencascade.org.

7.5 Problem-oriented software

| 129

A program system for applied mathematical modeling includes advanced tools for data visualization and processing. A post-processor provides, firstly, visualization of numerical results. In a user-friendly GUI, the user can view results, save graphs in chosen formats, and print them. In addition to visualizing 1D, 2D, and 3D scalar and vector fields, special tools may be available for the animation of time-dependent fields.

7.4.3 The Salome platform Among free multiplatform software for numerical simulation, we can highlight Salome⁹ as a very attractive tool. It is considered to be a platform providing basic features of systems for end-to-end modeling not associated with a specific problem. The program allows users to construct geometries, generate meshes, and visualize results. Thus, we have a unified environment with the basic elements of pre- and postprocessing. Salome does not include computational modules, but has tools to add them and organize calculations. The Python algorithmic language is used to prepare userdefined numerical schemes. Salome can therefore be employed as a platform to integrate third party computational modules in the creation of modern software for applied mathematical modeling. Among the basic modules of Salome, the Geometry module for creating and editing CAD geometric models with support of import and export is worth mentioning. The Mesh module is designed for generating meshes, including third party generators. The Post-Pro module is applied for the visualization of calculated data. The YACS module is available for the creation, editing, and execution of numerical schemes. Among the available extensions of SALOME, we note SALOME-. In addition to SALOME itself, this integrated toolkit for mechanical analysis includes the powerful package for finite element analysis, Code_Aster¹⁰. It is also possible to integrate SALOME with Code_Saturne¹¹, which is used to predict 2D and 3D heat and fluid flows.

7.5 Problem-oriented software In addition to general purpose applied software oriented to a wide class of problems, much attention is paid to special software for the solution of specific applied problems.

9 www.salome-platform.org. 10 www.code-aster.org. 11 http://code-saturne.org.

130 | 7 Applied software In developing problem-oriented software, we focus on computational algorithms in contrast to the relatively small possibilities of pre- and post-processing.

7.5.1 Parametric study In software for end-to-end mathematical modeling, each computational module is associated with the solution of an individual applied problem. A set of such modules must solve all basic problems. To make this possible, a computational module should be focused on the parametric study of applied problems. Mathematical modeling of complicated applied problems is based, firstly, on solving boundary value problems for systems of steady and unsteady equations of mathematical physics. As a rule, they are supplemented with systems of ordinary and transcendental equations. Next, numerical simulation requires the specification of a computational domain, i.e. creation of a geometric model. A problem-oriented software system for applied mathematical modeling may not provide tools for fully-fledged 3D editing of geometric models. Modeling is based on the separation of a clearly defined class of problems which will be solved using developed computational modules. Thus there is no need to construct a 3D model, nor do we need to employ an editor for complex 3D geometric objects. It is sufficient to have a library of standard parametric geometric objects (primitives). We have a set of parameters of geometric models and define a range of their variations for each computational module. System tools for visualization provide control of input parameters of geometric models. These tools are universal and do not depend on computational modules. In a parametric study, we control parameters which specify the process being investigated. Users must have options to choose parameters of the applied model under consideration. Coefficients of governing equations, boundary and initial conditions are set for a particular computational module. Parameters associated with material properties deserve a separate mention. It is necessary to create a unified database of materials and tools to work with this database. The third set of parameters is computational parameters. These parameters are associated with discretization schemes, methods for solving discrete problems, and other parameters of the computational algorithm. In particular, we need to specify the value of a timestep if we solve a transient problem. Thus, there are three groups of input parameters in each computational module for parametric study: – parameters of a geometric model; – parameters of an applied mathematical model; – computational parameters.

7.5 Problem-oriented software

| 131

Only interactive work with a relatively small number of parameters for solving applied problems makes end-to-end modeling software suitable for a wide range of users.

7.5.2 Componet-based implementation of functionalities The functionality of modern software systems for applied mathematical modeling must reflect the level of development of theory and practice achieved for numerical algorithms and software. This goal is attained by component-oriented programming based on the use of well-developed and verified software for solving basic mathematical problems (general mathematical functionality). In applied program packages, the actual content of computational problems is the solution of the initial value problem (the Cauchy problem) for systems of ordinary differential equations which reflect mass, momentum, and energy conservation laws. We reach Cauchy problems by discretization in space (using finite element methods, control volume or finite difference schemes). The general features of the system of ordinary differential equations (ODEs) are: – it is nonlinear; – it is coupled (some valuables are dependent on others); – it is stiff, i.e. different scales (in time) are typical for the physical processes considered. In order to solve the Cauchy problem we need to apply partially implicit schemes (to overcome stiffness) with iterative implementation (to avoid nonlinearity), as well as partial elimination of variables (to decouple unknowns). Applied software allows the tuning of iterative processes in order to take specific features of problems. It is possible to arrange nested iterative procedures, to define specific stopping criteria and timestep control procedures, and more. In this case, computational procedures demonstrate a very special character and work in a specific (and often very narrow) class of problems; the transition to other seemingly similar problems can drastically reduce its efficiency.

7.5.3 Computational components It is obviously necessary to apply more efficient computational algorithms in order to implement complicated nonlinear multidimensional transient applied models which take the parallel architecture (cluster, multicore, etc.) of modern computing systems into account. The nature of the algorithms used must be purely mathematical, i.e. involving no other considerations. The hierarchy of numerical algorithms is listed here (increasing in complexity from top to bottom):

132 | 7 Applied software – – –

linear solvers – direct methods and, first of all, iterative methods if the systems of equations have large dimensions; nonlinear solvers – general methods for solving nonlinear systems of algebraic equations; ODE solvers – to solve the stiff systems of ODEs which appear in some mathematical models.

Solvers must support both sequential and parallel implementations and also reflect modern achievements of numerical analysis and programming techniques. This means, in particular, that we need to use appropriate specialized software developed by specialists in numerical analysis. This software must be deeply verified in practice and greatly appreciated by the international scientific community. Such applied software systems are presented in particular in the following collections: – Trilinos¹² – Sandia National Laboratory; – SUNDIALS¹³ – Lawrence Livermore National Laboratory; – PETSc¹⁴ – Argonne National Laboratory.

12 www.trilinos.org. 13 computation.llnl.gov/casc/sundials/main.html. 14 www.mcs.anl.gov/petsc.

Maria V. Vasilieva, Alexandr E. Kolesov

8 Geometry generation and meshing Abstract: The process of mesh generation, called triangulation, is a key element of numerical experiments due to its essential influence on the accuracy of numerical results. For this reason a good many tools have been developed to construct meshes of appropriate quality relevant to the peculiarities of the problems under consideration. The most popular FOSS grid generators are Gmsh¹ and NETGEN². In addition to meshing procedures, these programs are supplemented with tools for creating simple geometries.

8.1 General information We need a discrete representation of a computational domain to solve a mathematical model numerically. Mesh generation is the process of creating such a discrete representation of a realistic geometry. This representation must be unique and it is possible that local refinement will be necessary in areas where large gradients of desired functions are suspected. The basic mesh elements are intervals in the 1D case, triangles and quadrangles in the 2D case, as well as tetrahedrons and hexahedrons in the 3D case. Two primary classes of meshes are recognized: – structured meshes are those with regular connectivity; – unstructured meshes have irregular connectivity.

8.1.1 Structured meshes Structured meshes (Figure 8.1) are widely used in computational mathematics. A structured mesh has a structure with regular connectivity. Advantages of structured meshes are connected to preserving the structure of neighboring nodes and maintaining a certain template for each mesh point. Rectangles (2D) or parallelepipeds (3D) are usually used as cells in structured meshes. Such meshes are most often employed in finite difference methods. To generate a regular mesh for a complex geometric object, it is possible to apply a coordinate transformation and construct a uniform grid in the transformed repre-

1 http://geuz.org/gmsh/. 2 www.hpfem.jku.at/netgen/.

134 | 8 Geometry generation and meshing

Fig. 8.1. Structured mesh.

sentation of the object. In doing so, the computational mathematical model should be written in the corresponding curvilinear coordinates.

8.1.2 Unstructured meshes The characteristic feature of unstructured meshes (Figure 8.2) is the arbitrary location of mesh points in a physical domain. This arbitrariness should be understood in the sense that there is no special direction of mesh points location, i.e. we do not observe a structure similar to regular meshes. Mesh points may be combined into polygons (2D) or polyhedrons (3D) of arbitrary shape. As a rule, triangle and tetrahedron cells are used in 2D and 3D, respectively. There is usually no need to use more complex cell shapes. The process of mesh generation is called triangulation. Using a given set of points, a computational domain can be triangulated in various ways. Note that any triangulation method should lead to the same number of cells for the given set of points. It is often necessary to optimize a generated mesh using certain criteria. The basic optimization criterion is that the triangles obtained should be equilateral (no corner is too sharp). This criterion is local and relates to a single triangle. The second (global) criterion is that neighboring triangles should have equal areas (the uniformity criterion).

Fig. 8.2. Unstructured mesh.

8.2 The Gmsh workflow

|

135

A special triangulation does exist, i.e. a Delaunay triangulation, which has a number of optimal properties. First, the triangles generated tend to be equilateral, i.e. the Delaunay algorithm maximizes the minimal interior angle of mesh triangles. The second property is that a circle circumscribing any triangle does not include any other points in its interior. A more detailed description of the properties of a Delaunay mesh can be found in the specialist literature. We now introduce a Voronoi diagram (partition) and its connection with Delaunay triangulations for further discussion. A Voronoi diagram for any point (named a seed) is a set of points located closer to this point than to any other. A cell of Voronoi diagrams is obtained from the intersection of half-spaces bounded by the perpendicular to the midpoint of the segment joining the two nearest points. It should be noted that Voronoi cells are always convex polygons. Each vertex of a Voronoi diagram is the meeting point of three Voronoi polygons. Seeds of these polygons form a Delaunay triangulation. Thus, we have dual consistency between Delaunay triangulations and Voronoi diagrams. Delaunay triangulations are actively employed for the numerical solution of applied problems on the basis of finite element methods. There are many well-developed methods and software for generating such triangle meshes. As mentioned above, we will now consider two of the most popular tools.

8.2 The Gmsh workflow Two programs for generating 3D meshes are considered here. Namely, Gmsh and NETGEN, which are commonly used to construct 3D meshes. As a rule, CAD (computer-aided design) systems are employed to create complex geometrical models. These two programs also allow us to build simple geometrical models. To install Gmsh, we can download it from the official website listed above (for Windows, Linux and Mac), or for the Ubuntu operating system we can apply the command below. $ sudo apt-get install gmsh

8.2.1 Elementary entities Gmsh is an automatic mesh generator for 2D and 3D domains. It has a light and userfriendly graphical interface. Gmsh consists of four modules which deal with geometry, mesh, solver, and post-processing (Figure 8.3). To launch Gmsh in the interactive mode, we should click on the program icon or use the command $ gmsh

136 | 8 Geometry generation and meshing

Fig. 8.3. Gmsh.

A geometry is created in Gmsh using elementary entities, i.e. points, lines, surfaces, and volumes. The constructed geometry is saved as a geo-file. A geo-file can also be created and edited using any text editor, and can then be loaded by the File | Open menu command. The file can also be edited in the Geometry module, by clicking the Edit button. The Geometry module provides the possibilities of a simple CAD system and employs a boundary representation (BRep) to define geometric objects. In order to describe geometries, we first define points (using the Point command), then introduce lines (the Line, Circle, Ellipse, and Spline commands), and surfaces (the Plane and Ruled Surface commands), and finally describe volumes (the Volume command). These geometric entities have an identification number, which is assigned to them at creation. We can also move and modify geometric entities via the Translate, Rotate, Dilate or Symmetry commands. These entities are considered in detail in the following. Point: Point(id) = {x, y, z, size}. Creates a point with the identification number (id). The three x, y, and z coordinates of the point and the prescribed mesh element size are specified inside the brackets. The last argument is optional.

8.2 The Gmsh workflow

|

137

Lines: – Line(id) = {pointId1, pointId2}. Creates a straight line with the identification number (id). The start and end points of the line are given identification numbers inside the brackets. – Circle(id) = {pointId1, pointId2, pointId3}. Creates a circle arc (strictly smaller than π ) with the identification number (id). The identification numbers of the start, center, and end points of the arc are given inside the brackets. – Ellipse(id) = {pointId1, pointId2, pointId3, pointId4}. Creates an ellipse arc with the identification number (id). Inside the brackets pointId1 and pointId2 are the identification numbers of the start and center points of the arc, respectively, pointId3 is the identification number of any point located on the major axis of the ellipse, pointId4 is the identification number of the end point of the arc. – Line Loop(id) = {lineId1, lineId2, ..., lineIdN}. Creates a closed line loop with the identification number (id). Inside the brackets, we set a list of identification numbers of all the lines which compose the loop. Surfaces: – Plane Surface(id) = {lineLoopId1, lineLoopId2, ..., line} LoopIdN}. Creates a plane surface with the identification number (id). Inside the brackets we give a list of identification numbers of all line loops which define the surface. The exterior boundary is defined by the first line loop and all other line loops specify holes in the surface. – Ruled Surface(id) = {lineLoopId1, lineLoopId2, ..., line LoopIdN}. Creates a ruled surface similar to plane surfaces; – Surface Loop(id) = {surfaceId1, surfaceId2, ..., surface IdN}. Creates a surface loop (a closed shell) with the identification number (id). Inside the brackets we provide a list of identification numbers of all surfaces which constitute the loop. Volume: Volume(id) = {surfaceLoopId1, surfaceLoopId2, ..., surface LoopIdN}. Creates a volume with the identification number (id). Inside the brackets we give a list of identification numbers of all surface loops which compose the volume.

138 | 8 Geometry generation and meshing 8.2.2 Commands for working with geometric objects Lines, surfaces, and volumes can also be created using the Extrude command. – Extrude{x, y, z} {object}. Extrudes an object along the vector. Inside the first brackets we give the coordinates of the vector (x, y, z). Objects can be points, lines, or surfaces. – Extrude{{x1, y1, z1}, {x2, y2, z2}, angle} {object}. Extrudes an object with a rotation. Inside the first brackets we specify the directions of the rotation axis (x1, y1, z1), the coordinates of a point on this axis (x2, y2, z2), and the rotation angle in radians (angle). – Extrude{{x1, y1, z1}, {x2, y2, z2}, {x3, y3, z3}, angle} {object}. Extrudes an object along the vector with a rotation. Inside the first brackets we give the coordinates of the vector (x1, y1, z1). Other arguments are the same as in the previous case. We can apply geometrical transformations to elementary entities or to their copies created by the Duplicata command. – Rotate{ {x1, y1, z1}, {x2, y2, z2}, angle} {object}. Rotates an object (points, lines, surfaces or volumes) by an angle (in radians). Inside the first brackets we specify the directions of the rotation axis (x1, y1, z1). The coordinates of a point on this axis (x2, y2, z2) are specified. To rotate the copy of the object, we must use Duplicata{object}. – Translate{x, y, z} {object}. Translates an object along the vector (x, y, z). – Symmetry{A, B, C, D} {object}. Moves an object symmetrically to a plane. Inside the brackets we set the coefficients of the plane’s equation (A, B, C, D). – Dilate{ {x, y, z}, factor} {object}. Scales an object by a factor. Inside the brackets we give the directions of the transformation (x, y, z).

8.2.3 Physical entities The major advantage of Gmsh is the existence of physical entities. These entities are employed to unite elementary entities into larger groups. They can be used to set boundaries and sub-domains of a future mesh. – Physical Point(id) = {PointId1, PointId2, . . . , PointIdN}. Creates a physical point with the identification number (id). Inside the brackets we set a list of elementary points which need to be grouped into one group. – Physical Line(id) = {LineId1, LineId2, . . ., LineIdN}.

8.2 The Gmsh workflow |





139

Creates a physical line with the identification number (id). Inside the brackets we specify a list of elementary lines which need to be grouped into one group. Physical Surface(id) = {SurfaceId1, SurfaceId2, . . . , SurfaceIdN}. Creates a physical surface with the identification number (id). Inside the brackets we present a list of elementary surfaces which need to be combined into one group. Physical Volume(id) = {VolumeId1, VolumeId2, . . . , VolumeIdN}. Creates a physical volume with the identification number (id). Inside the brackets we set a list of elementary volumes which need to be grouped into one group.

Note that the identification number of a group can be either a number or a string. In the latter case, a number will be given automatically.

8.2.4 Building geometry We now consider an example of building a 2D domain. We create a mesh.geo file using any text editor. First, we define auxiliary parameters as follows: p1 = 0.1; p2 = 0.02; r = 0.25;

Here, p1 and p2 are mesh element sizes and r is a radius. Next we create points to define a rectangle and circle as outlined below: Point(1) Point(2) Point(3) Point(4) Point(5) Point(6) Point(7) Point(8) Point(9)

= = = = = = = = =

{0, 0, 0, p1}; {1, 0, 0, p1}; {1, 1, 0, p1}; {0, 1, 0, p1}; { 0.5, 0.5, {0.5+r, 0.5, {0.5-r, 0.5, { 0.5, 0.5+r, { 0.5, 0.5-r,

0, 0, 0, 0, 0,

p2}; p2}; p2}; p2}; p2};

Straight lines and circle arcs are then introduced as follows: Line(1) Line(2) Line(3) Line(4)

= = = =

{1, {2, {3, {4,

2}; 3}; 4}; 1};

140 | 8 Geometry generation and meshing

Circle(5) Circle(6) Circle(7) Circle(8)

= = = =

{6, {9, {7, {8,

5, 5, 5, 5,

9}; 7}; 8}; 6};

along with a plane surface: Line Loop(9) = {8, 5, 6, 7}; Line Loop(10) = {3, 4, 1, 2}; Plane Surface(11) = {9, 10};

Further, we open the obtained mesh.geo file in Gmsh. Figure 8.4 shows the designed geometric domain.

Fig. 8.4. 2D geometric domain.

Finally, we can transform our 2D domain (Figure 8.4) into the 3D box presented in Figure 8.5 using the following Extrude command: Extrude {0, 0, 1} { Surface{11}; }

Fig. 8.5. 3D geometric domain.

8.2 The Gmsh workflow

|

141

8.2.5 Tools Gmsh provides an extensive set of tools which accelerate building geometric domains. We now discuss the most useful ones. – Arithmetic operations. Gmsh supports all basic arithmetic (=, +, -, *, /,^,%, +=, += ,-= , *= , /= , ++, --), relational (>, >=, 0) { col[ncols].i = i; col[ncols].j = j - 1; v[ncols++] = -hx / hy;

166 | 9 PETSc for parallel solving of linear and nonlinear equations

} if (j < info.my - 1) { col[ncols].i = i; col[ncols].j = j + 1; v[ncols++] = -hx / hy; } col[ncols].i = i; col[ncols].j = j; v[ncols++] = 2 * (hx / hy + hy / hx); MatSetValuesStencil(A, 1, &row, ncols, col, v, INSERT_VALUES);

42 43 44 45 46 47 48 49 50 51 52

} } // Assemble matrix MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); // Create vectors Vec x, b; DMCreateGlobalVector(da, &x); DMCreateGlobalVector(da, &b); VecSet(b, hx*hy); //Create linear solver. KSP ksp; KSPCreate(PETSC_COMM_WORLD, &ksp); KSPSetOperators(ksp, A, A, DIFFERENT_NONZERO_PATTERN); KSPSetFromOptions(ksp); // Solve and save the solution KSPSolve(ksp, b, x); save("u.vtk", da, x); // Get and print the number of iterations. int its; KSPGetIterationNumber(ksp, &its); PetscPrintf(PETSC_COMM_WORLD, "Iterations \%D\n", its); // Free work space. KSPDestroy(&ksp); VecDestroy(&x); VecDestroy(&b); MatDestroy(&A); DMDestroy(&da); PetscFinalize(); return 0;

53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

}

First, the header files are included. The "petscdmda.h" file allows the use of distributed arrays DMDA. The "petscksp.h" file is necessary to employ KSP solvers as follows:

9.2 Solvers for systems of linear equations

|

167

Listing 9.18. 1 2

#include #include

Listing 9.19 shows how to initialize PETSc and MPI. Listing 9.19. 4

PetscInitialize(&argc,&argv,(char*)0,0);

The da object is then created. The 2D structured uniform grid for the unit square is built and information about it obtained as shown below in Listing 9.20. Listing 9.20. 7 8 9 10 11 12 13 14 15

DMDACreate2d(PETSC_COMM_WORLD, DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_NONE, DMDA_STENCIL_STAR, -8, -7, PETSC_DECIDE, PETSC_DECIDE, 1, 1, NULL, NULL, &da); DMDASetUniformCoordinates(da, 0.0, 1.0, 0.0, 1.0, NULL, NULL); DMDALocalInfo info; DMDAGetLocalInfo(da, &info); // Compute grid steps double hx = 1.0/info.mx; double hy = 1.0/info.my;

In order to solve the problem it is necessary to form the matrix and the vector on the right. The parallel matrix for the da object is created as follows in Listing 9.21. Listing 9.21. 18

DMCreateMatrix(da,MATAIJ,&A);

The MATAIJ type corresponds to a sparse matrix. The result is an empty matrix A. Next, the five-diagonal coefficient matrix is filled according to finite difference approximation (9.4). Listing 9.22. 20 21 22 23

int ncols; PetscScalar v[5]; MatStencil col[5], row; for (int j = info.ys; j < info.ys + info.ym; j++) {

168 | 9 PETSc for parallel solving of linear and nonlinear equations

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

for (int i = info.xs; i < info.xs + info.xm; i++) { ncols = 0; row.i = i; row.j = j; if (i > 0) { col[ncols].i = i - 1; col[ncols].j = j; v[ncols++] = -hy / hx; } if (i < info.mx - 1) { col[ncols].i = i + 1; col[ncols].j = j; v[ncols++] = -hy / hx; } if (j > 0) { col[ncols].i = i; col[ncols].j = j - 1; v[ncols++] = -hx / hy; } if (j < info.my - 1) { col[ncols].i = i; col[ncols].j = j + 1; v[ncols++] = -hx / hy; } col[ncols].i = i; col[ncols].j = j; v[ncols++] = 2 * (hx / hy + hy / hx); MatSetValuesStencil(A, 1, &row, ncols, col, v, INSERT_VALUES); } } // Assemble matrix MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);

Then the vector of the approximate solution x and the vector on the right b is created as shown in Listing 9.23. Listing 9.23. 59 60

DMCreateGlobalVector(da, &x); DMCreateGlobalVector(da, &b);

9.2 Solvers for systems of linear equations

| 169

We set the values of the vector b as follows: Listing 9.24. 61

VecSet(b, hx*hy);

After designing the matrix and vectors, the linear system Ax = b is solved using the KSP object. First, the KSP object is created as shown in Listing 9.25. Listing 9.25. 64

KSPCreate(PETSC_COMM_WORLD,&ksp);

Then, we define the operators associated with the system as shown below: Listing 9.26. 65

KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN);

The matrix A serves as a preconditioning matrix in this case. The parameter DIFFERENT_ NONZERO_PATTERN is a flag which gives information about the structure of the preconditioner. To specify various options of the solver, we apply the command shown in Listing 9.27. Listing 9.27. 66

KSPSetFromOptions(ksp);

This command allows users to customize the linear solver by a set of options at runtime. Using these options, we can specify an iterative method and the type of preconditioners, determine the stability and convergence, as well as set various monitoring procedures. To solve the linear system, see Listing 9.28 below. Listing 9.28. 68

KSPSolve(ksp,b,x);

b and x are the vector on the right and unknown vector, respectively. KSP uses the GMRES method with the ILU preconditioner in the case of sequential run and block Jacobi in the case of parallel running by default.

170 | 9 PETSc for parallel solving of linear and nonlinear equations The desired method for solving systems of linear equations is defined by the following command which specifies the method: KSPSetType(KSP ksp,KSPType method);

Table 9.3 presents a list of available methods. Table 9.3. Methods for solving systems of linear equations. Method

KSPType

Option name

Richardson Chebychev Conjugate Gradient BiConjugate Gradient Generalized Minimal Residual BiCGSTAB Conjugate Gradient Squared Transpose-Free Quasi-Minimal Residual (1) Transpose-Free Quasi-Minimal Residual (2) Conjugate Residual Least Squares Method Shell for no KSP method

KSPRICHARDSON KSPCHEBYCHEV KSPCG KSPBICG KSPGMRES KSPBCGS KSPCGS KSPTFQMR KSPTCQMR KSPCR KSPLSQR KSPPREONLY

richardson chebychev cg bicg gmres bcgs cgs tfqmr tcqmr cr lsqr preonly

A method for solving the system of equations can also be set by the command line option -ksp_type, values of which are listed in the third column of Table 9.3. Next, we get the number of iterations its and print it as follows: Listing 9.29. 71 72 73

int its; KSPGetIterationNumber(ksp, &its); PetscPrintf(PETSC_COMM_WORLD, "Iterations \%D\n", its);

The its parameter provides either the number of iterations used until convergence is reached or the number of iterations performed until discovery of divergence or interruption. Finally, all objects are destroyed as shown in Listing 9.30. Listing 9.30. 75 76 77 78 79

KSPDestroy(&ksp); VecDestroy(&x); VecDestroy(&b); MatDestroy(&A); DMDestroy(&da);

9.3 Solution of nonlinear equations and systems

|

171

Then PetscFinalize is called. In the example below we compile and run the program on 4 parallel processes. $ mpirun -np 4 ./poisson -da_grid_x 50 -da_grid_y 50

The result is Iterations 37

Figure 9.4 shows the solution of the problem.

Fig. 9.4. The solution of the Dirichlet problem for the Poisson equation.

.

9.3 Solution of nonlinear equations and systems Most of the problems requiring solutions of differential equations are nonlinear. PETSc offers the SNES module for the solution of nonlinear problems. Built on the basis of KSP linear solvers and PETSc data structures, SNES provides an interface to use Newton’s method and its various modifications. An example of solving a nonlinear problem using the SNES module follows.

172 | 9 PETSc for parallel solving of linear and nonlinear equations 9.3.1 Statement of the problem In the rectangle Ω = {x | 0 ≤ xα ≤ 1, α = 1, 2} , we consider the nonlinear equation − div(q(u) grad u) = 0,

(9.6)

with the boundary conditions u(0, x2 ) = 1, 𝜕u (x , 0) = 0, 𝜕n 1

u(1, x2 ) = 0, 𝜕u (x , 1) = 0. 𝜕n 1

(9.7)

For our problem, we define q(u) as q(u) = uβ .

9.3.2 Solution algorithm Newton’s method is in common use for the solution of nonlinear problems of mathematical physics: F(u) = 0, (9.8) Newton’s method has the following general form: uk+1 = uk − J(u)−1 F(uk ),

k = 0, 1, ...,

(9.9)

where J(uk ) = F󸀠 (uk ) is the Jacobian and u0 is the initial approximate solution. In practice, Newton’s method (9.9) is realized in two steps: – Solve (approximately): J(uk )δ uk = −F(uk ). – Update: uk+1 = uk + δ uk . We apply this method to solve the problems (9.6), (9.7). Assume that δ uk is sufficiently small, then the problem can be linearized with respect to δ uk . Now the nonlinear coefficient q(u) is linearized: q(uk+1 ) = q(uk ) + q󸀠 (uk )δ uk + O ((δ u)2 ) ≈ q(uk ) + q󸀠 (uk )δ uk .

(9.10)

Treating δ uk as an unknown, we obtain the equation div(q(uk ) grad uk ) + div(q(uk ) grad δ uk ) + div(q󸀠 (uk )δ uk grad uk ) = 0.

(9.11)

Collecting elements for the unknowns δ uk on the left, we obtain the system of linear equations div(q(uk ) grad δ uk ) + div(q󸀠 (uk )Δuk grad uk ) = − div(q(uk ) grad uk ).

(9.12)

9.3 Solution of nonlinear equations and systems |

173

The system can be represented in the matrix form J(uk )δ uk = −F(uk ), from which δ uk is determined. Using the calculated δ uk , we find uk+1 = uk + δ uk . The basic computational complexity of Newton’s method is associated with the need to solve a linear system of equations.

9.3.3 Implementation in PETSc The parallel implementation of the solution of the nonlinear problem using Newton’s method is given in Listing 9.31 below. Listing 9.31. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

#include #include #define T0 0.0 #define T1 1.0 #define BETA 2.0 extern PetscErrorCode FormInitialGuess(SNES,Vec,void*); extern PetscErrorCode FormFunction(SNES,Vec,Vec,void*); extern PetscErrorCode FormJacobian(SNES,Vec,Mat*,Mat*,MatStructure*, void*); int main(int argc, char **argv) { PetscInitialize(&argc, &argv, 0, 0); // Create the grid DM da; DMDACreate2d(PETSC_COMM_WORLD, DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_NONE, DMDA_STENCIL_STAR, -5, -5, PETSC_DECIDE, PETSC_DECIDE, 1, 1, 0, 0, &da); DMDASetUniformCoordinates(da, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0); // Create the solver SNES snes; SNESCreate(PETSC_COMM_WORLD, &snes); SNESSetDM(snes, (DM) da); SNESSetFunction(snes, NULL, FormFunction, NULL); SNESSetJacobian(snes, NULL, NULL, FormJacobian, NULL); SNESSetFromOptions(snes); SNESSetComputeInitialGuess(snes, FormInitialGuess, NULL); // Solve the problem and save solution in file SNESSolve(snes, NULL, NULL); Vec x; SNESGetSolution(snes,&x); save("u.vtk", da, x); // Obtain and print the number of linear and nonlinear iterations

174 | 9 PETSc for parallel solving of linear and nonlinear equations

int its, lits; SNESGetIterationNumber(snes, &its); SNESGetLinearSolveIterations(snes, &lits); PetscPrintf(PETSC_COMM_WORLD, "Number of SNES iterations=%D\n",its); PetscPrintf(PETSC_COMM_WORLD, "Number of Linear iterations=%D\n",lits); //Free work space DMDestroy(&da); SNESDestroy(&snes); PetscFinalize(); return 0;

33 34 35 36 37 38 39 40 41 42 43 44 45

}

The parameters for the function q(u) = uβ and boundary conditions are specified as shown in Listing 9.32. Listing 9.32. 3 4 5

#define T0 0.0 #define T1 1.0 #define BETA 2.0

Next, the da object is created as follows in Listing 9.33. Listing 9.33. 13 14 15 16

DMDACreate2d(PETSC_COMM_WORLD, DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_NONE, DMDA_STENCIL_STAR, -5, -5, PETSC_DECIDE, PETSC_DECIDE, 1, 1, 0, 0, &da); DMDASetUniformCoordinates(da, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0);

For the numerical solution, we invoke the nonlinear solver as shown in Listing 9.34. Listing 9.34. 18 19 20

SNES snes; SNESCreate(PETSC_COMM_WORLD, &snes); SNESSetDM(snes, (DM) da);

9.3 Solution of nonlinear equations and systems |

175

To initialize it, the functions F and J should be set as follows: Listing 9.35. 21 22 23

SNESSetFunction(snes, NULL, FormFunction, NULL); SNESSetJacobian(snes, NULL, NULL, FormJacobian, NULL); SNESSetFromOptions(snes);

The function for the initial approximation is needed as shown below in Listing 9.36. Listing 9.36. 24

SNESSetComputeInitialGuess(snes, FormInitialGuess, NULL);

T1 = 1.0 is set as the initial guess as follows: Listing 9.37. 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

PetscErrorCode FormInitialGuess(SNES snes, Vec X, void *ctx) { DM da; SNESGetDM(snes, &da); int xs, ys, xm, ym; DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0); PetscScalar **x; DMDAVecGetArray(da, X, &x); for (int j = ys; j < ys + ym; j++) { for (int i = xs; i < xs + xm; i++) { x[j][i] = T1; } } DMDAVecRestoreArray(da, X, &x); PetscFunctionReturn(0); }

The function used for setting F is shown below in Listing 9.38. Listing 9.38. 58 59 60 61 62 63 64

PetscErrorCode FormFunction(SNES snes, Vec X, Vec F, void *ctx) { // Get grid information DM da; SNESGetDM(snes, &da); int mx, my, xs, ys, xm, ym; DMDAGetInfo(da, NULL, &mx, &my, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0);

176 | 9 PETSc for parallel solving of linear and nonlinear equations

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113

double hx = 1.0 / (mx - 1); double hy = 1.0 / (my - 1); // Get local vector Vec localX; DMGetLocalVector(da, &localX); DMGlobalToLocalBegin(da, X, INSERT_VALUES, localX); DMGlobalToLocalEnd(da, X, INSERT_VALUES, localX); PetscScalar **x, **f; DMDAVecGetArray(da, localX, &x); DMDAVecGetArray(da, F, &f); // Set values double t0, t1, a, d; double fUp, fDown, fRight, fLeft; for (int j = ys; j < ys + ym; j++) { for (int i = xs; i < xs + xm; i++) { t0 = x[j][i]; // i-1 t1 = (i == 0) ? T1 : x[j][i - 1]; a = 0.5 * (t0 + t1); d = PetscPowScalar(a, BETA); fLeft = d * (t0 - t1); // i+1 t1 = (i == mx - 1) ? T0 : x[j][i + 1]; a = 0.5 * (t0 + t1); d = PetscPowScalar(a, BETA); fRight = d * (t1 - t0); // j-1 fDown = 0.0; if (j > 0) { t1 = x[j - 1][i]; a = 0.5 * (t0 + t1); d = PetscPowScalar(a, BETA); fDown = d * (t0 - t1); } // j+1 fUp = 0.0; if (j < my - 1) { t1 = x[j + 1][i]; a = 0.5 * (t0 + t1); d = PetscPowScalar(a, BETA); fUp = d * (t1 - t0); } f[j][i] = -hy / hx * (fRight - fLeft) - hx / hy * (fUp - fDown); } } DMDAVecRestoreArray(da, localX, &x); DMDAVecRestoreArray(da, F, &f); DMRestoreLocalVector(da, &localX);

9.3 Solution of nonlinear equations and systems |

PetscFunctionReturn(0);

114 115

177

}

Here we use the finite difference approximation on the five-point stencil with the boundary conditions. The function for the calculation J is given directly in Listing 9.39. Listing 9.39. 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152

PetscErrorCode FormJacobian(SNES snes, Vec X, Mat *J, Mat *B, MatStructure *flg, void *ctx) { *flg = SAME_NONZERO_PATTERN; // Get grid information DM da; SNESGetDM(snes, &da); int mx, my, xs, ys, xm, ym; DMDAGetInfo(da, NULL, &mx, &my, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0); double hx = 1.0 / (mx - 1); double hy = 1.0 / (my - 1); // Get the local vector Vec localX; DMGetLocalVector(da, &localX); DMGlobalToLocalBegin(da, X, INSERT_VALUES, localX); DMGlobalToLocalEnd(da, X, INSERT_VALUES, localX); PetscScalar **x; DMDAVecGetArray(da, localX, &x); // Set Jacobian Mat jac = *J; PetscScalar v[5]; MatStencil col[5], row; int colCount; double t0, t1, a, b; double qUp, qDown, qRight, qLeft; double dqUp, dqDown, dqRight, dqLeft; for (int j = ys; j < ys + ym; j++) { for (int i = xs; i < xs + xm; i++) { colCount = 0; row.i = i; row.j = j; t0 = x[j][i]; // j-1 qDown = 0.0; dqDown = 0.0; if (j > 0) { t1 = x[j - 1][i]; a = 0.5 * (t0 + t1);

178 | 9 PETSc for parallel solving of linear and nonlinear equations

153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201

b = PetscPowScalar(a, BETA - 1); qDown = PetscPowScalar(a, BETA); dqDown = BETA / 2 * b * (t0 - t1); col[colCount].i = i; col[colCount].j = j - 1; v[colCount++] = -hx / hy * (qDown - dqDown); } // i-1 t1 = (i == 0) ? T1 : x[j][i - 1]; a = 0.5 * (t0 + t1); b = PetscPowScalar(a, BETA - 1); qLeft = PetscPowScalar(a, BETA); dqLeft = BETA / 2 * b * (t0 - t1); if (i > 0) { col[colCount].i = i - 1; col[colCount].j = j; v[colCount++] = -hy / hx * (qLeft - dqLeft); } // i+1 t1 = (i == mx - 1) ? T0 : x[j][i + 1]; a = 0.5 * (t0 + t1); b = PetscPowScalar(a, BETA - 1); qRight = PetscPowScalar(a, BETA); dqRight = BETA / 2 * b * (t1 - t0); if (i < mx - 1) { col[colCount].i = i + 1; col[colCount].j = j; v[colCount++] = -hy / hx * (qRight + dqRight); } // j+1 qUp = 0.0; dqUp = 0.0; if (j < my - 1) { t1 = x[j + 1][i]; a = 0.5 * (t0 + t1); b = PetscPowScalar(a, BETA - 1); qUp = PetscPowScalar(a, BETA); dqUp = BETA / 2 * b * (t1 - t0); col[colCount].i = i; col[colCount].j = j + 1; v[colCount++] = -hx / hy * (qUp + dqUp); } // center i,j col[colCount].i = i; col[colCount].j = j; v[colCount++] = hx / hy * (qDown + qUp + dqDown - dqUp) + hy / hx * (qLeft + qRight + dqLeft - dqRight); // set row MatSetValuesStencil(jac, 1, &row, colCount, col,

9.3 Solution of nonlinear equations and systems |

v, INSERT_VALUES);

202

}

203

} MatAssemblyBegin(jac, MAT_FINAL_ASSEMBLY); DMDAVecRestoreArray(da, localX, &x); MatAssemblyEnd(jac, MAT_FINAL_ASSEMBLY); DMRestoreLocalVector(da, &localX); PetscFunctionReturn(0);

204 205 206 207 208 209 210

179

}

To compute the Jacobian, we use the explicit expression. For the numerical solution of the nonlinear system, the function SNESSolve is called and the solution saved as shown in Listing 9.40 below. Listing 9.40. 26 27 28 29

SNESSolve(snes, NULL, NULL); Vec x; SNESGetSolution(snes,&x); save("u.vtk", da, x);

We then obtain the number of linear and nonlinear iterations and print it as shown in Listing 9.41 below. Listing 9.41. 31 32 33 34 35 36

int its, lits; SNESGetIterationNumber(snes, &its); SNESGetLinearSolveIterations(snes, &lits); PetscPrintf(PETSC_COMM_WORLD, "Number of SNES iterations=%D\n",its); PetscPrintf(PETSC_COMM_WORLD, "Number of Linear iterations=%D\n", lits);

Finally, the work space is freed as follows: Listing 9.42. 37 38

DMDestroy(&da); SNESDestroy(&snes);

and PetscFinalize called.

180 | 9 PETSc for parallel solving of linear and nonlinear equations In this example, we compile and run the program on 4 parallel processes as follows: $ mpirun -np 4 ./nonlinear -da_grid_x 50 -da_grid_y 50

The result is Number of SNES iterations=7 Number of Linear iterations=351

The solution of the problem is depicted in Figure 9.5.

Fig. 9.5. The solution of the nonlinear problem.

9.4 Solving unsteady problems PETSc provides the TS module to solve unsteady PDEs. TS employs SNES to solve nonlinear problems at each time level.

9.4.1 Problem formulation In the rectangle Ω = {x | x = (x1 , x2 ), 0 ≤ xα ≤ 1, α = 1, 2} ,

9.4 Solving unsteady problems

181

|

we consider the unsteady equation 𝜕u − Δu = f (x, t), 𝜕t

x ∈ Ω, 0 < t ≤ T,

(9.13)

supplemented with the boundary conditions x ∈ 𝜕Ω.

u(x, t) = 0,

(9.14)

For the unique solvability of the unsteady problem, we set the initial condition u(x, 0) = {

exp(cr3 ), 0,

if r < 0.125 , if r ≥ 0.125 1/2

where c = const, r = ((x1 − 0.5)2 + (x2 − 0.5)2 )

(9.15)

, x ∈ Ω.

9.4.2 Approximation The scheme with weights is applied to solve this transient problem after approximation in space. For simplicity, we define the following uniform grid in time: ω̄ τ = ωτ ∪ {T} = {tn = nτ ,

n = 0, 1, . . . , N,

τ N = T},

and in space: ω̄ = {x | x = (x1 , x2 ) , xα = iα hα , iα = 0, 1, . . . , Nα , Nα hα = 1, α = 1, 2}, where ω is the set of interior nodes and 𝜕ω is the set of boundary nodes. We write the two-level scheme with weights for the problems (9.13)–(9.15) as follows: yn+1 − yn (9.16) + A (σ yn+1 + (1 − σ ) yn ) = φ n , tn ∈ ωτ , τ y 0 = u0 .

(9.17)

The discrete operator A is defined as 1 (y(x1 + h1 , x2 ) − 2y(x1 , x2 ) + y(x1 − h1 , x2 )) h21 1 − 2 (y(x1 , x2 + h2 ) − 2y(x1 , x2 ) + y(x1 , x2 − h2 )) , h2

Ay = −

x ∈ ω.

If σ = 0, we have the explicit scheme, whereas for σ = 1, the scheme is implicit.

182 | 9 PETSc for parallel solving of linear and nonlinear equations 9.4.3 The program The program for the numerical solution of the time-dependent problem using the module TS is shown in Listing 9.43 below. Listing 9.43. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

#include #include #include #include extern PetscErrorCode RHSMatrix(TS, PetscReal, Vec, Mat*, Mat*, MatStructure*, void*); extern PetscErrorCode FormInitialSolution(DM, Vec, void*); extern PetscErrorCode MyTSMonitor(TS, PetscInt, PetscReal, Vec, void*); int main(int argc, char **argv) { PetscInitialize(&argc, &argv, (char *) 0, 0); //Create the grid DM da; DMDACreate2d(PETSC_COMM_WORLD, DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_NONE, DMDA_STENCIL_STAR, -8, -7, PETSC_DECIDE, PETSC_DECIDE, 1, 1, NULL, NULL, &da); DMDASetUniformCoordinates(da, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0); //Create the vector Vec u; DMCreateGlobalVector(da, &u); //Create the TS solver for the unsteady problem TS ts; TSCreate(PETSC_COMM_WORLD, &ts); TSSetDM(ts, da); TSSetType(ts, TSPSEUDO); //Create matrix Mat A; DMCreateMatrix(da, MATAIJ, &A); TSSetRHSJacobian(ts, A, A, RHSMatrix, NULL); TSSetRHSFunction(ts,NULL,TSComputeRHSFunctionLinear, NULL); FormInitialSolution(da, u, NULL); //Set parameters of the time scheme int maxsteps = 1000; double ftime = 1.0; double dt = 0.01; TSSetDuration(ts, maxsteps, ftime); TSSetInitialTimeStep(ts, 0.0, dt); TSMonitorSet(ts, MyTSMonitor, NULL, NULL); TSSetFromOptions(ts); //Solve the unsteady problem.

9.4 Solving unsteady problems

183

TSSolve(ts, u, &ftime); //Get the number of time steps. int steps; TSGetTimeStepNumber(ts, &steps); PetscPrintf(PETSC_COMM_WORLD, "Steps count \%D\n", steps); MatDestroy(&A); VecDestroy(&u); TSDestroy(&ts); DMDestroy(&da); PetscFinalize(); return 0;

42 43 44 45 46 47 48 49 50 51 52 53

|

}

Unlike the previous examples, the TS solver is used as follows: Listing 9.44. 21 22 23

TS ts; TSCreate(PETSC_COMM_WORLD, &ts); TSSetDM(ts, da);

The TS module is intended to solve ordinary differential equations (ODEs) written as ut = F(u, t), which are obtained after the discretization of unsteady problems in space. The module provides solvers which use the forward and backward Euler methods. We can also solve the stationary equations F(u) = 0, and differential-algebraic equations F(t, u, u)̇ = 0. The solution method is preset using the command Listing 9.45. 24

TSSetType(ts, TSPSEUDO);

where TSPSEUDO is the method with pseudo time-stepping. Instead of this method, we can use – TSEULER - the solver involving the forward Euler method; – TSSUNDIALS - the solver based on the package SUNDIALS; – TSBEULER - the backward Euler method.

184 | 9 PETSc for parallel solving of linear and nonlinear equations Since TS is implemented on the basis of the nonlinear solver SNES, it should be handled via SNES. We must set the Jacobian, the vector on the right, and the initial approximation as shown below in Listing 9.46. Listing 9.46. 26 27 28 29 30

Mat A; DMCreateMatrix(da, MATAIJ, &A); TSSetRHSJacobian(ts, A, A, RHSMatrix, NULL); TSSetRHSFunction(ts,NULL,TSComputeRHSFunctionLinear, NULL); FormInitialSolution(da, u, NULL);

The Jacobian is defined by the following functions according to the problem statement: Listing 9.47. 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

PetscErrorCode RHSMatrix(TS ts, PetscReal t, Vec U, Mat *J, Mat *Jpre, MatStructure *str, void *ctx) { DM da; TSGetDM(ts, &da); int mx, my, xs, ys, xm, ym; DMDAGetInfo(da, NULL, &mx, &my, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0); double hx = 1.0 / (mx - 1); double hy = 1.0 / (my - 1); MatStencil row, col[5]; PetscScalar val[5]; int colCount = 0; for (int j = ys; j < ys + ym; j++) { for (int i = xs; i < xs + xm; i++) { colCount = 0; row.i = i; row.j = j; if (i == 0 || j == 0 || i == mx - 1 || j == my - 1) { col[colCount].i = i; col[colCount].j = j; val[colCount++] = 1.0; } else { col[colCount].i = i - 1; col[colCount].j = j; val[colCount++] = 1.0/hx/hx; col[colCount].i = i + 1; col[colCount].j = j; val[colCount++] = 1.0/hx/hx; col[colCount].i = i; col[colCount].j = j - 1;

9.4 Solving unsteady problems

val[colCount++] col[colCount].i col[colCount].j val[colCount++] col[colCount].i col[colCount].j val[colCount++]

82 83 84 85 86 87 88

1.0/hy/hy; i; j + 1; 1.0/hy/hy; i; j; -2 * (1.0/hx/hx + 1.0/hy/hy);

} MatSetValuesStencil(*Jpre, 1, &row, colCount, col, val, INSERT_VALUES);

89 90 91

} } MatAssemblyBegin(*Jpre, MAT_FINAL_ASSEMBLY); MatAssemblyEnd(*Jpre, MAT_FINAL_ASSEMBLY); if (*J != *Jpre) { MatAssemblyBegin(*J, MAT_FINAL_ASSEMBLY); MatAssemblyEnd(*J, MAT_FINAL_ASSEMBLY); } PetscFunctionReturn(0);

92 93 94 95 96 97 98 99 100 101

= = = = = = =

| 185

}

The function to set the initial approximation is shown in Listing 9.48. Listing 9.48. 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

PetscErrorCode FormInitialSolution(DM da, Vec U, void* ptr) { // mesh info int mx, my, xs, ys, xm, ym; DMDAGetInfo(da, NULL, &mx, &my, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0); double hx = 1.0 / (mx - 1); double hy = 1.0 / (my - 1); // Vector PetscScalar **u; DMDAVecGetArray(da, U, &u); double x, y, r; for (int j = ys; j < ys + ym; j++) { y = j*hy; for (int i = xs; i < xs + xm; i++) { x = i*hx; r = PetscSqrtScalar((x - .5)*(x - .5) + (y - .5)*(y - .5)); if (r < 0.125) { u[j][i] = PetscExpScalar(-30 * r * r * r); } else { u[j][i] = 0.0; } }

186 | 9 PETSc for parallel solving of linear and nonlinear equations

} DMDAVecRestoreArray(da, U, &u); PetscFunctionReturn(0);

125 126 127 128

}

Then time parameters are set as shown in Listing 9.49 below. Listing 9.49. 32 33 34 35 36 37 38

int maxsteps = 1000; double ftime = 1.0; double dt = 0.01; TSSetDuration(ts, maxsteps, ftime); TSSetInitialTimeStep(ts, 0.0, dt); TSMonitorSet(ts, MyTSMonitor, NULL, NULL); TSSetFromOptions(ts);

The function MyTSMonitor allows the result to be written into a file at each time level as follows in Listing 9.50. Listing 9.50. 128 129 130 131 132 133 134 135 136 137 138 139 140

PetscErrorCode MyTSMonitor(TS ts, PetscInt step, PetscReal ptime, Vec v, void *ptr) { PetscReal norm; VecNorm(v, NORM_2, &norm); PetscPrintf(PETSC_COMM_WORLD,"timestep %D: time %G, solution norm %G\n", step, ptime, norm); DM da; TSGetDM(ts, &da); std::stringstream ss; ss