Compiler Optimizations for Scalable Parallel Systems: Languages, Compilation Techniques, and Run Time Systems [1 ed.] 3540419454, 9783540419457

Scalable parallel systems or, more generally, distributed memory systems offer a challenging model of computing and pose

246 50 7MB

English Pages 784 [782] Year 2001

Table of contents :
High Performance Fortran 2.0....Pages 3-43
The Sisal Project: Real World Functional Programming....Pages 45-72
HPC++ and the HPC++Lib Toolkit....Pages 73-107
A Concurrency Abstraction Model for Avoiding Inheritance Anomaly in Object-Oriented Programs....Pages 109-137
Loop Parallelization Algorithms....Pages 141-171
Array Dataflow Analysis....Pages 173-219
Interprocedural Analysis Based on Guarded Array Regions....Pages 221-246
Automatic Array Privatization....Pages 247-281
Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors....Pages 285-338
Communication-Free Partitioning of Nested Loops....Pages 339-383
Solving Alignment Using Elementary Linear Algebra....Pages 385-411
A Compilation Method for Communication-Efficient Partitioning of DOALL Loops....Pages 413-443
Compiler Optimization of Dynamic Data Distributions for Distributed-Memory Multicomputers....Pages 445-484
A Framework for Global Communication Analysis and Optimizations....Pages 485-524
Tolerating Communication Latency through Dynamic Thread Invocation in a Multithreaded Architecture....Pages 525-549
Advanced Code Generation for High Performance Fortran....Pages 553-596
Integer Lattice Based Methods for Local Address Generation for Block-Cyclic Distributions....Pages 597-645
A Duplication Based Compile Time Scheduling Method for Task Parallelism....Pages 649-682
SPMD Execution in the Presence of Dynamic Data Structures....Pages 683-708
Supporting Dynamic Data Structures with Olden....Pages 709-749
Runtime and Compiler Support for Irregular Computations....Pages 751-778

Recommend Papers

Real-Time Design Patterns: Robust Scalable Architecture for Real-Time Systems 9780201699562, 0201699567

This text is being used in a company wide campaign to improve system Software quality and Time to Market. I have found t

367 38 6MB Read more

Run-time Models for Self-managing Systems and Applications [1 ed.] 3034604327, 9783034604321

This edited volume focuses on the adoption of run-time models for the design and management of autonomic systems. Tradit

107 74 4MB Read more

Parallel Programming: for Multicore and Cluster Systems 364204817X, 9783642048173

Innovations in hardware architecture, like hyper-threading or multicore processors, mean that parallel computing resourc

101 73 17MB Read more

Microsoft C Compiler for the MS-DOS Operating System. Run-Time Library Reference

378 61 2MB Read more

Real Time Software For Small Systems

360 60 8MB Read more

Phonemic systems of Colombian languages

331 36 1MB Read more

Languages, Compilers, and Run-Time Systems for Scalable Computers: 4th International Workshop, LCR ’98 Pittsburgh, PA, USA, May 28–30, 1998 Selected Papers (Lecture Notes in Computer Science, 1511) 9783540651727, 3540651721

It is a great pleasure to present this collection of papers from LCR ’98, the Fourth Workshop on Languages, Compilers, a

111 88 5MB Read more

Parallel Systems [Reprint 2020 ed.] 9780520321502

117 27 25MB Read more

Reverse compilation techniques

498 85 5MB Read more

Stabilizing and optimizing control for time-delay systems 9783319927039, 9783319927046

380 84 2MB Read more

Compiler Optimizations for Scalable Parallel Systems: Languages, Compilation Techniques, and Run Time Systems [1 ed.]
3540419454, 9783540419457

Author / Uploaded
Ken Kennedy
Charles Koelbel (auth.)
Santosh Pande
Dharma P. Agrawal (eds.)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1808

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Santosh Pande Dharma P. Agrawal (Eds.)

Compiler Optimizations for Scalable Parallel Systems Languages, Compilation Techniques, and Run Time Systems

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Santosh Pande Georgia Institute of Technology, College of Computing 801 Atlantic Drive, Atlanta, GA 30332, USA E-mail: [email protected] Dharma P. Agrawal University of Cincinnati, Department of ECECS P.O. Box 210030, Cincinnati, OH 45221-0030, USA E-mail: [email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Compiler optimizations for scalable parallel systems : languages, compilation techniques, and run time systems / Santosh Pande ; Dharma P. Agrawal (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2001 (Lecture notes in computer science ; 1808) ISBN 3-540-41945-4

CR Subject Classification (1998): D.3, D.4, D.1.3, C.2, F.1.2, F.3 ISSN 0302-9743 ISBN 3-540-41945-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2001 Printed in Germany Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10720238 06/3142 543210

Preface Santosh Pande1 and Dharma P. Agrawal2 1

2

College of Computing 801 Atlantic Drive, Georgia Institute of Technology, Atlanta, GA 30332 Department of ECECS, ML 0030, PO Box 210030, University of Cincinnati, Cincinnati, OH 45221-0030

We are very pleased to publish this monograph on Compiler Optimizations for Scalable Distributed Memory Systems. Distributed memory systems oﬀer a challenging model of computing and pose fascinating problems regarding compiler optimizations ranging from language design to run time systems. Thus, the research done in this area serves as foundational to many challenges from memory hierarchy optimizations to communication optimizations encountered in both stand-alone and distributed systems. It is with this motivation that we present a compendium of research done in this area in the form of this monograph. This monograph is divided into ﬁve sections : section one deals with languages, section two deals with analysis, section three with communication optimizations, section four with code generation, and section ﬁve with run time systems. In the editorial we present a detailed summary of each of the chapters in these sections. We would like to express our sincere thanks to many who contributed to this monograph. First we would like to thank all the authors for their excellent contributions which really make this monograph one of a kind; as readers will see, these contributions make the monograph thorough and insightful (for an advanced reader) as well as highly readable and pedagogic (for students and beginners). Next, we would like to thank our graduate student Haixiang He for all his help in organizing this monograph and for solving latex problems. Finally we express our sincere thanks to the LNCS Editorial at Springer-Verlag for putting up with our schedule and for all their help and understanding. Without their invaluable help we would not have been able to put this monograph into its beautiful ﬁnal shape!!! We sincerely hope the readers ﬁnd the monograph truly useful in their work – be it further research or practice.

Introduction Santosh Pande1 and Dharma P. Agrawal2 1

2

College of Computing 801 Atlantic Drive, Georgia Institute of Technology, Atlanta, GA 30332 Department of ECECS, ML 0030, PO Box 210030, University of Cincinnati, Cincinnati, OH 45221-0030

1. Compiling for Distributed Memory Multiprocessors 1.1 Motivation The distributed memory parallel systems oﬀer elegant architectural solutions for highly parallel data intensive applications primarily because: – They are highly scalable. These systems currently come in a variety of architectures like 3D torus, mesh and hypercube that allow addition of extra processors should the computing demands increase. Scalability is an important issue especially for high performance servers such as parallel video servers, data mining and imaging applications. – With increase in parallelism, there is insigniﬁcant degradation in memory performance since memories are isolated and decoupled from direct accesses from processors. This is especially good for data intensive applications such as parallel databases and data mining that demand considerable memory bandwidths. In contrast, the memory bandwidths may not match the increase in number of processors in shared memory systems. In fact, the overall system performance may degrade due to increased memory contention. This in turn jeopardizes scalability of application beyond a point. – Spatial parallelism in large applications such as Fluid Flow, Weather Modeling and Image Processing, in which the problem domains are perfectly decomposable, is easy to map on these systems. The achievable speedups are almost linear and this is primarily due to fast accesses to the data maintained in local memory. – The interprocessor communication speeds and bandwidths have dramatically improved due to very fast routing. The performance ratings oﬀered by newer distributed memory systems have improved although they are not comparable to shared memory systems in terms of Mﬂops. – Medium grained parallelism can be eﬀectively mapped onto the newer systems like the Meiko CS-2, Cray T3D, IBM SP1/SP2 and EM4 due to a

XXII

Santosh Pande and Dharma P. Agrawal

low ratio of communication/computation speeds. Communication bottleneck has decreased compared with earlier systems and this has opened up parallelization of newer applications. 1.2 Complexity However, programming distributed memory systems remains very complex. Most of the current solutions mandate that the users of such machines must manage the processor allocation, data distribution and inter-processor communication in their parallel programs. Programming these systems for achieving the desired high performance is very complex. In spite of frantic demands by programmers, current solutions provided by (semi-automatic) parallelizing compilers are rather constrained. As a matter of fact, for many applications the only practical success has been through hand parallelization of codes with communication managed through MPI. In spite of a tremendous amount of research in this area, applicability of many of the compiler techniques remains rather limited and the achievable performance enhancement remains less than satisfactory. The main reasons for the restrictive solutions oﬀered by parallelizing compilers is the enormous complexity of the problem. Orchestrating computation and communication by suitable analysis and optimizing their performance through judicious use of underlying architectural features demands a true sophistication on the part of the compiler. It is not even clear whether these complex problems are solvable within the realm of compiler analysis and sophisticated restructuring transformations. Perhaps they are much deeper in nature and go right into the heart of design of parallel algorithms for such an underlying model of computation. The primary purpose of this monograph is to provide an insight into current approaches and point to potentially open problems that could have an impact. The monograph is organized in terms of issues ranging from programming paradigms (languages) to eﬀective run time systems. 1.3 Outline of the Monograph Language design is largely a matter of legacy and language design for distributed memory systems is no exception to the rule. In section I of the monograph we examine three important approaches (one imperative, one object-oriented and one functional) in this domain that have made a signiﬁcant impact. The ﬁrst chapter on HPF 2.0 provides an in-depth view of data parallel language which evolved from Fortran 90. They present HPF 1.0 features such as BLOCK distribution and FORALL loop as well as new features in HPF 2.0 such as INDIRECT distribution and ON directive. They also point to the complementary nature of MPI and HPF and discuss features such as EXTRINSIC interface mechanism. HPF 2.0 has been a major commercial success with many vendors such as Portland Group and Applied Parallel Research providing highly optimizing compiler support which generates

Introduction XXIII

message passing code. Many research issues especially related to supporting irregular computation could prove valuable to domains such as sparse matrix computation etc. The next chapter on Sisal 90 provides a functional view of implicit paralleism speciﬁcation and mapping. Shared memory implementation of Sisal is discussed, which involves optimizations such as update in place copy elimination etc. Sisal 90 and a distributed memory implemenatation which uses message passing are also discussed. Finally multi-threaded implementations of Sisal are discussed, with a focus on multi-threaded optimizations. The newer optimizations which perform memory management in hard-ware through dynamically scheduled multi-threaded code should really prove beneﬁcial for the performance of functional languages (including Sisal) which have an elegant programming model. The next chapter on HPC++ provides an object oriented view as well as details on a library and compiler strategy to support HPC++ level 1 release. The authors discuss interesting features related to multi-threading, barrier synchronization and remote procedure invocation. They also discuss library features that are especially useful for scientiﬁc programming. Extensions of this work relating to newer portable languages such as Java is currently an active area of research. We also have a chapter on concurrency models of OO paradigms. The authors speciﬁcally address a problem called inheritance anomaly which arises when synchronization constraints are implemented within methods of a class and an attempt is made to specialize methods through inheritance mechanisms. They propose a solution to this problem by separating the speciﬁcation of synchronization from the method speciﬁcation. The synchronization construct is not a part of the method body and is handled separately. It will be interesting to study the compiler optimizations on this model related to strength reduction of barriers, and issues such as data partitioning vs. barrier synchronizations. In section II of the monograph, we focus on various analysis techniques. Parallelism detection is very important and the ﬁrst chapter presents a very interesting comparative study of diﬀerent loop parallelization algorithms by Allen and Kennedy, Wolf and Lam, Darte and Vivien and by Feautrier. They provide comparisons in terms of their performance (ability to parallelize as well as quality of schedules generated for code generation) as well as complexity. The comparison also focusses on the type of dependence information available. Further extensions could involve run-time parallelization given more precise dependence information. Array data-ﬂow is of utmost importance in optimizations : both sequential as well as parallel. The ﬁrst chapter on array data-ﬂow analysis examines this problem in detail and presents techniques for exact data ﬂow as well as for approximate data ﬂow. The exact solution is shown for static control programs. Authors also show applications to interprocedural cases and some important parallelization techniques such as privatization. Some interesting extensions could involve run-time data ﬂow analysis. The next chapter discusses interprocedural analysis based on guarded (predicated) array regions. This is a framework based on path-sensitive predi-

XXIV Santosh Pande and Dharma P. Agrawal

cated data-ﬂow which provides summary information. The authors also show application of their work to improve array privatization based on symbolic propagation. Extensions of these to newer object oriented languages such as Java (which have clean class hierarchy and inheritance model) could be interesting since these programs really need such summary MOD information for performing any optimization. We ﬁnally present a very important analysis/optimization technique for array privatization. Array privatization involves removing memory-related dependences which have a signiﬁcant impact on communication optimizations, loop scheduling etc. The authors present a demand-driven data-ﬂow formulation of the problem; an algorithm which performs single pass propagation of symbolic array expressions is also presented. This comprehensive framework implemented in a Polaris compiler is making a signiﬁcant impact in improving many other related optimizations such as load balancing, communication etc. The next section is focussed on communication optimization. The communication optimization can be achieved through data (and iteration space) distribution, statically or dynamically. These approaches further classify into data and code alignment or simply interation space transformations such as in tiling. The communication can also be optimized in data-parallel programs through array region analysis. Finally one could tolerate some communication latency through novel techniques such as multi-threading. We have chapters which cover these broad range of topics about communication in depth. The ﬁrst chapter in this section focusses on tiling for cache-coherent multicomputers. This work derives optimal tile parameters for minimal communication in loops with aﬃ ne index expressions. The authors introduce a notion of data footprints and tile the iteration spaces so that the volume of communication is minimized. They develop an important lattice theoretic framework to precisely determine the sizes of data footprints which are very valuable not only in tiling but in many array distribution transformations. The next two chapters deal with the important problem of communication free loop partitioning. The second chapter in this section focusses on comparing diﬀerent methods of achieving communication-free partitioning for DOALL loops. This chapter discusses several variants of the communication-free partitioning problem involving duplication or non-duplication of data, load balancing of iteration space and aspects such as statement level vs. loop level partitioning. Several aspects such as trading parallelism to avoid inter-loop data distribution are also touched upon. Extending these techniques to broader classes of DOALL loops could enhance their applicability. The next chapter by Pingali et al. proposes a very interesting framework which ﬁrst determines a set of constraints on data and loop iteration placement. They then determine which constraints should be left unsatisﬁed to relax an overconstrained system to ﬁnd a solution involving a large amount of parallelism. Finally, the remaining constraints are solved for data and code

Introduction

XXV

distribution. The systematic linear algebraic framework improves over many ad-hoc loop partitioning approaches. These approaches trade parallelism for codes that allow decoupling the issues of parallelism and communication by relaxing an appropriate constraint of the problem. However, for many important problems such as image processing applications such a relaxation is not possible. That is, one must resort to a diﬀerent partitioning solution based on relative costs of communication and computation. In the next chapter, for solving such a problem, a new approach has been proposed to partition iteration space by determining directions which maximally cover the communication by minimally trading parallelism. This approach allows mapping of general medium grained DOALL loops. However, the communication resulting from this iteration space partitioning can not be easily aggregated without sophisticated ‘pack’/‘unpack’ mechanisms present at send/receive ends. Such extensions are desirable since aggregating communication has as signiﬁcant impact as reducing the volume. The static data distribution and alignment typically solve the problems of communication on a loop nest by loop nest basis but rarely in an intraprocedural scope. Most of the inter-loop nest level and interprocedural boundaries require dynamic data redistribution. Banerjee et al. develop techniques that can be used to automatically determine which data partitions are most beneﬁcial over speciﬁc sections of the program by accounting for redistribution overhead. They determine split points and phases of communication and redistribution are performed at split points. When communication must take place, it should be optimized. Also, any redundancies must be captured and eliminated. Manish Gupta in the next chapter proposes a comprehensive approach for performing global (interprocedural) communication optimizations such as vectorization, PRE, coalescing, hoisting etc. Such an interprocedural approach to communication optimization is highly proﬁtable in substantially improving the performance. Extending this work to irregular communication could be interesting. Finally, we present a multi-threaded approach which could hide the communication latency. Two representative applications involving bitonic sort and FFT are chosen and using ﬁne grained multi-threading on EM-X it is shown that multi-threading can substantially help in overlapping computation with communication to hide latencies up to 35 %. These methods could be especially useful for irregular computation. The ﬁnal phase of compiling for distributed memory systems involves solving many code generation problems. Code generation problems involve, determining communication generation and doing address calculation to map global references to local ones. The next section deals with these issues. The ﬁrst chapter presents structures and techniques for communication generation. They focus on issues such as ﬂexible computation partitioning (going beyond owner computes rule), communication adaptation based upon manipulating integer sets through abstract inequalities and control ﬂow simpli-

XXVI Santosh Pande and Dharma P. Agrawal

ﬁcation based on these. One good property of this work is that it can work with many diﬀerent front ends (not just data parallel languages) and the code generator has more opportunities to perform low level optimizations due to simpliﬁed control ﬂow. The second chapter discusses basis vector based address calculation mechanisms for eﬃ cient traversals of partitioned data. While one important issue of code generation is communication generation, a very important issue is to map global address space to local address space eﬃ ciently. The problem is complicated due to data distributions and access strides. Ramanujam et al. present closed form expressions for basis vectors for several cases. Using the closed form expressions for the basis vectors, they derive a non-unimodular linear transformation. The ﬁnal section is on supporting task parallelism and dynamic data structures. We also present a run-time system to manage irregular computation. The ﬁrst chapter by Darbha et al. presents a task scheduling approach that is optimal for many practical cases. The authors evaluate its performance for many practical applications such as the Bellman-Ford algorithm, Cholesky decomposition, the Systolic algorithm etc. They show that schedules generated by their algorithm are optimal for some cases and near optimal for most others. With HPF 2.0 supporting task parallelism, this could open up many new application domains. The next two chapters describe language supports for dynamic data structures such as pointers in distributed address space. Gupta describes several extensions to C with declarations such as TREE, ARRAY, MESH to declare dynamic data structures. He then describes name generation and distribution strategies for name generation and distribution strategies. Finally he describes support for both regular as well as irregular dynamic structures. The second chapter by Rogers et al. presents an approach followed in their Olden project which uses a distributed heap. The remote access is handled by software caching or computation migration. The selection of these mechanisms is done automatically through a compile time heuristic. They provide a data layout annotation to the programmer called local path lengths which allows programmers to give hints regarding expected data layout thereby ﬁxing these mechanisms. Both of these chapters provide highly useful insights into supporting dynamic data strutures which are very important for scalable domains of computation supported by these machines. Thus, these works should have a signiﬁcant impact on future scalable applications supported by these systems. Finally, we present a run-time system called CHAOS which provides eﬃ cient support for irregular computations. Due to indirection in many sparse matrix computations, the communication patterns are unknown at compile time in these applications. Indirection patterns have to be preprocessed, and the sets of elements to be sent and received by each processor precomputed,

Introduction XXVII

in order to optimize communication. In this work, the authors provide details of eﬃ cient run time support for an inspector–executor model. 1.4 Future Directions The two important bottlenecks for the use of distributed memory systems are the limited application domains and the fact that the performance is less than satisfactory. The main bottleneck seems to be handling communication. Thus, eﬃ cient solutions must be developed. Application domains beyond regular communication can be handled by supporting a general run-time communication model. This run-time communication model must be latency hiding and should give suﬃ cient ﬂexibility to the compiler to defer the hard decisions to run time yet allow static optimizations involving communication motion etc. One of the big problems compilers face is that estimating cost of communication is almost impossible. They can however gauge criticality (or relative importance) of communication. Developing such a model will allow compilers to more eﬀectively deal with issues of relative importance betwen computation and communication and communication and communication. Probably the best reason to use distributed memory systems is to beneﬁt from scalability even though application domains and performance might be somewhat weaker. Thus, new research must be done in scalable code generation. In other words, as size of the problem and number of processors increase, should there be a change in data/code partition or should it remain the same? What code generation issues are related to this? How could one potentially handle the “hot spots” that inevitably (although at much lower levels than shared memory systems) arise? Can one beneﬁt from the above communication model and dynamic data ownerships discussed earlier?

Table of Contents

Preface Santosh Pande and Dharma P. Agrawal ............................ V Introduction Santosh Pande and Dharma P. Agrawal ............................XXI 1

Compiling for Distributed Memory Multiprocessors . . . . . . . . . . . . . . .XXI 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .XXI 1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X . XII 1.3 Outline of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X . XII 1.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .XXVII ..

Section I : Languages Chapter 1. High Performance Fortran 2.0 Ken Kennedy and Charles Koelbel ................................ 1 2 3

4

5

6 7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . History and Overview of HPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Basic Language Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basic Language Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 EXTRINSIC Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The TASK REGION Directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Future Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 3 7 7 13 18 19 29 34 34 37 39 41

VIII

Table of Contents

Chapter 2. The Sisal Project: Real World Functional Programming Jean-Luc Gaudiot, Tom DeBoni, John Feo, Wim B¨ ohm, Walid Najjar, and Patrick Miller ................................. 45 1 2 3

4 5

6

7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Sisal Language: A Short Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . An Early Implementation: The Optimizing Sisal Compiler . . . . . . . . . 3.1 Update in Place and Copy Elimination . . . . . . . . . . . . . . . . . . . . . 3.2 Build in Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Reference Counting Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Loop Fusion, Double Buﬀering Pointer Swap, and Inversion . . . Sisal90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Foreign Language Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Prototype Distributed-Memory SISAL Compiler . . . . . . . . . . . . . . . 5.1 Base Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Rectangular Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Block Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Multiple Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture Support for Multithreaded Execution . . . . . . . . . . . . . . . 6.1 Blocking and Non-blocking Models . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Summary of Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 46 49 49 50 51 51 51 53 54 58 59 59 60 60 61 62 62 63 64 68 69

Chapter 3. HPC++ and the HPC++Lib Toolkit Dennis Gannon, Peter Beckman, Elizabeth Johnson, Todd Green, and Mike Levine ................................................ 73 1 2

3 4

5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The HPC++ Programming and Execution Model . . . . . . . . . . . . . . . . 2.1 Level 1 HPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Parallel Standard Template Library . . . . . . . . . . . . . . . . . . . . 2.3 Parallel Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Distributed Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Simple Example: The Spanning Tree of a Graph . . . . . . . . . . . . . . . Multi-threaded Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Examples of Multi-threaded Computations . . . . . . . . . . . . . . . . . . Implementing the HPC++ Parallel Loop Directives . . . . . . . . . . . . . .

73 74 75 76 77 77 78 78 82 84 92 96

Table of Contents

6

7 8

Multi-context Programming and Global Pointers . . . . . . . . . . . . . . . . . 6.1 Remote Function and Member Calls . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Using Corba IDL to Generate Proxies . . . . . . . . . . . . . . . . . . . . . . The SPMD Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Barrier Synchronization and Collective Operations . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IX

99 101 103 105 105 106

Chapter 4. A Concurrency Abstraction Model for Avoiding Inheritance Anomaly in Object-Oriented Programs Sandeep Kumar and Dharma P. Agrawal ......................... 109 1 2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approaches to Parallelism Speciﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Issues in Designing a COOPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Issues in Designing Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is the Inheritance Anomaly? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 State Partitioning Anomaly (SPA) . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 History Sensitiveness of Acceptable States Anomaly (HSASA) . 3.3 State Modiﬁcation Anomaly (SMA) . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Anomaly A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Anomaly B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 What Is the Reusability of Sequential Classes? . . . . . . . . . . . . . . . . . . . 5 A Framework for Specifying Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 6 Previous Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Concurrency Abstraction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 The CORE Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Specifying a Concurrent Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Deﬁning an AC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Deﬁning a Parallel Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Synchronization Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Reusability of Sequential Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Avoiding the Inheritance Anomaly . . . . . . . . . . . . . . . . . . . . . . . . . 10 The Implementation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 113 113 114 115 116 118 118 119 120 120 121 122 123 126 126 126 127 129 129 130 131 133 134

Section II : Analysis Chapter 5. Loop Parallelization Algorithms Alain Darte, Yves Robert, and Fr´ed´eric Vivien ..................... 141 1 2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input and Output of Parallelization Algorithms . . . . . . . . . . . . . . . . . . 2.1 Input: Dependence Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Output: Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141 142 143 144

X

3

4

5

6

7 8

Table of Contents

Dependence Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Dependence Graphs and Distance Sets . . . . . . . . . . . . . . . . . . . . . . 3.2 Polyhedral Reduced Dependence Graphs . . . . . . . . . . . . . . . . . . . . 3.3 Deﬁnition and Simulation of Classical Dependence Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Allen and Kennedy’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Power and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolf and Lam’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Theoretical Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The General Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Power and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Darte and Vivien’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Another Algorithm Is Needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Polyhedral Dependences: A Motivating Example . . . . . . . . . . . . . 6.3 Illustrating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Uniformization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Scheduling Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Schematic Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Power and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feautrier’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145 145 147 148 149 150 151 152 153 153 154 155 156 156 158 160 162 162 165 166 167 169

Chapter 6. Array Dataﬂ ow Analysis Paul Feautrier ................................................. 173 1 2

3

4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Array Dataﬂow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Program Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximate Array Dataﬂow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 From ADA to FADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Introducing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Taking Properties of Parameters into Account . . . . . . . . . . . . . . . 3.4 Eliminating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Complex Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 What Is a Complex Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 ADA in the Presence of Complex Statements . . . . . . . . . . . . . . . . 4.3 Procedure Calls as Complex Statements . . . . . . . . . . . . . . . . . . . .

173 176 176 176 181 189 190 190 191 195 197 201 202 204 204 206 206

Table of Contents

5

Applications of ADA and FADA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Program Comprehension and Debugging . . . . . . . . . . . . . . . . . . . . 5.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Array Expansion and Array Privatization . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Appendix : Mathematical Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Polyhedra and Polytopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Z-modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Z-polyhedra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Parametric Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XI

208 209 211 212 214 214 214 215 216 216

Chapter 7. Interprocedural Analysis Based on Guarded Array Regions Zhiyuan Li, Junjie Gu, and Gyungho Lee .......................... 221 1 2

3

4

5

6

7 8

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Traditional Flow-Insensitive Summaries . . . . . . . . . . . . . . . . . . . . . 2.2 Array Data Flow Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guarded Array Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Operations on GAR’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Predicate Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constructing Summary GAR’s Interprocedurally . . . . . . . . . . . . . . . . . 4.1 Hierarchical Supergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Summary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Symbolic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Region Numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Range Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application to Array Privatization and Preliminary Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Array Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Preliminary Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

221 223 223 225 226 228 230 232 232 233 235 238 238 239 240 240 241 241 243 244

Chapter 8. Automatic Array Privatization Peng Tu and David Padua ....................................... 247 1 2 3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm for Array Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Data Flow Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Inner Loop Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

247 248 250 250 252 256

XII

4

5

Table of Contents

3.4 Proﬁtability of Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Last Value Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demand-Driven Symbolic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Gated Single Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Demand-Driven Backward Substitution . . . . . . . . . . . . . . . . . . . . . 4.3 Backward Substitution in the Presence of Gating Functions . . . 4.4 Examples of Backward Substitution . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bounds of Symbolic Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Comparison of Symbolic Expressions . . . . . . . . . . . . . . . . . . . . . . . 4.7 Recurrence and the µ Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Bounds of Monotonic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Index Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Conditional Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

257 258 261 263 264 266 267 269 269 272 273 274 275 276 277

Section III : Communication Optimizations Chapter 9. Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors Anant Agarwal, David Kranz, Rajeev Barua, and Venkat Natarajan ... 285 1

2

3 4

5

6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Contributions and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Domain and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Program Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loop Partitions and Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Framework for Loop and Data Partitioning . . . . . . . . . . . . . . . . . . . 4.1 Loop Tiles in the Iteration Space . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Footprints in the Data Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Size of a Footprint for a Single Reference . . . . . . . . . . . . . . . . . . . 4.4 Size of the Cumulative Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Minimizing the Size of the Cumulative Footprint . . . . . . . . . . . . . General Case of G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 G Is Invertible, but Not Unimodular . . . . . . . . . . . . . . . . . . . . . . . 5.2 Columns of G Are Dependent and the Rows Are Independent . 5.3 The Rows of G Are Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other System Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Coherence-Related Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Eﬀect of Cache Line Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Data Partitioning in Distributed-Memory Multicomputers . . . . .

285 286 288 289 289 291 292 295 296 298 300 304 311 314 314 316 316 318 318 320 320

Table of Contents

7

Combined Loop and Data Partitioning in DSMs . . . . . . . . . . . . . . . . . 7.1 The Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Multiple Loops Heuristic Method . . . . . . . . . . . . . . . . . . . . . . 8 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Algorithm Simulator Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Experiments on the Alewife Multiprocessor . . . . . . . . . . . . . . . . . . 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A A Formulation of Loop Tiles Using Bounding Hyperplanes . . . . . . . . B Synchronization References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XIII

322 322 325 328 330 330 334 337 337

Chapter 10. Communication-Free Partitioning of Nested Loops Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu ............ 339 1 2

3

4

5 6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fundamentals of Array References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Iteration Spaces and Data Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Reference Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Properties of Reference Functions . . . . . . . . . . . . . . . . . . . . . . . . . . Loop-Level Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Iteration and Data Spaces Partitioning – Uniformly Generated References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Hyperplane Partitioning of Data Space . . . . . . . . . . . . . . . . . . . . . 3.3 Hyperplane Partitioning of Iteration and Data Spaces . . . . . . . . Statement-Level Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Aﬃ ne Processor Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Hyperplane Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

339 341 342 343 343 347 347 353 359 365 366 372 377 381

Chapter 11. Solving Alignment Using Elementary Linear Algebra Vladimir Kotlyar, David Bau, Induprakas Kodukula, Keshav Pingali, and Paul Stodghill .............................................. 385 1 2

3 4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Equational Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Reduction to Null Space Computation . . . . . . . . . . . . . . . . . . . . . . 2.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Reducing the Solution Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aﬃ ne Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Encoding Aﬃ ne Constraints as Linear Constraints . . . . . . . . . . . Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Formulation of Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

385 388 388 390 391 392 393 393 396 397

XIV

Table of Contents

5

Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Lessons from Some Common Computational Kernels . . . . . . . . . 5.2 Implications for Alignment Heuristic . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Reducing the Solution Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Unrelated Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B A Comment on Aﬃ ne Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

398 399 402 402 404 404 405 408

Chapter 12. A Compilation Method for Communication–Eﬃ cient Partitioning of DOALL Loops Santosh Pande and Tareq Bali .................................... 413 1 2

3 4

5

6

7 8

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DOALL Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Terms and Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Compatibility Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Cyclic Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Algorithm : Maximal Compatibility Subsets . . . . . . . . . . . . . . . . . 5.2 Algorithm : Maximal Fibonacci Sequence . . . . . . . . . . . . . . . . . . . 5.3 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partition Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Granularity Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example : Texture Smoothing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance on Cray T3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

413 414 415 419 421 422 423 423 424 427 427 428 428 429 431 431 432 432 435 440

Chapter 13. Compiler Optimization of Dynamic Data Distributions for Distributed-Memory Multicomputers Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee .... 445 1 2 3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Distribution Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Motivation for Dynamic Distributions . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview of the Dynamic Distribution Approach . . . . . . . . . . . . . 3.3 Phase Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Phase and Phase Transition Selection . . . . . . . . . . . . . . . . . . . . . . .

445 447 449 449 450 451 457

Table of Contents

4

5

6

7

Data Redistribution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Reaching Distributions and the Distribution Flow Graph . . . . . . 4.2 Computing Reaching Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Representing Distribution Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interprocedural Redistribution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Distribution Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Redistribution Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Static Distribution Assignment (SDA) . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Synthetic HPF Redistribution Example . . . . . . . . . . . . . . . . . . . . . 6.2 2-D Alternating Direction Implicit (ADI2D) Iterative Method . 6.3 Shallow Water Weather Prediction Benchmark . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XV

462 462 463 464 465 467 468 471 472 473 475 478 480

Chapter 14. A Framework for Global Communication Analysis and Optimizations Manish Gupta .................................................. 485 1 2 3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Available Section Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Representation of ASD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Computing Generated Communication . . . . . . . . . . . . . . . . . . . . . . 4 Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Data Flow Variables and Equations . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Decomposition of Bidirectional Problem . . . . . . . . . . . . . . . . . . . . 4.3 Overall Data-Flow Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Communication Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Elimination of Redundant Communication . . . . . . . . . . . . . . . . . . 5.2 Reduction in Volume of Communication . . . . . . . . . . . . . . . . . . . . 5.3 Movement of Communication for Subsumption and for Hiding Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Extensions: Communication Placement . . . . . . . . . . . . . . . . . . . . . . . . . 7 Operations on Available Section Descriptors . . . . . . . . . . . . . . . . . . . . . 7.1 Operations on Bounded Regular Section Descriptors . . . . . . . . . . 7.2 Operations on Mapping Function Descriptors . . . . . . . . . . . . . . . . 8 Preliminary Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . 9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Global Communication Optimizations . . . . . . . . . . . . . . . . . . . . . . 9.2 Data Flow Analysis and Data Descriptors . . . . . . . . . . . . . . . . . . . 10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

485 487 488 490 492 494 495 498 499 505 505 506 507 508 510 512 514 516 519 519 520 521

XVI

Table of Contents

Chapter 15. Tolerating Communication Latency through Dynamic Thread Invocation in a Multithreaded Architecture Andrew Sohn, Yuetsu Kodama, Jui-Yuan Ku, Mitsuhisa Sato, and Yoshinori Yamaguchi ............................................ 525 1 2

3

4 5 6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multithreading Principles and Its Realization . . . . . . . . . . . . . . . . . . . . 2.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The EM-X Multithreaded Distributed-Memory Multiprocessor . 2.3 Architectural Support for Fine-Grain Multithreading . . . . . . . . . Designing Multithreaded Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Multithreaded Bitonic Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Multithreaded Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . Overlapping Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

525 527 527 530 533 535 535 538 540 544 547

Section IV : Code Generation Chapter 16. Advanced Code Generation for High Performance Fortran Vikram Adve and John Mellor-Crummey .......................... 553 1 2

3

4

5

6

7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background: The Code Generation Problem for HPF . . . . . . . . . . . . . 2.1 Communication Analysis and Code Generation for HPF . . . . . . 2.2 Previous Approaches to Communication Analysis and Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Integer Set Framework for Data-Parallel Compilation . . . . . . . . . 3.1 Primitive Components of the Framework . . . . . . . . . . . . . . . . . . . . 3.2 Implementation of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . Computation Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Computation Partitioning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Code Generation to Realize Computation Partitions . . . . . . . . . . Communication Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Communication Generation with Message Vectorization and Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Recognizing In-Place Communication . . . . . . . . . . . . . . . . . . . . . . . 5.3 Implementing Loop-Splitting for Reducing Communication Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Control Flow Simpliﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Overview of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

553 556 556 558 561 561 562 565 565 567 573 577 581 582 584 584 588 589 590

Table of Contents

XVII

Chapter 17. Integer Lattice Based Methods for Local Address Generation for Block-Cyclic Distributions J. Ramanujam .................................................. 597 1 2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Related Work on One-Level Mapping . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work on Two-Level Mapping . . . . . . . . . . . . . . . . . . . . . . 3 A Lattice Based Approach for Address Generation . . . . . . . . . . . . . . . 3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Determination of Basis Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basis Determination Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Extremal Basis Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Improvements to the Algorithm for s < k . . . . . . . . . . . . . . . . . . . 4.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Address Sequence Generation by Lattice Enumeration . . . . . . . . . . . . 6 Optimization of Loop Enumeration: GO-LEFT and GO-RIGHT . . . . . . . 6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Experimental Results for One-Level Mapping . . . . . . . . . . . . . . . . . . . . 8 Address Sequence Generation for Two-Level Mapping . . . . . . . . . . . . 8.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Algorithms for Two-Level Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Itable: An Algorithm That Constructs a Table of Oﬀsets . . . . . . 9.2 Optimization of the Itable Method . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Search-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Experimental Results for Two-Level Mapping . . . . . . . . . . . . . . . . . . . 11 Other Problems in Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Communication Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Union and Diﬀerence of Regular Sections . . . . . . . . . . . . . . . . . . . 11.3 Code Generation for Complex Subscripts . . . . . . . . . . . . . . . . . . . . 11.4 Data Structures for Runtime Eﬃ ciency . . . . . . . . . . . . . . . . . . . . . 11.5 Array Redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

597 599 600 602 603 603 604 605 607 609 612 613 614 616 620 620 626 626 628 629 631 634 635 638 639 640 640 640 641 641

Section V : Task Parallelism, Dynamic Data Structures and Run Time Systems Chapter 18. A Duplication Based Compile Time Scheduling Method for Task Parallelism Sekhar Darbha and Dharma P. Agrawal ........................... 649 1 2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 STDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 2.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663

XVIII Table of Contents

3 4

5

Illustration of the STDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the STDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 CRC Is Satisﬁed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Application of Algorithm for Random Data . . . . . . . . . . . . . . . . . 4.3 Application of Algorithm to Practical DAGs . . . . . . . . . . . . . . . . . 4.4 Scheduling of Diamond DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Comparison with Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

664 670 670 672 674 675 680 680

Chapter 19. SPMD Execution in the Presence of Dynamic Data Structures Rajiv Gupta ................................................... 683 1 2

3

4 5 6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language Support for Regular Data Structures . . . . . . . . . . . . . . . . . . 2.1 Processor Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Dynamic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Name Generation and Distribution Strategies . . . . . . . . . . . . . . . . 2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compiler Support for Regular Data Structures . . . . . . . . . . . . . . . . . . . 3.1 Representing Pointers and Data Structures . . . . . . . . . . . . . . . . . . 3.2 Translation of Pointer Operations . . . . . . . . . . . . . . . . . . . . . . . . . . Supporting Irregular Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . Compile-Time Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

683 684 685 685 688 689 693 693 694 703 705 706

Chapter 20. Supporting Dynamic Data Structures with Olden Martin C. Carlisle and Anne Rogers ............................... 709 1 2

3

4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Marking Available Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Handling Remote References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Introducing Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selecting Between Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Using Local Path Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Update Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 The Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

709 711 711 711 714 715 715 718 719 722 723 724 726

Table of Contents

5

6 7

8

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Comparison with Other Published Work . . . . . . . . . . . . . . . . . . . . 5.2 Heuristic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proﬁling in Olden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Verifying Local Path Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Gupta’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Object-Oriented Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Extensions of C with Fork-Join Parallelism . . . . . . . . . . . . . . . . . . 7.4 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XIX

731 733 733 735 735 736 739 741 741 743 744 745

Chapter 21. Runtime and Compiler Support for Irregular Computations Raja Das, Yuan-Shin Hwang, Joel Saltz, and Alan Sussman .......... 751 1 2 3

4

5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the CHAOS Runtime System . . . . . . . . . . . . . . . . . . . . . . Compiler Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Transformation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Transformation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Hand Parallelization with CHAOS . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Compiler Parallelization Using CHAOS . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

751 753 758 759 763 765 769 769 773 775

Author Index ................................................. 779

Chapter 1. High Performance Fortran 2.0 Ken Kennedy1 and Charles Koelbel2 1 2

Department of Computer Science and Center for Research on Parallel Computation, Rice University, Houston, Texas, USA Advanced Computational Infrastructure and Research, National Science Foundation, 4201 Wilson Boulevard, Suite 1122 S, Arlington, VA 22230, USA

Summary. High Performance Fortran (HPF) was defined in 1993 as a portable data-parallel extension to Fortran. This year it was updated by the release of HPF version 2.0, which clarified many existing features and added a number of extensions requested by users. Compilers for these extensions are expected to appear beginning in late 1997. In this paper, we present an overview of the entire language, including HPF 1 features such as BLOCK distribution and the FORALL statement and HPF 2 additions such as INDIRECT distribution and the ON directive.

1. Introduction High Performance Fortran (HPF) is a language that extends standard Fortran by adding support for data-parallel programming on scalable parallel processors. The original language document, the product of an 18-month informal standardization effort by the High Performance Fortran Forum, was released in 1993. HPF 1.0 was based on Fortran 90 and was strongly influenced by the SIMD programming model that was popular in the early 90s. The language featured a single thread of control and a shared-memory programming model in which any required interprocessor communication would be generated implicitly by the compiler. In spite of widespread interest in the language, HPF was not an immediate success, suffering from the long lead time between its definition and the appearance of mature compilers and from the absence of features that many application developers considered essential. In response to the latter problem, the HPF Forum reconvened in 1995 and 1996 to produce a revised standard called HPF 2.0 [11].The purpose of this paper is two-fold: – To give an overview of the HPF 2.0 specification, and – To explain (in general terms) how the language may be implemented. We start by giving a short history of HPF and a discussion of the components of the language.

2. History and Overview of HPF HPF has attracted great interest since the inception of the first standardization effort in 1991. Many users had long hoped for a portable, efficient, high-level language for parallel programming. In the 1980’s, Geoffrey Fox’s S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 3-43, 2001. Springer-Verlag Berlin Heidelberg 2001

4

Ken Kennedy and Charles Koelbel

analysis of parallel programs [5,6] and other projects had identified and popularized data-parallel programming as one promising approach to this goal. The data-parallel model derived its parallelism from the observation that updates to individual elements of large data structures were often independent of each other. For example, successive over-relaxation techniques update every point of a mesh based on the (previous) values there and at adjacent points. This observation identified far more parallelism in the problem than could be exploited by the physical processors available. Data-parallel implementations solved this situation by dividing the data structure elements between the physical processors and scheduling each processor to perform the computations needed by its local data. Sometimes the local computations on one processor required data from another processor. In these cases, the implementation inserted synchronization and/or communication to ensure that the correct version of the data was used. How the data had been divided determined how often the processors had to interact. Therefore, the key intellectual step in writing a data-parallel program was to determine how the data could be divided to minimize this interaction; once this was done, inserting the synchronization and communication was relatively mechanical. In the late 1980’s, several research projects [2,7,8,15–18,20] and commercial compilers [12,19] designed languages to implement data-parallel programming. These projects extended sequential or functional languages to include aggregate operations, most notably array syntax and forall constructs, that directly reflected data-parallel operations. Also, they added syntax for describing data mappings, usually by specifying a high-level pattern for how the data would be divided among processors. Programmers were responsible for appropriately using these “data distribution” and “data parallel” constructs appropriately. In particular, the fastest execution was expected when the dimension(s) that exhibited data parallelism were also distributed across parallel processors. Furthermore, the best distribution pattern was the one that produced the least communication; that is, the pattern that required the least combining of elements stored on separate processors. What the programmer did not have to do was equally important. Data-parallel languages did not require the explicit insertion of synchronization and communication operations. This made basic programming much easier, since the user needed only to consider the (sequential) ordering of large-grain operations rather than the more complex and numerous interconnections between individual processors. In other words, data-parallel languages had sequential semantics; race conditions were not possible. The cost of this convenience was increasingly complex compilers. The job of the compiler and run-time system for a data-parallel language was to efficiently map programs onto parallel hardware. Typically, the implementation used a form of the owner-computes rule, which assigned the computation in an assignment statement to the processor that owned the left-hand side. Loops over distributed data structures, including the loops

High Performance Fortran 2.0

5

implied by aggregate operations, were an important special case of this rule; they were strip-mined so that each processor ran over the subset of the loop iterations specified by the owner-computes rule. This strip-mining automatically divided the work between the processors. Different projects developed various strategies for inserting communication and synchronization, ranging from pattern-matching [16] to dependence-based techniques [17]. Because the target platforms for the compilers were often distributed-memory computers like the iPSC/860, communication costs were very high. Therefore, the compilers expended great effort to reduce this cost through bundling communication [15] and using efficient collective communication primitives [16]. Similar techniques proved useful on a variety of platforms [8], giving further evidence that data-parallel languages might be widely portable. At the same time, the commercial Connection Machine Fortran compiler [19] was proving that data parallelism was feasible for expressing a variety of codes. Many of the best ideas for data-parallel languages were eventually incorporated into Fortran dialects by the Fortran D project at Rice and Syracuse Universities [4], the Vienna Fortran project at the University of Vienna [3] and work at COMPASS, Inc. [12]. Early successes there led to a Supercomputing ’91 birds-of-a-feather session that essentially proposed development of a standard data-parallel dialect of Fortran. At a follow-up meeting in Houston the following January, the Center for Research on Parallel Computation (CRPC) agreed to sponsor an informal standards process, and the High Performance Fortran Forum (HPFF) was formed. A “core” group of HPFF met every 6 weeks in Dallas for the next year, producing a preliminary draft of the HPF language specification presented at Supercomputing ’921 and the final HPF version 1.0 language specification early the next year [9]. The outlines of HPF 1.0 were very similar to its immediate predecessors: – Fortran 90 [1] (the base language) provided immediate access to array arithmetic, array assignment, and many useful intrinsic functions. – The ALIGN and DISTRIBUTE directives (structured comments recognized by the compiler) described the mapping of partitioned data structures. Section 3 describes these features in more detail. – The FORALL statement (a new construct), the INDEPENDENT directive (an assertion to the compiler), and the HPF library (a set of data-parallel subprograms) provided a rich set of data-parallel operations. Section 4 describes these features in more detail. – EXTRINSIC procedures (an interface to other programming paradigms) provided an “escape hatch” for programmers who needed access to low-level machine details or forms of parallelism not well-expressed by data-parallel constructs. Section 5 describes these functions in more detail. A reference to the standard [14] was published soon afterward, and HPFF went into recess for a time. 1

That presentation was so overcrowded that movable walls were removed during the session to make a larger room.

6

Ken Kennedy and Charles Koelbel

In 1994, HPFF resumed meetings with two purposes: – To consider Corrections, Clarifications, and Interpretations (CCI) of the HPF 1.0 language in response to public comments and questions, and – To determine requirements for further extensions to HPF by consideration of advanced applications codes. The CCI discussions led to the publication of a new language specification (HPF version 1.1). Although some of the clarifications were important for special cases, there were no major language modifications. The extensions requirements were collected in a separate document [10]. They later served as the basis for discussions toward HPF 2.0. In January 1995, HPFF undertook its final (to date) series of meetings, with the intention of producing a significant update to HPF. Those meetings were completed in December 1996, and the HPF version 2.0 language specification [11] appeared in early 1997. The basic syntax and semantics of HPF did not change in version 2.0; generally, programs still consisted of sequential compositions of aggregate operations on distributed arrays. However, there were some significant revisions: – HPF 2.0 consists of two parts: a base language, and a set of approved extensions. The base language is very close to HPF 1.1, and is expected to be fully implemented by vendors in a relatively short time. The approved extensions are more advanced features which are not officially part of the language, but which may be adopted in future versions of HPF. However, several vendors have committed to supporting one or more of the extensions due to customer demands. In this paper, we will refer to both parts of the language as “HPF 2.0” but will point out approved extensions when they are introduced. – HPF 2.0 removes, restricts, or reclassifies some features of HPF 1.1, particularly in the area of dynamic remapping of data. In all cases, the justification of these changes was that cost of implementation was much higher than originally thought, and did not justify the advantage gained by including the features. – HPF 2.0 adds a number of features, particularly in the areas of new distribution patterns, REDUCTION clauses in INDEPENDENT loops, the new ON directive for computation scheduling (including task parallelism), and asynchronous I/O. The remainder of this paper considers the features of HPF 2.0 in more detail. In particular, each section below describes a cluster of related features, including examples of their syntax and use. We close the paper with a look to future prospects for HPF.

High Performance Fortran 2.0

7

3. Data Mapping The most widely discussed features of HPF describe the layout of data onto memories of parallel machines. Conceptually, the programmer gives a highlevel description of how large data structures (typically, arrays) will be partitioned between the processors. It is the compiler and run-time system’s responsibility to carry out this partitioning. This data mapping does not directly create a parallel program, but it does set the stage for parallel execution. In particular, data parallel statements operating on partitioned data structures can execute in parallel. We will describe that process more in section 4. 3.1 Basic Language Features HPF uses a 2-phase data mapping model. Individual arrays can be aligned together, thus ensuring that elements aligned together are always stored on the same processor. This minimizes data movement if the corresponding elements are accessed together frequently. Once arrays are aligned in this way, one of them can be distributed, thus partitioning its elements across the processors’ memories. This affects the data movement from combining different elements of the same array (or, by implication, elements of different arrays that are not aligned together). Distribution tends to be more machine-dependent than alignment; differing cost tradeoffs between machines (for example, relatively higher bandwidth or longer latency) may dictate different distribution patterns when porting. Section 3.1.1 below describes alignment, while Section 3.1.2 describes distribution. It bears mentioning that the data mapping features of HPF are technically directives—that is, structured comments that are recognized by the compiler. The advantage of this approach is determinism; in general, an HPF program will produce the same result when run on any number of processors.2 Another way of saying this is that HPF data mapping affects only the efficiency of the program, not its correctness. This has obvious attraction for maintaining and porting codes. We feel it is a key to HPF’s success to date. 3.1.1 The ALIGN Directive. The ALIGN directive creates an alignment between two arrays. Syntactically, the form of this directive is !HPF$ ALIGN alignee [ ( align-dummy-list ) ] WITH [ * ] target [ ( align-subscript-list ) ] 2

There are two notable exceptions to this. HPF has intrinsic functions (not described in this paper) that can query the mapping of an array; a programmer could use these to explicitly code different behaviors for different numbers of processors. Also, round-off errors can occur which may be sensitive to data mapping; this is most likely to occur if the reduction intrinsics described in Section 4 are applied to mapped arrays.

8

Ken Kennedy and Charles Koelbel

(For brevity, we ignore several alternate forms that can be reduced to this one.) The alignee can be any object name, but is typically an array name. This is the entity being aligned. The target may be an object or a template; in either case, this is the other end of the alignment relationship. (Templates are “phantom” objects that have shape, but take no storage; they are sometimes useful to provide the proper sized target for an alignment.) Many alignees can be aligned to a single target at once. An align-dummy is a scalar integer variable that may be used in (at most) one of the align-subscript expressions. An align-subscript is an affine linear function of one align-dummy, or it is a constant, or it is an asterisk (*). The align-dummy-list is optional if the alignee is a scalar; otherwise, the number of list entries must match the number of dimensions of the alignee. The same holds true for the alignsubscript-list and target. The meaning of the optional asterisk before thetarget is explained in Section 3.2.3; here, it suffices to mention that it only applies to procedure dummy arguments. The ALIGN directive must appear in the declaration section of the program. An ALIGN directive defines the alignment of the alignee by specifying the element(s) of the target that correspond to each alignee element. Each aligndummy implicitly ranges over the corresponding dimension of the alignee. Substituting these values into the target expression specifies the matching element(s). A “*” used as a subscript in the target expression means the alignee element matches all elements in that dimension. For example, Figure 3.1 shows the result of the HPF directives !HPF$ !HPF$ !HPF$ !HPF$ !HPF$

ALIGN ALIGN ALIGN ALIGN ALIGN

B(I,J) WITH A(I,J) C(I,J) WITH A(J,I) D(K) WITH A(K,1) E(L) WITH A(L,*) F(M) WITH D(2*F-1)

Elements (squares) with the same symbol are aligned together. Here, B is identically aligned with A; this is by far the most common case in practice. Similarly, C is aligned with the transpose of A, which might be appropriate if one array were accessed row-wise and the other column-wise. Elements of D are aligned with the first column of A; any other column could have been used as well. Elements of E, however, are each aligned with all elements in a row of A. As we will see in Section 3.1.2, this may result in E being replicated on many processors when A is distributed. Finally, F has a more complex alignment; through the use of (nontrivial) linear functions, it is aligned with the odd element of D. However, D is itself aligned to A, so F is ultimately aligned with A. Note that, because the align-subscripts in each directive are linear functions, the overall alignment relationship is still an invertible linear function. The ALIGN directive produces rather fine-grain relationships between array elements. Typically, this corresponds to relationships between the data

High Performance Fortran 2.0

A

9

B

a

b

c

d

e

f

a

b

c

d

e

f

1

2

3

4

5

6

1

2

3

4

5

6

❶

❷

❸

❹

❺

➏

❶

❷

❸

❹

❺

➏

①

②

➂

➃

➄

➅

①

②

➂

➃

➄

➅

♥

♠

♣

♦

✔

✘

♥

♠

♣

♦

✔

✘

C

D

E

F

a

abcdef

a

a

1

❶

①

♥

b

2

❷

②

♠

c

3

❸

➂

♣

1

123456

d

4

❹

➃

♦

❶

❶❷❸❹❺➏

e

5

❺

➄

✔

①

①②➂➃➄➅

f

6

➏

➅

✘

♥

♥♠♣♦✔✘

1 ①

Fig. 3.1. Examples of the ALIGN directive. in the underlying algorithm or in physical phenomena being simulated. For example, a fluid dynamics computation might have arrays representing the pressure, temperature, and fluid velocity at every point in space; because of their close physical relationships those arrays might well be aligned together. Because the alignment derives from a deep connection, it tends to be machine-independent. That is, if two arrays are often accessed together on one computer, they will also be accessed together on another. This makes alignment useful for software engineering. A programmer can choose one “master” array to which all others will be aligned; this effectively copies the master’s distribution (or a modification of it) to the other arrays. It also allows To change the distributions of all the arrays (for example, when porting to a new

10

Ken Kennedy and Charles Koelbel

machine), the programmer only has to adjust the distribution of the master array, as explained in the next section. 3.1.2 The DISTRIBUTE Directive. The DISTRIBUTE directive defines the distribution of an object or template and all other objects aligned to it. Syntactically, the form of this directive is !HPF$ DISTRIBUTE distributee [ * ] [ ( dist-format-list ) ] [ ONTO [ * ] [ processor [ ( section-list ) ] ] ] The distributee can be any object or template, but is typically an array name; it is the entity being distributed. The dist-format-list gives a distribution pattern for each dimension of the distributee. The number of dist-format-list entries must match the number of distributee dimensions. The ONTO clause identifies the processor arrangement (or, as an HPF 2.0 approved extension, the section thereof) where the distributee will be stored. The number of dimensions of this expression must match the number of entries in the distformat-list that are not *. The dist-format-list or the processor expression is only optional if its * option (explained in Section 3.2.3) is present. The DISTRIBUTE directive must appear in the declaration section of the program. An DISTRIBUTE directive defines the alignment of the distributee by giving a general pattern for how each of its dimensions will be divided. HPF 1.0 had three such formats—BLOCK, CYCLIC, and *. HPF 2.0 adds two more— GEN BLOCK and INDIRECT—as approved extensions. Figure 3.2 shows the results of the HPF directives !HPF$ !HPF$ !HPF$ !HPF$ !HPF$ !HPF$ !HPF$ !HPF$ !HPF$

DISTRIBUTE S( DISTRIBUTE T( DISTRIBUTE U( DISTRIBUTE V( DISTRIBUTE W( DISTRIBUTE X( SHADOW X(1) DISTRIBUTE Y( DISTRIBUTE Z(

BLOCK ) CYCLIC ) CYCLIC(2) ) GEN_BLOCK( (/ 3, 5, 5, 3 /) ) ) INDIRECT( SNAKE(1:16) ) BLOCK ) BLOCK, * ) BLOCK, BLOCK )

Here, the color of an element represents the processor where it is stored. All of the arrays are mapped onto four processors; for the last case, the processors are arranged as a 2 × 2 array, although other arrangements (1 × 4 and 4 × 1) are possible. As shown by the S declaration, BLOCK distribution breaks the dimension into equal-sized contiguous pieces. (If the size is not divisible by the number of processors, the block size is rounded upwards.) The T declaration shows how the CYCLIC distribution assigns the elements one-by-one to processors in round-robbin fashion. CYCLIC can take an integer parameter k, as shown by the declaration of U; in this case, blocks of size k are assigned cyclically to the processors. The declaration of V demonstrates the GEN BLOCK pattern, which extends the BLOCK distribution to unequal-sized blocks. The

High Performance Fortran 2.0

11

S T U V W X Y

Z

Fig. 3.2. Examples of the DISTRIBUTE directive. sizes of the blocks on each processor is given by the mandatory integer array argument; there must be one such element for each processor. W demonstrates the INDIRECT pattern, which allows arbitrary distributions to be defined by declaring the processor home for each element of the distributee. The size of the INDIRECT parameter array must be the same as the size of the distributee. The contents of the parameter array SNAKE are not shown in the figure, but it must be set before the DISTRIBUTE directive takes effect. The BLOCK distribution can be modified by the SHADOW directive, which allocates “overlap” areas on each processor. X shows how this produces additional copies of the edge elements on each processor; the compiler can then use these copies to opti-

12

Ken Kennedy and Charles Koelbel

mize data movement. Finally, multi-dimensional arrays take a distribution pattern in each dimension. For example, the Y array uses a BLOCK pattern for the rows and the * pattern (which means “do not partition”) for the columns. The Z array displays a true 2-dimensional distribution, consisting of a BLOCK pattern in rows and columns. When other objects are aligned to the distributee, its distribution pattern propagates to them. Figure 3.3 shows this process for the alignments in Figure 3.1. The left side shows the effect of the directive

B

A

C

D

E

B

A

F

C

DISTRIBUTE A(BLOCK,*)

D

E

F

DISTRIBUTE A(*,BLOCK)

Fig. 3.3. Combining ALIGN and DISTRIBUTE.

!HPF$ DISTRIBUTE A( BLOCK, * ) assuming three processors are available. Because the alignments are simple in this example, the same patterns could be achieved with the directives !HPF$ !HPF$ !HPF$ !HPF$ !HPF$ !HPF$

DISTRIBUTE DISTRIBUTE DISTRIBUTE DISTRIBUTE DISTRIBUTE DISTRIBUTE

A( B( C( D( E( F(

BLOCK, * ) BLOCK, * ) *, BLOCK ) BLOCK ) BLOCK ) BLOCK )

The right side shows the effect of the directive !HPF$ DISTRIBUTE A( *, BLOCK ) Elements of E are replicated on all processors; therefore, each element has three colors. This mapping cannot be achieved by the DISTRIBUTE directive

High Performance Fortran 2.0

13

alone. The mappings of D and F onto a single processor are inconvenient to specify by a single directive, but it is possible using the ONTO clause. The DISTRIBUTE directive affects the granularity of parallelism and communication in a program. The keys to understanding this are to remember that computation partitioning is based on the location of data, and that combining elements on different processors (e.g. adding them together, or assigning one to the other) produces data movement. To balance the computational load, the programmer must choose distribution patterns so that the updated elements are evenly spread across processors. If all elements are updated, then either the BLOCK or CYCLIC patterns do this; triangular loops sometimes make CYCLIC the only load-balanced option. To reduce data movement costs, the programmer should maximize the number of on-processor accesses. BLOCK distributions do this for nearest-neighbor access patterns; for irregular accesses, it may be best to carefully calculate an INDIRECT mapping array. 3.2 Advanced Topics The above forms of ALIGN and DISTRIBUTE are adequate for declaring the mappings of arrays whose shape and access patterns are static. Unfortunately, this is not the case for all arrays; Fortran ALLOCATABLE and POINTER arrays do not have a constant size, and subprograms may be called with varying actual arguments. The features in this section support more dynamic uses of their mapped arrays. 3.2.1 Specialized and Generalized Distributions. In many dynamic cases, it is only possible to provide partial information about the data mapping of an array. For example, it may be known that a BLOCK distribution was used, but not how many processors (or which processors) were used. When two mappings interact—for example, when a mapped pointer is associated with a mapped array—the intuition is that the target must have a more fully described mapping than the incoming pointer. HPF makes this intuition more precise by defining “generalizations” and “specializations” of mappings. In short, S’s mapping is a specialization of G’s mapping (or G’s mapping is a generalization of S’s) if S is more precisely specified. To make this statement more exact, we must introduce some syntax and definitions. The “lone star” syntax that appeared in the DISTRIBUTE directives indicates any valid value can be used at runtime. Consider the directives !HPF$ DISTRIBUTE A * ONTO P !HPF$ DISTRIBUTE B (BLOCK) ONTO * The first line means that A is distributed over processor arrangement P, but does not specify the pattern; it would be useful for declaring dummy arguments to a subroutine that would only be executed on those processors. Similarly, the second line declares that B has a block distribution, but does not specify the processors. Either clause gives partial information about the

14

Ken Kennedy and Charles Koelbel

mapping of the distributee. In addition, the INHERIT directive specifies only that an object has a mapping. This serves as a “wild card” in matching mappings. It is particularly useful for passing dummy arguments that can take on any mapping (indeed, the name of the directive comes from the idea that the dummy arguments “inherit” the mapping of the actuals). HPF defines the mapping of S to be a specialization of mapping G if: 1. G has the INHERIT attribute, or 2. S does not have the INHERIT attribute, and a) S is a named object (i.e. an array or other variable), and b) S and G are aligned to objects with the same shape, and c) The align-subscripts in S’s and G’s ALIGN directives reduce to identical expressions (except for align-dummy name substitutions), and d) Either i. Neither S’s nor G’s align target has a DISTRIBUTE directive, or ii. Both S’s and G’s align targets have a DISTRIBUTE directive, and A. G’s align target’s DISTRIBUTE directive has no ONTO clause, or specifies “ONTO *”, or specifies ONTO the same processor arrangement as S’s align target, and B. G’s align target’s DISTRIBUTE directive has no distribution format clause, or uses “*” as the distribution format clause, or the distribution patterns in the clause are equivalent dimension-by-dimension to the patterns in S’s align target. Two distribution patterns are equivalent in the definition if they both have the same identifier (e.g. are both BLOCK, both CYCLIC, etc.) and any parameters have the same values (e.g. the m in CYCLIC(m), or the array ind in INDIRECT(ind)). This defines “specialization” as a partial ordering of mappings, with an INHERIT mapping as its unique minimal element. Conversely, “generalization” is a partial ordering, with INHERIT as its unique maximum. We should emphasize that, although the definition of “specialization” is somewhat complex, the intuition is very simple. One mapping is a specialization of another if the specialized mapping provides at least as much information as the general mapping, and when both mappings provide information they match. For example, the arrays A and B mentioned earlier in this section have mappings that are specializations of !HPF$ INHERIT GENERAL If all the arrays are the same size, both A and B are generalizations of !HPF$ DISTRIBUTE SPECIAL(BLOCK) ONTO P Neither A’s nor B’s mapping is a specialization of the other.

High Performance Fortran 2.0

15

3.2.2 Data Mapping for ALLOCATABLE and POINTER Arrays. One of the great advantages of Fortran 90 and 95 over previous versions of FORTRAN is dynamic memory management. In particular, ALLOCATABLE arrays can have their size set during execution, rather than being statically allocated. This generality does come at some cost in complicating HPF’s data mapping, however. In particular, the number of elements per processor cannot be computed until the size of the array is known; this is a particular problem for BLOCK distributions, since the beginning and ending indices on a particular processor may depend on block sizes, and thus on the number of processors. In addition, it is unclear what to do if an allocatable array is used in a chain of ALIGN directives. HPF resolves these issues with a few simple rules. Mapping directives (ALIGN and DISTRIBUTE) take effect when an object is allocated, not when it is declared. If an object is the target of an ALIGN directive, then it must be allocated when the ALIGN directive takes effect. These rules enforce most users’ expectations; patterns take effect when the array comes into existence, and one cannot define new mappings based on ghosts. For example, consider the following code. REAL, ALLOCATABLE :: A(:), B(:), C(:) !HPF$ DISTRIBUTE A(CYCLIC(N)) !HPF$ ALIGN B(I) WITH A(I*2) !HPF$ ALIGN C(I) WITH A(I*2) ... ALLOCATE( B(1000) ) ! Illegal ... ALLOCATE( A(1000) ) ! Legal ALLOCATE( C(500) ) ! Legal The allocation of B is illegal, since it is aligned to an object that does not (yet) exist. However, the allocations of A and C are properly sequenced; C can rely on A, which is allocated immediately before it. A will be distributed cyclically in blocks of N elements (where N is evaluated on entry to the subprogram where these declarations occur). If N is even, C will have blocks of size N/2; otherwise, its mapping will be more complex. It is sometimes convenient to choose the distribution based on the actual allocated size of the array. For example, small problems may use a CYCLIC distribution to improve load balance while large problems benefit from a BLOCK distribution. In these cases, the REALIGN and REDISTRIBUTE directives described in Section 3.2.4 provide the necessary support. The HPF 2 core language does not allow explicitly mapped pointers, but the approved extensions do. In this case, the ALLOCATABLE rules also apply to POINTER arrays. In addition, however, pointer assignment can associate a POINTER variable with another object, or with an array section. In this case, the rule is that the mapping of the POINTER must be a generalization of the target’s mapping. For example, consider the following code.

16

Ken Kennedy and Charles Koelbel

REAL, POINTER :: PTR1(:), PTR2(:), PTR3(:) REAL, TARGET A(1000) !HPF$ PROCESSORS P(4), Q(8) !HPF$ INHERIT PTR1 !HPF$ DISTRIBUTE PTR2(BLOCK) !HPF$ DISTRIBUTE PTR3 * ONTO P !HPF$ DISTRIBUTE A(BLOCK) ONTO Q PTR1 => A( 2 : 1000 : 2 ) ! Legal PTR2 => A ! Legal PTR3 => A ! Illegal PTR1 can point to a regular section of A because it has the INHERIT attribute; neither of the other pointers can, because regular sections are not named objects and thus not specializations of any other mapping. PTR2 can point to the whole of A; for pointers, the lack of an ONTO clause effectively means “can point to data on any processors arrangement.” PTR3 cannot point to A because their ONTO clauses are not compatible; the same would be true if their ONTO clauses matched but their distribution patterns did not. The effect of these rules is to enforce a form of type checking on mapped pointers. In addition to the usual requirements that pointers match their targets in type, rank, and shape, HPF 2.0 adds the requirement that any explicit mappings be compatible. That is, a mapped POINTER can only be associated with an object with the same (perhaps more fully specified) mapping. 3.2.3 Interprocedural Data Mapping. The basic ALIGN and DISTRIBUTE directives declare the mapping of global variables and automatic variables. Dummy arguments, however, require additional capabilities. In particular, the following situations are possible in HPF: – Prescriptive mapping: The dummy argument can be forced to have a particular mapping. If the actual argument does not have this mapping, the mapping is changed for the duration of the procedure. – Descriptive mapping: This is the same as a prescriptive mapping, except that it adds an assertion that the actual argument has the same mapping as the dummy. If it does not, the compiler may emit a warning. – Transcriptive mapping: The dummy argument can have any mapping. In effect, it inherits the mapping of the actual argument. Syntactically, prescriptive mapping is expressed by the usual ALIGN or DISTRIBUTE directives, as described in Section 3.1. Descriptive mapping is expressed by an asterisk preceding a clause of an ALIGN or DISTRIBUTE directive. Transcriptive mappings are expressed by an asterisk in place of a clause in an ALIGN or DISTRIBUTE directive. It is possible for a mapping to be partially descriptive and partially transcriptive (or some other combination) by this definition, but such uses are rare. For example, consider the following subroutine header.

High Performance Fortran 2.0

17

SUBROUTINE EXAMPLE( PRE, DESC1, DESC2, TRANS1, TRANS2, N ) INTEGER N REAL PRE(N), DESC1(N), DESC2(N), TRANS1(N), TRANS2(N) !HPF$ DISTRIBUTE PRE(BLOCK) ! Prescriptive !HPF$ DISTRIBUTE DESC1 *(BLOCK) ! Descriptive !HPF$ ALIGN DESC2(I) WITH *DESC1(I*2-1) ! Descriptive !HPF$ INHERIT TRANS1 ! Transcriptive !HPF$ DISTRIBUTE TRANS2 * ONTO * ! Transcriptive PRE is prescriptively mapped; if the corresponding actual argument does not have a BLOCK distribution, then the data will be remapped on entry to the subroutine and on exit. DESC1 and DESC2 are descriptively mapped; the actual argument for DESC1 is expected to be BLOCK-distributed, and the actual for DESC2 should be aligned with the even elements of DESC1. If either of these conditions is not met, then a remapping is performed and a warning is emitted by the compiler. TRANS1 is transitively mapped; the corresponding actual can have any mapping or can be an array section without causing remapping. TRANS2 is also transcriptively mapped, but passing an array section may cause remapping. HPF 2.0 simplified and clarified the rules for when an explicit interface is required.3 If any argument is declared with the INHERIT directive, or if any argument is remapped when the call is executed, then an explicit interface is required. Remapping is considered to occur if the mapping of an actual argument is not a specialization of the mapping of its corresponding dummy argument. In other words, if the dummy’s mapping uses INHERIT or doesn’t describe the actual as well (perhaps in less detail), then the programmer must supply an explicit interface. The purpose of this rule is to ensure that both caller and callee have the information required to change perform the remapping. Because it is sometimes difficult to decide whether mappings are specializations of each other, some programmers prefer to simply use explicit interfaces for all calls; this is certainly safe. 3.2.4 REALIGN and REDISTRIBUTE. Sometimes, remapping of arrays is required at a granularity other than subroutine boundaries. For example, within a single procedure an array may exhibit parallelism across rows for several loops, then parallelism across columns. Another common example is choosing the distribution of an array based on runtime analysis, such as computing the parameter array for a later INDIRECT distribution. For these cases, HPF 2.0 approved extensions provide the REALIGN and REDISTRIBUTE directives. It is worth noting that these directives were part of HPF version 1.0, but reclassified as approved extensions in HPF 2.0 due to unforeseen difficulties in their implementation. 3

Fortran 90 introduced the concept of explicit interfaces, which give the caller all information about the types of dummy arguments. Explicit interfaces are created by INTERFACE blocks and other mechanisms.

18

Ken Kennedy and Charles Koelbel

Syntactically, the REALIGN directive is identical to the ALIGN directive, except for two additional characters in the keyword. (Also, REALIGN does not require the descriptive and transcriptive forms of ALIGN since its purpose is always to change the data mapping.) Similarly, REDISTRIBUTE has the same syntax as DISTRIBUTE’s prescriptive case. Semantically, both directives set the mapping for the arrays they name when the program control flow reaches them; in a sense, they act like executable statements in this regard. The new mapping will persist until the array becomes deallocated, or another REALIGN or REDISTRIBUTE directive is executed. Data in the remapped arrays must be communicated to its new home unless the compiler can detect that the data is not live. One special case of dead data is called out in the HPF language specification—a REALIGN or REDISTRIBUTE directive for an array immediately following an ALLOCATE statement for the same array. The HPFF felt this was such an obvious and common case that strong advice was given to the vendors to avoid data motion there. There is one asymmetry between REALIGN and REDISTRIBUTE that bears mentioning. REALIGN only changes the mapping of its alignee; the new mapping does not propagate to any other arrays that might be aligned with it beforehand. REDISTRIBUTE of an array changes the mapping for the distributee and all arrays that are aligned to it, following the usual rules for ALIGN. The justification for this behavior is that both “remap all” and “remap one” behaviors are needed in different algorithms. (The choice to make REDISTRIBUTE rather than REALIGN propagate to other arrays was somewhat arbitrary, but fit naturally with the detailed definitions of alignment and distribution in the language specification.) The examples in Figure 3.4 may be helpful. In the assignments to A, that array is first computed from corresponding elements of C and their vertical neighbors, then updated from C’s transpose. Clearly, the communication patterns are different in these two operations; use of REALIGN allows both assignments to be completed without communication. (Although the communication here occurs in the REALIGN directives instead, a longer program could easily show a net reduction in communication.) In the operations on B, corresponding elements of B and D are multiplied in both loops; this implies that the two arrays should remain identically aligned. However, each loop only exhibits one dimension of parallelism; using REDISTRIBUTE as shown permits the vector operations to be executed fully in parallel in each loop, while any static distribution would sacrifice parallel execution in one or the other.

4. Data Parallelism Although the data mapping features of HPF are vital, particularly on distributed memory architectures, they must work in concert with data-parallel operations to fully use the machine. Conceptually, data-parallel loops and

High Performance Fortran 2.0

19

REAL A(100,100), B(100,100), C(100,100), D(100,100) !HPF$ DYNAMIC A, B !HPF$ DISTRIBUTE C(BLOCK,*) !HPF$ ALIGN D(I,J) WITH B(I,J) !HPF$ A = C !HPF$ A = A

REALIGN A(I,J) WITH C(I,J) + CSHIFT(C,1,2) + CSHIFT(C,-1,2) REALIGN A(I,J) WITH C(J,I) + TRANSPOSE(C)

!HPF$ REDISTRIBUTE B(*,BLOCK) DO I = 2, N-1 B(I,:) = B(I-1,:)*D(I-1,:) + B(I,:) END DO !HPF$ REDISTRIBUTE B(*,BLOCK) DO J = 2, N-1 B(:,J) = B(:,J-1)*D(:,J-1) + B(:,J-1) END DO Fig. 3.4. REALIGN and REDISTRIBUTE functions identify masses of operations that can be executed in parallel, typically element-wise updates of data. The compiler and run-time system use the data mapping information to package this potential parallelism for the physical machine. Therefore, as a first-order approximation programmers should expect that optimal performance will occur for vector operations along a distributed dimension. This will be true modulo communication and synchronization costs, and possibly implementation shortcomings. 4.1 Basic Language Features Four features make up the basic HPF support for data parallelism: 1. Fortran 90 array assignments define element-wise operations on regular arrays. 2. The FORALL statement is a new form of array assignment that provides greater flexibility. 3. The INDEPENDENT assertion is a directive (i.e. structured comment) that gives the compiler more information about a DO loop. 4. The HPF library is a set of useful functions that perform parallel operations on arrays. All of these features are part of the HPF 2.0 core language; Section 4.2 will consider additional topics from the approved extensions. We will not cover array assignments in more detail, except to say that they formed an invaluable

20

Ken Kennedy and Charles Koelbel

base to build HPF’s more general operations. FORALL is the first case that we have discussed of a new language construct; in contrast to the data mapping directives, it changes the values of program data. Because of this, it could not be another directive. INDEPENDENT, on the other hand, is a directive; if correctly used, it only provides information to the compiler and does not alter the meaning of the program. Finally, the HPF library is a set of functions that provide interesting parallel operations. In most cases, these operations derive their parallelism from independent operations on large data sets, but the operations do not occur element-wise as in array assignments. Compilers may augment the explicit data-parallel features of HPF by analyzing DO loops and other constructs for parallelism. In fact, many do precisely that. Doing so is certainly a great advantage for users who are porting existing code. Unfortunately, different compilers have different capabilities in this regard. For portable parallelism, it is often best to use the explicit constructs described below. 4.1.1 The FORALL Statement. The FORALL statement provides dataparallel operations, much like array assignments, with an explicit index space, much like a DO loop. There are both single-statement and multi-statement forms of the FORALL. The single-statement form has the syntax FORALL ( forall-triplet-list [, mask-expr ] ) forall-assignment-stmt The multi-statement form has the syntax FORALL ( forall-triplet-list [, mask-expr ] ) [ forall-body-construct ] ... END FORALL In both cases, a forall-triplet is index-name = int-expr : int-expr [ : int-expr ] If a forall-triplet-list has more than one triplet, then no index-name may be used in the bounds or stride for any other index. A forall-assignmentstmt is either an assignment statement or a pointer assignment statement. A forall-body-construct can be an assignment, pointer assignment, WHERE, or FORALL statement. For both forall-assignment-stmt and forall-body-construct, function calls are restricted to PURE functions; as we will see in Section 4.1.2, these are guaranteed to have no side effects. The semantics of a single-statement FORALL is essentially the same as for a single array assignment. First, the bounds and strides in the FORALL header are evaluated. These determine the range for each index to iterate over; for multidimensional FORALL statements, the indices are combined by Cartesian product. Next, the mask expression is evaluated for every “iteration” in range. The FORALL body will not be executed for iterations that produce a false mask. The right-hand side of the assignment is computed for every remaining iteration. The key to parallel execution of the FORALL is that

High Performance Fortran 2.0

21

these computations can be performed in parallel—no data is modified at this point, so there can be no interference between different iterations. Finally, all the results are assigned to the left-hand sides. It is an error if two of the iterations produce the same left-hand side location. Absent that error, the assignments can be made in parallel since there are no other possible sources of interference. The semantics of a multi-statement FORALL reduce to a series of singlestatement FORALLs, one per statement in the body. That is, the bounds and mask are computed once at the beginning of execution. Then each statement is executed in turn, first computing all right-hand sides and then assigning to the left-hand sides. After all assignments are complete, execution moves on to the next body statement. If the body statement is another FORALL, then the inner bounds must be computed (and may be different for every outer iteration) before the inner right-hand and left-hand sides, but execution still proceeds one statement at a time. The situation is similar for the mask in a nested WHERE statement. One way to visualize this process is shown in Figure 4.1. The diagram to the left of the figure illustrates the data dependences possible in the example code

Begin

Begin

0+2

1+3

2+4

0+2

2+3

3+4

A(2)

A(3)

A(4)

A(2)

A(3)

A(4)

10*4

20*6

30*4

10*2

20*3

30*4

C(2)

C(3)

C(4)

C(2)

C(3)

C(4)

End

End

Fig. 4.1. Visualizations of FORALL and DO

22

Ken Kennedy and Charles Koelbel

FORALL ( I = 2:4 ) A(I) = A(I-1) + A(I+1) C(I) = B(I) * A(I+1) END FORALL In the figure, each elemental operation is shown as an oval. The first row represents the A(I-1)+A(I+1) computations; the second row represents the assignments to A(I); the third row represents the computation of B(I) * A(I+1); the bottom row represents the assignments to C(I). The numbers in the ovals are based on the initial values A(1:5) = (/ 0, 1, 2, 3, 4 /) B(1:5) = (/ 0, 10, 20, 30, 40 /) The reader can verify that the semantics above lead to a final result of A(2:4) = (/ 2, 4, 6 /) C(2:4) = (/ 40, 120, 120 /) Arrows in the diagram represent worst-case data dependences; that is, a FORALL could have any one of those dependences. (To simplify the picture, transitive dependences are not shown.) These dependences arise in two ways: – From right-hand side to left-hand side: the left-hand side may overwrite data needed to compute the right-hand side. – From left-hand side to right-hand side: the right-hand side may use data assigned by the left-hand side. By inspection of the diagram, it is clear that there are no connections running across rows. This is true in general, and indicates that it is always legal to execute all right-hand and all left-hand sides simultaneously. Also, it appears from the diagram that a global synchronization is needed between adjacent rows. This is also true in general, but represents an uncommon worst case. In the diagram, dark arrows represent dependences that actually arise in this example; light arrows are worst-case dependences that do not occur here. It is often the case, as here, that many worst-case dependences do not occur in a particular case; a good compiler will detect these cases and simplify communication and synchronization accordingly. In this case, for example, there is no need for synchronization during execution of the second statement. It is useful to contrast the FORALL diagram with the corresponding dependence structure for an ordinary DO loop with the same body. The DO is shown on the right side of Figure 4.1. One can immediately see that the dependence structures are different, and verify that the result of the DO loop has changed to A(2:4) = (/ 2, 5, 9 /) C(2:4) = (/ 20, 60, 120 /)

High Performance Fortran 2.0

23

At first glance, the diagram appears simpler. However, the dependence arcs from the bottom of each iteration to the top eliminate parallelism in the general case. Of course, there are many cases where these serializing dependences do not occur—parallelizing and vectorizing compilers work by detecting those cases. However, if the analysis fails the DO will run sequentially, while the FORALL can run in parallel. The example above could also be written using array assignments. A(2:4) = A(1:3) + A(3:5) C(2:4) = B(2:4) * A(3:5) However, FORALL can also access and assign to more complex array regions than can easily be expressed by array assignment. For example, consider the following FORALL statements. FORALL ( I FORALL ( J FORALL ( K FORALL ( END FORALL

= = = L

1:N ) A(I,I) = B(I,N-I+1) 1:N ) C(INDX(J),J) = J*J 1:N ) = 1:J ) D(K,L) = E( K*(K-1)/2 + L )

The first statement accesses the anti-diagonal of B and assigns it to the main diagonal of A. The anti-diagonal could be accessed using advanced features of the array syntax; there is no way in to do an array assignment to a diagonal or other non-rectangular region. The second statement does a computation using the values of the index and assigns them to an irregular region of the array C. Again, the right-hand side could be done using array syntax by creating a new array, but the left-hand side is too irregular to express by regular sections. Finally, the last FORALL nest unpacks the one-dimensional E array into the lower triangular region of the D array. Neither the left nor right-hand side can easily be expressed in array syntax alone. 4.1.2 PURE Functions. FORALL provides a form of “parallel loop” over the elements of an array, but the only statements allowed in the loop body are various forms of assignment. Many applications benefit from more complex operations on each element, such as point-wise iteration to convergence. HPF addresses these needs by allowing a class of functions—the PURE functions— to be called in FORALL assignments. PURE functions are safe to call from within FORALL because they cannot have side effects; therefore, they cannot create new dependences in the FORALL statement’s execution. It is useful to call PURE functions because they allow very complex operations to be performed, including internal control flow that cannot easily be expressed directly in the body of a FORALL. Syntactically, a PURE function is declared by adding the keyword PURE to its interface before the function’s type. PURE functions must have an explicit interface, so this declaration is visible to both the caller and the function itself. The more interesting syntactic issue is what can be included in the

24

Ken Kennedy and Charles Koelbel

PURE function. The HPF 2.0 specification has a long list of restrictions on statements that can be included. The simple statement of these restrictions is that any construct that could assign to a global variable or a dummy argument is not allowed. This includes obvious cases such as using a global variable on the left-hand side of an assignment and less obvious ones such as using a dummy argument as the target of a pointer assignment. (The latter case does not directly change the dummy, but allows later uncontrolled side effects through the pointer.) Of course, this list of restrictions leads directly to the desired lack of side effects in the function. It is important to note that there are no restrictions on control flow in the function, except that STOP and PAUSE statements are not allowed. This allows quite complex iterative algorithms to be implemented in PURE functions. ! The caller FORALL ( I=1:N, J=1:M ) K(I,J) = MANDELBROT( CMPLX((I-1)*1.0/(N-1), & (J-1)*1.0/(M-1)), 1000 ) END FORALL ! The callee PURE INTEGER FUNCTION MANDELBROT(X, ITOL) COMPLEX, INTENT(IN) :: X INTEGER, INTENT(IN) :: ITOL COMPLEX XTMP INTEGER K K = 0 XTMP = -X DO WHILE (ABS(XTMP) R5 = ADD R4,R3 R6=RSL R5,"2",R1

Uniphase access and Thread Blocked for result

R2 = ADD R6,"1" R3 = MUL R3,R2

Result of uniphase access is available in R6.

MIDC-3 FORMAT

Input#1 in the MIDC2 form is not needed here. Instead use local register R3

Fig. 6.4. MIDC-2 and MIDC-3 code examples. is sent to T hread 256. T hread 255 does not block, it continues execution until termination. When the results of the split-phase read is available it is forwarded to T hread 256 which starts execution when all its input data is available. There are no restriction on the processor on which T hread 255 and T hread 256 are executed. In the MIDC-3 code, T hread 255 and 256 belong to the same code-block. The read structure memory operation is a local single phase operation. Hence, the two threads become a single thread. The thread blocks when the read operation is encountered and waits for the read request to be satisfied. Discussion of the Models.. The main differences between the blocking and non-blocking models lie in their synchronization and thread switching strategies. The blocking model requires a complex architectural support to efficiently switch between ready threads. The frame space is deallocated only when all the thread instances associated with its code block have terminated execution which is determined by extensive static program analysis. The model also relies on static analysis to distribute the shared data structures and therefore reduce the overhead of split-phase accesses by making some data structure accesses local. The non-blocking model relies on a simple

68

Jean-Luc Gaudiot et al.

scheduling mechanism: data-driven data availability. Once a thread completes execution, its framelet is deallocated and the space is reclaimed. The main difference between the Frame model and the Framelet models of synchronization is the token duplication. The Framelet model does require that variables which are shared by several threads within a code block be replicated to all these threads while in the Frame model these variables are allocated only once in the frame. The advantage of the Framelet model is that it is possible to design special storage schemes [31] that can take advantage of the locality of the inter-thread and intra-thread locality and achieve a cache miss rate close to 1%. 6.3 Summary of Performance Results This section summarizes the results of an experimental evaluation of the two execution models and their associated storage models. A preliminary version of these results was reported in [4], detailed results are reported in [5]. The evaluation of the program execution characteristics of these two models shows that the blocking model has a significant reduction in threads, instructions, and synchronization operations executed with respect to the non blocking model. It also has a larger average thread size (by 26% on average) and, therefore, a lower number of synchronization operations per instruction executed (17% lower on average). However, the total number of accesses to the Frame storage, in the nonblocking model, is comparable to the number of accesses to the Framelet storage in the blocking model. Although the Frame storage model eliminates the replication of data values, the synchronization mechanism requires that two or more synchronization slots (counters) be accessed for each shared data. The number of synchronization accesses to the frames nearly offsets all the redundant accesses. In fact the size of the trace of accesses to the frames is less than 3% smaller than the framelet trace size. Hence, synchronization overhead is the same for the frame and framelet models of synchronization. The evaluation also looked at the performance of a cache memory for the Frame and Framelet models. Both models exhibit a large degree of spatial locality in their accesses: In both cases the optimal cache block size was 256 bytes. However, the Framelet model has a much higher degree of temporal locality resulting in an average miss rate of 1.82% as opposed to 5.29% for the Frame model (both caches being 16KB, 4-way set associative with 256 byte blocks). The execution time of the blocking model is highly dependent on the success rate of the static data distribution. The execution times for success rates of 100% or 90% are comparable and outperform those of the non blocking model. For a success rate of 50%, however, the execution time may be higher than that of the non blocking model. The performance, however, depends largely on the network latency. When the network latency is low and the

The Sisal Project: Real World Functional Programming

69

processor utilization high, the non blocking model performs as well as the blocking model with a 100% or 90% success rate.

7. Conclusions and Future Research The functional model of computation is one attempt at providing an implicitly parallel programming paradigm2 . Because of the lack of state and its functionality, it allows the compiler to extract all available parallelism, fine and coarse grain, regular and irregular, and generate a partial evaluation order of the program. In its pure form (e.g., pure Lisp, Sisal, Haskell), this model is unable to express algorithms that rely explicitly on state. However, extensions to these languages have been proposed to allow a limited amount of stateful computations when needed. Instead, we are investigating the feasibility of the declarative programming style, both in terms of its expressibility and its run-time performance, over a wide range of numerical and non-numerical problems and algorithms, and executing on both conventional and novel parallel architectures. We are also evaluating the ability of these languages to aid compiler analysis to disambiguate and parallelize data structure accesses. On the implementation side, we have demonstrated how multithreaded implementations combine the strengths of both the von Neumann (in its exploitation of program and data locality) and of the data-driven model (in its ability to hide latency and support efficient synchronization). New architectures such as TERA [2] and *T [26] are being built with hardware support for multithreading. In addition, software multithreading models such as TAM [13] and MIDC [8]), are being investigated. We are currently further investigating the performance of both softwaresupported and hardware-supported multithreaded models on a wide range of parallel machines. We have designed and evaluated low-level machine independent optimization and code generation for multithreaded execution. The target hardware platforms will be stock machines, such as single superscalar processors, shared memory, and multithreaded machines. We will also target more experimental dataflow machines, (e.g., Monsoon [18, 37]). Acknowledgement. This research is supported in part by ARPA grant # DABT6395-0093 and NSF grant 53-4503-3481

References 1. S. J. Allan and R. R. Oldehoeft. Parallelism in sisal: Exploiting the HEP architecture. In 19th Hawaii International Conference on System Sciences, pages 538–548, January 1986. 2

Other attempts include the vector, data parallel and object-oriented paradigms.

70

Jean-Luc Gaudiot et al.

2. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. P. ortfield, and B. Smith. The Tera computer system. In Proceedings 1990 Int. Conf. on Supercomputing, pages 1–6. ACM Press, June 1990. 3. B. S. Ang, Arvind, and D. Chiou. StarT the Next Generation: Integrating Global Caches and Dataflow Architecture. Technical Report 354, LCS, Massachusetts Institute of Technology, August 1994. 4. M. Annavaram and W. Najjar. Comparison of two storage models in datadriven multithreaded architectures. In Proceedings of Symp. on Parallel and Distributed Processing, October 1996. 5. M. Annavaram, W. Najjar, and L. Roh. Experimental evaluation of blocking and non-blocki ng multithreaded code execution. Technical Report 97-108, Colorado State University, Department of Computer Science, www.cs.colostate.edu/ ftppub/TechReports/, March 1997. 6. J. Backus. Can programming be liberated from the von Neumann style? Communications of the ACM, 21(8):613–641, 1978. 7. A. B¨ ohm and J. Sargeant. Efficient dataflow code generation for sisal. Technical report, University of Manchester, 1985. 8. A. P. W. B¨ ohm, W. A. Najjar, B. Shankar, and L. Roh. An evaluation of coarsegrain dataflow code generation strategies. In Working Conference on Massively Parallel Programming Models, Berlin, Germany, Sept. 1993. 9. D. Cann. Compilation Techniques for High Performance Applicative Computation. PhD thesis, Colorado State University, 1989. 10. D. Cann. Compilation techniques for high performance applicative computation. Technical Report CS-89-108, Colorado State University, 1989. 11. D. Cann. Retire FORTRAN? a debate rekindled. CACM, 35(8):pp. 81–89, Aug 1992. 12. D. Cann. Retire Fortran? a debate rekindled. Communications of the ACM, 35(8):81–89, 1992. 13. D. E. Culler et al. Fine grain parallelism with minimal hardware support: A compiler-controlled Threaded Abstract Machine. In Proc. 4th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems, April 1991. 14. J. T. Feo, P. J. Miller, and S. K. Skedzielewski. Sisal90. In Proceedings of High Performance Functional Computing, April 1995. 15. M. Forum. MPI: A Message-Passing Interface Standard, 1994. 16. G. Fox, S. Hiranandani, K. Kennedy, U. Kremer, C. Tseng, and M. Wu. Fortran D language specification. Technical Report CRPC-TR90079, Center for Research on Parallel Computation, Rice University, P.O. Box 1892, Houston, TX 77251-1892, 1990. 17. D. Garza-Salazar and W. B¨ ohm. Reducing communication by honoring multiple alignments. In Proceedings of the 9th ACM International Conference on Supercomputing (ICS’95), pages 87–96, Barcelona, 1995. 18. J. Hicks, D. Chiou, B. Ang, and Arvind. Performance studies of Id on the Monsoon dataflow system. Journal of Parallel and Distributed Computing, 3(18):273–300, July 1993. 19. H. Hum, O. Macquelin, K. Theobald, X. Tian, G. Gao, P. Cupryk, N. Elmassri, L. Hendren, A. Jimenez, S. Krishnan, A. Marquez, S. Merali, S. Nemawarkar,

The Sisal Project: Real World Functional Programming

20.

21.

22.

23.

24. 25.

26.

27. 28.

29.

30. 31. 32.

33.

34.

71

P. Panangaden, X. Xue, and Y. Zhu. A design study of the EARTH multiprocessor. In Parallel Architectures and Compilation Techniques, 1995. R. Iannucci. A Dataflow/von Neumann Hybrid Architecture. Technical Report 418, Ph. D Dissertation Technical Report TR-418, Laboratory for Computer Science, MIT, Cambridge, MA, June 1988. Y. Kodama, H. Sakane, M. Sato, H. Yamana, S. Sakai, and Y. Yamaguchi. The EM-X parallel computer: Architecture and basic performance. In Proceedings of the 22th Annual International Symposium on Computer Architecture, pages 14–23, June 1995. J. McGraw, S. Skedzielewski, S. Allan, D. Grit, R. Oldehoeft, J. Glauert, I. Dobes, and P. Hohensee. SISAL-Streams and Iterations in a Single Assignment Language, Language Reference Manual, version 1. 2. Technical Report TR M-146, University of California - Lawrence Livermore Laboratory, March 1985. P. Miller. TWINE: A portable, extensible sisal execution kernel. In J. Feo, editor, Proceedings of Sisal ’93. Lawrence Livermore National Laboratory, October 1993. P. Miller. Simple sisal interpreter, 1995. ftp://ftp.sisal.com/pub/LLNL/SSI. R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T: A multithreaded massively parallel architecture. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 156–167, May 1992. R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T: A multithreaded massively parallel architecture. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 156–167, Gold Coast, Australia, May 19–21, 1992. ACM SIGARCH and IEEE Computer Society TCCA. Computer Architecture News, 20(2), May 1992. R. Oldehoeft and D. Cann. Applicative parallelism on a shared-memory multiprocessor. IEEE Software, January 1988. G. Papadopoulos. Implementation of a general-purpose dataflow multiprocessor. Technical report TR-432, MIT Laboratory for Computer Science, August 1988. G. M. Papadopoulos and D. E. Culler. Monsoon: an explicit token-store architecture. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 82–91, June 1990. J. Rannelletti. Graph Transformation algorithms for array memory optimization in applicative languages. PhD thesis, U. California, Davis, 1987. L. Roh and W. Najjar. Design of storage hierarchy in multithreaded architectures. In IEEE Micro, pages 271–278, November 1995. L. Roh, W. Najjar, B. Shankar, and A. P. W. B. ¨ ohm. Generation, optimization and evaluation of multith readed code. J. of Parallel and Distributed Computing, 32(2):188–204, February 1996. L. Roh, W. A. Najjar, B. Shankar, and A. P. W. B¨ ohm. An evaluation of optimized threaded code generation. In Parallel Architectures and Compilation Techniques, Montreal, Canada, 1994. S. Sakai, K. Hiraki, Y. Yamaguchi, and T. Yuba. Optimal Architecture Design of a Data-flow Computer. In Japanese Symposium on Parallel Processing, 1989. in Japanese.

72

Jean-Luc Gaudiot et al.

35. S. Skedzielewski and R. Simpson. A simple method to remove reference counting in applicative programs. In Proceedings of CONPAR 88, Sept 1988. 36. S. K. Skedzielewski, R. K. Yates, and R. R. Oldehoeft. DI: An interactive debugging interpreter for applicative languages. In Proceedings of the ACM SIGPLAN 87 Symposium on Interpreters and Interpretive Techniques, pages 102–109, June 1987. 37. K. Traub. Monsoon: Dataflow Architectures Demystified. In Proc. Imacs 91 13th Congress on Computation and Applied Mathematics, 1991. 38. M. Welcome, S. Skedzielewski, R. Yates, and J. Ranelleti. IF2: An applicative language intermediate form with explicit memory management. Technical Report TR M-195, University of California - Lawrence Livermore Laboratory, December 1986.

Chapter 3. HPC++ and the HPC++Lib Toolkit Dennis Gannon1 , Peter Beckman2 , Elizabeth Johnson1 , Todd Green1 , and Mike Levine1 1 2

Depar nt of Computer Science, Indiana University Los Alamos National Laboratory.

1. Introduction The High Performance C++ consortium is a group that has been working for the last two years on the design of a standard library for parallel programming based on the C++ language. The consortium consists of people from research groups within Universities, Industry and Government Laboratories. The goal of this effort is to build a common foundation for constructing portable parallel applications. The design has been partitioned into two levels. Level 1 consists of a specification for a set of class libraries and tools that do not require any extension to the C++ language. Level 2 provides the basic language extensions and runtime library needed to implement the full HPC++ Level 1 specification. Our goal in this chapter is to briefly describe part of the Level 1 specification and then provide a detailed account of our implementation strategy. Our approach is based on a library, HPC++Lib, which is described in detail in this document. We note at the outset that HPC++Lib is not unique and the key ideas are drawn from many sources. In particular, many of the ideas originate with K. Mani Chandy and Carl Kesselman in the CC++ language [6, 15] and the MPC++ Multiple Threads Template Library designed by Yutaka Ishikawa of RWCP [10], the IBM ABC++ library [15,16], the Object Management Group CORBA specification [9] and the Java concurrency model [1]. In particular, Carl Kesselman at USC ISI is also building an implementation of HPC++ using CC++ as the level 2 implementation layer. Our implementation builds upon a compiler technology developed in collaboration with ISI, but our implementation strategy is different. The key features of HPC++Lib are – A Java style thread class that provides an easy way to program parallel applications on shared memory architectures. This thread class is also used to implement the loop parallelization transformations that are part of the HPC++ level 1 specification. – A template library to support synchronization, collective parallel operations such as reductions, and remote memory references. S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 73-107, 2001. Springer-Verlag Berlin Heidelberg 2001

74

Dennis Gannon et al.

– A CORBA style IDL-to-proxy generator is used to support member function calls on objects located in remote address spaces. This chapter introduces the details of this programming model from the application programmer’s perspective and describes the compiler support required to implement and optimize HPC++.

2. The HPC++ Programming and Execution Model The runtime environment for HPC++ can be described as follows. The basic architecture consists of the following components. – A node is a shared-memory multiprocessor (SMP), possibly connected to other SMPs via a network. Shared memory is a coherent shared-address space that can be read and modified by any processor in the node. A node could be a laptop computer or a 128-processor SGI Origin 2000. – A context refers to a virtual address space on a node, usually accessible by several different threads of control. A Unix process often represents a context. We assume that there may be more than one context per node in a given computation. – A set of interconnected nodes constitutes a system upon which an HPC++ program may be run. There are two conventional modes of executing an HPC++ program. The first is “multi-threaded, shared memory” where the program runs within one context. Parallelism comes from the parallel-loops and the dynamic creation of threads. Sets of threads and contexts can be bound into Groups and there are collective operations such as reductions and prefix operators that can be applied synchronize the threads of a group. This model of programming is very well suited to modest levels of parallelism (about 32 processors) and where memory locality is not a serious factor. Node 1

Node 2

Node 3

context 1 context 2 (with 3 threads) (2 threads)

context 3 (3 threads)

context 4 (1 thread)

Fig. 2.1. A SPMD program on three nodes with four contexts. Each context may have a variable number of threads.

HPC++ and the HPC++Lib Toolkit

75

The second mode of program execution is an explicit “Single Program Multiple Data” (SPMD) model where n copies of the same program are run on n different contexts. This programming model is similar to that of Split-C [7], pC++ [15], AC [5] or C/C++ with MPI or PVM in that the distribution of data that must be shared between contexts and the synchronization of accesses to that data must be managed by the programmer. HPC++ differs from these other C-based SPMD systems in that the computation on each context can also be multi-threaded and the synchronization mechanisms for thread groups extends to sets of thread groups running in multiple contexts. It should also be noted that an SPMD computation need not be completely homogeneous: a program may contain two contexts on one node and one context on each of two other nodes. Furthermore, each of these contexts may contain a variable number of threads (see Figure 2.1). Multi-context SPMD programming with multi-threaded computation within each context supports a range of applications, such as adaptive grid methods for large scale simulation, that are best expressed using a form of multi-level parallelism. 2.1 Level 1 HPC++ The level 1 library has three components. The first component is a set of simple loop directives that control parallelism within a single context. The compiler is free to ignore these directives, but if there is more than one processor available, it can use the directives to parallelize simple loops. The HPC++ loop directives are based on ideas from HPF [8] and other older proposals. The idea is very simple. The HPC++ programmer can identify a loop and annotate it with a #pragma to inform the compiler it is “independent”. This means that each iteration is independent of every other iteration, and they are not ordered. Consequently, the compiler may choose to execute the loop in parallel, and generate the needed synchronization for the end of the loop. In addition, variables that do not carry loop dependences can be labeled as PRIVATE so that one copy of the variable is generated for each iteration. Furthermore, in the case of reductions, it is possible to label a statement with the REDUCE directive so that the accumulation operations will be atomic. As a simple example, consider the following function which will multiply an n by n matrix with a vector. This function may generate up to n2 parallel threads because both loops are labeled as HPC INDEPENDENT. However, the compiler and the runtime system must work together to choose when new threads of control will be created and when loops will performed sequentially. Also, each iterate of the outer loop uses the variable tmp, labeled PRIVATE, to accumulate the inner product. The atomicity of the reduction is guaranteed by the HPC REDUCE directive at the innermost level.

76

Dennis Gannon et al.

void Matvec(double **A, int n, double *X, double *Y){ double tmp; #pragma HPC_INDEPENDENT, PRIVATE tmp for(int i = 0; i < n; i++){ tmp = 0; #pragma HPC_INDEPENDENT for(int j = 0; j < n; j++){ #pragma HPC_REDUCE. tmp += A[i][j]*X[j]; } y[i] = tmp; } } In section 5 below we will describe the program transformations that the compiler must undertake to recast the annotated loop above into a parallel form using the HPC++ Thread library. 2.2 The Parallel Standard Template Library As described above, there are two execution models for HPC++ programs. For the single context model, an HPC++ program is launched as an ordinary C++ program with an initial single main thread of control. If the context is running on a node with more than one processor, parallelism can be exploited by using parallel loop directives, the HPC++ Parallel Standard Template Library (PSTL), or by spawning new threads of control. For multiple context execution, an HPC++ program launches one thread of control to execute the program in each context. This Single Program Multiple Data (SPMD) mode is a model of execution that is easily understood by programmers even though it requires the user to reason about and debug computations where the data structures are distributed over multiple address spaces. The HPC++ library is designed to help simplify this process. One of the major recent changes to the C++ standard has been the addition of the Standard Template Library (STL) [13, 14]. The STL has five basic components. – Container class templates provide standard definitions for common aggregate data structures, including vector, list, deque, set and map. – Iterators generalize the concept of a pointer. Each container class defines an iterator that gives us a way to step through the contents of containers of that type. – Generic Algorithms are function templates that allow standard elementwise operations to be applied to containers.

HPC++ and the HPC++Lib Toolkit

77

– Function Objects are created by wrapping functions with classes that typically have only operator() defined. They are used by the generic algorithms in place of function pointers because they provide greater efficiency. – Adaptors are used to modify STL containers, iterators, or function objects. For example, container adaptors are provided to create stacks and queues, and iterator adaptors are provided to create reverse iterators to traverse an iteration space backwards. The Parallel Standard Template Library (PSTL) is a parallel extension of STL. Distributed versions of the STL container classes are provided along with versions of the STL algorithms which have been modified to run in parallel. In addition, several new algorithms have been added to support standard parallel operations such as the element-wise application of a function and parallel reduction over container elements. Finally, parallel iterators have been provided. These iterators extend global pointers and are used to access remote elements in distributed containers. 2.3 Parallel Iterators STL iterators are generalizations of C++ pointers that are used to traverse the contents of a container. HPC++ parallel iterators are generalizations of this concept to allow references to objects in different address spaces. In the case of random access parallel iterators, the operators ++, −−, +n,−n, and [i] allow random access to the entire contents of a distributed container. In general, each distributed container class C, will have a subclass for the strongest form of parallel iterator that it supports (e.g. random access, forward or bidirectional) and a begin and end iterator functions. For example, each container class will provide functions of the form template class Container{ .... class pariterator{ ... } pariterator parbegin(); pariterator parend(); }; 2.4 Parallel Algorithms In HPC++ PSTL there are two types of algorithms. First are the conventional STL algorithms like for each(), which can be executed in parallel if called with parallel iterators. The second type includes STL algorithms where the semantics of the algorithm must be changed to make sense in a parallel context, as well as several new algorithms that are very common in parallel computation. Algorithms in this second group are identified by the prefix

78

Dennis Gannon et al.

par , and may be invoked with the standard random access iterators for single context parallelism or with parallel random access iterators for multi-context SPMD parallelism. The most important of the new parallel algorithms in HPC++ STL are – par apply(begin1, end1, begin2, begin3, ..... f()) which applies a function object pointwise to the elements of a set of containers. – par reduction(begin1, end1, begin2, begin3, ...., reduce(),f()) which is a parallel apply followed by a reduction on an associative binary operator. – par scan(result begin, result end, begin2, begin3, ...., scanop(), f()) which is a parallel apply followed by a parallel prefix computation. 2.5 Distributed Containers The HPC++ container classes include versions of each of the STL containers prefixed by the phrase distributed to indicate that they operate in a distributed SPMD execution environment. Constructors for these containers are collective operations, i.e. they must be invoked in each executing context in parallel. For example, a distributed vector with elements of type T is constructed with distributed_vector < T > X(dim0, &distrib_object); The last parameter is a distribution object which defines the mapping of the array index space to the set of contexts active in the computations. If the distribution parameter is omitted, then a default block distribution is assumed. A more complete description of the Parallel STL is given in [11]

3. A Simple Example: The Spanning Tree of a Graph The minimum spanning tree algorithm [12] takes a graph with weighted connections and attempts to find a tree that contains every vertex of the graph so that the sum of connection weights in the tree is minimal. The graph is represented by the adjacency matrix W of dimensions n ∗ n, where n is the number of vertices in the graph. W [i, j] contains the weight of the connection between vertex i and vertex j. W [i, j] is set to infinity if vertex i and vertex j are not connected. The algorithm starts with an arbitrary vertex of the graph and considers it to be the root of the tree being created. Then, the algorithm iterates n − 1 times choosing one more vertex from the pool of unselected vertices during each iteration. The pool of unselected vertices is represented by the distance vector D. D[i] is the weight of the connection from an unselected vertex i to the closest selected vertex. During each iteration, the algorithm selects a vertex whose corresponding D value is the smallest among all the

HPC++ and the HPC++Lib Toolkit

79

unselected vertices. It adds the selected vertex to the tree and updates the values in D for the rest of the unselected vertices in the following way. For each remaining vertex, it compares the corresponding D value with the weight of the connection between the newly selected vertex and the remaining vertex. If the weight of the new connection is less than the old D value, it is stored in D. After the n − 1 iterations D will contain the weights of the selected connections. We can parallelize this algorithm by searching for the minimum in D and updating D in parallel. To conserve memory, we decided to deal with sparse graphs and impose a limit on the number of edges for any one vertex. We represent the adjacency matrix W by a distributed vector of an edge list of pairs. Each edge list describes all the edges for one vertex; each pair represents one weighted edge where the first element is the weight of the edge and the second element is the index of the destination vertex. class weighted_edge{ int weight; int vertex; }; struct edge_list { typedef weighted_edge* iterator; weighted_edge my_edges[MAX_EDGES]; int num_edges; iterator begin() { return my_edges; } iterator end() { return my_edges+num_edges; } };

typedef distributed_vector Graph; Graph W(n); We represent the distance vector D by a distributed vector of pairs. The first element of each pair is the D value - the weight of the connection from the corresponding unselected vertex to the closest selected vertex. The second element of each pair is used as a flag of whether the corresponding vertex has already been selected to the tree. It is set to the pair’s index in D until the vertex is selected and is assigned -1 after the vertex is selected. struct cost_to_edge{ int weight; long index; cost_to_edge(int _weight, long _to_vertex); };

Dennis Gannon et al.

80

typedef distributed_vector DistanceVector; DistanceVector D(n); The main part of the program is a for loop that repeatedly finds a minimum in the distance vector D using par reduction, marks the found vertex as selected, and updates the distance vector using par apply. The call to par reduction uses the identity function-class as the operation to apply to each element of the distributed vector (it simply returns its argument) and a min functionclass as the reducing operation (it compares two pairs and returns the one with smaller weight). Min also requires an initial value for the reduction. In this case an edge cost pair with weight INT MAX. for(long i=1; i= 0) { Graph::pariterator w_iter = w.parbegin (); edge_list wi = w_iter[v.index];

HPC++ and the HPC++Lib Toolkit

81

// find a edge from v.index that goes to u edge_list::iterator temp = find_if(wi.begin(),wi.end(),FindVert(u)); int weight_uv = (temp==wi.end())? INT_MAX: (*temp).weight; if (v.weight > weight_uv) v = cost_to_edge(weight_uv, v.index); } } };

processors (P) 1 2 3 4 5 6 7 8

graph init. 0.128 0.215 0.258 0.308 0.353 0.371 0.402 0.470

performance computation time (sec) 22.94 11.96 8.67 6.95 5.84 5.40 5.03 4.94

speed-up 1 1.91 2.65 3.30 3.93 4.25 4.59 4.64

Table 3.1. Spanning Tree Performance Results

The basic parallel STL has been prototyped on the SGI Power Challenge and the IBM SP2. In table 3.1 we show the execution time for the spanning tree computation on a graph with 1000 vertices on the SGI. The computation is dominated by the reduction. This operation will have a speed-up that grows P as 1+C∗log(P )/N where P is the number of processors and N is the problem size and C is the ratio of the cost of a memory reference in a remote context to that of a local memory reference. In our case that is approximately 200. The resulting speed-up for this size problem with 8 processors is 5, so our computation is performing about as well as we might expect. We have also included the time to build the graph. It should be noted that the time to build the graph grows with the number of processors. We have not yet attempted to optimize the parallelization of this part of the program.

82

Dennis Gannon et al.

4. Multi-threaded Programming The implementation of HPC++ described here uses a model of threads that is based on a Thread class which is, by design, similar to the Java thread system. More specifically, there are two basic classes that are used to instantiate a thread and get it to do something. Basic Thread objects encapsulate a thread and provide a private data space. Objects of class Runnable provide a convenient way for a set of threads to execute the member functions of a shared object. The interface for Thread is given by class HPCxx_Thread{ public: HPCxx_Thread(const char *name = NULL); HPCxx_Thread(HPCxx_Runnable *runnable, const char *name = NULL); virtual ~HPCxx_Thread(); HPCxx_Thread& operator=(const HPCxx_Thread& thread); virtual void run(); static void stop(void *status); static void yield(); void resume(); int isAlive(); static HPCxx_Thread *currentThread(); void join(long milliseconds = 0, long nanoseconds = 0); void setName(const char *name); const char *getName(); int getPriority(); int setPriority(int priority); static void sleep(long milliseconds, long nanoseconds = 0); void suspend(); void start(); }; The interface for Runnable is given by class HPCxx_Runnable{ public: virtual void run() = 0; }; There are two ways to create a thread and give it work to do. The first is to create a subclass of Runnable which provides an instance of the run() method. For example, to make a class that prints a message we can write

HPC++ and the HPC++Lib Toolkit

83

class MyRunnable: public HPCxx_Runnable{ char *x; public: MyRunnable(char *c): x(c){} void run(){ printf(x); } }; The program below will create an instance of two threads that each run the run() method for a single instance of a runnable object. MyRunnable r("hello world"); Thread *t1 = new Thread(&r); Thread *t2 = new Thread(&r); t1->start(); // launch the thread but don’t block t2->start(); This program prints hello worldhello world It is not required that a thread have an object of class Runnable to execute. One may subclass Thread to provide a private data and name space for a thread and overload the run() function there as shown below. class MyThread: public HPCxx_Thread{ char *x; public: MyThread(char *y): x(y), HPCxx_Thread(){} void run(){ printf(x); } }; int main(int argv, char *argc){ HPCxx_Group *g; hpcxx_init(&argv, &argc, g); MyThread *t1 = new MyThread("hello"); t1->start(); return hpcxx_exit(g); } The decision for when to subclass Thread or Runnable depends upon the application. As we shall seen in the section on implementing HPC++ parallel loops, there are times when both approaches are used together. The initialization function hpcxx init() strips all command line flags of the form −hpcxx from the and argc array so that application flags are passed

Dennis Gannon et al.

84

to the program in normal order. This call also initializes the object g of type HP Cxx Group which is used for synchronization purposes and is described in greater detail below. The termination function hpcxx exit() is a clean up and termination routine. It should be noted that in this small example, it is possible for the main program to terminate prior to the completion of the two threads. This would signal an error condition. We will discuss the ways to prevent this from happening in the section on synchronization below. 4.1 Synchronization There are two types of synchronization mechanisms used in this HPC++ implementation: collective operator objects and primitive synchronization objects. The collective operations are based on the Hpcxx Group class which plays a role in HPC++ that is similar to that of the communicator in MPI. 4.1.1 Primitive Sync Objects. There are four basic synchronization classes in the library: A HP Cxx Sync < T > object is a variable that can be written to once and read as many times as you want. However, if a read is attempted prior to a write, the reading thread will be blocked. Many readers can be waiting for a single HP CXX Sync < T > object and when a value is written to it all the readers are released. Readers that come after the initial write see this as a const value. CC++ provides this capability as the sync modifier. The standard methods for HP Cxx Sync < T > are template class HPCxx_Sync{ public: operator T(); operator =(T &); void read(T &); void write(T &); bool peek(T &);

// // // // // //

read a value assign a value another form of read another form of writing TRUE if the value is there, returns FALSE otherwise.

}; HP Cxx SyncQ < T > provides a dual ”queue” of values of type T. Any attempt to read a sync variable before it is written will cause the reading thread to suspend until a value has been assigned. The thread waiting will ”take the value” from the queue and continue executing. Waiting threads are also queued. The ith thread in the queue will receive the ith value written to the sync variable.

HPC++ and the HPC++Lib Toolkit

85

There are several other standard methods for SyncQ< T >. template class HPCxx_SyncQ{ public: operator T(); operator =(T &); void read(T &); void write(T &); int length();

// // // // //

read a value assign a value another form of read another form of writing the number of values in the queue

// wait until the value is there and then // read the value but do not remove it from // the queue. The next waiting thread is signaled. void waitAndCopy(T& data); bool peek(T &); // same as Sync } For example, threads that synchronize around a producer-consumer interaction can be easily build with this mechanism. class Producer: public HPCxx_Thread{ HPCxx_SyncQ &x; public: Producer( HPCxx_SyncQ &y): x(y){} void run(){ printf("hi there\n"); x = 1; // produce a value for x } }; int main(int argc, char *argv[]){ Hpcxx_Group *g; hpcxx_init(&argc, &argv, g); HPCxx_SyncQ a; MyThread *t = new Producer(a); printf("start then wait for a value to assigned\"); t->start(); int x = a; // consume a value here. hpcxx_exit(g); return x; } Counting semaphores.. HPCxx CSem provide a way to wait for a group of threads to synchronize termination of a number of tasks. When constructed, a limit value is supplied and a counter is set to zero. A thread executing

86

Dennis Gannon et al.

waitAndReset() will suspend until the counter reaches the ”limit” value. The counter is then reset to zero. The overloaded “++” operator increments the counter by one. class HPCxx_CSem{ public: HPCxx_CSem(int limit); // prefix and postfix ++ operators. HPCxx_CSem& operator++(); const HPCxx_CSem& operator++(); HPCxx_CSem& operator++(int); const HPCxx_CSem& operator++(int); waitAndReset(); // wait until the count reaches the limit // then reset the counter to 0 and exit. }; By passing a reference to a HPCxx CSem to a group of threads each of which does a “++” prior to exit, you can build a multi-threaded ”join” operation. class Worker: public HPCxx_Thread{ HPCxx_CSem &c; public: Worker(HPCxx_CSem &c_): c(c_){} void run(){ // work c++; } }; int main(int argc, char *argv[]){ HPCxx_Group *g; hpcxx_init(&argc, &argv, g); HPCxx_CSem cs(NUMWORKERS); for(int i = 0; i < NUMWORKERS; i++) Worker *w = new Worker(cs); w->start(); } cs.waitAndReset(); //wait here for all workers to finish. hpcxx_exit(g); return 0; } Mutex locks. Unlike Java, the library cannot support synchronized methods or CC++ atomic members, but a simple Mutex object with two functions lock and unlock provide the basic capability.

HPC++ and the HPC++Lib Toolkit

87

class HPCxx_Mutex{ public: void lock(); void unlock(); }; To provide a synchronized method that only allows one thread at a time execution authority, one can introduce a private mutex variable and protect the critical section with locks as follows. class Myclass: public HPCxx_Runnable{ HPCxx_Mutex l; public: void synchronized(){ l.lock(); .... l.unlock(); } 4.1.2 Collective Operations. Recall that an HPC++ computation consists of a set of nodes, each of which contains one or more contexts. Each context runs one or more threads. To access the node and context structure of a computation the HPC++Lib initialization creates an object called a group. The HP Cxx Group class has the following public interface. class HPCxx_Group{ public: // Create a new group for the current context. HPCxx_Group(hpcxx_id_t &groupID = HPCXX_GEN_LOCAL_GROUPID, const char *name = NULL); // Create a group whose membership is this context //and those in the list HPCxx_Group(const HPCxx_ContextID *&id, int count, hpcxx_id_t &groupID = HPCXX_GEN_LOCAL_GROUPID, const char *name = NULL); ~HPCxx_Group(); hpcxx_id_t &getGroupID(); static HPCxx_Group *getGroup(hpcxx_id_t groupID); // Get the number of contexts that are participating // in this group int getNumContexts(); // Return an ordered array of context IDs in // this group. This array is identical for every member

88

Dennis Gannon et al.

// of the group. HPCxx_ContextID *getContextIDs(); // Return the context id for zero-based context where // is less than the current number of contexts HPCxx_ContextID getContextID(int context); // Set the number of threads for this group in *this* // context. void setNumThreads(int count); int getNumThreads(); void setName(const char *name); const char *getName(); }; As shown below, a node contains all of the contexts running on the machine, and the mechanisms to create new ones. class HPCxx_Node{ public: HPCxx_Node(const char *name = NULL); HPCxx_Node(const HPCxx_Node &node); ~HPCxx_Node(); bool contextIsLocal(const HPCxx_ContextID &id); int getNumContexts(); // Get an array of global pointers to the contexts // on this node. HPCxx_GlobalPtr *getContexts(); // Create a new context and add it to this node int addContext(); // Create a new context and run the specified executable int addContext(const char *execPath, char **argv); void setName(const char *name); const char *getName(); }; A context keeps track of the threads, and its ContextID provides a handle that can be passed to other contexts. class HPCxx_Context{ public: HPCxx_Context(const char *name=NULL); ~HPCxx_Context(); HPCxx_ContextID getContextID(); bool isMasterContext();

HPC++ and the HPC++Lib Toolkit

89

// Return the current number of threads in this context int getNumThreads(); // Null terminated list of the current threads in this node hpcxx_id_t *getThreadIDs(); // Return the number of groups of which this context is // a member. int getNumGroups(); // Return a list of the groups of which this context is // a member. hpcxx_id_t *getGroupIDs(); void setName(const char *name); const char *getName(); }; A group object represents a set of nodes and contexts and is the basis for collective operations. Groups are used to identify sets of threads and sets of contexts that participate in collective operations like barriers. In this section we only describe how a set of threads on a single context can use collective operations. Multicontext operations will be described in greater detail in the multi-context programming sections below. The basic operation is barrier synchronization. This is accomplished in following steps: We first allocate an object of type HPCxx Group and set the number of threads to the maximum number that will participate in the operation. For example, to set the thread count on the main group to be 13 we can write the following. int main(int argc, char *argv[]){ HPCxx_Group *g hpcxx_init(&argc, &argv, g); g->setThreadCout(13); HPCxx_Barrier barrier(*g); As shown above, a HPCxx Barrier object must be allocated for the group. This can be accomplished in three ways: – Use the group created in the initialization hpcxx init(). This is the standard way SPMD computations do collective operations and it is described in greater detail below. – Allocate the group with the constructor that takes an array of context IDs as an argument. This provides a limited form of ”subset” SIMD parallelism and will also be described in greater detail later. – Allocate a group object with the void constructor. This group will refer to this context only and will only synchronize threads on this context. The constructor for the barrier takes a reference to the Group object.

90

Dennis Gannon et al.

Each thread that will participate in the barrier operation must then acquire a key from the barrier object with the getKey() function. Once the required number of threads have a key to enter the barrier, the barrier can be invoked by means of the overloaded () operator as shown in the example below. class Worker: public HPCxx_Thread{ int my_key; HPCxx_Barrier &barrier; public: Worker(HPCxx_Barrier & b): barrier(b){ my_key = barrier.getKey(); } void run(){ while( notdone ){ // work barrier(key); } } }; int main(int argc, char *argv[]){ HPCxx_Group *g; hpcxx_init(&argc, &argv, g); g->setThreadCout(13); HPCxx_Barrier barrier(g); for(int i = 0; i < 13; i++){ Worker *w = new Worker(barrier); w->start(); } hpcxx_exit(g); } A thread can participate in more than one barrier group and a barrier can be deallocated. The thread count of a Group may be changed, a new barrier may be allocated and thread can request new keys. Reductions. Other collective operations exist and they are subclasses of the HP Cxx Barrier class. For example, let intAdd be the class, class intAdd{ public: int & operator()(int &x, int &y) { x += y; return x;} }; To create an object that can be used to form the sum-reduction of one integer from each thread, the declaration takes the form HPCxx_Reduct1 r(group);

HPC++ and the HPC++Lib Toolkit

91

and it can be used in the threads as follows: class Worker: public HPCxx_Thread{ int my_key; HPCxx_Reduct1 &add; public: Worker(HPCxx_Reduct1 & a): add(a){ my_key = add.getKey(); } void run(){ int x =3.14*my_id; // now compute the sum of all x values int t = add(key, x, intAdd() ); } } }; The public definition of the reduction class is given by template class HPCxx_Reduct1: public HPCxx_Barrier{ public: HPCxx_Reduct1(HPCxx_Group &); T operator()(int key, T &x, Oper op); T* destructive(int key, T *buffer, Oper op); }; The operation can be invoked with the overloaded () operation as in the example above, or with the destructive() form which requires a user supplied buffer to hold the arguments and returns a pointer to the buffer that holds the result. to avoid making copies all of the buffers are modified in the computation. This operation is designed to be as efficient as possible, so it is implemented as a tree reduction. Hence the binary operator is required to be associate, i.e. op(x, op(y, z)) == op( op(x, y), z) The destructive form is much faster if the size of the data type T is large. A mult-argument form of this reduction will allow operations of the form sum =

Op(x1i , x2i , ..., xKi ) i=0,n

and it is declared as by the template template < class R, class T1, class T2, ... TK , class Op1, class Op2 > class HPCxx_ReductK{ public:

Dennis Gannon et al.

92

HPCxx_ReductK(Hpxx_Group &); R & operator()(int key, T1, T2, ..., Tk , Op2, Op1); }; where K is 2, 3, 4 or 5 in the current implementation and Op1 returns a value of type R and Op2 is an associative binary operator on type R. Broadcasts. A synchronized broadcast of a value between a set of threads is accomplished with the operation template < class T > class HPCxx_Bcast{ public: HPCxx_Bcast(HpxxGroup &); T operator()(int key, T *x); In this case, only one thread supplies a non-null pointer to the value and all the others receive a copy of that value. Multicasts. A value in each thread can be concatenated into a vector of values by the synchronized multicast operation. template < class T > class HPCxx_Mcast{ public: HPCxx_Mcast(Hpxx_Group &); T * operator()(int key, T &x); In this case, the operator allocates an array of the appropriate size and copies the argument values into the array in ”key” order. 4.2 Examples of Multi-threaded Computations 4.2.1 The NAS EP Benchmark. The NAS Embarrassingly Parallel benchmark illustrates a common approach to parallelizing loops using the thread library. The computation consists of computing a large number of Gaussian pairs and gathering statistics about the results (see [2] for more details). The critical component of the computation is a loop of the form: double q[nq], gc; for(k = 1; k variable is used to synchronize the spawned thread with the calling thread. template void pqsort( T *x, int low, int high, int depth){ HPCxx_Sync s; int i = low; int j = high; int k = 0; int checkflag = 0; if (i >= j) return; T m = x[i]; while( i < j){ while ((x[j] >= m) && (j > low)) j--; while ((x[i] < m) && (i < high)) i++; if(i < j){ swap(x[i], x[j]); } } if(j < high){ if( (j-low < MINSIZE) || (depth > MAXDEPTH) ){ pqsort(x, low, j, depth+1); } else{ SortThread Th(s, x, low, j, depth+1); Th.start(); checkflag = 1; } }

Dennis Gannon et al.

96

if(j+1 > low) pqsort(x, j+1, high, depth); int y; if(checkflag) s.read(y); } The SortT hread < T > class is based on a templated subclass of thread. The local state of each thread contains a reference to the sync variable, a pointer to the array of data and the index values of the range to be sorted and the depth of the tree when the object was created. template class SortThread: public HPCxx_Thread{ HPCxx_Sync &s; T *x; int low, high; int depth; public: SortThread(HPCxx_Sync &s_, T *x_, int low_, int high_, int depth_): s(s_), x(x_), low(low_), high(high_), depth(depth_), HPCxx_Thread(NULL){} void run(){ int k = 0; pqsort(x, low, high, depth); s.write(k); } }; This program was run on the SGI Power Challenge with 10 processors for a series of random arrays of 5 million integers. As noted above, the speed-up is bounded by the theoretical limit of about 7. We also experimented with the MINSIZE and MAXDEPTH parameters. We found that a MINSIZE of 500 and a MAXDEPTH of 6 gave the results. We ran several hundred experiments and found that speed-up ranged from 5.6 to 6.6 with an average of 6.2 for 10 processors. With a depth of 6 the number of threads was bounded by 128. What is not clear from our experiments is the role thread scheduling and thread initialization play in the final performance results. This is a topic that is under current investigation.

5. Implementing the HPC++ Parallel Loop Directives Parallel loops in HPC++ can arise in two places: inside a standard function or in a member function for a class. In both cases, these occurrences may be part of an instantiation of a template. We first describe the transformational

HPC++ and the HPC++Lib Toolkit

97

problems associated with class member functions. The approach we use is based on a style of loop transformations invented by Aart Bik for parallelizing loops in Java programs [4]. The basic approach was illustrated in the Embarrassingly Parallel example in the previous section. Consider the following example. class C{ double x[N]; float &y; public: void f(float *z, int n){ double tmp = 0; #pragma HPC_INDEPENDENT for(int i = 0; i < n; i++){ #pragma HPC_REDUCE tmp tmp += y*x[i]+*z++; } ... cout f_blocked(z, bs, base, key); } }; The user’s class C must be modified as follows. The compiler must create a new member function, called f blocked() below, which executes a blocked version of the original loop. This function is called by the thread class as shown above. In addition, the original loop has to be modified to spawn the K threads (where K is determined by the runtime environment) and start each thread. (As in our previous examples, we have deleted the details about removing the thread objects when the thread completes execution.) The user class is also given local synchronization objects to signal the termination of the loop. For loops that do not have any reduction operations, we can synchronize the completion of the original iteration with a HPCxx CSem as was done in the Embar benchmark in the previous section. However, in this case, we do have a reduction, so we use the reduction classes and generate a barrier-based reduction that will synchronize all of the iteration classes with the calling thread as shown below. class C{ double x[N]; float &y; HPCxx_Group g; HPCxx_Reduct1 add; public: C(...): add(g){ .... }

HPC++ and the HPC++Lib Toolkit

99

void f_blocked(float *z, double &tmp, int bs, int base, int key){ z = z+bs; double t = 0; for(int i = base; i < base+ bs; i++){ t = y*x[i]+*z++; } add(key, t, doubleAdd()); } void f(float *z, int n){ double tmp = 0; ... g.setThreadCount(K+1); int key = add.getKey(); for(int th = 0; th < K; th++){ C_Thread *t = new C_Thread(this,z, tmp, n/K, th*(n/K), add); t->start(); } tmp = add(key, 0.0, doubleAdd()); ... cout template as is done in the MPC++ Template Library [10]. A global pointer is an object that is a proxy for a remote object. In most ways it can be treated exactly as a pointer to a local object. One major difference is that global pointers can only point to objects allocated from a special ”global data area”. To allocate a global pointer object of type T from the global area one can write HPCxx_GlobalPtr p = new ("hpcxx_global") T(args); or, for a global pointer to an array of 100 objects, HPCxx_GlobalPtr p = new ("hpcxx_global") T[100]; For objects of simple type a global pointer can be dereferenced like any other pointer. For example, assignment and copy through a global pointer is given by HPCxx_GlobalPtr p = new ("hpcxx_global") int; *p = 3.14; float y = 2 - *p; Integer arithmetic and the [] operator can be applied to global pointers in the same way as ordinary pointers. Because global pointer operations are far more expensive than regular pointer dereferencing, there are special operators for reading and writing blocks of data. void HPCxx_GlobalPtr::read(T *buffer, int size); void HPCxx_GlobalPtr::write(T *buffer, int size); Objects of a user-defined type may be copied through a global pointer only if there are pack and unpack friend functions defined as follows. Suppose you have a class of the form shown below. You must also supply a special function that knows how to pack an array of such objects. class C{ int x; float y[100]; public: friend void hpcxx_pack(HPCxx_Buffer *b, C *buffer, int size); friend void hpcxx_unpack(HPCxx_Buffer *b, C *buffer, int &size); }; void hpcxx_pack(HPCxx_Buffer *b, C *buffer, int size){ hpcxx_pack(b, size, 1); for(int i = 0; i < size; i++){

HPC++ and the HPC++Lib Toolkit

101

hpcxx_pack(b, C[i].x, 1); hpcxx_pack(b, C[i].y, 100); } } void hpcxx_unpack(HPCxx_Buffer *b, C *buffer, int &size){ hpcxx_unpack(b, size, 1); for(int i = 0; i < size; i++){ hpcxx_unpack(b, C[i].x, 1); hpcxx_unpack(b, C[i].y, 100); } } These pack and unpack functions can be considered a type of remote constructor. For example, suppose a class object contains a pointer to a buffer in the local heap. It is possible to write the unpack function so that it allocates the appropriate storage on the remote context and initializes it with the appropriate data. Unfortunately, it is not possible to access data members directly through global pointers without substantial compiler support. The following is an illegal operation class D{ public: int x; }; ... HPCxx_GlobalPtr p = new (‘‘hpcxx_global’’) D; ... p->x; // illegal member reference To solve this problem, we must create a data access function that returns this value. Then we can make a remote member call. 6.1 Remote Function and Member Calls For a user defined class C with member function, class C{ public: int foo(float, char); }; the standard way to invoke the member through a pointer is an expression of the form: C *p; p->foo(3.14, ’x’);

102

Dennis Gannon et al.

It is a bit more work to make the member function call though a global pointer. First, for each type C we must register the class and all members that will be called through global pointers. Registering the class is easy. There is a macro that will accomplish this task and it should be called at the top level. We next must register the member as shown below. hpxx_register(C); int main(int argc, char *argv[]){ HPCxx_Group *g; hpxx_init(&argc, &argv, g); hpcxx_id_t C_foo_id = hpxx_register(C::foo); The overloaded register templated function builds a table for systems information about registered functions, classes, and member functions, and returns its location as an ID. Because this ID is an index into the table, it is essential that each context register the members in exactly the same order. To invoke the member function, there is a special function template HPCxx_GlobalPtr P; ... hpcxx_id_t = hpcxx_invoke(P, C_foo_id, 3.13, ’x’); Invoke will call C::foo(3.13, ’x’) in the context that contains the object that P points to. The calling process will wait until the function returns. The asynchronous invoke will allow the calling function to continue executing until the result is needed. HPCxx_Sync sz; hpcxx_ainvoke(&sz, P, C_foo_id, 3.13, ’x’); .... // go do some work int z = sz; // wait here. It should be noted that it is not a good idea to pass pointers as argument values to invoke or ainvoke. However, it is completely legal to pass global pointers and return global pointers as results of remote member invocations. 6.1.1 Global Functions. Ordinary functions can be invoked remotely. By using a ContextID, the context that should invoke the function may be identified. HPCxx_ContextID HPCxx_Group::getContextID(int i); For example, to call a function in context ”3” from context ”0”, the function must be registered in each context. (As with member functions, the

HPC++ and the HPC++Lib Toolkit

103

order of the function registration determines the function identifier, so the functions must be registered in exactly the same order in each context.) double fun(char x, int y); int main(int argc, int *argv[]){ HPCxx_Group *g; hpxx_init(&argc, &argv, g); hpcxx_id_t = hpcxx_register(fun); // remote invocation of x = fun(’z’, 44); double x = hpcxx_invoke(g.getContext(3), fun_id , ’z’, 44); //asynchronous invocation HPCxx_Sync sx; hpcxx_ainvoke(&sx, g.getContext(3), fun_id, ’z’, 44 ); double x = sx; .... } 6.2 Using Corba IDL to Generate Proxies Two of the most anoying aspects of the HPC++Lib are the requirements to register member functions and write the pack and unpack routines for user-defined classes. In addition, the use of the invoke template syntax HPCxx_GlobalPtr P; ... int z = hpcxx_invoke(P, member_id, 3.13, ’x’); instead of the more conventional syntax z = P->member(3.13, ’x’); is clearly awkward. Unfortunately, the C++ language does not allow an extension to the overloading of the − > operator that will provide this capability. However, there is another solution to this problem. The CORBA Interface Definition Language (IDL) provides a well structured language for defining the public interface to remote objects and serializable classes. As a seperate utility, HPC++Lib provides an IDL to C++ translator that maps interface specifications to user-defined classes. For example, the IDL interface definition of a remote blackboard object class and the definition of a structure which represents a graphical object (Gliph) that can be drawn on the blackboard.

104

Dennis Gannon et al.

struct Gliph{ short type; int x, y; int r, g, b; }; interface BBoard{ int draw(in Gliph mark ); }; The IDL to C++ compiler generates a special case of the global pointer template and the header for the blackboard class as shown below. The prototype for BBoard contains a static registration function that registers all the member functions. The user only need call this one registration function at the start of the program. The specialization of the the global pointer template contains the requisite overloading of the − > operator and a new member function for each of the functions in the public interface. class BBoard{ static int draw_id; public: static void register(){ draw_id = hpcxx_register(Bboard::draw); } int draw(Gliph mark); }; class HPCxx_GlobalPtr{ // implementation specific GP attributes public: ... HPCxx_GlobalPtr * operator ->(){ return this; } int draw(Gliph mark , float &x){ return hpcxx_invoke(*this, BBoard::draw_id, mark); } }; The structure Gliph in the interface specification is compiled into a structure which contains the serialization pack and unpack functions. Using this tool the user compiles the interface specification into a new header file. This file can be included into the C++ files which contain the use of the class as well as the definition of the functions like BBoard::draw. To use the class with remote pointers the program must include only the registration call BBoard::register(); in the main program.

HPC++ and the HPC++Lib Toolkit

105

7. The SPMD Execution Model The Single Program Multiple Data Model (SPMD) of execution is one of the standard models used in parallel scientific programming. Our library supports this model as follows. At program load time n copies of a single program is loaded into n processes which each define a running context. The running processes are coordinated by the hpcxx init initialization routine which is invoked as the first call in the main program of each context. int main(int argc, char *argv[]){ HPCxx_Group *g; hpcxx_init(&argc, &argv, g); ... As described in the multi-context programming section, the context IDs allow one context to make remote function calls to any of the other contexts. The SPMD execution continues with one thread of control per context executing main. However, that thread can dynamically create new threads within its context. There is no provision for thread objects to be moved from one context to another. 7.1 Barrier Synchronization and Collective Operations In SPMD execution mode, the runtime system provides the same collective operations as were provided before for multi-threaded computation. The only semantic difference is that the collective operations apply across every context and every thread of the group. The only syntactic difference is that we allow a special form of the overloaded () operator that does not require a thread ”key”. For example, to do a barrier between contexts all we need is is the HP Cxx Group object. int main(int argc, char *argv[]){ HPCxx_Group *context_set; hpcxx_init(&argc, &argv, context_set); HPCxx_Barrier barrier(context_set); HPCxx_Reduct1 float_reduct(context_set); ... barrier(); float z = 3.14 z = float_reduct(z, floatAdd());

106

Dennis Gannon et al.

Note that the thread key can be used if there are multiple threads in a context that want to synchronize with the other contexts.

8. Conclusion This chapter has outlined a library and compilation strategy that is being used to implement the HPC++ Level 1 design. An earlier version of HPC++Lib was used to implement the prototype PSTL library and future releases of PSTL will be completely in terms of HPC++Lib. Our goal in presenting this library goes beyond illustrating the foundation for HPC++ tools. We also feel that this library will be used by application programmers directly and care has been taken to make it as usable as possible for large scale scientific application. In particular, we note that the MPC++ MTTL has been in use with the Real World Computing Partnership as an application programming platform for nearly one year and CC++ has been in use for two years. HPC++Lib is modeled on these successful systems and adds only a Java style thread class and a library for collective operations. In conjunction with the parallel STL, we feel that this will be an effective tool for writing parallel programs. The initial release of HPC++Lib will be available by the time of this publication at http://www.extreme.indiana.edu and other sites within the Department of Energy. There will be two runtime systems in the initial release. One version will be based on the Argonne/ISI Nexus runtime system and the other will be using the LANL version of Tulip [3]. Complete documentation and sample programs will be available with the release.

References 1. Ken Arnold and James Gosling. The Java Programming Language. Addison Wesley, 1996. 2. D.H. Bailey, E. Barszcz, L. Dagum, and H.D. Simon. NAS Parallel Benchmark Results October 1994. Technical Report NAS-94-001, NASA Ames Research Center, 1994. 3. P. Beckman and D. Gannon. Tulip: A portable run-time system for objectparallel systems. In Proceedings of the 10th International Parallel Processing Symposium, April 1996. 4. Aart J.C. Bik and D. Gannon. Automatically exploiting implicit parallelism in java. Technical report, Department of Computer Science, Indiana University, 1996. 5. William W. Carlson and Jesse M. Draper. Distributed data access in AC. In Fifth ACM Sigplan Symposium on Principles and Practices of Parallel Programming, 1995. 6. K. Mani Chandy and Carl Kesselman. CC++: A declarative concurrent objectoriented programming notation, 1993. In Research Directions in Concurrent Object Oriented Programming, MIT Press.

HPC++ and the HPC++Lib Toolkit

107

7. D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel programming in Split-C. In Supercomputing ’93, 1993. 8. High Performance Fortran Forum. Draft High Performance Fortran Language Specification, November 1992. Version 0.4. 9. Object Management Group. The Common Object Request Broker: Architecture and specification, July 1995. Revision 2.0. 10. Yutaka Ishikawa. Multiple threads template library. Technical Report TR-96012, Real World Computing Partnership, September 1996. 11. Elizabeth Johnson and Dennis Gannon. Hpc++: Experiments with the parallel standard template library. Technical Report TR-96-51, Indiana University, Department of Computer Science, December 1996. 12. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction To Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings Publishing Company, 1994. 13. Mark Nelson. C++ Programmer’s Guide to the Standard Template Library. IDG Books Worldwide, 1995. 14. Alexander Stepanov and Meng Lee. The Standard Template Library. Technical Report HPL-95-11, Hewlett-Packard Laboratories, January 1995. 15. Gregory Wilson and Paul Lu. Parallel Programming Using C++. MIT Press, 1996. 16. Gregory Wilson and William O’Farrell. An introduction to ABC++. 1995.

Chapter 4. A Concurrency Abstraction Model for Avoiding Inheritance Anomaly in Object-Oriented Programs Sandeep Kumar1 and Dharma P. Agrawal2 1 2

a ard

Laboratories, One Main Street, 10th Floor, Cambridge, MA 02142 Department of ECECS, ML 0030, University of Cincinnati, PO Box 210030, Cincinnati, OH 45221-0030

Summary. In a concurrent object-oriented programming language one would like to be able to inherit behavior and realize synchronization control without compromising the flexibility of either the inheritance mechanism or the synchronization mechanism. A problem called the inheritance anomaly arises when synchronization constraints are implemented within the methods of a class and an attempt is made to specialize methods through inheritance. The anomaly occurs when a subclass violates the synchronization constraints assumed by the superclass. A subclass should have the flexibility to add methods, add instance variables, and redefine inherited methods. Ideally, all the methods of a superclass should be reusable. However, if the synchronization constraints are defined by the superclass in a manner prohibiting incremental modification through inheritance, they cannot be reused, and must be reimplemented to reflect the new constraints; hence, inheritance is rendered useless. We have proposed a novel model of concurrency abstraction, where (a) the specification of the synchronization code is kept separate from the method bodies, and (b) the sequential and concurrent parts in the method bodies of a superclass are inherited by its subclasses in an orthogonal manner.

1. Introduction A programming model is a collection of program abstractions which provides a programmer with a simplified, and transparent, view of the computer’s hardware/software system. Parallel programming models are specifically designed for multiprocessors, multicomputers, or vector computers, and are characterized as: shared-variable, message-passing, data-parallel, objectoriented (OO), functional, logic, and heterogeneous. Parallel programming languages provide a platform to a programmer for effectively expressing (or, specifying) his/her intent of parallel execution of the parts of computations in an application. A parallel program is a collection of processes which are the basic computational units. The granularity of a process may vary in different programming models and applications. In this work we address issues relating to the use of the OO programming model in programming parallel machines [20]. OO programming is a style of programming which promotes program abstraction, and thus, leads to a modular and portable program and a reusable software. This paradigm has radically influenced the design and development S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 109-137, 2001. Springer-Verlag Berlin Heidelberg 2001

110

Sandeep Kumar and Dharma P. Agrawal

of almost all kinds of computer applications including user-interfaces, data structure libraries, computer-aided design, scientific computing, databases, network management, client-server based applications, compilers and operating systems. Unlike the procedural approach, where problem-solving is based around actions, OO programming provides a natural facility for modeling and decomposing a problem into objects. An object is a program entity which encapsulates data (a.k.a. instance variables) and operations (a.k.a. methods) into a single computational unit. The values of the instance variables determine the state of an object. Objects are dynamically created and their types may change during the course of program execution. The computing is done by sending messages (a.k.a. method invocation) among objects. In Figure 1.1 the methods (marked as labels on the edges) transform the initial states (shown as circles), r0 and s0 , of two objects, O1 and O2 , into their final states, rM and sN , respectively. Notice that O2 sends a message, p2 , to O1 while executing its method, q2 .

p () p () 1

O1

r2

1

p () M

r0

1

q () s1

2

s2

q () N

s 2

r

M

q () O

2

r

0

s

N

Fig. 1.1. The object-oriented computing model. On one hand, the OO language features are a boon to writing portable programs efficiently; but on the other hand they also contribute to degraded performance at run-time. It turns out that the concurrency is a natural consequence of the concept of objects: The concurrent use of coroutines in conventional programming is akin to the concurrent manipulation of objects in OO programming. Notice from Figure 1.1 that the two objects, O1 and O2 , change their states independently and only occasionally exchange values. Clearly, the two objects can compute concurrently. Since the OO programming model is inherently parallel, it should be feasible to exploit the potential parallelism from an OO program in order to improve its performance. Furthermore, if two states for an object can be computed independently, a finer granularity of parallelism can be exploited. For instance in the Actor [1] based

A Concurrency Abstraction Model for OO Programs

111

concurrent OO languages, maximal fine-grained object-level concurrency can be specified. Load-balancing becomes fairly easy as the objects can be freely placed and migrated within a system. Such languages have been proposed as scalable programming approaches for massively parallel machines [9]. Several research projects have successfully exploited the data parallel (SIMD) programming model in C++, notably, Mentat [15], C** [23], pC++ [13], and Charm++ [19]; probably these researchers were overwhelmed by the benefits of this model as noted in Fortran 90 and High Performance Fortran (HPF) [16] applications. The data parallel models of computations exploit the property of homogeneity in computational data structures, as is commonly found in scientific applications. For heterogeneous computations, however, it is believed that the task-parallel 1 (SPMD or MIMD) model is more effective than the data-parallel model. Such a model exists in Fortran M [12] and CC++ [8]. Most of the OO applications by nature are based on heterogeneous computations, and consequently, are more amenable to task-parallelism. Besides, the existence of numerous, inherently concurrent objects, can assist in masking the effects of latency (as there is always work to be scheduled). Furthermore, the SPMD model is cleanly represented in a task parallel OO language where the data and code that manipulates that data are clearly identified. Consider a typical OO program segment shown below. Our goal is to seek a program transformation from the left to the right. In other words, can we concurrently execute S1 and S2 , and if so, how? par { S1 : objecti .methodp ();

S′1 : objecti .methodp ();

?

⇒

S′2 : objectj .methodq ();

S2 : objectj .methodq (); }

There are two possible approaches: either (a) specify the above parallelism by writing a parallel program in a Concurrent OO Programming Language (COOPL), or (b) automatically detect that the above transformation is valid and then restructure the program. In this work we would limit our discussions to specification of parallelism. In designing a parallel language we aim for: (i) efficiency in its implementation, (ii) portability across different machines, (iii) compatibility with existing sequential languages, (iv) expressiveness of parallelism, and (v) ease of programming. A COOPL offers the benefits of object-orientation, primarily, inherently concurrent objects, ease of programming and code-reuse with the task of specification of parallelism (or more generally, concurrency). Several COOPLs have been proposed in the literature (refer to [4, 18, 33] for a survey). 1

a.k.a. control or functional parallelism.

112

Sandeep Kumar and Dharma P. Agrawal

It is unfortunate that when parallel programs are written in most of these COOPLs, the inheritance anomaly [25] is unavoidable and the reusability of the sequential components is improbable [21, 22]. In using a COOPL, one would want to inherit the behavior and realize synchronization control without compromising the flexibility in either exploiting the inheritance characteristics or using different synchronization schemes. A problem called the inheritance anomaly [25] arises when synchronization constraints are implemented within the methods of a class and an attempt is made to specialize methods through inheritance. The anomaly occurs when a subclass violates the synchronization constraints assumed by the superclass. A subclass should have the flexibility to add methods, add instance variables, and redefine inherited methods. Ideally, all the methods of a superclass should be reusable. However, if the synchronization constraints are defined by the superclass in a manner prohibiting incremental modification through inheritance, they cannot be reused, and must be reimplemented to reflect the new constraints; hence, inheritance is rendered useless. We claim that following are the two primary reasons for causing the inheritance anomaly [25] and the probable reuse of the sequential components in a COOPL [22]: – The synchronization constraints are implemented within the methods of a class and an attempt is made to specialize methods through inheritance. The anomaly occurs when a subclass violates the synchronization constraints assumed by the superclass. – The inheritance of the sequential part and the concurrent part of a method code are not orthogonal. In this work we have proposed a novel model of concurrency abstraction, where (a) the specification of the synchronization code is kept separate from the method bodies, and (b) the sequential and concurrent parts in the method bodies of a superclass are inherited by its subclasses in an orthogonal manner. The rest of the chapter is organized as follows. Section 2 discusses the issues in designing and implementing a COOPL. In sections 3 and 4 we present a detailed description of the inheritance anomaly and the reuse of sequential classes, respectively. In section 5 we propose a framework for specifying parallelism in COOPLs. In section 6 we review various COOPL approaches proposed by researchers in solving different kinds of inheritance anomalies. In section 7 we describe the concurrency abstraction model we have adopted in designing our proposed COOPL, CORE. Subsequently, we give an overview of the CORE model and describe its features. Later, we illustrate with examples how a CORE programmer can effectively avoid the inheritance anomaly and also reuse the sequential classes. Finally, we discuss an implementation approach for CORE, summarize our conclusions and indicate directions for future research.

A Concurrency Abstraction Model for OO Programs

113

2. Approaches to Parallelism Specification Conventionally, specification of task parallelism refers to explicit creation of multiple threads of control (or, tasks) which synchronize and communicate under a programmer’s control. Therefore, a COOPL designer should provide explicit language constructs for specification, creation, suspension, reactivation, migration, termination, and synchronization of concurrent processes. A full compiler has to be implemented when a new COOPL is designed. Alternatively, one could extend a sequential OO language in widespread use, such as C++ [11], to support concurrency. The latter approach is more beneficial in that: (a) the learning curve is smaller, (b) incompatibility problems seldom arise, and most importantly, (c) a preprocessor and a low-level library of the target computer system can more easily upgrade the old compiler to implement high-level parallel constructs. 2.1 Issues in Designing a COOPL A COOPL designer should carefully consider the pros and cons in selecting the following necessary language features: 1. Active vs. Passive Objects: An active object (a.k.a. an actor in [1]) possesses its own thread(s) of control. It encapsulates data structures, operations, and the necessary communication and synchronization constraints. An active object can be easily unified with the notion of a lightweight process [3]. In contrast, a passive object does not have its own thread of control. It must rely on either active objects containing it, or on some other process management scheme. 2. Granularity of Parallelism: Two choices are evident depending upon the level of parallelism granularity sought: a) Intra-Object: An active object may be characterized as: (i) a sequential object, if exactly one thread of control can exist in it; (ii) a quasi-concurrent object, such as a monitor, when multiple threads can exist in it, but only one can be active at a time; and (iii) a concurrent object, if multiple threads can be simultaneously active inside it [8, 33]. b) Inter-Object: If two objects receive the same message, then can they execute concurrently? For example, in CC++ [8], one can specify inter-object parallelism by enclosing the method invocations inside a parallel block, such as, a cobegin-coend construct [3], where an implicit synchronization is assumed at the end of the parallel block. 3. Shared Memory vs. Distributed Memory Model (SMM vs. DMM): Based on the target architecture, appropriate constructs for communication, synchronization, and partitioning of data and processes may be necessary. It should be noted that even sequential OOP is considered

114

4.

5.

6.

7.

Sandeep Kumar and Dharma P. Agrawal

equivalent to programming with messages: a method invocation on an object, say x, is synonymous to message send (as in DMM) whose receiver is x. However, floating pointers to data members (as in C++ [11]) may present a serious implementation hazard in COOPLs for DMMs. Object Distribution: In a COOPL for a DMM, support for location independent object interaction, migration, and transparent access to other remote objects, may be necessary. This feature may require intense compiler support and can increase run-time overhead [9]. Object Interaction: On a SMM, communication may be achieved via synchronized access to shared variables, locks, monitors, etc., whereas on a DMM, both synchronous and asynchronous message passing schemes may be allowed. Besides, remote procedure call (RPC) and its two variants, blocking RPC and asynchronous RPC (ARPC) [7,34], may be supported, too. Selective Method Acceptance: In a client/server based application [6], it may be necessary to provide support for server objects, who receive messages non-deterministically based on their internal states, parameters, etc.. Inheritance: The specification of synchronization code is considered as the most difficult part in writing parallel programs. Consequently, it is highly desirable to avoid rewriting of such code and instead reuse code via inheritance. Unfortunately, in many COOPLs, inheritance is either disallowed or limited [2,18,32,33], in part due to the occurrence of inheritance anomaly [24, 25]. This anomaly is a consequence of reconciling concurrency and inheritance in a COOPL, and its implications on a program include: (i) extensive breakage of encapsulation, and (ii) redefinitions of the inherited methods.

2.2 Issues in Designing Libraries Parallelism can be made a second class citizen in COOPLs by providing OO libraries of reusable abstractions (classes). These classes hide the lower-level details pertaining to specification of parallelism, such as, architecture (SMM or DMM), data partitions, communications, and synchronization. Two kinds of libraries can be developed as described below: – Implicit Libraries: These libraries use OO language features to encapsulate concurrency at the object-level. A comprehensive compiler support is essential for: (i) creating active objects without explicit user commands and in the presence of arbitrary levels of inheritance, (ii) preventing acceptance of a message by an object until it has been constructed, (iii) preventing destruction of an object until the thread of control has been terminated, (iv) object interaction, distribution and migration, and (v) preventing deadlocks.

A Concurrency Abstraction Model for OO Programs

115

– Explicit Libraries: These class libraries provide a set of abstract data types to support parallelism and synchronization. The objects of these classes are used in writing concurrent programs. In these libraries, the synchronization control and mutual exclusion code is more obvious at the user interface level. Most of these libraries are generally meant for programming on SMM [5]. The libraries are a good alternative to parallel programming as they are more easily portable. Some notable libraries reported in the literature are: ABC++, Parmacs [5], µC++, AT&T’s Task Library, PRESTO, PANDA, AWESIME, ES-Kit, etc. [4, 33]. Although the above approaches to parallel programming are considered simple and inexpensive, they require sophisticated compiler support. Moreover, they fail to avoid the inheritance anomaly as would be clear from the following section.

3. What Is the Inheritance Anomaly? In this section we describe the inheritance anomaly. An excellent characterization of this anomaly can be found in [25], and it has also been discussed in [18, 22, 24, 32, 33]. Definition 3.1. A synchronization constraint is that piece of code in a concurrent program which imposes control over concurrent invocations on a given object. Such constraints manage concurrent operations and preserve the desired semantic properties of the object being acted upon. The invocations that are attempted at inappropriate times are subject to delay. Postponed invocations are those that invalidate the desired semantic property of that object2 . Consider one of the classical concurrency problems, namely, bounded buffer [3]. This problem can be modeled by defining a class, B Buffer, as shown in Figure 3.1. In this class3 , there are three methods: B Buffer, Put, and Get. The constructor creates a buffer, buf, on the heap of a user-specified size, max. Note that buf is used as a circular array. Both the indices, in and out, are initialized to the first element of buf, and n represents the number of items stored in the buf so far. Put stores an input character, c, at the current index, in, and increments the values of in and n. On the other hand, Get retrieves a character, c, at the current index, out, from buf, and decrements the value of n but increments the value of out. A synchronization constraint is necessary if Put and Get are concurrently invoked on an object of B Buffer. This synchronization constraint(s) must satisfy following properties: (i) the execution of Put cannot be deferred as 2 3

An excellent characterization of different synchronization schemes can be found in [3]. Note that we have used the C++ syntax for defining classes in this section.

Sandeep Kumar and Dharma P. Agrawal

116

class B_Buffer { int in, out, n, max; char *buf; public: B_Buffer(int size) { max = size; buf = new char [max]; in = out = n = 0; } void Put (char c) { P(empty); buf[in] = c; in = (in+1) % max; n++; V(full); }

char Get (void) { char c; P(full); c = buf[out]; out = (out+1) % max; n--; V(empty); return c; }

}; Fig. 3.1. A class definition for B Buffer. long as there is an empty slot in buf, and (ii) the execution of Get must be postponed until there is an item in buf. In other words, the number of invocations for Put must be at least one more than the number of invocations for Get, but at the most equal to max. Such a constraint is to be provided by the programmer as part of the synchronization code in the concurrent program. For example, one could model the above synchronization constraint by using the P and V operations on two semaphores, full and empty. Consider defining a subclass of B Buffer, where we need different synchronization constraints. The methods Put and Get then may need non-trivial redefinitions. A situation where such redefinitions become necessary with inheritance of concurrent code in a COOPL is called the inheritance anomaly. In the following subsections, we review different kinds of inheritance anomalies as described in [25, 33]. 3.1 State Partitioning Anomaly (SPA) Consider Figure 3.2, where we have defined a subclass, B Buffer2, of the base class, B Buffer, as introduced in Figure 3.1. B Buffer2 is a specialized version of B Buffer, which inherits Put and Get from B Buffer and defines a

A Concurrency Abstraction Model for OO Programs

117

new method, GetMorethanOne. GetMorethanOne invokes Get as many times as is the input value of howmany.

class B_Buffer2: public B_Buffer { public: char GetMorethanOne (int howmany) { char last_char; for (int i=0; i < howmany; i++) last_char = Get(); return last_char; } }; Fig. 3.2. A class definition for B Buffer2. Put, Get, and GetMorethanOne can be concurrently invoked on an object of B Buffer2. Besides the previous synchronization constraint for Put and Get in B Buffer, now we must further ensure that whenever GetMorethanOne executes, there are at least two items in buf. Based on the current state of the object, or equivalently, the number of items in buf, either Get or GetMorethanOne must be accepted. In other words, we must further partition the set of acceptable states for Get. In order to achieve such a finer partition, the inherited Get must be redefined in B Buffer2, resulting in SPA. SPA occurs when the synchronization constraints are written as part of the method code and they are based on the partitioning of states of an object. This anomaly commonly occurs in accept-set based schemes. In one of the variants of this scheme, known as behavior abstraction [18], a programmer uses the become primitive [1] to specify the next set of methods that can be accepted by that object. An addition of a new method to a subclass is handled by redefining such a set to contain the name of the new method. The use of guarded methods (as shown in Figure 3.3) can prevent SPA, where the execution of methods is contingent upon first evaluating their guards. void Put (char c) when (in < out + max) { ... } char Get (void) when (in >= out + 1) { ... } char GetMorethanOne (int howmany) when (in >= out + howmany) { ... } Fig. 3.3. Redefined methods from class B Buffer2.

118

Sandeep Kumar and Dharma P. Agrawal

3.2 History Sensitiveness of Acceptable States Anomaly (HSASA) Consider Figure 3.4, where we have defined yet another subclass, B Buffer3, of B Buffer (see Figure 3.1). B Buffer3 is a specialized version of B Buffer, which inherits Put and Get and introduces a new method, GetAfterPut. Note that the guard for GetAfterPut is also defined. class B_Buffer3: public B_Buffer { public: char GetAfterPut() when (!after_put && in >= out + 1) { ... } }; Fig. 3.4. A class definition for B Buffer3. Put, Get, and GetAfterPut can be concurrently invoked on an object of B Buffer3. Apart from the previous synchronization constraint for Put and Get as in B Buffer, we must now ensure that GetAfterPut executes only after executing Put. The guard for GetAfterPut requires a boolean, after put, to be true, which is initially false. The synchronization requirement is that after put must be set to true and false in the inherited methods, Put and Get, respectively. In order to meet such a requirement, the inherited methods, Put and Get, must be redefined, and thus, resulting in HSASA. HSASA occurs when it is required that the newly defined methods in a subclass must only be invoked after certain inherited methods have been executed, i.e., the invocations of certain methods are history sensitive. Guarded methods are inadequate because the newly defined (and history sensitive) methods wait on those conditions which can only be set in the inherited methods, and consequently, redefinitions become essential. 3.3 State Modification Anomaly (SMA) Consider Figure 3.5, where we have defined two classes, Lock and B Buffer4. Lock is a mix-in class [6], which when mixed with another class, gives its object an added capability of locking itself. In B Buffer4 we would like to add locking capability to its objects. Thus, the inherited Put and Get in B Buffer4 execute in either a locked or unlocked state. Clearly, the guards in the inherited methods, Put and Get, must be redefined to account for the newly added features. Besides, the invocation of methods of Lock on an object of B Buffer4 affects the execution of Put and Get for this object. SMA occurs when the execution of a base class method modifies the condition(s) for the methods in the derived class. This anomaly is usually found in mix-in [6] class based applications.

A Concurrency Abstraction Model for OO Programs

class Lock { int locked; public: void lock() when (!locked) { locked = 1; } void unlock() when (locked) { locked = 0; } };

119

class B_Buffer4: public B_Buffer, Lock { // Unlocked

Put and

Get.

};

Fig. 3.5. Class definitions for Lock and B Buffer4. 3.4 Anomaly A Some COOPL designers have advocated the use of a single centralized class for controlling the invocation of messages received by an object [7]. Anomaly A occurs when a new method is added to a base class such that all its subclasses are forced to redefine their centralized classes. This happens because the centralized class associated with a subclass is oblivious of the changes in the base class, and therefore, cannot invoke the newly inherited method. Consider two centralized classes, B Buffer Server and B Buffer2 Server, as shown in Figure 3.6, for classes, B Buffer and B Buffer2, respectively. If a new method, NewMethod, is added to B Buffer, it becomes immediately visible in B Buffer2; however, both B Buffer Server and B Buffer2 Server are oblivious of such a change and must be redefined for their correct use, resulting in Anomaly A.

class B_Buffer_Server { void controller() { switch(...) { case...: Put(c); break; case...: Get(); break; } };

class B_Buffer2_Server { void controller2() { switch(...) { case...: Put(c); break; case...: Get(); break; case...: GetMorethanOne(n); break; } };

Fig. 3.6. Centralized server class definitions for B Buffer and B Buffer2.

120

Sandeep Kumar and Dharma P. Agrawal

3.5 Anomaly B Instead of using centralized synchronization, each method could maintain data consistency by using critical sections. A possible risk with this scheme, however, is that a subclass method could operate on the synchronization primitives used in the base class, resulting in Anomaly B.

4. What Is the Reusability of Sequential Classes? The success of an OO system in a development environment is largely dependent upon the reusability of its components, namely, classes. A COOPL designer must provide means for class reuse without editing previously written classes. Many C++ based COOPLs do not support sequential class reuse [21, 22]. Consider Figure 4.1, where a base class, Base, is defined with three methods, foo, bar, and baz. We also define a subclass, Derived, of Base, which inherits bar and baz, but overrides foo with a new definition.

class Base { int a,b,c; public: void foo() { a = 2; b = a*a; } void bar() { c = 3; c = a*c; } void foobar() { foo(); bar(); } };

class Derived: public Base { int d; public: void foo() { bar(); d = c * c; } };

Fig. 4.1. The sequential versions of the classes, Base and Derived. Let us assume that the parallelism for methods of Base is specified as shown in Figure 4.2. The parallelization of Base forces the redefinition of Derived, because otherwise, one or both of the following events may occur:

A Concurrency Abstraction Model for OO Programs

121

– Assume that a message foo is received by an object of Derived. A deadlock occurs once the inherited bar is called from within foo which has a receive synchronization primitive, but there is no complementary send command. – Assume further that bar is not called from Derived::foo. Now, the inherited definition of foobar becomes incorrect: foo and bar can no longer be enclosed inside a parallel block as these two methods would violate the Bernstein conditions [3, 17],

class Base { int a,b,c; public: void foo() { a = 2; send(a); b = a*a; } void bar() { c = 3; receive(a); c = a*c; } void foobar() { cobegin foo(); bar(); coend; } };

class Derived: public Base { int d; public: void foo() { bar(); d = c * c; } };

Fig. 4.2. The parallelized versions of the classes, Base and Derived.

5. A Framework for Specifying Parallelism In the previous section we established that it is extremely difficult to specify parallelism and synchronization constraints elegantly in a COOPL. Apparently, the inheritance anomaly and dubious reuse of sequential classes, make COOPLs a less attractive alternative for parallel programming. In the past, several researchers have designed COOPLs in an attempt to break these problems, but they have been only partially successful. In this section, we propose our solution for these problems in a class based COOPL.

122

Sandeep Kumar and Dharma P. Agrawal

We have designed a new COOPL, called CORE [21], which is based on C++ [11]. In CORE, the parallelism and all the necessary synchronization constraints for the methods of a class, are specified in an abstract class (AC) [6,11] associated with the class. Consequently, a subclass of a superclass is able to either: (i) bypass the synchronization code which would otherwise be embedded in the inherited methods, or (ii) inherit, override, customize, and redefine the synchronization code of the inherited methods in an AC associated with the subclass. In CORE, we are able to break SPA, SMA, Anomaly A, and Anomaly B. However, we are not completely able to avoid HSASA, but we minimize the resulting code redefinitions. Besides, the sequential classes are reusable in a CORE program. The CORE framework for parallel programming is attractive because of the following reasons: – synchronization constraints are specified separately from the method code; – inheritance hierarchies of the sequential and concurrent components are maintained orthogonally; – the degrees of reuse of the sequential and synchronization code are higher; and – parallelism granularity can be more easily controlled.

6. Previous Approaches In ACT++ [18], objects are viewed as sets of states. Concurrent objects are designed as sets of states. An object can only be in one state at a time and methods transform its state. States are inherited and/or may be re-defined in a subclass without a method ever needing a re-definition; however, sequential components cannot be re-used, and the become expression does not the allow call/return mechanism of C++. Their proposal suffers from SPA. In Rosette [32], enabled sets are used to define messages that are allowed in the object’s next state. The enabled sets are also objects and their method invocations combine the sets from the previous states. The authors suggest making the enabled sets as first-class values. However, their approach is extremely complex for specification of concurrency. Additionally this solution, too, is inadequate to solve SPA and HSASA. The authors of Hybrid [28] associate delay queues with every method. The messages are accepted only if the delay queues are empty. The methods may open or close other queues. The problem with the delay queue approach is that the inheritance and queue management are not orthogonal. Their approach is vulnerable to Anomaly B and HSASA. Besides, the sequential components cannot be reused. Eiffel [7] and the family of POOL languages [2] advocate the use of centralized classes to control concurrent computations. The sequential components can be reused, however, only one method can execute at a time, and the live method must be reprogrammed every time a subclass with a new method is

A Concurrency Abstraction Model for OO Programs

123

added. The designers of POOL-T [2] disallow inheritance. Both these schemes also suffer from Anomaly A. Guide [10] uses activation conditions, or synchronization counters, to specify an object’s state for executing a method. These activation conditions are complex expressions involving the number of messages received, completed, executing, and message contents, etc.. Clearly, such a specification directly conflicts with inheritance. Besides, a derived class method can potentially invalidate the synchronization constraints of the base class method, and hence, faces Anomaly B. Saleh et. al. [31] have attempted to circumvent the two problems but they restrict the specification of concurrency to intra-object level. They use conditional waits for synchronization purposes. There are no multiple mechanisms for specifying computational granularity and the reusability of sequential classes is impossible. Meseguer [26] has suggested the use of order-sorted rewriting logic and declarative solutions, where no synchronization code is ever used for avoiding the inheritance anomaly. However, it is unclear as to how the proposed solutions could be adapted into a more practical setting, as in a class based COOPL. Although, in Concurrent C++ [14], the sequential classes are reusable, SPA and SMA do not occur, however, HSASA and Anomaly B remain unsolved. Similarly, in CC++ [8], SPA, HSASA, and Anomaly B can occur and the sequential class reusability is doubtful. Much like us, Matsuoka et. al. [24], too, have independently emphasized on the localization and orthogonality of synchronization schemes for solving the problems associated with COOPLs. They have suggested an elegant scheme similar to that of path expressions [3] for solving these problems for an actor based concurrent language, called ABCL. In their scheme every possible state transitions for an object is specified in the class. However, with their strategy the reuse of sequential components is improbable.

7. The Concurrency Abstraction Model Recall from the previous section that the occurrence of the inheritance anomaly and the dubious reuse of sequential classes in a COOPL are primarily due to: (i) the synchronization constraints being an integral part of the method bodies, and (ii) the inheritance of the sequential and concurrent parts in a method code being non-orthogonal. We propose a novel notion of concurrency abstraction as the model for parallel programming in CORE, where these two factors are filtered out, and consequently, the two problems associated with a COOPL are solved. We first define following two terms before we explain the meaning of concurrency abstraction.

124

Sandeep Kumar and Dharma P. Agrawal

Definition 7.1. A concurrent region is that piece of method code4 (or a thread of control), which must be protected using a synchronization constraint. Definition 7.2. An AC is an abstract class5 associated with a class, C, where the parallelism and the necessary synchronization constraints for the concurrent regions of C are specified. The foundation of the concurrency abstraction model is built on the identification of concurrent regions and definitions of ACs: The sequential code of a concurrent region is “customized” to a concurrent code (i.e., a piece of code which has the specified parallelism and synchronization) by using the specifications in an AC. In a sense a class, C, inherits some “attributes” (specification of parallelism) for its methods from its AC such that the subclasses of C cannot implicitly inherit them. For a subclass to inherit these “attributes”, it must explicitly do so by defining its AC as a subclass of the AC associated with its superclass. Thus, a CORE programmer builds three separate and independent inheritance hierarchies: first, for the sequential classes; second, for a class and its AC; and third, for the ACs, if required. A hierarchy of ACs keeps the inheritance of the synchronization code orthogonal to the inheritance of the sequential methods. Such a dichotomy helps a subclass: (a) to bypass the synchronization specific code which would otherwise be embedded in a base class method, and (b) to inherit, override, customize, and redefine the synchronization code of the inherited methods in its own AC. We should point out that any specification inside an AC is treated as a compiler directive and no processes get created. These specifications are inlined into the method codes by a preprocessor. We shall now illustrate the concurrency abstraction model with an example. Consider a class, B, with two methods, b1 and b2, which are to be concurrently executed for an object of this class. In other words, the method bodies of b1 and b2 are the concurrent regions. In order to specify concurrent execution of these methods, one creates an AC of B, say Syn B. In Syn B, the two concurrent regions, b1 and b2, are enclosed inside a parallel block as shown in Figure 7.1(a). Let us consider following three scenarios, where D is defined as a subclass of B. Case 1 : Assume that D inherits b1 but overrides b2 by a new definition. If it is incorrect to concurrently invoke b1 and b2 for objects of D, then an AC for D is not defined (see Figure 7.1(b)). Otherwise an AC, Syn D, for D is defined. Two possibilities emerge depending upon whether or not a new concurrency specification is needed, i.e. either (i) a new specification 4 5

Note that an entire method could be identified as a concurrent region. Conventionally, an abstract class denotes that class in an OO program for which no object can be instantiated [6, 11]. We have followed the same convention in CORE.

A Concurrency Abstraction Model for OO Programs

B

Parbegin b1() Syn_B b2() Parend

b1() b2()

B

b1() b2()

D

b2()

(a)

B

Parbegin b1() Syn_B b2() Parend

D

b2()

Parbegin b1() Syn_D b2() Parend

B

b1() b2()

Parbegin b1() Syn_B b2() Parend

D

b2()

Syn_D

(d)

(c)

D

b1() b2()

Parbegin b1() Syn_B b2() d1() Parend

B

b1() b2()

d1()

Parbegin b1() Syn_D b2() d1() Parend

D

d1()

(e)

Parbegin b1() Syn_B b2() Parend

(b)

b1() b2()

B

125

Parbegin b1() Syn_B b2() d1() Parend

Parbegin d1()

Syn_D

Parend (f)

Fig. 7.1. Examples to illustrate the concurrency abstraction model. is needed, and hence, a new parallel block enclosing b1 and b2 is defined in Syn D (see Figure 7.1(c)); or, (ii) the old specification is reused, and thus, Syn D simply inherits from Syn B (see Figure 7.1(d)). Notice that when Syn D neither defines nor inherits from Syn B, the specified parallelism and synchronization in Syn B are effectively bypassed by the inherited method, b1, in D. Case 2 : Assume that D inherits b1 and b2, and defines a new method, d1. Much like the previous case, a new AC, Syn D, for D may or may not be needed. Assume that Syn D is needed. In case that b1, b2 and d1 require

126

Sandeep Kumar and Dharma P. Agrawal

a new concurrency specification, a new parallel block in Syn D is defined, which does not inherit from Syn B (see Figure 7.1(e)). However, if the old specification for b1 and b2 is reused but with a new specification for d1, then a parallel block enclosing d1 is defined in Syn D. Moreover, Syn D inherits from Syn B (see Figure 7.1(f)). Case 3 : Assume that D inherits b1 and b2, and defines a new method, d1. Unlike the previous two cases, assume that one specifies the concurrency and synchronization at the inter-class level rather than at the intra-class level. Consider a situation where a function (or a method), foo, defines a parallel block enclosing objectB.b1() and objectD.d1() as two parallel processes. If these two processes communicate, then that must be specified in an AC, Syn B D, associated with B and D, both.

8. The CORE Language Following our discussions on issues in designing a COOPL, in CORE we support: (i) passive objects with an assumption that an appropriate process management scheme is available, (ii) schemes for specification of intra- and inter-object parallelism, (iii) multiple synchronization schemes, and (iv) the inheritance model of C++. Our primary goal is to to avoid the inheritance anomaly and to allow the reuse of sequential classes. In CORE, the language extensions are based on the concurrency abstraction model as described in the previous section. We shall present the syntax of new constructs in CORE using the BNF grammar and their semantics informally with example programs. 8.1 Specifying a Concurrent Region As defined earlier, a concurrent region is that piece of code which is protected using a synchronization constraint. Some examples of a concurrent region include a critical section of a process, a process awaiting the completion of inter-process communication (such as a blocking send or a receive), a process protected under a guard (as in Figure 3.3), etc.. In CORE, a concurrent region can be specified at the intra- and inter-class levels by using the reserved words, Intra Conc Reg and Inter Conc Reg, respectively. These reserved words are placed inside a class in much the same way as the access privilege specifiers (public, private, protected) are placed in a C++ program. The concurrent regions do not explicitly encode the synchronization primitives inside the class, but they are redefined inside their associated ACs. 8.2 Defining an AC As mentioned earlier, an AC is associated with a class for whose concurrent regions, the parallelism and the necessary synchronization constraints are to

A Concurrency Abstraction Model for OO Programs

127

be specified. In other words, a concurrent region of a class is redefined in an AC such that the necessary synchronization scheme is encoded in it. Note that the members of an AC can access all the accessible members of the class it is associated with. An AC can be specified at the intra-class and inter-class levels using the reserved words, Intra Sync and Inter Sync, respectively. The BNF grammar for specifying an AC is given in Figure 8.1.

ACspec InheritList ACtag ACname

→ → → →

ACtag ACname : InheritList { ACdef } ; classname InheritList | ACname InheritList Intra Sync | Inter Sync identif ier

Fig. 8.1. The BNF grammar for specifying an AC.

8.3 Defining a Parallel Block A parallel block [8,17] encloses a list of concurrent or parallel processes. These processes become active once the control enters the parallel block, and they synchronize at the end of the block, i.e., all the processes must terminate before the first statement after the end of block can be executed. The reserved words, Parbegin and Parend, mark the beginning and the end of a parallel block, respectively. The BNF grammar for specifying a parallel block in a CORE program is shown in Figure 8.2.

P arP roc LoopStmt P rocList proc

→ → → →

Parbegin [methodname] [LoopStmt] P rocList Parend; for ( identif ier = initExp ; lastExp ; incExp ) proc ; P rocList | proc f unctionCall | methodCall

Fig. 8.2. The BNF grammar for specifying a parallel block. A P rocList enlists the parallel processes, where a process from this list is either a function call, or a method invocation. A loop may be associated with a parallel block by using the for loop syntax of C++ [11]. In a loop version of a parallel block, all the loop iterations are active simultaneously. In CORE, another kind of parallel block can be specified by associating a methodname at the beginning of the block. However, such a specification can only be enforced inside an intra-class AC. With such a specification once the process, methodname, completes its execution, all the processes in the block are spawned as the child processes of this process. Note that a similar

128

Sandeep Kumar and Dharma P. Agrawal

specification is possible in Mentat [15]. We now describe how the intra- and inter-object concurrency can be specified in CORE. 8.3.1 Intra-object Concurrency. Inside a parallel block, method invocations on the same object can be specified for parallel execution. Consider a class, Foo, and its associated AC, Syn Foo, as shown in Figure 8.3. Foo has a public method, foobar, and two private methods, bar and baz. In Syn Foo, a parallel block is defined which specifies that whenever an object of Foo invokes foobar, two concurrent processes corresponding to bar and baz must be spawned further. The parent process, foobar, terminates only after the two children complete their execution.

class Foo { private: bar() { ... } baz() { ... } public: foobar(); };

Intra_Sync Syn_Foo: Foo { Parbegin foobar() bar() baz() Parend; };

Fig. 8.3. A class Foo and its intra-class AC.

8.3.2 Inter-object Concurrency. Inside a parallel block, method invocations on two objects (of the same or different classes) can be specified for parallel execution. Consider Figure 8.4, where the function, main, creates two objects, master and worker. In main, a parallel block is specified where each object concurrently receives a message, foobar. Earlier, in Figure 8.3, we had specified that the methods, bar and baz, be concurrently invoked on an object executing foobar. Consequently, main spawns two processes, master.foobar() and worker.foobar(), and both these processes further fork two processes each.

int main() { Foo master, worker; Parbegin master.foobar(); worker.foobar(); Parend; } Fig. 8.4. A program to show specification of inter-object parallelism.

A Concurrency Abstraction Model for OO Programs

129

8.4 Synchronization Schemes The concurrent processes in CORE can interact and synchronize using different schemes, namely: (i) a mailbox, (ii) a guard, (iii) a pair of blocking send and receive primitives, and (iv) a predefined Lock class which implements a binary semaphore such that the P and V operations (or methods) can be invoked on an object of this class. Consider Figure 8.5, where a class, A, is defined with two methods, m1 and m2. These methods have been identified as two concurrent regions using the tag, Intra Conc Reg. We have defined two different intra-class ACs, Syn A and Syn A1, for illustrating the use of different synchronization schemes. Note that in Syn A, Send and Receive primitives have been used for synchronization, while a semaphore, sem, has been used in Syn A1 for the same purpose.

class A { Intra_Conc_Reg: void method1() { ... } void method2() { ... } }; Intra_Sync Syn_A: A { Comm_Buffer buf; void method1() { method1(); Send(&buf, writeVar); }

Intra_Sync Syn_A1: A { Lock sem; void method1() { method1(); sem.V(); }

void method2() { Receive(&buf, readVar); method2(); } };

void method2() { sem.P(); method2(); } };

Fig. 8.5. A program to illustrate the use of different synchronization schemes inside ACs.

9. Illustrations In this section we shall illustrate how the concurrency abstraction model of CORE can effectively support the reuse of a sequential class method in a

Sandeep Kumar and Dharma P. Agrawal

130

subclass and avoid the inheritance anomaly without ever needing any redefinitions. 9.1 Reusability of Sequential Classes Consider Figure 9.1, where a class, Queue, along with its two methods, add and del, are defined. Assume that inside a parallel block add and del are con-

class Queue { public: add(int newitem) { temp = ...; temp->item = newitem; temp->next = front_ptr; front_ptr = temp; } del() { val = front_ptr->item; front_ptr = front_ptr->next; } }; Fig. 9.1. A class definition for Queue. currently invoked on an object of Queue. Since a shared variable, front ptr, is modified by add and del, it must be protected against simultaneous modifications from these two parallel processes. Let us identify and define two concurrent regions, R1 and R2, corresponding to the code segments accessing front ptr in add and del, respectively. In other words, the code segments in R1 and R2 correspond to the critical sections of add and del. The transformed definition of Queue, and its associated AC, Syn Queue, are shown in Figure 9.2. Note that if we inline R1 and R2 at their respective call-sites in add and del, we get the original (sequential) versions of the add and del methods, as in Figure 9.1. R1 and R2 are redefined in Syn Queue by enclosing them around a pair of P and V operations on a semaphore, sem. These redefined concurrent regions are inlined into add and del while generating their code. Similarly, an object, sem, is defined as an instance variable in Queue. Assume that a subclass, Symbol Table, is defined for Queue, as shown in Figure 9.3. Symbol Table reuses add, overrides del, and defines a new method, search.

A Concurrency Abstraction Model for OO Programs class Queue { Intra_Conc_Reg: R1(SomeType* ptr) { ptr->next = front_ptr; front_ptr = ptr; }

Intra_Sync Syn_Queue : Queue { Lock sem; void R1() { sem.P(); R1(); sem.V(); }

R2() { val = front_ptr->item; front_ptr = front_ptr->next; } public: add(int newitem) { temp = ...; temp->item = newitem; R1(temp); }

131

void R2() { sem.P(); R2(); sem.V(); } };

del() { R2(); } };

Fig. 9.2. A complete definition for the Queue class. While compiling the code for add in Symbol Table, the untransformed inlined code for R1 is used, and hence, its definition remains the same as in Figure 9.1, as desired.

class Symbol_Table : public Queue { public: del() { ... } search() { ... } } Fig. 9.3. A class definition for Symbol Table.

9.2 Avoiding the Inheritance Anomaly Anomaly A cannot occur in CORE because the notion of centralized classes (as proposed in [7]) does not exist. Furthermore, since the declaration and use of different synchronization primitives in a CORE program are restricted within an AC, a possible risk of them being operated upon by subclass methods, is eliminated. Consequently, Anomaly B is also avoided.

Sandeep Kumar and Dharma P. Agrawal

132

We now illustrate using an example how SPA can be avoided in a CORE program. The use of guards has been proposed by several researchers for avoiding SPA; we, too, advocate their use. However, while defining a subclass a CORE programmer is never in a dilemma simply because some other form of synchronization scheme has been previously committed to in the base class. In contrast with the other COOPL approaches, where the guards can only be associated with the methods, in CORE, they can be associated with concurrent regions, Consequently, more computations can be interleaved in CORE, resulting in a more fine-grain specification of parallelism.

class B_Buffer { int in, out, n, max; char *buf; public: B_Buffer (int size) { max = size; buf = new char [max]; in = out = n = 0; } void Put (char c) { R1(c); } int Get (void) { char c; R2(&c); return c; }

Intra_Sync Syn_B1: B_Buffer { Lock empty, full; R1 () { empty.P(); R1(); full.V(); } R2 () { full.P(); R1(); empty.V(); } };

Intra_Conc_Reg: void R1 (char c) { buf[in] = c; in = (in+1) % max; n++; } void R2 (char *c) { *c = buf[out]; out = (out+1) % max; n--; } };

Fig. 9.4. A class definition for B Buffer in CORE. Let us reconsider the class definition for the bounded buffer problem as shown in Figure 3.1. The CORE classes for B Buffer and B Buffer2 are shown in Figure 9.4 and Figure 9.5, respectively. The intra-class ACs, Syn B1 and Syn B2, as associated with B Buffer and B Buffer2, respectively, are also shown in these figures. Note the use of different synchronization schemes in

A Concurrency Abstraction Model for OO Programs class B_Buffer2: public B_Buffer { public: char GetMorethanOne(int howmany) { char last_char; R3(&last_char, howmany); return last_char; }

133

Intra_Sync Syn_B2: Syn_B1, B_Buffer2 { R3() { if ( in >= out + n) R3(); } };

Intra_Conc_Reg: void R3 (char *c, int n) { for (int i=0; i < n; i++) *c = Get(); } };

Fig. 9.5. A class definition for B Buffer2 in CORE. the three concurrent regions: R1 and R2 use semaphores for synchronization, and R3 uses a guard. In CORE, we are not completely successful in avoiding HSASA, however, we minimize the resulting code redefinitions, as we illustrate below. Let us reconsider the class, B Buffer3, in Figure 3.4 from section 3.2. The CORE class for B Buffer3 along with its associated intra-class AC, Syn B3, are shown in Figure 9.6. Note that the boolean, after put, must be set to true and false in the inherited methods, Put and Get, respectively. In addition to inheriting Syn B1 in Syn B3, we redefine the synchronization code as shown in Figure 9.6. We are thus able to minimize the code redefinition as claimed earlier.

10. The Implementation Approach We have proposed CORE as a framework for developing concurrent OO programs such that different kinds of inheritance anomalies do not occur and the sequential classes remain highly reusable. The code redefinitions in CORE are effectively and easily handled by a preprocessor, which customizes in a bottom up manner the method codes for each class. An obvious problem with such a static inlining is that of code duplication and an overall increase in code size. However, one can avoid such a code duplication by a scheme similar to that of manipulating virtual tables in C++ [11]. Naturally, such a solution is more complex and the involved overhead in retrieving the exact call through chain of pointers at run-time, makes this approach inefficient. Since concurrent programs are targeted for execution on multiple processors and a network of workstations, where the cost of code migration is extremely high, our choice of code duplication is very well justified. However, the compile

134

Sandeep Kumar and Dharma P. Agrawal class B_Buffer3: public B_Buffer { public: char GetAfterPut() { R4 (); }

Intra_Sync Syn_B3: Syn_B1, B_Buffer { int after_put = 0; R1() { R1(); after_put = 1; }

Intra_Conc_Reg: R4() { ... } };

R2() { R2(); after_put = 0; } R4() { if (!after_put && in >= out + 1) R4(); } };

Fig. 9.6. A class definition for B Buffer3 in CORE. time expansion approach cannot handle previously compiled class libraries; they must be recompiled.

11. Conclusions and Future Directions We have proposed a framework for specifying parallelism in COOPLs, and have demonstrated that: (a) the inheritance anomaly can be avoided, and (b) the sequential classes can be effectively reused. We have introduced a novel model of concurrency abstraction in CORE and is the key to solving two important problems associated with COOPLs. In the proposed model (a) the specification of the synchronization code is kept separate from the method bodies, and (b) the sequential and the concurrent parts in the method bodies of a superclass are inherited by the subclasses in an orthogonal manner. In CORE, we avoid state partitioning anomaly (SPA), state modification anomaly (SMA), and Anomaly B. We disallowed the notion of a centralized class, and hence, Anomaly A can never be encountered in CORE. However, the history sensitiveness of acceptable states anomaly (HSASA) can still occur in CORE, but with minimal code redefinitions for the inherited methods. We have also established that there is no need for a COOPL designer to commit to just one kind of synchronization scheme; multiple synchronization schemes may be allowed in a COOPL. Finally, intra- and inter-object parallelism can be easily accommodated in a COOPL, as in CORE. As the

A Concurrency Abstraction Model for OO Programs

135

proposed concurrency abstraction model is language independent, it can be easily adapted into other class based COOPLs. In the CORE framework we have not discussed the issues pertaining to task partitioning and scheduling, load-balancing, and naming and retrieval of remote objects. These issues can potentially expose several interesting research problems. Moreover, the model in CORE itself could use work in the direction of avoiding HSASA in COOPLs. While the data-parallel model of parallel programming has not been the focus of this work, integration of inheritance and the specification of the data-partitions, may exhibit an anomaly similar to the inheritance anomaly.

References 1. Agha, G., “Concurrent Object-Oriented Programming,” Communications of the ACM, Vol. 33, No. 9, Sept. 1990, pp. 125-141. 2. America P., “Inheritance and Subtyping in a Parallel Object-Oriented Language,” Proc. European Conf. on Object-Oriented Programming, SpringerVerlag, Berlin, 1987, pp. 234-242. 3. Andrews, G., “Concurrent Programming: Principles and Practice,” The Benjamin Cummings Publ. Co., 1991. 4. Arjomandi E., O’Farrell, W., and Kalas, I., “Concurrency Support for C++: An Overview,” C++ Report, Jan. 1994, pp. 44-50. 5. Beck, B., “Shared-Memory Parallel Programming in C++,” IEEE Software, July 1990, pp. 38-48. 6. Booch, G., “Object-Oriented Analysis and Design with Applications,” The Benjamin Cummings Publ. Co., 1994. 7. Caromel D., “A General Model for Concurrent and Distributed Object-Oriented Programming,” Proc. of the ACM SIGPLAN Workshop on Object Based Concurrent Programming, Vol 24, No. 4, April 1989, pp. 102-104. 8. Chandy, K. M. and Kesselman, C., “Compositional C++: Compositional Parallel Programming,” Conf. Record of Fifth Workshop on Languages and Compilers for Parallel Computing, Vol. 757, LNCS, Springer-Verlag, Aug. 1992, pp. 124-144. 9. Chien, A., Feng, W., Karamcheti, V., and Plevyak, J., “Techniques for Efficient Execution of Fine-Grained Concurrent Programs,” Conf. Record of Sixth Workshop on Languages and Compilers for Parallel Computing, Aug. 1993, pp. 160-174. 10. Decouchant D., Krakowiak S., Meysembourg M., Riveill M., and Rousset de Pina X., “A Synchronization Mechanism for Typed Objects in a Distributed System,” Proc. ACM SIGPLAN Workshop on Object Based Concurrent Programming, Vol 24, no. 4, April 1989, pp. 105-108. 11. Ellis, M. and Stroustrup, B., The Annotated C++ Reference Manual, AddisonWesely, 1990. 12. Foster, I., “Task Parallelism and High-Performance Languages,” IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, No. 3, Fall 1994, pp. 27-36. 13. Gannon, D. and Lee, J. K., “Object-Oriented Parallelism: pC++ Ideas and Experiments,” Proc. 1991 Japan Soc. for Parallel Processing, 1993, pp. 13-23.

136

Sandeep Kumar and Dharma P. Agrawal

14. Gehani, N. and Roome, W. D., “Concurrent C++: Concurrent Programming With Class(es),” Software Practice and Experience, Vol. 18, No. 12, 1988, pp. 1157-1177. 15. Grimshaw, A. S., “Easy-to-Use Object-Oriented Parallel Programming with Mentat,” Technical Report, CS-92-32, Dept. of Computer Sci., Univ. of Virgina, Charlottesville, 1992. 16. Gross, T., O’Hallaron, and Subhlok, J., “Task Parallelism in a HighPerformance Fortran Framework,” IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, No. 3, Fall 1994, pp. 16-26. 17. Hwang, K., “Advanced Computer Architecture: Parallelism, Scalability, Programmability,” Mc-GrawHill, Inc., 1993. 18. Kafura, D.G. and Lavender, R.G., “Concurrent Object-Oriented Languages and the Inheritance Anomaly,” Parallel Computers: Theory and Practice, The IEEE Computer Society Press, Los Alamitos, CA, 1995. 19. Kale, L. V. and Krishnan, S., “CHARM++: A Portable Concurrent ObjectOriented System Based on C++,” Proc. of OOPSLA, Washington DC, SeptOct, 1993, pp.91-109. 20. Kumar, S., “Issues in Parallelizing Object-Oriented Programs,” Proc. of Intn’l Conf. on Parallel Processing Workshop on Challenges for Parallel Processing, Oconomowoc, WI, Aug. 14, 1995, pp. 64-71. 21. Kumar, S. and Agrawal, D. P., “CORE: A Solution to the Inheritance Anomaly in Concurrent Object-Oriented Languages,” Proc. Sixth Intn’l Conf. on Parallel and Distributed Computing and Systems, Louisville, KY, Oct. 14-16, 1993, pp. 75-81. 22. Kumar, S. and Agrawal, D. P., “A Class Based Framework for Reuse of Synchronization Code in Concurrent Object-Oriented Languages,” Intn’l Journal of Computers and Their Applications, Vol. 1, No. 1, Aug. 1994, pp. 11-23. 23. Larus, J. R., “C**: A Large Grain, Object-Oriented Data Parallel Programming Language,” Conf. Record of Fifth Workshop on Languages and Compilers for Parallel Computing, Vol. 757, LNCS, Springer-Verlag, Aug. 1992, pp. 326-341. 24. Matsuoka, S., Taura, K., and Yonezawa, A., “Highly Efficient and Encapsulated Reuse of Synchronization Code in Concurrent Object-Oriented Languages,” Proc. of OOPSLA, Washington DC, Sept-Oct, 1993, pp.109-126. 25. Matsuoka, S. and Yonezawa, A., “Analysis of Inheritance Anomaly in ObjectOriented Concurrent Programming Languages,” Research Directions in Concurrent Object-oriented Programming, The MIT Press, 1993. 26. Meseguer J., “Solving the Inheritance Anomaly in Concurrent Object-Oriented Programming,” Proc. European Conf. on Object-Oriented Programming, Kaiserslautern, Germany, July 1993. 27. Meyer, B., Object-Oriented Software Construction, Prentice-Hall, Englewood Cliffs, NJ, 1988. 28. Nierstrasz, O., “Active Objects in Hybrid,” Proc. of OOPSLA, Orlando, Florida, USA, Oct. 1987, pp. 243-253. 29. Open Software Foundation, “OSF DCE Application Development Reference,” Prentice Hall, Inc., Englewood Cliffs, NJ, 1993. 30. Plevyak, J. and Chien, A. A., “Obtaining Sequential Efficiency From Concurrent Object-Oriented Programs,” Proc. of the 22nd ACM Symp. on the Priniciples of Programming Languages, Jan. 1995. 31. Saleh, H. and Gautron, P., “A Concurrency Control Mechanism for C++ Objects,” Object-Based Concurrent Computing, Springer-Verlag, July 1991, pp. 195-210. 32. Tomlinson, C. and Singh, V., “Inheritance and Synchronization with Enabled Sets,” Proc. of OOPSLA, New Orleans, USA, Oct. 1989, pp. 103-112.

A Concurrency Abstraction Model for OO Programs

137

33. Wyatt, B. B., Kavi, K., and Hufnagel, S., “Parallelism in Object-Oriented Languages: A Survey,” IEEE Software, Nov. 1992, pp. 39-47. 34. Yu, G. and Welch, L. R., “Program Dependence Analysis for Concurrency Exploitation in Programs Composed of Abstract Data Type Modules,” Proc. of Sixth IEEE Symp. on Parallel & Distributed Processing, Dallas, TX, Oct. 26-29, 1994, pp. 66-73.

Chapter 5. Loop Parallelization Algorithms Alain Darte, Yves Robert, and Fr´ed´eric Vivien

Ecole Normale Sup´erieure de Lyon, F - 69364 LYON Cedex 07, France [Alain.Darte,Yves.Robert,Frederic.Vivien]@lip.ens-lyon.fr Summary. This chapter is devoted to a comparative survey of loop parallelization algorithms. Various algorithms have been presented in the literature, such as those introduced by Allen and Kennedy, Wolf and Lam, Darte and Vivien, and Feautrier. These algorithms make use of different mathematical tools. Also, they do not rely on the same representation of data dependences. In this chapter, we survey each of these algorithms, and we assess their power and limitations, both through examples and by stating “optimality” results. An important contribution of this chapter is to characterize which algorithm is the most suitable for a given representation of dependences. This result is of practical interest, as it provides guidance for a compiler-parallelizer: given the dependence analysis that is available, the simplest and cheapest parallelization algorithm that remains optimal should be selected.

1. Introduction Loop parallelization algorithms are useful source to source program transformations. They are particularly appealing as they can be applied without any knowledge of the target architecture. They can be viewed as a first – machineindependent – step in the code generation process. Loop parallelization will detect parallelism (transforming DO loops into DOALL loops) and will expose those dependences that are responsible for the intrinsic sequentiality of some operations in the original program. Of course, a second step in code generation will have to take machine parameters into account. Determining a good granularity generally is a key to efficient performance. Also, data distribution and communication optimization are important issues to be considered. But all these problems will be addressed on a later stage. Such a two-step approach is typical in the field of parallelizing compilers (other examples are general task graph scheduling and software pipelining). This chapter is devoted to the study of various parallelism detection algorithms based on: 1. A simple decomposition of the dependence graph into its strongly connected components such as Allen and Kennedy’s algorithm [2]. 2. Unimodular loop transformations, either ad-hoc transformations such as Banerjee’s algorithm [3], or generated automatically such as Wolf and Lam’s algorithm [31]. 3. Schedules, either mono-dimensional schedules [10, 12, 19] (a particular case being the hyperplane method [26]) or multi-dimensional schedules [15, 20]. S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 141-171, 2001. Springer-Verlag Berlin Heidelberg 2001

142

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

These loop parallelization algorithms are very different for a number of reasons. First, they make use of various mathematical techniques: graph algorithms for (1), matrix computations for (2), and linear programming for (3). Second, they take a different description of data dependences as input: graph description and dependence levels for (1), direction vectors for (2), and description of dependences by polyhedra or affine expressions for (3). For each of these algorithms, we identify the key concepts that underline them, and we discuss their respective power and limitations, both through examples and by stating “optimality” results. An important contribution of this chapter is to characterize which algorithm is the most suitable for a given representation of dependences. No need to use a sophisticated dependence analysis algorithm if the parallelization algorithm cannot take advantage of the precision of its result. Conversely, no need to use a sophisticated parallelization algorithm if the dependence representation is not precise enough. The rest of this chapter is organized as follows. Section 2 is devoted to a brief summary of what loop parallelization algorithms are all about. In Section 3, we review major dependences abstractions: dependence levels, directions vectors, and dependence polyhedra. Allen and Kennedy’s algorithm [2] is presented in Section 4 and Wolf and Lam’s algorithm [31] is presented in Section 5. It is shown that both algorithms are “optimal” in the class of those parallelization algorithms that use the same dependence abstraction as their input, i.e. dependence levels for Allen and Kennedy and direction vectors for Wolf and Lam. In Section 6 we move to a new algorithm that subsumes both previous algorithms. This algorithm is based on a generalization of direction vectors, the dependence polyhedra. In Section 7 we briefly survey Feautrier’s algorithm, which relies on exact affine dependences. Finally, we state some conclusions in Section 8.

2. Input and Output of Parallelization Algorithms Nested DO loops enable to describe a set of computations, whose size is much larger than the corresponding program size. For example, consider nested loops whose loop counters describe a -cube of size : these loops encapsulate a set of computations of size . Furthermore, it often happens that such loop nests contain a non trivial degree of parallelism, i.e. a set of independent computations of size Ω( r ) for ≥ 1. This makes the parallelization of nested loops a very challenging problem: a compiler-parallelizer must be able to detect, if possible, a non trivial degree of parallelism with a compilation time not proportional to the sequential execution time of the loops. To make this possible, efficient parallelization algorithms must be proposed with a complexity, an input size and an output size that depend only on but certainly not on , i.e. that depend on the size of the sequential code but not on the number of computations described.

Loop Parallelization Algorithms

143

The input of parallelization algorithms is a description of the dependences which link the different computations. The output is a description of an equivalent code with explicit parallelism. 2.1 Input: Dependence Graph Each statement of the loop nest is surrounded by several loops. Each iteration of these loops defines a particular execution of the statement, called an operation. The dependences between the operations are represented by a directed acyclic graph: the expanded dependence graph (EDG). There are as many vertices in the EDG as operations in the loop nest. Executing the operations of the loop nest while respecting the partial order specified by the EDG guarantees that the correct result of the loop nest is preserved. Detecting parallelism in the loop nest amounts to detecting anti-chains in the EDG. We illustrate the notion of “expanded dependence graph” with the Example 21 below. The EDG corresponding to this code is depicted on Figure 2.1. j

Example 21.

5

DO i=1,n DO j=1,n a(i, j) = a(i-1, j-1) + a(i, j-1) ENDDO ENDDO

4 3 2 1 0

1

2

3

4

5

i

Fig. 2.1. Example 21 and its EDG. Unfortunately, the EDG cannot be used as input for parallelization algorithms, since it is usually too large and may not be described exactly at compile-time. Therefore the reduced dependence graph (RDG) is used instead. The RDG is a condensed and approximated representation of the EDG. This approximation must be a superset of the EDG, in order to preserve the dependence relations. The RDG has one vertex per statement in the loop nest and its edges are labeled according to the chosen approximation of dependences (see Section 3 for details). Figure 2.2 presents two possible RDGs for Example 21, corresponding to two different approximations of the dependences. Since its input is a RDG and not an EDG, a parallelization algorithm is not able to distinguish between two different EDGs which have the same RDG. Hence, the parallelism that can be detected is the parallelism contained

144

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

2

1 a)

0 1

1 1 b)

Fig. 2.2. RDG: a) with dependence levels; b) with direction vectors. in the RDG. Thus, the quality of a parallelization algorithm must be studied with respect to the dependence analysis. For example, Example 21 and Example 22 have the same RDG with dependence levels (Figure 2.2 (a)). Thus, a parallelization algorithm which takes as input RDGs with dependence levels, cannot distinguish between the two codes. However, Example 21 contains one degree of parallelism whereas Example 22 is intrinsically sequential. Example 22. DO i=1,n DO j=1,n a(i, j) = 1 + a(i-1, n) + a(i, j-1) ENDDO ENDDO

2.2 Output: Nested Loops The size of the parallelized code, as noticed before, should not depend on the number of operations that are described. This is the reason why the output of a parallelization algorithm must always be described by a set of loops 1 . There are at least three ways to define a new order on the operations of a given loop nest (i.e. three ways to define the output of the parallelization algorithm), in terms of nested loops: 1. Use elementary loop transformations as basic steps for the algorithm, such as loop distribution (as in Allen and Kennedy’s algorithm), or loop interchange and loop skewing (as in Banerjee’s algorithm); 2. Apply a linear change of basis on the iteration domain, i.e. apply a unimodular transformation on the iteration vectors (as in Wolf and Lam’s algorithm). 3. Define a -dimensional schedule, i.e. apply an affine transformation from Z to Zd and interpret the transformation as a multi-dimensional timing function. Each component will correspond to a sequential loop, and 1

These loops can be arbitrarily complicated, as long as their complexity only depends on the size of the initial code. Obviously, the simpler the result, the better. But, in this context, the meaning of “simple” is not clear: it depends on the optimizations that may follow. We consider that structural simplicity is preferable, but this can be discussed.

Loop Parallelization Algorithms

145

the missing ( − ) dimensions will correspond to DOALL loops (as in Feautrier’s algorithm and Darte and Vivien’s algorithm). The output of these three transformation schemes can indeed be described as loop nests, after a more or less complicated rewriting processes (see [8, 9, 11, 31, 36]). We do not discuss the rewriting process here. Rather, we focus on the link between the representation of dependences (the input) and the loop transformations involved in the parallelization algorithm (the output). Our goal is to characterize which algorithm is optimal for a given representation of dependences. Here, “optimal” means that the algorithm succeeds in exhibiting the maximal number of parallel loops.

3. Dependence Abstractions For the sake of clarity, we restrict ourselves to the case of perfectly nested DO loops with affine loop bounds. This restriction permits to identify the iterations of the nested loops ( is called the depth of the loop nest) with vectors in Z (called the iteration vectors) contained in a finite convex polyhedron (called the iteration domain) bounded by the loop bounds. The -th component of an iteration vector is the value of the -th loop counter in the nest, counting from the outermost to the innermost loop. In the sequential code, the iterations are therefore executed in the lexicographic order of their iteration vectors. In the next sections, we denote by D the polyhedral iteration domain, by and -dimensional iteration vectors in D, and by the -th statement in the loop nest, where 1 ≤ ≤ . We write J if is lexicographically greater than J and ≥ J if J or = J . Section 3.1 recalls the different concepts of dependence graphs introduced in the informal discussion of Section 2.1: expanded dependence graphs (EDG), reduced dependence graphs (RDG), apparent dependence graphs (ADG), and the notion of distance sets. In Section 3.2, we formally define what we call polyhedral reduced dependence graphs (PRDG), i.e. reduced dependence graphs whose edges are labeled by polyhedra. Finally, in Section 3.3, we show how the model of PRDG generalizes classical dependence abstractions of distance sets such as dependence levels and direction vectors. 3.1 Dependence Graphs and Distance Sets Dependence relations between operations are defined by Bernstein’s conditions [4]. Briefly speaking, two operations are considered dependent if both operations access the same memory location and if at least one of the accesses is a write. The dependence is directed according to the sequential order, from the first executed operation to the last one. Depending on the order of write(s) and/or read, the dependence corresponds to a so called

146

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

flow dependence, anti dependence or output dependence. We write: =⇒ !$ (J) if statement S$ at iteration J depends on statement S" at iteration I. The partial order defined by =⇒ describes the expanded dependence graph (EDG). Note that (J − I) is always lexicographically nonnegative when S" (I) =⇒ S$ (J). In general, the EDG cannot be computed at compile-time, either because some information is missing (such as the values of size parameters or even worse, precise memory accesses), or because generating the whole graph is too expensive (see [35, 37] for a survey on dependence tests such as the gcd test, the power test, the omega test, the lambda test, and [18] for more details on exact dependence analysis). Instead, dependences are captured through a smaller cyclic directed graph, with s vertices (as many as statements), called the reduced dependence graph (RDG) (or statement level dependence graph). The RDG is a compression of the EDG. In the RDG, two statements S" and S$ are said dependent (we write e : S" → S$ ) if there exists at least one pair (I, J) such that S" (I) =⇒ S$ (J). Furthermore, the 2 edge e from S" to S$ in the RDG is labeled by the set {(I, J) ∈ D2 | S" (I) =⇒ S$ (J)}, or by an approximation De that contains this set. The precision and the representation of this approximation make the power of the dependence analysis. In other words, the RDG describes, in a condensed manner, an iteration level dependence graph, called (maximal) apparent dependence graph (ADG), that is a superset of the EDG. The ADG and the EDG have the same vertices, but the ADG has more edges, defined by:

!" (# )

(S" , I) =⇒ (S$ , J) (in the ADG) ⇔ ∃ e = (S" , S$ ) (in the RDG ) such that (I, J) ∈ De . For a certain class of nested loops, it is possible to express exactly this set of pairs (I, J) (see [18]): I is given as an affine function (in some particular cases, involving floor or ceiling functions) f"%$ of J where J varies in a polyhedron P"%$ : {(I, J) ∈ D2 | S" (I) =⇒ S$ (J)} = {(f"%$ (J), J) | J ∈ P"%$ ⊂ D}

(3.1)

In most dependence analysis algorithms however, rather than the set of pairs (I, J), one computes the set E"%$ of all possible values (J − I). E"%$ is called the set of distance vectors, or distance set: E"%$ = {(J − I) | S" (I) =⇒ S$ (J)} When exact dependence analysis is feasible, Equation 3.1 shows that the set of distance vectors is the projection of the integer points of a polyhedron. This set can be approximated by its convex hull or by a more or less accurate 2

Actually, there is such an edge for each pair of memory accesses that induces a dependence between Si and Sj .

Loop Parallelization Algorithms

147

description of a larger polyhedron (or a finite union of polyhedra). When the set of distance vectors is represented by a finite union, the corresponding dependence edge in the RDG is decomposed into multi-edges. Note that the representation by distance vectors is not equivalent to the representation by pairs (as in Equation 3.1), since the information concerning the location in the EDG of such a distance vector is lost. This may even cause some loss of parallelism, as will be seen in Example 64. However, this representation remains important, especially when exact dependence analysis is either too expensive or not feasible. Classical representations of distance sets (by increasing precision) are: – level of dependence, introduced in [1, 2] for Allen and Kennedy’s parallelizing algorithm. – direction vector, introduced by Lamport [26] and by Wolfe in [32, 33], then used in Wolf and Lam’s parallelizing algorithm [31]. – dependence polyhedron, introduced in [22] and used in Irigoin and Triolet’s supernode partitioning algorithm [23]. We refer to the PIPS software [21] for more details on dependence polyhedra. We now formally define reduced dependence graphs whose edges are labeled by dependence polyhedra. Then we show that this representation subsumes the two other representations, namely dependence levels and direction vectors. 3.2 Polyhedral Reduced Dependence Graphs We first recall the mathematical definition of a polyhedron, and how it can be decomposed into vertices, rays and lines. Definition 31 (Polyhedron, polytope). A set P of vectors in Q & is called a (convex) polyhedron if there exists an integral matrix A and an integral vector b such that: P = {x | x ∈ Q & , Ax ≤ b} A polytope is a bounded polyhedron. A polyhedron can always be decomposed as the sum of a polytope and of a polyhedral cone (for more details see [30]). A polytope is defined by its vertices, and any point of the polytope is a non-negative barycentric combination of the polytope vertices. A polyhedral cone is finitely generated and can be defined by its rays and lines. Any point of a polyhedral cone is the sum of a nonnegative combination of its rays and of any combination of its lines. Therefore, a dependence polyhedron P can be equivalently defined by a set of vertices (denoted by {v1 , . . . , vω }), a set of rays (denoted by {r1 , . . . , rρ }),

148

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

and a set of lines (denoted by {l1 , . . . , lλ }). Then, P is the set of all vectors p such that: ρ λ ω ξi li (3.2) νi ri + µi vi + p= i=1

i=1

i=1

with µi ∈ Q , νi ∈ Q , ξi ∈ Q , and i=1 µi = 1. We now define what we call a polyhedral reduced dependence graph (or PRDG), i.e. a reduced dependence graph labeled by dependence polyhedra. Actually, we are interested only in integral vectors that belong to the dependence polyhedra, since dependence distance are indeed integral vectors. +

+

ω

Definition 32. A polyhedral reduced dependence graph (PRDG) is a RDG, where each edge e : Si → Sj is labeled by a dependence polyhedron P (e) that approximates the set of distance vectors: the associated ADG contains an edge from instance I of node Si to instance J of node Sj if and only if (J − I) ∈ P (e). We explore in Section 6 this representation of dependences. At first sight, the reader can see dependence polyhedra as a generalization of direction vectors. 3.3 Definition and Simulation of Classical Dependence Representations We come back to more classical dependence abstractions: level of dependence and direction vector. We recall their definition and show that RDGs labeled by direction vectors or dependence levels are actually particular cases of polyhedral reduced dependence graphs. Direction vectors When the set of distance vectors is a singleton, the dependence is said uniform and the unique distance vector is called a uniform dependence vector. Otherwise, the set of distance vectors can still be represented by a ndimensional vector (called the direction vector), whose components belong to Z ∪ {∗} ∪ (Z × {+, −}). Its i-th component is an approximation of the i-th components of all possible distance vectors: it is equal to z+ (resp. z−) if all i-th components are greater (resp. smaller) than or equal to z. It is equal to ∗ if the i-th component may take any value and to z if the dependence is uniform in this dimension with unique value z. In general, + (resp. −) is used as shorthand for 1+ (resp. (−1)−). We denote by ei the i-th canonical vector, i.e. the n-dimensional vector whose components are all null except the i-th component equal to 1. Then, a direction vector is nothing but an approximation by a polyhedron, with a single vertex and whose rays and lines, if any, are canonical vectors. Indeed, consider an edge e labeled by a direction vector d and denote by I + , I − and I ∗ the sets of components of d which are respectively equal to

Loop Parallelization Algorithms

149

z+ (for some integer z), z−, and ∗. Finally, denote by dz the n-dimensional vector whose i-th component is equal to z if the i-th component of d is equal to z, z+ or z−, and to 0 otherwise. Then, by definition of the symbols +, − and ∗, the direction vector d represents exactly all n-dimensional vectors p for which there exist integers + − ∗ (ν, ν ′ , ξ) in N |I | × N |I | × Z|I | such that: p = dz + ξi ei (3.3) νi ei − νi′ ei + i∈I +

i∈I −

i∈I ∗

In other words, the direction vector d represents all integer points that belong to the polyhedron defined by the single vertex dz , the rays ei for i ∈ I + , the rays −ei for i ∈ I − and the lines ei for i ∈ I ∗ . For example, the direction vector (2+, ∗, −, 3) defines the polyhedron with one vertex (2, 0, −1, 3), two rays (1, 0, 0, 0) and (0, 0, −1, 0), and one line (0, 1, 0, 0). Dependence levels The representation by level is the less accurate dependence abstraction. In a loop nest with n nested loops, the set of distance vectors is approximated by an integer l, in [1, n] ∪ {∞}, defined as the largest integer such that the l − 1 first components of the distance vectors are zero. A dependence at level l ≤ n means that the dependence occurs at depth l of the loop nest, i.e. at a given iteration of the l − 1 outermost loops. In this case, one says that the dependence is a loop carried dependence at level l. If l = ∞, the dependence occurs inside the loop body, between two different statements, and is called a loop independent dependence. A reduced dependence graph whose edges are labeled by dependence levels is called a Reduced Leveled Dependence Graph (RLDG). Consider an edge e of level l. By definition of the level, the first non-zero component of the distance vectors is the l-th component and it can possibly take any positive integer value. Furthermore, we have no information on the remaining components. Therefore, an edge of level l < ∞ is equivalent to the l−1

n−l direction vector: (0, . . . , 0, 1+, ∗, . . . , ∗) and an edge of level ∞ corresponds to the null dependence vector. As any direction vector admits an equivalent polyhedron, so does a representation by level. For example, a level 2 dependence in a 3-dimensional loop nest, means a direction vector (0, 1+, ∗) which corresponds to the polyhedron with one vertex (0, 1, 0), one ray (0, 1, 0) and one line (0, 0, 1).

4. Allen and Kennedy’s Algorithm Allen and Kennedy’s algorithm [2] has first been designed to vectorizing loops. Then, it has been extended so as to maximize the number of parallel loops and to minimize the number of synchronizations in the transformed code. The input of this algorithm is a RLDG.

150

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

Allen and Kennedy’s algorithm is based on the following facts: 1. A loop is parallel if it has no loop carried dependence, i.e. if there is no dependence, whose level is equal to the depth of the loop, that concerns a statement surrounded by the loop. 2. All iterations of a statement S1 can be carried out before any iteration of a statement S2 if there is no dependence in the RLDG from S2 to S1 . Property (1) allows to mark a loop as a DOALL or a DOSEQ loop, whereas property (2) suggests that parallelism detection can be independently conducted in each strongly connected component of the RLDG. Parallelism extraction is done by loop distribution. 4.1 Algorithm For a dependence graph G, we denote by G(k) the subgraph of G in which all dependences at level strictly smaller than k have been removed. Here is a sketch of the algorithm in its most basic formulation. The initial call is Allen-Kennedy(RLDG, 1). Allen-Kennedy(G, k). – If k > n, stop. – Decompose G(k) into its strongly connected components Gi and sort them topologically. – Rewrite code so that each Gi belongs to a different loop nest (at level k) and the order on the Gi is preserved (distribution of loops at level ≥ k). – For each Gi , mark the loop at level k as a DOALL loop if Gi has no edge at level k. Otherwise mark the loop as a DOSEQ loop. – For each Gi , call Allen-Kennedy(Gi , k + 1). We illustrate Allen and Kennedy’s algorithm on the code below: Example 41. DO i=1,n DO j=1,n DO k=1,n S1 : a(i, j, k) = a(i-1, j+i, k) + a(i, j, k-1) + b(i, j-1, k) S2 : b(i, j, k) = b(i, j-1, k+j) + a(i-1, j, k) ENDDO ENDDO ENDDO

The dependence graph G = G(1), drawn on Figure 4.1, has only one strongly connected component and at least one edge at level 1, thus the first call finds that the outermost loop is sequential. However, at level 2 (the edge at level 1 is no longer considered), G(2) has two strongly connected components: all iterations of statement S2 can be carried out before any

Loop Parallelization Algorithms

151

1

S2

S1 1, 3

2

2

Fig. 4.1. RLDG for Example 41. iteration of statement S1 . A loop distribution is performed. The strongly connected component including S1 contains no edge at level 2 but one edge at level 3. Thus the second loop surrounding S1 is marked DOSEQ and the third one DOALL. The strongly connected component including S2 contains an edge at level 2 but no edge at level 3. Thus the second loop surrounding S1 is marked DOALL and the third one DOSEQ. Finally, we get: DOSEQ i=1,n DOSEQ j=1,n DOALL k=1,n S2 : b(i, j, k) = b(i, j-1, k+j) + a(i-1, j, k) ENDDO ENDDO DOALL j=1,n DOSEQ k=1,n S1 : a(i, j, k) = a(i-1, j+i, k) + a(i, j, k-1) + b(i, j-1, k) ENDDO ENDDO ENDDO

4.2 Power and Limitations It has been shown in [6] that for each statement of the initial code, as many surrounding loops as possible are detected as parallel loops by Allen and Kennedy’s algorithm. More precisely, consider a statement S of the initial code and Li one of the surrounding loops. Then Li will be marked as parallel if and only if there is no dependence at level i between two instances of S. This result proves only that the algorithm is optimal among all parallelization algorithms that describe, in the transformed code, the instances of S with exactly the same loops as in the initial code. In fact a much stronger result has been proved in [17]: Theorem 41. Algorithm Allen-Kennedy is optimal among all parallelism detection algorithms whose input is a Reduced Leveled Dependence Graph (RLDG). It is proved in [17] that for any loop nest N1 , there exists a loop nest N2 , which has the same RLDG, and such that for any statement S of N1 surrounded after parallelization by dS sequential loops, there exists in the

152

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

exact dependence graph of N2 a dependence path which includes Ω(N dS ) instances of statement S. In other words, Allen and Kennedy’s algorithm cannot distinguishes N1 from N2 as they have the same RLDG, and the parallelization algorithm is optimal in the strongest sense on N2 as it reaches on each statement the upper bound on the parallelism defined by the longest dependence paths in the EDG. This proves that, as long as the only information available is the RLDG, it is not possible to find more parallelism than found by Allen and Kennedy’s algorithm. In other words, algorithm Allen-Kennedy is well adapted to a representation of dependences by dependence levels. Therefore, to detect more parallelism than found by algorithm Allen-Kennedy, more information on the dependences is required. Classical examples for which it is possible to overcome algorithm Allen-Kennedy are Example 42 where a simple interchange (Figure 4.2) reveals parallelism and Example 43 where a simple skew and interchange (Figure 4.3) are sufficient. Example 42. DO i=1,n DO j=1,n a(i, j) = a(i-1, j-1) + a(i, j-1) ENDDO ENDDO

0 1

1 1

Fig. 4.2. Example 42: code and RDG.

Example 43. DO i=1,n DO j=1,n a(i, j) = a(i-1, j) + a(i, j-1) ENDDO ENDDO

0 1

1 0

Fig. 4.3. Example 43: code and RDG.

5. Wolf and Lam’s Algorithm Examples 42 and 43 contain some parallelism, that can not be detected by Allen and Kennedy’s algorithm. Therefore, as shown by Theorem 41, this

Loop Parallelization Algorithms

153

parallelism can not be extracted if the dependences are represented by dependence levels. To overcome this limitation, Wolf and Lam [31] proposed an algorithm that uses direction vectors as input. Their work unifies all previous algorithms based on elementary matrix operations such as loop skewing, loop interchange, loop reversal, into a unique framework: the framework of valid unimodular transformations. 5.1 Purpose Wolf and Lam aim at building sets of fully permutable loop nests. Fully permutable loops are the basis of all tiling techniques [5, 23, 29, 31]. Tiling is used to expose medium-grain and coarse-grain parallelism. Furthermore, a set of d fully permutable loops can be rewritten as a single sequential loop and d − 1 parallel loops. Thus, this method can also be used to express fine grain parallelism. Wolf and Lam’s algorithm builds the largest set of outermost fully permutable3 loops. Then it looks recursively at the remaining dimensions and at the dependences not satisfied by these loops. The version presented in [31] builds the set of loops via a case analysis of simple examples, and relies on a heuristic for loop nests of depth greater than or equal to six. In the rest of this section, we explain their algorithm from a theoretical perspective, and we provide a general version of this algorithm. 5.2 Theoretical Interpretation Unimodular transformations have two main advantages: linearity and invertibility. Given a unimodular transformation T , the linearity allows to easily check whether T is a valid transformation. Indeed, T is valid if and only if T d >l 0 for all non zero distance vectors d. The invertibility enables to rewrite easily the code as the transformation is a simple change of basis in Zn. In general, T d >l 0 cannot be checked for all distance vectors, as there are two many of them. Thus, one tries to guarantee T d >l 0 for all non-zero direction vectors, with the usual arithmetic conventions in Z ∪ {∗} ∪ (Z × {+, −}). In the following, we consider only non-zero direction vectors, which are known to be lexicographically positive (see Section 3.1). Denote by t(1), . . . , t(n), the rows of T . Let Γ be the closure of the cone generated by all direction vectors. For a direction vector d: T d >l 0 ⇔ ∃kd , 1 ≤ kd ≤ n | ∀i, 1 ≤ i < kd , t(i).d = 0 and t(kd ).d > 0. This means that the dependences represented by d are carried at loop level kd . If kd = 1 for all direction vectors d, then all dependences are carried by the first loop, and all inner loops are DOALL loops. t(1) is then called a 3

The i-th and (i + 1)-th loops are permutable if and only if the i-th and (i + 1)-th components of any distance vector of depth ≥ i are nonnegative.

154

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

timing vector or separating hyperplane. Such a timing vector exists if and only if Γ is pointed, i.e. if and only if Γ contains no linear space. This is also equivalent to the fact that the cone Γ + – defined by Γ + = {y | ∀x ∈ Γ, y.x ≥ 0} – is full-dimensional (see [30] for more details on cones and related notions). Building T from n linearly independent vectors of Γ + permits to transform the loops into n fully permutable loops. The notion of timing vector is at the heart of the hyperplane method and its variants (see [10, 26]), which are particularly interesting for exposing finegrain parallelism, whereas the notion of fully permutable loops is the basis of all tiling techniques. As said before, both formulations are strongly linked by Γ + . When the cone Γ is not pointed, Γ + has a dimension r, 1 ≤ r < n, r = n − s where s is the dimension of the lineality space of Γ . With r linearly independent vectors of Γ + , one can transform the loop nest so that the r outermost loops are fully permutable. Then, one can recursively apply the same technique to transform the n − r innermost loops, by considering the direction vectors not already carried by one of the r outermost loops, i.e by considering the direction vectors included in the lineality space of Γ . This is the general idea of Wolf and Lam’s algorithm even if they obviously did not express it in such terms in [31]. 5.3 The General Algorithm Our discussion can be summarized by the algorithm Wolf-Lam given below. Algorithm Wolf-Lam takes as input a set of direction vectors D and a sequence of linearly independent vectors E (initialized to void) from which the transformation matrix is built: Wolf-Lam(D, E). – Define Γ as the closure of the cone generated by the direction vectors of D. – Define Γ + = {y | ∀x ∈ Γ, y.x ≥ 0} and let r be the dimension of Γ + . – Complete E into a set E ′ of r linearly independent vectors of Γ + (by construction, E ⊂ Γ + ). – Let D′ be the subset of D defined by d ∈ D′ ⇔ ∀v ∈ E ′ , v.d = 0 (i.e. D′ = D ∩ E ′⊥ = D ∩ lin.space(Γ )). – Call Wolf-Lam(D′ , E ′ ). Actually, the above process may lead to a non unimodular matrix. Building the desired unimodular matrix T can be done as follows: – Let D be the set of direction vectors. Set E = ∅ and call Wolf-Lam(D, E). – Build a non singular matrix T1 whose first rows are the vectors of E (in the same order). Let T2 = pT1−1 where p is chosen so that T2 is an integral matrix. – Compute the left Hermite form of T2 , T2 = QH, where H is nonnegative, lower triangular and Q is unimodular. – Q−1 is the desired transformation matrix (since pQ−1 D = HT1 D).

Loop Parallelization Algorithms

155

We illustrate this algorithm with the following example: Example 51. DO i=1,n DO j=1,n DO k=1,n a(i, j, k) = a(i-1, j+i, k) + a(i, j, k-1) + a(i, j-1, k+1) ENDDO ENDDO ENDDO

0 0 1 0 1 −1

1 − 0

Fig. 5.1. Example 51: code and RDG.

The set of direction vectors is D = {(1, −, 0), (0, 0, 1), (0, 1, −1)} (see Figure 5.1). The lineality space of Γ (D) is two-dimensional (generated by (0, 1, 0) and (0, 0, 1)). Thus, Γ + (D) is one dimensional and generated by E1 = {(1, 0, 0)}. Then D′ = {(0, 0, 1), (0, 1, −1)} and Γ (D′ ) is pointed. We complete E1 by two vectors of Γ + (D′ ), for example by E2 = {(0, 1, 0), (0, 1, 1)}. In this particular example, the transformation matrix whose rows are E1 , E2 is already unimodular and corresponds to a simple loop skewing. For exposing DOALL loops, we choose the first vector of E2 in the relative interior of Γ + , for example E2 = {(0, 2, 1), (0, 1, 0)}. In terms of loops transformations, this amounts to skewing the loop k by factor 2 and then to interchanging loops j and k: DOSEQ i=1,n DOSEQ k=3,3×n DOALL j=max(1, ⌈ k−n ⌉), min(n, ⌊ k−1 ⌋) 2 2 a(i, j, k-2×j) = a(i-1, j+i, k-2×j) + a(i, j, k-2×j-1) + a(i, j-1, k-2×j+1) ENDDO ENDDO ENDDO

5.4 Power and Limitations Wolf and Lam showed that this methodology is optimal (Theorem B.6. in [31]): “an algorithm that finds the maximum coarse grain parallelism, and then recursively calls itself on the inner loops, produces the maximum degree of parallelism possible”. Strangely, they gave no hypothesis for this theorem. However, once again, this theorem has to be understood with respect to the dependence analysis that is used: namely, direction vectors, but without any information on the structure of the dependence graph. A correct formulation is the following:

156

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

Theorem 51. Algorithm Wolf-Lam is optimal among all parallelism detection algorithms whose input is a set of direction vectors (implicitly, one considers that the loop nest has only one statement or that all statements form an atomic block). Therefore, as for algorithm Allen-Kennedy, the sub-optimality of algorithm Wolf-Lam in the general case has to be found, not in the algorithm methodology, but in the weakness of its input: the fact that the structure of the RDG is not exploited may result in a loss of parallelism. For example, contrarily to algorithm Allen-Kennedy, algorithm Wolf-Lam finds no parallelism in Example 41 (whose RDG is given by Figure 5.2) because of the typical structure of the direction vectors (1, −, 0), (0, 1, −), (0, 0, 1).

1 0 0 0 0 1

1 − 0

S 1

S2

0 1 −

0 1 0

Fig. 5.2. Reduced Dependence Graph with direction vectors for Example 41.

6. Darte and Vivien’s Algorithm In this section, we introduce a third parallelization algorithm, that takes as input polyhedral reduced dependence graphs. We first explain our motivation (Section 6.1), then we proceed to a step-by-step presentation of the algorithm. We work out several examples. 6.1 Another Algorithm Is Needed We have seen two parallelization algorithms so far. Each algorithm may output a pure sequential code for examples where the other algorithm does find some parallelism. This motivates the search for a new algorithm subsuming algorithms Wolf-Lam and Allen-Kennedy. To reach this goal, one can imagine to combine these algorithms, so as to simultaneously exploit the structure of the RDG and the structure of the direction vectors: first, compute the cone generated by the direction vectors and transform the loop nest so as to expose the largest outermost fully permutable loop nest; then, consider the subgraph of the RDG, formed by the direction vectors that are not carried

Loop Parallelization Algorithms

157

by the outermost loops, and compute its strongly connected components; finally, apply a loop distribution in order to separate these components, and recursively apply the same technique on each component. Such a strategy enables to expose more parallelism by combining unimodular transformations and loop distribution. However, it is not optimal as Example 61 (Figure 6.1) illustrates. Indeed, on this example, combining algorithms Allen-Kennedy and Wolf-Lam as proposed above enables to find only one degree of parallelism, since at the second phase the RDG remains strongly connected. This is not better than the basic algorithm AllenKennedy. However, one can find two degrees of parallelism in Example 61 by scheduling S1 (i, j, k) at time-step 4i−2k and S2 (i, j, k) at time-step 4i−2k+3. Example 61. DO i=1,n DO j=1,n DO k=1,n S1 : a(i, j, k) = b(i-1, j+i, k) + b(i, j-1, k+2) S2 : b(i, j, k) = a(i, j-1, k+j) + a(i, j, k-1) ENDDO ENDDO ENDDO

0 1 -

0 0 1

S1

S2 1 0

0 1 -2

Fig. 6.1. Example 61: code and RDG. Consequently, we would like to have a single parallelization algorithm which finds some parallelism at least when Allen-Kennedy or Wolf-Lam does. The obvious solution would be to try Allen-Kennedy, then WolfLam (and even a combination of both algorithms) and to report the best answer. But such a naive approach is not powerful enough, because it uses either the dependence graph structure (Allen-Kennedy) or direction vectors (Wolf-Lam), but never benefits from both knowledges at the same step. For example, the proposed combination of both algorithms would use the dependence graph structure before or after the computation of a maximal set of fully permutable loops, but never during this computation. We claim that information on both the graph structure and the direction vectors must be used simultaneously. This is because the key concept when scheduling RDGs is not the cone generated by the direction vectors (i.e. the weights of the edges of the RDG), but turns out to be the cone generated by the weights of the cycles of the RDG. This is the motivation for the multi-dimensional scheduling algorithm presented below. It can be seen as a combination of unimodular transformations, loop distribution, and index-shift method. This algorithm subsumes algorithms Allen-Kennedy and Wolf-Lam. Beforehand we motivate the

158

Alain Darte, Yves Robert, and Fr´ed´eric Vivien

choice of the representation of the dependences that the algorithm works with. 6.2 Polyhedral Dependences: A Motivating Example In this section we present an example which contains some parallelism that cannot be detected if the dependences are represented by levels or direction vectors. However, there is no need to use an exact representation of the dependences to find some parallelism in this loop nest. Indeed, a representation of the dependences with dependence polyhedra enables us to parallelize this code. Example 62.    1 ≤ i ≤ n, 1 ≤ j < n

DO i = 1, n DO j = 1, n S: a(i, j) = a(j, i) + a(i, j-1) ENDDO ENDDO

 

flow

S(i, j) −→ S(i, j+1) flow

1≤i ) = ∅. Similar to the self-dependent relation, the cross-dependent relations also have to consider all cross-dependent relations obtained from different array variables. Synthesizing the above analyses, the space obtained by cross-dependent relations for loop L3.1 is span({[1, 1]: }). As Section 2.3 described, iteration-dependent space includes the selfdependent relations and cross-dependent relations. Since the spaces obtained by self-dependent relations and cross-dependent relations are the same, span({[1, 1]: }), the iteration-dependent space of loop L3.1 , IDS(L3.1 ), is therefore equal to span({[1, 1]:}). We have found the basis of iterationdependent space, which is {[1, 1]:}. Based on the finding of iteration-dependent space, the data space partitions can be obtained accordingly. Basically, all data elements referenced by iterations in the iteration-dependent space are grouped together and distributed to the processor where the iteration-dependent space is distributed to. Suppose that there are k different reference functions Ref? , 1 ≤ i ≤ k, that reference the same array variable in loop L and IDS(L) is the iterationdependent space of loop L. The following iteration set and data set are allocated to the same processor. {I|∀I ∈ IDS(L)}, and {D|D = Ref? (I), ∀I ∈ IDS(L) and i = 1, 2, . . . , k}. The above constraint should hold true for all array variables in loop L. Suppose Ψ (L) is the iteration space partitioning and Φ(L) is the data space partitioning. Let Ref@v denote the k :A array reference of array variable

350

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

v in loop L and Dv denote the data index in data space of v. For this example, given Iinit , an initial iteration, the following sets are distributed to the same processor. Ψ (L3.1 ) Φ(L3.1 )

= {I|I = Iinit + c[1, 1]t , c ∈ Z}, = {DA |DA = Ref1A (I) or DA = Ref2A (I), ∀I ∈ Ψ (L3.1 )} {DB |DB = Ref1B (I) or DB = Ref2B (I), ∀I ∈ Ψ (L3.1 )} {DC |DC = Ref1C (I) or DC = Ref2C (I), ∀I ∈ Ψ (L3.1 )}

Fig. 3.1 illustrates the non-duplicate data communication-free iteration and data allocations for loop L3.1 when Iinit = [1, 1]t . Fig. 3.1 (a) is the iteration space partitioning Ψ (L3.1 ) on iteration space of loop L3.1 , IS(L3.1 ), where Iinit = [1, 1]t . Fig. 3.1 (b), (c), and (d) are the data space partitionings Φ(L3.1 ) on data spaces of A, B, and C, respectively. If Ψ (L3.1 ) and Φ(L3.1 ) are distributed onto the same processors, communication-free execution is obtained. 3.1.2 Duplicate Data Strategy. The fact that each data can be allocated on only one processor reduces the probability of communication-free execution. If the constraint can be removed, it will get much improvement on the findings of communication-free partitionings. Since interprocessor communication is too time-consuming, it is worth replicating data in order to obtain higher degree of parallelism. Chen and Sheu’s method is the only one that takes data replication into consideration. Actually, not all data elements can be replicated. The data elements that incur output, input, and anti-dependences can be replicated. It is because only true dependence results data movements. Output, input, and anti-dependences affect only execution orders but no data movements. Hence, the iteration-dependent space needs considering only the true dependence relations. Chen and Sheu’s method defines two terms, one is fully duplicable array, another is partially duplicable array, to classify arrays into fully or partially duplicable arrays. An array is fully duplicable if the array involves no true dependence relations; otherwise, the array is partially duplicable. For a fully duplicable array, which incurs no true dependence relation, since all iterations use the old values of data elements, not the newly generated values; therefore, the array can be fully duplicated onto all processors without affecting the correctness of execution. For a partially duplicable arrays, only the data elements which involve no true dependence relations can be replicated. Example 32. Consider the following loop. do i1 = 1, N do i2 = 1, N A(i1 , i2 ) = A(i1 + 1, i2 ) + A(i1 , i2 + 1) enddo enddo

(L3.2 )

Communication-Free Partitioning of Nested Loops

351

Fig. 3.1. Non-duplicate data communication-free iteration and data allocations for loop L3.1 while IBCBD = [1, 1]D . (a) Iteration space partitioning Ψ (L3.1 ) on IS(L3.1 ), where IBCBD = [1, 1]D . (b) Data space partitioning Φ(L3.1 ) on DS(A). (c) Data space partitioning Φ(L3.1 ) on DS(B). (d) Data space partitioning Φ(L3.1 ) on DS(C). Loop L3.2 is a perfectly nested loop with uniformly generated reference. Suppose that the non-duplicate data strategy is adopted. Form Section 3.1.1, the iteration-dependent space of loop L3.2 is IDS(L3.2 ) = span({[−1, 0]D, [0, −1]D , [1, −1]D}), where the self- and cross-dependent relations are ∅ and span({[−1, 0]D, [0, −1]D , [1, −1]D }), respectively. Obviously, IDS(L3.2 ) spans IS(L3.2 ). It means that if sequential execution is out of consideration, loop L3.2 exists no communication-free partitioning without duplicate data. If the duplicate data strategy is adopted instead, loop L3.2 can be fully parallelized under communication-free criteria. The derivation of the result is as follows. As explained above, the data elements that incur output, input, and anti-dependences do not affect the correctness of execution and can be replicated. On the other hand, the data elements that incur true dependences will cause data movement and can be replicated. For this example, the data dependent vectors obtained are [−1, 0]D , [0, −1]D , and [1, −1]D . The data dependent vectors [−1, 0]D and [0, −1]D are anti-dependence and [1, −1]D

352

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

is an input dependence. Since array A incurs no true dependence, array A is a fully duplicable array. Therefore, the iteration-dependent space contains no dependent relations that is incurred by true dependence relations. Thus, IDS(L3.2 ) = ∅. It implies that if array A is replicated onto processors appropriately, each iteration can be executed separately and no interprocessor communication is incurred. The distributions of iterations and data elements are as follows. Given an initial iteration, IEFEG , the following sets are mapped to the same processor. Ψ (L3.2 ) = {IEFEG }, Φ(L3.2 ) = {D | D = Ref1 (I) or D = Ref2 (I) or D = Ref3 (I), ∀ I ∈ Ψ (L3.2 )}. Fig. 3.2 shows the duplicate data communication-free allocation of iterations and data elements for loop L3.2 when IEFEG = [3, 3]G . Fig. 3.2 (a) is the iteration space partitioning Ψ (L3.2 ) on iteration space of loop L3.2 , IS(L3.2 ), where IEFEG = [3, 3]G . Fig. 3.2 (b) is the data space partitioning Φ(L3.2 ) on data space of A. Note that the overlapped data elements are the duplicate data elements.

Fig. 3.2. Duplicate data communication-free iteration and data allocations for loop L3.2 when IEFEG = [3, 3]G . (a) Iteration space partitioning Ψ (L3.2 ) on IS(L3.2 ), where IEFEG = [3, 3]G . (b) Data space partitioning Φ(L3.2 ) on DS(A).

The duplicate data strategy does not always promise to obtain higher degree of parallelism. It is possible that applying the duplicate data strategy is in vain for increasing the degree of parallelism. Example 33. Reconsider loop L3.1 . According to the previously analyses, array A can cause self- and cross-dependence relations and the dependence vectors are t[1, 1], where t ∈ Z. Array B causes no self- and cross-dependence relations. Array C causes no self-dependence relations but cross-dependence

Communication-Free Partitioning of Nested Loops

353

relations and the dependence vectors are [1, 1]. Array A can involve true dependence and array C involves only input dependence. Obviously, array B involves no dependence. Therefore, array A is a partially duplicable array and arrays B and C are fully duplicable arrays. Since fully duplicable arrays invoke no data movement, the dependence vectors caused by fully duplicable arrays can be ignored by way of replication of data. However, true dependence vectors caused by partially duplicable arrays do cause interprocessor communication and must be included in the iteration-dependent space. Consequently, the iteration-dependent space is IDS(L3.1 ) = span({[1, 1]H}), which is the same as the result obtained in Section 3.1.1. Clearly, the degree of parallelism is still no improved even though the duplicate data strategy is adopted. We have mentioned that only true dependence relations can cause data movement. Nevertheless, output dependence relations cause no data movement but data consistency problem. Output dependence relations mean that there are multiple-writes to the same data elements. It is feasible to duplicate data to eliminate the output dependence relations. However, the data consistency problem is occurred. How to maintain the data to preserve their consistency is important for the correctness of the execution. Although there are multiple-writes to the same data element, only the last write is needed. Clearly, multiple-writes to the same data element may exist redundant computations. Besides, these redundant computations may occur unwanted data dependence relations and result in the losing of parallelism. Eliminating these redundant computations can remove these unwanted data dependence relations simultaneously and increase the degree of parallelism. In order to exploit more degrees of parallelism, Chen and Sheu’s method proposed another scheme to eliminate redundant computations. Eliminating redundant computations is a preprocessing step before applying the communication-free partitioning strategy. However, the scheme to eliminate redundant computations is complex and time-consuming. The tradeoff on whether to apply or not to apply the scheme depends on the users. Since the scope of elimination of redundant computations is beyond the range of the chapter, we omit the discussions of the scheme. Whoever is interested in this topic can refer to [5]. 3.2 Hyperplane Partitioning of Data Space This section introduces the method proposed in [26]. We use Ramanujam and Sadayappan’s method to denote this method. Ramanujam and Sadayappan’s method discusses data spaces partitioning for two-dimensional arrays; nevertheless, their method can be easily generalized to higher dimensions. They use a single hyperplane as a basic partitioning unit for each data space. Data on a hyperplane are assigned to a processor. Hyperplanes on data spaces are called data hyperplanes and on iteration spaces are called iteration hyperplane. Hyperplanes within a space are parallel to each other. In Ramanujam and

354

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Sadayappan’s method, iteration space partitioning is not addressed. However, their method implicitly contains the concept of iteration hyperplane. The basic ideas of Ramanujam and Sadayappan’s method are as follows. First, the data hyperplane of each data space is assumed to be the standard form of hyperplane. The coefficients of data hyperplanes are unknown and need to be evaluated later on. Based on the array reference functions, each data hyperplane can derive its corresponding iteration hyperplane. That is, all iterations referencing the data elements on the data hyperplane are on the iteration hyperplane. Since communication-free partitioning is required, therefore, all iteration hyperplanes derived from every data hyperplanes actually represent the same hyperplane. In other words, although these iteration hyperplanes are different in shape, they all represent the iteration hyperplanes of the iteration space. Hence, if interprocessor communication is prohibited, these iteration hyperplanes should be the same. As a result, conditions to satisfy the requirement of communication-free partitioning are established. These conditions form a linear system and are composed of the coefficients of data hyperplanes. Solving the linear system can obtain the values of the coefficients of data hyperplanes. The data hyperplanes are then determined. Since iteration space partitioning is not considered by this method, it results in the failure in applying to multiple nested loops. Furthermore, their method can deal with only fully parallel loop, which contains no data dependence relations within the loop. Example 34. Consider the following program model. do i1 = 1, N do i2 = 1, N A(i1 , i2 ) = B(b1,1 i1 + b1,2 i2 + b1,0 , b2,1 i1 + b2,2 i2 + b2,0 ) enddo enddo

(L3.3 )

Let v1 denote A and v2 denote B, and so on, unless otherwise noted. DJI denote the j KL component of array reference of array variable vI . Ramanujam and Sadayappan’s method partitions data spaces along hyperplanes. A data hyperplane on a two-dimensional data space DS(vI ) is a set of data indices {[D1I , D2I ]K |θ1I D1I + θ2I D2I = cI } and is denoted as ΦI , where θ1I and θ2I ∈ Q are hyperplane coefficients and cI ∈ Q is the constant term of the hyperplane. All elements in a hyperplane are undertaken by a processor, that is, a processor should be responsible for the executions of all computations in an iteration hyperplane and manage the data elements located in a data hyperplane. Note that the hyperplanes containing at least one integer-valued point are considered in the Chapter. As defined above, the data hyperplanes for array variables v1 and v2 are Φ1 = {[D11 , D21 ]K |θ11 D11 + θ21 D21 = c1 } and Φ2 = {[D12 , D22 ]K |θ12 D12 + θ22 D22 = c2 }, respectively. Since the array reference of v1 is (i1 , i2 ), hence, D11 = i1 and D21 = i2 . The array reference of v2 is (D12 , D22 ), where

Communication-Free Partitioning of Nested Loops

355

D12 = b1,1 i1 + b1,2 i2 + b1,0 , D22 = b2,1 i1 + b2,2 i2 + b2,0 . Substituting the loop indices for data indices into the data hyperplanes can obtain the following two hyperplanes. θ11 D11 + θ21 D21 = c1 ⇒ θ11 i1 + θ21 i2 = c1 , θ12 D12 + θ22 D22 = c2 ⇒ (θ12 b1,1 + θ22 b2,1 )i1 + (θ12 b1,2 + θ22 b2,2 )i2 = c2 − θ12 b1,0 − θ22 b2,0 . From the above explanation, these two hyperplanes actually represent the same hyperplane if the requirement of communication-free partitioning has to be satisfied. It implies  1  θ1 = θ12 b1,1 + θ22 b2,1 θ1 = θ12 b1,2 + θ22 b2,2  12 c = c2 − θ12 b1,0 − θ22 b2,0

We can rewrite the above formulations in matrix representation as follows.  2   1   θ1 b1,1 b2,1 0 θ1  θ21  =  b1,2 b2,2 0   θ22  (3.1) c2 −b1,0 −b2,0 1 c1

By the above analyses, the two data hyperplanes of v1 and v2 can be represented as below. Φ1 = {[D11 , D21 ]M |(θ12 b1,1 + θ22 b2,1 )D11 + (θ12 b1,2 + θ22 b2,2 )D21 = c2 − θ12 b1,0 − θ22 b2,0 }, 2 2 2 2 2 2 M 2 2 Φ = {[D1 , D2 ] |θ1 D1 + θ2 D2 = c }. A comprehensive methodology for communication-free data spaces partitioning proposed in [26] has been described. Let’s take a real program as an example to show how to apply the technique. In the preceding program model, we discussed the case that the number of different array references in the rhs(right hand side) of the assignment statement is just one. If there are multiple array references in the rhs of the assignment statement, the constraints from Eq. (3.1) should hold true for each reference functions to preserve the requirements of communication-free partitioning. Example 35. Consider the following loop. do i1 = 1, N do i2 = 1, N A(i1 , i2 ) = B(i1 + 2i2 + 2, i2 + 1) + B(2i1 + i2 , i1 − 1) enddo enddo

(L3.4 )

356

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Suppose the data hyperplanes of v1 and v2 are Φ1 and Φ2 , respectively, where Φ1 = {[D11 , D21 ]N | θ11 D11 + θ21 D21 = c1 } and Φ2 = {[D12 , D22 ]N |θ12 D12 + θ22 D22 = c2 }. Since this example contains two different array references in the rhs of the assignment statement, therefore, by Eq. (3.1), we have the following two constraints for the first and the second references.  1    2  θ1 1 0 0 θ1  θ21  =  2 1 0   θ22  , 1 c2  c1   −2 −1 1 2  θ1 2 1 0 θ1  θ21  =  1 0 0   θ22  . c1 0 1 1 c2 The parameters θ11 , θ21 , θ12 , θ22 , c1 , and c2 have to satisfy the above equations. Solving these equations can obtain the following solution θ11 = θ21 = θ12 = −θ22 and c2 = c1 + θ11 . Therefore, the communication-free data hyperplanes on DS(v1 ) and DS(v2 ) are respectively represented in the following: Φ1 = {[D11 , D21 ]N |θ11 D11 + θ11 D21 = c1 }, Φ2 = {[D12 , D22 ]N |θ11 D12 − θ11 D22 = c1 + θ11 }. Fig. 3.3 illustrates Φ1 and Φ2 , the communication-free data spaces partitioning, for loop L3.4 when θ11 = 1 and c1 = 3.

Fig. 3.3. Communication-free data spaces partitioning for loop L3.4 . (a) Data hyperplane Φ1 on data space DS(v1 ). (b) Data hyperplane Φ2 on data space DS(v2 ).

After having explained these examples, we have explicated Ramanujam and Sadayappan’s method in detail. Nevertheless, all the above examples are

Communication-Free Partitioning of Nested Loops

357

not general enough. The most general program model is considered below. Based on the same ideas, the constraints for satisfying the requirements of communication-free hyperplane partitioning is derived. An example is also given to illustrate the most general case. Example 36. Consider the following program model. do i1 = 1, N do i2 = 1, N A(a1,1 i1 + a1,2 i2 + a1,0 , a2,1 i1 + a2,2 i2 + a2,0 ) = B(b1,1 i1 + b1,2 i2 + b1,0 , b2,1 i1 + b2,2 i2 + b2,0 ) enddo enddo

(L3.5 )

Suppose the data hyperplanes of v1 and v2 are Φ1 = {[D11 , D21 ]O |θ11 D11 + = c1 } and Φ2 = {[D12 , D22 ]O |θ12 D12 + θ22 D22 = c2 }, respectively. The reference functions for each dimension of each array reference is listed as follows. D11 = a1,1 i1 + a1,2 i2 + a1,0 , D21 = a2,1 i1 + a2,2 i2 + a2,0 , D12 = b1,1 i1 + b1,2 i2 + b1,0 , D22 = b2,1 i1 + b2,2 i2 + b2,0 . θ21 D21

Replacing each DQP with its corresponding reference function, i = 1, 2 and j = 1, 2, the data hyperplanes Φ1 and Φ2 can be represented in terms of loop indices i1 and i2 as below. ⇒ ⇒ ⇒ ⇒

θ11 D11 + θ21 D21 = c1 θ11 (a1,1 i1 + a1,2 i2 + a1,0 ) + θ21 (a2,1 i1 + a2,2 i2 + a2,0 ) = c1 (θ11 a1,1 + θ21 a2,1 )i1 + (θ11 a1,2 + θ21 a2,2 )i2 = c1 − θ11 a1,0 − θ21 a2,0 , θ12 D12 + θ22 D22 = c2 θ12 (b1,1 i1 + b1,2 i2 + b1,0 ) + θ22 (b2,1 i1 + b2,2 i2 + b2,0 ) = c2 (θ12 b1,1 + θ22 b2,1 )i1 + (θ12 b1,2 + θ22 b2,2 )i2 = c2 − θ12 b1,0 − θ22 b2,0 .

As previously stated, these two hyperplanes are the corresponding iteration hyperplanes of the two data hyperplanes on iteration space. These two iteration hyperplanes should be consistent if the requirement of communicationfree partitioning has to be met. It implies θ11 a1,1 + θ21 a2,1 θ11 a1,2 + θ21 a2,2 1 c − θ11 a1,0 − θ21 a2,0 The above conditions  a2,1 a1,1  a1,2 a2,2 −a1,0 −a2,0

can be  0 0  1

= = =

θ12 b1,1 + θ22 b2,1 θ12 b1,2 + θ22 b2,2 c2 − θ12 b1,0 − θ22 b2,0

represented in matrix form as follows.  2    b2,1 0 θ1 θ11 b1,1 b2,2 0   θ22  θ21  =  b1,2 c2 −b1,0 −b2,0 1 c1

(3.2)

358

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

If there exists a nontrivial solution to the linear system obtained from Eq. (3.2), the nested loop exists communication-free hyperplane partitioning. Example 37. Consider the following loop. do i1 = 1, N do i2 = 1, N A(i1 + i2 , i1 − i2 ) = B(i1 − 2i2 , 2i1 − i2 ) enddo enddo

(L3.6 )

Let Φ1 = {[D11 , D21 ]R |θ11 D11 + θ21 D21 = c1 } and Φ2 = {[D12 , D22 ]R |θ12 D12 + = c2 } be the data hyperplanes of v1 and v2 , respectively. From Eq. (3.2), we have the following system of equations:   1    2  1 1 0 θ1 1 2 0 θ1  1 −1 0   θ21  =  −2 −1 0   θ22  . c1 0 0 1 c2 0 0 1 θ22 D22

The solution to this linear system is:  2  θ1 = −θ11 + 31 θ21 θ2 = θ11 + 13 θ21  22 c = c1

The linear system exists a nontrivial solution; therefore, loop L3.6 has communication-free hyperplane partitioning. Φ1 and Φ2 can be written as: Φ1 = {[D11 , D21 ]R |θ11 D11 + θ21 D21 = c1 }, Φ2 = {[D12 , D22 ]R |(−θ11 + 31 θ21 )D12 + (θ11 + 13 θ21 )D22 = c1 }. Let θ11 = 1 and θ21 = 1. The hyperplanes Φ1 and Φ2 are rewritten as follows. Φ1 = {[D11 , D21 ]R |D11 + D21 = c1 } Φ2 = {[D12 , D22 ]R | − 2D12 + 4D22 = 3c1 } Fig. 3.4 gives an illustration for c1 = 2. Ramanujam and Sadayappan’s method can deal with a single nested loop well. Their method fails in processing multiple nested loops. This is because they did not consider the iteration space partitioning. On the other hand, they do well for the fully parallel loop, but they can not handle the loop with data dependence relations. These shortcomings will be made up by methods proposed in [16, 30].

Communication-Free Partitioning of Nested Loops

359

Fig. 3.4. Communication-free data spaces partitioning for loop L3.6 . (a) Data hyperplane Φ1 on data space DS(v1 ). (b) Data hyperplane Φ2 on data space DS(v2 ). 3.3 Hyperplane Partitioning of Iteration and Data Spaces Huang and Sadayappan also proposed methods toward the communicationfree partitioning for nested loops. In this section, we will describe the method proposed in [16]. This method is denoted as Huang and Sadayappan’s method. Huang and Sadayappan’s method aims at the findings of iteration hyperplanes and data hyperplanes such that, based on the partitioning, the execution of nested loops involves no interprocessor communication. Furthermore, sufficient and necessary conditions for communication-free hyperplane partitioning are also derived. They proposed single-hyperplane and multiplehyperplane partitionings for nested loops. Single-hyperplane partitioning implies that a partition element contains a single hyperplane per space and a partition element is allocated onto a processor. Multiple-hyperplane partitioning means that a partition element contains a group of hyperplanes and all elements in a partition group is undertaken by a processor. Multiplehyperplane partitioning can provide more powerful capability than singlehyperplane partitioning in communication-free partitioning. For the sake of space limitation, we only introduce the single-hyperplane partitioning. Multiple-hyperplane partitioning can refer to [16]. In Section 3.2, Ramanujam and Sadayappan’s method assumes the generic format of data hyperplanes and then determines the coefficients of data hyperplanes. Since Ramanujam and Sadayappan’s method considers only data hyperplane partitioning, the loss of sight on iteration hyperplanes causes the failure of applying to sequences of nested loops. This phenomenon has been improved by Huang and Sadayappan’s method. However, Huang and Sadayappan’s method requires the nested loops be perfectly nested loop(s).

360

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

An n-dimensional data hyperplane on DS(vS ) is the set of data indices {[D1S , D2S , . . . , DTS ]U |θ1S D1S + θ2S D2S + · · · + θTS DTS = cS }, which is denoted as ΦS , where θ1S , . . ., and θTS ∈ Q are hyperplane coefficients and cS ∈ Q is the constant term of the hyperplane. Similarly, an iteration hyperplane of a dnested loop LV is a set of iterations {[I1V , I2V , . . . , IWV ]U |δ1V I1V +δ2V I2V +· · ·+δWV IWV = cV } and is denoted as Ψ V , where δ1V , . . ., and δWV ∈ Q are hyperplane coefficients and cV ∈ Q is the constant term of the hyperplane. Let ∆V = [δ1V , δ2V , . . . , δWV ] be the coefficient vector of iteration hyperplane and ΘS = [θ1S , θ2S , . . . , θTS ] be the coefficient vector of data hyperplane. An iteration hyperplane on IS(LV ) and a data hyperplane on DS(vS ) can be abbreviated as Ψ V = {I V | ∆V · I V = cV }, and ΦS = {DS | ΘS · DS = cS }, respectively, where I V = [I1V , I2V , . . . , IWV ]U is an iteration on IS(LV ) and DS = [D1S , D2S , . . . , DTS ]U is a data index on DS(vS ). If the hyperplane coefficient vector is a zero vector, it means the whole iteration space or data space needs to be allocated onto a processor. This fact leads to sequential execution and is out of the question in this Chapter. Hence, only non-zero iteration hyperplane coefficient vectors and data hyperplane coefficient vectors are considered in the Chapter. For any array reference of array vS in loop LV , there exists a sufficient and necessary condition to verify the relations between iteration hyperplane coefficient vector and data hyperplane coefficient vector if communicationfree requirement is satisfied. The sufficient and necessary condition is stated in the following lemma. Lemma 31. For a reference function Ref (I V ) = R · I V + r = DS , which is from IS(LV ) to DS(vS ), Ψ V = {I V | ∆V · I V = cV } is the iteration hyperplane on IS(LV ) and ΦS = {DS | ΘS · D S = cS } is the data hyperplane on DS(vS ). Ψ V and ΦS are communication-free hyperplane partitions if and only if ∆V = αΘS · R, for some α, α = 0. Proof. (⇒): Suppose that Ψ V = {I V |∆V · I V = cV } and ΦS = {D S |ΘS · DS = cS } are communication-free hyperplane partitions. Let I1V and I2V be two distinct iterations and belong to the same iteration hyperplane, Ψ V . If D1S and D2S are two data indices such that Ref (I1V ) = D1S and Ref (I2V ) = D2S , from the above assumptions, D1S and D2S should belong to the same data hyperplane, ΦS . Because I1V and I2V belong to the same iteration hyperplane, Ψ V , ∆V ·I1V = cV and ∆V · I2V = cV , therefore, ∆V · (I1V − I2V ) = 0. On the other hand, since D1S and D2S belong to the same data hyperplane, ΦS , it means that ΘS · D1S = cS and ΘS · D2S = cS . Replacing DXS by reference function Ref (IXV ), for k = 1, 2, we can obtain (ΘS · R) · (I1V − I2V ) = 0. Since I1V and I2V are any two iterations on Ψ V , (I1V − I2V ) is a vector on the iteration hyperplane. Furthermore, both ∆V · (I1V − I2V ) = 0 and (ΘS · R) · (I1V − I2V ) = 0, hence we can conclude that ∆V and (ΘS · R) are linearly dependent. It implies ∆V = αΘS · R, for some α, α = 0 [15].

Communication-Free Partitioning of Nested Loops

361

(⇐): Suppose Ψ i = {I i |∆i · I i = ci } and ΦY = {DY |ΘY · DY = cY } are hyperplane partitions for IS(LZ ) and DS(vY ), respectively and ∆Z = αΘY · R, for some α, α = 0. We claim Ψ Z and ΦY are communication-free partitioning. Let I Z be any iteration on iteration hyperplane Ψ Z . Then ∆Z ·I Z = cZ . Since Z ∆ = αΘY · R, replacing ∆Z by αΘY · R can get ΘY · Ref (I Z ) = α1 cZ + ΘY · r. Let cY = α1 cZ +ΘY ·r, then Ref (I Z ) ∈ ΦY . We have shown that ∀I Z ∈ Ψ Z , Ref (I Z ) ∈ ΦY . It then follows that Ψ Z and ΦY are communication-free partitioning. Lemma 31 shows good characteristics in finding communication-free partitioning. It can be used for determining the hyperplane coefficient vectors. Once the data hyperplane coefficient vectors are fixed, the iteration hyperplane coefficient vectors can also be determined. If the reference matrix R is invertible, we can also determine the iteration hyperplane coefficient vectors first, then the data hyperplane coefficient vectors can be evaluated by ΘY = ( α1 )∆Z · R−1 accordingly, where R−1 is the inverse of R. As regards to the constant terms of hyperplanes, since the constant terms of hyperplanes are correlated to each other, hence, if the constant term of some hyperplane is fixed, the others can be represented in terms of that constant term. From the proof of Lemma 31, we can know that if cZ is fixed, cY = α1 cZ + ΘY · r. Generally speaking, in a vector space, a vector does not change its direction after being scaled. Since α in Lemma 31 is a scale factor, it can be omitted without affecting the correctness. Therefore, we always let α = 1 unless otherwise noted. Example 38. Consider one perfectly nested loop. do i1 = 1, N do i2 = 1, N A(i1 + i2 , i1 + i2 ) = 2 ∗ A(i1 + i2 , i1 + i2 ) − 1 enddo enddo

(L3.7 )

Suppose the iteration hyperplane on IS(L3.7 ) is of the form Ψ = {I|∆·I = c} and the data hyperplane on DS(A) is Φ = {D|Θ · D = c′ }. From Lemma 31, the data hyperplane coefficient vector Θ can be set to arbitrarily 2-dimensional vector except zero vector and those vectors that cause the iteration hyperplane coefficient vectors also to be zero vectors. In this example, let Θ = [0, 1], then the iteration hyperplane coefficient vector ∆ is equal to [1, 1]. If the constant term of iteration hyperplane is fixed as c, the data hyperplane constant term c′ = c + Θ · r. For this example, c′ = c. Therefore, the iteration hyperplane and data hyperplane of loop L3.7 are Ψ = {I|[1, 1] · I = c} and Φ = {D|[0, 1] · D = c}, respectively. That is, Ψ = {[I1 , I2 ][ |I1 + I2 = c} and Φ = {[D1 , D2 ][ |D2 = c}. Fig. 3.5 illustrates the communication-free hyperplane partitioning of loop L3.7 , where c = 5. Fig. 3.5 (b) and (c) are iteration hyperplane and data hyperplane, respectively. On the other hand, if the data hyperplane coefficient vector Θ is chosen as [1, −1], it causes the iteration hyperplane coefficient vector ∆ to be [0, 0],

362

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Fig. 3.5. Communication-free hyperplane partitioning of loop L3.7 . (a) Iteration hyperplane partition on IS(L3.7 ): Ψ = {[I1 , I2 ]\ |I1 + I2 = 5}. (b) Data hyperplane partition on DS(A): Φ = {[D1 , D2 ]\ |D2 = 5}. which is a zero vector. Since all the hyperplane coefficient vectors are nonzero vectors, therefore, the above result is invalid. In other words, the data hyperplane coefficient vector Θ can be any 2-dimensional vector except [0, 0] and [1, −1]. By Section 2.3, since the null space of R is N S(R) = span({[1, −1]\}), the spaces caused by the self-dependent relations and cross-dependent relations are the same and equal span({[1, −1]\}). The iteration-dependent space of loop L3.7 is IDS(L3.7 ) = span({[1, −1]\}). It means that the iterations along the direction [1, −1]\ should be allocated onto the same processor. This result matches that shown in Fig. 3.5. Lemma 31 is enough to meet the requirement of communication-free partitioning for only one array reference. If there are more than one different array references in the same loop, Lemma 31 is useful but not enough. More conditions is attached in order to satisfy the communication-free criteria. Suppose there are γ different array references of array variable v] in nested loop L^ , which are Ref`^_] (I ^ ) = R`^_] I ^ + r`^_] , k = 1, 2, . . . , γ. As previously defined, the iteration hyperplane is Ψ ^ = {I ^ |∆^ · I ^ = c^ } and the data hyperplane is Φ] = {D] |Θ] · D] = c] }. By Lemma 31, Ψ ^ and Φ] are communication-free partitioning if and only if ∆^ = αΘ] · R, where R is some reference matrix and α is a non-zero constant. Without loss of generality, let α = 1. Since there are γ different array references, hence, Lemma 31 should be satisfied for every array reference. That is, ∆^ = Θ] · R1^_] = Θ] · R2^_] = · · · = Θ] · Rγ^_] .

(3.3)

On the other hand, the constant term of the data hyperplane is c′ = c + Θ] · r1^_] = c + Θ] · r2^_] = · · · = c + Θ] · rγ^_] if the iteration hyperplane constant term is c. It implies that

Communication-Free Partitioning of Nested Loops

Θa · r1bea = Θa · r2bea = · · · = Θa · rγbea .

363

(3.4)

Therefore, Eqs. (3.3) and (3.4) are the sufficient and necessary conditions for the communication-free hyperplane partitioning of nested loop with several array references to an array variable. If there exists a contradiction within the findings of the hyperplane coefficient vectors, it implies that the nested loop exists no communication-free hyperplane partitioning. Otherwise, the data hyperplane coefficient vector can be evaluated accordingly. The iteration hyperplane coefficient vector can also be determined. As a result, the iteration hyperplane and data hyperplane can be resolved. Similarly, the same ideas can be extended to sequences of nested loops, too. The results obtained by the above analyses can be combined for the most general case. Suppose the iteration space for each loop Lb is IS(Lb ). The iteration hyperplane on IS(Lb ) is Ψ b = {I b |∆b · I b = cb }. The data space for array variable va is DS(va ) and the hyperplane on DS(va ) is Φa = {Da |Θa · Da = ca }. Let Reffbea be the reference function of the kgh reference to array variable va in loop Lb . Eqs. (3.3) and (3.4) can be rewritten by minor modifications to meet the representation. Θa1 · Rfbea1 1 a

bea f1

Θ ·r

= Θa2 · Rfbea2 2 , a

bea f2

= Θ ·r .

(3.5) (3.6)

Furthermore, since ca = cb + Θa · rfbea , for some array variable va1 , ca1 = cb1 + Θa1 · r1b1 ea1 = cb2 + Θa1 · r1b2 ea1 , for two different loops Lb1 and Lb2 . We can obtain that cb2 − cb1 = Θa1 · (r1b1 ea1 − r1b2 ea1 ). Similarly, for some array variable va2 , ca2 = cb1 + Θa2 · r1b1 ea2 = cb2 + Θa2 · r1b2 ea2 , for two different loops Lb1 and Lb2 . We get that cb2 − cb1 = Θa2 · (r1b1 ea2 − r1b2 ea2 ). Combining these two equations can obtain the following equation. Θa1 · (r1b1 ea1 − r1b2 ea1 ) = Θa2 · (r1b1 ea2 − r1b2 ea2 ).

(3.7)

Thus, Eqs. (3.5), (3.6), and (3.7) are the sufficient and necessary conditions for the communication-free hyperplane partitioning of sequences of nested loops. Example 39. Consider the following sequence of loops. do i1 = 1, N do i2 = 1, N A(i1 + i2 + 2, −2i2 + 2) = B(i1 + 1, i2 ) enddo enddo do i1 = 1, N do i2 = 1, N B(i1 + 2i2 , i2 + 4) = A(i1 , i2 − 1) enddo enddo

(L3.8 )

(L3.9 )

364

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

To simplify the representation, we number the sequences of loops and array variables according to the order of occurrences. Let L1 refer to L3.8 , L2 refer to L3.9 , v1 refer to array variable A, and v2 refer to array variable B. The iteration space of loop L1 is IS(L1 ) and the iteration space of loop L2 is IS(L2 ). The data spaces of v1 and v2 are DS(v1 ) and DS(v2 ), respectively. Suppose Ψ 1 = {I 1 |∆1 · I 1 = c1 } is the iteration hyperplane on IS(L1 ) and Ψ 2 = {I 2 |∆2 · I 2 = c2 } is the iteration hyperplane on IS(L2 ). The data ′ hyperplanes on DS(v1 ) and DS(v2 ) are Φ1 = {D1 |Θ1 · D1 = c1 } and Φ2 = ′ {D2 |Θ2 · D2 = c2 }, respectively. As defined above, Refrmpq is the reference function of the ksu reference to array variable vq in loop Lm . Since Θ1 , and Θ2 are 2-dimensional non-zero vectors, let Θ1 = [θ11 , θ21 ], and Θ2 = [θ12 , θ22 ], where θ11 , θ21 , θ12 , and θ22 ∈ Q, (θ11 )2 + (θ21 )2 = 0, and (θ12 )2 +(θ22 )2 = 0. By Eq. (3.5), we have the following two equations Θ1 ·R11,1 = Θ2 · R11,2 and Θ1 · R12,1 = Θ2 · R12,2 . Thus, 1 1 1 0 = [θ12 , θ22 ] · , and [θ11 , θ21 ] · 0 −2 0 1 1 0 1 2 1 1 2 2 [θ1 , θ2 ] · = [θ1 , θ2 ] · . 0 1 0 1 There is no condition that satisfies Eq. (3.6) in this example since each array variable is referenced just once by each nested loop. To satisfy Eq. (3.7), the equation Θ1 · (r11,1 − r12,1 ) = Θ2 · (r11,2 − r12,2 ) is obtained. That is, 2 0 1 0 1 1 2 2 − ) = [θ1 , θ2 ] · ( − ). [θ1 , θ2 ] · ( 2 −1 0 4 Solving the above system of equations, we can obtain that θ21 = θ11 , θ12 = θ11 , and θ22 = −θ11 . Therefore, the data hyperplane coefficient vectors of Φ1 and Φ2 , Θ1 and Θ2 , are [θ11 , θ11 ] and [θ11 , −θ11 ], respectively. Since Θ1 and Θ2 are non-zero vectors, θ11 ∈ Q − {0}. The coefficient vector of iteration hyperplane Ψ 1 is ∆1 = Θ1 · R11,1 = 2 Θ · R11,2 = [θ11 , −θ11 ]. Similarly, ∆2 = Θ1 · R12,1 = Θ2 · R12,2 = [θ11 , θ11 ]. Let the constant term of iteration hyperplane Ψ 1 be fixed as c1 . Therefore, the constant term of iteration hyperplane Ψ 2 can, therefore, be computed by c2 = c1 + Θ1 · (r11,1 − r12,1 ) = c1 + Θ2 · (r11,2 − r12,2 ) = c1 + 5θ11 . The constant ′ term of data hyperplane Φ1 can be evaluated by c1 = c1 + Θ1 · r11,1 = 2,1 c2 + Θ1 · r1 = c1 + 4θ11 . The constant term of data hyperplane Φ2 can be ′ evaluated by c2 = c1 +Θ2 ·r11,2 = c2 +Θ2 ·r12,2 = c1 +θ11 . The communicationfree iteration hyperplane and data hyperplane partition are as follows. Ψ1 Ψ2 Φ1 Φ2

= = = =

{[I11 , I21 ]s |θ11 I11 − θ11 I21 = c1 } {[I12 , I22 ]s |θ11 I12 + θ11 I22 = c1 + 5θ11 } {[D11 , D21 ]s |θ11 D11 + θ11 D21 = c1 + 4θ11 } {[D12 , D22 ]s |θ11 D12 − θ11 D22 = c1 + θ11 }

Communication-Free Partitioning of Nested Loops

365

Fig. 3.6. Communication-free hyperplane partitionings for iteration spaces of loops L3.8 and L3.9 and data spaces of arrays A and B. (a) Iteration hyperplane partition on IS(L3.8 ): Ψ 3.8 = {[I11 , I21 ]v |I11 − I21 = 0}. (b) Iteration hyperplane partition on IS(L3.9 ): Ψ 3.9 = {[I12 , I22 ]v |I12 + I22 = 5}. (c) Data hyperplane partition on DS(A): Φw = {[D11 , D21 ]v |D11 + D21 = 4}. (d) Data hyperplane partition on DS(B): Φx = {[D12 , D22 ]v |D12 − D22 = 1}. Fig. 3.6 shows the communication-free hyperplane partitions for iteration spaces IS(L3.8 ) and IS(L3.9 ) and data spaces DS(A) and DS(B), assuming θ1 = 1 and c1 = 0. Fig. 3.6 (a) is the iteration space hyperplane partitioning of loop L3.8 and Fig. 3.6 (b) is the iteration space hyperplane partitioning of loop L3.9 . Fig. 3.6 (c) and (d) illustrate the hyperplane partitionings on data spaces DS(A) and DS(B), respectively.

4. Statement-Level Partitioning Traditionally, the concept of the iteration space is from the loop-level point of view. An iteration space is formed by iterations. Each iteration is an integer point in iteration space and is indexed by the values of loop indices. Every

366

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

iteration consists of all statements of that index within the loop body. The execution of an iteration includes all the execution of statements of that index. Actually, each statement is an individual unit and can be scheduled separately. Therefore, instead of viewing each iteration indivisible, an iteration can be separated into the statements enclosed in that iteration. The separated statements have the same index with that iteration and each is termed as a statement-iteration. We use I y to denote a statement-iteration of statement s. Since iteration space is composed of iterations, statement-iterations of a statement also form a space. Each statement has its corresponding space. We use statement-iteration space, denoted as SIS(s), to refer the space composed by statement-iterations of s. Statement-iteration space has the same loop boundaries with the corresponding iteration space. Generally speaking, statement-iteration space and iteration space have similar definitions except the viewpoint of objects; the former is from the statement-level point of view and the latter is from the loop-level viewpoint. In this section we describe two statement-level communication-free partitionings: one is using affine processor mappings [22] and another is using hyperplane partitioning [30]. 4.1 Affine Processor Mapping The method proposed in [22] considers iteration spaces partitioning, especially statement-level partitioning, to totally eliminate interprocessor communication and simultaneously maximizes the degree of parallelism. We use Lim and Lam’s method to refer the technique proposed in [22]. They use affine processor mappings to allocate statement-iterations to processors. The major consideration of Lim and Lam’s method is to find maximum communicationfree parallelism. That is, the goal is to find the set of affine processor mappings for statements in the program and to exploit as large amount of parallelism as possible on the premise that no interprocessor communication incurs while execution. Lim and Lam’s method deals with the array references with affine functions of outer loop indices or loop invariant variables. Their method can be applied to arbitrarily nested loops and sequences of loops. The statement-iteration distribution scheme adopted by Lim and Lam’s method is affine processor mapping, which is of the form P rocz (I z ) = P z I z +pz for statement sz . It maps each statement-iteration I z in SIS(sz ) to a (virtual) processor P rocz (I z ). P z is the mapping matrix of sz and pz is the mapping offset vector of sz . Maximizing the degree of parallelism is to maximize the rank of P z . To maximize the rank of P z and to minimize the dimensionality of the null space of P z are conceptually the same. Therefore, minimizing the dimensionality of the null space of P z is one major goal in Lim and Lam’s method. Similar to the meanings of iteration-dependent space defined in Section 2.3, they define another term to refer to those statement-iterations which have to be mapped to the same processor. The statement-iterations which have to be mapped to the same processor are collected in the minimal localized

Communication-Free Partitioning of Nested Loops

367

statement-iteration space, which is denoted as Li for statement si . Therefore, the major goal has changed from the minimization of the dimensionality of the null space of P i to the finding of the minimal localized statementiteration space. Once the minimal localized statement-iteration space of each statement is determined, the maximum degree of communication-free parallelism of each statement can be decided by dim(SIS(si ))−dim(Li ). Since each statement’s maximum degree of communication-free parallelism is different, in order to preserve the communication-free parallelism available to each statement, Lim and Lam’s method chooses the maximum value among all the degrees of communication-free parallelism of each statement as the dimensionality of the virtual processor array. By means of the minimal localized statement-iteration space, the affine processor mapping can be evaluated accordingly. The following examples demonstrate the concepts of Lim and Lam’s method proposed in [22]. Example 41. Consider the following loop. do i1 = 1, N do i2 = 1, N A(i1 , i2 ) = A(i1 − 1, i2 ) + B(i1 , i2 − 1) s1 : s2 : B(i1 , i2 ) = A(i1 , i2 + 1) + B(i1 + 1, i2 ) enddo enddo

(L4.1 )

Loop L4.1 contains two statements and there are two array variables referenced in the nested loop. Let v1 = A and v2 = B. We have defined statementiteration space and described the differences of it from iteration space above. Fig. 4.1 gives a concrete example to illustrate the difference between iteration space and statement-iteration space. In Fig. 4.1(a), a circle means an iteration and includes two rectangles with black and gray colors. The black rectangle indicates statement s1 and the gray one indicates statement s2 . In Fig. 4.1(b) and Fig. 4.1(c), each statement is an individual unit and the collection of statements forms two statement-iteration spaces. Let S be the set of statements and V be the set of array variables referenced by S. Suppose S = {s1 , s2 , . . . , sα } and V = {v1 , v2 , . . . , vβ }, where α, β ∈ Z+ . For this example, α = 2 and β = 2. Let the number of occurrences of variable v{ in statement s| be denoted as γ|}{ . For this example, γ1,1 = 2, γ1,2 = 1, γ2,1 = 1, and γ2,2 = 2. Let Ref~|}{ denote the reference function of the k occurrence of array variable v{ in statement s| , where 1 ≤ i ≤ α, 1 ≤ j ≤ β, and 1 ≤ k ≤ γ|}{ . A statement-iteration on a d-dimensional statement-iteration space SIS(s) can be written as I = [i1 , i2 , i ] . Let i~ denote the k component of statement-iteration I . The reference functions for each array reference are described as follows. 1 1 i11 i1 − 1 i1 1,2 1 1,1 1 , , Ref (I ) = , Ref (I ) = Ref11,1 (I 1 ) = 1 1 1 1 2 i22 i22 i22 − 1 i1 i1 i1 + 1 Ref12,2 (I 2 ) = , Ref12,1 (I 2 ) = , Ref22,2 (I 2 ) = . i22 i22 + 1 i22

368

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Fig. 4.1. The illustrations of the differences between iteration space and statement-iteration space. Communication-free partitioning requires the referenced data be located on the processor performing that execution, no matter whether the data is read or written. Therefore, it makes no difference for communication-free partitioning whether the data dependence is true dependence, anti-dependence, output dependence or input dependence. Hence, in Lim and Lam’s method, they defined a function, co-reference function, to keep the data dependence relationship. The co-reference function just keeps the data dependence relationship but does not retain the order of read or write. Let ℜ′ be the co-reference function and can be defined as the set of statement-iterations ′ I such that the data elements referenced by I are also referenced by I′ , where s, s′ ∈ S. Fig. 4.2 gives an abstraction of co-reference function.

Fig. 4.2. The abstraction of co-reference function.

Communication-Free Partitioning of Nested Loops

369

Accordingly, ′

′

′

′

′

′

ℜ1,1 (I 1 ) = {I 1 |(i11 = i11 ) ∧ (i12 = i12 )} ∪ {I 1 |(i11 = i11 − 1) ∧ (i12 = i12 )}∪ ′ ′ ′ {I 1 |(i11 = i11 + 1) ∧ (i12 = i12 )}, ℜ1,2 (I 1 ) = {I 2 |(i21 = i11 ) ∧ (i22 = i12 − 1)} ∪ {I 2 |(i21 = i11 − 1) ∧ (i22 = i12 − 1)}, ℜ2,1 (I 2 ) = {I 1 |(i11 = i21 ) ∧ (i12 = i22 + 1)} ∪ {I 1 |(i11 = i21 + 1) ∧ (i12 = i22 + 1)}, ′ ′ ′ ′ ′ ′ ℜ2,2 (I 2 ) = {I 2 |(i21 = i21 ) ∧ (i22 = i22 )} ∪ {I 2 |(i21 = i21 + 1) ∧ (i22 = i22 )}∪ ′ ′ ′ {I 2 |(i21 = i21 − 1) ∧ (i22 = i22 )}. As previously described, finding the minimal null space of P is the same as to find the minimal localized statement-iteration space. Hence, how to determine the minimal localized statement-iteration space of each statement is the major task of Lim and Lam’s method. The minimal localized statementiteration space is composed of the minimum set of column vectors satisfying the following conditions: – Single statement: The data dependence relationship within a statementiteration space may be incurred via the array references in the same statement or between statements. This requirement is to map all the statementiterations in a statement-iteration space that directly or indirectly access the same data element to the same processor. In other words, these statement-iterations should belong to the minimal localized statementiteration space. – Multiple Statements: For two different statements s1 and s2 , suppose ′ I 1 and I 1 are two statement-iterations in SIS(s1 ) and I 2 ∈ ℜ1 2 (I 1 ) ′ ′ ′ and I 2 ∈ ℜ1 2 (I 1 ). If statement-iterations I 1 and I 1 are mapped to the same processor, this requirement requires all the statement-iterations I 2 ′ and I 2 being mapped to the same processor. Figs. 4.3 and 4.4 conceptually illustrate the conditions Single Statement and Multiple Statements, respectively. The boldfaced lines in the two figures are the main requirements that these two condition want to meet.

Fig. 4.3. The abstraction of Single Statement condition.

370

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Fig. 4.4. The abstraction of Multiple Statements condition. An iterative algorithm can be used to evaluate the minimal localized statement-iteration space of each statement. First, initialize each Li using condition Single Statement. Second, iterate using condition Multiple Statements until all the Li is converged. In what follows, we use this iterative algorithm to evaluate the minimal localized statement-iteration space of each statement for this example. Based on the condition Single Statement, L1 is initialized to {[1, 0] } and L2 is initialized to {[1, 0] }. The algorithm iterates according to the condition Multiple Statements to check if there is any column vector that should be added to the minimal localized statementiteration spaces. The algorithm considers one minimal localized statementiteration space at a time. For all other minimal localized statement-iteration spaces, the iterative algorithm uses condition Multiple Statements to add column vectors to the minimal localized statement-iteration space, if any. Once all the localized statement-iteration spaces are converged, the algorithm halts. As for this example, the iterative algorithm is halted when L1 and L2 both converge to {[1, 0] }. Thus, the minimal localized statement-iteration spaces L1 and L2 have been evaluated and all equal {[1, 0] }. For any two statement-iterations, if the difference between these two statement-iterations belongs to the space spanned by the minimal localized statement-iteration space, these two statement-iterations have to be mapped to the same processor. Therefore, the orthogonal complement of the minimal localized statement-iteration space is a subspace that there exists no data dependent relationship within the space. That is, all statement-iterations that located on the the orthogonal complement of the minimal localized statementiteration space are completely independent. Accordingly, the maximum degree of communication-free parallelism of a statement is the dimensionality of the statement-iteration space SIS(s ) minus the dimensionality of the minimal localized statement-iteration space L . Let the maximum degree of communication-free parallelism available for statement s be denoted as τ . Then, τ = dim(SIS(s )) − dim(L ). Thus, τ1 = 1 and τ2 = 1. Lim and Lam’s method wants to exploit as large amount of parallelism as possible. To retain the communication-free parallelism of each statement, the dimension-

Communication-Free Partitioning of Nested Loops

371

ality of (virtual) processor array has to set to the maximum value of maximum degree of communication-free parallelism among all statements. Let τ be the dimensionality of the (virtual) processor array. It can be defined as τ = max∈{1,2,...,α} τi . For this example, τp = max(τ1 , τ2 ) = 1. We have decided the dimensionality of (virtual) processor array. Finally, we want to determine the affine processor mapping for each statement by means of the co-reference function and the minimal localized statementiteration space. To map a d-dimensional statement-iteration space to a τp dimensional processor array, the mapping matrix P in the affine processor mapping is a τp × d matrix and the mapping offset vector p is a τp × 1 vector. For each affine processor mapping P roci (I i ) = P i I i + pi , the following two constraints should be satisfied, where i ∈ {1, 2, . . . , α}. C1 span(Li ) ⊆ null space of P roci . ′ ′ ′ C2 ∀i′ ∈ {1, 2, . . . , α}, i′ = i, ∀I i ∈ SIS(si ), ∀I i ∈ ℜi,i′ (I i ) : P roci (I i ) = P roci (I i ). Condition C1 can be reformulated as follows. Since span(Li ) ⊆ null space ′ ′ of P roci , it means that if I i , I i ∈ SIS(si ) and (I i − I i ) ∈ span(Li ), then ′ ′ ′ P roci (I i ) = P roci (I i ). Thus, P i I i + pi = P i I i + pi . It implies that P i (I i − ′ I i ) = ∅, where ∅ is a τp × 1 zero vector. Because (I i − I i ) ∈ span(Li ), we can conclude that C1′ ∀x ∈ Li , P i x = ∅. A straightforward algorithm to find the affine processor mappings according to the constraints C1 and C2 is derived in the following. First, choose one statement that its maximum degree of communication-free parallelism equals the dimensionality of the (virtual) processor array, say si . Find the affine processor mapping P roci such that the constraint C1 is satisfied. Since span(Li ) has to be included in the null space of P roci , it means that the range of P roci is the orthogonal complement of the space spanned by L1 . Therefore, one intuitively way to find the affine processor mapping P roci is to set P roci (I i ) = (Li )⊥ I i , where W ⊥ means the orthogonal complement of the space W . The mapping offset vector pi is set to a zero vector. Next, based on the affine processor mapping P roci , use constraint C2 to find the other statements’ affine processor mappings. This process will repeat until all the affine processor mappings are found. Using the straightforward algorithm described above, we can find the two affine processor mappings P roc1 = [0, 1]I 1 and P roc2 = [0, 1]I 2 + 1. Fig. 4.5 shows the communication-free affine processor mappings of statements s1 and s2 for loop L4.1 . Data distribution is an important issue for parallelizing compilers on distributed-memory multicomputers. However, Lim and Lam’s method ignores that. The following section describes the communication-free hyperplane partitioning for iteration and data spaces.

372

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Fig. 4.5. Communication-free affine processor mappings P roc1 (I 1 ) = [0, 1]I 1 and P roc2 (I 2 ) = [0, 1]I 2 + 1 of statement s1 and s2 for loop L4.1 , assuming N = 5. 4.2 Hyperplane Partitioning In this section, the method proposed by Shih, Sheu, and Huang [30] studies toward the statement-level communication-free partitioning. They partition statement-iteration spaces and data spaces along hyperplanes. We use Shih, Sheu, and Huang’s method to denote the method proposed in [30]. Shih, Sheu, and Huang’s method can deal with not only an imperfectly nested loop but also sequences of imperfectly nested loops. They propose the sufficient and necessary conditions for the feasibility of communication-free singlehyperplane partitioning for an imperfectly nested loop and sequences of imperfectly nested loops. The main ideas of Shih, Sheu, and Huang’s method is similar to those proposed in Sections 3.2 and 3.3. In the following, we omit the tedious mathematical inference and just describe the concepts of the method. The details is referred to [30]. As defined in Section 4.1, S = {s1 , s2 , . . . , sα } is the set of statements and V = {v1 , v2 , . . . , vβ } is the set of array variables, where α, β ∈ Z+ . The number of occurrences of array variable vj in statement si is γi,j . If vj is not referenced in statement si , γi,j = 0. The reference function of the k th array reference of array variable vj in statement si is denoted as Refki,j , where i ∈ {1, 2, . . . , α}, j ∈ {1, 2, . . . , β}, and k ∈ {1, 2, . . . , γi,j }. Suppose Ψ i = {I i |∆i · I i = ci } is the statement-iteration hyperplane on SIS(si ) and Φj = {Dj |Θj · Dj = cj } is the data hyperplane on DS(vj ). Statement-level communication-free hyperplane partitioning requires those statement-iterations that reference the same array element be allocated on the same statement-iteration hyperplane. According to Lemma 21, two statement-iterations reference the same array element if and only if the difference of these two statement-iterations belongs to the null space of Rki,j , for some i, j and k. Hence, N S(Rki,j ) should be a subspace of the statement-

Communication-Free Partitioning of Nested Loops

373

iteration hyperplane. Since there may exist many different array references, partitioning a statement-iteration space must consider all array references appeared in the statement. Thus, the space spanned by N S(R ) for all array references appearing in the same statement should be a subspace of the statement-iteration hyperplane. Therefore, the above observations is concluded in the following lemma. Lemma 41 (Statement-Iteration Hyperplane Coefficient Check). For any communication-free statement-iteration hyperplane Ψ = {I |∆ ·I = c }, the following two conditions must hold: γ

i,j N S(Rki,j )) ⊆ Ψ i , (1) span(∪β=1 ∪k=1 γi,j (2) (∆i )t ∈ (span(∪βj=1 ∪k=1 N S(Rki,j )))⊥ ,

where S ⊥ denotes the orthogonal complement space of S. On the other hand, the dimension of a statement-iteration hyperplane is one less than the dimension of the statement-iteration space. If there exists a statement si , for some i, such that the dimension of the spanning space of N S(Rki,j ), for all j and k, is equal to the dimension of SIS(si ), then the spanning space cannot be a subspace of the statement-iteration hyperplane. Therefore, there exists no nontrivial communication-free hyperplane partitioning. Thus, we obtain the following lemma. Lemma 42 (Statement-Iteration Space Dimension Check). If ∃si ∈ S such that γ

i,j N S(Rki,j ))) = dim(SIS(si )), dim(span(∪βj=1 ∪k=1

then there exists no nontrivial communication-free hyperplane partitioning. In addition to the above observations, Shih, Sheu, and Huang’s method also finds more useful properties for the findings of communication-free hyperplane partitioning. Lemma 31 demonstrates that the iteration hyperplane and data hyperplane are communication-free partitioning if and only if the iteration hyperplane coefficient vector is parallel to the vector obtained by the multiplication of the data hyperplane coefficient vector and the reference matrix. Although Lemma 31 is for iteration space, it also holds true for statement-iteration space. Since the statement-iteration hyperplane coefficient vector is a non-zero vector, thus the multiplication of the data hyperplane coefficient vector and the reference matrix can not be a zero vector. From this condition, we can derive the feasible range of a data hyperplane coefficient vector. Therefore, we obtain the following lemma. Lemma 43 (Data Hyperplane Coefficient Check). For any communication-free data hyperplane Φj = {Dj |Θj · Dj = cj }, the following condition must hold: γi,j i,j t ′ (Θj )t ∈ (∪α i=1 ∪k=1 N S((Rk ) )) , where S ′ denotes the complement set of S.

374

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Lemmas 41 and 43 provide the statement-iteration hyperplane coefficient vector check and the data hyperplane coefficient vector check, respectively. Suppose the data hyperplane on data space DS(v ) is Φ = {D |Θ ·D = c }. Since each data element is accessed by some statement-iteration via some reference function, that is, D can be represented as Ref = R · I + r , thus, Θ · D = c , ⇔ Θ · (R · I + r ) = c , ⇔ (Θ · R ) · I = c − (Θ · r ). Let ∆

c

= =

Θ · R ,

(4.1)

c − (Θ · r ).

(4.2)

As a result, those statement-iterations that reference the data elements lay on the data hyperplane Φ = {D |Θ · D = c } will be located on the statement-iteration hyperplane Ψ = {I |(Θ · R ) · I = c − (Θ · r )}. Since there are three parameters in the above formulas and i ∈ {1, 2, . . ., α}, j ∈ {1, 2, . . . , β}, and k ∈ {1, 2, . . . , γ }, for the consistency of hyperplane coefficient vectors and constant terms in each space, we can derive some conditions. Combining the above constraints can obtain the following theorem. Theorem 41. Let S = {s1 , s2 , . . . , sα } and V = {v1 , v2 , . . . , vβ } be the sets of statements and array variables, respectively. Refki,j is the reference function of the k th occurrence of array variables vj in statement si , where i ∈ {1, 2, . . . , α}, j ∈ {1, 2, . . . , β} and k ∈ {1, 2, . . . , γi,j }. Ψ i = {I i |∆i · I i = ci } is the statement-iteration hyperplane on SIS(si ), for i = 1, 2, . . . , α. Φj = {Dj |Θj ·Dj = cj } is the data hyperplane on DS(vj ), for j = 1, 2, . . . , β. Ψ i and Φj are communication-free hyperplane partitions if and only if the following conditions hold. 1. 2. 3. 4. 5. 6. 7. 8. 9.

∀i, Θj · Rki,j = Θj · R1i,j , for j = 1, 2, . . . , β; k = 2, 3, . . . , γi,j . ∀i, Θj · R1i,j = Θ1 · R1i,1 , for j = 2, 3, . . . , β. ∀i, Θj · rki,j = Θj · r1i,j , for j = 1, 2, . . . , β; k = 2, 3, . . . , γi,j . Θj · (r1i,j − r11,j ) = Θ1 · (r1i,1 − r11,1 ), for i = 2, 3, . . . , α; j = 2, 3, . . . , β. γi,j i,j t ′ ∀j, (Θj )t ∈ (∪α i=1 ∪k=1 N S((Rk ) )) . i,j i j ∀i, ∆ = Θ · Rk , for some j, k, j ∈ {1, 2, . . . , β}; k ∈ {1, 2, . . . , γi,j }. γi,j N S(Rki,j )))⊥ . ∀i, (∆i )t ∈ (span(∪βj=1 ∪k=1 j 1 ∀j, j = 2, 3, . . . , β, c = c −Θ1 ·r1i,1 +Θj ·r1i,j , for some i, i ∈ {1, 2, . . . , α}. ∀i, ci = cj − (Θj · rki,j ), for some j, k, j ∈ {1, 2, . . . , β}; k ∈ {1, 2, . . . , γi,j }.

Theorem 41 can be used to determine whether the nested loop(s) is/are communication-free. It can also be used as a procedure of finding a communication-free hyperplane partitioning systematically. Conditions 1 to 4 in

Communication-Free Partitioning of Nested Loops

375

Theorem 41 are used for finding the data hyperplane coefficient vectors. Condition 5 can check whether the data hyperplane coefficient vectors found in preceding steps are within the legal range. Following the determination of the data hyperplane coefficient vectors, the statement-iteration hyperplane coefficient vectors can be obtained by using Condition 6. Similarly, Condition 7 can check whether the statement-iteration hyperplane coefficient vectors are within the legal range. The data hyperplane constant terms and statementiteration hyperplane constant terms can be obtained by using Conditions 8 and 9, respectively. If one of the conditions is violated, the whole procedure will stop and verify that the nested loop has no communication-free hyperplane partitioning. From Conditions 1 and 3, to satisfy the constraint that Θ is a non-zero row vector, we have the following condition. Rank(R1 − R2 , · · · , R1 − Rγi,j , r1i,j − r2i,j , · · · , r1i,j − rγi,ji,j ) < dim(DS(vj )),

(4.3)

for i = 1, 2, . . . , α and j = 1, 2, . . . , β. Note that this condition can also be found in [16] for loop-level hyperplane partitioning. We conclude the above by the following lemma. Lemma 44 (Data Space Dimension Check). Suppose S = {s1 , s2 , . . . , sα } and D = {v1 , v2 , . . . , vβ } are the sets of statements and array variables, respectively. Rki,j and rki,j are the reference matrix and the reference vector, respectively, where i ∈ {1, 2, . . . , α}, j ∈ {1, 2, . . . , β} and k ∈ {1, 2, . . . , γi,j }. If communication-free hyperplane partitioning exists then Eq. (4.3) must hold. Lemmas 42 and 44 are sufficient but not necessary. Lemma 42 is the statement-iteration space dimension test and Lemma 44 is the data space dimension test. To determine the existence of a communication-free hyperplane partitioning, we need to check the conditions in Theorem 41. We show the following example to explain the finding of communication-free hyperplanes of statement-iteration spaces and data spaces. Example 42. Consider the following sequence of imperfectly nested loops. do i1 = 1, N do i2 = 1, N A[i1 + i2 , 1] = B[i1 + i2 + 1, i1 + i2 + 2]+ s1 : C[i1 + 1, −2i1 + 2i2 , 2i1 − i2 + 1] do i3 = 1, N B[i1 + i3 + 1, i2 + i3 +1] = A[2i1 + 2i3 , i2 + i3 ]+ s2 : C[i1 + i2 + 1, −i2 + i3 + 1, i1 − i2 + 1] enddo enddo enddo (L4.2 ) do i1 = 1, N do i2 = 1, N

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

376

do i3 = 1, N C[i1 + 1, i2 , i2 + i3 ] = A[2i1 + 3i2 + i3 , i1 + i2 + 2]+ B[i1 + i2 , i1 − i3 + 1] enddo A[i1 , i2 + 3] = B[i1 − i2 , i1 − i2 + 2] + C[i1 + i2 , −i2 , −i2 ] s4 : enddo enddo

s3 :

The set of statements S is {s1 , s2 , s3 , s4 }. The set of array variables is V = {v1 , v2 , v3 }, where v1 , v2 , and v3 represent A, B, and C, respectively. The values of γ11 , γ12 , γ13 , γ21 , γ22 , γ23 , γ31 , γ32 , γ33 , γ41 , γ42 , and γ43 all are 1. We use Lemmas 42 and 44 to verify whether L4.2 has no communication-free hy 3 perplane partitioning. Since dim( =1 N S(R1 )) = 1, which is smaller than dim(SIS(s )), for i = 1, . . . , 4. Lemma 42 is helpless for ensuring that L4.2 exists no communication-free hyperplane partitioning. Lemma 44 is useless here because all the values of γ are 1, for i = 1, . . . , 4; j = 1, . . . , 3. Further examinations are necessary, because Lemmas 42 and 44 can not prove that L4.2 has no communication-free hyperplane partitioning. From Theorem 41, if a communication-free hyperplane partitioning exists, the conditions listed in Theorem 41 should be satisfied; otherwise, L4.2 exists no communication-free hyperplane partitioning. Let Ψ = {I |∆ · I = c } be the statement-iteration hyperplane on SIS(s ) and Φ = {D |Θ · D = cv } be the data hyperplane on DS(vj ). Due to the dimensions of the data spaces DS(v1 ), DS(v2 ), and DS(v3 ) are 2, 2, and 3, respectively, without loss of generality, the data hyperplane coefficient vectors can be respectively assumed to be Θ1 = [θ11 , θ21 ], Θ2 = [θ12 , θ22 ], and Θ3 = [θ13 , θ23 , θ33 ]. In what follows, the requirements to satisfy the feasibility of communication-free hyperplane partitioning are examined one-by-one. There is no need to examine the Conditions 1 and 3 because all the values of γi,j are 1. Solving the linear system obtained from Conditions 2 and 4 can get the general solutions: (θ11 , θ21 , θ12 , θ22 , θ13 ,θ23 , θ33 ) = (t, −t, 2t, −t, t, t, t), t ∈ Q − {0}. Therefore, Θ1 = [t, −t], Θ2 = [2t, −t] and Θ3 = [t, t, t]. Verifying Condition 5 can find out that all the data hyperplane coefficient vectors are within the legal range. Therefore, the statement-iteration hyperplane coefficient vectors can be evaluated by Condition 6. Thus, ∆1 = [t, t], ∆2 = [2t, −t, t], ∆3 = [t, 2t, t], and ∆4 = [t, −t]. The legality of these statement-iteration hyperplane coefficient vectors is then checked by using Condition 7. Checking Condition 7 can know that all the statement-iteration and data hyperplane coefficient vectors are legal. These results reveals that the nested loops have communication-free hyperplane partitionings. Finally, the data and statement-iteration hyperplanes constant terms are decided by using Conditions 8 and 9, respectively. Let one data hyperplane constant term be fixed, say c1v . The other hyperplane constant terms can be determined accordingly. Therefore, c2v = c1v + t, c3v = c1v + 3t, c1s = c1v + t, c2s = c1v , c3s = c1v + 2t, and c4s = c1v + 3t. Therefore, the communication-free hyperplane partitionings for loop L4.2 is G = Ψ 1 ∪ Ψ 2 ∪ Ψ 3 ∪ Ψ 4 ∪ Φ1 ∪ Φ2 ∪ Φ3 , where

Communication-Free Partitioning of Nested Loops

Ψ1 Ψ2 Ψ3 Ψ4 Φ1 Φ2 Φ3

= = = = = = =

377

{I 1 | [t, t] · I 1 = c1v + t}, {I 2 | [2t, −t, t] · I 2 = c1v }, {I 3 | [t, 2t, t] · I 3 = c1v + 2t}, {I 4 | [t, −t] · I 4 = c1v + 3t}, {D1 | [t, −t] · D1 = c1v }, {D2 | [2t, −t] · D2 = c1v + t}, {D3 | [t, t, t] · D3 = c1v + 3t}.

Fig. 4.6 illustrates the communication-free hyperplane partitionings for loop L4.2 , where t = 1 and c1v = 0. The corresponding parallelized program is as follows. doall c = −7, 18 c+4 do i1 = max(min(c − 4, ⌈ c−4 2 ⌉), 1), min(max(c, ⌊ 2 ⌋), 5) if( max(c − 4, 1) ≤ i1 ≤ min(c, 5) ) i2 = c − i1 + 1 A[i1 + i2 , 1] = B[i1 + i2 − 1, i1 + i2 + 2]+ C[i1 + 1, −2i1 + 2i2 , 2i1 − i2 + 1] endif do i2 = max(2i1 − c + 1, 1), min(2i1 − c + 5, 5) i3 = c − 2i1 + i2 B[i1 + i3 + 1, i2 + i3 + 1]= A[2i1 + 2i3 , i2 + i3 ]+ C[i1 + i2 + 1, −i2 + i3 + 1, i1 − i2 + 1] enddo enddo do i1 = max(c − 13, 1), min(c − 1, 5) do i2 = max(⌈ c−i21 −3 ⌉, 1), min(⌊ c−i21 +1 ⌋, 5) i3 = c − i1 − 2i2 + 2 C[i1 + 1, i2 , i2 + i3 ] = A[2i1 + 3i2 + i3 , i1 + i2 + 2]+ B[i1 + i2 , i1 − i3 + 1] enddo enddo do i1 = max(c + 4, 1), min(c + 8, 5) i2 = i1 − c − 3 A[i1 , i2 + 3] = B[i1 − i2 , i1 − i2 + 2] + C[i1 + i2 , −i2 , −i2 ] enddo enddoall

5. Comparisons and Discussions Recently, communication-free partitioning has received much emphasis for parallelizing compilers. Several partitioning techniques are proposed in the literature. In the previous sections we have glanced over these techniques. Chen and Sheu’s and Ramanujam and Sadayappan’s methods can deal with single loop. Since Ramanujam and Sadayappan’s method does not address

378

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

Fig. 4.6. Communication-free statement-iteration hyperplanes and data hyperplanes for loop L4.2 , where t = 1 and c1v = 0. (a) Statement-iteration hyperplane on SIS(s1 ). (b) Statement-iteration hyperplane on SIS(s2 ). (c) Statement-iteration hyperplane on SIS(s3 ). (d) Statement-iteration hyperplane on SIS(s4 ). (e) Data hyperplane on DS(A). (f) Data hyperplane on DS(B). (g) Data hyperplane on DS(C).

Communication-Free Partitioning of Nested Loops

379

iteration space partitioning, the absence of iteration space partitioning makes the method fail to handle multiple nested loops. Huang and Sadayappan’s, Lim and Lam’s and Shih, Sheu, and Huang’s methods all can deal with a sequence of nested loops. Besides, Chen and Sheu’s and Huang and Sadayappan’s methods address only perfectly nested loop(s) but all the others can manage imperfectly nested loop(s). Except that Ramanujam and Sadayappan’s method requires the nested loops to be fully parallel, others can process the nested loop(s) with or without data dependence relations. As for the array reference function, each method can process affine array reference functions except that Chen and Sheu’s method requires the loop be uniformly generated reference, in addition to be an affine function. We classify these methods as loop-level partitioning or statement-level partitioning. Loop-level partitioning views each iteration as a basic unit and partitions iterations and/or data onto processors. Chen and Sheu’s, Ramanujam and Sadayappan’s and Huang and Sadayappan’s methods are loop-level partitioning. Lim and Lam’s and Shih, Sheu, and Huang’s methods partition statement-iterations and/or data onto processors and are statement-level partitioning. The partitioning strategy used by Chen and Sheu’s method is similar to the findings of the iteration-dependent space. Once the iterationdependent space is determined, the data accessed by the iterations on the iteration-dependent space are grouped together and then distributed onto processors along with the corresponding iteration-dependent space. Lim and Lam’s method partitions statement-iteration spaces by using affine processor mappings. All the rest of methods partition iteration and/or data spaces along hyperplanes. Except Ramanujam and Sadayappan’s method addresses data space partitioning and Lim and Lam’s method addresses statement-iteration space partitioning, the others propose both iteration and data spaces partitionings. It is well-known that the dimensionality of a hyperplane is one less than the dimensionality of the original vector space. Therefore, the exploited degree of parallelism for hyperplane partitioning techniques is one. On the other hand, Chen and Sheu’s and Lim and Lam’s methods can exploit maximum degree of communication-free parallelism. All methods discuss the communication-free partitioning based on each data element to be distributed onto exactly one processor, every other processor that needs the data element has to access the data element via interprocessor communication. However, Chen and Sheu’s method presents not only the non-duplicate data strategy but also duplicate data strategy, which allows data to be appropriately duplicated onto processors in order to make the nested loop to be communicationfree or to exploit higher degree of parallelism. For simplicity, we number each method as follows. 1. Chen and Sheu’s method. 2. Ramanujam and Sadayappan’s method. 3. Huang and Sadayappan’s method.

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

380

4. Lim and Lam’s method. 5. Shih, Sheu, and Huang’s method. Synthesizing the above discussions can obtain the following tables. Table 5.1 compares each method according to the loop model that each method can deal with. It compares Loop(s), Nest, Type, and Reference Function. Loop(s) means the number of loops that the method can handle. Nest indicates the nested loop to be perfectly or imperfectly. Type indicates the type of the nested loop, fully parallel or others. Reference Function denotes the type of array reference functions. Table 5.2 compares the capabilities of each methods. Level indicates the method that is performed on loop- or statement-level. Partitioning Strategy is the strategy adopted by each method. Partitioning Space shows the spaces that the method can partition, which includes computation space partitioning and data space partitioning. Table 5.3 compares the functionalities of each method. Degree of Parallelism represents that the method can exploit how many degree of parallelism on the premise of communication-free partitioning. Duplicate Data means whether the method can allow data to be duplicated onto processors or not. Table 5.1. Comparisons of communication-free partitioning techniques – Loop Model. Method 1

Loop(s) single

Nest perfectly

Type arbitrary

2

single

imperfectly

3 4 5

multiple multiple multiple

perfectly imperfectly imperfectly

fully parallel arbitrary arbitrary arbitrary

Reference Function affine function with uniformly generated reference affine function affine function affine function affine function

Table 5.2. Comparisons of communication-free partitioning techniques – Capability. Method

Level

Partitioning Strategy

1 2 3 4 5

loop loop loop statement statement

iteration-dependent space hyperplane hyperplane affine processor mapping hyperplane

Partitioning Space Computation Data yes yes no yes yes yes yes no yes yes

Communication-Free Partitioning of Nested Loops

381

Table 5.3. Comparisons of communication-free partitioning techniques – Functionality. Method 1 2 3 4 5

Degree of Parallelism maximum communication-free parallelism 1 1 maximum communication-free parallelism 1

Duplicate Data yes no no no no

6. Conclusions As the cost of data communication is much higher than that of a primitive computation in distributed-memory multicomputers, reducing the communication overhead as much as possible is the most promising way to achieve high performance computing. Communication-free partitioning is an ideal situation that can eliminate total communication overhead, if possible. Therefore, it is of critical important for distributed-memory multicomputers. We have surveyed the current compilation techniques about communication-free partitioning of nested loops in the Chapter. The characteristics of every methods and the differences among them are also addressed. Communication-free partitioning is an ideal situation that communication overhead can be totally eliminated. However, there are many programs that can not be communication-free partitioned. Supporting efficient partitioning techniques to reduce communication overhead as much as possible is the future research in this area.

References 1. C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the 3rd ACM/SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 39–50, April 1991. 2. J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the ACM SIGPLAN’93 Conference on Programming Language Design and Implementation, pages 112–125, June 1993. 3. U. Banerjee. Unimodular transformations of double loops. In Proceedings of the 3rd Workshop on Languages and Compilers for Parallel Computing, pages 192–219, July 1990. 4. T. S. Chen. Compiling Nested Loops for Communication-Efficient Execution on Distributed Memory Multicomputers. PhD thesis, Department of Computer Science and Information Engineering, National Central University, Taiwan, June 1994.

382

Kuei-Ping Shih, Chua-Huang Huang, and Jang-Ping Sheu

5. T. S. Chen and J. P. Sheu. Communication-free data allocation techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 5(9):924–938, September 1994. 6. A. Darte and Y. Robert. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing, 29:43–59, 1995. 7. M. Dion, C. Randriamaro, and Y. Robert. How to optimize residual communications? In Proceedings of International Parallel Processing Symposium, April 1996. 8. M. Dion and Y. Robert. Mapping affine loop nests: New results. In B. Hertzberger and G. Serazzi, editors, High-Performance Computing and Networking, International Conference and Exhibition, volume LNCS 919, pages 184–189. Springer-Verlag, May 1995. 9. P. Feautrier. Some efficient solution to the affine scheduling problem, part I, one dimensional time. International Journal of Parallel Programming, 21(5):313– 348, October 1992. 10. P. Feautrier. Some efficient solution to the affine scheduling problem, part II, multidimensional time. International Journal of Parallel Programming, 21(6):389–420, December 1992. 11. M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992. 12. M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, 7(7):689–704, July 1996. 13. S. Hiranandani, K. Kennedy, and C. W. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66– 80, August 1992. 14. S. Hiranandani, K. Kennedy, and C. W. Tseng. Evaluating compiler optimizations for Fortran D. Journal of Parallel and Distributed Computing, 21:27–45, 1994. 15. K. Hoffman and R. Kunze. Linear Algebra. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, second edition, 1971. 16. C.-H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. Journal of Parallel and Distributed Computing, 19:90–102, 1993. 17. F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th Annual ACM Symposium Principle of Programming Languages, pages 319–329, January 1988. 18. A. H. Karp. Programming for parallelism. IEEE Comput. Mag., 20(5):43–57, May 1987. 19. C. Koelbel. Compiling Programs for Nonshared Memory Machines. PhD thesis, Department of Computer Science, Purdue University, November 1990. 20. C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440–451, October 1991. 21. L. Lamport. The parallel execution of do loops. Communications of the ACM, 17(2):83–93, February 1974. 22. A. W. Lim and M. S. Lam. Communication-free parallelization via affine transformations. In Proceedings of the 7th Workshop on Languages and Compilers for Parallel Computing, August 1994. 23. L. S. Liu, C. W. Ho, and J. P. Sheu. On the parallelism of nested for-loops using index shift method. In Proceedings of International Conference on Parallel Processing, volume II, pages 119–123, August 1990.

Communication-Free Partitioning of Nested Loops

383

24. D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29:1184–1201, December 1986. 25. J. Ramanujam. Compile-Time Techniques for Parallel Execution of Loops on Distributed Memory Multiprocessors. PhD thesis, Department of Computer and Information Science, Ohio State University, September 1990. 26. J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991. 27. A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the ACM SIGPLAN’89 Conference on Programming Language Design and Implementation, pages 69–80, June 1989. 28. M. Rosing, R. B. Schnabel, and R. P. Weaver. The DINO parallel programming language. Journal of Parallel and Distributed Computing, 13:30–42, 1991. 29. J. P. Sheu and T. H. Tai. Partitioning and mapping nested loops on multiprocessor systems. IEEE Transactions on Parallel and Distributed Systems, 2(4):430–439, October 1991. 30. K.-P. Shih, J.-P. Sheu, and C.-H. Huang. Statement-level communication-free partitioning techniques for parallelizing compilers. In Proceedings of the 9th Workshop on Languages and Compilers for Parallel Computing, August 1996. 31. C.-W. Tseng. An Optimizing Fortran D Compiler for MIMD DistributedMemory Machines. PhD thesis, Department of Computer Science, Rice University, January 1993. 32. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN’91 Conference on Programming Language Design and Implementation, pages 30–44, June 1991. 33. M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452–471, October 1991. 34. M. J. Wolfe. More iteration space tiling. In Proceedings of ACM International Conference on Supercomputing, pages 655–664, 1989. 35. M. J. Wolfe. Optimizing Supercompilers for Supercomputers. London and Cambridge, MA: Pitman and the MIT Press, 1989. 36. M. J. Wolfe. High Performance Compilers for Parallel Computing. AddisonWesley Publishing Company, 1996. 37. M. J. Wolfe and U. Banerjee. Data dependence and its application to parallel processing. International Journal of Parallel Programming, 16(2):137–178, April 1987. 38. H. P. Zima, P. Brezany, and B. M. Chapman. SUPERB and Vienna Fortran. Parallel Computing, 20:1487–1517, 1994. 39. H. P. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, New York, 1991.

Chapter 11. Solving Alignment Using Elementary Linear Algebra Vladimir Kotlyar, David Bau, Induprakas Kodukula, Keshav Pingali, and Paul Stodghill Department of Computer Science Cornell University Ithaca NY 14853 USA

Summary. Data and computation alignment is an important part of compiling sequential programs to architectures with non-uniform memory access times. In this paper, we show that elementary matrix methods can be used to determine communication-free alignment of code and data. We also solve the problem of replicating data to eliminate communication. Our matrix-based approach leads to algorithms which work well for a variety of applications, and which are simpler and faster than other matrix-based algorithms in the literature.

1. Introduction A key problem in generating code for non-uniform memory access (NUMA) parallel machines is data and computation placement — that is, determining what work each processor must do, and what data must reside in each local memory. The goal of placement is to exploit parallelism by spreading the work across the processors, and to exploit locality by spreading data so that memory accesses are local whenever possible. The problem of determining a good placement for a program is usually solved in two phases called alignment and distribution. The alignment phase maps data and computations to a set of virtual processors organized as a Cartesian grid of some dimension (a template in HPF Fortran terminology). The distribution phase folds the virtual processors into the physical processors. The advantage of separating alignment from distribution is that we can address the collocation problem (determining which iterations and data should be mapped to the same processor) without worrying about the load balancing problem. Our focus in this paper is alignment. A complete solution to this problem can be obtained in three steps. 1. Determine the constraints on data and computation placement. 2. Determine which constraints should be left unsatisﬁed. 3. Solve the remaining system of constraints to determine data and computation placement. 0

An earlier version of this paper was presented in the 7th Annual Workshop on Languages and Compilers for Parallel Computers (LCPC), Ithaca, 1994.

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 385-411, 2001. Springer-Verlag Berlin Heidelberg 2001

386

Vladimir Kotlyar et al.

In the first step, data references in the program are examined to determine a system of equations in which the unknowns are functions representing data and computation placements. Any solution to this system of equations determines a so-called communication-free alignment [6] — that is, a map of data elements and computations to virtual processors such that all data required by a processor to execute the iterations mapped to it are in its local memory. Very often, the only communication-free alignment for a program is the trivial one in which every iteration and datum is mapped to a single processor. Intuitively, each equation in the system is a constraint on data and computation placement, and it is possible to overconstrain the system so that the trivial solution is the only solution. If so, the second step of alignment determines which constraints must be left unsatisfied to retain parallelism in execution. The cost of leaving a constraint unsatisfied is that it introduces communication; therefore, the constraints left unsatisfied should be those that introduce as little communication as possible. In the last step, the remaining constraints are solved to determine data and computation placement. The following loop illustrates these points. It computes the product Y of a sub-matrix (11 : N + 10, 11 : N + 10) and a vector X: DO i=1,N DO j=1,N Y(i) = Y(i) + A(i+10,j+10)*X(j) For simplicity, assume that the virtual processors are organized as a onedimensional grid T . Let us assume that computations are mapped by iteration number — that is, a processor does all or none of the work in executing an iteration of the loop. To avoid communication, the processor that executes iteration (i, j) must have A(i + 10, j + 10), Y (i) and X(j) in its local memory. These constraints can be expressed formally by defining the following functions that map loop iterations and array elements to virtual processors: C D DY DX

: : : :

(i, j) → T (i, j) → T i→T j→T

processor processor processor processor

that that that that

performs iteration (i, j) owns A(i, j) owns Y (i) owns X(j)

The constraints on these functions are the following.   C(i, j) = DA (i + 10, j + 10) C(i, j) = DY (i) ∀i, j s.t. 1 ≤ i, j ≤ N :  C(i, j) = DX (j)

If we enforce all of the constraints, the only solution is the trivial solution in which all data and computations are mapped to a single processor. In this case, we say that our system is overconstrained. If we drop the constraint on X, we have a non-trivial solution to the resulting system of constraints, which maps iteration (i, j) to processor i, and maps array elements A(i + 10, j + 10),

Solving Alignment Using Elementary Linear Algebra

387

X(i) and Y (i) to processor i. Note that all these maps are affine functions — for example, the map of array A to the virtual processors can be written as follows: i 1 0 a 10 = − (1.1) D (a, b) = j 0 1 b 10 Since there is more than one processor involved in the computation, we have parallel execution of the program. However, elements of X must be communicated at runtime. In this example, the solution to the alignment equations was determined by inspection, but how does one solve such systems of equations in general? Note that the unknowns are general functions, and that each function may be constrained by several equations (as is the case for C in the example). To make the problem tractable, it is standard to restrict the maps to linear (or affine) functions of loop indices. This restriction is not particularly onerous in general – in fact, it permits more general maps of computation and data than are allowed in HPF. The unknowns in the equations now become matrices, rather than general functions, but it is still not obvious how such systems of matrix equations can be solved. In Section 2, we introduce our linear algebraic framework that reduces the problem of solving systems of alignment equations to the standard linear algebra problem of determining a basis for the null space of a matrix. One weakness of existing approaches to alignment is that they handle only linear functions; general affine functions, like the map of array A, must be dealt with in ad hoc ways. In Section 3, we show that our framework permits affine functions to be handled without difficulty. In some programs, replication of arrays is useful for exploiting parallelism. Suppose we wanted to parallelize all iterations of our matrix-vector multiplication loop. The virtual processor (i, j) would execute the iteration (i, j) and own the array element A(i + 10, j + 10). It would also require the array element X(j). This means that we have to replicate the array X along the i dimension of the virtual processor grid. In addition, element Y (i) must be computed by reducing (adding) values computed by the set of processors (i, ∗). In Section 4, we show that our framework permits a solution to the replication/reduction problem as well. Finally, we give a systematic procedure for dropping constraints from overconstrained systems. Finding an optimal solution that trades off parallelism for communication is very difficult. First, it is hard to model accurately the cost of communication and the benefit of parallelism. For example, parallel matrix-vector product is usually implemented either by mapping rows of the matrix to processors (so-called 1-D alignment) or by mapping general submatrices to processors (so-called 2-D alignment). Which mapping is better depends very much on the size of the matrix, and on the communication to computation speed ratio of the machine [9]. Second, even for simple parallel models and restricted cases of the alignment problem, finding the optimal solution is known to be NP-complete problem [10]. Therefore, we must fall

388

Vladimir Kotlyar et al.

back on heuristics. In Section 5, we discuss our heuristic. Not surprisingly, our heuristic is skewed to “do the right thing” for kernels like matrix-vector product which are extremely important in practice. How does our work relate to previous work on alignment? Our work is closest in spirit to that of Huang and Sadayappan who were the first to formulate the problem of communication-free alignment in terms of systems of equational constraints [6]. However, they did not give a general method for solving these equations. Also, they did not handle replication of data. Anderson and Lam sketched a solution method [1], but their approach is unnecessarily complex, requiring the determination of cycles in bipartite graphs, computing pseudo-inverses etc – these complications are eliminated by our approach. The equational, matrix-based approach described in this paper is not the only approach that has been explored. Li and Chen have used graph-theoretic methods to trade off communication for parallelism for a limited kind of alignment called axis alignment [10]. More general heuristics for a wide variety of cost-of-communication metrics have been studied by Chatterjee, Gilbert and Schreiber [2, 3], Feautrier [5] and Knobe et al [7, 8]. To summarize, the contributions of this paper are the following. 1. We show that the problem of determining communication-free partitions of computation and data can be reduced to the standard linear algebra problem of determining a basis for the null space of a matrix , which can be solved using fairly standard techniques (Section 2.2). 2. Previous approaches to alignment handle linear maps, but deal with affine maps in fairly ad hoc ways. We show that affine maps can be folded into our framework without difficulty (Section 3). 3. We show how replication of arrays is handled by our framework (Section 4). 4. We suggest simple and effective heuristic strategies for deciding when communication should be introduced (Section 5).

2. Linear Alignment To avoid introducing too many ideas at once, we restrict attention to linear subscripts and linear maps in this section. First, we show that the alignment problem can be formulated using systems of equational constraints. Then, we show that the problem of solving these systems of equations can be reduced to the standard problem of determining a basis for the null space of a matrix, which can be solved using integer-preserving Gaussian elimination. 2.1 Equational Constraints The equational constraints for alignment are simply a formalization of an intuitively reasonable statement: ‘to avoid communication, the processor that

Solving Alignment Using Elementary Linear Algebra

389

performs an iteration of a loop nest must own the data referenced in that iteration’. We discuss the formulation of these equations in the context of the following example: DO j=1,100 DO k=1,100 B(j,k) = A(j,k) + A(k,j) If i is an iteration vector in the iteration space of the loop, the alignment constraints require that the processor that performs iteration i must own B(F1 i), A(F1 i) and A(F2 i), where F1 and F2 are the following matrices:

F1 =

1 0 0 1

F2 =

0 1

1 0

Let C, D and D be p × 2 matrices representing the maps of the computation and arrays A and B to a p-dimensional processor template; p is an unknown which will be determined by our algorithm. Then, the alignment problem can be expressed as follows: find C, D and D such that   Ci = D F1 i Ci = D F1 i ∀ i ∈ iteration space of loop :  Ci = D F2 i To ‘cancel’ the i on both sides of each equation, we will simplify the problem and require that the equations hold for all 2-dimensional integer vectors, regardless of whether they are in the bounds of the loop or not. In that case, the constraints simply become equations involving matrices, as follows: find C, D and D such that   C = D F1 (2.1) C = D F1  C = D F2

We will refer to the equation scheme C = DF as the fundamental equation of alignment. The general principle behind the formulation of alignment equations should be clear from this example. Each data reference for which alignment is desired gives rise to an alignment equation. Data references for which subscripts are not linear functions of loop indices are ignored; therefore, such references may give rise to communication at runtime. Although we have discussed only a single loop nest, it is clear that this framework of equational constraints can be used for multiple loop nests as well. The equational constraints from each loop nest are combined to form a single system of simultaneous equations, and the entire system is solved to find communication-free maps of computations and data.

390

Vladimir Kotlyar et al.

2.2 Reduction to Null Space Computation One way to solve systems of alignment equations is to set C and D matrices to the zero matrix of some dimension. This is the trivial solution in which all computations and data are mapped to a single processor, processor 0. This solution exploits no parallelism; therefore, we want to determine a non-trivial solution if it exists. We do this by reducing the problem to the standard linear algebra problem of determining a basis for the null space of a matrix. Consider a single equation. C = DF This equation can be written in block matrix form as follows: I C D = 0 −F Now it is of the form UV = 0 where U is an unknown matrix and V is a known matrix. To see the connection with null spaces, we take the transpose of this equation and we see that this is the same as the equation VT UT = 0. Therefore, UT is a matrix whose columns are in the null space of VT . To exploit parallelism, we would like the rank of UT to be as large as possible. Therefore, we must find a basis for the null space of matrix VT . This is done using integer-preserving Gaussian elimination, a standard algorithm in the literature [4, 12]. The same reduction works in the case of multiple constraints. Suppose that there are s loops and t arrays. Let the computation maps of the loops be C1 , C2 , . . . , Cs , and the array maps be D1 , D2 , . . . , Dt . We can construct a block row with all the unknowns as follows: C1 C2 . . . Cs D1 . . . Dt U = For each constraint of the form Cj = Dk Fℓ , we create a block column:   0  I     0 Vq =     −Fℓ  0

where the zeros are placed so that: UVq

=

Cj − Dk Fℓ

(2.2)

Putting all these block columns into a single matrix V, the problem of finding communication-free alignment reduces once again to a matrix equation of the form

Solving Alignment Using Elementary Linear Algebra

391

Input:. A set of alignment constraints of the form Cj = D Fℓ . Output:. Communication-free alignment matrices Cj and Dk . 1. Assemble block columns as in (2.2). 2. Put all block columns Vq into one matrix V. 3. Compute a basis UT for the null space of VT . 4. Set template dimension to number of rows of U. 5. Extract the solution matrices Cj and Dk from U. 6. Reduce the solution matrix U as described in Section 2.4. Fig. 2.1. Algorithm LINEAR-ALIGNMENT. UV

= 0

(2.3)

The reader can verify that the system of equations (2.1) can be converted into the following matrix equation:   I I I C DA DB  0 −F1 −F2  = 0 (2.4) −F1 0 0 A solution matrix is: U= which gives us:

1

1 1

1

(2.5)

1 1

(2.6)

1 1

C = DA = DB =

Since the number of rows of U is one, the solution requires a one dimensional template. Iteration (i, j) is mapped to processor i + j. Arrays A and B are mapped identically so that the ‘anti-diagonals’ of these matrices are mapped to the same processor. The general algorithm is outlined in Figure 2.1. 2.3 Remarks Our framework is robust enough that we can add additional constraints to computation and data maps without difficulty. For example, if a loop in a loop nest carries a dependence, we may not want to spread iterations of that loop across processors. More generally, dependence information can be characterized by a distance vector z, which for our purposes says that iterations i and i+z have to be executed on the same processor. In terms of our alignment model: Ci + b = C(i + z) + b

⇔

Cz =

0

(2.7)

We can now easily incorporate (2.7) into our matrix system (2.3) by adding the following block column to V:

392

Vladimir Kotlyar et al.

V ¡p



 0 =  z  0

where zeros are placed to that UVdep = Cz. Adding this column to V will ensure that any two dependent iterations end up on the same processor. In some circumstances, it may be necessary to align two data references without aligning them with any computation. This gives rise to equations of the form D1 F1 = D2 F2 . Such equations can be incorporated into our framework by adding block columns of the form   0  F1     Vp =  (2.8)  0   −F2  0 where the zeros are placed so that UVp = D1 F1 −D2 F2 . 2.4 Reducing the Solution Basis Finally, one practical note. It is possible for Algorithm LINEAR-ALIGNMENT to produce a solution U which has p rows, even though all Cj produced by Step 5 have rank less than p. A simple example where this can happen is a program with two loop nests which have no data in common. Mapping the solution into a lower dimensional template can be left to the distribution phase of compiling; alternatively, an additional step can be added to Algorithm LINEAR-ALIGNMENT to solve this problem directly in the alignment phase. This modification is described next. Suppose we compute a solution which contains two computation alignments: (2.9) U = C1 C 2 . . .

Let r be the number of rows in U. Let r1 be the rank of C1 , and let r2 be the rank of C2 . Assume that r1 < r2 . We would like to have a solution basis where the first r1 rows of C1 are linearly independent, as are the first r2 rows of C2 — that way, if we decide to have an r1 -dimensional template, we are guaranteed to keep r1 degrees of parallelism for the second loop nest, as well. Mathematically, the problem is to find a sequence of row transformations T such that the first r1 rows of TC1 are linearly independent and so are the first r2 rows of TC2 . A detailed procedure is given in the appendix. Here, we describe the intuitive idea. Suppose that we have already arranged the first r1 rows of C1 to be linearly independent. Inductively, assume that the first k < r2 rows of C2 are linearly independent as well. We want to make the k + 1-st row of C2 linearly independent of the previous k rows. If it already is, we go the

Solving Alignment Using Elementary Linear Algebra

393

next row. If not, then there must be a row m > k + 1 of C2 which is linearly independent of the first k rows. It is easy to see that if we add the m-th row to the k + 1-st row, we will make the latter linearly independent of the first k rows. Notice that this can mess up C1 ! Fortunately, it can be shown that if we add a suitably large multiple of the m-th row, we can be sure that the first r1 rows of C1 remain independent. This algorithm can be easily generalized to any number of C¢ blocks.

3. Affine Alignment In this section, we generalize our framework to affine functions. The intuitive idea is to ‘encode’ affine subscripts as linear subscripts by using an extra dimension to handle the constant term. Then, we apply the machinery in Section 2 to obtain linear computation and data maps. The extra dimension can be removed from these linear maps to ‘decode’ them back into affine maps. We first generalize the data access functions F¢ so that they are affine functions of the loop indices. In the presence of such subscripts, aligning data and computation requires affine data and computation maps. Therefore, we introduce the following notation. Computation maps: Data maps: Data access functions:

C£ (i) = C£ i + c£

(3.1)

D¤ (a) = D¤ a + d¤ Fℓ (i) = Fℓ i + f ℓ

(3.2) (3.3)

Cj , Dk and Fℓ are matrices representing the linear parts of the affine functions, while cj , dk and f ℓ represent constants. The alignment constraints from each reference are now of the form ∀i ∈ Zn : Cj i + cj = Dk (Fℓ i + f ℓ ) + dk

(3.4)

3.1 Encoding Affine Constraints as Linear Constraints Affine functions can be encoded as linear functions by using the following identity. x T t Tx + t = (3.5) 1 where T is a matrix, and t and x are vectors. We can put (3.4) in the form:

394

Vladimir Kotlyar et al.

Cj

cj

i 1

= D¥

Fℓ

fℓ

=

Dk

dk

ˆk = D

Dk

dk

i 1

i 1

fℓ 1

Fℓ 0

+ dk

i 1

(3.6)

Now we let: ˆj = C

Cj

cj

ˆℓ = F

Fℓ 0

fℓ 1

(3.7)

(3.6) can be written as:

i 1

(3.8)

i 1 the equation. To do this, we need the following result.

from both sides of

∀i ∈ Z

d

ˆj : C

ˆ kF ˆℓ =D

As before, we would like to ‘cancel’ the vector

Lemma 31. Let T be a matrix, t a vector. Then x ∀x T t =0 1 if and only if T = 0 and t = 0. Proof:. In particular, we can let x = 0. This gives us: 0 T t = t = 0 1 So t = 0. Now, for any x: x x T 0 T t = = 1 1 which means that T = 0, as well.

Tx

= 0

2.

Using Lemma 31, we can rewrite (3.8) as follows: ˆj C

ˆ kF ˆℓ = D

(3.9)

We can now use the techniques in Section 2 to reduce systems of such equations to a single matrix equation as follows: ˆV ˆ U

= 0

(3.10)

In turn, this equation can be solved using the Algorithm LINEAR-ALIGNˆ To illustrate this process, we use the example from MENT to determine U. Section 1:

Solving Alignment Using Elementary Linear Algebra

395

DO i=1,N DO j=1,N Y(i) = Y(i) + A(i+10,j+10)*X(j) Suppose we wish to satisfy the constraints for access functions are: 1 0 f¦ = F¦ = 0 1 FY =  1 0  fY = 1 0 10 ˆ A =  0 1 10  F ˆY = F 0 0 1

Y and A. The relevant array 10 10 0 1 0 0 0 0 1

(3.11)

The reader can verify that the matrix equation to be solved is the following one. ˆV ˆ U

= 0

(3.12)

where: ˆ = U

ˆ D ˆA C

ˆY D

ˆ V



I ˆA  −F = 0

And the solution is the following matrix. 1 0 0 1 0 −10 1 0 ˆ = U 0 0 1 0 0 1 0 1

 I  0 ˆ −FY

(3.13)

From this matrix, we can read off the following maps of computation and data. 1 0 0 1 0 −10 ˆ ˆ C = DA = 0 0 1 0 0 1 1 0 ˆY = D 0 1 This says that iteration i of the loop and element X(i) are mapped to the following virtual processor.   i i 1 0 0 i ˆ   j C = = 1 0 0 1 1 1 Notice that although the space of virtual processors has two dimensions (because of the encoding of constants), the maps of the computation and data use only a one-dimensional subspace of the virtual processor space. To obtain a clean solution, it is desirable to remove the extra dimension introduced by the encoding.

396

Vladimir Kotlyar et al.

Input:. A set of alignment constraints as in Equation (3.4). Output:. Communication-free alignment mappings characterized by Cj , cj , D§ , d§ . ˆ ℓ matrices as in Equation 3.6. 1. Assemble F ˆ ℓ instead 2. Assemble block columns Vq as in Equation (2.2) using F of Fℓ . ˆ 3. Put all block columns Vq into one matrix V. T T ˆ ˆ 4. Compute a basis U for null-space of V as in the Step 3 of LINEAR-ALIGNMENT algorithm. ˆ 5. Eliminate redundant row(s) in U. ˆ 6. Extract the solution matrices from U. Fig. 3.1. Algorithm AFFINE-ALIGNMENT. We have already mentioned that there is always a trivial solution that maps everything to the same virtual processor p = 0. Because we have introduced affine functions, it is now possible to map everything to the same virtual processor p = 0. In our framework it is reflected in the fact that there is always a row (3.14) wT = 0 0 . . . 0 1 0 . . . 0 1 . . . 0 0 1

ˆ (with zeros placed appropriately) in the row space of the solution matrix U. To “clean up” the solution notice that we can always find a vector x such ˆ = wT . Moreover, let k be the position of some non-zero element that xT U in x and let J be an identity matrix with the k-th row replaced by xT (J is ˆ ′ = JU ˆ is equal to wT and is linearly non-singular). Then the k-th row of U independent from the rest of the rows. This means that we can safely remove it from the solution matrix. Notice that this procedure is exactly equivalent ˆ A more detailed description is given in the to removing k-th row from U. appendix. Algorithm AFFINE-ALIGNMENT is summarized in Figure 3.1.

4. Replication As we discussed in Section 1, communication-free alignment may require replication of data. Currently, we allow replication only of read-only arrays or of the arrays which are updated using reduction operations. In this section, we show how replication of data is handled in our linear algebra framework. We use a matrix-vector multiplication loop (MVM) as a running example. DO i=1,N DO j=1,N Y(i) = Y(i) + A(i,j)*X(j)

Solving Alignment Using Elementary Linear Algebra

397

We are interested in deriving the parallel version of this code which uses 2-D alignment — that is, it uses a 2-dimensional template in which processor (i, j) performs iteration (i,j). If we keep the alignment constraint for A only, we get the solution: 1 0 (4.1) C = D¨ = 0 1 which means that iteration (i,j) is executed on the processor with coordinates (i, j). This processor also owns the array element A(i, j). For the computation, it needs X(j) and Y (i). This requires that X be replicated along the i dimension of the processor grid, and Y be reduced along the j dimension. We would like to derive this information automatically. 4.1 Formulation of Replication To handle replication, we associate a pair of matrices R and D with each data reference for which alignment is desired; as we show next, the fundamental equational scheme for alignment becomes RC = DF. Up to this point, data alignment was specified using a matrix D which mapped array element a to logical processor Da. If D has a non-trivial nullspace, then elements of the array belonging to the same coset of the null-space get placed onto the same virtual processor; that is, Da1 a1 − a2

=

Da2

⇔ ∈ null(D)

When we allow replication, the mapping of array elements to processors can be described as follows. Array element a is mapped to processor p if Rp =

Da

The mapping of the array is now a many-to-many relation that can be described in words as follows: – Array elements that belong to the same coset of null(D) are mapped onto the same processors. – Processors that belong to the same coset of null(R) own the same data. From this, it is easy to see that the fundamental equation of alignment becomes RC = DF. The replication-free scenario is just a special case when R is I. Not all arrays in a procedure need to be replicated — for example, if an array is involved in a non-reduction dependence or it is very large, we can disallow replication of that array. Notice that the equation RC = DF is non-linear if both R and C are unknown. To make the solution tractable, we first compute C based on the constraints for the non-replicated arrays. Once C is determined, the equation is again linear in the unknowns R and

398

Vladimir Kotlyar et al.

D. Intuitively, this means that we first drop some constraints from the nonreplicated alignment system, and then try to satisfy these constraints via replication. We need to clarify what “fixing C” means. When we solve the alignment system (2.3), we obtain a basis Cbasis for all solutions to the loop alignment. The solutions can be expressed parametrically as C = TCbasis

(4.2)

for any matrix T. Now the replication equation becomes RTCbasis = DF

(4.3)

and we are faced again with a non-linear system (T is another unknown)! The key observation is that if we are considering a single loop nest, then T becomes redundant since we can “fold it” into R. This lets us solve the replication problem for a single loop nest. In our MVM example, once the loop alignment has been fixed as in (4.1), the system of equations for the replication of X and Y is: RX C = RY C =

DX FX DY FY

These can be solved independently or put UV = 0: U =  RX RY C 0  0 C  V =  DX 0 0 DY

together into a block-matrix form DX DY    

and solved using the standard methods. The solution to this system: RX = 0 1 DX = 1 1 0 1 RY = DY =

(4.4)

(4.5)

which is the desired result: columns of the processor grid form the cosets of null(RX ) and rows of the processor grid form the cosets of null(RY ). The overall Algorithm SINGLE-LOOP-REPLICATION-ALIGNMENT is summarized in Figure 4.1.

5. Heuristics In practice, systems of alignment constraints are usually over-determined, so it is necessary to drop one or more constraints to obtain parallel execution. As we mentioned in the introduction, it is very difficult to determine which constraints must be dropped to obtain an optimal solution. In this section, we discuss our heuristic which is motivated by scalability analysis of common computational kernels.

Solving Alignment Using Elementary Linear Algebra

399

Input:. Replication constraints of the form RC = DF. Output:. Matrices R, D and Cbasis that specify alignment with replication. 1. Find Cbasis by solving the alignment system for the nonreplicated arrays using the Algorithm AFFINE-ALIGNMENT. If all arrays in the loop nest are allowed to be replicated, then set Cbasis = I. 2. Find (R, D) pairs that specify replication by solving the RCbasis = DF equations. Fig. 4.1. Algorithm SINGLE-LOOP-REPLICATION-ALIGNMENT. 5.1 Lessons from Some Common Computational Kernels We motivate our ideas by the following example. Consider a loop nest that computes matrix-matrix product: DO i=1,n DO j=1,n DO k=1,n C(i,j) = C(i,j) + A(i,k)*B(k,j) [9] provides the description of various parallel algorithms for matrix-matrix multiplication. It is shown that the best scalability is achieved by an algorithm which organizes the processors into a 3-D grid. Let p, q and r be the processor indices in the grid. Initially, A is partitioned in 2-D blocks along the p-r “side” of the grid. That is, if we let Apr be a block of A, then it is initially placed on processor with the coordinates (p, 0, r). Similarly, each block B rq is placed on processor (0, q, r). Our goal is to accumulate the block C pq of the result on the processor (p, q, 0). At the start of the computation, A is replicated along the second (q) dimension of the grid. B is replicated along the first dimension (p). Therefore, we end up with processor (p, q, r) holding a copy of Apr and B rq . Then each processor computes the local matrix-matrix product: Dpqr = Apr ∗ B rq

(5.1)

It is easy to see that the blocks of C are related to these local products by: Dpqr (5.2) C pq = r

Therefore, after the local products are computed, they are reduced along the r dimension of the grid. We can describe this computation using our algebraic framework. There is a 3-D template and the computation alignment is an identity. Each of the arrays is replicated. For example the values of D and R for the array A are:

400

Vladimir Kotlyar et al.

R=

1 0

0 0 0 1

D=

1 0

0 1

(5.3)

By collapsing different dimensions of the 3-D grid, we get 2-D and 1-D versions of this code. In general, it is difficult for a compiler to determine which version to use — the optimal solution depends on the size of the matrix, and on the overhead of communication relative to computation of the parallel machine [9]. On modern machines where the communication overhead is relatively small, the 3-D algorithm is preferable, but most alignment heuristics we have seen would not produce this solution — note that all arrays are communicated in this version! These heuristics are much more likely to “settle” for the 2-D or 1-D versions, with some of the arrays kept local. Similar considerations apply to other codes such as matrix-vector product, 2-D and 3-D stencil computations, and matrix factorization codes [9]. Consider stencil computations. Here is a typical example: DO i=1,N DO j=1,N A(i,j) = ...B(i-1,j)...B(i+1,j)... ...B(i,j)...B(i,j-1)...B(i,j+1)) In general, stencil computations are characterized by array access functions of the form Fi + f © , where the linear part F is the same for most of the accesses. The difference in the offset induces nearest-neighbor communication. We will analyze the communication/computation cost ratio for 1-D and 2-D partitioning of this example. For the 1-D case, the N -by-N iteration space is cut up into N/P -by-N blocks. If N is large enough, then each processor has to communicate with its “left” and “right” neighbors, and the volume of communication is 2N . We can assume that the communication between the different pairs of neighbors happens at the same time. Therefore, the total communication time is Θ(2N ). The computation done on each processor is is√Θ(P/N ). In the Θ(N 2 /P ), so the ratio of communication to computation √ 2-D case, the iteration space is cut up into N/ P -by-N/ P blocks. Each processor now has four √ volume of √ neighbors to communicate with, and the communication is 4N/ P . Therefore, the ratio for this case is Θ( P /N ). We conclude that 2-D case scales better than 1-D case1 . In general, if we have a d-dimensional stencil-like computation, then it pays to have a d-dimensional template. The situation is somewhat different in matrix and vector products and matrix factorization codes (although the final result is the same). Let us consider matrix-vector product together with some vector operation between X and Y : 1

In fact, the total volume of communication is smaller in the 2-D case, despite the fact that we had fewer alignment constraints satisfied (this paradoxical result arises from the fact that the amount of communication is a function not just of alignment but of distribution as well).

Solving Alignment Using Elementary Linear Algebra

401

DO i=1,N DO j=1,N Y(i) = Y(i) + A(i,j) * X(j) DO i=1,N X(i) = ...Y(i)... This fragment is typical of many iterative linear system solvers ( [11]). One option is to use a 1-D template by leaving the constraint for X in the matrixvector product loop unsatisfied. The required communication is an all-to-all broadcast of the elements of X. The communication cost is Θ(N log(P )). The computation cost is Θ(N 2 /P ). This gives us communication to computation ratio of Θ(log(P )P/N ). √ √ -by-N/ P block of the In the 2-D version, each processor gets an N/ P√ iteration space and A. X and Y are partitioned in P pieces placed along the diagonal of the processor grid ( [9]). The algorithm is somewhat similar to matrix-matrix multiplication: each block of X gets broadcast along the column dimension, each block of Y is computed as the sum-reduction along the row dimension. Note that because each broadcast √ √ or reduction happens √ in parallel, the communication cost is Θ(log( P )N/ P ) = Θ(log(P )N/ √ P ). This results on the communication to computation ratio of Θ(log(P ) P /N ). Although the total volume of communication is roughly the same for the 1-D and 2-D case, the cost is asymptotically smaller in the 2-D case. Intuitively, the reason is that we were able to parallelize communication itself. To reason about this in our framework, let us focus on matrix-vector product, and see what kind of replication for X we get for the 1-D and 2-D case. In the 1-D case, the computation alignment is: C= 1 0 (5.4) The replication equation RC = DF for X is: RX 1 0 = DX 0 1

(5.5)

The only solution is:

RX = DX =

0

(5.6)

This means that every processor gets all elements of X — i.e., it is an allto-all broadcast. We have already computed the alignments for the 2-D case in Section 4. Because RX has rank 1, we have a “parallelizable” broadcasts — that is, the broadcast along different dimensions of the processor grid can happen simultaneously. In general, if the replication matrix has rank r and the template has dimension d, then we have broadcasts along d − r dimensional subspaces of the template. The larger r, the more of these broadcasts happen at the same time. In the extreme case r = d we have a replication-free alignment, which requires no communication, at all.

402

Vladimir Kotlyar et al.

5.2 Implications for Alignment Heuristic The above discussion suggests the following heuristic strategy. – If a number of constraints differ only in the offset of the array access function, use only one of them. – If there is a d-dimensional DOALL loop (or loop with reductions), use a d-dimensional template for it and try to satisfy conflicting constraints via replication. Keep the d-dimensional template if the rank of the resulting replication matrices is greater than zero. – If the above strategy fails, use a greedy strategy based on array dimensions as a cost measure. That is, try to satisfy the alignment constraints for the largest array first (intuitively, we would like large arrays to be “locked in place” during the computation). This is the strategy used by Feautrier [5]. Intuitively, this heuristic is biased in favor of exploiting parallelism in DO-ALL loops, since communication can be performed in parallel before the computation starts. This is true even if there are reductions in the loop nest, because the communication required to perform reductions can also be parallelized. This bias in favour of exploiting parallelism in DO-ALL loops at the expense of communication is justified on modern machines.

6. Conclusion We have presented a simple framework for the solution of the alignment problem. This framework is based on linear algebra, and it permits the development of simple and fast algorithms for a variety of problems that arise in alignment.

References 1. Jennifer M. Anderson and Monica S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 112 – 125, June 1993. 2. Siddartha Chatterjee, John Gilbert, and Robert Schreiber. The alignmentdistribution graph. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing. Sixth International Workshop., number 768 in LNCS. Springer-Verlag, 1993. 3. Siddartha Chatterjee, John Gilbert, Robert Schreiber, and Shang-Hua Teng. Optimal evaluation of array expressions on massively parallel machines. Technical Report CSL-92-11, XEROX PARC, December 1992. 4. Henri Cohen. A Course in Computational Algebraic Number Theory. Graduate Texts in Mathematics. Springer-Verlag, 1995.

Solving Alignment Using Elementary Linear Algebra

403

5. Paul Feautrier. Toward automatic distribution. Technical Report 92.95, IBP/MASI, December 1992. 6. C.-H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing. Fourth International Workshop. Santa Clara, CA., number 589 in LNCS, pages 186–200. Springer-Verlag, August 1991. 7. Kathleen Knobe, Joan D. Lucas, and William J. Dally. Dynamic alignment on distributed memory systems. In Proceedings of the Third Workshop on Compilers for Parallel Computers, July 1992. 8. Kathleen Knobe and Venkataraman Natarajan. Data optimization: minimizing residual interprocessor motion on SIMD machines. In Proceedings of the 3rd ´ Symposium on the Frontiers of Massively Parallel Computation - Frontiers 90, pages 416–423, October 1990. 9. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing. Design and Analysis of Algorithms. The Benjamin/Cummings Publishing Company, 1994. 10. Jingke Li and Marina Chen. Index domain alignment: minimizing cost of crossreferencing between distributed arrays. Technical Report YALEU/DCS/TR725, Department of Computer Science, Yale University, September 1989. 11. Youcef Saad. Kyrlov subspace methods on supercomputers. SIAM Journal on Scientific and Statistical Computing, 10(6):1200–1232, November 1989. 12. Michael Wolfe. High Performance Compilers for Parallel Computing. Addison– Wesley, Redwood City, CA, 1996. Acknowledgement. This research was supported by an NSF Presidential Young Investigator award CCR-8958543, NSF grant CCR-9503199, ONR grant N00014-931-0103, and a grant from Hewlett-Packard Corporation.

404

Vladimir Kotlyar et al.

A. Reducing the Solution Matrix As we mentioned in Section 2.4, it is possible for our solution procedure to produce a matrix U which has more rows than the rank of any of the computation alignments Cª . Intuitively, this means that we end up with a template that has a larger dimension than can be exploited in any loop nest in the program. Although the extra dimensions can be ‘folded’ away during the distribution phase, we show how the problem can be eliminated by adding an extra step to our alignment procedure. First, we discuss two ways in which this problem can arise. A.1 Unrelated Constraints Suppose we have two loops with iteration alignments C1 and C2 and two arrays A and B with data alignments D« and D¬ . Furthermore, only A is accessed in loop 1 via access function F« and only B is accessed in loop 2 via access function F¬ 2 . The alignment equations in this case are: C1

=

D« F«

(A.1)

C2

=

D¬ F¬

(A.2)

We can assemble this into a combined matrix equation: C1 C2 D« D¬ U =   I 0  0  I  V =   −F« 0  0 −F¬ UV

=

0

(A.3)

∗ ∗ Say, C∗1 and D« are the solution to (A.1). And C∗2 and D¬ are the solution to (A.2). Then it is not hard to see that the following matrix is a solution to (A.3): ∗ C1 0 D∗« 0 U = (A.4) ∗ 0 D¬ 0 C∗2

So we have obtained a processor space with the dimension being the sum of the dimensions allowed by (A.1) (say, p1 ) and (A.2) (say p2 ). However, these dimensions are not fully utilized since only the first p1 dimensions are used in loop 1, and only the remaining p2 dimensions are used in loop 2. 2

For simplicity we are considering linear alignments and subscripts. For affine alignments and subscripts the argument is exactly the same after the appropriate encoding.

Solving Alignment Using Elementary Linear Algebra

405

This problem is relatively easy to solve. In general, we can model the alignment constraints as an undirected alignment constraint graph G whose vertices are the unknown D and C alignment matrices; an edges (x, y) represents an alignment equation constraining vertex x and vertex y. alignments. We solve the constraints in each connected component separately, and choose a template with dimension equal to the maximum of the dimensions required for the connected components. A.2 General Procedure Unfortunately, extra dimensions can arise even when there is only one component in the alignment constraint graph. Consider the following program fragment: DO i=1,n DO j=1,n ...A(i,0,j)... The alignment equation for this loop is:   1 0 C = D  0 0  0 1 The solution is: U=

C

D



1 = 0 0

0 1 0 0 1 0

 0 0 1 0  0 1

So we have rank(C) = 2 < rank(U). If we use this solution, we end up placing the unused dimension of A onto an extra dimension of virtual processor space. We need a way of modifying the solution matrix U so that: rank(U)

= max{rank(C® )} ®

(A.5)

For this, we apply elementary (unimodular) row operations 3 to U so that we end up with a matrix U′ in which the first rank(C® ) rows of each C® component form a row basis for the rows of this component. We will say that each component of U′ is reduced. By taking the first max® {rank(C® )} rows of U′ we obtain a desired solution W. In our example matrix U is not reduced: the first two rows of C do not for a basis for all rows of C. But if we add the third row of U to the second row, we get U′ with desired property: 3

Multiplying a row by ± 1 and adding a multiple of one row to another are elementary row operations.

406

Vladimir Kotlyar et al.

U′



1  0 0

=

0 1 1 0 1 0

 0 0 1 1  0 1

Now by taking the first two rows of U′ we obtain a solution which does not induce unused processor dimensions. The question now becomes: how do we systematically choose a sequence of row operations on U in order to reduce its components? Without loss of generality, lets assume that U only consists of C¯ components: C1 C 2 . . . C s U = (A.6) Let:

– q be the number of rows in U. Also, by construction of U, q = rank(U). – ri be the rank of Ci for i = 1, . . . , s. – rmax = maxi {rank(Ci )}. Notice that rmax = q, in general. We want to find matrix W, so that:

– number of rows in W equals to rmax . – each component of W has the same rank as the corresponding component of U. Here is the outline of our algorithm: 1. Perform elementary row operations on U to get U′ in which every component is reduced. 2. Set W to the first rmax rows of U′ . The details are filled in below. We need the following Lemma: Lemma A1. Let a1 , . . . , ar , ar+1 , . . . , an be some vectors. Furthermore assume that the first r vectors form a basis for the span a1 , . . . , an . Let: ak

=

r

βj a j

(A.7)

j=1

be the representation of ak in the basis above. Then the vectors a1 , . . ., ar−1 , ar + αak are linearly independent (and form a basis) if and only if: 1 + αβr

=

0

(A.8)

Proof:. ar + αak

= ar + α

r

βj aj

j=1

= ar (1 + αβr ) + α

r−1 j=1

βj aj

(A.9)

Solving Alignment Using Elementary Linear Algebra

407

Now if in the equation (A.9) (1+αβr ) = 0, then the vectors a1 , . . . , an−1 , ar + αak are linearly dependent. Vice versa, if (1 + αβr ) = 0, then these vectors 2 are independent by the assumption on the original first r vectors. Lemma A1 forms a basis for an inductive algorithm to reduce all components of U. Inductively assume that we have already reduced C1 , . . . , Ck−1 . Below we show how to reduce Ck , while keeping the first k − 1 components reduced. Let  T  a1  aT2    Cj =  .   ..  aTq

we want the first rj rows to be linearly independent. Assume inductively that the first i − 1 rows (i < rj ) are already linearly independent. There are two cases for the i-th row (aTi ): 1. aTi is linearly independent from the previous rows. In this case we just move to next row. the i−1 2. ai = ℓ=1 γℓ aℓ , i.e. ai is linearly dependent on the previous rows. Note that since rj = rank(Cj ) > i, there is a row ap , which is linearly independent from the first i − 1 rows. Because of this the rows a1 , . . . , ai−1 , ai + αap are linearly independent for any α = 0. Lemma A1 tells us that we can choose α so that the previous components are kept reduced. We have to solve a system of inequalities like:  (1) 

= 0  1  1 + αβr(2)   1 + αβr

= 0 2 (A.10) ..   .    (k−1)

= 0 1 + αβrk−1 (1)

(k−1)

where βr1 , . . . , βrk−1 come from the inequalities (A.8) for each compo(i) (i) nent. βri ’s are rational numbers: βri = ηi /ξi . So we have to solve a system of inequalities:  αη1

= −ξ1     αη2

= −ξ2 (A.11) ..  .    αηk−1 = −ξk−1 It is easy to see that α = maxi {|ξi |} + 1 is a solution.

The full algorithm for communication-free alignment ALIGNMENT-WITH-FIXUP is outlined in Figure A.1.

408

Vladimir Kotlyar et al.

Input:. A set of encoded affine alignment constraints as in Equation (3.4). Output:. Communication-free alignment mappings characterized by Cj , cj , D° , d° which do not induce unused processor dimensions. 1. Form alignment constraint graph G. 2. For each connected component of G: a) Assemble the system of constraints and solve it as described in Algorithm AFFINE-ALIGNMENT to get the solution matrix ˆ U. ˆ that was induced by affine b) Remove the extra row of of U encoding. (Section B) c) If necessary apply the procedure described in Section A.2 to ˆ reduce the computation alignment components of U. Fig. A.1. Algorithm ALIGNMENT-WITH-FIXUP

B. A Comment on Affine Encoding Finally, we make a remark about affine encoding. A sanity check for alignment equations is that there should always be a trivial solution which places everything onto one processor. In the case of linear alignment functions and linear array accesses, we have a solution U = 0. When we use affine functions, this solution is still valid, but there is more. We should be able to express a ˆ = 0 that places everything on a single non-zero processor. Such solution U a solution would have C± = 0, D² = 0, c± = d² = 1. Or, using our affine encoding: ˆ± = 0 0 ... 0 1 C ˆ² = 0 0 ... 0 0 1 D Below, we prove that solution of this form always exists; moreover, this gives rise to an extra processor dimension which can be eliminated without using the algorithm of Section A. Let the matrix of unknowns be: ˆ = ˆ1 . . . C ˆs D ˆ1 ... D ˆt U C Also let:

– mi be the number of columns of Ci for i = 1, . . . , s. (mi is the dimension of the ith loop.) – ms+i be the number of columns of Di for i = 1, . . . , t. (ms+i is the dimension of the (s + i)th array.) T . (k − 1 zeros followed by a 1.) – ek ∈ Zk, ek = 0 0 . . . 0 1

Solving Alignment Using Elementary Linear Algebra

409

– w ∈ Z(³1+³2 +...+³s+t ) as in: 

w

     =      

e³ 1 e³ 2 .. . e³ s e³s+1 .. . e³s+t

           

ˆ = 0. In particular, we can show that It is not hard to show that wT V ˆ q that is assembled into V. ˆ vector w is orthogonal to every block column V ˆ Suppose that Vq corresponds to the equation: ˆk ˆ jF = D

ˆi C Therefore:

ˆq V

 =

    

0 I 0 ˆk −F 0

     

ˆ q has mi columns (the dimension of the ith loop) and the last Note that V ˆ in Section 3.1: column looks like (check the definition of F   0  0     ..   .     0     1     0     0     ..   .     0     −1     0     .   ..  0 with 1 and −1 placed in the same positions as the 1s in w. It is clear that w ˆ q . w is also orthogonal to the other columns is orthogonal to this column of V ˆ of Vq , since only the last column has non-zeros, where w has 1s.

410

Vladimir Kotlyar et al.

ˆ that corresponds to w? Note How can we remove an extra dimension in U ˆ that in general U will not have a row that is a multiple of w! Suppose that ˆ has r = rank(U) ˆ rows: U  T  u1  uT2   ˆ =  U  ..   .  uTr

ˆ = 0, we have that Since wT V w

∈

ˆT) null(V

ˆ form a basis for null(V ˆ T ). Therefore: But rows of U w

∈ span(u1 , . . . , ur )

(B.1)

Let x be the solution to: ˆ xT U

= wT

One of the coordinates of x, say xℓ , must be non-zero. Form the matrix J(x) by substituting the ℓth row of an r-by-r identity matrix with xT :   1 0 0 0 ... 0 ... 0  0 1 0 0 ... 0 ... 0     0 0 1 0 ... 0 ... 0     0 0 0 1 ... 0 ... 0    . .. .. .. .. .. .. ..  J(x) =    .. . . . . . . .    x1 x2 x3 x4 . . . xℓ . . . xr     . .. .. .. .. .. .. ..   .. . . . . . . .  0 0 0 0 ... 0 ... 1 ˆ ′ = J(x)U ˆ has the same J(x) is non-singular, because xℓ = 0. Therefore U ˆ and it is also a basis for the solutions to our alignment system: rank as U ˆ = J(x)U ˆV ˆ = J(x)0 = 0 ˆ ′V U But by construction:  ˆ′ U

=

uT1 uT2 .. .

     T  u  ℓ−1  wT  T  u  ℓ+1 .. T .ur

           

Solving Alignment Using Elementary Linear Algebra

411

Now we can just remove the wT row to get non-trivial solutions! Notice that we don’t really have to form J(x) — we have to find x (using Gaussian ˆ such that xℓ = 0. elimination) and then remove the ℓth row from U

Chapter 12. A Compilation Method for Communication–Efficient Partitioning of DOALL Loops Santosh Pande and Tareq Bali College of Computing 801 Atlantic Drive Georgia Institute of Technology Atlanta, GA 30332 [email protected]

Summary. Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops [6]. But a large category of DOALL loops inevitably result in communication and the tradeoffs between computation and communication must be carefully analyzed for those loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. First, code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and re-used eliminating a large communication volume. A new larger partition owns rule is formulated to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller non-compute intensive partition. A Partition Interaction Graph is then constructed that is used to merge the partitions to achieve granularity adjustment, computation+communication load balance and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on Cray T3D.

1. Introduction The distributed memory parallel architectures are quite popular for highly parallel scientiﬁc software development. The emergence of better routing schemes and technologies have reduced the inter-processor communication latency and increased the communication bandwidth by a large degree making these architectures attractive for a wide range of applications. Compiling for distributed memory systems continues to pose complex, challenging problems to the researchers. Some of the important research directions include, data parallel languages such as HPF/Fortran 90D [4,12,13,16], S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 413-443, 2001. Springer-Verlag Berlin Heidelberg 2001

414

Santosh Pande and Tareq Bali

communication free partitioning [1, 6, 14, 22], communication minimization [2, 17, 19], array privatization [30], data alignment [1, 3, 5, 11, 18, 24, 26, 29, 33], load balancing through multi-threading [27], mapping functional parallelism [8, 9, 20, 21, 23], compile and run time optimizations for irregular problems [25, 28] and optimizing data redistributions [15, 31]. The focus of most of these approaches is on eliminating as much interprocessor communication as possible. The primary motivation behind such approaches is that the data communication speeds on most of the distributed memory systems are orders of magnitude slower than the processor speeds. In particular, the loop partitioning approaches on these systems attempt to fully eliminate the communication through communication free partitioning for a sub-set of DOALL loops [1, 6, 14, 22]. These methods first attempt to find a communication free partition of the loop nest by determining a set of hyperplanes in the iteration and data spaces of the loop and then attempt to load balance the computation [6]. However, communication free partitioning is possible only for a very small, highly restrictive sub-class of DOALL loops and partitioning of general DOALL loops inevitably results in communication in the absence of any data replication. In these cases, the important goals of the loop partitioning strategy are to minimize the communication possibly trading parallelism and to achieve a computation+communication load balance for almost equal execution times of the generated loop partitions. The literature does not present comprehensive solutions to the above issues and this is the focus of our paper. Section 2 describes the previous work on DOALL partitioning on distributed memory systems and discusses our approach. Section 3 introduces necessary terms and definitions. Section 4 develops the theory and section 5 discusses the algorithms for DOALL iteration and data space partitioning. Section 6 discusses the algorithms for granularity adjustment, load balancing and mapping. Section 7 illustrates the methods through an example. Section 8 deals with the performance results on Cray T3D and conclusions.

2. DOALL Partitioning The DOALL loops offer the highest amount of parallelism to be exploited in many important applications. The primary motivation in DOALL partitioning on distributed memory systems is reducing data communication overhead. The previous work on this topic has focused on completely eliminating communication to achieve a communication free iteration and data space partition [1, 6, 14, 22]. But in many practical DOALL loops, the communication free partitioning may not be possible due to the incompatible reference instances of a given variable encountered in the loop body or due to incompatible variables [14, 22]. The parallelization of such DOALL loops is not possible by the above approaches. In this work, our motivation is to develop an iteration and data space partitioning method for these DOALL

Communication–Efficient DOALL Partitioning

415

loops where the reference patterns do not permit a communication free partition without replicating the data. We attempt to minimally trade parallelism to maximally eliminate communication. Our other objective is to achieve a computation+communication load balanced partitioning of the loop. The focus, thus, is on and computation+communication load balanced partitioning as against communication elimination and computation load balance for restricted cases as in previous approaches [6]. We choose not to replicate the data since it involves a point-point or broadcast type communication and poses an initial data distribution overhead on every loop slice. We first motivate our approach through an example. 2.1 Motivating Example Consider the following DOALL loop: for i = 2 to N for j = 2 to N A[i,j] = B[i-2,j-1]+B[i-1,j-1]+B[i-1,j-2] endfor endfor As far as this loop is concerned, it is not possible to determine a communication free data and iteration space partition [1, 6, 14, 22]. The reason this loop can not be partitioned in a communication free manner is that we can not determine a direction which will partition the iteration space so that all the resulting data references can be localized by suitably partitioning the data space of B without any replication. The question is : can we profitably (speedup > 1) parallelize this loop in any way ? If there are many ways for such a parallelization, which one will give us computation+communication load balance for the best speedup? We illustrate that by carefully choosing the iteration and data distributions, it is possible to maximally eliminate the communication while minimally sacrificing the parallelism to minimize the loop completion time and maximize the speedup. It is then possible to construct a partition interaction graph to adapt the partitions for granularity and computation+communication load balance and to map them on the available number of processors for a specific architecture. One approach is to replicate each element of matrix B on each processor and partition the nested DOALL iterations on a N x N mesh so that each processor gets one iteration. This, however, has a data distribution overhead ∝ N 2 and thus, is not a good solution where cost of communication is higher than computation. This method has the maximum parallelism, but it will not give any speedup due to a very high data distribution overhead. The other possibility is to minimize communication by carefully choosing iteration and data distributions.

416

Santosh Pande and Tareq Bali

Table 2.1. Comparison of eﬀect of partitioning on parallelism, communication and loop execution time Dir.

#Part.

Commn.

(-1,0) (0,1) (-1,1) Cyclic

(N-1) (N-1) (2N-3) (N-1)

4(N-2)(N-1) 4(N-2)(N-1) 4(N-2)(N-1) 2(N-2)(N-1)

Execution Time (N-1)+c1 (2N-3)+c2 (2N-3) (N-1)+c1 (2N-3)+c2 (2N-3) (N-1)+2c1 (N-1)+2c2 (N-1) (2N-5)+2c2 ⌈(2N −5)/2⌉+c1⌈(2N −5)/2⌉

In the above example, it is possible to determine communication free directions for the iteration and data spaces by excluding some references. In this example, let B1 ≡ B[i − 2, j − 1], B2 ≡ B[i − 1, j − 1], B3 ≡ B[i − 1, j − 2]. If we decide to exclude B1 , the communication free direction for partitioning the iteration space is (0,1) (data space partitioned column wise). If we decide to exclude B2 , the iteration partitioning direction is (-1,1) (data partitioning along anti-diagonal). If we decide to exclude B3 , one could partition iterations along the direction (-1,0) (data partition will be row wise). Please refer to ﬁgure 2.1 for details of iteration and data partitioning for (0,1) partitioning and ﬁgure 2.2 for (-1,0) partitioning. The iterations/data grouped together are connected by arrows which show the direction of partitioning. Figure 2.3 shows details of iteration and data partitioning for (-1,1) direction vector.

N N-1 N-1 N-2

j j 1 3 0 2 0

1

i

N-2

(a) Data Partition of ‘B’

N-1

2

3

i

N-1

(b) Iteration Partition

Fig. 2.1. Iteration and data partitioning for direction vector (0,1)

N

Communication–Efficient DOALL Partitioning

417

N N-1 N-1 N-2

j j 1 3 0 2 0

1

i

N-2

N-1

2

(a) Data Partition of ‘B’

3

i

N-1

N

(b) Iteration Partition

Fig. 2.2. Iteration and data partitioning for direction vector (-1,0)

N N-1 N-1 N-2

j j 1 3 0 2 0

1

i

N-2

(a) Data Partition of ‘B’

N-1

2

3

i

N-1

N

(b) Iteration Partition

Fig. 2.3. Iteration and data partitioning for direction vector (-1,1) Table 1 shows the volume of communication, the parallelism and the loop completion times for each of the above three cases. It is assumed that loop body takes unit time for execution; whereas, each ‘send’operation is c1 times expensive than loop body and each ‘receive’ operation is c2 times expensive than loop body. We now show that if we carry out the iteration and data space partitioning in the following manner, we can do better than any of the above partitions. We ﬁrst decide to partition along the direction (0, 1) followed by partitioning along (−1, 0) on cyclical basis. If we partition this way, it automatically ensures that the communication along the direction (-1,1) is not necessary. In

Santosh Pande and Tareq Bali

418

other words, the iteration space is partitioned such that most of the iterations in the space reference the same data elements. This results in localization of most of the references to the same iteration space partition. Please refer to ﬁgure 2.4 for details about iteration and data space partitions using cyclic directions.

N N-1 N-1 N-2

j j 1 3 0 2 0

1

i

N-2

(a) Data Partition of ‘B’

N-1

2

3

i

N-1

N

(b) Iteration Partition

Fig. 2.4. Iteration and data partitioning for cyclical direction vector (0.1)/ (-1,0) In this partitioning scheme, whatever parallelism is lost due to sequentialization of additional iterations is more than oﬀset by elimination of additional amount of communication, thus improving overall performance. We, thus, compromise the parallelism to a certain extent to reduce the communication. However, the resulting partitions from this scheme are not well balanced with respect to total computation+communication per processor. If we perform a careful data distribution and iteration partitions merging after this phase, this problem could be solved. Our objective in data distribution is to minimize communication overhead on larger partition and put the burden of larger overheads on smaller partitions so that each partition is more load balanced with respect to the total computation+communication. For this example, we demonstrate the superiority of our scheme over other schemes listed in Table 1 as follows: It is clear that our scheme has half the amount of communication volume as compared to either of the (0,1), (-1,0) or (-1,1) (we determine the volume of communication by counting total non-local references - each nonlocal reference is counted as one send at its sender and one receive at its receiver). The total number of partitions is (N-1) in case of (0,1) or (-1,0), (2*N-3) in case of (-1,1) and (N-1) in case of (0,1)/(-1,0) cyclic partitioning. Thus, (0,1)/(-1,0) has lesser amount of parallelism as compared to (-1,1)

Communication–Efficient DOALL Partitioning

419

partitioning, but it also eliminated communication by the same degree which is more beneficial since communication is more expensive than computation. As an overall effect of saving in communication and more balanced computation+communication at every partition, the loop execution and thus the speedup resulting from our scheme is much superior to any of the above. The loop execution time given by (0,1) or (-1,0) partitioning is : (N-1) + c1 (2N-3) + c2 (2N-3), that of (-1,1) is (N-1) + 2c1 (N-2) + 2c2 (N-2), whereas according to our scheme, the loop execution time is given by (2N-5)+2c2 ⌈(2N − 5)/2⌉ + c1 ⌈(2N − 5)/2⌉ It can be easily seen by comparing the expressions that, the loop execution time given by our scheme is superior to any of the other ones if c1 , c2 > 1 (typically c1 , c2 ≫ 1). Finally, it may be noted that our scheme, achieves an asymptotic speedup of (N/(2+2c2 +c1 )). In the modern systems with low communication/computation ratio, the typical values of c1 and c2 are few hundreds. For example, assuming c1 and c2 about 200, for N > 600, our scheme would give speedups > 1. This scheme, thus, results in effective parallelization of even medium size problems. Also, one can overlap the computation and communication on an architecture in which each PE node has a separate communication processor. In this case, the loop completion time can approach the ideal parallel time since most communication overheads are absorbed due to overlap with the computation. The above scheme, however, results in a large number of partitions that could lead to two problems in mapping them. The first problem is that the partitions could be too fine grained for a given architecture and the second problem could be that the number of available processors may be much lesser than the number of partitions (as usually the case). In order to solve these problems, we perform architecture dependent analysis after iteration and data space partitioning. We construct a Partition Interaction Graph from the iteration space partitions and optimize by merging partitions with respect to granularity so that communication overheads are reduced at the cost of coarser granularity. We then load balance the partitions with respect to total execution time consisting of computation+communication times and finally map the partitions on available number of processors. We now present an overall outline of our approach. 2.2 Our Approach Figure 2.5 shows the structure of our DOALL partitioner and scheduler. It consists of five phases: – Code Partitioning Phase : This phase is responsible for analyzing the references in the body of the DOALL loop nest and determine a set of directions to partition the iteration space to minimize the communication by minimally trading the parallelism. – Data Distribution Phase: This phase visits the iteration partitions generated above in the order of decreasing sizes and uses a larger partition owns

420

Santosh Pande and Tareq Bali

DOALL

Loops

Code Partitioning Phase

Data Distribution Phase

Granularity

Adjustment Phase

Load Balancing Phase Partitions assuming

infinite processors

Mapping Phase

Partitions Mapped to P Processors Fig. 2.5. DOALL Partitioner and Scheduler

Communication–Efficient DOALL Partitioning

421

rule to generate the underlying data distribution so that larger compute intensive partitions incur lesser communication overhead and vice-versa. The larger partition owns rule says that if the same data item is referenced by two or more partitions, the largest partition owns the data item. The goal is to generate computation+communication load balanced partitions. – Granularity Adjustment Phase: This phase analyzes whether the granularity of the partitions generated above is optimal or not. It attempts to combine two partitions which have a data communication and determines if the resulting partition is better in terms of completion time. It continues this process until the resulting partition has a worse completion time than any of the partitions from which it is formed. In this manner, a significant amount of communication is eliminated by this phase to improve the completion time. – Load Balancing Phase: This phase attempts to combine the load of several lightly loaded processors to reduce the number of required processors. Such merging is carried out only to the extent that the overall completion time does not degrade. – Mapping Phase: This phase is responsible for mapping the partitions from the previous phase to a given number of processors by minimally degrading the overall completion time. The partitions that minimally degrade the completion time on merger are combined and the process is continued till the number of partitions equal the number of available processors. The first two phases are architecture independent and the last three phases are architecture dependent which use the architecture cost model to perform granularity adjustment, load balancing and mapping. We first develop the theory behind code and data partitioning phases shown in figure 2.5.

3. Terms and Definitions We limit ourselves to the perfectly nested normalized DOALL loops. We define each of the occurrences of a given variable in the loop nest as its reference instance. For example, different occurrences of a given variable ‘B’ are defined as the reference instances of ‘B’ and different instances are denoted as B1 , B2 , B3 , ..., etc. for convenience. The iteration space of n-nested loop is defined as I = {(i1 , i2 , i3 , ...in )|Lj ≤ ij ≤ Uj , 1 ≤ j ≤ n}, where i1 , i2 , ..., in are different index variables of the loop nest. In the loop body, an instance of a variable ‘B’ references a subset of the data space of variable ‘B’. For example, the instance B1 ≡ B[i1 + σ11 , i2 + σ21 , i3 + σ31 ...], references the data space of matrix B decided by the iteration space as defined above and the the offsets σ11 , σ21 , σ31 , .... Each partition of the iteration space is called the iteration block. In order to generate communication free data and iteration partition, we determine partitioning directions in the iteration and data spaces such that the references

422

Santosh Pande and Tareq Bali

generated in each iteration block can be disjointly partitioned and allocated on local memory of a processor to avoid communication. Although most of the discussion in this paper uses constant offsets for the variable references in each dimension, in general, the references can be uniformly generated [6] so that it is possible to perform communication free partitioning analysis. Please note that our approach uses communication free partitioning analysis as the underlying method (as described in later sections); thus, the underlying assumptions and restrictions are the same as any of those methods described in literature [1, 6, 14, 22]. All of these methods are able to handle uniformly generated references; thus, our method is able to do the same. A set of reference instances of a variable is called the instance set of that variable. A set of reference instances of a variable for which communication free data and iteration partition can be determined is defined as a set of compatible instances of a variable. If a communication free partition can not be found, such a set of reference instances is called a set of incompatible instances. If a communication free partition can be determined for a set of variables considering all their instances, it is called as a set of compatible variables; otherwise it is called as a set of incompatible variables. In this paper, we focus on minimizing the communication when we have a set of incompatible instances of a variable so that a communication free partition can not be found. Minimizing communication for multiple incompatible variables is even more hard and is not attempted here. 3.1 Example Consider the following code: for i := 1 to N for j := 1 to N a[i,j] := (b[i,j] + b[i-1,j-1]+ b[i-1,j] + b[i-1,j+1] + b[i,j-1] + b[i,j+1] + b[i+1,j-1] + b[i+1,j] + b[i+1,j+1])/9 endfor endfor For this code, it is not possible to determine a communication free iteration and data partitioning direction. Let b1 ≡ b[i, j], b2 ≡ b[i − 1, j − 1], b3 ≡ b[i−1, j], b4 ≡ b[i−1, j+1], b5 ≡ b[i, j−1], b6 ≡ b[i, j+1], b7 ≡ b[i+1, j−1],b8 ≡ b[i + 1, j], b9 ≡ b[i + 1, j + 1]. Thus, the instance set for the variable b is given by {b1 , b2 , ..., b9 } for the nine occurrences of b. All these reference instances are, therefore, incompatible.

Communication–Efficient DOALL Partitioning

423

4. Problem We begin by stating the problem of communication minimization for incompatible instances of a variable as follows: Given an instance set of a variable B, denoted by SB = {B1 , B2 , B3 , ..., Bm } which may comprise of incompatible instances occurring within a loop nest as described before, determine a set of communication minimizing directions so that the volume of communication reduced is at least equal to or more than the parallelism reduced. We measure the volume of communication by the number of non-local references (the references which which fall outside the underlying data partition) corresponding to an iteration block. In our formulation of the problem, no data replication is allowed. There is only one copy of the each array element kept at one of the processors and whenever any other processor references it, there is a communication : one send at the owner processor and one receive at the one which needs it. The justification for reducing the above volume of communication is that the data communication latency in most distributed memory systems consists of a fixed start-up overhead to initiate communication and a variable part proportional to the length of (or to the number of data items) the message. Thus, reducing the number of non-local data values, reduces this second part of communication latency. Of course, one may perform message vectorization following our partitioning phase to group the values together to be sent in a single message to amortize on start-up costs. Such techniques are presented elsewhere [19] and do not form a part of this paper. We measure the amount of parallelism reduced by the number of additional iterations being introduced in an iteration block to eliminate the communication. 4.1 Compatibility Subsets We begin by outlining a solution which may attain the above objective. We ρ 1 2 , SB , ..., SB first partition the instance set of a variable, SB , into ρ subsets SB which satisfy the relation: – All the reference instances of the variable belonging to a given subset are compatible so that one can determine a direction for communication free j , ∃(dj1 , dj2 , ..., djr ) such that partitioning partitioning. Formally, ∀Bi ∈ SB j j j along direction vector (d1 , d2 , ..., dr ) achieves communication free partition, where, 1 ≤ j ≤ ρ. – At least one reference instance belonging to a given subset is incompatible with all the reference instances belonging to any other subset. Formally, j k so that it is incompatible with all Bi ∈ SB , where, j = k, ∃Bl ∈ SB 1 ≤ j, k ≤ ρ. In other words, one can not find a communication free j k . ∪ {Bl }, for some Bl ∈ SB partition for SB

424

Santosh Pande and Tareq Bali

It is easy to see that the above relation is a compatibility relation. It is well known that the compatibility relation only defines a covering of the set and does not define mutually disjoint partitions. We, therefore, first determine ρ ρ 1 2 , SB , ..., SB maximal compatibility subsets : SB from the above relation. For each of the maximal compatibility subsets, there exists a direction for communication free partitioning. The algorithm to compute maximal compatibility subsets is described in the next section. Following Lemma summarizes the maximum and minimum number of maximal compatibility subsets that can result from the above relation. Lemma 1 : If m ≡ |SB | and if ρ maximal compatibility subsets result from the above relation on SB , then 2 ≤ ρ ≤ C2m . Proof: It is clear that there must exist at least one Bi ∈ SB , such that it is not compatible with SB − {Bi }. If this is not the case, then communication free partition should exist for all the instances belonging to SB , which is not true. Thus, minimum two compatibility subsets must exist for SB . This proves the lower bound. We now show that any two reference instances, Bi , Bj ∈ SB are always compatible. Let (σ1i , σ2i , ..., σri ) and (σ1j , σ2j , ..., σrj ) be the two offsets corresponding to instances Bi and Bj respectively. Thus, if we partition along the direction (σ1i − σ1j , σ2i − σ2j , ..., σri − σrj ) in the iteration and data space of B, we will achieve the communication free partitioning as far as the instances Bi and Bj are concerned. Thus, for any two instances, communication free partitioning is always possible proving that they are compatible. The number of subsets which have two elements of SB are given by C2m , proving the upper bound on ρ. q.e.d The bounds derived in the above lemma allow us to prove the overall complexity of our Communication Minimizing Algorithms discussed later. The next step is to determine a set of cyclically alternating directions from the compatibility subsets found above to maximally cover the communication. 4.2 Cyclic Directions ρ 1 2 Let the instance set SB for a variable B be partitioned into SB , SB , ..., SB which are maximal compatibility subsets under the relation of communication free partitioning. Let Comp(B) be the set of communication free partitioning directions corresponding to these compatibility subsets. Thus, Comp(B) = {D1 , D2 , ..., Dρ }, where, Dj = (dj1 , dj2 , ..., djr ) is the direction of communicaj tion free partitioning for the subset SB . The problem now is to determine a

Communication–Efficient DOALL Partitioning

425

subset of Comp(B) which maximally covers1 the directions in Comp(B) as explained below. Let such a subset of Comp(B) be denoted by Cyclic(B). Let, −1 −1 −1 −1 Cyclic(B) = {Dπ (1) , Dπ (2) , ..., Dπ (t) }, where, Dπ (i) = Dj or i ≡ π(j) defines a permutation which maps jth element of Comp(B) at ith position in Cyclic(B). We now state the property which allows us determining such a maximal, ordered subset Cyclic(B) of Comp(B): Property 1 : The subset Cyclic(B) must satisfy all of the following: 1. Dπ (j) = Dπ (j−1) +Dπ (j−2) , where, 3 ≤ j ≤ t. Each of the directions −1 −1 Dπ (j) direction is then said to be covered by directions Dπ (j−1) and −1 Dπ (j−2) . Thus, each of the elements of the ordered set Cyclic(B) must be covered by the previous two elements, the exception being the first two elements of Cyclic(B). 2. Consider Comp(B) - Cyclic(B), and let some Dk belong to this set. If j −1 −1 Dk = c1 ∗ Dπ (t) + i=1 Dπ (i) , where, 1 ≤ j ≤ (t − 1), c1 ∈ I + (in other words, if the direction Dk can be expressed as a linear combination −1 of multiple of Dπ (t) and a summation of a subset of ordered directions as above) then it is covered and there is no communication along it. Let Uncov(B) be the subset of Comp(B) - Cyclic(B) such that ∀Dk ∈ j −1 −1 U ncov(B), Dk = c1 ∗ Dπ (t) + i=1 Dπ (i) , i.e., none of its elements is covered and let s ≡ |U ncov(B)|. 3. Cyclic(B) is that subset of Comp(B) which satisfying the properties stated in 1 and 2 as above leads to minimum s. −1

−1

−1

Stated more simply, Cyclic(B) is an ordered subset of Comp(B) which leaves minimum number of uncovered direction in Comp(B). If we determine Cyclic(B) and follow the corresponding communication free directions −1 −1 −1 −1 −1 cyclically from Dπ (1) to Dπ (t−1) (such as Dπ (1) , Dπ (2) ,...,Dπ (t−1) , −1 −1 Dπ (1) , Dπ (2) ,..), communication is reduced by a larger degree than loss of parallelism which is beneficial. The following Lemma formally states the result: Lemma 2: If we follow iteration partitioning cyclically along the directions corresponding to Cyclic(B) as above, for each basic iteration block (basic iteration block is achieved by starting at a point in iteration space and by traversing once along the directions corresponding to Cyclic(B) from there), parallelism is reduced by (t-1) (due to sequentialization of (t-1) iterations) and the communication is reduced by (ρ+t)-(s+3), where ρ ≡ |Comp(B)|, t ≡ |Cyclic(B)| and s ≡ |U ncov(B)|. Proof: It is easy to see that if t ≡ |Cyclic(B)|, we traverse once along the corresponding directions and thus, introduce (t-1) extra iterations in a basic 1

A given direction is said to be covered by a set of directions, iff partitioning along the directions in the set eliminates the need for communication along the given direction

426

Santosh Pande and Tareq Bali

iteration block reducing the parallelism appropriately. So, we prove the result for communication reduction. It is obvious that if we traverse the iteration space along (t-1) directions corresponding to the ordered set Cyclic(B), the communication is reduced by (t-1). In addition to this, since the Property 1, condition 1 is satisfied by these directions, additional (t-2) directions are covered eliminating the corresponding communication. In addition to this, thepartitioning is also −1 −1 j capable to covering all the directions : c1 ∗ Dπ (t) + i=1 Dπ (i) , where, 1 ≤ j ≤ (t−1), c1 ∈ I + , Property 1 Condition 2. These directions are the ones which correspond to Comp(B) − Cyclic(B) − U ncov(B). Thus, the number of such directions is (ρ - t - s). Thus, the total number of directions covered = (t-1)+(t-2)+(ρ - t - s) = (ρ+t) - (s+3). Thus, in one basic iteration partition, one is able to eliminate the communication equal to ((ρ + t) - (s + 3)) by reducing parallelism by an amount (t-1). q.e.d Corollary 1 : According to the above lemma, we must find at least one pair of directions which covers at least one other direction in Comp(B) to reduce more communication than parallelism. Proof: The above Lemma clearly demonstrates that, in order to reduce more communication than parallelism, we must have (ρ + t) − (s + 3) > (t − 1), or, (ρ−s) > 2. Now, Comp(B) = Cyclic(B) + Cov(B) + Uncov(B), where Cov(B) is the set of directions covered as per condition 2 in Property 1. In other words, ρ = t + q + s, where |Cov(B)| ≡ q. Thus, (ρ − s) ≥ 3 ⇒ (t + q) ≥ 3. At its lowest value, (t+q) = 3. Consider following cases for (t+q) = 3: 1. t = 0, q = 3 : This is impossible since if Cyclic(B) is empty, it can not cover any directions in Comp(B). 2. t = 1, q = 2: This is also impossible since one direction in Cyclic(B) can not cover two in Comp(B). 3. t = 2, q = 1: This is possible since two directions in Cyclic(B) can cover a direction in Comp(B) through Property 1, condition 2. 4. t = 3, q = 0: This is also possible, since Cyclic(B) would then have three elements related by Property 1, condition 1. It can be seen that only cases (3) and (4) above are possible and each one would imply that a direction in Comp(B) is covered by Cyclic(B) either through condition 1 or through condition 2 of Property 1. Thus, the result. q.e.d Thus, in order to maximally reduce communication, we must find Cyclic(B) from Comp(B) so that it satisfies Property 1. As one can see, the directions in Cyclic(B) form a Fibonacci Sequence as per Property 1 maximally covering the remaining directions in Comp(B). Our problem is, thus, to find a maximal Fibonacci Sequence using a minimal subset of Comp(B). The algorithm to determine such a subset is discussed in the next section.

Communication–Efficient DOALL Partitioning

427

5. Communication Minimization In this section, we discuss the two algorithms based on the theory developed in last section. The first algorithm determines the maximal compatibility subsets of the instance set of a given variable and the second one determines a maximal Fibonacci Sequence as discussed in the last section. We also analyze the complexity of these algorithms. For illustration of the working of this algorithm, please refer to the example presented in section 7. 5.1 Algorithm : Maximal Compatibility Subsets This algorithm finds the maximal compatibility subsets, Comp(B) of a variable B, given the instance set SB as an input. As one can see that the compatibility relation of communication free partitioning for a set of a references (defined before) is reflexive and symmetric but not necessarily transitive. If a and b are compatible, we denote this relation as a ≈ b. 1. Initialize Comp(B) := φ, k := 1. 2. for every reference instance Bi ∈ SB do p a) Find Bj ∈ SB such that Bi ≈ Bj but both Bi , Bj ∈ / SB , for 1 ≤ p < k. In other words, find a pair of references such that it has not been put into some compatibility subset already constructed so far (where k-1 is the number of compatibility subsets constructed so far). Whether or not Bi ≈ Bj can be determined by algorithms described in [1, 6, 14, 22]. k := {Bi , Bj } (put the pair satisfying above property into b) Initialize SB k being constructed). a new subset SB k c) For every Bl ∈ (SB − SB ), do k k k – if ∀Bm ∈ SB , Bl ≈ Bm , SB := SB ∪ {Bl }. k – Add the constructed subset SB to Comp(B), Comp(B) := k Comp(B) ∪ SB , k := k+1. d) Repeat steps (a) through (c) above till no Bj can be found satisfying condition in (a). 3. After all the subsets are constructed, replace each of them by the corresponding communication free partitioning directions. That is, for i by Di , where, Di is the Comp(B) constructed above, replace each SB i corresponding communication free direction for SB . As one can see that the above algorithm checks for compatibility relation from an element of SB to all the other elements of SB and therefore, its worst case complexity O(|SB |2 ).

428

Santosh Pande and Tareq Bali

5.2 Algorithm : Maximal Fibonacci Sequence Following algorithm determines the set Cyclic(B) using Comp(B) as an input. 1. Sort the set Comp(B). If {D1 , D2 , ..., Dρ } is the sorted set, it must satisfy the following order: – D1i < D1i+1 , or i+1 i for some < Dk+1 – if Dji := Dji+1 for all j such that 1 ≤ j ≤ k and Dk+1 k, such that 1 ≤ k ≤ r − 1. The elements D1 , D2 , ..., Dρ are then said to be sorted in non-decreasing order < such that D1 < D2 < D3 .... 2. Initialize set MaxFib := φ, max := 0. 3. for i :=1 to n for j := i+1 to n a) Let D := Di + Dj . Initialize last := j, Fib := φ, k := j+1. b) while (Dk < D) k := k+1 c) if (Dk = D), F ib := F ib ∪ Dk , D := Dk + Dlast , last := k, k:=k+1. d) Repeat steps (b) and (c) above till k > n. e) Let q be the number of additional directions covered in Comp(B) by Fib as per Property v1. In other words, let D ∈ Comp(B) − F ib. If D = c1 ∗ Dlast + l=1 Dl , where, 1 ≤ v ≤ |F ib|, c1 ∈ I + , D is already covered by Cyclic(B). Determine q, the number of such covered directions in Comp(B) - Fib. f) if max < |F ib| + q, MaxFib := Fib, max := |F ib| + q. 4. Cyclic(B) := MaxFib. As one can see, the sorting step for the above algorithm would require O(ρ log ρ) and the step of finding the maximal cover would require O(ρ3 ). Thus, the total complexity of the algorithm is O(ρ log ρ + ρ3 ). From Lemma 1, since ρ ≤ |SB |2 , the overall complexity of the algorithm is O(|SB |2 log|SB | + |SB |6 ). The code partitioning phase (refer to figure 2.5) uses these two algorithms to determine a set of communication minimizing directions (given by Cyclic(B)) for iteration space partitioning. 5.3 Data Partitioning The next phase is data partitioning. The objective of the data distribution is to achieve computation+ communication load balance through data distribution. This phase attempts minimization of communication overhead for larger compute intensive partitions by localizing their references as much as possible. In order to determine the data partition, we apply the following simple algorithm which uses a new larger partition owns rule:

Communication–Efficient DOALL Partitioning

429

– Sort the partitions in the decreasing order of their sizes in terms of the number of iterations. Visit the partitions in the sorted order (largest to smallest) as above. For each partition do: – Find out all the data references generated in a given partition and allocate that data to the respective processor. If the generated reference is already owned by a larger partition generated previously, add it to the set of nonlocal references.

6. Partition Merging The next step in compilation of the DOALL loops is to schedule the partitions generated above on available number of processors. For scheduling the partitions generated by the iteration partitioning phase on available number of processors, first a Partition Interaction Graph is constructed and granularity adjustment and load balancing are carried out. Then the partitions are scheduled (mapped) on a given number of available processors. Each node of the partition interaction graph denotes one loop partition and the weight of the node is equal to the number of iterations in that loop partition. There is a directed edge from one node to another which represents the direction of data communication. The weight of the edge is equal to the number of data values being communicated. Let G(V, E) denote such a graph where V is the set of nodes and E is the set of edges as described above. Let t(v´ ) denote the weight of node t´ ∈ V and c(vj , vi ) denote the weight of edge (vj , vi ) ∈ E. The following is supposed to be the order of execution of each partition: – Send : The partition first sends the data needed by other partitions. – Receive : After sending the data, the partition receives the data it needs sent by other partitions. – Compute : After receiving the data in the above step, the partition executes the assigned loop iterations. The total time required for the execution of each partition is, thus, equal to Send time + Receive time + Compute time. The Send time is proportional to the total number of data values sent out (total weight ) on all outgoing edges and the receive time is proportional to the total number of data values received (total weight) on all incoming edges. The compute time is proportional to the number of iterations (node weight). Depending on the relative offsets of the reference instances between different partitions and the underlying data distribution, the data values needed by a given partition may be owned by one or more partitions. This communication dependency is denoted by the graph edges and the graph may contain a different number of edges depending on such dependency. The length of the longest path between vi and vj is defined as the communication distance

430

Santosh Pande and Tareq Bali

between vi and vj where (vi , vj ) ∈ E. For example, in ﬁgure 6.1, the communication distance for the edge (vk , vi ) is equal to two due to the fact that (vk , vi ) ∈ E and the longest path from vk to vi is of length two. It can be shown that due to the properties of partitioning method described in the last section (proof omitted due to lack of space), the following relationships hold good (refer to ﬁgure 6.1): – The weight of a given node is less than or equal to any of its predecessors. In other words, t(vi ) ≤ t(vj ) where (vj , vi ) ∈ E. – The weight of an edge incident on a given node is more than or equal to the weight of an outgoing edge from that node for the same communication distance. In other words, c(vk , vj ) ≥ c(vj , vi ) where both the edges represent the same communication distance. This relationship does not apply to two edges representing two diﬀerent communication distances. We now describe three scheduling phases as outlined before. All of these heuristics traverse the partition interaction graph in reverse topological order by following simple breadth ﬁrst rule as follows: – Visit the leaf nodes of the graph. – Visit the predecessor of a given node such that all of its successors are already visited. – Follow this procedure to visit backwards from leaf nodes till all the nodes including root node are visited.

Vk

Vj Vi Fig. 6.1. Portion of Partition Interaction Graph The complexity of each of these phases is O(|V |) where V is the number of nodes in the graph.

Communication–Efficient DOALL Partitioning

431

6.1 Granularity Adjustment Refer to figure 6.1. – Calculate the completion time of each node vj given by tcom(vj ) = k1 ∗ c(v , v ) + k ∗ j i 2 (vk ,vj )∈E c(vk , vj ) + t(vj ), where the cost of one (vj ,vi )∈E iteration is assumed to be 1 and the cost of one send is assumed to be k1 and that of one receive to be k2 . – Visit the nodes of the graph in the reverse topological order described as above. Suppose we choose a predecessor vk of node vj for merging to adjust granularity. – Determine the completion time of merged node vjk given by tcom(vjk ) = tcom(vj ) + tcom(vk ) − c(vj , vk ) ∗ (k1 + k2 ). – Compare it with each of tcom(vj ) and tcom(vk ) and if tcom(vjk ) is lesser than both, merge vj and vk . – Continue the process by attempting to expand the partition by considering vjk and a predecessor of vk next and so on. – If tcom(vjk ) is greater than either of tcom(vj ) or tcom(vk ), reject the merger of vj and vk . Next, attempt merger of vk and one of predecessors and so on. – Repeat all the steps again on the new graph resulting from the above procedure and iterate the procedure until no new partitions are merged together (condition of graph invariance). 6.2 Load Balancing Refer to figure 6.1. – Let T be the overall loop completion time generated by the above phase. – Visit the nodes of the graph in the reverse topological order described as above. Suppose we choose a predecessor vk of node vj to merge the partitions. – Determine the completion time of merged node vjk = tcom(vj )+tcom(vk )− c(vj , vk ) ∗ (k1 + k2 ). Obviously, tcom(vjk ) will be higher than that of either of tcom(vk ) or tcom(vj ) since if it were not the case, the two partitions would have been merged by the granularity adjustment algorithm. – Compare tcom(vjk ) with T and if tcom(vjk ) is lesser than T, merge vj and vk . – Continue the process by attempting to expand the partition by considering vjk and predecessor of vk next and so on. – If vjk is greater than T, reject the merger of vj and vk . Next, attempt merger of vk and one of its predecessor and so on. – Keep repeating this process and if at any stage the completion time of the merged node is worse than the overall completion time T, reject it and attempt a new one by considering predecessor and its predecessor and so on.

432

Santosh Pande and Tareq Bali

6.3 Mapping Refer to figure 6.1. – Let there be P available processors on which the partitions resulting from previous phase are to be mapped, where # partitions > P. – Traverse the graph in the reverse topological order as described earlier. Suppose we choose a predecessor vµ or a node vj for a possible merge to reduce the number of processors. – Determine the completion time of merged node vjk : tcom(vjk ) = tcom(vj )+ tcom(vk ) − c(vj , vk ) ∗ (k1 + k2 ). Obviously, tcom(vjk ) will be higher than the loop completion time T. Store the tcom(vjk ) in a table. – Attempt the merger of another node and its predecessor and store it in a table. Repeat this process for all the nodes and choose the pair which results in minimum completion time when merged and combine them. This reduces the number of partitions by 1. – Continue the above process till the number of partitions is reduced to P.

7. Example : Texture Smoothing Code In this section, we illustrate the significance of the above phases using a template image processing code. This code exhibits a very high amount of spatial parallelism suitable for parallelization on distributed memory systems. On the other hand, this code also exhibits a high amount of communication in all possible directions in the iteration space. Thus, this code is a good example of the tradeoff between the parallelism and the communication. An important step in many image processing applications is texture smoothing which involves finding the average luminosity at a given point in a image from its immediate and successive neighbors. Consider the following code: for i := 1 to N for j := 1 to N a[i,j] := (b[i,j] + b[i-1,j-1]+ b[i-1,j] + b[i-1,j+1] + b[i,j-1] + b[i,j+1] + b[i+1,j-1] + b[i+1,j] + b[i+1,j+1])/9 endfor endfor The above code finds the average value of luminosity at a grid point (i,j) using its eight neighbors. In this code, every grid point is a potential candidate for parallelization; thus, the code exhibits a very high amount of parallelism. On the other hand,if we decide to parallelize every grid point, there would be a tremendous amount of communication in all possible directions. Thus, we apply our method to this application to demonstrate that we can achieve

Communication–Efficient DOALL Partitioning

433

a partition which maximally reduces communication by minimally reducing the parallelism. Let b1 ≡ b[i, j], b2 ≡ b[i − 1, j − 1], b3 ≡ b[i − 1, j], b4 ≡ b[i − 1, j + 1], b5 ≡ b[i, j − 1], b6 ≡ b[i, j + 1], b7 ≡ b[i + 1, j − 1], b8 ≡ b[i + 1, j], b9 ≡ b[i + 1, j + 1]. Thus, the instance set for the variable b is given by {b1 , b2 , ..., b9 } for the nine occurrences of b. Obviously, no communication free partition is possible for the above set of references. The first step, therefore, is to determine maximal compatibility subsets of the instance set. In order to determine the maximal compatibility subsets, we follow the algorithm described in section 5.1. We begin by considering the compatibility subset involving b1 . We try to group b1 with b2 to create a subset {b1 , b2 }. The direction for communication free partitioning for this subset is (1,1), and thus, we can not add any other reference of b to this subset except b9 since adding any other reference would violate the condition for communication free partitioning. Thus, one of our maximal compatibility subsets is {b1 , b2 , b9 }. Next, we group b1 with b3 and add b8 to it to give {b1 , b3 , b8 } as another compatibility subset with (1,0) as direction of communication free partitioning. Similarly, we try to group b1 with other elements so that b1 and that element are not together in any subset formed so far. Thus, the other subsets resulting from b1 are {b1 , b4 , b7 } and {b1 , b5 , b6 } with (1,-1) and (0,1) as the directions for communication free partitioning. Next, we follow the algorithm for b2 . We already have {b1 , b2 } in one of the subsets constructed so far; thus, we start with {b2 , b3 }. The direction for communication free partitioning is (0,1) in this case and we can include only b4 in this subset. Thus, we get {b2 , b3 , b4 } as another maximal compatibility set. By following the algorithm as illustrated above, the following are the maximal compatibility subsets found (directions for communication free partitions are shown next to each of them): – – – – – – –

b1 b2 b3 b4 b5 b6 b7

: : : : : : :

{b1 , b2 , b9 } (1,1), {b1 , b3 , b8 } (1,0), {b1 , b4 , b7 } (1,-1), {b1 , b5 , b6 } (0,1). {b2 , b3 , b4 } (0,1), {b2 , b5 , b7 } (1,0), {b2 , b6 } (1,2), {b2 , b8 } (2,1). {b3 , b5 } (1,-1), {b3 , b6 } (1,1), {b3 , b7 }(2,-1), {b3 , b9 } (2,1). {b4 , b5 } (1,-2), {b4 , b6 , b9 }(1,0), {b4 , b8 } (2,-1). {b5 , b8 }(1,1), {b5 , b9 } (1,2). {b6 , b7 }(1,-2), {b6 , b8 } (1,-1). {b7 , b8 , b9 } (0,1).

Next step is to determine the set Comp(b) which is a collection of communication free directions corresponding to each one of the maximal compatibility subsets. Thus, Comp(b) = {(0,1), (1,-2), (1,-1), (1,0), (1,1), (1,2), (2,-1), (2,1)}. The next step is to determine Cyclic(b) to maximally cover the directions in Comp(b). We, thus, apply the algorithm in section 5.2. We begin by considering (0,1) and (1,-2) which add up to (1,-1). Thus, we include (1,-1) in the set Fib being constructed. If we try adding (1,-2) and (1,-1), it gives (2,-3) which is not a member of Comp(b). Thus, we stop and at this

434

Santosh Pande and Tareq Bali

Table 7.1. Fibonacci Sets Constructed by Algorithm in Section 5.2 Fibonacci Directions Set (Fib) Covered {(0, 1), (1, −2), (1, −1)} (1,0),(2,-1) {(0, 1), (1,−1), (1, 0), (2,−1)} {(0, 1), (1, 0), (1, 1), (2, 1)} {(0, 1), (1, 1), (1, 2)} {(1, −2), (1, 1), (2, −1)} {(1, −1), (1, 0), (2, −1)} {(1, −1), (1, 2), (2, 1)} {(1, 0), (1, 1), (2, 1)} -

Parallelism Reduced 2 3 3 2 2 2 2 2

Communication Reduced 5 4 4 3 3 3 3 3

point, Fib = {(0, 1), (1, −2), (1, −1)}. The next step in the algorithm is to determine the other directions in Comp(b) which are covered by this iteration partition. Following the step 3.e of the algorithm, if we add (1,-1) and (0,1) it gives (1,0) and if we add 2*(1,-1) and (0,1) it gives (2,-1); thus, a linear combination of (1,-1) and (0,1) covers two other direction in Comp(b). In this case, we are able to eliminate communication equal to 5 by sequentializing 2 iterations. Next, we try to construct another set Fib by starting with (0,1) and (1,1) and following the procedure of adding them and checking if it covers any direction in Comp(b). If it does, then we add it to Fib and continue further by forming the Fibonacci series using two most recently added elements in the set. Finally, we find out the covered directions from the remainder of Comp(b), if any using the step 3.e of the algorithm. Table 2 shows the different Fib sets constructed by the algorithm , the covered directions if any and the parallelism lost and communication saved. From the table, one can see that the algorithm would compute Cyclic(b) = {(0, 1), (1, −2), (1, −1)} since it results in maximally reducing the communication. Thus, by cyclically following (0,1)/(1,-2) directions, one could reduce communication by 5 losing parallelism by 2 per basic iteration block. This demonstrates that it is possible to profitably parallelize these type of applications by using our method. Once Cyclic(b) is determined, the next step is to generate the iteration and the data partition for the loop nest. We apply the algorithm of section 5.3. We first determine (1,7) as the base point and then apply (0,1)/(1,-2) as directions cyclically to complete the partition. We then move along dimension 2 (for this example, dimension involving ‘i’ is considered as dimension 1 and dimension involving ‘j’ as dimension 2) and carry out the partitioning in a similar manner. We get the iteration partitioning shown in Figure 7.1. The data partition is found by traversing the iteration partitions in the order : 0, 4, 1, 5, 2, 6, 3 and 7 using largest partition owns rule.

Communication–Efficient DOALL Partitioning

435

Finally, the partition interaction graph of the partitions is generated as shown in ﬁgure 7.2. For this graph, only the communication distances 1 and 2 exist between the diﬀerent partitions (please see the deﬁnition of communication distances in the preceding section). The total number of iterations in each partition (the computation cost) and the number of data values exchanged between two partitions (the communication cost) are shown against each node and each edge in ﬁgure 7.2. Depending on the relative costs of computation and communication, the granularity adjustment, load balancing and mapping phases will merge the partitions generated above. The results of these phases for Cray T3D for problem size N=16 are discussed in section 8.

7

8

7 6 3 5 5 4 j 3 4 2 1 3

1

2

2

3

4

0

1 5

6

7

8

i

Fig. 7.1. Iteration partition for texture smoothing code

8. Performance on Cray T3D The following example codes are used to test the method on a Cray T3D system with 32 processors: Example I: --------for i = 2 to N for j = 2 to N for k = 1 to Upper A[i,j,k] = B[i-2,j-1,k]+B[i-1,j-1,k]+B[i-1,j-2,k] endfor endfor endfor

436

Santosh Pande and Tareq Bali

15

0 78 12 38

1

11

2

15 92 7

4

13

9

5 49

6

9 16 3

3

3

27 5

6

7

1

5

Fig. 7.2. Partition Interaction Graph of Iteration Partitions Example II: ---------for i := 1 to N for j := 1 to N for k := 1 to Upper a[i,j,k] := (b[i,j,k] + b[i-1,j-1,k]+ b[i-1,j,k] + b[i-1,j+1,k] + b[i,j-1,k] + b[i,j+1,k] + b[i+1,j-1,k] + b[i+1,j,k] + b[i+1,j+1,k])/9 endfor endfor endfor In the above examples, there is an inner loop in k dimension. The number of iterations in this loop, Upper, are chosen as 10 million in case of Example I and 1 million in case of Example II to make computation in loop body comparable to communication. As one can see, there is no problem in terms of communication free partitioning in k dimension. However, in i and j dimensions, due to reference patterns, no communication free partition exists and thus, the outer two loops (in i and j dimension) and the underlying data are distributed by applying the techniques described in sections 5 and 6. For Example I, the cyclical direction of partitioning are (0,1,0)/(-1,0,0) and for Example II, the cyclical directions are (0,1,0)/(1,-2,0) as explained earlier. The number partitions found by the method for each of these examples is equal to N , the size of the problem. Thus, the size of the problem N is appropriately chosen to match the number of processors. The method is partially2 implemented in the backend of Sisal (Streams and Iterations in a Single Assignment Language) compiler, OSC [7, 32] tar2

Some phases of the method are not fully automated yet

Communication–Efficient DOALL Partitioning

437

geted for Cray T3D system. The method is tested for N=4 (4 processors), N=8 (8 processors), N=16 (16 processors) and N=32 (32 processors). The timings are obtained using clock() system call on Cray T3D which allows measuring timings in micro-seconds. PVM was used as underlying mode of communication. The sequential (as shown above) and the parallel versions are implemented and speedup is calculated as the ratio of time required for each. Table 8.1. Example I : Performance on Cray T3D Problem Size 4x4 4x4 4x4 8x8 8x8 8x8 16x16 16x16 16x16 32x32 32x32 32x32

Processors

Direction

4 4 4 8 8 8 16 16 16 32 32 32

Cyclic (0,1) (-1,0) Cyclic (0,1) (-1,0) Cyclic (0,1) (-1,0) Cyclic (0,1) (-1,0)

Sequential Time (sec) 15.8 15.8 15.8 52.73 52.73 52.73 213.9 213.9 213.9 919.7 919.7 919.7

Parallel Time (sec) 7.6 15.1 15.52 16.3 29.07 30.09 33.66 61.5 63.3 68.42 113.44 117.6

Speedup 2.08 1.05 1.02 3.23 1.81 1.75 6.35 3.47 3.38 13.44 8.1 7.82

Table 8.2. Example II : Performance on Cray T3D Problem Size 4x4 8x8 16x16 32x32

Processors 4 8 16 32

Sequential Time (sec) 9.7 35.1 130.92 543.3

Parallel Time (sec) 2.57 9.78 19.12 43.24

Speedup 3.77 3.59 6.8 12.56

Refer to Table 3 and 4 for the results for each example. It can be clearly seen that it is possible to effectively parallelize both of these examples which are quite demanding in terms of communication by employing our method. The speedup values are quite promising in spite of heavy inherent communication in these applications.

438

Santosh Pande and Tareq Bali

We also implemented Example I using (0,1) (column-wise) and (-1,0) (row-wise) as directions of partitioning using ‘owner computes’ rule. The speedups obtained by using these directions are also shown in Table 3. It can be clearly seen that our method outperforms these partitioning by almost a factor of 2 in terms of speedups.

0

1

2

4

3

5

6

7

8

9

10

11

12

13

14

15

Fig. 8.1. Partition Interaction Graph for Example II (N=16) Figure 8.1 shows the partition interaction graph of Example II for a problem size N=16. The ﬁrst phase (granularity adjustment) attempts to increase the granularity of the partitions by combining them as per algorithm in section 6.1. But no partitions are combined by this phase. In order to measure the performance after this stage, the partitions are mapped to the respective processors. The processor completion times are shown in ﬁgure 8.2.

Communication–Efficient DOALL Partitioning

439

20 "16.dat" 18 16

Completion time (sec)

14 12 10 8 6 4 2 0 0

2

4

6

8 10 Processor Number

12

14

16

Fig. 8.2. Completion times for processors before load balancing 20 "16-bal.dat"

Completion time (sec)

15

10

5

0 0

1

2

3

4 5 Processor number

6

7

8

Fig. 8.3. Completion times for processors after load balancing

9

440

Santosh Pande and Tareq Bali 50 "16-map.dat" 45 40

Completion time(sec)

35 30 25 20 15 10 5 0 0

1

2

3

4 5 Number of processors

6

7

8

9

Fig. 8.4. Completion times for variable number of available processors : P = 1 to P = 8 Next, the load balancing phase attempts to reduce the number of required processors without increasing the completion time. The number of partitions reduced in this phase from 16 to 8. ﬁgure 8.3 gives the completion times of the respective processors. One can see that these processors are quite well load balanced. Finally, the mapping phase attempts to map these 8 partitions onto 8 or fewer processors. The completion times of these mappings for # processors = 8 through 1 are shown in ﬁgure 8.4. One can see that the method demonstrates an excellent linear scalability. 8.1 Conclusions In this paper, we have presented a methodology for partitioning and scheduling (mapping) the DOALL loops in a communication eﬃ cient manner with following contributions: – Established theoretical framework for communication eﬃ cient loop partitioning applicable to a large class of practical DOALL loops. – Developed iteration partitioning method for these loops by determining cyclic directions of partitioning in each dimension. – Developed a new larger partition owns rule for data distribution for computation+communication load balance.

Communication–Efficient DOALL Partitioning

441

– Developed methodologies for granularity adjustment, load balancing and mapping to significantly improve execution time and computation+communication load balance of each partition. – Experimentally shown that these methods give good speedups for problems that involve heavy inherent communication and also exhibit good load balance and scalability. The method can be used for effective parallelization of many practical loops encountered in important codes such as image processing, weather modeling etc. that have DOALL parallelism but which are inherently communication intensive.

References 1. D. Bau, I. Kodukula, V. Kotlyar, K. Pingali and P. Stodghill, “Solving Alignment Using Elementary Linear Algebra”, Proceedings of 7th International Workshop on Languages and Compilers for Parallel Computing, LNCS 892, 1994, pp. 46–60. 2. J. Anderson and M. Lam, “Global Optimizations for Parallelism and Locality on Scalable Parallel Machines”, Proceedings of SIGPLAN ’93 conference on Programming Language Design and Implementation, June 1993, pp. 112–125. 3. R. Bixby, K. Kennedy and U. Kremer, “Automatic Data Layout Using 0-1 Integer Programming”, Proc. Int’l Conf. on Parallel Architectures and Compilation Techniques, North-Holland, Amsterdam, 1994. 4. Z. Bozkus, A. Choudhary, G. Fox, T. Haupt and S. Ranka, “Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers”, Journal of Parallel and Distributed Computing, Special Issue on Data Parallel Algorithms and Programming, Vol. 21, No. 1, April 1994, pp. 15–26. 5. S. Chatterjee, J. Gilbert, R. Schreiber and S. -H. Teng, “Automatic Array Alignment in Data Parallel Programs”, 20th ACM Symposium on Principles of Programming Languages, pp. 16–28, 1993. 6. T. Chen and J. Sheu, “Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers”, IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No.9, September 1994, pp. 924–938. 7. J. T. Feo, D. C. Cann and R. R. Oldehoeft, “A Report on Sisal Language Project”, Journal of Parallel and Distributed Computing, Vol. 10, No. 4, October 1990, pp. 349-366. 8. A. Gerasoulis and T. Yang, “On Granularity and Clustering of Directed Acyclic Task Graphs”, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, Number 6, June 1993, pp. 686-701. 9. M. Girkar and C. Polychronopoulos, “Automatic Extraction of Functional Parallelism from Ordinary Programs”, IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 2, March 1992, pp. 166-178. 10. G. Gong, R. Gupta and R. Melhem, “Compilation Techniques for Optimizing Communication on Distributed-Memory Systems”, Proceedings of 1993 International Conference on Parallel Processing, Vol. II, pp. 39-46. 11. M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers”, IEEE Transactions on Parallel and Distributed Systems, Vol. 3, March 1992, pp. 179–193.

442

Santosh Pande and Tareq Bali

12. High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.0, Technical Report, CRPC–TR92225, Center for Research on Parallel Computation, Rice University, Houston, TX, 1992 (revised January 1993). 13. S. Hiranandani, K. Kennedy and C. -W. Tseng, “Compiling Fortran for MIMD Distributed-Memory Machines”, Communications of ACM, August 1992, Vol. 35, No. 8, pp. 66-80. 14. C. -H. Huang and P. Sadayappan, “Communication free Hyperplane Partitioning of Nested Loops”, Journal of Parallel and Distributed Computing, Vol. 19, No. 2, October ’93, pp. 90-102. 15. S. D. Kaushik, C. -H. Huang, R. W. Johnson and P. Sadayappan, “An Approach to Communication-Efficient Data Redistribution”, Proceedings of 1994 ACM International Conference on Supercomputing, pp. 364–373, June 1994. 16. C. Koelbel and P. Mehrotra, “Compiling Global Name-Space Parallel Loops for Distributed Execution”, IEEE Transactions on Parallel and Distributed Systems, October 1991, Vol. 2, No. 4, pp. 440–451. 17. J. Li and M. Chen, “Compiling Communication-efficient Programs for Massively Parallel Machines”, IEEE Transactions on Parallel and Distributed Systems, July 1991, pp. 361–376 18. A. Lim and M. Lam, “Communication-free Parallelization via Affine Transformations”, Proceedings of 7th International Workshop on Languages and Compilers for Parallel Computing, LNCS 892, 1994, pp. 92–106. 19. D. J. Palermo, E. Su, J. Chandy and P. Banerjee, “Communication Optimizations Used in the PARADIGM Compiler”, Proceedings of the 1994 International Conference on Parallel Processing, Vol. II (Software), pp. II-1 – II-10. 20. S. S. Pande, D. P. Agrawal, and J. Mauney, “A Scalable Scheduling Method for Functional Parallelism on Distributed Memory Multiprocessors”, IEEE Transactions on Parallel and Distributed Systems Vol. 6, No. 4, April 1995, pp. 388– 399 21. S. S. Pande, D. P. Agrawal and J. Mauney, “Compiling Functional Parallelism on Distributed Memory Systems”, IEEE Parallel and Distributed Technology, Spring 1994, pp. 64–75. 22. J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distribution in Distributed Memory Machines”, IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, October 1991, pp. 472–482. 23. S. Ramaswamy, S. Sapatnekar and P. Banerjee, “A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers”, Proceedings of 1994 International Conference on Parallel Processing, Vol. II (Software), pp. 116–125. 24. A. Rogers and K. Pingali, “Process Decomposition through Locality of Reference”, Proceedings of SIGPLAN ’89 conference on Programming Language Design and Implementation, pp. 69–80. 25. J. Saltz, H. Berryman and J. Wu, “Multiprocessors and Run-time Compilation”, Concurrency: Practice & Experience, Vol. 3, No. 4, December 1991, pp. 573-592. 26. V. Sarkar and G. R. Gao, “Optimization of Array Accesses by Collective Loop Transformations”, Proceedings of 1991 ACM International Conference on Supercomputing, pp. 194–204, June 1991. 27. A. Sohn, M. Sato, N. Yoo and J. -L. Gaudiot, “Data and Workload Distribution in a Multi-threaded Architecture”, Journal of Parallel and Distributed Computing 40, February 1997, pp. 256–264. 28. A. Sohn, R. Biswas and H. Simon, “Impact of Load Balancing on Unstructured Adaptive Computations for Distributed Memory Multiprocessors”, Proc.

Communication–Efficient DOALL Partitioning

29. 30. 31. 32. 33.

443

of 8th IEEE Symposium on Parallel and Distributed Processing, New Orleans, Louisiana, Oct. 1996, pp. 26–33. B. Sinharoy and B. Szymanski, “Data and Task Alignment in Distributed Memory Architectures”, Journal of Parallel and Distributed Computing, 21, 1994, pp. 61–74. P. Tu and D. Padua, “Automatic Array Privatization”, Proceedings of the Sixth Workshop on Language and Compilers for Parallel Computing, August 1993. A. Wakatani and M. Wolfe, “A New Approach to Array Redistribution : Strip Mining Redistribution”, Proceedings of PARLE ’94, Lecture Notes in Computer Science, 817, pp.323–335. R. Wolski and J. Feo, “Program Partitioning for NUMA Multiprocessor Computer Systems”, Journal of Parallel and Distributed Computing (special issue on Performance of Supercomputers), Vol. 19, pp. 203-218, 1993. H. Xu and L. Ni, “Optimizing Data Decomposition for Data Parallel Programs”, Proceedings of International Conference on Parallel Processing, August 1994, Vol. II, pp. 225-232.

Chapter 13. Compiler Optimization of Dynamic Data Distributions for Distributed-Memory Multicomputers Daniel J. Palermo1, Eugene W. Hodges IV2 , and Prithviraj Banerjee3 1 2 3

Hewlett-Packard C¶·¸ ¹onvex Division, Richardson, Texas ([email protected]) SAS Institute Inc., Cary, North Carolina ([email protected]) Northwestern Univ., Center for Parallel and Distributed Computing, Evanston, Illinois ([email protected])

Summary. For distributed-memory multicomputers, the quality of the data partitioning for a given application is crucial to obtaining high performance. This task has traditionally been the user’s responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributions that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program’s execution adds another dimension to the data partitioning problem. In this chapter we present an approach for selecting dynamic data distributions as well as a technique for analyzing the resulting data redistribution in order to generate efficient code.

1. Introduction As part of the research performed in the PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project [4], automatic data partitioning techniques have been developed to relieve the programmer of the burden of selecting good data distributions. Originally, the compiler could automatically select a static distribution of data (using a constraint-based algorithm [15]) specifying both the configuration of an abstract multi-dimensional mesh topology along with how program data should be distributed on the mesh. For complex programs, static data distributions may be insufficient to obtain acceptable performance on distributed-memory multicomputers. By allowing the data distribution to dynamically change over the course of a program’s execution this problem can be alleviated by matching the data T ºis research, performed at the University of Illinois, was supported in part by the National Aeronautics and Space Administration under Contract NASA NAG 1-613, in part by an Office of Naval Research Graduate Fellowship, and in part by the Advanced Research Projects Agency under contract DAA-H04-94-G-0273 administered by the Army Research office. We are also grateful to the National Center for Supercomputing Applications and the San Diego Supercomputing Center for providing access to their machines.

S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 445-484, 2001. Springer-Verlag Berlin Heidelberg 2001

446

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee Sequential Fortran 77 program no distribution or redistribution

Recursive phase decomposition

Dynamic HPF program explicit and implicit redistribution

Phase transition selection

Distribution data-flow synthesis explicit distribution

Redistribution data-flow synthesis explicit distribution & redistribution

Static distribution assignment

HPF redistribution directive generation

Optimized static HPF program Optimized dynamic HPF program explicit distribution and redistribution

explicit redistribution

Fig. 1.1. Dynamic data partitioning framework distribution more closely to the different computations performed throughout the program. Such dynamic partitionings can yield higher performance than a static partitioning when the redistribution is more efficient than the communication pattern required by the statically partitioned computation. We have developed an approach [31] (which extends the static partitioning algorithm) for selecting dynamic data distributions as well as a technique for analyzing the resulting data redistribution [32] in order to generate efficient code. In this chapter we present an overview of these two techniques. The approach we have developed to automatically select dynamic distributions, shown in the light outlined region in Figure 1.1, consists of two main steps. The program is first recursively decomposed into a hierarchy of candidate phases obtained using existing static distribution techniques. Then, the most efficient sequence of phases and phase transitions is selected taking into account the cost of redistributing the data between the different phases. An overview of the array redistribution data-flow analysis framework we have developed is shown in the shaded outlined areas of Figure 1.1. In addition to serving as a back end to the automatic data partitioning system, the framework is also capable of analyzing (and optimizing) existing High Performance Fortran [26] (HPF) programs providing a mechanism to generate fully explicit dynamic HPF programs while optimizing the amount of data redistribution performed. The remainder of this chapter is organized as follows: related work is discussed in Section 2; our methodology for the selection of dynamic data

Compiler Optimization of Dynamic Data Distributions for DMMs

447

distributions is presented in Section 3; Section 4 presents an overview of the redistribution analysis framework and the representations used in its development; the techniques for performing interprocedural array redistribution analysis are presented in Section 5; results are presented in Section 6; and conclusions are presented in Section 7.

2. Related Work Static Partitioning. Some of the ideas used in the static partitioning algorithm originally implemented in the PARADIGM compiler [17] were inspired by earlier work on multi-dimensional array alignment [29]. In addition to this work, in recent years much research has been focused on: performing multi-dimensional array alignment [5, 8, 25, 29]; examining cases in which a communication-free partitioning exists [35]; showing how performance estimation is a key in selecting good data distributions [11, 44]; linearizing array accesses and analyzing the resulting one-dimensional accesses [39]; applying iterative techniques which minimize the amount of communication at each step [2]; and examining issues for special-purpose distributed architectures such as systolic arrays [42]. Dynamic Partitioning. In addition to the work performed in static partitioning, a number of researchers have also been examining the problem of dynamic partitioning. Hudak and Abraham have proposed a method for selecting redistribution points based on locating significant control flow changes in a program [22]. Chapman, Fahringer, and Zima describe the design of a distribution tool that makes use of performance prediction methods when possible but also uses empirical performance data through a pattern matching process [7]. Anderson and Lam [2] approach the dynamic partitioning problem using a heuristic which combines loop nests (with potentially different distributions) in such a way that the largest potential communication costs are eliminated first while still maintaining sufficient parallelism. Bixby, Kennedy and Kremer [6, 27], as well as Garcia, Ayguad´e, and Labarta [13], have formulated the dynamic data partitioning problem in the form of a 0-1 integer programming problem by selecting a number of candidate distributions for each of a set of given phases and constructing constraints from the data relations. More recently, Sheffler, Schreiber, Gilbert and Pugh [38] have applied graph contraction methods to the dynamic alignment problem to reduce the size of the problem space that must be examined. Bixby, Kremer, and Kennedy have also described an operational definition of a phase which defines a phase as the outermost loop of a loop nest such that the corresponding iteration variable is used in a subscript expression of an array reference in the loop body [6]. Even though this definition restricts phase boundaries to loop structures and does not allow overlapping phases, for certain programs, such as the example that will be presented in

448

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

Section 3.1, this definition is sufficient to describe the distinct phases of a computation. Analyzing Dynamic Distributions. By allowing distributions to change during the course of a program’s execution, more analysis must also be performed to determine which distributions are present at any given point in the program as well as to make sure redistribution is performed only when necessary in order to generate efficient code. The work by Hall, Hiranandani, Kennedy, and Tseng [18] defined the term reaching decompositions for the Fortran D [19] decompositions which reach a function call site. Their work describes extensions to the Fortran D compilation strategy using the reaching decompositions for a given call site to compile Fortran D programs that contain function calls as well as to optimize the resulting implicit redistribution. As presented, their techniques addressed computing and optimizing (redundant or loop invariant) implicit redistribution operations due to changes in distribution at function boundaries, but do not address many of the situations which arise in HPF. The definition of reaching distributions (using HPF terminology), however, is still a useful concept. We extend this definition to also include distributions which reach any point within a function in order to encompass both implicit and explicit distributions and redistributions thereby forming the basis of the work presented in this chapter. In addition to determining those distributions generated from a set of redistribution operations, this extended definition allows us to address a number of other applications in a unified framework. Work by Coelho and Ancourt [9] also describes an optimization for removing useless remappings specified by a programmer through explicit realignment and redistribution operations. In comparison to the work in the Fortran D project [18], they are also concerned with determining which distributions are generated from a set of redistributions, but instead focus only on explicit redistribution. They define a new representation called a redistribution graph in which nodes represent redistribution operations and edges represent the statements executed between redistribution operations. This representation, although correct in its formulation, does not seem to fit well with any existing analysis already performed by optimizing compilers and also requires first summarizing all variables used or defined along every possible path between successive redistribution operations in order to optimize redistribution. Even though their approach currently only performs this analysis within a single function, they do suggest the possibility of an extension to their techniques which would allow them to also handle implicit remapping operations at function calls but they do not describe an approach.

Compiler Optimization of Dynamic Data Distributions for DMMs

449

3. Dynamic Distribution Selection For complex programs, we have seen that static data distributions may be insufficient to obtain acceptable performance. Static distributions suffer in that they cannot reflect changes in a program’s data access behavior. When conflicting data requirements are present, static partitionings tend to be compromises between a number of preferred distributions. Instead of requiring a single data distribution for the entire execution, program data could also be redistributed dynamically for different phases of the program (where a phase is simply a sequence of statements over which a given distribution is unchanged). Such dynamic partitionings can yield higher performance than a static partitioning when the redistribution is more efficient than the communication pattern required by the statically partitioned computation. 3.1 Motivation for Dynamic Distributions

program FFT2D complex Image(N,N)

processors

Figure 3.1 shows the basic computation performed in a two-dimensional Fast Fourier Transform (FFT). To execute this program in parallel on a machine with distributed memory, the main data array, Image, is partitioned across the available processors. By examining the data accesses that will occur during execution, it can be seen that, for the first half of the program, data is manipulated along the rows of the array. For the rest of the execution, data is manipulated along the columns. Depending on how data is distributed among the processors, several different patterns of communication could be generated. The goal of automatic data partitioning is to select the distribution that will result in the highest level of performance.

*** 1-D FFTs along rows do i = 1, N RowFFT(Image, i, N) enddo

(a) Static (butterfly communication)

time

Row FFTs

processors

*** 1-D FFTs along columns do i = 1, N ColumnFFT(Image, i, N) enddo end

Px1

Column FFTs

Px1

1xP

time

(b) Dynamic (redistribution)

Fig. 3.1. Two-dimensional Fast Fourier Transform If the array were distributed by rows, every processor could independently compute the FFTs for each row that involved local data. After the rows had been processed, the processors would now have to communicate to perform the column FFTs as the columns have been partitioned across the processors.

450

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

Conversely, if a column distribution were selected, communication would be required to compute the row FFTs while the column FFTs could be computed independently. Such static partitionings, as shown in Figure 3.1(a), suffer in that they cannot reflect changes in a program’s data access behavior. When conflicting data requirements are present, static partitionings tend to be compromises between a number of preferred distributions. For this example, assume the program is split into two separate phases; a row distribution is selected for the first phase and a column distribution for the second, as shown in Figure 3.1(b). By redistributing the data between the two phases, none of the one-dimensional FFT operations would require communication. From Figure 3.1, it can be seen how such a dynamic partitioning can yield higher performance if the dynamic redistribution communication is more efficient than the static communication pattern. 3.2 Overview of the Dynamic Distribution Approach As previously shown in Figure 1.1, the approach we have developed to automatically select dynamic distributions, consists of two main steps. First, in Section 3.3, we will describe how to recursively decompose the program into a hierarchy of candidate phases obtained using existing static distribution techniques. Then, in Section 3.4 we will describe how to select the most efficient sequence of phases and phase transitions taking into account the cost of redistributing the data between the different phases. This approach allows us to build upon the static partitioning techniques [15, 17] previously developed in the PARADIGM project. Static cost estimation techniques [16] are used to guide the selection of phases while static partitioning techniques are used to determine the best possible distribution for each phase. The cost models used to estimate communication and computation costs use parameters, empirically measured for each target machine, to separate the partitioning algorithm from a specific architecture. To help illustrate the dynamic partitioning technique, an example program will be used. In Figure 3.2, a two-dimensional Alternating Direction Implicit iterative method1 (ADI2D) is shown, which computes the solution of an elliptic partial differential equation known as Poisson’s equation [14]. Poisson’s equation can be used to describe the dissipation of heat away from a surface with a fixed temperature as well as to compute the free-space potential created by a surface with an electrical charge. For the program in Figure 3.2, a static data distribution will incur a significant amount of communication for over half of the program’s execution. For illustrative purposes only, the operational definition of phases previously described in Section 2 identifies twelve different “phases” in the program. 1

To simplify later analysis of performance measurements, the program shown performs an arbitrary number of iterations as opposed to periodically checking for convergence of the solution.

Compiler Optimization of Dynamic Data Distributions for DMMs op. phase

program ADI2d double precision u(N,N), uh(N,N), b(N,N), alpha integer i, j, k *** Initial value for u do j = 1, N do i = 1, N u(i,j) = 0.0 enddo u(1,j) = 30.0 u(n,j) = 30.0 enddo *** Initialize uh do j = 1, N do i = 1, N uh(i,j) = u(i,j) enddo enddo

op. phase 1 2 3 4 5 6 7

8 9 10 11 12

alpha = 4 * (2.0 / N) 13 do k = 1, maxiter 14 *** Forward and backward sweeps along cols do j = 2, N - 1 15 do i = 2, N - 1 16 b(i,j) = (2 + alpha) 17 uh(i,j) = (alpha - 2) * u(i,j) & + u(i,j + 1) + u(i,j - 1) 18 enddo 19 enddo 20 do j = 2, N - 1 21 uh(2,j) = uh(2,j) + u(1,j) 22 uh(N - 1,j) = uh(N - 1,j) + u(N,j) 23 enddo 24

&

do j = 2, N - 1 25 do i = 3, N - 1 26 b(i,j) = b(i,j) - 1 / b(i - 1,j) 27 uh(i,j) = uh(i,j) + uh(i - 1,j) / b(i - 1,j) 28 enddo 29 enddo 30

451

&

I

II &

III

IV

V

do j = 2, N - 1 uh(N - 1,j) = uh(N - 1,j) / b(N - 1,j) enddo do j = 2, N - 1 do i = N - 2, 2, -1 uh(i,j) = (uh(i,j) + uh(i + 1,j)) / b(i,j) enddo enddo

31 32 33 34 35

VI

VII 36 37 38

*** Forward and backward sweeps along rows do j = 2, N - 1 39 do i = 2, N - 1 40 b(i,j) = (2 + alpha) 41 u(i,j) = (alpha - 2) * uh(i,j) VIII + uh(i + 1,j) + uh(i - 1,j) 42 enddo 43 enddo 44 do i = 2, N - 1 45 u(i,2) = u(i,2) + uh(i,1) 46 u(i,N - 1) = u(i,N - 1) + uh(i,N) 47 IX enddo 48

do j = 3, N - 1 do i = 2, N - 1 b(i,j) = b(i,j) - 1 / b(i,j - 1) u(i,j) = u(i,j) & + u(i,j - 1) / b(i,j - 1) enddo enddo do i = 2, N - 1 u(i,N - 1) = u(i,N - 1) / b(i,N - 1) enddo do j = N - 2, 2, -1 do i = 2, N - 1 u(i,j) = (u(i,j) + u(i,j + 1)) & / b(i,j) enddo enddo enddo end

49 50 51 52 53 54 55 56 57 58 59

X

XI

XII 60 61 62 63 64

Fig. 3.2. 2-D Alternating Direction Implicit iterative method (ADI2D) (Shown with operational phases) These phases exposed by the operational definition need not be known for our technique (and, in general, are potentially too restrictive) but they will be used here for comparison as well as to facilitate the discussion. 3.3 Phase Decomposition Initially, the entire program is viewed as a single phase for which a static distribution is determined. At this point, the immediate goal is to determine if and where it would be beneficial to split the program into two separate phases such that the sum of the execution times of the resulting phases is less than the original (as illustrated in Figure 3.3). Using the selected distribution, a communication graph is constructed to examine the cost of communication in relation to the flow of data within the program. We define a communication graph as the flow information from the dependence graph weighted by the cost of communication. The nodes of the communication graph correspond to individual statements while the edges correspond to flow dependencies that exist between the statements. As a heuristic, the cost of communication performed for a given reference in a

452

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee Phase 1 Phase

>

+

Phase 2

Fig. 3.3. Phase decomposition statement is initially assigned to (reflected back along) every incoming dependence edge corresponding to the reference involved. Since flow information is used to construct the communication graph, the weights on the edges serve to expose communication costs that exist between producer/consumer relationships within a program. Also since we restrict the granularity of phase partitioning to the statement level, single node cycles in the flow dependence graph are not included in the communication graph. After the initial communication cost, comm(j, ref ), has been computed for a given reference, ref , and statement, j, it is scaled according to the number of incoming edges for each producer statement, i, of the reference. The weight of each edge, (i, j), for this reference, W (i, j, ref ), is then assigned this value: Scaling and assigning initial costs: dyncount (i) W (i, j, ref ) = · lratio(i, j) · comm(j, ref ) (3.1) dyncount (P ) ij P ∈in succ(j,ref )

lratio(i, j ) =

nestlevel (i) + 1 nestlevel (j) + 1

(3.2)

The scaling conserves the total cost of a communication operation for a given reference, ref , at the consumer, j, by assigning portions to each producer, i, proportional to the dynamic execution count of the given producer, i, divided by the dynamic execution counts of all producers. Note that the scaling factors are computed separately for producers which are lexical predecessors or successors of the consumer as shown in Equation (3.1). Also, to further differentiate between producers at different nesting levels, all scaling factors are also scaled by the ratio of the nesting levels as shown in Equation (3.2).

Compiler Optimization of Dynamic Data Distributions for DMMs

453

Once the individual edge costs have been scaled to conserve the total communication cost, they are propagated back toward the start of the program (through all edges to producers which are lexical predecessors) while still conserving the propagated cost as shown in Equation (3.3). Propagating costs back: dyncount(i) W (j, k, ∗) · lratio(i, j) » (¼½ j, ref ) += dyncount (P ) À∈ÁÂÃ(¾ )

(3.3)

P ∈in pred(¾¿∗)

In Figure 3.4, the communication graph is shown for ADI2D with some of the edges labeled with the unscaled comm cost expressions automatically generated by the static cost estimator (using a problem size of 512 × 512 and maxiter set to 100). For reference, the communication models for an Intel Paragon and a Thinking Machines CM-5, corresponding to the communication primitives used in the cost expressions, are shown in Table 3.1. Conditionals appearing in the cost expressions represent costs that will be incurred based on specific distribution decisions (e.g., P2 > 1 is true if the second mesh dimension is assigned more than one processor). Once the communication graph has been constructed, a split point is determined by computing a maximal cut of the communication graph. The maximal cut removes the largest communication constraints from a given phase to potentially allow better individual distributions to be selected for the two resulting split phases. Since we also want to ensure that the cut divides the program at exactly one point to ensure only two subphases are generated for the recursion, only cuts between two successive statements will be considered. Since the ordering of the nodes is related to the linear ordering of statements in a program, this guarantees that the nodes on one side of the cut will always all precede or all follow the node most closely involved in the cut. The following algorithm is used to determine which cut to use to split a given phase. For simplicity of this discussion, assume for now that there is at most only one edge between any two nodes. For multiple references to the same array, the edge weight can be considered to be the sum of all communication operations for that array. Also, to better describe the algorithm, view the communication graph G = (V, E) in the form of an adjacency matrix (with source vertices on rows and destination vertices on columns). 1. For each statement SÄ {i ∈ [1, (|V | − 1)]} compute the cut of the graph between statements SÄ and SÄ+1 by summing all the edges in the submatrices specified by [S1 , SÄ ] × [SÄ+1 , S|V | ] and [SÄ+1 , S|V | ] × [S1 , SÄ ] 2. While computing the cost of each cut also keep track of the current maximum cut. 3. If there is more than one cut with the same maximum value, choose from this set the cut that separates the statements at the highest nesting level.

454

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

6

3

5

10

13

22 (a)

41

(a)

17

(b)

27

23

(b)

47

(b)

(a)

51

18

28

(a)

32

(a) 36

(a) (a) 42

(a)

60

56

46

52

(a) 100 ∗ (P2 Å 1) ∗ Shift(510) (b) 3100 ∗ Transfer(510) Fig. 3.4. ADI2D communication graph with example initial edge costs (Statement numbers correspond to Figure 3.2)

Table 3.1. Communication primitives (time in µs for a Æ byte message) Intel Paragon Transfer(Æ) Shift(Æ)

TMC CM-5 23 + 0.12Æ Æ ≤ 16 50 + 0.018Æ 86 + 0.12Æ Æ Å 16 2 ∗ Transfer(Æ)

Compiler Optimization of Dynamic Data Distributions for DMMs

455

If there is more than one cut with the same highest nesting level, record the earliest and latest maximum cuts with that nesting level (forming a cut window). 4. Split the phase using the selected cut. destination

destination

1 2 3 4 5 9

1

4

2

15

3 4

7

2

9

4

15

7

7 0

5 12 (a) Adjacency matrix

1 3 source

source

1

1

4

12 0

7

5 (b) Actual representation

2 3 4 5 12

2

3

9

4

-

-

4

5

15 7+7

+

0

(c) Efficient computation

Fig. 3.5. Example graph illustrating the computation of a cut In Figure 3.5, the computation of the maximal cut on a smaller example graph with arbitrary weights is shown. The maximal cut is found to be between vertices 3 and 4 with a cost of 41. This is shown both in the form of the sum of the two adjacency submatrices in Figure 3.5(a), and graphically as a cut on the actual representation in Figure 3.5(b). In Figure 3.5(c), the cut is again illustrated using an adjacency matrix, but the computation is shown using a more efficient implementation which only adds and subtracts the differences between two successive cuts using a running cut total while searching for the maximum cut in sequence. This implementation also provides much better locality than the full submatrix summary when analyzing the actual sparse representation since the differences between two successive cuts can be easily obtained by traversing the incoming and outgoing edge lists (which correspond to columns and rows in the adjacency matrix respectively) of the node immediately preceding the cut. This takes O(Ç ) time on the actual representation, only visiting each edge twice – once to add it and once to subtract it. A new distribution is now selected for each of the resulting phases while inheriting any unspecified distributions (due to an array not appearing in a subphase) from the parent phase. This process is then continued recursively using the costs from the newly selected distributions corresponding to each subphase. As was shown in Figure 3.3, each level of the recursion is carried out in branch and bound fashion such that a phase is split only if the sum of the estimated execution times of the two resulting phases shows an improvement

456

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

6

41

46

47

22

27

5

18

13

17

23

51

28

60

32

56

52

10

3

36

42

Fig. 3.6. Partitioned communication graph for ADI2D (Statement numbers correspond to Figure 3.2.) over the original.2 In Figure 3.6, the partitioned communication graph is shown for ADI2D after the phase decomposition is completed. As mentioned in the cut algorithm, it is also possible to find several cuts which all have the same maximum value and nesting level forming a window over which the cut can be performed. This can occur since not all statements will necessarily generate communication resulting in either edges with zero cost or regions over which the propagated costs conserve edge flow, both of which will maintain a constant cut value. To handle cut windows, the phase should be split into two subphases such that the lower subphase uses the earliest cut point and the upper subphase uses the latest, resulting in overlapping phases. After new distributions are selected for each overlapping subphase, the total cost of executing the overlapped region in each subphase is examined. The overlap is then assigned to the subphase that resulted in the lowest execution time for this region. If they are equal, the overlapping region can be equivalently assigned to either subphase. Currently, this technique is not yet implemented for cut windows. We instead always select the earliest cut point in a window for the partitioning. To be able to bound the depth of the recursion without ignoring important phases and distributions, the static partitioner must also obey the following property. A partitioning technique is said to be monotonic if it selects the best available partition for a segment of code such that (aside from the cost of redistribution) the time to execute a code segment with a selected distribution is less than or equal to the time to execute the same segment 2

A further optimization can also be applied to bound the size of the smallest phase that can be split by requiring its estimated execution time to be greater than a “minimum cost” of redistribution.

Compiler Optimization of Dynamic Data Distributions for DMMs

457

with a distribution that is selected after another code segment is appended to the first. In practice, this condition is satisfied by the static partitioning algorithm that we are using. This can be attributed to the fact that conflicts between distribution preferences are not broken arbitrarily, but are resolved based on the costs imposed by the target architecture [17]. It is also interesting to note that if a cut occurs within a loop body, and loop distribution can be performed, the amount of redistribution can be greatly reduced by lifting it out of the distributed loop body and performing it in between the two sections of the loop. Also, if dependencies allow statements to be reordered, statements may be able to move across a cut boundary without affecting the cost of the cut while possibly reducing the amount of data to be redistributed. Both of these optimizations can be used to reduce the cost of redistribution but neither will be examined in this chapter. 3.4 Phase and Phase Transition Selection After the program has been recursively decomposed into a hierarchy of phases, a Phase Transition Graph (PTG) is constructed. Nodes in the PTG are phases resulting from the decomposition while edges represent possible redistribution between phases as shown in Figure 3.7(a). Since it is possible that using lower level phases may require transitioning through distributions found at higher levels (to keep the overall redistribution costs to a minimum), the phase transition graph is first sectioned across phases at the granularity of the lowest level of the phase decomposition.3 Redistribution costs are then estimated for each edge and are weighted by the dynamic execution count of the surrounding code. If a redistribution edge occurs within a loop structure, additional redistribution may be induced due to the control flow of the loop. To account for a potential “reverse” redistribution which can occur on the back edge of the iteration, the phase transition graph is also sectioned around such loops. The first iteration of a loop containing a phase transition is then peeled off and the phases of the first iteration of the body re-inserted in the phase transition graph as shown in Figure 3.7(b). Redistribution within the peeled iteration is only executed once while that within the remaining loop iterations is now executed (È − 1) times, where È is the number of iterations in the loop. The redistribution, which may occur between the first peeled iteration and the remaining iterations, is also multipled by (È − 1) in order to model when the back edge causes redistribution (i.e., when the last phase of the peeled iteration has a different distribution than the first phase of the remaining ones). Once costs have been assigned to all redistribution edges, the best sequence of phases and phase transitions is selected by computing the shortest 3

Sectioned phases that have identical distributions within the same horizontal section of the PTG are actually now redundant and can be removed, if desired, without affecting the quality of the final solution.

458

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee Start Level 0 1xP

I

Level 1 1xP

II

I

Level 2 Px1

II

I II

1 iter 1xP

1xP

Px1

III

III

IV

IV

Level 2

V

V

V

I

VI

VI

VI

III

1xP

IV

Start Level 0 1xP

I

Level 1 1xP

I

Px1

II

II

II

VII

VII

VII

III

III

III

VIII

VIII

VIII

IV

IV

IV

IX

V

V

V

X

X

VI

VI

VI

XI

XI

VII

VII

VII

XII

XII

VIII

VIII

1xP

VIII

Px1

IX

N-1 iter 1xP

IX

Px1

1xP

Px1

III

III

III

IX

1xP

X

X

IV

IV

XI

XI

V

V

V

XII

XII

VI

VI

VI

VII

VII

VII

VIII

VIII

VIII

IV

Stop

(a) Initial phase transition graph IX

Px1

IX

X

X

XI

XI

XII

XII

Stop

(b) After performing loop peeling Fig. 3.7. Phase transition graph for ADI2D

Compiler Optimization of Dynamic Data Distributions for DMMs

459

Table 3.2. Detected phases and estimated execution times (sec) for ADI2D (Performance estimates correspond to 32 processors.) Op. Phases(s) Distribution Intel Paragon TMC CM-5 I-XII ∗,BLOCK 1 × 32 22.151461 39.496276 Level 0 I-VIII ∗,BLOCK 1 × 32 1.403644 2.345815 Level 1 IX-XII BLOCK,∗ 32 × 1 0.602592 0.941550 I-III BLOCK,∗ 32 × 1 0.376036 0.587556 Level 2 IV-VIII ∗,BLOCK 1 × 32 0.977952 1.528050 path on the phase transition graph. This is accomplished in O(É 2 ) time (where É is now the number of vertices in the phase transition graph) using Dijkstra’s single source shortest path algorithm [40]. After the shortest path has been computed, the loop peeling performed on the PTG can be seen to have been necessary to obtain the best solution if the peeled iteration has a different transition sequence than the remaining iterations. Even if the peeled iteration does have different transitions, not actually performing loop peeling on the actual code will only incur at most one additional redistribution stage upon entry to the loop nest. This will not overly affect performance if the execution of the entire loop nest takes significantly longer than a single redistribution operation, which is usually the case especially if the redistribution considered within the loop was actually accepted when computing the shortest path. Using the cost models for an Intel Paragon and a Thinking Machines CM-5, the distributions and estimated execution times reported by the static partitioner for the resulting phases (described as ranges of operational phases) are shown in Table 3.2. The performance parameters of the two machines are similar enough that the static partitioning actually selects the same distribution at each phase for each machine. The times estimated for the static partitioning are slightly higher than those actually observed, resulting from a conservative assumption regarding pipelines4 made by the static cost estimator [15], but they still exhibit similar enough performance trends to be used as estimates. For both machines, the cost of performing redistribution is low enough in comparison to the estimated performance gains that a dynamic distribution scheme is selected, as shown by the shaded area in Figure 3.7(b). Pseudo-code for the dynamic partitioning algorithm is presented in Figure 3.8 to briefly summarize both the phase decomposition and the phase 4

Initially, a BLOCK, BLOCK distribution was selected by the static partitioner for (only) the first step of the phase decomposition. As the static performance estimation framework does not currently take into account any overlap between communication and computation for pipelined computations, we decided that this decision was due to the conservative performance estimate. For the analysis presented for ADI2D, we bypassed this problem by temporarily restricting the partitioner to only consider 1-D distributions.

460

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

transition selection procedures as described. As distributions for a given phase are represented as a set of variables, each of which having an associated distribution, a masking union set operation is used to inherit unspecified dis tributions (dist ◦ dist Ê ). A given variable’s distribution in the dist set will be replaced if it also has a distribution in the dist Ê set thus allowing any unspecified distributions in subphase Ë (dist Ê ) to be inherited from its parent (dist ). These sets and the operations performed on them will be described in more detail in Section 4.3. Since the use of inheritance during the phase decomposition process implicitly maintains the coupling between individual array distributions, redistribution at any stage will only affect the next stage. This can be contrasted to the technique proposed by Bixby, Kennedy, and Kremer [6] which first selects a number of partial candidate distributions for each phase specified by the operational definition. Since their phase boundaries are chosen in the absence of flow information, redistribution can affect stages at any distance from the current stage. This causes the redistribution costs to become binary functions depending on whether or not a specific path is taken, therefore, necessitating the need for 0-1 integer programming. In [27] they do agree, however, that 0-1 integer programming is not necessary when all phases specify complete distributions (such as in our case). In their work, this occurs only as a special case in which they specify complete phases from the innermost to outermost levels of a loop nest. For this situation they show how the solution can be obtained using a hierarchy of single source shortest path problems in a bottom-up fashion (as opposed to solving only one shortest path problem after performing a top-down phase decomposition as in our approach). Up until now, we have not described how to handle control flow other than for loop constructs. More general flow (caused by conditionals or branch operations) can be viewed as separate paths of execution with different frequencies of execution. The same techniques that have been used for scheduling assembly level instructions by selecting traces of interest [12] or forming larger blocks from sequences of basic blocks [23] in order to optimize the most frequently taken paths can also be applied to the phase transition problem. Once a single trace has been selected (using profiling or other criteria) its phases are obtained using the phase selection algorithm previously described but ignoring all code off the main trace. Once phases have been selected, all off-trace paths can be optimized separately by first setting their stop and start nodes to the distributions of the phases selected for the points at which they exit and re-enter the main trace. Each off-trace path can then be assigned phases by applying the phase selection algorithm to each path individually. Although this specific technique is not currently implemented in the compiler, but will be addressed in future work, other researchers have also been considering it as a feasible solution for selecting phase transitions in the presence of general control flow [3].

Compiler Optimization of Dynamic Data Distributions for DMMs

Partition(program) 1 cutlist ← ∅ 2 dist ← Static-Partitioning(program) 3 phases ← Decompose-Phase(program , dist, cutlist ) 4 ptg ← Select-Redistribution(phases, cutlist) 5 Assign distributions based on shortest phase recorded in ptg Decompose-Phase(phase, dist , cutlist ) 1 Add phase to list of recognized phases 2 Construct the communication graph for the phase 3 cut ← Max-Cut(phase) ⊲ No communication in phase 4 if VALUE(cut ) = 0 5 then return 6 Relocate cut to highest nesting level of identical cuts cut 7 phase 1 , phase 2 ←− phase 8 ⊲ Note: if cut is a window, phase 1 and phase 2 will overlap 9 dist1 ← Static-Partitioning(phase 1 ) 10 dist2 ← Static-Partitioning(phase 2 ) 11 ⊲ Inherit any unspecified distributions from parent 12 dist1 ← dist ◦ dist 1 13 dist2 ← dist ◦ dist 2 14 if (cost(phase1 ) + cost(phase2 )) < cost(phase) 15 then ⊲ If cut is a window, phase 1 and phase 2 overlap 16 if LAST STMTNUM(phase 1 ) > FIRST STMTNUM(phase 2 ) 17 then Resolve-Overlap(cut, phase 1 , phase 2 ) 18 List-Insert(cut , cutlist ) 19 phase→left = Decompose-Phase(phase 1 , dist 1 , cutlist ) 20 phase→right = Decompose-Phase(phase 2 , dist 2 , cutlist ) 21 else phase→left = NULL 22 phase→right = NULL 23 return (phase)

S S

Select-Redistribution(phases, cutlist ) 1 if cutlist = ∅ 2 then return 3 ptg ← Construct-PTG(phases , cutlist ) 4 Divide ptg horizontally at the recursion lowest level 5 for each loop in phases 6 do if loopcontains a cut at its nesting level 7 then Divide ptg at loop boundaries 8 Peel(loop, ptg ) 9 Estimate the interphase redistribution costs for ptg 10 Compute the shortest phase transition path on ptg 11 return (ptg)

Fig. 3.8. Pseudo-code for the partitioning algorithm

461

462

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

4. Data Redistribution Analysis The intermediate form of a program within the framework, previously shown in Figure 1.1, specifies both the distribution of every array at every point in the program as well as the redistribution required to move from one point to the next. The different paths through the framework involve passes which process the available distribution information in order to obtain the missing information required to move from one representation to another. The core of the redistribution analysis portion of the framework is built upon two separate interprocedural data-flow problems which perform distribution synthesis and redistribution synthesis (which will be described later in Section 5). These two data-flow problems are both based upon the problem of determining both the inter- and intraprocedural reaching distributions for a program. Before giving further details of how these transformations are accomplished through the use of these two data-flow problems, we will first describe the idea of reaching distributions and the basic representations we use to perform this analysis. 4.1 Reaching Distributions and the Distribution Flow Graph The problem of determining which distributions reach any given point taking into account control flow in the program is very similar to the computation of reaching definitions [1]. In classic compilation theory a control flow graph consists of nodes (basic blocks) representing uninterrupted sequences of statements and edges representing the flow of control between basic blocks. For determining reaching distributions, an additional restriction must be added to this definition. Not only should each block Ì be viewed as a sequence of statements with flow only entering at the beginning and leaving at the end, but the data distribution for the arrays defined or used within the block is also not allowed to change. In comparison to the original definition of a basic block, this imposes tighter restrictions on the extents of a block. Using this definition of a block in place of a basic block results in what we refer to as the distribution flow graph (DFG). This representation differs from [9] as redistribution operations now merely augment the definition of basic block boundaries as opposed to forming the nodes of the graph. Since the definition of the DFG is based upon the CFG, the CFG can be easily transformed into a DFG by splitting basic blocks at points at which a distribution changes as shown in Figure 4.1. This can be due to an explicit change in distribution, as specified by the automatic data partitioner, or by an actual HPF redistribution directive. If the change in distribution is due to a sequence of redistribution directives, the overall effect is assigned to the block in which they are contained; otherwise, a separate block is created whenever executable operations are interspersed between the directives.

Compiler Optimization of Dynamic Data Distributions for DMMs

B1

B

463

B

dist 1 redist x redist y dist 2

B2

(a) Distribution split

(b) Redistribution split

Fig. 4.1. Splitting CFG nodes to obtain DFG nodes 4.2 Computing Reaching Distributions Using this view of a block in a DFG and by viewing array distributions as definitions, the same data-flow framework used for computing reaching definitions [1] can now be used to obtain the reaching distributions by defining the following sets for each block Í in a function: • • • • • • •

DIST(Í ) REDIST(Í ) GEN(Í ) KILL(Í ) IN(Í ) OUT(Í ) DEF(Í ), USE(Í )

-

distributions present when executing block Í redistributions performed upon entering block distributions generated by executing block Í distributions killed by executing block Í distributions that exist upon entering block Í distributions that exist upon leaving block Í variables defined or used in block Í

Í

It is important to note that GEN and KILL are specified as the distributions generated or killed by executing block Í as opposed to entering (redistribution at the head of the block) or exiting (redistribution at the tail of the block) in order to allow both forms of redistribution. GEN and KILL are initialized by DIST or REDIST (depending on the current application as will be described in Section 5) and may be used to keep track of redistributions that occur on entry (e.g., HPF redistribution directives or functions with prescriptive distributions) or exit (e.g., calls to functions which internally change a distribution before returning). To perform interprocedural analysis, the function itself also has IN and OUT sets, which contain the distributions present upon entry and summarize the distributions for all possible exits. Once the sets have been defined, the following data-flow equations are iteratively computed for each block until the solution OUT(Í ) converges for every block Í (where PRED(Í ) are the nodes which immediately precede Í in the flow of the program): IN(Í ) = OUT(P ) (4.1) P ∈ P R ED(B)

OUT(B)

= GEN(B)

(IN(B) − KILL(B))

(4.2)

464

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

Since the confluence operator is a union, both IN and OUT never decrease in size and the algorithm will eventually halt. By processing the blocks in the flow graph in a depth-first order, the number of iterations performed will roughly correspond to the level of the most deeply nested statement, which tends to be a fairly small number on real programs [1]. As can be seen from Eqs. (4.1) and (4.2), the DEF and USE sets are actually not used to compute reaching distributions, but will have other uses for optimizing redistribution which will be explained in more detail in Section 5.2). 4.3 Representing Distribution Sets To represent a distribution set in a manner that would provide efficient set operations, the bulk of the distribution information associated with a given variable is stored in its symbol table entry as a distribution table. Bit vectors are used within the sets themselves to specify distributions which are currently active for a given variable. Since a separate symbol table entry is created for each variable within a given scope, this provides a clean interface for accepting distributions from the HPF front end [21]. While the front end is processing HPF distribution or redistribution directives, any new distribution information present for a given variable is simply added to the corresponding distribution table for later analysis. B

C

A

010...011001

010...000101

000...010010 ...

Distribution Set C

CYCLIC,* BLOCK,BLOCK *,BLOCK

...

Distribution Table

BLOCK,*

... Symbol Table

Fig. 4.2. Distribution set using bit vectors As shown in Figure 4.2, the actual distribution sets are maintained as linked lists with a separate node representing each variable with a bit vector (corresponding to the entries in the distribution table for that variable) to indicate which distributions are currently active for the variable. To maintain efficiency while still retaining the simplicity of a list, the list is always maintained in sorted order by the address of the variable’s entry in the symbol table to facilitate operations between sets. This allows us to implement operations on two sets by merging them in only O(Î) bit vector operations (where Î is the number of variables in a given set).

Compiler Optimization of Dynamic Data Distributions for DMMs

465

Since these sets are now actually sets of variables which each contain a set representing their active distributions, SETvar will be used to specify the variables present in a given distribution set, SET. For example, the notation SETvar can be used to indicate the inverse of the distributions for each variable contained within the ÏÐÑ as opposed to an inverse over the universe of all active variables (which would be indicated as SET). In addition to providing full union, intersection, and difference operations ( , , −) which operate on both levels of the set representation (between the variable symbols in the sets as well as between the bit vectors of identical symbols) masking versions of these operations ( ◦ , ◦ , −) ◦ are also provided which operate at only the symbol level. In the case of a masking union (Ò ◦ b), a union is performed at the symbol level such that any distributions for a variable appearing in set Ò will be replaced by distributions in set b. This allows new distributions in b to be added to aset while replacing any existing distributions in Ò. Masking intersections (Ò ◦ b) and differences (Ò − ◦ b) act somewhat differently in that the variables in set Ò are either selected or removed (respectively) by their appearance in set b. These two operations are useful for implementing existence operations (e.g., Òvar |bvar ≡ Ò ◦ b, ◦ b). Òvar |bvar ≡ Ò −

5. Interprocedural Redistribution Analysis Since the semantics of HPF require that all objects accessible to the caller after the call are distributed exactly as they were before the call [26], it is possible to first completely examine the context of a call before considering any distribution side effects due to the call. It may seem strange to say that there can be side effects when we just said that the semantics of HPF preclude it. To clarify this statement, such side effects are allowed to exist, but only to the extent that they are not apparent outside of the call. As long as the view specified by the programmer is maintained, the compiler is allowed to do whatever it can to optimize both the inter- and intraprocedural redistributions so long as the resulting distributions used at any given point in the program are not changed. The data partitioner explicitly assigns different distributions to individual blocks of code serving as an automated mechanism for converting sequential Fortran programs into efficient HPF programs. In this case, the framework is used to synthesize explicit redistribution operations in order to preserve the meaning of what the data partitioner intended in the presence of HPF semantics. In HPF, on the other hand, dynamic distributions are described by specifying the transitions between different distributions (through explicit redistribution directives or implicit redistribution at function boundaries). With the framework it is possible to convert an arbitrary HPF program into an optimized HPF program containing only explicit redistribution directives and descriptive [26] function arguments.

466

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee 1 2 3 4 5 6 7 8 9 10

main sub3 sub1 sub3* sub2 sub3* sub4 sub2* sub3* sub4*

(a) Top-down (pre-order) for calling context

1

main

2 3 4

sub1

sub2

5 6 7 8

sub3

sub4

9 10

sub3 sub3* sub1 sub3* sub4 sub2 sub3* sub4* sub2* main

(b) Bottom-up (post-order) for side-effects

Fig. 5.1. Example call graph and depth-first traversal order In Figure 5.1, an example call graph is shown to help illustrate the distribution and redistribution synthesis phases of the interprocedural analysis. If distribution information is not present (i.e., HPF input), distribution synthesis is first performed in a top-down manner over a program’s call graph to compute which distributions are present at every point within a given function. By establishing the distributions that are present at each call site, clones are also generated for each unique set of input distributions obtained for the called functions. Redistribution synthesis is then applied in a bottomup manner over the expanded call graph to analyze where the distributions are actually used and generates the redistribution required within a function. Since this analysis is interested in the effects between an individual caller/callee pair, and not in summarizing the effects from all callers before examining a callee, it is not necessary to perform a topological traversal for the top-down and bottom-up passes over the call graph. In this case, it is actually more intuitive to perform a depth-first pre-order traversal of the call graph (shown in Figure 5.1(a)) to fully analyze a given function before proceeding to analyze any of the functions it calls and to perform a depthfirst post-order traversal (shown in Figure 5.1(b)) to fully analyze all called functions before analyzing the caller. One other point to emphasize is that these interprocedural techniques can be much more efficient than analyzing a fully inlined version of the same program since it is possible to prune the traversal at the point a previous solution is found for a function in the same calling context. In Figure 5.1, asterisks indicate points at which a function is being examined after having already been examined previously. If the calling context is the same as the one used previously, the traversal can be pruned at this point reusing information recorded from the previous context. Depending on how much reuse occurs, this factor can greatly reduce the amount of time the compiler spends analyzing a program in comparison to a fully inlined approach.

Compiler Optimization of Dynamic Data Distributions for DMMs

467

Referring back to Figure 1.1 once again, the technique for performing distribution synthesis will be described in Section 5.1 while redistribution synthesis will be described in Section 5.2. The static distribution assignment (SDA) technique, will be described in Section 5.3, but as the HPF redistribution directive conversion is very straight-forward, it will not be discussed further in this section. More detailed descriptions of these techniques and the implementation of the framework can be found in [30]. 5.1 Distribution Synthesis When analyzing HPF programs, it is necessary to first perform distribution synthesis in order to determine which distributions are present at every point in a program. Since HPF semantics specify that any redistribution (implicit or explicit) due to a function call is not visible to the caller, each function can be examined independently of the functions it calls. Only the input distributions for a given function and the explicit redistribution it performs have to be considered to obtain the reaching distributions within a function. Given an HPF program, nodes (or blocks) in its DFG are delimited by the redistribution operations which appear in the form of HPF REDISTRIBUTE or REALIGN directives. As shown in Figure 5.2, the redistribution operations assigned to a block Ó represent the redistribution that will be performed when entering the block on any input path (indicated by the set REDIST(Ó )) as opposed to specifying the redistribution performed for each incoming path (REDIST(ÓÔ Ó1 ) or REDIST(ÓÔ Ó2 ) in the figure). If the set GEN(Ó ) is viewed as the distributions which are generated and KILL(Ó ) as the distributions which are killed upon entering the block, this

DIST(B1)

DIST(B2)

B1

B2 REDIST(B) REDIST(B, B2)

REDIST(B, B1)

DIST(B)

B

Fig. 5.2. Distribution and redistribution synthesis

468

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

problem can now be cast directly into the reaching distribution data-flow framework by making the following assignments: Data-flow initialization: REDIST(Õ ) = from directives DIST(Õ ) = ∅ GEN(Õ ) = REDIST(Õ ) OUT(Õ ) = REDIST(Õ )

KILL(Õ ) = REDISTvar (Õ ) IN(Õ ) = ∅

Data-flow solution: DIST(B) = OUT(Õ ) According to the HPF standard, a REALIGN operation only affects the array being realigned while a REDISTRIBUTE operation should redistribute all arrays currently aligned to the given array being redistributed (in order to preserve any previous specified alignments). The current implementation only records redistribution information for the array immediately involved in a REDISTRIBUTE operation. This results in only redistributing the array involved in the directive and not all of the alignees of the target to which it is aligned. In the future, the implementation could be easily extended to support the full HPF interpretation of REDISTRIBUTE by simply recording the same redistribution information for all alignees for the target template of the array involved in the operation. Due to the properties of REALIGN, this will also require first determining which templates arrays are aligned to at every point in the program (i.e., reaching alignments) using similar techniques. 5.2 Redistribution Synthesis After the distributions have been determined for each point in the program, the redistribution can be optimized. Instead of using either a simple copyin/copy-out strategy or a complete redistribution of all arguments upon every entry and exit of a function, any implicit redistribution around function calls can be reduced to only that which is actually required to preserve HPF semantics. Redistribution operations (implicitly specified by HPF semantics or explicitly specified by a programmer) that result in distributions which would not otherwise be used before another redistribution operation occurs are completely removed in this pass. Blocks are now delimited by changes in the distribution set. The set of reaching distributions previously computed for a block Õ represent the distributions which are in effect when executing that block (indicated by the set DIST(Õ ) in Figure 5.2). For this reason, the DIST(Õ ) sets are first restricted to only the variables defined or used within block Õ . Redistribution operations will now only be performed between two blocks if there is an intervening definition or use of a variable before the next change in distribution. Since we have also chosen to use a caller redistributes model, the GEN(Õ ) and KILL(Õ ) sets are now viewed as the distributions which are generated or

Compiler Optimization of Dynamic Data Distributions for DMMs

469

killed upon leaving block Ö . Using these definitions, this problem can now be cast directly into the reaching distribution data-flow framework by making the following assignments: Data-flow initialization: REDIST(Ö ) = ∅ DIST(Ö ) = DIST(Ö ) (DEF(Ö ) USE(Ö )) GEN(Ö ) = DIST(Ö ) KILL(Ö ) = DISTvar (Ö ) IN(Ö ) = ∅ OUT(Ö ) = DIST(Ö ) Data-flow solution: REDIST(Ö ) = DISTvar (Ö ) | (INvar (Ö ) − DISTvar (Ö )) = ∅ As will be seen later, using a caller redistributes model exposes many interprocedural optimization opportunities and also cleanly supports function calls which may require redistribution on both their entry and exit. Since REDIST(Ö ) is determined from both the DIST and IN sets, DIST(Ö ) represents the distributions needed for executing block Ö , while the GEN and KILL sets will be used to represent the exit distribution (which may or may not match DIST). By first restricting the DIST(Ö ) sets to only those variables defined or used within block Ö , redistribution is generated only where it is actually needed – the locations in which a variable is actually used in a distribution different from the current one (demand-driven, or lazy, redistribution). Although it will not be examined here, it would also be possible to take a lazy redistribution solution and determine the earliest possible time that the redistribution could be performed (eager redistribution) in order to redistribute an array when a distribution is no longer in use. The area between the eager and lazy redistribution points forms a window over which the operation can be performed to obtain the same effect. As will be shown later, it would be advantageous to position multiple redistribution operations in overlapping windows to the same point in the program in order to aggregate the communication thereby reducing the amount of communication overhead [30]. As the lazy redistribution point is found using a forward data-flow (reaching distributions) problem, it would be possible to find the eager redistribution point by performing some additional bookkeeping to record the last use of a variable as the reaching distributions are propagated along the flow graph; however, such a technique is not currently implemented in PARADIGM. In comparison to other approaches, interval analysis has also been used to determine eager/lazy points for code placement, but at the expense of a somewhat more complex formulation [43]. 5.2.1 Optimizing Invariant Distributions. Besides performing redistribution only when necessary, it is also desirable to only perform necessary redistribution as infrequently as possible.

470

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

Semantically loop invariant distribution regions5 can be first grown before synthesizing the redistribution operations. All distributions that do not change within a nested statement (e.g., a loop or if structure) are recorded on the parent statement (or header) for that structure. This has the effect of moving redistribution operations which result in the invariant distribution out of nested structures as far as possible (as was also possible in [18]). As a side effect, loops which are considered to contain an invariant distribution no longer propagate previous distributions for the invariant arrays. Since redistribution is moved out of the loop, this means that for the (extremely rare) special case of a loop invariant distribution (which was not originally present outside of the loop) contained within an undetectable zero trip loop, only the invariant distribution from within the loop body is propagated even though the loop nest was never executed. As this is only due to the way invariant distributions are handled, the data-flow handles non-invariant distributions as expected for zero trip loops (an extra redistribution check may be generated after the loop execution). 5.2.2 Multiple Active Distributions. Even though it is not specifically stated as such in the HPF standard, we will consider an HPF program in which every use or definition of an array has only one active distribution to be well-behaved. Since PARADIGM cannot currently compile programs which contain references with multiple active distributions, this property is currently detected by examining the reaching distribution sets for every node (limited by DEF/USE) within a function. A warning is issued if any set contains multiple distributions for a given use or definition of a variable stating that the program is not well-behaved. In the presence of function calls, it is also possible to access an array through two or more paths when parameter aliasing is present. If there is an attempt to redistribute only one of the aliased symbols, the different aliases now have different distributions even though they actually refer to the same array. This form of multiple active distributions is actually considered to be non-conforming in HPF [26] as it can result in consistency problems if the same array were allowed to occupy two different distributions. As it may be difficult for the programmers to make this determination, this can be automatically detected by determining if the reaching distribution set contains different distributions for any aliased arrays.6

5

6

Invariant redistribution within a loop can technically become non-invariant when return distributions from a function call within a loop nest are allowed to temporarily exist in the caller’s scope. Such regions can still be treated as invariant since this is the view HPF semantics provide to the programmer. The param alias pass in Parafrase-2 [34], which PARADIGM is built upon, is first run to compute the alias sets for every function call.

Compiler Optimization of Dynamic Data Distributions for DMMs

471

5.3 Static Distribution Assignment (SDA) To utilize the available memory on a given parallel machine as efficiently as possible, only the distributions that are active at any given point in the program should actually be allocated space. It is interesting to note that as long as a given array is distributed among the same total number of processors, the actual space required to store one section of the partitioned array is the same no matter how many array dimensions are distributed.7 By using this observation, it is possible to statically allocate the minimum amount of memory by associating all possible distributions of a given array to the same area of memory. Static Distribution Assignment (SDA) (inspired indirectly by the Static Single Assignment (SSA) form [10]) is a process we have developed in which the names of array variables are duplicated and renamed statically based on the active distributions represented in the corresponding DIST sets. As names are generated, they are assigned a static distribution corresponding to the currently active dynamic distribution for the original array. The new names will not change distribution during the course of the program. Redistribution now takes the form of an assignment between two different statically distributed source and destination arrays (as opposed to rearranging the data within a single array). To statically achieve the minimum amount of memory allocation required, all of the renamed duplicates of a given array are declared to be “equivalent.” The EQUIVALENCE statement in Fortran 77 allows this to be performed at the source level in a manner somewhat similar to assigning two array pointers to the same allocated memory as is possible in C or Fortran 90. Redistribution directives are also now replaced with actual calls to a redistribution library. Because the different static names for an array share the same memory, this implies that the communication operations used to implement the redistribution should read all of the source data before writing to the target. In the worst case, an entire copy of a partitioned array would have to be buffered at the destination processor before it is actually received and moved into the destination array. However, as soon as more than two different distributions are present for a given array, the EQUIVALENCE begins to pay off, even in the worst case, in comparison to separately allocating each different distribution. If the performance of buffered communication is insufficient for a given machine (due to the extra buffer copy), non-buffered communication could be used instead thereby precluding the use of EQUIVALENCE (unless some form of explicit buffering is performed by the redistribution library itself). 7

Taking into account distributions in which the number of processors allocated to a given array dimension does not evenly divide the size of the dimension, or degenerate distributions in which memory is not evenly distributed over all processors, it can also be equivalently said that there is an amount of memory which can store all possible distributions with very little excess.

472

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

REAL A(N, N) !HPF$ DISTRIBUTE (CYCLIC, *)::A ... A(i, j) = ... ... !HPF$ REDISTRIBUTE (BLOCK, BLOCK)::A ... ... = A(i, j) ...

(a) Before SDA

REAL A$0(N, N), A$1(N,N) !HPF$ DISTRIBUTE (CYCLIC, *)::A$0 !HPF$ DISTRIBUTE (BLOCK, BLOCK)::A$1 EQUIVALENCE (A$0, A$1) INTEGER A$cid A$cid = 0 ... A$0(i, j) = ... ... CALL reconfig(A$1, 1, A$cid) ... ... = A$1(i, j) ...

(b) After SDA

Fig. 5.3. Example of static distribution assignment In Figure 5.3, a small example is shown to illustrate this technique. In this example, a redistribution operation on A causes it to be referenced using two different distributions. A separate name is statically generated for each distribution of A, and the redistribution directive is replaced with a call to a run-time redistribution library [36]. The array accesses in the program can now be compiled by PARADIGM using techniques developed for programs which only contain static distributions [4] by simply ignoring the communication side effects of the redistribution call. As stated previously, if more than one distribution is active for any given array reference, the program is considered to be not well-behaved, and the array involved can not be directly assigned a static distribution. In certain circumstances, however, it may be possible to perform code transformations to make an HPF program well-behaved. For instance, a loop that contained multiple active distributions on the entry to its body due only to a distribution from the loop back edge (caused by redistribution within the loop) that wasn’t present on the loop entry would not be well-behaved. If the first iteration of that loop were peeled off, the entire loop body would now have a single active distribution for each variable and the initial redistribution into this state would be performed outside of the loop. This and other code transformations which help reduce the number of distributions reaching any given node will be the focus of further work in this area.

6. Results To evaluate the quality of the data distributions selected using the techniques presented in this chapter, as implemented in the PARADIGM compiler, we analyze three programs which exhibit different access patterns over the course of their execution. These programs are individual Fortran 77 subroutines which range in size from roughly 60 to 150 lines of code:

Compiler Optimization of Dynamic Data Distributions for DMMs

473

• Synthetic HPF redistribution example • 2-D Alternating Direction Implicit iterative method (ADI2D) [14] • Shallow water weather prediction benchmark [37] 6.1 Synthetic HPF Redistribution Example In Figure 6.1(a), a synthetic HPF program is presented which performs a number of different tests (described in the comments appearing in the code) of the optimizations performed by the framework. In this program, one array, x, is redistributed both explicitly using HPF directives and implicitly through function calls using several different interfaces. Two of the functions, func1 and func2, have prescriptive interfaces [26] which may or may not require redistribution (depending on the current configuration of the input array). The first function differs from the second in that it also redistributes the array such that it returns with a different distribution than which it was called. The last function, func3, differs from the first two in that it has an (implicit) transcriptive interface [26]. Calls to this function will cause it to inherit the current distribution of the actual parameters. Several things can be noted when examining the optimized HPF shown in Figure 6.1(b).8 First of all, the necessary redistribution operations required to perform the implicit redistribution at the function call boundaries have been made explicit in the program. Here, the interprocedural analysis has completely removed any redundant redistribution by relaxing the HPF semantics allowing distributions caused by function side effects to exist so long as they do not affect the original meaning of the program. For the transcriptive function, func3, the framework also generated two separate clones, func3$0 and func3$1, corresponding to two different active distributions appearing in a total of three different calling contexts. Two warnings were also generated by the compiler, inserted by hand as comments in Figure 6.1(b), indicating that there were (semantically) multiple reaching distributions for two references of x in the program. The first reference actually does have two reaching distributions due to a conditional with redistribution performed on only one path. The second, however, occurs after a call to a prescriptive function, func1, which implicitly redistributes the array to conform to its interface. Even though the redistribution for this function will accept either of the two input distributions and generate only a single distribution of x for the function, the following reference of x semantically still has two reaching distributions – hence the second warning. The optimization of loop invariant redistribution operations can also be seen in the first loop nest of this example in which a redistribution operation on x is performed at the deepest level of a nested loop. If there are no references of x before the occurrence of the redistribution (and no further 8

The HPF output, generated by PARADIGM, has been slightly simplified by removing unnecessary alignment directives from the figure to improve its clarity.

474

c !HPF$ c c

!HPF$

c !HPF$ c !HPF$

c

!HPF$ c

c c

c !HPF$ c !HPF$

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee PROGRAM test INTEGER x(10,10) *** For tests involving statement padding INTEGER a DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x *** Use of initial distribution x(1,1) = 1 *** Testing loop invariant redistribution DO i = 1,10 DO j = 1,10 a = 0 REDISTRIBUTE (BLOCK, CYCLIC) :: x x(i,j) = 1 a = 0 ENDDO ENDDO a = 0 *** Testing unnecessary redistribution REDISTRIBUTE (BLOCK, CYCLIC) :: x if (x(i,j) .gt. 1) then *** Testing redistribution in a conditional REDISTRIBUTE (BLOCK, BLOCK) :: x x(i,j) = 2 call func3(x,n) else x(i,j) = 3 endif *** Uses with multiple reaching distributions x(1,1) = 2 call func1(x,n) DO i = 1,10 DO j = 1,10 x(j,i) = 2 ENDDO ENDDO REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x *** Testing chaining of function arguments call func1(x,n) call func2(x,n) call func1(x,n) *** Testing loop invariant due to return DO i = 1,10 DO j = 1,10 *** Testing transcriptive function cloning call func3(x,n) ENDDO ENDDO a = 1 *** Testing unused distribution REDISTRIBUTE (BLOCK, CYCLIC) :: x a = 0 *** Testing "semantically killed" distribution REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x call func3(x,n) END

!HPF$ !HPF$

!HPF$

c !HPF$

c

!HPF$ !HPF$ !HPF$

PROGRAM test INTEGER x(10,10) INTEGER a, n DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x x(1,1) = 1 REDISTRIBUTE (BLOCK, CYCLIC) :: x DO i = 1,10 DO j = 1,10 a = 0 x(i,j) = 1 a = 0 END DO END DO a = 0 IF (x(i,j) .GT. 1) THEN REDISTRIBUTE (BLOCK, BLOCK) :: x x(i,j) = 2 CALL func3$0(x,n) ELSE x(i,j) = 3 END IF *** WARNING: too many dists (2) for x x(1,1) = 2 REDISTRIBUTE (BLOCK, CYCLIC) :: x CALL func1(x,n) DO i = 1,10 DO j = 1,10 *** WARNING: too many dists (2) for x x(j,i) = 2 END DO END DO REDISTRIBUTE (BLOCK, CYCLIC) :: x CALL func1(x,n) CALL func2(x,n) REDISTRIBUTE (BLOCK, CYCLIC) :: x CALL func1(x,n) REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x DO i = 1,10 DO j = 1,10 CALL func3$1(x,n) END DO END DO a = 1 a = 0 CALL func3$1(x,n) END

INTEGER FUNCTION func1(a,n) INTEGER n, a(n,n) !HPF$ DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a a(1,1) = 1 a(1,2) = 1 !HPF$ REDISTRIBUTE (CYCLIC, CYCLIC) :: a a(1,3) = 1 END

INTEGER FUNCTION func2(y,n) integer function func1(a,n) INTEGER n, y(n,n) *** Prescriptive function with different return !HPF$ DYNAMIC, DISTRIBUTE(CYCLIC,CYCLIC) :: y integer n, a(n, n) y(1,1) = 2 !HPF$ DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a END a(1,1) = 1 !HPF$ REDISTRIBUTE (BLOCK, CYCLIC) :: a INTEGER FUNCTION func3$1(n,z) a(1,2) = 1 INTEGER n, z(n,n) !HPF$ REDISTRIBUTE (CYCLIC, CYCLIC) :: a !HPF$ DISTRIBUTE(CYCLIC(3),CYCLIC) :: z a(1,3) = 1 z(1,1) = 3 end END integer function func2(y,n) INTEGER FUNCTION func3$0(n,z) c *** Prescriptive function with identical return INTEGER n, z(n,n) integer n, y(n,n) !HPF$ DISTRIBUTE(BLOCK,BLOCK) :: z !HPF$ DYNAMIC, DISTRIBUTE (CYCLIC, CYCLIC) :: y z(1,1) = 3 y(1,1) = 2 END end c

c

integer function func3(z,n) *** (implicitly) Transcriptive function integer n, z(n,n) z(1,1) = 3 end

(a) Before optimization

(b) After optimization

Fig. 6.1. Synthetic example for interprocedural redistribution optimization

Compiler Optimization of Dynamic Data Distributions for DMMs

475

redistribution performed in the remainder of the loop), then x will always have a (BLOCK, CYCLIC) distribution within the loop body. This situation is detected by the framework and the redistribution operation is re-synthesized to occur outside of the entire loop nest. It could be argued that even when it appeared within the loop, the underlying redistribution library could be written to be smart enough to only perform the redistribution when it is necessary (i.e., only on the first iteration) so that we have not really optimized away 102 redistribution operations. Even in this case, this optimization has still completely eliminated (102 -1) check operations that would have been performed at run time to determine if the redistribution was required. As there are several other optimizations performed on this example, which we will not describe in more detail here, the reader is directed to the comments in the code for further information. 6.2 2-D Alternating Direction Implicit (ADI2D) Iterative Method In order to evaluate the effectiveness of dynamic distributions, the ADI2D program, with a problem size of 512 × 512,9 is compiled with a fully static distribution (one iteration shown in Figure 6.2(a)) as well as with the selected dynamic distribution10 (one iteration shown in Figure 6.2(b)). These two parallel versions of the code were run on an Intel Paragon and a Thinking Machines CM-5 to examine their performance on different architectures. processors

I II III IV V VI VII VIII IX

1xP

1xP

X

XI

XII

1xP

III

1xP

time

processors

(a) Static (pipelined) I II

III

IV V VI VII VIII

IX X XI XII III

1xP

P x 1

1xP

Px1

IV

1xP

time

(b) Dynamic (redistribution) Fig. 6.2. Modes of parallel execution for ADI2D The static scheme illustrated in Figure 6.2(a) performs a shift operation to initially obtain some required data and then satisfies two recurrences in 9

10

To prevent poor serial performance from cache-line aliasing due to the power of two problem size, the arrays were also padded with an extra element at the end of each column. This optimization, although here performed by hand, is automated by even aggressive serial optimizing compilers. In the current implementation, loop peeling is not performed on the generated code. As previously mentioned in Section 3.4, the single additional startup redistribution due to not peeling will not be significant in comparison to the execution of the loop (containing a dynamic count of 600 redistributions).

476

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

the program using software pipelining [19, 33]. Since values are being propagated through the array during the pipelined computation, processors must wait for results to be computed before continuing with their own part of the computation. According to the performance ratio of communication vs. computation for a given machine, the amount of computation performed before communicating to the next processor in the pipeline will have a direct effect on the overall performance of a pipelined computation [20, 33]. A small experiment is first performed to determine the best pipeline granularity for the static partitioning. A granularity of one (fine-grain) causes values to be communicated to waiting processors as soon as they are produced. By increasing the granularity, more values are computed before communicating, thereby amortizing the cost of establishing communication in exchange for some reduction in parallelism. In addition to the experimental data, compile-time estimates of the pipeline execution [33] are shown in Figure 6.3. For the two machines, it can be seen that by selecting the appropriate granularity, the performance of the static partitioning can be improved. Both a fine-grain and the optimal coarse-grain static partitioning will be compared with the dynamic partitioning. The redistribution present in the dynamic scheme appears as three transposes11 performed at two points within an outer loop (the exact points in the program can be seen in Figure 3.7). Since the sets of transposes occur at the same point in the program, the data to be communicated for each transpose can be aggregated into a single message during the actual transpose. As it has been previously observed that aggregating communication improves performance by reducing the overhead of communication [20, 33], we will also examine aggregating the individual transpose operations here. 16

16 16 procs. 8 procs. 4 procs.

14

10

12 Speedup

Speedup

12

8 6

10 8 6

4

4

2

2

0

16 procs. 8 procs. 4 procs.

14

0 1

10

100 Strip Size

(a) Intel Paragon

1

10

100 Strip Size

(b) TMC CM-5

Fig. 6.3. Coarse-grain pipelining for ADI2D

11

This could have been reduced to two transposes at each point if we allowed the cuts to reorder statements and perform loop distribution on the innermost loops (between statements 17, 18 and 41, 42), as mentioned in Section 3, but these optimizations are not examined here.

Compiler Optimization of Dynamic Data Distributions for DMMs

477

In Figure 6.4, the performance of both dynamic and static partitionings for ADI2D is shown for an Intel Paragon and a Thinking Machines CM-5. For the dynamic partitioning, both aggregated and non-aggregated transpose operations were compared. For both machines, it is apparent that aggregating the transpose communication is very effective, especially as the program is executed on larger numbers of processors. This can be attributed to the fact that the start-up cost of communication (which can be several orders of magnitude greater than the per byte transmission cost) is being amortized over multiple messages with the same source and destination. For the static partitioning, the performance of the fine-grain pipeline was compared to a coarse-grain pipeline using the optimal granularity. The coarsegrain optimization yielded the greatest benefit on the CM-5 while still improving the performance (to a lesser degree) on the Paragon. For the Paragon, the dynamic partitioning with aggregation clearly improved performance (by over 70% compared to the fine-grain and 60% compared to the coarse-grain static distribution). On the CM-5, the dynamic partitioning with aggregation showed performance gains of over a factor of two compared to the fine-grain static partitioning but only outperformed the coarse-grain version for extremely large numbers of processors. For this reason, it would appear that the limiting factor on the CM-5 is the performance of the communication. As previously mentioned in Section 3.4, the static partitioner currently makes a very conservative estimate for the execution cost of pipelined loops [15]. For this reason a dynamic partitioning was selected for both the Paragon as well as the CM-5. If a more accurate pipelined cost model [33] were used, a static partitioning would have been selected instead for the CM-5. For the Paragon, the cost of redistribution is still low enough that a dynamic partitioning would still be selected for large machine configurations. It is also interesting to estimate the cost of performing a single transpose in either direction (P×1 ↔ 1×P) from the communication overhead present in the dynamic runs. Ignoring any performance gains from cache effects, the communication overhead can be computed by subtracting the ideal run time

Speedup

100 80

60

ideal (aggr.) dynamic dynamic (coarse) static (fine) static

60

40 30

40

20

20

10 20

40

ideal (aggr.) dynamic dynamic (coarse) static (fine) static

50 Speedup

120

60 80 Processors

100

(a) Intel Paragon

120

20

40

60 80 Processors

100

(b) TMC CM-5

Fig. 6.4. Performance of ADI2D

120

478

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

(serial time divided by the selected number of processors) from the measured run time. Given that three arrays are transposed 200 times, the resulting overhead divided by 600 yields a rough estimate of how much time is required to redistribute a single array as shown in Table 6.1. From Table 6.1 it can be seen that as more processors are involved in the operation, the time taken to perform one transpose levels off until a certain number of processors is reached. After this point, the amount of data being handled by each individual processor is small enough that the startup overhead of the communication has become the controlling factor. Aggregating the redistribution operations minimizes this effect thereby achieving higher levels of performance than would be possible otherwise. Table 6.1. Empirically estimated time (ms) to transpose a 1-D partitioned matrix (512 × 512 elements; double precision) Intel Paragon TMC CM-5 processors individual aggregated individual aggregated 8 36.7 32.0 138.9 134.7 16 15.7 15.6 86.8 80.5 32 14.8 10.5 49.6 45.8 64 12.7 6.2 40.4 29.7 128 21.6 8.7 47.5 27.4 6.3 Shallow Water Weather Prediction Benchmark Since not all programs will necessarily need dynamic distributions, we also examine another program which exhibits several different smaller phases of computation. The Shallow water benchmark is a weather prediction program using finite difference models of the shallow water equations [37] written by Paul Swarztrauber from the National Center for Atmospheric Research. As the program consists of a number of different functions, the program is first inlined since the approach for selecting data distributions is not yet fully interprocedural. Also, a loop which is implicitly formed by a GOTO statement is replaced with an explicit loop since the current performance estimation framework does not handle unstructured code. The final input program, ignoring comments and declarations, resulted in 143 lines of executable code. In Figure 6.5, the phase transition graph is shown with the selected path using costs based on a 32 processor Intel Paragon with the original problem size of 257 × 257 limited to 100 iterations. The decomposition resulting in this graph was purposely bounded only by the productivity of the cut, and not by a minimum cost of redistribution in order to expose all potentially beneficial phases. This graph shows that by using the decomposition technique presented in Figure 3.8, Shallow contains six phases (the length of the path between the start and stop node) with a maximum of four (sometimes redundant) candidates for any given phase.

Compiler Optimization of Dynamic Data Distributions for DMMs start 0

1 0 1 8x4 | 18 5448.3 5448.3

19 | 24

0 4 8x4 1260.82 6709.12

START LEVEL 0

8x4

1

LEVEL 1

32x1

LEVEL 2

25 0 8 8x4 | 53 672880

LEVEL 3

8x4 18

32x1

8x4

679589

54 | 65

53

32x1

7156.17

25 1 9 32 x 1 | 53 667931

12960.8

19 | 24

14242

25 2 10 32 x 1 | 53 667931

1 2 3 8x4 | 18 5448.3 23589.6

2 6 32 x 1 1281.15

19 | 24

3 7 8x4 1260.82

17784.9

25 3 11 32 x 1 | 53 667931

2.45104e+06

685716

24

32x1

8x4

679589

1 5 32 x 1 1281.15

0

1 2 32 x 1 5875.02

5875.02

19 | 24

0

0

1 | 18

479

0 12 8x4 40041.1

54 | 65

1 13 8x4 40041.1

54 | 65

3 15 8x4 40041.1

54 | 65

2 14 32 x 1 40310.4

8x4 65

719630

719630

726026

726026

32x1 79

8x4

66 | 79

0 16 8x4 746790

66 1 17 8x4 | 79 727044

1.44667e+06

80 | 143

0 20 8x4 559230

80 | 143

66 2 18 32 x 1 | 79 707876

1.44667e+06

1 21 8x4 514802

66 | 79

3 19 32 x 1 707876

1.44667e+06

80 | 143

2 22 8x4 514802

first | last

level ID configuration performance

143

1.96148e+06 STOP

Phase decomposition for Shallow

stop

Fig. 6.5. Phase transition graph and solution for Shallow (Displayed with the selected phase transition path and cummulative cost.) As there were no alignment conflicts in the program, and only BLOCK distributions were necessary to maintain a balanced load, the distribution of the 14 arrays in the program can be inferred from the selected configuration of the processor mesh. By tracing the path back from the stop node to the start, the figure shows that the dynamic partitioner selected a two-dimensional (8 × 4) static data distribution. Since there is no redistribution performed along this path, the loop peeling process previously described in Section 3.4 is not shown on this graph as it is only necessary when there is actually redistribution present within a loop. As the communication and computation estimates are best-case approximations (they don’t take into account communication buffering operations or effects of the memory hierarchy), it is safe to say that for the Paragon, a dynamic data distribution does not exist which can out-perform the selected static distribution. Theoretically, if the communication cost for a machine were insignificant in comparison to the performance of computation, redistributing data between the phases revealed in the decomposition of Shallow would be beneficial. In this case, the dynamic partitioner performed more work to come to the same conclusion that a single application of the static partitioner would have. It is interesting to note that even though the dynamic partitioner considers any possible redistribution, it will still select a static distribution if that is what is predicted to have the best performance.

480

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee 30

25

20

Speedup

Speedup

25

30

ideal 2-D BLOCK 1-D BLOCK

15

20 15

10

10

5

5 5

10

15 20 Processors

25

(a) Intel Paragon

30

ideal 2-D BLOCK 1-D BLOCK

5

10

15 20 Processors

25

30

(b) TMC CM-5

Fig. 6.6. Performance of Shallow In Figure 6.6, the performance of the selected static 2-D distribution (BLOCK, BLOCK) is compared to a static 1-D row-wise distribution (BLOCK, ∗) which appeared in some of the subphases. The 2-D distribution matches the performance of the 1-D distribution for small numbers of processors while outperforming the 1-D distribution for both the Paragon and the CM-5 (by up to a factor of 1.6 or 2.7, respectively) for larger numbers of processors. Kennedy and Kremer [24] have also examined the Shallow benchmark, but predicted that a one-dimensional (column-wise) block distribution was the best distribution for up to 32 processors of an Intel iPSC/860 hypercube (while also showing that the performance of a 1-D column-wise distribution was almost identical to a 1-D row-wise distribution). They only considered one-dimensional candidate layouts for the operational phases (since the Fortran D prototype compiler can not compile multidimensional distributions [41]). As their operational definition already results in 28 phases (over four times as many in comparison to our approach), the complexity of the resulting 0-1 integer programming formulation will also only increase further when considering multidimensional layouts.

7. Conclusions Dynamic data partitionings can provide higher performance than purely static distributions for programs containing competing data access patterns. The distribution selection technique presented in this chapter provides a means of automatically determining high quality data distributions (dynamic as well as static) in an efficient manner taking into account both the structure of the input program as well as the architectural parameters of the target machine. Heuristics, based on the observation that high communication costs are a result of not being able to statically align every reference in complex programs simultaneously, are used to form the communication graph. By removing constraints between competing sections of the program,

Compiler Optimization of Dynamic Data Distributions for DMMs

481

better distributions can potentially be obtained for the individual sections. If the resulting gains in performance are high enough in comparison to the cost of redistribution, dynamic distributions are formed. Communication still occurs, but data movement is now isolated into dynamic reorganizations of ownership as opposed to constantly obtaining any required remote data based on a (compromise) static assignment of ownership. A key requirement in automating this process is to be able to obtain estimates of communication and computation costs which accurately model the behavior of the program under a given distribution. Furthermore, by building upon existing static partitioning techniques, the phases examined as well as the redistribution considered are focused in the areas of a program which will otherwise generate large amounts of communication. In this chapter we have also presented an interprocedural data-flow technique that can be used to convert between redistribution and distribution representations optimizing redistribution while maintaining the semantics of the original program. For the data partitioner, the framework is used to synthesize explicit redistribution operations in order to preserve the meaning of what the data partitioner intended in the presence of HPF semantics. For HPF programs, redistribution operations (implicitly specified by HPF semantics or explicitly specified by a programmer) that result in distributions which would not otherwise be used before another redistribution operation occurs are completely removed. Many HPF compilers that are currently available as commercial products or those that have been developed as research prototypes do not yet support transcriptive argument passing or the REDISTRIBUTE and REALIGN directives as there is still much work required to provide efficient support for the HPF subset (which does not include these features). Since the techniques presented in this chapter can convert all of these features into constructs which are in the HPF subset (through the use of SDA), this framework can also be used to provide these features to an existing subset HPF compiler. Acknowledgement. We would like to thank Amber Roy Chowdhury for his assistance with the coding of the serial ADI algorithm, John Chandy and Amber Roy Chowdhury for discussions on algorithm complexity, as well as Christy Palermo for her suggestions and comments on this chapter. The communication graph figures used in this chapter were also generated using a software package known as “Dot” developed by Eleftherios Koutsofios and Steven North with the Software and Systems Research Center, AT&T Bell Laboratories [28].

References 1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Publ., Reading, MA, 1986.

482

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee

2. J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proc. of the ACM SIGPLAN ’93 Conf. on Programming Language Design and Implementation, 112–125, Albuquerque, NM, June 1993. 3. E. Ayguad´e, J. Garcia, M. Girones, M. L. Grande, and J. Labarta. Data redistribution in an automatic data distribution tool. In Proc. of the 8th Workshop on Languages and Compilers for Parallel Computing, volume 1033 of Lecture Notes in Computer Science, 407–421, Columbus, OH, Aug. 1995. Springer-Verlag. 1996. 4. P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IV, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. The PARADIGM compiler for distributed-memory multicomputers. IEEE Computer, 28(10):37–47, Oct. 1995. 5. D. Bau, I. Koduklula, V. Kotlyar, K. Pingali, and P. Stodghill. Solving alignment using elementary linear algebra. In Proc. of the 7th Workshop on Languages and Compilers for Parallel Computing, volume 892 of Lecture Notes in Computer Science, 46–60, Ithica, NY, 1994. Springer-Verlag. 1995. 6. R. Bixby, K. Kennedy, and U. Kremer. Automatic data layout using 0-1 integer programming. In Proc. of the 1994 Int’l Conf. on Parallel Architectures and Compilation Techniques, 111–122, Montr´eal, Canada, Aug. 1994. 7. B. Chapman, T. Fahringer, and H. Zima. Automatic support for data distribution on distributed memory multiprocessor systems. In Proc. of the 6th Workshop on Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, 184–199, Portland, OR, Aug. 1993. Springer-Verlag. 1994. 8. S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. H. Teng. Automatic array alignment in data-parallel programs. In Proc. of the 20th ACM SIGPLAN Symp. on Principles of Programming Languages, 16–28, Charleston, SC, Jan. 1993. 9. F. Coelho and C. Ancourt. Optimal compilation of HPF remappings (extended ´ abstract). Tech. Report CRI A-277, Centre de Recherche en Informatique, Ecole des mines de Paris, Fontainebleau, France, Nov. 1995. 10. R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. on Programming Languages and Systems, 13(4):451–490, Oct. 1991. 11. T. Fahringer. Automatic Performance Prediction for Parallel Programs on Massively Parallel Computers. Ph.D. thesis, Univ. of Vienna, Austria, Sept. 1993. TR93-3. 12. J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. on Computers, c-30:478–490, July 1981. 13. J. Garcia, E. Ayguad´e, and J. Labarta. A novel approach towards automatic data distribution. In Proc. of the Workshop on Automatic Data Layout and Performance Prediction, Houston, TX, Apr. 1995. 14. G. Golub and J. M. Ortega. Scientific Computing: An Introduction with Parallel Computing. Academic Press, San Diego, CA, 1993. 15. M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois, Urbana, IL, Sept. 1992. CRHC-92-19/UILU-ENG-92-2237. 16. M. Gupta and P. Banerjee. Compile-time estimation of communication costs on multicomputers. In Proc. of the 6th Int’l Parallel Processing Symp., 470–475, Beverly Hills, CA, Mar. 1992.

Compiler Optimization of Dynamic Data Distributions for DMMs

483

17. M. Gupta and P. Banerjee. PARADIGM: A compiler for automated data partitioning on multicomputers. In Proc. of the 7th ACM Int’l Conf. on Supercomputing, Tokyo, Japan, July 1993. 18. M. W. Hall, S. Hiranandani, K. Kennedy, and C. Tseng. Interprocedural compilation of Fortran D for MIMD distributed-memory machines. In Proc. of Supercomputing ’92, 522–534, Minneapolis, MN, Nov. 1992. 19. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed memory machines. Communications of the ACM, 35(8):66–80, Aug. 1992. 20. S. Hiranandani, K. Kennedy, and C.-W. Tseng. Evaluation of compiler optimizations for Fortran D on MIMD distributed-memory machines. In Proc. of the 6th ACM Int’l Conf. on Supercomputing, 1–14, Washington D.C., July 1992. 21. E. W. Hodges IV. High Performance Fortran support for the PARADIGM compiler. Master’s thesis, Dept. of Electrical and Computer Eng., Univ. of Illinois, Urbana, IL, Oct. 1995. CRHC-95-23/UILU-ENG-95-2237. 22. D. E. Hudak and S. G. Abraham. Compiling Parallel Loops for High Performance Computers – Partitioning, Data Assignment and Remapping. Kluwer Academic Pub., Boston, MA, 1993. 23. W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The Superblock: An effective technique for VLIW and superscalar compilation. The Journal of Supercomputing, 7(1):229–248, Jan. 1993. 24. K. Kennedy and U. Kremer. Automatic data layout for High Performance Fortran. In Proc. of Supercomputing ’95, San Diego, CA, Dec. 1995. 25. K. Knobe, J. Lukas, and G. Steele, Jr. Data optimization: Allocation of arrays to reduce communication on SIMD machines. Journal of Parallel and Distributed Computing, 8(2):102–118, Feb. 1990. 26. C. Koelbel, D. Loveman, R. Schreiber, G. Steele, Jr., and M. Zosel. The High Performance Fortran Handbook. The MIT Press, Cambridge, MA, 1994. 27. U. Kremer. Automatic Data Layout for High Performance Fortran. Ph.D. thesis, Rice Univ., Houston, TX, Oct. 1995. CRPC-TR95559-S. 28. B. Krishnamurthy, editor. Practical Reusable UNIX Software. John Wiley and Sons Inc., New York, NY, 1995. 29. J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of Parallel and Distributed Computing, 13(2):213–221, Oct. 1991. 30. D. J. Palermo. Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Multicomputers. Ph.D. thesis, Dept. of Electrical and Computer Eng., Univ. of Illinois, Urbana, IL, June 1996. CRHC96-09/UILU-ENG-96-2215. 31. D. J. Palermo, E. W. Hodges IV, and P. Banerjee. Dynamic data partitioning for distributed-memory multicomputers. Journal of Parallel and Distributed Computing, 38(2):158–175, Nov. 1996. special issue on Compilation Techniques for Distributed Memory Systems. 32. D. J. Palermo, E. W. Hodges IV, and P. Banerjee. Interprocedural array redistribution data-flow analysis. In Proc. of the 9th Workshop on Languages and Compilers for Parallel Computing, San Jose, CA, Aug. 1996. 33. D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Compiler optimizations for distributed memory multicomputers used in the PARADIGM compiler. In Proc. of the 23rd Int’l Conf. on Parallel Processing, II:1–10, St. Charles, IL, Aug. 1994. 34. C. D. Polychronopoulos, M. Girkar, M. R. Haghighat, C. L. Lee, B. Leung, and D. Schouten. Parafrase-2: An environment for parallelizing, partitioning,

484

35. 36.

37. 38.

39. 40. 41. 42. 43. 44.

Daniel J. Palermo, Eugene W. Hodges IV, and Prithviraj Banerjee synchronizing and scheduling programs on multiprocessors. In Proc. of the 18th Int’l Conf. on Parallel Processing, II:39–48, St. Charles, IL, Aug. 1989. J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans. on Parallel and Distributed Systems, 2(4):472–481, Oct. 1991. S. Ramaswamy and P. Banerjee. Automatic generation of efficient array redistribution routines for distributed memory multicomputers. In Frontiers ’95: The 5th Symp. on the Frontiers of Massively Parallel Computation, 342–349, McLean, VA, Feb. 1995. R. Sadourny. The dynamics of finite-difference models of the shallow-water equations. Journal of the Atmospheric Sciences, 32(4), Apr. 1975. T. J. Sheffler, R. Schreiber, J. R. Gilbert, and W. Pugh. Efficient distribution analysis via graph contraction. In Proc. of the 8th Workshop on Languages and Compilers for Parallel Computing, volume 1033 of Lecture Notes in Computer Science, 377–391, Columbus, OH, Aug. 1995. Springer-Verlag. 1996. H. Sivaraman and C. S. Raghavendra. Compiling for MIMD distributed memory machines. Tech. Report EECS-94-021, School of Electrical Enginnering and Computer Science, Washington State Univ., Pullman, WA, 1994. R. E. Tarjan. Data Structures and Network Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1983. C. W. Tseng. An Optimizing Fortran D Compiler for MIMD DistributedMemory Machines. Ph.D. thesis, Rice Univ., Houston, TX, Jan. 1993. COMP TR93-199. P. S. Tseng. Compiling programs for a linear systolic array. In Proc. of the ACM SIGPLAN ’90 Conf. on Programming Language Design and Implementation, 311–321, White Plains, NY, June 1990. R. von Hanxleden and K. Kennedy. Give-N-Take – A balanced code placement framework. In Proc. of the ACM SIGPLAN ’94 Conf. on Programming Language Design and Implementation, 107–120, Orlando, FL, June 1994. S. Wholey. Automatic data mapping for distributed-memory parallel computers. In Proc. of the 6th ACM Int’l Conf. on Supercomputing, 25–34, Washington D.C., July 1992.

Chapter 14. A Framework for Global Communication Analysis and Optimizations Manish Gupta ×BM

T. J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598

1. Introduction Distributed memory architectures have become popular as a viable and costeffective method of building scalable parallel computers. However, the absence of global address space, and consequently, the need for explicit message passing among processes makes these machines very difficult to program. This has motivated the design of languages like High Performance Fortran [14], which allow the programmer to write sequential or shared-memory parallel programs that are annotated with directives specifying data decomposition. The compilers for these languages are responsible for partitioning the computation, and generating the communication necessary to fetch values of nonlocal data referenced by a processor. A number of such prototype compilers have been developed [3, 6, 19, 23, 29, 30, 33, 34, 43]. Accessing remote data is usually orders of magnitude slower than accessing local data. This gap is growing because CPU performance is out-growing network performance, CPU’s are running relatively independent multiprogrammed operating systems, and commodity networks are being found more cost-effective. As a result, communication startup overheads tend to be astronomical on most distributed memory machines, although reasonable bandwidth can be supported for sufficiently large messages [36, 37]. Thus compilers must reduce the number as well as the volume of messages in order to deliver high performance. The most common optimizations include message vectorization [23,43], using collective communication [18,30], and overlapping communication with computation [23]. However, many compilers perform little global analysis of the communication requirements across different loop nests. This precludes general optimizations, such as redundant communication elimination, or carrying out extra communication inside one loop nest if it subsumes communication required in the next loop nest. This chapter presents a framework, based on global array data-flow analysis, to reduce communication in a program. We apply techniques for partial redundancy elimination, discussed in the context of eliminating redundant computation by Morel and Renvoise [31], and later refined by other researchers [12, 13, 25]. The conventional approach to data-flow analysis regards each access to an array element as an access to the entire array. Previous researchers [16,17,35] have applied data-flow analysis to array sections to improve its precision. However, using just array sections is insufficient in the S. Pande, D.P. Agrawal (Eds.): Compiler Optimizations for Scalable PS, LNCS 1808, pp. 485-524, 2001. Springer-Verlag Berlin Heidelberg 2001

486

Manish Gupta

context of communication optimizations. There is a need to represent information about the processors where the array elements are available, or need to be made available. For this purpose, we introduce a new kind of descriptor, the Available Section Descriptor (ASD) [21]. The explicit representation of availability of data in our framework allows us to relax the restriction that only the owner of a data item be able to supply its value when needed by another processor. An important special case occurs when a processor that needs a value for its computation does not own the data but has a valid value available from prior communication. In that case, the communication from the owner to this processor can be identified as redundant and eliminated, with the intended receiver simply using the locally available value of data. We show how the data flow procedure for eliminating partial redundancies is extended and applied to communication, represented using the ASDs. With the resultant framework, we are able to capture a number of optimizations, such as: vectorizing communication, eliminating communication that is redundant in any control flow path, reducing the amount of data being communicated, reducing the number of processors to which data must be communicated, and – moving communication earlier to hide latency, and to subsume previous communication. – – – –

We do not know of any other system that tries to perform all of these optimizations, and in a global manner. Following the results presented in [13] for partially redundant computations, we show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems, in the context of communication represented using ASDs as well. That makes the analysis procedure more efficient. We have implemented a simplified version of this framework as part of a prototype HPF compiler. Our preliminary experiments show significant performance improvements resulting from this analysis. An advantage of our approach is that the analysis is performed on the original program form, before any communication is introduced by the compiler. Thus, communication optimizations based on data availability analysis need not depend on a detailed knowledge of explicit communication representations. While our work has been done in the context of compiling for distributed memory machines, it is relevant for shared memory machines as well. Shared memory compilers can exploit information about interprocessor sharing of data to eliminate unnecessary barrier synchronization or replace barrier synchronization by cheaper producer-consumer synchronization in the generated parallel code [20, 32, 39]. A reduction in the number of communication messages directly translates into fewer synchronization messages. Another application of our work is in improving the effectiveness of block data transfer

A Framework for Global Communication Analysis and Optimizations

487

operations in scalable shared memory multiprocessors. For scalable shared memory machines with high network latency, it is important for the underlying system to reduce the overhead of messages needed to keep data coherent. Using block data transfer operations on these machines helps amortize the overhead of messages needed for non-local data access. The analysis presented in this paper can be used to accurately identify sections of data which are not available locally, and which should be subjected to block data transfer operations for better performance. The rest of this chapter is organized as follows: Section 2 describes, using an example, the various communication optimizations that are performed by the data flow procedure described in this chapter. Section 3 describes our representation of the Available Section Descriptor and the procedure to compute the communication generated for a statement in the program. Section 4 discusses the procedure for computing data flow information used in optimizing communications. In Section 5, we describe how the different communication optimizations are captured by the data flow analysis. Section 6 describes an extension to our framework to select a placement of communication that can reduce communication costs of a program even further. Section 7 presents the algorithms for the various operations on ASDs. Section 8 presents some preliminary results and in Section 9, we discuss related work. Finally, Section 10 presents conclusions.

2. Motivating Example We now illustrate various communication optimizations mentioned above using an example. Figure 2.1(a) shows an HPF program and a high level view of communication that would be generated by a compiler following the ownercomputes rule [23, 43], which assigns each computation to the processor that owns the data being computed. Communication is generated for each nonlocal data value used by a processor in the computation assigned to it. The HPF directives specify the alignment of each array with respect to a template VPROCS, which is viewed as a grid of virtual processors in the context of this work. The variables a and z are two-dimensional arrays aligned with VPROCS, and d, e, and w are one-dimensional arrays aligned with the first column of VPROCS. In this example, we assume that the scalar variable s is replicated on all processors. The communication shown in Figure 2.1(a) already incorporates message vectorization, a commonly used optimization to move communication out of loops. While message vectorization is captured naturally by our framework as we shall explain, in this chapter we focus on other important optimizations that illustrate the power of this framework. Our analysis is independent of the actual primitives (such as send-receive, broadcast) used to implement communication. We use the notation x(i) → VPROCS(i, j) (the ranges of i and j are omitted to save space in the figure) to mean that the value of x(i) is sent to the virtual processor position

488

Manish Gupta

VPROCS(i, j), for all 1 ≤ i ≤ 100, 1 ≤ j ≤ 100. Reduced communication after global optimization is shown in Figure 2.1(b). We consider optimizations performed for each variable d, e, and w. There are two identical communications for e in Figure 2.1(a), which result from the uses of e in statements 10 and 26. In both cases, e(i) must be sent to VPROCS(i, j), for all values of i, j. However, because of the assignment to e(1) in statement 13, the second communication is only partially redundant. Thus, we can eliminate the second communication, except for sending e(1) to VPROCS(1, j), for all values of j. This reduced communication is hoisted to the earliest possible place after statement 13 in Figure 2.1(b). In Figure 2.1(a), there are two communications for d, resulting from uses of d in statements 16 and 26. In Figure 2.1(b), the second communication has been hoisted to the earliest possible place, after statement 7, where it subsumes the first communication, which has been eliminated. Finally, there are two communications for w, resulting from uses of w in statements 16 and 28. The second, partially redundant, communication is hoisted inside the two branches of the if statement, and is eliminated in the then branch. The assignment to w(i) at statement 21 prevents the communication in the else branch from being moved earlier. The result of this collection of optimizations leads to a program in which communication is initiated as early as possible, and the total volume of communication has been reduced.

3. Available Section Descriptor In this section, we describe the Available Section Descriptor (ASD), a representation of data and its availability on processors. When referring to availability of data, we do not explicitly include information about the ownership of data, unless stated otherwise. This enables us to keep a close correspondence between the notions of availability of data and communication: in the context of our work, the essential view of communication is that it makes data available at processors which need it for their computation; the identity of the sender may be changed as long as the receiver gets the correct data. Hence, the ASD serves both as a representation of communication (by specifying the data to be made available at processors), and of the data actually made available by previous communications. Data remains available at receiving processors until it is modified by its owner or until the communication buffer holding non-local data is deallocated. Our analysis for determining the availability of data is based on the assumption that a communication buffer is never deallocated before the last read reference to that data. This can be ensured by the code generator after the analysis has identified read references which lead to communication that is redundant. Section 3.1 describes the

A Framework for Global Communication Analysis and Optimizations

489

HPF align (i, j) with VPROCS(i,j) :: a, z HPF align (i) with VPROCS(i,1) :: d, e, w 1: 5: 6: 7: 8: 9: 10: 11: 12: 13:

do i = 1, 100 e(i) = d(i) * w(i) d(i) = d(i) + 2 * w(i) end do e(i) → V PROCS(i, j) do i = 1, 100 do j = 1, 100 z(i,j) = e(i) end do end do e(1) = 2 * d(1)

14: if (s = 0) then d(i), w(i) → V PROCS(i, 100) 15: do i = 1, 100 16: z(i,100) = d(i) / w(i) 17: end do 18: else 19: do i = 1, 100 20: z(i,100) = m 21: w(i) = m 22: end do 23: end if e(i), d(i) → V PROCS(i, j) w(i) → V PROCS(i, 100) 24: do j = 1, 100 25: do i = 1, 100 26: a(i,j) = a(i,j) + (d(i) ∗ e(i))/z(i,j) 27: end do 28: z(j,100) = w(j) 29: end do (a)

do i = 1, 100 e(i) = d(i) * w(i) d(i) = d(i) + 2 * w(i) end do e(i), d(i) → V PROCS(i, j) do i = 1, 100 do j = 1, 100 z(i,j) = e(i) end do end do e(1) = 2 * d(1) e(1) → V PROCS(1, j) if (s = 0) then w(i) → V PROCS(i, 100) do i = 1, 100 z(i,100) = d(i) / w(i) end do else do i = 1,100 z(i, 100) = m w(i) = m end do w(i) → V PROCS(i, 100) end if

do j = 1, 100 do i = 1, 100 a(i,j) = a(i,j) + (d(i) * e(i))/z(i,j) end do z(j,100) = w(j) end do (b)

Fig. 2.1. Program before and after communication optimizations.

490

Manish Gupta

ASD representation. Section 3.2 describes how the communication generated at a statement is computed in terms of an ASD representation. 3.1 Representation of ASD The ASD is defined as a pair D, M , where D is a data descriptor, and M is a descriptor of the function mapping elements in D to virtual processors. Thus, M (D) refers collectively to processors where data in D is available. For an array variable, the data descriptor represents an array section. For a scalar variable, it consists of just the name of the variable. Many representations like the regular section descriptor (RSD) [9], and the data access descriptor (DAD) [5] have been proposed in the literature to summarize array sections. Our analysis is largely independent of the actual descriptor used to represent data. For the purpose of analysis in this work, we shall use the bounded regular section descriptor (BRSD) [22], a special version of the RSD, to represent array sections and treat scalar variables as degenerate cases of arrays with no dimensions. Bounded regular sections allow representation of subarrays that can be specified using the Fortran 90 triplet notation. We represent a bounded regular section as an expression A(S), where A is the name of an array variable, S is a vector of subscript values such that each of its elements is either (i) an expression of the form α ∗ k + β, where k is a loop index variable and α and β are invariants, (ii) a triple l : u : s, where l, u, and s are invariants (the triple represents the expression discussed above expanded over a range) , or (iii) ⊥, indicating no knowledge of the subscript value. The processor space is regarded as an unbounded grid of virtual processors. The abstract processor space is similar to a template in High Performance Fortran (HPF) [14], which is a grid over which different arrays are aligned. The mapping function descriptor M is a pair P, F , both P and F being vectors of length equal to the dimensionality of the processor grid. The ith element of P (denoted as P Ø ) indicates the dimension of the array A that is mapped to the ith grid dimension, and F Ø is the mapping function for that array dimension, i.e., F Ø (j) returns the position(s) along the ith grid dimension to which the jth element of the array dimension is mapped. We represent a mapping function, when known statically, as F Ø (j) =

(c ∗ j + l : c ∗ j + u : s)

In the above expression, c, l, u and s are invariants. The parameters c, l and u may take rational values, as long as F Ø (j) evaluates to a range over integers, over the data domain. The above formulation allows representation of one-toone mappings (when l = u), one-to-many mappings (when u ≥ l+s), and also constant mappings (when c = 0). The one-to-many mappings expressible with this formulation are more general than the replicated mappings for ownership that may be specified using HPF [14]. Under an HPF alignment directive,

A Framework for Global Communication Analysis and Optimizations

491

the jth element of array dimension P Ù may be mapped along the ith grid dimension to position c ∗ j + o or “∗”, which represents all positions in that grid dimension. If an array has fewer dimensions than the processor grid (this also holds for scalars, which are viewed as arrays with no dimensions), there is no array dimension mapped to some of the grid dimensions. For each such grid dimension m, P m takes the value µ, which represents a “missing” array dimension. In that case, F m is no longer a function of a subscript position. It is simply an expression of the form l : u : s, and indicates the position(s) in the mth grid dimension at which the array is available. As with the usual triplet notation, we shall omit the stride, s, from an expression when it is equal to one. When the compiler is unable to infer knowledge about the availability of data, the corresponding mapping function is set to ⊥. We also define a special, universal mapping function descriptor U, which represents the mapping of each data element on all of the processors. Example. Consider a 2-D virtual processor grid VPROCS, and an ASD A(2 : 100 : 2, 1 : 100), [1, 2], [F 1, F 2 ], where F 1 (i) = i − 1, F 2 (j) = 1 : 100. The ASD represents an array section A(2 : 100 : 2, 1 : 100), each of whose element A(2 ∗ i, j) is available at a hundred processor positions given by VPROCS(2∗i−1, 1 : 100). This ASD is illustrated in Figure 3.1. Figure 3.1(a) shows the array A, where each horizontal stripe Ai represents A(2∗i, 1 : 100). Figure 3.1(b) represents the mapping of the array section onto the virtual processor template VPROCS, where each subsection Ai is replicated along its corresponding row.

A1

A1 A1

...

A2

A2 A2

...

A3 A3 A3

...

A1 A2 A3

Array A

Processor Grid (b)

(a)

Fig. 3.1. Illustration of ASD

492

Manish Gupta

3.2 Computing Generated Communication Given an assignment statement of the form lhs = rhs, we describe how communication needed for each reference in the rhs expression is represented as an ASD. This section deals only with communication needed for a single instance of the assignment statement, which may appear nested inside loops. The procedure for summarizing communication requirements of multiple instances of a statement with respect to the surrounding loops is discussed in the next section. We shall describe our procedure for array references with arbitrary number of dimensions; references to scalar variables can be viewed as special cases with zero dimensions. This analysis is applied after any array language constructs (such as Fortran 90 array assignment statements) have been scalarized into equivalent Fortran 77-like assignment statements appearing inside loops. Each subscript expression is assumed to be a constant or an affine function of a surrounding loop index. If any subscript expressions are non-linear or coupled, then the ASD representing that communication is set to ⊥, and is conservatively underestimated or overestimated, based on the context. As remarked earlier, the identity of senders is ignored in our representation of communication. The ASD simply represents the intended availability of data to be realized via the given communication, or equivalently, the availability of data following that communication. Clearly, that depends on the mapping of computation to processors. In this work, we determine the generation of communication based on the owner computes rule, which assigns the computation to the owner of the lhs. The algorithm can be modified to incorporate other methods of assigning computation to processors [11], as long as that decision is made statically. Let DL be the data descriptor for the lhs reference and ML = PL , FL be the mapping function descriptor representing the ownership of the lhs variable (this case represents an exception where the mapping function of ASD corresponds to ownership of data rather than its availability at other processors). ML is directly obtained from the HPF alignment information which specifies both the mapping relationship between array dimensions and grid dimensions (giving PL ) and the mapping of array elements to grid positions (giving FL ), as described earlier. We calculate the mapping of the rhs variable DR , MR that results from enforcing the owner computes rule. The new ASD, denoted CGEN , represents the rhs data aligned with lhs. The regular section descriptor DR represents the element referenced by rhs. The mapping descriptor MR = PR , FR is obtained by the following procedure: Step 1. Align array dimensions with processor grid dimensions: Pi

1. For each processor grid dimension i, if the lhs subscript expression, SLL , in dimension PLi has the form α1 ∗ k + β1 and there is a rhs subscript n = α2 ∗ k + β2 , for the same loop index variable k, set PRi expression SR to the rhs subscript position n.

A Framework for Global Communication Analysis and Optimizations

493

2. For each remaining processor grid dimension i, set PÛÚ to j, where j is an unassigned rhs subscript position. If there is no unassigned rhs subscript position left, set PÛÚ to µ. Step 2. Calculate the mapping function for each grid dimension: For each processor grid dimension i, let FLÚ (j) = c ∗ j + o be the ownership mapping function of the lhs variable (c and o are integer constants, with the exception of replicated mapping, where c = 0 and o represents the range of all positions in that grid dimension). We determine the rhs mapping function FRi (j) from the lhs and the rhs subscript expressions corresponding respectively to dimensions PRi and PLi . The details are specified in Table 1. The first entry in Table 1 follows from the fact that element j = α2 ∗k+β2 Pi of SRR is made available at grid position c ∗ (α1 ∗ k + β1 ) + o along the ith dimension; substituting k by (j − β2 )/α2 directly leads to the given result. The second and the third entries correspond to the special cases when the rhs dimension has a constant subscript or there is no rhs dimension mapped to grid dimension i. The last entry represents the case when there is no lhs array dimension mapped to grid dimension i. In that case, the mapping function of the lhs variable must have c = 0. Pi

Pi

SL L

SRR

α1 ∗ k + β1 α1 ∗ k + β1 α1 ∗ k + β1 “missing”

α2 ∗ k + β2 , α2 = 0 β2 “missing” α2 ∗ k + β2

FRi (j) α1 ∗(j−β2 ) ( + α2

c∗ β1 ) + o c ∗ (α1 ∗ k + β1 ) + o c ∗ (α1 ∗ k + β1 ) + o o (c must be 0)

Table 3.1. Mapping function calculation based on the owner computes rule

Example. Consider the assignment statement in the code fragment: HPF ALIGN A(i, j) WITH VPROCS(i, j) ··· A(i, j) = . . . B(2 ∗ i, j − 1) . . . The ownership mapping descriptor ML for the lhs variable A is [1, 2], FL where FL1 (i) = i and FL2 (j) = j. This mapping descriptor is derived from the HPF alignment specification. Applying Step 1 of the compute rule algorithm, PR is set to [1, 2], that is, the first dimension of VPROCS is aligned with the first dimension of B, and the second dimension of VPROCS is aligned with the second dimension of B. The second step is to determine the mapping function FR . For the first grid dimension, PL1 corresponds to the subscript expression i and PR1 corresponds to the subscript expression 2 ∗ i. Therefore, using FL1 and the first rule in

494

Manish Gupta

Table 1, FÜ1 (i) is set to (1*(1*i - 0)/2) + 0) + 0 = i/2. For the second grid dimension, PL2 corresponds to the subscript expression j, and PR2 corresponds to the subscript expression j −1. Using FL2 and the first rule in Table 1, FR2 (j) is set to j + 1. The mapping descriptor thus obtained maps B(2*i, j-1) onto VPROCS(i, j).

4. Data Flow Analysis In this section, we present a procedure for obtaining data flow information regarding communication for a structured program. The analysis is performed on the control flow graph representation [1] of the program, in which nodes represent computation, and edges represent the flow of control. We are able to perform a collection of communication optimizations within a single framework, based on the following observations. Determining the data availability resulting from communication is a problem similar to determining available expressions in classical data flow analysis. Thus, optimizations like eliminating and hoisting communications are similar to eliminating redundant expressions and code motion. Furthermore, applying partial redundancy elimination techniques at the granularity of sections of arrays and processors enables not merely elimination, but also reduction in the volume of communication along different control flow paths. The bidirectional data-flow analysis for suppression of partial redundancies, introduced by Morel and Renvoise [31], and refined subsequently [12, 13, 25], defines a framework for unifying common optimizations on available expressions. We adapt this framework to solve the set of communication optimizations described in Section 2. This section presents the following results. – Section 4.1 reformulates the refined data-flow equations from [13] in terms of ASDs. We have incorporated a further modification that is useful in the context of optimizing communication. – Section 4.2 shows that the bidirectional problem of determining the possible placement of communication can be solved by obtaining a solution to a backward problem, followed by a forward correction. – In contrast to previous work, solving these equations for ASDs requires array data-flow analysis. In Section 4.3, we present the overall data-flow procedure that uses interval analysis. As with other similar frameworks, we require the following edge-splitting transformation to be performed on the control flow graph before the analysis begins: any edge that runs directly from a node with more than one successor, to a node with more than one predecessor, is split [13]. This transformation is illustrated in Figure 4.1. Thus, in the transformed graph, there is no direct edge from a branch node to a join node.

A Framework for Global Communication Analysis and Optimizations

495

A A

new node B B

Fig. 4.1. Edge splitting transformation 4.1 Data Flow Variables and Equations We use the following definitions for data-flow variables representing information about communication at different nodes in the control flow graph. Each of these variables is represented as an ASD. AN T LOCÝ : communication in node i, that is not preceded by a definition in node i of any data being communicated (i.e., local communication that may be anticipated at entry to node i). CGENÝ : communication in node i, that is not followed by a definition in node i of any data being communicated. KILLÝ : data being killed (on all processors) due to a definition in node i. AV INÝ /AV OU TÝ : availability of data at the entry/exit of node i. P P INÝ /P P OU TÝ : variables representing safe placement of communication at the entry/exit of node i, with some additional properties (described later in this section). IN SERTÝ : communication that should be inserted at the exit of node i. REDU N DÝ : communication in node i that is redundant. Local Data Flow Variables For an assignment statement, both AN T LOC and CGEN are set to the communication required to send each variable referenced on the right hand side (rhs) to the processor executing the statement. That depends on the compute-rule used by the compiler in translating the source program into SPMD form. Consider the program segment from the example in Section 3.2: HPF ALIGN A(i, j) WITH VPROCS(i, j) ··· A(i, j) = . . . B(2 ∗ i, j − 1) . . . Using the procedure in Section 3.2, we compute the communication necessary to send B(2*i, j-1) to the processor executing the statement as: CGEN = AN T LOC = B(2 ∗ i, j − 1), [1, 2], [F1 , F2 ], where F1 (i) = i/2,

496

Manish Gupta

and F2 (j) = j + 1. The KILL variable for the statement is set to A(i, j), U, signifying that A(i, j) is killed on all processors. The procedure for determining CGEN , AN T LOC, and KILL for nodes corresponding to program intervals shall be discussed in Section 4.3. Global Data Flow Variables The data flow equations, as adapted from [13], are shown below.1 AV IN is defined as ∅ for the entry node, while P P OU T is defined as ∅ for the exit node, and initialized to ⊤ for all other nodes. AV OU TÞ

=

AV INÞ

=

[AV INÞ − KILLÞ ] ∪ CGENÞ AV OU Tp

(4.2)

[(P P OU Ti − KILLi ) ∪ AN T LOCi ] ∩ [ (AV OU Tp ∪ P P OU Tp )]

(4.3)

(4.1)

p∈pred(i)

P P INi

=

p∈pred(i)

P P OU Ti

=

P P INs

(4.4)

s∈succ(i)

IN SERTi

=

[P P OU Ti − AV OU Ti ] − [P P INi − KILLi ] (4.5)

REDU N Di

=

P P INi ∩ AN T LOCi

(4.6)

The problem of determining the availability of data (AV INi /AV OU Ti ) is similar to the classical data-flow problem of determining available expressions [1]. This computation proceeds in the forward direction through the control flow graph. The first equation ensures that any data overwritten inside node i is removed from the availability set, and data communicated during node i (and not overwritten later) is added to the availability set. The second equation indicates that at entry to a join node in the control flow graph, only the data available at exit on each of the predecessor nodes can be considered to be available. We now consider the computation of P P IN/P P OU T . The term [(P P OU Ti − KILLi ) ∪ AN T LOCi ] in Equation 4.3 denotes the part of communication occurring in node i or hoisted into it that can legally be moved to the entry of node i. A further intersection of that term with [∩p∈pred(i) (AV OU Tp ∪ P P OU Tp )] gives an additional property to P P INi , 1

The original equation in [13] for P P INi has an additional term, corresponding to the right hand side being further intersected with P AV INi , the partial availability of data at entry to node i. This term is important in the context of eliminating partially redundant computation, because it prevents unnecessary code motion that increases register pressure. However, moving communication early can be useful even if it does not lead to a reduction in previous communication, because it may help hide the latency. Hence, we drop that term in our equation for P P INi .

A Framework for Global Communication Analysis and Optimizations

497

namely that all data included in P P INß must be available at entry to node i on every incoming path due to original or moved communication. P P OU Tß is set to communication that can be placed at entry to each of the successor nodes to i, as shown by Equation 4.4. Thus, P P OU Tß represents communication that can legally and safely appear at the exit of node i. The property of safety implies that the communication is necessary, regardless of the flow of control in the program. Hence, the compiler avoids doing any speculative communication in the process of moving communication earlier. As Equations 4.3 and 4.4 show, the value of P P INß for a node i is not only used to compute P P OU Tp for its predecessor node p, but it also depends on the value of P P OU Tp . Hence, this computation represents a bidirectional data flow problem. Finally, IN SERTi represents communication that should be inserted at the exit of node i as a result of the optimization. Given that P P OU Ti represents safe communication at that point, as shown in Equation 4.5, IN SERTi consists of P P OU Ti minus the following two components: (i) data already available at exit of node i due to original communication: given by AV OU Ti , and (ii) data available at entry to node i due to moved or original communication, and which has not been overwritten inside node i: this component is given by (P P INi − KILLi ). Following the insertions, any communication in node i that is not preceded by a definition of data (i.e., AN T LOCi ) and which also forms part of P P INi becomes redundant. This directly follows from the property of P P INi that any data included in P P INi must be available at entry to node i on every incoming path due to original or moved communication. Thus, in Equation 4.6, REDU N Di represents communication in node i that can be deleted. The union, intersection, and difference operations on ASDs are described later in the chapter, in Section 7. The ASDs are not closed under these operations (the intersection operation is always exact, except in the special case when two mapping functions, of the form Fi (j) = c ∗ i + l : c ∗ i + u : s, for corresponding array dimensions have different values of the coefficient c). Therefore, it is important to know for each operation whether to underestimate or overestimate the result, in case an approximation is needed. In the above equations, each of AV INi , AV OU Ti , P P INi , P P OU Ti , and REDU N Di are underestimated, if necessary. On the other hand, IN SERTi is overestimated, if needed. This ensures that the compiler does not incorrectly eliminate communication that is actually not redundant. While the overestimation of IN SERTi or underestimation of REDU N Di can potentially lead to more communication than necessary, our framework has some built-in guards against insertion of extra communication relative to the unoptimized program. The Morel-Renvoise framework [31] and its modified versions ensure that P P INi and P P OU Ti represent safe placements of computation at the entry/exit of node i. Correspondingly, in the context of our work, P P INi /P P OU Ti does not represent more communication than necessary.

498

Manish Gupta

4.2 Decomposition of Bidirectional Problem Before we describe our data-flow procedure using the above equations, we need to resolve the problem of bidirectionality in the computation of P P INà and P P OU Tà . Solving a bidirectional problem usually requires an algorithm that goes back and forth until convergence is reached. A preferable approach is to decompose the bidirectional problem, if possible, into simpler unidirectional problem(s) which can be solved more efficiently. Dhamdhere et al. [13] prove some properties about the bidirectional problem of eliminating redundant computation, and also prove that those properties are sufficient to allow the decomposition of that problem into two unidirectional problems. One of those properties, distributivity, does not hold in our case, because we represent data-flow variables as ASDs rather than bit strings, and the operations like union and difference are not exact, unlike the boolean operations. However, we are able to prove directly the following theorem: Theorem 41. The bidirectional problem of determining P P INà and P P OU Tà , as given by Equations 4.3 and 4.4, can be decomposed into a backward approximation, given by Equations 4.7 and 4.8, followed by a forward correction, given by Equation 4.9. BA P P INà

=

P P OU Tà

=

(P P OU Tà − KILLà ) ∪ AN T LOCà BA P P INs

(4.7) (4.8)

s∈succ(i)

P P INi

=

BA P P INi ∩ [

(AV OU Tp ∪ P P OU Tp )] (4.9)

p∈pred(i)

Proof : BA P P INi represents a backward approximation to the value of P P INi (intuitively, it represents communication that can legally and safely be moved to the entry of node i). We will show that the correction term (∩p∈pred(i) (AV OU Tp ∪ P P OU Tp )) applied to a node i to obtain P P INi cannot lead to a change in the value of P P OU T for any node in the control flow graph, and that in turn implies that the P P IN values of other nodes are also unaffected by this change. The correction term, being an intersection operation, can only lead to a reduction in the value of the set P P INi . Let X = BA P P INi − P P INi denote this reduction, and let x denote an arbitrary element of X. Thus, x ∈ BA P P INi , and x ∈ P P INi . Hence, there must exist a predecessor of i, say, node p (see Figure 4.2), such that: x ∈ AV OU Tp and x ∈ P P OU Tp . Therefore, p must have another child j such that x ∈ BA P P INj , otherwise x would have been included in P P OU Tp . Now let us consider the possible effects of removal of x from P P INi . From the given equations, a change in the value of P P INi can only affect the value of P P OU T for a predecessor of i (which can possibly lead to other changes). Clearly, the value of P P OU Tp

A Framework for Global Communication Analysis and Optimizations

cannot exist

499

p

i

j

Fig. 4.2. Proving the decomposition of the bidirectional problem does not change because P P OU Tp already does not include x. But node i cannot have any predecessors other than p because p is a branch node, and by virtue of the edge splitting transformation on the control flow graph, there can be no edge from a branch node to a join node. Hence, the application of the correction term at a node i cannot change the P P OU T value of any node: this implies the validity of the above process of decomposing the bidirectional problem. We observe that since the application of the correction term to a node does not change the value of P P OU T or P P IN of any other node, it does not require a separate pass through the control flow graph. During the backward pass itself, after the value of P P OU Tp is computed for a node p, the correction term can be applied to its successor node i by intersecting BA P P INi with AV OU Tp ∪ P P OU Tp . 4.3 Overall Data-Flow Procedure So far we have discussed the data flow equations that are applied in a forward or backward direction along the edges of a control flow graph to determine the data flow information for each node. In the presence of loops, which lead to cycles in the control flow graph, one approach employed by classical data flow analysis is to iteratively apply the data flow equations over the nodes until the data flow solution converges [1]. We use the other well-known approach, interval-analysis [2,17], which makes a definite number of traversals through each node and is well-suited to analysis such as ours which attempts to summarize data flow information for arrays. We use Tarjan intervals [38], which correspond to loops in a structured program. Each interval in a structured program has a unique header node h. As a further restriction, we require each interval to have a single loop exit node l. Each interval has a back-edge l, h. The edge-splitting transformation,

500

Manish Gupta

discussed earlier, adds a node b to split the back-edge l, h into two edges, l, b and b, h. We now describe how interval analysis is used in the overall data flow procedure. INTERVAL ANALYSIS Interval analysis is precisely defined in [7]. The analysis is performed in two phases, an elimination phase followed by a propagation phase. The elimination phase processes the program intervals in a bottom-up (innermost to outermost) traversal. During each step of the elimination phase, data flow information is summarized for inner intervals, and each such interval is logically collapsed and replaced by a summary node. Thus, when an outer interval is traversed, the inner interval is represented by a single node. At the end of the elimination phase, there are no more cycles left in the graph. For the purpose of our description, the top-level program is regarded as a special interval with no back-edge, which is the first to be processed during the propagation phase. Each step of the propagation phase expands the summary nodes representing collapsed intervals, and computes the data flow information for nodes comprising those intervals, propagating information from outside to those nodes. Our overall data flow procedure is sketched in Figure 4.3. We now provide details of the analysis.

for each interval in elimination phase (bottom-up) order do 1. Compute CGEN and KILL summary in forward traversal of the interval. 2. Compute AN T LOC summary in backward traversal of the interval. for each interval in propagation phase (top-down) order do 1. Compute AV IN and AV OU T for each node in a forward traversal of the interval. 2. Compute P P OU T and BA P P IN for each node in a backward traversal of the interval. Once P P OU T is computed for a node in this traversal, apply the forward correction to BA P P IN of each of its successor nodes. Once P P IN is obtained for a node via the forward correction(s), determine IN SERT and REDU N D for that node as well. Fig. 4.3. Overall data flow procedure

4.3.1 Elimination Phase. We now describe how the values of local dataflow variables, CGEN , KILL, and AN T LOC are summarized for nodes corresponding to program intervals in each step of the elimination phase. These values are used in the computations of global data-flow variables outside that interval.

A Framework for Global Communication Analysis and Optimizations

501

The computation of KILL and CGEN proceeds in the forward direction, i.e., the nodes within each interval are traversed in topological-sort order. For the computation of KILL, we define the variable Ká as the data that may be killed along any path from the header node h to node i. We initialize the data availability information and the kill information (Kh ) at the header node as follows: AV INh Kh

= =

∅ KILLh

The transfer function for Ki at all other nodes is defined as follows: Ki = ( Kp ) ∪ KILLi (4.10) p∈pred(i)

AVIN h Kh

AVOUT l K l

h

l

Fig. 4.4. Computing summary information for an interval The transfer functions given by Equations 4.1, 4.2 and 4.10 are then applied to each statement node during the forward traversal of the interval, as shown in Figure 4.4. Finally, the data availability generated for the interval last node l must be summarized for the entire interval, and associated with a summary node s. However, the data availability at l, obtained from Equations 4.1 and 4.2, is only for a single iteration of the loop. Following [17], we would like to represent the availability of data corresponding to all iterations of the loop.

502

Manish Gupta

Definition. For an ASD set S, and a loop with index k varying from low to high, expand(S, k, low : high) is a function which replaces all single data item references α ∗ k + β used in any array section descriptor D in S by the triple (α ∗ low + β : α ∗ high + β : α), and any mapping function of the form Fâ (j) = c ∗ k + o by Fâ (j) = c ∗ low + o : c ∗ high + o : c. The following equations define the transfer functions which summarize the data being killed and the data being made available in an interval with loop index k, for all iterations low : high. KILLs

=

expand(Kl , k, l ow : high)

CGENs

=

expand(AV OU Tl , k, l ow : high) − (∪AntiDepDef expand(AntiDepDef, k, l ow : high))

where AntiDepDef represents each definition in the interval loop that is the target of an anti-dependence at the loop nesting level (we conclude that a given dependence exists at a loop nesting level m if the direction vector corresponding to direction ‘=’ for all outer loops and direction ‘