Languages and Compilers for Parallel Computing: 11th International Workshop, LCPC'98, Chapel Hill, NC, USA, August 7-9, 1998, Proceedings (Lecture Notes in Computer Science, 1656) 9783540664260, 3540664262

LCPC’98 Steering and Program Committes for their time and energy in - viewing the submitted papers. Finally, and most im

124 26 7MB

English Pages 404 [395] Year 1999

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Languages and Compilers for Parallel Computing
Steering Committee
Program Committee
Organizing Committee
Preface
Table of Contents
From Flop to MegaFlops: Java for Technical Computing
Introduction
Optimization of Array Bounds Checking
Determining iterations with safe accesses
Tiling the iteration space with regions
Thread safety
An alternate approach
Optimizations
Program Transformations
Experimental Results
The MICROSTRIP benchmark
Parallelization
Related Work
Conclusions
Considerations in HPJava Language Design and Implementation
Introduction
Overview of HPJava
Translation scheme
Java packages for HPspmd programming
Programming in the adJava interface
Improving the performance
Issues in the language design
Extending the Java language
Why not HPF?
Datatypes in HPJava
Programming convenience
Concluding remarks
A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality
Introduction
Terminology
Memory layout representation using hyperplanes
Transformation for optimizing spatial locality
Algorithm to find the loop transformation for the general case
Utilizing partial layout information
Experimental Results
Related work
Conclusions
An Integrated Framework for Compiler-Directed Cache Coherence and Data Prefetching*
Introduction
The CCDP Framework
Background and Motivation
Overview of the CCDP Scheme
Data Prefetching Optimizations
Hardware Support
Architectural Model
Organization of the DCPFU
Compiler Techniques
Stale Reference Analysis
Prefetch Target Analysis
Prefetch Scheduling
Performance Evaluation
System Model
Simulated Schemes
Experimental Methodology
Application Codes
Performance Results
Conclusions
I/O Granularity Transformations*
Introduction
Granularity Transformations
Example Transformation
Problem Definition
Background
Interval and Interval Partitioning
FUD Graph
Identifying Induction Variables
Transformation Technique
Data Flow Analysis
Interval Analysis
Array Subscript Analysis
Code Generation
Discussion
Conclusions
Stampede A Programming System for Emerging Scalable Interactive Multimedia Applications
Introduction
The Smart Kiosk: An Example Target Application
Overview of Stampede
Address Spaces and Threads
Space-Time Memory
Garbage Collection in STM
Communicating Complex Data Structures through STM
Synchronizing with Real Time
Cluster-Wide Distributed Shared Objects (DSO)
Status and Plans
Conclusion
Network-Aware Parallel Computing with Remos*
Introduction
Usage Models
Remos Design Challenges
Remos API
Query based interface
Level of abstraction
Dynamic resource sharing
Accuracy
Implementation
Parallel Application Development with Remos
Remos Usage Framework
Fx compiler
Clustering
Application structure
Usage Examples and Experimental Results
Node selection in a static environment
Node selection in a dynamic environment
Runtime adaptation
Related Work
Concluding Remarks
Object-Oriented Implementation of Data-Parallelism on Global Networks
The Parallel Execution Model
The Framework
Program Transformation
Mapping Unmapped Data
Specializing Array Operations: Iterations and Communications
Generation of Communication Sets: Algorithm
Creation of Execution Blocks
Creation of the IDL Specification
Concluding Remarks
The Input Language mHPF
Optimized Execution of Fortran 90 Array Language on Symmetric Shared-Memory Multiprocessors
Introduction
Examples
Conflict between efficient parallelization and array temporary minimization
Collective optimization of scalarized loops and scalar loops
Optimized Parallelization of Fortran 90 Array Constructs
Experimental Results
Related Work
Conclusions and Future Work
Fortran RED | A Retargetable Environment for Automatic Data Layout*
Introduction
Example Scientific Program
Performance Model
Compiler Model
Execution Model
Machine Model
Experiments
Related Work
Conclusion
Automatic Parallelization of C by Means of Language Transcription
Introduction
The Cepheus Transcriber
Representation of data
Arrays and pointers
Expression simplification
Flow control statement manipulation
Recursion
Semantic emulation
Data types
Other conversion issues
Related Work
Performance Results
Conclusion
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes *
Introduction
Background
Irregular Reductions
Compiling for Software DSMs
Improving Irregular Reductions
LocalWrite
Compiler Analysis
Experimental Evaluation
Experimental Platform
Applications
Application Characteristics
Shared-Memory Speedups
Distributed-Memory Speedups
Discussion
Related Work
Conclusions
Beyond Arrays | A Container-Centric Approach for Parallelization of Real-World Symbolic Applications
Introduction
The Container-Centric Approach
Concept of container
Motivation
About the container-centric approach
Container Specification
Abstract containers
Abstract container operations
Concrete container description
Container-Based Transformation Techniques
Data dependences and loop-level parallelism
Loop parallelization
Container privatization
Exploiting associativity
Container-Based Dependence Test
Data dependence test for linear containers
Commutativity analysis --- dependence test for associate containers
Experimental Results
Related Work
Conclusions
SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations
Introduction
Basic Sparse Matrix Programming Techniques
A New Framework for Sparse Code Generation
Static SIPR
A Sparse Data Structure Library
Dynamic SIPR
Cost Analysis of SIPR Programs
Generating Executable Code for SIPR Programs
Examples of Dynamic SIPR
Analytical and Experimental Results
Analytical results
Experimental results
Previous Work
Conclusions and Future Work
HPF-2 Support for Dynamic Sparse Computations*
Introduction
Sparse Data Structures and Distributions
Sparse Data Structures
Dynamic Sparse Distribution Schemes
Parallel Dynamic Sparse Computations
Parallel Sparse LU Code
Evaluating Results
Related Work
Conclusions
Integrated Instruction Scheduling and Register Allocation Techniques*
Integrated Instruction Scheduling and Register Allocation Techniques
Introduction
Issues in Integrating Register Allocation with Instruction Scheduling
Live Range Spilling vs Live Range Splitting
Parallel Interference Graphs vs Register Reuse Dags
Unified Resource Allocation Using Reuse Dags and Splitting
Conclusion
A Spill Code Placement Framework for Code Scheduling
Introduction
Related Work
Definitions and Notations
Program Behavior
Execution Timing
Register Requirement
Condition for Register Spilling
Discussion
Conclusions
Copy Elimination for Parallelizing Compilers *
Introduction
Introductory Example
Related Work
Eliminating Copy Instructions
An Algorithm for Copy Elimination
Heuristic Copy Elimination
An Example
Experiments and Results
Conclusion
Compiling for SIMD Within a Register
Introduction
Basic SIMD-to-SWAR Compilation
Partitioned Operations
Inter-Processing-Element Communication
Reductions
Enable Masking
Compiler Optimizations for SWAR
Promotion of Field Sizes
SWAR Value Tracking
Enable Masking Optimizations
Conclusion
Automatic Analysis of Loops to Exploit Operator Parallelism on Recon gurable Systems *
Introduction
Related Work
Overall Approach
Framework for Operator and Loop Transformations
Operator Transformation
Statement re-ordering for configuration reuse
Framework for Reconfiguration Analysis
Motivating Example
Cut-set and Configuration Generation
Reconfiguration Minimization
Implementation and Results
Implementation
Results
Discussion of Results
Conclusion
Principles of Speculative Run-Time Parallelization
Run-Time Optimization Is Necessary
Run-Time Optimization
Principles of Run-Time Optimization
Obtaining Performance
Foundational Work: Run-Time Parallelization
Variations of the LRPD Test
Early Failure Detection
Faster Analysis and Early Success Detection
Fully Independent and Privatizable Accesses
Aggregate LRPD test
Some Strategy and Implementation Issues
Current Implementation of Run-Time Pass in Polaris
Experimental Results of Run-Time Test in Polaris
Conclusion
The Advantages of Instance-Wise Reaching De nition Analyses in Array (S)SA
Introduction
Motivations
Reaching Definition Analyses
Array SA and Array SSA: Definitions and Comparison
Definitions of (S)SAForms
Construction of Array SSA and Array SA Forms
Conversion to Array SA Form
Related Work
Preliminary Experimental Results
Extending Algorithms Based on SSA Form
Conclusion
Dependency Analysis of Recursive Data Structures Using Automatic Groups
Motivation
Definitions
The Programming Model
The Analysis
Conditional Statements
Merging call guards and the termination FSA
Completion of the executing FSA
Front-End Description Languages
Integrating ASAP description
Graph Types
Complexity
A Case Study: Fluid Flow Simulation
Conclusion
The I+ Test
Introduction
Background
Data Dependence
The GCD, Banerjee and I Tests
The Extension of the I Test
Interval-Equation Transformation
Interval-Equation Transformation Using the GCD Test
Time Complexity
Experimental Results
Conclusions
Author Index
Recommend Papers

Languages and Compilers for Parallel Computing: 11th International Workshop, LCPC'98, Chapel Hill, NC, USA, August 7-9, 1998, Proceedings (Lecture Notes in Computer Science, 1656)
 9783540664260, 3540664262

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1656

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Siddhartha Chatterjee Jan F. Prins Larry Carter Jeanne Ferrante Zhiyuan Li David Sehr Pen-Chung Yew (Eds.)

Languages and Compilers for Parallel Computing 11th International Workshop, LCPC’98 Chapel Hill, NC, USA, August 7-9, 1998 Proceedings

13

Volume Editors Siddhartha Chatterjee, Jan F. Prins Department of Computer Science, The University of North Carolina Chapel Hill, NC 27599-3175, USA E-mail: {sc/prins}@cs.unc.edu Larry Carter, Jeanne Ferrante Department of Computer Science and Engineering University of California at San Diego 9500 Gilman Drive, La Jolla, CA 92093-0114, USA E-mail: {carter/ferrante}@cs.ucsd.edu Zhiyuan Li Department of Computer Science, Purdue University 1398 Computer Science Building, West Lafayette, IN 47907, USA E-mail: [email protected] David Sehr Intel Corporation 2200 Mission College Boulevard, RN6-18, Santa Clara, CA 95052, USA E-mail: [email protected] Pen-Chung Yew Department of Computer Science and Engineering, University of Minnesota Minneapolis, MN 55455, USA E-mail: [email protected] Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Languages and compilers for parallel computing : 11th international workshop ; proceedings / LCPC ’98, Chapel Hill, NC, USA, August 7 - 9, 1998. S. Chatterjee . . . (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1656) ISBN 3-540-66426-2

CR Subject Classification (1998): D.1.3, D.3.4, F.1.2, B.2.1, C.2 ISSN 0302-9743 ISBN 3-540-66426-2 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN: 10704088 06/3142 – 5 4 3 2 1 0

Printed on acid-free paper

VII

Steering Committee Utpal Banerjee David Gelernter Alex Nicolau David Padua

Intel Corporation Yale University University of California at Irvine University of Ilinois at Urbana-Champaign

Program Committee Larry Carter Siddhartha Chatterjee Jeanne Ferrante Zhiyuan Li Jan Prins David Sehr Pen-Chung Yew

University of California at San Diego University of North Carolina at Chapel Hill University of California at San Diego Purdue University University of North Carolina at Chapel Hill Intel Corporation University of Minnesota

Organizing Committee Linda Houseman

University of North Carolina at Chapel Hill

External Reviewers George Almasi Ana Azevedo Brian Blount Calin Cascaval Walfredo Cirne Paolo D’Alberto Vijay Ganesh Xiaomei Ji

Asheesh Khare Jaejin Lee Yuan Lin Yunheung Paek Nick Savoiu Martin Simons Weiyu Tang

VI

Preface

LCPC’98 Steering and Program Committes for their time and energy in reviewing the submitted papers. Finally, and most importantly, we thank all the authors and participants of the workshop. It is their significant research work and their enthusiastic discussions throughout the workshop that made LCPC’98 a success. May 1999

Siddhartha Chatterjee Program Chair

Preface

The year 1998 marked the eleventh anniversary of the annual Workshop on Languages and Compilers for Parallel Computing (LCPC), an international forum for leading research groups to present their current research activities and latest results. The LCPC community is interested in a broad range of technologies, with a common goal of developing software systems that enable real applications. Among the topics of interest to the workshop are language features, communication code generation and optimization, communication libraries, distributed shared memory libraries, distributed object systems, resource management systems, integration of compiler and runtime systems, irregular and dynamic applications, performance evaluation, and debuggers. LCPC’98 was hosted by the University of North Carolina at Chapel Hill (UNC-CH) on 7 9 August 1998, at the William and Ida Friday Center on the UNC-CH campus. Fifty people from the United States, Europe, and Asia attended the workshop. The program committee of LCPC’98, with the help of external reviewers, evaluated the submitted papers. Twenty-four papers were selected for formal presentation at the workshop. Each session was followed by an open panel discussion centered on the main topic of the particular session. Many attendees have come to regard the open panels as a very effective format for exchanging views and clarifying research issues. Using feedback provided both during and after the presentations, all of the authors were given an opportunity to improve their papers before submitting the final manuscript contained in this volume. This collection documents important research activities from the past year in the design and implementation of programming languages and environments for parallel computing. The major themes of the workshop included both classical issues (Fortran, instruction scheduling, dependence analysis) as well as emerging areas (Java, memory hierarchy issues, network computing, irregular applications). These themes reflect several recent trends in computer architecture: aggressive hardware speculation, deeper memory hierarchies, multilevel parallelism, and “the network is the computer.” In this final editing of the workshop papers, we have grouped the papers into these categories. In addition to the regular paper sessions, LCPC’98 featured an invited talk by Charles Leiserson, Professor of Computer Science at the MIT Laboratory for Computer Science, entitled “Algorithmic Multithreaded Programming in Cilk”. This talk was the first exposure to the Cilk system for many of the participants and resulted in many interesting discussions. We thank Prof. Leiserson for his special contribution to LCPC’98. We are grateful to the Department of Computer Science at UNC-CH for its generous support of this workshop. We benefited especially from the efforts of Linda Houseman, who ably coordinated the logistical matters before, during, and after the workshop. Thanks also go out to our local team of volunteers: Brian Blount, Vibhor Jain, and Martin Simons. Special thanks are due to the

Table of Contents

Java From Flop to MegaFlops: Java for Technical Computing . . . . . . . . . . . . . . . . . . . . 1 J. E. Moreira, S. P. Midkiff and M. Gupta (IBM T.J. Watson Research Center) Considerations in HPJava Language Design and Implementation . . . . . . . . . . . 18 Guansong Zhang, Bryan Carpenter, Geoffrey Fox, Xinying Li and Yuhong Wen (Syracuse University) Locality A Loop Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 M. Kandemir (Northwestern University), J. Ramanujam (Louisiana State University), A. Choudhary (Northwestern University) and P. Banerjee (Northwestern University) An Integrated Framework for Compiler-Directed Cache Coherence and Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Hock-Beng Lim (University of Illinois) and Pen-Chung Yew (University of Minnesota) I/O Granularity Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Gagan Agrawal (University of Delaware) Network Computing Stampede: A Programming System for Emerging Scalable Interactive Multimedia Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Rishiyur S. Nikhil (Compaq), Umakishore Ramachandran (Georgia Tech), James M. Rehg (Compaq), Robert H. Halstead, Jr. (Curl Corporation), Christopher F. Joerg (Compaq) and Leonidas Kontothanassis (Compaq) Network-Aware Parallel Computing with Remos . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Bruce Lowekamp, Nancy Miller, Dean Sutherland, Thomas Gross, Peter Steenkiste and Jaspal Subhlok (Carnegie Mellon University) Object-Oriented Implementation of Data-Parallelism on Global Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Jan Borowiec (GMD FIRST) Fortran Optimized Execution of Fortran 90 Array Language on Symmetric Shared-Memory Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Vivek Sarkar (IBM T.J. Watson Research Center)

X

Table of Contents

Fortran RED — A Retargetable Environment for Automatic Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Ulrich Kremer (Rutgers University) Automatic Parallelization of C by Means of Language Transcription . . . . . . 166 Richard L. Kennell and Rudolf Eigenmann (Purdue University) Irregular Applications Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Hwansoo Han and Chau-Wen Tseng (University of Maryland) Beyond Arrays — A Container-Centric Approach for Parallelization of Real-World Symbolic Applications . . . . . . . . . . . . . . . . . . . . . . 197 Peng Wu and David Padua (University of Illinois) SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 William Pugh and Tatiana Shpeisman (University of Maryland) HPF-2 Support for Dynamic Sparse Computations . . . . . . . . . . . . . . . . . . . . . . . . 230 R. Asenjo (University of M´ alaga), O. Plata (University of M´alaga), J. Touri˜ no (University of La Coru˜ na), R. Doallo (University of La Coru˜ na) and E.L. Zapata (University of M´ alaga) Instruction Scheduling Integrated Instruction Scheduling and Register Allocation Techniques . . . . . 247 David A. Berson (Intel Corporation), Rajiv Gupta (University of Pittsburgh) and Mary Lou Soffa (University of Pittsburgh) A Spill Code Placement Framework for Code Scheduling . . . . . . . . . . . . . . . . . . 263 Dingchao Li, Yuji Iwahori, Tatsuya Hayashi and Naohiro Ishii (Nagoya Institute of Technology) Copy Elimination for Parallelizing Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 David J. Kolson, Alexandru Nicolau and Nikil Dutt (University of California, Irvine) Potpourri Compiling for SIMD Within a Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Randall J. Fisher and Henry G. Dietz (Purdue University) Automatic Analysis of Loops to Exploit Operator Parallelism on Reconfigurable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Narasimhan Ramasubramanian, Ram Subramanian and Santosh Pande (University of Cincinnati) Principles of Speculative Run–Time Parallelization . . . . . . . . . . . . . . . . . . . . . . . .323 Devang Patel and Lawrence Rauchwerger (Texas A&M University)

Table of Contents

XI

Dependence Analysis The Advantages of Instance-Wise Reaching Definition Analyses in Array (S)SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Jean-Fran¸cois Collard (University of Versailles) Dependency Analysis of Recursive Data Structures Using Automatic Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 D. K. Arvind and T. A. Lewis (The University of Edinburgh) The I+ Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Weng-Long Chang and Chih-Ping Chu (National Cheng Kung University) Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

From Flop to MegaFlops: Java for Technical Computing J. E. Moreira, S. P. Midkiff, and M. Gupta IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, New York 10598, USA {jmoreira,smidkiff,mgupta}@us.ibm.com

Abstract. Although there has been some experimentation with Java as a language for numerically intensive computing, there is a perception by many that the language is not suited for such work. In this paper we show how optimizing array bounds checks and null pointer checks creates loop nests on which aggressive optimizations can be used. Applying these optimizations by hand to a simple matrix-multiply test case leads to Java compliant programs whose performance is in excess of 500 Mflops on an RS/6000 SP 332MHz SMP node. We also report in this paper the effect that each optimization has on performance. Since all of these optimizations can be automated, we conclude that Java will soon be a serious contender for numerically intensive computing.

1

Introduction

The scientific programming community has recently demonstrated a great deal of interest in the use of Java for technical computing. There are many compelling reasons for such use of Java: a large supply of programmers, it is object-oriented without excessive complications (in contrast to C++), and it has support for networking and graphics. Technical computing is moving more and more towards a network-centric model of computation. In this context, it can be expected that Java will first be used where it is most natural: for visualization and networking components. Eventually, Java will spread into the core computational components of technical applications. Nevertheless, a major obstacle remains to the pervasive use of Java in technical computing: performance. Let us start by looking into the performance of a simple matrix-multiply routine in Java, as shown in Fig. 1. This routine computes C = C + A × B, where C is an m × p matrix, A is an m × n matrix, and B is an n × p matrix. We use that routine to multiply two 500 × 500 matrices (m = n = p = 500) on an RS/6000 SP 332MHz SMP node. This machine contains 4 × 332 MHz PowerPC 604e processors, each with a peak performance of 664 Mflops. We refer to this simple benchmark as MATMUL. The Java code is compiled into a native executable by the IBM High Performance Compiler for Java (HPCJ) [10], and achieves a performance of 5 Mflops on a 332 MHz PowerPC 604e processor. The equivalent Fortran code, compiled by the IBM XLF compiler, achieves 265 Mflops! A 50-fold performance S. Chatterjee (Ed.): LCPC’98, LNCS 1656, pp. 1–17, 1999. c Springer-Verlag Berlin Heidelberg 1999

2

J. E. Moreira et al. static void matmul(double[][] A, double[][] B, double[][] C, int m, int n, int p) { int i, j, k; for (i=0; i