Parallel and Distributed Processing: 15 IPDPS 2000 Workshops Cancun, Mexico, May 1–5, 2000 Proceedings (Lecture Notes in Computer Science, 1800) 354067442X, 9783540674429

This volume contains the proceedings from the workshops held in conjunction with the IEEE International Parallel and Dis

132 66 424MB

English Pages 1317 [1332] Year 2000

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Lecture Notes in Computer Science
Parallel andDistributed Processing
Volume Editors
Foreword
Contents
Memory Management in a combinedVIA/SCI Hardware
1 Motivation and Introduction
2 What are the Memory Management Considerations?
3 PCI{SCI vs. VIA discussion and comparison
3.1 Question 1: How a process' memory area is made available tothe NIC and in what way main memory is protected againstwrong accesses?
3.2 Question 2: At which point in the system a DMA engine isworking and how are the transactions of this DMA enginevalidated?
3.3 Question 3: In which way memory of a process on a remotenode is made accessible for a local process?
4 A new PCI{SCI Architecture with VIA Approaches
4.1 Advanced Memory Management
4.2 Operation of Distributed Shared Memory from amemory-
related point of view
4.3 Operation of Protected User-Level Remote DMA from amemory-
related point of view
4.4 A free choice of using either Programmed I/O or User-
LevelRemote DMA
5 Inuence on MPI Libraries
6 State of the project (November 1999)
7 Other Works on SCI and VIA
8 Conclusions and Outlook
References
ATOLL, a new switched, high speedInterconnect in Comparison to Myrinet and SCI
1 Introduction
2 Design Space for Network Interfaces
3 NIC Hardware Layout and Design
3.1 ATOLL
3.2 Myrinet
3.3 Scalable Coherent Interface (SCI)
4 Software
4.1 Low Level API
4.2 Upper Software Layer for Communication
4.3 Communication Overhead
5 Conclusion
References
ClusterNet: An Object-Oriented Cluster Network
1 Introduction
2 ClusterNet
2.1 Functionality: Router vs. Aggregate Function Execution
2.2 Network Storage: Routing Tables vs. Network-Embedded Data Structures
2.3 Network Port Interface: I/O Queues vs. Register Interface
2.4 Software Interface: Packet vs. Direct Read and Write
3 The ClusterNet4EPP Proof-of-Concept Prototype
4. Related Research
5. Conclusions and Future Directions
References
GigaBit Performance under NT
Abstract
1. Introduction
2. Message Passing
2.1 MPI Overview
2.2 PVM Overview
3. Gigabit Ethernet
4. MPI NT Environments
4.1 MPI/PRO for Windows NT
4.2 PaTENT WMPI 4.0
4.3 WMPI
5. Performance Tests
5.1 Test Equipment
5.2 Multi-processor Benchmark - PingPong
5.2.1 MPI Version
5.2.2 PVM Version
5.3 Differences of the MPI and PVM versions of PingPong
6. Results
6.1 Introduction
6.2 Latency Results (Table 3)
6.3 Network Bandwidths
6.3.1 Shared Memory Results (Figure 2)
6.3.2 Distributed Memory
7. Summary and Conclusions
7.1 Summary
7.2 Price/Performance Considerations
7.3 Summary of Conclusions
7.4 Future Work
References
MPI Collective Operations o ver IP Multicast *
1 Introduction
2 IP Multicast
3 MPI Collective Operations
3.1 MPI Broadcast
3.2 MPI Barrier Synchronization
4 Experimental Results
5 Conclusions and Future Work
References
An Open Market-Based Architecture forDistributed Computing
1 Introduction
2 System Properties
3 Resource Allocation
4 System Architecture
4.1 Overview of System Components
4.2 Basic System Services and Communication
5 Supporting Distributed Computing Paradigms
5.1 The Generic Master { Slave Model
5.2 A Sample Client Application
6 Related Work
7 Discussion
8 Future Directions
References
The MultiCluster Model to the Integrated Use ofMultiple Workstation Clusters
1 Introduction
2 Integrating Multiple Clusters
2.1 Hardware Aspects
2.2 Software Aspects
3 The MultiCluster Model
3.1 Hardware Platform
3.2 Software Structure
3.3 The Programming Environment—DECK
4 Related Work
5 Conclusions and Current Work
References
Parallel Information Retrieval on an SCI-BasedPC-NOW
1. Introduction
2. PC Cluster-based IR System
2.1 Typical IR System on Uniprocessor
2.2 Declustering IIF
2.3 Parallel IR System Model
2.4 Experimental PC Cluster System
2.5 SCI-based DSM Programming
3. Performance of PC Cluster-based IR System
3.1 Performance Comparison between SCI-based System and MPI-based System
3.2 Effect of Declustering IIF
3.3 Performance with Various-sized IIF
3.4 Reducing IR Operation Time
4. Conclusions
References
A PC-NOW Based Parallel Extension for aSequential DBMS
1 Introduction
2 Architecture
2.1 General Overview
2.2 The Server Module
2.3 The Calculator Module
3 Prototyping
3.1 General Overview
3.2 Implementation of the Server Module
3.3 Implementation of the Calculator Module
3.4 Communication Issues
4 Current Performance of the Extension Prototype
4.1 Underlying Hardware
4.2 Speed-up
4.3 Real Database Tests
5 Discussion
5.1 Parallel Extension vs. Porting
5.2 Toward a Generalization of the Parallel Extension Concept
6 Context
7 Summary
References
Workshop on Advancesin Parallel and Distributed Computational Models
The Heterogeneous BulkSynchronous Parallel Model
1 Introduction
2 Related Work
3 Heterogeneous BSP
4 HBSP Algorithms
4.1 Pre x Sums
4.2 Matrix Multiplication
4.3 Randomized Sample Sort
5 Conclusions and Future Directions
References
On stalling in LogP*
(Extended Abstract)
1 Introduction
2 The models
2.1 LogP’s stalling behaviour
3 Separation between -stalling LogP and stall-free LogP
Untitled
References
Parallelizability of some P-complete problems*
1
Introduction
2 Parameterized convex layers
3 Lexicographically first maximal 3 sums
4 Conclusions
References
A New Computation of Shape Moments viaQuadtree Decomposition *
1 Introduction
2 Basic Data Manipulation Operations
3 The Quadtree Decomposition
4 Computing Shape Moments
5 Parallel Moment Computation Algorithm
6 Concluding Remarks
References
The Fuzzy Philosophers
1. Introduction
2. A-protocol
3. Correctness of A-protocol
4. B-protocol
5. Efficiency of A-protocol
References
A Jav aApplet to Visualize Algorithms on Reconfigurable Mesh
1
Introduction
2 Reconfigurable Mesh
3 Specification of Software
3.1 User Interface
3.2 Programming Language
4 Execution of the JRM
4.1 Visualization of the Pre x Sums Algorithm
4.2 Some Algorithms Implemented on the JRM
5 Conclusion
References
A Hardware Implementation of PRAM and itsPerformance Evaluation *
1 Introduction
2 Design of the PRAM-like Computers
2.1 Steps of the PRAM Model and Our PRAM-like Computers
2.2 Architecture of the PRAM-like Computers
2.3 Memory Accessing and Synchronous Processing
2.4 The Internal Processing in the Nodes of the Units
2.5 Amount of Hardware and Theoretical Processing Time
3 Evaluation of the Implementation Method
References
A Non-Binary Parallel Arithmetic Architecture
1 Introduction
2 The shift switch logic
3 The small shift switch adder architecture
4 The larger shift switch adder architecture
5 Concluding remarks
References
Multithreaded Parallel Computer Model withPerformance Evaluation *
1 Introduction
2 Multithreaded parallel computer model
3 Multithreaded parallel computer model simulator
4 PRAM algorithms implemented in the MPCM
4.1 Prefix
sums algorithm
4.2 List ranking algorithm
5 Performance evaluation
References
Workshop on Parallel and DistributedComputing in Image Processing, VideoProcessing, and Multimedia (PDIVM2000)
Organizers
Preface
Committees
MAJC-5200: A High PerformanceMicroprocessor for Multimedia Computing
1 Introduction
2 Arc hitecture
3 MAJC-5200 Microprocessor
3.1 Building Blocks
3.2 MAJC CPU
4 Instruction Set
5 Performance in Multimedia Applications
6 Conclusion
References
A Novel Superscalar Architecture for FastDCT Implementation
1. Introduction
2. Modified SIMD Architecture for Fast DCT
3. Superscalar Execution of FDCT
4. Comparison and Conclusion
REFERENCES
Computing Distance Maps EcientlyUsing An Optical Bus
1 Introduction
2 The LARPBS Model
3 Algorithm Using n2 Processors
4 Algorithm Using n3 Processors
5 Conclusions
References
Advanced Data Layout Optimization forMultimedia Applications
1 Introduction and Related Work
2 Example Illustration
3 Main Memory Data Layout Organization (MDO)
3.1 The General Problem
3.2 The Pragmatic Solution
4 Experimental Results
5 Conclusions
References
Parallel P arsing of MPEG Video in aMulti-threaded Multiprocessor Environment
1 Introduction
2 Scene change detection in MPEG1 video
2.1 Description of the MPEG1 video format
2.2 The motion-luminance approach
2.3 The spatio-temporal approach
3 Parallel video parsing
3.1 Parallel video parsing using the motion-luminance approach
3.2 Parallel video parsing using the spatio-temporal approach
4 Experimental Results and Conclusions
References
Parallelization Techniques for Spatial-TemporalOccupancy Maps from Multiple Video Streams
1 Introduction
2 Distributed Sensing
3 Algorithms
3.1 Image-based
3.2 Map-based
3.3 Image-level parallelism
3.4 Pixel-level parallelism
3.5 Map-level parallelism
4 Results
5 Conclusion
References
Heuristic Solutions for a Mapping Problem in aTV-Anytime Server Network*
1 Introduction
2 A Hierarchical TV-Anytime Server Network
3 The Media Mapping Problem
3.1 A Feature of the Mapping Problem
3.2 Formalizing the Mapping Problem
4 Parallel Simulated Annealing Algorithms
4.1 Initial Solution
4.2 Neighborhood Structure
4.2.1 Neighborhood Structure - Phase I
4.2.2 Neighborhood Structure - Phase II
5 Performance Evaluation
6 TV Cache - A TV-Anytime System
References
RPV: A Programming Environment forReal-time Parallel Vision -Specification and programming methodology-
1 Introduction
2 System Overview
2.1 Hardware Con guration
2.2 Software Architecture
2.3 Modules
3 RPV Programming Tool
3.1 Class RPV Connection
3.2 Function RPV Invoke
3.3 Sample Programs
4 Conclusion
Acknowledgement
References
Parallel low-level image processing on adistributed-memory system
1 Introduction
2 Low-level image processing operators
3 Integrating parallelism in an image processing library
4 Experimental results
5 Conclusions
6 Future work
References
Congestion-free Routing of
Streaming Multimedia Content inBMIN-based Parallel Systems
1 Introduction
2 Folded Benes networks
3 Flow-based adaptive routing
4 The distributed modi ed looping algorithm
5 Simulation results
6 Conclusion
References
Performance of On-Chip Multiprocessorsfor Vision Tasks*
1 Introduction
2 Selected Vision Tasks
3 On-Chip Multiprocessor
4 Simulation Environment
5 Simulation Results and Analysis
6 Concluding Remarks
Acknowledgement
References
Parallel Hardware-Software Architecture forcomputation of Discrete Wavelet Transform using theRecursive Merge Filtering algorithm
1 Introduction
2 Formal Description of the Recursive Merge Filtering Algorithm
2.1 RMF Operator
3 DWT in terms of the RMF operator
4 RMF Algorithm Computations and Data Shifting
5 Transformation of Data Routing to Address Computation
6 Equations for Data Shifting
7 Hardware-Software Architecture
8 FPGA Implementation and Resource Use
References
Fifth International Workshop onHigh{level Parallel Programming Modelsand Supportive EnvironmentsHIPS 2000
Preface
Workshop Chair
Steering Committee
Program Committee
Acknowledgments
Accepted Papers for HIPS 2000
Pipelining Wavefront Computations:Experiences and Performance*
1 Introduction
2 Representations: MPI, HPF, and ZPL
3 Parallelization Experiences: MPI, HPF, and ZPL
3.1 MPI
3.2 HPF
3.3 ZPL
4 Performance
5 Conclusion
References
Speci cation Techniques for AutomaticPerformance Analysis Tools
1 Introduction
2 Related work
3 Overall Design of the KOJAK Cost Analyzer
4 Performance Property Speci cation
4.1 Data Model
4.2 Performance Properties
5 Implementation
6 Conclusion and Future Work
References
PDRS: A Performance Data Representation System*
1 Introduction
2 Design and Implementation of PDRS
2.1 Trace Data Module
2.2 Data Management Module
2.3 Performance Database
2.4 Relational Queries Module
2.5 Performance Diagnostic Agent (PDA) Module
2.6 Performance Visualization and Auralization (PVA) Module and Graphical UserInterface Module
3 Summary
References
Clix* - A Hybrid Programming Environment forDistributed Objects and Distributed SharedMemory
1 Introduction
2 The Arts Platform
3 Distributed Shared Memory Abstractions
4 Object-Oriented Design of the Clix-DSM
5 Sample Implementation
6 Related Work
7 Conclusions
References
Controlling Distributed Shared Memory Consistencyfrom High Level Programming Languages
1 Introduction
2 Implicit versus Explicit Consistency Management
3 Mome DSM Consistency Model
4 Consistency Management Optimizations
5 Experiments
5.1 Simulated Code
5.2 Data Prefetch
5.3 Consistency Management Strategy
5.4 Manager Distribution
6 Related Work
7 Conclusion and Future Work
References
Online Computation of Critical Pathsfor Multithreaded Languages
1 Introduction
2 Benefits
of Getting Critical Path Information
3 Our Scheme: Online Longest Path Computation
3.1 Target Language
3.2 Computed Critical Paths
3.3 Instrumentation
3.4 Potential Problems and Possible Solutions
4 Experiments
5 Related Work
6 Conclusion and Future Work
References
Problem Solving Environment Infrastructure forHigh Performance Computer Systems
1 Introduction
2 The Proposed Model for Problem SolvingEnvironments
2.1 The Layered Architecture
2.2 Level 0 - Infrastructure
2.3 Level 1 - Hardware Abstractions
2.4 Level 2 - Programming Model
2.5 Level 3 - Mathematics
2.6 Level 4 - Domain Speci c Interface
3 Implementation
3.1 CECAAD
3.2 An Electromagnetics Environment for Cluster Computers
3.3 An Image Processing Environment for Recon gurableComputers
4 Conclusion
References
Combining Fusion Optimizations and PiecewiseExecution of Nested Data-Parallel Programs
1 Introduction
2 Related Work
3 Combining Fusion and Piecewise Execution
4 Implementation and Benchmarks
5 Conclusion and Future Work
References
Declarativ e concurrency in Java
1 Introduction
2 Related work
2.1 Concurrent object-oriented programming
2.2 Temporal constraints
3 Logic programs for concurrent programming
3.1 Events and constraints
3.2 Markers and events
3.3 Constraints and methods
4 Synchronization constraints
5 Implementation
6 Conclusion
References
Scalable Monitoring Technique for DetectingRaces in Parallel Programs*
1 Introduction
2 Background
3 Scalable Monitoring Technique
4 Related Work
5 Conclusion
References
3rd IPDPS Workshop onHigh Performance Data Mining
Preface
Workshop Co-Chairs
Program Committee
Implementation Issues in the Design of I/OIntensive Data Mining Applications on Clustersof Workstations
1 Introduction
2 Implementation of I/O Intensive DM Applications
3 A Test Case DM Algorithm and its Implementation
4 Experimental Results and Conclusions
References
A Requirements Analysis for Parallel KDD Systems
1 Introduction
2 PKDD Requirements
3 Mining Methods
4 Hardware Models and Trends
5 Software Infrastructure
6 Conclusions
References
Parallel Data Mining on ATM-ConnectedPC Cluster and Optimization of itsExecution Environments
1 Introduction
2 Our ATM-connected PC cluster and its communicationcharacteristics
3 Parallel data mining application and its implementationon the cluster
3.1 Association rule mining
3.2 Implementation of HPA program on PC cluster
4 Optimization of transport layer protocol parameters
4.1 Broadcasting on the cluster and TCP retransmission
4.2 Total performance of HPA program using proposed method
5 Dynamic remote memory acquisition
5.1 Dynamic remote memory acquisition and its experiments
5.2 Remote update method
References
The Parallelization of a Knowledge DiscoverySystem with Hypergraph Representation*
1 Introduction
2 Serial System INDED
2.1 Inductive Logic Programming
2.2 Serial Arichitecture
3 Parallelizing INDED
4 Naive Decomposition
5 Data Parallel Decomposition with Data Partitioning
5.1 Data Partitioning and Locality
5.2 Partitioning Algorithm
6 Global Hypergraph using Speculative Parallelism
7 Current Status and Results
8 Current and Future Work
References
Parallelisation of C4.5 as a ParticularDivide and Conquer Computation
1 Problem statement
2 The Programming Environment
3 Related work and first experiments
4 Current results
5 Expected Results and Future work
References
Scalable Parallel Clustering for Data Mining on
1 Introduction
2 Bayesian Classification and AutoClass
3 P-AutoClass
3.1 Design of the parallel algorithm
3.1.1 Parallel update_wts
3.1.2 Parallel update_parameters
4 Experimental results
5 Related work
6 Conclusion
References
Exploiting Dataset Similarity for DistributedMining *
1 Introduction
2 Similarity Measure
2.1 Association Mining Concepts
2.2 Similarity Metric
2.3 Sampling and Association Rules
3 Clustering Datasets
4 Experimental Analysis
4.1 Setup
4.2 Sensitivity to Sampling Rate
4.3 Synthetic Dataset Clustering
4.4 Census Dataset Evaluation
5 Conclusions
References
Scalable Model for Extensional and IntensionalDescriptions of Unclassified Data
1 Introduction
2 Motivation
3 Proposed Architecture
4 ART1 Neural Network
5 Combinatorial Neural Model (CNM)
6 Ongoing Work
References
Parallel Data Miningof Bayesian Networks fromTelecommunications Network Data
1 Introduction and the Global Picture
1.1 Telecommunication Fault Management, Fault Correlation andData Mining BBNs
1.2 The Architecture
2 The Parallel Data Mining Algorithm
2.1 The Need for Parallelism
2.2 Parallel Cause And E ect Genetic Algorithm (P-CAEGA)
2.3 Results
2.4 Future Potential Research
3 Conclusion
References
IRREGULAR'00SEVENTH INTERNATIONAL WORKSHOP ONSOLVING IRREGULARLY STRUCTURED PROBLEMSIN PARALLEL
General Chair
Program Co-chairs
Steering Committee
Program Committee
Invited Speakers
FOREWORD
Load Balancing and Continuous QuadraticProgramming
1
Extended Abstract
Parallel Management of Large Dynamic SharedMemory Space: A Hierarchical FEM Application
1 Introduction
2 EÆcient Irregular Memory Accesses within a LargeVirtual Address Space
2.1 Understanding ccNUMA Architecture
2.2 Enhancing Virtual Pages Locality
3 EÆcient Parallel Dynamic Memory Allocation
3.1 The Fragmentation Problem
3.2 Allocating Memory in Parallel
4 Conclusion
References
Efficient
Parallelization of UnstructuredReductions on Shared Memory ParallelArc hitectures*
1 Introduction
2 Unstructured Reductions on Shared Memory Machines
3 Unstructured Reductions on Distributed MemoryMachines
4 Exclusive Ownership Technique
5 Performance Results
6 Summary and Conclusion
References
Parallel FEM Simulation of Crack Propagation {Challenges, Status, and Perspectives*
1 Introduction
2 System Overview
3 Geometric Modeling and Mesh Generation
4 Equation Solving and Preconditioning
5 Adaptivity
6 Future Work
7 Conclusions
References
Support for Irregular Computations inMassively Parallel PIM Arrays, Usingan Object-Based Execution Model
1 Introduction
2 Irregular Problems
3 Macroservers
4 PIM Arrays and Their Support for IrregularComputations
5 Case Study: Sparse Matrix Vector Multiply
6 Conclusion
References
Executing Communication-Intensive IrregularPrograms EÆciently
1 Introduction
2 Constraints on Execution
2.1 Barrier Synchronization
2.2 Fixed Size Processor Partitions
3 Scheduling on Fixed Size Partitions
3.1 Handling Y-Irregularity
3.2 Handling X-Irregularity
4 Online Rebalancing of Threads
4.1 Costs of Imbalance and Rebalancing
4.2 Optimal Online Algorithm
4.3 Low Overhead Alternatives
5 Summary and Future Work
References
NON-MEMORY-BASED AND REAL-TIMEZEROTREE BUILDING FOR WAVELETZEROTREE CODING SYSTEMS
1 INTRODUCTION
2 THE ARCHITECTURE FOR REARRANGING2-STAGE 2-D DWT
2.1 Two Priliminary Devices Used in the Architecture forRearrangement
2.2 the Proposed Architecture and the Analysis of Its Operations
3 THE DESIGN EXTENDED TO GENERAL STAGESOF DWT
4 PERFORMANCE ANALYSIS AND CONCLUSION
References
Graph Partitioning for Dynamic, Adaptive andMulti-phase Computations
A Multilevel Algorithm for Spectral Partitioningwith Extended Eigen-Models
1 Introduction
2 Extended Eigen-Model
3 Subspace Algorithms for Solving ExtendedEigenproblems
4 Algorithm
5 Numerical Results
References
An Integrated Decomposition and PartitioningApproach for Irregular Block-StructuredApplications
1 Introduction
2 The partitioning approach
2.1 Overview
2.2 The clustering algorithm
2.3 The distribution method
3 Numerical results
3.1 The applications
3.2 Comparison with the Berger-Rigoutsos algorithm
3.3 Parallel performance
4 Conclusions
Acknowledgments
References
Ordering Unstructured Meshes for SparseMatrix Computations on Leading ParallelSystems
1 Introduction
2 Partitioning and Linearization
2.1 Cuthill-McKee Algorithms (CM)
2.2 Self-Avoiding Walks (SAW)
2.3 Graph Partitioning (MeTiS)
3 Experimental Results
3.1 Distributed-Memory Implementation
3.2 Shared-Memory Implementation
3.3 Multithreaded Implementation
4 Work in Progress
Acknowledgements
References
A GRASP for computing approximate solutions for theThree-Index Assignment Problem
A
BSTRACT
On Identifying Strongly Connected Componentsin Parallel*
1 Introduction
2 A Parallelizable Algorithm for Strongly ConnectedComponents
3 Serial Complexity of Algorithm DCSC
4 Future Work
Acknowledgements
References
A Parallel, Adaptive Re nement Scheme forTetrahedral and Triangular Grids
1 Introduction
1.1 Serial Element Adaption Scheme
2 Parallel Implementation
2.1 Data Structures
2.2 Re nement
2.3 Grid Closure
3 Groundwater Application
4 Conclusion
References
PaStiX : A Parallel Sparse Direct Solver Basedon a Static Scheduling for Mixed 1D/2D BlockDistributions *
1 Introduction
2 Parallel solver and block mapping
3 Numerical Experiments
4 Conclusion and Perspectives
References
Workshop onJav afor Parallel and Distributed Computing
Workshop Organization
Program Co-Chairs
Program Committee
An IP Next Generation Compliant JavaTM VirtualMachine
1 Introduction
2 A Quick overview of IPv6
2.1 Addressing format
2.2 Modifications of the IP stack protocols
2.3 Other new features
3 Developing an IPv6 package for Java
3.1 Architecture
3.2 Implementation of the underlying mechanisms
4 Results and extensions
4.1 A raw level and new options
4.2 An IPv6 compliant JVM
4.3 Results
5 Conclusion
References
An Approach to Asynchronous Object-OrientedParallel and Distributed Computing onWide-Area Systems*
1 Introduction
2 Related work
3 The Moka programming model
4 The Java API of Moka
5 Transparent vs. non-transparent distributedinteractions
6 Performance evaluation
7 Conclusions
References
Performance Issues for Multi-language JavaApplications
1 Introduction
2 Choices for Native Interface
3 Java Native Interface Performance
3.1 Cost of a Native Call
3.2 Cost of Accessing Data from Native Code
4 Embedding the JVM in Servers: Memory Managementand Threading Interactions
5 Implementation of Fast JNI on IRIX
5.1 Fast JNI Calling Optimization in the MIPS JIT
5.2 JNI Pinning Lock Optimization in the JVM
6 Related Work
7 Conclusions
Acknowledgements
References
MPJ: A Proposed Java MessagePassing API and Environment forHigh Performance Computing
1. Introduction
2. Some design decisions
3. Overview of the Architecture
3.1 Process creation and monitoring
3.2 The MPJ daemon
3.3 Handling MPJ aborts – Jini events
3.4 Other failures – Jini leasing
3.5 Sketch of a “Device-Level” API for MPJ
4. Conclusions and Future Work
5. References
Implementing Jav aconsistency using a generic,multithreaded DSM runtime system
1 Introduction
2 Executing Java programs on distributed clusters
2.1 Concurrent programming in Java
2.2 The Hyperion System
3 Implementing Java consistency
3.1 DSM-PM2: a generic, multi-protocol DSM layer
3.2 Using DSM-PM2 to build a Java consistency protocol
4 Preliminary performance evaluation
5 Conclusion
References
Third Workshop on Bio-Inspired Solutions toParallel Processing Problems (BioSP3)
Workshop Chairs
Steering Committee
Program Committee
TAKE ADVANTAGE OF THE COMPUTINGPOWER OF DNA COMPUTERS
1 Introduction
2 DNA Computation Model
2.1 Operations
2.2 Biological Implementation
3 NP-complete Problem Solving
3.1 One Simpli ed NP-complete Problem
3.2 An Advanced Problem
4 Problem Reconsideration
5 Conclusion
References
Agent surgery: The case for mutable agents
1 Introduction
2 Mutable programs
3 Mutability in Bond
4 Surgery techniques
4.1 Simple surgical operations
4.2 Replacing the strategy of a state
4.3 Splitting a transition with a state
4.4 Bypassing states
4.5 Adding and removing planes
4.6 Joining and splitting agents
4.7 Trimming agents
5 Conclusions
Acknowledgments
References
Was Collective Intelligence1 before Life on Earth?
1. Introduction
2. Computational Collective Intelligence
2.1 Computational model of Collective Intelligence
2.2. The Inference Model for Collective Intelligence and its measure
3. Comprehension and definition of life
4. Ordering Collective Intelligence and Life
5. Conclusions
6. References
Solving Problems on Parallel Computers by CellularProgramming
1. Introduction
2. Cellular Programming
3. A Parallel Environment for CARPET
4. Programming Examples
4.1. The wireworld program
4.2. A forest fire model
4.3 Performance results
6. Conclusion
Acknowledgements
References
Multiprocessor Scheduling with Support byGenetic Algorithms - based Learning Classifier System
1
Introduction
2 Genetic Algorithms-based Learning Classi er System
3 Multi-agent Approach to Multiprocessor Scheduling
4 An Architecture of a Classi er System to SupportScheduling
5 Experimental Results
6 Conclusions
References
Viewing Scheduling Problems throughGenetic and Evolutionary Algorithms
1
Introduction
2 Genetic and Evolutionary Algorithms
2.1 Basic Concepts
2.2 GEPE: The Genetic and Evolutionary ProgrammingEnvironment
3 Analysis of the Scheduling Processes
4 Approaching the JSSP with GEAs
5 Results
6 A practical example
7 Conclusions
References
Dynamic Load Balancing Model: PreliminaryAssessment of a Biological Model for aPseudo-Search Engine
1 Introduction
2 Methodologies of Genetic Programming
2.1 Overview
2.2 Application of the Genetic Operators
3 Adapting the Biological Model
3.1 Overview of the Biological Model
3.2 Description of the Computer Model
4 Computation Measures for the Pseudo-Search Engine
4.1 Overview
4.2 Computational Measures
5 Conclusion
6 Acknowledgements
References
A Parallel Co-ev olutionary Metaheuristic
1 Introduction
2 The Quadratic Assignment Problem
3 CO-SEARCH: A parallel co-evolutionary metaheuristic
3.1 MTS: the search agent
3.2 The Adaptive Memory
3.3 The Diversifying GA
3.4 The intensifying kick
4 Experiments
5 Conclusion
References
Neural Fraud Detection in Mobile PhoneOperations
1 Introduction
2 Neural Fraud Model Construction
2.1 Classi cation of Mobile Phone Users
2.2 Neural Network Model
3 Experimental Study
3.1 Environment and Implementation Issues
4 Conclusion
References
Information Exchange in Multi Colony AntAlgorithms
1
Introduction
2 Ant Algorithm for TSP
3 Parallel Ant Algorithms
4 Strategies for Information Exchange
5 Results
6 Conclusion
References
A Surface-Based DNA Algorithm for theExpansion of Symbolic Determinants
1 Introduction
2 Surface-Based Operations
2.1 Abstract Model
2.2 Biological Implementation
3 Hard Computation Problem Solving
3.1 Expansion of Symbolic Determinants Problem
3.2 Surface-Based Algorithm
4 Analysis of the Algorithm
5 Conclusion
References
Hardware Support forSimulated Annealing and Tabu Search
1
Introduction
2 Local Search
3 Hardware Support
4 Implementation
5 Results
6 Discussion
References
Eighth International Workshop onParallel and Distributed Real-Time Systems
General Chair
Program Chairs
Publicity Chair
Steering Committee
Program Committee
Message from the Program Chairs
A Distributed Real Time Coordination Protocol
1 Introduction
2 Problem Formulation
3 Protocol Design
4 Summary and Conclusion
Acknowledgement
References
A Segmented Backup Scheme for Dependable RealTime Communication in Mult ihop Networks
1 Introduction
2 Spare Resource Allocation
3 Backup Route Selection
4 Failure Recovery
5 Delay and Scalability
6 Performance Evaluation
7 Conclusions
References
Real-Time Coordination in Distributed MultimediaSystems
1 Introduction
2 The Coordination Language Manifold
3 Extending Manifold with a Real-Time Event Manager
3.1 Recording Time
3.2 Expressing Temporal Relationships
4 Coordination of RT Components in a Multimedia Presentation
5 Conclusions
References
Supporting Fault-Tolerant Real-TimeApplications using the RED-Linux GeneralScheduling Framework *
1 Introduction
2 Related Work on Fault-Tolerant and Real-TimeSupport
3 The RED-Linux General Scheduling Framework
4 The Design of Fault Monitors
5 The Implementation of Task Group in RED-Linux
6 Conclusions
References
Are COTS suitable for building distributedfault-tolerant hard real-time systems*
?
1 Introduction
2 COTS and hard real-time constraints
2.1 Methodology
2.2 WCET analysis and COTS hardware
2.3 WCET analysis and COTS real-time operating systems
3 COTS and fault tolerance constraints
3.1 Methodology
3.2 O -line task replication
3.3 Basic fault-tolerance mechanisms and COTS components
4 Experimental platform
5 Concluding remarks
References
Autonomous Consistency Technique inDistributed Database withHeterogeneous Requirements
1 Background
1.1 Needs in SCM
2 Approach
2.1 Assurance
2.2 Goal
3 Accelerator
3.1 Allowable Volume
3.2 System Model in SCM
3.3 Accelerator
3.4 AV management
4 Simulation
5 Conclusion
References
Real-time Transaction ProcessingUsing Two-stage Validation in Broadcast Disks*
1 Introduction
2 Issues of Transaction Processing in BroadcastEnvironments
3 Protocol Design
3.1 Broadcasting of Validation Information
3.2 Timestamp Ordering
4 The New Protocol
4.1 Transaction Processing at Mobile Clients
4.2 The Server Functionality
5 Conclusions and Future Work
References
Using Logs to Increase Availability in Real-TimeMain-Memory Database
1 Introduction
2 RODAIN Database
3 Log Handling in the RODAIN Database Node
4 Experimental Study
5 Conclusion
References
Components are from Mars
1 Introduction
2 Basic Component Model and Qualification
Basic component model
Implications for distributed real-time systems
3 On Disputed Issues in Component Design
Do components have state?
Are objects components?
4 Concluding Remarks
Acknowledgements
References
2 + 10 ˜ 1 + 50 !
1 Introduction
2 Fault Modelling and Analysis
3 Testability Analysis
4 Conclusion and future challenges
References
A Framework for Embedded Real-time SystemDesign *
1 Introduction
2 A Fully Automatic Approach for the Analysis ofReal-time Systems.
3 Conclusion
References
Best-effort Scheduling of (m,k)-firm Real-time Streams in Multihop Networks
1 Introduction
2 E
DBP Scheduling Algorithm
3 Performance Study
4 Conclusions
References
Predictabili ty and Resource Management inDistribut ed Multimedia Presentations
1 Introduction
2
The Proposed Language Extensions for QoS de nition
3 The Proposed Runtime Environment
3.1 Determination of Task Blocking Time
4 Related Work
5 Conclusions
References
Quality of Service Negotiation for Distributed, DynamicReal-time Systems
1
Introduction
2 QoS Negotiation Architecture and Approach
3 QoS Negotiation Algorithm and Protocol
4 Experimental Results
5 Previous Work in QoS Negotiation
6 Conclusions and Future Work
References
An Open Framework for Real-TimeScheduling Simulation
1
Introduction
2 Related Work
3 Theory of Operation
3.1 Taks Model and Workload Generation
3.2 Scheduling and Dispatching
3.3 Logging and Statistics
4 Conclusions
References
5th International Workshop on Embedded/DistributedHPC Systems and Applications (EHPC 2000)
Preface
E
HPC 2000 Contents
Program Committee
Advisory Committee
A Probabilistic Power Prediction Tool for the Xilinx4000-Series FPGA
1
Introduction and Background
2 Overview of the Tool
3 Calibration of the Tool
4 Power Measurements
5 Experimental Evaluation of the Tool
6 Summary
Acknowledgements
References
Application Challenges: System Health Managementfor Complex Systems
1
Introduction
1.1 Challenges in system health management
1.2 Condition-Based Maintenance for Naval Ships
1.2.1 MPROS Architecture
1.2.2 Data Concentrator hardware
2 MPROS Software
2.1 PDME
2.2 Knowledge fusion
3 Validation
4 Conclusions
5 Acknowledgment
References
Accommodating QoS Prediction in an Adaptive Resource Management Framwork
1 Introduction
2 Overview of RM approach
3 Software and Hardware Profiling
4 QoS and Resource Utilization Monitoring
5 Resource Selection
6 Experiments
7 Conclusions and Ongoing Work
References
Network Load Monitoring in Distributed Systems
1 Introduction
2 Load Simulator
3 Previous Work
4 Experimental Procedure
5 Conclusion
References
A Novel Specification and Design MethodologyOf Embedded Multiprocessor Signal Processing SystemsUsing High-Performance Middleware
1 Introduction
2 The Need for Model Continuity in Specification & DesignMethodologies
3 The MAGIC Specification and Design Methodology
4 Model Continuity via Middleware
5 Using VSIPL & MPI for Model Continuity
6 Conclusion
References
Auto Source Code Generation and Run-TimeInfrastructure and Environment for High Performance,Distributed Computing Systems
1 Introduction
1.1 Systems and Applications Genesis Environment (SAGE)
2 Auto-Glue Code Generation and Run-Time Kernel
3 Experiments
3.1 Benchmark Applications
3.2 Target Machine
3.3 Experiments and Test Method
3.4 Results
4 Conclusions
References
Developing an Open Architecture for PerformanceData Mining
1 Introduction
2 Unified Modeling Language
3 A Performance Data Mining Architecture
4 Discussion
5 Future PDMA Research
6 Conclusions
Acknowledgements
References
A 90k gate “CLB” for Parallel Distributed Computing
1 Introduction
2 ManArray Parallel Distributed Computing
3 Evaluation
4 Conclusions
References
Po wer-Aware Replication of Data Structures inDistributed Embedded Real-Time Systems*
1 System Model
2 Numerical Results
2.1 E ect of Application Write Ratios
2.2 Impact of Per-hop Transfer Cost
2.3 Task Allocation and Network Topology
2.4 Routing Issues
2.5 Selective Replication
3 Conclusion
References
Comparison of MPI Implementations on aShared Memory Machine
1 Introduction
2 Approach
3 Results
3.1 Platform Con guration
3.2 Sun's HPC 3.0 MPI
3.3 LAM Shared Memory MPI
3.4 MPICH
4 Conclusions
References
A Genetic Algorithm Approach to SchedulingCommunications for a Class of Parallel Space-TimeAdaptive Processing Algorithms
1 Introduction and Background
2 Overview of Parallel STAP
3 Genetic Algorithm Methodology
4 Numerical Results
5. Conclusion
Acknowledgements
References
Reconfigurable Parallel Sorting and Load Balancing on aBeowulf Cluster: HeteroSort
1 Introduction
1.1 Dynamic Adaptability
1.2 Beowulf Clusters
1.3 Local Knowledge and Global Processes
1.4 Related Work
2 Approach
2.1 Beowulf Clusters
2.2 Optimization of HeteroSort
3 Fault Tolerance
5.1 Future Directions
Acknowledgments
Reference
7th Recon gurable Architectures Workshop(RAW2000)
Workshop Chair
Steering Chair
Program Chair
Publicity Chair
Programme Committee
Preface
Programme of RAW 2000:
Run-Time Recon guration at Xilinx(invited talk)
JRoute: A Run-Time Routing API for FPGAHardware
1 Introduction
2 Overview of the Virtex Routing Architecture
3 JRoute Features
3.1 Various Levels of Control
3.2 Support for Cores
3.3 Unrouter
3.4 Avoiding Contention
3.5 Debugging Features
4 JRoute versus Routing with JBits
5 Portability
6 Future Work
7 Conclusions
Acknowledgements
References
A Recon gurable Content Addressable Memory
1 Introduction
2 A Standard CAM Implementation
3 An FPGA CAM Implementation
4 The Recon gurable Content Addressable Memory(RCAM)
5 An RCAM Example
6 System Issues
7 Comparison to Other Approaches
8 Associative Processing
9 Conclusions
10 Acknowledgements
References
ATLANTIS – A Hybrid FPGA/RISC BasedRe-configurable System
1 Introduction
2 ATLANTIS System Architecture
2.1 ATLANTIS Computing Board (ACB)
2.2 ATLANTIS I/O Board (AIB)
2.3 ATLANTIS Active Backplane (AAB)
2.4 Host CPU
2.5 CHDL Development Environment
3 Applications
3.1 High Energy Physics
3.2 Image processing
3.3 Astronomy
3.4 Measured and Estimated Performance
4 Summary and Outlook
References
The Cellular Processor Architecture CEPRA{1Xand its Con guration by CDL
1 Introduction
2 Target Architectures
3 CDL, a Language for Cellular Processing
4 Transformation into a Hardware Description
5 Conclusion
References
Loop Pipelining and Optimization for Run Time Reconfiguration*
1 Introduction
2 Related Work
3 Pipeline Construction
3.1 Definitions
3.2 Phase 0: Pre-processing and Mapping
3.3 Phase 1: Partitioning
3.4 Routing Considerations
3.5 Phase 2: Pipeline Segmentation
3.6 Recon guration of null stages
4 Results
5 Conclusions
References
Compiling Process Algebraic Descriptionsinto Recon gurable Logic
1 Introduction
2 The Circal process algebra
3 Overview of compiler operation
4 A circuit model of Circal
4.1 Design outline
4.2 Process logic design
4.3 The complete process logic block
5 Mapping circuits to recon gurable logic
6 Deriving modules from process descriptions
7 Conclusions
Acknowledgements
References
Behavioral Partitioning with Synthesisfor Multi-FPGA Architecturesunder Interconnect, Area, and Latency Constraints *
1 Introduction
2 Partitioning and Synthesis Framework
3 The FMPAR Partitioner with the Exploration Engine
3.1 The FMPAR Algorithm
4 Experimental Results
4.1 Effectiveness of Dynamic Exploration with FMPAR
4.2 Comparison of FMPAR against a Simulated Annealing Partitioner
4.3 On-Board Implementations
5 Summary
References
Module Allocation for DynamicallyRecon gurable Systems
1 Introduction
2 Problem Formulation
3 Con guration Bundling
3.1 Bundling Compatibility of Temporal Templates
3.2 Measure of Con guration Bundling
4 Con guration Bundling Driven Module AllocationAlgorithm
4.1 Initial Module Allocation
4.2 Ordering and Allocating Temporal Templates
5 Experimental Results
6 Conclusions and Acknowledgments
References
Augmenting Modern Superscalar Architectures withConfigurable Extended Instructions
1 Introduction
2 T1000 Architecture1
2.1 Background
2.2 T1000 Details
3 Methodology
3.1 Performance evaluation
3.2 Hardware Cost
4 Potential Performance Payoff of Aggressive Instruction Selection
4.1 Performance Results Using the Greedy Selection Algorithm
5 A Selective Algorithm for Choosing Extended Instructions
5.1 Selective Algorithm Overview
5.2 Performance Improvements Using the Selective Algorithm
6 Configurable Hardware Cost
7 Prior Work
8 Conclusions and Future Work
References
Complexity Bounds for Lookup TableImplementation of Factored Forms in FPGATechnology Mapping
1 Introduction
2 Preliminaries
3 Worst Case Mapping to K-
LUTs
4 Conclusion
References
Optimization of Motion Estimator for Run-Time-Reconfiguration Implementation.
1. Introduction.
2. Qualitative motion estimation in the Log-Polar space.
3. Determination of the possible number of steps for RTRimplementation.
3.1. Evaluation of the possible number of steps.
3.2. Modelling and parameters determination.
4. Results.
5. Conclusion and future work.
References.
Constan t-Time Hough Transform On A 3D Reconfigurable Mesh Using Few er Processors
1
Introduction
2 The Computational Model
3 The Constant-Time Algorithm
References
Fifth International Workshop on FormalMethods for Parallel Programming: Theory andApplications FMPPTA2000
Program and Organizing Chair's Message
Foreword
Programme Committee
978-3-540-45591-2_134_OnlinePDF.pdf
A Method for Automatic Cryptographic ProtocolVerification (Extended Abstract)
1 Introduction
2 Terms, Formulae, _-Parameterized Tree Automata
3 Messages, What Intruders Know, and Simulating Protocol Runs
4 Experimental Results
5 Conclusion
Acknowledgments
References
Verification
Methods forWeaker Shared Memory Consistency Models
1
Introduction
Contribution 1: Architectural tests for Weaker Memory Models
Contribution 2: New Abstraction Methods for Architectural Tests
2 Summary of Results
3 Conclusions and Future Work
References
Models Supporting Nondeterminismand Probabilistic Choice
1 Introduction
2 Domains
3 CSP
4 The probabilistic power domain
4.1 Probabilistic CSP
5 Constructing a new model
6 Summary
References
Concurrent Specification And Timing Analysisof Digital Hardware using SDL
1
Introduction
2 Approach
3 Validation and Verification
4 Abstract Sequencing Constraints
5 A Typical Component: A Delay Flip-Flop
6 A Simple Circuit: The Single Pulser
7 A More Complex Circuit: A Bus Arbiter
8 Conclusions
Acknowledgements
References
Incorporating Non-functional Requirements intoSoftware Architectures
1 Introduction
2 ZCL Framework
2.1 CL Model
2.2 ZCL Framework
3 Formalising and Incorporating NFRs into DynamicSoftware Architectures
3.1 Formalising NFRs
3.2 Integrating NFRs into the ZCL Framework
4 Case Study: an Appointment System
5 Conclusion and Future Works
References
Automatic Implementation of Distributed SystemsFormal Specifications
Introduction
Formal Description Techniques
Mondel Language
Implementation Approaches
The DRACO-PUC Environment
Setting up the Implementation Environment
Conclusions
References
Refinement based validation of an algorithm fordetecting distributed termination
1 Introduction
2 Description of the Algorithm
2.1 Diffusing Computation
2.2 Termination Detection
2.3 Path Vectors
3 TheUNITY Formalism
3.1 UNITY Logic Predicates
3.2 Refinements
4 Validation
4.1 Specification of the Termination
4.2 Structure of the Validation
4.3 Diffusing Computation Pattern
4.4 The Concrete Model
4.5 The Abstract Model
4.6 The AuxiliaryModel
4.7 Liveness
4.8 Mechanizing the Development
5 Conclusion
References
978-3-540-45591-2_141_OnlinePDF.pdf
Tutorial 1 : Abstraction and Refinement ofConcurrent Programs and FormalSpecificationA Practical View
References
Tutorial 2: A Foundation for ComposingConcurrent Objects
References
Workshop on Optics and Computer Science(WOCS 2000)
Organizers:
Preface
Program Chair
Steering Committee
Program Committee
Fault Tolerant Algorithms for a Linear Arraywith a Reconfigurable Pipelined Bus System
1 Introduction
2 Model Descriptions
2.1 LARPBS Model
2.2 Fault Model
3 Preprocessing Phase
4 Fault Tolerant Algorithms
5 Conclusions
References
Fast and Scalable Parallel Matrix Computationswith Optical Buses(Extended Abstract*
)
1 Introduction
2 Scalable Parallelization
3 Optical Buses
4 Matrix Multiplication, Chain Product, and Powers
5 Inversion of Lower and Upper Triangular Matrices
6 Determinants, Characteristic Polynomials, and Ranks
7 Inversion of Arbitrary Matrices
8 Linear Systems of Equations
9 LU- and QR-Factorizations
References
Pulse-Modulated Vision Chips with Versatile-Interconnected Pixels
1 Introduction
2 Vision Chip Based on PWM
3 Vision Chip Based on PFM
4 Discussion
5 Conclusion
Acknowledgments
References
Connectivity Models for OptoelectronicComputing Systems
1 Connectivity ,Dimensionality ,and Rent's Rule
2 Discontinuities and the Origin of Rent's Rule
3 Free-Space Optical Interconnections
4 Fundamental Studies of Interconnections
5 Conclusion
References
Optoelectronic-VLSI Technology: Terabit/s I/O to aVLSI Chip
R
eferences
Three Dimensional VLSI-Scale Interconnects
Introduction
PIM Motivation
Optoelectronic Technologies
Summary
References
Present and Future Needs of Free-Space OpticalInterconnects
1 Introduction
2 Present Status of FSOI
3 Present limitations in FSOI and future directions
4 Conclusions
References
Fast Sorting on a Linear Array with a ReconfigurablePipelined Bus System*
1 Introduction
2 Fast sorting on the LARPBS
2.1 Definitions and properties
2.2 An O(log logN) time merging algorithm on the LARPBS
2.3 The sorting algorithm
References
Architecture description and prototypedemonstration of optoelectronicparallel-matching architecture
1 Introduction
2 Parallel Matching Architecture
3 Experimental prototype system
4 Conclusions
Acknowledgment
References
A Distributed Computing Demonstration System UsingFSOI Inter−Processor Communication
1 Introduction
2 System Topology
2.1Carrier Boards
2.2 System Board
3 Processor Interconnection
4 Processing Element
5 Conclusion
References
Optoelectronic Multi-Chip Modules Based on ImagingFiber Bundle Structures
VCSEL based smart pixel array technology enableschip-to-chip optical interconnect
978-3-540-45591-2_156_OnlinePDF.pdf
Run-Time Systems for Parallel Programming
Preface
A Portable and Adaptative Multi-ProtocolCommunication Library for MultithreadedRuntime Systems
1 EÆcient Communication in MultithreadedEnvironments
2 The Madeleine II Multi-Protocol CommunicationInterface
3 Inside Madeleine II : from the Application to theNetwork
4 Implementation and Performances
5 Related work
6 Conclusion
References
CORBA Based Runtime Support for LoadDistribution and Fault Tolerance
1 Introduction
2 Integrating Load Distribution into CORBA
3 Runtime Support for Fault Tolerance in CORBA BasedSystems
4 Experimental Results
5 Conclusions
References
Run-time Support for Adaptive Load Balancing
1 Motivation and Related Work
2 Load Balancing Framework
3 Load Balancing Strategies
4 Application Performance
5 Conclusion
References
Integrating Kernel Activations in aMultithreaded Runtime System on top of Linux
1 Kernel Support for User Level Thread Schedulers
1.1 The Marcel Mixed Thread Scheduler
1.2 Better Support: Kernel Activations
2 Marcel on Top of Linux Activations
2.1 How it works
2.2 Extensions to the original proposal
2.3 Modi˝cations to Marcel
3 Performance and Evaluation
3.1 Performance
4 Conclusion
References
DyRecT: Software Support for AdaptiveParallelism on NOWs
1 Introduction
2 High-Level Primitives
3 Low-Level Primitives
4 Performance Results
5 Conclusion
References
Fast Measurement of LogP Parametersfor Message Passing Platforms
1 Introduction
2 Parameterized LogP
3 Fast parameter measurement
3.1 Limitations of the method
4 Result evaluation
5 Conclusions
Acknowledgements
References
Supporting exible safety and sharing inmulti-threaded environments*
1 Introduction
2 Safe Threads package
2.1 Support for Threads
2.2 Protected domains and Permission relationships
2.3 Implementation
3 Performance Analysis
3.1 Thread Creation
3.2 Context Switch
4 Existing Safety Solutions
5 Conclusion
References
A Runtime System for Dynamic DAGProgramming
1 Introduction
2 DAG and Compact DAG
3 The Incremental Execution Model
4 The Parallel Scheduling Algorithm
5 Runtime System Organization
6 Experimental Study
7 Conclusion
Acknowledgments
References
Workshop on Fault-Tolerant Parallel andDistributed Systems (FTPDS '00)
Workshop Chair
Invited speakers
Papers
Certification of system architecture dependability
Computing in the RAIN:A Reliable Array of Independent Nodes*
1 Introduction
1.1 Related Work
1.2 Novel Features of RAIN
2 Communication
2.1 Fault-Tolerant Interconnect Topologies
2.2 Consistent-History Protocol for Link Failures
2.3 A Port of MPI
3 Group Membership
3.1 Novel Features
4 Data Storage
4.1 Array Codes
4.2 Distributed Store/Retrieve Operations
5 Proof-of-Concept Applications
5.1 High-Availability Video Server
5.2 High-Availability Web Server
5.3 Distributed Checkpointing Mechanism
6 Conclusions
References
Fault Tolerant Wide-Area Parallel Computing
1.0 Introduction
2.0 Related Work
3.0 Fault Tolerance Options for SPMD Applications
4.0 Performance Models
5.0 Results
5.1 Validating the Models
5.2 Head-to-head Comparison
6.0 Summary
7.0 References
Transient Analysis ofDependability/P erformabilit y Models b yRegenerativ e Randomization with LaplaceTransform In version
1
Introduction
2 The New Variant
2.1 Closed form solution in the Laplace transform domain
2.2 Numerical Laplace inversion
3 Analysis and Comparison
4 Conclusions
References
FANTOMASFault Tolerance for Mobile Agents in Clusters
1
Introduction and Motivation
2 Related Work: Fault Tolerance for Mobile Agents
3 Concepts for a Fault Tolerance Approach for Mobile Agents
3.1 Goals and Requirements
3.2 Fault Model
3.3 Discussion of Fault Tolerance Methods
3.4 The FANTOMAS Concept
3.5 Diagnosis
4 Analytic Evaluation
5 Conclusions and Future Work
References
Metrics, Methodologies, and Tools forAnalyzing Network Fault Recovery Performancein Real-Time Distributed Systems
1 Introduction
2 Network Fault Recovery Technologies
3 Network Fault Recovery Performance
3.1 Testing Model
3.2 Fault Recovery Performance Metrics
4 Testing Methodology
5 Network Fault Recovery Performance Measurement Toolset
5.1 Test Orchestration Tools
5.2 Data Collection Tools
5.3 Analysis/Visualization Tools
6 Applying the Metrics, Tools, and Testing Methodology
6.1 General Test Setup
6.2 Example FDDI Test Results
6.3 Example Fast Ethernet Test Results
7 Conclusions and Ongoing Work
8 References
Consensus Based on Strong Failure Detectors:A Time and Message-Ecient Protocol
1 Introduction
2 Asynchronous Distributed Systems, Failure Detectorsand the Consensus Problem
2.1 Asynchronous Distributed System with Process Crash Failures
2.2 The Class S of Unreliable Failure Detectors
2.3 The Consensus Problem
3 The S-Based Consensus Protocol
3.1 The Protocol
3.2 Underlying Principles
3.3 Structure
3.4 Proof
4 Cost of the Protocol
5 Conclusion
References
Implementation of Finite Lattices in VLSI forFault-State Encoding in High-speed Networks
1 Introduction
2 Lattices and Fault-Tolerance
3 Application to Selected Fault-Tolerant RoutingAlgorithms
4 Implementatio
4.1 Simple Table-Based Method
4.2 Implementation with Boolean Lattice
4.3 Hybrid Implementations
5 Conclusion
References
Building a Reliable Message Delivery System Using theCORBA Event Service
1 Introduction
2 Log files and retry policies – Are they adequate?
3 Application-level reliability mechanism to provide resilience
3.1 Model for reliability: Resynchronization
4 Effectiveness of the reliability mechanism - Experiments
5 Summary and Concluding Remarks
References
Network Survivability Simulation of a CommerciallyDeployed Dynamic Routing System Protocol
INTRODUCTION
DRS ALGORITHM
DRS PROACTIVE COST
NETWORK SURVIVABILITY ANALYSIS
References
Fault-tolerant Distributed-Shared-Memory on aBroadcast-based Interconnection Network
1 Introduction
2 Fault Tolerant DSM on the SOME-bus
3 Conclusion
References
An Efficient Backup-Overloading for Fault-Tolerant Scheduling of Real-Time Tasks
1 Introduction
2 Dynamic Logical Groups
3 Performance Study
4 Conclusions
References
Mobile Agents to Automate Fault Management inWireless and Mobile Networks1
1
. Introduction
2. The Fault-Tolerant Wireless Network Management Architecture
3. Overall Description of Methodology
4. Fault Management
4.1 An Example of the Steps in Fault Correction and Recovery
4.2 A High-Level View of the System
5. Conclusion
References
9th Heterogeneous Computing Workshop(HCW 2000)
Session 1-AGrid Environmen t
Session 1-BResource Discovery and Management
Session 2-ACommunication and Data Management
Session 2-BModeling and Metrics
Session 3-ATheory and Modeling
Session 3-BScheduling I
Session 4-AGrid Applications
Session 4-BResource Management
Session 5-BScheduling II
Author Index
Recommend Papers

Parallel and Distributed Processing: 15 IPDPS 2000 Workshops Cancun, Mexico, May 1–5, 2000 Proceedings (Lecture Notes in Computer Science, 1800)
 354067442X, 9783540674429

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science

1800

Jose Rolim et al. (Eds.)

Parallel and

Distributed Processing IS IPDPS 2000 Workshops Cancun, Mexico, May 2000 Proceedings

Springer



TC PanUd Proces1ing



Aaociation for Computing Machinery

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1800

Springer Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

Jose Rolim et al. (Eds.)

Parallel and Distributed Processing 15 IPDPS 2000 Workshops Cancun, Mexico, May 1-5, 2000 Proceedings

Springer

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Managing Volume Editor Jose Rolim Universite de Geneve, Centre Universitaire d'Informatique 24, rue General Dufour, CH-1211 Geneve 4, Switzerland E-mail: [email protected]

Cataloging-in-Publication Data applied for

Die Deutsche Bibliothek - CIP-Einheitsaufnahme Parallel and distributed processing : 15 IPDPS 2000 workshops, Cancun, Mexico, May 1 - 5, 2000, proceedings/ Jose Rolim et al. (ed.). Berlin ; Heidelberg ; New York; Barcelona; Hong Kong; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1800) ISBN 978-3-540-67442-9

CR Subject Classification (1998): C.1-4, B.1-7, D.1-4, F.1-2, G.1-2, E.l, H.2 ISSN 0302-9743 ISBN 978-3-540-67442-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law Springer-Verlag is a company in the BertelsmarmSpringer publishing group. © Springer-Verlag Berlin Heidelberg 2000 Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10720149 06/3142 54 32 I 0

Volume Editors

Jose D.P. Rolim G. Chiola G. Conte L.V. Mancini Oscar H. Ibarra Koji Nakano Stephan Olariu Sethuraman Panchanathan Andreas Uhl Martin Schulz Mohammed J. Zaki Vipin Kumar David B. Skilicorn Sartaj Sahni Timothy Davis Sanguthevar Rajasekeran Sanjay Ranka Denis Caramel Serge Chaumette Geoffrey Fox Peter Graham Albert Y. Zomaya Fikret Ercal

Kenji Toda Sang Hyuk Son Maarten Boasson Yoshiaki Kakuda Deveah Bhatt Lonnie R. Welch Hossam ElGindy Viktor K. Prasanna Hartmut Schmeck Oliver Diessel Beverly Sanders Dominique Mery Fouad Kiamilev Jeremy Ekman Afonso Ferreira Sadik Esener Yi Pan Keqin Li Ron Olsson Laxmikant V. Kale Pete Beckman Matthew Haines Dimiter R. Avresky

Foreword This volume contains the proceedings from the workshops held in conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000, on 1-5 May 2000 in Cancun, Mexico. The workshops provide a forum for bringing together researchers, practitioners, and designers from various backgrounds to discuss the state of the art in parallelism. They focus on different aspects of parallelism, from run time systems to formal methods, from optics to irregular problems, from biology to networks of personal computers, from embedded systems to programming environments; the following workshops are represented in this volume: Workshop on Personal Computer Based Networks of Workstations Workshop on Advances in Parallel and Distributed Computational Models Workshop on Par. and Dist. Comp. in Image, Video, and Multimedia Workshop on High-Level Parallel Prog. Models and Supportive Env. Workshop on High Performance Data Mining Workshop on Solving Irregularly Structured Problems in Parallel Workshop on Java for Parallel and Distributed Computing Workshop on Biologically Inspired Solutions to Parallel Processing Problems Workshop on Parallel and Distributed Real-Time Systems Workshop on Embedded HPC Systems and Applications Reconfigurable Architectures Workshop Workshop on Formal Methods for Parallel Programming Workshop on Optics and Computer Science Workshop on Run-Time Systems for Parallel Programming Workshop on Fault-Tolerant Parallel and Distributed Systems All papers published in the workshops proceedings were selected by the program committee on the basis of referee reports. Each paper was reviewed by independent referees who judged the papers for originality, quality, and consistency with the themes of the workshops. We would like to thank the general co-chairs Joseph .Ja.Ja and Charles Weems for their support and encouragement, the steering committee chairs, George Westrom and Victor Prasanna, for their guidance and vision, and the finance chair, Bill Pitts, for making this publication possible. Special thanks are due to Sally .Jelinek, for her assistance with meeting publicity, to Susamma Barna for making local arrangements, and to Danuta Sosnowska for her tireless efforts in interfacing with the organizers. We gratefully acknowledge sponsorship from the IEEE Computer Society and its Technical Committee of Parallel Processing and the cooperation of the ACM SIGARCH. Finally, we would like to thank Danuta Sosnowska and Germaine Gusthiot for their help in the preparation of this volume. February 2000

Jose D. P. Rolim

Contents

Workshop on Personal Computer Based Networks of Workstations G. Chiola, G. Conte, L.V. Mancini

1

Memory Management in a Combined VIA/SCI Hardware M. Trams, W. Rehm, D. Balkanski, S. Simeonov

4

ATOLL, a New Switched, High Speed Interconnect in Comparison to Myrinet and SCI 16 M. Fischer, U. Bruning, J. Kl1tge, L. Rzymianowicz, P. Schulz, M. Waack ClusterNet: An Object-Oriented Cluster Network R.R. Hoare

28

GigaBit Performance under NT M. Baker, S. Scott, A. Geist, L. Browne

39

MPI Collective Operations over IP Multicast H.A. Chen, Y.O. Carrasco, A. W. Apon

51

An Open Market-Based Architecture for Distributed Computing S. Lalis, A. Karipidis

61

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters M. Baretto, R. Avila, P. Navaux

71

Parallel Information Retrieval on an SCI-Based PC-NOW S.-H. Chung, H.-C. Kwon, K.R. Ryu, H.-K. Jang, J.-H. Kim, C.-A. Choi

81

A PC-NOW Based Parallel Extension for a Sequential DBMS M. Exbrayat, L. Brunie

91

Workshop on Advances in Parallel and Distributed Computational Models O.H. Ibarra, K. Nakano, S. Olariu

101

The Heterogeneous Bulk Synchronous Parallel Model T.L. Williams, R.J. Parsons

102

On Stalling in LogP G. Bilardi, K. T. Herley, A. Pietracaprina, G. Pucci

109

X

Contents

Parallelizability of Some P-Complete Problems A. Fujiwara, M. Inoue, T. Mas1tzawa

116

A New Computation of Shape Moments via Quadtree Decomposition C.-H. W1t, S.-J. Horng, P.-Z. Lee, S.-S. Lee, S.-Y. Lin

123

The Fuzzy Philosophers S.-T. H1wng

130

A Java Applet to Visualize Algorithms on Reconfigurable Mesh K. Miyashita, R. Hashimoto

137

A Hardware Implementation of PRAM and Its Performance Evaluation M. Imai, Y. Hayakawa, H. Kawanaka, W. Chen, K. Wada, G.D. Castanho, Y. Okajima, H. Okamoto

143

A Non-binary Parallel Arithmetic Architecture R. Lin, J.L. Schwing

149

Multithreaded Parallel Computer Model with Performance Evaluation J. Cui, J.L. Bordim, K. Nakano, T. Hayashi, N. Ishii

155

Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM 2000) S. Panchanathan, A. Uhl

161

MAJC-5200: A High Performance Microprocessor for Multimedia Computing S. Sudharsanan

163

A Novel Superscalar Architecture for Fast DCT Implementation Z. Yong, M. Zhang

171

Computing Distance Maps Efficiently Using an Optical Bus Y. Pan, Y. Li, J. Li, K. Li, S.-Q. Zheng

178

Advanced Data Layout Optimization for Multimedia Applications C. Kulkarni, F. Catthoor, H. De Man

186

Parallel Parsing of MPEG Video in a Multi-threaded Multiprocessor Environment S.M. Bhandarkar, S.R. Chandmsekamn

194

Contents

XI

Parallelization Techniques for Spatial-Temporal Occupancy Maps from Multiple Video Streams N. DeBardeleben, A. Hoover, W. Jones, W. Ligon

202

Heuristic Solutions for a Mapping Problem in a TV-Anytime Server Network X. Zhmt, R. Luling, L. Xie

210

RPV: A Programming Environment for Real-Time Parallel Vision Specification and Programming Methodology D. Arita, Y. Hamada, S. Yonemoto, R.-i. Tanig1tchi

218

Parallel Low-Level Image Processing on a Distributed Memory System C. Nicolesrn, P. Jonker Congestion-Free Routing of Streaming Multimedia Content in BMIN-Based Parallel Systems H. Seth1t Performance of On-Chip Multiprocessors for Vision Tasks Y. Chung, K. Park, W. Hahn, N. Park, V.K. Prasanna Parallel Hardware-Software Architecture for Computation of Discrete Wavelet Transform Using the Recursive Merge Filtering Algorithm P. Jamkhandi, A. Mukherjee, K. Mukherjee, R. Franceschini

226

234

242

250

Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2000) M. Schulz

257

Pipelining Wavefront Computations: Experiences and Performance E.G. Lewis, L. Snyder

261

Specification Techniques for Automatic Performance Analysis Tools M. Gerndt, H.-G. EJ]er

269

PDRS: A Performance Data Representation System X.-H. Sun, X. Wu

277

Clix - A Hybrid Programming Environment for Distributed Objects and Distributed Shared Memory F. Mueller, J. Nolte, A. Schlaefer

285

Controlling Distributed Shared Memory Consistency from High Level Programming Languages Y. Jegou

293

XII

Contents

Online Computation of Critical Paths for Multithreaded Languages Y. Oyama, K. Ta urn, A. Yonezawa

301

Problem Solving Environment Infrastructure for High Performance Computer Systems D. C. Stanzione, Jr., W.B. Ligon III

314

Combining Fusion Optimizations and Piecewise Execution of Nested Data-Parallel Programs W. Pfannenstiel

324

Declarative Concurrency in Java R. Ramirez, A.E. Santosa

332

Scalable Monitoring Technique for Detecting Races in Parallel Programs Y.-K. J1m, C.E. McDowell

340

Workshop on High Performance Data Mining M.J. Zaki, V. Kumar, D.B. Skillicorn

348

Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations R. Barnglia, D. Laforenza, S. Orlando, P. Palmerini, R. Perego A Requirements Analysis for Parallel KDD Systems W.A. Maniatty, M.J. Zaki

350

358

Parallel Data Mining on ATM-Connected PC Cluster and Optimization of Its Execution Environment M. Oguchi, M. Kitsuregawa

366

The Parallelization of a Knowledge Discovery System with Hypergraph Representation J. Seitzer, J.P. Buckley, Y. Pan, L.A. Adams

374

Parallelisation of C4.5 as a Particular Divide and Conquer Computation P. Becuzzi, M. Coppola, S. Ruggieri, M. Vanneschi

382

Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti, D. Talia

390

Exploiting Dataset Similarity for Distributed Mining S. Parthasarnthy, M. Ogiharn

399

Contents

XIII

Scalable Model for Extensional and Intensional Descriptions of Unclassified Data H.A. Prado, S.C. Hirtle, P.M. Engel

407

Parallel Data Mining of Bayesian Networks from Telecommunications Network Data R. Sterrit, K. Adamson, C.M. Shapcott, E.P. C1LTran

415

Irregular 2000 - Workshop on Solving Irregularly Structured Problems in Parallel S. Sahni, T. Davis, S. Rajasekeran, S. Ranka

423

Load Balancing and Continuous Quadratic Programming W.W. Hager

427

Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application X. Cavin, L. Alonso

428

Efficient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures S. Benkner, T. Brandes

435

Parallel FEM Simulation of Crack Propagation-Challenges, Status, and Perspectives B. Carter, C.-S. Chen, L.P. Chew, N. Chrisochoides, G.R. Gao, G. Heber, A.R. Ingraffea, R. Krause, C. Myers, D. Nave, K. Pingali, P. Stodghill, S. Vavasis, P.A. Wawrzynek Support for Irregular Computations in Massively Parallel PIM Arrays, Using an Object-Based Execution Model H.P. Zima, T.L. Sterling Executing Communication-Intensive Irregular Programs Efficiently V. Ramakrishnan, I.D. Scherson

443

450

457

Non-Memory-Based and Real-Time Zerotree Building for Wavelet Zerotree Coding Systems D. Peng, M. Lu

469

Graph Partitioning for Dynamic, Adaptive, and Multi-phase Computations V. Kumar, K. Schloegel, G. Karypis

476

XIV

Contents

A Multilevel Algorithm for Spectral Partitioning with Extended Eigen-Models S. Oliveira, T. Soma

477

An Integrated Decomposition and Partitioning Approach for Irregular Block-Structured Applications J. Rantakokko

485

Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems L. Oliker, X. Li, G. Heber, R. Biswas

497

A GRASP for Computing Approximate Solutions for the Three-Index Assignment Problem R.M. Aiex, P.M. Pardalos, L.S. Pitsoulis, M.G.C. Resende

504

On Identifying Strongly Connected Components in Parallel L.K. Fleischer, B. Hendrickson, A. Pinar

505

A Parallel, Adaptive Refinement Scheme for Tetrahedral and Triangular Grids A. Stagg, J. Hallberg, J. Schmidt

512

PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed lD /2D Block Distributions P. H enon, P. Ram et, J. Roman

519

Workshop on Java for Parallel and Distributed Computing D. Caramel, S. Chaumette, G. Fox, P. Graham

526

An IP Next Generation Compliant Java TM Virtual Machine G. Chelius, E. Fleury

528

An Approach to Asynchronous Object-Oriented Parallel and Distributed Computing on Wide-Area Systems M. Di Santo, F. Frattolillo, W. Russo, E. Zimeo Performance Issues for Multi-language Java Applications P. Murray, T. Smith, S. Srinivas, M. Jacob MPJ: A Proposed Java Message Passing API and Environment for High Performance Computing M. Baker, B. Carpenter

536

544

552

Contents

XV

Implementing Java Consistency Using a Generic, Multithreaded DSM Runtime System 560 G. Antoniu, L. Bouge, P. Hatcher, M. MacBeth, K. McG1tigan, R. Namyst

Workshop on Bio-Inspired Solutions to Parallel Processing Problems (BioSP3) A.Y. Zomaya, F. Ercal, S. Olariu

568

Take Advantage of the Computing Power of DNA Computers Z.F. Qi1t, M. Lu

570

Agent Surgery: The Case for Mutable Agents L. Boloni, D. C. Marinescu

578

Was Collective Intelligence before Life on Earth? T. Szuba, M. Almulla

586

Solving Problems on Parallel Computers by Cellular Programming D. Talia

595

Multiprocessor Scheduling with Support by Genetic Algorithms-Based Learning Classifier System J.P. Nowacki, G. Pycka, F. Seredynski

604

Viewing Scheduling Problems through Genetic and Evolutionary Algorithms M. Rocha, C. Vilela, P. Cortez, J. Neves

612

Dynamic Load Balancing Model: Preliminary Assessment of a Biological Model for a Pseudo-search Engine R.L. Walker

620

A Parallel Co-evolutionary Metaheuristic V. Bachelet, E.-G. Talbi

628

Neural Fraud Detection in Mobile Phone Operations A. Boukerche, M.S.M.A. Notare

636

Information Exchange in Multi Colony Ant Algorithms M. Middendorf, F. Reischle, H. Schmeck

645

A Surface-Based DNA Algorithm for the Expansion of Symbolic Determinants Z.F. Qiu, M. Lu

653

XVI

Contents

Hardware Support for Simulated Annealing and Tabu Search R. Schneider, R. Weiss

660

Workshop on Parallel and Distributed Real-Time Systems K. Toda, S.H. Son, M. Boasson, Y. Kakuda

668

A Distributed Real Time Coordination Protocol L. Sha, D. Seto

671

A Segmented Backup Scheme for Dependable Real Time Communication in Multihop Networks P.K. G1tmmadi, J.P. Madhavarap1t, S.R. Murthy Real-Time Coordination in Distributed Multimedia Systems T.A. Limniotes, G.A. Papadopmtlos

678

685

Supporting Fault-Tolerant Real-Time Applications Using the RED-Linux General Scheduling Framework K.-J. Lin, Y.-C. Wang

692

Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems? P. Chevochot, A. Colin, D. Decotigny, I. Puaut

699

Autonomous Consistency Technique in Distributed Database with Heterogeneous Requirements H. Hanamura, I. Kaji, K. Mori

706

Real-Time Transaction Processing Using Two-Stage Validation in Broadcast Disks K.-w. Lam, V. C.S. Lee, S.H. Son

713

Using Logs to Increase Availability in Real-Time Main-Memory Database 720 T. Niklander, K. Raatikainen Components Are from Mars M.R. V. Chaudron, E. de Jong

727

2+10 >-- 1+50 ! H. Hansson, C. Norstrom, S. Punnekkat

734

A Framework for Embedded Real-Time System Design J. - Y. Choi, H.- H. Kwak, I. Lee

738

Contents

XVII

Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks A. Striegel, G. Manimaran

743

Predictability and Resource Management in Distributed Multimedia Presentations C. Mourlas

750

Quality of Service Negotiation for Distributed, Dynamic Real-Time Systems 757 G.D. Cavanmtgh, L.R. Welch, B.A. Shirazi, E.-n. H1th, S. Anwar An Open Framework for Real-Time Scheduling Simulation T. Kramp, M. Adrian, R. Koster

766

Workshop on Embedded/Distributed HPC Systems and Applications (EHPC 2000) D. Bhatt, L.R. Welch

773

A Probabilistic Power Prediction Tool for the Xilinx 4000-Series FPGA 776 T. Osm1Llski, J. T. Muehring, B. Veale, J.M. West, H. Li, S. Vanichayobon, S.-H. Ko, J.K. Antonio, S.K. Dhall Application Challenges: System Health Management for Complex Systems 784 G.D. Hadden, P. Bergstrom, T. Samad, B.H. Bennett, G.J. Vachtsevanos, J. Van Dyke Accomodating QoS Prediction in an Adaptive Resource Management Framework E.-n. Huh, L.R. Welch, B.A. Shirazi, B.C. Tjaden, G.D. Cavanaugh Network Load Monitoring in Distributed Systems K.M. Jahirul Islam, B.A. Shirazi, L.R. Welch, B.C. Tjaden, G.D. Cavanaugh, S. Anwar A Novel Specification and Design Methodology of Embedded Multiprocessor Signal Processing Systems Using High-Performance Middleware R.S. Janka, L.M. Wills Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems M.I. Patel, K. Jordan, M. Clark, D. Bhatt

792

800

808

816

XVIII

Contents

Developing an Open Architecture for Performance Data Mining D.B. Pierce, D. T. Rover

823

A 90k Gate "CLE" for Parallel Distributed Computing B. Sch1Llman, G. Pechanek

831

Power-Aware Replication of Data Structures in Distributed Embedded Real-Time Systems O.S. Unsal, I. Koren, C.M. Krishna Comparison of MPI Implementations on a Shared Memory Machine B. Van Voorst, S. Seidel

839

847

A Genetic Algorithm Approach to Scheduling Communications for a Class of Parallel Space-Time Adaptive Processing Algorithms J.M. West, J.K. Antonio

855

Reconfigurable Parallel Sorting and Load Balancing on a Beowulf Cluster: HeteroSort P. Yang, T.M. Kunau, B.H. Bennett, E. Davis, B. Wren

862

Reconfigurable Architectures Workshop (RAW 2000)

870

H. ElGindy, V.K. Prasanna, H. Schmeck, 0. Diessel

Run-Time Reconfiguration at Xilinx S.A. Guccione

873

JRoute: A Run-Time Routing API for FPGA Hardware E. Keller

874

A Reconfigurable Content Addressable Memory S.A. Guccione, D. Levi, D. Downs

882

ATLANTIS - A Hybrid FPGA/RISC Based Re-configurable System 0. Brosch, J. Hesser, C. Hinkelbein, K. Kornmesser, T. Kuberka, A. Kugel, R. Manner, H. Singpiel, B. Vettermann

890

The Cellular Processor Architecture CEPRA-lX and Its Configuration by CDL C. Hochberger, R. Hoffmann, K.-P. Volkmann, S. Waldschmidt

898

Contents

XIX

Loop Pipelining and Optimization for Run Time Reconfiguration K. Bondalapati, V.K. Pmsanna

906

Compiling Process Algebraic Descriptions into Reconfigurable Logic 0. Diessel, G. Milne

916

Behavioral Partitioning with Synthesis for Multi-FPGA Architectures under Interconnect, Area, and Latency Constraints P. Lakshmikanthan, S. Govindamjan, V. Srinivasan, R. Vemuri Module Allocation for Dynamically Reconfigurable Systems X.-j. Zhang, K.-w. Ng Augmenting Modern Superscalar Architectures with Configurable Extended Instructions X. Zhmt, M. Martonosi

924

932

941

Complexity Bounds for Lookup Table Implementation of Factored Forms 951 in FPGA Technology Mapping W. Feng, F.J. Meyer, F. Lombardi Optimization of Motion Estimator for Run-Time-Reconfguration Implementation C. Tanmtgast, Y. Berviller, S. Weber

959

Constant-Time Hough Transform on a 3D Reconfigurable Mesh Using Fewer Processors Y. Pan

966

Workshop on Formal Methods for Parallel Programming (FMPPTA 2000) B. Sanders, D. Mery

97 4

A Method for Automatic Cryptographic Protocol Verification J. Goubault-La'r'recq

977

Verification Methods for Weaker Shared Memory Consistency Models R.P. Ghughal, G. C. Gopalakrishnan

985

Models Supporting Nondeterminism and Probabilistic Choice M. Mislove

993

Concurrent Specification and Timing Analysis of Digital Hardware Using SDL K.J. Turner, F.J. Argul-Marin, S.D. Laing

1001

XX

Contents

Incorporating Non-functional Requirements into Software Architectures N.S. Rosa, G.R.R. Justo, P.R.F. C1mha

1009

Automatic Implementation of Distributed Systems Formal Specifications 1019 L.H. Castelo Branco, A.F. do Prado, W. Lopes de Smtza, M. Sant'Anna Refinement Based Validation of an Algorithm for Detecting Distributed Termination M. Filali, P. Mauran, G. Padiou, P. Queinnec, X. Thirimtx

1027

Tutorial 1: Abstraction and Refinement of Concurrent Programs and Formal Specification D. Cansell, D. Mery, C. Tabacznyj

1037

Tutorial 2: A Foundation for Composing Concurrent Objects J.-P. Bahsmm

1039

Workshop on Optics and Computer Science (WOCS 2000) F. Kiamilev, J. Ekman, A. Ferreira, S. Esener, Y. Pan, K. Li

1042

Fault Tolerant Algorithms for a Linear Array with a Reconfigurable Pipelined Bus System A.G. BmLTgeois, J.L. Trahan

1044

Fast and Scalable Parallel Matrix Computationas with Optical Buses K. Li

1053

Pulse-Modulated Vision Chips with Versatile-Interconnected Pixels J. Ohta, A. Uehara, T. Tokuda, M. Nunoshita

1063

Connectivity Models for Optoelectronic Computing Systems H.M. Ozaktas

1072

Optoelectronic-VLSI Technology: Terabit/s I/O to a VLSI Chip A. V. Krishnamoorthy

1089

Three Dimensional VLSI-Scale Interconnects D. W. Prather

1092

Present and Future Needs of Free-Space Optical Interconnects S. Esener, P. Marchand

1104

Contents

XXI

Fast Sorting on a Linear Array with a Reconfigurable Pipelined Bus System A. Datta, R. Owens, S. Smmdaralakshmi

1110

Architecture Description and Prototype Demonstration of Optoelectronic Parallel-Matching Architecture K. Kagawa, K. Nitta, Y. Ogmn, J. Tanida, Y. Ichioka

1118

A Distributed Computing Demonstration System Using FSOI Inter-Processor Communication J. Ekman, C. Berger, F. Kiamilev, X. Wang, H. Spaanenb1trg, P. Marchand, S. Esener

1126

Optoelectronic Multi-chip Modules Based on Imaging Fiber Bundle Structures D.M. Chiarnlli, S.P. Levitan

1132

VCSEL Based Smart Pixel Array Technology Enables Chip-to-Chip Optical Interconnect Y. Liu

1133

Workshop on Run-Time Systems for Parallel Programming (RTSPP) R. Olsson, L.V. Kale, P. Beckman, M. Haines

1134

A Portable and Adaptative Multi-protocol Communication Library for Multithreaded Runtime Systems 0. Aumage, L. Bouge, R. Namyst

1136

CORBA Based Runtime Support for Load Distribution and Fault Tolerance T. Barth, G. Flender, B. Freisleben, M. Grauer, F. Thilo

1144

Run-Time Support for Adaptive Load Balancing M.A. Bhandarkar, R.K. Brunner, L. V. Kale Integrating Kernel Activations in a Multithreaded Runtime System on Top of LINUX V. Danjean, R. Namyst, R.D. Russell

1152

1160

DyRecT: Software Support for Adaptive Parallelism on NOWs E. Godard, S. Setia, E. White

1168

Fast Measurement of LogP Parameters for Message Passing Platforms T. Kielmann, H.E. Bal, K. Verstoep

1176

XXII

Contents

Supporting Flexible Safety and Sharing in Multi-threaded Environments 1184 S.H. Samorodin, R. Pandey A Runtime System for Dynamic DAG Programming M.- Y. W1t, W. Shu, Y. Chen

1192

Workshop on Fault-Tolerant Parallel and Distributed Systems (FTPDS 2000) D.R. Avresky

1200

Certification of System Architecture Dependability I. Levendel

1202

Computing in the RAIN: A Reliable Array of Independent Nodes V. Bohossian, C.C. Fan, P.S. LeMahieu, M.D. Riedel, L. Xu, J. Hmck

1204

Fault-Tolerant Wide-Area Parallel Computing J.B. Weissman

1214

Transient Analysis of Dependability /Performability Models by Regenerative Randomization with Laplace Transform Inversion J.A. Carrasco FANTOMAS: Fault Tolerance for Mobile Agents in Clusters H. Pals, S. Petri, C. Grewe Metrics, Methodologies, and Tools for Analyzing Network Fault Recovery Performance in Real-Time Distributed Systems P.M. Irey IV, B.L. Chappell, R. W. Hott, D.T. Marlow, K.F. 0 'Donoghue, T.R. Plunkett

1226

1236

1248

Consensus Based on Strong Failure Detectors: A Time and Message-Efficient Protocol F. Greve, M. Hurfin, R. Macedo, M. Raynal

1258

Implementation of Finite Lattices in VLSI for Fault-State Encoding in High-Speed Networks A.G. Doring, G. Lustig

1266

Building a Reliable Message Delivery System Using the COREA Event Service S. Ramani, B. Dasarathy, K.S. Trivedi

1276

Contents XXIII Network Survivability Simulation of a Commercially Deployed Dynamic Routing System Protocol A. Chowdhm·y, 0. Prieder, P. L1tse, P.-J. Wan

1281

Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network D. Hecht, C. Katsinis

1286

An Efficient Backup-Overloading for Fault-Tolerant Scheduling of Real-Time Tasks R. Al-Omari, G. Manimaran, A.K. Somani

1291

Mobile Agents to Automate Fault Management in Wireless and Mobile Networks N. Pissinmt, Bhagyavati, K. Makki

1296

Heterogeneous Computing Workshop (HCW 2000) V.K. Prasanna, C.S. Raghavendra

1301

Author Index

1307

3rd Workshop on Personal Computer based Networks Of Workstations (PC-NOW 2000)

Clusters composed of fast personal computers are now well established as cheap and efficient platforms for distributed and parallel applications. The main drawback of a standard NONs is the poor performance of the standard inter-process communication mechanisms based on RPC, sockets, TCP /IP, Ethernet. Sue h standard communication mechanisms perform poorly both in terms of throughput as well as message latency. Several protoypes developed around the world have proved that re-visiting the implementation of the communication layer of a standard Operating System kernel, a kw cost hardware platform composed of only commodity components can scale up to several tens of processing nodes and deliver communication and computation performance exceeding the one delivered by the conventional highcost parallel platforms. This w orkshoppro videsa forum to discuss issues related to the design of efficient NOW /Clusters based on commodity hardware and publi:lomain operating systems as compared to custom hardware devices and/or proprietary operating systems.

Workshop Organizers G. Chiola (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) L.V. Mancini (DSI, U. Rome, I)

Sponsors IEEETFCC (Task lorce on Cluster Computing)

J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1-3, 2000. © Springer-Verlag Berlin Heidelberg 2000

2

G. Chiola, G. Conte, and L.V. Mancini

Program Commitee

Program Chair:

C. Anglano (U. Piemonte Or., I) M. Baker (CSM, U. Portsmouth, UK) L. Bouge (ENS Lyon, F) G. Chiola (DISI, U. Genoa, I) G. Ciaccio (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) H.G. Dietz (ECE, Purdue U., USA) W. Gentzsch (GENIAS Software GmbH, D) G. Iannello (DIS, U. Napoli, I) Y. Ishikawa (RWCP, J) K. Li (Princeton U., USA) L.V. Mancini (DSI, U. Roma 1, I) T.G. Mattson (Intel Corp., USA) W. Rehm (Informatik, T.U. Chemnitz, D) P. Rossi (ENEA HPCN, Bologna, I) P. Roe (Queensland U. of Tech., AUS) D.B. Skillikorn (Queens U., CAN) D. Tavangarian (Informatik, U. Rostock, D) B. Tourancheau (LHPC, U. Lyon, F)

Referees C. Anglano 0. Aumage M. Baker G. Chiola G. Ciaccio G. Conte M. Fischer

W. Gentzsch G. Iannello Y. Ishikawa L.V. Mancini T.G. Mattson J.-F. Mehaut R. Namyst

W. Rehm P.Roe P. Rossi D. Tavangarian B. Tourancheau R. Westrelin

3rd Workshop on Personal Computer Based Networks of Workstations

3

Accepted Papers Session 1: Cluster Interconnect Design and Implementation

- M. Trams, \V. Rehm, D. Balkanski, and S. Simeonov "Memory Management in a combined VIA/SCI Hardware" - M. Fischer, et al. "ATOLL, a new switched, high speed Interconnect in comparison to Myrinet and SCI" - R.R. Hoare "ClusterNet: An Object-Oriented Cluster Network" Session 2: Off-the-shelf Clusters Communication

- M. Baker, S. Scott, A. Geist, and L. Browne "GigaBit Performance under NT" - H.A. Chen, Y.O. Carrasco, and A.W. Apon "MPI Collective Operations over IP Multicast" Session 3: Multiple Clusters and Grid Computing

- S. Lalis, and A. Karipidis 'An Open Market-Based Architecture for Distributed Computing" - M. Barreto, R. Avila, and Ph. Navaux "The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters" Session 4: Data Intensive Applications

- S.H. Chung, et al. "Parallel Information Retrieval on an SCI-Based PCNOW" - M. Exbrayat, and L. Brunie 'A PC-MOW Based Parallel Extension for a Sequential DBMS"

Other Activities In addition to the presentation of contributed papers an invited talk will be scheduled at the workshop.

Memory Management in a combined VIA/SCI Hardware Mario Trams, Wolfgang Rehm, Daniel Balkanski and Stanislav Simeonov * {mtr, rerun}@inform at iktu -elem ni 1z. de DaniBalkanski©yahoo. com, stan©bfu. bg T echnisc he UnitM"sitiit Chemnitz F akult ..at for Informattl?' StraBe der Nationen 62, 09111 Chemnitz, Germany

Abstract In this document w emake a brief review of memory management and DMA considerations in case of common SCI hardware and the Virtual Interface Architecture. On this basis we expose our ideas for an improved memory management of a hardware combining the positive characteristics of both basic technologies in order to get one completely new design rather than simply adding one to the other. The described memory management concept provides the opportunity of a real zerocopy transfer for Send-Recehe operations by keeping full flexibility and efficiency of a nodes' local memory management system. From the resulting hardware we expect a very good system throughput for message passing applications even if they are using a wide range of message sizes.

1

Motivation and Introduction

PCI-SCI bridges (Scalable Coherent Interface [12]) become a more and more preferable technological choice in the growing market of Cluster Computing based on non-proprietary hardware. Although absolute performance characteristics of this communication hardware increases more and more, it still has some disadvantages. Dolphin Irterconnect Solutions AS (Norway) is the leading manufacturer of commercial SCI link chips as well as the only manufacturer of commercially available PCI-SCI bridges. These bridges offer ~ry low latencies in range of some microseconds for their distributed shared memory and reac h also relatively high bandwidths (more than 80MBytes/s). In our clusters we use Dolphins PCI-SCI bridges in junction with standard PC components [11]. MPI applications that we are running on our cluster can get a great acceleration from low latencies of the underlying SCI shared memory if it is used as commmication medium for transferring messages. MPI implementations e.g. such as [7] show a * Daniel Balkanski and Stanislav Simeonov are from the Burgas Free University, Bulgaria. ** The work presented in this paper is sponsored by the SMWK/SMWA Saxony ministries (AZ:7531.50-03-0380-98/6). It is also carried out in strong interaction with the project GRANT SFB393/B6 of the DFG (German National Science Foundation). J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 4-15, 2000. © Springer-Verlag Berlin Heidelberg 2000

Memory Management in a Combined VIA/SCI Hardware

5

bandwidth of about 35MByte/s for a message size of lkByte which is quite a lot (refer also to figure 1 later). The major problem of MPI implementations over shared memory is big CPU utilization on long message sizes due to copy operations. So the just referred good MPI performance [7] is more an academic peak performance which is achieved with more or less total CPU consumption. A standard solution for this problem is to use a block- moving DMA engine for data transfers in background. Dolphins PCI- SCI bridges implement such a DMA engine. Unfortunately, this one can't be controlled directly from a user process without violating general protection issues. Therefore kernel calls are required here which in end effect increase the minimum achievable latency and require a lot of additional CPU cycles. The Virtual Interface Architecture (VIA) Specification [16] defines mechanisms for moving the communication hardware closer to the application by migrating protection mechanisms into the hardware. In fact, VIA specifies nothing completely new since it can be seen as an evolution of U- Net [15]. But it is a first try to define a common industry- standard of a principle communication architecture for message passing - from hardware to software layers. Due to its DMA transfers and its reduced latency because of user- level hardware access, a VIA system will increase the general system throughput of a cluster computer compared to a cluster equipped with a conventional communication system with similar raw performance characteristics. But for very short transmission sizes a programmed IO over global distributed shared memory won't be reached by far in terms of latency and bandwidth. This is a natural fact because we can't compare a simple memory reference with DMA descriptor preparation and execution.  

 

         













Message Size !Bytes]

Figurel. Comparison of MPI Implementations for Dolphins PCI-SCI Bridges and GigaN ets cLAN VIA Hardware

Figure 1 shows bandwidth curves of MPI implementations for both an SCI and a native VIA implementation (GigaNet cLAN). The hardware is in both cases based on the PCI bus and the machines where the measurements were taken are comparable. The concrete values are based on ping-pong measurements and where taken from [7] in case of SCI, and from [10] (Linux case) for the cLAN hardware.

6

M. Trams et al.

As expected, the bandwidth in case of SCI is looking better in the range of smaller message sizes. For larger message sizes the cLAN implementation demonstrates higher bandwidth because of its advanced DMA engine. But not less important is the fact that a DMA engine gives the CPU more time for computations. Details of such CPU utilization considerations are outside the scope of this paper and are already discussed in [14] and [8]. As summarization of these motivating facts we can state that besides a powerful DMA engine controllable from user-level a distributed shared memory for programmed IO is an important feature which shouldn't be missed in a communication system.

2

What are the Memory Management Considerations?

First of all we want to make a short definition what belongs to memory management regarding this document. This can be stated by the following aspects expressed in the form of questions: 1. How a process' memory area is made available to the Network Interface Controller (NIC) and in what way main memory is protected against wrong accesses? 2. At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated? 3. In which way memory of a process on a remote node is made accessible for a local process? Based on these questions we can classify the different communication system architectures in terms of advantages/disadvantages of their memory management. In the analysis that is presented in the following sections we'll reveal these advantages and disadvantages arisen from common PCI-SCI architecture and the VI Architecture.

3 3.1

PCI-SCI vs. VIA discussion and comparison Question 1: How a process' memory area is made available to the NIC and in what way main memory is protected against wrong accesses?

Common PCI-SCI case: Current PCI-SCI bridges developed by Dolphin realize a quiet static memory management [4] to get access to main memory or rather PCI address space. To avoid unwanted accesses to sensitive locations, the PCI-SCI bridge is set up to allow accesses only to a dedicated memory window. Memory access requests caused by remote machines are only allowed if they fall within the specified window. This causes two big disadvantages:

Continuous exported regions must also be continuous available inside the physical address space. Additionally, these regions must be aligned to the minimum exportable block size which is typically quite large (512kB for Dolphin's bridges).

Memory Management in a Combined VIA/SCI Hardware

7

- Exported Memory must reside within this window. To handle these problems it is required to reserve main memory only for SCI purposes. This, in practice, 'wastes' a part of memory if it is not really exported later. In consequence these disadvantages of common PCI-SCI bridge architecture make their use with MPI applications very difficult. Especially in view of zero-copy transfer operations. Because data transfers can be processed using the reserved memory region only, it would require that MPI applications use special malloc () functions for allocating data structures used for send/receive purposes later. But this violates a major goal of the MPI standard: Architecture Independence.

VIA case: The VI Architecture specifies a much better view the NIC has on main memory. Instead of a flat one-to-one representation of the physical memory space it implements a more flexible lookup-table address translation. Comparing this mechanism with the PCI-SCI pendant the following advantages become visible. - Continuous regions seen by the VIA hardware are not required to be also continuous inside the host physical address space. - Accesses to sensitive address ranges are prevented by just not including them into the translation table. - The NIC can get access to every physical memory page, even if this may not be possible for all physical pages at once (when the translation table has less entries than the number of physical pages). The translation table is not only for address translation purposes, but also for protection of memory. To achieve this a so-called Protection Tag is included for each translation and protection table entry. This tag is checked prior to each access to main memory to qualify the access. For more information about this see later in section 3.2. Conclusions regarding question 1: It is clear, that the VIA approach offers much more flexibility. Using this local memory access strategy in a PCI-SCI bridge design will eliminate all of the problems seen in current designs. Of course, the drawback is the more complicated hardware and the additional cycles to translate the address. 3.2

Question 2: At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated?

Common PCI-SCI case: The DMA engine accesses local memory in the same way as already discussed in section 3.1. Therefore it inherits also all disadvantages when dealing with physical addresses on the PCI-SCI bridge.

8

M. Trams et al.

For accesses to global SCI memory a more flexible translation table is used. This Downstream Translation Table realizes a virtual view onto global SCI memory similar as the view of a VIA NIC onto local memory. Every page of the virtual SCI memory can be mapped to a page of the global SCI memory. Regarding validation, the DMA engine can't distinguish between regions owned by different processes (neither local nor remote). Therefore the hardware can't make a check of access rights on-the-flow. Rather it is required that the DMA descriptor containing the information about the block to copy is assured to be right. In other words the operating system kernel has to prepare or at least to check any DMA descriptor to be posted to the NIC. This requires OS calls that we want to remove at all cost.

VIA case: A VIA NIC implements mechanisms to execute a DMA descriptor from user-level while assuring protection among multiple processes using the same VIA hardware. An user process can own one or more interfaces of the VIA hardware (so-called Virtual Interfaces). In other words, a virtual interface is a virtual representation of a virtual unique communication hardware. The connection between the virtual interfaces and the VIA hardware is made by Doorbells that represent a virtual interface with its specific control registers. An user-level process can insert a new DMA descriptor into a job queue of the VIA hardware by writing an appropriate value into a doorbell assigned to this process. The size of a doorbell is equal to the page size of the host computer and so the handling which process may access which doorbell (or virtual interface) can be simply realized by the hosts' virtual memory management system. Protection during DMA transfers is achieved by usage of Protection Tags. These tags are used by the DMA engine to check if the access of the current processed virtual interface to a memory page is right. The protection tag of the accessed memory page is compared with the protection tag assigned to the virtual interface of the process that provided this DMA descriptor. Only if both tags are equal, the access is legal and can be performed. A more detailed description of this mechanism is outside the scope of this document (refer to [13] and [16]).

Conclusions regarding question 2: The location of the DMA engine is in both cases principally the same. The difference is that in case of VIA a real lookup-table based address translation is performed between the DMA engine and PCI memory. That is, the VIA DMA operates on a virtual local address space, while the PCI-SCI DMA operates directly with local physical addresses. The answer for the access protection is simple: The common PCI-SCI DMA engine supports no protection in hardware and must trust on right DMA descriptors. The VIA hardware supports full protection in hardware where the DMA engine is only one part of the whole protection mechanism.

Memory Management in a Combined VIA/SCI Hardware

3.3

9

Question 3: In which way memory of a process on a remote node is made accessible for a local process?

Common PCI-SCI case: Making remote memory accessible is a key function in a SCI system, of course. Each PCI-SCI bridge offers a special PCI memory window which is practically the virtual SCI memory seen by the card. So the same SCI memory the DMA engine may access can be also accessed via memory references (also called programmed IO here). The procedure of making globally available SCI memory accessible for the local host is also referred as importing global memory into local address space. On the other side, every PCI-SCI bridge can open a window to local address space and make it accessible for remote SCI nodes. The mechanism of this window is already described in section 3.1 regarding question 1. The procedure of making local memory globally accessible is also called exporting local memory into global SCI space. Protection is totally guaranteed when dealing with imported and exported memory in point of view of memory references. Only if a process has got a valid mapping of a remote process' memory page it is able to access this memory. VIA case: The VI Architecture offers principally no mechanism to access remote memory as it is realized in a distributed shared memory communication system such as SCI. But there is an indirect way by using a so-called Remote DMA (or RDMA) mechanism. This method is very similar to DMA transfers as they are used in common PCI-SCI bridges. A process that wants to transfer data between its local memory and memory of a remote process specifies a RDMA descriptor. This contains an address for the local VIA virtual address space and an address for the remote nodes' local VIA virtual address space. Conclusions regarding question 3: While a PCI-SCI architecture allows processes to really share their memory globally across a system, this is not possible with a VIA hardware. Of course, VIA was never designed for realizing distributed shared memory.

4

A new PCI-SCI Architecture with VIA Approaches

In our design we want to combine the advances of an ultra-low latency SCI Shared Memory with a VIA-like advanced memory management and protected user-level DMA. This combination will make our SCI hardware more suitable for our message passing oriented parallel applications requiring short as well as long transmission sizes. 4.1

Advanced Memory Management

In order to eliminate the discussed above restrictions with continuous and aligned exported memory regions that must reside in a special window, our PCI-SCI

M. Trams et al.

IO

architecture will implement two address translation tables - for both local and remote memory accesses. In contrast, common PCI-SCI bridges use only one translation table for accesses to remote memory. This new and more flexible memory management combined with reduced minimal page size of distributed shared memory leads to a much better usage of the main memory of the host system. In fact, our targeted amount of imported SCI memory is 1GB with a page granularity of 16kB. With a larger downstream address translation table this page size may be reduced further to match exactly the page size used in the host systems (such as 4kB for x86 CPUs). In case of the granularity of memory to be exported in SCI terminology or to be made available for VIA operations there's no question: It must be equal to the host system page size. In other words, 4kB since the primary target system is a x86 one. 128MB is the planned maximum window size here. 4.2

Operation of Distributed Shared Memory from a memory-related point of view  

  

 



 

  

  

  

    

   

  

Figure2. Address Translations between exporting and importing Processes for programmed IO

Figure 2 gives an overall example of exporting/importing memory regions. The example illustrates the address translations performed when the importing process accesses memory exported by a process on the remote node. The exporting process exports some of its previously allocated memory by registering it within its local PCI-SCI hardware. Registering memory is done on a by-page basis. Remember that in case of a common PCI-SCI system it would be required that this exported memory is physically located inside this special memory area reserved for SCI purposes. But here we can take the advantage of the virtual view onto local memory similar to this in VI Architecture.

Memory Management in a Combined VIA/SCI Hardware

11

Once the upstream address translation table entries are adjusted, the exported memory can be accessed from remote machines since it became part of the global SCI memory. To access this memory, the remote machine must import it first. The major step to do here is to set up entries inside its downstream address translation table so that they point to the region inside the global SCI memory that belongs to the exporter. From now, the only remaining task is to map the physical PCI pages that correspond to the prepared downstream translation entries into the virtual address space of the importing process. ·when the importing process accesses the imported area, the transaction is forwarded through the PCI-SCI system and addresses are translated three times. At first the host MMU translates the address from the process' virtual address space into physical address space (or rather PCI space). Then the PCI-SCI bridge takes up the transaction and translates the address into the global SCI address space by usage of the downstream translation table. The downstream address translation includes generation of the remote node id and address offset inside the remote nodes' virtual local PCI address space. When the remote node receives the transaction, it translates the address to the correct local physical (or rather PCI) address by using the upstream address translation table.

4.3

Operation of Protected User-Level Remote DMA from a memory-related point of view

Figure 3 shows the principle work of the DMA engine of our PCI-SCI bridge design. This figure shows principally the same address spaces and translation tables as shown by figure 2. Only the process' virtual address spaces and the corresponding translation into physical address spaces are skipped to not overload the figure. The DMA engine inside the bridge is surrounded by two address translation tables, or more correct said by two address translation and protection tables. On the active node (that is, where the DMA engine is executing DMA descriptors node 1 here) both translation tables are involved. However, on the remote node there has practically nothing changed compared to the programmed IO case. Hence the remote node doesn't make any difference between transactions whether they were generated by the DMA engine or not. Both translation tables of one PCI-SCI bridge incorporate protection tags as described in section 3.2. But while this is used in VIA for accesses to local memory, here it is also used for accesses to remote SCI memory. Together with VIA mechanisms for descriptor notification and execution the DMA engine is unable to access wrong memory pages whether local (exported) nor remote (imported) ones. Note that a check for right protection tags is really made only for the DMA engine and only on the active node (node 1 in figure 3). In all other cases the same translation and protection tables are used, but the protection tags inside are ignored.

12

M. Trams et al.  





 

  

  

 

 

  

  

 

~---{ ]--o--('}-----------(}---~   

  

  

  

Figure3. Address Translations performed during RDMA Transfers

4.4

A free choice of using either Programmed 1/0 or User-Level Remote DMA

This kind of a global memory management allows applications or more exactly communication libraries to decide on-the-fly depending on data size in which way it should be transferred. In case of a short message a PIO transfer may be used, and in case of a longer message a RDMA transfer may be suitable. The corresponding remote node is not concerned in this decision since it doesn't see any differences. This keeps the protocol overhead very low. And finally we want to remember the VIA case. Although we already have the opportunity of a relatively low- latency protected user- level remote DMA mechanism without the memory handling problems as in case of common PCISCI, there's nothing like a PIO mechanism for realizing a distributed shared memory. Hence the advantages of an ultra- low latency PIO transfer are not available here.

5

Influence on MPI Libraries

To show the advantages of the presented advanced memory management we want to take a look at the so-called Rendezvous Protocol that is commonly used for Send-Receive operations. Figure 4 illustrates the principle of the Rendezvous protocol used in common MPI implementations [7] based on Dolphins PCI-SCI bridges. One big problem in this model is the copy operation that takes place on the receivers' side to take data out of the SCI buffer. Although the principally increasing latency can be hidden due to the overlapping mechanism a lot of CPU cycles are burned there.

Memory Management in a Combined VIA/SCI Hardware Sender

Receiver

Sender

13

Receiver

Request_ Send Ok_to_Send --- - - - Block_Ready ____ _

Tran~ler cornple!ed

Memory

Ready

Tran~lercrnnpletet! Tran~ler completed

I CPlJ h1L~Y

I CPlJ h1L~Y

lcrtJfree

lcrtJfree

Figure4. Typical Rendezvous-Protocol in common PCI-SCI Implementations

Figure5. Improved RendezvousProtocol based on advanced PCI-SCI Memory Management

·with our proposed memory management there's a chance to remove this copy operation on the receivers' side. The basic operation of the Rendezvous protocol can be implemented as described in figure 5. Here the sender informs the receiver as usual. Before the receiver sends back an acknowledge it checks if the data structure the data is to be written to is already exported to the sender. If not, the memory region that includes the data structure is registered within the receivers' PCI-SCI bridge and exported to the sender. The sender itself must also import this memory region if this was not already done before. After this the sender copies data from private memory of the sending process directly into private memory of the receiving process. As further optimization the sender may decide to use the DMA engine to copy data without further CPU intervention. This decision will be typically based on the message size.

6

State of the project (November 1999)

We developed our own FPGA-based PCI-SCI card and have prototypes of this card already running. At the moment they only offer a so-called Manual Packet Mode for now that is intended for sideband communication besides the regular programmed IO and DMA transfers. The card itself is a 64Bit/33MHz PCI Rev.2.1 one [8]. As SCI link controller we are using Dolphins LC-2 for now, and we are looking to migrate to the LC-3 as soon as it is available. The reprogrammable FPGA design leads to a flexible reconfigurable hardware and offers also the opportunity for experiments. Linux low-level drivers for Alpha and x86 platforms and several configuration/test programs were developed. In addition our research group is working on an appropriate higher-level Linux driver for our card [5, 6]. This offers a software-interface (advanced Virtual Interface Provider Library) that combines SCI and VIA features such as importing/ exporting memory regions, VI connection management etc. Also it emulates parts of the hardware so that it is possible to run other software on top of it although the real hardware is not available. As an example, a parallelized MPI-version of the popular raytracer POVRAY is already running over this emulation. This program uses an MPI-2 library for

14

M. Trams et al.

our combined SCI/VIA hardware. This library is also under development at our department [3]. For more details and latest news refer to our project homepage at http://www.tu-chemnitz.de/-mtr/VIA..SCI/

7

Other Works on SCI and VIA

Dolphin already presented some performance measurements in [1] for their VIA implementation which is a emulation over SCI shared memory. Although the presented VIA performance is looking very good, it's achieved by the cost of too big CPU utilization again. The number of vendors of native VIA hardware is growing more and more. One of these companies is GigaNet [17] where performance values are already available. GigaNet gives on their web pages latencies of 8µs for short transmission sizes. Dolphin gives a latency for PIO operations (remote memory access) of 2.3µs. This demonstrates the relatively big performance advantage a distributed shared memory offers here. University of California, Berkeley [2] and the Berkeley Lab [9] are doing more open research also in direction of improving the VIA specification. The work at the University of California, Berkeley is concentrated more on VIA hardware implementations based on Myrinet. In contrast, the work at the Berkeley Lab is targeted mainly to software development for Linux.

8

Conclusions and Outlook

The combined PCI-SCI/VIA system is not just a simple result of adding two different things. Rather it is a real integration of both in one design. More concrete it is an integration of concepts defined by the VIA specification into a common PCI-SCI architecture since major PCI-SCI characteristics are kept. The result is a hardware design with completely new qualitative characteristics. It combines the most powerful features of SCI and VIA in order to get highly efficient messaging mechanisms and high throughput over a broad range of message lengths. The advantage that MPI libraries can take from a more flexible memory management was illustrated for the case of a Rendezvous Send-Receive for MPI. The final proof in practice is still pending due to lack of a hardware with all implemented features.

References 1. Torsten Amundsen and John Robinson: High-performance cluster-computing

with Dolphin's CluStar PCI adapter card. In: Proceedings of SCI Europe '98, Pages 149-152, Bordeaux, 1998

Memory Management in a Combined VIA/SCI Hardware

15

2. Philip Buonadonna, Andrew Geweke: A..n Implementation and A..nalysis of the Virtual Interface Architecture. University of California at Berkeley, Dept.of Computer Science, Berkeley, 1998. www. cs. berkeley. edu;-philipb/via/ 3. A new MPI-2-Standard MPI Implementation with support for the VL4.. www.tu-chemnitz.de/informatik/RA/projects/chempi-html/ 4. Dolphin Interconnect Solutions AS: PCI-SCI Bridge Spec. Rev. 4.01. 1997. 5. Friedrich Seifert: Design and Implementation of System Software for Transparent Mode Communication over SCI., Student Work, Dept. of Computer Science, University of Technology Chemnitz, 1999. See also: www.tu-chemnitz.de/-sfri/publications.html 6. Friedrich Seifert: Development of System Software to integrate the Virtual Interface Architecture (VIA) into Linux Operating System Kernel for optimized l\ifessage Passing. Diploma Thesis, TU-Chemnitz, Sept. 1999. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html 7. Joachim Worringen and Thomas Bemmerl: MPICH for SCI-connected Clusters. In: Proceedings of SCI-Europe'99, Toulouse, Sept. 1999, Pages 3-11. See also: wwwbode. in. tum. de/ events/ sci-europe99/ 8. Mario Thams and Wolfgang Rehm: A new generic and reconfigurable PCISCI bridge. In: Proceedings of SCI-Europe'99, Toulouse, Sept. 1999, Pages 113-120. See also: wwwbode. in. tum.de/events/sci-europe99/ 9. M-VIA: A High Performance Modular VIA for Linux. Project Homepage: http://www.nersc.gov/research/FTG/via/ 10. MPI Software Technology, Inc. Performance of MP I/Pro for cLAN on Linux and Windows.www.mpi-softtech.com/performance/perf-win-lin.html 11. The Open Scalable Cluster ARchitecture (OSCAR) Project. TU Chemnitz. www.tu-chemnitz.de/informatik/RA/projects/oscar.html/ 12. IEEE Standard for Scalable Coherent Interface (SCI). IEEE Std. 1596-1992. SCI Homepage: www. SCizzL . com 13. Mario Thams: Design of a system-friendly PCI-SCI Bridge with an optimized User-Interface. Diploma Thesis, TU-Chemnitz, 1998. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html 14. Mario Thams, Wolfgang Rehm, and Friedrich Seifert: An advanced PCI-SCI bridge with VIA support. In: Proceedings of 2nd Cluster-Computing Workshop, Karlsruhe, 1999, Pages 35-44. See also: www.tu-chemnitz.de/informatik/RA/CC99/ 15. The U-Net Project: .4. User-Level Network Interface Architecture. www2.cs.cornell.edu/U-Net 16. Intel, Compaq and Microsoft. Virtual Interface Architecture Specification Vl.O., VIA Homepage: www.viarch.org 17. GigaN et Homepage: www. gig an et. com

ATOLL, a new switched, high speed Interconnect in Comparison to Myrinet and SCI Markus Fischer, Ulrich Bruning, Jorg Kluge, Lars Rzymianowicz, Patrick Sc h ulz, Mathias \\lack University of Mannheim, Germany, markus©atoll-net.de

Abstract. While standard processors achieve supercomputer performance, a performance gap exists between the interconnect of MPP's and COTS. Standard solutions like Ethernet can not keep up with the demand for high speed communication of todays po w erful CPU's. Hence, high speed interconnects have an important impact on a cluster's performance. While standard solutions for processing nodes exist, communication hardware is currently only Nailable as a special, expensiw non portable solution. ATOLL presents a switched, high speed interconnect, whic hfulfills the current needs for user level communication and concurrency in computation and communication. A TOLLis a single chip solution, additional switching hardware is not required.

1

Introduction

Using commodity off the shelf components (COTS) is a viable option to build up pow erful clusters not only for mmber crunching but also for highly parallel, commercial applications. First clusters already show up in the Top500 [6] list and it is expected to see the number of entries continuously rising. Powerful CPU's suh as the Intel PIII Xeon with SMP functionality, achiev e processing performance kno wnfrom supercomputers. Currently a high percentage of existing clusters is equipped with standard solutions sue has Fast Ethernet. This is mainly for compatibility reasons since applications based on standardized TCP /IP are easily portable. This protocol how eveis known to cause too muchoverhead [7]. Especially low ering latency is an importart key to achieve good communication performance. A survey on message sizes shows that protocols and hardware have to be designed to handle short messages extremely well [14]: - in sev en parallel scie:raific applications 30% of the messages were bet -reen 16 bytes and a kilo~te - the median message sizes for TCP and UDP traffic in a departmental network w ere 32 and 128 lytes respectively - 99% of TCP and 86% of the UDP traffic was less than 200 bytes - on a commercial database all messages were less than 200 bytes the a v erage message size ranges beween 19 - 230 bytes J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 16-27, 2000. © Springer-Verlag Berlin Heidelberg 2000

A TOLL, a New Switched, High Speed Interconnect

17

Recent research with Gigabit/s interconnects, such as Myrinet and SCI, has shown that one key to achieve low latency and high bandwidth is to bypass the operating system, avoiding a trap into the system: User Level Communication (ULC) gives the user application full control over the interconnect device (BIP, HPVM, UNET, AM). While ULC shortens the critical path when sending a message, a global instance such as the kernel, is no longer involved in scheduling outgoing data. This has the disadvantage, that security issues have to be discussed, if different users are running their application. But also trashing and context switching through multiple processes can lower performance. Current research examines how to multiplex a network device efficiently [8], if this is not supported by the NI hardware itself. Therefore, a unique solution would be to support multiple Ni's directly in hardware. Designing interconnects for the standard PCI interface cuts down production costs, due to higher volume. Nevertheless, necessary additional switching hardware increases the total cost per node significantly. While PCI is a standard interface designed for IO, current PCI bridges are limited by a bandwidth of 132 MB/s running at 32bit/33Mhz. Upcoming mainboards will run at 64bit/66Mhz and achieve a maximum bandwidth of 528MB/s. The paper is organized as follows. The design space for network interfaces is evaluated and an overview on key functionality to achieve good communication performance is described in the next section. Section 3 will describe the design issues of ATOLL in comparison to Myrinet and SCI. In section 4 software layers, such as low level API and message passing interfaces for ATOLL and other NIC's, are discussed. Finally, section 5 concludes our paper.

2

Design Space for Network Interfaces

In this section we would like to evaluate current NICs and characterize the design space of IO features in general, differentiating between hardware and software issues. From the hardware's point of view, features like special purpose processor on board, additional (staging) memory, support of concurrency by allowing both, PIO and DMA operations, or support for shared memory at lowest level are of interest. The requirement for additional switching hardware to build up large scaling clusters is another concern. From the software's point of view it is interesting to examine which protocols are offered and how they are implemented, whether MMU functionality is implemented allowing RDMA, or how message delivery and arrival are detected. The latter will have a major impact on performance. We would like to break down the design space into the following items: Concurrency with PIO and DMA Thansactions, MMU Functionality to support RDMA

Basically, when sending a message, the NIC's API chooses PIO or DMA for transfer, depending on the message size. PIO has the advantage of low start-up costs to initiate the transfer. However since the processor is transferring data

18

M. Fischer et al.

directly to the network, it is busy during the entire transaction. To allow concurrency, the DMA mode must be chosen in which the processor only prepares a message by creating a descriptor pointing to the actual message. This descriptor is handed to the DMA engine which picks up the information and injects the message into the network. It is important to know that the DMA engine relies on pinned down memory since otherwise pages can be swapped out of memory and the engine usually can not page on demand by itself. The advantage of using DMA is to hide latency (allowing for multiple sends and receives). However it has a higher start-up time than PIO. Typically, a threshold values determines which protocol is chosen for the transaction. Both mechanisms also play an important role when trying to avoid memory copies. - Intelligent Network Adapter, Hardware and Software Protocols

The most important feature having an intelligent network adapter (processor and SRAM on board) is to be flexible in programming message handling functionality. Protocols for error detection and correction can be programmed in software, but also new techniques can be applied (VIA). Support for concurrency is improved as well. Additional memory on board lowers congestion and the possibility of deadlocks on the network decreases. It has the advantage to buffer incoming data, thus emptying the network links on which the message has been transferred. However, the memory size is usually limited and expensive, also the number of data copies rises. Another disadvantage of this combination is that the speed of an processor on board can not cope with the main processing unit. Finally, programming the network adapter is a versatile task. - Switches, Scalability and Routing

A benchmark of a point to point routine typically only shows the best performance for non-standard situations. Since a parallel application usually consists of dozens of processes communicating in a more or less fixed pattern, measuring the bisection bandwidth generates better information of the underlying communication hardware. A cost-effective SAN has bidirectional links and allows sending and receiving concurrently. A key factor for performance is scalability, when switches are added for a multistage connection network to allow larger clusters. Here blocking behavior becomes the major concern. Another point of interest is the connection from NIC to NIC: Data link cables must provide a good compromise between data path width and transfer speed. - Hardware support for Shared Memory (Coherency) and NI locations

Currently a trend can be seen in clustering bigger SMP nodes. Within an SMP node, a cache coherent protocol like MESI synchronizes to achieve data consistency. To add this functionality to IO devices (such as the NIC), they would have to participate on the cache coherent protocol, being able to snoop on the system bus. However, this would involve a special solution for every processor type and system and can not be propagated as a commodity solution. With the

A TOLL, a New Switched, High Speed Interconnect

19

growing distance between the NI and the processor, the latency of the communication operations raises and, at the same time, the bandwidth declines. The only position that results in a wide distribution and, thus, necessary higher production volumes, is the standardized PCI bus. This leads to the loss of a number of functions, like e.g., the cache coherent accesses to the main memory of the processor. As the NI on the PCI card is independent from the used processor (and has to be), functions like the MMU in the NI cannot be recreated, as they differ according to which processor is being used. For this purpose an adaptable hardware realization of the basic mechanisms or an additional programmable processor on the PCI card can be used. - Performance Issues: Copy Routines and Notification Mechanisms

Once a message is ready for sending, the data has to be placed at a location where the NIC can fetch the data. Using the standard memcpy routines however may show poor performance. The reason is that the cache of the CPU is ruined when larger messages have been injected into the network. Modern CPU's like the Pentium III or Ultrasparc offer special MMX or VIS instructions which copy the data without polluting the cache. Another critical point is the software overhead caused by diverse protocols to guarantee data transfer. Nowadays cables are almost error free. Thus heavy protocols like TCP /IP are no longer necessary. Since an error may occur, an automatic error detection and correction implemented directly in hardware would improve efficiency. Performance is also sensitive to message arrival detection. A polling method typically wastes a lot of CPU cycles and an interrupt causes too much overhead, since contexts have to be switched. Avoiding the interrupt mechanism is very important as each new interrupt handling leads to a latency of approximately 60 µs [8].

3

NIC Hardware Layout and Design

In the ATOLL project, all design space features have been carefully evaluated and the result is an implementation of a very advanced technology.

3.1

ATOLL

Overview The ATOLL cluster interface network, is a future communication technology for building cost-effective and very efficient SAN's using standard processing nodes. Due to an extremely low communication start-up time and very broad hardware support for processing messages, a much higher performance standard in the communication of parallel programs is achieved. Unique is the availability of four links of the interface network, an integrated 8 x 8 crossbar and four independent host ports. They allow for creating diverse network topologies without additional external switches and the ATOLL network is one of the first network on a chip implementations. This design feature especially supports SMP nodes by assigning multiple processes their dedicated device. Figure 1 depicts an overview on hardware layout and data flow of ATOLL.

20

M. Fischer et al.

Processor Pentium Will

-==----•

Tnm!IW"vi.alJMA

- - - Tran.'lft:rvi• PIO

D

I

(PCI Bla'St)

OMA. Oe..aip4c.w

Tob0M Bps

Q

GigaNet

l::i..

P ackel Engine

D

SysKonnect Net Gear

Message Length (Bytes) Figure 3 - MPI/Pro Bandwidth Results Our experiences with the performance of MPI under NT 4 and Windows 2000 are inconclusive. Currently, it appears that in shared-memory mode that the latencies under Windows 2000 may be marginally lower than NT 4. The measured peak bandwidths of Windows 2000 were greater than NT4. In distributed-memory mode the measured latencies under Windows 2000 were approximately 20% higher than the equivalent under NT 4. The measured bandwidths for Windows 2000 and NT 4 were very similar however. It is interesting to note that the measured network latencies for 100 Mbps Ethernet cards and Giga Net under WinSock and MPI/Pro are almost equivalent. The performance of the Packet Engine Gigbit card is between 7% and 13% faster respectively. However, the performance of the SysKonnect and Net Gear cards are significantly slower that standard 100 Mbps Ethernet.

7.2 Price/Performance Considerations Table 4 shows the price/performance ratios calculated using the network card costs in September 1999 versus the peak measured bandwidth and minimum latency. It should be noted that the calculated ratios shown are only an approximate indicator as the price of the network cards varies significantly based on the quantity bought and the discounts given. The smaller the price/performance ratio the better value for money that can be expected from a network card. The choice of what is the most appropriate card is often not based

GigaBit Performance under NT

49

solely on the price/performance, but also other factor such as desired performance, compatibility or availability.

Bandwidth (Log) versus Message Length (In Distributed Memory) 50

'0'

10 5

uJ

~ cc

1!. ...."C

.5

~

'j "C

.1 .05

C: (IS

cc

.01 .005

/ .001

/

/

/

/

/

/

/

/

I

/

/

/

/

/

/

/

/

I

Plot Key X

1l0Mq,s

0

GigaNet

t,.

Packet Engine

[]

SysKomect NetGear

4 Message Length (Bytes)

Figure 4 - WinSock Bandwidth Results The ratios shown in Table 4 indicate that the 100 Mbps Fast Ethernet cards provide significantly better price/performance than the other network cards. However, the ratios for the NetGear Gigabit card are significantly better than the other price/performance ratios available. Card Make and speed

NetGear FA310TX l00Mbps GigaNet - Clan GNN 1000 Packet Engine - GNIC 11 SvsKonnect-SK-9841 NetGear - GA620

Price/Performance ($/Mbytes/s) $24.95/8.8 ~ 2.835 $795/37 ~ 2149 $995/12 ~ 82.92 $729/17 ~ 42.88 $299.99/19 ~ 15.79

Price/Performance ($/us) $24.95/208 ~ 0.12 $$795/208 ~ 3.82 $995/336 ~ 2.96 $729/179 ~ 4.07 $299.99/585 ~ 0.51

Table 4: Network Card Cost versus Performance (MPI/Pro) 7 .3 Summary of Conclusions Our work has shown that release 1.2.3 of MPI/Pro imposes an approximate additional 1 Byte latency of 25% and 50% over WinSock under shared and distributed-memory modes respectively. We have shown that the Giga Net Gigabit Ethernet provides the highest bandwidth of those tested. We suspect, as currently we do not have a concrete price for this card, that the price/performance of this card will be poorer that that of Net Gear but better than Packet Engine and NetGear. Our price/performance figures do, however, strongly suggest that the current performance and costs of the Gigabits cards makes standard 100 Mbps a much sounder technology investment at the moment. Obviously, other

50

M. Baker et al.

factors, like required peak bandwidth, may make the decision of what technology to choose not one purely based on price/performance. Another factor that puts the Gigabit Ethernet at a disadvantage compared to other network technologies, such as Myrinet23 and SCr24, is the relatively high start up latencies approximately an order of magnitude higher. These high latencies are being addressed with the new VIA interfaces and drivers being developed for Ethernet.

7.4 Future Work This work is part of an on going effort to investigate the performance of a range of cluster-based technologies. The next phase of our work will involve comparing the performance of different network technologies under NT and Linux.

References 1 A. Geist, Cluster Computing: The Wave of the future, Springer Verlag Lecture Notes in Computer Science, May 1994. 'The PYM project - http://www. epm. ornl. gov/pvm/ 3 MPI Forum - http://www. mpi- forum. org/ docs/docs. html 4 Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, Report No. CS-94-230, May 5, 1994 5 MPICH - http://www. mes. anl. gov/mpi/mpieh/ 6 W. Gropp, et. al., A high-performance, portable implementation of the MPI message passing interface standard - http:/ /www-e. mes. anl. gov/mpi/mpicharticle/paper. html 1 W. Gropp and B. Smith, Chameleon parallel programming tools users manual. Technical Report ANL-93/23, Argonne National Laboratory, March 1993. 8 PYM: A Users' Guide and Tutorial For Networked Parallel Computing http://www.netlib.org/pvm3/book/pvm-book.html 9 Gigabit Ethernet Alliance - Gigabit Ethernet: Accelerating the standard for speed, http://www.gigabit-ethernet.org/technology/whitepapers, September 1999. 10 Ethernet Segment Limits. - http://www. gigabi t-ethernet. org/technology/ 11 TOPIC http://www.des .port .ac. uk/-mab/T0PIC/ 12 MPI Software Technology, Inc. - http://www. mpi- soft tech. com/ 13 WinMPICh - http://www. ere. ms state. edu/mpi/mpiNT. html 14 VIA-http :/ /www.viaarch.com 15 PaTENT-http://www.genias.de/products/patent/ 16 WINdows based PARallel computing- http://www. genias. de/ 11 WMPI- http:/ /dsg. dei. UC .pt/w32mpi/ 18 R. Buttler and E. Lusk, User's Guide to the p4 Parallel Programming System, ANL92/17, Argonne National Laboratory, October 1992. 19 NetGear - http:/ /netgear. baynetworks. com/ '° GigaNet - http://www. giga-net. com/ 21 Packet Engine - http://www. packet engines. com/ index4. html SysKonnect- http://www. syskonnect. de/ 23 N. Boden, et. al. Myrinet - A Gbps LAN. IEEE Micro, Vol. 15, No.l, February 1995. http://www.myri.com/ Dolphin Interconnect Solutions - http://www. dolphinics. no/

MPI Collective Operations o \er IP Multicast* Hsiang Ann Chen, Yvette 0. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Aransas, U.S.A {hachen,yochoa,aapon}©comp.uark.edu

Abstract. Many common implementations of Message Passing Interface (MPI) implement collective operations over poin t-to-poin tlperations. This work examines IP multicast as a framework for collective operations. IP multicast is not reliable. If a receiver is not ready when a message is sent via IP multicast, the message is lost. Two techniques for ensuring that a message is not lost due to a slow receiving process are examined. The techniques are implemented and compared experimentally over both a shared and a switched Fast Ethernet. The average performance of collective operations is improved as a function of the number of participating processes and message size for both networks.

1

Introduction

Message passing in a cluster of computers has become one of the most popular paradigms for parallel computing. Message Passing Interface (MPI) has emerged to be the de facto standard for message passing. In many common implementations of MPI for clusters, MPI collective operations are implemented o'er MPI point-to-point operations. Opportunities for optimization remain. Multicast is a mode of communication where one sender can send to multiple receivers by sending only one copy of the message. With multicast, the message is not duplicated unless it has to travel to differert parts of the network through switches. Many net w arks support broadcast or mnlticast. For example, shared Ethernet, token bus, token ring, FDDI, and reflective memory all support broadcast at the data link layer. The Internet Protocol (IP) supports multicast o~r netw orksthat ha veIP multicast routing capability at the network layer. The goal dmis paper is to in vestigatethe design issues and performance of implementing MPI collectiv e operations using multicast. IP multicast is used to optimize the performance of MPI collective operations, namely the MPI broadcast and MPI barrier synchronization, for this preliminary work. The results are promising and give insight to w ork that is planned on a l< · · ·

1600 1400 1200

al

!'l

1000 800 600 400 200 0

0

1000

2000 3000 size of message (in byte)

4000

5000

Fig. 11. Performance Comparison with MPI..Bcast over hub and switch for 4 processes

multicast is where the extra latency of sending scout messages becomes less than the latency from sending extra packets of data when the data is large. For some numbers of nodes, collisions also caused larger variance in performance with the multicast implementations. For example, this is observed for 6 nodes as shown in Fig. 9. With 6 nodes using the binary algorithm, both node 2 and node 1 attempt to send to node 0 at the same time, which causes extra delay. Figure 11 compares the average performance of the switch and the hub for 4 processes. When using IP multicast, the average performance of the hub is better than the switch for all measured message sizes. As for the original MPICH implementation, the average performance of hub becomes worse than the switch when the size of the message is bigger than 3000. The MPICH implementation puts more messages into the network. As the load of the network gets larger, the extra latency of the switch become less significant than the improvement gained with more bandwidth. The multicast implementation is better than MPICH for message sizes greater than one Ethernet frame.

MPI Collective Operations over IP Multicast 2500

mp1ch(9proc) ---4-----

multicastMPICH ····

mp1ch(6proc) ···+···

mp1ch(3proc)

···-B···

JOO

hnear(9proc) X-·· hnear(6 proc) --A-hnear(3 proc) ·· *· ·

2000

59

600

"~

1500

500

J" >.

400

1000

JOO

500 200

o~-~-~--~-~--~~ 0

1000

2000

3000 SJZeofmessage(mbyteJ

4000

5000

Fig. 12. Performance Comparison with MPI..Bcast over 3, 6, and 9 processes over Fast Ethernet switch

NmnberofProcesses

Fig. 13. Comparison of MPI..Barrier over Fast Ethernet hub

Figure 12 compares MPICH and the linear multicast implementation for 3, 6, and 9 processes over the switch. The results show that the linear multicast algorithm scales well up to 9 processes and better than MPICH. With the linear implementation, the extra cost for additional processes is nearly constant with respect to message size. This is not true for MPICH. Figure 13 describes the results of MPI..Barrier operation over the hub. The results for MPLBarrier show that IP multicast performs better on the average than the original MPICH implementation. The performance improvement increases as the size of the message gets bigger. In a Single Program Multiple Data (SPMD) environment, message passing using either the linear algorithm or the binary algorithm is correct even when there are multiple multicast groups. However, since the IP multicast implementation requires the receive call to be posted before the message is sent, it is required that each process execute the multicast calls in the same order. This restriction is equivalent to requiring that the MPI code be safe[5]. If several processes broadcast to the same multicast group (in MPI terms, this is the same process group of same context), the order of broadcast will be correctly preserved. For example, suppose in an environment including the 4 processes with ids 4, 6, 7 and 8, processes 6, 7, and 8 all belong to the same multicast group and the broadcast is called in the following order. MPL.Bcast(&buffer, count, MPLINT, 6, MPLCOMl\LWORLD); MPL.Bcast(&buffer, count, MPLINT, 7, MPLCOMl\LWORLD); MPL.Bcast(&buffer, count, MPLINT, 8, MPLCOMl\LWORLD);

Using either the binary algorithm or the linear algorithm, process 7 cannot proceed to send the the second broadcast until it has received the broadcast message from process 6, and process 8 cannot send in the third broadcast until it has received the broadcast message from process 7. The order of the three

60

H.A. Chen, Y.O. Carrasco, and A.W. Apon

broadcasts is carried out correctly. Using a similar argument, when there are two or more multicast groups that a process receives from, the order of broadcast will be correct as long as the MPI code is safe.

5

Conclusions and Future Work

Multicast reduces the number of messages required and improves the performance of MPI collective operations by doing so. Its receiver-directed message passing mode allows the sender to address all the receivers as a group. This experiment focused on a particular implementation using IP multicast. Future work is planned in several areas. Improvements are possible to the binary tree and linear communication patterns. ·while we have not observed buffer overflow due to a set of fast senders overrunning a single receiver, it is possible this may occur in many-to-many communications and needs to be examined further. Additional experimentation using parallel applications is planned. Also, low latency protocols such as the Virtual Interface Architecture[9] standard typically require a receive descriptor to be posted before a mesage arrives. This is similar to the requirement in IP multicast that the receiver be ready. Future work is planned to examine how multicast may be applied to MPI collective operations in combination with low latency protocols.

References [1] D. E. Comer. Internetworking with TCP/IP Vol. I: Principles, Protocols, and Architecture . Prentice Hall, 1995. [2] T. H. Dunigan and K. A. Hall. PVM and IP Multicast. Technical Report ORNL/TM-13030, Oak Ridge National Laboratory, 1996. [3] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Technical Report Preprint MCS-P567-0296, Argonne National Laboratory, March 1996. [4] N. Nupairoj and L. M. Ni. Performance Evaluation of Some MPI Implementations on Workstation Clusters. In Proceedings of the 1994 Scalable Parallel Libraties Conference, pages 98-105. IEEE Computer Society Press, October 1994. [5] P. Pacheo. Parallel Programming with MPI. Morgan Kaufmann, 1997. [6] The LAM source code. http://www.mpi.nd.edu/lam. [7] The MPICH source code. www-unix.mcs.anl.gov/mpi/index.html. [8] A. S. Tannenbaum, M. F. Kaashoek, and H. E. Bal. Parallel Programming Using Shared Objects and Broadcasting. Computer, 25(8), 1992. [9] The Virtual Interface Architecture Standard. http://www. vi arch. org. [10] D. Towsley, J. Kurose, and S. Pingali. A Comparison of Sender-Initiated and Receiver-Initiated Reliable Multicast Protocols. IEEE JSAC, 15(3), April 1997.

An Open Market-Based Architecture for Distributed Computing Sp yros Lalis and Alexandros Karipidis Computer Science Dept., University of Crete, Hellas {lalis,karipid}@csd.uoc.gr Institute of Computer Science, Foundation for Research and Technology, Hellas {lalis,karipid}@ics.forth.gr

Abstract. One of the challenges in large scale distributed computing is to utilize the thousands of idle personal computers. In this paper, we presen t a system that enables users to effortlessly and safely export their machines in a global market of processing capacity. Efficient resource allocation is performed based on statistical machine profiles and leases are used to promote dynamic task placement. The basic programming primitives of the system can be extended to develop class hierarchies which support different distributed computing paradigms. Due to the objectoriented structuring of code, deV)loping a distributed computation can be as simple as implementing a few methods.

1

Introduction

The growth of the Internet has provided us with the largest network of interconnected computers in history. As off-the-shelf hardware becomes faster and gains Internet access, the netw ork's processing capaciy will continue increasing. Many of these systems are often under-utilized, a fact accentuated by the globe's geography since "busy" hours in one time-zone tend to be "idle" hours in another. Distributing computations over the Irternet is thus very appealing. However, several issues nnst be resolved for this to be feasible. The obstacle of platform heterogeneity must be overcome and security problems arising from the execution of code from untrusted parties must be confronted. F urther inconveniences arise when installing and mairtaining the corresponding programming en vironmerts. And then, distributed computations must be designed and implemented on top of them, a challenging task even for experienced programmers. In this paper we present a system that addresses these problems, simplifying distributed computing over the Internet considerably. Through a maintenancefree, web-based user interface any machine can be safely connected to the system to act as a host for remote computations. A framework that promotes code reuse and incremental development through object-oriented extensions is offered to the application programmer. \Vriting computations for the system can be as trivial as implementing a few routines. We feel that the ease of deploying the system J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 61-70, 2000. © Springer-Verlag Berlin Heidelberg 2000

62

S. Lalis and A. Karipidis

and developing applications for it is of importance to the scientific community since most of the programming is done by scientists themselves with little or no support from computer experts. The rest of the paper is organized as follows. Section 2 summarizes the general properties of the system. Details about the resource allocation mechanism are given in Sect. 3. In Sect. 4 we look into the system architecture, giving a description of components and communication mechanisms. In Sect. 5 we show how our system can be used to develop distributed computations in a straightforward way. A comparison with related work is given in Sect. 6. Section 7 discusses the advantages of our approach. Finally, future directions of this work are mentioned in the last section.

2

System Properties

When designing the system, the most important goal was to achieve a level of simplicity that would make it popular both to programmers and owners of lightweight host machines, most notably PCs. Ease of host registration was thus considered a key issue. Safety barriers to shield hosts from malicious behavior of foreign code were also required. Portability and inter-operability was needed to maximize the number of host platforms that can be utilized. A simple yet powerful programming environment was called for to facilitate the distribution of computations over the Internet. All these features had to be accompanied by a dynamic and efficient mechanism for allocating resources to applications without requiring significant effort from the programmer. In order to guarantee maximal cross-platform operability the system was implemented in Java. Due to Java's large scale deployment, the system can span across many architectures and operating systems. Host participation is encouraged via a web based interface, which installs a Java applet on the host machine. This accommodates the need for a user friendly interface, as users are accustomed to using web browsers. Furthermore, the security manager installed in Java enabled browsers is a widely trusted firewall, protecting hosts from downloaded programs. Finally, due to the applet mechanism, no administration nor maintenance is required at the host the majority of users already has a recent version of a web browser installed on their machines. On the client side we provide an open, extensible architecture for developing distributed applications. Basic primitives are provided which can in turn be used to implement diverse, specialized processing models. Through such models it is possible to hide the internals of the system and/or provide advanced programming support in order to simplify application development.

3

Resource Allocation

Host allocation is based on profiles, which are created by periodically benchmarking each host. A credit based [1] mechanism is used for charging. Credit

An Open Market-Based Architecture for Distributed Computing

63

can be translated into anything that makes sense in the context where the system is deployed. Within a non-profit institution, it may represent time units to facilitate quotas. Service-oriented organizations could charge clients for using hosts by converting credit to actual currency. Both hosts (sellers) and clients (buyers) submit orders to a market, specifying their actual and desired machine profile respectively. The parameters of an order are listed in table 1. The performance vectors include the host's mean score and variance for a set of benchmarks over key performance characteristics such as integer and floating point arithmetic, network connection speed to the market server etc. The host abort ratio is the ratio of computations killed by the host versus computations initiated on that host (a "kill" happens when a host abruptly leaves the market). The host performance vectors and abort ratio are automatically produced by the system. Host profiles can easily be extended to include additional information that could be of importance for host selection.

Table 1. Parameters specified in orders

Parameter

Sell Orders

Description Buy Orders

The minimum amount of credit The maximum amount of credit required per second of use of offered per second of use of the the host. host. The maximum amount of usage The minimum amount of usage lease duration time without renegotiation. time without renegotiation. granted/ demanded Credit granted/demanded for not honoring the lease duration. compensation The host's average score and The average performance score performance statistics variance for each of the bench- and variance a buyer is willing to accept. vectors marks (measured). abort ratio The host's measured abort ra- The abort ratio a buyer is willtio. ing to accept. price/sec

An economy-based mechanism is employed to match the orders that are put in the market. For each match, the market produces a lease, which is a contract between a host and a client containing their respective orders and the price of use agreed upon. Leases are produced periodically using continuous double auction [8]. A lease entitles the client to utilize the host for a specific amount of time. If the client's task completes within the lease duration, then the buyer transfers an amount of credit to the seller as a reward, calculated by multiplying actual duration with the lease's price per second. If the lease duration is not honored, an amount of credit is transfered from the dishonoring party to the other.

64

4 4.1

S. Lalis and A. Karipidis

System Architecture Overview of System Components

An overview of the system's architecture is depicted in Fig. 1. The basic components of our system are the market server, hosts, the host agent, schedulers, tasks and client applications.

( Scheduler )

ff

Control Protocol

-----+-----....-, Client application

; MarketSchedulerProtocol I

~ Market

Computation Protocol

Server

~Protocol

/ ~arketHostAgentProtocol

,-------------,

,------------,

Host Agent ~

, HostAgentTaskProtocol

'lj

( Task)

Host Agent







~ ~ HostAgentTaskProtocol (Task)

Fig. 1. Overview of architecture

The Client Application is a program which needs to perform computations that require considerable processing power. Through the system, it may either distribute a computation across a number of machines or just delegate the execution of an entire computation to a faster machine to speed up execution. The Market Server is the meeting place for buyers and sellers of processing power. It collects orders from clients and hosts. Using the host profiles, it then matches buy with sell orders and thus allocates resources. A Host is a machine made available to be used by clients. A host participates in the market through the Host Agent, a Java applet. The user visits a URL with a Java enabled web browser and the agent is downloaded to his system. The agent communicates with the market server, takes care of placing orders on behalf of the user and executes tasks assigned to the host. It also provides the market server with the benchmark scores needed for the host's profile. A computation in our system consists of a Scheduler and one or more Tasks. The application installs the scheduler on the market server. The scheduler then places orders in the market for acquiring machines to complete the computation. New orders can be issued at any time in order to adapt to fluid market conditions. When a lease is accepted by the scheduler, a task is launched in the host machine to assist in completing the computation.

An Open Market-Based Architecture for Distributed Computing

4.2

65

Basic System Services and Communication

There are six protocols used for communication by the system. The UploadProtocol is a fixed, published Remote Method Invocation (RMI) interface used by the client application to upload a computation to the market server and to instantiate it's scheduler. A client application may instantiate multiple schedulers to simultaneously launch the same code with multiple data. The ControlProtocol is a published RMI interlace for the client application to control a scheduler. Through this interface the application perlorms tasks such as starting a computation with new parameters, altering the computation's budget for acquiring hosts, instructing the scheduler to kill all tasks and exit, etc. The basic functions are implemented in the system classes. The programmer can introduce computation specific control functions by extending this interface. The ComputationProtocol is used within the bounds of a single computation for communication among tasks and their scheduler. It is application dependent and thus unknown to the system. \Ve do, however, provide message passing support (not further discussed in this paper) that can be used by application developers to implement flexible, safe and efficient data exchange. The MarketSchedulerProtocol is used for local communication between the market server and schedulers. The market server implements a standard published interface for servicing requests from schedulers such as placing orders and retrieving host and market status information. Respectively, schedulers provide methods for being notified by the market of events such as the opportunity to acquire a new lease, a change in the client's account balance, the completion of a task's work and the failure of a host that was leased to them. Similarly, the HostAgentTaskProtocol provides local communication among a host agent and the task it is hosting. The agent implements a published interface for servicing requests from tasks, such as retrieving information about a host's performance. The MarketHostAgentProtocol is a proprietary protocol used by the market server and the host agent. It allows orders to be placed in the market by the host. It is also used to retrieve tasks from the market, ask for "payment" when tasks complete and to post benchmarking data to the market server.

5

Supporting Distributed Computing Paradigms

Through the set of primitives offered by the system, it is possible to develop a wide range of applications. More importantly generic support can be provided for entire classes of distributed computations. Applications can then be developed by extending these classes to introduce specific functionality. This incremental development can greatly simplify programming. As an example, in the following we describe this process for embarrassingly parallel computations requiring no communication between tasks. Other distributed computation paradigms can be supported in similar fashion.

66

S. Lalis and A. Karipidis

5.1

The Generic Master - Slave Model

In this model work is distributed among many processors by a distinguished processor referred to as the "master" . The other processors, referred to as "slaves" , complete the work assigned to them and return the results to the master. In order to process its workload a slave does not need to communicate with any other slave. This model is used in image processing, genetics algorithms, brute force search and game tree evaluation. One possible implementation of this model is sketched below. For brevity, only the methods a programmer has to be aware of are shown.

public interface MS_Control extends Control { void start(Object pars);// inherited by superclass void stop(); // inherited by superclass Object[] getResults(boolean all, boolean keep); }

public abstract class MS_Scheduler extends Scheduler implements MS_Control { public abstract Object[] doPartitions(Object pars); public void receiveResult(Object result); }

public abstract class MS_Task extends Task { public abstract Object processPartition(Object partition); }

The MS_Control.start method starts a new computation. MB-Control.start triggers MB-Scheduler. doPartitions to produce the various partitions of the computation. These are forwarded to instances ofMB-Task residing on hosts allocated to the computation and M8-Task.processPartition is invoked to process them. The results are returned to the scheduler where post-processing is performed via calls to the MS_Scheduler. receiveResult method. It is important to notice that programmers need to implement just three methods in order to complete a computation following this model. All other implementation issues, including the resource allocation strategy of the scheduler, remain hidden. The MS_Control interface, which defines the primitives for controlling and retrieving the results of the computation, is implemented by the base MS_Scheduler class and thus does not concern the programmer. This master/ slave model could be further extended to introduce additional functionality such as check-pointing and restarting of tasks for fault tolerance. Programmers would exploit this functionality without effort. 5.2

A Sample Client Application

Based on this model, we show how a specific application, e.g. for computing the Mandelbrot set, can be implemented. We assume that the area to be calculated is partitioned in bands, processed in parallel to speed up execution. The user selects an area and the computation is started to zoom into the selected area.

An Open Market-Based Architecture for Distributed Computing

67

The parameters, partitions and results of the fractal application must be extensions of the Object class. The classes must implement the Serializable interface in order to be successfully transported across machine boundaries. class FractalParameters extends Object implements Serializable { II ... fractal computation parameters }

class FractalPartition extends Object implements Serializable { II ... parameters for calculating a slice }

class FractalResult extends Object implements Serializable { II ... results of a slice calculation }

Assuming the parameter and result objects have been appropriately defined, a FractalScheduler class must be programmed as a subclass of MS_Scheduler to produce partitions via the doPartitions method. The MS_Scheduler.receiveResult method is not overridden because individual results are not merged by the scheduler. Also, the basic MS_Control interface needs no extension since it already offers the necessary routines for controlling and monitoring the computation. Analogously, a FractalTask class must be provided that implements the M5-Task.processPartition method to perform the calculation of slices. class FractalScheduler extends MS_Scheduler { Object[] doPartitions(Object comp_pars) { FractalPartition partitions[]; FractalParameters pars=(FractalParameters)comp_pars; II ... split calculation and produce partitions return (partitions); } }

class FractalTask extends MS_Task { Object processPartition(Object partition) { FractalResult result; FractalPartition pars=(FractalPartition)partition; II ... perform the computation return(result); } }

Finally, to run the application, the computation's classes must be uploaded to the market server using the UploadProtocol and a scheduler instance must be created. The MS_Control interface is used to control the scheduler and periodically retrieve the computation's results.

68

6

S. Lalis and A. Karipidis

Related Work

Popular distributed programming environments such as PVM [9] and MPI [9] lack advanced resource allocation support. PVM allows applications to be notified when machines join/leave the system, but the programmer must provide code that investigates hosts' properties and decides on proper allocation. MPI, using a static node setup, prohibits dynamic host allocation: the programmer must make a priori such decisions. Both systems require explicit installation of their runtime system on participating hosts. A user must therefore have access to all participating machines, as she must be able to login to them in order to spawn tasks. This is impractical and may result in only a few number of hosts being utilized, even within a single organization. Finally, the choice of C as the main programming language, compared to Java, is an advantage when speed is concerned. But to be able to exploit different architectures, the user must provide and compile code for each one of them, adding to the complexity and increasing development time due to porting considerations. The maturation of Java technology ("just in time" compilation, Java processors, etc.) could soon bridge the performance gap with C. Notably, a Java PVM implementation is underway [6], which will positively impact the portability of the PVM platform. Condor is a system that has been around for several years. It provides a comparative "matchmaking" process for resource allocation through its "classified advertisment" matchmaking framework [11]. A credit-based mechanism could be implemented using this framework, but is currently unavailable. Condor too requires extensive administration and lacks support for easy development. Newer systems such as Legion [10] and Globus [7] address the issues of resource allocation and security. They provide mechanisms for locating hosts and signing code. However, both require administration such as compiling and installing the system as well as access to the host computer. They do not support the widely popular Windows platform (though Legion supports NT) and do little to facilitate application development for non-experts. Globus merely offers an MPI implementation whereas Legion provides the "Mentat" language extensions. Legion's solution is more complete but also complicated for inexperienced programmers. It requires using a preprocessor, an "XDR" style serialization process and introduces error-prone situations since virtual method calls will not work as expected in all cases. Stateful and stateless objects are also handled differently. Finally, adding hosts to a running computation is done from the command line and additional hosts are assigned to the computation at random - no matching of criteria is performed. Several other systems using Java as the "native" programming language have been designed for supporting globally distributed computations, such as Charlotte [3], Javelin [4] and Challenger [5]. These systems automatically distribute computations over machines. However, they do not employ market-based principles to allocate hosts and do not maintain information about hosts' performance. The market paradigm has received considerable attention in distributed systems aiming for flexible and efficient resource allocation. A system operating on the same principles as ours is Popcorn [12]. Popcorn also uses auction mech-

An Open Market-Based Architecture for Distributed Computing

69

anisms to allocate hosts to client computations and exploits Java applet technology to achieve portability, inter-operability and safety. However it does not provide "host profiling", nor promotes incremental development.

7

Discussion

Besides the fact that the allocation strategies used in most systems don't take into account "behavioral patterns" of hosts, there is also virtually no support for leasing. We argue that both are invaluable for efficient resource allocation in open computational environments. Providing information about the statistical behavior of participating hosts can assist schedulers in taking task placement decisions, avoiding hosts that will degrade performance (and waste credit). For example, assume a scheduler has two tasks to allocate. Blind allocation on two hosts is not a good idea; unless two machines exhibit comparable performance, the faster machine will be wasted since the computation will be delayed by the slower one. Similarly, using the abort ratio, schedulers can avoid unstable hosts for placing critical parts of a computation. Those can be assigned to perhaps more "expensive" but stable hosts. Computations implementing check-pointing and crash-recovery could utilize less credible hosts. The lack of leasing is also a drawback in open environments: a client could obtain many processors when there is no contention and continue to hold them when demand rises. This is unacceptable in a real world scenario where credit reflects priorities or money. This would imply that prioritized or wealthy computations can be blocked by "lesser" ones. To guarantee quality of service, some form of leasing or preemption must be adopted. Leases are also practical in non-competitive environments. The lease duration allows users to indicate the time during which hosts are under-utilized. Based on this knowledge, tasks can be placed on hosts that will be idle for enough time, and checkpoints can be accurately scheduled, right before a host is about to become unavailable. Finally, it is generally acknowledged that incremental development increases productivity by separation of concerns and modular design. Distributed computing can benefit from such an approach. Modern object-oriented programming environments are a step towards this direction, but significant programming experience and discipline are still required. We feel that with our system's design, it is possible even for inexperienced programmers to write computations rapidly.

8

Future Directions

New versions of the Java platform will offer more fine grained control in thesecurity system. Using the new mechanisms we expect to be able to provide more efficient services, such as access to local storage for task checkpoints, invocation of native calls to exploit local, tuned libraries such as [2] [13]. Logging mechanisms along with the signing of classes, will further increase the security of the system.

70

S. Lalis and A. Karipidis

We also wish to experiment with schedulers capable of recording the performance of previous allocations. Accumulated information can perhaps be converted into "experience" , leading towards more efficient allocation strategies. Lastly the issue of scalability needs to be addressed. The current architecture is limited by the market server. A single server could not handle the millions or billions of hosts connecting to a truly world-wide version of this service. It would also be impossible to have all schedulers running on the machine. We intend to overcome this problem by introducing multiple market servers that will allow traffic to be shared among several geographically distributed servers.

References [1] Y. Amir, B. Awerbuch, and R. S. Borgstrom. A cost-benefit framework for online management of a metacomputing system. In Proceedings of the First International Conference on Information and Computation Economies, pages 140-147, October 1998. [2] M. Baker, B. Carpenter, G. Fox, S. H. Ko, and S. Lim. mpiJava: An ObjectOriented Java Interface to MPI. Presented at International Workshop on Java for Parallel and Distributed Computing, IPPS/SPDP 1999, April 1999. [3] A. Baratloo, M. Karau!, Z. M. Kedem, and P. Wyckoff. Charlotte: Metacomputing on the web. In Ninth International Conference on Parallel and Distributed Computing Systems, September 1996. [4] P. Cappello, B. Christiansen, M. F. Ionescu, M. 0. Neary, K. E. Schauser, and D. Wu. Javelin: Internet-based parallel computing using java. In Proceedings of the ACM Workshop on Java for Science and Engineering Computation, June 1997. [5] A. Chavez, A. Moukas, and P. Maes. Challenger: A multiagent system for distributed resource allocation. In Proceedings of the First International Conference on Autonomous Agents '97, 1997. [6] A. Ferrari. JPVM The Java Parallel Virtual Machine. Journal of Concurrency: Practice and Experience, 10(11), November 1998. [7] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl J. Supercomputer Applications, 11(2), 1997. [8] D. Friedman. The double auction market institution: A survey. In D. Friedman and J. Rust, editors, Proceedings of the Workshop in Double Auction Markets, Theories and Evidence, June 1991. [9] G. A. Geist, J. A. Kohl, and P. M. Papadopoulos. PVM and MPI: a Comparison of Features. Calculateurs Paralleles, 8(2):137-150, June 1996. [10] A. S. Grimshaw and W. A. Wulf. The legion vision of a worldwide computer. CACM, 40(1):39-45, 1997. [11] R. Raman, M. Livny, and M. Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 1998. [12] 0. Regev and N. Nisan. The POPCORN Market - an Online Market for Computational Resources. In Proceedings of the First International Conference on Information and Computation Economies, pages 148-157, October 1998. [13] The Java Grande Working Group. Recent Progress of the Java Grande Numerics Working Group. http://math.nist.gov/javanumerics/ reports/jgfnwg-02.html.

The Multi Cluster Model to the Integrated Use of Multiple Workstation Clusters Marcos Barreto*, Rafael Avila**, and Philippe Navaux*** Institute oflnformatics - UFRGS Av. Bento Gorn;;alves, 9500 Bl. IV PO Box 15064-90501-910 Porto Alegre, Brazil E-mail: {barre to, bohrer, navaux }@inf. ufrgs. br

Abstract. One of the new research tendencies within the well-established cluster computing area is the growing interest in the use of multiple workstation clusters as a single virtual parallel machine, in much the same way as individual workstations are nowadays connected to build a single parallel cluster. In this paper we present an analysis on several aspects concerning the integration of different workstation clusters, such as Myrinet and SCI, and propose our MultiCluster model as an alternative to achieve such integrated architecture.

1 Introduction Cluster computing is nowadays a common practice to many research groups around the world that search for high performance to a great variety of parallel and distributed applications, like aerospacial and molecular simulations, Web servers, data mining, and so forth. To achieve high performance, many efforts have been devoted to the design and implementation of low overhead communication libraries, specially dedicated to fast communication networks used to interconnect nodes within a cluster, which is the case of Fast Ethernet [14], Myrinet [3] and SCI [12]. The design of such software is a widely explored area, resulting in proposals like BIP [21], GM [9], VIA [24] and Fast Messages [ 19]. Currently, there are other research areas being explored, such as administrative tools for cluster management and what is being called Grid Computing, with the objective of joining geographically distributed clusters to form a Metacomputer and taking benefit of the resulting overall computational power [4]. The work presented here is not focused on these areas directly, because our goal is to discuss a practical situation in which a Myrinet cluster must be interconnected with a SCI cluster to form a single parallel machine, which can be used to verify the application's behaviour when it runs on a shared memory cluster or on a message passing cluster, efficiently distribute tasks from an application according to their communication needs, offer a complete environment destinated to teach parallel and distributed * M.Sc. student at PPGC/UFRGS (CAPES fellow) ** M.Sc. (PPGC/UFRGS, 1999); RHAE/CNPq researcher at PPGC/UFRGS * * * Ph.D. (INPG, Grenoble - France, 1979); Professor at PPGC/UFRGS J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 71-80, 2000. © Springer-Verlag Berlin Heidelberg 2000

72

M. Barreto, R. Avila, and P. Navaux

programming, allowing the user to express, through the same API, message passing and shared memory interactions. This paper is organised as follows: Section 2 exposes an analysis on the problems that arise from integrating multiple workstation clusters; in Section 3 we present the MultiCluster model and the DECK environment as our contribution towards this objective; Section 4 brings some comments on related research efforts and finally Section 5 presents our conclusions and current research activities.

2

Integrating Multiple Clusters

When computer networks were an emergent platform to parallel and distributed programming, many efforts were dispended to solve problems related to joining individual PCs in a single virtual parallel machine. From these efforts, communication libraries such as PVM [8] and MPI [17] arose to allow individual network nodes to be identified within the parallel environment. The integration of multiple workstation clusters presents a similar problem. Individual clusters of workstations are nowadays fairly well managed by communication libraries and parallel execution environments. When we start to think on clusters of clusters, again we have the same problems regarding the connection of elements that run independently from each other and still meet the compromise of offering to the user an appropriate environment for parallel and distributed programming. What we mean by appropriate is to provide an intuitive programming interface and offer enough resources to meet the programmer's needs. As the purpose of this paper is to identify these problems and propose possible solutions to them, we have divided our study in hardware and software analysis. 2.1

Hardware Aspects

There are no major problems in the hardware point of view to achieve such integration, since the networks considered (Myrinet and SCI) could co-exist within the same node and use different techniques to communicate. Figure 1 presents the most simple cluster interconnection that could be realised. Each individual cluster could have any number of physical nodes connected through a switch (in the Myrinet case) or directly as a ring (in the SCI case). To allow the integration, each cluster must have a "gateway" node configured with two network interfaces (two Myrinet Nis or a Myrinet + SCI Nls), where the additional Myrinet NI is used to link clusters. For the moment we do not consider SCI a suitable technology as a linking media, since a message-passing paradigm seems more adequate for this purpose. 2.2

Software Aspects

Several points have been discussed by the community in order to identify problems and solutions related to the design and implementation of communication libraries for cluster-based applications, with a main objective: provide high bandwith at small latencies. Besides this, the development of cluster middleware tools to furnish high availability and single system image support is an ongoing task [4, 11].

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters





73



  

Fig. 1. The simplest way to interconnect two workstation clusters.

In the case of clusters of clusters, performance is not a key point due to the drawbacks implicitly imposed by the loosely coupled integration. There are other problems regarding such integration that must be attended first and performance will then be the consequence of the techniques used to solve them. The first point to consider is how to combine message passing with distributed shared memory. A desirable solution would be to offer a single communication abstraction that could be efficiently implemented over message passing and shared memory architectures. In practice, however, it is easier to have an individual mechanism to each one and allow the user to choose between them, depending on his application needs. Another point to treat is the routing problem, which arises when a task needs to exchange data with another task running in a remote cluster. It is necessary that the communication layer identifies what is the location of a communication endpoint and knows how to map physical nodes from separate clusters to be capable of routing messages between them. Finally, heterogeneity could be a problem. Although most individual workstation clusters are internally homogeneous, there may be cases where multiple clusters could be heterogeneous in relation to each other. In these cases, problems regarding "endianisms" and floating-point data representation have to be addressed. lfthe previous problems can be efficiently treated, it is also possible to provide the user with the capacity of deciding where to place a specific set of tasks, according to their communication needs. If the application granularity can be modelled considering the underlying platform, it is still possible to achieve good performance.

3

The MultiCluster Model

The MultiCluster model is an approach to join independent clusters and provide a simple programming interface which allows the user to configure and utilize such an integrated platform. With this model we intend to address and provide solution to the problems mentioned in the previous Section, while still keeping a well structured and

74

M. Barreto, R. Avila, and P. Navaux

efficient programming environment. To best explain the proposed model, we have divided the discussion in hardware and software aspects. 3.1

Hardware Platform

We are assuming the configuration illustrated in Figure I, which corresponds to our available hardware platform. We currently have a Myrinet cluster, composed by 4 Dual Pentium Pro 200 MHz nodes, and a SCI cluster, composed by 4 Pentium Celeron 300 MHz nodes. These clusters are linked through a Fast Ethernet network. The choice of the media used to interconnect the clusters depends mostly on the application needs. It is possible to use a standard Ethernet link instead ofMyrinet to realise the communication between clusters. We propose Myrinet as a link media because it could minimize the loss in performance originated by the integration of different platforms; for our model, however, it is enough that some node in each cluster plays the role of a gateway. It is important to say that questions related to cost and scalability are out of the scope of this paper. In a near future, many companies and universities are likely to own a small number of cluster platforms, and so these questions are particular to each of them. We are assuming the situation where at least two clusters are available and have to be used together. 3.2

Software Structure

We have studied each problem mentioned in Section 2.2, trying to find the best solution to each one and structuring our software layer to carry out such solutions. As a result, the MultiCluster model follow some conceptual definitions which rule the way such integration must be handled. Figure 2 shows the user-defined descriptor file to a MultiCluster application. In this file, the user must specify a list of machines within the clusters he wants to use, the communication subnets identifiers (used to inter-cluster communication), a set of logical nodes with its correspondents machines and the gateway nodes. Physical and Logical Nodes. A physical node corresponds to each available machine plugged in any individual cluster and only matters to physical questions. Logical nodes are the set of available nodes from the application's point of view. In the case of message-passing clusters, each physical node corresponds to one logical node (this is mandatory). In shared-memory clusters, a logical node can be composed of more than one physical node. The distinction between logical nodes for Myrinet and SCI is made by the node id field. For example, "node 1:0" means the second node within the subnet 0 (which is Myrinet in our example), while "node 4: l" means the first node within the subnet 1 (which is SCI). It is important to notice that this numbering scheme, although complex, is entirely processed by the environment in a transparent manner; the user only knows how many logical nodes he has and what are the physical machines within each logical node.

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters

75

II DECK user-defined descriptor file II virtual machine verissimo, quintana, euclides, dionelio, scliar, ostermann, meyer, luft II communication subnets myrinet: 0 sci: 1 II logical nodes node 0:0 machines: verissimo node 1:0 machines: quintana node 2:0 machines: euclides node 3:0 machines: dionelio node 4:1 machines: scliar, luft node 5:1 machines: ostermann, meyer II gateway nodes gateways: quintana, scliar Fig. 2. Descriptor file for a Multi Cluster application.

Intra- and Inter-node Communication. As the application only sees logical nodes, it is relatively easy to adapt the different communication paradigms: inside a logical node, communication is made by shared memory; between logical nodes, communication is made by message passing. From the user's point of view, there is only one programming interface furnishing both mechanisms to specify communication over Myrinet or SCI clusters; the underlying communication layer is in charge of implementing one or another paradigm. Heterogeneity. Although a less frequent problem, heterogeneity may arise depending on the availability of clusters that have to be interconnected. Here, we are considering different data representations and the need to indicate to the message receiver what is the architecture type of the message sender. This problem is implicitly treated by the communication software. Even occuring some performance loss due to such integration, it is possible to the user to define the best location for his application tasks, creating communication resources according to each task location (i.e. communication subnets). Through this facility, the granularity of communication could be balanced among clusters, avoiding as long as possible the traffic across the link network. 3.3

The Programming Environment-DECK

The interface between the programmer and the MultiCluster architecture is the DECK environment. DECK (Distributed Executive Communication Kernel) is composed of a runtime system and a user API which provides a set of services and abstractions for the development of parallel and distributed applications. A DECK application runs in an SPMD style, split in terms oflogical nodes.

76

M. Barreto, R. Avila, and P. Navaux

DECK is divided in two layers, one called µDECK, which directly interacts with the underlying OS and a service layer, where more elaborate resources (including the support for multiple clusters) are made available. Figure 3 shows the layered structure ofDECK.

























Fig. 3. Internal structure of DECK.

µDECK is the platform-dependent part of DECK. This layer implements the five basic abstractions provided within the environment: threads, semaphores, messages, mailboxes and shared segments. Each of these abstractions is treated by the application as an object, and has associated primitives for proper manipulation. Messages present pack/unpack primitives, which do not necessarily perform marshalling/unrnarshalling actions. When a message object is created, one of its attributes holds the identification of the host architecture. At the time of a pack no marshalling is performed; at the time of an unpack, if the receiving host is of a different architecture, the proper data conversion is made 1 . Messages can be posted to or retrieved from mailboxes. Only the creator of a mailbox is allowed to retrieve messages from it, but any other thread knowing the mailbox can post to it. To use a mailbox, the creator must register it in a naming server. There are two ways to obtain a mailbox address: fetching it in the name server or receiving it in a message. The service layer is built on top of µDECK and aims to furnish additional, more sophisticated mechanisms that might be useful to the development of parallel applications, such as naming, group communication and fault tolerance support. In the scope of this paper, two elements of this layer must be analysed: the naming service and the Remote Communication Daemon (RCD). The name server is a dedicated thread which runs in the first node within each cluster. For example, in the configuration illustrated in Figure 2, there will be a naming server running on "verissimo" and another running on "scliar". Each naming server is responsible to register mailboxes created within its cluster. The name server is automatically executed when the application starts and has a well-known mailbox to allow other threads to communicate. 1

It is important to observe that we only expect this to happen for messages crossing cluster boundaries, since clusters are assumed to be internally homogeneous.

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters

77

The DECK/Myrinet Implementation. In the implementation of DECK on top of Myrinet, we are currently using BIP (Basic Inte,face for Parallelism) [21] as a communication protocol to efficiently use the underlying hardware and deliver high performance to applications. As BIP utilizes reception queues labeled with tags within each node, our mailbox implementation assigns a specific tag to each mailbox. To create a mailbox, the programmer uses a deck...mbox_create() primitive, passing as arguments the mailbox name and the communication subnet (defined in the descriptor file) in which this mailbox will be used. The communication is made by post and retrieve operations, passing as arguments the corresponding mailbox and the message object, which contains the DECK supported datatypes. Posting a message is an asynchronous operation, while retrieving a message is a synchronous operation. To achieve this behaviour, we use the bip_tisend () and bip_trecv () primitives, respectively. The implementation of µDECK mailboxes and messages on top ofBIP is straightforward, since both are based on message passing. Shared segments, however, need an additional software DSM support to be implemented with the same library. For the moment we are studying the introduction of a DSM library, such as TreadMarks [25], to allow the usage of shared segments over Myrinet. The primitives for threads and semaphores are trivial and follow the Pthreads standard [13]. The DECK/SCI Implementation. We base our DECK/SCI implementation on two SCI programming libraries: Yasmin [23], which provides basic primitives for creation, mapping and synchronisation of shared segments, and Sthreads [22], which offers a Pthread-like environment on top of Yasmin. A µDECK shared segment object offers primitives for creation, naming, mapping and locking. To the difference ofMyrinet, SCI allows an easier implementation of both communication paradigms, so DECK/SCI offers mailboxes and messages as well as shared segments. The creation of threads in DECK/SCI follows a simple round-robin placement strategy, according to the number of physical nodes that compose a logical node, which means that placement is still transparent to the end user. Notice that local memory can still be used for communication by local threads (i.e. threads in the same physical node), but it is up to the programmer to keep this kind of control. This means that, within SCI clusters, memory is only guaranteed to be correctly shared between remote threads if it is mapped into a µDECK shared segment. RCD-Remote Communication Daemon. In order to support the MultiCluster model, the Remote Communication Daemon has been designed as a DECK service responsible for communicating to remote clusters. As each cluster must have a "gateway" node, the RCD is automatically executed inside this node when the application starts and follows the same semantic of the name server, i.e., it also has a well-known mailbox. The RCD acts upon demand on two special cases: when fetching names defined remotely (i.e. on another cluster) and when posting messages to remote mailboxes. When a DECK primitive fails to fetch a mailbox address in a local name server, it contacts the RCD, which then broadcasts the request to other RCDs in the system and

78

M. Barreto, R. Avila, and P. Navaux

wait for an answer, returning it to the caller. In the second case, when a DECK primitive sees a remote mailbox address when posting a message, it contacts the RCD, which then forwards the message to the RCD responsible for the communication subnet in which the mailbox is valid. It is important to emphasize that communication between threads in different logical nodes, as well as different clusters, must always be made by message passing. Even in the case of a SCI cluster, there must be at least one mailbox to allow the communication with the RCD and, eventually, retrieve messages. For the moment we are disconsidering the utilisation of a global shared memory space to establish communication among clusters due to the lack of this support in the DECK/Myrinet implementation. Our intention in designing DECK in three parts is to make it usable without changes in both single- and multi-clustered environments. In the first case, the RCD will simply not be brought into action by the application, since all the objects will be local to a specific cluster.

4

Related Work

Since the purpose of this paper is to discuss practical questions involved in the integration of multiple clusters and propose our model to achieve such integration, we tried to identify similar proposals regarding this subject. There is a great number of research projects concerning the integration of multiple workstation clusters, such as NOW [l], Beowulf [2], Globus [7] and Legion [10]. The goal of these projects is to allow parallel and distributed programming over geographically distributed, heterogeneous clusters that corresponds to a "global computational grid". The differential characteristic of our MultiCluster model is that we are assuming the simultaneous use of different network technologies, while these projects plans to use a common network technology to connect clusters, providing high scalability. In terms of programming environments, there are also some efforts concentrated in joining message passing and distributed shared memory facilities, such as Stardust [5] and Active Messages II [16]. The main goal is to provide support for both message passing and distributed shared memory paradigms and, at same time, offer mechanisms to fault tolerance and load balancing support, as well as, portability. There are also some important contributions based on Java, such as JavaNOW [15], JavaParty [20] and Javelin [6]. All these contributions aims to provide distributed programming across networks of workstations or Web-based networks, differing in the communication model they used. The idea behind MultiCluster is similar in some aspects with the objectives found in the projects/environments mentioned here, though in a smaller scale. Our research goal is to identify and propose solutions to problems related to specific integration of Myrinet and SCI clusters, while the goals of such projects comprise a larger universe, including fast communication protocols, cluster tools, job scheduling and so on. Nevertheless, it is possible to state brief comparisons: our RCD is a simplest implementation when compared with Nexus, the communication system used inside Globus; it is just a way to give remote access to mailboxes defined in another clusters and allow us to separate the functionality of DECK when it runs in a single cluster platform.

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters

79

The combination of message passing and distributed shared memory we offer is not so different than the usual mechanisms provided by the others environments. We want to efficiently implement these mechanisms in both clusters, without changing the programming interface. To accomplish this, our choice is to provide a mailbox object and a shared segment object to express message passing and memory sharing, respectively.

5

Conclusions and Current Work

In this paper we exposed some problems related to the integration of two different cluster platforms and proposed our MultiCluster model to achieve such desirable integration. We are developing our software environment aiming to accomplish a number of objectives, such as joining two specific cluster platforms (Myrinet and SCI) and providing a uniform API for parallel and distributed programming on both platforms, as well as opening research activities concerning such integration. The integration is easier in terms of hardware because many solutions are already implemented within the OS kernel (e.g. co-existence of network device drivers). In terms of software, we have to decide what is the abstraction degree we want to offer to the programmer. It is important that the user be aware of the characteristics of each individual cluster to best adapt his application to take benefit of them. On the other hand, the DECK layer must abstract as much as possible implementation details, offering to the users a complete and simple API able to express the application needs. Currently, the descriptor file is the key point to configure the MultiCluster platform, because it represents the communication contexts and the logical nodes the user wants to use. Although this configuration is not so transparent, it is the most suitable way to adapt the execution environment according to the user needs. We consider that there are no problems in this task, since the execution environment guarantees the expected functionality. Our work has been guided towards the design of a complete set of programming resources, enclosed in a software layer. Through the modularisation of DECK, we have divided our work in such way that we can parallelize our efforts to cover all problems exposed and to make available, as soon as possible, the MultiCluster model. At the moment we already have an implementation of DECK based on Pthreads and UNIX sockets, available at our Web page [18]. This implementation has played an important role to define the DECK structure and behaviour. At the time of this writing, we are concluding the implementation on top ofBIP and collecting some performance results and, at same time, starting the implementation of DECK objects on top of SCI. The next step is to join both clusters and develop the RCD communication protocol.

References 1. T. Anderson, D. Culler, and D. Patterson. A case for NOW - Network of Workstations. Available by WWW at http://now.cs.berkeley.edu, Out. 1999. 2. Beowulf. The Beowulf project. Available by WWW at http://www.beowulf.org, Jun. 1999. 3. N. Boden et al. Myrinet: A gigabit-per-second local-area network. IEEE Micro, 15(1):29-36, Feb. 1995.

80

M. Barreto, R. Avila, and P. Navaux

4. Rajkumar Buyya. High Performance Cluster Computing. Prentice Hall PTR, Upper Saddle River, NJ, 1999. 5. Gilbert Cabillic and Isabelle Puaut. Stardust: an environment for parallel programming on networks of heterogeneous workstations. Journal of Parallel and Distributed Computing, 40:65-80, 1997. 6. B. Christiansen et al. Javelin: Internet-based parallel computing using Java. Available by WWW at http://www.cs.ucsb.edu/research/javelin/, Nov. 1999. 7. Ian Foster and Carl Kesselman. The Globus project. Available by WWW at http://www.globus.org, Jul. 1999. 8. Al Geist et al. PVM Parallel Virtual Machine. MIT Press, Cambridge, MA, 1994. 9. GM message passing system. Available by WWW at http://www.myri.com, Nov. 1999. 10. A. Grimshaw et al. The Legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), Jan. 1997. 11. Kai Hwang and Zhiwei Xu. Scalable Parallel Computing: Technology, Architecture, Programming.McGraw-Hill, New York, NY, 1997. 12. IEEE. IEEE standard for Scalable Coherent Interface (SCI). IEEE 1596-1992, 1992. 13. IEEE. Information technology-portable operating system interface (POSIX), threads extension [C language]. IEEE 1003.lc-1995, 1995. 14. IEEE. Local and metropolitan area networks-supplement-media access control (MAC) parameters, physical layer, medium attachment units and repeater for 100Mb/s operation, type l00BASE-T (clauses 21-30). IEEE 802.3u-1995, 1995. 15. Java and High Performance Computing Group. The JavaNOW project. Available by WWW at http://www.jhpc.org/projects.html, Nov. 1999. 16. Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-protocol Active Messages on a cluster of SMP's. In Proc. of SuperComputing 97, 1997. 17. MPI FORUM. Document for a standard message passing interface. International Journal of Supercomputer Applications and High Performance Computing Technology, 8(3/4), 1994. 18. The MultiCluster project. Available by WWW at http://wwwgppd.inf.ufrgs.br/projects/mcluster, Nov. 1999. 19. S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages for Myrinet. In SuperCOmputing '95. IEEE Computer Society Press, 1996. 20. Michael Philippsen and Matthias Zenger. JavaParty: A distributed companion to Java. Available by WWW athttp://wwwipd.ira.uka.de/J avaParty, Nov. 1999. 21. Loic Prylli and Bernard Tourancheau. BIP: A new protocol designed for high performance networking on Myrinet. In Jose Rolim, editor, Parallel and Distributed Processing, number 1388 in Lecture Notes in Computer Science, pages 472--485. Springer, 1998. 22. Enno Rehling. Sthreads: Multithreading for SCI clusters. In Proc. ofEleventh Symposium on Computer Architecture and High Performance Computing, Natal - RN, Brazil, 1999. Brazilian Computer Society. 23. H. Taskin. Synchronizationsoperationen fiir gemeinsamen Speicher in SCI-Clustern. Available by WWW at http://www.uni-paderborn.de/cs/ag-heiss/en/veroeffentlichungen.html, Aug. 1999. 24. VIA- Virtual Interface Architecture. Available by WWW at http://www.via.org, Nov. 1999. 25. Willy Zwaenepoel et al. TreadMarks distributed shared memory (DSM) system. Available by WWW at http://www.cs.rice.edurwilly/TreadMarks/overview.html, Dez. 1998.

Parallel Information Retrieval on an SCI-Based PC-NOW Sang-Hwa Chung, Hyuk-Chul Kwon, Kwang Ryel Ryu, Han-Kook Jang, Jin-Hyuk Kim, and Cham-Ah Choi Division of Computer Science and Engineering, Pusan National University, Pusan, 609-735, Korea {shchung, hckwon, krryu, hkjang, variant, cca}@hyowon.pusan.ac.kr

Abstract. This paper presents an efficient parallel information retrieval (IR) system which provides fast information service for the Internet users on lowcost high-performance PC-NOW environment. The IR system is implemented on a PC cluster based on the Scalable Coherent Interface (SCI), a powerful interconnecting mechanism for both shared memory models and message passing models. In the IR system, the inverted-index file (IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of the IIF to be evaluated in parallel. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed- up of up to 4.0 was obtained with an Snode cluster in processing each query on a 500,000-document IIF.

1. Introduction As more and more people are accessing the Internet and acquiring a vast amount of information easily, more people consider that the problem of information retrieval (IR) resides no longer in the lack of information, but in how we can choose from a vast amount the right information with speed. Many of us have already experienced that some IR systems provide information service much faster than others. How fast an IR system can respond to users' queries mostly depends on the performance of the underlying hardware platform. Therefore, most of the major IR service providers have been urged to spend several hundred thousand dollars to purchase their hardware systems. However, for many small businesses on the Internet, that cost is too high. In this paper, as a cost-effective solution for this problem, a PC cluster interconnected by a high-speed network card is suggested as a platform for fast IR service. With the PC cluster, a massive digital library can be efficiently distributed to PC nodes by utilizing local hard disks. Besides, every PC node can act as an entry to process multiple users' queries simultaneously. It is extremely important to select a network adapter to construct a high-speed system area network (SAN). For a message passing system, the Fast Ethernet card or the Myrinet card can be used. For a distributed shared memory (DSM) system, the SCI card can be considered. Fast Ethernet developed for LAN is based on complicated protocol software such as TCP/IP, and its bandwidth is not high. The Myrinet[l] card is a high-speed message passing card with a maximum bandwidth of 160Mbyte/sec. However, the network cost is relatively high because Myrinet J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 81-90, 2000. © Springer-Verlag Berlin Heidelberg 2000

82

S.-H. Chung et al.

requires crossbar switches for the network connection. Besides, its message-passing mechanism is based on time consuming operating system calls. For applications with frequent message-passing, this can lead to performance degradation. To overcome the system call overhead, systems based on user-level interface for message-passing without intervention of operating system have been developed. Representative systems include AM[2], FM[3], and U-Net[4]. Recently, Myrinet is also provided with a new message-passing system called GM[5], which supports user-level OSbypass network interface access. The SCI (Scalable Coherent Interface: ANSI/IEEE standard 1596-1992) is designed to provide a low-latency (less than 1µs) and high bandwidth (up to 1Gbyte/sec) point-to-point interconnect. The SCI interconnect can assume any topology including ring and crossbar. Once fully developed, the SCI can connect up to 64K nodes. Since the SCI supports DSM models that can feature both of NUMA and CC-NUMA variants, it is possible to make transparent remote memory access with memory read/write transactions without using explicit message-passing. The performance of the SCI-based systems has been proven by the commercial CCNUMA servers such as Sequent NUMAQ 2000[ 6] and Data General's Aviion[7]. In this research, the SCI is chosen as an underlying interconnecting mechanism for clustering. The Parallel IR system is implemented on an SCI-based PC cluster using a DSM programming technique. In the IR system, the inverted-index file(IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. An IIF is the sorted list of terms (or keywords), with each term having links to the documents containing that term. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. An MPl-based IR system using Fast Ethernet as an interconnect is also constructed for comparison purpose.

2. PC Cluster-based IR System 2.1

Typical IR System on Uniprocessor

Figure 1 shows the structure of a typical IR system implemented on a uniprocessor. As shown in the figure, once a user's query with multiple terms is presented to the system, for each query term in turn the IR engine retrieves relevant information from the IIF in the hard disk. When all the information is collected, the IR engine performs necessary IR operations, scores the retrieved documents, ranks them, and sends the IR result back to the user. For the efficient parallelization of the system, it is important to find out the most time consuming part in executing the IR system. Using the sequential IR system developed previously[8], the system's execution time is analyzed as shown in Figure 2. In the sequential system, the most time consuming part is disk access. Thus, it is necessary to parallelize disk access. This can be done by partitioning the IIF into pieces and distributing the pieces to the processing nodes in a PC cluster.

Parallel Information Retrieval on an SCI-Based PC-NOW

Query

u

(-

-

)

83

 (%)  50 45  40 

Result

~

 35  30  25 20  15   10 5

iu

IR Engine

0

 Fig. 1. A typical IR system

disk access vector extract IR operation   

ranking 

Fig. 2. Execution time analysis in the sequential IR system

2.2 Declustering IIF

Most current IR systems use a very large lookup table called an inverted index file (IIF) to index relevant documents for given query terms. Each entry of the IIF consists of a term and a list of ids of documents containing the term. Each of the document ids is tagged with a weight of the term for that document. Given a query, all the query terms are looked up from the IIF to retrieve relevant document ids and the corresponding term weights. Next, the documents are scored based on the term weight values and then ranked before they are reported back to the user. Since our IR system processes user's query in parallel on a PC cluster, it is desirable to have the IIF appropriately declustered to the local hard disks of the processing nodes. We can achieve maximum parallelism if the declustering is done in such a way that the disk 1/0 and the subsequent scoring job are distributed as evenly as possible to all the processing nodes. An easy random declustering method would be just to assign each of the terms (together with its list of documents) in the IIF lexicographically to each of the processing nodes in turn, repeatedly until all the terms are assigned. In this paper, we present a simple greedy declustering method which performs better than the random method. Our greedy declustering method tries to put together in the same node those terms which have low probability of simultaneous occurrence in the same query. If the terms in a query all happen to be stored in the same node, the disk 1/0 cannot be done in parallel and also the scoring job cannot readily be processed in parallel. For an arbitrary pair of terms in the IIF, how can we predict the probability of their cooccurring in the same query? We conjecture that this probability has a strong correlation with the probability of their co-occurrence in the same documents. Given a pair of terms, the probability of their co-occurrence in the same documents can be obtained by the number of documents in which the two terms co-occur divided by the number of all the documents in a given document collection. We calculate this probability for each of all the pairs of terms by preprocessing the whole document collection. When the size of the document collection is very large, we can limit the calculation of the co-occurrence probabilities only to those terms which are significant. The reason is that about 80% of the terms in a document collection usually exhibits only a single or double occurrences in the whole document collection and they are unlikely to appear in the user queries. Also, since the number of terms in a document collection is known to increase in log scale as the number of documents increases, our

84

S.-H. Chung et al.

method will not have much difficulty in scaling up. As more documents are added to the collection, however, re-calculation of the co-occurrence probabilities would be needed for maintenance. But, this would not happen frequently because the statistical characteristics of a document collection does not change abruptly. In the first step of our greedy declustering algorithm, all the terms in the IIF are sorted in the decreasing order of the number of documents each term appears. The higher this number the more important the term is in the sense that it is quite likely to be included in many queries. This is especially true when the queries are modified by relevance feedback[9]. This type of terms also have a longer list of documents in the IIF and thus causes heavier disk 1/0. Therefore, it is advantageous to store these terms in different nodes whenever possible for the enhancement of 1/0 parallelism. Suppose there are n processing nodes. We assign the first n of the sorted terms to each of the n nodes in turn. For the next n terms, each term is assigned to the node which contains a term with the lowest probability of co-occurrence. From the third pass of the term assignment, a term is assigned to such a node that the summation of the probabilities of co-occurrence of the term with the terms already assigned to the node is the lowest. This process repeats until all the terms in the IIF are assigned. 2.3 Parallel IR System Model

The PC cluster-based parallel IR system model is shown in Figure 3. The IR system consists of an entry node and multiple processing nodes. The participating nodes are PCs with local hard disks and connected by an SCI-based high-speed network. The working mechanism of the parallel IR system model can be explained as follows. The entry node accepts a user' query and distributes query terms to processing nodes (including itself) based on the declustering information described in the previous subsection. Each processing node consults the partitioned IIF using the list of query terms delivered from the entry node, and collects the necessary document list for each term from the local hard disk. Once all the necessary document lists are collected, they are transmitted to the entry node. The entry node collects the document lists from the participating processing nodes (including itself), performs required IR operations such as AND/OR and ranks the selected documents according to their scores. Finally the sorted document list is sent back to the user as an IR result.

     

  

    

             

            

       

       

       

       

   

   

   

   

Fig. 3. Parallel IR system model

Parallel Information Retrieval on an SCI-Based PC-NOW

2.4

85

Experimental PC Cluster System

In this research, an 8-node SCI-based PC cluster system is constructed as shown in Figure 4. Each node is a 350MHz Pentium II PC with 128Mbyte main memory and 4.3Gbyte SCSI hard disk, and operated by Linux kernel 2.0.36. In the cluster, any PC node can be configured as an entry node. As shown in the figure, each PC node is connected to the SCI network through the Dolphin Interconnect Solution (DIS)'s PCI-SCI bridge card. There are 4 rings in the network, and 2 nodes in each ring. The rings are interconnected by the DIS's 4x4 SCI switch. For DSM programming, the DIS' s SI SCI (Software Infrastructure for SCI) API[ 10] is used. With this configuration, the maximum point-to-point bulk transfer rate obtained is 80 Mbyte/sec approximately. 





  











    

  

 

Fig. 4. SCI-based 8 node PC cluster system

For comparison purpose, an 8-node Fast Ethernet-based PC cluster system is also constructed. Each PC node has the same configuration as the SCI network's node except that a PCI Fast Ethernet Adapter is used for networking. A switching hub is used to interconnect PC nodes in the cluster. For message-passing programming, MPICH 1.1.1[11] is used. In this case, the maximum point-to-point bulk transfer rate obtained is 10 Mbyte/sec approximately. 2.5

SCI-based DSM Programming

The SCI interconnect mechanism supports DSM programming. By using SISCI, a node in the SCI-based PC cluster can establish a mapping between it's local memory address space and a remote node's memory address space. Once the mapping is established, the local node can access the remote node's memory directly. In DSM programming, the communication between PC nodes in the cluster is done using remote read and remote write transactions instead of message-passing. These remote read/write transactions are actually carried out using the remote read/write functions provided by SISCI. When the IR program is actually coded, most of the remote memory transactions are implemented using the remote write function. This is because the remote write function performs about 10 times faster than the remote read function in the DIS's PSI-SCI bridge card.

86

S.-H. Chung et al.

3. Performance of PC Cluster-based IR System 3.1 Performance Comparison between SCI-based System and MPI-based System

In this experiment, average query processing times are measured for the 8-node SCIbased system, the 8-node MPl-based system and a single node system. The IIF is constructed from 100,000 documents collected from articles in a newspaper. A user's query consists of 24 terms. Each query is made to contain a rather large number of terms because the queries modified by relevance feedback usually have that many terms. The IIF is randomly declustered to be stored on each processing node's local disk. As shown in Table 1, the disk access time is reduced for both the SCI-based system and the MPl-based system when compared with the single node system. However, the MPl-based system is worse than the single node system in total query processing time because of the communication overhead. The SCI-based system has much less communication overhead than the MPl-based system, and performs better than the single node system. The speed-up improves with further optimizations presented in the following subsections. Table 1. Query processing times of 8-node SCI-based system and 8-node MPl-based system (unit: sec)

Send query term Receive document list Disk access IR operation Total

SCI-based system 0.0100 0.0839 0.0683 0.0468 0.2091

MPl-based system 0.0251 0.2097 0.0683 0.0468 0.3500

Single-node System 0 0 0.2730 0.0468 0.3198

3.2 Effect of Declustering IIF

The greedy declustering method is compared with the random method on a test set consisting of 500 queries each containing 24 terms. To generate the test queries we randomly sampled 500 documents from a document collection containing 500,000 newspaper articles. From each document, the most important 24 terms are selected to make a query. The importance of a term in a document is judged by the value tf x idf; where tfis the term's frequency in that document and idf is the so called inverse document frequency. The inverse document frequency is given by log,(N/n) + 1, where N is the total number of documents in the collection and n is the number of documents containing the term. Therefore, a term in a document is considered important if its frequency in that document is high enough but at the same time it does not appear in too many other documents. Table 2 shows the experimental results comparing the random clustering and the greedy declustering methods using those 500 queries on our 500,000 document collection.

Parallel Information Retrieval on an SCI-Based PC-NOW

87

Table 2. Comparison of random declustering and greedy declustering (unit: sec)

Random declustering

Greedy declustering

Average query processing time

0.5725

0.5384

Accumulated query processing time for 500 queries

286.2534

269.1919

3.3 Performance with Various-sized IIF

In this subsection, the performance of the SCI-based parallel IR system is analyzed with the number of documents increased up to 500,000. These documents are collected from a daily newspaper, and 500,000 documents amount to the collection of the daily newspaper articles for 7 years. The size of IIF proportionally increases as the number of documents increases. For example, the size of IIF is 300 Mbytes for 100,000 documents, and 1.5 Gbytes for 500,000 documents. The 8-node PC cluster and the greedy declustering method are used for the experiment. The experimental result is presented in Figure 5. It takes 0.1805 seconds to process a single query with the 100,000 document IIF, while it takes 0.2536 seconds with the 200,000 document IIF and 0.5398 seconds with 500,000 document IIF. As the IIF size increases, the document list for each query term becomes longer, and the time spent for IR operations (AND/OR operations) increases considerably. As a result, the IR operation eventually takes more time than the disk access, and becomes the major source of bottleneck. 

  

    

    





  

  

  

  

Fig. 5. llF size vs. query processing time



 

 

88

S.-H. Chung et al.

3.4 Reducing IR Operation Time

As presented in the previous subsection, the IR operation time turns out to be a new overhead as the IIF size increases. In the IR system, AND/OR operations are performed by the entry node after all the necessary document lists are collected from the processing nodes. However, it is possible to perform AND/OR operations partially to the document lists collected in each processing node. So, each processing node can transmit only the result to the entry node. This helps in reducing not only the IR operation time but also the communication time. The performance of the improved system in comparison with the original system is shown in Figure 6. In the experiment, the 8-node PC cluster, the greedy declustering method and 500,000 document IIF are used. In the original system, the IR operation takes 0.2873 seconds which is more than 53% of the total query processing time. However in the improved system, the IR operation takes only 0.1035 seconds which is about 35% of the total time. Thus, the IR operation takes less time than the disk access again. The communication time is also reduced from 0.1128 seconds to 0.0500 seconds, and the total time is reduced to almost half when compared with the original system.    



 











   

  

Fig. 6. Query processing time with reduced IR operation time

Figure 7 shows the speed-up of the parallel IR system. The maximum speed-up obtained from the 8-node system when compared with the single node system is 4.0. As shown in the figure, the speed-up of the parallel IR system is saturated rapidly from the 4-node system. As the number of the processing nodes in the system increases, the disk access time 1 is reduced because the average number of query terms assigned to each node decreases. However, the IR operation time and the communication time rather increase as the number of document lists transmitted to the entry node increases, and attenuate the overall speed-up. The problem may be alleviated by applying the following idea. Instead of sending all the document lists to the entry nodes, intermediate nodes can be utilized to merge the document lists by performing AND/OR operations in advance as shown in Figure 8. Thus the entry node finally handles only two document lists. This will help in reducing both the IR 1 The disk access time includes the time spent for partial AND/OR operations in the processing nodes.

Parallel Information Retrieval on an SCI-Based PC-NOW

operation time and the communication time. verify the above idea .

89

Experiments need to be performed to

  

 



 





  

 

















 







  

Fig. 7. Number of processing nodes vs. query processing time

Fig. 8. Merging document lists in intermediate nodes

4. Conclusions In this paper, as a cost-effective solution for fast IR service, an SCI-based PC cluster system is proposed. In the parallel IR system developed on the PC cluster, the IIF is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. The IR system is developed using a DSM programming technique based on SCI. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed-up of 4.0 was obtained with the 8-node cluster in processing each query on a

90

S.-H. Chung et al.

500,000-docwnent IIF. Currently, the parallel IR system has a single entry node. In the future research, a PC cluster based IR system with multiple entry nodes will be developed. Each processing node in the cluster system can act as an entry node to process multiple users's queries simultaneously. This will help in improving both the IR system's utilization and throughput. With more research effort, we hope this model to be evolved as a practical solution for low-cost high-performance IR service on the Internet.

References 1. IEEE, "MYRINET: A GIGABIT PER SECOND LOCAL AREA NETWORK", IEEE-Micro, Vol.15, No.I, February 1995, pp.29-36. 2. "Active Messages: a Mechanism for Integrated Communication and Computation", Thorsten von Eicken and David Culler, et. al., 1992. 3. "Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors", IEEE Concurrency, vol. 5, No. 2, April-June 1997, pp. 60-73. (Pakin, Karamcheti & Chien) 4. "U-Net: A User-Level Network Interface for Parallel and Distributed Computing", Anindya Basu, Vineet Buch, Werner Vogels, Thorsten von Eicken, Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995. 5. http://www.myri.com/GM/doc/gm_toe.html 6. "NUMA-Q: An SCI based Enterprise Server", http://www.sequent.com/products/ highend_srv/sci_wpl .html 7. "SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers with Pentium Pro SHV Nodes", http://www.dg.com/about/html/sci_interconnect_ chipset_and_a.html 8. S.H.Park, H.C.Kwon, "An Improved Relevance Feedback for Korean Information Retrieval System", Proc. of the 16th IASTED International Conf. Applied Informatics, IASTED/ACTA Press, pp.65-68, Garmisch-Partenkirchen, Germany, February 23-25, 1998 9. Salton, G. and Buckley, C., "Improving retrieval performance by relevance feedback", American Society for Information Science, 41, 4, pp. 288-297, 1990. I 0. http://www.dolphinics.no/customer/software/linux/index.html I 1. "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard", http://www-unix.mcs.anl.gov/mpi/ mpich/docs.html

A PC-NOW Based Parallel Extension for a Sequential DBMS Matthieu Ex bray at and Lionel Brunie Laboratoire d'Ingenierie des Systemes d'Information Institut National des Sciences A ppliquees, Lyon, F ranee Matthieu.Exbrayat©lisi.insa-lyon.fr, Lionel.Brunie©insa-lyon.fr

Abstract. In this paper we study the use of networks of PCs to handle the parallel execution of relational database queries. This approach is based on a parallel extension, called parallel relational query evaluator working in a coupled mode with a sequeitial DBMS. We present a detailed arc hitecture of the parallel query eVtluator and introduce Enkidu, the efficient Java-based prototype that has been build according to our concepts. We expose a set of measurements, conducted over Enkidu, and highlighting its performances. We finally discuss the interest and viability of the concept of parallel extension in the context of relational databases and in the wider context of high performance computing. Keywords: Networks of mrkstations, Parallel DBMS, Java

1

Introduction

P arallelizingDatabase Management Systems (DBMS) has been a flourishing field of research for the last fifteen years. Research, experiment and development have been conducted according to three main goals. The first one is to accelerate heavy operations, su

)/(fl

~x,~!e:. y

y "vi!mlio"

Figure 3.(b) The schematic of the GPl switch. Figure 4. The GP shift switches.

A Non-binary Parallel Arithmetic Architecture Table 1. Tha function of GP1 sWitth r,r,.:fa,·,;~I V"l!'"-ii 4, the MPSU reads four sets of coordinates from the queue. These four coordinates represent the four blocks that are to be merged. The MPSU then carries out the data movement operation using the hardware configured on the FPGA. The data for the addition process is read from the FPGA on-board RAM. Once the merging process is completed, the coordinates of the merged block are written to queue Q2. The MPSU repeats the process for all 2x2 blocks read from Q 1. The process continues until all the blocks in the queue Ql are processed at which time all blocks of size 2x2 have been merged into 4x4 blocks. The MPSU then begins to read the block coordinates from the queue Q2, merges the blocks and writes the resulting coordinates to queue Q 1. This process of switching between queue Q l and queue Q2 is repeated until all the blocks are processed and we have a single entry in one of the queues.

Parallel Hardware-Software Architecture

255

 





 

   



 



Figure 1.2 Hardware Software Architecture for DWT

8 FPGA Implementation and Resource Use

The H.O.T Works board from VCC[3] has been provided with onboard RAM, which can be used to store the rMap index. This technique of implementation of the r Map index use is efficient as the data shifting process can be carried out by the means of an addition/subtraction circuit configured on the FPGA. The rMap index contains a series of (x,y) pairs which point to a specific location in the original data matrix. Figure 1.3 shows the comparative number of data accesses for the conventional RMF and the hardware-software implementation of RMF. We see a substantial decrease in the total number of direct accesses. Although we need to reset the data array to the correct positions after the completion of all blocks of a certain level, we can do so by using the block access mechanism rather than singular data accesses. This blocks access mechanism is a fraction of the initial data accesses. The figure below shows the original gray map image along with the reverse-transformed image (.PGM format).

256

P. Jamkhandi et al.

Re-constructed Image

Original Image

       











 











  

 

  



 





 

Fig. 1.3 Chart showing the reduction in the main memory data accesses. The data accesses are transformed into FPGA board RAM accesses. References [l] K. Mukherjee, "Image Compression and Transmission using Wavelets and Vector Quantization, Ph.D. Dissertation, University of Central Florida, 1999. [2] S. G. Mallat, "A Theory for Multiresolution Signal Decomposition: The Wavelet Representation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.11, no.7, pp. 674- 693, July 1989 [3] VCC H.O.T Works Board Manual

Fifth International Workshop on High-level Parallel Programming Models and Supportive Environments HIPS 2000 Chair: Martin Schulz sch [email protected] Lehrstuhl f" ur Rdmertec hnik und Re4

1sMSP~2 ~WF/10/VERT

4

16

number of processors

Cray T3E: WF/10/HOR

4

16

64

1sr:ts~~21WF/lDiHOR

4

16

number of processors

Cray T3E: WF/2D

4

16

number of processors

IBM SP-2: WF/2D

4

16

number of processors

267

Cray T3E: WF/1 D/BOTH

4

16

II 64

IBM"SP~:t:ofvfF/10/BOTH

4

16

number of processors

Fig. 3. Performance summary. Kernel names are from Fig. I. Note that all PGHPF bars are present, but they are very small for WF/1 D/HOR and WF/2D.

XlHPF is competitive with the C+MPI and ZPL, because it performs pipelining. The single processor bars highlight disparities in local computation performance. ZPL performs considerably better than any of the others for WF/1D/VERT. We hypothesize that the dependences in this kernel thwart proper array access optimization by the xl optimizer (used by both the Fortran and C compilers). The ZPL code does not suffer from this, because its compiler generates direct pointer references rather than using C arrays. When the C+MPI code is modified in this way, its performance matches ZPL. Conversely, ZPL is worse for WF/1D/HOR. Again, we believe this is an optimization issue. When the ZPL code is modified to use C arrays rather than pointer manipulation, it matches HPF. The summary is that when we ignore the differences that arise from using C versus Fortran, the C+MPI, xlHPF, and ZPL kernel performance are comparable. Nevertheless, as stated in the previous section, we found a number of wavefronts that even the xlHPF compiler failed to optimize.

5

Conclusion

We have evaluated the experience and performance of expressing wavefront computations by three different approaches: programmer implemented via message passing, compiler discovered via automatic parallelization, and programmer defined via explicit parallel language features for pipelining. Our study reveals that in developing wavefronts, each approach can produce an efficient solution, but at a cost. The message passing codes took considerably longer to develop and debug than the other approaches. The HPF codes did not reliably perform well. Although one compiler produced efficient code, the other was three orders of magnitude worse. Even the better compiler failed to pipeline some very simple cases. We find that the language-level approach embod-

268

E.C. Lewis and L. Snyder

ied in ZPL simplifies program development and results in good performance that is consistently achieved. Acknowledgements. This research was supported in part by a grant ofHPC time from

the Arctic Region Supercomputing Center.

References 1. Accelerated Strategic Computing Initiative. ASCI SWEEP3D homepage. http://www.llnl.gov/asci...henchmarks/asci/limited/sweep3d/sweep3d_readme.html. 2. Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. ZPL's WYSIWYG performance model. In Third IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 50-61, March 1998. 3. Bradford L. Chamberlain, E Christopher Lewis, Calvin Lin, and Lawrence Snyder. Regions: An abstraction for expressing array computation. In ACM SIGAPLISIGPLAN International Conference on Array Programming Languages, pages 41-49, August 1999. 4. Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder. Language support for pipelining wavefront computations. In Proceedings of the Workshop on Languages and Compilers.for Parallel Computing, 1999. 5. Ron Cytron. Doacross: Beyond vectorization for multiprocessors. In International Conference on Parallel Processing, pages 836-844, 1986. 6. Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, Ko-Yang Wang, Wai-Mee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Proceedings of the 1995 ACM/IEEE Supercomputing Conference (CD-ROM), 1995. 7. High Performance Fortran Forum. HPF Language Specification, Version 2.0. January 1997. 8. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiler optimizations for Fortran Don MIMD distributed-memory machines. In Supercomputing '91, pages 96-100, Albuquerque, NM, November 1991. 9. K. R. Koch. R. S. Baker, and R. E. Alcouffe. Solution of the first-order form of threedimensional discrete ordinates equations on a massively parallel machine. Transactions of the American Nuclear Society, 65: 198-9, 1992. 10. David K. Lowenthal and Michael James. Run-time selection of block size in pipelined parallel programs. In Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, pages 82-7, 1999. 11. Anne Rogers and Keshav Pingali. Process decomposition through locality of reference. In ACM SIGPLAN PLDJ '89, pages 69-80, June 1989. 12. Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI-The Complete Reference. The MIT Press, Cambridge, Massachusetts, 2nd edition, 1998. 13. Lawrence Snyder. The ZPLProgrammer's Guide. The MIT Press, 1999. 14. David Sundaram-Stukel and Mark K. Vernon. Predictive analysis of a wavefront application using LogGP. In Seventh ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, May 1999. 15. Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. 16. ZPL Project. ZPL project homepage. http:/www.cs.washington.edu/research/zpl.

Specification Techniques for Automatic Performance Analysis Tools Michael Gerndt, Hans-Georg EBer Central Institute for Applied Mathematics Research Centre Juelich {m.gerndt, h.g.esser}@fz-juelich.de

Abstract. P erformance analysis of parallel programs is a time-consuming task and requires a lot of experience. It is the goal of the KOJAK project at the Researc hCentre Juelich to develop an automatic performance analysis environment. A k ey requiremeil. for the success of this new environment is its easy in tegration with already existing tools on the target platform. The design should lead to tools that can be easily retargeted to different parallel machines based on specification documents. This article outlines the features of the APART Specification Language designed for that purpose and demonstrates its applicability in the context of the K OJAK Cost Analyzer, a first protoype tool of KOJAK.

1

Introduction

Current performance analysis tools for parallel programs assist the application programmer in measuring and interpreting performance data. But, the application of these tools to real programs is a time-consuming task which requires a lot of experience, and frequently, the rev ealed performance bottlen4:s belong to a small number of well-defined performance problems, such as load balancing and excessive message passing overhead. It is the goal of the KOJAK project (Kit for Objective Judgement and A utomaticKnowledge-b ase ddetection of bottlene ck) at the Research Centre Juelich to dev elopan en vironmert that automatically reveals w,ll-defined typical bottlenecks [www.fz-juelich.de/zam/kojak]. We designed KOJAK [6] such that it is not implemented for a single target en vironmert only, e.g. the Cray T3E currently installed at our center, but can easily be ported to other target platforms as well. KOJAK will use specification documents to in terface to existing performance data supply tools and to specify potential performance problems of the target programming paradigm. In parallel witfthe dev elopment of KOJAK automatic performance analysis techniques are investigated in the ESPRIT IV Working Group on Automatic Performance Analysis: Resour es and Tools(APART) [www.fz-juelich.de/apart]. This article demonstrates the main features of the APART Specification L anguage (ASL) [3] within the context of the K OJAK Cost .kalyzer (COSY) (Section 3). The performance data analyzed in COSY are specified as an ASL object model (Section 4.1) and represented at runtime via a relational database scheme. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 269-276, 2000. © Springer-Verlag Berlin Heidelberg 2000

270

M. Gerndt and H.-G. Esser

The performance problems COSY is aiming at are specified as ASL performance properties (Section 4.2) based on the performance data model and are implemented via SQL queries (Section 5).

2

Related work

The use of specification languages in the context of automatic performance analysis tools is a new approach. Paradyn [8] performs an automatic online analysis and is based on dynamic monitoring. While the underlying metrics can be defined via the Metric Description Language (MDL) [9], the set of searched bottlenecks is fixed. It includes CPUbound, ExcessiveSync Waiting Time, ExcessiveIOBlockingTime, and TooManySmallIOOps. A rule-based specification of performance bottlenecks and of the analysis process was developed for the performance analysis tool OPAL [5] in the SVMFortran project. The rule base consists of a set of parameterized hypothesis with proof rules and refinement rules. The proof rules determine whether a hypothesis is valid based on the measured performance data. The refinement rules specify which new hypotheses are generated from a proven hypothesis [4]. Another approach is to define a performance bottleneck as an event pattern in program traces. EDL [1] allows the definition of compound events based on extended regular expressions. EARL [10] describes event patterns in a more procedural fashion as scripts in a high-level event trace analysis language which is implemented as an extension of common scripting languages like Tel, Perl or Python.

3

Overall Design of the KOJ AK Cost Analyzer

COSY [7] analyzes the performance of parallel programs based on performance data of multiple test runs. It identifies program regions, i.e. subprograms, loops, if-blocks, subroutine calls, and arbitrary basic blocks, with high parallelization overhead based on the region's speedup. It explains the parallelization overhead by identifying performance problems and ranking those problems according to their severity. COSY is integrated into the CRAY T3E performance analysis environment. The performance data measured by Apprentice [2] are transferred into a relational database. The implementation of the interface between COSY and Apprentice via the database facilitates the integration with other performance data supply tools on CRAY T3E as well as the integration with other environments. The database includes static program information, such as the region structure and the program source code, as well as dynamic information, such as execution time, number of floating point, integer and load/store operations, and instrumentation overhead. For each subroutine call the execution time as well as the pass count with the mean value and standard deviation, as well as the minimum and maximum values are stored.

Specification Techniques for Automatic Performance Analysis Tools

271

After program execution Apprentice is started. Apprentice then computes summary data for program regions taking into account compiler optimizations. The resulting information is written to a file and transferred into the database. The database includes multiple applications with different versions and multiple test runs per program version. The data model is outlined in Section 4.1. The user interface of COSY allows to select a program version and a specific test run. The tool analyzes the dynamic data and evaluates a set of performance properties (Section 4.2). The main property is the total cost of the test run, i.e. the cycles lost in comparison to optimal speedup, other properties explain these costs in more detail. The basis for this computation is the test run with the smallest number of processors. The performance properties are ranked according to their severity and presented to the application programmer.

4

Performance Property Specification

COSY is based on specifications of the performance data and performance properties. The specifications are presented in ASL in the next two subsections. ASL supports the following concepts: Performance property: A performance property characterizes one aspect of the performance behavior of an application. A property has one or more conditions that can be applied to identify this property. It has a confidence expression that returns a value between zero and one depending on the strength of the indicating condition. Finally it has a severity expression that returns a numerical value. If the severity of a property is greater than zero, this property has some negative effect on the program's performance. Performance problem: A performance property is a performance problem, iff its severity is greater than a user- or tool-defined threshold. Bottleneck: A program has a unique bottleneck, which is its most severe performance property. If this bottleneck is not a performance problem, the program does not need any further tuning.

The entire specification consists of two sections. The first section models performance data while the second section specifies performance properties based on the data model. 4.1

Data Model

The performance data can be easily modeled via an object-oriented approach. ASL provides constructs to specify classes similar to Java with single-inheritance only. Classes in the data model have attributes but no methods, since the specification will not be executed. The ASL syntax is not formally introduced in this article due to space limitations, instead, we present the performance data model used in COSY.

272

M. Gerndt and H.-G. Esser

class Program { String Name; setof ProgVersion Versions;

class ProgVersion { DateTime Compilation; setof Function Functions; setof TestRun Runs; SourceCode Code;

}

}

The Program class represents a single application which is identified by its name. COSY can store multiple programs in its database. An object of that class contains a set of Frog Version objects, each with the compilation timestamp, the source code, the set of functions (static information) and the executed test runs (dynamic information). class TestRun { DateTime Start; int NoPe; int Clockspeed;

class Function { String Name; setof FunctionCall Calls; setof Region Regions;

}

}

A TestRun object determines the start time and the processor configuration. A Function object specifies the function name, the call sites, and the program regions in this function. All this information is static. class TotalTirning { TestRun Run; float Exel; float Incl; float Ovhd;

class Region { Region ParentRegion; setof TotalTiming TotTimes; setof TypedTiming TypTimes; }

}

The Region class models a program region with its parent region and its performance data gathered during execution. Performance data are modeled by two classes, according to the internal structure of Apprentice. The TotalTiming class contains the summed up exclusive and inclusive computing time as well as the overhead time. As there may be several test runs, there are also possibly several TotalTiming objects for a region. The TypedTiming class determines the execution time for special types of overhead such as I/0, message passing and barrier synchronization - Apprentice knows 25 such types. As with the TotalTiming objects, there is a set of TypedTiming objects for every test run, but for each region there is at most one object per timing type and per test run. class TypedTiming { TestRun Run; TimingType Type; float Time;

class FunctionCall { Function Caller; Region CallingReg; setof CallTiming Sums;

}

}

TypedTiming objects have three attributes: The TestRun attribute Run codes the specific test run of the program, 'Type (an enumeration type) is the work type

Specification Techniques for Automatic Performance Analysis Tools

273

that is being considered in this object and Time is the time spent doing work of this type. Call sites of functions are modeled by the FunctionCall class. A function call has a set of CallTiming objects which store the differences of the individual processes. A CallTiming object is composed of the TestRun it belongs to, the minimum, maximum, mean value, and standard deviation over a) the number of calls and b) the time spent in the function. For the four extremal values the processor that was first or last in the respective category is memorized. Due to the design of Apprentice, the data model does not make use of inheritence. More complex data models can be found in [3]. 4.2

Performance Properties

property

is PROPERTY pp-name '(' arg-list ')' '{' [LET def * IN] pp-condition pp-confidence pp-severity

'};'

arg is type ident pp-condition is CONDITION ':' conditions ';' is condition conditions or condition OR conditions is [' (' cond-id ')' ] bool-expr condition pp-confidence is CONFIDENCE ':' MAX '(' confidence-list ')' ';' or CONFIDENCE ':' confidence ';' confidence is [' (' cond-id ')' '- >' ] arith-expr is SEVERITY ':' MAX '(' severity-list')' ';' pp-severity or SEVERITY ':' severity ';' severity is ['(' cond-id ')' '->'] arith-expr

Fig. 1. ASL property specification syntax.

The property specification (Figure 1) defines the name of the property, its context via a list of parameters, and the condition, confidence, and severity expressions. The property specification is based on a set of parameters. These parameters specify the property's context and parameterize the expressions. The context specifies the environment in which the property is evaluated, e.g. the program region and the test run. The condition specification consists of a list of conditions. A condition is a predicate that can be prefixed by a condition identifier. The identifiers have to be unique in respect to the property since the confidence and severity specifications can refer to the conditions via those condition identifiers.

274

M. Gerndt and H.-G. Esser

The confidence specification is an expression that computes the maximum of a list of confidence values. Each confidence value is computed via an arithmetic expression resulting in a value in the interval of zero and one. The value can be guarded by a condition identifier introduced in the condition specification. The condition identifier represents the value of the condition. The severity specification has the same structure as the confidence specification. It computes the maximum of the individual severity values of the conditions. The following example properties are checked by COSY. They demonstrate the ASL language features. Most of the property specifications make use of the following two functions: TotalTiming Summary(Region r, TestRun t) = UNIQUE({s IN r.TotTimes WITH s.Run==t}); float Duration(Region r. TestRun t) = Summary(r.t).Incl;

The first function Summary takes a Region r and a TestRun object and returns the unique Total Timing object which is a member of r. Tot Times belonging to that test run. The second function Duration uses Summary to extract the total execution time of the specified region in the specified test run. Note that all timings in the database are summed up values of all processes. The first property SublinearSpeedup determines the lost cycles in relation to the test run with the minimal number of processors. Property SublinearSpeedup(Region r, TestRun t, Region Basis) { LET TotTimes MinPeSum = UNIQUE({sum IN r.TotTimes WITH sum.Run.NoPe MIN(s.Run.NoPe WHERE s IN r.TotTimes)}); float TotalCost = Duration(r,t) - Duration(r,MinPeSum.Run) IN

CONDITION: SEVERITY:

TotalCost>O; CONFIDENCE: 1; TotalCost/Duration(Basis,t);

}

The property is based on the total costs, i.e. the lost cycles compared to a reference run with the smallest number of processors. If TotalCost is greater than zero, the region has the SublinearSpeedup property. The confidence value, which is one in all examples here, might be lower than one if the condition is only an indication for that property. The severity of the SublinearSpeedup property is determined as the fraction of the total costs compared to the duration of Basis in that test run. Property MeasuredCost (Region r, TestRun t, Region Basis) { LET float Cost= Summary(r,t).Ovhd; IN CONDITION: Cost> O; CONFIDENCE: 1; SEVERITY: Cost/ Duration(Basis,t); }

The total costs can be split up into measured and unmeasured costs. The MeasuredCost property determines that more detailed information might be

Specification Techniques for Automatic Performance Analysis Tools

275

available (Summary(r,t). Ovhd is the overhead measured by Apprentice). ff the severity of its counterpart, the UnmeasuredCost, is much higher, the reason cannot be found with the available data. Property SyncCost(Region r, TestRun t, Region Basis) { LET float Barrier= SUM(tt.Time WHERE tt IN r.TypTimes AND tt.Run==t AND tt.Type == Barrier); IN CONDITION: Barrier> O; CONFIDENCE: 1; SEVERITY: Barrier/ Duration(Basis,t); }

The SyncCost property determines that barrier synchronization is a reason for overhead in that region. Its severity depends on the time spent for barrier synchronization in relation to the execution time of the ranking basis. Property Loadimbalance(FunctionCall Call, TestRun t, Region Basis) { LET CallTiming ct= UNIQUE ({c IN Call.Sums WITH c.Run == t}); float Dev = ct.StdevTime; float Mean= ct.MeanTime; IN CONDITION: Dev> ImbalanceThreshold * Mean; CONFIDENCE: 1; SEVERITY: Mean/ Duration(Basis,t); }

The Loadlmbalance property is a refinement of the Sync Cost property. It is evaluated only for calls to the barrier routine. If the deviation is significant, the barrier costs result from load imbalance.

5

Implementation

The design and implementation of COSY ensures portability and extensibility. The design requires that the performance data supply tools are extended such that the information can be inserted into the database. This extension was implemented for Apprentice with the help of Cray Research. The database interface is based on standard SQL and therefore, any relational database can be utilized. We ran experiments with four different databases: Oracle 7, MS Access, MS SQL server, and Postgres. For all those databases, except MS Access, the setup was in a distributed fashion. The data were transferred over the network to the database server. While Oracle was a factor of 2 slower than MS SQL server and Postgres, MS Access outperformed all those systems. Insertion of performance information was a factor of 20 faster than with the Oracle server. COSY is implemented in Java and is thus portable to any Java environment. It uses the standard JDBC interface to access the database. Although accessing the database via JDBC is a factor of two to four slower than C-based implementations, fetching a record from the Oracle server takes about 1 ms, the portability of the implementation outweighs the performance drawbacks. The overall performance depends very much on the work distribution between the client and the database. It is a significant advantage to translate the conditions of performance properties entirely into SQL queries instead of first accessing the data components and evaluating the expressions in the analysis tool.

276

6

M. Gerndt and H.-G. Esser

Conclusion and Future Work

This article presented a novel design for performance analysis tools. As an example, COSY, a prototype component of the KOJAK environment, was presented. The design enables excellent portability and integration into existing performance environments. The performance data and the performance properties are described in ASL and can therefore easily be adapted to other environments. For this prototype, the specification is manually translated into a relational database scheme and the evaluation of the conditions and the severity expressions of the performance properties is transformed into appropriate SQL queries and ranking code by the tool developer. In the future, we will investigate techiques for the automatic generation of the database design from the performance property specification and the automatic translation of the property description into executable code.

References 1. P. Bates, J.C. Wileden: High-Level Debugging of Distributed Systems: The Be-

2. 3. 4.

5. 6. 7. 8. 9. 10.

havioral Abstraction Approach, The Journal of Systems and Software, Vol. 3, pp. 255-264, 1983 CRAY Research: Introducing the MPP Apprentice Tool, Cray Manual IN-2511, 1994, 1994 Th. Fahringer, M. Gerndt, G. Riley, J.L. Traff: Knowledge Specification for Automatic Performance Analysis, to appear: APART Technical Report, Forschungszentrum Jiilich, FZJ-ZAM-IB-9918, 1999 M. Gerndt, A. Krumme: A Rule-based Approach for Automatic Bottleneck Detection in Programs on Shared Virtual Memory Systems, Second Workshop on High-Level Programming Models and Supportive Environments (HIPS '97), in combination with IPPS '97, IEEE, 1997 M. Gerndt, A. Krumme, S. Ozmen: Performance Analysis for SVM-Fortran with OPAL, Proceedings Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA'95), Athens, Georgia, pp. 561-570, 1995 M. Gerndt, B. Mohr, F. Wolf, M. Pantano: Performance Analysis on CRAY T3E, Euromicro Workshop on Parallel and Distributed Processing (PDP '99), IEEE Computer Society, pp. 241-248, 1999 A. Lucas: Basiswerkzeuge zur automatischen Auswertung von ApprenticeLeistungsdaten, Diploma Thesis, RWTH Aachen, Internal Report Forschungszentrums Jiilich Jiil-3652, 1999 B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, T. Newhall: The Paradyn Parallel Performance Measurement Tool, IEEE Computer, Vol. 28, No. 11, pp. 37-46, 1995 Paradyn Project: Paradyn Parallel Performance Tools: User's Guide, Paradyn Project, University of Wisconsin Madison, Computer Sciences Department, 1998 F. Wolf, B. Mohr: EARL - A Programmable and Extensible Toolkit for Analyzing Event Traces of Message Passing Programs, 7th International Conference on HighPerformance Computing and Networking (HPCN'99), A. Hoekstra, B. Hertzberger (Eds.), Lecture Notes in Computer Science, Vol. 1593, pp. 503-512, 1999

PDRS: A Performance Data Representation System" Xian-He Sun 1 • 2 1 Dept. 2

3

Xingfu Wu '· 1

of Computer Science, Louisiana State University, Baton Rouge, LA 70803

Dept. of Computer Science, Illinois Institute of Technology, Chicago, IL 60616

Dept. of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208 [email protected] [email protected]

Abstract. We present the design and development of a Performance Data Representation System (PDRS) for scalable parallel computing. PDRS provides decision support that helps users find the right data to understand their programs' performance and to select appropriate ways to display and analyze it. PDRS is an attempt to provide appropriate assistant to help programmers identifying performance bottlenecks and optimizing their programs.

1 Introduction Many performance measurement systems have been developed in recent years. While these systems are important, their practical usefulness relies on an appropriate understanding of the measured data. When monitoring a complex parallel program, the amount of performance data collected may be very huge. This huge amount of performance data needs to be processed for further performance evaluation and analysis. A general performance measurement system always provides a facility that assists manipulation of this performance data. Data manipulation functions are often dependent on performance data organization and representation. The difficulty in providing an adequate performance environment for high performance computing is the lack of appropriate models, representations and associated evaluation methods to understand measured data and locate performance bottlenecks. Performance Data Representation System (PDRS) proposed in this paper is designed to attack this difficulty. PDRS is a general-purpose integrated system supported by performance database representation and the combination of performance visualization and auralization. It is based on our recent success in automatic performance evaluation and prediction. Many performance measurement systems exist right now [3, 4, 5]. While these performance systems have made their contribution to the state-of-the-art of performance *

This work was supported in part by National Science Foundation under NSF grant ASC-9720215 and CCR-9972251.

J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 277-284, 2000. © Springer-Verlag Berlin Heidelberg 2000

278

X.-H. Sun and X. Wu

evaluation, none of them has addressed the data presentation and understanding issue adequately. With the advance in performance measurement and visualization techniques, and increased use of large, scalable computing systems, data presentation and management becomes increasingly important. The PDRS is a post-execution performance data representation system designed for scalable computing, and is distinct from existing performance systems. First, while it supports conventional visualization views, it is designed based on the most recent analytical results in scalability and statistical analysis to reveal the scaling properties of a scalable computing system. Second, the system uses relational database, SQL and Java JDBC techniques such that performance information is easily retrieved, compared and displayed. Because of the complexity and volume of the data involved in a performance database, it is natural to exploit a database management system (DBMS) to archive and retrieve performance data. A DBMS will help not only in managing the performance data, but also in assuring that the various performance information can be presented in some reasonable format for users. Third, the system is implemented based on the combination of performance visualization and auralization techniques and object-oriented Java techniques such that it is easy for users to understand and use. Finally, the system supports the SDDF data format. It can be either used as a stand-alone application or easily integrated into other existing performance environments.

2 Design and Implementation of PDRS Figure 2.1 depicts the design framework of PDRS. The technical approaches used to develop these components are discussed below section by section.

2.1 Trace Data Module This module is in charge of collecting original performance data of parallel programs, and stores them with SDDF [1]. The large volume of data involved in parallel computations requires that instrumentation to collect the data selectively and intelligently. One way to collect data of a parallel program is to instrument the program executable so that when the program runs, it generates the desired information. PDRS is designed to use the Scala Instrumentation System (SIS) [11] to get the SDDF trace data file. PDRS also provides a general interface that can be used under any system, which provides the SDDF trace data interface. 2.2 Data Management Module This module is in charge of performance data filtering and mapping.

PDRS: A Performance Data Representation System

279

Event histories of parallel programs are valuable information sources for performance analysis but the problem is how to extract the useful information from massive amounts of low-level event traces. Our system performs the data filtering as a preparation to store the event history into a relational database. The SDDF is a trace description language that specifies both data record structures and data record instances. We are building a performance database based on the SDDF specification. Our data management module is being implemented in Oracle DBMS.

Figure 2.1 Design framework of PDRS 2.3 Performance Database We classify the performance data saved in the SDDF tracefiles into five groups: processor information, memory information, program information, communication information and 1/0 information. Each group is represented as an entity relation in the performance database. An individual event in a relation is treated as a tuple with a given unique identifier. The information retrieval is achieved by the relational database queries. The example below shows how objects can be retrieved using JDBC [13]. For instance, suppose that we want to get the communication events that occurred in processor 0, the query select sourcePE, destinationPE, messageLength, event_startTimestamp, event_endTimestamp from Communication Information where processor= 0.

280

X.-H. Sun and X. Wu

We may make the following SQL query by JDBC: ResultSet rs = stmt.executeQuery( "select sourcePE, destinationPE, messageLength, event_startTimestamp, event_endTimestamp from Communication Information where processor = 0 '); while (rs.nextO) { Object il = rs.getObject("sourcePE'); Object i2 = rs.getObject("destinationPE'); Object rl = rs.getObject("messageLength '); Object r2 = rs.getObject("event_startTimestamp "); Object r3 = rs.getObject("event_endTimestamp'); }

Multiple versions of performance data are handled by specifying a version attribute in each tuple. By specifying a version number in each database query, we can get multiple versions of program performance for comparison. In addition to the default PDRS performance parameters, new performance parameters such as sound files can also be added by users and be supported by the database. 2.4 Relational Queries Module This module includes four parts: Symbolic Analysis, Statistical Analysis, Scalability Analysis, and Performance Model Generator. The module is being implemented in JDBC. Its structure is shown in Figure 2.2. Java applications include the PDA, PV A, and GUI module implemented by Java. The JDBC provides a bridge between Java applications and performance database. .,....._

Java Applications

JDBC Driver(s)

.

----

Performance Database

database access ~ ~

C

-::

Figure 2.2 Relational Queries Module We use symbolic evaluation [2, 6] that combines both data and control flow analysis to determine variable values, assumptions about and constraints between variable values, and conditions under which control flow reaches a program statement. Computations are represented as symbolic expressions defined over the program's problem and machine size. Each program variable is associated with a symbolic expression describing its value at a specific program point Statistical Analysis determines code and/or machine effects, finds the correlation between program phases, identifies the scaling behavior of "difficult-segments", and provides statistical performance data [12] for the PDA (Performance Diagnostic Agent)

PDRS: A Performance Data Representation System

281

module and GUI. The development of the scalability analysis is based on newly developed algorithms for predicting performance in terms of execution time and scalability of a code-machine combination [8, 9, 11, 15]. Analytical and experimental results show that scalability combined with initial execution time can provide good performance prediction, in terms of execution times. In addition, crossing-point analysis [9] finds fast/slow performance crossing points of parallel programs and machines. In contrast with execution time, which is measured for a particular pair of problem and system size, range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing-point analysis. In addition to high-level performance prediction, PDRS also supports low-level performance analysis to identify performance bottlenecks and hardware constrains based on performance models chosen by the user. For example, we have proposed an empirical memory model based on a simplified mean value parameterization [14] to separate CPU execution time from stall time due to memory loads/stores. From traced information or information from the analysis modules, performance models can be generated to predict the performance at the component level, as well as over-all performance. 2.5 Performance Diagnostic Agent (PDA) Module

This module provides performance advice in order to help users find performance bottlenecks in their application programs. It also provides performance comparison and suggestions based on real performance results and predicted performance ranges. The PDA is based on our approaches to statistical analysis, scalability analysis and performance model generator. The function operation algorithm for this module is as follows. Algorithm (Performance diagnosis): Performance analysis requests; switch (analysis type) { Statistical: Retrieve the performance information required; Get or compute the predicted performance range; Compute the real result of requested performance parameter; Compare the result with the performance range; If (the result is not in the performance range) Give an explanation (using graphics and sound); break; Scalability: Retrieve the performance information required; Get or compute the predicted scalability results; Compute the real scalability results; Compare the real result with the predicted results; Explain the compared results (using graphics and sound);

282

X.-H. Sun and X. Wu

break; Models: Retrieve the performance information required; Get the predicted performance range; Compute the real result of requested performance parameter; Compare the result with the pe,formance range; If (the result is not in the pe,formance range) Give an explanation (using graphics and sound); break; Default: printf("No such analysis type"); break; }

In the algorithm, the PDA can provide suggestions and explanations when performance bottlenecks occur. Based on the statistical analysis, the PDA can retrieve the performance information from the performance database, then may provide the advice about program performance. 2.6 Performance Visualization and Auralization (PVA) Module and Graphical User Interface Module

This PVA module provides some graphical display of performance information about users' application programs and platforms. It is natural to use different visual objects to represent various performance data and use visualization techniques to gain insight into the execution of parallel programs so that their performance may be understood and improved. The basic goal of this module is to use graphics and sound (Java 2D, Java 3D and JavaSound) to display some advice and performance views about application programs. For example, based on performance comparison, some performance bottlenecks can be found in graphics. Some suggestions can be given in graphics, such as what applications are suitable for the platforms, what platforms are suitable for solving the applications, and how to modify the application program to be suitable for the platforms. When performance bottlenecks occur, sound is used to inform users about some performance problem in their application programs. The sound files are stored in a performance database. The Graphical User Interface module is an integrated user-friendly graphical interface. It integrates the whole functions of the PVA module, and directly displays the performance data requested by users. Figures 2.3 and 2.4 are two views of PDRS GUI. Figure 2.3 shows speed comparison of PDD and PT algorithms [7]. Figure 2.4 shows the Kiviat graph for performance comparison.

PDRS: A Performance Data Representation System

283

//··1·-------l~ ~-·.·. . ~~,~

(-A.J ·' : .._ _ .· - . .....· I i .·'j..__/~ [J· _........-·1 '·-...

.

1 1

.

~

_)·--

I1_...

,i_

-----

t"

_.'

-

..

------~1 -_._,_

-

~

----

Figure 2.3 Speed comparison of PDD and PT algorithms r_,.

,

• .. '

·····---··.:.:

_,.;,:;-·-

I

I' '

'i

-- \ ....-~ ---:-i -'

'

:....,~~--

-----..-.1:. __ El

' - - -1111111

Figure 2.4 Kiviat Graph for Performance Comparison

3 Summary We have presented the design of a Performance Data Representation System (PDRS) based on our current success of the development of the SCALA [10, 11] performance system for scalable parallel processing. While the PDRS has not been fully implemented at this time, some of its key components have been implemented and tested. Implementation results are very encouraging. PDRS highlights the performance data representation using relational database. Integrated into advanced restructuring compilation and performance analysis system, the proposed PDRS attempts to lift performance evaluation system to a new level. It is designed to provide developers a guideline on performance optimization, to assist the purchasers selecting systems best suited to their needs, and to give valuable feedback to vendors on bottlenecks that can be alleviated in future products. It has the potential to provide users with much more useful information than current existing performance systems. Many advanced technologies, such as database management, object-oriented programming, visualization and auralization are used in the PDRS. The integration of these technologies into compilation and performance analysis system is new, and very challenging. It can motivate many new

284

X.-H. Sun and X. Wu

research and development issues. PDRS 1s only a first step toward the automatic performance analysis and optimization.

References I. R. Aydt, The Pablo Self-Defining Data Format, Department of Computer Science, University of Illinois, April 1995, ftp://bugle.cs.uiuc.edu/pub/Release/Documentation/SDDF.ps. 2. T. Fahringer and B. Scholz, Symbolic evaluation for parallelizing compilers, in Proc. of the 11th ACM International Conference on Supercomputing, Vienna, Austria, ACM Press, July 1997, 261-268. 3. J. Kohn and W. Williams, ATExpert, Journal of Parallel and Distributed Computing 18, 1993, 205-222. 4. A.D. Malony and G.V. Wilson, Future directions in parallel performance environment, Performance Measurement and Visualization of Parallel Systems, Eds: G. Haring and G. Kotsis, Elsevier Science Publishers B.V., 1993, 331-351. 5. B. P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, and T. Newhall, The Paradyn parallel performance measurement tools, IEEE Computer 28, 11, 1995. 6. M. Scheib!, A. Celie, and T. Fahringer, Interfacing Mathematica from the Vienna Fortran Compilation System, Technical Report, Institute for Software Technology and Parallel Systems, Univ. ofVienna, December 1996. 7. X.-H. Sun, H. Zhang, and L. Ni, Efficient tridiagonal solvers on multicomputers, IEEE Transactions on Computers 41, 3 (1992), 286-296. 8. X.-H. Sun and D. Rover, Scalability of parallel algorithm-machine combinations, IEEE Transactions on Parallel and Distributed Systems, June 1994, 599-613. 9. X.-H. Sun, Performance range comparison via crossing point analysis, Lecture Notes in Computer Science 1388 (J. Rolim, ed.), Springer, March 1998. 10.X.-H. Sun, T. Fahringer, M. Pantano, and Z. Zhan, SCALA: A performance system for scalable computing, in Proc. of the Workshop on High-Level Parallel Programming Models & Supportive Environments, Lecture Notes in Computer Science 15 86, Springer, April 1999. 11.X.-H. Sun, M. Pantano, and Thomas Fahringer, Integrated range comparison for data-parallel compilation systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 10, May, 1999, 448-458. 12.X.-H. Sun, D. He, K. Cameron, and Y. Luo, A Factorial Performance Evaluation for Hierarchical Memory Systems, in Proc. of the IEEE Int 7Parallel Processing Symposium '99, April 1999. 13.Sun Microsystems Inc., JDBC: A Java SQL API, Version 1.20, http://www.javasoft.com/ products/jdbc/index.html, January 1997. 14.M. V. Vernon, E. D. Lazowska, and J. Zahorjan, An accurate and efficient performance analysis technique for multi-processor snooping cache-consistency protocols, in Proc. 15 th Annual Symp. Computer Architecture, Honolulu, HI, June 1988, 308-315. 15.Xingfu Wu, Performance Evaluation, Prediction, and Visualization of Parallel Systems, Kluwer Academic Publishers, Boston, ISBN 0-7923-8462-8, 1999.

Clix* - A Hybrid Programming Environment for Distributed Objects and Distributed Shared Memory Frank Mueller-, Jorg Nolte2 , and Alexander Schlaefer3 Humboldt University Berlin, Institut f. Informatik, 10099 Berlin, Germany 2 GMD FIRST, Rudow er Chaussee 5, D-12489 Berlin, Germaiy 3 University of Washington, CSE, Box 352350, Seattle, WA 98195-2350, USA mueller©informatik.hu-berlin.de, phone: (+49) (30) 2093-3011, fax: -3010 1

Abstract. P arallel programming with distributed object tedmology becomes increasingly popular but shared-memory programming is still a common w ayof utilizing parallel machines. In fact, both models can coexist fairly well and soft w are DSM systems can be constructed easily using distributed object systems. In this paper, we describe the construction of a hybrid programming platform based on the ARTS distributed object system. We describe how an object-oriented design approach provides a compact and flexible description of the system components. A sample implementation demonstrates that three classes of less than 100 lines of code each suffice to implement sequen tial consistency

1

Introduction

Object-oriented programming and distributed object technology are considered to be state of the art in distributed and as well as parallel computing. However, typical mmerical data-structures like huge arrays or matrices are hard to represent in a distributed object paradigm. Such data structures usually cannot be represented as single objects because this leads to extremely coarse-grained programs thus limiting parallel execution. On the other hand, it is not feasible to represent, e.g., each array element as a remote object because remote object in vocation mea,'le,'l, 1996. S. A. l\rioyer and V. S. Sun.deram. PIOUS: a scalable parallel I/O system for distributed computing environments. In Scafol>lt, High-Pnformance Computing Conj., 1994. N. Nieuwejaar and D. Kotz. The galley parallel file system. Parnlfol Computing, 23(4), June 1997. Iv!. T. Oszu and P. Va.lduriez. P1--inciples of Distt--ibutc.d DafolmM, Sv.~tt,m.~. Prentice Hall, 1999. R. H. Patterson III. Infcwmed Pre/etching and Caching. PhD thesis, Carnegie Iviellon University, December 1997. Pirahesh d al. Pmnlfolinn in Relational Data Base Swtun;,. In nt'l Symp. on Pmnlfol tmtl Distt--ibutedSystems, July 1990. D. A. Reed, d al. Performance analysis of parallel systems: Approaches and open problems. In .!oint Syrnpo~ .~ium cm Pmnlfol P1·oc,,.~.~ir1,!J (.!SPP), June 1998. E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia.. In Int'l Conj, on Vt,qJ Lm'!Jf; Dafol>a,'le,'l, August 1997. H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data. sets. Technical Report 9906-010, Northwestern University, June 1999. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In AC'M SIGMOD C'onf. 011 Manag