132 66 424MB
English Pages 1317 [1332] Year 2000
Lecture Notes in Computer Science
1800
Jose Rolim et al. (Eds.)
Parallel and
Distributed Processing IS IPDPS 2000 Workshops Cancun, Mexico, May 2000 Proceedings
Springer
•
TC PanUd Proces1ing
•
Aaociation for Computing Machinery
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1800
Springer Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Jose Rolim et al. (Eds.)
Parallel and Distributed Processing 15 IPDPS 2000 Workshops Cancun, Mexico, May 1-5, 2000 Proceedings
Springer
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Managing Volume Editor Jose Rolim Universite de Geneve, Centre Universitaire d'Informatique 24, rue General Dufour, CH-1211 Geneve 4, Switzerland E-mail: [email protected]
Cataloging-in-Publication Data applied for
Die Deutsche Bibliothek - CIP-Einheitsaufnahme Parallel and distributed processing : 15 IPDPS 2000 workshops, Cancun, Mexico, May 1 - 5, 2000, proceedings/ Jose Rolim et al. (ed.). Berlin ; Heidelberg ; New York; Barcelona; Hong Kong; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1800) ISBN 978-3-540-67442-9
CR Subject Classification (1998): C.1-4, B.1-7, D.1-4, F.1-2, G.1-2, E.l, H.2 ISSN 0302-9743 ISBN 978-3-540-67442-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law Springer-Verlag is a company in the BertelsmarmSpringer publishing group. © Springer-Verlag Berlin Heidelberg 2000 Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10720149 06/3142 54 32 I 0
Volume Editors
Jose D.P. Rolim G. Chiola G. Conte L.V. Mancini Oscar H. Ibarra Koji Nakano Stephan Olariu Sethuraman Panchanathan Andreas Uhl Martin Schulz Mohammed J. Zaki Vipin Kumar David B. Skilicorn Sartaj Sahni Timothy Davis Sanguthevar Rajasekeran Sanjay Ranka Denis Caramel Serge Chaumette Geoffrey Fox Peter Graham Albert Y. Zomaya Fikret Ercal
Kenji Toda Sang Hyuk Son Maarten Boasson Yoshiaki Kakuda Deveah Bhatt Lonnie R. Welch Hossam ElGindy Viktor K. Prasanna Hartmut Schmeck Oliver Diessel Beverly Sanders Dominique Mery Fouad Kiamilev Jeremy Ekman Afonso Ferreira Sadik Esener Yi Pan Keqin Li Ron Olsson Laxmikant V. Kale Pete Beckman Matthew Haines Dimiter R. Avresky
Foreword This volume contains the proceedings from the workshops held in conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000, on 1-5 May 2000 in Cancun, Mexico. The workshops provide a forum for bringing together researchers, practitioners, and designers from various backgrounds to discuss the state of the art in parallelism. They focus on different aspects of parallelism, from run time systems to formal methods, from optics to irregular problems, from biology to networks of personal computers, from embedded systems to programming environments; the following workshops are represented in this volume: Workshop on Personal Computer Based Networks of Workstations Workshop on Advances in Parallel and Distributed Computational Models Workshop on Par. and Dist. Comp. in Image, Video, and Multimedia Workshop on High-Level Parallel Prog. Models and Supportive Env. Workshop on High Performance Data Mining Workshop on Solving Irregularly Structured Problems in Parallel Workshop on Java for Parallel and Distributed Computing Workshop on Biologically Inspired Solutions to Parallel Processing Problems Workshop on Parallel and Distributed Real-Time Systems Workshop on Embedded HPC Systems and Applications Reconfigurable Architectures Workshop Workshop on Formal Methods for Parallel Programming Workshop on Optics and Computer Science Workshop on Run-Time Systems for Parallel Programming Workshop on Fault-Tolerant Parallel and Distributed Systems All papers published in the workshops proceedings were selected by the program committee on the basis of referee reports. Each paper was reviewed by independent referees who judged the papers for originality, quality, and consistency with the themes of the workshops. We would like to thank the general co-chairs Joseph .Ja.Ja and Charles Weems for their support and encouragement, the steering committee chairs, George Westrom and Victor Prasanna, for their guidance and vision, and the finance chair, Bill Pitts, for making this publication possible. Special thanks are due to Sally .Jelinek, for her assistance with meeting publicity, to Susamma Barna for making local arrangements, and to Danuta Sosnowska for her tireless efforts in interfacing with the organizers. We gratefully acknowledge sponsorship from the IEEE Computer Society and its Technical Committee of Parallel Processing and the cooperation of the ACM SIGARCH. Finally, we would like to thank Danuta Sosnowska and Germaine Gusthiot for their help in the preparation of this volume. February 2000
Jose D. P. Rolim
Contents
Workshop on Personal Computer Based Networks of Workstations G. Chiola, G. Conte, L.V. Mancini
1
Memory Management in a Combined VIA/SCI Hardware M. Trams, W. Rehm, D. Balkanski, S. Simeonov
4
ATOLL, a New Switched, High Speed Interconnect in Comparison to Myrinet and SCI 16 M. Fischer, U. Bruning, J. Kl1tge, L. Rzymianowicz, P. Schulz, M. Waack ClusterNet: An Object-Oriented Cluster Network R.R. Hoare
28
GigaBit Performance under NT M. Baker, S. Scott, A. Geist, L. Browne
39
MPI Collective Operations over IP Multicast H.A. Chen, Y.O. Carrasco, A. W. Apon
51
An Open Market-Based Architecture for Distributed Computing S. Lalis, A. Karipidis
61
The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters M. Baretto, R. Avila, P. Navaux
71
Parallel Information Retrieval on an SCI-Based PC-NOW S.-H. Chung, H.-C. Kwon, K.R. Ryu, H.-K. Jang, J.-H. Kim, C.-A. Choi
81
A PC-NOW Based Parallel Extension for a Sequential DBMS M. Exbrayat, L. Brunie
91
Workshop on Advances in Parallel and Distributed Computational Models O.H. Ibarra, K. Nakano, S. Olariu
101
The Heterogeneous Bulk Synchronous Parallel Model T.L. Williams, R.J. Parsons
102
On Stalling in LogP G. Bilardi, K. T. Herley, A. Pietracaprina, G. Pucci
109
X
Contents
Parallelizability of Some P-Complete Problems A. Fujiwara, M. Inoue, T. Mas1tzawa
116
A New Computation of Shape Moments via Quadtree Decomposition C.-H. W1t, S.-J. Horng, P.-Z. Lee, S.-S. Lee, S.-Y. Lin
123
The Fuzzy Philosophers S.-T. H1wng
130
A Java Applet to Visualize Algorithms on Reconfigurable Mesh K. Miyashita, R. Hashimoto
137
A Hardware Implementation of PRAM and Its Performance Evaluation M. Imai, Y. Hayakawa, H. Kawanaka, W. Chen, K. Wada, G.D. Castanho, Y. Okajima, H. Okamoto
143
A Non-binary Parallel Arithmetic Architecture R. Lin, J.L. Schwing
149
Multithreaded Parallel Computer Model with Performance Evaluation J. Cui, J.L. Bordim, K. Nakano, T. Hayashi, N. Ishii
155
Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM 2000) S. Panchanathan, A. Uhl
161
MAJC-5200: A High Performance Microprocessor for Multimedia Computing S. Sudharsanan
163
A Novel Superscalar Architecture for Fast DCT Implementation Z. Yong, M. Zhang
171
Computing Distance Maps Efficiently Using an Optical Bus Y. Pan, Y. Li, J. Li, K. Li, S.-Q. Zheng
178
Advanced Data Layout Optimization for Multimedia Applications C. Kulkarni, F. Catthoor, H. De Man
186
Parallel Parsing of MPEG Video in a Multi-threaded Multiprocessor Environment S.M. Bhandarkar, S.R. Chandmsekamn
194
Contents
XI
Parallelization Techniques for Spatial-Temporal Occupancy Maps from Multiple Video Streams N. DeBardeleben, A. Hoover, W. Jones, W. Ligon
202
Heuristic Solutions for a Mapping Problem in a TV-Anytime Server Network X. Zhmt, R. Luling, L. Xie
210
RPV: A Programming Environment for Real-Time Parallel Vision Specification and Programming Methodology D. Arita, Y. Hamada, S. Yonemoto, R.-i. Tanig1tchi
218
Parallel Low-Level Image Processing on a Distributed Memory System C. Nicolesrn, P. Jonker Congestion-Free Routing of Streaming Multimedia Content in BMIN-Based Parallel Systems H. Seth1t Performance of On-Chip Multiprocessors for Vision Tasks Y. Chung, K. Park, W. Hahn, N. Park, V.K. Prasanna Parallel Hardware-Software Architecture for Computation of Discrete Wavelet Transform Using the Recursive Merge Filtering Algorithm P. Jamkhandi, A. Mukherjee, K. Mukherjee, R. Franceschini
226
234
242
250
Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2000) M. Schulz
257
Pipelining Wavefront Computations: Experiences and Performance E.G. Lewis, L. Snyder
261
Specification Techniques for Automatic Performance Analysis Tools M. Gerndt, H.-G. EJ]er
269
PDRS: A Performance Data Representation System X.-H. Sun, X. Wu
277
Clix - A Hybrid Programming Environment for Distributed Objects and Distributed Shared Memory F. Mueller, J. Nolte, A. Schlaefer
285
Controlling Distributed Shared Memory Consistency from High Level Programming Languages Y. Jegou
293
XII
Contents
Online Computation of Critical Paths for Multithreaded Languages Y. Oyama, K. Ta urn, A. Yonezawa
301
Problem Solving Environment Infrastructure for High Performance Computer Systems D. C. Stanzione, Jr., W.B. Ligon III
314
Combining Fusion Optimizations and Piecewise Execution of Nested Data-Parallel Programs W. Pfannenstiel
324
Declarative Concurrency in Java R. Ramirez, A.E. Santosa
332
Scalable Monitoring Technique for Detecting Races in Parallel Programs Y.-K. J1m, C.E. McDowell
340
Workshop on High Performance Data Mining M.J. Zaki, V. Kumar, D.B. Skillicorn
348
Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations R. Barnglia, D. Laforenza, S. Orlando, P. Palmerini, R. Perego A Requirements Analysis for Parallel KDD Systems W.A. Maniatty, M.J. Zaki
350
358
Parallel Data Mining on ATM-Connected PC Cluster and Optimization of Its Execution Environment M. Oguchi, M. Kitsuregawa
366
The Parallelization of a Knowledge Discovery System with Hypergraph Representation J. Seitzer, J.P. Buckley, Y. Pan, L.A. Adams
374
Parallelisation of C4.5 as a Particular Divide and Conquer Computation P. Becuzzi, M. Coppola, S. Ruggieri, M. Vanneschi
382
Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti, D. Talia
390
Exploiting Dataset Similarity for Distributed Mining S. Parthasarnthy, M. Ogiharn
399
Contents
XIII
Scalable Model for Extensional and Intensional Descriptions of Unclassified Data H.A. Prado, S.C. Hirtle, P.M. Engel
407
Parallel Data Mining of Bayesian Networks from Telecommunications Network Data R. Sterrit, K. Adamson, C.M. Shapcott, E.P. C1LTran
415
Irregular 2000 - Workshop on Solving Irregularly Structured Problems in Parallel S. Sahni, T. Davis, S. Rajasekeran, S. Ranka
423
Load Balancing and Continuous Quadratic Programming W.W. Hager
427
Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application X. Cavin, L. Alonso
428
Efficient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures S. Benkner, T. Brandes
435
Parallel FEM Simulation of Crack Propagation-Challenges, Status, and Perspectives B. Carter, C.-S. Chen, L.P. Chew, N. Chrisochoides, G.R. Gao, G. Heber, A.R. Ingraffea, R. Krause, C. Myers, D. Nave, K. Pingali, P. Stodghill, S. Vavasis, P.A. Wawrzynek Support for Irregular Computations in Massively Parallel PIM Arrays, Using an Object-Based Execution Model H.P. Zima, T.L. Sterling Executing Communication-Intensive Irregular Programs Efficiently V. Ramakrishnan, I.D. Scherson
443
450
457
Non-Memory-Based and Real-Time Zerotree Building for Wavelet Zerotree Coding Systems D. Peng, M. Lu
469
Graph Partitioning for Dynamic, Adaptive, and Multi-phase Computations V. Kumar, K. Schloegel, G. Karypis
476
XIV
Contents
A Multilevel Algorithm for Spectral Partitioning with Extended Eigen-Models S. Oliveira, T. Soma
477
An Integrated Decomposition and Partitioning Approach for Irregular Block-Structured Applications J. Rantakokko
485
Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems L. Oliker, X. Li, G. Heber, R. Biswas
497
A GRASP for Computing Approximate Solutions for the Three-Index Assignment Problem R.M. Aiex, P.M. Pardalos, L.S. Pitsoulis, M.G.C. Resende
504
On Identifying Strongly Connected Components in Parallel L.K. Fleischer, B. Hendrickson, A. Pinar
505
A Parallel, Adaptive Refinement Scheme for Tetrahedral and Triangular Grids A. Stagg, J. Hallberg, J. Schmidt
512
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed lD /2D Block Distributions P. H enon, P. Ram et, J. Roman
519
Workshop on Java for Parallel and Distributed Computing D. Caramel, S. Chaumette, G. Fox, P. Graham
526
An IP Next Generation Compliant Java TM Virtual Machine G. Chelius, E. Fleury
528
An Approach to Asynchronous Object-Oriented Parallel and Distributed Computing on Wide-Area Systems M. Di Santo, F. Frattolillo, W. Russo, E. Zimeo Performance Issues for Multi-language Java Applications P. Murray, T. Smith, S. Srinivas, M. Jacob MPJ: A Proposed Java Message Passing API and Environment for High Performance Computing M. Baker, B. Carpenter
536
544
552
Contents
XV
Implementing Java Consistency Using a Generic, Multithreaded DSM Runtime System 560 G. Antoniu, L. Bouge, P. Hatcher, M. MacBeth, K. McG1tigan, R. Namyst
Workshop on Bio-Inspired Solutions to Parallel Processing Problems (BioSP3) A.Y. Zomaya, F. Ercal, S. Olariu
568
Take Advantage of the Computing Power of DNA Computers Z.F. Qi1t, M. Lu
570
Agent Surgery: The Case for Mutable Agents L. Boloni, D. C. Marinescu
578
Was Collective Intelligence before Life on Earth? T. Szuba, M. Almulla
586
Solving Problems on Parallel Computers by Cellular Programming D. Talia
595
Multiprocessor Scheduling with Support by Genetic Algorithms-Based Learning Classifier System J.P. Nowacki, G. Pycka, F. Seredynski
604
Viewing Scheduling Problems through Genetic and Evolutionary Algorithms M. Rocha, C. Vilela, P. Cortez, J. Neves
612
Dynamic Load Balancing Model: Preliminary Assessment of a Biological Model for a Pseudo-search Engine R.L. Walker
620
A Parallel Co-evolutionary Metaheuristic V. Bachelet, E.-G. Talbi
628
Neural Fraud Detection in Mobile Phone Operations A. Boukerche, M.S.M.A. Notare
636
Information Exchange in Multi Colony Ant Algorithms M. Middendorf, F. Reischle, H. Schmeck
645
A Surface-Based DNA Algorithm for the Expansion of Symbolic Determinants Z.F. Qiu, M. Lu
653
XVI
Contents
Hardware Support for Simulated Annealing and Tabu Search R. Schneider, R. Weiss
660
Workshop on Parallel and Distributed Real-Time Systems K. Toda, S.H. Son, M. Boasson, Y. Kakuda
668
A Distributed Real Time Coordination Protocol L. Sha, D. Seto
671
A Segmented Backup Scheme for Dependable Real Time Communication in Multihop Networks P.K. G1tmmadi, J.P. Madhavarap1t, S.R. Murthy Real-Time Coordination in Distributed Multimedia Systems T.A. Limniotes, G.A. Papadopmtlos
678
685
Supporting Fault-Tolerant Real-Time Applications Using the RED-Linux General Scheduling Framework K.-J. Lin, Y.-C. Wang
692
Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems? P. Chevochot, A. Colin, D. Decotigny, I. Puaut
699
Autonomous Consistency Technique in Distributed Database with Heterogeneous Requirements H. Hanamura, I. Kaji, K. Mori
706
Real-Time Transaction Processing Using Two-Stage Validation in Broadcast Disks K.-w. Lam, V. C.S. Lee, S.H. Son
713
Using Logs to Increase Availability in Real-Time Main-Memory Database 720 T. Niklander, K. Raatikainen Components Are from Mars M.R. V. Chaudron, E. de Jong
727
2+10 >-- 1+50 ! H. Hansson, C. Norstrom, S. Punnekkat
734
A Framework for Embedded Real-Time System Design J. - Y. Choi, H.- H. Kwak, I. Lee
738
Contents
XVII
Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks A. Striegel, G. Manimaran
743
Predictability and Resource Management in Distributed Multimedia Presentations C. Mourlas
750
Quality of Service Negotiation for Distributed, Dynamic Real-Time Systems 757 G.D. Cavanmtgh, L.R. Welch, B.A. Shirazi, E.-n. H1th, S. Anwar An Open Framework for Real-Time Scheduling Simulation T. Kramp, M. Adrian, R. Koster
766
Workshop on Embedded/Distributed HPC Systems and Applications (EHPC 2000) D. Bhatt, L.R. Welch
773
A Probabilistic Power Prediction Tool for the Xilinx 4000-Series FPGA 776 T. Osm1Llski, J. T. Muehring, B. Veale, J.M. West, H. Li, S. Vanichayobon, S.-H. Ko, J.K. Antonio, S.K. Dhall Application Challenges: System Health Management for Complex Systems 784 G.D. Hadden, P. Bergstrom, T. Samad, B.H. Bennett, G.J. Vachtsevanos, J. Van Dyke Accomodating QoS Prediction in an Adaptive Resource Management Framework E.-n. Huh, L.R. Welch, B.A. Shirazi, B.C. Tjaden, G.D. Cavanaugh Network Load Monitoring in Distributed Systems K.M. Jahirul Islam, B.A. Shirazi, L.R. Welch, B.C. Tjaden, G.D. Cavanaugh, S. Anwar A Novel Specification and Design Methodology of Embedded Multiprocessor Signal Processing Systems Using High-Performance Middleware R.S. Janka, L.M. Wills Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems M.I. Patel, K. Jordan, M. Clark, D. Bhatt
792
800
808
816
XVIII
Contents
Developing an Open Architecture for Performance Data Mining D.B. Pierce, D. T. Rover
823
A 90k Gate "CLE" for Parallel Distributed Computing B. Sch1Llman, G. Pechanek
831
Power-Aware Replication of Data Structures in Distributed Embedded Real-Time Systems O.S. Unsal, I. Koren, C.M. Krishna Comparison of MPI Implementations on a Shared Memory Machine B. Van Voorst, S. Seidel
839
847
A Genetic Algorithm Approach to Scheduling Communications for a Class of Parallel Space-Time Adaptive Processing Algorithms J.M. West, J.K. Antonio
855
Reconfigurable Parallel Sorting and Load Balancing on a Beowulf Cluster: HeteroSort P. Yang, T.M. Kunau, B.H. Bennett, E. Davis, B. Wren
862
Reconfigurable Architectures Workshop (RAW 2000)
870
H. ElGindy, V.K. Prasanna, H. Schmeck, 0. Diessel
Run-Time Reconfiguration at Xilinx S.A. Guccione
873
JRoute: A Run-Time Routing API for FPGA Hardware E. Keller
874
A Reconfigurable Content Addressable Memory S.A. Guccione, D. Levi, D. Downs
882
ATLANTIS - A Hybrid FPGA/RISC Based Re-configurable System 0. Brosch, J. Hesser, C. Hinkelbein, K. Kornmesser, T. Kuberka, A. Kugel, R. Manner, H. Singpiel, B. Vettermann
890
The Cellular Processor Architecture CEPRA-lX and Its Configuration by CDL C. Hochberger, R. Hoffmann, K.-P. Volkmann, S. Waldschmidt
898
Contents
XIX
Loop Pipelining and Optimization for Run Time Reconfiguration K. Bondalapati, V.K. Pmsanna
906
Compiling Process Algebraic Descriptions into Reconfigurable Logic 0. Diessel, G. Milne
916
Behavioral Partitioning with Synthesis for Multi-FPGA Architectures under Interconnect, Area, and Latency Constraints P. Lakshmikanthan, S. Govindamjan, V. Srinivasan, R. Vemuri Module Allocation for Dynamically Reconfigurable Systems X.-j. Zhang, K.-w. Ng Augmenting Modern Superscalar Architectures with Configurable Extended Instructions X. Zhmt, M. Martonosi
924
932
941
Complexity Bounds for Lookup Table Implementation of Factored Forms 951 in FPGA Technology Mapping W. Feng, F.J. Meyer, F. Lombardi Optimization of Motion Estimator for Run-Time-Reconfguration Implementation C. Tanmtgast, Y. Berviller, S. Weber
959
Constant-Time Hough Transform on a 3D Reconfigurable Mesh Using Fewer Processors Y. Pan
966
Workshop on Formal Methods for Parallel Programming (FMPPTA 2000) B. Sanders, D. Mery
97 4
A Method for Automatic Cryptographic Protocol Verification J. Goubault-La'r'recq
977
Verification Methods for Weaker Shared Memory Consistency Models R.P. Ghughal, G. C. Gopalakrishnan
985
Models Supporting Nondeterminism and Probabilistic Choice M. Mislove
993
Concurrent Specification and Timing Analysis of Digital Hardware Using SDL K.J. Turner, F.J. Argul-Marin, S.D. Laing
1001
XX
Contents
Incorporating Non-functional Requirements into Software Architectures N.S. Rosa, G.R.R. Justo, P.R.F. C1mha
1009
Automatic Implementation of Distributed Systems Formal Specifications 1019 L.H. Castelo Branco, A.F. do Prado, W. Lopes de Smtza, M. Sant'Anna Refinement Based Validation of an Algorithm for Detecting Distributed Termination M. Filali, P. Mauran, G. Padiou, P. Queinnec, X. Thirimtx
1027
Tutorial 1: Abstraction and Refinement of Concurrent Programs and Formal Specification D. Cansell, D. Mery, C. Tabacznyj
1037
Tutorial 2: A Foundation for Composing Concurrent Objects J.-P. Bahsmm
1039
Workshop on Optics and Computer Science (WOCS 2000) F. Kiamilev, J. Ekman, A. Ferreira, S. Esener, Y. Pan, K. Li
1042
Fault Tolerant Algorithms for a Linear Array with a Reconfigurable Pipelined Bus System A.G. BmLTgeois, J.L. Trahan
1044
Fast and Scalable Parallel Matrix Computationas with Optical Buses K. Li
1053
Pulse-Modulated Vision Chips with Versatile-Interconnected Pixels J. Ohta, A. Uehara, T. Tokuda, M. Nunoshita
1063
Connectivity Models for Optoelectronic Computing Systems H.M. Ozaktas
1072
Optoelectronic-VLSI Technology: Terabit/s I/O to a VLSI Chip A. V. Krishnamoorthy
1089
Three Dimensional VLSI-Scale Interconnects D. W. Prather
1092
Present and Future Needs of Free-Space Optical Interconnects S. Esener, P. Marchand
1104
Contents
XXI
Fast Sorting on a Linear Array with a Reconfigurable Pipelined Bus System A. Datta, R. Owens, S. Smmdaralakshmi
1110
Architecture Description and Prototype Demonstration of Optoelectronic Parallel-Matching Architecture K. Kagawa, K. Nitta, Y. Ogmn, J. Tanida, Y. Ichioka
1118
A Distributed Computing Demonstration System Using FSOI Inter-Processor Communication J. Ekman, C. Berger, F. Kiamilev, X. Wang, H. Spaanenb1trg, P. Marchand, S. Esener
1126
Optoelectronic Multi-chip Modules Based on Imaging Fiber Bundle Structures D.M. Chiarnlli, S.P. Levitan
1132
VCSEL Based Smart Pixel Array Technology Enables Chip-to-Chip Optical Interconnect Y. Liu
1133
Workshop on Run-Time Systems for Parallel Programming (RTSPP) R. Olsson, L.V. Kale, P. Beckman, M. Haines
1134
A Portable and Adaptative Multi-protocol Communication Library for Multithreaded Runtime Systems 0. Aumage, L. Bouge, R. Namyst
1136
CORBA Based Runtime Support for Load Distribution and Fault Tolerance T. Barth, G. Flender, B. Freisleben, M. Grauer, F. Thilo
1144
Run-Time Support for Adaptive Load Balancing M.A. Bhandarkar, R.K. Brunner, L. V. Kale Integrating Kernel Activations in a Multithreaded Runtime System on Top of LINUX V. Danjean, R. Namyst, R.D. Russell
1152
1160
DyRecT: Software Support for Adaptive Parallelism on NOWs E. Godard, S. Setia, E. White
1168
Fast Measurement of LogP Parameters for Message Passing Platforms T. Kielmann, H.E. Bal, K. Verstoep
1176
XXII
Contents
Supporting Flexible Safety and Sharing in Multi-threaded Environments 1184 S.H. Samorodin, R. Pandey A Runtime System for Dynamic DAG Programming M.- Y. W1t, W. Shu, Y. Chen
1192
Workshop on Fault-Tolerant Parallel and Distributed Systems (FTPDS 2000) D.R. Avresky
1200
Certification of System Architecture Dependability I. Levendel
1202
Computing in the RAIN: A Reliable Array of Independent Nodes V. Bohossian, C.C. Fan, P.S. LeMahieu, M.D. Riedel, L. Xu, J. Hmck
1204
Fault-Tolerant Wide-Area Parallel Computing J.B. Weissman
1214
Transient Analysis of Dependability /Performability Models by Regenerative Randomization with Laplace Transform Inversion J.A. Carrasco FANTOMAS: Fault Tolerance for Mobile Agents in Clusters H. Pals, S. Petri, C. Grewe Metrics, Methodologies, and Tools for Analyzing Network Fault Recovery Performance in Real-Time Distributed Systems P.M. Irey IV, B.L. Chappell, R. W. Hott, D.T. Marlow, K.F. 0 'Donoghue, T.R. Plunkett
1226
1236
1248
Consensus Based on Strong Failure Detectors: A Time and Message-Efficient Protocol F. Greve, M. Hurfin, R. Macedo, M. Raynal
1258
Implementation of Finite Lattices in VLSI for Fault-State Encoding in High-Speed Networks A.G. Doring, G. Lustig
1266
Building a Reliable Message Delivery System Using the COREA Event Service S. Ramani, B. Dasarathy, K.S. Trivedi
1276
Contents XXIII Network Survivability Simulation of a Commercially Deployed Dynamic Routing System Protocol A. Chowdhm·y, 0. Prieder, P. L1tse, P.-J. Wan
1281
Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network D. Hecht, C. Katsinis
1286
An Efficient Backup-Overloading for Fault-Tolerant Scheduling of Real-Time Tasks R. Al-Omari, G. Manimaran, A.K. Somani
1291
Mobile Agents to Automate Fault Management in Wireless and Mobile Networks N. Pissinmt, Bhagyavati, K. Makki
1296
Heterogeneous Computing Workshop (HCW 2000) V.K. Prasanna, C.S. Raghavendra
1301
Author Index
1307
3rd Workshop on Personal Computer based Networks Of Workstations (PC-NOW 2000)
Clusters composed of fast personal computers are now well established as cheap and efficient platforms for distributed and parallel applications. The main drawback of a standard NONs is the poor performance of the standard inter-process communication mechanisms based on RPC, sockets, TCP /IP, Ethernet. Sue h standard communication mechanisms perform poorly both in terms of throughput as well as message latency. Several protoypes developed around the world have proved that re-visiting the implementation of the communication layer of a standard Operating System kernel, a kw cost hardware platform composed of only commodity components can scale up to several tens of processing nodes and deliver communication and computation performance exceeding the one delivered by the conventional highcost parallel platforms. This w orkshoppro videsa forum to discuss issues related to the design of efficient NOW /Clusters based on commodity hardware and publi:lomain operating systems as compared to custom hardware devices and/or proprietary operating systems.
Workshop Organizers G. Chiola (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) L.V. Mancini (DSI, U. Rome, I)
Sponsors IEEETFCC (Task lorce on Cluster Computing)
J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1-3, 2000. © Springer-Verlag Berlin Heidelberg 2000
2
G. Chiola, G. Conte, and L.V. Mancini
Program Commitee
Program Chair:
C. Anglano (U. Piemonte Or., I) M. Baker (CSM, U. Portsmouth, UK) L. Bouge (ENS Lyon, F) G. Chiola (DISI, U. Genoa, I) G. Ciaccio (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) H.G. Dietz (ECE, Purdue U., USA) W. Gentzsch (GENIAS Software GmbH, D) G. Iannello (DIS, U. Napoli, I) Y. Ishikawa (RWCP, J) K. Li (Princeton U., USA) L.V. Mancini (DSI, U. Roma 1, I) T.G. Mattson (Intel Corp., USA) W. Rehm (Informatik, T.U. Chemnitz, D) P. Rossi (ENEA HPCN, Bologna, I) P. Roe (Queensland U. of Tech., AUS) D.B. Skillikorn (Queens U., CAN) D. Tavangarian (Informatik, U. Rostock, D) B. Tourancheau (LHPC, U. Lyon, F)
Referees C. Anglano 0. Aumage M. Baker G. Chiola G. Ciaccio G. Conte M. Fischer
W. Gentzsch G. Iannello Y. Ishikawa L.V. Mancini T.G. Mattson J.-F. Mehaut R. Namyst
W. Rehm P.Roe P. Rossi D. Tavangarian B. Tourancheau R. Westrelin
3rd Workshop on Personal Computer Based Networks of Workstations
3
Accepted Papers Session 1: Cluster Interconnect Design and Implementation
- M. Trams, \V. Rehm, D. Balkanski, and S. Simeonov "Memory Management in a combined VIA/SCI Hardware" - M. Fischer, et al. "ATOLL, a new switched, high speed Interconnect in comparison to Myrinet and SCI" - R.R. Hoare "ClusterNet: An Object-Oriented Cluster Network" Session 2: Off-the-shelf Clusters Communication
- M. Baker, S. Scott, A. Geist, and L. Browne "GigaBit Performance under NT" - H.A. Chen, Y.O. Carrasco, and A.W. Apon "MPI Collective Operations over IP Multicast" Session 3: Multiple Clusters and Grid Computing
- S. Lalis, and A. Karipidis 'An Open Market-Based Architecture for Distributed Computing" - M. Barreto, R. Avila, and Ph. Navaux "The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters" Session 4: Data Intensive Applications
- S.H. Chung, et al. "Parallel Information Retrieval on an SCI-Based PCNOW" - M. Exbrayat, and L. Brunie 'A PC-MOW Based Parallel Extension for a Sequential DBMS"
Other Activities In addition to the presentation of contributed papers an invited talk will be scheduled at the workshop.
Memory Management in a combined VIA/SCI Hardware Mario Trams, Wolfgang Rehm, Daniel Balkanski and Stanislav Simeonov * {mtr, rerun}@inform at iktu -elem ni 1z. de DaniBalkanski©yahoo. com, stan©bfu. bg T echnisc he UnitM"sitiit Chemnitz F akult ..at for Informattl?' StraBe der Nationen 62, 09111 Chemnitz, Germany
Abstract In this document w emake a brief review of memory management and DMA considerations in case of common SCI hardware and the Virtual Interface Architecture. On this basis we expose our ideas for an improved memory management of a hardware combining the positive characteristics of both basic technologies in order to get one completely new design rather than simply adding one to the other. The described memory management concept provides the opportunity of a real zerocopy transfer for Send-Recehe operations by keeping full flexibility and efficiency of a nodes' local memory management system. From the resulting hardware we expect a very good system throughput for message passing applications even if they are using a wide range of message sizes.
1
Motivation and Introduction
PCI-SCI bridges (Scalable Coherent Interface [12]) become a more and more preferable technological choice in the growing market of Cluster Computing based on non-proprietary hardware. Although absolute performance characteristics of this communication hardware increases more and more, it still has some disadvantages. Dolphin Irterconnect Solutions AS (Norway) is the leading manufacturer of commercial SCI link chips as well as the only manufacturer of commercially available PCI-SCI bridges. These bridges offer ~ry low latencies in range of some microseconds for their distributed shared memory and reac h also relatively high bandwidths (more than 80MBytes/s). In our clusters we use Dolphins PCI-SCI bridges in junction with standard PC components [11]. MPI applications that we are running on our cluster can get a great acceleration from low latencies of the underlying SCI shared memory if it is used as commmication medium for transferring messages. MPI implementations e.g. such as [7] show a * Daniel Balkanski and Stanislav Simeonov are from the Burgas Free University, Bulgaria. ** The work presented in this paper is sponsored by the SMWK/SMWA Saxony ministries (AZ:7531.50-03-0380-98/6). It is also carried out in strong interaction with the project GRANT SFB393/B6 of the DFG (German National Science Foundation). J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 4-15, 2000. © Springer-Verlag Berlin Heidelberg 2000
Memory Management in a Combined VIA/SCI Hardware
5
bandwidth of about 35MByte/s for a message size of lkByte which is quite a lot (refer also to figure 1 later). The major problem of MPI implementations over shared memory is big CPU utilization on long message sizes due to copy operations. So the just referred good MPI performance [7] is more an academic peak performance which is achieved with more or less total CPU consumption. A standard solution for this problem is to use a block- moving DMA engine for data transfers in background. Dolphins PCI- SCI bridges implement such a DMA engine. Unfortunately, this one can't be controlled directly from a user process without violating general protection issues. Therefore kernel calls are required here which in end effect increase the minimum achievable latency and require a lot of additional CPU cycles. The Virtual Interface Architecture (VIA) Specification [16] defines mechanisms for moving the communication hardware closer to the application by migrating protection mechanisms into the hardware. In fact, VIA specifies nothing completely new since it can be seen as an evolution of U- Net [15]. But it is a first try to define a common industry- standard of a principle communication architecture for message passing - from hardware to software layers. Due to its DMA transfers and its reduced latency because of user- level hardware access, a VIA system will increase the general system throughput of a cluster computer compared to a cluster equipped with a conventional communication system with similar raw performance characteristics. But for very short transmission sizes a programmed IO over global distributed shared memory won't be reached by far in terms of latency and bandwidth. This is a natural fact because we can't compare a simple memory reference with DMA descriptor preparation and execution.
Message Size !Bytes]
Figurel. Comparison of MPI Implementations for Dolphins PCI-SCI Bridges and GigaN ets cLAN VIA Hardware
Figure 1 shows bandwidth curves of MPI implementations for both an SCI and a native VIA implementation (GigaNet cLAN). The hardware is in both cases based on the PCI bus and the machines where the measurements were taken are comparable. The concrete values are based on ping-pong measurements and where taken from [7] in case of SCI, and from [10] (Linux case) for the cLAN hardware.
6
M. Trams et al.
As expected, the bandwidth in case of SCI is looking better in the range of smaller message sizes. For larger message sizes the cLAN implementation demonstrates higher bandwidth because of its advanced DMA engine. But not less important is the fact that a DMA engine gives the CPU more time for computations. Details of such CPU utilization considerations are outside the scope of this paper and are already discussed in [14] and [8]. As summarization of these motivating facts we can state that besides a powerful DMA engine controllable from user-level a distributed shared memory for programmed IO is an important feature which shouldn't be missed in a communication system.
2
What are the Memory Management Considerations?
First of all we want to make a short definition what belongs to memory management regarding this document. This can be stated by the following aspects expressed in the form of questions: 1. How a process' memory area is made available to the Network Interface Controller (NIC) and in what way main memory is protected against wrong accesses? 2. At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated? 3. In which way memory of a process on a remote node is made accessible for a local process? Based on these questions we can classify the different communication system architectures in terms of advantages/disadvantages of their memory management. In the analysis that is presented in the following sections we'll reveal these advantages and disadvantages arisen from common PCI-SCI architecture and the VI Architecture.
3 3.1
PCI-SCI vs. VIA discussion and comparison Question 1: How a process' memory area is made available to the NIC and in what way main memory is protected against wrong accesses?
Common PCI-SCI case: Current PCI-SCI bridges developed by Dolphin realize a quiet static memory management [4] to get access to main memory or rather PCI address space. To avoid unwanted accesses to sensitive locations, the PCI-SCI bridge is set up to allow accesses only to a dedicated memory window. Memory access requests caused by remote machines are only allowed if they fall within the specified window. This causes two big disadvantages:
Continuous exported regions must also be continuous available inside the physical address space. Additionally, these regions must be aligned to the minimum exportable block size which is typically quite large (512kB for Dolphin's bridges).
Memory Management in a Combined VIA/SCI Hardware
7
- Exported Memory must reside within this window. To handle these problems it is required to reserve main memory only for SCI purposes. This, in practice, 'wastes' a part of memory if it is not really exported later. In consequence these disadvantages of common PCI-SCI bridge architecture make their use with MPI applications very difficult. Especially in view of zero-copy transfer operations. Because data transfers can be processed using the reserved memory region only, it would require that MPI applications use special malloc () functions for allocating data structures used for send/receive purposes later. But this violates a major goal of the MPI standard: Architecture Independence.
VIA case: The VI Architecture specifies a much better view the NIC has on main memory. Instead of a flat one-to-one representation of the physical memory space it implements a more flexible lookup-table address translation. Comparing this mechanism with the PCI-SCI pendant the following advantages become visible. - Continuous regions seen by the VIA hardware are not required to be also continuous inside the host physical address space. - Accesses to sensitive address ranges are prevented by just not including them into the translation table. - The NIC can get access to every physical memory page, even if this may not be possible for all physical pages at once (when the translation table has less entries than the number of physical pages). The translation table is not only for address translation purposes, but also for protection of memory. To achieve this a so-called Protection Tag is included for each translation and protection table entry. This tag is checked prior to each access to main memory to qualify the access. For more information about this see later in section 3.2. Conclusions regarding question 1: It is clear, that the VIA approach offers much more flexibility. Using this local memory access strategy in a PCI-SCI bridge design will eliminate all of the problems seen in current designs. Of course, the drawback is the more complicated hardware and the additional cycles to translate the address. 3.2
Question 2: At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated?
Common PCI-SCI case: The DMA engine accesses local memory in the same way as already discussed in section 3.1. Therefore it inherits also all disadvantages when dealing with physical addresses on the PCI-SCI bridge.
8
M. Trams et al.
For accesses to global SCI memory a more flexible translation table is used. This Downstream Translation Table realizes a virtual view onto global SCI memory similar as the view of a VIA NIC onto local memory. Every page of the virtual SCI memory can be mapped to a page of the global SCI memory. Regarding validation, the DMA engine can't distinguish between regions owned by different processes (neither local nor remote). Therefore the hardware can't make a check of access rights on-the-flow. Rather it is required that the DMA descriptor containing the information about the block to copy is assured to be right. In other words the operating system kernel has to prepare or at least to check any DMA descriptor to be posted to the NIC. This requires OS calls that we want to remove at all cost.
VIA case: A VIA NIC implements mechanisms to execute a DMA descriptor from user-level while assuring protection among multiple processes using the same VIA hardware. An user process can own one or more interfaces of the VIA hardware (so-called Virtual Interfaces). In other words, a virtual interface is a virtual representation of a virtual unique communication hardware. The connection between the virtual interfaces and the VIA hardware is made by Doorbells that represent a virtual interface with its specific control registers. An user-level process can insert a new DMA descriptor into a job queue of the VIA hardware by writing an appropriate value into a doorbell assigned to this process. The size of a doorbell is equal to the page size of the host computer and so the handling which process may access which doorbell (or virtual interface) can be simply realized by the hosts' virtual memory management system. Protection during DMA transfers is achieved by usage of Protection Tags. These tags are used by the DMA engine to check if the access of the current processed virtual interface to a memory page is right. The protection tag of the accessed memory page is compared with the protection tag assigned to the virtual interface of the process that provided this DMA descriptor. Only if both tags are equal, the access is legal and can be performed. A more detailed description of this mechanism is outside the scope of this document (refer to [13] and [16]).
Conclusions regarding question 2: The location of the DMA engine is in both cases principally the same. The difference is that in case of VIA a real lookup-table based address translation is performed between the DMA engine and PCI memory. That is, the VIA DMA operates on a virtual local address space, while the PCI-SCI DMA operates directly with local physical addresses. The answer for the access protection is simple: The common PCI-SCI DMA engine supports no protection in hardware and must trust on right DMA descriptors. The VIA hardware supports full protection in hardware where the DMA engine is only one part of the whole protection mechanism.
Memory Management in a Combined VIA/SCI Hardware
3.3
9
Question 3: In which way memory of a process on a remote node is made accessible for a local process?
Common PCI-SCI case: Making remote memory accessible is a key function in a SCI system, of course. Each PCI-SCI bridge offers a special PCI memory window which is practically the virtual SCI memory seen by the card. So the same SCI memory the DMA engine may access can be also accessed via memory references (also called programmed IO here). The procedure of making globally available SCI memory accessible for the local host is also referred as importing global memory into local address space. On the other side, every PCI-SCI bridge can open a window to local address space and make it accessible for remote SCI nodes. The mechanism of this window is already described in section 3.1 regarding question 1. The procedure of making local memory globally accessible is also called exporting local memory into global SCI space. Protection is totally guaranteed when dealing with imported and exported memory in point of view of memory references. Only if a process has got a valid mapping of a remote process' memory page it is able to access this memory. VIA case: The VI Architecture offers principally no mechanism to access remote memory as it is realized in a distributed shared memory communication system such as SCI. But there is an indirect way by using a so-called Remote DMA (or RDMA) mechanism. This method is very similar to DMA transfers as they are used in common PCI-SCI bridges. A process that wants to transfer data between its local memory and memory of a remote process specifies a RDMA descriptor. This contains an address for the local VIA virtual address space and an address for the remote nodes' local VIA virtual address space. Conclusions regarding question 3: While a PCI-SCI architecture allows processes to really share their memory globally across a system, this is not possible with a VIA hardware. Of course, VIA was never designed for realizing distributed shared memory.
4
A new PCI-SCI Architecture with VIA Approaches
In our design we want to combine the advances of an ultra-low latency SCI Shared Memory with a VIA-like advanced memory management and protected user-level DMA. This combination will make our SCI hardware more suitable for our message passing oriented parallel applications requiring short as well as long transmission sizes. 4.1
Advanced Memory Management
In order to eliminate the discussed above restrictions with continuous and aligned exported memory regions that must reside in a special window, our PCI-SCI
M. Trams et al.
IO
architecture will implement two address translation tables - for both local and remote memory accesses. In contrast, common PCI-SCI bridges use only one translation table for accesses to remote memory. This new and more flexible memory management combined with reduced minimal page size of distributed shared memory leads to a much better usage of the main memory of the host system. In fact, our targeted amount of imported SCI memory is 1GB with a page granularity of 16kB. With a larger downstream address translation table this page size may be reduced further to match exactly the page size used in the host systems (such as 4kB for x86 CPUs). In case of the granularity of memory to be exported in SCI terminology or to be made available for VIA operations there's no question: It must be equal to the host system page size. In other words, 4kB since the primary target system is a x86 one. 128MB is the planned maximum window size here. 4.2
Operation of Distributed Shared Memory from a memory-related point of view
Figure2. Address Translations between exporting and importing Processes for programmed IO
Figure 2 gives an overall example of exporting/importing memory regions. The example illustrates the address translations performed when the importing process accesses memory exported by a process on the remote node. The exporting process exports some of its previously allocated memory by registering it within its local PCI-SCI hardware. Registering memory is done on a by-page basis. Remember that in case of a common PCI-SCI system it would be required that this exported memory is physically located inside this special memory area reserved for SCI purposes. But here we can take the advantage of the virtual view onto local memory similar to this in VI Architecture.
Memory Management in a Combined VIA/SCI Hardware
11
Once the upstream address translation table entries are adjusted, the exported memory can be accessed from remote machines since it became part of the global SCI memory. To access this memory, the remote machine must import it first. The major step to do here is to set up entries inside its downstream address translation table so that they point to the region inside the global SCI memory that belongs to the exporter. From now, the only remaining task is to map the physical PCI pages that correspond to the prepared downstream translation entries into the virtual address space of the importing process. ·when the importing process accesses the imported area, the transaction is forwarded through the PCI-SCI system and addresses are translated three times. At first the host MMU translates the address from the process' virtual address space into physical address space (or rather PCI space). Then the PCI-SCI bridge takes up the transaction and translates the address into the global SCI address space by usage of the downstream translation table. The downstream address translation includes generation of the remote node id and address offset inside the remote nodes' virtual local PCI address space. When the remote node receives the transaction, it translates the address to the correct local physical (or rather PCI) address by using the upstream address translation table.
4.3
Operation of Protected User-Level Remote DMA from a memory-related point of view
Figure 3 shows the principle work of the DMA engine of our PCI-SCI bridge design. This figure shows principally the same address spaces and translation tables as shown by figure 2. Only the process' virtual address spaces and the corresponding translation into physical address spaces are skipped to not overload the figure. The DMA engine inside the bridge is surrounded by two address translation tables, or more correct said by two address translation and protection tables. On the active node (that is, where the DMA engine is executing DMA descriptors node 1 here) both translation tables are involved. However, on the remote node there has practically nothing changed compared to the programmed IO case. Hence the remote node doesn't make any difference between transactions whether they were generated by the DMA engine or not. Both translation tables of one PCI-SCI bridge incorporate protection tags as described in section 3.2. But while this is used in VIA for accesses to local memory, here it is also used for accesses to remote SCI memory. Together with VIA mechanisms for descriptor notification and execution the DMA engine is unable to access wrong memory pages whether local (exported) nor remote (imported) ones. Note that a check for right protection tags is really made only for the DMA engine and only on the active node (node 1 in figure 3). In all other cases the same translation and protection tables are used, but the protection tags inside are ignored.
12
M. Trams et al.
~---{ ]--o--('}-----------(}---~
Figure3. Address Translations performed during RDMA Transfers
4.4
A free choice of using either Programmed 1/0 or User-Level Remote DMA
This kind of a global memory management allows applications or more exactly communication libraries to decide on-the-fly depending on data size in which way it should be transferred. In case of a short message a PIO transfer may be used, and in case of a longer message a RDMA transfer may be suitable. The corresponding remote node is not concerned in this decision since it doesn't see any differences. This keeps the protocol overhead very low. And finally we want to remember the VIA case. Although we already have the opportunity of a relatively low- latency protected user- level remote DMA mechanism without the memory handling problems as in case of common PCISCI, there's nothing like a PIO mechanism for realizing a distributed shared memory. Hence the advantages of an ultra- low latency PIO transfer are not available here.
5
Influence on MPI Libraries
To show the advantages of the presented advanced memory management we want to take a look at the so-called Rendezvous Protocol that is commonly used for Send-Receive operations. Figure 4 illustrates the principle of the Rendezvous protocol used in common MPI implementations [7] based on Dolphins PCI-SCI bridges. One big problem in this model is the copy operation that takes place on the receivers' side to take data out of the SCI buffer. Although the principally increasing latency can be hidden due to the overlapping mechanism a lot of CPU cycles are burned there.
Memory Management in a Combined VIA/SCI Hardware Sender
Receiver
Sender
13
Receiver
Request_ Send Ok_to_Send --- - - - Block_Ready ____ _
Tran~ler cornple!ed
Memory
Ready
Tran~lercrnnpletet! Tran~ler completed
I CPlJ h1L~Y
I CPlJ h1L~Y
lcrtJfree
lcrtJfree
Figure4. Typical Rendezvous-Protocol in common PCI-SCI Implementations
Figure5. Improved RendezvousProtocol based on advanced PCI-SCI Memory Management
·with our proposed memory management there's a chance to remove this copy operation on the receivers' side. The basic operation of the Rendezvous protocol can be implemented as described in figure 5. Here the sender informs the receiver as usual. Before the receiver sends back an acknowledge it checks if the data structure the data is to be written to is already exported to the sender. If not, the memory region that includes the data structure is registered within the receivers' PCI-SCI bridge and exported to the sender. The sender itself must also import this memory region if this was not already done before. After this the sender copies data from private memory of the sending process directly into private memory of the receiving process. As further optimization the sender may decide to use the DMA engine to copy data without further CPU intervention. This decision will be typically based on the message size.
6
State of the project (November 1999)
We developed our own FPGA-based PCI-SCI card and have prototypes of this card already running. At the moment they only offer a so-called Manual Packet Mode for now that is intended for sideband communication besides the regular programmed IO and DMA transfers. The card itself is a 64Bit/33MHz PCI Rev.2.1 one [8]. As SCI link controller we are using Dolphins LC-2 for now, and we are looking to migrate to the LC-3 as soon as it is available. The reprogrammable FPGA design leads to a flexible reconfigurable hardware and offers also the opportunity for experiments. Linux low-level drivers for Alpha and x86 platforms and several configuration/test programs were developed. In addition our research group is working on an appropriate higher-level Linux driver for our card [5, 6]. This offers a software-interface (advanced Virtual Interface Provider Library) that combines SCI and VIA features such as importing/ exporting memory regions, VI connection management etc. Also it emulates parts of the hardware so that it is possible to run other software on top of it although the real hardware is not available. As an example, a parallelized MPI-version of the popular raytracer POVRAY is already running over this emulation. This program uses an MPI-2 library for
14
M. Trams et al.
our combined SCI/VIA hardware. This library is also under development at our department [3]. For more details and latest news refer to our project homepage at http://www.tu-chemnitz.de/-mtr/VIA..SCI/
7
Other Works on SCI and VIA
Dolphin already presented some performance measurements in [1] for their VIA implementation which is a emulation over SCI shared memory. Although the presented VIA performance is looking very good, it's achieved by the cost of too big CPU utilization again. The number of vendors of native VIA hardware is growing more and more. One of these companies is GigaNet [17] where performance values are already available. GigaNet gives on their web pages latencies of 8µs for short transmission sizes. Dolphin gives a latency for PIO operations (remote memory access) of 2.3µs. This demonstrates the relatively big performance advantage a distributed shared memory offers here. University of California, Berkeley [2] and the Berkeley Lab [9] are doing more open research also in direction of improving the VIA specification. The work at the University of California, Berkeley is concentrated more on VIA hardware implementations based on Myrinet. In contrast, the work at the Berkeley Lab is targeted mainly to software development for Linux.
8
Conclusions and Outlook
The combined PCI-SCI/VIA system is not just a simple result of adding two different things. Rather it is a real integration of both in one design. More concrete it is an integration of concepts defined by the VIA specification into a common PCI-SCI architecture since major PCI-SCI characteristics are kept. The result is a hardware design with completely new qualitative characteristics. It combines the most powerful features of SCI and VIA in order to get highly efficient messaging mechanisms and high throughput over a broad range of message lengths. The advantage that MPI libraries can take from a more flexible memory management was illustrated for the case of a Rendezvous Send-Receive for MPI. The final proof in practice is still pending due to lack of a hardware with all implemented features.
References 1. Torsten Amundsen and John Robinson: High-performance cluster-computing
with Dolphin's CluStar PCI adapter card. In: Proceedings of SCI Europe '98, Pages 149-152, Bordeaux, 1998
Memory Management in a Combined VIA/SCI Hardware
15
2. Philip Buonadonna, Andrew Geweke: A..n Implementation and A..nalysis of the Virtual Interface Architecture. University of California at Berkeley, Dept.of Computer Science, Berkeley, 1998. www. cs. berkeley. edu;-philipb/via/ 3. A new MPI-2-Standard MPI Implementation with support for the VL4.. www.tu-chemnitz.de/informatik/RA/projects/chempi-html/ 4. Dolphin Interconnect Solutions AS: PCI-SCI Bridge Spec. Rev. 4.01. 1997. 5. Friedrich Seifert: Design and Implementation of System Software for Transparent Mode Communication over SCI., Student Work, Dept. of Computer Science, University of Technology Chemnitz, 1999. See also: www.tu-chemnitz.de/-sfri/publications.html 6. Friedrich Seifert: Development of System Software to integrate the Virtual Interface Architecture (VIA) into Linux Operating System Kernel for optimized l\ifessage Passing. Diploma Thesis, TU-Chemnitz, Sept. 1999. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html 7. Joachim Worringen and Thomas Bemmerl: MPICH for SCI-connected Clusters. In: Proceedings of SCI-Europe'99, Toulouse, Sept. 1999, Pages 3-11. See also: wwwbode. in. tum. de/ events/ sci-europe99/ 8. Mario Thams and Wolfgang Rehm: A new generic and reconfigurable PCISCI bridge. In: Proceedings of SCI-Europe'99, Toulouse, Sept. 1999, Pages 113-120. See also: wwwbode. in. tum.de/events/sci-europe99/ 9. M-VIA: A High Performance Modular VIA for Linux. Project Homepage: http://www.nersc.gov/research/FTG/via/ 10. MPI Software Technology, Inc. Performance of MP I/Pro for cLAN on Linux and Windows.www.mpi-softtech.com/performance/perf-win-lin.html 11. The Open Scalable Cluster ARchitecture (OSCAR) Project. TU Chemnitz. www.tu-chemnitz.de/informatik/RA/projects/oscar.html/ 12. IEEE Standard for Scalable Coherent Interface (SCI). IEEE Std. 1596-1992. SCI Homepage: www. SCizzL . com 13. Mario Thams: Design of a system-friendly PCI-SCI Bridge with an optimized User-Interface. Diploma Thesis, TU-Chemnitz, 1998. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html 14. Mario Thams, Wolfgang Rehm, and Friedrich Seifert: An advanced PCI-SCI bridge with VIA support. In: Proceedings of 2nd Cluster-Computing Workshop, Karlsruhe, 1999, Pages 35-44. See also: www.tu-chemnitz.de/informatik/RA/CC99/ 15. The U-Net Project: .4. User-Level Network Interface Architecture. www2.cs.cornell.edu/U-Net 16. Intel, Compaq and Microsoft. Virtual Interface Architecture Specification Vl.O., VIA Homepage: www.viarch.org 17. GigaN et Homepage: www. gig an et. com
ATOLL, a new switched, high speed Interconnect in Comparison to Myrinet and SCI Markus Fischer, Ulrich Bruning, Jorg Kluge, Lars Rzymianowicz, Patrick Sc h ulz, Mathias \\lack University of Mannheim, Germany, markus©atoll-net.de
Abstract. While standard processors achieve supercomputer performance, a performance gap exists between the interconnect of MPP's and COTS. Standard solutions like Ethernet can not keep up with the demand for high speed communication of todays po w erful CPU's. Hence, high speed interconnects have an important impact on a cluster's performance. While standard solutions for processing nodes exist, communication hardware is currently only Nailable as a special, expensiw non portable solution. ATOLL presents a switched, high speed interconnect, whic hfulfills the current needs for user level communication and concurrency in computation and communication. A TOLLis a single chip solution, additional switching hardware is not required.
1
Introduction
Using commodity off the shelf components (COTS) is a viable option to build up pow erful clusters not only for mmber crunching but also for highly parallel, commercial applications. First clusters already show up in the Top500 [6] list and it is expected to see the number of entries continuously rising. Powerful CPU's suh as the Intel PIII Xeon with SMP functionality, achiev e processing performance kno wnfrom supercomputers. Currently a high percentage of existing clusters is equipped with standard solutions sue has Fast Ethernet. This is mainly for compatibility reasons since applications based on standardized TCP /IP are easily portable. This protocol how eveis known to cause too muchoverhead [7]. Especially low ering latency is an importart key to achieve good communication performance. A survey on message sizes shows that protocols and hardware have to be designed to handle short messages extremely well [14]: - in sev en parallel scie:raific applications 30% of the messages were bet -reen 16 bytes and a kilo~te - the median message sizes for TCP and UDP traffic in a departmental network w ere 32 and 128 lytes respectively - 99% of TCP and 86% of the UDP traffic was less than 200 bytes - on a commercial database all messages were less than 200 bytes the a v erage message size ranges beween 19 - 230 bytes J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 16-27, 2000. © Springer-Verlag Berlin Heidelberg 2000
A TOLL, a New Switched, High Speed Interconnect
17
Recent research with Gigabit/s interconnects, such as Myrinet and SCI, has shown that one key to achieve low latency and high bandwidth is to bypass the operating system, avoiding a trap into the system: User Level Communication (ULC) gives the user application full control over the interconnect device (BIP, HPVM, UNET, AM). While ULC shortens the critical path when sending a message, a global instance such as the kernel, is no longer involved in scheduling outgoing data. This has the disadvantage, that security issues have to be discussed, if different users are running their application. But also trashing and context switching through multiple processes can lower performance. Current research examines how to multiplex a network device efficiently [8], if this is not supported by the NI hardware itself. Therefore, a unique solution would be to support multiple Ni's directly in hardware. Designing interconnects for the standard PCI interface cuts down production costs, due to higher volume. Nevertheless, necessary additional switching hardware increases the total cost per node significantly. While PCI is a standard interface designed for IO, current PCI bridges are limited by a bandwidth of 132 MB/s running at 32bit/33Mhz. Upcoming mainboards will run at 64bit/66Mhz and achieve a maximum bandwidth of 528MB/s. The paper is organized as follows. The design space for network interfaces is evaluated and an overview on key functionality to achieve good communication performance is described in the next section. Section 3 will describe the design issues of ATOLL in comparison to Myrinet and SCI. In section 4 software layers, such as low level API and message passing interfaces for ATOLL and other NIC's, are discussed. Finally, section 5 concludes our paper.
2
Design Space for Network Interfaces
In this section we would like to evaluate current NICs and characterize the design space of IO features in general, differentiating between hardware and software issues. From the hardware's point of view, features like special purpose processor on board, additional (staging) memory, support of concurrency by allowing both, PIO and DMA operations, or support for shared memory at lowest level are of interest. The requirement for additional switching hardware to build up large scaling clusters is another concern. From the software's point of view it is interesting to examine which protocols are offered and how they are implemented, whether MMU functionality is implemented allowing RDMA, or how message delivery and arrival are detected. The latter will have a major impact on performance. We would like to break down the design space into the following items: Concurrency with PIO and DMA Thansactions, MMU Functionality to support RDMA
Basically, when sending a message, the NIC's API chooses PIO or DMA for transfer, depending on the message size. PIO has the advantage of low start-up costs to initiate the transfer. However since the processor is transferring data
18
M. Fischer et al.
directly to the network, it is busy during the entire transaction. To allow concurrency, the DMA mode must be chosen in which the processor only prepares a message by creating a descriptor pointing to the actual message. This descriptor is handed to the DMA engine which picks up the information and injects the message into the network. It is important to know that the DMA engine relies on pinned down memory since otherwise pages can be swapped out of memory and the engine usually can not page on demand by itself. The advantage of using DMA is to hide latency (allowing for multiple sends and receives). However it has a higher start-up time than PIO. Typically, a threshold values determines which protocol is chosen for the transaction. Both mechanisms also play an important role when trying to avoid memory copies. - Intelligent Network Adapter, Hardware and Software Protocols
The most important feature having an intelligent network adapter (processor and SRAM on board) is to be flexible in programming message handling functionality. Protocols for error detection and correction can be programmed in software, but also new techniques can be applied (VIA). Support for concurrency is improved as well. Additional memory on board lowers congestion and the possibility of deadlocks on the network decreases. It has the advantage to buffer incoming data, thus emptying the network links on which the message has been transferred. However, the memory size is usually limited and expensive, also the number of data copies rises. Another disadvantage of this combination is that the speed of an processor on board can not cope with the main processing unit. Finally, programming the network adapter is a versatile task. - Switches, Scalability and Routing
A benchmark of a point to point routine typically only shows the best performance for non-standard situations. Since a parallel application usually consists of dozens of processes communicating in a more or less fixed pattern, measuring the bisection bandwidth generates better information of the underlying communication hardware. A cost-effective SAN has bidirectional links and allows sending and receiving concurrently. A key factor for performance is scalability, when switches are added for a multistage connection network to allow larger clusters. Here blocking behavior becomes the major concern. Another point of interest is the connection from NIC to NIC: Data link cables must provide a good compromise between data path width and transfer speed. - Hardware support for Shared Memory (Coherency) and NI locations
Currently a trend can be seen in clustering bigger SMP nodes. Within an SMP node, a cache coherent protocol like MESI synchronizes to achieve data consistency. To add this functionality to IO devices (such as the NIC), they would have to participate on the cache coherent protocol, being able to snoop on the system bus. However, this would involve a special solution for every processor type and system and can not be propagated as a commodity solution. With the
A TOLL, a New Switched, High Speed Interconnect
19
growing distance between the NI and the processor, the latency of the communication operations raises and, at the same time, the bandwidth declines. The only position that results in a wide distribution and, thus, necessary higher production volumes, is the standardized PCI bus. This leads to the loss of a number of functions, like e.g., the cache coherent accesses to the main memory of the processor. As the NI on the PCI card is independent from the used processor (and has to be), functions like the MMU in the NI cannot be recreated, as they differ according to which processor is being used. For this purpose an adaptable hardware realization of the basic mechanisms or an additional programmable processor on the PCI card can be used. - Performance Issues: Copy Routines and Notification Mechanisms
Once a message is ready for sending, the data has to be placed at a location where the NIC can fetch the data. Using the standard memcpy routines however may show poor performance. The reason is that the cache of the CPU is ruined when larger messages have been injected into the network. Modern CPU's like the Pentium III or Ultrasparc offer special MMX or VIS instructions which copy the data without polluting the cache. Another critical point is the software overhead caused by diverse protocols to guarantee data transfer. Nowadays cables are almost error free. Thus heavy protocols like TCP /IP are no longer necessary. Since an error may occur, an automatic error detection and correction implemented directly in hardware would improve efficiency. Performance is also sensitive to message arrival detection. A polling method typically wastes a lot of CPU cycles and an interrupt causes too much overhead, since contexts have to be switched. Avoiding the interrupt mechanism is very important as each new interrupt handling leads to a latency of approximately 60 µs [8].
3
NIC Hardware Layout and Design
In the ATOLL project, all design space features have been carefully evaluated and the result is an implementation of a very advanced technology.
3.1
ATOLL
Overview The ATOLL cluster interface network, is a future communication technology for building cost-effective and very efficient SAN's using standard processing nodes. Due to an extremely low communication start-up time and very broad hardware support for processing messages, a much higher performance standard in the communication of parallel programs is achieved. Unique is the availability of four links of the interface network, an integrated 8 x 8 crossbar and four independent host ports. They allow for creating diverse network topologies without additional external switches and the ATOLL network is one of the first network on a chip implementations. This design feature especially supports SMP nodes by assigning multiple processes their dedicated device. Figure 1 depicts an overview on hardware layout and data flow of ATOLL.
20
M. Fischer et al.
Processor Pentium Will
-==----•
Tnm!IW"vi.alJMA
- - - Tran.'lft:rvi• PIO
D
I
(PCI Bla'St)
OMA. Oe..aip4c.w
Tob0M Bps
Q
GigaNet
l::i..
P ackel Engine
D
SysKonnect Net Gear
Message Length (Bytes) Figure 3 - MPI/Pro Bandwidth Results Our experiences with the performance of MPI under NT 4 and Windows 2000 are inconclusive. Currently, it appears that in shared-memory mode that the latencies under Windows 2000 may be marginally lower than NT 4. The measured peak bandwidths of Windows 2000 were greater than NT4. In distributed-memory mode the measured latencies under Windows 2000 were approximately 20% higher than the equivalent under NT 4. The measured bandwidths for Windows 2000 and NT 4 were very similar however. It is interesting to note that the measured network latencies for 100 Mbps Ethernet cards and Giga Net under WinSock and MPI/Pro are almost equivalent. The performance of the Packet Engine Gigbit card is between 7% and 13% faster respectively. However, the performance of the SysKonnect and Net Gear cards are significantly slower that standard 100 Mbps Ethernet.
7.2 Price/Performance Considerations Table 4 shows the price/performance ratios calculated using the network card costs in September 1999 versus the peak measured bandwidth and minimum latency. It should be noted that the calculated ratios shown are only an approximate indicator as the price of the network cards varies significantly based on the quantity bought and the discounts given. The smaller the price/performance ratio the better value for money that can be expected from a network card. The choice of what is the most appropriate card is often not based
GigaBit Performance under NT
49
solely on the price/performance, but also other factor such as desired performance, compatibility or availability.
Bandwidth (Log) versus Message Length (In Distributed Memory) 50
'0'
10 5
uJ
~ cc
1!. ...."C
.5
~
'j "C
.1 .05
C: (IS
cc
.01 .005
/ .001
/
/
/
/
/
/
/
/
I
/
/
/
/
/
/
/
/
I
Plot Key X
1l0Mq,s
0
GigaNet
t,.
Packet Engine
[]
SysKomect NetGear
4 Message Length (Bytes)
Figure 4 - WinSock Bandwidth Results The ratios shown in Table 4 indicate that the 100 Mbps Fast Ethernet cards provide significantly better price/performance than the other network cards. However, the ratios for the NetGear Gigabit card are significantly better than the other price/performance ratios available. Card Make and speed
NetGear FA310TX l00Mbps GigaNet - Clan GNN 1000 Packet Engine - GNIC 11 SvsKonnect-SK-9841 NetGear - GA620
Price/Performance ($/Mbytes/s) $24.95/8.8 ~ 2.835 $795/37 ~ 2149 $995/12 ~ 82.92 $729/17 ~ 42.88 $299.99/19 ~ 15.79
Price/Performance ($/us) $24.95/208 ~ 0.12 $$795/208 ~ 3.82 $995/336 ~ 2.96 $729/179 ~ 4.07 $299.99/585 ~ 0.51
Table 4: Network Card Cost versus Performance (MPI/Pro) 7 .3 Summary of Conclusions Our work has shown that release 1.2.3 of MPI/Pro imposes an approximate additional 1 Byte latency of 25% and 50% over WinSock under shared and distributed-memory modes respectively. We have shown that the Giga Net Gigabit Ethernet provides the highest bandwidth of those tested. We suspect, as currently we do not have a concrete price for this card, that the price/performance of this card will be poorer that that of Net Gear but better than Packet Engine and NetGear. Our price/performance figures do, however, strongly suggest that the current performance and costs of the Gigabits cards makes standard 100 Mbps a much sounder technology investment at the moment. Obviously, other
50
M. Baker et al.
factors, like required peak bandwidth, may make the decision of what technology to choose not one purely based on price/performance. Another factor that puts the Gigabit Ethernet at a disadvantage compared to other network technologies, such as Myrinet23 and SCr24, is the relatively high start up latencies approximately an order of magnitude higher. These high latencies are being addressed with the new VIA interfaces and drivers being developed for Ethernet.
7.4 Future Work This work is part of an on going effort to investigate the performance of a range of cluster-based technologies. The next phase of our work will involve comparing the performance of different network technologies under NT and Linux.
References 1 A. Geist, Cluster Computing: The Wave of the future, Springer Verlag Lecture Notes in Computer Science, May 1994. 'The PYM project - http://www. epm. ornl. gov/pvm/ 3 MPI Forum - http://www. mpi- forum. org/ docs/docs. html 4 Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, Report No. CS-94-230, May 5, 1994 5 MPICH - http://www. mes. anl. gov/mpi/mpieh/ 6 W. Gropp, et. al., A high-performance, portable implementation of the MPI message passing interface standard - http:/ /www-e. mes. anl. gov/mpi/mpicharticle/paper. html 1 W. Gropp and B. Smith, Chameleon parallel programming tools users manual. Technical Report ANL-93/23, Argonne National Laboratory, March 1993. 8 PYM: A Users' Guide and Tutorial For Networked Parallel Computing http://www.netlib.org/pvm3/book/pvm-book.html 9 Gigabit Ethernet Alliance - Gigabit Ethernet: Accelerating the standard for speed, http://www.gigabit-ethernet.org/technology/whitepapers, September 1999. 10 Ethernet Segment Limits. - http://www. gigabi t-ethernet. org/technology/ 11 TOPIC http://www.des .port .ac. uk/-mab/T0PIC/ 12 MPI Software Technology, Inc. - http://www. mpi- soft tech. com/ 13 WinMPICh - http://www. ere. ms state. edu/mpi/mpiNT. html 14 VIA-http :/ /www.viaarch.com 15 PaTENT-http://www.genias.de/products/patent/ 16 WINdows based PARallel computing- http://www. genias. de/ 11 WMPI- http:/ /dsg. dei. UC .pt/w32mpi/ 18 R. Buttler and E. Lusk, User's Guide to the p4 Parallel Programming System, ANL92/17, Argonne National Laboratory, October 1992. 19 NetGear - http:/ /netgear. baynetworks. com/ '° GigaNet - http://www. giga-net. com/ 21 Packet Engine - http://www. packet engines. com/ index4. html SysKonnect- http://www. syskonnect. de/ 23 N. Boden, et. al. Myrinet - A Gbps LAN. IEEE Micro, Vol. 15, No.l, February 1995. http://www.myri.com/ Dolphin Interconnect Solutions - http://www. dolphinics. no/
MPI Collective Operations o \er IP Multicast* Hsiang Ann Chen, Yvette 0. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Aransas, U.S.A {hachen,yochoa,aapon}©comp.uark.edu
Abstract. Many common implementations of Message Passing Interface (MPI) implement collective operations over poin t-to-poin tlperations. This work examines IP multicast as a framework for collective operations. IP multicast is not reliable. If a receiver is not ready when a message is sent via IP multicast, the message is lost. Two techniques for ensuring that a message is not lost due to a slow receiving process are examined. The techniques are implemented and compared experimentally over both a shared and a switched Fast Ethernet. The average performance of collective operations is improved as a function of the number of participating processes and message size for both networks.
1
Introduction
Message passing in a cluster of computers has become one of the most popular paradigms for parallel computing. Message Passing Interface (MPI) has emerged to be the de facto standard for message passing. In many common implementations of MPI for clusters, MPI collective operations are implemented o'er MPI point-to-point operations. Opportunities for optimization remain. Multicast is a mode of communication where one sender can send to multiple receivers by sending only one copy of the message. With multicast, the message is not duplicated unless it has to travel to differert parts of the network through switches. Many net w arks support broadcast or mnlticast. For example, shared Ethernet, token bus, token ring, FDDI, and reflective memory all support broadcast at the data link layer. The Internet Protocol (IP) supports multicast o~r netw orksthat ha veIP multicast routing capability at the network layer. The goal dmis paper is to in vestigatethe design issues and performance of implementing MPI collectiv e operations using multicast. IP multicast is used to optimize the performance of MPI collective operations, namely the MPI broadcast and MPI barrier synchronization, for this preliminary work. The results are promising and give insight to w ork that is planned on a l< · · ·
1600 1400 1200
al
!'l
1000 800 600 400 200 0
0
1000
2000 3000 size of message (in byte)
4000
5000
Fig. 11. Performance Comparison with MPI..Bcast over hub and switch for 4 processes
multicast is where the extra latency of sending scout messages becomes less than the latency from sending extra packets of data when the data is large. For some numbers of nodes, collisions also caused larger variance in performance with the multicast implementations. For example, this is observed for 6 nodes as shown in Fig. 9. With 6 nodes using the binary algorithm, both node 2 and node 1 attempt to send to node 0 at the same time, which causes extra delay. Figure 11 compares the average performance of the switch and the hub for 4 processes. When using IP multicast, the average performance of the hub is better than the switch for all measured message sizes. As for the original MPICH implementation, the average performance of hub becomes worse than the switch when the size of the message is bigger than 3000. The MPICH implementation puts more messages into the network. As the load of the network gets larger, the extra latency of the switch become less significant than the improvement gained with more bandwidth. The multicast implementation is better than MPICH for message sizes greater than one Ethernet frame.
MPI Collective Operations over IP Multicast 2500
mp1ch(9proc) ---4-----
multicastMPICH ····
mp1ch(6proc) ···+···
mp1ch(3proc)
···-B···
JOO
hnear(9proc) X-·· hnear(6 proc) --A-hnear(3 proc) ·· *· ·
2000
59
600
"~
1500
500
J" >.
400
1000
JOO
500 200
o~-~-~--~-~--~~ 0
1000
2000
3000 SJZeofmessage(mbyteJ
4000
5000
Fig. 12. Performance Comparison with MPI..Bcast over 3, 6, and 9 processes over Fast Ethernet switch
NmnberofProcesses
Fig. 13. Comparison of MPI..Barrier over Fast Ethernet hub
Figure 12 compares MPICH and the linear multicast implementation for 3, 6, and 9 processes over the switch. The results show that the linear multicast algorithm scales well up to 9 processes and better than MPICH. With the linear implementation, the extra cost for additional processes is nearly constant with respect to message size. This is not true for MPICH. Figure 13 describes the results of MPI..Barrier operation over the hub. The results for MPLBarrier show that IP multicast performs better on the average than the original MPICH implementation. The performance improvement increases as the size of the message gets bigger. In a Single Program Multiple Data (SPMD) environment, message passing using either the linear algorithm or the binary algorithm is correct even when there are multiple multicast groups. However, since the IP multicast implementation requires the receive call to be posted before the message is sent, it is required that each process execute the multicast calls in the same order. This restriction is equivalent to requiring that the MPI code be safe[5]. If several processes broadcast to the same multicast group (in MPI terms, this is the same process group of same context), the order of broadcast will be correctly preserved. For example, suppose in an environment including the 4 processes with ids 4, 6, 7 and 8, processes 6, 7, and 8 all belong to the same multicast group and the broadcast is called in the following order. MPL.Bcast(&buffer, count, MPLINT, 6, MPLCOMl\LWORLD); MPL.Bcast(&buffer, count, MPLINT, 7, MPLCOMl\LWORLD); MPL.Bcast(&buffer, count, MPLINT, 8, MPLCOMl\LWORLD);
Using either the binary algorithm or the linear algorithm, process 7 cannot proceed to send the the second broadcast until it has received the broadcast message from process 6, and process 8 cannot send in the third broadcast until it has received the broadcast message from process 7. The order of the three
60
H.A. Chen, Y.O. Carrasco, and A.W. Apon
broadcasts is carried out correctly. Using a similar argument, when there are two or more multicast groups that a process receives from, the order of broadcast will be correct as long as the MPI code is safe.
5
Conclusions and Future Work
Multicast reduces the number of messages required and improves the performance of MPI collective operations by doing so. Its receiver-directed message passing mode allows the sender to address all the receivers as a group. This experiment focused on a particular implementation using IP multicast. Future work is planned in several areas. Improvements are possible to the binary tree and linear communication patterns. ·while we have not observed buffer overflow due to a set of fast senders overrunning a single receiver, it is possible this may occur in many-to-many communications and needs to be examined further. Additional experimentation using parallel applications is planned. Also, low latency protocols such as the Virtual Interface Architecture[9] standard typically require a receive descriptor to be posted before a mesage arrives. This is similar to the requirement in IP multicast that the receiver be ready. Future work is planned to examine how multicast may be applied to MPI collective operations in combination with low latency protocols.
References [1] D. E. Comer. Internetworking with TCP/IP Vol. I: Principles, Protocols, and Architecture . Prentice Hall, 1995. [2] T. H. Dunigan and K. A. Hall. PVM and IP Multicast. Technical Report ORNL/TM-13030, Oak Ridge National Laboratory, 1996. [3] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Technical Report Preprint MCS-P567-0296, Argonne National Laboratory, March 1996. [4] N. Nupairoj and L. M. Ni. Performance Evaluation of Some MPI Implementations on Workstation Clusters. In Proceedings of the 1994 Scalable Parallel Libraties Conference, pages 98-105. IEEE Computer Society Press, October 1994. [5] P. Pacheo. Parallel Programming with MPI. Morgan Kaufmann, 1997. [6] The LAM source code. http://www.mpi.nd.edu/lam. [7] The MPICH source code. www-unix.mcs.anl.gov/mpi/index.html. [8] A. S. Tannenbaum, M. F. Kaashoek, and H. E. Bal. Parallel Programming Using Shared Objects and Broadcasting. Computer, 25(8), 1992. [9] The Virtual Interface Architecture Standard. http://www. vi arch. org. [10] D. Towsley, J. Kurose, and S. Pingali. A Comparison of Sender-Initiated and Receiver-Initiated Reliable Multicast Protocols. IEEE JSAC, 15(3), April 1997.
An Open Market-Based Architecture for Distributed Computing Sp yros Lalis and Alexandros Karipidis Computer Science Dept., University of Crete, Hellas {lalis,karipid}@csd.uoc.gr Institute of Computer Science, Foundation for Research and Technology, Hellas {lalis,karipid}@ics.forth.gr
Abstract. One of the challenges in large scale distributed computing is to utilize the thousands of idle personal computers. In this paper, we presen t a system that enables users to effortlessly and safely export their machines in a global market of processing capacity. Efficient resource allocation is performed based on statistical machine profiles and leases are used to promote dynamic task placement. The basic programming primitives of the system can be extended to develop class hierarchies which support different distributed computing paradigms. Due to the objectoriented structuring of code, deV)loping a distributed computation can be as simple as implementing a few methods.
1
Introduction
The growth of the Internet has provided us with the largest network of interconnected computers in history. As off-the-shelf hardware becomes faster and gains Internet access, the netw ork's processing capaciy will continue increasing. Many of these systems are often under-utilized, a fact accentuated by the globe's geography since "busy" hours in one time-zone tend to be "idle" hours in another. Distributing computations over the Irternet is thus very appealing. However, several issues nnst be resolved for this to be feasible. The obstacle of platform heterogeneity must be overcome and security problems arising from the execution of code from untrusted parties must be confronted. F urther inconveniences arise when installing and mairtaining the corresponding programming en vironmerts. And then, distributed computations must be designed and implemented on top of them, a challenging task even for experienced programmers. In this paper we present a system that addresses these problems, simplifying distributed computing over the Internet considerably. Through a maintenancefree, web-based user interface any machine can be safely connected to the system to act as a host for remote computations. A framework that promotes code reuse and incremental development through object-oriented extensions is offered to the application programmer. \Vriting computations for the system can be as trivial as implementing a few routines. We feel that the ease of deploying the system J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 61-70, 2000. © Springer-Verlag Berlin Heidelberg 2000
62
S. Lalis and A. Karipidis
and developing applications for it is of importance to the scientific community since most of the programming is done by scientists themselves with little or no support from computer experts. The rest of the paper is organized as follows. Section 2 summarizes the general properties of the system. Details about the resource allocation mechanism are given in Sect. 3. In Sect. 4 we look into the system architecture, giving a description of components and communication mechanisms. In Sect. 5 we show how our system can be used to develop distributed computations in a straightforward way. A comparison with related work is given in Sect. 6. Section 7 discusses the advantages of our approach. Finally, future directions of this work are mentioned in the last section.
2
System Properties
When designing the system, the most important goal was to achieve a level of simplicity that would make it popular both to programmers and owners of lightweight host machines, most notably PCs. Ease of host registration was thus considered a key issue. Safety barriers to shield hosts from malicious behavior of foreign code were also required. Portability and inter-operability was needed to maximize the number of host platforms that can be utilized. A simple yet powerful programming environment was called for to facilitate the distribution of computations over the Internet. All these features had to be accompanied by a dynamic and efficient mechanism for allocating resources to applications without requiring significant effort from the programmer. In order to guarantee maximal cross-platform operability the system was implemented in Java. Due to Java's large scale deployment, the system can span across many architectures and operating systems. Host participation is encouraged via a web based interface, which installs a Java applet on the host machine. This accommodates the need for a user friendly interface, as users are accustomed to using web browsers. Furthermore, the security manager installed in Java enabled browsers is a widely trusted firewall, protecting hosts from downloaded programs. Finally, due to the applet mechanism, no administration nor maintenance is required at the host the majority of users already has a recent version of a web browser installed on their machines. On the client side we provide an open, extensible architecture for developing distributed applications. Basic primitives are provided which can in turn be used to implement diverse, specialized processing models. Through such models it is possible to hide the internals of the system and/or provide advanced programming support in order to simplify application development.
3
Resource Allocation
Host allocation is based on profiles, which are created by periodically benchmarking each host. A credit based [1] mechanism is used for charging. Credit
An Open Market-Based Architecture for Distributed Computing
63
can be translated into anything that makes sense in the context where the system is deployed. Within a non-profit institution, it may represent time units to facilitate quotas. Service-oriented organizations could charge clients for using hosts by converting credit to actual currency. Both hosts (sellers) and clients (buyers) submit orders to a market, specifying their actual and desired machine profile respectively. The parameters of an order are listed in table 1. The performance vectors include the host's mean score and variance for a set of benchmarks over key performance characteristics such as integer and floating point arithmetic, network connection speed to the market server etc. The host abort ratio is the ratio of computations killed by the host versus computations initiated on that host (a "kill" happens when a host abruptly leaves the market). The host performance vectors and abort ratio are automatically produced by the system. Host profiles can easily be extended to include additional information that could be of importance for host selection.
Table 1. Parameters specified in orders
Parameter
Sell Orders
Description Buy Orders
The minimum amount of credit The maximum amount of credit required per second of use of offered per second of use of the the host. host. The maximum amount of usage The minimum amount of usage lease duration time without renegotiation. time without renegotiation. granted/ demanded Credit granted/demanded for not honoring the lease duration. compensation The host's average score and The average performance score performance statistics variance for each of the bench- and variance a buyer is willing to accept. vectors marks (measured). abort ratio The host's measured abort ra- The abort ratio a buyer is willtio. ing to accept. price/sec
An economy-based mechanism is employed to match the orders that are put in the market. For each match, the market produces a lease, which is a contract between a host and a client containing their respective orders and the price of use agreed upon. Leases are produced periodically using continuous double auction [8]. A lease entitles the client to utilize the host for a specific amount of time. If the client's task completes within the lease duration, then the buyer transfers an amount of credit to the seller as a reward, calculated by multiplying actual duration with the lease's price per second. If the lease duration is not honored, an amount of credit is transfered from the dishonoring party to the other.
64
4 4.1
S. Lalis and A. Karipidis
System Architecture Overview of System Components
An overview of the system's architecture is depicted in Fig. 1. The basic components of our system are the market server, hosts, the host agent, schedulers, tasks and client applications.
( Scheduler )
ff
Control Protocol
-----+-----....-, Client application
; MarketSchedulerProtocol I
~ Market
Computation Protocol
Server
~Protocol
/ ~arketHostAgentProtocol
,-------------,
,------------,
Host Agent ~
, HostAgentTaskProtocol
'lj
( Task)
Host Agent
•
•
•
~ ~ HostAgentTaskProtocol (Task)
Fig. 1. Overview of architecture
The Client Application is a program which needs to perform computations that require considerable processing power. Through the system, it may either distribute a computation across a number of machines or just delegate the execution of an entire computation to a faster machine to speed up execution. The Market Server is the meeting place for buyers and sellers of processing power. It collects orders from clients and hosts. Using the host profiles, it then matches buy with sell orders and thus allocates resources. A Host is a machine made available to be used by clients. A host participates in the market through the Host Agent, a Java applet. The user visits a URL with a Java enabled web browser and the agent is downloaded to his system. The agent communicates with the market server, takes care of placing orders on behalf of the user and executes tasks assigned to the host. It also provides the market server with the benchmark scores needed for the host's profile. A computation in our system consists of a Scheduler and one or more Tasks. The application installs the scheduler on the market server. The scheduler then places orders in the market for acquiring machines to complete the computation. New orders can be issued at any time in order to adapt to fluid market conditions. When a lease is accepted by the scheduler, a task is launched in the host machine to assist in completing the computation.
An Open Market-Based Architecture for Distributed Computing
4.2
65
Basic System Services and Communication
There are six protocols used for communication by the system. The UploadProtocol is a fixed, published Remote Method Invocation (RMI) interface used by the client application to upload a computation to the market server and to instantiate it's scheduler. A client application may instantiate multiple schedulers to simultaneously launch the same code with multiple data. The ControlProtocol is a published RMI interlace for the client application to control a scheduler. Through this interface the application perlorms tasks such as starting a computation with new parameters, altering the computation's budget for acquiring hosts, instructing the scheduler to kill all tasks and exit, etc. The basic functions are implemented in the system classes. The programmer can introduce computation specific control functions by extending this interface. The ComputationProtocol is used within the bounds of a single computation for communication among tasks and their scheduler. It is application dependent and thus unknown to the system. \Ve do, however, provide message passing support (not further discussed in this paper) that can be used by application developers to implement flexible, safe and efficient data exchange. The MarketSchedulerProtocol is used for local communication between the market server and schedulers. The market server implements a standard published interface for servicing requests from schedulers such as placing orders and retrieving host and market status information. Respectively, schedulers provide methods for being notified by the market of events such as the opportunity to acquire a new lease, a change in the client's account balance, the completion of a task's work and the failure of a host that was leased to them. Similarly, the HostAgentTaskProtocol provides local communication among a host agent and the task it is hosting. The agent implements a published interface for servicing requests from tasks, such as retrieving information about a host's performance. The MarketHostAgentProtocol is a proprietary protocol used by the market server and the host agent. It allows orders to be placed in the market by the host. It is also used to retrieve tasks from the market, ask for "payment" when tasks complete and to post benchmarking data to the market server.
5
Supporting Distributed Computing Paradigms
Through the set of primitives offered by the system, it is possible to develop a wide range of applications. More importantly generic support can be provided for entire classes of distributed computations. Applications can then be developed by extending these classes to introduce specific functionality. This incremental development can greatly simplify programming. As an example, in the following we describe this process for embarrassingly parallel computations requiring no communication between tasks. Other distributed computation paradigms can be supported in similar fashion.
66
S. Lalis and A. Karipidis
5.1
The Generic Master - Slave Model
In this model work is distributed among many processors by a distinguished processor referred to as the "master" . The other processors, referred to as "slaves" , complete the work assigned to them and return the results to the master. In order to process its workload a slave does not need to communicate with any other slave. This model is used in image processing, genetics algorithms, brute force search and game tree evaluation. One possible implementation of this model is sketched below. For brevity, only the methods a programmer has to be aware of are shown.
public interface MS_Control extends Control { void start(Object pars);// inherited by superclass void stop(); // inherited by superclass Object[] getResults(boolean all, boolean keep); }
public abstract class MS_Scheduler extends Scheduler implements MS_Control { public abstract Object[] doPartitions(Object pars); public void receiveResult(Object result); }
public abstract class MS_Task extends Task { public abstract Object processPartition(Object partition); }
The MS_Control.start method starts a new computation. MB-Control.start triggers MB-Scheduler. doPartitions to produce the various partitions of the computation. These are forwarded to instances ofMB-Task residing on hosts allocated to the computation and M8-Task.processPartition is invoked to process them. The results are returned to the scheduler where post-processing is performed via calls to the MS_Scheduler. receiveResult method. It is important to notice that programmers need to implement just three methods in order to complete a computation following this model. All other implementation issues, including the resource allocation strategy of the scheduler, remain hidden. The MS_Control interface, which defines the primitives for controlling and retrieving the results of the computation, is implemented by the base MS_Scheduler class and thus does not concern the programmer. This master/ slave model could be further extended to introduce additional functionality such as check-pointing and restarting of tasks for fault tolerance. Programmers would exploit this functionality without effort. 5.2
A Sample Client Application
Based on this model, we show how a specific application, e.g. for computing the Mandelbrot set, can be implemented. We assume that the area to be calculated is partitioned in bands, processed in parallel to speed up execution. The user selects an area and the computation is started to zoom into the selected area.
An Open Market-Based Architecture for Distributed Computing
67
The parameters, partitions and results of the fractal application must be extensions of the Object class. The classes must implement the Serializable interface in order to be successfully transported across machine boundaries. class FractalParameters extends Object implements Serializable { II ... fractal computation parameters }
class FractalPartition extends Object implements Serializable { II ... parameters for calculating a slice }
class FractalResult extends Object implements Serializable { II ... results of a slice calculation }
Assuming the parameter and result objects have been appropriately defined, a FractalScheduler class must be programmed as a subclass of MS_Scheduler to produce partitions via the doPartitions method. The MS_Scheduler.receiveResult method is not overridden because individual results are not merged by the scheduler. Also, the basic MS_Control interface needs no extension since it already offers the necessary routines for controlling and monitoring the computation. Analogously, a FractalTask class must be provided that implements the M5-Task.processPartition method to perform the calculation of slices. class FractalScheduler extends MS_Scheduler { Object[] doPartitions(Object comp_pars) { FractalPartition partitions[]; FractalParameters pars=(FractalParameters)comp_pars; II ... split calculation and produce partitions return (partitions); } }
class FractalTask extends MS_Task { Object processPartition(Object partition) { FractalResult result; FractalPartition pars=(FractalPartition)partition; II ... perform the computation return(result); } }
Finally, to run the application, the computation's classes must be uploaded to the market server using the UploadProtocol and a scheduler instance must be created. The MS_Control interface is used to control the scheduler and periodically retrieve the computation's results.
68
6
S. Lalis and A. Karipidis
Related Work
Popular distributed programming environments such as PVM [9] and MPI [9] lack advanced resource allocation support. PVM allows applications to be notified when machines join/leave the system, but the programmer must provide code that investigates hosts' properties and decides on proper allocation. MPI, using a static node setup, prohibits dynamic host allocation: the programmer must make a priori such decisions. Both systems require explicit installation of their runtime system on participating hosts. A user must therefore have access to all participating machines, as she must be able to login to them in order to spawn tasks. This is impractical and may result in only a few number of hosts being utilized, even within a single organization. Finally, the choice of C as the main programming language, compared to Java, is an advantage when speed is concerned. But to be able to exploit different architectures, the user must provide and compile code for each one of them, adding to the complexity and increasing development time due to porting considerations. The maturation of Java technology ("just in time" compilation, Java processors, etc.) could soon bridge the performance gap with C. Notably, a Java PVM implementation is underway [6], which will positively impact the portability of the PVM platform. Condor is a system that has been around for several years. It provides a comparative "matchmaking" process for resource allocation through its "classified advertisment" matchmaking framework [11]. A credit-based mechanism could be implemented using this framework, but is currently unavailable. Condor too requires extensive administration and lacks support for easy development. Newer systems such as Legion [10] and Globus [7] address the issues of resource allocation and security. They provide mechanisms for locating hosts and signing code. However, both require administration such as compiling and installing the system as well as access to the host computer. They do not support the widely popular Windows platform (though Legion supports NT) and do little to facilitate application development for non-experts. Globus merely offers an MPI implementation whereas Legion provides the "Mentat" language extensions. Legion's solution is more complete but also complicated for inexperienced programmers. It requires using a preprocessor, an "XDR" style serialization process and introduces error-prone situations since virtual method calls will not work as expected in all cases. Stateful and stateless objects are also handled differently. Finally, adding hosts to a running computation is done from the command line and additional hosts are assigned to the computation at random - no matching of criteria is performed. Several other systems using Java as the "native" programming language have been designed for supporting globally distributed computations, such as Charlotte [3], Javelin [4] and Challenger [5]. These systems automatically distribute computations over machines. However, they do not employ market-based principles to allocate hosts and do not maintain information about hosts' performance. The market paradigm has received considerable attention in distributed systems aiming for flexible and efficient resource allocation. A system operating on the same principles as ours is Popcorn [12]. Popcorn also uses auction mech-
An Open Market-Based Architecture for Distributed Computing
69
anisms to allocate hosts to client computations and exploits Java applet technology to achieve portability, inter-operability and safety. However it does not provide "host profiling", nor promotes incremental development.
7
Discussion
Besides the fact that the allocation strategies used in most systems don't take into account "behavioral patterns" of hosts, there is also virtually no support for leasing. We argue that both are invaluable for efficient resource allocation in open computational environments. Providing information about the statistical behavior of participating hosts can assist schedulers in taking task placement decisions, avoiding hosts that will degrade performance (and waste credit). For example, assume a scheduler has two tasks to allocate. Blind allocation on two hosts is not a good idea; unless two machines exhibit comparable performance, the faster machine will be wasted since the computation will be delayed by the slower one. Similarly, using the abort ratio, schedulers can avoid unstable hosts for placing critical parts of a computation. Those can be assigned to perhaps more "expensive" but stable hosts. Computations implementing check-pointing and crash-recovery could utilize less credible hosts. The lack of leasing is also a drawback in open environments: a client could obtain many processors when there is no contention and continue to hold them when demand rises. This is unacceptable in a real world scenario where credit reflects priorities or money. This would imply that prioritized or wealthy computations can be blocked by "lesser" ones. To guarantee quality of service, some form of leasing or preemption must be adopted. Leases are also practical in non-competitive environments. The lease duration allows users to indicate the time during which hosts are under-utilized. Based on this knowledge, tasks can be placed on hosts that will be idle for enough time, and checkpoints can be accurately scheduled, right before a host is about to become unavailable. Finally, it is generally acknowledged that incremental development increases productivity by separation of concerns and modular design. Distributed computing can benefit from such an approach. Modern object-oriented programming environments are a step towards this direction, but significant programming experience and discipline are still required. We feel that with our system's design, it is possible even for inexperienced programmers to write computations rapidly.
8
Future Directions
New versions of the Java platform will offer more fine grained control in thesecurity system. Using the new mechanisms we expect to be able to provide more efficient services, such as access to local storage for task checkpoints, invocation of native calls to exploit local, tuned libraries such as [2] [13]. Logging mechanisms along with the signing of classes, will further increase the security of the system.
70
S. Lalis and A. Karipidis
We also wish to experiment with schedulers capable of recording the performance of previous allocations. Accumulated information can perhaps be converted into "experience" , leading towards more efficient allocation strategies. Lastly the issue of scalability needs to be addressed. The current architecture is limited by the market server. A single server could not handle the millions or billions of hosts connecting to a truly world-wide version of this service. It would also be impossible to have all schedulers running on the machine. We intend to overcome this problem by introducing multiple market servers that will allow traffic to be shared among several geographically distributed servers.
References [1] Y. Amir, B. Awerbuch, and R. S. Borgstrom. A cost-benefit framework for online management of a metacomputing system. In Proceedings of the First International Conference on Information and Computation Economies, pages 140-147, October 1998. [2] M. Baker, B. Carpenter, G. Fox, S. H. Ko, and S. Lim. mpiJava: An ObjectOriented Java Interface to MPI. Presented at International Workshop on Java for Parallel and Distributed Computing, IPPS/SPDP 1999, April 1999. [3] A. Baratloo, M. Karau!, Z. M. Kedem, and P. Wyckoff. Charlotte: Metacomputing on the web. In Ninth International Conference on Parallel and Distributed Computing Systems, September 1996. [4] P. Cappello, B. Christiansen, M. F. Ionescu, M. 0. Neary, K. E. Schauser, and D. Wu. Javelin: Internet-based parallel computing using java. In Proceedings of the ACM Workshop on Java for Science and Engineering Computation, June 1997. [5] A. Chavez, A. Moukas, and P. Maes. Challenger: A multiagent system for distributed resource allocation. In Proceedings of the First International Conference on Autonomous Agents '97, 1997. [6] A. Ferrari. JPVM The Java Parallel Virtual Machine. Journal of Concurrency: Practice and Experience, 10(11), November 1998. [7] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl J. Supercomputer Applications, 11(2), 1997. [8] D. Friedman. The double auction market institution: A survey. In D. Friedman and J. Rust, editors, Proceedings of the Workshop in Double Auction Markets, Theories and Evidence, June 1991. [9] G. A. Geist, J. A. Kohl, and P. M. Papadopoulos. PVM and MPI: a Comparison of Features. Calculateurs Paralleles, 8(2):137-150, June 1996. [10] A. S. Grimshaw and W. A. Wulf. The legion vision of a worldwide computer. CACM, 40(1):39-45, 1997. [11] R. Raman, M. Livny, and M. Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 1998. [12] 0. Regev and N. Nisan. The POPCORN Market - an Online Market for Computational Resources. In Proceedings of the First International Conference on Information and Computation Economies, pages 148-157, October 1998. [13] The Java Grande Working Group. Recent Progress of the Java Grande Numerics Working Group. http://math.nist.gov/javanumerics/ reports/jgfnwg-02.html.
The Multi Cluster Model to the Integrated Use of Multiple Workstation Clusters Marcos Barreto*, Rafael Avila**, and Philippe Navaux*** Institute oflnformatics - UFRGS Av. Bento Gorn;;alves, 9500 Bl. IV PO Box 15064-90501-910 Porto Alegre, Brazil E-mail: {barre to, bohrer, navaux }@inf. ufrgs. br
Abstract. One of the new research tendencies within the well-established cluster computing area is the growing interest in the use of multiple workstation clusters as a single virtual parallel machine, in much the same way as individual workstations are nowadays connected to build a single parallel cluster. In this paper we present an analysis on several aspects concerning the integration of different workstation clusters, such as Myrinet and SCI, and propose our MultiCluster model as an alternative to achieve such integrated architecture.
1 Introduction Cluster computing is nowadays a common practice to many research groups around the world that search for high performance to a great variety of parallel and distributed applications, like aerospacial and molecular simulations, Web servers, data mining, and so forth. To achieve high performance, many efforts have been devoted to the design and implementation of low overhead communication libraries, specially dedicated to fast communication networks used to interconnect nodes within a cluster, which is the case of Fast Ethernet [14], Myrinet [3] and SCI [12]. The design of such software is a widely explored area, resulting in proposals like BIP [21], GM [9], VIA [24] and Fast Messages [ 19]. Currently, there are other research areas being explored, such as administrative tools for cluster management and what is being called Grid Computing, with the objective of joining geographically distributed clusters to form a Metacomputer and taking benefit of the resulting overall computational power [4]. The work presented here is not focused on these areas directly, because our goal is to discuss a practical situation in which a Myrinet cluster must be interconnected with a SCI cluster to form a single parallel machine, which can be used to verify the application's behaviour when it runs on a shared memory cluster or on a message passing cluster, efficiently distribute tasks from an application according to their communication needs, offer a complete environment destinated to teach parallel and distributed * M.Sc. student at PPGC/UFRGS (CAPES fellow) ** M.Sc. (PPGC/UFRGS, 1999); RHAE/CNPq researcher at PPGC/UFRGS * * * Ph.D. (INPG, Grenoble - France, 1979); Professor at PPGC/UFRGS J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 71-80, 2000. © Springer-Verlag Berlin Heidelberg 2000
72
M. Barreto, R. Avila, and P. Navaux
programming, allowing the user to express, through the same API, message passing and shared memory interactions. This paper is organised as follows: Section 2 exposes an analysis on the problems that arise from integrating multiple workstation clusters; in Section 3 we present the MultiCluster model and the DECK environment as our contribution towards this objective; Section 4 brings some comments on related research efforts and finally Section 5 presents our conclusions and current research activities.
2
Integrating Multiple Clusters
When computer networks were an emergent platform to parallel and distributed programming, many efforts were dispended to solve problems related to joining individual PCs in a single virtual parallel machine. From these efforts, communication libraries such as PVM [8] and MPI [17] arose to allow individual network nodes to be identified within the parallel environment. The integration of multiple workstation clusters presents a similar problem. Individual clusters of workstations are nowadays fairly well managed by communication libraries and parallel execution environments. When we start to think on clusters of clusters, again we have the same problems regarding the connection of elements that run independently from each other and still meet the compromise of offering to the user an appropriate environment for parallel and distributed programming. What we mean by appropriate is to provide an intuitive programming interface and offer enough resources to meet the programmer's needs. As the purpose of this paper is to identify these problems and propose possible solutions to them, we have divided our study in hardware and software analysis. 2.1
Hardware Aspects
There are no major problems in the hardware point of view to achieve such integration, since the networks considered (Myrinet and SCI) could co-exist within the same node and use different techniques to communicate. Figure 1 presents the most simple cluster interconnection that could be realised. Each individual cluster could have any number of physical nodes connected through a switch (in the Myrinet case) or directly as a ring (in the SCI case). To allow the integration, each cluster must have a "gateway" node configured with two network interfaces (two Myrinet Nis or a Myrinet + SCI Nls), where the additional Myrinet NI is used to link clusters. For the moment we do not consider SCI a suitable technology as a linking media, since a message-passing paradigm seems more adequate for this purpose. 2.2
Software Aspects
Several points have been discussed by the community in order to identify problems and solutions related to the design and implementation of communication libraries for cluster-based applications, with a main objective: provide high bandwith at small latencies. Besides this, the development of cluster middleware tools to furnish high availability and single system image support is an ongoing task [4, 11].
The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters
73
Fig. 1. The simplest way to interconnect two workstation clusters.
In the case of clusters of clusters, performance is not a key point due to the drawbacks implicitly imposed by the loosely coupled integration. There are other problems regarding such integration that must be attended first and performance will then be the consequence of the techniques used to solve them. The first point to consider is how to combine message passing with distributed shared memory. A desirable solution would be to offer a single communication abstraction that could be efficiently implemented over message passing and shared memory architectures. In practice, however, it is easier to have an individual mechanism to each one and allow the user to choose between them, depending on his application needs. Another point to treat is the routing problem, which arises when a task needs to exchange data with another task running in a remote cluster. It is necessary that the communication layer identifies what is the location of a communication endpoint and knows how to map physical nodes from separate clusters to be capable of routing messages between them. Finally, heterogeneity could be a problem. Although most individual workstation clusters are internally homogeneous, there may be cases where multiple clusters could be heterogeneous in relation to each other. In these cases, problems regarding "endianisms" and floating-point data representation have to be addressed. lfthe previous problems can be efficiently treated, it is also possible to provide the user with the capacity of deciding where to place a specific set of tasks, according to their communication needs. If the application granularity can be modelled considering the underlying platform, it is still possible to achieve good performance.
3
The MultiCluster Model
The MultiCluster model is an approach to join independent clusters and provide a simple programming interface which allows the user to configure and utilize such an integrated platform. With this model we intend to address and provide solution to the problems mentioned in the previous Section, while still keeping a well structured and
74
M. Barreto, R. Avila, and P. Navaux
efficient programming environment. To best explain the proposed model, we have divided the discussion in hardware and software aspects. 3.1
Hardware Platform
We are assuming the configuration illustrated in Figure I, which corresponds to our available hardware platform. We currently have a Myrinet cluster, composed by 4 Dual Pentium Pro 200 MHz nodes, and a SCI cluster, composed by 4 Pentium Celeron 300 MHz nodes. These clusters are linked through a Fast Ethernet network. The choice of the media used to interconnect the clusters depends mostly on the application needs. It is possible to use a standard Ethernet link instead ofMyrinet to realise the communication between clusters. We propose Myrinet as a link media because it could minimize the loss in performance originated by the integration of different platforms; for our model, however, it is enough that some node in each cluster plays the role of a gateway. It is important to say that questions related to cost and scalability are out of the scope of this paper. In a near future, many companies and universities are likely to own a small number of cluster platforms, and so these questions are particular to each of them. We are assuming the situation where at least two clusters are available and have to be used together. 3.2
Software Structure
We have studied each problem mentioned in Section 2.2, trying to find the best solution to each one and structuring our software layer to carry out such solutions. As a result, the MultiCluster model follow some conceptual definitions which rule the way such integration must be handled. Figure 2 shows the user-defined descriptor file to a MultiCluster application. In this file, the user must specify a list of machines within the clusters he wants to use, the communication subnets identifiers (used to inter-cluster communication), a set of logical nodes with its correspondents machines and the gateway nodes. Physical and Logical Nodes. A physical node corresponds to each available machine plugged in any individual cluster and only matters to physical questions. Logical nodes are the set of available nodes from the application's point of view. In the case of message-passing clusters, each physical node corresponds to one logical node (this is mandatory). In shared-memory clusters, a logical node can be composed of more than one physical node. The distinction between logical nodes for Myrinet and SCI is made by the node id field. For example, "node 1:0" means the second node within the subnet 0 (which is Myrinet in our example), while "node 4: l" means the first node within the subnet 1 (which is SCI). It is important to notice that this numbering scheme, although complex, is entirely processed by the environment in a transparent manner; the user only knows how many logical nodes he has and what are the physical machines within each logical node.
The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters
75
II DECK user-defined descriptor file II virtual machine verissimo, quintana, euclides, dionelio, scliar, ostermann, meyer, luft II communication subnets myrinet: 0 sci: 1 II logical nodes node 0:0 machines: verissimo node 1:0 machines: quintana node 2:0 machines: euclides node 3:0 machines: dionelio node 4:1 machines: scliar, luft node 5:1 machines: ostermann, meyer II gateway nodes gateways: quintana, scliar Fig. 2. Descriptor file for a Multi Cluster application.
Intra- and Inter-node Communication. As the application only sees logical nodes, it is relatively easy to adapt the different communication paradigms: inside a logical node, communication is made by shared memory; between logical nodes, communication is made by message passing. From the user's point of view, there is only one programming interface furnishing both mechanisms to specify communication over Myrinet or SCI clusters; the underlying communication layer is in charge of implementing one or another paradigm. Heterogeneity. Although a less frequent problem, heterogeneity may arise depending on the availability of clusters that have to be interconnected. Here, we are considering different data representations and the need to indicate to the message receiver what is the architecture type of the message sender. This problem is implicitly treated by the communication software. Even occuring some performance loss due to such integration, it is possible to the user to define the best location for his application tasks, creating communication resources according to each task location (i.e. communication subnets). Through this facility, the granularity of communication could be balanced among clusters, avoiding as long as possible the traffic across the link network. 3.3
The Programming Environment-DECK
The interface between the programmer and the MultiCluster architecture is the DECK environment. DECK (Distributed Executive Communication Kernel) is composed of a runtime system and a user API which provides a set of services and abstractions for the development of parallel and distributed applications. A DECK application runs in an SPMD style, split in terms oflogical nodes.
76
M. Barreto, R. Avila, and P. Navaux
DECK is divided in two layers, one called µDECK, which directly interacts with the underlying OS and a service layer, where more elaborate resources (including the support for multiple clusters) are made available. Figure 3 shows the layered structure ofDECK.
Fig. 3. Internal structure of DECK.
µDECK is the platform-dependent part of DECK. This layer implements the five basic abstractions provided within the environment: threads, semaphores, messages, mailboxes and shared segments. Each of these abstractions is treated by the application as an object, and has associated primitives for proper manipulation. Messages present pack/unpack primitives, which do not necessarily perform marshalling/unrnarshalling actions. When a message object is created, one of its attributes holds the identification of the host architecture. At the time of a pack no marshalling is performed; at the time of an unpack, if the receiving host is of a different architecture, the proper data conversion is made 1 . Messages can be posted to or retrieved from mailboxes. Only the creator of a mailbox is allowed to retrieve messages from it, but any other thread knowing the mailbox can post to it. To use a mailbox, the creator must register it in a naming server. There are two ways to obtain a mailbox address: fetching it in the name server or receiving it in a message. The service layer is built on top of µDECK and aims to furnish additional, more sophisticated mechanisms that might be useful to the development of parallel applications, such as naming, group communication and fault tolerance support. In the scope of this paper, two elements of this layer must be analysed: the naming service and the Remote Communication Daemon (RCD). The name server is a dedicated thread which runs in the first node within each cluster. For example, in the configuration illustrated in Figure 2, there will be a naming server running on "verissimo" and another running on "scliar". Each naming server is responsible to register mailboxes created within its cluster. The name server is automatically executed when the application starts and has a well-known mailbox to allow other threads to communicate. 1
It is important to observe that we only expect this to happen for messages crossing cluster boundaries, since clusters are assumed to be internally homogeneous.
The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters
77
The DECK/Myrinet Implementation. In the implementation of DECK on top of Myrinet, we are currently using BIP (Basic Inte,face for Parallelism) [21] as a communication protocol to efficiently use the underlying hardware and deliver high performance to applications. As BIP utilizes reception queues labeled with tags within each node, our mailbox implementation assigns a specific tag to each mailbox. To create a mailbox, the programmer uses a deck...mbox_create() primitive, passing as arguments the mailbox name and the communication subnet (defined in the descriptor file) in which this mailbox will be used. The communication is made by post and retrieve operations, passing as arguments the corresponding mailbox and the message object, which contains the DECK supported datatypes. Posting a message is an asynchronous operation, while retrieving a message is a synchronous operation. To achieve this behaviour, we use the bip_tisend () and bip_trecv () primitives, respectively. The implementation of µDECK mailboxes and messages on top ofBIP is straightforward, since both are based on message passing. Shared segments, however, need an additional software DSM support to be implemented with the same library. For the moment we are studying the introduction of a DSM library, such as TreadMarks [25], to allow the usage of shared segments over Myrinet. The primitives for threads and semaphores are trivial and follow the Pthreads standard [13]. The DECK/SCI Implementation. We base our DECK/SCI implementation on two SCI programming libraries: Yasmin [23], which provides basic primitives for creation, mapping and synchronisation of shared segments, and Sthreads [22], which offers a Pthread-like environment on top of Yasmin. A µDECK shared segment object offers primitives for creation, naming, mapping and locking. To the difference ofMyrinet, SCI allows an easier implementation of both communication paradigms, so DECK/SCI offers mailboxes and messages as well as shared segments. The creation of threads in DECK/SCI follows a simple round-robin placement strategy, according to the number of physical nodes that compose a logical node, which means that placement is still transparent to the end user. Notice that local memory can still be used for communication by local threads (i.e. threads in the same physical node), but it is up to the programmer to keep this kind of control. This means that, within SCI clusters, memory is only guaranteed to be correctly shared between remote threads if it is mapped into a µDECK shared segment. RCD-Remote Communication Daemon. In order to support the MultiCluster model, the Remote Communication Daemon has been designed as a DECK service responsible for communicating to remote clusters. As each cluster must have a "gateway" node, the RCD is automatically executed inside this node when the application starts and follows the same semantic of the name server, i.e., it also has a well-known mailbox. The RCD acts upon demand on two special cases: when fetching names defined remotely (i.e. on another cluster) and when posting messages to remote mailboxes. When a DECK primitive fails to fetch a mailbox address in a local name server, it contacts the RCD, which then broadcasts the request to other RCDs in the system and
78
M. Barreto, R. Avila, and P. Navaux
wait for an answer, returning it to the caller. In the second case, when a DECK primitive sees a remote mailbox address when posting a message, it contacts the RCD, which then forwards the message to the RCD responsible for the communication subnet in which the mailbox is valid. It is important to emphasize that communication between threads in different logical nodes, as well as different clusters, must always be made by message passing. Even in the case of a SCI cluster, there must be at least one mailbox to allow the communication with the RCD and, eventually, retrieve messages. For the moment we are disconsidering the utilisation of a global shared memory space to establish communication among clusters due to the lack of this support in the DECK/Myrinet implementation. Our intention in designing DECK in three parts is to make it usable without changes in both single- and multi-clustered environments. In the first case, the RCD will simply not be brought into action by the application, since all the objects will be local to a specific cluster.
4
Related Work
Since the purpose of this paper is to discuss practical questions involved in the integration of multiple clusters and propose our model to achieve such integration, we tried to identify similar proposals regarding this subject. There is a great number of research projects concerning the integration of multiple workstation clusters, such as NOW [l], Beowulf [2], Globus [7] and Legion [10]. The goal of these projects is to allow parallel and distributed programming over geographically distributed, heterogeneous clusters that corresponds to a "global computational grid". The differential characteristic of our MultiCluster model is that we are assuming the simultaneous use of different network technologies, while these projects plans to use a common network technology to connect clusters, providing high scalability. In terms of programming environments, there are also some efforts concentrated in joining message passing and distributed shared memory facilities, such as Stardust [5] and Active Messages II [16]. The main goal is to provide support for both message passing and distributed shared memory paradigms and, at same time, offer mechanisms to fault tolerance and load balancing support, as well as, portability. There are also some important contributions based on Java, such as JavaNOW [15], JavaParty [20] and Javelin [6]. All these contributions aims to provide distributed programming across networks of workstations or Web-based networks, differing in the communication model they used. The idea behind MultiCluster is similar in some aspects with the objectives found in the projects/environments mentioned here, though in a smaller scale. Our research goal is to identify and propose solutions to problems related to specific integration of Myrinet and SCI clusters, while the goals of such projects comprise a larger universe, including fast communication protocols, cluster tools, job scheduling and so on. Nevertheless, it is possible to state brief comparisons: our RCD is a simplest implementation when compared with Nexus, the communication system used inside Globus; it is just a way to give remote access to mailboxes defined in another clusters and allow us to separate the functionality of DECK when it runs in a single cluster platform.
The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters
79
The combination of message passing and distributed shared memory we offer is not so different than the usual mechanisms provided by the others environments. We want to efficiently implement these mechanisms in both clusters, without changing the programming interface. To accomplish this, our choice is to provide a mailbox object and a shared segment object to express message passing and memory sharing, respectively.
5
Conclusions and Current Work
In this paper we exposed some problems related to the integration of two different cluster platforms and proposed our MultiCluster model to achieve such desirable integration. We are developing our software environment aiming to accomplish a number of objectives, such as joining two specific cluster platforms (Myrinet and SCI) and providing a uniform API for parallel and distributed programming on both platforms, as well as opening research activities concerning such integration. The integration is easier in terms of hardware because many solutions are already implemented within the OS kernel (e.g. co-existence of network device drivers). In terms of software, we have to decide what is the abstraction degree we want to offer to the programmer. It is important that the user be aware of the characteristics of each individual cluster to best adapt his application to take benefit of them. On the other hand, the DECK layer must abstract as much as possible implementation details, offering to the users a complete and simple API able to express the application needs. Currently, the descriptor file is the key point to configure the MultiCluster platform, because it represents the communication contexts and the logical nodes the user wants to use. Although this configuration is not so transparent, it is the most suitable way to adapt the execution environment according to the user needs. We consider that there are no problems in this task, since the execution environment guarantees the expected functionality. Our work has been guided towards the design of a complete set of programming resources, enclosed in a software layer. Through the modularisation of DECK, we have divided our work in such way that we can parallelize our efforts to cover all problems exposed and to make available, as soon as possible, the MultiCluster model. At the moment we already have an implementation of DECK based on Pthreads and UNIX sockets, available at our Web page [18]. This implementation has played an important role to define the DECK structure and behaviour. At the time of this writing, we are concluding the implementation on top ofBIP and collecting some performance results and, at same time, starting the implementation of DECK objects on top of SCI. The next step is to join both clusters and develop the RCD communication protocol.
References 1. T. Anderson, D. Culler, and D. Patterson. A case for NOW - Network of Workstations. Available by WWW at http://now.cs.berkeley.edu, Out. 1999. 2. Beowulf. The Beowulf project. Available by WWW at http://www.beowulf.org, Jun. 1999. 3. N. Boden et al. Myrinet: A gigabit-per-second local-area network. IEEE Micro, 15(1):29-36, Feb. 1995.
80
M. Barreto, R. Avila, and P. Navaux
4. Rajkumar Buyya. High Performance Cluster Computing. Prentice Hall PTR, Upper Saddle River, NJ, 1999. 5. Gilbert Cabillic and Isabelle Puaut. Stardust: an environment for parallel programming on networks of heterogeneous workstations. Journal of Parallel and Distributed Computing, 40:65-80, 1997. 6. B. Christiansen et al. Javelin: Internet-based parallel computing using Java. Available by WWW at http://www.cs.ucsb.edu/research/javelin/, Nov. 1999. 7. Ian Foster and Carl Kesselman. The Globus project. Available by WWW at http://www.globus.org, Jul. 1999. 8. Al Geist et al. PVM Parallel Virtual Machine. MIT Press, Cambridge, MA, 1994. 9. GM message passing system. Available by WWW at http://www.myri.com, Nov. 1999. 10. A. Grimshaw et al. The Legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), Jan. 1997. 11. Kai Hwang and Zhiwei Xu. Scalable Parallel Computing: Technology, Architecture, Programming.McGraw-Hill, New York, NY, 1997. 12. IEEE. IEEE standard for Scalable Coherent Interface (SCI). IEEE 1596-1992, 1992. 13. IEEE. Information technology-portable operating system interface (POSIX), threads extension [C language]. IEEE 1003.lc-1995, 1995. 14. IEEE. Local and metropolitan area networks-supplement-media access control (MAC) parameters, physical layer, medium attachment units and repeater for 100Mb/s operation, type l00BASE-T (clauses 21-30). IEEE 802.3u-1995, 1995. 15. Java and High Performance Computing Group. The JavaNOW project. Available by WWW at http://www.jhpc.org/projects.html, Nov. 1999. 16. Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-protocol Active Messages on a cluster of SMP's. In Proc. of SuperComputing 97, 1997. 17. MPI FORUM. Document for a standard message passing interface. International Journal of Supercomputer Applications and High Performance Computing Technology, 8(3/4), 1994. 18. The MultiCluster project. Available by WWW at http://wwwgppd.inf.ufrgs.br/projects/mcluster, Nov. 1999. 19. S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages for Myrinet. In SuperCOmputing '95. IEEE Computer Society Press, 1996. 20. Michael Philippsen and Matthias Zenger. JavaParty: A distributed companion to Java. Available by WWW athttp://wwwipd.ira.uka.de/J avaParty, Nov. 1999. 21. Loic Prylli and Bernard Tourancheau. BIP: A new protocol designed for high performance networking on Myrinet. In Jose Rolim, editor, Parallel and Distributed Processing, number 1388 in Lecture Notes in Computer Science, pages 472--485. Springer, 1998. 22. Enno Rehling. Sthreads: Multithreading for SCI clusters. In Proc. ofEleventh Symposium on Computer Architecture and High Performance Computing, Natal - RN, Brazil, 1999. Brazilian Computer Society. 23. H. Taskin. Synchronizationsoperationen fiir gemeinsamen Speicher in SCI-Clustern. Available by WWW at http://www.uni-paderborn.de/cs/ag-heiss/en/veroeffentlichungen.html, Aug. 1999. 24. VIA- Virtual Interface Architecture. Available by WWW at http://www.via.org, Nov. 1999. 25. Willy Zwaenepoel et al. TreadMarks distributed shared memory (DSM) system. Available by WWW at http://www.cs.rice.edurwilly/TreadMarks/overview.html, Dez. 1998.
Parallel Information Retrieval on an SCI-Based PC-NOW Sang-Hwa Chung, Hyuk-Chul Kwon, Kwang Ryel Ryu, Han-Kook Jang, Jin-Hyuk Kim, and Cham-Ah Choi Division of Computer Science and Engineering, Pusan National University, Pusan, 609-735, Korea {shchung, hckwon, krryu, hkjang, variant, cca}@hyowon.pusan.ac.kr
Abstract. This paper presents an efficient parallel information retrieval (IR) system which provides fast information service for the Internet users on lowcost high-performance PC-NOW environment. The IR system is implemented on a PC cluster based on the Scalable Coherent Interface (SCI), a powerful interconnecting mechanism for both shared memory models and message passing models. In the IR system, the inverted-index file (IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of the IIF to be evaluated in parallel. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed- up of up to 4.0 was obtained with an Snode cluster in processing each query on a 500,000-document IIF.
1. Introduction As more and more people are accessing the Internet and acquiring a vast amount of information easily, more people consider that the problem of information retrieval (IR) resides no longer in the lack of information, but in how we can choose from a vast amount the right information with speed. Many of us have already experienced that some IR systems provide information service much faster than others. How fast an IR system can respond to users' queries mostly depends on the performance of the underlying hardware platform. Therefore, most of the major IR service providers have been urged to spend several hundred thousand dollars to purchase their hardware systems. However, for many small businesses on the Internet, that cost is too high. In this paper, as a cost-effective solution for this problem, a PC cluster interconnected by a high-speed network card is suggested as a platform for fast IR service. With the PC cluster, a massive digital library can be efficiently distributed to PC nodes by utilizing local hard disks. Besides, every PC node can act as an entry to process multiple users' queries simultaneously. It is extremely important to select a network adapter to construct a high-speed system area network (SAN). For a message passing system, the Fast Ethernet card or the Myrinet card can be used. For a distributed shared memory (DSM) system, the SCI card can be considered. Fast Ethernet developed for LAN is based on complicated protocol software such as TCP/IP, and its bandwidth is not high. The Myrinet[l] card is a high-speed message passing card with a maximum bandwidth of 160Mbyte/sec. However, the network cost is relatively high because Myrinet J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 81-90, 2000. © Springer-Verlag Berlin Heidelberg 2000
82
S.-H. Chung et al.
requires crossbar switches for the network connection. Besides, its message-passing mechanism is based on time consuming operating system calls. For applications with frequent message-passing, this can lead to performance degradation. To overcome the system call overhead, systems based on user-level interface for message-passing without intervention of operating system have been developed. Representative systems include AM[2], FM[3], and U-Net[4]. Recently, Myrinet is also provided with a new message-passing system called GM[5], which supports user-level OSbypass network interface access. The SCI (Scalable Coherent Interface: ANSI/IEEE standard 1596-1992) is designed to provide a low-latency (less than 1µs) and high bandwidth (up to 1Gbyte/sec) point-to-point interconnect. The SCI interconnect can assume any topology including ring and crossbar. Once fully developed, the SCI can connect up to 64K nodes. Since the SCI supports DSM models that can feature both of NUMA and CC-NUMA variants, it is possible to make transparent remote memory access with memory read/write transactions without using explicit message-passing. The performance of the SCI-based systems has been proven by the commercial CCNUMA servers such as Sequent NUMAQ 2000[ 6] and Data General's Aviion[7]. In this research, the SCI is chosen as an underlying interconnecting mechanism for clustering. The Parallel IR system is implemented on an SCI-based PC cluster using a DSM programming technique. In the IR system, the inverted-index file(IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. An IIF is the sorted list of terms (or keywords), with each term having links to the documents containing that term. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. An MPl-based IR system using Fast Ethernet as an interconnect is also constructed for comparison purpose.
2. PC Cluster-based IR System 2.1
Typical IR System on Uniprocessor
Figure 1 shows the structure of a typical IR system implemented on a uniprocessor. As shown in the figure, once a user's query with multiple terms is presented to the system, for each query term in turn the IR engine retrieves relevant information from the IIF in the hard disk. When all the information is collected, the IR engine performs necessary IR operations, scores the retrieved documents, ranks them, and sends the IR result back to the user. For the efficient parallelization of the system, it is important to find out the most time consuming part in executing the IR system. Using the sequential IR system developed previously[8], the system's execution time is analyzed as shown in Figure 2. In the sequential system, the most time consuming part is disk access. Thus, it is necessary to parallelize disk access. This can be done by partitioning the IIF into pieces and distributing the pieces to the processing nodes in a PC cluster.
Parallel Information Retrieval on an SCI-Based PC-NOW
Query
u
(-
-
)
83
(%) 50 45 40
Result
~
35 30 25 20 15 10 5
iu
IR Engine
0
Fig. 1. A typical IR system
disk access vector extract IR operation
ranking
Fig. 2. Execution time analysis in the sequential IR system
2.2 Declustering IIF
Most current IR systems use a very large lookup table called an inverted index file (IIF) to index relevant documents for given query terms. Each entry of the IIF consists of a term and a list of ids of documents containing the term. Each of the document ids is tagged with a weight of the term for that document. Given a query, all the query terms are looked up from the IIF to retrieve relevant document ids and the corresponding term weights. Next, the documents are scored based on the term weight values and then ranked before they are reported back to the user. Since our IR system processes user's query in parallel on a PC cluster, it is desirable to have the IIF appropriately declustered to the local hard disks of the processing nodes. We can achieve maximum parallelism if the declustering is done in such a way that the disk 1/0 and the subsequent scoring job are distributed as evenly as possible to all the processing nodes. An easy random declustering method would be just to assign each of the terms (together with its list of documents) in the IIF lexicographically to each of the processing nodes in turn, repeatedly until all the terms are assigned. In this paper, we present a simple greedy declustering method which performs better than the random method. Our greedy declustering method tries to put together in the same node those terms which have low probability of simultaneous occurrence in the same query. If the terms in a query all happen to be stored in the same node, the disk 1/0 cannot be done in parallel and also the scoring job cannot readily be processed in parallel. For an arbitrary pair of terms in the IIF, how can we predict the probability of their cooccurring in the same query? We conjecture that this probability has a strong correlation with the probability of their co-occurrence in the same documents. Given a pair of terms, the probability of their co-occurrence in the same documents can be obtained by the number of documents in which the two terms co-occur divided by the number of all the documents in a given document collection. We calculate this probability for each of all the pairs of terms by preprocessing the whole document collection. When the size of the document collection is very large, we can limit the calculation of the co-occurrence probabilities only to those terms which are significant. The reason is that about 80% of the terms in a document collection usually exhibits only a single or double occurrences in the whole document collection and they are unlikely to appear in the user queries. Also, since the number of terms in a document collection is known to increase in log scale as the number of documents increases, our
84
S.-H. Chung et al.
method will not have much difficulty in scaling up. As more documents are added to the collection, however, re-calculation of the co-occurrence probabilities would be needed for maintenance. But, this would not happen frequently because the statistical characteristics of a document collection does not change abruptly. In the first step of our greedy declustering algorithm, all the terms in the IIF are sorted in the decreasing order of the number of documents each term appears. The higher this number the more important the term is in the sense that it is quite likely to be included in many queries. This is especially true when the queries are modified by relevance feedback[9]. This type of terms also have a longer list of documents in the IIF and thus causes heavier disk 1/0. Therefore, it is advantageous to store these terms in different nodes whenever possible for the enhancement of 1/0 parallelism. Suppose there are n processing nodes. We assign the first n of the sorted terms to each of the n nodes in turn. For the next n terms, each term is assigned to the node which contains a term with the lowest probability of co-occurrence. From the third pass of the term assignment, a term is assigned to such a node that the summation of the probabilities of co-occurrence of the term with the terms already assigned to the node is the lowest. This process repeats until all the terms in the IIF are assigned. 2.3 Parallel IR System Model
The PC cluster-based parallel IR system model is shown in Figure 3. The IR system consists of an entry node and multiple processing nodes. The participating nodes are PCs with local hard disks and connected by an SCI-based high-speed network. The working mechanism of the parallel IR system model can be explained as follows. The entry node accepts a user' query and distributes query terms to processing nodes (including itself) based on the declustering information described in the previous subsection. Each processing node consults the partitioned IIF using the list of query terms delivered from the entry node, and collects the necessary document list for each term from the local hard disk. Once all the necessary document lists are collected, they are transmitted to the entry node. The entry node collects the document lists from the participating processing nodes (including itself), performs required IR operations such as AND/OR and ranks the selected documents according to their scores. Finally the sorted document list is sent back to the user as an IR result.
Fig. 3. Parallel IR system model
Parallel Information Retrieval on an SCI-Based PC-NOW
2.4
85
Experimental PC Cluster System
In this research, an 8-node SCI-based PC cluster system is constructed as shown in Figure 4. Each node is a 350MHz Pentium II PC with 128Mbyte main memory and 4.3Gbyte SCSI hard disk, and operated by Linux kernel 2.0.36. In the cluster, any PC node can be configured as an entry node. As shown in the figure, each PC node is connected to the SCI network through the Dolphin Interconnect Solution (DIS)'s PCI-SCI bridge card. There are 4 rings in the network, and 2 nodes in each ring. The rings are interconnected by the DIS's 4x4 SCI switch. For DSM programming, the DIS' s SI SCI (Software Infrastructure for SCI) API[ 10] is used. With this configuration, the maximum point-to-point bulk transfer rate obtained is 80 Mbyte/sec approximately.
Fig. 4. SCI-based 8 node PC cluster system
For comparison purpose, an 8-node Fast Ethernet-based PC cluster system is also constructed. Each PC node has the same configuration as the SCI network's node except that a PCI Fast Ethernet Adapter is used for networking. A switching hub is used to interconnect PC nodes in the cluster. For message-passing programming, MPICH 1.1.1[11] is used. In this case, the maximum point-to-point bulk transfer rate obtained is 10 Mbyte/sec approximately. 2.5
SCI-based DSM Programming
The SCI interconnect mechanism supports DSM programming. By using SISCI, a node in the SCI-based PC cluster can establish a mapping between it's local memory address space and a remote node's memory address space. Once the mapping is established, the local node can access the remote node's memory directly. In DSM programming, the communication between PC nodes in the cluster is done using remote read and remote write transactions instead of message-passing. These remote read/write transactions are actually carried out using the remote read/write functions provided by SISCI. When the IR program is actually coded, most of the remote memory transactions are implemented using the remote write function. This is because the remote write function performs about 10 times faster than the remote read function in the DIS's PSI-SCI bridge card.
86
S.-H. Chung et al.
3. Performance of PC Cluster-based IR System 3.1 Performance Comparison between SCI-based System and MPI-based System
In this experiment, average query processing times are measured for the 8-node SCIbased system, the 8-node MPl-based system and a single node system. The IIF is constructed from 100,000 documents collected from articles in a newspaper. A user's query consists of 24 terms. Each query is made to contain a rather large number of terms because the queries modified by relevance feedback usually have that many terms. The IIF is randomly declustered to be stored on each processing node's local disk. As shown in Table 1, the disk access time is reduced for both the SCI-based system and the MPl-based system when compared with the single node system. However, the MPl-based system is worse than the single node system in total query processing time because of the communication overhead. The SCI-based system has much less communication overhead than the MPl-based system, and performs better than the single node system. The speed-up improves with further optimizations presented in the following subsections. Table 1. Query processing times of 8-node SCI-based system and 8-node MPl-based system (unit: sec)
Send query term Receive document list Disk access IR operation Total
SCI-based system 0.0100 0.0839 0.0683 0.0468 0.2091
MPl-based system 0.0251 0.2097 0.0683 0.0468 0.3500
Single-node System 0 0 0.2730 0.0468 0.3198
3.2 Effect of Declustering IIF
The greedy declustering method is compared with the random method on a test set consisting of 500 queries each containing 24 terms. To generate the test queries we randomly sampled 500 documents from a document collection containing 500,000 newspaper articles. From each document, the most important 24 terms are selected to make a query. The importance of a term in a document is judged by the value tf x idf; where tfis the term's frequency in that document and idf is the so called inverse document frequency. The inverse document frequency is given by log,(N/n) + 1, where N is the total number of documents in the collection and n is the number of documents containing the term. Therefore, a term in a document is considered important if its frequency in that document is high enough but at the same time it does not appear in too many other documents. Table 2 shows the experimental results comparing the random clustering and the greedy declustering methods using those 500 queries on our 500,000 document collection.
Parallel Information Retrieval on an SCI-Based PC-NOW
87
Table 2. Comparison of random declustering and greedy declustering (unit: sec)
Random declustering
Greedy declustering
Average query processing time
0.5725
0.5384
Accumulated query processing time for 500 queries
286.2534
269.1919
3.3 Performance with Various-sized IIF
In this subsection, the performance of the SCI-based parallel IR system is analyzed with the number of documents increased up to 500,000. These documents are collected from a daily newspaper, and 500,000 documents amount to the collection of the daily newspaper articles for 7 years. The size of IIF proportionally increases as the number of documents increases. For example, the size of IIF is 300 Mbytes for 100,000 documents, and 1.5 Gbytes for 500,000 documents. The 8-node PC cluster and the greedy declustering method are used for the experiment. The experimental result is presented in Figure 5. It takes 0.1805 seconds to process a single query with the 100,000 document IIF, while it takes 0.2536 seconds with the 200,000 document IIF and 0.5398 seconds with 500,000 document IIF. As the IIF size increases, the document list for each query term becomes longer, and the time spent for IR operations (AND/OR operations) increases considerably. As a result, the IR operation eventually takes more time than the disk access, and becomes the major source of bottleneck.
Fig. 5. llF size vs. query processing time
88
S.-H. Chung et al.
3.4 Reducing IR Operation Time
As presented in the previous subsection, the IR operation time turns out to be a new overhead as the IIF size increases. In the IR system, AND/OR operations are performed by the entry node after all the necessary document lists are collected from the processing nodes. However, it is possible to perform AND/OR operations partially to the document lists collected in each processing node. So, each processing node can transmit only the result to the entry node. This helps in reducing not only the IR operation time but also the communication time. The performance of the improved system in comparison with the original system is shown in Figure 6. In the experiment, the 8-node PC cluster, the greedy declustering method and 500,000 document IIF are used. In the original system, the IR operation takes 0.2873 seconds which is more than 53% of the total query processing time. However in the improved system, the IR operation takes only 0.1035 seconds which is about 35% of the total time. Thus, the IR operation takes less time than the disk access again. The communication time is also reduced from 0.1128 seconds to 0.0500 seconds, and the total time is reduced to almost half when compared with the original system.
Fig. 6. Query processing time with reduced IR operation time
Figure 7 shows the speed-up of the parallel IR system. The maximum speed-up obtained from the 8-node system when compared with the single node system is 4.0. As shown in the figure, the speed-up of the parallel IR system is saturated rapidly from the 4-node system. As the number of the processing nodes in the system increases, the disk access time 1 is reduced because the average number of query terms assigned to each node decreases. However, the IR operation time and the communication time rather increase as the number of document lists transmitted to the entry node increases, and attenuate the overall speed-up. The problem may be alleviated by applying the following idea. Instead of sending all the document lists to the entry nodes, intermediate nodes can be utilized to merge the document lists by performing AND/OR operations in advance as shown in Figure 8. Thus the entry node finally handles only two document lists. This will help in reducing both the IR 1 The disk access time includes the time spent for partial AND/OR operations in the processing nodes.
Parallel Information Retrieval on an SCI-Based PC-NOW
operation time and the communication time. verify the above idea .
89
Experiments need to be performed to
Fig. 7. Number of processing nodes vs. query processing time
Fig. 8. Merging document lists in intermediate nodes
4. Conclusions In this paper, as a cost-effective solution for fast IR service, an SCI-based PC cluster system is proposed. In the parallel IR system developed on the PC cluster, the IIF is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node's hard disk. For each incoming user's query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. The IR system is developed using a DSM programming technique based on SCI. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed-up of 4.0 was obtained with the 8-node cluster in processing each query on a
90
S.-H. Chung et al.
500,000-docwnent IIF. Currently, the parallel IR system has a single entry node. In the future research, a PC cluster based IR system with multiple entry nodes will be developed. Each processing node in the cluster system can act as an entry node to process multiple users's queries simultaneously. This will help in improving both the IR system's utilization and throughput. With more research effort, we hope this model to be evolved as a practical solution for low-cost high-performance IR service on the Internet.
References 1. IEEE, "MYRINET: A GIGABIT PER SECOND LOCAL AREA NETWORK", IEEE-Micro, Vol.15, No.I, February 1995, pp.29-36. 2. "Active Messages: a Mechanism for Integrated Communication and Computation", Thorsten von Eicken and David Culler, et. al., 1992. 3. "Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors", IEEE Concurrency, vol. 5, No. 2, April-June 1997, pp. 60-73. (Pakin, Karamcheti & Chien) 4. "U-Net: A User-Level Network Interface for Parallel and Distributed Computing", Anindya Basu, Vineet Buch, Werner Vogels, Thorsten von Eicken, Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995. 5. http://www.myri.com/GM/doc/gm_toe.html 6. "NUMA-Q: An SCI based Enterprise Server", http://www.sequent.com/products/ highend_srv/sci_wpl .html 7. "SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers with Pentium Pro SHV Nodes", http://www.dg.com/about/html/sci_interconnect_ chipset_and_a.html 8. S.H.Park, H.C.Kwon, "An Improved Relevance Feedback for Korean Information Retrieval System", Proc. of the 16th IASTED International Conf. Applied Informatics, IASTED/ACTA Press, pp.65-68, Garmisch-Partenkirchen, Germany, February 23-25, 1998 9. Salton, G. and Buckley, C., "Improving retrieval performance by relevance feedback", American Society for Information Science, 41, 4, pp. 288-297, 1990. I 0. http://www.dolphinics.no/customer/software/linux/index.html I 1. "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard", http://www-unix.mcs.anl.gov/mpi/ mpich/docs.html
A PC-NOW Based Parallel Extension for a Sequential DBMS Matthieu Ex bray at and Lionel Brunie Laboratoire d'Ingenierie des Systemes d'Information Institut National des Sciences A ppliquees, Lyon, F ranee Matthieu.Exbrayat©lisi.insa-lyon.fr, Lionel.Brunie©insa-lyon.fr
Abstract. In this paper we study the use of networks of PCs to handle the parallel execution of relational database queries. This approach is based on a parallel extension, called parallel relational query evaluator working in a coupled mode with a sequeitial DBMS. We present a detailed arc hitecture of the parallel query eVtluator and introduce Enkidu, the efficient Java-based prototype that has been build according to our concepts. We expose a set of measurements, conducted over Enkidu, and highlighting its performances. We finally discuss the interest and viability of the concept of parallel extension in the context of relational databases and in the wider context of high performance computing. Keywords: Networks of mrkstations, Parallel DBMS, Java
1
Introduction
P arallelizingDatabase Management Systems (DBMS) has been a flourishing field of research for the last fifteen years. Research, experiment and development have been conducted according to three main goals. The first one is to accelerate heavy operations, su
)/(fl
~x,~!e:. y
y "vi!mlio"
Figure 3.(b) The schematic of the GPl switch. Figure 4. The GP shift switches.
A Non-binary Parallel Arithmetic Architecture Table 1. Tha function of GP1 sWitth r,r,.:fa,·,;~I V"l!'"-ii 4, the MPSU reads four sets of coordinates from the queue. These four coordinates represent the four blocks that are to be merged. The MPSU then carries out the data movement operation using the hardware configured on the FPGA. The data for the addition process is read from the FPGA on-board RAM. Once the merging process is completed, the coordinates of the merged block are written to queue Q2. The MPSU repeats the process for all 2x2 blocks read from Q 1. The process continues until all the blocks in the queue Ql are processed at which time all blocks of size 2x2 have been merged into 4x4 blocks. The MPSU then begins to read the block coordinates from the queue Q2, merges the blocks and writes the resulting coordinates to queue Q 1. This process of switching between queue Q l and queue Q2 is repeated until all the blocks are processed and we have a single entry in one of the queues.
Parallel Hardware-Software Architecture
255
Figure 1.2 Hardware Software Architecture for DWT
8 FPGA Implementation and Resource Use
The H.O.T Works board from VCC[3] has been provided with onboard RAM, which can be used to store the rMap index. This technique of implementation of the r Map index use is efficient as the data shifting process can be carried out by the means of an addition/subtraction circuit configured on the FPGA. The rMap index contains a series of (x,y) pairs which point to a specific location in the original data matrix. Figure 1.3 shows the comparative number of data accesses for the conventional RMF and the hardware-software implementation of RMF. We see a substantial decrease in the total number of direct accesses. Although we need to reset the data array to the correct positions after the completion of all blocks of a certain level, we can do so by using the block access mechanism rather than singular data accesses. This blocks access mechanism is a fraction of the initial data accesses. The figure below shows the original gray map image along with the reverse-transformed image (.PGM format).
256
P. Jamkhandi et al.
Re-constructed Image
Original Image
Fig. 1.3 Chart showing the reduction in the main memory data accesses. The data accesses are transformed into FPGA board RAM accesses. References [l] K. Mukherjee, "Image Compression and Transmission using Wavelets and Vector Quantization, Ph.D. Dissertation, University of Central Florida, 1999. [2] S. G. Mallat, "A Theory for Multiresolution Signal Decomposition: The Wavelet Representation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.11, no.7, pp. 674- 693, July 1989 [3] VCC H.O.T Works Board Manual
Fifth International Workshop on High-level Parallel Programming Models and Supportive Environments HIPS 2000 Chair: Martin Schulz sch [email protected] Lehrstuhl f" ur Rdmertec hnik und Re4
1sMSP~2 ~WF/10/VERT
4
16
number of processors
Cray T3E: WF/10/HOR
4
16
64
1sr:ts~~21WF/lDiHOR
4
16
number of processors
Cray T3E: WF/2D
4
16
number of processors
IBM SP-2: WF/2D
4
16
number of processors
267
Cray T3E: WF/1 D/BOTH
4
16
II 64
IBM"SP~:t:ofvfF/10/BOTH
4
16
number of processors
Fig. 3. Performance summary. Kernel names are from Fig. I. Note that all PGHPF bars are present, but they are very small for WF/1 D/HOR and WF/2D.
XlHPF is competitive with the C+MPI and ZPL, because it performs pipelining. The single processor bars highlight disparities in local computation performance. ZPL performs considerably better than any of the others for WF/1D/VERT. We hypothesize that the dependences in this kernel thwart proper array access optimization by the xl optimizer (used by both the Fortran and C compilers). The ZPL code does not suffer from this, because its compiler generates direct pointer references rather than using C arrays. When the C+MPI code is modified in this way, its performance matches ZPL. Conversely, ZPL is worse for WF/1D/HOR. Again, we believe this is an optimization issue. When the ZPL code is modified to use C arrays rather than pointer manipulation, it matches HPF. The summary is that when we ignore the differences that arise from using C versus Fortran, the C+MPI, xlHPF, and ZPL kernel performance are comparable. Nevertheless, as stated in the previous section, we found a number of wavefronts that even the xlHPF compiler failed to optimize.
5
Conclusion
We have evaluated the experience and performance of expressing wavefront computations by three different approaches: programmer implemented via message passing, compiler discovered via automatic parallelization, and programmer defined via explicit parallel language features for pipelining. Our study reveals that in developing wavefronts, each approach can produce an efficient solution, but at a cost. The message passing codes took considerably longer to develop and debug than the other approaches. The HPF codes did not reliably perform well. Although one compiler produced efficient code, the other was three orders of magnitude worse. Even the better compiler failed to pipeline some very simple cases. We find that the language-level approach embod-
268
E.C. Lewis and L. Snyder
ied in ZPL simplifies program development and results in good performance that is consistently achieved. Acknowledgements. This research was supported in part by a grant ofHPC time from
the Arctic Region Supercomputing Center.
References 1. Accelerated Strategic Computing Initiative. ASCI SWEEP3D homepage. http://www.llnl.gov/asci...henchmarks/asci/limited/sweep3d/sweep3d_readme.html. 2. Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. ZPL's WYSIWYG performance model. In Third IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 50-61, March 1998. 3. Bradford L. Chamberlain, E Christopher Lewis, Calvin Lin, and Lawrence Snyder. Regions: An abstraction for expressing array computation. In ACM SIGAPLISIGPLAN International Conference on Array Programming Languages, pages 41-49, August 1999. 4. Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder. Language support for pipelining wavefront computations. In Proceedings of the Workshop on Languages and Compilers.for Parallel Computing, 1999. 5. Ron Cytron. Doacross: Beyond vectorization for multiprocessors. In International Conference on Parallel Processing, pages 836-844, 1986. 6. Manish Gupta, Sam Midkiff, Edith Schonberg, Ven Seshadri, David Shields, Ko-Yang Wang, Wai-Mee Ching, and Ton Ngo. An HPF compiler for the IBM SP2. In Proceedings of the 1995 ACM/IEEE Supercomputing Conference (CD-ROM), 1995. 7. High Performance Fortran Forum. HPF Language Specification, Version 2.0. January 1997. 8. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiler optimizations for Fortran Don MIMD distributed-memory machines. In Supercomputing '91, pages 96-100, Albuquerque, NM, November 1991. 9. K. R. Koch. R. S. Baker, and R. E. Alcouffe. Solution of the first-order form of threedimensional discrete ordinates equations on a massively parallel machine. Transactions of the American Nuclear Society, 65: 198-9, 1992. 10. David K. Lowenthal and Michael James. Run-time selection of block size in pipelined parallel programs. In Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, pages 82-7, 1999. 11. Anne Rogers and Keshav Pingali. Process decomposition through locality of reference. In ACM SIGPLAN PLDJ '89, pages 69-80, June 1989. 12. Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI-The Complete Reference. The MIT Press, Cambridge, Massachusetts, 2nd edition, 1998. 13. Lawrence Snyder. The ZPLProgrammer's Guide. The MIT Press, 1999. 14. David Sundaram-Stukel and Mark K. Vernon. Predictive analysis of a wavefront application using LogGP. In Seventh ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, May 1999. 15. Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996. 16. ZPL Project. ZPL project homepage. http:/www.cs.washington.edu/research/zpl.
Specification Techniques for Automatic Performance Analysis Tools Michael Gerndt, Hans-Georg EBer Central Institute for Applied Mathematics Research Centre Juelich {m.gerndt, h.g.esser}@fz-juelich.de
Abstract. P erformance analysis of parallel programs is a time-consuming task and requires a lot of experience. It is the goal of the KOJAK project at the Researc hCentre Juelich to develop an automatic performance analysis environment. A k ey requiremeil. for the success of this new environment is its easy in tegration with already existing tools on the target platform. The design should lead to tools that can be easily retargeted to different parallel machines based on specification documents. This article outlines the features of the APART Specification Language designed for that purpose and demonstrates its applicability in the context of the K OJAK Cost Analyzer, a first protoype tool of KOJAK.
1
Introduction
Current performance analysis tools for parallel programs assist the application programmer in measuring and interpreting performance data. But, the application of these tools to real programs is a time-consuming task which requires a lot of experience, and frequently, the rev ealed performance bottlen4:s belong to a small number of well-defined performance problems, such as load balancing and excessive message passing overhead. It is the goal of the KOJAK project (Kit for Objective Judgement and A utomaticKnowledge-b ase ddetection of bottlene ck) at the Research Centre Juelich to dev elopan en vironmert that automatically reveals w,ll-defined typical bottlenecks [www.fz-juelich.de/zam/kojak]. We designed KOJAK [6] such that it is not implemented for a single target en vironmert only, e.g. the Cray T3E currently installed at our center, but can easily be ported to other target platforms as well. KOJAK will use specification documents to in terface to existing performance data supply tools and to specify potential performance problems of the target programming paradigm. In parallel witfthe dev elopment of KOJAK automatic performance analysis techniques are investigated in the ESPRIT IV Working Group on Automatic Performance Analysis: Resour es and Tools(APART) [www.fz-juelich.de/apart]. This article demonstrates the main features of the APART Specification L anguage (ASL) [3] within the context of the K OJAK Cost .kalyzer (COSY) (Section 3). The performance data analyzed in COSY are specified as an ASL object model (Section 4.1) and represented at runtime via a relational database scheme. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 269-276, 2000. © Springer-Verlag Berlin Heidelberg 2000
270
M. Gerndt and H.-G. Esser
The performance problems COSY is aiming at are specified as ASL performance properties (Section 4.2) based on the performance data model and are implemented via SQL queries (Section 5).
2
Related work
The use of specification languages in the context of automatic performance analysis tools is a new approach. Paradyn [8] performs an automatic online analysis and is based on dynamic monitoring. While the underlying metrics can be defined via the Metric Description Language (MDL) [9], the set of searched bottlenecks is fixed. It includes CPUbound, ExcessiveSync Waiting Time, ExcessiveIOBlockingTime, and TooManySmallIOOps. A rule-based specification of performance bottlenecks and of the analysis process was developed for the performance analysis tool OPAL [5] in the SVMFortran project. The rule base consists of a set of parameterized hypothesis with proof rules and refinement rules. The proof rules determine whether a hypothesis is valid based on the measured performance data. The refinement rules specify which new hypotheses are generated from a proven hypothesis [4]. Another approach is to define a performance bottleneck as an event pattern in program traces. EDL [1] allows the definition of compound events based on extended regular expressions. EARL [10] describes event patterns in a more procedural fashion as scripts in a high-level event trace analysis language which is implemented as an extension of common scripting languages like Tel, Perl or Python.
3
Overall Design of the KOJ AK Cost Analyzer
COSY [7] analyzes the performance of parallel programs based on performance data of multiple test runs. It identifies program regions, i.e. subprograms, loops, if-blocks, subroutine calls, and arbitrary basic blocks, with high parallelization overhead based on the region's speedup. It explains the parallelization overhead by identifying performance problems and ranking those problems according to their severity. COSY is integrated into the CRAY T3E performance analysis environment. The performance data measured by Apprentice [2] are transferred into a relational database. The implementation of the interface between COSY and Apprentice via the database facilitates the integration with other performance data supply tools on CRAY T3E as well as the integration with other environments. The database includes static program information, such as the region structure and the program source code, as well as dynamic information, such as execution time, number of floating point, integer and load/store operations, and instrumentation overhead. For each subroutine call the execution time as well as the pass count with the mean value and standard deviation, as well as the minimum and maximum values are stored.
Specification Techniques for Automatic Performance Analysis Tools
271
After program execution Apprentice is started. Apprentice then computes summary data for program regions taking into account compiler optimizations. The resulting information is written to a file and transferred into the database. The database includes multiple applications with different versions and multiple test runs per program version. The data model is outlined in Section 4.1. The user interface of COSY allows to select a program version and a specific test run. The tool analyzes the dynamic data and evaluates a set of performance properties (Section 4.2). The main property is the total cost of the test run, i.e. the cycles lost in comparison to optimal speedup, other properties explain these costs in more detail. The basis for this computation is the test run with the smallest number of processors. The performance properties are ranked according to their severity and presented to the application programmer.
4
Performance Property Specification
COSY is based on specifications of the performance data and performance properties. The specifications are presented in ASL in the next two subsections. ASL supports the following concepts: Performance property: A performance property characterizes one aspect of the performance behavior of an application. A property has one or more conditions that can be applied to identify this property. It has a confidence expression that returns a value between zero and one depending on the strength of the indicating condition. Finally it has a severity expression that returns a numerical value. If the severity of a property is greater than zero, this property has some negative effect on the program's performance. Performance problem: A performance property is a performance problem, iff its severity is greater than a user- or tool-defined threshold. Bottleneck: A program has a unique bottleneck, which is its most severe performance property. If this bottleneck is not a performance problem, the program does not need any further tuning.
The entire specification consists of two sections. The first section models performance data while the second section specifies performance properties based on the data model. 4.1
Data Model
The performance data can be easily modeled via an object-oriented approach. ASL provides constructs to specify classes similar to Java with single-inheritance only. Classes in the data model have attributes but no methods, since the specification will not be executed. The ASL syntax is not formally introduced in this article due to space limitations, instead, we present the performance data model used in COSY.
272
M. Gerndt and H.-G. Esser
class Program { String Name; setof ProgVersion Versions;
class ProgVersion { DateTime Compilation; setof Function Functions; setof TestRun Runs; SourceCode Code;
}
}
The Program class represents a single application which is identified by its name. COSY can store multiple programs in its database. An object of that class contains a set of Frog Version objects, each with the compilation timestamp, the source code, the set of functions (static information) and the executed test runs (dynamic information). class TestRun { DateTime Start; int NoPe; int Clockspeed;
class Function { String Name; setof FunctionCall Calls; setof Region Regions;
}
}
A TestRun object determines the start time and the processor configuration. A Function object specifies the function name, the call sites, and the program regions in this function. All this information is static. class TotalTirning { TestRun Run; float Exel; float Incl; float Ovhd;
class Region { Region ParentRegion; setof TotalTiming TotTimes; setof TypedTiming TypTimes; }
}
The Region class models a program region with its parent region and its performance data gathered during execution. Performance data are modeled by two classes, according to the internal structure of Apprentice. The TotalTiming class contains the summed up exclusive and inclusive computing time as well as the overhead time. As there may be several test runs, there are also possibly several TotalTiming objects for a region. The TypedTiming class determines the execution time for special types of overhead such as I/0, message passing and barrier synchronization - Apprentice knows 25 such types. As with the TotalTiming objects, there is a set of TypedTiming objects for every test run, but for each region there is at most one object per timing type and per test run. class TypedTiming { TestRun Run; TimingType Type; float Time;
class FunctionCall { Function Caller; Region CallingReg; setof CallTiming Sums;
}
}
TypedTiming objects have three attributes: The TestRun attribute Run codes the specific test run of the program, 'Type (an enumeration type) is the work type
Specification Techniques for Automatic Performance Analysis Tools
273
that is being considered in this object and Time is the time spent doing work of this type. Call sites of functions are modeled by the FunctionCall class. A function call has a set of CallTiming objects which store the differences of the individual processes. A CallTiming object is composed of the TestRun it belongs to, the minimum, maximum, mean value, and standard deviation over a) the number of calls and b) the time spent in the function. For the four extremal values the processor that was first or last in the respective category is memorized. Due to the design of Apprentice, the data model does not make use of inheritence. More complex data models can be found in [3]. 4.2
Performance Properties
property
is PROPERTY pp-name '(' arg-list ')' '{' [LET def * IN] pp-condition pp-confidence pp-severity
'};'
arg is type ident pp-condition is CONDITION ':' conditions ';' is condition conditions or condition OR conditions is [' (' cond-id ')' ] bool-expr condition pp-confidence is CONFIDENCE ':' MAX '(' confidence-list ')' ';' or CONFIDENCE ':' confidence ';' confidence is [' (' cond-id ')' '- >' ] arith-expr is SEVERITY ':' MAX '(' severity-list')' ';' pp-severity or SEVERITY ':' severity ';' severity is ['(' cond-id ')' '->'] arith-expr
Fig. 1. ASL property specification syntax.
The property specification (Figure 1) defines the name of the property, its context via a list of parameters, and the condition, confidence, and severity expressions. The property specification is based on a set of parameters. These parameters specify the property's context and parameterize the expressions. The context specifies the environment in which the property is evaluated, e.g. the program region and the test run. The condition specification consists of a list of conditions. A condition is a predicate that can be prefixed by a condition identifier. The identifiers have to be unique in respect to the property since the confidence and severity specifications can refer to the conditions via those condition identifiers.
274
M. Gerndt and H.-G. Esser
The confidence specification is an expression that computes the maximum of a list of confidence values. Each confidence value is computed via an arithmetic expression resulting in a value in the interval of zero and one. The value can be guarded by a condition identifier introduced in the condition specification. The condition identifier represents the value of the condition. The severity specification has the same structure as the confidence specification. It computes the maximum of the individual severity values of the conditions. The following example properties are checked by COSY. They demonstrate the ASL language features. Most of the property specifications make use of the following two functions: TotalTiming Summary(Region r, TestRun t) = UNIQUE({s IN r.TotTimes WITH s.Run==t}); float Duration(Region r. TestRun t) = Summary(r.t).Incl;
The first function Summary takes a Region r and a TestRun object and returns the unique Total Timing object which is a member of r. Tot Times belonging to that test run. The second function Duration uses Summary to extract the total execution time of the specified region in the specified test run. Note that all timings in the database are summed up values of all processes. The first property SublinearSpeedup determines the lost cycles in relation to the test run with the minimal number of processors. Property SublinearSpeedup(Region r, TestRun t, Region Basis) { LET TotTimes MinPeSum = UNIQUE({sum IN r.TotTimes WITH sum.Run.NoPe MIN(s.Run.NoPe WHERE s IN r.TotTimes)}); float TotalCost = Duration(r,t) - Duration(r,MinPeSum.Run) IN
CONDITION: SEVERITY:
TotalCost>O; CONFIDENCE: 1; TotalCost/Duration(Basis,t);
}
The property is based on the total costs, i.e. the lost cycles compared to a reference run with the smallest number of processors. If TotalCost is greater than zero, the region has the SublinearSpeedup property. The confidence value, which is one in all examples here, might be lower than one if the condition is only an indication for that property. The severity of the SublinearSpeedup property is determined as the fraction of the total costs compared to the duration of Basis in that test run. Property MeasuredCost (Region r, TestRun t, Region Basis) { LET float Cost= Summary(r,t).Ovhd; IN CONDITION: Cost> O; CONFIDENCE: 1; SEVERITY: Cost/ Duration(Basis,t); }
The total costs can be split up into measured and unmeasured costs. The MeasuredCost property determines that more detailed information might be
Specification Techniques for Automatic Performance Analysis Tools
275
available (Summary(r,t). Ovhd is the overhead measured by Apprentice). ff the severity of its counterpart, the UnmeasuredCost, is much higher, the reason cannot be found with the available data. Property SyncCost(Region r, TestRun t, Region Basis) { LET float Barrier= SUM(tt.Time WHERE tt IN r.TypTimes AND tt.Run==t AND tt.Type == Barrier); IN CONDITION: Barrier> O; CONFIDENCE: 1; SEVERITY: Barrier/ Duration(Basis,t); }
The SyncCost property determines that barrier synchronization is a reason for overhead in that region. Its severity depends on the time spent for barrier synchronization in relation to the execution time of the ranking basis. Property Loadimbalance(FunctionCall Call, TestRun t, Region Basis) { LET CallTiming ct= UNIQUE ({c IN Call.Sums WITH c.Run == t}); float Dev = ct.StdevTime; float Mean= ct.MeanTime; IN CONDITION: Dev> ImbalanceThreshold * Mean; CONFIDENCE: 1; SEVERITY: Mean/ Duration(Basis,t); }
The Loadlmbalance property is a refinement of the Sync Cost property. It is evaluated only for calls to the barrier routine. If the deviation is significant, the barrier costs result from load imbalance.
5
Implementation
The design and implementation of COSY ensures portability and extensibility. The design requires that the performance data supply tools are extended such that the information can be inserted into the database. This extension was implemented for Apprentice with the help of Cray Research. The database interface is based on standard SQL and therefore, any relational database can be utilized. We ran experiments with four different databases: Oracle 7, MS Access, MS SQL server, and Postgres. For all those databases, except MS Access, the setup was in a distributed fashion. The data were transferred over the network to the database server. While Oracle was a factor of 2 slower than MS SQL server and Postgres, MS Access outperformed all those systems. Insertion of performance information was a factor of 20 faster than with the Oracle server. COSY is implemented in Java and is thus portable to any Java environment. It uses the standard JDBC interface to access the database. Although accessing the database via JDBC is a factor of two to four slower than C-based implementations, fetching a record from the Oracle server takes about 1 ms, the portability of the implementation outweighs the performance drawbacks. The overall performance depends very much on the work distribution between the client and the database. It is a significant advantage to translate the conditions of performance properties entirely into SQL queries instead of first accessing the data components and evaluating the expressions in the analysis tool.
276
6
M. Gerndt and H.-G. Esser
Conclusion and Future Work
This article presented a novel design for performance analysis tools. As an example, COSY, a prototype component of the KOJAK environment, was presented. The design enables excellent portability and integration into existing performance environments. The performance data and the performance properties are described in ASL and can therefore easily be adapted to other environments. For this prototype, the specification is manually translated into a relational database scheme and the evaluation of the conditions and the severity expressions of the performance properties is transformed into appropriate SQL queries and ranking code by the tool developer. In the future, we will investigate techiques for the automatic generation of the database design from the performance property specification and the automatic translation of the property description into executable code.
References 1. P. Bates, J.C. Wileden: High-Level Debugging of Distributed Systems: The Be-
2. 3. 4.
5. 6. 7. 8. 9. 10.
havioral Abstraction Approach, The Journal of Systems and Software, Vol. 3, pp. 255-264, 1983 CRAY Research: Introducing the MPP Apprentice Tool, Cray Manual IN-2511, 1994, 1994 Th. Fahringer, M. Gerndt, G. Riley, J.L. Traff: Knowledge Specification for Automatic Performance Analysis, to appear: APART Technical Report, Forschungszentrum Jiilich, FZJ-ZAM-IB-9918, 1999 M. Gerndt, A. Krumme: A Rule-based Approach for Automatic Bottleneck Detection in Programs on Shared Virtual Memory Systems, Second Workshop on High-Level Programming Models and Supportive Environments (HIPS '97), in combination with IPPS '97, IEEE, 1997 M. Gerndt, A. Krumme, S. Ozmen: Performance Analysis for SVM-Fortran with OPAL, Proceedings Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA'95), Athens, Georgia, pp. 561-570, 1995 M. Gerndt, B. Mohr, F. Wolf, M. Pantano: Performance Analysis on CRAY T3E, Euromicro Workshop on Parallel and Distributed Processing (PDP '99), IEEE Computer Society, pp. 241-248, 1999 A. Lucas: Basiswerkzeuge zur automatischen Auswertung von ApprenticeLeistungsdaten, Diploma Thesis, RWTH Aachen, Internal Report Forschungszentrums Jiilich Jiil-3652, 1999 B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, T. Newhall: The Paradyn Parallel Performance Measurement Tool, IEEE Computer, Vol. 28, No. 11, pp. 37-46, 1995 Paradyn Project: Paradyn Parallel Performance Tools: User's Guide, Paradyn Project, University of Wisconsin Madison, Computer Sciences Department, 1998 F. Wolf, B. Mohr: EARL - A Programmable and Extensible Toolkit for Analyzing Event Traces of Message Passing Programs, 7th International Conference on HighPerformance Computing and Networking (HPCN'99), A. Hoekstra, B. Hertzberger (Eds.), Lecture Notes in Computer Science, Vol. 1593, pp. 503-512, 1999
PDRS: A Performance Data Representation System" Xian-He Sun 1 • 2 1 Dept. 2
3
Xingfu Wu '· 1
of Computer Science, Louisiana State University, Baton Rouge, LA 70803
Dept. of Computer Science, Illinois Institute of Technology, Chicago, IL 60616
Dept. of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208 [email protected] [email protected]
Abstract. We present the design and development of a Performance Data Representation System (PDRS) for scalable parallel computing. PDRS provides decision support that helps users find the right data to understand their programs' performance and to select appropriate ways to display and analyze it. PDRS is an attempt to provide appropriate assistant to help programmers identifying performance bottlenecks and optimizing their programs.
1 Introduction Many performance measurement systems have been developed in recent years. While these systems are important, their practical usefulness relies on an appropriate understanding of the measured data. When monitoring a complex parallel program, the amount of performance data collected may be very huge. This huge amount of performance data needs to be processed for further performance evaluation and analysis. A general performance measurement system always provides a facility that assists manipulation of this performance data. Data manipulation functions are often dependent on performance data organization and representation. The difficulty in providing an adequate performance environment for high performance computing is the lack of appropriate models, representations and associated evaluation methods to understand measured data and locate performance bottlenecks. Performance Data Representation System (PDRS) proposed in this paper is designed to attack this difficulty. PDRS is a general-purpose integrated system supported by performance database representation and the combination of performance visualization and auralization. It is based on our recent success in automatic performance evaluation and prediction. Many performance measurement systems exist right now [3, 4, 5]. While these performance systems have made their contribution to the state-of-the-art of performance *
This work was supported in part by National Science Foundation under NSF grant ASC-9720215 and CCR-9972251.
J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 277-284, 2000. © Springer-Verlag Berlin Heidelberg 2000
278
X.-H. Sun and X. Wu
evaluation, none of them has addressed the data presentation and understanding issue adequately. With the advance in performance measurement and visualization techniques, and increased use of large, scalable computing systems, data presentation and management becomes increasingly important. The PDRS is a post-execution performance data representation system designed for scalable computing, and is distinct from existing performance systems. First, while it supports conventional visualization views, it is designed based on the most recent analytical results in scalability and statistical analysis to reveal the scaling properties of a scalable computing system. Second, the system uses relational database, SQL and Java JDBC techniques such that performance information is easily retrieved, compared and displayed. Because of the complexity and volume of the data involved in a performance database, it is natural to exploit a database management system (DBMS) to archive and retrieve performance data. A DBMS will help not only in managing the performance data, but also in assuring that the various performance information can be presented in some reasonable format for users. Third, the system is implemented based on the combination of performance visualization and auralization techniques and object-oriented Java techniques such that it is easy for users to understand and use. Finally, the system supports the SDDF data format. It can be either used as a stand-alone application or easily integrated into other existing performance environments.
2 Design and Implementation of PDRS Figure 2.1 depicts the design framework of PDRS. The technical approaches used to develop these components are discussed below section by section.
2.1 Trace Data Module This module is in charge of collecting original performance data of parallel programs, and stores them with SDDF [1]. The large volume of data involved in parallel computations requires that instrumentation to collect the data selectively and intelligently. One way to collect data of a parallel program is to instrument the program executable so that when the program runs, it generates the desired information. PDRS is designed to use the Scala Instrumentation System (SIS) [11] to get the SDDF trace data file. PDRS also provides a general interface that can be used under any system, which provides the SDDF trace data interface. 2.2 Data Management Module This module is in charge of performance data filtering and mapping.
PDRS: A Performance Data Representation System
279
Event histories of parallel programs are valuable information sources for performance analysis but the problem is how to extract the useful information from massive amounts of low-level event traces. Our system performs the data filtering as a preparation to store the event history into a relational database. The SDDF is a trace description language that specifies both data record structures and data record instances. We are building a performance database based on the SDDF specification. Our data management module is being implemented in Oracle DBMS.
Figure 2.1 Design framework of PDRS 2.3 Performance Database We classify the performance data saved in the SDDF tracefiles into five groups: processor information, memory information, program information, communication information and 1/0 information. Each group is represented as an entity relation in the performance database. An individual event in a relation is treated as a tuple with a given unique identifier. The information retrieval is achieved by the relational database queries. The example below shows how objects can be retrieved using JDBC [13]. For instance, suppose that we want to get the communication events that occurred in processor 0, the query select sourcePE, destinationPE, messageLength, event_startTimestamp, event_endTimestamp from Communication Information where processor= 0.
280
X.-H. Sun and X. Wu
We may make the following SQL query by JDBC: ResultSet rs = stmt.executeQuery( "select sourcePE, destinationPE, messageLength, event_startTimestamp, event_endTimestamp from Communication Information where processor = 0 '); while (rs.nextO) { Object il = rs.getObject("sourcePE'); Object i2 = rs.getObject("destinationPE'); Object rl = rs.getObject("messageLength '); Object r2 = rs.getObject("event_startTimestamp "); Object r3 = rs.getObject("event_endTimestamp'); }
Multiple versions of performance data are handled by specifying a version attribute in each tuple. By specifying a version number in each database query, we can get multiple versions of program performance for comparison. In addition to the default PDRS performance parameters, new performance parameters such as sound files can also be added by users and be supported by the database. 2.4 Relational Queries Module This module includes four parts: Symbolic Analysis, Statistical Analysis, Scalability Analysis, and Performance Model Generator. The module is being implemented in JDBC. Its structure is shown in Figure 2.2. Java applications include the PDA, PV A, and GUI module implemented by Java. The JDBC provides a bridge between Java applications and performance database. .,....._
Java Applications
JDBC Driver(s)
.
----
Performance Database
database access ~ ~
C
-::
Figure 2.2 Relational Queries Module We use symbolic evaluation [2, 6] that combines both data and control flow analysis to determine variable values, assumptions about and constraints between variable values, and conditions under which control flow reaches a program statement. Computations are represented as symbolic expressions defined over the program's problem and machine size. Each program variable is associated with a symbolic expression describing its value at a specific program point Statistical Analysis determines code and/or machine effects, finds the correlation between program phases, identifies the scaling behavior of "difficult-segments", and provides statistical performance data [12] for the PDA (Performance Diagnostic Agent)
PDRS: A Performance Data Representation System
281
module and GUI. The development of the scalability analysis is based on newly developed algorithms for predicting performance in terms of execution time and scalability of a code-machine combination [8, 9, 11, 15]. Analytical and experimental results show that scalability combined with initial execution time can provide good performance prediction, in terms of execution times. In addition, crossing-point analysis [9] finds fast/slow performance crossing points of parallel programs and machines. In contrast with execution time, which is measured for a particular pair of problem and system size, range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing-point analysis. In addition to high-level performance prediction, PDRS also supports low-level performance analysis to identify performance bottlenecks and hardware constrains based on performance models chosen by the user. For example, we have proposed an empirical memory model based on a simplified mean value parameterization [14] to separate CPU execution time from stall time due to memory loads/stores. From traced information or information from the analysis modules, performance models can be generated to predict the performance at the component level, as well as over-all performance. 2.5 Performance Diagnostic Agent (PDA) Module
This module provides performance advice in order to help users find performance bottlenecks in their application programs. It also provides performance comparison and suggestions based on real performance results and predicted performance ranges. The PDA is based on our approaches to statistical analysis, scalability analysis and performance model generator. The function operation algorithm for this module is as follows. Algorithm (Performance diagnosis): Performance analysis requests; switch (analysis type) { Statistical: Retrieve the performance information required; Get or compute the predicted performance range; Compute the real result of requested performance parameter; Compare the result with the performance range; If (the result is not in the performance range) Give an explanation (using graphics and sound); break; Scalability: Retrieve the performance information required; Get or compute the predicted scalability results; Compute the real scalability results; Compare the real result with the predicted results; Explain the compared results (using graphics and sound);
282
X.-H. Sun and X. Wu
break; Models: Retrieve the performance information required; Get the predicted performance range; Compute the real result of requested performance parameter; Compare the result with the pe,formance range; If (the result is not in the pe,formance range) Give an explanation (using graphics and sound); break; Default: printf("No such analysis type"); break; }
In the algorithm, the PDA can provide suggestions and explanations when performance bottlenecks occur. Based on the statistical analysis, the PDA can retrieve the performance information from the performance database, then may provide the advice about program performance. 2.6 Performance Visualization and Auralization (PVA) Module and Graphical User Interface Module
This PVA module provides some graphical display of performance information about users' application programs and platforms. It is natural to use different visual objects to represent various performance data and use visualization techniques to gain insight into the execution of parallel programs so that their performance may be understood and improved. The basic goal of this module is to use graphics and sound (Java 2D, Java 3D and JavaSound) to display some advice and performance views about application programs. For example, based on performance comparison, some performance bottlenecks can be found in graphics. Some suggestions can be given in graphics, such as what applications are suitable for the platforms, what platforms are suitable for solving the applications, and how to modify the application program to be suitable for the platforms. When performance bottlenecks occur, sound is used to inform users about some performance problem in their application programs. The sound files are stored in a performance database. The Graphical User Interface module is an integrated user-friendly graphical interface. It integrates the whole functions of the PVA module, and directly displays the performance data requested by users. Figures 2.3 and 2.4 are two views of PDRS GUI. Figure 2.3 shows speed comparison of PDD and PT algorithms [7]. Figure 2.4 shows the Kiviat graph for performance comparison.
PDRS: A Performance Data Representation System
283
//··1·-------l~ ~-·.·. . ~~,~
(-A.J ·' : .._ _ .· - . .....· I i .·'j..__/~ [J· _........-·1 '·-...
.
1 1
.
~
_)·--
I1_...
,i_
-----
t"
_.'
-
..
------~1 -_._,_
-
~
----
Figure 2.3 Speed comparison of PDD and PT algorithms r_,.
,
• .. '
·····---··.:.:
_,.;,:;-·-
I
I' '
'i
-- \ ....-~ ---:-i -'
'
:....,~~--
-----..-.1:. __ El
' - - -1111111
Figure 2.4 Kiviat Graph for Performance Comparison
3 Summary We have presented the design of a Performance Data Representation System (PDRS) based on our current success of the development of the SCALA [10, 11] performance system for scalable parallel processing. While the PDRS has not been fully implemented at this time, some of its key components have been implemented and tested. Implementation results are very encouraging. PDRS highlights the performance data representation using relational database. Integrated into advanced restructuring compilation and performance analysis system, the proposed PDRS attempts to lift performance evaluation system to a new level. It is designed to provide developers a guideline on performance optimization, to assist the purchasers selecting systems best suited to their needs, and to give valuable feedback to vendors on bottlenecks that can be alleviated in future products. It has the potential to provide users with much more useful information than current existing performance systems. Many advanced technologies, such as database management, object-oriented programming, visualization and auralization are used in the PDRS. The integration of these technologies into compilation and performance analysis system is new, and very challenging. It can motivate many new
284
X.-H. Sun and X. Wu
research and development issues. PDRS 1s only a first step toward the automatic performance analysis and optimization.
References I. R. Aydt, The Pablo Self-Defining Data Format, Department of Computer Science, University of Illinois, April 1995, ftp://bugle.cs.uiuc.edu/pub/Release/Documentation/SDDF.ps. 2. T. Fahringer and B. Scholz, Symbolic evaluation for parallelizing compilers, in Proc. of the 11th ACM International Conference on Supercomputing, Vienna, Austria, ACM Press, July 1997, 261-268. 3. J. Kohn and W. Williams, ATExpert, Journal of Parallel and Distributed Computing 18, 1993, 205-222. 4. A.D. Malony and G.V. Wilson, Future directions in parallel performance environment, Performance Measurement and Visualization of Parallel Systems, Eds: G. Haring and G. Kotsis, Elsevier Science Publishers B.V., 1993, 331-351. 5. B. P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, and T. Newhall, The Paradyn parallel performance measurement tools, IEEE Computer 28, 11, 1995. 6. M. Scheib!, A. Celie, and T. Fahringer, Interfacing Mathematica from the Vienna Fortran Compilation System, Technical Report, Institute for Software Technology and Parallel Systems, Univ. ofVienna, December 1996. 7. X.-H. Sun, H. Zhang, and L. Ni, Efficient tridiagonal solvers on multicomputers, IEEE Transactions on Computers 41, 3 (1992), 286-296. 8. X.-H. Sun and D. Rover, Scalability of parallel algorithm-machine combinations, IEEE Transactions on Parallel and Distributed Systems, June 1994, 599-613. 9. X.-H. Sun, Performance range comparison via crossing point analysis, Lecture Notes in Computer Science 1388 (J. Rolim, ed.), Springer, March 1998. 10.X.-H. Sun, T. Fahringer, M. Pantano, and Z. Zhan, SCALA: A performance system for scalable computing, in Proc. of the Workshop on High-Level Parallel Programming Models & Supportive Environments, Lecture Notes in Computer Science 15 86, Springer, April 1999. 11.X.-H. Sun, M. Pantano, and Thomas Fahringer, Integrated range comparison for data-parallel compilation systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 10, May, 1999, 448-458. 12.X.-H. Sun, D. He, K. Cameron, and Y. Luo, A Factorial Performance Evaluation for Hierarchical Memory Systems, in Proc. of the IEEE Int 7Parallel Processing Symposium '99, April 1999. 13.Sun Microsystems Inc., JDBC: A Java SQL API, Version 1.20, http://www.javasoft.com/ products/jdbc/index.html, January 1997. 14.M. V. Vernon, E. D. Lazowska, and J. Zahorjan, An accurate and efficient performance analysis technique for multi-processor snooping cache-consistency protocols, in Proc. 15 th Annual Symp. Computer Architecture, Honolulu, HI, June 1988, 308-315. 15.Xingfu Wu, Performance Evaluation, Prediction, and Visualization of Parallel Systems, Kluwer Academic Publishers, Boston, ISBN 0-7923-8462-8, 1999.
Clix* - A Hybrid Programming Environment for Distributed Objects and Distributed Shared Memory Frank Mueller-, Jorg Nolte2 , and Alexander Schlaefer3 Humboldt University Berlin, Institut f. Informatik, 10099 Berlin, Germany 2 GMD FIRST, Rudow er Chaussee 5, D-12489 Berlin, Germaiy 3 University of Washington, CSE, Box 352350, Seattle, WA 98195-2350, USA mueller©informatik.hu-berlin.de, phone: (+49) (30) 2093-3011, fax: -3010 1
Abstract. P arallel programming with distributed object tedmology becomes increasingly popular but shared-memory programming is still a common w ayof utilizing parallel machines. In fact, both models can coexist fairly well and soft w are DSM systems can be constructed easily using distributed object systems. In this paper, we describe the construction of a hybrid programming platform based on the ARTS distributed object system. We describe how an object-oriented design approach provides a compact and flexible description of the system components. A sample implementation demonstrates that three classes of less than 100 lines of code each suffice to implement sequen tial consistency
1
Introduction
Object-oriented programming and distributed object technology are considered to be state of the art in distributed and as well as parallel computing. However, typical mmerical data-structures like huge arrays or matrices are hard to represent in a distributed object paradigm. Such data structures usually cannot be represented as single objects because this leads to extremely coarse-grained programs thus limiting parallel execution. On the other hand, it is not feasible to represent, e.g., each array element as a remote object because remote object in vocation mea,'le,'l, 1996. S. A. l\rioyer and V. S. Sun.deram. PIOUS: a scalable parallel I/O system for distributed computing environments. In Scafol>lt, High-Pnformance Computing Conj., 1994. N. Nieuwejaar and D. Kotz. The galley parallel file system. Parnlfol Computing, 23(4), June 1997. Iv!. T. Oszu and P. Va.lduriez. P1--inciples of Distt--ibutc.d DafolmM, Sv.~tt,m.~. Prentice Hall, 1999. R. H. Patterson III. Infcwmed Pre/etching and Caching. PhD thesis, Carnegie Iviellon University, December 1997. Pirahesh d al. Pmnlfolinn in Relational Data Base Swtun;,. In nt'l Symp. on Pmnlfol tmtl Distt--ibutedSystems, July 1990. D. A. Reed, d al. Performance analysis of parallel systems: Approaches and open problems. In .!oint Syrnpo~ .~ium cm Pmnlfol P1·oc,,.~.~ir1,!J (.!SPP), June 1998. E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia.. In Int'l Conj, on Vt,qJ Lm'!Jf; Dafol>a,'le,'l, August 1997. H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data. sets. Technical Report 9906-010, Northwestern University, June 1999. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In AC'M SIGMOD C'onf. 011 Manag