Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II (Lecture Notes in Computer Science, 6272) 3642152902, 9783642152900

This book constitutes the refereed proceedings of the 16th International Euro-Par Conference held in Ischia, Italy, in A

138 10 12MB

English Pages 569 [570] Year 2010

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Title Page
Preface
Organization
Table of Contents – Part II
Topic 9: Parallel and Distributed Programming
Parallel and Distributed Programming
Transactional Mutex Locks
Introduction
The TML Algorithm
Boundary and Per-Access Instrumentation
Implementation Issues
Programmability
Performance Considerations
Compiler Support
Evaluation
List Traversal
Red-Black Tree Updates
Write-Dominated Workloads
Conclusions
References
Exceptions for Algorithmic Skeletons
Introduction
Related Work
Algorithmic Skeletons in a Nutshell
Exceptions for Algorithmic Skeletons
Exception Semantics
Illustrative Example
Exceptions in Skandium Library
How Exceptions Cancel Parallelism
Skandium API with Exception
Exception Handler
High-Level Stack Trace
Conclusions
References
Generators-of-Generators Library with Optimization Capabilities in Fortress
Introduction
Motivating Example and Problems We Tackle
Composing Parallel Skeletons to Create Naive Program
Applying Theorem to Derive Efficient Parallel Program
Problems We Tackled
GoG Library in Fortress
GoG: Generation and Consumption of Nested Data Structures
Optimization Mechanism in GoGs
Growing GoG Library
Programming Examples and Experimental Results
Example Programming with GoG Library
Experimental Results
Related Work
Conclusion
References
User Transparent Task Parallel Multimedia Content Analysis
Introduction
User Transparent Parallel Programming Tools
Parallel-Horus
System Design
The Task Graph
Graph Execution
A Line Detection Application
Curvilinear Structure Detection
Evaluation
Conclusions and Future Work
References
Parallel Simulation for Parameter Estimation of Optical Tissue Properties
Introduction
Tissue Optics
Challenges and Approach
Challenges
Our Approach
Parallelisation and Computational Considerations
Coarse-Grained Parallelism Using Parallel MATLAB
Fine-Grained Parallelism Using Graphics Hardware Acceleration
Performance Results
Conclusions
References
Topic 10: Parallel Numerical Algorithms
Parallel Numerical Algorithms
Scalability and Locality of Extrapolation Methods for Distributed-Memory Architectures
Introduction
Implementation of Extrapolation Methods for ODE Systems with Arbitrary Access Structure
Algorithmic Structure and Sequential Implementation
Exploiting Data and Task Parallelism
Implementations Specialized in ODE Systems with Limited Access Distance
Reducing Communication Costs
Optimization of the Locality Behavior
Experimental Evaluation
Sequential Performance and Locality of Memory References
Parallel Speedups and Scalability
Conclusions
References
CFD Parallel Simulation Using Getfem++ and Mumps
Introduction
The Getfem++ Library
The Model Description
Parallelization under Getfem++
Linear Algebra Procedures
Mumps Library
Navier-Stokes Simulation to Steady the Transition to Turbulence in the Wake of a Circular Cylinder
Time Discretization
Space Discretization Using Getfem++
Parallel Experiments
References
Aggregation AMG for Distributed Systems Suffering from Large Message Numbers
Introduction
Algorithm and Implementation
Parallel Implementation of AMG
Pairwise Aggregation AMG
Smoothed Aggregation AMG
Combination of Pairwise Aggregation AMG and Smoothed Aggregation
Benchmarks on Computations of Flows in an Engine
Background and Details of the Simulations
Results of the Benchmarks
Conclusions
References
A Parallel Implementation of the Jacobi-Davidson Eigensolver and Its Application in a Plasma Turbulence Code
Introduction
The Jacobi-Davidson Method
Implementation Description
Overview of SLEPc
Parallelization Details
Solution of the Correction Equation
GENE
Results
Conclusions
References
Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code
Introduction
Description of the Problem
Malleable Parallel Task Scheduling
Related Work
The Algorithm
Cost Function
Evaluation
Conclusion and Future Work
References
Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores
Introduction
Background
Data Translation Algorithm
Basic Data Translation Procedure
Data Translation Parallelization
Geometric Tiling and Binning
Parallelization Strategies
Experimental Evaluation
Concluding Remarks
References
Topic 11: Multicore and Manycore Programming
Multicore and Manycore Programming
JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-Core Architectures
Introduction
Related Work
JavaSymphony
Dynamic Virtual Architectures
JavaSymphony Objects
Object Agent System
Synchronisation Mechanism
Locality Control
Matrix Transposition Example
Experiments
Discrete Cosine Transformation
NAS Parallel Benchmarks: CG Kernel
3D Ray Tracing
Cholesky Factorisation
Matrix Transposition
Sparse Matrix-Vector Multiplication
Conclusions
References
References
Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees
Introduction
The ED-Tree
Implementation
Performance Evaluation
References
Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study
Introduction
Background
HTM Implementation and Interface
Discrete Event and Supply Chain Simulation
Using Transactions in GBSE-C
Resource Management
Scheduling Management
Productivity and Performance Evaluation
Conclusion
References
Exploiting Fine-Grained Parallelism on Cell Processors
Introduction
Distributed Task Pools
Task Pool Runtime System
Design and Implementation
Experimental Results
Synthetic Application
Matrix Multiplication
LU Decomposition
Linked-Cell Particle Simulation
Related Work
Conclusions
References
Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.
Introduction
Cell/B.E. Overview
On-Chip Pipelined Mergesort
Implementation Details
Experimental Results
On-Chip-Pipelined Merging Times
Results of DC-Map
Comparison to CellSort
Discussion
Conclusion and Future Work
References
Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures
Introduction
Related Work
Problem Modeling
Hardware Architecture
Application Communication Patterns
The TreeMatch Algorithm
Experimental Validation
Experimental Set-Up
Simulation Results
NAS Parallel Benchmarks
NAS Communication Patterns Modeling
Conclusion and Future Works
References
Parallel Enumeration of Shortest Lattice Vectors
Introduction
Preliminaries
Enumeration of the Shortest Lattice Vector
Algorithm for Parallel Enumeration of the Shortest Lattice Vector
Parallel Lattice Enumeration
The Algorithm for Parallel Enumeration
Improvements
Experiments
Conclusion and Further Work
References
A Parallel GPU Algorithm for Mutual Information Based 3D Nonrigid Image Registration
Introduction
Mutual Information Based Nonrigid Image Registration
GPU Architecture Overview
Parallelization and Optimization
Parallel Execution on the GPU
Use of Look Up Table
Optimizations for Transformation Coefficients
Memory Coalescing for the Gradient Computation
Experimental Results
Performance Results and Discussion
Validation of Registration Results
Summary and Conclusions
References
Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations
Introduction
Related Works
Physics Simulation
Multi-GPU Abstraction Layer
Multi-architecture Data Types
Transparent GPU Kernel Launching
Architecture Specific Task Implementations
Scheduling on Multi-GPUs
Partitioning and Task Mapping
Dynamic Load Balancing
Harnessing Multiple GPUs and CPUs
Results
Colliding Objects on Multiple GPUs
Affinity Guided Work Stealing
Involving CPUs
Conclusion
References
Long DNA Sequence Comparison on Multicore Architectures
Introduction
Related Work
Algorithm Description and Parallelism
Available Parallelism
Parallel Implementations on a Multicore Architecture
Centralized Data Storage Approach
Distributed Data Storage Approach
Experimental Methodology
Modeled Systems
Experimental Results
Speedup in the Real Machine
Bandwidth Requirements
Simulation Results
Conclusions
References
Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing
Introduction
Fault Tolerance in the Context of Dependability
Future Space Missions and Their Requirements
Introspection
Adaptive Fault Tolerance for High-Performance On-Board Computing
Assertions
Fault Detection and Recovery
Analysis-Based Assertion Generation
Related Work
Conclusion
References
Maestro: Data Orchestration and Tuning for OpenCL Devices
Introduction
OpenCL
Overview
High Level Queue
Problem: Code Portability
Proposed Solution: Autotuning
Problem: Load Balancing
Proposed Solution: Benchmarks and Device Interrogation
Problem: Suboptimal Use of Interconnection Bus
Proposed Solution: Multibuffering
Results
Experimental Testbeds
Test Kernels
Related Work
Conclusions
References
Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software
Introduction
Geant4 and Parallelization
Geant4: Background
Prior Distributed Memory Parallelizations of Geant4
Geant4MT Methodology and Tools
$T$1: Transformation for Thread Safety
$T$2: Transformation for Memory Footprint Reduction
Custom Scalable Malloc Library
Detection for Shared-Update and Run-Time Correctness
Experimental Results
Related Work
Conclusion
References
Parallel Exact Time Series Motif Discovery
Introduction
Background and Notation
Related Work
Parallel Motif Algorithm Design
Load Balancing Optimizations
Results and Analysis
Strong Scalability
Load Balancing Optimizations Analysis
Conclusions and Future Work
References
Optimized Dense Matrix Multiplication on a Many-Core Architecture
Introduction
The IBM Cyclops-64 Architecture
Classic Matrix Multiplication Algorithms
Proposed Matrix Multiplication Algorithm
Work Distribution
Minimization of High Cost Memory Operations
Architecture Specific Optimizations
Experimental Evaluation
Conclusions and Future Work
References
A Language-Based Tuning Mechanism for Task and Pipeline Parallelism
Introduction
The XJava Language
Language
Compiler and Runtime System
Tuning Challenges
Language-Based Tuning Mechanism
Tuning Parameters
Inferring Tuning Parameters from XJava Code
Inferring Context Information
Tuning Heuristics
Experimental Results
Benchmarked Applications
Results
Related Work
Conclusion
References
A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures
Introduction
A Key Observation for Implementing the Flush Operation Efficiently
Main Contributions
Formalization of Our OpenMP Memory Model Instantiation
Operational Semantics of $Model$LF
Cache Protocol of $Model$LF
Cache Line States
Cache Operations and State Transitions
Experimental Results and Analyses
Experimental Testbed
Summary of Main Results
Scalability
Impact of Cache Size
Related Work
Conclusion and Future Work
References
Programming CUDA-Based GPUs to Simulate Two-Layer Shallow Water Flows
Introduction
Mathematical Model and Numerical Scheme
CUDA Implementation
Parallelism Sources
Algorithmic Details of the CUDA Version
Experimental Results
Conclusions and Further Work
References
Topic 12: Theory and Algorithms for Parallel Computation
Theory and Algorithms for Parallel Computation
Analysis of Multi-Organization Scheduling Algorithms
Introduction
Motivation and Presentation of the Problem
Related Work
Contributions and Road Map
Problem Description and Notations
Local Constraint
Selfishness
Complexity Analysis
Lower Bounds
Selfishness and Lower Bounds
Computational Complexity
Algorithms
Iterative Load Balancing Algorithm
LPT-LPT and SPT-LPT Heuristics
Analysis
Experiments
Concluding Remarks
References
Area-Maximizing Schedules for Series-Parallel DAGs
Introduction
Background
Maximizing Area for Series-Parallel dags
The Idea Behind Algorithm $ASP-DAG$
Algorithm $ASP-DAG$'s Inductive Approach
The Timing of Algorithm $ASP-DAG$
Conclusion
References
Parallel Selection by Regular Sampling
Introduction
The BSP Model
The Algorithm
Conclusions
References
Ants in Parking Lots
Introduction
Technical Background
Ant-Robots Formalized
The $Parking Problem$ for Ants
Single Ants and Parking
The Simplified Framework for Single Ants
A Single Ant Cannot Park, Even on a One-Dimensional Mesh
Single-Ant Configurations That Can Park
Multi-Ant Configurations That Can Park
Quadrant Determination with the Help of Adjacent Ants
Completing the Parking Process
Conclusions
References
Topic 13: High Performance Networks
High Performance Networks
An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees
Motivation
Efficient Deterministic Routing in Fat-Trees
OBQA Description
Performance Evaluation
Simulation Model
Results for Uniform Traffic
Results for Hot-Spot Traffic
Results for Traces
Data Memory Requirements
Conclusions
References
A First Approach to King Topologies for On-Chip Networks
Introduction
Description of the Topologies
Routing
Minimal Routing
Misrouting
Routing Composition
Evaluation
Conclusion
References
Optimizing Matrix Transpose on Torus Interconnects
Introduction
Matrix Transpose Overview
Load Imbalance on Asymmetric Torus Networks
Optimizing Permutation Communications
Basic Short Dimension Routing (SDR)
Selective Routing
Extending to Higher Dimensions
Implementation Level Optimizations
Experimental Evaluation
Blue Gene Overview
Experimental Setup
Results
References
Topic 14: Mobile and Ubiquitous Computing
Mobile and Ubiquitous Computing
cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks
Introduction
Related Work
cTrust Scheme
Trust Graph in CMANETs
Trust Path Finding Problems in CMANET
Markov Decision Process Model
Value Iteration
cTrust Distributed Trust Aggregation Algorithm
Experimental Evaluation
Experiment Setup
Results and Analysis
Conclusion
References
Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks
Introduction
Related Works
Problem Description
Network Model
Overview of Growth Codes
Maximizing Growth Codes Utility by Coding on The Symbol Level
Decimal Codeword Degree
Search for Appropriate Codeword Degree
Maximizing Growth Codes Utility by Priority Broadcast
Unicast
Priority Broadcast
Simulation Results
Impact of Coding on the Symbol Level
Impact of Priority Broadcast
Conclusion
References
@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks
Introduction
Related Work
Flooding Algorithms
The Case for Adaptive Behavior
@Flood
Forwarding Procedure
Probing Module
Monitor Module
Adaptation Engine
Evaluation
Convergence
@Flood vs. Standard Pre-configuration
Adapting the Forwarding Algorithm
Conclusions and Future Work
References
On Deploying Tree Structured Agent Applications in Networked Embedded Systems
Introduction
Application and System Model, Problem Formulation
Application Model
System Model
Problem Formulation
Uncapacitated 1-Hop Agent Migration Algorithm
Uncapacitated k-Hop Agent Migration Algorithm
Handling Capacity Constraints
Experiments
Setup
Results without Capacity Constraints
Results with Capacity Constraints – Small Scale Experiments
Results with Capacity Constraints – Large Scale Experiments
Result Summary
Related Work
Conclusions
References
Meaningful Metrics for Evaluating Eventual Consistency
Introduction
Related Work
Experimental Methodology
Evaluation
Commit Ratio
Average Agreement Delays (AAD)
Average Commitment Delays (ACD)
Conclusions
References
Caching Dynamic Information in Vehicular Ad Hoc Networks
Introduction
Related Work
Enabling Caching Support in VITP
Simulation Testbed Setup
Vehicular Mobility Generation
Evaluation Scenarios and Query Generation
Evaluation
Caching Evaluation - Querying Road Traffic Conditions
Caching Evaluation - Querying Road-Side Facilities Availability
Conclusions
References
Collaborative Cellular-Based Location System
Introduction
Related Work
System Architecture
Location API Module
Location Module
Communications Module
Database Module
Experiments
Locating Large Areas
Locating Small Areas
Conclusions and Future Work
References
Author Index
Recommend Papers

Euro-Par 2010 - Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part II (Lecture Notes in Computer Science, 6272)
 3642152902, 9783642152900

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6272

Pasqua D’Ambra Mario Guarracino Domenico Talia (Eds.)

Euro-Par 2010 Parallel Processing 16th International Euro-Par Conference Ischia, Italy, August 31 - September 3, 2010 Proceedings, Part II

13

Volume Editors Pasqua D’Ambra ICAR-CNR Via P. Castellino 111 80131 Napoli, Italy E-mail: [email protected] Mario Guarracino ICAR-CNR Via P. Castellino 111 80131 Napoli, Italy E-mail: [email protected] Domenico Talia ICAR-CNR Via P. Bucci 41c 87036 Rende, Italy E-mail: [email protected]

Library of Congress Control Number: 2010932506 CR Subject Classification (1998): F.1.2, C.3, C.2.4, D.1, D.4, I.6 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-15290-2 Springer Berlin Heidelberg New York 978-3-642-15290-0 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

Euro-Par is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel computing. The major themes can be divided into four broad categories: theory, high-performance, cluster and grid, distributed and mobile computing. These categories comprise 14 topics that focus on particular issues. The objective of Euro-Par is to provide a forum within which to promote the development of parallel computing both as an industrial technique and an academic discipline, extending the frontier of both the state of the art and the state of practice. The main audience for and participants in Euro-Par are researchers in academic departments, government laboratories, and industrial organizations. Euro-Par 2010 was the 16th conference in the Euro-Par series, and was organized by the Institute for High-Performance Computing and Networking (ICAR) of the Italian National Research Council (CNR), in Ischia, Italy. Previous EuroPar conferences took place in Stockholm, Lyon, Passau, Southampton, Toulouse, Munich, Manchester, Padderborn, Klagenfurt, Pisa, Lisbon, Dresden, Rennes, Las Palmas, and Delft. Next year the conference will take place in Bordeaux, France. More information on the Euro-Par conference series and organization is available on the wesite http://www.europar.org. As mentioned before, the conference was organized in 14 topics. The paper review process for each topic was managed and supervised by a committee of at least four persons: a Global Chair, a Local Chair, and two members. Some specific topics with a high number of submissions were managed by a larger committee with more members. The final decisions on the acceptance or rejection of the submitted papers were made in a meeting of the Conference Co-chairs and Local Chairs of the topics. The call for papers attracted a total of 256 submissions, representing 41 countries (based on the corresponding authors’ countries). A total number of 938 review reports were collected, which makes an average of 3.66 review reports per paper. In total 90 papers were selected as regular papers to be presented at the conference and included in the conference proceedings, representing 23 countries from all continents, therefore the acceptance rate was of 35%. Three papers were selected as distinguished papers. These papers, which were presented in a separate session, are:

1. Friman Sanchez, Felipe C, Alex Ramirez and Mateo Valero, Long DNA Sequence Comparison on Multicore Architectures 2. Michel Raynal and Damien Imbs, The x-Wait-Freedom Progress Condition 3. Mark James, Paul Springer and Hans Zima, Adaptive Fault Tolerance for Many-Core-Based Space-Borne Computing

VI

Preface

Euro-Par 2010 was very happy to present three invited speakers of high international reputation, who discussed important developments in very interesting areas of parallel and distributed computing: 1. Jack Dongarra (University of Tennessee, Oak Ridge National Laboratory, University of Manchester): Impact of Architecture and Technology for Extreme Scale on Software and Algorithm Design 2. Vittoria Colizza (ISI Foundation): Computational Epidemiology: A New Paradigm in the Fight Against Infectious Diseases 3. Ignacio M. Llorente (Universidad Complutense de Madrid): Innovation in Cloud Computing Architectures During the conference Jack Dongarra received the Third Euro-Par Achievement Award for his leadership in the field of parallel and distributed computing. In this edition, 11 workshops were held in conjunction with the main track of the conference. These workshops were: 1. CoreGrid/ERCIM Workshop (CoreGrid2010) 2. The 5th Workshop on Virtualization in High-Performance Cluster and Grid Computing(VHPC 2010) 3. Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 2010) 4. The 4th Workshop on Highly Parallel Processing on a Chip (HPPC 2010) 5. The 7th International Workshop on the Economics and Business Grids, Clouds, Systems, and Services (GECON 2010) 6. The First Workshop on High-Performance Bioinformatics and Biomedicine (HiBB) 7. The Third Workshop on UnConventional High-Performance Computing (UCHPC 2010) 8. Workshop on HPC Applied to Finance (HPCF 2010) 9. The Third Workshop on Productivity and Performance Tools for HPC Application Development (PROPER2010) 10. Workshop on Cloud Computing: Projects and Initiatives (CCPI 2010) 11. XtreemOS Summit 2010. The 16th Euro-Par conference in Ischia was made possible due to the support of many individuals and organizations. Special thanks are due to the authors of all the submitted papers, the members of the topic committees, and all the reviewers in all topics, for their contributions to the success of the conference. We also thank the members of the Organizing Committee and people of the YES Meet organizing secretariat. We are grateful to the members of the Euro-Par Steering Committee for their support. We acknowledge the help we had from Henk Sips and Dick Epema of the organization of Euro-Par 2009. A number of institutional and industrial sponsors contributed towards the organization of the conference. Their names and logos appear on the Euro-Par 2010 website http://www.europar2010.org.

Preface

VII

It was our pleasure and honor to organize and host Euro-Par 2010 in Ischia. We hope all the participants enjoyed the technical program and the social events organized during the conference. August 2010

Domenico Talia Pasqua D’Ambra Mario Rosario Guarracino

Organization

Euro-Par Steering Committee Chair Chris Lengauer

University of Passau, Germany

Vice-Chair Luc Boug´e

ENS Cachan, France

European Respresentatives Jos´e Cunha Marco Danelutto Christos Kaklamanis Paul Kelly Harald Kosch Thomas Ludwig Emilio Luque Toms Margalef Wolfgang Nagel Rizos Sakellariou Henk Sips

New University of Lisbon, Portugal University of Pisa, Italy Computer Technology Institute, Greece Imperial College, UK University of Passau, Germany University of Heidelberg, Germany University Autonoma of Barcelona, Spain University Autonoma of Barcelona, Spain Dresden University of Technology, Germany University of Manchester, UK Delft University of Technology, The Netherlands

Honorary Members Ron Perrott Karl Dieter Reinartz

Queen’s University Belfast, UK University of Erlangen-Nuremberg, Germany

Observers Domenico Talia Emmanuel Jeannot

University of Calabria and ICAR-CNR, Italy Laboratoire Bordelais de Recherche en Informatique (LaBRI) / INRIA, France

X

Organization

Euro-Par 2010 Organization Conference Co-chairs Pasqua D’Ambra Mario Guarracino Domenico Talia

ICAR-CNR, Italy ICAR-CNR, Italy ICAR-CNR and University of Calabria, Italy

Local Organizing Committee Laura Antonelli Eugenio Cesario Agostino Forestiero Francesco Gregoretti Ivana Marra Carlo Mastroianni

ICAR-CNR, ICAR-CNR, ICAR-CNR, ICAR-CNR, ICAR-CNR, ICAR-CNR,

Italy Italy Italy Italy Italy Italy

Euro-Par 2010 Program Committee Topic 1: Support Tools and Environments Global Chair Omer Rana

Cardiff University, UK

Local Chair Giandomenico Spezzano

ICAR-CNR, Italy

Members Michael Gerndt Daniel S. Katz

Technical University of Munich, Germany University of Chicago, USA

Topic 2: Performance Prediction and Evaluation Global Chair Stephen Jarvis

Warwick University, UK

Local Chair Massimo Coppola

ISTI-CNR, Italy

Members Junwei Cao Darren Kerbyson

Tsinghua University, China Los Alamos National Laboratory, USA

Organization

XI

Topic 3: Scheduling and Load-Balancing Global Chair Ramin Yahyapour

Technical University of Dortmund, Germany

Local Chair Raffaele Perego

ISTI-CNR, Italy

Members Frederic Desprez Leah Epstein Francesc Guim Bernat

INRIA Rhˆone-Alpes, France University of Haifa, Israel Intel, Spain

Topic 4: High-Performance Architectures and Compilers Global Chair Pedro Diniz

IST/UTL/INESC-ID, Portugal

Local Chair Marco Danelutto

University of Pisa, Italy

Members Denis Barthou Marc Gonzalez Tallada Michael Huebner Karlsruhe

University of Versailles, France Polytechnic University of Catalonia, Spain Institute of Technology, Germany

Topic 5: Parallel and Distributed Data Management Global Chair Rizos Sakellariou

University of Manchester, UK

Local Chair Salvatore Orlando

University of Venice, Italy

Members Josep-L. Larriba-Pey Srinivasan Parthasarathy Demetrios Zeinalipour

Polytechnic University of Catalonia, Spain Ohio State University, USA University of Cyprus, Cyprus

Topic 6: Grid, Cluster and Cloud Computing Global Chair Kate Keahey

Argonne National Laboratory, USA

XII

Organization

Local Chair Domenico Laforenza

IIT-CNR, Italy

Members Alexander Reinefeld Pierluigi Ritrovato Doug Thain Nancy Wilkins-Diehr

Zuse Institute Berlin, Germany University of Salerno and CRMPA, Italy University of Notre Dame, USA San Diego Supercomputer Center, USA

Topic 7: Peer-to-Peer Computing Global Chair Adriana Iamnitchi

University of South Florida, USA

Local Chair Paolo Trunfio

University of Calabria, Italy

Members Jonathan Ledlie Florian Schintke

Nokia Research, USA Zuse Institute of Berlin, Germany

Topic 8: Distributed Systems and Algorithms Global Chair Pierre Sens

University of Paris 6, France

Local Chair Giovanni Schmid

ICAR-CNR, Italy

Members Pascal Felber Ricardo Jimenez-Peris

University of Neuchatel, Switzerland Polytechnic University of Madrid, Spain

Topic 9: Parallel and Distributed Programming Global Chair Thilo Kielmann

Vrije Universiteit, The Netherlands

Local Chair Andrea Clematis

IMATI-CNR, Italy

Members Sergey Gorlatch Alexey Lastovetsky

University of Munster, Germany University College Dublin, Ireland

Organization

XIII

Topic 10: Parallel Numerical Algorithms Global Chair Patrick Amestoy Local Chair Daniela di Serafino

Members Rob Bisseling Enrique S. Quintana Ort´ı Marian Vajtersic

University of Toulouse, France

Second University of Naples and ICAR-CNR, Italy

Utrecht University, The Netherlands University Jaime I, Spain University of Salzburg, Austria

Topic 11: Multicore and Manycore Programming Global Chair Fabrizio Petrini

IBM, USA

Local Chair Beniamino Di Martino Members Siegfried Benkner Kirk Cameron Dieter Kranzlm¨ uller Jakub Kurzak Jesper Larsson Tr¨aff Davide Pasetto

Second University of Naples, Italy

University of Vienna, Austria Virginia Tech, USA Ludwig Maximilians University of Munich, Germany University of Tennessee, USA University of Vienna, Austria IBM, Ireland

Topic 12: Theory and Algorithms for Parallel Computation Global Chair Thomas Rauber

University of Bayreuth , Germany

Local Chair Vittorio Scarano

University of Salerno, Italy

Members Christoph Kessler Yves Robert

Linkoping University, Sweden Normal Superior School of Lyon, France

XIV

Organization

Topic 13: High-Performance Networks Global Chair Jos´e Flich

Technical University of Valencia, Spain

Local Chair Alfonso Urso

ICAR-CNR, Italy

Members Ulrich Bruening Giuseppe Di Fatta

Heidelberg University, Germany University of Reading, UK

Topic 14: Mobile and Ubiquitous Computing Global Chair Gregor Schiele

University of Mannheim, Germany

Local Chair Giuseppe De Pietro

ICAR-CNR, Italy

Members Jalal Al-Muhtadi Zhiwen Yu Northwestern

King Saud University, Saudi Arabia Polytechnical University, China

Euro-Par 2010 Referees Virat Agarwal Josep Aguilar Saborit Jalal Al Muhtadi Samer Al-Kiswany Jose Aliaga Pedro Alonso Patrick Amestoy Panayiotis Andreou Artur Andrzejak Ashiq Anjum Benjamin Arai Luciana Arantes Peter Arbenz Mikael Asplund Rocco Aversa Rosa M. Badia Jos´e Manuel Bad´ıa-Contelles Mark Baker

Ranieri Baraglia Kevin Barker Denis Barthou Tom Beach Shajulin Benedict Siegfried Benkner Anne Benoit John Bent Anca Berariu Massimo Bernaschi Carlo Bertolli Paolo Bientinesi Angelos Bilas Rob Bisseling Jeremy Blackburn Brian Blake Erik Boman Mathieu Bouillaguet

Organization

Thomas Brady John Bresnahan Ron Brightwell Andrey Brito Maciej Brodowicz Shawn Brown Ulrich Bruening Ali Butt Alfredo Buttari Surendra Byna Joao Cachopo Kirk Cameron Agustin Caminero Ramon Canal Pasquale Cantiello Junwei Cao Gabriele Capannini Emanuele Carlini David Carrera Simon Caton Eugenio Cesario Gregory Chockler Martin Chorley Peter Chronz Giuseppe Ciaccio Mario Ciampi Andrea Clematis Carmela Comito Denis Conan Guojing Cong Massimo Coppola Angelo Corana Julita Corbalan Antonio Coronato Stefania Corsaro Rub´en Cuevas Rum´ın Yong Cui Alfredo Cuzzocrea Daniele D’Agostino Pasqua D’Ambra Maurizio D’Arienzo John Daly Marco Danelutto Patrizio Dazzi Giuseppe De Pietro

Valentina De Simone Frederic Desprez Giuseppe Di Fatta Beniamino Di Martino Claudia Di Napoli Daniela di Serafino Pedro Diniz Cristian Dittamo David Dominguez-Sal Rion Dooley Jan Dunnweber Alejandro Duran Pierre-Franois Dutot Jorge Ejarque Artigas Thomas Epperly Leah Epstein Juan Carlos Fabero Pascal Felber Florian Feldhaus Xizhou Feng John Feo Salvatore Filippone Joshua Finnis Jose Flich Gianluigi Folino Agostino Forestiero Giancarlo Fortino Michel Fourni´e Antonella Galizia Luigi Gallo Efstratios Gallopoulos Alfredo Garro Rong Ge Joanna Geibig Krassimir Georgiev Joseph Gergaud Abdou Germouche Michael Gerndt Vittoria Gianuzzi Domingo Gimenez Maurizio Giordano Harald Gjermundrod Frank Glinka Sergio G´ omez-Villamor Jose Gonzalez

XV

XVI

Organization

Marc Gonzalez Maria Gradinariu Vincent Gramoli Fabola Greve Mario Guarracino Michele Guidolin Carla Guillen Carias Francesc Guim Bernat Thom Haddow Georg Hager Houssam Haitof Paul Hargrove Tim Harris Enric Herrero Josep Ramon Herrero Ignacio Hidalgo Perez Daniel Higuero Sing Wang Ho Yannick Hoarau Torsten Hoefler Mikael Hogqvist Gerard Holzmann Haowei Huang Michael Hubner Mauro Iacono Adriana Iamnitchi Francisco Igual-Pe˜ na Stephen Jarvis Prasad Jayanti Emmanuel Jeannot Shantenu Jha Daniel Jimnez-Gonz´ alez Ricardo Jimenez-Peris Maik Jorra Gopi Kandaswamy Karen Karavanic Daniel Katz Kate Keahey Philipp Kegel Darren Kerbyson Christoph Kessler Thilo Kielmann Hyunjoo Kim Zach King Bjrn Kolbeck

Derrick Kondo Andreas Konstantinidis Nicolas Kourtellis Dieter Kranzlmuller Peter Kropf Nico Kruber Herbert Kuchen Jakub Kurzak David LaBissoniere Giuliano Laccetti Domenico Laforenza Juan Lanchares Marco Lapegna Josep Larriba-Pey Alexey Lastovetsky Rob Latham Jonathan Ledlie Arnaud Legrand Sergey Legtchenko Francesco Lelli Jeff Linderoth Yan Liu David Lowenthal Claudio Lucchese Xiaosong Ma Lucia Maddalena Mahin Mahmoodi Mesaac Makpangou Barnaby Malet Loris Marchal Olivier Marin Stefano Marrone Suresh Marru Paul Marshall Alberto Martn-Huertas Norbert Martnez-Baz´ an Carlo Mastroianni Rafael Mayo Michele Mazzucco Dominik Meilaender Alessio Merlo Marcel Meyer Gregory Michaelson Mauro Migliardi Matteo Migliavacca

Organization

Einat Minkov Sbastien Monnet Raffaele Montella Matteo Mordacchini Christopher Moretti Paolo Mori Francesco Moscato Alexander Moskovsky Sandrine Mouysset Kiran-Kumar Muniswamy-Reddy Victor Munt´es-Mulero Alin Murarasu Farrukh Nadeem Jamin Naghmouchi Franco Maria Nardini John-Paul Navarro Libero Nigro Dimitrios Nikolopoulos Praveen Nuthulapati Yury Oleynik Salvatore Orlando Renzo Orsini George Pallis Alexander Papaspyrou Scott Parker Srinivasan Parthasarathy Davide Pasetto Jean-Louis Pazat Francois Pellegrini Olivier Peres Francesca Perla Kathrin Peter Ventsislav Petkov Fabrizio Petrini Marlon Pierce Jean-Marc Pierson Giuseppe Pirr´ o Stefan Plantikow Antonio Plaza Sabri Pllana Alexander Ploss Joseph Polifroni Andrea Pugliese Judy Qiu Alfonso Quarati

Martin Quinson Ioan Raicu Massimiliano Rak Thomas Rauber Alfredo Remon Laura Ricci Pierluigi Ritrovato Etienne Riviere Yves Robert Ivan Rodero Arun Rodrigues Arnold Rosenberg Francois-Henry Rouet Alain Roy Paul Ruth Vladimir Rychkov Rizos Sakellariou Jose S´anchez Martin Sandrieser Vittorio Scarano Daniele Scarpazza Patrick Schafer Valerio Schiavoni Gregor Schiele Michael Schiffers Florian Schintke Giovanni Schmid Erik Schnetter Assaf Schuster Frank Seinstra Pierre Sens Lei Shu Claudio Silvestri Ral Sirvent Tor Skeie David Skinner Warren Smith Julien Sopena Paul Soule Giandomenico Spezzano Kyriakos Stavrou Thomas Steinke Jan Stender Pietro Storniolo Hari Subramoni

XVII

XVIII

Organization

Frederic Suter Pierre Sutra Martin Swany Spencer Swift Alex Tabbal Domenico Talia Ian Taylor Andrei Tchernykh Douglas Thain Gal Thomas Juan Tirado Matthew Tolentino Rafael Tolosana-Calasanz Nicola Tonellotto Jesper Traff Corentin Travers Paolo Trunfio Mauricio Tsugawa Alfonso Urso Marian Vajtersic Jose Valerio Rob van Nieuwpoort Robbert van Renesse Ana Lucia Varbanescu

Salvatore Venticinque Rossano Venturini Antonio Vidal Vicente Vidal Frederic Vivien Edward Walker Hanbo Wang Shaowen Wang Charles Weems Josef Weidendorfer Philipp Wieder Matthew Woitaszek Wenjun Wu Roman Wyrzykowski Dongyan Xu Ramin Yahyapour Kenneth Yoshimoto Choonhan Youn Zhiwen Yu Luca Zanni Demetrios Zeinalipour-Yazti Wensheng Zhang Wolfgang Ziegler Hans Zima

Table of Contents – Part II

Topic 9: Parallel and Distributed Programming Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thilo Kielmann, Andrea Clematis, Sergei Gorlatch, and Alexey Lastovetsky

1

Transactional Mutex Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luke Dalessandro, Dave Dice, Michael Scott, Nir Shavit, and Michael Spear

2

Exceptions for Algorithmic Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Leyton, Ludovic Henrio, and Jos´e M. Piquer

14

Generators-of-Generators Library with Optimization Capabilities in Fortress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kento Emoto, Zhenjiang Hu, Kazuhiko Kakehi, Kiminori Matsuzaki, and Masato Takeichi User Transparent Task Parallel Multimedia Content Analysis . . . . . . . . . . Timo van Kessel, Niels Drost, and Frank J. Seinstra Parallel Simulation for Parameter Estimation of Optical Tissue Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihai Duta, Jeyarajan Thiyagalingam, Anne Trefethen, Ayush Goyal, Vicente Grau, and Nic Smith

26

38

51

Topic 10: Parallel Numerical Algorithms Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrick Amestoy, Daniela di Serafino, Rob Bisseling, Enrique S. Quintana-Ort´ı, and Marian Vajterˇsicx

63

Scalability and Locality of Extrapolation Methods for Distributed-Memory Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Korch, Thomas Rauber, and Carsten Scholtes

65

CFD Parallel Simulation Using Getfem++ and Mumps . . . . . . . . . . . . . . . Michel Fourni´e, Nicolas Renon, Yves Renard, and Daniel Ruiz

77

Aggregation AMG for Distributed Systems Suffering from Large Message Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximilian Emans

89

XX

Table of Contents – Part II

A Parallel Implementation of the Jacobi-Davidson Eigensolver and Its Application in a Plasma Turbulence Code . . . . . . . . . . . . . . . . . . . . . . . . . . . Eloy Romero and Jose E. Roman Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Roderus, Anca Berariu, Hans-Joachim Bungartz, Sven Kr¨ uger, Alexei Matveev, and Notker R¨ osch Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuanrui Zhang, Jun Liu, Emre Kultursay, Mahmut Kandemir, Nikos Pitsianis, and Xiaobai Sun

101

113

125

Topic 11: Multicore and Manycore Programming Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beniamino Di Martino, Fabrizio Petrini, Siegfried Benkner, Kirk Cameron, Dieter Kranzlm¨ uller, Jakub Kurzak, Davide Pasetto, and Jesper Larsson Tr¨ aff

137

JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-Core Architectures . . . . . . . . . . . . . . . . . . . . Muhammad Aleem, Radu Prodan, and Thomas Fahringer

139

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit

151

Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study . . . . . . . . . . . . . Huayong Wang, Yi Ge, Yanqi Wang, and Yao Zou

163

Exploiting Fine-Grained Parallelism on Cell Processors . . . . . . . . . . . . . . . Ralf Hoffmann, Andreas Prell, and Thomas Rauber

175

Optimized on-Chip-Pipelined Mergesort on the Cell/B.E. . . . . . . . . . . . . . Rikard Hult´en, Christoph W. Kessler, and J¨ org Keller

187

Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanuel Jeannot and Guillaume Mercier

199

Parallel Enumeration of Shortest Lattice Vectors . . . . . . . . . . . . . . . . . . . . . ¨ ur Dagdelen and Michael Schneider Ozg¨

211

A Parallel GPU Algorithm for Mutual Information Based 3D Nonrigid Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vaibhav Saxena, Jonathan Rohrer, and Leiguang Gong

223

Table of Contents – Part II

Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Everton Hermann, Bruno Raffin, Fran¸cois Faure, Thierry Gautier, and J´er´emie Allard

XXI

235

Long DNA Sequence Comparison on Multicore Architectures . . . . . . . . . . Friman S´ anchez, Felipe Cabarcas, Alex Ramirez, and Mateo Valero

247

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark James, Paul Springer, and Hans Zima

260

Maestro: Data Orchestration and Tuning for OpenCL Devices . . . . . . . . . Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter

275

Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Dong, Gene Cooperman, and John Apostolakis

287

Parallel Exact Time Series Motif Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Narang and Souvik Bhattacherjee

304

Optimized Dense Matrix Multiplication on a Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elkin Garcia, Ioannis E. Venetis, Rishi Khan, and Guang R. Gao

316

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Otto, Christoph A. Schaefer, Matthias Dempe, and Walter F. Tichy A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures . . . . . . . . . . . . . . . . . . . . . Chen Chen, Joseph B. Manzano, Ge Gan, Guang R. Gao, and Vivek Sarkar Programming CUDA-Based GPUs to Simulate Two-Layer Shallow Water Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marc de la Asunci´ on, Jos´e M. Mantas, and Manuel J. Castro

328

341

353

Topic 12: Theory and Algorithms for Parallel Computation Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Kessler, Thomas Rauber, Yves Robert, and Vittorio Scarano

365

XXII

Table of Contents – Part II

Analysis of Multi-Organization Scheduling Algorithms . . . . . . . . . . . . . . . . Johanne Cohen, Daniel Cordeiro, Denis Trystram, and Fr´ed´eric Wagner

367

Area-Maximizing Schedules for Series-Parallel DAGs . . . . . . . . . . . . . . . . . Gennaro Cordasco and Arnold L. Rosenberg

380

Parallel Selection by Regular Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Tiskin

393

Ants in Parking Lots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arnold L. Rosenberg

400

Topic 13: High Performance Networks Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Flich, Alfonso Urso, Ulrich Bruening, and Giuseppe Di Fatta An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles, and Jose Duato

412

413

A First Approach to King Topologies for On-Chip Networks . . . . . . . . . . . Esteban Stafford, Jose L. Bosque, Carmen Mart´ınez, Fernando Vallejo, Ramon Beivide, and Cristobal Camarero

428

Optimizing Matrix Transpose on Torus Interconnects . . . . . . . . . . . . . . . . . Venkatesan T. Chakaravarthy, Nikhil Jain, and Yogish Sabharwal

440

Topic 14: Mobile and Ubiquitous Computing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregor Schiele, Giuseppe De Pietro, Jalal Al-Muhtadi, and Zhiwen Yu

452

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks . . . . . . . . . Huanyu Zhao, Xin Yang, and Xiaolin Li

454

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao Zhao, Xin Wang, Jin Zhao, and Xiangyang Xue

466

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks . . . . . . . . Jos´e Mocito, Lu´ıs Rodrigues, and Hugo Miranda

478

Table of Contents – Part II

On Deploying Tree Structured Agent Applications in Networked Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikos Tziritas, Thanasis Loukopoulos, Spyros Lalis, and Petros Lampsas

XXIII

490

Meaningful Metrics for Evaluating Eventual Consistency . . . . . . . . . . . . . . Jo˜ ao Barreto and Paulo Ferreira

503

Caching Dynamic Information in Vehicular Ad Hoc Networks . . . . . . . . . . Nicholas Loulloudes, George Pallis, and Marios D. Dikaiakos

516

Collaborative Cellular-Based Location System . . . . . . . . . . . . . . . . . . . . . . . David Navalho and Nuno Pregui¸ca

528

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

541

Table of Contents – Part I

Topic 1: Support Tools and Environments Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omer Rana, Giandomenico Spezzano, Michael Gerndt, and Daniel S. Katz

1

Starsscheck: A Tool to Find Errors in Task-Based Parallel Programs . . . . Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade

2

Automated Tuning in Parallel Sorting on Multi-Core Architectures . . . . . Haibo Lin, Chao Li, Qian Wang, Yi Zhao, Ninghe Pan, Xiaotong Zhuang, and Ling Shao

14

Estimating and Exploiting Potential Parallelism by Source-Level Dependence Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Mak, Karl-Filip Fax´en, Sverker Janson, and Alan Mycroft

26

Source-to-Source Optimization of CUDA C for GPU Accelerated Cardiac Cell Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fred V. Lionetti, Andrew D. McCulloch, and Scott B. Baden

38

Efficient Graph Partitioning Algorithms for Collaborative Grid Workflow Developer Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gergely Sipos and P´eter Kacsuk

50

Profile-Driven Selective Program Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . Tugrul Ince and Jeffrey K. Hollingsworth

62

Characterizing the Impact of Using Spare-Cores on Application Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e Carlos Sancho, Darren J. Kerbyson, and Michael Lang

74

Topic 2: Performance Prediction and Evaluation Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen Jarvis, Massimo Coppola, Junwei Cao, and Darren Kerbyson A Model for Space-Correlated Failures in Large-Scale Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Gallet, Nezih Yigitbasi, Bahman Javadi, Derrick Kondo, Alexandru Iosup, and Dick Epema

86

88

XXVI

Table of Contents – Part I

Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rosilde Corvino, Abdoulaye Gamati´e, and Pierre Boulet

101

jitSim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pradipta De and Vijay Mann

117

pCFS vs. PVFS: Comparing a Highly-Available Symmetrical Parallel Cluster File System with an Asymmetrical Parallel File System . . . . . . . . Paulo A. Lopes and Pedro D. Medeiros

131

Comparing Scalability Prediction Strategies on an SMP of CMPs . . . . . . Karan Singh, Matthew Curtis-Maury, Sally A. McKee, Filip Blagojevi´c, Dimitrios S. Nikolopoulos, Bronis R. de Supinski, and Martin Schulz

143

Topic 3: Scheduling and Load-Balancing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ramin Yahyapour, Raffaele Perego, Fr´ed´eric Desprez, Leah Epstein, and Francesc Guim Bernat

156

A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling . . . . . . Marin Bougeret, Pierre-Fran¸cois Dutot, Klaus Jansen, Christina Otte, and Denis Trystram

157

Non-clairvoyant Scheduling of Multiple Bag-of-Tasks Applications . . . . . . Henri Casanova, Matthieu Gallet, and Fr´ed´eric Vivien

168

Extremal Optimization Approach Applied to Initial Mapping of Distributed Java Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivanoe De Falco, Eryk Laskowski, Richard Olejnik, Umberto Scafuri, Ernesto Tarantino, and Marek Tudruj A Delay-Based Dynamic Load Balancing Method and Its Stability Analysis and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingyang Meng, Jianzhong Qiao, Shukuan Lin, Enze Wang, and Peng Han

180

192

Code Scheduling for Optimizing Parallelism and Data Locality . . . . . . . . . Taylan Yemliha, Mahmut Kandemir, Ozcan Ozturk, Emre Kultursay, and Sai Prashanth Muralidhara

204

Hierarchical Work-Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-No¨el Quintin and Fr´ed´eric Wagner

217

Optimum Diffusion for Load Balancing in Mesh Networks . . . . . . . . . . . . . George S. Markomanolis and Nikolaos M. Missirlis

230

Table of Contents – Part I

A Dynamic, Distributed, Hierarchical Load Balancing for HLA-Based Simulations on Large-Scale Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . Robson Eduardo De Grande and Azzedine Boukerche

XXVII

242

Topic 4: High Performance Architectures and Compilers Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro C. Diniz, Marco Danelutto, Denis Barthou, Marc Gonzales, and Michael H¨ ubner

254

Power-Efficient Spilling Techniques for Chip Multiprocessors . . . . . . . . . . Enric Herrero, Jos´e Gonz´ alez, and Ramon Canal

256

Scalable Object-Aware Hardware Transactional Memory . . . . . . . . . . . . . . Behram Khan, Matthew Horsnell, Mikel Lujan, and Ian Watson

268

Efficient Address Mapping of Shared Cache for On-Chip Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fenglong Song, Dongrui Fan, Zhiyong Liu, Junchao Zhang, Lei Yu, and Weizhi Xu Thread Owned Block Cache: Managing Latency in Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fenglong Song, Zhiyong Liu, Dongrui Fan, Hao Zhang, Lei Yu, and Shibin Tang Extending the Cell SPE with Energy Efficient Branch Prediction . . . . . . . Martijn Briejer, Cor Meenderinck, and Ben Juurlink

280

292

304

Topic 5: Parallel and Distributed Data Management Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rizos Sakellariou, Salvatore Orlando, Josep Lluis Larriba-Pey, Srinivasan Parthasarathy, and Demetrios Zeinalipour-Yazti

316

Federated Enactment of Workflow Patterns . . . . . . . . . . . . . . . . . . . . . . . . . Gagarine Yaikhom, Chee Sun Liew, Liangxiu Han, Jano van Hemert, Malcolm Atkinson, and Amy Krause

317

A Distributed Approach to Detect Outliers in Very Large Data Sets . . . . Fabrizio Angiulli, Stefano Basta, Stefano Lodi, and Claudio Sartori

329

Topic 6: Grid, Cluster and Cloud Computing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Keahey, D. Laforenza, A. Reinefeld, P. Ritrovato, D. Thain, and N. Wilkins-Diehr

341

XXVIII

Table of Contents – Part I

Deployment of a Hierarchical Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . Eddy Caron, Benjamin Depardon, and Fr´ed´eric Desprez

343

Toward Real-Time, Many-Task Applications on Large Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sangho Yi, Derrick Kondo, and David P. Anderson

355

Scheduling Scientific Workflows to Meet Soft Deadlines in the Absence of Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kassian Plankensteiner, Radu Prodan, and Thomas Fahringer

367

A GPGPU Transparent Virtualization Component for High Performance Computing Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giulio Giunta, Raffaele Montella, Giuseppe Agrillo, and Giuseppe Coviello What Is the Price of Simplicity? A Cross-Platform Evaluation of the SAGA API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathijs den Burger, Ceriel Jacobs, Thilo Kielmann, Andre Merzky, Ole Weidner, and Hartmut Kaiser

379

392

User-Centric, Heuristic Optimization of Service Composition in Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Kofler, Irfan ul Haq, and Erich Schikuta

405

A Distributed Market Framework for Large-Scale Resource Sharing . . . . Marian Mihailescu and Yong Meng Teo

418

Using Network Information to Perform Meta-Scheduling in Advance in Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Tom´ as, Agust´ın Caminero, Blanca Caminero, and Carmen Carri´ on

431

Topic 7: Peer to Peer Computing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adriana Iamnitchi, Paolo Trunfio, Jonathan Ledlie, and Florian Schintke Overlay Management for Fully Distributed User-Based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R´ obert Orm´ andi, Istv´ an Heged˝ us, and M´ ark Jelasity Dynamic Publish/Subscribe to Meet Subscriber-Defined Delay and Bandwidth Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Adnan Tariq, Gerald G. Koch, Boris Koldehofe, Imran Khan, and Kurt Rothermel

444

446

458

Table of Contents – Part I

Combining Hilbert SFC and Bruijn Graphs for Searching Computing Markets in a P2P System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damia Castell` a, Hector Blanco, Francesc Gin´e, and Francesc Solsona Sampling Bias in BitTorrent Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . Boxun Zhang, Alexandru Iosup, Johan Pouwelse, Dick Epema, and Henk Sips A Formal Credit-Based Incentive Model for Sharing Computer Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josep Rius, Ignasi Barri, Fernando Cores, and Francesc Solsona

XXIX

471 484

497

Topic 8: Distributed Systems and Algorithms Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascal Felber, Ricardo Jimenez-Peris, Giovanni Schmid, and Pierre Sens

510

Improving Message Logging Protocols Scalability through Distributed Event Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Ropars and Christine Morin

511

Value-Based Sequential Consistency for Set Objects in Dynamic Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Baldoni, Silvia Bonomi, and Michel Raynal

523

Robust Self-stabilizing Construction of Bounded Size Weight-Based Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Colette Johnen and Fouzi Mekhaldi

535

Adaptive Conflict Unit Size for Distributed Optimistic Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kim-Thomas Rehmann, Marc-Florian M¨ uller, and Michael Sch¨ ottner Frame Allocation Algorithms for Multi-threaded Network Cameras . . . . . Jos´e Miguel Piquer and Javier Bustos-Jim´enez Scalable Distributed Simulation of Large Dense Crowds Using the Real-Time Framework (RTF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ole Scharf, Sergei Gorlatch, Felix Blanke, Christoph Hemker, Sebastian Westerheide, Tobias Priebs, Christoph Bartenhagen, Alexander Ploss, Frank Glinka, and Dominik Meilaender

547

560

572

The x-Wait-Freedom Progress Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damien Imbs and Michel Raynal

584

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

597

Parallel and Distributed Programming Thilo Kielmann1 , Andrea Clematis1 , Sergei Gorlatch2 , and Alexey Lastovetsky2 1

Topic Chairs Members

2

Developing parallel or distributed applications is a hard task and it requires advanced algorithms, realistic modeling, efficient design tools, high-level programming abstractions, high-performance implementations, and experimental evaluation. Ongoing research in this field emphasizes the design and development of correct, high-performance, portable, and scalable parallel programs. Related to these central needs, important work addresses methods for reusability, performance prediction, large-scale deployment, self-adaptivity, and fault-tolerance. Given the rich history in this field, practical applicability of proposed methods, models, algorithms, or techniques is a key requirement for timely research. This topic is focusing on parallel and distributed programming in general, except for work specifically targeting multicore architectures, which has matured to becoming a Euro-Par topic of its own. This year, 17 papers were submitted to this topic. Each submission was reviewed by at least four reviewers and, finally, we were able to select five regular papers, spanning the topic’s scope, ranging from low-level issues like locking schemes and exceptions, all the way up to the parallelisation of a biocomputational simulation. In particular, Dalessandro et al. propose “Transactional Mutex Locks”, combining the generality of mutex locks with the scalability of software transactional memory. In “Exceptions for Algorithmic Skeletons”, Leyton et al. describe how to handle exceptions without breaking the high-level abstractions of algorithmic skeletons. Emoto et al. contributed “Generators-of-generators Library with Optimization Capabilities in Fortress”, a library for constructing parallel skeletons from nested data structures. In “User Transparent Task Parallel Multimedia Content Analysis”, van Kessel et al. present a domain-specific, user-transparent programming model. Last but not least, Duta et al. exploit the algorithmic parallelism in biocomputational simulations in their paper “Parallel Simulation for Parameter Estimation of Optical Tissue Properties”. We are proud of the scientific program that we managed to assemble. Of course, this was only possible by combining the efforts of many. We would like to take the opportunity to thank the authors who submitted their contributions, and the external referees who have made the scientific selection process possible in the first place.

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, p. 1, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Transactional Mutex Locks Luke Dalessandro1, , Dave Dice2 , Michael Scott1 , Nir Shavit2,3, , and Michael Spear4 1

University of Rochester Sun Labs at Oracle 3 Tel-Aviv University 4 Lehigh University {luked,scott}@cs.rochester.edu, [email protected], [email protected], [email protected] 2

Abstract. Mutual exclusion (mutex) locks limit concurrency but offer low single-thread latency. Software transactional memory (STM) typically has much higher latency, but scales well. We present transactional mutex locks (TML), which attempt to achieve the best of both worlds for read-dominated workloads. We also propose compiler optimizations that reduce the latency of TML to within a small fraction of mutex overheads. Our evaluation of TML, using microbenchmarks on the x86 and SPARC architectures, is promising. Using optimized spinlocks and the TL2 STM algorithm as baselines, we find that TML provides the low latency of locks at low thread levels, and the scalability of STM for read-dominated workloads. These results suggest that TML is a good reference implementation to use when evaluating STM algorithms, and that TML is a viable alternative to mutex locks for a variety of workloads.

1

Introduction

In shared-memory parallel programs, synchronization is most commonly provided by mutual exclusion (mutex) locks, but these may lead to unnecessary serialization. Three common alternatives allow parallelism among concurrent read-only critical sections. (1) Reader/writer (R/W) locks typically require two atomic operations (one to enter a critical section, another to depart), thereby enabling multiple threads to hold the lock in “read mode” simultaneously. R/W locks typically do not restrict what can be done within the critical section (e.g., code may perform I/O or modify thread-local data), but the programmer must statically annotate any critical section that might modify shared data as a writer, in which case it cannot execute concurrently with other critical sections. (2) Read-copy-update (RCU) [1] 



At the University of Rochester, this work was supported in part by NSF grants CNS-0411127, CNS-0615139, CCF-0702505, and CSR-0720796; by financial support from Intel and Microsoft; and by equipment support from Sun. The work in Tel-Aviv University was supported in part by the European Union under grant FP7-ICT-2007-1 (project VELOX), by grant 06/1344 from the Israeli Science Foundation, and by a grant from Sun Microsystems.

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 2–13, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Transactional Mutex Locks

3

ensures no blocking in a read-only critical section, but constrains the allowable behavior (e.g., doubly-linked lists can be traversed only in one direction). (3) Sequence locks (seqlocks) [2] forbid linked data structure traversal or function calls. While software transactional memory (STM) [3] appears ideally suited to replacing R/W locks, RCU, and sequence locks, there are two main obstacles. First, STM implementations typically require significant amounts of global and per-thread metadata. This space overhead may be prohibitive if STM is not used often within an application. Second, STM tends to have unacceptably high single-thread latency, usually higher than 2× that of lock-based code [4]. The nature of many critical sections in systems software suggests an approach that spans the gap between locks and transactions: specifically, we may be able to leverage TM research to create a better locking mechanism. In this paper we propose Transactional Mutex Locks (TML). TML offers the generality of mutex locks and the read-read scalability of sequence locks, while avoiding the atomic operation overheads of R/W locks or the usage constraints of RCU and sequence locks. These properties make TML an appealing lock replacement for many critical sections. They also suggest that TML, rather than a mutex lock, should be used as the baseline when evaluating new STM algorithms.1

2

The TML Algorithm

Lock-based critical sections require instrumentation only at the boundaries, to acquire and release the lock. STM-based critical sections also require instrumentation on every load or store to any location that may be shared. This instrumentation is costly: when entering a critical section, the thread must be checkpointed (e.g., via a call to setjmp); each load must be logged to enable detection of conflicts; each store must be logged, both to enable conflict detection and to enable undoing writes in the event of a conflict; and at the end of the region the entire set of reads must be double-checked to identify conflicts. If a conflict is detected, all writes must be un-done and the checkpoint must be restored (e.g., via a call to longjmp), so that the critical section can be rolled back and retried. Furthermore, many STM algorithms require the use of atomic instructions, such as compare-and-swap (cas), on each write to shared data. TML is essentially an STM implemented via a single global seqlock. While it requires both boundary and per-access instrumentation, it keeps overhead low by trading concurrency for low latency: by allowing concurrency only among readonly critical sections, the entire cost can be reduced to a handful of instructions at boundaries, a few instructions on each load or store of shared data, no peraccess logging, and at most one cas per critical section. 2.1

Boundary and Per-Access Instrumentation

Listing 1 presents the instrumentation required for TML. We use glb to refer to a single word of global metadata, and loc to refer to the single, local word of 1

We do precisely this in our PPoPP’10 paper [5], which while published earlier was completed after the work presented here.

4

L. Dalessandro et al.

Listing 1. TML instrumentation TMBegin: 1 checkpoint() 2 if (nest++) return 3 while ((loc = glb) & 1) { } TMRead(addr): 1 tmp = *addr 2 if (glb != loc) 3 restore_chkpt() 4 return tmp

TMEnd: 1 if (--nest) return 2 if (loc & 1) glb++ TMWrite(addr, val): 1 if (!(loc & 1)) 2 if (!cas(&glb, loc, loc + 1)) 3 restore_chkpt() 4 loc++ 5 *addr = val

metadata required by a thread in a critical section. We also maintain a per-thread local variable, nest, to support dynamic nesting of critical sections. TMBegin and TMEnd mark the beginning and ending of a critical section, respectively. Loads from shared memory are made via (inlined) calls to TMRead, and stores to shared memory are made via TMWrite. The checkpoint() and restore_chkpt() functions can be mapped directly to setjmp() and longjmp(), respectively. At a high level, the algorithm provides a multi-reader, single-writer protocol. Which critical sections perform writes need not be determined statically; instead, threads can dynamically transition to writer mode. Whenever a thread suspects an atomicity violation (something that can happen only before it has become a writer), it unwinds its stack and restarts using the restore_chkpt() function. Three properties ensure atomicity for race-free programs: – When glb is even, there are no writing critical sections. This property is provided by line 3 of TMBegin, which prevents critical sections from starting when glb is odd; TMWrite lines 1–4, which ensure that a thread only modifies glb once via a call to TMWrite, and only by transitioning it from the even value observed at TMBegin to the next successive odd value; and TMEnd line 2, which ensures that a writer updates glb to the next successive even value when it has finished performing reads and writes. – Concurrent writing critical sections are forbidden. A critical section Ci either never modifies glb, or else modifies it exactly twice, by incrementing it to an odd value at TMWrite line 2, and then incrementing it to an even value at TMEnd line 2. Since an intervening call to TMWrite from critical section Cj cannot proceed between when Ci sets glb odd and when Ci completes, concurrent writing critical sections are prevented. – Within any critical section, all values returned by TMRead are consistent with an execution in which the critical section runs in isolation. We have already shown that critical sections cannot start when a writing critical section is in flight. Since writing critical sections execute in isolation, and can only become writers if there have been no intervening writing critical sections, it remains only to show that a call to TMRead by read-only critical section Ci will not succeed if there has been an intervening write in critical section

Transactional Mutex Locks

5

Cj . On every call to TMRead by Ci , the test on line 2 ensures that glb has not changed since Ci began. Since modifications to glb always precede modifications to shared data, this test detects intervening writes. 2.2

Implementation Issues

Ordering: Four constraints are required for ordering: read-before-read/write ordering is required after TMBegin line 3, read/write-before-write ordering is required in TMEnd line 2, write-before-read/write ordering is required after TMWrite line 2, and read-before-read ordering is required before TMRead line 2. Of these, only the cost of ordering in TMRead can be incurred more than once per critical section. On TSO and x86 architectures, where the cas imposes ordering, no hardware fence instructions are required, but compiler fences are necessary. Overflow: Our use of a single counter admits the possibility of overflow. On 64-bit systems, overflow is not a practical concern, as it would take decades to occur. For 32-bit counters, we recommend a mechanism such as that proposed by Harris et al. [6]. Briefly, in TMEnd a thread must ensure that before line 2 is allowed to set glb to 0, it blocks until all active TML critical sections complete. A variety of techniques exist to make all threads visible for such an operation. Allocation: If critical section Ci delays immediately before executing line 1 of TMRead with address X, and a concurrent critical section frees X, then it is possible for Ci to fault if the OS reclaims X. There many techniques to address this concern in STM, and all are applicable to TML. – A garbage collecting or transaction-aware allocator [7] may be used. – The allocator may be prohibited from returning memory to the OS. – On some architectures, line 1 of TMRead may use a non-faulting load [8]. – Read-only critical sections can call restore_chkpt when a fault occurs [9]. 2.3

Programmability

I/O and Irrevocable Operations: A TML critical section that performs operations that cannot be rolled back (such as I/O and some syscalls) must never, itself, roll back. We provide a lightweight call suitable for such instances (it is essentially TMWrite lines 1–4). This call must be made once before performing I/O or other irrevocable operations from within a critical section [10, 11]. For simplicity, we treat memory management operations as irrevocable.2 Interoperability with Legacy Code: In lock-based code, programmers frequently transition data between thread-local and shared states (e.g., when an element is removed from a shared lock-based collection, it becomes thread-local and can be modified without additional locks). Surprisingly, most STM implementations do not support such accesses [12,13,14]. In the terminology of Menon et al. [12], 2

Since nearly all workloads that perform memory management (MM) also write to shared data, making MM irrevocable does not affect scalability, but eliminates the overhead of supporting rollback of allocation and reclamation.

6

L. Dalessandro et al.

TML provides asymmetric lock atomicity (ALA), meaning that race-free code can transition data between shared and private states, via the use of TMLprotected regions. ALA, in turn, facilitates porting from locks to transactions without a complete, global understanding of object lifecycles. The argument for ALA is straightforward: transitioning data from shared to private (“privatization”) is safe, as TML uses polling on every TMRead to prevent doomed critical sections from accessing privatized data, and writing critical sections are completely serialized.3 Transitions from private to shared (“publication”) satisfy Menon’s ALA conditions since the sampling of glb serves as prescient acquisition of a read lock covering a critical section’s entire set of reads. These properties are provided by TMBegin line 3 and strong memory ordering between TMWrite lines 2 and 5. Limitations: TML can be thought of as both a replacement for locks and an STM implementation. However, there are a few restrictions upon TML’s use in these settings. First, when TML is used instead of a R/W lock, the possibility of rollback precludes the use of irrevocable operations (such as I/O) within read-only critical sections. Instead, I/O must be treated as a form of writing. Second, when used as an STM, TML does not allow programmer-induced rollback for condition synchronization [15], except in the case of conditional critical regions [16], where all condition synchronization occurs before the first write. Third, our presentation assumes lexical scoping of critical sections (e.g., that the stack frame in which checkpoint() is executed remains active throughout the critical section). If this condition does not hold, then the critical section must be made irrevocable (e.g., via TMWrite lines 1–4) before the frame is deactivated. 2.4

Performance Considerations

Inlining and Instrumentation: Given its simplicity, we expect all per-access instrumentation to be inlined. Depending on the ability of the compiler to cache loc and the address of glb in registers, in up to six extra x86 assembly instructions remain per load, and up to 11 extra x86 assembly instructions per store. We also assume either a manually invoked API, such that instrumentation is minimal, or else compiler support to avoid instrumentation to stack, thread-local, and “captured” memory [17]. Cache Behavior: We expect TML to incur fewer cache coherence invalidations than mutex or R/W locks, since read-only critical sections do not write metadata. Until it calls TMWrite, a TML critical section only accesses a single global, glb, and only to read; thus the only cache effects of one thread on another are (1) a TMRead can cause the line holding glb to downgrade to shared in a concurrent thread that called TMWrite and (2) a failed TMWrite can cause an eviction in a concurrent thread that successfully called TMWrite. These costs are equivalent to those experienced when a thread attempts to acquire a held test-and-test-andset mutex lock. Furthermore, they are less costly than the evictions caused by 3

The sufficiency of these two conditions for privatization safety was established by Marathe et al. [14].

Transactional Mutex Locks

7

R/W locks, where when any thread acquires or releases the lock, all concurrent threads holding the lock in their cache experience an eviction. Progress: TML is livelock-free: in-flight critical section A can roll back only if another in-flight critical section W increments glb. However, once W increments glb, it will not roll back (it is guaranteed to win all conflicts, and we prohibit programmer-induced rollback). Thus A’s rollback indicates that W is making progress. If starvation is an issue, the high order bits of the nest field can be used as a consecutive rollback counter. As in RingSTM [18], an additional branch in TMBegin can compare this counter to some threshold, and if the count is too high, make the critical section irrevocable at begin time.

3

Compiler Support

When there are many reads and writes, the instrumentation in Listing 1 admits much redundancy. We briefly discuss optimizations that target this overhead. Post-Write Instrumentation (PWI): When a critical section W issues its first write to shared memory, via TMWrite, it increments the glb field and makes it odd. It also increments its local loc field, ensuring that it matches glb. At this point, W cannot roll back, and no other critical section can modify glb until W increments it again, making it even. These other critical sections are also guaranteed to roll back, and to block until W completes. Thus once W performs its first write, instrumentation is not required on any subsequent read or write. Unfortunately, standard static analysis does not suffice to eliminate this instrumentation, since glb is a volatile variable: the compiler cannot tell that glb is odd and immutable until W commits. We could assist the compiler by maintaining a separate per-thread flag, which is set on line 4 of TMWrite and unset in line 2 of TMEnd. TMWrite could then use this flag for its condition on line 1, and TMRead could test this flag between lines 1 and 2, and return immediately when the flag is set. Standard compiler analysis would then be able to elide most instrumentation that occurs after the first write of shared data. A more precise mechanism for this optimization uses static analysis: any call to TMRead that occurs on a path that has already called TMWrite can skip lines 2–3. Similarly, any call to TMWrite that occurs on a path that has already called TMWrite can skip lines 1–4. Thus after the first write the remainder of a writing critical section will execute as fast as one protected by a single mutex lock. Propagation of this analysis must terminate at a call to TMEnd. It must also terminate at a join point between multiple control flows if a call to TMWrite does not occur on every flow. To maximize the impact of this optimization, the compiler may clone basic blocks (and entire functions) that are called from writing and nonwriting contexts.4

4

In the context of STM, similar redundancy analysis has been suggested by AdlTabatabai et al. [19] and Harris et al. [6].

8

L. Dalessandro et al.

Relaxing Consistency Checks (RCC): Spear et al. [20] reduce processor memory fence instructions within transactional instrumentation by deferring postvalidation (such as lines 2–3 of TMRead) when the result of a read is not used until after additional reads are performed. For TML, this optimization results in a reduction in the total number of instructions, even on machines that do not require memory fence instructions to ensure ordering. In effect, multiple tests of glb can be condensed into a single call without compromising correctness. Lightweight Checkpointing and Rollback (LCR): When there is neither nesting nor function calls, the checkpoint at TMBegin can be skipped. Since all instrumentation is inlined, and rollback occurs only in read-only critical sections that cannot have any externally visible side effects, unwinding the stack can be achieved with an unconditional branch, rather than a longjmp. Extending this optimization to critical sections that make function calls is possible, but requires an extra test on every function return.

4

Evaluation

We evaluate TML using parameterized microbenchmarks taken from the RSTM suite [21]. Experiments labeled “Niagara2” were collected on a 1.165 GHz, 64way Sun UltraSPARCTM T2 with 32 GB of RAM, running SolarisTM 10. The Niagara2 has eight cores, each of which is eight-way multithreaded. On the Niagara2, code was compiled using gcc 4.3.2 with –O3 optimizations. Experiments labeled “Nehalem” were collected on an 8-way Sun Ultra27 with 6GB RAM and a 2.93GHz Intel Xeon W3540 processor with four cores, each of which is two-way multithreaded. Nehalem code was compiled using gcc 4.4.1, with –O3 optimizations. On both machines, the lowest level of the cache hierarchy is shared among all threads. However, the Niagara2 cores are substantially simpler than the Nehalem cores, resulting in different instrumentation overheads. On each architecture, we evaluate five algorithms: – Mutex – All critical sections are protected by a single coarse-grained mutex lock, implemented as a test-and-test and set with exponential backoff. – R/W Lock – Critical sections are protected by a writer-prioritizing R/W lock, implemented as a 1-bit writer count and a 31-bit reader count. Regions statically identified as read-only acquire the lock for reading. Regions that may perform writes conservatively acquire the lock for writing. – STM – Critical sections are implemented via transactions using a TL2-like STM implementation [8] with 1M ownership records, a hash table for write set lookups, and setjmp/longjmp for rollback. This implementation is not privatization safe. – TML – Our TML implementation, using setjmp/longjmp for rollback. – TML+opt – TML extended with the PWI, RCC, and LCR optimizations discussed in Section 3. These optimizations were implemented by hand. In our microbenchmarks, threads repeatedly access a single, shared, prepopulated data structure. All data points are the average of five 5-second trials.

Transactional Mutex Locks 4500

10000 9000

3000 2500

Mutex R/W Lock STM TML TML+opt

2000 1500 1000 500

Throughput (1000 Tx/sec)

Throughput (1000 Tx/sec)

4000 3500

9

8000 7000 6000 5000 4000 3000 2000 1000

0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Threads

(a) Niagara2

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Threads

(b) Nehalem

Fig. 1. Linked list benchmark. All threads perform 90% lookups, 5% inserts, and 5% deletes from a singly linked list storing 8-bit keys.

4.1

List Traversal

Figure 1 presents a workload where threads perform 90% lookups, 5% inserts, and 5% removes from a linked list storing 8-bit values. TML scales well, since writes are rare. STM also scales well, but with much higher latency. In STM, each individual read and write must be logged, and the per-read instrumentation is more complex. The resulting overheads, such as increased pressure on the L1 cache, prevent the workload from scaling well beyond the number of cores – 4 and 8 threads, respectively, on the Nehalem and Niagara2. In contrast, since TML has constant per-thread metadata requirements, there is much less L1 contention, and scalability beyond the number of cores is quite good. Furthermore, even with a shared cache the R/W Lock implementation does not perform as well as TML. There are three contributing factors: First, atomic operations on both critical section entrance and exit increase single-thread latency. Second, TML causes fewer cache evictions than R/W locks. Third, the conservative decision to acquire the lock in writer mode for critical sections that may perform writes limits scalability. Furthermore, TML allows a reader to complete while a writer is active if the reader can start and finish in the time between when the writer begins and when it performs its first write. In the List benchmark, writers perform on average 64 reads before their first write, providing ample time for a reader to complete successfully.5 On the Niagara2, simple in-order cores cannot mask even the lightweight instrumentation of TML. Thus even though TML is more than three times as fast as STM at one thread, it is slower than Mutex until two threads and slower than R/W Lock until four threads. Furthermore, we observe that the RCC and LCR optimizations have a profound impact on the Niagara2. Single-thread latency 5

This same property results in the PWI optimization having no noticeable impact on the List workload, since writer critical sections are rare and perform all reads before the first write.

10

L. Dalessandro et al.

14000

10000 8000 7000 6000 5000 4000 3000

Mutex R/W Lock STM TML TML+opt

2000 1000 0

Throughput (1000 Tx/sec)

Throughput (1000 Tx/sec)

9000

12000 10000 8000 6000 4000 2000 0

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Threads

(a) Niagara2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Threads

(b) Nehalem

Fig. 2. Red-black tree benchmark. All threads perform a 90/5/5 mix of lookup/insert/remove operations on a red-black tree storing 16-bit keys.

improves by more than 10%, resulting in a crossover with R/W Lock at 2 threads. In addition, decreasing the latency of critical sections leads to greater scalability. On the Nehalem, cas is heavily optimized, resulting in impressive singlethread performance for both Mutex and R/W Lock. In contrast, the requirement for per-access instrumentation leads to TML performing much worse than the Mutex baseline. As a result, TML does not outperform single-threaded Mutex until three threads, at which point it also begins to outperform R/W Lock. As on the Niagara2, RCC and LCR optimizations lead to much lower single-thread latency (roughly the same as Mutex), but they do not yield a change in the slope of the curve. We also note that for the List, Nehalem is not able to exploit multithreading to scale beyond the number of cores. Last, on Nehalem, TML proves quite resilient to preemption, even without the use of Solaris schedctl.6 This behavior matches our intuition that a preempted TML read-only critical section should not impede the progress of concurrent readers or writers. 4.2

Red-Black Tree Updates

The List workload does not exhibit much parallelism in the face of writers, since any write is likely to invalidate most concurrent readers (or writers, in the case of STM). To assess the merits of TML relative to STM, we consider a (pre-populated) red-black tree in Figure 2. In this workload, we again have a 90/5/5 mix of lookups, inserts, and removes, but we now use 16-bit keys. This results in much shorter critical sections, but also many fewer true conflicts, since operations on different branches of the tree should not conflict. Since it has fine-grained conflict detection, and since conflicts are rare, STM scales to the full capacity of the Niagara2. TML achieves a higher peak, but then false conflicts cause performance to degrade starting around 32 threads. Separate experiments at all thread levels from 1–64 confirm that this tapering 6

Schedctl allows a thread to briefly defer preemption, e.g., when holding locks.

Transactional Mutex Locks

11

off is smooth, and not related to an increase in multithreading. As with the list, we observe that the conservative assumptions of writing critical sections cause R/W Lock to scale poorly, despite its lower single-thread latency. On the Nehalem, STM starts from a lower single-thread throughput, but scales faster than TML. Both TML and STM scale beyond the core count, effectively using hardware multithreading to increase throughput. Furthermore, since there is significant work after the first write in a writing critical section, the ability of STM to allow concurrent readers proves crucial. At four threads, STM rollbacks are three orders of magnitude fewer than in TML, while commits are only 20% fewer. This implies that most TML rollbacks are unnecessary, but that the low latency of TML is able to compensate. Surprisingly, we also see that our compiler optimizations have a negative impact on scalability for this workload on Nehalem. We can attribute this result to the LCR optimization. In effect, longjmp approximates having randomized backoff on rollback, which enables some conflicts to resolve themselves. In separate experiments, we found PWI and RCC to have a slight positive impact on the workload when LCR is not applied. We conclude that the wide issue width of the Nehalem decreases the merit of these optimizations. 4.3

Write-Dominated Workloads

The scalability of TML relative to STM is tempered by the fact that TML is optimized for read-dominated workloads. As a higher percentage of critical sections perform writes, TML loses its edge over STM. In this setting, TML will often scale better than R/W locks, since a read-only critical section can overlap with the beginning of a writing critical section that does not perform writes immediately. However, TML should have lower throughput than an ideal STM, where nonconflicting critical sections can proceed in parallel. We assess this situation in Figure 3. In the experiment, we fix the thread count at 4 on the Nehalem, and at 16 on the Niagara2, and then vary the frequency of read-only critical sections. For many workloads, a 90% read-only

Throughput (1000 Tx/sec)

7000 6000

10000 Mutex R/W Lock STM TML TML+opt

9000 Throughput (1000 Tx/sec)

8000

5000 4000 3000 2000 1000

8000 7000

Mutex R/W Lock STM TML TML+opt

6000 5000 4000 3000 2000 1000 0

0 0

10

20 30 40 50 60 70 % Read-Only Transactions

(a) Niagara2, 16 threads.

80

90

0

10

20 30 40 50 60 70 % Read-Only Transactions

80

90

(b) Nehalem, 4 threads.

Fig. 3. Red-black tree benchmark with 16-bit keys. The percentage of read-only critical sections varies, while the number of threads is fixed.

12

L. Dalessandro et al.

ratio is common, and in such a setting, TML provides higher throughput than STM. However, as the read-only ratio decreases, the workload still admits a substantial amount of parallelism. STM can exploit this parallelism, while TML, R/W locks, and mutex locks cannot.

5

Conclusions

In this paper, we presented Transactional Mutex Locks (TML), which provide the strength and generality of mutex locks without sacrificing scalability when critical sections are read-only and can be executed in parallel. TML avoids much of the instrumentation overhead of traditional STM. In comparison to reader/writer locks, it avoids the need for static knowledge of which critical sections are read-only. In comparison to RCU and sequence locks, it avoids restrictions on the programming model. Our results are very promising, showing that TML can perform competitively with mutex locks at low thread counts, and that TML performs substantially better when the thread count is high and most critical sections are read-only. By leveraging many lessons from STM research (algorithms, semantics, compiler support) TML can improve software today, while offering a clear upgrade path to STM as hardware and software improvements continue. We also hope that TML will provide an appropriate baseline for evaluating new STM algorithms, since it offers substantial read-only scalability and low latency without the overhead of a full and complex STM implementation. Acknowledgments. We thank the anonymous reviewers for many insightful comments that improved the quality of this paper.

References 1. McKenney, P.E.: Exploiting Deferred Destruction: An Analysis of Read-CopyUpdate Techniques in Operating System Kernels. PhD thesis, OGI School of Science and Engineering at Oregon Health and Sciences University (2004) 2. Lameter, C.: Effective Synchronization on Linux/NUMA Systems. In: Proc. of the May 2005 Gelato Federation Meeting, San Jose, CA (2005) 3. Shavit, N., Touitou, D.: Software Transactional Memory. In: Proc. of the 14th ACM Symp. on Principles of Distributed Computing, Ottawa, ON, Canada (1995) 4. Cascaval, C., Blundell, C., Michael, M., Cain, H.W., Wu, P., Chiras, S., Chatterjee, S.: Software Transactional Memory: Why Is It Only a Research Toy? Queue 6(5), 46–58 (2008) 5. Dalessandro, L., Spear, M.F., Scott, M.L.: NOrec: Streamlining STM by Abolishing Ownership Records. In: Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Bangalore, India (2010) 6. Harris, T., Plesko, M., Shinar, A., Tarditi, D.: Optimizing Memory Transactions. In: Proc. of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, Ottawa, ON, Canada (2006) 7. Hudson, R.L., Saha, B., Adl-Tabatabai, A.R., Hertzberg, B.: A Scalable Transactional Memory Allocator. In: Proc. of the 2006 International Symp. on Memory Management, Ottawa, ON, Canada (2006)

Transactional Mutex Locks

13

8. Dice, D., Shalev, O., Shavit, N.: Transactional Locking II. In: Proc. of the 20th International Symp. on Distributed Computing, Stockholm, Sweden (2006) 9. Felber, P., Fetzer, C., Riegel, T.: Dynamic Performance Tuning of Word-Based Software Transactional Memory. In: Proc. of the 13th ACM SIGPLAN 2008 Symp. on Principles and Practice of Parallel Programming, Salt Lake City, UT (2008) 10. Spear, M.F., Silverman, M., Dalessandro, L., Michael, M.M., Scott, M.L.: Implementing and Exploiting Inevitability in Software Transactional Memory. In: Proc. of the 37th International Conference on Parallel Processing, Portland, OR (2008) 11. Welc, A., Saha, B., Adl-Tabatabai, A.R.: Irrevocable Transactions and their Applications. In: Proc. of the 20th ACM Symp. on Parallelism in Algorithms and Architectures, Munich, Germany (2008) 12. Menon, V., Balensiefer, S., Shpeisman, T., Adl-Tabatabai, A.R., Hudson, R., Saha, B., Welc, A.: Practical Weak-Atomicity Semantics for Java STM. In: Proc. of the 20th ACM Symp. on Parallelism in Algorithms and Architectures, Munich, Germany (2008) 13. Spear, M.F., Dalessandro, L., Marathe, V.J., Scott, M.L.: Ordering-Based Semantics for Software Transactional Memory. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 275–294. Springer, Heidelberg (2008) 14. Marathe, V.J., Spear, M.F., Scott, M.L.: Scalable Techniques for Transparent Privatization in Software Transactional Memory. In: Proc. of the 37th International Conference on Parallel Processing, Portland, OR (2008) 15. Harris, T., Marlow, S., Peyton Jones, S., Herlihy, M.: Composable Memory Transactions. In: Proc. of the 10th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Chicago, IL (2005) 16. Brinch Hansen, P.: Operating System Principles. Prentice-Hall, Englewood Cliffs (1973) 17. Dragojevic, A., Ni, Y., Adl-Tabatabai, A.R.: Optimizing Transactions for Captured Memory. In: Proc. of the 21st ACM Symp. on Parallelism in Algorithms and Architectures, Calgary, AB, Canada (2009) 18. Spear, M.F., Michael, M.M., von Praun, C.: RingSTM: Scalable Transactions with a Single Atomic Instruction. In: Proc. of the 20th ACM Symp. on Parallelism in Algorithms and Architectures, Munich, Germany (2008) 19. Adl-Tabatabai, A.R., Lewis, B.T., Menon, V., Murphy, B.R., Saha, B., Shpeisman, T.: Compiler and Runtime Support for Efficient Software Transactional Memory. In: Proc. of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, Ottawa, ON, Canada (2006) 20. Spear, M.F., Michael, M.M., Scott, M.L., Wu, P.: Reducing Memory Ordering Overheads in Software Transactional Memory. In: Proc. of the 2009 International Symp. on Code Generation and Optimization, Seattle, WA (2009) 21. Rochester Synchronization Group, Department of Computer Science, University of Rochester: Rochester STM (2006–2009), http://www.cs.rochester.edu/synchronization/rstm/

Exceptions for Algorithmic Skeletons Mario Leyton1 , Ludovic Henrio2 , and Jos´e M. Piquer3

2

1 NIC Labs, Universidad de Chile Miraflores 222, Piso 14, 832-0198, Santiago, Chile [email protected] INRIA Sophia-Antipolis, Universit´e de Nice Sophia-Antipolis, CNRS - I3S 2004 Route des Lucioles, BP 93, F-06902 Sophia-Antipolis Cedex, France [email protected] 3 Departamento de Ciencias de la Computaci´ on, Universidad de Chile Av. Blanco Encalada 2120, Santiago, Chile [email protected]

Abstract. Algorithmic skeletons offer high-level abstractions for parallel programming based on recurrent parallelism patterns. Patterns can be combined and nested into more complex parallelism behaviors. Programmers fill the skeleton patterns with the functional (business) code, which transforms the generic skeleton into a specific application. However, when the functional code generates exceptions, programmers are exposed to implementation details of the skeleton library, breaking the high-level abstraction principle. Furthermore, related parallel activities must be stopped as the exception is raised. This paper describes how to handle exceptions in algorithmic skeletons without breaking the highlevel abstractions of the programming model. We describe both the behavior of the framework in a formal way, and its implementation in Java: the Skandium Library. Keywords: Algorithmic skeletons, exceptions, semantics.

1

Introduction

Exceptions are the traditional way of handling programming errors which alter the normal execution flow. Many programming languages provide support for exception handling such as: Ada, C++, Eiffel, Java, Ocaml, Ruby, etc. Algorithmic skeletons (skeletons for short) are a high-level programming model for parallel and distributed computing, introduced by Cole [7]. Skeletons take advantage of recurrent programming patterns to hide the complexity of parallel and distributed applications. Starting from a basic set of patterns (skeletons), more complex patterns can be built by nesting the basic ones. To write an application, programmers must compose skeleton patterns and fill them with the sequential blocks specific to the application. The skeleton pattern implicitly defines the parallelization, distribution, orchestration, and composition aspects, while the functional code provides the application’s functional aspects (i.e. business code). P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 14–25, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Exceptions for Algorithmic Skeletons

15

The functional code, provided by users, is likely to encounter errors and generate exceptions. The open question we address in this paper is how exceptions, raised by a business code, interact with the surrounding skeleton pattern to alter (or not) the pattern’s normal execution flow. And in the worst case, how this exceptions are reported back to the parent skeleton or user, after related parallel activities are aborted. In this paper we present an exception mechanism for an algorithmic skeleton library which blends in with the host programming language’s exception handling. Furthermore, the process of unwinding the stack does not reveal unnecessary low-level details of the skeleton library implementation, but is consistent with the high-level abstractions of the programming model. This paper is organized as follows. Section 2 describes the related work. Section 3 provides a brief overview of the algorithmic skeleton programming model. Section 4 introduces the exception model. Section 5 shows how the exception model is implemented in the Skandium library, and Section 6 provides the conclusions.

2

Related Work

As a skeleton library we use Skandium [19,18], which is a multi-core reimplementation of Calcium [16], a ProActive [5] based algorithmic skeleton library. Skandium is mainly inspired by Lithium [1] and Muskel [9] frameworks developed at University of Pisa. In all of them, skeletons are provided to the programmer as a Java API. Our previous work has provided formalisms for algorithmic skeletons such as a type system for nestable parallelism patterns [6] and reduction semantics [16]. This work extends both previous formalisms. QUAFF [10] is a recent skeleton library written in C++ and MPI. QUAFF relies on template-based meta-programming techniques to reduce runtime overheads and perform skeleton expansions and optimizations at compilation time. Skeletons can be nested and sequential functions are stateful. QUAFF takes advantage of C++ templates to generate, at compilation time, new C/MPI code. QUAFF is based on the CSP-model, where the skeleton program is described as a process network and production rules (single, serial, par, join) [11]. Formalisms in Skil [3] provide polymorphic skeletons in C. Skil was later reimplemented as the Muesli [14] skeleton library, but instead of a subset of the C language, skeletons are offered through C++. Contrary to Skil, Muesli supports nesting of task and data parallel skeletons [15] but is limited to P3 L’s two tier approach [2]. Exceptions are a relevant subject on parallel programming models. For example in [4] exceptions for asynchronous method calls on active objects are introduced, and [12] provides formalisms for exception management in BSML, a functional parallel language for BSP. However, to the best of our knowledge, no previous work has focused on exception handling for algorithmic skeletons. Regarding exception management in parallel and distributed programming, the exception handling approach proposed by Keen et al [13] for asynchronous method invocation presents similarities with our approach. In the sense that

16

M. Leyton, L. Henrio, and J.M. Piquer

exceptions within the skeleton framework happen in an asynchronous context and we require a predefined handler to manage the exception. However, in our approach handlers are allowed to fail and we use the familiar Future.get() as a synchronization point to deliver the result or exception. The exception handling approach used by JCilk [8] also presents a similarity with our approach. An exception raised by parallel a activity causes its siblings to abort. This yields simpler semantics, closer to what a sequential programmer would expect in Java; and also provides a mechanism to abort parallel activities in search type algorithms (such as branch and bound). However, the approach used in JCilk corresponds to a language extension while our approach is implemented as a library extension. Furthermore, our approach provides exceptions which do not break the high-level principle.

3

Algorithmic Skeletons in a Nutshell

In Skandium [19], skeletons are provided as a Java library. The library can nest task and data parallel skeletons in the following way:  ::= seq(fe , h) | farm(, h) | pipe(1 , 2 , h) | while(fc , , h) | for(i, , h) | if(fc , true , f alse , h) | map(fs , , fc , h) | fork(fs , {i }, fm , h) | d&c(fc , fs , , fm , h) Each skeleton represents a different pattern of parallel computation. For each skeleton, an optional argument () has been introduced in this paper, which corresponds to the skeleton’s exception handler (h) described in detail in Section 4. In algorithmic skeletons all communication details are implicit for each pattern, hidden away from the programmer. The task parallel skeletons are: seq for wrapping execution functions; farm for task replication; pipe for staged computation; while/for for iteration; and if for conditional branching. The data parallel skeletons are: map for single instruction multiple data; fork which is like map but applies multiple instructions to multiple data; and d&c for divide and conquer. The nested skeleton pattern () relies on sequential blocks of the application. These blocks provide the business logic and transform a general skeleton pattern into a specific application. We denominate these blocks muscles, as they provide the real (non-parallel) functionality of the program. In Skandium, muscles come in four flavors: Execution

fe : p → r

Split Merge

fs : p → [r0 , ..., rk ] fm : [p0 , ..., pk ] → r

Condition

fc : p → boolean

Here p ranges over parameters, i.e. values or object as represented in the host language (Java in our implementation). Additionally r ::= p

| e

Exceptions for Algorithmic Skeletons

17

where the result r can be either a new parameter p or an exception e; and [r0 , ..., rk ] is a list of results. For the skeleton language, muscles are black boxes invoked during the computation of the skeleton program. Multiple muscles may be executed either sequentially or in parallel with respect to each other, in accordance with the defined . The result of a muscle is passed as a parameter to other muscle(s). When no further muscles need to be executed, the final result is delivered to the user.

4

Exceptions for Algorithmic Skeletons

Exceptions provide a useful mechanism to report errors and disrupt the normal program flow. Besides error communication, exceptions are also useful, for example, to stop computation once a result is found in recursive algorithms such as branch & bound. To illustrate the relevance of exceptions in algorithmic skeleton programming, consider the following simplified skeleton pseudocode: Skeleton s = new Farm(new Pipe(stage1, stage2)); //... public X stage1(P p) throws FileNotFoundException { FileInputStream fis = new FileInputStream(new File(p.filename)); }

Cleary muscles such as stage1 can raise an exception (e.g. if filename is not found). However, for the skeleton pattern Pipe it is unclear whether the FileNotFoundException raised by stage1 should be passed as a result to stage2, cancel computation altogether, or report the exception to the parent skeleton (Farm). The model we propose contemplates three possible scenarios: 1. An exception is raised and caught inside a muscle. For the skeleton it is as if no exception was raised at all, and thus no action is required. 2. An exception is raised by a muscle or a sub-skeleton, and a matching handler is found and executed to produce the skeleton’s result. 3. An exception is raised by a muscle or sub-skeleton, and no matching handler is found in the skeleton. The exception is raised to the parent skeleton. 4.1

Exception Semantics

A simplified version of the skeleton semantics is described, which extends a classical skeleton semantics that can be found in [17]. The principle of the operational semantics with exceptions is that muscle functions f are reduced to a result r which can be either a new parameter p or an exception e. Conceptually, when an exception is raised by a muscle we want the surrounding skeleton to either handle the exception and produce a regular result, or raise the exception to the parent skeleton. Consider the skeleton example: f arm(pipe(1 , 2 , h)) with the input p0 . If 1 (p0 ) reduces to a data p, then the evaluation continues normally, p is passed to 2 and evaluated to p :

18

M. Leyton, L. Henrio, and J.M. Piquer

f arm(pipe(1 , 2 , h))(p0 ) → . . . → f arm(pipe(p, 2 , h)) → f arm(p ) → . . . On the contrary, if 1 is reduced to an exception e then: f arm(pipe(e, 2 , h))(p0 ) → f arm(h(e)) → f arm(r) The pipe is reduced to h(e) which returns a result r: either an exception e or a value p. Extending the semantics for exceptions is tricky because nested skeletons are compiled into a sequence of lower-level instructions (  J). Therefore the skeleton nesting must be remembered so that the instruction reduction semantics raise the exception to the parent skeleton’s handler. Let us introduce some new concepts and helper functions: e is an an exception which represents an error during the computation. A concatenation τ ⊕ e extends the trace of a given exception e by the creation context τ . h is a programmer provided handler function. Given a parameter p and an exception e, this handler can be evaluated into a result: h(p, e) → r. The result r can be a new parameter p or a new exception e if an error took place during the handler evaluation. match is a function taking a handler h and an exception e as parameter. (h, e) returns true if a handler h can be applied to an exception e, and f alse otherwise. → and ⇒ arrows are used for local and global reductions respectively. ↑ is used to separate the evaluated skeleton from its exception handler. An instruction J can reduce its parameter either to a result r or an exception e. For the later case, the handler h remembers the parameter p before J is reduced: p is both stored in the handler and used in J. p-remember

J ↑ h(τ )(p) → J(p) ↑ h(τ, p) If the instruction finishes without raising an exception, then the handler is not invoked. Above, at the end of the pipe the handler h was discarded and f arm(p ) is obtained. finished-ok

p ↑ h(τ, p ) ⇒ p On the contrary, if the instruction raises an exception then this exception can be transmitted, further instruction at the same level are discarded. e-transmit

J(e) · . . . · Jn ⇒ e If the result is an exception e and the handler h matches the exception, then the handler on the exception is invoked. e-catch

match(h, e) e ↑ h(τ, p) ⇒ h(τ, p, e)

Exceptions for Algorithmic Skeletons

19

If the handler h does not match e, then we add a trace to the exception. e-raise

¬match(h, e) e ↑ h(τ, p) ⇒ τ ⊕ e

In the scenario of data parallelism, if one of the subcomputations raises an exception: conq-inst-reduce-with-exception

conqI (fc )(Ω1 . . . ei . . . Ωn ) → conqI (fc )(ei ) Then the exception ei is kept and the other parallel activities are discarded. For further details see Section 5.1. 4.2

Illustrative Example

Let us illustrate the semantics introduced in section 4.1 with a simple example showing some of the reduction rules. We consider the following skeleton  = pipe(if (fb, seq(fpre , hp ), seq(fid )), seq(ft ), h) This skeleton acts in two phases. First, depending on a boolean condition on the received data (given by the muscle function fb ), a pre-treatment fpre might be realized (or nothing if the condition is not verified, fid is the identity muscle function); then, the main treatment, expressed by the function ft is performed. The instruction corresponding to the preceding skeleton is the following:  pipeI (ifI (fb , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ h∅ (τ1 )) ↑ h∅ (τi ), seqI (ft ) ↑ h∅ (τt )) ↑ h(τ ) pipeI (ifI (fb , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 )) ↑ ∅(τi ), seqI (ft ) ↑ ∅(τt )) ↑ h(τ )([d1 ]) ⇒∗ pipeI (ifI (fb , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 )) ↑ ∅(τi ), seqI (ft ) ↑ ∅(τt ))(d1 ) ↑ h(τ, d1 )) remember ⇒∗ ifI (fb , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 )) pipe-reduction-n ↑ ∅(τi )(d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 ) ⇒∗ ifI (fb , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 ))(d1 ) remember ↑ ∅(τi , d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 ) ⇒∗ seqI (fb )(d1 ) · choiceI (d1 , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 )) ↑ ∅(τi , d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 )) if-inst ⇒∗ true · choiceI (d1 , seqI (fpre ) ↑ hp (τp ), seqI (fid ) ↑ ∅(τ1 )) ↑ ∅(τi , d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 ) seq choice ⇒∗ seqI (fpre ) ↑ hp (τp )(d1 ) ↑ ∅(τi , d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 ) remember+seq ⇒∗ e ↑ hp (τp , d1 ) ↑ ∅(τi , d1 ) · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 ) ⇒∗ τi ⊕ τp ⊕ e · pipeI (seqI (ft ) ↑ ∅(τt )) ↑ h(τ, d1 )) raise+finished-ok ⇒∗ τi ⊕ τp ⊕ e ↑ h(τ, d1 ) next+pipe catch+seq-insts ⇒∗ h(d1 , τi ⊕ τp ⊕ e)

Fig. 1. Exception Semantics Example

20

M. Leyton, L. Henrio, and J.M. Piquer

Where we introduce h∅ as the empty handler, and τx for the locations of the different instructions. For example, τ is the creation point of the pipe instruction, and τi of the if instruction. This instruction is evaluated as shown in Figure 1. Starting from an incoming data d1 , we suppose that fb (d1 ) is true. For reduction simplicity, we suppose that all the functions are stateless, allowing a parallel evaluation of the different steps. Reduced terms are underlined, and usage of context-handler and contexthandler-stack rules is implicit. (inst-arrow allows reduction → to be raised to the level of reduction ⇒). The evaluation of the skeleton is shown in Figure 1, where ⇒∗ is the reflexive transitive closure of ⇒. It involves an exception being raised and raised up to the top level handler.

5

Exceptions in Skandium Library

Skandium is a Java based algorithmic skeleton library for high-level parallel programming of multi-core architectures. Skandium provides basic nestable parallelism patterns which can be composed to program more complex applications. From a general perspective, parallelism or distribution of an application in Fig. 2. Thread Pool Execution Skandium is a producer/consumer problem. The shared buffer is a task queue, and the produced/consumed data are tasks, as shown in Figure 2. A ready-queue stores ready tasks. Root-tasks are entered into the ready-queue by users, who provide the initial parameter and the skeleton program. The skeleton program undergoes a transformation process into an internal stack of instructions which harness the parallelism behavior of each skeleton (as detailed in Figure 2 of [17]). Interpreter threads consume tasks from the ready-queue and compute their skeleton instruction stack. When the interpreters cannot compute a task any further, the task is either in finished or waiting state. If the task is in the finished state its result is delivered to the user. If the task is in the waiting state, then the task has generated new sub-tasks which are inserted into the ready-queue. Sub-tasks represent data parallelism for skeletons such as map, f ork, d&c. A sub-task may in turn produce new sub-taks. A task will exit the waiting state and be reinserted into the ready-queue when all of its sub-tasks are finished. 5.1

How Exceptions Cancel Parallelism

Exceptions disrupt the normal execution flow of a program. The generation and propagation of exceptions requires the cancellation of sibling parallel activities. The semantics introduced in Section 4.1 simply discards sibling parallel

Exceptions for Algorithmic Skeletons

21

1 // 1. Define the skeleton program Skeleton sort = new DaC( 3 new ShouldSplit(threshold), new SplitList(), 5 new Sort(), new MergeList(), 7 new HandleSortException()); 9 // 2. Input parameters Future future = sort.input(new Range(...)); 11 // 3. Do something else here... 13 // 4. Block for the results 15 try{ Range result = future.get(); 17 } catch(ExecutionException e){...}

Listing 1. Skandium Library Usage Output

activities, as specified by the CONQ-INST-REDUCE-WITH-EXCEPTION rule. Thus, implementations are free to continue the execution of sibling parallel activities, and disregard their results, or apply a best effort to cancel the sibling parallel activities. The later approach is implemented in Skandium as it can be used to abort recursive searches, once a result is found. In the case of the Skandium library, a task is cancelled as follows (this also applies to direct task cancellation by a user). If the task is in the ready-queue then it is removed. If the task is in execution, then it stopped as soon as possible: after the task’s current instruction is finished, but before the next instruction begins. Finally, if the task is in waiting state, then each of its sub-tasks are cancelled (recursively). A raised exception unwinds the task’s execution stack until a handler is found and the computation can continue, or until the task’s stack is empty. When the stack is empty, the exception is either returned to the user (for root-tasks) or passed to the parent-task (for sub-tasks). When a parent receives an exception from a sub-task all of the sub-task’s siblings are cancelled, and the exception continues to unwind the parent’s stack. Note that an exception propagation will not abort parallel activities from different root-tasks (task parallelism) as they do not have a common task ancestor. 5.2

Skandium API with Exception

The code in Listing 1 shows how simple it is to interact with the Skandium API to input data and retrieve the result. Lines 1-7 show how a Divide and Conquer (DaC) skeleton is built using four muscle functions and an exception handler. Line 10 shows how new data is entered into the skeleton. In this case a new Range(...) contains an array and two indexes left and right which represent the minimum and maximum indexes to consider. The input yields a future which can be used to cancel the computation, query or block for results. Finally, line 16 blocks until the result is available or an exception is raised.

22

M. Leyton, L. Henrio, and J.M. Piquer

1 class SplitList implements Split throws ArrayIndexOutOfBoundsException{ 3 @Override public Range[] split(Range r){ 5 int i = partition(r.array, r.left, r.right); 7 Range[] intervals ={new Range(r.array, r.left, i-1), new Range(r.array, i+1, r.right)}; return intervals; 9 }

Listing 2. Skandium Muscle Example 1 interface ExceptionHandler{ public R handle(P p, E exception) throws Exception; 3 }

Listing 3. Exception Handler Interface

Listing 2 shows the definition of a Skandium muscle, which provides the functional (business) behavior to a skeleton. In the example, the SplitList muscle implements the Split interface, and thus requires the implementation of the R[] split(P p) method. Additionally, the muscle may raise an ArrayIndexOutOfBoundsException when stepping outside of an array. 5.3

Exception Handler

An exception handler is a function h : e → r which transforms an exception into a regular result. All exception handlers must implement the ExceptionHandler interface shown in Listing 3. The handler is in charge of transforming an exception raised by a sub-skeleton or muscle into a result. The objective is to have a chance to recover from an error and continue with the computation. Listing 4 shows an example of an exception handler. If for some reason the SplitList.split(...) shown in Listing 2 raises an ArrayIndexOutOf BoundException, then we use the handler to catch the exception and directly sort the array Range without subdividing. 5.4

High-Level Stack Trace

If the exception is raised outside of the Skandium library, i.e. when invoking Future.get() as shown in Listing 1, then this means that no handler 1 class HandleSortException { public Range handle(Range r, Exception e){ 3 Arrays.sort(r.array, r.left, r.right+1); return r; 5 } }

Listing 4. Exception Handler Interface

Exceptions for Algorithmic Skeletons

23

Caused by: java.lang.Exception: Solve Test Exception 2 at examples.nqueens.Solve.execute(Solve.java:26) at examples.nqueens.Solve.execute(Solve.java:1) 4 at instructions.SeqInst.interpret(SeqInst.java:53) at system.Interpreter.interLoop(Interpreter.java:69) 6 at system.Interpreter.inter(Interpreter.java:163) at system.Task.run(Task.java:137)

Listing 5. Real Low-level Stack Trace 1 Caused by: java.lang.Exception: Solve Test Exception at examples.nqueens.Solve.execute(Solve.java:26) 3 at examples.nqueens.Solve.execute(Solve.java:1) at skeletons.Seq.(DaC.java:68) 5 at skeletons.DaC.(NQueens.java:53) at skeletons.Map.(NQueens.java:60)

Listing 6. High-level Stack Trace

was capable of catching the exception and the error is reported back to the programmer. A stack-trace would normally have, for example, the form shown on Listing 5. This is the real stack-trace which corresponds to the actual Java stack-trace. In the example, line 2 generated the exception from inside the Solve.execute(...) muscle. This muscle was executed through a SeqInst instruction which in turn was called by the interpretation methods. The problem with the real stack-trace, is that the lower-level implementation details are exposed to programmers. In this case the SeqInst instruction, thus breaking the high-level abstraction offered by algorithmic skeletons. Furthermore, the stack-trace is flat since there is no evidence of the skeleton nesting. In regular programming, this would be equivalent to printing only the name of the function which generated the exception, without actually printing the calling function. Furthermore, the function Solve.execute() could have been nested into more than one skeleton in the application, and it is impossible to know by inspecting the stack-trace from which of the nestings the exception was generated. Therefore, we have introduced a high-level stack-trace which hides the internal interpretation details by not exposing instruction level elements and instead traces the skeleton nesting. The high-level stack-trace for the same example is shown in Listing 6. Lines 4-7 of the low-level stack-trace shown in Listing 5 are replaced by lines 4-6 in Listing 6. Thus it is evident that the error was generated by a Solve muscle nested inside a DaC which in turn was nested into a Map skeleton. To produce a high-level stack-trace three steps are required in the library’s implementation: 1. When a new skeleton object is created, it must remember its trace: class, method, file and line number. This corresponds to τ in Section 4.1. 2. When the skeleton object is transformed into an instruction, the trace must be copied to the instruction (Figure 1 of [17]).

24

M. Leyton, L. Henrio, and J.M. Piquer

3. When an exception is raised, an instruction unwinds the stack and adds its corresponding trace to the high-level stack-trace. This corresponds to the E-RAISE rule in Section 4.1.

6

Conclusions

We have presented an exception management model for algorithmic skeleton, in particular for the Java Skandium library which supports nestable parallelism patterns. The exception management has been formally specified, w.r.t. the library’s operational semantics. Exceptions can be raised and handled at each level of the skeleton nesting structure. Each skeleton can have handlers attached, specified by programmers through and API. The handlers are capable of catching errors and returning regular results, or raising the exception to be handled by the parent skeleton. Additionally, the raised exceptions are dynamically modified to reflect the nesting of skeleton patterns. Furthermore, no trace of lower-level library methods are exposed to programmers and exceptions do not break the abstraction level. As future work we would like to apply this model to other skeleton (and related) frameworks. The ones which will benefit mostly from the proposed exception model are those that support nestable skeletons, asynchronous computations, and transformation of the skeleton patterns into lower-level modules that implement the parallelism behaviour.

References 1. Aldinucci, M., Danelutto, M., Teti, P.: An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems 19(5), 611–626 (2003) 2. Bacci, B., Danelutto, M., Orlando, S., Pelagatti, S., Vanneschi, M.: P3 L: A structured high level programming language and its structured support. Concurrency: Practice and Experience 7(3), 225–255 (1995) 3. Botorog, G.H., Kuchen, H.: Efficient high-level parallel programming. Theor. Comput. Sci. 196(1-2), 71–107 (1998) 4. Caromel, D., Chazarain, G.: Robust exception handling in an asynchronous environment. In: Romanovsky, A., Dony, C., Knudsen, J., Tripathi, A. (eds.) Proceedings of ECOOP 2005 Workshop on Exception Handling in Object Oriented Systems. Tech. Report No 05-050, Dept. of Computer Science, LIRMM, Montpellier-II Univ., France (July 2005) 5. Caromel, D., Delb´e, C., di Costanzo, A., Leyton, M.: ProActive: an integrated platform for programming and running applications on Grids and P2P systems. Computational Methods in Science and Technology 12 (2006) 6. Caromel, D., Henrio, L., Leyton, M.: Type safe algorithmic skeletons. In: Proceedings of the 16th Euromicro PDP, pp. 45–53. IEEE CS Press, Toulouse (February 2008) 7. Cole, M.: Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge (1991)

Exceptions for Algorithmic Skeletons

25

8. Danaher, J.S., Lee, I.T.A., Leiserson, C.E.: Programming with exceptions in jcilk. Sci. Comput. Program. 63(2), 147–171 (2006) 9. Danelutto, M.: Qos in parallel programming through application managers. In: Proceedings of the 13th Euromicro PDP, pp. 282–289. IEEE Computer Society, Washington (2005) 10. Falcou, J., S´erot, J., Chateau, T., Laprest´e, J.T.: Quaff: efficient c++ design for parallel skeletons. Parallel Computing 32(7), 604–615 (2006) 11. Falcou, J., S´erot, J.: Formal semantics applied to the implementation of a skeletonbased parallel programming library. In: Joubert, G.R., Bischof, C., Peters, F.J., Lippert, T., Bcker, M., Gibbon, P., Mohr, B. (eds.) Parallel Computing: Architectures, Algorithms and Applications, Proc. of PARCO 2007, Julich, Germany. NIC, vol. 38, pp. 243–252. John von Neumann Institute for Computing, Germany (September 2007) 12. Gesbert, L., Loulergue, F.: Semantics of an exception mechanism for bulk synchronous parallel ml. In: PDCAT 2007: Proceedings of the Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 201–208. IEEE Computer Society, Washington (2007) 13. Keen, A.W., Olsson, R.A.: Exception handling during asynchronous method invocation (research note). In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 656–660. Springer, Heidelberg (2002) 14. Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002) 15. Kuchen, H., Cole, M.: The integration of task and data parallel skeletons. Parallel Processing Letters 12(2), 141–155 (2002) 16. Leyton, M.: Advanced Features for Algorithmic Skeleton Programming. Ph.D. thesis, Universit´e de Nice-Sophia Antipolis (October 2008) 17. Leyton, M., Henrio, L., Piquer, J.M.: Operational semantics for algorithmic skeletons. Tech. rep., University of Chile (to appear, 2010), http://skandium.niclabs.cl/publications/semantics-2010.pdf 18. Leyton, M., Piquer, J.M.: Skandium: Multi-core programming with algorithmic skeletons. In: Proceedings of the 18th Euromicro PDP. IEEE CS Press, Pisa (February 2010) (to appear) 19. Skandium: http://skandium.niclabs.cl/

Generators-of-Generators Library with Optimization Capabilities in Fortress Kento Emoto1 , Zhenjiang Hu2 , Kazuhiko Kakehi1 , Kiminori Matsuzaki3 , and Masato Takeichi1 1

University of Tokyo {[email protected],k kakehi@ducr,[email protected]}.u-tokyo.ac.jp 2 National Institute of Informatics [email protected] 3 Kochi University of Technology [email protected]

Abstract. A large number of studies have been conducted on parallel skeletons and optimization theorems over skeleton programs to resolve difficulties with parallel programming. However, two nontrivial tasks still remain unresolved when we need nested data structures: The first is composing skeletons to generate and consume them; and the second is applying optimization theorems to obtain efficient parallel programs. In this paper, we propose a novel library called Generators of Generators (GoG) library. It provides a set of primitives, GoGs, to produce nested data structures. A program developed with these GoGs is automatically optimized by the optimization mechanism in the library, so that its asymptotic complexity can be improved. We demonstrate its implementation on the Fortress language and report some experimental results.

1

Introduction

Consider the following variant of the maximum segment sum problem: given a sequence of numbers, find the maximum sum of 4-flat segments. Here, ‘4-flat’ means that each difference in successive elements in the segment is less than four. For example, the answer to the sequence below is 13 contributed to by the bold numbers. [2, 1, −5, 3, 6, 2, 4, 3, 4, −5, 3, 1, −2, 8] This is a simplified example of combinatorial optimization [1], which is one of the most important classes of computational problems. It is difficult to develop an efficient parallel program to solve problems, especially in a cost linear to the length. Even if one can use parallel skeletons [2, 3] such as a map, reduce, and scan, it is still difficult to generate all the segments by composing them. In addition, we often need to optimize skeleton programs, but deriving efficient programs from naive programs is still a difficult task even though we have various theorems for shortcut derivations [4, 5, 6, 7]. If we have a generation function segs that returns all the segments, we can then solve the problem rather easily. Such a program is written as follows with P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 26–37, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Generators-of-Generators Library with Optimization Capabilities in Fortress

27

comprehension notation [8,9,10,11,12]. Here, x is the given sequence, s is bound to each segment of x, flat 4 is a predicate to check 4-flatness, and and MAX mean reductions to find the summation and maximum.   s | s ← segs x, flat 4 s MAX Normal execution of this naive program clearly has a cubic cost w.r.t. the length of x. Therefore, we need to optimize the program to make it efficient. This paper proposes a novel library with which we can run the above naive program efficiently with a linear-cost parallel reduction (i.e., it runs in O(n/p + log p) parallel time for an input x of length n on p processors). The library has three features. (1) It provides a set of primitives, Generators of Generators (GoGs), to produce nested data structures. (2) It is equipped with an automatic optimization mechanism that exploits knowledge on optimization theorems that have been developed in the field of skeletal parallel programming thus far. (3) Its optimization is lightweight and no deep analysis of program code is required to apply optimization theorems. The main contributions of this paper are the novel design of the library as well as its implementation in Fortress [12]. Note that the implementation has been merged into the Fortress interpreter/compiler. The rest of this paper is organized as follows. Section 2 clarifies the problems we have tackled with the GoG library. Section 3 describes our GoG library. Section 4 presents programming examples and experimental results for the library. Finally, Section 5 reviews related work, and Section 6 concludes the paper.

2

Motivating Example and Problems We Tackle

Let us again consider the maximum 4-flat segment sum problem (MFSS for short) discussed in the introduction. We will identify two problems we tackle in this paper, through creating an efficient parallel program for MFSS via parallel skeletons and optimization theorems. The notation follows that of Haskell [13]. 2.1

Composing Parallel Skeletons to Create Naive Program

We will introduce the following parallel skeletons [14, 2] on lists to describe a naive parallel program. Here, a map applies a function to each element of a list, reduce takes a summation of a list with an associative operator, scan and scanr produce forward and backward accumulations with associative operators, respectively, and filter removes elements that do not satisfy a predicate. = [f a1 , f a2 , . . . , f an ] map f [a1 , a2 , . . . , an ] reduce (⊕) [a1 , a2 , . . . , an ] = a1 ⊕ a2 ⊕ · · · ⊕ an scan (⊕) [a1 , a2 , . . . , an ] = [b1 , b2 , . . . , bn ] where bi = a1 ⊕ · · · ⊕ ai scanr (⊕) [a1 , a2 , . . . , an ] = [c1 , c2 , . . . , cn ] where ci = ai ⊕ · · · ⊕ an filter p = reduce (+ +) ◦ map (λa.if p a then [a] else [ ]) Here, + + means list concatenation, and an application of reduce (⊕) to an empty list results in the identity of ⊕.

28

K. Emoto et al.

Now, we can compose a naive parallel program, mfss, for MFSS as follows. Here, ↑ is the max operator, segs generates all segments of a list, inits and tails generate all initial and tail segments, and flat 4 checks the 4-flatness. mfss = reduce (↑) ◦ map (reduce (+)) ◦ filter flat 4 ◦ segs segs = reduce (+ +) ◦ map inits ◦ tails inits = scan (++) ◦ map (λa.[a]) tails = scanr (+ +) ◦ map (λa.[a]) flat 4 = rpred (λ(u, v).|u − v| < 4) rpred r [a1 , a2 , . . . , an ] = reduce (∧) (map r [(a1 , a2 ), (a2 , a3 ), . . . , (an−1 , an )] Since mfss is described with parallel skeletons, it is a parallel program. The program mfss is clear, once we know segs generates all segments. However, composing skeletons to create segs is difficult for usual programmers. Such compositions of skeletons to generate nested data structures is generally a difficult task. For example, the generation of all subsequences (subsets) of a list is far more difficult and complicated. 2.2

Applying Theorem to Derive Efficient Parallel Program

Let us introduce a theorem to derive an efficient program from the naive program. Of the various optimization theorems that have been studied thus far [4,5,6,7,15], the following theorem [15] can be applied to the naive program, mfss. Theorem 1. Provided that ⊕ with identity ı⊕ is associative and commutative, and ⊗ is associative and distributive over ⊕, the following equation holds. reduce (⊕) ◦ map (reduce (⊗)) ◦ filter (rpred r) ◦ segs = π1 ◦ reduce () ◦ map hex where (m1 , t1 , i1 , s1 , h1 , l1 )  (m2 , t2 , i2 , s2 , h2 , l2 ) = (m1 ⊕ m2 ⊕(t1 ⊗i2 )l1 ,h2 , (t1 ⊗s2 )l1 ,h2 ⊕t2 , i1 ⊕(s1 ⊗i2 )l1 ,h2 , (s1 ⊗s2 )l1 ,h2 , h1 , l2 )

hex a = (a, a, a, a, a, a) ; (a)l,h = if r l h then a else ı⊕ Applying this theorem to mfss, we can obtain an efficient parallel program that is shown in the right hand side of the equation. The resulting parallel program is a simple reduction with a linear cost, and thus runs in O(n/p + log p) parallel time for an input of size n on p processors. However, a difficult task here is to select the theorem from a sea of optimization theorems. Moreover, it is also difficult to implement manually the derived operators correctly without bugs. There are generally two difficult tasks in applying optimization theorems: finding a suitable theorem from a sea of optimization theorems, and correctly implementing the given efficient program. 2.3

Problems We Tackled

The two main problems we tackled were difficult tasks in the development of efficient parallel programs: (1) composing skeletons for producing nested data structures used to describe naive programs, and (2) selecting and applying suitable optimization theorems to derive efficient programs. To tackle these problems, we propose a library to conceal these difficult tasks from users.

Generators-of-Generators Library with Optimization Capabilities in Fortress

3

29

GoG Library in Fortress

To overcome the two problems, we propose a novel library called the GoG library. The library provides a set of GoGs equipped with an optimization mechanism. The whole structure of the library is outlined in Figure 1. A GoG is, basically, an object representing a nested data structure, such as a list of all segments. It also has the ability to carry out computation (nested reductions, specifically) on the nested data structure. The combination of GoGs and comprehension notation provides a concise way of describing naive parallel programs. For example, a naive program for MFSS can be written with GoGs and comprehension notation as follows. Here, segs is a function to create a GoG object that represents a list of all segments of the given list, x. Note that the result is not a simple list of all segments.   s | s ← segs x, flat 4 s MAX Since the generation of all segments is implemented in the GoG, users are freed from the difficult task of composing skeletons to produce segments. They only need to learn what kind of GoGs are given. The outstanding feature of the library from others is that a GoG optimizes computation when it is executed. A GoG automatically checks whether given parameters (such as functions, predicates, and operators) satisfy the application conditions of optimization theorems. Once it finds an applicable theorem, it executes the computation using an efficient implementation given by the theorem. For example, the library applies Theorem 1 to the above naive program, so that it runs with a linear cost. This mechanism clearly frees users from the difficult task of applying optimization theorems. The rest of this section explains GoGs and the optimization mechanism as well as their implementation in Fortress. Also, we discuss how the library can be extended. The details can be found in Emoto [15]. We have selected Fortress as an implementation language, because it has both comprehension notation and generators, which share the same concept as GoGs. Collection of GoGs for describing specifications

A user program

  f y | y ← ys | ys ← inits xs    f y | y ← ys | ys ← tails xs    f y | y ← ys | ys ← segs xs 

Growable Optimizing Library

×

Thm1 : (Condition 1 , EfficientImpl 1 )

generate2 (⊕, ⊗, f )

  f y | y ← ys | ys ← subs xs  .. .

Growing

Collection of theorems for optimizing reductions



Thm2 : (Condition 2 , EfficientImpl 2 ) Thm3 : (Condition 3 , EfficientImpl 3 ) Thm4 : (Condition 4 , EfficientImpl 4 )

for i = 1, 2, . . . if Conditioni (⊕, ⊗, f ) then return EfficientImpli (⊕, ⊗, f )

Thm5 : (Condition 5 , EfficientImpl 5 ) .. .

Growing

Fig. 1. Two collections form an optimizing GoG library. It optimizes naively-described computation using knowledge of optimization theorems.

30

3.1

K. Emoto et al.

GoG: Generation and Consumption of Nested Data Structures

First, let us we introduce generators in Fortress. A generator is basically an object holding a set of elements. For example, a list is a generator. Its differs from a simple data set in that it also carries out parallel computation on the elements. The computation is implemented in the method, generate, and it has the following semantics. Here, generator g is a list, and its method generate takes the pair of an associative operator (enclosed in an object) and a function. g.generate(⊕, f ) ≡ reduce (⊕) (map f g) What is important is that the generator (data structure) itself carries out the computation, which enables the whole computation to be optimized when it is executed. For example, a generator may fuse the reduce and map above, and may use specific efficient implementation exploiting the zero of ⊕ when it exists. As generators are equipped with comprehension notation, we can use a concise notation instead of the direct method invocations. An expression described in the comprehension notation is desugared into invocations of generate as  f a | a ← g ⇒ g.generate(⊕, f ) . It is worth noting that a generator has a method, filter, to return another generator holding filtered elements. Also, expression e | a ← g, p x , which involves filtering by predicate p, is interpreted as e | x ← g.filter (p) . The actual filtering is delayed until the resulting generator of g.filter(p) carries the computation on its elements. This may enable optimization by exploiting properties of the predicate. It is also worth noting that the body of a comprehension expression can contain another comprehension expression to describe complex computation. Now, we will introduce our GoGs extending the concept of generators. A GoG is an object representing a nested data structure, such as a list of all segments, but it also carries out computation on the nested data structure. The computation is implemented in a method, generate2 , and has the following semantics. Here, GoG gg is a list of lists, such as segs x for list x. gg.generate2 (⊕, ⊗, f ) ≡ reduce (⊕) (reduce (⊗) (map f gg)) Again, the encapsulation of the computation into a GoG enables the whole computation to be optimized. The details are presented in the next section. The combination of GoGs and comprehension notation gives us a concise way of describing naive nested computations, which may be optimized with GoGs. A nested comprehension expression is desugared into an invocation of generate2 as follows.      f a | a ← g  g ← gg ⇒ gg.generate 2 (⊕, ⊗, f ) It is worth noting that we have extended the desugaring process of the Fortress interpreter to deal with our GoGs. The extension of desugaring will be implemented completely within our library, when the syntax extension feature of Fortress becomes available in the future.

Generators-of-Generators Library with Optimization Capabilities in Fortress

31

The library gives a set of functions to create GoG objects, such as segs. Using such functions, we can write a naive parallel program for MFSS with comprehension notation as explained at the beginning of this section. Note that the generation of all segments is delayed until the GoG carries out computation, and also that generation may be canceled when an efficient implementation is used there. 3.2

Optimization Mechanism in GoGs

The outstanding feature of the library is GoG’s optimization of computation. In the previous section, we explained how we designed GoGs to carry out computation by themselves so that they could optimize computation. We need three functionalities to implement optimization exploiting knowledge on theorems: (1) know the mathematical properties of parameters such as predicates and operators, (2) determine the application conditions of theorems, and (3) dispatch efficient implementations given by applicable theorems. Once these are given, optimization is straightforward: if an applicable theorem is found, a GoG executes computation with the dispatched implementation. Know Mathematical Properties of Parameters. We have to know whether the operators have mathematical properties such as distributivity, for example, to use Theorem 1. It is generally very difficult to find such properties from definitions of operators and functions. Therefore, we took another route: parameters were annotated about such properties beforehand by implementors. The annotations about properties were embedded on the types of parameters. We used types as the location of annotations, because annotations were not values necessary for computation, and the type hierarchy was useful for reuse. For example, Figure 2 has an annotation about the distributivity of + (enclosed in an object, SumReduction) over ↑ (enclosed in an object, MaxReduction). To indicate distributivity, the object, SumReduction, extends the trait, DistributesOverMaxReduction. In Fortress, type arguments are enclosed in ·. It is worth noting that we can annotate predicates in another way. Since a predicate is just a function, we cannot add an annotation to its type directly like objects of reduction operators. However, we can add an annotation to the type of its return value, because Fortress allows Boolean extension. Judge Application Conditions. To use knowledge on optimization theorems correctly, we have to judge their application conditions regarding parameters. Since the properties of parameters are annotated on their types, we can implement such judgments by branching expressions based on types. For example, Fortress has a typecase expression that branches on the types of given arguments. Figure 2 shows an implementation of a judgment on distributivity. The judgment checks whether the second reduction object (r) extends the trait, DistributesOverQ, in which Q is the type of the first reduction object (q). If r, then it means that the second reduction object distributes over the first. In this case, the judgment returns true. It is worth noting that we can also implement such judgments by overloading functions.

32

K. Emoto et al.

trait DistributesOverE end (* used for annotation: distributive over E *)

object SumReduction extends DistributesOverMaxReduction, . . . empty(): Number = 0; join(a: Number, b: Number): Number = a + b end distributesQ, R(q : Q, r : R) : Boolean = typecase (q, r) of Q, DistributesOverQ ⇒ true else ⇒ false end

Fig. 2. Annotation and judgment about distributivity generate 2 R q: ReductionR, r: ReductionR, f : E → R : R = if distributes(q, r) ∧ commutative(q) ∧ relational (p) then efficientImpl(q, r, f ) else naiveImpl (q, r, f ) end

Fig. 3. Simplified dispatching of efficient implementation about Theorem 1. Here, commutative is judgment of commutativity, predicate p is stored in a field variable, and relational is judgment to check if p is defined by rpred.

The judgment of an application condition is implemented simply by composing judgment functions about required properties. Dispatch Efficient Implementations. The dispatch process is straightforward, once we have judgments about application conditions. Figures 3 shows a simplified dispatch process of the GoGs for all segments. The process is implemented in the generate2 method, and it checks whether the parameters satisfy the application condition of Theorem 1. If the condition is satisfied, then it computes the result by efficient implementation (i.e., the RHS of the equation in Theorem 1). Otherwise, it computes the result with its naive semantics. It is worth noting that each of the new operators in the efficient implementation uses the original operators for a fixed number of times. Each GoG generally has a list of theorems (pairs of conditions and efficient implementations). It checks their application conditions one by one. If an applicable theorem is found, then it computes the result of computation done by the efficient implementation. If no applicable theorem is found in the list, then it computes the result based on its naive semantics. The current library uses the first-match strategy in this process. 3.3

Growing GoG Library

The library can easily be extended. We can add GoGs and accompanying functions to extend its application area. We can also add new pairs of application conditions and efficient implementations to strengthen its optimization. We have extended the library to have the following GoGs (and accompanying functions): all segments of a list (segs), all initial segments of a list (inits), all tail segments of a list (tails), and all subsequences of a list (subs). The naive semantics of the former three were discussed in Section 2. The last one is trivial. We have also added various optimization theorems to the library as well as Theorem 1. The following optimizations were used during the experiment in

Generators-of-Generators Library with Optimization Capabilities in Fortress

33

Section 4. Here, x is an input of the computation, and each of the LHS programs can be replaced with a corresponding efficient program in the RHS under various conditions. The common application condition is that the associative operator ⊗ distributes over the other associative operator ⊕. In addition, the theorem for segs requires commutativity of ⊕, and theorems involving predicates require predicate p to be defined by a certain relation r as p = rpred r. We have omitted definitions of new constant-cost reduction operators x . See Emoto [15] for details. 

= π1 ( i (f a, f a) | a ← x )   f a | a ← i | i ← inits x = π1 ( s (f a, f a) | a ← x )   f a | a ← t | t ← tails xs f a | a ← s | s ← segs x

= π1 ( t (f a, f a, f a, f a) | a ← x )  f a | a ← i | i ← inits x , p i = π1 ( i (f a, f a, a, a) | a ← x )  f a | a ← t | t ← tails x , p t = π1 ( t (f a, f a, a, a) | a ← x ) It is worth noting that these optimization are applicable to not only the usual plus and maximum operator but also any operators that satisfy the required conditions. It is also worth noting that the RHSs run in O(n/p + log p) parallel time for an input of size n on p processors, when ⊕ and ⊗ have constant costs. Equipped with the GoGs and theorems above, the GoG library enables us to describe various naive parallel programs, and carry out efficient computations by implicitly exploiting the theorems.

4

Programming Examples and Experimental Results

Here, we explain how we can write naive parallel programs with the GoG library, and present experimental results that demonstrate the naive programs actually run efficiently. 4.1

Example Programming with GoG Library

Figure 4 has the complete code written   with the GoG library for MFSS. Here, Numbers is an abbreviation of Number a | a ← s , relationalPredicate corresponds to rpred, and type information (i.e., Number) is explicitly written as a workaround of the current type system. The program clearly looks like a cubic-cost naive program. However, it runs with a linear cost due to the optimization mechanism implemented in the GoG (in this program, the GoG is the component ExampleProgram import List.{. . .}; import Generator2.{. . .}; export Executable run() : () = do x = arrayNumber( 400 ).fill fn a ⇒ random(10) − 5 flat 4 = relationalPredicateNumber fn (a, b) ⇒ |a − b| < 4      Number s s ← segs x, flat 4 s mfss = MAXNumber Number println(“the maximum 4-flat segment sum of x is ” mfss) end end

Fig. 4. Complete code of an example program with GoG library for MFSS

34

K. Emoto et al.

object returned by the expression, segs x). This will be demonstrated in the next section. It is worth noting that the expressiveness is at least equal to the set of our list skeletons. This is because we can create scan and scanr with  inits and tails as follows: scan (⊕) x = i | i ← inits x , and scanr (⊕) x = t | t ← tails x . Here, a comprehension expression without reduction operations results in a list of the elements in the usual sense. Then, their computation can be executed with a linear cost exploiting well-known scan lemmas. 4.2

Experimental Results

We present the experimental results to demonstrate the effect of optimization and parallel execution performance. The measurements were done with the current Fortress interpreter (release 4444 from the subversion repository) running R R on a PC with two quadcore CPUs (two IntelXeon X5550, 8 cores in total, without hyper-threading), 12-GB of memory, and Linux 2.6.31. Figure 5 plots measured execution time for the following micro-programs with and without optimization on a logarithmic scale. Here, inits and tails creates the GoGs of all initial and tail segments, and ascending and descending are predicates to check whether given arguments are sorted ascendingly and descendingly. Note that the input for mtp and mdtp is a list of positive real numbers.   mis = MAX  s | s ← inits x ; mais = MAX  s | s ← inits x, ascending s mtp = MIN  s | s ← tails x ; mdtp = MIN  s | s ← tails x, descending s mss = MAX s | s ← segs x ; mfss = MAX s | s ← segs x, flat4 s The graph indicates that optimization works well so that the naively described micro-programs run with linear costs, while the naive execution of these programs suffers from quadratic and cubic costs. It is worth noting that the program code used in measuring the case without optimization is the same as that with optimization, except that the annotations 10

execution time (s)

1

mis(w/ opt) mais(w/ opt) mtp(w/ opt) mdtp(w/ opt) mss(w/ opt) mfss(w/ opt) mis(w/o opt) mais(w/o opt) mtp(w/o opt) mdtp(w/o opt) mss(w/o opt) mfss(w/o opt) linear cost

0.1

0.01

0.001 1

10

100

1000

10000

100000

1e+06

the length of input

Fig. 5. Execution time of micro programs. They achieve linear costs by optimization.

Generators-of-Generators Library with Optimization Capabilities in Fortress

35

4 mis mais mtp mdtp mss mfss linear

speedup

3.5 3 2.5 2 1.5 1 1

2

3

4

FORTRESS_THREADS

Fig. 6. Speedup of the micro programs for a large input with optimization

about mathematical properties were removed from the reduction objects. This means that the library correctly applies its known theorems. The GoG library now has the huge task of applying suitable theorems, and we only have the small task of notifying the library of mathematical properties of objects. It is also worth noting that well-used objects are annotated by library implementors. Next, let us look at parallel execution performance of the programs. Figure 6 plots the measured speedup of the micro-programs for a large input (218 elements) with optimization. The graph indicates good speedup of programs, although this is slightly less than optimal because the computation is light. Unfortunately, the current Fortress interpreter is not mature and has a problem with limitations on parallelism; no Fortress program including our library can achieve more than four-fold speedup. Therefore, the graph only shows the results for at most four native threads. This limitation will be removed in the future Fortress interpreter or compiler, and thus the programs will be able to achieve better speedup for a larger number of threads. It is worth noting that a program can achieve good speedup even if it is not optimized and thus executed by the naive semantics. This is because the naive semantics uses the existing generators of Fortress in the computation. It is worth noting that the overhead for the dispatch process in generate2 is negligibly small against the execution time of the main computation, unless too many (more than hundreds) theorems are given. If that many theorems are given, we would need to organize them to dispatch them efficiently, which we intend to do in future work. The results indicate naive programs with GoGs run efficiently in parallel.

5

Related Work

The SkeTo library [14,4] is a parallel skeleton library equipped with optimization mechanisms. Its optimization is designed to fuse successive flat calls of skeletons, but not to optimize nested use of skeletons. The work in this paper deals with the optimization of nestedly composed skeletons. It can be seen as a complement to the previous work. The FAN skeleton framework [3] is an skeletal parallel programming framework with an interactive transformation (optimization) mechanism. It has the

36

K. Emoto et al.

same goal as ours. It helps programmers to refine naive skeleton compositions interactively so that they are efficient, with a graphical tool that locates applicable transformations and provides performance estimates. Our GoG library was designed for automatic optimization, and thereby is equipped with transformations (optimization theorems) that always improve performance for specific cases. Also, the optimization mechanism for the GoG library is lightweight in the sense that it does not need extra tools such as preprocessors. Programming using comprehension notation has been considered a promising approach to concise parallel programming, with a history of decades-long research [8, 9, 10, 11]. Previous work [16] has studied optimization through flattening of nested comprehension expressions to exploit flat parallelism effectively. Their optimizations have focused on balancing computation tasks. The work discussed in this paper mainly focuses on improving the complexity of computation.

6

Conclusion

We proposed the GoG library to tackle two difficult problems in the development of efficient parallel programs. The library frees users from the difficult tasks of composing skeletons to generate nested data structures, and manually applying optimization theorems (transformations) correctly. In the paper, we demonstrated that a naively-composed program appearing to have a cubic cost actually runs in parallel with a linear cost, with the MFSS problem. The drastic improvement was due to automatic optimization based on theories of parallel skeletons. It is worth noting that we can also implement the library in other modern languages. For example, we can implement it in C++ using OpenMP [17] or MPI [18] for parallel execution and template techniques for new notations. One direction in future work is to widen the application area of the library. We intend to extend the set of GoGs as well as the optimizations over them, so that we can describe more applications such as combinatorial-optimization problems. We also intend to extend the optimization to higher-level nesting, even though the current implementation only deals with two-level nesting. Moreover, we plan to apply the idea of GoGs to programming on matrices and trees, based on our previous research on parallel skeletons on them. Another direction in future work is to study automatic discovery of the mathematical properties of operators and functions from their definitions. This will greatly reduce the number of user tasks. We believe that the rapid progress with recent computers will enable such automatic discovery in the near future.

Acknowledgments This paper reports the first result of the joint research project “Development of a library based on skeletal parallel programming in Fortress” with Sun Microsystems. The authors would like to thank Guy L. Steele Jr. and Jan-Willem Maessen for fruitful discussions. This study was partially supported by the Global COE program “The Research and Training Center for New Development in Mathematics”.

Generators-of-Generators Library with Optimization Capabilities in Fortress

37

References 1. Schrijver, A.: Combinatorial Optimization—Polyhedra and Efficiency. Springer, Heidelberg (2003) 2. Rabhi, F.A., Gorlatch, S. (eds.): Patterns and Skeletons for Parallel and Distributed Computing. Springer, Heidelberg (2002) 3. Aldinucci, M., Gorlatch, S., Lengauer, C., Pelagatti, S.: Towards parallel programming by transformation: the FAN skeleton framework. Parallel Algorithms Applications 16(2-3) (2001) 4. Emoto, K., Hu, Z., Kakehi, K., Takeichi, M.: A compositional framework for developing parallel programs on two-dimensional arrays. International Journal of Parallel Programming 35(6) (2007) 5. Iwasaki, H., Hu, Z.: A new parallel skeleton for general accumulative computations. International Journal of Parallel Programming 32(5) (2004) 6. Hu, Z., Takeichi, M., Iwasaki, H.: Diffusion: Calculating efficient parallel programs. In: Proceedings of the 1999 ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation (1999) 7. Gorlatch, S.: Systematic efficient parallelization of scan and other list homomorphisms. In: Fraigniaud, P., Mignotte, A., Robert, Y., Boug´e, L. (eds.) Euro-Par 1996. LNCS, vol. 1124. Springer, Heidelberg (1996) 8. Blelloch, G.E., Sabot, G.W.: Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing 8(2) (1990) 9. Chakravarty, M.M.T., Keller, G., Lechtchinsky, R., Pfannenstiel, W.: Nepal - nested data parallelism in haskell. In: Sakellariou, R., Keane, J.A., Gurd, J.R., Freeman, L. (eds.) Euro-Par 2001. LNCS, vol. 2150, p. 524. Springer, Heidelberg (2001) 10. Chakravarty, M.M.T., Leshchinskiy, R., Jones, S.P., Keller, G., Marlow, S.: Data parallel haskell: a status report. In: DAMP 2007: Proceedings of the 2007 workshop on Declarative aspects of multicore programming (2007) 11. Fluet, M., Rainey, M., Reppy, J., Shaw, A., Xiao, Y.: Manticore: a heterogeneous parallel language. In: DAMP 2007: Proceedings of the 2007 workshop on Declarative aspects of multicore programming (2007) 12. Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., Tobin-Hochstadt, S.: The Fortress language specification version 1.0 (2008), http://research.sun.com/projects/plrg/fortress.pdf 13. Peyton Jones, S. (ed.): Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, Cambridge (2003) 14. Matsuzaki, K., Emoto, K.: Implementing fusion-equipped parallel skeletons by expression templates. In: Draft Proceedings of the 21st International Symposium on Implementation and Application of Functional Languages (IFL 2009), Technical Report: SHU-TR-CS-2009-09-1, Seton Hall University (2009) 15. Emoto, K.: Homomorphism-based Structured Parallel Programming. PhD thesis, University of Tokyo (2009) 16. Leshchinskiy, R., Chakravarty, M.M.T., Keller, G.: Higher order flattening. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 920–928. Springer, Heidelberg (2006) 17. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, Cambridge (2007) 18. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface, 2nd edn. MIT Press, Cambridge (1999)

User Transparent Task Parallel Multimedia Content Analysis Timo van Kessel, Niels Drost, and Frank J. Seinstra Department of Computer Science, VU University De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands {timo,niels,fjseins}@cs.vu.nl

Abstract. The research area of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia archives and data streams. To satisfy the increasing computational demands of emerging MMCA problems, there is an urgent need to apply High Performance Computing (HPC) techniques. As most MMCA researchers are not also experts in the field of HPC, there is a demand for programming models and tools that can help MMCA researchers in applying these techniques. Ideally, such models and tools should be efficient and easy to use. At present there are several user transparent library-based tools available that aim to satisfy both these conditions. All such tools use a data parallel approach in which data structures (e.g. video frames) are scattered among the available compute nodes. However, for certain MMCA applications a data parallel approach induces intensive communication, which significantly decreases performance. In these situations, we can benefit from applying alternative parallelization approaches. This paper presents an innovative user transparent programming model for MMCA applications that employs task parallelism. We show our programming model to be a viable alternative that is capable of outperforming existing user transparent data parallel approaches. As a result, the model is an important next step towards our goal of integrating data and task parallelism under a familiar sequential programming interface.

1

Introduction

In recent years, the desire to automatically access vast amounts of image and video data has become more widespread — for example due to the popularity of publicly accessible digital television archives. In a few years, the automatic analysis of the content of multimedia data will be a problem of phenomenal proportions, as digital video may produce high data rates, and multimedia archives steadily run into Petabytes of storage space. As a result, high-performance computing on clusters or even large collections of clusters (grids) is rapidly becoming indispensable for urgent problems in multimedia content analysis (MMCA). Unfortunately, writing efficient parallel applications for such systems is known to be hard. As not all MMCA experts are also parallel programming experts, there is a need for tools than can assist in creating parallel applications. Ideally P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 38–50, 2010. c Springer-Verlag Berlin Heidelberg 2010 

User Transparent Task Parallel Multimedia Content Analysis

39

such tools require little extra effort compared to traditional MMCA tools, and lead to efficient parallel execution in most application scenarios. MMCA researchers can benefit from HPC tools to reduce the time they spend on running their applications. However, they will have to put effort in learning how to use these tools. An HPC tool is only useful to MMCA researchers when the performance increase outweighs the extra effort required in using it. User transparent tools are special in that these try to remove the additional effort altogether by hiding the HPC complexity behind a familiar and fully sequential application programming interface (API). It is well-known that applying data parallel approaches in MMCA computing can lead to good speedups [1,2,3,4,5] — even when applied as part of a user transparent tool [6]. In such approaches images and video frames are scattered among the available compute nodes, such that each node calculates over a partial structure only. Inter-node dependencies are then resolved by communicating between the nodes. Previous results, however, show that exploiting data parallelism does not always give the desired performance [7]. Sometimes applications offer not enough data parallelism, or data parallelism causes too much communication overhead for the application to scale well. For such applications using alternative techniques, like task parallelism or pipelining, often is beneficial. In this paper we describe a user transparent task parallel programming model for the MMCA domain. The model provides a fully sequential API identical to that of an existing sequential MMCA programming system [8]. The model delays the execution of operations and constructs a task graph in order to divide these (sets of) operations over the parallel system. We show that in certain realistic cases our model can be more efficient than a user transparent data parallel approach. Ultimately, we aim to combine our task parallel efforts with our earlier results on a user transparent data parallel programming model [6] to arrive at a model that exploits both data and task parallelism (see also [9]). This paper is organized as follows. In Section 2, we describe general user transparent parallel programming tools. Section 3 explains the design of our user transparent task parallel programming model. We describe a simple example MMCA application in Section 4. This is followed by an evaluation in Section 5. Finally, we describe our conclusions and future work in Section 6.

2

User Transparent Parallel Programming Tools

Whereas specifying the parallelization of applications by hand may be reasonable for a small group of experts, most users prefer to dedicate their time to describing what a computer should do rather than how it should do it. As a result, many programming tools have been developed to alleviate the problem of low level software design for parallel and distributed systems. In all cases, such tools are provided with a programming model that abstracts from the idiosyncrasies of the underlying parallel hardware. The relatively small user base of parallel computing in the MMCA community indicates, however, that existing parallelization tools are still considered too hard to use by MMCA researchers.

40

T. van Kessel, N. Drost, and F.J. Seinstra

THRESHOLD 2

Effort

EXPERT TOOLS

EFFICIENT EXPERT TOOLS

USER FRIENDLY TOOLS

USER TRANSPARENT TOOLS

THRESHOLD 1

Efficiency

Fig. 1. Parallelization tools: effort versus efficiency. User transparent tools are considered both user friendly and sufficiently efficient.

The ideal solution would be to have a parallelization tool that abstracts from the underlying hardware completely, allowing users to develop optimally efficient parallel programs in a manner that requires no additional effort in comparison to writing purely sequential software. Unfortunately, no such parallelization tool currently exists and due to the many intrinsic difficulties it is commonly believed that no such tool will be developed ever at all [10]. However, if the ideal of ’obtaining optimal efficiency without effort’ is relaxed somewhat, it may still be possible to develop a parallelization tool that constitutes an acceptable solution for the MMCA community. The success of such a tool largely depends on the amount of effort requested from the application programmer and the level of efficiency obtained in return. The graph of Figure 1 depicts a general classification of parallelization tools based on the two dimensions of effort and efficiency. Here, the efficiency of a parallelization tool is loosely defined as the average ratio between the performance of any MMCA application implemented using that particular tool and the performance of an optimal hand-coded version of the same application. Similarly, the required effort refers to (1) the amount of initial learning needed to start using a given parallelization tool, (2) the additional expense that goes into obtaining a parallel program that is correct, and (3) the amount of work required for obtaining a parallel program that is particularly efficient . In the graph, the maximum amount of effort the average MMCA expert generally is willing to invest into the implementation of efficient parallel applications is represented by THRESHOLD 1. The minimum level of efficiency a user generally expects as a return on investment is depicted by THRESHOLD 2. To indicate that the two thresholds are not defined strictly, and may differ between groups of researchers, both are represented by somewhat fuzzy bars in the graph of Figure 1. Each tool that is considered both ’user friendly’ and ’sufficiently efficient’ is referred to as a tool that offers full user transparent parallelization. An important additional feature of any user transparent tool is that it does not require the user

User Transparent Task Parallel Multimedia Content Analysis

41

to fine-tune any application in order to obtain particularly efficient parallel code (although it may still allow the user to do so). Based on these considerations, we conclude that a parallelization tool constitutes an acceptable solution for the MMCA community only, if it is fully user transparent. 2.1

Parallel-Horus

In our earlier work, we have designed and implemented Parallel-Horus [6], a user transparent parallelization tool for the MMCA domain. Parallel-Horus, which is implemented in C++ and MPI, allows programmers to implement data parallel multimedia applications as fully sequential programs. The library’s API is made identical to that of an existing sequential library: Horus [8]. Similar to other frameworks [2], Horus recognizes that a small set of algorithmic patterns can be identified that covers the bulk of all commonly applied functionality. Parallel-Horus includes patterns for commonly used functionality such as unary and binary pixel operations, global reduction, neighborhood operation, generalized convolution, and geometric transformations (e.g. rotation, scaling). Recent developments include patterns for operations on large datasets, as well as patterns on increasingly important derived data structures, such as feature vectors. For reasons of efficiency, all Parallel-Horus operations are capable of adapting to the performance characteristics of a parallel machine at hand, i.e. by being flexible in the partitioning of data structures. Moreover, it was realized that it is not sufficient to consider parallelization of library operations in isolation. Therefore, the library was extended with a run-time approach for communication minimization (called lazy parallelization) that automatically parallelizes a fully sequential program at runtime by inserting communication primitives and additional memory management operations whenever necessary [11]. Results for realistic multimedia applications have shown the feasibility of the Parallel-Horus approach, with data parallel performance (obtained on individual cluster systems) consistently being found to be optimal with respect to the abstraction level of message passing programs [6]. Notably, Parallel-Horus was applied in recent NIST TRECVID benchmark evaluations for content-based video retrieval, and played a crucial role in achieving top-ranking results in a field of strong international competitors [6,12]. Moreover, recent extensions to Parallel-Horus, that allow for services-based multimedia Grid computing, have been applied successfully in large-scale distributed systems, involving hundreds of massively communicating compute resources covering the entire globe [6]. Despite these successes, and as shown in [7], for some applications a data parallel approach generates too much communication overhead for efficient parallelization. In these situations exploiting alternative forms of parallelism, such as task parallelism or pipelining, could improve the efficiency of the parallel execution. In this paper we investigate the design and implementation of a task parallel counterpart of Parallel-Horus. Keeping in mind the fact that applications should have a sufficient number of independent tasks available, and that load-balancing may be required for optimized performance, the following section discusses the design of our user transparent task parallel system.

42

T. van Kessel, N. Drost, and F.J. Seinstra

3

System Design

In the MMCA domain, most applications apply a multimedia algorithm that transforms one or more input data structures into one or more output data structures. Generally, such input and output structures are in the form of dense data fields, like images, kernels, histograms and feature vectors. When we look more closely to the multimedia algorithms, we can regard them as sequences of basic operations, each transforming one or more dense data fields into another dense data field. These basic operations are exactly the ones that are provided by the Parallel-Horus API, referred to in Section 2.1. For our user transparent task parallel programming model we will use exactly the same sequential API. In our task parallel approach, we aim to execute as many basic operations as possible in parallel. However, in most algorithms there are operations that require the output data of a preceding operation as one of its input data structure. As a result, to execute such an algorithm correctly in a task parallel manner, we need to identify these data dependencies. To that end we will not execute the operations of the programming model immediately, but we delay their execution and create a task graph of the basic operations in the application at run-time. 3.1

The Task Graph

When an operation is delayed in our programming model, an operation node is created instead, and added to the task graph1 . The operation node contains a description of the operation itself, together with references to the nodes in the task graph corresponding to the input data structures of the operation. Furthermore, the operation node holds references to its parent nodes, as well as other parameters needed for the operation. In addition, each node registers how many children need their results as input. A future object is created when the operation node is added to the task graph. The future object acts as placeholder for the result of the operation and it is returned to the application instead of the actual result (see Figure 2). At this point we take advantage of the fact that most intermediate results of an algorithm — by themselves — are not of much interest to the application (e.g. will not be written out to file). As a result, the application will not notice that we do not calculate the result data structure immediately. Eventually the operation nodes will form a Directed Acyclic Graph (DAG), containing all operations of the application as its nodes. In this graph, the source nodes are a special type of node. They do not have parent nodes, but they contain the data structure itself instead. For example, operations that import data structures (e.g., read an image from disk, create a kernel, etcetera) lead to the creation of such source nodes in the task graph.

1

Some API calls lead to the creation of multiple operation nodes in the task graph. For example, a convolution with a Gaussian kernel is a single API call, but consists of one operation for creating the kernel and another one for the actual convolution.

User Transparent Task Parallel Multimedia Content Analysis Application

API

43

Task Graph

Image1 = Import("file");

Image1 Future

Image2 = Image1.Op2();

Image2 Future

Image3 = Image2.Op3();

Image3 Future

Op3 Operation Node

Image4 = Image3.Op4();

Image4 Future

Op4 Operation Node

Image5 = Image2.Op5(image4);

Image5 Future

Source Node

Op2 Operation Node

Op5 Operation Node

Fig. 2. Task graph construction. Each API call leads to the creation of a future object and an operation node; importing data from file leads to the creation of a source node.

3.2

Graph Execution

At some point during its execution, an application needs to access the data of future objects. At that point, the future will initiate the execution of all parts of the task graph that lead to the creation of the required result structure. The execution is initiated by requesting the result data from its corresponding operation node. This operation, in turn, needs to acquire the result data from all its parent nodes in order to calculate its own result data. The parent nodes will do the same until a source node is reached, which can return their result data immediately. This way, only those operations in the task graph that are needed to calculate the required data structure are executed, as shown in Figure 3. In addition, all nodes that took part in the execution now also have their result data and will become source nodes. To ensure efficient execution of the task graph, we introduced a simple optimization to the execution process. First, we use the observation that any node that has only a single child node must be executed in sequence with that child node. Therefore, sequences of such operations are merged into a composite operation to prevent them from being executed on different processors. When the composite operations are identified, the execution of the task graph can be started. Parallel execution is obtained at nodes that have multiple parents. At these nodes, a separate job is created for each parent node to obtain its result. In contrast, any node with multiple children acts as synchronization point in the task graph, because all children will have to wait for the node to calculate its result, before they can continue their own calculations in parallel.

44

T. van Kessel, N. Drost, and F.J. Seinstra getImage()

Fig. 3. Task graph execution. Only nodes leading to the required result are evaluated.

We implemented a prototype runtime system to execute the task graph as explained above. The runtime system is capable of merging nodes into composite nodes and then execute the required part of the task graph in parallel when a data structure is accessed through a future object. As part of our prototype runtime system we use the Satin [13] system to obtain parallel execution on clusters. Satin is a divide-and conquer system implemented in Java, that solves problems by splitting them into smaller independent subproblems. The subproblems are spawned by Satin to be solved in parallel. When all the subproblems are solved, the original problem collects all partial solutions and combines them to form the final result. This process of splitting problems into subproblems is repeated iteratively until trivial subproblems remain, which are calculated directly. It should be noted that Satin is a very powerful system. Apart from the automatic parallel execution of the spawned problems, Satin offers automatic load-balancing using random job stealing. Moreover, Satin offers fault-tolerance and automatic task migration without requiring any extra effort from the programmer. To execute our task graph using Satin, the future spawns the operation node leading to the desired result data. In turn, the operation node will spawn all its parent operations in the task graph using Satin in order to get their result data. Consequently, the parent operations will spawn their parents until the source operations are reached. The source operations will not spawn any new jobs, but deliver their result data instead, thus ending the recursion. When all parents are finished and synchronized by the Satin runtime, the node can start its own execution and then deliver its results to it children. Despite its benefits (see above), Satin induces overheads that could have been avoided if we would have implemented a complete run-time system from scratch. For example, Satin creates a spawn tree of the different parallel tasks instead of a DAG, to the effect that tasks are replicated in Satin — which may cause such tasks to be executed more than once. Also, Satin assumes the presence of many (even millions) of tasks, which may not be the case in all situations. However, the reader needs to keep in mind that the main contribution of this work is in the user transparent task parallel programming model. Despite these issues related

User Transparent Task Parallel Multimedia Content Analysis

45

to Satin’s overhead, in the remainder of this paper we do obtain close-to-linear speedup characteristics for several runtime scenarios.

4

A Line Detection Application

The following describes a typical, yet simple, example application from the MMCA domain. The example is selected because results for data parallel execution with Parallel-Horus are well available. Importantly, in some cases the data parallel execution of this particular application leads to non-linear speedups. 4.1

Curvilinear Structure Detection

As discussed in [14], the computationally demanding problem of line detection is solved by considering the second order directional derivative in the gradient direction, for each possible line direction. This is achieved by applying anisotropic Gaussian filters, parameterized by orientation θ, smoothing scale σu in the line direction, and differentiation scale σv perpendicular to the line, given by:  σ ,σ ,θ  u v  r (x, y, σu , σv , θ) = σu σv fvv

1 bσu ,σv ,θ

,

(1)

with b the line brightness. When the filter is correctly aligned with a line in the image, and σu , σv are optimally tuned to capture the line, filter response is maximal. Hence, the per pixel maximum line contrast over the filter parameters yields line detection: R(x, y) = arg max r (x, y, σu , σv , θ). σu ,σv ,θ

(2)

Figure 4(a) gives a typical example of an image used as input. Results obtained for a reasonably large subspace of (σu , σv , θ) are shown in Figure 4(b). The anisotropic Gaussian filtering problem can be implemented in many different ways. In this paper we consider three possible approaches. First, for each orientation θ it is possible to create a new filter based on σu and σv . In effect, this yields a rotation of the filters, while the orientation of the input image remains fixed. Hence, a sequential implementation based on this approach (which we refer to as Conv2D) implies full 2-dimensional convolution for each filter.

Fig. 4. Detection of C. Elegans worms (courtesy of Janssen Pharmaceuticals, Belgium)

46

T. van Kessel, N. Drost, and F.J. Seinstra

FOR all orientations θ DO RotatedIm = GeometricOp(OriginalIm, "rotate", θ); ContrastIm = UnPixOp(ContrastIm, "set", 0); FOR all smoothing scales σu DO FOR all differentiation scales σv DO FiltIm1 = GenConvOp(RotatedIm, "gaussXY", σu , σv , 2, 0); FiltIm2 = GenConvOp(RotatedIm, "gaussXY", σu , σv , 0, 0); DetectedIm = BinPixOp(FiltIm1, "absdiv", FiltIm2); DetectedIm = BinPixOp(DetectedIm, "mul", σu × σv ); ContrastIm = BinPixOp(ContrastIm,"max", DetectedIm); OD OD BackRotatedIm = GeometricOp(ContrastIm, "rotate", −θ); ResultIm = BinPixOp(ResultIm, "max", BackRotatedIm); OD

Fig. 5. Pseudo code for the ConvRot algorithm FOR all orientations θ DO FOR all smoothing scales σu DO FOR all differentiation scales σv DO FiltIm1 = GenConvOp(OriginalIm, "func", σu , σv , 2, 0); FiltIm2 = GenConvOp(OriginalIm, "func", σu , σv , 0, 0); ContrastIm = BinPixOp(FiltIm1, "absdiv", FiltIm2); ContrastIm = BinPixOp(ContrastIm, "mul", σu × σv ); ResultIm = BinPixOp(ResultIm, "max", ContrastIm); OD OD OD

Fig. 6. Pseudo code for the Conv2D and ConvUV algorithms, with "func" either "gauss2D" or "gaussUV"

The second approach (referred to as ConvUV ) is to decompose the anisotropic Gaussian filter along the perpendicular axes u, v, and use bilinear interpolation to approximate the image intensity at the filter coordinates. Although comparable to the Conv2D approach, ConvUV is expected to be faster due to a reduced number of accesses to the image pixels. A third possibility (called ConvRot) is to keep the orientation of the filters fixed, and to rotate the input image instead. The filtering now proceeds in a two-stage separable Gaussian, applied along the x- and y-direction. Pseudo code for the ConvRot algorithm is given in Figure 5. The program starts by rotating the original input image for a given orientation θ. In addition, for all (σu , σv ) combinations the filtering is performed by xy-separable Gaussian filters. For each orientation the maximum response is combined in a single contrast image. Finally, the temporary contrast image is rotated back to match the orientation of the input image, and the maximum response image is obtained. For the Conv2D and ConvUV algorithms, the pseudo code is identical and given in Figure 6. Filtering is performed in the inner loop by either a full twodimensional convolution (Conv2D) or by a separable filter in the principle axes directions (ConvUV ). On a state-of-the-art sequential machine either program may take from a few minutes up to several hours to complete, depending on the size of the input image and the extent of the chosen parameter subspace.

User Transparent Task Parallel Multimedia Content Analysis

5

47

Evaluation

We implemented the three versions of the line detection application using our task parallel programming model. We tested the applications using the DAS-3 distributed supercomputer2 , which consists of 5 clusters spread over The Netherlands. For our experiments we used up to 64 nodes of the cluster located at the VU University in Amsterdam. On this cluster, each node contains 2 dual core AMD Opteron CPUs for a total of 4 cores and is connected by a high-speed 10 Gbit/s Myrinet network. We used the JRockit JVM version 3.1.2. In order to be able to compare the results of our experiments to previous experiments with Parallel-Horus, we only used a single worker thread on each node. Table 1. Performance results (in seconds) for computing a typical orientation scalespace at 1◦ angular resolution (i.e., 180 orientations) and 8 (σu , σv ) combinations. Scales computed are σu ∈ {3, 5, 7} and σv ∈ {1, 2, 3}, ignoring the isotropic case σu,v = {3, 3}. Image size is 512×512 (8-byte) pixels. # Nodes 1 2 4 8 16 32 48 64

ConvUV 198.23 94.84 48.20 24.08 12.43 6.53 4.58 3.63

Conv2D 2400.88 1193.74 593.18 298.45 150.73 77.50 52.94 40.66

ConvRot 724.03 387.04 190.31 98.25 53.60 30.21 19.77 17.00

Table 1 shows the results for the three applications under our programming model. For the sequential performance, it shows that the ConvUV approach is the fastest, followed by ConvRot. As expected Conv2D is significantly slower, due to the excessive accessing of pixels in the 2-dimensional convolution operations. Overall, these results are similar to our earlier sequential results obtained with Parallel-Horus [7]. The speedup for the three versions of the application is shown in Figure 7(a). It shows that both ConvUV and Conv2D obtain close-to-linear speedups of 54.6 and 59.1, respectively (on 64 nodes). ConvRot scales worse than the other two, and reaches a speedup of 42.6 on 64 nodes. For our task parallel programming model, scalability is influenced by the number of available parallel tasks. For all three algorithms, the inner loops are good candidates to serve as parallel tasks. As the structure of ConvUV and Conv2D is equal, we expect these algorithms to scale equally well. The marginal differences in the speedup results are explained by the fact that the ConvUV version contains much smaller task sizes, while the communication patterns are the same. As a result, the communication overhead is relatively larger. Unlike these two algorithms, in ConvRot some of the operations take place in the outer loop. As a result, the task graph constructed during ConvRot execution will contain more data dependencies between the tasks. This reduces the number of 2

See http://www.cs.vu.nl/das3/

T. van Kessel, N. Drost, and F.J. Seinstra

60

60

50

50

40

40 Speedup

Speedup

48

30

20

30

20

10

10

Linear ConvUV Conv2D ConvRot

Linear Conv2D ConvRot

0

0 0

10

20

30 Nodes

40

50

60

0

10

20

(a)

30 Nodes

40

50

60

(b)

Fig. 7. (a) Speedup obtained with our model. (b) Parallel-Horus shows similar results.

Linear 18 rotations 180 rotations 1800 rotations

60

50

Speedup

40

30

20

10

0 0

10

20

30

40

50

60

Nodes

Fig. 8. Speedup of ConvRot for 18, 180 and 1800 rotations

tasks that can be run in parallel and complicates the distribution of the tasks over the compute nodes, which explains the lower speedup results. Previous results [15] obtained for Parallel-Horus on DAS-3 are shown in figure 7(b). Despite the fact that the applied parallelization is entirely different, Parallel-Horus obtains very similar speedup characteristics in all compared cases. As a result, one could say that our task parallel model has not brought the desired improvement. As shown below, however, inspection of the Satin system reveals that our model still can perform much better than Parallel-Horus. The load-balancing mechanism of the Satin system performs best when there are massive amounts of tasks available. As a consequence, our prototype runtime system works best for applications in which many tasks can be identified. As a result, the low number of tasks available in ConvRot reduces the effectiveness of our runtime system. Figure 8 shows that varying the number of rotations by a factor of 10 (by changing the angular resolution) greatly influences the speedup characteristics for ConvRot. On 64 nodes, the speedup is only 9.0 for 18 rotations. For 180 and 1800 rotations, however, speedup is 42.6 and 58.6, respectively. In other words, for a very large number of rotations (thus: tasks) all versions of the application obtain close-to-linear speedup under our user

User Transparent Task Parallel Multimedia Content Analysis

49

transparent task parallel programming model. In the near future we will reduce these scaling effects by adapting our runtime system (e.g. by replacing Satin) such that it requires less tasks to be effective. Importantly, for Parallel-Horus the reduced speedup of ConvRot can not be improved by varying the chosen parameter space of the application [7,15]. The large data parallel overhead is due to the large number of global data dependencies present in the algorithm (i.e., the repeated image rotations). Clearly, in such cases our task parallel model provides a better solution.

6

Conclusions and Future Work

In this paper we introduced a user transparent task parallel programming model for the MMCA domain. The programming model delays execution and constructs a task graph in order to identify tasks that can be executed in parallel. When a result data structure is required by the application, only the relevant part of the task graph is executed by the runtime system in order to produce it. We have built a prototype runtime system that is (currently) based on the Satin divide-and-conquer system. Measurement results for three versions of an example line detection problem show that a user transparent task parallel approach is a viable alternative to user transparent data parallelism. In the near future we will investigate whether we can improve our graph execution engine, either by reducing Satin’s overhead, or by replacing Satin altogether. Further investigations will be aimed at graph optimization strategies for improved performance. As the next major step, we aim to combine our task parallel model with Parallel-Horus to arrive at a user transparent programming model that employs both task and data parallelism. At the same time, we aim at extending our model to support many-core hardware (e.g. GPUs, FPGAs), which is particularly suitable for executing selected MMCA compute kernels.

References 1. Galizia, A., D’Agostino, D., Clematis, A.: A Grid Framework to Enable Parallel and Concurrent TMA Image Analysis. International Journal of Grid and Utility Computing 1(3), 261–271 (2009) 2. Morrow, P.J., et al.: Efficient implementation of a portable parallel programming model for image processing. Concur. - Pract. Exp. 11(11), 671–685 (1999) 3. Lebak, J., et al.: Parallel VSIPL++: An Open Standard Software Library for HighPerformance Signal Processing. Proc. IEEE 93(2), 313–330 (2005) 4. Juhasz, Z., Crookes, D.: A PVM Implementation of a Portable Parallel Image Processing Library. In: Ludwig, T., Sunderam, V.S., Bode, A., Dongarra, J. (eds.) PVM/MPI 1996 and EuroPVM 1996. LNCS, vol. 1156, pp. 188–196. Springer, Heidelberg (1996) 5. Plaza, A., et al.: Commodity cluster-based parallel processing of hyperspectral imagery. J. Parallel Distrib. Comput. 66(3), 345–358 (2006) 6. Seinstra, F.J., et al.: High-Performance Distributed Video Content Analysis with Parallel-Horus. IEEE MultiMedia 14(4), 64–75 (2007)

50

T. van Kessel, N. Drost, and F.J. Seinstra

7. Seinstra, F.J., Koelma, D.: User transparency: a fully sequential programming model for efficient data parallel image processing. Concurrency - Practice and Experience 16(6), 611–644 (2004) 8. Koelma, D., et al.: Horus C++ Reference. Technical report, Univ. Amsterdam, The Netherlands (January 2002) 9. Ramaswamy, S., et al.: A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers. IEEE Trans. Parallel Distrib. Syst. 8(11), 1098–1116 (1997) 10. Blume, W., et al.: Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing. IEEE Parallel Distrib. Technol. 2(3), 37–47 (1994) 11. Seinstra, F.J., Koelma, D., Bagdanov, A.D.: Finite State Machine-Based Optimization of Data Parallel Regular Domain Problems Applied in Low-Level Image Processing. IEEE Trans. Parallel Distrib. Syst. 15(10), 865–877 (2004) 12. Snoek, C., et al.: The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1678–1689 (2006) 13. van Nieuwpoort, R., et al.: Satin: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005) 14. Geusebroek, J.M., et al.: A Minimum Cost Approach for Segmenting Networks of Lines. International Journal of Computer Vision 43(2), 99–111 (2001) 15. Liu, F., Seinstra, F.J.: A Comparison of Distributed Data Parallel Multimedia Computing over Conventional and Optical Wide-Area Networks. In: DMS, Knowledge Systems Institute, pp. 9–14 (2008)

Parallel Simulation for Parameter Estimation of Optical Tissue Properties Mihai Duta1 , Jeyarajan Thiyagalingam1, Anne Trefethen1, Ayush Goyal2 , Vicente Grau2 , and Nic Smith2 1

Oxford e-Research Centre Computing Laboratory University of Oxford, Oxford, UK [email protected] 2

Abstract. Several important laser-based medical treatments rest on the crucial knowledge of the response of tissues to laser penetration. Optical properties are often localised and are measured using optically active fluorescent microspheres injected into the tissue. However, the measurement process combines the tissue properties with the optical characteristics of the measuring device which in turn requires numerically intensive mathematical simulations for extracting the tissue properties from the data. In this paper, we focus on exploiting the algorithmic parallelism in the biocomputational simulation, in order to achieve significant runtime reductions. The entire simulation accounts for over 30,000 spatial points and is too computationally demanding to run in a serial fashion. We discuss our strategies of parallelisation at different levels of granularity and we present our results on two different parallel platforms. We also emphasise the importance of retaining a high level of code abstraction in the application to benefit both agile coding and interdisciplinary collaboration between research groups. Keywords: tissue optics, parallel simulation, CUDA, parallel MATLAB.

1 Introduction This paper describes the acceleration of a computationally intensive biomedical application through the exploitation of parallelism at different levels of granularity while retaining a high level of code abstraction. The biomedical application involves a large number of computer simulations, which are based on a mathematical model that is subject to a continuous refinement process as a result of research. This research is an interdisciplinary effort, drawing on results from both biology and mathematics. Thus, the computational performance demanded by the computer simulations is conflicting with the need of high-level abstractions in the implementation, which facilitates the rapid inter-collaborative development of the mathematical model. Software performance and programming abstraction are two contradicting goals of software engineering. Resolving these contradictions in the case of an interdisciplinary research effort is aggravated by the likely lack of expertise in using conventional highperformance programming languages (e.g. FORTRAN or C++) or of numerical function libraries. The typical challenges to address in such a case are: P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 51–62, 2010. c Springer-Verlag Berlin Heidelberg 2010 

52

M. Duta et al.

– the ability to cope with rapidly changing prototypes (agile code development); – the retention of a high level of abstraction in order to enable the easy knowledge transfer across disciplines; – the ability to exploit parallelism without trading off abstraction. The approach followed in this paper towards maintaining the programming abstraction is the exploitation of parallelism at multiple levels of granularity. Thus, independent computer simulations on which the biomedical application is based are run on conventional multi-core computer hardware or multi-node clusters in a parallel fashion controlled via an abstracted description using Parallel MATLAB [1]. At the same time, parts of each individual simulation are accelerated on general-purpose graphics cards (GPUs) using nVidia CUDA [2]. This paper makes the following contributions: – we develop a parallel computational model for the biomedical application, which effectively exploits parallelism at different levels of granularity; – we demonstrate significant computing speedups of the application while retaining the high-level abstraction, which is essential for an agile interdisciplinary code development. The rest of this paper is organised as follows. We first provide a background on tissue optics and introduce the underlying mathematical model of the biomedical application in Section 2. We then describe our approach to addressing the challenges raised by this interdisciplinary application in Section 3. Following that, we give the details of parallelising the simulation in Section 4. This is followed by a discussion of the performance results in Section 5. The conclusions are drawn in Section 6, where the generality of the present approach is also assessed.

2 Tissue Optics Recent medical advances have witnessed the development of a host of minimally invasive laser-based treatments, e.g. the detection of bacterial pathogens in living hosts, the measurement of neoplasmic tumour growth, the pharmacokinetic study of drug effects or photo-dynamic cancer therapy [3]. A crucial factor in these treatments is the optical tissue properties, which vary throughout the body [4] and which determine how laser light is scattered, diffused and attenuated. As a consequence of varying optical properties, tissues are exposed to varying energy levels of incident light [5] and the treatment must avoid extended exposures to high energy levels. A good understanding of the optical properties of tissue is, therefore, very important. Of the several applications, the specific medical research application of this paper is optical fluorescence imaging, which is used for obtaining high-contrast images of coronary vasculature. This application concerns measuring radii of blood vessels in the coronary vasculature images, which requires knowledge of the optical properties of the tissue. To estimate these optical properties, fluorescent microspheres are first injected into the tissue vascular network. The microspheres embedded in the tissue are excited by filtered light which in response emit light of a spectrum distinct from the

Parallel Simulation for Parameter Estimation of Optical Tissue Properties

53

incident light. The emitted light is recorded by a CCD camera every time a slice (in the micrometre range) of the tissue is sectioned off [6]. The result is a set of digitised 2D images that make up, slice-by-slice, the 3D tissue volume. This set of images does not directly characterise the optical properties of the tissue as the recorded photographic data mixes the sought properties with the optics of the recording device. The prevalent way to describe the recording process is via pointspread functions (PSF) [7], which are mathematical models of how optical media blurs an imaged object. In essence, a PSF describes the response of an optically active system to a point light source, so its precise definition is that of a Green’s (or impulse response) function. In our application, two distinct PSF models are used to represent the optical properties of both the recording device and biological tissue. Thus, the recorded digital image I of a single microsphere is modelled as the three-dimensional (3D) image convolution (denoted by ⊗) of the original microsphere image Im with the PSF f1 of the biological tissue and the PSF f2 of the recording device: I = Im ⊗ f1 ⊗ f2

(1)

For each microsphere, f1 is parameterised by – the coefficients of excitation and emission tissue attenuation, µex and µem , respectively. – location of the microsphere (denoted by s) inside the slice, which affects the detected intensity of the microsphere. The parameters that characterise f1 are not known and therefore estimating them constitutes part of the calculation of tissue properties. The parameter estimation represents a fitting problem, where a simulated image Is is synthetically generated and the parameters are adjusted such that the difference between the simulated image and the observed image I is minimised in a least squares sense. The algorithm is iterative, starting with initial parameter values and updating them (and thus generating a new image every time) until the least squares difference is minimised. The overall algorithm for the computational model is depicted in Figure 1. In the model, as a first step, a set of the synthetic microspheres M are initialised with known synthetic values via the function Initialise Synthetic. InitGuess initialises the parameters with initial values obtained from the literature [8]. GetProps extracts the a-priori known parameter values for a given microsphere, SignalSpread performs expensive signal convolution between single point source and surrounding simulated tissues and the lens, as specified in Equation 1. As part of this operation, the images being convolved need to be resized through interpolation along one of the axis, the imresize operation. The Optimise function is an optimiser which optimises the values of µem , µex and s satisfying the constraints. These newly searched values become the current parameters, which are set by the SetProps function. The search is continued until the mean-squared error, calculated by MSE, between the actual image and the simulated image (formed with the optimised parameter values) is minimised. The optimiser may rely on any algorithm and in our case this is based on the Newton-Raphson gradient descent algorithm [9]. The search is constrained such that parameters µex , µem and s can assume only a sub-set of values and being inter-dependent [10]. Furthermore, the algorithm is computationally expensive

54

M. Duta et al.

M ← Initialise Synthetic() for m ∈ M [μex , μem , s ] ← InitGuess() err ← ∞ [μex , μem , s] ← GetProps(m) while (err > ) V ← SignalSpread(μex , μem , s ) err ← MSE(V, m) [μex , μem , s ] ← Optimise(μex , μem , s , err) SetProps(μex , μem , s ) end while end for

Fig. 1. Sequential computational model of the problem

— the cost of computation depends directly on the number of microspheres involved, which is typically in the order of tens of thousands.

3 Challenges and Approach 3.1 Challenges One of the pressing factors of this study is the time-to-solution. This includes the code development time, time spent on simulation and time spent on the interpretation of results. The simulation is highly time consuming and the setting should be such that scientists should be able to map their domain-specific knowledge without ample programming expertise, let alone parallel programming experience. This need for agile development leads us to consider a solution whereby we should be able to develop the model with minimal effort while exploiting high performance computing resources. These issues collectively demand a solution with the ability to represent high-level abstractions with the facility to utilise them in a parallel environment. Libraries or highlevel programming languages such as Fortran or C++ may not yield that level of abstractions, especially with parallel constructs. Furthermore, the need for rapid prototyping and simulation discourages such a choice. Modelling tools such as Simulink [11] may be a useful approach but the abstractions within Simulink still need to be developed. 3.2 Our Approach In realising a solution, we favour an array programming language, MATLAB [1]. Array programming languages offer high level of abstractions, especially when combined with their own libraries. They offer an easier route for developing short, but yet powerful prototypes. By the same token, Octave, Python and SciLab are potential alternatives to MATLAB. However, we were motivated to choose MATLAB on the basis of interactiveness, core functionalities, popularity, support and scalability of applications.

Parallel Simulation for Parameter Estimation of Optical Tissue Properties

55

MATLAB is an integrated, modular development and visualisation platform. Abstractions within MATLAB goes beyond the notion of libraries. While delivering Fortran-90 like array-level abstractions, MATLAB specialises on different domains using toolboxes, which are implementations of very common domain-specific operations. These toolboxes bridge the communication gap between scientists across different disciplines and enables easier knowledge transfer across domains. MATLAB supports parallel computing through a specialised toolbox — Parallel Computing Toolbox [12]. This toolbox enables writing parallel loops or processing large arrays with the notion of single program multiple data. Functionalities of Parallel Computing Toolbox does not scale beyond a single node and this is augmented by a Distributed Computing Toolbox, which is transparently run on the master node, the distributed computing engine. All these collectively enable utilising a cluster of workstations in a transparent manner. The fact that MATLAB is a just-in-time compiled language often leads to suboptimal runtime performances. To overcome this, we selectively modify and use the CUDA architecture wherever possible, without breaking any abstractions.

4 Parallelisation and Computational Considerations In view of identifying parallelism within the parameter fitting algorithm depicted in Figure 1, the following observations are useful. 1. The task of optimising the parameter space for one microsphere is independent of another; this allows the microspheres to be processed in parallel. 2. The fitting algorithm relies on an optimiser to find the point (µex , µem , s) in the parameter space that minimises the SignalSpread cost function. Numerous types of optimisers require repeated and independent evaluations of the SignalSpread cost function, which can be obtained in a parallel manner. 3. The SignalSpread function performs 3D image convolution operation on three different images as given in Equation 1. Following the standard approach to numerical convolution, the images are first transformed to the frequency domain, followed by a frequency-domain multiplication and the transformation of this result back to the image domain. This involves two discrete Fourier transforms, for which the popular numerical algorithm is the Fast Fourier Transform (FFT). Parallel multi-threaded implementations of this exist, which exploit a fine-grained level of parallelism. 4. To avoid the large memory penalty of convolving the 5123 resolution image volume within the SignalSpread function, we convolve a down-sampled image and then resize the result via interpolation. This operation exposes fine-grained parallelism. The negligible loss of accuracy in the interpolated image does not affect the tuning of optical parameters. The spatial resolution of a microsphere image varies between 323 and 5123 type of double, corresponding to 256KB and 1GB respectively. However, to avoid strong aliasing effects in converting an image to the frequency domain, the doubling (at a minimum) of the original array size along all three dimensions. Additionally, the entire convolution operation requires three 3D arrays for the reasons stated in Section 2, which raises

56

M. Duta et al.

the memory requirement for a single operation to 6MB and 24GB, respectively. This highlights clearly the need for image resize manipulation, which reduces the memory imprint of convolution to manageable dimensions. In light of the demands posted in Section 3, the representation of the parallelism should be such that it remains at a higher level as much as possible. High level parallelism is clearly expressed by the observations (1) and (2). This is easily exploited in an abstracted way within Parallel MATLAB, as described in the next section. At the other end of the spectrum, fine-grained parallelism as expressed by the FFT and image resize operations (observations (3) and (4)) can take advantage of multithreading, and adds extra performance on appropriate hardware. In this work, programmable GPUs were used at a great advantage; their specialised programming is in contradiction with the overall requirement for abstraction but is of limited extent within the application and hidden within MATLAB functions. 4.1 Coarse-Grained Parallelism Using Parallel MATLAB MATLAB provides parallelism with the notion of workers. Each worker is responsible for a MATLAB session and a subscription mechanism does exist to govern the granularity of the worker. It is very common to subscribe a single worker per core of a microprocessor. In the case of a cluster of workstations, multiple workers may exist and their execution are orchestrated by the distributed computing engine. The distributed computing engine may use any scheduler and provides the transparency across workers. In other words, the application does not differentiate between workers which span across different nodes and workers residing within the same host. Using the Message Passing Interface as the underlying communication mechanism between workers, MATLAB provides two different forms of high-level constructs to perform parallel operations: parfor and spmd. The parfor construct represents a parallel loop, whose enclosed statements are executed in parallel across workers in a lock-stepped fashion. The data distribution is automatic where the array is sliced based on the access patterns. The spmd construct is a single program, multiple data construct, where all workers perform the same operation but on different data. The data distribution can be performed manually depending on the requirements. In this work, we chose the spmd construct to achieve the parallelism at the microsphere level. Evaluation of the parameter space for each microsphere is independent from each other, but may vary in their runtime depending on the convergence rate of the optimiser. This implies that microspheres cannot be processed in a locked-stepped fashion without incurring additional delays. The spmd construct of Parallel MATLAB allows this condition to be met: while enabling several MATLAB workers to perform similar operation but on different data without the need for an explicit synchronisation. The original data space, here a set of microspheres, is partitioned across distributed workers to enable workers to deal with different data. Since no communications are required across workers for the evaluation of the parameter space, this is an ideal case for spmd. The modified algorithm accounting this is shown below in Figure 2. In addition to the original notions in the sequential model, M is distributed across workers using the CoDistribute. Placeholder result arrays (µdex , µdem , sd ) are created across all workers in a distributed fashion using CreateDistributed. The spmd block guarantees that the all

Parallel Simulation for Parameter Estimation of Optical Tissue Properties

57

M ← Initialise Synthetic() M d ← CoDistribute(M,N) μdex ← CreateDistributed(N) μdem ← CreateDistributed(N) sd ← CreateDistributed(N) spmd (N) for m ∈ M d    [μdex , μdem , sd ] ← InitGuess() err ← ∞ [μex , μem , s] ← GetProps(m) while (err > ) V ← SignalSpread(μex , μem , s ) err ← MSE(V, m) [μex , μem , s ] ← Optimise(μex , μem , s , err) SetProps(μex , μem , s ) end while μdex (m) ← μex μdem (m) ← μem sd (m) ← s end for μex ← gather(μdex ) μem ← gather(μdem ) s ← gather(sd ) end

Fig. 2. Parallel computational model of the problem

workers perform the same operation on their own data. Finally, the data are gathered and consolidated to non-distributed arrays µex , µem and s using the gather construct. Each worker may reach a path in the program at any point in time. 4.2 Fine-Grained Parallelism Using Graphics Hardware Acceleration The fine-grained parallelism inherent in the 3D convolution and image resizing can be accelerated on commodity GPU hardware relatively easily, using nVidia CUDA [13]. CUDA is a general purpose high-level parallel architecture that enables programmers to control the execution of (numerically intensive) code on GPUs via the C language with nVidia-specific extensions. GPU programming using CUDA is determined by two important hardware aspects: – the highly parallel structure of GPU hardware (represented by a large number of specialised processing cores) is best exploited through a massively multi-threaded SIMD programming approach, with minimal conditional branching; – the global memory available on GPU hardware is relatively limited and memory bandwidth between the host and the GPU is too limiting to allow graphics

58

M. Duta et al.

computing to access the host memory in a way that overcomes the graphics memory limitation. The convolution operations use the optimised FFT transforms implemented by the CUFFT library from nVidia. In using the CUFFT library, there are a number of factors to consider: – The CUFFT library processes FFT transforms in an interleaved data format for the real and imaginary parts, whereas MATLAB holds the real and imaginary parts separate. The data format transformation back and forth is best carried out on the GPU itself but that obviously requires extra device memory. – Best transform performance (in terms of both absolute speed and scaling with the data size) is achieved for transform sizes that are powers of 2; this is a general feature of the FFT algorithm in any implementation (including FFTW [14], used by MATLAB itself). The divide-and-conquer approach of FFT under performs when data size is a prime or an unbalanced factorisation containing a large prime. Another advantage of powers of two is that transform results from CUFFT and the more standard FFTW are equal to within machine accuracy, in contrast to prime sizes on which results can have a relative difference of order 1. Apart from the FFT transforms themselves, the frequency-domain data array multiplication and scaling also benefited from GPU acceleration. Beside the 3D convolution, the parallelism inherent in the image resizing (independent row and/or column operations) also benefits from the massively multi-threaded acceleration on GPUs to great effect. Finally, to retain the high abstraction level in the application, the CUDA programming is hidden from the rest of the application via MEX interface functions directly usable from MATLAB.

5 Performance Results Since the time for performing the entire computation is an ongoing process, we report our runtime performance for a limited set of microspheres for resolutions from 32 through 128. To enable the measurement of speedups against different versions, we fixed the number of iterations of the optimiser or wherever this was not possible, we calculated the speedups based on per iteration basis. This is perfectly valid, as the runtimes between iterations vary very marginally. Our simulations are based on a cluster of workstations and a GPU-based system whose configurations are as follows: – A Microsoft cluster running Microsoft Windows 2008 (64bit) with 32-nodes, interconnected using Gigabit Ethernet. Each node is equipped with a quad-core IA-64, Xeon processor and 8GB of RAM running MATLAB (2009b, 64bit). The MATLAB is configured to use single worker-per-core and the native scheduler. – A Tesla-S1070 GPU cluster, running on Linux (Kernel 2.6.18) with 8 compute nodes shared across 4 Tesla S1070 units. Each compute node is equipped with a 2 quad-core Xeon Processor and with 8GiB memory. Each GPU device is equipped with 4GB of device memory shared across 240 cores per processor and devices are interconnected using PCIe x16. The CUDA version is 2.3.

Parallel Simulation for Parameter Estimation of Optical Tissue Properties

59

– Hosts used for local worker/single worker experiments have the same configuration of quad-core Xeon processor running Windows 2008 (64bit) and 32GB of RAM. Both the hosts are running the same version of MATLAB as the cluster. Results are averaged for these. The initial speedups by simply using Parallel MATLAB and CUDA are very attractive. There are a number observations that can be made in the reported performance results. Speedups from parallel workers: We show our runtime performance in Figure 3(a). An immediate observation, which we investigate further below, is the performance difference between local and distributed workers. In MATLAB, local workers are confined to a single node whereas distributed workers may span across any number of nodes. We report our speedups in two different stages: speedup of local workers against a single worker (4L vs 1L) and speed up of distributed workers against local workers (xD vs 4L, where x = 4, 8, 16, 32). When using multiple local workers, we see substantial but yet suboptimal speedups against a single worker. However, speedups of distributed workers against 4-Local workers are very attractive (and thus against 1L). Distributed workers outperform corresponding number of local workers. It is worth noticing that MATLAB does not have the notion of shared memory model, and instead it relies on the message passing model. Hence all communications are treated as messages which may create a bus contention. To investigate this further, we performed a synthetic data transfer between local and distributed workers. In the case of distributed workers, we customised the worker allocation to ensure that the workers are from distinct nodes to avoid any local bus contention. Figure 3(d) shows the ratio of transfer times of local workers to distributed workers. Transfers between distributed workers are faster than transfers between local workers by several fold. We ascribe these observations to different characteristics between local bus and network interconnects and to the schedulers. Local workers rely on the operating system whereas distributed workers rely on the scheduling policy of the distributed computing engine. Speedups from CUDA: There are two fine-grained operations, 3D-convolution (FFT) and image resizing (ImResize), which were time consuming in the original application, with neither of them benefiting from threading within MATLAB. We parallelised each of these using the CUDA. We report our findings, again, in two stages: performance of these kernels as standalone versions and after integrating these kernels within the application. The integration of the CUDA kernels is tested in three configurations: MATLAB version with a single worker with and without CUDA and 4-local workers without CUDA . The speedups are obtained by comparing their runtimes against a single worker without CUDA (using the standard FFT and ImResize functions). The results are shown in Figures 3(e) and 3(f), respectively. In all cases, the performance figures include data transfer and transformation costs. In the case of standalone kernels, we observe that both the kernels vary in their performance with data sizes. Initial rise in speedups are obtained by amortising the cost of transfer and transformation costs by computation time. For increased data sizes, however, the transfer costs begins to dominate again and the overall speedups begin decreasing. In the case of integrated version, parallel MATLAB offers significant speedups compared to the single worker but single worker with CUDA outperforms the parallel version on a single host. Performance varies with problem sizes,

60

M. Duta et al.

1000

4

nSize = 32 nSize = 64 nSize = 128

3.5 3

Speedups

Runtimes (s) x 1000

800

600

400

2.5 2 1.5 1

200

0.5 0

1L

4L

4D

0

8D 16D 32D

Configuration

25

35 30 25

Ratio of L/D

15 10

20 15 10

5

5

0

4D

8D

16D

0

32D

200

Configuration

20

600

800

1000

1200

(d) Ratio of local to distributed data transfer times 10

3D Convolution (fft) Image Resizing (imresize)

4 Workers (Vanilla FFT/ImResize) 1 Worker, (CUDA FFT/ImResize)

8

Speedups

15

Speedups

400

Data Size (in MB)

(c) Speedups with distributed workers

10

5

0

128

(b) Speedups with four local workers

32 64 128

20

Speedups

64

Cubic Root of Problem Size

(a) Runtimes for different configurations 30

32

6

4

2

816 32

64

128

256

Problem Size

(e) 3D Convolution/Imresize speedups

0

32

64

128

Cubic Root of Problem Size

(f) Combined speedups

Fig. 3. Runtimes/Speedups for Parameter Searching. The search is carried out for three different sizes (32, 64 and 128) on different worker configurations: single worker (denoted by 1L), four workers localised to a single node (4L), four, eight, 16 and 32 workers distributed across the cluster (denoted by 4D, 8D,16D and 32D respectively) (Figures 3(a)-3(c)). The ratio of transfer times between local and distributed workers are presented in Figure 3(d). On CUDA, both 3DConvolution and ImResize were run in isolation and after the integration with the overall solution (Figures 3(e),3(f)). See text for further information.

Parallel Simulation for Parameter Estimation of Optical Tissue Properties

61

replicating the original effects found in standalone versions. Furthermore, the current limitation in the CUDA driver, in terms of using CUDA-based routines in MATLAB, requires constant flushing of CUDA device memory when CUDA kernels are repetitively called within MATLAB. Our explanation is that such constant memory management may trigger the MATLAB Just-in-Time compiler, which may offset some of the benefits due to CUDA parallelisation. Furthermore, runs on higher-data sizes cannot be performed as the overall memory requirements of the overall application exceeds that of the device. Albeit all these complexities, the overall speedups are substantial. Furthermore, the configuration of the GPU cluster is such that two nodes share the S1070 device and thus parallel workers will only create a contention/race condition and will not offer any speedups. The absence of any hardware with MATLAB distributed computing engine and CUDA impedes us from making use of the fine-grained parallelism within each worker. The scalability of the solution: We observe that when increasing the number of distributed workers, the speedup increases linearly with respect to the minimum number of distributed workers (4D). In the case of CUDA, the scalability exists only for a subrange of problem sizes and this is the case even when including data transfer and transformation costs. However, integration of the kernel and associated overheads leads to sub-optimal scalability. Programmability: What is more promising is the level of detail at which the parallelism is expressed. The same sequential algorithm becomes readily parallelised without any efforts relating to the experience of C++. However, utilising the CUDA FFT and ImResize required some expertise in understanding and in transforming the data formats within the satellite MEX functions. However, this is only a one-time effort and does not affect the level of abstraction represented by the application.

6 Conclusions This paper outlines the solution adopted in the parallelisation of a biomedical application, in which a high-level of programming abstraction is a crucial criterion for both the rapid code development as well as for the underlying inter-disciplinary research. The key to maintaining abstraction is the exploitation of parallelism at multiple levels of granularity; at a high level, multiple simulations of the core mathematical model on which the application is based are run in a parallel fashion on a conventional computer hardware while the simulations themselves are accelerated at a low level on GPUs. The high-level parallelism is facilitated by the use of Parallel MATLAB, which strikes a good balance between performance and a high level of abstraction, with the added advantage of its prevalence in the academic community. Parallelism at the lowlevel granularity benefits from remarkable acceleration on graphics cards, via CUDA programming. Both levels of parallelism are relatively flexible in terms of the type of hardware the application can be run on. By combining Parallel MATLAB programming with low-level CUDA programming for the graphics card acceleration, remarkable performance benefits were demonstrated on two different parallel hardware configurations. The paper emphasises the importance of the high-level programming abstraction for a simple knowledge transfer across

62

M. Duta et al.

different scientific disciplines. The generality of the separation between high-level parallel programming from the low-level acceleration on graphics devices renders the approach applicable in similar cases, where inter-disciplinarity couples with agile code development.

References R A Language for Parallel Computing. International 1. Sharma, G., Martin, J.: MATLAB : Journal of Parallel Programming 37(1), 3–36 (2009) 2. Harris, M.: Many-Core GPU Computing with nVidia CUDA. In: Proceedings of the 22nd Annual International Conference on Supercomputing, p. 1. ACM, New York (2008) 3. Shah, K., Weissleder, R.: Molecular Optical Imaging: Applications Leading to the Development of Present Day Therapeutics. Journal of the American Society for Experimental NeuroTherapeutics 2(2), 215–225 (2005) 4. Steyer, G.J., Roy, D., Salvado, O., Stone, M.E., Wilson, D.L.: Removal of Out-of-Plane Fluorescence for Single Cell Visualization and Quantification in Cryo-Imaging. Annals of Biomedical Engineering 37(8), 1613–1628 (2009) 5. Steven, L.J., Prahl, S.A.: Modeling Optical and Thermal Distributions in Tissue During Laser Irradiation. Lasers in Surgery and Medicine 6, 494–503 (2008) 6. Spaan, J.A.E., ter Wee, R., van Teeffelen, J.W.G.E., Streekstra, G., Siebes, M., Kolyva, C., Vink, H., Fokkema, D.S., VanBavel, E.: Visualisation of Intramural Coronary Vasculature by an Imaging Cryomicrotome Suggests Compartmentalisation of Myocardial Perfusion Areas. Medical and Biological Engineering and Computing 43(4), 431–435 (2005) 7. Rolf, P., ter Wee, R., van Leeuwen, G.T., Spaan, J., Streekstra, G.: Diameter Measurement from Images of Fluorescent Cylinders Embedded in Tissue. Medical and Biological Engineering and Computing 46(6), 589–596 (2008) 8. Cheong, W., Prahl, S., Welch, A.: A Review of the Optical Properties of Biological Tissues. IEEE Journal of Quantum Electronics 26, 2166–2185 (1990) 9. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999) 10. Goyal, A., van den Wijngaard, J., van Horssen, P., Grau, V., Spaan, J., Smith, N.: Intramural Spatial Variation of Optical Tissue Properties Measured with Fluorescence Microsphere Images of Porcine Cardiac Tissue. In: Proceedings of the Annual IEEE International Conference on Engineering in Medicine and Biology Society, pp. 1408–1411 (2009) 11. Mathworks: MATLAB Simulink, Simulation and Model-Based Design, http://www.mathworks.co.uk/products/simulink/ (Last accessed June 1, 2010) 12. Mathworks: MATLAB Parallel Computing Toolbox, http://www.mathworks.co.uk/products/parallel-computing/ (Last accessed June 1, 2010) 13. Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High Performance Discrete Fourier Transforms on Graphics Processors. In: Proceedings of the SuperComputing 2008 (Electronic Media). ACM/IEEE, Austin (2008) 14. Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1391 (1998)

Parallel Numerical Algorithms Patrick Amestoy1 , Daniela di Serafino1 , Rob Bisseling2 , Enrique S. Quintana-Ort´ı2, and Marian Vajterˇsic2 1

Topic Chairs Members

2

Robust and efficient parallel numerical algorithms and their implementation in easy-to-use portable software components are crucial for computational science and engineering applications. They are strongly influenced by the remarkable progress in the development of parallel computer systems. This imposes to adapt or re-design already existing algorithms, or to find new algorithmic approaches, in order to effectively exploit the power that is offered by emerging parallel architectures. On the other hand, the use of numerical algorithms and software in more and more complex applications, to make simulations and computational experiments with a very high level of detail, demands for increasing computing power, thus giving new stimulating impulses to parallel computer developers. The aim of Topic 10 is to provide a forum for the presentation and discussion of recent advances in the field of parallel and distributed numerical algorithms. Different aspects of their design and implementation are addressed, ranging from fundamental algorithmic concepts, through their efficient implementation on modern parallel architectures, up to software design and prototyping in scientific computing and simulation software environments, and to performance analysis. Overall, 17 papers were submitted to this topic, with authors from Austria, China, Egypt, France, Germany, Greece, India, Italy, Lebanon, The Netherlands, Russia, Spain, Sweden and the United States of America. Each paper received at least three reviews and, finally, we were able to select 6 regular papers. The accepted papers discuss parallel algorithms for ordinary differential equations, partial differential equations, linear algebra and fast Fourier transforms. We grouped the accepted papers into two sessions. In the session on differential equations, M. Korch, T. Rauber and C. Scholtes analyze several parallel implementations of explicit extrapolation methods for the solution of systems of ODEs; M. Fournier, N. Rennon and D. Ruiz describe the integration of the parallel linear system solver Mumps into the Getfem++ finite element library and its application in CFD simulations; M. Emans presents an approach to reduce latency effects in algebraic multigrid applications on clusters. In the session on linear algebra and FFT, E. Romero and J. Roman propose a robust and efficient parallel implementation of the Jacobi-Davidson eigensolver for complex non-Hermitian matrices; M. Roderus, A. Berariu, H.-J. Bungartz, S. Kr¨ uger, A. Matveev and N. R¨ osch describe the use of scheduling algorithms to improve the scalability of parallel eigenvalue computations in quantum chemistry; Y. Zhang, J. Liu, E. Kultursay, M. T. Kandemir, N. Pitsianis and X. Sun P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 63–64, 2010. c Springer-Verlag Berlin Heidelberg 2010 

64

P. Amestoy et al.

discuss the parallelization of the data translation phase for non-uniform FFT methods. The previous papers show a wide diversity of themes, approaches and applications related to the field of parallel numerical algorithms. We are convinced that they represent valuable contributions to Euro-Par 2010 and their presentation at the conference will be of interest for a broad audience of participants. Finally, we would like to take the opportunity of thanking all the authors for their submissions, the referees for their competent and useful comments, as well as the Euro-Par Organizing Committee for a smooth coordination of Topic 10.

Scalability and Locality of Extrapolation Methods for Distributed-Memory Architectures Matthias Korch, Thomas Rauber, and Carsten Scholtes University of Bayreuth, Department of Computer Science {korch,rauber,carsten.scholtes}@uni-bayreuth.de

Abstract. The numerical simulation of systems of ordinary differential equations (ODEs), which arise from the mathematical modeling of timedependent processes, can be highly computationally intensive. Thus, efficient parallel solution methods are desirable. This paper considers the parallel solution of systems of ODEs by explicit extrapolation methods. We analyze and compare the scalability of several implementation variants for distributed-memory architectures which make use of different load balancing strategies and different loop structures. By exploiting the special structure of a large class of ODE systems, the communication costs can be reduced considerably. Further, by processing the microsteps using a pipeline-like loop structure, the locality of memory references can be increased and a better utilization of the cache hierarchy can be achieved. Runtime experiments on modern parallel computer systems show that the optimized implementations can deliver a high scalability.

1

Introduction

Systems of ordinary differential equations (ODEs) arise from the mathematical modeling of time-dependent processes. We consider the numerical solution of initial value problems (IVPs) of systems of ODEs defined by y (t) = f (t, y(t)),

y(t0 ) = y0 ,

(1)

with y : IR → IRn and f : IR × IRn → IRn . The solution of such problems can be highly computationally intensive, and the utilization of parallelism is desirable, in particular if the dimension n of the ODE system is large. Parallel solution methods have been proposed by many authors, for example, extrapolation methods [4,5,8,14,11], waveform relaxation methods [2], iterated Runge–Kutta methods [9,7], and peer two-step methods [12]. An overview is given in [2]. Numerical solution methods for ODE IVPs start with the initial value y0 , and perform a (potentially large) number of time steps to walk through the integration interval [t0 , te ]. At each time step κ, a new approximation η κ+1 to the solution function y at time tκ+1 is computed, i.e., η 0 = y0 , η κ+1 ≈ y(tκ+1 ). Sophisticated methods can adapt the stepsize hκ = tκ+1 − tκ such that the number of time steps is reduced while the local error is kept below a user-defined error tolerance. The different solution methods are distinguished mainly by the computations performed at each time step. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 65–76, 2010. c Springer-Verlag Berlin Heidelberg 2010 

66

M. Korch, T. Rauber, and C. Scholtes

Extrapolation methods compute a sequence υ (1) , . . . , υ (r) of approximations to y(tκ+1 ) of increasing accuracy, which are then used to extrapolate η κ+1 . In order to compute these r approximations, the IVP y (t) = f (t, y(t)), y(tκ ) = η κ on the interval [tκ , tκ+1 ] is solved r times by a base method using an individual constant (j) (1) (2) (r) stepsize hκ , j = 1, . . . , r, where hκ > hκ > · · · > hκ . The time steps with (j) stepsize hκ computed by the base method are called local time steps or microsteps, whereas the time steps of the extrapolation method are referred to as global time steps or macro-steps. Often, one-step methods of order 1 or 2 are used as base (j) method. The stepsizes hκ are determined by an increasing sequence of positive (j) integer numbers n1 < n2 < · · · < nr , such that hκ = hκ /nj for j = 1, . . . , r. In the experiments presented in this paper, we use the harmonic sequence nj = j, which is presumed to deliver a better practical performance than other commonly used sequences [3] and which facilitates load balancing. Explicit extrapolation methods use an explicit base method, e.g., the explicit Euler method (Richardson-Euler method) or the explicit midpoint rule (GBS method [1,6]). In this paper, we use the Richardson-Euler method, which com(j) putes the r approximations υ (j) = μnj , j = 1, . . . , r, according to   (j) (j) (j) (j) (j) μ0 = η κ , μλ = μλ−1 + h(j) , λ = 1, . . . , nj . · f t + (λ − 1)h , μ κ κ κ λ−1 (2) (1) (r) , . . . , υ as values of a function T(h) such that Interpreting the sequence υ   (j) T hκ = υ (j) , we use the Aitken–Neville algorithm hj−k · Tj,k − hj · Tj−1,k , k = 1, . . . , r − 1, j = k + 1, . . . , r , (3) hj−k − hj   (j) to extrapolate T(0) = Tr,r . This value is then starting with Tj,1 = T hκ used as the new approximation η κ+1 to start the next macro-step. In the following, we investigate the scalability of different implementations of extrapolation methods for distributed-memory architectures. At first, in Section 2, we describe how sequential and parallel implementations can be derived from the mathematical formulations (2) and (3). Then, in Section 3, we show how the communication costs can be reduced and how the locality of memory references can be improved by exploiting the special structure of a large class of ODE IVPs. In Section 4, we evaluate the results of runtime experiments performed on two modern supercomputer systems to investigate the scalability of the previously described implementations. Section 5 concludes the paper. Tj,k+1 =

2

Implementation of Extrapolation Methods for ODE Systems with Arbitrary Access Structure

In this section, we derive sequential and parallel distributed-memory implementations of extrapolation methods, using (2) and (3) as a starting point. The implementations derived in this section are suitable for general ODE systems consisting of n arbitrarily coupled equations.

Extrapolation Methods for Distributed-Memory Architectures

67

// computation of υ (1) , . . . , υ (r) in the registers R1 , . . . , Rr 1: 2: 3: 4: 5:

(j)

for (j = 1; j ≤ r; ++j) // for each stepsize hκ = hκ /nj for (i = 1; i ≤ n; ++i) Rj [i] = η κ [i]; // use η κ as initial value for (λ = 1; λ ≤ nj ; ++λ) // nj micro-steps (j) for (i = 1; i ≤ n; ++i) F[i] = fi (tκ + (λ − 1)hκ , Rj ); // evaluate f (j) for (i = 1; i ≤ n; ++i) Rj [i] += hκ F[i]; // update Rj // extrapolation according to Aitken–Neville

6: 7: 8: 9:

10:

for (k = 2; k ≤ r; ++k) // r − 1 rounds for (j = r; j ≥ k; --j) // in-place computation of Tj,k in register Rj for (i = 1; i ≤ n; ++i) (j−k) (j) (j−k) (j) Rj [i] = (hκ · Rj [i] − hκ · Rj−1 [i])/(hκ − hκ ); accept or reject Rr as new approximation η κ+1 and select new stepsize hκ+1 ;

Fig. 1. Sequential loop structure

2.1

Algorithmic Structure and Sequential Implementation

The computations defined by (2) and (3) suggest the subdivision of the algorithmic structure of a macro-step into two phases: the computation of the approximation values υ (1) , . . . , υ (r) and the subsequent extrapolation of T (0) using the Aitken–Neville algorithm. Figure 1 shows a possible sequential loop structure implementing these two phases as separate loop nests. This loop structure is used as basis for deriving parallel implementations in Section 2.2. In the first phase of a macro-step, the approximation values υ (j) ≈ y(tκ+1 ), j = 1, . . . , r, are computed by solving the IVP y (t) = f (t, y(t)), y(tκ ) = η κ on the interval [tκ , tκ+1 ] r times by the explicit Euler method (2) using an individual (j) stepsize hκ . To exploit the locality of successive micro-steps λ → λ + 1, the outermost loop is iterated over the sequence index j = 1, . . . , r such that each iteration computes one sequence j of micro-steps. The computation of one sequence of micro-steps makes use of two vectors of size n (i.e., arrays which can store n floating-point values, referred to as registers). After initializing the register Rj with the n components of the initial value η κ , a loop is executed to compute the micro-steps λ = 1, . . . , nj . At each micro-step, an evaluation of the right-hand-side function f and an update of Rj is required. Since, here, we make no assumptions about the structure of the ODE system, (j) the evaluation of fi (tκ + (λ − 1)hκ , Rj ) may depend on all components of Rj , and, therefore, the results of the function evaluations f1 , . . . , fn have to be stored in a temporary register (F) before the components of Rj can be updated in a subsequent loop over the system dimension. After the update, the temporary register F can be reused to compute another micro-step.

68

M. Korch, T. Rauber, and C. Scholtes

At the beginning of the second phase (the Aitken–Neville algorithm (3)) the registers Rj , j = 1, . . . , r, contain the r approximation values υ (j) = Tj,1 . According to (3), the computation of Tj,k only depends on Tj−1,k−1 and Tj,k−1 . Hence, Tr,r can be computed in-place by carefully choosing the update order of the registers in use. To do so, one iteration of the outer loop k (one round ) overwrites Tj,k−1 in Rj by Tj,k for all j ≥ k. Starting the j-loop with j = r and counting j backwards avoids overwriting values needed for the computation of the remaining Tj  ,k with j  < j. Respecting these dependencies, all components of Rj can be updated in-place according to (3). After r − 1 rounds, the final result of the Aitken–Neville extrapolation, Tr,r = η κ+1 , resides in Rr and can be accepted or rejected by the stepsize controller. 2.2

Exploiting Data and Task Parallelism

Compared to classical Runge–Kutta methods, extrapolation methods provide a larger potential for parallelism. In addition to the potential for parallelism across the system (data parallelism), which increases with the dimension of the ODE system and which is available in all ODE solution methods, extrapolation methods can exploit parallelism across the method (task parallelism). The following description of parallelization approaches assumes the use of a message-passing programming model, in particular MPI [13]. Consecutive Implementation. In a purely data-parallel implementation, the i-loops are parallelized such that the n components of the ODE system are distributed evenly across all P processors and each processor is responsible for the computation of one contiguous block of components of size n/P . As in the sequential implementation, the sequences of micro-steps j = 1, . . . , r as well as the micro-steps of these sequences λ = 1, . . . , nj are computed consecutively to exploit the locality of successive micro-steps. Therefore, we refer to this implementation as consecutive implementation. Since the equations of the ODE system may be coupled arbitrarily, a processor has to provide the part of the (j) current approximation vector μλ it computes to all other participating processors before the next micro-step can be started. This requires the use of a multi-broadcast operation (MPI Allgather()). Similarly to the computation of the micro-steps, the Aitken–Neville algorithm can be executed in data-parallel style using the same blockwise data distribution of n/P contiguous blocks of components to the P processors. During the execution of the Aitken–Neville algorithm, no communication is necessary. Only one further communication operation, a multi-accumulation operation (MPI Allreduce()), needs to be executed near the end of the macro-step to enable stepsize control. Group Implementations. As the r sequences of micro-steps within one macrostep are independent of each other, extrapolation methods provide potential for task parallelism. It can be exploited by parallelizing across the j-loop (line 1 of Fig. 1), i.e., by partitioning the processors into up to r disjoint groups where each group is responsible for the computation of one or several sequences of

Extrapolation Methods for Distributed-Memory Architectures

69

micro-steps. Since these sequences consist of different numbers of micro-steps, load balancing is required for the group implementations. We consider two load balancing strategies: – Linear group implementation: The processors are partitioned into r groups, and the number of processors per group is chosen to be linearly proportional to the number of micro-steps computed. – Extended group implementation: The processors are partitioned into r/2 groups of equal size. Each group is responsible for the computation of (up to) two sequences of micro-steps, such that the total number of micro-steps per group is (nearly) evenly balanced. (For r odd, only one sequence of microsteps can be assigned to one of the groups.) In case of the harmonic sequence, which we use in our experiments, and r even, group g computes υ (g) and υ (r+1−g) , which leads to a total number of r + 1 micro-steps per group. As in the sequential and in the consecutive implementation, the groups compute the micro-steps of a sequence j one after the other to maintain locality. If a group is responsible for two sequences j and j  , the two sequences are computed consecutively. Within a group consisting of G processors, work is distributed to the processors in a similar blockwise fashion as in the consecutive implementation such that each processor in the group is assigned n/G components of the ODE system. As in the consecutive implementation, multi-broadcast operations have to be executed between successive micro-steps, but data has to be exchanged only between processors belonging to the same group. Thus, less processors participate in a multi-broadcast operation, and multiple multi-broadcast operations are executed in parallel. Compared to the consecutive implementation, the group implementations have the difficulty that the data distribution resulting from the task-parallel computation of υ (1) , . . . , υ (r) is not optimal for the Aitken–Neville algorithm, since each processor only has a subset of the components of one or two of the vectors υ (1) , . . . , υ (r) in its memory. Moreover, in the linear group implementation the groups differ in size, and the processors store different numbers of components of the vectors. In order to avoid a complex and, thus, time consuming reorganization of the data distribution, we chose to gather all components of υ (1) , . . . , υ (r) at one processor (the leader) of the group responsible for the respective vector. The group leaders then execute the Aitken–Neville algorithm. This still involves a significant amount of communication: Since the vectors υ (1) , . . . , υ (r) are distributed among the groups, each group leader has to send all components of one of its vectors, say Tj−1,k , to the leader of the group which stores Tj,k and is responsible for the computation of Tj,k+1 . This has to be done once each round and can be implemented by single-transfer operations (MPI Send(), MPI Recv()). After all r − 1 rounds, the processor which has computed the result Tr,r broadcasts it to all other processors using MPI Bcast().

70

3

M. Korch, T. Rauber, and C. Scholtes

Implementations Specialized in ODE Systems with Limited Access Distance

In the following, we show how the performance can be improved by exploiting the specific access structure of a large class of ODE systems. 3.1

Reducing Communication Costs

In case of general ODE systems, where the equations may be coupled arbitrarily, we have to assume that the evaluation of a component function fi (t, y) requires all components of the argument vector y. General implementations of extrapola(j) tion methods therefore have to exchange the current approximation vector μλ by a multi-broadcast operation once per micro-step. Consequently, general implementations cannot be expected to reach a high scalability, since the execution time of multi-broadcast operations increases with the number of processors. There is, however, a large class of ODE systems where this multi-broadcast operation can be avoided. This includes ODE systems where the components of the argument vector accessed by each component function fi lie within a bounded index range near i. Many sparse ODE systems, in particular many ODE systems resulting from the spatial discretization of partial differential equations (PDEs) by the method of lines, can be formulated such that this condition is satisfied. To measure this property of a function f , we use the access distance d(f ) which is the smallest value b, such that all component functions fi (t, y) access only the subset {yi−b , . . . , yi+b } of the components of their argument vector y. We say the access distance of f is limited if d(f )  n. If the right-hand-side function f has a limited access distance, a processor only needs to exchange data with the two processors responsible for those 2 · d(f ) components immediately adjacent to his own set of components, i.e., with its two neighbor processors, to meet the dependencies of the function evaluation. (In our implementations, we assume that each processor is at least responsible for 2 · d(f ) components.) A similar approach has been followed in [5] to devise efficient linearly-implicit extrapolation algorithms. A limited access distance not only allows the replacement of the expensive multi-broadcast operation by scalable neighbor-to-neighbor communication, it is furthermore possible to partially overlap the data transfer times with computations: While a processor is waiting for the incoming data from the neighboring processors, it can already evaluate those component functions which do not depend on the incoming data. These optimizations can be applied to all three general implementations introduced in Section 2.2. 3.2

Optimization of the Locality Behavior

If the right-hand-side function f has a limited access distance, the working space of the micro-steps can be reduced by a loop interchange of the i-loop and the λloop. Instead of processing all components i = 1, . . . , n in the innermost loop, we can switch over to a block-based computation order using blocks of size B ≥ d(f )

Extrapolation Methods for Distributed-Memory Architectures

71

tκ (j)

tκ + h κ (j) tκ + 2 · h κ ··· (j)

tκ + nj · hκ

1

2

4

7 11 15

3

5

8 12 16

6

9 13 17

...

10 14 18

B

n

Fig. 2. Pipelined computation of the micro-steps

such that the ODE system is subdivided into nB = n/B contiguous blocks. Given this subdivision, the function evaluation of a block I ∈ {2, . . . , nB − 1} only depends on the blocks I − 1, I and I + 1 of the previous micro-step. This enables a pipeline-like computation of the micro-steps as illustrated in Fig. 2. The figure shows the pipelined computation of υ (j) for the case nj = 4. Each box represents a block of B components at a certain micro-step. The time indices of the micro-steps proceed from top to bottom and are displayed at the left of each row, while the system dimension runs from left to right. The top row represents ηκ and is used as input to the first micro-step. The blocks are numbered in the order they are computed. The blocks on the diagonal (15, 16, 17, 18) are to be computed in the current pipelining step. The three blocks above each one of the blocks on this diagonal are needed as input to evaluate the right-hand-side function on the respective block. For example, the blocks with numbers 8, 12, and 16 are needed as input to compute the block with number 17. Numbered white blocks represent final values of the components of (j) υ (j) at micro-step tκ + nj · hκ = tκ + hκ . All other numbered blocks have been used in previous pipelining steps, but are not needed to compute the current or subsequent pipelining steps. After the blocks on the current diagonal have been computed, all dependencies of the blocks on the next diagonal are satisfied. Thus, we can continue with the computation of one diagonal after the other until all blocks of υ (j) have been computed. Using this computation order, the working space of one pipelining step contains only 3 · nj + 1 blocks of size B, which is usually small enough to fit in the cache. For very large sequences of Euler steps, such as they are common in many PDE solvers where they are used for global time stepping, spanning a pipeline across all time steps would be inefficient. For such codes, diamond-like tile shapes as proposed in [10] lead to a higher performance. We have implemented pipelining variants of the sequential implementation, the consecutive implementation, the linear group implementation, and the extended group implementation. Since the pipelining computation order requires a limited access distance as does the optimized communication pattern proposed in Section 3.1, the optimized communication pattern is used in all parallel pipelining implementations. Thus, the amount of data exchanged in the pipelining implementations is equal to that in the implementations from Section 3.1. During the initialization and during the finalization of the pipeline, communication with one

72

M. Korch, T. Rauber, and C. Scholtes

of the neighboring processors is required. In order to match these communication needs, pipelines running in opposite directions on processors with neighboring sets of components are used.

4

Experimental Evaluation

The scalability of the parallel implementation variants has been investigated on two supercomputer systems, JUROPA and HLRB 2. HLRB 2 is an SGI Altix 4700 based on Intel Itanium 2 (Montecito) dual-core processors running at 1.6 GHz, which are interconnected by an SGI NUMAlink network, totaling 9 728 CPU cores. The system is divided into 19 partitions with 512 cores each. JUROPA consists of 2208 compute nodes equipped with two quad-core Intel Xeon X5570 (Nehalem-EP) processors running at 2.93 GHz resulting in a total number of 17 664 cores. An Infiniband network interconnects the nodes. The sequential implementations have been investigated, additionally, on a third system equipped with AMD Opteron 8350 (Barcelona) quad-core processors running at 2.0 GHz. The implementations have been developed using C with MPI. As compilers we used GCC 4.1.2 on HLRB 2, GCC 4.3.2 on JUROPA, and GCC 4.4.1 on the Opteron system. The MPI libraries used on the two supercomputer systems were SGI MPT 1.22 on HLRB 2 and MPICH2 1.0.8 on JUROPA. As an example problem we use BRUSS2D [2], a typical example of a twodimensional PDE discretized by the method of lines. The spatial discretization of this problem on a N × N grid leads to an ODE system of size n = 2N 2 . The components of the ODE system can be ordered such that a limited access distance of d(f ) = 2N results. We use B = d(f ) = 2N as the block size for the specialized implementations. This is the smallest possible choice and keeps the size of the working space at a minimum. On the other hand, this value is also large enough to exploit spatial locality of memory references. In all experiments presented in the following, we compute r = 4 sequences of micro-steps, and we use the harmonic sequence nj = j to determine the number of micro-steps in each sequence j ∈ {1, . . . , 4}. As weight of the diffusion term, we use α = 2·10−3. 4.1

Sequential Performance and Locality of Memory References

First, we investigate the two sequential implementations, i.e., the general implementation and the specialized pipelining implementation. Since an increase of the system size leads to a larger working space and thus to a different utilization of the cache hierarchy, we normalize the execution time by the number of macro-steps and by the size of the ODE system and plot the resulting normalized execution time against the size of the ODE system (Fig. 3). Consequently, every increase or decrease of the normalized execution time corresponds to an increase or decrease of the average execution time of memory operations and thus to an increase or decrease of the number of cache misses per work unit. The overhead incurred by instructions independent of the system size (such as the stepsize control mechanism) usually is negligible if the ODE system is sufficiently large.

x 10

5.5

Opteron 8350, 2 GHz general pipelining

5 4.5 4 3.5 2 10

4

6

10

10 n

8

10

−6

1.6 1.5

x 10

Itanium 2 Montecito, 1.6 GHz general pipelining

1.4 1.3 1.2 1.1 1

3

5

10

10

7

10

Runtime per step and component in s

−7

6

Runtime per step and component in s

Runtime per step and component in s

Extrapolation Methods for Distributed-Memory Architectures −7

2.1

x 10

73

Xeon X5570, 2.93 GHz general pipelining

2

1.9

1.8

1.7

3

5

10

n

10

7

10

n

Fig. 3. Comparison of the sequential implementations

Except for very small system sizes, the normalized execution time is lower for smaller system sizes, and it increases when the system size is increased and thus the working space of a significant loop outgrows the size of one of the cache levels. The largest increase of the normalized execution time can be observed when the size of two registers starts exceeding the size of the highest level cache. The Opteron processor has the smallest total cache size of the three target platforms considered. The L3 cache has a size of 2 MB, the L2 cache has a size of 512 KB, and the L1 data cache has a size of 64 KB. Up to n ≈ 3.3 · 104 , two registers fit in the L2 cache. In this situation, the normalized execution times of the general and the pipelining implementation are nearly equal and reach their minimal value of ≈ 3.7 · 10−7 s. If n increases above 3.3 · 104 , the normalized execution times of both implementations increase rapidly. For n  1.6 · 105 two vectors exceed the sum of the sizes of the L2 and the exclusive L3 cache. If n is increased beyond this value, the growth of the normalized execution times slows down to nearly 0. In this situation, the pipelining implementation runs about 10 % faster than the general implementation, because the working space of the general implementation no longer fits in the L3 + L2 cache while the working space of the pipelining implementation is still small enough. The working space of the pipelining implementation outgrows the cache size for n  3 · 108 . Figure 3 includes normalized execution times for n ≤ 1.28 · 108 . While at this point the working space of the pipelining implementation still fits in the cache, for n = 1.28 · 108 the normalized execution time of the pipelining implementation already starts growing and nearly approaches that of the general implementation. On the Itanium 2 and the Xeon processor we observe a similar behavior as on the Opteron processor. Though on these two processors the pipelining implementation is faster than the general implementation for all system sizes, the difference is not very large as long as two registers fit in the 256 KB L2 cache, which is the case for n ≈ 1.6 · 104 . For larger values of n, the normalized execution times of both implementations grow rapidly on the Itanium 2 processor until n ≈ 5.9 · 105 when two registers no longer fit in the 9 MB L3 cache. Then, the normalized execution times remain almost constant up to the largest system size considered in this experiment. In this situation where the working space of the pipelining implementation still fits in the cache, the pipelining implementation is nearly 20 % faster than the general implementation. On the Xeon processor

74

M. Korch, T. Rauber, and C. Scholtes

JUROPA, N = 2000

HLRB 2, N = 2000

cons gnrl glin gnrl gext gnrl glin comm gext comm glin pipe gext pipe

Speedup

3

2.5

2

1.5

1

3.5

3

Speedup

3.5

2.5

2

1.5

0

10

20 30 40 50 Number of MPI processes

60

1

0

10

20 30 40 50 Number of MPI processes

60

Fig. 4. Comparison of the general implementations and the specialized group implementations

the normalized execution time of the pipelining implementation increases only marginally when two registers outgrow the L2 cache. But the curve of the general implementation shows a step at this event, and it increases further for n  105 until n ≈ 2 · 106 , because for n  5.2 · 105 the 8 MB L3 cache is too small to store two registers completely, and the registers can only be partially reused by subsequent micro-steps. For values of n in this range, the pipelining implementation is about 8.5 % faster than the general implementation. 4.2

Parallel Speedups and Scalability

To investigate the parallel performance of the implementations, we consider the parallel speedups and the parallel efficiency on HLRB 2 and JUROPA with respect to the fastest sequential implementation on the respective machine. For the problem sizes N = 1000 and N = 2000 considered in this paper, the fastest sequential implementation on both systems is the pipelining implementation. As shown in Fig. 4, the consecutive general implementation (cons gnrl) and the group implementations (glin, gext) cannot obtain a high scalability. The scalability of all general implementations (gnrl) is limited by the multi-broadcast operation. The specialized group implementations reach slightly higher speedups, but the speedups are also determined by the number of group leaders which participate in the Aitken–Neville algorithm. Consequently, the speedups of the linear group implementations (glin) are higher than the speedups of the extended group implementations (gext), since a larger number of groups is used. Figure 5 shows speedups and efficiencies of the specialized consecutive implementation with unmodified loop structure and the consecutive pipelining implementation. Both can take advantage of neighbor-to-neighbor communication and obtain a high scalability. For large numbers of cores (Fig. 5, left), no significant difference in the parallel performance of the two implementations can be observed, because the working space of both implementations decreases reciprocally proportional to the number of cores. Thus, even the working space of the unmodified loop structure fits in the cache if the number of cores used is

Extrapolation Methods for Distributed-Memory Architectures

75

Speedup for up to N/2 processors Efficiency for up to 150 processors 1.2

500 HLRB 2, opt. comm. HLRB 2, pipelining JUROPA, opt. comm. JUROPA, pipelining

400

1.1

Efficiency

n = 2 · 106

Speedup

1

N = 1000

300

200

0.9 0.8

100

0

0.7 0.6 0

100

200 300 400 Number of MPI processes

500

0

50 100 Number of MPI processes

150

0

50 100 Number of MPI processes

150

1.2

1000

1.1

800

600

Efficiency

n = 8 · 106

Speedup

1

N = 2000

400

0.9 0.8

200

0

0.7

0

200

400 600 800 Number of MPI processes

1000

0.6

Fig. 5. Comparison of the specialized consecutive implementations

sufficiently large. On HLRB 2 a single partition was used for less than 512 MPI processes. For larger runs, processes were distributed evenly to two partitions. Using this strategy, we observe a decrease in the performance when the number of MPI processes per partition comes close to the partition size. For small numbers of cores (Fig. 5, right), the decrease of the working space due to parallelization can produce superlinear speedups, since distributing the data to more processors has a similar effect on the computation time per component as decreasing the system size in a sequential run. On HLRB 2 the computation time per component for the system sizes n = 2 · 106 and 8 · 106 are significantly larger than those for n  5.9 · 105 (cf. Fig. 3). Thus, superlinear speedups occur for n/p  5.9 · 105 . On JUROPA no superlinear speedups are observed, because the normalized sequential runtime of the fastest implementation depends only marginally on the number of components to be processed. On both systems, the efficiency of the pipelining implementation is significantly higher than that of the unmodified loop structure for small numbers of cores.

5

Conclusions

We have investigated the scalability and the locality of several implementations of explicit extrapolation methods for ODE IVPs. The implementations exploit pure data parallelism, or mixed task and data parallelism combined with two

76

M. Korch, T. Rauber, and C. Scholtes

different load balancing strategies. Further, we have compared a general implementation suitable for ODE systems consisting of arbitrarily coupled equations with a pipeline-like implementation applicable to right-hand-side functions with limited access distance. We have exploited this limited access distance to replace expensive global communication by neighbor-to-neighbor communication. Runtime experiments on two modern supercomputer systems show that a high scalability is possible if expensive global communication can be avoided by exploiting the special structure of the right-hand-side function. A pipeline-like loop structure can improve the performance of sequential and moderately-sized parallel runs significantly. For large numbers of processors, the amount of data handled per processor is only small and fits in the cache, and the influence of the loop structure on the performance is less significant. Acknowledgments. We thank the J¨ ulich Supercomputing Centre and the Leibniz Supercomputing Centre Munich for providing access to their systems.

References 1. Bulirsch, R., Stoer, J.: Numerical treatment of ordinary differential equations by extrapolation methods. Numer. Math. 8, 1–13 (1966) 2. Burrage, K.: Parallel and Sequential Methods for Ordinary Differential Equations. Oxford University Press, New York (1995) 3. Deuflhard, P.: Order and stepsize control in extrapolation methods. Numer. Math. 41, 399–422 (1983) 4. Deuflhard, P.: Recent progress in extrapolation methods for ordinary differential equations. SIAM Rev. 27, 505–535 (1985) 5. Ehrig, R., Nowak, U., Deuflhard, P.: Massively parallel linearly-implicit extrapolation algorithms as a powerful tool in process simulation. In: Parallel Computing: Fundamentals, Applications and New Directions, pp. 517–524. Elsevier, Amsterdam (1998) 6. Gragg, W.B.: On extrapolation algorithms for ordinary initial value problems. SIAM J. Numer. Anal. 2, 384–404 (1965) 7. van der Houwen, P.J., Sommeijer, B.P.: Parallel iteration of high-order Runge– Kutta methods with stepsize control. J. Comput. Appl. Math. 29, 111–127 (1990) 8. Lustman, L., Neta, B., Gragg, W.: Solution of ordinary differential initial value problems on an Intel Hypercube. Computer Math. Appl. 23(10), 65–72 (1992) 9. Nørsett, S.P., Simonsen, H.H.: Aspects of parallel Runge–Kutta methods. In: Numerical Methods for Ordinary Differential Equations. LNM, vol. 1386, pp. 103–117 (1989) 10. Orozco, D., Gao, G.: Mapping the FDTD application to many-core chip architectures. In: Int. Conf. on Parallel Processing (ICPP-2009). IEEE, Los Alamitos (2009) 11. Rauber, T., R¨ unger, G.: Load balancing schemes for extrapolation methods. Concurrency: Pract. Ex. 9(3), 181–202 (1997) 12. Schmitt, B.A., Weiner, R., Jebens, S.: Parameter optimization for explicit parallel peer two-step methods. Appl. Numer. Math. 59, 769–782 (2008) 13. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: MPI the complete reference, 2nd edn. MIT Press, Cambridge (1998) 14. van der Houwen, P.J., Sommeijer, B.P.: Parallel ODE solvers. In: ACM Int. Conf. on Supercomputing, pp. 71–81 (1990)

CFD Parallel Simulation Using Getfem++ and Mumps Michel Fourni´e1 , Nicolas Renon2 , Yves Renard3 , and Daniel Ruiz4 1

Institut de Math´ematiques de Toulouse, CNRS (UMR 5219), Universit´e de Toulouse, France 2 Centre de Calcul Inter Universitaire de Toulouse (CICT-CALMIP), France 3 Institut Camille Jordan, CNRS (UMR 5208), INSA Lyon, France 4 Institut de Recherche en Informatique de Toulouse, CNRS (UMR 5505), Universit´e de Toulouse, France

Abstract. We consider the finite element environment Getfem++1 , which is a C++ library of generic finite element functionalities and allows for parallel distributed data manipulation and assembly. For the solution of the large sparse linear systems arising from the finite element assembly, we consider the multifrontal massively parallel solver package Mumps2 , which implements a parallel distributed LU factorization of large sparse matrices. In this work, we present the integration of the Mumps package into Getfem++ that provides a complete and generic parallel distributed chain from the finite element discretization to the solution of the PDE problems. We consider the parallel simulation of the transition to turbulence of a flow around a circular cylinder using Navier Stokes equations, where the nonlinear term is semi-implicit and requires that some of the discretized differential operators be updated and with an assembly process at each time step. The preliminary parallel experiments using this new combination of Getfem++ and Mumps are presented.

1

Introduction

In many applications, the solution of PDE problems with discretization using very fine grid meshes is required, as in the simulation of turbulent fluid flows or in the simulation of lightnings in plasmas, for example. The amount of unknowns in such cases can reach levels that make it difficult to hold the data in the memory of a single host and distributed computing may become intrinsically necessary. In the following, we are interested in the particular case of the simulation of non-stationary turbulent flows, 2D or 3D, using Navier-Stokes equations where the nonlinear term is semi-implicit and requires that some of the discretized differential operators be updated with an assembly process at each time step. We exploit therefore the finite element environment Getfem++, which is a C++ library of generic finite element functionalities, that enables to express in a simple 1 2

http://home.gna.org/Getfem/ http://Mumps.enseeiht.fr/

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 77–88, 2010. c Springer-Verlag Berlin Heidelberg 2010 

78

M. Fourni´e et al.

way the finite element discretization of PDE problems, including various types of 2D or 3D finite elements, and that can be used for many PDE applications. Additionally, Getfem++ allows for parallel distributed data manipulation and assembly, and this is one of the key features to address appropriately the type of problems that we are interested in. For the solution, at each time step, of the large sparse linear systems arising from the finite element assembly, we also use the multifrontal massively parallel solver package Mumps, which implements a parallel distributed LU factorization of large sparse matrices. Since Mumps enables to specify, as input, an already distributed sparse linear system, it was easy for us to integrate the Mumps package into Getfem++ in order to design a complete and generic parallel distributed chain from the finite element discretization to the solution of the PDE problems. This is developed in more details in sections 2 and 3. The main contribution of this work is to propose a coupling between Getfem++ and Mumps in an advantageous way. We introduce the non-stationary turbulent flow problem in section 4. The results are presented in section 5, where we try to analyze the potential and limitations for parallelism of this general computational chain made of Getfem++ and Mumps together. We briefly discuss the identified bottlenecks and possible evolutions of this platform for improved performances.

2

The Getfem++ Library

Getfem++ (Generic Toolbox for Finite Element Methods in C++) is a finite element library, freely distributed under the terms of the Gnu Lesser General Public License (LGPL license). It aims at providing some standard tools for the development of finite element codes for PDEs, and in particular: 1. proposing a generic management of meshes: arbitrary dimension, arbitrary geometric transformations, 2. providing some generic assembling methods, 3. implementing many advanced methods (mixed methods, mortar elements, hierarchical elements, X-FEM,...), and simplify the addition of new methods, 4. interpolation methods, computation of norms, mesh operations (including automatic refinement), boundary conditions, c and Python thus giving a pos5. proposing a simple interface under Matlab sibility to use Getfem++ without any knowledge of C++, 6. post-processing tools such as extraction of slices from a mesh ... Getfem++ can be used to build very general finite elements codes, where the finite elements, integration methods, dimension of the meshes, are just some parameters that can be changed very easily, thus allowing a large spectrum of experimentations with arbitrary dimension i.e. not only 2D and 3D problems (numerous examples are available in the tests directory of the distribution). Getfem++ has no meshing capabilities (apart from regular meshes), hence it is necessary to import meshes. Import formats currently known by Getfem++

CFD Parallel Simulation Using Getfem++ and Mumps

79

are GiD3 , Gmsh4 and emc2 mesh files. However, given a mesh, it is possible to refine it automatically. Getfem++ has been awarded the second price at the “Troph´ees du libre 2000” in the category of scientific softwares5 . 2.1

The Model Description

Getfem++ provides a model description which allows to quickly build some finite element method applications on complex linear or nonlinear PDE coupled models. The basic idea is to define generic bricks which can be assembled to describe a complex situation. A brick can describe either an equation (Poisson equation, linear elasticity ...) or a boundary condition (Dirichlet, Neumann ...) or any relation between two variables. Once a brick is written, it is possible to use it in very different ways. This allows to reuse the generated code and to increase the capability of the library. A particular effort has been paid to ease as much as possible the development of a new brick. A brick is mainly defined by its contribution to the final linear system to be solved. Here, we present some command lines used for a complete Poisson problem. After defining a mesh with boundary DirichletBoundaryNum, finite element methods mfU (Pk-FEM or Qk-FEM · · · for the unknowns) and mfRhs (for the right hand sides), an integration method mim, we can use the following main code. getfem::model laplacian_model; // Main unknown of the problem laplacian_model.add_fem_variable("u", mfU); // Laplacian term on u getfem::add_Laplacian_brick(laplacian_model, mim, "u"); // Dirichlet condition. gmm::resize(F, mfRhs.nb_dof()); getfem::interpolation_function(mfRhs, F, sol_u); laplacian_model.add_initialized_fem_data("DirichletData", mfRhs, F); getfem::add_Dirichlet_condition_with_multipliers(laplacian_model, mim, "u", mfU, DirichletBoundaryNum, "DirichletData"); // Volumic source term and Neumann conditions are defined in the same way // Solve linear system gmm::iteration iter(residual, 1, 40000); getfem::standard_solve(laplacian_model, iter); std::vector U(mfU.nb_dof()); gmm::copy(laplacian_model.real_variable("u"), U); // solution in U

The standard_solve command is linked to a solver method that can be an iterative one (using iter) or Mumps library (see section 3) when the installation is on parallel machine. 3 4 5

http://gid.cimne.upc.es/ http://www.geuz.org/gmsh/ http://www.trophees-du-libre.org/

80

M. Fourni´e et al.

2.2

Parallelization under Getfem++

It is not necessary to parallelize all steps in the code, however the brick system offers a generic parallelization based on MPI (communication between processes), ParMetis6 (partition of the mesh) and Mumps (parallel sparse direct solver). In this way, each mesh used is implicitly partitioned (using ParMetis) into a number of regions corresponding to the number of processors and the assembly procedures are parallelized. This means that the matrices stored are all distributed. In the current version of Getfem++, the right hand side vectors of the considered problem are transferred to each processor (the sum of each contribution is made at the end of the assembly and each MPI process holds a copy of the global vector). This aspect introduces some memory limitation but provides a simple generic parallelization with little effort. This can however be improved in further developments. Finally I/O have been parallelized too in order to prevent any sequential stage (avoid significant bottleneck when storing partial results during the simulations of large scaled problems). 2.3

Linear Algebra Procedures

The linear algebra library used by Getfem++ is Gmm++ a generic C++ template library for sparse, dense and skyline matrices. It is built as a set of generic algorithms (mult, add, copy, sub-matrices, dense and sparse solvers ...) for any interfaced vector type or matrix type. It can be viewed as a glue library allowing cooperation between several vector and matrix types. However, basic sparse, dense and skyline matrix/vector types are built in Gmm++, hence it can be used as a standalone linear algebra library. The library offers predefined dense, sparse and skyline matrix types. The efficiency of this library is given on the web site http://grh.mur.at/misc/sparselib_benchmark/. Getfem++ proposes classical iterative solvers (CG, GMRES, ...) and includes its own version of SuperLU 3.0. An interface to Mumps is provided to enable simple parallelization.

3

Mumps Library

Mumps (“MUltifrontal Massively Parallel Solver”) is a package for solving systems of linear equations of the form Ax = b, where A is a general square sparse matrix. Mumps is a direct method based on a multifrontal approach which performs a direct factorization A = LU or A = LDLT depending on the symmetry of the matrix. Mumps exploits both parallelism arising from sparsity in the matrix A and from dense factorization kernels. One of the key features of the Mumps package for the coupling with Getfem++ is the possibility to input the sparse matrices in distributed assembled or elemental format. This enables us to keep the already assembled and distributed matrices created under Getfem++ in place on the different processors and to call Mumps for solution without any data exchange or communication in between. 6

http://glaros.dtc.umn.edu/gkhome/metis/parmetis/

CFD Parallel Simulation Using Getfem++ and Mumps

81

Mumps offers several built-in ordering packages and a tight interface to some external ordering packages. The software is written in Fortran 90, and the available C interface has been used to make the link with Getfem++. The parallel version of Mumps requires MPI [7] for message passing and makes use of the BLAS [4], BLACS, and ScaLAPACK [1] libraries. The system Ax = b is solved in three main steps: 1. Analysis. An ordering based on the symmetrized pattern A+AT is computed, based on a symbolic factorization. A mapping of the multifrontal computational graph is then computed, and the resulting symbolic information is used by the processors to estimate the memory necessary for factorization and solution. 2. Factorization. The numerical factorization is a sequence of dense factorizations on so called frontal matrices. Task parallelism is derived from the elimination tree and enables multiple fronts to be processed simultaneously. This approach is called multifrontal approach. After the factorization, the factor matrices are kept distributed. They will be used at the solution phase. 3. Solution. The right-hand side b is broadcast from the host to the working processors that compute the solution x using the distributed factors computed during factorization. The solution is then either assembled on the host or kept distributed on the working processors. Each of these phases can be called separately and several instances of Mumps can be handled simultaneously. Mumps implements a fully asynchronous approach with dynamic scheduling of the computational tasks. Asynchronous communication is used to enable overlapping between communication and computation. Dynamic scheduling was initially chosen to accommodate numerical pivoting in the factorization. The other important reason for this choice was that, with dynamic scheduling, the algorithm can adapt itself at execution time to remap work and data to more appropriate processors. In fact, the main features of static and dynamic approaches are combined, where the estimation obtained during the analysis is used to map some of the main computational tasks, and where the other tasks are dynamically scheduled at execution time. The main data structures (the original matrix and the factors) are similarly partially mapped during the analysis phase.

4

Navier-Stokes Simulation to Steady the Transition to Turbulence in the Wake of a Circular Cylinder

The transition to turbulence of the flow around a circular cylinder is studied by a two and a three-dimensional numerical simulation of the Navier-Stokes equations system. This simulation is a problem which has been extensively investigated and is still a challenge for high Reynolds number Re = ν1 when a rotation of the cylinder is considered. The numerical method used is based on splitting scheme and Neumann boundary conditions are imposed at the two boundaries in the

82

M. Fourni´e et al.

Fig. 1. Boundary conditions for the velocity components

spanwise direction and non-reflecting boundary conditions are specified for the outlet downstream boundary [6] see Figure 1. The profile of the flow depends on Re. For example, without rotation of the cylinder, for Re = 200, the flow becomes unsymmetric and unsteady and eddies are shed alternatively from the upper and lower edges of the disk in a time periodic fashion. This trail of vortices in the wake known as the Karman vortex street is illustrated in Figure 2. The qualitative results obtained are not presented in this paper.

Fig. 2. Representation of the Navier-Stokes solution. On the left : Velocity u (norm, streamlines). On the right : Pressure p (iso-value).

4.1

Time Discretization

We consider the time-dependent Navier-Stokes equations written in terms of the primitive variables, namely the velocity u and the pressure p (ν is the viscosity) on a finite time interval [0, T ] and in domain Ω ⊂ IR2 or IR3 , ⎧ ∂u ⎪ ⎨ − νΔu + (u.∇)u + ∇p = 0 on Ω × (0, T ), ∂t (1) div(u) = 0 on Ω × (0, T ), ⎪ ⎩ u(t = 0) = u0 and Boundary conditions on u. We consider the incremental pressure-correction scheme which is a time-marching technique composed of sub-steps [5,2].

CFD Parallel Simulation Using Getfem++ and Mumps

83

Let Δt > 0 be a time step and set tk = kΔt for 0 ≤ k ≤ K = T /Δt . We denote un and pn approximate solutions of the velocity and the pressure at the time tn . For given values of the velocity un and the pressure pn , we compute an approximate velocity field u∗ by solving the following momentum equations at tn+1 (with the same boundary conditions as u) u∗ − u + (un .∇)(u∗ ) = −∇pn + νΔu∗ . Δt

(2)

Notice that the incompressibility condition is not satisfied by u∗ , so we perform a projection step (onto a space of divergence free functions), by introducing an auxiliary potential function Φ solution of the Poisson equation (with homogenous boundary condition) (3) ∇.u∗ = ΔΦ. Then the true velocity is obtained by un+1 − u∗ = −∇Φ

(4)

and the pressure is given by pn+1 = pn +

Φ . Δt

(5)

The nonlinear term in (2) is treated with a semi-implicit discretization of the nonlinear term which implies to assemble this term at each time iteration. As opposed to the initial formulation of the problem the splitting scheme presents the advantage to uncouple the velocity and the pressure. This point implies that the dimension of the matrices appearing in the linear systems to be solved is reduced. 4.2

Space Discretization Using Getfem++

To define a weak formulation of the problem, we multiply the equations (2), (3) by a test function v and the equation (4) by a test function q belonging to a suitable space V and Q respectively. The spaces are chosen in such a way that the test functions vanish on the boundary position where a Dirichlet data is prescribed on u and we assume that p be average free. Then, we realize a Galerkin approximation. We consider two families of finite dimensional subspaces Vh and Qh of the spaces V and Q, respectively. We refer to [5] for more details on those spaces. We consider finite element piecewise polynomial of degree 2 for the velocity and 1 for the pressure. More precisely, we consider the compatible couple of spaces Q2-Q1 (on quadrangular mesh) such that the Babuska-Brezzi condition be satisfied (inf-sup condition) [3].

5

Parallel Experiments

Computations were run on the CALMIP SGI Single System Image Supercomputer Altix 3700 with 128 Itanium 2processor (1,5 Ghz, 6 MB cache L3, singlecore) and 256 GB RAM (one single address space). The Altix 3700 is a well

84

M. Fourni´e et al.

known ccNUMA parallel architecture which can be efficient to run parallel applications. CALMIP stands for “Computations in Midi-Pyr´en´ees” and has been founded in 1994 by 17 Research Laboratories of Toulouse University to promote, at the regional scale, the use of new parallel computing technologies in the research community. The SGI Altix system is used through a batch scheduler system, namely Portable Batch System (PBSpro). The policy of the scheduler does not take into account spatial locality or affinity of the resources (cpus). However, we tried, as far as we could, to run jobs during periods of low charge of the whole system, in order to help locality between processes, but without any guarantee. Nevertheless we always used the command ”dplace” to avoid undesired migration of MPI processes by the operating system. Installation of MUMPS on the SGI Altix System was made with SGI’s Message Passing Toolkit v11 for MPI, SGI’s Scientific Computing Software Library (SCSL) and SGI’s Scientific Computing Software Library routine for Distributed Shared Memory (SDSM) for BLAS, BLACS and ScaLAPACK. We must mention beforehand that Getfem++ is a C++ library, and since C++ codes suffer partly from the performance of the Itanium processor with the actual version of the compiler, the timings that we shall present might be improved. Still, our main purpose is to analyse the potential and limitations of the combination of Getfem++ with Mumps as a complete parallel platform for numerical simulation. The integration of Mumps into Getfem++ naturally exploits the matrix distributed sparse format used in Getfem++. It was just necessary to transform the native compressed sparse columns format used in Getfem++ to the required general distributed sparse format in Mumps. This is done independently on each processor with the local data. As discussed in Section 2, Getfem++ holds so called ”global vectors”, that are in fact duplicated and synchronized on each processor as a generic and easy way to share data in parallel. Mumps inputs the right hand side vectors from the master processor, and may return distributed solutions. However, since these solutions are needed to build, in the time loop, the next right-hand sides, the solution vectors are therefore gathered onto the master processor. Nevertheless, to follow the requirements for data parallelism in Getfem++, we needed to declare both the right hand side and solution vectors as ”global vectors”. In that way, interfacing Mumps with Getfem++ was straightforward, but the many synchronizations implied by this handling of global vectors may be a bottleneck for efficiency. At any rate, we have incorporated these global data synchronizations in the various timings that are presented in the following. Further improvements of Getfem++ parallel platform may be obtained with alternative approaches for generic data exchanges. As presented in section 4, the Navier-Stokes simulation involves a time-loop where the three sparse linear systems arising from (2), (3), (4) have to be solved at each time step, and with one of these three systems that needs (due to the nonlinear term) to be reassembled and factorized systematically. The other two systems need to be factorized only once, at the first iteration in time, and the

CFD Parallel Simulation Using Getfem++ and Mumps

85

three systems need to be analyzed only once since their sparsity structure remains unchanged. In Figure 3, where timings are shown, we indicate the average time spent in one time-loop (excluding the particular case of the first loop where extra computation is performed). We superimpose the elapsed time spent strictly in the Mumps specific routines (shown in dark red - bottom) with the global elapsed time per time-iteration (in light yellow - top). Speedups are given in terms of elapsed time after synchronization at the boundaries between the Getfem++ code and Mumps. We have run experiments for a 2D simulation with three different mesh sizes, one of 450K nodes, a medium one of 900K nodes, and a larger one of 1.8M nodes (the nodes numbers correspond to the velocity unknowns), and we varied the number of processors from 4 to 64 (maximum current user limit in the PBS queue management system). The generic loop in time involves, as already said, one sparse matrix FEM assembly for the system associated with equation (2), performed in parallel with the Getfem++ functions, and one factorization with Mumps for the same system, plus three solutions for the linear systems associated with equations (2), (3), and (4). Equation (5) involves a simple linear combination of vectors and is performed with a collective and reduction operation (MPI Allreduce). We mention that some little and marginal extra operations are performed at each time-loop to recover lift and drag coefficients useful for qualitative analysis of the properties of the flow simulation. It appears, from the timings shown in these figures, that the matrix re-assembly benefits directly from parallel computation, since we can observe in all cases substantial CPUtime reduction for the operations excluding Mumps specific ones, illustrated by the reduction of the dark red part in the bar charts with increasing numbers of processors. The Mumps operations, which involve three solutions for one factorization only, benefit much less from parallelism except when the size of the matrices become sufficiently large so that the computations are mostly dominated by the factorization of the system arising from (2). This can be easily explained by the fact that the granularity of a solution phase in Mumps is much less than that of the factorization phase, and the benefit from different levels of parallelism (sparsity structure plus BLAS kernels) is much less in the solution 600

200

3000

500

2500

400

2000

300

1500

200

1000

100

500

150

100

50

0 0

4

8

16

32

0 0

4

8

16

32

0 0

4

8

16

32

Fig. 3. Average timings with respect to the number of processors. Left : 450K nodes - Center : 900K nodes - Right : 1.8M nodes. Black : Mumps part - White : outside Mumps.

86

M. Fourni´e et al.

phase in general. Additionally, in the case of the largest dimensional experiment, the time spent in the Mumps operations specifically becomes dominant with respect to the rest. This particular observation is favorable to the situation of larger simulations (3D cases) where we may expect indeed that the gains in parallelism of the code will be mostly driven by the parallelism in the Mumps factorization. Tables 1, 2, 3 give details about the computational timings in the different parts of the code, as well as some speed up information to estimate the degree of parallelism that can be achieved in each sub-part and in the combination of Mumps plus Getfem++. It is clear that effort has to be paid to improve in particular the parallelism in the solution phase of Mumps. This is under current investigation, and we shall benefit from the future releases of Mumps. The case of 4 processors shows better performances with respect to the others. This might be explained by the architecture of the Altix machine, but we do not have the explicit clue for this. To get better understanding of these timings and speed-ups, we detail in Table 4 the contribution of the various subparts in the time-loop to the total time, including thus the timing for the assembly at each time step of the nonlinear term for the velocity, the extra computations for the construction of the systems themselves (including right hand sides), the timing for the factorization plus solution with Mumps of the system associated to equation (2), and the timings for the solution only of the systems linked to (3) and (4). The total computational time in Mumps additionally includes the extra operations when chaining the results of one equation to the Table 1. Timings and Speed Ups for 450K nodes test case Number of Processors 4 8 16 32 Time in Mumps operations 10.7 29.7 7.5 6 Speed Up in Mumps part 1 0.36 1.43 1.78 Time spent outside Mumps 172 96 60.3 41.4 Speed Up in ops. outside Mumps 1 1.79 2.85 4.16 Total Time per Iteration 182.7 125.7 67.8 47.4 Global Speed Up per Iteration 1 1.45 2.7 3.85

Table 2. Timings and Speed Ups for 900K nodes test case Number of Processors 4 8 16 32 Time in Mumps operations 195.6 155.3 102.4 113.2 Speed Up in Mumps part 1 1.26 1.91 1.72 Time spent outside Mumps 365.3 212.8 129.9 96 Speed Up in ops. outside Mumps 1 1.72 2.81 3.80 Total Time per Iteration 560.9 368.1 232.3 209.2 Global Speed Up per Iteration 1 1.52 2.41 2.68

CFD Parallel Simulation Using Getfem++ and Mumps

87

Table 3. Timings and Speed Ups for 1.8M nodes test case Number of Processors 4 8 16 32 Time in Mumps operations 1786.8 1800.1 1400.6 739 Speed Up in Mumps part 1 0.99 1.28 2.42 Time spent outside Mumps 732.3 429.5 275.7 185.8 Speed Up in ops. outside Mumps 1 1.7 2.66 3.94 Total Time per Iteration 2519.1 2229.6 1676.3 924.8 Global Speed Up per Iteration 1 1.13 1.5 2.72

other, as indicated in section 4.1. We also give the time when summing and broadcasting the global vectors that correspond to the velocity and pressure vectors, and to the right hand sides for the three systems involved at each iteration in time. We have just extracted the timings for the medium size test case, on 4 and 16 processors, in order to get information about the relative weights of the different parts in the total computational timings. From these results we observe that the time for the extra computations are one of the main bottleneck. Future improvements should be brought in priority to that part but we still need to investigate in deeper details the mechanisms of data exchanges involved. The second axis for improvement may also be in a finer integration of Mumps in Getfem++, and in particular in the development of specific functionalities to address wether distributed right hand sides or distributed solution vectors. We must mention that Mumps did not encounter specific numerical problems with the various systems involved. Indeed, the estimated number of entries for factors and the actual number on entries after factorization do not differ more than 1 or 2 percent in all cases. The maximum frontal sizes are 1457 for systems in equations (2) and (4), and 558 for system in equation (3), and the number of delayed pivots are 5760 and 0 respectively (for indication, the number of operations during node elimination are 3.135 1010 and 1.432 109 respectively). Futures developments will be concerned, as already mentioned, with the improvement of the parallelization in general and in particular the data exchanges. This may require new features such as sparse distributed right hand side and Table 4. Detailed timings for 900K nodes test case on 16 processors Number of Processors 4 16 Speed-Ups Assembly of nonlinear term 284.7 72.2 3.94 Extra computations (sum and Broadcast) 80.6(0.89) 57.7(1.04) 1.4 Mumps timings for facto + solve of (2) 182.4 94.7 1.92 Mumps timings for solution of (3) 0.15 0.092 1.63 Mumps timings for solution of (4) 1.74 0.87 2 Time in Mumps operations 195.6 103.3 1.89 Time spent outside Mumps 365.3 129.9 2.81 Total Time per iteration 560.9 233.2 2.4

88

M. Fourni´e et al.

solution vectors in Mumps. Besides that, we shall improve the memory allocation in Getfem++ with a similar approach as for the sparse distributed vectors in Mumps. Finally we shall experiment with much larger 3D Fluids Dynamics test cases and with larger numbers of processors available on the new massively parallel machine in Toulouse (that will replace the Altix machine). We hope that these future developments will help to ensure some good scalability to address challenge problems.

References 1. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: Scalapack users’ guide (1997) 2. Braza, M., Persillon, H.: Physical analysis of the transition to turbulence in the wake of a circular cylinder by three-dimensional navier-stokes simulation. J. Fluid Mech. 365, 23–88 (1998) 3. Brezzi, F., Fortin, M.: Mixed and Hybrid finite element methods. Springer, Heidelberg (1991) 4. Dongarra, J.J., Croz, J.D., Duff, I.S., Hammarling, S.: Algorithm 679. a set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16, 1–17 (1990) 5. Ern, A., Guermond, J.L.: Theory and Practice of Finite Elements. Applied Mathematical Series, vol. 159. Springer, Heidelberg (2004) 6. Jin, G., Braza, M.: A non-reflecting outlet boundary condition for incompressible unsteady navier-stokes calculations. J. Comput. Phys. 107, 239–253 (1993) 7. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.: Mpi: The complete reference (1996)

Aggregation AMG for Distributed Systems Suffering from Large Message Numbers Maximilian Emans AVL List GmbH, Hans-List-Platz 1, 8020 Graz, Austria [email protected]

Abstract. In iterative solution procedures for problems in science and engineering, AMG methods with inexpensive computation of the hierarchy of coarse-grid operators are a good choice to solve systems of linear equations where accurate solutions of these systems are not needed. In this contribution we demonstrate that the parallel performance of this kind of algorithm is significantly improved if they are applied in combination with the Smoothed Aggregation approach, since this reduces the number of communication events. The resulting hybrid algorithms are particularly beneficial on systems where the number of messages limits the performance.

1

Introduction

If linear equation solvers are applied within iterative schemes to obtain solutions of non-linear coupled differential equations, it is often sufficient to reduce the order of magnitude of the residual only by a few (e.g. 2) orders of magnitude. Such situations are not unusual in some important applications of numerical software, e.g. in fluid dynamics where a solution of the Navier-Stokes equations is computed by algorithms like SIMPLE (“semi-implicit method for pressurelinked equations”), see Ferziger and Peri´c [6]. Our examples are taken from this field of application. Among the most efficient solvers for those kind of problems are aggregation AMG methods with simple setup, i.e. with small aggregates and constant interpolation schemes. In production runs of such algorithms it is important that problems of different sizes can be solved efficiently on hardware that is available to scientists and engineers. Since supercomputers are not accessible to the vast majority of users of such software, the affordable alternative are clusters. Typically these machines consist of a certain number of nodes with one or more chips that comprise in turn one or more processing units (cores). While the cores of one node have access to a common memory space, the individual nodes are linked by a network interconnect. Although modern interconnects provide enormous bandwidth and remarkably low latency in the case of individual point-to-point communication events, the number and the timing of the communication events of typical implementations of suitable AMG implementations leads in certain cases to a degradation of the parallel performance of these algorithms. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 89–100, 2010. c Springer-Verlag Berlin Heidelberg 2010 

90

M. Emans

In this contribution we analyse the parallel performance of common AMG algorithms on a Linux-cluster and examine the reasons for degraded parallel performance. We show that the performance degradation of a simple AMG method with small aggregates can be easily overcome if it is combined with the Smoothed Aggregation approach in a suitable manner.

2

Algorithm and Implementation

Suppose, the task is to solve the linear system Ax = b

(1)

where A ∈ Rn×n is a symmetric positive definite system matrix, b ∈ Rn is a right hand side vector and x ∈ Rn the solution that is approached iteratively. A represents the discretisation of a partial differential equation; the number of elements per row depends on the discretisation, but we shall assume that it is small compared to the number of unknowns n; in other words, A is sparse. The fundamental AMG algorithm is not repeated here; for this we refer to the literature, e.g. to the appendix of Trottenberg et al. [9]. Typically, the AMG algorithm comprises two phases: In the first phase (setup) the grid hierarchy is constructed and the operators are defined; for both various methods have been described. In the second phase (solution) the linear system is solved; here, the algorithms mainly differ in the choice of the smoother and in the cycling strategy. 2.1

Parallel Implementation of AMG

With R ∈ Rnc ×n denoting the restriction operator that is constructed during the coarsening process that we will describe for the different methods later, the Galerkin approach to compute the matrix of the coarse-grid system of size nc , AC ∈ Rnc ×nc , is AC = RART . (2) We follow this Galerkin approach. On the level of the discretisation, the problem is divided into disjoint subdomains and distributed to the available processes. The corresponding matrices are available to the solver of the systems of linear equations that uses the same distributed memory approach. Since it is important to keep the cost of communication of the setup phase as low as possible for the type of application we are concerned with, we allow only local interpolation in all our algorithms, see Emans [1]. This is a common method to reduce the communication required for the explicit computation of AC according to eqn. (2). A deteriorated convergence of parallel cases is the price that may be paid for this practice. To obtain a smoother, we follow the usual practice and employ a fully parallel Jacobi scheme on the boundaries between two domains and a lexicographic GaußSeidel scheme for the interior points. This technique is referred to as hybrid Gauß-Seidel smoother, see Yang and Henson [4].

Aggregation AMG for Distributed Systems

91

In Emans [1] it was shown that the setup of C–F-splitting based AMG methods is usually more expensive than that of aggregation AMG. Therefore we concentrate our interest on aggregation-based AMG methods. These methods divide the set of fine-grid points into a number of disjoint subsets that are called aggregates. The aggregates become the coarse-grid entities. We use a uniform interface between the rest of the code and the linear solver, similar to the one described by Falgout and Meier-Yang [5]. The data is stored in CRS (compressed row storage) format and all linear algebra operations are implemented uniformly without the use of external libraries. The choice of the CRS format for the matrix essentially determines the implementation of all matrixvector operations in the sequential case. The data of the domain boundaries is exchanged in a point-to-point communication event through the asynchronous (MPI: “immediate send”) transfer protocol. The communication during the setup comprises two stages: First the numbers of operator elements to be communicated are exchanged, then the exchange of data is started. Simultaneously the local parts of the computation of the coarse-grid operator are performed; once this and the exchange has been completed, the boundaries are treated. The scheme is illustrated in figure 1. The same principle is applied in the solution phase: While the exchange of the information at the inter-domain boundaries takes place, local work (smoothing or computation of matrix-vector products) is done; the computation of the boundaries starts as soon as the local computation has been finished and the information from the neighbouring domains is available. Apart from this type of communication, some collective communications events for reduction operations occur, e.g. for the computation of the scalar products of the conjugate gradients and norms or for the control of the setup phase. No blocking communication takes place. process 1

process 2

process 1

a)

b)

process 2

process 1

process 2

c)

Fig. 1. Pattern of inter-domain data exchange: a) preparation of data exchange: local boundary data is copied to a coherent array mapping the buffer cells of the neighbour, b) data exchange, i.e. update of buffer cells, and local computations take place simultaneously, c) local computations using information from neighbouring domains (buffer cells) are done

2.2

Pairwise Aggregation AMG

The most natural application of algebraic (and also geometric) multigrid is its use as a “stand-alone” solver, i.e. the multigrid algorithm acts as a solver and is not used as a preconditioner of a Krylov method. Good performance of AMG methods requires the quality of the interpolation to be satisfying and the computation of the coarse-grid operator to be very

92

M. Emans

efficient, in particular if the requirement to the accuracy of the solution is modest. The second requirement precludes then the application of expensive interpolation methods so that constant interpolation will be the method of choice. Then the first requirement can only be met if the number of fine-grid points per aggregate is very low, e.g. 2 or 4. For the coarse-grid selection of this algorithm we use the pairwise aggregation algorithm described by Notay [8]. It produces aggregates comprising not more than two fine-grid nodes. When this pairwise aggregation is applied within a v-cycle scheme without Krylov acceleration, the convergence is quite poor since the small frequencies of the error are not sufficiently reduced. Using an F-cycle i.e. recursively a w-cycle is followed by a v-cycle, see Trottenberg et al. [9], leads to a method with reasonable convergence properties without Krylov acceleration. Only two hybrid Gauß-Seidel sweeps are necessary after the return to the finer grid, i.e. there is no pre-smoothing. Moreover, in parallel computations, an agglomeration strategy is employed to treat the systems assigned to the coarsest grids, see Trottenberg et al. [9]: all the information of grids with less than 200 nodes is passed to one of the neighbours until only one grid remains. The system of equations of this coarsest grid is then solved directly by Gaußian elimination by a single process while the other processes idle. It was shown e.g. in Emans [2] that this simple AMG method performs best for the kind of systems of equations we are concerned with in this article. It is referred to as amggs2 in this contribution. 2.3

Smoothed Aggregation AMG

This type of algorithm groups significantly more fine-grid nodes into the aggregates. This results in a more rapid coarsening compared to algorithm amggs2, i.e. less grids are constructed. The corresponding procedure relies on a distinction between strong and weak (mutual) influence between two neighbouring nodes. Strong influence of i onto j (and vice versa) is given if √ aii · ajj .

(3)

ε := 0.08 · 0.25(l−1) ,

(4)

|aij | ≥ ε · where ε is defined as

see Vanˇek et al. [11], and l counts the grid levels (l = 1: finest grid). Formally, a tentative prolongation (or interpolation) operator with constant interpolation is constructed. Since it would result in rather poor representation of the fine-grid problem on the coarse-grid, it is smoothed by applying one Jacobi-smoothing step to it. The implemented serial algorithm has been described by Vanˇek et al. [11]. In our parallel implementation the aggregation and the smoothing are strictly local processes, i.e. aggregates are not intersected by inter-domain boundaries and smoothing is done only between the interior points of each process. The same agglomeration strategy as for algorithm amggs2 is employed to treat the systems of the coarsest grids. More details of the parallel implementation can be found in [2].

Aggregation AMG for Distributed Systems

93

This AMG algorithm is used as a preconditioner of a conjugate gradient method since its convergence as a “stand-alone” solver is still quite poor. The multigrid cycle is a v-cycle with two pre-smoothing and two post-smoothing sweeps. The algorithm is referred to as ams1cg. 2.4

Combination of Pairwise Aggregation AMG and Smoothed Aggregation

It will be demonstrated in section 3 that the parallel performance of amggs2 is impaired mainly by high cost per iteration in the solution phase. This cost is due to a very large number of communication events since smoothing is done on each level and the number of levels is comparatively large. Since for algorithm ams1cg this is not the case, we suggest to switch to this algorithm on the coarsest grids. The cycling strategy is the same as for algorithm amggs2. We refer to this algorithm as amggs3. Algorithm ams1cg is invoked as soon as the number of total grid points drops below 10000; for an explanation of this figure see section 3.2.

3 3.1

Benchmarks on Computations of Flows in an Engine Background and Details of the Simulations

Our benchmark cases are two short periods of an unsteady three-dimensional simulation of a full cycle of a four cylinder gasoline engine. The fluid dynamics is modelled by the Navier-Stokes equations amended by a standard k-ε turbulence model; the thermal and thermodynamic effects are considered through the solution of the energy equation for enthalpy and the computation of the material properties using the coefficients of air. The conservation laws are discretised by means of finite volumes on an unstructured mesh. The variable arrangement is collocated and the algorithm managing the coupling of the variables and the non-linearity of the system of partial differential equations is SIMPLE; details of the algorithm are found in Ferziger and Peri´c [6]. The simulation comprises the gas flow and the combustion in one cylinder. The stroke of the cylinder is 81.4 mm, the bore is 79.0 mm yielding a (maximum) volume of 0.4 l (per cylinder). Each benchmark case consists of a few time-steps for which the SIMPLE iterates until convergence is reached. The computational domain is subject to change in time: it contains the interior of the cylinder and eventually the parts of the ducts through which the air is sucked into the cylinder or expelled from it. The piston surface (bottom line in the slices through the computational meshes in figure 2) is a moving boundary. A simulation of a full engine cycle comprises the simulation of the (compressible) flow of cold air into the cylinder while the piston is moving downward, the subsequent compression after the valves are closed, the combustion of the explosive mixture, and the discharge of the hot gas while the piston moves upward. The model fuel that is burnt is octane; an eddy break-up model is used to simulate the combustion process. The data describing the engine cycle is shown in figure 3.

94

M. Emans

Fig. 2. Slices through the three-dimensional meshes of the partial problems, 0000 load (left), 0972 combustion (right) CTEC-GASOLINE - Pressure and Temperature pressure / kPa, temperature / K

7000 6000

pressure temperature piston position

5000 4000

valves closed

ignition

0000

valves opened

0972

3000 2000 1000 0 400

500

600

700 angle / deg

800

900

1000

Fig. 3. Scheme of the engine cycle and notation for considered partial problems

Our cases are taken from the strokes load (0000) and combustion (0972). Even though the physical processes characterising these strokes are quite different, the number that has the most influence on the behaviour of the solver is the number of unknowns. Case 0000 represents a large problem, case 0972 a small one. The computationally relevant information about the cases is compiled in table 1 that contains e.g. the time step dt, the number of time steps nt , and the number of linear systems to be solved nsy . Table 1. Benchmark cases (hex/tet: hexahedral/tetrahedral) 0000 stroke crank angle problem size dt nt boundaries nsy mesh: hex/tet/other

0972

load combustion 746◦ 360◦ 111.0MB 18.6MB 6.06 · 10−6 3.03 · 10−5 5 20 mass flow, wall wall 130 378 80.0%/1.2%/18.8% 77.7%/1.5%/21.8%

Aggregation AMG for Distributed Systems

3.2

95

Results of the Benchmarks

For the measurements we used up to four nodes `a 2 quad-cores (i.e. 8 cores per node) of a Linux cluster equipped with Intel Xeon CPU X5365 (3.00GHz, main memory 16GB, L1-cache 32kB per core, L2-cache 4MB shared between two cores) connected by Infiniband interconnect (by Mellanox) with an effective bandwidth of around 750 Mbit/s and a latency of around 3.3 μs. The computational part of the program is compiled by the Intel-FORTRAN compiler 10.1, the communication is performed through calls to hp-MPI subroutines (C-binding). The benchmarks were run within the environment of the software AVL FIRE(R) 2009 with 1, 2, 4, 8, and 16 processes, where the domain decomposition was performed once for each case by the graph partitioning algorithm METIS, see Karypis and Kumar [10]. Computations with 1, 2, and 4 processes were done on a single node, for 8 and 16 processes we used 2 and 4 nodes respectively. The processes are mapped to the available cores in a way that each process has sole access to one 4MB L2-cache. In preliminary experiments it had been found that the L2-cache and the data transfer from the memory to it is the bottleneck for such kind of computations, see Emans and van der Meer [3]. Although distributing two or four processes to two or four nodes would increase the performance, we used a single node for these computations since in the practical applications the gain in performance usually does not justify the occupation of the additional nodes. It is important to note that the MPI implementation uses the shared memory space of one node for the communication between two processes wherever possible. This means that all intra-node communication is done without utilisation of the network interconnect. Besides the algorithm ams1cg, amggs2, and amggs3 we present the performance of a reference solver ichpcg. It is an incomplete parallel Cholesky factorisation preconditioned conjugate gradient method. In the benchmark runs the algorithms are used as solvers of the pressurecorrection equations of the SIMPLE algorithm. The iteration of the linear solver was terminated as soon as the 1-norm of the residual had been reduced by a factor of 20. The measured computing time refers to the solution of these systems only. The computing time of the setup and the computing time of the solution phase is shown in figure 4. Furthermore, the operator complexity  number of matrix elements of level l , (5) c= number of matrix elements of level 1 (l)

is relevant since it is a measure of the memory requirement. It is shown together with the cumulative iteration count and the number of AMG levels in table 2. From the measured times the parallel efficiency Ep is computed as Ep =

t1 , p · tp

(6)

where tp denotes the computing time with p processes. The comparison between the total computing times of amggs2 and ams1cg shows that amggs2 is the faster AMG solver for the systems of equations that

96

M. Emans Problem 012-0000 - 1.4 mio cells, load ams1cg, total ams1cg, setup amggs2, total amggs2, setup amggs3, total amggs3, setup ichpcg, total

100 90 parallel efficiency / %

computing time / s

1000

100

80 70 60 50 40 30

1

2

4

8

16

ams1cg, setup ams1cg, solution amggs2, setup amggs2, solution amggs3, setup amggs3, solution ichpcg, solution 1

2

processes

4

8

16

8

16

processes

ams1cg, total ams1cg, setup amggs2, total amggs2, setup amggs3, total amggs3, setup ichpcg, total

120

parallel efficiency / %

computing time / s

Problem 012-0972 - 240000 cells, combustion

100

100

80

60

40

ams1cg, setup ams1cg, solution amggs2, setup amggs2, solution amggs3, setup amggs3, solution ichpcg, solution

10 1

2

4

8

1

16

2

4 processes

processes

Fig. 4. Computing times (left) and parallel efficiency for solution and setup phase (right), benchmarks on cluster comp 1 Table 2. Number of iterations, operator complexity c, and number of levels nl of the computations with the linear AMG solvers Case

1

2

ams1cg 4 8

16

c nl

amggs2 1 2

4

8

16

c

nl

0000

416 415 446 469 484 1.51 3

575 575 573 573 573 2.55 14

0972

1128 1128 1125 1125 1134 1.50 3

1149 1154 1151 1152 1153 2.60 11

have to be solved in this kind of simulation as long as the problem is large enough or distributed to not more than 4 processes. The increased number of iterations of parallel ams1cg is a consequence of the simplified parallelisation, see Emans [1]. For parallel runs communication overhead, competition for hardware resources between processes on the same node, and degraded convergence property are responsible for a reduction of the parallel efficiency. But a clear distinction between the retarding factors is necessary. Computations with 1, 2, or 4 parallel processes are done on one node, i.e. the communication is done through a shared memory mechanism that is automatically

Aggregation AMG for Distributed Systems

97

evoked by the used MPI implementation. As described in Emans [3], the parallel efficiency of these type of computations is mainly affected by the competition for access to the main memory. It is hard to bypass this bottleneck that leads to a significant reduction of the parallel efficiency. If the problem is large enough, more precisely if the number of unknowns per process is large enough, for runs with 8 or 16 processes (where additionally the network interconnect is utilised) the parallel efficiency increases again since the probability of L2-cache misses is reduced and less data needs to be transferred from the main memory to the cache. This is valid as long as the inter-node communication is not significantly slower than the typical internal tasks as e.g. in case 0000 for all algorithms. Since ichpcg requires very little communication and can take advantage of the reduced probability of L2-cache misses in parallel runs, it shows a very high parallel efficiency and excellent performance for the small problem 0972. Although the Infiniband interconnect has an excellent latency (the values given above are obtained by the “pingpong” benchmark, see Hockney [7], i.e. isolated communication events are considered) it is known that certain types of Infiniband interconnects show a much worse behaviour if loaded by many communication events taking place at approximately the same time. This is exactly what happens here if the asynchronous MPI communication between the neighbouring domains is invoked. The algorithm amggs2 requires much more communication than ams1cg since the number of multigrid levels is larger, see table 2, and the levels are visited more frequently due to the F-cycle. The consequence is that the parallel efficiency of amggs2 with the 16 processes for case 0972 is lower than that of the 8 processes run, while for ams1cg it is higher. In case 0000 the internal tasks are large enough such that the communication does not result in waiting time; therefore the described effect is not seen in this case. The communication statistics for case 0972 is shown in table 3. The meaning of tex needs some explanation: tex summarises the time needed for the initialisation of the asynchronous communication (e.g. the call of “MPI Isend”) and the time spent on waiting for the termination of the data transfer at the synchronisation points (e.g. “MPI Waitall”). Since transfer operations with different message sizes are done at a time, the waiting times at synchronisation points are assigned to one of the message sizes in table 3 with regard to the average message size of all transfers that are synchronised at this point. Note that no proportionality between nby and tex can be expected since tex only measures the overhead. From the data in table 3 it can be seen that a large part of the communication events of algorithm amggs2 are small messages, while the small messages make up only a small part of the communication events of algorithm ams1cg. The reason is obviously the different number of levels. Since the communication events of amggs2 are required by the necessarily sequential smoothing steps on the different grids, it is not possible to combine individual communication events in a way that a smaller number of larger data packages is exchanged. Merging both algorithms in a way that amggs2 is used for the fine levels and ams1cg for the coarse levels will render the communication pattern significantly

98

M. Emans

Table 3. Communication statistics for case 0972 of the computations with the linear AMG solvers with 16 processes (nex : number of exchange operations, nby sum of exchanged information, tex impact on computing time (initialisation of asynchronous data transfer and waiting time at synchronisation point), p2p: non-blocking point-to-point communication), collective: reduction operations Message [bytes] p2p [0 . . . 256] p2p [257 . . . 1024] p2p [1025 . . . 4096] p2p [4097 . . . 16384] p2p [16385 . . . 65536] p2p  [65537 . . . ∞] p2p collective, 4-byte: collective, 8-byte:

nex 1105408 415243 419162 30360 1749 0 1971922 27236 3776

ams1cg nby 27.7MB 223.1MB 818.0MB 199.9MB 40.1MB 0.0MB 1308.8MB 198944 30208

tex

nex

2.39s 0.68s 0.29s 0.02s 0.07s 0.00s 3.45s 4.08s 0.71s

4989571 2651019 605579 6493 5027 0 8257689 70157 1153

amggs2 nby 389.5MB 1302.9MB 1056.5MB 70.4MB 115.3MB 0.0MB 2934.6MB 280628 9224

tex 11.92s 6.41s 0.40s 0.00s 0.21s 0.00s 18.94s 3.92s 0.53s

more suitable for the used cluster hardware. In order to choose an appropriate threshold for the switch to ams1cg one observes how the exchanged amount of data in table 3 depends on the message size: For ams1cg the message sizes below 1MB entail less than 20% of the exchanged amount of data whereas for amggs2 this percentage is more than 55%. The threshold should be chosen such that the number of messages smaller than 1MB is reduced. The threshold might depend on the problem; for our decision we made the following rough estimation: we assumed that, e.g. for amggs2 with 16 processes, on the coarser grids every fifth unknown is exchanged; 1MB message size (128 double precision variables) then corresponds to a grid size of 10240 points; below this threshold the data transfer is efficiently reduced through the use ams1cg instead of amggs2. The comparison between the performance and parallel efficiency of amggs2 and the new algorithm, amggs3, in figure 4 shows that this modification does not destroy the favourable properties of amggs2 in the case 0000 and in the computations with up to 8 processes in case 0972 while it enhances its performance significantly where it was not competitive in case 0972. Although ams1cg, amggs2, or ichpcg perform equally well in particular situations, the computing times of amggs3 are the fastest or at least very competitive in all cases; this is in practice an important property of a solver in a typical “black-box” application. The algorithm related numbers are found in table 4, the communication statistics for the 16 processes run of case 0972 in table 5. Since the findings depend strongly on the hardware the algorithm runs on, we present finally the parallel efficiency of computations of the critical problem 0972 on other clusters that are characterised by a comparatively large latency of the interconnect. The curves are found in figure 5. Both clusters use Gigabit Ethernet interconnect with a measured effective latency of 60μs and a bandwidth of 120 Mbit/s. The first cluster, comp2 has got the same nodes as the cluster that has been used for all previous computations, i.e. it has two Intel Xeon

Aggregation AMG for Distributed Systems

99

Table 4. Operator complexities, total number of levels and number of coarse levels generated with Smoothed Aggregation (in brackets) for amggs3 Case

1

2

4

8

16

c

nl

0000

572 575 575 579 590 2.55 10 (2)

0972

1156 1176 1194 1269 1309 2.37 5 (2)

Table 5. Communication statistics for case 0972 of the computations with algorithm amggs3 with 16 processes (notation as in table 3) Message [bytes]

amggs3 nby

nex

p2p [0 . . . 256] p2p [257 . . . 1024] p2p [1025 . . . 4096] p2p [4097 . . . 16384] p2p [16385 . . . 65536] p2p  [65537 . . . ∞] p2p collective, 4-byte: collective, 8-byte:

2303163 1427591 689385 4823 4461 0 4429423 40475 1209

tex

119.2MB 812.6MB 1221.3MB 45.9MB 120.3MB 0.0MB 2319.4MB 161900 9672

4.66s 3.12s 0.44s 0.00s 0.12s 0.00s 8.36s 3.57s 0.61s

Problem 012-0972 - 240000 cells, combustion 100 comp3: parallel efficiency / %

comp2: parallel efficiency / %

100 90 80 70 60 50

amggs2, setup amggs2, solution amggs3, setup amggs3, solution

90 80 70 60 50

40

amggs2, setup amggs2, solution amggs3, setup amggs3, solution

40 1

2

4 processes

8

16

1

2

4 processes

8

16

Fig. 5. Problem 0972: parallel efficiency for solution and setup phase for amggs2 and amggs3, benchmarks on clusters comp2 and comp3

5365 per node. The second additional cluster, comp3 has a single dual-core Intel Xeon 3.2 GHz with 2MB L2-cache and 16kB L1-cache per core. The curves in figure 5 show that the obtained results are not unique to a specific hardware configuration.

100

4

M. Emans

Conclusions

AMG methods with particularly simple setup and moderately aggressive coarsening are vulnerable to the restrictions that the hardware of modern Linux-clusters often imposes on parallel computations of the described type. As a remedy to this problem we suggest limiting the number of coarse-grids by switching to the Smoothed Aggregation algorithm for the coarsest grids. This way the favourable properties of simple AMG implementations using e.g. pairwise aggregation schemes are mainly preserved whereas it can be shown that these modifications lead to a significantly improved parallel performance.

References 1. Emans, M.: Performance of Parallel AMG-preconditioners in CFD-codes for weakly compressible flows. Parallel Computing 36, 326–338 (2010) 2. Emans, M.: AMG for linear systems in engine flow simulations. In: Karczewski, K. (ed.) PPAM 2009, Part II. LNCS, vol. 6068, pp. 350–359. Springer, Heidelberg (2010) 3. Emans, M., van der Meer, A.: Mixed Precision AMG as Linear Equation Solver for Definite Systems. In: Sloot, P.M.A., van Albada, G.D., Dongarra, J.J. (eds.) ICCS 2010. Procedia Computer Science, vol. 1, pp. 175–183. Elsevier Academic Press, Amsterdam (2010) 4. Henson, E., Yang, U.M.: BoomerAMG: a Parallel Algebraic Multigrid Solver and Preconditioner. Appl. Num. Math. 41, 155–177 (2001) 5. Falgout, R.D., Yang, U.M.: hypre: a Library of High Performance Preconditioners. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J.J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002, Part III. LNCS, vol. 2331, pp. 632–641. Springer, Heidelberg (2002) 6. Ferziger, J.H., Peri´c, M.: Computational Methods for Fluid Dynamics. Springer, Heidelberg (1996) 7. Hockney, R.: Performance parameters and benchmarking of supercomputers. Parallel Computing 17, 1111–1130 (1991) 8. Notay, Y.: An aggregation-based algebraic multigrid method. Electr. Trans. Num. Anal. 37, 123–146 (2010) 9. Trottenberg, U., Oosterlee, C., Sch¨ uller, A.: MULTIGRID. Elsevier Academic Press, Amsterdam (2001) 10. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comp. 20, 359–392 (1998) 11. Vanˇek, P., Brezina, M., Mandel, J.: Algebraic Multigrid by Smoothed Aggregation for Second and Fourth Order Elliptic Problems. Computing 56, 179–196 (1996)

A Parallel Implementation of the Jacobi-Davidson Eigensolver and Its Application in a Plasma Turbulence Code Eloy Romero and Jose E. Roman Instituto ITACA, Universidad Polit´ecnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain {eromero,jroman}@itaca.upv.es

Abstract. In the numerical solution of large-scale eigenvalue problems, Davidson-type methods are an increasingly popular alternative to Krylov eigensolvers. The main motivation is to avoid the expensive factorizations that are often needed by Krylov solvers when the problem is generalized or interior eigenvalues are desired. In Davidson-type methods, the factorization is replaced by iterative linear solvers that can be accelerated by a smart preconditioner. Jacobi-Davidson is one of the most effective variants. However, parallel implementations of this method are not widely available, particularly for non-symmetric problems. We present a parallel implementation to be released in SLEPc, the Scalable Library for Eigenvalue Problem Computations, and test it in the context of a highly scalable plasma turbulence simulation code. We analyze its parallel efficiency and compare it with Krylov-type eigensolvers. Keywords: Message-passing parallelization, eigenvalue computations, Jacobi-Davidson, plasma simulation.

1

Introduction

We are concerned with the partial solution of the standard eigenvalue problem defined by a large, sparse matrix A of order n, Ax = λx, where the scalar λ is called the eigenvalue, and the n-vector x is called the eigenvector. Many iterative methods are available for the partial solution of the above problem, that is, for computing a subset of the eigenvalues. The most popular ones are Krylov projection methods such as Lanczos, Arnoldi or Krylov-Schur, and Davidsontype methods such as Generalized Davidson or Jacobi-Davidson. Details of these methods can be found in [2]. Krylov methods achieve good performance when computing extreme eigenvalues, but usually fail to compute interior eigenvalues. In that case, the convergence can be improved by combining the method with a spectral transformation technique, i.e., to solve (A − σ)−1 x = θx instead of Ax = λx. The drawback of this approach is the added high computational cost 

This work was partially supported by the Spanish Ministerio de Ciencia e Innovaci´ on under project TIN2009-07519.

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 101–112, 2010. c Springer-Verlag Berlin Heidelberg 2010 

102

E. Romero and J.E. Roman

of solving large linear systems at each iteration of the eigensolver. Moreover, for stability reasons these systems must be solved very accurately (normally with direct methods). Davidson-type methods aim at reducing the cost by solving linear systems approximately, without compromising the robustness, usually with iterative methods. Davidson methods are becoming an excellent alternative due to the possibility of tuning the balance between numerical behaviour and computational performance. A powerful preconditioner (close to the matrix inverse), if available, can usually reduce the number of iterations significantly. However, in practice its use is normally too expensive computationally and difficult to parallelize, thus dominating the cost of the eigensolver. Otherwise, depending on the performance of the matrix-vector product, the preconditioner and the orthogonalization, there exist Davidson-type variants that can be competitive with respect to Krylov-type eigensolvers. This paper illustrates an example of this. Despite their potential benefit, it is still difficult to find freely available parallel implementations of Davidson-type eigensolvers, especially for the non-symmetric case, although there are some publications dealing with parallel implementations of these methods employed for certain applications (for instance, the JacobiDavidson described in [1]). Some Davidson-type methods can be found in PRIMME [19], Anasazi [3] and JADAMILU [5]. PRIMME implements almost all Davidson-type variants, Anasazi only implements a basic block Generalized Davidson method, and JADAMILU implements Jacobi-Davidson. However, none of them support non-Hermitian problems. Implementation of non-Hermitian solvers gets complicated because of the need to work with invariant subspaces rather than eigenvectors, as well as to consider both right and left eigenspaces. Our aim is to provide a robust and efficient parallel implementation of the JacobiDavidson method in the context of SLEPc, the Scalable Library for Eigenvalue Problem Computations [9], that can address standard and generalized problems, both Hermitian and non-Hermitian, with either real or complex arithmetic. The solver will also provide different Davidson variants other than Jacobi-Davidson. Some preliminary results have been presented in [14], where a simple non-restarted variant with real arithmetic is discussed. In this work, we focus on the restarted Jacobi-Davidson method for complex non-Hermitian problems. The eigenvalue problem is the main algebraic problem in many areas such as structural dynamics, quantum chemistry and control theory. In this work, we show results for the eigenvalue calculation that takes place in the plasma physics application GENE, that solves a set of non-linear partial integro-differential equations in five-dimensional phase space by means of the method of lines. Because of the shape of the spectrum (see Fig. 1), computing the largest magnitude eigenvalues of the linearized operator is not particularly difficult, despite the unfavorable characteristics of the problem (complex non-Hermitian with matrix in implicit form). However, the case of computing the rightmost eigenvalues is much more difficult from the numerical point of view, since these eigenvalues are much smaller in magnitude compared to the dominant ones. This makes the

A Parallel Implementation of the Jacobi-Davidson Eigensolver

103

computational problem challenging and suitable as a testbed for our new parallel eigensolver running on distributed memory architectures. In section 2 we describe the Jacobi-Davidson method and several relevant variants such as harmonic extraction. The implementation details, including how the method is parallelized, are discussed in section 3. In section 4 we provide a brief description of the application. The performance of the parallel eigensolver in this application is presented in section 5.

2

The Jacobi-Davidson Method

Davidson-type methods belong to the class of subspace methods, i.e., approximate eigenvectors are sought in a search subspace V. Each iteration of these methods has two phases: the subspace extraction, which selects the best approximated eigenpair from V in the desired spectrum region, and the subspace expansion, which adds a correction for the selected eigenvector to V. The subspace expansion distinguishes a Davidson-type variant from others. Jacobi-Davidson computes a correction t orthogonal to the selected approximate eigenvector u as an approximate solution of the so-called Jacobi orthogonal component correction (JOCC) [10] equation A(u + t) = λ(u + t) ,

u⊥t .

(1)

There are ways in which (1) derives into different linear systems. For our purpose, we implement the Jacobi-Davidson correction equation studied in [7]     uz ∗ uz ∗ (A − θI) I − ∗ t = −r , (2) I− ∗ z u z u being r = Au − θu the residual associated to the selected approximate eigenpair (θ, u), and z ∈ span{Au, u}. If (2) is solved exactly, one step of the algorithm turns out to be one step of the Rayleigh Quotient Iteration, which converges almost quadratically [7]. Otherwise, i.e., (2) is solved approximately, this high convergence rate may get lost. There is a trade-off between speed of convergence and the amount of work one is willing to spend for solving the equation, that is easily tuned if an iterative method is used. In practice, the performance of the eigensolver depends dramatically on a suitable stopping criterion for the iterative method. In the subspace extraction phase, Davidson-type methods classically impose the Ritz-Galerkin condition to the eigenpair that will be selected (θ, u), r = Au − θu ⊥ V .

(3)

Since u ∈ V, it is possible to express u = V u ˜, being V a basis of V. Then this ˜ = θ˜ u. leads to the low-dimensional projected eigenproblem V ∗ AV u In practice this extraction technique, named Rayleigh-Ritz, obtains good convergence rates when exterior eigenvalues are desired. However, it gives poor

104

E. Romero and J.E. Roman

approximate eigenvectors for interior eigenvalues. The harmonic Rayleigh-Ritz method was proposed in [11,12] as an alternative extraction technique for this case. Assuming that interior eigenvalues close to a given target τ are desired, harmonic Rayleigh-Ritz imposes the Petrov-Galerkin condition to the selected eigenpair (A − τ )u − ξu ⊥ W ≡ (A − τ )V,

ξ =θ−τ .

(4)

For stability reasons V and W (a basis of W) are both constructed to be orthonormal: W is such that (A − τ I)V = W S, where S is upper triangular. In the same way, this leads to the projected eigenproblem W ∗ (A − τ )V u ˜ = ξW ∗ V u ˜ .

(5)

Then the smallest magnitude pairs (ξ, u ˜) of (5) correspond to the pairs (ξ+τ, V u ˜) in V closest to the target τ . Finally, instead of directly computing the eigenpairs of the projected problem, the implementation performs the Schur decomposition and works with Schur vectors along the method. At the end, the solver has obtained a partial Schur decomposition from which it is possible to compute the corresponding approximate eigenpairs. This variant, called JDQZ [7], facilitates restarting (when the subspaces are full) and locking (when eigenpairs converge). Due to memory limitations and in order to improve efficiency, the maximum size of the search and test subspaces have to be bounded. The thick restart technique resets the subspace with the best mmin approximate eigenvectors when its size achieves mmax . JDQZ replaces the eigenvectors by an orthogonal basis of the corresponding eigenspace, which means updating V and W with the first mmin right and left Schur vectors of the projected problem, respectively. When an eigenpair converges, it is removed from the subspace bases V and W , forcing a restart without it and in the following iterations orthogonalizing the new vectors t also against all locked vectors. Algorithm 1 summarizes the scheme of a Jacobi-Davidson method with harmonic Rayleigh-Ritz extraction and the correction equation (2). For a more detailed description, the reader is referred to [16,15,7].

3

Implementation Description

3.1

Overview of SLEPc

SLEPc, the Scalable Library for Eigenvalue Problem Computations [9]1 , is a software library for the solution of large, sparse eigenvalue and singular value problems on parallel computers. It was designed to solve problems formulated in either standard or generalized form, both Hermitian and non-Hermitian, with either real or complex arithmetic. 1

http://www.grycap.upv.es/slepc/

A Parallel Implementation of the Jacobi-Davidson Eigensolver

105

Algorithm 1. Block Jacobi-Davidson with harmonic Rayleigh-Ritz extraction Input:

matrix A of size n, target τ , number of wanted eigenpairs p, block size s, maximum size of V mmax , restart with mmin vectors  X)  Output: resulting eigenpairs (Θ, Choose an n × s full rank matrix V such that V ∗ V = I 

0. When the tasks have to be executed on processors with successive indices, Steinberg [22] proposed an adapted strip-packing algorithm with approximation factor 2. Furthermore, Jansen [16] gave an approximation scheme with makespan at most 1 +  for any fixed  > 0. There exist other algorithms for special cases of the MPTS with approximation factors close to one (e.g. [9] for identical malleable tasks), but those do not apply for our case.

Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code

117

The problem stated in this paper is a standard MPTS problem, so the algorithms mentioned above could in principle be applied. However, as we will see, the number of sub-matrices to be processed is limited for most cases. This allows us to modify the algorithm from [18] by introducing a combinatorial approach in order to find a better solution. 3.2

The Algorithm

We focus on a two-phase approach. In the first phase, a number of allotted processors for each task is determined. This step performs the transformation from an MPTS to an NPTS. In the second phase, an optimal scheduling for the nonmalleable tasks is constructed. Processor Allotment: In [18], Ludwig and Tiwari showed how a processor allotment can be found in runtime O(mn). The algorithm also computes a lower bound ω on the optimal makespan d∗ such that ω ≤ d∗ ≤ 2ω. The basic idea is to find a number of allotted processors PTi ∈ P for each task Ti such that ti,PTi ≤ τ . τ ∈ IR is defined as an upper bound for the execution time of each task, and PTi the minimum number of processors which are required to satisfy this condition. The goal is to find the minimum τ ∗ which produces a feasible allotment for each task. Furthermore, the values of τ to be considered can be limited to a certain set X = {ti,Pj : i = 1 . . . n, j = 1 . . . m}. Thus, |X | = mn. Once τ ∗ has been found, the algorithm yields an allotment, which will subsequently be used for solving the NPTS problem. The algorithm requires t to be strictly monotonic: ti,p1 > ti,p2 for p1 < p2 . This property is not generally given, but can be achieved by a suitable choice of the cost function (see below). For further details, please refer to the original paper [18]. Solution of NPTS: Ludwig and Tiwari presented a technique which “. . . takes any existing approximation algorithm for NPTS and uses it to obtain an algorithm for MPTS with the same approximation factor.”[18]. In other words, if there is a way to find an optimal schedule for the NPTS, it simultaneously yields an optimal solution for the MPTS. For each irreducible representation, one block on the diagonal of the Hamilton matrix H has to be diagonalized; in most applications, the number of such blocks is limited by 10. This includes the important point groups Ih and Oh and their subgroups as well as all point groups with up to fivefold rotational axes as symmetry elements [1]. Thus, point groups including more than 10 irreducible representations are very rarely encountered in practical applications. For such a limited problem size, we can achieve an optimal solution by a combinatorial approach. The number of possible permutations σ of an NPTS is n!; in our case, this leads to a maximum number of 10! ≈ 3.6 · 106 . Algorithm 1 shows the routine to find the scheduling with the minimum makespan d∗ . A scheduling is represented by a sequence of task numbers σ, from which a scheduling is generated in Algorithm 2. For simplification, we do not explicitly provide the routine that generates the permutations.

118

M. Roderus et al.

Algorithm 1. Finds the scheduling sequence σ ∗ from which MakeSchedule (algorithm 2) generates a scheduling with the minimum makespan 1. Find the tasks which have been allotted all processors: Tm ⊆ T = {Ti : PTi = m} 2. Schedule Tm at the beginning 3. For each possible permutation σ of T \ Tm (a) Call MakeSchedule to generate a scheduling from σ (b) σ ∗ ← σ if makespan of σ < makespan of σ ∗

Algorithm 2. The procedure to generate a scheduling from a scheduling sequence σ. It allots to each task σ(i) → Ti a start time and an end time (σ(i).startTime and σ(i).endTime, respectively), as well as a set of allotted processors (σ(i).allottedProcessors). There is a set of processors, C = {C1 . . . Cm }; each element Ci has an attribute Ci .availableTime, which points out until which time the processor is occupied and, thus, the earliest time from which on it can be used for a new task. procedure MakeSchedule(σ) for all Ci ∈ C do Ci .availableTime ← 0 end for for i ← 1, |σ| do Ctmp ← first PTi processors which are available σ(i).startTime ← max(available times from Ctmp ) σ(i).endTime ← σ(i).startTime + ti,PTi σ(i).allottedProcessors ← Ctmp for all Cj ∈ Ctmp do Cj .availableTime = σ(i).endTime end for end for end procedure

4

Cost Function

The scheduling algorithm described requires a cost function which estimates the execution time of the (Sca)LAPACK routines DSYGV and PDSYGV. It is difficult to determine upfront how accurate the estimates have to be. However, the validation of the algorithm will show whether the error bounds are tight enough for the algorithm to work in practice. The ScaLAPACK User’s Guide [6] proposes a general performance model, which depends on machine-dependent parameters such as floating point or network performance, and data and routine dependent parameters such as total FLOP or communication count. In [10], Demmel and Stanley used this approach to evaluate the general performance behavior of the ScaLAPACK routine PDSYEVX. The validation of the models shows that the prediction error usually lies between 10 and 30%. Apart from that, for the practical use in a scheduling

Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code

119

algorithm, it exhibits an important drawback: to establish a model of that kind, good knowledge of the routine used is required. Furthermore, each routine needs its own model; thus, if the routine changes (e.g. due to a revision or the use of a different library), the model has to be adapted as well. Here we follow a different approach: the routine is handled as a “black box”. Predictions of its execution time are based on empirical data, which are recorded by test runs with a set of randomly generated matrices on a set of possible processor allotments P. Then, with a one-dimensional curve-fitting algorithm, a continuous cost function t is generated for each element of P. Thus, each P ∈ P has a related cost function tP : S → tP,S (do not confuse with (2)). ScaLAPACK uses a two-dimensional block cyclic data distribution. For each instance of a routine, a Pr × Pc process grid has to be allocated with Pr process rows and Pc process columns. However, √ the ScaLAPACK User’s Guide [6] suggests to use a square grid (Pr = Pc =  P ) for P ≥ 9 and a one-dimensional grid (Pr = 1; Pc = P ) for P < 9. Following this suggestion results√in a reduced set of processor configurations, e.g. P = {1, 2, . . . , 8, 9, 16, 25, . . . ,  m 2 }, which will be invoked here. We used the method of least squares to fit the data. For our purposes, it shows two beneficial properties: – the data are fitted by polynomials. These are easy to handle and allow us to generate an estimated execution time tP,S with low computational effort; – during the data generation, processing of a matrix sometimes takes longer than expected due to hardware delays (see circular marks in Fig. 2). As long as those cases are rare, their influence on the cost function is minor and can be neglected. Finally, we combine the emerging set of P -related cost functions to form the general cost function (2). However, in practice, when a certain number of allotted processors is exceeded, parallel routines no longer feature a speedup or even slow down, see [24]. This behavior does not comply with the assumption of a general monotonic cost function. To satisfy this constraint, we define (2) as follows: ti,p = min{tP,Si : P ∈ P ∧ P ≤ p}.

(3)

time/seconds

P

16 14 12 10 8 6 4 2 0

empiric data fitted data

n

k 500

1000

1500

2000

size n of a matrix ∈ IRn×n

Fig. 2. Execution time measurements of the routine PDSYGV diagonalizing randomly generated matrices. The curve labeled “fitted data” represents a polynomial of degree 3, which was generated by the method of least squares from the empiric data.

120

M. Roderus et al.

All possible processor counts P ∈ P are considered which are smaller than or equal to p. The P which results in the smallest execution time for the given S also determines the P -related cost function and thus the result t of the general cost function (3).

5

Evaluation

We evaluated the presented scheduler for two molecular systems as example applications: the gold cluster compound Au55 (PH3 )12 in symmetry S6 and the palladium cluster Pd344 in symmetry Oh . Table 1 lists the sizes of the symmetry adapted basis sets which result in SAu55 and SPd344 . Table 1. The resulting point group classes (PGC) of the two example systems Au55 (PH3 )12 in symmetry S6 and Pd344 in symmetry Oh . The third row contains the sizes n of matrices ∈ IRn×n which comprise the elements of SAu55 and SPd344 . i PGC Si

Au55 (PH3 )12 1 2 3 4 Ag E g Au E u 782 1556 782 1560

Pd344 1 2 3 4 5 6 7 8 9 10 A1g A2g Eg T1g T2g A1u A2u Eu T1u T2u 199 317 513 838 956 1110 317 471 785 956

Test platform was an SGI Altix 4700, installed at the Leibniz Rechenzentrum M¨ unchen, Germany. The system uses Intel Itanium2 Montecito Dual Cores as CPUs and has an SGI NUMAlink 4 as underlying network. As numerical library, SGI’s SCSL was used to provide BLAS, LAPACK, BLACS and ScaLAPACK support. For further details on the hard- and software specification, please refer to [21]. We performed time and load measurements using the VampirTrace profiling tool [5]. For that purpose, we manually instrumented the code, inserting calls to the VT library. As a result, the two relevant elements of the scheduling algorithm could be measured separately: the execution time of the eigensolvers and the idle time (see below). Two negative influences on the parallel efficiency can be expected: firstly, as virtually every parallel numerical library, the performance of ScaLAPACK does not scale ideally with the number of processors. Consequently, the overall efficiency worsens the more the scheduler parallelizes the given tasks. The second negative influence arises from the scheduling itself. As Fig. 1 shows, a processor can stay idle while waiting for other processors or the whole routine to finish. Those idle times result in load imbalance, hence a decreased efficiency. Figure 3 shows the execution times of a scheduled diagonalization during one SCF cycle. The gap between the predicted and computed execution time is reasonably low (at most ≈ 10%) except for the case when p = 1. This shows that the cost function, which generates the values for the predicted makespan, works sufficiently accurate to facilitate the practical use of the scheduling algorithm.

Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code

35 time/seconds

30 25

predicted computed idle time LPT

16 14 time/seconds

40

20 15

12

predicted computed idle time LPT

10 8 6

10

4

5

2

0

121

0 1 2 3 4 5 6 7 8 9 10 12 14 # of processors

(a) Au55 (PH3 )12

16

20

1 2 3 4 5 6 7 8 910 12 14 16 20 # of processors

24

28

(b) Pd344

Fig. 3. Time diagrams of the two test systems Au55 (PH3 )12 and Pd344 . Considered are the execution times of the diagonalization module during one SCF iteration. The curves labeled “predicted” show the predicted makespan of the scheduling algorithm, whereas the curves labeled “computed” provide the real execution time of the scheduled eigensolvers. The curves labeled “idle time” represent the average time, during which a processor was not computing. The lines labeled “LPT” indicate the execution time of the sequential LAPACK routine computing the largest matrix from S and yield thus the best possible performance of the previously used LPT-scheduler (see Sect. 1).

The figure also shows a lower bound on the execution time of a sequential scheduler (“LPT”-line). To recapitulate the basic idea of the previously used LPT-scheduler: all matrices are sorted by their size and accordingly scheduled on any processor which becomes available. There the matrix is diagonalized by a sequential LAPACK eigensolver routine (see Fig. 1a). However, this performance bound is now broken and the execution time is improved beyond this barrier by our new algorithm. The diagonalization of the first system, Au55 (PH3 )12 , scales up to 20 processors, whereas the LPT-scheduler can only exploit up to 4 processors. The proposed MPTS scheduler is faster by a factor of about 2 when using 4 processors and by a factor of 8.4 when using 20 processors. For the diagonalization of the second system, Pd344 , the execution time improved for up to 28 processors. Compared to the LPT-scheduler, which in this case can only exploit up to 10 processors, the MPTS-scheduler is faster by the factor of 1.6 when using this processor number and about 4 when using 28 processors. The parallel efficiency is given in Fig. 4. For the system Au55 (PH3 )12 , a superlinear speedup was achieved for the cases p = {2, 4, 5, 9}. One can also see for both examples that the idle times of the scheduling can cause a notable loss of efficiency. Further investigations on the scheduling algorithm to reduce those time gaps would thus be an opportunity to improve the overall efficiency. One also has to consider the cost of establishing the scheduling. It is difficult to estimate this cost beforehand, but recall that the scheduling, once computed, can be re-used in each recurring SCF cycle, at least 103 in a typical application

122

M. Roderus et al.

1.2

1 0.8

0.8

efficiency

efficiency

1

0.6 0.4 0.2 0

0.6 0.4 0.2

computed no idle time 1 2 3 4 5 6 7 8 9 10 12 14 # of processors

computed no idle time

0 16

20

(a) Au55 (PH3 )12

1 2 3 4 5 6 7 8 910 12 14 16 20 # of processors

24

28

(b) Pd344

Fig. 4. The parallel efficiency in computing the two test systems Au55 (PH3 )12 and Pd344 . The curves labeled “computed” show the efficiency according to the execution times displayed in Fig. 3. The curve labeled “no idle time” represents the efficiency, where the idle-times of the scheduling are subtracted from the execution times. It thus indicates the parallel efficiency of the ScaLAPACK-routine PDSYEVX as used in the given schedules.

(see Sect. 1). Thus, a reasonable criterion is the number of SCF cycles required to amortize these costs. In the worst case considered, for 10! possible task permutations (see Sect. 3.2), our implementation requires a runtime of ≈ 34 s. This case has to be considered for the second test system Pd344 , as its symmetry group Oh implies a total number of 10 matrices. In an example case, where the these matrices are scheduled on 10 processors, a performance gain of ≈ 1.46 s could be achieved, compared to the sequential LPT-scheduler (see Fig. 3b). Accordingly, the initial scheduling costs have been amortized after 24 SCF iterations.

6

Conclusion and Future Work

We demonstrated how a parallel eigensolver can be used efficiently in quantum chemistry software when the Hamilton matrix has a block-diagonal structure. The scalability of the proposed parallel scheduler has been demonstrated on real chemical systems. The first system, Au55 (PH3 )12 , scales as far as 20 processors. For the second system, Pd344 , a performance gain could be achieved until up to 28 processors. Compared to the previously used LPT-scheduler, the scalability was significantly improved. Performance improvements could be achieved by the factors of about 2 and 1.6, respectively, when using the same numbers of processors. With the improved parallelizability, the diagonalization step can now be executed about 8.4 and 4 times faster, respectively. Furthermore, the proposed strategy for the cost function, which relies on empiric records of execution times, provides results accurate enough for the scheduler to work in practice. In summary, the technique presented significantly improves the performance and the scalability of the solution of the generalized eigenvalue problem in

Scheduling Parallel Eigenvalue Computations in a Quantum Chemistry Code

123

parallel Quantum Chemistry codes. It makes thus an important contribution to prevent this step from becoming a bottleneck in simulations of large symmetric molecular systems, especially nano particles. It will be interesting to explore how an approximate scheduling algorithm like the one of [20] compares with the combinatorial approach proposed here. Adopting such an algorithm would also make the presented technique more versatile because the number of tasks would not be limited anymore. Thus, rare cases in practical applications with significantly more than 10 blocks would be covered as well.

Acknowledgments The work was funded by the Munich Centre of Advanced Computing (MAC) and Fonds der Chemischen Industrie.

References 1. Altmann, S.L., Herzig, P.: Point-group theory tables. Clarendon, Oxford (1994) 2. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK’s user’s guide. SIAM, Philadelphia (1992) 3. Belling, T., Grauschopf, T., Kr¨ uger, S., Mayer, M., N¨ ortemann, F., Staufer, M., Zenger, C., R¨ osch, N.: High performance scientific and engineering computing. In: Bungartz, H.J., Durst, F., Zenger, C. (eds.) Lecture notes in Computational Science and Engineering, vol. 8, pp. 439–453 (1999) 4. Belling, T., Grauschopf, T., Kr¨ uger, S., N¨ ortemann, F., Staufer, M., Mayer, M., Nasluzov, V.A., Birkenheuer, U., Hu, A., Matveev, A.V., Shor, A.V., Fuchs-Rohr, M.S.K., Neyman, K.M., Ganyushin, D.I., Kerdcharoen, T., Woiterski, A., Majumder, S., R¨ osch, N.: ParaGauss, version 3.1. Tech. rep., Technische Universit¨ at M¨ unchen (2006) 5. Bischof, C., M¨ uller, M.S., Kn¨ upfer, A., Jurenz, M., Lieber, M., Brunst, H., Mix, H., Nagel, W.E.: Developing scalable applications with Vampir, VampirServer and VampirTrace. In: Proc. of ParCo 2007, pp. 113–120 (2007) 6. Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C., Dongarra, J.: ScaLAPACK user’s guide. SIAM, Philadelphia (1997) 7. Blazewicz, J., Ecker, K., Pesch, E., Schmidt, G., Weglarz, J.: Handbook on scheduling: from theory to applications. Springer, Heidelberg (2007) 8. Blazewicz, J., Kovalyov, M.Y., Machowiak, M., Trystram, D., Weglarz, J.: Scheduling malleable tasks on parallel processors to minimize the makespan. Ann. Oper. Res. 129, 65–80 (2004) 9. Decker, T., L¨ ucking, T., Monien, B.: A 5/4-approximation algorithm for scheduling identical malleable tasks. Theor. Comput. Sci. 361(2), 226–240 (2006) 10. Demmel, J., Stanley, K.: The performance of finding eigenvalues and eigenvectors of dense symmetric matrices on distributed memory computers. In: Proc. seventh SIAM conf. on parallel processing for scientific computing, pp. 528–533 (1995) 11. Dunlap, B.I., R¨ osch, N.: The Gaussian-type orbitals density-functional approach to finite systems. Adv. Quantum Chem. 21, 317–339 (1990)

124

M. Roderus et al.

12. Garey, M., Johnson, D.: Computers and intractability: a guide to the theory of NP-completeness. W.H. Freeman and Company, New York (1979) 13. Graham, R.L.: Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 1563–1581 (1966) 14. Graham, R.: Bounds on multiprocessing timing anomalities. SIAM J. Appl. Math. 17, 263–269 (1969) 15. Hein, J.: Improved parallel performance of SIESTA for the HPCx Phase2 system. Tech. rep., The University of Edinburgh (2004) 16. Jansen, K.: Scheduling malleable parallel tasks: an asymptotic fully polynomial time approximation scheme. Algorithmica 39, 59–81 (2004) 17. Koch, W., Holthausen, M.C.: A chemist’s guide to density functional theory. WileyVCH, Weinheim (2001) 18. Ludwig, W., Tiwari, P.: Scheduling malleable and nonmalleable parallel tasks. In: SODA 1994, pp. 167–176 (1994) 19. Mouni´e, G., Rapine, C., Trystram, D.: Efficient approximation algorithms for scheduling malleable tasks. In: SPAA 1999, pp. 23–32 (1999) 20. Mouni´e, G., Rapine, C., Trystram, D.: A 3/2-approximation algorithm for scheduling independent monotonic malleable tasks. SIAM J. Comp. 37(2), 401–412 (2007) 21. National supercomputer HLRB-II, http://www.lrz-muenchen.de/ 22. Steinberg, A.: A strip-packing algorithm with absolute performance bound 2. SIAM J. Comp. 26(2), 401–409 (1997) 23. Turek, J., Wolf, J., Yu, P.: Approximate algorithms for scheduling parallelizable tasks. In: SPAA 1992, pp. 323–332 (1992) 24. Ward, R.C., Bai, Y., Pratt, J.: Performance of parallel eigensolvers on electronic structure calculations II. Tech. rep., The University of Tennessee (2006) 25. Yudanov, I.V., Matveev, A.V., Neyman, K.M., R¨ osch, N.: How the C-O bond breaks during methanol decomposition on nanocrystallites of palladium catalysts. J. Am. Chem. Soc. 130, 9342–9352 (2008) 26. Yudanov, I.V., Metzner, M., Genest, A., R¨ osch, N.: Size-dependence of adsorption properties of metal nanoparticles: a density functional study on Pd nanoclusters. J. Phys. Chem. C 112, 20269–20275 (2008)

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation on Multicores Yuanrui Zhang1 , Jun Liu1 , Emre Kultursay1 , Mahmut Kandemir1 , Nikos Pitsianis2,3 , and Xiaobai Sun3 1

Pennsylvania State University, University Park, USA {yuazhang,jxl1036,euk139,kandemir}@cse.psu.edu 2 Aristotle University,Thessaloniki, Greece 3 Duke University, Durham, U.S.A. {nikos,xiaobai}@cs.duke.edu

Abstract. The non-uniform FFT (NuFFT) has been widely used in many applications. In this paper, we propose two new scalable parallelization strategies to accelerate the data translation step of the NuFFT on multicore machines. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balancing. The experimental results collected from a commercial multicore machine show that, with the help of our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation, even for large data set sizes, with any input sample distribution.

1

Introduction

The non-uniform FFT (NuFFT) [2] [7] [15] has been widely used in many applications, including synthetic radar imaging [16], medical imaging [13], telecommunications [19], and geoscience and seismic analysis [6]. Unlike the Fast Fourier Transform (FFT) [4], it allows the sampling in the data or frequency space (or both) to be unequally-spaced or non-equispaced. To achieve the same O(N log N ) computational complexity as the FFT, the NuFFT translates the unequallyspaced samples to the equally-spaced points, and then applies the FFT to the translated Cartesian grid, where the complexity of the first step, named data translation or re-sampling, is linear to the size of the sample ensemble. Despite the lower arithmetic complexity compared to the FFT, the data translation step has been found to be the most time-consuming part in computing NuFFT [18]. The reason lies in its irregular data access pattern, which significantly deteriorates the memory performance in modern parallel architectures [5]. Furthermore, as data translation is essentially a matrix-vector multiplication with an irregular and sparse matrix, its intrinsic parallelism cannot be readily exploited by conventional compiler based techniques [17] that work well mostly for regular dense matrices. Many existing NuFFT algorithms [2] [8] [14] [12] try to reduce the complexity of data translation while maintaining desirable accuracy through mathematical methods, e.g., by designing different kernel P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 125–136, 2010. c Springer-Verlag Berlin Heidelberg 2010 

126

Y. Zhang et al.

functions. In a complementary effort, we attempt to improve the performance of data translation through different parallelization strategies that take into account the architectural features of the target platform, without compromising accuracy. In our previous work [20], we developed a tool that automatically generates a fast parallel NuFFT data translation code for user-specified multicore architecture and algorithmic parameters. This tool consists of two major components. The first one applies an architecture-aware parallelization strategy to input samples, rearranging them in off-chip memory or data file. The second one instantiates a parallel C code based on the derived parallel partitions and schedules, using a pool of codelets for various kernel functions. The key to the success of this tool is its parallelization strategy, which directly dictates the performance of the output code. The scheme we developed in [20] has generated significant improvements for the data translation computation, compared to a more straightforward approach, which does not consider the architectural features of the underlying parallel platform. However, it is limited to small data set sizes, e.g., 2K × 2K, with non-uniformly distributed samples; the parallelization takes excessively long time to finish when the data size is large. This slow-down is mainly due to the use of recursive geometric tiling and binning during parallelization, which is intended for improving the data locality (cache behavior) of the translation code. To overcome this drawback, in this paper, we design and experimentally evaluate two new scalable parallelization strategies that employ an equally-sized tiling and binning to cluster the unequally-spaced samples, for both uniform and non-uniform distributions. The first strategy is called the source driven parallelization, and the second one is referred to as the target driven parallelization. Both strategies use dynamic task allocation to achieve load balancing, instead of the static approach employed in [20]. To guarantee the mutual exclusion in data updates during concurrent computation, the first scheme applies a special parallel scheduling, whereas the second one employs a customized parallel partitioning. Although both schemes have comparable performance for the data translation with uniformly distributed sources, the target driven parallelization outperforms the other when using the input with non-uniformly distributed sources, especially on a large number of cores, in which case synchronization overheads become significant in the first scheme. We conducted experiments on a commercial multicore machine, and compared the execution time of the data translation step with the FFT from FFTW [11]. The collected results demonstrate that, with the help of our proposed parallelization strategies, the data translation step is no longer the primary bottleneck in the NuFFT computation, even for non-uniformly distributed samples with large data set sizes. The rest of the paper is organized as follows. Section 2 explains the NuFFT data translation algorithm and its basic operations. Section 3 describes our proposed parallelization strategies in detail. Section 4 presents the results from our experimental analysis, and finally, Section 5 gives the concluding remarks.

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation

2

127

Background

2.1

Data Translation Algorithm

Data translation algorithms vary in the type of data sets they target and the type of re-gridding or re-sampling methods they employ. In this paper, we focus primarily on convolution-based schemes for data translation, as represented by Eq.(1) below, where the source samples S are unequally-spaced, e.g., in the frequency domain, and the target samples T are equally-spaced, e.g., in an image domain. The dual case with equally-spaced sources and unequally-spaced targets can be treated in a symmetrical way. v(T ) = C(T, S) · q(S).

(1)

In Eq.(1), q(S ) and v(T ) denote input source values and translated target values, respectively, and C represents the convolution kernel function. The set S for source locations can be provided in different ways. In one case, the sample coordinates are expressed and generated by closed-form formulas, as in the case with the sampling on a polar grid (see Figure 1 (a)). Alternatively, the coordinates can be provided in a data file as a sequence of coordinate tuples, generated from a random sampling (see Figure 1 (b)). The range and space intervals of the target Cartesian grid T are specified by the algorithm designer, and they can be simplified into a single oversampling factor [7] since the samples are uniformly distributed. With an oversampling factor of α, the relationship between the number of sources and targets can be expressed as | T |= α | S |. The convolution kernel C is obtained either by closed-form formulas or numerically. Examples for the former case are the Gaussian kernel and the central B-splines [14] [7], whereas examples for the latter case are the functions obtained numerically according to the local least-square criterion [14] and the min-max criterion [8]. In each case, we assume that the function evaluation routines are provided. In addition, the kernel function is of local support, i.e., each source is only involved in the convolution computation with targets within a window, and vice versa. Figure 1 (c) shows an example for a single source in the 2D case, where the window has a side length of w.

w

(a) The sampling on a polar grid

(b) A random sampling

(c) The convolution window for a source with side length w

Fig. 1. Illustration of different sampling schemes and local convolution window

128

2.2

Y. Zhang et al.

Basic Data Translation Procedure

The above data translation algorithm can be described by the following pseudo code: for each source si in S  for each target tj in T window(si ) V (tj ) += c(tj , si ) × Q(si ) where Q is a one-dimensional array containing the source values regardless of the geometric dimension, V is a d-dimensional array holding the target values, and c represents a kernel function over T × S, e.g., the Gaussian kernel e(|tj −si |)/σ where | tj − si | denotes the distance between a target and a source. The target coordinates are generated on-demand during the computation based on the source coordinates, the oversampling value (α), and the range of the data space, e.g., L×L in the 2D case. In the pseudo code, the outer loop iterates over sources, because it is easy to find the specific targets for a source within the window, but not vice versa. We name this type of computation the source driven computation. The alternate computation is the target driven computation, whose outer loop iterates over targets. The complexity of the code is O(wd × | S |); however, since w is usually very small (in other words, wd is near constant), the data translation time is linear with the number of sources | S |.

3 3.1

Data Translation Parallelization Geometric Tiling and Binning

The convolution-based data translation is essentially a matrix-vector multiplication with sparse and irregular matrices. While the sparsity stems from the local window effect of the kernel function, the irregularity is caused by the unequallyspaced sources. In this case, the conventional tiling on dense and regular matrices [17] cannot help to achieve high data reuse. For instance, the source samples s1 and s2 from the same tile, as shown in Figure 2 (a), may update different targets located far away from each other in the data space, as indicated by Figure 2 (b). To exploit target reuse, the geometric tiling [3] is employed to cluster the sources into cells/tiles based on their spatial locations. The tiles can be equallysized (Figure 3 (a)) or unequally-sized (Figure 3 (b)), with the latter obtained through an adaptive tiling based on the sample distribution. In either case, the basic data translation procedure can be expressed as: for each non-empty source tile Tk in S for each source si in Tk  for each target tj in T window(si ) V (tj ) += c(tj , si ) × Q(si ) Associated with tiling is a process called binning. It reshuffles the source data in the storage space, e.g., external memory or file, according to tiles. In terms of data movements, the complexity of an equally-sized tiling and binning is O(| S |),

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation

t1

Random order s1 s2 s3 s4 s5 sm

129

s2

t2 t3 t4 t5

s1

tn

affected targets

(a) The conventional tiling on the convolution matrix

(b) Geometrically separate two sources and their affected targets

Fig. 2. Conventional tiling

overlapped target windows

overlapped target windows

(a) Equally-sized geometric (b) Adaptive geometric tiling with uniformly tiling with non-uniformly distributed sources distributed sources

Fig. 3. Geometric tiling

whereas the cost of an adaptive recursive tiling and binning is O(| S | log | S |). The latency of the latter increases dramatically when the data set size is very large, as observed in [20]. Consequently, in this work, we adopt the equally-sized tiling, irrespective of the source distribution, i.e., uniform distribution or nonuniform distribution, as illustrated in Figure 3. In this way, the non-Cartesian grid of sources is transformed to a Cartesian grid of tiles before any parallel partitioning and scheduling is applied, and this separation makes our parallelization strategies scalable with the data set sizes. 3.2

Parallelization Strategies

To parallelize the data translation step, two critical issues need to be considered: mutual exclusion of data updates and load balancing. On the one hand, a direct parallelization of the source loop of the code in Section 2.2 or the source tile loop of the code in Section 3.1 may lead to incorrect results, as threads on different cores may attempt to update the same target when they are processing geometrically nearby sources concurrently. Although parallelizing the inner target loop can avoid this, it would cause significant synchronization overheads. On the other hand, a simple equally-sized parallel partitioning in the data space may lead to unbalanced workloads across multiple processors when input sources are non-uniformly distributed. To address these issues, we have designed two parallelization strategies that aim at accelerating data translation on emerging multicore machines with onchip caches. One of these strategies is called the source driven parallelization, and the other is referred to as the target driven parallelization. They are intended to be used in the context of the source driven computation. To ensure mutual exclusion of target updates, the first strategy employs a special parallel scheduling, whereas the second strategy applies a customized parallel partitioning. Both the strategies use a recursive approach with dynamic task allocation to achieve load balance across the cores in the target architecture.

130

Y. Zhang et al. w

0 11 0

Fig. 4. One-dimensional partition with a 2-step scheduling for the 2D case

0 1 2 0

2 3 3 1

1 3 3 2

0 2 1 0

Fig. 5. Two-dimensional partition with a 4-step scheduling for the 2D case

1) Source Driven Parallelization For a data space containing both sources and targets, an explicit partitioning of sources will induce an implicit partitioning of targets, and vice versa, because of the local convolution window effect. The source driven parallelization carries out parallel partitioning and scheduling in the source domain, whereas the other operates on the targets. In both cases, the source domain has been transformed into a Cartesian grid of tiles through geometric tiling and binning (inside each tile, the samples are still unequally-spaced). When partitioning the sources, a special scheduling is indispensable to guaranteeing the mutual exclusion of target updates. Consider Figure 4 and Figure 5 for example. No matter how the sources in a 2D region are partitioned, e.g., using one-dimensional or two-dimensional partition, adjacent blocks cannot be processed at the same time due to potentially affected overlapping targets, as indicated by the dashed lines. However, if further dividing the neighboring source regions into smaller blocks according to the target overlapping patterns, a scheduling can be found to process those adjacent source regions through several steps, where at each step, the sub-blocks with the same number (indicated in the figures) can be executed concurrently. And, a synchronization takes place as moving from one step to another. Our observation is that a one-dimensional partition needs a 2-step scheduling to eliminate the contention, whereas a two-dimensional partition needs 4 steps. In general, an x-dimensional partition (1 ≤ x ≤ d) requires a 2x -step scheduling to ensure correctness. Based on this observation, our source driven parallelization scheme is designed as follows. Given m threads and a d-dimensional data space, first factorize m into p1 × p2 × . . .× pd , and then divide dimension i into 2pi segments (1 ≤ i ≤ d), which results in 2p1 × 2p2 × . . . 2pd blocks in the data space, and finally schedule every 2d neighboring blocks using the same execution-order pattern. Figure 6 (a) illustrates an example for the 2D case with m = 16 and p1 = p2 = 4. The blocks having the same time stamp (number) can be processed concurrently, provided that the window side length w is less than any side length of a block. In the case where m is very large and this window size condition no longer holds, some threads will be dropped to decrease m, until the condition is met. Although a  similar d -dimensional partition (d < d) with a 2d -step scheduling can also be used for a d-dimensional data space, e.g., a one-dimensional partition for the 2D

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation

131

w 0

1

0

1

0

1

0

1

2

3

2

3

2

3

2

3

0

1

0

1

0

1

0

1

2

3

2

3

2

3

2

3

0

1

0

1

0

1

0

1

2

3

2

3

2

3

2

3

0

1

0

1

0

1

0

1

2

3

2

3

2

3

2

3

0 1 0 1 0 1 0 10 1 0 1 0 1 01

(a) A two-dimensional partition with a 4-step scheduling for m=16, p1=p2=4

(b) A one-dimensional partition with a 2-step scheduling for m=16

Fig. 6. Illustration of non-recursive source driven parallelization in the 2D case

0

1

2

3

0

2

1

3

00 01 00 02 03 00 00

2 00 01 00 02 03 00 00

2

1

3 10 11 10 12 13 10 10

3

(a) A two-level recursive partitioning and scheduling for m=4, p1=p2=2

1

0

2

01 01 01

0

01

3

02 02 02

02

4

03 03 03

5

1

6

11 11 11

11

7

12 12 12

12

8

13 13 13

13

9

2

2

2

2

10

3

3

3

3

1

00 00 00

1

00

03 10 10

10

(b) The scheduling table for the source blocks obtained in (a) with 10 synchronization steps

Fig. 7. Illustration of recursive source driven parallelization in the 2D case

case, as shown in Figure 6 (b), it is not as scalable as the d-dimensional partition when m is huge. The scheme explained so far works well with uniformly distributed sources, but not with non-uniformly distributed ones, as it can cause unbalanced workloads in the latter case. However, a slight modification can fix this problem. Specifically, since the amount of computation in a block is proportional to the number of sources it contains, the above scheme can be recursively applied to the source space until the number of sources in a block is less than a preset value (δ). The recursion needs to ensure that the window size condition is not violated. Figure 7 (a) depicts an example of two-level recursive partitioning for the 2D case, where m = 4, p1 = p2 = 2, and the shaded regions are assumed to have dense samples. The corresponding concurrent schedule is represented by the table shown in Figure 7 (b). The synchronization takes place every time after processing each group of blocks pointed by a table entry. However, within each

132

Y. Zhang et al.

group, there is no processing order for the blocks, which will be dynamically allocated to the threads at run time. The selection of threshold δ has an impact on the recursion depth as well as the synchronization latency. A smaller value of δ usually leads to deeper recursions and higher synchronization overheads; but, it is also expected to have better load balance. Thus, there is a tradeoff between minimizing synchronization overheads and balancing workloads, and careful selection of δ is important for the overall performance. 2) Target Driven Parallelization We have also designed a target driven parallelization strategy to perform recursive partitioning and scheduling in the target domain, which has the advantage of requiring no synchronization. Given m threads and a d-dimensional data space, this scheme employs a 2d -branch geometric tree to divide the target space into blocks recursively until the number of sources associated with each block is below a threshold (σ), and then uses a neighborhood traversal to obtain an execution order for those blocks, based on which it dynamically allocates their corresponding sources to the threads at run time without any synchronization. Figure 8 (a) shows an example for the 2D case with quadtree partition [9] and its neighborhood traversal. Since target blocks are non-overlapping, there is no data update contention. Although the associated sources of adjacent blocks are overlapping, there is no coherence issue, as sources are only read from the external memory. The neighborhood traversal helps improve data locality during computation through the exploitation of source reuses. The threshold σ is set to be less than |S|/m and greater than the maximum number of sources in a tile. A smaller value of σ usually results in more target blocks and more duplicated source accesses because of overlapping; but, it is also expected to exhibit better load balance, especially with non-uniformly distributed inputs. Therefore, concerning the selection of value for σ, there is a tradeoff between minimizing memory access overheads and balancing workloads. In addition, to find the associated sources for a particular target block is not as easy as the other way around. Typically, one needs to compare the coordinates of each source with the boundaries of the target block, which will take O(| S |) time. Our parallelization scheme reduces this time to a constant by aligning the window of a target block with the Cartesian grid of tiles, as depicted in Figure 8 (b). Although this method attaches irrelevant sources to each block, the introduced execution overheads can be reduced by choosing proper tile size.

4

Experimental Evaluation

We implemented these two parallelization strategies in our tool [20], and evaluated them experimentally on a commercial multicore machine. Two sample inputs are used, both of which are generated in a 2D region of 10 × 10, with a data set size of 15K × 15K. One contains random sources that are uniformly distributed in the data space, whereas the other has samples generated on a

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation

1

15

16 17 19 18

14

13 12 11

2

5

6 7 9 8

3

4

10

(a) An example of quadtree partition in the target domain and neighborhood traversal among target blocks

133

(b) An illustration of aligned window to the source tiles for a target block

Fig. 8. Example of target driven parallelization in the 2D case. Especially, (b) shows the Cartesian grid of tiles and a target partition with bolded lines, where the window of a target block is aligned to the tiles, indicated by red color.

polar grid with non-uniform distribution. The kernel is the Gaussian function and oversampling (α) is 2.25. The convolution window side length w is set to be 5 in terms of the number of targets affected in each dimension. In this case, each source is involved in the computation with 25 targets, and the total number of targets is 22.5K × 22.5K. The underlying platform is Intel Harpertown multicore machine [1], which features a dual quad-core operating at a frequency of 3GHz, 8 private L1 caches of size 32KB and 4 shared L2 caches of size 6MB, each connecting to a pair of cores. We first investigated the relationship between the tile size and the cache size (both L1 and L2) on the target platform, and analyzed the impact of tile size on the performance of binning (the most time-consuming process in the parallelization phase) and data translation. The tile size is determined based on the cache size so that all the sources and their potentially affected targets in each tile are expected to fit in the cache space. For non-uniform distributions, the tile size is actually calculated based on the assumption of uniform distribution. Figure 9 shows the execution time of binning and data translation for the input with uniformly distributed sources (random samples), using the source driven parallelization strategy. The cache size varies from 16KB to 12MB. The four groups depicted from left to right are the results of our experiments with 1 core, 2 cores, 4 cores, and 8 cores, respectively. For binning, the time decreases as the cache size (or tile size) increases, irrespective of the number of cores used. The reason is that, each tile employs an array data structure to keep track of its sources, when the tile size increases, or equivalently, the number of tiles decreases, it is likely that the data structure of the corresponding tile is in the cache when binning a source, i.e., the cache locality is better. Hence, fewer tiles makes binning faster. In contrast, the data translation becomes slower when the tile size increases, since the geometric locality of the sources in each tile worsens, which in turn reduces the target reuse. When both binning and data translation are concerned, we can see that, there is a lowest point (minimum execution time) in each group, which is around the tile size 1.5 MB, half of L2 cache size per core (3MB) in the Harpertown processor. A similar behavior can also be observed for

Y. Zhang et al.

Data translation

Binning

Cache Size (KB) - Tile Size

Fig. 9. Execution time of binning and data translation under uniform distribution, as cache (tile) size varies

50 0 16 25 6 20 48 92 16

16 25 6 20 48 92 16

16 25 6 20 48 92 16

16 25 6 20 48 92 16

25 6 20 48 92 16

0

100

16 25 6 20 48 92 16

50

150

16 25 6 20 48 92 16

100

Data translation

200

25 6 20 48 92 16

Time (Seconds)

150

16

Time (Seconds)

Binning 200

16

134

Cache Size (KB) - Tile Size

Fig. 10. Execution time of binning and data translation under non-uniform distribution, as cache (tile) size varies

the input with non-uniformly distributed sources (samples on the polar grid), as shown in Figure 10, but with a shifted minimum value. We then evaluated the efficiency of our two parallelization strategies, using the best tile size found from the first group of experiments. Figure 11 and Figure 12 show their performance with tuned σ and δ, respectively, for non-uniformly distributed sources. We can see that, in both the plots, the execution time of data translation scales well with the number of cores, due to improved data locality; however, the execution time of geometric tiling and binning reduces much slower when the number of cores increases, as this process involves only physical data movements in the memory. In particular, the two execution times become very close to each other on 8 cores. This indicates that our proposed parallelization strategies are suitable for a pipelined NuFFT at this point, where the three steps of the NuFFT, namely, parallelization, data translation and FFT are expected to have similar latencies in order to achieve balanced pipeline stages for streaming applications. Further, the two parallelization strategies have comparable performance for data translation with uniformly distributed sources, as shown by the group of bars on the left in Figure 13; however, the target driven strategy outperforms the other by 13% and 37% on 4 cores and 8 cores, respectively, with non-uniformly distributed sources, as depicted by the group of bars on the right in Figure 13, where the respective data translation times using source and target driven schemes on 4 cores are 23.1 and 20.1 seconds, and on 8 cores are 17.1 and 10.8 seconds. This difference is expected to become more pronounced as the number of cores increases, since there will be more synchronization overheads in the source driven scheme. We also conducted a performance comparison between the data translation using the target driven parallelization strategy on the input samples, and the FFT obtained from FFTW with ”FFTW MEASURE” option [11] [10] on the translated target points. Figure 14 presents the collected results with non-uniformly distributed sources. The graph shows that the execution times of data translation and FFT are comparable on 1, 2, 4 and 8 cores, respectively. In particular, the data translation becomes faster than the FFT as the number of cores increases. This good performance indicates that, with our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation.

Scalable Parallelization Strategies to Accelerate NuFFT Data Translation Parallelization phase

Parallelization phase

Data translation step

Time (Seconds)

Time (Seconds)

100 80 60 40 20 0 2 cores

4 cores

100 80 60 40 20 0

8 cores

1 core

Number of Cores

Data translation

Target driven parallelization

40 20

es co r

8

4

co r

es

es

re

co r

co

2

1

es co r

8

es

co r 4

co r 2

co

es

0

Number of cores

Fig. 13. Data translation time with the two parallelization strategies, for uniform and non-uniform distributions, respectively

Time (Seconds)

60

1

4 cores

8 cores

Fig. 12. Performance of the target driven parallelization with non-uniformly distributed sources

80

re

Time (Seconds)

Source driven parallelization

2 cores

Number of Cores

Fig. 11. Performance of the source driven parallelization with non-uniformly distributed sources

5

Data translation step

Total execution time

Total execution time

1 core

135

FFT

80 60 40 20 0 1 core

2 cores

4 cores

8 cores

Number of Cores

Fig. 14. Execution time of the data translation using target driven parallelization and the FFT from FFTW

Concluding Remarks

In this work, we proposed two new parallelization strategies for the NuFFT data translation step. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balance on emerging multicore architectures. To ensure the mutual exclusion in data updates during concurrent computation, the first scheme applies a special parallel scheduling, whereas the second one employs a customized parallel partitioning. Our experimental results show that, the proposed parallelization strategies work well with large data set sizes, even for non-uniformly distributed input samples, which help NuFFT achieve good performance for data translation on multicores.

Acknowledgement This research is supported in part by NSF grants CNS 0720645, CCF 0811687, OCI 821527, CCF 0702519, CNS 0720749, and a grant from Microsoft.

References 1. http://www.intel.com 2. Beylkin, G.: On the fast Fourier transform of functions with singularities. Applied and Computational Harmonic Analysis 2, 363–381 (1995)

136

Y. Zhang et al.

3. Chen, G., Xue, L., et al.: Geometric tiling for reducing power consumption in structured matrix operations. In: Proceedings of IEEE International SOC Conference, pp. 113–114 (September 2006) 4. Cooley, J., Tukey, J.: An algorithm for the machine computation of complex Fourier series. Mathematics of Computation 19, 297–301 (1965) 5. Debroy, N., Pitsianis, N., Sun, X.: Accelerating nonuniform fast Fourier transform via reduction in memory access latency. In: SPIE, vol. 7074, p. 707404 (2008) 6. Duijndam, A., Schonewille, M.: Nonunifrom fast Fourier transform. Geophysics 64, 539 (1999) 7. Dutt, A., Rokhlin, V.: Fast Fourier transforms for nonequispaced data. SIAM Journal on Scientific Computing 14, 1368–1393 (1993) 8. Fessler, J.A., Sutton, B.P.: Nonuniform fast Fourier transforms using min-max interpolation. IEEE Transactions on Signal Processing 51, 560–574 (2003) 9. Finkel, R., Bentley, J.: Quad trees: A data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974) 10. Frigo, M.: A fast Fourier transform compiler. ACM SIGPLAN Notices 34, 169–180 (1999) 11. Frigo, M., Johnson, S.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384 (May 1998) 12. Greengard, L., Lee, J.Y.: Accelerating the nonuniform fast Fourier transform. SIAM Review 46, 443–454 (2004) 13. Knopp, T., Kunis, S., Potts, D.: A note on the iterative MRI reconstruction from nonuniform k-space data. International Journal of Biomedical Imaging 6, 4089– 4091 (2007) 14. Liu, Q., Nguyen, N.: An accurate algorithm for nonuniform fast Fourier transforms (NUFFTs). IEEE Microwaves and Guided Wave Letters 8, 18–20 (1998) 15. Liu, Q., Tang, X.: Iterative algorithm for nonuniform inverse fast Fourier transform (NU-IFFT). Electronics Letters 34, 1913–1914 (1998) 16. Renganarayana, L., Rajopadhye, S.: An approach to SAR imaging by means of non-uniform FFTs. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, vol. 6, pp. 4089–4091 (July 2003) 17. Renganarayana, L., Rajopadhye, S.: A geometric programming framework for optimal multi-level tiling. In: Proceedings of ACM/IEEE Conference on Supercomputing, May 2004, p. 18 (2004) 18. Sorensen, T., Schaeffter, T., Noe, K., Hansen, M.: Accelerating the nonequispaced fast Fourier transform on commodity graphics hardware. IEEE Transactions on Medical Imaging 27, 538–547 (2008) 19. Ying, S., Kuo, J.: Application of two-dimensional nonuniform fast Fourier transform (2-d NuFFT) technique to analysis of shielded microstrip circuits. IEEE Transactions on Microwave Theory and Techniques 53, 993–999 (2005) 20. Zhang, Y., Kandemir, M., Pitsianis, N., Sun, X.: Exploring parallelization strategies for NUFFT data translation. In: Proceedings of Esweek, EMSOFT (2009)

Multicore and Manycore Programming Beniamino Di Martino1 , Fabrizio Petrini1 , Siegfried Benkner2 , Kirk Cameron2 , Dieter Kranzlm¨ uller2 , Jakub Kurzak2, 2 Davide Pasetto , and Jesper Larsson Tr¨aff2 1

Topic Chairs Members

2

We would like to join the other members of the Program Committee in welcoming you to the Multicore and Manycore Programming Topic of Europar 2010. Europar is one the primary forums where researchers, architects and designers from academia and indutry explore new and emerging technologies in multicore programming and algorithmic development. This year, we received 43 submissions. Each paper was reviewed by at least three reviewers and we were able to select 17 regular high-quality papers. The Topic Committee Members handled the paper review process aiming at high-quality and timely review process. Each TPC member was able to handle a high-load, exceeding 20 papers, providing valuable insight and guidance to improve the quality of the scientific contributions. The accepted papers discuss very interesting issues. In particular, the paper “Parallel Enumeration of Shortest Lattice Vectors” by M. Schneider and O. Dagdelen presents a parallel version of the shortest lattice enumeration algorithm, using multi-core CPU systems. The paper “Exploiting Fine-Grained Parallelism on Cell Processors” by A. Prell, R. Hoffmann and T. Rauber, presents a hierarchically distributed task pool for task parallel programming on Cell processors. The paper “Optimized on-chip-pipelined mergesort on the Cell/B.E.” by R. Hulten, C. Kessler and J. Keller works out the technical issues of applying the on-chip pipelining technique to parallel mergesort algorithm for the Cell processor. The paper “A Language-Based Tuning Mechanism for Task and Pipeline Parallelism” by F. Otto, C. A. Schaefer, M. Dempe and W. F. Tichy tackles the issues arising with auto-tuners for parallel applications of requiring several tuning runs to find optimal values for all parameters by introducing a language-based tuning mechanism. The paper “Near-optimal placement of MPI processes on hierarchical NUMA architecture” by E. Jeannot and G. Mercier describes a novel algorithm called TreeMatch that maps processes to resources in order to reduce the communication cost of the whole application. The paper “Multithreaded Geant4: Semi-Automatic Transformation into Scalable Thread-Parallel Software” by X. Dong, G. Cooperman, J. Apostolakis presents the transformation into scalable thread parallel version of an application case study, Geant4, which is a 750,000 line toolkit first designed in the early 1990s. The paper “Parallel Exact Time Series Motifs Discovery” by A. Narang presents novel parallel algorithms for exact motif discovery on multi-core architectures. The paper “JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-core Architectures” by M. Aleem, R. Prodan and T. Fahringer proposes a new Java-based programming P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 137–138, 2010. c Springer-Verlag Berlin Heidelberg 2010 

138

B. Di Martino et al.

model for shared memory multi-core parallel computers as an extension to the JavaSymphony distributed programming environment. The paper “Adaptive Fault Tolerance for Many-Core based Space-Borne Computing” by H. Zima describes an approach for providing software fault tolerance in the context of future deep-space robotic NASA missions, which will require a high degree of autonomy and enhanced on-board computational capabilities, focusing on introspection-based adaptive fault tolerance. The paper “A Parallel GPU Algorithm for Mutual Information based 3D Nonrigid Image Registration” by V. Saxena, J. Rohrer and L. Gong presents parallel design and implementation of 3D non-rigid image registration for the Graphics Processing Units (GPUs). The paper “A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures” by C. Chen, J. Manzano, G. Gan, G. Gao and V. Sarkar presents an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. The paper “Maestro: Data Orchestration for OpenCL Devices” by K. Spafford, J. S Meredith and J. Vetter introduces Maestro, an open source library for automatic data orchestration on OpenCL devices. The paper “Optimized dense matrix multiplication on a manycore architecture” by E. Garcia, I. E. Venetis, R. Khan and G. Gao utilizes dense matrix multiplication as a case of study to present a general methodology to map applications to manycore architectures. The paper “Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations” by E. Hermann, B. Raffin, F. Faure, T. Gautier and J. Allard proposes a parallelization scheme for dynamically balancing work load between multiple CPUs and GPUs. The paper “Long DNA Sequence Comparison on Multicore Architectures” by F. Snchez, F. A. Ramirez and M. Valero analyzes how large scale biology sequences comparison takes advantage of current and future multicore architectures, and investigate which memory organization is more efficient in a multicore environment. The paper “Programming CUDAbased GPUs to simulate two-layer shallow water flows” by M. De la Asuncin, J. Miguel Mantas Ruiz and M. Castro describes an accelerated implementation of a first order well-balanced finite volume scheme for 2D two-layer shallow water systems using GPUs supporting the CUDA programming model and double precision arithmetic. The paper “Scalable Producer-Consumer Pools based on Elimination-Diffraction Trees” by Y. Afek, G. Korland, M. Natanzon and N. Shavit presents new highly distributed pool implementations based on a novel combination combination of the elimination-tree and diracting-tree paradigms. Finally, the paper paper “Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study” by H. Wang, G. Yi, Y. Wang and Y. Zou shows how, with well-designed encapsulation, HTM can deliver good consumability for commercial applications. We would like to take the opportunity of thanking the authors who submitted the contributions, the Euro-Par Chairs Domenico Talia, Pasqua D’ Ambra and Mario Guarracino, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.

JavaSymphony: A Programming and Execution Environment for Parallel and Distributed Many-Core Architectures Muhammad Aleem, Radu Prodan, and Thomas Fahringer Institute of Computer Science, University of Innsbruck, Technikerstraße 21a, A-6020 Innsbruck, Austria {aleem,radu,tf}@dps.uibk.ac.at

Abstract. Today, software developers face the challenge of re-engineering their applications to exploit the full power of the new emerging many-core processors. However, a uniform high-level programming model and interface for parallelising Java applications is still missing. In this paper, we propose a new Java-based programming model for shared memory many-core parallel computers as an extension to the JavaSymphony distributed programming environment. The concept of dynamic virtual architecture allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex parallel computers and distributed Grid infrastructures. On top of this virtual architecture, objects can be explicitly distributed, migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. We evaluate the JavaSymphony programming model and the new shared memory run-time environment for six real applications and benchmarks on a modern multi-core parallel computer. We report scalability analysis results that demonstrate that JavaSymphony outperforms pure Java implementations, as well as other alternative related solutions.

1

Introduction

Over the last 35 years, increasing processor clock frequency was the main technique to enhance the overall processor power. Today, power consumption and heat dissipation are the two main factors which resulted in a design shift towards multi-core architectures. A multi-core processor consists of several homogeneous or heterogeneous processing cores packaged in a single chip [5] with possibly varying computing power, cache size, cache levels, and power consumption requirements. This shift profoundly affects application developers who can no longer transparently rely only on Moore’s law to speedup their applications. Rather, they have to re-engineer and parallelise their applications with user controlled load 

This research is partially funded by the “Tiroler Zukunftsstiftung”, Project name: “Parallel Computing with Java for Manycore Computers”.

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 139–150, 2010. c Springer-Verlag Berlin Heidelberg 2010 

140

M. Aleem, R. Prodan, and T. Fahringer

balancing and locality control to exploit the underlying many-core architectures and ever complex memory hierarchies. The locality of task and data has significant impact on application performance as demonstrated by [9,11]. The use of Java for scientific and high performance applications has significantly increased in recent years. The Java programming constructs related to threads, synchronisation, remote method invocations, and networking are well suited to exploit medium to coarse grained parallelism required by parallel and distributed applications. Terracotta [10], Proactive [1], DataRush [4], and JEOPARD [6] are some of the prominent efforts which have demonstrated the use of Java for the performance-oriented applications. Most of these efforts, however, do not provide user-controlled locality of task and data to exploit the complex memory hierarchies on many-core architectures. In previous work, we developed JavaSymphony [2] (JS) as a Java-based programming paradigm for programming conventional parallel and distributed infrastructures such as heterogeneous clusters and computational Grids. In this paper, we extend JS with a new shared memory abstraction for programming many-core architectures. JS’s design is based on the concept of dynamic virtual architecture which allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex parallel computers and distributed Grid infrastructures. On top of this virtual architecture, objects can be explicitly distributed, migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. The extensions to the run-time environment were performed with minimal API changes, meaning that the old distributed JS applications can now transparently benefit from being executed on many-core parallel computers with improved locality. The rest of paper is organised as follows. Next section discusses the related work. Section 3 presents the JS programming model for many-core processors, including a shared memory run-time environment, dynamic virtual architectures, and locality control mechanisms. In Section 4, we present experimental results on six real applications and benchmarks. Section 5 concludes the paper.

2

Related Work

Proactive [1] is an open source Java-based library and parallel programming environment for developing parallel, distributed and concurrent applications. Proactive provides high-level programming abstractions based on the concept of remote active objects [1], which return future objects after asynchronous invocations. Alongside programming, Proactive provides deployment-level abstractions for applications on clusters, Grids and multi-core machines. Proactive does not provide user-controlled locality of tasks and objects at processor or core level. The JEOPARD [6] project’s main goal is to provide a complete hardware and software framework for developing real-time Java applications on embedded systems and multi-core SMPs. The project is aiming to provide operating system support in the form of system-level libraries, hardware support in the form of Java processors, and tool support related to application development and performance

JavaSymphony: A Programming Environment for Manycores

141

analysis. Although focused on embedded multi-core systems, the API includes functionality for multi-core and NUMA parallel architectures. Terracotta [10] is a Java-based open source framework for application development on multi-cores, clusters, Grids, and Clouds. Terracotta uses a JVM clustering technique, although the application developer sees a combined view of the JVMs. Terracotta targets enterprise and Web applications and does not provide abstractions to hide concurrency from the application developer who has to take care of these low-level details. Pervasive DataRush [4] is a Java-based high-performance parallel solution for data-driven applications, such as data mining, text mining, and data services. A DataRush application consists of data flow graphs, which represent data dependencies among different components. The run-time system executes the data flow graphs and handles all underlying low-level details related to synchronisation and concurrency. Most of these related works either prevent the application developer from controlling the locality of data and tasks, or engage the developer in time consuming and error-prone low-level parallelisation details of the Java language such as socket communication, synchronisation, remote method invocation, and thread management. High-level user-controlled locality at the application, object, and task level distinguishes JavaSymphony from other Java-based frameworks for performance-oriented applications.

3

JavaSymphony

JavaSymphony (JS) [2] is a Java-based programming paradigm, originally designed for developing applications on distributed cluster and Grid architectures. JS provides high-level programming constructs which abstract low-level programming details and simplify the tasks of controlling parallelism, locality, and load balancing. In this section, we present extensions to the JavaSymphony programming model to support shared memory architectures, ranging from distributed NUMA and SMP parallel computers to modern many-core processors. We provide a unified solution for user-controlled locality-aware mapping of applications, objects and tasks on shared and distributed memory infrastructures with a uniform interface that shields the programmer from the low-level resource access and communication mechanisms. 3.1

Dynamic Virtual Architectures

Most existing Java infrastructures that support performance-oriented distributed and parallel applications hide the underlying physical architecture or assume a single flat hierarchy of a set (array) of computational nodes or cores. This simplified view does not reflect heterogeneous architectures such as multi-core parallel computers or Grids. Often the programmer fully depends on the underlying operating system on shared memory machines or on local resource managers on clusters and Grids to properly distribute the data and computations which results in important performance losses.

142

M. Aleem, R. Prodan, and T. Fahringer

 To alleviate this problem, JS introduces     the concept of dynamic virtual architecture      (VA) that defines the structure of a hetero  geneous architecture, which may vary from a         small-scale multi-core processor or cluster to      a large-scale Grid. The VAs are used to con          trol mapping, load balancing, code placement     and migration of objects in a parallel and dis  tributed environment. A VA can be seen as a tree structure, where each node has a certain Fig. 1. Four-level locality-aware level that represents a specific resource gran- VA ularity. Originally, JavaSymphony focused exclusively on distributed resources and had no capabilities of specifying hierarchical VAs at the level of shared memory resources (i.e. scheduling of threads on shared memory resources was simply delegated to the operating system). In this paper, we extended the JavaSymphony VA to reflect the structure of shared memory many-core computing resources. For example, Figure 1 depicts a four-level VA representing a distributed memory machine (such as a shared memory NUMA) consisting of a set of SMPs on level 2, multi-core processors on level 1, and individual cores on the leaf nodes (level 0). Lines 6 − 8 in Listing 1 illustrates the JS statements for creating this VA structure.

3.2

JavaSymphony Objects

Writing parallel JavaSymphony applications requires encapsulating Java objects into so called JS objects, which are distributed and mapped onto the hierarchical VA nodes (levels 0 to n). A JS object can be either a single-threaded or a multithreaded object. The single-threaded JS object is associated with one thread which executes all invoked methods of that object. A multi-threaded JS object is associated with n parallel threads, all invoking methods of that object. In this paper, we extend JS with a shared memory programming model based on shared JS (SJS) objects. A SJS object can be mapped to a node of level 0 − 3 according to the VA depicted in Figure 1 (see Listing 1, lines 7 and 12) and cannot be distributed onto higher-level remote VA nodes. A SJS object can also be a single-threaded or a multi-threaded object. Listing 1 (line 12) shows the JS code for creating a multi-threaded SJS object. Three types of method invocations can be used with JS as well as with SJS objects: asynchronous (see Listing 1, line 14), synchronous, and one-sided method invocations. 3.3

Object Agent System

The Object Agent System, a part of JS run-time system (JSR), manages and processes shared memory jobs for SJS objects and remote memory jobs for JS objects. Figure 2 shows two Object Agents (OA). An OA is responsible for creating jobs, mapping objects to VAs, migrating, and releasing objects. The shared memory jobs are processed by a local OA, while the remote memory jobs

JavaSymphony: A Programming Environment for Manycores

OA

143

OA Remote memory job

Method Invocator

2. put in queue

1.create

1. create

4. return results ResultHandle

Shared memory job

Job Handler

2. put in queue ResultHandle 3. execute Job Handler Jobs queues single-threaded objects

3. execute Job Handler

Remote Communication Local Communication

4. return results

Jobs queue multi-threaded objects

3. execute

Jobs queue multi-threaded objects

Jobs queues single-threaded objects

3. execute Job Handler

Reference (to ResultHandler)

Fig. 2. The Object Agent System job processing mechanism

are distributed and processed by remote OAs. An OA has a multi-threaded job queue which contains all jobs related to the multi-threaded JS or SJS objects. A multi-threaded job queue is associated with n job processing threads called Job Handlers. Each single-threaded JS or SJS object has a single-threaded job queue and an associated job handler. The results returned by the shared and remote memory jobs are accessed using ResultHandle objects (listing 1, line 5), which can be either local object references in case of SJS objects, or remote object references in case of distributed JS objects. 3.4

Synchronisation Mechanism

JS provides synchronisation mechanism for one-sided and asynchronous object method invocations. Asynchronous invocations can be synchronised at individual or at group level. Group-level synchronisation involves n asynchronous invocations combined in one ResultHandleSet object (see Listing 1, lines 13−14). The group-level synchronisation may block or examine (without blocking) whether one, a certain number, or all threads have finished processing their methods. The JS objects invoked using one-sided method invocations may use barrier synchronisation implemented using barrier objects. A barrier object has a unique identifier and an integer value specifying the number of threads that will wait on that barrier. 3.5

Locality Control

The Locality Control Module (LCM) is a part of the JSR that applies and manages locality constraints on the executing JS application by mapping JS objects and tasks onto the VA nodes. In JS, We can specify locality constraints at three levels of abstraction: 1. Application-level locality constraints are applied to all JS or SJS objects and all future data allocations of a JS application. The locality constraints can be

144

M. Aleem, R. Prodan, and T. Fahringer

specified with the help of setAppAffinity static method of the JSRegistry class, as shown in line 9 of Listing 1. 2. Object-level locality constraints are applied to all method invocations and data allocations performed by a JS or SJS object (see line 12 in Listing 1). The object-level locality constraints override any previous application-level constraints for that object. 3. Task-level locality constraints are applied to specific task invocations and override any previous object or application-level constraints for that task. Mapping an application, object, or task to a specific core will constrain the execution to that core. Mapping them on a higher-level VA node will delegate the scheduling on the inferior VA nodes to the JSR. The LCM in coordination with the OAS job processing mechanism applies the locality constraints. The jobs are processed by the job handler threads within an OA. The job handlers are JVM threads that are executed by some system-level threads such as the POSIX threads on a Linux system. The locality constraints are enforced within the job handler thread by invoking appropriate POSIX system calls through the JNI mechanism. First, a unique systemwide thread identifier is obtained for the job handler thread with help of the syscall( NR gettid) system call. This unique thread identifier is then used as input to the sched setaffinity function call to schedule the thread on a certain core of the level 1 VA. 3.6

Matrix Transposition Example

Listing 1 displays the kernel of a simple shared memory matrix transposition application, a very simple but much used kernel in many numerical applications, which interchanges a matrix rows and columns: A[i, j] = A[j, i]. The application first initialises the matrix and some variables (lines 1 − 3) and registers itself to the JSR (line 4). Then, it creates a group-level ResultHandle object rhs and a level 3 VA node dsm (lines 5 − 8). Then the application specifies 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

boolean b S i n g l e T h r e a d e d = f a l s e ; i n t N = 1 0 2 4 ; i n t np = 4 ; i n t [ ] st a r t R o w = new i n t [ np ] ; i n t [ ] [ ] T = new i n t [ N ] [ N ] ; i n t [ ] [ ] A=new i n t [N ] [ N ] ; i n i t i a l i z e M a t r i x (A) ; J S R e g i s t r y r e g = new J S R e g i s t r y ( ” MatrixTransposeApp ” ) ; ResultHandleSet rhs ; VA smp1 = new VA( 2 , new i n t [ ] { 2 , 2 , 2 } ) ; VA smp2 = new VA( 2 , new i n t [ ] { 4 , 4 } ) ; VA dsm = new VA( 3 ) ; dsm . addVA ( smp1 ) ; dsm . addVA ( smp2 ) ; J S R e g i s t r y . s e t A p p A f f i n i t y ( dsm ) ; // A p p l i c a t i o n −l e v e l l o c a l i t y f o r ( i n t i = 0 ; i < N ; i = i + N / np ) st a r t R o w [ i ] = i ; S J S O b j e c t wo r k e r = new S J S O b j e c t ( b S i n g l e Th r e a d e d , ” wo r k sp a c e . Worker ” , new O b j e c t [ ] { A, T, N} , smp2 ) ; // O b je c t−l e v e l l o c a l i t y f o r ( i n t i = 0 ; i < np ; i ++) r h s . add ( wo r k e r . a i n v o k e ( ” Tr a n sp o se ” , new O b j e c t [ ] { i , st a r t R o w [ i ] } ) , i); rhs . waitAll () ; reg . unregister () ;

Listing 1. Matrix transposition example

JavaSymphony: A Programming Environment for Manycores

145

the application level locality (line 9) corresponding to a distributed shared memory NUMA parallel computer. The application then partitions the matrix blockwise among np parallel tasks (lines 10 − 11). In line 12, a multi-threaded SJS object is created and mapped onto the level 2 VA node smp2. The application then invokes np asynchronous methods, adds the returned ResultHandle objects to rhs (lines 13 − 14), and waits for all invoked methods to finish their execution (line 15). In the end, the application unregisters itself from the JSR (line 16).

4

Experiments

We have developed several applications and benchmarks using the JS shared memory programming model with locality awareness. We used for our experiments a shared memory SunFire X4600 M2 NUMA machine equipped with eight quad-core processors, where each processor has a local memory bank. Each core has a private L1 and L2 caches of size 128KB and 512KB, respectively, and a shared 2MB L3 cache. Each processor has three cache coherent hypertransport links supporting up to 8GB/sec of direct and inter-processor data transfer each. 4.1

Discrete Cosine Transformation

Execution time (sec)

The Discrete Cosine Transformation (DCT) algorithm is used to remove the non-essential information from digital images and audio data. 1000 JS shared memory Typically, it is used to compress the JPEG im800 JS distributed memory ages. The DCT algorithm divides the image into 600 square blocks and then applies the transforma400 tions on each block to remove the non-essential 200 information. After that, a reverse transforma0 tion is applied to produce a restored image, 2 4 8 16 32 which contains only essential data. Number of cores We developed a shared memory version of JS DCT and compared it with a previous message Fig. 3. DCT experimental passing-based implementation [2] in order to test results the improvement of the shared memory solution against the old implementation. Figure 3 shows that the shared memory implementation requires approximately 20% to 50% less execution time as compared to the RMI-based distributed implementation. The results validate the scalability of the JS shared memory programming model and also highlight its importance on a multi-core system as compared to the message passing-based model which introduces costly communication overheads. 4.2

NAS Parallel Benchmarks: CG Kernel

The NAS parallel benchmarks [3] consist of five kernels and three simulated applications. We used for our experiments the Conjugate Gradient (CG) kernel which

146

M. Aleem, R. Prodan, and T. Fahringer

2500

4 35 3,5 Speedup

3 2,5 2 15 1,5 JS with locality Proactive Java

1 0,5 0 1

2

4

8

Number of cores

(a) Speedup

16

32

L3 cac che misses (thousands)

4,5

2000 1500 1000 500

App with locality App without locality

0 2

4

8

16

32

Locall DRAM acces ss (thousand ds)

involves irregular long distance communications. The CG kernel uses the power and conjugate gradient method to compute an approximation to the smallest eigenvalue of large sparse, symmetric positive definite matrices. We implemented the kernel in JS using the Java-based implementation of CG [3]. The JS CG implementation is based on the master-worker computational model. The main JS program acts as master and creates multiple SJS worker objects that implement four different methods. The master program asynchronously invokes these methods on the SJS worker objects and waits for them to finish their execution. A single iteration involves several serial computations from the master and many steps of parallel computations from the workers. In the end, the correctness of results is checked using a validation method. We compared our locality-aware JS version of the CG benchmark with a pure Java and a Proactive-based implementation. Figure 4(a) shows that JS exhibits better speedup compared to the Java and Proactive-based implementations for almost all machine sizes. Proactive exhibits the worse performance since the communication among threads is limited to RMI, even within shared memory computers. The speedup of the JS implementation for 32 cores is lower than the Java-based version because of the overhead induced by the locality control, which becomes significant for the relatively small problem size used. In general, specifying locality constraints for n parallel tasks on a n-core machine does not bring much benefit, still it is useful to solve the problems related to thread migration and execution of multiple JVM-level threads by a single native thread. To investigate the effect of locality constraints, we measured the number of cache misses and local DRAM accesses for the locality aware and non-locality aware applications. Figure 4(b) illustrates that the number of L3 cache misses increased for the locality-aware JS implementation because of the contention on the L3 cache shared by multiple threads scheduled on the same multi-core processor. The locality constraints keep the threads close to the node where the data was allocated, which results in a high number of local memory accesses that significantly boost the overall performance (see Figure 4(c)). Although the JS CG kernel achieves better speedup as compared to the all other versions, the overall speedup is not impressive because it involves a large number of parallel invocations (11550 − 184800 for 2 − 32 cores) to complete 2500 2000 1500 1000 500

Number of cores

(b) L3 cache misses

App with locality App without locality

0 2

4

8

16

32

Number of cores

(c) Local DRAM accesses

Fig. 4. CG kernel experimental results

14

Java

12

JS with locality

10 8 6 4 2 0 1

2

4

8

16

32

30 25 20 15 10 App with locality App without locality

5 0 2

8

16

Number of cores

Number of cores

(a) Speedup

4

(b) L3 cache misses

32

Local DRAM access (thousands)

Speedup

16

L3 cac che misses (thousands)

JavaSymphony: A Programming Environment for Manycores

147

30 25 20 15 10 App with locality

5

App without locality

0 2

4

8

16

32

Number of cores

(c) Local DRAM accesses

Fig. 5. Ray tracing experimental results

75 iterations. It is is a communication intensive kernel and the memory access latencies play a major role in the application performance. The non-contiguous data accesses also result in more cache misses. This kernel achieves good speedup until 4 cores (all local memory accesses), while beyond 8 cores the speedup decrease, because of increased memory access latencies on the threads scheduled on the remote cores. 4.3

3D Ray Tracing

The ray tracing application, part of the Java Grande Forum (JGF) benchmark suite [8], renders a scene containing 64 spheres at N × N resolution. We implemented this application in JS using the multi-threaded Java version from the JGF benchmarks. The ray tracing application creates first several ray tracer objects, initialises them with scene and interval data, and then renders to the specified resolution. The interval data points to the rows of pixels a parallel thread will render. The JS implementation parallelises the code by distributing the outermost loop (over rows of pixels) to several SJS objects. The SJS objects render the corresponding rows of pixels and write back the resulting image data. We applied locality constraints by mapping objects to cores close to each other to minimise the memory access latencies. We experimentally compared our JS implementation with the multi-threaded Java ray tracer from JGF. As shown in Figure 5(a), the JS implementation achieves better speedup for all machine sizes. Figure 5(c) shows that there is a higher number of local memory accesses for the locality-aware JS implementation, which is the main reason for the performance improvement. Figure 5(b) further shows that a locality-aware version has high number of L3 cache misses because of the resource contention on this shared resource, however, the performance penalty is significantly lower compared to the locality gain. 4.4

Cholesky Factorisation

The Cholesky factorisation [7] expresses a N × N symmetric positive-definite matrix A, implying that all the diagonal elements are positive and the nondiagonal elements are not too big, as the product of a triangular matrix L and

5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0

Java 1

2

4

JS with locality 8

16

32

DRAM accesses (thousands)

M. Aleem, R. Prodan, and T. Fahringer

Speed dup

148

25

App without locality 20

10 5 0

Number of cores

(a) Speedup

App with locality

15

2

4

8

16

32

Number of cores

(b) Total DRAM accesses

Fig. 6. Cholesky factorisation experimental results

its transpose LT : A = L × LT . This numerical method is generally used to calculate the inverse and the determinant of a positive definite matrix. We developed a multi-threaded Java-based version and a JS-based implementation with locality constraints of the algorithm. The triangular matrix L is parallelised by distributing a single row of values among several parallel tasks; in this way, rows are computed one after another. Figure 6(a) shows that the locality-aware JS version has better speedup compared to the Java-based version, however, for the machine size 8 both versions show quite similar performance. We observed that the operating system scheduled by default the threads of the Java version one router away from the data (using both right and left neighbour nodes), which matched the locality constraints we applied to the JS version. Figure 6(b) shows that the locality-based version has a low number of memory accesses compared to the non-locality-based version due to the spatial locality. 4.5

Matrix Transposition

We developed a multi-threaded Java-based and a JS-based locality-aware version of the matrix transposition algorithm that we introduced in Section 3.6. Again, the locality-aware JS version achieved better speedup, as illustrated in Figure 7(a). The locality constraints mapped the threads to all cores of a processor before moving to other node, which resulted in L3 cache contention and, therefore, more cache misses (see Figure 7(b)). The locality also ensures that there are more local and less costly remote memory accesses which produces better application performance (see Figure 7(c)). 4.6

Sparse Matrix-Vector Multiplication

Sparse Matrix-Vector Multiplication (SpMV) is an important kernel used in many scientific applications which computes y = A · x, where A is a sparse matrix and x and y are dense vectors. We developed an iterative version of the SpMV kernel where matrix A was stored in a vector using the Compressed Row Storage format. We used for this experiment a matrix size of 10000 × 10000 and set the number of non-zero elements per row to 1000. We developed both

Java

12

JS with locality

Speedup

10 8 6 4 2 0 1

2

4

8

16

32

180 160 140 120 100 80 60 40 20 0

App without locality App with locality

2

4

Number of cores

8

16

32

Number of cores

(a) Speedup

(b) L3 cache misses

Local DRAM access (thousands)

14

L3 cac che misses (th housands)

JavaSymphony: A Programming Environment for Manycores

149

250 200 150 100 50

App without locality App with locality

0 2

4

8

16

32

Number of cores

(c) Local DRAM accesses

2000

14

Java

12

JS

10

JS with locality

1950 L3 cache e misses

Speedup

16

8 6 4

1900 1850 1800 1750

2

1700

0

1650 1

2

4

8

16

Number of cores

(a) Speedup

32

App without locality App with locality

2

4

8

16

32

Local D DRAM access s (thoisands)

Fig. 7. Matrix transposition experimental results 45 40 35 30 25 20 15 10 5 0 2

4

8

App with locality

16

32

Number of cores

Number of cores

(b) L3 cache misses

App without locality

(c) Local DRAM accesses

Fig. 8. SpMV experimental results

pure Java and JS versions of this kernel by distributing the rows of the sparse matrix several parallel threads. The resulting vector y is computed as follows: to n yi = j=1 aij · xj . SpMV involves indirect and unstructured memory accesses which negatively affects the pure Java implementation with no locality constraints (see Figure 8(a)), while the locality-aware JS implementation performs significantly better. The vector x is used by all threads and, once the data values from this vector are loaded in L3 shared cache, they are reused by the other working threads on that processor. This spatial locality results in less cache misses as shown in Figure 8(b). The locality also ensures less remote and more local memory accesses as shown in Figure 8(c), which further improves the application performance.

5

Conclusions

We presented JavaSymphony, a parallel and distributed programming and execution environment for multi-cores architectures. JS’s design is based on the concept of dynamic virtual architecture, which allows modelling of hierarchical resource topologies ranging from individual cores and multi-core processors to more complex symmetric multiprocessors and distributed memory parallel computers. On top of this virtual architecture, objects can be explicitly distributed,

150

M. Aleem, R. Prodan, and T. Fahringer

migrated, and invoked, enabling high-level user control of parallelism, locality, and load balancing. Additionally, JS provides high-level abstractions to control locality at application, object and thread level. We illustrated a number of real application and benchmark experiments showing that the locality-aware JS implementation outperforms conventional technologies such as pure Java relying entirely on operating system thread scheduling and data allocation.

Acknowledgements The authors thank Hans Moritsch for his contribution to earlier stages of this research.

References 1. Denis Caromel, M.L.: ProActive Parallel Suite: From Active Objects-SkeletonsComponents to Environment and Deployment. In: Euro-Par 2008 Workshops Parallel Processing, pp. 423–437. Springer, Heidelberg (2008) 2. Fahringer, T., Jugravu, A.: JavaSymphony: a new programming paradigm to control and synchronize locality, parallelism and load balancing for parallel and distributed computing: Research articles. Concurr. Comput. Pract. Exper. 17(7-8), 1005–1025 (2005) 3. Frumkin, M.A., Schultz, M., Jin, H., Yan, J.: Performance and scalability of the NAS parallel benchmarks in Java. IPDPS, 139a (2003) 4. Norfolk, D.: The growth in data volumes - an opportunity for it-based analytics with Pervasive Datarush. White Paper (June 2009), http://www.pervasivedatarush.com/Documents/WP%27s/ Norfolk%20WP%20-%20DataRush%20%282%29.pdf 5. Kumar, R., Farkas, K.I., Jouppi, N.P., Ranganathan, P., Tullsen, D.M.: Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In: MICRO-36’-03 (2003) 6. Siebert, F.: Jeopard: Java environment for parallel real-time development. In: JTRES-06 2008, pp. 87–93. ACM, New York (2008) 7. Siegfried, B., Maria, L., Kvasnicka, D.: Experiments with cholesky factorization on clusters of smps. In: Proceedings of the Book of Abstracts and Conference CD, NMCM 2002, pp. 30–31, 1–14 (July 2002) 8. Smith, L.A., Bull, J.M.: A multithreaded Java grande benchmark suite. In: Third Workshop on Java for High Performance Computing (June 2001) 9. Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: The 16th international symposium on High performance distributed computing, pp. 97–106. ACM, New York (2007) 10. Terracotta, I.: The Definitive Guide to Terracotta: Cluster the JVM for Spring, Hibernate and POJO Scalability. Apress, Berkely (2008) 11. Yang, R., Antony, J., Janes, P.P., Rendell, A.P.: Memory and thread placement effects as a function of cache usage: A study of the gaussian chemistry code on the sunfire x4600 m2. In: International Symposium on Parallel Architectures, Algorithms, and Networks, pp. 31–36 (2008)

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit Computer Science Department Tel-Aviv University, Israel [email protected] Abstract. Producer-consumer pools, that is, collections of unordered objects or tasks, are a fundamental element of modern multiprocessor software and a target of extensive research and development. For example, there are three common ways to implement such pools in the Java JDK6.0: the SynchronousQueue, the LinkedBlockingQueue, and the ConcurrentLinkedQueue. Unfortunately, most pool implementations, including the ones in the JDK, are based on centralized structures like a queue or a stack, and thus are limited in their scalability. This paper presents the ED-Tree, a distributed pool structure based on a combination of the elimination-tree and diffracting-tree paradigms, allowing high degrees of parallelism with reduced contention. We use the ED-Tree to provide new pool implementations that compete with those of the JDK. In experiments on a 128 way Sun Maramba multicore machine, we show that ED-Tree based pools scale well, outperforming the corresponding algorithms in the JDK6.0 by a factor of 10 or more at high concurrency levels, while providing similar performance at low levels.

1

Introduction

Producer-consumer pools, that is, collections of unordered objects or tasks, are a fundamental element of modern multiprocessor software and a target of extensive research and development. Pools show up in many places in concurrent systems. For example, in many applications, one or more producer threads produce items to be consumed by one or more consumer threads. These items may be jobs to perform, keystrokes to interpret, purchase orders to execute, or packets to decode. A pool allows push and pop with the usual pool semantics [1]. We call the pushing threads producers and the popping threads consumers. There are several ways to implement such pools. In the Java JDK6.0 for example they are called “queues”: the SynchronousQueue, the LinkedBlockingQueue, and the ConcurrentLinkedQueue. The SynchronousQueue provides a “pairing up” function without buffering; it is entirely symmetric: Producers and consumers wait for one another, rendezvous, and leave in pairs. The term unfair refers to the fact that it allows starvation. The other queues provide a buffering mechanism and allow threads to sleep while waiting for their requests P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 151–162, 2010. c Springer-Verlag Berlin Heidelberg 2010 

152

Y. Afek et al.

to be fulfilled. Unfortunately, all these pools, including the new scalable SynchronousQueue of Lea, Scott, and Shearer [2], are based on centralized structures like a lock-free queue or a stack, and thus are limited in their scalability: the head of the stack or queue is a sequential bottleneck and source of contention. This paper shows how to overcome this limitation by devising highly distributed pools based on an ED-Tree, a combined variant of the diffracting-tree structure of Shavit and Zemach [3] and the elimination-tree structure of Shavit and Touitou [4]. The ED-Tree does not have a central place through which all threads pass, and thus allows both parallelism and reduced contention. As we explain in Section 2, an ED-Tree uses randomization to distribute the concurrent requests of threads onto many locations so that they collide with one another and can exchange values. It has a specific combinatorial structure called a counting tree [3,5], that allows requests to be properly distributed if such successful exchanges did not occur. As shown in Figure 1, one can add queues at the leaves of the trees so that requests are either matched up or end up properly distributed on the queues at the tree leaves. By “properly distributed” we mean that requests that do not eliminate always end up in the queues: the collection of all the queues together has the behavior of one large queue. Since the nodes of the tree will form a bottleneck if one uses the naive implementation in Figure 1, we replace them with highly distributed nodes that use elimination and diffraction on randomly chosen array locations as in Figure 2. The elimination and diffraction tree structures were each proposed years ago [4,3] and claimed to be effective through simulation [6]. A single level of an elimination array was also used in implementing shared concurrent stacks [7]. However, elimination trees and diffracting trees were never used to implement real world structures. This is mostly due the fact that there was no need for them: machines with a sufficient level of concurrency and low enough interconnect latency to benefit from them did not exist. Today, multicore machines present the necessary combination of high levels of parallelism and low interconnection costs. Indeed, this paper is the first to show that that ED-Tree based implementations of data structures from the java.util.concurrent scale impressively on a real machine (a Sun Maramba multicore machine with 2x8 cores and 128 hardware threads), delivering throughput that at high concurrency levels 10 times that of the existing JDK6.0 algorithms. But what about low concurrency levels? In their elegant paper describing the JDK6.0 SynchronousQueue, Lea, Scott, and Shearer [2], suggest that using elimination techniques may indeed benefit the design of synchronous queues at high loads. However, they wonder whether the benefits of reduced contention achievable by using elimination under high loads, can be made to work at lower levels of concurrency because of the possibility of threads not meeting in the array locations. This paper shows that elimination and diffraction techniques can be combined to work well at both high and low loads. There are two main components that our ED-Tree implementation uses to make this happen. The first is to have each thread adaptively choose an exponentially varying array range from which it

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees

153

randomly picks a location, and the duration it will wait for another thread at that location. This means that, without coordination, threads will tend to map into a smaller array range as the load decreases, thus increasing chances of a collision. The second component is the introduction of diffraction for colliding threads that do not eliminate because they are performing the same type of operation. The diffraction mechanism allows threads to continue down the tree at a low cost. The end result is an ED-Tree structure, that, as our empirical testing shows, performs well at both high and low concurrency levels.

2

The ED-Tree

Before explaining how the ED-Tree works, let us review its predecessor, the diffracting tree [3] (see Figure 1). Consider a binary tree of objects called balancers with a single input wire and two output wires, as depicted in Figure 1. Threads arrive at a balancer and it sends them alternately up and down, so its top wire always has the same or at most one more than the bottom one. The Tree[k] network of width k is a binary tree of balancers constructed inductively by taking two Tree[k/2] networks of balancers and perfectly shuffling their outputs [3]. As a first step in constructing the ED-Tree, we add to the diffracting tree a collection of lock-free queues at the output wires of the tree leaves. To perform a push, threads traverse the balancers from the root to the leaves and then push the item onto the appropriate queue. In any quiescent state, when there are no lock-free balancer

5

3

1

1

wire 0 2

5 4

1 3

head tail

1

5

lock-free queue

2

1

3

wire 1 4

2

0

4

Fig. 1. A Tree[4] [3] leading to 4 lock-free queues. Threads pushing items arrive at the balancers in the order of their numbers, eventually pushing items onto the queues located on their output wires. In each balancer, a pushing thread fetches and then complements the bit, following the wire indicated by the fetched value (If the state is 0 the pushing thread it will change it to 1 and continue to top wire (wire 0), and if it was 1 will change it to 0 and continue on bottom wire (wire 1)). The tree and stacks will end up in the balanced state seen in the figure. The state of the bits corresponds to 5 being the last inserted item, and the next location a pushed item will end up on is the queue containing item 2. Try it! We can add a similar tree structure for popping threads, so that the first will end up on the top queue, removing 1, and so on. This behavior will be true for concurrent executions as well: the sequences values in the queues in all quiescent states, when all threads have exited the structure, can be shown to preserve FIFO order.

154

Y. Afek et al. 1/2 width elimination-diffraction balancer eliminationdiffraction balancer

Pusher’s toggle-bit

C:return(1) 1

0

F:pop() 1

1

5

2

6

A: ok

A:push(6) Poper’s toggle-bit

B:return(2)

B:pop() C:pop()

0

F:return(3) 0

D:pop()

3

E:push(7) 0

4

E: ok D:return(7)

Fig. 2. An ED-Tree. Each balancer in Tree[4] is an elimination-diffraction balancer. The start state depicted is the same as in Figure 1, as seen in the pusher’s toggle bits. From this state, a push of item 6 by Thread A will not meet any others on the elimination-diffraction arrays, and so will toggle the bits and end up on the 2nd stack from the top. Two pops by Threads B and C will meet in the top balancer’s array, diffract to the sides, and end up going up and down without touching the bit, ending up popping the first two values values 1 and 2 from the top two lock-free queues. Thread F which did not manage to diffract or eliminate, will end up as desired on the 3rd queue, returning a value 3. Finally, Threads D and E will meet in the top array and “eliminate” each other, exchanging the value 7 and leaving the tree. This is our exception to the FIFO rule, to allow good performance at high loads, we allow threads with concurrent push and pop requests to eliminate and leave, ignoring the otherwise FIFO order.

threads in the tree, the output items are balanced out so that the top queues have at most one more element than the bottom ones, and there are no gaps. One could implement the balancers in a straightforward way using a bit that threads toggle: they fetch the bit and then complement it using a compareAndSet (CAS) operation, exiting on the output wire they fetched (zero or one). One could keep a second, identical tree for pops, and you would see that from one quiescent state to the next, the items removed are the first ones pushed onto the queue. Thus, we have created a collection of queues that are accessed in parallel, yet act as one quiescent FIFO queue. The bad news is that the above implementation of the balancers using a bit means that every thread that enters the tree accesses the same bit at the root balancer, causing that balancer to become a bottleneck. This is true, though to a lesser extent, with balancers lower in the tree. We can parallelize the tree by exploiting a simple observation similar to the one made about the elimination backoff stack: If an even number of threads pass through a balancer, the outputs are evenly balanced on the top and bottom wires, but the balancer’s state remains unchanged.

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees

155

The idea behind the ED-Tree is combining the modified diffracting [3] tree as above with the elimination-tree techniques [4]. We use an eliminationArray in front of the bit in every balancer as in Figure 2. If two popping threads meet in the array, they leave on opposite wires, without a need to touch the bit, as anyhow it would remain in its original state. If two pushing threads meet in the array, they also leave on opposite wires. If a push or pop call does not manage to meet another in the array, it toggles the respective push or pop bit (in this sense it differs from prior elimination and/or diffraction balancer algorithms [4,3] which had a single toggle bit instead of separate ones, and provided LIFO rather than FIFO like access through the bits) and leaves accordingly. Finally, if a push and a pop meet, they eliminate, exchanging items. It can be shown that all push and pop requests that do not eliminate each other provide a quiescently consistent FIFO queue behavior. Moreover, while the worst case time is log k where k is the number of lock-free queues at the leaves, in contended cases, 1/2 the requests are eliminated in the first balancer, another 1/4 in the second, 1/8 on the third, and so on, which converges to an average of 2 steps to complete a push or a pop, independent of k.

3

Implementation

As described above, each balancer (see the pseudo-code in Listing 1) is composed of an eliminationArray, a pair of toggle bits, and two pointers, one to each of its child nodes. The last field, lastSlotRange,(which has to do with the adaptive behavior of the elimination array) will be described later in this section. 1 2 3 4 5 6

public class Balancer{ ToggleBit producerToggle, consumerToggle; Exchanger[] eliminationArray; Balancer leftChild , rightChild; ThreadLocal lastSlotRange; }

Listing 1. A Balancer

The implementation of a toggle bit as shown in Listing 2 is based on an AtomicBoolean which provides a CAS operation. To access it, a thread fetches the current value (Line 5) and tries to atomically replace it with the complementary value (Line 6). In case of a failure, the thread retries (Line 6). 1 2 3 4 5 6 7 8

AtomicBoolean toggle = new AtomicBoolean(true); public boolean toggle(){ boolean result; do{ result = toggle.get (); }while(!toggle.compareAndSet(result, !result )); return result; }

Listing 2. The Toggle of a Balancer

156

Y. Afek et al.

The implementation of an eliminationArray is based on an array of Exchangers. Each exchanger (Listing 3) contains a single AtomicReference which is used as a placeholder for exchanging, and an ExchangerPackage, where the ExchangerPackage is an object used to wrap the actual data and to mark its state and type. 1 2 3

public class Exchanger{ AtomicReference slot; }

4 5 6 7 8 9

public class ExchangerPackage{ Object value; State state ; Type type; }

Listing 3. An Exchanger

Each thread performing either a push or a pop, traverses the tree as follows. Starting from the root balancer, the thread tries to exchange its package with a thread with a complementary operation, a popper tries to exchange with a pusher and vice versa. In each balancer, each thread chooses a random slot in the eliminationArray, publishes its package, and then backs off in time, waiting in a loop to be eliminated. In case of failure, a backoff in “space” is performed several times. The type of space back off depends on the cause of the failure: If a timeout is reached without meeting any other thread, a new slot is randomly chosen in a smaller range. However, if a timeout is reached after repeatedly failing in the CAS while trying to either pair or just to swap in, a new slot is randomly chosen in a larger range. Adaptivity. In the backoff mechanism described above, a thread senses the level of contention and depending on it selects randomly an appropriate range of the eliminationArray to work on (by iteratively backing off). However, each time a thread starts a new operation, it initializes the backoff parameters, wasting the same unsuccessful rounds of backoff in place until sensing the current level of contention. To avoid this, we let each thread save its last-used range between invocations (Listing 1 line 5). This saved range is used as (a good guess of) the initial range at the beginning of the next operation. This method proved to be a major factor in reducing the overhead in low contention situations and allowing the EDTree to yield good performance under high contention. The result of the meeting of two threads in each balancer is one of the following four states: ELIMINATED, TOGGLE, DIFFRACTED0, or DIFFRACTED1. In case of ELIMINATED, a popper and a pusher successfully paired-up, and the method returns. If the result is TOGGLE, the thread failed to pair-up with any other type of request, so the toggle() method shown in Listing 1 is called, and according to its result the thread accesses one of the child balancers. Lastly, if the state is either DIFFRACTED0 or DIFFRACTED1, this is a result of two operations of the same type meeting in the same location, and the corresponding child balancer, either 0 or 1, is chosen.

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees

2000 JDK-Sync-Queue ED-Sync-Queue

0 2

4

6

Unfair Synchronous Queue Throughput (× 103 op/s)

Throughput (× 103 op/s)

Unfair Synchronous Queue (detailed) 4000

8

Number of threads

10

157

12000 10000 8000 6000

JDK-Sync-Queue ED-Sync-Queue

4000 2000 0 50

100

150

200

250

Number of threads

Fig. 3. The unfair synchronous queue benchmark: a comparison of the latest JDK 6.0 algorithm and our novel ED-Tree based implementation. The graph on the left is a zoom in of the low concurrency part of the one on the right. Number of producers and consumers is equal in each of the tested workloads.

As a final step, the item of a thread that reaches one of the tree leaves is placed in the corresponding queue. A queue can be one of the known queue implementations: a SynchronousQueue, a LinkedBlockingQueue, or a ConcurrentLinkedQueue. Using ED-Trees with different queue implementations we created the following three types of pools: An Unfair Synchronous Queue. When setting the leaves to hold an unfair SynchronousQueue, we get a unfair synchronous queue [2]. An unfair synchronous queue provides a “pairing up” function without the buffering. Producers and consumers wait for one another, rendezvous, and leave in pairs. Thus, though it has internal queues to handle temporary overflows of mismatched items, the unfair synchronous queue does not require any long-term internal storage capacity. An Object Pool. With a simple replacement of the former SynchronousQueue with a LinkedBlockingQueue. With a ConcurrentLinkedQueue we get a blocking object pool, or a non-blocking object pool respectively. An object pool is a software design pattern. It consists of a multi-set of initialized objects that are kept ready to use, rather than allocated and destroyed on demand. A client of the object pool will request an object from the pool and perform operations on the returned object. When the client finishes work on an object, it returns it to the pool rather than destroying it. Thus, it is a specific type of factory object. Object pooling can offer a significant performance boost in situations where the cost of initializing a class instance is high, the rate of instantiation of a class is high, and the number of instances in use at any one time is low. The pooled object is obtained in predictable time when the creation of the new objects (especially over a network) may take variable time. In this paper we show two versions of an object pool: blocking and a non blocking. The only difference between these pools is the behavior of the popping thread when the pool is empty. While in the blocking version a popping thread is forced to wait until an

158

Y. Afek et al. Nonblocking Resource Pool Throughput (× 103 op/s)

Throughput (× 103 op/s)

Resource Pool 12000 10000 8000 6000

LinkedBlockingQueue ED-BlockingQueue

4000 2000 0 50

100

150

200

250

Number of threads

10000 8000 6000 4000

ConcurrentLinkedQueue ED-Pool

2000 0 50

100

150

200

250

Number of threads

Fig. 4. Throughput of BlockingQueue and ConcurrentQueue object pool implementations. Number of producers and consumers is equal in each of the tested workloads.

available resource is pushed back to Pool, in the unblocking version it can leave without an object. An example of a widely used object pool is a connection pool. A connection pool is a cache of database connections maintained by the database so that the connections can be reused when the database receives new requests for data. Such pools are used to enhance the performance of executing commands on the database. Opening and maintaining a database connection for each user, especially of requests made to a dynamic database-driven website application, is costly and wastes resources. In connection pooling, after a connection is created, it is placed in the pool and is used again so that a new connection does not have to be established. If all the connections are being used, a new connection is made and is added to the pool. Connection pooling also cuts down on the amount of time a user waits to establish a connection to the database. Starvation avoidance. Finally, in order to avoid starvation in the queues (Though it has never been observed in all our tests), we limit the time a thread can be blocked in these queues before it retries the whole Tree[k] traversal again.

4

Performance Evaluation

We evaluated the performance of our new algorithms on a Sun UltraSPARC T2 Plus multicore machine. This machine has 2 chips, each with 8 cores running at 1.2 GHz, each core with 8 hardware threads, so 64 way parallelism on a processor and 128 way parallelism across the machine. There is obviously a higher latency when going to memory across the machine (a two fold slowdown). We begin our evaluation in Figure 3 by comparing the new unfair SynchronousQueue of Lea et. al [2], scheduled to be added to the java.util.concurrent library of JDK6.0, to our ED-Tree based version of an unfair synchronous queue. As we explained earlier, an unfair synchronous queue provides a symmetric “pairing up” function without buffering: Producers and consumers wait for one another, rendezvous, and leave in pairs.

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees

8000

ConcurrentLinkedQueue ED-Pool

6000 4000 2000 0 0

100 200 300 400 500 600 Work

Unfair Synchronous Queue Throughput (× 103 op/s)

Throughput (× 103 op/s)

Nonblocking Resource Pool 10000

159

10000

JDK-Sync-Queue ED-Sync-Queue

8000 6000 4000 2000 0 0

100 200 300 400 500 600 Work

Fig. 5. Throughput of a SynchronousQueue as the work load changes for 32 producer and 32 consumer threads.)

One can see that the ED-Tree behaves similarly to the JDK version up to 8 threads(left figure). Above this concurrency level, the ED-Tree scales nicely while the JDK implementation’s overall throughput declines. At its peak, at 64 threads, the ED-Tree delivers more than 10 times the performance of the JDK implementation. Beyond 64 threads, the threads are no longer placed on a single chip, and traffic across the interconnect causes a moderate performance decline for the ED-Tree version. We next compare two versions of an object Pool. An object pool is a set of initialized objects that are kept ready to use, rather than allocated and destroyed on demand. A consumer of the pool will request an object from the pool and perform operations on the returned object. When the consumer has finished using an object, it returns it to the pool, rather than destroying it. The object pool is thus a type of factory object. The consumers wait in case there is no available object, while the producers, unlike producers of unfair synchronous queue, never wait for consumers, they add the object to the pool and leave. We compared an ED-Tree BlockingQueue implementation to the LinkedBlockingQueue of JDK6.0. Comparison results for the object pool benchmark are shown on the lefthand side of Figure 4. The results are pretty similar to those in the unfair SynchronousQueue. The JDK’s LinkedBlockingQueue performs better than its unfair SynchronousQueue, yet it still does not scale well beyond 4 threads. In contrast, our ED-Tree version scales well even up to 80 threads because of its underlying use of the LinkedBlockingQueue. At its peak at 64 threads it has 10 times the throughput of the JDK’s LinkedBlockingQueue. Next, we evaluated implementations of ConcurrentQueue, a more relaxed version of an object pool in which there is no requirement for the consumer to wait in case there is no object available in the Pool. We compared the ConcurrentLinkedQueue of JDK6.0 (which in turn is based on Michael’s lock-free linked list algorithm [8]) to an ED-Tree based ConcurrentQueue (righthand side of Figure 4). Again, the results show a similar pattern: the JDK’s ConcurrentLinkedQueue scales up to 14 threads, and then drops, while the ED-Tree

160

Y. Afek et al. Resource pool

Unfair Synchronous Queue 9000

JDK-Linked-Blocking-Queue ED-Linked-Blocking-Queue

9000

Throughput (× 103 op/s)

3

Throughput (× 10 op/s)

10000 8000 7000 6000 5000 4000 3000 2000

JDK-Sync-Queue ED-Sync-Queue

8000 7000 6000 5000 4000 3000 2000 1000

50

60

70

80

% of Consumers in all threads

90

50

60

70

80

90

% of Consumers in all threads

Fig. 6. Performance changes of a Resource pool and unfair SynchronousQueue when total number of threads is 64, as the ratio of consumer threads grows from 50% to 90% of total thread amount

based ConcurrentQueue scales well up to 64 threads. At its peak at 64 threads, it has 10 times the throughput of the JDK’s ConcurrentLinkedQueue. Since the ED-Tree object pool behaves well at very high loads, we wanted to test how it behaves in scenarios where the working threads are not pounding the pool all the time. To this end we emulate varying work loads by adding a delay between accesses to the pool. We tested 64 threads with a different set of dummy delays due to work, varying it from 30-600ms. The comparison results in Figure 5 show that even as the load decreases the ED-Tree synchronous queue outperforms the JDK’s synchronous queue. , This is due to the low overhead adaptive nature of the randomized mapping into the eliminationArray: as the load decreases, a thread tends to dynamically shrink the range of array locations into which it tries to map. Another work scenario that was tested is the one when the majority of the pool users are consumers, i.e. the rate of inserting items to the pool is lower than the one demanded by consumers and they have to wait until items become available. Figure 6 shows what happens when number of threads using the pool is steady(64 threads), but the ratio of consumers changes from 50% to 90%. One can see that ED-tree outperforms JDK’s structures both in case when the number of producer and consumer thread equals and in cases where there are a

Fig. 7. Elimination rate by levels, as concurrency increases

Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees

161

Elimination array range

Elimination Size 8 7 6 5 4 3 2 1

ED-Sync-Queue

0

200

400 Work

600

800

Fig. 8. Elimination range as the work load changes for 32 producer and 32 consumer threads

lot more consumer threads than producer threads (for example 90% consumers and 10% producers) . Next, we investigated the internal behavior of the ED-Tree with respect to the number of threads. We check the elimination rate at each level of the tree. The results appear in Figure 7. Surprisingly, we found out that the higher the concurrency, that is, the more threads added, the more threads get all the way down the tree to the queues. At 4 threads, all the requests were eliminated at the top level, and throughout the concurrency range, even at 265 threads, 50% or more of the requests were eliminated at the top level of the tree, at least 25% at the next level, and at least 12.5% at the next. This, as we mentioned earlier, forms a sequence that converges to less than 2 as n, the number of threads, grows. In our particular 3-level ED-Tree tree the average is 1.375 balancer accesses per sequence, which explains the great overall performance. Lastly, we investigated how the adaptive method of choosing the elimination range behaves under different loads. Figure 8 shows that, as we expected, the algorithm adapts the working range to the load reasonably well. The more each thread spent doing work not related to the pool, the more the contention decreased, and respectively, the default range used by the threads decreased. Acknowledgements. This paper was supported in part by grants from Sun Microsystems, Intel Corporation, as well as a grant 06/1344 from the Israeli Science Foundation and European Union grant FP7-ICT-2007-1 (project VELOX).

References 1. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, NY (2008) 2. Scherer III, W.N., Lea, D., Scott, M.L.: Scalable synchronous queues. Commun. ACM 52(5), 100–111 (2009) 3. Shavit, N., Zemach, A.: Diffracting trees. ACM Trans. Comput. Syst. 14(4), 385–428 (1996)

162

Y. Afek et al.

4. Shavit, N., Touitou, D.: Elimination trees and the construction of pools and stacks: preliminary version. In: SPAA 1995: Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pp. 54–63. ACM, New York (1995) 5. Aspnes, J., Herlihy, M., Shavit, N.: Counting networks. Journal of the ACM 41(5), 1020–1048 (1994) 6. Herlihy, M., Lim, B., Shavit, N.: Scalable concurrent counting. ACM Transactions on Computer Systems 13(4), 343–364 (1995) 7. Hendler, D., Shavit, N., Yerushalmi, L.: A scalable lock-free stack algorithm. In: SPAA 2004: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pp. 206–215. ACM, New York (2004) 8. Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: PODC 1996: Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, pp. 267–275. ACM, New York (1996)

Productivity and Performance: Improving Consumability of Hardware Transactional Memory through a Real-World Case Study Huayong Wang, Yi Ge, Yanqi Wang, and Yao Zou IBM Research - China {huayongw,geyi,yqwang}@cn.ibm.com,[email protected]

Abstract. Hardware transactional memory (HTM) is a promising technology to improve the productivity of parallel programming. However, a general agreement has not been reached on the consumability of HTM. User experiences indicate that HTM interface is not straightforward to be adopted by programmers to parallelize existing commercial applications, because of the internal limitation of HTM and the difficulties to identify shared variables hidden in the code. In this paper we demonstrate that, with well-designed encapsulation, HTM can deliver good consumability. Based on the study of a typical commercial application in supply chain simulations - GBSE, we develop a general scheduling engine that encapsulates the HTM interface. With the engine, we can convert the sequential program to multi-threaded model without changing any source code for the simulation logic. The time spent on parallelization is reduced from two months to one week, and the performance is close to the manually tuned counterpart with fine-grained locks. Keywords: hardware transactional memory, parallel programming, discrete event simulation, consumability.

1

Introduction

Despite years of research, parallel programming is still a challenging problem in most application domains, especially for commercial applications. Transactional memory (TM), with hardware-based solutions to provide desired semantics while incurring the least runtime overhead, has been proposed to ameliorate this challenge. With hardware transactional memory (HTM), programmers simply delimit regions of code that access shared data. The hardware ensures that the execution of these regions appears atomic with respect to other threads. As a result, HTM allows programmers to enforce mutual exclusion as simple as traditional coarse-grained locks, while achieving performance close to fine-grained locks. However, consumability of HTM programming model is still a point of controversy for commercial application developers, who have to take tradeoff between the cost of parallelization and the performance benefits. Better consumability means that applications can benefit from HTM in performance with less cost P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 163–174, 2010. c Springer-Verlag Berlin Heidelberg 2010 

164

H. Wang et al.

of modification and debugging. It’s not easy to achieve good consumability due to two reasons. First, when parallelizing a sequential program, its source code usually contains a large quantity of variables shared by functions. Those shared variables are not indicated explicitly, nor protected by locks. To thoroughly identify these variables is a time consuming task. Second, the HTM programming model has limitations. Supporting I/O and other irrevocable operations (like system calls) inside transactions is expensive in terms of implementation complexity and performance loss. Many HTM researches adopt application kernels as benchmarks for performance evaluation. The benchmarks are designed for hardware tuning purpose and cannot reflect the consumability problem of HTM. How to improve the consumability of HTM is a practical problem at the time of HTM’s imminent emergence in commercial processors. Some research works have tried to provide friendly HTM programming interfaces [1]. By adding specific semantics into the HTM interfaces, it brings the new challenge on compatibilities of HTM implementations, which is a major concern for HTM commercial applications. It is more reasonable to solve the problem by combining two approaches: encapsulating general HTM interface by runtime libraries and taking advantage of application’s particularities. In this paper a typical commercial application in supply chain simulation domain is studied as a running example. We parallelize it by two ways: HTM and traditional locks. Based on the quantitative comparison of the costs, we demonstrate that HTM can have good consumability with proper encapsulation for such kind of applications. This paper makes the following contributions: 1. It shows that our method can reduce the time of the parallelization work from two months to one week by using a new HTM encapsulation interface on the case study application. The method is also suitable to most real-world applications in the discrete event simulation domain. 2. Besides showing the improvement of productivity, we also evaluate the overall performance and analyze the factors that influence the performance. With HTM encapsulation, we can generally achieve 4.36 and 5.13 times speedup of 8 threads in two different algorithms. The result is close to the performance achieved through fine-grained locks. The remainder of this paper is organized as follows. Section 2 provides background information on HTM and introduces the application studied in this paper. Section 3 describes the details of the parallelization work and explains how our method can help to improve the consumability. Section 4 presents and analyzes experimental results. Section 5 gives the conclusion.

2

Background

This section introduces HTM concepts as well as a case study application. 2.1

HTM Implementation and Interface

In order to conduct a fair evaluation, we choose a basic HTM implementation Best Effort Transaction (BET), which was referred in many papers as a baseline

Improving Consumability of Hardware Transactional Memory

165

design [2],[3],[4]. BET uses data cache as buffers to save the data accessed by transactions. Each cache line is augmented with an atomic flag (A-flag) and an atomic color (A-color). The A-flag indicates whether the cache line has been accessed by an uncommitted transaction, and A-color indicates which transaction has accessed the cache line. When a transaction accesses a cache line, the A-flag is set; when a transaction is completed or aborted, the A-flag of each cache line that has been accessed by the transaction is cleared. When two transactions access a cache line simultaneously, and at least one of them modifies the cache line, one of the two transactions should be aborted to ensure transaction atomicity. This is called “conflict”. If a transaction is aborted, all cache lines modified by the transaction need to be invalidated. Then, the execution flow jumps to a pre-defined error handler, in which the cleanup task can be performed before the transaction is re-executed. From the programmers’ perspective, this basic HTM exposes five primitives. TM BEGIN and TM END mark the start and end of a transaction. TM BEGIN has a parameter “priority”. It is used to avoid livelock. If a conflict happens, the transaction with lower priority is aborted. In this paper, the lower the value, the higher the priority. Therefore using time stamp as priority leads to the result that the older transaction aborts the younger in a conflict. TM SUSPEND and TM RESUME are two primitives used inside a transaction to temporarily stop and restart the transaction execution. In suspend state, the memory access operations are treated as regular memory access operations, except that conflicts with the thread’s own suspended transaction are ignored. I/O and system calls are allowed in suspend state. If a conflict happens in suspend state and the transaction needs to be aborted, the cancelation is delayed until TM RESUME is executed. The last primitive TM VALIDATE is used in suspend state to check whether a conflict happens. 2.2

Discrete Event and Supply Chain Simulation

Discrete event simulation (DES) is a method to mimic the dynamics of a real system. The Parallel DES (PDES) in this paper refers to the DES parallelized by multiple threads on a shared memory multiprocessor platform, rather than the distributed DES running on clusters. The core of a parallel discrete event simulator is an event list and a thread pool processing those events. Each event has a time stamp and a handler function. Event list contains all unprocessed events, sorted by time stamps, as shown in Fig. 1. The main processing loop in a simulator repeatedly removes the oldest event from the event list and calls the handler function for the event. Thus, the process can be viewed as a sequence of event computations. When an event is being processed, it is allowed to add one or more events to the event list with time stamps in the future. The principle of DES is to ensure that events with different time stamps are processed in time stamp order. This is worthy of extra precaution because each thread does not priori know whether a new event will be added later. Figure 1 demonstrates such a situation. Both threads fetch the event with the oldest time stamp to process. Since event B requires relatively longer processing time, thread

166

H. Wang et al.

Fig. 1. Event list

1 fetches event D after it finished processing event A. However, event B adds a new event C to the event list at time tadd , which causes an out-of-order execution of events C and D. There are two kinds of algorithms to address the problem [5]: conservative and optimistic algorithms. Briefly, the conservative algorithm takes precautions to avoid the out-of-order processing. That is, each event is processed only when there is no event with a smaller time stamp. To achieve it, a Lower Bound on the Time Stamp (LBTS)is used in the conservative algorithm. In this paper, LBTS is the smallest time stamp in the event list. Events with time stamp equal to LBTS are safe to be processed. After all of them have been processed, LBTS is increased to the next time stamp in the event list. The optimistic algorithm, on the contrary, uses a detection and recovery approach. Events are allowed to be processed out of time stamp order. However, if the computations of two events conflicts, the event with larger time stamp must be rolled back and reprocessed. Supply chain simulation is an application of DES. General Business Simulation Environment (GBSE) is a supply chain simulation and optimization tool developed by IBM [6]. It is a widely used commercial application and earned the 2008 Supply Chain Excellence Award, a top award in this domain. GBSE-C is the C/C++ version of GBSE. Besides the aforementioned features of DES, GBSE-C presents other properties to the parallelization work. 1. As a sequential program, it has a large amount of shared variables not protected by locks in event handlers. To parallelize it, all shared variables in the source code must be identified, which is a very time consuming task. 2. The number of events is large. Events with shared variables are probably processed at different time. Therefore the actual conflict rate between events is low. 3. For business reasons, source code of event handlers is frequently changed. Considering the code is developed by experts on supply chain, rather than experts on parallel programming, it is desirable to keep the programming style as simple as usual, i.e. writing sequential code as before, but achieving performance of parallel execution.

Improving Consumability of Hardware Transactional Memory

167

Using traditional lock-based technique to parallelize GBSE-C makes no sense because the business processing logic need to be modified. It’s unbearable to depend on the business logic programmers, who have little knowledge about parallelization, to explicitly specify which parts should be protected by locks and which are not. On the contrary, our HTM-based approach only modifies the simulation service layer and is transparent to business logic programmers.

3

Using Transactions in GBSE-C

For the purpose of parallelization, GBSE-C can be deemed to be a two-layer program. The upper layer is the business logic, such as order handling process, inventory control process, and procurement process. These logics are implemented in the form of event handlers. The lower layer is the event scheduling engine, which is the core of the simulator. The engine consists of two modules: 1. The resource management module encapsulates commonly used functions, such as malloc, I/O, and operations to the event list. After encapsulation, those functions are made safe to be used inside transactions. 2. The scheduling management module is in charge of event scheduling and processing. Threads from a thread pool fetch suitable events from the event list and process them. The scheduling policy is either conservative or optimistic, which defines the event processing mechanism.

3.1

Resource Management

The resource management module includes a memory pool, an I/O wrapper API, and the event list interface. Memory Pool. If a chunk of memory is allocated inside a transaction and the transaction is aborted, the allocated memory will never be released. Also memory allocation from a global pool usually requires accessing shared variables, and hence incurs conflict between transactions. To solve these problems, we implement a memory pool per thread. Inside a transaction, new functions TM MALLOC and TM FREE replace the original functions malloc and free. TM MALLOC obtains a chunk of memory from the corresponding memory pool. Then it records the address and size of the memory chunk in a thread-specific table (MA table). If the transaction is aborted, the error handler releases all memory chunks recorded in the table; otherwise, the allocated memory is valid. TM FREE returns the memory chunk to the pool and deletes the corresponding entry in the table. I/O Wrapper. There are many disk and network I/O operations in GBSEC. Functionally, they are used to access files, remote services, as well as for database operations. To simplify programming, these operations have already been encapsulated by helper functions. I/O operations inside a transaction are

168

H. Wang et al.

re-executed if the transaction is re-executed. Based on whether an I/O operation can tolerate this side effect, they can be classified to two categories. The first category is idempotent, such as reading a read-only file and printing for debug purpose. It can be re-executed without harmful impact. The second category is non-idempotent, as for example appending a row of data to a database table. It cannot be executed multiple times. We have encountered a lot of cases in the first category. What we do is to add TM SUSPEND and TM RESUME at the start and end of those helper functions. The code in event handlers can remain unchanged if it calls the helper functions for these operations. The cases in the second category are more interesting. We have two approaches to handle these cases. First, we use I/O buffering, as described in previous work [7]. Briefly, we buffer the data of the I/O operations until the transaction finishes. If the transaction is committed, perform the I/O operations with the data in the buffer; otherwise, discard the data in the buffer. However, this method is inconvenient for some complex cases where the buffered I/O operation influences later operations in the transaction. We propose another method by adding a new flag “serial mode” to each event. If it is set, the event handler contains complex I/O operations that should be handled in traditional sequential mode, where only one event is processed at a time. After this event is processed, the scheduler resumes the parallel execution mode. Event List Interface. Event handler may add a new event to the event list through the event list interface. The interface exposes a function ADD EVENT for programmers. The function does not manipulate the event list immediately. Instead, it records the function call and the corresponding parameters in a thread-specific table (ELO table). After the transaction is committed, extra code after TM END will execute the operations recorded in ELO table. If the transaction is aborted, the table is simply cleared. 3.2

Scheduling Management

In order to make TM programming transparent, the HTM interface is encapsulated in the event scheduler. Event handler programmers need not be aware of the HTM programming primitives. The engine supports both conservative and optimistic scheduling algorithms.It also has a special scheduling policy supporting conflict prediction. This policy is based on the conservative algorithm and needs change of HTM implementation. The Conservative Algorithms. The main process loop in each thread in the conservative algorithm includes the following steps: 1. Fetch an event with the smallest time stamp in the event list. 2. If the time stamp of this event is larger than LBTS, return the event to the event list and wait until LBTS is increased. The check guarantees that an event will be processed only when there is no event with smaller time stamp. 3. Execute the event handler in the context of a transaction. After the transaction is committed, execute the delayed event list operations (if any), and clear both ELO and MA tables.

Improving Consumability of Hardware Transactional Memory

169

4. If all events with time stamp equal to LBTS have been processed, LBTS is increased to the next time stamp in the event list. After that, a notification about the LBTS change is sent to all other threads. The threads blocked on LBTS then wake up. In the conservative algorithm, threads only process events with the same time stamp (LBTS). Thread pool will starve if the number of events with time stamp equal to LBTS is small. The parallelism of the conservative algorithm is limited by event density - the average number of events with each time stamp in the event list. In addition, the barrier synchronization to prevent out-of-order event processing might be costly when the event processing time is disproportional. Some threads might go idle while others are processing events that take a long execution time. Both degrade the performance of the conservative algorithm.

Fig. 2. An example of the optimistic algorithm

The Optimistic Algorithm. The optimistic algorithm overcomes the shortcomings of the conservative algorithm since it allows out of time stamp order execution. However, in traditional implementations, the optimistic algorithm does not necessarily bring optimized performance in a general sense, because it incurs overhead for checkpoint and rollback operations [8]. Using HTM, that overhead is minimized since checkpoint and rollback are done by hardware. The main process loop in the optimistic algorithm includes following steps: 1. Fetch an event with the smallest time stamp in the event list. 2. Execute the event handler in the context of a transaction. In this case the transaction priority is equal to the time stamp. 3. After the execution of the event handler, suspend the transaction and wait until the event’s time stamp is equal to LBTS. During this period, the handler periodically wakes up and checks whether a conflict is detected. If so, abort the transaction and re-execute the event handler. 4. After the transaction is committed, execute the delayed event list operations (if any), and clear both ELO and MA tables. 5. If all events with time stamp equal to LBTS have been processed, LBTS is increased to the next time stamp in the event list. After that a notification

170

H. Wang et al.

about LBTS change is sent to all other threads. The threads blocked on LBTS then wake up. Figure 2 shows an example of the optimistic algorithm. At the beginning, there are three events (A, B, and C) in the event list and LBTS is equal to k. The three events are being concurrently processed by three separate threads. Since the time stamp of event C is larger than LBTS, event C is blocked and put into the suspend state. The thread releases the CPU during the blocking. Meanwhile, event B adds a new event D into the event list with time stamp k+1 by calling function ADD EVENT. After event A and B are finished, LBTS is increased to k+1, and a notification about LBTS change is sent to event C. The execution of event C is woken up, and the corresponding transaction is going to be committed soon. Event D can also be processed whenever there is a free thread in the thread pool. If conflict happens between event A and C, event C is aborted since its priority is low. From the example, we can see that although events are processed out-of-order, they are committed strictly in order, which guarantees the correctness of the simulation. Besides the low cost of checkpoint and rollback, HTM-based optimistic algorithm has another advantage over traditional implementations: fine-grained rollback. In HTM-based implementation, only those events really affected are rolled back since each transaction has its own checkpoint. The optimistic algorithm might suffer from overly optimistic execution, i.e., some threads may advance too far in the event list. The results are two-fold. First, conflict rate balloons with the increase of the number of events being processed. Second, overflow might happen when a lot of transactions are concurrently executed. Both results limit the parallelism of the optimistic algorithm. Scheduler with Conflict Prediction. The performance of the case study application is limited by the high conflict rate when the thread number is large. Without appropriate event scheduling mechanisms, a large thread pool may consume more on-chip power without performance improvement, or even causing performance degradation. In our engine, the event scheduler supports a special policy to predict the conflict between events with the help of HTM. The prediction directs each thread to process suitable events to avoid unnecessary conflicts. The conflict prediction is feasible based on the following observations. – Data locality. In PDES applications, executions of one event show data locality. The memory footprints of the previous executions give hints about the memory addresses to be accessed in the future. – Conflict position. Some event handlers are composed with a common pattern: first read configuration data, then do some computation, and finally write back computing results. Positions of conflicts caused by shared variables accessing are usually at the beginning and end of the event handler. The time span of transaction execution between conflict position and the end of the transaction is highly relevant to the conflict probability. In order to design the scheduler with conflict prediction, we modify the HTM implementation by adding two bloom filters in each processor to record the

Improving Consumability of Hardware Transactional Memory

171

transaction’s read and write sets. When the transaction is committed, the contents of bloom filters called signatures are dumped into memory at pre-defined addresses. When the transaction is aborted, the bloom filters are cleared and the conflicting address is recorded. Each event maintains a conflict record table (CRT) for the conflict prediction, which contains the signatures and other transaction statistics. Some important fields of the statistics are described as follows: – Total Execution Cycles (TEC): total execution cycles of a transaction between TM BEGIN and TM END. – Conflicting Addresses (CA): conflicting addresses of a transaction. – Conflicting Execution Cycles (CEC): execution cycles of a transaction between TM BEGIN and conflict. Before an event is going to be processed, the scheduler first checks whether the event’s conflict addresses are in the signatures of any event under execution. If so, a conflict is possible to happen if this event is executed. Then the scheduler uses possible conflict time-span (PCT) as a metric to determine the conflict probability. PCT refers to the total time span within which if one event with lower priority starts to execute, it will be aborted by the other running event with high priority. The smaller the PCT is, the less probably the conflict occurs. The value of PCT between event A and event B can be computed according to Eq. 1. (1) P CTAB = (T ECA − CECA ) + (T ECB − CECB ) Fig. 3 shows three cases of transaction conflict between event A and B. In the first case, two transactions have conflicting memory accessing at the very beginning, the transaction with lower priority could be possibly aborted if it starts to execute at any time within the time span from point M to N. It has the largest PCT (P CTmax ≈ T ECA + T ECB ); in the second case, conflict position is located in the middle and conflict probability is lower than the first one with P CTmid ≈ 1/2(T ECA + T ECB ); while it’s seldom to conflict in the third case with P CTmin ≈ 0. Based on CRT and PCT, the scheduler can predict conflict and decide the scheduling policy for each event. Case 1: early conflict PCT≈TECA+TECB M

Case 2: middle conflict PCT ≈ 1/2(TECA+TECB)

TECB Conflict address PCT

CECB

Case 3: late conflict PCT ≈ 0

TECA CECA N

PCT TECA/2

Trans. A (High Priority)

CECA

CECB

Trans. A (High Priority) Trans. A (High Priority) Trans. B

Trans. B

Trans. B

Fig. 3. Three cases with different PCT

Fig. 4. Transaction conflict rate

172

4

H. Wang et al.

Productivity and Performance Evaluation

We have parallelized GBSE-C through two approaches: the new approach based on HTM and the traditional approach based on locks. Since the work was done by the same group of developers, the time spent implementing the different approaches can be considered as a straightforward indication of productivity. – HTM-based approach. We spent about one week to finish the parallelization work for both conservative and optimistic algorithms. Three days were used to identify I/O operations and handle them accordingly. It is not very difficult because the I/O operations are well encapsulated by helper functions. Additional time was used to encapsulate the HTM interface in the scheduling engine. – Lock-based approach. We spent two months to complete the parallelization work for the conservative algorithm. About half of the time was used to identify the variables shared by events. This task is time-consuming because understanding the program logic in event handler requires some domain knowledge and the usage of pointers exacerbates the problem. Performance tuning and debugging took another three weeks. It includes shortening critical sections (fine-grained locks), using proper locks (pthread locks vs. spinning locks), and preventing dead locks. Since the execution order of events with same time stamp is not deterministic, some bugs were found only after running the program many times. We did not implement the optimistic algorithm using locks because doing checkpoint and rollback by software means is difficult in general. The performance evaluation is carried out on an IBM full system simulator [9], which supports configurable cycle-accurate simulation for POWER/PowerPC based CMP and SMP. The target processor contains 4 clusters connected by an on-chip interconnect. Each cluster includes 4 processor cores and a shared L2 cache. Each core runs at a frequency of 2GHz, with an out-of-order multi-issue pipeline. We have studied the transaction size in this application. The sizes are measured separately for the read set and the write sets. Read and write sets contain the data read and written by a transaction respectively. 88% of the read sets and 91% of the write sets are less than 4KB, indicating that most transactions are small. The sizes of the read and write sets in the largest transaction approach 512KB. Since the L2 data cache (2MB) is much larger, the execution of a single transaction dose not incur overflow. Figure 4 shows the transaction conflict rate in conservative and optimistic algorithms. The conflict rate is defined as the number of conflict divided by the total number of transactions. Optimistic algorithm has higher conflict rate than the conservative algorithm since the optimistic algorithm tries more paralleling execution. When the number of threads in thread pool is increased, the conflict rate is also increased. Generally, the conflict rate is less than 24%. It proves that many events are actually independent and can be processed in parallel.

Improving Consumability of Hardware Transactional Memory

173

CP scheduler vs. Normal scheduler CP = Conflict Prediction 6

30.00% 4.64

5

4.96

25.00%

Speedup

4.36

3.26

1.91

2

15.00% 9.37% 7.79%

2.00

20.00%

15.19%

3.23

3

19.36%

10.00%

Conflict rate

4.34

4

1.07 3.87% 3.80%

1

5.00%

0.22% 0.02%

0

0.00% 1 thread

2 threads

4 threads

8 threads

16 threads

Conflict rate of CP scheduler

Conflict rate of normal scheduler

Speedup of CP scheduler

Speedup of normal scheduler

Fig. 5. Speedup from three algorithms Fig. 6. Speedup of scheduler with conflict prediction

Figure 5 illustrates the observed speedup from the three approaches: the conservative algorithm with fine-grained locks, the conservative algorithm with HTM and the optimistic algorithm with HTM. The first approach represents the best performance achievable through manual optimization. The conservative algorithm with HTM achieves performance that is slightly lower than the first approach, indicating HTM is effective for the parallelization work. The optimistic algorithm with HTM is better than the previous two approaches. It gains more than 2 times speedup when the thread count is 2. This is because optimistic algorithm reduces the synchronization overhead at each time stamp and the scheduler has more chances to balance workloads among threads. Generally, through HTM, we can achieve 4.36 and 5.13 times speedup with 8 threads in the conservative and optimistic algorithms respectively, and achieve 5.69 times speedup with 16 threads in the optimistic algorithm. Besides the performance comparison with the normal event scheduler, we also conduct an experiment to evaluate the performance of the scheduler with conflict prediction. Fig. 6 illustrates the speedup and conflict rate comparison between the two schedulers with conservative algorithm. Because of extra overhead of the conflict prediction, the scheduler with conflict prediction shows a slightly worse performance against the normal one when the thread count is low. But as the thread count increases, it gradually outperforms the normal one and the performance gap becomes larger. With 16 threads, the speedup is 14% more than the normal one. We can also see that the conflict rate is reduced by scheduler with conflict prediction.

5

Conclusion

In this paper, we demonstrate that the encapsulated HTM can have good consumability for some real-world applications. Based on these findings in the paper, we will further investigate the consumability of HTM in a broader range of applications in the future.

174

H. Wang et al.

References 1. McDonald, A., Chung, J., Carlstrom, B.D., Minh, C.C., Chafi, H., Kozyrakis, C., Olukotun, K.: Architectural semantics for practical transactional memory. In: Proc. of the 33rd International Symposium on Computer Architecture, pp. 53–65. IEEE, Los Alamitos (2006) 2. Wang, H., Hou, R., Wang, K.: Hardware transactional memory system for parallel programming. In: Proc. of the 13th Asia-Pacific Computer System Architecture Conference, pp. 1–7. IEEE, Los Alamitos (2008) 3. Baugh, L., Neelakantam, N., Zilles, C.: Using hardware memory protection to build a high-performance, strongly-atomic hybrid transactional memory. In: Proc. of the 35th International Symposium on Computer Architecture, pp. 115–126. IEEE, Los Alamitos (2008) 4. Chung, J., Baek, W., Kozyrakis, C.: Fast memory snapshot for concurrent programming without synchronization. In: Proc. of the 23rd International Conference on Supercomputing, pp. 117–125. ACM, New York (2009) 5. Perumalla, K.: Parallel and distributed simulation: traditional techniques and recent advances. In: Proc. of the 2006 Winter Simulation Conference, pp. 84–95 (2006) 6. Wang, W., Dong, J., Ding, H., Ren, C., Qiu, M., Lee, Y., Cheng, F.: An introduction on ibm general business simulation environment. In: Proc. of the 2008 Winter Simulation Conference, pp. 2700–2707 (2008) 7. Chung, J., Chafi, H., Minh, C., McDonald, A., Carlstrom, B., Kozyrakis, C., Olukotun, K.: The common case transactional behavior of multithreaded programs. In: Proc. of the 12th International Symposium on High-Performance Computer Architecture, pp. 166–177. IEEE, Los Alamitos (2006) 8. Poplawski, A., Nicol, D.: Nops: a conservative parallel simulation engine for ted. In: Proc. of the 12th Workshop on Parallel and Distributed Simulation, pp. 180–187 (1998) 9. Bohrer, P., Peterson, J., Elnozahy, M., Rajamony, R., Gheith, A., Rochhold, R.: Mambo: a full system simulator for the powerpc architecture. ACM SIGMETRICS Performance Evaluation Review 31(4), 8–12 (2004)

Exploiting Fine-Grained Parallelism on Cell Processors Ralf Hoffmann, Andreas Prell, and Thomas Rauber Department of Computer Science University of Bayreuth, Germany [email protected]

Abstract. Driven by increasing specialization, multicore integration will soon enable large-scale chip multiprocessors (CMPs) with many processing cores. In order to take advantage of increasingly parallel hardware, independent tasks must be expressed at a fine level of granularity to maximize the available parallelism and thus potential speedup. However, the efficiency of this approach depends on the runtime system, which is responsible for managing and distributing the tasks. In this paper, we present a hierarchically distributed task pool for task parallel programming on Cell processors. By storing subsets of the task pool in the local memories of the Synergistic Processing Elements (SPEs), access latency and thus overheads are greatly reduced. Our experiments show that only a worker-centric runtime system that utilizes the SPEs for both task creation and execution is suitable for exploiting fine-grained parallelism.

1

Introduction

With the advent of chip multiprocessors (CMPs), parallel computing is moving into the mainstream. However, despite the proliferation of parallel hardware, writing programs that perform well on a variety of CMPs remains challenging. Heterogeneous CMPs achieve higher degrees of efficiency than homogeneous CMPs, but are generally harder to program. The Cell Broadband Engine Architecture (CBEA) is the most prominent example [1,2]. It defines two types of cores with different instruction sets and DMA-based “memory flow control” for data movement and synchronization between main memory and softwaremanaged local stores. As a result, exploiting the potential of CBEA-compliant processors requires significant programming effort. Task parallel programming provides a flexible framework for programming homogeneous as well as heterogeneous CMPs. Parallelism is expressed in terms of independent tasks, which are managed and distributed by a runtime system. Thus, the programmer can concentrate on the task structure of an application, without being aware of how tasks are scheduled for execution. In practice, however, the efficiency of this approach strongly depends on the runtime system and its ability to deal with almost arbitrary workloads. Efficient execution across CMPs with different numbers and types of cores requires the programmer to maximize the available parallelism in an application. This is typically achieved P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 175–186, 2010. c Springer-Verlag Berlin Heidelberg 2010 

176

R. Hoffmann, A. Prell, and T. Rauber

by exposing parallelism at a fine level of granularity. The finer the granularity, the greater the potential for parallel speedup, but also the greater the communication and synchronization overheads. In the end, it is the runtime system that determines the degree to which fine-grained parallelism can be exploited. When we speak of fine-grained parallelism, we assume task execution times on the order of 0.1–10µs. Given the trend towards large-scale CMPs, it becomes more and more important to provide support for fine-grained parallelism. In this work, we investigate task parallel programming with fine-grained tasks on a heterogeneous CMP, using the example of the Cell processor. Both currently available incarnations— the Cell Broadband Engine (Cell/B.E.) and the PowerXCell 8i—comprise one general-purpose Power Processing Element (PPE) and eight specialized coprocessors, the Synergistic Processing Elements (SPE). Although the runtime system we present is tailored to the Cell’s local store based memory hierarchy, we believe many concepts will be directly applicable to future CMPs. In summary, we make the following contributions: – We describe the implementation of a hierarchically distributed task pool designed to take advantage of CMPs with local memories, such as the Cell processor. Lowering the task scheduling overhead is the first step towards exploiting fine-grained parallelism. – Our experiments indicate that efficient support for fine-grained parallelism on Cell processors requires task creation by multiple SPEs instead of by the PPE. For this reason, we provide functions to offload the process of task creation to the SPEs. Although seemingly trivial, sequential bottlenecks must be avoided in order to realize the full potential of future CMPs.

2

Distributed Task Pools

Task pools are shared data structures for storing parallel tasks of an application. As long as there are tasks available, a number of threads keep accessing the task pool to remove tasks for execution and to insert new tasks upon creation. Task pools generalize the concept of work queueing. While a work queue usually implies an order of execution such as LIFO or FIFO, a task pool may not guarantee such an order. To improve scheduling, task pools are often based on the assumption that inserted tasks are free of dependencies and ready to run. Implementing a task pool can be as simple as setting up a task queue that is shared among a number of threads. While such a basic implementation might suffice for small systems, frequent task pool access and increasing contention will quickly limit scalability. Distributed data structures, such as per-thread task queues, address the scalability issue of centralized implementations, at the cost of requiring additional strategies for load balancing. 2.1

Task Pool Runtime System

The task pool runtime is implemented as a library, which provides an API for managing the task pool, running SPE threads, performing task pool operations,

Exploiting Fine-Grained Parallelism on Cell Processors

177

and synchronizing execution after parallel sections. In addition to the common approach of creating tasks in a PPE thread, we include functions to delegate task creation to a number of SPEs, intended for those cases in which the performance of the PPE presents a bottleneck to application scalability. Each SPE executes a basic self-scheduling loop that invokes user-defined task functions after removing tasks from the task pool. In this way, the user can focus on the implementation of tasks, rather than writing complete executables for the SPEs. To support workloads with nested parallelism, new tasks can be spawned from currently running tasks, without tying execution to any particular SPE. 2.2

Design and Implementation

To match the Cell’s memory hierarchy, we define a hierarchical task pool organization consisting of two separate storage domains: a local storage domain, providing fast access to a small set of tasks, and a shared storage domain, collecting all remaining tasks outside of local storage. From an implementation viewpoint, the hierarchical task pool may be thought of as two disjoint task pools, one per storage domain, with an interface for moving tasks between the task pools. An SPE thread initiates task movement in two cases: (1) there is no local task left when trying to schedule a task for execution, or (2) there is no local storage space left when trying to insert a new task. To reduce the frequency of these events, tasks should be moved in bundles. However, choosing a large bundle size may lead to increased load balancing activity, up to the point where load balancing overheads outweigh any savings. For this reason, we only try to maximize the bundle size within fixed bounds. Shared Task Storage. Data structures for storing tasks in main memory should meet the following requirements: (1) Allow concurrent access by multiple processing elements. (2) Provide potentially unbounded storage space that can grow and shrink as needed. (3) Facilitate movement of tasks between storage domains. Concurrent access by multiple processing elements is usually achieved by using a distributed task pool. In our previous work, we have described the implementation of a distributed task pool based on a set of double-ended queues (deques) [3]. We noted that dequeuing a single task involved up to 11 small DMA transfers, which added significant latency to the overall operation. As a consequence of these overheads, fine-grained parallelism remained hard to exploit. To increase the efficiency of DMA transfers, we now allocate blocks of contiguous tasks and arrange the blocks in a circular linked list. Similar in concept to distributed task queues, blocks are locked and accessed independently, allowing concurrent access to distinct blocks. Each SPE is assigned a separate block on which to operate. Only if an SPE finds its block empty or already locked when searching for tasks, it follows the pointer to the next block, effectively attempting to steal a task from another SPE’s block. Thus, load balancing is built into the data structure, rather than being implemented as part of the scheduling algorithm.

178

R. Hoffmann, A. Prell, and T. Rauber

Fig. 1. Task storage in main memory. The basic data structure is a circular linked list of task blocks, each of which is locked and accessed independently. If more than one list is allocated, the lists are in turn linked together.

The resulting data structure including our extension to support multiple lists is illustrated in Fig. 1. Individual task blocks may be resized by requesting help from the PPE, using the standard PPE-assisted library call mechanism. The number of blocks in a list remains constant after allocation and equals the number of SPEs associated with that list. In the case of more than one list, as shown in the example of Fig. 1, SPEs follow the pointer to the head of the next list if they fail to get a task from their current list. Local Task Storage. Tasks in the local storage domain are naturally distributed across the local stores of the SPEs. Given the limited size of local storage, we can only reserve space for a small set of tasks. In our current implementation, we assume a maximum of ten tasks per local store. With DMA performance in mind, we adopt the task queue shown in Fig. 2. The queue is split into private and public segments for local-only and shared access, respectively. Access to the public segment is protected by a lock. The private segment is accessed without synchronization. Depending on the number of tasks in the queue, the local SPE adjusts the segment boundary to share at least one task via the public segment. Adapting the segments by shifting the boundary requires exclusive access to the queue, including the public segment. If the queue is empty or there is only one task left to share, the queue is public by default (2a). When inserting new tasks, the private segment may grow up to a defined maximum size; beyond that size, tasks remain in the public segment (2b–d). Similarly, when searching for a task, the private segment is checked first before accessing the public segment (2e– f). Unless the public segment is empty, there is no need to shrink the private segment (2g). Given that other SPEs are allowed to remove tasks from a public segment, periodic checks are required to determine whether the public segment is empty, in which case the boundary is shifted left to share another task (2h–j). Dynamic Load Balancing. Task queues with private and public segments provide the basis for load balancing within the local storage domain. If an SPE fails to return a task from both local and shared storage, it may attempt to steal a task from another SPE’s local queue. Task stealing transfers a number of tasks between two local stores without involving main memory. We have implemented a two-level stealing protocol for a system containing two Cell processors. At first, task stealing is restricted to

Exploiting Fine-Grained Parallelism on Cell Processors

179

Fig. 2. Task storage in local memory. The basic data structure is a bounded queue, implemented as an array of tasks. The queue owner is responsible for partitioning the queue dynamically into private and public segments. In this example, the private segment may contain up to five tasks.

finding a local victim, i.e., a victim located on the same chip. SPEi starts off with peeking at the public segment of its logical neighbor SPE(i%N )+1 , where N is the number of local SPEs, and 1 ≤ i ≤ N . Note that two logically adjacent SPEs need not be physically adjacent, though we ensure that they are allocated on the same chip. To reduce the frequency of stealing attempts, a thief tries to steal more than one task at a time—up to half the number of tasks in a given public segment (steal-half policy). If the public segment is already locked or there is no task left to steal, the neighbor of the current victim becomes the next victim, and the procedure is repeated until either a task is found or the thief returns to its own queue. In the latter case, task stealing continues with trying to find a remote victim, i.e., a victim located off-chip. Again, each SPE in question is checked once. Task Pool Access Sequence. Our current task pool implementation is based on the assumption that tasks are typically executed by the SPEs. Therefore, the PPE is used to insert but not to remove tasks. In the following, we focus on the sequence in which SPEs access the task pool. For the sake of clarity, we omit the functions for adapting the local queue segments. Get Task Figure 3(a) shows the basic sequence for removing a task from the task pool. Tasks are always taken from the local queue, but if the queue is empty, the shared storage pool must be searched for a task bundle to swap in. The size of the bundle is ultimately limited by the size of the local queue. Our strategy does not attempt to find the largest possible bundle by inspecting each block in the list, but instead, tries to minimize contention by transferring the first bundle found; even if the bundle is really a single task. Before the lock is released and other SPEs might begin to steal, one of the transferred tasks is reserved for execution by the local SPE. Put Task Figure 3(b) shows the sequence for inserting a task into the task pool. The task is always inserted locally, but if the local queue is full, a number of tasks must be swapped out to the shared storage pool first. For reasons of locality, it makes sense not to clear the entire queue before inserting the new task. Tasks

180

R. Hoffmann, A. Prell, and T. Rauber void ∗g e t t a s k ( ) { void ∗t a s k = r e m o v e p r i v a t e ( ) ; i f ( t a s k ) ret urn t a s k ; lock ( public ) ; t a s k = r em ov e public ( ) ; i f ( task ) { unloc k ( p u b l i c ) ; ret urn t a s k ; } / / Loc al s t o r e queue i s empty swap in from mm ( ) ; t a s k = remove ( ) ; unloc k ( p u b l i c ) ; i f ( t a s k ) ret urn t a s k ; / / Not hing l e f t i n main memory ret urn s t e a l t a s k ( ) ; }

void p u t t a s k ( void ∗t a s k ) { bool i n s = i n s e r t p r i v a t e ( t a s k ) ; i f ( i n s ) ret urn ; lock ( public ) ; ins = i n se r t p u b l i c ( task ) ; i f ( ins ) { unloc k ( p u b l i c ) ; ret urn ; } / / Loc al s t o r e queue i s f u l l swap out to mm ( ) ; i n s e r t ( task ) ; unloc k ( p u b l i c ) ; }

(a) Get task

(b) Put task

Fig. 3. Simplified sequence for accessing the task pool (SPE code)

that are moved from local to shared storage are inserted into an SPE’s primary block. If the block is already full, the SPE calls back to its controlling PPE thread, which doubles the size of the block and resumes SPE execution.

3

Experimental Results

We evaluate the performance and scalability of our task pool implementation in two steps. First, we compare different task pool variants, using synthetic workloads generated by a small benchmark application. Second, we present results from a set of three applications with task parallelism: a matrix multiplication and LU decomposition and a particle simulation based on the Linked-Cell method [4]. For these applications, we compare performance with Cell Superscalar (CellSs) 2.1, which was shown to achieve good scalability for workloads with tasks in the 50µs range [5,6]. Matrix multiplication and decomposition codes are taken from the examples distributed with CellSs. We performed runtime experiments on an IBM BladeCenter QS22 with two PowerXCell 8i processors and 8 GB of DDR2 memory per processor. The system is running Fedora 9 with Linux kernel 2.6.25-14. Programming support is provided by the IBM Cell SDK version 3.1 [7]. The speedups presented in the following subsections are based on average runtimes from ten repetitions. Task pool parameters are summarized in Table 1. 3.1

Synthetic Application

We consider two synthetic workloads with the following runtime characteristics: – Static(f, n): Create n tasks of size f . All tasks are identical and require the same amount of computation. – Dynamic(f, n): Create n initial tasks of size f . Depending on the task, up to two child tasks may be created. Base tasks are identical to those of the static workload.

Exploiting Fine-Grained Parallelism on Cell Processors

181

Table 1. Distributed task pool parameters used in the evaluation

Task size is a linear function of the workload parameter f such that a base task of size f = 1 is executed in around 500 clock cycles on an SPE. Figure 4 shows the relative performance of selected task pools, based on executing synthetic workloads Static(f, 105) and Dynamic(f, 26). In the case of the dynamic workload, n = 26 initial tasks result in a total of 143 992 tasks to be created and executed. Task sizes range from very to moderately fine-grained. To evaluate the benefits of local task storage, we compare speedups with a central task pool that uses shared storage only. Such a task pool is simply configured without the local store queues described in the previous section. Lacking the corresponding data structures, tasks are scheduled one at a time without taking advantage of bundling. In addition, we include the results of our previous task pool implementation, reported in [3]. Speedups are calculated relative to the central task pool running with a single SPE. Figures 4(a) and (b) show that fine-grained parallelism drastically limits the scalability of all task pools that rely on the PPE to create tasks. This is to be expected, since the PPE cannot create and insert tasks fast enough to keep all SPEs busy. The key to break this sequential bottleneck is to involve the SPEs in the task creation. Because SPEs can further take advantage of their local stores, we see significant improvements in terms of scalability, resulting in a performance advantage of 8.5× and 6.7× over the same task pool using PPE task creation. In this and the following experiments, the tasks to be created are evenly distributed among half of the SPEs, in order to overlap task creation and execution. Without task creation support from the SPEs, our new implementation is prone to parallel slowdowns, as is apparent in Fig. 4(a) and (b). While the deques of our previous task pool allow for concurrency between enqueue and dequeue operations, the block list requires exclusive access to a given block when inserting or removing tasks. Thus, with each additional SPE accessing the block list, lock contention increases and as a result task creation performance of the PPE degrades. Figure 4(c) shows that tasks of size f = 100 (16µs) are already large enough to achieve good scalability with all distributed task pools, regardless of task creation. The central task pool scales equally well up to eight SPEs, but suffers from increasing contention when using threads allocated on a second Cell processor. Figures 4(d) and (e) show that fine-grained nested parallelism is a perfect fit for the hierarchically distributed task pool, with performance improvements of 25.5× and 14.5× over the central task pool (using 16 SPEs). Even for larger tasks, the new implementation maintains a significant advantage, as can be seen in Fig. 4(f).

182

R. Hoffmann, A. Prell, and T. Rauber

(a) Static(1, 105 )

(b) Static(10, 105 )

(c) Static(100, 105 )

(d) Dynamic(1, 26)

(e) Dynamic(10, 26)

(f) Dynamic(100, 26)

Fig. 4. Speedups for the synthetic application with static and dynamic workloads (f, n), where f is the task size factor and n is the number of tasks. The resulting average task size is shown in the upper left of each figure. Speedups are calculated relative to the central task pool (shared storage only) using one SPE.

3.2

Matrix Multiplication

The matrix multiplication workload is characterized by a single type of task, namely the block-wise multiplication of the input matrices. Figure 5 shows the effect of decreasing task size on the scalability of the implementations. Speedups are calculated relative to CellSs using a single SPE. CellSs scales reasonably well for blocks of size 32. However, if we further decrease the block size towards fine-grained parallelism, we clearly see the limitations of CellSs. Likewise, requiring the PPE to create all tasks drastically limits the scalability of our task pool implementation. Because the shared

(a) B = 32

(b) B = 16

(c) B = 8

Fig. 5. Speedups for the multiplication of two 1024×1024 matrices. The multiplication is carried out in blocks of size B × B. Speedups are relative to CellSs using one SPE.

Exploiting Fine-Grained Parallelism on Cell Processors

(a) B = 32

(b) B = 16

183

(c) B = 8

Fig. 6. Speedups for the LU decomposition of a 1024×1024 matrix. The decomposition is carried out in blocks of size B × B. Speedups are relative to CellSs using one SPE.

storage pool is contended by both PPE and SPE threads, we face similar slowdowns as described above. For scalable execution of tasks in the low microsecond range, task creation must be offloaded from the PPE to the SPEs. Using 16 SPEs (eight SPEs for task creation), the distributed task pool achieves up to 12.8× the performance of CellSs. 3.3

LU Decomposition

Compared to the matrix multiplication, the LU decomposition exhibits a much more complex task structure. The workload is characterized by four different types of tasks and decreasing parallelism with each iteration of the algorithm. Mapping the task graph onto the task pool requires stepwise scheduling and barrier synchronization. Figure 6 shows speedups relative to CellSs, based on the same block sizes as in the matrix multiplication example. Due to the complex task dependencies and the sparsity of the input matrix, we cannot expect to see linear speedups up to 16 SPEs. Using a block size greater than 32 results in a small number of coarse-grained tasks. Limited parallelism, especially in the final iterations of the algorithm, motivates a decomposition at a finer granularity. In fact, the best performance is achieved when using blocks of size 32 or 16. Once again, finegrained parallelism can only be exploited by freeing the PPE from the burden of task creation. In that case, we see an improvement of up to 6× over CellSs. 3.4

Linked-Cell Particle Simulation

The particle simulation code is based on the Linked-Cell method for approximating short-range pair potentials, such as the Lennard-Jones potential in molecular dynamics [4]. To save computational time, the Lennard-Jones potential is truncated at a cutoff distance rc , whose value is used to subdivide the simulation space into cells of equal length. Force calculations can then be limited to include interactions with particles in the same and in neighboring cells. For our runtime tests, we use a 2D simulation box with reflective boundary conditions. The box is subdivided into 40 000 cells, and particles are randomly

184

R. Hoffmann, A. Prell, and T. Rauber

(a) 1000 particles

(b) 10 000 particles

(c) 100 000 particles

Fig. 7. Speedups for the Linked-Cell application with three different workloads based on the number of particles to simulate. Idealized scheduling assumes zero overhead for task management, scheduling, and load balancing. Speedups are relative to CellSs using one SPE.

distributed, with the added constraint of forming a larger cluster. The workload consists of two types of tasks—force calculation and time integration—which update the particles of a given cell. At the end of each time step, particles that have crossed cell boundaries are copied to their new enclosing cells. This task is performed sequentially by the PPE. We focus on the execution of three workloads, based on the number of particles to simulate: 1000, 10 000, and 100 000. The potential for parallel speedup increases with the number of particles, but at the same time, clustering of particles leads to increased task size variability. Thus, efficient execution depends on load balancing at runtime. Figure 7 shows the results of running the simulation for ten time steps. Once again, speedups are calculated relative to CellSs using a single SPE. Maximum theoretical speedups are indicated by the dashed line, representing parallel execution of force calculation and time integration without any task related overheads. In the case of only 1000 particles, there is little parallelism to exploit. Although SPE task creation leads to some improvement, 16 SPEs end up delivering roughly the same performance as a single SPE executing the workload in sequence. In this regard, it is important to note that tasks are only created for non-empty cells, requiring that each cell be checked for that condition. Whereas the PPE can simply load a value from memory, an SPE has to issue a DMA command. Even with atomic DMA operations, the additional overhead of reading a cell’s content is on the order of a few hundred clock cycles. The amount of useful parallelism increases with the number of particles to simulate. Using 16 SPEs, the task pool delivers 62% and 75% of the ideal performance, when simulating 10 000 and 100 000 particles, respectively. In contrast, CellSs achieves only 16% and 20% of the theoretical peak.

4

Related Work

Efficient support for fine-grained parallelism requires runtime systems with low overheads. Besides bundling tasks in the process of scheduling, runtime systems

Exploiting Fine-Grained Parallelism on Cell Processors

185

may decide to increase the granularity of tasks depending on the current load and available parallelism. Strategies such as lazy task creation [8] and task cutoff [9] limit the number of tasks and the associated overhead of task creation, while still preserving enough parallelism for efficient execution. However, not all applications lend themselves to combining tasks dynamically at runtime. In a recent scalability analysis of CellSs, Rico et al. compare the speedups for a number of applications with different task sizes [10]. The authors conclude that, in some of their applications, the task creation overhead of the PPE limits scalability to less than 16 SPEs. To reduce this overhead and thereby improve scalability, the authors suggest a number of architectural enhancements to the PPE, such as out-of-order execution, wider instruction issue, and larger caches. All in all, task creation could be accelerated by up to 50%. In the presence of fine-grained parallelism, however, task creation must be lightweight and scalable, which we believe is best achieved by creating tasks in parallel. Regardless of optimizations, runtime systems introduce additional overhead, which limits their applicability to tasks of a certain minimum size. Kumar et al.make a case for exploiting fine-grained parallelism on future CMPs and propose hardware support for accelerating task queue operations [11,12]. Their proposed design, Carbon, adds a set of hardware task queues, which implement a scheduling policy based on work stealing, and per-core task prefetchers to hide the latency of accessing the task queues. On a set of benchmark applications with fine-grained parallelism from the field of Recognition, Mining, and Synthesis (RMS), the authors report up to 109% performance improvement over optimized software implementations. While hardware support for task scheduling and load balancing has great potential for exploiting large-scale CMPs, limited flexibility in terms of scheduling policies will not obviate the need for sophisticated software implementations.

5

Conclusions

In this paper, we have presented a hierarchically distributed task pool that provides a basis for efficient task parallel programming on Cell processors. The task pool is divided into local and shared storage domains, which are mapped to the Cell’s local stores and main memory. Task movement between the storage domains is facilitated by DMA-based operations. Although we make extensive use of Cell-specific communication and synchronization constructs, the concepts are general enough to be of interest for other platforms. Based on our experiments, we conclude that scalable execution on the Cell processor requires SPE-centric runtime system support. In particular, the current practice of relying on the PPE to create tasks is not suitable for exploiting finegrained parallelism. Instead, tasks should be created by a number of SPEs, in order to maximize the available parallelism at any given time.

Acknowledgments We thank the Forschungszentrum J¨ ulich for providing access to their Cell blades. This work is supported by the Deutsche Forschungsgemeinschaft (DFG).

186

R. Hoffmann, A. Prell, and T. Rauber

References 1. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the Cell multiprocessor. IBM J. Res. Dev. 49(4/5) (2005) 2. Johns, C.R., Brokenshire, D.A.: Introduction to the Cell Broadband Engine Architecture. IBM J. Res. Dev. 51(5) (2007) 3. Hoffmann, R., Prell, A., Rauber, T.: Dynamic Task Scheduling and Load Balancing on Cell Processors. In: Proc. of the 18th Euromicro Intl. Conference on Parallel, Distributed and Network-Based Processing (2010) 4. Griebel, M., Knapek, S., Zumbusch, G.: Numerical Simulation in Molecular Dynamics, 1st edn. Springer, Heidelberg (September 2007) 5. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE Architecture. In: Proc. of the 2006 ACM/IEEE conference on Supercomputing (2006) 6. Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM J. Res. Dev. 51(5) (2007) 7. IBM: IBM Software Development Kit (SDK) for Multicore Acceleration Version 3.1, http://www.ibm.com/developerworks/power/cell 8. Mohr, E., Kranz, D.A., Halstead Jr., R.H.: Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In: Proc. of the 1990 ACM conference on LISP and functional programming (1990) 9. Duran, A., Corbal´ an, J., Ayguad´e, E.: An adaptive cut-off for task parallelism. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (2008) 10. Rico, A., Ramirez, A., Valero, M.: Available task-level parallelism on the Cell BE. Scientific Programming 17, 59–76 (2009) 11. Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: Architectural Support for FineGrained Parallelism on Chip Multiprocessors. In: Proc. of the 34th Intl. Symposium on Computer Architecture (2007) 12. Kumar, S., Hughes, C.J., Nguyen, A.: Architectural Support for Fine-Grained Parallelism on Multi-core Architectures. Intel Technology Journal 11(3) (2007)

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E. Rikard Hult´en1 , Christoph W. Kessler1 , and J¨org Keller2 1 2

Link¨opings Universitet, Dept. of Computer and Inf. Science, 58183 Link¨oping, Sweden FernUniversit¨at in Hagen, Dept. of Math. and Computer Science, 58084 Hagen, Germany

Abstract. Limited bandwidth to off-chip main memory is a performance bottleneck in chip multiprocessors for streaming computations, such as Cell/B.E., and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memoryefficient code is an important program optimization. In earlier work, we have proposed such a transformation technique: on-chip pipelining. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip bus, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this paper, we consider parallel mergesort on Cell/B.E. as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique for the Cell processor, describe our implementation, evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed reduces, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

1 Introduction The new generation of multiprocessors-on-chip derives its raw power from parallelism, and explicit parallel programming with platform-specific tuning is needed to turn this power into performance. A prominent example is the Cell Broadband Engine [1] with a PowerPC core and 8 parallel slave processors called SPEs. Yet, many applications use the Cell BE like a dancehall architecture: the SPEs use their small on-chip local memories (256 KB for both code and data) as explicitly-managed caches, and they all load and store data from/to the external (off-chip) main memory. The bandwidth to the external memory is much smaller than the SPEs’ aggregate bandwidth to the on-chip interconnect bus (EIB). This limits performance and prevents scalability. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 187–198, 2010. c Springer-Verlag Berlin Heidelberg 2010 

188

R. Hult´en, C.W. Kessler, and J. Keller

External memory is also a bottleneck in other multiprocessors-on-chip. This problem will become more severe as the core count per chip is expected to increase considerably in the foreseeable future. Scalable parallelization on such architectures therefore must use direct communication between the SPEs to reduce communication with off-chip main memory. In this paper, we consider the important domain of memory-intensive computations and consider the global merging phase of pipelined mergesort on Cell as a challenging case study, for the following reasons: – The ratio of computation to data movement is low. – The computational load of tasks varies widely (by a factor of 2k for a binary merge tree with k levels). – The computational load of a merge task is not fixed but only averaged. – Memory consumption is not proportional to computational load but constant among tasks. – Communication always occurs between tasks of different computational load. These factors complicate the mapping of tasks to SPEs. In total, pipelining a merge tree is much more difficult than task graphs of regular problems such as matrix vector multiplication. The task graph of the global merging phase consists of a tree of merge tasks that should contain, in the lowest layer, at least as many merger tasks as there are SPEs available. Previous solutions like CellSort [2] and AAsort [3] process the tasks of the merge tree layer-wise bottom-up in serial rounds, distributing the tasks of a layer equally over SPEs (there is no need to have more than one task per SPE). Each layer of the tree is then processed in a dancehall fashion, where each task operates on (buffered) operand and result arrays residing in off-chip main memory. This organization leads to relatively simple code but puts a high access load on the off-chip-memory interface. On-chip pipelining reorganizes the overall computation in a pipelined fashion such that intermediate results (i.e., temporary stream packets of sorted elements) are not written back to main memory where they wait for being reloaded in the next layer processing round, but instead are forwarded immediately to a consuming successor task that possibly runs on a different core. This will of course require some buffering in onchip memory and on-chip communication of intermediate results where producer and consumer task are mapped to different SPEs, but multi-buffering is necessary anyway in processors like Cell in order to overlap computation with (DMA) communication. It also requires that all merger tasks of the algorithm be active simultaneously; usually there are several tasks mapped to a SPE, which are dynamically scheduled by a userlevel round-robin scheduler as data is available for processing. However, as we would like to guarantee fast context switching on SPEs, the limited size of Cell’s local on-chip memory then puts a limit on the number of buffers and thus tasks that can be mapped to an SPE, or correspondingly a limit on the size of data packets that can be buffered, which also affects performance. Moreover, the total volume of intermediate data forwarded on-chip should be low and, in particular, must not exceed the capacity of the on-chip bus. Hence, we obtain a constrained optimization problem for mapping the tasks of streaming computations to the SPEs of Cell such that the resulting throughput is maximized.

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.

189

In previous work we developed mappings for merge trees [4,5]. In particular, we have developed various optimal, approximative and heuristic mapping algorithms for optimized on-chip pipelining of merge trees. Theoretically, a tremendous reduction of required memory bandwidth could be achieved, and our simulations for an idealized Cell architecture indicated that considerable speedup over previous implementations are possible. But an implementation on the real processor is very tricky if it should overcome the overhead related to dynamic scheduling, buffer management, synchronization and communication delays. Here, we detail our implementation that actually achieves notable speedup of up to 61% over the best previous implementation, which supports our earlier theoretical estimations by experimental evidence. Also, the results support the hypothesis that on-chip pipelining as an algorithmic engineering option is worthwhile in general because simpler applications might profit even more. The remainder of this article is organized as follows. In Section 2, we give a short overview of the Cell processor, as far as needed for this article. Section 3 develops the on-chip pipelined merging algorithm, Section 4 gives details of the implementation, and Section 5 reports on the experimental results. Further details will soon be available in a forthcoming master thesis [6]. Section 6 concludes and identifies issues for future work.

2 Cell/B.E. Overview The Cell/B.E. (Broadband Engine) processor [1] is a heterogeneous multi-core processor consisting of 8 SIMD processors called SPE and a dual-threaded PowerPC core (PPE), which differ in architecture and instruction set. In earlier versions of the Sony PlayStation-3TM (PS3), up to 6 SPEs of its Cell processor could be used under Linux. On IBMs Cell blade servers such as QS20 and later models, two Cells with a total of 16 SPEs are available. Cell blades are used, for instance, in the nodes of RoadRunner, which was the world’s fastest supercomputer in 2008–2009. While the PPE is a full-fledged superscalar processor with direct access to off-chip memory via L1 and L2 cache, the SPEs are optimized for doing SIMD-parallel computations at a significantly higher rate and lower power consumption than the PPE. The SPE datapaths and registers are 128 bits wide, and the SPU vector instructions operate on them as on vector registers, holding 2 doubles, 4 floats or ints, 8 shorts or 16 bytes, respectively. For instance, four parallel float comparisons between the corresponding sections of two vector registers can be done in a single instruction. However, branch instructions can tremendously slow down data throughput of an SPE. The PPE should mainly be used for coordinating SPE execution, providing OS service and running control intensive code. Each SPE has a small local on-chip memory of 256 KBytes. This local store is the only memory that the SPE’s processing unit (the SPU) can access directly, and therefore it needs to accommodate both SPU code and data. There is no cache and no virtual memory on the SPE. Access to off-chip memory is only possible by asynchronous DMA put and get operations that can communicate blocks of up to 16KB size at a time to and from off-chip main memory. DMA operations are executed asynchronously by the SPE’s memory flow controller (MFC) unit in parallel with the local SPU; the SPU can initiate a DMA transfer and synchronize with a DMA transfer’s completion. DMA transfer is also possible between an SPE and another SPE’s local store.

190

R. Hult´en, C.W. Kessler, and J. Keller

There is no operating system or runtime system on the SPE except what is linked to the application code in the local store. This is what necessitates user-level scheduling if multiple tasks are to run concurrently on the same SPE. SPEs, PPE and the memory interface are interconnected by the Element Interconnect Bus (EIB) [1]. The EIB is implemented by four uni-directional rings with an aggregate bandwidth of 204 GByte/s (peak). The bandwidth of each unit on the ring to send data over or receive data from the ring is only 25.6 GB/s. Hence, the off-chip memory tends to become the performance bottleneck if heavily accessed by multiple SPEs. Programming the Cell processor efficiently is a challenging task. The programmer should partition an application suitably across the SPEs and coordinate SPE execution with the main PPE program, use the SPE’s SIMD architecture efficiently, and take care of proper communication and synchronization at fairly low level, overlapping DMA communication with local computation where possible. All these different kinds of parallelism are to be orchestrated properly in order to come close to the theoretical peak performance of about 220 GFlops (for single precision). To allow for overlapping DMA handling of packet forwarding (both off-chip and onchip) with computation on Cell, there should be at least buffer space for 2 input packets per input stream and 2 output packets per output stream of each streaming task to be executed on an SPE. While the SPU is processing operands from one buffer, the other one in the same buffer pair can be simultaneously filled or drained by a DMA operation. Then the two buffers are switched for each operand and result stream for processing the next packet of data. (Multi-buffering extends this concept from 2 to an arbitrary number of buffers per operand array, ordered in a circular queue.) This amounts to at least 6 packet buffers for an ordinary binary streaming operation, which need to be accommodated in the size-limited local store of the SPE. Hence, the size of the local store part used for buffers puts an upper bound on the buffer size and thereby on the size of packets that can be communicated. On Cell, the DMA packet size cannot be made arbitrarily small: the absolute minimum is 16 bytes, and in order to be not too inefficient, at least 128 bytes should be shipped at a time. Reasonable packet sizes are a few KB in size (the upper limit is 16KB). As the size of SPE local storage is severely limited (256KB for both code and data) and the packet size is the same for all SPEs and throughout the computation, this means that the maximum number of packet buffers of the tasks assigned to any SPE should be as small as possible. Another reason to keep packet size large is the overhead due to switching buffers and user-level runtime scheduling between different computational tasks mapped to the same SPE. Figure 1 shows the sensitivity of the execution time of our pipelined mergesort application (see later) to the buffer size.

3 On-Chip Pipelined Mergesort Parallel sorting is needed on every modern platform and hence heavily investigated. Several sorting algorithms have been adapted and implemented on Cell BE. The highest performance is achieved by Cellsort [2] and AAsort [3]. Both sort data sets that fit into off-chip main memory but not into local store. Both implementations have similarities. They work in two phases to sort a data set of size N with local memories of size N  . In the first phase, blocks of data of size 8N  that fit into the combined local memories

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.

191

Fig. 1. Merge times (here for a 7-level merger tree pipeline), shown for various input sizes (number of 128bit-vectors per SPE), strongly depend on the buffer size used in multi-buffering

of the 8 SPEs are sorted. In the second phase, those sorted blocks of data are combined to a fully sorted data set. We concentrate on the second phase as the majority of memory accesses occurs there and as it accounts for the largest share of sorting time for larger input sizes. In CellSort [2], this phase is realized by a bitonic sort because this avoids data dependent control flow and thus fully exploits SPE’s SIMD architecture. Yet, O(N log2 N ) memory accesses are needed and the reported speedups are small. In AAsort [3], mergesort with 4-to-1-mergers is used in the second phase. The data flow graph of the merge procedures thus forms a fully balanced merge quadtree. The nodes of the tree are executed on the SPEs layer by layer, starting with the leaf nodes. As each merge procedure on each SPE reads from main memory and writes to main memory, all N words are read from and written to main memory in each merge round, resulting in N log4 (N/(8N  )) = O(N log4 N ) data being read from and written to main memory. While this improves the situation, speedup still is limited. In order to decrease the bandwidth requirements to off-chip main memory and thus increase speedup, we use on-chip pipelining. This means that all merge nodes of all tree levels are active from the beginning, and that results from one merge node are forwarded in packets of fixed size to the follow-up merge node directly without usage of main memory as intermediate store. With b-to-1 merger nodes and a k-level merge tree, we realize bk -to-1 merging with respect to main memory traffic and thus reduce main memory traffic by a factor of k · log4 (b). The decision to forward merged data streams in packets of fixed size allows to use buffers of this fixed size for all merge tasks, and also enables follow-up merge tasks to start work before predecessor mergers have handled their input streams completely, thus keeping as many merge tasks busy as possible, and allowing pipeline depths independent of the lengths of data streams. Note that already the mergers in the AAsort algorithm [3] must work with buffering and fixed size packets.

192

R. Hult´en, C.W. Kessler, and J. Keller

The requirement to keep all tasks busy is complicated by the fact that the processing of data streams is not completely uniform over all tasks but depends on the data values in the streams. A merger node may consume only data from one input stream for some time, if those data values are much smaller than the data values in the other input streams. Hence, if all input buffers for those streams are filled, and the output buffers of the respective predecessor merge tasks are filled as well, those merge tasks will be stalled. Moreover, after some time the throughput of the merger node under consideration will be reduced to the output rate of the predecessor merger producing the input stream with small data values, so that follow-up mergers might also be stalled as a consequence. Larger buffers might alleviate this problem, but are not possible if too many tasks are mapped to one SPE. Finally, the merger nodes should be distributed over the SPEs such that two merger nodes that communicate data should be placed onto the same SPE whenever possible, to reduce communication load on the EIB. As a secondary goal, if they cannot be placed onto the same SPE, they might be placed such that their distance on the EIB is small, so that different parts of the EIB might be used in parallel. In our previous work [4], we have formulated the above problem of mapping of tasks to SPEs as an integer linear programming (ILP) optimization problem with the constraints given. An ILP solver (we use CPLEX 10.2 [7]) can find optimal mappings for small tree sizes (usually for k ≤ 6) within reasonable time; for k = 7, it can still produce an approximative solution. For larger tree sizes, we used an approximation algorithm [4].

(a)

(b)

Fig. 2. Two Pareto-optimal solutions for mapping a 5-level merge tree onto 5 SPEs, computed by the ILP solver [4]. (a) The maximum memory load is 22 communication buffers (SPE4 has 10 nodes with 2 input buffers each, and 2 of these have output buffers for cross-SPE forwarding) and communication load 1.75 (times the root merger’s data output rate); (b) max. memory load 18 and communication load 2.5. The (expected) computational load is perfectly balanced (1.0 times the root merger’s load on each SPE) in both cases.

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.

193

The ILP based mapping optimizer can be configured by a parameter  ∈ (0, 1) that controls the priority of different secondary optimization goals, for memory load or communication load; computational load balance is always the primary optimization goal. Example mappings computed with different  for a 5-level tree are visualized in Fig. 2.

4 Implementation Details Merging kernel. SIMD instructions are being used as much as possible in the innermost loops of the merger node. Merging two (quad-word) vectors is completely done with SIMD instructions as in CellSort [2]. In principle, it is possible to use only SIMD instructions in the entire merge loop, but we found that it did not reduce time because the elimination of an if-statement required too many comparisons and moving data around redundantly. Mapping optimizer. The mapping of merger task nodes to SPEs is read in by the PPE from a text file generated by the mapping optimizer. The PPE generates the task descriptors for each SPE at runtime, so that our code in not constrained to a particular merge-tree, but still optimized to the merge-tree currently used. Due to the complexity of the optimization problem, optimal mappings can be (pre-)computed only for smaller tree sizes up to k = 6. For larger trees, we use the approximative mapping algorithm DC-map [4] that computes mappings by recursively composing mappings for smaller trees, using the available optimal mappings as base cases. SPE task scheduler. Tasks mapped to the same SPE are scheduled by a user-level scheduler in a round-robin order. A task is ready to run if it has sufficient input and an output buffer is free. A task runs as long as it has both input data and space in the output buffer, and then initiates the transfer of its result packet to its parent node and returns control to the scheduler loop. If there are enough other tasks to run afterwards, DMA time for flushing the output buffer is masked and hence only one output buffer per task is necessary (see below). Tasks that are not data-ready are skipped. As the root merger is always alone on its SPE, no scheduler is needed there and many buffers are available; its code is optimized for this special case. Buffer management. Because nodes (except for the root) are scheduled round-robin, the DMA latency can, in general, be masked completely by the execution of other tasks, and hence double-buffering of input or output streams is not necessary at all, which reduces buffer requirements considerably. An output stream buffer is only used for tasks whose parents/successors reside on a different SPE. Each SPE has a fixed sized pool of memory for buffers that gets equally shared by the nodes. This means that nodes on less populated SPEs, for instance the root merger that has a single SPE on its own, can get larger buffers (yet multiples of the packet size). Also, a SPE with high locality (few edges to tasks on other SPEs) needs fewer output buffers and thus may use larger buffers than another SPE with equally many nodes but where more output buffers are needed. A larger buffering capacity for certain tasks (compared to applying the worstcase size for all) reduces the likelihood of an SPE sitting idle as none of its merger tasks is data-ready.

194

R. Hult´en, C.W. Kessler, and J. Keller

Communication. Data is pushed upwards the tree (i.e., producers/child nodes control cross-SPE data transfer and consumers/parent nodes acknowledge receipt) except for the leaf nodes which pull their input data from main memory. The communication looks different depending on whether the parent (consumer) node being pushed to is located on the same SPE or not. If the parent is local, the memory flow controller cannot be used because it demands that the receiving address is outside the sender’s local store. Instead, the child’s output buffer and its parent’s input buffer can simply be the same. This eliminates the need for an extra output buffer and makes more efficient use of the limited amount of memory in the local store. The (system-global) addresses of buffers in the local store on the opposite side of cross-SPE DMA communications are exchanged between the SPEs in the beginning. Synchronization. Each buffer is organized as cyclic buffer with a head and a tail pointer. A task only reads from its input buffers and thus only updates the tail pointers and never writes to the head pointers. A child node only writes to its parent’s input buffers, which means it only writes to the head pointer and only reads from the tail position. The parent task updates the tail pointer of the input buffer for the corresponding child task; the child knows how large the parent’s buffer is and how much it has written itself to the parent’s input buffer so far, and thus knows how much space is left for writing data into the buffer. In particular, when a child reads the tail position of its parent’s input buffer, the value is pessimistic so it is safe to use even if the parent is currently using its buffer and is updating the tail position simultaneously. The reverse is true for the head position, the child writes to the parent’s head position of the corresponding input buffer and the parent only reads. This means that no locks are needed for the synchronization between nodes. DMA tag management. A SPE can have up to 32 DMA transfers in flight simultaneously and uses tags in {0, ..., 31} to distinguish between these when polling the DMA status. The Cell SDK offers an automatic tag manager for dynamic tag allocation and release. However, if an SPE has many buffers used for remote communication, it may run out of tags. If that happens, the tag-requesting task gives up, steps back into the task queue and tries to initiate that DMA transfer again when it gets scheduled next.

5 Experimental Results We used a Sony PlayStation-3 (PS3) with IBM Cell SDK 3.0 and an IBM blade server QS20 with SDK 3.1 for the measurements. We evaluated for as large data sets as could fit into RAM on each system, which means up to 32Mi integers on PS3 (6 SPEs, 256 MiB RAM) and up to 128Mi integers on QS20 (16 SPEs, 1GiB RAM). The code was compiled using gcc version 4.1.1 and run on Linux kernel version 2.6.18-128.e15. A number of blocks equal to the number of leaf nodes in the tree to be tested were filled with random data and sorted. This corresponds to the state of the data after the local sorting phase (phase 1) of CellSort [2]. Ideally, each such block would be of the size of the aggregated local storage available for buffering on the processor. CellSort sorts 32Ki (32,768) integers per SPE, blocks would thus be 4 × 128KiB = 512KiB on

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.

195

the PS3 and 16 × 128KiB = 2MiB on the QS20. For example, a 6-level tree has 64 leaf nodes, hence the optimal data size on the QS20 would be 64 × 512KiB = 32MiB. However, block sizes of other sizes were used when testing in order to magnify the differences between mappings. Different mappings (usually for  = 0.1, 0.5 and 0.9) were tested. 5.1 On-Chip-Pipelined Merging Times The resulting times with on-chip pipelining for 5-level and 6-level trees on PS3 are shown in Fig. 3. For QS20, mappings generated with  = 0.1, 0.5 and 0.9 were tested on different data sizes and merger tree sizes from k = 5 to k = 8, see Fig 4. We see that the choice of the mapping can have a major impact on merging time, as even mappings that are optimal for different optimization goals exhibit timing differences of up to 25%.

Fig. 3. Merge times for k = 5 (left) and k = 6 (right) for different mappings () on PS3

5.2 Results of DC-Map Using the DC-map algorithm [4], mappings for trees for k = 8, 7 and 6 were constructed by recursive composition using optimal mappings (computed with the ILP algorithm with  = 0.5) as base cases for smaller trees. Fig. 5 shows the result for merging 64Mi integers on QS20. 5.3 Comparison to CellSort Table 1 shows the direct comparison between the global merging phase of CellSort (which is dominating overall sorting time for large data sets like these) and on-chippipelined merging with the best mapping chosen. We achieve significant speedups for Table 1. Timings for the CellSort global merging phase vs. Optimized on-chip-pipelined merging for global merging of integers on QS20 k 5 6 7

#ints 16Mi 32Mi 64Mi

CellSort Global Merging 219 ms 565 ms 1316 ms

On-Chip-Pipelined Merging 174 ms 350 ms 772 ms

Speedup 1.26 1.61 1.70

196

R. Hult´en, C.W. Kessler, and J. Keller

k=5

k=6

k=7

k=8

Fig. 4. Merge times for k = 5, 6, 7, 8 and different input sizes and mappings () on QS20

Fig. 5. Merge times (64 Mi integers) for trees (k = 8, 7, 6) constructed from smaller trees using DC-map

on-chip-pipelining in all cases; the best speedup of 70% can be obtained with 7 SPEs (64Mi elements) on QS20, using the mapping with  = 0.01 in Fig. 6; the corresponding speedup figure for the PS3 is 143% at k = 5, 16Mi elements. This is due to less communication with off-chip memory.

Optimized On-Chip-Pipelined Mergesort on the Cell/B.E.

197

Fig. 6. Merge times on QS20 for k = 7, further mappings

5.4 Discussion Different mappings gives some variation in execution times, it seems like the cost model used in the mapping optimizer is more important than the priority parameters in it. Also with on-chip pipelining, using deeper tree pipelines (to fully utilize more SPEs) is not always beneficial beyond a certain depth k, here for k = 6 for PS3 and k = 8 for QS20, as a too large number of tasks increases the overhead of on-chip pipelining (smaller buffers, scheduling overhead, tag administration, synchronization, communication overhead). The overall pipeline fill/drain overhead is more significant for lower workloads but negligible for the larger ones. From Fig. 1 it is clear that, with optimized mappings, buffer size may be lowered without losing much performance, which frees more space in the local store of the SPEs, e.g. for accommodating the code for all phases of CellSort, saving the time overhead for loading in a different SPE program segment for the merge phase.

6 Conclusion and Future Work With an implementation of the global merging phase of parallel mergesort as a case study of a memory-intensive computation, we have demonstrated how to lower memory bandwidth requirements in code for the Cell BE by optimized on-chip pipelining. We obtained speedups of up to 70% on QS20 and 143% on PS3 over the global merging phase of CellSort, which dominates the sorting time for larger input sizes. On-chip pipelining is made possible by several architectural features of Cell that may not be available in other multicore processors. For instance, the possibility to forward data by DMA between individual on-chip memory units is not available on current GPUs where communication is only to and from off-chip global memory. The possibility to lay out buffers in on-chip memory and move data explicitly is not available on cache-based multicore architectures. Nevertheless, on-chip pipelining will be applicable in upcoming heterogeneous architectures for the DSP and multimedia domain with a design similar to Cell, such as ePUMA [9]. Intels forthcoming 48-core single-chip

198

R. Hult´en, C.W. Kessler, and J. Keller

cloud computer [8] will support on-chip forwarding between tiles of two cores, with 16KB buffer space per tile, to save off-chip memory accesses. On-chip pipelining is also applicable to other streaming computations such as general data-parallel computations or FFT. In [10] we have described optimal and heuristic methods for optimizing mappings for general pipelined task graphs. The downside of on-chip pipelining is complex code that is hard to debug. We are currently working on an approach to generic on-chip pipelining where, given an arbitrary acyclic pipeline task graph, an (optimized) on-chip-pipelined implementation will be generated for Cell. This feature is intended to extend our BlockLib skeleton programming library for Cell [11]. Acknowledgements. C. Kessler acknowledges partial funding from EU FP7 (project PEPPHER, #248481), VR (Integr. Softw. Pipelining), SSF (ePUMA), Vinnova, and CUGS. We thank Niklas Dahl and his colleagues from IBM Sweden for giving us access to their QS20 blade server.

References 1. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation—a performance view. IBM J. Res. Devel. 51(5), 559–572 (2007) 2. Gedik, B., Bordawekar, R., Yu, P.S.: Cellsort: High performance sorting on the Cell processor. In: Proc. 33rd Intl. Conf. on Very Large Data Bases, pp. 1286–1207 (2007) 3. Inoue, H., Moriyama, T., Komatsu, H., Nakatani, T.: AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In: Proc. 16th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), pp. 189–198. IEEE Computer Society, Los Alamitos (2007) 4. Keller, J., Kessler, C.W.: Optimized pipelined parallel merge sort on the Cell BE. In: Proc. 2nd Workshop on Highly Parallel Processing on a Chip (HPPC-2008) at Euro-Par 2008, Gran Canaria, Spain (2008) 5. Kessler, C.W., Keller, J.: Optimized on-chip pipelining of memory-intensive computations on the Cell BE. In: Proc. 1st Swedish Workshop on Multicore Computing (MCC-2008), Ronneby, Sweden (2008) 6. Hult´en, R.: On-chip pipelining on Cell BE. Forthcoming master thesis, Dept. of Computer and Information Science, Link¨oping University, Sweden (2010) 7. ILOG Inc.: Cplex version 10.2 (2007), http://www.ilog.com 8. Howard, J., et al.: A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In: Proc. IEEE International Solid-State Circuits Conference, pp. 19–21 (February 2010) 9. Liu, D., et al.: ePUMA parallel computing architecture with unique memory access (2009), http://www.da.isy.liu.se/research/scratchpad/ 10. Kessler, C.W., Keller, J.: Optimized mapping of pipelined task graphs on the Cell BE. In: Proc. of 14th Int. Worksh. on Compilers for Par. Computing, Z¨urich, Switzerland (January 2009) ˚ 11. Alind, M., Eriksson, M., Kessler, C.: Blocklib: A skeleton library for Cell Broadband Engine. In: Proc. ACM Int. Workshop on Multicore Software Engineering (IWMSE-2008) at ICSE-2008, Leipzig, Germany (May 2008)

Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures Emmanuel Jeannot1,2 and Guillaume Mercier1,2,3 1

LaBRI INRIA Bordeaux Sud-Ouest 3 Institut Polytechnique de Bordeaux {Emmanuel.Jeannot,Guillaume.Mercier}@labri.fr 2

Abstract. MPI process placement can play a deterministic role concerning the application performance. This is especially true with nowadays architecture (heterogenous, multicore with different level of caches, etc.). In this paper, we will describe a novel algorithm called TreeMatch that maps processes to resources in order to reduce the communication cost of the whole application. We have implemented this algorithm and will discuss its performance using simulation and on the NAS benchmarks.

1

Introduction

The landscape of parallel computing has undergone tremendous changes since the introduction of multicore architectures. Multicore machines feature hardware characteristics that are a novelty, especially when compared to cluster-based architectures. Indeed, the amount of cores available within each system is much higher and the memory hierarchy becomes much more complex than previously. Thus, the communication performance can dramatically change according to the processes location within the system since the closer the data is located from the process, the faster the access shall be. This is know as the Non-Uniform Memory Access (NUMA) effect and can be commonly experienced in modern computers. As the core amount in a node is expected to grow sharply in the near future, all these changes have to be taken into account in order to exploit such architectures at their full potential. However, there is a gap between the hardware and the software. Indeed, as far as programming is concerned, the change is less drastic since users still rely on standards such as MPI or OpenMP. Hybrid programming (that is, mixing both message-passing and shared memory paradigms) is one of the keys to obtain the best performance from hierarchical multicore machines. This implies new programming practices that users should follow and apply. However, legacy MPI applications can already take advantage of the computing power offered by such complex architectures. One way of achieving this goal is to match an application’s communication pattern to the underlying hardware. That is, the processes that communicate the most would be bound on cores that share the most levels in the memory hierarchy (e.g. caches). The idea is therefore to build a correspondence between the list of MPI process ranks and P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 199–210, 2010. c Springer-Verlag Berlin Heidelberg 2010 

200

E. Jeannot and G. Mercier

the list of core numbers in the machine. Thus, the placement of MPI processes relies totally on the matching that is computed by a relevant algorithm. Then, the issue is to use an algorithm that yields a satisfactory solution to our problem. In this paper, we introduce a new algorithm, called TreeMatch that efficiently computes a solution to our problem by taking into account the specificities of the underlying hardware. The rest of this paper is organized as follows: in section 2 we will describe some related works. Section 3 exposes the problem and show how it can be modeled while section 4 describes our TreeMatch algorithm. Both theoretical and empirical results are detailed in section 5. At last, section 6 concludes this paper.

2

Related Work

Concerning process placement a pioneer work is provided by Kruskal and Snir in [9] where the problem is modeled by a multicommodity flow. The MPIPP [2] framework takes into consideration a multicluster context and strives to dispatch the various MPI processes on the different clusters used by an application. Graph theory algorithms are widely used in order to determine the matching (a list of (MPI process rank,core number) couples). For instance, several vendor MPI implementations, such as the ones provided by Hewlett-Packard [3]1 or by IBM (according to [4]) make use of such mechanism. [6] also formalizes the problem with graphs. In these cases, however, the algorithm computing the final mapping is MPI implementation-specific and does not take into account the complexity of the hierarchy encountered in multicore NUMA machines nor their topologies. In a previous paper [10], we used a graph partitioning algorithm called Scotch [5] to perform the task of mapping computation. However, Scotch is able to work on any type of graphs, not just trees as in our case. Making some hypothesis on the graph structure can lead to improvements and that is why we have developed a new algorithm, called TreeMatch, tailored to fit exactly our specific needs.

3

Problem Modeling

3.1

Hardware Architecture

The first step for determining a relevant process placement consists of gathering information about the underlying hardware. Retrieving the information about the memory hierarchy in a portable way is not a trivial task. Indeed, no tool is able to provide information about the various caches levels (such as their respective sizes and which cores can access them) on a wide spectrum of systems. To this end, we participated in the development of a specific software tool that fulfills this goal: Hardware Locality or Hwloc [1]. Thanks to Hwloc, we can get all the needed information about the architecture, that is, the number of NUMA nodes, sockets and cores as well as the information about the memory hierarchy. In previous works [2,10] the architecture is modeled by a topology 1

HP-MPI has since been replaced by Platform MPI.

Near-Optimal Placement of MPI Processes

201

matrix. Entries in this matrix correspond to the communication speed between two cores and take into account the number elements of the memory hierarchy shared between cores. However, such a representation induces a side effect as it “flattens” the view of the hardware’s structure. The result is a loss of valuable information that could be exploited by the matching algorithm in order to compute the best possible placement. Indeed, a NUMA machine is most of the time hierarchically structured. A tree can thus provide a more reliable representation than a topology matrix. Since Hwloc internally uses a tree-like data structure to store all information, we can easily build another data structure derived from Hwloc’s. Formally, we have a tree of hardware elements. The depth of this tree corresponds to the depth of this element in the hierarchy, cores (or other computing elements) are leaves of the tree. 3.2

Application Communication Patterns

The second vital piece of information is the target application’s communication pattern. In our case, this pattern consists of the global amount of data exchanged between each pair of processes in the application and is stored in a p × p communication matrix (where p is the number of processes). This approximation yields to a “flatten” view of the communication activities that occur during an application’s execution. In order to gather the needed data, we chose to introduce a slight amount of profiling elements within an existing MPI implementation (MPICH2). By modifying the low-level communication channels in the MPICH2 stack we are able to trace data exchanges in both cases of point-to-point and collective communication2 . Since this tracing is very light, it will not disturb an application’s execution. This approach has also been implemented by other vendor MPI implementations, such as HP-MPI (now Platform MPI) [3]. The main drawback is that a preliminary run of the application is mandatory in order to get the communication pattern and a change in the execution (number of processors, input data, etc.) often requires to rerun the profiling. However, there are other possible approaches. For instance, the MPIPP framework uses a tool called FAST [11] that combines static analysis of the application’s code and dynamic execution of a modified application that executes much faster while retaining the same communication pattern as the original source application. The goal is to reduce the time necessary to determine the communication pattern by several magnitudes. It is to be noted that in all approaches, an application, either the original one or a simpler version, has to be executed.

4

The TreeMatch Algorithm

We now describe the algorithm that we developed to compute the process placement. Our algorithm, as opposed to other approaches is able to take into account the hardware’s complex hierarchy. However, in order to slightly simplify the problem, we assume that the topology tree is balanced (leaves are all at the same 2

This profiling could also have been done by tools such as VampirTrace.

202

E. Jeannot and G. Mercier

Algorithm 1. The TreeMatch Algorithm

1 2 3 4 5 6 7

Input: T // The topology tree Input: m // The communication matrix Input: D // The depth of the tree groups[1..D − 1]=∅ // How nodes are grouped on each level foreach depth← D − 1..1 do // We start from the leaves p ← order of m // Extend the communication matrix if necessary if p mod arity(T, depth − 1) = 0 then m ←ExtendComMatrix(T ,m,depth) groups[depth]←GroupProcesses(T ,m,depth)// Group processes by communication affinity m ←AggregateComMatrix(m,groups[depth]) ; // Aggregate communication of the group of processes

8 MapGroups(T ,groups) // Process the groups to built the mapping

depth) and symmetric (all the nodes of a given depth possess the same arity). Such assumptions are indeed very realistic in the case of a homogeneous parallel machine where all processors, sockets, nodes or cabinets are identical. The goal of the TreeMatch algorithm is to assign to each MPI process a computing element and hence a leaf of the tree. In order to optimize the communication time of an application, the TreeMatch algorithm will map processes to cores depending on the amount of data they exchange. The TreeMatch algorithm is depicted in Algorithm 1. To describe how the TreeMatch algorithm works we will run it on the example given in Fig. 1. Here, the topology is modeled by a tree of depth 4 with 12 leaves (cores). The communication pattern between MPI processes is modeled by an 8 × 8 matrix (hence, we have 8 processes). The algorithm process the tree upward at depth 3. At this depth the arity of the node of the next level in the tree k=2, divides the order p = 8 of the matrix m. Hence, we directly go to line 6 where the algorithm calls function GroupProcesses. This function first builds the list of possible groups of processes. The size of the group is given by the arity k of the node of the tree at the upper level (here 2). For instance, we can group process 0 with processes 1 or 2 up to 7 Proc 0 1 2 3 4 5 6 7 0 0 1000 10 1 100 1 1 1 1 1000 0 1000 1 1 100 1 1 2 10 1000 0 1000 1 1 100 1 3 1 1 1000 0 1 1 1 100 4 100 1 1 1 0 1000 10 1 5 1 100 1 1 1000 0 1000 1 6 1 1 100 1 10 1000 0 1000 7 1 1 1 100 1 1 1000 0

(a) Communication Matrix

(b) Topology Tree (squares represent mapped processes using different algorithms)

Fig. 1. Input Example of the TreeMatch Algorithm

Near-Optimal Placement of MPI Processes

203

Function. GroupProcesses(T ,m,depth) Input: T //The topology tree Input: m // The communication matrix Input: depth // current depth 1 l ←ListOfAllPossibleGroups(T ,m,depth) 2 G ←GraphOfIncompatibility(l) 3 return IndependentSet(G)

 and process 1 with process 2 up to 7 and so on. Formally we have 28 = 5400 possible groups of processes. As we have p = 8 processes and we will group them by pairs (k=2), we need to find p/k = 4 groups that do not have processes in common. To find these groups, we will build the graph of incompatibilities between the groups (line 2). Two groups are incompatible if they share a same process (e.g. group (2,5) is incompatible with group (5,7) as process 5 cannot be mapped at two different locations). In this graph of incompatibility, vertices are the groups and we have an edge between two vertices if the corresponding groups are incompatible. The set of groups we are looking for is hence an independent set of this graph. In the literature, such a graph is referred to as the complement of a Kneser Graph [8]. A valuable property3 of the graph is that since k divides p any maximal independent set is maximum and of size p/k. Therefore, any greedy algorithm always finds an independent set of the required size. However, all grouping of processes (i.e. independent sets) are not of equal quality. They depend on the value of the matrix. In our example, grouping process 0 with process 5 is not a good idea as they exchange only one piece of data and if we group them we will have a lot of remaining communication to perform at the next level of the topology. To account this, we valuate the graph with the amount of communication reduced thanks to this group. For instance, based on matrix m, the sum of communication of process 0 is 1114 and process 1 is 2104 for a total of 3218. If we group them together, we will reduce the communication volume by 2000. Hence the valuation of the vertex corresponding to group (0,1) is 3218-2000=1218. The smaller the value, the better the grouping. Unfortunately, finding such an independent set of minimum weight is NP-Hard and in-approximable at a constant ratio [7]. Hence, we use heuristics to find a “good” independent set: – smallest values first: we rank vertices by smallest value first and we built a maximal independent set greedily, starting by the vertices with smallest value. – largest values last: we rank vertex by smallest value first and we built a maximal independent set such that the largest index of the selected vertices is minimized. – largest weighted degrees first: we rank vertices by their decreasing weighted degree (the average weight of their neighbours) and we built a maximal independent set greedily, starting by the vertices with largest weighted degree [7]. 3

http://www.princeton.edu/~jacobfox/MAT307/lecture14.pdf

204

E. Jeannot and G. Mercier

In our case, whatever the heuristic we use we find the independent set of minimum weight, which is {(0,1),(2,3),(4,5),(5,6)}. This list is affected to the array group[3] in line 6 of the TreeMatch Algorithm. This means that, for instance, process 0 and process 1 will be put on leaves sharing the same parent.

Function. AggregateComMatrix(m,g) 1 2 3 4 5 6 7

Input: m // The communication matrix Input: g // list of groups of (virtual) processes to merge n ← NbGroups(g) for i ← 0..(n − 1) do for j ← 0..(n − 1) do if i = j then r[i, j] ← 0 else   r[i, j] ← i ∈g[i] j ∈g[j] m[i1 , j1 ] 1

1

8 return r

To continue our algorithm, we will continue to build the groups at depth 2. However prior to that, we need to aggregate the matrix m with the remaining communication. The aggregated matrix is computed in the AggregateComMatrix. The goal is to compute the remaining communication between each group of processes. For instance between the first group (0,1) and the second group (2,3) the amount of communication is 1012 and is put in r[0, 1] (see Fig. 2(a)). The matrix r is of size 4 by 4 (we have 4 groups) and is returned to be affected to m (line 7 of the TreeMatch algorithm). Now, the matrix m correspond to the communication pattern between the group of processes (called virtual processes) built during this step. The goal of the remaining steps of the algorithm is to group this virtual processes up to the root of the tree. The Algorithm then loops and decrements depth to 2. Here, the arity at depth 1 is 3 and does not divide the order of m (4) hence we add two artificial groups that do not communicate to any other groups. This means that we add two lines and two columns full of zeroes to matrix m. The new matrix is depicted in Fig. 2(b). The goal of this step is to allow more flexibility in the mapping, thus yielding a more efficient mapping.

Virt. Proc 0 1 2 3 0 0 1012 202 4 1 1012 0 4 202 2 202 4 0 1012 3 4 202 1012 0

(a) Aggregated matrix (depth 2)

Virt. Proc 0 1 2 3 4 5 0 0 1012 202 4 0 0 1 1012 0 4 202 0 0 2 202 4 0 1012 0 0 3 4 202 1012 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0

(b) Extended matrix

Virt. Proc 0 1 0 0 412 1 412

(c) Aggregated matrix (depth 1)

Fig. 2. Evolution of the communication matrix at different step of the algorithm

Near-Optimal Placement of MPI Processes

205

Function. ExtendComMatrix(T ,m,depth) Input: T //The topology tree Input: m // The communication matrix Input: depth // current depth 1 p ← order of m 2 k ←arity(T ,depth+1) 3 return AddEmptyLinesAndCol(m,k,p)

Once this step is performed, we can group the virtual processes (group of process built in the previous step). Here the graph modeling and the independent set heuristics lead to the following mapping: {(0,1,4),(2,3,5)}. Then we aggregate the remaining communication to obtain, a 2 by 2 matrix (see Fig. 2(c)). During the next loop (depth=1), we have only one possibility to group the virtual processes: {(0,1)}, which is affected to group[1]. The algorithm then goes to line 8. The goal of this step is to map the processes to the resources. To perform this task, we use the groups array, that describes a hierarchy of processes group. A traversal of this hierarchy gives the process mapping. For instance, virtual process 0 (resp. 1) of group[1], is mapped on the left (resp. right) part of the tree. When a group corresponds to an artificial group, no processes will be mapped to the corresponding subtree. At the end processes 0 to 7 are respectively mapped to leaves (cores) 0,2,4,6,1,3,5,7 (see bottom of Fig. 1(b)). This mapping is optimal. Indeed, it is easy to see that the algorithm provides an optimal solution if the communication matrix corresponds to a hierarchical communication pattern (processes can be arranged in tree, and the closer they are in this tree the more they communicate), that can be mapped to the topology tree (such as matrix of Fig. 1(a)). In this case, optimal groups of (virtual) processes are automatically found by the independent set heuristic as the corresponding weights of these groups are the smallest among all the groups. Moreover thanks to the creation of artificial groups line 5, we avoid the Packed mapping 0,2,4,6,8,1,3 which is worse as processes 4 and 5 communicate a lot with processes 6 and 7 and hence must be mapped to the same subtree. On the same figure, we can see that the Round Robin mapping that maps process i on core i leads also to a very poor result.

5

Experimental Validation

In this section, we will expose several sets of results. First, we will show simulation performance comparisons of TreeMatch when compared to simple placement policies such as Round Robin, where processes are dispatched on the various NUMA nodes in a round-robin fashion, packed, where processes are bound onto cores in the same node until it is fully occupied and so on. We will also show comparisons between TreeMatch and the algorithm used in the MPIPP framework [2]. Basically, the MPIPP algorithm starts from a random mapping and strives to improve it by switching the cores between two processors. The algorithm stops when improvement is not possible anymore. We have implemented

206

E. Jeannot and G. Mercier

two versions of this randomized algorithm: MPIPP.5 when we take the best result of the MPIPP algorithm using five different initial random mappings and MPIPP.1 when only one initial mapping is used. 5.1

Experimental Set-Up

For performing our simulation experiments, we have used the NAS communication patterns, computed by profiling the execution of each benchmark. In this work the considered benchmarks have a size (number of processes) of 16, 32/36 or 64. The kernels are bt, cg, ep, ft, is, lu, mg, sp and the classes (size of the data) are: A for size 16 and 32/36, B, C, and D for size 32/36 and 64. When we use the benchmark, we use the topology tree constructed from the real Bertha machine described below. We have also used 14 synthetic communication matrices. These are random matrices with a clear hierarchy between the processes: distinct pairs of processes communicate a lot together, then pairs of pairs communicate a little bit less, etc. For the synthetic communication matrices we also used synthetic topologies built using the Hwloc tools, in addition to the Bertha topology. We have used 6 different topologies with a depth from 3 to 5 mimicking current parallel machines (e.g. a cluster of 20 nodes with 4 socket per nodes, 6 cores per sockets, cores being paired by a L2 cache). The real-scale experiments were carried out on a 96-cores machine called Bertha. It is a single system composed of four NUMA nodes. In this context, all communication between NUMA nodes is performed through shared memory. Each NUMA node features four sockets and each socket features six cores. The CPUs are Intel Dunnington at 2.66 GHz where the L2 cache is shared between two cores and the L3 is shared by all the cores on the socket. We use MPICH2 to perform our real scale experiments as its process manager (called hydra) includes Hwloc and thus provides a simple way to bind processes to cores. 5.2

Simulation Results

We have carried-out simulation results to assess the raw performance of our algorithm. Results are depicted in Fig. 3. On the diagonal of each figure are displayed the different heuristics. On the lower part are displayed the histogram and the ECDF (empirical cumulative distribution function) of the average simulated runtime ratio between the 2 heuristics on the corresponding row and column. If the ratio is greater than 1 the above heuristic outperforms the below heuristic. On the upper part, some numeric summary indicates: the proportion of ratios that are strictly above 1; the proportion of ratios that are equal to 1 (if any) the median ratio, the average ration and, in brackets, the maximum and minimum ratios. For example, on Fig 3(a), that TreeMatch outperforms MPIPP.1 in more than 93% of the cases with a median ration of 1.415 and an average ratio of 1.446 with a minimum ratio of 1 and a maximum ratio of 2.352.

Near-Optimal Placement of MPI Processes

Sim synthetic ALL

Sim NAS ALL 40.51 % (>) 63.29 % (>) 93.67 % (>) 94.94 % (>) 16.46 % (=) 29.11 % (=) 5.06 % (=) Median=1.113 TreeMatch Median=1.000 Median=1.016 Avg=1.159 Median=1.415 Avg=1.011 Avg=1.036 Avg=1.446 [0.960,1.800] [0.895,1.156] [0.960,1.432] [1.000,2.352] ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●



●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●

● ●● ●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●

● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●● ●● ● ●●● ●● ●● ●● ● ●

●●

● ● ● ●●

● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●





● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.0 1.5

0.5

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

73.42 % (>) Median=1.215 Avg=1.251 [0.788,1.830]

RR

● ●● ●● ● ● ● ●● ● ● ●●

● ●● ● ● ● ● ● ● ●

● ● ● ● ● ● ●●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ●● ● ●

● ● ● ● ● ●

● ● ●



0.5

1.0 1.5

0.5

● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

100.00 % (>) Median=3.589 Avg=3.762 [1.969,8.797]

77.38 % (>) 22.62 % (=) Median=1.067 Avg=1.090 [1.000,1.475]



● ● ● ● ● ●●



Packed

0.00 % (>) Median=0.297 Avg=0.324 [0.117,0.567]



● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●



● ● ● ● ● ●

● ● ● ●

● ● ● ● ● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

MPIPP.1



●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ●●● ● ● ● ● ● ● ●

MPIPP.5

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●●

●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●

● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

1.0 1.5

● ● ● ●

● ● ●● ● ● ● ● ● ●

95.24 % (>) 4.76 % (=) Median=1.306 Avg=1.387 [1.000,2.232]

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ● ● ●

100.00 % (>) Median=4.441 Avg=4.980 [ 2.063,14.906]

● ● ● ● ●

● ● ●

●● ●



0.5

● ● ●

● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●

94.05 % (>) 5.95 % (=) Median=1.220 Avg=1.273 [1.000,2.064]

●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

83.54 % (>) 93.67 % (>) 12.66 % (=) 5.06 % (=) Median=1.062 Median=1.387 Avg=1.119 Avg=1.391 [0.988,1.800] [1.000,2.064]

Packed

● ● ● ●● ● ● ●● ● ●

● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●● ● ●

TreeMatch

75.95 % (>) 96.20 % (>) 96.20 % (>) 8.86 % (=) 3.80 % (=) Median=1.102 Median=1.015 Median=1.418 Avg=1.146 Avg=1.025 Avg=1.423 [0.944,1.800] [0.877,1.256] [1.000,2.069]

MPIPP.5

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●

207



● ●

●● ● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

●● ●

● ●

1.0 1.5

(a) Simulation of matchings of the NAS benchmark for the different heuristics for all size

0.2

1.0

5.0

0.2

1.0

5.0

● ●

● ● ● ● ●

●● ● ● ● ●

MPIPP.1

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

0.2

1.0

5.0

(b) Simulation of matchings of synthetic input for the different heuristics for all size. Here RR is not shown as it is identical to the Packed mapping

Fig. 3. Simulation results

On Fig. 3(a), we see that for NAS communication patterns, our TreeMatch Algorithm is better than all the other algorithms. It is only slightly better than MPIPP.5 but this version of MPIPP is run on five different seeds and hence is always slower than our algorithm4 . The MPIPP version with one seed (MPIPP.1) is outperformed by all the other algorithms. The packed method provides better results than the Round Robin method due because cores are not numbered sequentially by the system or the BIOS. On Fig. 3(b), we see that for synthetic input, the results are even in better favor for the TreeMatch algorithm. This comes, from the fact that our algorithm finds the optimal matching for these kinds of matrices as shown in section 4. 5.3

NAS Parallel Benchmarks

We then compare the results between each heuristics on the real Bertha machine, using the NAS benchmarks. Here, ratios are computed on average runtime of at least 4 runs. In Fig. 4(a) we see that the TreeMatch is the best heuristics among the all the other tested ones. It slightly outperforms MPIPP.5, but this heuristic is much slower than ours. Surprisingly, TreeMatch is also only slightly better than Round Robin and in some cases the ratio is under 0.8. Actually, it appears than Round Robin is very good for NAS of size 16 (when all cores are grouped to the same node). This means that for small size problems a clever mapping is not required. However, if we plot the ratios for sizes above or equal to 32 (Fig. 4(b)), we see that, in such cases, TreeMatch compares even more favorably to Packed, 4

Up to 82 times slower in some cases.

208

E. Jeannot and G. Mercier

Bertha ALL

Bertha 64−32

48.68 % (>) 52.63 % (>) 52.63 % (>) 59.21 % (>) 5.26 % (=) 11.84 % (=) 1.32 % (=) 2.63 % (=) TreeMatch Median=1.000 Median=1.001 Median=1.002 Median=1.005 Avg=1.026 Avg=1.027 Avg=1.006 Avg=1.043 [0.900,1.267] [0.923,1.269] [0.778,1.276] [0.900,1.699]

46.15 % (>) 55.77 % (>) 69.23 % (>) 63.46 % (>) 5.77 % (=) 11.54 % (=) 1.92 % (=) Median=1.012 TreeMatch Median=1.000 Median=1.002 Median=1.008 Avg=1.058 Avg=1.037 Avg=1.039 Avg=1.045 [0.900,1.699] [0.900,1.267] [0.923,1.269] [0.951,1.276]

● ●







●●

●● ● ●



●●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●



● ●●

● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ● ●

● ●



●● ●● ● ●● ●● ●● ●

● ● ●









●● ● ●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●





●●



●●

● ●● ●●

● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●



●●

● ●

● ● ●



●●

0.9

● ● ●

● ● ●● ●●

● ●



● ● ●

● ●

●●● ●● ● ● ● ●●



●●



1.1

●● ●

●●

0.9

● ●









● ● ●











● ●

● ●











● ● ●

● ●

1.1

● ●●





●●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●

0.9



60.53 % (>) 1.32 % (=) Median=1.005 Avg=1.040 [0.910,1.516]

RR

● ● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ●● ●●



●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●













●● ●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



1.1

0.9



● ● ●

●● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●

● ● ●

● ●







● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ●













● ● ●

● ●





● ●

0.9



● ●

● ●● ●











● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



1.1

● ●

0.9

● ● ●



44.23 % (>) 1.92 % (=) Median=0.999 Avg=1.011 [0.910,1.516]

RR

●●

● ●

● ●



● ●

●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ●● ●

●● ●

● ●

●●

●●

69.23 % (>) 51.92 % (>) 5.77 % (=) Median=1.000 Median=1.002 Avg=1.018 Avg=1.006 [0.910,1.640] [0.959,1.081]

Packed







1.1



● ● ● ●

(a) NAS Benchmark comparison of the different heuristics for all size

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



MPIPP.1

59.62 % (>) 59.62 % (>) 73.08 % (>) 1.92 % (=) 1.92 % (=) Median=1.005 Median=1.001 Median=1.003 Avg=1.009 Avg=1.003 Avg=1.021 [0.948,1.120] [0.952,1.052] [0.916,1.699]

MPIPP.5



●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ●●

● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●

● ●







● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





●● ●●●

● ●● ● ● ●● ● ● ● ● ●

50.00 % (>) 55.26 % (>) 5.26 % (=) 2.63 % (=) Median=1.000 Median=1.001 Avg=0.978 Avg=1.015 [0.773,1.137] [0.910,1.640]

Packed

●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●



●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●

55.26 % (>) 59.21 % (>) 52.63 % (>) 1.32 % (=) 2.63 % (=) Median=1.002 Median=1.001 Median=1.002 Avg=0.980 Avg=1.002 Avg=1.017 [0.778,1.120] [0.952,1.052] [0.916,1.699]

MPIPP.5

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ●

1.1



0.9

● ●●



● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





●● ● ●



1.1

0.9



● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●





MPIPP.1

1.1

(b) NAS Benchmark comparison of the different heuristics for size 32, 36 and 64

Fig. 4. NAS Comparison

RoundRobin or MPIPP.1 (Round-Robin, being in this case the worst method). Moreover, we see that TreeMatch is never outperformed by more than 10% (the ratio is never under 0.9) and in some cases the gain approaches 30%. 5.4

NAS Communication Patterns Modeling

In the previous section, we have seen that on the average the TreeMatch algorithm sometimes outperforms by only a small margin the other methods and we have many comparable performances (in the histograms of the Fig. 4, most of the result have a ratio close to 1). We conjecture that this is mainly due to the fact that many NAS kernel being computation-bound, improvement in terms of communication time are not always visible. Moreover, the communication patterns are an aggregation of the whole execution and do not account for phases in the algorithm.Hence, to evaluate the impact on the communication of our mapping, we have designed an MPI program that executes only the communication pattern (exchanging data corresponding to the pattern, with MPI_AlltoallV) and does not perform any computation. Results are displayed in Fig 5. In 5(a) we see that the TreeMatch is the best heuristic. Except for MPIPP.5, it outperforms the other heuristics in almost 75% of the cases. In several cases, the gain exceeds 30% and the loss never exceeds 10% (except for RR where the minimum ratio is 0.83). When we restrict the experiments to 64 processes the results are even more favorable to TreeMatch (Fig 5(b)). In this case, the overall worst ratio is always greater than 0.92 (8% degradation) while it outperforms the other techniques up to 20%. Moreover, it has a better or similar performance than MPIPP.5 in two thirds of the cases.

Near-Optimal Placement of MPI Processes

Model ALL

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

79.17 % (>) 79.17 % (>) 84.72 % (>) 1.39 % (=) 1.39 % (=) Median=1.158 Median=1.013 Median=1.027 Avg=1.245 Avg=1.026 Avg=1.033 [0.957,1.984] [0.857,1.333] [0.750,1.229]

●●

MPIPP.5

●●

● ● ● ● ● ●●

● ● ●



● ● ●● ●

● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ●● ●



● ● ●

● ●●

● ●

● ● ● ●● ●

0.6

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●

1.0

● ●

● ● ● ● ● ●



● ● ●

● ●

● ●● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●

●● ●

1.4 0.6

● ●

● ●● ● ● ●● ● ● ● ● ● ●





●● ● ● ●●

● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

● ● ● ● ● ● ● ●



● ● ● ● ●

● ● ● ● ●● ●

● ●

MPIPP.5





68.06 % (>) 76.39 % (>) Median=1.015 Median=1.148 Avg=1.008 Avg=1.213 [0.750,1.230] [0.924,1.907]





● ● ● ● ● ● ● ●



● ●● ● ● ● ● ● ● ● ● ● ● ● ●

1.0



● ● ● ● ● ● ● ●



● ● ● ●

● ● ●●



● ● ● ● ●





















RR

●●



● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.0

● ● ●



● ● ● ●





● ● ●

●● ● ● ● ● ●

● ●

● ●

1.4 0.6

● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

1.0



● ●



● ●

● ● ● ● ● ● ● ● ● ●● ● ●









● ●

































● ●





























































● ● ●







0.6

1.0

















MPIPP.1







0.6

1.0

















● ●

1.6

● ● ●



● ● ● ● ●

1.4





● ● ●

● ● ●

● ● ● ● ● ● ●



● ● ●





● ● ● ● ● ● ● ●



● ● ●





● ● ● ● ● ● ● ●



(a) NAS Benchmark communication pattern comparison of the different heuristics for all size





● ● ● ● ● ● ● ● ●

MPIPP.1

RR



● ●



75.00 % (>) Median=1.502 Avg=1.372 [0.914,1.745]





● ● ● ●





●●







●● ● ● ● ●

1.4 0.6







●●

●● ● ● ● ●

● ● ● ●● ●

Packed



● ● ● ● ●

83.33 % (>) Median=1.123 Avg=1.205 [0.914,1.745]

87.50 % (>) 75.00 % (>) Median=1.025 Median=1.526 Avg=1.024 Avg=1.410 [0.750,1.167] [0.924,1.907]



● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

79.17 % (>) 91.67 % (>) 75.00 % (>) 4.17 % (=) 4.17 % (=) Median=1.569 Median=1.039 Median=1.053 Avg=1.454 Avg=1.037 Avg=1.059 [0.957,1.984] [0.857,1.333] [0.750,1.186]

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ●

Packed



● ● ● ● ●

● ●

● ●● ●● ● ● ● ● ●

TreeMatch

62.50 % (>) 83.33 % (>) 91.67 % (>) 75.00 % (>) 4.17 % (=) 4.17 % (=) 4.17 % (=) Median=1.574 Median=1.007 Median=1.032 Median=1.078 Avg=1.494 Avg=1.027 Avg=1.061 Avg=1.082 [0.924,1.890] [0.920,1.333] [0.976,1.333] [0.999,1.191]



● ● ● ● ● ●● ●

● ●

Model 64

55.56 % (>) 73.61 % (>) 72.22 % (>) 80.56 % (>) 1.39 % (=) 1.39 % (=) 1.39 % (=) Median=1.205 Median=1.002 Median=1.010 Median=1.024 Avg=1.296 Avg=1.041 Avg=1.067 Avg=1.075 [0.924,1.890] [0.920,1.364] [0.963,1.377] [0.832,1.534]

TreeMatch

209

1.6

0.6











1.0

1.6

0.6

1.0

1.6

(b) NAS Benchmark communication pattern comparison of the different heuristics for size 64

Fig. 5. Communication pattern comparison for the NAS Benchmark

6

Conclusion and Future Works

Executing a parallel application on a modern architecture requires to carefully take into account the architectural features of the environment. Indeed, current modern parallel computers are highly hierarchical both in terms of topology (e.g. clusters made of nodes of several multicore processors) and in terms of data accesses or exchanges (NUMA architecture with various levels of cache, network interconnection of nodes, etc.). In this paper, we have investigated the placement of MPI processes on these modern infrastructure. We have proposed an algorithm called TreeMatch that maps processes to computing elements based on the hierarchy topology of the target environment and on the communication pattern of the different processes. Under reasonable assumptions (e.g if the communication pattern is structured hierarchically), this algorithm provides an optimal mapping. Simulation results show that our algorithm outperforms other approaches (such as the MPIPP algorithm) both in terms of mapping quality and computation speed. Moreover, the quality improves with the number of processors. As, in some cases, the TreeMatch performance is very similar to other strategy; we have studied its impact when we remove the computations. In this case, we see greater difference in terms of performance. We can then conclude that this approach delivers its full potential for communication-bound application. However, this difference also highlights some modeling issues as the communication matrix is an aggregated view of the whole execution and does not account for different phases of the application with different communication patterns.

210

E. Jeannot and G. Mercier

References 1. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In: Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010). IEEE Computer Society Press, Pisa (February 2010), http://hal.inria.fr/inria-00429889 2. Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: Mpipp: an automatic profileguided parallel process placement toolset for smp clusters and multiclusters. In: Egan, G.K., Muraoka, Y. (eds.) ICS, pp. 353–360. ACM, New York (2006) 3. Solt, D.: A profile based approach for topology aware MPI rank placement (2007), http://www.tlc2.uh.edu/hpcc07/Schedule/speakers/hpcc_hp-mpi_solt.ppt 4. Duesterwald, E., Wisniewski, R.W., Sweeney, P.F., Cascaval, G., Smith, S.E.: Method and System for Optimizing Communication in MPI Programs for an Execution Environment (2008), http://www.faqs.org/patents/app/20080288957 5. Pellegrini, F.: Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs. In: Proceedings of SHPCC 1994, Knoxville, pp. 486–493. IEEE, Los Alamitos (May 1994) 6. Träff, J.L.: Implementing the MPI process topology mechanism. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pp. 1–14. IEEE Computer Society Press, Los Alamitos (2002) 7. Kako, A., Ono, T., Hirata, T., Halldórsson, M.M.: Approximation algorithms for the weighted independent set problem. In: Kratsch, D. (ed.) WG 2005. LNCS, vol. 3787, pp. 341–350. Springer, Heidelberg (2005) 8. Kneser, M.: Aufgabe 300. Jahresber. Deutsch. Math. -Verein 58 (1955) 9. Kruskal, C., Snir, M.: Cost-performance tradeoffs for communication networks. Discrete Applied Mathematics 37-38, 359–385 (1992) 10. Mercier, G., Clet-Ortega, J.: Towards an efficient process placement policy for mpi applications in multicore environments. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. LNCS, vol. 5759, pp. 104–115. Springer, Heidelberg (2009) 11. Zhai, J., Sheng, T., He, J., Chen, W., Zheng, W.: Fact: fast communication trace collection for parallel applications through program slicing. In: SC. ACM, New York (2009)

Parallel Enumeration of Shortest Lattice Vectors ¨ ur Dagdelen1 and Michael Schneider2 Ozg¨ 1

2

Center for Advanced Security Research Darmstadt - CASED [email protected] Technische Universit¨ at Darmstadt, Department of Computer Science [email protected]

Abstract. Lattice basis reduction is the problem of finding short vectors in lattices. The security of lattice based cryptosystems is based on the hardness of lattice reduction. Furthermore, lattice reduction is used to attack well-known cryptosystems like RSA. One of the algorithms used in lattice reduction is the enumeration algorithm (ENUM), that provably finds a shortest vector of a lattice. We present a parallel version of the lattice enumeration algorithm. Using multi-core CPU systems with up to 16 cores, our implementation gains a speed-up of up to factor 14. Compared to the currently best public implementation, our parallel algorithm saves more than 90% of runtime. Keywords: lattice reduction, shortest vector problem, cryptography, parallelization, enumeration.

1

Introduction

A lattice L is a discrete subgroup of the space Rd . Lattices are represented by linearly independent basis vectors b1 , . . . , bn ∈ Rd , where n is called the dimension of the lattice. Lattices have been known in number theory since the eighteenth century. They already appear when Lagrange, Gauss, and Hermite study quadratic forms. Nowadays, lattices and hard problems in lattices are widely used in cryptography as the basis of promising cryptosystems. One of the main problems in lattices is the shortest vector problem (SVP), that searches for a vector of shortest length in the lattice. The shortest vector problem is known to be NP-hard under randomized reductions. It is also considered to be intractable even in the presence of quantum computers. Therefore, many lattice based cryptographic primitives, e.g. one-way functions, hash functions, encryption, and digital signatures, leverage the complexity of the SVP problem. In the field of cryptanalysis, lattice reduction is used to attack the NTRU and GGH cryptosystems. Further, there are attacks on RSA and low density knapsack cryptosystems. Lattice reduction still has applications in other fields of mathematics and number theory. It is used for factoring composite numbers and computing discrete logarithms using diophantine approximations. In the field of discrete optimization, lattice reduction is used to solve linear integer programs. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 211–222, 2010. c Springer-Verlag Berlin Heidelberg 2010 

212

¨ Dagdelen and M. Schneider O.

The fastest algorithm known to solve SVP is the enumeration algorithm of Kannan [8] and the algorithm of Fincke and Pohst [3]. The variant used mostly in practice was presented by Schnorr and Euchner in 1991 [14]. Nevertheless, these algorithms solve SVP in exponential runtime. So far, enumeration is only applicable in low lattice dimensions (n ≤ 60).1 For higher dimensions it is only possible to find short vectors, but not a shortest vector. Mostly, these approximate solutions of the SVP are sufficient in practice. In 1982 the famous LLL algorithm was presented for factoring polynomials [9]. This algorithm does not solve SVP rather it finds a vector with length exponential in the lattice dimension. However, LLL is the first algorithm having a polynomial asymptotic running time. LLL can be run in lattice dimension up to 1000. In practice, the most promising algorithm for lattice reduction in high dimensions is the BKZ block algorithm by Schnorr and Euchner [14]. It mainly consists of two parts, namely enumeration in blocks of small dimension and LLL in high dimension. BKZ finds shorter vectors than LLL, at the expense of a higher runtime. Considering parallelization, there are various works dealing with LLL, e.g., [15,2]. The more time-consuming part of BKZ, namely the enumeration step (assuming the use of high block sizes) was considered in the master’s thesis of Pujol [11] and in a very recent work [13]. A GPU version of enumeration was shown in [7]. The enumeration in lattices can be visualized as a depth first search in a weighted search tree, with different subtrees being independent from each other. Therefore, it is possible to enumerate different subtrees in parallel threads without any communication between threads. We have chosen multi-core CPUs for the implementation of our parallel enumeration algorithm. Our Contribution. In this paper, we parallelize the enumeration (ENUM) algorithm by Schnorr and Euchner [14]. We implement the parallel version of ENUM and test it on multi-core CPUs. More precisely, we use up to 16 CPU cores to speed up the lattice enumeration, in lattice dimensions of 40 and above. Considering the search tree, the main problem is to predict the subtrees that are examined during enumeration beforehand. We gain speed-ups of up to factor 14 in comparison to our single core version.2 Compared to the fastest single-core ENUM implementation known our parallel version of ENUM saves more than 90% of runtime. We add some clever additional communication among threads, such that by using s processor cores we even gain a speed-up of more than s in some cases. By this work, we show that it is possible to parallelize the entire BKZ algorithm for the search for short lattice vectors. The strength of BKZ is used to assess the practical hardness of lattice reduction, which helps finding suitable parameters for secure lattice based cryptosystems. The algorithm of Pujol [11,13] uses a volume heuristic to predict the number of enumeration steps that will be performed in a subtree. This estimate is used to 1 2

The recent work of [5] could no more be considered for our final version. On a 24 core machine we gain speed-up factors of 22.

Parallel Enumeration of Shortest Lattice Vectors

213

predict if a subtree is split recursively for enumeration in parallel. In contrast to that, our strategy is to control the height of subtrees that can be split recursively. Organization. Section 2 explains the required basic facts on lattices and parallelization, Section 3 describes the ENUM algorithm by [14], Section 4 presents our new algorithm for parallel enumeration, and Section 5 shows our experimental results.

2

Preliminaries

Notation. Vectors and expression x denotes denotes the Euclidean v∞ . Throughout the

matrices are written in bold face, e.g. v and M. The the nearest integer to x ∈ R, i.e., x = x − 0.5. v norm, other norms are indexed with a subscript, like paper, n denotes the lattice dimension.

Lattices. A lattice is a discrete additive subgroup of Rd . It can be represented as the linear integer span of n ≤ d linear independent vectors b1 , . . . , bn ∈ Rd , which are arranged in a column matrix B = [b1 , . . . , bn ] ∈ Rd×n . The lattice L(B) is the set all linear integer combinations of the basis vectors bi , of n namely L(B) = { i=1 xi bi : xi ∈ Z} . The dimension of the lattice equals the number of linearly independent basis vectors n. If n = d, the lattice is called full-dimensional. For n ≥ 2 there are infinitely many bases of a lattice. One basis can be transformed into another using a unimodular transformation matrix. The first successive minimum λ1 (L(B)) is the length of a shortest nonzero vector of a lattice. There exist multiple shortest vectors of a lattice, a shortest vector is not unique. Define the Gram-Schmidt-orthogonalization B∗ = i−1 [b∗1 , . . . , b∗n ] of B. It is computed via b∗i = bi − j=1 μi,j b∗j for i = 1, . . . , n,  2 where μi,j = bT b∗ / b∗  for all 1 ≤ j ≤ i ≤ n. We have B = B∗ μT , i

j

j

where B∗ is orthogonal and μT is an upper triangular matrix. Note that B∗ is not necessarily a lattice basis.

Lattice Problems and Algorithms. The most famous problem in lattices is the shortest vector problem (SVP). The SVP asks to find a shortest vector in the lattice, namely a vector v ∈ L\{0} with v = λ1 (L). An approximation version of the SVP was solved by Lenstra, Lenstra, and Lov´asz [9]. The LLL algorithm is still the basis of most algorithms used for basis reduction today. It runs in polynomial time in the lattice dimension and outputs a so-called LLL-reduced basis . This basis consists of nearly orthogonal vectors, and a short, first basis vector with approximation factor exponential in the lattice dimension. The BKZ algorithm by Schnorr and Euchner [14] reaches better approximation factors, and is the algorithm used mostly in practice today. As a subroutine, BKZ makes use of an exact SVP solver, such as ENUM. In practice, SVP can only be solved in low dimension n, say up to 60, using exhaustive search techniques or, a second approach, using sieving algorithms that work probabilistically. An overview of enumeration algorithms is presented in [12]. A randomized sieving approach for solving exact SVP was presented in [1] and an improved variant in [10]. In this paper, we only deal with enumeration algorithms.

214

¨ Dagdelen and M. Schneider O.

Exhaustive Search. In [12] Pujol and Stehl´e examine the floating point behaviour of the ENUM algorithm. They state that double precision is suitable for lattice dimensions up to 90. It is common practice to pre-reduce lattices before starting enumeration, as this reduces the radius of the search space. In BKZ, the basis is always reduced with the LLL algorithm when starting enumeration. Publicly available implementations of enumeration algorithms are the established implementation of Shoup’s NTL library and the fpLLL library of Stehl´e et al. Experimental data on enumeration algorithms using NTL can be found in [4,10], both using NTL’s enumeration. A parallel implementation of ENUM is available at Xavier Pujol’s website.3 To our knowledge there are no results published using this implementation. Pujol mentions a speedup factor of 9.7 using 10 CPUs. Our work was developed independently of Pujol’s achievements. Parallelization. Before we present our parallel enumeration algorithm, we need to introduce definitions specifying the quality of the realized parallelization. Furthermore, we give a brief overview of parallel computing paradigms. There exist many parallel environments to perform operations concurrently. Basically, on today’s machines, one distinguishes between shared memory and distributed memory passing. A multi-core microprocessor follows the shared memory paradigm in which each processor core accesses the same memory space. Nowadays, such computer systems are commonly available. They possess several cores, while each core acts as an independent processor unit. The operating system is responsible to deliver operations to the cores. In the parallelization context there exist notions that measure the achieved quality of a parallel algorithm compared to the sequential version. In the sequel of this paper, we will need the following definitions: Speed-up factor: time needed for serial computation divided by the time required for the parallel algorithm. Using s processes, a speed-up factor of up to s is expected. Efficiency: speed-up factor divided by the number of used processors. An efficiency of 1.0 means that s processors lead to a speed-up factor of s which can be seen as a “perfect” parallelization. Normally the efficiency is smaller than 1.0 because of the communication overhead for inter-process communication. Parallel algorithms such as graph search algorithms may benefit from communication, in such a way that fewer operations need to be computed. As soon as the number of saved operations exceeds the communication overhead, an efficiency of more than 1.0 might be achieved. For instance, branch-and-bound algorithms for Integer Linear Programming might have superlinear speedup, due to the interdependency between the search order and the condition which enables the algorithm to disregard a subtree. The enumeration algorithm falls into this category as well. 3

http://perso.ens-lyon.fr/xavier.pujol/index_en.html

Parallel Enumeration of Shortest Lattice Vectors

3

215

Enumeration of the Shortest Lattice Vector

In this chapter we give an overview of the ENUM algorithm first presented in [14]. In the first place, the algorithm was proposed as a subroutine in the BKZ algorithm, but ENUM can be used as a stand-alone instance to solve the exact SVP. An example instance of ENUM in dimension 3 is shown by the solid line of Figure 1. An algorithm listing is shown as Algorithm 1. Algorithm 1. Basic Enumeration Algorithm ∗ 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j   )1≤j≤i≤n , b1  . . . bn   = λ1 (L(B)) Output: umin such that  n u b i i i=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

2 A ← b∗ 1  , umin ← (1, 0, . . . , 0), u ← (1, 0, . . . , 0), l ← (0, . . . , 0), c ← (0, . . . , 0) t=1 while t ≤ n do 2 lt ← lt+1 + (ut + ct )2 b∗ t if lt < A then if t > 1 then t ← t −1  move one layer down in the tree n ct ← ut ← ct i=t+1 ui μi,t , else A ← lt , umin ← u  set new minimum end else t←t+1  move one layer up in the tree choose next value for ut using the zig-zag pattern end end

To find a shortest non-zero vector of a lattice L(B) with B = [b1 , . . . , bn ], ENUM takes as input the Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , the quadratic 2 2 norm of the Gram-Schmidt orthogonalization b∗1  , . . . , b∗n  of B, and an n initial bound n A. The search space is the set of all coefficient vectors u ∈ Z that satisfy  t=1 ut bt  ≤ A. Starting with an LLL-reduced basis, it is common to 2 set A = b∗1  in the beginning. If the norm of the shortest vector is known beforehand, it is possible to start with a lower A, which limits the search space and reduces the runtime of the algorithm. If a vector v of length smaller than A is found, A can be reduced to the norm of v, that means A always denotes the size of the current shortest vector. The goal of ENUM is to find a coefficient vector u ∈ Zn satisfying the equation   n   n         (1) ut bt  = minn  xt bt  .   x∈Z    t=1

t=1

Therefore, all coefficient combinations u that determine a vector of norm less than A are enumerated. t−1In Equation 1 we replace all bt by their orthogonalization, i.e., bt = b∗t + j=1 μt,j b∗j and get Equation (2): 2  2    n n t−1 n n            ∗ ∗  u ut bt  =  · (b + μ b ) = (u + μi,t ui )2 · b∗t 2 .  t t,j j  t t     t=1  t=1 t=1 j=1 i=t+1

216

¨ Dagdelen and M. Schneider O.

n Let c ∈ Rd with ct = i=t+1 μi,t ui (line 8), which is predefined by all coefficients ui with n ≥ i > t. The intermediate norm lt (line 4) is defined as lt = lt+1 + (ut + ct )2 b∗t 2 . This is the norm part of Equation 2 that is predefined by the values ui with n ≥ i ≥ t. The algorithm enumerates the coefficients in reverse order, from un to u1 . This can be considered as finding a minimum in a weighted search tree. The height of the tree is uniquely determined by the dimension n. The root of the tree denotes the coefficient un . The coefficient values ut for 1 ≤ t ≤ n determine the values of the vertices of depth (n − t + 1), leafs of the tree contain coefficients u1 . The inner nodes represent intermediate nodes, not complete coefficient vectors, i.e., a node on level t determines a subtree (⊥, . . . , ⊥, ut , ut+1 , . . . , un ), where the first t − 1 coefficients are not yet set. lt is the norm part predefined by this inner node on level t. We only enumerate parts of the tree with lt < A. Therefore, the possible values for ut on the next lower level are in an interval around ct with (ut + ct )2 < (A − lt+1 )/ b∗t , following the definition of lt . ENUM iterates over all possible values for ut , as long as lt ≤ A, the current minimal value. If lt exceeds A, enumeration of the corresponding subtree can be cut off, the intermediate norm lt will only increase when stepping down in the tree, as lt ≤ lt−1 always holds. The iteration over all possible coefficient values is (due to Schnorr and Euchner) performed in a zig-zag pattern. The values for ut will be sequenced like either ct , ct + 1, ct − 1, ct + 2, ct − 2, . . . or ct , ct − 1, ct + 1, ct − 2, ct + 2, . . .. ENUM starts at the leaf (1, 0, . . . , 0) and gives the first possible solution for a shortest vector in the given lattice. The algorithm performs its search by moving up (when a subtree can be cut off due to lt ≥ A) and down in the tree (lines 13 and 7). The norm of leaf nodes is compared to A. If l1 ≤ A, it stores A ← l1 and umin ← u (line 10), which define the current shortest vector and its size. When ENUM moves up to the root of the search tree it terminates and outputs the computed global minimum A and the corresponding shortest vector umin .

4

Algorithm for Parallel Enumeration of the Shortest Lattice Vector

In this section we describe our parallel algorithm for enumeration of the shortest lattice vector. The algorithm is a parallel version of the algorithm presented in [14]. First we give the main idea of parallel enumeration. Secondly, we present a high level description. Algorithms 2 and 3 depict our parallel ENUM. Thirdly, we explain some improvements that speed up the parallelization in practice. 4.1

Parallel Lattice Enumeration

The main idea for parallelization is the following. Different subtrees of the complete search tree are enumerated in parallel independently from each other representing them as threads (Sub-ENUM threads). Using s processors, s subtrees can be enumerated at the same time. All threads ready for enumeration are

Parallel Enumeration of Shortest Lattice Vectors

217

End

5. Thread

Start

1. Thread

2. Thread

3. Thread

4. Thread

Fig. 1. Comparison of serial (solid line) and parallel (dashed line) processing of the search tree

stored in a list L, and each CPU core that has finished enumerating a subtree picks the next subtree from the list. Each of the subtrees is an instance of SVP in smaller dimension; the initial state of the sub-enumeration can be represented by a tuple (u, l, c, t). When the ENUM algorithm increases the level in the search tree, the center (ct ) and the range ((A − lt+1 )/ b∗t ) of possible values for the current index are calculated. Therefore, it is easy to open one thread for every value in this range. Figure 1 shows a 3-dimensional example and compares the flow of the serial ENUM with our parallel version. Beginning at the starting node the procession order of the serial ENUM algorithm follows the directed solid edges to the root. In the parallel version dashed edges represent the preparation of new Sub-ENUM threads which can be executed by a free processor unit. Crossed-out edges point out irrelevant subtrees. Threads terminate as soon as they reach either a node of another thread or the root node. Extra Communication – Updating the Shortest Vector. Again, we denote the current minimum, the global minimum, as A. In our parallel version, it is the global minimum of all threads. As soon as a thread has found a new minimum, the Euclidean norm of this vector is written back to the shared memory, i.e. A is updated. At a certain point every thread checks the global minimum whether another thread has updated A and, if so, uses the updated one. The smaller A is, the faster a thread terminates, because subtrees that exceed the current minimum can be cut off in the enumeration. The memory access for this update operation is minimal, only one integer value has to be written back or read from shared memory. This is the only type of communication among threads, all other computations can be performed independently without communication overhead. 4.2

The Algorithm for Parallel Enumeration

Algorithm 2 shows the main thread for the parallel enumeration. It is responsible to initialize the first Sub-ENUM thread and manage the thread list L.

218

¨ Dagdelen and M. Schneider O.

Algorithm 2. Main thread for parallel enumeration ∗ 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j   )1≤j≤i≤n , b1  . . . bn   Output: umin such that  n i=1 ui bi = λ1 (L(B)) 1 2 3 4 5 6 7 8 9

2 A ← b∗ 1  , umin ← (1, 0, . . . , 0) u ← (1, 0, . . . , 0), l ← 0, c ← 0, t ← 1 L ← {(u, l, c, t)} while L = ∅ or threads are running do if L = ∅ and cores available then pick Δ = (u, l, c, t) from L start Sub-ENUM thread Δ = (u, l, c, t) on new core end end

 Global variables  Local variables  Initialize list

A Sub-ENUM thread (SET) is represented by the tuple (u, l, c, t), where u is the coefficient vector, l the intermediate norm of the root to this subtree, c the search region center and t the lattice dimension minus the starting depth of the parent node in the search tree.

Algorithm 3. Sub-ENUM thread (SET) 2 ∗ 2 Input: Gram-Schmidt coefficients (μi,j )1≤j≤i≤n , b∗ u, ¯ l, c¯, t¯) 1  . . . bn  , (¯ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

¯ , l ← (0, . . . , 0), c ← (0, . . . , 0) u←u t ← t¯, lt+1 ← ¯ l, ct ← c¯, bound ← n while t ≤ bound do 2 lt ← lt+1 + (ut + ct )2 b∗ t if lt < A then if t > 1 then t ← t −1  move one layer down in the tree n ct ← ut ← ct i=t+1 ui μi,t , if bound = n then L ← L ∪ (u, lt+2 , ct+1 , t + 1)  insert new SET in list L bound ← t end else A ← lt , umin ← u  set new global minimum end else t←t+1  move one layer up in the tree choose next value for ut using the zig-zag pattern end end

Whenever the list contains a SET and free processor units exist, the first SET of the list is executed. The execution of SETs is performed by Algorithm 3. We process the search tree in the same manner as the serial algorithm (Algorithm 1), except the introduction of the loop bound bound and the handling of new SETs (lines 9− 11). First, the loop bound controls the termination of the subtree and prohibits that nodes are visited twice. Second, only the SET whose bound is set to the lattice dimension is allowed to create new SETs. Otherwise, if we allow each SET to create new SETs by itself, this would lead to an explosion of the number of threads and each thread has too few computations to perform.

Parallel Enumeration of Shortest Lattice Vectors

219

We denote the SET with bound set to n by unbounded SET (USET). At any time, there exists only one USET that might be stored in the thread list L. As soon as an USET has the chance to find a new minimum within the current subtree (lines 5 − 6), its bound is set to the current t value. Thereby, it is transformed to a SET and the recent created SET becomes the USET. 4.3

Improvements

We presented a first solution for the parallelization of the ENUM algorithm providing a runtime speed-up by a divide and conquer technique. We distribute subtrees to several processor units to search for the minimum. Our improvements deal with the creation of SETs and result in significantly shorter running time. Recall the definitions of Sub-ENUM thread (SET) and unbounded Sub-ENUM thread (USET). By now we call a node, where a new SET can be created, a candidate. Note that a candidate can only be found in an USET. The following paragraphs show worst cases of the presented parallel ENUM algorithm and present possible solutions to overcome the existing drawbacks. Threads within threads. Our parallel ENUM algorithm allows to create new SETs only by an USET. The avoidance of producing immense overhead which happens by permitting the creation of new SETs by any SET, backs our decision that it suffices to let only USET create new instances. However, if an USET creates a new SET at a node of depth 1, then this new SET is executed by a single processor sequentially. Note that this SET solves the SVP problem in dimension n − 1. It turns out that in the case the depth of a current analyzed node in ENUM is sufficient far away from the depth t of the starting node, the creation of a new SET is advantageous according to the overall running time and the number of simultaneously occupied processors. Therefore, we introduce a bound sdeep which expresses what we consider to be sufficient far away, i.e. if a SET visits a node with depth k fulfilling the equation k − t ≥ sdeep where t stands for the depth of the starting node and it is not an USET, then this SET is permitted to create a new SET once. Thread Bound. Although we avoid the execution of SETs where the dimension of the subtree is too big, we are still able to optimize the parallel ENUM algorithm by considering execution bounds. We achieve additional performance improvements by the following idea. Instead of generating SETs in each possible candidate, we consider the depth of the node. This enables us to avoid big subtrees for new SETs by introducing an upper bound sup representing the minimum distance of a node to the root to become a candidate. If ENUM visits a node with depth t fulfilling n − t > sup and this node is a candidate, we no longer make a subtree ready for a new SET. We rather prefer to behave in that situation like the serial ENUM algorithm. Good choices for the above bounds sdeep and sup are evaluated in Section 5.

¨ Dagdelen and M. Schneider O.

220

5

Experiments

We performed numerous experiments to test our parallel enumeration algorithm. We created 5 different random lattices of each dimension n ∈ {42, . . . , 56} in the sense of Goldstein and Mayer [6]. The bitsize of the entries of the basis matrices were in the order of magnitude of 10n. We started with bases in Hermite normal form, then LLL-reduced the bases (using LLL parameter δ = 0.99). The experiments were performed on a compute server equipped with four AMD Opteron (2.3GHz) quad core processors. We compare our results to the highly optimized, serial version of fpLLL in version 3.0.12, the fastest ENUM implementation known, on the same platform. The programs were compiled using gcc version 4.3.2. For handling parallel processes, we used the Boost-Threadsublibrary in version 1.40. Our C++ implementation uses double precision to 2 2 store the Gram-Schmidt coefficients μi,j and the b∗1  , . . . , b∗n  . Due to [12], this is suitable up to dimension 90, which seems to be out of the range of today’s enumeration algorithms. We tested the parallel ENUM algorithm for several sdeep values and concluded that sdeep = 25 36 (n − t) seems to be a good choice, where t is the depth of the starting node in a SET instance. Further, we use sup = 56 n. 1e+06

100

1 core fplll (1 core) 4 cores 8 cores 16 cores

100000

90 80 occupancy [%]

Time [s]

10000 1000 100

70 60 50 40

10 30 1

avg. load of all cpu cores max value min value

20

0.1

10 42

44

46

48 50 Dimension

52

54

56

0

10

20

30

40

50

60

70

80

90

100

time [%]

Fig. 2. Average runtimes of enumeration of 5 random lattices in each dimension, comparing our multi-core implementation to fpLLL’s and our own single-core version

Fig. 3. Occupancy of the cores. The x-axis marks the percentage of the complete runtime, the y-axis shows the average occupancy of all CPU cores over 5 lattices.

Table 1. Average time in seconds for enumeration of lattices in dimension n

n 1 core 4 cores 8 cores 16 cores fpLLL 1 core

42

44

46

48

50

52

54

3.81 0.99 0.62 0.52 3.32

27.7 7.2 4.0 2.6 23.7

37.6 8.8 4.8 3.5 29.7

241 55 28 18 184

484 107 56 36 367

3974 976 504 280 3274

10900 2727 1390 794 9116

56 223679 56947 28813 16583 184730

Parallel Enumeration of Shortest Lattice Vectors 16

16 cores 8 cores 4 cores fplll (1 core)

14

Speedup compared to fplll (single core)

Speedup compared to single core

16

12 10 8 6 4 2

221

16 cores 8 cores 4 cores

14 12 10 8 6 4 2

42

44

46

48 50 Dimension

52

54

56

42

44

46

48 50 Dimension

52

54

56

Fig. 4. Average speed-up of parallel ENUM compared to our single-core version (left) and compared to fpLLL single-core version (right)

Table 1 and Figure 2 present the experimental results that compare our parallel version to our serial algorithm and to the fpLLL library. We only present the timings, as the output of the algorithms is in all cases the same, namely a shortest non-zero vector of the input lattice. The corresponding speed-ups are shown in Figure 4. To show the strength of parallelization of the lattice enumeration, we first compare our multi-core versions to our single-core version. The best speed-ups are 4.5 (n = 50) for 4 cores, 8.6 (n = 50) for 8 cores, and 14.2 (n = 52) for 16 cores. This shows that, using s processor cores, we sometimes gain speed-ups of more than s, which corresponds to an efficiency of more than 1. This is a very untypical behavior for (standard) parallel algorithms, but understandable for graph search algorithms as our lattice enumeration. It is caused by the extra communication for the write-back of the current minimum A. The highly optimized enumeration of fpLLL is around 10% faster than our serial version. Compared to the fpLLL algorithm, we gain a speed-up of up to 6.6 (n = 48) using 8 CPU cores and up to 11.7 (n = 52) using 16 cores. This corresponds to an efficiency of 0.825 (8 cores) and 0.73 (16 cores), respectively. Figure 3 shows the average, the maximum, and the minimum occupancy of all CPU cores during the runtime of 5 lattices in dimension n = 52. The average occupancy of more than 90% points out that all cores are nearly optimally loaded; even the minimum load values are around 80%. These facts show a good balanced behaviour of our parallel algorithm.

6

Conclusion and Further Work

In this paper we have presented a parallel version of the most common algorithm for solving the shortest vector problem in lattices, the ENUM algorithm. We have shown that a huge speed-up and a high efficiency is reachable using multi-core processors. As parallel versions of LLL are already known, with our parallel ENUM we have given evidence that both parts of the BKZ reduction algorithm can be parallelized. It remains to combine both, parallel LLL and parallel ENUM, to a parallel version of BKZ. Our experience with BKZ shows that

222

¨ Dagdelen and M. Schneider O.

in higher blocksizes of ≈ 50 ENUM takes more than 99% of the complete runtime. Therefore, the speed-up of ENUM will directly speed up BKZ reduction, which in turn influences the security of lattice based cryptosystems. Furthermore, to enhance scalability further, an extension of our algorithm to parallel systems with multiple multicore nodes is considered as future work.

Acknowledgments We thank Jens Hermans, Richard Lindner, Markus R¨ uckert, and Damien Stehl´e for helpful discussions and their valuable comments. We thank Michael Zohner for performing parts of the experiments. We thank the anonymous reviewers for their comments.

References 1. Ajtai, M., Kumar, R., Sivakumar, D.: A sieve algorithm for the shortest lattice vector problem. In: STOC 2001, pp. 601–610. ACM, New York (2001) 2. Backes, W., Wetzel, S.: Parallel lattice basis reduction using a multi-threaded Schnorr-Euchner LLL algorithm. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 960–973. Springer, Heidelberg (2009) 3. Fincke, U., Pohst, M.: A procedure for determining algebraic integers of given norm. In: van Hulzen, J.A. (ed.) ISSAC 1983 and EUROCAL 1983. LNCS, vol. 162, pp. 194–202. Springer, Heidelberg (1983) 4. Gama, N., Nguyen, P.Q.: Predicting lattice reduction. In: Smart, N.P. (ed.) EUROCRYPT 2008. LNCS, vol. 4965, pp. 31–51. Springer, Heidelberg (2008) 5. Gama, N., Nguyen, P.Q., Regev, O.: Lattice enumeration using extreme pruning, To appear in Eurocrypt 2010 (2010) 6. Goldstein, D., Mayer, A.: On the equidistribution of Hecke points. Forum Mathematicum 2003 15(2), 165–189 (2003) 7. Hermans, J., Schneider, M., Buchmann, J., Vercauteren, F., Preneel, B.: Parallel shortest lattice vector enumeration on graphics cards. In: Bernstein, D.J., Lange, T. (eds.) AFRICACRYPT 2010. LNCS, vol. 6055, pp. 52–68. Springer, Heidelberg (2010) 8. Kannan, R.: Improved algorithms for integer programming and related lattice problems. In: STOC 1983, pp. 193–206. ACM, New York (1983) 9. Lenstra, A., Lenstra, H., Lov´ asz, L.: Factoring polynomials with rational coefficients. Mathematische Annalen 4, 515–534 (1982) 10. Micciancio, D., Voulgaris, P.: Faster exponential time algorithms for the shortest vector problem. In: SODA 2010 (2010) 11. Pujol, X.: Recherche efficace de vecteur court dans un r´eseau euclidien. Masters thesis, ENS Lyon (2008) 12. Pujol, X., Stehl´e, D.: Rigorous and efficient short lattice vectors enumeration. In: Pieprzyk, J. (ed.) ASIACRYPT 2008. LNCS, vol. 5350, pp. 390–405. Springer, Heidelberg (2008) 13. Pujol, X., Stehl´e, D.: Accelerating lattice reduction with FPGAs, To appear in Latincrypt 2010 (2010) 14. Schnorr, C.P., Euchner, M.: Lattice basis reduction: Improved practical algorithms and solving subset sum problems. Mathematical Programming 66, 181–199 (1994) 15. Villard, G.: Parallel lattice basis reduction. In: ISSAC 1992, pp. 269–277. ACM, New York (1992)

A Parallel GPU Algorithm for Mutual Information Based 3D Nonrigid Image Registration Vaibhav Saxena1 , Jonathan Rohrer2 , and Leiguang Gong3 1

IBM Research - India, New Delhi 110070, India [email protected] 2 IBM Research - Zurich, 8803 R¨ uschlikon, Switzerland [email protected] 3 IBM T.J. Watson Research Center, NY 10598, USA [email protected] Abstract. Many applications in biomedical image analysis require alignment or fusion of images acquired with different devices or at different times. Image registration geometrically aligns images allowing their fusion. Nonrigid techniques are usually required when the images contain anatomical structures of soft tissue. Nonrigid registration algorithms are very time consuming and can take hours for aligning a pair of 3D medical images on commodity workstation PCs. In this paper, we present parallel design and implementation of 3D non-rigid image registration for the Graphics Processing Units (GPUs). Existing GPU-based registration implementations are mainly limited to intra-modality registration problems. Our algorithm uses mutual information as the similarity metric and can process images of different modalities. The proposed design takes advantage of highly parallel and multi-threaded architecture of GPU containing large number of processing cores. The paper presents optimization techniques to effectively utilize high memory bandwidth provided by GPU using on-chip shared memory and co-operative memory update by multiple threads. Our results with optimized GPU implementation showed an average performance of 2.46 microseconds per voxel and achieved factor of 28 speedup over a CPU-based serial implementation. This improves the usability of nonrigid registration for some real world clinical applications and enables new ones, especially within intra-operative scenarios, where strict timing constraints apply.

1

Introduction

Image registration is a key computational tool in medical image analysis. It is the process of aligning two images usually acquired at different times with different imaging parameters or slightly different body positions, or using different imaging modalities, such as CT and MRI. Registration can compensate for subject motion and enables a reliable analysis of disease progression or treatment effectiveness over time. Fusion of pre- and intra-operative images can provide guidance to the surgeon. Rigid registration achieves alignment by scaling, rotation and translation. However, most parts of the human body are soft-tissue P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 223–234, 2010. c Springer-Verlag Berlin Heidelberg 2010 

224

V. Saxena, J. Rohrer, and L. Gong

structures and more complex transformation models are required to align them. There is a variety of so-called nonrigid registration algorithms. A major obstacle to their widespread clinical use is the high computational cost. Nonrigid registration algorithms typically require many hours to register 3D images prohibiting interactive use [3]. Some implementations of nonrigid registration algorithms achieve runtimes in the order of minutes [11,5]. However, they typically run on large parallel systems of up to more than hundred processors. Acquisition and maintenance of such architectures is expensive and therefore the availability of such solutions is very limited. In this paper, we present a CUDA-based implementation of a mutual information based nonrigid registration algorithm on the GPU, which achieves significant runtime acceleration and in many cases even sub-minute runtimes for multimodal registration of 3D images. Our solution provides low-cost fast nonrigid registration, which hopefully will facilitate a widespread clinical use. We present data partitioning for 3D images to effectively utilize large number of threads supported on the GPU, and optimization techniques to achieve high memory throughput from the GPU memory spaces. The proposed implementation achieves consistent speedup across different datasets of varying dimensions. The registration algorithm bases on a B-spline transformation model [15,13]. This approach has been used successfully for the registration of a variety of anatomical structures, such as the brain, the chest [7], the heart, the liver and the breast [13]. Mutual information is the most common metric for both monomodal and multimodal registration [10]. There are fast GPU implementations of nonrigid registration algorithms, but most of them are limited to monomodal registration [8,14,2]. The only multimodality-enabled implementation we are aware of is [17], which, however, uses only 2D textures and implemented with OpenGL and GLSL. Another similar work implemented on the Cell/B.E. is reported in [12]. However, the Cell/B.E. and GPU represent two very contrasting architectures and require different optimization approaches for achieving good performance. The next section will provide an overview of the registration method. Section 3 will briefly describe the GPU architecture followed by section 4 of discussion of proposed parallel algorithm implementation. The experimental evaluation will be discussed in section 5 followed by conclusion in section 6.

2

Mutual Information Based Nonrigid Image Registration

Nonrigid image registration is the process to compute a nonlinear mapping or transformation between two images (2D or 3D). One of the images is called reference or fixed image and other one is called floating or moving image. We implemented a well-known mutual information based nonrigid image registration algorithm which models the transformation using B-splines. A set of control points are overlaid on the fixed image and transformation can be generated by letting these control points move freely. The transformation T (x; μ) is obtained by B-spline interpolation with transformation parameters μ that are B-spline

A Parallel GPU Algorithm for Mutual Information

225

coefficients located at the control points. The degrees of freedom of control points are governed by the spacing between the points. To register fixed image ff ix to moving image fmov , an optimal set of transformation parameters μ are computed that best align the fixed and transformed moving image (i.e. establish similarity between them). To support the registration of images obtained from different modalities, we use negative of Mutual information as similarity metric. A gradient descent optimizer with feedback step adjustment [6] is used to iteratively obtain the optimal set of transformation parameters (coefficients) that minimize the similarity metric. Mathematically, T provides a mapping of point in the fixed image space with coordinates xf ix to the point in the moving image space with coordinates xmov xmov = T (xf ix ; μ) 

Using above mapping, we can obtain warped (transformed) moving image fmov 

fmov (x) = fmov (T (xf ix ; μ)) 

An optimal set of transformation parameters μ make the images fmov and ff ix comparable (similar) to each other based on a similarity metric. The mutual information calculation is based on a Parzen estimation of the joint histogram p(if ix , imov ) of the fixed and the transformed moving image [16]. As proposed in [7], a zero-order B-spline Parzen window for the fixed image and a cubic B-spline Parzen window for the moving image is used. Together with a continuous cubic B-Spline representation of the floating image, this allows to calculate the gradient of the similarity metric S in closed form. The metric S is computed as    p(τ, η; μ) p(τ, η; μ)log S(μ) = − pf ix (τ ; μ)pmov (η; μ) τ η where p is the joint pdf with fixed image intensity values τ and warped moving image intensity values η. pf ix and pmov represents marginal pdfs. The derivative of S with respect to transformation parameter μi is following sum over all the fixed image voxels that are within support region of μi :  ∂p(if ix , imov ) ∂S = −α |if ix =ff ix (x),imov =fmov (T (x;μ)) ∂μi ∂imov x 

T ∂ ∂ fmov (ξ)|ξ=T (x;μ) . . T (x; μ) ∂ξ ∂μi where α is a normalization factor. We use all the voxels of the fixed image to calculate the histogram and not only a subset like in [7]. We also use a multi-resolution approach to increase robustness and speed [16], meaning that we first register with reduced image and transformation resolutions, and then successively increase them until the desired resolution has been reached. We downsample the images by 2 in each dimension using gaussian filter and interpolation to obtain them in lower resolution.

226

3

V. Saxena, J. Rohrer, and L. Gong

GPU Architecture Overview

A GPU can be modeled as a set of SIMD multiprocessors (SMs) each consisting of a set of scalar processor cores (SPs). The SPs of a SM execute the same instruction simultaneously but on different data points. The GPU has a large device global memory with high bandwidth and high latency. In addition, each SM also contains a very fast, low-latency on-chip shared memory. In the CUDA programming model [9], a host program runs on the CPU and launches a kernel program to be executed on the GPU device in parallel. The kernel executes as a grid of one or more thread blocks. Each thread block is dynamically scheduled to be executed on a single SM. The threads of a thread block cooperate with each other by synchronizing their execution and efficiently sharing resources on the SM such as shared memory and registers. Threads within a thread block gets executed on a SM in the scheduling units of 32 threads called a warp. Global memory is used most efficiently when multiple threads simultaneously access words from a contiguous aligned segment of memory, enabling GPU hardware to coalesce these memory accesses into a single memory transaction. The Nvidia Tesla C1060 GPU used in the present work contains 4GB of offchip device memory and 16KB of on-chip shared memory. The GPU supports a maximum of 512 threads per thread block.

4

Parallelization and Optimization

The main steps of the registration algorithm are outlined in Algorithm 1. The iterative gradient descent optimization part (within the first for-block) consumes almost all the computation of the registration algorithm. It has been shown that the two (for) loops that iterate over all the fixed image voxels take more than 99% of the optimization time [12]. We therefore focus our parallelization and optimization effort on this part following the CUDA programming model. 4.1

Parallel Execution on the GPU

We offload the two (for) loops to the GPU. Whenever the control on the host reaches any of these loops, the host calls the corresponding GPU routine. Once the GPU routine finishes, the control is returned to the host for further execution. Data Partitioning and Distribution: To enable parallel computation of joint histogram (loop 1 ) and parallel computation of gradient (loop 2 ) on the GPU, we divide the images into contiguous cubic blocks before the start of iterative gradient descent optimization part and process the fixed image blocks independently. The fixed image blocks are distributed to the CUDA thread blocks with each thread block processing one or more image blocks. A thread block requires N/T iterations to process a fixed image block, where N is the number of voxels in the image block and T is the number of threads in a thread block. In each iteration, a thread in the thread block processes one voxel of the fixed image block performing operations in loop 1 or loop 2.

A Parallel GPU Algorithm for Mutual Information

227

Algorithm 1. Image Registration Algorithm Construct the image pyramid For each level of the multi-resolution hierarchy Initialize the transformation coefficients For each iteration of the gradient descent optimizer Iterate over all fixed image voxels (loop 1) Map coordinates to moving image space Interpolate moving image intensity Update joint histogram Calculate joint and marginal probability density functions Calculate the mutual information Iterate over all fixed image voxels (loop 2) Map coordinates to moving image space Interpolate moving image intensity and its gradient Update the gradients for the coefficients in the neighborhood Calculate the new set of coefficients (gradient descent)

For example, for the fixed image block size of 8×8×8, we use 3 dimensional thread block of size (8, 8, Dz ). A thread block makes 8/Dz iterations for processing an image block, where Dz is the number of threads in the third dimension. In mth (0≤m < 8/Dz ) iteration a thread with index (i, j, k) processes voxel (i, j, k + m×Dz ) of the image block. As described in section 3, a maximum of 512 threads are supported per thread block and a single image block of size 8×8×8 also contains same number (512) of voxels. This allows a thread block with 512 threads to process one fixed image block in a single iteration. Moreover, as the fixed image blocks are stored contiguously in the global memory, the threads can read the consecutive fixed image values in a coalesced fashion. The reference and moving images don’t change as part of the iterative optimization steps. Therefore we transfer these images to the GPU global memory in the beginning and never modify them throughout the optimization process. Joint Histogram Computation: The joint histogram computation in the first loop requires several threads to update (read and write) a common GPU global memory region allocated for the histogram. This will require synchronization among the threads. As described in section 2 we use parzen estimation of joint histogram with cubic B-spline window for moving image. This requires four bins to be updated with cubic B-spline weights for each pair of fixed image and warped moving image intensity values. The two threads with same intensity value pair will have collision for all of these four bins. Moreover, there will be collisions even if the threads have different intensity value pairs but they share some common bins to be updated. An atomic update based approach would be costly in this case and therefore we allocate a separate local buffer per CUDA thread in the GPU global memory to store its partial histogram results. A joint histogram with 32x32 (or 64x64) bins requires 4K (or 16K) memory for its storage therefore it is not possible to store these per thread partial histogram buffers in the on-chip

228

V. Saxena, J. Rohrer, and L. Gong

GPU shared memory. In the end, we reduce these buffers on GPU to compute the final joint histogram values. Gradient Computation: Similar to the joint histogram computation, the gradient computation in second loop also requires several threads to update (read and write) a common GPU global memory region allocated for storing gradient values. Each thread processing a voxel updates gradient values for 192 (3×4×4×4) coefficients in the neighborhood affected by the voxel. Single precision gradient values for 192 coefficients require 768 bytes of memory and hence it is not possible to allocate per thread partial gradient buffer on the on-chip 16KB shared memory for more than 21 threads per thread block. Therefore, we allocate a separate local buffer per CUDA thread in the GPU global memory to store its partial gradient results. In the end, we reduce these buffers on GPU to compute the final gradient values. For both Joint Histogram and Gradient Computation, the final reduction of partial buffers is performed using a binary tree based parallel reduction approach that uses shared memory [4]. Marginal pdfs and Mutual Information Computation: The computation of marginal pdfs for fixed and moving images together with mutual information computation is still done on the host. This computation on the host takes less than 0.02% of the total gradient descent optimization time and hence does not become a bottleneck. Performing this computation on the host requires transformation coefficients to be sent to the GPU before performing joint histogram computation. Once the computation is done, the computed joint histogram is transferred back to the host. Similarly, for the gradient computation, we send modified histogram values to the GPU before the computation and transfer back the computed gradient values to the host in the end. However, this transfer of data does not become the bottleneck as the amount of data transferred is small. The experiment results also show that these transfers require less than 0.1% of the total time for joint histogram and gradient computation. 4.2

Use of Look Up Table

When computing transformed fixed image voxels and moving image interpolation of transformed voxels, we need to perform weighted sum of B-spline coefficients with B-spline weights. As the cubic B-spline base functions only have limited support, therefore this weighted sum requires only four B-spline coefficients located at four neighboring control points. For 3D case, we need to consider only 4×4×4 points in the neighborhood of x to compute the interpolated value:  ci,j,k βx,i βy,j βz,k f (x) = i,j,k=0...3

where f is one component of the transformation function or moving image intensity, ci,j,k are B-spline coefficients and βs are cubic B-spline weights. For different components of the transformation function, weights remain the same but coefficients differ. Instead of computing these weights repeatedly at runtime,

A Parallel GPU Algorithm for Mutual Information

229

we pre-compute these weights at sub-grid points with a spacing of 1/32 the voxel size and store the computed values to a lookup table. To enable the use of lookup table, we constrain the control point spacing to be an integral multiple of voxel spacing. Similarly, for image interpolation, we round down the point coordinates to the nearest sub-grid point. We compute lookup table on the host and transfer it once to the GPU device memory along with the reference and moving images before the start of the optimization process. The lookup table doesn’t get modified afterward and remains constant. For 32 sub-grid points in one voxel width, we only require 512 bytes of memory (four single precision B-spline weights per sub-grid point). On the GPU, we store the lookup table on the on-chip shared memory to avoid accessing high latency global memory each time. 4.3

Optimizations for Transformation Coefficients

As explained previously, each fixed image voxel requires 3×4×4×4 (4×4×4 per dimension) transformation coefficients in the neighborhood for transforming its coordinates to moving image space. However if the spacing between coefficient grid points is an integral multiple of fixed image block size, then all voxels within a fixed image block require same set of coefficients for the transformation. Each fixed image block is transformed by a single thread block and storing 3×4×4×4 coefficients only requires 768 bytes for single precision. Therefore, before processing an image block, all the 192 coefficients required by this block are loaded to the on-chip shared memory for faster access. There is no need for any synchronization in this case as threads only read the coefficients. 4.4

Memory Coalescing for the Gradient Computation

As described in section 4.1, for the joint histogram and gradient computation, we allocated per thread separate buffers in the global memory to store partial results. These buffers were reduced in the end to get the final values. For gradient computation, each voxel updates its local gradient buffer for its neighboring 4×4×4 coefficients per dimension. However, each thread of a warp updates the gradient buffer entry corresponding to the same coefficient in its local gradient buffer. The gradient buffer entries for the threads can be organized in either Array of Structures (AOS) or Structure of Arrays (SOA) form. In the AOS form, the gradient entries for a thread are stored contiguously in its separate distinct buffer. In the SOA form, the local buffers of different threads are interleaved so that gradient entries for different threads corresponding to same coefficient are stored contiguously in global memory. In the AOS form, updates by threads result in non-coalesced memory accesses to global memory for read and writes. To avoid this, we use the SOA form for gradient computation. In this form, threads update consecutive memory locations in the global memory resulting in coalesced memory access. In the end, we reduce per coefficient values from all the threads to compute the final gradient values. We evaluate performance with these two forms in the result section 5.

230

5

V. Saxena, J. Rohrer, and L. Gong

Experimental Results

The parallel version of the code was run on the Nvidia Tesla C1060 GPU. The Tesla C1060 is organized as a set of 30 SMs each containing 8 SPs with a total of 240 scalar cores. The scalar cores run at the frequency of 1.3 GHz. The GPU has 4 GB of off-chip device memory with peak memory bandwidth of 102 GB/s. It has 16 KB of on-chip shared memory and 16K registers available per SM. The CUDA SDK 2.3 and NVCC 0.2.1221 compiler were used in the compilation of the code. The GPU host system has Intel Xeon 3.0 GHz processor with 3 GB of main memory. The GPU has a PCIe x16 link interface to the host providing a peak bandwidth of 4 GB/s in a single direction. The serial version of the code was running on one of the cores of an Intel Xeon processor running at 2.33 GHz with 2 GB of memory. The code was compiled with the GCC 4.1.1 compiler (with -O2). For both the systems, we measured the runtime of the most computation expensive multi-resolution iterative optimization process for single precision data. On the GPU system, it is assumed that the reference and moving images are already transferred to the GPU global memory. 5.1

Performance Results and Discussion

Runtime. We performed registrations of 22 different sets (pair) of CT abdominal images of different sizes, and measured the average of the registration time per voxel. The images were partitioned into cubic blocks of size 8×8×8. A three level multi-resolution pyramid was used with B-spline grid spacing of 16x16x16 voxels at the finest level. The gradient descent optimizer was set to perform fixed number of 30 iterations at each pyramid level. For the purpose of comparison we measured the per voxel time for the finest pyramid level only. The sequential version required 69.02 (±16) μs/voxel. In contrast, the GPU version required 2.46 (±0.33) μs/voxel resulting in factor of 28 speedup compared to the serial version. For example, the sequential version required 1736.65 seconds for an image of size 512x512x98 for 30 iterations at the finest level, whereas GPU version only required 67.20 seconds. Note that the above time does not include the time for operations that are performed for each pyramid level before starting the iterations for the gradient descent optimizer e.g. allocating and transferring fixed and moving images on the GPU. To include these operations in the performance measurement as well, we compared the total registration application execution times for serial and parallel versions for all the 22 datasets. The GPU parallel based version showed a speedup between 18x to 26x with an average speedup of 22.3 (±2.7) compared to serial version for total execution time. The good speedup suggests that the time consuming part of the application was successfully offloaded to the GPU. Scalability. We measured the scalability of the GPU implementation with different number of threads per block and the number of thread blocks. Figure 1(a) shows the scalability on GPU for an image of size 512×512×98 with different number of thread blocks up to 180 thread blocks. The performance with a single thread block is taken as the speedup of factor one. The number of

A Parallel GPU Algorithm for Mutual Information

(a) Scaling with thread blocks

231

(b) Scaling with threads per thread block

Fig. 1. Scaling on GPU with different number of threads and thread blocks

threads per thread block has been fixed to 128 for this experiment. As shown in the figure, we see good scalability with increasing number of thread blocks with factor 34 speedup with 60 thread blocks over single thread block performance. There is no performance improvement beyond 60 thread blocks. As described perviously, we use per thread local buffers for computing joint histogram and gradient values. These buffers need to be initialized to zero before performing the computation and need to be reduced in the end. This initialization and reduction time increases with increasing number of threads in the kernel grid and compensate any slight improvement in the actual computation time. For example, with 512×512×98 image dataset, the reduction time for joint histogram computation (and gradient computation) increases from about 1.26% to 3.42% (and about 1.1% to 2.66%) of the total histogram (and gradient) computation time when increasing number of thread blocks from 60 to 180. Figure 1(b) shows the scaling for an image of size 512×512×98 with different number of threads per thread block for 30, 60 and 90 thread blocks. We could not use 512 threads per thread block due to register overflow. Also the implementation requires a minimum of 64 threads for 8×8×8 cubic image block size. We observe the best performance with 60 thread blocks and 128 threads. Although we show the scaling for only three sets of thread blocks, more number of threads per thread block seem to provide better performance in general as expected. Also, the performance difference with different number of threads decreases with increasing number of thread blocks as seen by the performance variation for different threads with 30 thread blocks compared to 90. Optimizations. We compared the performance with AOS and SOA forms for gradient computation discussed in section 4.4. The table 1 shows the speedup with SOA form over the AOS form. The SOA form enabling coalesced memory accesses provides significant performance improvement over the AOS. 5.2

Validation of Registration Results

In this experiment, we validate the results of image registration for mono and multi-modal cases. We used the BrainWeb [1] simulated MRI volumes IT 1 and

232

V. Saxena, J. Rohrer, and L. Gong

Table 1. Speedup with SOA over AOS for gradient computation

Table 2. Similarity metric values for mono and multi-modal case

Algorithm Component

Speed Modality Mono Multi ˜ Registered T1 to T1D to T1 to T1D to up Per iteration of gradient descent optimizer 7-8x Images T1 T1 T2 T2 Total Gradient Computation (loop 2) Time 10-12x Serial -1.9443 -1.8914 -1.3975 -1.3743 (including data transfer and reduction) GPU -1.8246 -1.8426 -1.3452 -1.3590 Gradient Computation Only 10-11x (excluding reduction)

IT 2 (181×217×181 voxels, isotropic voxel spacing of 1mm, sample slices shown in figures 3(a) and 3(b)): the T1 and T2 volumes are aligned. We deformed IT 1 with an artificially generated transformation function TD based on randomly placed Gaussian blobs to obtain IT 1D . We then registered IT 1D to IT 1 (monomodal case) and IT 1D to IT 2 (multi-modal case) using both the serial and GPU implementations and measured the mutual information (MI) values before and after the registration. To verify the final MI values after the registration, we also compared these with the MI values of the perfectly aligned IT 1 & IT 1 (monomodal) and IT 1 & IT 2 (multi-modal). We used three levels of the multi-resolution pyramid and 30 iterations were carried out per pyramid level. Table 2 shows the final values of similarity metric (negative of MI) after the registration along with the metric values for perfectly aligned images. Mono-Modal Case: The similarity metric for perfectly aligned IT 1 and IT 1 is -1.9443 (computed using serial version). The final metric values after registering IT 1D to IT 1 is -1.8914 and -1.8426 using the serial and GPU implementations respectively. The difference in metric values on the two platforms can be attributed to the difference in the floating point arithmetic and different ordering. The difference in the metric values for perfectly aligned images on the two platforms is similar. Visual inspection of images before and after the registration confirmed that the registered image aligns well with the original image, and there is no difference between the images registered using the serial and GPU implementations. Figure 2 shows an example of visual comparison of the GPU registration accuracy. Multi-Modal Case: In this case, we registered IT 1D and IT 2 to obtain registered image IT 1R . We then compared IT 1R to already known solution volume IT 1 to verify the registration result. The similarity metric for perfectly aligned IT 1 and IT 2 is -1.3975. The final metric values after registering IT 1D to IT 2 is -1.3743 and -1.3590 using the serial and GPU implementations respectively. For visual comparison we cannot directly merge the registered image T1R with T2 as done in the mono-modal case. Therefore, we color-merged T1R and T1 for the purpose of validation. In case of correct registration, the color-merged image should not have any colored regions. Figure 3 shows the visual comparison of the GPU registration accuracy in case of multi-modal registration.

A Parallel GPU Algorithm for Mutual Information

(a) Original deformed

233

and (b) Original & regis- (c) Registered images tered image on GPU on Xeon and GPU

Fig. 2. Visual comparison of the GPU registration accuracy in mono-modal case. Pairs of grayscale images are color-merged (one image in the green channel, the other in the red and the blue channel); areas with alignment errors appear colored. (a) shows the misalignment of the original (green) and the artificially deformed image. (b) shows that after registration on the GPU only minor registration errors are visible (original image in green). In (c), no difference between the image registered on the GPU and the Xeon (green) is visible.

(a) T1

(b) T2

(c) Original T1 & (d) Overlay of T1R registered image on T2

Fig. 3. Multi-modal case. (a) T1 image slice. (b) T2 image slice. (c) color-merged image of T1 (in green channel) and the registered image T1R on the GPU (in red and blue channels). Only minor registration errors are visible. (d) overlay of registered image T1R (red colored) on top of T2 (green colored) using 50% transparency.

6

Summary and Conclusions

We discussed in this paper a GPU-based implementation of mutual information based nonrigid registration. Our preliminary experimental results with the GPU implementation showed an average performance of 2.46 microseconds per voxel with 28x speedup over a serial version. For a pair of images of 512x512x24 pixels, the registration takes about 17.1 seconds to complete. Our GPU performance also compares well with the other high performance platforms, although it is difficult to make a perfectly fair comparison due to differences in implemented algorithms and experimental setup. A parallel implementation of mutual information based

234

V. Saxena, J. Rohrer, and L. Gong

nonrigid registration algorithm presented in [11] used 64 CPUs of a supercomputer reporting a speedup factor of up to 40 compared to a single CPU and resulting in mean execution time of 18.05 microseconds per voxel. The proposed GPU-based nonrigid registration provides a cost-effective alternative to existing implementations based on other more expensive parallel platforms. Future work will include more systematic and comparative evaluation of our GPU-based implementation vs others based on different multicore platforms.

References 1. Collins, D.L., Zijdenbos, A.P., Kollokian, V., Sled, J.G., Kabani, N.J., Holmes, C.J., Evans, A.C.: Design and construction of a realistic digital brain phantom. IEEE Trans. Med. Imaging 17(3), 463–468 (1998) 2. Courty, N., Hellier, P.: Accelerating 3D Non-Rigid Registration using Graphics Hardware. International Journal of Image and Graphics 8(1), 81–98 (2008) 3. Crum, W.R., Hartkens, T., Hill, D.L.G.: Non-rigid image registration: theory and practice. Br. J. Radiol. 77(2), 140–153 (2004) 4. Harris, M.: Optimizing parallel reduction in CUDA (2007), http://www.nvidia.com/object/cuda_sample_advanced_topics.html 5. Ino, F., Tanaka, Y., Hagihara, K., Kitaoka, H.: Performance study of nonrigid registration algorithm for investigating lung disease on clusters. In: Proc. PDCAT, pp. 820–825 (2005) 6. Kybic, J., Unser, M.: Fast parametric elastic image registration. IEEE Transactions on Image Processing 12(11), 1427–1442 (2003) 7. Mattes, D., Haynor, D., Vesselle, H., Lewellen, T., Eubank, W.: PET-CT image registration in the chest using free-form deformations. IEEE Trans. Med. Imag. 22(1), 120–128 (2003) 8. Muyan-Ozcelik, P., Owens, J.D., Xia, J., Samant, S.S.: Fast deformable registration on the GPU: A CUDA implementation of demons. In: ICCSA, pp. 223–233 (2008) 9. Nvidia CUDA Prog. Guide 2.3, http://www.nvidia.com/object/cuda_get.html 10. Pluim, J., Maintz, J., Viergever, M.: Mutual information based registration of medical images: a survey. IEEE Trans. Med. Imaging 22(8), 986–1004 (2003) 11. Rohlfing, T., Maurer Jr., C.R.: Nonrigid image registration in shared-memory multiprocessor environments with application to brains, breasts, and bees. IEEE Trans. Inf. Technol. Biomed. 7(1), 16–25 (2003) 12. Rohrer, J., Gong, L., Sz´ekely, G.: Parallel Mutual Information Based 3D Non-Rigid Registration on a Multi-Core Platform. In: HPMICCAI workshop in conjunction with MICCAI (2008) 13. Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Transactions on Medical Imaging 18(8), 712–721 (1999) 14. Sharp, G., Kandasamy, N., Singh, H., Folkert, M.: GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration. Phys. Med. Biol. 52(19), 5771–5783 (2007) 15. Szeliski, R., Coughlan, J.: Spline-based image registration. Int. J. Comput. Vision 22(3), 199–218 (1997) 16. Thevenaz, P., Unser, M.: Spline pyramids for inter-modal image registration using mutual information. In: Proc. SPIE, vol. 3169, pp. 236–247 (1997) 17. Vetter, C., Guetter, C., Xu, C., Westermann, R.: Non-rigid multi-modal registration on the GPU. In: Proc. SPIE, vol. 6512 (2007)

Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations Everton Hermann1 , Bruno Raffin1 , Fran¸cois Faure2 , Thierry Gautier1 , and J´er´emie Allard1 1

2

INRIA Grenoble University

Abstract. Today, it is possible to associate multiple CPUs and multiple GPUs in a single shared memory architecture. Using these resources efficiently in a seamless way is a challenging issue. In this paper, we propose a parallelization scheme for dynamically balancing work load between multiple CPUs and GPUs. Most tasks have a CPU and GPU implementation, so they can be executed on any processing unit. We rely on a two level scheduling associating a traditional task graph partitioning and a work stealing guided by processor affinity and heterogeneity. These criteria are intended to limit inefficient task migrations between GPUs, the cost of memory transfers being high, and to favor mapping small tasks on CPUs and large ones on GPUs to take advantage of heterogeneity. This scheme has been implemented to support the SOFA physics simulation engine. Experiments show that we can reach speedups of 22 with 4 GPUs and 29 with 4 CPU cores and 4 GPUs. CPUs unload GPUs from small tasks making these GPUs more efficient, leading to a “cooperative speedup” greater than the sum of the speedups separatly obtained on 4 GPUs and 4 CPUs.

1

Introduction

Interactive physics simulations are a key component of realistic virtual environments. However the amount of computations as well as the code complexity grows quickly with the variety, number and size of the simulated objects. The emergence of machines with many tightly coupled computing units raises expectations for interactive physics simulations of a complexity that has never been achieved so far. These architectures usually show a mix of standard generic processor cores (CPUs) with specialized ones (GPUs). The difficulty is then to efficiently take advantage of these architectures. Several parallelization approaches have been proposed but usually focused on one aspect of the physics pipe-line or targeting only homogeneous platforms (GPU or multiple CPUs). Object level parallelizations usually intend to identify non-colliding groups of objects to be mapped on different processors. Fine grain parallelizations on a GPU achieve high-speedups but require to deeply revisit the computation kernels. In this article we propose a parallelization approach that takes advantage of the multiple CPU cores and GPUs available on a SMP machine. We rely on P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 235–246, 2010. c Springer-Verlag Berlin Heidelberg 2010 

236

E. Hermann et al.

the open source SOFA physics simulation library designed to offer a high degree of flexibility and high performance executions. SOFA [1] supports various types of differential equation solvers for single objects as well as complex scenes of different kinds of interacting physical objetcs (rigid objects, deformable solids, fluids). The physics pipe-line is classically split in two main steps: collision detection and time integration. The collision detection being performed efficiently on a single GPU [2], we focus here on the time integration step. The work load of time integration varies according to collisions: new collisions require the time integration step to compute and apply the associated new repulsion forces. We developed a multiple CPUs and GPUs parallelization for the time integration step. A first traverse enables to extract a data dependency graph between tasks. It defines the control flow graph of the application, which identifies the first level of parallelism. Several tasks have a CPU implementation as well as a GPU one using CUDA [3]. This GPU code provides a second fine-grain parallelization level. At runtime, the tasks are scheduled according to a two levels scheduling strategy. At initialisation and every time the task graph changes (addition or removal of collisions), the task graph is partitioned with a traditional graph partitioner and partitions are distributed to PUs (GPUs or CPUs are called Processing Unit). Then, work stealing is used to move paritions between PUs to correct the work inbalance that may appear as the simulation progresses. Our approach differs from the classical work stealing algorithm [4] as our stealing strategy takes the temporal and spatial locality into account. Spatial locality relies on the classical Owner Compute Rule where tasks using the same data tend to be scheduled on the same PU. This locality criteria is guaranteed during the tasks graph partitioning, where tasks accessing the same data are gathered in the same affinity group. Temporal locality occurs by reusing the task mapping between consecutive iterations. Thus, when starting a new time integration step, tasks are first assigned to the PU they ran on at the previous iteration. CPUs tend to be more efficient than GPUs for small tasks and vice-versa. We thus associate weights to tasks, based on their execution time, that PUs use to steal tasks better suited to their capacities. Thanks to this criteria, PU heterogeneity becomes a performance improvement factor rather than a limiting one. Experiments show that a complex simulation composed of 64 colliding objects, totaling more than 400k FEM elements, is simulated on 8 GPUs in 0.082s per iteration instead of 3.82s on one CPU. A heterogeneous scene with both complex and simple objects can efficiently exploit all resources in a machine with 4 GPUs and 8 CPU cores to compute the time integration step 29 times faster than with a single CPU. The addition of the 4 CPU cores not dedicated to the GPUs actually increases the simulation performance by 30%, significantly more than the 5% performance expected because CPUs unload GPUs from small tasks making these GPUs more efficient. After discussing related works (Sec. 2) and a quick overview of the physics simulator (Sec 3), we focus on multi-GPU support (Sec. 4) and scheduling (Sec. 5). Experimental results (Sec. 6) are detailed before to conclude.

Multi-GPU and Multi-CPU Parallelization

2

237

Related Works

We first review works related to the parallelization of physics engines before to focus on approaches for scheduling tasks on multiple GPUs or between the CPU and GPU. Some approaches propose cluster based parallelizations for interactive physics simulations, but the scalability is usually limited due to the high overhead associated with communications [5]. This overhead is more limited with shared memory multi-processor approaches [6,7,8]. GPUs drew a lot of attention for physics, first because we can expect these co-processors to be easily available on users machines, but also as it can lead to impressive speedups[9,10,11]. The Bullet 1 and PhysX 2 physics engines defer solid and articulated objects simulation on GPU or Cell for instance. All these GPU approaches are however limited to one CPU and one GPU or co-processor. Task distribution between the processor and the co-processor is statically defined by the developer. Recent works propose a more transparent support of heterogeneous architectures mixing CPUs and GPUs. GPU codelets are either automatically extracted from an existing code or manually inserted by the programmer for more complex tasks [12,13]. StarPU supports an heterogeneous scheduling on multiples CPUs and GPUs with a software cache to improve CPU/GPU memory transfers [14]. They experiment various scheduling algorithms, some enabling to get “cooperative speedups” where the GPU gets support from the CPU to get a resulting speedup higher to the sum of the individual speedups. We also get such speedups in our experiments. A regular work stealing strategy is also tested but the performance gain is more limited. The stealing scheme is not adapted to cope with the heterogeneity. Published experiments include tests with one GPU only. We know two different approaches for multi GPU dynamics load balancing. The extension of the StarSs for GPUs [15] proposes a master/helper/worker scheme, where the master inserts tasks in a task dependency graph, helpers grab a ready task when their associated GPU becomes idle, while workers are in charge of memory transfers. The master leads to a centralized list scheduling that work stealing enables to avoid. RenderAnts is a Reyes renderer using work stealing on multiple GPUs [16]. The authors underline the difficulty in applying work stealing for all tasks due to the overhead of data transfers. They get good performance by duplicating some computations to avoid transfers and they keep work stealing only on one part of the Reyes pipeline. Stealing follows a recursive data splitting scheme leading to tasks of adaptive granularity. Both RenderAnts and StarSs address multi GPU schedulin g, but none include multiple CPUs. In this paper we address scheduling on GPUs and CPUs. We extend the Kaapi runtime [17] to better schedule tasks with data flow precedences on multiple CPUs and GPUs. Contrary to previous works, the initial work load is balanced by computing at runtime a partition of the tasks with respect to their affinity to accessed objects. Then during the execution, the work imbalance is corrected by 1 2

http://www.bulletphysics.com http://www.nvidia.com/physx

238

E. Hermann et al.

Mechanical Mesh

Initial State

Intermediary State

Final State

Fig. 1. Simulation of 64 objects, falling and colliding under gravity. Each object is a deformable body simulated using Finite Element Method (FEM) with 3k particles.

a work stealing scheduling algorithm [4,18]. In [19] the performance of the Kaapi work stealing algorithm was proved for tasks with data flow precedences, but not in the heterogenenous context mixing CPUs and GPUs. A GPU is significantly different from a ”fast” CPU, due to the limited bandwidth between the CPU and GPU memories, to the overhead of kernel launching and to the SIMD nature of GPUs that does not fit all algorithms. The work stealing policy needs to be adapted to take advantage of the different PU capabilities to get “cooperative speedups”.

3

Physics Simulation

Physics simulations, particularly in interactive scenarios, are very challenging high performance applications. They require many different computations whose cost can vary unpredictably, depending on sudden contacts or user interactions. The amount of data involved can be important depending on the number and complexity of the simulated objects. Data dependencies evolve during the simulation due to collisions. To remain interactive the application shoud execute each iteraction within a few tens of milliseconds. The physics simulation pipeline is an iterative process where a sequence of steps is executed to advance the scene forward in time. The pipeline includes a collision detection step based on geometry intersections to dynamically create or delete interactions between objects. Time integration consists in computing a new state (i.e. positions and velocity vectors), starting from the current state and integrating the forces in time. Finally, the new scene state is rendered and displayed or sent to other devices. In this paper, we focus on time integration. Interactive mechanical simulators can involve objects of different kinds (solid, articulated, soft, fluids), submitted to interaction forces. The objects are simulated independently, using their own encapsulated simulation methods (FEM, SPH, mass-springs, etc.). Interaction forces are updated at each iteration based on the current states of the objects. Our approach combines flexibility and performance, using a new efficient approach for the parallelization of strong coupling between independently

Multi-GPU and Multi-CPU Parallelization

239

implemented objects [8]. We extend the SOFA framework [1] we briefly summarize here. The simulated scene is split into independent sets of interacting objects. Each set is composed of objects along with their interaction forces, and monitored by an implicit differential equation solver. The object are made of components, each of them implementing specific operations related to forces, masses, constraints, geometries and other parameters of the simulation. A collision detection pipeline creates and removes contacts based on geometry intersections. It updates the objects accordingly, so that each one can be processed independently from the others. By traversing the object sets, the simulation process generates elementary tasks to evaluate the physical model.

4

Multi-GPU Abstraction Layer

We first introduce the abstraction layer we developed to ease deploying codes on multiple GPUs. 4.1

Multi-architecture Data Types

The multi-GPU implementation for standard data types intends to hide the complexity of data transfers and coherency management among multiple GPUs and CPUs. On shared memory multiprocessors all CPUs share the same address space and data coherency is hardware managed. In opposite, even when embedded in a single board, GPUs have their own local address space. We developed a DSM (Distributed Shared Memory) like mechanism to release the programmer from the burden of moving data between a CPU and a GPU or between two GPUs. When accessing a variable, our data structure first queries the runtime environment to identify the processing unit trying to access the data. Then it checks a bitmap to test if the accessing processing unit has a valid data version. If so, it returns a memory reference that is valid in the address space of the processing unit requesting data access. If the local version is not valid, a copy from a valid version is required. For instance it happens when a processing unit accesses a variable for the first time, or when another processing unit has changed the data. This detection is based on dirty bits to flag the valid versions of the data on each PU. These bits are easily maintained by setting the valid flag of a PU each time the data is copied to it, and resetting all the flags but that of the current PU when the data is modified. Since direct copies between GPU memories are not supported at CUDA level, data first have to transit through CPU memory. Our layer transparently takes care of such transfers, but these transfers are clearly expensive and must be avoided as much as possible. 4.2

Transparent GPU Kernel Launching

The target GPU a kernel is started on is explicit in the code launching that kernel. This is constraining in our context as our scheduler needs to reallocate

240

E. Hermann et al.

                      

                

Fig. 2. Multi-implementation task definition. Top: Task Signature. Left: CPU Implementation. Right: GPU Implementation.

a kernel to a GPU different form the one it was supposed to run on, without having to modify the code. We reimplemented part of the CUDA Runtime API. The code is compiled and linked as usually done in a single GPU scenario. At execution time our implementation of the CUDA API is loaded and intercepts the calls to the standard CUDA API. When a CUDA kernel is launched, our library queries the runtime environment to know the target GPU. Then the execution context is retargeted to a different GPU if necessary and the kernel is launched. Once the kernel is finished, the execution context is released, so that other threads can access it. 4.3

Architecture Specific Task Implementations

One of the objectives of our framework is to seamlessly execute a task on a CPU or a GPU. This requires an interface to hide the task implementation that is very different if it targets a CPU or a GPU. We provide a high level interface for architecture specific task implementations (Fig. 2). First a task is associated with a signature that must be respected by all implementations. This signature includes the task parameters and their access mode (read or write). This information will be further used to compute the data dependencies between tasks. Each CPU or GPU implementation of a given task is encapsulated in a functor object. There is thus a clear separation between a task definition and its various architecture specific implementations. We expect that at least one implementation be provided. If an implementation is missing, the task scheduler will simply reduce the range of possible target architectures to the supported subset.

5

Scheduling on Multi-GPUs

We mix two approaches for task scheduling. We first rely on a task partitioning that is executed every time the task graph changes, i.e. if new collisions or user interactions appear or disappear. Between two partitionings, work stealing is used to reduce the load imbalance that may result from work load variations due to the dynamic behavior of the simulation.

Multi-GPU and Multi-CPU Parallelization

5.1

241

Partitioning and Task Mapping

As partitioning is executed at runtime it is important to keep its cost as reduced as possible. The task graph is simply partitioned by creating one partition per physical object. Interaction tasks, i.e. tasks that access two objects, are mapped to one of these objects’ partition. Then, using METIS or SCOTCH, we compute a mapping of each partition that try to minimize communications between PUs. Each time the task graph changes due to addition or removal of interactions between objects (new collision or new user interactions), the partitioning is recomputed. Associating all tasks that share the same physical object into the same partition allows to increase affinity between these tasks. This significantly reduces memory transfers and improves performance especially on GPUs where these transfers are costly. A physics simulation also shows a high level of temporal locality, i.e. the changes from one iteration to the next one are usually limited. Work stealing can move partitions to reduce load inbalance. These movements have a good change to be relevant for the next iteration. Thus if no new partioning is required, each PU simply starts with the partitions executed during the previous iteration. 5.2

Dynamic Load Balancing

At a beginning of a new iteration each processing unit has a queue of partitions (an ordered list of tasks) to execute. The execution is then scheduled by the Kaapi [19] work stealing algorithm. During the execution, a PU first searches in its local queue for partitions ready to execute. A partition is ready if and only if all its read mode arguments are already produced. If there is no ready partition in the local queue, the PU is considered idle and it tries to steal work from another PU selected at random. To improve performance we need to guide steals to favor gathering interacting objects on the same processing unit. We use an affinity list of PUs attached to each partition: a partition owned by a given PU has another distant PU in its affinity list if and only if this PU holds at least one task that interacts with the target partition. A PU steals a partition only if this PU is in the affinity list of the partition. We update the affinity list with respect to the PU that executes the tasks of the partition. Unlike the locality guided work stealing in [20], this affinity control is only employed if the first task of the partition has already been executed. Before that, any processor can steal the partition. As we will see in the experiments, this combination of initial partitioning and locality guided work stealing significantly improves data locality and thus performance. 5.3

Harnessing Multiple GPUs and CPUs

Our target platforms have multiple CPUs and GPUs: the time to perform a task depends on the PUs, but also on the kind of task itself. Some of them may

E. Hermann et al.















 

 









                        









242

       

 

Fig. 3. Speedup per iteration when simulating 64 deformable objects falling under gravity (Fig. 1) using up to 8 GPUs

perform better on CPU, while others have shortest execution times on GPU. Usually GPUs are more efficient than CPUs on time consuming tasks with high degree of data parallelism; and CPUs generally outperform GPUs on small size problems due to the cost of data transfer and the overhead of kernel launching. Following the idea of [18], we extended the work stealing policy to schedule time consuming tasks on the fastest PUs, i.e. GPUs. Because tasks are grouped into partitions (Sec. 5.1), we apply this idea on partitions. During execution we collect the execution time of each partition on CPUs and GPUs. The first iterations are used as a warming phase to obtain these execution times. Not having the best possible performance for these first iterations is acceptable as interactive simulations usually run several minutes. Instead of having a queue of ready partitions sorted by their execution times, we implement a dynamic threshold algorithm that allows a better parallel conU T ime current execution. Partitions with the CP GP U T ime ratio below the threshold are executed on a CPU, otherwise on a GPU. When a thief PU randomly selects a victim, it checks if this victim has a ready partition that satisfies the threshold criteria and steals it. Otherwise the PU chooses a new victim. To avoid PU starving for too long, the threshold is increased each time a CPU fails to steal, and decreases each time a GPU fails.

6

Results

To validate our approach we used different simulation scenes including independent objects or colliding and attached objects. We tested it on a quad-core Intel Nehalem 3GHz with 4 Nvidia GeForce GTX 295 dual GPUs. Tests using 4 GPUs where performed on a dual quad-core Intel Nehalem 2.4 GHz with 2 Nvidia GeForce GTX 295 dual GPUs. The results presented are obtained from the mean value over 100 executions. 6.1

Colliding Objects on Multiple GPUs

The first scene consists of 64 deformable objects falling under gravity (Fig. 3). This scene is homogeneous as all objects are composed of 3k particles and simulated using a Finite Element Method with a conjugate gradient equation solver.

Multi-GPU and Multi-CPU Parallelization

243



#$ ! 



               

    



 !" 

(a)

(b)

Fig. 4. (a) A set of flexible bars attached to a wall (a color is associated to each block composing a bar). (b) Performances with blocks of different sizes, using different scheduling strategies.

At the beginning of the simulation all objects are separated, then the number of collisions increases reaching 60 pairs of colliding objects, before the objects start to separate from each other under the action of repulsion forces. The reference average CPU sequential time is 3.8s per iteration. We remind that we focus on the time integration step. We just time this phase. Collision detection is executed in sequence with time integration on one GPU (0.04s per iteration on average for this scene). We can observe that when objects are not colliding (beginning and end of the simulation) the speedup (relative to one GPU) is close to 7 with 8 GPUs. As expected the speedup decreases as the number of collisions increases, but we still get at least a 50% efficiency (at iteration 260). During our experiments, we observed a high variance of the execution time at the iteration following the apparition of new collisions. This is due to the increasing number of steals needed to adapt the load from the partitioning. Steal overhead is important as it triggers GPU-CPU-GPU memory transfers. The second scene tested is very similar. We just changed the mechanical models of objects to get a scene composed of heterogeneous objects. Half of the 64 objects where simulated using a Finite Element Method, while the other ones relied on a Mass-Springs model. The object sizes were also heterogeneous, ranging from 100 to 3k particles. We obtained an average speedup of 4.4, to be compared with 5.3 obtained for the homogeneous scene (Fig. 3). This lower speedup is due to the higher difficulty to find a well balanced distribution due to scene heterogeneity. 6.2

Affinity Guided Work Stealing

We investigated the efficiency of our affinity guided work stealing. We simulated 30 soft blocks grouped in 12 bars (Fig. 4(a)). These bars are set horizontally

244

E. Hermann et al.









      

                        



Fig. 5. Simulation performances with various combinations of CPUs and GPUs

and are attached to a wall. They flex under the action of gravity. The blocks attached in a single bar are interacting similarly to colliding objects. We then compare the performance of this simulation while activating different scheduling strategies (Fig. 4(b)). The first scheduling strategy assigns blocks to 4 GPUs in a round-robin way. The result is a distribution that has a good load balance, but poor data locality, since blocks in the same bar are in different GPUs. The second strategy uses a static partitioning that groups the blocks in a same bar on the same GPU. This solution has a good data locality since no data is transferred between different GPUs, but the work load is not well balanced as the bars have different number of blocks. The third scheduling relies on a standard work stealing. It slightly outperforms the static partitioning for small blocks as it ensures a better load balancing. It also outperforms the round-robin scheduling because one GPU is slightly more loaded as it executes the OpenGL code for rendering the scene on a display. For larger objects, the cost of memory transfers during steals become more important, making work stealing less efficient than the 2 other schedulings. When relying on affinity, work stealing gives the best results for both block sizes. It enables to achieve a good load distribution while preserving data locality. 6.3

Involving CPUs

We tested a simulation combining multiple GPUs and CPUs in a machine with 4 GPUs and 8 Cores. Because GPUs are passive devices, a core is associated to each GPU to manage it. We thus have 4 cores left that could compute for the simulation. The scene consists of independent objects with 512 to 3 000 particles. We then compare standard work stealing with the priority guided work stealing (Sec. 5.3). Results (Fig. 5) show that our priority guided work stealing always outperforms standard work stealing as soon as at least one CPU and one GPU are involved. We also get “cooperative speedups”. For instance the speedup with 4 GPUs and 4 CPUs (29), is larger than the sum of the 4 CPUs (3.5) and 4 GPUs (22) speedups. Processing a small object on a GPU sometimes takes as long as a large one. With the priority guided work stealing, the CPUs will execute

Multi-GPU and Multi-CPU Parallelization

245

tasks that are not well suited to GPUs. Then the GPU will only process larger tasks, resulting on larger performance gains than that if it had to take care of all smaller tasks. In opposite, standard work stealing lead to “competitive speed-downs”. The simulation is slower with 4 GPUs and 4 CPUs than with only 4 GPUs. It can be explained by the fact that when a CPU takes a task that is not well-adapted to its architecture, it can become the critical path of the iteration, since tasks are not preemptive.

7

Conclusion

In this paper we proposed to combine partitioning and work stealing to parallelize physics simulations on multiple GPUs and CPUs. We try to take advantage of spatial and temporal locality for scheduling. Temporal locality relies mainly on reusing the partition distribution between consecutive iterations. Spatial locality is enforced by guiding steals toward partitions that need to access a physical object the thief already owns. Moreover, in the heterogeneous context where both CPUs and GPUs are involved, we use a priority guided work stealing to favor the execution of low weight partitions on CPUs and large weight ones on GPUs. The goal is to give each PU the partitions it executes the most efficiently. Experiments confirm the benefits of these strategies. In particular we get “cooperative speedups” when mixing CPUs and GPUs. Though we focused on physics simualtions, our approach can probably be straitforwardly extended to other iterative simulations. Future work focuses on task preemption so that CPUs and GPUs can collaborate even when only large objects are available. We also intend to directly spawn tasks to the target processor instead of using an intermediary task graph, which should reduce the runtime environment overhead. Integrating the parallelization of the collision detection and time integration steps would also avoid the actual synchronization point and enable a global task scheduling further improving data locality.

Acknowledgments We would like to thanks Th´eo Trouillon, Marie Durand and Hadrien Courtecuisse for their precious contributions to code development.

References 1. Allard, J., Cotin, S., Faure, F., Bensoussan, P.J., Poyer, F., Duriez, C., Delingette, H., Grisoni, L.: Sofa - an open source framework for medical simulation. In: Medicine Meets Virtual Reality (MMVR’15), Long Beach, USA (2007) 2. Faure, F., Barbier, S., Allard, J., Falipou, F.: Image-based collision detection and response between arbitrary volume objects. In: Symposium on Computer Animation (SCA 2008), pp. 155–162. Eurographics, Switzerland (2008)

246

E. Hermann et al.

3. NVIDIA Corporation: NVIDIA CUDA compute unified device architecture programming guide (2007) 4. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multithreaded language. SIGPLAN Not. 33(5), 212–223 (1998) 5. Allard, J., Raffin, B.: Distributed physical based simulations for large vr applications. In: Virtual Reality Conference, 2006, pp. 89–96 (2006) 6. Guti´errez, E., Romero, S., Romero, L.F., Plata, O., Zapata, E.L.: Parallel techniques in irregular codes: cloth simulation as case of study. J. Parallel Distrib. Comput. 65(4), 424–436 (2005) 7. Thomaszewski, B., Pabst, S., Blochinger, W.: Parallel techniques for physically based simulation on multi-core processor architectures. Computers & Graphics 32(1), 25–40 (2008) 8. Hermann, E., Raffin, B., Faure, F.: Interactive physical simulation on multicore architectures. In: EGPGV, Munich (2009) 9. Georgii, J., Echtler, F., Westermann, R.: Interactive simulation of deformable bodies on gpus. In: Proceedings of Simulation and Visualisation, pp. 247–258 (2005) 10. Comas, O., Taylor, Z.A., Allard, J., Ourselin, S., Cotin, S., Passenger, J.: Efficient Nonlinear FEM for Soft Tissue Modelling and its GPU Implementation within the Open Source Framework SOFA. In: Bello, F., Edwards, E. (eds.) ISBMS 2008. LNCS, vol. 5104, pp. 28–39. Springer, Heidelberg (2008) 11. Harris, M.J., Coombe, G., Scheuermann, T., Lastra, A.: Physically-based visual simulation on graphics hardware. In: HWWS 2002: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 109–118. Eurographics Association, Aire-la-Ville (2002) 12. Leung, A., Lhot´ ak, O., Lashari, G.: Automatic parallelization for graphics processing units. In: PPPJ 2009, pp. 91–100. ACM, New York (2009) 13. Dolbeau, R., Bihan, S., Bodin, F.: Hmpp: A hybrid multi-core parallel programming environment. In: First Workshop on General Purpose Processing on Graphics Processing Unit (2007) 14. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 863–874. Springer, Heidelberg (2009) 15. Ayguad´e, E., Badia, R.M., Igual, F.D., Labarta, J., Mayo, R., Quintana-Ort´ı, E.S.: An extension of the starss programming model for platforms with multiple gpus. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009) 16. Zhou, K., Hou, Q., Ren, Z., Gong, M., Sun, X., Guo, B.: Renderants: interactive reyes rendering on gpus. ACM Trans. Graph. 28(5), 1–11 (2009) 17. Gautier, T., Besseron, X., Pigeon, L.: KAAPI: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Parallel Symbolic Computation 2007 (PASCO 2007), London, Ontario, Canada, pp. 15–23. ACM, New York (2007) 18. Bender, M.A., Rabin, M.O.: Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk. Theory of Computing Systems 35(3), 289–304 (2000) 19. Gautier, T., Roch, J.L., Wagner, F.: Fine grain distributed implementation of a dataflow language with provable performances. In: PAPP Workshop, Beijing, China. IEEE, Los Alamitos (2007) 20. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: SPAA, pp. 1–12. ACM, New York (2000)

Long DNA Sequence Comparison on Multicore Architectures Friman S´ anchez1 , Felipe Cabarcas2,3, Alex Ramirez1,2 , and Mateo Valero1,2 1

Technical University of Catalonia, Barcelona, Spain 2 Barcelona Supercomputing Center, BSC, Spain 3 Universidad de Antioquia, Colombia {fsanchez}@ac.upc.es, {felipe.cabarcas,alex.ramirez,mateo.valero}@bsc.es

Abstract. Biological sequence comparison is one of the most important tasks in Bioinformatics. Due to the growth of biological databases, sequence comparison is becoming an important challenge for high performance computing, especially when very long sequences are compared. The Smith-Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). In this work, we show how long sequences comparison takes advantage of current and future multicore architectures. We analyze two different SW implementations on the CellBE and use simulation tools to study the performance scalability in a multicore architecture. We study the memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture is an valid alternative to execute challenging bioinformatic workloads.

1

Introduction

Bioinformatics is an emerging technology that is attractring the attention of computer architects, due to the important challenges it presents from the performance point of view. Sequence comparison is one of the fundamental tasks of bioinformatics and the starting point of almost all analysis that imply more complex tasks. This is basically an inference algorithm oriented to identify similarities between sequences. The need for speeding up this process is consequence of the continuous growth of sequence length. Usually, biologists compare long DNA sequences of entire genomes (coding and non-coding regions) looking for matched regions which mean similar functionality or conserved regions in the evolution; or unmatched regions showing functional differences, foreign fragments, etc. Dynamic programming based algorithms (DP) are recognized as optimal methods for sequence comparison. The Smith-Waterman algorithm [16] (SW) is a well-known exact method to find the best local alignment between sequences. However, because DP based algorithm’s complexity is O(nm) (being n and m the length of sequences), comparing very long sequences becomes a challenging scenario. In such a case, it is common to obtain many optimal solutions, which P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 247–259, 2010. c Springer-Verlag Berlin Heidelberg 2010 

248

F. S´ anchez et al.

can be relevant from the biological point of view. The time and space requirements of SW algorithms limit its use. As alternative, heuristics solutions have been proposed. FASTA [13] and BLAST [4] are widely used heuristics which allow fast comparisons, but at the expense of sensitivity. For these reasons, the use of parallel architectures that are able to exploit the several levels of parallelism existing in this workload is mandatory to get high quality results in a reduced time. At the same time, computer architects have been moving towards the paradigm of multicore architectures, which rely on the existence of sufficient thread-level parallelism (TLP) to exploit the large number of cores. In this context, we consider the use of multicore architectures in bioinformatics to provide the computing performance required by this workload. In this paper we analyze how large-scale sequence comparisons can be performed efficiently using modern parallel multicore architectures. As a baseline we take the IBM CellBE architecture, which has proved to be an efficient alternative for highly parallel applications [11][3][14]. We study the performance scalability of this workload in terms of speedup when many processing units are used concurrently in a multicore environment. Additionally, we study the memory organization that the algorithm requires and how to overcome the memory space limitation of the architecture. Furthermore, we analyze two different synchronization strategies. This paper is organized as follows: Section 2 discusses related work on parallel alternatives for sequence comparison. Section 3 describes the SW algorithm and the strategy of parallelism. Section 4 describes the baseline architecture and presents two parallel implementations of the SW algorithm on CellBE. Section 5 describes the experimental methodology. Section 6 discusses the results of our experiments. Finally, section 7 concludes the paper with a general outlook.

2

Related Work

Researchers have developed many parallel versions of the SW algorithm [7][8], each designed for a specific machine. These works are able to find a short set of optimal solutions when comparing two very long sequences. The problem of finding many optimal solutions grows exponentially with the sequences length, becoming this a more complex problem. Azzedine et al [6] SW implementation avoids the excessive memory requirements and obtains all the best local alignments between long sequences in a reduced time. In that work, the process is divided in two stages: First, the score matrix is computed and the maximum scores and their coordinates are stored. Second, with this information, part of the matrix is recomputed with the inverses sequences (smaller than the original sequences) and the best local alignments are retrieved. The important point is that the compute time of the first stage is much higher than the needed in phase two. Despite the efforts to reduce time and space, the common feature is that the score matrix computation is required, which is still the most time-consuming part. There are some works about the SW implementations on modern multicore architectures. Svetlin [12] describes an implementation on the Nvidia’s Graphics Processing Units (GPU). Sachdeva et al [15] present results on the use of the

Long DNA Sequence Comparison on Multicore Architectures

249

CellBE to compare few and short pairs of sequences that fit entirely in the Local Storage (LS) of each processor. S´ anchez [10] compares SW implementation on several modern multicore architectures like SGI Altix, IBM Power6 and CellBE, which support multiple dimension of parallelism (ILP, DLP and TLP). Furthermore, several FPGAs and custom VLSI hardware solutions have been designed for sequence comparison [1][5]. They are able to process millions of matrix cells per second. Among those alternatives, it is important to highlight the Kestrel processor [5], which is a single instruction multiple data (SIMD) parallel processor with 512 processing elements organized as a systolic array. The system originally focuses on efficient high-throughput DNA and protein sequence comparison. Designers argue that although this is a specific processor, it can be considered to be in the midpoint of dedicated hardware and general purpose hardware due to its programmability and reconfigurable architecture. Multicore architectures can deliver high performance in a wide range of applications like games, multimedia, scientific algorithms, etc. However, achieving high performance with these systems is a complex task: as the number of cores per chip and/or the number of threads per core increases, new challenges emerge in terms of power, scalability, design complexity, memory organization, bandwidth, programalibility, etc. In this work we make the following contributions: - We implement the SW on the CellBE. However, unlike previous works, we focus on long sequences comparison. We present two implementations that exploit TLP and DLP. In the first one, the memory is used as a centralized data storage because the SPE LS is small to hold sequences and temporal data. In the second one, each SPE stores parts of the matrix in its own LS and other SPEs synchronously read data via DMA operations. It requires to handle data dependencies in a multicore environment, synchronization mechanisms between cores, on-chip and off-chip traffic management, double buffering use for hiding data communication latency, SIMD programming for extracting fine-grain data parallelism, etc. - As a major contribution, we use simulation techniques to explore the SW performance scalability along different number of cores working in parallel. We investigate the memory organization that deliveres the maximum bandwidth with the minimum hardware cost, and analyze the impact of including shared cache that can be accessed by all the cores. We also study the impact of memory latency and the synchronization overhead on the performance.

3

Algorithm Description and Parallelism

The SW algorithm determines the optimal local alignment between two sequences of length lx and ly by assigning scores to each character-to-character comparison: positive for exact matches/substitutions, negative for insertions/ deletions. The process is done recursively and the data dependencies are shown in figure 1a. The matrix cell (i, j) computation depends on results (i − 1, j), (i, j − 1) and (i − 1, j − 1). However, cell across the antidiagonals are independent. The final score is reached when all the symbols have been compared. After

250

F. S´ anchez et al.

computing the similarity matrix, to obtain the best local alignment, the process starts from the cell which has the highest score, following the arrows until the value zero is reached. 3.1

Available Parallelism

Because most of the time is spent computing the score matrix, this is the part usually parallelized. The commonly used strategy is the wavefront method in which computation advances parallel to the antidiagonals. As figure 1a shows, the maximum available parallelism is obtained when the main antidiagonal is reached. Before developing an specific implementation, it is necessary to understand the parameters that influence performance. Figure 1c and table 1 illustrate these parameters and their description. The computation is done by blocks of a determined size. We can identify three types of relevant parameters: first, those which depend on the input data set, (the sequence lengths lx and ly ); second, those which depend on the algorithm implementation like b, k (the vertical and horizontal block lengths); and third, those which depend on the architecture (the number of workers p and the time Tblock(b,k) required to compute a block of size b ∗ k). There are studies on the parallelism in this kind of problems [2][9], Instead of developing a new model, we just want to summarize this remarking that the total time to compute the comparison can be expressed as follows: T otal timeparallel = Tseq

part

+ Tcomp(b,k) + Ttransf (b,k) + Tsync(b,k)

(1)

Where Tseq part is the intrinsic sequential part of the execution; Tcomp(b,k) is the time to process the matrix in parallel with p processors and with a specific block size b ∗ k; Ttransf (b,k) is the time spent transferring all blocks with size b ∗ k used in the computation; and Tsync(b,k) is the synchronization overhead. Each synchronization is done after a block of size b ∗ k is computed. On one hand Tcomp(b,k) basically depends on p, b and k, as the number of processors increases, this time decreases, The limit is given by the processors speed and the main antidiagonal, that is, if ly and lx are different, the maximum parallelism continues for |lx − ly | stages and then decreases again. Small values of b and k increases the number of parallel blocks, making the use of a larger number of processor effective. On the contrary, larger values of b and k reduce parallelism, therefore Tcomp(b,k) increases. On the other hand, Tsync(b,k) also depends on b, k. Small values of them increase the number of synchronization events, which can degrade performance seriously. Finally, Ttransf (b,k) increases with large values of b and k but also with very small values of them. The latter situation happens because it increases the number of inefficient data transfers due to the size.

4

Parallel Implementations on a Multicore Architecture

Comparing long sequences presents many challenges to any parallel architecture. Many relevant issues like synchronization, data partition, bandwidth use, memory space and data organization should be studied carefully to efficiently use the available features of a machine to minimize equation 1.

Long DNA Sequence Comparison on Multicore Architectures

251

Table 1. Parameters involved in the execution of the SW implementation on CellBE Name

Description

b

Horizontal block size (in number of symbols (bytes))

k

Vertical block size (in number of symbols (bytes))

lx

Length of sequence in the horizontal direction

ly

Length of sequence in the vertical direction

p

Number of processors (workers), SPE is the case of CellBE

Tblock(b,k) Time required to process a block of size b ∗ k













    



 





  





  

































  

 

  

!  "#$!"%" $$"   &

Fig. 1. (a) Data dependency (b) Different optimal regions (c) Computation distribution

To exploit TLP in the SW, a master thread takes sequences and preprocess them: makes profiling computation according to a substitution score matrix, prepares the worker execution contexts and receives results. Those issues correspond to the sequential part of the execution, the first term of equation 1. Workers compute the similarity matrix as figure 1b shows, that is, each worker computes different rows of the matrix. For example, if p = 8, p0 computes row 0, row 8, row 16, etc; p1 computes row 1, row 9, row 17, and so on. Since SIMD registers of the workers are 16-bytes long, it is possible to compute 8 symbols in parallel, (having 2 bytes per temporal scores), that is, k = 8 symbols. Each worker has to store temporal matrix values which will be used by the next worker, for example, in figure 1b, computing block 2 by p0 generates temporal data used in the computation of block 1 by p1 . This feature leads to several possible implementations. In this work, we show two, both having advantages and disadvantages, and being affected differently by synchronization and communications. 4.1

Centralized Data Storage Approach

Due to the small size of the scratch pad memory (LS of 256KB in CellBE, shared between instructions and data), a buffer per worker is defined in memory to store data as figure 2a shows. Each worker reads from its own buffer and write to the next worker’s buffer via DMA GET and PUT operations. Shared data correspond to the border of consecutive rows as shown in figure 1b. That implies that all workers are continuously reading/writing from/to memory, which is a possible problem from the BW point of view, but it is easy to program. The Ttransf (b,k)

252

F. S´ anchez et al.   

    

  

 

 

  !"  # 

  !"  # 

  !"  # 

  !"  # 

 





















 



 

 

 



    

  !"  # 

       



 

 

Fig. 2. (a) SPEs store data in memory. (b) SPEs store data in an internal buffer.

term of equation 1 is minimized using double buffering. It reduces the impact of DMA operation latency, overlapping computation with data transfer. Atomic operations are used to synchronize workers and guarantee data dependencies. 4.2

Distributed Data Storage Approach

Here, each worker defines a small local buffer in its own LS to store temporal results (figure 2b). When pi computes a block, it signals pi+1 indicating that a block is ready. When pi+1 receives this signal, it starts a DMA GET operation to bring data from the LS of pi to its own LS. When data arrives, pi+1 handshakes pi sending an ack signal. Once pi receives this signal, it knows that the buffer is available for storing new data. The process continues until all blocks are computed. However, due to the limited size of LS, the first and the last workers in the chain read from and write to memory. This approach reduces data traffic generated in previous approach. However it is more complex to program. There are three types of DMA: between workers to transfer data from LS to LS (on-chip traffic), from memory to LS and from LS to memory (off-chip traffic). Synchronization overhead is reduced taking into account that one SPE does not have to wait immediately for the ack signal from other SPE, it can start the computation of the next block and later wait for the ack signal. Synchronization is done by using the signal operations available in the CellBE.

5

Experimental Methodology

As a starting point, we execute both SW implementations on the CellBE with 1 to 16 SPUs. We evaluate the performance impact of the block size and the bandwidth requirements. Then, using simulation, we study the performance scalability and the impact of different memory organizations. We obtain traces from the execution to feed the architecture simulator (TaskSim) and carry out an analysis in more complex multicore scenarios. TaskSim is base on the idea that, in distributed memory architectures, the computation time on a processor does not depend on what is happening in the rest of the system, as it happens on the CellBE. Execution time depends on inter-thread synchronization, and

Long DNA Sequence Comparison on Multicore Architectures "#$

"#$%!























)

  

 

 







 !







 !

 









"#$

"#$

253

   

    

  



  

*+



)



*+

  

' (,' -.$/' 01'#0.

 

 

*+

)

*+

 



*+

&' (

*+

&' (

*+

&' (

*+



' -

  &' (

*+



Fig. 3. Modeled system Table 2. Evaluated configurations ranked by L2 bandwidth and Memory bandwidth Number of Banks L2 Cache Organization Bandwidth [GB/s]

1 25.6

2 51,2

4 102.4

8 204.8

mics/dram per mics 1/1 1/2 or 2/1 1/4 or 4/1 2/4 or 4/2 Memory Organization Bandwidth [GB/s] 6.4 12.8 25.6 51.2

16 409.6 4/4 102.4

Each combination of L2 BW and memory BW is a possible configuration, e.i, 2 L2 banks and 2 mics with 2 dram/mic deliver 51.2 GB/s to L2 and 25.6 GB/s BW to memory

the memory system for DMA transfers. TaskSim models the memory system in cycle-accurate mode: the DMA controller, the interconnection buses, the Memory Interface Controller, the DRAM channels, and the DIMMs. TaskSim does not model the processors themselves, it relies on the computation time recorded in the trace to measure the delay between memory operations (DMAs) or interprocessor synchronizations (modeled as blocking semaphores). As inputs, we use sequences with length 3.4M and 1.8M symbols, for real execution. However, to obtain traces with manageable size for simulation, we take shorter sequences ensuring available parallelism up to 112 workers, using block transfer of 16KB. The code running on PPU and SPU side are coded in C and were compiled with ppu-gcc and spu-gcc 4.1.1 respectively, with -O3 option. The executions run on a IBM BladeCenter QS20 system composed by 2 CellBE at 3.2 GHz. As a result, it is possible to have up to 16 SPEs running concurrently. 5.1

Modeled Systems

Figure 3 shows the general organization of the evaluated multicore architectures using simulation. It is comprised of several processing units integrated into clusters, the working frequency is 3.2 GHz. The main elements are: - A control processor P: a PowerPC core with SIMD capabilities. - Accelerators: 128 SIMD cores, connected into clusters of eight core each. Each core is connected with a private LS and a DMA controller.

254

F. S´ anchez et al.

- A Global Data Bus (GDB) connects the LDBs, L2 cache and memory controllers. It allows 4 8-bytes request per cycle, with 102.4 GB/s of bandwidth. - Local Data Buses (LDB) connects each cluster to GDB. Each GDB to LDB connection is 8-bytes/cycle. (25,6 GB/s). - A shared L2 cache distributed into 1 to 16 banks as table 2 describes. Each bank is a 8-way set associative with size ranging from 4KB to 4MB. - On-chip memory interface controllers (MIC) connect GDB and memory, providing up to 25.6 GB/s each (4x 6.4 GB/s Multi-channel DDR-2 modules). Our baseline blade CellBE-like machine consists of: 2 clusters with 8 workers each, without L2 cache, one GDB providing a peak BW of 102.4 GB/s for onchip data transfer. One MIC providing a peak BW of 25.6 GB/s to memory. And four DRAM modules connected to the MIC with 6.4 GB/s each.

6 6.1

Experimental Results Speedup in the Real Machine

Figures 4 and 5 show performance results for the centralized and distributed approaches on CellBE. The baseline is the execution with one worker. Figures show the performance impact of the block size (parameter b of table 1). When using small block sizes (128B or 256B), the application does not exploit the available parallelism due to two reasons: First, although more parallel blocks are available, the number of inefficient DMAs increases. Second, because each block transfer is synchronized, the amount of synchronization operations also increases and the introduced overhead degrades performance. With larger blocks like 16KB, the parallelism decreases, but the synchronization overhead decreases. Less aggressive impact is observed in the distributed SW because the synchronization mechanism is direct between workers and data transfers are done directly between LSs. With blocks of 16KB, performance of both approaches is similar (14X for centralized and 15X for distributed with 16 workers), that is, synchronization and data transfer in both approaches are hidden by computation. 6.2

Bandwidth Requirements

Data traffic is measured in both centralized and distributed cases. With these results, we perform some estimation of the on-chip and off-chip BW requirements, dividing the total traffic by the time to compute the matrix. Figure 6 depicts results of these measures. Up to 16 workers, the curves reflect the real execution on CellBE, for the rest points, we made a mathematical extrapolation that gives some idea of the BW required when using more workers. Figure shows that centralized SW doubles the BW requirements of distributed case when using equal number of workers. The reason is that in the centralized case, a worker sends data from a LS to memory first, and then, another worker bring this data from memory to its own LS, besides, all the traffic is off-chip. In the distributed case, data travels only once: from a LS to another LS and most of the traffic

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

lineal 128B 256B 512B 4096B 8192B 16384B

Speedup

Speedup

Long DNA Sequence Comparison on Multicore Architectures

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs

lineal 0c 512 c 4096 c 8192 c 16384 c

120

60

100

50

80

Speedup

Bandwidth in GB/s

Fig. 5. Distributed SW implem.

out-of-chip bw (distr) intra-chip bw (distr) out-of-chip bw (centr)

70

40 30

lineal 128B 256B 512B 4096B 8192B 16384B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs

Fig. 4. Centralized SW implem. 80

255

60 40

20

20

10 0 20

40 60 80 100 Number of workers

120

Fig. 6. Bandwidth requirements

0 0

20

40

60

80

100

120

Number of Workers

Fig. 7. Memory latency impact, cent.

is on-chip. For example, for 16 workers off-chip BW arrives to 9.1 GB/s in the first case, and on-chip BW is around 4.7 GB/s in the second one. Although the current CellBE architecture can deliver these BW for 16 cores, it is clear this demand is unsustainable when more than 16 workers are used, as figure shows. 6.3

Simulation Results

This section presents simulation results according to the configurations of section 3. We show results for both SW implementations using 16KB of block size. We study the memory latency with perfect memory, the real memory system impact, inclusion of L2 cache impact and the synchronization overhead impact. Memory Latency Impact. We perform experiments using up to 128 cores, without L2 cache, with sufficient memory bandwidth, with different latencies in a perfect memory and without synchronization overhead. Figure 7 shows results for the centralized SW. The first observation is that even in the ideal case (0 cycles), the execution does not reach a linear performance. This is because of Amdahl’s law: with 1 worker, the sequential part of the execution takes 0.41% of time, but with 128 workers it takes around 30.2% of time. Figure also shows this implementation hides the latency properly due to the double buffering use. The degratation starts with latencies near to 8K cycles, when more than 64 workers are used. Finally, the performance does not increase with more than 112 cores

F. S´ anchez et al.

Speedup

100 80

6.4 GB/s 12.8 GB/s 25.6 GB/s 51.2 GB/s 102.4 GB/s

120

lineal 6.4 GB/s 12.8 GB/s 25.6 GB/s 51.2 GB/s 102.4 GB/s 204.8 GB/s

120

100 Speedup

256

60

80 60 40

40 20 20 0 0

0 0

20

40 60 80 100 Number of Workers

Fig. 8. Memory BW, centralized

120

120 1MB 2MB 4MB 100

80

Speedup

Speedup

100

40 60 80 100 Number of Workers

Fig. 9. Memory BW, distributed

1MB 16MB 256MB 512MB 1GB

120

20

120

60 40

80 60 40

20

20

0 0

20

40 60 80 100 Number of Workers

120

0 0

20

40 60 80 100 Number of Workers

120

Fig. 10. Cache Sizes, 6.4 Gb/s BW, cent. Fig. 11. Cache Sizes, 6.4 Gb/s BW, dist.

because the simulated trace only has parallelism for this amount of cores, as explained in section 5. Results for the distributed case exhibit similar behavior. Real Memory System Without L2 Cache. Figures 8 and 9 show the performance results when using several combinations of MICs and DRAMS per MIC, without L2 caches. As shown, having 128 workers in the centralized case, the required BW to obtain the maximum performance is between 51.2 GB/s (2 MICS and 4 DRAM/MIC) and 102.4GB/s (4 MICS and 4 DRAMS/MIC). Comparing the results with the extrapolation of figure 6, we conclude that the required BW with 128 workers is near to 65 GB/s. However, having some configurations like them is unrealistic because of that physical connections do not scale in this way. For the distributed case, 12.8 GB/s (1 MIC and 2 DRAMS/MIC) is sufficient to obtaining the maximum performance. This is because the off-chip traffic in this case is very small (figure 6). Basically all the traffic is kept inside the chip. Impact of the L2 Cache and Local Storage. There are several ways to include cache or local memory in the system. We evaluate two options: first, adding a bank-partitoned L2 cache connected to the GDB; second, adding small size of LS to each worker. These two models differ in the way data locallity and interprocessor communication is managed. Figure 10 shows results for the centralized case in which only one MIC and one DRAM module is used (6.4

Long DNA Sequence Comparison on Multicore Architectures

120

Speedup

100 80

257

1 ns 500 ns 1000 ns 5000 ns 10000 ns

60 40 20 20

40

60

80

100

120

Number of Workers

Fig. 12. Synchronization Overhead

GB/s of memory BW) and a maximum 204.5 GB/s of L2 BW is available (L2 is distributed in 8 banks). As shown, the cache requirement is very high due to that the matrix to compute is bigger than L2 and data reuse is very small: when a block is computed, data is used once for another worker, after that, it is replaced by a new block. Additionally, with many workers the conflict misses increase significantly with small L2 caches, therefore, the miss rate increases and performance degrades. Figure 11 show results for the distributed case, where each worker has a 256KB LS (as CellBE) and there is L2 cache distributed in 8 banks to access date that are not part of the matrix computation. Results show that a shared L2 cache of 2MB is enought to capture the on-chip traffic. Synchronization Overhead. To obtain small synchronization overhead it is required to used a proper synchronization technique that match well in the target machine. So far, we have made experiments regardless of synchronization overhead. Now, this is included in the performance analysis. Each time a worker computes a block, it communicates to another worker that data is available, as explained in section 4.1. We include this overhead assuming that the time to perform a synchronization event (signal o wait) after a block is computed is a fraction of the requirede time to compute it, that is, Tsyncb lock(b,k) = α ∗ Tblock(b,k) . We give this information to our simulator and measure the performance for different values of α. Figure 12 shows the experiment results for the centralized SW implementation. As observed, the system assimilate the impact of up to 1000 nanoseconds of latency in each synchronization event. The results of the distributed approach exhibit a similar behavior.

7

Conclusions

This paper describes the implementation of DP algorithms for long sequences comparisons on modern multicore architectures that exploit several levels of parallelism. We have studied different SW implementations that efficiently use the CellBE hardware and achieve speedups near to linear with respect to the number of workers. Furthermore, the major contribution of our work is the use

258

F. S´ anchez et al.

of simulation tools to study more complex multicore configurations. We have studied key aspects like memory latency impact, efficient memory organization capable of delivering maximum BW and synchronization overhead. We observed that it is possible to minimize the memory latency impact by using techniques like double buffering while large data blocks are computed. Besides, we have shown that due to the sequential part of the algorithm, the performance does not scale linerly with large number of workers. It becomes necessary to perform optimizations in the sequential part of the SW implementations. We investigated the memory configuration that deliveres maximum BW to satisfy tens and even hundreds of cores on a single chip. As a result, we determined that for the SW algorithm it is more efficient to distribute small size of LS accross the workers instead of having a shared L2 on-chip data cache connected to the GDB. This is consequence of the streaming nature of the application, in which data reuse is low. However, the use of LS makes more challenging the programming, because the communication is always managed at the user-level. Finally, we observed that our synchronization strategy minimizes the impact of this overhead because it prevents a worker to wait inmediatly for the response of a previous signal. In this way, the application can endure an overhead of up to thousand ns with a maximum performance degradation of around 3%.

Acknowledgements This work was sponsored by the European Commission (ENCORE Project, contract 248647), the HiPEAC Network of Excellence and the Spanish Ministry of Science (contract TIN2007-60625). Program AlBan (Scholarship E05D058240CO).

References 1. Fast data finder (fdf) and genematcher (2000), http://www.paracel.com 2. Aji, A.M., Feng, W.c., Blagojevic, F., Nikolopoulos, D.S.: Cell-swat: modeling and scheduling wavefront computations on the cell broadband engine. In: CF 2008: Proceedings of the 5th conference on Computing frontiers, pp. 13–22. ACM, New York (2008) 3. Alam, S.R., Meredith, J.S., Vetter, J.S.: Balancing productivity and performance on the cell broadband engine. In: IEEE International Conference on Cluster Computing, pp. 149–158 (2007) 4. Altschul, S.F., Madden, T.L., Schffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped blast and psi-blast: a new generation of protein database serach programs. Nucleic acids research 25, 3389–3402 (1997) 5. Blas, A.D., Karplus, K., Keller, H., Kendrick, M., Mesa-Martinez, F.J., Hughey, R.: The ucsc kestrel parallel processor. IEEE Transactions on Parallel and Distributed systems (January 2005) 6. Boukerche, A., Magalhaes, A.C., Ayala, M., Santana, T.M.: Parallel strategies for local biological sequence alignment in a cluster of workstations. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS. IEEE Computer Society, Los Alamitos (2005)

Long DNA Sequence Comparison on Multicore Architectures

259

7. Boukerche, A., Melo, A.C., Sandes, E.F., Ayala-Rincon, M.: An exact parallel algorithm to compare very long biological sequences in clusters of workstations. Cluster Computing 10(2), 187–202 (2007) 8. Chen, C., Schmidt, B.: Computing large-scale alignments on a multi-cluster. In: IEEE International Conference on Cluster Computing, vol. 38 (2003) 9. Edmiston, E.E., Core, N.G., Saltz, J.H., Smith, R.M.: Parallel processing of biological sequence comparison algorithms. Int. J. Parallel Program. 17(3) (1988) 10. Friman, S., Ramirez, A., Valero, M.: Quantitative analysis of sequence alignment applications on multiprocessor architectures. In: CF 2009: Proceedings of the 6th ACM conference on Computing frontiers, pp. 61–70. ACM, New York (2009) 11. Gedik, B., Bordawekar, R.R., Yu, P.S.: Cellsort: high performance sorting on the cell processor. In: VLDB 2007: Proceedings of the 33rd international conference on Very large data bases, pp. 1286–1297, VLDB Endowment (2007) 12. Manavski, S.A., Valle, G.: Cuda compatible gpu cards as efficient hardwarer accelerator for smith-waterman sequence alignment. BMC Bioinformatics 9 (2008) 13. Pearson, W.R.: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and FASTA algorithms. Genomics 11 (1991) 14. Petrini, F., Fossum, G., Fern´ andez, J., Varbanescu, A.L., Kistler, M., Perrone, M.: Multicore surprises: Lessons learned from optimizing sweep3d on the cell broadband engine. In: IPDPS, pp. 1–10 (2007) 15. Sachdeva, V., Kistler, M., Speight, E., Tzeng, T.H.K.: Exploring the viability of the cell broadband engine for bioinformatics applications. In: Proceedings of the 6th Workshop on High Performance Computational Biology, pp. 1–8 (2007) 16. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing Mark James1 , Paul Springer1 , and Hans Zima1,2 1

Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 2 University of Vienna, Austria {mjames,pls,zima}@jpl.nasa.gov

Abstract. This paper describes an approach to providing software fault tolerance for future deep-space robotic NASA missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. Such systems have become possible as a result of the emerging many-core technology, which is expected to offer 1024-core chips by 2015. We discuss the challenges and opportunities of this new technology, focusing on introspection-based adaptive fault tolerance that takes into account the specific requirements of applications, guided by a fault model. Introspection supports runtime monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California.

1

Introduction

On-board computing systems for space missions are subject to stringent dependability requirements, with enforcement strategies focusing on strict and widely formalized design, development, verification, validation, and testing procedures. Nevertheless, history has shown that despite these precautions errors occur, sometimes resulting in the catastrophical loss of an entire mission. There are theoretical as well as practical reasons for this situation: 1. No matter how much effort is spent for verification and test well-known undecidability and NP-completeness results show that many relevant problems are either undecidable or computationally intractable. 2. As a result, large systems typically do contain design faults. 3. Even a perfectly designed system may be subject to external faults, such as radiation effects and operator errors. As a consequence, it is essential to provide methods that avoid system failure and maintain the functionality of a system, possibly with degraded performance, even in the case of faults. This is called fault tolerance. Fault tolerant systems were built long before the advent of the digital computer, based on the use of replication, diversified design, and federation of equipment. In an article on Babbage’s difference engine published in 1834 Dionysius P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 260–274, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

261

Lardner wrote [1]: “The most certain and effectual check upon errors which arise in the process of computation is to cause he same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computation by different methods.” An example for an early fault-tolerant computer is NASA’s Self-Testing-and-Repairing (STAR) system developed for a 10-year mission to the outer planets in the 1960s. Today, highly sophisticated fault-tolerant computing systems control the new generation of fly-by-wire aircraft, such as the Airbus and Boeing airliners. Perhaps the most widespread use of fault-tolerant computing has been in the area of commercial transactions systems, such as automatic teller machines and airline reservation systems. Most space missions of the past were largely controlled from Earth, so that a significant number of failures could be handled by putting the spacecraft in a “safe” mode, with Earth-bound controllers attempting to return it to operational mode. This approach will no longer work for future deep-space missions, which will require enhanced autonomy and a powerful on-board computational capability. Such missions are becoming possible as a result of recent advances in microprocessor technology, which are leading to low-power many-core chips that today already have on the order of 100 cores, with 2015 technology expected to offer 1024-core systems. These developments have many consequences for fault tolerance, some of them challenging and others providing new opportunities. In this paper we focus on an approach for software-implemented applicationadaptive fault tolerance. The paper is structured as follows: In Section 2, we establish a conceptual basis, providing more precise definitions for the notions of dependability and fault tolerance. Section 3 gives an overview of future missions and their requirement, and outlines an on-board architecture that complements a radiation-hardened spacecraft control and communication component with a COTS-based high-performance processing system. After introducing introspection in Section 4, we discuss introspection-based adaptive fault tolerance in Section 5. The paper ends with an overview of related work and concluding remarks in Sections 6 and 7.

2

Fault Tolerance in the Context of Dependability

Dependability has been defined by the IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance as the “trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers”. Dependability is characterized by its attributes, the threats to it, and the means by which it can be achieved [2,3]. The attributes of dependability specify a set of properties that can be used to assess how a system satisfies its overall requirements. Key attributes are reliability, availability, mean-time-to-failure, and safety. A threat is any fact or event that negatively affects the dependability of a system. Threats can be classified as faults, errors, or failures. Their relationship can be illustrated by the fault-error-failure chain shown in Figure 1.

262

M. James, P. Springer, and H. Zima

Service Interface

propagation to service interface

activation

Fault

Error−1

defect

invalid state

propagation

....

Error−n

FAILURE

invalid state

Violation of System SPEC External Fault (caused by external failure)

Service Interface Fig. 1. Threats: the fault-error-failure chain

A fault is a defect in a system. Faults can be dormant—e.g., incorrect program code that is not executed—and have no effect. When activated during system operation, a fault leads to an error, which is an illegal system state. Errors may be propagated through a system, generating other errors. For example, a faulty assignment to a variable may result in an error characterized by an illegal value for that variable; the use of the variable for the control of a for-loop can lead to ill-defined iterations and other errors, such as illegal accesses to data sets and buffer overflows. A failure occurs if an error reaches the service interface of a system, resulting in system behavior that is inconsistent with its specification. With the above terminology in place, we can now precisely characterize a system as fault tolerant if it never enters a failure state. Errors may occur in such a system, but they never reach its service boundary and always allow recovery to take place. The implementation of fault tolerance in general implies three steps: error detection, error analysis, and recovery. The means for achieving dependability include fault prevention, fault removal, and fault tolerance. Fault prevention addresses methods that prevent faults to being incorporated into a system. In the software domain, such methods include restrictive coding structures that avoid common programming faults, the use of object-oriented techniques, and the provision of high-level APIs. An example for hardware fault prevention is shielding against radiation-caused faults. Fault removal refers to the a set of techniques that eliminate faults during the design and development process. Verification and Validation (V&V) are important in

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

263

this context: Verification provides methods for the review, inspection, and test of systems, with the goal of establishing that they conform to their specification. Validation checks the specification in order to determine if it correctly expresses the needs of the system’s users. For theoretical as well as practical reasons, neither fault prevention nor fault removal provide complete solutions, i.e., in general for non-trivial programs there is no guarantee that they do not contain design faults. However, even in a program completely free of design faults, hardware malfunction can cause software errors at execution time. In the domain underlying this paper, the problem is actually more severe: a spacecraft can be hit by radiation, which can cause arbitrary errors in data and control structures. This is discussed in more detail in the next section.

3

Future Space Missions and Their Requirements

Future deep-space missions face the challenge of designing, building, and operating progressively more capable autonomous spacecraft and planetary rovers. Given the latency and bandwidth of spacecraft-Earth communication for such missions, the need for enhanced autonomy becomes obvious: Earth-based mission controllers will be unable to directly control distant spacecraft and robots to ensure timely precision and safety, and to support “opportunistic science” by capturing rapidly changing events, such as dust devils on Mars or volcanic eruptions on a remote moon in the solar system [4]. Furthermore, the high data volume yielded by smart instruments on board of the spacecraft can overwhelm the limited bandwidth of spacecraft-Earth communication, enforcing on-board data analysis, filtering, and compression. Science processing will require a highperformance capability that may range up to hundreds of Teraops for on-board synthetic aperture radar (SAR), hyperspectral assessment of scenes, or stereo vision. Currently, the performance of traditional mission architectures lags that of commercial products by at least two orders of magnitude; furthermore, this gap is expected to widen in the future. As a consequence, the traditional approach to on-board computing is not expected to scale with the requirements of future missions. A radical departure is necessary. Emerging technology offers a way out of this dilemma. Recent developments in the area of commercial multi-core architectures have resulted in simpler processor cores, enhanced efficiency in terms of performance per Watt, and a dramatic increase in the number of cores on a chip, as illustrated by Tilera Corporation’s Tile64 [5]—a homogeneous parallel chip architecture with 64 identical cores arranged in an 8x8 grid performing at 192 Gops with a power consumption of 170-300mW per core—or Intel’s terachip announced for 2011—an 80-core chip providing 1.01 Teraflops based on a frequency of 3.16 GHz, with a power consumption of 62W. These trends suggest a new paradigm for spacecraft architectures, in which the ultra-reliable radiation-hardened core component responsible for control, navigation, data handling, and communication is extended with a scalable commoditybased multi-core system for autonomy and science processing. This approach

264

M. James, P. Springer, and H. Zima

will provide the basis for a powerful parallel on-board supercomputing capability. However, bringing COTS components into space leads to a new problem—the need to address their vulnerability to hardware as well as software faults. Space missions are subject to faults caused by equipment failure or environmental impacts, such as radiation, temperature extremes, or vibration. Missions operating close to the Earth/Moon system can be controlled from the ground. Such missions may allow controlled failure, in the sense that they fail only in specific, pre-defined modes, and only to a manageable extent, avoiding complete disruption. Rather than providing the capability of resuming normal operation, a failure in such a system puts it into a safe mode, from which recovery is possible after the failure has been detected and identified. As an example, the on-board software controlling robotic planetary exploration spacecraft for those portions of a mission during which there is no critical activity (such as detumbling the spacecraft after launch or descent to a planetary surface) can be organized as a system allowing controlled failure. When a fault is detected during operation, all active command sequences are terminated, components inessential for spacecraft survival are powered off, and the spacecraft is positioned into a stable sun-pointed attitude. Critical information regarding the state of the spacecraft and the fault are transmitted to ground controllers via an emergency link. Restoring the spacecraft health is then delegated to controllers on Earth. However, such an approach is not adequate for deep-space missions beyond immediate and continuous control from the Earth. For such missions, fault tolerance is a key prerequisite, i.e., a fail-operational response to faults must be provided, implying that the spacecraft must be able to deal autonomously with faults and continue to provide the full range of critical functionality, possibly at the cost of degraded performance. Systems which preserve continuity of service can be significantly more difficult to design and implement than fail-controlled systems. Not only is it necessary to determine that a fault has occurred, the software must be able to determine the effects of the fault on the system’s state, remove the effects of the fault, and then place the system into a state from which processing can proceed. This is the situation on which the rest of this paper is based. We focus on strategies and techniques for providing adaptive, introspection-based fault tolerance for space-borne systems. Deep-space missions are subject to radiation in the form of cosmic rays and the solar wind, exposing them to protons, alpha particles, heavy ions, ultraviolet radiation, and X-rays. Radiation can interact with matter through atomic displacement—a rearrangement of atoms in a crystal lattice—or ionization, with the potential of causing permanent or transient damage [6]. Modern COTS circuits are protected against long-term cumulative degradation as well as catastrophic effects caused by radiation. However, they are exposed to transient faults in the form of Single Event Upsets or Multiple Bit Upsets, which do not cause lasting damage to the device. A Single Event Upset (SEU) changes the state of a single bit in a register or memory, whereas a Multiple Bit Upset (MBU) results in a

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

265

change of state of multiple adjacent bits. The probability of SEUs and MBUs depends on the environment in which the spacecraft is operating, and on the detailed characterization of the hardware components in use. COTS semiconductor fabrication processes vary: with the 65nm process now in commercial production, some of the semiconductor foundries are using Silicon on Insulator (SOI) construction, which makes the chips less susceptible to these radiation effects. Depending on the efficacy of fault tolerance mechanisms, SEUs and MBUs can manifest themselves at different levels. For example, faults may affect processor cores and caches, DRAM memory units, memory controllers, on-chip communication networks, I/O processors nodes, and interconnection networks. This can result in the corruption of instruction fetch/decode, address selection, memory units, synchronization, communication, and signal/interrupt processing. In a sequential thread this may lead to the (unrecognized) use of corrupted data and the execution of wrong or illegal instructions, branches, and data accesses in the program. Hangs or crashes of the program, as well as unwarranted exceptions are other possible consequences. In a distributed system, transient faults can cause communication errors, livelock, deadlock, data races, or arbitrary Byzantine failures [7]. Some of these effects may be caught and corrected in the hardware (e.g., via the use of an error-correcting code (ECC)) with no disruption of the program. A combination of hardware and software mechanisms may provide an effective approach, as in the case of the fault isolation of cores in a multi-core chip [8]. Other faults, such as those causing illegal instruction codes, illegal addresses, or the violation of access protections may trigger a synchronous interrupt, which can lead to an application-specific response. In a distributed system, watchdogs may detect a message failure. However, in general, an error may remain undetected. Figure 2 outlines key building blocks of an architecture for space-borne computing in which the radiation-hardened core is augmented with a COTS-based scalable high-performance computing system (HPCS). The Spacecraft Control & Communication System is the core component of the on-board system, controlling the overall operation, navigation, and communication of the spacecraft. Due to its critical role for the operation and survival of the spacecraft this system is typically implemented using radiation-hardened components that are largely immune to the harsh radiation environments encountered in space. The Fault-Tolerant High-Capability Computational Subsystem (FTCS) is designed to provide an additional layer of fault tolerance around the HPCS via a Reliable Controller that shields the Spacecraft Control & Communication System from faults that evaded detection or masking in the High Performance Computing System. The Reliable Controller is the only component of the FTCS that communicates directly with the spacecraft control and communication system. As a consequence, it must satisfy stringent reliability requirements. Important approaches for implementing the Reliable Controller—either on a pure software basis, or by a combination of hardware and software—have been developed in the Ghidrah [9] and ST8 systems [10].

266

M. James, P. Springer, and H. Zima

Spacecraft Control& Communication System

Spacecraft Control Computer (SCC) Communication Subsystem (COMM)

Fault−Tolerant High−Capability Computational Subsystem (FTCS) Spacecraft Interface

Reliable Controller (CTRL)

High Performance Computing System (HPCS)

Intelligent Mass Data

Storage (IMDS)

Interconnection Network(s) Instrument Interface

Instruments ...

EARTH Fig. 2. An architecture for scalable space-borne computing

4

Introspection

The rest of this paper deals with an introspection-based approach to providing fault tolerance for the High Performance Computing System in the architecture depicted in Figure 2. A generic framework for introspection has been described in [11]; here we outline its major components. Introspection provides a generic software infrastructure for the monitoring, analysis, and feedback-oriented management of applications at execution time. Consider a parallel application executing in the High Performance Computing System. Code and data belonging to the object representation of the application will be distributed across its components, creating a partitioning of the application into application segments. An instance of the introspection system consists of a set of interacting introspection modules, each of which can be linked to application segments. The structure of an individual introspection module is outlined in Figure 3. Its components include: – Application/System Links are either sensors or actuators. Sensors represent hardware or software events that occur during the execution of the application; they provide input from the application to the introspection module. Actuators represent feedback from the introspection module to the application. They are triggered as a result of module-internal processing and may result in changes of the application state, its components, or its instrumentation.

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

267

Sensors

Inference Engine

.. .

(SHINE)

.. . control links

Monitoring

Application Segment(s)

to/from Analysis

Knowledge

.. .

Feedback/ Recovery

Base

Prognostics

introspection modules

.. .

Actuators

Fig. 3. Introspection module

– Inference Engine. The nature of the problems to which introspection is applied demands efficient and flexible control of the associated application segments. These requirements are met in our system by the Spacecraft Health Inference Engine (SHINE) as the core of each introspection module. SHINE is a real-time inference engine that provides an expert systems capability and functionality for building, accessing, and updating a structured knowledge base. – The Knowledge Base consists of declarative facts and rules that specify how knowledge can be processed. It may contain knowledge about the underlying system, the programming languages supported, the application domain, and properties of application programs and their execution that are either derived by static or dynamic analysis or supplied by the user. We have implemented a prototype introspection system for a cluster of Cell Broadband Engines. Figure 4 illustrates the resulting hierarchy of introspection modules, where the levels of the hierarchy, from bottom to top, are respectively associated with the SPEs, the PPE, and the overall cluster.

5

Adaptive Fault Tolerance for High-Performance On-Board Computing

The prototype system mentioned above relied on user-specified assertions guiding the introspection system. In the following we outline the ideas underlying

268

M. James, P. Springer, and H. Zima

external links Sensors Inference Engine

...

Mo

Monitoring

...

Analysis

Knowledge Base

Feedback/Rec Prognostics

Actuators

Sensors

Sensors Inference Engine

...

. ..

Monitoring

...

Analysis

Actuators

. Actuators

Knowledge Base

Base

Prognostics

Actuators

...

Knowledge

Feedb/Rec

Sensors Inference Engine Monitoring Analysis Recovery Prognostics

Analysis

Prognostics

Sensors

.

Monitoring

...

Knowledge Base

Feedb/Rec

Inference Engine

...

. . Actuators

Sensors Inference Engine Monitoring Analysis Recovery Prognostics

Knowledge Base

. . Actuators

Inference Engine Monitoring Analysis Recovery Prognostics

Knowledge Base

.. .

Sensors

. . Actuators

Inference Engine Monitoring Analysis Recovery Prognostics

Knowledge Base

Fig. 4. Introspection hierarchy for a cluster of Cell Broadband Engines

our current work, which generalizes this system in a number of ways, with a focus on providing support for the automatic generation of assertions. Our approach to providing fault tolerance for applications executing in the high-performance computing system is adaptive in the sense that faults can be handled in a way that depends on the potential damage caused by them. This enables a flexible policy resulting in a reduced performance penalty for the fault tolerance strategy when compared to fixed-redundancy schemes. For example, an SEU causing a single bitflip in the initial phase of an image processing algorithm may not at all affect the outcome of the computation. However, SEU-triggered faults such as the corruption of a key data structure caused by an illegal assignment to one of its components, the change of an instruction code, or the corruption of an address computation may have detrimental effects on the outcome of the computation. Such faults need to be handled through the use of redundancy, with an approach that reflects their severity and takes into account known properties of the application and the underlying system. 5.1

Assertions

An assertion describes a propositional logic predicate that must be satisfied at certain locations of the program, during specific phases of execution, or in

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

269

program regions such as loops and methods. Its specification consists of four components—the specification of an assertion expression, the region in which this expression can be applied, the characterization of the fault if the assertion is violated, and an optional recovery specification. We illustrate this by a set of examples. assert ((A(i) ≤ B(i)) in (L1) fault (F 1, i, . . .) recovery (...) The assertion expression A(i) ≤ B(i)) must be satisfied immediately after the execution of the statement labeled by L1. If it fails, a fault type, F1, is specified and a set of relevant arguments is relayed to the introspection system. Furthermore, a hint for the support of a recovery method is provided.  assert (x = 0) pre in (L2) fault (F T 2, x, . . .) assert (z = f 2(x)) in (L2) fault (F T 3, x, y, z, . . .) The two assertion expressions x = 0 and z = f 2(x) respectively serve as precondition and postcondition for a statement at label L2, with respective fault types F T 2 and F T 3 for assertion violations.  assert (diff≥ ) invariant in (r loop) fault (...) The assertion expression diff≥  specifies an invariant that is associated with the region defined by r loop. It must be satisfied at any point of execution within this loop. 

5.2

Fault Detection and Recovery

Introspection-based fault tolerance provides a flexible approach that in addition to applying innovative methods can leverage existing technology. Methods that are useful in this context include assertion-based acceptance tests that check the value of an assertion and transfer control to the introspection system in case of violation, and fault detectors that can effectively mask a fault by using redundant code based on analysis information (see Section 5.3). Furthermore, faults in critical sections of the code can be masked by leveraging fixed redundancy techniques such as TMR or NMR. Another technique is the replacement of a function with an equivalent version that implements AlgorithmBased Fault Tolerance (ABFT). Information supporting the generation of assertion-based acceptance tests as well as fault detectors can be derived from static or dynamic automatic program analysis, retrieved from domain- or system specific information contained in the knowledge base or be directly specified by an expert user. Figure 5 provides an informal illustration of the different methods used to gather such information. 5.3

Analysis-Based Assertion Generation

Automatic analysis of program properties relevant for fault tolerance can be leveraged from a rich spectrum of existing tools and methods. This includes the static analysis of the control and data structures of a program, its intra- and

270

M. James, P. Springer, and H. Zima

source program

P

analysis

instrumentation

assertions

KB

P’

User

application and system knowledge

instrumented program Fig. 5. Assertion generation

inter-procedural control flow, data flow and data dependences, data access patterns, and patterns of synchronization and communication in multi-threaded programs [12,13]. Other static tools can check for the absence of deadlocks or race conditions. Profiling from simulation runs or test executions can contribute information on variable ranges, loop counts, or potential bottlenecks [14]. Furthermore, dynamic analysis provides knowledge that cannot be derived at compile time, such as the actual paths taken during a program execution and dynamic dependence relationships. Consider a simple example. In data flow analysis, a use-definition chain is defined as the link between the statement that uses (i.e., reads) a variable to the set of all definitions (i.e., assignments) of that variable that can reach this statement along an execution path. Similarly, a definition-use chain links a definition to all its uses. An SEU can break such a chain, for example by redirecting an assignment. This can result in a number of different faults, including the following: – attempt to use an undefined variable or dereference an undefined pointer – rendering a definition of a variable useless

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

271

– leading to an undefined expression evaluation – destroying a loop bound The results of static analysis (as well as results obtained from program profiling) can be exploited for fault detection and recovery in a number of ways, including the generation of assertions in connection with specific program locations or program regions. Examples include asserting: – – – – – – –

the value of a variable that has been determined to be a constant the value range of a variable or a pointer the preservation of use-definition and definition-use chains the preservation of dependence relationships a limit for the number of iterations in a loop an upper limit for the size of a data structure correctness of access sequences to files

The generation of such assertions must be based on the statically derived information in combination with the generation of code that records the corresponding relationships at runtime. A more elaborate technique that exploits static analysis for the generation of a fault detector using redundant code generation can be based on program slicing [15]. This is an analysis technique that extracts from a program the set of statements that affect the values required at a certain point of interest. For example, it answers the question which statements of the program contribute to the value of a critical variable at a given location. These statements form a slice. The occurrence of SEUs can disrupt the connection between a variable occurrence and its slice. A fault detector for a specific variable assignment generates redundant code based only on that slice, and compares its outcome with that of the original code. Some of the techniques applied to sequential programs can be generalized to deal with multi-threaded programs. Of specific importance in this context are programs whose execution is organized as a data-parallel set of threads according to the Single-Program-Multiple-Data (SPMD) paradigm since the vast majority of parallel scientific applications belong to this category [16].

6

Related Work

The Remote Exploration and Experimentation (REE) [17] project conducted at NASA was among the first to consider putting a COTS-based parallel machine into space and address the resulting problems related to application-adaptive fault tolerance [18]. More recently, NASA’s Millenium ST-8 project [10] developed a “Dependable Multiprocessor” around a COTS-based cluster using the IBM PowerPC 750FX as a data processor, with a Xilinx VirtexII 6000 FPGA co-processor for the support of application-specific modules for digital signal processing, data compression, and vector processing. A centralized system controller for the cluster is implemented using a redundant configuration of radiationhardened Motorola processors.

272

M. James, P. Springer, and H. Zima

Some significant work has been done in the area of assertions. The EAGLE system [19] provides an assertion language with temporal constraints. The Design for Verification (D4V) [20] system uses dynamic assertions, which are objects with state that are constructed at design time and tied to program objects and locations. Language support for assertions and invariants has been provided in Java 1.4, Eiffel for pre- and post condition in Hoare’s logic, and the Java Modeling Language (JML). Intelligent resource management in an introspectionbased approach has been proposed in [21]. Finally, the concept of introspection, as used in our work, has been outlined in [22]. A similar idea has been used by Iyer and co-workers for application-specific security [23] based on hardware modules embedded in a reliability and security engine.

7

Conclusion

This paper focused on software-provided fault tolerance for future deep-space missions providing an on-board COTS-based computing capability for the support of autonomy. We described the key features of an introspection framework for runtime monitoring, analysis, and feedback-oriented recovery, and outlined methods for the automatic generation of assertions that trigger key actions of the framework. A prototype version of the system was originally implemented on a cluster of Cell Broadband Engines; currently, an implementation effort is underway for the Tile64 system. Future work will address an extension of the introspection technology to performance tuning and power management. Furthermore, we will study the integration of introspection with traditional V&V.

Acknowledgment This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration and funded through the internal Research and Technology Development program.

References 1. Lardner, D.: Babbages’s Calculating Engine. Edinburgh Review (July 1834); Reprinted in Morrison, P., Morrison, E. (eds.). Charles Babbage and His Calculating Engines. Dover, New York (1961) 2. Avizienis, A., Laprie, J.C., Randell, B.: Fundamental Concepts of Dependability. Technical report, UCLA (2000) (CSD Report No. 010028)

Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

273

3. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1) (January-March 2004) 4. Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C., Judd, M.: OASIS: Onboard Autonomous Science Investigation System for Opportunistic Rover Science. Journal of Field Robotics 24(5), 379–397 (2007) 5. Tile64 Processor Family (2007), http://www.tilera.com 6. Shirvani, P.P.: Fault-Tolerant Computing for Radiation Environments. Technical Report 01-6, Center for Reliable Computing, Stanford University, Stanford, California 94305 (June 2001) (Ph.D. Thesis) 7. Lamport, L., Shostak, R., Pease, M.: The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems 4(3), 382–401 (1982) 8. Aggarwal, N., Ranganathan, P., Jouppi, N.P., Smith, J.E.: Isolation in Commodity Multicore Processors. IEEE Computer 40(6), 49–59 (2007) 9. Li, M., Tao, W., Goldberg, D., Hsu, I., Tamir, Y.: Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware. In: Cluster 2002: Proceedings of the IEEE International Conference on Cluster Computing, p. 266. IEEE Computer Society, Washington (September 2002) 10. Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z., Some, R.: High Performance Dependable Multiprocessor II. In: Proceedings 2007 IEEE Aerospace Conference, pp. 1–22 (March 2007) 11. James, M., Shapiro, A., Springer, P., Zima, H.: Adaptive Fault Tolerance for Scalable Cluster Computing in Space. International Journal of High Performance Computing Applications (IJHPCA) 23(3) (2009) 12. Zima, H.P., Chapman, B.M.: Supercompilers for Parallel and Vector Computers. ACM Press Frontier Series (1991) 13. Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, New York (1999) 14. Havelund, K., Goldberg, A.: Verify Your Runs. In: Meyer, B., Woodcock, J. (eds.) VSTTE 2005. LNCS, vol. 4171, pp. 374–383. Springer, Heidelberg (2008) 15. Weiser, M.: Program Slicing. IEEE Transactions on Software Engineering 10, 352– 357 (1984) 16. Strout, M.M., Kreaseck, B., Hovland, P.: Data Flow Analysis for MPI Programs. In: Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2006) (June 2006) 17. Some, R., Ngo, D.: REE: A COTS-Based Fault Tolerant Parallel Processing Supercomputer for Spacecraft Onboard Scientific Data Analysis. In: Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1–7.B.3-12 (1999) 18. Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S., Whisnant, K.: Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Trans. Parallel Distrib. Syst. 10(6), 560–579 (1999) 19. Goldberg, A., Havelund, K., McGann, C.: Runtime Verification for Autonomous Spacecraft Software. In: Proceedings 2005 IEEE Aerospace Conference, pp. 507– 516 (March 2005) 20. Mehlitz, P.C., Penix, J.: Design for Verification with Dynamic Assertions. In: Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop, SEW 2005 (2005)

274

M. James, P. Springer, and H. Zima

21. Kang, D.I., Suh, J., McMahon, J.O., Crago, S.P.: Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Multi-Core Architectures. In: Proceedings of the 2007 Workshop on High Performance Embedded Computing, HPEC 2007 (September 2007) 22. Zima, H.P.: Introspection in a Massively Parallel PIM-Based Architecture. In: Joubert, G.R. (ed.) Advances in Parallel Computing, vol. 13, pp. 441–448. Elsevier B.V., Amsterdam (2004) 23. Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.M.W., Klemperer, P., Farivar, R.: Toward Application-Aware Security and Reliability. IEEE Security and Privacy 5(1), 57–62 (2007)

Maestro: Data Orchestration and Tuning for OpenCL Devices Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter Oak Ridge National Laboratory {spaffordkl,jsmeredith,vetter}@ornl.gov

Abstract. As heterogeneous computing platforms become more prevalent, the programmer must account for complex memory hierarchies in addition to the difficulties of parallel programming. OpenCL is an open standard for parallel computing that helps alleviate this difficulty by providing a portable set of abstractions for device memory hierarchies. However, OpenCL requires that the programmer explicitly controls data transfer and device synchronization, two tedious and error-prone tasks. This paper introduces Maestro, an open source library for data orchestration on OpenCL devices. Maestro provides automatic data transfer, task decomposition across multiple devices, and autotuning of dynamic execution parameters for some types of problems.

1

Introduction

In our previous work with general purpose computation on graphics processors (GPGPU) [1, 2], as well as a survey of similar research in the literature [3, 4, 5, 6, 7], we have encountered several recurring problems. First and foremost is code portability–most GPU programming environments have been proprietary, requiring code to be completely rewritten in order to run on a different vendor’s GPU. With the introduction of OpenCL, the same kernel code (code which executes on the device) can generally be used on any platform, but must be “hand tuned” for each new device in order to achieve high performance. This manual optimization of code requires significant time, effort, and expert knowledge of the target accelerator’s architecture. Furthermore, the vast majority of results in GPGPU report performance for only single-GPU implementations, presumably due to the difficulty of task decomposition and load balancing, the process of breaking a problem into subtasks and dividing the work among multiple devices. These tasks require the programmer to know the relative processing capability of each device in order to appropriately partition the problem. If the load is poorly balanced, devices with insufficient work will be idle while waiting on those with larger portions of work to finish. Task decomposition also requires that the programmer carefully aggregate output data and perform device synchronization. This prevents OpenCL code from being portable–when moving to a platform with a different number of devices, or devices which differ in relative speed, work allocations must be adjusted. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 275–286, 2010. c Springer-Verlag Berlin Heidelberg 2010 

276

K. Spafford, J. Meredith, and J. Vetter

Typically, GPUs or other compute accelerators are connected to the host processor via a bus. Many results reported in the literature focus solely on kernel execution and neglect to include the performance impact of data transfer across the bus. Poor management of this interconnection bus is the third common problem we have identified. The bandwidth of the bus is almost always lower than the memory bandwidth of the device, and suboptimal use of the bus can have drastic consequences for performance. In some cases, the difference in bandwidth can be more than an order of magnitude. Consider the popular NVIDIA Tesla C1060 which has a peak memory bandwidth of 102 gigabytes per second. It is usually connected to a host processor via a sixteen lane PCIe 2.0 bus, with peak bandwidth of only eight gigabytes per second. One common approach to using the bus is the function offload model. In this model, sequential portions of an application execute on the host processor. When a parallel section is reached, input data is transferred to the accelerator. When the accelerator is finished computing, outputs are transferred back to the host processor. This approach is the simplest to program, but the worst case for performance. It poorly utilizes system resources–the bus is never active at the same time as the accelerator. In order to help solve these problems, we have developed an open source library called Maestro. Maestro leverages a combination of autotuning, multibuffering, and OpenCL’s device interrogation capabilities in an attempt to provide a portable solution to these problems. Further, we argue that Maestro’s automated approach is the only practical solution, since the parameter space for hand tuning OpenCL applications is enormous. 1.1

OpenCL

In December 2008, the Khronos Group introduced OpenCL [8], an open standard for parallel computing on heterogeneous platforms. OpenCL specifies a language, based on C99, that allows a programmer to write parallel functions called kernels which can execute on any OpenCL device, including CPUs, GPUs, or any other device with a supporting implementation. OpenCL provides support for data parallelism as well as task parallelism. OpenCL also provides a set of abstractions for device memory hierarchies and an API for controlling memory allocation and data transfer. In OpenCL, parallel kernels are divided into tens of thousands of work items, which are organized into local work groups. For example, in matrix multiplication, a single work item might calculate one entry in the solution matrix, and a local work group might calculate a submatrix of the solution matrix.

2

Overview

Before proceeding to Maestro’s proposed solutions to the observed problems, it is important to introduce one of the key ideas in Maestro’s design philosophy, a single, high level task queue.

Maestro: Data Orchestration and Tuning for OpenCL Devices

2.1

277

High Level Queue

In the OpenCL task queue model, the programmer must manage a separate task queue for each GPU, CPU, or other accelerator in a heterogeneous platform. This model requires that the programmer has detailed knowledge about which OpenCL devices are available. Modifications to the code are required to obtain high performance on a system with a different device configuration. The Maestro model (contrasted in Figure 1), unifies the disparate, devicespecific queues into a single, high-level task queue. At runtime, Maestro queries OpenCL to obtain information about the available GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data and divide work among the available devices automatically. This frees the programmer from having to synchronize multiple devices and keep track of device specific information.

Fig. 1. Task Queue Hierarchies–In the OpenCL task queue hierarchy, the programmer must manage a separate task queue for each device. Maestro unifies these into a single, high-level queue which is independent of the underlying hardware.

3

Problem: Code Portability

OpenCL’s claim to portability relies on its ability to execute kernel code on any device with a supporting implementation. While this represents a substantial improvement over proprietary programming environments, some obstacles to portability remain. One such obstacle is the organization of local work items. What is the appropriate local work group size for a given kernel? With current hardware, local work items roughly correspond to device threads. Hence, for GPUs, the rule of thumb is to start with a sufficiently high multiple of sixteen, e.g. 128 or 256. However, this heuristic does not guarantee a kernel will execute successfully, much less exhibit high performance. For example, the OpenCL implementation in Mac OS X imposes an upper limit on local group size of one for code to execute on a CPU. Also, while larger group sizes often lead to better performance, if the kernel is strongly constrained by either register or local memory usage, it may simply fail to execute on a GPU due to lack of resources.

278

3.1

K. Spafford, J. Meredith, and J. Vetter

Proposed Solution: Autotuning

Since OpenCL can execute on devices which differ so radically in architecture and computational capabilities, it is difficult to develop simple heuristics with strong performance guarantees. Hence, Maestro’s optimizations rely solely on empirical data, instead of any performance model or a priori knowledge. Maestro’s general strategy for all optimizations can be summarized by the following steps: 1. 2. 3. 4.

Estimate based on benchmarks Collect empirical data from execution Optimize based on results While performance continues to improve, repeat steps 2-3

This strategy is used to optimize a variety of parameters including local work group size, data transfer size, and the division of work among multiple devices. However, these dynamic execution parameters are only one of the obstacles to true portability. Another obstacle is the choice of hardware specific kernel optimizations. For instance, some kernel optimizations may result in excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem. Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro.

4

Problem: Load Balancing

In order to effectively distribute a kernel among multiple OpenCL devices, a programmer must keep in mind, at a minimum, each device’s relative performance on that kernel, the speed of the interconnection bus between host processor and each device (which can be asymmetric), a strategy for input data distribution to devices, and a scheme on how to synchronize devices and aggregate output data. Given that an application can have many kernels, which can very significantly in performance characteristics (bandwidth bound, compute bound, etc.), it quickly becomes impractical to tune optimal load balancing for every task by hand. 4.1

Proposed Solution: Benchmarks and Device Interrogation

At install time, Maestro uses benchmarks and the OpenCL device interrogation API to characterize a system. Peak FLOPS, device memory bandwidth, and bus bandwidth are measured using benchmarks based on the Scalable Heterogeneous Computing (SHOC) Benchmark Suite [9]. The results of these benchmarks serve as the basis, or initial estimation, for the optimization of the distribution of work among multiple devices. As a kernel is repeatedly executed, either in an application or Maestro’s offline tuning methods, Maestro continues to optimize the distribution of work. After each iteration, Maestro computes the average rate at which each device completes work items, and updates a running, weighted average. This rate is specific to each device and kernel combination, and is a practical way to measure many interacting factors for performance. We examine the convergence to an optimal distribution of work in Section 6.2.

Maestro: Data Orchestration and Tuning for OpenCL Devices

5

279

Problem: Suboptimal Use of Interconnection Bus

OpenCL devices are typically connected to the host processor via a relatively slow interconnection bus. With current hardware, this is normally the PCIe bus. Since the bandwidth of this bus is dramatically lower than a GPU’s memory bandwidth, it introduces a nontrivial amount of overhead. 5.1

Proposed Solution: Multibuffering

In order to minimize this overhead, Maestro attempts to overlap computation and communication as much as possible. Maestro leverages and extends the traditional technique of double buffering (also known as ping-pong buffering).

Fig. 2. Double Buffering–This figure contrasts the difference between (a) the function offload model and (b) a very simple case of double buffering. Devices which can concurrently execute kernels and transfer data are able to hide some communication time with computation.

Figure 2 illustrates the difference between the function offload model and double buffered execution. Maestro implements concurrent double buffering to multiple devices, including optimization of the data chunk size, which we term multibuffering. In Maestro’s implementation of multibuffering, the initial data chunk size is set to the size that resulted in the maximum bus bandwidth measured by benchmarks at install time. Maestro then varies the chunk size and optimizes based on observed performance. However, double buffering cannot be used in all cases. Some OpenCL platforms simply lack the support for concurrent data copy and execution. Furthermore, some algorithms are not practical for use with double buffering. Consider an algorithm which accesses input data randomly. A work item might require data at the end of an input buffer which has not yet been transferred to the accelerator, resulting in an error. In order to accommodate this class of algorithms, Maestro allows the programmer to place certain inputs in a universal buffer, which is copied to all devices before execution begins. While this does

280

K. Spafford, J. Meredith, and J. Vetter

limit the availability of some performance optimizations, it greatly expands the number of algorithms which can be supported by Maestro.

6 6.1

Results Experimental Testbeds

OpenCL Limitations. Since OpenCL is still a nascent technology, early software implementations impose several restrictions on the composition of test platforms. First, it is not possible to test a system with GPUs from different vendors due to driver and operating system compatibility issues. Second, CPU support is not widely available. As such, we attempt to provide results from a comprehensive selection of devices, including platforms with homogeneous GPUs, heterogeneous GPUs, and with an OpenCL-supported CPU and GPU. Host Configurations – Krakow. Krakow is a dual socket Nehalem based system, with a total of eight cores running at 2.8Ghz with 24GB of RAM. Krakow also features an NVIDIA Tesla S1070, configured to use two Tesla T10 processors connected via a sixteen lane PCIe v2.0 bus. Results are measured using NVIDIA’s GPU Computing SDK version 3.0. – Lens. Lens is a medium sized cluster primarily used for data visualization and analysis. Its thirty-two nodes are connected via Infiniband, with each node containing four AMD quad core Barcelona processors with 64GB of RAM. Each node also has two GPUs–one NVIDIA Tesla C1060 and one NVIDIA GeForce 8800GTX, connected to the host processor over a PCIe 1.0 bus with sixteen active lanes. Lens runs Scientific Linux 5.0, and results were measured using NVIDIA’s GPU computing SDK, version 2.3. – Lyon. Lyon is an dual-socket, single-core 2.0 GHz AMD Opteron 246 system with a 16-lane PCIe 1.0 bus and 4GB of RAM, housing an ATI Radeon HD 5870 GPU. It runs Ubuntu 9.04 and uses the ATI Stream SDK 2.0 with the Catalyst 9.12 Hotfix 8.682.2RC1 driver. Graphics Processors – NVIDIA G80 Series. The NVIDIA G80 architecture combined the vertex and pixel hardware pipelines of traditional graphics processors into a single category of cores, all of which could be tasked for general-purpose computation if desired. The NVIDIA 8800GTX has 128 processor cores split among sixteen multiprocessors. These cores run at 1.35GHz, and are fed from 768MB of GDDR3 RAM through a 384-bit bus. – NVIDIA GT200 Series. The NVIDIA Tesla C1060 graphics processor comprises thirty streaming multiprocessors, each of which contains eight stream processors for a total of 240 processor cores clocked at 1.3Ghz. Each multiprocessor has 16KB of shared memory, which can be accessed as quickly as a register under certain access patterns. The Tesla C1060 has 4GB of global memory and supplementary cached constant and texture memory.

Maestro: Data Orchestration and Tuning for OpenCL Devices

281

Table 1. Comparison of Graphics Processors GPU Peak FLOPS Mem. Bandwidth Processors Clock Memory Units GF GB/s # Mhz MB Tesla C1060/T10 933 102 240 1300 4096 GeForce 8800GTX 518 86 128 1350 768 Radeon HD5870 2720 153 1600 850 1024

– ATI Evergreen Series. In ATI’s “Terascale Graphics Engine” architecture, Stream processors are divided into groups of eighty, which are collectively known as SIMD cores. Each SIMD core contains four texture units, an L1 cache, and has its own control logic. SIMD cores can communicate with each other via an on-chip global data share. We present results from the Radeon HD5870 (Cypress XT) which has 1600 cores.

6.2

Test Kernels

We have selected the following five test kernels to evaluate Maestro. These kernels range in both complexity and performance characteristics. In all results, the same kernel code is used on each platform, although the problem size is varied. As such, cross-machine results are not directly comparable, and are instead presented in normalized form. – Vector Addition. The first test kernel is the simple addition of two one dimensional vectors, C ← A + B. This kernel is very simple and strongly bandwidth bound. Both input and output vectors can be multibuffered. – Synthetic FLOPS. The synthetic FLOPS kernel maintains the simplicity of vector addition, but adds in an extra constant, K, C ← A + B + K. K is computed using a sufficiently high number of floating point operations to make the kernel compute bound. – Vector Outer Product. The vector outer product kernel, u ⊗ v, takes two input vectors of length n and m, and creates an output matrix of size n × m. The outer product reads little input data compared to the generated output, and does not support multibuffering on any input. – Molecular Dynamics. The MD test kernel is a computation of the LennardJones potential from molecular dynamics. It is a strongly compute bound, O(n2 ) algorithm, which must compare each pair of atoms to compute all contributions to the overall potential energy. It does not support multibuffering on all inputs. – S3D. We also present results from the key portion of S3D’s Getrates kernel. S3D is a computational chemistry application optimized for GPUs in our previous work [2]. This kernel is technically compute bound, but also consumes seven inputs, making it the most balanced of the test kernels.

282

K. Spafford, J. Meredith, and J. Vetter

Fig. 3. Autotuning the local work group size – This figure shows the performance of the MD kernel on various platforms at different local work group sizes, normalized to the performance at a group size of 16. Lower runtimes are better.

Local Tuning Results. Maestro’s capability for autotuning of local work group size is shown using the MD kernel in Figure 3. All runtimes are shown in normalized fashion, in this case as a percentage of the runtime on each platform with a local work group size of 16 (the smallest allowable on several devices). The optimal local work group size is highlighted for each platform. Note the variability and unpredictability of performance due to the sometimes competing demands of register pressure, memory access patterns, and thread grouping. These results indicate that a programmer will be unlikely to consistently determine an optimal work group size at development time. By using Maestro’s autotuning capability, the developer can focus on writing the kernel code, not on the implications of local work group size on correctness and performance portability. Multibuffering Results. Maestro’s effectiveness when overlapping computation with communication can be improved by using an optimal buffer chunk size. Figure 4 shows Maestro’s ability to auto-select the best buffer size on each platform. We observe in the vector outer product kernel one common situation, where the largest buffer size performs the best. Of course, the S3D kernel results show that this is not always the case; here, a smaller buffer size is generally better. However, note that on Krakow with two Tesla S1070 GPUs, there is an asymmetry between the two GPUs, with one preferring larger and one preferring smaller buffer sizes. This result was unusual enough to merit several repeated experiments for verification. Again, this shows the unpredictability of performance, even with what appears to be consistent hardware, and highlights the need for autotuning.

Maestro: Data Orchestration and Tuning for OpenCL Devices

283

Fig. 4. Autotuning the buffer chunk size – This figure shows the performance on the (a) vector outer product and (b) S3D kernels when splitting the problem into various size chunks and using multibuffering. Lower runtimes are better. Values are normalized to the runtime at the 256kB chunk size on each platform.

Mutli-GPU Results. One of Maestro’s strengths is its ability to automatically partition computation between multiple devices. To determine the proportion of work for each device, it initially uses an estimate based on benchmarks run at install time, but will quickly iterate to an improved load distribution based on the imbalance for a specific kernel. Figure 5 shows for the S3D and MD kernels the proportion of total time spent on each device. Note that there is generally an initial load imbalance which can be significant, and that even well balanced hardware is not immune. Maestro’s ability to automatically detect and account for load imbalance makes efficient use of the resources available on any platform. Combined Results. Maestro’s autotuning has an offline and an online component. At install time, Maestro makes an initial guess for local work group size, buffering chunk size, and workload partitioning for all kernels based on values which are measured using benchmarks. However, Maestro can do much better, running an autotuning process to optimize all of these factors, often resulting in significant improvements. Figure 6 shows the results of Maestro’s autotuning for specific kernels relative to its initial estimate for these parameters. In (a) we see the single-GPU results, showing the combined speedup both from tuning the local work group size and applying double buffering with a tuned chunk size, showing improvement up to 1.60×. In (b) we see the multi-GPU results, showing the combined speedup both from tuning the local work group size and applying a tuned workload partitioning, showing speedups of up to 1.8×. This autotuning can occur outside full application runs. Kernels of particular interest can be placed in a unit test and executed several times to provide Maestro with performance data (measured internally via OpenCL’s event API) for coarsegrained adjustments. This step is not required, since the same optimizations can

284

K. Spafford, J. Meredith, and J. Vetter

Fig. 5. Autotuning the load balance – This figure shows the load imbalance on the (a) S3D and (b) MD kernels, both before and after tuning the work distribution for the specific kernel. Longer striped bars show a larger load imbalance.

Fig. 6. Combined autotuning results – (a) Shows the combined benefit of autotuning both the local work group size the double buffering chunk size for a single GPU of the test platforms. (b) Shows the combined benefit of autotuning both the local work group size and the multi-GPU load imbalance using both devices (GPU+GPU or GPU+CPU) of the test platforms. Longer bars are better.

be performed online, but reduces the number of online kernel executions with suboptimal performance.

7

Related Work

An excellent overview of the history of GPGPU is given in [10]. Typically, work in this area has been primarily focused on case studies, which describe the process of accelerating applications or algorithms which require extremely

Maestro: Data Orchestration and Tuning for OpenCL Devices

285

high performance[3, 4, 5, 6, 7, 1, 2]. These applications are typically modified to use graphics processors, STI Cell, or field programmable gate arrays (FPGAs). These studies serve as motivation for Maestro, as many of them help illustrate the aforementioned common problems. Autotuning on GPU-based systems is beginning to gain some popularity. For example, Venkatasubramanian et. al. have explored autotuning stencil kernels for multi-CPU and multi-GPU environments [11]. Maestro is distinguished from this work because it uses autotuning for the optimization of data transfers and execution parameters, rather than the kernel code itself.

8

Conclusions

In this paper, we have presented Maestro, a library for data orchestration and tuning on OpenCL devices. We have shown a number of ways in which achieving the best performance, and sometimes even correctness, is a daunting task for programmers. For example, we showed that the choice of a viable, let alone optimal local work group size for OpenCL kernels cannot be accomplished with simple rules of thumb. We showed that multibuffering, a technique nontrivial to incorporate in OpenCL code, is further complicated by the problem- and devicespecific nature of choosing an optimal buffer chunk size. And we showed that even in what appear to be well-balanced hardware configurations, load balancing between multiple GPUs can require careful division of the workload. Combined, this leads to a space of performance and correctness parameters which is immense. By not only supporting double buffering and problem partitioning for existing OpenCL kernels, but also applying autotuning techniques to find the high performance areas of this parameter space with little developer effort, Maestro leads to improved performance, improved program portability, and improved programmer productivity.

Acknowledgements This manuscript has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.

References [1] Meredith, J.S., Alvarez, G., Maier, T.A., Schulthess, T.C., Vetter, J.S.: Accuracy and Performance of Graphics Processors: A Quantum Monte Carlo Application Case Study. Parallel Computing 35(3), 151–163 (2009) [2] Spafford, K.L., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R., Sankaran, R.: Accelerating S3D: A GPGPU Case Study. In: HeteroPar 2009: Proceedings of the Seventh International Workshop on Algorithms, Models, and Tools for Parallel Computing on Heterogeneous Platforms (2009)

286

K. Spafford, J. Meredith, and J. Vetter

[3] Rodrigues, C.I., Hardy, D.J., Stone, J.E., Schulten, K., Hwu, W.M.W.: GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications. In: CF 2008: Proceedings of the 2008 Conference on Computing Frontiers, pp. 273–282. ACM, New York (2008) [4] He, B., Govindaraju, N.K., Luo, Q., Smith, B.: Efficient Gather and Scatter Operations on Graphics Processors. In: SC 2007: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pp. 1–12. ACM, New York (2007) [5] Fujimoto, N.: Faster Matrix-Vector Multiplication on GeForce 8800GTX. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–8 (April 2008) [6] Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In: ACM SIGGRAPH 2003, pp. 917– 924. ACM, New York (2003) [7] Stone, J.E., Phillips, J.C., Freddolino, P.L., Hardy, D.J., Trabuco, L.G., Schulten, K.: Accelerating Molecular Modeling Applications With Graphics Processors. Journal of Computational Chemistry 28, 2618–2640 (2005) [8] The Khronos Group (2009), http://www.khronos.org/opencl/ [9] Danalis, A., Marin, G., McCurdy, C., Mereidth, J., Roth, P., Spafford, K., Tipparaju, V., Vetter, J.: The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In: Proceedings of the Third Annual Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 2010). ACM, New York (2010) [10] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU Computing. Proceedings of the IEEE 96(5), 879–899 (2008) [11] Venkatasubramanian, S., Vuduc, R.W.: Tuned and Wildly Asynchronous Stencil Kernels for Hybrid CPU/GPU Systems. In: ICS 2009: Proceedings of the 23rd international conference on Supercomputing, pp. 244–255. ACM, New York (2009)

Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software Xin Dong1 , Gene Cooperman1 , and John Apostolakis2 1

College of Computer Science, Northeastern University, Boston, MA 02115, USA {xindong,gene}@ccs.neu.edu 2 PH/SFT, CERN, CH-1211, Geneva 23, Switzerland [email protected] Abstract. This work presents an application case study. Geant4 is a 750,000 line toolkit first designed in the mid-1990s and originally intended only for sequential computation. Intel’s promise of an 80-core CPU meant that Geant4 users would have to struggle in the future with 80 processes on one CPU chip, each one having a gigabyte memory footprint. Thread parallelism would be desirable. A semi-automatic methodology to parallelize the Geant4 code is presented in this work. Our experimental tests demonstrate linear speedup in a range from one thread to 24 on a 24core computer. To achieve this performance, we needed to write a custom, thread-private memory allocator, and to detect and eliminate excessive cache misses. Without these improvements, there was almost no performance improvement when going beyond eight cores. Finally, in order to guarantee the run-time correctness of the transformed code, a dynamic method was developed to capture possible bugs and either immediately generate a fault, or optionally recover from the fault.

1

Introduction

The number of cores on a CPU chip is currently doubling every two years, in a manner consistent with Moore’s Law. If sequential software has a working set that is larger than the CPU cache, then running a separate copy of the software for each core has the potential to present immense memory pressure on the bus to memory. It is doubtful that the memory pressure will continue to be manageable as the number of cores on a CPU chip continues to double. This work presents an application case study concerned with just this issue, with respect to Geant4 (GEometry ANd Tracking, http://geant4.web.cern. ch/). Geant4 was developed over about 15 years by physicists around the world, using the Booch software engineering methodology. The widest use for Geant4 is for Monte Carlo simulation and analysis of experiments at the LHC collider in Geneva. In some of the larger experiments, such as CMS [1], software applications can grow to a two gigabyte footprint that includes hundreds of dynamic libraries (.so files). In addition to collider experiments, Geant4 is used for radiationbased medical applications [2], for cosmic ray simulations [3], and for space and radiation simulations [4]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 287–303, 2010. c Springer-Verlag Berlin Heidelberg 2010 

288

X. Dong, G. Cooperman, and J. Apostolakis

Geant4 is a package with 750,000 lines of C++ code spread over 6,000 files. It is a toolkit with deep knowledge of the physics of particle tracks. Given the geometry, the corresponding materials and the fundamental particles, a Geant4 simulation is driven by randomly generated independent events. Within a loop, each event is simulated in sequence. The corresponding computation for each event is organized into three levels: event generation and result aggregation; tracking in each event; and stepping on each track. Geant4 stepping is governed by physics processes, which specify particle and material interaction. The Geant4 team of tens of physicists issues a new release every six months. Few of those physicists have experience writing thread-parallel code. Hence, a manual rewriting of the entire Geant4 code base for thread parallelism was not possible. Geant4 uses an event loop programming style that lends itself to a straightforward thread parallelization. The parallelization also required the addition of the ANSI C/C++ thread keyword to most of the files in Geant4. As described in Section 3.1, an automatic way to add this thread parallelism was developed by modifying the GNU C++ parser. Section 3.2 then describes a further step to reduce the memory footprint. This thread-parallel implementation of Geant4 is known as Geant4MT. However, this intermediate Geant4MT was found not to be scalable. When scaling to 24 cores, two important performance drains were found: memory allocation and writes to shared variables. Custom memory allocator. First, none of the standard memory allocators scale properly when used in Geant4. Some of the allocators tried include the glibc default malloc (ptmalloc2) [5], tcmalloc [6], ptmalloc3 [5] and hoard [7]. This is because the malloc standard requires the use of a shared memory data structure so that any thread can free the memory allocated by any other thread. Yet most of the Geant4 allocations are thread-private. The number of futex calls in Geant4MT provided the final evidence of the importance of a threadprivate custom memory allocator. We observed the excessive number of futex calls (Linux analog of mutex calls) to completely disappear after introducing our thread-private allocator. Writes to shared variables. The second important drain occurs due to excessive writes to shared variables. This drain occurs even when the working set is small. Note that the drain on performance makes itself known as excessive cache misses when measuring performance using performance counters. However, this is completely misleading. The real issue is the particular cache misses caused by a write to a shared variable. Even if the shared variable write is a cache hit, all threads that include this shared variable in their active working set will eventually experience a read/write cache miss. This is because there are four CPU chips on the motherboard with no off-chip cache, in the high performance machines on which we tested. So, a write by one of the threads forces the chip set logic to invalidate the corresponding cache lines of the other three CPU chips. Thus, a single write eventually forces three subsequent L3 cache misses, one miss in each of the other three chips. The need to understand this unexpected behavior was a major source of the delay in making Geant4 fully scalable. The interaction with the malloc issue

Multithreaded Geant4: Semi-automatic Transformation

289

above initially masked this second performance drain. It was only after solving the issue of malloc, and then building a mechanism to track down the shared variables most responsible for the cache misses, that we were able to confirm the above working hypothesis. The solution was then quite simple: eliminate unnecessary sharing of writable variables. Interaction of memory allocator and shared writable variables. As a result of this work, we were able to conclude that the primary reason that the standard memory allocators suffered degraded performance was likely not the issue of excessive futex calls. Instead, we now argue that it was due to writes to shared variables of the allocator implementation. Our back-of-the-envelope calculations indicated that there were not enough futex calls to account for the excessive performance drain! We considered the use of four widely used memory allocators, along with our own customized allocator for a malloc/free intensive toy program. Surprisingly, we observed a parallel slowdown for each of the four memory allocators. In increasing the number of threads from 8 to 16 on a 16-core computer, the execution was found to become slower! Reasoning for correctness of Geant4. The domain experts are deeply concerned about the correctness of Geant4MT. Yet it challenges existing formal methods. First, the ubiquitous callback mechanism and C++ virtual member functions defined in Geant4 resist static methods. What part of the code will be touched is determined dynamically at the run-time. Second, the memory footprint is huge for large Geant4 applications, rendering dynamic methods endless. For an example, Helgrind [8] makes the data initialization too slow to finish for a representative large Geant4 application. Because Geant4MT, like Geant4, is a toolkit with frequent callbacks to end user code, we relax the correctness requirements. It is not possible with today’s technology to fully verify Geant4MT in the context of arbitrary user callbacks. Hence, we content ourselves with enhancements to verify correctness of production runs. In particular, we enforce the design assumption that “shared application data is never changed when parallel computing happens”. A run-time tool is developed to verify this condition. This tool also allows the application to coordinate the threads so as to avoid data races when updates to shared variables occur unexpectedly. Experience with Geant4MT. Geant4MT represents a development effort of two and a half years. This effort has now yielded experimental results showing linear speedup both on a 24-core Intel computer (four Nehalem-class CPU 6core chips), and on a 16-core AMD computer (four Barcelona-class CPU 4-core chips). The methodology presented here is recommended because it compresses these two and a half years of work into a matter of three days for a new version of Geant4. By using the tools developed as part of this work, along with the deeper understanding of the systems issues, we estimate the time for a new project to be the same three days, plus the time to understand the structure of the new software and create an appropriate policy about what objects should be shared,

290

X. Dong, G. Cooperman, and J. Apostolakis

while respecting the original software design. The contributions of this work are four-fold. It provides: 1. a semi-automatic way to transform C++ code into working thread-parallel code; 2. a thread private malloc library scalable for intensive concurrent heap accesses to transient objects; 3. the ability to automatically attribute frequent sources of cache misses to particular variables; and 4. a dynamic method to guarantee the run-time correctness for the threadparallel program One additional novelty in Section 3.4 is an analytical formula that predicts the number of updates to shared variables by all threads, based on measurements of the number of cache misses. The number of shared variable updates is important, because it has been identified as one of two major sources of performance degradation. The experimental section (Section 4) confirms the high accuracy of this formula. The rest of this paper is organized as follows. Section 2 introduces Geant4 along with some earlier work on parallelization for clusters. Section 3 explains our multithreading methodology and describes the implementation of multithreading tools. Section 4 evaluates the experimental results. We review related work in Section 5 and conclude in Section 6.

2 2.1

Geant4 and Parallelization Geant4: Background

Detectors, as used to detect, track, and/or identify high-energy particles, are ubiquitous in experimental and applied particle physics, nuclear physics, and nuclear engineering. Modern detectors are also used as calorimeters to measure the energy of the detected radiation. As detectors become larger, more complex and more sensitive, commensurate computing capacity is increasingly demanded for large-scale, accurate and comprehensive simulations of detectors. This is because simulation results dominate the design of modern detectors. A similar requirement also exists in space science and nuclear medicine, where particlematter interaction plays a key role. Geant4 [9,10] responds to this challenge by implementing and providing a “diverse, wide-ranging, yet cohesive set of software components for a variety of settings” [9]. Beginning with the first production release in 1998, the Geant4 collaboration has continued to refine, improve and enhance the toolkit towards more sophisticated simulations. With abundant physics knowledge, the Geant4 toolkit has modelling support for such concepts as secondary particles and secondary tracks (for example, through radioactive decay), the effect of electromagnetic fields, unusual geometries and layouts of particle detectors, and aggregation of particle hits in detectors.

Multithreaded Geant4: Semi-automatic Transformation

2.2

291

Prior Distributed Memory Parallelizations of Geant4

As a Monte Carlo simulation toolkit, Geant4 profits from improved throughput via parallelism derived from independence among modelled events and their computation. Therefore, researchers have adopted two methods for parallel simulation in the era of computer clusters. The first method is a standard parameter sweep: Each node of the cluster runs a separate instance of Geant4 that is given a separate set of input events to compute. The second method is that of ParGeant4 [11]. ParGeant4 uses a master-worker style of parallel computing on distributed-memory multiprocessors, implemented on top of the open source TOP-C package (Task Oriented Parallel C/C++) [12]. Following master-worker parallelism, each event is dispatched to the next available worker, which leads to dynamic load-balancing on workers. Prior to dispatching the events, each worker does its own initialization of global data structures. Since then, many-core computing has gained an increasing presence in the landscape. For example, Intel presented a roadmap including an 80-core CPU chip for the future. It was immediately clear that 80 Geant4-based processes, each with a footprint of more than a gigabyte, would never work, due to the large memory pressure on a single system bus to memory. A potentially simple solution takes advantage of UNIX copy-on-write semantics to enhance the sharing of data further by forking the child processes after Geant4 initializes its data. However, the natural object-oriented style of programming in Geant4 encourages a memory layout in which all fields of an object are placed on the same memory page. If just one field of the object is written to, then the entire memory page containing that object will no longer be shared. Hence the copy-on-write approach with forked processes was rejected as insufficiently scalable.

3

Geant4MT Methodology and Tools

The Geant4MT follows the same event-level parallelism as the prior distributed memory parallelization has done. The goal of the Geant4MT is: given a computer with k cores, we wish to replace k independent copies of the Geant4 process with an equivalent single process with k threads, which use the many-core machine in a memory-efficient scalable manner. The corresponding methodology includes the code transformation for thread safety(T 1) and for memory footprint reduction(T 2), the thread private malloc library and the shared-update checker. The transformation work generates two techniques further used by other work: one is to dump the data segment to the thread local storage (TLS [13]) for thread safety, which later produces the thread private allocator; another is to protect memory pages for write-access reflection, which evolves to the run-time tool for shared-update elimination, correctness verification and data race arbitration. 3.1

T 1: Transformation for Thread Safety

The Transformation for Thread Safety transforms k independent copies of the Geant4 process into an equivalent single process with k threads. The goal is

292

X. Dong, G. Cooperman, and J. Apostolakis

to create correct thread-safe code, without yet worrying about the memory footprint. This transformation includes two parts: global variable detection and global variable privatization. For C++ programs, we collect the information from four kinds of declarations for global variables, which are the possible source of data race. They are “static” declarations, global declarations, “extern” declarations and “static const” declarations for pointers. The last case is very special and rare: pointers are no longer constants if each thread holds its own copy of the same object instance. A rigorous way to collect the information for all global variables is to patch some code in the C++ parser to recognize them. In our case, we change the GNU (g++ version 4.2.2) C++ parser source file parser.c to patch some output statements there. After the parser has been changed, we re-compile gcc-4.2.2. We then use the patched compiler to build Geant4 once more. In this pass, all concerned declarations and their locations are collected as part of the building process. For each global variable declaration, we add the ANSI C/C++ keyword thread, as described by T 1.1 in Table 1. After this step, the data segment is almost empty with merely some const values left. As a result, at the binary level, each thread acts almost the same as the original process, since any variable that could have been shared has been declared thread-local. Naturally, the transformed code is thread-safe. The handling of thread-local pointers is somewhat more subtle. In addition to the term thread-local, we will often refer to thread-private data. A thread may create a new object stored in the heap. If the pointer to the new object is stored in TLS, then no other thread can access this new object. In this situation, we refer to both the new object and its thread-local pointer as being thread-private. However, only the pointer is stored in TLS as a thread-local variable. The threadlocal pointer serves as a remedy for TLS to support variables that are not plain old data (not POD) and to implement dynamic initialization for TLS variables. To make non-POD or dynamically initialized variables thread safe, we introduce new variables and transform the original type to be pointer type, which is a plain old data structure (POD). Then initialize variables dynamically before it is first used. Two examples T 1.2 and T 1.3 in Table 1 demonstrate the implementation. A tool based on an open source C++ parser, Elsa [14], has been developed to transform global declarations collected by the patched parser. This tool works not only for Geant4 but also for CLHEP, which is a common library widely used by Geant4. The transformation T 1 generates an unconditionally threadsafe Geant4. This version is called unconditional because any thread can call any function with no data race. 3.2

T 2: Transformation for Memory Footprint Reduction

The Transformation for Memory Footprint Reduction allows threads to share data without violating thread safety. The goal is to determine the read-only variables and field members, and remove the thread keyword, so that they become shared. A member field may have been written to during its initialization,

Multithreaded Geant4: Semi-automatic Transformation

293

Table 1. Representative Code Transformation for T 1 and T 2 T 1.1 T 1.2

int global = 0; static nonPOD field;

−→ −→

T 1.3

static int var1 = var2;

−→

class volume { //large relatively read-only member field RD t RD; //small transitory member field RDWR t RDWR; }; thread vector store;

−→

T 2.1

thread int global = 0; static thread nonPOD *newfield; if (!newfield) //Patch before refer to, as indicated newfield = new nonPOD; //by compilation errors #define field (*newfield) #define CLASS::field (*CLASS::newfield) static thread int *var1 NEW PTR = 0; if (!var1 NEW PTR ) { var1 NEW PTR = new int; *var1 NEW PTR = var2; } int &var1 = *var1 NEW PTR ; //dynamically extended by the main thread via the //constructor and replicated by worker threads 1 thread RDWR t *RDWR array; 2 class volume 3 { int instanceID; 4 RD t RD; }; 5 #define RDWR (RDWR array[instanceID]) 6 vector store;

but may be read-only thereafter. The difficulty is to figure out for each sharable class (as defined in the next paragraph), which member fields become read-only after the worker threads are spawned. Below, such a field member is referred to as relatively read-only. A sharable class is defined as a class that has many instances, most of whose member fields are relatively read-only. A sharable instance is defined as an instance of a sharable class. We take the left part of T 2.1 in Table 1 as an example. The class “volume” is a sharable class that contains some relatively read-only fields, whose cumulative size is large. The class “volume” also contains some read-write fields, whose cumulative size is small. However, it is not clear which field is relatively read-only and which field is read-write. To recognize relatively read-only data, we put all sharable instances into a pre-allocated region in the heap by overloading the “new” and “delete” methods for sharable classes and replacing the allocator for their containers. The allocator substitute has overloaded “new” and “delete” methods using the pre-allocated region of sharable instances. Another auxiliary program, the tracer, is introduced to direct the execution of the Geant4 application. Our tracer tool controls the execution of the application using the ptrace system call similarly to how “gdb” debugs a target process. In addition, the tracer tool catches segmentation fault signals in order to recognize member fields that are not relatively read-only. (Such non-relatively read-only data will be called transitory data below.) Figure 1 briefly illustrates the protocol between the tracer and the target application. First, the Geant4 application (the “inferior” process in the terminology of ptrace) sets up signal handlers and spawns the tracer tool. Second, the tracer tool (the “superior” process in the terminology of ptrace) attaches and notifies the “inferior” to remove the “write” permission from the pre-allocated memory region. Third, the tracer tool intercepts and relays each segmentation fault signal to the “inferior” to re-enable the “write” permission for the pre-allocated

294

X. Dong, G. Cooperman, and J. Apostolakis

Non−violation

0

Spawn

1

2

3

4

SIGUSR1

0

ATTACH

1

CONT

Violation

5

6

SIGFAULT SIGFAULT SIGUSR1

2

CONT

3

4

CONT

Retry

5

DETACH

Inferior state Downward signals sent by "raise" or OS Superior state Upward signals sent by "ptrace" ptrace with PTRACE_CONT Spawn Setup signal handlers, create tracer and sleep CONT Retry Retry the volative instruction Violation Write to the protected memory Inferior SIGUSR1 handler remove the write permission for the protected memory while SIGFAULT handler re−enable it

Fig. 1. Interaction between Inferior and Tracer

memory region, retry the instruction that calls the segmentation fault, and return to the second step. As the last step, the “inferior” will actively re-enable the “write” permission for the pre-allocated memory region and tell the tracer tool to terminate, which then forces the tracer tool to detach. If a sharable class has any transitory member field, it is unsafe for threads to share the whole instance for this class. Instead, threads share instances whose transitory member fields have been moved to the thread-private region. The implementation is described by the right part of T 2.1 in Table 1, whose objective is illustrated by Figure 2. In this figure, two threads share three instances for the sharable class “volume” in the heap. First, we remove the transitory member field set RW-Field from the class “volume”. Then, we add a new field as an instance ID, which is declared by line 3 of T 2.1. In Figure 2, the ID for each shared instance is 0, 1 or 2. Each thread uses a TLS pointer to an array of fields of type RW-Field, which is indexed by the instance ID. The RW-field array is declared in line 1 while the RW-field reference is redefined by the macro on line 5. As we can see from Figure 2, when worker thread 1 accesses the RWField set of instance 0, it follows the TLS pointer and assigns the RW-Field to 1. Similarly, worker thread 2 follows the TLS pointer and assigns the RWField to 8. Therefore, two threads access the RW-Field of the same instance, but access different memory locations. Following this implementation pattern, T 2 transforms the unconditionally thread-safe version of Geant4 further to share the detector data. For the T 2 transformation, it is important to recognize in advance all transitory member fields (fields that continue to be written to). This leads to larger read-only and read-write memory chunks in the logical address space as a sideeffect. Furthermore, this helps to reduce the total physical memory consumption even for process parallelism by taking advantage of copy-on-write technology. The internal experiments show that threads always perform better after the performance bottleneck is eliminated.

Multithreaded Geant4: Semi-automatic Transformation

295

Shared Relatively Read Only Data Central Heap

Thread Private Heap Transcient

Text (code)

Static/Global variables

More write−protect

Instance ID 0

Transcient

Instance ID 2

Instance ID RW−Field

Transcient 0 1 2 1 5 3

Instance ID RW−Field

0 1 2 8 2 6

Write−protect for T2 transformation

Thread Local Storage

Transcient

Instance ID 1

Thread Private Heap

Stack

RW−Field−Pointer

Thread Local Storage

Stack

RW−Field−Pointer

Thread Worker 1

Thread Worker 2

Fig. 2. Geant4MT Data Model

3.3

Custom Scalable Malloc Library

The parallel slowdown for the glibc default malloc library is reproducible through a toy program in which multiple threads work cooperatively on a fixed pool of tasks. The task for the toy program is to allocate 4,000 chunks of size 4 KB and to then free them. As the number of threads increases, the wall-clock time increases, even though the load per thread decreases. An obvious solution to this demonstrated problem would be to directly modify the Geant4MT source code. One can pre-allocate storage for “hot”, or frequently used, classes. The pre-allocated memory is then used instead of the dynamically allocated one in the central heap. This methodology works only if there is a small upper bound on the number of object instances for each “hot” class. Further, the developer time to overload the “new” and “delete” method, and possibly define specialized allocators for different C++ STL containers, is unacceptable. Hence, this method is impractical except for software of medium size. The preferred solution is a thread-private malloc library. We call our implementation tpmalloc. This thread-private malloc uses a separate malloc arena for each thread. This makes the non-portable assumption that if a thread allocates memory, then the same thread will free it. Therefore, a thread-local global variable is also provided, so that the modified behavior can be turned on or off on a per-thread basis. This custom allocator can be achieved by applying T 1 to any existing malloc library. In our case, we modify the original malloc library from glibc to create the thread-private malloc arenas. In addition, we pre-initialize a memory region for each worker thread and force it to use the thread private top chunk in the heap. In portions of the code where we know that a thread executing that code will not need to share a central heap region, we turn on a thread-local global variable to use a thread-private malloc arena. As Figure 2 shows, this allows Geant4MT to keep both transient objects and transitory data in a thread-private heap region. Therefore, the original lock associated with each arena in the glibc malloc library, is no longer used by the custom allocator.

296

3.4

X. Dong, G. Cooperman, and J. Apostolakis

Detection for Shared-Update and Run-Time Correctness

The Geant4MT methodology makes worker threads share read-only data, while other data is stored as thread-local or thread-private. This avoids the need for thread synchronization. If the multithreaded application still slows down with additional cache misses, the most likely reason is the violation of the assumption that all shared variables are read-only. Using a machine with 4 Intel Xeon 7400 Dunnington CPUs (3 levels cache and 24 cores in total), we estimate the number of additional write cache misses generated by updates to shared variables via a simple model. This number would otherwise be difficult to measure. Suppose there are w ≥ 2 thread workers that update the same variable. The total number of write accesses is designated as n. In the simulation, write accesses are assumed to be evenly distributed. The optimal load-balancing for thread parallelism is also assumed, so that the number of updates is n/w per thread. Consider any update access u1 and its predecessor u0 . One cache miss results from u1 whenever the variable is changed by another thread between the times of u0 and u1 . It is easy to see that any write access from another thread falls between u0 and u1 with probability w/n. Therefore, the probability that no update happens during this period is (1 − w/n)(n−n/w) ≈ e(1−w) . Furthermore, one cache miss happens at u1 with probability 1 − e(1−w) . For the L1 cache, each core has its own cache, and it suffices to set w to be the number of cores. In the L3 case, there is a single L3 cache per chip. Hence, for L3 cache and a machine with multiple chips, one can consider the threads of one chip as a single large super-thread and apply the formula by setting the number of threads w to be the number of super-threads, or the number of chips. Similar ideas can be used for analyzing L2 cache. The performance bottleneck from updates to shared variables is subtle enough to make testing based on the actual source code impractical for a software package as large as Geant4. Luckily, we had already developed a tracer tool for Transformation T 2 (see Section 3.2, along with Figure 1). An enhancement of the tracer tool is applicable to track down this bottleneck. The method is to remove write permission from all shared memory regions and to detect write accesses to the region. As seen in Figure 2, all read-only regions are write-protected. Under such circumstance, the “tracer” will catch a segmentation fault, allowing it to recognize when the shared data has been changed. This method exposes some frequently used C++ expressions in the code that are unscalable. An expression is scalable if given a load of n times execution of the expression, a parallel execution with k threads on k cores spends 1/k of the time for one thread with the same load. Some of the unscalable C++ expressions found in the code by this methodology are listed as follows: 1. cout.precision(*); Shared-updates to precision, even in the absence of output from cout. 2. str = ””; All empty strings refer to the same static value using a reference count. This assignment changes the reference count.

Multithreaded Geant4: Semi-automatic Transformation

297

3. std::ostringstream os; The default constructor for std::ostringstream takes a static instance of the locale class, which changes the reference count for this instance. Whenever updates for shared variables occur intensively, the tracer tool can be employed to determine all instructions that modify shared variables. Note that since all workers execute the same code, if one worker thread modifies a shared variable, then all worker threads modify that shared variable. This creates a classic situation of “ping pong” for the cache line containing that variable. Hence, this results in unscalable code. The most frequent such occurrences are the obvious suspects for performance bottlenecks. The relevant code is then analyzed and replaced with a thread-private expression where possible. The same tracer tool is sensitive to the violation of the T 2 read-only assumption. So it works also for the production phase to guarantee the run-time correctness of Geant4MT applications. For this purpose, the tracer tool is enhanced further with some policies and corresponding mechanisms for the coordination of shared variable updates. The tracer tool based on memory protection decides whether the shared data has ever been changed or not by Geant4MT. It serves as a dynamic verifier to guarantee that a production run is correct. If no segmentation fault happens in the production phase, the computation is correct and the results are valid. When a segmentation fault is captured, one just aborts that event. The results from all previous events are still valid. The tracer tool has zero run-time overhead in previous events in the case that the T 2 read-only assumption is not violated. A more sophisticated policy is to use a recovery strategy. The tracer tool suspends each thread that tries to write to the shared memory region. In that event all remaining threads finish their current event and arrive at a quiescent state. Then, all quiescent threads wait upon the suspended threads. The tracer tool first picks a suspended thread to resume and finish its current event. All suspended threads then redo their current events in sequence. This portion of the computation experiences a slowdown due to the serialization, but the computation can continue without aborting. To accomplish the above recovery strategy, a small modification of the Geant4MT source code is needed. The Geant4MT application must send a signal to the tracer tool before and after each Geant4 event. When violations of the shared read-only data are rare, this policy has a minimal effect on performance.

4

Experimental Results

Geant4MT was tested using an example based on FullCMS. FullCMS is a simplified version of the actual code used by the CMS experiment at CERN [1]. This example was run on a machine with 4 AMD Opteron 8346 HE processors and a total of 16 cores working at 1.8 GHz. The hierarchy of the cache for this CPU is a single 128 KB L1 and a single 512 KB L2 cache (non-inclusive) per core. There is a 2 MB L3 cache shared by the 4 cores on the same chip. The cache line size is 64 bytes. The kernel version of this machine is Linux 2.6.31 and the compiler is gcc 4.3.4 with the “-O2” option, following the Geant4 default.

298

X. Dong, G. Cooperman, and J. Apostolakis

Removal of futex delays from Geant4MT. The first experiment, whose results are reported in the left part of Table 2, reveals the bottleneck from the ptmalloc2 library (the glibc default). The total number of Geant4 events is always 4800. Each worker holds around 20 MB of thread-private data and 200 MB of shared data. The shared data is initialized by the master thread prior to spawning the worker threads. Table 2 reports the wall-clock time for the simulation. From this table, we see the degradation of Geant4MT. Along with the degradation is a tremendously increasing number of futex system calls. Table 2. Malloc issue for FullCMS/Geant4MT: 4 AMD Opteron 8346 HE (4×4 cores) vs. 4 Intel Xeon 7400 Dunnington (4×6 cores). Time is in seconds.

Number of Workers 1 4 8 16

4 AMD Opteron 8346 HE CPUs ptmalloc2 tpmalloc Time Speedup Futex & Time Time Speedup 10349 1 0, 0 10285 1 2650 3.91 2.4K, 0.04 2654 3.87 1406 7.36 38K, 0.4 1355 7.59 804 12.87 24M, 244 736 13.98

4 Intel Xeon 7400 Dunnington CPUs Number of ptmalloc2 tpmalloc Workers Time Speedup Futex & Time Time Speedup 1 6843 1 0, 0 6571 1 6 1498 4.57 13K, 0.3 1223 5.37 12 1050 6.51 24M, 266 824 7.97 24 654 10.3 66M, 1281 496 13.25

After replacing the glibc default ptmalloc2 with our tpmalloc library in Geant4MT, the calls to futex completely disappeared. (This is because the tpmalloc implementation with thread-private malloc arenas is lock-free.) As expected, the speedup increased, as seen in Table 2. However, that speedup was less than linear. The reason was that another performance bottleneck still existed in this intermediate version! (This was the issue of writes to shared variables.) Similar results were observed on a second computer populated with 4 Intel Xeon 7400 Dunnington CPUs (24 cores in total), running at 2.66 GHz. This CPU has three levels of cache. Each core has 64 KB of L1 cache. There are three L2 caches of 3 MB each, with each cache shared among two cores. Finally, there is a single 16 MB L3 cache, shared among all cores on the same chip. The cache line size is 64 bytes for all three levels. The kernel version on this machine is Linux 2.6.29 and the compiler is gcc 4.3.3 with the “-O2” option specified, just as with the experiment on the AMD computer. The load remained at 4800 Geant4 events in total. The wall-clock times are presented in the right part of Table 2. This table shows that the malloc libraries do not scale as well on this Intel machine, as compared to the AMD machine. This may be because the Intel machine runs at a higher clock rate. On the Intel machine, just 12 threads produce a number of futexes similar to that produced by 16 threads on the AMD computer. Analysis of competing effects of futexes versus cache misses. The performance degradation due to writes to shared variables tends to mask the distinctions among different malloc libraries. Without that degradation, one expects still greater improvements from tpmalloc, for both platforms. It was the experimental results from Table 2 that allowed us to realize the existence of this

Multithreaded Geant4: Semi-automatic Transformation

299

one additional performance degradation. Even with the use of tpmalloc, we still experienced only a 13 times speedup with 24 threads. Yet tpmalloc had eliminated the extraordinary number of futexes, and we no longer observed increasing system time (time in kernel) for more threads. Hence, we eliminated operating system overhead as a cause of the slowdown. Having removed futexes as a potential bottleneck, we turned to considering cache misses. We measured the number of cache misses for the three cache levels. The results are listed in Table 3 under the heading “before removal”. The number of L3 cache misses for this case was observed to increase excessively. As described in the introduction, this was later ascribed to writes to shared variables. Upon discovering this, some writes to shared variables were removed. The heading “after removal” refers to L3 cache misses after making threadprivate some frequently used variables with excessive writes. Table 3. Shared-update Elimination on 4 Intel Xeon 7400 Dunnington (4×6 cores) Number of Non-dominating statistics Before removal After removal Workers # Instructions L1-D Misses L2 Misses L3 References L3 Misses CPU Cycles L3 Misses Time Speedup 1 1,598 24493M 402M 87415M 293M 1945G 308M 6547s 1 6 1,598G 24739M 630M 87878M 326M 2100G 302M 1087s 6.02 12 1,598G 24742M 634M 88713M 456M 3007G 302M 543s 12.06 24 1,599G 24827M 612M 88852M 517M 3706G 294M 271s 24.16

Estimating the number of writes to shared variables. It remains to experimentally determine if the remaining performance degradation is due to the updates to shared variables, or due to other reasons. For example, if the working set is larger than the available cache, this would create a different mechanism for performance degradation. If updates to shared variables is the primary cause of cache misses, then the analytical formula of Section 3.4 can be relied upon to correctly predict the number of updates to shared variables. The formula will be used in two ways. 1. The formula will be used to confirm that most of the remaining cache misses are due only to updates to shared variables. This is done by first considering five different cases from the measured data: L2/6 (L2 cache misses with 6 threads); L2/12; L2/24; L3/12; and L3/24. This will be used to show that all of the legal combinations predict approximately the same number of shared variable updates. 2. Given the predicted number of shared variable updates, this is used to determine if the number of shared variable updates is excessive. When can we stop looking for shared variables that are frequently updated? Answer: when the number of shared variable updates is sufficiently small. Some data from Table 3 can be used to predict the number of shared variable updates following the first usage of the formula. The results are listed in Table 4. This table provides experimental validation that the predicted number of shared variable updates can be trusted.

300

X. Dong, G. Cooperman, and J. Apostolakis

Table 4. Prediction for Shared Variable Updates on 4 Intel Xeon 7400 Dunnington (4×6 cores) Level of Number of Number of Cache Misses Number of Threads or Prediction for Shared Cache Workers Single Thread Multiple Threads Additional Super-threads (w) Variable Updates (n) 6 6 6 L2 cache 6 402 × 10 630 × 10 228 × 10 3 ≈ 263 × 106 6 6 6 L2 cache 12 402 × 10 634 × 10 232 × 10 6 ≈ 230 × 106 L2 cache 24 402 × 106 612 × 106 210 × 106 12 ≈ 210 × 106 L3 cache 12 293 × 106 456 × 106 163 × 106 2 ≈ 258 × 106 L3 cache 24 293 × 106 517 × 106 224 × 106 4 ≈ 236 × 106

Effect of removing writes to shared memory. Table 3 shows that Geant4MT scales linearly to 24 threads after most updates to shared variables are removed. According to our estimate in Table 4, 260 × 106 updates to shared variables have been removed. As seen earlier, with no optimizations, only a 10.3 times speedup using 24 cores was obtained. By using tpmalloc, a 13.25 times speedup was obtained, as seen in Table 2. With tpmalloc and removal of excessive writes to shared variables, a 24.16 times speedup using 24 cores was obtained, as seen in Table 3. In fact, this represents a slight superlinear speedup (a 1% improvement). We believe this is due to reduced L3 cache misses due to sharing of cached data among distinct threads. Results for different allocators with few writes to shared memory. The last experiment compared different malloc implementations using Geant4MT, after having eliminated previous performance bottlenecks (shared memory updates). Some interesting results from this experiment are listed in Table 5. First, ptmalloc3 is worse than ptmalloc2 for Geant4MT. This may account for why glibc has not included ptmalloc3 as the default malloc library. Second, tcmalloc is excellent for 8 threads and is generally better than hoard although hoard has better speedup for some cases. Our custom thread-private tpmalloc does not show any degradation so that Geant4MT with tpmalloc leads to a better speed-up. This demonstrates the strong potential of Geant4MT for still larger scalability on future many-core machines. Table 5. Malloc Library Comparison Using Geant4 on 4 AMD Opteron 8346 HE (4×4 cores) Number of Workers 1 2 4 8 16

ptmalloc2 ptmalloc3 hoard tcmalloc tpmalloc Time Speedup Time Speedup Time Speedup Time Speedup Time Speedup 9923s 1 10601s 1 10503s 1 9918s 1 10090s 1 4886s 2.03 6397s 1.66 6316s 1.66 4980s 1.99 5024s 2.01 2377s 4.17 4108s 2.58 2685s 3.91 2564s 3.87 2504s 4.03 1264s 7.85 2345s 4.52 1321s 7.95 1184s 8.37 1248s 8.08 797s 12.46 1377s 7.70 691s 15.20 660s 15.02 623s 16.20

Multithreaded Geant4: Semi-automatic Transformation

5

301

Related Work

Some multithreading work for linear algebra algorithms are: PLASMA [15], which addresses the scheduling framework; and PLUTO [16] and its variant [17], which addresses the compiler-assisted parallelization, optimization and scheduling. While the compiler-based methodologies are fit for the data parallelism existing in tractable loop nests, new approaches are necessary for other applications, e.g., commutativity analysis [18] to automatically extract parallelism from utilities such as gzip, bzip2, cjpeg, etc. The Geant4MT T 1 transformation is similar to well-known approaches such as the private data clause in OpenMP [19] and Cilk [20]; and the SUIF [21] privatizable directive from either programmer or compiler. Nevertheless, T 1 pursues thread safety in Geant4 with its large C/C++ code base containing many virtual and callback functions — a context that would overwhelm both the programming and the compile-time analysis. The Geant4MT T 2 transformation applies task-oriented parallelism (one event is a task) and gains some data parallelism by sharing relatively read-only data and replicated transitory data. While this transformation benefits existing parallel programming tools, it raises thread safety issues for the generated code. Besides our own tool for run-time correctness, other approaches are also available. Some tools for static data race detection are: SharC [22], which checks data sharing strategies for multi-threaded C via declaring the data sharing strategy and checking; RELAY [23], which is used to detect data races in the Linux kernel; RacerX [24], which finds serious errors in Linux and FreeBSD; and KISS [25], which obtains promising initial results in Windows device drivers. All these methods are unsound. Two sound static data race detection tools are LOCKSMITH [26], which finds several races in Linux device drivers and the method of Henzinger et al. [27], which is based on model checking. Sound methods need to check a large state space and may fail to complete due to resource exhaustion. All these methods, even with their limitations, are crucial for system software (e.g., an O/S) which requires strict correctness and is intended to run forever. In contrast, this work addresses the run-time correctness for application software that already runs correctly in the sequential case.

6

Conclusion

Multithreaded software will benefit even more from future many-core computers. However, efficient thread parallelization is difficult when confronted by two real-world facts. First, software is continually undergoing continuing active development with periodically appearing new releases to eliminate bugs, to enhance functionality, or to port for additional platforms. Second, the parallelization expert and the domain expert often have only limited understanding of each other’s job. The methodology presented encourages black box transformations so that the jobs of the parallelization expert and the domain expert can proceed in a largely independent manner. The tools in this work not only enable highly scalable thread parallelism, but also provide a solution of wide applicability for efficient thread parallelization.

302

X. Dong, G. Cooperman, and J. Apostolakis

Acknowledgement We gratefully acknowledge the use of the 24-core Intel computer at CERN for testing. We also gratefully acknowledge the helpful discussions with Vincenzo Innocente, Sverre Jarp and Andrzej Nowak. The many performance tests run by Andrzej Nowak on our modified software were especially helpful in gaining insights into the sources for the performance degradation.

References 1. CMS, http://cms.web.cern.ch/cms/ 2. Arce, P., Lagares, J.I., Perez-Astudillo, D., Apostolakis, J., Cosmo, G.: Optimization of An External Beam Radiotherapy Treatment Using GAMOS/Geant4. In: World Congress on Medic Physics and Biomedical Engineering, vol. 25(1), pp. 794–797. Springer, Heidelberg (2009) 3. Hohlmann, M., Ford, P., Gnanvo, K., Helsby, J., Pena, D., Hoch, R., Mitra, D.: GEANT4 Simulation of a Cosmic Ray Muon Tomography System With MicroPattern Gas Detectors for the Detection of High-rmZ Materials. IEEE Transactions on Nuclear Science 56(3-2), 1356–1363 (2009) 4. Godet, O., Sizun, P., Barret, D., Mandrou, P., Cordier, B., Schanne, S., Remou´e, N.: Monte-Carlo simulations of the background of the coded-mask camera for Xand Gamma-rays on-board the Chinese-French GRB mission SVOM. Nuclear Instruments and Methods in Physics Research Section A 603(3), 365–371 (2009) 5. malloc, http://www.malloc.de/en/ 6. TCMalloc, http://goog-perftools.sourceforge.net/doc/tcmalloc.html 7. The Hoard Memory Allocator, http://www.hoard.org/ 8. Instrumentation Framework for Building Dynamic Analysis Tools, http://valgrind.org/ 9. Agostinelli, S., et al.: GEANT4–a simulation toolkit. Nuclear Instruments and Methods in Physics Research Section A 506(3), 250–303 (2003) (over 100 authors, including J. Apostolakis and G. Cooperman) 10. Allison, J., et al.: Geant4 Developments and Applications. IEEE Transactions on Nuclear Science 53(1), 270–278 (2006) (73 authors, including J. Apostolakis and G. Cooperman) 11. Cooperman, G., Nguyen, V., Malioutov, I.: Parallelization of Geant4 Using TOP-C and Marshalgen. In: IEEE NCA 2006, pp. 48–55 (2006) 12. TOP-C, http://www.ccs.neu.edu/home/gene/topc.html 13. Thread-Local Storage, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1966.html 14. Elsa: An Elkhound-based C++ Parser, http://www.cs.berkeley.edu/~ smcpeak/elkhound/ 15. Parallel Linear Algebra For Scalable Multi-core Architecture, http://icl.cs.utk.edu/plasma/ 16. Bondhugula, U., Hartono, A., Ramanujam, J.: A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In: PLDI 2008, vol. 43(6), pp. 101–113 (2008) 17. Baskaran, M.M., Vydyanathan, N., Bondhugula, U.K.R., Ramanujam, J., Rountev, A., Sadayappan, P.: Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors. In: PPoPP 2009, pp. 219–228 (2009)

Multithreaded Geant4: Semi-automatic Transformation

303

18. Aleen, F., Clark, N.: Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture. In: ASPLOS 2009, vol. 44(3), pp. 241–252 (2009) 19. OpenMP, http://openmp.org/wp/ 20. Cilk, http://www.cilk.com/ 21. SUIF, http://suif.stanford.edu/ 22. Anderson, Z., Gay, D., Ennals, R., Brewer, E.: SharC: Checking Data Sharing Strategies for Multithreaded C. In: PLDI 2008, vol. 43(6), pp. 149–158 (2008) 23. Voung, J.W., Jhala, R., Lerner, S.: RELAY: Static Race Detection on Millions of Lines of Code. In: ESEC-FSE 2007, pp. 205–214 (2007) 24. Engler, D., Ashcraft, K.: RacerX: Effective, Static Detection of Race Conditions and Deadlocks. In: SOSP 2003, vol. 37(5), pp. 237–252 (2003) 25. Qadeer, S., Wu, D.: KISS: Keep It Simple and Sequential. In: PLDI 2004, pp. 149–158 (2004) 26. Pratikakis, P., Foster, J.S., Hicks, M.: LOCKSMITH: Context-Sensitive Correlation Analysis for Race Detection. In: PLDI 2006, vol. 41(6), pp. 320–331 (2006) 27. Henzinger, T.A., Jhala, R., Majumdar, R.: Race Checking by Context Inference. In: PLDI 2004, pp. 1–13 (2004)

Parallel Exact Time Series Motif Discovery Ankur Narang and Souvik Bhattacherjee IBM India Research Laboratory, New Delhi {annarang,souvbhat}@in.ibm.com

Abstract. Time series motifs are an integral part of diverse data mining applications including classification, summarization and near-duplicate detection. These are used across wide variety of domains such as image processing, bioinformatics, medicine, extreme weather prediction, the analysis of web log and customer shopping sequences, the study of XML query access patterns, electroencephalograph interpretation and entomological telemetry data mining. Exact Motif discovery in soft real-time over 100K time series is a challenging problem. We present novel parallel algorithms for soft real-time exact motif discovery on multi-core architectures. Experimental results on large scale P6 SMP system, using real life and synthetic time series data, demonstrate the scalability of our algorithms and their ability to discover motifs in soft real-time. To the best of our knowledge, this is the first such work on parallel scalable soft real-time exact motif discovery. Keywords: exact motif discovery, parallel algorithm, multi-core multithreaded architectures.

1 Introduction Time series motifs are pairs of individual time series, or subsequences of a longer time series, which are very similar to each other. Since the formalism of time series motifs in 2002, dozens of researchers have used them in domains as diverse as medicine, biology, telemedicine, entertainment and severe weather prediction. Further, domains such as financial market analysis [5], sensor networks and disaster management require realtime motif discovery over massive amounts of data and possibly in an online fashion. The intuitive algorithm for computing motifs is quadratic in the number of individual time series (or the length of the single time series from which subsequences are extracted). Thus, for massive time series it is hard to obtain exact time series motif in realistic timeframe. More than a dozen approximate algorithms to discover motifs have been proposed [1] [2] [6] [7] [10]. The sequential time complexity of most of these algorithms is O(m) or O(m log m), where m is the number of time series; but the associated constant factors are high. [8] shows a tractable exact algorithm to find time series motifs. This exact algorithm is worst case quadratic, but it can reduce the time required by three orders of magnitude. This algorithm enables tackling problems which have previously been thought intractable, for example automatically constructing dictionaries of recurring patterns from electroencephalographs. Further this algorithm is fast enough to be used as a subroutine P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 304–315, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Parallel Exact Time Series Motif Discovery

305

in higher level data mining algorithms for summarization, near-duplicate detection and anytime classification [12]. However, the sequential MK algorithm [8] cannot deliver soft real-time exact motif discovery for large number (100K or more) of time series. Hence, there is an imperative need to explore parallel scalable algorithm for exact motif discovery. Emerging and next generation multi-core architectures have large number of hardware threads and support high rate throughput computing. For scalable parallel algorithm for exact motif discovery on multi-core architectures one needs to have finegrained cache-aware multithreaded design along with optimizations for load balancing. Soft real-time scalable performance on large multi-cores is a challenging problem due to simultaneous consideration of cache locality and dynamic load balancing. In this paper, we present the design of parallel multi-threaded algorithms for exact motif discovery along with novel optimizations for cache performance and dynamic load balancing. Our algorithm is essentially in-memory bound. Experimental results on real EEG data and random walk data show that our algorithms scale well and discover motifs in soft real-time. We make the following contributions: – We present the design of parallel algorithms for exact discovery of time series motifs. To achieve high scalability and performance on multi-core architectures, we use novel cache locality and dynamic load balancing optimizations. – We demonstrate soft real-time parallel motif discovery for realistic (EEG data) and randomly generated time series on large scale Power6 based SMP systems.

2 Background and Notation Definition 1. Time Series: A Time Series is a sequence T = (t1 , t2 , . . . , tn ) which is an ordered set of n real valued numbers. The ordering is typically temporal; however other kinds of data such as color distributions, shapes and spectrographs also have a well defined ordering and can fruitfully be considered time series for the purpose of indexing and mining. For simplicity and without loss of generality we consider only equispaced data. In general, we may have many time series to consider and thus need to define a time series database. Definition 2. Time Series Database: A Time Series Database (D) is an unordered set of m time series possibly of different lengths. For simplicity, we assume that all the time series are of the same length and D fits in the main memory. Thus, D is a matrix of real numbers where Di is the ith row in D as well as the ith time series Ti in the database and Di,j is the value at time j of Ti . Definition 3. Time Series Motif: The Time Series Motif of a time series database D is the unordered pair of time series {Ti , Tj } in D which is the most similar among all possible pairs. More formally, ∀a, b, i, j the pair {Ti , Tj } is the motif iff dist(Ti , Tj ) ≤ dist(Ta , Tb ), i = j and a = b.

306

A. Narang and S. Bhattacherjee

We can generalize the notion of motifs by considering a motif ranking notion. More formally: Definition 4. k th Time Series Motif: The k th -Time Series motif is the k th most similar pair in the database D. The pair {Ti , Tj } is the k th motif iff there exists a set S of pairs / S and {Tj , Td } ∈ / S of time series of size exactly k-1 such that ∀Td ∈ D {Ti , Td } ∈ and ∀ {Tx , Ty } ∈ S, {Ta , Tb } ∈ / S dist(Tx , Ty ) ≤ dist(Ti , Tj ) ≤ dist(Ta , Tb ). These ideas can be extended to subsequences of a very long time series by treating every subsequence of length n (n  m) as an object in the time series database. Motifs in such a database are subsequences that are conserved locally in the long time series. More formally: Definition 5. Subsequence: A subsequence of length n of a time series T = (t1 , t2 , .., tm ) is a time series Ti,n = (ti , ti+1 , .., ti+n−1 ) for 1 ≤ i ≤ m − n + 1. Definition 6. Subsequence Motif: The Subsequence Motif is a pair of subsequences {Ti,n , Tj,n } of a long time series T that are most similar. More formally, ∀a, b, i, j the pair {Ti,n , Tj,n } is the subsequence motif iff dist(Ti,n , Tj,n ) ≤ dist(Ta,n , Tb,n ), |i − j| ≥ w and |a − b| ≥ w for w > 0. To avoid trivial motifs, w is typically chosen to be ≥ n/2. [8] describes a fast sequential algorithm for exact motif discovery. This algorithm relies on distance of all the time series from a randomly chosen reference time series as the guiding heuristic to reduce the search space. A randomly chosen time series is taken as a reference. The distances of all other time series with respect to this time series are calculated and sorted. The key insight of this algorithm is that this linear ordering of data provides some useful heuristic information to guide the motif search. The observation is that if two objects are close in the original space, they must also be close in the linear ordering, but the reverse may not be true. Two objects can be arbitrarily close in the linear ordering but very far apart in the original space. This observation helps to obtain a speed up over the usual the brute force algorithm. However, a further increase in speed up is obtained when a single reference point is extended to multiple such reference points. This extended version is known as the MK algorithm [8]. From each reference time series in the MK algorithm (Fig. 1), the distance of all other time series are calculated. The reference time series with maximum standard deviation is used to order the time series with increasing distance. The distances between the time series as defined by this ordering is a good lower bound on the actual distance between the time series. At the top level, the algorithm has iterations with increasing offset values starting with offset=1. In each iteration, it updates the best so far variable if a closer pair of time series is discovered.

3 Related Work Most of the literature has focused on computationally fast approximate algorithms for motif discovery [1] [2] [4] [6] [7] [10], however the time series motif problem has been exactly solved by FLAME algorithm [11] only. The FLAME algorithm works on

Parallel Exact Time Series Motif Discovery

307

Algorithm MK Motif Discovery Procedure [L1 ,L2 ] =MK Motif(D,R) 1 best-so-far = INF 2 for i=1 to R 3 refi =a randomly chosen time series Dr from D 4 for j=1 to m 5 Disti,j =d(refi ,Dj ) 6 if Disti,j = SZ(i+1) 10 find an ordering I of the indices to the time series in D such that DistZ(1),I(j) =DistZ(1),I(j+1) 11 offset=0, abandon=false 12 while abandon=false 13 offset=offset+1, abandon=true 14 for j=1 to m 15 reject=false 16 for i=1 to R 17 lower bound=|DistZ(i),I(j) - DistZ(i),I(j+of f set) | 18 if lower bound >best-so-far 19 reject=true, break 20 else if i = 1 21 abandon=false 22 if reject=false 23 if d(DI(j) ,DI(j+of f set) ) current difference in reference distance) during the traversal of the sorted reference distance array. At the end of each iteration, the threads perform a reduction operation to update the overall best so far value and the corresponding pair of time series representing the motif pair. Further, at this point the threads check if the termination criteria for the algorithm has been reached by all threads. If so, the program exits and the motif pair is reported.

Sorted Reference Distance Array

A[1..4]

A[5..8]

A[9..12]

Offset = 1

T1

T2

Sorted Reference Distance Array

T3

Offset = 2

T1

T2

T3

T1

T2

T3

A[1..4]

T1

Offset e [ 1..2]

T5

Offset e [ 3..4]

T2

Offset e [ 3..4]

T6

Offset e [ 5..6]

T3

Offset e [ 5..6]

T7

Offset e [ 7..8]

T4

Offset e [ 7..8]

T8

Offset e [ 9..10]

T1

Offset e [ 9..10]

T5

Offset e[ 11..12]

T2

Offset e[ 11..12]

T6

Offset e[ 13..14]

T3

Offset e[ 13..14]

T7

Offset e[ 15..16]

T4

Offset e[ 15..16]

T8

Offset = 3

Single Dimension Parallelism: Array Parallelism

A[5..8]

Offset e [ 1..2]

Two Dimension Parallelism: Array and Offset Parallelism

Fig. 2. Parallelism in Exact Motif Discovery

When the second dimension of parallelism, i.e. offset parallelism is also used, then in each iteration a group of offset values are considered as opposed to single offset per iteration. Each thread traverses its partition for all the offset values assigned to the current offset group. At the end of each iteration, reduction operation is performed to update the overall best so far value and the corresponding pair of time series representing the motif pair, across all the offset values considered in this iteration. The number of offset values considered in an iteration is determined empirically for best performance. The parallel algorithm including offset parallelism is given in Fig. 2. This algorithm is referred to as Par-MK algorithm. When both the dimensions of parallelism are used, the threads that work on different offsets but the same array partition are co-scheduled on the same core or multiple cores that share the same L2 cache. This helps in improving cache performance and reduces pressure on the external memory bandwidth demand. The overall performance of this parallel algorithm depends on the architectural and the algorithm parameters. The architectural parameters we consider for performance evaluation include number of cores, number of hardware threads per core, size of L2 cache and number of cores sharing L2 cache. For a given time series database, the algorithm parameters that affect performance are the number of reference points and the number of parallel offsets

310

A. Narang and S. Bhattacherjee

when offset parallelism is used. The results and analysis Section 5 details the interesting interplay amongst these parameters for obtaining the best performance. The compute workload on each thread is non-uniform due to variable number of distance computations between time series. This causes load imbalance problem across the threads and results in lower scalability and efficiency with increasing number of threads and cores on the target architecture. The next section presents the design optimizations for load balancing. 4.1 Load Balancing Optimizations Each thread in the Algorithm 2 performs different number of distance computations between time series depending upon on its local best so far[i] value and the distribution of distances with respect to the reference points. This leads to load imbalance across the threads. In this section, we present static and dynamic load balancing techniques. We first present the static load balancing technique (referred as Par-MK-SLB). The work that threads do is divided into two categories: (a) sorted distance array traversal and best so far[i] update, (b) distance computation between pairs of time series. (Fig. 3) Each thread is assigned a fixed category of work. Each thread, Ti , that performs traversal of the sorted distance array (referred as search thread) has a synchronized queue, Qi , associated with it where it puts in the distance computation requests. The distance computation request is the following tuple: [Sa , Sb , Rab ], where, Sa and Sb are the time series and Rab is the difference in the reference distances of these two time series. Each thread, DTj , that computes distance between time series (referred as compute thread) picks up distance computation request from a queue, Qi . It first checks if this difference in the reference distance is less than the best so far[i]. If so, then it picks up the request for computation, else the request is discarded. It then computes the distance between the two given time series (Sa , Sb ). If the actual distance between these two time series is less than the best so far[i] value then, best so far[i] is updated in a synchronized fashion. For static load balancing, each thread in the system is assigned to one of the multiple work-share groups. Each work-share group consists of a certain set of search threads (and their respective queues) and a set of compute threads. The ratio of compute threads to search threads is referred to as cs-ratio. For optimized mapping onto the target architectures, the threads that belong to the same work-share group are co-scheduled on the cores that share the same L2 cache (Fig. 3). The cs-ratio in each work-share group is determined by the relative load on these threads and also by the underlying architecture for best cache performance. The static load balancing has its limitations of not being able to share compute threads across the work-share groups and also not being able to adjust the ratio between the overall number of compute threads to the overall number of search threads in the system. Further, there can be load imbalance between the search threads due to variable number of synchronized queue operations. Thus, to provide further improvements in load balancing we extend this to dynamic load balancing. Dynamic Load Balancing. In the dynamic load balancing scheme, the execution is divided into multiple phases. Each thread does one category of work in a phase but can switch to other category in the next phase if needed for load balancing. Thus, the csratio can vary across phases (Fig. 4). In each phase, the average queue occupancies are

Parallel Exact Time Series Motif Discovery Traversal Threads

311

Distance Compute Threads Queue(1)

T1

T6 Work Share Group(1)

Queue(2) T2

T7 Queue(3)

T3

T8 Work Share Group(2)

Queue(4) T4

T9 Queue(5)

T5

T10

Work Share Group(3)

Fig. 3. Static Load Balancing

monitored to determine whether the search load is high or the compute load is high. If in the current phase, the compute load is high, as denoted by the relatively high occupancy of the queues, then some search threads switch their work category to compute work in the next phase. Thus, the cs-ratio increases in the next phase. Symmetrically, the csratio is decreased in the next phase, if in the current phase the queues have relatively less occupancy. The length of each phase in terms of number of iterations is determined empirically. The threads chosen to switch from one category to another between phases and the new value of the cs-ratio is dependent on the architectural configuration to ensure maximal cache locality while taking care of load balancing. Further, we employ Search Threads

Compute Threads

T1

schunk(0) schunk(1) schunk(2) schunk(3)

Sorted Reference Distance Array

Queue(1) T6 Queue(2) T2 next_available chunk

T7 Queue(3)

T3

T8 Queue(4)

T4

T9

Random Distance Computation Request Fetch

Fig. 4. Dynamic Load Balancing

312

A. Narang and S. Bhattacherjee

fine-grained parallelism to get dynamic load balance between the search threads. Here, instead of the partitioning the sorted distance array into a number of chunks equal to the number of search threads, we partition it into much smaller chunks. A synchronized variable points to the next available small chunk for the fixed set of offsets. Each search thread picks the next available small chunk and performs search over it while en-queuing distance computation requests for the compute thread (Fig. 4). When no small chunks are available for this iteration, the search threads perform a barrier and the global best so far value is updated and visible to all threads. Thus, the search thread now accesses non-contiguous small chunks as opposed to the static load balancing algorithm (Par-MK-SLB) or the Par-MK algorithm. The dynamic load balancing algorithm is referred to as Par-MK-DLB.

5 Results and Analysis We implemented all the three algorithms, Par-MK, Par-MK-SLB and Par-MK-DLB using Posix Threads (NPTL) API. The test data used was the same as in [8]. We evaluated the performance and scalability of our algorithms on the EEG data and the Random Walk data. The EEG data [8] has voltage differences across the scalp and reflects the activity of the large populations of neurons underlying the recording electrode. The EEG data has a single sequence of length, L = 180K. We extracted subsequences of length, N = 200 from this long sequence, as mentioned in Section 2 and used them to find the subsequence motif. For the synthetic data, we generated a random walk of T S = 100K time series each of length, N = 1024. We performed the experiments on a large scale SMP machine, with 32 (4GHz) Power6 cores. Each core has 32KB of L1 cache and two cores share 32MB of L2 cache. Further, each thread was assigned to a single core. 5.1 Strong Scalability In strong scalability experiment we kept the number of time series and the length of each of the time series as constant, while increasing the number of threads from 2 to 32. Fig. 5(a) shows the variation in exact motif discovery time with increasing number of cores. The input consists of subsequences derived from the EEG time series of length L = 180K. The motif discovery time decreases from 207s for 2 cores/threads to 22.05s for 32 cores/threads. We also measured the time of the sequential MK algorithm [8] on this power machine. Compared to the sequential time of 734s, we obtained a speedup of 33.4× on 32 cores. The superlinear speedup is due to the dynamic load balancing in our algorithm with superior L2 cache performance. The parallel efficiency in this case is 104.4%. Using random walk data over T S = 50K time series each of length 1024, we obtained 29.5s time on 32 threads and 176.6s on 2 threads (Fig. 5(b). While the sequential MK obtained the same motif in 253s. This gives a speedup of roughly 8.57×. The parallel efficiency turns out to be 27%. This fall in speedup and efficiency is due to the reduction in cache performance owing to larger size of the time series (N = 1024). Using dimension based partitioning we can obtain a better cache performance and hence better speedup.

Parallel Exact Time Series Motif Discovery

Strong Scalability: EEG data (L = 180K, N = 200)

313

Strong Scalability: Random Walk (TS = 50K, N = 1024)

800

300

700 250 200

500 EEG

400 300 200

Time(s)

Time(s)

600

150

RWalk

100 50

100 0 EEG

0

1 734

2

4

8

16

32

207.11 107.87 65.03 31.37 22.05 Num ber of Threads

(a)

RWalk

1

2

4

8

16

32

253 176.6 115.4 70.7 40.83 29.5 Number of Threads

(b)

Fig. 5. (a) Strong Scalability (EEG) Motif Discovery. (b) Strong Scalability (Random Walk) Motif Discovery.

5.2 Load Balancing Optimizations Analysis We studied and compared load balance (across the threads) between the algorithms Par-MK and Par-MK-DLB. Fig. 6(a) plots the wait time for 32 threads for the EEG data using the Par-MK algorithm. The wait time for a thread is the sum over all iterations, of the time the thread has to wait for the all-thread barrier to complete. The wait time represents the load imbalance across the threads. The minimum wait time is 2.9s while the maximum is 10.47s. In each iteration the load balance across threads can be different depending on the nature of data. This irregular nature of the computation causes all threads to have non-zero total wait-time over all iterations. Since, in the Par-MK algorithm the search and compute loads across the threads are variable so this load imbalance across the threads is high. This load imbalance results in poor scalability and performance. Fig. 6(b) shows the wait time for 32 threads for the Par-MK-DLB algorithm. Here the minimum wait time is 2.2s while the maximum is 4s. Here, each thread processes as many small available chunks in each iteration as it can, based on the load from each chunk it picks. This dynamic load balancing strategy leads to much better load balance compared to the Par-MK algorithm. The EEG data has more search load than distance computation load hence the cs-ratio of the compute threads to the search threads is kept low (1:7). The random walk data has higher distance computation load compared to the search load. To obtain the best overall time, we tried multiple values of the ratio of the number of compute threads to the number of search threads. Fig. 7(a) displays the wait time (using 32 threads, random walk data, T S = 100K, N = 1024) for the search threads when the cs-ratio is 1. The minimum wait time is 20s while the maximum is 53.5s, with mean 33.7s and standard deviation 9.7s. The overall motif discovery time in this case is 164s. The wait time denotes variation of the distance computation workload and search traversal workload across the search threads. Since, for random walk data the length of the time series is set to 1024, there is more work for distance computation than traversal across the reference distance array. Due to this, the queues for the search threads have higher occupancy in this experiment. Thus, even when the cs-ratio is 1 : 1,

A. Narang and S. Bhattacherjee Load Imbalance, Par-MK-DLB, EEG data, 32 threads

Load Imbalance (Wait Time), EEG data, 32 threads

Thread Id

24

27

18

0

30

24 27

21

18

15

9 12

3

6

0

0

21

2

Wait Time

15

Wait Time

4

9

6

Time

Time

8

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 12

10

6

12

3

314

Thread Id

(a)

(b)

Fig. 6. (a) Load Balance Analysis Par-MK. (b) Load Balance Analysis Par-MK-DLB.

the number of compute threads are lesser than needed and hence the load imbalance across the search threads is large and the overall completion time is 172s. When the cs-ratio is chosen as 7 : 1, then the load imbalance across the threads becomes much better as indicated by the Fig. 7(b). Here, the minimum wait time is around 4s while the maximum is around 8.1s, with mean 5.84s and standard deviation as 2.26s. The total motif discovery time here is 146.5s. The higher number of compute threads are able to quickly serve the queues with distance computation requests resulting in better load balance and lower overall exact motif discovery time.

|C| = 16, |S| = 16, Random Walk, TS = 100K

|C| = 28, |S| = 4, Random Walk, TS = 100K

60

40 Wait Time

30 20 10

14

12

10

8

6

4

0

2

0

Search Thread ID

(a)

Wait Time (s)

Wait Time (s)

50

9 8 7 6 5 4 3 2 1 0

Wait Time

0

1

2

3

Search Thread ID

(b)

Fig. 7. (a) Load Balance Analysis with CS-Ratio 1:1. (b) Load Balance Analysis with CS-Ratio 7:1.

6 Conclusions and Future Work We presented the design of parallel algorithms for exact motif discovery over large time series. Novel load balancing and cache optimizations provide high scalability and performance on multi-core architectures. We demonstrate the performance of our optimized parallel algorithms on large Power SMP machines using both real and random data sets. We achieved around 33× performance speedup over the best sequential

Parallel Exact Time Series Motif Discovery

315

performance reported so far for discovery of exact motifs on real data. Our optimized parallel algorithm delivers exact motif in soft real-time. To the best of our knowledge, this is the first such work on parallel scalable soft real-time exact motif discovery on multi-core architectures. In future, we plan to investigate scalability on manycore architectures with thousands of threads.

References 1. Beaudoin, P., van de Panne, M., Poulin, P., Coros, S.: Motion-motif graphs. In: Symposium on Computer Animation (2008) 2. Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: 9th International Conference on Knowledge Discovery and Data mining (KDD 2003), pp. 493–498 (2003) 3. Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Computing 30(4), 443–472 (2001) 4. Guyet, T., Garbay, C., Dojat, M.: Knowledge construction from time series data using a collaborative exploration system. Journal of Biomedical Informatics 40(6), 672–687 (2007) 5. Jiang, T., Feng, Y., Zhang, B., Shi, J., Wang, Y.: Finding motifs of financial data streams in real time. In: Kang, L., Cai, Z., Yan, X., Liu, Y. (eds.) ISICA 2008. LNCS, vol. 5370, pp. 546–555. Springer, Heidelberg (2008) 6. Meng, J., Yuan, J.: Hans, M., Wu, Y.: Mining motifs from human motion. In: Proc. of EUROGRAPHICS (2008) 7. Minnen, D., Isbell, C., Essa, I., Starner, T.: Discovering multivariate motifs using subsequence density estimation and greedy mixture learning. In: Conf. on Artificial Intelligence, AAAI 2007 (2007) 8. Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009) 9. Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, USA, pp. 562–567 (2005) 10. Tanaka, Y., Iwamoto, K., Uehara, K.: Discovery of time-series motif from multi-dimensional data based on mdl principle. Machine Learning 58(2-3), 269–300 (2005) 11. Tata, S.: Declarative Querying For Biological Sequences. Ph.D. thesis. The University of Michigan (2007) 12. Ueno, K., Xi, X., Keogh, E., Lee, D.: Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: Proc. of IEEE International Conference on Data Mining (2006) 13. Zaki, M.: Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing 61(3), 401–426 (2001)

Optimized Dense Matrix Multiplication on a Many-Core Architecture Elkin Garcia1, Ioannis E. Venetis2 , Rishi Khan3 , and Guang R. Gao1 1

Computer Architecture and Parallel Systems Laboratory Department of Electrical and Computer Engineering University of Delaware, Newark 19716, U.S.A. {egarcia,ggao}@capsl.udel.edu 2 Department of Computer Engineering and Informatics University of Patras, Rion 26500, Greece [email protected] 3 ET International, Newark 19711, U.S.A. [email protected]

Abstract. Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.

1

Introduction

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. They exploit temporal locality making use of cache tiling techniques with tile size selection and padding [8,18]. However, the data location and replacement in the cache is controlled by hardware making fine control of these parameters difficult. In addition, power consumption and chip die area constraints make increasing on-chip cache an untenable solution to the memory wall problem [5,17]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 316–327, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Optimized Dense Matrix Multiplication on a Many-Core Architecture

317

As a result, new architectures like the IBM Cyclops-64 (C64) belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. These new kinds of architectures hand the management of the memory hierarchy to the programmer and save the die area of hardware cache controllers and over-sized caches. Although this might complicate programming at their current stage, these systems provide more flexibility and opportunities to improve performance. Following this path, new alternatives for classical algorithmic problems, such as Dense Matrix Multiplication (MM), LU decomposition (LU) and Fast Fourier Transform (FFT) have been studied under these new many-core architectures [7,15,21]. The investigation of these new opportunities leads to two main conclusions: (1) The optimizations for improving performance on cache-based parallel system are not necessarily feasible or convenient on software managed memory hierarchy systems. (2) Memory access patterns reached by appropriate tiling substantially increase the performance of applications. Based on these observations we can conclude that new programming and compiling methodologies are required to fully exploit the potential of these new classes of architectures. We believe that a good starting point for developing such methodologies are classical algorithms with known memory access and computation patterns. These applications provide realistic scenarios and have been studied thoroughly under cache-based parallel systems. Following this idea, we present a general methodology that provides a mapping of applications to software managed memory hierarchies, using MM on C64 as a case of study. MM was chosen because it is simple to understand and analyze, but computationally and memory intensive. For the basic algorithm, the arithmetic complexity and the number of memory operations in multiplications of two matrices m × m are O(m3 ) . The methodology presented in this paper is composed of three strategies that result in a substantial increase in performance, by optimizing different aspects of the algorithm. The first one is a balanced distribution of work among threads. Providing the same amount of work to each thread guarantees minimization of the idle time of processing units waiting for others to finish. If a perfect distribution is not feasible, a mechanism to minimize the differences is proposed. The second strategy is an optimal register tiling and sequence of traversing tiles. Our register tiling and implementation of the sequence of traversing tiles are designed to maximize the reuse of data in registers and minimize the number of memory accesses to slower levels, avoiding unnecessary stalls in the processing units while waiting for data. The last strategy involves more specific characteristics of C64. The use of special instructions, optimized instruction scheduling and other techniques further boost the performance reached by the previous two strategies. The impact on performance can change according to the particular characteristics of the many-core processor used. The experimental evaluation was performed using a real C64 chip. After the implementation of the three strategies proposed, the performance reached by the C64 chip is 44.12 GFLOPS, which corresponds to 55.2% of the peak performance. Additionally, measurements of power consumption prove that C64 is very power

318

E. Garcia et al.

Processor 1

Processor 2

SP

SP

SP

FP

FP

SP ···

SP

Host Interface

TU

TU

TU

TU

1.92 TB/s

3D Mesh Control Network

FP A-Switch

TU

Node

Chip

Processor 80

SP

Gigabit Ethernet

Load: 31 cycles; Store: 15 cycles

GM ~2.5MB

320GB/s

SRAM Bank

SRAM Bank

SRAM Bank

SRAM Bank

SRAM Bank

SRAM Bank

SP 16kB

640GB/s

Crossbar Network ···

Latency Overall Bandwidth 64 Registers

Load: 2 cycles; Store: 1 cycle

HD

FPGA

TU

Read: 1 cycle Write: 1 cycle

DDR2 SDRAM Controller

(a) C64 Chip Architecture

Off-Chip Memory

Load: 57 cycles; Store: 28 cycles 16GB/s (Multiple load and Multiple store instructions); 2GB/s

Off-Chip DRAM 1GB

(b) Memory Hierarchy of C64

Fig. 1. C64 Architecture details

efficient, providing 530 MFLOPS/W for the problem under consideration. This value is comparable to the top of the Green500 list [13], which provides a ranking of the most energy-efficient supercomputers in the world. The rest of this paper is organized as follows. In Section 2, we describe the C64 architecture. In Section 3, we give a short overview on the current status of MM Algorithms. In Section 4, we introduce our proposed MM Algorithm and optimizations. In Section 5, we present the experimental evaluation of our implementation. Finally, we conclude and present future work in Section 6.

2

The IBM Cyclops-64 Architecture

Cyclops-64 (C64) is an innovative architecture developed by IBM, designed to serve as a dedicated petaflop computing engine for running high performance applications. A C64 chip is an 80-processor many-core-on-a-chip design, as can be seen in Fig. 1a. Each processor is equipped with two thread units (TUs), one 64-bit floating point unit (FP) and two SRAM memory banks of 30kB each. It can issue one double precision floating point “Multiply and Add” instruction per cycle, for a total performance of 80 GFLOPS per chip when running at 500MHz. A 96-port crossbar network with a bandwidth of 4GB/s per port connects all TUs and SRAM banks [11]. The complete C64 system is built out of tens of thousands of C64 processing nodes arranged in a 3-D mesh topology. Each processing node consists of a C64 chip, external DRAM, and a small amount of external interface logic. A C64 chip has an explicit three-level memory hierarchy (scratchpad memory, on-chip SRAM, off-chip DRAM), 16 instruction caches of 32kB each (not shown in the figure) and no data cache. The scratchpad memory (SP) is a configured portion of each on-chip SRAM bank which can be accessed with very low latency by the TU it belongs to. The remaining sections of all on-chip SRAM banks consist the on-chip global memory (GM), which is uniformly addressable from all TUs. As a summary, Fig. 1b reflects the current size, latency (when there is no contention) and bandwidth of each level of the memory hierarchy.

Optimized Dense Matrix Multiplication on a Many-Core Architecture

319

Execution on a C64 chip is non-preemptive and there is no hardware virtual memory manager. The former means that the C64 micro-kernel will not interrupt the execution of a user application unless an exception occurs. The latter means the three-level memory hierarchy of the C64 chip is visible to the programmer.

3

Classic Matrix Multiplication Algorithms

MM algorithms have been studied extensively. These studies focus mainly on: (1) Algorithms that decrease the na¨ıve complexity of O(m3 ). (2) Implementations that take advantage of advanced features of computer architectures to achieve higher performance. This paper is oriented towards the second area. In the first area, more efficient algorithms are developed. Strassen’s algorithm [20] is based on the multiplication of two 2 × 2 matrices with 7 multiplications, instead of 8 that are required in the straightforward algorithm. The recursive application of this fact leads to a complexity of O(mlog7 ) [10]. Disadvantages, such as numerical instability and memory space required for submatrices in the recursion, have been discussed extensively [14]. The current best lower bound is O(m2.376 ), given by the Coppersmith–Winograd algorithm [9]. However, this algorithm is not used in practice, due to its large constant term. The second area focuses on efficient implementations. Although initially more emphasis was given towards implementations for single processors, parallel approaches quickly emerged. A common factor among most implementations is the decomposition of the computation into blocks. Blocking algorithms not only give opportunities for better use of specific architectural features (e.g., memory hierarchy) but also are a natural way of expressing parallelism. Parallel implementations have exploited the interconnection pattern of processors, like Cannon’s matrix multiply algorithm [6,16], or the reduced number of operations like Strassen’s algorithm [4,12]. These implementations have explored the design space along different directions, according to the targeted parallel architecture. Other studies have been focused on models that captures performance-relevant aspects of the hierarchical nature of computer memory like the Uniform Memory Hierarchy (UMH) model or the Parallel Memory Hierarchy (PMH) model [1,2]. The many-core architecture design space has not yet been explored in detail, but existing studies already show their potential. A performance prediction model for Cannon’s algorithm has shown a huge performance potential for an architecture similar to C64 [3]. Previous research of MM on C64 showed that is possible to increase performance substantially by applying well known optimizations methods and adapting them to specific features of the chip [15]. More recent results on LU decomposition conclude that some optimizations that performs well for classical cached-based parallel system are not the best alternative for improving performance on software managed memory hierarchy systems [21].

4

Proposed Matrix Multiplication Algorithm

In this section we analyze the proposed MM algorithm and highlight our design choices. The methodology used is oriented towards exploiting the maximum

320

E. Garcia et al.

benefit of features that are common across many-core architectures. Our target operation is the multiplication of dense square matrices A × B = C, each of size m × m using algorithms of running time O(m3 ). Throughout the design process, we will use some specific features of C64 to illustrate the advantages of the proposed algorithm over different choices used in other MM algorithms. Our methodology alleviates three related sources identified to cause poor performance in many-core architectures: (1) Inefficient or unnecessary synchronization. (2) Unbalanced work between threads. (3) Latency due to memory operations. Relation and impact in performance of these sources are architecture dependent and modeling their interactions has been an active research topic. In our particular case of interest, the analysis of MM is easier than other algorithms not only for the simple way it can be described but also for the existence of parallel algorithms that do not required synchronizations. It simplifies the complexity of our design process because we only need to carefully analyze in two instead of the three causes of poor performance we have identified as long as the algorithm proposed does not require synchronizations. These challenges will be analyzed in the following subsections. 4.1

Work Distribution

The first challenge in our MM algorithm is to distribute work among P processors avoiding synchronization. It is well known that each element ci,j in C can be calculated independently. Therefore, serial algorithms can be parallelized without requiring any synchronization for the computation of each element ci,j , which immediately solves this requirement. The second step is to break the m × m matrix C into blocks such that we minimize the maximum block size pursuing optimal resource utilization and trying to avoid overloading a processor. This is optimally done by breaking the 2 problem into blocks of mP elements, but the blocks must be rectangular and fit into C. One way to break C in P rectangular blocks is dividing rows and columns of C into q1 and q2 sets respectively, with q1 · q2 = P . The optimal way  to minimize the maximum block size is to divide the m rows into q1 sets of qm1 rows (with some  having   an extra row) and thesame for  columns.  The maximum tile size is qm1 · qm2 and it is bounded by qm1 + 1 · qm2 + 1 . The difference between this upper bound and the optimal tile size is qm1 + qm2 + 1 and this difference is √ minimized when q1 = q2 = P . If P is not a square√number, we find the q1 that is a factor of P and closest but not larger than P . To further optimize, we can turn off some processors if the maximum tile size could be decreased. In practice, this reduces to turning off processors if q2 − q1 is smaller and in general, this occurs if P is prime or one larger than a square number. 4.2

Minimization of High Cost Memory Operations

After addressing the synchronization and load-balancing problems for MM, the next major bottleneck is the impact of memory operations. Despite the high

Optimized Dense Matrix Multiplication on a Many-Core Architecture

321

bandwidth of on-chip memory in many-core architectures (e.g. C64), bandwidth and size of memory are still bottlenecks for algorithms, producing stalls while processors are waiting for new data. As a result, implementations of MM, LU and FFT are still memory bound [7,15,21]. However, the flexibility of softwaremanaged memory hierarchies provides new opportunities to the programmer for developing better techniques for tiling and data locality without the constraints imposed by cache parameters like line sizes or line associativity [19,21]. It implies an analysis of the tile shapes, the tile size and the sequences in which tiles have to be traversed taking advantage of this new dimension in the design space. While pursuing a better use of the memory hierarchy, our approach takes two levels of this hierarchy, one faster but smaller and the other slower but bigger. Our objective is to minimize the number of slow memory operations, loads (LD) and stores (ST ), that are a function of the problem (Λ), the number of processors (P ), the tile parameters (L) and the sequence of traversing tiles (S), subject to the data used in the current computation (R) cannot exceed the size of the small memory (Rmax ). This can be expressed as the optimization problem: min L,S

LD (Λ, P, L, S) + ST (Λ, P, L, S) ,

s.t. R (Λ, P, L, S) ≤ Rmax

(1)

In our case, registers are the fast memory and Λ is the MM with the partitioning described in subsection 4.1. Our analysis assumes a perfect load-balancing   m  √ where each block C ∈ C of size n × n n = P computed by one processor is  subdivided into tiles Ci,j ∈ C  of size L2 × L2 . Due to data dependencies, the  required blocks A ∈ A and B  ∈ B of sizes n × m and m × n are subdivided  into tiles Ai,j ∈ A and Bi,j ∈ B  of sizes L2 × L1 and L1 × L2 respectively. Each processor requires 3 nested loops for computing all the tiles of its block. Using loop interchange analysis, an exhaustive study of the 6 possible schemes to traverse tiles was conducted and two prototype sequences S1 and S2 were found. The algorithms that describe these sequences are shown in Fig. 2. S1: for i = 1 to Ln2 S2: for j = 1 to Ln2  S3: Initialize Ci,j S4: for k = 1 to Lm1  S5: Load Ai,k , Bk,j    S6: Ci,j + = Ai,k · Bk,j S : end for  S7: Store Ci,j S : end for S : end for

S1: for i = 1 to Ln2 S2: for k = 1 to Lm1 S3: Load Ai,k S4: for j = 1 to Ln2  S5: if k = 1 then Initialize Ci,j  S6: else Load Ci,j  S7: Load Bk,j   S8: Ci,j + = Ai,k · Bk,j  S9: Store Ci,j S : end for S : end for S : end for

(a) Algorithm using sequence S1

(b) Algorithm using sequence S2

Fig. 2. Implementation of sequences for traversing tiles in one block of C

322

E. Garcia et al.

Based on the data dependencies of this implementations, the general optimization problem described in (1) can be expressed for our case by Eq. (2).  min

L∈{L1 ,L2 }, S∈{S1 ,S2 }

f (m, P, L, S) =

L2 m + m √  2 1 3 m + + P − 1 m2 L2 L1 2

3

2

if S = S1 if S = S2 (2)

s.t. 2L1 L2 + L22 ≤ Rmax Analyzing the piecewise function f , we notice that if P ≥ 4 the objective function for S = S1 is always smaller to the objective function for S = S2 . Since f only depends on L2 , we minimize f by maximizing L2 . Given the constraint, L2 is maximized by minimizing L1 . Thus L1 = 1, we solve the optimum L2 in the boundary of the constraint. The solution of Eq. (2) if P ≥ 4 is:   1 + Rmax − 1 (3) L1 = 1, L2 = This result is not completely accurate, since we assumed that there are not remainders when we divide the matrices into blocks and subdivide the blocks in tiles. Despite this fact, they can be used as a good estimate. For comparison purposes, C64 has 63 registers and we need to keep one register for the stack pointer, pointers to A, B, C matrices, m and stride parameters, then Rmax = 63 − 6 = 57 and the solution of Eq. (3) is L1 = 1 and L2 = 6. Table 1 summarizes the results in terms of the number of LD and ST for the tiling proposed and other 2 options that fully utilizes the registers and have been used in practical algorithms: inner product of vectors (L1 = 28 and L2 = 1) and square tiles (L1 = L2 = 4). As a consequence of using sequence S1 , the number of ST is equal in all tiling strategies. As expected, the tiling proposed has the minimum number of LD: 6 times less than the inner product tiling and 1.5 times less than the square tiling. 4.3

Architecture Specific Optimizations

Although the general results of subsection 4.2 are of major importance, an implementation that properly exploits specific features of the architecture is also important for maximizing the performance. We will use our knowledge and experience for taking advantage of the specific features of C64 but the guidelines proposed here could be extended to similar architectures. Table 1. Number of memory operation for different tiling strategies Memory Operations Inner Product Square Optimal 1 1 Loads 2m3 m3 m3 2 3 2 2 Stores m m m2

Optimized Dense Matrix Multiplication on a Many-Core Architecture

323

The first optimization is the use of special assembly functions for Load and Store. C64 provides the instructions multiple load (ldm RT, RA, RB ) and multiple store (stm RT, RA, RB ) that combine several memory operations into only one instruction. For the ldm instruction, starting from an address in memory contained in RA, consecutive 64-bit values in memory are loaded into consecutive registers, starting from RT through and including RB. Similarly, stm instruction stores 64-bit values in memory consecutively from RT through and including RB starting in the memory address contained in RA. The advantage in the use of these instructions is that the normal load instruction issues one data transfer request per element while the special one issues one request each 64-byte boundary. Because our tiling is 6 × 1 in A and 1 × 6 in B, we need A in column-major order and B in row-major order as a requirement for exploiting this feature. If they are not in the required pattern, we transpose one matrix without affecting the complexity of the algorithms proposed because the running time of transposition is O(m2 ). The second optimization applied is instruction scheduling: the correct interleaving of independent instructions to alleviate stalls. Data dependencies can stall the execution of the current instruction waiting for the result of one issued previously. We want to hide or amortize the cost of critical instructions that increase the total computation time executing other instructions that do not share variables or resources. The most common example involves interleaving memory instructions with data instructions but there are other cases: multiple integer operations can be executed while one floating point operation like is computed.

5

Experimental Evaluation

This section describes the experimental evaluation based on the analysis done in section 4 using the C64 architecture described in section 2. Our baseline parallel MM implementation works with square matrices m × m and it was written in C. The experiments were made up to m = 488 for placing matrices A and B in on-chip SRAM and matrix C in off-chip DRAM, the maximum number of TUs used is 144. To analyze the impact of the partitioning schema described in subsection 4.1 we compare it with other two partition schemes. Fig. 3 shows the performance reached for two different matrix sizes. In Partitioning 1, the m rows are di  vided into q1 sets, the first q1 − 1 containing qm1 and the last set containing the remainder rows. The same partitioning is followed for columns. It has the worst performance of the three partitions because it does not minimize  max  the imum tile size. Partitioning 2 has optimum maximum tile size of qm1 · qm2 but does not distribute the number of rows and columns uniformly between sets q1 and q2 respectively. Its performance is very close to our algorithm Partitioning 3, which has optimum maximum tile size and better distribution of rows and columns between sets q1 and q2 respectively. A disadvantage of Partitioning 2 over Partitioning 3 is that for small matrices (n ≤ 100) and large number of TUs

324

E. Garcia et al.

2.5

3.5

1.5

1.0

Partitioning 1

Partitioning 2

3.0

Partitioning 3

2.5

2.0 1.5

1.0

0.5

Partitioning 2

Performance (GFLOPS)

2.0

Performance (GFLOPS)

Partitioning 1

Partitioning 3

0.5 Thread Units

0.0 1

4

9

16 25 36 49 64 81 100 121 144

(a) Matrix Size 100 × 100

Thread Units

0.0 1

4

9

16 25 36 49 64 81 100 121 144

(b) Matrix Size 488 × 488

Fig. 3. Different Partition Schemes vs. Number of Threads Units

Partitioning 2 may produce a significant lower performance as can be observed in Fig. 3a. Our partitioning algorithm Partitioning 3 performs always better, the maximum performance reached is 3.16 GFLOPS. The other one with optimum maximum tile size performs also well for large matrices, indicating that minimizing the maximum tile size is an appropriate target for optimizing the work load. In addition, our partition algorithm scales well with respect to the number of threads which is essential for many-core architectures. The results of the progressive improvements made to our MM algorithm are shown in Fig. 4 for the maximum size of matrices that fits on SRAM. The implementation of the tiling strategy proposed in subsection 4.2 for minimizing the number of memory operations, was made in assembly code using tiles of 6×1, 1 × 6 and 6 × 6 for blocks in A, B and C respectively. Because the size of blocks in C are not necessarily multiple of 6, all possible combinations of tiles with size less than 6 × 6 were implemented. The maximum performance reached was 30.42 GFLOPS, which is almost 10 times the maximum performance reached by the version that uses only the optimum partition. This big improvement shows the advantages of the correct tiling and sequence of traversing tiles that directly minimizes the time waiting for operands, substituting costly memory operations in SRAM with operations between registers. From another point of view, our tiling increases the reuse of data in registers minimizing number of access to memory for a fixed number of computations. The following optimizations related more with specific features of C64 also increased the performance. The use of multiple load and multiple store instructions (ldm/stm) diminishes the time spent transferring data addressed consecutively in memory. The new maximum performance is 32.22 GFLOPS: 6% better than the version without architecture specific optimizations. The potential of these features has not been completely exploted because transactions that cross a 64byte boundary are divided and transactions in groups of 6 do not provide an optimum pattern for minimizing this division. Finally, the instruction scheduling applied for hiding the cost of some instructions doing other computations in the middle increases performance by 38%. The maximum performance of our

Optimized Dense Matrix Multiplication on a Many-Core Architecture

325

50.0 Partitioning

Performance (GFLOPS)

45.0

Tiling

40.0

Optimization 1 - ldm/stm

35.0

Optimization 2 - Inst. Scheduling

30.0 25.0 20.0 15.0 10.0

5.0 0.0 1

4

9

16

25 36 49 Thread Units

64

81 100 121 144

Fig. 4. Impact of each optimization on the performance of MM using m = 488

MM algorithm is 44.12 GFLOPS which corresponds to 55.2% of the peak performance of a C64 chip. We also made measurements of power consumption using the current consumed by the two voltage sources of the C64 chip (1.2V and 1.8V) yielding a total of 83.22W or 530 MFLOPS/W. This demostrates the power efficiency of C64 for the problem under consideration. This value is similar to the top of the Green500 list, which provides a ranking of the most energy-efficient supercomputers in the world.

6

Conclusions and Future Work

In this paper we present a methodology to design algorithms for many-core architectures with a software managed memory hierarchy taking advantage of the flexibility these systems provide. We apply it to design a Dense Matrix Multiplication (MM) mapping and we implement MM for C64. We propose three strategies for increasing performance and show their advantages under this kind of architecture. The first strategy is a balanced distribution of work amount threads: our partitioning strategy not only distributes the amount of computation as uniform as possible but also minimizes the maximum block size that belongs to each thread. Experimental results show that the partitioning proposed scales well with respect to the number of threads for different sizes of square matrices and performs better than other similar schemes. The second strategy alleviates the total cost of memory accesses. We propose an optimal register tiling with an optimal sequence of traversing tiles that minimizes the

326

E. Garcia et al.

number of memory operations and maximizes the reuse of data in registers. The implementation of the proposed tiling reached a maximum performance of 30.42 GFLOPS which is almost 10 times larger than the maximum performance reached by the optimum partition alone. Finally, specific architecture optimizations were implemented. The use of multiple load and multiple store instructions (ldm/stm) diminishes the time spent transferring data that are consecutive stored/loaded in memory. It was combined with instruction scheduling, hiding or amortizing the cost of some memory operations and high cost floating point instructions doing other computations in the middle. After these optimizations, the maximum performance of our MM algorithm is 44.12 GFLOPS which corresponds to 55.2% of the peak performance of a C64 chip. We also provide evidence of the power efficiency of C64: power consumption measurements show a maximum efficiency of 530 MFLOPS/W for the problem under consideration. This value is comparable to the top of the Green500 list, which provides a ranking of the most energy-efficient supercomputers in the world. Future work includes the study of other techniques like software pipelining and work-stealing that can further increase the performance of this algorithm. We also want to explore how to increase the size of the tiles beyond the maximum number of registers, using the stack and SPM. In addition, we desire to apply this methodology to other linear algebra algorithmic problems like matrix inversion.

Acknowledgments This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF- 0702244), and other government sponsors. We thank all the members of CAPSL group at University of Delaware and ET International that have given us valuable comments and feedback.

References 1. Alpern, B., Carter, L., Feig, E., Selker, T.: The uniform memory hierarchy model of computation. Algorithmica 12, 72–109 (1992) 2. Alpern, B., Carter, L., Ferrante, J.: Modeling parallel computers as memory hierarchies. In: Proceedings Programming Models for Massively Parallel Computers, pp. 116–123. IEEE Computer Society Press, Los Alamitos (1993) 3. Amaral, J.N., Gao, G.R., Merkey, P., Sterling, T., Ruiz, Z., Ryan, S.: Performance Prediction for the HTMT: A Programming Example. In: Proceedings of the Third PETAFLOP Workshop (1999) 4. Bailey, D.H., Lee, K., Simon, H.D.: Using Strassen’s Algorithm to Accelerate the Solution of Linear Systems. Journal of Supercomputing 4, 357–371 (1991) 5. Callahan, D., Porterfield, A.: Data cache performance of supercomputer applications. In: Supercomputing 1990: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pp. 564–572. IEEE Computer Society Press, Los Alamitos (1990)

Optimized Dense Matrix Multiplication on a Many-Core Architecture

327

6. Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, USA (1969) 7. Chen, L., Hu, Z., Lin, J., Gao, G.R.: Optimizing the Fast Fourier Transform on a Multi-core Architecture. In: IEEE 2007 International Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1–8 (March 2007) 8. Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: PLDI 1995: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, pp. 279–290. ACM, New York (1995) 9. Coppersmith, D., Winograd, S.: Matrix Multiplication via Arithmetic Progressions. In: Proceedings of the 19th Annual ACM symposium on Theory of Computing (STOC 1987), New York, NY, USA, pp. 1–6 (1987) 10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001) 11. Denneau, M.: Warren Jr., H.S.: 64-bit Cyclops: Principles of Operation. Tech. rep., IBM Watson Research Center, Yorktown Heights, NY (April 2005) 12. Douglas, C.C., Heroux, M., Slishman, G., Smith, R.M.: GEMMW: A Portable Level 3 Blas Winograd Variant of Strassen’s Matrix-Matrix Multiply Algorithm (1994) 13. Feng, W.C., Scogland, T.: The Green500 List: Year One. In: 5th IEEE Workshop on High-Performance, Power-Aware Computing. In: Conjunction with the 23rd International Parallel & Distributed Processing Symposium, Rome, Italy (May 2009) 14. Higham, N.J.: Exploiting Fast Matrix Multiplication Within the Level 3 BLAS. ACM Transactions on Mathematical Software 16(4), 352–368 (1990) 15. Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006) 16. Lee, H.-J., Robertson, J.P., Fortes, J.A.B.: Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc. of the 11th International Conference on Supercomputing (ICS 1997), pp. 44–51. ACM, Vienna (1997) 17. Kondo, M., Okawara, H., Nakamura, H., Boku, T., Sakai, S.: Scima: a novel processor architecture for high performance computing. In: Proceedings of the Fourth International Conference/Exhibition on High Performance Computing in the AsiaPacific Region, vol. 1, pp. 355–360 (2000) 18. Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pp. 63–74. ACM, New York (1991) 19. Orozco, D.A., Gao, G.R.: Mapping the fdtd application to many-core chip architectures. In: ICPP 2009: Proceedings of the 2009 International Conference on Parallel Processing, pp. 309–316. IEEE Computer Society, Washington (2009) 20. Strassen, V.: Gaussian Elimination is not Optimal. Numerische Mathematik 14(3), 354–356 (1969) 21. Venetis, I.E., Gao, G.R.: Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions. In: Proceedings of the 6th ACM Conference on Computing Frontiers (CF 2009), Ischia, Italy, pp. 71–80 (May 2009)

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism Frank Otto, Christoph A. Schaefer, Matthias Dempe, and Walter F. Tichy Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany {otto,cschaefer,dempe,tichy}@ipd.uka.de

Abstract. Current multicore computers differ in many hardware aspects. Tuning parallel applications is indispensable to achieve best performance on a particular hardware platform. Auto-tuners represent a promising approach to systematically optimize a program’s tuning parameters, such as the number of threads, the size of data partitions, or the number of pipeline stages. However, auto-tuners require several tuning runs to find optimal values for all parameters. In addition, a program optimized for execution on one machine usually has to be re-tuned on other machines. Our approach tackles this problem by introducing a language-based tuning mechanism. The key idea is the inference of essential tuning parameters from high-level parallel language constructs. Instead of identifying and adjusting tuning parameters manually, we exploit the compiler’s context knowledge about the program’s parallel structure to configure the tuning parameters at runtime. Consequently, our approach significantly reduces the need for platform-specific tuning runs. We implemented the approach as an integral part of XJava, a Java language extension to express task and pipeline parallelism. Several benchmark programs executed on different hardware platforms demonstrate the effectiveness of our approach. On average, our mechanism sets over 90% of the relevant tuning parameters automatically and achieves 93% of the optimal performance.

1

Introduction

In the multicore era, performance gains for applications of all kind will come from parallelism. The prevalent thread model forces programmers to think on low abstraction levels. As a consequence, writing multithreaded code that offers satisfying performance is not straight-forward. New programming models have been proposed for simplifying parallel programming and improving portability. Interestingly, the high-level constructs can be used for automatic performance tuning. Libraries, in contrast, do not normally provide semantic information about parallel programming patterns. Case studies have shown that parallel applications typically employ different types of parallelism on different levels of granularity [13]. Performance depends on various parameters such as the number of threads, the number of pipeline P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 328–340, 2010. c Springer-Verlag Berlin Heidelberg 2010 

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

329

stages, or load balancing strategies. Usually, these parameters have to be defined and set explicitly by the programmer. Finding a good parameter configuration parameters is far from easy due to large parameter search spaces. Auto-tuners provide a systematic way to find an optimal parameter configuration. However, as the best configuration strongly depends on the target platform, a program normally has to be re-tuned after porting to another machine. In this paper, we introduce a mechanism to automatically infer and configure five essential tuning parameters from high-level parallel language constructs. Our approach exploits explicit information about task and pipeline parallelism and uses tuning heuristics to set appropriate parameter values at runtime. From the programmer’s perspective, a considerable number of tuning parameters becomes invisible. That is, the need for feedback-directed auto-tuning processes on different target platforms is drastically reduced. We implemented our approach as part of the previously introduced language XJava [11,12]. XJava extends Java with language constructs for high-level parallel programming and allows the direct expression of task and pipeline parallelism. An XJava program compiles to Java code instrumented with tuning parameters and context information about its parallel structure. The XJava runtime system exploits the context information and platform properties to set tuning parameters. We evaluated our approach for a set of seven benchmark programs. Our approach sets over 90% of the relevant tuning parameters automatically, achieving 93% of the optimum performance on three different platforms.

2

The XJava Language

XJava extends Java by adding tasks and parallel statements. For a quick overview, the simplified grammar extension in BNF style is shown in Figure 1. We basically extend the existing production rules for method declarations (rule 1) and statements (rule 7). New keywords are work and push, new operators are => and |||. Semantics are described next. 2.1

Language

Tasks. Tasks are conceptually related to filters in stream languages. Basically, a task is an extension of a method. Unlike methods, a task defines a concurrently executable activity that expects a stream of input data and produces a stream of output data. The types of data elements within the input and output stream are defined by the task’s input and output type. These types can also be void in order to specify that there is no input or output. For example, the code public String => String encode(Key key) { work (String s) { push encrypt(s, key); } } declares a public task encode with input and output type String. The work block defines what to do for each incoming element and can be thought of as a

330        $ % *    

F. Otto et al.      

        

      

       

               

                              !" #   " #

 !" #      " #  & & '(   )'(  & & '(   +   +   +  & & '( )'(   + ,)&-  ) -  +   + ,)&-  ) - )'( ,)&-    ) -     

 +            

 

&&  #)&./ # &&  #)&0./ #

 #)& # 1   

Fig. 1. The grammar extension of XJava

loop. A task body contains either exactly one or no work block (rule 6). A push statement inside a task body puts an element into the output stream. In the example, these elements are String objects encrypted by the method encrypt and the parameter key. Parallel statements. Tasks are called like methods; parallelism is generated by combining task calls with operators to compose parallel statements (rule 9). Basically, these statements can be used both outside and inside a task body; the latter case introduces nested parallelism. Parallel statements allow for easily expressing many different types of parallelism, such as linear and non-linear pipelines, master/worker configurations, data parallelism, and recursive parallelism. (1) Combining tasks with the “=>” operator introduces pipeline parallelism. In addition to the task encrypt above, we assume two more tasks read and write for reading and writing to a file. Then, the pipeline statement read(fin) => encode(key) => write(fout); creates a pipeline that encodes the content of the file fin and writes results to the file fout. (2) Combining tasks with the “|||” operator introduces task parallelism. Assuming a task compress, the concurrent statement compress(f1) ||| compress(f2); compresses two files f1 and f2 concurrently. By default, a task is executed by one thread. Optionally, a task call can be marked with a “+” operator to make it replicable. A replicable task can be executed by more than one thread, which is useful to reduce bottleneck effects in pipelines.

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

331

For example, the task encode in the pipeline example above might be the slowest stage. Using the expression encode(key)+ instead of encode(key) can increase throughput since we allow more threads to execute that critical stage. The number of replicates is determined at runtime and thus does not need to be specified by the programmer. If the programmer wants to create a concrete number of task instances at once, say 4, he can use the expression encode(key):[4]. 2.2

Compiler and Runtime System

The XJava compiler transforms XJava to optimized and instrumented Java code, which is then translated into bytecode. The translated program consists of logical code units that are passed to the XJava runtime system XJavaRT. XJavaRT is the place where parallelism happens. It is designed as a library employing executor threads and built-in scheduling mechanisms.

3

Tuning Challenges

A common reason for poor performance of parallel applications is poor adaption of parallel code to the underlying hardware platform. With the parallelization of an application, a large number of performance-relevant tuning parameters arise, e.g. how many threads are used for a particular calculation, how to set the size of data partitions, how many stages a pipeline requires, or how to accomplish load balancing for worker threads. Manual tuning is tedious, costly, and due to the large number of possible parameter configurations often hopeless. To automate the optimization process, search-based automatic performance tuning (auto-tuning) [23,1,20,22] is a promising approach. Auto-tuning represents a feedback-directed process consisting of several steps: choice of parameter configuration, program execution, performance monitoring, and generation of a new configuration based on search algorithms such as hill climbing or simulated annealing. Experiments with realworld parallel applications have shown that using appropriate tuning techniques, a significant performance gain can be achieved on top of “plausible” configurations chosen by the programmer [13,18]. However, as the diversity of application areas for parallelism has grown and the available parallel platforms differ in many respects (e.g. in number or type of cores, cache architecture, available memory, or operating system), the number of targets to optimize for is large. Optimizations made for a certain machine may cause a slowdown on another machine. Thus, a program optimized for a particular hardware platform usually has to be re-tuned on other platforms. For illustration, let’s think of a parallel program with only one tuning parameter t that adjusts the number of concurrent threads. While the best configuration for t on a 4-core-machine is probably a value close to 4, this configuration might be suboptimal for a machine with 16 cores. From the auto-tuner’s perspective, t represents a set of values to choose from. If the tuner knew the purpose of t, it would be able to configure t directly in relation to the number of cores providing significantly improved performance.

332

F. Otto et al.

To tackle the problem of optimization portability, recent approaches propose the use of tuning heuristics to exploit information about purpose and impact of tuning parameters [17]. This context information helps configuring parameters implicitly without enumerating and testing their entire value range.

4

Language-Based Tuning Mechanism

We propose an approach that exploits tuning-relevant context information from XJava’s high-level parallel language constructs (cf. Section 2). Relevant tuning parameters are automatically inferred and implicitly set by the runtime system (XJavaRT). Therefore, porting an XJava application to another machine requires less re-tuning, in several cases no re-tuning at all. Figure 2 illustrates the concept of our approach (b) in contrast to feedback-directed auto-tuning (a). Our work focuses on task and pipeline parallelism; both forms of parallelism are widely used. Task parallelism refers to tasks whose computations are independent from each other. Pipeline parallelism refers to tasks with input-output dependencies, i.e. the output of one task serves as the input of the next task.



&'!($ !  

&'!($ !



 

&'!($ !  

 ! 

  )



"#$ %

 ! 

%(#%

*



"#$ %





 %



 



 ! 

 +,,



"#$ %

Fig. 2. Adapting a parallel program P to different target platforms M 1, M 2, M 3. (a) A search-based auto-tuner requires the explicit declaration of tuning parameters a, b, c. The auto-tuner needs to perform several feedback-directed tuning runs on each platform to find the best configuration. (b) In our approach, we use compiler knowledge to automatically infer relevant tuning parameters and context information about the program’s parallel structure. The parameters are set by the runtime system XJavaRT, which uses tuning heuristics that depend on the characteristics of the target platform.

First, we describe essential types of tuning parameters for these forms of parallelism (Section 4.1). Then, we show how the XJava compiler infers tuning parameters and context information from code (Sections 4.2 and 4.3). Finally, we describe heuristics to set the tuning parameters (Section 4.4).

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

4.1

333

Tuning Parameters

Tuning parameters represent program variables that may influence performance. In our work, we distinguish between explicit and implicit tuning parameters. The first have to be specified and configured by the programmer, the latter are invisible to the programmer and set automatically. In the following we describe essential types of tuning parameters for task and pipeline parallelism [13,17]. Thread count (T C). The total number of threads executing an application strongly influences its performance. To underestimate the number will limit speedup, to overestimate the number might slow down the program due to synchronization overhead and memory consumption. Load balancing strategy (LB). The load balancing strategy determines how to distribute workload to execution threads or CPU cores. Load balancing can be done statically, e.g. in a round-robin style, or dynamically, e.g. in a first-comefirst-serve fashion or combined with work stealing. Cut-off depth (CO). Parallel applications typically employ parallelism on different levels. Low-level parallelism can have a negative impact on the performance, if the synchronization and memory costs are higher than the additional speedup of concurrent execution. In other words, there is a level CO where parallelism is not worthwhile and a serial execution of the code is preferable. Stage replicates (SR). The throughput and speedup achieved by a pipeline is limited by its slowest stage. If this stage is stateless, it can be replicated in order to be executed by more than one thread. The parameter SR denotes the number of replicates. Stage fusion (SF ). From the programmer’s perspective, the conceptual layout of a pipeline usually consist of n stages s1 , ..., sn . However, mapping each stage si to one thread may not be the best configuration. Instead, fusing some stages could reduce bottleneck effects. Stage fusion represents functional composition of stages and is similar to the concept of filter fusion [14]. Data size (DS). Parallel programs often process a large amount of data that needs to be decomposed into smaller partitions. The data partition size typcially affects the program’s performance. The applications considered here expose up to 14 parameters that need to be tuned (cf. Section 5). Note that one application can contain several parameters of the same type. The following sections show how our approach automatically infers and sets these parameters, except DS. As the most appropriate size of data partitions depends on the type of application, we leave this issue to the programmer or further tuning. The XJava programmer must define separate tasks for decomposing and merging data. 4.2

Inferring Tuning Parameters from XJava Code

The XJava compiler generates Java code and adds tuning parameters. Task parallel statements are instrumented with the parameter cut-off depth (CO).

334

F. Otto et al.



 

 





  

 

 

 









  

  

  

  







   

   

         





    



 

      

 

    

 

Fig. 3. Inferring tuning parameters and context information from XJava code

When compiling a pipeline statement consisting of n stages s1 , ..., sn , the stages s2 , ...sn are instrumented with the boolean tuning parameter stage fusion (SF ), indicating whether that stage should be fused with the previous one. In addition, the parameter stage replicates (SR) is added to each stage declared as replicable. Figure 3 illustrates the parameter inference for a task parallel statement and a pipeline. A task parallel statement p() ||| q() is instrumented with the parameter CO. Depending on its value, that statement executes either concurrently or sequentially, if the cut-off depth is reached. A pipeline a() => b()+ => c()+ => d() compiles to a set of four task instances a, b, c and d. Since b and c are replicable, a tuning parameter SR is added to them. In addition, b, c and d get a boolean parameter SF defining whether to fuse that stage with the previous one. The parameters T C and LB for the overall number of threads and the load balancing strategy affect both task parallel statement and pipelines. In Section 4.4, we describe the heuristics used to set the parameters. 4.3

Inferring Context Information

Beside inferring tuning parameters, the XJava compiler exploits context information about the program’s parallel structure. The compiler makes this knowledge available at runtime to set tuning parameters appropriately. The context information of a task call includes several aspects: (1) purpose of the task (pipeline stage or a part of task-parallel section), (2) input and output dependences, (3) periodic or non-periodic task, (4) level of parallelism, and (5) current workload of the task. Aspects 1-3 can be inferred at compile time, aspects 4 and 5 at runtime. However, XJavaRT has access to all information. Figure 3 sketches potential context information for tasks. 4.4

Tuning Heuristics

Thread count (T C). XJavaRT provides a global thread pool to control the total number of threads and to monitor the numbers of running and idle threads at any time. XJavaRT knows the number n of a machine’s CPU cores and

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

335

therefore uses the heuristic T C = n · α for some α ≥ 1. We use α = 1.5 as a predefined value. Load balancing (LB). XJavaRT employs different load balancing strategies depending on the corresponding context information. For recursive task parallelism, such as divide and conquer algorithms, XJavaRT applies a work stealing mechanism based on the Java fork/join framework [8]. For pipelines, XJavaRT prefers stages with higher workloads to execute, thus implementing a dynamic load balancing strategy. Cut-off depth (CO). XJavaRT dynamically determines the cut-off depth for task parallel expressions to decide whether to execute a task parallel statement concurrently or in sequential order. Since XJavaRT keeps track of the number of idle executor threads, it applies the heuristic CO = ∞ if idle threads exist, and CO = l otherwise (where l is the nested level of the task parallel expression). In other words, tasks are executed sequentially if there are no executor threads left. Stage replicates (SR). When a replicable task is called, XJavaRT creates SR = i replicates of the task, where i denotes the number of idle executor threads. If there are no idle threads, i.e. all CPU cores are busy, no replicates will be created. XJavaRT uses a priority queue putting tasks with lower work load (i.e. few data items waiting at their input port) at the end. This mechanism does not always achieve optimal results, but seems effective in practice, as our results show.       

































Fig. 4. Stage fusion for a pipeline a() => b()+ => c()+ => d(). (a) Stage replication without fusion introduces overhead for splitting and joining data items. (b) Stage fusion prior to replication removes some of this overhead.

Stage fusion (SF ). In a pipeline consisting of several stages, combining two or more stages into a single stage can increase performance, as the overhead for split-join operations is reduced. Therefore, XJava fuses consecutive replicable tasks within a pipeline expression to create a single replicable task. Figure 4 illustrates this mechanism for a pipeline a() => b()+ => c()+ => d().

5

Experimental Results

We evaluate our approach using a set of seven benchmarks that cover a wide range of parallel applications, including algorithmic problems such as sorting

336

F. Otto et al.

or matrix multiplication, as well as the real-world applications for raytracing, video processing and cryptography. The applications use task, data or pipeline parallelism. We measure two metrics: Implicit tuning parameters. We count the number of automatically handled tuning parameters as a metric for simplification of the optimization process. If more tuning parameters are automated, fewer optimizations have to be performed manually. Performance. For each application, we compared a sequential version to an XJava version and measured the speedups heur and best : – heur: Speedups of the XJava programs using our heuristic-based approach. These programs did not require any manual adjustments. – best: Speedups achieved for the best parameter configuration found by an auto-tuner performing an exhaustive search. The speedups over the sequential versions were measured on three different parallel platforms: (1) an Intel Quadcore Q6600 with 2.40 GHz, 4 GB RAM and Windows 7 Professional 64 Bit; (2) a Dual Intel Xeon Quadcore E5320 1.86 GHz, 8 GB RAM and Ubuntu Linux 7.10; (3) a Sun Niagara T2 with 8 cores (each capable of 8 threads), 1.2 GHz, 16 GB RAM and Solaris 10. 5.1

Benchmarked Applications

MSort and QSort implement the recursive mergesort and quicksort algorithms to sort a randomly generated array with approximately 33.5 million integer values. Matrix multiplies two matrices based on a master-worker configuration, where the master divides the final matrix into areas and assigns them to workers. MBrot computes the mandelbrot set for a given resolution and a maximum number of 1000 iterations. LRay is a lightweight raytracer entirely written in Java. MBrot and LRay both use the master-worker pattern by letting the master divide the image into multiple blocks, which are then computed concurrently by workers. The applications Video and Crypto use pipeline parallelism. Video is used to combine multiple frames into a slideshow, while performing several filters such as scaling and sharpening on each of the video-frames. The resulting pipeline contains eight stages, five of which are data parallel and can be replicated. Crypto applies multiple encryption algorithms from the javax.crypto package to a 60 MB text file that is split into 5 KB blocks. The pipeline has seven stages; each stage except those for input and output are replicable. 5.2

Results

Implicit tuning parameters. Depending on the parallelization strategy, the programs expose different tuning parameters. Figure 5 shows the numbers of explicit and implicit parameters for each application. Explicit parameters are declared and set in the program code. Implicit parameters do not appear in the code, they are automatically inferred by the compiler and set using our approach.

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

       !     !  ,-  ,-  ,-       

 + +  &2 +3 $2 :  ; <

$>  6>   

          " # $%&%$' " # $%&%$' (. /&% $ (. /&% $ (. /&% $ (. (# $%&%456%057 " (8 $%&%456%157 )=41 4=41

&>3   2 7> ?  

 # # # # # (8 (8 0=(8

337

  ())* ())* 01* 01* 01* 9#* ())* 9(*

$'> "?? .> 

Fig. 5. Explicit and implicit tuning parameters for the benchmarked applications. On average, our approach infers and sets 91% of the parameters automatically.

On average, the number of explicit tuning parameters is reduced by 91%, ranging from 67% to 100%. Our mechanism automatically infers and sets all parameters except the data size (DS). For Matrix , these are the sizes of the parts of the matrix to be computed by a worker; for MBrot and LRay, these are the sizes of the image blocks calculated concurrently. In Crypto the granularity is determined by the size of the data blocks that are sent through the pipeline. As Video decomposes the video data frame by frame, there is no need for an explicit tuning parameter to control the data size.

    



 

   

  



   

  

     

  

Fig. 6. Execution times (milliseconds) of the sequential benchmark programs

Performance. Figure 6 lists the execution times of the sequential benchmark programs. Figure 7 shows the speedups for the corresponding XJava versions on the three parallel platforms. Using our approach, the XJava programs achieve an average speedup of about 3.5 on the Q6600 quadcore, 5.0 on the E5320 dualquadcore, and 17.5 on the Niagara T2. The automatic replication of XJava tasks achieves good utilization of the available cores in the master-worker and pipeline applications, although the roundrobin distribution of items leads to a suboptimal load balancing in the replicated stages. The blocks in Crypto are of equal size, leading to an even workload. The frames in Video have different dimensions, resulting in slightly lower speedups. To examine the quality of our heuristics, we used a script-based auto-tuner performing an exhaustive search to find the best parameter configuration. We

338

F. Otto et al.

Fig. 7. Performance of our heuristic-based approach (heur ) in comparison to the best configuration found by a search-based auto-tuner

observed the largest performance difference for QSort and for LRay on the E5320 machine. In general, QSort benefits from further increasing the cutoff threshold, as more tasks allow better load balancing with workstealing. For LRay, reducing the number of workers by one increases the speedup from 3.4 to 5 - we attribute this behavior to poor cache usage or other memory bottlenecks when using too many threads. In all other cases, the search-based auto-tuner achieved only minor additional speedups compared to our language-based tuning mechanism. In total, the mean error rate of the heuristic-based configurations to the best configurations are 9% on the E5320 dual-quadcore, 7% on the Niagara T2, and 4% on the Q6600 quadcore. That is, our approach achieves 93% of the optimal performance.

6

Related Work

Auto-tuning has been investigated mainly in the area of numerical software and high-performance computing. Therefore, many approaches (such as ATLAS [23], FFTW [5], or FIBER [7]) focus on tuning particular types of algorithms rather than entire parallel applications. Datta et al. [4] address auto-tuning and optimization strategies for stencil computations on multicore architectures. MATE [10] uses a model-based approach to dynamically optimize distributed master/worker applications. MATE predicts the performance of these programs. However, optimizing other types of parallel patterns requires the creation of new analytic models. MATE does not target multicore systems.

A Language-Based Tuning Mechanism for Task and Pipeline Parallelism

339

Atune [17,19] introduces tuning heuristics to improve search-based auto-tuning of parallel architectures. However, Atune needs a separate configuration language and an offline auto-tuner. Stream languages such as StreamIt [21,6] provide explicit syntax for data, task and pipeline parallelism. Optimizations are done at compile time for a given machine; dynamic adjustments are typically not addressed. Libraries such as java.util.concurrent [9] or TBB [16] provide constructs for high-level parallelism, but do not exploit context information and still require explicit tuning. Languages such as Chapel [2], Cilk [15] and X10 [3] focus on task and data parallelism but not on explicit pipelining and do not support tuning parameter inference.

7

Conclusion

Tuning parallel applications is essential to achieve best performance on a particular platform. In this paper, we presented a language-based tuning mechanism for basically any kind of application employing task and pipeline parallelism. Our approach automatically infers tuning parameters and corresponding context information from high-level parallel language constructs. Using appropriate heuristics, tuning parameters are set at runtime. We implemented our technique as part of the XJava compiler and runtime system. We evaluated our approach for seven benchmark programs covering different types of parallelism. Our tuning mechanism infers and sets over 90% of the relevant tuning parameters automatically. The average performance achieves 93% of the actual optimum, drastically reducing the need for further tuning. If further search-based tuning is still required, our approach provides a good starting point. Future work will address the support of further tuning parameters (such as data size), the refinement of tuning heuristics, and the integration of a feedbackdriven online auto-tuner.

References 1. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report, University of California, Berkeley (2006) 2. Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl. 21(3) (August 2007) 3. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In: Proc. OOPSLA 2005. ACM, New York (2005) 4. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures. In: Proc. Supercomputing Conference (2008)

340

F. Otto et al.

5. Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proc. ICASSP, vol. 3 (May 1998) 6. Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In: Proc. ASPLOS-XII. ACM, New York (2006) 7. Katagiri, T., Kise, K., Honda, H., Yuba, T.: FIBER: A Generalized Framework for Auto-tuning Software. In: Proc. International Symposium on HPC (2003) 8. Lea, D.: A Java fork/join Framework. In: Proc. Java Grande 2000. ACM, New York (2000) 9. Lea, D.: The java.util.concurrent Synchronizer Framework. Sci. Comput. Program 58(3) (2005) 10. Morajko, A., Margalef, T., Luque, E.: Design and Implementation of a Dynamic Tuning Environment. Parallel and Distributed Computing 67(4) (2007) 11. Otto, F., Pankratius, V., Tichy, W.F.: High-level Multicore Programming With XJava. In: Comp. ICSE 2009, New Ideas And Emerging Results. ACM, New York (2009) 12. Otto, F., Pankratius, V., Tichy, W.F.: XJava: Exploiting Parallelism with ObjectOriented Stream Programming. In: Sips, H., Epema, D., Lin, H.-X. (eds.) EuroPar 2009 Parallel Processing. LNCS, vol. 5704, pp. 875–886. Springer, Heidelberg (2009) 13. Pankratius, V., Schaefer, C.A., Jannesari, A., Tichy, W.F.: Software Engineering for Multicore Systems: an Experience Report. In: Proc. IWMSE 2008. ACM, New York (2008) 14. Proebsting, T.A., Watterson, S.A.: Filter Fusion. In: Proc. Symposium on Principles of Programming Languages (1996) 15. Randall, K.: Cilk: Efficient Multithreaded Computing. PhD Thesis. Dep. EECS, MIT (1998) 16. Reinders, J.: Intel Threading Building Blocks. O’Reilly Media, Inc., Sebastopol (2007) 17. Schaefer, C.A.: Reducing Search Space of Auto-Tuners Using Parallel Patterns. In: Proc. IWMSE 2009. ACM, New York (2009) 18. Schaefer, C.A., Pankratius, V., Tichy, W.F.: Atune-IL: An Instrumentation Language for Auto-Tuning Parallel Applications. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009 Parallel Processing. LNCS, vol. 5704, pp. 9–20. Springer, Heidelberg (2009) 19. Schaefer, C.A., Pankratius, V., Tichy, W.F.: Engineering Parallel Applications with Tunable Architectures. In: Proc. ICSE. ACM, New York (2010) 20. Tapus, C., Chung, I., Hollingsworth, J.K.: Active Harmony: Towards Automated Performance Tuning. In: Proc. Supercomputing Conference (2002) 21. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, p. 179. Springer, Heidelberg (2002) 22. Werner-Kytola, O., Tichy, W.F.: Self-tuning Parallelism. In: Williams, R., Afsarmanesh, H., Bubak, M., Hertzberger, B. (eds.) HPCN-Europe 2000. LNCS, vol. 1823, p. 300. Springer, Heidelberg (2000) 23. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated Empirical Optimizations of Software and the ATLAS Project. Journal of Parallel Computing 27 (2001)

A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures Chen Chen1 , Joseph B. Manzano2 , Ge Gan2 , Guang R. Gao2 , and Vivek Sarkar3 1 2

Tsinghua University, Beijing 100084, P.R. China University of Delaware, Newark DE 19716, USA 3 Rice University, Houston TX 77251, USA

Abstract. This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0. We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.

1 Introduction An important open problem for future multicore and manycore chip architectures is the development of shared-memory organizations and memory consistency models (or memory models for short) that are effective for small local memory sizes in each core, scalable to a large number of cores, and still productive for software to use. Despite the fact that strong memory models such as Sequential Consistency (SC) [1] are supported on mainstream small-scale SMPs, it seems likely that weaker memory models will be explored in current and future multicore and manycore architectures such as the Cell Broadband Engine [2], Tilera [3], and Cyclops64 [4]. OpenMP [5] is a natural candidate as a programming model for multicore and manycore processors with software-managed local memories, thanks to its weak memory model. In the OpenMP memory model, each thread may maintain a temporary view of the shared memory which “allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable” [5]. It includes a flush operation on P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 341–352, 2010. c Springer-Verlag Berlin Heidelberg 2010 

342

C. Chen et al.

a specified flush-set that can be used to synchronize the temporary view with the shared memory for the variables in the flush-set. It is a weak consistency model “because a thread¡¯s temporary view of memory is not required to be consistent with memory at all times” [5]. This relaxation of the memory consistency constraints provides room for computer system designers to experiment with a wide range of caching schemes, each of which has different performance and cost tradeoff. Therefore, the OpenMP memory model can exhibit very different instantiations, each of which is a memory model that is stronger than the OpenMP memory model, i.e., any legal value under an instantiation is also a legal value under the OpenMP memory model, but not vice versa. Among various instantiations of the OpenMP memory model, an important problem is to find an instantiation that can be efficiently implemented on multicore and manycore architectures and easily understood by programmers. 1.1 A Key Observation for Implementing the Flush Operation Efficiently The flush operation synchronizes temporary views with the shared memory. So it is more expensive than read and write operations. In order to efficiently implement the OpenMP memory model, the instantiation should be able to implement the flush operation efficiently. Unfortunately, the OpenMP memory model has the serialization requirement for flush operations, i.e., “if the intersection of the flush-sets of two flushes performed by two different threads is non-empty, then the two flushes must be completed as if in some sequential order, seen by all threads” [5]. Therefore, it seems that it is very hard to efficiently implement the flush operation because of the serialization requirement. However, this requirement has a hidden meaning that is not clearly explained in [5]. The hidden meaning is the key for efficiently implement the flush operation. We use an example to explain the real meaning of the serialization requirement. For the program in Fig. 1, it seems that the final status of the shared memory must be either x = y = 1 or x = y = 2 according to the serialization requirement. However, after discussion with the OpenMP community, x = 1, y = 2 and x = 2, y = 1 are also legal results under the OpenMP memory model. The reason is that the OpenMP memory model allows flush operations to be completed earlier (but cannot be later) than the flush points (statements 3 and 6 in this program). Therefore, one possible way to get the result x = 1, y = 2 is that firstly thread 2 assigns 2 to x and immediately flushes x into the shared memory, then thread 1 assigns 1 to x and 1 to y and then flushes x and y, and finally thread 2 assigns 2 to y and flushes y. Therefore, we get a key observation for implementing the flush operation efficiently as follows. Thread 1 Thread 2 1: x = 1; 4: x = 2; 5: y = 2; 2: y = 1; 3: flush(x,y); 6: flush(x,y); Is x = 1, y = 2 (or x = 2, y = 1) legal under the OpenMP memory model? Fig. 1. A motivating example for understanding the serialization requirement under the OpenMP memory model

A Study of a Software Cache Implementation of the OpenMP Memory Model

343

The Key Observation: A flush operation on a flush-set of shared locations can be decomposed into unordered flush operations on each individual location. Each flush operation after decomposition must be completed no later than the flush point of the original flush operation. Assuming that a memory location is the minimal unit for atomic memory accesses, the serialization requirement is naturally satisfied. 1.2 Main Contributions In this paper, we propose an instantiation of the OpenMP memory model based on the key observation in Section 1.1. It has the following advantages. – Our instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. The OpenMP memory model may allow programs with data races to generate undefined values. However, in our instantiation, all return values must be in a subset of initial value and the values that was written by some thread before. Since the OpenMP memory model allows programs with data races 1 , our instantiation would be helpful when programming in such cases. – Our instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. – Our instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0, such as the unclear serialization requirement, the problem of handling temporary overflow and the unclear semantics for programs with data races. Therefore, our instantiation is easy to understand from the angle of efficient implementations. We also propose the cache protocol of the instantiation and implement the softwarecontrolled cache on Cell Broadband Engine. The experimental results show that our instantiation has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations. The rest of the paper is organized as follows. Section 2 introduces our instantiation of the OpenMP memory model. Section 3 introduces the cache protocol of the instantiation. Section 4 presents the experimental results. Section 5 discusses the related work. The conclusion is presented in Section 6.

2 Formalization of Our OpenMP Memory Model Instantiation A necessary prerequisite to build OpenMP’s software cache implementations is the availability of formal memory models that establish the legality conditions for determining if an implementation is correct. As observed in [6], “it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency 1

Section 2.8.6 of the OpenMP specification 3.0 [5] shows a program with data races that implements critical sections.

344

C. Chen et al.

model that precisely describes how reads and writes on different threads interact”. While there is general agreement that the OpenMP memory model is based on temporary views and flush operations, discussion with OpenMP experts led us to conclude that the OpenMP specification provides a lot of leeway on when flush operations can be performed and on the inclusion of additional flush operations (not specified by the programmer) to deal with local memory size constraints. In this section, we formalize an instantiation of the OpenMP Memory Model — ModelLF , based on the key observation in Section 1.1. ModelLF builds on OpenMP’s relaxed-consistency memory model in which each worker thread maintains a temporary view of shared data which may not always be consistent with the actual data stored in the shared memory. The OpenMP flush operation is used to establish consistency between these temporary views and the shared memory at specific program points. In ModelLF , each flush operation only forces local temporary view to be consistent with the shared memory. That is why we call it ModelLF where “LF” means local flush. A flush operation is only applied on a single location. We assume that a memory location is the minimal unit for atomic memory accesses. Therefore, the serialization requirement of flush operations is naturally satisfied. A flush operation on a set of shared locations is decomposed into unordered flush operations on each individual location, where those flush operations after decomposition must be completed no later than the flush point of the original flush operation. So it avoids the known problem of decomposition as explained in Section 2.8.6 of the OpenMP specification 3.0 [5], where the compiler may reorder the flush operations after decomposition to a later position than the flush point and cause incorrect semantics. 2.1 Operational Semantics of ModelLF In this section, we define the operational semantics of ModelLF . Firstly, we introduce a little background for the definition. A store, σ, is a mathematical representation of the machine’s shared memory, which maps memory location addresses to values (σ : addr → val). We model temporary views by introducing a distinct store, σi , for each worker thread Ti in an OpenMP parallel region. Following OpenMP’s convention, thread T0 is assumed to be the master thread. σi [l] represents the value stored in location l in thread Ti ’s temporary view. The flush operation, flush(Ti , l) makes temporary view σi consistent with the shared memory σ on location l. Under ModelLF , program flush operations are performed at the program points specified by the programmer. Moreover, additional flush operations may be inserted nondeterministically by the implementation at any program point, which makes it possible to implement the memory model with bounded space for temporary views, such as caches. The operational semantics of memory operations of ModelLF include the read, write, program flush operation and nondeterministic flush operation defined as follows: – Memory read: If thread Ti needs to read the value of the location l, it performs a read(Ti , l) operation on store σi . If σi does not contain any value of l, the value in the shared memory will be loaded to σi and returned to the read operation. – Memory write: If thread Ti needs to write value v to the location l, it performs a write(Ti , v, l) operation on store σi .

A Study of a Software Cache Implementation of the OpenMP Memory Model

345

– Program / Nondeterministic flush: If thread Ti needs to synchronize σi with the shared memory on a shared location l, it performs a f lush(Ti, l) operation. If σi contains a “dirty value” 2 of l, it will write back the value into the shared memory. After the flush operation, σi will discard the value of l. A thread performs program flush operations at program points specified by the programmer, and can nondeterministically perform flush operations at any program point. All the program and nondeterministic flush operations on the same shared location must be observed by all threads to be completed in the same sequential order.

3 Cache Protocol of ModelLF In this section, we introduce the cache protocol that implements ModelLF . We assume that each thread contains a cache which corresponds to its temporary view. Therefore, performing operations on temporary views is equivalent to performing such operations on the caches. Without loss of generality, in this section, we assume that each operation is performed on one cache line. The reason is that an operation on one cache line can be decomposed into sub operations; each of which is performed on a single location. We use per-location dirty bits in a cache line to take care of the decomposition problem. 3.1 Cache Line States We assume that each cache line contains multiple locations. Each location contains a value that can be a “clean value” 3 , a “dirty value”, or an “invalid value”. Each cache line can be in one of the five states as follows. Invalid: All the locations contain “invalid values”. Clean: All the locations contain “clean values”. Dirty: All the locations contain “dirty values”. Clean-Dirty: Each location contains either a “clean value” or a “dirty value”. Invalid-Dirty: Each location contains either an “invalid value” or a “dirty value”. For simplicity, the cache line cannot be in other states such as Invalid-Clean. Additional nondeterministic flush operations may be performed when necessary to force the cache line to be in one of the five states as above. We use a per-line flag bit together with the dirty bits to represent the state of the cache line. The flag bit indicates whether those non-dirty values in the cache line are clean or invalid. 3.2 Cache Operations and State Transitions The state transition diagram of ModelLF cache protocol is shown in Fig. 2. Now we explain how each cache operation affects the state transition diagram. Memory read: If the original state of the cache line is invalid or invalid-dirty, the invalid locations will load “clean values” from memory. Therefore, the state will change to clean or clean-dirty, respectively. In other cases, the state will not change. After that, the values in the cache line will be returned. 2 3

The term “dirty value” means that the value of location l was modified by thread Ti . The term “clean value” means that the value was read but not modified by the thread.

346

C. Chen et al. r

f Invalid w

w

f

f

r Clean

f

w

f r

InvalidͲDirty Invalid Dirty w

r/w CleanͲDirty Clean Dirty

w

w

w Dirty r/w

r :memoryread w :memroy write f :flush

Fig. 2. State transition diagram for the cache protocol of ModelLF

Memory write: A write operation writes specified “dirty values” to the cache line. Therefore, if the original state is invalid or invalid-dirty, it becomes either invalid-dirty or dirty after the write operation, which depends on whether all the locations contain “dirty values”. In other cases, the state will become either clean-dirty or dirty, which depends on whether all the locations contain “dirty values”. Program / Nondeterministic flush: A flush operation forces all the “dirty values” of the cache line to be written back into memory. Then, the state will become invalid. There may be various ways to implement the flush operation. For example, many architectures support a block of data to be written back at a time. So a possible way of implementing the flush operation is to write back the entire cache line that is being flushed together with the dirty bits and then merge the “dirty values” into the corresponding memory line in the shared memory. If the mergence in memory is not supported, a thread has to load the memory line, and then merge it with the cache line, and finally write back the merged line, where the process must be atomic to handle the false sharing problem. For example, on the Cell processor, atomic DMA operations can be used to guarantee atomicity of the process.

4 Experimental Results and Analyses In this section, we introduce our experimental results under ModelLF cache protocol. In section 4.1, we introduce the experimental testbed. Then in section 4.2, we introduce the major observations of our experiments. Finally, we introduce the details and analyses of the observations in the last two sections. 4.1 Experimental Testbed The experimental results presented in this paper were obtained on CBEA (Cell Broadband Engine Architecture) [2] under the OPELL (OPenmp for cELL) framework [7]. CBEA: CBEA has a main processor called the Power Processing Element (PPE) and a number of co-processors called the Synergistic Processing Elements (SPEs). The PPE handles most of the computational workload and has control over the SPEs, i.e., start, stop, interrupt, and schedule processes onto the SPEs. Each SPE has a 256KB local

A Study of a Software Cache Implementation of the OpenMP Memory Model

347

storage which is used to store both instructions and data. An SPE can only access its own local storage directly. Both PPE and SPEs share main memory. SPEs access main memory via DMA (direct memory access) transfers which are much slower than the access on each SPE’s own local storage. We executed the programs on a PlayStation 3 [8] which has one 3.2 GHz Cell Broadband Engine CPU (with 6 accessible SPEs) and 256MB global shared memory. Our experiments used all 6 SPEs with the exception of the evaluation of speedup which used various numbers of SPEs from 1 to 6. OPELL Framework: OPELL is an open source toolchain / runtime effort to implement OpenMP for the CBEA. OPELL has a single source compiler which compiles an OpenMP program to a single source file that is executable on CBEA. During runtime, the executable file starts to run the sequential region of the program on PPE. Once the program enters a parallel region, PPE will assign tasks of computing parallel codes to SPEs. After SPEs finish the tasks, the parallel region ends and PPE will go ahead to execute the following sequential region. Since each SPE only has 256KB local storage to store both instructions and data, OPELL has a partition /overlay manager runtime library that partitions the parallel codes into small pieces to fit for the local storage size, and loads and replaces those pieces on demand. Since a DMA transfer is much slower than an access on the local storage, OPELL has a software cache runtime library to take advantage of locality. The runtime library manages a part of local storages as caches and has a user interface for accessing. We implement our cache protocol in OPELL’s software cache runtime library. The cache protocol uses 4-way set associative caches. The size of each cache line is 128 bytes. We ran the experiments on various cache sizes which range from 4KB to 64KB. We did not try bigger cache size because the size of local storage is very limited (256KB) and a part of it is used to store instructions and maintain stack. Benchmarks: We used three benchmark programs in our experiments — Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks [9]. 4.2 Summary of Main Results The main results of our experiments are as follows: Result I: Scalability (Section 4.3): ModelLF cache protocol has nearly linear speedup with respect to the number of threads for the tested benchmarks. Result II: Impact of Cache Size (Section 4.4): We use another instantiation of the OpenMP memory model — ModelGF 4 , to compare with ModelLF . ModelGF maintains a global total ordering among flush operations. The difference between ModelGF and ModelLF is that when ModelGF performs a flush operation on a location l, it enforces the temporary views of all threads to see the same value of l by discarding the values of l in the temporary views. To implement ModelGF , we simulate a centralized directory 4

Operational semantics of ModelGF is defined in [10].

348

C. Chen et al.

that maintains the information for all the caches. When a flush operation on a location l is performed, the directory informs all the threads that contain the value of l to discard the value. We assume that the centralized directory is “ideal”, i.e., the cost of maintenance and lookup is trivial. However, the cost of informing a thread is as expensive as a DMA transfer because the directory is placed in main memory. ModelLF outperforms ModelGF due to its cheaper flush operations. Our results show that the performance gap between ModelLF and ModelGF cache protocols increases as the cache size becomes smaller. This observation is significant because the current trend in multicore and manycore processors is that the local memory size per core decreases as the number of cores increases. 4.3 Scalability Fig. 3 shows the speedup as a function of the number of SPEs (Each SPE runs one thread.) under ModelLF cache protocol. The tested applications are MG with a 32KB cache size, and IS and EP with a 64KB cache size. All the three applications have input size W. We can see that for IS and EP benchmarks, ModelLF cache protocol nearly achieves linear speedup. For MG benchmark, the speedup is not as good as the other two when the number of threads is 3, 5 and 6. The reason is that the workloads among threads are not balanced when the number of threads is not a power of 2. ISͲWandEPͲW achievealmost hi l t linearspeedup.

6 6SHHGXS



0*: 0* : ,6: (3:

    

1XPEHURI7KUHDGV 63(V

 











MG MGͲWperforms W performs worsebecause ofunbalanced workloads workloads.

Fig. 3. Speedup as a function of the number of SPEs under Model LF cache protocol

4.4 Impact of Cache Size Fig. 4 and 5 show execution time and cache eviction ratio curves for IS and MG with input size W on various cache sizes (4KB, 8KB, 16KB, 32KB and 64KB 5 ) per thread. The two figures show that the cache eviction ratio curves under the two cache protocols are equal, but the execution time curves are not. Moreover, the difference in execution time becomes larger as the cache size becomes smaller. This is because the cost of cache eviction in ModelGF cache protocol is much higher. Moreover, the smaller the cache size is, the higher the cache eviction ratio is. To show the change of performance gap clearly, we normalize the execution time into the interval [0, 1] by applying division on every execution time where the divisor is the maximal execution time in all tested configurations. The corresponding configurations to the maximal execution time are 4KB cache sizes under ModelGF for both MG and IS. The performance gap between ModelGF and ModelLF keeps constantly for EP when we change the cache sizes. The reason is that EP has very bad temporal locality. So it is insensitive to the change of cache sizes. 5

64KB is only for IS.

A Study of a Software Cache Implementation of the OpenMP Memory Model 

        

       

0RGHO*)



0RGHO/)



&DFKH6L]HV

 N

N

N

N

&DF FKH(YLF FWLRQ5DW WLR

  



1RUPDO OL]HG([HFXWLRQ7LPH



349

0RGHO*) 0RGHO/)



&DFKH6L]HV

 N

N

N

N

N

N

Thedifferenceofnormalizedexecutiontimeincreasedfrom0.15to0.25asthecachesizeper SPEwasdecreasedfrom64KBto4KB.

Fig. 4. Trends of execution time and cache eviction ratio for IS-W on various cache sizes 

       

0RGHO*) 0RGHO/)









&DFKH6L]HV

 N

N

N

N

0RGHO*)

&DFKH(YLFWWLRQ5DWLR

  



1RUPDO OL]HG([[HFXWLRQ7LPH



0RGHO/)

&DFKH6L]HV

 N

N

N

N

Thedifferenceofnormalizedexecutiontimeincreasedfrom0.04to0.16asthecache sizeperSPEwasdecreasedfrom32KBto4KB.

Fig. 5. Trends of execution time and cache eviction ratio for MG-W on various cache sizes

5 Related Work Despite over two decades of research on memory consistency models, there does not appear to be a consensus on how memory models should be formalized [11,12,13,14]. The efforts to formalize memory models for mainstream parallel languages such as the Java memory model [15], the C++ memory model [16], and the OpenMP memory model [6] all take different approaches. The authoritative source for the OpenMP memory model can be found in the specifications for OpenMP 3.0 [5], but the memory model definition therein is provided in terms of informal prose. To address this limitation, a formalization of the OpenMP memory model was presented in [6]. In this paper, the authors developed a formal, mathematical language to model the relevant features of OpenMP. They developed an operational model to verify its conformance to the OpenMP standard. Through these tools, the authors found that the OpenMP memory model is weaker than the weak consistency model [17]. The authors also claimed that they found some ambiguities in the informal definition of the OpenMP memory model presented in the OpenMP specification version 2.5 [18]. Since there is no significant change of the OpenMP memory model from version 2.5 to version 3.0, their work demonstrates the need for the

350

C. Chen et al.

OpenMP community to work towards a formal and complete definition of the OpenMP memory model. Some early research on software controlled caches can be found in the NYU Ultracomputer [19], Cedar [20], and IBM RP3 [21] projects. All three machines have local memories that can be used as programmable caches, with software taking responsibility for maintaining consistency by inserting explicit synchronization and cache consistency operations. By default, this responsibility falls on the programmer but compiler techniques have also been developed in which these operations are inserted by the compiler instead, e.g., [22]. Interest in software caching has been renewed with the advent of multicore processors with local memories such as the Cell Broadband Engine. There have been a number of reports on more recent software cache optimization from compiler angle as described in [23,24,25]. Examples of recent work on software cache protocol implementation on Cell processors can be found in [26,27,28]. The cache protocol used in [26] uses a centralized directory to keep tract cache line state information in the implementation - reminds us the ModelGF cache protocol in this paper. The cache protocols reported in [27,28] do not appear to use a centralized directory - hence appear to be more close to the ModelLF cache protocol. However, we do not have access to the detailed information on the implementations of these models, and cannot make a more definitive comparisons at the time when this paper is written. OPELL [7] is an open source toolchain / runtime effort to implement OpenMP for the Cell Broadband Engine. Our cache protocol framework reported here has been developed much earlier in 2006-2007 frame and embedded in OPELL (see [7])- but the protocols themselves are not published externally.

6 Conclusion and Future Work In this paper, we investigate the problem of software cache implementations for the OpenMP memory model on multicore and manycore processors. We propose an instantiation of the OpenMP memory model — ModelLF which prohibits undefined values and avoids the ambiguity of the original memory model definition on OpenMP Specification 3.0. ModelLF is scalable with respect to the number of threads because it does not rely on communications among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. We propose the corresponding cache protocol and implement the cache protocol by software cache on the Cell processor. The experimental results show that ModelLF cache protocol has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to ModelGF cache protocol derived from a stronger memory model that maintains a global total ordering among flush operations. This provides a useful way that how to formalize (architecture unspecified) OpenMP memory model in different ways and evaluate the instantiations to produce different performance profiles. Our conclusion is that OpenMP’s relaxed memory model with temporary views is a good match for software cache implementations, and that the refinements in ModelLF can lead to good opportunities for scalable implementations of OpenMP on future multicore and manycore processors.

A Study of a Software Cache Implementation of the OpenMP Memory Model

351

In the future, we will investigate the possibility of implementing our instantiation on different architectures and study its scalability in the case that the architecture contains big number of cores (e.g. over 100).

Acknowledgment This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF0702244), and other government sponsors. We thank all the members of CAPSL group at University of Delaware. We thank Ziang Hu for his suggestions on the experimental design. We thank Bronis R. de Supinski and Greg Bronevetsky for answering questions regarding the OpenMP memory model. We thank the efforts of our reviewers for their helpful suggestions that have led to several important improvements of our work.

References 1. Lamport, L.: How to make a multiprocessor that correctly executes multiprocess programs. IEEE Trans. on Computers C-28(9), 690–691 (1979) 2. IBM Microelectronics: Cell Broadband Engine, http://www-01.ibm.com/chips/techlib/techlib.nsf/products/ Cell Broadband Engine 3. Tilera Corporation: Tilera, http://www.tilera.com/ 4. Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the Cyclops-64 cellular architecture. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation, Held in conjunction with the 32nd Annual International Symposium on Computer Architecture, Madison, Wisconsin, pp. 11–20 (2005) 5. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 3.0 (May 2008), http://www.openmp.org/mp-documents/spec30.pdf 6. Bronevetsky, G., de Supinski, B.R.: Complete formal specification of the OpenMP memory model. Int. J. Parallel Program. 35(4), 335–392 (2007) 7. Manzano, J., Hu, Z., Jiang, Y., Gan, G.: Towards an automatic code layout framework. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguad´e, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 157–160. Springer, Heidelberg (2008) 8. Sony Computer Entertainment: PlayStation3, http://www.us.playstation.com/ps3/features 9. NASA Ames Research Center: NAS Parallel Benchmark, http://www.nas.nasa.gov/Resources/Software/npb.html 10. Chen, C., Manzano, J.B., Gan, G., Gao, G.R., Sarkar, V.: A study of a software cache implementation of the openmp memory model for multicore and manycore architectures. Technical Memo CAPSL/TM-93 (February 2010) 11. Adve, S., Hill, M.D.: A unified formalization of four shared-memory models. IEEE Transactions on Parallel and Distributed Systems 4, 613–624 (1993) 12. Shen, X., Arvind, Rudolph, L.: Commit-Reconcile & Fences (CRF): a new memory model for architects and compiler writers. In: ISCA 1999: Proceedings of the 26th annual international symposium on Computer architecture, pp. 150–161. IEEE Computer Society, Washington (1999) 13. Saraswat, V.A., Jagadeesan, R., Michael, M., von Praun, C.: A theory of memory models. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 161–172. ACM, New York (2007)

352

C. Chen et al.

14. Arvind, A., Maessen, J.W.: Memory model = instruction reordering + store atomicity. SIGARCH Comput. Archit. News 34(2), 29–40 (2006) 15. Manson, J., Pugh, W., Adve, S.V.: The Java memory model. In: POPL 2005: Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 378–391. ACM, New York (2005) 16. Boehm, H.J., Adve, S.V.: Foundations of the C++ concurrency memory model. In: PLDI 2008: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pp. 68–78. ACM, New York (2008) 17. Dubois, M., Scheurich, C., Briggs, F.: Memory access buffering in multiprocessors. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 320–328. ACM, New York (1998) 18. OpenMP Architecture Review Board: OpenMP Application Program Interface (2005), http://www.openmp.org/mp-documents/spec25.pdf 19. Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The NYU ultracomputer—designing a MIMD, shared-memory parallel machine. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 239–254. ACM, New York (1998) 20. Gajski, D., Kuck, D., Lawrie, D., Sameh, A.: CEDAR—a large scale multiprocessor, pp. 69–74. IEEE Computer Society Press, Los Alamitos (1986) 21. Pfister, G., Brantley, W., George, D., Harvey, S., Kleinfelder, W., McAuliffe, K., Melton, E., Norton, V., Weiss, J.: The research parallel processor prototype (RP3): Introduction and architecture. In: ICPP 1985: Proceedings of the 1985 International Conference on Parallel Processing, pp. 764–771 (1985) 22. Cytron, R., Karlovsky, S., McAuliffe, K.P.: Automatic management of programmable caches. In: ICPP 1988: Proceedings of the 1988 International Conference on Parallel Processing, pp. 229–238 (August 1988) 23. Eichenberger, A.E., O’Brien, K., O’Brien, K., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.: Optimizing compiler for the CELL processor. In: PACT 2005: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pp. 161–172. IEEE Computer Society, Los Alamitos (2005) 24. Eichenberger, A.E., O’Brien, J.K., O’Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y., Koo, R.: Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst. J. 45(1), 59–84 (2006) 25. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on CELL. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008) 26. Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 303–314. ACM, New York (2008) 27. Chen, T., Lin, H., Zhang, T.: Orchestrating data transfer for the Cell/B.E. processor. In: ICS 2008: Proceedings of the 22nd annual international conference on Supercomputing, pp. 289– 298. ACM, New York (2008) 28. Gonz`alez, M., Vujic, N., Martorell, X., Ayguad´e, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O’Brien, K., O’Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 292–302. ACM, New York (2008)

Programming CUDA-Based GPUs to Simulate Two-Layer Shallow Water Flows Marc de la Asunci´on1 , Jos´e M. Mantas1 , and Manuel J. Castro2 1

Dpto. Lenguajes y Sistemas Inform´ aticos, Universidad de Granada 2 Dpto. An´ alisis Matem´ atico, Universidad de M´ alaga

Abstract. The two-layer shallow water system is used as the numerical model to simulate several phenomena related to geophysical flows such as the steady exchange of two different water flows, as occurs in the Strait of Gibraltar, or the tsunamis generated by underwater landslides. The numerical solution of this model for realistic domains imposes great demands of computing power and modern Graphics Processing Units (GPUs) have demonstrated to be a powerful accelerator for this kind of computationally intensive simulations. This work describes an accelerated implementation of a first order well-balanced finite volume scheme for 2D two-layer shallow water systems using GPUs supporting the CUDA (Compute Unified Device Architecture) programming model and double precision arithmetic. This implementation uses the CUDA framewok to exploit efficiently the potential fine-grain data parallelism of the numerical algorithm. Two versions of the GPU solver are implemented and studied: one using both single and double precision, and another using only double precision. Numerical experiments show the efficiency of this CUDA solver on several GPUs and a comparison with an efficient multicore CPU implementation of the solver is also reported.

1

Introduction

The two-layer shallow water system of partial differential equations governs the flow of two superposed shallow layers of immiscible fluids with different constant densities. This mathematical model is used as the numerical model to simulate several phenomena related to stratified geophysical flows such as the steady exchange of two different water flows, as occurs in the Strait of Gibraltar [6], or the tsunamis generated by underwater landslides [13]. The numerical resolution of two-layer or multilayer shallow water systems has been object of an intense research during the last years: see for instance [1,3,4,6,13]. The numerical solution of these equations in realistic applications, where big domains are simulated in space and time, is computationally very expensive. This fact and the degree of parallelism which these numerical schemes exhibit suggest the design of parallel versions of the schemes for parallel machines in order to solve and analyze these problems in reasonable execution times. In this paper, we tackle the acceleration of a finite volume numerical scheme to solve two-layer shallow water systems. This scheme has been parallelized and P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 353–364, 2010. c Springer-Verlag Berlin Heidelberg 2010 

354

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

optimized by combining a distributed implementation which runs on a PC cluster [4] with the use of SSE-optimized routines [5]. However, despite of the important performance improvements, a greater reduction of the runtimes is necessary. A cost effective way of obtaining a substantially higher performance in these applications consists in using the modern Graphics Processor Units (GPUs). The use of these devices to accelerate computationally intensive tasks is growing in popularity among the scientific and engineering community [15,14]. Modern GPUs present a massively parallel architecture which includes hundreds of processing units optimized for performing floating point operations and multithreaded execution. These architectures make it possible to obtain performances that are orders of magnitude faster than a standard CPU at a very affordable price. There are previous proposals to port finite volume one-layer shallow water solvers to GPUs by using a graphics-specific programming language [9,10]. These solvers obtain considerable speedups to simulate one-layer shallow water systems but their graphics-based design is not easy to understand and maintain. Recently, NVIDIA has developed the CUDA programming toolkit [11] which includes an extension of the C language and facilitates the programming of GPUs for general purpose applications by preventing the programmer to deal with the graphics details of the GPU. A CUDA solver for one-layer systems based on the finite volume scheme presented in [4] is described in [2]. This one-layer shallow water CUDA solver obtains a good exploitation of the massively parallel architecture of several NVIDIA GPUs. In this work, we extend the proposal presented in [2] for the case of two-layer shallow water systems and we study its performance. From the computational point of view, the numerical solution of the two-layer system presents two main problems with respect to the one-layer case: the need of using double precision arithmetic for some calculations of the scheme and the need of managing a higher volume of data to perform the basic calculations. Our goal is to exploit efficiently the GPUs supporting CUDA and double precision arithmetic in order to accelerate notably the numerical solution of two-layer shallow water systems. This paper is organized as follows: the next section describes the underlying mathematical model, the two-layer shallow water system, and the finite-volume numerical scheme which has been ported to GPU. A description of the data parallelism of the numerical scheme and its CUDA implementation are presented in Section 4. Section 5 shows and analyzes the performance results obtained when the CUDA solver is applied to several test problems using two different NVIDIA GPUs supporting double precision. Finally, Section 6 summarizes the main conclusions and presents the lines for further work.

2

Mathematical Model and Numerical Scheme

The two-layer shallow water system is a system of conservation laws and nonconservative products with source terms which models the flow of two homogeneous fluid shallow layers with different densities that occupy a bounded domain

Programming CUDA-Based GPUs

355

D ⊂ R2 under the influence of a gravitational acceleration g. The system has the following form: ∂W ∂F1 ∂F2 ∂W ∂W ∂H ∂H + (W )+ (W ) = B1 (W ) +B2 (W ) +S1 (W ) +S2 (W ) ∂t ∂x ∂y ∂x ∂y ∂x ∂y (1) being ⎛ ⎛ ⎜ ⎜ ⎜ ⎜ W =⎜ ⎜ ⎜ ⎜ ⎝

h1 q1,x q1,y h2 q2,x q2,y





⎜ q2 ⎟ ⎜ 1,x 1 ⎟ ⎜ + gh21 ⎟ ⎜ h1 ⎟ 2 ⎜ ⎟ ⎜ q1,x q1,y ⎟ ⎜ ⎟ ⎜ ⎟ h 1 ⎜ ⎟, F1 (W ) = ⎜ ⎟ q 2,x ⎜ ⎟ ⎜ q2 ⎟ 1 2⎟ ⎜ 2,x ⎜ + gh2 ⎟ ⎜ h2 ⎟ 2 ⎜ ⎟ ⎝ q2,x q2,y ⎠ h2

⎞ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎠



q1,x

0 ⎜ ⎜ 0 ⎜ ⎜ 0 B1 (W ) = ⎜ ⎜ 0 ⎜ ⎜ ⎝−rgh2 0

00 0 0 0 −gh1 00 0 00 0 00 0 00 0 ⎛

⎞ 00 ⎟ 0 0⎟ ⎟ 0 0⎟ ⎟, 0 0⎟ ⎟ ⎟ 0 0⎠ 00

⎞ 0 ⎜ ⎟ ⎜gh1 ⎟ ⎜ ⎟ ⎜ 0 ⎟ ⎜ ⎟, S1 (W ) = ⎜ ⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎝gh2 ⎠ 0

q1,y



⎜ ⎟ ⎜ q1,x q1,y ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2 h1 ⎟ ⎜ q1,y ⎟ 1 2⎟ ⎜ ⎜ h + 2 gh1 ⎟ 1 ⎟, F2 (W ) = ⎜ ⎜ ⎟ q2,y ⎜ ⎟ ⎜ ⎟ ⎜ q2,x q2,y ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 2 h2 ⎟ ⎝ q2,y ⎠ 1 + gh22 h2 2 ⎛

⎜ ⎜ ⎜ ⎜ B2 (W ) = ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ ⎜ ⎜ S2 (W ) = ⎜ ⎜ ⎜ ⎜ ⎝

0 0 0 0 0 −rgh2 0 0 gh1 0 0 gh2

00 0 00 0 0 0 −gh1 00 0 00 0 00 0

⎞ 00 ⎟ 0 0⎟ ⎟ 0 0⎟ ⎟, 0 0⎟ ⎟ ⎟ 0 0⎠ 00

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

where hi (x, y, t) ∈ R denotes the thickness of the water layer i at point (x, y) at time t, H(x, y) ∈ R is the depth function measured from a fixed level of reference and r = ρ1 /ρ2 is the ratio of the constant densities of the layers (ρ1 < ρ2 ), which in realistic oceanographical applications is close to 1 (see Fig. 1). Finally, qi (x, y, t) = (qi,x (x, y, t), qi,y (x, y, t)) ∈ R2 is the mass-flow of the water layer i at point (x, y) at time t . To discretize System (1), the computational domain D is divided into L cells or finite volumes Vi ⊂ R2 , which are assumed to be quadrangles. Given a finite

356

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

   

 









Fig. 1. Two-layer sketch

volume Vi , Ni ∈ R2 is the centre of Vi , ℵi is the set of indexes j such that Vj is a neighbour of Vi ; Γij is the common edge of two neighbouring cells Vi and Vj , and |Γij | is its length; ηij = (ηij,x , ηij,y ) is the unit vector which is normal to the edge Γij and points towards the cell Vj [4] (see Fig. 2).

Fig. 2. Finite volumes

Assume that the approximations at time tn , Win , have already been calculated. To advance in time, with Δtn being the time step, the following numerical scheme is applied (see [4] for more details): Win+1 = Win −

Δtn  |Γij | Fij− |Vi |

(2)

j∈ℵi

being Fij− =



1 −1 Kij · (I − sgn(Dij )) · Kij · Aij (Wjn − Win ) − Sij (Hj − Hi ) , 2

where |Vi | is the area of Vi , Hl = H(Nl ) with l = 1, . . . , L, Aij ∈ R6×6 and Sij ∈ R6 depends on Win and Wjn , Dij is a diagonal matrix whose coefficients are the eigenvalues of Aij , and the columns of Kij ∈ R6×6 are the associated eigenvectors.

Programming CUDA-Based GPUs

To compute the n-th time step, the following condition can be used: −1  j∈ℵi | Γij | Dij ∞ n Δt = min i=1,...,L 2γ | Vi |

357

(3)

where γ, 0 < γ ≤ 1, is the CFL (Courant-Friedrichs-Lewy) parameter.

3

CUDA Implementation

In this section we describe the potential data parallelism of the numerical scheme and its implementation in CUDA. 3.1

Parallelism Sources

Figure 3a shows a graphical description of the main sources of parallelism obtained from the numerical scheme. The main calculation phases, identified with circled numbers, presents a high degree of parallelism because the computation performed at each edge or volume is independent with respect to that performed at other edges or volumes.

(a) Parallelism sources of the numerical scheme

(b) General steps of the parallel algorithm implemented in CUDA

Fig. 3. Parallel algorithm

When the finite volume mesh has been constructed, the time stepping process is repeated until the final simulation time is reached: 1. Edge-based calculations: Two calculations must be performed for each edge Γij communicating two cells Vi and Vj (i, j ∈ {1, . . . , L}):

358

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

a) Vector Mij = | Γij | Fij− ∈ R6 must be computed as the contribution of each edge to the calculation of the new states of its adjacent cells Vi and Vj (see (3)). This contribution can be computed independently for each edge and must be added to the partial sums Mi and Mj associated to Vi and Vj , respectively. b) The value Zij = | Γij |  Dij ∞ must be computed as the contribution of each edge to the calculation of the local Δt values of its adjacent cells Vi and Vj (see (2)). This contribution can be computed independently for each edge and must be added to the partial sums Zi and Zj associated to Vi and Vj , respectively. 2. Computation of the local Δti for each volume: For each volume Vi , the local Δti is obtained as follows (see (3)): Δti = 2γ |Vi | Zi−1 . In the same way, the computation for each volume can be performed in parallel. 3. Computation of Δt: The minimum of all the local Δti values previously computed for each volume is obtained. This minimum Δt represents the next time step which will be applied in the simulation. 4. Computation of Win+1 : The (n + 1)-th state of each volume (Win+1 ) is calculated from the n-th state and the data computed in previous phases, in Δt Mi . This phase can also be the following way (see (2)): Win+1 = Win − |V i| performed in parallel (see Fig. 3a). As can be seen, the numerical scheme exhibits a high degree of potential data parallelism and it is good candidate to be implemented on CUDA architectures.

4

Algorithmic Details of the CUDA Version

In this section we describe the parallel algorithm we have developed and its implementation in CUDA. It is an extension of the algorithm described in [2] to simulate two-layer shallow water systems. We consider problems consisting in a bidimensional regular finite volume mesh. The general steps of the parallel algorithm are depicted in Fig. 3b. Each processing step executed on the GPU is assigned to a CUDA kernel. A kernel is a function executed on the GPU by many threads which are organized forming a grid of thread blocks that run logically in parallel (see [12] for more details). Next, we describe in detail each step: – Build data structure: In this step, the data structure that will be used on the GPU is built. For each volume, we store its initial state (h1 , q1,x , q1,y , h2 , q2,x and q2,y ) and its depth H. We define two arrays of float4 elements, where each element represents a volume. The first array contains h1 , q1,x , q1,y and H, while the second array contains h2 , q2,x and q2,y . Both arrays are stored as 2D textures. The area of the volumes and the length of the vertical and horizontal edges are precalculated and passed to the CUDA kernels that need them. We can know at runtime if an edge or volume is frontier and the value of the normal ηij of an edge by checking the position of the thread in the grid.

Programming CUDA-Based GPUs

359

– Process vertical edges and process horizontal edges: As in [2], we divide the edge processing into vertical and horizontal edge processing. For vertical edges, ηij,y = 0 and therefore all the operations where this term takes part can be discarded. Similarly, for horizontal edges, ηij,x = 0 and all the operations where this term takes part can be avoided. In vertical and horizontal edge processing, each thread represents a vertical and horizontal edge, respectively, and computes the contribution of the edge to their adjacent volumes as described in section 3.1. The edges (i.e. threads) synchronize each other when contributing to a particular volume by means of four accumulators (in [2] we used two accumulators for one-layer systems), each one being an array of float4 elements. The size of each accumulator is the number of volumes. Let us call the accumulators 1-1, 1-2, 2-1 and 2-2. Each element of accumulators 1-1 and 2-1 stores the contributions of the edges to the layer 1 of Wi (the first 3 elements of Mi ) and to the local Δt of the volume (a float value Zi ), while each element of accumulators 2-1 and 2-2 stores the contributions of the edges to the layer 2 of Wi (the last 3 elements of Mi ). Then, in the processing of vertical edges: ◦ Each vertical edge writes in the accumulator 1-1 the contribution to the layer 1 and to the local Δt of its right volume, and writes in the accumulator 1-2 the contribution to the layer 2 of its right volume. ◦ Each vertical edge writes in the accumulator 2-1 the contribution to the layer 1 and to the local Δt of its left volume, and writes in the accumulator 2-2 the contribution to the layer 2 of its left volume. Next, the processing of horizontal edges is performed in an analogous way, but with the difference that the contribution is added to the accumulators instead of only writing it. Figure 4 shows this process graphically.

(a) Vertical edge processing

(b) Horizontal edge processing

Fig. 4. Computing the sum of the contributions of the edges of each volume

360

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

(a) Contribution to Δti

(b) Contribution to Wi

Fig. 5. Computation of the final contribution of the edges for each volume

– Compute Δti for each volume: In this step, each thread represents a volume and computes the local Δti of the volume Vi as described in section 3.1. The final Zi value is obtained by summing the two float values stored in the positions corresponding to the volume Vi in accumulators 1-1 and 2-1 (see Fig. 5a). – Get minimum Δt: This step finds the minimum of the local Δti of the volumes by applying a reduction algorithm on the GPU. The reduction algorithm applied is the kernel 7 (the most optimized one) of the reduction sample included in the CUDA Software Development Kit [11]. – Compute Wi for each volume: In this step, each thread represents a volume and updates the state Wi of the volume Vi as described in section 3.1. The final Mi value is obtained as follows: the first 3 elements of Mi (the contribution to layer 1) are obtained by summing the two 3×1 vectors stored in the positions corresponding to the volume Vi in accumulators 1-1 and 2-1, while the last 3 elements of Mi (the contribution to layer 2) are obtained by summing the two 3 × 1 vectors stored in the positions corresponding to the volume Vi in accumulators 1-2 and 2-2 (see Fig. 5b). Since a CUDA kernel can not write directly into textures, the textures are updated by firstly writing the results into temporary arrays and then these arrays are copied to the CUDA arrays bound to the textures. A version of this CUDA algorithm which uses double precision to perform all the computing phases has also been implemented. The volume data is stored in three arrays of double2 elements (which contain the state of the volumes) and one array of double elements (the depth H). We use six accumulators of double2 elements (for storing the contributions to Wi ) and two accumulators of double elements (for storing the contributions to the local Δt of each volume).

Programming CUDA-Based GPUs

5

361

Experimental Results

We consider an internal circular dambreak problem in the [−5, 5] × [−5, 5] rectangular domain in order to compare the performance of our implementations. The depth function is given by H(x, y) = 2 and the initial condition is: T

Wi0 (x, y) = (h1 (x, y), 0, 0, h2 (x, y), 0, 0) where  h1 (x, y) =

 1.8 if x2 + y 2 > 4 , 0.2 otherwise

h2 (x, y) = 2 − h1 (x, y)

The numerical scheme is run for several regular bidimensional finite volume meshes with different number of volumes (see Table 1). Simulation is carried out in the time interval [0, 1]. CFL parameter is γ = 0.9, r = 0.998 and wall boundary conditions (q1 · η = 0, q2 · η = 0) are considered. To perform the experiments, several programs have been implemented: – A serial CPU version of the CUDA algorithm. This version has been implemented in C++ and uses the Eigen library [8] for operating with matrices. We have used the double data type in this implementation. – A quadcore CPU version of the CUDA algorithm. This is a parallelization of the aforementioned serial CPU version which uses OpenMP [7]. – A mixed precision CUDA implementation (CUSDP). In this GPU version, the eigenvalues and eigenvectors of the Aij matrix (see Sect. 2) are computed using double precision to avoid numerical instability problems, but the rest of operations are performed in single precision. – A full double precision CUDA implementation (CUDP). All the programs were executed on a Core i7 920 with 4 GB RAM. Graphics cards used were a GeForce GTX 280 and a GeForce GTX 480. Figure 7 shows the evolution of the fluid. Table 1 shows the execution times in seconds for all the meshes and programs. As can be seen, the number of volumes and the execution times scale with a different factor because the number of time steps required for the same time interval also augments when the number of cells is increased (see (3)). Using a GTX 480, for big meshes, CUSDP achieves a speedup of 62 with respect to the monocore CPU version, while CUDP reaches a speedup of 38. As expected, the OpenMP version only reaches a speedup less than four in all meshes. CUDP has been about 38 % slower than CUSDP for big meshes in the GTX 480 card, and 24 % slower in the GTX 280 card. In the GTX 480 card, we get better execution times by setting the sizes of the L1 cache and shared memory to 48 KB and 16 KB per multiprocessor, respectively, for the two edge processing CUDA kernels. Table 2 shows the mean values of the percentages of the execution time and GPU FLOPS for all the computing steps and implementations. Clearly, almost all the the execution time is spent in the edge processing steps.

362

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

Table 1. Execution times in seconds for all the meshes and programs Mesh size L = Lx × Ly 100 × 100 200 × 200 400 × 400 800 × 800 1600 × 1600 2000 × 2000

CPU 1 core

CPU 4 cores

GTX 280 GTX 480 CUSDP CUDP CUSDP CUDP

7.54 2.10 0.48 0.80 0.37 59.07 15.84 3.15 4.38 1.42 454.7 121.0 21.92 29.12 8.04 3501.9 918.7 163.0 216.1 57.78 28176.7 7439.4 1262.7 1678.0 453.5 54927.8 14516.6 2499.2 3281.0 879.7

0.53 2.17 13.01 94.57 735.6 1433.6

Table 2. Mean values of the percentages of the execution time and GPU FLOPS for all the computing steps Computing step Process vertical edges Process horizontal edges Compute Δti Get minimum Δt Compute Win+1

% Execution time 1 core 4 cores CUSDP CUDP 49.6 49.8 0.2 0.4

48.2 48.6 1.1 0.4 1.7

49.5 49.4 0.3 0.1 0.7

50.0 48.5 0.3 0.2 1.0

% GPU FLOPS 49.5 49.9 0.1 0.0 0.5

Figure 6 shows graphically the GB/s and GFLOPS obtained in the CUDA implementations with both graphics cards. In the GTX 480 card, CUSDP achieves 4.2 GB/s and 34 GFLOPS for big meshes. Theoretical maximums are: for the GTX 480, 177.4 GB/s, and 1.35 TFLOPS in single precision, or 168 GFLOPS in double precision; for the GTX 280, 141.7 GB/s, and 933 GFLOPS in single precision, or 78 GFLOPS in double precision. As can be seen, the speedup, GB/s and GFLOPS reached with the CUSDP program are notably worse than those obtained in [2] with the single precision CUDA implementation for one-layer systems. This is mainly due to two reasons. Firstly, since double precision has been used to compute the eigenvalues and eigenvectors, the efficiency is reduced because the double precision speed is 1/8 of the single precision speed in GeForce cards with GT200 and GF100 architectures. Secondly, since the register usage and the complexity of the code executed by each thread is higher in this implementation, the CUDA compiler has to store some data into local memory, which also increases the execution time. We also have compared the numerical solutions obtained in the monocore and the CUDA programs. The L1 norm of the difference between the solutions obtained in CPU and GPU at time t = 1.0 for all meshes was calculated. The order of magnitude of the L1 norm using CUSDP vary between 10−4 and 10−6 , while that of obtained using CUDP vary between 10−12 and 10−14 , which reflects the different accuracy of the numerical solutions computed on the GPU using both single and double precision, and using only double precision.

Programming CUDA-Based GPUs

(a) GB/s

363

(b) GFLOPS

Fig. 6. GB/s and GFLOPS obtained with the CUDA implementations in all meshes with both graphics cards

(a) t = 0.0

(b) t = 2.5

(c) t = 5.0

Fig. 7. Graphical representation of the fluid evolution at different time instants

6

Conclusions and Further Work

In this paper we have presented an efficient first order well-balanced finite volume solver for two-layer shallow water systems. The numerical scheme has been parallelized, adapted to the GPU and implemented using the CUDA framework in order to exploit the parallel processing power of GPUs. On the GTX 480 graphics card, the CUDA implementation using both single and double precision has reached 4.2 GB/s and 34 GFLOPS, and has been one order of magnitude faster than a monocore CPU version of the solver for big uniform meshes. It is expected that this results will significantly improve on a NVIDIA Tesla GPU architecture based on Fermi, since this architecture includes more double precision support than the GTX 480 graphics card. The simulations carried out also reveal the different accuracy obtained with the two implementations of the solver, getting better accuracy using double precision than using both single and double precision. As further work, we propose to extend the strategy to enable efficient simulations on irregular and non-structured meshes.

364

M. de la Asunci´ on, J.M. Mantas, and M.J. Castro

Acknowledgements J. M. Mantas acknowledges partial support from the DGI-MEC project MTM200806349-C03-03. M. de la Asunci´ on and M. J. Castro acknowledge partial support from DGI-MEC project MTM2009-11923.

References 1. Abgrall, R., Karni, S.: Two-layer shallow water system: A relaxation approach. SIAM J. Sci. Comput. 31(3), 1603–1627 (2009) 2. de la Asunci´ on, M., Mantas, J.M., Castro, M.: Simulation of one-layer shallow water systems on multicore and CUDA architectures. The Journal of Supercomputing (2010), http://dx.doi.org/10.1007/s11227-010-0406-2 3. Audusse, E., Bristeau, M.O.: Finite-volume solvers for a multilayer Saint-Venant system. Int. J. Appl. Math. Comput. Sci. 17(3), 311–320 (2007) 4. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Par´es, C.: A parallel 2D finite volume scheme for solving systems of balance laws with nonconservative products: Application to shallow flows. Comput. Meth. Appl. Mech. Eng. 195, 2788–2815 (2006) 5. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Par´es, C.: Solving shallow-water systems in 2D domains using finite volume methods and multimedia SSE instructions. J. Comput. Appl. Math. 221(1), 16–32 (2008) 6. Castro, M.J., Garc´ıa-Rodr´ıguez, J.A., Gonz´ alez-Vida, J.M., Mac´ıas, J., Par´es, C.: Improved FVM for two-layer shallow-water models: Application to the strait of gibraltar. Adv. Eng. Softw. 38(6), 386–398 (2007) 7. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press, Cambridge (2007) 8. Eigen 2.0.12, http://eigen.tuxfamily.org 9. Hagen, T.R., Hjelmervik, J.M., Lie, K.A., Natvig, J.R., Henriksen, M.O.: Visual simulation of shallow-water waves. Simulation Modelling Practice and Theory 13(8), 716–726 (2005) 10. Lastra, M., Mantas, J.M., Ure˜ na, C., Castro, M.J., Garc´ıa-Rodr´ıguez, J.A.: Simulation of shallow-water systems using graphics processing units. Math. Comput. Simul. 80(3), 598–618 (2009) 11. NVIDIA: CUDA home page, http://www.nvidia.com/object/cuda_home_new.html 12. NVIDIA: NVIDIA CUDA Programming Guide 3.0 (2010), http://developer.nvidia.com/object/cuda_3_0_downloads.html 13. Ostapenko, V.V.: Numerical simulation of wave flows caused by a shoreside landslide. J. Appl. Mech. Tech. Phys. 40, 647–654 (1999) 14. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proceedings of the IEEE 96(5), 879–899 (2008) 15. Rumpf, M., Strzodka, R.: Graphics Processor Units: New prospects for parallel computing. Lecture Notes in Comput. Science and Engineering 51, 89–134 (2006)

Theory and Algorithms for Parallel Computation Christoph Kessler1 , Thomas Rauber2 , Yves Robert1 , and Vittorio Scarano2 1

Topic Chairs Members

2

Parallelism concerns all levels of current computing systems, from single CPU machines to large server farms. Effective use of parallelism relies crucially on the availability of suitable models of computation for algorithm design and analysis, and of efficient strategies for the solution of key computational problems on prominent classes of platforms, as well as of good models of the way the different components are interconnected. With the advent of multicore parallel machines, new models and paradigms are needed to allow parallel programming to advance into mainstream computing. This includes the following topics: – foundations of parallel, distributed, multiprocessor and network computation; – models of parallel, distributed, multiprocessor and network computation; – emerging paradigms of parallel, distributed, multiprocessor and network computation; – models and algorithms for parallelism in memory hierarchies; – models and algorithms for real networks (scale-free, small world, wireless networks); – theoretical aspects of routing; – deterministic and randomized parallel algorithms; – lower bounds for key computational problems; This year, 8 papers discussing some of these issues were submitted to this topic. Each paper was reviewed by four reviewers and, finally, we were able to select 4 regular papers. The accepted papers discuss very interesting issues about theory and models for parallel computing, as well as the mapping of parallel computations to the execution resources of parallel platforms. The paper “Analysis of Multi-Organization Scheduling Algorithms” by J. Cohen, D. Cordeiro, D. Trystram and F. Wagner considers the problem of scheduling single-processor tasks on computing platforms composed of several independent organizations where each organization only cooperates if its local make-span is not increased by jobs of other organizations. This is called ’local constraint’. Moreover, a ’selfishness constraint’ is considered which does not allow schedules in which foreign jobs are finished before all local jobs are started. The article shows some lower bounds and proves that the scheduling problem is NPcomplete. Three approximation algorithms are discussed and compared by an experimental evaluation with randomly generated workloads. The paper “Area-Maximizing Schedules for Series-Parallel DAGs” by G. Cordasco and A. Rosenberg explores the computations of schedules for series-parallel DAGs. In particular, AREA-maximizing schedules are computed which produce P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 365–366, 2010. c Springer-Verlag Berlin Heidelberg 2010 

366

C. Kessler et al.

execution-eligible tasks as fast as possible. In previous work, the authors have introduced this problem as IC-scheduling and have shown how AREA-maximizing schedules can be derived for specific families of DAGs. In this article, this work is extended to arbitrary series-parallel (SP) DAGs that are obtained by series and parallel composition. The paper “Parallel selection by regular sampling” by A. Tiskin examines the selection problem and uses the BSP model to express a new deterministic algorithm for solving the problem. The new algorithm needs O(n/p) local computations and communications (optimal) and O(log log p) synchronizations for arrays of length n using p processors. The paper “Ants in Parking Lots” by A. Rosenberg investigates the movement of autonomous units (ants) in a 2D mesh. The ants are controlled by a specialized finite-state machine (FSM); all ants have the same FSM. Ants can communicate by direct contact or by leaving pheromone values on mesh points previously visited. The paper discusses some movement operations of ants in 1D or 2D meshes. In particular, the parking problem is considered, defined as the movement to the nearest corner of the mesh. We would like to take the opportunity of thanking the authors who submitted a contribution, as well as the Euro-Par Organizing Committee, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.

Analysis of Multi-Organization Scheduling Algorithms Johanne Cohen1 , Daniel Cordeiro2, Denis Trystram2 , and Fr´ed´eric Wagner2 1

Laboratoire d’Informatique PRiSM, Universit´e de Versailles St-Quentin-en-Yvelines ´ 45 avenue des Etats-Unis, 78035 Versailles Cedex, France 2 LIG, Grenoble University 51 avenue Jean Kuntzmann, 38330 Montbonnot Saint-Martin, France

Abstract. In this paper we consider the problem of scheduling on computing platforms composed of several independent organizations, known as the Multi-Organization Scheduling Problem (MOSP). Each organization provides both resources and tasks and follows its own objectives. We are interested in the best way to minimize the makespan on the entire platform when the organizations behave in a selfish way. We study the complexity of the MOSP problem with two different local objectives – makespan and average completion time – and show that MOSP is NP-Hard in both cases. We formally define a selfishness notion, by means of restrictions on the schedules. We prove that selfish behavior imposes a lower bound of 2 on the approximation ratio for the global makespan. We present various approximation algorithms of ratio 2 which validate selfishness restrictions. These algorithms are experimentally evaluated through simulation, exhibiting good average performances.

1 1.1

Introduction Motivation and Presentation of the Problem

The new generation of many-core machines and the now mature grid computing systems allow the creation of unprecedented massively distributed systems. In order to fully exploit such large number of processors and cores available and reach the best performances, we need sophisticated scheduling algorithms that encourage users to share their resources and, at the same time, that respect each user’s own interests. Many of these new computing systems are composed of organizations that own and manage clusters of computers. A user of such systems submits his/her jobs to a scheduler system that can choose any available machine in any of these clusters. However, each organization that shares its resources aims to take maximum advantage of its own hardware. In order to improve cooperation between the organizations, local jobs should be prioritized. To find an efficient schedule for the jobs using the available machines is a crucial problem. Although each user submits jobs locally in his/her own organization, it is necessary to optimize the allocation of the jobs for the whole P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 367–379, 2010. c Springer-Verlag Berlin Heidelberg 2010 

368

J. Cohen et al.

platform in order to achieve good performance. The global performance and the performance perceived by the users will depend on how the scheduler allocates resources among all available processors to execute each job. 1.2

Related Work

From the classical scheduling theory, the problem of scheduling parallel jobs is related to the Strip packing [1]. It corresponds to pack a set of rectangles (without rotations and overlaps) into a strip of machines in order to minimize the height used. Then, this problem was extended to the case where the rectangles were packed into a finite number of strips [16, 15]. More recently, an asymptotic (1 + )-approximation AFPTAS with additive constant O(1) and with runningtime polynomial in n and in 1/ was presented in [8]. Schwiegelshohn, Tchernykh, and Yahyapour [14] studied a very similar problem, where the jobs can be scheduled in non-contiguous processors. Their algorithm is a 3-approximation for the maximum completion time (makespan) if all jobs are known in advance, and a 5-approximation for the makespan on the on-line, non-clairvoyant case. The Multi-Organization Scheduling problem (MOSP) was introduced by Pascual et al. [12, 13] and studies how to efficiently schedule parallel jobs in new computing platforms, while respecting users’ own selfish objectives. A preliminary analysis of the scheduling problem on homogeneous clusters was presented with the target of minimizing the makespan, resulting in a centralized 3-approximation algorithm. This problem was then extended for relaxed local objectives in [11]. The notion of cooperation between different organizations and the study of the impact of users’ selfish objectives are directly related to Game Theory. The study of the Price of Anarchy [9] on non-cooperative games allows to analyze how far the social costs – results obtained by selfish decisions – are from the social optimum on different problems. In selfish load-balancing games (see [10] for more details), selfish agents aim to allocate their jobs on the machine with the smallest load. In these games, the social cost is usually defined as the completion time of the last job to finish (makespan). Several works studied this problem focusing in various aspects, such as convergence time to a Nash equilibrium [4], characterization of the worst-case equilibria [3], etc. We are not targeting here at such game theoretical approaches. 1.3

Contributions and Road Map

As suggested in the previous section, the problem of scheduling in multi-organization clusters has been studied from several different points of view. In this paper, we propose a theoretical analysis of the problem using classical combinatorial optimization approaches. Our main contribution is the extension and analysis of the problem for the case in which sequential jobs are submitted by selfish organizations that can handle different local objectives (namely, makespan and average completion times).

Analysis of Multi-Organization Scheduling Algorithms

369

We introduce new restrictions to the schedule that take into account the notion of selfish organizations, i.e., organizations that refuse to cooperate if their objectives could be improved just by executing earlier one of their jobs in one of their own machines. The formal description of the problem and the notations used in this paper are described in Section 2. The Section 3 shows that any algorithm respecting our new selfishness restrictions can not achieve approximation ratios better than 2 and that both problems are intractable. New heuristics for solving the problem are presented in Section 4. Simulation experiments, discussed in Section 5, show the good results obtained by our algorithms in average.

2

Problem Description and Notations

In this paper, we are interested in the scheduling problem in which different organizations own a physical cluster of identical machines that are interconnected. They share resources and exchange jobs with each other in order to simultaneously maximize the profits of the collectivity and their own interests. All organizations intent to minimize the total completion time of all jobs (i.e., the global makespan) while they individually try to minimize their own objectives – either the makespan or the average completion time of their own jobs – in a selfish way. Although each organization accepts to cooperate with others in order to minimize the global makespan, individually it behaves in a selfish way. An organization can refuse to cooperate if in the final schedule one of its migrated jobs could be executed earlier in one of the machines owned by the organization. Formally, we define our target platform as a grid computing system with N different organizations interconnected by a middleware. Each organization O(k) (1 ≤ k ≤ N ) has m(k) identical machines available that can be used to run jobs submitted by users from any organization. (k) Each organization O(k) has n(k) jobs to execute. Each job Ji (1 ≤ i ≤ n(k) ) (k) will use one processor for exactly pi units of time1 . No preemption is allowed, (k) i.e., after its activation, a job runs until its completion at time Ci . (k) (k) We denote the makespan of a particular organization kby Cmax = max (Ci ) 1≤i≤n(k)  (k) and its sum of completion times as Ci . The global makespan for the entire grid (k) computing system is defined as Cmax = max (Cmax ). 1≤k≤N

2.1

Local Constraint

The Multi-Organization Scheduling Problem, as first described in [12] consists in minimizing the global makespan (Cmax ) with an additional local constraint : at the end, no organization can have its makespan increased if compared with the makespan that the organization could have by scheduling the jobs in its 1

All machines are identical, i.e., every job will be executed at the same speed independently of the chosen machine.

370

J. Cohen et al. (k) local

own machines (Cmax optimization problem:

). More formally, we call MOSP(Cmax ) the following (k)

(k) local

minimize Cmax such that, for all k (1 ≤ k ≤ N ), Cmax ≤ Cmax

In this work, we also study the case where all organizations are interested locally in minimizing their average completion time while minimizing the global makespan. As in MOSP(Cmax ), each organization imposes that the sum of completion times of its jobs can not be increased if compared with what the orga (k) local nization could have obtained using only its own machines ( Ci ). We denote this problem MOSP( Ci ) and the goal of this optimization problem is to:  (k)  (k) local Ci ≤ Ci minimize Cmax such that, for all k (1 ≤ k ≤ N ),

2.2

Selfishness

 In both MOSP(Cmax ) and MOSP( Ci ), while the global schedule might be computed by a central entity, the organizations keep control on the way they execute the jobs in the end. This property means that, in theory, it is possible for organizations to cheat the devised global schedule by re-inserting their jobs earlier in the local schedules. In order to prevent such behavior, we define a new restriction on the schedule, called selfishness restriction. The idea is that, in any schedule respecting this restriction, no single organization can improve its local schedule by cheating. (l) Given a fixed schedule, let Jf be the first foreign job scheduled to be executed (k)

in O(k) (or the first idle time if O(k) has no foreign job) and Ji any job belonging (l) to O(k) . Then, the selfishness restriction forbids any schedule where Cf < (k)

(k)

Ci − pi . In other words, O(k) refuses to cooperate if one of its jobs could be executed earlier in one of O(k) machines even if this leads to a larger global makespan.

3 3.1

Complexity Analysis Lower Bounds

Pascual et al. [12] showed with an instance having two organizations and two machines per organization that every algorithm that solves MOSP (for rigid, parallel jobs and Cmax as local objectives) has at least a 32 approximation ratio when compared to the optimal makespan that could be obtained without the local constraints. We show that the same bound applies asymptotically even with a larger number of organizations. Take the instance depicted in Figure 1a. O(1) initially has two jobs of size N and all the others initially have N jobs of size 1. All organizations contribute

Analysis of Multi-Organization Scheduling Algorithms

371

  























 



















 

  









 



















 





  



  



























 

















(a) Initial instance – Cmax = 2N

(b) Global optimum without constraints – Cmax = N + 1

(c) Optimum with MOSP constraints – Cmax = 3N 2

Fig. 1. Ratio between global optimum makespan and  the optimum makespan that can be obtained for both MOSP(Cmax ) and MOSP( Ci ). Jobs owned by organization O(2) are highlighted.

only with 1 machine each. The optimal makespan for this instance is N + 1 (Figure 1b), nevertheless it delays jobs from O(2) and, as consequence, does not respect MOSP’s local constraints. The best possible makespan that respects the local constraints (whenever the local objective is the makespan or the average completion time) is 3N 2 , as shown in Figure 1c. 3.2

Selfishness and Lower Bounds

Although all organizations will likely cooperate with each other to achieve the best global makespan possible, their selfish behavior will certainly impact the quality of the best attainable global makespan. We study here the impact of new selfishness restrictions on the quality of the achievable schedules. We show that these restrictions impact MOSP(Cmax ) and MOSP( Ci ) as compared with unrestricted schedules and, moreover, that MOSP(Cmax ) with selfishness restrictions suffers from limited performances as compared to MOSP(Cmax ) with local constraints. Proposition 1. Any approximation algorithm for both MOSP(Cmax ) and  MOSP( Ci ) has ratio greater than or equal to 2 regarding the optimal makespan without constraints if all organizations behave selfishly. Proof. We prove this result by using the example described in Figure 1. It is clear from Figure 1b that an optimal solution for a schedule without local constraints can be achieved in N + 1. However, with added selfishness restrictions, Figure 1a (with a makespan of 2N ) represents the only valid schedule possible. We can, therefore, conclude that local constraints combined with selfishness restrictions imply that no algorithm can provide an approximation ratio of 2 when compared with the problem without constraints.   Proposition 1 gives a ratio regarding the optimal makespan without the local constraints imposed by MOSP. We can show that the same approximation ratio of 2 also applies for MOSP(Cmax ) regarding the optimal makespan even if MOSP constraints are respected.

372

J. Cohen et al.

























 















 





 













 









(a) Initial instance – Cmax = 2N − 2



(b) Global optimum with MOSP constraints – Cmax = N

Fig. 2. Ratio between global optimum makespan with MOSP constraints and the makespan that can be obtained by MOSP(Cmax ) with selfish organizations

Proposition 2. Any approximation algorithm for MOSP(Cmax ) has ratio greater than or equal to 2 − N2 regarding the optimal makespan with local constraints if all organizations behave selfishly. Proof. Take the instance depicted in Figure 2a. O(1) initially has N jobs of size 1 and O(N ) has two jobs of size N . The optimal solution that respects MOSP local constraints is given in Figure 2b and have Cmax equal to N . Nevertheless, the best solution that respects the selfishness restrictions is the initial instance with a Cmax equal to 2N − 2. So, the ratio of the optimal solution with the selfishness restrictions to the optimal solution with MOSP constraints is 2 − N2 .   3.3

Computational Complexity

This section studies how hard it is to find optimal solutions for MOSP even for the simpler case in which all organizations contribute only with one machine and two jobs. We consider the decision version of the MOSP defined as follows: Instance: a set of N organizations (for 1 ≤ k ≤ N , organization O(k) has n(k) jobs, m(k) identical machines, and makespan as the local objective) and an integer . Question: Does there exist a schedule with a makespan less than  ? Theorem 1. MOSP(Cmax ) is strongly NP-complete. Proof. It is straightforward to see that MOSP(Cmax ) ∈ N P . Our proof is based on a reduction from the well-known 3-Partition problem [5]: Instance: a bound B ∈ Z + and a finite set A of 3m integers {a1 , . . . , a3m }, such that every element of A is strictly between B/4 and B/2 and such that 3m i=1 ai = mB. Question: can A  be partitioned into m disjoint sets A1 , A2 , . . . , Am such that, for all 1 ≤ i ≤ m, a∈Ai a = B and Ai is composed of exactly three elements? Given an instance of 3-Partition, we construct an instance of MOSP where, (k) (k) for 1 ≤ k ≤ 3m, organization O(k) initially has two jobs J1 and J2 with

Analysis of Multi-Organization Scheduling Algorithms (k)

373

(k)

p1 = (m + 1)B + 7 and p2 = (m + 1)ak + 1, and all other organizations have two jobs with processing time equal to 2. We then set  to be equal to (m + 1)B + 7. Figure 3 depicts the described instance. This construction is performed in polynomial time. Now, we prove that A can be split into m disjoint subsets A1 , . . . , Am , each one summing up to B, if and only if this instance of MOSP has a solution with Cmax ≤ (m + 1)B + 7. Assume that A = {a1 , . . . , a3m } can be partitioned into m disjoint subsets A1 , . . . , Am , each one summing up to B. In this case, we can build an optimal schedule for the instance as follows: (k)

– for 1 ≤ k ≤ 3m, J1 is scheduled on machine k; (k) (k) – for 3m + 1 ≤ k ≤ 4m, J1 and J2 are scheduled on machine k; (ai1 )

– for 1 ≤ i ≤ m, let Ai = {ai1 , ai2 , ai3 } ⊆ A. The jobs J2 are scheduled on machine 3m + i.

(ai2 )

, J2

(ai3 )

and J2

So, the global Cmax is (m + 1)B + 7 and the local constraints are respected. Conversely, assume that MOSP has a solution with Cmax ≤ (m + 1)B + 7. The total work (W ) of the jobs that must be executed is W = 3m((m + 1)B + 3m 7) + 2 · 2m + (m + 1) i=1 ai + 3m = 4m(m + 1)B + 7. Since we have exactly 4m organizations, the solution must be the optimal solution and there are no idle times in the scheduling. Moreover, 3m machines must execute only one job of size (m + 1)B + 7. W.l.o.g, we can consider that for 3m + 1 ≤ k ≤ 4m, machine k performs jobs of size less than (m + 1)B + 7. To prove our proposition, we first show two lemmas: Lemma 1. For all 3m + 1 ≤ k ≤ 4m, at most four jobs of size not equal to 2 (k) can be scheduled on machine k if Cmax ≤ (m + 1)B + 7. Proof. It is enough to notice that all jobs of size not equal to 2 are greater than (m+1)B/4+1, that Cmax must be equal to (m+1)B +7 and that m+1 > 3.   Lemma 2. For all 3m + 1 ≤ k ≤ 4m, exactly two jobs of size 2 are scheduled (k) on each machine k if Cmax ≤ (m + 1)B + 7. Proof. We prove this lemma by contradiction. Assume that there exists a machine k such that at most one job of size 2 is scheduled on it. So, by definition of the size of jobs, all jobs scheduled in machine k have a size greater than (m + 1)B/4 + 1. By consequence of Lemma 1, since at most four jobs can be scheduled on machine k, the total work on this machine is (m + 1)B + y + 2 where y ≤ 4. This fact is in contradiction with the facts that there does not exist idle processing time and that K = (m + 1)B + 7.   Now, we construct m disjoint subsets A1 , A2 , . . . , Am of A as follows: for all 1 ≤ i ≤ m, aj is in Ai if the job with size (m + 1)aj + 1 is scheduled on machine 3m + i. Note that all elements of A belong to one and only one set in {A1 , . . . , Am }. We prove that A is a partition with desired properties. We focus

374

J. Cohen et al.









 

 

















 





Fig. 3. Reduction of MOSP(Cmax ) from 3-Partition

on a fixed element Ai . By definition of Ai we have that   ((m + 1)aj + 1) = (m + 1)B + 7 ⇒ ((m + 1)aj + 1 = (m + 1)B + 3 4+ aj ∈Ai

aj ∈Ai



Since m + 1 > 3, we have aj ∈Ai (m + 1)aj = (m + 1)B. Thus, we can deduce  that Ai is composed of exactly three elements and a∈Ai a = B.   We continue by showing that even if all organizations are interested locally in the average completion time,  the problem is still NP-complete. We prove NP-completeness of the MOSP( Ci ) problem (having a formulation similar to the MOSP(Cmax ) decision problem) using a reduction from the Partition problem.The idea here is similar to the one used in the previous reduction, but the Ci constraints heavily restrict the allowed movements of jobs when compared to the Cmax constraints.  Theorem 2. MOSP( Ci ) is NP-complete.  Proof. First, note that it is straightforward to see that MOSP( Ci ) ∈ N P . We use the Partition [5] problem to prove this theorem. Instance: a set of n integers s1 , s2 , . . . , sn . Question: does there exist a subset J ⊆ I = {1, . . . , n} such that   si = si ? i∈J

i∈I\J



Consider an integer M > i si . Given  an instance of the Partition problem, we construct an instance of MOSP( Ci ) problem, as depicted in Figure 4a. There are N = 2n + 2 organizations having two jobs each. The organizations O(2n+1) and O(2n+2) have two jobs with processing time 1. Each integer si from the Partition problem corresponds to a pair of jobs ti and ti , with processing time equal to 2i M and 2i M + si respectively. We set (k) (k) J1 = tk , for all 1 ≤ k ≤ n and J1 = tk−n , for all n + 1 ≤ k ≤ 2n. We set K 

t +



t +4

i i i i to W . To complete the construction, for any k, 1 ≤ k ≤ 2n, the N = 2 (k) (k) has also a job J2 with processing time equal to K. We set organization O

Analysis of Multi-Organization Scheduling Algorithms 



375











 







 

 

 







  

(a) Initial instance Fig. 4. Reduction of MOSP(



 

 

(b) Optimum 

Ci ) from Partition

K to . This construction is performed in polynomial time and we prove that it is a reduction. First, assume that {s1 , s2 , . . . , sn } is partitioned into 2 disjoint sets S1 , S2 with the desired properties. We construct a valid schedule with optimal global  makespan for MOSP( Ci ). For all si , if i ∈ J, we schedule job ti in organization O(N ) and job ti in organization O(N −1) . Otherwise, we schedule ti in O(N −1) and ti in O(N ) . The problem constraints impose that organizations O(N −1) and O(N ) will first schedule their own jobs (two jobs of size 1). The remaining jobs will be scheduled in non-decreasing time, using the Shortest Processing Time first (SPT) rule. This schedule respects MOSP’s constraints of not increasing the organization’s average completion time because each job is being delayed by at most its own size (by construction, the sum of all jobs scheduled before the job being scheduled is   (N −1) smaller than the size of the job). Cmax will be equal to 2 + i 2i M + i∈J si .   (N −1) (N ) Since J is a partition, Cmax is exactly equal to Cmax = 2+ i 2i M + i∈I\J si . (N )

(N −1)

Also, Cmax = Cmax = K, which gives us the theoretical lower bound for Cmax .  Second, assume MOSP( Ci ) has a solution with Cmax ≤ K. We prove that {s1 , s2 , . . . , sn } is partitioned into  2 disjoint sets S1 , S2 with the desired properties. This solution of MOSP( Ci ) has the structure drawn in Figure 4b. To achieve a Cmax equal to K, the scheduler must keep all jobs that have size exactly equal to K in their initial organizations. Moreover all jobs of size 1 must also remain in their initial organizations, otherwise these jobs would be delayed. The remaining jobs (all ti and ti jobs) must be scheduled either in oror O(N ) . Each processor must execute a total work of ganizations O(N −1)  i   2 i 2 M+ i si s 2K−4 = = i 2i M + 2i i to achieve a makespan equal to K. 2 2 Let J ⊆ I = {1, . . . , n} such that i ∈ J if ti was  scheduled  on organization (N −1) . O(N −1) execute a total work of W (N −1) = i 2i M + i∈J si , that must O    be equal to the total work of O(N ) , W (N ) = i 2i M + i∈I\J si . Since i si <   M , we have W (N −1) ≡ i∈J si (mod M ) and W (N ) ≡ i∈I\J (mod M ). This (N −1) means thatW (N −1) = W (N ) =⇒ mod M ) = (W (N ) mod M ) =⇒   (W s = s . If MOSP( C ) has a solution with Cmax ≤ K, then set i i∈J i i∈I\J i J is a solution for Partition.  

376

4

J. Cohen et al.

Algorithms

In this section, we present three different heuristics to solve MOSP(Cmax ) and  MOSP( Ci ). All algorithms present the additional property of respecting selfishness restrictions. 4.1

Iterative Load Balancing Algorithm

The Iterative Load Balancing Algorithm (ILBA) [13] is a heuristic that redistributes the load from the most loaded organizations. The idea is to incrementally rebalance the load without delaying any job. First the less loaded organizations are rebalanced. Then, one-by-one, each organization has its load rebalanced. The heuristic works as follows. First, each organization schedules its own jobs locally and the organizations are enumerated by non-decreasing makespans, i.e. (1) (2) (N ) Cmax ≤ Cmax ≤ . . . ≤ Cmax . For k = 2 until N , jobs from O(k) are rescheduled sequentially, and assigned to the less loaded of organizations O(1) . . . O(k) . Each job is rescheduled by ILBA either earlier or at the same time that the job was scheduled before the migration. In other words, no job is delayed by ILBA, which  guarantees that the local constraint is respected for MOSP(Cmax ) and MOSP( Ci ). 4.2

LPT-LPT and SPT-LPT Heuristics

We developed and evaluated (see Section 5) two new heuristics based on the classical LPT (Longest Processing Time First [6]) and SPT (Smallest Process ing Time First [2]) algorithms for solving MOSP(Cmax ) and MOSP( Ci ), respectively. Both heuristics work in two phases. During the first phase, all organizations minimize their own local objectives. Each organization starts applying LPT for its own jobs if the organization is interested in minimizing its own makespan, or starts applying SPT if the organization is interested in its own average completion time. The second phase is when all organizations cooperatively minimize the makespan of the entire grid computing system without worsening any local objective. This phase works as follows: each time an organization becomes idle, i.e., it finishes the execution of all jobs assigned to it, the longest job that does not have started yet is migrated and executed by the idle organization. This greedy algorithm works like a global LPT, always choosing the longest job yet to be executed among jobs from all organizations. 4.3

Analysis

ILBA, LPT-LPT and SPT-LPT do not delay any of the jobs when compared to the initial local schedule. During the rebalancing phase, all jobs either remain in their original organization or are migrated to an organization that became idle at a preceding time. The implications are:

Analysis of Multi-Organization Scheduling Algorithms

377

– the selfishness restriction is respected – if a job is migrated, it will start before the completion time of the last job of the initial organization; – if organizations’ local objective is to minimize the makespan, migrating a job to a previous moment in time will decrease the job’s completion time and, as consequence, it will not increase the initial makespan of the organization; – if organizations’ local objective is to minimize the average completion time, migrating a job from the initial organization to another that became idle at a previous moment in time will decrease the completion time of all jobs from the  initial organization and of the job being migrated. This means that the Ci of the jobs from the initial organization is always decreased; – the rebalancing phase of all three algorithms works as the list scheduling algorithms. Graham’s classical approximation ratio 2 − N1 of list scheduling algorithms [6] holds for all of them. We recall from Section 3.2 that no algorithm respecting selfishness restrictions can achieve an approximation ratio for MOSP(Cmax ) better than 2. Since all our algorithms reach an approximation ratio of 2, no further enhancements are possible without removing selfishness restrictions.

5

Experiments

We conducted a series of simulations comparing ILBA, LPT-LPT, and SPT-LPT under various experimental settings. The workload was randomly generated with parameters matching the typical environment found in academic grid computing systems [13]. We evaluated the algorithms with instances containing a random number of machines, organizations and jobs with different sizes. In our tests, the number of initial jobs in each organization follows a Zipf distribution with exponent equal to 1.4267, which best models virtual organizations in real-world grid computing systems [7]. We are interested in the improvement of the global Cmax provided by the different algorithms. The results are evaluated with comparison to the Cmax obtained by the algorithms with the well-known theoretical lower bound for the  p(k) scheduling problem without constraints LB = max( mi(k) , pmax ). i,k

Our main conclusion is that, despite the fact that the selfishness restrictions are respected by all heuristics, ILBA and LPT-LPT obtained near optimal results for most cases. This is not unusual, since it follows the patterns of experimental behavior of standard list scheduling algorithms, in which it is easy to obtain a near optimal schedule when the number of tasks grows large. SPT-LPT produces worse results due to the effect of applying SPT locally. However, in some particular cases, in which the number of jobs is not much larger than the number of machines available, the experiments yield more interesting results. Figure 5 shows the histogram of a representative instance of such a particular case. The histograms show the frequency of the ratio Cmax obtained to the lower bound over 5000 different instances with 20 organizations and 100 jobs for ILBA, LPT-LPT and SPT-LPT. Similar results have been obtained for

J. Cohen et al.











 

 









 











    









378

  

 











    

(a) ILBA

(b) LPT-LPT



 











    

(c) SPT-LPT

Fig. 5. Frequency of results obtained by ILBA, LPT-LPT, and SPT-LPT when the results are not always near optimal

many different sets of parameters. LPT-LPT outperforms ILBA (and SPT-LPT) for most instances and its average ratio to the lower bound is less than 1.3.

6

Concluding Remarks

In this paper, we have investigated the scheduling on multi-organization platfrom the literature and exforms. We presented the MOSP(Cmax ) problem  tended it to a new related problem MOSP( Ci ) with another local objective. In each case we studied how to improve the global makespan while guaranteeing that no organization will worsen its own results.  We showed first that both versions MOSP(Cmax ) and MOSP( Ci ) of the problem are NP-hard. Furthermore, we introduced the concept of selfishness in these problems which corresponds to additional scheduling restrictions designed to reduce the incentive for the organizations to cheat locally and disrupt the global schedule. We proved that any algorithm respecting selfishness restrictions can not achieve a better approximation ratio than 2 for MOSP(Cmax ). Two new scheduling algorithms were proposed, namely LPT-LPT and SPTLPT, in addition to ILBA from the literature. All these algorithms are list scheduling, and thus achieve a 2-approximation. We provided an in-depth analysis of these algorithms, showing that all of them respect the selfishness restrictions. Finally, all these algorithms were implemented and analysed through experimental simulations. The results show that our new LPT-LPT outperforms ILBA and that all algorithms exhibit near optimal performances when the number of jobs becomes large. Future research directions will be more focused on game theory. We intend to study schedules in the case where several organizations secretly cooperate to cheat the central authority.

References 1. Baker, B.S., Coffman Jr., E.G., Rivest, R.L.: Orthogonal packings in two dimensions. SIAM Journal on Computing 9(4), 846–855 (1980) 2. Bruno, J.L., Coffman Jr., E.G., Sethi, R.: Scheduling independent tasks to reduce mean finishing time. Communications of the ACM 17(7), 382–387 (1974)

Analysis of Multi-Organization Scheduling Algorithms

379

3. Caragiannis, I., Flammini, M., Kaklamanis, C., Kanellopoulos, P., Moscardelli, L.: Tight bounds for selfish and greedy load balancing. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 311–322. Springer, Heidelberg (2006) 4. Even-Dar, E., Kesselman, A., Mansour, Y.: Convergence time to nash equilibria. ACM Transactions on Algorithms 3(3), 32 (2007) 5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (January 1979) 6. Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics 17(2), 416–429 (1969) 7. Iosup, A., Dumitrescu, C., Epema, D., Li, H., Wolters, L.: How are real grids used? The analysis of four grid traces and its implications. In: 7th IEEE/ACM International Conference on Grid Computing, pp. 262–269 (September 2006) 8. Jansen, K., Otte, C.: Approximation algorithms for multiple strip packing. In: Bampis, E., Jansen, K. (eds.) WAOA 2009. LNCS, vol. 5893, pp. 37–48. Springer, Heidelberg (2010) 9. Koutsoupias, E., Papadimitriou, C.: Worst-case equilibria. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 404–413. Springer, Heidelberg (1999) 10. Nisam, N., Roughgarden, T., Tardos, E., Vazirani, V.V.: Algorithmic Game Theory. Cambridge University Press, Cambridge (September 2007) 11. Ooshita, F., Izumi, T., Izumi, T.: A generalized multi-organization scheduling on unrelated parallel machines. In: International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 26–33. IEEE Computer Society, Los Alamitos (December 2009) 12. Pascual, F., Rzadca, K., Trystram, D.: Cooperation in multi-organization scheduling. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 224–233. Springer, Heidelberg (August 2007) 13. Pascual, F., Rzadca, K., Trystram, D.: Cooperation in multi-organization scheduling. Concurrency and Comp.: Practice & Experience 21(7), 905–921 (2009) 14. Schwiegelshohn, U., Tchernykh, A., Yahyapour, R.: Online scheduling in grids. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–10 (April 2008) 15. Ye, D., Han, X., Zhang, G.: On-line multiple-strip packing. In: Berlin, S. (ed.) Proceedings of the 3rd International Conference on Combinatorial Optimization and Applications, June 2009. LNCS, vol. 5573, pp. 155–165. Springer, Heidelberg (2009) 16. Zhuk, S.N.: Approximate algorithms to pack rectangles into several strips. Discrete Mathematics and Applications 16(1), 73–85 (2006)

Area-Maximizing Schedules for Series-Parallel DAGs Gennaro Cordasco1 and Arnold L. Rosenberg2, 1 University of Salerno, ISISLab, Dipartimento di Informatica ed Applicazioni “R.M. Capocelli”, Fisciano 84084, Italy [email protected] 2 Colorado State University, Electrical and Computer Engineering, Fort Collins, CO 81523, USA [email protected]

Abstract. Earlier work introduced a new optimization goal for DAG schedules: the “AREA” of the schedule. AREA-maximizing schedules are intended for computational environments—such as Internet-based computing and massively multicore computers—that benefit from DAG-schedules that produce executioneligible tasks as fast as possible. The earlier study of AREA-maximizing schedules showed how to craft such schedules efficiently for DAGs that have the structure of trees and other, less well-known, families of DAGs. The current paper extends the earlier work by showing how to efficiently craft AREA-maximizing schedules for series-parallel DAGs, a family that arises, e.g., in multi-threaded computations. The tools that produce the schedules for series-parallel DAGs promise to apply also to other large families of computationally significant DAGs.

1 Introduction Many modern computing platforms, such as the Internet and massively multicore architectures, have characteristics that are not addressed by traditional strategies1 for scheduling DAGs i.e., computations having inter-task dependencies that constrain the order of executing tasks. This issue is discussed at length in, e.g., [24], where the seeds of the Internet-based computing scheduling (IC-scheduling) paradigm are planted. ICscheduling strives to meet the needs of the new platforms by crafting schedules that execute DAGs in a manner that renders new tasks eligible for execution at the maximal possible rate. The paradigm thereby aims to: (a) enhance the utilization of computational resources, by always having work to allocate to an available client/processor; (b) lessen the likelihood of a computation’s stalling pending completion of alreadyallocated tasks. Significant progress in [5,6,8,21,24,25] has extended the capabilities of IC-scheduling so that it can now optimally schedule a wide range of computationally significant DAGs (cf. [4]). Moreover, simulations using DAGs that arise in real scientific computations [20] as well as structurally similar artificial DAGs [13] suggest that IC-schedules can have substantial computational benefits over schedules produced by a  1

Research supported in part by US NSF Grant CNS-0905399. Many traditional DAG-scheduling strategies are discussed and compared in [12,18,22].

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 380–392, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Area-Maximizing Schedules for Series-Parallel DAGs

381

range of common heuristics. However, it has been known since [21] that many significant classes of DAGs do not admit schedules that are optimal within the framework of IC-scheduling. The current authors have responded to this fact with a relaxed version of IC-scheduling under which every DAG admits an optimal schedule [7]. This relaxed strategy strives to maximize the average rate at which new tasks are rendered eligible for execution. For reasons that become clear in Section 2, we call the new optimization metric for schedules the AREA of the schedule: the goal is an AREA-maximizing schedule for a DAG (an AM-schedule, for short). The study in [7] derived many basic properties of AM-schedules. Notable among these are: (1) Every DAG admits an AM-schedule. (2) AM-scheduling subsumes the goal of IC-scheduling, in the following sense. If a DAG G admits an optimal IC-schedule, then: (a) that schedule is an AM-schedule; (b) every AM-schedule is optimal under IC-scheduling. Thus, we never lose scheduling quality by focusing on achieving an AM-schedules, rather than an IC-optimal schedule. Our Contribution. The major algorithmic results of [7] show how to craft AMschedules efficiently for DAGs that have the structure of a monotonic tree, as well as for other, less common, families of DAGs. The current paper extends this contribution by showing how to efficiently craft an AM-schedule for any series-parallel DAG (SPDAG , for short). SP- DAG s have a regularity of structure that makes them algorithmically advantageous in a broad range of applications; cf. [15,23,27]. Most relevant to our study: (1) SP-DAGs are the natural abstraction of many significant classes of computations, including divide-and-conquer algorithms. (2) SP-DAGs admit efficient schedules in parallel computing systems such as CILK [1] that employ a multi-threaded computational paradigm. (3) Arbitrary DAGs can be efficiently (in linear time) reformulated as SP-DAGs with little loss in degree of parallelism [15,23]. Our contribution addresses two facets of the existing work on DAG-scheduling: (1) The work on CILK and kindred multi-threaded systems [1,2,3] employ performance metrics that are not relevant in the computational environments that we target (as discussed earlier). (2) Many SP- DAGs do not admit optimal schedules under IC-scheduling [6,21]. Related Work. Problems related to scheduling DAGs on parallel/distributed computing systems have been studied for decades. Most versions of these problems are known to be NP-Hard [10], except when scheduling special classes of DAGs (cf. [11]). This has led both researchers and practitioners to seek efficient scheduling heuristics that seem to perform well in practice (cf. [14,26]). Among the interesting attempts to understand such heuristics are the comparisons in [12,18,22] and the taxonomy in [19]. Despite the disparate approaches to scheduling in sources such as those cited, virtually every algorithm/heuristic that predates IC-scheduling shared one central characteristic: they all rely on knowing (almost) exact times for each computer to execute each task and to communicate with a collaborating computer. The central premise underlying ICand AM-scheduling is that within many modern computing environments, one cannot even apprximate such knowledge reliably. This premise is shared by sources such as [16,17], whose approaches to scheduling for IC platforms admit margins of error in time estimates of 50% or more. Indeed, IC- and AM scheduling are analytical proposals for what to do when accurate estimates are out of the question.

382

G. Cordasco and A.L. Rosenberg

2 Background Computation-DAGs and schedules. We study computations that are described by DAG s. Each DAG G has a set VG of nodes, each representing a task, and a set AG of (directed) arcs, each representing an intertask dependency. For arc (u → v) ∈ AG : • task v cannot be executed until task u is; • u is a parent of v, and v is a child of u in G. A parentless node is a source; a childless node is a target. G is connected if it is so when one ignores arc orientations. When VG 1 ∩ VG 2 = ∅, the sum G 1 + G 2 of DAGs G 1 and G 2 is the DAG with node-set VG 1 ∪ VG 2 and arc-set AG 1 ∪ AG 2 . When one executes a DAG G, a node v ∈ VG becomes ELIGIBLE (for execution) only after all of its parents have been executed. Note that all of G’s sources are ELIGIBLE at the beginning of an execution; the goal is to render all of G’s targets ELIGIBLE. Informally, a schedule Σ for G is a rule for selecting which ELIGIBLE node to execute at each step of an execution of G; formally, Σ is a topological sort of G, i.e., a linearization of VG under which all arcs point from left to right (cf. [9]). We do not allow recomputation of nodes/tasks, so a node loses its ELIGIBLE status once it is executed. In compensation, after v ∈ VG has been executed, there may be new nodes that are rendered ELIGIBLE; this occurs when v is their last parent to be executed. We measure the quality of a schedule Σ using the rate at which Σ renders nodes of G ELIGIBLE. Toward this end, we define2 , for k ∈ [1, |VG |], the quantities EΣ (k) and eΣ (k): EΣ (k) is the number of nodes of G that are ELIGIBLE after Σ has executed k nodes; and eΣ (k) is the number of nodes (perforce, nonsources) of G that are rendered ELIGIBLE by Σ’s kth node-execution. (We measure time in an event-driven manner, as the number of nodes that have been executed thus far, so we often refer to “step k” rather than “Σ’s kth node-execution.”) The AREA metric for schedules. Focus on a DAG G = (VG , AG ) with n nontargets, def N nonsources, s sources, and S targets (note: NG = |VG | = s + N = S + n). The quality of a schedule Σ for G at step t is given by the size of EΣ (t): the larger, the better. The goal of IC-scheduling is to execute G’s nodes in an order that maximizes quality at every step t ∈ [1, NG ] of the execution. A schedule Σ ∗ that achieves this demanding goal is IC-optimal; formally, (∀t ∈ [1, NG ]) EΣ ∗ (t) =

max

Σ a schedule for G

{EΣ (t)}

The AREA of a schedule Σ for G, AREA(Σ), is the sum AREA(Σ) = EΣ (0) + EΣ (1) + · · · + EΣ (NG ). def

(2.1)

def  The normalized AREA, E(Σ) = AREA(Σ) ÷ NG , is the average number of nodes that are ELIGIBLE when Σ executes G. The term “area” arises by formal analogy with Riemann sums as approximations to integrals. The goal of the scheduling paradigm we develop here is to find an AM schedule for G, i.e., a schedule Σ  such that

AREA(Σ  ) = 2

max

Σ a schedule for G

[a, b] denotes the set of integers {a, a + 1, . . . , b}.

AREA(Σ).

Area-Maximizing Schedules for Series-Parallel DAGs

383

We have discovered in [7] a number of ways to simplify the quest for AM schedules. First, we can restrict the form of such schedules. Lemma 1. [7] Altering a schedule Σ for DAG G so that it executes all of G’s nontargets before any of its targets cannot decrease Σ’s AREA. Hence, we can streamline analysis by ignoring targets. Second, we can alter the AREA metric in certain ways: The only portion of AREA(Σ) that actually depends on choices made by Σ is area(Σ) = def

n  t 

eΣ (j)neΣ (1)+(n−1)eΣ (2)+ . . . +eΣ (n).

(2.2)

t=1 j=1

The eligibility profile associated with schedule Σ for G is the n-entry vector Π(Σ) = eΣ (1), eΣ (2), . . . , eΣ (n) . Our view of schedules as sequences of nontargets nodes allows us to talk about subschedules of Σ, which are contiguous subsequences of Σ. Each subschedule Φ delimits a (not necessarily connected) subDAG G Φ of G: VG Φ is the subset of VG whose nodes appear in Φ plus all the nodes in VG which become ELIGIBLE due to the execution of Φ, and AG Φ contains precisely those arcs from AG that have both ends in VG Φ . Let G be a DAG that admit a schedules Σ. The following abbreviations hopefully enhance legibility. Let Φ be a subschedule of Σ. For any sequence Φ, including (sub)schedules, Φ = |Φ| denotes the length of Φ. Note that EΦ (k) and eΦ (k) are defined for k ∈ [1, Φ]. – Π(Φ) denotes G Φ ’s eligibility profile: Π(Φ) = eΦ (1), . . . , eΦ (Φ ) = eΣ (i), . . . , eΣ (j) for some i, j ∈ [1, n] with Φ = j − i + 1. b – SU M (Φ, a, b) = i=a eΦ (i), and SU M (Φ) = SU M (Φ, 1, Φ ). For disjoint subschedules Φ and Ψ of Σ, denote by (Φ · Ψ ) the subschedule of Σ that concatenates Φ and Ψ . Thus: (Φ · Ψ ) has Φ + Ψ elements; it first executes all nodes in Φ, in the same order as Φ, and then executes all nodes in Ψ , in the same order as Ψ . Series-parallel DAGs (SP-DAGs). A (2-terminal) series-parallel DAG G (SP-DAG, for short) is produced by a sequence of the following operations (cf. Fig. 1): 1. Create. Form a DAG G that has: (a) two nodes, a source s and a target t, which are jointly G’s terminals, (b) one arc, (s → t), directed from s to t. 2. Compose SP-DAGs, G  with terminals s , t , and G  , with terminals s , t . (a) Parallel composition. Form G = G  ⇑ G  from G  and G  by identifying/merging s with s to form a new source s and t with t to form a new target t. (b) Series composition. Form G = (G  → G  ) from G  and G  by identifying/merging t with s . G has the single source s and the single target t . One can use examples from [6] to craft SP- DAGs that do not admit optimal IC-schedules.

384

G. Cordasco and A.L. Rosenberg

Fig. 1. Compositions of SP-DAGs

3 Maximizing Area for Series-Parallel DAGs Theorem 1. There exists an algorithm ASP-DAG that finds, in time O(n2 ), an AMschedule for any n-node SP-DAG. The remainder of the section develops Algorithm ASP-DAG , to prove Theorem 1. 3.1 The Idea Behind Algorithm ASP-DAG The problem of recognizing when a given DAG is an SP-DAG is classic within the area of algorithm design. A decomposition-based algorithm in [27], which solves the problem in linear time, supplies the basic idea underlying Algorithm ASP-DAG . We illustrate how with the sample SP-DAG G in Fig. 2(left). The fact that G is an SP-DAG can be verified with the help of a binary decomposition tree T G whose structure illustrates the sequence of series and parallel compositions that form G from a set of nodes; hence, T G ultimately takes one back to the definition of SP-DAG in Section 2. Fig. 2(right) depicts T G for the G of Fig. 2(left). The leaves of T G are single-arc DAGs; each of its internal nodes represents the SP-DAG obtained by composing (in series or in parallel) its two children. Importantly for the design of Algorithm ASP-DAG , one can use the algorithm of [27] to construct T G from a given DAG G: the construction succeeds just when G is an SP- DAG . Algorithm ASP-DAG uses T G to design an AM-schedule for G inductively. 3.2 Algorithm ASP-DAG ’s Inductive Approach Given an SP-DAG G, Algorithm ASP-DAG first constructs T G . It then designs an AMschedule for G by exploiting the structure of T G . A. Leaf-DAGs of T G . Each leaf-DAG is a single-arc DAG, i.e., a series composition of degenerate one-node DAGs. Each such DAG has the form (s → t), hence admits a unique schedule (execute s; then execute t) which, perforce, is an AM-schedule.

Area-Maximizing Schedules for Series-Parallel DAGs

385

Fig. 2. An example of the decomposition of SP-DAGs

B. Series-Composing Internal Nodes of T G . Focus on a node of T G that represents the series composition (G  → G  ) of disjoint SP-DAGs G  and G  . Lemma 2. If the disjoint SP-DAGs G  and G  admit, respectively, AM-schedules Σ  and Σ  , then the schedule (Σ  · Σ  ) is AM for the series composition (G  → G  ). Proof. Let G  have N  nonsources, and let G  have n nontargets. By definition of   ELIGIBILITY , every node of G must be executed before any node of G , so we need   focus only on schedules for (G → G ) of the form (Σ1 · Σ2 ), where Σ1 is a schedule for G  and Σ2 is a schedule for G  . We then have area(Σ1 · Σ2 ) = area(Σ1 ) + area(Σ2 ) + Σ2 SU M (Σ1 ) = area(Σ1 ) + area(Σ2 ) + n N  . Thus, choosing schedules Σ1 and Σ2 that maximize both area(Σ1 ) and area(Σ2 ) will maximize area(Σ1 · Σ2 ). The lemma follows. C. Parallel-Composing Internal Nodes of T G . Focus on a node of T G that represents the parallel composition (G  ⇑ G  ) of disjoint SP-DAGs G  and G  . We present an algorithm that crafts an AM-schedule Σ for (G  ⇑ G  ) from AM-schedules Σ  and Σ  for G  and G  . Algorithm SP-AREA Input: SP-DAGs G  and G  and respective AM schedules Σ  and Σ  . Output: AM schedule Σ for G = (G  ⇑ G  ). G  and G  are disjoint within (G  ⇑ G  ), except for their shared source s and target t. Because every schedule for (G  ⇑ G  ) begins by executing s and finishes by executing t, we can focus only on how to schedule the sum, G 1 + G 2 , obtained by removing both s and t from (G  ⇑ G  ): G 1 and G 2 are not-necessarily-connected sub DAGs of, respectively, G  and G  . (The parallel composition in Fig. 1 illustrates that G 1 and/or G 2 can be disconnected: removing s and t in the figure disconnects the lefthand DAG.)

386

G. Cordasco and A.L. Rosenberg

(1) Construct the average-eligibility profile AV G(Σ  ) from Π(Σ  ) as follows. If Π(Σ  ) = eΣ  (1), . . . , eΣ  (n) , then AV G(Σ  ) = aΣ  (1), . . . , aΣ  (n) , where, for 1 k k ∈ [1, n], aΣ  (k) = eΣ  (k). k i=1 Similarly, construct the average-eligibility profile AV G(Σ  ) from Π(Σ  ). (2) (a) Let j  be the smallest index of AV G(Σ  ) whose value is maximum within profile AV G(Σ  ). Segregate the subsequence of Σ  comprising elements 1, . . . , j  , to form an (indivisible) block of nodes of Σ  with average eligibility value (AEV) aΣ  (j  ). Perform a similar analysis for Σ  to determine the value j  and the associated block of nodes of Σ  with AEV aΣ  (j  ). (b) Repeat procedure (a) for Σ  , using indices from j  + 1 to n, collecting blocks, until we find a block that ends with the last-executed node of Σ  . Do the analogous repetition for Σ  . After procedure (a)-then-(b), each of Σ  and Σ  is decomposed into a sequence of blocks, plus the associated sequences of AEVs. We claim that the following schedule Σ for (G  ⇑ G  ) is AM. Σ: 1. Execute s 2. Merge the blocks of Σ  and Σ  in nonincreasing order of AEV. (Blocks are kept intact; ties in AEV are broken arbitrarily.) 3. Execute t Claim. Σ is a valid schedule for (G  ⇑ G  ). This is obvious because: (1) Σ keeps the blocks of both Σ  and Σ  intact; (2) Σ incorporates the blocks of Σ  (resp., Σ  ) in their order within Σ  (resp., Σ  ). Claim. Σ is an AM schedule for (G  ⇑ G  ). Before verifying Σ’s optimality, we digress to establish two technical lemmas. Lemma 3. Let Σ be a schedule for DAG G, and let  Φ and Ψ be disjoint subschedules    SU M (Φ) SU M (Ψ ) 3 > of Σ. Then area(Φ · Ψ ) > area(Ψ · Φ) iff . Φ Ψ Proof. Invoking (2.2), we find that area(Φ · Ψ ) = (Φ + Ψ )eΦ (1) + (Φ + Ψ − 1)eΦ (2) + · · · + (Ψ + 1)eΦ (Φ ) +Ψ eΨ (1) + (Ψ − 1)eΨ (2) + · · · + eΨ (Ψ ); area(Ψ · Φ) = (Φ + Ψ )eΨ (1) + (Φ + Ψ − 1)eΨ (2) + · · · + (Φ + 1)eΨ (Ψ ) +Φ eΦ (1) + (Φ − 1)eΦ (2) + · · · + eΦ (Φ ). Therefore, area(Φ · Ψ ) − area(Ψ · Φ) = Ψ · SU M (Φ) − Φ · SU M (Ψ ). The result now follows by elementary calculation.



Lemma 4. Let Σ be a schedule for DAG G, and let Φ, Ψ , Γ , andΔ be four mutually disjoint subschedules of Σ. Then area(Φ·Ψ ) > area(Ψ ·Φ) iff area(Γ ·Φ·Ψ ·Δ) >  area(Γ · Ψ · Φ · Δ) . 3

The disjointness of Φ and Ψ ensures that both (Φ · Ψ ) and (Ψ · Φ) are subschedules of Σ.

Area-Maximizing Schedules for Series-Parallel DAGs

387

The import of Lemma 4 is: One cannot change the area-ordering of the two concatenations of Φ and Ψ by appending the same fixed subschedule (Γ ) before the concatenations and/or appending the same fixed subschedule (Δ) after the concatenations. Proof. The following two differences have the same sign:     area(Φ · Ψ ) − area(Ψ · Φ) and area(Γ · Φ · Ψ · Δ) − area(Γ · Ψ · Φ · Δ) . To wit: area(Γ ·Φ·Ψ ·Δ) = (Γ +Φ +Ψ +Δ )eΓ (1) + · · · + (1+Φ +Ψ +Δ )e( Γ ) +(Γ +Φ +Δ )eΦ (1) + · · · + (1+Ψ +Δ )eΦ (Φ ) +(Ψ +Δ )eΨ (1) + · · · + (1+Δ )eΨ (Ψ ) +Δ eΔ (1) + · · · + eΔ (Δ ); area(Γ ·Ψ ·Φ·Δ) = (Γ +Ψ +Φ +Δ )eΓ (1) + · · · + (1+Ψ +Φ +Δ )eΓ (Γ ) +(Ψ +Φ +Δ )eΨ (1) + · · · + (1+Φ +Δ )eΨ (Ψ ) +(Φ +Δ )eΦ (1) + · · · + (1+Δ )eΦ (Φ )) +Δ eΔ (1) + · · · + eΔ (Δ ). Elementary calculation now shows that     area(Γ ·Φ·Ψ ·Δ) − area(Γ ·Ψ ·Φ·Δ) = area(Φ · Ψ ) − area(Ψ · Φ) .



The Optimality of Schedule Σ. We focus only on step 2 of Σ because, as noted earlier, every schedule for (G  ⇑ G  ) begins by executing s and ends by executing t. We validate each of the salient properties of Algorithm SP-AREA. Lemma 5. The nodes inside each block determined by Algorithm SP-AREA cannot be rearranged in any AM-schedule for (G  ⇑ G  ). Proof. We proceed by induction on the structure of the decomposition tree T G of G = (G  ⇑ G  ). The base of the induction is when T G consists of a single leaf-node. The lemma is trivially true in this case, for the structure of the resulting block is mandated by the direction of the arc in the node. (In any schedule for the DAG (s → t), s must be computed before t.) The lemma is trivially true also when G is formed solely via parallel compositions. To wit, for any such DAG, removing the source and target leaves one with a set of independent nodes. The resulting eligibility profile is, therefore, a sequence of 0s: each block has size 1 and AEV 0. This means that every valid schedule is AM. In general, as we decompose G: (a) The subDAG schedule needed to process a parallel composition (G 1 ⇑ G 2 ) does not generate any blocks other than those generated by the schedules for G 1 and G 2 . (b) The sub DAG schedule needed to process a series composition can generate new blocks. To wit, consider SP-DAGs G 1 and G 2 , with respective schedules Σ1 and Σ2 , so that (Σ1 · Σ2 ) is a schedule for the serial composition (G 1 → G 2 ). Say that (Σ1 · Σ2 ) generates consecutive blocks, B1 from Σ1 and B2 from Σ2 , where B1 ’s AEV is smaller than B2 ’s. Then blocks B1 and B2 are merged—via concatenation—by (Σ1 · Σ2 ) into block B1 · B2 . For instance, the indicated scenario occurs when B1 is the last block of Σ1 and B2 is the first block of Σ2 . Assuming, for induction, that the nodes in each of B1 and B2 cannot be rearranged, we see that the nodes inside B1 · B2 cannot be reordered. That is, every node of G 1 —including those in B1 —must be executed before any node of G 2 —including those in B2 .

388

G. Cordasco and A.L. Rosenberg

Lemma 5 tells that the nodes inside each block generated by Algorithm SP-AREA cannot be reordered. We show now that blocks can also not be subdivided. Note that this will also preclude merging blocks in ways that violate blocks’ indivisibility. Lemma 6. If a schedule Σ for a DAG (G  ⇑ G  ) subdivides a block (as generated by Algorithm SP-AREA), then Σ is not AM. Proof. Assume, for contradiction, that there exists an AM schedule Σ for G = (G  ⇑ G  ) that subdivides some block A of Σ  into two blocks, B and C. (Our choice of Σ  here clearly loses no generality.) This subdivision means that within Σ, B and C are separated by some sequence D. We claim that Σ’s AREA-maximality implies that AEV (C) > AEV (B). Indeed, by construction, we know that   1 1 SU M (A, 1, A ) > SU M (A, 1, j) . (∀j ∈ [1, A − 1]) A j Therefore, we have 1 1 SU M (A, 1, A ) > SU M (A, 1, B ) , A B which means that A B   B eA (i) > A eA (i). i=1

i=1

It follows, noting that C = A − B , that B

A 

eA (i) > (A − B )

B 

eA (i).

i=1

i=B +1

This, finally, implies that AEV (C) > AEV (B). Now we are in trouble: Either of the following inequalities, (a) [AEV (D) < AEV (C)]

or

(b) [AEV (D) ≥ AEV (C)]

(3.3)

would allow us to increase Σ’s AREA, thereby contradicting Σ’s alleged AREA-maximality! If inequality (3.3(a)) held, then Lemmas 3 and 4 would allow us to increase Σ’s AREA by interchanging blocks D and C. If inequality (3.3(b)) held, then because [AEV (C) > AEV (B)], we would have [AEV (D) > AEV (B)]. But then Lemmas 3 and 4 would allow us to increase Σ’s AREA by interchanging blocks B and D. We conclude that schedule Σ cannot exist. We summarize the results of this section in the following result. Theorem 2. Let G  and G  be disjoint SP-DAGs that, respectively, admit AM-schedules Σ  and Σ  , and have n and n nontargets. If we know the block decompositions of Σ  and Σ  , ordered by AEV, then Algorithm SP-AREA determines, within time O(n +n ) an AM-schedule for G = (G  ⇑ G  ). Proof. By Lemma 6, any AM schedule for G must be a permutation of the blocks obtained by decomposing Σ  and Σ  . Assume, for contradiction, that some AM schedule Σ ∗ is not obtained by selecting blocks in decreasing order of AEV. There would then exist (at least) two blocks, A and B, that appear consecutively in Σ ∗ , such that

Area-Maximizing Schedules for Series-Parallel DAGs

389

AEV (A) < AEV (B). We would then have SU M (A) /A < SU M (B) /B , so by Lemma 3, area(A · B) > area(B · A). Let C (resp., D) be the concatenation of all blocks that come before A (resp., after B) in Σ ∗ . An invocation of Lemma 4 shows that Σ ∗ cannot be AM, contrary to assumption. Timing: Because the blocks obtained by the decomposition are already ordered, com-

pleting G’s schedule requires only merging two ordered sequences of blocks, which can be accomplished in linear time. Observation 1. It is worth noting that Algorithm ASP-DAG ’s decomposition phase, as described in Section 3.2, may require time quadratic in the size of G. However, we use Algorithm SP-AREA in a recursive manner, so the decomposition for an SP-DAG G is simplified by the algorithm’s (recursive) access to the decompositions of G’s sub DAGs.

Fig. 3. An example of composition

An Example. The DAG G of Fig. 3(center) is the parallel composition of the DAGs of Fig. 3(left). (G appears at the root of T G in Fig. 2.) By removing nodes s and t from G, we obtain the disjoint DAGs G  and G  of Fig. 3(right). We note the AM-schedules: Σ  = a, b, c, d, e, f, g, h, i for G  ; Σ  = k, l, m, n, o, p, q, r for G  . We use G  and G  to illustrate Algorithm SP-AREA. For schedule Σ  : Π(Σ  ) = 2, 2, 2, 0, 1, 0, 1, 0, 1 , and profile AV G(Σ  ) = 2, 2, 2, 3/2, 7/5, 7/6, 8/7, 1, 1 . The position of the maximum element of AV G(Σ  ) is 3 (choosing the rightmost element in case of ties), so the first block is a, b, c , with AEV 2. Continuing: the new eligibility profile is 0, 1, 0, 1, 0, 1 , the average eligibility profile is 0, 1/2, 1/3, 1/2, 2/5, 1/2 . The maximum is the last element, so the new block is d, e, f, g, h, i , with AEV 1/2. For schedule Σ  : Π(Σ  ) = 2, 0, 1, 4, 0, 0, 0, 1 , and AV G(Σ  ) = 2, 1, 1, 7/4, 7/5, 7/6, 1, 1 . The maximum is at the first element, so the first block is k , with AEV 2. The next eligibility profile is 0, 1, 4, 0, 0, 0, 1 , and the average eligibility profile is 0, 1/2, 5/3, 5/4, 1, 5/6, 1 , which has its maximum at the 3rd element. The new block is l, m, n , with AEV 5/3. The new average eligibility profile is 0, 0, 0, 1/4 , so the last block is o, p, q, r with AEV 1/4. Thus, the two schedules are split into five blocks:

390

G. Cordasco and A.L. Rosenberg

Σ  Blocks AEV k 2 l, m, n 5/3 o, p, q, r 1/4 The AM-schedule obtained by ordering the blocks in order of decreasing AEV is a, b, c, k, l, m, n, d, e, f, g, h, i, o, p, q, r . Σ  Blocks AEV a, b, c 2 d, e, f, g, h, i 1/2

3.3 The Timing of Algorithm ASP-DAG Let G be an N -node SP-DAG, and let T (N ) be the time that Algorithm ASP-DAG takes to find an AM-schedule Σ for G, plus the time it takes to decompose Σ into (indivisible) ordered blocks. Let us see what goes into T (N ). (1) Algorithm ASP-DAG invokes [27] to decompose G in time O(N ). (2) The algorithm finds an AM-schedule Σ for G by recursively unrolling G’s decomposition tree T G . (2.a) If G is a series composition (G  → G  ) of an n-node SP-DAG G  and an (N −n+1)-node SP-DAG G  , then, by induction, T (N ) = T (N −n+1)+T (n)+O(N ) for some n ∈ [2, N − 1]. (The “extra” node is the merged source of G  and target of G  .) The algorithm: (i) recursively generates AM-schedules, Σ  for G  and Σ  for G  , in time T (N − n + 1) + T (n); (ii) generates Σ by concatenating Σ  and Σ  in time O(N ); (iii) decomposes Σ into blocks in time O(N ), using the O(N ) blocks obtained while generating Σ  and Σ  . (2.b) If G is a parallel composition (G  ⇑ G  ) of an (n + 1)-node SP-DAG G  and an (N − n + 1)-node SP-DAG G  , then, by induction, T (N ) = T (N − n + 1) + T (n + 1) + O(N ), for some n ∈ [2, N − 2]. (The two “extra” nodes are the merged sources and targets of G  and G  .) The algorithm: (i) recursively generates AM-schedules, Σ  for G  and Σ  for G  , in time T (N − n + 1) + T (n); (ii) generates Σ in time O(N ) using Algorithm SP-AREA; (iii) decomposes Σ into blocks in time O(N ), by taking the union of the blocks obtained while generating Σ  and Σ  . (Recall that the parallel composition does not generate additional blocks.) Overall, then, T (N ) = O(N 2 ), as claimed.

4 Conclusion In furtherance of our extension of IC-Scheduling [4,5,6,8,21] to a scheduling paradigm that applies to all DAGs, we have expanded our initial study [7] of AREA-maximizing (AM) DAG-scheduling so that we can now find optimal schedules of all series-parallel DAG s efficiently. We are in the process of studying whether AM-scheduling shares the computational benefits of IC-Scheduling [13,20]. We are also seeking (possibly heuristic) algorithms that will allow us to provably efficiently find schedules that are (approximately) AREA-maximizing for arbitrary DAGs.

References 1. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. In: 5th ACM SIGPLAN Symp. on Principles and Practices of Parallel Programming, PPoPP 1995 (1995)

Area-Maximizing Schedules for Series-Parallel DAGs

391

2. Blumofe, R.D., Leiserson, C.E.: Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 202–229 (1998) 3. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46, 720–748 (1999) 4. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Applying IC-scheduling theory to some familiar computations. In: Wkshp. on Large-Scale, Volatile Desktop Grids, PCGrid 2007 (2007) 5. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Advances in IC-scheduling theory: scheduling expansive and reductive DAGs and scheduling DAGs via duality. IEEE Trans. Parallel and Distributed Systems 18, 1607–1617 (2007) 6. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Extending IC-scheduling via the Sweep algorithm. J. Parallel and Distributed Computing 70, 201–211 (2010) 7. Cordasco, G., Rosenberg, A.L.: On scheduling DAGs to maximize area. In: 23rd IEEE Int. Symp. on Parallel and Distributed Processing, IPDPS 2009 (2009) 8. Cordasco, G., Rosenberg, A.L., Sims, M.: Accommodating heterogeneity in IC-scheduling via task fattening. In: On clustering tasks in IC-optimal DAGs, 37th Intl. Conf. on Parallel Processing, ICPP 2008 (2008) (submitted for publication) 9. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (1999) 10. Garey, M.R., Johnson, D.S.: Computers and Intractability. W.H. Freeman and Co., San Francisco (1979) 11. Gao, L.-X., Rosenberg, A.L., Sitaraman, R.K.: Optimal clustering of tree-sweep computations for high-latency parallel environments. IEEE Trans. Parallel and Distributed Systems 10, 813–824 (1999) 12. Gerasoulis, A., Yang, T.: A comparison of clustering heuristics for scheduling DAGs on multiprocessors. J. Parallel and Distributed Computing 16, 276–291 (1992) 13. Hall, R., Rosenberg, A.L., Venkataramani, A.: A comparison of DAG-scheduling strategies for Internet-based computing. In: Intl. Parallel and Distr. Processing Symp. (2007) 14. Hwang, K., Xu, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York (1998) 15. Jayasena, S., Ganesh, S.: Conversion of NSP DAGs to SP DAGs. MIT Course Notes 6.895 (2003) 16. Kondo, D., Casanova, H., Wing, E., Berman, F.: Models and scheduling mechanisms for global computing applications. In: Intl. Parallel and Distr. Processing Symp. (2002) 17. Korpela, E., Werthimer, D., Anderson, D., Cobb, J., Lebofsky, M.: SETI@home: massively distributed computing for SETI. In: Dubois, P.F. (ed.) Computing in Science and Engineering. IEEE Computer Soc. Press, Los Alamitos (2000) 18. Kwok, Y.-K., Ahmad, I.: Benchmarking and comparison of the task graph scheduling algorithms. J. Parallel and Distributed Computing 59, 381–422 (1999) 19. Kwok, Y.-K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys 31, 406–471 (1999) 20. Malewicz, G., Foster, I., Rosenberg, A.L., Wilde, M.: A tool for prioritizing DAGMan jobs and its evaluation. J. Grid Computing 5, 197–212 (2007) 21. Malewicz, G., Rosenberg, A.L., Yurkewych, M.: Toward a theory for scheduling dags in Internet-based computing. IEEE Trans. Comput. 55, 757–768 (2006) 22. McCreary, C.L., Khan, A.A., Thompson, J., Mcardle, M.E.: A comparison of heuristics for scheduling DAGs on multiprocessors. In: 8th Intl. Parallel Processing Symp., pp. 446–451 (1994) 23. Mitchell, M.: Creating minimal vertex series parallel graphs from directed acyclic graphs. In: 2004 Australasian Symp. on Information Visualisation -, vol. 35, pp. 133–139 (2004) 24. Rosenberg, A.L.: On scheduling mesh-structured computations for Internet-based computing. IEEE Trans. Comput. 53, 1176–1186 (2004)

392

G. Cordasco and A.L. Rosenberg

25. Rosenberg, A.L., Yurkewych, M.: Guidelines for scheduling some common computationdags for Internet-based computing. IEEE Trans. Comput. 54, 428–438 (2005) 26. Sarkar, V.: Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge (1989) 27. Valdes, J., Tarjan, R.E., Lawler, E.L.: The recognition of series-parallel digraphs. SIAM J. Comput. 11, 289–313 (1982)

Parallel Selection by Regular Sampling Alexander Tiskin Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK

Abstract. Bulk-synchronous parallelism (BSP) is a simple and efficient paradigm for parallel algorithm design and analysis. In this paper, we present a new simple deterministic BSP algorithm for the classical problem of selecting the k-th smallest element from an array of size n, for a given k, on a parallel computer with p processors. Our algorithm   is based on the technique of regular sampling. It runs in optimal O np local computation and communication, and near-optimal O(log log p) synchronisation. The algorithm is of theoretical interest, as it gives an improvement in the asymptotic synchronisation cost over its predecessors. It is also simple enough to be implementable.

1

Introduction

The selection problem is a classical problem in computer science. Definition 1. Given an array of size n, and an integer k, 0 ≤ k < n, the selection problem asks for the array’s element with rank k (i.e. the k-th smallest element, counting from 0). Without loss of generality, we assume that all array elements are distinct. In this paper, we restrict ourselves to comparison-based selection, where the only primitive operations allowed on the elements are pairwise comparisons, with possible outcomes “greater than” or “less than”. The selection problem is closely related to sorting. Indeed, the naive solution to the selection problem consists in sorting the array, and then indexing the required element. Using an efficient sorting algorithm, such as mergesort, the selection problem can therefore be solved in time O(n log n). However, it is wellknown that, in contrast to sorting, this time bound for the selection problem is not optimal, and linear time is achievable. The first selection algorithm running in time O(n) was given by Blum et al. [4]. Further constant-factor improvements were obtained by Sch¨ onhage et al. [17], and by Dor and Zwick [8]; see also a survey by Paterson [15]. In this paper, we consider the selection problem in a coarse-grained parallel computation model, such as BSP or its variant CGM. Once again, the naive solution to the selection problem on a coarse-grained parallel computer consists in sorting the array, and then indexing the required element. Using an efficient sorting algorithm, such as parallel sorting by regular sampling (PSRS) by Shi and Schaeffer [18] (see also [20]), the selection problem can besolved determin n , communication O np , and synchroniistically in local computation O n log p sation O(1), on a parallel computer with p processors. Taking into account the P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 393–399, 2010. c Springer-Verlag Berlin Heidelberg 2010 

394

A. Tiskin

cost ofsharing the input of size O(n) among p processors, the communication  cost O np is clearly optimal. The synchronisation cost O(1) is also trivially optimal. However, the local computation cost is not optimal, due to the existence of linear-time sequential selection algorithms. A natural question arises: can the local computation cost of coarse-grained parallel selection be reduced to the   optimal O np , while keeping the asymptotic optimality in communication and synchronisation? A step towards this goal was made by Ishimizu et al. [12], who gave a coarsegrained parallel selection algorithm running in optimal local computation and   communication O np , and in synchronisation that is essentially O(log p). Fujiwara et al. [9] improved the synchronisation cost to O(min(log p, log log n)). They  proposed another algorithm, with suboptimal local computation cost  also p (which is still better than the one achievable by naive sorting), and O n log p the optimal synchronisation cost O(1). Gerbessiotis and Siniolakis [10] gave a randomised BSP  algorithm, running in optimal local computation and communication O np , and optimal synchronisation O(1), with high probability. An optimal selection algorithm in the PRAM model was given by Han [11]. There have also been significant developments in the area of experimental evaluation of selection algorithms. Various practical parallel selection algorithms have been proposed by Al-Furiah et al. [1], Saukas and Song [16], Bader [2], Cafaro et al. [5]. In this paper, we present a new simple deterministic BSP algorithm for selection. Our algorithm is based on the technique of regular sampling. It runs   in optimal O np local computation and communication, and in near-optimal O(log log p) synchronisation. Our algorithm is of theoretical interest, as it gives an improvement in the asymptotic synchronisation cost over its predecessors. It is also simple enough to be implementable. Throughout the paper, we ignore small irregularities arising from imperfect matching between integer parameters. This allows us to avoid overloading our notation with floor and ceiling functions, which have to be assumed implicitly wherever necessary.

2

The BSP Model

The model of bulk-synchronous parallel (BSP) computation [22,13,3] provides a simple and practical framework for general-purpose parallel computing. Its main goal is to support the creation of architecture-independent and scalable parallel software. Key features of BSP are its treatment of the communication medium as an abstract fully connected network, and strict separation of all interaction between processors into point-to-point asynchronous data communication and barrier synchronisation. This separation allows an explicit and independent cost analysis of local computation, communication and synchronisation. A BSP computer contains • p processors; each processor has a local memory and is capable of performing an elementary operation or a local memory access every time unit;

Parallel Selection by Regular Sampling

395

• a communication network, capable of accepting a word of data from every processor, and delivering a word of data to every processor, every g time units; • a barrier synchronisation mechanism, capable of synchronising all processors every l time units. The processors may follow different threads of computation, and have no means of synchronising with one another between the global barriers. It will be convenient to consider a version of the BSP model equipped with an external memory, which serves as the source of the input and the destination for the output, and can also be used for intermediate data. Algorithms designed for this model can easily be translated to the traditional distributed-memory setting; see [20] for details. A BSP computation is a sequence of supersteps. The processors are synchronised between supersteps; the computation within a superstep is completely asynchronous. Consider a superstep in which every processor performs up to w local operations, sends up to hout words of data, and receives up to hin words of data. We call w the local computation cost, and h = hout +hin the communication cost of the superstep. The total superstep cost is defined as w+h·g +l, where the communication gap g and the latency l are parameters of the network defined above. For a computation comprising S supersteps with local computation costs ws and communication costs hs , 1 ≤ s ≤ S, the total cost is W + H · g + S · l, where • W =

S

s=1

S

ws is the total local computation cost ;

• H = s=1 hs is the total communication cost ; • S is the synchronisation cost. The values of W , H and S typically depend on the number of processors p and on the problem size. The original definition of BSP does not account for memory as a limited resource. However, the model can be easily extended by an extra parameter m, representing the maximum capacity of each processor’s local memory. Note that this approach also limits the amount of communication allowed within a superstep: h ≤ m. One of the early examples of memory-sensitive BSP algorithm design is given in [14]. An alternative approach to reflecting memory cost is given by the model CGM, proposed in [7]. A CGM is essentially a memory-restricted BSP computer, where memory capacity and maximum superstep communication are determined by the   . A large number of algorithms size of the input/output: h ≤ m = O input +output p have been developed for the CGM, see e.g. [19]. In order to utilise the computer resources efficiently, a typical BSP program regards the values p, g and l as configuration parameters. Algorithm design should aim to minimise local computation, communication and synchronisation costs for any realistic values of these parameters.

396

3

A. Tiskin

The Algorithm

A typical approach to selection involves partitioning the input array into subarrays, sampling these subarrays, and then using (a subset of) the obtained samples as splitters for the elimination of array elements. For sequential selection, it is sufficient to choose suitable constants for the subarray size and the sampling frequency: for example, one can use subarrays of size 5, take subarray medians as samples, and take the median of samples as the single splitter. Each elimination stage reduces the amount of data by a constant factor, and therefore the overall data reduction rate is exponential. In a coarse-grained parallel setting, a natural subarray size is n/p; we assume that this value is sufficiently high. As before, we could reduce the amount of data by a constant factor in every stage by taking a suitable constant for the sampling frequency. However, we can do better by varying the sampling frequency. Generally, as the remaining data become more and more sparse, we can afford more and more frequent sampling on the data. This accelerates the data reduction rate to super-exponential. Algorithm 1 (Selection) Parameter: integer k, 0 ≤ k < n. Input: array a of size n in external memory; we assume n ≥ p3/2 . Output: the value of the element of a with rank k. Description. First phase. The array is gradually reduced in size, by eliminating elements in repeated rounds of regular sampling. Let N = 2n. Consider a round in which m = N/r elements remain in the global array; for instance, in the first round we have m = N/2, and therefore r = 2. Each processor reads from the external memory a local subarray of size m p . Then, by repeated application of a sequential selection algorithm, each processor selects from its local subarray a set of 2r1/2 + 1 samples spaced at regular intervals, inclusive of the two boundary elements of the subarray. Then, all the (2r1/2 +1)p samples are collected in a designated processor. The designated processor sorts the array of samples, and then selects from it a subset of 2r1/2 +1 splitters spaced at regular intervals, inclusive of the two boundary elements (i.e. the minimum and the maximum sample). The whole set of splitters is then broadcast across the processors, using an efficient broadcasting algorithm, such as e.g. two-phase broadcast [3]. For each splitter, a processor finds its local rank within that processor’s subarray. These local ranks are then collected in a designated processor, which adds them up to obtain the global rank of each splitter within the global array. Consider two adjacent splitters a− , a+ , such that the global rank of a− (respectively, a+ ) is below (respectively, above) k. We call the subset of all array elements that fall (inclusively) between splitters a− and a+ the bucket. Note that 1/2

+1)p = p samples. Crucially, the bucket also contains the bucket contains (2r 2r 1/2 +1 the element of global rank k, which is required as the algorithm’s output.

Parallel Selection by Regular Sampling

397

The bucket boundary splitters a− , a+ are then broadcast across the processors. Each processor eliminates all the elements of its local subarray that fall outside the bucket, i.e. are either below a− , or above a+ . Then, a designated processor collects the sizes of the remaining local subarrays, allocates for each processor a range of locations of appropriate size in the external memory, and communicates to each processor the address of its allocated areas. The processors write their remaining local subarrays into the allocated areas. The bucket, which has now been collected in the external memory, replaces the global array in the subsequent round. To reflect that, we update the value of m by setting it to the size of the bucket, and we let r = N/m. We also update the value of k by subtracting from it the global rank of a− . The round is completed. Rounds are perfomed repeatedly, until the size of the array is reduced to np . Second phase. A designated processor reads the remaining array of size np from the external memory, and finds the element of rank k by a sequential selection algorithm. Cost analysis First phase. Consider a particular round where, as before, there are m = N/r remaining elements. The communication cost of each reading its local  n processor  = O . The local computation subarray from the external memory is O m p rp   1/2 = cost of each processor selecting the regular samples is (2r + 1) · O m p  n  O r1/2 p . For each processor, let us call the subset of its local elements that fall between two (locally) adjacent samples a block (note that since the local subarray is not sorted, blocks are in general not physically contiguous). We have 2r1/2 blocks per processor, and therefore 2r1/2 p blocks overall, each containing 2rm 1/2 p elements (including or excluding the boundaries as appropriate). Consider a block defined by a pair of samples b− , b+ . We call this block low (respectively, high) if the block’s lower boundary b− is non-strictly below (respectively, strictly above) the bucket’s lower boundary a− . A low block has a non-empty intersection with the bucket, if and only if it contains the bucket’s lower boundary a− : b− < a− ≤ b+ . Since for each processor, all its blocks are disjoint, at most one of them can contain the value a− . Therefore, across all processors, there can be at most p low blocks intersecting the bucket. A high block has a non-empty intersection with the bucket, if and only if its lower boundary b− is contained within the bucket: a− ≤ b− < a+ . Each of the p samples contained in the bucket can be a lower boundary sample for at most one high block. Therefore, across all processors, there can be at most p high blocks intersecting the bucket. In total, there can be at most p + p = 2p blocks having a non-empty intersection with the bucket. Hence, the number of elements in the bucket is at most 2p · m N = rm 1/2 = r 3/2 . Since all the elements outside the bucket are eliminated, the 2r 1/2 p maximum possible fraction (relative to N ) of remaining elements gets thus raised

398

A. Tiskin

to the power of 3/2 in every round. After log3/2 log2 p = O(log log p) rounds, the log3/2 log2 p

N number of remaining elements is at most 2−(3/2) N = 2log = O( np ). 2p The local computation and communication costs decrease superexponentially in every round, and  therefore dominated by the respective costs in the first  are round, equal to O np . The total number of rounds, and therefore the synchronisation cost, is O(log log p).   Second phase. The local computation and communication costs are O np , and the synchronisation cost is O(1). Total. The overall resource costs are dominated by thefirst phase. Therefore, the local computation and communication costs are O np , and the synchronisation cost is O(log log p).  

It should be noted that the resource costs of Algorithm 1 are controlled mainly by the frequency of initial sampling in every round. In contrast, the frequency of choosing splitters among the samples is relatively less important, and allows for substantial freedom. In particular, we could even choose all the samples as splitters. Instead, we chose the minimum possible number of splitters, thus stressing the similarity with the PSRS algorithm of [18] (see also [20]). Also note that the assumption of external memory allows us to perform implicit load balancing between the rounds. If the algorithm were expressed in the standard distributed-memory version of BSP, then explicit load balancing within every round would be required.

4

Conclusions

We have given a deterministic BSP algorithm   for selection, running in the optimal local computation and communication O np , and synchronisation O(log log p). It remains an open problem whether the synchronisation cost can be reduced to the optimal O(1) without increasing the asymptotic local computation and communication costs, and without randomisation. In the context of coarse-grained parallel computation, regular sampling has been used previously for the problems of sorting [18,20] and computing 2D and 3D convex hulls [21]. In the current paper, we have applied this technique to the problem of selection. We expect that further applications of regular sampling are possible, and conclude that this powerful technique should be an essential part of the parallel algorithm design toolkit.

References 1. Al-Furiah, I., Aluru, S., Goil, S., Ranka, S.: Practical algorithms for selection on coarse-grained parallel computers. IEEE Transactions on Parallel and Distributed Systems 8(8), 813–824 (1997) 2. Bader, D.A.: An improved, randomized algorithm for parallel selection with an experimental study. Journal of Parallel and Distributed Computing 64(9), 1051– 1059 (2004)

Parallel Selection by Regular Sampling

399

3. Bisseling, R.H.: Parallel Scientific Computation: A structured approach using BSP and MPI. Oxford University Press, Oxford (2004) 4. Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. Journal of Computer and System Sciences 7(4), 448–461 (1973) 5. Cafaro, M., De Bene, V., Aloisio, G.: Deterministic parallel selection algorithms on coarse-grained multicomputers. Concurrency and Computation: Practice and Experience 21(18), 2336–2354 (2009) 6. Corrˆea, R., et al. (eds.): Models for Parallel and Distributed Computation: Theory, Algorithmic Techniques and Applications, Applied Optimization, vol. 67. Kluwer Academic Publishers, Dordrecht (2002) 7. Dehne, F., Fabri, A., Rau-Chaplin, A.: Scalable parallel computational geometry for coarse grained multicomputers. International Journal on Computational Geometry 6, 379–400 (1996) 8. Dor, D., Zwick, U.: Selecting the median. SIAM Journal on Computing 28(5), 1722–1958 (1999) 9. Fujiwara, A., Inoue, M., Masuzawa, T.: Parallel selection algorithms for CGM and BSP models with application to sorting. Transactions of Information Processing Society of Japan 41(5), 1500–1508 (2000) 10. Gerbessiotis, A.V., Siniolakis, C.J.: Architecture independent parallel selection with applications to parallel priority queues. Theoretical Computer Science 301(1-3), 119–142 (2003) 11. Han, Y.: Optimal parallel selection. In: Proceedings of ACM–SIAM SODA, pp. 1–9 (2003) 12. Ishimizu, T., Fujiwara, A., Inoue, M., Masuzawa, T., Fujiwara, H.: Parallel algorithms for selection on the BSP and BSP* models. Systems and Computers in Japan 33(12), 97–107 (2002) 13. McColl, W.F.: Scalable computing. In: van Leeuwen, J. (ed.) Computer Science Today: Recent Trends and Developments. LNCS, vol. 1000, pp. 46–61. Springer, Heidelberg (1995) 14. McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3/4), 287–297 (1999) 15. Paterson, M.: Progress in selection. In: Karlsson, R., Lingas, A. (eds.) SWAT 1996. LNCS, vol. 1097, pp. 368–379. Springer, Heidelberg (1996) 16. Saukas, E.L.G., Song, S.W.: A note on parallel selection on coarse-grained multicomputers. Algorithmica 24(3-4), 371–380 (1999) 17. Schoenhage, A., Paterson, M., Pippenger, N.: Finding the median. Journal of Computer and System Sciences 13(2), 184–199 (1976) 18. Shi, H., Schaeffer, J.: Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing 14(4), 361–372 (1992) 19. Song, S.W.: Parallel graph algorithms for coarse-grained multicomputers. In: Corrˆea, et al. (eds.) [6], pp. 147–178 20. Tiskin, A.: The bulk-synchronous parallel random access machine. Theoretical Computer Science 196(1-2), 109–130 (1998) 21. Tiskin, A.: Parallel convex hull computation by generalised regular sampling. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 392–399. Springer, Heidelberg (2002) 22. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)

Ants in Parking Lots Arnold L. Rosenberg Electrical & Computer Engineering, Colorado State University, Fort Collins, CO 80523, USA [email protected]

Abstract. Ants provide an attractive metaphor for robots that “cooperate” to perform complex tasks. This paper is a step toward understanding the algorithmic concomitants of this metaphor, the strengths and weaknesses of ant-based computation models. We study the ability of finite-state ant-robots to scalably perform a simple path-planning task called parking, within fixed, geographically constrained environments (“factory floors”). This task: (1) has each ant head for its nearest corner of the floor and (2) has all ants within a corner organize into a maximally compact formation. Even without (digital analogues of) pheromones, many initial configurations of ants can park, including: (a) a single ant situated along an edge of the floor; (b) any assemblage of ants that begins with two designated adjacent ants. In contrast, a single ant in the middle of (even a one-dimensional) floor cannot park, even with the help of (volatile digital) pheromones. Keywords: Ant-inspired robots, Finite-state machines, Path planning.

1 Introduction As we encounter novel computing environments that offer unprecedented computing power, while posing unprecedented challenges, it is compelling to seek inspiration from natural analogues of these environments. Thus, empowered with technology that enables mobile intercommunicating robotic computers, it is compelling to seek inspiration from social insects, mainly ants (because robots typically operate within a two-dimensional world), when contemplating how to employ the computers effectively and efficiently in a variety of geographical environments; indeed, many sources—see, e.g., [1,4,6,7,8,9]— have done precisely that. This paper is a step toward understanding the algorithmic concomitants of the robot-as-ant metaphor within the context of a simple, yet nontrivial, path-planning problem. Ant-robots in a “factory.” We focus on mobile robotic computers (henceforth, ants, to

stress the natural inspiration) that function within a fixed geographically constrained environment (henceforth, a factory floor [a possible application domain]) that is tesselated with identical (say, square) tiles. We expect ants to be able to: – navigate the floor, while avoiding collisions (with obstacles and one another); – communicate with and sense one another, by “direct contact” (as when real ants meet) and “timestamped message passing” (as when real ants deposit pheromones); – assemble in desired locations, in desired configurations. Although not relevant to the current study, we would also want ants to be able to discover goal objects (“food”) and to convey “food” from one location to another; cf. [4,6,8,9]. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 400–411, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Ants in Parking Lots

401

Indeed, the “parking” problem that we study might be the aftermath of the preceding activities, e.g., if one has ants convey collected “food” to designated stations. In the “standard realization” of ants studied here, ants are endowed with “intelligence” via embedded computers. These “intelligent” ants are responsible for planning and orchestrating assigned activities. In a quest to understand how a large assemblage of ants of very limited ability can “cooperate” to accomplish complex tasks, we focus on ants that are mobile finite-state machines. In particular, we want to understand what paths ants can plan in a scalable manner, i.e., without counting (beyond a fixed limit). Having ants park. We study the preceding issue by focusing on a simple, yet algorith-

mically nontrivial, path-planning task that we studied in [8] under a rather different ant-based model. This task, parking: (1) has each ant head for the nearest corner of the floor and (2) has all ants within a corner organize into a maximally compact formation (see Section 2.2 for details). While we have not yet characterized which initial configurations of ants can park successfully, we report here on progress toward this goal: – Even without using (digital analogues of) pheromones, many initial configurations of ants can park. These configurations include: • a single ant that starts anywhere along an edge of the floor (Theorem 2); • any assemblage of ants that begins with two distinguished ants that are adjacent—i.e., on tiles that share an edge or a corner (Theorem 3). – In contrast: A single ant can generally not park, even on a one-dimensional floor and even with the help of (volatile digital) pheromones (Theorem 1). The algorithmic setting. We require algorithms to be simple, scalable and decentralized. (1) Algorithms must work on floors of arbitrary sizes. An algorithm cannot exploit information about the size of the floor; it must treat the side-length n of the floor as an unknown, never exploiting its specific value. (2) Algorithms must work with arbitrarily large collections of ants. This means, in particular, that all ants are identical; no ant has a “name” that renders it unique. (3) Algorithms must employ noncentralized coordination. All coordination among ants is achieved in a distributed, noncentralized manner, via messages that pass between ants on neighboring tiles. (4) Algorithms must be “finite-state.” All ants must execute the same program (in SPMD mode), and this program must have the very restricted form described in Section 2.1.B. These guidelines may be too costly to observe in practical systems; cf. [2,4,6].

2 Technical Background 2.1 Ant-Robots Formalized A. Pheromone-bearing floors. Floors and connectivity (cf. [5]). The n × n floor is a square1 mesh of tiles, n along each side (Fig. 1(a)); tiles are indexed by the set [0, n − 1] × [0, n − 1].2 We represent the floor algorithmically by the n × n grid(-graph) 1 2

Our study adapts to rectangular floors with little difficulty. def N denotes the nonnegative integers. For i ∈ N and j ≥ i, [i, j] = {i, i + 1, . . . , j}.

402

A.L. Rosenberg

2,0

2,1

2,2

2,0

2,1

2,2

1,0

1,1

1,2

1,0

1,1

1,2

0,0

0,1

0,2

0,0

0,1

0,2

(a)

(b)

1,0

1,1

0,0

0,1

(c)

Fig. 1. (a) A 3 × 3 floor M3 ; (b) the associated grid (edges represent mated opposing arcs); (c) the 2 × 2 corner of a grid with all incident arcs

(Fig. 1(b)). Mn ambiguously denotes the mesh or the grid, according to context. Ants move along Mn ’s King’s-move arcs, which are labeled by the compass directions. Each tile v = i, j of Mn is connected by mated in- and out-arcs to its neighbors (Fig. 1(c)). – If i ∈ {0, m − 1} and j ∈ {0, n − 1} then v is a corner tile and has 3 neighbors. – If i = 0 (resp., i = n − 1) and j ∈ [1, n − 2], then v is a bottom (resp., top) tile. If j = 0 (resp., j = n − 1) and i ∈ [1, n − 2], then v is a left (resp., right) tile. These four are collectively edge tiles; each has 5 neighbors. – If i, j ∈ [1, n − 2], then v is an internal tile and has 8 neighbors. Mn ’s regions. Mn ’s quadrants are its induced subgraphs3 on the following sets of tiles: Quadrant SOUTHWEST NORTHWEST SOUTHEAST NORTHEAST

Name QSW QNW QSE QNE

Tile-set [0, n/2 − 1] × [0, n/2 − 1] [n/2, n − 1] × [0, n/2 − 1] [0, n/2 − 1] × [n/2, n − 1] [n/2, m − 1] × [n/2, n − 1]

In analogy with quadrants, which are “fenced off” by imaginary vertical and horizontal side-bisecting lines, Mn has four wedges, W N , W E , W S , W W , which are “fenced off” by imaginary diagonals that connect its corners: Wedge Name WN SOUTH W S EAST W E WEST W W

NORTH

Tile-set {x, y | [x ≥ y] and [x + y {x, y | [x < y] and [x + y {x, y | [x < y] and [x + y {x, y | [x ≥ y] and [x + y

≥ n − 1]} < n − 1]} ≥ n − 1]} < n − 1]}

The asymmetries in boundaries give each tile a unique quadrant and wedge. For k ∈ [0, 2n − 2], Mn ’s kth diagonal is the set of tiles Δk = {i, j ∈ [0, n − 1] × [0, n − 1] | i + j = k}. For d ∈ [1, 2n − 1], the radius-d quarter-sphere of QSW d−1 is the union: k=0 Δk . “Virtual” pheromones. Each tile of Mn contains a fixed number c of counters; each

counter i can hold an integer in the range [0, Ii ]. Each counter represents a “virtual” 3

Let the directed graph G have tile-set N and arc-set A. The induced subgraph of G on the set N  ⊆ N is the subgraph of G with tile-set N  and all arcs from A that have both in N  .

Ants in Parking Lots

403

1 2

1

3

2

3

(a)

(b)

(c)

1

2

1

3

2

(d)

(e)

Fig. 2. Snapshots of a pheromone of intensity I = 3 changing as FSM-ant F (the dot) moves. All snapshots have F on the center tile; unlabeled tiles have level 0. (a) F has deposited a maximum dose of pheromone on each tile that it has reached via a 2-step SE-SW path; note that the pheromone has begun to “evaporate” on the tiles that F has left. (b) F stands still for one time-step and deposits no pheromone. (c) F moves W and deposits a maximum dose of pheromone. (d) F moves S and deposits a maximum dose of pheromone. (e) F moves E and does not deposit any pheromone.

pheromone (cf. [4])—a digital realization of real ants’ volatile organic compounds; each value within [0, Ii ] is an intensity level of pheromone i. The number c and the ranges [0, Ij ]cj=1 are characteristics of a specific realization of the model. The volatility of real pheromones is modeled by a schedule of decrements of every pheromone counter; see Fig. 2. Every computation begins with all tiles having level 0 of every pheromone. B. Computers and programs. Each ant contains a computer that endows it with “intelligence.” Each computer possesses I/O ports that allow it to communicate with the outside world and with computers on adjacent tiles. In a single “step,” a computer can: 1. detect an ant on an adjacent tile; 2. recognize Mn ’s four edges/sides and its four corners; 3. communicate with each neighboring computer—one on a tile that shares an edge or corner—by receiving one message and transmitting one message per time-step. 4. receive a command from the outside world. As discussed earlier, we have embedded computers function as identical copies of a fixed, specialized finite-state machine (FSM) F ; cf. [10]. We specify FSMs’ statetransition functions in an algorithmically friendly fashion, as programs that are finite sets of case statements;4 an FSM F is, thus, specified as follows: – F has s states, named LABEL1 , . . . , LABELs . – F responds to a fixed repertoire of inputs, each INPUTi being a combination of: • the messages that F receives from neighboring FSMs and/or the outside world; • the presence/absence of an edge/corner of Mn , a “food” item to be manipulated, an “obstacle” to be avoided; • the levels of intensity of the pheromones that are present on the current tile. – F responds to the current input by • emitting an output from a repertoire, each OUTPUTk being a combination of: ∗ the messages that F sends to neighboring FSMs; ∗ pheromone-related actions: deposit a type-h pheromone at intensity I ≤ Ih , enhance a type-j pheromone, increasing its intensity to I ≤ Ij ; ∗ “food”-related actions: pick up and carry the item on the current tile, deposit the item that F is currently carrying; 4

The CARPET programming environment [3,11] employs a similar programming style.

404

A.L. Rosenberg

∗ stand still or move to an appropriate neighboring tile (one that will not cause F to “fall off” Mn ). • changing state (by specifying the next case statement to execute). An FSM is thus specified in a format similar to the following: LABEL 1:

if INPUT 1 then OUTPUT 1,1 and goto LABEL 1,1 .. . if INPUT m then OUTPUT 1,m and goto LABEL 1,m .. .. . . LABEL s : if INPUT 1 then OUTPUT s,1 and goto LABEL s,1 .. . if INPUT m then OUTPUT s,m and goto LABEL s,m

We specify algorithms in English—aiming to supply enough detail to make it clear how to craft finite-state programs that implement the algorithms. 2.2 The Parking Problem for Ants The parking problem has each ant F move as close as possible to the corner tile of Mn that is closest to F when it receives the command PARK (from the outside world). Additionally, the ants in each quadrant must cluster within the most compact quartersphere “centered” at the quadrant’s corner tile. Focus on QSW (easy clerical changes work for the other quadrants). Formally: A configuration of ants solves the parking problem for QSW precisely if it minimizes the parking potential function Π(t) = def

2n−2 

(k + 1) × (the number of ants residing on Δk at step t).

(1)

k=0

This simply specified, yet algorithmically nontrivial, path-planning problem is a good vehicle for studying what ants can determine about the “floor” without “counting.”

3 Single Ants and Parking 3.1 The Simplified Framework for Single Ants The input and output repertoires of this section’s FSMs need not contain pheromone levels, because pheromones do not enhance the path-planning power of a single ant. Proposition 1. Given any FSM F that employs pheromones while navigating Mn , there exists an FSM F  that follows the same trajectory as F while not using pheromones.

Ants in Parking Lots

405

Proof (Sketch). We eliminate pheromones one at a time, so we may assume that F uses a single pheromone that it deposits with intensity levels from the set [0, I]. The pheromone-less FSM F  “carries around” (in finite-state memory) a map that specifies all necessary information about the positions and intensities of deposits of F ’s pheromone. For F  to exist, the map must be: (a) “small”—with size independent of n—and (b) easily updated as F ’s steps are simulated. Map size. The portion of Mn that could contain nonzero levels of the pheromone is no larger than the “radius”-I subgrid of Mn that F has been in during the most recent I steps. No trace of pheromone can persist outside this region because of volatility. Thus, the map needs only be a (2I − 1) × (2I − 1) mesh centered at F ’s current tile. Because F is the only FSM, at most one tile of the map contains the integer I (a maximum level of the pheromone at this step), at most one contains the integer I − 1 (a maximum level one step ago), . . . , at most one contains the integer 1 (a maximum level I − 1 steps ago). Fig. 2 displays a sample map, with four sample one-step updates. Updating the map. Because of a map’s restricted size and contents, there are fewer I−1 than 1 + j=0 ((2I − 1)2 − j) distinct maps (even ignoring the necessary adjacency of tiles that contain the integers k and k − 1). F  can, therefore, carry the set of all possible maps in its finite-state memory, with the then-current map clearly “tagged.” Thus, F  has finitely many states as long as F does. The state-transition function of F  augments F ’s by updating each state’s map-component while emulating F’s state change.   3.2 A Single Ant Cannot Park, Even on a One-Dimensional Mesh Theorem 1. One cannot design an FSM-ant that successfully parks when started on an arbitrary tile of (even the one-dimensional version of) Mn . Proof (Sketch). To the end of deriving a contradiction, say that we have an FSM F with state-set Q that can park successfully no matter where we place it in an arbitrarily large mesh Mn . By Proposition 1, we may assume that F does not use pheromones. Let us place F far enough within Mn that its path to its (allegedly correct) parking corner is longer than q = |Q|. Let us label each tile along this path with the state that F is in the last time it leaves the tile. Because F starts out far from its parking corner, there must be tiles along the path that are labeled with the same state, s ∈ Q, despite the fact that one tile is farther from F ’s parking corner than the other. (In Fig. 3(left), Mn ’s N E corner is depicted as F ’s parking corner.) Let us now lengthen Mn ’s sidelength—i.e., increase n—by “cutting and splicing” portions of Mn that begin and end with state-label s, as in Fig. 3(right). Because F is deterministic, it cannot “distinguish” between two tiles that have the same state-label. This means that F ’s ultimate behavior will be the same in the original and stretched versions of Mn : it will end its journey in both meshes at the N E corner, despite the fact that the “cut and splice” operation has lengthened F ’s parking path. However, if we perform the “cut and splice” operation enough times, we can make F ’s original tile as far as we want from its parking corner. In particular, we can make this distance greater than the distance between F ’s original tile and some other corner of (the stretched) Mn . Once this happens, F is no longer parking successfully. The theorem follows.  

406

A.L. Rosenberg

STATE

s

STATE

s

Fig. 3. Illustrating the “cut and splice” operation of Theorem 1

OR

Fig. 4. The two possible edge-terminating trajectories for a single ant

3.3 Single-Ant Configurations That Can Park Theorem 2. A single FSM-ant F can park in time O(n) when started on an arbitrary edge-tile of Mn . Proof (Sketch). With no loss of generality, let F begin on Mn ’s bottom edge, at tile 0, k. Clearly, F ’s target parking tile is either 0, 0 or 0, n − 1. To decide between these alternatives, F begins a 60◦ northeasterly walk from 0, k, i.e., a walk consisting of “Knight’s-move supersteps” of the form (0, +1), (+1, 0), (+1, 0) (see Fig. 4), continuing until it encounters an edge or a corner of Mn . Note that after s “supersteps,” F has moved from 0, k to 2s, k + s. Consequently, if F ’s walk terminates in: – Mn ’s right edge, then F ’s parking tile is 0, n − 1, because 2s < n − 1 ≤ k + s; – Mn ’s top edge or NE corner, then F ’s parking tile is 0, 0, because k + s ≤ n − 1 ≤ 2s.  

4 Multi-Ant Configurations That Can Park We do not yet know if a collection of ants can park successfully when started in an arbitrary initial configuration in Mn —but one simple restriction enables successful parking. Two ants are adjacent if they reside on tiles that share an edge or a corner. Theorem 3. Any collection of ants that contains two designated adjacent ants can park in O(n2 ) synchronous steps.

Ants in Parking Lots

407

Proof (Sketch). We focus, in turn, on the three components of the activity of parking: 1. having each ant determine its home quadrant; 2. having each ant use this information to plan a route to its target corner of Mn ; 3. having ants that share the same home quadrant—hence, the same target corner— organize into a configuration that minimizes the parking potential function (1). 4.1 Quadrant Determination with the Help of Adjacent Ants Lemma 1. Any collection of ants that contains two designated adjacent ants can determine their home quadrants in O(n2 ) synchronous steps. We address the problem of quadrant determination in three parts. Section 4.1.A presents an algorithm that allows two adjacent ants to park. Because this algorithm is a bit complicated, we present in Section 4.1.B a much simpler algorithm that allows three adjacent ants to park. We then show in Section 4.1.C how two adjacent ants can act as “shepherds” to help any collection of ants determine their home quadrants. A. Two adjacent ants can park. The following algorithm was developed in collaboration with Olivier Beaumont. The essence of the algorithm is depicted in Fig. 5. Say, for definiteness, that Mn contains two horizontally adjacent ants: AL on tile x, y and AR on tile x, y + 1. (This assumption loses no generality, because the ants can remember their true initial configuration in finite-state memory, then move into the left-right configuration, and finally compute the adjustments necessary to make the correct determination about their home quadrants.) Our algorithm has two phases. Determining home wedges. AL performs two roundtrip walks within Mn from its initial

tile, one at 45◦ (a sequence of (+1, +1) moves, then a sequence of (−1, −1) moves) and one at −45◦ (a sequence of (+1, −1) moves, then a sequence of (−1, +1) moves); see Fig. 5(a). The termini of the outward walks determine ants’ home wedges: 45◦ walk terminus corner corner top edge top edge right edge right edge

−45◦ walk terminus AL ’s home wedge: AR ’s home wedge: corner/top WN WE left edge WW If walk ends within one tile of Mn ’s corner then: W E else: W S corner/top WN WN left edge WW If walk ends within one tile of Mn ’s corner then: W N else: W W corner/top WE WE left edge WS If walk ends within one tile of Mn ’s corner then: W E else: W S

Determining home quadrants. AL and AR use the knowledge of their home wedge(s) to

determine their home quadrant(s), via one of the following subprocedures. • The ants did not both begin in either W E or W W . In this case, AL and AR move in lockstep to the bottom edge of Mn (see Fig. 5(b)). They set out thence in lockstep on independent diagonal roundtrip walks, AL at an angle of −45◦ (via (+1, −1) moves)

408

A.L. Rosenberg

2 2

AL

AL 1

1

AL

AR

1 2

2

Determine Wedge

(a)

1

1

AR

2

Determine Quadrant (horizontal)

Determine Quadrant (vertical)

(b)

(c)

Fig. 5. Quadrant determination for two adjacent ants: (a) wedge determination; (b) quadrant determination, horizontal version; (c) quadrant determination, vertical version

and AR at an angle of 45◦ (via (+1, +1) moves). When an ant returns to its original tile on the bottom edge, it checks for the other ant’s presence in the appropriate adjacent tile, to determine the outcome of their roundtrip “race.” A discrete set of “race” outcomes provides the ants bounds on the columns where each began to park: Did the “race” end in a tie? Did one ant win by exactly one step? by more than one step? These bounds allow AL and AR to determine their home quadrants. • The ants both began in either W E or W W . In this case, a “vertical” analogue of the preceding “race” (see Fig. 5(c)) allows AL and AR to determine their home quadrants. Once AL and AR know their respective home quadrants, they can park greedily. B. Three adjacent ants can park. Let the ants A1 , A2 , and A3 , be adjacent on Mn , in that their adjacency graph is connected. Via local communication, each ant can recognize and remember its initial location relative to the others’. The ants can thereby adjust the result of the following algorithm and determine their respective home quadrants. Determining home quadrants. Horizontal placement (Fig. 6(Left)). The ants align verti-

cally in the top-to-bottom order A1 , A2 , A3 , with A2 retaining its original tile. (A simple clerical adjustment is needed if A2 begins on either the top or bottom edge of Mn .) A1 marches leftward to Mn ’s left edge and returns to its pre-march tile; in lockstep, A3 marches rightward to Mn ’s right edge and returns to its pre-march tile. – If A1 and A3 return to A2 at the same step, or if A1 returns one step before A3 , then A2 ’s home is either QN W or QSW . A1 and A3 can now determine their home quadrants by noting how they moved in order to achieve the vertical alignment. Vertical

placement

A1

A1 A3

A3 Horizontal placement

Fig. 6. The essence of the parking algorithm for three adjacent ants: (Left) Determining each ant’s horizontal placement. (Right) Determining each ant’s vertical placement.

Ants in Parking Lots

409

– If A1 returns to A2 two or more steps before A3 , then all three ants have QN W or QSW as home. The situation when A3 returns to A2 before A1 is almost the mirror image—except for the asymmetry in eastern and western quadrants’ boundaries. Vertical placement (Fig. 6(Right)). This determination is achieved by “rotating the horizontal-placement algorithm by 90◦ .” Details are left to the reader. After horizontal and vertical placement, each ant knows its home quadrant, hence can park as specified in Section 4.2. C. Two adjacent ants acting as shepherds. All ants in any collection that contains two designated adjacent ants are able to determine their respective home quadrants— with the help of the two adjacent ants. As we verify this, we encounter instances of ants blocking the intended paths of other ants. We resolve this problem by having conflicting ants switch roles—which is possible because all ants are identical. If ant A is blocking ant B, then A “becomes” B and continues B’s blocked trajectory; and B “becomes” A and moves onto the tile that A has relinquished (by “becoming” B). We have m ≥ 2 ants, A1 , . . . , Am , with two designated adjacent ants—without loss of generality, A1 and A2 . We present a three-phase algorithm that takes O(n2 ) synchronous steps and that has A1 and A2 act as shepherds to help all other ants determine their home quadrants. Phase 1: A1 and A2 determine their home quadrant(s), using the algorithm of Sec-

tion 4.1.A. They remember this information, which they will use to park in phase 3. Phase 2: A1 and A2 help other ants determine their home quadrants, via four subphases.

Subphase a: A1 and A2 distinguish east from west. A1 and A2 head to corner SW. A1 determines the parity of n via a roundtrip from tile 0, 0 to tile 0, n − 1 and back. If n is even, then: – A2 moves one tile eastward at every time-step until it reaches Mn ’s right edge. It then reverses direction and begins to move one tile westward at every time-step. – Starting one step later, A1 moves one tile eastward at every third time-step. – A1 and A2 stop when they are adjacent: A1 on tile 0, 12 n − 1, A2 on tile 0, 12 n. A1 and A2 thus determine the midpoint of Mn ’s bottom row (hence, of every row): – A2 ’s trajectory, 0, 0  0, n − 1  0, 12 n takes 32 n − 2 time-steps. – A1 ’s trajectory 0, 0  0, 12 n − 1 takes 12 n − 3 steps. Because it starts one step later than A2 does, A1 arrives at 0, 12 n − 1 after 32 n − 2 time-steps.

If n is odd, then: – A2 moves one tile eastward at time-step until it reaches Mn ’s right edge. At that point, it reverses direction and begins to move one tile westward at every time-step. – Starting one time-step later, A1 moves one tile eastward at every third time-step. – A1 and A2 stop when they are adjacent: A1 on 0, 12 n − 1, A2 on 0, 12 n . A1 and A2 thus determine the midpoint of Mn ’s bottom row (hence, of every row): – A2 ’s trajectory, 0, 0  0, n − 1  0, 12 n , takes 3 12 n − 4 time-steps. – A1 ’s trajectory, 0, 0  0, 12 n − 1, takes 3 12 n − 3 steps. Because it starts one step later than A2 does, A1 arrive at 0, 21 n − 1 after 3 21 n − 4 time-steps.

410

A.L. Rosenberg

Subphase b: A1 and A2 identify easterners, westerners. A1 walks through the western half of Mn , column by column, informing each encountered ant that it resides in either QN W or QSW . Simultaneously, A2 does the symmetric task in the eastern half of Mn , informing each encountered ant that it resides in either QN E or QSE . A1 and A2 meet at corner NW of Mn after their walks. Subphase c: A1 and A2 distinguish north from south, via a process analogous to that of Subphase a. Subphase d: A1 and A2 identify northerners and southerners, via a process analogous to that of Subphase b. By the end of Phase 2, every ant knows its home quadrant. Phase 3: Ants park. Every ant except for A1 and A2 begins to park as soon as it learns

its home quadrant—which occurs no later than the end of Phase 2; A1 and A2 await the end of Phase 2, having learned their home quadrants in Phase 1. 4.2 Completing the Parking Process

We describe finally how ants that know their home quadrants travel to their parking corner and configure themselves optimally compactly within that corner. Traveling to the corner. Each ant follows a two-stage trajectory. It proceeds horizontally

to the left edge of Mn . Having achieved that edge, it proceeds vertically toward its parking corner. An ant that is proceeding horizontally: (a) moves only to an empty tile; if none exists, then it waits; (b) yields to an ant that is already proceeding vertically. Organizing within the corner. We describe this process in terms of QSW ; other quadrants

...

...

...

...

are treated analogously; see Fig. 7. The first ant that reaches its parking corner (it may have started there) becomes an usher. It directs vertically arriving ants into the next adjacent downward diagonal; thus-directed ants proceed down this diagonal (k) and up the next higher one (k + 1)—moving only when there is an ant behind them that wants them to move! When an ant reaches the top of the upward diagonal, it “defrocks” the current usher and becomes an usher itself (via a message relayed by its lower neighbor). The corner of Mn thus gets filled in compactly, two diagonals at a time. One shows that this manner of filling the corner organizes the ants into a configuration that minimizes the parking potential function (1). This completes the parking algorithm and the proof of Theorem 3.  

...

...

Fig. 7. Three stages in the snaked parking trajectory within QSW ; X-ed cells contain “ushers”

Ants in Parking Lots

411

5 Conclusions We have reported on progress in understanding the algorithmic strengths and weaknesses of ant-inspired robots within geographically constrained rectangular “floors.” We have obtained this understanding via the simple path-planning problem we call parking: have ants configure themselves in a maximally compact manner within their nearest corner of the floor. We have illustrated a variety of initial configurations of a collection of ants that enable successful, efficient parking, the strongest being just that the collection contains two ants that are initially adjacent. We have also shown that—even in a one-dimensional world—a single ant cannot park. We mention “for the record” that if efficiency is unimportant, then any collection of ants that contains at least four adjacent ones can perform a vast array of path-planning computations (and others as well), by simulating an autonomous (i.e., input-less) 2counter Register Machine whose registers have capacity O(n2 ); cf. [10]. Where do we go from here? Most obviously, we want to solve the parking problem definitively, by characterizing which initial configurations enable parking and which do not. It would also be valuable to understand the capabilities of ant-inspired robots within the context of other significant tasks that involve path planning [1,6,7,8,9], including tasks that involve finding and transporting “food” and avoiding obstacles (as in [8,9]), as well those that involve interactions among multiple genres of ants. Acknowledgments. This research was supported in part by US NSF Grant CNS0905399. Thanks to O. Beaumont and O. Brock and the anonymous reviewers for insights and suggestions.

References 1. Chen, L., Xu, X., Chen, Y., He, P.: A novel ant clustering algorithm based on Cellular automata. In: IEEE/WIC/ACM Int’l. Conf. Intelligent Agent Technology (2004) 2. Chowdhury, D., Guttal, V., Nishinari, K., Schadschneider, A.: A cellular-automata model of flow in ant trails: non-monotonic variation of speed with density. J. Phys. A: Math. Gen. 35, L573–L577 (2002) 3. Folino, G., Mendicino, G., Senatore, A., Spezzano, G., Straface, S.: A model based on Cellular automata for the parallel simulation of 3D unsaturated flow. Parallel Computing 32, 357–376 (2006) 4. Geer, D.: Small robots team up to tackle large tasks. IEEE Distributed Systems Online 6(12) (2005) 5. Goles, E., Martinez, S. (eds.): Cellular Automata and Complex Systems. Kluwer, Amsterdam (1999) 6. http://www.kivasystems.com/ 7. Marchese, F.: Cellular automata in robot path planning. In: EUROBOT 1996, pp. 116–125 (1996) 8. Rosenberg, A.L.: Cellular ANTomata. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds.) ISPA 2007. LNCS, vol. 4742, pp. 78–90. Springer, Heidelberg (2007) 9. Rosenberg, A.L.: Cellular ANTomata: food-finding and maze-threading. In: 37th Int’l. Conf. on Parallel Processing (2008) 10. Rosenberg, A.L.: The Pillars of Computation Theory: State, Encoding, Nondeterminism. Universitext Series. Springer, Heidelberg (2009) 11. Spezzano, G., Talia, D.: The CARPET programming environment for solving scientific problems on parallel computers. Parallel and Distributed Computing Practices 1, 49–61 (1998)

High Performance Networks Jos´e Flich1 , Alfonso Urso1 , Ulrich Bruening2 , and Giuseppe Di Fatta2 1

Topic Chairs Members

2

Interconnection networks are key elements in current scalable compute and storage systems, such as parallel computers, networks of workstations, clusters, and even on-chip interconnects. In all these systems, common aspects of communication are of high interest, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, topologies, routing algorithms, communication protocols, etc. This year, five papers discussing some those issues were submitted to this topic. Each paper was reviewed by four reviewers and, finally, we were able to select three regular papers. Although the low submission this year, the three selected papers exhibited the standard quality of past years in the track. The accepted papers discuss interesting issues like the Head-Of-Line removal in fattree topologies, the introduction of new topologies for on-chip networks, and the optimization of algorithms in specific topologies. In detail, the paper titled An Efficient Strategy for Reducing Head-Of-Line Blocking in Fat-Trees and authored by J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and J.Duato, discusses and proposes a new queue management strategy for fat-trees, where the number of queues is minimized while guaranteeing good performance levels. In the paper titled A First Approach to King Topologies for On-Chip Networks, authored by E. Stafford, J.L. Bosque, C. Martinez, F. Vallejo, R. Beivide, and C. Camarero, the use of new topologies is tackled for on-chip networks. Finally, paper titledOptimizing Matrix Transpose on Torus Interconnects, and authored by V. T. Chakaravarthy, N. Jain, and Y. Sabharwal, focusing on optimizing matrix transpose on torus interconnects, proposes application-level routing techniques that improve the load balancing, resulting in better performance. We would like to take the opportunity of thanking the authors who submitted their contribution, as well as the Euro-Par Organizing Committee, and the referees with their highly useful comments, whose efforts have made this conference and this topic possible.

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, p. 412, 2010. c Springer-Verlag Berlin Heidelberg 2010 

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees Jesus Escudero-Sahuquillo1, Pedro Javier Garcia1 , Francisco J. Quiles1 , and Jose Duato2 1 2

Dept. of Computing Systems, University of Castilla-La Mancha, Spain {jescudero,pgarcia,paco}@dsi.uclm.es Dept. of Computer Engineering, Technical University of Valencia, Spain [email protected]

Abstract. The fat-tree is one of the most common topologies for the interconnection networks of PC Clusters which are currently used for high-performance parallel computing. Among other advantages, fat-trees allow the use of simple but very efficient routing schemes. One of them is a deterministic routing algorithm that has been recently proposed, offering similar (or better) performance than Adaptive Routing while reducing complexity and guaranteeing in-order packet delivery. However, as other deterministic routing proposals, this deterministic routing algorithm cannot react when high traffic loads or hot-spot traffic scenarios produce severe contention for the use of network resources, leading to the appearance of Head-Of-Line (HOL) blocking, which spoils network performance. In that sense, we present in this paper a simple, efficient strategy for dealing with the HOL blocking that may appear in fat-trees with the aforementioned deterministic routing algorithm. From the results presented in the paper, we can conclude that, in the mentioned environment, our proposal considerably reduces HOL blocking without significantly increasing switch complexity and required silicon area. Keywords: Interconnection Networks, Fat-trees, Deterministic Routing, Head-of-line Blocking.

1 Motivation High-performance interconnection networks are currently key elements for many types of computing and communication systems: massive parallel processors (MPPs), local and system area networks, IP routers, networks on chip, and clusters of PCs and workstations. In such environments, the performance achieved by the whole system greatly depends on the performance the interconnection network offers, especially when the number of processing and/or storage nodes is high. On its side, network performance depends on several issues (topology, routing, switching, etc.) which should be carefully considered by interconnect designers in order to obtain the desired low packets latencies and high network bandwidth. However, network design decisions are not actually taken based only on the achieved performance, but also on other factors, like network cost and power consumption. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 413–427, 2010. c Springer-Verlag Berlin Heidelberg 2010 

414

J. Escudero-Sahuquillo et al.

In fact, a clear trend on interconnect researching is to propose cost-effective techniques, which allow to obtain good performance while minimizing network resource requirements. Obviously, cost-effective solutions are especially relevant for the commercial high-speed interconnects (Myrinet[1], Infiniband[2], etc.) used for building high-performance clusters, since they must satisfy, at affordable cost, the growing performance demands of cluster designers and users (many of the most powerful parallel computers are currently cluster-based machines [3]). In that sense, the fat-tree [4] has become one of the most popular network topologies since it offers (among other properties) a high communication bandwidth while minimizing the required hardware. Consequently, many interconnection networks in current clusters and MPPs (for instance, the Earth Simulator[5]) are fat-trees. Additionally, the fat-tree pattern eases the implementation of different efficient routing algorithms, either deterministic (packets follow a fixed path between source and destination) or adaptive (packets may follow several alternative paths). In general, adaptive routing algorithms are more difficult to implement than deterministic ones and present problems regarding in-order packet delivery and deadlock-freedom, but it has been traditionally assumed that they better balance traffic, thus achieving higher throughput. However, a recently proposed deterministic routing algorithm for fat-trees [6] achieves a similar or better throughput than adaptive routing, thanks to a smart balance of link utilization based on the properties of the fat-tree pattern. This algorithm can be implemented in a cost-effective way by using Flexible Interval Routing [7] (FIR), a memoryefficient generic routing strategy. Summing up, this algorithm offers the advantages of deterministic routing (simple implementation, in-order packet delivery, deadlockfreedom) while reaching the performance of (or even outperforming) adaptive routing. Hereafter, this algorithm will be referred to as DET. However, the DET routing algorithm, like other deterministic proposals, is unable to keep by itself its good performance level when packet contention appears. Contention happens in a switch when several packets, from different input ports, concurrently require access to the same output port. In these cases, only one packet can cross while the others will have to wait until the required output port becomes free, thus their latency increases. Moreover, when contention is persistent (in this case, it is usually referred to as congestion), a packet waiting to cross will prevent other packets stored behind in the same input buffer from advancing, even if these packets request a free output port, thus their latency increases and switch throughput drops. This phenomenon is known as Head-of-Line (HOL) blocking, and may limit the throughput of the switch up to about 58% of its peak value [8]. Of course, high traffic loads and hot-spot situations favor the appearance of contention and HOL blocking, and consequently in these scenarios network performance suffers degradation. In order to solve that problem, many mechanisms have been proposed. In that sense, as modern high-speed interconnects are lossless (i.e. discarding blocked packets is not allowed), the most common approach to deal with HOL blocking is to have different queues at each switch port, in order to separately store packets belonging to different flows. This is the basic approach followed by several techniques that, on the other hand, differ in many aspects, basically in the required number of queues and in the policy for mapping packets to queues. For instance, Virtual Output Queues at network

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

415

level (VOQnet)[9] requires as many queues per port as end-nodes, mapping each packet to the queue assigned to its destination. This solution guarantees that packets addressed to different destinations never share queues, thus completely eliminating HOL blocking, but, on the other hand, it is too costly in terms of the silicon area required to implement all these queues per port in medium or large networks. On the contrary, other techniques like Virtual Output Queues at switch level (VOQsw)[10], Dynamically Allocated Multi-Queues (DAMQs)[11] or Destination-Based Buffer Management (DBBM)[12] use far fewer queues per port than VOQnet. Although these techniques map packets to queues according to different “static” (i.e. independent of traffic conditions) criteria, all of them allow packets belonging to different flows to share queues, thus they just partially reduce HOL blocking. In contrast with the “static” queue assignment of the previously mentioned techniques, both Regional Explicit Congestion Notification (RECN)[13], [14] and Flow-Based Implicit Congestion Management (FBICM)[15] detect which flows cause HOL blocking and dynamically assign queues to separate them from others. Although this approach quite efficiently eliminates HOL blocking without using many queues, it requires specific mechanisms to detect congested flows and additional resources (basically, Content-Addressable Memories, CAMs) to keep track of them. In conclusion, it seems that an effective HOL blocking elimination would imply to considerably increase switch cost and/or to introduce significant complexity. On the contrary, we think that, if the queue assignment criterion exploits the properties of both network topology and routing scheme, it would be possible to quite effectively eliminate HOL blocking without using too many queues per port and without introducing significant complexity. In that sense, we propose in this paper an efficient HOL blocking elimination technique for fat-trees which use the DET routing algorithm. Our proposal uses a reduced set of queues per port, mapping packets to these queues according to the traffic balance the routing algorithm performs. As it is shown in the paper, by linking queue assignment to the routing algorithm, our proposal reduces HOL blocking probability with the minimum number of resources, thus reducing resource requirements of other generic HOL blocking elimination techniques while achieving similar (or even better) performance. Moreover, as queue assignment is static in our proposal, it does not introduce additional complexity, as dynamic approaches do. Summing up, we propose a cost-effective and simple technique for eliminating HOL blocking in a cost-effective interconnect configuration (fat-trees using the DET routing algorithm). As the proposed queue assignment is based on traffic balance, and this balance is based on a smart distribution of destinations among output ports, we call our proposal OutputBased Queue Assignment (OBQA). The rest of the paper is organized as follows. In Section 2 we summarize the deterministic routing algorithm for fat-trees which is the base of our proposal. Next, in Section 3, the basics of our OBQA proposal are presented. In Section 4, OBQA is compared in terms of performance and resource needs to previously proposed techniques which also reduce HOL blocking. Finally, in Section 5, some conclusions are drawn.

2 Efficient Deterministic Routing in Fat-Trees Theoretically, a fat-tree topology [4] is an indirect bidirectional interconnection network consisting of several stages of switches connected by links, forming a complete tree

416

J. Escudero-Sahuquillo et al.

Fig. 1. Eight node 2-ary 3-tree with DET routing algorithm

which gets “thicker” near the “root” switch (transmission bandwidth between switches is increased by adding more links in parallel as switches become closer to the root), and whose “leaves” (switches at the first stage) are the places where processors are located. However, as this theoretical scheme requires that the number of ports of the switches increases as we get closer to the root, its implementation is actually unfeasible. For this reason, some equivalent, alternative implementations have been proposed that use switches with constant radix [16]. In [17] a fat-tree definition is given for embedding all the topologies quoted as fat-trees. Among these, we focus on k-ary n-trees, a parametric family of regular multistage interconnection networks (MINs) [18], where bandwidth between switches is increased as we go towards the root by replicating switches. A k-ary n-tree topology is defined by two parameters: k, which represents the arity or number of ports connecting a switch to the next or previous stage, and n, which represents the number of stages of the network. A k-ary n-tree is able to interconnect k n nodes and includes nk n−1 switches. Figure 1 shows a 2-ary 3-tree connecting 8 processing nodes with 12 switches. In such topologies, a minimal source-destination path can be easily found by going from source to one of the nearest common ancestors of both source and destination (“upwards” direction) and then turning towards destination (“downwards” direction). Note that when advancing upwards, several paths are possible (thus making possible adaptive routing), but when advancing downwards, only a single path is available. Taking that into account, adaptivity is limited to the upwards path, where any switch can select any of its upwards output ports for forwarding packets towards the next stage, but this is enough for allowing adaptive routing algorithms based on different criteria. For instance, it’s possible to use a round-robin policy for selecting output ports, in order to balance traffic inside the network and thus improving performance. However, adaptive routing algorithms introduce out-of-order packet delivery, as packets may cross different paths between the same source-destination pair. By contrast, a deterministic routing algorithm able to balance traffic as well as adaptive routing schemes would solve the out-of-order delivery problem without losing performance. Such deterministic routing algorithm (as mentioned above referred to as

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

417

Fig. 2. OBQA logical input port organization

DET) is proposed in [6], and it reduces the multiple upwards paths into a single one for each source-destination pair without unbalancing network link utilization (so, all the links of a given stage are used by a similar number of paths). This is accomplished by shuffling consecutive destinations at each switch in the upwards direction. That means that consecutive destinations are distributed among the upwards output links, reaching different switches in the next stage. Figure 1 also shows the destination distribution proposed by the deterministic routing algorithm in a 2-ary 3-tree network. Note that each link is labeled1 with its assigned destinations (that is, with the destinations of packets that will be forwarded through it). It can be seen that, by shuffling destinations as proposed, packets crossing a switch are addressed to a number of destinations that decrease as the stage of the switch increases (that is, the higher the stage, the lower number of destinations the switch deals with). In fact, each switch of the last stage receives packets addressed only to two destinations, and packets destined to each one are forwarded through a different downwards link (e.g. switch 8 receives packets only for destinations 0 and 4, and they are forwarded downwards through different output ports). Note also that packets addressed to the same destination reach the same switch at the last stage, then following a unique downwards path. That means that, when going downwards, a packet share links only with packets addressed to the same destination. In this way, the mechanism distributes the traffic destined to different nodes and thus traffic is completely balanced (that is, both upwards and downwards links at a given stage are used by the same number of paths). In [6], more details about this deterministic routing algorithm for fat-trees can be found, including a possible implementation using Flexible Interval Routing [7] and an evaluation of its performance. This evaluation shows that this deterministic routing algorithm equals adaptive routing for synthetic traffic patterns (while being simpler to implement), but it outperforms adaptive routing by a factor of 3 when real applications are considered. Summing up, this deterministic algorithm is one of the best options regarding routing in fat-trees, since it reaches high performance while being cost-effective. However, as this deterministic routing proposal considers only switches with singlequeue ports, its performance drops when contention situations cause Head-Of-Line blocking in the single queues2 , thus some mechanism should be used to help this routing algorithm to guarantee a certain performance level even in those situations. 1 2

Upwards links in red, downwards links in green. All in all, it’s not in the scope of the DET routing algorithm to solve the HOL blocking problem.

418

J. Escudero-Sahuquillo et al.

3 OBQA Description In this section we describe in depth our new proposal for reducing HOL blocking in fat-trees. As mentioned above, we call our proposal Output-Based Queue Assignment (OBQA), since it assigns packets to queues depending on its requested output port, thus taking advantage of the traffic balance the DET routing algorithm performs. In the following paragraphs, we firstly detail the assumed memory organization at each switch port. Next, we explain the OBQA basics and how it “fits” with the DET routing algorithm in order to efficiently eliminate HOL blocking. Figure 2 depicts a diagram of the assumed queue scheme at input ports3 . As can be seen, a RAM is used to store incoming packets, this memory being statically divided into a reduced set of queues of fixed size. The exact number (n) of queues may be tuned, but, as one of our objectives is to reduce the queue requirements of other techniques, we assume that n is always smaller than switch radix (as we explain in Section 4, a value of n equal to or lower than half of switch radix is enough for quite efficiently eliminating HOL blocking). In order to calculate the queue in which each incoming packet must be stored, OBQA performs a simple modulo-mapping operation between the output port requested by the packet and the number of queues per port, thus: assigned− queue−number = requested− output− port MOD number− of− queues This queue assignment is the key of OBQA functionality. As a first benefit, packets requesting the same output port (which in fact may be addressed to different destinations) are stored in the same input queue, thus if that output port becomes congested, packets in other queues do not suffer HOL blocking. Moreover, taking into account that we assume the use of the DET routing algorithm, the number of destinations a switch deals with is lower as higher the stage (as can be seen in Figure 1), thus the number of destinations assigned to each OBQA queue is lower as higher the stage. Therefore, the higher the stage, the less destinations share queues, thus decreasing HOL-blocking probability. In fact, both HOL blocking whose origin is contention inside switch and HOL blocking created by contention in other switches are reduced. Figure 3 depicts an example of OBQA operation in a portion of a 2-ary 3-tree (see Figure 1) consisting of several switches in different stages. Note that switch radix is 4, switch ports being numbered from P0 to P3. We assume the number of queues per port is 2, although only queues at port 0 in every switch are shown. Inside the queues, packets are labeled with their own destination. As in Figure 1, routing information are depicted besides the output ports, indicating which destinations are assigned to each output port (upwards destinations are colored in red, while downwards ones in green). In the example, packets addressed to all destinations are injected by the end-node connected to port 0 of switch 0. When received at switch 0, these packets are stored in their corresponding input queue, which is calculated by using the aforementioned modulo-mapping function. For instance, packets addressed to destinations 1, 3, 5 or 7 are stored in queue 1 because, for these destinations, the requested output ports are either P1 or P3, then in all these cases requested− output− port MOD 2 = 1. Analogously, packets addressed to destinations 2, 4 or 6 are stored in queue 0. 3

Hereafter, for the sake of simplicity, we assume an Input-Queued (IQ) switch architecture, but note OBQA could be also valid for Combined Input and Output Queued (CIOQ) switches.

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

419

Fig. 3. OBQA operation example (4 × 4 Switches)

At the next stage, the number of destinations present in switch 4 are lower, because the DET routing algorithm smartly balances traffic between network links. For instance, port 0 of switch 4 only receives packets addressed to destinations 2, 4 and 6, and the routing algorithm again balances traffic among all possible output ports. At this point, packets addressed to destination 4 request P2, thus being stored in queue 0, while packets addressed to destinations 2 or 6 respectively request P1 and P3, thus being stored in queue 1. Note that at this stage, the number of destinations sharing each queue is lower than in the former stage, then HOL-blocking probability is reduced. At switch 8, port 0 only receives packets addressed to destination 4. This switch is at the highest stage and, for this reason, the number of destinations the switch deals with is minimal, thus HOL-blocking is completely eliminated at this stage. Finally, as the DET routing algorithm implies that downwards paths are exclusively used by packets addressed to the same destination, HOL-blocking situations in downwards paths are not possible. Summing up, OBQA smartly exploits both k-ary n-trees benefits and the DET routing algorithm properties for progressively reducing HOL-blocking probability along any path inside the fat-tree. As this is achieved with a reduced set of queues per port, OBQA can be considered a cost-efficient technique for HOL-blocking elimination in fat-trees. In the next section, we evaluate OBQA in comparison with other schemes proposed for reducing HOL blocking, in terms of performance and memory requirements.

4 Performance Evaluation In this section we present the evaluation of the OBQA mechanism, based on simulation results showing network performance when OBQA or other HOL-blocking elimination technique are used. The simulation tool used in our experiments is an ad-hoc, eventdriven simulator modeling interconnection networks at cycle level. In the next sections we firstly describe the simulated scenarios and modeling considerations used to configure all the experiments. Then, we show and analyze the simulation results. 4.1 Simulation Model The simulator models different types of network topologies by defining the number of switches, end-nodes and links. In our experiments, we model fat-trees (k-ary n-trees) with different network sizes and different switch radix values. In particular, we use the network configurations shown in Table 1.

420

J. Escudero-Sahuquillo et al.

Table 1. Evaluated network configurations # Fat-Tree Size Interconnection Pattern #Switch radix #Switches (total) #Stages #1

64 × 64

4-ary 3-tree

8

48

3

#2 256 × 256

4-ary 4-tree

8

256

4

#3 256 × 256

16-ary 2-tree

32

32

2

For all network configurations, we use the same link model. In particular, we assume serial full-duplex pipelined links with 1GByte/s of bandwidth and 4 nanoseconds for link delay, both for switch-to-switch links as node-to-switch links. The DET routing algorithm described in Section 2 has been used for all the network configurations in Table 1. The modeled switch architecture follows the IQ scheme, thereby memories are implemented only at switch input ports. However, memory size and organization at each input port depend on the mechanism used for reducing HOL-blocking. In that sense, for the purpose of comparison, the simulator models the following memory queue schemes: – OBQA. A memory of 4 KB per input port is used, statically and equally divided among all queues in the input port. As described in Section 3, queues are assigned to a set of output ports (taking into account the routing algorithm, that means that a queue is assigned to a set of destinations) according to the aforementioned modulomapping function. We consider three values for the number of queues: 2, 4 or 8. – DBBM. As in the previous scheme, a memory of 4 KB per input port is assumed, statically and equally divided among all configured DBBM queues in the port. Each queue is assigned to a set of destinations according to the modulo-mapping function assigned− queue−number = destination MOD number− of− queues. We consider 4 or 8 queues per port for DBBM. – Single Queue. This is the simplest case, with only one queue at each input port for storing all the incoming packets. Hence, there is no HOL-blocking reduction policy at all, thus this scheme allows to evaluate the performance achieved by the “alone” DET routing algorithm. 4 KB memories are used for this case. – VOQ at switch level (VOQsw). 4 KB memories per input port are used, statically and equally divided into as many queues as switch output ports, in order to store each incoming packet in the queue corresponding to its requested output port. As the number of queues equals switch radix, 8 queues are used for network configurations #1 and #2, and 32 queues for network configuration #3. – VOQ at network level (VOQnet). This scheme, although the most effective one, requires a greater memory size per port, because the memory must be split into as many queues as end-nodes in the network, and each queue requires a minimum size. Considering flow control restrictions, packet size, link bandwidth and link delay, we fix the minimum queue size to 512 bytes, which implies input port memories of 32 KB for the 64 × 64 fat-tree and 128 KB for the other (larger) networks. Note that this scheme is actually almost unfeasible, thus it is considered only for showing theoretical maximum efficacy in HOL-blocking elimination4. 4

As both RECN and FBICM have been reported to obtain a performance level similar to VOQnet, we consider redundant to include these techniques in the comparison.

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

421

Table 2. Synthetic traffic patterns used in the evaluation Random Traffic

Hot-Spot Traffic

Traffic case # Sources Destination Generation rate # Sources Destination Generation rate Start time End time #1 #2

100% 75%

random random

incremental 100%

0% 25%

123

100%

250 µs

300 µs

Notice that although memory requirements are different for each policy, we use the same memory amount per port (4 KB), except in VOQnet scheme. Later, we analyze the minimum memory requirements for each mechanism, thus putting results in this section in the right context. Regarding message switching policy, we assume Virtual Cut-Through. Moreover, in all the switches, the flow control policy is a credit-based mechanism at the queue level. Packets are forwarded from input queues to output queues through a multiplexed crossbar, modeled with a speedup of 1 (link bandwidth is equal to crossbar bandwidth). The end-nodes are connected to switches through Input Adapters (IAs). Every IA is modeled with a fixed number of admittance queues (as many as destinations in the network, in order to avoid HOL-blocking before packet injection), and a variable number of injection queues, which follow the same selected scheme as queues at input port memories. A generated message is completely stored in a particular admittance queue assigned to its destination. Then, the stored message is packetized before being transferred to an injection queue. We use 64-byte packets. Regarding traffic loads, we use both synthetic traffic patterns (in order to simulate ideal traffic scenarios) and storage area network (SAN) traces. The synthetic traffic patterns are described in Table 2. For uniform traffic (traffic case #1) each source injects packets addressed to random destinations. We range injection rate from 0% up to 100% of the link bandwidth. In addition, a simple, intensive hot-spot scenario (traffic case #2) is defined, in order to create heavy congestion situations within the network. In this case congestion has been generated by a percentage of sources (25%) injecting packets always addressed to the same destination, while the rest of sources (75%) inject packets to random destinations. Note that random packet generation rate is incremental in case #1, thus increasing the traffic rate from 0% up to 100% of link bandwidth. By contrast, traffic pattern #2 has been used to obtain performance results as a function of time, in order to show the impact of a sudden congestion situation. These synthetic traffic patterns have been applied to network configurations #2 and #3. On the other hand, we use real traffic traces provided by Hewlett-Packard Labs. They include all the I/O activity generated from 1/14/1999 to 2/28/1999 at the disk interface of the cello system. As these traces are eleven years old, we apply several time compression factors to the traces. Of course, only results as a function of time are shown in this case. Traces are used for network configuration #1. Finally, although the simulator offers many metrics, we base our evaluation on the ones usually considered for measuring network performance: network throughput (network efficiency when normalized) and packet latency. Therefore, in the following subsections we analyze, by means of these metrics, the obtained network performance.

422

J. Escudero-Sahuquillo et al.

4.2 Results for Uniform Traffic Figure 4 depicts the simulation latency results as a function of traffic load, obtained for fat-trees configurations #2 (Figure 4(a)) and #3 (Figure 4(b)) when traffic case #1 (completely uniform traffic) is used. As can be seen in Figure 4(a), when the network is made of switches of radix 8 (configuration #2), OBQA configured with 4 queues per port reaches the saturation point (the point where average message latency dramatically increases) for the same traffic load as VOQnet and VOQsw. Note that in this case VOQsw requires 8 queues per input port and VOQnet requires 256 queues. Thus, OBQA equals the maximum possible performance while significantly reducing the number of required queues. On its side, the DBBM scheme, which is also configured using 4 queues per input port, experiences a high packet latency, near the poorest result (the one of the Single Queue scheme). It can also be seen that, even when configured with just 2 queues, OBQA achieves better results than DBBM with 4 queues, although worse than VOQsw (however, note that the difference is around 12% while queues are reduced in 75%). Note also that using OBQA increases dramatically (around 30%) the performance achieved by the Single Queue scheme (that is, the performance achieved by the DET routing algorithm without HOL blocking elimination mechanism). For a network with switches of radix 32 (Figure 4(b)), OBQA configured with 8 queues per port reaches the saturation point for the same traffic load as VOQsw, which in this case requires 32 queues per port. Moreover, this traffic load is just 2% lower than the VOQnet saturation point, thus the number of queues per port could be reduced from 256 to 8 at the cost of a minimum performance decrease. Even if configured with 4 queues, OBQA reaches the saturation point at a load only 5% lower than VOQsw. Again, the Single Queue scheme and DBBM with 4 or 8 queues achieve very poor results. As already explained, that is what we should expect for the Single Queue scheme, since it does not implement any HOL blocking elimination mechanism. However, the reason for the poor behavior of DBBM is that its queue assignment policy does not “fit” the routing algorithm (on the contrary to OBQA), thus some DBBM queues are not efficiently used for eliminating HOL blocking. In fact, considering the routing algorithm, in some switches beyond the first stage, some DBBM queues are not used at all.

8000

1Q DBBM-4Q 7000 OBQA-2Q OBQA-4Q VOQSW 6000 VOQNet

Network Latency (Cycles)

Network Latency (Cycles)

8000

5000 4000 3000 2000 1000

1Q DBBM-4Q 7000 DBBM-8Q OBQA-4Q 6000 OBQA-8Q VOQSW VOQNet 5000 4000 3000 2000 1000

0

0 0

0.2

0.4 0.6 Normalized Accepted Traffic

0.8

(a) Network Configuration #2

1

0

0.2

0.4 0.6 Normalized Accepted Traffic

0.8

1

(b) Network Configuration #3

Fig. 4. Network latency versus Accepted traffic. Uniform distribution of packet destinations.

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

423

Therefore, we can conclude that, in a uniform traffic scenario, OBQA achieves a network performance similar to the VOQnet and VOQsw ones, while requiring far less queues per port (specifically, a maximum number of queues per port equal to half the switch radix). Moreover, OBQA clearly outperforms DBBM. 4.3 Results for Hot-Spot Traffic In this subsection, we present in Figure 5 network efficiency results as a function of time, when synthetic traffic pattern #2 is used in fat-tree configurations #2 (Figure 5(a)) and #3 (Figure 5(b)). As previously described, in this case 25% of end-nodes generate hot-spot traffic addressed to a single end-node (specifically, destination 123), whereas the rest of traffic is randomly distributed. Furthermore, the hot-spot generation sources are active only during 50 microseconds, after 250 microseconds from simulation start time, thus creating a sudden, eventual congestion situation. As can be seen in Figure 5(a), for a network with 8-radix switches, the Single Queue scheme barely achieves a 5% of network efficiency when congestion appears. Likewise, DBBM (with 4 queues) performance decreases around 25% when congestion arises. Obviously, both queue schemes are dramatically affected by the HOL-blocking created by congestion, and don’t recover their maximum theoretical performance during simulation time. OBQA with 4 queues per port and VOQsw (8 queues per port) achieve similar performances which are better than the DBBM one, decreasing around 20% when the hot-spot appears, thus they slightly improve DBBM behavior and completely outperform the Single Queue scheme. However, again OBQA requires half the queues than VOQsw while achieving similar performance. On its side, VOQnet achieves the maximum efficiency but requiring 256 queues. Similar results are achieved for a fattree made of 32-radix switches, as can be seen in Figure 5(b), VOQsw (32 queues per port in this case) achieving the same results as OBQA with 8 queues per port. Note that in this case the performance level of OBQA with 8 queues is quite close to the maximum (VOQnet), and also that OBQA with 4 queues outperforms DBBM with 8 queues.

(a) Network Configuration #2.

(b) Network Configuration #3.

Fig. 5. Network efficiency (normalized throughput) versus Time. Hot-Spot traffic (25% of packets are addressed to the same destination).

424

J. Escudero-Sahuquillo et al.

0.18

0.25

0.16 0.2

0.12 0.1 0.08 0.06 1Q DBBM-4Q OBQA-2Q OBQA-4Q VOQSw VOQNet

0.04 0.02 0 0

500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 Time (nanoseconds)

(a) Network Configuration #1. FC=20.

Normalized Throughput

Normalized Throughput

0.14

0.15

0.1 1Q DBBM-4Q OBQA-2Q OBQA-4Q VOQSw VOQNet

0.05

0 0

500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 Time (nanoseconds)

(b) Network Configuration #1. FC=40.

Fig. 6. Network efficiency (normalized throughput) versus Time. Storage area network (SAN) traces with different compression factors.

Summing up, the analysis of hot-spot results leads to similar conclusions than the former (uniform traffic) analysis: OBQA reaches great performance with a number of queues per port equal to half or a quarter switch radix. 4.4 Results for Traces Finally, we evaluate network performance when real traffic (the I/O traces described in section 4.1) are used as traffic load. In this case we use fat-tree configuration #1 (64 nodes, switch radix is 8). Figure 6(a) shows results for a trace compression factor of 20. As can be seen, OBQA with 4 queues again achieves excellent efficiency, at the same level as VOQsw (8 queues), and they slightly outperforms DBBM (4 queues) and the Single Queue scheme. When a higher compression factor (40) is applied to the trace (Figure 6(b)), it can be seen that OBQA (4 queues), as VOQsw, more clearly outperforms DBBM, achieving in some cases a 30% of improvement. Again, the Single Queue scheme achieves the poorest results (until 50% of the OBQA efficiency). 4.5 Data Memory Requirements In this section, we compare the minimum data memory requirements of different switch organizations, as an estimation of switch complexity in each case. In particular, we consider the same HOL blocking elimination queue schemes modeled in the simulation tests. Table 3 shows memory size and area requirements for each queue scheme at each switch input port. In order to help in the comparison, different number of queues per port have been evaluated for different schemes. For each case the minimum memory requirements have been computed assuming two packets per queue (64-byte packet size) and Virtual Cut-Through switching. All queue schemes can be interchangeably used by 8 × 8 and 32 × 32 switches, except the VOQsw ones. In particular, line #3 corresponds to 32 × 32 switches, and line #4 represents data for 8 × 8 switches. On the other hand, VOQnet requirements are shown for 256-node networks (line #1) and for

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

425

Table 3. Memory size and area requirements per input port for different queue schemes # Queue scheme # Queues per port Data Memory size Data Memory Area per port (mm2 ) 1 2

V OQnet V OQnet

256 64

32768 bytes 8192 bytes

0.33291 0.07877

3 4

V OQsw V OQsw

32 8

4096 bytes 1024 bytes

0.03569 0.01412

5 6

DBBM DBBM

8 4

1024 bytes 512 bytes

0.01412 0.00647

7 8 9

OBQA OBQA OBQA

8 4 2

1024 bytes 512 bytes 256 bytes

0.01412 0.00647 0.00359

64-node networks (line #2). Memory area has been estimated by means of the CACTI tool v5.3 [19] using its SRAM modeling. We assume SRAM memories with one read and one write port, 45 nm technology node, and 1 byte as readout value. As can be seen, in general OBQA and DBBM present similar memory size and area requirements per port, although, as we have shown in previous sections, OBQA always outperforms DBBM when configured with the same number of queues (or even with less queues than DBBM in some scenarios). On its side, VOQsw requirements for 32 × 32 switches are much greater than the ones of any OBQA or DBBM schemes. For 8×8 switches, VOQsw requirements (8 queues) equal OBQA and DBBM requirements if configured with 8 queues. Note, however, that for this switch radix, OBQA with 4 queues always equals VOQsw performance, thus in this case OBQA requirements are actually lower than (half) the VOQsw ones. Finally, VOQnet needs a vast amount of memory storage as it demands many queues (as many queues as destinations in the network), and its required area is impractical in real implementations.

5 Conclusions Currently, one the most popular high-performance interconnection network topologies is the fat-tree, whose nice properties have favored its use in many clusters and massive parallel processors. One of these properties is the high communication bandwidth offered with the minimum hardware, thus being a cost-effective topology. The special connection pattern of the fat-tree has been exploited by a deterministic routing algorithm which achieves the same performance as adaptive routing in fat-trees, whereas being simpler and thus cost-effective. However, the performance of that deterministic routing algorithm is spoiled by Head-Of-Line blocking when this phenomenon appears due to high traffic loads and/or hot-spot scenarios. The HOL blocking elimination technique proposed, described and evaluated in this paper solves this problem, keeping the good performance of the aforementioned deterministic routing algorithm, even in adverse scenarios. Our proposal is called OutputBased Queue Assignment (OBQA), and it is based on using at each switch port a reduced set of queues, and on mapping incoming packets to queues in line with the routing algorithm. Specifically, OBQA uses a modulo-mapping function which selects

426

J. Escudero-Sahuquillo et al.

the queue to store a packet depending on its requested output port, thus suiting the routing algorithm. From the results shown in this paper we can conclude that a number of OBQA queues per port equal to half (or even a quarter) switch radix is enough for efficiently dealing with HOL blocking even in scenarios with high traffic loads and/or hot-spots, especially in networks where switch radix is not high. In fact, OBQA outperforms previous techniques with similar queue requirements (like DBBM) and achieves similar (or slightly worse) performance than other techniques with much higher queue requirements. Furthermore, OBQA significantly improves the performance achieved by the deterministic routing algorithm when no HOL blocking elimination technique is used, especially in hot-spot scenarios. As this is accomplished without requiring many resources, OBQA can be considered as a cost-effective solution to significantly reduce HOL blocking in fat-trees.

Acknowledgements This work is jointly supported by the MEC, MICINN and European Commission under projects Consolider Ingenio-2010-CSD2006-00046 and TIN2009-14475-C04, and by the JCCM under projects PCC08-0078 (PhD. grant A08/048) and POII10-0289-3724. We are grateful to J. Flich, C. Gomez, F. Gilabert, M. E. Gomez y P. Lopez, from the Department of Computer Engineering, Technical University of Valencia, Spain, for their generous support to this work.

References 1. Myrinet 2000, Series Networking (2000), http://www.cspi.com/multicomputer/products/ 2000 series networking/2000 networking.htm 2. InfiniBand Trade Association. InfiniBand Architecture. Specification Volume 1. Release 1.0, http://www.infinibandta.com/ 3. Top 500 List, http://www.top500.org 4. Leiserson, C.E.: Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers 34(10), 892–901 (1985) 5. Earth Simulator, http://www.jamstec.go.jp/es/en/index.html 6. Gomez, C., Gilabert, F., Gomez, M., Lopez, P., Duato, J.: Deterministic versus adaptive routing in fat-trees. In: Workshop CAC (IPDPS 2007), p. 235 (March 2007) 7. Gomez, M.E., Lopez, P., Duato, J.: A memory-effective routing strategy for regular interconnection networks. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), p. 41.2 (April 2005) 8. Karol, M.J., Hluchyj, M.G., Morgan, S.P.: Input versus output queueing on a space-division packet switch. IEEE Trans. on Commun. COM-35, 1347–1356 (1987) 9. Dally, W., Carvey, P., Dennison, L.: Architecture of the Avici terabit switch/router. In: Proc. of 6th Hot Interconnects, pp. 41–50 (1998) 10. Anderson, T., Owicki, S., Saxe, J., Thacker, C.: High-speed switch scheduling for local-area networks. ACM Transactions on Computer Systems 11(4), 319–352 (1993) 11. Tamir, Y., Frazier, G.: Dynamically-allocated multi-queue buffers for vlsi communication switches. IEEE Transactions on Computers 41(6) (June 1992)

An Efficient Strategy for Reducing Head-of-Line Blocking in Fat-Trees

427

12. Nachiondo, T., Flich, J., Duato, J.: Destination-based HoL blocking elimination. In: Proc. 12th ICPADS, pp. 213–222 (July 2006) 13. Garc´ıa, P.J., Flich, J., Duato, J., Johnson, I., Quiles, F.J., Naven, F.: Efficient, scalable congestion management for interconnection networks. IEEE Micro 26(5), 52–66 (2006) 14. Mora, G., Garc´ıa, P.J., Flich, J., Duato, J.: RECN-IQ: A cost-effective input-queued switch architecture with congestion management. In: Proc. ICPP (2007) 15. Escudero-Sahuquillo, J., Garc´ıa, P.J., Quiles, F.J., Flich, J., Duato, J.: FBICM: Efficient Congestion Management for High-Performance Networks Using Distributed Deterministic Routing. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2008. LNCS, vol. 5374, pp. 503–517. Springer, Heidelberg (2008) 16. Leiserson, C.E., Maggs, B.M.: Communication-efficient parallel algorithms for distributed random-access machines. Algorithmica 3, 53–77 (1988) 17. k-ary n-trees: High Performance Networks for Massively Parallel Architectures. In: Proceedings of International Parallel Processing Symposium (1997) 18. Duato, J., Yalamanchili, S., Ni, L.: Interconnection networks. An engineering approach. Morgan Kaufmann Publishers, San Francisco (2004) 19. Thoziyoor, S., Muralimanohar, N., Ahn, J.H., Jouppi, N.P.: Cacti 5.1. technical report hpl2008-20. Technical report, Hewlett-Packard Development Company (April 2008)

A First Approach to King Topologies for On-Chip Networks Esteban Stafford, Jose L. Bosque, Carmen Mart´ınez, Fernando Vallejo, Ramon Beivide, and Cristobal Camarero Electronics and Computers Department, University of Cantabria, Faculty of Sciences, Avda. Los Castros s/n, 39006 Santander, Spain [email protected], {joseluis.bosque,carmen.martinez,fernando.vallejo, ramon.beivide}@unican.es, [email protected]

Abstract. In this paper we propose two new topologies for on-chip networks that we have denoted as king mesh and king torus. These are a higher degree evolution of the classical mesh and torus topologies. In a king network packets can traverse the networks using orthogonal and diagonal movements like the king on a chess board. First we present a topological study addressing distance properties, bisection bandwidth and path diversity as well as a folding scheme. Second we analyze different routing mechanisms. Ranging from minimal distance routings to missrouting techniques which exploit the topological richness of these networks. Finally we make an exhaustive performance evaluation comparing the new king topologies with their classical counterparts. The experimental results show a performance improvement, that allow us to present these new topologies as better alternative to classical topologies.

1

Introduction

Although a lot of research on interconnection networks has been conducted in the last decades, constant technological changes demand new insights about this key component in modern computers. Nowadays, networks are critical for managing both off-chip and on-chip communications. Some recent and interesting papers advocate for networks with high-radix routers for large-scale supercomputers[1][2]. The advent of economical optical signalling enables this kind of topologies that use long global wires. Although the design scenario is very different, on-chip networks with higher degree than traditional 2D meshes or tori have also been recently explored[3]. Such networks entail the use of long wires in which repeaters and channel pipelining are needed. Nevertheless, with current VLSI technology, the planar substrate in which the network is going to be deployed suggests the use of 2D mesh-like topologies. This has been the case of Tilera[4] and the Intel’s Teraflop research chip[5], with 64 and 80 cores arranged in a 2D mesh respectively. Forthcoming technologies such as on-chip high-speed signalling and optical communications could favor the use of higher degree on-chip networks. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 428–439, 2010. c Springer-Verlag Berlin Heidelberg 2010 

A First Approach to King Topologies for On-Chip Networks

429

In this paper, we explore an intermediate solution. We analyze networks whose degrees double the radix of a traditional 2D mesh while still preserving an attractive layout for planar VLSI design. We study meshes and tori of degree eight in which a packet located in any node can travel in one hop to any of its eight neighbours just like the king on a chessboard. For this reason, we denote these networks king meshes and king tori. In this way, we adopt a more conservative evolution towards higher radix networks trying to exploit their advantages while avoiding the use of long wires. The simplicity and topological properties of these networks offer tantalising features for future on-chip architectures: higher throughput, smaller latency, trivial partitioning in smaller networks, good scalability and high fault-tolerance. The use of diagonal topologies has been considered in the past, in the fields of VLSI[6], FPGA[7] and interconnection networks[8]. Also mesh and toroidal topologies with added diagonals have been considered, both with degree six[9] and eight[10].The king lattice has been previously studied in several papers of Information Theory[11]. The goal of this paper is to explore the suitability of king topologies to constitute the communication substrate of forthcoming on-chip parallel systems. With this idea in mind, we present the foundations of king networks and a first attempt to unleash their potential. The main contributions of our research are the following: i) An in-depth analysis of the topological characteristics of king tori and king meshes. ii) The introduction and evaluation of king tori, not considered previously in the technical literature. iii) A folding scheme that ensures king tori scalability. iv) An adaptive and deadlock-free routing algorithm for king topologies. v) A first performance evaluation of king networks based on synthetic traffic. The remainder of this paper is organized as follows. Section 2 is devoted to define the network topologies considered in this paper. The most relevant distance parameters and the bisection bandwidth are computed for each network and a folding method is considered for networks with wrap-around links. Section 3 tackles the task of finding routing algorithms to unlock the networks’ potential high performance, starting with simple minimum-distance algorithms and evolving to more elaborate missrouting and load balancing techniques. Section 4 presents a first performance evaluation of these networks. Finally, Section 5 concludes the paper highlighting its most important findings.

2

Description of the Topologies

In this Section we define and analyze distance properties of the network topologies considered in this paper: square meshes, square king meshes, square tori and square king tori. Then, we obtain expressions for significant distance parameters as well as the bisection bandwidth. Finally, we consider lay-out possibilities minimizing wire length for those topologies with wrap-around edges.

430

E. Stafford et al.

As usual, networks are modeled by graphs, where graph vertices represent processors and edges represent the communication links among them. In this paper we will only consider square networks, as sometimes networks with sides of different length result in an unbalanced use of the links in each dimension[12]. Therefore, in the following we will obviate the adjective “square”. Hence, for any of the networks considered here the number of nodes will be n = s2 , for any integer s > 1. By Ms we will denote the usual mesh of side s. This is a very well-known topology which has been deeply studied. A mesh based network of degree eight can be obtained by adding new links such that, any packet not only can travel in orthogonal directions, but also can use diagonal movements. Will denote by KMs the king mesh network, which is obtained by adding diagonal links (just for non-peripheral nodes) to Ms . Note that both networks are neither regular nor vertex-symmetric. The way to make this kind of network regular and vertex-symmetric is to add wrap-around links in order to make that every node has the same number of neighbors. We will denote as Ts the usual torus network of side s. The torus is obviously the four degree regular counterpart of the mesh. Then, KTs will denote the king torus network, that is, a king mesh with new wrap-around links in order to obtain an eight degree regular network. Another way to see this network is as a torus with extra diagonal links that turn the four degree torus into an eight degree network. In Figure 1 an example of each network is shown.

Fig. 1. Examples of Mesh, King Mesh, Torus and King Torus Networks

A First Approach to King Topologies for On-Chip Networks

431

In an ideal system, transmission delays in the network can be inferred from its topological properties. The maximum packet delay is given by the diameter of the graph. It is the maximum length over all minimum paths between any pair of nodes. The average delay is proportional to the average distance, which is computed as the average length of all minimum paths connecting every pair of nodes of the network. In Table 1 we record these parameters of the four networks considered. The diameter and average distance of mesh and torus are well-known values, [13]. The distance properties of king torus were presented in [14]. Table 1. Topological Parameters Network

Ms

KMS

Ts

KTs

Diameter

2s

s

s

 2s 

Average Distance

≈ 23 s

Bisection Bandwidth

2s



7 s 15

6s



s 2

4s



s 3

12s

An specially important metric of interconnection networks is the throughput, the maximum data rate the network can deliver. In the case of uniform traffic, that is, nodes send packets to random nodes with uniform probability, the throughput is bounded by the bisection. According to the study in [13], in networks with homogeneous channel bandwidth, as the ones considered here, the bisection bandwidth is proportional to the channel count across the smallest cut that divides the network into two equal halves. This value represents an upper bound in the throughput under uniform traffic. In Table 1, values for the bisection for mesh and torus are shown, see [13]. The obtention of the bisection bandwidth in king mesh and torus is straightforward. Note that a king network doubles the number of links of its orthogonal counterpart but has three times the bisection bandwidth. In a more technological level, physical implementation of computer networks usually requires that the length of the links is similar, if not constant. In the context of networks-on-chip, mesh implementation is fairly straightforward. A regular mesh can be lade out with a single metal layer. Due to the crossing diagonal links, the king mesh requires two metal layers. However tori have wrap-around links whose length depend on the size of the network. To overcome this problem, a well known technique is graph folding. A standard torus can be implemented with two metal layers. Our approach to folding king tori is based on the former but because of the diagonal links four metal layers are required. As a consequence of the folding, the length of the links √ is between two and 8 in king tori. This seems to be the optimal solution for this kind of networks. Figure 2 shows a 8 × 8 folded king torus. For the sake of clarity, the folded graph is shown with the orthogonal and diagonal links separated. Now, if we compare king meshes with tori, we observe that the cost of doubling the number of links gives great returns. Bisection bandwidth is 50% larger,

432

E. Stafford et al.

Fig. 2. Folding of King Torus Network. For the sake of clarity, the orthogonal and diagonal links are shown in separates graphs.

average distance is almost 5% less and diameter remains the same. In addition, implementation of a king mesh on a network-on-chip is simpler, as it does not need to be folded and fits in two metal layers just like a folded torus.

3

Routing

This section explores different routing techniques trying to take full advantage of the king networks. For simplicity it focuses on toroidal networks assuming that meshes will have a similar behaviour. Our development starts with the most simple minimum distance routing continuing through to more elaborate load balancing schemes capable of giving high performance in both benign and adverse traffic situations. Enabling packets to reach their destination in direct networks is traditionally done with source routing. This means that at the source node, when the packet is injected, a routing record is calculated based on source and destination using a routing function. This routing record is a vector whose integer components are the number of jumps the packet must make in each dimension in order to reach its destination. In 2D networks routing records have two components, ΔX and ΔY . These components could be used to route packets in king networks, but the diagonal links, that can be thought as shortcuts, would never be used. Then it is necessary to increase the number of components in the routing record to account for the greater degree of these new networks. Thus we will broaden the definition of routing record as a vector whose components are the number of jumps a packet must make in each direction, not dimension. Thus, king networks will have four directions, namely X and Y as the horizontal and vertical, Z for the diagonal y = x and T for the diagonal y = −x.

A First Approach to King Topologies for On-Chip Networks

3.1

433

Minimal Routing

To efficiently route packets in a king network, we need a routing function that takes source and destination nodes and gives a routing record that makes the packet reach its destination in the minimum number of jumps. Starting with the 2D routing record, it is easy to derive a naive king routing record that is minimal(Knaive). From the four components of the routing record, this routing function will not use two of them. Hence, routing records will have, at most, two non-zero components, one is orthogonal and the other is diagonal. The algorithm is simple, consider (ΔX , ΔY ) where ΔX > ΔY > 0. The corresponding king routing record would be (δX , δY , δZ , δT ) = (ΔX − ΔY , 0, ΔY , 0). The rest of the cases are calculated in a similar fashion. In addition to being minimal, this algorithm balances the use of all directions under uniform traffic, a key aspect in order to achieve maximum throughput. The drawback, however, is that it does not exploit all the path diversity available in the network. Path diversity is defined as the number of minimal paths between a pair of nodes a, b of a network. For mesh and tori will denote it as |Rab |.   |Δx | + |Δy | |Rab | = . |Δx | Similarly, in king mesh and tori the path diversity is:        n  |Δx | n 2n − 2j j n where = (−1) |RKab | = k 2 j=0 j n−k−j |Δy | 2 Thus, the path diversity for king networks is overwhelmingly higher than in meshes and tori. Take for example Δx = 7, Δy = 1, this is the routing record to go from the white box to the gray box in Figure 1. In a mesh the path diversity would be Rab = 8 while in a king mesh RKab = 357. Now, the corresponding Knaive routing record is (δX , δY , δZ , δT ) = (6, 0, 1, 0). This yields only 7 alternative paths, so 350 path are ignored, this is even less than the 2d torus. This is not a problem under uniform and other benign traffic patterns but on adverse situations a diminished performance is observed. For instance, see the performance of 16 × 16 torus with 1-phit packets in Figure 3. The throughput in uniform traffic of the Knaive algorithm is 2.4 times higher than that of a standard torus, which is a good gain for the cost of doubling network resources. However, in shuffle traffic, the throughput is only double and under other traffic patterns even less. A way of improving this is increasing the path diversity by using routing records with three non-zero components. This can be done by applying the notion that two jumps in one orthogonal direction can be replaced by a jump in Z plus one in T without altering the path’s length. Based on our experiments we have found that the best performance is obtained when using transformations similar to the following.       δX δX δX , 0, δZ + , ) (δX , 0, δZ , 0) → ( 3 3 3

434

E. Stafford et al.

Being this an enhancement of the Knaive algorithm we denote it EKnaive. It is important to note that it is still minimum-distance and gives more path diversity but not all that is available. Continuing with our example, this algorithm will give us 210 of the total 357 paths (See Table 2). As can be seen in Figure 3, the EKnaive routing record improves the throughput in some adverse traffic patterns due to its larger path diversity. However this comes at a cost. The inherent balance in the link utilization of the Knaive algorithm is lost, thus giving worse performance under uniform traffic. Table 2. Alternative routing records for (6,0,1,0) with corresponding path diversity Routing Record (δX , δY , δZ , δT ) (6,0,1,0) (4,0,2,1) (2,0,3,2) (0,0,4,3) theoretical

Path Diversity 7 105 210 35 357

16x16 king torus, shuffle traffic Throughput (phits/cycle/node)

Throughput (phits/cycle/node)

16x16 king torus, uniform traffic 1.2 1 0.8 0.6 0.4 0.2 0 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Offered load (phits/cycle/node)

1.2 Routing 2d torus Knaive EKnaive Kmiss KBugal

1 0.8 0.6 0.4 0.2 0 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Offered load (phits/cycle/node)

Fig. 3. Throughput comparison of the various routing algorithms in 16 × 16 toroidal networks

3.2

Misrouting

In the light of the previous experiences, we find that direction balancing is key. But is it important enough to relax the minimum distance requirement? In response to this question, we have developed a new routing function whose routing record may have four non-zero components. Forcing packets to use all directions will cause missrouting as the minimum paths will no longer be used. Thus we name this approach Kmiss. Ideally, to achieve direction balance, the four components would be as close as possible. However this would cause path lengths to be unreasonable long. A compromise must be reached between the path length and component similarity. With Kmiss, the routing record is extracted from a table indexed by the 2D

A First Approach to King Topologies for On-Chip Networks

435

routing record. The table is constructed so that the components of the routing records do not differ more than 3. The new function improves the load balance regardless of the traffic pattern and provides packets with more means to avoid local congestion. In addition it increases the path diversity. Experimental results as those shown in Section 4 show that this algorithm gives improved throughput in adverse traffic patterns but the misrouting diminishes its performance in benign situations. Figure 3 shows that Kmiss is still poor in uniform traffic, but gives the highest throughput under shuffle. 3.3

Routing Composition

In essence, we have a collection of routing algorithms. Some are very good in benign traffic but perform badly under adverse traffic, while others are reasonably good in the latter but disappointing in the former. Ideally, we would like to choose which algorithm to use depending on the situation. Better yet would be that the network switches from one to another by its self. This is achieved to a certain extent in Universal Globally Adaptive Load-balancing (UGAL)[15]. In a nutshell what this algorithm does is routing algorithm composition. Based on local traffic information, each node decides whether a packet is sent using a minimal routing or the non-minimal Valiant’s routing [16], composing a better algorithm that should have the benefits of both of the simple ones. As we show next, KBugal is an adaptation of UGAL to king networks and bubble routing with two major improvements. On one hand, for the non-minimal routing, instead of Valiant’s algorithm, we use Kmiss routing. This approach takes advantage of the topology’s path diversity without significantly increasing latency and it has a simpler implementation. On the other hand, the philosophy behind UGAL resides in estimating the transmission time of a packet at the source node based on local information. Thus selecting the shortest output queue length among all profitable channels both for the minimal and the non-minimal routings. In the best scenario, the performance of KBugal is the best out of the two individual algorithms, as can be seen in Figure 3. The use of bubble routing allows deadlock-free operation with only two virtual channels per physical channel in contrast to the three used by original UGAL. In order to get a better estimation, KBugal takes into account the occupation of both virtual channels together for each profitable physical channel. The reason behind this is fairly simple. Considering that all virtual channels share the same physical channel, the latency is determined by the occupation of all virtual channels, not only the one it is injected in.

4

Evaluation

In this section we present the experimental evaluation carried out to verify the better performance and scalability of the proposed networks. This is done by comparing with other networks usually considered for future network-on-chip

436

E. Stafford et al.

8x8 networks, latency 30 28

2 Latency (cycles)

Throughput (phits/cycle/node)

8x8 networks, throughput 2.5

1.5 1 0.5

26 24 22

Topology king_mesh king_torus mesh torus

20 18

0

16 0

0.5 1 1.5 2 2.5 Offered load (phits/cycle/node)

3

0

0.2 0.4 0.6 0.8 Offered load (phits/cycle/node)

1

Fig. 4. Throughput and latency of king topologies with Knaive compared to mesh and tori under uniform traffic

architectures, as are the mesh and torus with size 8 × 8. The same study was made with 16 × 16 networks, but due to their similarity to 8 × 8 and lack of space, these results are not shown. All the experiments have been done on a functional simulator called fsin[17]. The router model is based on the bubble adaptive router presented in [18] with two virtual channels. As we will be comparing networks of different degree, a constant buffer space will be assigned to each router and will be divided among all individual buffers. Another important factor in the evaluation of networks are the traffic patterns. The evaluation has been performed with synthetic workload using typical traffic patterns. According to the effect on load balance, traffic patterns can be classified into benign and adverse. The former naturally balances the use of network resources, like uniform or local, while the latter introduces contention and hotspots that reduce performance, as in complement or butterfly. Due to space limitations, only the results for three traffic patterns are shown as they can represent the behaviour observed on the rest. These are uniform, bitcomplement and butterfly. Figure 4 shows the throughput and latency of king networks using Knaive compared to those of 2d tori and meshes. It proves that the increased degree of the king networks outperforms their baseline counterparts by more than a factor two. The average latency on zero load is reduced according to the average distance theoretical values. Packets are 16-phit long, thus making the latency improvement less obvious in the graphs. Observe that king meshes have significantly better performance than 2d tori, both in throughput and latency. Figure 5 presents an analysis of the different routing techniques under the three traffic patterns and for 8 × 8 king tori and meshes. Comparing the results of networks with different sizes highlights that the throughput per node is halved. This is due to the well known fact that the number of nodes in square networks grows quadratically with the side while the bisection bandwidth grows linearly. For benign traffic patterns, the best results are given by Knaive routing. However in adverse traffic, a sensible decrease in performance is observed, caused by the reduced path diversity. As mentioned in Section 3 this limitation is overcome

A First Approach to King Topologies for On-Chip Networks complement traffic 8x8 king mesh 45

40

40

40

35 30 25

30 25

35 30 25

20

20

15

15

15

0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)

1 0.8 0.6 0.4 0.2 0

0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)

complement traffic 8x8 king mesh 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

ROUTING Knaive Kmiss ugal KBugal

0

0 0.5 1 1.5 2 Offered load (phits/cycle/node)

Throughput (phits/cycle/node)

1.2

Throughput (phits/cycle/node)

0 0.5 1 1.5 2 Offered load (phits/cycle/node)

uniform traffic 8x8 king torus

butterfly traffic 8x8 king mesh 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 2 Offered load (phits/cycle/node)

complement traffic 8x8 king torus

butterfly traffic 8x8 king torus 45

40

40

40

35 30 25

Latency (cycles)

45

Latency (cycles)

45

35 30 25

35 30 25

20

20

20

15

15

15

0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)

0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)

2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)

complement traffic 8x8 king torus Throughput (phits/cycle/node)

uniform traffic 8x8 king torus 2.5

0 0.2 0.4 0.6 0.8 1 Offered load (phits/cycle/node)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

ROUTING Knaive Kmiss ugal KBugal

0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)

butterfly traffic 8x8 king torus Throughput (phits/cycle/node)

Throughput (phits/cycle/node)

35

20

uniform traffic 8x8 king mesh

Latency (cycles)

Latency (cycles)

45

0 0.1 0.2 0.3 0.4 0.5 Offered load (phits/cycle/node)

Throughput (phits/cycle/node)

butterfly traffic 8x8 king mesh

45

Latency (cycles)

Latency (cycles)

uniform traffic 8x8 king mesh

437

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Offered load (phits/cycle/node)

Fig. 5. Throughput and latency of routings on 8×8 king meshes and tori under different traffic patterns

438

E. Stafford et al.

by the Kmiss routing. In fact this routing yields poor performance under benign traffic pattern but very good under the adverse ones. Our composite routing algorithm KBugal gives the best average performance on all traffic patterns. In the benign situations the throughput is slightly less than Knaive. And under adverse traffic, performance is similar to the Kmiss routing, being even better in some situations. The results show that KBugal gives better performance than its more generic predecessor UGAL. As can be seen, under benign traffic a improvement of 15% is obtained and between 10% (complement) and 90% (butterfly).

5

Conclusion

In this paper we have presented the foundations of king networks. Their topological properties offer tantalising possibilities, positioning them as clear candidates for future network-on-chip systems. Noteworthy are king meshes, which have the implementation simplicity and wire length of a mesh yet better performance than 2d tori. In addition, we have presented a series of routing techniques specific for king networks, that are both adaptive and deadlock free, which allow to exploit their topological richness. A first performance evaluation of these algorithms based on synthetic traffic has been presented in which their properties are highlighted. Further study will be required to take full advantage of these novel topologies that promise higher throughput, smaller latency, trivial partitioning and high fault-tolerance.

Acknowledgment This work has been partially funded by the Spanish Ministry of Education and Science (grant TIN2007-68023-C02-01 and Consolider CSD2007-00050), as well as by the HiPEAC European Network of Excellence.

References 1. Kim, J., Dally, W., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News 36(3), 77–88 (2008) 2. Scott, S., Abts, D., Kim, J., Dally, W.: The blackwidow high-radix clos network. SIGARCH Comput. Archit. News 34(2), 16–28 (2006) 3. Kim, J., Balfour, J., Dally, W.: Flattened butterfly topology for on-chip networks. In: MICRO 2007: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 172–182. IEEE Computer Society, Washington (2007) 4. Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Bown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15–31 (2007) 5. Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Singh, A., Jacob, T., Jain, S., Erraguntla, V., Roberts, C., Hoskote, Y., Borkar, N., Borkar, S.: An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits 43(1), 29–41 (2008)

A First Approach to King Topologies for On-Chip Networks

439

6. Igarashi, M., Mitsuhashi, T., Le, A., Kazi, S., Lin, Y., Fujimura, A., Teig, S.: A diagonal interconnect architecture and its application to risc core design. IEIC Technical Report (Institute of Electronics, Information and Communication Engineers) 102(72), 19–23 (2002) 7. Marshall, A., Stansfield, T., Kostarnov, I., Vuillemin, J., Hutchings, B.: A reconfigurable arithmetic array for multimedia applications. In: FPGA 1999: Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays, pp. 135–143. ACM, New York (1999) 8. Tang, K., Padubidri, S.: Diagonal and toroidal mesh networks. IEEE Transactions on Computers 43(7), 815–826 (1994) 9. Shin, K., Dykema, G.: A distributed i/o architecture for harts. In: Proceedings of 17th Annual International Symposium on Computer Architecture, pp. 332–342 (1990) 10. Hu, W., Lee, S., Bagherzadeh, N.: Dmesh: a diagonally-linked mesh network-onchip architecture. nocarc (2008) 11. Honkala, I., Laihonen, T.: Codes for identification in the king lattice. Graphs and Combinatorics 19(4), 505–516 (2003) 12. Camara, J., Moreto, M., Vallejo, E., Beivide, R., Miguel-Alonso, J., Martinez, C., Navaridas, J.: Twisted torus topologies for enhanced interconnection networks. IEEE Transactions on Parallel and Distributed Systems 99 (2010) (PrePrints) 13. Dally, W., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco (2003) 14. Martinez, C., Stafford, E., Beivide, R., Camarero, C., Vallejo, F., Gabidulin, E.: Graph-based metrics over qam constellations. In: IEEE International Symposium on Information Theory, ISIT 2008, pp. 2494–2498 (2008) 15. Singh, A.: Load-Balanced Routing in Interconnection Networks. PhD thesis (2005) 16. Valiant, L.: A scheme for fast parallel communication. SIAM Journal on Computing 11(2), 350–361 (1982) 17. Ridruejo Perez, F., Miguel-Alonso, J.: Insee: An interconnection network simulation and evaluation environment (2005) 18. Puente, V., Izu, C., Beivide, R., Gregorio, J., Vallejo, F., Prellezo, J.: The adaptive bubble router. J. Parallel Distrib. Comput. 61(9), 1180–1208 (2001)

Optimizing Matrix Transpose on Torus Interconnects Venkatesan T. Chakaravarthy, Nikhil Jain, and Yogish Sabharwal IBM Research - India, New Delhi {vechakra,nikhil.jain,ysabharwal}@in.ibm.com

Abstract. Matrix transpose is a fundamental matrix operation that arises in many scientific and engineering applications. Communication is the main bottleneck in performing matrix transpose on most multiprocessor systems. In this paper, we focus on torus interconnection networks and propose application-level routing techniques that improve load balancing, resulting in better performance. Our basic idea is to route the data via carefully selected intermediate nodes. However, directly employing this technique may lead to worsening of the congestion. We overcome this issue by employing the routing only for selected set of communicating pairs. We implement our optimizations on the Blue Gene/P supercomputer and demonstrate up to 35% improvement in performance.

1

Introduction

Matrix transpose is a fundamental matrix operation that arises in many scientific and engineering applications. On a distributed multi-processor system, the matrix is distributed over the processors in the system. Performing transpose involves communication amongst these processors. On most interconnects with sub-linear bisection bandwidth, the primary bottleneck for transpose is the communication. Matrix transpose is also included in the HPC Challenge benchmark suite [8]. This is a suite for evaluating the performance of high performance computers. The HPCC matrix transpose benchmark mainly aims at evaluating the interconnection network of the distributed system. We shall be interested in optimizing matrix transpose on torus interconnects. Torus interconnects are attractive interconnection architectures for distributed memory supercomputers. They are more scalable in comparison to competing architectures such as the hypercube and therefore many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on these interconnects. We will be mainly interested in the case of asymmetric torus networks. Recall that a torus network is said to be symmetric if all the dimensions are of the same length; it is said to be asymmetric otherwise. Notice that increasing the size of a torus from one symmetric configuration to a larger one requires addition of nodes along all the dimensions, requiring a large number of nodes to be added. It is therefore of considerable interest to study asymmetric configurations as they are realized often in practice. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 440–451, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Optimizing Matrix Transpose on Torus Interconnects

441

X

S

D

Fig. 1. Illustration of SDR technique and routing algorithms

Adaptive routing is the most common routing scheme supported by torus networks at the hardware level. In this scheme, the data may take any of the shortest paths from the source to the destination; the exact path is determined dynamically based on various network parameters such as congestion on the links. Dynamically choosing paths leads to better load balancing. The goal of this paper is to propose optimizations for matrix transpose on asymmetric torus networks when the underlying hardware level routing scheme is adaptive routing. A standard approach for performing matrix transpose (also used in the widely used linear algebra library, ScaLAPACK [2]) is to split the processing into multiple phases. Each phase involves a permutation communication, wherein, every node sends data to exactly one node and receives data from exactly one node. In other words, the communication happens according to a permutation π : P → P and each node u sends data to the node π(u), where P is the set of all nodes in the system. (If π(u) = u, then the node does not participate in the communication). Our optimization is based on an application level routing scheme that we call Short Dimension Routing (SDR), which reduces congestion for permutation communications on asymmetric torus interconnects. Apart from matrix transpose, permutation communication patterns also arise frequently in other HPC applications such as binary-exchange based fast-fourier transforms [6] and recursive doubling for MPI collectives in MPICH [11]. Our techniques are also useful in other applications that involve permutation communication. The SDR technique is based on a simple, but useful, observation that on asymmetric torus networks the load on the links lying along the shorter dimensions are typically less loaded in comparison to the links lying along the longer dimensions. To see this in a quantitative manner, consider random permutation communication patterns on a 2-dimensional torus of size NX × NY . We show that the expected load on any X-link (namely, a link lying along the Xdimension) is NX /4 and the expected load any Y -link is NY /4. Thus, in case of asymmetric torus, say a torus with NX ≥ 2NY , the expected load on the Y -links is twice that of the X-links. The SDR technique exploits this fact and work as follows. Instead of sending data directly from sources to destinations, we route the data through intermediate nodes. Namely, for each source-destination pair, the data is first sent from the source to the intermediate node and then from the intermediate node to the destination. In both the steps, communication takes place using the adaptive routing supported at the hardware level. The crucial aspect of the SDR technique is its choice of the intermediate node. We restrict the selection of the intermediate node to be from one amongst those that can

442

V.T. Chakaravarthy, N. Jain, and Y. Sabharwal

be reached by traversing only along the shorter dimensions from the source. To better illustrate the idea, consider the case of a 2-dimensional asymmetric torus of size NX × NY with NX ≥ 2NY . In this case, we will choose one of the nodes lying on the same column as the source node as the intermediate node. Intuitively, the idea is that the above process improves the load balance of X-links, although increasing the load on Y -links. The increased load on Y -links is not an issue, since to begin with they are lesser loaded in comparison to the X-links. Overall, the routing leads to load balancing among the X-links, without making the Y -links potential bottleneck. This leads to reduction in the congestion resulting in better performance. The idea is illustrated in Figure 1. The figure shows a 2-dimensional torus of size NX = 8 and NY = 4 (the wrap-around links are omitted for the sake of clarity). For a communication with S as the source node, one of the black nodes will be used as the intermediate node. The next important aspect of the SDR technique is the exact choice of the intermediate nodes. Consider a 2-dimensional torus of size NX × NY , with NY ≥ NX . Consider a communication from node u1 = x1 , y1  to node u2 = x2 , y2 . Our strategy is to route this data through an intermediate node u = x1 , y   where y  = (y2 + NY /2) mod NY . Intuitively, this choice of u provides the adaptive routing scheme maximum number of X-links as options to choose from, when sending data from u to u2 . Therefore, our algorithm leads to better loadbalancing on the X links. Figure 1 illustrates the idea: for a communication with S as the source node, the node labeled X will serve as the intermediate node. However, when the torus is nearly symmetric (say NX ≤ 2NY ) our particular choice of intermediate node may significantly increase load on Y links, possibly making the Y -links as the bottleneck. To overcome this hurdle, we employ the idea of selective routing. In this scheme, the intermediate node routing is performed only for a carefully selected subset of the source-destination pairs. To summarize, we carefully choose a subset of source-destination pairs and perform application level routing for these pairs via carefully chosen intermediate nodes. We then generalize the above ideas to higher dimensions. We implement our optimizations on the Blue Gene/P supercomputer and demonstrate upto 35% speedup in performance for matrix transpose. Our experiments show that both the choice of the intermediate nodes and the idea of selective routing are critical in achieving the performance gain. Related Work: Our work focuses on matrix transpose on asymmetric torus networks. The transpose problem has been well-studied on other interconnection architectures [7,4,1,5]. The basic idea of intermediate-node routing goes back to the work of Valiant [12]. Valiant considers permutation communications on hypercubes and shows that routing data via randomly chosen intermediate nodes helps in load balancing. Subsequently, oblivious routing algorithms have been studied extensively. In these work, the entire routing path for each source-destination pair is fully determined by the algorithm. They also consider arbitrary network topologies given as input in the form of graphs. We refer to the survey by R¨ acke [9] for discussion on this topic. In our case, we

Optimizing Matrix Transpose on Torus Interconnects

443

focus on asymmetric torus networks, where the underlying hardware supports adaptive routing. Our goal is to use minimal application-level routing on top of the hardware-level adaptive routing and obtain load balancing. We achieve this by choosing a subset of source-destination pairs and routing the data for these pairs via carefully chosen intermediate nodes. We note that the routing from the source to the intermediate node and from the intermediate node to the destination is performed using the underlying hardware-level adaptive routing scheme. Our work shows that for permutation communication on asymmetric torus networks, the adaptive routing scheme can be improved.

2

Matrix Transpose Overview

In this Section, we give a brief overview of performing matrix transpose in parallel. We shall describe the underlying communication pattern which specifies the interactions amongst the processors in the system. Then, we discuss a popular multi-phase algorithm [3] that has been incorporated in the widely used ScaLAPACK [2] library. This algorithm is also used in the HPC Challenge Transpose benchmark [8]. In parallel numerical linear algebra software, matrices are typically distributed in a block cyclic manner [2]. Such a layout allows for operations to be performed on submatrices instead of individual matrix elements in order to take advantage of the hierarchical memory architecture. The processors in a parallel system are logically arranged in a P × Q grid. A matrix A of size N × M is divided into smaller blocks of size b × b which are distributed over the processor grid. For simplicity, we assume that b divides M and N ; so the matrix can be viewed as an array of M/b × N/b blocks. Let Rp,q denote the processor occupying position i, j on the processor grid where 0 ≤ p < P and 0 ≤ q < Q. Further, let Br,c denote the block in the position r, c of the matrix where 0 ≤ r < M/b and 0 ≤ c < N/b. In the block cyclic layout, the block br,c resides on the processor Rr%P,c%Q1 . In parallel matrix transpose, for each block br,c , the block must be sent to the processor containing the block bc,r . This requires communication between the processors Rr%P,c%Q and Rc%P,r%Q. We now discuss the communication pattern for parallel matrix transpose. Let L denote the LCM of P , Q and G denote the GCD of P , Q. Using the generalized Chinese remainder theorem, it can be shown that a processor Rp,q communicates with a processor Rp ,q iff p ≡ q  (mod G) and q ≡ p (mod G). Based on this fact, it can be shown that any processor communicates with exactly (L/P × (L/Q) processors. The multi-phase algorithm[3] works in (L/P ) × (L/Q) phases. In each phase, every processor sends data to exactly one processor and similarly receives data from exactly one processor. The communication for processor Rp,q is described in Figure 2. The important aspect to note is that the communication in each phase is a permutation communication.

1

% denotes the mod operator.

444

V.T. Chakaravarthy, N. Jain, and Y. Sabharwal g = (q − p)%G p˜ = (p + g)%P , q˜ = (q − g)%Q For j = 0 to L/P − 1 For i = 0 to L/Q − 1 p + i · G)%P p1 = (˜ q − j · G)%Q q1 = (˜ Send data to Rp1 ,q1 p2 = (˜ p − i · G)%P q + j · G)%Q q2 = (˜ Receive data from Rp2 ,q2

Fig. 2. Communication for processor Rp,q in the multi-phase transpose algorithm

3

Load Imbalance on Asymmetric Torus Networks

As mentioned in the introduction, the shorter dimension routing technique is based on the observation that typical permutation communications on an asymmetric torus have higher load on links of the longer dimensions compared to the shorter dimensions. To demonstrate the above observation in a quantitative manner, here we consider random permutation communications on a 2-dimensional torus and show that the expected load on longer dimension links is higher than that shorter dimension links. Consider a torus of size NX × NY . Consider a random permutation π. We assume that the underlying routing scheme is adaptive routing. We will show that the expected load on any X-link is NX /8 and similarly the expected load on any Y -link is NY /8. For a link e, let Le be the random variable denoting the load on the link e. Theorem 1. For any X-link e, E[Le ] = NX /8. Similarly, for any Y -link e, E[Le ] = NY /8. Proof. Let LX be the random variable denoting the sum of the load over all the X-links. Consider any node u = x1 , y1 . Every node is equiprobable to be the destination π(u). Since we are dealing with a torus network, π(u) is equally likely to be located to the left of u or to the right of u. In either case, the packet traverses 0 to NX /2 X-links with equal probability. Therefore expected number of X-links traversed by the packet is NX /4. Therefore.    number of X-links traversed by packet from u to π(u) E[LX ] = u∈P

=



(NX /4) = |P| · (NX /4),

u∈P

where P = NX NY is the number of nodes. The total number of X-links is 2|P| (considering bidirectionality of links) and no link is special. Thus, the expected load on the X-links is the same. It follows that

Optimizing Matrix Transpose on Torus Interconnects

E[Le ] =

445

NX |P| · (NX /4) = . 2|P| 8

The case of Y -links is proved in a similar manner.



The above theorem shows that in case of asymmetric torus, the links of different dimensions have different expected load. For instance, if NX ≥ 2NY , the expected load on the X-links is twice that of the Y -links. Based on this observation, our application-level routing algorithms try to balance the load on the X-links by increasing load on the Y -links, while ensuring that the Y -links do not become the bottleneck.

4

Optimizing Permutation Communications

In this section, we discuss our heuristic application-level routing algorithm for optimizing permutation communications. As discussed earlier, matrix transpose is typically implemented in multiple phases, where each phase involves a permutation communication. As a consequence, we obtain improved performance for matrix transpose. Our application-level algorithm is based on two key ideas: the basic SDR technique and selective routing. 4.1

Basic Short Dimension Routing (SDR)

Consider a two-dimensional asymmetric torus of size NX × NY . Without loss of generality, assume that NX > NY . As discussed in Section 3, for typical permutation communication patterns, the X-links are expected to be more heavily loaded in comparison to the Y -links. We exploit the above fact and design a heuristic that achieves better load balancing. Namely, for sending a packet from a source u1 to a destination u2 , we will choose a suitable intermediate node u lying on the same column as u1 . This ensures that the data from u1 to u traverses only the Y -links. The choice of u is discussed next. Consider a communication from a node u1 = x1 , y1  to a node u2 = x2 , y2 . Recall that in adaptive routing, a packet from u1 to u2 may take any of the shortest paths from u1 to u2 . In other words, the packet may traverse any of the links inside the smallest rectangle defined by taking u1 and u2 as the diagonally opposite ends. Let DX (u1 , u2 ) denote the number of X-links crossed by any packet from u1 to u2 . It is given by DX (u1 , u2 ) = min{(x1 − x2 ) mod NX , (x2 − x1 ) mod NX }. Similarly, let DY (u1 , u2 ) = min{(y1 − y2 ) mod NY , (y2 − y1 ) mod NY }. In adaptive routing, the load balancing on the X-links is proportional to DY (u1 , u2 ), since a packet has as many as DY (u1 , u2 ) choices of X-links to choose from. Similarly, the load balancing on the Y -links is proportional to DX (u1 , u2 ). Therefore maximum load balancing on the X-links can be achieved when we route the packets from u1 to u2 through the intermediate node u = u1 , (u2 + NY /2) mod NY . Note that the packets from u1 to u only traverse Y -links and do not put extra load on the X-links. Among the nodes that have the above property, u is the node having the maximum DY value with respect

446

V.T. Chakaravarthy, N. Jain, and Y. Sabharwal

to u2 . Thus, the choice of u offers maximum load balance in the second phase when packets are sent from the intermediate node to the destination. An important issue with the above routing algorithm is that it may overload the Y -links resulting in the Y -links becoming the bottleneck. This would happen when the torus is nearly symmetric (for instance, NX ≤ 2NY ). To demonstrate this issue, we present an analysis of the basic SDR based algorithm. Analysis. Consider a random permutation π. The following Lemma derives the expected load on the X-links and the Y -links for the basic SDR algorithm. For a link e, let Le be the random variable denoting the load on the link e. Lemma 1. For an X-link e, the expected load is E[Le ] = NX /8 and for a Y -link e, the expected load is E[Le ] = 3NY /8. Proof. Let us split the overall communication into two phases: sending packets from every node to the intermediate nodes and sending the packets from the intermediate nodes to the final destination. Let us first derive the expected load for the case of X-links. In the first phase of communication, packets only cross the Y-links and do not cross any X-link. So, we need to only consider the second phase of communication. As in the case of Theorem 1, we have that the expected load E[Le ] = NX /8, for any X-link e. Now consider the case of Y -links. For a Y-link e, let Le,1 be the random variable denoting the number of packets crossing the link during the first phase. For two numbers 0 ≤ y, j < NY , let y ⊕ j denote y + j mod NY ; similarly, let y j denote y − j mod NY . The Y-links come in two types: (i) links from a node x, a to the node x, a ⊕ 1; (ii) links from a node x, a to the node x, a 1. Consider a link e of the first type. (The analysis for the second type of Ylinks is similar.) The set of nodes that can potentially use the link e is given by {x, a i : 0 ≤ i ≤ NY /2}. Let ui denote the node x, a i and let the intermediate node used by ui be ρ(ui ) = x, yi . The packet from the node x, yi  will cross the link e, if yi ∈ {a ⊕ 1, a ⊕ 2, . . . , a ⊕ (NY /2 − i)} Since π(ui ) is random, yi is also random. Therefore we have that Pr[The packet from ui crosses e in the first phase] =

(NY /2) − i . NY

The expected number of packets that cross the link e in the first phase is: NY /2

E[Le,1 ] =

 (NY /2) − i NY . = N 8 Y i=0

Hence, the expected load E[Le,1 ] = NY /8. In the second phase of the communication, every packet traverses NY /2 number of Y -links. As there are NX NY communicating pairs, the total load on the Y -links is NX NY 2 /2. The number of Y -links is 2NX NY . The expected load put in the second phase on each Y -link e is NY /4. Thus, for a Y -link e, the expected combined load on the link e is E[Le ] = NY /8 + NY /4 = 3NY /8. 

Optimizing Matrix Transpose on Torus Interconnects

447

This shows that our routing algorithm will perform well on highly asymmetric torus (i.e., when NX ≥ 3NY ). However, when the torus is not highly asymmetric (for instance when NX ≤ 2NY ), the Y -links become the bottleneck. 4.2

Selective Routing

We saw that for torus interconnects that are not highly asymmetric, the Y -links become the bottleneck in the above routing scheme. To overcome the above issue for such networks, we modify our routing algorithm to achieve load balancing on X-links without overloading the Y -links. The idea is to use application-level routing only for a chosen fraction of the communicating pairs. The chosen pairs communicate via the intermediate nodes which are selected as before. The remaining pairs communicate directly without going via intermediate nodes. Recall that for a communicating pair u1 and u2 , the load imbalance is inversely proportional to the DX (u1 , u2 ). Therefore, we select the pairs having small value of DX to communicate via intermediate nodes. The remaining pairs are made to communicate directly. We order the communicating pairs in increasing order of their DX . Then, we choose the first α fraction to communicate via intermediate nodes and the rest are made to communicate directly. The fraction α is a tunable parameter, which is determined based on the values NX and NY . When NX ≥ 3NY , we choose α = 1; otherwise, α is chosen based on the ratio NX /NY . 4.3

Extending to Higher Dimensions

Our routing algorithm can be extended for higher dimensional torus. We now briefly sketch the case of 3-dimensional asymmetric torus of size NX × NY × NZ . Without loss generality, assume that NX ≥ NY ≥ NZ . Consider communicating packets from a source node u1 = x1 , y1 , z1  to the destination node u2 = x2 , y2 , z2 . Being a three dimensional torus, two natural choices exist for selecting the intermediate node: (i) u = x1 , y1 , (z2 + NZ /2) mod NZ ; (ii) u = x1 , (y2 + NY /2) mod NY , (z2 + NZ /2) mod NZ . We need to make a decision on which type of intermediate nodes to use. If we choose u as the intermediate then the packets from source to the intermediate node will traverse only Z-links. On the other hand, if we choose u as the intermediate then the packets from source to the intermediate node will traverse both Y -links and only Z-links. In case NY and NZ are close to NX , we send all the packets directly without using intermediate nodes (Example: NX = NY = NZ ). In case only NY is close to NX , we use the first type of intermediate nodes (Example: NX = NY = 2NZ ). Finally, if both NY and NZ are considerably smaller, we use the second type of intermediate nodes (Example: NX = 2NY = 2NZ ). 4.4

Implementation Level Optimizations

We fine tune our algorithm further, by employing the following strategies. These strategies were devised based on experimental studies. Chunk based exchange (CBE): The source node divides the data to be sent to the destination into β smaller chunks. This allows the intermediate routing

448

V.T. Chakaravarthy, N. Jain, and Y. Sabharwal

node to forward the chunks received so far while other chunks are being received in a pipeline manner. Here β is a tunable parameter; our experimental evaluation suggests that β = 1024 gives best results. Sending to nearby destinations: For pairs that are separated by a very small distance along the longer dimension, we choose not to perform intermediate node routing. For all pairs with distance less than γ along the longer dimension, we send the data directly. Here γ is a tunable parameter; our experimental evaluation suggests that γ = 2 gives best results. Handling routing node conflicts: A node x may be selected as the intermediate routing node for more than one communicating pair, resulting in extra processing load on this node. To avoid this scenario, only one communicating pair is allowed to use x as an intermediate routing node. For the other communicating pairs, we look for intermediate nodes that are at distance less than δ from x. In case all such nodes are also allocated for routing, the pair is made to communicate directly. Here δ is a tunable parameter; our experimental evaluation suggests that δ = 2 gives best results. Computing intermediate nodes: Note that our algorithm needs to analyze the permutation communication pattern and choose an α fraction of sourcedestination pairs for which intermediate-node routing has to be performed. This process is carried out on any one of the nodes. The rest of the nodes send information regarding their destination to the above node. This node sends back information about the intermediate node to be used, if any. In case the multiphase algorithm involves more than one phase, the above process is carried out on different nodes in parallel for the different phases.

5

Experimental Evaluation

In this Section, we present an experimental evaluation of our application-level routing algorithm on the Blue Gene/P supercomputer. We first present an overview of the Blue Gene supercomputer and then discuss our results. 5.1

Blue Gene Overview

The Blue Gene/P [10] is a massively parallel supercomputer comprising of quadcore nodes. The nodes themselves are physically small, allowing for very high packaging density in order to realize optimum cost-performance ratio. The Blue Gene/P uses five interconnect networks for I/O, debug, and various types of inter-processor communication. The most significant of these interconnection networks is the three-dimensional torus that has the highest aggregate bandwidth and handles the bulk of all communication. Each node supports six independent 850 MBps bidirectional nearest neighbor links, with an aggregate bandwidth of 5.1 GBps. The torus network uses both dynamic (adaptive) and deterministic routing with virtual buffering and cut-through capability.

Optimizing Matrix Transpose on Torus Interconnects

Random Mapping - Performance Scaling

Default Mapping - Performance Scaling

256

256 Base Opt

144.61 128

64

59.64

32

19.52

16

17.41

12.32 10.68

8 4

106.63

75.98

3.98 3.56 32

Performance (GB/s)

Performance (GB/s)

128

449

256

1024 2048

119.90 68.17

64 32 16

11.88

18.43 16.53

10.28

8 3.89

3.25

32

Number of nodes

(a) Random mapping

92.79

55.10

4 128

Base Opt

128

256

1024 2048

Number of nodes

(b) Default mapping

Fig. 3. Performance comparison

5.2

Experimental Setup

All our experiments involve performing transpose operations on a matrix distributed over a Blue Gene system. We consider systems of sizes ranging from 32 to 2048 nodes. The torus dimensions of these systems are of the form d × d × 2d or d × 2d × 2d. The distributed matrix occupies 256-512 MB data on each node. We consider two methods of mapping the MPI ranks to the physical nodes: Default mapping: This is the traditional way of mapping MPI ranks to physical nodes on a torus interconnect. In this mapping, the ranks are allocated in a dimension ordered manner, i.e., the dimensions of the physical node may be determined by examining contiguous bits in the binary encoding of the MPI rank. For example the physical node for an MPI rank r may be determined using an XY Z ordering of dimensions as follows. The Z coordinate of the physical node would be r mod NZ , the Y coordinate would be (r/NZ ) mod NY and the X coordinate would be r/(NZ NY ); here, divisions are integer divisions. The dimensions may be considered in other orders as well, for instance Y XZ or ZXY . Random mapping: In this case the MPI ranks are randomly mapped to the physical nodes. This results in a random permutation communication pattern. This is characteristic of the HPC Challenge Transpose benchmark [8], which is used to analyze the network performance of HPC systems. 5.3

Results

We begin the result section with comparison of the performance of the multiphase transpose algorithm(Base) with our application-level heuristic routing algorithm (Opt). This comparison is followed by experimental study of effects of varying various heuristic parameters like αandβ. Also an attempt has been made to delineate the contribution of each optimization individually on the overall performance gain that has been obtained in Opt. Comparison of Opt with Base: In our heuristic algorithm for optimal performance parameter α was set to .5, β to 1024, γ to 2 and δ to 2. These parameters

450

V.T. Chakaravarthy, N. Jain, and Y. Sabharwal

Effect of CBE on Performance

Effect of Alpha on Performance Peformance

75.98

74

76

73.27 72.46

72

72.11

71.81

70

70.01 69.58

67.84

68

67.11

66 0

0.125 0.25 0.375 0.5 0.625 0.75 0.875

1

Performance (GB/s)

Performance (GB/s)

76

75.95

Peformance 75.30

75.92

75.90

74 72 70 68

67.34

66 64

66.83

64.89 65.01

64.42 4

16

64

256

1024

4096

Beta

Alpha

(a) Effect of α

(b) Effect of β

Effect of gamma on Performance Peformance Performance (GB/s)

Version

75.98

76 74 72 71.26

71.94

70 68 67.58

67.32

66 0

1

2

3

Performance (GB/s) Base 59.64 Base with CBE 60.86 Opt with δ = 1, γ = 0 71.35 Opt with δ = 1, γ = 2 73.74 Opt 75.98

4

Gamma

(c) Effect of γ

(d) Effect of Other Factors

Fig. 4. Contributions to Performance

have been obtained experimentally. The results are shown in Figure 3. The Xaxis is the system size (number of nodes) and the Y-axis is the performance of matrix transpose communication obtained in GB/s. In Figure 3(a), the performance results are obtained by mapping the MPI tasks onto the nodes using the random mapping. We see that Opt provides significant improvements over Base. The gain varies from 12% for 32 nodes to 35% for 2048 nodes. In Figure 3(b), the performance results are obtained by mapping the MPI tasks onto the nodes using the default mapping. We see that Opt provides significant improvements over Base. The gain ranges from 11% to 29%, with the best performance being observed on 2048 nodes. In both the graphs, we observe that the performance is significantly better for the systems of size 1024 and 2048 in comparison to the smaller system sizes. This is due to the fact the underlying interconnection network for these system sizes is a torus whereas for the smaller system sizes, it is a mesh. It is to be noted that the number of paths between intermediate routing nodes and destination nodes is double in case of a torus as compared to a mesh. Therefore, the intermediate routing node in a mesh offers limited gain in the number of paths to the destination in comparison to the torus. Our results demonstrate the benefit of our application-level routing heuristic algorithm. The overheads involved in initially determining the intermediate nodes were found to be negligible.

Optimizing Matrix Transpose on Torus Interconnects

451

Effect of α: The effects of varying α for 1024 nodes is shown in Figure 4(a). The performance improves as α is increased to about 1/2. This is expected, as the load gets balanced on the longer dimension links. As we further increase α, the performance drops as the shorter dimension links become more congested. Effect of β: Figure 4(b) shows the effect of varying β for 1024 nodes. As expected, the performance improves as β is increased. The performance saturates for large values of β. Effect of γ: The result of varying γ is shown in Figure 4(c) for 1024 nodes. In the extreme case when γ = 0, i.e., all the communicating pairs try to use intermediate routing nodes, the performance is as low as 71 GB/s. The performance increases as gamma is increased to 2. As gamma is increased further, the performance begins to drop as a large proportion of pairs communicate directly, thereby loosing the advantage of SDR. We also conducted experiments to separately study the contributions of the different optimizations. These results are presented in Figure 4(d).

References 1. Azari, N., Bojanczyk, A., Lee, S.: Synchronous and asynchronous algorithms for matrix transposition on mcap. In: Advanced Algorithms and Architectures for Signal Processing III. SPIE, vol. 975, pp. 277–288 (1988) 2. Blackford, L.S., Choi, J., Cleary, A., Petitet, A., Whaley, R.C., Demmel, J., Dhillon, I., Stanley, K., Dongarra, J., Hammarling, S., Henry, G., Walker, D.: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance. In: Supercomputing 1996: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, CDROM (1996) 3. Choi, J., Dongarra, J., Walker, D.: Parallel matrix transpose algorithms on distributed memory concurrent computers. Parallel Comp. 21(9), 1387–1405 (1995) 4. Eklundh, J.: A fast computer method for matrix transposing. IEEE Trans. Comput. 21(7), 801–803 (1972) 5. Johnsson, S., Ho, C.: Algorithms for matrix transposition on boolean n-cube configured ensemble architecture. SIAM J. Matrix Anal. Appl. 9(3) (1988) 6. Kumar, V.: Introduction to Parallel Computing (2002) 7. Leary, D.: Systolic arrays for matrix transpose and other reorderings. IEEE Transactions on Computers 36, 117–122 (1987) 8. Luszczek, P., Dongarra, J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., Mccalpin, J., Bailey, D., Takahashi, D.: Introduction to the hpc challenge benchmark suite. Tech. rep (2005) 9. R¨ acke, H.: Survey on oblivious routing strategies. In: CiE 2009: Proceedings of the 5th Conference on Computability in Europe, pp. 419–429 (2009) 10. IBM journal of Reasearch, Development staff: Overview of the ibm blue gene/p project. IBM J. Res. Dev. 52(1/2), 199–220 (2008) 11. Thakur, R., Rabenseifner, R.: Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications 19, 49–66 (2005) 12. Valiant, L.: A scheme for fast parallel communication. SIAM J. of Comp. 11, 350– 361 (1982)

Mobile and Ubiquitous Computing Gregor Schiele1 , Giuseppe De Pietro1 , Jalal Al-Muhtadi2 , and Zhiwen Yu2 1

Topic Chairs Members

2

The tremendous advances in wireless networks, mobile computing, sensor networks along with the rapid growth of small, portable and powerful computing devices offers opportunities for pervasive computing and communications. Topic 14 deals with cutting-edge research in various aspects related to the theory or practice of mobile computing or wireless and mobile networking, including architectures, algorithms, networks, protocols, modeling and performance, applications, services, and data management. This year, we received 11 submissions for Topic 14. Each paper was peer reviewed by at least three reviewers. We selected 7 regular papers. The accepted papers discuss very interesting issues about wireless ad-hoc networks, mobile telecommunication systems and sensor networks. In their paper “cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks” Huanyu Zhao, Xin Yang, Xiaolin (Andy) Li describe a novel trust aggregation scheme for cyclic MANETs. The second paper “On Deploying Tree Structured Agent Applications in Embedded Systems” by Nikos Tziritas, Thanasis Loukopoulos, Spyros Lalis, Petros Lampsas presents a distributed algorithm aiming at arranging communicating agents over a set of wireless nodes in order to optimize the deployment of embedded applications. The third paper by Nicholas Loulloudes, George Pallis, Marios Dikaiakos is entitled “Caching Dynamic Information in Vehicular Ad-Hoc Networks”. It proposes an approach based on caching techniques for minimizing network overhead imposed by Vehicular Ad Hoc Networks and for assessing the performance of Vehicular Information Systems. The fourth paper “Meaningful Metrics for Evaluating Eventual Consistency” by Joao Pedro Barreto, Paulo Ferreira analyses different metrics for evaluating the effectiveness of eventually consistent systems. In the fifth paper, “Collaborative GSM-based Location”, David Navalho and Nuno Preguica examine how information sharing among nearby mobile devices can be used to improve the accuracy of GSM- or UMTS-based location estimation. The sixth paper is entitled “@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks” by Jose Mocito, Luis Rodrigues, Hugo Miranda. It proposes an adaptive routing system in which each node is able to adapt the routing process dynamically with respect to the current system context. The approach integrates multiple routing protocols in a single system. Finally, the paper“Maximizing Growth Codes Utility in Large-scale Wireless Sensor Networks” by Zhao Yao, Xin Wang extends existing work on robust information distribution in wireless sensor networks using Growth Codes by loosening assumptions made in the original approach. This allows to apply Growth Codes to a wider range of applications. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 452–453, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Mobile and Ubiquitous Computing

453

We would like to take the opportunity to thank all authors who submitted a contribution, the Euro-Par Organizing Committee, and all reviewers for their hard and valuable work. Their efforts made this conference and this topic possible.

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks Huanyu Zhao, Xin Yang, and Xiaolin Li Scalable Software Systems Laboratory, Department of Computer Science Oklahoma State University, Stillwater, OK 74078, USA {huanyu,xiny,xiaolin}@cs.okstate.edu

Abstract. In a Cyclic Mobile Ad Hoc Network (CMANET) where nodes move cyclically, we formulate trust management problems and propose the cTrust scheme to handle trust establishment and aggregation issues. Unlike trust management in conventional peer-to-peer (P2P) systems, trust management in MANETs is based on simple neighbor trust relationships and also location and time dependent. In this paper, we focus on trust management problem in highly mobility environment. We model trust relations as a trust graph in CMANET to enhance accuracy and efficiency of trust establishment among peers. Leveraging a stochastic distributed BellmanFord based algorithm for fast and lightweight aggregation of trust scores, the cTrust scheme is a decentralized and self-configurable trust aggregation scheme. We use the NUS student contact patterns derived from campus schedules as our CMANET communication model. The analysis and simulation results demonstrate the efficiency, accuracy, scalability of the cTrust scheme. With increasing scales of ad hoc networks and complexities of trust topologies, cTrust scales well with marginal overheads.

1

Introduction

Research in Mobile Ad Hoc Networks (MANETs) has made tremendous progress in fundamental protocols, routing, packet forwarding and data gathering algorithms, and systems. Different from the conventional networks, in MANETs, nodes carry out routing and packet forwarding functions so that nodes act as terminals and routers. The increasing popularity of these infrastructure-free systems with autonomous peers and communication paradigms have made MANETs prone to selfish behaviors and malicious attacks. MANETs are inherently insecure and untrustful. In MANETs, each peer is free to move independently, and will therefore change its connections to other peers frequently, which results in a very high rate of network topology changes. The communication is usually multi-hop, and each node may forward traffic even unrelated to its own use. The transmission power, computational ability and available bandwidth of each node in MANETs is limited. In reality, we notice that a large part of MANETs 

The research presented in this paper is supported in part by National Science Foundation (grants CNS-0709329, CCF-0953371, OCI-0904938, CNS-0923238).

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 454–465, 2010. c Springer-Verlag Berlin Heidelberg 2010 

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

455

peers have cyclic movement traces and can be modeled as Cyclic Mobile Ad Hoc Networks (CMANETs) which is defined as a kind of MANET where nodes’ mobilities are cyclic [5,7]. In this paper, we focus on the trust management problem in CMANETs. Conventional centralized trust establishment approaches are not suited well for use within CMANETs scenarios. And to our best knowledge, little research work has investigated the trust issues in CMANETs. Trust establishment in CMANETs is still an open and challenging topic. Unlike the P2P trust, CMANETs are based on naive neighbor trust relationships. Trust in CMANETs is also location and time dependent. We propose a trust aggregation scheme called cTrust for aggregation of distributed trust information in completely decentralized and highly-dynamic CMANET environments. Our contributions in this work are multifold. (1) we model the movement patterns and trust relationships in CMANET as a trust graph. And we model the most trustable path finding process as the Markov Decision Process (MDP). (2) We propose trust transfer function, value iteration function and distributed trust aggregation algorithm to solve the most trustable path finding problem. This algorithm leverages a stochastic markov chain based distributed Bellman-Ford algorithm, which greatly reduces the message overhead. It requires only localized communication between neighbor peers, and it captures a concise snapshot of the whole network from each peer’s perspective. (3) We design the evaluation metrics for cTrust. Using random and scale-free trust topologies, we conduct extensive experimental evaluations based on the NUS student real campus movement trace data. The structure of the rest paper is as follows. Section 2 presents the related works. Section 3 proposes trust model and the stochastic distributed cTrust aggregation algorithm. We present the simulation results to explore the performance of cTrust in section 4. Section 5 concludes the paper.

2

Related Work

The EigenTrust scheme proposed by Kamvar et al. presents a method to obtain a global trust value for each peer by calculating the eigen value from a trust ratings matrix [4]. Xiong and Liu developed a reputation-based trust framework PeerTrust [11]. Zhou and Hwang etc. proposed the PowerTrust system for DHTbased P2P networks [14]. H. Zhao and X. Li proposed the concept of trust vector and a trust management scheme VectorTrust for aggregation of distributed trust scores [13], and they also proposed a group trust rating aggregation scheme HTrust using H-Index technique [12]. In the field of MANETs, Sonja Buchegger and Jean-Yves Le Boudec proposed a reputation scheme to detect misbehavior in MANETs [1]. Their scheme is based on a modified Bayesian estimation method. Sonja Bucheggery and his colleague also proposed a self-policing reputation mechanism [2]. The scheme is based on nodes’ locally observation, and it leverages second-hand trust information to rate and detect misbehaving nodes. The CORE system adopts a reputation mechanism to achieve nodes cooperation in MANETs [6]. The goal of CORE

456

H. Zhao, X. Yang, and X. Li

system is to prevent nodes’ selfish behavior. [1], [2] and [6] mainly deal with the identification and isolation of misbehaved nodes in MANETs, but the mobility feature of MANETs is not fully addressed in these previous work. Yan Sun et al. considered trust as a measure of uncertainty, and they presented a formal model to represent, model and evaluate trust in MANETs [9]. Ganeriwal, Saurabh et al. extended the trust scheme application scenario to sensor networks and they built a trust framework for sensor networks [3]. Another Ad Hoc trust scheme is [10] where the trust confidence factor was proposed.

3

cTrust Scheme

3.1

Trust Graph in CMANETs

In CMANETs, nodes have short radio range, high mobility, and uncertain connectivity. Two nodes are able to communicate only when they reach each others’ transmission range. When two nodes meet at a particular time, they have a contract probability P (P ∈ [0, 1]) that they contact or start some transactions. The cyclic movement trace graph of a CMANET consisting of three nodes is shown in Figure 1. The unit time is set as 10. Each peer i moving cyclically has motion cycle time Ci . We can tell from the trace that CA = 30, CB = 30 and CC = 20. The system motion cycle time CS is the Least Common Multiple (LCM) of all the peers’ motion cycle time in the network, CS = lcm(CA , CB , CC ) = 60. Note that, CMANETs movement traces are not required to be following some shapes. The “cyclic” is explained that if two nodes meet at time T 0, they have a high probability to meet after every particular time period TP . We represent the movement traces in this paper as some shapes to ease the presentation and understanding. A

A

C

A

B

C

C

B

B

(a) T = 0

(b) T = 10

(c) T = 20

A

A

B

C

C

A

B

C

B

(d) T = 30

(e) T = 40

(f) T = 50

Fig. 1. CMANET Movement Trace Snapshots for One System Cycle (CS = 60)

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

457

A/ 10+30n

0.8/60n A/ 20+30n

A/30n

0.9/20+30n

C/20n

0.9/30n 0.5/30+60n

B/ 20+30n

C/ 10+20n

B/30n 0.7/30+60n

B/ 10+30n

Fig. 2. Trust Graph. The vertices in the graph correspond to peer states (time and location) in a system. An directed solid edge shows the initial trust relation (as trust rating and trust direction) and contact time. The dashed line shows the nodes’ movement trace. Each peer maintains a local trust table to store trust information.

To describe the CMANETs system features and trust relationships, we combine the snapshot graph and trust relationships into a directed trust graph as shown in Figure 2. Each node is represented by several states based on periodically appearing locations as the vertices in graph. We represent the states Xi as i/Ti {Loc} where i is the node ID and Ti {Loc} is appearance time for particular locations. The appearance time is give by, Ti {Loc} = T 0i {Loc} + Ci × n(n = 0, 1, 2, . . .)

(1)

where T 0i {Loc} is the first time node i appears at this location and Ci is node i’s motion cycle time. For example, node A appears at three locations. So in trust graph, node A is represented by three states: A/TA {Loc0 }, A/TA {Loc1 } and A/TA {Loc2 }. Following Equation (1), we have, TA {Loc0 } = 0 + CA × n = 30n, TA {Loc1 } = 10 + CA × n = 10 + 30n, TA {Loc2 } = 20 + CA × n = 20 + 30n, (n = 0, 1, 2, . . .). The three states generated by node A is A/30n, A/(10+30n), A/(20+ 30n). State Xi ’s one hop direct trust neighbors is represented by the set H  (Xi ), e.g., H  (B/30n) = {A/30n, C/(10 + 20n)}. The directed dashed lines between states shows nodes’ movement trace as state transfer edges in trust graph. The initial trust relationships are shown by the solid directed edges in the graph. There is an edge directed from peer i to peer j if and only if i has a trust rating on j. The value Ri,j (Ri,j ∈ [0, 1]) reflects how much i trusts j where Ri,j = 0 indicates i never/distrust trust j, Ri,j = 1 indicates i fully trust j. The trust between different states of the same node i is considered as Ri,i = 1. The trust rating is personalized which means the peer has self-policing trust on other peers, rather than obtaining a global reputation value for each peer. We adopt personalized trust rating because in an open and decentralized CMANET

458

H. Zhao, X. Yang, and X. Li

environment, peers will not have any centralized infrastructure to maintain a global reputation. The solid trust rating value on the edge could be obtained by applying different functions to consider all the history transactions’ importance, date, service quality between two peers. How to rate a service and how to generate and normalize the accurate direct trust ratings are not with the scope of this paper. In this paper, we assume the normalized trust ratings have been generated. What we studied in this paper is the trust aggregation/propagation process in an ad hoc network with high mobility. Besides trust ratings, each edge is also labeled by a time function showing when two nodes can communicate by this trust link. The appearance time for each link is given by Equation (2): TRi,j = T 0Ri,j + lcm(Ci , Cj ) × n(n = 0, 1, 2, . . .)

(2)

Where Ci , Cj is the relevant nodes’ motion cycle time and T 0Ri,j is the first time they meet by this link. The solid edges are represented as Ri,j /TRi,j . The system trust graph shows all the trust relationships, the moving trace and possible contacts of the network. For example, setting T = 0 at Figure 2, we obtain the snapshot trace as in Figure 1(a) and the appearing trust links. 3.2

Trust Path Finding Problems in CMANET

In cTrust system, each peer maintains a local trust table. The trust table consists of the remote peer ID as entry, the trust rating for each possible remote peer, the next hop to reach the remote peer. Each entry shows only the next hop instead of the whole trust path. Initially, peers’s trust tables only contain the trust information of their one hop direct experience. Due to the communication range and power constrains, peers are not able to communicate with remote peers directly. Suppose peer i wishes to start a transaction with remote peer k. i wishes to infer an indirect trust rating for peer k to check k’s reputation. In cTrust, the trust transfer is defined as follows. Definition 3.1 (Trust Transfer): If peer i has a trust rating Ri,j towards peer j, peer j has trust rating Rj,k towards peer k, then peer i has indirect trust Ri,k = Ri,j ⊗ Rj,k towards peer k. In Definition 3.1, Ri,j and Rj,k can be both direct and indirect trust. Beside the trust rating, peer i also wishes to find a trustable path and depend on the multihop communication to finish this transaction. Among a set of paths between i and k, i tends to choose the Most Trustable Path (MTP). Definition 3.2 (Most Trustable Path): The most trustable path from peer i to peer k is the trust path yielding highest trust rating Ri,k . MTP is computed as the maximal ⊗ production value of all directed edges along a path. And this production will be considered as i’s trust rating towards peer k. The MTP provides a trustable communication path, and is used to launch multi-hop transactions with an unfamiliar target peer. cTrust scheme solves the trust rating transfer and MTP finding problems in CMANETs.

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

3.3

459

Markov Decision Process Model

Markov Decision Process (MDP) is a discrete time stochastic control process consisting of a set of states. In each state there are several actions to choose. For a state x and an action a, the state transition function Px,x determines the transition probabilities to the next state. A reward is also earned for each state transition. We model the MTP finding process as a MDP. We propose value iteration to solve the MTP finding problem. Theorem 3.1: The MTP finding process is a Markov Decision Process. Proof Initially, for a sequence of random node states in trust graph X1 , X2 , X3 , . . . , Xt , the trust path has the following relation: P r(Xt+1 = x|X1 = x1 , X2 = x2 ..., Xt = xt ) = P r(Xt+1 = x|Xt = xt )

(3)

Equation 3 indicates the state transitions of a trust path possess the markov property: the future states depend only on the present state, and are independent of past states. The components required in a MDP are defined by the following notations: – S: state space of MDP, the node state set in the trust graph. – A: action set of MDP, the state transition decisions. – Px,x = P r(Xt+1 = x |Xt = x, at = a): the probability that action a in node state x at time t will lead to node x at time t + 1. – Rx,x : the reward received after transition to state x from state x with transition probability Px,x . Rx,x is in terms of trust rating in out scheme. The state transition probability in state x to state x is computed from normalizing all x’s out trust links (trust ratings). Px,x = P r(Xt+1 = x |Xt = x) =

R   x,x Rx,y

(4)

y∈H  (x)

In each node state, the next state probability sums to one. The trust path finding process is a stochastic process that all state transitions are probabilistic. The goal is to maximize the cumulative trust rating for the whole path, typically the expected production from the source peer to the destination peer. γRs1 ,s2 ⊗ γ 2 Rs2 ,s3 ⊗ γ 3 Rs3 ,s4 ⊗ ... ⊗ γ t Rst ,st+1

(5)

where γ is the discount rate and satisfies 0 ≤ γ ≤ 1. It is typically close to 1. Therefore, the MTP finding process is a MDP (S, A, P.,. , R.,. ). The solution to this MDP can be expressed as a trust path π (MTP), The standard algorithms to calculate the policy π is the value iteration process.

460

3.4

H. Zhao, X. Yang, and X. Li

Value Iteration

Section 3.2 presents the trust transfer function Ri,k = Ri,j ⊗ Rj,k . The upper bound for Ri,j ⊗ Rj,k is min(Ri,j , Rj,k ) because the combination of trust cannot exceed any original trust. Ri,j ⊗ Rj,k should be larger than Ri,j × Rj,k , which avoid a fast trust rating dropping in trust transfer. The discount rate γ(γ ∈ [0, 1]) determines the importance of remote trust information. The trust transfer function Ri,j ⊗ Rj,k needs to meet the following condition: Ri,j × γRj,k ≤ Ri,j ⊗ Rj,k ≤ min(Ri,j , γRj,k )

(6)

In cTrust scheme, we set the trust transfer function as: 

na

Ri,j ⊗ Rj,k = min(Ri,j , γRj,k ) ×

max(Ri,j , γRj,k )

(7)

We prove that the given function meets the condition in (6). Proof

 max(Ri,j , γRj,k ) ≤ 1, so, min(Ri,j , γRj,k ) × na max(Ri,j , γRj,k ) ≤ min(Ri,j , γRj,k )  min(Ri,j , γRj,k ) × na max(Ri,j , γRj,k ) =

min(Ri,j ,γRj,k )×max(Ri,j ,γRj,k )



na na −1

max(Ri,j ,γRj,k )

=

R

√ i,j

na na −1

×γRj,k

max(Ri,j ,γRj,k )

≥ Ri,j × γRj,k

Therefore, we have proved that the trust transfer function (7) meets the condition (6).

By setting up the adjusting factor na (na =1, 2, 3...), (Ri,j ⊗Rj,k ) can be sliding between the upper and lower bound. In each round of the iteration, the trust table of each node is updated by choosing an action (next hop state in trust graph). The value iteration is executed concurrently for all nodes. It compares the new information with the old trust information and makes a correction to the trust tables based on the new information. The trust tables associated with the nodes are updated iteratively and until they converge. Based on the trust transfer function, the value iteration function is set up as:     Ri,k = max Ri,k , α min(Ri,j , γRj,k ) na max(Ri,j , γRj,k )

(8)

where Ri,k is the trust rating towards peer k given peer i’s local trust table, Ri,j is the direct link trust and Rj,k is the received trust information towards peer k. α(α ∈ [0, 1]) is the learning rate. The learning rate determines to what extent the newly acquired trust information will replace the old trust rating. A learning rate α = 0 indicates that the node does not learn anything, and a learning rate factor α = 1 indicates that the node fully trusts and learns the new information. At the convergence status of the value iteration, each peer’s trust table will contain the trust rating for MTP. 3.5

cTrust Distributed Trust Aggregation Algorithm

In the initial stage of an evolving CMANET, pre-set direct trust ratings are stored in local trust tables. However, the direct trust information is limited and

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

461

does not cover all potential interactions. The distributed trust aggregation algorithm gathers trust ratings to any peer in a network (Algorithm 1). In this algorithm, each trust path is aggregated to MTP with highest trust rating towards target peer. Indirect trust information will be added to trust tables and be updated as the aggregation process evolves. The algorithm is implemented based on distributed Bellman-Ford algorithm. Updates are performed periodically where peers retrieve one of their direct trust neighbors’ trust tables and replace existing trust ratings with more higher ones in local trust tables,then include relevant neighbors as the next hops. Algorithm 1. Distributed Trust Aggregation Algorithm in CMANETs 1: Initialize local trust tables. 2: for each time slot do 3: for peer i ∈ CMANET do 4: Find i’s direct trust neighbor set H  (i). 5: if H  (i) = ∅, then 6: Normalize transition probability by Equation (4). 7: Decide one target node j by transition probability in set H  (i) 8: Send trust table request toj (The contact between i and j is based on their contact probability P). 9: Receive incoming trust tables. 10: Relaxe each trust table entry by trust value iteration function (8), update nexthop peers. 11: If receive any trust table request from other peers, send trust table back. 12: end if 13: end for 14: end for

4 4.1

Experimental Evaluation Experiment Setup

CMANET Contact Pattern Model. We construct an unstructured network based on the NUS student trace. The data of contact patterns is from the class schedules for the Spring semester of 2006 in National University of Singapore (NUS) among 22341 students with 4875 sessions [7,8]. For each enrollment student, we have her/his class schedule. It gives us extremely accurate information about the contact patterns among students over large time scales. The contact patterns among students inside classrooms were presented in [7]. Following class schedules, students move around on campus and meet each other when they are in the same session. The trace data set considers only students movements during business hours, and ignores contacts that students hang around campus for various activities outside of class. The time is compressed by removing any idle time slots without any active sessions. So the contacts take place only in classrooms. Two students are within communication range of each other if and

462

H. Zhao, X. Yang, and X. Li

only if they are in the same classroom at the same time. The sessions can be considered as classes. The unit time is one hour, and a session may last multiple hours. The NUS contact patterns can be modeled as CMANET. In our experiment, 100 to 1000 students are randomly chosen to simulate 100 to 1000 moving peers in CMANET. Following her/his class schedule, each student appears moving cyclically in classrooms. The contact probability P is set as 0.9 which indicates that when two nodes meet, they have a probability of 0.9 to communicate. We considered all 4875 sessions in the data set. The time for the whole system cycle (CS ) is 77 hours (time units). Trust Topology Model. The random trust topology and scale-free trust topology are used to establish trust relationships in this simulation. In random trust topology, the trust outdegree of a peer follows normal distribution with mean value μd = 20, 25, 30 and variance σd2 = 5. On the random trust topology, all peers have similar initial trust links. Under the scale-free trust topology, highly active peers possess large numbers of trust links, and most other peers only have small numbers of trust links. The number of trust links follows power law distribution with an scaling exponent k = 2. Parameter Setting. The network is configured from 100 nodes to 1000 nodes. The network complexity is represented in terms of nodes’ average outdegree d. A network complexity with d = 20, 25, 30 indicates on average, the initial nodes’ outdegrees is 20, 25, 30. Peer’s real behavior is represented by a pre-set normal distribution (μr = 0.25, 0.75, σr2 = 0.2) rating score r ∈ [0, 1]. As mentioned in Section 3, we assume the accurate direct trust ratings have already been generated. This is reasonable because any trust inference scheme must rely on an accurate trust rating scheme. It is meaningless to study the inference trust based on the direct trust rating if the direct trust rating is not reliable. So in our simulation, the direct trust rating R ∈ [0, 1] is generated with a normal distribution 2 = 0.1) based on peer’s real behavior score r. The parameters in (μR = r, σR iteration function is set up as learning rate α = 1, discount factor γ = 1 and adjusting factor na = 9. To measure the performance under dynamic models, new transactions is continuously generated according to a poisson distribution with an arrival rate λ = 10 to 50 transactions per service cycle, between a random source node and a random destination node. New node will randomly join the network, and peer leave/die also randomly happens. In such a dynamic model and in a real mobile ad hoc network, it is hard to achieve strictly convergence status. So the convergence in our simulation is -convergence, and -convergence is defined as that the variance between any peer’s two consecutive trust tables is smaller than the pre-set threshold  = 0.02. 4.2

Results and Analysis

Convergence Time. The convergence time is measured in terms of the number of time units needed to achieve -convergence status. Figure 3 shows that cTrust only needs a small number of aggregation cycles before convergence. We also observe that convergence time increases as network complexity increases. As

800

800

700

700

600 500 400 300 200

d=20 d=25 d=30

100 0 100

200

300

400 500

600

700

800

Convergence Time

Convergence Time

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

600 500 400 300 200

d=20 d=25 d=30

100 0 100 200 300

900 1000

463

400

500

600

700 800 900 1000

Network Size

Network Size

(a) Random Trust Topology

(b) Scale-Free Trust Topology

40

40

30

30

20

10

d=20 d=25 d=30

0 100 200 300 400 500

600 700 800 900 1000

Avg. Message

Avg. Message

Fig. 3. Convergence Time

20

10

d=20 d=25 d=30

0 100

200

300

Network Size

400

500

600

700

800

900 1000

Network Size

(a) Random Trust Topology

(b) Scale-Free Trust Topology

Fig. 4. Message Overhead

network size N increases, convergence time increases relatively slowly (O(n)). This shows that cTrust features satisfactory scalability. We also observe that the trust topologies do not affect the convergence time so much as network size. Communication Message Overhead. Figure 4 shows the average communication message overhead to achieve convergence per individual peer. cTrust greatly reduce the communication message overhead by using MDP model. This is because, in each iteration, each node only receive trust table for one of its most trusted neighbors (Equation (4)). The message overhead grows slowly as the network size grows, showing that cTrust is a lightweight scheme. In a network with high complexity, cTrust system incurs more message overheads. In a typical cTrust network, the average message overhead is affected by only network size N and complexity d and not affected by trust topology. As a result, the overhead curves for both topologies in the figures appear similar. Average Trust Path Length. Figure 5 indicates the average length of a trust path starts from a source peer to a destination peer in convergence status. Generally, the trust path length increases with the network size and complexity, which indicates peers gain more remote trust information. In the scale-free trust topology, the trust path length is greatly reduced. This is because in scale-free trust topology most peers have only a few connections while some power peers control many links, making trust information hard to spread. In a complex network where trust information can be spread father, there are more longer trust paths and involve more trust transfers.

464

H. Zhao, X. Yang, and X. Li

6

7

d=20 d=25 d=30

Avg. Path Length

Avg. Path Length

7

5 4 3 2 1 100

200

300

400

500

600

700

800

5 4 3 2 1 100

900 1000

d=20 d=25 d=30

6

200

300

400

500

600

700

800

900 1000

Network Size

Network Size

(a) Random Trust Topology

(b) Scale-Free Trust Topology

100%

100%

98%

98%

96%

96%

94%

94%

Accruacy

Accuracy

Fig. 5. Average Trust Path Length

92% 90%

85%

d=20 d=25 d=30

80% 100 200 300 400 500 600 700 800 900 1000

92% 90%

85%

80% 100 200 300 400 500 600 700 800 900 1000

Network Size

Network Size

(a) Random Trust Topology

d=20 d=25 d=30

(b) Scale-Free Trust Topology

Fig. 6. Aggregation Accuracy

Accuracy. cTrust aggregation accuracy is measured by comparing all the inferred trust ratings with peers real behavior scores. The similarity is considered as aggregation accuracy. As shown in Figure 6, on average, cTrust aggregation accuracy is maintained above 90%. The result is very encouraging because cTrust is a personalized trust system using inferred (not direct) trust and the information for each node to access is limited in CMANETs. As the network complexity increases, the accuracy decreases. This is because in complex networks, there are more long trust paths that involve more trust transfers, resulting in lower accuracy in inferred trust ratings due to multi-hop relationships. The accuracy in scale-free trust topology is slightly higher than in random trust topology. One reason is in scale-free trust topology, the average trust path is shorter which leads to high accuracy in trust transfer.

5

Conclusion

We have presented the cTrust scheme in CMANETs. cTrust scheme is aimed to provide a common framework to enable trust inferring in a CMANET trust landscape. We presented the trust transfer function, trust value iteration function, and the cTrust distribution trust aggregation algorithm. To validate our proposed algorithms and protocols, we conducted extensive evaluation based on NUS students trace data. The experimental results demonstrate that cTrust

cTrust: Trust Aggregation in Cyclic Mobile Ad Hoc Networks

465

scheme trust aggregation is efficient. cTrust convergence time increases slowly with network size. Message overhead in cTrust is modest. The trust information spreads fast and extensively in CMANETs. The trust rating inference accuracy in cTrust scheme is over 90%. We believe that cTrust establishes a solid foundation to design trust-enabled applications and middleware in CMANETs.

References 1. Buchegger, S., Boudec, J.Y.L.: A robust reputation system for mobile ad-hoc networks. Technical report, IC/2003/50, EPFL-IC-LCA (2003) 2. Buchegger, S., Boudee, J.Y.L.: Self-policing mobile ad hoc networks by reputation systems. IEEE Communications Magazine 43(7), 101–107 (2005) 3. Ganeriwal, S., Balzano, L.K., Srivastava, M.B.: Reputation-based framework for high integrity sensor networks. ACM Trans. Sen. Netw. 4(3), 1–37 (2008) 4. Kamvar, S.D., Schlosser, M.T., G.-Molina, H.: The eigentrust algorithm for reputation management in p2p networks. In: Proceedings of the 12th International Conference on World Wide Web (WWW). pp. 640–651. Budapest,Hungary (May 20-24 2003) 5. Liu, C., Wu, J.: Routing in a cyclic mobispace. In: MobiHoc: Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing, pp. 351–360 (2008) 6. Michiardi, P., Molva, R.: Core: A collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. In: Sixth IFIP conference on security communications, and multimedia (CMS 2002), Portoroz, Slovenia (2002) 7. Srinivasan, V., Motani, M., Ooi, W.T.: Analysis and implications of student contact patterns derived from campus schedules. In: MobiCom 2006: Proceedings of the 12th annual international conference on Mobile computing and networking, New York, NY, USA, pp. 86–97 (2006) 8. Srinivasan, V., Motani, M., Ooi, W.T.: CRAWDAD data set nus/contact, v. 200608-01 (August 2006), http://crawdad.cs.dartmouth.edu/nus/contact 9. Sun, Y.L., Yu, W., Han, Z., Liu, K.J.R.: Information theoretic framework of trust modeling and evaluation for ad hoc networks. IEEE Journal on Selected Areas in Communications 24(2) (2006) 10. Theodorakopoulos, G., Baras, J.: On trust models and trust evaluation metrics for ad hoc networks. IEEE Journal on Selected Areas in Communications 24(2) (February 2006) 11. Xiong, L., Liu, L.: Peertrust: supporting reputation-based trust for peer-to-peer electronic communities. IEEE Transactions on Knowledge and Data Engineering 16(7), 843–857 (2004) 12. Zhao, H., Li, X.: H-trust: A robust and lightweight group reputation system for peer-to-peer desktop grid. Journal of Computer Science and Technology (JCST) 24(5), 833–843 (2009) 13. Zhao, H., Li, X.: Vectortrust: The trust vector aggregation scheme for trust management in peer-to-peer networks. In: The 18th International Conference on Computer Communications and Networks (ICCCN 2009), San Francisco, CA USA, August 2-6 (2009) 14. Zhou, R., Hwang, K.: Powertrust: A robust and scalable reputation system for trusted peer-to-peer computing. IEEE Transactions on Parallel and Distributed Systems 18(4), 460–473 (2007)

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks Yao Zhao, Xin Wang, Jin Zhao, and Xiangyang Xue School of Computer Science, Fudan University, China Shanghai Key Lab of Intelligent Information Processing, Shanghai, China

Abstract. The goal of Growth Codes proposed by Karma et.al. is to increase the “persistence” of sensed data, so as to promise that data is more likely to reach a data sink. In many “zero-configuration” sensor networks, where the network topology would change very rapidly, Growth Codes are especially useful. However, the design of Growth Codes is based on two assumptions: (1) each sensor node contains only one single-snapshot of the monitored environment, and each packet contains only one sensed symbol; (2) all codewords have the same probability to be received by the sink. Obviously, these two assumptions do not hold in many practical scenarios of large-scale sensor networks, thus the performance of Growth Codes would be sub-optimal. In this paper, we generalize the scenarios to include multi-snapshot and less random encounters. By associating the decimal degree with the codewords, and by using priority broadcast to exchange codewords, we aim to achieve a better performance of Growth Codes over a wider range of sensor networks applications. The proposed approaches are described in detail by means of both analysis and simulations.

1

Introduction

Wireless sensor networks have been widely used for data perception and collection in different scenarios, such as floods, fires, earthquakes, and military areas. Often in such networks, the sensor nodes used to collect and deliver data are prone to catch failure suddenly and unpredictably themselves. Thus, all the protocols designed for these sensor networks should focus on the reliability of data collection and temporarily store part of the data for information survives. Many coding approaches are proposed to achieving the robustness in such networks, since coding over the sensed data may increase the likelihood that the data will survive as some nodes fail. Among them, Growth Codes [7] are specially designed for the purpose of increasing the “persistence” of the sensed data. Assuming that the network topology changes very rapidly, e.g., due to link instability or node failures, Growth Codes address to increase the data persistence, where data persistence is defined as the fraction of data generated within the network and eventually reaches the sink(s). The codewords employ a dynamically changing codeword degree distribution to deliver data at a much faster rate to the sink and promise to be able to decode a substantial number of the codewords at any given time. P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 466–477, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

467

However, the design of Growth Codes is based on two assumptions: (1) each sensor node contains only one single-snapshot of the monitored environment, and each packet contains only one sensed symbol; (2) all codewords have the same probability to be received by the sink. This is true for information exchange at random encounters among nodes or for very high mobility. However, it does not well capture protocol behavior in realistic large-scale sensor networks, where the performance of Growth Codes would be sub-optimal. In this paper, we loose the above assumptions and generalize the scenarios of original Growth Codes to a wider range of applications. Firstly, the design of Growth Codes is generalized to the scenarios of multisnapshot, where the buffer of each sensor node can store multiple symbols sensed from the monitored area, and the size of a transmission packet is larger than the size of a sensed symbol. Notice that, this assumption is quite reasonable in practical applications of large-scale sensor network, e.g., a transmission packet may contain more than hundreds of bytes, whereas a sensed symbol may contain only several bytes. For clarity to describe, we term “packet” as the transmission data unit and term “symbol” as the sensed data unit, respectively. Thus, when a node receives an input packet, it can disassemble this packet into several symbols and then generate a new packet over other symbols in its local buffer. In this case, Growth Codes can be modified to encode on the symbol level, instead of on the packet level, to achieve a better utility in the multi-snapshot scenarios. Secondly, the design of Growth Codes is generalized to the scenarios of less random encounters, where the network scale is too large to promise that the all codewords could have the same probability to reach to the sink. Often in such scenarios, the nodes located far from the sink would have less chance to deliver their symbols to the sink than the closer ones. In this case, we need investigate special technique to increase the chance of the symbols which are sensed in an area far from the sink. Motivated by the natural property of broadcasting in wireless sensor networks, we introduce priority broadcast to disseminate sensed data, which gives a high priority to the farther data. By priority broadcast, we show that even in the scenarios of less random encounters, the performance of Growth Codes can approach to the theoretical value. The rest of the paper is organized as follows. Section 2 gives an overview of the related works. Section 3 gives a brief overview of data persistence in large scale wireless sensor networks and introduces the original Growth Codes. In Section 4, the design of Growth Codes is modified to encode on the symbol level to maximize utility in the case of multi-snapshot. In Section 5, the design of Growth Codes is modified to use priority broadcast to maximize utility in the case of large network scale. Section 6 evaluates the performance of our proposed modifications by simulations. Section 7 concludes this paper.

2

Related Works

Storing and disseminating network coded [2] information instead of the original data can bring significant performance improvements to wireless sensor network

468

Y. Zhao et al.

protocols [1]. However, though network coding is generally beneficial, coding over all available packets might leave symbols undecodable until the sink receives enough codewords. A decentralized implementation of erasure codes is proposed in [5]. Assuming that there are n storage nodes with limited memory and k < n sources generating the data, the authors consider that the sink is able to retrieve all the data by query any k nodes. similarly, decentralized fountain codes [8] [3] are proposed to persist the cached data in wireless sensor networks. Growth Codes [7] are specifically designed to increase the “persistence” of the sensed data in dynamic environments. In Growth Codes, nodes initially disseminate the original data and then gradually code over other data as well to increase the probability that information survives. To achieve more data persistence under scenarios of different node mobility, resilient coding algorithms [10] are proposed. The authors propose to keep a balance between coding over all incoming packets and Growth Codes. Furthermore, the approach of Growth Codes is generalized to the multi-snapshots scenarios in [9], where the authors aim to maximize the expected utility gain through joint coding and scheduling and propose two algorithms, with and without mixing different snapshots. When dealing with multi-snapshot, the authors prove the existence of the optimal codeword degree but do not propose any specific methods. Border node retransmission based probabilistic broadcast, is proposed in [4] to reduce the number of rebroadcasting messages. The authors suggest that it is interesting to privilege the retransmission by nodes that are located at the radio border of the sender. In some sense, their insight of border node retransmission motives our proposed approach of priority broadcast.

3 3.1

Problem Description Network Model

We start with the description of the network model of large-scale sensor networks. This is a sensor network consists of a large number of sensors distributed randomly in a monitored disaster region, such as earthquakes, fires, floods, etc. Since the sudden configuration changes due to failure, this network is operated in a “zero-configuration” manner and the data collection must be initiated immediately, before the nodes have a chance to assess the current network topology. In addition, the location of the sink is unknown and all data is of equal importance. Each sensor node has a unique identifier (ID) and is capable of sensing an area around itself. And also, each sensor node has a radio interface and can communicate directly with some of the sensors around it. Here, we assume the node number to be N and transmission radius to be R. Specially, it is difficult for the nodes that located far from the sink to deliver their symbols to the sink, since it will take too many hops. Moreover, we assume that each packet can at most contain P symbols, and assume the buffer size to be B, as is depicted in Fig. 1.

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

Snap 1

Snap 2

Snap 3

Snap 4

Snap 5

Symbol 1

Symbol 2

Symbol 3

Symbol 4

Symbol 5

469

Packet

Fig. 1. Storing symbols of multi-snapshot in memory (P = 3, B = 5)

Since the location of the sink is unknown, thus the intermediate node randomly chooses a neighbor and a random packet in its memory and then disseminates this packet to the selected neighbor. In Section 5, we also investigate the technique of priority broadcast, where the intermediate node chooses the packet with the highest priority and then disseminates this packet to all the neighbors. 3.2

Overview of Growth Codes

Growth Codes define the degree of a codeword to be the number of packets XOR’d together to form a codeword. Each node initializes its buffer with its own packet (e.g., the sensed information); when a new packet is received, the node stores the incoming packet at a random position in its memory, overwriting a random unit of previously stored information (but not its sensed information). Though a high codeword degree would increase the probability that a codeword provides innovative information, more codewords and more operations are necessary to decode a received codeword. As is described in [7], codewords in the network should start with degree one and then increase the degree over time. Specifically, for the first R1 =  N2−1  packets, codewords of degree 1 are dissem−1 inated; after recovering Rk =  kN k+1  packets, codewords of degree k + 1 are disseminated. Let d be the degree of a codeword to be transmitted after r packets have been recovered at the sink, and assume that all codewords have the same probability to be received by the sink, Growth Codes give us: d=

4 4.1

N +1 . N −r

(1)

Maximizing Growth Codes Utility by Coding on The Symbol Level Decimal Codeword Degree

Growth Codes assume that each transmission packet contains only one sensed symbol, thus the codeword degree in Growth Codes is always the integer number. However, in many scenarios of multi-snapshot, a sensor node can store multiple symbols in the buffer, thus a transmission packet would contain several sensed symbols. Therefore, when using Growth Codes on the symbol level, to generate the codeword of the decimal degree is indeed possible.

470

Y. Zhao et al.

packet A, degree =1 A1

A2

B1

B2

new codeword, degree =1.5 A1+B1

A2

packet B, degree =1

Fig. 2. An example of a codeword with the decimal degree 1.5

A1

A2

B1

B2

C1

C2

A1

D1

D2

C1

(a) A 4 node network, when 2 packets have been recovered

A1+B1

A2

A1+B1

A2+B2

C1+D1

C2

A2

C1+D1

C2+D2

A1+C1

A2

C2

A1+C1

A2+C2

A1+C1

C2

(b) Codewords of degree 1

(c) Codewords of degree 2

(d) Codewords of degree 1.5

Fig. 3. An example that the decimal degree outperforms the integer degree

An example of the decimal degree is depicted in Fig. 2. An intermediate node stores 2 packets (i.e., packet A and packet B) in the buffer, each packet contains 2 symbols (i.e., a total of 4 symbols:A1 , A2 ; B1 , B2 ). To transmit, this node generates a new packet, which is encoded over one whole packet (A1 , A2 ) and another half packet (B1 ), to get a new codeword of degree 1.5. Here, the value of 1.5 is calculated by 12 ∗ 2 + 12 ∗ 1, since A1 + B1 has a degree of 2 while A2 has a degree of 1. −1 In Growth Codes, after Rk =  kN k+1  packets having been recovered at the sink, a codeword of degree k + 1 is more likely to be successful decoded than a codeword of degree k. However, when the decimal degree is used, a decimal degree such as k +0.5 may outperform both degree of k and k +1. For clarity, the example that the decimal degree outperforms the integer degree is given in Fig. 3. It is a 4 node sensor network, each node has sensed 2 symbols and has the same transmission capacity of 2 symbols. At one time, the sink has already recovered 2 packets (see Fig. 3a: A1 , A2 , A3 , A4 ), then we compare three codeword degrees: degree 1 , degree 2 and degree 1.5. In the first case (see Fig. 3b), the sink randomly receives a codeword of degree 1. At expectation, 0.5 new packet (1 new symbol) can be recovered. In the second case (see Fig. 3c), the sink randomly receives a codeword of degree 2. At expectation, 0.5 new packet (1 new symbol) can also be recovered. In the third case (see Fig. 3d), the sink may get 2 symbols (e.g., from a codeword of A1 + C1 and C2 ) at the probability of 13 , get 1 symbol (e.g., from a codeword of C1 + D1 and C2 ) at the probability of 12 and get no symbol (e.g., from a codeword of

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

471

A1 + B1 and A2 ) at the probability of 16 . At expectation, the number of newly recovered symbols is: 13 ∗ 2 + 12 ∗ 1 + 16 ∗ 0 = 76 , which is larger than 1. 4.2

Search for Appropriate Codeword Degree

+1 Growth Codes select the codeword degree as d =  N N −r , but the previous subsection suggests that the codeword of the decimal degree may have a better performance. To get the appropriate decimal codeword degree, we use the approach of interpolation. Notice that, the degree transition points suggested in Growth Codes, which can be thought as the sample set used for interpolation, are: {< d, f (d) > |1≤d≤N, f (d) = N − Nd+1 }. Then, the problem of searching for appropriate codeword degree is turning into the interpolation problem as follow. Given a partition : {i = 0, 1, . . . , N − 1, x = {xi }, xi = i + 1, y = {yi }, yi = N − Nx+1 }, how to get the interpolation function s (x, y)? i yi −yi−1 Let si = xi −xi−1 =yi − yi−1 (i = 1, . . . , N − 1) and define mi =min (si , si+1 ), the method of Hermite spline interpolation [6] tells us:

s (x, y) = mi−1 (xi − x)2 (x − xi−1 ) − mi (x − xi−1 )2 (xi − x) + yi−1 (xi − x)2 [2(x − xi−1 ) + 1] + yi (x − xi−1 )2 [2(xi − x) + 1] (2) The next theorem illustrates the efficiency of Hermite spline interpolation. And the example of this kind of hermite interpolation is illustrated in Fig. 4. For clarity to observe, we only show the part of degree less than 10. Theorem 1. The codeword degree distribution calculated by Eqn. 2 is nearoptimal. Proof. (1) It is obvious that s (xi ) = yi , for i = 0, 1, . . . , N − 1; (2) Since s (xi−1 ) = mi−1 , s (xi ) = mi and s (x) > 0(xi−1 < x < xi ), thus in each interval [xi1 , xi ], s (x, y) is monotonically increasing. Thus, Eqn. 2 is conform to the “growth” property of Growth Codes, where the codewords degree is gradually increasing as time goes, so that the codeword degree distribution calculated by Eqn. 2 is near-optimal.

Having known the appropriate degree, then we have the modified coding strategy as follow. When an intermediate node has a chance to disseminate a packet, it first determines an appropriate codeword degree of d and then generates P coded symbols, among which (d − d) ∗ P coded symbols are encoded over d symbols and (1 − d + d) ∗ P coded symbols are encoded over d symbols.

5 5.1

Maximizing Growth Codes Utility by Priority Broadcast Unicast

Since the location of the sink is unknown, and since all data is of equal importance, Growth Codes use a simple algorithm to exchange information: a node

472

Y. Zhao et al. Searching the decimal degree by interpolation (N=500). 10 9

The decimal degree The integer degree

Codeword degree

8 7 6 5 4 3 2 1 0

0

50

100

150

200

250

300

350

400

450

500

The number of packets recovered at the sink

Fig. 4. Searching the decimal degree by interpolation (N=500)

randomly chooses a neighbor and a random codeword in its memory and exchanges information. However, this kind of unicast may be less efficient than broadcast, since the natural property of broadcasting for wireless sensor nodes. Furthermore, the method of unicast is locally uniform and thus is damage for those nodes with long distance from the sink to deliver their packet to the sink. Clearly, it violates the assumption by the design of Growth Codes that all codewords in the network have the same probability to be received by the sink. The above assumption by Growth Codes only holds if the sink encounters other nodes with uniform probability. However, in less random scenarios, coding performance will decrease, thus to exchange information by unicast is not an efficient way to maximize the utility of Growth Codes. (The simulation results are shown in Section 6.) 5.2

Priority Broadcast

Since the natural property of broadcast in wireless sensor networks, and since broadcast can diffuse a message from a source node to other nodes in the network quickly, broadcast seems to be an intuitive method of more efficient information exchange. To maximize the Growth Codes utility, we use priority broadcast to disseminate packets. It is privilege to deliver the packet with the higher priority. Firstly, the priority relates to the intersection of the radio areas of two nodes. The packet received from a neighbor that is located at the radio border of an intermediate node should have a high priority, since this kind of border data transmission is beneficial to disseminate sensed data quickly. It is observed that the distance between two nodes with full duplex communication can be evaluated by comparing their neighbor lists. When two nodes A and B can contact each other, the union of their communication areas ZA ∪ZB can be partitioned in three zones (Fig. 5): Z1 , Z2 , Z3 , where Z1 denotes ZA − ZB , Z3 2 denotes ZB −ZA and Z2 denotes ZA ∩ZB . We define the ratio μ1 by: μ1 = Z1Z+Z . 3

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

473

Z3

Z1

Z2 A

B

Fig. 5. Intersection of the radio areas of node A and node B

1

1

0 0

1

(a) δ =

1 2

1

0 0

1

(b) δ = 1

0 0

1

(c) δ = 2

Fig. 6. Example of the priority function convexity according to δ parameter

The parameter μ1 approximately denotes the distance between node A and B, where a larger value of μ1 denotes a longer distance while a fewer value of μ1 denotes a shorter distance. When node A wants to know μ1 , it has to identify its neighbors and the neighbors of B. For that purpose, each node which forwards a broadcast adds the identities of all its neighbors in the message. When node B receives the broadcast message, it compares the list from the incoming message to its own neighbors list. Then it can determine the approximate value of μ1 . Therefore, when node B receives a packet of sensed data from node A, node B calculates the priority of this packet as follow: p1 = (

μ1 δ1 ) . M1

(3)

Here, M1 denotes the constant which represents the maximal value of μ1 . This 2 which correspond value can be evaluated by the maximal value of the ratio Z1Z+Z 3 to the case when the distance between node A and node B is equal to the transmission radius(M ≈ 0.6). δ1 denotes a coefficient which is to control the impact of μ1 (see Fig. 6). Secondly, the priority relates to the walk length. Here, the walk length is referred to as the number of hops for a source packet to take to arrive at certain intermediate node. We can set a counter μ2 for each source packet and increase the counter by one after each transmission unless it reaches to the sink. Therefore, when a packet is received by an intermediate node, its priority is calculated as follows: p2 = (

μ2 δ2 ) . M2

(4)

474

Y. Zhao et al.

Here, M2 denotes the constant which represents the maximal value of μ2 . Since each node can store at most B packets in the buffer, a data disseminating walk would be interrupted at a random node at the probability of B1 . Thus, the value of M2 can be approximately evaluated by the buffer size B. δ2 denotes a coefficient which is to control the impact of μ2 (also see Fig. 6). Combining the previous equations, each packet in the buffer has the broadcast priority as follows. Theorem 2 shows the efficiency of priority broadcast. p=(

μ2 δ2 μ1 δ1 ) +( ) . M1 M2

(5)

Theorem 2. There exists the appropriate δ1 and δ2 to promise that the performance of priority broadcast in the scenarios of less random encounters approaches to the ones of random encounters. Proof. Fig. 7 illustrates a scenario of less random encounters, where node S is a intermediate node in the network, and where node B has the longer distance to node S than node A. When node S stores both packet A and packet B in its δ2 buffer, it should calculate their priorities pA and pB by Eqn. 5: pA = 1 + ( B1 ) , μ

δ

δ

) 1 +( B2 ) 2 . Obviously, there exists some value of δ1 and δ2 to promise pB = ( M1,B 1,B that pA =pB . Thus, though node B has a long distance from node S than node A, the priority of their packets are equal by appropriate δ1 and δ2 .



A S B

C

Fig. 7. Priority broadcast in the scenario of less random encounters

Having known the priority of each packet in the buffer, then we have the modified coding strategy as follow. When an intermediate node has a chance to disseminate a packet, the packet k is chosen at the probability of  Bpk P . i=1

6

i

Simulation Results

In this section, we look at how efficient our proposed methods can help the sink recover more data in wireless sensor networks. Our simulation is implemented in about 1000 lines of C++ code. We consider a random wireless sensor network where 500 nodes are randomly replaced in a 1×1 square, each sensor node has a same transmission radius R, thus a pair of sensor nodes are connected by a link if they are within a distance R.

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

6.1

475

Impact of Coding on the Symbol Level

First, we look at how the codewords with the decimal degree (coding on the symbol level) impacts the data persistence. In our simulation, we assume P = 10 and B = 100, i.e., each node can store at most 100 symbols (10 packets) in the buffer. The transmission radius R is 0.2. In the scenarios of multi-snapshot, a transmission packet may contain several sensed symbols. In this case, Growth Codes on the symbol level can promise the codeword degree to gradually increase in the granular level than coding on the packet level. As is depicted in Fig. 8, the sink needs to use 365 codewords to recover 50 percent of symbols by coding on the symbol level but 400 codewords by coding on the packet level. When the sink has received 500 codewords, it can recover 59% of original symbols by coding on the symbol level but only 52.6% of symbols coding on the packet level. Thus, our proposed method persist much more data than the original Growth Codes and the efficiency of coding on the symbol level in the scenarios of multi-snapshot is shown.

Fig. 8. Data persistence in a 500 node random sensor network (R = 0.2, B = 100)

6.2

Impact of Priority Broadcast

Second, we look at how the information exchange by priority broadcast impacts the performance of Growth Codes in different scenarios, whatever the scenarios of less random encounters or random encounters. We simulate several networks where the node density varies from 10 to 30. Moreover, two kinds of network scale of 500 and 1000 are considered respectively, we assume δ1 = 2 and δ2 = 0.5. In the scenarios of random encounters (e.g., a much dense network where node density is 30), the original Growth Codes perform well by both unicast and priority broadcast. However, in the scenarios of less random encounters, the performance of the original Growth Codes decrease since it is difficult for a part of symbols to reach to the sink. While associating Growth Codes with

476

Y. Zhao et al.

Time to recover 80% of the symbols Time to recover 80% of the symbols (*N)

Time to recover 50% of the symbols (*N)

Time to recover 50% of the symbols 1.4 N=500, unicast N=1000, unicast N=500, priority broadcast N=1000, priority broadcast Theoretical Value

1.2

1

0.8

10

15

20

25

30

Density (=Average number of neighbors)

(a) Time taken to recover 50% of the data in different networks

2.2 N=500, unicast N=1000, unicast N=500, priority broadcast N=1000, priority broadcast Theoretical Value

2

1.8

1.6

1.4

1.2

1 10

15

20

25

30

Density (=Average number of neighbors)

(b) Time taken to recover 80% of the data in different networks

Fig. 9. The performance of Growth Codes differs in different networks

priority broadcast, even the symbols that have the long distance from the sink can also reach to the sink. Fig. 9 depicts the time taken to recover certain fraction of sensed data when the nodes use Growth Codes, by both unicast and priority broadcast. The bottommost curve in either Fig. 9a or Fig. 9b plots the theoretical performance of Growth Codes. The results suggest that the performance of the original Growth Codes differs in different networks. If the network scale is large (e.g, N = 500 or 1000) but the node density is not large enough (e.g, less than 30), the performance of the original Growth Codes would decease. However, to associate Growth Codes with priority broadcast can approach to the theoretical performance in all kinds of scenarios. Thus, the efficiency of priority broadcast is shown.

7

Conclusion

In this paper, we generalize the application scenarios of the original Growth Codes to include multi-snapshot and less random encounters, so as to achieve a better performance over a wider range of networks applications, especially for large-scale of wireless sensor networks. The core idea of this paper is: (1) to use the decimal codeword degree instead of the integer codeword degree to deal with the multi-snapshot scenarios; (2) to use the priority broadcast, which is correlated with the node border transmission and the walk length, to approach to the theoretical performance of Growth Codes in the scenarios of less random encounters. In addition, the proposed methods are compared with the original Growth Codes by simulations, and the results validate the efficiency of our proposed methods. We believe that our work would greatly help to the practical application of Growth Codes in kinds of large-scale sensor networks.

Maximizing Growth Codes Utility in Large-Scale Wireless Sensor Networks

477

Acknowledgement This work was supported in part by 863 program of China under Grant No. 2009AA01A348, National Key S&T Project under Grant No. 2010ZX03003-00303, Shanghai Municipal R&D Foundation under Grant No. 09511501200, the Shanghai Rising-Star Program under Grant No. 08QA14009. Xin Wang is the corresponding author.

References 1. Acedanski, S., Deb, S., Medard, M., Koetter, R.: How good is random linear coding based distributed networked storage. In: NetCod (2005) 2. Ahlswede, R., Cai, N., Li, S.Y., Yeung, R.: Network information flow. IEEE Transactions on Information Theory 46(4), 1204–1216 (2000) 3. Aly, S.A., Kong, Z., Soljanin, E.: Fountain codes based distributed storage algorithms for large-scale wireless sensor networks. In: IPSN 2008: Proceedings of the 7th international conference on Information processing in sensor networks, pp. 171–182. IEEE Computer Society, Washington (2008) 4. Cartigny, J., Simplot, D.: Border node retransmission based probabilistic broadcast protocols in ad-hoc networks. In: HICSS 2003: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003) - Track 9, p. 303. IEEE Computer Society, Washington (2003) 5. Dimakis, A., Prabhakaran, V., Ramchandran, K.: Decentralized erasure codes for distributed networked storage. IEEE Transactions on Information Theory 52(6), 2809–2816 (2006) 6. Ahlberg, J.H., Nilson, E.N., Walsh, J.L.: The theory of splines and their applcations. Academic Press, New York (1967) 7. Kamra, A., Misra, V., Feldman, J., Rubenstein, D.: Growth codes: maximizing sensor network data persistence. In: SIGCOMM 2006: Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 255–266. ACM, New York (2006) 8. Lin, Y., Liang, B., Li, B.: Data persistence in large-scale sensor networks with decentralized fountain codes. In: 26th IEEE International Conference on Computer Communications, INFOCOM 2007, pp. 1658–1666. IEEE, Los Alamitos (May 2007) 9. Liu, J., Liu, Z., Towsley, D., Xia, C.H.: Maximizing the data utility of a data archiving & querying system through joint coding and scheduling. In: IPSN 2007: Proceedings of the 6th international conference on Information processing in sensor networks, pp. 244–253. ACM, New York (2007) 10. Munaretto, D., Widmer, J., Rossi, M., Zorzi, M.: Resilient coding algorithms for sensor network data persistence. In: Verdone, R. (ed.) EWSN 2008. LNCS, vol. 4913, pp. 156–170. Springer, Heidelberg (2008)

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks Jos´e Mocito1 , Lu´ıs Rodrigues2 , and Hugo Miranda3 1

INESC-ID / FCUL INESC-ID / IST University of Lisbon

2 3

Abstract. Flooding is a fundamental building block in multi-hop networks (both mobile and static); for instance, many routing protocols for wireless ad hoc networks use flooding as part of their route discovery/maintenance procedures. Unfortunately, most flooding algorithms have configuration parameters that must be tuned according to the execution environment, in order to provide the best possible performance. Given that ad hoc environments are inherently unpredictable, dynamic, and often heterogeneous, anticipating the most adequate configuration of these algorithms is a challenging task. This paper presents @Flood, an adaptive protocol for flooding in wireless ad hoc networks that allows each node to auto-tune the configuration parameters, or even change the forwarding algorithm, according to the properties of the execution environment. Using @Flood, nodes autoconfigure themselves, circumventing the need for pre-configuring all the devices in the network for the expected operational conditions, which is impractical or even impossible in very dynamic environments.

1

Introduction

Flooding is a communication primitive that provides best-effort delivery of messages to every node in the network. It is a fundamental building block of many communication protocols for multi-hop networks. In routing, flooding is typically used during the discovery phase, to find routes among nodes. It has many other uses, like building a distributed cache that keeps data records in the close vicinity of clients [11], performing robust and decentralized code propagation [8] or executing distributed software updates [1], performing query processing [15] or enhancing the privacy of communicating nodes [5]. Given its paramount importance, it is no surprise that the subject of implementing flooding in an efficient manner has been extensively studied in the literature. Its most simple implementation requires every node in the network to retransmit every message once, which results in an overwhelming use of communication resources that can negatively affect the capacity of nodes to communicate with each other [12]. As a result, many different approaches to implement 

This work was partially supported by FCT (PTDC/EIA/71752/2006) through POSI and FEDER.

project

REDICO

P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 478–489, 2010. c Springer-Verlag Berlin Heidelberg 2010 

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

479

(optimized) flooding have been proposed [12,4,2,10,3]. Most require fine-tuning a number of operational parameters in order to provide the best performance. Such tuning can depend on several factors, like the node density, geographical distribution, device characteristics, or traffic load. An ill-configured algorithm can contribute to an increased battery depletion of the devices and for and excessive occupation of the available bandwidth. Usually, flooding protocols provide none or very few mechanisms to self-adapt their parameters, and must be totally or partially configured off-line, before the actual deployment of nodes in the network. Such task is obviously challenging, since multi-hop networks are typically highly dynamic, where the topology is formed spontaneously, and the node distribution, movement, or sleeping schedule, can create heterogeneous topologies that do not favor any static or homogeneous configuration. In other words, uniformly using the same, common configuration in every node unlikely provides optimal results. Moreover, even in the case of homogeneous and stable operating conditions, the task of preconfiguring a large number of devices may be impractical or even impossible (we recall that flooding may be required to be already operational in order to perform configuration/software updates on the fly). This paper proposes the Auto-Tunable Flooding, AT-Flooding for short, or simply @Flood, a generic auto-tuning flooding protocol where each node, individually, selects the parameter configuration that is more advantageous. Furthermore, flooding initiators can select the most adequate flooding algorithm given the perceived network conditions. This makes @Flood particularly useful in unpredictable, dynamic, and/or heterogeneous environments. The fact that each node can operate using a different configuration paves the way for efficient operation in scenarios where different portions of the network are subject to different operating conditions. Given that each node auto-configures autonomously, the configuration of the network is scalable, as no global synchronization is required. The mechanism is based on a feedback loop that monitors the performance of the flooding algorithms. We illustrate the feasibility of the approach using different algorithms. Experimental results show that, in all scenarios, our technique is able to match or supersede the performance of the best off-line configuration. The remainder of the paper is organized as follows. The related work is presented in Section 2. In Section 3 we present detailed descriptions of several flooding algorithms, that are used to motivate our work and illustrate the application of our scheme. Section 4 provides experimental arguments advocating the need for a generic adaptive flooding solution. Section 5 provides a detailed description of the @Flood protocol. In Section 6 we evaluate our proposal. Section 7 concludes the paper and highlights the future directions of this work.

2

Related Work

Devising adaptive solutions for flooding is not a new problem. However, the majority of the approaches focus on defining retransmission probabilities as a function of the perceived neighborhood, which may provide sub-optimal results

480

J. Mocito, L. Rodrigues, and H. Miranda

in the presence of uneven node distribution. We will briefly describe the approaches that, to the best of our knowledge, are closely related to our solution. Hypergossiping [6] is a dual-strategy flooding mechanism to overcome partitioning and to efficiently disseminate messages within a partition. It uses an adaptive approach where the retransmission probability at some node is a function of its neighbor count. Rather than defining a mechanism to locally tune this configuration value, the authors derive optimal assignments off-line on the basis of experimental results obtained a priori. Unfortunately, these results are intrinsically tied to the studied scenarios. Tseng et al. [13] proposed adaptive versions of the flooding protocols published in [12], that adapt to the density of the network where they are executing by observing the neighbor set at each node, either by overhearing traffic or explicit periodic announcements. A more recent work by Liarikapis and Shahrabi [9] proposes a similar approach, specifically focused on a probability-based scheme, where the neighbor set is obtained by counting the number of retransmissions of a single message. In both solutions the configuration of the flooding strategy is obtained by a function, defined a priori, of the perceived neighbor count. Such function is established by finding the best values through extensive simulations and refinements for a given scenario. Once again, this suggests that the resulting function may not perform well if the conditions change in run-time. Smart Gossip [7] is an adaptive gossip-based flooding service where retransmission probabilities are applied to reduce the communication overhead. Like our approach, the performance metric is defined for each source. By knowing the estimated diameter of the network along with some node dependency sets that include, for instance, the set of nodes that are forwarders for messages of a given source, every node can determine its reception probability for that source and influence the retransmission probability of the forwarders. By requiring an estimate of the network diameter and accurate knowledge of a partial topology this approach is not suitable for dynamic networks.

3

Flooding Algorithms

A plethora of flooding algorithms have been proposed in the literature [12,4,2,10,3]. Here we do not aim at providing a comprehensive survey of the related works (an interesting one can be found in [14]). Instead, in this section we focus on a small set of improved flooding algorithms that we consider representative, and that we have used to illustrate the operation of our @Flood protocol, namely GOSSIP1 [4], PAMPA [10], and 6SB [3]. GOSSIP1 employs a probabilistic approach where the forwarding of data packets is driven by a pre-configured probability p, i.e. upon receiving a message for the first time, every node decides with probability p to retransmit the message. PAMPA is a power-aware flooding algorithm where the signal strength indication (RSSI) is used to estimate the distance of nodes to the source and to select the farthest ones as the first to restransmit. A pre-configured threshold c determines the number of retransmissions detected after which a node will no longer forward a message.

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

481

6SB employs the same rationale as PAMPA but uses a location-based approach where the the geographic position of nodes determines the best candidates for retransmitting first, i.e. the farthest from the source. Again, a retransmission counter threshold c is employed to limit the number of forwards.

4

The Case for Adaptive Behavior

The challenge in developing a flooding protocol consists in finding the right tradeoff between robustness and efficiency. The aforementioned flooding protocols aim at minimizing the number of packets required to achieve some target level of reliability using different strategies. To motivate the need for supporting adaptive behavior, we performed a set of experimental tests that illustrate the impact of different factors that influence the flooding performance. For that purpose we tested several algorithms using different configuration parameters and measured their performance using the delivery ratio and the total amount data transmitted. To run the experiments, we implemented the four protocols in the NS-2 network simulation environment, and performed the tests in a MANET, with signal propagation following the Two Ray Ground model, and composed of 50 nodes, 5 of which are senders, producing 256 byte messages at a rate of one message every second. Nodes were distributed uniformly across the space. Unless otherwise stated, all results are averages of 25 runs in the same conditions using different scenarios. Network Density. The first, most obvious, factor that we consider is the network density. For self containment, we illustrate the relevance of this aspect by depicting in Figure 1 the performance of GOSSIP1 in sparse (average neighbor count of 4) and dense networks (average neighbor count of 17). As it can be observed, different values for retransmission probability p provide different delivery results, being the value 0.4 enough for providing high delivery in dense scenarios, while in sparse scenarios higher value is required to obtain high delivery ratios. We can also confirm by the results that an increase of p will produce more messages, which reflects the increasingly conservative behavior of the protocol. Note that in the dense scenario, after some point, this increase in the number of messages is no longer productive, as there is no noticeable benefit in the delivery ratio. Geographical Positioning. Some forwarding algorithms rely on geographical positioning to perform optimized flooding. This can be determined, for instance, using the Global Positioning System (GPS). These positioning systems are, however, prone to errors and its accuracy can range from the tens to the hundreds of meters. In position based algorithms inaccurate positioning can originate suboptimal performance by generating more traffic than required or have an impact in the delivery ratio. This case is depicted in Fig. 2 where 6SB protocol is used in a network where the positioning error varies between 0 and 150 meters. As we can observe, the impact in the delivery ratio is quite significant, ranging from 2% when there’s no positioning error, to 15% when the error reaches 150 meters.

482

J. Mocito, L. Rodrigues, and H. Miranda

1

40000

Deliv. ratio Traffic (KB)

0.9

1

0.8

40000

Deliv. ratio Traffic (KB)

0.9 0.8 30000

30000

0.7

0.7

0.6

0.6

0.5

20000

0.5

0.4

20000

0.4

0.3

0.3 10000

10000

0.2

0.2

0.1

0.1

0

0 0.2

0.4

0.6

0.8

0

1

0 0.2

0.4

Parameter p

0.6

0.8

1

Parameter p

(a) Dense Scenario

(b) Sparse Scenario

Fig. 1. GOSSIP1 algorithm with different parameters 1

6SB

Delivery ratio

0.9

0.8

0.7

0.6 0

25

50

75

100

125

150

Positioning error

Fig. 2. Positioning inaccuracy in 6SB

Discussion. There are many factors that may affect the performance of a flooding protocol. The most widely studied is the network density. However, as we have discussed in the previous paragraphs, different forwarding algorithms may be equally sensitive to other factors, like inaccuracies in the readings of GPS data or the signal strength. Operational factors may even invalidate the assumption of a given forwarding strategy (for instance, the availability of positioning information in 6SB) and may require the use of a different forwarding strategy. Therefore, it is of relevance to devise a scheme to automatically adapt the flooding protocol in face of several types of changes in the operational envelope. The selected adaptation may either change the forwarding algorithm or just tune its configuration parameters.

5

@Flood

Auto-Tunable Flooding, AT-Flooding for short, or simply @Flood is an adaptive flooding protocol that aims at finding in an automated manner the most suitable forwarding strategy, and for the selected strategy, the most suitable parameter configuration. The protocol continuously monitors the performance of

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

483

Algorithm 1. @Flood: forwarding procedure 1: 2: 3: 4: 5: 6: 7: 8: 9:

R(alg, pid) ← ∅ primary procedure flood(m) m.alg ← primary m.alg.send(< p, m >)

 primary algorithm

procedure receive(src, < orig, m >) if {m} ∈ / R(m.alg, orig) then R(m.alg, orig) ← R(m.alg, orig) ∪ {m} m.alg.forward(< orig, m >)

the executing configuration, and performs adaptations as needed. The adaptation procedure is strictly local, thus involving no coordination between nodes. @Flood is composed by 4 complementary components, namely: 1) a forwarding procedure, in charge of implementing the message dissemination; 2) a probing module, that injects a negligible amount of data into the network using different flooding algorithms; 3) a feedback loop, that monitors the performance of the underlying flooding algorithms and; 4) an adaptation engine, in charge of selecting the flooding algorithm and tuning its configuration parameters at run-time. Each of these components is described in detail in the next subsections. Note that @Flood can use multiple forwarding algorithms. Therefore, some parts of the protocol are generic, and some parts are specific of the forwarding algorithms in use. In the following subsections we clarify this distinction. 5.1

Forwarding Procedure

The forwarding procedure consists in invoking the primary flooding algorithm, i.e. the one that is selected for carrying application data. Our protocol implements two methods: flood, called to send a message and receive, called when a message is received from the network. We also assume that each message is tagged with the forward algorithm that has been selected by the adaptive mechanism. Each algorithm has its own specific send method, called to send a message, and forward, called to (possibly) retransmit the message. Alg. 1 describes the behavior of this component. When a node wishes to flood a message, it issues a call to the primary flooding algorithm (lines 3–5). Likewise, when a node receives a message for the first time (lines 6–9) stores its identifier and feeds it to the flooding method of the algorithm selected for that message. The reason for storing the message identifiers will become clear in the next sections. 5.2

Probing Module

To support the switching between flooding algorithms one needs to assess the performance of each alternative, in order to produce an adequate decision. In @Flood this is accomplished through a mechanism (Alg. 2) where, periodically, a negligible amount of probing messages is injected in the network using the different forwarding strategies. Therefore, throughout the execution of @Flood all the alternative strategies are running, producing traffic that is used by the

484

J. Mocito, L. Rodrigues, and H. Miranda

Algorithm 2. @Flood: probe module 1: 2: 3: 4:

alternatives upon periodicProbeTimer probe ← select from alternatives \ primary probe.send(< p,PROBE>)

 list of alternative algorithms

Algorithm 3. @Flood: monitor module 1: Nn (alg, pid) ← ∅  List of sets of received messages by n from pid since last adaptation 2: A(alg, pid) ← ∅  Set of all collected messages from pid since last adaptation 3: Pn (alg, pid) ← 0  List of performance metrics for every node n 4: upon periodicHelloTimer 5: tp (alg, pid) ← c(alg, pid) 6: datalink.send() 7: procedure receive(src, ) 8: Nsrc ← Nsrc ∪ r 9: A← A∪a 10: for n = 0 to p .length do 11: if tn .isNewer(Tn ) then 12: Pn ← pn

monitor module to extract the information required by the performance metric. These messages are treated as regular messages and therefore use the forwarding mechanism described in the previous section. To optimize the usage of communication resources, the probing module is only activated when a node starts producing regular traffic. 5.3

Monitor Module

In order to adapt the underlying algorithm, @Flood needs to gather information about the performance of each flooding algorithm, including the primary and the alternatives. To that purpose, @Flood uses a monitor module, that determines the performance of each forwarding strategy. Several implementations of the monitor are possible. In this paper we use a simple approach based on the exchange of periodic HELLO messages that can be either explicit, or piggy-backed on regular data or control traffic. A description of the monitor module is provided in Alg. 3. For each algorithm alg and traffic source s three data structures, A(alg, s) and Nn (alg, s) and Pn (alg, s), hold respectively the identifiers of messages known to be sent in the system, received by neighbor node n, and the performance metrics computed at every node n, since the last adaptation (lines 1–3). Periodically, a node a broadcasts to its neighbors a HELLO message containing R, A and P (lines 4–6). Upon reception of the HELLO message node b updates its own data structures by integrating the received digests messages (lines 7–12). Each node p can compute the performance of the current configuration of any algorithm from any given data source from the number of messages detected to have been missed by the neighbors compared to the amount of messages produced since the last adaptation. The performance metric can be computed using the following function:

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

 localPerf(alg, src) = average∀i∈Neighborhood(p)

|Ni (alg, src)| |A(alg, src)|

485



Since the data initiators will typically perceive complete delivery of the produced messages to the neighboring nodes, the previous metric does not capture problems that are further in the dissemination process. Therefore, these nodes use a different metric that captures the average delivery ratio perceived by every node in the network: globalPerf(alg, p) = average∀i∈N odes (Pi (alg, p)) In both metrics the sets of messages (N and A) and the list of performance measurements (P ) are reset right after a reconfiguration action by the adaptation engine, in order to reflect the performance of the current configuration. 5.4

Adaptation Engine

The adaptation engine of @Flood works in a loop, where periodically: i) the monitor module is called to assess the performance of the different forwarding algorithms (both the primary and the alternatives being probed); ii) each algorithm (including the alternatives) adapts its configuration parameters in order to improve its performance; iii) the primary algorithm is changed if another alternative outperforms the current selection. The part of the protocol that depends on the forwarding algorithms is encapsulated in the adapt method that is exported by each implementation. This method adapts the configuration of the algorithm and accepts as parameters the current measured performance for the algorithm, the target performance and a stable margin δ that introduces some hysteresis in the system by defining a stable margin around the target performance given by the localP erf metric, and is related to how accurately one can locally estimate this value, which depends on the sampling period. Naturally, the concrete action depends on the protocol in use, and might involve increasing or decreasing the value of configuration parameters. In our implementations, a delivery ratio below the target performance originates, in GOSSIP1 an increment in the retransmission probability p, and in 6SB and PAMPA an increment in the retransmission counter c. If the performance is below target, the parameters for the three algorithms are decremented. The algorithm used by the adaptation engine is illustrated in Alg. 4. In the first phase (lines 4–7) every flooding algorithm for which p is not the source is adapted towards meeting the target performance in terms of the average delivery ratio of the neighboring nodes (localPerf). In the second phase (lines 8–11), for the algorithms for which p is a source, the global delivery ratio metric (globalPerf) is computed and compared with the one for the primary algorithm. If theres is an alternative that meets the target performance and the primary algorithm does not, the later is substituted by the former.

486

J. Mocito, L. Rodrigues, and H. Miranda

Algorithm 4. @Flood: adaptive engine 1: target 2: upon periodicAdaptTimer 3: Pp (primary, p) ← globalPerf(primary,p) 4: for all alg ∈ alternatives do 5: for all src ∈ sources \ p do 6: Pp (alg,src) ← localPerf(alg, src) 7: alg.adapt(Pp (alg,src), target, δ) 8: Pp (alg, p) ← globalPerf(alg,p) 9: if target − δ < Pp (alg, p) < target + δ then 10: if Pp (primary, p) < target − δ ∨ Pp (primary, p) > target + δ 11: primary ← alg

6

 performance target

then

Evaluation

To validate and evaluate the performance of @Flood, we performed a series of experiments using the forwarding algorithms listed in Sect. 3. The NS-2 simulator was used to test the protocol in a wireless ad hoc network of 50 nodes uniformly distributed and using the Two Ray Ground signal propagation model. The area and transmission range were manipulated to obtain sparse network scenarios, where each node has, on average, four neighbors. Five nodes at random produce 256 byte messages at a rate of one message every second. The target delivery ratio was set to 0.90 and the δ to 0.05. The time period between adaptations was set to 30 seconds. Unless otherwise stated, all results are averages of 25 runs in the same conditions using different scenarios, and the 95% confidence intervals are presented in the figures. 6.1

Convergence

We first show how @Flood is able to make the configuration of a given forwarding algorithm converge to values that offer good performance (i.e., that match the target delivery ratio and reduce the number of redundant messages). For this purpose, we simulated the execution of @Flood using GOSSIP1 as the forwarding algorithm, starting from two different initial configurations (p = 0 and p = 1, respectively), and measured the evolution of the delivery ratio and amount of traffic in sampling periods of 10 seconds. The results are depicted in Fig. 3. For an initial p = 0 the delivery is roughly 17% due to the neighbors of the initiators receiving data. For an initial p = 1 the delivery is roughly 97% since every node restransmits (the 3% loss is due to collisions). We can observe that @Flood drives GOSSIP1 to a configuration that offers approximately the same performance, in terms of delivery ratio and produced traffic, regardless of its initial configuration. Moreover, the delivery ratio stabilizes close to the pre-defined goal of 0.90 (±0.05). 6.2

@Flood vs. Standard Pre-configuration

One interesting question is to assess how well @Flood performs with regard to non-adaptive versions of the protocol with the standard (homogeneous) preconfiguration of the nodes. Can @Flood achieve the same performance?

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

1

800

0.9

700

0.8

p=0.0 p=1.0

0.6 0.5 0.4

p=0.0 p=1.0

600 Traffic (KB)

Delivery ratio

0.7

487

500 400 300 200

0.3

100

0.2 0.1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140

10 20 30 40 50 60 70 80 90 100 110 120 130 140

Simulation time (x10 s)

Simulation time (x10 s)

Fig. 3. @Flood convergence with GOSSIP1 60

Generated traffic (MB)

50

40

30 91%

90%

w/o @Flood

w/ @Flood

20

10

0

Fig. 4. @Flood vs. best pre-config

In this test we used the GOSSIP1 algorithm and performed several runs with different values of the parameter p to determine which results in a delivery performance close to 0.90. Coincidently the value was also 0.90 and resulted in a delivery ratio of 0.91. The same algorithm was also executed using @Flood but without competing alternatives, i.e., only GOSSIP1 has been set as the single alternative. The configurable parameters were set for a low value and an initial warm-up period was exercised in order for these values to converge, and bring the delivery performance close to a target of 0.90 (±0.05). The results for the amount of traffic generated are depicted in Fig. 4. We can observe that GOSSIP1 without @Flood sends slightly more traffic than with @Flood. We can therefore conclude that the reduction of retransmissions operated by @Flood compensates the overhead introduced by the performance monitor. 6.3

Adapting the Forwarding Algorithm

We now illustrate the use of @Flood to change the forwarding algorithm, by commuting between a position-based forwarding strategy to a power-aware forwarding strategy, more precisely, between 6SB and PAMPA. This allows to use 6SB when position information is accurate and PAMPA otherwise.

J. Mocito, L. Rodrigues, and H. Miranda

1

1

0.9

0.9

Delivery ratio

Delivery ratio

488

0.8

0.7 6SB PAMPA

0.6

0.8

0.7

Delivery ratio Primary algorithm 6SB

0.6 PAMPA

0.5

0.5 10 20 30 40 50 60 70 80 90 100 110 120 130 140

10 20 30 40 50 60 70 80 90 100110120130140

Simulation time (x10 s)

Simulation time (x10 s)

Fig. 5. Positioning inaccuracy in 6SB and @Flood

To illustrate the operation of @Flood in such scenarios, we performed an experimental test, with a single data producer, where the error in the positioning estimation varies throughout the simulation time. More specifically, the test consisted in inducing a position estimation error (by a maximum of 150 meters in a random direction ) in every node at specific moments in time. @Flood was configured to start with 6SB as the primary forwarding algorithm and PAMPA as an alternative. The target was set to 0.98 and the δ to 0.02. In Fig. 5 we show two runs of the same experiment, one with static versions of 6SB and PAMPA, and the other with @Flood commuting between the two. The first plot depicts the equivalent delivery performance of both algorithms and the significant performance loss of 6SB whenever an error is induced in the positioning information. On the other hand, the second plot exhibits less penalty, since @Flood automatically switches to PAMPA as soon as the perceived performance of the 6SB algorithm drops, and returns to 6SB when the performance metric measured on the probing traffic returns to values above the target threshold.

7

Conclusions and Future Work

In this paper we presented @Flood, a new adaptive flooding protocol for mobile ad hoc networks. Experimental results with different forwarding algorithms validate the effectiveness of the solution. The performance of the resulting system matches a system that is pre-configured with the best configuration values for the execution scenario but with a smaller number of retransmissions, thus contributing to an extended network lifetime. This makes the approach very interesting for dynamic and/or large scale scenarios and eliminates the need to pre-configure the flooding protocol. In the future, we plan to experiment with other forwarding algorithms, study other alternative schemes to monitor the performance of the network and drive the adaptation procedure, and investigate a mechanism to self-adapt the hysteresis parameter to the currently executing algorithms and network conditions.

@Flood: Auto-Tunable Flooding for Wireless Ad Hoc Networks

489

References 1. Busnel, Y., Bertier, M., Fleury, E., Kermarrec, A.: Gcp: gossip-based code propagation for large-scale mobile wireless sensor networks. In: Proc. of the 1st Int. Conf. on Autonomic Computing and Communication Systems, pp. 1–5. ICST (2007) 2. Drabkin, V., Friedman, R., Kliot, G., Segal, M.: Rapid: Reliable probabilistic dissemination in wireless ad-hoc networks. In: Proc. of the 26th IEEE Int. Symp. on Reliable Distributed Systems, pp. 13–22. IEEE CS, Los Alamitos (2007) 3. Garbinato, B., Holzer, A., Vessaz, F.: Six-shot broadcast: A context-aware algorithm for efficient message diffusion in manets. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part I. LNCS, vol. 5331, pp. 625–638. Springer, Heidelberg (2008) 4. Haas, Z.J., Halpern, J.Y., Li, L.: Gossip-based ad hoc routing. IEEE/ACM Trans. on Networking 14(3), 479–491 (2006) 5. Kamat, P., Zhang, Y., Trappe, W., Ozturk, C.: Enhancing source-location privacy in sensor network routing. In: Proc. of the 25th IEEE Int. Conf. on Distributed Computing Systems, pp. 599–608. IEEE Computer Society, Los Alamitos (2005) 6. Khelil, A., Marr´ on, P.J., Becker, C., Rothermel, K.: Hypergossiping: A generalized broadcast strategy for mobile ad hoc networks. Ad Hoc Net. 5(5), 531–546 (2007) 7. Kyasanur, P., Choudhury, R.R., Gupta, I.: Smart gossip: An adaptive gossip-based broadcasting service for sensor networks. In: Proc. of the 2006 IEEE Int. Conf. on Mobile Adhoc and Sensor Systems, pp. 91–100 (October 2006) 8. Levis, P., Patel, N., Culler, D., Shenker, S.: Trickle: a self-regulating algorithm for code propagation and maintenance in wireless sensor networks. In: Proc. of the 1st Symp. on Networked Systems Design and Impl., p. 2. USENIX (2004) 9. Liarokapis, D., Shahrabi, A.: A probability-based adaptive scheme for broadcasting in manets. In: Mobility 2009: Proc. of the 6th International Conference on Mobile Technology, Application Systems, pp. 1–8. ACM, New York (2009) 10. Miranda, Leggio, S., Rodrigues, L., Raatikainen, K.: A power-aware broadcasting algorithm. In: Proc. of The 17th Annual IEEE Int. Symp. on Personal, Indoor and Mobile Radio Communications, September 11-14 (2006) 11. Miranda, H., Leggio, S., Rodrigues, L., Raatikainen, K.: Epidemic dissemination for probabilistic data storage. In: Emerging Comm.: Studies on New Tech. and Practices in Communication, Global Data Management, vol. 8. IOS Press, Amsterdam (2006) 12. Tseng, Y.C., Ni, S.Y., Chen, Y.S., Sheu, J.P.: The broadcast storm problem in a mobile ad hoc network. Wireless Networking 8(2/3), 153–167 (2002) 13. Tseng, Y.C., Ni, S.Y., Shih, E.Y.: Adaptive approaches to relieving broadcast storms in a wireless multihop mobile ad hoc network. In: Proc. of the 21st Int. Conf. on Distributed Computing Systems, p. 481. IEEE, Los Alamitos (2001) 14. Williams, B., Camp, T.: Comparison of broadcasting techniques for mobile ad hoc networks. In: Proc. of the 3rd ACM Int. Symp. on Mobile Ad Hoc Networking & Computing, pp. 194–205. ACM, New York (2002) 15. Woo, A., Madden, S., Govindan, R.: Networking support for query processing in sensor networks. Commun. ACM 47(6), 47–52 (2004)

On Deploying Tree Structured Agent Applications in Networked Embedded Systems Nikos Tziritas1,3, Thanasis Loukopoulos2,3, Spyros Lalis1,3, and Petros Lampsas2,3 1

Dept. of Computer and Communication Engineering, Univ. of Thessaly, Glavani 37, 38221 Volos, Greece {nitzirit,lalis}@inf.uth.gr 2 Dept. of Informatics and Computer Technology, Technological Educational Institute (TEI) of Lamia, 3rd km. Old Ntl. Road Athens, 35100 Lamia, Greece {luke,plam}@teilam.gr 3 Center for Research and Technology Thessaly (CERETETH), Volos, Greece

Abstract. Given an application structured as a set of communicating mobile agents and a set of wireless nodes with sensing/actuating capabilities and agent hosting capacity constraints, the problem of deploying the application consists of placing all the agents on appropriate nodes without violating the constraints. This paper describes distributed algorithms that perform agent migrations until a “good” mapping is reached, the optimization target being the communication cost due to agent-level message exchanges. All algorithms are evaluated using simulation experiments and the most promising approaches are identified.

1 Introduction Mobile code technologies for networked embedded systems, like Aggila [1], SmartMessages [2], Rovers [3] and POBICOS [4], allow the programmer to structure an application as a set of mobile components that can be placed on different nodes based on their computing resources and sensing/actuating capabilities. From a system perspective, the challenge is to optimize such a placement taking into account the message traffic between application components. This paper presents distributed algorithms for the dynamic migration of mobile components, referred to as agents, in a system of networked nodes with the objective of reducing the network load due to agent-level communication. The proposed algorithms are simple so they can be implemented on nodes with limited memory and computing capacity. Also, modest assumptions are made regarding the knowledge of routing paths used for message transport. The algorithms rely on information that can be provided by even simple networking or middleware logic without incurring (significant) additional communication overhead. The contributions of the paper are the following: (i) we identify and formulate the agent placement problem (APP) in a way that is of practical use to the POBICOS middleware but can also prove useful to other work on mobile agent systems with placement constraints, (ii) we present a distributed algorithm that relies on minimal network knowledge and extend it so that it can exploit additional information about P. D’Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 490–502, 2010. © Springer-Verlag Berlin Heidelberg 2010

On Deploying Tree Structured Agent Applications in Networked Embedded Systems

491

the underlying network topology (if available), (iii) we evaluate both algorithm variants via simulations and discuss their performance.

2 Application and System Model, Problem Formulation This section introduces the type of applications targeted in this work and the underlying system and network model. It then formulates the agent placement problem (APP) and the respective optimization objectives. 2.1 Application Model We focus on applications that are structured as a set of cooperating agents organized in a hierarchy. For instance, consider a demand-response client which tries to reduce power consumption upon request of the energy utility. A simplified possible structure is shown in Fig. 1. The lowest level of the tree comprises agents that periodically report individual device status and power consumption to a room agent, which reports (aggregated) data for the entire room to the root agent. When the root decides to lower power consumption (responding to a request issued by the electric utility), it requests some or all room agents to curve power consumption as needed. In turn, room agents trigger the respective actions (turn off devices, lower consumption level) in the end devices by sending requests to the corresponding device agents. Leaf (sensing and actuating) agents interact with the physical environment and must be placed on nodes that provide the respective sensors or actuators and are located at the required areas, hence are called node-specific. On the other hand, intermediate agents perform their tasks using just general-purpose computing resources which can be provided by any node; thus we refer to these agents as node-neutral. In Fig. 1, device agents are node-specific while all other agents are node-neutral. Agents can migrate between nodes to offload their current hosts or to get closer to the agents they communicate with. In our work we consider migration only for nodeneutral agents because their operation is location- and node-independent by design, while node-specific agents remain fixed on the nodes where they were created. Still, the ability to migrate node-neutral agents creates a significant optimization potential in terms of reducing the overall communication cost. 2.2 System Model We assume a network of capacitated (resource-constrained) nodes with sensing and/or actuating capabilities. Let ni denote the ith node, 1≤i≤N and r(ni) its resource capacity (processing power or memory size). The capacity of a node imposes a generic constraint to the number of agents it can host. Nodes communicate with each other on top of a (wireless) network that is treated as a black box. The underlying routing topology is abstracted as a graph, its vertices representing nodes and each edge representing a bidirectional routing-level link between a node pair. In this work we consider tree-based routing, i.e., there is exactly one path for connecting any two nodes. Let D be a N×N×N boolean matrix encoding the routing topology as follows: Dijx=1 iff the path from ni to nj includes nx, else 0. Since we assume that the network is a tree Dijx= Djix. Also, Diii=1, Dijj=1 and Diij=0. Let hij be the path length between ni and nj; equal to 0 for i=j. Obviously, hij = hji.

492

N. Tziritas et al.

root D/R

…...

room room11D/R D/R

room n D/R

….. device 1 (sense & control)

device m (sense & control)

…...

Fig. 1. Agent tree structure of an indicative sensing/control application

Each application is structured as a set of cooperating agents organized in a tree, the leaf agents being node-specific and all other agents being node-neutral. Assuming an enumeration of agents whereby node-neutral agents come first, let ak be the kth agent, 1≤k≤A+S, with A and S being equal to the total number of node-neutral and nodespecific agents, respectively. Let r(ak) be the capacity required to host ak. Agent-level traffic is captured via an (A+S)×(A+S) matrix C, where Ckm denotes the load from ak to am (measured in data units over a time period). Note that Ckm need not be equal to Cmk. Also, Ckk=0 since an agent does not send messages to itself. 2.3 Problem Formulation For the sake of generality we target the case where all agents are already hosted on some nodes, but the current placement is non-optimal. Let P be an N×(A+S) matrix used to encode the placement of agents on nodes as follows: Pik=1 iff ni hosts ak, 0 otherwise. The total network load L incurred by the application for a placement P can then be expressed as:

L=

A+ S A+ S

N

N

∑ ∑ Ckm ∑∑ hij Pik Pjm k =1 m =1

(1)

i =1 j =1

A placement P is valid iff each agent is hosted on exactly one node and the node capacity constraints are not violated: N

∑P i =1

ik

= 1, ∀1 ≤ k ≤ A + S

(2)

A+ S

∑ P r (a ) ≤ r (n ), ∀1 ≤ i ≤ N k =1

ik

k

i

(3)

Also, a migration is valid only if starting from a valid placement P it leads to another valid agent placement P’ without moving any node-specific agents: Pik = Pik' , ∀A < k ≤ A + S

(4)

On Deploying Tree Structured Agent Applications in Networked Embedded Systems

493

The agent placement problem (APP) can then be stated as: starting from an initial valid agent placement Pold, perform a series of valid agent migrations, eventually leading to a new valid placement Pnew that minimizes (1). Note that the solution to APP may be a placement that is actually suboptimal in terms of (1). This is the case when the optimal placement is infeasible due to (3), more specifically, when it can be reached only by performing a “swap” that cannot be accomplished because there is not enough free capacity on any node. A similar feasibility issue is discussed in [5] but in a slightly different context. Also, (1) does not take into account the cost for performing a migration. This is because we target scenarios where the application structure, agent-level traffic pattern and underlying routing topology are expected to be sufficiently stable to amortize the migration costs.

3 Uncapacitated 1-Hop Agent Migration Algorithm This section presents an agent migration algorithm for the case where nodes can host any number of agents, i.e., without taking into account capacity limitations. In terms of routing knowledge, each node knows only its immediate (1-hop) neighbors involved in transporting inbound and outbound agent messages; we refer to this as 1hop network awareness. This information can be provided by even a very simple networking layer. A node does not attempt to discover additional nodes but simply considers migrating agents to one of its neighbors. An agent may nevertheless move to distant nodes via consecutive 1-hop migrations. Description. The 1-hop agent migration algorithm (AMA-1) works as follows. A node records, for each locally hosted agent, the traffic associated with each neighboring node as well as the local traffic, due to the message exchange with remote and local agents, respectively. Periodically, this information is used to decide if it is beneficial for the agent to migrate to a neighbor. More formally, let lijk denote the load associated with agent ak hosted at node ni for a neighbor node nj (notice that ni does not have to know D to compute lijk):

lijk =

A+ S

∑ (C m =1

km

+ Cmk ) Dixj , Pxm = 1

(5)

The decision to migrate ak from ni to nj is taken iff lijk is greater than the total load with all other neighbors of ni plus the local load associated with ak: lijk > liik +

∑l

x ≠i, j

ixk

, hij = hix = 1

(6)

The intuition behind (6) is that by moving ak from its current host ni to a neighbor nj, the distance for the load with nj decreases by one hop while the distance for all other loads, including the load that used to take place locally, increases by one hop. If (6) holds, the cost-benefit of the migration is positive, hence the migration reduces the total network load as per (1).

494

N. Tziritas et al. n1 a1

a2

n3

n2

a3

n5

n4 a4

a5

a6

n6

a7

(a) n7 (b)

Fig. 2. (a) Application agent structure; (b) Agent placement on the network

Consider the application depicted in Fig. 2a which comprises four node-specific agents (a4, a5, a6, a7), two intermediate node-neutral agents (a42 a3) and a node-neutral root agent (a1), and the actual agent placement on nodes shown in Fig. 2b. Let each node-specific agent generate 2 data units per time unit towards its parent, which in turn generates 1 data unit per time unit towards the root (edge values in Fig. 2a). Assume that n1 runs the algorithm for a3 (striped). The load associated with a3 for the neighbor node n2 and n3 is l123=2 respectively l133=3 while the local load is l113=0. According to (6) the only beneficial migration for a3 is for it to move on n3. Continuing the example, assume that a3 indeed migrates to n3 and is (again) checked for migration. This time the relevant loads are l313=2, l353=2, l363=0, l333=1, thus a3 will remain at n3. Similarly, a1 will remain at n3 while a2 will eventually migrate from n4 to n2 then to n1 and last to n3, resulting in a placement where all node-neutral agents are hosted at n3. This placement is stable since there is no beneficial migration as per (6). Implementation and complexity. For each local agent it is required to record the load with each neighboring node and the load with other locally hosted agents. This can be done using a A’×(g+1) load table, where A’ is the number of local node-neutral agents and g is the node degree (number of neighbors). The destination for each agent can then be determined as per (6) in a single pass across the respective row of the load table, in O(g) operations or a total of O(gA’) for all agents. Note that the results of this calculation remain valid as long as the underlying network topology, application structure and agent message traffic statistics do not change. Convergence. The algorithm does not guarantee convergence because it is susceptible to livelocks. Revisiting the previous example, assume that the application consists only of the right-hand subtree of Fig. 2a, placed as in Fig. 2b. Node n1 may decide to move a3 to n3 while n3 may decide to move a1 to n1. Later on, the same migrations may be performed in the reverse direction, resulting in the old placement etc. We expect such livelocks to be rare in practice, especially if neighboring nodes invoke the algorithm at different intervals. Nevertheless, to guarantee convergence we introduce a coordination scheme in the spirit of a mutual exclusion protocol. When ni decides to migrate ak to nj it asks for a permission. To avoid “swaps” nj denies this

On Deploying Tree Structured Agent Applications in Networked Embedded Systems

495

request if: (i) it hosts an agent ak’ that is the child or the parent of ak, (ii) it has decided to migrate ak’ to ni, and (iii) the identifier of nj is smaller than that of ni (j