Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers (Lecture Notes in Computer Science, 2862) 3540204059, 9783540204053

This volume contains the papers presented at the 9th workshopon Job Sched- ing Strategies for Parallel Processing, which

107 90

English Pages 284 [276] Year 2003

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Frontmatter
Scheduling in HPC Resource Management Systems: Queuing vs. Planning
TrellisDAG: A System for Structured DAG Scheduling
SLURM: Simple Linux Utility for Resource Management
OurGrid: An Approach to Easily Assemble Grids with Equitable Resource Sharing
Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment
A Measurement-Based Simulation Study of Processor Co-allocation in Multicluster Systems
Grids for Enterprise Applications
Performance Estimation for Scheduling on Shared Networks
Scaling of Workload Traces
Gang Scheduling Extensions for I/O Intensive Workloads
Parallel Job Scheduling under Dynamic Workloads
Backfilling with Lookahead to Optimize the Performance of Parallel Job Scheduling
QoPS: A QoS Based Scheme for Parallel Job Scheduling
Backmatter
Recommend Papers

Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers (Lecture Notes in Computer Science, 2862)
 3540204059, 9783540204053

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2862

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Dror Feitelson Larry Rudolph Uwe Schwiegelshohn (Eds.)

Job Scheduling Strategies for Parallel Processing 9th International Workshop, JSSPP 2003 Seattle, WA, USA, June 24, 2003 Revised Paper

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Dror Feitelson The Hebrew University, School of Computer Science and Engineering 91904 Jerusalem, Israel E-mail: [email protected] Larry Rudolph Massachusetts Institute of Technology Laboratory for Computer Science Cambridge, MA 02139, USA E-mail: [email protected] Uwe Schwiegelshohn University of Dortmund, Computer Engineering Institute 44221 Dortmund, Germany E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at . CR Subject Classification (1998): D.4, D.1.3, F.2.2, C.1.2, B.2.1, B.6, F.1.2 ISSN 0302-9743 ISBN 3-540-20405-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springeronline.com c Springer-Verlag Berlin Heidelberg 2003  Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN: 10968987 06/3142 543210

Preface

This volume contains the papers presented at the 9th workshop on Job Scheduling Strategies for Parallel Processing, which was held in conjunction with HPDC12 and GGF8 in Seattle, Washington, on June 24, 2003. The papers went through a complete review process, with the full version being read and evaluated by five to seven members of the program committee. We would like to take this opportunity to thank the program committee, Su-Hui Chiang, Walfredo Cirne, Allen Downey, Wolfgang Gentzsch, Allan Gottlieb, Moe Jette, Richard Lagerstrom, Virginia Lo, Cathy McCann, Reagan Moore, Bill Nitzberg, Mark Squillante, and John Towns, for an excellent job. Thanks are also due to the authors for their submissions, presentations, and final revisions for this volume. Finally, we would like to thank the MIT Laboratory for Computer Science and the School of Computer Science and Engineering at the Hebrew University for the use of their facilities in the preparation of these proceedings. This year we had papers on three main topics. The first was continued work on conventional parallel systems, including infrastructure and scheduling algorithms. Notable extensions include the consideration of I/O and QoS issues. The second major theme was scheduling in the context of grid computing, which continues to be an area of much activity and rapid progress. The third area was the methodological aspects of evaluating the performance of parallel job scheduling. This was the ninth annual workshop in this series, which reflects the continued interest in this area. The proceedings of previous workshops are available from Springer-Verlag as LNCS volumes 949, 1162, 1291, 1459, 1659, 1911, 2221, and 2537, for the years 1995 to 2002, respectively. Except for the first three, they are also available on-line. We hope you find these papers interesting and useful.

August 2003

Dror Feitelson Larry Rudolph Uwe Schwiegelshohn

Scheduling in HPC Resource Management Systems: Queuing vs. Planning Matthias Hovestadt1 , Odej Kao1,2 , Axel Keller1 , and Achim Streit1

2

1 Paderborn Center for Parallel Computing University of Paderborn, 33102 Paderborn, Germany {maho,kel,streit}@upb.de Faculty of Computer Science, Electrical Engineering and Mathematics University of Paderborn, 33102 Paderborn, Germany [email protected]

Abstract. Nearly all existing HPC systems are operated by resource management systems based on the queuing approach. With the increasing acceptance of grid middleware like Globus, new requirements for the underlying local resource management systems arise. Features like advanced reservation or quality of service are needed to implement high level functions like co-allocation. However it is difficult to realize these features with a resource management system based on the queuing concept since it considers only the present resource usage. In this paper we present an approach which closes this gap. By assigning start times to each resource request, a complete schedule is planned. Advanced reservations are now easily possible. Based on this planning approach functions like diffuse requests, automatic duration extension, or service level agreements are described. We think they are useful to increase the usability, acceptance and performance of HPC machines. In the second part of this paper we present a planning based resource management system which already covers some of the mentioned features.

1

Introduction

A modern resource management system (RMS) for high performance computing (HPC) machines consists of many vital components. Assuming that they all work properly the scheduler plays a major role when issues like acceptance, usability, or performance of the machine are considered. Much research was done over the last decade to improve scheduling strategies [8, 9]. Nowadays supercomputers become more and more heterogenous in their architecture and configuration (e. g. special visualization nodes in a cluster). However current resource management systems are often not flexible enough to reflect these changes. Additionally new requirements from the upcoming grid environments [11] arise (e. g. guaranteed resource usage). The grid idea is similar to the former metacomputing concept [25] but takes a broader approach. More different types of resources are joined besides supercomputers: network connections, data archives, VR-devices, or physical sensors D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 1–20, 2003. c Springer-Verlag Berlin Heidelberg 2003 

2

Matthias Hovestadt et al.

and actors. The vision is to make them accessible similar to the power grid, regardless where the resources are located or who owns them. Many components are needed to make this vision real. Similar to a resource management system for a single machine a grid scheduler or co-allocator is of major importance when aspects of performance, usability, or acceptance are concerned. Obviously the functionality and performance of a grid scheduler depends on the available features of the underlying local resource management systems. Currently the work in this area is in a difficult but also challenging situation. On the one hand requirements from the application are specified. Advanced reservations and information about future resource usage are mandatory to guarantee that a multi-site [2, 6] application starts synchronously. Only a minority of the available resource management systems provide these features. On the other hand the specification process and its results are influenced by the currently provided features of queuing based resource management systems like NQE/NQS, Loadleveler, or PBS. We present an approach that closes this gap between the two levels RMS and grid middleware. This concept provides complete knowledge about start times of all requests in the system. Therefore advanced reservations are implicitly possible. The paper begins with a classification of resource management systems. In Sect. 3 we present enhancements like diffuse resource requests, resource reclaiming, or service level agreement (SLA) aware scheduling. Sect. 4 covers an existing implementation of a resource management system, which already realizes some of the mentioned functions. A brief conclusion closes the paper.

2

Classification of Resource Management Systems

Before we start with classifying resource management systems we define some terms that are used in the following. – The term scheduling stands for the process of computing a schedule. This may be done by a queuing or planning based scheduler. – A resource request contains two information fields: the number of requested resources and a duration for how long the resources are requested. – A job consists of a resource request as above plus additional information about the associated application. Examples are information about the processing environment (e. g. MPI or PVM), file I/O and redirection of stdout and stderr streams, the path and executable of the application, or startup parameters for the application. We neglect the fact that some of this extra job data may indeed be needed by the scheduler, e. g. to check the number of available licenses. – A reservation is a resource request starting at a specified time for a given duration. In the following the term Fix-Time request denotes a reservation, i. e. it cannot be shifted on the time axis. The term Var-Time request stands for a resource

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

3

request which can move on the time axis to an earlier or later time (depending on the used scheduling policy). In this paper we focus on space-sharing, i. e. resources are exclusively assigned to jobs. The criterion for the differentiation of resource management systems is the planned time frame. Queuing systems try to utilize currently free resources with waiting resource requests. Future resource planning for all waiting requests is not done. Hence waiting resource requests have no proposed start time. Planning systems in contrast plan for the present and future. Planned start times are assigned to all requests and a complete schedule about the future resource usage is computed and made available to the users. A comprehensive overview is given in Tab. 1 at the end of this section. 2.1

Queuing Systems

Today almost all resource management systems fall into the class of queuing systems. Several queues with different limits on the number of requested resources and the duration exist for the submission of resource requests. Jobs within a queue are ordered according to a scheduling policy, e. g. FCFS (first come, first serve). Queues might be activated only for specific times (e. g. prime time, non prime time, or weekend). Examples for queue configurations are found in [30, 7]. The task of a queuing system is to assign free resources to waiting requests. The highest prioritized request is always the queue head. If it is possible to start more than one queue head, further criteria like queue priority or best fit (e. g. leaving least resources idle) are used to choose a request. If not enough resources are available to start any of the queue heads, the system waits until enough resources become available. These idle resources may be utilized with less prioritized requests by backfilling mechanisms. Two backfilling variants are commonly used: – Conservative backfilling [21]: Requests are chosen so that no other waiting request (including the queue head) is further delayed. – EASY backfilling [18]: This variant is more aggressive than conservative backfilling since only the waiting queue head must not be delayed. Note, a queuing system does not necessarily need information about the duration of requests, unless backfilling is applied. Although queuing systems are commonly used, they also have drawbacks. Due to their design no information is provided that answers questions like “Is tomorrow’s load high or low?” or “When will my request be started?”. Hence advanced reservations are troublesome to implement which in turn makes it difficult to participate in a multi-site grid application run. Of course workarounds with high priority queues and dummy requests were developed in the past. Nevertheless the ‘cost of scheduling’ is low and choosing the next request to start is fast.

4

2.2

Matthias Hovestadt et al.

Planning Systems

Planning systems do resource planning for the present and future, which results in an assignment of start times to all requests. Obviously duration estimates are mandatory for this planning. With this knowledge advanced reservations are easily possible. Hence planning systems are well suited to participate in grid environments and multi-site application runs. There are no queues in planning systems. Every incoming request is planned immediately. Planning systems are not restricted to the mentioned scheduling policies FCFS, SJF (shortest job first), and LJF (longest job first). Each time a new request is submitted or a running request ends before it was estimated to end, a new schedule has to be computed. All non-reservations (i.e. Var-Time requests) are deleted from the schedule and sorted according to the scheduling policy. Then they are reinserted in the schedule at the earliest possible start time. We call this process replanning. Note, with FCFS the replanning process is not necessary, as new requests are simply placed as soon as possible in the schedule without discarding the current schedule. Obviously some sort of backfilling is implicitly done during the replanning process. As requests are placed as soon as possible in the current schedule, they might be placed in front of already planned requests. However, these previously placed requests are not delayed (i. e. planned at a later time), as they already have a proposed start time assigned. Of course other more sophisticated backfilling strategies exist (e. g. slack-based backfilling [28]), but in this context we focus on the easier variant of backfilling. Controlling the usage of the machine as it is done with activating different queues for e. g. prime and non prime time in a queuing system has to be done differently in a planning system. One way is to use time dependent constraints for the planning process (cf. Figure 1), e. g. “during prime time no requests with more than 75% of the machines resources are placed”. Also project or user specific limits are possible so that the machine size is virtually decreased.

A v a ila b le R e so u rc e s S y s te m

1 0 0 %

W id e N o d e L im it

7 5 % 5 0 %

L im its o f P ro je c t A

2 5 %

L im its o f P ro je c t B

F rid a y

1 8 :0 0

0 7 :0 0 S a tu rd a y

1 8 :0 0

S u n d a y

0 7 :0 0

tim e

Fig. 1. Time dependent limits in a planning system

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

5

Table 1. Differences of queuing and planning systems queuing system planning system planned time frame present present and future submission of resource requests insert in queues replanning assignment of proposed start time no all requests runtime estimates not necessary1 mandatory reservations not possible yes, trivial backfilling optional yes, implicit examples PBS, NQE/NQS, LL CCS, Maui Scheduler2 1 exception: backfilling 2 According to [15] Maui may be configured to operate like a planning system.

Times for allocating and releasing a partition are vital for computing a valid schedule. Assume two successive requests (A and B) using the same nodes and B has been planned one second after A. Hence A has to be released in at most one second. Otherwise B will be configured while A is still occupying the nodes. This delay would also affect all subsequent requests, since their planned allocation times depend on the release times of their predecessors. Planning systems also have drawbacks. The cost of scheduling is higher than in queuing systems. And as users can view the current schedule and know when their requests are planned, questions like “Why is my request not planned earlier? Look, it would fit in here.” are most likely to occur.

3

Advanced Planning Functions

In this section we present features which benefit from the design of planning systems. Although a planning system is not necessarily needed for implementing these functionalities, its design significantly relieves it. 3.1

Requesting Resources

The resources managed by an RMS are offered to the user by means of a set of attributes (e.g. nodes, duration, amount of main memory, network type, file system, software licenses, etc.). Submitting a job or requesting a resource demands the user to specify the needed resources. This normally has to be done either exactly (e.g. 32 nodes for 2 hours) or by specifying lower bounds (e.g. minimal 128 MByte memory). If an RMS should support grid environments two additional features are helpful: diffuse requests and negotiations. Diffuse Requests. We propose two versions. Either the user requests a range of needed resources, like “Need at least 32 and at most 128 CPUs”. Or the RMS “optimizes” one or more of the provided resource attributes itself. Examples are “Need Infiniband or Gigabit-Ethernet on 128 nodes” or “Need as much nodes

6

Matthias Hovestadt et al.

6 U s e r-In te rfa c e / A P I

5 S c h e d u le r

1 2

4

H P C -S y s te m A p p lic a tio n

/

3

O p tim iz e r

Fig. 2. RMS supporting diffuse requests

as possible for as long as possible as soon as possible”. Optimizing requires an objective function. The user may define a job specific one otherwise a system default is taken. Diffuse requests increase the degree of freedom of the scheduler because the amount of possible placements is larger. The RMS needs an additional component which collaborates with the scheduler. Figure 2 depicts how this optimizer is integrated in the planning process. With the numbered arrows representing the control flows several scenarios can be described. For example, placing a reservation with a given start time results in the control flow ‘1,4’. Planning a diffuse request results in ‘1,2,3,4’. Negotiation. One of the advantages of a grid environment is the ability to co-allocate resources (i.e. using several different resources at the same time). However one major problem of co-allocation is how to specify and reserve the resources. Like booking a journey with flights and hotels, this often is an iterative process, since the requested resources are not always available at the needed time or in the desired quality or quantity. Additionally, applications should be able to request the resources directly without human intervention. All this demands for a negotiation protocol to reserve resources. Using diffuse requests eases this procedure. Referring to Fig. 2 negotiating a resource request (with a user or a coallocation agent) would use the pathes ‘1,2,3,6,1,...’. Negotiation protocols like SNAP [4] are mandatory to implement service level agreements (cf. Sect. 3.3). 3.2

Dynamic Aspects

HPC systems are getting more and more heterogeneous, both in hardware and software. For example, they may comprise different node types, several communication networks, or special purpose hardware (e. g. FPGA cards). Using a deployment server allows to provide several operating system variants for special purposes (e.g. real time or support of special hardware). This allows to tailor the operating system to the application to utilize the hardware in the best possible

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

7

way. All these dynamical aspects should be reflected and supported by an RMS. The following sections illuminate some of these problem areas. Variable Reservations. Fix-Time requests are basically specified like VarTime requests (cf. Sect. 2). They come with information about the number of resources, the duration, and the start time. The start time can be specified either explicitly by giving an absolute time or implicitly by giving the end time or keywords like ASAP or NOW. However the nature of a reservation is that its start time is fixed. Assume the following scenario: if a user wants to make a resource reservation as soon as possible, the request is planned according to the situation when the reservation is submitted. If jobs planned before this reservation end earlier than estimated, the reservation will not move forward. Hence we introduce variable reservations. They are requested like normal reservations as described above, but with an additional parameter (e. g. -vfix). This flag causes the RMS to handle the request like a Var-Time request. After allocating the resources on the system, the RMS automatically switches the type of the request to a reservation and notifies the user (e. g. by sending an email) that the resources are now accessible. In contrast to a Var-Time request a variable reservation is never planned later than its first planned start time. Resource Reclaiming. Space-sharing is commonly applied to schedule HPC applications because the resources are assigned exclusively. Parallel applications (especially from the domain of engineering technology) often traverse several phases (e.g. computation, communication, or checkpointing) requiring different resources. Such applications are called malleable or evolving [10] and should be supported by an RMS. Ideally, the application itself is able to communicate with the RMS via an API to request additional resources (duration, nodes, bandwidth, etc.) or to release resources at runtime. If an HPC system provides multiple communication networks (e.g. Gigabit, Infiniband, and Myrinet) combined with an appropriate software layer (e.g. Direct Access Transport (DAT) [24, 5]) it is possible to switch the network at runtime. For example, assume an application with different communication phases: one phase needs low latency whereas another phase needs large bandwidth. The running application may now request more bandwidth: “Need either 100 MBytes/s for 10 minutes or 500 MBytes/s for 2 minutes”. According to Fig. 2 the control flow ‘5,2,3,4,5...’ would be used to negotiate this diffuse request. Until now DAT techniques are often used to implement a failover mechanism to protect applications against network breakdowns. However, it should also be possible that the RMS causes the application to (temporarily) switch to another network in order to make the high speed network available to another application. This would increase the overall utilization of the system. It also helps to manage jobs with a deadline. Automatic Duration Extension. Estimating the job runtime is a well known problem [21, 26]. It is annoying if a job is aborted shortly before termination

8

Matthias Hovestadt et al.

because the results are lost and the resources were wasted. Hence users tend to overestimate their jobs by a factor of at least two to three [27] to ensure that their jobs will not be aborted. A simple approach to help the users is to allow to extend the runtime of jobs while they are running. This might solve the problem, but only if the schedule allows the elongation (i. e. subsequent jobs are not delayed). A more sophisticated approach allows the delay of Var-Time requests because delaying these jobs might be more beneficial than killing the running job and processing the resubmitted similar job with a slightly extended runtime. The following constraints have to be considered: – The length of an extension has to be chosen precisely, as it has a strong influence on the costs of delaying other jobs. For example, extending a one day job by 10 minutes seems to be ok. However if all other waiting jobs are only 1 minute long, they would have to wait for 10 additional minutes. On the other hand these jobs may have already waited for half a day, so 10 minutes extra would not matter. The overall throughput of the machine (measured in useful results per time unit) would be increased substantially. – The number of granted extensions: is once enough or should it be possible to elongate the duration twice or even three times? Although many constraints have to be kept in mind in this automatic extension process, we think that in some situations delaying subsequent jobs might be more beneficial than dealing with useless jobs that are killed and generated no result. Of course reservations must not be moved, although policies are thinkable which explicitly allow to move reservations in certain scenarios (cf. Sect. 3.3). In the long run automatic duration extension might also result in a more precise estimation behavior of the users as they need no longer be afraid of losing results due to aborted jobs. In addition to the RMS driven extension an application driven extension is possible [14]. Automatic Restart. Many applications need a runtime longer than anything allowed on the machine. Such applications often checkpoint their state and are resubmitted if they have been aborted by the RMS at the end of the requested duration. The checkpointing is done cyclic either driven by a timer or after a specific amount of computing steps. With a restart functionality it is possible to utilize even short time slots in the schedule to run such applications. Of course the time slot should be longer than a checkpoint interval, so that no results of the computation are lost. If checkpointing is done every hour, additional information provided could be: “the runtime should be x full hours plus n minutes” where n is the time the application needs for checkpointing. If the application is able to catch signals the user may specify a signal (e. g. USR1) and the time needed to checkpoint. The RMS is now able to send the given checkpoint signal in time enforcing the application to checkpoint. After waiting the given time the RMS stops the application. This allows to utilize time

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

9

slots shorter than a regular checkpoint cycle. The RMS automatically resubmits the job until it terminates before its planned duration. Space Sharing “Cycle Stealing”. Space-sharing can result in unused slots since jobs do not always fit together to utilize all available resources. These gaps can be exploited by applications which may be interrupted and restarted arbitrarily. This is useful for users running “endless” production runs but do not need the results with a high priority. Such jobs do not appear in the schedule since they run in the “background” stealing the idle resources in a space sharing system. This is comparable to the well known approach in time-sharing environments (Condor [19]). Prerequisite is that the application is able to checkpoint and restart. Similar to applying the “automatic restart” functionality a user submits such a job by specifying the needed resources and a special flag causing the RMS to run this job in the background. Optionally the user may declare a signal enforcing a checkpoint. Specifying the needed resources via a diffuse request would enable the RMS to optimize the overall utilization if planning multiple “cycle stealing” requests. For example, assume two “cycle stealing” requests: A and B. A always needs 8 nodes and B runs on 4 to 32 nodes. If 13 nodes are available the RMS may assign 8 nodes to A and 5 to B. Deployment Servers. Computing centers provide numerous different services (especially commercial centers). Until now they often use spare hardware to cope with peak demands. These spare parts are still often configured manually (operating system, drivers, etc.) to match the desired configuration. It would be more convenient if such a reconfiguration is done automatically. The user should be able to request something like: – “Need 5 nodes running RedHat x.y with kernel patch z from 7am to 5pm.” – “Need 64 nodes running ABAQUS on SOLARIS including 2 visualization nodes.” Now the task of the RMS is to plan both the requested resources and the time to reconfigure the hardware. 3.3

Service Level Agreements

Emerging network-oriented applications demand for certain resources during their lifetime [11], so that the common best effort approach is no longer sufficient. To reach the desired performance it is essential that a specified amount of processors, network bandwidth or harddisk capacity is available at runtime. To fulfill the requirements of an application profile the computing environment has to provide a particular quality of service (QoS). This can be achieved by a reservation and a following allocation of the corresponding resources [12]

10

Matthias Hovestadt et al.

for a limited period, which is called advanced reservation [20]. It is defined as “[..] the process of negotiating the (possibly limited or restricted) delegation of particular resource capabilities over a defined time interval from the resource owner to the requester ” [13]. Although the advanced reservation permits that the application starts as planned, this is not enough for the demands of the real world. At least the commercial user is primarily interested in an end-to-end service level agreement (SLA) [17], which is not limited to the technical aspects of an advanced reservation. According to [29] an SLA is “an explicit statement of the expectations and and obligations that exist in a business relationship between two organizations: the service provider and the customer ”. It also covers subjects like involved parties, validity period, scope of the agreement, restrictions, service-level objectives, service-level indicators, penalties, or exclusions [22]. Due to the high complexity of analyzing the regulations of an SLA and checking their fulfillment, a manual handling obviously is not practicable. Therefore SLAs have to be unambiguously formalized, so that they can be interpreted automatically [23]. However high flexibility is needed in formulating SLAs, since every SLA describes the particular requirement profile. This may range up to the definition of individual performance metrics. The following example illustrates how an SLA may look like. Assume that the University of Foo commits, that in the time between 10/18/2003 and 11/18/2003 every request of the user “Custom-Supplies.NET” for a maximum of 4 Linux nodes and 12 hours is fulfilled within 24 hours. Example 1 depicts the related WSLA [23] specification. It is remarkable that this SLA is not a precise reservation of resources, but only the option for such a request. This SLA is quite rudimental and does not consider issues like reservation of network bandwidth, computing costs, contract penalty or the definition of custom performance metrics. However it is far beyond the scope of an advanced reservation. Service level agreements simplify the collaboration between a service provider and its customers. The customer fulfillment requires that the additional information provided by an SLA has to be considered not only in the scheduling process but also during the runtime of the application. SLA-aware Scheduler. From the schedulers point of view the SLA life cycle starts with the negotiation process. The scheduler is included into this process, since it has to agree to the requirements defined in the SLA. As both sides agree on an SLA the scheduler has to ensure that the resources according to the clauses of the SLA are available. At runtime the scheduler is not responsible for measuring the fulfillment of the SLA, but to provide all granted resources. Dealing with hardware failures is important for an RMS. For an SLA-aware scheduler this is vital. For example, assume a hardware failure of one or more resources occurs. If there are jobs scheduled to run on the affected resources, these jobs have to be rescheduled to other resources to fulfill the agreed SLAs. If there are not enough free resources available, the scheduler has to cancel or

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

11

Example 1 A Service Level Agreement specified in WSLA

Mon Tue Wed Thu Fri 0AM 12PM

Uni-Foo.DE Custom-Supplies.NET Sat Oct 18 00:00:00 CET 2003 Tue Nov 18 00:00:00 CET 2003













delay scheduled jobs or to abort running jobs to ensure the fulfillment of the SLA. As an agreement on the level of service is done, it is not sufficient to use simple policies like FCFS. With SLAs the amount of job specific attributes increases significantly. These attributes have to be considered during the scheduling process.

12

Matthias Hovestadt et al.

Job Forwarding Using the Grid. Even if the scheduler has the potential to react on unexpected situations like hardware failures by rescheduling jobs, this is not always applicable. If there are no best effort jobs either running or in the schedule, the scheduler has to violate at least one SLA. However if the system is embedded in a grid computing environment, potentially there are matching resources available. Due to the fact that the scheduler knows each job’s SLA, it could search for matching resources in the grid. For instance this could be done by requesting resources from a grid resource broker. By requesting resources with the specifications of the SLA it is assured that the located grid resources can fulfill the SLA. In consequence the scheduler can forward the job to another provider without violating the SLA. The decision of forwarding does not only depend on finding matching resources in the grid. If the allocation of grid resources is much more expensive than the revenue achieved by fulfilling the SLA, it can be economically more reasonable to violate the SLA and pay the penalty fee. The information given in an SLA in combination with job forwarding gives the opportunity to use overbooking in a better way than in the past. Overbooking assumes that users overestimate the durations of their jobs and the related jobs will be released earlier. These resources are used to realize the additional (overbooked) jobs. However if jobs are not released earlier as assumed, the overbooked jobs have to be discarded. With job forwarding these jobs may be realized on other systems in the grid. If this is not possible the information provided in an SLA may be used to determine suitable jobs for cancellation.

4

The Computing Center Software

This section describes a resource management system developed and used at the Paderborn Center for Parallel Computing. It provides some of the features characterized in Sect. 3. 4.1

Architecture

The Computing Center Software [16] has been designed to serve two purposes: For HPC users it provides a uniform access interface to a pool of different HPC systems. For system administrators it provides a means for describing, organizing, and managing HPC systems that are operated in a computing center. Hence the name “Computing Center Software”, CCS for short. A CCS island (Fig. 3) comprises five modules which may be executed asynchronously on different hosts to improve the response time of CCS. – The User Interface (UI) provides a single access point to one or more systems via an X-window or a command line based interface. – The Access Manager (AM) manages the user interfaces and is responsible for authentication, authorization, and accounting. – The Planning Manager (PM) plans the user requests onto the machine.

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

13

IM I s la n d M g r .

U I

. . .

U se r In te r fa c e

A M A c c e ss M g r .

P M P la n n in g M g r .

M M M a c h in e M g r .

H a r d w a r e

U I

U se r In te r fa c e

Fig. 3. Interaction between the CCS components

– The Machine Manager (MM) provides machine specific features like system partitioning, job controlling, etc. The MM consists of three separate modules that execute asynchronously. – The Island Manager (IM) provides CCS internal name services and watchdog facilities to keep the island in a stable condition. 4.2

The Planning Concept

The planning process in CCS is split into two instances, a hardware dependent and a hardware independent part. The Planning Manager (PM) is the hardware independent part. It has no information on mapping constraints (e. g. the network topology or location of visualization- or I/O-nodes). The hardware dependent tasks are performed by the Machine Manager(MM). It maps the schedule received from the PM onto the hardware considering system specific constraints (e. g. network topology). The following sections depict this split planning concept in more detail. Planning. According to Sect. 2 CCS is a planning system and not a queuing system. Hence CCS requires the users to specify the expected duration of their requests. The CCS planner distinguishes between Fix-Time and Var-Time resource requests. A Fix-Time request reserves resources for a given time interval. It cannot be shifted on the time axis. In contrast, Var-Time requests can move on the time axis to an earlier or later time slot (depending on the used policy). Such a shift on the time axis might occur when other requests terminate before the specified estimated duration. Figure 4 shows the schedule browser. The PM manages two “lists” while computing a schedule. The lists are sorted according to the active policy. 1. The New list(N-list): Each incoming request is placed in this list and waits there until the next planning phase begins. 2. The Planning list(P-list): The PM plans the schedule using this list.

14

Matthias Hovestadt et al.

CCS comes with three strategies: FCFS, SJF, and LJF. All of them consider project limits, system wide node limits, and Admin-Reservations (all described in Sect. 4.3). The system administrator can change the strategy at runtime. The integration of new strategies is possible, because the PM provides an API to plug in new modules. Planning an Incoming Job: The PM first checks if the N-list has to be sorted according to the active policy (e. g. SJF or LJF). It then plans all elements of N-list. Depending on the request type (Fix-Time or Var-Time) the PM calls an associated planning function. For example, if planning a Var-Time request, the PM tries to place the request as soon as possible. The PM starts in the present and moves to the future until it finds a suitable place in the schedule. Backfilling: According to Sect. 2 backfilling is done implicitly during the replanning process, if SJF or LJF is used. If FCFS is used the following is done: each time a request is stopped, an Admin-Reservation is removed, or the duration of a planned request is decreased, the PM determines the optimal time for starting the backfilling (backfillStart) and initiates the backfill procedure. It checks for all Var-Time requests in the P-list with a planned time later than backfillStart if they could be planned between backfillStart and their current schedule. Mapping. The separation between the hardware independent PM and the system specific MM allows to encapsulate system specific mapping heuristics in separate modules. With this approach, system specific requests (e. g. for I/Onodes, specific partition topologies, or memory constraints) may be considered. One task of the MM is to verify if a schedule received from the PM can be realized with the available hardware. The MM checks this by mapping the

Fig. 4. The CCS schedule browser

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

15

user given specification with the static (e. g. topology) and dynamic (e. g. PE availability) information on the system resources. This kind of information is described by means of the Resource and Service Description (RSD, cf. Sect. 4.4). If the MM is not able to map a request onto the machine at the time given by the PM, the MM tries to find an alternative time. The resulting conflict list is sent back to the PM. The PM now checks this list: If the MM was not able to map a Fix-Time request, the PM rejects it. If it was a backfilled request, the PM falls back on the last verified start time. If it was not a backfilled request, the PM checks if the planned time can be accepted: Does it match Admin-Reservations, project limits, or system wide limits? If managing a homogeneous system verifying is not mandatory. However, if a system comprises different node types or multiple communication networks or the user is able to request specific partition shapes (e. g. a 4 x 3 x 2 grid) verifying becomes necessary to ensure a deterministic schedule. Another task of the MM is to monitor the utilization of partitions. If a partition is not used for a certain amount of time, the MM releases the partition and notifies the user via email. The MM is also able to migrate partitions when they are not active (i. e. no job is running). The user does not notice the migration unless she runs timecritical benchmarks for testing the communication speed of the interconnects. In this case the automatic migration facility may be switched off by the user at submit time. 4.3

Features

Showing Planned Start Times: The CCS user interface shows the estimated start time of interactive requests directly after the submitted request has been planned. This output will be updated whenever the schedule changes. This is shown in Example 2. Reservations: CCS can be used to reserve resources at a given time. Once CCS has accepted a reservation, the user has guaranteed access to the requested resources. During the reserved time frame a user can start an arbitrary number of interactive or batch jobs. Deadline Scheduling: Batch jobs can be submitted with a deadline notification. Once a job has been accepted, CCS guarantees that the job is completed at (or before) the specified time. Limit Based Scheduling: In CCS authorization is project based. One has to specify a project at submit time. CCS knows two different limit time slots: weekdays and weekend. In each slot CCS distinguishes between day and night. All policies consider the project specific node limits (given in percent of the number of available nodes of the machine). This means that the scheduler will sum up the already used resources of a project in a given time slot. If the time dependent limit is reached, the request in question is planned to a later or earlier slot (depending on the request type: interactive, reservation, deadline etc.).

16

Matthias Hovestadt et al.

Example: We define a project-limit of 15% for weekdays daytime for the project FOO. Members of this project may now submit a lot of batch jobs and will never get more than 15% of the machine during daytime from Monday until Friday. Requests violating the project limit are planned to the next possible slot (cf. Fig. 1). Only the start time is checked against the limit to allow that a request may have a duration longer than a project limit slot. The AM sends the PM the current project limits at boot time and whenever they change (e. g. due to crashed nodes). System Wide Node Limit: The administrator may establish a system wide node limit. It consists of a threshold (T), a number of nodes (N), and a time slot [start, stop]. N defines the number of nodes which are not allocatable if a user requests more than T nodes during the interval [start, stop]. This ensures that small partitions are not blocked by large ones during the given interval. Admin Reservations: The administrator may reserve parts or the whole system (for a given time) for one or more projects. Only the specified projects are able to allocate and release an arbitrary number of requests during this interval on the reserved number of nodes. Requests of other projects are planned to an earlier or later time. An admin reservation overrides the current project limit and the current system wide node limit. This enables the administrator to establish “virtual machines” with restricted access for a given period of time and a restricted set of users. Duration Change at Runtime: It is possible to manually change the duration of already waiting or running requests. Increasing the duration may enforce a verify round. The MM checks if the duration of the given request may be increased, without influencing subsequent requests. Decreasing the duration may change the schedule, because requests planned after the request in question may now be planned earlier.

4.4

Resource and Service Description

The Resource and Service Description (RSD) [1, 3] is a tool for specifying irregularly connected, attributed structures. Its hierarchical concept allows different dependency graphs to be grouped for building more complex nodes, i. e. hypernodes. In CCS it is used at the administrator level for describing the type and topology of the available resources, and at the user level for specifying the required system configuration for a given application. This specification is created automatically by the user interface. In RSD resources and services are described by nodes that are interconnected by edges via communication endpoints. An arbitrary number of attributes may be assigned to each of this entities. RSD is able to handle dynamic attributes. This is useful in heterogeneous environments, where for example the temporary network load affects the choice of the mapping. Moreover, dynamic attributes may be used by the RMS to support the planning and monitoring of SLAs (cf. Sect. 3.3).

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

17

Example 2 Showing the planned allocation time %ccsalloc -t 10s -n 7 shell date ccsalloc: Connecting default machine: PSC ccsalloc: Using default project : FOO ccsalloc: Using default name : bar%d ccsalloc: Emailing of CCS messages : On ccsalloc: Only user may access : Off ccsalloc: Request (51/bar_36): will be authenticated and planned ccsalloc: Request (51/bar_36): is planned and waits for allocation ccsalloc: Request (51/bar_36): will be allocated at 14:28 (in 2h11m) ccsalloc: 12:33: New planned time is at 12:57 (in 24m) ccsalloc: 12:48: New planned time is at 12:53 (in 5m) ccsalloc: 12:49: New planned time is at 12:50 (in 1m) ccsalloc: Request (51/bar_36): is allocated ccsalloc: Request 51: starting shell date Wed Mar 12 12:50:03 CET 2003 ccsalloc: Request (51/bar_36): is released ccsalloc: Bye,Bye (0)

5

Conclusion

In this paper we have presented an approach for classifying resource management systems. According to the planned time frame we distinguish between queuing and planning systems. A queuing system considers only the present and utilizes free resources with requests. Planning systems in contrast plan for the present and future by assigning a start time to all requests. Queuing systems are well suited to operate single HPC machines. However with grid environments and heterogenous clusters new challenges arise and the concept of scheduling has to follow these changes. Scheduling like a queuing system does not seem to be sufficient to handle the requirements. Especially if advanced reservation and quality of service aspects have to be considered. The named constraints of queuing systems do not exist in planning systems due to their different design. Besides the classification of resource management systems we additionally presented new ideas on advanced planning functionalities. Diffuse requests ease the process of negotiating the resource usage between the system and users or coallocation agents. Resource reclaiming and automatic duration extension extend the term of scheduling. The task of the scheduler is no longer restricted to plan the future only, but also to manage the execution of already allocated requests. Features like diffuse requests and service level agreements in conjunction with job forwarding allow to build a control cycle comprising active applications, resource management systems, and grid middleware. We think this control cycle would help to increase the everyday usability of the grid especially for the commercial users. The aim of this paper is to show the benefits of planning systems for managing HPC machines. We see this paper as a basis for further discussions.

18

Matthias Hovestadt et al.

References [1] M. Brune, J. Gehring, A. Keller, and A. Reinefeld. RSD - Resource and Service Description. In Proc. of 12th Intl. Symp. on High-Performance Computing Systems and Applications (HPCS’98), pages 193–206. Kluwer Academic Press, 1998. 16 [2] M. Brune, J. Gehring, A. Keller, and A. Reinefeld. Managing Clusters of Geographically Distributed High-Performance Computers. Concurrency - Practice and Experience, 11(15):887–911, 1999. 2 [3] M. Brune, A. Reinefeld, and J. Varnholt. A Resource Description Environment for Distributed Computing Systems. In Proceedings of the 8th International Symposium High-Performance Distributed Computing HPDC 1999, Redondo Beach, Lecture Notes in Computer Science, pages 279–286. IEEE Computer Society, 1999. 16 [4] K. Cjajkowski, I. Foster, C. Kesselman, V. Sander, and S. Tuecke. SNAP: A Protocol for Negotiation of Service Level Agreements and Coordinated Resource Management in Distributed Systems. In Proceedings of the 8th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), volume 2537 of Lecture Notes in Computer Science, pages 153–183. Springer Verlag, 2002. 6 [5] Direct Access Transport (DAT) Specification. http://www.datcollaborative.org, April 2003. 7 [6] C. Ernemann, V. Hamscher, A. Streit, and R. Yahyapour. Enhanced Algorithms for Multi-Site Scheduling. In Proceedings of 3rd IEEE/ACM International Workshop on Grid Computing (Grid 2002) at Supercomputing 2002, volume 2536 of Lecture Notes in Computer Science, pages 219–231, 2002. 2 [7] D. G. Feitelson and M. A. Jette. Improved Utilization and Responsiveness with Gang Scheduling. In D. G. Feitelson and L. Rudolph, editor, Proc. of 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science, pages 238–262. Springer Verlag, 1997. 3 [8] D. G. Feitelson and L. Rudolph. Towards Convergence in Job Schedulers for Parallel Supercomputers. In D. G. Feitelson and L. Rudolph, editor, Proc. of 2nd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 1–26. Springer Verlag, 1996. 1 [9] D. G. Feitelson and L. Rudolph. Metrics and Benchmarking for Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editor, Proc. of 4th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1459, pages 1–24. Springer Verlag, 1998. 1 [10] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, and K. C. Sevcik. Theory and Practice in Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editor, Proc. of 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science, pages 1–34. Springer Verlag, 1997. 7 [11] I. Foster and C. Kesselman (Eds.). The Grid: Blueprint for a New Computing. Morgan Kaufmann Publishers Inc. San Fransisco, 1999. 1, 9 [12] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, and A. Roy. A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation. In Proceedings of the International Workshop on Quality of Service, 1999. 9 [13] GGF Grid Scheduling Dictionary Working Group. Grid Scheduling Dictionary of Terms and Keywords. http://www.fz-juelich.de/zam/RD/coop/ggf/sd-wg.html, April 2003. 10

Scheduling in HPC Resource Management Systems: Queuing vs. Planning

19

[14] J. Hungersh¨ ofer, J.-M. Wierum, and H.-P. G¨ anser. Resource Management for Finite Element Codes on Shared Memory Systems. In Proc. of Intl. Conf. on Computational Science and Its Applications (ICCSA), volume 2667 of LNCS, pages 927–936. Springer, May 2003. 8 [15] D. Jackson, Q. Snell, and M. Clement. Core Algorithms of the Maui Scheduler. In D. G. Feitelson and L. Rudolph, editor, Proceddings of 7th Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of Lecture Notes in Computer Science, pages 87–103. Springer Verlag, 2001. 5 [16] A. Keller and A. Reinefeld. Anatomy of a Resource Management System for HPC Clusters. In Annual Review of Scalable Computing, vol. 3, Singapore University Press, pages 1–31, 2001. 12 [17] H. Kishimoto, A. Savva, and D. Snelling. OGSA Fundamental Services: Requirements for Commercial GRID Systems. Technical report, Open Grid Services Architecture Working Group (OGSA WG), http://www.gridforum.org/Documents/Drafts/default_b.htm, April 2003. 10 [18] D. A. Lifka. The ANL/IBM SP Scheduling System. In D. G. Feitelson and L. Rudolph, editor, Proc. of 1st Workshop on Job Scheduling Strategies for Parallel Processing, volume 949 of Lecture Notes in Computer Science, pages 295–303. Springer Verlag, 1995. 3 [19] M. Litzkow, M. Livny, and M. Mutka. Condor - A Hunter of Idle Workstations. In Proceedings of the 8th International Conference on Distributed Computing Systems (ICDCS’88), pages 104–111. IEEE Computer Society Press, 1988. 9 [20] J. MacLaren, V. Sander, and W. Ziegler. Advanced Reservations - State of the Art. Technical report, Grid Resource Allocation Agreement Protocol Working Group, Global Grid Forum, http://www.fz-juelich.de/zam/RD/coop/ggf/graap/sched-graap-2.0.html, April 2003. 10 [21] A. Mu’alem and D. G. Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. In IEEE Trans. Parallel & Distributed Systems 12(6), pages 529–543, June 2001. 3, 7 [22] A. Sahai, A. Durante, and V. Machiraju. Towards Automated SLA Management for Web Services. HPL-2001-310 (R.1), Hewlett-Packard Company, Software Technology Laboratory, HP Laboratories Palo Alto, http://www.hpl.hp.com/techreports/2001/HPL-2001-310R1.html, 2002. 10 [23] A. Sahai, V. Machiraju, M. Sayal, L. J. Jin, and F. Casati. Automated SLA Monitoring for Web Services. In Management Technologies for E-Commerce and E-Business Applications, 13th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, volume 2506 of Lecture Notes in Computer Science, pages 28–41. Springer, 2002. 10 [24] Scali MPI ConnectT M . http://www.scali.com, April 2003. 7 [25] L. Smarr and C. E. Catlett. Metacomputing. Communications of the ACM, 35(6):44–52, June 1992. 1 [26] W. Smith, I. Foster, and V. Taylor. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In D. G. Feitelson and L. Rudolph, editor, Proc. of 5th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1659 of Lecture Notes in Computer Science, pages 202–219. Springer Verlag, 1999. 7 [27] A. Streit. A Self-Tuning Job Scheduler Family with Dynamic Policy Switching. In Proc. of the 8th Workshop on Job Scheduling Strategies for Parallel Processing, volume 2537 of Lecture Notes in Computer Science, pages 1–23. Springer, 2002. 8

20

Matthias Hovestadt et al.

[28] D. Talby and D. G. Feitelson. Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. In 13th Intl. Parallel Processing Symp., pages 513–517, April 1999. 4 [29] D. Verma. Supporting Service Level Agreements on an IP Network. Macmillan Technology Series. Macmillan Technical Publishing, August 1999. 10 [30] K. Windisch, V. Lo, R. Moore, D. Feitelson, and B. Nitzberg. A Comparison of Workload Traces from Two Production Parallel Machines. In 6th Symposium Frontiers Massively Parallel Computing, pages 319–326, 1996. 3

TrellisDAG: A System for Structured DAG Scheduling Mark Goldenberg, Paul Lu, and Jonathan Schaeffer Department of Computing Science, University of Alberta Edmonton, Alberta, T6G 2E8, Canada {goldenbe,paullu,jonathan}@cs.ualberta.ca http://www.cs.ualberta.ca/~paullu/Trellis/

Abstract. High-performance computing often involves sets of jobs or workloads that must be scheduled. If there are dependencies in the ordering of the jobs (e.g., pipelines or directed acyclic graphs) the user often has to carefully, manually submit the jobs in the right order and/or delay submitting dependent jobs until other jobs have finished. If the user can submit the entire workload with dependencies, then the scheduler has more information about future jobs in the workflow. We have designed and implemented TrellisDAG, a system that combines the use of placeholder scheduling and a subsystem for describing workflows to provide novel mechanisms for computing non-trivial workloads with inter-job dependencies. TrellisDAG also has a modular architecture for implementing different scheduling policies, which will be the topic of future work. Currently, TrellisDAG supports: 1. A spectrum of mechanisms for users to specify both simple and complicated workflows. 2. The ability to load balance across multiple administrative domains. 3. A convenient tool to monitor complicated workflows.

1

Introduction

High-performance computing (HPC) often involves sets of jobs or workloads that must be scheduled. Sometimes, the jobs in the workload are completely independent and the scheduler is free to run any job concurrently with any other job. At other times, the jobs in the workload have application-specific dependencies in their ordering (e.g., pipelines [20]) such that the user has to carefully submit the jobs in the right order (either manually or via a script) or delay submitting dependent jobs until other jobs have finished. The details of which job is selected to run on which processor is determined by a scheduling policy [5]. Often, the scheduler uses the knowledge of the jobs in the submission queue, the jobs currently running, and the history of past jobs to help make its decisions. In particular, any knowledge about future job arrivals can supplement other knowledge to make better policy choices. The problem is that the mere presence of a job in the submission queue is usually interpreted by the scheduler to mean that a job can run concurrently with D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 21–43, 2003. c Springer-Verlag Berlin Heidelberg 2003 

22

Mark Goldenberg et al.

B: Function classifier DNA/Protein Sequences A: PSI−BLAST

D: Create Summary

C: Localization classifier

Fig. 1. Example Bioinformatics Workflow

other jobs. Therefore, to avoid improper ordering of jobs, either the scheduler has to have a mechanism to specify job dependencies or the user has to delay the submission of some jobs until other jobs have completed. Without such mechanisms, managing the workload can be labour-intensive and deprives the scheduler of the knowledge of future jobs until they are actually submitted, even though the workflow “knows” that the jobs are forthcoming. If the user can submit the entire workload with dependencies, then the scheduler has access to more information about future jobs in the workflow. We have designed, implemented, and evaluated the TrellisDAG system for scheduling workloads with job dependencies [10]. TrellisDAG is designed to support any workload with job dependencies and provides a framework for implementing different scheduling policies. So far, our efforts have focussed on the basic mechanisms to support the scheduling of workflows with directed acyclic graph (DAG) dependencies. So far, our policies have been simple: first-comefirst-serve (FCFS) of jobs with satisfied dependencies and simple approaches to data locality when placing jobs on processors [10]. We have not yet emphasized the development of new policies since our focus has been on the underlying infrastructure and framework. The main contributions of this work are: 1. The TrellisDAG system and some of its novel mechanisms for expressing job dependencies, especially DAG description scripts (Section 3.2). 2. The description of an application with non-trivial workflow dependencies, namely building checkers endgame databases via retrograde analysis (Section 2). 3. A simple, empirical evaluation of the correctness and performance of the TrellisDAG system (Section 4).

TrellisDAG: A System for Structured DAG Scheduling

1.1

23

Motivation

Our empirical evaluation in Section 4 uses examples from computing checkers endgame databases (Section 2). However, TrellisDAG is designed to be generalpurpose and to support a variety of applications. For now, let us consider a bioinformatics application with a simple workflow dependency with four jobs: A, B, C, and D (Figure 1). This example is based on a bioinformatics research project in our department called Proteome Analyst (PA) [19]. In Figure 1, the input to the workflow is a large set of DNA or protein sequences, usually represented as strings over an alphabet. In high-throughput proteome analysis, the input to Job A can be tens of thousands of sequences. A common, first-stage analysis is to use PSI-BLAST [1] to find similar sequences, called homologs, in a database of known proteins. Then, PA uses the information from the homologs to predict different properties of the new, unknown proteins. For example, Job B uses a machine-learned classifier to map the PSI-BLAST output to a prediction of the general function of the protein (e.g., the protein is used for amino acid biosynthesis). Job C uses the same PSI-BLAST output from A and a different classifier to predict the subcellular localization of the protein (e.g., the protein is found in the Golgi complex). Both Jobs B and C need the output of Job A, but both B and C can work concurrently. For simplicity, we will assume that all of the sequences must be processed by PSI-BLAST before any of the output is available to B and C. Job D gathers and presents the output of B and C. Some of the challenges are: 1. Identifying that Jobs B and C can be run concurrently, but A and B (and A and C) cannot be concurrent (i.e., knowledge about dependencies within the workflow). 2. Recognizing that there are, for example, three processors (not shown) at the moment that are ready to execute the jobs (i.e., find the processor resources). 3. Mapping the jobs to the processors, which is the role of the scheduler. Without the knowledge about dependencies in the scheduler, the user may have to submit Job A, wait until A is completed, and then submit B and C so that the ordering is maintained. Otherwise, if Jobs A, B, C, and D are all in the submission queue, the scheduler may try to run them concurrently if, say, four processors are available. But, having the user manually wait for A to finish before submitting the rest of the workflow could mean delays and it does mean that the scheduler cannot see that B, C, and D will eventually be executed, depriving the policy of that knowledge of future jobs. 1.2

Related Work and Context

The concept of grid computing is pervasive these days [7, 6]. TrellisDAG is part of the Trellis Project in overlay metacomputing [15, 14]. In particular, TrellisDAG is layered on top of the placeholder scheduling technique [14]. The goals

24

Mark Goldenberg et al.

Server Command lines Service routines (command−line server)

Placeholder

Placeholder

Placeholder

Execution host 1

Execution host 2

Execution host n

Fig. 2. Placeholder scheduling

of the Trellis Project are more modest and simpler than that of grid computing. Trellis is focussed on supporting scientific applications on HPC systems; supporting business applications and Web services are not explicit goals of the Trellis Project. Therefore, we prefer to use older terminology—metacomputing—to reflect the more limited scope of our work. In computing science, the dream of metacomputing has been around for decades. In various forms, and with important distinctions, it has also been known as distributed computing, batch scheduling, cycle stealing, high-throughput computing, peer-to-peer systems, and (most recently) grid computing. Some well-known, contemporary examples in this area include SETI@home [18], Project RC5/distributed.net [16], Condor [13, 8, 3], and the projects associated with Globus/Open Grid Service Architecture (OGSA) [9]. Of course, there are many, many other related projects around the world. The Trellis philosophy has been to write the minimal amount of new software and to require the minimum of superuser support. Simplicity and software reuse have been important design principles; Trellis uses mostly widely-deployed, existing software systems. Currently, Trellis does not use any of the new software that might be considered part of grid computing, but the design of Trellis supports the incorporation of and/or co-existence with grid technology in the future. At a high level, placeholder scheduling is illustrated in Figure 2. A placeholder is a mechanism for global scheduling in which each placeholder represents a potential unit of work. The current implementation of placeholder scheduling uses normal batch scheduler job scripts to implement a placeholder. Placeholders are submitted to the local batch scheduler with a normal, non-privileged user identity. Thus, local scheduling policies and job accounting are maintained. There is a central host that we will call the server. All the jobs (a job is anything that can be executed) are stored on the server (the storage format is user-defined; it can be a plain file, a database, or any other format). There is also

TrellisDAG: A System for Structured DAG Scheduling

25

a set of separate programs called services that form a layer called the commandline server. Adding a new storage format or implementing a new scheduling policy corresponds to implementing a new service program in this modular architecture. There are a number of execution hosts (or computational nodes) – the machines on which the computations are actually executed. On each execution host there is one or more placeholder(s) running. Placeholders can handle either sequential or parallel jobs. It can be implemented as a shell script, a binary executable, or a script for a local batch scheduler. However, it has to use the service routines (or services) of the command-line server to perform the following activities: 1. Get the next job from the server, 2. Execute the job, and 3. Resubmit itself (if necessary). Therefore, when the job (a.k.a. placeholder) begins executing, it contacts a central server and requests the job’s actual run-time parameters (i.e., late binding). For placeholders, the communication across administrative domains is handled using Secure Shell (SSH). In this way, a job’s parameters are pulled by the placeholder rather than pushed by a central authority. In contrast, normal batch scheduler jobs hard-code all the parameters at the time of local job submission (i.e., early binding). Placeholder scheduling is similar to Condor’s gliding in and flocking techniques [8]. Condor is, by far, the more mature and robust system. However, by design, placeholders are not as tightly-coupled with the server as Condor daemons are with the central Condor servers (e.g., no I/O redirection to the server). Also, placeholders use the widely-deployed SSH infrastructure for secure and authenticated communication across administrative domains. The advantage of the more loosely-coupled and SSH-based approach is that overlay metacomputers (which are similar to “personal Condor pools”) can be quickly deployed, without superuser permissions, while maintaining functionality and security. Recently, we used placeholder scheduling to run a large computational chemistry application across 18 different administrative domains, on 20 different systems across Canada, with 1,376 processors [15, 2]. This experiment, dubbed the Canadian Internetworked Scientific Supercomputer (CISS), was most notable for the 18 different administrative domains. No system administrator had to install any new infrastructure software (other than SSH, which was almost-universally available already). All that we asked for was a normal, user-level account. In terms of placeholder scheduling, most CISS sites can be integrated within minutes. We believe that the low infrastructure requirements to participate in CISS was key in convincing such a diverse group of centres to join in. 1.3

Unique Features of TrellisDAG

TrellisDAG enhances the convenience of monitoring and administering the computation by providing the user with a tool to translate the set of inter-dependent

26

Mark Goldenberg et al.

Stage 1 A

Stage 2 B

C

Stage 3 D

E

F

Fig. 3. A 3-stage computation. It is convenient to inquire the status of each stage. If the second stage resulted in errors, we may want to disable the third stage and rerun starting from the second stage

jobs into a hierarchical structure with the naming conventions that are natural for the application domain. As shown in Figure 3, suppose that the computation is a 3-stage simulation, where the first stage is represented by jobs A and B, the second stage is represented by jobs C and D, and the third stage is represented by jobs E and F . The following are examples of the natural tasks of monitoring and administering such a computation: 1. Query the status of the first stage of the computation, e.g. is it completed? 2. Make the system execute only the first two stages of the computation, in effect disabling the third stage. 3. Redo the computation starting from the second stage. TrellisDAG makes such services possible by providing a mechanism for a highlevel description of the workflow, in which the user can define named groups of jobs (e.g., Stage 1) and specify collective dependencies between the defined groups (e.g., between Stage 1 and Stage 2). Finally, TrellisDAG provides mechanisms such that: 1. The user can submit all of the jobs and dependencies at once. 2. TrellisDAG provides a single point for describing the scheduling policies. 3. TrellisDAG provides a flexible mechanism by which attributes may be associated with individual jobs. The more information the scheduler has, the better scheduling decisions it may make. 4. TrellisDAG records the history information associated with the workflow. For our future work, the scheduler may use machine learning in order to improve the overall computation time from one trial to another. Our system provides mechanism for this capability by storing the relevant history information about the computation, such as the start times of jobs, the completion times of jobs, and the resources that were used for computing the individual jobs. Notably, Condor has a tool called the Directed Acyclic Graph Manager (DAGMan) [4]. One can represent a hierarchical system of jobs and dependencies using DAGMan scripts. DAGMan and TrellisDAG share similar design goals. However,

TrellisDAG: A System for Structured DAG Scheduling

27

DAGMan scripts are approximately as expressive as TrellisDAG’s Makefilebased mechanisms (Section 3.2). It is not clear how well DAGMan’s scripts will scale for complicated workflows, as described in the next section. Also, the widely-used Portable Batch System (PBS) has a simple mechanism to specify job dependency, but jobs are named according to submit-time job numbers, which are awkward to script, re-use, and do not scale to large workflows.

2

The Endgame Databases Application

We introduce an important motivating application – building checkers endgame databases via retrograde analysis – at a very high level. Then we highlight the properties of the application that are important for our project. But, the design of TrellisDAG does not make any checkers-specific assumptions. A team of researchers in the Department of Computing Science at the University of Alberta aims to solve the game of checkers [11, 12, 17]. For this paper, the detailed rules of checkers are largely irrelevant. In terms of workflow, the key application-specific “Properties” are: 1. The game starts with 24 pieces (or checkers) on the board. There are 12 black pieces and 12 white pieces. 2. Pieces are captured and removed from the board during the game. Once captured, a piece cannot return to the board. 3. Pieces start as checkers but can be promoted to be kings. Once a checker is promoted to a king, it can never become a checker again. Solving the game of checkers means that, for any given position, the following question has to be answered: Can the side to move first force a win or is it a draw? Using retrograde analysis, a database of endgame positions is constructed [12, 17]. Each entry in the database corresponds to a unique board position and contains one of three possible values: W IN , LOSS or DRAW . Such a value represents perfect information about a position and is called the theoretical value for that position. The computation starts with the trivial case of one piece. We know that whoever has that last piece is the winner. If there are two pieces, any position in which one piece is immediately captured “plays into” the case where there is only one piece, for which we already know the theoretical value. In general, given a position, whenever there is at least one legal move that leads to a position that has already been entered in the database as a LOSS for the opponent side, we know that the given position is a W IN for the side to move (since he will take that move, the values of all other moves do not matter); conversely, if all legal moves lead to positions that were entered as a W IN for the opponent side, we know that the given position is a LOSS for the side to move. For a selected part of the databases, this analysis goes in iterations. An iteration consists of going through all positions to which no value has been assigned and trying to derive a value using the rules described above. When an

28

Mark Goldenberg et al.

4 vs. 3

4300 3310 2320

1231 0241

4102

3211 2221

1330 0340

4201

3112 2122

1132 0142

4003 3013

2023 1033

0043

Fig. 4. Example workflow dependencies in checkers endgame databases, part of the 7 piece database

iteration does not result in any updates of values, then the remaining positions are assigned a value of DRAW (since neither player can force a win). If we could continue the process of retrograde analysis up to the initial position with 24 pieces on the board, then we would have solved the game. However, the number of positions grows exponentially and there are O(1020 ) possible positions in the game. That is why the retrograde analysis is combined with the forward search, in which the game tree with the root as the initial position of the game is constructed. When the two approaches “meet”, we will have perfect information about the initial position of the game and the game will be solved. In terms of workflow, strictly speaking, the positions with fewer pieces on the board must be computed/solved before the positions with more pieces. In a position with 3-pieces, a capture immediately results in a 2-piece position, as per Property 2. In other words, the 2-piece database must be computed before the 3-piece databases, which are computed before the 4-piece databases, and so on until, say, the 7-piece databases. We can subdivide the databases further. Suppose that black has 4 pieces and white has 3 pieces, which is part of the 7-piece database (Figure 4). We note that a position with 0 black kings, 3 white kings, 4 black checkers, and no white checkers (denoted 0340) can never play into a position with 1 black king, 2 white

TrellisDAG: A System for Structured DAG Scheduling

29

kings, 3 black checkers, and 1 white checker (denoted 1231) or vice versa. As per Property 3, 0340 cannot play into 1231 because the third white king in 0340 can never become a checker again. 1231 cannot play into 0340 because the lone black king in 1231 can never become a checker again. If 0340 and 1231 are separate computations or jobs, they can be computed concurrently as long as the database(s) that they do play into (e.g., 1330) are finished. In general, let us denote any position by four numbers standing for the number of kings and checkers of each color on the board, as illustrated above. A group of all positions that have same 4-number representation is called a slice. Positions b1k wk1 b1c wc1 and b2k wk2 b2c wc2 that have the same number of black and white pieces can be processed in parallel if one of the following two conditions hold: wk1 > wk2 and b1k < b2k or wk1 < wk2 and b1k > b2k . Figure 4 shows the workflow dependencies for the case of 4 pieces versus 3 pieces. Each “row” in the diagram (e.g., the dashed-line box) represents a set of jobs that can be computed in parallel. In turn, each of the slices in Figure 4 can be further subdivided (but not shown in the figure) by considering the rank of the leading checker. The rank is the row of a checker as it advances towards becoming a king, as per Property 3. Only the rank of the leading or most advanced checker is currently considered. Ranks are numbers from 0 to 6, since a checker would be promoted to become a king on rank 7. For example, 1231.53 represents a position with 1 black king, 2 white kings, 3 black checkers, and 1 white checker, where the leading black checker is on rank 5 (i.e., 2 rows from being promoted to a king) and the leading white checker is on rank 3 (i.e., 4 rows from being promoted to a king). Consequently, slices where only one side has a checker (and the other side only has kings) have seven possible subdivisions based on the rank of the leading checker. Slices where both sides have checkers have 49 possible subdivisions. Of course, slices with only kings cannot be subdivided according to the leading checker strategy. To summarize, subdividing a database into thousands of jobs according to the number of pieces, then the number of pieces of each colour, then the combination of types of pieces, and by the rank of the leading checker has two main benefits: an increase in inter-job concurrency and a reduction in the memory and computational requirements of each job. We emphasize that the constraints of the workflow dependencies emerge from: 1. The rules of checkers, 2. The retrograde analysis algorithm, and 3. The subdivision of the checkers endgame databases. TrellisDAG does not contribute any workflow dependencies in addition to those listed above.

30

Mark Goldenberg et al.

Description layer Translation and submission

TrellisDAG Jobs database Service routines (command−line server)

Placeholder

Placeholder

Placeholder

Execution host 1

Execution host 2

Execution host n

Placeholder scheduling Fig. 5. An architectural view of TrellisDAG

Computing checkers endgame databases is a non-trivial application that requires large computing capacity and presents a challenge by demanding several properties of a metacomputing system: 1. A tool/mechanism for convenient description of a multitude of jobs and inter-job dependencies. 2. The ability to dynamically satisfy inter-job dependencies while efficiently using the application-specific opportunities for concurrent execution. 3. A tool for convenient monitoring and administration of the computation.

3

Overview of TrellisDAG

An architectural view of TrellisDAG is presented in Figure 5. To run the computation, the user has to submit the jobs and the dependencies to the jobs database using one of the several methods described in Section 3.2. Then one or more placeholders have to be run on the execution nodes (workstations). These placeholders use the services of the command-line server to access and modify the jobs database. The services of the command-line server are described in Section 3.3. Finally, the monitoring and administrative utilities are described in Section 3.4. 3.1

The Group Model

In our system, each job is a part of a group of jobs and explicit dependencies exist between groups rather than between individual jobs. This simplifies the

TrellisDAG: A System for Structured DAG Scheduling

Group X job A

31

Group X Implied inter−job

job B

job A job B

dependencies job C

job C

Fig. 6. Dashed ovals denote jobs. The dependencies between the jobs of group X are implicit and determined by the order of their submission dependencies between jobs (i.e. the order of execution of jobs within a group is determined by the order of their submission). A group may either contain either only jobs or both jobs and subgroups. A group is called the supergroup with respect to its subgroups. A group that does not have subgroups is said to be a group at the lowest level. In contrast, a group that does not have a supergroup is said to be a group at the highest level. In general, we say that a subgroup is one level lower than its immediate supergroup. With each group, there is associated a special group called the prologue group. The prologue group logically belongs to the level of its group, but it does not have any subgroups. Jobs of the prologue group (called prologue jobs) are executed before any job of the subgroups of the group is executed. We also distinguish epilogue jobs. In contrast to prologue jobs, epilogue jobs are executed after all other jobs of the group are complete. In this version of the system, epilogue jobs of a group are part of that group and do not form a separate group (unlike the prologue jobs). Note the following: 1. Jobs within a group will be executed in the order of their submission. In effect, they represent a pipeline. This is illustrated in Figure 6. 2. Dependencies can only be specified between groups and never between individual jobs. If such a dependency is required, a group can always be defined to contain one job. This is illustrated in Figure 7. 3. A supergroup may have jobs of its own. Such jobs are executed after all subgroups of the supergroup are completed. This is illustrated in Figure 8. 4. Dependencies can only be defined between groups with the same supergroup or between groups at the highest level (i.e. the groups that do not have a supergroup). This is illustrated in Figure 9. 5. Dependencies between supergroups imply pairwise dependencies between their subgroups. These extra dependencies do not make the workflow in-

32

Mark Goldenberg et al.

Group X

Group Y

job A

job D

job B

job E

job C

job F

Fig. 7. The workflow dependency of group Y on group X implies the dependency of the first job of Y on the last job of X (this dependency is shown by the dashed arc)

correct, but are an important consideration, since they may inhibit the use of concurrency. This is illustrated in Figure 10. Assume that we have the 2-piece databases computed and that we would like to compute the 3-piece and 4-piece databases. We assume that there are scripts to compute individual slices. For example, running script 2100.sh would compute and verify the databases for all positions with 2 kings of one color and 1 king of the other color. The workflow for our example is shown in Figure 11. We can think of defining supergroups for the workflow in Figure 11 as shown by the dashed lines. Then, we can define dependencies between the supergroups and obtain a simpler looking workflow. 3.2

Submitting the Workflow

There are several ways of describing the workflow. The user chooses the way depending on how complicated the workflow is and how he wants (or does not want) to make use of the grouping capability of the system. Flat Submission Script. The first way of submitting a workflow is by using the mqsub utility. This utility is similar to qsub of many batch schedulers. However, there is an extra command-line argument (i.e., -deps) to mqsub that lets the user specify workflow dependencies. Note that the names of groups are userselected and are not scheduler job numbers; the scripts can be easily re-used. In our example, we define a group for each slice to be computed. The (full-size version of the) script in Figure 12 submits the workflow in Figure 11. Note that there are two limitations on the order of submission of the constituents of the workflow to the system:

TrellisDAG: A System for Structured DAG Scheduling

33

Group X job A Group K job B Group L

Group M

job C

Fig. 8. Group X has subgroups K, L, M and jobs A, B, C. These jobs will be executed after all of the jobs of K, L, M and their subgroups are completed Group X

Group Y

Group K

Group P

Group L

Group M

Group Q

Fig. 9. Groups K, L, M have same supergroup: X; therefore, we can specify dependencies between these groups. Similarly, group Y is a common supergroup for groups P and Q. In contrast, groups K and P do not have a common supergroup (at least not immediate supergroup) and cannot have an explicit dependency between them 1. The groups have to be submitted in some legal order, and 2. The jobs within a group have to be submitted in the correct order. Using a Makefile. A higher-level description of a workflow is possible via a Makefile. The user simply writes a Makefile that can be interpreted by the standard UNIX make utility. Each rule in this Makefile computes a part of the checkers databases and specifies the dependencies of that part on other parts. TrellisDAG includes a utility, called mqtranslate, that translates such a Makefile in another Makefile, in which every command line is substituted for a call to mqsub. We present part of a translated Makefile for our example in Figure 13. The DAG Description Script. Writing a flat submission script or a Makefile may be a cumbersome task, especially when the workflow contains hundreds

34

Mark Goldenberg et al.

Group X Group L

Group K

Group M

Group P

Group Q

Group R

Group Y

Fig. 10. Group Y depends on group X and this implies pairwise dependencies between their subgroups (these dependencies are denoted by dashed arrows)

or thousands of jobs, as in the checkers computation. For some applications, it is possible to come up with simple naming conventions for the jobs and write a script to automatically produce a flat submission script or a Makefile. TrellisDAG helps the user by providing a framework for such a script. Moreover, through this framework (which we call the DAG description script), the additional functionality of supergroups becomes available. A DAG description script is simply a module coded using the Python scripting language; this module implements the functions required by the TrellisDAG interface. TrellisDAG has a way to transform that module into a Makefile and further into a flat submission script as described above. The sample DAG description script in Figure 14 describes the workflow in Figure 11 with supergroups. Line 19 states that there are two levels of groups. A group at level VS is identified by two integers, while a group at level Slice is identified by four integers. The function generateGroup (lines 21-41) returns the list of groups with a given supergroup at a given level. The generated groups correspond to the nodes (in case of the level Slice) and the dashed boxes (in case of the level VS) in Figure 11. The function getDependent (lines 57-61) returns a list of groups with a given supergroup, on which a given group with the same supergroup depends. The executables for the jobs of a given group are returned by the getJobsExecutables (lines 43-46) function. Note that, in our example, computing each slice involves a computation and a verification job; they are represented by executables with suffixes .comp.sh and .ver.sh, respectively. We will now turn to describing the getJobsAttributes function (lines 48-51). Sometimes it is convenient to associate more information with a job than just the command line. Such information can be used by the scheduling system

TrellisDAG: A System for Structured DAG Scheduling

35

3 Pieces: 2 vs. 1 2100 1110 0120

2001 1011

0021

4 Pieces: 3 vs. 1

4 Pieces: 2 vs. 2 2200

3100 2110 1120 0130

2011 1021

0031

3001

1210 0220

1111 0121 0022

Fig. 11. The workflow for the running example. The nodes are slices. The dashed lines represent the natural way of defining supergroups; if such supergroups were defined, then the only workflow dependencies would be expressed by the thick arrows

or by the user. In TrellisDAG, the extra information associated with individual jobs is stored as key-value pairs called attributes. In the current version of the system, we have several kinds of attributes that serve the following purposes: 1. Increase the degree of concurrency by relaxing workflow dependencies. 2. Regulate the placement of jobs, i.e. mapping jobs to execution hosts. 3. Store the history profile. In this section, we concentrate on the first two kinds of attributes. By separating the computation of the checkers endgame databases with their verification, we can potentially achieve a higher degree of concurrency. To do that, we introduce an attribute that allows the dependent groups to proceed when a group reaches a certain point in the computation (i.e. a certain number of jobs are complete). We refer to this feature as early release. In our example, the computation job of all slices will have the release attribute set to yes. We also take into account that verification needs the data that has been produced during the computation. Therefore, it is desirable that verification runs on the same machine as the computation. Hence, we introduce another attribute called affinity. When the affinity of a job is set to yes, the job is forced to be executed on the same machine as the previous job of its group.

36

Mark Goldenberg et al.

Fig. 12. Part of the flat submission script for the running example

Fig. 13. Part of the Makefile with calls to mqsub for the running example 3.3

Services of the Command-Line Server

Services of the command-line server (see Figure 5) are the programs through which a placeholder can access and modify the jobs database. These services are normally called within an ssh session in the placeholder. All of the services get their parameters on the command line as key-value pairs. For example, if the value associated with the key id is 5, then the key-value pair on the command line is id=5. We start with the service called mqnextjob. The output of this service represents the job ID of the job that is scheduled to be run by the calling placeholder. For example, the service could be called as follows: ssh server ‘‘mqnextjob sched=SGE sched id=$JOB ID \ submit host=brule host=‘hostname‘’’ Once the job’s ID is obtained, we can obtain the command line for that job using the mqgetjobcommand service. The command line is output to the standard output. The placeholder has access to the attributes of a job through the mqgetjobattribute service. The service outputs the value associated with the given attribute. When the job is complete, the placeholder has to inform the system about this event. This is done using the mqdonejob service.

TrellisDAG: A System for Structured DAG Scheduling

Fig. 14. The DAG description script with supergroups and attributes

37

38

Mark Goldenberg et al.

Fig. 15. The first screen of mqdump for the running example The user can maintain his own status of the running job in the jobs database by using the mqstatus service. The status entered through this service is shown by the mqstat monitoring utility (see Section 3.4). 3.4

Utilities

The utilities described in this section are used to monitor the computation and perform some basic administrative tasks. mqstat: Status of the Computation. This utility outputs information about running jobs: job ID, command line, hostname of the machine where the job is running, time when the job has started, and the status submitted by means of calling the mqstatus service. mqdump: The Jobs Browser. mqdump is an interactive tool for browsing, monitoring, and changing the dynamic status of groups. The first screen of mqdump for the example corresponding to the DAG description script in Figure 14 is presented in Figure 15. Following the prompt, the user types the number from 1 to 11 corresponding to the command that must be executed. As we see from Figure 15, the user can dive into the supergroups, see the status of the computation of each group, and perform several administrative tasks. For example, the user can disable a group, thereby preventing it from being executed; or, he can assert that a part of the computation was performed without using TrellisDAG by marking one or more groups as done.

TrellisDAG: A System for Structured DAG Scheduling

3.5

39

Concluding Remarks

The features of TrellisDAG can be summarized as follows (see Figure 5): 1. The workflow and its dependencies can be described in several ways. The user chooses the mechanism or technique based on the properties of the workflow (such as the level of complexity) and his own preferences. 2. Whatever way of description the user chooses, utilities are provided to translate this description into the jobs database. 3. Services of the command-line server are provided so that the user can access and modify the jobs database dynamically, either from the command line or from within a placeholder. 4. mqstat and mqdump are utilities that can be used by the user to perform monitoring and simple administrative tasks. 5. The history profile of the last computation is stored in the jobs’ attributes.

4

Experimental Assessment of TrellisDAG

As a case study, we present an experiment that shows how TrellisDAG can be applied to compute part of the checkers endgame databases. The goal of these simple experiments is to demonstrate the key features of the system, namely: 1. Simplicity of specifying dependencies. Simple workflows are submitted using mqsub. For other workflows, writing the DAG description script is a fairly straightforward task (Figure 14), considering the complexity of the dependencies. 2. Good use of opportunities for concurrent execution. We introduce the notion of minimal schedule and use that notion to argue that TrellisDAG makes good use of the opportunities for concurrency that are inherent in the given workflow. 3. Small overhead of using the system and placeholder scheduling. We will quantify the overhead that the user incurs by using TrellisDAG. Since TrellisDAG is bound to placeholder scheduling, this overhead includes the overhead of running the placeholders. We factored out the data movement overheads in these experiments. However, data movement can be supported [10]. 4.1

The Minimal Schedule

We introduce our concept of the minimal schedule. We use this concept to tackle the following question: What is considered a good utilization of opportunities for concurrent execution? The minimal schedule is a mapping of jobs to resources that is made during a model run – an emulated computation where an infinite number of processors is assumed and the job is run without any start-up

40

Mark Goldenberg et al.

overhead. Therefore, the only limits on the degree of concurrency in the minimal schedule are job dependencies (not resources) and there are no scheduler overheads. The algorithm for emulating a model run is as follows: 1. Set time to 0 (i.e. t = 0) and the set of running jobs to be empty; 2. Add all the ready jobs to the set of running jobs; For each job, store its estimated (use the time measured during the sequential run) time of completion. That is, if a job j1 took t1 seconds to be computed in the sequential run, then the estimated completion time of j1 is t + t1 ; 3. Find the minimal of all completion times in the set of running jobs and set the current time t to that time; 4. Remove those jobs from the set of running jobs whose completion time is t; these jobs are considered complete; 5. If there are more jobs to be executed, goto step 2. No system that respects the workflow dependencies can beat the model computation as described above in terms of the makespan. Therefore, if we show that our system results in a schedule that is close to the minimal schedule, we will have shown that the opportunities for concurrency are used well. 4.2

Experimental Setup

All of the experiments were performed using a dedicated cluster of 20 nodes with dual AMD Athlon 1.5GHz CPUs and 512MB of memory. Since the application is memory-intensive, we only used one of the CPUs on each node. Sun Grid Engine (SGE) was the local batch scheduler and the PostgreSQL database was used to maintain the jobs database. After each run, the computed databases are verified against a trusted, previously-computed version of the database. If the ordering of computation in the workflow is incorrect, the database will not verify correctly. Therefore, this verification step also indicates a proper schedule for the workflow. 4.3

Computing All 4 versus 3 Pieces Checkers Endgame Databases

We use the grouping capability of TrellisDAG and define 3 levels: VS (as in 4 “versus” 3 pieces in the 7-piece database), Slice and Rank (which are further subdivisions of the problem). Over 3 runs, the median time was 32,987 seconds (i.e., 9.16 hours) (with a maximum of 33,067 seconds and a minimum of 32,891 seconds). The minimal schedule is computed to be 32,444 seconds, which implies that our median time is within 1.7% of the minimal schedule. Since the cluster is dedicated, this result is largely expected. However, the small difference between the minimal schedule and the computation does show that TrellisDAG does make good use of the available concurrency (otherwise, the deviation from the minimal schedule would be much greater). Furthermore, the experiment shows that the cumulative overhead of the placeholders and the TrellisDAG services is small compared to the computation.

Concurrency

TrellisDAG: A System for Structured DAG Scheduling

41

18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0

6000 3000

9000

12000 18000 24000 30000 36000 15000 21000 27000 33000 Time, sec.

Fig. 16. Degree of concurrency chart for the experiment without data movement

Fig. 17. Degree of concurrency chart for the experiment without data movement. Minimal schedule The degree of concurrency chart is shown in Figure 16. This chart shows how many jobs were being executed concurrently at any given point in time. We note that the maximal degree of concurrency achieved in the experiment is 18. However, the total time when the degree of concurrency was 18 (or greater than 12 for that matter) is very small and adding several more processors would not significantly affect the makespan. The degree of concurrency chart in Figure 17 corresponds to the minimal schedule. Note that the maximal degree of concurrency achieved by the minimal schedule is 20 and this degree of concurrency was achieved approximately at the same time in the computation as when the degree of concurrency 18 was achieved in the real run (compare Figure 16 and Figure 17). With this exception, the degree of concurrency chart for the minimal schedule run looks like a slightly condensed version of the chart for the real run. Summary. The given experiment is an example of using TrellisDAG for computing a complicated workflow with hundreds of jobs and inter-job dependencies.

42

Mark Goldenberg et al.

Writing a DAG description script of only several dozens of lines is enough to describe the whole workflow. Also, the makespan of the computation was close to that of the model run. Elsewhere, we show that the system can also be used for computations requiring data movement [10].

5

Concluding Remarks

We have shown how TrellisDAG can be effectively used to submit and execute a workflow represented by a large DAG of dependencies. So far, our work has concentrated on providing the mechanisms and capability for specifying (e.g., DAG description script) and submitting entire workflows. The motivation for our work is two-fold: First, some HPC workloads, from bioinformatics to retrograde analysis, have non-trivial workflows. Second, by providing scheduler mechanisms to describe workloads, we hope to facilitate future work in scheduling policies than can benefit from the knowledge of forthcoming jobs in the workflow. In the meantime, several properties of TrellisDAG have been shown: 1. DAG description scripts for the workflows are relatively easy to write. 2. The makespans achieved by TrellisDAG are close to the makespans achieved by the minimal schedule. We conclude that the system effectively utilizes the opportunities for concurrent execution, and 3. The overhead of using TrellisDAG is not high, which is also justified by comparing the makespan achieved by the computation using TrellisDAG with the makespan of the model computation. As discussed above, our future work with TrellisDAG will include research on new scheduling policies. We believe that the combination of knowledge of the past (i.e., trace information on service time and data accessed by completed jobs) and knowledge of the future (i.e., workflow made visible to the scheduler) is a powerful combination when dealing with multiprogrammed workloads and complicated job dependencies. For example, there is still a lot to explore in terms of policies that exploit data locality when scheduling jobs in an entire workload. In terms of deployment, we hope to use TrellisDAG in future instances of CISS [15] and as part of the newly-constituted, multi-institutional WestGrid Project [21] in Western Canada. In terms of applications, TrellisDAG will be a key component in the Proteome Analyst Project [19] and in our on-going attempts to solve the game of checkers.

References [1] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389–3402, 1997. 23 [2] CISS – The Canadian Internetworked Scientific Supercomputer. http://www.cs.ualberta.ca/~ciss/. 25 [3] Condor. http://www.cs.wisc.edu/condor. 24

TrellisDAG: A System for Structured DAG Scheduling

43

[4] DAGMan Metascheduler. http://www.cs.wisc.edu/condor/dagman/. 26 [5] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science, pages 1–34. Springer-Verlag, 1997. 21 [6] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a Future Computing Infrastructure. Morgan-Kaufmann, 1999. 23 [7] I. Foster, C. Kesselman, J. M. Nick, and S. Tecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed System Integration, June 2002. 23 [8] T. Frey, J.; Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: a computation management agent for multi- institutional grids. In High Performance Distributed Computing, 2001. Proceedings. 10th IEEE International Symposium, pages 55 – 63, San Francisco, CA, USA, August 2001. IEEE Computer Society Press. 24, 25 [9] Globus Project. http://www.globus.org/. 24 [10] M. Goldenberg. TrellisDAG: A System For Structured DAG Scheduling. Master’s thesis, Dept. of Computing Science, University of Alberta, Edmonton, Alberta, Canada, 2003. 22, 39, 42 [11] R. Lake and J. Schaeffer. Solving the Game of Checkers. In Richard J. Nowakowski, editor, Games of No Chance, pages 119–133. Cambridge University Press, 1996. 27 [12] R. Lake, J. Schaeffer, and P. Lu. Solving Large Retrograde Analysis Problems Using a Network of Workstations. Advances in Computer Chess, VII:135–162, 1994. 27 [13] M. J. Litzkow, M. Livny, and M. W. Mutka. Condor : A hunter of idle workstations. In 8th International Conference on Distributed Computing Systems, pages 104– 111, Washington, D. C., USA, June 1988. IEEE Computer Society Press. 24 [14] C. Pinchak, P. Lu, and M. Goldenberg. Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers:Early Experiences. In 8th Workshop on Job Scheduling Strategies for Parallel Processing, Edinburgh, Scotland, U. K., July 24 2002. 23 [15] C. Pinchak, P. Lu, J. Schaeffer, and M. Goldenberg. The Canadian Internetworked Scientific Supercomputer. In 17th Annual International Symposium on High Performance Computing Systems and Applications (HPCS), pages 193–199, Sherbrooke, Quebec, Canada, May 11 – 14, 2003. 23, 25, 42 [16] RC5 Project. http://www.distributed.net/rc5. 24 [17] J. Schaeffer, Y. Bj¨ ornsson, N. Burch, R. Lake, P. Lu, and S. Sutphen. Building the Checkers 10-Piece Endgame Databases. In Advances in Computer Games X, 2003. in press. 27 [18] SETI@home. http://setiathome.ssl.berkeley.edu/. 24 [19] D. Szafron, P. Lu, R. Greiner, D. Wishart, Z. Lu, B. Poulin, R. Eisner, J. Anvik, C. Macdonell, and B. Habibi-Nazhad. Proteome Analyst – Transparent HighThroughput Protein Annotation: Function, Localization, and Custom Predictors. Technical Report TR 03-05, Dept. of Computing Science, University of Alberta, 2003. http://www.cs.ualberta.ca/~bioinfo/PA/. 23, 42 [20] D. Thain, J. Bent, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and M. Livny. The architectural implications of pipeline and batch sharing in scientific workloads. Technical Report 1463, Computer Sciences Department, University of Wisconsin, Madison, 2002. 21 [21] WestGrid. http://www.westgrid.ca/. 42

SLURM: Simple Linux Utility for Resource Management Andy B. Yoo, Morris A. Jette, and Mark Grondona Lawrence Livermore National Laboratory Livermore, CA 94551 {ayoo,jette1,grondona1}@llnl.gov

Abstract. A new cluster resource management system called Simple Linux Utility Resource Management (SLURM) is described in this paper. SLURM, initially developed for large Linux clusters at the Lawrence Livermore National Laboratory (LLNL), is a simple cluster manager that can scale to thousands of processors. SLURM is designed to be flexible and fault-tolerant and can be ported to other clusters of different size and architecture with minimal effort. We are certain that SLURM will benefit both users and system architects by providing them with a simple, robust, and highly scalable parallel job execution environment for their cluster system.

1

Introduction

Linux clusters, often constructed by using commodity off-the-shelf (COTS) components, have become increasingly popular as a computing platform for parallel computation in recent years, mainly due to their ability to deliver a high performance-cost ratio. Researchers have built and used small to medium size clusters for various applications [3, 16]. The continuous decrease in the price of the COTS parts in conjunction with the good scalability of the cluster architecture has now made it feasible to economically build large-scale clusters with thousands of processors [18, 19]. 

This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes. This work was performed under the auspices of the U. S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. Document UCRL-JC-147996.

D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 44–60, 2003. c Springer-Verlag Berlin Heidelberg 2003 

SLURM: Simple Linux Utility for Resource Management

45

An essential component that is needed to harness such a computer is a resource management system. A resource management system (or resource manager) performs such crucial tasks as scheduling user jobs, monitoring machine and job status, launching user applications, and managing machine configuration, An ideal resource manager should be simple, efficient, scalable, faulttolerant, and portable. Unfortunately there are no open-source resource management systems currently available which satisfy these requirements. A survey [12] has revealed that many existing resource managers have poor scalability and fault-tolerance rendering them unsuitable for large clusters having thousands of processors [14, 11]. While some proprietary cluster managers are suitable for large clusters, they are typically designed for particular computer systems and/or interconnects [21, 14, 11]. Proprietary systems can also be expensive and unavailable in source-code form. Furthermore, proprietary cluster management functionality is usually provided as a part of a specific job scheduling system package. This mandates the use of the given scheduler just to manage a cluster, even though the scheduler does not necessarily meet the need of organization that hosts the cluster. Clear separation of the cluster management functionality from scheduling policy is desired. This observation led us to set out to design a simple, highly scalable, and portable resource management system. The result of this effort is Simple Linux Utility Resource Management (SLURM1 ). SLURM was developed with the following design goals: – Simplicity: SLURM is simple enough to allow motivated end-users to understand its source code and add functionality. The authors will avoid the temptation to add features unless they are of general appeal. – Open Source: SLURM is available to everyone and will remain free. Its source code is distributed under the GNU General Public License [9]. – Portability: SLURM is written in the C language, with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets. SLURM also supports a general purpose plug-in mechanism, which permits a variety of different infrastructures to be easily supported. The SLURM configuration file specifies which set of plug-in modules should be used. – Interconnect independence: SLURM supports UDP/IP based communication as well as the Quadrics Elan3 and Myrinet interconnects. Adding support for other interconnects is straightforward and utilizes the plug-in mechanism described above. – Scalability: SLURM is designed for scalability to clusters of thousands of nodes. Jobs may specify their resource requirements in a variety of ways including requirements options and ranges, potentially permitting faster initiation than otherwise possible. 1

A tip of the hat to Matt Groening and creators of Futurama, where Slurm is the most popular carbonated beverage in the universe.

46

Andy B. Yoo et al.

– Robustness: SLURM can handle a variety of failure modes without terminating workloads, including crashes of the node running the SLURM controller. User jobs may be configured to continue execution despite the failure of one or more nodes on which they are executing. Nodes allocated to a job are available for reuse as soon as the job(s) allocated to that node terminate. If some nodes fail to complete job termination in a timely fashion due to hardware of software problems, only the scheduling of those tardy nodes will be affected. – Secure: SLURM employs crypto technology to authenticate users to services and services to each other with a variety of options available through the plug-in mechanism. SLURM does not assume that its networks are physically secure, but does assume that the entire cluster is within a single administrative domain with a common user base across the entire cluster. – System administrator friendly: SLURM is configured using a simple configuration file and minimizes distributed state. Its configuration may be changed at any time without impacting running jobs. Heterogeneous nodes within a cluster may be easily managed. SLURM interfaces are usable by scripts and its behavior is highly deterministic. The main contribution of our work is that we have provided a readily available tool that anybody can use to efficiently manage clusters of different size and architecture. SLURM is highly scalable2 . The SLURM can be easily ported to any cluster system with minimal effort with its plug-in capability and can be used with any meta-batch scheduler or a Grid resource broker [7] with its well-defined interfaces. The rest of the paper is organized as follows. Section 2 describes the architecture of SLURM in detail. Section 3 discusses the services provided by SLURM followed by performance study of SLURM in Section 4. Brief survey of existing cluster management systems is presented in Section 5. Concluding remarks and future development plan of SLURM is given in Section 6.

2

SLURM Architecture

As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Users and system administrators interact with SLURM using simple commands. Figure 1 depicts the key components of SLURM. As shown in Figure 1, SLURM consists of a slurmd daemon running on each compute node, a central slurmctld daemon running on a management node (with optional fail-over twin), and five command line utilities, which can run anywhere in the cluster. 2

It was observed that it took less than five seconds for SLURM to launch a 1900-task job over 950 nodes on recently installed cluster at Lawrence Livermore National Laboratory.

SLURM: Simple Linux Utility for Resource Management

47

Fig. 1. SLURM Architecture

The entities managed by these SLURM daemons include nodes, the compute resource in SLURM and partitions, which group nodes into logical disjoint sets. The entities also include jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of tasks within a job. Each job is allocated nodes within a single partition. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance a single job step may be started which utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation. Figure 2 exposes the subsystems that are implemented within the slurmd and slurmctld daemons. These subsystems are explained in more detail below. 2.1

SLURM Local Daemon (Slurmd)

The slurmd is a multi-threaded daemon running on each compute node. It reads the common SLURM configuration file and recovers any previously saved state information, notifies the controller that it is active, waits for work, executes the work, returns status, and waits for more work. Since it initiates jobs for other users, it must run with root privilege. The only job information it has at any given time pertains to its currently executing jobs. The slurmd performs five major tasks.

48

Andy B. Yoo et al.

Globus and/or Metascheduler (optional)

User: srun slurmctld Partition Manager

Node Manager

Machine Status

Job Status

Job Control

Job Manager

Remote Execution

Stream Copy

slurmd

Fig. 2. SLURM Architecture - Subsystems – Machine and Job Status Services: Respond to controller requests for machine and job state information, and send asynchronous reports of some state changes (e.g. slurmd startup) to the controller. – Remote Execution: Start, monitor, and clean up after a set of processes (typically belonging to a parallel job) as dictated by the slurmctld daemon or an srun or scancel command. Starting a process may include executing a prolog program, setting process limits, setting real and effective user id, establishing environment variables, setting working directory, allocating interconnect resources, setting core file paths, initializing the Stream Copy Service, and managing process groups. Terminating a process may include terminating all members of a process group and executing an epilog program. – Stream Copy Service: Allow handling of stderr, stdout, and stdin of remote tasks. Job input may be redirected from a file or files, a srun process, or /dev/null. Job output may be saved into local files or sent back to the srun command. Regardless of the location of stdout or stderr, all job output is locally buffered to avoid blocking local tasks. – Job Control: Allow asynchronous interaction with the Remote Execution environment by propagating signals or explicit job termination requests to any set of locally managed processes. 2.2

SLURM Central Daemon (Slurmctld)

Most SLURM state information is maintained by the controller, slurmctld. The slurmctld is multi-threaded with independent read and write locks for the various data structures to enhance scalability. When slurmctld starts, it reads the

SLURM: Simple Linux Utility for Resource Management

49

SLURM configuration file. It can also read additional state information from a checkpoint file generated by a previous execution of slurmctld. Full controller state information is written to disk periodically with incremental changes written to disk immediately for fault-tolerance. The slurmctld runs in either master or standby mode, depending on the state of its fail-over twin, if any. The slurmctld need not execute with root privilege. The slurmctld consists of three major components: – Node Manager: Monitors the state of each node in the cluster. It polls slurmd’s for status periodically and receives state change notifications from slurmd daemons asynchronously. It ensures that nodes have the prescribed configuration before being considered available for use. – Partition Manager: Groups nodes into non-overlapping sets called partitions. Each partition can have associated with it various job limits and access controls. The partition manager also allocates nodes to jobs based upon node and partition states and configurations. Requests to initiate jobs come from the Job Manager. The scontrol may be used to administratively alter node and partition configurations. – Job Manager: Accepts user job requests and places pending jobs in a priority ordered queue. The Job Manager is awakened on a periodic basis and whenever there is a change in state that might permit a job to begin running, such as job completion, job submission, partition-up transition, node-up transition, etc. The Job Manager then makes a pass through the priority-ordered job queue. The highest priority jobs for each partition are allocated resources as possible. As soon as an allocation failure occurs for any partition, no lowerpriority jobs for that partition are considered for initiation. After completing the scheduling cycle, the Job Manager’s scheduling thread sleeps. Once a job has been allocated resources, the Job Manager transfers necessary state information to those nodes, permitting it to commence execution. When the Job Manager detects that all nodes associated with a job have completed their work, it initiates clean-up and performs another scheduling cycle as described above.

3 3.1

SLURM Operation and Services Command Line Utilities

The command line utilities are the user interface to SLURM functionality. They offer users access to remote execution and job control. They also permit administrators to dynamically change the system configuration. These commands all use SLURM APIs which are directly available for more sophisticated applications. – scancel: Cancel a running or a pending job or job step, subject to authentication and authorization. This command can also be used to send an arbitrary signal to all processes on all nodes associated with a job or job step.

50

Andy B. Yoo et al.

– scontrol: Perform privileged administrative commands such as draining a node or partition in preparation for maintenance. Many scontrol functions can only be executed by privileged users. – sinfo: Display a summary of partition and node information. A assortment of filtering and output format options are available. – squeue: Display the queue of running and waiting jobs and/or job steps. A wide assortment of filtering, sorting, and output format options are available. – srun: Allocate resources, submit jobs to the SLURM queue, and initiate parallel tasks (job steps). Every set of executing parallel tasks has an associated srun which initiated it and, if the srun persists, managing it. Jobs may be submitted for batch execution, in which case srun terminates after job submission. Jobs may also be submitted for interactive execution, where srun keeps running to shepherd the running job. In this case, srun negotiates connections with remote slurmd’s for job initiation and to get stdout and stderr, forward stdin, and respond to signals from the user. The srun may also be instructed to allocate a set of resources and spawn a shell with access to those resources. srun has a total of 13 parameters to control where and when the job is initiated. 3.2

Plug-ins

In order to make the use of different infrastructures possible, SLURM uses a general purpose plug-in mechanism. A SLURM plug-in is a dynamically linked code object which is loaded explicitly at run time by the SLURM libraries. A plug-in provides a customized implementation of a well-defined API connected to tasks such as authentication, interconnect fabric, task scheduling. A common set of functions is defined for use by all of the different infrastructures of a particular variety. For example, the authentication plug-in must define functions such as: slurm auth activate to create a credential, slurm auth verify to verify a credential to approve or deny authentication, slurm auth get uid to get the user ID associated with a specific credential, etc. It also must define the data structure used, a plug-in type, a plug-in version number. The available polygons are defined in the configuration file. 3.3

Communications Layer

SLURM presently uses Berkeley sockets for communications. However, we anticipate using the plug-in mechanism to easily permit use of other communications layers. At LLNL we are using an Ethernet for SLURM communications and the Quadrics Elan switch exclusively for user applications. The SLURM configuration file permits the identification of each node’s hostname as well as its name to be used for communications. While SLURM is able to manage 1000 nodes without difficulty using sockets and Ethernet, we are reviewing other communication mechanisms which may offer improved scalability. One possible alternative is STORM[8]. STORM uses the

SLURM: Simple Linux Utility for Resource Management

51

cluster interconnect and Network Interface Cards to provide high-speed communications including a broadcast capability. STORM only supports the Quadrics Elan interconnnect at present, but does offer the promise of improved performance and scalability. 3.4

Security

SLURM has a simple security model: Any user of the cluster may submit parallel jobs to execute and cancel his own jobs. Any user may view SLURM configuration and state information. Only privileged users may modify the SLURM configuration, cancel any jobs, or perform other restricted activities. Privileged users in SLURM include the users root and SlurmUser (as defined in the SLURM configuration file). If permission to modify SLURM configuration is required by others, set-uid programs may be used to grant specific permissions to specific users. We presently support three authentication mechanisms via plug-ins: authd [10], munged and none. A plug-in can easily be developed for Kerberos or authentication mechanisms as desired. The munged implementation is described below. A munged daemon running as user root on each node confirms the identity of the user making the request using the getpeername function and generates a credential. The credential contains a user ID, group ID, time-stamp, lifetime, some pseudo-random information, and any user supplied information. The munged uses a private key to generate a Message Authentication Code (MAC) for the credential. The munged then uses a public key to symmetrically encrypt the credential including the MAC. SLURM daemons and programs transmit this encrypted credential with communications. The SLURM daemon receiving the message sends the credential to munged on that node. The munged decrypts the credential using its private key, validates it and returns the user ID and group ID of the user originating the credential. The munged prevents replay of a credential on any single node by recording credentials that have already been authenticated. In SLURM’s case, the user supplied information includes node identification information to prevent a credential from being used on nodes it is not destined for. When resources are allocated to a user by the controller, a job step credential is generated by combining the user ID, job ID, step ID, the list of resources allocated (nodes), and the credential lifetime. This job step credential is encrypted with a slurmctld private key. This credential is returned to the requesting agent (srun) along with the allocation response, and must be forwarded to the remote slurmd’s upon job step initiation. slurmd decrypts this credential with the slurmctld’s public key to verify that the user may access resources on the local node. slurmd also uses this job step credential to authenticate standard input, output, and error communication streams.

52

Andy B. Yoo et al.

1. 2.

slurmctld

srun

3.

slurmd

slurmd

slurmd

4.

ephemeral port

‘‘known’’ port

Fig. 3. Job initiation connections overview. 1. The srun connects to slurmctld requesting resources. 2. slurmctld issues a response, with list of nodes and job credential. 3. The srun opens a listen port for every task in the job step, then sends a run job step request to slurmd. 4. slurmd’s initiate job step and connect back to srun for stdout/err

3.5

Job Initiation

There are three modes in which jobs may be run by users under SLURM. The first and most simple is interactive mode, in which stdout and stderr are displayed on the user’s terminal in real time, and stdin and signals may be forwarded from the terminal transparently to the remote tasks. The second is batch mode, in which the job is queued until the request for resources can be satisfied, at which time the job is run by SLURM as the submitting user. In allocate mode, a job is allocated to the requesting user, under which the user may manually run job steps via a script or in a sub-shell spawned by srun. Figure 3 gives a high-level depiction of the connections that occur between SLURM components during a general interactive job startup. The srun requests a resource allocation and job step initiation from the slurmctld, which responds with the job ID, list of allocated nodes, job credential. if the request is granted. The srun then initializes listen ports for each task and sends a message to the slurmd’s on the allocated nodes requesting that the remote processes be initiated. The slurmd’s begin execution of the tasks and connect back to srun for stdout and stderr. This process and the other initiation modes are described in more detail below.

SLURM: Simple Linux Utility for Resource Management User

srun srun cmd

slurmctld

53

slurmd

register job step register job step reply run job step req

prolog

run job step reply connect(stdout/err)

job_mgr session_mgr cmd

status req (periodic) status reply

task exit msg release allocation

exit status

run epilog req

epilog

run epilog reply

Fig. 4. Interactive job initiation. srun simultaneously allocates nodes and a job step from slurmctld then sends a run request to all slurmd’s in job. Dashed arrows indicate a periodic request that may or may not occur during the lifetime of the job

Interactive Mode Initiation. Interactive job initiation is illustrated in Figure 4. The process begins with a user invoking srun in interactive mode. In Figure 4, the user has requested an interactive run of the executable “cmd” in the default partition. After processing command line options, srun sends a message to slurmctld requesting a resource allocation and a job step initiation. This message simultaneously requests an allocation (or job) and a job step. The srun waits for a reply from slurmctld, which may not come instantly if the user has requested that srun block until resources are available. When resources are available for the user’s job, slurmctld replies with a job step credential, list of nodes that were allocated, cpus per node, and so on. The srun then sends a message each slurmd on the allocated nodes requesting that a job step be initiated. The slurmd’s verify that the job is valid using the forwarded job step credential and then respond to srun. Each slurmd invokes a job thread to handle the request, which in turn invokes a task thread for each requested task. The task thread connects back to a port opened by srun for stdout and stderr. The host and port for this connection is contained in the run request message sent to this machine by srun. Once stdout and stderr have successfully been connected, the task thread takes the necessary steps to initiate the user’s executable on the node, initializing environment, current working directory, and interconnect resources if needed.

54

Andy B. Yoo et al.

User

srun srun batch

slurmctld batch req

batch reply submit exit status

slurmd

slurmd

job queued

run req run reply

job_mgr session_mgr

job step req job step reply

script srun prolog cmd

release step release step reply task exit msg run epilog req

epilog

run epilog reply

Fig. 5. Queued job initiation. slurmctld initiates the user’s job as a batch script on one node. Batch script contains an sruncall which initiates parallel tasks after instantiating job step with controller. The shaded region is a compressed representation and is illustrated in more detail in the interactive diagram (Figure 4)

Once the user process exits, the task thread records the exit status and sends a task exit message back to srun. When all local processes terminate, the job thread exits. The srun process either waits for all tasks to exit, or attempt to clean up the remaining processes some time after the first task exits. Regardless, once all tasks are finished, srun sends a message to the slurmctld releasing the allocated nodes, then exits with an appropriate exit status. When the slurmctld receives notification that srun no longer needs the allocated nodes, it issues a request for the epilog to be run on each of the slurmd’s in the allocation. As slurmd’s report that the epilog ran successfully, the nodes are returned to the partition. Batch Mode Initiation. Figure 5 illustrates the initiation of a batch job in SLURM. Once a batch job is submitted, srun sends a batch job request to slurmctld that contains the input/output location for the job, current working directory, environment, requested number of nodes. The slurmctld queues the request in its priority ordered queue. Once the resources are available and the job has a high enough priority, slurmctld allocates the resources to the job and contacts the first node of the

SLURM: Simple Linux Utility for Resource Management

User

srun srun allocate

slurmctld allocate req

55

slurmd

allocate reply

sh srun

job step req job step reply run job step req run job step reply

prolog job_mgr session_mgr

connect(stdout/err) cmd

job/job step status task exit msg release job step release allocation exit status

run epilog req

epilog

run epilog reply

Fig. 6. Job initiation in allocate mode. Resources are allocated and srun spawns a shell with access to the resources. When user runs an srun from within the shell, the a job step is initiated under the allocation allocation requesting that the user job be started. In this case, the job may either be another invocation of srun or a job script which may have multiple invocations of srun within it. The slurmd on the remote node responds to the run request, initiating the job thread, task thread, and user script. An srun executed from within the script detects that it has access to an allocation and initiates a job step on some or all of the nodes within the job. Once the job step is complete, the srun in the job script notifies the slurmctld and terminates. The job script continues executing and may initiate further job steps. Once the job script completes, the task thread running the job script collects the exit status and sends a task exit message to the slurmctld. The slurmctld notes that the job is complete and requests that the job epilog be run on all nodes that were allocated. As the slurmd’s respond with successful completion of the epilog, the nodes are returned to the partition. Allocate Mode Initiation. In allocate mode, the user wishes to allocate a job and interactively run job steps under that allocation. The process of initiation in this mode is illustrated in Figure 6. The invoked srun sends an allocate request to slurmctld, which, if resources are available, responds with a list of nodes allocated, job id, etc. The srun process spawns a shell on the user’s terminal

56

Andy B. Yoo et al.

with access to the allocation, then waits for the shell to exit at which time the job is considered complete. An srun initiated within the allocate sub-shell recognizes that it is running under an allocation and therefore already within a job. Provided with no other arguments, srun started in this manner initiates a job step on all nodes within the current job. However, the user may select a subset of these nodes implicitly. An srun executed from the sub-shell reads the environment and user options, then notify the controller that it is starting a job step under the current job. The slurmctld registers the job step and responds with a job credential. The srun then initiates the job step using the same general method as described in the section on interactive job initiation. When the user exits the allocate sub-shell, the original srun receives exit status, notifies slurmctld that the job is complete, and exits. The controller runs the epilog on each of the allocated nodes, returning nodes to the partition as they complete the epilog.

4

Related Work

Portable Batch System (PBS). The Portable Batch System (PBS) [20] is a flexible batch queuing and workload management system originally developed by Veridian Systems for NASA. It operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems. PBS was developed as a replacement for NQS (Network Queuing System) by many of the same people. PBS supports sophisticated scheduling logic (via the Maui Scheduler). PBS spawn’s daemons on each machine to shepherd the job’s tasks. It provides an interface for administrators to easily interface their own scheduling modules. PBS can support long delays in file staging with retry. Host authentication is provided by checking port numbers (low ports numbers are only accessible to user root). Credential service is used for user authentication. It has the job prolog and epilog feature. PBS Supports high priority queue for smaller “interactive” jobs. Signal to daemons causes current log file to be closed, renamed with time-stamp, and a new log file created. Although the PBS is portable and has a broad user base, it has significant drawbacks. PBS is single threaded and hence exhibits poor performance on large clusters. This is particularly problematic when a compute node in the system fails: PBS tries to contact down node while other activities must wait. PBS also has a weak mechanism for starting and cleaning up parallel jobs. 4.1

Quadrics RMS

Quadrics RMS[21] (Resource Management System) is for Unix systems having Quadrics Elan interconnects. RMS functionality and performance is excellent. Its major limitation is the requirement for a Quadrics interconnect. The proprietary code and cost may also pose difficulties under some circumstances.

SLURM: Simple Linux Utility for Resource Management

57

Maui Scheduler. Maui Scheduler [17] is an advanced reservation HPC batch scheduler for use with SP, O2K, and UNIX/Linux clusters. It is widely used to extend the functionality of PBS and LoadLeveler, which Maui requires to perform the parallel job initiation and management. Distributed Production Control System (DPCS). The Distributed Production Control System (DPCS) [6] is a scheduler developed at Lawrence Livermore National Laboratory (LLNL). The DPCS provides basic data collection and reporting mechanisms for prject-level, near real-time accounting and resource allocation to customers with established limits per customers’ organization budgets, In addition, the DPCS evenly distributes workload across available computers and supports dynamic reconfiguration and graceful degradation of service to prevent overuse of a computer where not authorized. DPCS supports only a limited number of computer systems: IBM RS/6000 and SP, Linux, Sun Solaris, and Compaq Alpha. Like the Maui Scheduler, DPCS requires an underlying infrastructure for parallel job initiation and management (LoadLeveler, NQS, RMS or SLURM). LoadLeveler. LoadLeveler [11, 14] is a proprietary batch system and parallel job manager by IBM. LoadLeveler supports few non-IBM systems. Very primitive scheduling software exists and other software is required for reasonable performance, such as Maui Scheduler or DPCS. The LoadLeveler has a simple and very flexible queue and job class structure available operating in ”matrix” fashion. The biggest problem of the LoadLeveler is its poor scalability. It typically requires 20 minutes to execute even a trivial 500-node, 8000-task on the IBM SP computers at LLNL. Load Sharing Facility (LSF). LSF [15] is a proprietary batch system and parallel job manager by Platform Computing. Widely deployed on a wide variety of computer architectures, it has sophisticated scheduling software including fairshare, backfill, consumable resources, an job preemption and very flexible queue structure. It also provides good status information on nodes and LSF daemons. While LSF is quite powerful, it is not open-source and can be costly on larger clusters. Condor. Condor [5, 13, 1] is a batch system and parallel job manager developed by the University of Wisconsin. Condor was the basis for IBM’s LoadLeveler and both share very similar underlying infrastructure. Condor has a very sophisticated checkpoint/restart service that does not rely upon kernel changes, but a variety of library changes (which prevent it from being completely general). The Condor checkpoint/restart service has been integrated into LSF, Codine, and DPCS. Condor is designed to operate across a heterogeneous environment, mostly to harness the compute resources of workstations and PCs. It has an interesting ”advertising” service. Servers advertise their available resources and

58

Andy B. Yoo et al.

consumers advertise their requirements for a broker to perform matches. The checkpoint mechanism is used to relocate work on demand (when the ”owner” of a desktop machine wants to resume work). Beowulf Distributed Process Space (BPROC). The Beowulf Distributed Process Space (BPROC) is set of kernel modifications, utilities and libraries which allow a user to start processes on other machines in a Beowulf-style cluster [2]. Remote processes started with this mechanism appear in the process table of the front end machine in a cluster. This allows remote process management using the normal UNIX process control facilities. Signals are transparently forwarded to remote processes and exit status is received using the usual wait() mechanisms. This tight coupling of a cluster’s nodes is convenient, but high scalability can be difficult to achieve.

5

Performance Study

We were able to perform some SLURM tests on a 1000 node cluster at LLNL. Some development was still underway at that time and tuning had not been performed. The results for executing simple ’hostname’ program on two tasks per node and various node counts is show in Figure 7. We found SLURM performance to be comparable to the Quadrics Resource Management System (RMS) [21] for

SLURM RMS LoadLeveler

Seconds

10

1

0.1

1

2

4

8

16

32 Nodes

64

128

256

512

950

Fig. 7. Time to execute /bin/hostname with various node counts

SLURM: Simple Linux Utility for Resource Management

59

all job sizes and about 80 times faster than IBM LoadLeveler [14, 11] at tested job sizes.

6

Conclusion and Future Plans

We have presented in this paper an overview of SLURM, a simple, highly scalable, robust, and portable cluster resource management system. The contribution of this work is that we have provided a immediately-available and open-source tool that virtually anybody can use to efficiently manage clusters of different sizes and architecture. Looking ahead, we anticipate adding support for additional operating systems. We anticipate adding a job preempt/resume capability, which will provide an external scheduler the infrastructure required to perform gang scheduling, and a checkpoint/restart capability. We also plan to use the SLURM for IBM’s Blue Gene/L platform [4] by incorporating a capability to manage jobs on a threedimensional torus machine into the SLURM.

Acknowledgments Additional programmers responsible for the development of SLURM include: Chris Dunlap, Joey Ekstrom, Jim Garlick, Kevin Tew and Jay Windley.

References [1] J. Basney, M. Livny, and T. Tannenbaum. High Throughput Computing with Condor. HPCU news, 1(2), June 1997. 57 [2] Beowulf Distributed Process Space. http://bproc.sourceforge.net. 58 [3] Beowulf Project. http://www.beowulf.org. 44 [4] Blue Gene/L. http://cmg-rr.llnl.gov/asci/platforms/bluegenel. 59 [5] Condor. http://www.cs.wisc.edu/condor. 57 [6] Distributed Production Control System. http://www.llnl.gov/icc/lc/dpcs overview.html. 57 [7] I. Foster and C. Kesselman. The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, Inc., 1999. 46 [8] E. Frachtenberg, F. Petrini, et al. Storm: Lightning-fast resource management. In Proceedings of SuperComputing, 2002. 50 [9] GNU General Public License. http://www.gnu.org/licenses/gpl.html. 45 [10] A. home page. http://www.theether.org/authd/. 51 [11] IBM Corporation. LoadLeveler’s User Guide, Release 2.1. 45, 57, 59 [12] M. Jette, C. Dunlap, J. Garlick, and M. Grondona. Survey of Batch/Resource Management-Related System Software. Technical Report N/A, Lawrence Livermore National Laboratory, 2002. 45 [13] M. Litzknow, M. Livny, and M. Mutka. Condor - a hunter for idle workstations. In Proc. International Conference on Distributed Computing Systems, June 1988. 57

60

Andy B. Yoo et al.

[14] Load Leveler. http://www-1.ibm.com/servers/eservers/pseries/library/sp books/ loadleveler.html. 45, 57, 59 [15] Load Sharing Facility. http://www.platform.com. 57 [16] Loki – Commodity Parallel Processing. http://loki-www.lanl.org. 44 [17] Maui Scheduler. mauischeduler.sourceforge.net. 57 [18] Multiprogrammatic Capability Cluster. http://www.llnl.gov/linux/mcr. 44 [19] Parallel Capacity Resource. http://www.llnl.gov/linux/pcr. 44 [20] Portable Batch System. http://www.openpbs.org. 56 [21] Quadircs Resource Management System. http://www.quadrics.com/website/pdf/rms.pdf. 45, 56, 58

OurGrid: An Approach to Easily Assemble Grids with Equitable Resource Sharing Nazareno Andrade1 , Walfredo Cirne1 , Francisco Brasileiro1, and Paulo Roisenberg2 1

Universidade Federal de Campina Grande Departamento de Sistemas e Computa¸ca ˜o, 58.109-970, Campina Grande, PB, Brazil {nazareno,walfredo,fubica}@dsc.ufcg.edu.br http://dsc.ufcg.edu.br/ourgrid 2 Hewllet Packard Brazil Porto Alegre, RS, Brazil [email protected]

Abstract. Available grid technologies like the Globus Toolkit make possible for one to run a parallel application on resources distributed across several administrative domains. Most grid computing users, however, don’t have access to more than a handful of resources onto which they can use this technologies. This happens mainly because gaining access to resources still depends on personal negotiations between the user and each resource owner of resources. To address this problem, we are developing the OurGrid resources sharing system, a peer-to-peer network of sites that share resources equitably in order to form a grid to which they all have access. The resources are shared accordingly to a network of favors model, in which each peer prioritizes those who have credit in their past history of bilateral interactions. The emergent behavior in the system is that peers that contribute more to the community are prioritized when they request resources. We expect, with OurGrid, to solve the access gaining problem for users of bag-of-tasks applications (those parallel applications whose tasks are independent).

1

Introduction

To use grid computing, a user must assemble a grid. A user must not only have the technologies to use grid computing, but also, she must have access to resources on which she can use these technologies. For example, to use resources through the Globus Toolkit [18], she must have access, i.e., permission to use resources on which Globus is installed. Today, the access gaining to grid resources is done via personal requests from the user to each resource’s owner. To run her application on the workstations of some laboratories in a university, a user must convince the system administrator of each laboratory to give her access to their system’s workstations. When the resources the user wishes to use cross institutional boundaries, the situation gets more complicated, as possibly different institutional policies come in. Thus, is D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 61–86, 2003. c Springer-Verlag Berlin Heidelberg 2003 

62

Nazareno Andrade et al.

very difficult for one to gain access to more than a handful of resources onto which she can use grid computing technologies to run her applications. As resource owners must provide access to their resources to allow users to form grids, there must be interest in providing resources to grid users. Also, as several grid users may demand the same resource simultaneously, there must be mechanisms for dealing with conflicting requests for resources, arbitrating them. As this problem can be seen as an offer and demand problem, approaches to achieve this have been based on grid economy [3, 9, 30, 6], which means using, in grids, economic models from real markets. Although models that mimic real markets have the mechanisms to solve the problems of a grid’s offer and demand, they rely on a yet-not-available infrastructure of electronic monetary transactions. To make possible for users to securely verify what they have consumed and pay for it, there must be mature and well deployed technologies for electronic currency and banking. As these technologies are not widely deployed yet, the actual use of the economic mechanisms and architectures in real settings is postponed until the technologies are mature and the infrastructure needed to use them is available. Nevertheless, presently there exists demand for grids to be used in production. Aiming to provide, in short term, an infrastructure that addresses this demand for an expressive set of users, we are developing OurGrid. The OurGrid design is based on a model of resource sharing that provides equity with a minimum of guaranties needed. With it, we aim to provide an easy to install, open and extensible platform, suitable for running a useful set of grid applications for users willing to share their resources in order to obtain access to the grid. Namely, the type of application for which OurGrid intends to provide resources to are those parallel applications whose task are loosely coupled known as bag-of-tasks (BoT) applications [13, 27]. BoT are those parallel applications composed of a set of independent tasks that need no communication among them during execution. Many applications in areas such as computational biology [28], simulations, parameter sweep [2] and computer imaging [26, 25] fit into this definition and are useful to large communities of users. Additionally, from the research perspective, there exists demand for understanding grid usage requirements and patterns in real settings. With a system such as OurGrid in production for real users, we will be able to gather valuable information about the needs and habits of grid users. This allows us to both provide better guidance to future efforts in more general solutions and to collect important data about grids’ usage, like workloads, for example. The remaining of this paper is structured in the following way. In Section 2 we go into further details about the grid assembling problem, discussing related works and presenting our approach. We discuss how BoT applications are suitable for running with resources provided by OurGrid in Section 3. Section 4 describes the design of OurGrid and the network of favors model. An evaluation of the system is discussed in Section 5. In Section 6 we expose the future steps planned in the OurGrid development. Finally, we make our concluding remarks in Section 7.

OurGrid: An Approach to Easily Assemble Grids

2

63

Assembling a Grid

In a traditional system, like a LAN or a parallel supercomputer, a user obtains access to resources by negotiating with the resources’ owner the right to access them. Once access is granted, the system’s administrator configures to the user a set of permissions and priorities. Although this procedure is still used also in grid computing, due to grid’s inherent wide distribution, spawning across many administrative boundaries, this approach is not suitable. Grid computing aims to deal with large, heterogeneous and dynamic users and resources sets [19]. Moreover, if we are to build large scale grids, we must be able to form them with mutually untrusted and even unknown parts. In this scenario, however, it is very difficult to an ordinary user to obtain access to more than a small set of services whose owners are known. As grid computing aims to provide access to large quantities of resources widely distributed, giving the users the possibility of accessing only small quantities of resources means neglecting the potential of grid computing. The problem of assembling a grid also raises some issues from the resource providers’ perspective. Suppose a very simple scenario where just two institutions, A and B, want to create a grid joining their resources. Both of them are interested in having access to as many processors as possible. Also, both of them shall want some fairness in the sharing. Probably both of them will want to assure that they will not only give access to their resources to the other institution’s users, but also that its users will access the other institution’s resources, maybe in equal proportions. Existing solutions in grid computing allow these two institutions to define some policies in their resource sharing, creating static constraints and guarantees to the users of the grid [16, 29, 12]. However, if a third institution C joins the grid, new agreements must be negotiated between the institutions and configured on each of them. We can easily see that these mechanisms are neither scalable nor flexible enough to the large scale grids scenarios. 2.1

Related Work

Although grid computing is a very active area of research, until recently, research efforts in the dynamic access gaining to resources did not exist. We attribute this mainly to the recentness of grid computing, that has made necessary to postpone the question of access gaining until the technologies needed to use grids matured. Past efforts have been spent in defining mechanisms that support static access policies and constraints to allow the building of metacomputing infrastructures across different administrative domains like in the Condor system [29] and in the Computational Co-op [12]. Since 1984 the Condor system has used different mechanisms for allowing a Condor user to access resources across institutional boundaries. After trying to use institutional level agreements [17], Condor was changed to a user-toinstitution level [29], to provide flexibility, as requested by its users. Recently, it was perceived also that interoperability with grid middlewares was also needed,

64

Nazareno Andrade et al.

and a new architecture for accessing grid resources was developed [20]. Although it has not dealt with dynamic access gaining, the Condor project has made valuable contributions to understanding the needs of users in accessing and using the grid. The Computational Co-op defined a mechanism for gathering sites in a grid using cooperatives as a metaphor. This mechanism allows all sites to control how much of their resources are being used by the grid and provides guarantees on how much resource from the grid it can use. This is done through a proportionalshare ticket-based scheduler. The tickets are used by users to access both local and grid resources, obtaining priorities as they spend the tickets. However, both the need of negotiations between the owners of the sites to define the division of the grid tickets and the impossibility of tickets transfers or consumption makes the Co-op not flexible enough to environments as dynamic as grids. Moreover, just as e-cash, it depends on good cryptography infrastructure to make sure that tickets are not forged. Recent effort related to access gaining in grid computing is the research on grid economy. Namely, the Grid Architecture for Computational Economy (GRACE) [8], the Nimrod/G system [3] and the Compute Power Market [9] are related to our work. The GRACE is an abstract architecture that supports different economic models for negotiating access to grid resources. Nimrod/G is a grid broker for the execution of parameter sweep applications that implements GRACE concepts, allowing a grid client to negotiate access to resources paying for it. The Compute Power Market aims to provide access to resources in a decentralized manner, through a peer-to-peer network, letting users pay in cash for using grid resources. An important point to note in these approaches is that for allowing negotiations between service consumers and providers using secure global currencies as proposed by Nimrod/G and Compute Power Market, an infrastructure for the secure negotiation, payments and banking must be deployed. The level of maturity of the basis technologies — as, for example, secure and well deployed electronic money — makes necessary to postpone the use of economic-based approaches in real systems. 2.2

OurGrid Approach

The central point of OurGrid is the utilization of assumptions that, although more restrictive to the system’s usefulness, are easier to satisfy than those of existing systems based on grid economy. Our assumptions about the environment in which the system will operate are that (i) there are at least two peers in the system willing to share their resources in order to obtain access to more resources and (ii) the applications that will be executed using OurGrid need no quality of service (QoS) guarantees. With these assumptions, we aim to build a resource sharing network that promotes equity in the resources sharing. By equity we mean that participants in the network which have donated more resources are prioritized when they ask for resources. With the assumption that there will be at least two resource providers in the system, we ensure that there will exist participants in the system which

OurGrid: An Approach to Easily Assemble Grids

65

own resources whose access can be exchanged. This makes possible the use of an exchange based economic model, instead of the more commonly used price based models [3, 9]. By assuming that there are no requirements for QoS guarantees, we put aside negotiations, once providers need not to negotiate a product whose characteristics won’t be guaranteed. Without negotiations, it becomes unnecessary that participants even agree on values for the resources allocated and consumed. This simplifies the process, once consumers don’t have to verify that an agreed value was really consumed and providers don’t have to assure that resources are provided as agreed. Actually, in this way we are building the simplest form of an exchanged based economic model. As there’s no negotiation, every participant does favors expecting to be reciprocated, and, in conflicting situations, prioritizes those who have done favors to it in the past. The more a participant offers, the more it expects to be rewarded. There are no negotiations or agreements, however. Each participant accounts its favors only to himself, and cannot expect to profit from them in other way than getting other participants to make him favors. As there is no cost in donating idle cycles — as they will be forever lost if not consumed instantaneously —, a participant in the model can only gain from donating them. As, by our first assumption, there exists at least one other participant which is sharing her idle resources, donating implies in eventually benefiting from accessing extra resources. As we shall see, from the local behavior of all participants, the emergent behavior of the system promotes equity in the arbitration of conflicting requests for the shared resources in the system. An important point is that the absence of QoS guarantees makes impossible to guarantee equity in the resource sharing. The system can’t guarantee that a user will access enough resources to compensate the amount she donated to the community, because it can’t guarantee that there will ever be available resources for the time needed. As such, we propose a system that aims not to guarantee, but to promote the resource sharing equity. Promoting equity means trying, via a best-effort strategy, to achieve equity. The proposed assumptions about the system ease the development and deployment of OurGrid, restricting, in turn, its utility. The necessity of the participants to own resources excludes users that don’t own any resources, but are willing to pay (e.g. in cash) to use the grid. Also, the absence of QoS guarantees makes impossible the advance reservation of resources and, consequently, preclude mechanisms that provide synchrony to the execution of parallel applications that need communication between tasks. We believe, however, that even with this restrictions, OurGrid will still be very useful. OurGrid delivers services that are suitable to the bag-of-tasks class of applications. As stated before, these applications are relevant to many areas of research, being interesting to many users.

66

3

Nazareno Andrade et al.

Bags-of-Tasks Applications

Due to the independence of their tasks, BoT applications are especially suited for execution on the grid, where both failures and slow communication channels are expected to be more frequent than in conventional platforms for the execution of parallel applications. Moreover, we argue that this class of applications can be successfully executed without the need of QoS guarantees, as in the OurGrid scenario. A BoT application can perform well with no QoS guarantees as it (i) does not need any synchronization between tasks, (ii) has no dependencies between tasks and (iii) can tolerate faults caused by resources unavailability with very simple strategies. Example of such strategies to achieve fault tolerance in the OurGrid scenario are replication of tasks in multiple resources or the simple re-submition of tasks that failed to execute [23, 26]. As such, this class of application can cope very well with resources that neither are dedicated nor can have their availability guaranteed, as failure in the execution of individual tasks does not impact on the execution of the other tasks. Besides performing well with our assumptions, another characteristic of BoT applications that matches our approach is their users work cycle. Experience says that, once the application is developed, users usually carry out the following cycle: (a) plan details of the computation, (b) run the application, (c) examine the results, (d) restart cycle. Planning the details of the computation means spending the time needed to decide the parameters to run the application. Often, a significant amount of time is also needed to process and understand the results produced by a large scale computation. As such, during the period in which a user is running her BoT application, she wants as much resources as possible, but during the other phases of her working cycle, she leaves her resources idle. These idle resources can be provided to other users to grant, in return, access to other users’ resources, when needed. An example of this dynamic for two BoT application users is illustrated in Figure 1. In this Figure, the users use resources both local and obtained from the grid — which, in this case, are only the other user’s idle resources — whenever they need to run their BoT applications. Note that whenever a user needs her own resources, it has priority over the foreign user. Another point to note is that, as resources are heterogeneous, a user might not own the resources she needs to run an application that poses constraints on the resources it needs. For example, a user can own machines running both Linux and Solaris. If she wants to run an application that can run only on Solaris, she won’t be able to use all of her resources. As such, it is possible for a user to share part of her resources while consuming other part, maybe in addition to resources from other users. In this way, we believe that expecting BoT applications users to share some of their resources in order to gain access to more resources is very plausible. As stated before, this kind of exchange can be carried out without any impact to the resources owners, because there are exchanged only resources that otherwise

OurGrid: An Approach to Easily Assemble Grids

67

grid idle

shared

local BoT user 1

i d l e

local

shared

local

i d l e

grid shared

local

grid

grid grid local BoT user 2

i d l e

shared

local

i d l e

idle

local

shared

Fig. 1. Idle resource sharing between two BoT users

would be idle. In return, they get extra resources when needed to run their applications.

4

OurGrid

Based on the discussed approach we intend to develop OurGrid to work as a peerto-peer network of resources owned by a community of grid users. By adding resources to the peer-to-peer network and sharing them with the community, a user gains access to all the available resources on it. All the resources are shared respecting each provider’s policies and OurGrid strives to promote equity in this sharing. A user accesses the grid through the services provided by a peer, which maintains communication with other peers and uses the community services (e.g. application-level routing and discovery) to access them, acting as a grid broker to its users. A peer P will be accessed by native and foreign users. Native users are those who access the OurGrid resources through P , while foreign users have access to P ’s resources via other peers. A peer is both a consumer and a provider of resources. When a peer P is making a favor in response to a request from a peer Q, P is acting as a provider of resources to Q, while Q is acting as a consumer of P ’s resources. OurGrid network architecture is shown in Figure 2. Clients are software used by the users to access the community resources. A client is at least an application scheduler, possibly with extra functionalities. Examples of such clients are MyGrid [15], APST [10], Nimrod/G [2] and AppLeS [7]. We plan to provide access, through OurGrid, to different resource types. In Figure 2, for example, the resources of type A could be clusters of workstations accessed via Globus GRAM, the type B resources could be parallel supercomputers and type C resources could be workstations running MyGrid’s UserAgent [15]. Although resources of any granularity (i.e. workstations, clusters, entire institutions, etc.) can be encapsulated in an OurGrid peer, we propose them to

68

Nazareno Andrade et al.

Fig. 2. OurGrid network architecture

manage access to whole sites instead of to individual resources. As resources are often grouped in sites, using this granularity in the system will give us some advantages: (i) the number of peers in the system diminishes considerably, improving the performance of searches; (ii) the system’s topology becomes closer to its network infrastructure topology, alleviating traffic problems found in other peer-to-peer systems [24]; and (iii) the system becomes closer to the real ownership distribution of the resources, as they are, usually grouped in sites, each with its proper set of users and owners. Finally, an OurGrid community can be part of a larger set of resources that a user has access to, and users can be native users of more than one peer, either in the same or in different communities. In the rest of this section, we describe in details the key aspects of OurGrid design. In Subsection 4.1 we present the model accordingly to which the resources are shared, the network-of-favors. Subsection 4.2 depicts the protocol used to gain access to the resources of an OurGrid community. 4.1

The Network of Favors

All resources in the OurGrid network are shared in a network of favors. In this network of favors, allocating a resource to a requesting consumer is a favor. As such, it is expected that the consumer becomes in debt with the owner of the consumed resources. The model is based on the expectation that its participants will reciprocate favors to those consumers they are in debt with, when solicited.

OurGrid: An Approach to Easily Assemble Grids

69

If a participant is not perceived to be acting in this way, it is gradually less prioritized, as its debt grows. Every peer in the system keeps track of a local balance for each known peer, based on their past interactions. This balance is used to prioritize peers with more credit when arbitrating conflicting requests. For a peer p, all consumption of p’s resources by another peer p is debited from the balance for p in p and all resources provided by p to p is credited in the balance p maintains for p . With all known peers’ balances, each participant can maintain a ranking of all known participants. This ranking is updated on each provided or consumed favor. The quantification of each favor’s value is done locally an independently — as negotiations and agreements aren’t used —, serving only to the decisions of future resource allocations of the local peer. As the peers in the system ask each other favors, they gradually discover which participants are able to reciprocate their favors, and prioritize them, based on their debt or credit. As a consequence, while a participant prioritizes those who cooperate with him in satisfactory ways, it marginalizes the peers who, for any reason, do not reciprocate the favors satisfactorily. The non-reciprocation can happen for many reasons, like, for example: failures on services or on the communication network; the absence of the desired service in the peer; or the utilization of the desired service by other users at the moment of the request. Free-rider [4] peers may even choose not to reciprocate favors. In all of this cases, the non-reciprocation of the favors gradually diminishes the probability of the peer to access the grid’s resources. Note that our mechanism of prioritizing intends to solve only conflicting situations. It is expected that, if a resource is available and idle, any user can access it. In this way, an ordinary user can, potentially, access all the resources in the grid. Thus, users that contribute very little or don’t contribute can still access the resources of the system, but only if no other peer that has more credit requests them. The use of idle and not requested resources by peers that don’t contribute (i.e., free-riders) actually maximizes the resource utilization, and does not harm the peers who have contributed with their resources. Another interesting point is that our system, as conceived, is totally decentralized and composed of autonomous entities. Each peer depends only on its local knowledge and decisions to be a part of the system. This characteristic greatly improves the adaptability and robustness of the system, that doesn’t depend on coordinated actions or global views [5]. 4.2

The OurGrid Resource Sharing Protocol

To communicate with the community, gain access to, consume and provide resources, all peers use the OurGrid resource sharing protocol. Note that the protocol concerns only the resource sharing in the peer-to-peer network. We consider that the system uses lower-level protocols to other necessary services, such as peers discovery and broadcasting of messages. An example of a platform that provides these protocols is the JXTA [1] project.

70

Nazareno Andrade et al.

User Scheduler

Consumer

(...)

Provider

Client

Consumer Provider

Provider

Rank Consumer OurGrid peer

Consumer Provider

OurGrid Community

Fig. 3. Consumer and provider interaction

The three participants in the OurGrid resource sharing protocol are clients, consumers and a providers. A client is a program that manages to access the grid resources and to run the application tasks on them. OurGrid will be one of such resources, transparently offering computational resources to the client. As such, a client may (i) access both OurGrid peers and other resources directly, such as Globus GRAM [16] or a Condor Pool [29]; and (ii) access several OurGrid peers from different resource sharing communities. We consider that the client encompasses the application scheduler and any other domain-specific module needed to schedule the application efficiently. A consumer is the part of a peer which receives requests from a user’s client to find resources. The consumer is used first to request resources to providers that are able and willing to do favors to it, and, after obtaining them, to execute tasks on the resources. Providers are the part of the peers which manages the resources shared in the community and provides them to consumers. As illustrated in Figure 3, every peer in the community has both a consumer and a provider modules. When a consumer receives a request for resources from a local user’s client, it broadcasts to the peer-to-peer network the desired resources’ characteristics in a ConsumerQuery message. The resources’ characteristics are the minimum constraints needed to execute the tasks this ConsumerQuery message is referring to. It is responsibility of the client to discover this characteristics, probably asking this information to the user. Note that

OurGrid: An Approach to Easily Assemble Grids

71

as it is broadcasted, the ConsumerQuery message also reaches the provider that belongs to the same peer the consumer does. All providers whose resources match the requested characteristics and are available (accordingly to their local policies) reply to the requester with a P roviderW orkRequest message. The set of replies received up to a given moment defines the grid that has been made available for the client request by the OurGrid community. Note that this set is dynamic, as replies can arrive later, when the resources needed to satisfy the request became available at more providers. With the set of available resources, it is possible for the consumer peer to ask for its client to schedule tasks onto them. This is done sending a ConsumerScheduleRequest message containing all known available providers. The application scheduling step is kept out of the OurGrid scope to allow the user to select among existing scheduling algorithms [11, 23] the one that optimizes her application accordingly to her knowledge about the characteristics of the application. Once the client has scheduled any number of tasks to one or more of the providers who sent P roviderW orkRequest messages, it sends a ClientSchedule message to the consumer to which it requested the resources. As each peer represents a site, owning a set of resources, the ClientSchedule message can contain either a list of ordered pairs (task, provider) or a list of tuples (task, provider, processor). It is up to the client deciding how to format its ClientSchedule message. All tasks are sent through the consumer and not directly from the client to the provider, to allow the consumer to account its resource consumption. To each provider Pn in the ClientSchedule message, the consumer then sends a ConsumerF avor message containing the tasks to be executed in Pn with all the data needed to run it. If the peer who received the ConsumerF avor message finishes its tasks successfully, it then sends back a P roviderF avorReport message to the corresponding consumer. After concluding each task execution, the provider also updates its local rank of known peers, subtracting the accounting it made of the task execution cost from the consumer peer’s balance. The consumer peer, on receiving the P roviderF avorReport, also updates its local rank, but adding the accounting it made of the reported tasks execution cost to the provider balance. Note that the consumer may either trust the accounting sent by the provider or make its own autonomous accounting. While a provider owns available resources that match the request’s constraints and is willing to do favors, it keeps asking the consumer for tasks. A provider may decide not to continue making favors to a consumer in order to prioritize another requester who is upper in its ranking. The provider also decides to stop requesting tasks if it receives a message from the consumer informing that there are no tasks left to schedule or if it receives no response from a task request. Note that, after the first broadcast, the flow of requests is from the providers to the consumer. As the P roviderW orkRequest messages are the signal of availability, we alleviate the consumer from the task of managing the state of its current providers.

72

Nazareno Andrade et al.

provider1

consumer

client ClientRequest

provider2

ConsumerQuery ConsumerQuery ProviderWorkRequest

ClientScheduleRequest ClientSchedule ConsumerFavor

ConsumerFavorReport

ProviderFavorReport ProviderWorkRequest

Fig. 4. Sequence diagram for a consumer and two providers interaction

In Figure 4, a sequence diagram for an interaction between a consumer and two providers is shown. The provider provider1 makes a favor to consumer, but provider2 either is unable or has decided not to provide any resources to consumer. Although an OurGrid network will be an open system, potentially comprised by different algorithms and implementations for its peers, we present in the following sections examples of expected correct behaviors for both the peer’s provider and consumer. The algorithms intend to exemplify and make clearer how should a peer behavior to obtain access to the community’s shared resources. Provider Algorithm. A typical provider runs three threads: the receiver, the allocator and the executor. The receiver and the allocator execute continuously, both of them access, add, remove and alter elements of the list of received requests and of known peers. The executor is instantiated by the allocator to take care of individual tasks execution and accounting. The receiver thread keeps checking for received requests. For each of the requests received, it verifies if the request can be fulfilled with the owned resources. It does so by verifying if the provider owns resources to satisfy the request’s requirements, no matter if they are available or not, accordingly to the peer’s sharing policies. If the consumer request can be satisfied, the receiver adds it to a list of received requests. There are two such lists, one for requests issued by local users and another one for those issued by foreign users. This allows us to prioritize local users requests in the scheduling of tasks to the local resources. The allocator thread algorithm is shown in Algorithm 1. While executing, the allocator thread continuously tries to satisfy the received requests with the available resources. It tries to find first a request from

OurGrid: An Approach to Easily Assemble Grids

73

Algorithm 1: Provider’s allocator thread algorithm Data : communityRequests, localRequests, knownPeersCredits, localPriorityPolicies while true do chosen = null ; /* local users’ requests are prioritized over the community’s */; 1 if localRequests.length > 0 then rank = 1 ; repeat 2

actual = getLocalRequestRanked( localRequests, localPriorityPolicies, rank++ ); if isResourceToSatisfyAvailable( actual ) then chosen = actual ; end until ( chosen != null )  ( rank > localRequests.length) ;

3

end /* if there’s no local user’s request which can be satisfied */; if ( chosen == null ) && ( communityRequests.lenght > 0 ) then rank = 1 ; repeat

4

5

actual = getCommunityRequestRanked( communityRequests, knownPeersBalances, rank++ ); if isResourceToSatisfyAvailable( actual ) then chosen = actual ; resourcesToAllocate = getResourcesToAllocateTo( chosen );

6 7

end until ( chosen != null )  ( rank > communityRequests.length) ; end /* actually allocate resource to the chosen task */; if chosen != null then

8 9

send( chosen.SrcPeerID, ProviderWorkRequest ) ; receivedMessage = receiveConsumerFavorMessage( timeout ) ; if receivedMessage != null then receivedTasks = getTasks( receivedMessage ); foreach task in receivedTasks do

10

execute( tasks, resourcesToAllocate ); end else

11

if isRequestLocal( chosen ) then localRequests.remove( chosen ) ; else

12

communityRequest.remove( chosen ) ; end end end end

a local user which can be fulfilled, and if there’s none, it tries the same with the community requests received. The function getLocalRequestRanked(), in line labeled 2, returns the request with the specified position in the priority ranking, accordingly to a local set of policies. The policies can differ from peer to peer, but examples of local prioritizing policies would be FIFO or to prioritize the users who consumed less in their past histories. The function getCommunityRequestRanked(), in line labeled 5,

74

Nazareno Andrade et al.

does the same thing, but for the community requests. It must be based on the known peers balances, that serve as a ranking to prioritize these requests. On lines labeled 3 and 6, the allocator verifies if the resources necessary specified to fulfill the request passed as a parameter are available, according to the local policies of availability. If some request was chosen to be answered in this iteration of the main loop, the allocator decides which resources will be allocated to this request (line labeled 7) and sends a message asking for tasks to execute. If it receives tasks to execute, it then schedule the received tasks on the resources allocated to that request. This is done on the execute() function, which creates a provider executor thread for each task that will be executed. The executor first sets up the environment in which the task will execute. To set up the environment means to prepare any necessary characteristics specified by the local policies of security. For example, it could mean to create a directory with restricted permissions or with restricted size in which the task will execute. After the task is effectively executed and its results have been collected and sent back, the executor must update the credit of the consumer peer. The quantifying function may differ from peer to peer. A simple example of how it could be done is to sum up all the CPU time used by the task and multiply it by the CPU speed, in MIPS. Once the peer has estimated the value of the favor it just made, it updates the knownP eersBalances list, decreasing the respective consumer balance. For simplicity, we are considering here that our provider’s allocator scheduling is non-preemptive. However, it is reasonable to expect that, to avoid impacting on interactive user of the shared resources, a provider may suspend or even kill tasks from foreign users. Consumer Algorithm. As we did with the provider, this section will discuss a simple yet functional version of a consumer algorithm. The consumer runs three threads: the requester, the listener and the remote executor. The requester responsibility is to broadcast client requests it receives as ConsumerQuery messages. After the ConsumerQuery message for a given ClientRequest message has been sent, the consumer listener thread starts waiting for its responses. It receives all the ProviderWorkRequest messages sent to the peer, informing that the resources are available to the Client as they arrive. Each instance of the remote executor thread, as illustrated in Algorithm 2, is responsible for sending a set of tasks to a provider, waiting for the responses and updating the balance of this provider in the local peer. The quantification is shown on line labeled 1, and may differ from peer to peer. Examples of how it can be performed may vary from simply using the accounting sent by the provider to more sophisticated mechanisms, such as sending a micro-benchmark to test the resource performance, collect the CPU time consumed and then calculating the favor cost as a function of both. Yet another possibility is to estimate the task

OurGrid: An Approach to Easily Assemble Grids

75

Algorithm 2: Consumer’s remote executor thread algorithm Data : provider, scheduledTasks, knowPeersCredits send( provider, ConsumerFavor ); unansweredTasks = scheduledTasks; while ( unansweredTasks.lenght > 0 ) && ( timeOutHasExpired( ) == false ) do results = waitProviderFavorReport( providerID ); answeredTasks = results.getTasks( ); removeReportedTasks( answeredTasks, unansweredTasks ); foreach task in answeredTasks do if isProviderLocal ( provider ) == false then 1

usage = quantifyUsage( results ); previousBalance = getPeerBalance( knownPeerBalance, provider ); updatePeerBalance( knownPeersBalance, provider, ( previousBalance + usage ) );

2 end end end

size, maybe asking the user this information, and then assigning a cost based on this size to each task execution. The provider’s balance is updated on line labeled 2. Note that the usage is added to the provider’s balance, while in the provider’s executor it was deducted.

5

Evaluation

In this section we show some preliminary results from simulations and analytical evaluation of the OurGrid system. Note that, due to its decentralized and autonomous nature, characterizing the behavior of an OurGrid community is quite challenging. Therefore, at this initial moment, we base our analysis on a simplified version of OurGrid, called OurGame. OurGame was designed to capture the key features of OurGrid, namely the system-wide behavior of the network of favors and the contention for finite resources. The simplification consists of grouping resource consumption into turns. In a turn, each peer is either a provider or a consumer. If a peer is a consumer, it tries to consume all available resources. If a peer is a provider, it tries to allocate all resources it owns to the current turn consumers. In short, OurGame is a repeated game that captures the key features of OurGrid and allows us to shed some light over its system-wide behavior. 5.1

OurGame

Our system in this model is comprised of a community of n peers represented by a set P = {p1 , p2 , ..., pn }. Each peer pk owns a number rk of resources. All resources are identical, but the amounts in each peer may be different. Each peer can be in one of two states: provider or consumer. When it is in the provider state, it is able to provide all its local resources, while in the consumer state it sends a request for resources to the community. We consider that when a peer is in the consumer state it consumes all its local resources and, as such, it cannot

76

Nazareno Andrade et al.

provide resources to the community. All requests sent by the consumers are equal, requesting as much resources as can be provided. A peer pk is a tuple {id, r, state, ranking, ρ, allocationStrategy}. The id field represents this peer identification, to be used by other peers to control its favor balance. As stated before, r represents pk ’s amount of resources, state represents the peer’s actual state, assuming the provider or consumer values. The ranking is a list of pairs (peer id, balance), representing the known peers ranking. In this pair, peer id represents a known peer and balance the credit or debit associated with this peer. To all unknown peers, we consider balance = 0. The ρ field is the probability of pk of being a provider in a given turn of the game. The allocationStrategy element of the tuple defines the peer’s resource allocation behavior. As instances of the allocationStrategy, we have implemented AllF orOneAllocationStrategy and P roportionallyF orAllAllocationStrategy. The former allocates all of the provider’s resources to the consumer that has the greatest balance value (ties are broken randomly). The P roportionallyF orAllAllocationStrategy allocates the peer’s resources proportionally to all requesting peers with positive balance values. If there are no peers with positive balance values, it allocates to all with zero balance values and if there are no requesting peer with a non-negative balance value, it allocates proportionally to all requesting peers. In this model the time line is divided in turns. The first action of all peers in every turn is to, accordingly to its ρ, choose its state during the turn, either consumer or provider. Next, all peers who are currently in consumer state send a request to the community. All requests arrive in all peers instantaneously, asking for as many resources as the peer owns. As our objective is studying how the system deals with conflicting requests, all consumers always ask for the maximum set of resources. On receiving a request, each provider chooses, based on its allocationStrategy which resources to allocate to which consumers, allocating always all of its resources. All allocations last for the current turn only. In the end of the turn, each peer updates its ranking with perfect information about the resources it provided or consumed in this turn. 5.2

Scenarios

To verify the system behavior, we varied the following parameters: – Number of peers: We have simulated communities with 10, 100 and 1000 peers; – Peers strategies: Regarding the allocationStrategy, we have simulated the following scenarios: 100% of the peers using AllF orOneAllocationStrategy, 100% using P roportionallyF orAllAllocationStrategy and the following combinations between the two strategies in the peers: (25%, 75%), (50%, 50%) and (75%, 25%). – Peer probability of being a provider in a turn (ρ): We have simulated with all peers having a probability of 0.25, 0.50 and 0.75 to be in the provider

OurGrid: An Approach to Easily Assemble Grids

77

state. Also, we have simulated an heterogeneous scenario in which each peer has a probability of being a provider given by a uniform distribution in the interval [0.00..0.99]. We have not considered peers with probability 1.00 of being a provider because we believe that the desire of consuming is the primary motivation for a site to join an OurGrid community, and a peer would not joint it to be always a provider. – Amount of resources owned by a peer: All peers own an amount of resources in a uniform distribution in the interval [10..50]. We considered this to be the size of a typical laboratory that will be encapsulated in an OurGrid peer. All combinations of those parameters gave us 60 simulation scenarios. We have implemented the model and this scenarios using the SimJava [22] simulation toolkit1 . 5.3

Metrics

Since participation in an OurGrid community is voluntary, we have designed OurGrid (i) to promote equity (i.e., if the demand is greater than the offer of resources, the resources obtained from the grid should be equivalent to resources donated to the grid), and (ii) to prioritize the peers that helped the community the most (in the sense that they have donated more than they have consumed). We gauge equity using Favor Ratio (FR) and the prioritization using Resource Gain (RG). The Favor Ratio F Rk of a peer pk after a given turn is defined by the ratio of the accumulated amount of resources gained from the grid (note that this excludes the local resources consumed) by the accumulated amount of resources it has donated to the grid. More precisely, for a peer pk which, during t turns, gained gk resources from the grid and donated dk to the grid, F Rk = gk /dk . As such, F Rk represents a relation between the amount of resources a peer gained and how much resources it has donated. If F Rk = 1, peer pk has received from the grid an amount of resources equal to that it donated. That is, F Rk = 1 denotes equity. The Resource Gain RGk of a peer pk after a given turn is obtained dividing the accumulated amount of resources used by it (both local and from the grid) by the accumulated amount of local resources it has used. As such, let lk be all the local resources a peer pk consumed during t turns and gk the total amount of resources it obtained from the grid during the same t turns, RGk = (lk + gk )/lk . RGk measures the “speed-up” delivered by the grid, i.e. how much grid resources helped a peer in comparison to its local resources. Note that RGk represents the resources obtained by a peer when it requested resources, because whenever a peer asks for grid resources, it is also consuming its local resources. Thus, we can interpret RGk as a quantification of how much that peer was prioritized by the community. Thus, to verify the equity in the system-wide behavior, we expect to observe that, in situations of resource contention, F Rk = 1 for all peers pk . We want 1

SimJava is available at http://www.dcs.ed.ac.uk/home/simjava/

78

Nazareno Andrade et al.

also to verify if the peers which donated more to the community are really being prioritized. To gauge this, we will use RGk , which we expect to be greater for the peers with the greatest differences between what they have donated and what they have consumed from the community. Due to the considerations of this model, we can easily draw a relation between RGk and F Rk . Consider a peer pk , let rk be the amount of resources it owns, t the number of turns executed, lk its local resources consumed, dk the amount of resources it donated to the community, ik the resources that went idle because there were no consumers in some turns in which pk was in the provider state and ρk the probability of the peer pk being a provider in a given turn. Let us also denote Rk as the total amount of resources that a peer had available during t turns. As such: ⎧ ⎪ ⎨Rk = t.rk (1) Rk = lk + dk + ik ⎪ ⎩ lk = (1 − ρk ).Rk From (1) and the definitions of F Rk and RGk , we can derive that: RGk = 1 +

ik .F Rk ρk .F Rk − (1 − ρk ) (1 − ρk ).t.rk

(2)

Another relation that is useful is obtained from the fact that the total amount of resources available in the system is the sum of all resources obtained from the grid, local consumed and left idle for all peers. As all resources donated are consumed or left idle, no resources are lost nor created, we can state a resource conservation law as follows:     Rk = gk + lk + ik (3) k

5.4

k

k

k

Results Discussion

With the OurGame model, the scenarios in which we instantiated the model and the metrics to measure its characteristics presented, we shall now show the results we have obtained so far. We will divide the discussion between the scenarios in which all peers have the same providing probability ρ and those in which ρk is given by a uniform distribution in [0..0.99]. In each of them we will then examine how the number of peers, the providing probabilities ρk and the allocationStrategy they were using impacted their RGk and F Rk values, and, consequently, on the network of favors behavior and on the OurGrid resource contention. Results for Communities in Which All Peers Have Equal Providing Probabilities. For all scenarios in which ρk was equal to all peers, both F Rk and RGk converged. F Rk always converged to 1, but RGk converged to different values, depending on the scenarios’ parameters.

OurGrid: An Approach to Easily Assemble Grids

79

5 greatest site mean sized site smallest site

grid used / donated

4

3

2

1

0 0

100

200

300

400

500

600

700

Turns

Fig. 5. FR for peers with different resource quantities in a 10-peer community using ProportionallyForAllAllocationStrategy with ρ = 0.25

With all peers competing with the same appetite for resources, each peer gains back the same amount of resources it has donated to the community, that explains the convergence of F Rk . Figure 5 shows this happening, despite variance on the amount of resources owned by each peer. The three lines in the Figure are a peer with the greatest rk , a peer with a mean value of rk and a peer with the smaller rk in the scenario. Regarding RGk , with F Rk = 1, from equation (2), we obtain that RGk = ρk 1 + (1−ρ − (1−ρki).t.rk . To facilitate our understanding, we divide the analysis of k) RGk behavior in two situations: the scenarios in which there are no idle resources (i.e., ik = 0) and the scenarios in which there are idle resources (i.e., ik > 0). Analytically analyzing the scenarios, we observe that in a given scenario, i > 0 happens if there is no consumer in some  round. The probability of all peers p1 , ..., pn to be in the provider state is AP = 1≤k≤n ρk . Thus, the number of turns in which all resources in the community were idle in a given scenario is IT = t.AP . For all scenarios but the ones with 10 peers and 0.50 or 0.75 providing probabilities, IT = 0. In the scenarios where ik = 0, as F Rk = 1 we find, from equation (2) that RGk = 1 + ρk /(1 − ρk ). As 0 ≤ ρk < 1, this means that RGk ∝ ρk . As such, the peers with greater ρk are the peers which have the greater difference between what they donated and consumed, the fact of RGk ∝ ρk shows that the more a peer contribute to the community, the more it is prioritized. As, in this scenarios, all peers have the same ρ, however, they all have the same RGk . For example, in a community in which all peers have ρ = 0.50, we found RGk from all peers converged to RG = 1 + 0.5/(1 − 0.5) = 2. As can be seen in Figure 6, for a 100-peer community.

80

Nazareno Andrade et al.

Fig. 6. RG in a 100-peer community using ProportionallyForAllAllocationStrategy and with ρ = 0.50

In the scenarios in which i > 0, we also found F Qk = 1 and RGk also converged. However, we observed two differences in the metrics behavior: (i) F Rk took a greater number of turns to converge and (ii) RGk converged to a value smaller than in the scenarios where i = 0. The former behavior difference happened because each peer took a longer time to rank the other peers, as there happened turns with no resource consumption. The latter behavior difference is explained by equation (3). As, for a given peer pk , lk is fixed, the idle resources ik actually are resources not consumed. This means that gk and, consequently, RGk decreases as ik increases. In short, as the total amount of resources consumed by the peers is less than the total amount of resources that were made available (i.e., both donated and idle), their RGk is smaller than in the scenarios where all resources made available were consumed. Finally, regarding the strategy the peers used to allocate their resources, we found that varying the strategy used by the peers did not affect significantly the metrics behavior. The number of peers in the community, on the other hand, naturally affects the number of turns needed to both metrics converge. The number of turns needed to the metrics to converge is bigger as the size of the community grows. Results for Communities in Which Peers Have Different Providing Probabilities. After observing the effects of each of the simulation parameters in a community that had the same probability for consuming resources, we now discuss how the variation on this probability affects our defined metrics. First, in the simulations of the 10-peer communities, we found that F Rk did not converge. Figure 7 shows F Rk for three peers of a community with this number of peers and in which the providing chance ρk of each peer pk is given by

OurGrid: An Approach to Easily Assemble Grids

81

Fig. 7. FR for three peers in a 10-peer community with different providing probabilities using ProportionallyForAllAllocationStrategy

a uniform distribution in the interval [0.00..0.99]. As can be seen, the peer which donated less to the community — as its providing chance is smaller that of all other peers’ in Figure 7 —, obtained the greater F R. This is easily explained if we take a look also in the values of RGk for these peers. The RGk behavior for the same three peers is shown in Figure 8. Note that, for the peer with the greatest ρ, RGk explodes the scale in its first request, after some turns providing, what gives us the almost vertical solid line in the graph. Figure 8 shows how a peer is prioritized as it donates more resources to the community. Consequently, the peer which provided more resources is the peer with the greatest RG. Whenever it asks for resources, it manages to get access to more resources than a peer that has provided less to the community. The peer with the lesser providing chance obtained more resources from the grid, and thus got a greater F Rk because it actually requested more resources and donated just a little bit. As the providing probabilities are static, the peers with greatest probabilities provided more and didn’t asked resources often enough to make their F Rk raise. Thus, F Rk did not converge because there was not enough competition in these scenarios, and there were turns in which only peers which contributed with small amounts of resources to the community requested resources. Note that without enough competition for the resources we cannot observe the fairness of the system. Nevertheless, by observing RGk , we still can observe how the prioritization was done, when the peers which contributed more to the community did asked for resources. An interesting behavior we have observed is that with the growth of the community size, F Rk once again converges to 1. It happens for the 100-peer

82

Nazareno Andrade et al.

7 providing chance 0.903513 providing chance 0.687385 providing chance 0.17399

total used / local used

6

5

4

3

2

1

0 0

100

200

300

400

500 Turns

600

700

800

900

1000

Fig. 8. RG for three peers in a 10-peer community with different providing probabilities using ProportionallyForAllAllocationStrategy

and the 1000-peers communities, in our simulations. The histogram2 of F Rk in a 100-peer community in the turn 3000 is show in Figure 9. The convergence of F Rk happens due to the greatest concurrence present in greater communities. As there are more peers, there are less turns in which only peers with small ρi request the resources of the community. As such, less peers manage to obtain F Rk as high as happened in the 10 peers scenarios. This may still happen, if there are only peers that donate very little in a sufficiently large number of turns. Nevertheless, this is not prejudicial to our objectives, as these resources could not be allocated to a peer that contributed significantly to the community. With F Rk = 1 and i = 0, we again find that RGk ∝ ρk . This shows that peers which contributed more, that is, which have the highest ρk , were more prioritized. We shall remark that again, our two allocation strategies did not show impact in the simulations results. As such, in the long run, peers that allocate all of their resources to the highest ranked peer perform as well as peers that allocate their resources proportionally to the balances of the requesters.

6

Future Directions

The next steps in the OurGrid development are (i) simulating real grid users workloads on the peers; (ii) studying the impact of malicious peers in the system; and (iii) the actual implementation of OurGrid. Now we have evaluated the key characteristics of our network of favors, simulating more realistic scenarios is 2

We opted to show a histogram due to the great number of peers in the simulation.

OurGrid: An Approach to Easily Assemble Grids

83

Fig. 9. FR histogram in a 100-peer community with different allocation strategies on turn 3000

needed to understand the impact of the grid environment in the model presented in this work. Peer’s maliciousness is important mostly in two aspects in OurGrid: a peer consumer shall want to assure that a provider executed a task correctly and there must not be possible to exploit the community using unfair accounting. More specifically in the malicious peers problem, to deal with the need of the consumer to assure correct task execution in unreliable providers, we plan to study both (a) replication in order to discover providers a consumer can trust and (b) the insertion of application specific verification, like the techniques described in [21]. To cope with the objective of making the community tolerant to peers using unfair accounting, marginalizing them, we aim to study the use of (a) autonomous accounting and (b) replication to determine if a consumer shall trust unknown providers. We plan to start OurGrid implementation as an extension of MyGrid3 [15, 14] former work done at UFCG. OurGrid will be able to serve as a MyGrid resource in the user’s grid, and will initially obtain access to resources through the already existent MyGrid’s Grid Machine Interface. The Grid Machine Interface is an abstraction that provides access to different kinds of grid resources (Globus GRAM, MyGrid’s UserAgent, Unix machines via ssh, etc.) and will allow OurGrid to interoperate with existing grid middleware. Interoperability is important to both take advantage of existing infrastructure and to ease the OurGrid adoption by the community of users. 3

MyGrid is open-source and it is available at http://dsc.ufcg.edu.br/mygrid/

84

7

Nazareno Andrade et al.

Conclusions

We have presented the design of OurGrid, a system that aims to allow users of BoT applications to easily obtain access and use computational resources, dynamically forming an on-demand, large-scale, grid. Also, opting for simplicity in the services it delivers, OurGrid will be able to be deployed immediately, both satisfying a current need in the BoT users community and helping researchers in better understanding how grids are really used in production, a knowledge that will help to guide future research directions. OurGrid is based on a network of favors, in which a site donates its idle resources as a favor, expecting to be prioritized when it asks for favors from the community. Our design aims to provide this prioritization in a completely decentralized manner. The decentralization is crucial to keep our system simple and not dependent on centralized services that might be hard to deploy, scale and trust. Our preliminary results on the analysis through simulation of this design to solve the conflict for resources in a decentralized community shows us that this approach is promising. We expect to evolve the present design into a solution that, due to its simplicity, will be able to satisfy a need from real grid users today.

Acknowledgments We would like to thank Hewlett Packard, CNPq and CAPES for the financial support. It was crucial to the progress of this work. We would also like to thank Elizeu Santos-Neto, Lauro Costa and Jacques Sauv´e for the insightful discussions that much contributed to our work.

References [1] Project JXTA. http://www.jxta.org/. 69 [2] D. Abramson, J. Giddy, and L. Kotler. High performance parametric modeling with Nimrod/G: Killer application for the global grid? In Proceedings of the IPDPS’2000, pages 520–528. IEEE CS Press, 2000. 62, 67 [3] David Abramson, Rajkumar Buyya, and Jonathan Giddy. A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems (FGCS) Journal, 18:1061–1074, 2002. 62, 64, 65 [4] Eytan Adar and Bernardo A. Huberman. Free riding on gnutella. First Monday, 5(10), 2000. http://www.firstmonday.dk/. 69 [5] Ozalp Babaoglu and Keith Marzullo. Distributed Systems, chapter 4: Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. Addison-Wesley, 1993. 69 [6] Alexander Barmouta and Rajkumar Buyya. GridBank: A Grid Accounting Services Architecture (GASA) for distributed systems sharing and integration. In 26th Australasian Computer Science Conference (ACSC2003), 2003 (submitted). 62

OurGrid: An Approach to Easily Assemble Grids

85

[7] Fran Berman, Richard Wolski, Silvia Figueira, Jennifer Schopf, and Gary Shao. Application-level scheduling on distributed heterogeneous networks. In Supercomputing’96, 1996. 67 [8] R. Buyya, D. Abramson, and J. Giddy. An economy driven resource management architecture for computational power grids. In International Conference on Parallel and Distributed Processing Techniques and Applications, 2000. 64 [9] Rajkumar Buyya and Sudharshan Vazhkudai. Compute Power Market: Towards a Market-Oriented Grid. In The First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2001), Beijing, China, 2000. IEEE Computer Society Press. 62, 64, 65 [10] H. Casanova, J. Hayes, and Y. Yang. Algorithms and software to schedule and deploy independent tasks in grids environments. In Workshop on Distributed Computing, Metacomputing and Resource Globalization, 2002. 67 [11] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman. Heuristics for scheduling parameter sweep applications in grid environments. In Proceedings of the 9th Heterogeneous Computing Workshop, pages 349–363, Cancun, Mexico, May 2000. IEEE Computer Socity Press. 71 [12] W. Cirne and K. Marzullo. The computational Co-op: Gathering clusters into a metacomputer. In PPS/SPDP’99 Symposium, 1999. 63 [13] Walfredo Cirne, Francisco Brasileiro, Jacques Sauv´e, Nazareno Andrade, Daniel Paranhos, Elizeu Santos-Neto, Raissa Medeiros, and Fabr´ıcio Silva. Grid computing for Bag-of-Tasks applications. In Proceedings of the I3E2003, September 2003. (to appear). 62 [14] Walfredo Cirne and Keith Marzullo. Open Grid: A user-centric approach for grid computing. In 13th Symposium on Computer Architecture and High Performance Computing, 2001. 83 [15] Walfredo Cirne, Daniel Paranhos, Lauro Costa, Elizeu Santos-Neto, Franscisco Brasileiro, Jacques Sauv´e, Fabr´ıcio Alves Barbosa da Silva, and Cirano Silveira. Running bag-of-tasks applications on computational grids: The MyGrid approach. Proceedings of the ICCP’2003 - International Conference on Parallel Processing, October 2003. 67, 83 [16] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke. A resource management architecture for metacomputing systems. In IPPS/SPDP’98 Workshop on Job Scheduling Strategies for Parallel Processing, pages 62–82, 1998. 63, 70 [17] D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide flock of Condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12:53–65, 1996. 63 [18] I. Foster and C. Kesselman. The Globus project: A status report. In IPPS/SPDP ’98 Heterogeneous Computing Workshop, pages 4–18, 1998. 61 [19] Ian Foster. The anatomy of the Grid: Enabling scalable virtual organizations. Lecture Notes in Computer Science, 2150, 2001. 63 [20] James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steve Tuecke. Condor-G: A computation management agent for multi-institutional grids. Cluster Computing, 5:237–246, 2002. 64 [21] Philippe Golle and Ilya Mironov. Uncheatable distributed computations. Lecture Notes in Computer Science, 2020:425–441, 2001. 83 [22] Fred Howell and Ross McNab. SimJava: a discrete event simulation package for Java with applications in computer systems modelling. In Procedings of the First International Conference on Web-based modelling and simulation. Society for Computer Simulation, 1998. 77

86

Nazareno Andrade et al.

[23] Daniel Paranhos, Walfredo Cirne, and Francisco Brasileiro. Trading cycles for information: Using replication to schedule bag-of-tasks applications on computational grids. In Proceedings of the Euro-Par 2003: International Conference on Parallel and Distributed Computing, 2003. 66, 71 [24] Matei Ripeanu and Ian Foster. Mapping the Gnutella network: Macroscopic properties of large-scale peer-to-peer systems. In First International Workshop on Peer-to-Peer Systems (IPTPS), 2002. 68 [25] E. L. Santos-Neto, L. E. F. Ten´ orio, E. J. S. Fonseca, S. B. Cavalcanti, and J. M. Hickmann. Parallel visualization of the optical pulse through a doped optical fiber. In Proceedings of Annual Meeting of the Division of Computational Physics, Boston, MA, USA, 2001. 62 [26] Shava Smallen, Walfredo Cirne, Jaime Frey, Francine Berman, Rich Wolski, MeiHui Su, Carl Kesselman, Steve Young, and Mark Ellisman. Combining workstations and supercomputers to support grid applications: The parallel tomography experience. In Proceedings of the HCW’2000 - Heterogeneous Computing Workshop, 2000. 62, 66 [27] J. Smith and S. K. Shrivastava. A system for fault-tolerant execution of data and compute intensive programs over a network of workstations. In Lecture Notes in Computer Science, volume 1123. IEEE Press, 1996. 62 [28] J. R. Stiles, T. M. Bartol, E. E. Salpeter, and M. M. Salpeter. Monte carlo simulation of neuromuscular transmitter release using MCell a general simulator of cellular physiological processes. Computational Neuroscience, pages 279–284, 1998. 62 [29] Douglas Thain, Todd Tannenbaum, and Miron Livny. Grid Computing: Making the Global Infrastructure a Reality, chapter 11 - Condor and the grid. John Wiley, 2003. 63, 70 [30] R. Wolski, J. Plank, J. Brevik, and T. Bryan. Analyzing market-based resource allocation strategies for the computational grid. International Journal of Highperformance Computing Applications, 15(3), 2001. 62

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment Gerald Sabin, Rajkumar Kettimuthu, Arun Rajan, and Ponnuswamy Sadayappan The Ohio State University Columbus OH 43201, USA {sabin,kettimut,rajan,saday}@cis.ohio-state.edu

Abstract. Most previous research on job scheduling for heterogeneous systems considers a scenario where each job or task is mapped to a single processor. On the other hand, research on parallel job scheduling has concentrated primarily on the homogeneous context. In this paper, we address the scheduling of parallel jobs in a heterogeneous multi-site environment, where each site has a homogeneous cluster of processors, but processors at different sites have different speeds. Starting with a simple greedy scheduling strategy, we propose and evaluate several enhancements using trace driven simulations. We consider the use of multiple simultaneous reservations at different sites, use of relative job efficacy as a queuing priority, and compare the use of conservative versus aggressive backfilling. Unlike the single-site case, conservative backfilling is found to be consistently superior to aggressive backfilling for the heterogeneous multi-site environment.

1

Introduction

Considerable research has been conducted over the last decade on the topic of job scheduling for parallel systems, such as those used for batch processing at Supercomputer Centers. Much of this research has been presented at the annual Workshops on Job Scheduling Strategies for Parallel Processing [10]. With significant recent developments in creating the infrastructure for grid computing, the transparent sharing of resources at multiple geographically distributed sites is being facilitated. An important aspect of significance to multi-site job scheduling is that of heterogeneity – different sites are generally unlikely to have identical configurations of their processors and can be expected to have different performance characteristics. Much of the research to date on job scheduling for heterogeneous systems has only addressed the scheduling of independent sequential jobs or precedence constrained task graphs where each task is sequential [4, 21, 34]. A direct extension of the heterogeneous scheduling strategies for sequential jobs and coarse-grained task-graphs is not attractive for scheduling thousands of processors, across multiple sites, due to the explosion in computational complexity. Instead we seek to extend the practically effective back-filling 

Supported in part by NSF grant EIA-9986052

D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 87–104, 2003. c Springer-Verlag Berlin Heidelberg 2003 

88

Gerald Sabin et al.

based parallel job scheduling strategies [11] used widely in practice for single-site scheduling. In this paper, we address the problem of heterogeneous multi-site job scheduling, where each site hosts a homogeneous cluster of processors. The paper is organized as follows. In Section 2, we provide some background information about parallel job scheduling and heterogeneous job scheduling. Section 3 provides information about the simulation environment used for this study. In Section 4, we begin by considering a simple greedy scheduling strategy for scheduling parallel jobs in a multi-site environment, where each site has a homogeneous cluster of processors, but processors at different sites have different speeds. We progressively improve on the simple greedy scheme, starting with the use of multiple simultaneous reservations at different sites. In Section 5, we compare the use of conservative versus aggressive backfilling in the heterogeneous context, and show that the trends are very different from the single-site case. In Section 6, we evaluate a scheduling scheme that uses the relative performance of jobs at different sites as the queue priority criterion for back-filling. In Section 7, we evaluate the implications of restricting the number of sites used for simultaneous reservations. Section 8 discusses related work. We conclude in Section 9.

2

Background

The problem we address in this paper is the following: Given a number of heterogeneous sites, with a homogeneous cluster of processors at each site, and a stream of parallel jobs submitted to a metascheduler, find an effective schedule for the jobs so that the average turnaround time of jobs is optimized. There has been a considerable body of work that has addressed the parallel job scheduling problem in the homogeneous context [6, 29, 31, 26, 11]. There has also been work on heterogeneous job scheduling [4, 21, 33, 34], but this has generally been restricted to the case of sequential jobs or coarse-grained precedence constrained task graphs. The fundamental approach used for scheduling in these two contexts has been very different. We provide a very brief overview. The Min-Min algorithm is representative of the scheduling approaches proposed for scheduling tasks on heterogeneous systems. A set of N tasks is given, with their runtimes on each of a set of P processors. Given a partial schedule of already scheduled jobs, for each unscheduled task, the earliest possible completion time is determined by considering each of the P processors. After the minimum possible completion time for each task is determined, the task that has the lowest “earliest completion time” is identified and is scheduled on the processor that provides its earliest completion time. This process is repeated N times, till all N tasks are scheduled. The problem has primarily been evaluated in a static “off-line” context - where all tasks are known before scheduling begins, and the objective is the minimization of makespan, i.e. the time to finish all tasks. The algorithms can be applied also in the dynamic “on-line” context, by “unscheduling” all non-started jobs at each scheduling event – when either a new job arrives or a job completes.

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

89

Scheduling of parallel jobs has been addressed in the homogeneous context. It is usually viewed in terms of a 2D chart with time along one axis and the number of processors along the other axis. Each job can be thought of as a rectangle whose length is the user estimated run time and width is the number of processors required. The simplest way to schedule jobs at a single site is to use a First-Come-First-Served (FCFS) policy. This approach suffers from low system utilization [23]. Backfilling was proposed to improve the system utilization and has been implemented in several production schedulers. Backfilling works by identifying “holes” in the 2D chart and moving forward smaller jobs that fit those holes, without delaying any jobs with future reservations. There are two common variations to backfilling - conservative and aggressive (EASY)[13, 26]. In conservative backfill, every job is given a reservation when it enters the system. A smaller job is moved forward in the queue as long as it does not delay any previously queued job. In aggressive backfilling, only the job at the head of the queue has a reservation. A small job is allowed to leap forward as long as it does not delay the job at the head of the queue. Thus, prior work on job scheduling algorithms for heterogeneous systems has primarily focused on independent sequential jobs or collections of singleprocessor tasks with precedence constraints. On the other hand, schemes for parallel job scheduling have not considered heterogeneity of the target systems. Extensions of algorithms like Min-Min are possible, but their computational complexity could be excessive for realistic systems. Instead, we pursue an extension to an approach that we previously proposed for distributed multi-site scheduling on homogeneous systems [30]. The basic idea is to submit each job to multiple sites, and cancel redundant submissions when one of the sites is able to start the job.

3

Simulation Environment

In this work we employ trace-driven simulations, using workload logs from supercomputer centers. The job logs were obtained from the collection of workload logs available form Dror Feitelson’s archive [12]. Results for a 5000 job subset of the 430 node Cornell Theory Center (CTC) trace and a 5000 job subset of a trace from the 128 node IBM SP2 system at the San Diego Supercomputer Center (SDSC) are reported. The first 5000 jobs were selected, representing roughly a one month set of jobs. These traces were modified to vary load and to model jobs submitted to a metascheduler from geographically distributed users (by time-shifting two of the traces by three hours, to model two centers each in the Pacific and Eastern U.S. time zones). The available job traces do not provide any information about runtimes on multiple heterogeneous systems. To model the workload characteristics of a heterogeneous environment, the NAS Parallel Benchmarks 2.0 [2] were used. Four Class B benchmarks were used to model the execution of jobs from the CTC and SDSC trace logs on a heterogeneous system. Each job was randomly chosen to represent one of the NAS benchmarks. The processing power of each remote

90

Gerald Sabin et al.

site was modeled after one of four parallel computers for which NAS benchmark data was available (cluster 0:SGI Origin 2000, cluster 1:IBM SP(WN/66), cluster 2:Cray T3E 900, cluster 3:IBM SP (P2SC 160 MHz). The run times of the various machines were normalized with respect to IBM SP (P2SC 160 MHz) for each benchmark. The jobs were scaled to represent their relative runtime (for the same number of nodes) on each cluster. These scaled runtimes represent the expected runtime of a job on a particular cluster, assuming the estimate from the original trace corresponded to an estimate on the IBM SP (P2SC 160 MHz). The total number of processors at each remote site was chosen to be the same as the original traces submitted to that node (430 when simulating a trace from CTC and 128 for SDSC). Therefore, in these simulations all jobs can run at any of the simulated sites. In this paper, we do not consider the scheduling of a single job across multiple sites. The benchmarks used were: LU (an application benchmark, solving a finite difference discretization of the 3-D compressible Navier - Stokes equations[3]), MG (Multi-Grid, a kernel benchmark, implementing a V-cycle multi-grid algorithm to solve the scalar discrete Poisson equation [25]), CG (Conjugate Gradient, that computes an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix, and IS (Integer Sort, that tests a sorting operation that is important in “particle method” codes). The performance of scheduling strategies at different loads was simulated by multiplying the runtimes of all jobs by a load factor. The runtime of jobs were expanded to leave the duration of the simulated trace (roughly one month) unchanged. This will generate a schedule equivalent to a trace where the interarrival time of the jobs are reduced by a constant factor. This model of increasing load results in a linear increase in turnaround time, even if the wait times remain unchanged. However, the increase in wait time (due to the higher load), generally causes the turnaround time to increase at a faster rate (especially when the system approaches saturation, when a nonlinear increase in wait time causes a significant increase in turnaround time). Simulations were run with load factors ranging from 1.0 to 2.0, in increments of 0.2.

4

Greedy Metascheduling

We first consider a simple greedy scheduling scheme, where jobs are processed in arrival order by the meta-scheduler, and each job is assigned to the site with the lowest instantaneous load. The instantaneous load at a site is considered to be the ratio of the total remaining processor-runtime product for all jobs (either queued or running at that site) to the number of processors at the site. It thus represents the total amount of time needed to run all jobs assuming no processor cycles are wasted (i.e. jobs can be ideally packed). Recently, we had evaluated a multi-site scheduling strategy that uses multiple simultaneous requests for the homogeneous multi-site context [30] and showed that it provided significant improvement in turnaround time. We first applied the idea of multiple simultaneous requests to the heterogeneous environment

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

91

and compared its performance with the simple greedy scheme. With the MR (Multiple Requests) scheme, each job is sent to the K least loaded sites. Each of these K sites schedules the job locally. The scheduling scheme at the sites is aggressive backfilling, using FCFS queue priority. When a job is able to start at any of the sites, the site informs the metascheduler, which in turn contacts the K − 1 other local schedulers to cancel that redundant request from their respective queues. This operation must be atomic to ensure that the job is only executed at one site. By placing each job in multiple queues, the expectation is that more jobs will be available in all local queues, thereby increasing the number of jobs that could fit into a backfill window. Furthermore, more “holes” will be created in the schedule due to K − 1 reservations being removed when a job starts running, enhancing backfill opportunities for queued jobs. The MR scheme was simulated with K = 4, i.e. each job was submitted to all four sites. It was hoped that these additional backfilling opportunities would lead to improved system utilization and reduced turnaround times. However, as shown in Fig. 1, the average turnaround time decreases only very slightly with the MR scheme, when compared to the simple greedy scheme. Figure 2 shows that the average system utilization for the greedy-MR scheme improves only slightly when compared to the simple greedy scheme, and utilization is quite high for both schemes. In a heterogeneous environment, the same application may perform differently when run on different clusters. Table 1 shows the measured runtime for three applications (NAS benchmarks) on the four parallel systems mentioned earlier, for execution on 8 or 256 nodes. It can be seen that performance differs for each application on the different machines. Further, no machine is the fastest on all applications; the relative performance of different applications on the machines can be very different. With the simple greedy and MR schemes, it is possible that jobs may execute on machines where their performance is not the best. In order to assess this, we computed an “effective utilization” metric. We first define the efficacy of a job at any site to be the ratio of its best run-

Average Turnaround Time

Greedy Greedy - MR

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Average Turnaround Time

CTC 119000 99000 79000 59000 39000 19000

700000 600000 500000 400000 300000 200000 100000 0

SDSC Greedy Greedy-MR

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 1. Greedy Metascheduling in a Heterogeneous Environment: Turnaround time. Making multiple simultaneous requests only slightly improves the turnaround time

Gerald Sabin et al.

CTC

Utilization

86%

96%

82%

Utilization

92

SDSC

94%

78% 74%

Greedy Greedy - MR

70% 66%

1.0 1.2 1.4 1.6 1.8 Runtime Expansion Factor

2.0

92%

Greedy Greedy-MR

90% 1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion factor

58% 56% 54% 52% 50% 48% 46%

CTC

Greedy Greedy - MR

68% Effective Utilization

Effective Utilization

Fig. 2. Greedy Metascheduling: Utilization. With respect to utilization, making multiple requests results in a slight improvement over the greedy scheme

SDSC

66%

Greedy Greedy-MR

64% 62%

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 3. Greedy Metascheduling: Effective Utilization. With respect to effective utilization, making multiple requests only results in a slight improvement over the greedy scheme

Table 1. Job Run-times on Different Systems SGI IBM Cray IBM+ Origin 2000 SP (WN/66) T3E 900 SP (P2SC 160 MHz) IS Class B 23.3 22.6 16.3∗ 17.7 (8 Nodes) MG Class B 35.3 34.3 25.3 17.2∗ (8 Nodes) MG Class B 1.3 2.3 1.8 1.1∗ (256 Nodes) LU Class B 20.3∗ 94.9 35.6 24.2 (256 Nodes) + Original estimated runtime * Best runtime for a job

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

93

time (among all the sites) to its runtime at that site. The effective utilization is a weighted utilization metric, where each job’s processor-runtime product is weighted by its efficacy on that site. While utilization is a measure of the fraction of used processor cycles on the system, the effective utilization is a measure of the fraction of used processor cycles with respect to its best possible usage.  Ef f ective U tilization =

P rocessorsU sedi × Runtimei × Ef f icacyi P rocessorsAvailable × M akespan

(1)

where M akespan = M inStartT ime − M axCompletetionT ime

(2)

Ef f icacyi = ActualRuntimei ÷ OptimalRuntimei

(3)

and

5

Aggressive vs Conservative Scheduling

So far we have used aggressive backfilling, and seen that using multiple requests only improves performance slightly. In this scheme (Greedy-MR) a job runs at the site where it starts the earliest. In a heterogeneous context, the site where the job starts the earliest may not be the best site. The heterogeneity of the sites means that any given job can have different runtimes at the various sites. Thus, the site that gives the earliest start time need not give the earliest completion time. Therefore, it would be beneficial to use earliest completion time instead of earliest start time in determining which site should start a job. However, this poses a problem. When a job is ready to start at a particular site, it is possible to conclude that the site offers the job the earliest start time because other sites still have the job in their queue. However, estimating which site offers the earliest expected completion time is difficult without knowing when the jobs are expected to start at other sites. In order to be able to estimate the completion time of a job at all relevant sites, conservative backfilling has to be employed at each site. When a job is about to start at a site, the scheduler has to check the expected completion time of the same job at all sites where the job is scheduled, and determine if some other site has a better completion time. If the job is found to have a better completion time at another site, the job is not run and is removed from the queue at this site. Figure 4 compares the performance of the completion-based conservative scheme with the previously evaluated start-based aggressive scheme for the CTC and SDSC traces. It can be observed that the conservative scheme performs much better than the aggressive scheme. This is quite the opposite of what generally is observed with single-site scheduling, where aggressive backfilling performs better than conservative backfilling in regards to the turnaround time metric. It has been shown that aggressive backfilling consistently improves the performance of

94

Gerald Sabin et al.

long jobs relative to conservative backfilling [28] and the turnaround time metric is dominated by the long jobs. Indeed, that is what we observe with single site scheduling for these traces (in order to make the overall load comparable to the four-site experiments, the runtime of all jobs was scaled down by a factor of 4). Figures 6 and 7 provide insights into the reason for the superior performance of the completion based conservative scheme for multi-site scheduling. Even though the aggressive scheme has better “raw” utilization, the effective utilization is worse than the conservative scheme. The higher effective utilization with the completion-based conservative scheme suggests that basing the decision on expected job completion time rather than start time improves the chances of a job running on a site where its efficacy is higher, thereby making more effective use of the processor cycles. The figures show similar trends for both CTC and SDSC traces.

110000 90000 70000 50000 30000 10000

Average Turnaround Time

Greedy-MR (Aggressive) Conservative - MR

Average Turnaround Time

CTC

700000 600000 500000 400000 300000 200000 100000 0

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

SDSC Greedy-MR (Aggressive) Conservative-MR

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 4. Performance of Conservative Multiple Request Scheme. Using a conservative completion-based scheme has a significant impact on the Greedy MR scheme

 

&7& $JJUHVVLYH &RQVHUYDWLYH

 

 $YHUDJH 7XUQDURXQG7LPH

$YHUDJH 7XUQDURXQG7LPH



 

6'6& $JJUHVVLYH &RQVHUYDWLYH

  

       5XQWLPH([SDQVLRQ)DFWRU

      5XQWLPH([SDQVLRQ)DFWRU

Fig. 5. Aggressive vs. Conservative Back-filling: Single Site. For a single site, aggressive backfilling outperforms conservative backfilling

CTC

88% 83% 78% 73% 68% 63% 58%

Greedy - MR (Aggressive) Conservative - MR 1.0

95

SDSC 98% 96% 94% 92% 90% 88% 86% 84% 82% 80%

Utilization

Utilization

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Greedy-MR (Aggressive) Conservative-MR 1.0

1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 6. Utilization of Conservative Completion-Based Multiple Requests Scheme. When using the conservative competition-based scheme, the raw utilization actually decreases, even though the turnaround time has improved CTC

62% 58% 54% 50% 46%

72% Effective Utilization

Effective Utilization

66%

Greedy - MR (Aggressive) Conservative - MR 1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

SDSC

70% 68% 66% 64% 62%

Greedy-MR (Aggressive) Conservative-MR 1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 7. Effective Utilization of Conservative Completion-Based Multiple Requests Scheme. Using a conservative completion-based scheme significantly increases effective utilization, even though raw utilization decreases

The average turnaround time for aggressive backfilling at a single site is better, compared to conservative backfilling, because of improved backfilling chances with aggressive backfilling. The backfilling opportunities in the singlesite context are poorer with conservative backfilling because each waiting job has a reservation, and the presence of multiple reservations creates impediments to backfilling. Conservative backfilling has been shown to especially prevent long narrow jobs from backfilling. The improvement in the average turnaround time with conservative backfilling in the heterogeneous context is also attributed to improved backfilling caused by the holes created by the dynamic removal of replicated jobs at each site, and an increased number of jobs to attempt to backfill at each site. Multiple reservation requests result in an increased number of jobs in all local queues at any given time. This gives the scheduler more jobs to choose from, when attempting to fill a backfill window. In a heterogeneous environment the presence of reservation replicas brings both backfilling advantages and the advantages due to reservation guarantees.

96

Gerald Sabin et al.

We then incorporated a refinement to the completion-based multiple reservation scheduling strategy. In the homogeneous case, when a job is ready to start at one of the sites, all other copies of that job at other sites can be canceled because none of those could possibly produce an earlier completion time. In the heterogeneous context, when a job is ready to start at a site, and appears to have the earliest completion time when compared to its reservation at other sites, there is still a possibility that future backfilling at a faster site might allow a faster completion at the faster site. In order to take advantage of these possible backfills, it might be worthwhile to keep the jobs in the queues at the faster sites, even though the current remote reservations do not provide a completion time better than at the site where the job is ready to start. We implemented a version of the completion - based conservative backfilling scheme where only jobs at slower sites were canceled when a job was started at a site. This improved performance, but not to a significant extent.

6

Efficacy Based Scheduling

The previous data has shown that the raw utilization is not a good indicator for how well a scheduling strategy performs in a heterogeneous environment. The turnaround time tracks more closely with the effective utilization. Therefore, a scheme which directly takes into account the efficacy of jobs would be desirable. A strategy which increases the efficacy of the jobs would lead to a higher effective utilization, which we expect will lower the average turn around time. To include efficacy in our strategies we propose using efficacy as the priority order for the jobs in the queue. Changing the order of the reserved jobs will change the backfilling order. In this case jobs with higher efficacies will attempt to backfill before jobs with lower efficacies, and thus will have more backfilling opportunities. In the case of identical efficacies the secondary priority will be FCFS. This priority scheme will guarantee a starvation free system using any of the given strategies. Furthermore, in a conservative backfilling scheduler, a priority queue based on efficacy will result in bounded delays for all jobs. This is due to each job being guaranteed to have an efficacy of 1.0 on at least one site. The job is guaranteed to make progress at this site, leading to a starvation free system. This has resulted in a minimal (less than 5%) increase in worst case turnaround time, when compared to an FCFS priority. By changing the priority order to efficacy, a job will have a greater chance to run on its faster machines (because it will have a higher priority on these machines than the jobs with a lower efficacy). This can be expected to increase the average efficacy of the system. This higher average efficacy (as shown by the effective utilization) is expected to lead to a lower turn around time for strategies which use an efficacy priority policy. Figure 10 shows that using an efficacy based priority policy indeed leads to a higher effective utilization. This is in spite of a FCFS priority queue having better raw utilization (Fig. 9). The higher effective utilization provides the basis for the improved turn around time seen in Fig. 8.

70000 60000 50000 40000 30000 20000 10000

CTC FCFS Priority Efficacy Priority

Average Turnaround Time

Average Turnaround Time

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

250000 200000 150000

97

SDSC FCFS Priority Efficacy Priority

100000 50000 0

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

CTC

86% 82% 78% 74% 70% 66% 62% 58%

6'6& 

FCFS Priority Efficacy Priority

8WLOL]DWLRQ

Utilization

Fig. 8. Performance of Efficacy Based Scheduling. Explicitly accounting for efficacy (by using efficacy for the priority policy) reduces turnaround time

 )&)63ULRULW\ (IILFDF\3ULRULW\

 

1.0

1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor



     5XQWLPH([SDQVLRQ)DFWRU

Fig. 9. Utilization of efficacy based scheduling. Raw utilization has decreased slightly when using an efficacy priority queue, even though the turnaround time has improved

CTC

64% 60% 56% 52%

FCFS Priority Efficacy Priority

 (IIHFWLYH 8WLOL]DWLRQ

Effective Utilization

68%

6'6&

   

)&)63ULRULW\ (IILFDF\3ULRULW\



48% 1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

      5XQWLPH([SDQVLRQ)DFWRU

Fig. 10. Effective utilization of efficacy based scheduling. Using an efficacy priority queue increases effective utilization, in spite of the decrease in raw utilization. This explains the improvement in turnaround time

98

7

Gerald Sabin et al.

Restricted Multi-site Reservations

150000 130000 110000 90000 70000 50000 30000 10000

CTC 1 Site 2 Sites 3 Sites 4 Sites

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

$YHUDJH 7XUQDURXQG7LPH

Average Turnaround Time

So far the strategies implemented in this paper have concentrated on either making one reservation on a single site or reservations at all sites (where the total number of reservations is equal to the total number of sites). We have seen that making multiple reservations shows a substantial improvement in the average turnaround time. However, it is of interest to make fewer reservations, if possible. This is due to the overhead involved in maintaining a larger number of reservations (network latency bound). When a job is ready to start at a site, it must contact all other sites and determine whether it should start, based on the current strategy. When a job has determined it should start (by contacting all other sites where a reservation is held and receiving a reply), it must inform all other sites that the job is starting. This process may happen multiple times for each job (a maximum of once for each site which attempts to start the job). Therefore, a minimum of 3 ∗ (K − 1) messages must be transferred for each job to start. Further, the job must be transferred to each site where a reservation is made (network bandwidth bound). This network overhead could be substantially reduced by limiting the number of reservations. When fewer reservations are used per job, each site does not have to contact as many other sites before starting a job, and there is a lower chance that a job will be denied at start (there will be fewer sites to deny the job). These factors can substantially reduce the communication overhead needed. Figure 11 shows the turnaround time results when each job is submitted to K sites, with K varied from 1 to 4. The graphs show that the greatest degree of improvement is when the number of sites is increased from one to two. There is less of a benefit as the number of sites is further increased. Therefore, when network latencies are high, jobs can be submitted to a smaller number of sites and the multi-reservation scheduler can still realize a substantial fraction of the benefits achievable with a scheduler that schedules each job at all sites. In order to avoid starvation (when using efficacy as the priority) the efficacy is relative to the sites where the job was scheduled. Therefore, each job will still be guaranteed

  

6'6& 6LWH 6LWHV

6LWHV 6LWHV

         5XQWLPH([SDQVLRQ)DFWRU

Fig. 11. Performance with restricted multi-site reservation, no communication costs. As the number of scheduling sites are increased, the turnaround time monotonically decreases

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

70000 50000

CTC 2 Sites, Load 4 Sites, Load 2 Sites, Comp 4 Sites, Comp

30000 10000 1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Average Turnaround Time

Average Turnaround Time

90000

300000 250000 200000 150000 100000 50000 0

99

SDSC 2 Sites, Load 4 Sites, Load 2 Sites, Comp 4 Sites, Comp

1.0 1.2 1.4 1.6 1.8 2.0 Runtime Expansion Factor

Fig. 12. Effect of site selection criteria, no communication costs. Using the completion time of the reservations, as opposed to an instantaneous load, improves the turnaround time to have an efficacy of one at least on one of the sites. Hence the jobs are still guaranteed to be free from starvation. In our previous graphs and data, we used the instantaneous load metric (either maintained at the metascheduler or by periodically polling the remote sites) to choose which K sites to schedule each job. We next consider a more accurate approach to selecting the sites. Instead of using the instantaneous load, we query each site to determine the earliest completion time, based on its current schedule. This further takes into account the efficacy of the job at each location and there is a higher probability that the job will run on a site where its efficacy is maximum. Figure 12 shows that changing the mechanism for site selection can have significant impact on the turn around time. There is of course no change when the job is submitted to the maximum number of sites, because all sites are being chosen, regardless of the site selection mechanism. From Fig. 12 it can be observed that when using completion time as the site selection criterion, submitting to fewer sites can be almost as effective as submitting to all sites. However, this more accurate approach does not come free. There is an additional initial overhead that must be incurred to determine a job’s K best completion times, which is not incurred when using the instantaneous load. The load of each site can be maintained incrementally by the metascheduler, or the metascheduler can periodically update the load of each site; therefore there are no per-job communication costs incurred in selecting the K least loaded sites. In contrast, to determine the K best completion times, each site must be queried for its expected completion time. For N sites, the querying will require 2 ∗ N messages, N messages from the metascheduler (to contact each site with the job specifications) and a response from each site. When using completion time to determine the K sites, there are an additional 2 ∗ N messages needed to determine the minimum completion times. Therefore, a minimum of 3 ∗ (K − 1) messages per job are required when using the instantaneous load and a minimum of 2 ∗ N + 3 ∗ (K − 1) messages when using completion time. Next we assess the impact of communication overhead for data transfer when running a job at a remote site. We assumed a data transfer rate of 10Mbps.

Gerald Sabin et al.

$YHUDJH 7XUQDURXQG7LPH

  

&7& 6LWH 6LWHV 6LWHV 6LWHV

$YHUDJH 7XUQDURXQG7LPH

100

 

   

6'6& 6LWH 6LWHV 6LWHV 6LWHV

        5XQWLPH([SDQVLRQ)DFWRU

      5XQWLPH([SDQVLRQ)DFWRU

Fig. 13. Average turnaround time with a contention-less network model. Adding data transfer time uniformly increase turnaround time, but does not affect the trends &7&



    

6LWH 6LWHV

6LWHV 6LWHV

      5XQWLPH([SDQVLRQ)DFWRU

1XPEHURI 0HVVDJHV

1XPEHU2I 0HVVDJHV



6'6&

    

6LWH 6LWHV

6LWHV 6LWHV

      5XQWLPH([SDQVLRQ)DFWRU

Fig. 14. Number of messages with a contention-less network model. Decreasing the number of scheduling sites significantly reduces the number of control messages needed to maintain the schedule

Each job was assigned a random size (for executable plus data) between 500MB and 3GB. The data transfer overhead was modeled by simply increasing the length of a job if it is run remotely (local jobs do not involve any data transfer). Figure 13 shows the average turnaround time including the extra overheard. The turnaround times have increased due to the additional overhead, but relative trends remain the same. Figure 14 shows the number of control messages which were actually needed to maintain the schedule. There is a substantial increase in the number of messages when the value of K is increased.

8

Related Work

Recent advances in creating the infrastructure for grid computing (e.g. Globus [14], Legion[19], Condor-G[15] and UNICORE [24]) facilitate the deployment of metaschedulers that schedule jobs onto multiple heterogeneous sites. However there has been little work on developing and evaluating job scheduling schemes for a heterogeneous environment.

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

101

Application level scheduling techniques [3, 5, 17] have been developed to efficiently deploy resource intensive applications that require more resources than available at a single site and parameter sweep applications over the grid. There have been some studies on decoupling the scheduler core and application specific components [7] and introducing a Metascheduler [32] to balance the interests of different applications. But none of the above works address the problem of developing effective scheduling strategies for a heterogeneous environment. [8] proposes an economic model for scheduling in the heterogeneous grid environments where the objective is to minimize the cost function associated with each job - an aspect somewhat orthogonal to that addressed in this paper. [35] proposes a load sharing facility with emphasis on distributing the jobs among the various machines, based on the workload on the machines. Studies that have focused on developing job scheduling algorithms for the grid computing environment include [16, 18, 20, 27, 30]. Most of these studies do not address the issue of heterogeneity. In [22] a few centralized schemes for sequential jobs were evaluated. In [16], the performance of a centralized metascheduler was studied under different levels of information exchange between the meta scheduler and the local resource management systems where the individual systems are considered heterogeneous in that the number of processors at different sites differs, but processors at all sites are equally powerful. In [9], the impact of scheduling jobs across multiple homogeneous systems was studied, where jobs can be run on a collection of homogeneous nodes from independent systems s, where each systems may have a different number of nodes. The impact of advance reservations for meta-jobs on the overall system performance was studied in [27]. In [20, 30], some centralized and decentralized scheduling algorithms were evaluated for metacomputing, but only the homogeneous context is considered.

9

Current Status and Future Work

The simulation results show that the proposed scheduling strategy is promising. We plan next to implement the strategy in the Silver/Maui scheduler and evaluate it on the Cluster Ohio distributed system. The Ohio Supercomputer Center recently initiated the Cluster Ohio project [1] to encourage increased academic usage of cluster computing and to leverage software advances in distributed computing. OSC acquires and puts into production a large new cluster approximately every two years, following the budget cycle. OSC distributes the older machine in chunks to academic laboratories at universities around the state, with the proviso that the machines continue to be controlled centrally and available for general use by the OSC community. Each remote cluster is designed to be fully stand-alone, with its own file system and scheduling daemons. To allow non-trivial access by remote users, currently PBS and Maui/Silver are used with one queue for each remote cluster and require that users explicitly choose the remote destination. Remote users can access any cluster for a PBS job by using the Silver metascheduler. Globus is used to handle the mechanics of authentica-

102

Gerald Sabin et al.

tion among the many distributed clusters. We plan to deploy and evaluate the heterogeneous scheduling approach on the Cluster Ohio systems.

Acknowledgments We thank the Ohio Supercomputer Center for access to their resources. We also thank the anonymous referees for their suggestions.

References [1] Cluster Ohio Initiative. http://oscinfo.osc.edu/clusterohio. 101 [2] Nas Parallel Benchmark Results. http://www.nas.nasa.gov/NAS/NPB/NPB2Results/971117/all.html. 89 [3] D. Bailey, T. Harris, W. Saphir, R. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report Report NAS-95-020, NASA Ames Research Center, Numerical Aerodynamic Simulation Facility, December 1995. 90, 101 [4] T. D. Braun, H. J. Siegel, N. Beck, L. Bo”lo”ni, M. Maheswaran, A. I. Reuther, J. P. R., M. D. Theys, B. Yao, D. A. Hensgen, and R. F. Freund. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing, 61:810–837, 2001. 87, 88 [5] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid. In Supercomputing, November 2000. 101 [6] S. H. Chiang and M. K. Vernon. Production job scheduling for parallel shared memory systems. In Proceedings of International Parallel and Distributed Processing Symposium, 2002. 88 [7] H. Dail, H. Casanova, and F. Berman. A Decoupled Scheduling Approach for the GrADS Environment. In Proceedings of Supercomputing, November 2002. 101 [8] C. Ernemann, V. Hamscher, and R. Yahyapour. Economic scheduling in grid computing. In 8th Workshop on Job Scheduling Strategies for Parallel Processing, July 2002. in conjunction with the High Performance Distributed Computing Symposium (HPDC ’02). 101 [9] C. Ernemann, V.Hamscher, U. Schwiegelshohn, R. Yahyapour, and A. Streit. On advantages of grid computing for parallel job scheduling. 101 [10] D. Feitelson. Workshops on job scheduling strategies for parallel processing. www.cs.huji.ac.il/ feit/parsched/. 87 [11] D. Feitelson and M. Jette. Improved utilization and responsiveness with gang scheduling. In 3rd Workshop on Job Scheduling Strategies for Parallel Processing, number 1291 in LNCS, pages 238–261, 1997. 88 [12] D. G. Feitelson. Logs of real parallel workloads from production systems. URL: http://www.cs.huji.ac.il/labs/parallel/workload/. 89 [13] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and practice in parallel job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1–34. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. 89

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environment

103

[14] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. In Intl. J. Supercomputer Applications, volume 11, pages 115–128, 1997. 100 [15] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S.Tuecke. Condor-g: A computation management agent for multi-institutional grids. In Proc. Intl. Symp. On High Performance Distributed Computing, 2001. 100 [16] J. Gehring and T. Preiss. Scheduling a metacomputer with uncooperative subschedulers. In In Proc. JSSPP, pages 179–201, 1999. 101 [17] J. Gehring and A. Reinefeld. MARS - A Framework for Minimizing the Job Execution Time in a Metacomputing Environment. In Future Generation Computer Systems – 12, volume 1, pages 87–90, 1996. 101 [18] J. Gehring and A. Streit. Robust resource management for metacomputers. In High Performance Distributed Computing, pages 105–111, 2000. 101 [19] A. S. Grimshaw, W. A. Wulf, and the Legion team. The legion vision of a worldwide computer. Communications of the ACM, pages 39–45, January 1997. 100 [20] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour. Evaluation of job-scheduling strategies for grid computing. In Proc. Grid ’00, pages 191–202, 2000. 101 [21] P. Holenarsipur, V. Yarmolenko, J. Duato, D. K. Panda, and P. Sadayappan. Characterization and enhancement of static mapping heuristics for heterogeneous systems. In Intl. Conf. On High-Performance Computing, December 2000. 87, 88 [22] H. A. James, K. A. Hawick, , and P. D. Coddington. Scheduling independent tasks on metacomputing systems. In Parallel and Distributed Systems, 1999. 101 [23] J. P. Jones and B. Nitzberg. Scheduling for parallel supercomputing: A historical perspective of achievable utilization. In 5th Workshop on Job Scheduling Strategies for Parallel Processing, 1999. 89 [24] M. Romberg. The UNICORE architecture: Seamless Access to Distributed Resources. In HPDC ’99, pages 287–293, 1999. 100 [25] W. Saphir, A. Woo, and M. Yarrow. The NAS Parallel Benchmarks 2.1 Results. Technical Report NAS-96-010, NASA, http://www.nas.nasa.gov/NAS /NPB/Reports/NAS-96-010.ps, August 1996. 90 [26] J. Skovira, W. Chan, H. Zhou, and D. Lifka. The EASY - LoadLeveler API project. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 41–47. Springer-Verlag, 1996. Lect. Notes Comput. Sci. vol. 1162. 88, 89 [27] Q. Snell, M. Clement, D. Jackson, , and C. Gregory. The performance impact of advance reservation meta-scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop on Job Scheduling Strategies for Parallel Processing, volume 1911 of Lecture Notes in Computer Science. Springer-Verlag, 2000. 101 [28] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan. Characterization of backfilling strategies for job scheduling. In 2002 Intl. Workshops on Parallel Processing, August 2002. held in conjunction with the 2002 Intl. Conf. on Parallel Processing, ICPP 2002. 94 [29] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan. Selective reservation strategies for backfill job scheduling. In 8th Workshop on Job Scheduling Strategies for Parallel Processing, July 2002. 88 [30] V. Subramani, R. Kettimuthu, S. Srinivasan, and P. Sadayappan. Distributed job scheduling on computational grids using multiple simultaneous requests. In Proceedings of the 11th High Performance Distributed Computing Conference, 2002. 89, 90, 101

104

Gerald Sabin et al.

[31] D. Talby and D. Feitelson. Supporting priorities and improving utilization of the ibm sp scheduler using slack-based backfilling. In Proceedings of the 13th International Parallel Processing Symposium, 1999. 88 [32] S. S. Vadhiyar and J. J. Dongarra. A metascheduler for the grid. In 11-th IEEE Symposium on High Performance Distributed Computing, July 2002. 101 [33] J. Weissman and A. Grimshaw. A framework for partitioning parallel computations in heterogeneous environments. Concurrency: Practice and Experience, 7(5), August 1995. 88 [34] V. Yarmolenko, J. Duato, D. K. Panda, and P. Sadayappan. Characterization and enhancement of dynamic mapping heuristics for heterogeneous systems. In ICPP 2000 Workshop on Network-Based Computing, August 2000. 87, 88 [35] S. Zhou, X. Zheng, J. Wang, , and P. Delisle. Utopia: A load sharing facility for large heterogeneous distributed computer systems. Software - Practice and Experience (SPE), December 1993. 101

A Measurement-Based Simulation Study of Processor Co-allocation in Multicluster Systems S. Banen, A.I.D. Bucur, and D.H.J. Epema Faculty of Information Technology and Systems Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands {a.i.d.bucur,d.h.j.epema}@its.tudelft.nl

Abstract. In systems consisting of multiple clusters of processors interconnected by relatively slow network connections such as our Distributed ASCI Supercomputer (DAS), applications may benefit from the availability of processors in multiple clusters. However, the performance of single-application multicluster execution may be degraded due to the slow wide-area links. In addition, scheduling policies for such systems have to deal with more restrictions than schedulers for single clusters in that every component of a job has to fit in separate clusters. In this paper we present a measurement study of the total runtime of two applications, and of the communication time of one of them, both on single clusters and on multicluster systems. In addition, we perform simulations of several multicluster scheduling policies based on our measurement results. Our results show that in many cases, restricted forms of co-allocation in multiclusters have better performance than not allowing co-allocation at all.

1

Introduction

Over the last decade, clusters and distributed-memory multiprocessors consisting of hundreds or thousands of standard CPUs have become very popular. Compared to single-cluster systems, multicluster systems consisting of multiple, geographically distributed clusters interconnected by a relatively slow wide-area network can provide a larger computational power. Instead of smaller groups of users with exclusive access to their single clusters, larger groups of users can share the multicluster, potentially leading to lower turn-around times and a higher utilization, and making larger job sizes possible. One such multicluster system is the Distributed ASCI Supercomputer (DAS) [1], which was designed and deployed by the Dutch Advanced School for Computing and Imaging (ASCI) in the Netherlands. The possibility of creating multiclusters fits with the recent interest in computational and data GRIDs [2, 3], in which it is envisioned that applications can access resources (hardware resources such as processors, memory, and special instruments, but also data resources) in many different locations at the same time to accomplish their goals. There are two potential problems when employing multicluster systems. First, applications may not be suitable for multicluster execution because D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 105–128, 2003. c Springer-Verlag Berlin Heidelberg 2003 

106

S. Banen et al.

they can not deal very well with the slow wide-area links. Second, scheduling a multicomponent application across a multicluster system (i.e., performing coallocation) meets with more restrictions than scheduling a job in a single cluster because now each of the components has to fit in a separate cluster. In this paper we first investigate and compare the total runtimes of single-cluster and multicluster execution of two parallel applications modeling physical phenomena by performing measurements on the DAS. Our main conclusion is that both applications, with appropriate parameter settings in one of them, are very well suited for multicluster operation. Subsequently, we assess the performance of several scheduling policies for co-allocation in multiclusters with simulations using the runtime measurements. We have also performed detailed measurements of the time spent in communication of one of the two applicatons. Because the results of these measurements are not used in our simulations, they are relegated to an appendix. In previous papers [4, 5, 6], we have assessed the influence on the mean response time of the job structure and size, the sizes of the clusters in the system, the ratio of the speeds of local and wide-area communications, and of the presence of a single or of multiple queues in the system. Also in [7], co-allocation (called multi-site computing there) is studied, with as performance metric the (average weighted) response time. There, jobs only specify a total number of processors, and are split up across the clusters. The slow wide-area communication is accounted for by a factor r by which the total execution times are multiplied. Co-allocation is compared to keeping jobs local and to only sharing load among the clusters, assuming that all jobs fit in a single cluster. One of the most important findings in [7] is that for r less than or equal to 1.25, it pays to use co-allocation. In [8], we consider the maximal utilization, i.e., the utilization at which the system becomes saturated, as a performance metric. Our five-cluster second-generation Distributed ASCI Supercomputer (DAS) [1, 9] (and its predecessor), which was an important motivation for this work, was designed to assess the feasibility of running parallel applications across widearea systems [10, 11, 12]. In the most general setting, grid resources are very heterogeneous; in this paper we restrict ourselves to homogeneous multicluster systems such as the DAS. Showing the viability of co-allocation in such systems may be regarded as a first step in assessing the benefit of co-allocation in more general grid environments.

2

The System Model

In this section we describe our model of multicluster systems and the scheduling policies we will evaluate. 2.1

The Distributed ASCI Supercomputer

The DAS (in fact the DAS2, the second-generation system which was installed at the end of 2001 when the first-generation DAS1 system was discontinued)

A Measurement-Based Simulation Study of Processor Co-allocation

107

is a wide-area computer system consisting of five clusters (one at each of five universities in The Netherlands, amongst which Delft University of Technology) of dual-processor nodes, one with 72, the other four with 32 nodes each. Each node contains two 1-Ghz Pentium-IIIs and at least 1GB RAM. The clusters are interconnected by the Dutch university backbone for wide-area communications (100 Mbit/s), while for local communications inside the clusters Myrinet LANs are used (1,200 Mbit/s). The system was designed for research on parallel and distributed computing. On single DAS clusters the PBS [13] scheduler is used, while jobs spanning multiple clusters can be submitted with Globus [14]. The current version of Globus is unable to use the fast local DAS interconnect (Myrinet); all Globus communication goes over TCP/IP sockets (this problem will be solved in the near future). 2.2

The Structure of the System

We model a multicluster system consisting of C clusters of processors, of possibly different sizes. We assume that all processors have the same service rate. By a job we understand a parallel application requiring some number of processors, possibly in multiple clusters (co-allocation). Jobs are rigid, so the numbers of processors requested by and allocated to a job are fixed. We call a task the part of a job that runs on a single processor. We assume that jobs only request processors and we do not include in the model other types of resources. 2.3

The Structure of Job Requests and the Placement Policies

Jobs that require co-allocation have to specify the number and the sizes of their components, i.e., of the sets of tasks that have to go to the separate clusters. A job is represented by a tuple of C values, at least one of which is strictly positive. We consider unordered requests, for which the components of the tuple specify the numbers of processors the job requires in the separate clusters, allowing the scheduler to choose the clusters for the components. Such requests model applications like FFT, where tasks in the same job component share data and need intensive communication, while tasks from different components exchange little or no information. To determine whether an unordered request fits, we try to schedule its components in decreasing order of their sizes on distinct clusters. We use Worst Fit (WF) to place the components on clusters. 2.4

The Scheduling Policies

In a multicluster system where co-allocation is used, jobs can be either singlecomponent or multi-component, and in a general case both types are simultaneously present in the system. A scheduler dealing with the first type of jobs can be local to a cluster and does not need any knowledge about the rest of the system. For multi-component jobs, the scheduler needs global information for its decisions.

108

S. Banen et al.

Treating both types of jobs equally or keeping single-component jobs local and scheduling only multi-component jobs globally over the entire system, having a single global scheduler or schedulers local to each cluster, all these are decisions that influence the performance of the system. In [6] we have studied several policies, some of which with multiple variations; in this paper we consider the following approaches: 1. [GS] The system has one global scheduler with one global queue, for both single- and multi-component jobs. All jobs are submitted to the global queue. The global scheduler knows at any moment the number of idle processors in each cluster and based on this information chooses the clusters for each job. 2. [LS] Each cluster has its own local scheduler with a local queue. All queues receive both single- and multi-component jobs and each local scheduler has global knowledge about the numbers of idle processors. However, singlecomponent jobs are scheduled only on the local cluster. The multi-component jobs are co-allocated over the entire system. When scheduling is performed all enabled queues are repeatedly visited, and in each round at most one job from each queue is started. When the job at the head of a queue does not fit, the queue is disabled until the next job departs from the system. At each job departure the queues are enabled in the same order in which they were disabled. 3. [LP] The system has both a global scheduler with a global queue, and local schedulers with local queues. Multi-component jobs go to the global queue and are scheduled by the global scheduler using co-allocation over the entire system. Single-component jobs are placed in one of the local queues and are scheduled by the local scheduler only on its corresponding cluster. The local schedulers have priority: the global scheduler can schedule jobs only when at least one local queue is empty. When a job departs, if one or more of the local queues are empty both the global queue and the local queues are enabled. If no local queue is empty only the local queues are enabled and repeatedly visited; the global queue is enabled and added to the list of queues which are visited when at least one of the local queues gets empty. When both the global queue and the local queues are enabled at job departures, they are always enabled starting with the global queue. The order in which the local queues are enabled does not matter since the jobs in them are only started on the local clusters. In all the cases considered, both the local and the global schedulers use the FCFS policy to choose the next job to run.

3

The Applications

In this section we describe the two applications for which we will perform measurements on the DAS.

A Measurement-Based Simulation Study of Processor Co-allocation

3.1

109

The Ensflow Application

The Ensflow application [15] uses the data-assimilation technique to understand the evolution of streams and eddies in the ocean near the southern tip of Africa. In this technique, information from observations of the system is combined with information on the evolution of the system obtained from an implementation of the laws of physics. This can be done by using ensemble models that do not calculate the evolution of a single state but rather of a large number (an ensemble, typically 50-500) of different states (ensemble members). In our case there are 60 ensemble members that evolve for a period of 20 days with a time step of 24 hours. Every 240 hourly time steps, an analysis and an update of the ensemble members are done to obtain the optimal estimate for the past period. Each of the ensemble members evolves independently of the others during the time between analysis and update. The sequence of ensemble averages over time describes the development of the ocean’s currents best fitting the observations. The application has the following structure: /*--------initialisation--------*/ initiate 60 ensembles; /*--------start main loop-------*/ while time < stop_time /* computation */ evolve the 60 ensembles; if (time = time_to_analyse) /* computation + communication */ analyse and update ensembles; endif endwhile /*---------end main loop--------*/

The main loop is executed 20 times, with two data adjustments. Only during the data adjustment phase (analysis and update ensembles) data are exchanged (using MPI). The data of the ensemble members are local to the processors, and the ensemble members are distributed evenly over the processors. To avoid processors from being unnecessarily idle, we choose the number of processors such that the number of ensemble members is an exact multiple of it. In [15], the Ensflow application is described in more detail, and measurements of the total runtime on two multiprocessors are presented. 3.2

The Poisson Application

Our Poisson application implements a parallel iterative algorithm to find a discrete approximation to the solution of the two-dimensional Poisson equation (a second-order differential equation governing steady-state heat flow in a twodimensional domain) on the unit square. For the discretization, a uniform grid of points in the unit square with a constant step in both directions is considered.

110

S. Banen et al.

The application uses a red-black Gauss-Seidel scheme (see for instance [16], pp. 429–433), for which the grid is split up into ”black” and ”red” points, with every red point having only black neighbours and vice versa. In every iteration, each grid point has its value updated as a function of its previous value and the values of its neighbours, and all points of one colour are visited first followed by the ones of the other colour. The application, which is implemented in MPI, has the following structure: /*--------initialisation--------*/ if proc_index = 0 read the initial data; /* communication */ broadcast data to all the processes; endif /*--------start main loop-------*/ while global-error => limit /* computation */ update black points; update red points; /* communication */ exchange borders with neighbours; /* communication + synchronization */ collect/distribute global error; endwhile /*---------end main loop--------*/

The domain of the problem is split up into a two-dimensional pattern of rectangles of equal size among the participating processes. In our experiments, we assign only one process to a processor. A way of splitting up the domain is called a process(or) configuration, and is indicated by h×v, with h, v the numbers of processes in the horizontal and vertical directions, respectively. In Sect. 4 we will consider the numbers of processors and the processor configurations as shown in Table 1. Every process communicates with each of its neighbours in order to exchange the values of the grid points on the borders and to compute a global stopping criterion. Exchanging borders takes place in four consecutive steps; first all com-

Table 1. The processor configurations used in our measurements total number processor of processors configuration 8 4x2 16 4x4 32 8x4 64 8x8

A Measurement-Based Simulation Study of Processor Co-allocation 3 2 1 0

7 6 5 4

11 10 9 8

111

15 14 13 12

Fig. 1. The process grid for the Poisson application for process configuration 4x4 divided over two clusters (left–right) munication in the direction top is performed, and then in the directions bottom, left and right. The amount of communication depends on the size of the grid, the number of participating processes, and the initial data. When we execute the Poisson application on multiple clusters, the process grid is split up into adjacent vertical strips of equal width, with each cluster running an equal consecutive number of processes (we assume processes to be numbered in column-major order). For instance, for process configuration 4x4 and two clusters, the processes are split up as depicted in Fig. 1. Here, processors 4–11 have to exchange border information with processors in the other cluster.

4

Runtime Measurements

In this section we present the results of the measurements of our two applications on the DAS. We use Globus for submitting multicomponent jobs to the DAS. In all of our experiments, the jobs always have components of equal size. Since Globus is currently unable to use the fast local DAS interconnect (Myrinet) but uses the slower local Ethernet instead, we employ both PBS and Globus for running the applications in a single DAS cluster. The PBS measurements yield the best performance of single-cluster operation, but the single-cluster Globus measurements make for a fairer comparison with the multicluster results. Measurements with Globus on a system with C clusters are labeled with Globus-C. 4.1

Total Runtime of the Ensflow Application

For an investigation of the total runtime we ran the Ensflow application once for different numbers of processors and clusters. The results of the measurements are presented in Fig. 2. The gaps for 15 processors and Globus-2 and Globus-4, for 20 processors and Globus-3, and for 30 processors and Globus-4 are due to the fact that then we cannot have equal-size job components. The gap for Globus-1 with 60 processors is caused by the limitation of 32 processors in a single cluster when using Globus. We find that the performance of multicluster execution for all numbers of clusters considered compared to single-cluster execution is very good for this application. In addition, the speedup is quite reasonable. Relative to the 12-processor case, the efficiency slowly decreases to about 0.7 for 60 processors. The explanation of the good performance of multicluster execution is that this application has a relatively small communication component.

112

S. Banen et al.

PBS

Globus-1

Globus-2

Globus-3

Globus-4

4000 3000 2000 1000 0 12

15

20

30

60

number of processors

Fig. 2. The total runtime of the Ensflow application (in seconds) for different numbers of processors and clusters. (No data when the number of processors is not a multiple of the number of clusters, and for 60 processors with Globus-1.)

4.2

Total Runtime of the Poisson Application

For a first investigation of the total runtime of the Poisson application, we ran the application once varying the grid size, the total number of processors (see Table 1 for the corresponding processor configurations), and the number of clusters. In addition to the total runtime, we also record the number of iterations needed to reach convergence. The results of the measurements are presented in Table 2, and graphically in Fig. 3. (Because of the numbers of processors we consider, we cannot use three clusters.) Again there are gaps for Globus-1, this time for 64 processors in a single cluster, for the same reason as above. We find that for a very small grid size, the runtime may increase considerably when using more clusters. However, for a large grid size, the performance of multicluster execution compared to single-cluster execution is quite reasonable. Since the processor configuration influences the number of iterations needed to reach convergence (which determines the total runtime), it is difficult to make a general statement about the speedup. In particular for grid sizes 1000x1000 and 2000x2000, the number of iterations is very variable. However, for grid size 4000x4000 the number of iterations is almost constant, and the speedup when going from 8 to 64 processors for PBS, Globus-2, and Globus-4 is 6.5, 6.0, and 5.8, respectively. For a further investigation of the total runtime we now fix the processor configuration to 4x4, and we add a few grid sizes. For every set-up (grid size and number of clusters) we ran the application ten times. The results of the measurements (minimum, average, and maximum) are presented in Table 3. For a better comparison, we depict in Fig. 4 for every grid size the (average) runtimes relative to the (average) single-cluster PBS runtimes (which are normalized to 1). It is clear that for large grid sizes, this application is well suited for multicluster execution. The explanation is that the two major components of the total runtime, the time for updating all grid points (computation) and the time

A Measurement-Based Simulation Study of Processor Co-allocation

113

Table 2. The number of iterations and the total runtime (in seconds) of the Poisson application for different grid sizes and numbers of processors and clusters grid size 100 x 100

1000 x 1000

2000 x 2000

4000 x 4000

total number number of of processors iterations 8 2436 16 2132 32 2158 64 2429 8 2630 16 4347 32 4356 64 2650 8 2630 16 4387 32 4387 64 2650 8 2630 16 2644 32 2651 64 2650

PBS Globus-1 Globus-2 Globus-4 0.74 0.74 0.93 1.21 70.9 60.2 34.3 8.1 291 265 134 46.8 1230 649 357 188

3.23 3.59 4.54 — 86.6 78.6 46.7 — 335 292 161 — 1277 725 371 —

11.5 12.1 17.4 24.2 109 119 68.8 30.7 358 339 193 80.4 1390 766 402 231

15.0 11.8 17.4 21.1 114 125 67.4 31.7 365 332 191 85.1 1463 767 440 251

for exchanging border grid points (communication), increase in a different way when the grid size increases. When the total number of grid points increases with a factor g, the number of grid points to be exchanged increases with a factor √ g. Since communication is the component that causes the poor performance of multicluster execution, it is to be expected that for larger grid sizes (with relatively smaller communication components), multicluster execution performs relatively better.

5

Performance Evaluation of the Scheduling Policies

In this section we assess the performance of the multicluster scheduling policies introduced in Sect. 2.4 with simulations for several workloads differentiated by the numbers of components into which jobs are split and by the percentages of jobs running each of the two applications introduced in Sect. 3. The simulations are for a multicluster with 4 clusters of 32 processors each. The simulation programs were implemented using the CSIM simulation package [17]. We will present our simulation results in terms of response time as a function of the utilization. We define the gross utilization as the utilization computed from the actual service times experienced by jobs, which for multicomponent jobs includes the time spent in the slow wide-area communication. The net utilization is defined as the utilization computed from the single-cluster service times of jobs of the same total size, which gives a measure of the throughput of the system.

114

S. Banen et al.

grid size 100x100 PBS

Globus-1

grid size 1000x1000

Globus-2

Globus-4

25

PBS

Globus-1

Globus-2

Globus-4

140 120 100 80 60 40 20 0

20 15 10 5 0 8

16

32

64

8

number of processors

16

32

64

number of processors

grid size 2000x2000 PBS

Globus-1

Globus-2

grid size 4000x4000

Globus-4 PBS

Globus-1

Globus-2

Globus-4

400 2000

300

1500

200 1000

100

500

0

0

8

16

32

number of processors

64

8

16

32

64

number of processors

Fig. 3. The total runtime of the Poisson application (in seconds) for different grid sizes and numbers of processors and clusters. (No data for 64 processors with Globus-1.) When there is no co-allocation, there is no wide-area communication and the net utilization is equal to the gross utilization. In this section we only look at the gross utilization and depict the response time as a function of this utilization, because that is a fair basis for comparing the policies. In Sect. 5.1 we present the workloads in the simulations. Section 5.2 discusses the influence of the numbers and sizes of the job components on the performance, while in Sect. 5.3 the benefits and disadvantages of co-allocation are discussed, compared to a system without co-allocation. In Sect. 5.4 we make a general comparison of the policies. Section 6 compares for all the policies and workloads the gross and the net utilization, which shows how efficient the global applications use the gross utilization offered. 5.1

The Workloads

Each of the jobs in the simulated workload is supposed to run one of our two applications; in the case of the Poisson application, we assume the grid size to

A Measurement-Based Simulation Study of Processor Co-allocation

115

Table 3. The total runtime (minimum, average, and maximum) of the Poisson application (in seconds) for processor configuration 4x4 for different grid sizes and numbers of clusters grid size 50 100 200 400 1000 2000 4000 10000

x x x x x x x x

min. 50 0.22 100 0.65 200 1.73 400 4.67 1000 60.7 2000 248 4000 701 10000 3734

PBS avg. max. 0.23 0.29 0.72 0.77 1.83 1.88 4.95 5.72 63.7 68.3 257 274 706 712 3841 3948

Globus-1 min. avg. max. 1.35 1.60 2.28 3.35 4.12 6.51 6.55 6.87 8.01 12.4 12.8 13.6 78.4 78.9 79.4 291 296 310 720 733 766 3878 3960 4078

Globus-2 min. avg. max. 5.93 6.29 6.86 14.7 15.3 16.7 26.4 27.6 30.0 32.0 36.5 38.7 101 105 108 306 309 311 743 750 757 4012 4081 4160

Globus-4 min. avg. max. 6.12 7.62 11.4 14.3 16.7 22.8 24.0 26.5 33.9 28.6 30.8 39.8 103 107 118 306 323 349 728 751 794 4215 4235 4285

be 4000x4000. We assess three cases: 100% of the jobs in the system run the Poisson application, 100% of the jobs run the Ensflow application, and each of the two applications is represented by 50% of the jobs in the system. Tables 4 and 5 display the execution times measured on the DAS for the two applications in the several configurations that we are using in the simulations. These values are the same as the ones depicted in Fig. 2, and in Fig. 3 for grid size 4000x4000; for a single cluster we use the PBS runtimes. We assume the interarrival times to be exponentially distributed. Jobs are split up in different ways, but their components are always of equal size, and we also keep the percentages of jobs for each total size always equal. For the same total size, the various splitting choices admitted in the system receive equal probabilities. We compare a no co-allocation case, when only single-component jobs are admitted, to several co-allocation cases. We define the following co-allocation rules: 1. [no] There are only single-component jobs, co-allocation is not allowed. 2. [co] Both single- and multi-component jobs are allowed, without restrictions on the sizes of job components and the numbers of components.

Table 4. The execution times (in seconds) for the Poisson application, depending on the total job size and the number of components, used in the simulations Total job size Number of job components 1 2 4 8 1230.0 1390.0 — 16 649.0 766.0 767.0 32 357.0 402.0 440.0

116

S. Banen et al.

PBS

Globus-1

Globus-2

Globus-4

35 30 25 20 15 10 5 0 50

100

200

400

1000

2000

4000

10000

grid size (one side)

Fig. 4. The total runtime of the Poisson application (in seconds) for processor configuration 4x4 for different grid sizes and numbers of clusters, normalized with respect to PBS

3. [rco] Both single- and multi-component jobs are allowed, but the jobcomponent sizes are restricted to half of the clusters’ sizes. 4. [fco] Both single- and multi-component jobs. The job-component sizes are restricted to half of the clusters’ sizes, and only multi-component jobs with two components are allowed. In Tables 6, 7, and 8 we show the resulting percentages of jobs for the numbers of components allowed for the Poisson application (here we disallow jobs of size 8 to be split into 4 components), for the Ensflow application, and for an even mix of these, respectively.

Table 5. The execution times (in seconds) for the Ensflow application, depending on the total job size and the number of components, used in the simulations Total job size Number of job components 1 2 3 4 12 3485.0 3494.0 3504.0 3507.0 15 2836.0 — 2884.0 — 20 1935.0 2207.0 — 2155.0 30 1563.0 1541.0 1584.0 —

A Measurement-Based Simulation Study of Processor Co-allocation

117

Table 6. The percentages of jobs with different numbers of components for the four job compositions for the Poisson application Total job size 8 16 32

Number of job components [no] [co] [rco] [fco] 1 1 2 4 1 2 4 1 2 33.34% 16.67% 16.67% — 16.67% 16.67% — 16.67% 16.67% 33.33% 11.11% 11.11% 11.11% 11.11% 11.11% 11.11% 16.665% 16.665% 33.33% 11.11% 11.11% 11.11% 0.0% 16.665% 16.665% 0.0% 33.33%

Table 7. The percentages of jobs with different numbers of components for the four co-allocation rules for the Ensflow application Total job size 12 15 20 30

[no] Number of job components 1 2 3 4 25.0% 0.0% 0.0% 0.0% 25.0% — 0.0% — 25.0% 0.0% — 0.0% 25.0% 0.0% 0.0% —

Total job size 12 15 20 30

[rco] Number of job components 1 2 3 4 6.25% 6.25% 6.25% 6.25% 12.5% — 12.5% — 8.34% 8.33% — 8.33% 8.34% 8.33% 8.33% —

Total job size 12 15 20 30

[co] Number of job components 1 2 3 4 6.25% 6.25% 6.25% 6.25% 12.5% — 12.5% — 0.0% 12.5% — 12.5% 0.0% 12.5% 12.5% —

Total job size 12 15 20 30

[fco] Number of job components 1 2 3 4 12.5% 12.5% 0.0% 0.0% 25.0% — 0.0% — 0.0% 25.0% — 0.0% 0.0% 25.0% 0.0% —

5.2

The Influence of the Numbers and Sizes of the Job Components

In Fig. 5 we show the response time as a function of the (gross) utilization for the three job mixes, the three scheduling policies, and the four co-allocation rules. (In Fig. 5 and in all subsequent figures, the legends are in the right-to-left order of the curves, and the average response time is in seconds.) Because our two applications have very different service times, we assess the performance more in terms of the point where the system saturates (where the reponse-time curves rise very steeply) than in terms of the actual reponse times. The performance is the best for the Poisson application; a reason for this is that in that case all the job sizes are also powers of two, like the clusters’ sizes, which makes them fit better in the system. For the Ensflow application the utilization achieved is worse because of the job sizes, which in most combinations add up in a way that leaves more idle processors in the system than in the case of the Poisson application. For all policies and co-allocation rules considered the worst performance is displayed

118

S. Banen et al.

Table 8. The percentages of jobs with different numbers of components for the four co-allocation rules and a mix of the Poisson and Ensflow applications in equal proportions Total job size 8 16 32 12 15 20 30

[no] Number of job components 1 2 3 4 16.67% 0.0% — — 16.67% 0.0% — 0.0% 16.66% 0.0% — 0.0% 12.5% 0.0% 0.0% 0.0% 12.5% — 0.0% — 12.5% 0.0% — 0.0% 12.5% 0.0% 0.0% —

Total job size 8 16 32 12 15 20 30

[rco] Number of job components 1 2 3 4 8.335% 8.335% — — 5.557% 5.557% — 5.556% 0.0% 8.33% — 8.33% 3.125% 3.125% 3.125% 3.125% 6.25% — 6.25% — 4.167% 4.167% — 4.166% 4.167% 4.167% 4.166% —

Total job size 8 16 32 12 15 20 30

[co] Number of job components 1 2 3 4 8.335% 8.335% — — 5.557% 5.557% — 5.556% 5.554% 5.553% — 5.553% 3.125% 3.125% 3.125% 3.125% 6.25% — 6.25% — 0.0% 6.25% — 6.25% 0.0% 6.25% 6.25% —

Total job size 8 16 32 12 15 20 30

[fco] Number of job components 1 2 3 4 8.335% 8.335% — — 8.335% 8.335% — 0.0% 0.0% 16.66% — 0.0% 6.25% 6.25% 0.0% 0.0% 12.5% — 0.0% — 0.0% 12.5% — 0.0% 0.0% 12.5% 0.0% —

by the mix of the two applications, where the different sizes of jobs are even more difficult to fit in an efficient way. In all the graphs in Fig. 5 the [co] co-allocation rule yields the poorest performance. This shows that although in general co-allocation provides more flexibility in placing jobs on the system, jobs with conflicting requirements can make the performance worse than that in the absence of co-allocation. The bad performance is due to the simultaneous presence in the system of large singlecomponent jobs, using (almost) entire clusters, and of jobs with many components, even equal to the number of clusters in which case on each of the clusters there has to be enough room to accommodate a job component. Possible improvements are to restrict the maximum size of job components and to limit the number of components of the multi-component jobs. The [rco] co-allocation rule includes the first restriction, while [fco] includes both. The graphs show that in all the cases considered imposing these restrictions significantly improves the performance. For LS, the performance for both the [rco] and [fco] cases proves to be much better than for the no co-allocation case. The same result holds for LP. When there are only single-component jobs LP becomes LS, and that is why for the no co-allocation case with LP the curve for LS is depicted. For GS,

A Measurement-Based Simulation Study of Processor Co-allocation

Poisson

5000 0

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

10000 5000 0

15000

5000

10000 5000 0

10000 5000 0

15000

LP fco LP rco LS no LP co

10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

15000

LS fco LS rco LS no LS co

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Ensflow 20000 Average Response Time

Average Response Time

Mix

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0

LP fco LP rco LS no LP co

5000

20000

LS fco LS rco LS no LS co

10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

20000

10000

0

Average Response Time

Average Response Time

Average Response Time

15000

GS fco GS no GS rco GS co

Ensflow 20000

LS fco LS rco LS no LS co

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson 20000

Average Response Time

10000

Mix 20000

GS fco GS no GS rco GS co

Mix 20000 Average Response Time

15000

Ensflow 20000

GS fco GS no GS rco GS co

Average Response Time

Average Response Time

20000

119

15000

LP fco LP rco LS no LP co

10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Fig. 5. The performance of GS, LS, and LP (top–bottom) for the Poisson application, the Ensflow application, and a mix of the two in equal proportions (left–right), for the four co-allocation rules (legends are in right-left order of the curves; for GS, the curves for [fco] and [no] are nearly indistinguishable)

co-allocation does not enhance the performance, but maintains or only slightly improves it with the [fco] restrictions—large jobs are always split and only maximum two components are allowed— and deteriorates it in the other cases. For GS, the advantage of more flexibility brought by co-allocation does not compensate the disadvantage of longer service times due to the inter-cluster communication. The GS policy does not restrict single-component jobs to the local clusters, which makes the performance in the absence of communication rather good. Jobs are scheduled in a FCFS manner from the single queue and the more freedom in spreading the jobs on the clusters introduced by co-allocation is not used enough. 5.3

Co-allocation versus No Co-allocation

As Fig. 5 shows, in a large number of cases co-allocation can enhance the performance of a multicluster system, but it is necessary to avoid the simultaneous presence in the system of jobs with conflicting requirements. In [18] we have shown that large single-component jobs and jobs with many components deteriorate the performance. Moreover, combining such jobs makes it even worse,

120

S. Banen et al.

which is also confirmed by Fig. 5. For LS and LP it is enough to avoid large singlecluster jobs to make co-allocation worthwhile. Since LS stores multi-component jobs in all local queues, it provides (compared to the other policies) more flexibility and a larger choice—any of the jobs at the top of the queues—at each moment when a scheduling decision has to be taken. This is why avoiding jobs with many components does not influence the performance much. LP keeps all multi-component jobs in the global queue and the jobs with many components, which are more difficult to fit, impact the performance more. This can be concluded from the significant improvement brought by the [fco] restrictions compared to the [rco] ones. GS, as mentioned before, has good performance in the absence of communication due to the fact that it can run jobs from the single queue on any of the clusters. However the same single queue makes co-allocation without restrictions ([co]) perform poorly, and only when both the numbers and sizes of the components are restricted ([fco]) is co-allocation an advantage. 5.4

Comparing the Policies

In this section we compare the three policies defined for the three application mixes and the different co-allocation rules. From Fig. 5 we conclude that independent of the application mix LS provides the best results for the co-allocation cases. When there are only single-component jobs the performance of GS is better. LP becomes LS when there are just single-component jobs, so the performance of the two policies is the same in the absence of co-allocation. With the [rco] restrictions LS display much better results than LP, the difference between the two policies being that LP keep all multi-component jobs in a single queue. This relates to our observation for GS that when there is a single queue for multi-component jobs, those with many components are hard to fit and have a strong negative impact on performance. GS is better for the singlecomponent jobs, but once multi-component jobs are allowed, the extra queue for the global jobs in LP and spreading the global jobs among the local queues in the case of LS bring enough benefits as to allow those policies to outperform GS. Comparing all the cases considered, we concluded that the best results are displayed by LP and LS with the [fco] restrictions. The similar performance of LS and LP in that case shows that for those sizes and numbers for the job components, to have a separate queue for the multi-component jobs is enough and the backfilling effect with a window equal to the number of clusters induced by LS does not bring extra improvements.

6

Gross versus Net Utilization

In Sect. 5 we have studied the average response time as a function of the gross utilization. In this section we discuss the difference between gross and net utilization, and quantify this difference for the cases considered in Sect. 5. We have defined the net and the gross utilization based on the job service times in

A Measurement-Based Simulation Study of Processor Co-allocation

15000 10000 5000

Ensflow 20000 Average Response Time

0

15000 10000 5000

10000 5000 0

10000 5000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Mix 20000

0

10000 5000 0

5000 0

co GS gross co GS net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Ensflow 20000 Average Response Time

15000

rco GS gross rco GS net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson Average Response Time

rco GS gross rco GS net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

co GS gross co GS net

5000 0

Average Response Time

15000

10000

Ensflow 20000 Average Response Time

Average Response Time

rco GS gross rco GS net

fco GS gross fco GS net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson

20000

Mix 20000

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

20000

fco GS gross fco GS net

Mix 20000 Average Response Time

Average Response Time

fco GS gross fco GS net

Average Response Time

Poisson 20000

121

co GS gross co GS net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Fig. 6. The response time as a function of the gross and the net utilization for the GS policy, the three application mixes, and the three co-allocation rules that allow co-allocation single clusters with fast local communication, and on the longer service times displayed by multi-component jobs running the same application on multiple clusters (thus using slow inter-cluster communication), respectively. The difference between these utilizations is the capacity lost internally in multi-component jobs due to slow wide-area links. This internal capacity loss might be reduced by restructuring applications [12] or by having them use (collective-) communication operations optimized for wide-area systems [11]. The performance of a multicluster policy may look good when considering the response time as a function of the gross utilization, but, when there is much internal capacity loss, the performance as a function of the net utilization (or of the throughput) may be poor. This ”real” performance of a multicluster policy would improve with more efficient applications or with faster global communication. In Figs. 7, 8 and 6 we depict the average response time for our three policies, for the three application mixes and for the different ways of co-allocation studied, as a function of both the gross and the net utilization. To assess the difference between the two utilizations at a certain response time, one should compare the graphs in the horizontal direction. Of course, for the same workload (defined by the arrival rate, and so, by the net utilization), the difference between the gross

S. Banen et al.

15000 10000 5000

Ensflow 20000 Average Response Time

0

15000 10000 5000

15000 10000 5000 0

rco LS gross rco LS net

10000 5000

Mix 20000

5000 0

5000 0

co LS gross co LS net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Ensflow 20000 Average Response Time

10000

rco LS gross rco LS net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson Average Response Time

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0

15000

5000 0

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

co LS gross co LS net

10000

Ensflow 20000 Average Response Time

Average Response Time

rco LS gross rco LS net

fco LS gross fco LS net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson

20000

Mix 20000

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

20000

fco LS gross fco LS net

Average Response Time

Average Response Time

fco LS gross fco LS net

Average Response Time

Poisson 20000

Mix 20000 Average Response Time

122

co LS gross co LS net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Fig. 7. The response time as a function of the gross and the net utilization for the LS policy, the three application mixes, and the three co-allocation rules that allow co-allocation and the net utilization is the same for all scheduling policies and co-allocation rules, albeit at possibly different response times. The largest difference between the gross and the net utilizations is always displayed for the Poisson application. This is an expected consequence of the fact that this application requires the largest amount of communication. Spreading the jobs running this application on more clusters also results in more wide-area communication than for the Ensflow application, or the equal mix of the two applications. For all the policies and job mixes, comparing the three co-allocation cases we observe that the largest amount of intercluster communication is shown for the [rco] restrictions, and the least for the [co] restrictions. By limiting the size of the single-component jobs, [rco] and [fco] increase the percentage of multicomponent jobs which brings more wide-area communication. Since it limits the number of job components, [fco] yields a lower amount of intercluster communication compared to [rco]. These results are also valid for the Ensflow application, even though the differences are smaller since that application requires very little communication. Despite the significant difference in performance, for the same application mix and the same restrictions imposed for co-allocation, all the three policies show very similar differences between the graphs for net and gross utilization. In

A Measurement-Based Simulation Study of Processor Co-allocation

15000 10000 5000

Ensflow 20000 Average Response Time

0

15000 10000 5000

10000 5000 0

10000 5000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Mix 20000

0

10000 5000 0

5000 0

co LP gross co LP net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Ensflow 20000 Average Response Time

15000

rco LP gross rco LP net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson Average Response Time

rco LP gross rco LP net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

co LP gross co LP net

5000 0

Average Response Time

15000

10000

Ensflow 20000 Average Response Time

Average Response Time

rco LP gross rco LP net

fco LP gross fco LP net

15000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Poisson

20000

Mix 20000

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

20000

fco LP gross fco LP net

Mix 20000 Average Response Time

Average Response Time

fco LP gross fco LP net

Average Response Time

Poisson 20000

123

co LP gross co LP net

15000 10000 5000 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization

Fig. 8. The response time as a function of the gross and the net utilization for the LP policy, the three application mixes, and the three co-allocation rules that allow co-allocation general we could expect that policies with better performance would show more wide-area communication for the same set of jobs.

7

Conclusions

We have performed measurements of two applications on our multicluster DAS system, and we have performed simulations of three multicluster scheduling policies incorporating co-allocation. The performance of multicluster execution of the Ensflow application is very good, which can be explained by its relatively small communication component. Also Poisson application is well suited for multicluster execution, at least for large grid sizes, when the communication component becomes relatively small. The penalty for the slow multicluster communication can be reduced by allowing the computation and communication parts of the processes of a multicluster job to overlap. To be able to make a well-considered decision when to submit an application to a single cluster or across multiple clusters, it would be convenient to have a synthetic application parameterized by the way it is split up across clusters and by its communication pattern, to simulate a range of possible applications. Our simulations of multicluster scheduling policies show that simply allowing co-allocation without any restrictions is not a very good idea, for none of the

124

S. Banen et al.

policies. In all cases, one should at least limit the job-component sizes, and preferably also the number of job components. Furthermore, we found that the policies with local queues (possibly with a global queue for multicomponent jobs) yield better performance than having only a single global queue.

References [1] The Distributed ASCI Supercomputer (DAS). (www.cs.vu.nl/das2) 105, 106 [2] The Global Grid Forum. (www.gridforum.org) 105 [3] Foster, I., Kesselman, C., eds.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann (1999) 105 [4] Bucur, A., Epema, D.: The Influence of the Structure and Sizes of Jobs on the Performance of Co-Allocation. In Feitelson, D., Rudolph, L., eds.: 6th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 1911 of LNCS. Springer-Verlag (2000) 154–173 106 [5] Bucur, A., Epema, D.: The Influence of Communication on the Performance of CoAllocation. In Feitelson, D., Rudolph, L., eds.: 7th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 2221 of LNCS. Springer-Verlag (2001) 66–86 106 [6] Bucur, A., Epema, D.: Local versus Global Queues with Processor Co-Allocation in Multicluster Systems. In Feitelson, D., Rudolph, L., Schwiegelshohn, U., eds.: 8th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 2537 of LNCS. Springer-Verlag (2002) 184–204 106, 108 [7] Ernemann, C., Hamscher, V., Schwiegelshohn, U., Yahyapour, R., Streit, A.: On Advantages of Grid Computing for Parallel Job Scheduling. In: 2nd IEEE/ACM Int’l Symposium on Cluster Computing and the GRID (CCGrid2002). (2002) 39–46 106 [8] Bucur, A., Epema, D.: The Maximal Utilization of Processor Co-Allocation in Multicluster Systems. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS). (2003) 106 [9] H. E. Bal et al.: The Distributed ASCI Supercomputer Project. ACM Operating Systems Review 34 (2000) 76–96 106 [10] Bal, H., Plaat, A., Bakker, M., Dozy, P., Hofman, R.: Optimizing Parallel Applications for Wide-Area Clusters. In: Proc. of the 12th Int’l Parallel Processing Symp. (1998) 784–790 106 [11] Kielmann, T., Hofman, R., Bal, H., Plaat, A., Bhoedjang, R.: MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems. In: ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. (1999) 131– 140 106, 121 [12] Plaat, A., Bal, H., Hofman, R., Kielmann, T.: Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects. Future Generation Computer Systems 17 (2001) 769–782 106, 121 [13] The Portable Batch System. (www.openpbs.org) 107 [14] Globus. (www.globus.org) 107 [15] van Hees, F., van der Steen, A., van Leeuwen, P.: A parallel data assimilation model for oceanographic observations. Concurrency and Computation: Practice and Experience (2003 (to appear)) 109 [16] Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing. Benjamin/Cummings (1994) 110

A Measurement-Based Simulation Study of Processor Co-allocation

125

[17] Mesquite Software, Inc.: (The CSIM18 Simulation Engine, User’s Guide) 113 [18] Bucur, A., Epema, D.: An Evaluation of Processor Co-Allocation for Different System Configurations and Job Structures. In: Proc. of the 14th Symp. on Computer Architecture and High Performance Computing, IEEE Computer Society Press (2002) 195–203 119

A

Communication-Time Measurements

In this appendix we measure the communication time needed for exchanging the values of border grid points in the Poisson application with processor configuration 4x4. All numbers presented below are averages over ten runs. A.1

Per-process Communication Time for Exchanging Borders

We measure for each individual process(or) the total time (i.e., across all iterations) it spends exchanging borders with other processes. Figure 9 contains the results for different numbers of clusters and grid sizes 100x100 and 4000x4000. As expected, for Globus-2, the processes at the edges of the clusters need more time to communicate, although for some reason also some other processes take much time to communicate. This effect is relatively speaking much larger for grid size 100x100 than for grid size 4000x4000. For Globus-2 with grid size 100x100 we also measure for each individual process, and in each direction, the total time it spends exchanging borders with other processes; the results are presented in Fig. 10. We see that in the crosscluster directions left and right, receiving border information takes a relatively large amount of time.

0.20 0.28 0.25 0.26 0.33 0.30 0.28 0.35 0.31 0.24 0.29 0.27 PBS

0.24 0.30 0.31 0.25

3.25 3.71 3.79 3.69 4.17 4.30 3.50 3.98 3.92 3.49 3.93 4.04 PBS

3.72 4.18 3.92 3.96

grid size 100x100 0.69 1.13 1.11 1.09 5.29 6.44 6.87 1.17 1.53 1.60 1.31 0.68 6.83 6.66 1.60 1.71 1.82 1.67 1.06 12.9 11.5 1.40 1.47 1.50 1.25 0.74 5.72 7.41 Globus-1 Globus-2 grid size 4000x4000 78 87 99 88 56 107 137 76 86 103 90 54 107 134 79 92 87 77 75 117 128 78 90 86 79 68 118 112 Globus-1 Globus-2

0.89 1.12 5.66 5.83

7.13 8.64 7.21 1.98 5.88 2.20 3.96 7.23 8.07 0.64 4.02 2.49 Globus-4

8.37 8.20 9.87 4.95

81 83 96 91

96 84 113 108

111 99 124 120

123 132 108 108 130 123 121 111 Globus-4

Fig. 9. The total per-process communication times for different grid sizes (in seconds)

126

S. Banen et al.

5.31 0.11 0.15 0.23 0.27 0.44 0.75 5.34 4.83 0.14 0.11 0.12 Top

0.11 0.29 0.21 0.12

0.13 0.12 0.11 0.13 0.22 0.20 0.17 0.17 0.20 0.14 0.13 0.51 0.33 0.25 0.33 0.25 Bottom

0.02 6.30 0.40 0.02 6.66 0.27 0.06 5.98 0.11 0.07 6.03 0.13 Left

0.09 0.09 0.09 0.09

0.10 0.54 6.12 0.12 0.53 6.30 0.10 0.13 6.95 0.14 0.15 6.98 Right

0.56 0.59 4.51 4.97

Fig. 10. The total per-process communication times in each of the four directions for Globus-2 and grid size 100x100 (in seconds)

A.2

Synchronized and Non-synchronized Operation

In our original Poisson application, we do not synchronize processes before they start their communication phases. So then, as soon as a process finishes its computation in an iteration, it starts (trying) to communicate. We added an MPI command to our application in order to enable synchronized operation. Then, all processes synchronize in every iteration before the communication starts, so they all start communicating at (about) the same time. In both synchronized and non-synchronized operation, we measure the communication time in an iteration as the time elapsed between the last process finishing its computation phase and the last process finishing its communication phase. The cause of the difference between these commmunication times in these two modes of operation lies in the potential parallelism of computation and communication in non-synchronized operation. In general, in this mode of operation, the communication time is smaller, as we will indeed see in Sect. A.3. A.3

Total Communication Time for Exchanging Borders

We define the total communication time for exchanging borders as the sum of the communication times of all iterations, both for synchronized and nonsynchronized operation. In Table 9, we show the minimum, average, and maximum (across ten runs) total communication time for synchronized operation; the variation is in general not very large. The difference between single-cluster and multicluster performance (and between PBS and Globus-1) is very large. Table 10 presents the (average) total communication times for exchanging borders when processes are or are not synchronized before communication. (This table contains the average results of Table 9.) In Fig. 11 we depict the average total communication times from Table 10 after normalization with respect to PBS. We find that with synchronized operation, communication in a single cluster is, depending on the grid size, 10–35 times faster then multicluster communication, while for non-synchronized operation (and realistic grid sizes), this factor is reduced to about 13. In addition, in Table 10 we see that for large grid sizes the performance of multicluster communication strongly improves when the processes are not synchronized, but that this is not the case for a single cluster with PBS.

A Measurement-Based Simulation Study of Processor Co-allocation

127

Table 9. The total communication time for exchanging borders (in seconds) with synchronized operation for processor configuration 4x4 grid size 50 100 200 400 1000 2000 4000 10000

x x x x x x x x

50 100 200 400 1000 2000 4000 10000

number of iterations 865 2132 3570 3814 4347 4387 2644 2644

min. 0.15 0.37 0.69 0.89 2.09 4.21 4.50 11.9

PBS avg. max. 0.16 0.19 0.39 0.44 0.72 0.74 0.91 0.95 2.17 2.31 4.38 4.67 4.88 5.47 12.3 13.1

Globus-1 min. avg. max. 0.52 0.62 0.87 1.41 1.51 1.67 2.90 3.04 3.56 4.49 4.68 4.91 9.49 10.2 11.8 27.1 29.1 32.3 127 134 139 127 162 202

Globus-2 min. avg. max. 3.81 4.67 5.33 9.25 11.8 15.1 15.0 19.8 23.0 18.0 22.5 29.4 22.3 28.8 36.5 41.3 45.1 49.6 154 161 171 186 267 398

Globus-4 min. avg. max. 4.82 4.96 5.13 12.0 12.2 13.7 20.1 20.3 20.5 17.5 20.8 22.6 26.1 29.2 32.1 41.2 42.8 45.8 167 179 206 290 358 427

Table 10. The total communication time for exchanging borders (in seconds) with synchronized and non-synchronized operation for processor configuration 4x4 grid size

50x50 100x100 200x200 400x400 1000x1000 2000x2000 4000x4000 10000x10000

A.4

number of PBS iterations synchronization yes no 865 0.16 0.11 2132 0.39 0.28 3570 0.72 0.54 3814 0.91 0.88 4347 2.17 2.00 4387 4.38 3.59 2644 4.88 5.26 2644 12.3 12.7

Globus-1 synchronization yes no 0.62 0.54 1.51 1.31 3.04 2.71 4.68 4.20 10.2 11.3 29.1 16.5 134 18.1 162 39.8

Globus-2 synchronization yes no 4.67 2.02 11.8 4.16 19.8 7.09 22.5 9.30 28.8 16.8 45.1 18.5 161 24.3 267 62.5

Globus-4 synchronization yes no 4.96 1.92 12.2 4.72 20.3 8.50 20.8 12.4 29.2 17.5 42.8 31.4 179 42.0 358 172

Data Transfer Rate of Exchanging Borders

We use the results of Table 9 (with synchronized operation) to calculate the data transfer rate when exchanging borders. For this calculation we assume that the slowest communicating process is always an interior process with four borders to exchange. Since the processor configuration is 4x4, the number of grid points to communicate per iteration by an interior process (send and receive) is twice the side of the grid. For a grid point 8 bytes are reserved. In Fig. 12 we see that for PBS the data transfer rate strongly increases when the amount of data to be communicated increases. The highest data transfer rate for PBS is 35 Mbyte/s for grid size 4000x4000, while for multicluster execution (Globus-2 and Globus-4; the data points in these two cases are nearly indistinguishable) the highest data transfer rate of over 3 Mbyte/s is reached for a grid size of 2000x2000.

128

S. Banen et al.

PBS

Globus-1

Globus-2

Globus-4

40 35 30 25 20 15 10 5 0 50

100

200

400

1000

2000

4000

10000

2000

4000

10000

grid size (one side)

40 35 30 25 20 15 10 5 0 50

100

200

400

1000

grid size (one side)

data transfer rate (MByte/s)

Fig. 11. The total communication time for exchanging borders with synchronized (top) and non-synchronized (bottom) operation for processor configuration 4x4, normalized with respect to PBS

40 35 30 25 20 15 10 5 0

PBS Globus-1 Globus-2 Globus-4

50

100

200

400

1000

2000

4000

10000

grid size (one side)

Fig. 12. The data-transfer rate when exchanging borders

Grids for Enterprise Applications Jerry Rolia, Jim Pruyne, Xiaoyun Zhu, and Martin Arlitt Hewlett Packard Labs 1501 Page Mill Rd., Palo Alto, CA, 94304, USA {jar,pruyne,xiaoyun,arlitt}@hpl.hp.com

Abstract. Enterprise applications implement business resource management systems, customer relationship management systems, and general systems for commerce. These applications rely on infrastructure that represents the vast majority of the world’s computing resources. Most of this infrastructure is lightly utilized and inflexible. This paper builds on the strength of Grid technologies for the scheduling and access control of resources with the goal of increasing the utilization and agility of enterprise resource infrastructures. Our use of these techniques is directed by our observation that enterprise applications have differing characteristics than those commonly found in scientific domains in which the Grid is widely used. These include: complex resource topology requirements, very long lifetimes, and time varying demand for resources. We take these differences into consideration in the design of a resource access management framework and a Grid-based protocol for accessing it. The resulting system is intended to support the new requirements of enterprise applications while continuing to be of value in the parallel job execution environment commonly found on today’s Grid.

1

Introduction

Grid computing is emerging as a means to increase the effectiveness of information technology. Grid technologies a) enable sharing of critical, scarce or specialized resources and data, b) provide wide area resource allocation and scheduling capabilities, and c) form the basis for dynamic and flexible collaboration. These concepts are enabled through the creation of virtual organizations. This paper is focused on the resource allocation and scheduling aspects of the Grid, particularly as they relate to applications and resource environments found in enterprises. Until recently, the focus of Grid computing has been on support for scientific or technical job-based parallel applications. In such Grids, a job’s requirements are specified in a Resource Description Language (RDL). Grid services, such as those offered by Globus [1], match the requirements of jobs with a supply of resources. A resource management system (RMS), such as Condor [2], schedules access to resources by job and automates their deployment and execution. Different kinds of resources can contribute to such Grids. For example, Grids may provide access to: the spare compute cycles of a network of workstations, D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 129–147, 2003. c Springer-Verlag Berlin Heidelberg 2003 

130

Jerry Rolia et al.

a supercomputing cluster, a much sought after device, or to shares of a large multi-processor system. Enterprise applications have requirements on infrastructure that are different and more complex than the typical requirements of parallel jobs. They may have topology requirements that include multiple tiers of servers and multiple networks, and may exploit networking appliances such as load balancers and firewalls that govern collaborations and other networking policies. Grids for enterprise applications require resource management systems that deal with such complexity. Resource management systems must not only govern which resources are used but must also automate the joint configuration of these compute, storage, and networking resources. Advances in resource virtualization now enable such cross-domain resource management systems [4]. Grid services may provide an open and secure framework to govern interactions between applications and such resource management systems across intranets and the Internet. In this paper our contributions are as follows. We enumerate resource management requirements for enterprise applications and how they are addressed by recent advances in resource virtualization technologies. We propose a Resource Access Management (RAM) Framework for enterprise application infrastructure as an example of an RMS and present resulting requirements on Grid service protocols. Section 2 elaborates on enterprise application requirements for resource management systems. We explain how advances in resource virtualization enable automated infrastructure management. A resource access management framework for enterprise application infrastructure is described in Section 3. Requirements upon Grid service protocols are presented in Section 4. Section 5 describes related work. Concluding remarks are offered in Section 6.

2

Background

In general, enterprise infrastructure is lightly utilized. Computing and storage resources are often cited as being less than 30% utilized. This is a significant contrast to most large scientific computing centers which run at much higher utilization levels. Unfortunately, the complex resource requirements of enterprise applications, and the desire to provision for peak demand are reasons for such low utilization. Configuration and resource requirement complexity can cause support costs for operations staff to exceed the cost of the infrastructure. Furthermore, such infrastructure is inflexible. It can be difficult for an enterprise to allocate resources based on business objectives and to manage collaborations within and across organizational boundaries. Infrastructure consolidation is the current best practice for increasing asset utilization and decreasing operations costs in enterprise environments. Storage consolidation replaces host based disks with virtual disks supported by storage area networks and disk arrays. This greatly increases the quantity of storage that can be managed per operator. Server consolidation identifies groups of applications that can execute together on an otherwise smaller set of servers

Grids for Enterprise Applications

131

without causing significant performance degradations or functional failures. This can greatly reduce the number, heterogeneity and distribution of servers that must be managed. For both forms of consolidation, the primary goal is typically to decrease operations costs. Today’s consolidation exercises are usually manual processes. Operations staff must decide which applications are associated with which resources. Configuration changes are then made by hand. Ideally, an RMS should make such decisions and automate the configuration changes. Recent advances in the virtualization of networking, servers, and storage now enable joint, secure, programmatic control over the configuration of computing, networking and storage infrastructure. Examples of virtualization features include: virtual Local Area Networks, Storage Area Networks and disk arrays that support virtual disks, and virtual machines and other partitioning technologies that enable resource sharing within servers. These are common virtualization features within today’s enterprise infrastructure. Programmable Data Centers (PDC) exploit virtualization technologies to offer secure multi-tier enterprise infrastructure on-demand [4]. As an example, Web, Application, and Database servers along with firewalls and load balancers can be programmatically associated with virtual Local Area Networks and virtual disks on-demand. Furthermore, additional servers or appliances can be made to appear on these networks or can be removed from the networks on-demand. Application topology requirements can be described in an RDL, submitted to a PDC and rendered on-demand. This makes the Grid paradigm feasible for enterprise applications [5]. The automation capabilities of PDCs can be exploited by an RMS in many ways. The first is simply to automate the configuration of enterprise infrastructure with the goal of decreasing operations costs. However, this automation can then be further exploited by the management system to increase the utilization of resources and to better manage the allocation of resource capacity with respect to business objectives. Next, we consider the requirements enterprise applications impose upon such resource management systems. Enterprise applications differ from parallel compute jobs in several ways. First, as we have already discussed, enterprise applications have different topology requirements. Second, enterprise applications have different workload characteristics. Often, a scientific or technical job requires a constant number of homogeneous resources for some duration. In contrast, enterprise applications require resources continuously over very long, perhaps undetermined time intervals. In addition, enterprise workloads have changes in numbers of users and workload mix which may result in time varying demands for resources, large peak to mean ratios for demand, and future demands that are difficult to predict precisely. Demands need to be expressed for a variety of resources that include: servers, partitions of servers, appliances, storage capacity, and the network bandwidths that connect them. For an enterprise application, reduced access to resource capacity degrades application Quality of Service (QoS). The result is that the users of the applica-

132

Jerry Rolia et al.

tion face greater queuing delays and incur greater response times. The users do not in general repeat their requests in the future. Therefore, unsatisfied application resource demands on enterprise infrastructure are not, in general, carried forward as a future demand. If an application must satisfy a Service Level Agreement (SLA) with its own users then it must also require an access assurance from the enterprise infrastructure RMS that it will get a unit of resource capacity precisely when needed. For example, the assurance could be a probability θ with value 1 for guaranteed access or θ = 0.999 or θ = 0.99 for less assurance. Few applications always need an assurance of θ = 1 for all of their resource capacity requirements. An application may for example require n servers with an assurance of 1 and two additional servers with an assurance θ = 0.999. Service level assurance is a QoS that can be offered by an RMS as a class of service to applications. Enterprise applications require several kinds of QoS. These include: access assurance, availability, performance isolation, and security isolation. It is the role of a PDC RMS to provide such QoS to applications as they access the enterprise application infrastructure. In the next section we describe an RMS for enterprise applications. It relies on a PDC to automate the rendering of infrastructure. We use the system to help derive requirements for Grid service protocols.

3

Resource Access Management Framework

This section describes our proposed framework for Resource Access Management (RAM) as an RMS for enterprise infrastructure. The framework includes several components: admission control, allocation, policing, arbitration, and assignment. These address the questions: which applications should be admitted, how much resource capacity must be set aside to meet their needs, which applications are currently entitled to resource capacity on-demand, which requests for resource capacity will be satisfied, and which units of resource capacity are to be assigned to each application. Answering such questions requires statements of expected application demands and required qualities of service. In our framework, we characterize time varying application demands on enterprise infrastructure resources using an Application Demand Profile (ADP). The profile describes an application’s demands for quantities of resource capacity (including bandwidths) and required qualities of service. The qualities of service are with respect to the application’s requirements upon the enterprise infrastructure resources. An ADP includes compact specifications of demand with regard to calendar patterns including their beginning and ending time. For example, a calendar pattern may describe a weekday, a weekend day, or a special day when end of month processing occurs. Within the pattern, time is divided into slots. The distribution of expected application demand is given for each slot [12]. These specifications are used by the RAM framework to extrapolate application demand for resources between the beginning and ending time. We refer to this interval as a capacity planning horizon [13].

Grids for Enterprise Applications

133

We note that each application is responsible for the qualities of service it offers to its own users. Also, the relationship between application QoS and the expressed demands upon infrastructure is application specific [8]. The specification of application QoS requirements must be translated into QoS requirements upon enterprise infrastructure. An RMS must arbitrate among possibly competing business objectives to determine the assignment of resource capacity to applications at any point in time. The application demand profile is encoded in an RDL and submitted to an RMS as part of a resource reservation request. The resource reservation process may include the use of brokers to ground the demands of the application. Once admitted by the RMS, an application must interact with the system to acquire and release resource capacity as needed to meet the QoS needs of its users. We expect per-application reservation requests to be infrequent. For most environments, reservation latencies of minutes to hours are not likely to be an issue. This is because applications are long lived and reservations are likely to be made well in advance of actual resource consumption. Resource capacity requests may occur very frequently. The latency between the time of such requests to the time when capacity is made available may be governed by service level agreements. Admission control and resource capacity acquisition processes are illustrated in Figure 1. Figure 1(a) illustrates an application reservation request. The RAM framework checks to ensure that there are sufficient resources available to support the application with its desired QoS. This involves an interaction with an allocation system that determines how much resource capacity must be set aside to support the application. A calendar maintains state information on expected and actual demand for hosted applications over a capacity planning horizon. Figure 1(b) illustrates resource capacity requests from admitted applications. Resource capacity requests are batched by the RAM framework. Remaining steps in the process decide which of the batched requests are honored. These are explained in subsequent sections of this paper. Finally, applications can always release resource capacity (for compactness this process is not illustrated in the figure). We note that this paper does not consider language support for describing an application’s infrastructure topology. Our focus is on a characterization of demand that supports resource access assurance. The following subsections describe this aspect of ADPs in more detail, resource access assurance Class of Service (CoS) as an example of a QoS, and the components of our RAM framework along with the issues they address. 3.1

Application Demand Profiles and Statistical Assurance

This section describes our approach for characterizing time varying demands upon enterprise infrastructure resources. ADPs [12] represent historical and/or anticipated resource requirements for enterprise applications. Suppose there is a particular resource used by an application, that it has a single capacity attribute and that it has a typical pattern of use each weekday. We model the corresponding quantity of capacity that is required, with respect to the attribute, as a se-

ADP

Response

Allocation

Response

Admission control

ADP

Reserve

Reserve

Application reservation request

Reservation Calendar + Entitlements + Actual Usage

STATE

Fig. 1. Resource Access Management Processes Assignment

Requests to be satisfied

Arbitration

Handle to resource capacity (if applicable)

Request not satisfied

Requests labeled as entitled or not entitled

Policing

Batch of pending requests

Batch requests for next time slot

Request for resource capacity

Application requests resource capacity

134 Jerry Rolia et al.

Grids for Enterprise Applications

135

quence of random variables, {Xt , t = 1, ..., T }. Here each t indicates a particular time slot within the day, and T is the total number of slots used in this profile. For example, if each t corresponds to a 60-minute time slot, and T = 24, then this profile represents resource requirements by hour of day. Our assumption here is that, for each fixed t, the behavior of Xt is predictable statistically given a sufficiently large number of observations from historical data. This means we can use statistical inference to predict how frequently a particular quantity of capacity may be needed. We use a probability mass function (pmf) to represent this information. Without loss of generality, suppose Xt can take a value from {1, . . . , m}, where m is the observed maximum of the quantity of units of resource capacity of a particular type, then the pmf consists of a set of probabilities, {pk , k = 1, . . . , m}, where pk = P r[Xt = k]. Note that although m and pk don’t have a subscript t for simplicity of the notation, they are defined within each time slot. Figure 2(a) shows the construction of a pmf for a given time slot (9-10 am) for an ADP of an application. In this example, the application makes demands on a resource that represents a resource pool of servers. The servers are its units of capacity. The capacity attribute for the resource governs the allocation of servers to applications. In the example only weekdays, not weekends, are considered. The application in this example required between 1 and 5 servers over W weeks of observation. Since we are only considering weekdays, there are 5 observations per week, thus there are a total of 5 W observations contributing to each application pmf. Figure 2(b) illustrates how the pmfs of many applications contribute to a pmf for a corresponding resource pool as a whole. We regard this aggregate demand as an ADP for the resource itself. The aggregate demand on the resource pool is modeled as a sequence of random variables, denoted as {Yt , t = 1, . . . , T }. It is possible to estimate the distribution of the aggregate demand while taking into account correlations in application demands. The characterization of this ADP of aggregate demand enables the resource management system to provide statistical assurances to hosted applications so that they have a high probability of receiving resource capacity when needed [12]. So far, we have described an ADP with respect to demand for a specific type of resource with one capacity attribute. This notion can be generalized. For example, it can be augmented to include a characterization of multiple capacity attributes for each resource and to include a specification of demands for multiple resource types. For example, an ADP may describe the time-varying expected demands of an application on servers from a pool of servers, bandwidth to and from the Internet, bandwidth to and from a storage array, and disk space from the storage array. In subsequent sections, we assume this more general definition of an ADP. To summarize, a resource management system can extrapolate the possibly time-varying expected demand from application ADPs onto its own calendar to obtain the ADP of aggregate demand upon its resource pools. If there are suffi-

136

Jerry Rolia et al.

cient resources available, based on per slot statistical tests as described above, then the application is a candidate for admission. 3.2

Resource Access Classes of Service

This section describes resource access classes of service for applications as an example of an RMS QoS for enterprise infrastructure. We assume that an application associates multiple classes of service with its profile. For example, the minimum or mean resource capacity that is required may be requested with a very high assurance, for example with θ = 1. Capacity beyond that may be requested with a lower assurance and hence at a lower cost. An application is expected to partition its requests across multiple CoS to achieve the application QoS it needs while minimizing its own costs. We define the following access assurance CoS for each request for a unit of resource capacity: – Guaranteed: A request with this CoS receives a 100 percent, i.e. θ = 1, assurance it will receive a resource from the resource pool. – Predictable Best Effort with probability θ, PBE(θ): A request with this CoS is satisfied with probability θ as defined in Section 3.1. – Best Effort: An application may request resources on-demand for a specific duration with this CoS but will only receive them if the resource management system chooses to make them available. There is no assurance that such requests will be satisfied. These requests need not be specified within the application’s ADP in the initial reservation request. The guaranteed and predictable best effort classes of service enable capacity planning for the enterprise infrastructure. The best effort class of service provides a mechanism for applications to acquire resources that support exceptional demands. These may be made available, possibly via economic market based mechanisms [6]. Consider a set of applications that exploit a common infrastructure. Using the techniques of Section 3.1, for the same applications, we will require more resource capacity for a guaranteed CoS than for a predictable best effort CoS. Similarly, larger values of θ will require more resource capacity than smaller values. In this way CoS has a direct impact on the quantity and cost of resources needed to support the applications. Next, we consider the various components of our proposed framework as illustrated in Figure 1. 3.3

Admission Control

Admission control deals with the question of which applications are to be admitted by a resource management system. Admission control may be governed by issues that include: – whether there are sufficient resources to satisfy an application’s QoS requirements while satisfying the needs of applications that have already been admitted;

{ {

MONDAY

FRIDAY

FRIDAY

2 AM

1 AM

12 AM

3 AM

# Servers Needed probability (From 9−10 AM) 0.10 1 0.15 2 3 0.20 0.40 4 5 0.15 1.00

probability mass function (pmf)

WEEK W

...

...

...

WEEK 1

2

4

3

11 AM

10 AM

0.1

0.2

0.3

0.4

1 PM

1

10 PM

9 PM

8 PM

7 PM

6 PM

5 PM

4 PM

3 PM

2 PM

2 3 4 5 # Servers Needed

pmf for 9−10 AM time slot

probability

8 AM

probability

9 AM

5

0.1

0.2

0.3

0.4

2 3 4 5 # Servers Needed

... 0.1

0.2

0.3

0.4

2 3 4 5 # Servers Needed Application |A|

1

pmf for 9−10 AM time slot

0.1

0.2

0.3

0.4

# Servers Needed

10 20 30 40 50 60 70 80 90

pmf for 9−10 AM time slot

Aggregate Expected Demand on the Utility Grid for the 9−10 AM Time Slot

Application 1

1

pmf for 9−10 AM time slot

probability

MONDAY

probability

Application Demand Profiles for Many Applications in the 9−10 AM Time Slot

6

Grids for Enterprise Applications

Fig. 2. Application demand profiles

137

11 PM

12 PM

7 AM

6 AM

5 AM

4 AM

138

Jerry Rolia et al.

– risk and revenue objectives for the enterprise infrastructure; and, – the goal of increasing asset utilization. Only after an application has been admitted is it able to acquire and release resource capacity. 3.4

Allocation

Allocation decides how much resource capacity must be set aside to satisfy the reservation requirements of applications. The capacity to be set aside depends on: – application demands as expressed by ADPs; – the requested access assurance CoS; – the granularity of virtualization supported by the enterprise infrastructure; and, – the topology of the enterprise infrastructure. Within our framework, calendars keep track of allocations for the enterprise infrastructure. They are used to ensure that allocations do not exceed the expected availability of enterprise infrastructure. The granularity of virtualization supported by the infrastructure, in addition to application security requirements, may determine whether an offered resource is a partition of a physical resource. Allocation must also be sensitive to the topology of the infrastructure. Networking fabrics shared by applications must be able to satisfy joint network bandwidth requirements. 3.5

Policing

We assume that applications describe their expected time varying demands to the resource management system using an ADP. The resource management system uses the information for capacity planning and also to provide statistical assurances. To enable the latter capability, we must ensure that the applications behave according to their ADPs. Consider an application that has been admitted. The policing component of the framework observes each request for resource capacity and decides whether the application is entitled to that additional capacity. If it is entitled, then presumably, not satisfying the request will incur some penalty. The choice of which requests will be satisfied is deferred to the arbitration component. In [13], we described an approach for policing. The approach considers resource access over short time scales, for example over several hours, and long time scales, for example over several weeks or months. Application ADPs are augmented with bounds on resource capacity usage over multiple time scales. These are compared with actual usage to decide upon entitlement. We demonstrated the proposed approach in a case study. The approach provides an effective way to limit bursts in demand for resources over multiple time scales. This is essential for our RAM framework because it offers statistical guarantees for access to resources.

Grids for Enterprise Applications

3.6

139

Arbitration

Arbitration decides precisely which applications will have their requests satisfied. It is likely to: – favor entitled requests (i.e. pass the policing test) to avoid penalties; – satisfy requests that are not entitled if the risk of future penalties are less than the additional revenue opportunities; and, – withhold resource capacity to make sure an application does not receive too high a quality of service. Arbitration aims to best satisfy the needs of the applications and the goals of the enterprise infrastructure [7]. 3.7

Assignment

Assignment decides which of the available units of resource capacity are to be given to each application. Assignment must take into account the resulting impact of application loads on the performance of any infrastructure that is shared by multiple applications. Specific assignment decisions may also be made to best satisfy the objectives for the enterprise infrastructure. For example, assignment decisions may aim to minimize the use of certain servers so that they can be powered off. In [11] we describe a mathematical optimization approach for assigning units of resource capacity to a multi-tier application within a PDC. The method takes into account: the availability of server resources, existing network loads on the PDC’s networking fabrics, and the bandwidth requirements between resources in the multi-tier application. 3.8

Summary

We have described our assumptions about the nature of information that must be offered to a resource management system for enterprise infrastructure, the kinds of problems that must be addressed by the system, and components that address them. Together, we refer to these as a Resource Access Management (RAM) framework for enterprise infrastructure. In the next section, we consider how enterprise applications should interact with such a resource management system. Grid service protocols could be of value here to provide support for admission control and reservation, to acquire and release resource capacity over time, to change reservations, and to deal with issues of multiple resource types.

4

Protocols for Enterprise Application Grids

In the previous sections, we discussed the requirements of applications in an enterprise environment. We also described the framework we have developed for

140

Jerry Rolia et al.

performing resource management for these applications. The final component of a complete system is the external interface that permits customers to interact with the resource management system. Just as the requirements for enterprise applications drives new requirements on the resource management system, so too does it drive changes in how customers interact with resource management systems. A principal reason for needing a new interaction model is the long running, and time varying demand nature of our application environment. These attributes give rise to a number of observations. First, because of the time varying demand, resource requirements will be complex. Some applications may have mid-day peaks in demand, others may peak in the mornings or near the end of month or business quarter. At times the peaks may overlap. As discussed in earlier sections, an RMS must take into account such time varying demands when making admission control decisions. Furthermore, we expect that, often times, a resource management system will be unable to satisfy initial requests for resources with, for example, desired resource types or classes of service. This means that we require a negotiation process between the customer and the resource provider to allow the two parties to determine a mutually agreeable reservation. Second, because applications and their corresponding reservations are longlived, customers of our system will continue to interact with the RMS throughout the application’s lifetime. These interactions are needed to submit requests to acquire and release resource capacity, as reserved with respect to demand profiles. We also require an interaction model that allows customers and RMS to change reservations when necessary, subject to a service level agreement associated with the reservation. Furthermore, an RMS may fail to satisfy the requirements of a reservation at some point in time. It must notify the application, possibly in advance. Also, an application may not be returning resources as agreed. A negotiation is needed so that the RMS reclaims the most appropriate resources. In general, these negotiations are needed to guide the RMS so that it makes appropriate decisions. Finally, the resource management system’s planning horizon may actually be shorter than the expected lifetime of the application. That is, the application may be making requests farther into the future than the resource management system is willing to make guarantees at the current time. Therefore, the agreement between the parties may be subject to re-negotiation when the resource management system is prepared to consider new points in time. Advance reservation is becoming increasingly common in traditional parallel processing domains [15][16][22]. We aim to extend these methods to suit the needs of enterprise applications as much as possible. In particular, SNAP [15], provides a SLA basis for resource management. This appears to be suitable for our environment, although additional issues and requirements are raised in the enterprise environment. We outline extensions to what has thus far been proposed, particularly in the area of Reservation Service Level Agreements (RSLA).

Grids for Enterprise Applications

4.1

141

Properties of a RSLA

We use the RSLA to represent the agreement between the customer and the resource management system. We require that specification of this agreement be flexible enough to represent a wide variety of attributes that can be specified by either the customer or the resource management system. We use the following tuple to represent the RSLA: < I, c, tdead , < {(r, ADP )Q }τ >R > . Much of this notation is based on SNAP. The first three fields are exactly as defined by SNAP: I is the identifier of the RSLA, c is the customer of the SLA, and tdead represents the end time (deadline) for the agreement. The fourth field is the description of the requested resources in a particular RDL, R, used by the system. It is modeled as a set of requests for individual resource types. Each resource type has a description, r, that enumerates attributes such as physical characteristics like a CPU architecture, and, demand as characterized by its ADP. The ADP provides information about each of a resource’s demand attributes such as quantities of servers, bandwidth to and from other resource types, qualities of service, and policing information. If the ADP specifies an end time (i.e., a capacity planning horizon) later than tdead , there is no commitment on the part of the RMS to honor the agreement after the SLA expires. Each resource type and ADP pair also has QoS requirements, Q as described in Section 2. Topology requirements across all resource types are contained in τ . We note that by simply setting θ = 1, i.e. the guaranteed CoS within Q, for all of our demands, we have an agreement that requires guaranteed access to resources, so including the assurance in the RSLA specification need not weaken the original SNAP approach. Enterprise applications often have complex dependency requirements among the resources they use. For example, multi-tier application architectures, such as those employed by many application servers are common. The wide use of virtualization technologies make it possible to render the required topology in a PDC environment. The topology description τ defines how the resources will be inter-connected. So, in the multi-tier example, we could describe a single logical Internet address provided by a load balancer for servers in an application tier that are also connected by a private sub-net to a back-end database layer consisting of machines of another logical type. Each type of resource, load balancers, application tier servers, and database servers are of different logical types. SNAP captures the notion of multiple identical resources through an Array construct. However, the use of the ADP is central to our specification as described in Section 3.1. So, we make the ADP an explicit component of our RDL. Within the SNAP context, it is straightforward to make the ADP part of the Array in the form ADP × r × C, where r is a resource type in τ and C is the number of CoS used by the ADP, instead of the constant n × r notation in SNAP, where n is a specific number of resources. A successful negotiation leads to a reservation. The attributes of the RSLA are used by the application to identify context when requesting and releasing

142

Jerry Rolia et al.

resource capacity at runtime, when making best effort requests for resource capacity, and when a negotiation is needed so that the RMS can pre-emptively take back the most appropriate units of resource capacity. One of the desirable properties of the SLA abstraction for reservation is that it can be re-negotiated over time as needs or available resources change. However, in some cases, change in need may be known a priori, such as when a trend toward increased or decreased demand is known. The ADP can be augmented with trending information and used when extrapolating the profile onto the RMS’s capacity planning horizon. As a simple example, one could state that the demand will increase 10% per month over the life of the SLA. Finally, we note that the SLA termination time may be extremely large in an enterprise context. It may be determined by the resource management system due to its capacity planning horizon rather than by the customer as an expected job length as is commonly seen in today’s reservation systems. SNAP purposely avoids addressing issues relating to the auditing of an SLA, but does discuss dimensions such as pricing that may be subject to auditing. Our framework suggests two varieties of audits. The first is policing as discussed in Section 3.5. In this case, we know how to audit to determine that an application is staying within the agreed upon resource levels. The other dimension is assurance. The specified statistical assurance can be tested against actual performance. Such measurements are possible, but they make take a very long time when assurance levels are high, but not 1. 4.2

Negotiating a RSLA

A RSLA will in most cases be the result of a negotiation process. The customer and the resource management system can be expected to go through multiple iterations to reach a mutually agreeable set of parameters for the SLA. Ultimately, the negotiation is completed in a two-phase protocol where both sides commit to the results of the negotiation. One possible negotiation scenario is shown in Figure 3. In this example, the customer contacts the RMS, and proposes an initial RSLA comprising the termination date, resource description, ADP, QoS including the desired level of service assurance, 0.999, and topology. The RMS cannot provide the type of resources requested, so proposes alternative resources of types r1 , r2 , r3 in the form of a Negative Reply. The negative reply indicates that no RSLA has been formed because the requirements could not be met. The consumer then alters its requirements to use resources of type r2 , updating the ADP and topology to match the capability of the selected resource type. This results in an RSLA which the RMS can render, but with a lesser QoS than was requested. In this case, the service assurance level will only be 0.9 on two specific dates, and no guarantee can be made at all after November 21 even though the request was for a lifetime through December 15. Although the conditions are different, the new RSLA is valid, so the RMS provides a Positive Reply with the limit that it must be accepted before time t1 . The RMS puts this time limit on the agreement so that it can negotiate with other customers without having to hold this reservation

Grids for Enterprise Applications

Customer

143

RMS

Req:

Neg-Reply: No resources of type r try r 1, r2, or r 3 Req:

Pos-Reply: Only 0.9 Assurance on April 10, 14, nothing after Nov. 21; valid until t 1 Reserve

Reserve ACK

Fig. 3. Example of negotiation process

indefinitely. Here, we assume that the customer commits to the new RSLA within the specified time by sending a reserve message, and the RMS acknowledges that reservation. An important aspect of the negotiation process is that it is composable. That is, an entity that acts as the resource manager to some clients may in turn be a client to another resource manager. In this way, we allow for brokering to occur. A broker can appear to its clients as a resource provider when, in fact, it contracts out to other resource providers for physical resources. There are a wide variety of brokering strategies possible, but the important point is that each RSLA is made between the customer and the RMS it communicates with directly. As long as the specification of the RSLA is met, the customer need not be concerned with details of how the requirements are being satisfied. Even when the SLA is agreed upon, we expect that either party may unilaterally need to change the parameters of the SLA. The conditions upon which this can occur must be agreed to within the original RSLA including possible penalties. In practice, making changes to the SLA can be handled as a re-initiation of the negotiation protocol, but the penalties or conditions under which this is permitted are dependent on the semantics of the SLA, and are beyond the scope of this work. This does put a requirement on the customer that it be available to re-negotiate at any time which is often not the case in current systems. The common model today is “submit and forget” where once a reservation has been made, the customer assumes it will be fulfilled. To enable re-negotiation, customers will require a persistent presence on the network with which negotiation

144

Jerry Rolia et al.

can be initiated. The simplest approach is to provide an e-mail address that the RMS can send a message to requesting a new negotiation to occur. A variety of events may trigger this change from the resource management system’s side, including failures of resources, arrival of a higher priority customer, or other arbitrary changes in policy that invalidate a current RSLA.

5

Related Work

There are many examples of application control systems for business applications that are emerging in the literature. In general, these are control systems that are coupled with applications to recognize when additional resource capacity should be acquired or released and may determine how much an application is willing to pay for such capacity. These control systems could exploit our extensions to SNAP to interact with PDCs that act as resource providers for such applications. This section describes examples of application control systems, related work on offering statistical assurance, and examples of PDCs. First, we consider several examples of application control systems. With MUSE [6], Web sites are treated as services that run concurrently on all servers in a cluster. A pricing/optimization model is used to determine the fraction of cpu resources allocated to each service on each server. The over-subscription of resources is dealt with via the pricing/optimization model. When resources are scarce costs increase thereby limiting demand. Commercial implementations of such goal driven technology are emerging [17][18]. Levy et al. consider a similar problem but use RMS utility functions to decide how to deal with assignment when resources are scarce [19]. Ranjan et al. [10] consider QoS driven server migration for PDCs. They provide application level workload characterizations and explore the effectiveness of an online algorithm, Quality of Infrastructure on Demand (QuID), that decides when Web based applications should acquire and release resource capacity from a PDC. It is an example of an approach that could be used by an application to navigate within its demand profile, i.e., to decide how many resources it needs in its next time slot. Next, we consider statistical assurances for access to resources. Rolia et. al. consider statistical assurances for applications requesting resources from a RMS. They consider the notion of time varying demands and exploit techniques similar to those from the Network QoS literature [20][21] to compute assurance levels. Urgaonkar et. al. also recognize that significant resource savings can be made by offering less than a guaranteed level of service [14] in computing environments. However, they do not consider time varying demands or classes of service. We proposed the RAM framework, classes of service, and policing mechanisms in [13] and offered a study to evaluate the effectiveness of the policing methods. Finally, we consider examples of PDCs. A PDC named the Adaptive Internet Data Center is described in [3]. Its infrastructure concepts have been realized as a product [4]. It exploits the use of virtual LANs and SANs for partitioning resources into secure domains called virtual application environments. These

Grids for Enterprise Applications

145

environments support multi-tier as well as single-tier applications. A second example is Oceano [9], an architecture for an e-business utility. Work is being performed within the Global Grid Forum to create standards for Grid-based advance reservation systems in the Grid Resource Allocation Agreement Protocol (GRAAP) [22] working group. The group’s goal is to make advance reservation systems interoperable through a standard set of protocols and interfaces. The group’s approach is also influenced significantly by SNAP, and so is consistent with our approach and goals. We are working with this group to help ensure the standards are applicable to enterprise grids as well as scientific grids.

6

Summary and Conclusions

Grid computing is emerging as a means to increase the effectiveness of information technology. Open standards for interactions between applications and resource management systems are desirable. Ideally such interaction methods can be unified for parallel and scientific computing and enterprise applications. Although the infrastructure requirements in the enterprise are often more complex, the advent of virtualization technologies and PDCs enable dynamic configuration of resources to meet these demands. The challenge becomes developing a resource management framework that meets the demands of enterprise applications. We have observed that these applications have long running times, but varying demand. This leads to under utilized infrastructures due to over provisioning based on peak demands. Our solution is a resource allocation framework that permits customers to describe, probabilistically, their applications’ demands for resources over time. We capture this behavior in an Application Demand Profile, and allocate resources based on a service assurance probability. We also augment the Application Demand Profile with information needed to police applications which go outside their profile [13]. Users interact with our system using an approach based on the Grid, and Reservation Service Level Agreements. The RSLA describes the agreement between the customer and the resource management system for resource utilization over time. Our formulation builds on existing work in this area, and extends it to support the topology, ADP, and service assurance used in our framework. We also base the RSLA formulation process on a two-phase negotiation protocol. We are striving for a converged environment that supports equally scientific and enterprise applications. We believe that our approach is consistent with, and extends on-going work in the parallel scheduling community, so we believe this vision can be a reality. As parallel workloads evolve into a more service oriented model, their demand characteristics are likely to evolve toward those currently seen in the enterprise, making this converged model even more valuable.

146

Jerry Rolia et al.

References [1] Czajkowski K., Foster I., Karonis N., Kesselman C., Martin S., Smith W., and Tuecke S.: A Resource Management Architecture for Metacomputing Systems, JSSPP, 1998, LNCS vol. 1459, pp. 62-82. 129 [2] Litzkow M., Livny M. and Mutka M.: Condor - A Hunter of Idle Workstations. Proceedings of the 8th International Conference on Distributed Computing Systems, June, 1988, pp. 104-111. 129 [3] Rolia J., Singhal S. and Friedrich R.: Adaptive Internet Data Centers. Proceedings of the European Computer and eBusiness Conference (SSGRR), L’Aquila, Italy, July 2000, Italy, http://www.ssgrr.it/en/ssgrr2000/papers/053.pdf. 144 [4] HP Utility Data Center Architecture, http://www.hp.com/solutions1/ 130, infrastructure/solutions/utilitydata/architecture/index.html. 131, 144 [5] Graupner S., Pruyne J., and Singhal S.: Making the Utility Data Center a Power Station for the Enterprise Grid, HP Laboratories Technical Report, HPL-2003-53, 2003. 131 [6] Chase J., Anderson D., Thakar P., Vahdat A., and Doyle R.: Managing Energy and Server Resources in Hosting Centers, Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (SOSP), Oct. 2001, pp. 103-116. 136, 144 [7] Kelly, T.: Utility-Directed Allocation, Proceedings of the First Workshop on Algorithms and Architectures for Self-Managing Systems, June 2003. http://tesla.hpl.hp.com/self-manage03/ Finals/kelly.ps To appear as a Hewlett-Packard Technical Report. 139 [8] Foster I., Kesselman C., Lee C., Lindell B., Nahrstedt K., and Roy A.: A Distributed Resource Management Architecture that Supports Advance Reservation and Co-allocation, Proceedings of IWQoS 1999, June 1999, pp. 27-36, London, U. K.. 133 [9] Appleby K., Fakhouri S., Fong L., Goldszmidt G. and Kalantar M.: Oceano – SLA Based Management of a Computing Utility. Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management, May 2001. 145 [10] Ranjan S., Rolia J., Zu H., and Knightly E.: QoS-Driven Server Migration for Internet Data Centers. Proceedings of IWQoS 2002, May 2002, pp. 3-12, Miami, Florida, USA. 144 [11] Zhu X. and Singhal S.: Optimal Resource Assignment in Internet Data Centers, Proceedings of the Ninth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, pp. 61-69, Cincinnati, Ohio, August 2001. 139 [12] Rolia J., Zhu X., Arlitt M., and Andrzejak A.: Statistical Service Assurances for Applications in Utility Grid Environments. Proceedings of the Tenth IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems 12-16 October 2002, pp. 247-256, Forth Worth, Texas, USA. 132, 133, 135 [13] Rolia J., Zhu X., and Arlitt M.: Resource Access Management for a Resource Utility for Enterprise Applications, The proceedings of the International Symposium on Integrated Management (IM 2003), March, 2003, pp. 549-562, Colorado Springs, Colorado, USA. 132, 138, 144, 145 [14] Urgaonkar B., Shenoy P., and Roscoe T.: Resource Overbooking and Application Profiling in Shared Hosting Platforms, Proceedings of the Fifth Symposium on

Grids for Enterprise Applications

[15]

[16] [17] [18] [19]

[20]

[21] [22]

147

Operating Systems Design and Implementation (OSDI), Boston, MA, December 2002. 144 Czajkowski K., Foster I., Kesselman C., Sander V.: Tuecke S.: SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems, JSSPP, 2002, LNCS vol. 2357, pp. 153-183. 140 Jackson D., Snell Q., Clement M.: Core Algorithms of the Maui Scheduler, JSSPP 2001, LNCS vol. 2221, pp. 87-102. 140 Sychron Enterprise Manager, http://www.sychron.com, 2001. 144 Utility computing white paper, http://www.ejasent.com, 2001. 144 Levy R., Nagarajarao J., Pacifici G., Spreitzer M, Tantawi A., and Youssef A.: Performance Management for Cluster Based Web Services, The proceedings of the International Symposium on Integrated Management (IM 2003), March, 2003, pp. 247-261, Colorado Springs, Colorado, USA. 144 Zhang Z., Towsley D., and Kurose J.: Statistical Analysis of Generalized Processor Sharing Scheduling Discipline, IEEE Journal on Selected Areas in Communications, vol. 13, Number 6, pp 1071-1080, 1995. 144 Knightly E. and Shroff N.: Admission Control for Statistical QoS: Theory and Practice, IEEE Network, Vol. 13, Number 2, pp. 20-29, 1999. 144 Global Grid Forum GRAAP Working Group Web Page, http://www.fz-juelich.de/zam/RD/coop/ggf/graap/graap-wg.html. 140, 145

Performance Estimation for Scheduling on Shared Networks Jaspal Subhlok and Shreenivasa Venkataramaiah Department of Computer Science, University of Houston Houston, TX 77204 {jaspal,shreeni}@cs.uh.edu www.cs.uh.edu/~jaspal

Abstract. This paper develops a framework to model the performance of parallel applications executing in a shared network computing environment. For sharing of a single computation node or network link, the actual performance is predicted, while for sharing of multiple nodes and links, performance bounds are developed. The methodology for building such a shared execution performance model is based on monitoring an application’s execution behavior and resource usage under controlled dedicated execution. The procedure does not require access to the source code and hence can be applied across programming languages and models. We validate our approach with experimental results with NAS benchmarks executed in different resource sharing scenarios on a small cluster. Applicability to more general scenarios, such as large clusters, memory and I/O bound programs and wide are networks, remain open questions that are included in the discussion. This paper makes the case that understanding and modeling application behavior is important for resource allocation and offers a promising approach to put that in practice.

1

Introduction

Shared networks, varying from workstation clusters to computational grids, are an increasingly important platform for high performance computing. Performance of an application strongly depends on the dynamically changing availability of resources in such distributed computing environments. Understanding and quantifying the relationship between the performance of a particular application and available resources, i.e., how will the application perform under given network and CPU conditions, is important for resource selection and for achieving good and predictable performance. The goal of this research is automatic development of application performance models that can estimate application execution behavior under different network conditions. This research is motivated by the problem of resource selection in the emerging field of grid computing [1, 2]. The specific problem that we address can be stated as follows: “What is the best set of nodes and links on a given network computation environment for the execution of a given application under current network conditions?” Node selection based on CPU considerations has D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 148–165, 2003. c Springer-Verlag Berlin Heidelberg 2003 

Performance Estimation for Scheduling on Shared Networks

149

been dealt effectively by systems like Condor [3] and LSF [4] but network considerations make this problem significantly more complex. A solution to this problem requires the following major steps: 1. Application characterization: Development of an application performance model that captures the resource needs of an application and models its performance under different network and CPU conditions. 2. Network characterization: Tools and techniques to measure and predict network conditions such as network topology, available bandwidth on network links, and load on compute nodes. 3. Mapping and scheduling:Algorithms to select the best resources for an application based on existing network conditions and application’s performance model. Figure 1 illustrates the general framework for resource selection. In recent years, significant progress has been made in several of these components. Systems that characterize a network by measuring and predicting the availability of resources on a network exist, some examples being NWS[5] and Remos[6]. Various algorithms and systems to map and schedule applications onto a network have been proposed, such as [7, 8, 9, 10, 11]. In general, these projects target specific classes of applications and assume a simple, well defined structure and resource requirements for their application class. In practice, applications show diverse structures that can be difficult to quantify. Our research is focused on application characterization and builds on earlier work on dynamic measurement of resource usage by applications [12]. The goal is to automatically develop application performance models to estimate performance in different resource availability scenarios. We believe that this is an important missing piece in successfully tackling the larger problem of automatic scheduling and resource selection. This paper introduces a framework to model and predict the performance of parallel applications with CPU and network sharing. The framework is designed to work as a tool on top of a standard Unix/Linux environment. Operating system features to improve sharing behavior have been studied in the MOSIX system [13]. The techniques employed to model performance with CPU sharing have also been studied in other projects with related goals [14, 15]. This paper generalizes authors’ earlier work [16] to a broader class of resource sharing scenarios, specifically loads and traffic on multiple nodes and communication links. For more complex scenarios, it is currently not possible to make accurate predictions, so this research focuses on computing upper and lower bounds on performance. A good lower bound on performance is the characteristic that is most useful for the purpose of resource selection. The approach taken in this work is to measure and infer the core execution parameters of a program, such as the message exchange sequences and CPU utilization pattern, and use them as a basis for performance modeling with resource sharing. This is fundamentally different from approaches that include analysis of application code to build a performance model that have been explored by many

150

Jaspal Subhlok and Shreenivasa Venkataramaiah Network characterization

Application characterization

Network

Application

Profiling on a dedicated cluster

Network Profiling and Measurements

CPU, Traffic and Synchronization pattern

Current Resource Availability on the Network

Analysis and mathematical modelling

Analysis

Performance Prediction Model

Forecast of Resource Availability

Scheduling Decisions and Mapping Algorithms Mapping the application on to the network Selected Nodes and Links on the Network

Fig. 1. Framework for resource selection in a network computing environment

researchers, some examples being [17, 18]. In our view, static analysis of application codes has fundamental limitations in terms of the program structures that can be analyzed accurately, and in terms of the ability to predict dynamic behavior. Further, assuming access to source code and libraries inherently limits the applicability of this approach. In our approach, all measurements are made by system level probes, hence no program instrumentation is necessary and there is no dependence on the programming model with which an application was developed. Some of the challenges we address are also encountered in general performance modeling and prediction for parallel systems [19, 20, 21]. We present measurements of the performance of the NAS benchmark programs to validate our methodology. In terms of the overall framework for re-

Performance Estimation for Scheduling on Shared Networks

151

source selection shown in Figure 1, this research contributes and validates an application characterization module.

2

Overview and Validation Framework

The main contribution of this paper is construction of application performance models that can estimate the impact of competing computation loads and network traffic on the performance of parallel and distributed applications. The performance estimation framework works as follows. A target application is executed on a controlled testbed, and the CPU and communication activity on the network is monitored. This system level information is used to infer program level activity, specifically the sequence of time slots that the CPU spends in compute, communication, and idle modes, and the size and sequence of messages exchanged between the compute nodes. The program level information is then used to model execution with resource sharing. For simpler scenarios, specifically sharing of a single node or a single network link, the model aims to predict the actual execution time. For more complex scenarios that involve sharing on multiple nodes and network links, the model estimates upper and lower bounds on performance. The input to an application performance model is the expected CPU and network conditions, specifically the load average on the nodes and expected bandwidth and latency on the network routes. Computing these is not the subject of the paper but is an important component of any resource selection framework that has been addressed in related research [22, 6, 5]. We have developed a suite of monitoring tools to measure the CPU and network usage of applications. The CPU monitoring tool would periodically probe (every 20 milliseconds for the reported experiments) the processor status and retrieve the application’s CPU usage information from the kernel data structures. This is similar to the working of the UNIX top utility and provides an application’s CPU busy and idle patterns. The network traffic between nodes is actively monitored with tcpdump utility and application messages are reassembled from network traffic as discussed in [23]. Once the sequence of messages between nodes is identified, the communication time for the message exchanges is calculated based on the benchmarking of the testbed. This yields the time each node CPU spends on computation, communication and synchronization waits. In order to validate this shared performance modeling framework, extensive experimentation was performed with MPI implementation of Class A NAS Parallel benchmarks [24], specifically the codes EP (Embarrassingly Parallel), BT (Block Tridiagonal solver), CG (Conjugate Gradient), IS (Integer Sort), LU (LU solver), MG (Multigrid), and SP (Pentadiagonal solver). The compute cluster used for this research is a 100Mbps Ethernet based testbed of 500 MHz, Pentium 2 machines running FreeBSD and MPICH implementation of MPI. Each of the NAS codes was compiled with g77 or gcc for 4 nodes and executed on 4 nodes. The computation and communication characteristics of these codes were measured in this prototyping phase. The time spent by the CPUs of executing nodes

152

Jaspal Subhlok and Shreenivasa Venkataramaiah Idle

Communication

Computation

Percentage CPU Utilization

100

80

60

40

20

0 CG

IS

MG

SP

LU

BT

EP

Benchmark

Fig. 2. CPU usage during execution of NAS benchmarks

in different activities is shown in Figure 2. The communication traffic generated by the codes is highlighted in Figure 3 and was verified with a published study of NAS benchmarks [25]. The details of the measured execution activity, including the average duration of the busy and idle phases of the CPU, are presented in Table 1. In the following sections we will discuss how this information was used to concretely model the execution behavior of NAS benchmarks with compute loads and network traffic. We will present results that compare the measured execution time of each benchmark under different CPU and network sharing scenarios, and how they compare with the estimates and bounds computed by the application performance model.

3 3.1

Modeling Performance with CPU Sharing CPU Scheduler on Nodes

We assume that the node scheduler assigns the CPU to active processes fairly as follows. All processes on the ready queue are normally given a fixed execution time slice in round robin order. If a process blocks during execution, it is removed from the ready queue immediately and the CPU is assigned to another waiting process. A process gains priority (or collects credits) for some time while it is blocked so that when it is unblocked and joins the ready queue again it will receive a higher share of the CPU in the near future. The net effect is that each active process receives approximately equal CPU time even if some processes block for short intervals during execution. In our experience, this is qualitatively

Performance Estimation for Scheduling on Shared Networks

0

1

0

1

0

1

2

3

2

3

2

3

BT

CG

EP

0

1

0

1

0

1

2

3

2

3

2

3

IS

LU

0

1

2

3

153

MG

SP

Fig. 3. Dominant communication patterns during execution of NAS benchmarks. The thickness of the lines is based on the generated communication bandwidth true at least of most Unix based systems, even though the exact CPU scheduling policies are complex and vary significantly among operating systems. 3.2

CPU Shared on One Node

We investigate the impact on total execution time when one of the nodes running an application is shared by another competing CPU intensive process. The basic problem can be stated as follows: If a parallel application executes in time T on a dedicated testbed, what is the expected execution time if one of the nodes has a competing load? Suppose an application repeatedly executes on a CPU for busytimephase seconds and then sleeps for idletimephase seconds during dedicated execution. When the same application has to share the CPU with a compute intensive load, the scheduler will attempt to give equal CPU time slices to the two competing processes. The impact on the overall execution time due to CPU sharing depends on the values of busytimephase and idletimephase as illustrated in Figure 4 and explained below for different cases: – idletimephase = 0 : The CPU is always busy without load. The two processes get alternate equal time slices and the execution time doubles as shown in Figure 4(a). – busytimephase < idletimephase : There is no increase in the execution time. Since the CPU is idle over half the time without load, the competing process

154

Jaspal Subhlok and Shreenivasa Venkataramaiah

Table 1. Measured execution characteristics of NAS benchmarks Execution time - On Benchmark a dedicated system (seconds) CG

25.6

IS

40.1

MG

Percentage CPU Time

Computation 49.6

Busy Communication

Idle

Average CPU Time (milliseconds) Busy Phase Idle Phase

Messages on one link Number

Average size (KBytes)

23.1

27.3

60.8

22.8

1264

18.4

43.4

38.1

18.5

2510.6

531.4

11

2117.5

43.9

71.8

14.4

13.8

1113.4

183.0

228

55.0

SP

619.5

73.7

19.7

6.6

635.8

44.8

1606

102.4

LU

563.5

88.1

4.0

7.9

1494.0

66.0

15752

3.8

BT

898.3

89.6

2.7

7.7

2126.0

64.0

806

117.1

EP

104.6

94.1

0

5.9

98420.0

618.0

0

0

gets more than its fair share of the CPU from the idle CPU cycles. This is illustrated in Figure 4(b) – busytimephase > idletimephase : In this situation, the competing process cannot get its entire fair share of the CPU from idle cycles. The scheduler gives equal time slices to the two processes. This case is illustrated in Figure 4(c). The net effect is that every cycle of duration busytimephase + idletimephase now executes in 2 ∗ busytimephase time. Alternately stated, the execution time will increase by a factor of: busytimephase − idletimephase busytimephase + idletimephase Once the CPU usage pattern of an application is known, the execution time with a compute load can be estimated based on the above discussion. For most parallel applications, the CPU usage generally follows a pattern where busytimephase > idletimephase. A scheduler provides fairness by allocating a higher fraction of CPU in the near future to a process that had to relinquish its time slice because it entered an idle phase, providing a smoothing effect. For a parallel application that has the CPU busy cpubusy seconds, and idle for cpuidle seconds on aggregate during execution, the execution time often simply increases to 2 ∗ cpubusy seconds. This is the case for all NAS benchmark programs. The exception is when an application has long intervals of low CPU usage and long intervals of high CPU usage. In those cases, the impact on different phases of execution has to be computed separately and combined. Note that the execution time with two or more competing loads, or for a given UNIX load average, can be predicted in a similar fashion. In order to validate this approach, the execution characteristics of the NAS programs were computed as discussed and the execution time with sharing of

Performance Estimation for Scheduling on Shared Networks Application Executing

Load Executing

155

Idle

No idle time, idletimephase = 0

....

Unloaded (a)

....

Loaded

busytimephase < idletimephase

....

Unloaded (b)

....

Loaded

busytimephase > idletimephase Unloaded (c) Loaded

.... ....

Time (ms)

Fig. 4. Relationship between CPU usage pattern during dedicated execution and execution pattern when the CPU has to be shared with a compute load

CPU on one node was estimated. The benchmarks were then executed with a load on one node and the predicted execution time was compared with the corresponding measured execution time. The results are presented in Figure 5. There is a close correspondence between predicted and measured values for all benchmarks, validating our prediction model for this simple scenario. It is clear from Figure 5 that our estimates are significantly more accurate than the naive prediction that the execution time doubles when the CPU has to be shared with another program. 3.3

CPU Shared on Multiple Nodes

We now consider the case where the CPUs on all nodes have to be shared with a competing load. Additional complication is caused by the fact that each of the nodes is scheduled independently, i.e., there is no coordinated (or gang) scheduling. This does not impact the performance of local computations but can have a significant impact on communication. When one process is ready to send a message, the receiving process may not be executing, leading to additional communication and synchronization delays. It is virtually impossible to predict the exact sequence of events and arrive at precise performance predictions in this case [14]. Therefore, we focus on developing upper and lower bounds for execution time.

156

Jaspal Subhlok and Shreenivasa Venkataramaiah

250

Predicted

Normalized Execution Time

Measured 200

150

100

50

0 CG

IS

MG

SP

LU

BT

EP

Benchmark

Fig. 5. Comparison of predicted and measured execution times with a competing compute load on one node. The execution time with no load is normalized to 100 units for each program

During program execution without load, the CPU at any given time is computing, communicating or idle. We discuss the impact on the time spent on each of these modes when there is a competing load on all nodes. – Computation: The time spent on local computations with load can be computed exactly as in the case of a compute load on only one node that was discussed earlier. For most parallel applications, this time doubles with fair CPU sharing. – Communication: The CPU time for communication is first expected to double because of CPU sharing. Completion of a communication operation implemented over the networking (TCP/IP) stack requires active processing on sender and receiver nodes even for asynchronous operations. Further, when one process is ready to communicate with a peer, the peer process may not be executing due to CPU sharing since all nodes are scheduled independently. The probability that a process is executing at a given point when two processes are sharing the CPU is 50%. If a peer process is not active, the process initiating the communication may have to wait half a CPU time slice to start communicating. A simple analysis shows that the communication time could double again due to independent scheduling. However, this is the statistical worst case scenario since the scheduler will try to compensate the processes that had to wait, and because pairs of processes can start executing in lockstep in the case of regular communication. Hence, the communication time may increase by up to a factor of 4.

Performance Estimation for Scheduling on Shared Networks

Predicted Upper Bound

350

Predicted Lower Bound

157

Measured

Normalized Execution TIme

300 250 200 150 100 50 0 CG

IS

MG

SP

LU

BT

EP

Benchmark

Fig. 6. Comparison of predicted and measured execution times with a competing compute load on all nodes

– Idle: For compute bound parallel programs, the CPU is idle during execution primarily waiting for messages or signals from another node. Hence, the idle time occurs while waiting for a sequence of computation and communication activities involving other executing nodes to complete. The time taken for computation and communication activities is expected to increase by a factor of 2 and 4, respectively with CPU sharing. Hence, in the worst case, the idle time may increase by a factor of 4. Based on this discussion, we have the following result. Suppose comptime, commtime and idletime are the time spent by the node CPUs computing, communicating, and idling during execution on a dedicated testbed. The execution time is bounded from above by: 2 ∗ comptime + 4 ∗ (commtime + idletime) The execution time is also loosely bounded from below by: 2 ∗ (comptime + commtime) which is the expected execution time when the CPU is shared on only one node. For validation, the higher and lower bounds for execution time for the NAS benchmarks with a single load process on all nodes were computed and compared with the measured execution time under those conditions. The results are charted in Figure 6. We observe that the measured execution time is always within the computed bounds. The range between the bounds is large for communication intensive programs, particularly CG and IS. For most applications, the measured values are in the upper part of the range, often very close to the predicted upper bound.

158

Jaspal Subhlok and Shreenivasa Venkataramaiah

The main conclusion is that the above analysis can be used to compute a meaningful upper bound for execution time with load on all nodes and independent scheduling. We restate that a good upper bound on execution time (or a lower bound on performance) is valuable for resource selection.

4 4.1

Modeling Performance with Communication Link Sharing One Shared Link

We study the impact on execution time if a network link has to be shared or the performance of a link changes for any reason. We assume that the performance of a given network link, characterized by the effective latency and bandwidth observed by a communicating application, are known. We want to stress that finding the expected performance on a network link is far from trivial in general, even when the capacity of the link and the traffic being carried by the link are known. The basic problem can be stated as follows: If a parallel application executes in time T on a dedicated testbed, what is the expected execution time if the effective latency and bandwidth on a network link change from L and B to newL and newB, respectively. The difference in execution time will be the difference in the time taken for sending and receiving messages after the link properties have changed. If the number of messages traversing this communication link is nummsgs and the average message size is avgmsgsize, then the time needed for communication increases by: [(newL + avgmsgsize/newB) − (L + avgmsgsize/B)] ∗ nummsgs We use this equation to predict the increase in execution time when the effective bandwidth and latency on a communication link change. For the purpose of validation, the available bandwidth on one of the links on our 100Mbps Ethernet testbed was reduced to a nominal 10Mbps with dummynet [26] tool. The characteristics of the changed network were measured and the information was used to predict the execution time for each NAS benchmark program. The programs were then executed on this modified network and the measured and predicted execution time were compared. The results are presented in Figure 7. We observe that the predicted and measured values are fairly close demonstrating that the prediction model is effective in this simple scenario of resource sharing.

Performance Estimation for Scheduling on Shared Networks

Normalized Execution Time

300

159

Predicted Measured

250 200 150 100 50 0 CG

IS

MG

SP

LU

BT

EP

Benchmark

Fig. 7. Comparison of predicted and measured execution times with bandwidth reduced to 10 Mbps from 100 Mbps on one communication link. The execution time with no load is normalized to 100 units for each program

4.2

Multiple Shared Links

When the performance of multiple communication links is reduced, the application performance also suffers indirectly because of synchronization effects. As in the case of load on all nodes, we discuss how the time the CPU spends on computation, communication, and idle phases during execution on a dedicated testbed, changes due to link sharing. – Computation: The time spent on local computations remains unchanged with link sharing. – Communication: The time for communication will increase as discussed in the case of sharing of a single link. The same model can be used to compute the increase in communication time. – Idle: As discussed earlier, the idle time at nodes of an executing parallel program occurs while waiting for a sequence of computation and communication activities involving other executing nodes to complete. Hence, in the worst case, the idle time may increase by the same factor as the communication time. We introduce commratio as the factor by which the time taken to perform the message exchange sequences on the executing nodes are expected to slow down due to link sharing. (the largest value is used when different nodes perform different sequences of communication.) That is, the total time to physically transfer all messages in an application run is expected to change by a factor commratio,

160

Jaspal Subhlok and Shreenivasa Venkataramaiah

not including any synchronization related delay. This commratio is determined by two factors. 1. The messages sequence sent between a pair of nodes including the size of each message. 2. The time to transport a message of a given size between a pair of nodes. The message sequences exchanged between nodes is computed ahead of time as discussed earlier in this paper. The time to transfer a message depends on application level latency and bandwidth between the pair of nodes. A network measurement tool like NWS is used to obtain these characteristics. For the purpose of experiments reported in this paper, the effective latency and bandwidth was determined by careful benchmarking ahead of time with different message sizes and different available network bandwidths. The reason for choosing this way is to factor out errors in network measurements in order to focus on performance modeling. We then have the following result. Suppose comptime, commtime and idletime are the time spent by the node CPUs computing, communicating, and idling during execution on a dedicated testbed. An upper bound on the application execution time due to a change in the characteristics of all links is: comptime + commratio ∗ (commtime + idletime) A corresponding lower bound on execution time is: comptime + commratio ∗ commtime + idletime For validation, the bounds for execution time for the NAS benchmarks with nominal available bandwidth reduced from 100Mbps to 10Mbps were computed and compared with the measured execution times under those conditions. The results are charted in Figure 8. We note that all measured execution times are within the bounds, except that the measured execution time for IS is marginally higher than the upper bound. As expected, the range covered by the bounds is larger for communication intensive programs CG, IS and MG. In the case of IS, the measured value is near the upper bound implying that synchronization waits are primarily related to communication in the program, while the measured execution time of MG is near the lower bound indicating that the synchronization waits are primarily associated with computations on other nodes. The main conclusion is that this analysis can be used to compute meaningful upper and lower bounds for execution time with sharing on all communication links. We have used the execution time from a representative run of the application for each scenario for our results. In most cases the range of execution time observed is small, typically under 1%. However, for applications with high rate of message exchange, significant variation was observed between runs. For example, in the case of LU running with competing loads on all nodes, the difference between the slowest and fastest runs was around 10%.

Performance Estimation for Scheduling on Shared Networks

Predicted Lower Bound

Predicted Upper Bound

500

161

Measured

Normalized Execution TIme

400 300 200 100 0 CG

IS

MG

SP

LU

BT

EP

Benchmark

Fig. 8. Comparison of estimated and measured execution times with bandwidth limited to 10 Mbps on all communication links

5

Limitations and Extensions

We have made a number of assumptions, implicit and explicit, in our treatment, and presented results for only a few scenarios. We now attempt to distinguish between the fundamental limitations of this work and the assumptions that were made for simplicity. – Estimation of Network Performance: For estimation of performance on a new or changed network, we assume that the expected latency and bandwidth on the network links are known and can be predicted for the duration of the experiments. The results of performance estimation can only be as good as the prediction of network behavior. Estimation of expected network performance is a a major challenge for network monitoring tools and accurate prediction is often not possible. However, this is orthogonal to the research presented in this paper. Our goal is to find the best performance estimates for given network characteristics. – Asymmetrical Computation Loads and Traffic: We have developed results for the cases of equal loads on all nodes and equal sharing on all links. This was done for simplicity. The approach is applicable for different loads on different nodes and different available bandwidth on different links. The necessary input for analysis is the load average on every node and expected latency and bandwidth on links. In such situations, typically the slowest node and the slowest link will determine the bounds on application speed. The modeling is also applicable when there is sharing of nodes as well as

162















Jaspal Subhlok and Shreenivasa Venkataramaiah

links but we have omitted the details due to lack of space. More details are described in [27]. Asymmetrical Applications: We have implicitly assumed that all nodes executing an application are following a similar execution pattern. In case of asymmetrical execution, the approach is applicable but the analysis has to be done for each individual node separately before estimating overall application performance. Similarly, if an application executes in distinctly different phases, the analysis would have to be performed separately for each phase. Execution on a Different Architecture from Where an Application Performance Model Was Prototyped: If the relative execution speed between the prototyping and execution nodes is fixed and can be determined, and the latency and bandwidth of the executing network can be inferred, a prediction can be performed. This task is relatively simple when moving between nodes of similar architectures, but is very complex if the executing nodes have a fundamentally different cache hierarchy or processor architecture as compared to prototyping nodes. Wide Area Networks: All results presented in this paper are for a local cluster. The basic principles are designed to apply across wide area networks also although the accuracy of the methodology may be different. An important issue is that our model does not account for sharing of bandwidth by different communication streams within an application. This is normally not a major factor in a small cluster where a crossbar switch allows all nodes to simultaneously communicate at maximum link speed. However, it is an important consideration in wide area networks where several application streams may share a limited bandwidth network route. Large Systems: The results developed in this paper are independent of the number of nodes but the experimentations was performed only on a small cluster. How well this framework will work in practice for large systems remains an open question. Memory and I/O Constraints: This paper does not address memory bound or I/O bound applications. In particular, it is assumed that sufficient memory is available for the working sets of applications even with sharing. In our evaluation experiments, the synthetic competing applications do not consume significant amount of storage and hence the caching behavior of the benchmarks is not affected with processor sharing. Clearly more analysis is needed to give appropriate consideration to storage hierarchy which is critical in many scenarios. Different Data Sets and Number of Nodes than the Prototyping Testbed: If the performance pattern is strongly data dependent, an accurate prediction is not possible but the results from this work may still be used as a guideline. This work does not make a contribution for performance prediction when the number of nodes is scaled, but we conjecture that it can be matched with other known techniques. Application Level Load Balancing: We assume that each application node performs the same amount of work independent of CPU and net-

Performance Estimation for Scheduling on Shared Networks

163

work conditions. Hence, if the application had internal load balancing, e.g., a master-slave computation where the work assigned to slaves depends on their execution speed, then our prediction model cannot be applied directly.

6

Conclusions

This paper demonstrates that detailed measurement of the resources that an application needs and uses can be employed to build an accurate model to predict the performance of the same application under different network conditions. Such a prediction framework can be applied to applications developed with any programming model since it is based on system level measurements alone and does not employ source code analysis. In our experiments, the framework was effective in predicting the execution time or execution time bounds of the programs in the NAS parallel benchmark suite in a variety of network conditions. To our knowledge, this is the first effort in the specific direction of building a model to estimate application performance in different resource sharing scenarios, and perhaps, this paper raises more questions than it answers. Some of the direct questions about the applicability of this approach are discussed (but not necessarily answered) in the previous section. Different application, network, processor and system architectures raise issues that affect the applicability of the simple techniques developed in this paper. However, our view is that most of those problems can be overcome with improvement of the methodology that was employed. More fundamentally, the whole approach is based on the ability to predict the availability of networked resources in the near future. If resource availability on a network changes in a completely dynamic and unpredictable fashion, no best effort resource selection method will work satisfactorily. In practice, while future network state is far from predictable, reasonable estimates of the future network status can be obtained based on recent measurements. The practical implication is that the methods in this paper may only give a rough estimate of the expected performance on a given part of the network, since the application performance estimate is, at best, as good as the estimate of the resource availability on the network. However, these performance estimates are still a big improvement over current techniques that either do not consider application characteristics, or use a simplistic qualitative description of an application such as master-slave or SPMD. Even an approximate performance prediction may be able to effectively make a perfect (or the best possible) scheduling decision by selecting the ideal nodes for execution. In summary, the ability to predict the expected performance of an application on a given set of nodes, and using this prediction for making the best possible resource choices for execution, is a challenging problem which is far from solved by the research presented in this paper. However, this paper makes a clear contribution toward predicting application performance or application performance bounds. We believe this is an important step toward building good resource selection systems for shared computation environments.

164

Jaspal Subhlok and Shreenivasa Venkataramaiah

Acknowledgments This research was supported in part by the Los Alamos Computer Science Institute (LACSI) through Los Alamos National Laboratory (LANL) contract number 03891-99-23 as part of the prime contract (W-7405-ENG-36) between the DOE and the Regents of the University of California. Support was also provided by the National Science Foundation under award number NSF ACI-0234328 and the University of Houston’s Texas Learning and Computation Center. We wish to thank other current and former members of our research group, in particular, Mala Ghanesh, Amitoj Singh, and Sukhdeep Sodhi, for their contributions. Finally, the paper is much improved as a result of the comments and suggestions made by the anonymous reviewers.

References [1] Foster, I., Kesselman, K.: Globus: A metacomputing infrastructure toolkit. Journal of Supercomputer Applications 11 (1997) 115–128 148 [2] Grimshaw, A., Wulf, W.: The Legion vision of a worldwide virtual computer. Communications of the ACM 40 (1997) 148 [3] Litzkow, M., Livny, M., Mutka, M.: Condor — A hunter of idle workstations. In: Proceedings of the Eighth Conference on Distributed Computing Systems, San Jose, California (1988) 149 [4] Zhou, S.: LSF: load sharing in large-scale heterogeneous distributed systems. In: Proceedings of the Workshop on Cluster Computing, Orlando, FL (1992) 149 [5] Wolski, R., Spring, N., Peterson, C.: Implementing a performance forecasting system for metacomputing: The Network Weather Service. In: Proceedings of Supercomputing ’97, San Jose, CA (1997) 149, 151 [6] Lowekamp, B., Miller, N., Sutherland, D., Gross, T., Steenkiste, P., Subhlok, J.: A resource query interface for network-aware applications. In: Seventh IEEE Symposium on High-Performance Distributed Computing, Chicago, IL (1998) 149, 151 [7] Berman, F., Wolski, R., Figueira, S., Schopf, J., Shao, G.: Application-level scheduling on distributed heterogeneous networks. In: Proceedings of Supercomputing ’96, Pittsburgh, PA (1996) 149 [8] Bolliger, J., Gross, T.: A framework-based approach to the development of network-aware applications. IEEE Trans. Softw. Eng. 24 (1998) 376 – 390 149 [9] Subhlok, J., Lieu, P., Lowekamp, B.: Automatic node selection for high performance applications on networks. In: Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, GA (1999) 163–172 149 [10] Tangmunarunkit, H., Steenkiste, P.: Network-aware distributed computing: A case study. In: Second Workshop on Runtime Systems for Parallel Programming (RTSPP), Orlando (1998) 149 [11] Weismann, J.: Metascheduling: A scheduling model for metacomputing systems. In: Seventh IEEE Symposium on High-Performance Distributed Computing, Chicago, IL (1998) 149 [12] Subhlok, J., Venkataramaiah, S., Singh, A.: Characterizing NAS benchmark performance on shared heterogeneous networks. In: 11th International Heterogeneous Computing Workshop. (2002) 149

Performance Estimation for Scheduling on Shared Networks

165

[13] Barak, A., La’adan, O.: The MOSIX multicomputer operating system for high performance cluster computing. Future Generation Computer Systems 13 (1998) 361–372 149 [14] Arpaci-Dusseau, A., Culler, D., Mainwaring, A.: Scheduling with implicit information in distributed systems. In: SIGMETRICS’ 98/PERFORMANCE’ 98 Joint Conference on the Measurement and Modeling of Computer Systems. (1998) 149, 155 [15] Wolski, R., Spring, N., Hayes, J.: Predicting the CPU availability of time-shared unix systems on the computational grid. Cluster Computing 3 (2000) 293–301 149 [16] Venkataramaiah, S., Subhlok, J.: Performance prediction for simple CPU and network sharing. In: LACSI Symposium 2002. (2002) 149 [17] Clement, M., Quinn, M.: Automated performance prediction for scalable parallel computing. Parallel Computing 23 (1997) 1405–1420 150 [18] Fahringer, T., Basko, R., Zima, H.: Automatic performance prediction to support parallelization of Fortran programs for massively parallel systems. In: Proceedings of the 1992 International Conference on Supercomputing, Washington, DC (1992) 347–56 150 [19] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K., Santos, E., Subramonian, R., von Eicken, T.: LogP: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA (1993) 1–12 150 [20] Fahringer, T., Scholz, B., Sun, X.: Execution-driven performance analysis for distributed and parallel systems. In: 2nd International ACM Sigmetrics Workshop on Software and Performance (WOSP 2000), Ottawa, Canada (2000) 150 [21] Schopf, J., Berman, F.: Performance prediction in production environments. In: 12th International Parallel Processing Symposium, Orlando, FL (1998) 647–653 150 [22] Dinda, P., O’Hallaron, D.: An evaluation of linear models for host load prediction. In: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing. (1999) 151 [23] Singh, A., Subhlok, J.: Reconstruction of application layer message sequences by network monitoring. In: IASTED International Conference on Communications and Computer Networks. (2002) 151 [24] Bailey, D., Harris, T., Saphir, W., van der Wijngaart, R., Woo, A., Yarrow, M.: The NAS Parallel Benchmarks 2.0. Technical Report 95-020, NASA Ames Research Center (1995) 151 [25] Tabe, T., Stout, Q.: The use of the MPI communication library in the NAS Parallel Benchmark. Technical Report CSE-TR-386-99, Department of Computer Science, University of Michigan (1999) 152 [26] Rizzo, L.: Dummynet: a simple approach to the evaluation of network protocols. ACM Computer Communication Review 27 (1997) 158 [27] Venkataramaiah, S.: Performance prediction of distributed applications using CPU measurements. Master’s thesis, University of Houston (2002) 162

Scaling of Workload Traces Carsten Ernemann, Baiyi Song, and Ramin Yahyapour Computer Engineering Institute, University Dortmund 44221 Dortmund, Germany {carsten.ernemann,song.baiyi,ramin.yahyapour}@udo.edu

Abstract. The design and evaluation of job scheduling strategies often require simulations with workload data or models. Usually workload traces are the most realistic data source as they include all explicit and implicit job patterns which are not always considered in a model. In this paper, a method is presented to enlarge and/or duplicate jobs in a given workload. This allows the scaling of workloads for later use on parallel machine configurations with a different number of processors. As quality criteria the scheduling results by common algorithms have been examined. The results show high sensitivity of schedule attributes to modifications of the workload. To this end, different strategies of scaling number of job copies and/or job size have been examined. The best results had been achieved by adjusting the scaling factors to be higher than the precise relation between the new scaled machine size and the original source configuration.

1

Introduction

The scheduling system is an important component of a parallel computer. Here, the applied scheduling strategy has direct impact to the overall performance of the computer system with respect to the scheduling policy and objective. The design of such a scheduling system is a complex task which requires several steps, see [13]. The evaluation of scheduling algorithms is important to identify the appropriate algorithm and the corresponding parameter settings. The results of theoretical worst-case analysis are only of limited help as typical workloads on production machines do normally not exhibit the specific structure that will create a really bad case. In addition, theoretical analysis is often very difficult to apply to many scheduling strategies. Further, there is no random distribution of job parameter values, see e.g. Feitelson and Nitzberg [9]. Instead, the job parameters depend on several patterns, relations, and dependencies. Hence, a theoretical analysis of random workloads will not provide the desired information either. A trial and error approach on a commercial machine is tedious and significantly affects the system performance. Thus, it is usually not practicable to use a production machine for the evaluation except for the final testing. This just leaves simulation for all other cases. Simulations may either be based on real trace data or on a workload model. Workload models, see e.g. Jann et al. [12] or Feitelson and Nitzberg [9], enable D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 166–182, 2003. c Springer-Verlag Berlin Heidelberg 2003 

Scaling of Workload Traces

167

a wide range of simulations by allowing job modifications, like a varying amount of assigned processor resources. However, many unknown dependencies and patterns may cause the actual workload of a real system. This is especially true as the characteristics of an workload usually change over time; beginning from daily or weekly cycles to changes in the job submissions during a year and the lifetime of a parallel machine. Here, the consistence of a statistical generated workload model with real workloads is difficult to guarantee. On the other hand, trace data restrict the freedom of selecting different configurations and scheduling strategies as a specific job submission depends on the original circumstances. The trace is only valid on a similar machine configuration and the same scheduling strategy. For instance, trace data taken from a 128 processor parallel machine will lead to unrealistic results on a 256 processor machine. Therefore, the selection of the underlying data for the simulation depends on the circumstances determined by the MPP architecture as well as the scheduling strategy. A variety of examples already exists for evaluations via simulation based on a workload model, see e.g. Feitelson [5], Feitelson and Jette [8] or on trace data, see e.g. Ernemann et al. [4]. Our research on job scheduling strategies for parallel computers as well as for computational Grid environments led to the requirement of considering different resource configurations. As the individual scheduling objectives of users and owners is of high importance in this research, we have to ensure that the workload is very consistent with real demand. To this end, statistical distribution of the various parameters without the detailed dependencies between them cannot be applied. Therefore, real workload traces have been chosen as the source for our evaluations. In this paper, we address the question how workload traces can be transformed to be used on different resource configurations while retaining important specifics. In Section 2 we give a brief overview on previous works in workload modelling and analysis. In addition, we discuss our considerations for choosing a workload for evaluation. Our approach and the corresponding results are presented in Section 3. Finally, we conclude this paper with a brief discussion on the important key observations in Section 4.

2

Background

We consider on-line parallel job scheduling in which a stream of jobs is submitted to a job scheduler by individual users. The jobs are executed in a space-sharing fashion for which a job scheduling system is responsible to decide when and on which resource set the jobs are actually started. A job is first known by the system at its submission time. The job description contains information on its requirements as e.g. number of processing nodes, memory or the estimated execution length. For the evaluation of scheduling methods it is a typical task to choose one or several workloads for simulations. The designer of a scheduling algorithm must ensure that the workload is close to a real user demand in the examined scenario. Workload traces are recorded on real systems and contain information on the

168

Carsten Ernemann et al.

job requests including the actual start and execution time of the job. Extensive research has been done to analyze workloads as well as to propose corresponding workload models, see e.g. [7, 3, 2, 1]. Generally, statistical models use distributions or a collection of distributions to describe the important features of real workload attributes and the correlations among them. Then synthetic workloads are generated by sampling from the probability distributions [12, 7]. Statistical workload models have the advantage that new sets of job submissions can be generated easily. The consistence with real traces depends on the knowledge about the different examined parameters in the original workload. Many factors contribute to the actual process of workload generation on a real machine. Some of them are known, some are hidden and hard to deduce. It is difficult to find rules for job submissions by individual users. The analysis of workloads shows several correlations and patterns of the workload statistics. For example, jobs on many parallel computers require job sizes of a power of two [15, 5, 16]. Other examples are the job distribution during the daily cycle obviously caused by the individual working hours of the users, or the job distribution of different week days. Most approaches consider the different statistical moments isolated. Some correlations are included in several methods. However, it is very difficult to identify whether the important rules and patterns are extracted. In the same way it is difficult to tell whether the inclusion of the result is actually relevant to the the evaluation and therefore also relevant for the design of an algorithm. In general, only a limited number of users are active on a parallel computer, for instance, several dozens. Therefore, for some purposes it is not clear if a given statistical model comes reasonable close to a real system. For example, some workload traces include singular outliers which significantly influence the overall scheduling result. In this case, a statistical modelling without this outlier might significantly deviate from the real world result. In the same way, it may make a vast difference to have several outliers of the same or similar kind. The relevance to the corresponding evaluation is difficult to judge, but this also renders the validity of the results undefined. Due to the above mentioned reasons, it emerged to be difficult to use statistical workload models for our research work. Therefore, we decided to use workload traces for our evaluations. The standard parallel workload archive [19] is a good source for job traces. However, the number of available traces is limited. Most of the workloads are observed on different supercomputers. Mainly, the total number of available processors differs in those workloads. Therefore, our aim was to find a reasonable method to scale workload traces to fit on a standard supercomputer. However, special care must be taken to keep the new workload as consistent as possible to the original trace. To this end, criteria for measuring the validity had to be chosen for the examined methods for scaling the workload. The following well-known workloads have been used: of the CTC [11], the NASA [9], the LANL [6], the KTH [17] and three workloads from the SCSD [20]. All traces are available from the Parallel Workload Archive, see [19]. As shown in Table 1, the supercomputer from the LANL has the highest number of processors

Scaling of Workload Traces

169

Table 1. The Examined Original Workload Traces Workload Number of jobs Number of nodes Size of the biggest job Static factor f

CTC 79302 430 336 3

NASA KTH 42264 28490 128 100 128 100 8

10

LANL SDSC95 SDSC96 SDSC00 201387 76872 38719 67667 1024 416 416 128 1024 400 320 128 1

3

3

8

from the given computers and so this number of processors was chosen as the standard configuration. Therefore the given workload from the LANL does not need to be modified and as a result the following modification will only be applied to the other given workloads. In comparison to statistical workload models, the use of actual workload traces is simpler as they inherently include all submission patterns and underlying mechanisms. The traces reflect the real workload exactly. However, it is difficult to perform several simulations as the data basis is usually limited. In addition, the applicability of workload traces to other resource configuration with a different number of processors is complicated. For instance, this could result in a too high workload and an unrealistic long wait time for a job. Or, contrary, the machine is not fully utilized if the amount of computational work is too low. However, it is difficult to change any parameter of the original workload trace as it has an influence on its overall validity. For example, the reduction of the inter-arrival time destroys the distribution of the daily cycle. Therefore, modifications on the job length are inappropriate. Modifications of the requested processor number of a job change the original job size distribution. For instance, we might invalid an existing preference of jobs with a power of 2 processor requirement. In the same way, an alternative scaling of the number of requested processors by a job would lead to an unrealistic job size submission pattern. For example, scaling a trace taken from a 128 node MPP system to 256 node system by just duplicating each job preserves the temporal distribution of job submissions. However, this transformation leads also to an unrealistic distribution as no larger jobs are submitted. Note, that the scaling of a workload to match a different machine configuration always alters the original distribution whatsoever. Therefore, as a trade-off special care must be taken to preserve original time correlations and job size distribution.

3

Scaling Workloads to a Different Machine Size

The following 3 sections present the examined methods to scale the workload. We briefly discuss the different methods as the results of each step motivated the next. First, it is necessary to select quality criteria for comparing the workload modifications. Distribution functions could be used to compare the similarity of

170

Carsten Ernemann et al.

the modified with the corresponding original workloads. This method might be valid, however, it is unknown whether the new workload has a similar effect on the resulting schedule as the original workload. As mentioned above, the scheduling algorithm that has been used on the original parallel machine also influences the submission behavior of the users. If a different scheduling system is applied and causes different response times, this will most certainly influence the submission pattern of later arriving jobs. This is a general problem [3, 1] that has to be kept in mind if workload traces or statistical models are used to evaluate new scheduling systems. This problem can be solved if the feedback mechanisms of prior scheduling results on new job submissions is known. However, such a feedback modelling is a difficult topic as the underlying mechanisms vary between individual users and between single jobs. For our evaluation, we have chosen the Average Weighted Response Time (AWRT) and the Average Weighted Wait Time (AWWT) generated by the scheduling process. Several other scheduling criteria, for instance the slowdown, can be derived from AWRT and AWWT. To match the original scheduling systems, we used First-Come-First-Serve [18] and EASY-Backfilling [17, 14] for generating the AWRT and AWWT. These scheduling methods are well known and used for most of the original workloads. Note, that the focus of this paper is not to compare the quality of both scheduling strategies. Instead, we use the results of each algorithm to compare the similarity of each modified workload with the corresponding original workload. The definitions (1) to (3) apply whereas index j represents job j.

Resource Consumptionj = requestedResourcesj · (endTimej − startTimej ) (1)  AWRT =

j∈Jobs

 AWWT =



j∈Jobs



Resource Consumptionj · (endTimej − submitTimej )  Resource Consumptionj

(2)

j∈Jobs

Resource Consumptionj · (startTimej − submitTimej )  (3) Resource Consumptionj j∈Jobs

In addition, the makespan is considered, which is the end time of the last job within the workload. The Squashed Area is given as a measurement for the amount of consumed processing power for the workloads which is defined in (4).  Resource Consumptionj (4) Squashed Area = j∈Jobs Note, that in the following we refer to jobs with a higher number of requested processor as bigger jobs, while calling jobs with a smaller demand in processor number as smaller jobs respectively.

Scaling of Workload Traces

171

Scaling only the number of requested processors of a job results in the problem that the whole workload distribution is transformed by a factor. In this case the modified workload might not contain jobs requesting 1 or a small number of processors. In addition, the favor of jobs requesting a power of 2 processors is not modelled correctly for most scaling factors. Alternatively, the number of jobs can be scaled. Each original job is duplicated to several jobs in the new workload. Using only this approach has the disadvantage that the new workload has more smaller jobs in relation to the original workload. For instance, if the biggest job in the original workload uses the whole machine, a duplication of each job for a machine with twice the number of processors leads to a new workload in which no job requests the maximum number of processors at all. 3.1

Precise Scaling of Job Size

Based on the considerations above, a factor f is calculated for combining the scaling of the requested processor number of each job with the scaling of the total number of jobs. In Table 1 the requested maximum number of processors requested by a job is given as well as the total number number of available processors. As explained above multiplying solely the number of processors of a job or the number of jobs by a constant factor is not reasonable. Therefore, the following combination of both strategies has been applied. In order to analyze the influence of both possibilities the workloads were modified by using a probabilistic approach: a probability factor p is used to specify whether the requested number of processors is multiplied for a job or copies of this job are created. During the scaling process each job of the original workload is modified by only one of the given alternatives. A random value between 0 and 100 is generated for probability p. A decision value d is used to discriminate which alternative is applied for a job. If p produced by the probabilistic generator is greater d the number of processors is scaled for the job. Otherwise, f identical, new job are included in the new workload. So, if d has a greater value, the system prefers the creation of smaller jobs while resulting in less bigger jobs otherwise. As a first approach, integer scaling factors had been chosen based on the relation to a 1024 processor machine. We restricted ourselves to integer factors as it would require additional considerations to model fractional job parts. For the KTH a factor f of 10 is chosen, for the NASA and the SDSC00 workloads a factor of 8 and for all other workloads a factor of 3. Note, that for the SDSC95 workload one job yields more than 1024 processors if multiplied by 3. Therefore, this single job is reduced to 1024. For the examination of the influence of d, we created 100 modified workloads for each original workload with d between 0 and 100. However, with exception to the NASA traces, our method did not produce satisfying results for the workload scaling. The imprecise factors increased the overall amount of workload at most 26% which lead to a jump of several factors for AWRT and AWWT. This shows how important the precise scaling of the overall amount of workload is. Second, if the chosen factor f is smaller than the precise scaling factor the workloads

172

Carsten Ernemann et al.

which prefer smaller jobs scale better than the workloads with bigger jobs. If f is smaller or equal to the precise scaling factor, the modified workloads scale better for smaller values of d. Based on these results, we introduced a precise scaling for the job size. As the scaling factors for the workloads CTC, KTH, SDSC95 and SDSC96 are not integer values an extension to the previous method was necessary. In the case that a single large job is being created the number of jobs is multiplied by the precise scaling factor and rounded. The scheduling results for the modified workloads are presented in Table 2. Only the results for the original workload (ref) and the modified workloads with the parameter settings of d = {1, 50, 99} are shown. Now the modified CTC based workloads are close to the original workloads in terms of AWWT, AWRT and utilization if only bigger jobs are created (d = 1). For increasing values for d, also AWRT, AWWT and utilization increase. Overall, the results are closer to the original results in comparison to using an integer factor. A similar behavior can be found for the SDSC95 and SDSC96 workload modifications. For KTH the results are similar with the exception that we converge to the original workload for decreasing d. The results for the modified NASA workloads present very similar results for the AWRT and AWWT for the derived and original workloads independently from the used scheduling algorithm. Note, that the NASA workload itself is quite different in comparison to the other workloads as it includes a high percentage of interactive jobs. In general, the results for this method are still not satisfying. Using a factor of d = 1 is not realistic as mentioned in Section 2 because small jobs are missing in relation to the original workload. 3.2

Precise Scaling of Number and Size of Jobs

Consequently, the precise factor is also used for the duplication of jobs. However, as mentioned above, it is not trivial to create fractions of jobs. To this end, a second random variable p1 was introduced with values between 0 and 100. The variable p1 is used to decide whether the lower or upper integer bound of the precise scaling factor is considered. For instance, the precise scaling factor for the CTC workload is 2.3814 we used the value of p1 to decide whether to use the scaling factor of 2 or 3. If p1 is smaller than 38.14 the factor of 2 will be used, 3 otherwise. The average should result in a scaling factor of around 2.3814. For the other workloads we used the same scaling strategy with the decision values of 24.00 for the KTH workload and with 46.15 for the SDSC95 and SDSC96 workloads. This enhanced method improves the results significantly. In Table 3 the main results are summarized. Except for the simulations with the SDSC00 workload all other results show a clear improvement in terms of similar utilization for each of the according workloads. The results for the CTC show again that only small values of d lead to convergence of AWRT and AWWT to the original workload.

Scaling of Workload Traces

173

The same qualitative behavior can be observed for the workloads which are derived from the KTH and SDSC00 workloads. The results for the NASA workload show that AWRT and AWWT do not change between the presented methods. This leads to the assumption that this specific NASA workload does not contain enough workload to produce job delays. The results of the modifications for the SDSC9* derived workloads are already acceptable as the AWRT and AWWT between the original workloads and the modified workloads with a mixture of smaller and bigger jobs (d = 50) are already very close. For this two workloads the scaling is acceptable. In general, it can be summarized that the modification still do not produce matching results for all original workloads. Although we use precise factors for scaling job number and job width, some of the scaled workloads yield better results than the original workload. This is probably caused due to the fact that according to the factor d the scaled workload is distributed over either more but smaller (d = 99) or less but bigger jobs (d = 1). As mentioned before, the existence of more smaller jobs in a workload usually improves the scheduling result. The results show that a larger machine leads to smaller AWRT and AWWT values. Or contrary, a larger machine can execute relatively more workload than an according number of smaller machine for the same AWRT or AWWT. However, this applies only for the described workload modifications. Here, we generate relatively more smaller jobs in relation to the original workload.

number of jobs

makespan in seconds

79285 82509 158681 236269 79285 82509 158681 236269

29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750

Policy EASY

resources 430 ref 1 1024 50 99 430 ref 1 1024 50 99

FCFS

CTC

workload

d

util. in %

Table 2: Results for Precise Scaling for the Job Size and Estimated Scaling for Job Number

66 66 75 83 66 66 75 83

AWWT AWRT Squashed in sec- in sec- Area onds onds

13905 13851 21567 30555 19460 19579 28116 35724

53442 53377 61117 70083 58996 59105 67666 75253

8335013015 19798151305 22259040765 24960709755 8335013015 19798151305 22259040765 24960709755

174

Carsten Ernemann et al. Table 2: continued

416 ref 1 1024 50 99 416 ref 1 1024 50 99 416 ref 1 1024 50 99 416 ref 1 1024 50 99

makespan in seconds

28482 30984 157614 282228 28482 30984 157614 282228

29363625 29363625 29363625 29363625 29381343 29373429 29376374 29363625

69 69 68 67 69 69 68 67

24677 25002 17786 10820 400649 386539 38411 11645

75805 76102 68877 61948 451777 437640 89503 62773

2024854282 20698771517 20485558974 20258322777 2024854282 20698771517 20485558974 20258322777

42049 44926 190022 333571 42049 44926 190022 333571

7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421

47 47 47 47 47 47 47 47

6 6 5 1 6 6 5 1

9482 9482 9481 9477 9482 9482 9481 9477

474928903 3799431224 3799431224 3799431224 474928903 3799431224 3799431224 3799431224

67655 72492 305879 536403 67655 72492 305879 536403

63192267 63201878 63189633 63189633 68623991 68569657 64177724 63189633

83 83 83 83 77 77 82 83

76059 74241 54728 35683 2182091 2165698 516788 38787

116516 114698 95185 76140 2222548 2206155 557245 79244

6749918264 53999346112 53999346112 53999346112 6749918264 53999346112 53999346112 53999346112

75730 77266 151384 225684 75730 77266 151384 225684

31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080

63 63 70 77 63 63 70 77

13723 14505 19454 25183 17474 18735 24159 28474

46907 47685 52652 58367 50658 51914 57357 61659

8284847126 20439580820 22595059348 24805524723 8284847126 20439580820 22595059348 24805524723

37910 38678 75562 112200 37910 38678 75562 112200

31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431

62 62 68 75 62 62 68 75

9134 9503 14858 22966 10594 11175 18448 26058

48732 49070 54305 62540 50192 50741 57896 65632

8163457982 20140010107 22307362421 24410540372 8163457982 20140010107 22307362421 24410540372

util. in %

number of jobs Policy EASY FCFS EASY FCFS EASY FCFS EASY

128 ref 1 1024 50 99 128 ref 1 1024 50 99

FCFS

128 ref 1 1024 50 99 128 ref 1 1024 50 99

EASY

resources 100 ref 1 1024 50 99 100 ref 1 1024 50 99

FCFS

SDSC96

SDSC95

SDSC00

NASA

KTH

workload

d

AWWT AWRT Squashed in sec- in sec- Area onds onds

Scaling of Workload Traces Table 3: Results using Precise Factors for Job Number and Size

128 ref 1 1024 50 99 128 ref 1 1024 50 99 416 ref 1 1024 50 99 416 ref 1 1024 50 99

makespan in seconds

79285 80407 133981 187605 79285 80407 133981 187605

29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750

66 66 66 66 66 66 66 66

13905 13695 12422 10527 19460 18706 15256 12014

53442 53250 51890 50033 58996 58261 54724 51519

8335013015 19679217185 19734862061 19930294802 8335013015 19679217185 19734862061 19930294802

28482 31160 159096 289030 28482 31160 159096 289030

29363625 29363625 29363625 29363625 29381343 29381343 29371792 29363625

69 69 69 69 69 69 69 69

24677 24457 18868 11903 400649 383217 41962 12935

75805 75562 70002 62981 451777 434322 93097 64013

2024854282 20702184590 20725223128 20737513457 2024854282 20702184590 20725223128 20737513457

42049 44870 188706 333774 42049 44870 188706 333774

7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421

47 47 47 47 47 47 47 47

6 6 2 1 6 6 3 1

9482 9482 9478 9477 9482 9482 9479 9477

474928903 3799431224 3799431224 3799431224 474928903 3799431224 3799431224 3799431224

67655 77462 305802 536564 67655 77462 305802 536564

63192267 63192267 63189633 63189633 68623991 68486537 64341025 63189633

83 83 83 83 77 77 82 83

76059 75056 61472 35881 2182091 2141633 585902 38729

116516 115513 101929 76338 2222548 2182090 626359 79186

6749918264 53999346112 53999346112 53999346112 6749918264 53999346112 53999346112 53999346112

75730 76850 131013 185126 75730 76850 131013 185126

31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080

63 63 63 62 63 63 63 62

13723 14453 13215 11635 17474 18511 15580 12764

46907 47641 46319 44739 50658 51698 48684 45867

8284847126 20411681280 20466656625 20446439351 8284847126 20411681280 20466656625 20446439351

util. in %

number of jobs Policy EASY FCFS EASY FCFS EASY FCFS EASY

128 ref 1 1024 50 99 128 ref 1 1024 50 99

FCFS

100 ref 1 1024 50 99 100 ref 1 1024 50 99

EASY

resources 430 ref 1 1024 50 99 430 ref 1 1024 50 99

FCFS

SDSC95

SDSC00

NASA

KTH

CTC

workload

d

AWWT AWRT Squashed in sec- in sec- Area onds onds

175

176

Carsten Ernemann et al.

number of jobs

makespan in seconds

37910 38459 66059 92750 37910 38459 65627 92750

31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431

EASY

Policy

resources 416 ref 1 1024 50 99 416 ref 1 1024 50 99

FCFS

SDSC96

workload

d

util. in %

Table 3: continued

62 62 62 62 62 62 62 62

AWWT AWRT Squashed in sec- in sec- Area onds onds

9134 9504 9214 8040 10594 11079 10126 8604

48732 49084 49087 47796 50192 50658 49823 48360

8163457982 20100153862 20106192767 20171317735 8163457982 20100153862 20106192767 20171317735

resources

utilization in %

CTC

430 1024 430 1024

ref 2.45 ref 2.45

79285 136922 79285 136922

29306750 29306750 29306750 29306750

66 68 66 68

13905 14480 19460 19503

53442 54036 58996 59058

8335013015 20322861231 8335013015 20322861231

KTH

100 1024 100 1024

ref 28482 EASY 10.71 165396 ref 28482 FCFS 10.71 165396

29363625 29363625 29381343 29379434

69 72 69 72

24677 24672 400649 167185

75805 75826 451777 218339

2024854282 21708443586 2024854282 21708443586

NASA

128 1024 128 1024

ref 8.00 ref 8.00

42049 188706 42049 188258

7945421 7945421 7945421 7945421

47 47 47 47

6 2 6 4

9482 9478 9482 9480

474928903 3799431224 474928903 3799431224

SDSC00

128 1024 128 1024

ref 8.21 ref 8.58

67655 312219 67655 323903

63192267 63204664 68623991 69074629

83 86 77 82

76059 75787 2182091 2180614

116516 116408 2222548 2221139

6749918264 55369411171 6749918264 58020939264

SDSC95

f

number of jobs

makespan in seconds

416 1024 416 1024

ref 2.48 ref 2.48

75730 131884 75730 131884

31662080 31662080 31662080 31662080

63 63 63 63

13723 13840 17474 17327

46907 46985 50658 50472

8284847126 20534988559 8284847126 20534988559

416 1024 416 1024

ref 2.48 ref 2.48

37910 66007 37910 66007

31842431 31842431 31842431 31842431

62 62 62 62

9134 8799 10594 10008

48732 48357 50192 49566

8163457982 20184805564 8163457982 20184805564

Policy

workload

AWWT AWRT Squashed in sec- in sec- Area onds onds

SDSC96

Table 4: Results for Increased Scaling Factors with d = 50

EASY FCFS

EASY FCFS EASY FCFS EASY FCFS EASY FCFS

Scaling of Workload Traces

3.3

177

Adjusting the Scaling Factor

In order to compensate the above mentioned scheduling advantage of having more small jobs in relation to the original workload, the scaling factor f was modified to increase the overall amount of workload. The aim is to find a scaling factor f that the results in terms of the AWRT and AWWT match to the original workload for d = 50. In this way, a combination of bigger as well as more smaller jobs exists. To this end, additional simulations have been performed with small increments of f . In Table 4 the corresponding results are summarized, more extended results are shown in Table 5 in the appendix. It can be observed that the scheduling behavior is not strict linear corresponding to the incremented scaling factor f . The precise scaling factor for the CTC workload is 2.3814, whereas a slightly higher scaling factor corresponds to a AWRT and AWWT close to the original workload results. The actual values slightly differ e.g. for the EASY (f = 2.43) and the FCFS strategy (f = 2.45). Note, that the makespan stays constant for different scaling factors. Obviously the makespan is dominated by a later job and is therefore independent of the increasing amount of computational tasks (squashed area, utilization and the number of jobs). This underlines that the makespan is predominantly an off-line scheduling criterion [10]. In an on-line scenario new jobs are submitted to the system where the last submitted jobs influence the makespan without regard to the overall scheduling performance of the whole workload. An analogous procedure can be applied to the KTH, SDSC95 and SDSC96 workloads. The achieved results are very similar. The increment of the scaling factor f for the NASA workloads leads to different effects. A marginal increase causes a significant change of the scheduling behavior. The values of the AWRT and AWWT are drastically increasing. However, the makespan, the utilization and the workload stay almost constant. This indicates that the original NASA workload has almost no wait time while a new job is started when the previous job is finished. The approximation of an appropriate scaling factor for the SDSC00 workload differs from the previous described process as the results for the EASY and FCFS strategies differ much. Here the AWRT and the AWWT of the FCFS are more than a magnitude higher than by using EASY-Backfilling. Obviously, the SDSC00 workload contains highly parallel jobs as this causes FCFS to suffer in comparison to EASY backfilling. In our opinion, it is more reasonable to use the results of the EASY strategy for the workload scaling, because the EASY strategy is more representative for many current systems and for the observed workloads. However, as discussed above, if the presented scaling methods are applied to other traces, it is necessary to use the original scheduling method that caused the workload trace.

4

Conclusion

In this paper we proposed a procedure for scaling different workloads to a uniform supercomputer. To this end, the different development steps have been pre-

178

Carsten Ernemann et al.

sented as each motivated the corresponding next step. We used combinations of duplicating jobs and/or modifying the requested processor numbers. The results showed again how sensitive workloads react to modifications. Therefore, several steps were necessary to ensure that the scaled workload showed similar scheduling behavior. Resulting schedule attributes as e.g. average weighted response or wait time have been used as quality criteria. The significant differences between the intermediate results for modified workloads indicate the general difficulties to generate realistic workload models. The presented method is motivated as the development of more complex scheduling strategies requires workloads with a careful reproduction of real workloads. Only workload traces include all such explicit and implicit dependencies. As simulations are commonly used for evaluating scheduling strategies, there is demand for a sufficient database of workload traces. However, there is only a limited number of traces available which originate from different systems. The presented method can be used to scale such workload traces to a uniform resource configuration for further evaluations. Note, we do not propose that our method actually extrapolates an actual user behavior for a specific larger machine. Moreover, we scale the real workload traces to fit on a larger machine while maintaining original workload properties. To this end, our method includes a combination of generating additional job copies and extending the job width. In this way, we ensure that some jobs utilize the same relative number of processors as in the original traces, while original jobs still occur in the workload. For instance, an existing preference of power of 2 jobs in the original workload is still included in the scaled workload. Similarly, other preferences or certain job patterns maintain intact even if they are not explicitly known. The presented model can be extended to scale other job parameters in the same fashion. Preliminary work has been done to include memory requirements or requested processor ranges. This list can be extended by applying additional rules and policies for the scaling operation.

References [1] Allen B. Downey. A parallel workload model and its implications for processor allocation. In 6th Intl. Symp. High Performance Distributed Comput., Aug 1997. 168, 170 [2] Allen B. Downey. Using queue time predictions for processor allocation. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 35–57. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. 168 [3] Allen B. Downey and Dror G. Feitelson. The elusive goal of workload characterization. Perf. Eval. Rev., 26(4):14–29, Mar 1999. 168, 170 [4] Carsten Ernemann, Volker Hamscher, and Ramin Yahyapour. Economic scheduling in grid computing. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages 128–152. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2537. 167

Scaling of Workload Traces

179

[5] Dror G. Feitelson. Packing schemes for gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 89–110. Springer-Verlag, 1996. Lect. Notes Comput. Sci. vol. 1162. 167, 168 [6] Dror G. Feitelson. Memory usage in the LANL CM-5 workload. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 78–94. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. 168 [7] Dror G. Feitelson. Workload modeling for performance evaluation. In M. C. Calzarossa and S. Tucci, editors, Performance Evaluation of Complex Systems: Techniques and Tools, pages 114–141. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2459. 168 [8] Dror G. Feitelson and Morris A. Jette. Improved utilization and responsiveness with gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 238–261. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. 167 [9] Dror G. Feitelson and Bill Nitzberg. Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 337–360. Springer-Verlag, 1995. Lect. Notes Comput. Sci. vol. 949. 166, 168 [10] Dror G. Feitelson and Larry Rudolph. Metrics and benchmarking for parallel job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1–24. Springer-Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. 177 [11] Steven Hotovy. Workload evolution on the Cornell Theory Center IBM SP2. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 27–40. Springer-Verlag, 1996. Lect. Notes Comput. Sci. vol. 1162. 168 [12] Joefon Jann, Pratap Pattnaik, Hubertus Franke, Fang Wang, Joseph Skovira, and Joseph Riodan. Modeling of workload in MPPs. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 95–116. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. 166, 168 [13] Jochen Krallmann, Uwe Schwiegelshohn, and Ramin Yahyapour. On the design and evaluation of job scheduling algorithms. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 17–42. Springer Verlag, 1999. Lect. Notes Comput. Sci. vol. 1659. 166 [14] Barry G. Lawson and Evgenia Smirni. Multiple-queue backfilling scheduling with priorities and reservations for parallel systems. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages 72–87. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2537. 170 [15] V. Lo, J. Mache, and K. Windisch. A comparative study of real workload traces and synthetic workload models for parallel job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 25–46. Springer-Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. 168 [16] Uri Lublin and Dror G. Feitelson. The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. J. Parallel & Distributed Comput., (to appear). 168 [17] Ahuva W. Mu’alem and Dror G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel & Distributed Syst., 12(6):529–543, Jun 2001. 168, 170

180

Carsten Ernemann et al.

[18] Uwe Schwiegelshohn and Ramin Yahyapour. Improving first-come-first-serve job scheduling by gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 180–198. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. 170 [19] Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/, April 2003. 168 [20] K. Windisch, V. Lo, R. Moore, D. Feitelson, and B. Nitzberg. A comparison of workload traces from two production parallel machines. In 6th Symp. Frontiers Massively Parallel Comput., pages 319–326, Oct 1996. 168

Appendix: Complete Results Table 5: All Results for Increased Scaling Factors with d = 50

100 ref 10.68 10.69 1024 10.70 10.71 10.72 10.75 10.78 10.80 10.83 100 ref 10.71 10.72 1024 10.80

makespan in seconds

79285 135157 135358 135754 136922 136825 137664 137904 79285 135754 136524 136922 136825 137664 137904 138547

29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750 29306750

66 67 67 67 68 68 69 69 66 67 67 68 68 69 69 69

13905 13242 13475 14267 14480 13751 15058 15071 19460 17768 18818 19503 18233 19333 19058 19774

53442 52897 52979 53771 54036 53267 54540 54611 58996 57272 58326 59058 57749 58815 58598 59291

8335013015 19890060461 20013239358 20130844161 20322861231 20455740107 20563974522 20486963613 8335013015 20130844161 20291066216 20322861231 20455740107 20563974522 20486963613 20675400432

28482 166184 165766 166323 165396 166443 167581 170046 168153 168770 28482 165396 166443 168153

29363625 29363625 29363625 29363625 29363625 29363625 29363625 29363625 29363625 29363625 29381343 29379434 29380430 29374047

69 72 72 72 72 72 72 73 73 73 69 72 72 73

24677 24756 24274 24344 24672 24648 24190 24417 25217 25510 400649 167185 104541 291278

75805 75880 75233 75549 75826 75775 75273 75546 76284 76587 451777 218339 155669 342345

2024854282 21649282727 21668432748 21665961992 21708443586 21663836681 21763427500 21829946042 21871159818 21904565195 2024854282 21708443586 21663836681 21871159818

util. in %

number of jobs Policy EASY FCFS

resources 430 ref 2.41 2.42 1024 2.43 2.45 2.46 2.47 2.48 430 ref 2.43 2.44 1024 2.45 2.46 2.47 2.48 2.49

EASY

CTC

workload

f

KTH

A

AWWT AWRT Squashed in sec- in sec- Area onds onds

Scaling of Workload Traces Table 5: continued

SDSC00

128 ref 8.12 8.14 8.15 8.16 8.18 8.19 8.20 8.21 1024 8.22 128 ref 8.55 8.56 1024 8.58 8.59

number of jobs

makespan in seconds

167431 167681 167991 169405 168894 169370 170417

29366917 29381624 29366517 29378230 29371367 29381584 29380278

73 73 73 73 74 74 74

295568 404008 424255 281495 415358 539856 491738

346661 455149 475405 332646 466515 590999 542886

21968343948 22016195800 22051851208 22080508136 22127579593 22204787743 22263296356

42049 188706 188659 189104 190463 190221 190897 191454 190514 190580 42049 188258 188659 189563 189864 189104 190463 190221 190897 191454 190514

7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945507 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945421 7945507

47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47

6 2 436 370 466 527 380 483 736 243 6 4 562 629 427 534 562 721 531 587 605

9482 9478 9910 9850 9952 10001 9847 9967 10220 9730 9482 9480 10036 10126 9901 10013 10048 10194 9998 10070 10088

474928903 3799431224 3805309069 3813901379 3815152286 3825085688 3829707646 3829000061 3838797287 3835645184 474928903 3799431224 3805309069 3806198375 3810853391 3813901379 3815152286 3825085688 3829707646 3829000061 3838797287

67655 308872 309778 310917 310209 310513 310286 311976 312219 313024 67655 321966 323298 323903 323908

63192267 63209190 63189633 63195547 63189633 63189633 63247375 63194139 63204664 63200276 68623991 68877042 69093787 69074629 69499787

83 85 85 85 85 85 85 86 86 86 77 82 82 82 82

76059 70622 71757 78663 76235 74827 77472 78585 75787 75811 2182091 2133228 2154991 2180614 2346320

116516 111043 112264 119080 116714 115312 118119 119254 116408 116267 2222548 2173666 2195442 2221139 2386846

6749918264 54813430352 54908840905 55003341172 55030054463 55206637895 55258239565 55368328613 55369411171 55499902234 6749918264 57703096198 57785593002 58020939264 57999342465

util. in %

FCFS Policy EASY FCFS

128 ref 8.00 8.01 8.04 1024 8.05 8.06 8.07 8.08 8.09 8.10 128 ref 8.00 8.01 8.02 1024 8.03 8.04 8.05 8.06 8.07 8.08 8.09

EASY

NASA

10.85 10.88 10.89 10.90 10.92 10.96 10.99

FCFS

resources

workload

f

AWWT AWRT Squashed in sec- in sec- Area onds onds

181

182

Carsten Ernemann et al. Table 5: continued

416 ref 2.46 2.48 1024 2.50 2.51 2.52 416 ref 2.47 2.48 2.49 1024 2.50 2.52

makespan in seconds

325858 325467 325458 75730 130380 131399 131884 131730 132536 133289 75730 131884 131730 132536 133289 133924

69428033 69146937 69258234 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080 31662080

82 82 82 63 63 63 63 64 64 64 63 63 64 64 64 65

2338591 2248848 2219200 13723 13287 13144 13840 13957 14409 14432 17474 17327 17053 17624 17676 17639

2379182 2289373 2259628 46907 46492 46288 46985 47245 47682 47628 50658 50472 50341 50896 50872 50820

58011833809 58074546998 58211844138 8284847126 20351822499 20464087105 20534988559 20722722130 20734539617 20794582470 8284847126 20534988559 20722722130 20734539617 20794582470 20955732920

37910 65498 66007 66457 66497 66653 37910 65842 66007 66274 66457 66653

31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431 31842431

62 62 62 63 63 63 62 62 62 63 63 63

9134 9055 8799 9386 9874 9419 10594 9674 10008 11312 11321 11089

48732 48736 48357 49134 49315 48715 50192 49361 49566 51211 51069 50386

8163457982 20026074751 20184805564 20353508244 20502723327 20629070916 8163457982 20120648801 20184805564 20374472890 20353508244 20629070916

util. in %

number of jobs Policy EASY FCFS EASY

resources 8.60 8.61 8.63 416 ref 2.46 2.47 2.48 1024 2.49 2.50 2.52 416 ref 2.48 2.49 2.50 2.52 2.53

FCFS

SDSC96

SDSC95

workload

f

AWWT AWRT Squashed in sec- in sec- Area onds onds

Gang Scheduling Extensions for I/O Intensive Workloads Yanyong Zhang1 , Antony Yang1 , Anand Sivasubramaniam2 , and Jose Moreira3 1

Department of Electrical & Computer Engg. Rutgers, The State University of New Jersey Piscataway NJ 08854 {yyzhang,pheroth}@ece.rutgers.edu 2 Department of Computer Science & Engg. The Pennsylvania State University University Park PA 16802 [email protected] 3 IBM T. J. Watson Research Center P. O. Box 218, Yorktown Heights NY 10598-0218 [email protected]

Abstract. Scientific applications are becoming more complex and more I/O demanding than ever. For such applications, the system with dedicated I/O nodes does not provide enough scalability. Rather, a serverless approach is a viable alternative. However, with the serverless approach, a job’s execution time is decided by whether it is co-located with the file blocks it needs. Gang scheduling (GS), which is widely used in supercomputing centers to schedule parallel jobs, is completely not aware of the application’s spatial preferences. In this paper, we show that gang scheduling does not do a good job scheduling I/O intensive applications. We extend gang scheduling by adding different levels of I/O awareness, and propose three schemes. We show that all these three new schemes are better than gang scheduling for I/O intensive jobs. One of them, with the help of migration, outperforms the others significantly for all the workloads we look at.

1

Introduction

Scientific applications, involving the processing, analyzing, and visualizing of scientific and engineering datasets, are of great importance to the advancement of society. Large-scale parallel machines are used in supercomputing centers to meet their needs. Scheduling of jobs onto the nodes of these machines thus becomes an important and challenging area of research. A large number of jobs are to run on the system. These jobs are usually long running, requiring significant amount of CPU and storage resources. A scheduling scheme that is not carefully designed can easily result in unbearably high response time and low throughput. The research is challenging because of the numerous tradeoffs from the application characteristics as well as the underlying system configurations involved D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 183–207, 2003. c Springer-Verlag Berlin Heidelberg 2003 

184

Yanyong Zhang et al.

in designing a scheduler. Some of the influencing factors are job arrival rates, job sizes, job behaviors, system configuration parameters, and system capacity. Now, this research is facing new challenges because the applications are increasing dramatically in their sizes, lengths, and resource requirements, at a pace which has never been seen before. Scheduling strategies can have a significant impact on the performance characteristics of a large parallel system [4, 5, 8, 10, 13, 14, 19, 20, 23]. Early strategies used a space-sharing approach, wherein jobs can run side by side on different nodes of the machine at the same time, but each node is exclusively assigned to a job. The wait and response times for jobs with an exclusively space-sharing strategy can be relatively high. Gang scheduling (or coscheduling) is a technique to improve the performance by adding a time-sharing dimension to space sharing [18]. This technique virtualizes the physical machine by slicing the time axis into multiple virtual machines. Tasks of a parallel job are coscheduled to run in the same time-slices. The number of virtual machines created (equal to the number of time slices), is called the multiprogramming level (MPL) of the system. This approach opens more opportunities for the execution of parallel jobs, and is thus quite effective in reducing the wait time, at the expense of increasing the apparent job execution time. Gang scheduling can increase the system utilization as well. A considerable body of previous work has been done in investigating different variations of gang scheduling [25, 24, 26, 22]. Gang-scheduling is now widely used in supercomputing centers. For example, it has been used in the prototype GangLL job scheduling system developed by IBM Research for the ASCI Blue-Pacific machine at LLNL (a large scale parallel system spanning thousands of nodes [16]). On the application’s end, over the past decade, the input data sizes that these applications are facing have increased dramatically, at a much faster pace than the performance improvement of the I/O system [15]. Supercomputing centers usually deploy dedicated I/O nodes to store the data and manage the data access. Data is striped across high-performance disk arrays. The I/O bandwidth provided in such systems is limited by the number of I/O channels and the number of disks, which are normally smaller than the number of available computing units. This is not enough to serve the emerging scientific applications that are more complex than ever, and are more I/O demanding than ever. Fortunately, several research groups have proposed a serverless approach to address this problem [1, 2]. In a serverless file system, each node in the system serves as both compute node and I/O node. All the data files are distributed across the disks of the nodes in the system. Suppose a task (from a parallel job) wants to access a data block. If the block is hosted on the disk of the node on which this task is running, then the data can be fetched from the local disk, incurring a local disk I/O cost. If the block resides on the disk of another node, then an extra network cost is needed, in addition to the cost of accessing the disk, to fetch the data. With this approach, we have as many I/O nodes as the compute nodes. It is much easier to scale than the dedicated solution. The only assumption this

Gang Scheduling Extensions for I/O Intensive Workloads

185

approach has is reasonably fast interconnect between the nodes, which should not be a concern given the rapid progress in technology. While the serverless approach shows a lot of promise, realizing the promise is not a trivial task. It raises new challenges to the design of a job scheduler. Files now are partitioned across a set of nodes in the system, and, on the same set of nodes, jobs are running as well. Suppose task t needs to access block b of file f . The associated access cost can vary considerably depending on the location of b. If b is hosted on the same node where t is running, then t only needs to go to the local disk to fetch the data. Otherwise, t needs to pay extra network cost to fetch the data from a remote disk. The disk accesses now become asymmetric. In order to avoid this extra network cost, a scheduler must co-locate tasks of parallel jobs with the file blocks they need. This will be even more urgent for I/O intensive jobs. Although gang scheduling, which is widely used in many supercomputing centers, is effective for traditional workloads, most of its implementations are completely unaware of applications’ spatial preferences. Instead, they focus on maximizing system usage when they allocate the jobs into the system so that a new job can be accommodated into the system quickly, and a running job can receive more CPU time as well. However, it is not yet clear which of the two factors - I/O awareness, or system utilization - is more important to the performance. Or, both are important for some scenario. If so, what is better, at what scenario, by how much? And, can we do even better by adaptively balancing between the two? This paper sets out to answer all the above questions by conducting numerous simulation based experiments on a wide range of workloads. This paper proposes several new scheduling heuristics, which try to schedule jobs to their preferred nodes, without compromising the system utilization. This paper also extensively evaluates their performances. In this paper, we have made the following contributions: – We propose a new variation of gang scheduling, I/O aware gang scheduling (IOGS), which is aware of the jobs’ spatial preferences, and always schedules jobs to their desired nodes. – We quantitatively compare IOGS and pure gang scheduling under both I/O intensive workloads and medium I/O workloads. We find that pure gang scheduling is better than IOGS for medium workloads, while IOGS is much better than gang scheduling for I/O intensive workloads, which are the motivating workloads for this study. – We propose a hybrid scheme, adaptive I/O aware gang scheduling (AdaptiveIOGS), which tries to combine the benefits of both schemes. We show that this adaptive scheme performs the best in most of the situations except for I/O intensive workloads at very high loads. – Further, we propose another scheme, migration adaptive I/O aware gang scheduling (Migrate-IOGS), which combines migration technique with the existing scheme. Migration technique can move more jobs to their desired nodes, thus leading to better performance. – Finally, we have shown that Migrate-IOGS delivers the best performance among the four across a wide range of workloads.

186

Yanyong Zhang et al.

The rest of the paper is organized as follows: Section 2 describes the system and workload models used in this study. Section 3 presents all the proposed scheduling heuristics, and report their performance results. Section 4 presents our conclusions and possible direction for future work.

2

System and Workload Model

Large scale parallel applications are demanding large data inputs during their executions. Optimizing their I/O performance is becoming a critical issue. In this study, we set out to investigate the impact of application I/O intensity on the design of a job scheduler. 2.1

I/O Model

One of the latest trends in file system design is serverless file system [1, 2], where there is no centralized server, and the data is distributed on the disks of the client nodes that participate in the file system. Each node in the system serves as both compute node and I/O node. In a system that has dedicated I/O nodes (such as storage area network), I/O bandwidth is limited by the number of I/O channels and number of disks, which is not suitable for the more complex and more I/O demanding applications. However, in a serverless approach, we have as many I/O nodes as the nodes in the system, which provides much better scalability. Earlier work has shown that the serverless approach is a viable alternative, provided the interconnect is reasonably fast. (The details about serverless file systems are beyond the scope of this paper, and interested readers can read the references.) In this study, we adopt this serverless file system approach. The data files in the system are partitioned according to some heuristics, and distributed on the disks of a subset of the nodes (clients that participate the file system). The file partitioning heuristic is a very important issue, and will be described in detail in next section. and tremote to denote the local and remote I/O costs respecWe use tlocal IO IO tively. In this paper, when we use the term I/O costs, we mean the costs associated with those I/O requests that are not satisfied in the file cache, and have to fetch data from disk. Figure 1 illustrates such a system with 4 nodes, which are connected by highspeed interconnect. File F has two partitions, F.1/2 and F.2/2, hosted by nodes 3 and 4 respectively. Parallel job J (with tasks J 1 and J 2 ) is running on nodes 2 and 3, and needs data from F during its execution. In this example, if the data needed by J 2 belongs to partition F.1/2, then J 2 spends much less time in I/O because all its requests can be satisfied by local disks, while J 1 has to fetch the data from remote disks. 2.2

File Partition Model

File partition plays an important role in deciding a task’s I/O portion. Even though a task is running on a node that hosts one partition of its file, it does not

Gang Scheduling Extensions for I/O Intensive Workloads

node 1 both compute node and I/O node

node 2

node 3

J1

CPU

node 4

J2

F.1/2

disk

187

F.2/2

High Speed Interconnect

Fig. 1. The I/O model example

mean that the task can enjoy lower local I/O costs because the data it needs may not belong to this partition. In the example shown in Figure 2, F has 18 blocks in total, and all the odd-numbered blocks belong to F.1/2, while the even-numbered blocks belong to F.2/2. Task J 1 is co-located with all the odd-numbered blocks of F . Unfortunately, J 1 needs the first 10 blocks of file F , which are evenly distributed between F.1/2 and F.2/2. Thus, half of its I/O requests have to go to remote disks. This example shows that even if a job scheduler manages to assign tasks to the nodes where their files are hosted, their I/O requests may not be fully served by local disks. This observation suggests that we need to coordinate the file partitioning and applications’ data access pattern to better appreciate the results of good job schedulers. As suggested by Corbett and Feitelson in [3], it is possible for the applications to pass their access patterns to the underlying file system, so that the system can partition the files accordingly, leading to a one-to-one mapping between the

node 3

node 4

J1

J2

F.2/2

F.1/2

2 4 6 8 10 12 14 16 18

1 3 5 7 9 11 13 15 17 High Speed Interconnect

Fig. 2. File partition Example

188

Yanyong Zhang et al. 30 ms

30 ms 1x30ms CPU

30 ms 1x30ms

30 ms 1x30ms

30 ms 1x30ms

1x30ms

I/O

local Fig. 3. The execution model for task i with dIi = 0.03second, nB i =1, and tIO = 0.03second

tasks and partitions. The authors even suggested that file partitioning can be dynamically changed if the access pattern changes. Thus, as long as the task is assigned to the appropriate node (hosting the corresponding partition), all its I/O accesses are local. In this study, we used the above approach. Although the hardware configurations in [3] and in this study are different, we believe that we can implement the same idea on our platform. Again, please note that our schemes can work with other partitioning heuristics as well. In that case, we need to compose the job access pattern model and quantify the ratio of the local I/Os out of the total number of I/Os made by each task. In this study, we do not dynamically re-partition files in order to avoid the associated overheads. In fact, as statistics in [17] show, file sharing across applications is rare, which implies that dynamic re-partitioning is not necessary. 2.3

Workload Model

We conduct a simulation based study of our scheduling strategies using synthetic workloads. The basic synthetic workloads are generated from stochastic models that fit actual workloads at the ASCI Blue-Pacific system in Lawrence Livermore National Laboratory ( a 320-node RS/6000 SP). We first obtain one set of parameters that characterizes a specific workload. We then vary this set of parameters to generate a set of nine different workloads, which impose an increasing load on the system. This approach, described in more detail in [12], allows us to do a sensitivity analysis of the impact of the scheduling algorithms over a wide range of operating points. From the basic synthetic workload, we obtain the following set of parameters for job i: (1) tai : arrival time, (2) tC i : CPU time (if i is run in a dedicated setting, the total time that is spent on CPU), (3) ni : number of nodes. Next, we add one more dimension to the jobs: I/O intensity. In this study, we employ a simple model assuming I/O accesses are evenly distributed throughout the execution. We use two more parameters to support this simple model for job i: (4) dIi : the I/O requests interarrival time, and (5) nB i : number of disk blocks per request. In the example shown in Figure 3, I/O operations are evenly distributed throughout the execution. There is an I/O operation every 0.03 second (30 milliseconds). It accesses one block during each I/O operation. Since we assume every I/O access is local, and the local cost is 30 milliseconds, then every I/O operation takes 1 × 30 milliseconds. Please note that we do not care whether these I/O requests are for one file, or multiple files. As discussed in Section 2.2,

Gang Scheduling Extensions for I/O Intensive Workloads

189

all the files a job accesses are partitioned according to its access patterns. By varying the values of dIi and nB i , we can vary the I/O intensity of job i, with smaller dIi and greater nB i implying higher I/O intensity. In addition to these application parameters, system parameters also play a role in a job’s I/O intensity, and tremote ), with higher I/O costs such as local and remote I/O costs (tlocal IO IO leading to higher I/O intensity. The I/O intensity of job i is thus calculated as local nB i ×tIO local +dI , nB ×t i IO i

where we assume that all the I/O requests are local. For example,

if job i issues I/O requests every 0.03 second (dIi = 0.03second), accessing one block per request (nB i = 1), and a local I/O request takes 0.03 second to finish, then i’s I/O intensity is 50%. tC

Based on the definition of the model, we can further derive (6) nIi =  diI  × i j shadow util ← mi−1,j .util selected ← F alse bypassed ← T rue else util ← mi−1,j−wji .size .util + wji .size if util ≥ mi−1,j .util util ← util selected ← T rue bypassed ← F alse if util = mi−1,j .util bypassed ← T rue else util ← mi−1,j .util selected ← F alse bypassed ← T rue

It was already noted in Section 3.2.2 that it is possible that in an arbitrary cell mx,y both markers are set simultaneously, which means that there is more than one possible schedule. In such case, the algorithm will follow the bypassed marker. / S simply means that wjx is not started at t, but In term of scheduling wjx ∈ this decision has a deeper meaning in terms of queue policy. Since the queue is traversed by Algorithm 2 from tail to head, skipping wjx means that other jobs, closer to the head of the queue will be started instead, and the same maximal utilization will still be achieved. By selecting jobs closer to the head of the queue our produced schedule is not only more committed to the queue FCFS policy, but can also be expected to receive a better score from the evaluation metrics such as average response time, slowdown etc. ♣ The resulting S contains a single job wj2 , and its scheduling at t is illustrated in Figure 3. Note that wj1 is not part of S. It is only drawn to illustrate that wj2 does not effect its expected start time, indicating that our produced schedule is safe. 3.3

The Full Algorithm

3.3.1 Maximizing Utilization One way to create a safe schedule is to require all jobs in S to terminate before the shadow time, so as not to interfere with that job’s reservation. This restriction can be relaxed in order to achieve a better schedule S  , still safe but with a much improved utilization. This is

236

Edi Shmueli and Dror G. Feitelson

Algorithm 2 Constructing S

N=10

S ← {} i ← |W Q| j←n while i > 0 and j > 0 if mi,j .bypassed = T rue i←i−1 else S ← S ∪ {wji } j ← j − wji .size i←i−1

wj2 wj1 rj1

t=25

t=28 (shadow)

Time

Fig. 3. Scheduling wj2 at t = 25 possible due to the extra processors left at the shadow time after wj1 is started. Waiting jobs which are expected to terminate after the shadow time can use these extra processors, referred to as the shadow free capacity, and run side by side together with wj1 , without effecting its start time. As long as the total size of jobs in S  that are still running at the shadow time does not exceed the shadow free capacity, wj1 will not be delayed, and S  will be a safe schedule. If the first waiting job, wj1 , can only start after rjs has terminated, than the shadow free capacity, denoted by extra, is calculated as follows : extra = n +

s 

rji .size − wj1 .size

i=1

To use the extra processors, the jobs which are expected to terminate before the shadow time are distinguished from those that are expected to still run at that time, and are therefore candidates for using the extra processors. Each waiting job wji ∈ W Q will now be represented by two values: its original size and its shadow size — its size at the shadow time. Jobs expected to terminate

Backfilling with Lookahead to Optimize the Performance

237

N=10

extra=3

wj1 rj1

t=25

t=28 (shadow)

wj sizessize time 1 77 4 2 20 2 3 11 6 4 22 4 5 33 5

Time

Fig. 4. Computing shadow and extra, and the processed job queue

before the shadow time have a shadow size of 0. The shadow size is denoted ssize, and is calculated using the following rule:

0 t + wji .time ≤ shadow wji .ssize = wji .size otherwise If wj1 can start at t, the shadow time is set to ∞. As a result, the shadow size ssize, of all waiting jobs is set to 0, which means that any computation which involves extra processors is unnecessary. In this case setting extra to 0 improves the algorithm performance. All these calculation are done in a pre-processing phase, before running the dynamic programming algorithm. ♣ wj1 which can begin execution at t = 28 leaves 3 extra processors. shadow and extra are set to 28 and 3 respectively, as illustrated in Figure 4. In the queue shown on the right, we use the notation sizessize to represent the two size values. wj2 is the only job expected to terminate before the shadow time, thus its shadow size is 0. 3.3.2 A Three Dimensional Data Structure To manage the use of the extra processors, we need a three dimensional matrix denoted M  of size (|W Q|+ 1) × (n + 1) × (extra + 1). Each cell mi,j,k now contains two integer values, util and sutil, and the two trace markers. util holds the maximal achievable utilization at t, if the machine’s free capacity is j, the shadow free capacity is k, and only waiting jobs {1..i} are available for scheduling. sutil hold the minimal number of extra processors required to achieve the util value mentioned above.

238

Edi Shmueli and Dror G. Feitelson

Table 3. Initial k = 0 plane ↓ i (sizessize ) , j → 0 0 (φφ ) 00 1 (77 ) 00 2 (20 ) 00 3 (11 ) 00 4 (22 ) 00 5 (33 ) 00

1 00 φφ φφ φφ φφ φφ

2 00 φφ φφ φφ φφ φφ

3 00 φφ φφ φφ φφ φφ

4 00 φφ φφ φφ φφ φφ

5 00 φφ φφ φφ φφ φφ

Table 4. k = 0 plane ↓ i (sizessize ) , j → 0 0 (φφ ) 00 1 (77 ) 00 2 (20 ) 00 3 (11 ) 00 4 (22 ) 00 5 (33 ) 00

1 00 00 ↑ 00 ↑ 00 ↑ 00 ↑ 00 ↑

2 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

3 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

4 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

5 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

The selected and bypassed markers are used in the same manner as described in section 3.2.2. As mentioned in section 3.2.2, the i = 0 rows and j = 0 columns are initialized with zero values, this time for all k planes. ♣ M  is a 6 × 6 × 4 matrix. util and sutil are noted utilsutil . The notation of the selected and bypassed markers is not changed and remains  and ↑ respectively. Table 3 describes the initial k = 0 plane. Planes 1..3 are initially similar. 3.3.3 Filling M  The values in every mi,j,k cell are calculated in an iterative matter using values from previously calculated cells as described in Algorithm 3. The calculation is exactly the same as in Algorithm 1, except for an addition of a slightly more complicated condition that checks that enough processors are available both now and at the shadow time. The computation stops when reaching cell m|wq|,n,extra . ♣ When the shadow free capacity is k = 0, only wj2 who’s ssize = 0 can be scheduled. As a result, the maximal achievable utilization of the j = 5 free processors, when considering all i = 5 jobs is m5,5,0 .util = 2, as can be seen in Table 4. This is of course the same utilization value (and the same schedule) achieved in Section 3.2.3, as the k = 0 case is identical to considering only jobs that terminate before the shadow time. When the shadow free capacity is k = 1, wj3 who’s ssize = 1 is also available for scheduling. As can be seen in Table 5, starting at m3,3,1 the maximal achiev-

Backfilling with Lookahead to Optimize the Performance

239

Algorithm 3 Filling M  – Note : To slightly ease the reading, mi,j,k .util, mi,j,k .sutil, mi,j,k .selected, and mi,j,k .bypassed are represented by util, sutil, selected, and bypassed respectively. for k = 0 to extra for i = 1 to |W Q| for j = 1 to n if wji .size > j or wji .ssize > k util ← mi−1,j,k .util sutil ← mi−1,j,k .sutil selected ← F alse bypassed ← T rue else util ← mi−1,j−wji .size,k−wji .ssize .util + wji .size sutil ← mi−1,j−wji .size,k−wji .ssize .sutil + wji .ssize if util > mi−1,j,k .util or (util = mi−1,j,k .util and sutil ≤ mi−1,j,k .sutil) util ← util sutil ← sutil selected ← T rue bypassed ← F alse if util = mi−1,j,k .util and sutil = mi−1,j,k .sutil mi,j,k .bypassed ← T rue else util ← mi−1,j,k .util sutil ← mi−1,j,k .sutil selected ← F alse bypassed ← T rue

Table 5. k = 1 plane ↓ i (sizessize ) , j → 0 0 (φφ ) 00 1 (77 ) 00 2 (20 ) 00 3 (11 ) 00 4 (22 ) 00 5 (33 ) 00

1 00 00 ↑ 00 ↑ 11  11 ↑ 11 ↑

2 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

3 00 00 ↑ 20  31  31 ↑ 31 ↑

4 00 00 ↑ 20  31  31 ↑ 31 ↑

5 00 00 ↑ 20  31  31 ↑ 31 ↑

able utilization is increased to 3, at the price of using a single extra processor. The two selected jobs are wj2 and wj3 . As the shadow free capacity increases to k = 2, wj4 who’s shadow size is 2, joins wj2 and wj3 as a valid scheduling option. Its effect is illustrated in Table 6 starting at m4,4,2 , as the maximal achievable utilization has increased to 4 —

240

Edi Shmueli and Dror G. Feitelson

Table 6. k = 2 plane ↓ i (sizessize ) , j → 0 0 (φφ ) 00 1 (77 ) 00 2 (20 ) 00 3 (11 ) 00 4 (22 ) 00 5 (33 ) 00

1 00 00 ↑ 00 ↑ 11  11 ↑ 11 ↑

2 00 00 ↑ 20  20 ↑ 20 ↑  20 ↑

3 00 00 ↑ 20  31  31 ↑ 31 ↑

4 00 00 ↑ 20  31  42  42 ↑

5 00 00 ↑ 20  31  42  42 ↑

Table 7. k = 3 plane ↓ i (sizessize ) , j → 0 0 (φφ ) 00 1 (77 ) 00 2 (20 ) 00 3 (11 ) 00 4 (22 ) 00 5 (33 ) 00

1 00 00 ↑ 00 ↑ 11  11 ↑ 11 ↑

2 00 00 ↑ 20  20 ↑ 20 ↑ 20 ↑

3 00 00 ↑ 20  31  31 ↑ 31 ↑

4 00 00 ↑ 20  31  42  42 ↑

5 00 00 ↑ 20  31  53  53 ↑

the sum of wj2 and wj4 sizes. This comes at a price of using a minimum of 2 extra processors, corresponding to wj4 ’s shadow size. It is interesting to examine the m4,2,2 cell (marked with a ), as it introduces an interesting heuristic decision. When the machine’s free capacity is j = 2 and only jobs {1..4} are considered for scheduling, the maximal achievable utilization can be accomplished by either scheduling wj2 or wj4 , both with a size of 2, yet wj4 will use 2 extra processors while wj2 will use none. The algorithm chooses to bypass wj4 and selects wj2 as it leaves more extra processors to be used by other jobs. Finally the full k = 3 shadow free capacity is considered. wj5 , who’s shadow size is 3 can now join wj1 ..wj4 as a valid scheduling option. As can be seen in Table 7, the maximal achievable utilization at t = 25, when the machine’s free capacity is n = j = 5, the shadow free capacity is extra = k = 3 and all five waiting jobs are available for scheduling is m5,5,3 .util = 5. The minimal number of extra processors required to achieve this utilization value is m5,5,3 .sutil = 3. 3.3.4 Constructing S  Algorithm 4 describes the construction of S  . It starts at the last computed cell m|wq|,n,extra , follows the trace markers, and stops when reaching the 0 boundaries of any plane. As explained in section 3.2.4, when both trace markers are set simultaneously, the algorithm follows the bypassed marker, a decision which prioritizes jobs that have waited longer.

Backfilling with Lookahead to Optimize the Performance

241

Algorithm 4 Constructing S  S  ← {} i ← |W Q| j←n k ← extra while i > 0 and j > 0 if mi,j,k .bypassed = T rue i←i−1 else S  ← S  ∪ {wji } j ← j − wji .size k ← k − wji .ssize i←i−1

wj4

N=10

wj3 wj2 wj1 rj1

t=25

t=28 (shadow)

Time

Fig. 5. Scheduling wj2 , wj3 and wj4 at t = 25 ♣ Both trace markers in m5,5,3 , are set, which means there is more than one way to construct S  . In our example there are two possible schedules, both utilize all 5 free processors, resulting in a fully utilized machine. Choosing S  = {wj2 , wj3, wj4 } is illustrated in Figure 5. Choosing S  = {wj2 , wj5 } is illustrated in Figure 6. Both schedules fully utilize the machine and ensure that wj1 will start without a delay, thus both are safe schedules, yet the first schedule (illustrated in Figure 5) contains jobs closer to the head of the queue, thus it is more committed to the queue FCFS policy. Based on the explanation in section 3.2.4, choosing S  = {wj2 , wj3, wj4 } is expected to gain better results when evaluation metrics are considered.

242

Edi Shmueli and Dror G. Feitelson

wj5

N=10

wj2 wj1 rj1

t=25

t=28 (shadow)

Time

Fig. 6. Scheduling wj2 and wj5 at t = 25.

3.4

A Note On Complexity

The most time and space demanding task is the construction of M  . It depends on |W Q| — the length of the waiting queue, n — the machine’s free capacity at t, and extra — the shadow free capacity. |W Q| depends on the system load. On heavily loaded systems the average waiting queue length can reach tens of jobs with peaks reaching sometimes hundreds. Both n and extra fall in the range of 0 to N . Their values depend on the size and time distribution of the waiting and running jobs. A termination of a small job causes nothing but a small increase to the system’s free capacity, thus n is increased by a small amount. On the other hand, when a large job terminates, it leaves much free space and n will consequently be large. extra is a function of the size of the first waiting job, and the size and time distribution of the running jobs. If wj1 is small but it can start only after a large job terminates, extra will consequently be large. On the other hand, if the size of the terminating job is small and wj1 ’s size is relatively large, fewer extra processors will be available. 3.5

Optimizations

It was mentioned in Section 3.4 that on heavily loaded systems the average waiting queue length can reach tens of jobs, a fact that has a negative effect on the performance of the scheduler, since the construction of M  directly depends on |W Q|. Two enhancements can be applied in the pre-processing phase. Both result in a shorter waiting queue |W Q | < |W Q| and thus improve the scheduler performance. The first enhancement is to exclude jobs larger than the machine’s current free capacity. If wji .size > n it is clear that it will not be started in the current

Backfilling with Lookahead to Optimize the Performance

243

Table 8. Optimized Waiting Queue W Q wj sizessize 2 20 3 11 4 22 5 33

scheduling step, so it can be safely excluded from the waiting queue without any effect on the algorithm results. The second enhancement is to limit the number of jobs examined by the algorithm by including only the the first C waiting jobs in W Q where C is a predefined constant. We call this approach limited lookahed since we limit the number of jobs the algorithm is allowed to examine. It is often possible to produce a schedule which maximizes the machine’s utilization by looking only at the first C jobs, thus by limiting the lookahead, the same results are achieved, but with much less computation effort. Obviously this is not always the case, and such a restriction might produce a schedule which is not optimal. The effect of limiting the lookahead on the performance of LOS is examined in Section 4.3. ♣ Looking at our initial waiting queue described in the table in Figure 4, it is clear that wj1 cannot start at t since its size exceeds the machine’s 5 free processors. Therefore it can be safely excluded from the processed waiting queue without effecting the produced schedule. The resulting waiting queue W Q holds only four jobs as shown in Table 8. We could also limit the lookahead to C = 3 jobs, excluding wj5 from W Q . In this case the produced schedule will contain jobs wj2 , wj3 and wj4 , and not only that it maximizes the utilization of the machine, but it is also identical to the schedule shown in Figure 5. By limiting the lookahead we improved the performance of the algorithm and achieved the same results.

4 4.1

Experimental Results The Simulation Environment

We implemented all aspects of the algorithm including the mentioned optimizations in a job scheduler we named LOS, and integrated LOS into the framework of an event-driven job scheduling simulator. We used logs of the Cornell Theory Center (CTC) SP2, the San Diego Supercomputer Center (SDSC) SP2, and the Swedish Royal Institute of Technology (KTH) SP2 supercomputing centers as a basis [22], and generated logs of varying loads ranging from 0.5 to 0.95, by multiplying the arrival time of each job by constant factors. For example, if the offered load in the CTC log is 0.60, then by multiplying each job’s arrival time by 0.60 a new log is generated with a load of 1.0. To generate a load of 0.9, each job’s arrival time is multiplied by a constant of 0.60 0.90 . We claim that in contrast

Edi Shmueli and Dror G. Feitelson

Mean job differential response time vs. Load 6000

Easy-LOS

5000 4000 3000 2000 1000 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

Mean job differential response time

Mean job differential response time

244

Mean job differential response time vs. Load 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

Mean job differential response time

(a) CTC Log

Easy-LOS

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

(b) SDSC Log

Mean job differential response time vs. Load 25000

Easy-LOS

20000 15000 10000 5000 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Load

(c) KTH Log

Fig. 7. Mean job differential response time vs Load

to other log modification methods which modify the jobs’ sizes or runtimes, our generated logs and the original ones maintain resembling characteristics. The logs were used as an input for the simulator, which generates arrival and termination events according to the jobs characteristics of a specific log. On each arrival or termination event, the simulator invokes LOS which examines the waiting queue, and based on the current system state it decides which jobs to start. For each started job, the simulator updates the system free capacity and enqueues a termination event corresponding to the job termination time. For each terminated job, the simulator records its response time, bounded slowdown (applying a threshold of τ = 10 seconds), and wait time.

Mean job differential bounded_slowdown vs. Load 25

Easy-LOS

20 15 10 5 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

Mean job differential bounded_slowdown

Mean job differential bounded_slowdown

Backfilling with Lookahead to Optimize the Performance

Mean job differential bounded_slowdown vs. Load 100 90 80 70 60 50 40 30 20 10 0

(a) CTC Log

Mean job differential bounded_slowdown

245

Easy-LOS

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

(b) SDSC Log

Mean job differential bounded_slowdown vs. Load 350

Easy-LOS

300 250 200 150 100 50 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Load

(c) KTH Log

Fig. 8. Mean job differential bounded slowdown (τ = 10) vs Load

4.2

Improvement over EASY

We used the framework mentioned above to run simulations of the EASY scheduler [19, 26], and compared its results to those of LOS which was limited to a maximal lookahead of 50 jobs. By comparing the achieved utilization vs. the offered load of each simulation, we saw that for the CTC and SDSC workloads a discrepancy occurs at loads higher than 0.9, whereas for the KTH workload it occurs only at loads higher than 0.95. As such discrepancies indicate that the simulated system is actually saturated, we limit the x axis to the indicated ranges when reporting our results. As the results of schedulers processing the same jobs may be similar, we need to compute confidence intervals to assess the significance of observed differences. Rather than doing so directly, we first apply the “common random numbers” variance reduction technique [15]. For each job in the workload file, we tabu-

246

Edi Shmueli and Dror G. Feitelson

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

Mean job response time vs. Load Mean job response time

Mean job response time

Mean job response time vs. Load 21000 20000 19000 18000 17000 16000 15000 14000 13000 12000 11000

55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

(a) CTC Log

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

(b) SDSC Log

Mean job response time vs. Load Mean job response time

120000 100000 80000 60000

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

40000 20000 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Load

(c) KTH Log

Fig. 9. Limited lookahead affect on mean job response time

late the difference between its response time under EASY and under LOS. We then compute confidence intervals on these differences using the batch means approach. By comparing the difference between the schedulers on a job-by-job basis, the variance of the results is greatly reduced, and so are the confidence intervals. The results for response time are shown in Figure 7, and for bounded slowdown in Figure 8. The results for wait time are the same as those for response time, because we are looking at differences. In all the plots, the mean job differential response time (or bounded slowdown) is positive across the entire load range for all three logs, indicating that LOS outperforms Easy with respect to these metrics. Moreover, the lower boundaries of the 90% confidence intervals measured at key load values are also positive, i.e. 0 is not included in the confi-

Mean job bounded_slowdown(thresh=10) vs. Load 30

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

25 20 15 10 5 0 0.5

0.55

0.6

0.65

0.7 0.75 Load

0.8

0.85

0.9

Mean job bounded_slowdown(thresh=10)

Mean job bounded_slowdown(thresh=10)

Backfilling with Lookahead to Optimize the Performance

Mean job bounded_slowdown(thresh=10) vs. Load 200 180 160 140 120 100 80 60 40 20 0

(a) CTC Log

Mean job bounded_slowdown(thresh=10)

247

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Load

(b) SDSC Log

Mean job bounded_slowdown(thresh=10) vs. Load 1600 1400 1200 1000 800

Easy LOS.10 LOS.25 LOS.35 LOS.50 LOS.100 LOS.250

600 400 200 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Load

(c) KTH Log

Fig. 10. Limited lookahead affect on mean job bounded slowdown (τ = 10) dence interval. Thus the advantage of LOS over EASY is statistically significant at this level of confidence. 4.3

Limiting the Lookahead

Section 3.5 proposed an enhancement called limited lookahead aimed at improving the performance of the algorithm. We explored the effect of limiting the lookahead on the scheduler performance by performing six LOS simulations with a limited lookahead of 10, 25, 35, 50, 100 and 250 jobs respectively. Figure 9 present the effect of the limited lookahead on the mean job response time. Figure 10 presents its effect on the mean job bounded slowdown. Again, the effect on wait time is the same as that on response time. The notation LOS.X is used to represent LOS’s result curve, where X is the maximal number of waiting jobs that LOS was allowed to examine on each

248

Edi Shmueli and Dror G. Feitelson

scheduling step (i.e. its lookahead limitation). We also plotted Easy’s result curve to allow a comparison. We observe that for the CTC log in Figure 9(a) and the KTH log in Figure 9(c), when LOS is limited to examine only 10 jobs at each scheduling step, its resulting mean job response time is relatively poor, especially at high loads, compared to the result achieved when the lookahead restriction is relaxed. The same observation also applies to the mean job bounded slowdown for these two logs, as shown in figure 10(a,c). As most clearly illustrated in figures 9(a) and 10(a), the result curves of LOS and Easy intersect several times along the load axis, indicating that the two schedulers achieve the same results with neither one consistently outperforming the other as the load increases. The reason for the poor performance is the low probability that a schedule which maximizes the machine utilization actually exists within the first 10 waiting jobs, thus although LOS produces the best schedule it can, it is rarely the case that this schedule indeed maximizes the machine utilization. However, for the SDSC log in Figures 9(b) and 10(b), LOS manages to provide good performance even with a limited lookahead of 10 jobs. As the lookahead limitation is relaxed, LOS performance improves but the improvement is not linear with the lookahead factor, and in fact the resulting curves for both metrics are relatively similar for lookaheads in the range of 25– 250 jobs. Thus we can safely use a bound of 50 on the lookahead, thus bounding the complexity of the algorithm. The explanation is that at most of the scheduling steps, especially under low loads, the length of the waiting queue is kept small, so lookahead of hundreds of jobs has no effect in practice. As the load increases and the machine advances toward its saturation point, the average number of waiting jobs increases, as shown in Figure 11, and the effect of changing the lookahead is more clearly seen. Interestingly, with LOS the average queue length is actually shorter, because it is more efficient in packing jobs, thus allowing them to terminate faster.

5

Conclusions

Backfilling algorithms have several parameters. In the past, two parameters have been studied: the number of jobs that receive reservations, and the order in which the queue is traversed when looking for jobs to backfill. We introduce a third parameter: the amount of lookahead into the queue. We show that by using a lookahead window of about 50 jobs it is possible to derive much better packing of jobs under high loads, and that this improves both average response time and average bounded slowdown metrics. A future study should explore how the packing effects secondary metrics such as the queue length behavior. In Section 3.4 we stated that on a heavily loaded system the waiting queue length can reach tens of jobs, so a scheduler capable of maintaining a smaller queue across large portion of the scheduling steps, increases the users’ satisfaction with the system. Alternative algorithms for constructing S  when several optional schedules are possible might also be examined. In Section 3.2.4 we stated that by following the bypassed marker we

Backfilling with Lookahead to Optimize the Performance

Average_Queue_Length vs Load 60 Average_Queue_Length

Average_Queue_Length

Average_Queue_Length vs Load 50 Easy 45 Los.50 40 35 30 25 20 15 10 5 0 0.5 0.55 0.6

0.65

0.7 0.75 Load

0.8

249

0.85

0.9

50

Easy Los.50

40 30 20 10 0 0.5

(a) CTC log

0.55

0.6

0.65

0.7 0.75 Load

0.8

0.85

0.9

(b) SDSC log

Average_Queue_Length vs Load Average_Queue_Length

140 120

Easy Los.50

100 80 60 40 20 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Load

(c) KTH log

Fig. 11. Average queue length vs Load expect a better score from the evaluation metrics, but other heuristics such as choosing the schedule with the minimal overall expected termination time are also worthy of evaluation. Finally, extending our algorithm to perform reservations for more than a single job and exploring the effect of such a heuristic on performance presents an interesting challenge.

References [1] O. Arndt, B. Freisleben, T. Kielmann, and F. Thilo, ”A Comparative Study of On-Line Scheduling Algorithms for Networks of Workstation”. Cluster Computing 3(2), pp. 95–112, 2000. 229, 230 [2] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, ”A Static Performance Estimator to Guide Data Partitioning Decisions”. In 3rd Symp. Principles and Practice of Parallel Programming, pp. 213–223, Apr 1991. 230

250

Edi Shmueli and Dror G. Feitelson

[3] E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, and R. E. Tarjan, ”Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms”. SIAM J. Comput. 9(4), pp. 808–826, Nov 1980. 230 [4] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, ”Approximation Algorithms for Bin-Packing - An Updated Survey”. In Algorithm Design for Computer Systems Design, G. Ausiello, M. Lucertini, and P. Serafini (eds.), pp. 49–106, Springer-Verlag, 1984. 230 [5] M. V. Devarakonda and R. K. Iyer, ”Predictability of Process Resource Usage: A Measurement Based Study on UNIX”. IEEE Tans. Sotfw. Eng. 15(12), pp. 1579–1586, Dec 1989. 230 [6] D. G. Feitelson, A Survey of Scheduling in Multiprogrammed Parallel Systems. Research Report RC 19790 (87657), IBM T. J. Watson Research Center, Oct 1994; revised version, Aug 1997. 228 [7] D. G. Feitelson and L. Rudolph, ”Toward Convergence in Job Schedulers for Parallel Supercomputers”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 1162, pp. 1–26, 1996. 228 [8] D. G. Feitelson, L. Rudolph, U. Schweigelshohn, K. C. Sevcik, and P. Wong, ”Theory and Practice in Parallel Job Scheduling”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 1291, pp. 1–34, 1997. 228, 230 [9] D. G. Feitelson and L. Rudolph, ”Metrics and Benchmarking for Parallel Job scheduling”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 1459, pp. 1– 24, 1998. [10] D. Jackson, Q. Snell, and M. Clement, “Core Algorithms of the Maui Scheduler”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 2221, pp. 87–102, 2001. 231 [11] D. Karger, C. Stein, and J. Wein, ”Scheduling Algorithms”. In Handbook of algorithms and Theory of computation, M. J. Atallah (ed.), CRC Press, 1997. 230 [12] S. Krakowiak, Principles of Operating Systems. The MIT Press, Cambridge Mass., 1998. 230 [13] E. Krevat, J. G. Castanos, and J. E. Moreira, ”Job Scheduling for the BlueGene/L System”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, L. Rudolph, and U. Schweigelshohn (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 2537, pp. 38–54, 2002. 229 [14] P. Krueger, T-H. Lai, and V. A. Radiya, ”Processor Allocation vs. Job Scheduling on Hypercube Computers”. In 11th Intl. Conf. Distributed Comput. Syst., pp. 394– 401, May 1991. 230 [15] A. M. Law and W. D. Kelton, Simulation Modeling and Analysis. 3rd ed., McGraw Hill, 2000. 245 [16] B. G. Lawson and E. Smirni, “Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, L. Rudolph, and U. Schweigelshohn (eds.), Springer-Verlag, Lect. Notes Comput. Sci. Vol. 2537, pp. 72–87, 2002. 231 [17] S. T. Leutenegger and M. K. Vernon, ”The Performance of Multiprogrammed Multiprocessor Scheduling Policies”. In SIGMETRICS Conf. Measurement and Modeling of Comput. Syst., pp. 226–236, May 1990. 230

Backfilling with Lookahead to Optimize the Performance

251

[18] S. T. Leutenegger and M. K. Vernon, Multiprogrammed Multiprocessor Scheduling Issues. Research Report RC 17642 (#77699), IBM T. J. Watson Research Center, Nov 1992. 230 [19] D. Lifka, ”The ANL/IBM SP Scheduling System”, In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag. Lect. Notes Comput. Sci. Vol. 949, pp. 295–303, 1995. 231, 232, 245 [20] S. Majumdar, D. L. Eager, and R. B. Bunt, ”Scheduling in Multiprogrammed Parallel Systems”. In SIGMETRICS Conf. Measurement and Modeling of Comput. Syst., pp. 104–113, May 1988. 230 [21] A. W. Mu’alem and D. G. Feitelson, “Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling”, In IEEE Trans. on Parallel and Distributed Syst. 12(6), pp. 529–543, Jun 2001. 230, 231 [22] Parallel Workloads Archive. URL http://www.cs.huji.ac.il/labs/parallel/workload/. 243 [23] V. Sarkar, ”Determining Average Program Execution Times and Their Variance”. In Proc. SIGPLAN Conf. Prog. Lang. Design and Implementation, pp. 298–312, Jun 1989. 230 [24] K. C. Sevick, ”Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems”. Performance Evaluation 19(2–3), pp. 107–140, Mar 1994. 230 [25] J. Sgall, ”On-Line Scheduling — A Survey”. In Online Algorithms: The State of the Art, A. Fiat and G. J. Woeginger, editors, Springer-Verlag, 1998. Lect. Notes Comput. Sci. Vol. 1442, pp. 196–231. 230 [26] J. Skovira, W. Chan, H. Zhou, and D. Lifka, ”The EASY - LoadLeveler API Project”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag. Lect. Notes Comput. Sci. Vol. 1162, pp. 41–47, 1996. 231, 245 [27] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, “Characterization of Backfilling Strategies for Parallel Job Scheduling”. In Proc. of 2002 Intl. Workshops on Parallel Processing, Aug 2002. 231 [28] D. Talby and D. G. Feitelson, ”Supporting Priorities and Improving Utilization of the IBM SP Scheduler Using Slack-Based Backfilling”. In 13th Intl. Parallel Processing Symp. (IPPS), pp. 513–517, Apr 1999. 231 [29] W. A. Ward, Jr., C. L. Mahood, and J. E. West “Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy”. In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson, L. Rudolph, and U. Schweigelshohn (eds.), SpringerVerlag, Lect. Notes Comput. Sci. Vol. 2537, pp. 88–102, 2002. 231

QoPS: A QoS Based Scheme for Parallel Job Scheduling Mohammad Islam, Pavan Balaji, P. Sadayappan, and D. K. Panda Department of Computer and Information Science The Ohio State University Columbus, OH 43210, USA {islammo,balaji,saday,panda}@cis.ohio-state.edu

Abstract. Although job scheduling has been much studied, the issue of providing deadline guarantees in this context has not been addressed. In this paper, we propose a new scheme, termed as QoPS to provide Quality of Service (QoS) in the response time given to the end user in the form of guarantees in the completion time to submitted independent parallel jobs. To the best of our knowledge, this scheme is the first one to implement admission control and guarantee deadlines for admitted parallel jobs. Keywords: QoS, Job Scheduling, Deadlines, Parallel Job Scheduling

1

Introduction

A lot of research has focused on the problem of scheduling dynamically-arriving independent parallel jobs on a given set of resources. The metrics evaluated include system metrics such as the system utilization, throughput [4, 2], etc. and users metrics such as turnaround time, wait time [9, 14, 3, 6, 13, 8], etc. There has also been some recent work in the direction of providing differentiated service to different classes of jobs using statically or dynamically calculated priorities [16, 1] assigned to the jobs. However, to our knowledge, there has been no work addressing the provision of Quality of Service (QoS) in Parallel Job Scheduling. In current job schedulers, the charge for a run is based on the resources used, but is unrelated to the responsiveness of the system. Thus, a 16-processor job that ran for one hour would be charged for 16 CPU-hours irrespective of whether the turn-around time were one hour or one day. Further, on most systems, even if a user were willing to pay more to get a quicker turn-around on an urgent job, there is no mechanism to facilitate that. Some systems, e.g. NERSC [1] offer different queues which have different costs and priorities: in addition to the normal priority queue, a high priority queue with double the usual charge, and a low priority queue with half the usual charge. Jobs in the high priority queue get priority over those in the 

This research was supported in part by NSF grants #CCR-0204429 and #EIA9986052

D. Feitelson, L. Rudolph, W. Schwiegelshohn (Eds.): JSSPP 2003, LNCS 2862, pp. 252–268, 2003. c Springer-Verlag Berlin Heidelberg 2003 

QoPS: A QoS Based Scheme for Parallel Job Scheduling

253

normal queue, until some threshold on the number of serviced jobs is exceeded. Such a system offers the users some choice, but does not provide any guarantee on the response time. It would be desirable to implement a charging model for a job with two components: one based on the actual resource usage, and another based on the responsiveness sought. Thus if two users with very similar jobs submit them at the same time, where one is urgent and the other is not, the urgent job could be provided quicker response time than the non-urgent job, but would also be charged more. We view the overall issue of providing QoS for job scheduling in terms of two related aspects, which however can be decoupled: – Cost Model for Jobs: The quicker the sought response time, the larger should be the charge. The charge will generally be a function of many factors, including the resources used and the load on the system. – Job Scheduling with Response-time Guarantees: If jobs are charged differently depending on the response time demanded by the user, the system must provide guarantees of completion time. Although deadline-based scheduling has been a topic of much research in the real-time research community, it has not been much addressed in the context of job scheduling. In this paper, we address the latter issue (Job Scheduling with Response-time Guarantees) by providing Quality of Service (QoS) in the response time given to the end-user in the form of guarantees in the completion time to the submitted independent parallel applications. We do not explicitly consider the cost model for jobs; the way deadlines are associated with jobs in our simulation studies is explained in the subsequent sections. At this time, the following open questions arise: – How practical is a solution to this problem? – What are the trade-offs involved in such a scheme compared to a non-deadline based scheme? – How does the imposition of deadlines by a few jobs affect the average response time of jobs that do not impose any deadlines? – Meeting deadlines for some jobs might result in starvation of other nondeadline jobs. Does making the scheme starvation free by providing artificial deadlines to the non-deadline jobs affect the true deadline jobs? We study the feasibility of such an idea by providing a framework, termed as QoPS (Standing for QoS for Parallel Job Scheduling), for providing QoS with job schedulers; we compare the trade-offs associated with it with respect to the existing non-deadline based schemes. We compare it to adaptations of two existing algorithms - the Slack-Based (SB) algorithm [16] and the Real-time (RT) algorithm [15], previously proposed in different contexts. The SB algorithm [16] was proposed as an approach to improve the utilization achieved by a back-filling job scheduler. On the other hand, the RT algorithm [15] was proposed in order to schedule non-periodic real-time jobs with hard deadlines, and was evaluated in a static scenario for scheduling uni-processor jobs on a multiprocessor system.

254

Mohammad Islam et al.

As explained later, we adapted these two schemes to schedule parallel jobs in a dynamic job scheduling context with deadlines. The remaining part of the paper is organized as follows. In Section 2, we provide some background on deadline based job scheduling and how some schemes proposed in other contexts may be modified to accommodate jobs with deadlines. In Section 3, we discuss the design and implementation of a new scheduling scheme that allows deadline specification for jobs. The simulation approach to evaluate the schemes is discussed in Section 4. In Section 5, we present results of our simulation studies comparing the various schemes. In Section 6, we conclude the paper.

2

Background and Related Work

Most earlier schemes proposed for scheduling independent parallel jobs dealt with either maximizing system metrics such as the system utilization, throughput, etc., or minimizing user metrics such as the turnaround time, wait time, slowdown, etc., or both. Some other schemes have also looked at prioritizing the jobs based on a number of statically or dynamically determined weights. In this section, we review some of the related previous work and propose modifications to these to suit the problem we address. 2.1

Review of Related Work

In this subsection we describe some previous related work. In the next subsection, we show how these schemes can be modified to accommodate jobs with deadlines. Slack-Based (SB) Algorithm The Slack-Based (SB) Algorithm [16], proposed by Feitelson et. al., is a backfilling algorithm used to improve the system throughput and the user response times. The main idea of the algorithm is to allow a slack or laxity for each job. The scheduler gives each waiting job a precalculated slack, which determines how long it may have to wait before running: ‘important’ and ‘heavy’ jobs will have little slack in comparison with others. When other jobs arrive, each job is allowed to be pushed behind in schedule time as long as its execution is within the initially calculated slack. The calculation of the initial slack involves cost functions taking into consideration certain priorities associated with the job. This scheme supports both user selected and administrative priorities, and guarantees a bounded wait time for all jobs. Though this algorithm has been proposed for improving the system utilization and the user response times, it can be easily modified to support deadlines by fixing the slack appropriately. We propose this modified algorithm in Section 2.2. Real Time (RT) Algorithm It has been shown that for dynamic systems with more than one processor, a polynomial-time optimal scheduling algorithm does not exist [11, 10, 12]. The Real Time (RT) Algorithm [15], proposed by

QoPS: A QoS Based Scheme for Parallel Job Scheduling

255

Ramamritham et. al., is an approach to schedule uni-processor tasks with hard real time deadlines on multi-processor systems. The algorithm tries to meet the specified deadlines for the jobs by using heuristic functions. The tasks are characterized by worst case computation times, deadlines and resource requirements. Starting with an empty partial schedule, each step in the search extends the current partial schedule with one of the tasks yet to be scheduled. The heuristic functions used in the algorithm actively direct the search for a feasible schedule i.e., they help choose the task that extends the current partial schedule. Earliest Deadline First and Least Laxity First are examples of such heuristic functions. In order to accommodate this algorithm into the domain of scheduling dynamically arriving parallel jobs, we have made two modifications to the algorithm. The first one is to allow parallel jobs to be submitted to the algorithm and the other is to allow dynamically arriving jobs. The details of the modified algorithm are provided in Section 2.2. 2.2

Modifications of Existing Schemes

In this section we propose modifications to the Slack-Based and Real-Time algorithms to support deadlines for parallel jobs. Modified Slack Based (MSB) Algorithm The pseudo code for the modified slack based algorithm is shown below. Checking the admissibility of job J with Latest Start Time into an existing profile of size N: A. set cheapPrice to MAXNUMBER B. set cheapSchedule to existing schedule C. for each time slot ts in the profile starting from current time a. Remove all the jobs from slot ts to the end b. insert job J at slot ts c. schedule each removed job one by one in Ascending Scheduled Time (AST) order d. Calculate the price of this new schedule using the cost function e. if (price MAXBACKTRACK) then 1. Job J is rejected 2. Keep the old schedule and break else 1. continue step E with new partial schedule end if end if

QoPS: A QoS Based Scheme for Parallel Job Scheduling

257

end for F. if (all jobs are placed in the schedule) then a. Job J is accepted b. Update the current schedule end if

The RT algorithm assumes that the calculation of the heuristic function for scheduling a job into a given partial schedule takes constant time. However, this assumption only holds true for sequential (single processor) jobs (which was the focus of the algorithm). However, the scenario we are looking at in this paper relates to parallel jobs, where holes are possible in the partial schedule. In this scenario, such an assumption would not hold true. The Modified RT algorithm (MRT algorithm) uses the same technique as the RT algorithm but increases the time complexity to accommodate the parallel job scenario. When a new job arrives, all the jobs that have not yet started (including the newly arrived job) are sorted using some heuristic function (this function could be Earliest Deadline First, Least Laxity First, etc). Each of these jobs is inserted into the schedule in the sorted order. A partial schedule at any point during this algorithm is said to be feasible if every job in it meets its deadline. A partial schedule is said to be strongly feasible if the following two conditions are met: – The partial schedule is feasible – The partial schedule would remain feasible when extended by any one of the unscheduled jobs When the algorithm reaches a point where the partial schedule obtained is not feasible, it backtracks to a previous strongly feasible partial schedule and tries to take a different path. A certain number of backtracks are allowed, after which the scheduler rejects the job.

3

The QoPS Scheduler

In this section we present the QoPS algorithm for parallel job scheduling in deadline-based systems. As mentioned earlier, it has been shown that for dynamic systems with more that one processor, a polynomial-time optimal scheduling algorithm does not exist. The QoPS scheduling algorithm uses a heuristic approach to search for feasible schedules for the jobs. The scheduler considers a system where each job arrives with a corresponding completion time deadline requirement. When each job arrives, it attempts to find a feasible schedule for the newly arrived job. A schedule is said to be feasible if it does not violate the deadline constraint for any job in the schedule, including the newly arrived job. The pseudo code for the QoPS scheduling algorithm is presented below:

258

Mohammad Islam et al.

Checking the admissibility of job J into an existing profile of size N: A. For each time slot ts in position ( 0, N/2, 3N/4, 7N/8 ... ) starting from Current Time 1. Remove all waiting jobs from position ts to the end of profile and place them into a Temporary List (TL) 2. Sort the temporary list using the Heuristic function 3. Set Violation Count = 0 4. For each job Ji in the temporary List (TL) i. Add Job Ji into the schedule ii. if (there is a deadline violation for job Ji at slot T) then a. Violation Count = Violation Count + 1 b. if (Violation Count > K-Factor) break c. Remove all jobs from the schedule from position mid(ts + T) to position T and add them into TL d. Sort the temporary list using the Heuristic function e. Add the failed job Ji into the top of temporary list to make sure it will be scheduled at mid(ts + T) end if end for 5. if (Violation Count > K-Factor) then i. Job is rejected ii. break end if end for B. if (violation count < K-Factor) then Job is accepted end if The main difference between the MSB and the QoPS algorithm is the flexibility the QoPS algorithm offers in reordering the jobs that have already been scheduled (but not yet started). For example, suppose jobs J1 , J2 , ..., JN are the jobs which are currently in the schedule but not yet started. The MSB algorithm specifies a fixed order for the jobs as calculated by some heuristic function (the heuristic function could be least laxity first, earliest deadline first, etc). This ordering of jobs specifies the order in which the jobs have to be considered for scheduling. For the rest of the algorithm, this ordering is fixed. When a new job JN +1 arrives, the MSB algorithm tries to fit this new job in the given schedule without any change to the initial ordering of the jobs. On the other hand, the QoPS scheduler allows flexibility in the order in which jobs are considered for scheduling. The amount of flexibility offered is determined by the K-factor denoted in the pseudo code above. When a new job arrives, it is given log2 (N ) points in time where its insertion into the schedule is attempted, corresponding to the reserved start-times of jobs

QoPS: A QoS Based Scheme for Parallel Job Scheduling

259

{0, N/2, 3N/4, ... } respectively, where N is the number of jobs currently in the queue. The interpretation of these options is as follows: For option 1 (corresponding to 0 in the list), we start by removing all the jobs from the schedule and placing them in a temporary ordering (TL). We then append the new job to TL and sort the N+1 jobs in TL according to some heuristic function (again, the heuristic function could be least laxity first, earliest deadline first, etc). Finally, we try to place the jobs in the order specified by the temporary ordering TL. For option 2, we do not start with an empty schedule. Instead, we only remove the latter N/2 jobs in the original schedule, chosen in scheduled start time order, place them in the temporary list TL. We again append the new job to TL and sort the N/2 + 1 jobs in the temporary list TL (based on the heuristic function). We finally generate reservations for these N/2 + 1 jobs in the order specified by TL. Thus, log2 (N ) options for placement are evaluated. For each option given to the newly arrived job, the algorithm tries to schedule the jobs based on this temporary ordering. If a job misses its deadline, this job is considered as a critical job and is pushed to the head of the list (thus altering the temporary schedule). This altering of the temporary schedule is allowed at most ’K’ times; after this the scheduler decides that the new job cannot be scheduled while maintaining the deadline for all of the already accepted jobs and rejects it. This results in a time complexity of O(K log2 (N )) for the QoPS scheduling algorithm.

4

Evaluation Approach

In this section we present the approach we took to evaluate the QoPS scheme with the other schemes and the non-deadline based EASY backfilling scheme. 4.1

Trace Generation

Job scheduling strategies are usually evaluated using real workload traces, such as those available at the Parallel Workload Archive [5]. However real job traces from supercomputer centers have no deadline information. A possible approach to evaluating the QoPS scheduling strategy might be based on the methodology that was used in [15] to evaluate their real-time scheduling scheme. Their randomized synthetic job sets were created in such a way that a job set could be packed into a fully filled schedule, say from time = 0 to time = T, with no holes at all in the entire schedule. Each job was then given an arrival time of zero, and a completion deadline of (1+r)*T. The value of ‘r’ represented a degree of difficulty in meeting the deadlines. A larger value of ‘r’ made the deadlines more lax. The initial synthetic packed schedule is clearly a valid schedule for all non-negative values of ‘r’. The real-time scheduling algorithm was evaluated for different values of ‘r’, over a large number of such synthesized task sets. The primary metric was the fraction of cases that a valid schedule for all tasks was found by the scheduling algorithm. It was found that as

260

Mohammad Islam et al.

‘r’ was increased, a valid schedule was found for a larger fraction of experiments, asymptotically tending to 100% as ‘r’ increased. We first attempted to extend this approach to the dynamic context. We used a synthetic packed schedule of jobs, but unlike the static context evaluated in [15], we set each job’s arrival time to be its scheduled start time in the synthetic packed schedule, and set its deadline beyond its start-time by (1+r) times its runtime. When we evaluated different scheduling algorithms, we found that when ‘r’ was zero, all schemes had a 100% success rate, while the success rate dropped as ‘r’ was increased! This was initially puzzling, but the reason was quickly apparent; With r = 0, as each job arrives, the only possible valid placement of the new job corresponds to that in the synthetic packed schedule, and any deadline-based scheduling algorithm exactly tracks the optimal schedule. When ‘r’ is increased, other choices are feasible, and the schedules begin diverging from the optimal schedule, and the failure rate increases. Thus, this approach to generating the test workload is attractive in that it has a known valid schedule that meets the deadlines of all jobs; but it leads to the unnatural trend of decreasing scheduling success rate as the deadlines of jobs are made more relaxed. Due to the above problem with the evaluation methodology used in [15], we pursued a different trace-driven approach to evaluation. We used traces from Feitelson’s archive (5000-job subsets of the CTC and the SDSC traces) and first used EASY back-fill to generate a valid schedule for the jobs. Deadlines were then assigned to all jobs, based on their completion time on the schedule generated by EASY back-fill. A deadline stringency factor determined how tight the deadline was to be set, compared to the EASY back-fill schedule. With a stringency factor of 0, the deadlines were set to be the completion times of the jobs with the EASY back-fill schedule. With a stringency factor of ‘S’, the deadline of each job was set ahead of its arrival time by max(runtime, (1-S)*EASY-Schedule-Responsetime). The metric used was the number of jobs successfully scheduled. As ‘S’ was increased, the deadlines became more stringent. So we would expect the number of successfully scheduled jobs to decrease. 4.2

Evaluation Scenarios

With the first set of simulation experiments, the three schemes (MRT, MSB and QoPS) were compared under different offered loads. A load factor “l” was varied from 1.0 to 1.6. The variation of the load was done using two schemes: Duplication and Expansion. For the duplication scheme, with l = 1.0, only the original jobs in the trace subset were used. With l = 1.2, 20% of the original jobs were picked and duplicates were introduced into the trace at the same arrival time to form a modified trace. For the expansion scheme, the arrival times of the jobs were unchanged, but the run-times were magnified by a factor of 1.2 for l = 1.2. The modified trace was first scheduled using EASY back-fill, and then the deadlines for jobs were set as described above, based on the stringency factor. After evaluating the schemes under the scenario described above, we carried out another set of experiments under a model where only a fraction of the jobs

QoPS: A QoS Based Scheme for Parallel Job Scheduling

261

have deadlines associated with them. This might be a more realistic scenario at supercomputer centers; while some of the jobs may be urgent and impose deadlines, there would likely also be other jobs that are non-urgent, with the users not requiring any deadlines. In order to evaluate the MSB, MRT, and QoPS schemes under this scenario of mixed jobs, some with user-imposed deadlines and others without, we artificially created very lax deadlines for the non-deadline jobs. While the three schemes could be run with an “infinite” deadline for the non-deadline jobs, we did not do that in order to avoid starvation of any jobs. The artificial deadline of each non-deadline job was set to max(24 hours, R*runtime), ‘R’ being a “relaxation” factor. Thus, short non-deadline jobs were given an artificial deadline of one day, while long jobs were given a deadline of R*runtime.

5

Experimental Results

As discussed in the previous section, the deadline-based scheduling schemes were evaluated through simulation using traces derived from the CTC and the SDSC traces archived at the Parallel Workloads Archive [5]. Deadlines were associated with each job in the trace as described earlier. Different offered loads were simulated by addition of a controlled number of duplicate jobs. The results for the expansion scheme are omitted here due to space reasons and can be found in [7]. The introduction of duplicate jobs is done incrementally, i.e. the workload for a load of 1.6 would include all the jobs in the trace for load 1.4, with the same arrival times for the common jobs. For a given load, different experiments were carried out for different values of the stringency factor ‘S’, with jobs having more stringent deadlines for a higher stringency factor. In this paper, we show only the results for the CTC trace and refer the reader to [7] for the results with the SDSC trace. The values of the K-factor and the Relaxation Factor ’R’ used for the QoPS scheme are 5 and 2 respectively, and the number of backtracks allowed for the MRT scheme was 300 for all the experiments. 5.1

All Jobs with Deadlines

We first present results for the scenario where all the submitted jobs had deadlines associated with them, determined as described in the previous section. The metrics measured were the percentage of unadmitted jobs and the percentage of lost processor-seconds from the unadmitted jobs. Figure 1 shows the percentage of unadmitted jobs and lost processor-seconds for the MRT, MSB and QoPS schedules, for a stringency factor of 0.2, as the load factor is varied from 1.0 to 1.6. It can be seen that the QoPS scheme performs better, especially at high load factors. In the case of QoPS, as the load is increased from 1.4 to 1.6, the total number of unaccepted jobs actually decreases, even though the total number of jobs in the trace increases from 7000 to 8000. The reason for this counter-intuitive result is as follows. As the load increases, the average wait time for jobs under

262

Mohammad Islam et al.

Unadmitted Jobs Vs Load

Unadmitted Processor Seconds Vs Load

QoPS 18 MRT MSB 16

% of Unadm. Proc. Secs.

% of Unadmitted Jobs

20

14 12 10 8 6 100

110

120

130

140

150

24 22 20 18 16 14

QoPS MRT MSB

12 100

160

110

Load Factor

120

130

140

150

160

Load Factor

Fig. 1. Admittance capacity for less stringent (Stringency Factor = 0.2) deadlines: (a) Unadmitted jobs, (b) Unadmitted Processor Seconds

Unadmitted Processor Seconds Vs Load % of Unadm. Proc. Secs.

% of Unadmitted Jobs

Unadmitted Jobs Vs Load 15 QoPS 14 MRT 13 MSB 12 11 10 9 8 7 100

110

120 130 140 Load Factor

150

160

36 34 32

QoPS MRT MSB

30 28 26 24 22 100

110

120 130 140 Load Factor

150

160

Fig. 2. Admittance capacity for more stringent (Stringency Factor = 0.5) deadlines: (a) Unadmitted jobs, (b) Unadmitted Processor Seconds

EASY backfill increases nonlinearly as we approach system saturation. Since the deadline associated with a job is based on its schedule with EASY backfill, the same job will have a higher response time and hence looser deadline in a higherload trace than in a trace with lower load. So it is possible for more jobs to be admitted with a higher-load trace than with a lower-load trace, if there is sufficient increase in the deadline. A similar and more pronounced downward trend with increasing load is observed for the unadmitted processor-seconds. This is due to the greater relative increase in response time of “heavier” jobs (i.e. those with higher processor-seconds) than lighter jobs. As the load increases, more heavy jobs are admitted and more light jobs are unable to be admitted. The same overall trend also holds for a higher stringency factor (0.5), as seen in Figure 2. However, the performance of QoPS is closer to the other two schemes. In general, we find that as the stringency factor increases, the performance of the different strategies tends to converge. This suggests that the additional flexibility that QoPS tries to exploit in rearranging schedules is most beneficial when the jobs have sufficient laxity with respect to their deadlines.

QoPS: A QoS Based Scheme for Parallel Job Scheduling

Utilization Vs Load

0.8 QoPS 0.75 MRT 0.7 MSB EASY 0.65 CONS 0.6 0.55

Utilization

Utilization

Utilization Vs Load

0.5 0.45 0.4 100

263

0.8 QoPS 0.75 MRT 0.7 MSB EASY 0.65 CONS 0.6 0.55 0.5 0.45

110

120

130

140

Load Factor

150

160

0.4 100

110

120

130

140

150

160

Load Factor

Fig. 3. Utilization comparison: (a) Stringency Factor = 0.2, (b) Stringency Factor = 0.5 We next look at the achieved utilization of the system, as the load is varied. As a reference, we compare the utilization for the deadline-based scheduling schemes with non-deadline based job scheduling schemes (EASY and Conservative backfilling) using the same trace. Since a fraction of submitted jobs are unable to be admitted in the deadline-based schedule, clearly we can expect the achieved system utilization to be worse than the non-deadline case. Figure 3 shows the system utilization achieved for stringency factors of 0.2 and 0.5. There is a loss of utilization of about 10% for QoPS when compared to EASY and Conservative, when the stringency factor is 0.2. With a stringency factor of 0.5, fewer jobs are admitted, and the utilization achieved with the deadline-based scheduling schemes drops by 5-10%. Among the deadline-based schemes, QoPS and MSB perform comparably, with MRT achieving 3-5% lower utilization at high load (load factor of 1.6). 5.2

Mixed Job Scenario

We next consider the case when only a subset of submitted jobs have userspecified deadlines. As discussed in Section 4, non-deadline jobs were associated with an artificial deadline that provided considerable slack, but prevented starvation. We evaluated the following combinations through simulation: a) 80% non-deadline jobs and 20% deadline jobs, and b) 20% non-deadline jobs and 80% deadline jobs. For each combination, we considered stringency factors of 0.20 and 0.50. Figure 4 shows the variation of the admittance of deadline jobs with offered load for the schemes, when 80% of the jobs are deadline jobs, and stringency factor is 0.2. It can be seen that the QoPS scheme provides consistently superior performance compared to the MSB and MRT schemes, especially at high load. Figure 5 presents data for cases with 20% of jobs having user-specified deadlines, and a stringency factor of 0.2. Compared to the cases with 80% of jobs being deadline-jobs, the QoPS scheme significantly outperforms the MSB and

264

Mohammad Islam et al.

Unadmitted Jobs Vs Load

Unadmitted Processor Seconds Vs Load

QoPS 16 MRT MSB 14

% of Unadm. Proc. Secs.

% of Unadmitted Jobs

18

12 10 8 6 4 100

110

120

130

140

150

160

26 25 24 23 22 21 20 19 18 QoPS 17 MRT 16 MSB 15 100 110

Load Factor

120

130

140

150

160

Load Factor

Fig. 4. Admittance capacity with a mix of deadline and non-deadline jobs (Percentage of Deadline Jobs = 80%) for less stringent (Stringency Factor = 0.2) deadlines: (a) Unadmitted jobs, (b) Unadmitted Processor Seconds

Unadmitted Processor Seconds Vs Load % of Unadm. Proc. Secs.

% of Unadmitted Jobs

Unadmitted Jobs Vs Load 11 10 QoPS MRT 9 MSB 8 7 6 5 4 3 2 1 100 110

120 130 140 Load Factor

150

160

50 QoPS 45 MRT 40 MSB 35 30 25 20 15 10 5 100 110

120 130 140 Load Factor

150

160

Fig. 5. Admittance capacity with a mix of deadline and non-deadline jobs (Percentage of Deadline Jobs = 20%) for less stringent (Stringency Factor = 0.2) deadlines: (a) Unadmitted jobs, (b) Unadmitted Processor Seconds

MRT schemes. This again suggests that in scenarios where many jobs have significant flexibility (here the non-deadline jobs comprise 80% of jobs and they have significant flexibility in scheduling), the QoPS scheme makes effective use of the available flexibility. The results for a stringency factor of 0.5 are omitted and can be found in [7]. Figures 6 and 7 show the variation of the average response time and average slowdown of non-deadline jobs with load, for the cases with 20% and 80% of the jobs being deadline jobs and a stringency factor of 0.2. In addition to the data for the three deadline-based scheduling schemes, data for the EASY and Conservative back-fill mechanisms is also shown. The average response time and slowdown can be seen to be lower for QoPS, MRT and MSB, when compared to EASY or Conservative. This is because the delivered load for the non-deadline based schemes is equal to the offered load (the X-axis), whereas the delivered load for the deadline-based scheduling schemes is lower than offered load. In other

QoPS: A QoS Based Scheme for Parallel Job Scheduling

Average Slow Down Vs Load

40000

QoPS 35000 MRT MSB 30000 EASY CONS 25000

Average Slowdown

Average Response Time (sec)

Average Response Time Vs Load

20000 15000 10000 100

110

120

130

140

150

265

160

45 QoPS 40 MRT 35 MSB 30 EASY CONS 25 20 15 10 5 0 100 110

Load Factor

120

130

140

150

160

Load Factor

Fig. 6. Performance of Non-deadline jobs with Percentage of Deadline Jobs = 20%; Stringency Factor = 0.2 (a) Response time, (b) Average slowdown

Average Slow Down Vs Load

QoPS MRT MSB 35000 EASY 30000 CONS 40000

Average Slowdown

Average Response Time (sec)

Average Response Time Vs Load 45000

25000 20000 15000 10000 100

110

120 130 140 Load Factor

150

160

50 45 QoPS MRT 40 MSB 35 EASY 30 CONS 25 20 15 10 5 0 100 110

120 130 140 Load Factor

150

160

Fig. 7. Performance of Non-deadline jobs with Percentage of Deadline Jobs = 80%; Stringency Factor = 0.2 (a) Response time (b) Average Slowdown

words, with the non-deadline based schemes, all the jobs are admitted, whereas with the other deadline based schemes, not all deadline jobs are admitted. This also explains the reason why the performance of QoPS appears inferior to MSB and MRT - as seen from Figure 4, the rejected load from the deadline jobs is much higher for MRT than QoPS. When the data for the case of 20% deadline jobs is considered (Figure 6), the performance of QoPS has improved relative to MSB; the turnaround time is comparable or better except for a load of 1.6, and the average slowdown is lower at all loads. These user metrics are better for QoPS than MSB/MRT despite accepting a higher load (Figure 5). The achieved utilization for the different schemes as a function of load is shown in Figure 8 for a stringency factor of 0.2. It can be seen that the achieved utilization with QoPS is roughly comparable with MSB and better than MRT, but worse than the non-deadline based schemes. Compared to the case when all jobs had user-specified deadlines (Figure 3), the loss of utilization compared to

266

Mohammad Islam et al.

Utilization Vs Load 0.8

Utilization

Utilization

Utilization Vs Load 0.8 QoPS 0.75 MRT 0.7 MSB EASY 0.65 CONS 0.6 0.55 0.5 0.45

QoPS 0.75 MRT MSB 0.7 EASY CONS 0.65 0.6 0.55

0.4 100

110

120

130

140

150

0.5 100

160

110

120

Load Factor

130

140

150

160

Load Factor

Fig. 8. Utilization comparison for a mix of deadline and non-deadline jobs (Stringency Factor = 0.2): (a) Percentage of Deadline Jobs = 80%, (b) Percentage of Deadline Jobs = 20%

Average Slow Down Vs Utilization

QoPS MRT MSB 30000 EASY CONS 25000 35000

Average Slowdown

Average Response Time (sec)

Average Response Time Vs Utilization 40000

20000 15000 10000 0.5

0.55

0.6

0.65 0.7 Utilization

0.75

0.8

45 QoPS 40 MRT 35 MSB 30 EASY CONS 25 20 15 10 5 0 0.5 0.55

0.6

0.65 0.7 Utilization

0.75

0.8

Fig. 9. Performance variation of Non-deadline jobs with utilization for Percentage of Deadline Jobs = 20%; Stringency Factor = 0.2 (a) Average response time (b) Average slowdown

the non-deadline based schemes is much less: about 8% when 80% of the jobs are deadline jobs, and 2-3% when 20% of the jobs are deadline jobs. As discussed above, a direct comparison of turnaround time and slowdown as a function of offered load is complicated by the fact that different scheduling schemes accept different numbers of jobs. A better way of comparing the schemes directly is by plotting average response time or slowdown against achieved utilization on the X-axis (instead of offered load). This is shown in Figure 9, for the case of 20% deadline jobs and a stringency factor of 0.2. When there is sufficient laxity available, it can be seen that QoPS is consistently superior to MSB and MRT. Further, QoPS has better performance than EASY and conservative backfilling too, especially for the slowdown metric. Thus, despite the constraints of the deadline-jobs, QoPS is able to achieve better slowdown and response time for the non-deadline jobs when compared to EASY and conservative backfilling, i.e. instead of adversely affecting the non-deadline jobs for the same delivered

QoPS: A QoS Based Scheme for Parallel Job Scheduling

Average Slow Down Vs Utilization

Average Slowdown

Average Response Time (sec)

Average Response Time Vs Utilization 45000

QoPS 40000 MRT MSB 35000 EASY 30000 CONS 25000 20000 15000 10000 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

267

50 45 QoPS MRT 40 MSB 35 EASY 30 CONS 25 20 15 10 5 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Utilization

Utilization

Fig. 10. Performance variation of Non-deadline jobs with utilization for Percentage of Deadline Jobs = 80%; Stringency Factor = 0.2 (a) Average response time, (b) Average slowdown

load, QoPS provides better performance for them when compared to the standard back-fill mechanisms. However, when the number of deadline jobs increases to 80% (Figure 10), the performance of QoPS degrades.

6

Conclusions and Future Work

Scheduling dynamically-arriving independent parallel jobs on a given set of resources is a long studied problem; issues addressed include from evaluation of system and user metrics such as utilization, throughput, turnaround time, etc. to soft time-guarantees for the response time of jobs using priority based scheduling. However, a solution to the problem of providing Quality of Service (QoS) for Parallel Job Scheduling has not been previously addressed. In this paper, we proposed a new scheme termed as the QoPS Scheduling Algorithm to provide QoS in the response time given to the end user in the form of guarantees in the completion time to the submitted independent parallel jobs. Its effectiveness has been evaluated using trace-driven simulation. The current scheme does not explicitly deal with cost-metrics for charging the jobs depending on the deadlines and resource usage. Also, when a job arrives, it has a number of options for placement in the schedule. The current scheme looks at each of these options in an FCFS order and does not do any kind of evaluation to see if one option is better (for the system and user metrics, such as utilization for example) than the others. We plan to extend this to define cost functions for both charging the jobs and for evaluating and choosing between several feasible schedules when admitting new jobs.

268

Mohammad Islam et al.

References [1] NERSC. http://hpcf.nersc.gov/accounts/priority charging.html. 252 [2] S.-H. Chiang and M. K. Vernon. Production Job Scheduling for Parallel Shared Memory Systems. In the Proceedings of the IEEE International Parallel and Distributed Processing Symposium, April 2001. 252 [3] W. Cirne and F. Berman. Adaptive Selection of Partition Size of Supercomputer Requests. In the Proceedings of 6th workshop on Job Scheduling Strategies for Parallel Processing, April 2000. 252 [4] D. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In the Proceedings of IEEE Workshop on Job Scheduling Strategies for Parallel Processing, 1997. 252 [5] D. G. Feitelson. Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/ parallel/workload/. 259, 261 [6] P. Holenarsipur, V. Yarmolenko, J. Duato, D. K. Panda, and P. Sadayappan. Characterization and Enhancement of Static Mapping Heuristics for Heterogeneous Systems. In the Proceedings of the IEEE International Symposium on High Performance Computing (HiPC), December 2000. 252 [7] M. Islam, P. Balaji, P. Sadayappan, and D. K. Panda. QoPS: A QoS based scheme for Parallel Job Scheduling. Technical report, The Ohio State University, Columbus, OH, April 2003. 261, 264 [8] B. Jackson, B. Haymore, J. Facelli, and Q. O. Snell. Improving Cluster Utilization Through Set Based Allocation Policies. In IEEE Workshop on Scheduling and Resource Management for Cluster Computi ng, September 2001. 252 [9] P. Keleher, D. Zotkin, and D. Perkovic. Attacking the Bottlenecks in Backfilling Schedulers. In Cluster Computing: The Journal of Networks, Software Tools and Applications, March 2000. 252 [10] A. K. Mok. Fundamental design problems of distributed systems for the hard realtime environment. PhD thesis, Massachussetts Institute of Technology, Cambridge, MA, May 1983. 254 [11] A. K. Mok. The design of real-time programming systems based on process models. In the Proceedings of IEEE Real Time Systems Symposium, December 1984. 254 [12] A. K. Mok and M. L. Dertouzos. Multi-Processor Scheduling in a Hard Real-Time Environment. In the Proceedings of the Seventh Texas Conference on Computing Systems, November 1978. 254 [13] A. W. Mualem and D. G. Feitelson. Utilization, Predictability, Workloads and User Estimated Runtime Esti mates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6), June 2001. 252 [14] D. Perkovic and P. Keleher. Randomization, Speculation and Adaptation in Batch Schedulers. In the Proceedings of the IEEE International Conference on Supercomputing, November 2000. 252 [15] K. Ramamritham, J. A. Stankovic, and P.-F. Shiah. Efficient Scheduling Algorithms for Real-Time Multiprocessor Systems. IEEE Transactions on Parallel and Distributed Systems, 1(2), April 1990. 253, 254, 259, 260 [16] D. Talby and D. G. Feitelson. Supporting Priorities and Improving Utilization of the IBM SP2 scheduler using Slack Based Backfilling. In the Proceedings of the 13th Intl. Parallel Processing Symposium, pages 513–517, April 1997. 252, 253, 254

Author Index

Andrade, Nazareno . . . . . . . . . . . . . 61 Arlitt, Martin . . . . . . . . . . . . . . . . . 129 Balaji, Pavan . . . . . . . . . . . . . . . . . . 252 Banen, S. . . . . . . . . . . . . . . . . . . . . . . 105 Brasileiro, Francisco . . . . . . . . . . . . 61 Bucur, A.I.D. . . . . . . . . . . . . . . . . . . 105 Cirne, Walfredo . . . . . . . . . . . . . . . . . 61 Epema, D.H.J. . . . . . . . . . . . . . . . . .105 Ernemann, Carsten . . . . . . . . . . . . 166

Moreira, Jose . . . . . . . . . . . . . . . . . . 183 Panda, D. K. . . . . . . . . . . . . . . . . . . 252 Petrini, Fabrizio . . . . . . . . . . . . . . . 208 Pruyne, Jim . . . . . . . . . . . . . . . . . . . 129 Rajan, Arun . . . . . . . . . . . . . . . . . . . . 87 Roisenberg, Paulo . . . . . . . . . . . . . . .61 Rolia, Jerry . . . . . . . . . . . . . . . . . . . .129

Hovestadt, Matthias . . . . . . . . . . . . . 1

Sabin, Gerald . . . . . . . . . . . . . . . . . . . 87 Sadayappan, P. . . . . . . . . . . . . . . . . 252 Sadayappan, Ponnuswamy . . . . . . 87 Schaeffer, Jonathan . . . . . . . . . . . . . 21 Shmueli, Edi . . . . . . . . . . . . . . . . . . . 228 Sivasubramaniam, Anand . . . . . . 183 Song, Baiyi . . . . . . . . . . . . . . . . . . . . 166 Streit, Achim . . . . . . . . . . . . . . . . . . . . 1 Subhlok, Jaspal . . . . . . . . . . . . . . . . 148

Islam, Mohammad . . . . . . . . . . . . . 252

Venkataramaiah, Shreenivasa . . 148

Jette, Morris A. . . . . . . . . . . . . . . . . .44

Yahyapour, Ramin . . . . . . . . . . . . . 166 Yang, Antony . . . . . . . . . . . . . . . . . .183 Yoo, Andy B. . . . . . . . . . . . . . . . . . . . 44

Feitelson, Dror G. . . . . . . . . . 208, 228 Fernandez, Juan . . . . . . . . . . . . . . . 208 Frachtenberg, Eitan . . . . . . . . . . . .208 Goldenberg, Mark . . . . . . . . . . . . . . 21 Grondona, Mark . . . . . . . . . . . . . . . . 44

Kao, Odej . . . . . . . . . . . . . . . . . . . . . . . 1 Keller, Axel . . . . . . . . . . . . . . . . . . . . . . 1 Kettimuthu, Rajkumar . . . . . . . . . 87 Lu, Paul . . . . . . . . . . . . . . . . . . . . . . . . 21

Zhang, Yanyong . . . . . . . . . . . . . . . 183 Zhu, Xiaoyun . . . . . . . . . . . . . . . . . . 129