143 65 32MB
English Pages 619 [640] Year 1994
Readings in
Distributed Computing Systems Thomas L. Casavant and Mukesh Singhal
> >* •**
• \ >\ • '
IEEE COMPUTER SOCIETY PRESS
•4j?
VO*v
V,*\
X
.
-3
•
«
^ «5 4*-*** a < *
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, INC.
IEEE Computer Society Readings in series Each volume in the Readings in series is coordinated with a special-interest issue of Computer magazine, the IEEE Computer Society's flagship periodical. Readings in volumes consolidate both tutorial and intermediate material to deliver the most up-to-date information on developing areas in computer science and engineering.
Readings in titles Readings in Computer-Generated Music Readings in Real-Time Systems Readings in Distributed Computing Systems Readings in Object-Oriented Systems and Applications Readings in Computer Architectures for Intelligent Systems Readings in Document Image Analysis Readings in Rapid Systems Prototyping
Readings in Distributed Computing Systems
Readings in Distributed Computing Systems
Thomas L. Casavant and Mukesh Singhal
IEEE Computer Society Press Los Alamitos, California Washington • Brussels • Tokyo
Library of Congress Cataloging-in-Publication Data Casavant, Thomas. Readings in Distributed Computing Systems / Thomas Casavant and Mukesh Singhal. p. cm. — (A Computer Society Press tutorial) Includes bibliographical references. ISBN 0-8186-3032-9 (case). — ISBN 0-8186-3031-0 (fiche) 1. Electronic data processing — Distributed processing. I. Singhal, Mukesh. II. Title. III. Series: IEEE Computer Society Press tutorial. QA76.9.D5C35 1994 004' .36 — dc20 92-40164 CIP
Published by the IEEE Computer Society Press 10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1264 © 1994 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy isolated pages beyond the limits of US copyright law, for the private use of their patrons. For other copying, reprint, or republication permission, write to IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331. IEEE Computer Society Press Order Number 3032 Library of Congress Number 92-40164 IEEE Catalog Number EH0359-0 ISBN 0-8186-3031-0 (microfiche) ISBN 0-8186-3032-9 (case) Additional copies can be ordered from IEEE Computer Society Press Customer Service Center 10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1264 Tel: (714)821-8380 Fax: (714) 821-4641 Email: [email protected]
IEEE Service Center 445 Hoes Lane P.O. Box 1331 Piscataway, NJ 08855-1331 Tel: (908)981-1393 Fax: (908) 981-9667
IEEE Computer Society 13, avenue de I'Aquilon B-1200 Brussels BELGIUM Tel: +32-2-770-2198 Fax: +32-2-770-8505
IEEE Computer Society Ooshima Building 2-19-1 Minami-Aoyama Minato-ku, Tokyo 107 JAPAN Tel: +81-3-3408-3118 Fax: +81-3-3408-3553
Technical editors: Jon T. Butler and V. Rao Vemuri Production editor/copy editor: Phyllis Walker Book layout: Diamond Disc, Christopher Patterson Cover design/book design: VB Designs, Toni Van Buskirk Cover photography: The Stock Market/Tom Sanders © 1990 Printed in the United States of America by Braun-Brumfield, Inc.
The Institute of Electrical and Electronics Engineers, Inc.
Foreword
T
his is the third of our Readings in volumes, a series published by the IEEE Computer Society Press. Our intent is to provide, in one text, tutorial and interme¬ diate material on developing areas in computer science and engineering. A unique aspect of this series is its origin. Each volume is developed from a special issue of Computer maga¬ zine, the IEEE Computer Society ’ s flagship periodical. That is, the editors have chosen papers to produce both a special issue of Computer and, subsequently, a Readings in volume. The papers in Computer provide a tutorial introduction to the subject matter and target an audience with a broad background. The papers in our Readings in series provide a wider perspec¬ tive of the subject and significantly greater coverage. The Readings in series is appropriate for (1) students in senior- and graduate-level courses, (2) engineers seeking a convenient way to improve their knowledge, and (3) managers wishing to augment their technical expertise. The guiding principle motivating this series is the delivery of the most upto-date information in emerging fields. Because computer
Jon T. Butler and V. Rao Vemuri
scientists and engineers deal with rapidly changing technology, they need access to tutorial descriptions of new and promising developments. Our Readings in texts will satisfy that need. Papers chosen for this volume were judged on their technical content, quality, and clarity of exposition. As with other Computer Society Press products, this text has undergone thorough review. In addition, all of the previously published papers had to pass Computer magazine’s strict review process. We wish to thank all who have contributed to this effort: the authors, who devoted substantial effort to produce the high quality required for their papers to be selected, and our referees, who donated their expertise and time to evaluate the manu¬ scripts and provide feedback to the authors. A special acknowledgment is due the editors, Thomas L. Casavant and Mukesh Singhal, whose time and energies were required to read the papers, direct an extensive administrative effort, coordinate referee reports, select final papers, and secure timely and high-quality revisions.
Distributed Computing Systems
v
.
■
Preface
T
he major impetus for this text was the overwhelming response to our call for papers for the August 1991 special issue of Computer magazine on distributed computing systems. Over 117 authors, some very well known, submitted papers of excellent quality. Since Computer could publish only seven of these papers because of space limitations, we viewed this as an ideal opportunity to publish a tutorial text on the subject. As an introduction and foundation to the subject, we took the seven papers that Computer had selected, and then we added 24 other papers to provide significantly broader coverage for specialists and nonspecialists alike. Distributed computing systems has emerged as an active research area, with both the number of researchers in the field and the number of distributed system prototypes increasing dramatically. This unprecedented interest can be judged from the increase in • Professional meetings: There many international conferences, symposia, and workshops that focus totally on distributed computing systems, such as the International Conference on Distributed Computing Systems, the ACM Symposium on Principles of Distributed Computing, and the IEEE Workshop on Parallel and Distributed Systems. Other conferences, such as the International Conference on Parallel Processing and the Computer Software and Applications Conference (COMPS AC), now devote tracks solely to distributed computing systems. • Journals: Journals dedicated to the topic of distributed computing systems include Distributed Computing, by Springer-Verlag; IEEE Transactions on Parallel and Distributed Systems', and the Journal of Parallel and Distributed Computing, by Academic Press. Other journals covering the topic as a regular subject include IEEE Transactions on Software Engineering; A CM Transactions on Computer Systems; and Real-Time Systems, by Kluwer Academic Publishers. Several journals have devoted special issues to distributed computing systems, including IEEE
Distributed Computing Systems
Thomas L. Casavant and Mukesh Singhal
vii
Transactions on Software Engineering, January 1987; IEEE Transactions on Computers, December 1988 and August 1989; Algorithmica, 1988; and Computer, August 1991. • Industrial push: Industries and universities have developed a number of prototype implementations of distributed systems, including Mach at Carnegie Mellon University; V-Kemel at Stanford University; Sprite at the University of California, Berkeley; Amoeba at Vrije University, Amsterdam; System R* at IBM; Locus at the University of California, Los Angeles; VAX-Cluster at Digital Equipment Corporation; and the Spring Project at the University of Massachusetts-Amherst. The purpose of this text is twofold: to present new developments in the distributed computing systems field and to summarize the current state of the art. Our goal is to provide condensed information to new researchers, as well as a unified perspective to researchers currently active in the field.
VIII
Preface
Contents Foreword.v Preface.vii Introduction.1 T.L. Casavant and M. Singhal Chapter 1: Distributed Computing Systems: An Overview.5 Distributed Computing.6 J.A. Stankovic A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems.31 T.L. Casavant and J.G. Kuhl Deadlock Detection in Distributed Systems.52 M. Singhal Chapter 2: Theoretical Aspects.72 Logical Time in Distributed Computing Systems.73 C. Fidge The Many Faces of Consensus in Distributed Systems.83 J. Turek and D. Shasha Self-Stabilization in Distributed Systems.100 M. Flatebo, A.K. Datta, and S. Ghosh Chapter 3: General.115 A Model for Executing Computations in a Distributed Environment.116 C. E. Wills Relaxed Consistency of Shared State in Distributed Servers.133 K. Ravindran and S. T. Chanson Chapter 4: Object-Oriented Systems.151 An Object-Based Taxonomy for Distributed Computing Systems.152 B.E. Martin, C.H. Pedersen, andJ. Bedford-Roberts Fragmented Objects for Distributed Abstractions.170 M. Makpangou, Y. Gourhant, J.-P. Le Narzul, and M. Shapiro Configuring Object-Based Distributed Programs in REX.187 J. Kramer, J. Magee, M. Sloman, and N. Dulay Chapter 5: Fault Tolerance and Crash Recovery.206 Probabilistic Diagnosis of Multiprocessor Systems.207 S. Lee and K.G. Shin The Delta-4 Distributed Fault-Tolerant Architecture.223 D. Powell, P. Barrett, G. Bonn, M. Chereque, D. Seaton, and P. Verissimo Rollback Recovery in Concurrent Systems.249 S.J. Upadhyaya and A. Ranganathan Problem Solving with Distributed Knowledge.267 M.K. Saxena, K.K. Biswas, and P.C.P. Bhatt
Distributed Computing Systems
ix
Chapter 6: Performance Measurements and Modeling.
285
ZM4/Simple: A General Approach to Performance Measurement and Evaluation of Distributed Systems.286 P. Dauphin, R. Hofmann, R. Klar, B. Mohr, A. Quick, M. Siegle, and F. Sotz Chapter 7: Experimental Systems and Case Studies.310 Tools for Distributed Application Management.311 K. Marzullo, R. Cooper, M.D. Wood, and K.P. Birman The Architectural Overview of the Galaxy Distributed Operating System.327 P.K. Sinha, M. Maekawa, K. Shimizu, X. Jia, H. Ashihara, N. Utsunomiya, K.S. Park, and H. Nakano Communication Facilities for Distributed Transaction-Processing Systems.346 E. Mafia and B. Bhargava Supporting Utility Services in a Distributed Environment.355 C.-Q. Yang, D. Hensgen, T.S. Thomas, and R. Finkel Chapter 8: Distributed Shared Memory.374 Distributed Shared Memory: A Survey of Issues and Algorithms.375 B. Nitzberg and V. Lo Using Broadcasting to Implement Distributed Shared Memory Efficiently.387 A. S. Tanenbaum, M.F. Kaashoek, and H.E. Bal Memory as a Network Abstraction.409 G. Delp, D. Farber, R. Minnich, J. Smith, and I.M.-C. Tam Chapter 9: Distributed File Systems.424 Managing Highly Available Personal Files.425 M.W. Mutka and L.M. Ni Pushing the Limits of Transparency in Distributed File Systems.447 R. Floyd and C.S. Ellis Scale in Distributed Systems.463 B. C. Neuman Transparent Access to Large Files That Are Stored across Sites.490 H. F. Wedde, B. Korel, S. Chen, D.C. Daniels, S. Nagaraj, and B. Santhanam Chapter 10: Distributed Databases.511 Distributed Data Management: Unsolved Problems and New Issues.512 M. T. Ozsu and P. Valduriez A Unified Approach to Distributed Concurrency Control.545 P. Anastassopoulos and J. Dollimore Replicated Data Management in Distributed Systems.572 M. Ahamad, M.H. Ammar, and S.Y. Cheung Scheduling Transactions for Distributed Time-Critical Applications.592 S. H. Son and S. Park Author Profiles.619
x
Contents
Introduction
M
ost readers are already familiar with distributed computing systems, at least as users. Automated teller machine networks, airline reservation systems, and on-site validation of credit cards provide three examples of the pervasiveness of distributed systems in everyday life. The computer research community relies heavily on distributed systems for electronic mail, remote login, network file systems, page swapping, and remote file transfer. The Internet electronic mail system, for example, is one of the most useful pieces of scientific infrastructure developed in the past 10 years. Basically, a distributed computing system consists of a collection of autonomous computers connected by a commu¬ nication network. The sites typically do not share memory, and communication is solely by means of message passing. Over the past two decades, the availability of fast, inexpen¬ sive processors and advances in communication technology have made distributed computing an attractive means of infor¬ mation processing. As early as 1978, Enslow1 specified five properties for defining a distributed computing system:
Thomas L. Casavant and Mukesh Singhal
• Multiplicity of general-purpose resource components, both physical and logical, that can be dynamically assigned to specific tasks; • Physical distribution of the physical and logical resources by means of a communications network; • High-level operating system that unifies and integrates the control of the distributed components; • System transparency, which allows services to be requested by name only; and • Cooperative autonomy, characterizing the operation and interaction of both physical and logical resources. The many practical motivations for distributed systems include higher performance or throughput, increased reliability, and improved access to data over a geographically dispersed area.2 However, pursuing these potential advantages has exposed
Distributed Computing Systems
1
a broad set of new problems. Ironically, attempts to exploit a feature often degrade that very feature. For example, improving reliability through redundancy immediately requires recovery mechanisms, which can themselves become a serious source of potential failure and reduce overall system reliability. During the last two decades, a number of subfields have become well established. Cutting across the subfields are two camps of researchers who see their motivations differently: One camp views distribution as a means to an end; the other views it as imposed by the situation. In the first case, distribution of resources is seen as a way to achieve some goal, such as • Massively parallel, general-purpose, high-speed computing; • Fault tolerance, reliability, or availability; or • Real-time response demands. In the second case, distributed computation is forced on a designer because of some existing, overriding situation, such as • • • •
Distributed database systems, Automated manufacturing, Remote sensing and control, or Coordinated decision making.
Challenges The issues facing the distributed-computing community divide roughly into two categories: system design issues and application-oriented issues. System design issues can be subdivided into hardware-oriented and software-oriented issues. Applications issues, which take the perspec¬ tive of a system user, can be thought of as system models and programming support.
System design issues The hardware-oriented issues of system design include hardware measures for fault tolerance, physical networks and communications protocol design, and physical clock synchronization. Fault-tolerance mechanisms generally demand redundancy in hardware resources. For example, in the Tandem system, which is designed to meet rigid reliability and availability requirements in a database environment, each hardware component is duplicated. This demands not only hardware but also software to “roll back” the system state to the last known consistent state before component failure. The design of physical networks and communication protocols can have a direct impact on system efficiency and reliability. For example, in the case of a system distributed throughout one building, a carrie-sense protocol on an Ethernet media is sufficient. But a space-borne system, such as the space station, requires a radio, packet-switched network to function. The software-oriented issues of system design include distributed algorithms, naming and resource location, resource allocation, distributed operating systems, system integration, reliability, tools and languages, real-time systems, and performance measurement. Distributed algorithms encompass many fundamental, challenging problems in distributedsystems design, including distributed mutual exclusion, distributed consensus, Byzantine agree¬ ment, deadlock handling, logical clocks, distributed snapshots, load sharing/scheduling, and crash recovery. Distributed-algorithm design is complicated by the lack of both global memory and physically synchronized clocks.
2
Introduction
Distributed snapshots and clock synchronization have recently attracted attention as ways to compensate for the lack of global memory. A distributed-snapshot algorithm collects a partial global state of a distributed system to help identify several interesting properties. Clocksynchronization schemes compensate for the lack of a physically synchronized global clock. Fault tolerance and crash recovery are gaining importance, because distributed systems are being commercialized. Load-sharing and distributed-scheduling techniques improve performance and efficiency by effectively distributing the work load throughout the system. Researchers are investigating load¬ sharing policies that are stable, efficient, and easy to implement. Distributed scheduling decomposes a task into several subtasks and allocates them over a set of processors to hasten computation by overlapping subtask execution and minimizing communication. Since optimal solutions to this problem are NP hard (even if accurate system state information were available), researchers are looking into heuristic methods. With performance measurement and modeling techniques, performance can be predicted and potential design flaws identified before an actual system is built. Performance analysis of distributed systems has been done, primarily with analytic modeling and simulation, but it can be difficult. Thus, empirical techniques are needed to study actual performance. As design techniques approach maturity, many companies and universities have launched projects to construct real-life distributed systems, and a number of them now have prototype implementations — for example, Mach at Carnegie Mellon University; V-Kemel at Stanford University; Sprite at the University of California, Berkeley; Amoeba at Vrije University in Amsterdam; System R* at IBM; Locus at the University of California, Los Angeles; VAX-Cluster at Digital Equipment Corporation; and the Spring Project at the University of MassachusettsAmherst.
Application-oriented issues From the application designer’s point of view, which includes algorithmic design, the first question is “What does my system look like?” Many models have been proposed and continue to be explored. The communicating sequential process (CSP) model is a very simple synchronous message-passing model. However, as the area has matured, the loose structure of the CSP has led to more restricted paradigms, such as the rendezvous mechanism of Ada and remote procedure call. Some models have been proposed to meet the needs of specific applications — for example, the transaction model as it relates to on-line transaction processing. Other modeling issues include virtual shared memory and decisions about processor granularity. The application designer’s second concern is the practical matter of programming support. Research in this area brings reality to the models described above. This reality comes in the form of languages, operating system interfaces, and higher level abstractions for building applications such as databases (for example, the general-purpose relational database Ingres [Interactive Graphics and Retrieval System]). In addition to a means of specifying computations and communications, automated and semiautomated tools are needed to help produce, debug, and verify distributed applications. In fact, the need for tools is becoming a pervasive problem that will require a great deal of attention.
The future of distributed computing Although distributed computing has been an active research topic for at least two decades, several design issues still face researchers and system builders. The first issue involves theoretical aspects, including global state, logical/physical clock
Distributed Computing Systems
3
synchronization, and algorithm verification. Global state is necessary to compensate for the lack of global memory. A distributed-snapshot algorithm collects a global state of a distributed system that helps identify interesting properties, but the lack of global memory makes verification of distributed algorithms tedious and error-prone. Verification techniques for shared memory systems are not directly applicable to distributed-memory systems. They require other techniques (probably snapshot-based or temporal-logic based). Second, although fault tolerance and crash recovery have already received much attention, system reliability will become even more important as distributed systems become more and more commercialized. Instead of concentrating on hardware redundancy, future research efforts should also investigate software techniques for achieving fault tolerance. Third, tools and languages are badly needed. This area includes tools for specifying, analyzing, transforming, debugging, creating, and verifying distributed software, as well as new languages, language extensions, operating system primitives, compilation techniques, debuggers, and software-creation methods. One of the greatest challenges to users of distributed systems is the creation of reliable, working distributed software. Advances in theory, practical tools, methods, and languages are essential for building reliable and efficient distributed software. The fourth issue is high-performance systems. Constructing massively parallel systems (105 or more processors) will require physically distributed memory. Topological, as well as technological, issues will continue to be important in designing interconnection networks for these systems. Technological challenges will center on optical communications and the VLSIrelated issues of module integration, multichip and board layout, and packaging. Fifth, real-time distributed systems will become more important for automated manufacturing, remote sensing and control, and other time-critical missions. Past research in real-time distributed systems emphasized efficient task-scheduling algorithms to meet task deadlines under various constraints. Future research will focus on system structuring and total system design. Finally, as design techniques approach perfection, we will see a proliferation of actual distributed systems with significantly improved fault tolerance, resource sharing, and commu¬ nications. These systems will function as single, coherent, powerful virtual machines providing transparent user access to network-wide resources.
References 1. 2.
4
P.H. Enslow, Jr., “What is a ‘Distributed’ Data Processing System?,” Computer, Vol. 11, No. 1, Jan. 1978, pp. 13-21. J.A. Stankovic, “A Perspective on Distributed Computer Systems,” IEEE Trans. Computers, Vol. C-33, No. 12, Dec. 1984, pp. 1102-1115.
Introduction
Chapter 1
Distributed Computing Systems: An Overview
Distributed Computing
A
distributed computer system (DCS) is a collection of computers connected by a communications subnet and logically integrated in varying degrees by a distributed operating system and/or distributed database system. Each computer node may be a uniprocessor, a multiprocessor, or a multicomputer. The communications subnet may be a widely geographically dispersed collection of communication processors or a local area network. Typical applications that use distributed computing include e-mail, teleconferencing, electronic funds transfers, multimedia telecommunications, command and control systems, and support for general purpose computing in industrial and academic settings. The widespread use of distributed computer systems is due to the price-performance revolution in microelectronics, the development of cost effective and efficient communication subnets [4] (which is itself due to the merging of data communications and computer communications), the development of resource sharing software, and the increased user demands for communication, economical sharing of resources, and productivity. A DCS potentially provides significant advantages, including good performance, good reliability, good resource sharing, and extensibility [35, 41]. Potential performance enhancement is due to multiple processors and an efficient subnet, as well as avoiding contention and bottlenecks that exist in uniprocessors and multiprocessors. Potential reliability improvements are due to the data and control redundancy possible, the geographical distribution of the system, and the ability for hosts and communication processors to perform mutual inspection. With the proper subnet, distributed operating system [46], and distributed database [85], it is possible to share hardware and software resources in a cost effective manner, increasing productivity and lowering costs. Possibly the most important potential advantage of a DCS is extensibility. Extensibility is the ability to easily adapt to both short- and long-term changes without significant disruption of the system. Short-term changes include varying work loads and host or subnet failures or additions. Long-term changes are associated with major modifications to the requirements or content of the system. DCS research encompasses many areas, including local- and
6
John A. Stankovic
Distributed Computing
Host 1
Host 2
Host 3
Host 4
NOS
NOS
NOS
NOS
Unix
Ultrix
VMS
OS/2
1
1
T
I
Local area network Native operating systems - Unix, Ultrix, VMS, OS/2
Figure 1. Network operating systems.
wide-area networks, distributed operating systems, distributed databases, distributed file servers, concurrent and distributed programming languages, specification languages for concurrent system, theory of parallel algorithms, theory of distributed computing, parallel architectures and interconnection structures, fault tolerant and ultrareliable systems, distributed real-time systems, cooperative problem solving techniques of artificial intelligence, distributed debugging, distributed simulation, distributed applications, and a methodology for the design, construction, and maintenance of large, complex distributed systems. Many prototype distributed computer systems have been built at university, industrial, commercial, and government research laboratories, and production systems of all sizes and types have proliferated. It is impossible to survey all distributed computing systems research. An extensive survey and bibliography would require hundreds of pages. Instead, this paper focuses on two important areas: distributed operating systems and distributed databases.
Distributed operating systems Operating systems for distributed computing systems can be categorized into two broad categories: network operating systems and distributed operating systems [113]. (1) Network operating systems. Consider the situation where each of the hosts of a computer network has a local operating system that is independent of the network. The sum total of all the operating system software added to each host in order to communicate and share resources is called a network operating system (NOS). The added software often includes modifications to the local operating system. NOS s are characterized by being built on top of existing operating systems, and they attempt to hide the differences between the underlying systems, as shown in Figure 1. (2) Distributed operating systems. Consider an integrated computer network where there is one native operating system for all the distributed hosts. This is called a distributed operating system (DOS). See Figure 2. Examples of DOSs include the V system [28], Eden [66], Amoeba [78], the Cambridge distributed computing system [79], Medusa [83], Locus [87], and Mach [93]. Examples of real-time distributed operating systems include MARS [59] and Spring [ 116]. A DOS is designed with the network requirements in mind from its inception and it tries to manage the resources of the network in a global fashion. Therefore, retrofitting a DOS to existing operating systems and other software is not a problem for DOSs. Since DOSs are used to satisfy a wide variety of requirements, their various implementations are quite different.
Distributed Computing Systems
7
Figure 2. Distributed operating system — Mach.
Note that the boundary between NOSs and DOSs is not always clearly distinguishable. In this paper we primarily consider distributed operating systems issues divided into six categories: process structures, access control and communication, reliability, heterogeneity, efficiency, and real-time. In the section entitled “Distributed databases,” we present a similar breakdown applied to distributed databases.
Process structures The conventional notion of a process is an address space with a single execution trace through it. Because of the parallelism inherent in multiprocessing and distributed computing, we have seen that recent operating systems are supporting the separation of address space (sometimes called a task) and execution traces (called threads or lightweight processes) [93,94]. In most systems the address space and threads are restricted to reside on a single node (a uni- or multiprocessor). However, some systems such as Ivy, the Apollo domain, and Clouds [33] support a distributed address space (sometimes called a distributed shared memory—see [82] for a summary of issues involved with distributed shared memory), and distributed threads executing on that address space. Regardless of whether the address space is local or distributed [42], there has been significant work done on the following topics: supporting very large, but sparse address spaces, efficiently copying information between address spaces using a technique called copy on write where only the data actually used gets copied [43], and supporting efficient file management by mapping files into the address space and then using virtual memory techniques to access the file. At a higher level, distributed operating systems use tasks and threads to support either a procedure or an object-based paradigm [29]. If objects are used, there are two variations: the passive and active object models. Because the object-based paradigm is so important and wellsuited to distributed computing, we will present some basic information about objects and then discuss the active and passive object paradigms. A data abstraction is a collection of information and a set of operations defined on that information. An object is an instantiation of a data abstraction. The concept of an object is usually supported by a kernel that may also define a primitive set of objects. Higher level objects are then constructed from more primitive objects in some structured fashion. All hardware and software resources of a DCS can be regarded as objects. The concept of an object and its implications form an elegant basis for a DCS [83,113].Forexample,distributed systems’ functions such as allocation of objects to a host, moving objects, remote access to objects, sharing objects across the network, and providing interfaces between disparate objects are all “conceptually” simple, because they are all handled by yet other objects. The object concept is powerful and can easily support the popular client-server model of distributed computing.
8
Distributed Computing
Distributed Computing Systems: An Overview r
Figure 3. Distributed computation (thread through passive objects).
Objects also serve as the primitive entity supporting more complicated distributed computational structures. One type of distributed computation is a process (thread) that executes as a sequential trace through passive objects, but does so across multiple hosts, as shown in Figure 3. The objects are permanent but the execution properties are supplied by an external process (thread) executing in the address space of the object. Another form of a distributed computation is to have clusters of objects, each with internal, active threads, running in parallel and communicating with each other based on the various types of interprocess communication (IPC) used. This is known as the active object model (see Figure 4). The cluster of processes may be colocated, or distributed in some fashion. Other notions of what constitutes a distributed program are possible; for example, object invocations can support additional semantics (such as found in database transactions). The major problem with object-based systems has been poor execution time performance. However, this is not really a problem with the object abstraction itself, but with inefficient implementation of access to objects. The most common reason given for poor execution time is that current architectures are ill-suited for object-based systems. Another problem is choosing the right granularity for an object. If every integer or character and their associated operations are treated as objects, then the overhead is too high. If the granularity of an object is too large, then the benefits of the object-based system are lost.
Access control and communications A distributed system consists of a collection of resources managed by the distributed operating system. Accessing the resources must be controlled in two ways. First, the manner used to access the resource must be suitable to the resource and requirements under consideration. For example, a printer must be shared serially, local data of an object should not be shareable, and a read-only file can be accessed simultaneously by any number of users. In an object-based system, access can be controlled on an operation-by-operation basis. For example, a given user may be restricted to only using the insert operation on a queue object, while another user may be able to access the queue using both insert and remove operations. Second, access to a resource must be restricted to a set
Distributed Computing Systems
9
Host 1
Host 2
Host 3
Figure 4. Local area network — Active objects.
of allowable users. In many systems this is done by an access control list, an access control matrix, or capabilities. The MIT Athena project developed an authentication server called Kerberos based on a third party authentication model [80] that uses private key encryption. Communication has been the focus of much of the work in distributed operating systems, providing the glue that binds logically and physically separated processes [54, 92]. Remote procedure calls (RPCs) extend the semantics of programming language’s procedure calls to communication across nodes of a DCS. Lightweight RPCs [18] have been developed to make such calls as efficient as possible. Many systems also support general synchronous and asynchronous send and receive primitives whose semantics are different and more general than RPC semantics. Broadcasting and multicasting are also common primitives found in DOSs and they provide useful services in achieving consensus and other forms of global coordination. For systems with high reliability requirements, reliable broadcast facilities might be provided [55]. Implementing communication facilities is done either directly in the kernels of the operating systems (as in the V system [28]), or as user-level services (as in MACH [93]). This is a classical trade-off between performance and flexibility. Intermediate between these approaches lies the x-kemel [54] where basic primitives required for all communication primitives are provided at the kernel level and protocol-specific logic is programmable at a higher level.
Reliability While reliability is a fundamental issue for any.system, the redundancy found in DCSs makes them particularly well-suited for the implementation of reliability techniques. We begin the discussion on reliability with a few definitions. A fault is a mechanical or algorithmic defect that may generate an error. A fault may be permanent, transient, or intermittent. An error is an item of information which, when processed by the normal algorithms of the system, will produce a failure. A failure is an event at which a system violates its specifications. Reliability can then be defined as the degree of tolerance against errors and faults. Increased reliability comes from fault avoidance and fault tolerance. Fault
10
Distributed Computing
II
RIBUTEI) 1. OiVlFI. IlN
MSt
'E.RVIEW
avoidance results from conservative design practices such as using high reliability components and nonambitious design. Fault tolerance employs error detection and redundancy to deal with faults and errors. Most of what we discuss here relates to the fault tolerance aspect of reliability. Reliability is a complex, multidimensional activity that must simultaneously address some or all of the following: fault confinement, fault detection, fault masking, retries, fault diagnosis, reconfiguration, recovery, restart, repair, and reintegration. Further, distributed systems require more than reliability — they need to be dependable. Dependability is the trustworthiness of a computer system and it subsumes reliability, availability, safety, and security. System architectures such as Delta-4 [88] strive for dependability. We cannot do justice to all these issues in this short paper. Instead we will discuss several of the more important issues related to reliability in DOSs. Reliable DOSs should support replicated files, exception handlers, testing procedures executed from remote hosts, and avoid single points of failure by a combination of replication, backup facilities, and distributed control. Distributed control could be used for file servers, name servers, scheduling algorithms, and other executive control functions. Process structure, how environment information is kept, the homogeneity of various hosts, and the scheduling algorithm may allow for relocatability of processes. Interprocess communication (IPC) might be supported as a reliable remote procedure call [81,109] and also provide reliable atomic broadcasts as is done in Isis [19]. Reliable IPC would enforce “at least once” or “exactly once” semantics depending on the type of IPC being invoked, and atomic broadcasts guarantee that either all processes that are to receive the message will indeed receive it, or none will. Other DOS reliability solutions are required to avoid invoking processes that are not active, to avoid the situation where a process remains active but is not used, and to avoid attempts to communicate with terminated processes. ARGUS [70], a distributed programming language, has incorporated reliability concerns into the programming language explicitly. It does this by supporting the idea of an atomic object, transactions, nested actions, reliable remote procedure calls, stable variables, guardians (which are modules that service node failures and synchronize concurrent access to data), exception handlers, periodic and background testing procedures, and recovery of a committed update given the present update does not complete. A distributed program written in ARGUS may potentially experience deadlock. Currently, deadlocks are broken by timing out and aborting actions. Distributed databases make use of many reliability features such as stable storage, transactions, nested transactions [76], commit and recovery protocols [103], nonblocking commit protocols [102], termination protocols [104], checkpointing, replication, primary/backups, logs/audit trails, differential files [99], and time-outs to detect failures. Operating system support is required to make these mechanisms more efficient [47, 87, 119, 131]. One aspect of reliability not stressed enough in DCS research is the need for robust solutions, that is, the solutions must explicitly assume an unreliable network, tolerate host failures, network partitionings, and lost, duplicate, out-of-order, or noisy data. Robust algorithms must sometimes make decisions after reaching only approximate agreement or by using statistical properties of the system (assumed known or dynamically calculated). A related question is, at what level should the robust algorithms, and reliability in general, be supported? Most systems attempt to have the subnet ensure reliable, error-free data transmission between processes. However, according to the end-to-end argument [97], such functions placed at the lower levels of the system are often redundant and unnecessary. The rationale for this argument is that since the application has to take into account errors introduced not only by the subnet, many of the error detection and recovery functions can be correctly and completely provided only at the application level. The relationship of reliability to the other issues discussed in this paper is very strong. For example, object-based systems confine errors to a large degree, define a consistent system state to support rollback and restart, and limit propagation of rollback activities. However, if objects
Distributed Computing Systems
11
are supported on a distributed shared memory, special problems arise [134]. Since objects can represent unreliable resources (such as processors and disks), and since higher level objects can be built using lower level objects, the goal of reliable system design is to create “reliable” objects out of unreliable objects. For example, a stable storage can be created out of several disk objects and the proper logic. Then a physical processor, a checkpointing capability, a stable storage, and logic can be used to create a stable processor. One can proceed in this fashion to create a very reliable system. The main drawback is potential loss of execution time efficiency. For many systems, it is just too costly to incorporate an extensive number of reliability mechanisms. Reliability is also enhanced by proper access control and judicial use of distributed control. The major challenge is to integrate solutions to all these issues in a cost effective manner and produce an extremely reliable system.
Heterogeneity Incompatibility problems arise in heterogeneous DCSs in a number of ways [10], and at all levels. First, incompatibility is due to the different internal formatting schemes that exist in a collection of different communication and host processors. Second, incompatibility also arises from the differences in communication protocols and topology when networks are connected to other networks via gateways. Third, major incompatibilities arise due to different operating systems, file servers, and database systems that might exist on a network or a set of networks. The easiest solution to this general problem for a single DCS is to avoid the issue by using a homogeneous collection of machines and software. If this is not practical, then some form of translation is necessary. Some earlier systems left this translation to the user. This is no longer acceptable. Translation done by the DCS system can be done at the receiver host or at the source host. If it is done at the receiver host, then the data traverse the network in their original form. The data usually are supplemented with extra information to guide the translation. The problem with this approach is that at every host there must be a translator to convert each format in the system to the format used on the receiving host. When there exist n different formats, this requires the support of {n - 1) translators at each host. Performing the translation at the source host before transmitting the data is subject to all the same problems. There are two better solutions, each applicable under different situations: an intermediate translator, or an intermediate standard data format. An intermediate translator accepts data from the source and produces the acceptable format for the destination. This is usually used when the number of different types of necessary conversions is small. For example, a gateway linking two different networks acts as an intermediate translator. For a given conversion problem, if the number of different types to be dealt with grows large, then a single intermediate translator becomes unmanageable. In this case, an intermediate standard data format (interface) is declared, hosts convert to the standard, data are moved in the format of the standard, and another conversion is performed at the destination. By choosing the standard to be the most common format in the system, the number of conversions can be reduced. At a high level of abstraction the heterogeneity problem and the necessary translations are well understood. At the implementation level a number of complications exist. The issues are precision loss, format incompatibilities (minus zero value in sign magnitude, and l’s complement cannot be represented in 2’s complement), data type incompatibilities (mapping of an upper- or lower¬ case terminal to an upper-case-only terminal is a loss of information), efficiency concerns, the number and locations of the translators, and what constitutes a good intermediate data format for a given incompatibility problem. As DCSs become more integrated, one can expect that both programs and complicated forms of data might be moved to heterogeneous hosts. How will a program run on this host, given that
12
Distributed Computing
.Distributed Compiling Systems
Overview
the host has different word lengths, different machine code, and different operating system primitives? How will database relations stored as part of a CODASYL database be converted to a relational model and its associated storage scheme? Moving a data-structure object requires knowledge about the semantics of the structure (for example, that some of the fields are pointers and these have to be updated upon a move). How should this information be imparted to the translators, what are the limitations, if any, and what are the benefits and costs of having this kind of flexibility? In general, the problem of providing translation for movement of data and programs between heterogeneous hosts and networks has not been solved. The main problem is ensuring that such programs and data are interpreted correctly at the destination host. In fact, the more difficult problems in this area have been largely ignored. The Open Systems Foundation (OSF) distributed computing environment (DCE) is attempting to address the problem of programming and managing heterogeneous distributed-computer systems by establishing a set of standards for the major components of such systems. This includes standards for RPCs, distributed file servers, and distributed management.
Efficiency Distributed computer systems are meant to be efficient in a multitude of ways. Resources (files, compilers, debuggers, and other software products) developed at one host can be shared by users on other hosts, limiting duplicate efforts. Expensive hardware resources can also be shared, minimizing costs. Communication facilities, such as the remote procedure call, electronic mail, and file transfer protocols, also improve efficiency by enabling better and faster transfer of information. The multiplicity of processing elements might also be exploited to improve response time and throughput of user processes. While efficiency concerns exist at every level in the system, they must also be treated as an integrated “system” level issue. For example, a good design, the proper trade-offs between levels, and the pairing down of over-ambitious features usually improves efficiency. Here we will concentrate on discussing efficiency as it relates to the execution time of processes (threads). Once the system is operational, improving response time and throughput of user processes (threads) is largely the responsibility of scheduling and resource management algorithms [6, 30, 32,39,75, 111, 112,126], and the mechanisms used to move processes and data [28,33,42,127]. The scheduling algorithm is intimately related to the resource allocator because a process will not be scheduled for the CPU if it is waiting for a resource. If a DCS is to exploit the multiplicity of processors and resources in the network, it must contain more than simply n independent schedulers. The local schedulers must interact and cooperate, and the degree to which this occurs can vary widely. We suggest that a good scheduling algorithm for a DCS will be a heuristic that acts like an “expert system.” This expert system’s task is to effectively utilize the resources of the entire distributed system given a complex and dynamically changing environment. We hope to illustrate this in the following discussion. In the remainder of this section when we refer to the scheduling algorithm, we are referring to the part of the scheduler (possibly an expert system) that is responsible for choosing the host of execution for a process. We assume that there is another part of the scheduler that assigns the local CPU to the highest priority-ready process. We divide the characteristics of a DCS that influence response time and throughput into system characteristics and scheduling algorithm characteristics. System characteristics include the number, type, and speed of processors, caches, and memories; the allocation of data and programs; whether data and programs can be moved; the amount and location of replicated data and programs; how data are partitioned; partitioned functionality in the form of dedicated processors; any special-purpose hardware; characteristics of the communication subnet; and special problems of distribution such as no central clock and the inherent delays in the system. A good scheduling
Distributed Computing Systems
13
algorithm would take the system characteristics into account. Scheduling algorithm characteristics include the type and amount of state information used, how and when that information is transmitted, how the information is used (degree and type of cooperation between distributed scheduling entities), when the algorithm is invoked, adaptability of the algorithm, and the stability of the algorithm [24,110]. The type of state information used by scheduling algorithms includes queue lengths, CPU utilization, amount of free memory, estimated average response time, or combinations of various information in making its scheduling decision. The type of information also refers to whether the information is local or networkwide information. For example, a scheduling algorithm on host 1 could use queue lengths of all the hosts in the network in making its decision. The amount of state information refers to the number of different types of information used by the scheduler. Information used by a scheduler can be transmitted periodically or asynchronously. If asynchronously, it may be sent only when requested (as in bidding), it may be piggybacked on other messages between hosts, or it may be sent only when conditions change by some amount. The information may be broadcast to all hosts, sent to neighbors only, or to some specific set of hosts. The information is used to estimate the loads on other hosts of the network in order to make an informed global scheduling decision. However, the data received are out of date and even the ordering of events might not be known [63]. It is necessary to manipulate the data in some way to obtain better estimates. Several examples are very old data can be discarded (given that state information is time stamped, a linear estimation of the state extrapolated to the current time might be feasible); conditional probabilities on the accuracy of the state information might be calculated in parallel with the scheduler by some monitor nodes and applied to the received state information; the estimates can be some function of the age of the state information; or some form of (iterative) message interchange might be feasible. Before a process is actually moved, the cost of moving it must be accounted for in determining the estimated benefit of the move. This cost is different if the process has not yet begun execution than if it is already in progress. In both cases, the resources required must also be considered. If a process is in execution, then environment information (like the process control block) probably should be moved with the process. It is expected that in many cases the decision will be not to move the process. Schedulers invoked too often will produce excessive overhead. If they are not invoked often enough, they will not be able to react fast enough to changing conditions. There will be undue start-up delay for processes. There must be some ideal invocation schedule that is a function of the load. In a complicated DCS environment, it can be expected that the scheduler will have to be quite adaptive [74, 110]. A scheduler might make minor adjustments in weighing the importance of various factors as the network state changes in an attempt to track a slowly changing environment. Major changes in the network state might require major adjustments in the scheduling algorithms. For example, under very light loads, there does not seem to be much justification for networkwide scheduling, so the algorithm might be turned off — except the part that can recognize a change in the load. At moderate loads, the full-blown scheduling algorithm might be employed. This might include an individual host refusing all requests for information, and refusing to accept any process because it is too busy. Under heavy loads on all hosts, it again seems unnecessary to use networkwide scheduling. A bidding scheme might use both source and server directed bidding [112]. An overloaded hosts asks for bids and is the source of work for some other hosts in the network. Similarly, a lightly loaded host may make a reverse bid (ask the rest of the network for some work). The two types of bidding might coexist. Schedulers could be designed in a multilevel fashion with decisions being made at different rates — local decisions and state information
14
Distributed Computing
updates occur frequently, but more global exchange of decisions and state information might proceed at a slower rate because of the inherent cost of these global actions. A classic efficiency question in any system is what should be supported by the kernel, or more generally by the operating system, and what should be left to the user? The trend in DCS is to provide minimal support at the kernel level — for example, supporting objects, primitive IPC mechanisms, and processes (threads). Then other operating system functions are supported as higher level processes. On the other hand, because of efficiency concerns some researchers advocate putting more in the kernel, including communication protocols, real-time systems support, or even supporting the concept of a transaction in the kernel. This argument will never be settled conclusively since it is a function of the requirements and types of processes running. Of course, many other efficiency questions remain. These include the efficiency of the object model, the end-to-end argument, locking granularity, performance of remote operations, improvements due to distributed control, the cost effectiveness of various reliability mechanisms, efficiently dealing with heterogeneity, hardware support for operating system functions [7], and handling the I/O bottleneck via disk arrays of various types called RAID 1 through 6 [56, 86] (redundant arrays of inexpensive disks). Efficiency is not a separate issue, but must be addressed for each issue in order to result in an efficient, reliable, and extensible DCS. A difficult question to answer is exactly what is acceptable performance, given that multiple decisions are being made at all levels and that these decisions are being made in the presence of missing and inaccurate information.
Real-time applications Real-time applications such as nuclear power plants and process control are inherently distributed and have severe real-time and reliability requirements. These requirements add considerable complication to a DCS. Examples of demanding real-time systems include ESS [ 13], REBUS [9], and SIFT [132]. ESS is a software-controlled electronic switching system developed by the Bell System for placing telephone calls. The system meets severe real-time and reliability requirements. REBUS is a fault-tolerant distributed system for industrial real-time control, and SIFT is a fault-tolerant flight control system. Generally, these systems are built with technology that is tailored to these applications, because many of the concepts and ideas used in general purpose distributed computing are not applicable when deadlines must be guaranteed. For example, remote procedure calls, creating tasks and threads, and requesting operating system services are all done in a manner that ignores deadlines, causes processes to block at any time, and only provides reasonable average case performance. None of these things are reasonable when it is critical that deadlines be met. In fact, many misconceptions [115] exist when dealing with distributed, real-time systems. However, significant new research efforts are now being conducted to combat these misconceptions and to provide a science of real-time computing in the areas of formal verification, scheduling theory and algorithms [72, 91], communications protocols [8, 135], and operating systems [98,116,125]. The goal of this new research in real-time systems is to develop predictable systems even when the systems are highly complex, distributed, and operate in nondeterministic environments [117]. Other distributed real-time applications such as airline reservation and banking applicants have less severe real-time constraints and are easier to build. These systems can utilize most of the general- purpose distributed computing technology described in this paper, and generally approach the problem as if real-time computing were equivalent to fast computing, which is false. While this seems to be common practice, some additions are required to deal with real-time constraints; for example, scheduling algorithms may give preference to processes with earlier deadlines or processes holding a resource may be aborted if a process with a more urgent deadline
Distributed Computing Systems
15
requires the resource. Results from the more demanding real-time systems may also be applicable to these soft real-time systems.
Distributed databases Database systems have existed for many years providing significant benefits in high availability and reliability, reduced costs, good performance, and ease of sharing information. Most are built using what can be called a database architecture [34, 85] (see Figure 5). The architecture includes a query language for users, a data model that describes the information content of the database as seen by the users, a schema (the definition of the structure, semantics and constraints on the use of the database), a mapping that describes the physical storage structure used to implement the data model, a description of Figure 5. Database architecture. how the physical data will be accessed, and, of course, the data (database) itself. All these components are then integrated into a collection of software that handles all accesses to the database. This software is called the database management system (DBMS). The DBMS usually supports a transaction model. In this section, we discuss various transaction structures, access and concurrency control, reliability, heterogeneity, efficiency techniques, and real-time distributed databases.
Transaction structures A transaction is an abstraction that allows programmers to group a sequence of actions on the database into a logical execution unit [17, 58, 65, 107]. Transactions either commit or abort. If the transaction successfully completes all its work, the transaction commits. A transaction that aborts does so without completing its operations, and any effect of executed actions must be undone. Transactions have four properties, known as the ACID properties: atomicity, consistency, isolation, and durability. Atomicity means that either the entire transaction completes, or it is as if the transaction never executed. Consistency means that the transaction maintains the integrity constraints of the database. Isolation means that even if transactions execute concurrently, their results appear as if they were executed in some serial order. Durability means that all changes made by a committed transaction are permanent. A distributed database is a single logical database, but it is physically distributed over many sites (nodes). A distributed database management system (DBMS) controls access to the distributed data supporting the ACID properties, but with the added complication imposed by the physical distribution. For example, a user may issue a transaction that updates various data that, transparent to the user, physically reside on many different nodes [118]. The software that supports the ACID properties of transactions must then interact with all these nodes in a manner that is consistent with the ACID properties. This usually requires supporting remote node slave transactions, distributed concurrency control, recovery, and commit protocols.
16
Distributed Computing
There have been many papers and books written about these basic aspects of distributed database management systems [15,16,17,58, 85]. Rather than discussing these basic issues, we will discuss some of the extended transaction models that have been recently developed. The traditional transaction model, while powerful because of its ability to mask the effects of failures and concurrency, has shortcomings when applied to complex applications such as computer-aided design, computer-aided software engineering, distributed operating systems, and multimedia databases. In these applications, there is a need for greater functionality and performance than can be achieved with traditional transactions. For example, two programmers working on a joint programming project may wish for their transactions to be cooperative and to see partial results of the other user, rather than being competitive and isolated, properties exhibited by the traditional transaction model. Also, traditional transactions only exploit very simple semantics (such as read-only and write-only semantics) of the transactions in order to achieve greater concurrency and performance. Many extended transaction models have been proposed to support the greater functionality and performance required by complex applications. These include nested transactions [76, 89], multilevel transactions [77, 11], cooperative transactions [60], compensating transactions [61], recoverable communicating actions [ 128], split transactions [90], and sagas [44,45]. Each of these transaction models has different semantics with respect to visibility, consistency, recovery and permanence in their attempt to be useful for various complex applications. As an example, a nested transaction is composed of subtransactions that may execute concurrently. Subtransactions are serializable with respect to siblings and other nonrelated transactions. Subtransactions are failure atomic with respect to their parent transaction. A subtransaction can abort without necessarily causing the parent to abort. The other extended models have features such as relaxing serializability or failure atomicity, may have structures other than hierarchical, and may exhibit different abort dependencies. The relationship and utility of these models is currently being explored. In this regard, a comprehensive and flexible framework called ACTA [31], has been developed to provide a formal description and a reasoning procedure for the properties of all these extended transaction models. There is another dimension along which traditional transactions have changed for complex applications. Initially, traditional transactions were considered as performing simple read or write operations on the database. However, there has been a merger of ideas from object-based programming and database systems, resulting in object-based databases [14, 22, 73]. Here, transactions and extended transactions perform higher level operations on objects that reside in the database. Using object-based databases provides more support for complex applications than having to work with simple read and write operations.
Access and concurrency control Most access control in distributed databases is supported by the underlying operating system. In particular, it is the operating system that verifies that a database user is who he claims to be and that controls which user is allowed to read or write various data. Users are typically grouped so that each group has the same rights. Maintaining and updating the various rights assigned to each group is nontrivial, and is exacerbated when heterogeneous databases are considered. In a database system, multiple transactions will be executing in parallel and may conflict over the use of data. Protocols for resolving data access conflicts between transactions are called concurrency control protocols [5, 15, 62, 67, 123]. The correctness of a concurrency control protocol is usually based on the concept of serializability [16]. Serializability means that the effect of a set of executed transactions (permitted to run in parallel) must be the same as some serial execution of that set of transactions. In many cases — and at all levels — the strict condition of
Distributed Computing Systems
17
serializability is not required. Relaxing this requirement can usually result in access techniques that are more efficient in the sense of allowing more parallelism and faster execution times. Due to space limitations we will not discuss these extensions here. See [129, 130]. Three major classes of concurrency control protocols are: locking, time stamp ordering [95], and validation (also called the optimistic approach [62]). Locking is a well-known technique. In the database context the most common form of locking is called two-phase locking. See [16] for a description of this protocol. Time stamp ordering is an approach where all accesses to data are time stamped, and then some common rule is followed by all transactions in such a way as to ensure serializability [123]. This technique can be useful at all levels of a system, especially if time stamps are already being generated for other reasons, such as to detect lost messages or failed hosts. A variation of the time¬ stamp-ordering approach is the multiversion approach. This approach is interesting because it integrates concurrency control with recovery. In this scheme, each change to the database results in a new version. Concurrency control is enforced on a version basis, and old versions serve as checkpoints. Handling multiple versions efficiently is accomplished by differential files [99]. Validation is a technique that permits unrestricted access to data items (resulting in no blocking and hence fast access to data), but then checks for potential conflicts at the commit point. The commit point is the time at which a transaction is sure that it will complete. This approach is useful when few conflicts are expected, because access is very fast; if most transactions are validated (due to the few conflicts), then there is also little overhead due to aborting any nonvalidated transactions. Most validation protocols assume a particular recovery scheme, and can also be considered to integrate concurrency control and recovery.
Reliability Recovery management in distributed databases is a complex process [58,102,103]. Recovery is initiated due to problems such as invalid inputs, integrity violations, deadlocks, and node and media failures. All of these faults, except node failures, are addressed by simple transaction rollback. Node failures require much more complicated solutions. All solutions are based on data redundancy and make use of stable storage where past information concerning the database has been saved and can survive failures. It is the recovery manager component of database systems that is responsible for recovery. Generally, the recovery manager must operate under six scenarios. (1) Under normal operation the recovery manager logs each transaction and its work, and at transaction commit time checks transaction consistency, records the commit operation on the log, and forces the log to stable storage. (2) If a transaction must be rolled back, the recovery manager performs the rollback to a specific checkpoint, or completely aborts the transaction (essentially a rollback to the beginning of the transaction). (3) If any database resource manager crashes, the recovery manager must obtain the proper log records and restore the resource manager to the most recent committed state. (4) After a node crash, the recovery manager must restore the state of all the resource managers at that node and resolve any outstanding distributed transactions that were using that node at the time of the crash and were not able to be resolved due to the crash. (5) The recovery manager is responsible for handling media recovery (such as a disk crash) by using an update log and archive copies or copies from other nodes if replicated data is supported by the system. (6) It is typical that distributed databases have recovery managers at each node that
18
Distributed Computing
.Distributed Computing Systems; An Overview
cooperate to handle many of the previous scenarios. These recovery managers themselves may fail. Restart requires reintegrating a recovery manager into the set of active recovery managers in the distributed system. Actually supporting all of the above scenarios is difficult and requires sophisticated strategies for what information is contained in a log, how and when to write the log, how to maintain the log over a long time period, how to undo transaction operations, how to redo transaction operations, how to utilize the archives, how and when to checkpoint, and how to interact with concurrency control and commit processing. Many performance trade-offs arise when implementing recovery' management: for example, how often and how to take checkpoints. If frequent checkpoints are taken, then recovery is faster and less work is lost. However, taking checkpoints too often significantly slows the “normal” operation of the system. Another question arises in how to efficiently log the information needed for recovery. For example, many systems perform a lazy commit, where at commit time, all the log records are created but only pushed to disk at a later time. This reduces the guarantees that the system can provide about the updates, but it improves performance.
Heterogeneity As defined above, a distributed database is a single logical database, physically residing on multiple nodes. For a distributed database system there is a single query language, data model, schema and transaction management strategy. This is in contrast to a federated database system that is a collection of autonomous database systems integrated for cooperation. The autonomous systems may have identical query languages, data models, schemas and transaction management strategies, or they may be heterogeneous along any or all of these dimensions. The degree of integration varies from a single unified system constructed with new models for query languages and data models — for example, built on top of the autonomous components to those systems with minimal interaction and without any unified models. These latter systems are called multidatabase systems. Federated systems arise when databases are developed independently, then need to be highly integrated. Multidatabase systems arise when individual database management systems wish to retain a great degree of autonomy and only provide minimal interaction. Heterogeneity in database systems also arises from differences in underlying operating systems or hardware, from differences in DBMSs, and from semantics of the data itself. Several examples of these forms of heterogeneity follow. Operating systems on different machines may support different file systems, naming conventions and IPC. Hardware instruction sets, data formats, and processing components may also be different at various nodes. Each of these may give rise to differences that must be resolved for proper integration of the databases. If DBMSs have different query languages, data models, or transaction strategies, then problems arise. For example, differences in query languages may mean that some requests become illegal when issued on data in the “other” database. Differences in data models arise from many sources, including what structures each supports (for example, relations versus record types) and from how constraints are specified. One transaction strategy might employ two-phase locking with after image logging, while another uses some form of optimistic concurrency control. Other problems arise from semantics of the data. Semantic heterogeneity is not well understood and arises from differences in the meaning, definition, or interpretation of the related data in the various databases. For example, course grades may be defined on an [A, B, C, D, F] scale in one database and on a numerical scale in another database. How do we resolve the differences when a query fetches grades from both databases?
Distributed Computing Systems
19
Chapter 1
Solutions to the heterogeneity problem are usually difficult and costly both in development costs in dollars and in run-time execution costs. Differences in query languages are usually solved by mapping commands in one language to an equivalent set of commands in the other, and vice versa. If more than two query languages are involved, then an intermediate language is defined and mappings occur to and from this intermediate language. Mappings must be defined for the data model and schema integration. For example, an integrated schema would contain the description of the data described by all component schemas and the mappings between them. It is also possible to restrict what data in a given database can be seen by the federation of databases. This is sometimes called an export schema. Solutions must also be developed for query decomposition and optimization, and global transaction management in federated database systems. The complexities found here are beyond the scope of this paper. Interested readers should see [71,101, 124],
Efficiency Distributed database systems make use of many techniques for improving efficiency, including distributed and local query optimization, various forms of buffering, lazy evaluation, disk space allocation tailored to access requirements, the use of mapping caches, parallelism in subtransactions, and nonserializability. Parallelism in subtransactions and nonserializability have been mentioned before so these issues will not be discussed here. It is often necessary for operating systems that support databases to be specifically tailored for database functions, and if not, then the support provided by the operating system is usually inefficient. See [ 119] for a discussion on this issue. To obtain good performance in distributed databases, query optimization is critical. Query optimization can be categorized as either heuristic where ad hoc rules transform the original query into an equivalent query that can be processed more quickly, or systematic where the estimated cost of various alternatives are computed using analytical cost models that reflect the details of the physical distributed database. In relational databases, Join, Selection, and Projection are the most time consuming and frequent operations in queries and hence most query optimizers deal with these operations. For example, a Join is one of the most time consuming relational operators because the resultant size can be equal to the product of the sizes of the original relations. The most common optimization is to first perform Projections and Selections to minimize intermediate relations to be Joined. In a distributed setting, minimizing the intermediate relations can also leave the effect of minimizing the amount of data that needs to be transferred across the network. Specialized query optimization techniques also exist for statistical databases [84] and memory resident databases [133]. Supporting a database buffer (sometimes called a database cache) is one of the most important efficiency techniques required [40, 96]. The database buffer manager acts to make pages accessible in main memory (reading from the disk) and to coordinate writes to the disk. In doing the reading and writing, it attempts to minimize I/O and to do as much I/O in a lazy fashion as possible. One example of the lazy I/O would be that at transaction commit time only the log records are forced to the disk, the commit completes and the actual data records are written at a later time when the disk is idle, or if forced for other reasons, such as a need for more buffer space. Part of the buffer managers’ s task is to interact with one log manager by writing to the log and to cooperate with the recovery manager. If done poorly, disk I/Os for logging can become a bottleneck. Disk space allocation is another important consideration in attaining good database performance. In general, the space allocation strategy should be such that fast address translation from logical block numbers to physical disk addresses can occur without any I/O, and the space allocation should be done to support both direct and sequential access.
20
Distributed Computing
TRIBUTE!) (. OMIT
if'STEMS:
OVERVIEW
The full mapping of relations through multiple intermediate levels of abstractions (for example, from relations, to segments, to OS files, to logical disks, to extents, and to disk blocks) down to the physical layer must be done efficiently. In fact, one should try to eliminate unnecessary intermediate layers (still retaining data independence), and use various forms of mapping caches to speed up the translations.
Real-time transaction systems Real-time transaction systems are becoming increasingly important in a wide range of applications. One example of a real-time transaction system is a computer integrated manufacturing system where the system keeps track of the state of physical machines, manages various presses in the production line, and collects statistical data from manufacturing operations. Transactions executing on the database may have deadlines in order to reflect, in a timely manner, the state of manufacturing operations or to respond to the control messages from operators. For instance, the information describing the current state of an object may need to be updated before a team of robots can work on the object. The update transaction is considered successful only if the data (the information) is changed consistently (in the view of all the robots) and the update operation is done within the specified time period so that all the robots can begin working with a consistent view of the situation. Other applications of real-time database systems can be found in program trading in the stock market, radar tracking systems, command and control systems, and air traffic control systems. Real-time transaction processing is complex because it requires an integrated set of protocols that must not only satisfy database consistency requirements, but also operate under timing constraints [36, 37, 50, 69]. The algorithms and protocols that must be integrated include CPU scheduling, concurrency control, conflict resolution, transaction restart, transaction wakeup, deadlock, buffer management, and disk I/O scheduling [21, 23, 25, 26, 50, 100]. Each of these algorithms or protocols should directly address the real-time constraints. To date, work on real¬ time databases has investigated a centralized, secondary storage real-time database [1, 2, 3]. As is usually required in traditional database systems, work so far has required that all the real¬ time transaction operations maintain data consistency as defined by serializability. Serializability may be relaxed in some real-time database systems, depending on the application environment and data properties [105,114,69], but little actual work has been done in this area. Serializability is enforced by using a real-time version of either the two-phase locking protocol or optimistic concurrency control. Optimistic concurrency control has been shown to perform better than twophase locking when integrated with priority-driven CPU scheduling in real-time database systems [48, 49,53], In addition to timing constraints, in many real-time database applications, each transaction imparts a value to the system that is related to its criticalness and to when it completes execution (relative to its deadline). In general, the selection of a value function depends on the application [72]. To date, the value of a transaction has been modeled as a function of its criticalness, start time, deadline, and the current system time. Here criticalness represents the importance of transactions, while deadlines constitute the time constraints of real-time transactions. Criticalness and deadline are two characteristics of real-time transactions and they are not necessarily related. A transaction that has a short deadline does not imply that it has high criticalness. Transactions with the same criticalness may have different deadlines and transactions with the same deadline may have different criticalness values. Basically, the higher the criticalness of a transaction, the larger its value to the system. It is important to note that the value of a transaction is time-variant. A transaction that has missed its deadline will not be as valuable to the system as it would be if it had completed before its deadline.
Distributed Computing Systems
21
Other important issues and results for real-time distributed databases include. • In a real-time system, I/O scheduling is an important issue with respect to the system performance. In order to minimize transaction loss probability, a good disk scheduling algorithm should take into account not only the time constraint of a transaction, but also the disk service time [25]. • Used for I/O, the earliest deadline discipline ignores the characteristics of disk service time, and, therefore, does not perform well except when the I/O load is low. • Various conflict resolution protocols that directly address deadlines and criticalness can have a important impact on performance over protocols that ignore such information. • How can priority inversion (this refers to the situation where a high priority transaction is blocked due to a low priority transaction holding a lock on a data item) be solved [52,106]? • How can soft real-time transaction systems be interfaced to hard real-time components? • How can real-time transactions themselves be guaranteed to meet hard deadlines? • How will real-time buffering algorithms impact real-time optimistic concurrency control [51]? • How will semantics-based concurrency control techniques impact real-time performance? • How will the algorithms and performance results be impacted when extended to a distributed real-time system? • How can correctness criteria other than serializability be exploited in real-time transaction systems?
Summary Distributed computer systems began in the early 1970s with a few experimental systems. Since that time tremendous progress has been made in many disciplines that support distributed computing. The progress has been so remarkable that DCSs are commonplace and quite large. For example, the Internet has over 500,000 nodes on it. This paper has discussed two of the areas that played a major role in this achievement: distributed operating systems and distributed databases. For more information on distributed computing, see the following books and surveys [4, 35,41, 46,58,85,113,115,121]. As mentioned in the introduction, many areas of distributed computing could not be covered in this paper. One important area omitted is distributed file servers such as NFS and Andrew. For more information on these and other distributed file servers, see the survey paper [68].
Acknowledgments I enthusiastically thank Panos Chrysanthis and Krithi Ramamritham for their valuable comments on this work.
Glossary Access control list. A model of protection where rights are maintained as a list associated with each object. See Capability list and Access control matrix. Access control matrix. A model of protection where rows of the matrix represent domains of execution and columns represent the objects in the system. The entries in the matrix indicate the allowable operations each domain of execution can perform on each object.
22
Distributed Computing
Active object model. An object that has one or more execution activities (for example, threads) associated with it at all times. Operation invocations on the object use these resident threads for execution. Asynchronous send. An interprocess communication primitive where the sending process does not wait for a reply before continuing to execute. See Synchronous send. Atomic broadcast. A communication primitive that supports the result that either all hosts (or processes) receive the message, or none of them see the message. See Broadcasting. Atomicity. A property of a transaction where either the entire transaction completes, or it is as if the transaction never executed. See Transaction. Bidding. A distributed scheduling scheme that requests hosts to provide information in the form of a bid as to how well that host can accept new work. Broadcasting. Sending a message to all hosts or processes in the system. See Multicasting. Capability list. A model of protection where rights are maintained as a list associated with each execution domain. See Access control list and Access control matrix. Client-server model. A software architecture that includes server processes that provide services and client processes that request services via well-defined interfaces. A particular process can be both a server and a client process. Consistency. A property of a transaction which means that the transaction maintains the integrity of the database. See Transaction. Copy-on-write. Data are delayed from being copied between address spaces until either the source or the destination actually performs a write operation. Data abstraction. A collection of information (data) and a set of operations on that information. Differential files. A representation of a collection of data as the difference from some point of reference. Used as a technique for storing large and volatile files. Distributed computing environment (DCE). A computing environment that exploits the potential of computer networks without the need to understand the underlying complexity. This environment is to meet the needs of end users, system administrators, and application developers. Distributed operating system (DOS). A native operating system that runs on and controls a network of computers. Distributed shared memory. The abstraction of shared memory in a physically nonshared distributed system. Durability. A property of a transaction which means that all changes made by a committed transaction are permanent. See Transaction. Federated database system. A collection of autonomous database systems integrated for purposes of cooperation. Hard real time. Tasks have deadlines or other timing constraints, and serious consequences could occur if a task misses a deadline. See Soft real time. Isolation. A property of transactions which means that even if transactions execute concurrently, their results appear as if they were executed in some serial order. See Transaction and Serializability. Lazy evaluation. A performance improvement technique that postpones taking an action or even a part of an action until the results of that action (subaction) are actually required. Lightweight process. An efficiency technique that separates the address space and rights from the execution activity. Most useful for parallel programs and multiprocessors. Lightweight remote procedure call. An efficiency technique to reduce the execution time cost of remote procedure calls when the processes happen to reside on the same host. Multicasting. Sending a message to all members of a defined group. See Broadcasting.
Distributed Computing Systems
23
Chapter I
Nested transaction. A transaction model that permits a transaction to be composed of subtransactions that can fail without necessarily aborting the parent transaction. Network operating system (NOS). A layer of software added to local operating systems to enable a distributed collection of computers to cooperate. Network partitioning. A failure situation where the communication network(s) connecting the hosts of a distributed system have failed in such a manner that two or more independent subnets are executing without being able to communicate with each other. Object. An instantiation of a data abstraction. See Data abstraction. Passive object. An object that has no execution activity assigned to it. The execution activity gets mapped into the object upon invocation of operations of the object. Process. A program in execution including the address space, the current state of the computation, and various rights to which this program is entitled. RAID. Redundant arrays of inexpensive disks to enhance I/O throughput and fault tolerance. Real-time applications. Applications where tasks have specific deadlines or other timing constraints such as periodic requirements. Recoverable communicating actions. A complex transaction model to support long and cooperative nonhierarchical computations involving communicating processes. Remote procedure call (RPC). A synchronous communication method that provides the same semantics as a procedure call, but it occurs across hosts in a distributed system. Sagas. A complex transaction model for long lived activities consisting of a set of component transactions that can commit as soon as they complete. If the saga aborts, committed components are compensated. Serializability. A correctness criterion which states that the effect of a set of executed transactions must be the same as some serial execution of that set of transactions. Soft real time. In a soft real-time system tasks have deadlines or other timing constraints, but no serious complications occur if a deadline is missed. See Hard real time. Split transactions. A complex transaction model where the splitting transaction delegates the responsibility for aborting or committing changes made to a subset of objects it has accessed to the split transaction. Synchronous send. An interprocess communication primitive where the sending process waits for a reply before proceeding. See Asynchronous send. Time stamp ordering. A concurrency control technique where all accesses to data are time stamped and then some common rule is followed to ensure serializability. See Serializability. Thread. Represents the execution activity of a process. Multiple threads can exist in one process. Transaction. An abstraction that groups a sequence of actions on a database into a logical execution unit. Traditional transactions have four properties. See Atomicity, Consistency, Isolation, and Durability. Validation. A concurrency control technique that permits unrestricted access to data items, but then checks for potential conflicts at the transaction commit point.
References [ 1 ] R. Abbott and H. Garcia-Molina, “Scheduling Real-Time Transactions,” A CMSIGMOD Record, Mar. 1988. [2] R. Abbott and H. Garcia-Molina, “Scheduling Real-Time Transactions: A Performance Evaluation,” Proc. 14th VLDB Conf., 1988. [3] R. Abbott and H. Garcia-Molina, “Scheduling Real-Time Transactions with Disk Resident Data,” Proc. 15th VLDB Conf., 1989.
24
Distributed Computing
t.STRIBUTEI> (.. OMPU IJ>
:ems:
•'EM VI!
[4] B. Abeysundara and A. Kamal, “High Speed Local Area Networks and Their Performance: A Survey,” ACM Computing Surveys, Vol. 23, No. 2, June 1991. [5] R. Agrawal, M.J. Carey and M. Livny, “Concurrency Control Performance Modeling: Alternatives and Implications,” ACM Trans. Database Systems, Vol. 12, No. 4, Dec. 1987. [6] T. Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism,” Tech. Report TR 90-04-02, Univ. of Washington, Oct. 1990. [7] T. Anderson et al., “Interaction of Architecture and OS Design,” tech, report, Dept, of Computer Science, Univ. of Washington, Aug. 1990. [8] K. Arvind, K. Ramamritham, and J. Stankovic, “A Local Area Network Architecture for Communication in Distributed Real-Time Systems,” invited paper, Real-Time Systems J., Vol. 3, No. 2, May 1991, pp. 113-147. [9] J. Ayache, J. Courtiat, and M. Diaz, “REBUS, A Fault Tolerant Distributed System for Industrial Control,” IEEE Trans. Computers., Vol. C-31, July 1982. [10] M. Bach, N. Coguen, and M. Kaplan, “The ADAPT System: A Generalized Approach towards Data Conversion,” Proc. VLDB, 1979. [11] B. Badrinath and K. Ramamritham, “Performance Evaluation of Semantics-Based Multilevel Concurrency Control Protocols,” Proc. ACM SIGMOD, 1990, pp 163-172. [12] J. Ball et al., “RIG, Rochester’s Intelligent Gateway: System Overview,” IEEE Trans. Software Eng., Vol. SE-2, No. 4, Dec. 1980. [13] D. Barclay, E. Byrne, and F. Ng, “A Real-Time Database Management System for No. 5 ESS,” Bell System Tech. J., Vol. 61, No. 9, Nov. 1982. [14] D. Batory, “GENESIS: A Project to Develop an Extensible Database Management System,” Proc. Int’l Workshop Object-Oriented Database Sytems, 1986, pp. 207-208. [15] P.A. Bernstein, D.W. Shipman, and J.B. Rothnie, Jr., “Concurrency Control in a System for Distributed Databases (SDD-1),” ACM Trans. Database Systems, Vol. 5, No. 1, Mar. 1980, pp. 18-25. [16] P. Bernstein and N. Goodman, “Concurrency Control in Distributed Database Systems,” ACM Computing Surveys, Vol. 13, No. 2, June 1981. [17] A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison Wesley, Reading, Mass, 1987. [18] B.N. Bershad, T.E. Anderson, and E.D. Lazowska, “Lightweight Remote Procedure Call,” ACM Trans. Computer Systems, Vol. 8, No. 1, Feb. 1990, pp. 37-55. [19] K. Birman, “Replication and Fault-Tolerance in the ISIS System,” ACM Symp. OS Principles, Vol. 19, No. 5, Dec. 1985. [20] A. Birrell et al., “Grapevine: An Exercise in Distributed Computing,” Comm. ACM, Vol. 25, Apr. 1982, pp. 260-274. [21] A. Buchmann et al., “Time-Critical Database Scheduling: A Framework for Integrating Real-Time Scheduling and Concurrency Control,” Proc. Data Eng. Conf, 1989. [22] M. Carey et al., “The Architecture of the EXODUS Extensible DBMS,” Readings in Database Systems, Morgan Kaufmann, 1988, pp. 488-501. [23] M.J. Carey, R. Jauhari, and M. Livny, “Priority in DBMS Resource Scheduling,” Proc. 15th VLDB Conf, 1989. [24] T. Casavant and J. Kuhl, “A Taxonomy of Scheduling in General Purpose Distributed Computing Systems,” Trans. Software Eng., Vol. 14, No. 2, Feb. 1988. [25] S. Chen et al., “Performance Evaluation of Two New Disk Scheduling Algorithms for Real-Time Systems,” Real-Time Systems, Vol. 3, No. 3, Sept. 1991. [26] S. Chen and D. Towsley, “Performance of a Mirrored Disk in a Real-Time Transaction System,” Proc. 1991 ACM SIGMETRICS, 1991. [27] D. Cheriton, H. Goosen, and P. Boyle, “Paradigm: A Highly Scalable Shared Memory Multicomputer Architecture,” Computer, Feb. 1991.
Distributed Computing Systems
25
Chaffer 1
[28] D. Cheriton and W. Zwaenepoel, “Distributed Process Groups in the V Kernel,” ACM Trans. Computer Systems, Vol. 3, No. 2, May 1985. [29] R. Chin and S. Chanson, “Distributed Object Based Programming Systems,” ACM Computing Surveys, Vol. 23, No. 1, Mar. 1991. [30] T. Chou and J. Abraham, “Load Balancing in Distributed Systems,” IEEE Trans. Software Eng., Vol. SE-8, No. 4, July 1982. [31] P. Chrysanthis and K. Ramamritham, “ACTA; A Framework for Specifying and Reasoning about Transaction Structure and Behavior,” Proc. ACM SIGMOD Int’l Conf. Management Data, 1990, pp. 194-203. [32] W. Chu et al., “Task Allocation in Distributed Data Processing,” Computer, Vol. 13, Nov. 1980, pp. 57-69. [33] P. Dasgupta et al., “The Clouds Distributed Operating System,” Computer, Vol. 24, No. 11, Nov.1991, pp. 34-44. [34] C.J. Date, An Introduction to Database Systems, Addison Wesley, Reading, Mass., 1975. [35] D.W. Davies et al., Distributed Systems Architecture and Implementation, Vol. 105, Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, 1981. [36] U. Dayal et al., “The HiPAC Project: Combining Active Database and Timing Constraints,” ACM SIGMOD Record, Mar. 1988. [37] U. Dayal, “Active Database Management Systems,” Proc. 3rd Int’l Conf. Data and Knowledge Management, 1988. [38] J. Dion, “The Cambridge File Server,” ACM Cperating Systems Rev., Oct. 1980. [39] K. Efe, “Heuristic Models of Task Assignment Scheduling in Distributed Systems,” Computer, Vol. 15, June 1982. [40] W. Effelsberg and T. Haerder, “Principles of Database Buffer Management,” ACM Trans. Database Systems, Vol. 9, No. 4, Dec. 1984. [41] P. Enslow, “What is a Distributed Data Processing System,” Computer, Vol. 11, Jan. 1978. [42] E. Felten and J. Zahorjan, “Issues in the Implementation of a Remote Memory Paging System,” Tech. Report TR 91-03-09, Univ. of Washington, Mar. 1991. [43] R. Fitzgerald and R. Rashid, “Integration of Virtual Memory Management and Interprocess Communication in Accent, ACM Trans. Computer Systems, Vol. 4, No. 2, May 1986. [44] H. Garcia-Molina et al., “Modeling Long-Running Activities as Nested Sagas,” IEEE Tech. Committee Data Eng., 14(1):14-18, Mar. 1991. [45] H. Garcia-Molina and K. Salem, “SAGAS,” Proc. ACM SIGMOD Int’l Conf. Management Data, 1987, pp. 249-259. [46] A. Goscinski, Distributed Operating Systems: The Logical Design, Addison Wesley, Sydney, Australia, 1991. [47] J. Gray, “Notes on Database Operating Systems,” Operating Systems: An Advanced Course, SpringerVerlag, Berlin, Germany, 1979. [48] J.R. Haritsa, M.J. Carey, and M. Livny, “On Being Optimistic about Real-Time Constraints,” Principles of Distributed Computing, 1990. [49] J.R. Haritsa, M.J. Carey, and M. Livny, “Dynamic Real-Time Optimistic Concurrency Control,” Proc. 11th Real-Time Systems Symp., 1990. [50] J. Huang et al., “Experimental Evaluation of Real-Time Transaction Processing,” Proc. Real-Time System Symp., 1989. [51 ] J. Huang and J. Stankovic, “Real-Time Buffer Management,” COINS TR 90-65, Univ. of Massachusetts, Aug. 1990. [52] J. Huang et al., “Priority Inheritance under Two-Phase Locking,” Proc. Real-Time Systems Symp., 1991.
26
Distributed Computing
Distributed Computing Systems
[53] J. Huang et al., “Experimental Evaluation of Real-Time Optimistic Concurrency Control Schemes,” Proc. VLDB, 1991. [54] N. Hutchinson and L. Peterson, “The x-Kemel: An Architecture for Implementing Network Protocols,” IEEE Trans. Software Eng., Vol. 17, No. 1, Jan. 1991. [55] T. Joseph and K. Birman, “Reliable Broadcast Protocols,” Tech. Report TR 88-918, Cornell Univ., June 1988. [56] R. Katz, G. Gibson, and D. Patterson, “Disk System Architectures for High Performance Computing,” Proc. IEEE, Vol. 77, No. 12, Dec. 1989. [57] J.P. Kearns and S. DeFazio, “Diversity in Database Reference Behavior,” Performance Evaluation Rev., Vol. 17, No. 1, May 1989. [58] W. Kohler, “A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems,” ACM Computing Surveys, Vol. 13, No. 2, June 1981. [59] H. Kopetz et al., “Distributed Fault Tolerant Real-Time Systems: The Mars Approach,” IEEE Micro, Vol. 9, No. 1, Feb. 1989, pp. 25-40. [60] H. Korth, W. Kim, and F. Bancilhon, “On Long-Duration CAD Transactions,” Information Sciences, Vol. 46, No. 1-2, Oct.-Nov. 1988, pp. 73-107. [61] H. Korth, E. Levy, and A. Silberschatz, “Compensating Transactions: A New Recovery Paradigm,” Proc. 16th VLDB Conf, 1990, pp. 95-106. [62] H.T. Hung and J.T. Robinson, “On Optimistic Methods for Concurrency Control,” ACM Trans. Database Systems, Vol. 6, No. 2, June 1981. [63] L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Comm. ACM, July 1978. [64] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals Problem,” ACM Trans. Programming Language and Systems, Vol. 4, No. 3, July 1982. [65] B. Lampson, “Atomic Transactions,” Lecture Notes in Computer Science, Vol. 105, Springer-Verlag, Berlin, Germany, 1980, pp. 365-370. [66] E. Lazowska et al., “The Architecture for the Eden System,” Proc. 8th Ann. Symp. Operating System Principles, 1981. [67] G. LeLann, “Algorithms for Distributed Data-Sharing Systems That Use Tickets,” Proc. 3rd Berkeley Workshop Distributed Databases and Computer Networks, 1978. [68] E. Levy and A. Silberschatz, “Distributed File Systems: Concepts and Examples,” ACM Computing Surveys, Vol. 22, No. 4, Dec. 1990. [69] K.J. Lin, “Consistency Issues in Real-Time Database Systems,” Proc. 22ndHawaii Int’l Conf. System Sciences, 1989. [70] B. Liskov and R. Scheifler, “Guardians and Actions: Linguistic Support for Robust, Distributed Systems,” Proc. 9th Symp. Principles Programming Languages, 1982, pp. 7-19. [71] W. Litwin, L. Mark, and N. Roussopoulos, “Interoperability of Multiple Autonomous Databases,” ACM Computing Surveys, Vol. 22, No. 3, Sept. 1990. [72] C.D. Locke, Best-Effort Decision Making for Real-Time Scheduling, doctoral dissertation, Carnegie Mellon Univ., Pittsburgh, Pa., 1986. [73] D. Maier et al., “Development of an Object-Oriented DBMS,” Proc. Object-Oriented Programming Systems, Languages, and Applications, 1986, pp. 472-482. [74] R. Mirchandaney, D. Towsley, and J. Stankovic, “Adaptive Load Sharing in Heterogeneous Distributed Systems,” J. Parallel and Distributed Computing, Vol. 9, Sept. 1990, pp. 331-346. [75] R. Mirchandaney, D. Towsley, and J. Stankovic, “Analysis of the Effects of Delays on Load Sharing,” IEEE Trans. Computers, Vol. 38, No. 11, Nov. 1989, pp. 1513-1525. [76] J.E.B. Moss, Nested Transactions: An Approach to Reliable Distributed Computing, doctoral thesis, Massachusetts Inst, of Technology, Cambridge, Mass., Apr. 1981.
Distributed Computing Systems
27
[77] J.E.B. Moss, N. Griffeth, and M. Graham, “Abstraction in Recovery Management,” Proc. ACM SIGMOD Int’l Conf. Management Data, 1986, pp. 72-83. [78] S.J. Mullender et al., “Experiences with the Amoeba Distributed Operating System,” Comm. ACM, Vol. 33, No. 12, Dec. 1990. [79] K.M. Needham and A.J. Herbert, The Cambridge Distributed Computing System, Addison-Wesley, London, UK, 1982. [80] R.M. Needham and M. Schroeder, “Using Encryption for Authentication in Large Networks of Computers,” Comm. ACM, Vol. 21, No. 12, Dec. 1978, pp. 993-999. [81] B.J. Nelson, “Remote Procedure Call,” Tech. Report CSL-81-9, Xerox Corp., May 1981. [82] B. Nitzberg and V. Lo, “Distributed Shared Memory: A Survey of Issues and Algorithms,” Computer, Vol. 24, No. 8, Aug. 1991, pp. 52-60. [83] J. Ousterhout, D. Scelza, and P. Sindhu, “Medusa: An Experiment in Distributed Operating System Structure,” Comm. ACM, Vol. 23, Feb. 1980. [84] G. Ozsoyoglu, V. Matos, and Z. Meral Ozsoyoglu, “Query Processing Techniques in the SummaryTable-by-Example Database Query Language,” ACM Trans. Database Systems, Vol. 14,No.4,1989, pp. 526-573. [85] M. Ozsu and P. Valduriez, Principles of Distributed Database Systems, Prentice Hall, Englewood Cliffs, N.J., 1991. [86] D. Patterson, G. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD, 1988. [87] G. Popek et al., “LOCUS, A Network Transparent, High Reliability Distributed System,” Proc. 8th Symp. Operating System Principles, 1981, pp. 14-16. [88] D. Powell et. al. “The Delta-4 Distributed Fault Tolerant Architecture,” Report No. 91055, Laboratoire d’Automatique et d’Analyse des Systemes, Feb. 1991. [89] C. Pu, Replication and Nested Transactions in the Eden Distributed System, doctoral thesis, Univ. of Washington, 1986. [90] C. Pu, G. Kaiser, and N. Hutchinson, “Split Transactions for Open-Ended Activities, Proc. 11th Int’l Conf. VLDB, 1988, pp. 26-37. [91] K. Ramamritham, J. Stankovic, and P. Shiah, “Efficient Scheduling Algorithms For Real-Time Multiprocessor Systems,” IEEE Trans. Parallel and Distributed Computing, Vol. 1, No. 2, Apr. 1990, pp. 184-194. [92] R.F. Rashid and G.G. Robertson, “Accent: A Communication Oriented Network Operating System Kernel,” Proc. 8th Symp. Operating System Principles, 1981. [93] R. Rashid, “Threads of a New System, UNIX Review,” Aug. 1986, pp 37-49. [94] R. Rashid et al., “Machine Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures,” IEEE Trans. Computers, Vol. 37, No 8, Aug. 1988. [95] D.J. Rosenkrantz, R.E. Steams, and P.M. Lewis, “System Level Concurrency Control for Distributed Database Systems,” ACM Trans. Database Systems, Vol. 3, No. 2, June 1978. [96] G.M. Sacco and M. Schkolnick, “Buffer Management in Relational Database Systems,” ACM Trans. Database Systems, Vol. 11, No. 4, Dec. 1986. [97] J.H. Saltzer, D.P. Reed, and D.D. Clark, “End-to-End Arguments in System Design,” Proc. 2nd Int’l Conf. Distributed Computing Systems, 1981. [98] K. Schwan, A. Geith, and H. Zhou, “From Chaos(Base) to Chaos(Arc): A Family of Real-Time Kernels,” Proc. Real-Time Systems Symp., 1990, pp. 82-91. [99] D.G. Severance and G.M. Lohman, “Differential Files: Their Application to the Maintenance of Large-Databases,” ACM Trans. Database Systems, Vol. 1, No. 3, Sept. 1976. [100] L. Sha, R. Rajknmar, and J.P. Lehoczky, “Concurrency Control for Distributed Real-Time Databases,” ACM SIGMOD Record, Mar. 1988.
28
Distributed Computing
Distributed Computing Systems
Overview
[ 101 ] A. Sheth and J. Larson, “Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases,” ACM Computing Surveys, Vol. 22, No. 3, Sept. 1990. [102] D. Skeen, “Nonblocking Commit Protocols,” Proc. ACM SIGMOD, 1981. [103] D. Skeen and M. Stonebraker, “A Formal Model of Crash Recovery in a Distributed System,” IEEE Trans. Software Eng., Vol. SE-9, No. 3, May 1983. [104] D. Skeen, “A Decentralized Termination Protocol,” Proc. 1st IEEE Symp. Reliability Distributed Software Database Systems, 1981. [105] S.H. Son, “Using Replication for High Performance Database Support in Distributed Real-Time Systems,” Proc. 8th Real-Time Systems Symp., 1987. [106] S.H. Son and C.H. Chang, “Priority-Based Scheduling in Real-Time Database Systems,” Proc. 15th VLDB Conf, 1989. [107] A.Z. Spector and P.M. Schwarz, “Transactions: A Construct for Reliable Distributed Computing,” ACM Operating System Rev., Vol. 17, No. 2, Apr. 1983. [108] S.K. Shrivastava, “On the Treatment of Orphans in a Distributed System,” Proc. 3rd Symp. Reliability Distributed Systems, 1983. [109] S.K. Shrivastava and F. Panzieri, “The Design of a Reliable Remote Procedure Call Mechanism,” Trans. Computers, Vol. C-31, July 1982. [110] J.A. Stankovic, “Simulations of Three Adaptive, Decentralized, Controlled Job Scheduling Algorithms,” Computer Networks, Vol. 8, No. 3, June 1984, pp. 199-217. [111] J.A. Stankovic,“Bayesian Decision Theory and Its Application to Decentralized Control of Job Scheduling,” IEEE Trans. Computers, Vol. C-34, Jan. 1985. [112] J.A. Stankovic and I.S. Sidhu, “An Adaptive Bidding Algorithm for Processes, Clusters and Distributed Groups,” Proc. 4th Int’l Conf. Distributed Computing, 1984. [113] J.A. Stankovic, “A Perspective on Distributed Computer Systems,” Trans. Computers, Vol. C-33, No. 12, Dec. 1984, pp. 1102-1115. [114] J.A. Stankovic and W. Zhao, “On Real-Time Transactions,” ACM SIGMOD Record, Mar. 1988. [115] J.A. Stankovic, “Misconceptions about Real-Time Computing: A Serious Problem For Next Generation Systems,” Computer, Vol. 21, No. 10, Oct. 1988, pp. 10-19. [116] J.A. Stankovic and K. Ramamritham, “The Spring Kernel: A New Paradigm for Real-Time Systems,” IEEE Software, Vol. 8, No. 3, May 1991, pp. 62-72. [117] J.A. Stankovic and K. Ramamritham, “What is Predictability for Real-Time Systems — An Editorial,” Real-Time Systems J., Vol. 2, Dec. 1990, pp. 247-254. [118] M. Stonebraker and E. Neuhold, “A Distributed Database Version of INGRES,” Proc. Berkeley Workshop Distributed Data Management and Computer Networks, 1977, pp. 19-36. [119] M. Stonebraker, “Operating System Support for Database Management,” Comm. ACM, Vol. 24, July 1981, pp. 412-418. [120] H. Sturgess, J. Mitchell, and I. Isreal, “Issues in the Design and Use of Distributed File System,” ACM Operating System Rev., July 1980. [121] A. Tanenbaum and R. van Renesse, “Distributed Operating Systems,” ACM Computing Surveys, Vol. 17, No. 4, Dec. 1985, pp. 419-470. [122] M. Theimer, K. Lantz, and D.R. Cheriton, “Preemptable Remote Execution Facility for the V-System,” Proc. 10th Symp. Operating Systems Principles, 1985, pp. 2-12. [123] R.H. Thomas, “A Majority Consensus Approach on Concurrency Control for Multiple Copy Databases,” ACM Trans. Database Systems., Vol. 4, No. 2, June 1979, pp. 180-209. [124] G. Thomas et al., “Heterogeneous Distributed Database Systems for Production Use Computing Surveys, Vol. 22, No. 3, Sept. 1990.
ACM
[125] H. Tokuda and C. Mercer, “Arts: A Distributed Real-Time Kernel,” ACM Operating Systems Rev., July 1989, pp. 29-53.
Distributed Computing Systems
29
[126] D. Towsley, G. Rommel, and J. Stankovic, “Analysis of Fork-Join Program Response Times on Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, Vol. 1, No. 3, July 1990, pp. 286-303. [127] R. Vaswani and J. Zahorjan, “Implications of Cache Affinity on Processor Scheduling for Multiprogrammed, Shared Memory Multiprocessors,” Tech. Report TR 91-03-03, Univ. of Washington, Mar. 1991. [128] S. Vinter, K. Ramamritham, and D. Stemple, “Recoverable Actions in Gutenberg,” Proc. 6th Int’l Conf. Distributed Computing Systems, 1986, pp. 242-249. [129] W. Weihl, Specification and Implementation of Atomic Data Types, doctoral thesis, Massachusetts Inst, of Technology, Cambridge, Mass., Mar. 1984. [ 130] W. Weihl, “Commutativity Based Concurrency Control for Abstract Data Types,” Trans. Computers, Vol. 37, No. 12, Dec. 1988, pp. 1488-1505. [131] M. Weinstein et al., “Transactions and Synchronization in a Distributed Operating System,” ACM Symp. Operating Systems Principles, Vol. 19, No. 5, Dec. 1985. [132] J. Wensley et al., “SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, Oct. 1978, pp. 1240-1255. [133] K. Whang and R. Krishnamurthy, “Query Optimization in a Memory-Resident Domain Relational Calculus Database System,” ACM Trans. Database Systems, Vol. 15, No. 1, Mar. 1990, pp. 67-95. [ 134] K. Wu and W. Fuchs, “Recoverable Distributed Shared Virtual Memory,” IEEE Trans. Computers, Vol. 39, No. 4, Apr. 1990. [135] W. Zhao, J. Stankovic, and K. Ramamritham, “A Window Protocol for Transmission of Time Constrained Messages,” IEEE Trans. Computers, Vol. 39, No. 9, Sept. 1990, pp. 1186-1203.
30
Distributed Computing
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
T
he study of distributed computing has grown to include a large range of applications [16], [17], [31], [32], [37], [54], [55]. However, at the core of all the efforts to exploit the potential power of distributed computation are issues related to the management and allocation of system resources relative to the computational load of the system. This is particularly true of attempts to construct large, general-purpose multiprocessors [3], [8], [25], [26], [44], [45], [46], [50], [61], [67]. The notion that a loosely coupled collection of processors could function as a more powerful general-purpose computing facility has existed for quite some time. A large body of work has focused on the problem of managing the resources of a system in such a way as to effectively exploit this power. The result of this effort has been the proposal of a variety of widely differing techniques and methodologies for distributed-resource management. Along with these competing proposals has come the inevitable proliferation of inconsistent and even contradictory terminology, as well as a number of slightly differing problem formulations and assumptions. Thus, it is difficult to analyze the relative merits of alternative schemes in a meaningful fashion. It is also difficult to focus a common effort on approaches and areas of study that seem most likely to prove fruitful. This paper attempts to tie the many facets of distributed scheduling together under a common, uniform set of terminol¬ ogy. We provide a taxonomy to classify distributed-scheduling algorithms according to a reasonably small set of salient features. This provides a convenient means to quickly describe the central aspects of a particular approach and offers a basis for comparison
Thomas L. Casavant and Jon G. Kuhl
of commonly classified schemes. Earlier works attempted to classify certain aspects of the scheduling problem. A paper by Casey [9] gives the basis of a hierarchical categorization. The taxonomy that we present here agrees with the nature of Casey’s categorization. However, we include a large number of additional fundamental distinguishing features that differentiate among existing approaches. Hence, our taxonomy provides a more detailed and complete look at the basic issues addressed by Casey. This greater detail is necessary to allow meaningful comparisons of different approaches. Wang
Distributed Computing Systems
31
Chapter
and Morris [65] provide a taxonomy of load-sharing schemes that contrast with the taxonomy presented by Casey. They succinctly describe the range of solutions to the load-sharing problem. They describe solutions as either source initiative or server initiative. In addition, they characterize solutions along a continuous range, according to the degree of information dependency involved. Our taxonomy takes a much broader view of the distributed-scheduling problem, in which load sharing is only one of several possible basic strategies available to a system designer. Thus, the Wang and Morris classifications describe only a narrow category within the taxonomy described here. Among existing taxonomies, one can find examples of hierarchical and flat classification schemes. Our taxonomy proposes a hybrid of these two — hierarchical as long as possible in order to reduce the total number of classes and flat when the descriptors of the system may be chosen in an arbitrary order. The levels in the hierarchy — chosen in order to keep the description of the taxonomy itself small — do not necessarily reflect any ordering of importance among characteristics. In other words, the descriptors comprising the taxonomy do not attempt to hierarchically order the characteristics of scheduling systems from more to less general. This point should be stressed especially with respect to the positioning of the flat portion of the taxonomy near the bottom of the hierarchy. For example, load balancing is a characteristic that pervades a large number of distributed-scheduling systems, but for the sake of reducing the size of the description of the taxonomy, it has been placed in the flat portion of the taxonomy, and for the sake of brevity, the flat portion has been placed near the bottom of the hierarchy. This paper is organized into four sections following this introduction. The first section, “The scheduling problem — Describing its solutions,” defines the scheduling problem as it applies to distributed-resource management. It contains a taxonomy that allows qualitative description and comparison of distributed-scheduling systems. The next section, “Examples,” presents examples from the literature to demonstrate the use of the taxonomy in qualitatively describing and comparing existing systems. The last two sections, “Discussion” and “Conclusions,” present ideas raised by the taxonomy and also suggest areas in need of additional work.
The scheduling problem — Describing its solutions The general scheduling problem has been described a number of times and in many different ways [12], [22], [63] and is usually a restatement of the classical notions of job sequencing [13] in the study of production management [7]. For the purposes of distributed-process scheduling, we take a broader view of the scheduling function: as a resource management resource. This management resource is basically a mechanism or policy used to efficiently and effectively manage the access to and use of a resource by its various consumers. Hence, we may view every instance of the scheduling problem as consisting of three main components: • Policy, • Consumer(s), and • Resource(s). Like other management or control problems, understanding the functioning of a scheduler is best done by observing the effect it has on its environment. Then one can observe the behavior of the scheduler in terms of how policy affects consumers and resources. Note that although there is only one policy, the scheduler may be viewed in terms of how it affects either or both consumers and resources. This relationship between the scheduler, the policy, consumers, and resources is shown in Figure 1.
32
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
LSTKIBUTEI) COMPUTING
"Eivis;
■ERVIfiW
f
Scheduler —► Consumers
—► Policy
Resources
Figure 1. Scheduling system.
In light of this description of the scheduling problem, the following two properties must be considered when any scheduling system is being evaluated: • Performance: The satisfaction of the consumers with how well the scheduler manages the resource in question. • Efficiency: The satisfaction the consumers feel about how difficult or costly it is to access the management resource itself. In other words, consumers want to be able to quickly and efficiently access the actual resource in question but do not desire to be hindered by overhead problems associated with using the management function itself. One by-product of this statement of the general scheduling problem is the unification of two terms in common use in the literature. There is often an implicit distinction between the terms allocation and scheduling. However, it can be argued that these are merely alternative formulations of the same problem, with allocation posed in terms of resource allocation from the resources’ point of view and scheduling viewed from the consumers’ point of view. In this sense, allocation and scheduling are merely two terms describing the same general mechanism from two different viewpoints.
The classification scheme The usefulness of the four-category taxonomy of computer architecture presented by Flynn [20] has been well demonstrated by its ability to compare systems through their relation to that taxonomy. The goal of our taxonomy is to provide a commonly accepted set of terms and to provide a mechanism to allow comparison of past work in the area of distributed scheduling in a qualitative way. In addition, it is hoped that the categories and their relationships to each other have been chosen carefully enough to indicate areas in need of future work and to help classify future work. We tried to keep the taxonomy small by ordering it hierarchically when possible, but some choices of characteristics can be made independent of previous design choices and thus are specified as a set of descriptors from which a subset may be chosen. The taxonomy, discussed here in terms of distributed-process scheduling, is also applicable to a larger set of resources. In fact, the taxonomy could be employed to classify any set of resource-management systems. However, we will focus our attention on the area of process management, since it is in this area that we hope to derive relationships useful in determining potential areas for future work.
Distributed Computing Systems
33
Hierarchical classification. The structure of the hierarchical portion of the taxonomy is shown in Figure 2. A discussion of the hierarchical portion follows. Local versus global. We may distinguish between local and global scheduling at the highest level. Local scheduling involves the assignment of processes to the time-slices of a single processor. Since scheduling on single-processor systems [12], [62], as well as sequencing or jobshop processing [13], [18], has been actively studied for many years, our taxonomy focuses on global scheduling. Global scheduling involves deciding where to execute a process, and the job of local scheduling is left to the operating system of the processor to which the process is ultimately allocated. This allows the processors in a multiprocessor increased autonomy, while it reduces the responsibility (and consequently the overhead) of the global-scheduling mechanism. Note that this does not imply that global scheduling must be done by a single central authority, but rather that we view the problems of local and global scheduling as separate issues (at least logically), with separate mechanisms at work to solve each. Static versus dynamic. The next level in the hierarchy (beneath global scheduling) is a choice between static and dynamic scheduling. This choice indicates the time at which scheduling or assignment decisions are made. In the case of static scheduling, information regarding the total mix of processes in the system, as well as all the independent subtasks involved in a job or task force [26], [44] is assumed to be available by the time the program object modules are linked into load modules. Hence, each executable image in a system has a static assignment to a particular processor, and each time that process image is submitted for execution, it is assigned to that processor. A more relaxed definition of static scheduling may include algorithms that schedule task forces for a particular hardware configuration. Over a period of time, the topology of the system may change, but characteristics describing the task force remain the same. Hence, the scheduler may generate a new assignment of processes to processors to serve as the schedule until the topology changes again. Note that the term static scheduling used here has the same meaning as deterministic scheduling in [22] and task scheduling in [56]. In our attempt to develop a consistent set of terms and taxonomy, we will not use these alternative terms here. Optimal versus suboptimal. In a case where all the information regarding the state of the system and the resource needs of a process are known, an optimal assignment can be made based on some
34
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
TRIBUTE!) COMFI.'-TLNG SYSTEMS:
criterion function [5], [14], [21], [35], [40], [48]. Examples of optimization measures are minimizing total process-completion time, maximizing utilization of resources in the system, and maximizing system throughput. In the event that these problems are computationally infeasible, suboptimal solutions may be tried [2], [34], [47]. Within the realm of suboptimal solutions to the scheduling problem, we may think of two general categories: approximate and heuristic. Approximate versus heuristic. The first branch beneath the suboptimal solutions is labeled approximate. A solution in this branch uses the same formal computational model for the algorithm, but instead of searching the entire solution space for an optimal solution, it is satisfied when it finds a “good” one. This solution is categorized as suboptimal-approximate. The assumption that a “good” solution can be recognized may not be so insignificant, but in cases where a metric is available for evaluating a solution, this technique can be used to decrease the time taken to find an acceptable solution schedule. Factors that determine whether this approach is worthy of pursuit include • • • •
The The The The
availability of a function to evaluate a solution, time required to evaluate a solution, ability to judge according to some metric the value of an optimal solution, and availability of a mechanism for intelligently pruning the solution space.
The second branch beneath the suboptimal solutions is labeled heuristic [15], [30], [66]. This branch represents the category of static algorithms that make the most realistic assumptions about a priori knowledge concerning process and system-loading characteristics. It also represents the solutions to the static-scheduling problem that require the most reasonable amount of time and other system resources to perform their function. The most distinguishing feature of heuristic schedulers is that they make use of special parameters that affect the system in indirect ways. Often, the parameter being monitored is correlated to system performance in an indirect, instead of a direct, way. This alternate parameter is much simpler to monitor or to calculate. For example, clustering groups of processes that communicate heavily on the same processor and physically separating processes that would benefit from parallelism [52] directly decrease the overhead involved in passing information between processors, while reducing the interference among processes that may run without synchronization with one another. This result has an impact on the overall service that users receive but cannot be directly related (in a quantitative way) to system performance as the user sees it. Hence, our intuition, if nothing else, leads us to believe that taking the aforementioned actions when possible will improve system performance. However, we may not be able to prove that a first-order relationship between the mechanism employed and the desired result exists. Optimal and suboptimal-approximate techniques. Regardless of whether a static solution is optimal or suboptimal-approximate, there are four basic categories of task-allocation algorithms that can be used to arrive at an assignment of processes to processors: • Solution space enumeration and search [48]; • Graph theoretic [4], [57], [58]; • Mathematical programming [5], [14], [21], [35], [40]; and • Queuing theoretic [10], [28], [29]. Dynamic solutions. In dynamic scheduling, the more realistic assumption is made that very little a priori knowledge is available about the resource needs of a process. In static scheduling,
Distributed Computing Systems
35
a decision is made for a process image before the process is ever executed, while in dynamic scheduling, no decision is made until a process begins its life in the dynamic environment of the system. Since it is the responsibility of the running system to decide where a process is to execute, it is only natural to next ask where the decision itself is to be made. Physically distributed versus physically nondistributed. The next level in the hierarchy involves whether the work involved in making decisions should be physically distributed among the processors [17] or the responsibility for the task of global dynamic scheduling should physically reside in a single processor [44] (physically nondistributed). At this level, the concern is with the logical authority of the decision-making process. Cooperative versus noncooperative. Within the realm of distributed global, dynamic scheduling, we may also distinguish between those mechanisms that involve cooperation between the distributed components (cooperative) and those in which the individual processors make decisions independent of the actions of the other processors (noncooperative). The question here is one of the degree of autonomy that each processor has in determining how its own resources should be used. In the cooperative case, each processor has the responsibility to carry out its own portion of the scheduling task, but all processors are working toward a common, system-wide goal, instead of making decisions based on the way in which the decision will affect local performance only. In the noncooperative case, individual processors act alone as autonomous entities and arrive at decisions regarding the use of their resources independent of the effect of their decision on the rest of the system. As in the static scheduling case, the taxonomy tree has reached a point where we may consider optimal, suboptimal-approximate, and suboptimal-heuristic solutions. The discussion presented for the static case also applies here. Flat classification. In addition to the hierarchical portion of the taxonomy already discussed, there are a number of other distinguishing characteristics that scheduling systems may have. Here, we deal with characteristics that do not fit uniquely under any particular branch of the treestructured taxonomy given thus far but are still important in the way in which they describe the behavior of a scheduler. In other words, the following could be branches beneath several of the leaves shown in Figure 2. In the interest of clarity, these characteristics are not repeated under each leaf but are presented here as a flat extension to the scheme presented thus far. It should be noted that these attributes represent a set of characteristics, and any particular scheduling subsystem may possess some subset of this set. Finally, the placement of these characteristics near the bottom of the tree is not intended to be an indication of their relative importance or of any relationship to other categories of the hierarchical portion. Rather, this position was determined primarily to reduce the size of the description of the taxonomy. Adaptive versus nonadaptive. An adaptive solution to the scheduling problem is one in which the algorithms and parameters used to implement the scheduling policy change dynamically according to the previous and current behavior of the system in response to previous decisions made by the scheduling system. An example of such an adaptive scheduler would be one that takes many parameters into consideration in making its decisions [52]. In response to the behavior of the system, the scheduler may start to ignore one parameter or reduce the importance of that parameter if it believes that parameter is either providing information that is inconsistent with the rest of the inputs or not providing any information regarding the change in system state in relation to the values of the other parameters being observed. A second example of adaptive scheduling
36
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
Distributed Compttino Systems: An Overview
would be one that is based on the stochastic learning automata model [39]. An analogy may be drawn here between the notion of an adaptive scheduler and adaptive control [38], although the usefulness of such an analogy for purposes of performance analysis and implementation is questionable [51 ]. In contrast to an adaptive scheduler, a nonadaptive scheduler would be one that does not necessarily modify its basic control mechanism on the basis of the history of system activity. An example would be a scheduler that always weighs its inputs in the same way regardless of the history of the system’s behavior. Load balancing. This category of policies, which has received a great deal of attention in the literature [10], [11], [36], [40], [41], [42], [46], [53], approaches the problem with the philosophy that being fair to the hardware resources of the system is good for the users of that system. The basic idea is to attempt to balance (in some sense) the load on all processors in such a way that allows progress by all processes on all nodes to proceed at approximately the same rate. This solution is most effective when the nodes of a system are homogeneous, since this allows all nodes to know a great deal about the structure of the other nodes. Normally, information would be passed about the network periodically or on demand [ 1 ], [60] in order to allow all nodes to obtain a local estimate concerning the global state of the system. Then the nodes act together in order to remove work from heavily loaded nodes and place it at lightly loaded nodes. This is a class of load¬ balancing solutions that relies heavily on the assumption that the information at each node is quite accurate in order to prevent processes from endlessly being circulated about the system without making much progress. Another concern here is deciding on the basic unit used to measure the load on individual notes. As was pointed out in the introduction, the placement of this characteristic near the bottom of the hierarchy in the flat portion of the taxonomy is not related to its relative importance or generality compared with characteristics at higher levels. In fact, it might be observed that — at the point at which a choice is made between optimal and suboptimal characteristics — a specific objective or cost function must have already been made. However, the purpose of the hierarchy is not so much to describe relationships between classes of the taxonomy but rather to reduce the size of the overall description of the taxonomy so as to make it more useful in comparing different approaches to solving the scheduling problem. Bidding. In this class of policy mechanisms, a basic protocol framework exists that describes the way in which processes are assigned to processors. The resulting scheduler is one that is usually cooperative in the sense that enough information is exchanged (between nodes with tasks to execute and nodes that may be able to execute tasks) so that an assignment of tasks to processors can be made that is beneficial to all nodes in the system as a whole. To illustrate the basic mechanism of bidding, the framework and terminology of [49] will be used. Each node in the network is responsible for two roles with respect to the bidding process: manager and contractor. The manager represents the task in need of a location to execute and the contractor represents a node that is able to do work for other nodes. Note that a single node takes on both of these roles and there are no nodes that are strictly managers or strictly contractors. The manager announces the existence of a task in need of execution by a task announcement and then receives bids from the other nodes (contractors). A wide variety of possibilities exists concerning the type and amount of information exchanged in order to make decisions [53], [59]. The type and amount of information exchanged are the major factors in determining the effectiveness and performance of a scheduler employing the notion of bidding. A very important feature of this class of schedulers is that all nodes generally have full autonomy in the sense that the manager ultimately has the power to decide where to send a task from among those nodes that respond with bids. In
Distributed Computing Systems
37
Chapter 1
addition, the contractors are also autonomous, since they are never forced to accept work if they do not choose to do so. Probabilistic. This classification has existed in scheduling systems for some time [13]. The basic idea for this scheme is motivated by the fact that in many assignment problems, the number of permutations of the available work and the number of mappings to processors is so large that to analytically examine the entire solution space would require a prohibitive amount of time. Instead, the idea of randomly (according to some known distribution) choosing some process as the next to assign is used. By this method being used repeatedly, a number of different schedules may be generated, and then this set is analyzed to choose the best from among those randomly generated. The fact that an important attribute is used to bias the random choosing process would lead one to expect that the schedule would be better than one chosen entirely at random. The argument that this method actually produces a good selection is based on the expectation that enough variation is introduced by the random choosing to allow a “good” solution to get into the randomly chosen set. In an alternative view of probabilistic schedulers are those that employ the principles of decision theory in the form of team theory [24]. These would be classified as probabilistic, since suboptimal decisions are influenced by prior probabilities derived from best-guesses to the actual states of nature. In addition, these prior probabilities are used to determine (utilizing some random experiment) the next action (or scheduling decision). One-time assignment versus dynamic reassignment. In this classification, we consider the entities to be scheduled. If the entities are jobs in the traditional batch-processing sense of the term [ 19], [23], then we consider the single point in time in which a decision is made as to where and when the job is to execute. While this technique technically corresponds to a dynamic approach, it is static in the sense that once a decision has been made to place and execute a job, no further decisions are made concerning the job. We would characterize this class as one¬ time assignment. Notice that in this mechanism, the only information usable by the scheduler to make its decision is the information given it by the user or submitter of the job. This information might include estimated execution time or other system resource demands. One critical point here is the fact that once users of a system understand the underlying scheduling mechanism, they may present false information to the system in order to receive better response. This point fringes on the area of psychological behavior, but human interaction is an important design factor to consider in this case, since the behavior of the scheduler itself is trying to mimic a general philosophy. Hence, the interaction of this philosophy with the system’s users must be considered. In contrast, solutions in the dynamic-reassignment class try to improve on earlier decisions by using information on smaller computation units — the executing subtasks of jobs or task forces. This category represents the set of systems that do not trust their users to provide accurate descriptive information and use dynamically created information to adapt to changing demands of user processes. This adaptation takes the form of migrating processes (including current process state information). There is clearly a price to be paid in terms of overhead, and this price must be carefully weighed against possible benefits. An interesting analogy exists between the differentiation made here and the question of preemption versus nonpreemption in uniprocessor scheduling systems. Here, the difference lies in whether or not to move a process from one place to another once an assignment has been made, while in the uniprocessor case the question is whether or not to remove the running process from the processor once a decision has been made to let it run.
38
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
Examples In this section, examples will be taken from the published literature to demonstrate their relationships to one another with respect to the taxonomy detailed in the preceding section. The purpose of this section is twofold. The first is to show that many different scheduling algorithms can fit into the taxonomy. The second is to show that the categories of the taxonomy actually correspond, in most cases, to methods that have been examined.
Global static In [48], we see an example of an optimal, enumerative approach to the task-assignment problem. The criterion function is defined in terms of optimizing the amount of time a task will require for all interprocess communication and execution, where the tasks submitted by users are assumed to be broken into suitable modules before execution. The cost function is called a minimax criterion, since it is intended to minimize the maximum execution and communication time required by any single processor involved in the assignment. Graphs are then used to represent the module-to-processor assignments, and the assignments are then transformed to a type of graph matching known as weakhomomorphisms. The optimal search of this solution space can then be done using the A* algorithm from artificial intelligence [43]. The solution also achieves a certain degree of processor load balancing as well. Reference [4] gives a good demonstration of the usefulness of the taxonomy in that the paper describes the algorithm given as a solution to the optimal dynamic assignment problem for a twoprocessor system. However, in attempting to make an objective comparison of the system in this paper with other dynamic systems, we see that the algorithm proposed is actually a static one. In terms of the taxonomy that we present here, we would categorize this as a static, optimal, graphtheoretical approach in which the a priori assumptions are expanded to include more information about the set of tasks to be executed. The way in which reassignment of tasks is performed during process execution is decided upon before any of the program modules begin execution. Instead of reassignment decisions being made during execution, the stronger assumption is simply made that all information about the dynamic needs of a collection of program modules is available a priori. This assumption says that if a collection of modules possesses a certain communication pattern at the beginning of its execution and this pattern is completely predictable, then this pattern may change over the course of execution and these variations are predictable as well. Costs of relocation are also assumed to be available, and this assumption appears to be quite reasonable. The model presented in [35] represents an example of an optimal, mathematical-programming formulation employing a branch-and-bound technique to search the solution space. The goals of the solution are to minimize interprocessor communications, balance the utilization of all processors, and satisfy all other engineering application requirements. The model given defines a cost function that includes interprocessor-communication costs and processor-execution costs. The assignment is then represented by a set of zero-one variables, and the total execution cost is then represented by a summation of all costs incurred in the assignment. In addition to the above, the problem is subject to constraints that allow the solution to satisfy the load-balancing and engineering-application requirements. The algorithm then used to search the solution space (consisting of all potential assignments) is derived from the basic branch-and-bound technique. Again, in [ 10], we see an example of the use of the taxonomy in comparing the proposed system to other approaches. The title of the paper, “Load Balancing in Distributed Systems, indicates that the goal of the solution is to balance the load among the processors in the system in some way. However, the solution actually fits into the static, optimal, queueing-theoretical class. The goal of the solution is to minimize the execution time of the entire program to maximize performance.
Distributed Computing Systems
39
and the algorithm is derived from results in Markov decision theory. In contrast to the definition of load balancing given earlier, where the goal was to even the load and utilization of system resources, the approach in this paper is consumer oriented. An interesting approximate, mathematical-programming solution, motivated from the view¬ point of fault-tolerance, is presented in [2]. The algorithm is suggested by the computational complexity of the optimal solution to the same problem. In the basic solution to a mathematicalprogramming problem, the state space is either implicitly or explicitly enumerated and searched. One approximation method mentioned in this paper [64] involves first removing the integer constraint, solving the continuous-optimization problem, discretizing the continuous solution, and obtaining a bound on the discretization error. Whereas this bound is with respect to the continuous optimum, the algorithm proposed in this paper directly uses an approximation to solve the discrete problem and bound its performance with respect to the discrete optimum. The last static example to be given here appears in [66]. This paper gives a heuristic-based approach to the problem by using extractable data and synchronization requirements of the different subtasks. The three primary heuristics used are • Loss of parallelism, • Synchronization, and • Data sources. The way in which loss of parallelism is used is to assign tasks to nodes one at a time in order to effect the least loss of parallelism based on the number of units required for execution by the task currently under consideration. The synchronization constraints are phrased in terms of firing conditions that are used to describe precedence relationships between subtasks. Finally, data source information is used in much the same way that a functional program uses precedence relationships between parallel portions of a computation that take the roles of varying classes of suppliers of variables to other subtasks. The final heuristic algorithm involves weighing each of the previous heuristics and combining them. A distinguishing feature of the algorithm is its use of a greedy approach to find a solution, when at the time decisions are made, there can be no guarantee that a decision is optimal. Hence, an optimal solution would more carefully search the solution space using a backtrack or branch-and-bound method, as well as using an exact optimization criterion instead of the heuristics suggested.
Global dynamic Among the dynamic solutions presented in the literature, the majority fit into the general category of physically distributed, cooperative, suboptimal, and heuristic. There are, however, examples for some of the other classes. First, in the category of physically nondistributed, one of the best examples is the experimental system developed for the Cm* architecture — Medusa [44], In this system, the functions of the operating system (for example, file system and scheduler) are physically partitioned and placed at different places in the system. Hence, the scheduling function is placed at a particular place and is accessed by all users at that location. Another rare example exists in the physically distributed, noncooperative class. In this example [27], random-level order scheduling is employed at all nodes independently in a tightly coupled MIMD machine. Hence, the overhead involved in this algorithm is minimized, since no information need be exchanged to make random decisions. The mechanism suggested is thought to work best in moderate to heavily loaded systems, since in these cases, a random policy is thought to give a reasonably balanced load on all processors. In contrast to a cooperative solution, this
40
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
1ST!
i:m 1 OMFUTiNti Systems:
•'E8VIEW
algorithm does not detect or try to avoid system overloading by sharing loading information among processors but makes the assumption that it will be under heavy load most of the time and bases all of its decisions on that assumption. Clearly, here the processors are not necessarily concerned with the utilization of their own resources, but neither are they concerned with the effect their individual decisions will have on the other processors in the system. It should be pointed out that although the above two algorithms (and many others) are given in terms relating to general-purpose distributed-processing systems, they do not strictly adhere to the definition of a distributed-data-processing system as given in [17]. In [57], another rare example exists in the form of a physically distributed, cooperative, optimal solution in a dynamic environment. The solution is given for the two-processor case in which critical load factors are calculated prior to program execution. The method employed is to use a graph-theoretical approach to solving for load factors for each process on each processor. These load factors are then used at runtime to determine when a task could run better if placed on the other processor. The final class (and largest in terms of amount of existing work) is the class of physically distributed, cooperative, suboptimal, and heuristic solutions. In [53], a solution is given that is adaptive, load-balancing, and makes one-time assignments of jobs to processors. No a priori assumptions are made about the characteristics of the jobs to be scheduled. One major restriction of these algorithms is that they only consider assignment of jobs to processors, and once a job becomes an active process, no reassignment of processes is considered, regardless of the possible benefit. This is very defensible, though, if the overhead involved in moving a process is very high (which may be the case in many circumstances). Whereas this solution cannot exactly be considered as a bidding approach, exchange of information occurs between processes in order for the algorithms to function. The first algorithm (a copy of which resides at each host) compares its own “busyness” with its estimate of the “busyness” of the least busy host. If the difference exceeds the bins (or threshold) designated at the current time, one job is moved from the job queue of the busier host to the less busy one. The second algorithm allows each host to compare itself with all other hosts and involves two biases. If the difference exceeds bias 1 but not bias2, then one job is moved. If the difference exceeds bias2, then two jobs are moved. Also, there is an upper limit set on the number of jobs that can move at once in the entire system. The third algorithm is the same as algorithm one, except that an antithrashing mechanism is added to account for the fact that a delay is present between the time a decision is made to move a job and the time it arrives at the destination. All three algorithms had an adaptive feature added that would turn off all parts of the respective algorithm (except the monitoring of load) when system load was below a particular minimum threshold. This had the effect of stopping processor thrashing whenever it was practically impossible to balance the system load due to lack of work to balance. In the high-load case, the algorithm was turned off to reduce extraneous overhead when the algorithm could not effect any improvement in the system under any redistribution of jobs. This last feature also supports the notion in the noncooperative example given earlier, in that the load is usually automatically balanced as a side effect of heavy loading. The remainder of the paper focuses on simulation results to reveal the impact of modifying the biasing parameters. The work reported in [6] is an example of an algorithm that employs the heuristic of load balancing and probabilistically estimates the remaining processing times of processes in the system. The remaining processing time for a process was estimated by one of the following methods: • Memoryless:
Re(t) = E [5]
• Pastrepeats:
/?e(0 = t
Distributed Computing Systems
41
• Distribution: • Optimal:
/te(f) = £{S-f|s > t} Re(t) = R(t)
where R(t) is the remaining time needed, given that t seconds have already elapsed; S is the service time random variable; and Re(t) is the scheduler’s estimate of R(t). The algorithm then basically uses the first three methods to predict response times in order to obtain an expected-delay measure, which in turn is used by pairs of processors to balance their load on a pairwise basis. This mechanism is adopted by all pairs on a dynamic basis to balance the system load. Another adaptive algorithm is discussed in [52] and is based on the bidding concept. The heuristic mentioned here utilizes prior information concerning the known characteristics of processes such as resource requirements, process priority, special resource needs, precedence constraints, and the need for clustering and distributed groups. The basic algorithm periodically evaluates each process at a current node to decide whether or not to transmit bid requests for a particular process. The bid requests include information needed for contractor nodes to make decisions regarding how well they may be able to execute the process in question. The manager receives bids, compares them to the local evaluation, and will transfer the process if the difference between the best bid and the local estimate is above a certain threshold. The key to the algorithm is the formulation of a function to be used in a modified McCulloch-Pitts neuron. The neuron (implemented as a subroutine) evaluates the current performance of individual processes. Several different functions were proposed, simulated, and discussed in this paper. The adaptive nature of this algorithm is in the fact that it dynamically modifies the number of hops that a bid request is allowed to travel, depending on current conditions. The most significant result is that the information regarding process clustering and distributed groups seems to have had little impact on the overall performance of the system. The final example to be discussed here [55] is based on a heuristic derived from the area of Bayesian decision theory [33]. The algorithm uses no a priori knowledge regarding task characteristics and is dynamic in the sense that the probability distributions that allow maximizing decisions to be made based on the most likely current state of nature are updated dynamically. Monitor nodes make observations every/seconds and update probabilities. Every d seconds, the scheduler itself is invoked to approximate the current state of nature and make the appropriate maximizing action. It was found that the parameters/and d could be tuned to obtain maximum performance for a minimum cost.
Discussion In this section, we will attempt to demonstrate the application of the qualitative description tool presented earlier to a role beyond that of classifying existing systems. In particular, we will utilize two behavior characteristics — performance and efficiency — in conjunction with the classification mechanism presented in the taxonomy to identify general qualities of scheduling systems that will lend themselves to managing large numbers of processors. In addition, the uniform terminology presented will be employed to show that some earlier-thought-to-besynonymous notions are actually distinct and that the distinctions are valuable. Also, in at least one case, two earlier-thought-to-be-different notions will be shown to be much the same.
Decentralized versus distributed scheduling When considering the decision-making policy of a scheduling system, there are two fundamental components — authority and responsibility. When authority is distributed to the entities of a resource management system, we call this decentralized. When responsibility for
42
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
making and carrying out policy decisions is shared among the entities in a distributed system, we say that the scheduler is distributed. This differentiation exists in many other organizational structures. Any system that possesses decentralized authority must have distributed responsibility, but it is possible to allocate responsibility for gathering information and carrying out policy decisions without giving the authority to change past or make future decisions.
Dynamic versus adaptive scheduling The terms dynamic scheduling and adaptive scheduling are quite often attached to various proposed algorithms in the literature, but there appears to be some confusion as to the actual difference between these two concepts. The more common property to find in a scheduler (or resource management subsystem) is the dynamic property. In a dynamic situation, the scheduler takes into account the current state of affairs as it perceives them in the system. This is done during the normal operation of the system under a dynamic and unpredictable load. In an adaptive system, the scheduling policy itself reflects changes in its environment — the running system. Notice that the difference here is one of level in the hierarchical solution to the scheduling problem. Whereas a dynamic solution takes environmental inputs into account when making its decisions, an adaptive solution takes environmental stimuli into account to modify the scheduling policy itself.
The resource/consumer dichotomy in performance analysis As is the case in describing the actions or qualitative behavior of a resource management subsystem, the performance of the scheduling mechanisms employed may be viewed from either the resource or consumer point of view. When considering performance from the consumer (or user) point of view, the metric involved is often one of minimizing individual program completion times: response. Alternately, the resource point of view also considers the rate of process execution in evaluating performance, but from the view of total system throughput. In contrast to response, throughput is concerned with seeing that all users are treated fairly and that all are making progress. Notice that the resource view of maximizing resource utilization is compatible with the desire for maximum system throughput. Another way of stating this, however, is that all users, when considered as a single collective user, are treated best in this environment of maximizing system throughput or maximizing resource utilization. This is the basic philosophy of load-balancing mechanisms. There is an inherent conflict, though, in trying to optimize both response and throughput.
Focusing on future directions In this section, the earlier-presented taxonomy, in conjunction with two terms used to quantitatively describe system behavior, will be used to discuss possibilities for distributed scheduling in the environment of a large system of loosely coupled processors. In previous work related to the scheduling problem, the basic notion of performance has been concerned with evaluating the way in which users’ individual needs are being satisfied. The metrics most commonly applied are response and throughput [23]. While these terms accurately characterize the goals of the system in terms of how well users are served, they are difficult to measure during the normal operation of a system. In addition to this problem, the metrics do not lend themselves well to direct interpretation as to the action to be performed to increase performance when it is not at an acceptable level. These metrics are also difficult to apply when analysis or simulation of such systems is attempted. The reason for this is that two important aspects of scheduling are necessarily intertwined. These two aspects are performance and efficiency. Performance is the part of a system’s behavior that encompasses how well the resource to be managed is being used to the
Distributed Computing Systems
43
Chapter 1
benefit of all users of the system. Efficiency, though, is concerned with the added cost (or overhead) associated with the resource management facility itself. In terms of these two criteria, we may think of desirable system behavior as that having the highest level of performance possible, while incurring the least overhead in doing it. Clearly, the exact combination of these two that brings about the most desirable behavior is dependent on many factors and in many ways resembles the space/time trade-off present in common algorithm design. The point to be made here is that simultaneous evaluation of performance and efficiency is very difficult because of this inherent entanglement. What we suggest is a methodology for designing scheduling systems in which performance and efficiency are separately observable. Current and future investigations will involve studies to better understand the relationships between performance, efficiency, and their components as they affect quantitative behavior. It is hoped that a much better understanding can be gained regarding the costs and benefits of alternative distributed-scheduling strategies.
Conclusions This paper has sought to bring together the ideas and work in the area of resource management generated in the last 15 to 20 years. The intention was to provide a suitable framework for comparing past work in the area of resource management, while providing a tool for classifying and discussing future work. This has been done through the presentation of common terminology and a taxonomy on the mechanisms employed in computer system resource management. While the taxonomy could be used to discuss many different types of resource management, the attention of the paper and included examples has been on the application of the taxonomy to the processing resource. Finally, recommendations regarding possible fruitful areas for future research in the area of scheduling in large-scale, general-purpose distributed computer systems have been discussed. As is the case in any survey, there are many pieces of work to be considered. It is hoped that the examples presented fairly represent the true state of research in this area, while it is acknowledged that not all such examples have been discussed. In addition to the references at the end of this paper, an annotated bibliography is included. It lists work that—although not explicitly mentioned in the text — has aided in the construction of this taxonomy through the support of additional examples. The exclusion of any particular result was not intentional nor should it be construed as a judgment of the merit of that work. Decisions as to which papers to use as examples were made purely on the basis of their applicability to the context of the discussion in which they appear.
References [1] A.K. Agrawala, S.K. Tripathi, and G. Ricart, “Adaptive Routing Using a Virtual Waiting Time Technique,” IEEE Trans. Software Eng., Vol. SE-8, No. 1, Jan. 1982, pp. 76-81. [2] J.A. Bannister and K.S. Trivedi, “Task Allocation in Fault-Tolerant Distributed Systems,” Acta Informatica, Vol. 20, 1983, pp. 261-281. [3] J.F. Bartlett, “A Nonstop Kernel,” Proc. Eighth ACM Symp. Operating Systems Principles, ACM Press, New York, N.Y., 1981, pp. 22-29. [4] S.H. Bokhari, “Dual Processor Scheduling with Dynamic Reassignment,” IEEE Trans. Software Eng., Vol. SE-5, No. 4, July 1979, pp. 326-334. [5] S.H. Bokhari, “A Shortest Tree Algorithm for Optimal Assignments across Space and Time in a Distributed Processor System, IEEE Trans. Software Eng., Vol. SE-7, No. 6, Nov. 1981, pp. 335-341.
44
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
Bistrib umo Computing S y stems: j
[6] R.M. Bryant and R,A. Finkel, “A Stable Distributed Scheduling Algorithm,” Proc. Second Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 314-323. [7] E.S. Buffa, Modern Production Management, fifth ed., Wiley, New York, N.Y., 1977. [8] T.L. Casavant and J.G. Kuhl, “Design of a Loosely-Coupled Distributed Multiprocessing Network,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 42^15. [9] L.M. Casey, “Decentralized Scheduling,” Australian Computer J., Vol. 13, May 1981, pp. 58-63. [10] T.C.K. Chou and J.A. Abraham, “Load Balancing in Distributed Systems,” IEEE Trans. Software Eng., Vol. SE-8, No. 4, July 1982, pp. 401-412. [11] Y.C. Chow and W.H. Kohler, “Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System,” IEEE Trans. Computers, Vol. C-28, No. 5, May 1979, pp. 354—361. [12] E.G. Coffman and P.J. Denning, Operating Systems Theory, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1973. [13] R.W. Conway, W.L. Maxwell, and L.W. Miller, Theory of Scheduling, Addison-Wesley Pub. Co., Reading, Mass., 1967. [ 14] K.W. Doty, P.L. McEntire, and J.G. O’Reilly, “Task Allocation in a Distributed Computer System,” Proc. INFOCOM ’82, IEEE Computer Soc. Press, Los Alamitos, Calif., 1982, pp. 33-38. [15] K. Efe, “Heuristic Models of Task Assignment Scheduling in Distributed Systems,” Computer, Vol. 15, No. 6, June 1982, pp. 50-56. [ 16] C.S. Ellis, J.A. Feldman, and J.E. Heliotis, “Language Constructs and Support Systems for Distributed Computing,” Proc. ACM SIGACT-SIGOPS Symp. Principles Distributed Computing, ACM Press, New York, N.Y., 1982, pp. 1-9. [17] P.H. Enslow, Jr., “What Is a ‘Distributed’ Data Processing System?,” Computer, Vol. 11, No. 1, Jan. 1978, pp. 13-21. [18] J.R. Evans et al., Applied Production and Operations Management, West Pub. Co., St. Paul, Minn., 1984. [19] I. Flores, OSMVT, Allyn and Bacon, Inc., Rockleigh, N.J., 1973. [20] M.J. Flynn, “Very High-Speed Computing Systems,” Proc. IEEE, Mol. 54, Dec. 1966, pp. 1901-1909. [21] A. Gabrielian and D.B. Tyler, “Optimal Object Allocation in Distributed Computer Systems,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 84-95. [22] M.J. Gonzalez, “Deterministic Processor Scheduling,” ACM Computing Surveys, Vol. 9, No. 3, Sept. 1977, pp. 173-204. [23] H. Hellerman and T.F. Conroy, Computer System Performance, McGraw-Hill, Inc., New York, N.Y., 1975. [24] Y. Ho, “Team Decision Theory and Information Structures,” Proc. IEEE, Vol. 68, No. 6, June 1980, pp. 644-654. [25] E.D. Jensen, “The Honeywell Experimental Distributed Processor — An Overview,” Computer, Vol. 11, No. 1, Jan. 1978, pp. 28-38. [26] A.K. Jones et al., “StarOS, a Multiprocessor Operating System for the Support of Task Forces,” Proc. Seventh ACM Symp. Operating Systems Principles, ACM Press, New York,N.Y., 1979, pp. 117-127. [27] D. Klappholz and H.C. Park, “Parallelized Process Scheduling for a Tightly-Coupled MIMD Machine,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 315-321. [28] L. Kleinrock, Queuing Systems, Vol. 2: Computer Applications, Wiley, New York, N.Y., 1976. [29] L. Kleinrock and A. Nilsson, “On Optimal Scheduling Algorithms for Time-Shared Systems,” J. ACM, Vol. 28, No. 3, July 1981, pp. 477^86. [30] C.P. Kruskal and A. Weiss, “Allocating Independent Subtasks on Parallel Processor — Extended
Distributed Computing Systems
45
Abstract,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 236-240. [31] R.E. Larson, Tutorial: Distributed Control, IEEE Computer Soc. Press, Los Alamitos, Calif., 1979. [32] G. Le Lann, Motivations, Objectives and Characterizations of Distributed Systems (Lecture Notes in Computer Science, Vol. 105), Springer-Verlag Pub., New York, N.Y., 1981, pp. 1-9. [33] B.W. Lindgren, Elements of Decision Theory, The MacMillan Pub. Co., New York, N.Y., 1971. [34] V.M. Lo, “Heuristic Algorithms for Task Assignment in Distributed Systems,” Proc. Fourth Int’l Conf Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 30-39. [35] P.Y.R. Ma, E.Y.S. Lee, and J. Tsuchiya, “A Task Allocation Model for Distributed Computing Systems,” IEEE Trans. Computers, Vol. C-31, No. 1, Jan. 1982, pp. 41-47. [36] R. Manner, “Hardware Task/Processor Scheduling in a Polyprocessor Environment,” IEEE Trans. Computers, Vol. C-33, No. 7, July 1984, pp. 626-636. [37] P.L. McEntire, J.G. O’Reilly, and R.E. Larson, Distributed Computing: Concepts and Implementations, IEEE Press, New York, N.Y., 1984. [38] E. Mishkin and L. Braun, Jr., Adaptive Control Systems, McGraw-Hill, Inc., New York, N.Y., 1961. [39] K. Narendra, “Learning Automata —A Survey,” IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-4, No. 4, July 1974, pp. 323-334. [40] L.M. Ni and K. Hwang, “Optimal Load Balancing Strategies for a Multiple Processor System,” Proc. Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 352-357. [41] L.M. Ni and K. Abani, “Nonpreemptive Load Balancing in a Class of Local Area Networks,” Proc. Computer Networking Symp., IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 113-118. [42] L.M. Ni and K. Hwang, “Optimal Load Balancing in a Multiple Processor System with Many Job Classes,” IEEE Trans. Software Eng., Vol. SE-11, No. 5, May 1985, pp. 491-496. [43] N.J. Nilsson, Principles of Artificial Intelligence, Tioga, Palo Alto, Calif., 1980. [44] J. Ousterhout, D. Scelza, and P. Sindhu, “Medusa: An Experiment in Distributed Operating System Structure,” Comm. ACM, Vol. 23, No. 2, Feb. 1980, pp. 92-105. [45] G. Popeketal., “LOCUS: A Network Transparent, High Reliability Distributed System,” Proc. Eighth ACM Symp. Operating Systems Principles, ACM Press, New York, N.Y., 1981, pp. 169-177. [46] M.L. Powell andB.P. Miller, “Process Migration in DEMOS/MP,” Proc. Ninth ACM Symp. Operating Systems Principles (OS Review), Vol. 17, No. 5, Oct. 1983, pp. 110-119. [47] C.C. Price and S. Krishnaprasad, “Software Allocation Models for Distributed Computing Systems,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 40-48. [48] C. Shen and W. Tsai, “A Graph Matching Approach to Optimal Task Assignment in Distributed Computing Systems Using a Minimax Criterion,” IEEE Trans. Computers, Vol. C-34, No. 3, Mar. 1985, pp. 197-203. [49] R.G. Smith, “The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver,” IEEE Trans. Computers, Vol. C-29, No. 12, Dec. 1980, pp. 1104-1113. [50] M.H. Solomon and R.A. Finkel, “The ROSCOE Distributed Operating System,” Proc. Seventh ACM Symp. Operating Systems Principles, ACM Press, New York, N.Y., 1979, pp. 108-114. [51] J. A. Stankovic, “Simulations ofThree Adaptive, Decentralized, Controlled Job Scheduling Algorithms,” Computer Networks, Vol. 8, No. 3, June 1984, pp. 199-217. [52] J.A. Stankovic, “A Perspective on Distributed Computer Systems,” IEEE Trans. Computers, Vol. C-33, No. 12, Dec. 1984, pp. 1102-1115. [53] J.A. Stankovic, “An Application of Bayesian Decision Theory to Decentralized Control of Job Scheduling,” IEEE Trans. Computers, Vol. C-34, No. 2, Feb. 1985, pp. 117-130. [54] J.A. Stankovic et al., “An Evaluation of the Applicability of Different Mathematical Approaches to
46
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
Distributed Computing
EMS:
■'ERVTEW
the Analysis of Decentralized Control Algorithms,” Proc. COMPSAC ’82, IEEE Computer Soc. Press, Los Alamitos, Calif., 1982, pp. 62-69. [55] J.A. Stankovic and I.S. Sidhu, “An Adaptive Bidding Algorithm for Processes, Clusters and Distributed Groups,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 49-59. [56] J.A. Stankovic et al., “A Review of Current Research and Critical Issues in Distributed System Software,” IEEE Computer Soc. Distributed Processing Technical Committee Newsletter, Vol. 7, No. 1, Mar. 1985, pp. 14-47. [57] H.S. Stone, “Critical Load Factors in Two-Processor Distributed Systems,” IEEE Trans. Software Eng., Vol. SE-4, No. 3, May 1978, pp. 254-258. [58] H.S. Stone and S.H. Bokhari, “Control of Distributed Processes,” Computer,Vol. 11, No. 7, July 1978, pp. 97-106. [59] H. Sullivan and T. Bashkow, “A Large-Scale Homogeneous, Fully Distributed Machine —II,” Proc. Fourth Symp. Computer Architecture, IEEE Computer Soc. Press, Los Alamitos, Calif., 1977, pp.118-124. [60] A.S. Tanenbaum, Computer Networks, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1981. [61] D.P. Tsay and M.T. Liu, “MIKE: A Network Operating System for the Distributed Double-Loop Computer Network,” IEEE Trans. Software Eng., Vol. SE-9, No. 2, Mar. 1983, pp. 143-154. [62] D.C. Tsichritzis and P.A. Bernstein, Operating Systems, Academic Press, Inc., New York, N.Y., 1974. [63] K. Vairavan and R.A. DeMillo, “On the Computational Complexity of a Generalized Scheduling Problem,” IEEE Trans. Computers, Vol. C-25, No. 11, Nov. 1976, pp. 1067-1073. [64] R.A. Wagner and K.S. Trivedi, “Hardware Configuration Selection through Discretizing a Continuous Variable Solution,” Proc. Seventh IFIP Symp. Computer Performance Modeling, Measurement and Evaluation, 1980, pp. 127-142. [65] Y.T. Wang and R.J.T. Morris, “Load Sharing in Distributed Systems,” IEEE Trans. Computers, Vol. C-34, No. 3, Mar. 1985, pp. 204-217. [66] M.O. Ward and D.J. Romero, “Assigning Parallel-Executable, Intercommunicating Subtasks to Processors,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 392-394. [67] L.D. Wittie and A.M. Van Tilborg, “MICROS, a Distributed Operating System for MICRONET, a Reconfigurable Network Computer,” IEEE Trans. Computers, Vol. C-29, No. 12, Dec. 1980, pp. 1133-1144.
Annotated bibliography Application of taxonomy to examples from literature References on this list contain additional examples not discussed in the section of the text entitled “Examples,” as well as abbreviated descriptions of examples that are discussed there. The purpose of this list is to demonstrate the use of the taxonomy described in the section entitled “The scheduling problem — Describing its solutions” to classify large numbers of examples from the literature. 1. G.R. Andrews, D.P. Dobkin, and P.J. Downey, “Distributed Allocation with Pools of Servers,” Proc. ACM SIGACT-S1GOPS Symp. Principles Distributed Computing, ACM Press, New York, N.Y., 1982, pp.73-83. Global, dynamic, distributed (however, in a limited sense), cooperative, suboptimal, heuristic, bidding, nonadaptive, dynamic reassignment. 2. J.A. Bannister and K.S. Trivedi, “Task Allocation in Fault-Tolerant Distributed Systems,” Acta Informatica, Vol. 20, 1983, pp. 261-281.
Distributed Computing Systems
47
Chapter 1
Global, static, suboptimal, approximate, mathematical-programming. 3. F. Berman and L. Snyder, “On Mapping Parallel Algorithms into Parallel Architectures,” Proc. 1984 lnt’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 307-309. Global, static, optimal, graph-theoretical. 4. S.H. Bokhari, “Dual Processor Scheduling with Dynamic Reassignment,” IEEE Trans. Software Eng., Vol. SE-5, No. 4, July 1979, pp. 326-334. Global, static, optimal, graph-theoretical. 5. S.H. Bokhari, “A Shortest Tree Algorithm for Optimal Assignments across Space and Time in a Distributed Processor System,” IEEE Trans. Software Eng., Vol. SE-7, No. 6, Nov. 1981, pp. 335—341. Global, static, optimal, mathematical-programming, intended for tree-structured applications. 6. R.M. Bryant and R.A. Finkel, “A Stable Distributed Scheduling Algorithm,” Proc. Second Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 314—323. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, probabilistic, load-balancing. 7. T.L. Casavant and J.G. Kuhl, “Design of a Loosely-Coupled Distributed Multiprocessing Network,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 42^45. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, load-balancing, bidding, dynamic reassignment. 8. L.M. Casey, “Decentralized Scheduling,” Australian Computer J., Vol. 13, May 1981, pp. 58-63. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, load-balancing. 9. T.C.K. Chou and J. A. Abraham, “Load Balancing in Distributed Systems,” IEEE Trans. Software Eng., Vol. SE-8, No. 4, July 1982, pp. 401^112. Global, static, optimal, queuing-theoretical. 10. T.C.K. Chou and J.A. Abraham, “Load Redistribution under Failure in Distributed Systems,” IEEE Trans. Computers, Vol. C-32, No. 9, Sept. 1983, pp. 799-808. Global, dynamic (but with static pairings of supporting and supported processors), distributed, cooperative, suboptimal, provides three separate heuristic mechanisms, motivated from fault-recovery aspect. 11. Y.C. Chow and W.H. Kohler, “Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System,” IEEE Trans. Computers, Vol. C-28, No. 5, May 1979, pp. 354-361. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, load-balancing, (part of the heuristic approach is based on results from queuing theory). 12. W.W. Chu et al., “ Task Allocation in Distributed Data Processing,” Computer, Vol. 13, No. 11, Nov. 1980, pp. 57-69. Global, static, optimal, suboptimal, heuristic, heuristic approached based on graph theory and mathematical programming are discussed. 13. K.W. Doty, P.L. McEntire, and J.G. O’Reilly, “Task Allocation in a Distributed Computer System,” Proc. 1NFOCOM ’82, IEEE Computer Soc. Press, Los Alamitos, Calif., 1982, pp. 33-38. Global, static, optimal, mathematical-programming (nonlinear spatial dynamic programming). 14. K. Efe, “Heuristic Models of Task Assignment Scheduling in Distributed Systems,” Computer, Vol. 15, No. 6, June 1982, pp. 50-56. Global, static, suboptimal, heuristic, load-balancing. 15. J.A.B. Fortes and F. Parisi-Presicce, “Optimal Linear Schedules for the Parallel Execution of Algorithms,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984,pp. 322-329. Global, static, optimal, uses results from mathematical programming for a large class of datadependency-driven applications.
48
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
.Distributed Computing Systems
Overview
16. A. Gabrielian and D.B. Tyler, “Optimal Object Allocation in Distributed Computer Systems,” Proc. Fourth Int 7 Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 84-95. Global, static, optimal, mathematical-programming, uses a heuristic to obtain a solution close to optimal, employs backtracking to find optimal one from that. 17. C. Gao, J.W.S. Liu, and M. Railey, “Load Balancing Algorithms in Homogeneous Distributed Systems,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984,pp. 302-306. Global, dynamic, distributed, cooperative, suboptimal, heuristic, probabilistic. 18. W. Huen et al., “TECHNEC, a Network Computer for Distributed Task Control,” Proc. First Rocky Mountain Symp. Microcomputers, IEEE Computer Soc. Press, Los Alamitos, Calif., 1977, pp.233-237. Global, static, suboptimal, heuristic. 19. K. Hwang et al., “A Unix-Based Local Computer Network with Load Balancing,” Computer, Vol. 15, No. 4, Apr. 1982, pp. 55-65. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, load-balancing. 20. D. Klappholz and H.C. Park, “Parallelized Process Scheduling for a Tightly-Coupled MIMD Machine,” Proc. 1984 Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 315-321. Global, dynamic, physically distributed, noncooperative. 21. C.P. Kruskal and A. Weiss, “Allocating Independent Subtasks on Parallel Processors — Extended Abstract,” Proc. 1984 Int'l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 236-240. Global, static, suboptimal, but optimal for a set of optimistic assumptions, heuristic, problem stated in terms of queuing theory. 22. V.M. Lo, “Heuristic Algorithms forTask Assignment in Distributed Systems,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 30-39. Global, static, suboptimal, approximate, graph-theoretical. 23. V.M. Lo, “Task Assignment to Minimize Completion Time,” Proc. Fifth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1985, pp. 329-336. Global, static, optimal, mathematical-programming for some special cases, but in general is suboptimal, heuristic using the LPT algorithm. 24. P.Y.R. Ma, E.Y.S. Lee, and J. Tsuchiya, “A Task Allocation Model for Distributed Computing Systems,” IEEE Trans. Computers, Vol. C-31, No. 1, Jan. 1982, pp. 41-47. Global, static, optimal, mathematical-programming (branch-and-bound). 25. S. Majumdar and M.L. Green, “A Distributed Real Time Resource Manager,” Proc. Distributed Data Acquisition, Computing, and Control Symp., IEEE Computer Soc. Press, Los Alamitos, Calif., 1980, pp. 185-193. Global, dynamic, distributed, cooperative, suboptimal, heuristic, load-balancing, nonadaptive. 26. R. Manner, “Hardware Task/Processor Scheduling in a Polyprocessor Environment,” IEEE Trans. Computers, Vol. C-33, No. 7, July 1984, pp. 626-636. Global, dynamic, distributed control and responsibility, but centralized information in hardware on bus lines; cooperative, optimal, (priority) load-balancing. 27. L.M. Ni and K. Hwang, “Optimal Load Balancing Strategies for a Multiple Processor System,” Proc. Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 352-357. Global, static, optimal, mathematical-programming. 28. L.M. Ni and K. Abani, “Nonpreemptive Load Balancing in a Class of Local Area Networks,” Proc. Computer Networking Symp., IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 113-118.
Distributed Computing Systems
49
Chapter 1
Global, dynamic, distributed, cooperative, optimal and suboptimal solutions given — mathematicalprogramming and adaptive load-balancing, respectively. 29. J. Ousterhout, D. Scelza, and P. Sindhu, “Medusa: An Experiment in Distributed Operating System Structure,” Comm. ACM, Vol. 23, No. 2, Feb. 1980, pp. 92-105. Global, dynamic, physically nondistributed. 30. M.L. Powell and B.P. Miller, “Process Migration in DEMOS/MP,” Proc. Ninth ACM Symp. Operating Systems Principles (OS Review), Vol. 17, No. 5, Oct. 1983, pp. 110-119. Global, dynamic, distributed, cooperative, suboptimal, heuristic, load-balancing, but no specific decision rule given. 31. C.C. Price and S. Krishnaprasad, “Software Allocation Models for Distributed Computing Systems,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 40-48. Global, static, optimal, mathematical-programming, but also suggest heuristics. 32. C.V. Ramamoorthy et al„ “Optimal Scheduling Strategies in a Multiprocessor System,” IEEE Trans. Computers, Vol. C-21, No. 2, Feb. 1972, pp. 137-146. Global, static, optimal solution presented for comparison with the heuristic one also presented; graph theory is employed in the sense that it uses task-precedence graphs. 33. K. Ramamritham and J.A. Stankovic, “Dynamic Task Scheduling in Distributed Hard Real-Time Systems,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 96-107. Global, dynamic, distributed, cooperative, suboptimal, heuristic, bidding, one-time assignments (a real¬ time guarantee is applied before migration). 34. J. Reif and P. Spirakis, “Real-Time Resource Allocation in a Distributed System,” Proc. ACMSIGACTSIGOPS Symp. Principles Distributed Computing, ACM Press, New York, N.Y., 1982, pp. 84—94. Global, dynamic, distributed, noncooperative, probabilistic. 35. S. Sahni, “Scheduling Multipipeline and Multiprocessor Computers,” Proc. 1984 Int’l Conf. Parallel Processing, 1984, pp. 333-337. Global, static, suboptimal, heuristic. 36. T.G. Saponis and P.L. Crews, “A Model for Decentralized Control in a Fully Distributed Processing System,” Proc. COMPCON Fall ’80, IEEE Computer Soc. Press, Los Alamitos, Calif., 1980, pp. 307-312. Global, static, suboptimal, heuristic based on load balancing; also intended for applications of the nature of coupled recurrence systems. 37. C. Shen and W. Tsai, “A Graph Matching Approach to Optimal Task Assignment in Distributed Computing Systems Using a Minimax Criterion,” IEEE Trans. Computers, Vol. C-34, No. 3, Mar. 1985, pp. 197-203. Global, static, optimal, enumerative. 38. J.A. Stankovic, “The Analysis of a Decentralized Control Algorithm for Job Scheduling Utilizing Bayesian Decision Theory,” Proc. Int’l Conf. Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1981, pp. 333-337. Global, dynamic, distributed, cooperative, suboptimal, heuristic, one-time assignment, probabilistic. 39. J.A. Stankovic, “A Heuristic for Cooperation among Decentralized Controllers,” Proc. INFOCOM ‘83, IEEE Computer Soc. Press, Los Alamitos, Calif., 1983, pp. 331-339. Global, dynamic, distributed, cooperative, suboptimal, heuristic, one-time assignment, probabilistic. 40. J.A. Stankovic, “Simulations ofThree Adaptive, Decentralized, Controlled, Job Scheduling Algorithms,” Computer Networks, Vol. 8, No. 3, June 1984, pp. 199-217. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, adaptive, load-balancing, one-time assignment; three variants of this basic approach given.
50
A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems
41. J.A. Stankovic, “An Application of Bayesian Decision Theory to Decentralized Control of Job Scheduling,” IEEE Trans. Computers, Vol. C-34, No. 2, Feb. 1985, pp. 117-130. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic based on results from Bayesian decision theory. 42. J.A. Stankovic, “Stability and Distributed Scheduling Algorithms,” Proc. ACM Nat’l Conf, ACM Press, New York, N.Y., 1985. Here, there are two separate algorithms specified. The first is global, dynamic, physically distributed, cooperative, heuristic, adaptive, dynamic reassignment based on stochastic learning automata. The second is global, dynamic, physically distributed, cooperative, heuristic, bidding, one-time assignment. 43. J.A. Stankovic and I.S. Sidhu, “An Adaptive Bidding Algorithm for Processes, Clusters and Distributed Groups,” Proc. Fourth Int’l Conf. Distributed Computing Systems, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 49-59. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, adaptive, bidding, additional heuristics regarding clusters and distributed groups. 44. H.S. Stone, “Critical Load Factors in Two-Processor Distributed Systems,” IEEE Trans. Software Eng., Vol. SE-4, No. 3, May 1978, pp. 254-258. Global, dynamic, physically distributed, cooperative, optimal, (graph theory based). 45. H.S. Stone and S.H. Bokhari, “Control of Distributed Processes,” Computer, Vol. 11, No. 7, July 1978, pp.97-106. Global, static, optimal, graph-theoretical. 46. H. Sullivan and T. Bashkow, “A Large-Scale Homogeneous, Fully Distributed Machine — I,” Proc. Fourth Symp. Computer Architecture, IEEE Computer Soc. Press, Los Alamitos, Calif., 1977, pp.105-117. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, bidding. 47. A.M. Van Tilborg and L.D. Wittie, “Wave Scheduling — Decentralized Scheduling of Task Forces in Multicomputers,” IEEE Trans. Computers, Vol. C-33, No. 9, Sept. 1984, pp. 835-844. Global, dynamic, distributed, cooperative, suboptimal, heuristic, probabilistic, adaptive; assumes treestructured (logically) task forces. 48. R. A. Wagner and K.S. Trivedi, “Hardware Configuration Selection through Discretizing a Continuous Variable Solution,” Proc. Seventh IFIP Symp. Computer Performance Modeling, Measurement and Evaluation, 1980, pp. 127-142. Global, static, suboptimal, approximate, mathematical-programming. 49. Y.T. Wang and R.J.T. Morris, “Load Sharing in Distributed Systems,” IEEE Trans. Computers, Vol. C-34, No. 3, Mar. 1985, pp. 204-217. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, one-time assignment, load¬ balancing. 50. M.O. Ward and D.J. Romero, “Assigning Parallel-Executable, Intercommunicating Subtasks to Processors,” Proc. 1984 Int’l Conf Parallel Processing, IEEE Computer Soc. Press, Los Alamitos, Calif., 1984, pp. 392-394. Global, static, suboptimal, heuristic. 51. L.D. Wittie and A.M. Van Tilborg, “MICROS, a Distributed Operating System for MICRONET, a Reconfigurable Network Computer,” IEEE Trans. Computers, Vol. C-29, No. 12, Dec. 1980, pp.1133-1144. Global, dynamic, physically distributed, cooperative, suboptimal, heuristic, load-balancing (also with respect to message traffic).
Distributed Computing Systems
51
Deadlock Detection in Distributed Systems
A
distributed system is a network of sites that exchange information with each other by message passing. A site consists of computing and storage facilities and an interface to local users and to a communication network. A primary motivation for using distributed systems is the possibility of resource sharing — a process can request and release resources (local or remote) in an order not known a priori; a process can request some resources while holding others. In such an environment, if the sequence of resource allocation to processes is not controlled, a deadlock may occur. A deadlock occurs when processes holding some resources request access to resources held by other processes in the same set. The simplest illustration of a deadlock consists of two processes, each holding a different resource in exclusive mode and each requesting an access to resources held by other processes. Unless the deadlock is resolved, all the processes involved are blocked indefinitely. Therefore, a deadlock requires the attention of a process outside those involved in the deadlock for its detection and resolution. A deadlock is resolved by aborting one or more processes involved in the deadlock and granting the released resources to other processes involved in the deadlock. A process is aborted by withdrawing all its resource requests, restoring its state to an appropriate previous state, relinquishing all the resources it acquired after that state, and restoring all the relinquished resources to their original states. In the simplest case, a process is aborted by starting it afresh and relinquishing all the resources it held.
Mukesh Singhal
Resource versus communication deadlock Two types of deadlock have been discussed in the liter¬ ature: resource deadlock and communication deadlock. In resource deadlocks, processes make access to resources (for example, data objects in database systems, buffers in storeand-forward communication networks). A process acquires a resource before accessing it and relinquishes it after using it. A process that requires resources for execution cannot proceed until it has acquired all those resources. A set of processes is
52
Deadlock Detection in Distributed Systems
resource-deadlocked if each process in the set requests a resource held by another process in the set. In communication deadlocks, messages are the resources for which processes wait.1 Reception of a message takes a process out of wait — that is, unblocks it. A set of processes is communicationdeadlocked if each process in the set is waiting for a message from another process in the set and no process in the set ever sends a message. In this paper we limit our discussion to resource deadlocks in distributed systems. To present the state of the art of deadlock Figure 1. Resource allocation graph. detection in distributed systems, this paper describes a series of deadlock detection techniques based on centralized, hierarchical, and distributed control organizations. The paper complements one by Knapp, which discusses deadlock detection in distributed database systems.2 Knapp emphasizes the underlying theoretical principles of deadlock detection and gives an example of each principle. In contrast, this paper examines deadlock detection in distributed systems more from the point of view of its practical implications. It presents an up-to-date and comprehensive survey of deadlock detection algorithms, discusses their merits and drawbacks, and compares their performance (delays as well as message complexity). Moreover, this paper examines related issues, such as correctness of the algorithms, performance of the algorithms, and deadlock resolution, which require further research.
Graph-theoretic model of deadlocks The state of a system is in general dynamic; that is, system processes continuously acquire and release resources. Characterization of deadlocks requires a representation of the state of process-resource interactions. The state of process-resource interactions is modeled by a bipartite directed graph called a resource allocation graph. Nodes of this graph are processes and resources of a system, and edges of the graph depict assignments or pending requests. A pending request is represented by a request edge directed from the node of a requesting process to the node of the requested resource. A resource assignment is represented by an assignment edge directed from the node of an assigned resource to the node of the process assigned. For example, Figure 1 shows the resource allocation graph for two processes P, and P2 and two resources R( and R2, where edges R] -> P] and R2 -» P2 are assignment edges and edges P2 R; and P^ R2 are request edges. A system is deadlocked if its resource allocation graph contains a directed cycle in which each request edge is followed by an assignment edge. Since the resource allocation graph of Figure 1 contains a directed cycle, processes Pt and P2 are deadlocked. A deadlock can be detected by constructing the resource allocation graph and searching it for cycles. In a distributed database system (DDBS), the user accesses the data objects of the database by executing transactions. A transaction can be viewed as a process that performs a sequence of reads and writes on data objects. The data objects of a database can be viewed as resources that are acquired (by locking) and released (by unlocking) by transactions. In DDBS literature the resource allocation graph is referred to as a transaction-wait-for (TWF) graph.3 In a TWF graph, nodes are transactions and there is a directed edge from node T, to node T2 if T, is blocked and must wait for T2 to release some data object. A system is deadlocked if and only if there is a directed cycle in its TWF graph. Since both graphs denote the state of process-resource interaction, we will collectively refer to them as state graphs.
Distributed Computing Systems
53
Chapter 1
Deadlock-handling strategies The three strategies for handling deadlocks are deadlock prevention, deadlock avoidance, and deadlock detection. In deadlock prevention, resources are granted to requesting processes in such a way that a request for a resource never leads to a deadlock. The simplest way to prevent a deadlock is to acquire all the needed resources before a process starts executing. In another method of deadlock prevention, a blocked process releases the resources requested by an active process. In deadlock avoidance strategy, a resource is granted to a process only if the resulting state is safe. (A state is safe if there is at least one execution sequence that allows all processes to run to completion.) Finally, in deadlock detection strategy, resources are granted to a process without any check. Periodically (or whenever a request for a resource has to wait) the status of resource allocation and pending requests is examined to determine if a set of processes is deadlocked. This examination is performed by a deadlock detection algorithm. If a deadlock is discovered, the system recovers from it by aborting one or more deadlocked processes. The suitability of a deadlock-handling strategy greatly depends on the application. Both deadlock prevention and deadlock avoidance are conservative, overly cautious strategies. They are preferred if deadlocks are frequent or if the occurrence of a deadlock is highly undesirable. In contrast, deadlock detection is a lazy, optimistic strategy, which grants a resource to a request if the resource is available, hoping that this will not lead to a deadlock. Deadlock handling is complicated in distributed systems because no site has accurate knowledge of the current state of the system and because every intersite communication involves a finite and unpredictable delay. Next, we examine the complexity and practicality of the three deadlock-handling approaches in distributed systems.
Deadlock prevention Deadlock prevention is commonly achieved either by having a process acquire all the needed resources simultaneously before it begins executing or by preempting a process that holds the needed resource. In the former method, a process requests (or releases) a remote resource by sending a request message (or release message) to the site where the resource is located. This method has the following drawbacks: (1) It is inefficient because it decreases system concurrency. (2) A set of processes may get deadlocked in the resource-acquiring phase. For example, suppose process P, at site S, and process P2 at site S2 simultaneously request two resources R3 and R4 located at sites S3 and S4, respectively. It may happen that S3 grants R3 to P, and S4 grants R4 to P2, resulting in a deadlock. This problem can be handled by forcing processes to acquire needed resources one by one, but that approach is highly inefficient and impractical. (3) In many systems future resource requirements are unpredictable (not known a priori). In the latter method, an active process forces a blocked process, which holds the needed resource, to abort. This method is inefficient because several processes may be aborted without any deadlock.
Deadlock avoidance For deadlock avoidance in distributed systems, a resource is granted to a process if the resulting global system state is safe (the global state includes all the processes and resources of the distributed system). The following problems make deadlock avoidance impractical in distributed systems: (1) Because every site has to keep track of the global state of the system, huge storage capacity and extensive communication ability are necessary.
54
Deadlock Detection in Distributed Systems
(2) The process pf checking for a safe global state must be mutually exclusive. Otherwise, if several sites concurrently perform checks for a safe global state (each site for a different resource request), they may all find the state safe but the net global state may not be safe. This restriction severely limits the concurrency and throughput of the system. (3) Due to the large numbers of processes and resources, checking for safe states is computationally expensive.
Deadlock detection Deadlock detection requires examination of process-resource interactions for the presence of cyclic wait. In distributed systems deadlock detection has two advantages: • Once a cycle is formed in the state graph, it persists until it is detected and broken. • Cycle detection can proceed concurrently with the normal activities of a system; therefore, it does not have a serious effect on system throughput. For these reasons, the literature on deadlock handling in distributed systems is highly biased toward deadlock detection.
Issues in deadlock detection Deadlock detection involves two basic tasks: maintenance of the state graph and search of the state graph for the presence of cycles. Because in distributed systems a cycle may involve several sites, the search for cycles greatly depends on how the system state graph is represented across the system. Classified according to the way state graph information is maintained and the search for cycles is carried out, the three types of algorithms for deadlock detection in distributed systems are centralized, distributed, and hierarchical algorithms. In centralized algorithms the state graph is maintained at a single designated site, which has the sole responsibility of updating it and searching it for cycles. In distributed algorithms the state graph is distributed over many sites of the system, and a cycle may span state graphs located at several sites, making distributed processing necessary to detect it. In centralized algorithms the global state of the system is known and deadlock detection is simple. In distributed algorithms the problem of deadlock detection is more complex because no site may have accurate knowledge of the system state.4 In hierarchical algorithms sites are arranged in a hierarchy, and a site detects deadlocks involving only its descendant sites. Hierarchical algorithms exploit access patterns local to a cluster of sites to efficiently detect deadlocks.
Correctness of deadlock detection algorithms To be correct, a deadlock detection algorithm must satisfy two criteria: • No undetected deadlocks: the algorithm must detect all existing deadlocks in finite time. • No false deadlocks: the algorithm should not report nonexistent deadlocks. In distributed systems where there is no global memory and communication occurs solely by messages, it is difficult to design a correct deadlock detection algorithm because sites may receive out-of-date and inconsistent state graphs of the system. As a result, sites may detect a cycle that never existed but whose different segments existed in the system at different times. That is why many deadlock detection algorithms reported in the literature are incorrect.
Distributed Computing Systems
55
Chaffer
Strengths and weaknesses of centralized algorithms In centralized deadlock detection algorithms, a designated site, often called the control site, has the responsibility of constructing the global state graph and searching it for cycles. The control site may maintain the global state graph all the time, or it may build it whenever deadlock detection is to be earned out by soliciting the local state graph from every site. Centralized algorithms are conceptually simple and are easy to implement. Deadlock resolution is simple in these algorithms—the control site has the complete information about the deadlock cycle, and it can optimally resolve the deadlock. However, because control is centralized at a single site, centralized deadlock detection algorithms have a single point of failure. Communication links near the control site are likely to be congested because the control site receives state graph information from all the other sites. Also, the message traffic generated by deadlock detection activity is independent of the rate of deadlock formation and the structure of deadlock cycles.
Strengths and weaknesses of distributed algorithms In distributed deadlock detection algorithms, the responsibility of detecting a global deadlock is shared equally among the sites. The global state graph is spread over many sites, and several sites participate in the detection of a global cycle. Unlike centralized algorithms, distributed algorithms are not vulnerable to a single point of failure, and no site is swamped with deadlock detection activity. Deadlock detection is initiated only if a waiting process is suspected to be part of a deadlock cycle. But deadlock resolution is often cumbersome in distributed deadlock detection algorithms because several sites may detect the same deadlock and may not be aware of other sites and/or processes involved in the deadlock. Distributed algorithms are difficult to design because sites may collectively report the existence of a global cycle after seeing its segments at different instants (though all the segments never existed simultaneously) due to the system’s lack of globally shared memory. Also, proof of correctness is difficult for these algorithms.
Strengths and weaknesses of hierarchical algorithms In hierarchical deadlock detection algorithms, sites are arranged hierarchically, and a site detects deadlocks involving only its descendant sites. To efficiently detect deadlocks, hierarchical algorithms exploit access patterns local to a cluster of sites. They tend to get the best of both worlds: they have no single point of failure (as centralized algorithms have), and a site is not bogged down by deadlock detection activities that it is not concerned with (as sometimes happens in distributed algorithms). For efficiency, most deadlocks should be localized to as few clusters as possible; the objective of hierarchical algorithms will be defeated if most deadlocks span several clusters. Next, we describe a series of centralized, distributed, and hierarchical deadlock detection algorithms. We discuss the basic idea behind their operations, compare them with each other, and discuss their pros and cons. We also summarize the performance of these algorithms in terms of message traffic, message size, and delay in detecting a deadlock (see Table 1). It is not possible to enumerate these performance measures with high accuracy for many deadlock detection algorithms for the following reasons: the random nature of the TWF graph topology, the invocation of deadlock detection activities even though there is no deadlock, and the initiation of deadlock detection by several processes in a deadlock cycle. Therefore, for most algorithms we have given performance bounds rather than exactnumbers (for example, the maximum number of messages transferred to detect aglobal cycle).
Centralized deadlock detection algorithms In the simplest centralized deadlock detection algorithm, a designated site called the control site maintains the state graph of the entire system and checks it for the existence of deadlock cycles.
56
Deadlock Detection in Distributed Systems
Distributed Computing Systems: An Overview
Table 1. Performance comparison of distributed deadlock detection algorithms. Algorithm
Number of messages
Delay
Message size
Goldman Isloor-Marsland Menasce-Muntz Obermarck Chandy et al. Haas-Mohan Sugihara et al.
/
(4,(0 £ 4/0) a (4,(0 < 4/0)
We need to compare only two pairs from the sets to establish whether there is a causal relationship between the two events. The first comparison is true if and only if e. can causally affect/. The second comparison precludes reflexivity because it would be nonsensical to say that an event happened before itself.1 (Because synchronously communicating processes may independently time-stamp their part of a shared event, as do F and I in Figure 1, it may not always be easy to directly test e. */..) Elsewhere I have formally shown that these rules are sufficient to implement the “happened before” relation.2 The rules are robust enough for asynchronous message “overtaking” (that is, nonFIFO queuing of messages destined for a particular process instance). The rules also preserve the equivalence between asynchronous message-passing and synchronous message-passing with an intervening buffer process, and between synchronous message-passing and a single event shared among the communicating processes. As Figure 2 shows, substituting the appropriate counter values from the time stamps into the formula from the comparison property establishes the relationships defined by the computation in Figure 1. The final category in Figure 2 shows the principal advantage of partially ordered logical clocks over previous time-stamping methods.1 Where no causal relationship exists between events, no arbitrary ordering is imposed on them. Thus we can tell, for instance, that events K and J could occur in either order, or at exactly the same time. Even though Figure 1 suggests that K occurred before J in real time (taking a line drawn horizontally through the diagram to represent an instant in global real time), the logical behavior of the computation does not enforce this temporal ordering. A slightly
Distributed Computing Systems
77
Chapter 2
Tests for e -»/where • e and/are the same event: (2 < 2) a (2 C)
• e and/are different events in the same process: (1 < 3) a (1 < 3) => B -> D (2 * 1) a (2 i(C ► B) • e and/are in different processes with an intervening communication action: (2 < 4) a (0 < 3) => E —» D (2*0) a (3 4 2) => -i (J —» E) • e is in a subprocess of the process containing/or vice versa: (1 < 1) a (0 < 2) => H —» J (5 * 3) a (2 1) => • e and/are different parts of a synchronous communication event: (3 < 3) a (1 (F —> I) (1*1) a (3 *3) => (I —» F) • e and/are “potentially concurrent”; that is, there is no causal relationship between them: (2*0)a(1 -,(C->J) (2*1)a(0C) (1 *0) a(0 -,(K-> J) (2 * 0) a (0 < 1) => -,(J->K) Figure 2. Some “happened before” relationships defined between two arbitrary events e and f by the time stamps in Figure 1.
different interleaving of the same computation may result in J occurring before K. (Think of the dots in Figure 1 as beads free to slide up and down the time lines, as long as they do not violate causality by, for example, causing communication arrows to point backward.)
Optimizations In their full generality, as described in the previous section, partially ordered logical clocks may be impractically expensive for long-lived computations. For instance, the rules placed no upper bound on the size of the set of pairs, and their number was limited only by the number of process instances created at runtime. Nevertheless, several optimizations are possible, depending on the application environment in which the clocks will be used.
Static number of processes Where there is no process-nesting, and the system knows the number of processes to be created at compile time, the “set of pairs” data structure is unnecessary. Instead, each process can use an array of counters, with one element reserved per process.3 This frequently used optimization is known as “vector time”3 because each clock reading is a vector (array) of counter values. It has the obvious advantage of placing an upper bound on the storage
78
Logical Time in Distributed Computing Systems
Theoretical Aspects
requirements for the auxiliary clock variables. In a formal proof, Charron-Bost demonstrated that a vector of length n is minimal for n static processes.4 The vector-time optimization can be applied to languages with nested concurrency if they do not allow recursive process definitions.2 This restriction means that only one instance of each static process definition can execute at any given time. The system can determine from the source code the maximum number of runtime process instances simultaneously executing and reserve only one vector element for each. Every time a particular process definition is instantiated, it uses the same element in the logical clock vector, because no other copies of itself are currently running.
Comparisons known a priori So far we have assumed that the computation maintains counters for every process instance. However, if we know in advance the processes that contain the events we wish to study, then we need to keep counter values only for those processes.5 Nevertheless, other “uninteresting” processes must still maintain an auxiliary clock variable and transfer information at communication events. Otherwise, the partial ordering may fail to correctly reflect transitive interprocess dependences. All processes and all synchronizing actions in a computation must actively participate.1
Only new values piggybacked When the number of processes in the vector-time model is large, the transmission of the clock arrays during message-passing represents a significant overhead. In such cases each process can maintain, via two further auxiliary arrays, the value of the “local” counter when a vector was last sent to each other process, and when each counter for other processes was last updated. Using this information, the process can piggyback on an outgoing message only those counter values that have been modified since the last communication with the target process, assuming message overtaking is precluded.6
Implementations and applications Partially ordered logical clocks have been used in a number of practical and theoretical areas.
Languages Inmos’s Occam has only one interprocess communication mechanism, synchronous message¬ passing, and nested concurrency without recursion. Thus, programmers can easily add partially ordered clocks. I have experimentally introduced “logical timers” in a way consistent with Occam’s existing real-time timer.2 So far we have assumed that message-passing is the only interprocess communication medium. Languages that allow interprocess synchronization in other ways — for example, through access to shared memory, monitors, and semaphores — must also incorporate rules for such synchronization. Bryan has defined partially ordered time for Ada.7 In Ada, there are several unusual ways that tasks (processes) define causal relationships between events. The Ada rendezvous causes two tasks to synchronize while the “accept” code is executed, after which independent execution continues. One task can unconditionally “abort” another. Unhandled exceptions may propagate to another task. When shared variables are used, the Ada standard guarantees synchronization only at certain points in the computation. Bryan has formally defined the causal relationships for all of these activities.
Debugging distributed systems Programmers trying to debug distributed computing systems are faced with a frustrating inability
Distributed Computing Systems
79
Chapter 2
to see what is happening in the network of processes.2 To detect the occurrence of events in geographically distant processes, an observer must receive a “notification message. Because of unpredictable propagation delays, the arrival times of these notifications may bear no resemblance to the order in which the events originally occurred. Time-stamping the notifications at their source with the current real time is also unhelpful, even when the local real-time clocks are closely synchronized, because the real-time ordering of events may be affected by CPU loads unrelated to the computation of interest. The perceived ordering of events based on these time stamps may be merely an artifact of relative processor speeds, with no significance for the computation itself, and may be different each time the same computation is performed. In the past, a popular approach to this problem was to time-stamp events using so-called Lamport clocks (totally ordered logical clocks).1 Unfortunately, these time stamps impose on unrelated concurrent events an arbitrary ordering that the observer cannot distinguish from genuine causal relationships. Time-stamping the event notifications with partially ordered time readings resolves all the debugging problems. The observer receives an accurate representation of the event orderings, can see all causal relationships, and can derive all possible totally ordered interleavings. Most importantly, the technique greatly reduces the number of tests required. It is never necessary to perform the same computation more than once to see whether different event orderings (interleavings) are possible.2 Partially ordered logical clocks have been used experimentally for the detection of global conditions in a homogeneous network of processors,8 for two prototype implementations of a monitor for Ada programs,5 and for a prototype temporal assertion checker for Occam.2
Definition of global states The “happened before” relation provides for straightforward definitions of normally subtle concepts. For instance, a “cut” of a distributed computation partitions the events performed into two sets: “past” and “future.” In a consistent cut, the set of past events C does not violate causality; for example, it does not contain the reception of a message without its transmission. This concept has a very simple definition in terms of partially ordered time.3 Assume that E represents the set of all events performed during a computation using only asynchronous message-passing. Then a consistent cut C is a finite subset Cc£ such that Ve: E\ c: C • (e —> c) => (e CE Q In other words, if any event e “happened before” an event c in the cut set C, then e must also be in the cut set. This is an important concept in the theory of distributed error recovery and rollback. Consider a distributed system consisting of a static number of nonnested processes, each of which periodically stores a “snapshot” of its local state (including the contents of message queues). If a set of snapshot events S c C, one from each process, forms the leading edge of a consistent cut C — that is, Vs: S,c:C ■ —i(s —> c) then these local states form a valid global state from which an erroneous computation may be restarted. Partially ordered time has also been used in the analysis of other global state problems, for instance, characterization of distributed deadlocks.9
Concurrency measures A concurrency measure is a software metric that objectively assesses how concurrent a computation is. It measures the structure of the computation graph, rather than elapsed execution
80
Logical Time in Distributed Computing Systems
mcAL aspects
time. Partially ordered logical clocks have proved important in the definition and proposed implementations of such measures. One of the simplest such measures, known as to, counts the number of concurrent pairs of events that occurred during the computation and divides this by the total number of pairs of events between processes.10 For a given computation C, consisting of two or more nonnested processes, the measure is defined as !{(