Distributed Computing - Introduction, RPC, Time, State, Consensus, Replication, Fault-Tolerance, PAXOS, Transactions, Consistency, Peer-to-Peer, Analytics, Datacenter Computing, Machine Learning, Blockchain, IoT, Edge Computing


226 42 84MB

English Pages [780]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
I
INTRODUCTION
BACKGROUND AND DEFINITIONS
System Model
Consistent System States
Interactions with the Outside World
In-Transit Messages
Logging Protocols
Stable Storage
Garbage Collection
CHECKPOINT-BASED ROLLBACK RECOVERY
Uncoordinated Checkpointing
Overview
Dependency Graphs and Recovery Line Calculation
The Domino Effect
Coordinated Checkpointing
Overview
Non-blocking Checkpoint Coordination
Checkpointing with Synchronized Clocks
Checkpointing and Communication Reliability
Minimal Checkpoint Coordination
Communication-induced Checkpointing
Overview
Model-based Protocols
Index-based Protocols
LOG-BASED ROLLBACK RECOVERY
The No-Orphans Consistency Condition
Pessimistic Logging
Overview
Techniques for Reducing Performance Overhead
Relaxing Logging Atomicity
Optimistic Logging
Overview
Synchronous vs. Asynchronous Recovery
Causal Logging
Overview
Tracking Causality
Comparison
IMPLEMENTATION ISSUES
Overview
Checkpointing Implementation
Concurrent Checkpointing
Incremental Checkpointing
System-level versus User-level Implementations
Compiler Support
Checkpoint Placement
Checkpointing Protocols in Comparison
Communication Protocols
Location-Independent Identities and Redirection
Reliable Channel Protocols
Log-based Recovery
Message Logging Overhead
Combining Log-Based Recovery with Coordinated Checkpointing
Stable Storage
Support for Nondeterminism
System Calls
Asynchronous Signals
Dependency Tracking
Recovery
Reinstating a Process in its Environment
Behavior During Recovery
Checkpointing and Mobility
Rollback Recovery in Practice
CONCLUDING REMARKS
Introduction
Implementation
Spanserver Software Stack
Directories and Placement
Data Model
TrueTime
Concurrency Control
Timestamp Management
Paxos Leader Leases
Assigning Timestamps to RW Transactions
Serving Reads at a Timestamp
Assigning Timestamps to RO Transactions
Details
Read-Write Transactions
Read-Only Transactions
Schema-Change Transactions
Refinements
Evaluation
Microbenchmarks
Availability
TrueTime
F1
Related Work
Future Work
Conclusions
Paxos Leader-Lease Management
Abstract
1 Introduction
2 Making Writes Efficient
2.1 Aurora System Architecture
2.2 Writes in Aurora
2.3 Storage Consistency Points and Commits
2.4 Crash Recovery in Aurora
3 Making Reads Efficient
3.1 Avoiding quorum reads
3.2 Scaling Reads Using Read Replicas
3.3 Structural Consistency in Aurora Replicas
3.4 Snapshot Isolation and Read View Anchors in Aurora Replicas
4 Failures and Quorum Membership
4.1 Using Quorum Sets to Change Membership
4.2 Using Quorum Sets to Reduce Costs
5 Related Work
6 Conclusions
Acknowledgments
References
Introduction
ALPS Systems and Trade-offs
Causal+ Consistency
Definition
Causal+ vs. Other Consistency Models
Causal+ in COPS
Scalable Causality
System Design of COPS
Overview of COPS
The COPS Key-Value Store
Client Library and Interface
Writing Values in COPS and COPS-GT
Reading Values in COPS
Get Transactions in COPS-GT
Garbage, Faults, and Conflicts
Garbage Collection Subsystem
Fault Tolerance
Conflict Detection
Evaluation
Implementation and Experimental Setup
Microbenchmarks
Dynamic Workloads
Scalability
Related Work
Conclusion
Formal Definition of Causal+
Abstract
1 Introduction
2 Background
2.1 Non-volatile main memory
2.2 High-performance networking
2.3 Goals of this paper
3 Evaluation setup
4 Low-latency writes
4.1 Persistent RDMA background
4.2 Durability guarantee of RDMA
4.3 Measurements
4.4 Newer NICs
4.5 Future RDMA extensions
4.6 Low-latency state machine replication
5 High-bandwidth bulk writes
5.1 Discussion on disabling DDIO
5.2 Improving RDMA bandwidth
5.3 DMA engine background
5.4 IOAT DMA microbenchmarks
5.5 Optimizing RPCs with DMA
6 Persistent log
6.1 Diagnosis: Cache line invalidation
6.2 Rotating counter
6.3 Extension to rotating registers
6.4 End-to-end performance
7 Related work
8 Conclusion
References
Introduction
Disaggregate Hardware Resource
Limitations of Monolithic Servers
Hardware Resource Disaggregation
OSes for Resource Disaggregation
The Splitkernel OS Architecture
LegoOS Design
Abstraction and Usage Model
Hardware Architecture
Process Management
Process Management and Scheduling
ExCache Management
Supporting Linux Syscall Interface
Memory Management
Memory Space Management
Optimization on Memory Accesses
Storage Management
Global Resource Management
Reliability and Failure Handling
LegoOS Implementation
Hardware Emulation
Network Stack
Processor Monitor
Memory Monitor
Storage Monitor
Experience and Discussion
Evaluation
Micro- and Macro-benchmark Results
Application Performance
Failure Analysis
Related Work
Discussion and Conclusion
Introduction
Background
Fast distributed transactions
RDMA
Choosing networking primitives
Advantage of RPCs
Advantage of datagram transport
Performance considerations
On small clusters
On medium-sized clusters
Reliability considerations
Stress tests for packet loss
FaSST RPCs
Coroutines
RPC interface and optimizations
Detecting packet loss
RPC limitations
Single-core RPC performance
Transactions
Handling failures and packet loss
Implementation
Transaction API
Evaluation
Object store
Single-key read-only transactions
Multi-key transactions
TATP
SmallBank
Latency
Future trends
Scalable one-sided RDMA
More queue pairs
Advanced one-sided RDMA
Related work
Conclusion
Introduction
Background
FaaS Workloads
Data Collection
Functions, Applications, and Triggers
Invocation Patterns
Function Execution Times
Memory Usage
Main Takeaways
Managing Cold Starts in FaaS
Design Challenges
Hybrid Histogram Policy
Implementation in Apache OpenWhisk
Evaluation
Methodology
Simulation Results
Experimental results
Production Implementation
Related Work
Conclusion
Abstract
1 Introduction
2 Motivation
3 Challenges
4 Overview of Cartel
5 Design Detail
5.1 Metadata Storage and Aggregation
5.2 Cartel – Three Key Mechanisms
5.3 Cartel Runtime
6 Evaluation
6.1 Experimental Methodology
6.2 Benefits from Cartel
6.3 Effect of Mechanisms
6.4 Use Case - Network Attack
7 Discussion
8 Related Work
9 Conclusion
References
Introduction
Background & Model
Smart-home Platforms
Programming Model
Failures in IoT Environments
Problem Study
Inconsistency
Dependency
Analysis and Findings
Transactuations
Abstraction & API
Chaining transactuations
Relacs
Relacs Store
Execution Model
Relacs Runtime
Fault Tolerance
Implementation
Discussion
Evaluation
Programmability
Correctness
Overhead
Related Work
Conclusion
Acknowledgment
Recommend Papers

Distributed Computing - Introduction, RPC, Time, State, Consensus, Replication, Fault-Tolerance, PAXOS, Transactions, Consistency, Peer-to-Peer, Analytics, Datacenter Computing, Machine Learning, Blockchain, IoT, Edge Computing

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Chapter 2: What Good are Models and What Models are Good? Fred B. Schneider* Department of Computer Science Cornell University Ithaca, New York 14853 U.S.A.

1. Refining Intuition Distributed systems are hard to design and understand because we lack intuition for them. Perhaps this is because our lives are fundamentally sequential and centralized. Perhaps not. In any event, distributed systems are being built. We must develop an intuition, so that we can design distributed systems that perform as we intend and so that we can understand existing distributed systems well enough for modification as needs change. Although developing an intuition for a new domain is difficult, it is a problem that engineers and scientists have successfully dealt with before. We have acquired intuition about flight, about nuclear physics, and about chaotic phenomena (like turbulence). Two approaches are conventionally employed: Experimental Observation. We build things and observe how they behave in various settings. A body of experience accumulates about approaches that work. Even though we might not understand why something works, this body of experience enables us to build things for settings similar to those that have been studied. Modeling and Analysis. We formulate a model by simplifying the object of study and postulating a set of rules to define its behavior. We then analyize the model—using mathematics or logic—and infer consequences. If the model accurately characterizes reality, then it becomes a powerful mental tool. Both of these approaches are illustrated in this text. Some chapters report experimental observation; others are more concerned with describing and analyzing models. Taken together, however, the chapters constitute a collective intuition for the design and analysis of distributed systems. *This material is based on work supported in part by the Office of Naval Research under contract N00014-91-J-1219, the National Science Foundation under Grant No. CCR-8701103, and DARPA/NSF Grant No. CCR-9014363. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not reflect the views of these agencies.

-1-

In a young discipline—like distributed systems—there is an inevitable tension between advocates for "experimental observation" and those for "modeling and analysis". This tension masquerades as a dichotomy between "theory" and "practice". Each side believes that theirs is the more effective way to refine intuition. Practitioners complain that they learn little from the theory. Theoreticians complain that practitioners are not addressing the right problems. A theoretician might simplify too much when defining a model; the analysis of such models will rarely enhance our intuition. A practitioner might incorrectly generalize from experience or concentrate on the wrong attributes of an object; our intuition does not profit from this, either. On the other hand, without experimental observation, we have no basis for trusting our models. And, without models, we have no hope of mastering the complexity that underlies distributed systems. The remainder of this chapter is about models for distributed systems. We start by discussing useful properties of models. We then illustrate that simple distributed systems can be difficult to understand—our intuition is not well developed—and how models help. We next turn to two key attributes of a distributed system and discuss models for these attributes. The first concerns assumptions about process execution speeds and message delivery delays. The implications of such assumptions are pursued in greater detail in Chapter 4 (???Ozalp+Marzullo) and Chapter 5 (???Ozalp+Sam). The second attribute concerns failure modes. This material is fundamental for later chapters on implementing fault tolerance, Chapter 6 (???FS state machine) and Chapter 7 (???FS et al primarybackup). Finally, we argue that all of these models are useful and interesting, both to practitioners and theoreticians.

2. Good Models For our purposes, a model for an object is a collection of attributes and a set of rules that govern how these attributes interact. There is no single correct model for an object. Answering different types of questions about an object usually requires that we employ different models, each with different attributes and/or rules. A model is accurate to the extent that analyzing it yields truths about the object of interest. A model is tractable if such an analysis is actually possible. Defining an accurate model is not difficult; defining an accurate and tractable model is. An accurate and tractable model will include exactly those attributes that affect the phenomena of interest. Selecting these attributes requires taste and insight. Level of detail is a key issue. In a tractable model, rules governing interactions of attributes will be in a form that supports analysis. For example, mathematical and logical formulas can be analyzed by uninterpreted symbolic manipulations. This makes these formalisms well suited for defining models. Computer simulations can also define models. While not as easy to analyze, a computer simulation is usually easier to analyize than the system it simulates. In building models for distributed systems, we typically seek answers to two fundamental questions: Feasibility. What classes of problems can be solved? Cost. For those classes that can be solved, how expensive must the solution be? Both questions have practical import as well as having theoretical value. First, being able to recognize an unsolvable problem lurking beneath a system’s requirements can head-off wasted effort in design, implementation, and testing. Second, knowing the cost implications of solving a particular problem allows us to avoid designs requiring protocols that are inherently slow or expensive. Finally, knowing the inherent cost of solving a particular problem provides a yardstick with which we can -2-

evaluate any solution that we devise. By studying algorithms and computational complexity, most undergraduates learn about undecidable problems and about complexity classes, the two issues raised above. The study builds intuition for a particular model of computation—one that involves a single sequential process and a uniform access-time memory. Unfortunately, this is neither an accurate nor useful model for the systems that concern us. Distributed systems comprise multiple processes communicating over narrowbandwidth, high-latency channels, with some processes and/or channels faulty. The additional processes provide more computational power but require coordination. The channel bandwidth limitations help isolate the effects of failures but also mean that interprocess communication is a scarce system resource. In short, distributed systems raise new concerns and understanding these requires new models.

A Coordination Problem In implementing distributed systems, process coordination and coping with failures are pervasive concerns. Here is an example of such a problem. Coordination Problem. Two processes, A and B, communicate by sending and receiving messages on a bidirectional channel. Neither process can fail. However, the channel can experience transient failures, resulting in the loss of a subset of the messages that have been sent. Devise a protocol where either of two actions α and β are possible, but (i) both processes take the same action and (ii) neither takes both actions.  That this problem has no solution usually comes as a surprise. Here is the proof. Any protocol that solves this problem is equivalent to one in which there are rounds of message exchange: first A (say) sends to B, next B sends to A, then A sends to B, and so on. We show that in assuming the existence of a protocol to solve the problem, we are able to derive a contradiction. This establishes that no such protocol exists. Select the protocol that solves the problem using the fewest rounds. By assumption, such a protocol must exist, and, by construction, no protocol solving the problem using fewer rounds exists. Without loss of generality, suppose that m, the last message sent by either process, is sent by A. Observe that the action ultimately taken by A cannot depend on whether m is received by B, because its receipt could never be learned by A (since this is the last message). Thus, A’s choice of action α or β does not depend on m. Next, observe that the action ultimately taken by B cannot depend on whether m is received by B, because B must make the same choice of action α or β even if m is lost (due to a channel failure). Having argued that the action chosen A and B does not depend on m, we conclude that m is superfluous. Thus, we can construct a new protocol in which one fewer message is sent. However, the existence of such a shorter protocol contradicts the assumption that the protocol we employed used the fewest number of rounds.  We established that the Coordination Problem could not be solved by building a simple, informal model. Two insights were used in this model: (1)

All protocols between two processes are equivalent to a series of message exchanges.

(2)

Actions taken by a process depend only on the sequence of messages it has received.

-3-

Having defined the model and analyized it, we have now refined our intuition. Notice that is so doing we not only learned about this particular problem but also about variations. For example, we might wonder whether coordination of two processes is possible if the channel never fails (so messages are never lost) or if the channel informs the sender whenever a message is lost. For each modification, we can determine whether the change invalidates some assertion being made in the analysis (i.e. the proof above) and thus we can determine whether the change invalidates the proof.

3. Synchronous versus Asynchronous Systems When modeling distributed systems, it is useful to distinguish between asynchronous and synchronous systems. With an asynchronous system, we make no assumptions about process execution speeds and/or message delivery delays; with a synchronous system, we do make assumptions about these parameters. In particular, the relative speeds of processes in a synchronous system is assumed to be bounded, as are any delays associated with communications channels. Postulating that a system is asynchronous is a non-assumption. Every system is asynchronous. Even a system in which processes run in lock step and message delivery is instantaneous satisfies the definition of an asynchronous system (as well as that of a synchronous system). Because all systems are asynchronous, a protocol designed for use in an asynchronous system can be used in any distributed system. This is a compelling argument for studying asynchronous systems. In theory, any system that employs reasonable schedulers can be regarded as being synchronous, because there are then (trivial) bounds on the relative speeds of processes and channels. This, however, is not a useful way to view a distributed system. Protocols that assume the system is synchronous exhibit performance degradation as the ratios of the various process speeds and delivery delays increase. Reasonable throughput can be attained by these protocols only when processes execute at about the same speed and delivery delays are not too large. Thus, there is no value to considering a system as being synchronous unless the relative execution speeds of processes and channel delays are close. In practice, then, postulating that a system is synchronous constrains how processes and communications channels are implemented. The scheduler that multiplexes processors must not violate the constraints on process execution speeds. If long-haul networks are employed for communications, then queuing delays, unpredictable routings, and retransmission due to errors must not be allowed to violate the constraints on channel delays. On the other hand, asserting that the relative speeds of processes is bounded is equivalent to assuming that all processors in the system have access to approximately rate-synchronized real-time clocks. This is because either one can be used to implement the other. Thus, timeouts and other time-based protocol techniques are possible only when a system is synchronous.

An Election Protocol In asserting that a system is synchronous, we rule out certain system behaviors. This, in turn, enables us to employ simpler or cheaper protocols than would be required to solve the same problem in an asynchronous system (where these behaviors are possible). An example is the following election problem. Election Problem. A set of processes P 1 , P 2 , ..., Pn must select a leader. Each process Pi has a unique identifier uid(i). Devise a protocol so that all of the processes learn the identity of the leader. Assume all processes start executing at the same time and that all communi-

-4-

cate using broadcasts that are reliable.



Solving this problem in an asynchronous system is not difficult, but somewhat expensive. Each process Pi broadcasts 〈i, uid(i)〉. Every process will eventually receive these broadcasts, so each can independently "elect" the Pi for which uid(i) is smallest. Notice that n broadcasts are required for an election. In a synchronous system, it is possible to solve the Election Problem with only a single broadcast. Let τ be a known constant bigger than the largest message delivery delay plus the largest difference that can be observed at any instant by reading clocks at two arbitrary processors. Now, it suffices for each process Pi to wait until either (i) it receives a broadcast or (ii) τ∗uid(i) seconds elapse on its clock at which time it broadcasts 〈i〉. The first process that makes a broadcast is elected. We have illustrated that by restricting consideration to synchronous systems, time can be used to good advantage in coordinating processes. The act of not sending a message can convey information to processes. This technique is used, for example, by processes in the synchronous election protocol above to infer values of uid(i) that are held by no process.

4. Failure Models A variety of failure models have been proposed in connection with distributed systems. All are based on assigning responsibility for faulty behavior to the system’s components—processors and communications channels. It is faulty components that we count, not occurrences of faulty behavior. And, we speak of a system being t-fault tolerant when that system will continue satisfying its specification provided that no more than t of its components are faulty. By way of contrast, in classical work on fault-tolerant computing systems, it is the occurrences of faulty behavior that are counted. Statistical measures of reliability and availability, like MTBF (mean-time-between-failures) and probability of failure over a given interval, are deduced from estimates of elapsed time between fault occurrences. Such characterizations are important to users of a system, but there are real advantages to describing the fault tolerance of a system in terms of the maximum number of faulty components that can be tolerated over some interval of interest. Asserting that a system is t-fault tolerant is a measure of the fault tolerance supported by the system’s architecture, in contrast to fault tolerance achieved simply by using reliable components. Fault tolerance of a system will depend on the reliability of the components used in constructing that system—in particular, the probability that there will be more than t failures during the operating interval of interest. Thus, once t has been chosen, it is not difficult to derive the more traditional statistical measures of reliability. We simply compute the probabilities of having various configurations of 0 through t faulty components. So, no expressive power is lost by counting faulty components—as we do—rather than counting fault occurrences. Some care is required in defining failure models, however, when it is the faulty components that are being counted. For example, consider a fault that leads to a message loss. We could attribute this fault to the sender, the receiver, or the channel. Message loss due to signal corruption from electrical noise should be blamed on the channel, but message loss due to buffer overflow at the receiver should be blamed on the receiver. Moreover, since replication is the only way to tolerate faulty components, the architecture and cost of implementing a t-fault tolerant system very much depends on exactly how fault occurrences are being attributed to components. Incorrect attribution leads to an inaccurate distributed system model; erroneous conclusions about system architecture are sure to follow.

-5-

A faulty component exhibits behavior consistent with some failure model being assumed. Failure models commonly found in the distributed systems literature include: Failstop. A processor fails by halting. Once it halts, the processor remains in that state. The fact that a processor has failed is detectable by other processors [S84]. Crash. A processor fails by halting. Once it halts, the processor remains in that state. The fact that a processor has failed may not be detectable by other processors [F83]. Crash+Link. A processor fails by halting. Once it halts, the processor remains in that state. A link fails by losing some messages, but does not delay, duplicate, or corrupt messages [BMST92]. Receive-Omission. A processor fails by receiving only a subset of the messages that have been sent to it or by halting and remaining halted [PT86]. Send-Omission. A processor fails by transmitting only a subset of the messages that it actually attempts to send or by halting and remaining halted [H84]. General Omission. A processor fails by receiving only a subset of the messages that have been sent to it, by transmitting only a subset of the messages that it actually attempts send, and/or by halting and remaining halted [PT86]. Byzantine Failures. A processor fails by exhibiting arbitrary behavior [LSP82]. Failstop failures are the least disruptive, because processors never perform erroneous actions and failures are detectable. Other processors can safely perform actions on behalf of a faulty failstop processor. Unless the system is synchronous, it is not possible to distinguish between a processor that is executing very slowly and one that has halted due to a crash failure. Yet, the ability to make this distinction can be important. A processor that has crashed can take no further action, but a processor that is merely slow can. Other processors can safely perform actions on behalf of a crashed processor, but not on behalf of a slow one, because subsequent actions by the slow processor might not be consistent with actions performed on its behalf by others. Thus, crash failures in asynchronous systems are harder to deal with than failstop failures. In synchronous systems, however, the crash and failstop models are equivalent. The next four failure models—Crash+Link, Receive-Omission, Send-Omission, and General Omission—all deal with message loss, each modeling a different cause for the loss and attributing the loss to a different component. Finally, Byzantine failures are the most disruptive. A system that can tolerate Byzantine failures can tolerate anything. The extremes of our spectrum of models—failstop and Byzantine—are not controversial, but there can be debate about the other models. Why not define a failure model corresponding to memory disruptions or misbehavior of the processor’s arithmetic-logic unit (ALU)? The first reason brings us back to the two fundamental questions of Section 2. The feasibility and cost of solving certain fundamental problems is known to differ across the seven failure models enumerated above. (We return to this point below.) A second reason that these failure models are interesting is a matter of taste in abstractions. A reasonable abstraction for a processor in a distributed system is an object that sends and receives messages. The failure models given above concern ways that such an abstraction might be faulty. Failure models involving the contents of memory or the functioning of an ALU, for example, concern internal (and largely irrelevant) details of the processor abstraction. A good model

-6-

encourages suppression of irrelevant details.

Fault Tolerance and Distributed Systems As the size of a distributed system increases, so does the number of its components and, therefore, so does the probability that some component will fail. Thus, designers of distributed systems must be concerned from the outset with implementing fault tolerance. Protocols and system architectures that are not fault tolerant simply are not very useful in this setting. The link between fault tolerance and distributed systems goes in the other direction as well. Implementing a distributed system is the only way to achieve fault tolerance. All methods for achieving fault tolerance employ replication of function using components that fail independently. In a distributed system, the physical separation and isolation of processors linked by a communications network ensures that components fail independently. Thus, achieving fault tolerance in a computing system can lead to solving problems traditionally associated with distributed computing systems. Failures—be they hard or transient—can be detected only by replicating actions in failureindependent ways. One way to do this is by performing the action using components that are physically and electrically isolated; we call this replication in space. The validity of the approach follows from an empirically justified belief in the independence of failures at physically and electrically isolated devices. A second approach to replication is for a single device to repeatedly perform the action. We call this replication in time. Replication in time is valid only for transient failures. If the results of performing a set of replicated actions disagree, a failure has occurred. Without making further assumptions, this is the strongest statement that can be made. In particular, if the results agree, we cannot assert that no component is faulty (and the results are correct). This is because if there are enough faulty components, all might be corrupted, yet still agree. For Byzantine failures, t +1-fold replication permits t-fault tolerant failure detection but not masking. This is because when there is disagreement among t +1 independently obtained results, one cannot assume that the majority value is correct. In order to implement t-fault tolerant masking, 2t +1-fold replication is needed, since then as many as t values can be faulty without causing the majority value to be faulty. At the other extreme of our failure models spectrum, for failstop failures a single component suffices for detection. And, t +1-fold replication is sufficient for masking the failure of as many as t faulty components.

5. Which Model When? Theoreticians have good reason to study all of the models we have discussed. The models each idealize some dimension of real systems, and it is useful to know how each system attribute affects the feasibility or cost of solving a problem. Theoreticians also may have reason to define new models. Identifying attributes that affect the problems that arise in distributed systems allows us to identify the key dimensions of the problems we face. The dilemma faced by practitioners is that of deciding between models when building a system. Should we assume that processes are asynchronous or synchronous, failstop or Byzantine? The answer depends on how the model is being used. One way to regard a model is as an interface definition—a set of assumptions that programmers can make about the behavior of system components. Programs are written to work correctly assuming the actual system behaves as prescribed by the model. And, when system behavior is not consistent with the model, then no guarantees can be made.

-7-

For example, a system designed assuming that Byzantine failures are possible can tolerate anything. Assuming Byzantine failures is prudent in mission critical systems, because the cost of system failure is usually great, so there is considerable incentive to reduce the risk of a system failure. For most applications, however, it suffices to assume a more benign failure model. In those rare circumstances where the system does not satisfy the model, we must be prepared for the system to violate its specification. Large systems, especially, are rarely constructed as single monolithic entities. Rather, the system is structured by implementing abstractions. Each abstraction builds on other abstractions, providing some added functionality. Here, having a collection of models can be used to good advantage. Among the abstractions that might be implemented are processors possessing the attributes discussed above. We might start by assuming our processors only exhibit crash failures. Failstop processors might then be approximated by using timeout-based protocols. Finally, if we discover that processors do not only exhibit crash failures we might go back and add various sanity-tests to system code, causing processors to crash rather than behave in a way not permitted by the crash failure model. Lastly, the various models can and should be regarded as limiting cases. The behavior of a real system is bounded by our models. Thus, understanding the feasibility and costs associated with solving problems in these models, can give us insight into the feasibility and cost of solving a problem in some given real system whose behavior lies between the models.

References [BMST92]

Budhiraja, N., K. Marzullo, F.B. Schneider, and S. Toueg. Primary-backup protocols: Lower bounds and optimal implementations. Proceedings of the Third IFIP Working Conference on Dependable Computing for Critical Applications (Mondello, Italy, Sept. 1992), 187-198.

[F83]

Fischer, M.J. The consensus problem in unreliable distributed systems (A brief survey). Proc. 1983 International Conference on Foundations of Computations Theory, Lecture Notes in Computer Science, Vol. 158, Springer-Verlag, New York, 1983, 127-140.

[H84]

Hadzilacos, V. Issues of Fault Tolerance in Concurrent Computations Ph.D. thesis, Harvard University, June 1984.

[LSP82]

Lamport, L., R. Shostak, and M. Pease. The Byzantine generals problem. ACM TOPLAS 4, 3 (July 1982), 382-401.

[PT86]

Perry, K.J. and S. Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering SE-12, No. 3, (March 1986) 477-482

[S84]

Schneider, F.B. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May 1984), 145-154.

-8-

Fallacies of Distributed Computing Explained (The more things change the more they stay the same) Arnon Rotem-Gal-Oz [This whitepaper is based on a series of blog posts that first appeared in Dr. Dobb's Portal www.ddj.com/dept/architect] The software industry has been writing distributed systems for several decades. Two examples include The US Department of Defense ARPANET (which eventually evolved into the Internet) which was established back in 1969 and the SWIFT protocol (used for money transfers) was also established in the same time frame [Britton2001]. Nevertheless, In 1994, Peter Deutsch, a sun fellow at the time, drafted 7 assumptions architects and designers of distributed systems are likely to make, which prove wrong in the long run - resulting in all sorts of troubles and pains for the solution and architects who made the assumptions. In 1997 James Gosling added another such fallacy [JDJ2004]. The assumptions are now collectively known as the "The 8 fallacies of distributed computing" [Gosling]: 1. 2. 3. 4. 5. 6. 7. 8.

The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.

This whitepaper will looks at each of these fallacies, explains them and checks their relevancy for distributed systems today.

The network is reliable The first fallacy is "The network is reliable." Why is this a fallacy? Well, when was the last time you saw a switch fail? After all, even basic switches these days have MTBFs (Mean Time Between Failure) in the 50,000 operating hours and more. For one, if your application is a mission critical 365x7 kind of application, you can just hit that failure--and Murphy will make sure it

happens in the most inappropriate moment. Nevertheless, most applications are not like that. So what's the problem? Well, there are plenty of problems: Power failures, someone trips on the network cord, all of a sudden clients connect wirelessly, and so on. If hardware isn't enough--the software can fail as well, and it does. The situation is more complicated if you collaborate with an external partner, such as an e-commerce application working with an external credit-card processing service. Their side of the connection is not under your direct control. Lastly there are security threats like DDOS attacks and the like. What does that mean for your design? On the infrastructure side, you need to think about hardware and software redundancy and weigh the risks of failure versus the required investment. On the software side, you need to think about messages/calls getting lost whenever you send a message/make a call over the wire. For one you can use a communication medium that supplies full reliable messaging; WebsphereMQ or MSMQ, for example. If you can't use one, prepare to retry, acknowledge important messages, identify/ignore duplicates (or use idempotent messages), reorder messages (or not depend on message order), verify message integrity, and so on. One note regarding WS-ReliableMessaging: The specification supports several levels of message guarantee--most once, at least once, exactly once and orders. You should remember though that it only takes care of delivering the message as long as the network nodes are up and ru n n in g , it d oesn ’t h an d le p ersisten cy an d you still n eed to take care of that (or use a vendor solution that does that for you) for a complete solution. To sum up, the network is Unreliable and we as software architect/designers need to address that.

Latency is zero The second fallacy of Distributed Computing is the assumption that "Latency is Zero". Latency is how much time it takes for data to move from one place to another (versus bandwidth which is how much data we can transfer during that time). Latency can be relatively good on a LAN--but latency deteriorates quickly when you move to WAN scenarios or internet scenarios.

Latency is more problematic than bandwidth. Here's a quote from a post by Ingo Rammer on latency vs. Bandwidth [Ingo] that illustrates this: "B u t I th in k th at it’s really interesting to see that the end-to-end bandwidth increased by 1468 times within the last 11 years while the latency (the time a single ping takes) has only been improved tenfold. If th is w ou ld n ’t b e en ou g h , th ere is even a n atu ral cap on laten cy. T h e minimum round-trip time between two points of this earth is determined by the maximum speed of information transmission: the speed of light. At roughly 300,000 kilometers per second (3.6 * 10E12 teraangstrom per fortnight), it will always take at least 30 milliseconds to send a ping from Europe to the US and back, even if the processing would be done in real time."

You may think all is okay if you only deploy your application on LANs. However even when you work on a LAN with Gigabit Ethernet you should still bear in mind that the latency is much bigger then accessing local memory Assuming the latency is zero you can be easily tempted to assume making a call over the wire is almost like making a local calls--this is one of the problems with approaches like distributed objects, that provide "network transparency"--alluring you to make a lot of fine grained calls to objects which are actually remote and expensive (relatively) to call to. Taking latency into consideration means you should strive to make as few as possible calls and assuming you have enough bandwidth (which will talk about next time) you'd want to move as much data out in each of this calls. There is a nice example illustrating the latency problem and what was done to solve it in Windows Explorer in http://blogs.msdn.com/oldnewthing/archive/2006/04/07/570801.aspx

Another example is AJAX. The AJAX approach allows for using the dead time the users spend digesting data to retrieve more data - however, you still need to consider latency. Let's say you are working on a new shiny AJAX front-end--everything looks just fine in your testing environment. It also shines in your staging environment passing the load tests with flying colors. The application can still fail miserably on the production environment if you fail to test for latency problems-retrieving data in the background is good but if you can't do that fast enough the application would still stagger and will be unrespon sive.… (You can read more on AJAX and latency here.) [RichUI] You can (and should) use tools like Shunra Virtual Enterprise, Opnet Modeler and many others to simulate network conditions and

understand system behavior thus avoiding failure in the production system.

Bandwidth is infinite The next Distributed Computing Fallacy is "Bandwidth Is Infinite." This fallacy, in my opinion, is not as strong as the others. If there is one thing that is constantly getting better in relation to networks it is bandwidth. However, there are two forces at work to keep this assumption a fallacy. One is that while the bandwidth grows, so does the amount of information we try to squeeze through it. VoIP, videos, and IPTV are some of the newer applications that take up bandwidth. Downloads, richer UIs, and reliance on verbose formats (XML) are also at work-especially if you are using T1 or lower lines. However, even when you think that this 10Gbit Ethernet would be more than enough, you may be hit with more than 3 Terabytes of new data per day (numbers from an actual project). The other force at work to lower bandwidth is packet loss (along with frame size). This quote which underscores this point very well: "In the local area network or campus environment, rtt and packet loss are both usually small enough that factors other than the above equation set your performance limit (e.g. raw available link bandwidths, packet forwarding speeds, host CPU limitations, etc.). In the WAN however, rtt and packet loss are often rather large and something that the end systems can not control. Thus their only hope for improved performance in the wide area is to use larger packet sizes. Let's take an example: New York to Los Angeles. Round Trip Time (rtt) is about 40 msec, and let's say packet loss is 0.1% (0.001). With an MTU of 1500 bytes (MSS of 1460), TCP throughput will have an upper bound of about 6.5 Mbps! And no, that is not a window size limitation, but rather one based on TCP's ability to detect and recover from congestion (loss). With 9000 byte frames, TCP throughput could reach about 40 Mbps. Or let's look at that example in terms of packet loss rates. Same round trip time, but let's say we want to achieve a throughput of 500 Mbps (half a "gigabit"). To do that with 9000 byte frames, we would need a packet loss rate of no more than 1x10^-5. With 1500 byte frames, the required packet loss rate is down to

2.8x10^-7! While the jumbo frame is only 6 times larger, it allows us the same throughput in the face of 36 times more packet loss." [WareOnEarth] Acknowledging the bandwidth is not infinite has a balancing effect on the implications of the the "Latency Is Zero" fallacy; that is, if acting on the realization the latency is not zero we modeled few large messages. Bandwidth limitations direct us to strive to limit the size of the information we send over the wire. The main implication then is to consider that in the production environment of our application there may be bandwidth problems which are beyond our control. And we should bear in mind how much data is expected to travel over the wise. The recommendation I made in my previous post--to try to simulate the production environment--holds true here as well.

The Network is Secure Peter Deutsch introduced the Distributed Computing Fallacies back in 1991. You'd think that in the 15 years since then that "the Network is secure" would no longer be a fallacy. Unfortunately, that's not the case--and not because the network is now secure. No one would be naive enough to assume it is. Nevertheless, a few days ago I began writing a report about a middleware product some vendor tried to inflict on us that has no regard whatsoever to security! Well that is just anecdotal evidence, however. Statistics published at Aladdin.com [Aladdin] shows that: "For 52% of the networks the perimeter is the only defense According to Preventsys and Qualys, 52% of chief information security officers acknowledged having a "Moat & Castle" approach to their overall network security . They admitted that once the perimeter security is penetrated, their networks are at risk. Yet, 48% consider themselves to be "proactive" when it comes to network security and feel that they have a good grasp on their enterprise's security posture. 24% felt their security was akin to Fort Knox (it would take a small army to get through), while 10% compared their network security to Swiss cheese (security holes inside and out). The remaining 14% of respondents described their current network security as being

locked down on the inside, but not yet completely secured to the outside. Preventsys and Qualys also found that 46% of security officers spend more than a third of their day, and in some cases as much as 7 hours, analyzing reports generated from their various security point solutions. " In case you just landed from another planet the network is far from being secured. Here are few statistics to illustrate that: Through the continual 24x7 monitoring of hundreds of Fortune 1000 companies, RipTech has discovered several extremely relevant trends in information security. Among them:

1. General Internet attack trends are showing a 64% annual rate of growth 2. The average company experienced 32 attacks per week over the past 6 months 3. Attacks during weekdays increased in the past 6 months" [RipTech]. When I tried to find some updated incident statistics, I came up with the following [CERT]:

Note: Given the widespread use of automated attack tools, attacks against Internet-connected systems have become so commonplace that counts of the number of incidents reported provide little information with regard to assessing the scope and impact of attacks. Therefore, as of 2004, we will no longer publish the number of incidents reported. Instead, we will be working with others in the community to develop and report on more meaningful metrics" (the number of incidents for 2003 was 137539 incidents...) Lastly Aladdin claims that the costs of Malware for 2004 (Viruses, Worms, Trojans etc.) are estimated between $169 billion and $204 billion. [Aladdin] The implications of network (in) security are obvious--you need to build security into your solutions from Day 1. I mentioned in a previous blog post that security is a system quality attribute that needs to be taken into consideration starting from the architectural level. There are dozens of books that talk about security and I cannot begin to delve into all the details in a short blog post. In essence you need to perform threat modeling to evaluate the security risks. Then following further analyses decide which risk are

should be mitigated by what measures (a tradeoff between costs, risks and their probability). Security is usually a multi-layered solution that is handled on the network, infrastructure, and application levels. As an architect you might not be a security expert--but you still need to be aware that security is needed and the implications it may have (for instance, you might not be able to use multicast, user accounts with limited privileges might not be able to access some networked resource etc.)

Topology doesn’tchange The fifth Distributed Computing Fallacy is about network topology. "Topology doesn't change." Th at's rig h t, it d oesn ’t--as long as it stays in the test lab. When you deploy an application in the wild (that is, to an organization), the network topology is usually out of your control. The operations team (IT) is likely to add and remove servers every once in a while and/or make other changes to the network ("this is the new Active Directory we will use for SSO ; we're replacing RIP with OSPF and this application's servers are moving into area 51" and so on). Lastly there are server and network faults which can cause routing changes. When you're talking about clients, the situation is even worse. There are laptops coming and going, wireless ad-hoc networks , new mobile devices. In short, topology is changing constantly. What does this mean for the applications we write? Simple. Try not to depend on specific endpoints or routes, if you can't be prepared to renegotiate endpoints. Another implication is that you would want to either provide location transparency (e.g. using an ESB, multicast) or provide discovery services (e.g. a Active Directory/JNDI/LDAP). Another strategy is to abstract the physical structure of the network. The most obvious example for this is DNS names instead of IP addresses. Recently I moved my (other) blog from one hosting service to another. The transfer went without a hitch as I had both sites up an running. Then when the DNS routing tables were updated (it takes a day or two to the change to ripple) readers just came to the new site without knowing the routing (topology) changed under their feet. An interesting example is moving from WS-Routing to WS-Addressing. In WS-Routing a message can describes it own routing path--this assumes that a message can know the path it needs to travel in advance. The topology doesn't change (this also causes a security

vulnerability--but that's another story) where the newer WSAddressing relies on "Next Hop" routing (the way TCP/IP works) which is more robust. Another example is routing in SQL Server Service Broker. The problematic part is that the routes need to be set inside service broker. This is problematic since IT now has to remember to go into Service Broker and update routing tables when topology changes. However, to mitigate this problem the routing relies on next-hop semantics and it allows for specifying the address by DNS name.

There is one administrator The sixth Distributed Computing Fallacy is "There is one administrator". You may be able to get away with this fallacy if you install your software on small, isolated LANs (for instance, a single person IT "group" with no WAN/Internet). However, for most enterprise systems the reality is much different. The IT group usually has different administrators, assigned according to expertise--databases, web servers, networks, Linux, Windows, Main Frame and the like. This is the easy situation. The problem is occurs when your company collaborates with external entities (for example, connecting with a business partner), or if your application is deployed for Internet consumption and hosted by some hosting service and the application consumes external services (think Mashups). In these situations, the other administrators are not even under your control and they may have their own agendas/rules. At this point you may say "Okay, there is more than one administrator. But why should I care?" Well, as long as everything works, maybe you don't care. You do care, however, when things go astray and there is a need to pinpoint a problem (and solve it). For example, I recently had a problem with an ASP.NET application that required full trust on a hosting service that only allowed medium trust-the application had to be reworked (since changing host service was not an option) in order to work. Furthermore, you need to understand that the administrators will most likely not be part of your development team so we need provide them with tools to diagnose and find problems. This is essential when the application involves more than one company ("Is it their problem or our's?"). A proactive approach is to also include tools for monitoring on-going operations as well; for instance, to allow administrators identify problems when they are small--before they become a system failure.

Another reason to think about multiple administrators is upgrades. How are you going to handle them? How are you going to make sure that the different parts of our application (distributed, remember?) are synchronized and can actually work together; for example, does the current DB schema match the current O/R mapping and object model? Again this problem aggravates when third parties are involved. Can your partner continue to interop with our system when we made changes to the public contract (in an SOA) so, for example, you need to think about backward compatibility (or maybe even forward compatibility) when designing interoperability contracts. To sum up, when there is more than one administrator (unless we are talking about a simple system and even that can evolve later if it is successful), you need to remember that administrators can constrain your options (administrators that sets disk quotas, limited privileges, limited ports and protocols and so on), and that you need to help them manage your applications.

Transport cost is zero On to Distributed Computing Fallacy number 7--"Transport cost is zero". There are a couple of ways you can interpret this statement, both of which are false assumptions. One way is that going from the application level to the transport level is free. This is a fallacy since we have to do marshaling (serialize information into bits) to get data unto the wire, which takes both computer resources and adds to the latency. Interpreting the statement this way emphasizes the "Latency is Zero" fallacy by reminding us that there are additional costs (both in time and resources). The second way to interpret the statement is that the costs (as in cash money) for setting and running the network are free. This is also far from being true. There are costs--costs for buying the routers, costs for securing the network, costs for leasing the bandwidth for internet connections, and costs for operating and maintaining the network running. Someone, somewhere will have to pick the tab and pay these costs. Imagine you have successfully built Dilbert's Google-killer search engine [Adams] (maybe using latest Web 2.0 bells-and-whistles on the UI) but you will fail if you neglect to take into account the costs that are needed to keep your service up, running, and responsive (E3 Lines, datacenters with switches, SANs etc.). The takeaway is that even in situations you think the other fallacies are not relevant to your situation because you rely on existing solutions ("yeah, we'll just

deploy Cisco's HSRP protocol and get rid of the network reliability problem") you may still be bounded by the costs of the solution and you'd need to solve your problems using more cost-effective solutions.

The network is homogeneous. The eighth and final Distributed Computing fallacy is "The network is homogeneous." While the first seven fallacies were coined by Peter Deutsch, I read [JDJ2004] that the eighth fallacy was added by James Gosling six years later (in 1997). Most architects today are not naïve enough to assume this fallacy. Any network, except maybe the very trivial ones, are not homogeneous. Heck, even my home network has a Linux based HTPC, a couple of Windows based PCs, a (small) NAS, and a WindowMobile 2005 device-all connected by a wireless network. What's true on a home network is almost a certainty in enterprise networks. I believe that a homogeneous network today is the exception, not the rule. Even if you managed to maintain your internal network homogeneous, you will hit this problem when you would try to cooperate with a partner or a supplier. Assuming this fallacy should not cause too much trouble at the lower network level as IP is pretty much ubiquitous (e.g. even a specialized bus like Infiniband has an IP-Over-IB implementation, although it may result in suboptimal use of the non-native IP resources. It is worthwhile to pay attention to the fact the network is not homogeneous at the application level. The implication of this is that you have to assume interoperability will be needed sooner or later and be ready to support it from day one (or at least design where you'd add it later). Do not rely on proprietary protocols--it would be harder to integrate them later. Do use standard technologies that are widely accepted; the most notable examples being XML or Web Services. By the way, much of the popularity of XML and Web Services can be attributed to the fact that both these technologies help alleviate the affects of the heterogeneity of the enterprise environment. To sum up, most architects/designers today are aware of this fallacy, which is why interoperable technologies are popular. Still it is something you need to keep in mind especially if you are in a situation that mandates use of proprietary protocols or transports.

Summary With almost 15 years since the fallacies were drafted and more than 40 years since we started building distributed systems – the characteristics and underlying problems of distributed systems remain pretty much the same. What is more alarming is that architects, designers and developers are still tempted to wave some of these problems off thinking technology solves everything. Remember that (successful) applications evolve and grow so even if things look Ok for a while if you don't pay attention to the issues covered by the fallacies they will rear their ugly head and bite you. I hope that reading this paper both helped explain what the fallacies mean as well as provide some guidance on what to do to avoid their implications.

References [Britton2001] IT Architecture & Middleware, C. Britton, AddisonWesley 2001, ISBN 0-201-70907-4 [JDJ2004]. http://java.sys-con.com/read/38665.htm [Gosling] http://blogs.sun.com/roller/page/jag [Ingo] http://blogs.thinktecture.com/ingo/archive/2005/11/08/LatencyVsBan dwidth.aspx [RichUI] http://richui.blogspot.com/2005/09/ajax-latency-problemsmyth-or-reality.html [WareOnEarth] http://sd.wareonearth.com/~phil/jumbo.html [Aladdin] http://www.esafe.com/home/csrt/statistics/statistics_2005.asp [RipTech] http://www.riptech.com/ [CERT] http://www.cert.org/stats/#incidents\ [Adams] http://www.dilbert.com/comics/dilbert/archive/dilbert20060516.html

Introduction to Distributed System Design Table of Contents Audience and Pre-Requisites The Basics So How Is It Done? Remote Procedure Calls Distributed Design Principles Exercises References

Audience and Pre-Requisites This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.

The Basics What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system: A program is the code you write. A process is what you get when you run it. A message is used to communicate between processes. A packet is a fragment of a message that might travel on a wire. A protocol is a formal description of message formats and the rules that two processes must follow in order to exchange those messages. A network is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links. A component can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc. A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.

Why build a distributed system? There are lots of advantages including the ability to connect remote users with remote resources in an open and scalable way. When we say open, we mean each component is continually open to interaction with other components. When we say scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities. Thus, a distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components. To be truly reliable, a distributed system must have the following characteristics: • • • •



• •

Fault-Tolerant: It can recover from component failures without performing incorrect actions. Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed. Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired. Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system. Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect. Predictable Performance: The ability to provide desired responsiveness in a timely manner. Secure: The system authenticates access to data and services [1]

These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system must be able to continue operating correctly even when components fail. This issue is discussed in the following excerpt of an interview with Ken Arnold. Ken is a research scientist at Sun and is one of the original architects of Jini, and was a member of the architectural team that designed CORBA.

Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in 10 13, how often would it happen?" Common sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time."

When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern. What does designing for failure mean? One classic problem is partial failure. If I send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got to you, and then the network broke, and I just didn't get the response. The other is the message never got to you because the network broke before it arrived. So if I never receive a response, how do I know which of those two results happened? I cannot determine that without eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not a network failure but you died. How does this change how I design things? For one thing, it puts a multiplier on the value of simplicity. The more things I can do with you, the more things I have to think about recovering from. [2]

Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software. Hardware failures were a dominant concern until the late 80's, but since then internal hardware reliability has improved enormously. Decreased heat production and power consumption of smaller circuits, reduction of off-chip connections and wiring, and high-quality manufacturing techniques have all played a positive role in improving hardware reliability. Today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures. Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5]. •



Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a releasemode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle). Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6]

Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.

Let's get a little more specific about the types of failures that can occur in a distributed system: •

• •

• •





Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure. Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop. Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded. Network failures: A network link breaks. Network partition failure: A network fragments into two or more disjoint subnetworks within which messages can be sent, but between which messages are lost. This can occur due to a network failure. Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc. Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1]

Our goal is to design a distributed system with the characteristics listed above (faulttolerant, highly available, recoverable, etc.), which means we must design for failure. To design for failure, we must be careful to not make any assumptions about the reliability of the components of a system. Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies". 1. 2. 3. 4. 5. 6. 7. 8.

The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.

[3]

Latency: the time between initiating a request for data and the beginning of the actual data transfer. Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry. Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed. Homogeneous network: A network running a single network protocol.

So How Is It Done?

Building a reliable system that runs over an unreliable communications network seems like an impossible goal. We are forced to deal with uncertainty. A process knows its own state, and it knows what state other processes were in recently. But the processes have no way of knowing each other's current state. They lack the equivalent of shared memory. They also lack accurate ways to detect failure, or to distinguish a local software/hardware failure from a communication failure. Distributed systems design is obviously a challenging endeavor. How do we do it when we are not allowed to assume anything, and there are so many complexities? We start by limiting the scope. We will focus on a particular type of distributed systems design, one that uses a client-server model with mostly standard protocols. It turns out that these standard protocols provide considerable help with the low-level details of reliable network communications, which makes our job easier. Let's start by reviewing clientserver technology and the protocols. In client-server applications, the server provides some service, such as processing database queries or sending out current stock prices. The client uses the service provided by the server, either displaying database query results to the user or making stock purchase recommendations to an investor. The communication that occurs between the client and the server must be reliable. That is, no data can be dropped and it must arrive on the client side in the same order in which the server sent it. There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service. In distributed systems, there can be many servers of a particular type, e.g., multiple file servers or multiple network name servers. The term service is used to denote a set of servers of a particular type. We say that a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service. There are many binding policies that define how a particular server is chosen. For example, the policy could be based on locality (a Unix NIS client starts by looking first for a server on its own machine); or it could be based on load balance (a CICS client is bound in such a way that uniform responsiveness for all clients is attempted).

A distributed service may employ data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed. Caching is a related concept and very common in distributed systems. We say a process has cached data if it maintains a copy of the data locally, for quick access if it is needed again. A cache hit is when a request is satisfied from cached data, rather than from the primary service. For example, browsers use document caching to speed up access to frequently used documents. Caching is similar to replication, but cached data can become stale. Thus, there may need to be a policy for validating a cached data item before using it. If a cache is actively refreshed by the primary service, caching is identical to replication. [1] As mentioned earlier, the communication between client and server needs to be reliable. You have probably heard of TCP/IP before. The Internet Protocol (IP) suite is the set of communication protocols that allow for communication on the Internet and most commercial networks. The Transmission Control Protocol (TCP) is one of the core protocols of this suite. Using TCP, clients and servers can create connections to one another, over which they can exchange data in packets. The protocol guarantees reliable and in-order delivery of data from sender to receiver. The IP suite can be viewed as a set of layers, each layer having the property that it only uses the functions of the layer below, and only exports functionality to the layer above. A system that implements protocol behavior consisting of layers is known as a protocol stack. Protocol stacks can be implemented either in hardware or software, or a mixture of both. Typically, only the lower layers are implemented in hardware, with the higher layers being implemented in software.

Resource : The history of TCP/IP mirrors the evolution of the Internet. Here is a brief overview of this history.

There are four layers in the IP suite: 1. Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet. 2. Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender

and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.

UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything. Although TCP is more reliable than UDP, the protocol is still at risk of failing in many ways. TCP uses acknowledgements and retransmission to detect and repair loss. But it cannot overcome longer communication outages that disconnect the sender and receiver for long enough to defeat the retransmission strategy. The normal maximum disconnection time is between 30 and 90 seconds. TCP could signal a failure and give up when both end-points are fine. This is just one example of how TCP can fail, even though it does provide some mitigating strategies.

3. Network Layer : As originally defined, the Network layer solves the problem of getting packets across a single network. With the advent of the concept of internetworking, additional functionality was added to this layer, namely getting data from a source network to a destination network. This generally involves routing the packet across a network of networks, e.g. the Internet. IP performs the basic task of getting packets of data from source to destination. 4. Link Layer : The link layer deals with the physical transmission of data, and usually involves placing frame headers and trailers on packets for travelling over the physical network and dealing with physical components along the way.

Resource : For more information on the IP Suite, refer to the Wikipedia article.

Remote Procedure Calls Many distributed systems were built using TCP/IP as the foundation for the communication between components. Over time, an efficient method for clients to interact with servers evolved called RPC, which means remote procedure call. It is a powerful technique based on extending the notion of local procedure calling, so that the called procedure may not exist in the same address space as the calling procedure. The two processes may be on the same system, or they may be on different systems with a network connecting them. An RPC is similar to a function call. Like a function call, when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned. In the illustration below, the client makes a procedure call that sends a request to the server. The client process waits until either a reply is received, or it times out. When the request arrives at the server, it calls a dispatch routine that performs the requested service, and sends the reply to the client. After the RPC call is completed, the client process continues.

Threads are common in RPC-based distributed systems. Each incoming request to a server typically spawns a new thread. A thread in the client typically issues an RPC and then blocks (waits). When the reply is received, the client thread resumes execution. A programmer writing RPC-based code does three things: 1. Specifies the protocol for client-server communication 2. Develops the client program 3. Develops the server program The communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked. The client and server programs must communicate via the procedures and data types specified in the protocol. The server side registers the procedures that may be called by the client and receives and returns data required for processing. The client side calls the remote procedure, passes any required data and receives the returned data. Thus, an RPC application uses classes generated by the stub generator to execute an RPC and wait for it to finish. The programmer needs to supply classes on the server side that provide the logic for handling an RPC request. RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer.

Some RPC applications view these types of errors as unrecoverable. Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server. A challenging error-handling case occurs when a client needs to know the outcome of a request in order to take the next step, after failure of a server. This can sometimes result in incorrect actions and results. For example, suppose a client process requests a ticket-selling server to check for a seat in the orchestra section of Carnegie Hall. If it's available, the server records the request and the sale. But the request fails by timing out. Was the seat available and the sale recorded? Even if there is a backup server to which the request can be re-issued, there is a risk that the client will be sold two tickets, which is an expensive mistake in Carnegie Hall [1]. Here are some common error conditions that need to be handled: •





Network data loss resulting in retransmit: Often, a system tries to achieve 'at most once' transmission tries. In the worst case, if duplicate transmissions occur, we try to minimize any damage done by the data being received multiple time. Server process crashes during RPC operation: If a server process crashes before it completes its task, the system usually recovers correctly because the client will initiate a retry request once the server has recovered. If the server crashes completing the task but before the RPC reply is sent, duplicate requests sometimes result due to client retries. Client process crashes before receiving response: Client is restarted. Server discards response data.

Some Distributed Design Principles Given what we have covered so far, we can define some fundamental design principles which every distributed system designer and software engineer should know. Some of these may seem obvious, but it will be helpful as we proceed to have a good starting list. •

• •

As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state. A classic error scenario is for a process to send data to a process running on a second machine. The process on the first machine receives some data back and processes it, and then sends the results back to the second machine assuming it is ready to receive. Any number of things could have failed in the interim and the sending process must anticipate these possible failures. Explicitly define failure scenarios and identify how likely each one might occur. Make sure your code is thoroughly covered for the most likely ones. Both clients and servers must be able to deal with unresponsive senders/receivers.

• •





Think carefully about how much data you send over the network. Minimize traffic as much as possible. Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise. Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. If you must be sure, do checksums or validity checks on data to verify that the data has not changed. Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it. If a process stores information that can't be reconstructed, then problems arise. One possible question is, "Are you now a single point of failure?" I have to talk to you now - I can't talk to anyone else. So what happens if you go down? To deal with this issue, you could be replicated. Replication strategies are also useful in mitigating the risks of maintaining state. But there are challenges here too: What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived at the other? What happens if the network gets partitioned and the replicants can't talk to each other? Can anybody proceed? There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.



• •

Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more. Talk to your colleagues about these alternatives and your results, and decide on the best solution. Acks are expensive and tend to be avoided in distributed systems wherever possible. Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.

Exercises 1. Have you ever encountered a Heisenbug? How did you isolate and fix it?

2. For the different failure types listed above, consider what makes each one difficult for a programmer trying to guard against it. What kinds of processing can be added to a program to deal with these failures? 3. Explain why each of the 8 fallacies is actually a fallacy. 4. Contrast TCP and UDP. Under what circumstances would you choose one over the other? 5. What's the difference between caching and data replication? 6. What are stubs in an RPC implementation? 7. What are some of the error conditions we need to guard against in a distributed environment that we do not need to worry about in a local programming environment? 8. Why are pointers (references) not usually passed as parameters to a Remote Procedure Call? 9. Here is an interesting problem called partial connectivity that can occur in a distributed environment. Let's say A and B are systems that need to talk to each other. C is a master that also talks to A and B individually. The communications between A and B fail. C can tell that A and B are both healthy. C tells A to send something to B and waits for this to occur. C has no way of knowing that A cannot talk to B, and thus waits and waits and waits. What diagnostics can you add in your code to deal with this situation? 10. What is the leader-election algorithm? How can it be used in a distributed system? 11. This is the Byzantine Generals problem: Two generals are on hills either side of a valley. They each have an army of 1000 soldiers. In the woods in the valley is an enemy army of 1500 men. If each general attacks alone, his army will lose. If they attack together, they will win. They wish to send messengers through the valley to coordinate when to attack. However, the messengers may get lost or caught in the woods (or brainwashed into delivering different messages). How can they devise a scheme by which they either attack with high probability, or not at all?

References [1] Birman, Kenneth. Reliable Distributed Systems: Technologies, Web Services and Applications. New York: Springer-Verlag, 2005. [2] Interview with Ken Arnold [3] The Eight Fallacies [4] Wikipedia article on IP Suite [5] Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 1993. [6] Bohrbugs and Heisenbugs

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Logical Time: A Way to Capture Causality in Distributed Systems M. Raynal, M. Singhal

N˚ 2472 Mars 1995

PROGRAMME 1

ISSN 0249-6399

apport de recherche

Logical Time: A Way to Capture Causality in Distributed Systems 

M. Raynal , M. Singhal



Programme 1 | Architectures paralleles, bases de donnees, reseaux et systemes distribues Projet Adp Rapport de recherche n2472 | Mars 1995 | 22 pages

Abstract: The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems. Usually causality is tracked using physical time, but in distributed systems setting, there is no built-in physical time and it is only possible to realize an approximation of it. As asynchronous distributed computations make progress in spurts, it turns out that the logical time, which advances in jumps, is sucient to capture the fundamental monotonicity property associated with causality in distributed systems. This paper reviews three ways to de ne logical time (e.g., scalar time, vector time, and matrix time) that have been proposed to capture causality between events of a distributed computation. Key-words: Distributed systems, causality, logical time, happens before, scalar time, vector time, matrix time. (Resume : tsvp)  

IRISA, Campus de Beaulieu, 35042 Rennes-Cedex, [email protected]. Dept. of Computer, Information Science, Columbus, OH 43210, [email protected].

Unite´ de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu, 35042 RENNES Cedex (France) Te´le´phone : (33) 99 84 71 00 – Te´le´copie : (33) 99 84 71 71

Le temps logique en reparti ou comment capturer la causalite

Resume : Ce rapport examine di erents mecanismes d'horlogerie logique qui ont ete proposes pour capturer la relation de causalite entre evenements d'un systeme reparti. Mots-cle : causalite, precedence, syteme reparti, temps logique, temps lineaire, temps vectoriel, temps matriciel.

Causality in Distributed Systems

3

1 Introduction A distributed computation consists of a set of processes that cooperate and compete to achieve a common goal. These processes do not share a common global memory and communicate solely by passing messages over a communication network. The communication delay is nite but unpredictable. The actions of a process are modeled as three types of events, namely, internal events, message send events, and message receive events. An internal event only a ects the process at which it occurs, and the events at a process are linearly ordered by their order of occurrence. Moreover, send and receive events signify the ow of information between processes and establish causal dependency from the sender process to the receiver process. It follows that the execution of a distributed application results in a set of distributed events produced by the processes. The causal precedence relation induces a partial order on the events of a distributed computation. Causality (or the causal precedence relation) among events in a distributed system is a powerful concept in reasoning, analyzing, and drawing inferences about a computation. The knowledge of the causal precedence relation among processes helps solve a variety of problems in distributed systems. Among them we nd:  Distributed algorithms design: The knowledge of the causal precedence relation among events helps ensure liveness and fairness in mutual exclusion algorithms, helps maintain consistency in replicated databases, and helps design correct deadlock detection algorithms to avoid phantom and undetected deadlocks.  Tracking of dependent events: In distributed debugging, the knowledge of the causal dependency among events helps construct a consistent state for resuming reexecution; in failure recovery, it helps build a checkpoint; in replicated databases, it aids in the detection of le inconsistencies in case of a network partitioning.  Knowledge about the progress: The knowledge of the causal dependency among events helps a process measure the progress of other processes in the distributed computation. This is useful in discarding obsolete information, garbage collection, and termination detection.  Concurrency measure: The knowledge of how many events are causally dependent is useful in measuring the amount of concurrency in a computation. All events that are not causally related can be executed concurrently. Thus, an analysis of the causality in a computation gives an idea of the concurrency in the program.

RR n2472

4

M. Raynal and M. Singhal

The concept of causality is widely used by human beings, often unconsciously, in planning, scheduling, and execution of a chore or an enterprise, in determining infeasibility of a plan or the innocence of an accused. In day-today life, the global time to deduce causality relation is obtained from loosely synchronized clocks (i.e., wrist watches, wall clocks). However, in distributed computing systems, the rate of occurrence of events is several magnitudes higher and the event execution time is several magnitudes smaller; consequently, if the physical clocks are not precisely synchronized, the causality relation between events may not be accurately captured. However, in a distributed computation, progress is made in spurts and the interaction between processes occurs in spurts; consequently, it turns out that in a distributed computation, the causality relation between events produced by a program execution, and its fundamental monotonicity property, can be accurately captured by logical clocks. In a system of logical clocks, every process has a logical clock that is advanced using a set of rules. Every event is assigned a timestamp and the causality relation between events can be generally inferred from their timestamps. The timestamps assigned to events obey the fundamental monotonicity property; that is, if an event a causally a ects an event b, then the timestamp of a is smaller than the timestamp of b. This paper rst presents a general framework of a system of logical clocks in distributed systems and then discusses three ways to implement logical time in a distributed system. In the rst method, the Lamport's scalar clocks, the time is represented by non-negative integers; in the second method, the time is represented by a vector of non-negative integers; in the third method, the time is represented as a matrix of non-negative integers. The rest of the paper is organized as follows: The next section presents a model of the execution of a distributed computation. Section 3 presents a general framework of logical clocks as a way to capture causality in a distributed computation. Sections 4 through 6 discuss three popular systems of logical clocks, namely, scalar, vector, and matrix clocks. Section 7 discusses ecient implementations of the systems of logical clocks. Finally Section 8 concludes the paper.

2 A Model of Distributed Executions 2.1 General Context

A distributed program is composed of a set of n asynchronous processes p1 , p2 , ..., pi, ..., pn that communicate by message passing over a communication network. The

INRIA

Causality in Distributed Systems

5

processes do not share a global memory and communicate solely by passing messages. The communication delay is nite and unpredictable. Also, these processes do not share a global clock that is instantaneously accessible to these processes. Process execution and a message transfer are asynchronous { a process may execute an event spontaneously and a process sending a message does not wait for the delivery of the message to be complete.

2.2 Distributed Executions

The execution of process pi produces a sequence of events e0i , e1i , ..., exi , exi +1 , ... and is denoted by Hi where Hi = (hi, !i ) hi is the set of events produced by pi and binary relation !i de nes a total order on these events. Relation !i expresses causal dependencies among the events of pi . A relation !msg is de ned as follows. For every message m that is exchanged between two processes, we have send(m) !msg receive(m). Relation !msg de nes causal dependencies between the pairs of corresponding send and receive events. A distributed execution of a set of processes is a partial order H=(H , !), where H =[ihi and !=([i !i [ !msg )+ . Relation ! expresses causal dependencies among the events in the distributed execution of a set of processes. If e1 ! e2 , then event e2 is directly or transitively dependent on event e1 . If e1 6! e2 and e2 6! e1, then events e1 and e2 are said to be concurrent and are denoted as e1 k e2 . Clearly, for any two events e1 and e2 in a distributed execution, e1 ! e2 or e2 ! e1 , or e1 k e2 . Figure 1 shows the time diagram of a distributed execution involving three processes. A horizontal line represents the progress of the process; a dot indicates an event; a slant arrow indicates a message transfer. In this execution, a ! b, b ! d, and b k c.

2.3 Distributed Executions at an Observation Level

Generally, at a level or for an application, only few events are relevant. For example, in a checkpointing protocol, only local checkpoint events are relevant. Let R denote

RR n2472

6

M. Raynal and M. Singhal

a

p

1

c p

d

2

b

p

3

Figure 1: The time diagram of a distributed execution. the set of relevant events. Let !R be the restriction of ! to the events in R; that is, 8 e1, e2 2 R: e1 !R e2 () e1 ! e2. An observation level de nes a projection of the events in the distributed computation. The distributed computation de ned by the observation level R is denoted as, R=(R, !R). For example, if in Figure 1, only events a, b, c, and d are relevant to an observation level (i.e., R=fa, b, c, dg), then relation !R is de ned as follows: !R = f(a; b), (a; c), (a; d), (b; d), (c; d)g.

3 Logical Clocks: A Mechanism to Capture Causality

3.1 De nition

A system of logical clocks consists of a time domain T and a logical clock C . Elements of T form a partially ordered set over a relation 0) (each time R1 is executed d can have a di erent value).  R2: Each message piggybacks the clock value of its sender at sending time. When a process pi receives a message with timestamp Cmsg , it executes the following actions: { Ci := max(Ci; Cmsg) { Execute R1. { Deliver the message. 1.

Ci := Ci + d

Figure 2 shows evolution of scalar time with d=1 for the computation in Figure

4.2 Basic Property

Clearly, scalar clocks satisfy the monotonicity and hence the consistency property. In addition, scalar clocks can be used to totally order the events in a distributed system as follows [7]: The timestamp of an event is denoted by a tuple (t, i) where t is its time of occurrence and i is the identity of the process where it occurred. The total order relation  on two events x and y with timestamps (h,i) and (k,j), respectively, is de ned as follows: x  y () (h < k or (h = k and i < j )) The total order is consistent with the causality relation "!". The total order is generally used to ensure liveness properties in distributed algorithms (requests are timestamped and served according to the total order on these timestamps) [7].

INRIA

9

Causality in Distributed Systems

1

p

2

3

8

9

1

2 1

p

2

3 1

p

5

4

3

4

7

9 11 10

b 5 6

7

Figure 2: Evolution of scalar time. If the increment value d is always 1, the scalar time has the following interesting property: if event e has a timestamp h, then h-1 represents the minimum logical duration, counted in units of events, required before producing the event e [2]; we call it the height of the event e, in short height(e). In other words, h-1 events have been produced sequentially before the event e regardless of the processes that produced these events. For example, in Figure 2, ve events precede event b on the longest causal path ending at b. However, system of scalar clocks is not strongly consistent; that is, for two events e1 and e2 , C(e1) < C(e2) 6=) e1 ! e2 . Reason for this is that the logical local clock and logical global clock of a process are squashed into one, resulting in the loss of information.

5 Vector Time 5.1 De nition

The systems of vector clocks was developed independently by Fidge [2,3], Mattern [8], and Schmuck [10]. (A brief historical perspective of vector clocks is given in the Sidebar 2.) In the system of vector clocks, the time domain is represented by a set of n-dimensional non-negative integer vectors. Each process pi maintains a vector vti [1::n] where vti [i] is the local logical clock of pi and describes the logical time progress at process pi . vti [j ] represents process pi 's latest knowledge of process pj local time. If vti [j ]=x, then process pi knows that local time at process pj has

RR n2472

10

M. Raynal and M. Singhal

progressed till x. The entire vector vti constitutes pi 's view of the logical global time and is used to timestamp events. Process pi uses the following two rules R1 and R2 to update its clock:  R1: Before executing an event, it updates its local logical time as follows:

vti[i] := vti [i] + d

(d > 0)

 R2: Each message m is piggybacked with the vector clock vt of the sender process at sending time. On the receipt of such a message (m,vt), process pi executex the following sequence of actions: { Update its logical global time as follows:

1  k  n : vti [k] := max(vti [k]; vt[k])

{ Execute R1. { Deliver the message m. The timestamp associated with an event is the value of the vector clock of its process when the event is executed. Figure 3 shows an example of vector clocks progress with the increment value d=1.

5.2 Basic Property Isomorphism

The following three relations are de ned to compare two vector timestamps, vh and vk:

vh  vk () 8 x : vh[x]  vk[x] vh < vk () vh  vk and 9 x : vh[x] < vk[x] vh k vk () not (vh < vk) and not (vk < vh) Recall that relation \!" induces a partial order on the set of events that are pro-

duced by a distributed execution. If events in a distributed system are timestamped using a system of vector clocks, we have the following property. If two events x and y have timestamps vh and vk, respectively, then:

x ! y () vh < vk x k y () vh k vk: INRIA

11

Causality in Distributed Systems

1 0 0

p

1

p

2 0 0

0 1 0

2

p

3

2 4 0

2 3 0

2 0 0 0 0 1

4 3 4

3 0 0

2 2 0

2 3 0

2 3 3 2 3 2

2 3 4

5 3 4 5 6 4

5 3 4 5 5 4

2 3 4

Figure 3: Evolution of vector time. Thus, there is an isomorphism between the set of partially ordered events produced by a distributed computation and their timestamps. This is a very powerful, useful, and interesting property of vector clocks. If processes of occurrence of events are known, the test to compare two timestamps can be simpli ed as follows: If events x and y respectively occurred at processes pi and pj and are assigned timestamps (vh,i) and (vk,j), respectively, then

x ! y () vh[i] < vk[i] x k y () vh[i] > vk[i] and vh[j ] < vk[j ]

Strong Consistency

The system of vector clocks is strongly consistent; thus, by examining the vector timestamp of two events, we can determine if the events are causally related. However, the dimension of vector clocks cannot be less than n for this property [1].

Event Counting

If d is always 1 in the rule R1, then the ith component of vector clock at process pi, vti [i], denotes the number of events that have occurred at pi until that instant. So, if an event e has timestamp vh, vh[j ] denotes the number of events executed by

RR n2472

12

M. Raynal and M. Singhal

P

process pj that causally precede e. Clearly, vh[j ] ? 1 represents the total number of events that causally precede e in the distributed computation.

Applications

Since vector time tracks causal dependencies exactly, it nds a wide variety of applications. For example, they are used in distributed debugging, in the implementations of causal ordering communication and causal distributed shared memory, in establishing global breakpoints, and in determining the consistency of checkpoints in optimistic recovery.

6 Matrix Time 6.1 De nition

In a system of matrix clocks, the time is represented by a set of n  n matrices of non-negative integers. A process pi maintains a matrix mti [1::n; 1::n] where  mti[i; i] denotes the local logical clock of pi and tracks the progress of the computation at process pi .  mti[k; l] represents the knowledge that process pi has about the knowledge of pk about the logical local clock of pl. The entire matrix mti denotes pi 's local view of the logical global time. Note that row mti [i; :] is nothing but the vector clock vti [:] and exhibits all properties of vector clocks. Process pi uses the following rules R1 and R2 to update its clock:  R1 : Before executing an event, it updates its local logical time as follows:

mti[i; i] := mti [i; i] + d

(d > 0)

 R2: Each message m is piggybacked with matrix time mt. When pi receives such a message (m,mt) from a process pj , pi executes the following sequence of actions: { Update its logical global time as follows: 1  k  n : mti [i; k] := max(mti [i; k]; mt[j; k]) 1  k; l  n : mti [k; l] := max(mti [k; l]; mt[k; l])

INRIA

Causality in Distributed Systems

13

{ Execute R1. { Deliver message m. A system of matrix clocks was rst informally proposed by Michael and Fischer in 1982 [4] and has been used by Wuu and Bernstein [12] and by Lynch and Sarin [9] to discard obsolete information in replicated databases.

6.2 Basic Property

Clearly vector mti [i; :] contains all the properties of vector clocks. In addition, matrix clocks have the following property: min (mti [k; l])  t ) process pi knows that every other process pk knows k that pl 's local time has progressed till t If this is true, it is clear that process pi knows that all other processes know that pl will never send information with a local time  t. In many applications, this implies that processes will no longer require from pl certain information and can use this fact to discard obsolete information. If d is always 1 in the rule R1, then mti [k; l], denotes the number of events occurred at pl and known by pk as far as pi 's knowledge is concerned.

7 Ecient Implementations If the number of processes in a distributed computation is large, then vector and matrix clocks will require piggybacking of huge information in messages for the purpose of disseminating time progress and updating clocks. In this section, we discuss ecient ways to maintain vector clocks; similar techniques can be used to eciently implement matrix clocks. It has been shown [1] that if vector clocks have to satisfy the strong consistency property, then in general vector timestamps must be at least of size n. Therefore, in general the size of vector timestamp is the number of processes involved in a distributed computation; however, several optimizations are possible and next, we discuss techniques to implement vector timestamps eciently.

7.1 Singhal-Kshemkalyani's Di erential Technique

Singhal-Kshemkalyani's technique [11] is based on the observation that between successive events at a process, only few entries of the vector clock at that process are

RR n2472

14

M. Raynal and M. Singhal

likely to change. This is more true when the number of processes is large because only few of them will interact frequently by passing messages. In Singhal-Kshemkalyani's di erential technique, when a process pi sends a message to a process pj , it piggybacks only those entries of its vector clock that di er since the last message send to pj . Therefore, this technique cuts down the communication bandwidth and bu er (to store messages) requirements. However, a process needs to maintain two additional vectors to store the information regarding the clock values when the last interaction was done with other processes. p

1

p

1 0 0 0

{(1,1)} 1 1 0 0

1 2 1 0

1 4 4 1

1 3 2 0

2

p3

0 0 1 0

{(3,1)}

0 0 0 1

0 0 2 0

{(3.2)}

0 0 3 1

0 0 4 1

{(3,4),(4,1)}

{(4,1)}

p 4

Figure 4: Vector clocks progress in Singhal-Kshemkalyani technique Figure 4 illustrates the Singhal-Kshemkalyani technique. If entries i1 , i2, ..., in1 of the vector clock at pi have changed (to v1 , v2, ..., vn1 , respectively) since the last message send to pj , then process pi piggybacks a compressed timestamp of the from f(i1, v1 ), (i2, v2 ), ..., (in1, vn1 ))g to the next message to pj . When pj receives this message, it updates its clock as follows: vti [k]:= max(vti [k], vk ) for k=1, 2, ...., n1. This technique can substantially reduce the cost of maintaining vector clocks in large systems if process interaction exhibits temporal or spatial localities. However, it requires that communication channels be FIFO.

INRIA

Causality in Distributed Systems

15

7.2 Fowler-Zwaenepoel's Direct-Dependency Technique

Fowler-Zwaenepoel's direct-dependency technique [5] does not maintain vector clocks on-the- y. Instead, in this technique a process only maintains information regarding direct dependencies on other processes. A vector time for an event, that represents transitive dependencies on other processes, is constructed o -line from a recursive search of the direct dependency information at processes. A process pi maintains a dependency vector Di that is initially Di[j ]=0 for j =1..n and is updated as follows:  Whenever an event occurs at pi: Di[i]:=Di[i]+1.  When a process pj sends a message m to pi, it piggybacks the updated value of Dj [j ] in the message.  When pi receives a message from pj with piggybacked value d, pi updates its dependency vector as follows: Di[j ]:= maxfDi[j ], dg. Thus, the dependency vector at a process only re ects direct dependencies. At any instant, Di [j ] denotes the sequence number of the latest event on process pj that directly a ects the current state. Note that this event may precede the latest event at pj that causally a ects the current state. Figure 5 illustrates the Fowler-Zwaenepoel technique. This technique results in considerable saving in the cost; only one scalar is piggybacked on every message. However, the dependency vector does not represent transitive dependencies (i.e., a vector timestamps). The transitive dependency (or the vector timestamp) of an event is obtained by recursively tracing the direct-dependency vectors of processes. Clearly, this will have overhead and will involve latencies. Therefore, this technique is not suitable for applications that require on-the- y computation of vector timestamps. Nonetheless, this technique is ideal for applications where computation of causal dependencies is performed o -line (e.g., causal breakpoint, asynchronous checkpointing recovery).

7.3 Jard-Jourdan's Adaptive Technique

In the Fowler-Zwaenepoel's direct-dependency technique, a process must observe an event (i.e., update and record its dependency vector) after receiving a message but before sending out any message. Otherwise, during the reconstruction of a vector timestamp from the direct-dependency vectors, all the causal dependencies will not be captured. If events occur very frequently, this technique will require recording the history of a large number of events. In the Jard-Jourdan's technique [6], events can

RR n2472

16

M. Raynal and M. Singhal p

1 1 0 0 0

p

{1}

1 1 0 0

1 2 1 0

1 4 4 0

1 3 2 0

2

p3

0 0 1 0

0 0 2 0

{1}

0 0 0 1

{2}

0 0 3 1

0 0 4 1

{4}

{1}

p 4

Figure 5: Vector clocks progress in Fowler-Zwaenepoel technique be adaptively observed while maintaining the capability of retrieving all the causal dependencies of an observed event. Jard-Jourdan de ne a pseudo-direct relation  on the events of a distributed computation as follows: if events ei and ej happen at processes pi and pj , respectively, then ej ei i there exists a path of message transfers that starts after ej on pj and ends before ei on pi such that there is no observed event on the path. The partial vector clock p vti at process pi is a list of tuples of the form (j , v ) indicating that the current state of pi is pseudo-dependent on the event on process pj whose sequence number is v. Initially, at a process pi : p vti=f(i, 0)g.  Whenever an event is observed at process pi, the following actions are executed (Let p vti =f(i1; v 1),...,(i; v ), ...g, denote the current partial vector clock at pi and variable e vti holds the timestamp of the observed event): { e vti = f(i1; v1),...,(i; v), ...g. { p vti := f(i; v + 1)g.  When process pj sends a message to pi, it piggybacks the current value of p vtj in the message.

INRIA

17

Causality in Distributed Systems v-pt 1={(1,1)}

v-pt 1 ={(1,0)}

p

1 {(1,0)}

p

v-pt 2 = {(1,0),(2,0)}

v-pt 2 ={(2,0)} 2

{(1,0),(2,0)}

v-pt ={(1,0), 3

p3

v-pt 3 ={(3,0)}

{(2,0),(3,1)}

v-pt 3 ={(3,1)}

v-pt 3=

v-pt 3= v-pt 3=

{(3,2)} {(3,2),(4,1)}

e2-pt ={(1,0),

e1-pt 3 ={(3,0)}

e3-pt 3 = {(3,2),(4,1)}

3

{(2,0),(3,1)}

p 4

v-pt 4 = {(4,0),(5,1)}

v-pt 4 ={(4,0)}

v-pt

4= {(4,1)}

e1-pt 4 ={(4,0),(5,1)}

({3,3)}

{(4,1)}

{(4,1)}

{(5,1)}

p 5

v-pt ={(5,2)}

v-pt 5 ={(5,0)}

5

v-pt ={(5,1)} 5 e1-pt ={(5,0)} 5

v-pt =

5 {(4,1),(5,1)}

e2-pt =

5 {(4,1),(5,1)}

Figure 6: Vector clocks progress in Jard-Jourdan technique

 When pi receives a message piggybacked with timestamp p vt, pi updates p vti such that it is the union of the following: (Let p vt=f(im1 ; vm1),...,(imk; vmk )g and p vti =f(i1; v1),,...,(il; vl)g.) { all (imx; vmx) such that (imx; :) does not appear in v pti, { all (ix; vx) such that (ix; :) does not appear in v pt, and { all (ix, max(vx; vmx)) for all (vx; :) that appear in v pt and v pti . Figure 6 illustrates the Jard-Jourdan technique to maintain vector clocks. eX ptn denotes the timestamp of the Xth observed event at process pn . For example, the third event observed at pi is timestamped e3 pt3 =f(3,2,(4,1)g; this timestamp means that the pseudo-direct predecessors of this event are located at processes p3 and p4

RR n2472

18

M. Raynal and M. Singhal

and are respectivley the second event observed at p3 and the rst observed at p4. So, considering the timestamp of an event, the set of the observed events that are its predecessors can be easily computed.

8 Concluding Remarks The concept of causality between events is fundamental to the design and analysis of distributed programs. The notion of time is basic to capture causality between events; however, there is no built-in physical time in distributed systems and it is only possible to realize an approximation of it. But a distributed computation makes progress in spurts and consequently logical time, which advances in jumps, is sucient to capture the monotonicity property induced by causality in distributed systems. This paper presented a general framework of logical clocks in distributed systems and discussed three systems of logical clocks, namely, scalar, vector, and matrix, that have been proposed to capture causality between events of a distributed computation. We discussed properties and applications of these systems of clocks. The power of a system of clocks increases in this order, but so do the complexity and the overheads. We discussed ecient implementations of these systems of clocks.

References [1]

Charron-Bost, B. Concerning the size of logical clocks in distributed systems. Inf. Proc. Letters, vol.39, (1991), pp. 11-16.

[2]

Fidge, L.J. Timestamp in message passing systems that preserves partial ordering. Proc. 11th Australian Comp. Conf., (Feb. 1988), pp. 56-66.

[3] [4]

Fidge, C. Logical time in distributed computing systems. IEEE Computer, (August 1991), pp. 28-33. Fischer, M.J., Michael, A. Sacrifying serializability to attain hight availability of data in an unreliable network. Proc. of ACM Symposium on Principles of Database Systems, (1982), pp. 70-75.

[5]

Fowler J., Zwaenepoel W. Causal distributed breakpoints. Proc. of 10th Int'l. Conf. on Distributed Computing Systems, (1990), pp. 134-141.

INRIA

Causality in Distributed Systems

19

[6]

Jard C., Jourdan G-C. Dependency tracking and ltering in distributed computations. in Brief announcements of the ACM symposium on PODC, (1994). (A full presentation appeared as IRISA Tech. Report No. 851, 1994).

[7]

Lamport, L. Time, clocks and the ordering of events in a distributed system. Comm. ACM, vol.21, (July 1978), pp. 558-564.

[8]

Mattern, F. Virtual time and global states of distributed systems. Proc. "Parallel and distributed algorithms" Conf., (Cosnard, Quinton, Raynal, Robert Eds), North-Holland, (1988), pp. 215-226.

[9]

Sarin, S.K., Lynch, L. Discarding obsolete information in a replicated data base system. IEEE Trans. on Soft. Eng., vol.SE 13,1, (Jan. 1987), pp. 39-46.

[10]

Schmuck, F. The use of ecient broadcast in asynchronous distributed systems. Ph. D. Thesis, Cornell University, TR88-928, (1988), 124 pages.

[11]

Singhal, M., Kshemkalyani, A. An Ecient Implementation of Vector Clocks. Information Processing Letters, 43, August 1992, pp. 47-52.

[12]

Wuu, G.T.J., Bernstein, A.J. Ecient solutions to the replicated log and dictionary problems. Proc. 3rd ACM Symposium on PODC, (1984), pp. 233-242

RR n2472

20

M. Raynal and M. Singhal

Annex 1: Virtual Time The synchronizer concept of Awerbuch [1] allows a synchronous distributed algorithm or program to run on an asynchronous distributed system. A synchronous distributed program executes in a lock-step manner and its progress relies on a global time assumption. A global time preexists in the semantics of synchronous distributed programs and it participates in the execution of such programs. A synchronizer is an interpreter for synchronous distributed programs and simulates a global time for such programs in an asynchronous environment [6]. In distributed, discrete-event simulations [5,8], a global time (the so-called the simulation time) exists and the semantics of a simulation program relies on such a time and its progress ensures that the simulation program has the liveness property. In the execution of a distributed simulation, it must be ensured that the virtual time progresses (liveness) in such a way that causality relations of the simulation program are never violated (safety). The global time built by a synchronizer or by a distributed simulation runtime environment drives the underlying program and should not be confused with the logical time presented previously. The time provided by a synchronizer or a distributed simulation runtime support does belong to the underlying program semantics and is nothing but the virtual [4] counterpart of the physical time o ered by the environment and used in real-time applications [2].

INRIA

Causality in Distributed Systems

21

On the other hand, in the previous representations of logical time (e.g., linear, vector, or matrix), the aim is to order events according to their causal precedence in order to ensure some properties such as liveness, consistency, fairness, etc. Such a logical time is only a means among others to ensure some properties. For example Lamport's logical clocks are used in Ricart-Agrawala's mutual exclusion algorithm [7] to ensure liveness; this time does not belong to the mutual exclusion semantics or the program invoking mutual exclusion. In fact, other means can be used to ensure properties such as liveness; for example, Chandy and Misra's mutual exclusion algorithm [3] uses a dynamic, directed, acyclic graph to ensure liveness.

References [1] [2] [3] [4] [5] [6] [7] [8]

Awerbuch, B. Complexity of network synchronization. Journal of the ACM, vol.32,4, (1985), pp. 804-823. Berry, G. Real time programming : special purpose or general purpose languages. IFIP Congress, Invited talk, San Francisco, (1989). Chandy, K.M., Misra, J. The drinking philosophers problem. ACM Toplas, vol.6,4, (1984), pp. 632-646. Je erson, D. Virtual time. ACM Toplas, vol.7,3, (1985), pp. 404-425. Misra, J. Distributed discrete event simulation. ACM Computing Surveys, vol.18,1, (1986), pp. 39-65. Raynal, M., Helary, J.M. Synchronization and control of distributed systems and programs. Wiley & sons, (1990), 124 p. Ricart, G., Agrawala, A.K. An optimal algorithm for mutual exclusion in computer networks. Comm. ACM, vol.24,1, (Jan. 1981), pp. 9-17. Righter, R., Walrand, J.C. Distributed simulation of discrete event systems. Proc. of the IEEE, (Jan. 1988), pp. 99-113.

RR n2472

22

M. Raynal and M. Singhal

Annex 2: Vector Clocks - A Brief Historical Perspective Although the theory associated with vector clocks was rst developed in 1988 independently by Fidge, Mattern, and Schmuck, vector clocks were informally introduced and used by several researchers earlier. Parker et al. [2] used a rudimentary vector clocks system to detect inconsistencies of replicated les due to network partitioning. Liskov and Ladin [1] proposed a vector clock system to de ne highly available distributed services. Similar system of clocks was used by Strom and Yemini [4] to keep track of the causal dependencies between events in their optimistic recovery algorithm and by Raynal to prevent drift between logical clocks [3].

References [1] [2] [3] [4]

Liskov, B., Ladin, R. Highly available distributed services and fault-tolerant distributed garbage collection. Proc. 5th ACM Symposium on PODC, (1986), pp. 29-39. Parker, D.S. et al. Detection of mutual inconsistency in distributed systems. IEEE Trans. on Soft. Eng., vol.SE 9,3, (May 1983), pp. 240-246. Raynal, M. A distributed algorithm to prevent mutual drift between n logical clocks. Inf. Processing Letters, vol.24, (1987), pp. 199-202. Strom, R.E., Yemini, S. Optimistic recovery in distributed systems. ACM TOCS, vol.3,3, (August 1985), pp. 204-226.

INRIA

Unite´ de recherche INRIA Lorraine, Technopoˆle de Nancy-Brabois, Campus scientifique, 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LE`S NANCY Unite´ de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unite´ de recherche INRIA Rhoˆne-Alpes, 46 avenue Fe´lix Viallet, 38031 GRENOBLE Cedex 1 Unite´ de recherche INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex Unite´ de recherche INRIA Sophia-Antipolis, 2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex

E´diteur INRIA, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex (France) ISSN 0249-6399

Operating Systems

R. Stockton Gaines Editor

Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates, Inc.

A distributed system consists of a collection of distinct processes which are spatially separated, and which communicate with one another by exchanging messages. A network of interconnected computers, such as the ARPA net, is a distributed system. A single computer can also be viewed as a distributed system in which the central control unit, the memory units, and the input-output channels are separate processes. A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process. We will concern ourselves primarily with systems of spatially separated computers. However, many of our remarks will apply more generally. In particular, a multiprocessing system on a single computer involves problems similar to those of a distributed system because of the unpredictable order in which certain events can occur.

The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become. Key Words and Phrases: distributed systems, computer networks, clock synchronization, multiprocess systems CR Categories: 4.32, 5.29 Introduction The concept of time is fundamental to our way of thinking. It is derived from the more basic concept of the order in which events occur. We say that something happened at 3:15 if it occurred after our clock read 3:15 and before it read 3:16. The concept of the temporal ordering of events pervades our thinking about systems. For example, in an airline reservation system we specify that a request for a reservation should be granted if it is made before the flight is filled. However, we will see that this concept must be carefully reexamined when considering events in a distributed system. General permission to make fair use in teaching or research of all or part of this material is granted to individual readers and to nonprofit libraries acting for them provided that ACM's copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission o f the Association for Computing Machinery. To otherwise reprint a figure, table, other substantial excerpt, or the entire work requires specific permission as does republication, or systematic or multiple reproduction. This work was supported by the Advanced Research Projects Agency of the Department of Defense and Rome Air Development Center. It was monitored by Rome Air Development Center under contract number F 30602-76-C-0094. Author's address: Computer Science Laboratory, SRI International, 333 Ravenswood Ave., Menlo Park CA 94025. © 1978 ACM 0001-0782/78/0700-0558 $00.75 558

In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation "happened before" is therefore only a partial ordering of the events in the system. We have found that problems often arise because people are not fully aware of this fact and its implications. In this paper, we discuss the partial ordering defined by the "happened before" relation, and give a distributed algorithm for extending it to a consistent total ordering of all the events. This algorithm can provide a useful mechanism for implementing a distributed system. We illustrate its use with a simple method for solving synchronization problems. Unexpected, anomalous behavior can occur if the ordering obtained by this algorithm differs from that perceived by the user. This can be avoided by introducing real, physical clocks. We describe a simple method for synchronizing these clocks, and derive an upper bound on how far out of synchrony they can drift.

The Partial Ordering Most people would probably say that an event a happened before an event b if a happened at an earlier time than b. They might justify this definition in terms of physical theories of time. However, if a system is to meet a specification correctly, then that specification must be given in terms of events observable within the system. If the specification is in terms of physical time, then the system must contain real clocks. Even if it does contain real clocks, there is still the problem that such clocks are not perfectly accurate and do not keep precise physical time. We will therefore define the "happened before" relation without using physical clocks. We begin by defining our system more precisely. We assume that the system is composed of a collection of processes. Each process consists of a sequence of events. Depending upon the application, the execution of a subprogram on a computer could be one event, or the execution of a single machine instruction could be one Communications of the ACM

July 1978 Volume 21 Number 7

Fig. 1.

Fig. 2.

a,

CY

,Y

(9

(9

~o

P4'

~

(9 O

o

q7

q5

r3

ql

~) U

-2

- - -

P3' ~

~~

q6 ;--# . i ~~ _ ~ ~ - ~ Y _

r3

r1

event. We are assuming that the events of a process form a sequence, where a occurs before b in this sequence if a happens before b. In other words, a single process is defined to be a set of events with an a priori total ordering. This seems to be what is generally meant by a process.~ It would be trivial to extend our definition to allow a process to split into distinct subprocesses, but we will not bother to do so. We assume that sending or receiving a message is an event in a process. We can then define the "happened before" relation, denoted by "---~", as follows. Definition. The relation "---->"on the set of events of a system is the smallest relation satisfying the following three conditions: (1) I f a and b are events in the same process, and a comes before b, then a ~ b. (2) I f a is the sending of a message by one process and b is the receipt o f the same message by another process, then a ~ b. (3) I f a ~ b and b ~ c then a ---* c. Two distinct events a and b are said to be concurrent if a ~ b and b -/-* a. We assume that a ~ a for any event a. (Systems in which an event can happen before itself do not seem to be physically meaningful.) This implies that ~ is an irreflexive partial ordering on the set of all events in the system. It is helpful to view this definition in terms of a "space-time diagram" such as Figure 1. The horizontal direction represents space, and the vertical direction represents t i m e - - l a t e r times being higher than earlier ones. The dots denote events, the vertical lines denote processes, and the wavy lines denote messagesfl It is easy to see that a ~ b means that one can go from a to b in ' The choice of what constitutes an event affects the ordering of events in a process. For example, the receipt of a message might denote the setting of an interrupt bit in a computer, or the execution of a subprogram to handle that interrupt. Since interrupts need not be handled in the order that they occur, this choice will affect the ordering of a process' message-receiving events. 2 Observe that messages may be received out of order. We allow the sending of several messages to be a single event, but for convenience we will assume that the receipt of a single message does not coincide with the sending or receipt of any other message. 559

(9 O

r2

P2'

Pl ~

c~

r4

q6

P3

cy

the diagram by moving forward in time along process and message lines. For example, we have p, --~ r4 in Figure 1. Another way of viewing the definition is to say that a --) b means that it is possible for event a to causally affect event b. Two events are concurrent if neither can causally affect the other. For example, events pa and q:~ of Figure 1 are concurrent. Even though we have drawn the diagram to imply that q3 occurs at an earlier physical time than 1)3, process P cannot know what process Q did at qa until it receives the message at p , (Before event p4, P could at most know what Q was planning to do at q:~.) This definition will appear quite natural to the reader familiar with the invariant space-time formulation of special relativity, as described for example in [1] or the first chapter of [2]. In relativity, the ordering of events is defined in terms of messages that could be sent. However, we have taken the more pragmatic approach of only considering messages that actually are sent. We should be able to determine if a system performed correctly by knowing only those events which did occur, without knowing which events could have occurred.

Logical Clocks We now introduce clocks into the system. We begin with an abstract point of view in which a clock is just a way of assigning a number to an event, where the n u m b e r is thought of as the time at which the event occurred. More precisely, we define a clock Ci for each process Pi to be a function which assigns a number Ci(a) to any event a in that process. The entire system o f c l b c k s is represented by the function C which assigns to any event b the number C ( b ) , where C ( b ) = C/(b) i f b is an event in process Pj. For now, we make no assumption about the relation of the numbers Ci(a) to physical time, so we can think of the clocks Ci as logical rather than physical clocks. They m a y be implemented by counters with no actual timing mechanism. Communications of the ACM

July 1978 Volume 21 Number 7

Fig. 3. CY



8

8

8

c~!

~ ~iLql

~

.r 4

We now consider what it means for such a system of clocks to be correct. We cannot base our definition of correctness on physical time, since that would require introducing clocks which keep physical time. Our definition must be based on the order in which events occur. The strongest reasonable condition is that if an event a occurs before another event b, then a should happen at an earlier time than b. We state this condition more formally as follows.

Clock Condition. For any events a, b: if a---> b then C ( a ) < C(b). Note that we cannot expect the converse condition to hold as well, since that would imply that any two concurrent events must occur at the same time. In Figure 1, p2 and p.~ are both concurrent with q3, so this would mean that they both must occur at the same time as q.~, which would contradict the Clock Condition because p2 -----> /93.

It is easy to see from our definition of the relation "---~" that the Clock Condition is satisfied if the following two conditions hold. C 1. I f a and b are events in process P~, and a comes before b, then Ci(a) < Ci(b). C2. If a is the sending of a message by process Pi and b is the receipt of that message by process Pi, then Ci(a) < Ci(b). Let us consider the clocks in terms of a space-time diagram. We imagine that a process' clock "ticks" through every number, with the ticks occurring between the process' events. For example, if a and b are consecutive events in process Pi with Ci(a) = 4 and Ci(b) = 7, then clock ticks 5, 6, and 7 occur between the two events. We draw a dashed "tick line" through all the likenumbered ticks of the different processes. The spacetime diagram of Figure 1 might then yield the picture in Figure 2. Condition C 1 means that there must be a tick line between any two events on a process line, and

560

condition C2 means that every message line must cross a tick line. F r o m the pictorial meaning of--->, it is easy to see why these two conditions imply the Clock Condition. We can consider the tick lines to be the time coordinate lines of some Cartesian coordinate system on spacetime. We can redraw Figure 2 to straighten these coordinate lines, thus obtaining Figure 3. Figure 3 is a valid alternate way of representing the same system of events as Figure 2. Without introducing the concept of physical time into the system (which requires introducing physical clocks), there is no way to decide which of these pictures is a better representation. The reader may find it helpful to visualize a twodimensional spatial network of processes, which yields a three-dimensional space-time diagram. Processes and messages are still represented by lines, but tick lines become two-dimensional surfaces. Let us now assume that the processes are algorithms, and the events represent certain actions during their execution. We will show how to introduce clocks into the processes which satisfy the Clock Condition. Process Pi's clock is represented by a register Ci, so that C~(a) is the value contained by C~ during the event a. The value of C~ will change between events, so changing Ci does not itself constitute an event. To guarantee that the system of clocks satisfies the Clock Condition, we will insure that it satisfies conditions C 1 and C2. Condition C 1 is simple; the processes need only obey the following implementation rule: IR1. Each process P~ increments Ci between any two successive events. To meet condition C2, we require that each message m contain a timestamp Tm which equals the time at which the message was sent. U p o n receiving a message timestamped Tin, a process must advance its clock to be later than Tin. More precisely, we have the following rule. IR2. (a) I f event a is the sending of a message m by process P~, then the message m contains a timestamp T m = Ci(a). (b) U p o n receiving a message m, process Pi sets Ci greater than or equal to its present value and greater than Tin. In IR2(b) we consider the event which represents the receipt of the message m to occur after the setting of Ci. (This is just a notational nuisance, and is irrelevant in any actual implementation.) Obviously, IR2 insures that C2 is satisfied. Hence, the simple implementation rules I R l and IR2 imply that the Clock Condition is satisfied, so they guarantee a correct system of logical clocks.

Ordering the Events Totally We can use a system of clocks satisfying the Clock Condition to place a total ordering on the set of all system events. We simply order the events by the times Communications of the ACM

July 1978 Volume 21 Number 7

at which they occur. To break ties, we use any arbitrary total ordering < of the processes. More precisely, we define a relation ~ as follows: if a is an event in process Pi and b is an event in process Pj, then a ~ b if and only if either (i) Ci{a) < Cj(b) or (ii) El(a) Cj(b) and Pi < Py. It is easy to see that this defines a total ordering, and that the Clock Condition implies that if a ----> b then a ~ b. In other words, the relation ~ is a way of completing the "happened before" partial ordering to a total ordering, a The ordering ~ depends upon the system of clocks Cz, and is not unique. Different choices of clocks which satisfy the Clock Condition yield different relations ~ . Given any total ordering relation ~ which extends --->, there is a system of clocks satisfying the Clock Condition which yields that relation. It is only the partial ordering which is uniquely determined by the system of events. Being able to totally order the events can be very useful in implementing a distributed system. In fact, the reason for implementing a correct system of logical clocks is to obtain such a total ordering. We will illustrate the use of this total ordering of events by solving the following version of the mutual exclusion problem. Consider a system composed of a fixed collection of processes which share a single resource. Only one process can use the resource at a time, so the processes must synchronize themselves to avoid conflict. We wish to find an algorithm for granting the resource to a process which satisfies the following three conditions: (I) A process which has been granted the resource must release it before it can be granted to another process. (II) Different requests for the resource must be granted in the order in which they are made. (III) If every process which is granted the resource eventually releases it, then every request is eventually granted. We assume that the resource is initially granted to exactly one process. These are perfectly natural requirements. They precisely specify what it means for a solution to be correct/ Observe how the conditions involve the ordering of events. Condition II says nothing about which of two concurrently issued requests should be granted first. It is important to realize that this is a nontrivial problem. Using a central scheduling process which grants requests in the order they are received will not work, unless additional assumptions are made. To see this, let P0 be the scheduling process. Suppose P1 sends a request to Po and then sends a message to P2. U p o n receiving the latter message, Pe sends a request to Po. It is possible for P2's request to reach P0 before Pl's request does. Condition II is then violated if P2's request is granted first. To solve the problem, we implement a system of ----"

;~The ordering < establishes a priority among the processes. If a "fairer" method is desired, then < can be made a function of the clock value. For example, if Ci(a) = C/b) andj < L then we can let a ~ b ifj < C~(a) mod N --< i, and b ~ a otherwise; where N is the total number of processes. 4 The term "eventually" should be made precise, but that would require too long a diversion from our main topic. 561

clocks with'rules IR 1 and IR2, and use them to define a total ordering ~ of all events. This provides a total ordering of all request and release operations. With this ordering, finding a solution becomes a straightforward exercise. It just involves making sure that each process learns about all other processes' operations. To simplify the problem, we make some assumptions. They are not essential, but they are introduced to avoid distracting implementation details. We assume first o f all that for any two processes P / a n d Pj, the messages sent from Pi to Pi are received in the same order as they are sent. Moreover, we assume that every message is eventually received. (These assumptions can be avoided by introducing message numbers and message acknowledgment protocols.) We also assume that a process can send messages directly to every other process. Each process maintains its own request queue which is never seen by any other process. We assume that the request queues initially contain the single message To:Po requests resource, where Po is the process initially granted the resource and To is less than the initial value of any clock. The algorithm is then defined by the following five rules. For convenience, the actions defined by each rule are assumed to form a single event. 1. To request the resource, process Pi sends the message TIn:P/requests resource to every other process, and puts that message on its request queue, where T,~ is the timestamp of the message. 2. When process Pj receives the message T,~:P~ requests resource, it places it on its request queue and sends a (timestamped) acknowledgment message to P~.'~ 3. To release the resource, process P~ removes any Tm:Pi requests resource message from its request queue and sends a (timestamped) Pi releases resource message to every other process. 4. When process Pj receives a Pi releases resource message, it removes any Tm:P~ requests resource message from its request queue. 5. Process P/is granted the resource when the following two conditions are satisfied: (i) There is a Tm:Pi requests resource message in its request queue which is ordered before any other request in its queue by the relation ~ . (To define the relation " ~ " for messages, we identify a message with the event of sending it.) (ii) P~ has received a message from every other process timestamped later than Tin.~ Note that conditions (i) and (ii) of rule 5 are tested locally by P~. It is easy to verify that the algorithm defined by these rules satisfies conditions I-III. First of all, observe that condition (ii) of rule 5, together with the assumption that messages are received in order, guarantees that P~ has learned about all requests which preceded its current '~This acknowledgment messageneed not be sent if Pj has already sent a messageto Pi timestamped later than T.... " If P, -< Pi, then Pi need only have receiveda messagetimestamped _>T,,, from P/. Communications of the ACM

July 1978 Volume 21 Number 7

request. Since rules 3 and 4 are the only ones which delete messages from the request queue, it is then easy to see that condition I holds. Condition II follows from the fact that the total ordering ~ extends the partial ordering ---~. Rule 2 guarantees that after Pi requests the resource, condition (ii) of rule 5 will eventually hold. Rules 3 and 4 imply that if each process which is granted the resource eventually releases it, then condition (i) of rule 5 will eventually hold, thus proving condition III. This is a distributed algorithm. Each process independently follows these rules, and there is no central synchronizing process or central storage. This approach can be generalized to implement any desired synchronization for such a distributed multiprocess system. The synchronization is specified in terms of a State Machine, consisting of a set C of possible commands, a set S of possible states, and a function e: C × S--~ S. The relation e(C, S) -- S' means that executing the command C with the machine in state S causes the machine state to change to S'. In our example, the set C consists of all the commands Pi requests resource and P~ releases resource, and the state consists of a queue of waiting request commands, where the request at the head of the queue is the currently granted one. Executing a request command adds the request to the tail of the queue, and executing a release command removes a command from 'he queue. 7 Each process independently simulates the execution of the State Machine, using the commands issued by all the processes. Synchronization is achieved because all processes order the commands according to their timestamps (using the relation ~ ) , so each process uses the same sequence of commands. A process can execute a command timestamped T when it has learned of all commands issued by all other processes with timestamps less than or equal to T. The precise algorithm is straightforward, and we will not bother to describe it. This method allows one t o implement any desired form of multiprocess synchronization in a distributed system. However, the resulting algorithm requires the active participation of all the processes. A process must know all the commands issued by other processes, so that the failure of a single process will make it impossible for any other process to execute State Machine commands, thereby halting the system. The problem of failure is a difficult one, and it is beyond the scope of this paper to discuss it in any detail. We will just observe that the entire concept of failure is only meaningful in the context of physical time. Without physical time, there is no way to distinguish a failed process from one which is just pausing between events. A user can tell that a system has "crashed" only because he has been waiting too long for a response. A method which works despite the failure of individual processes or communication lines is described in [3]. 7 If each process does not strictly alternate request and release commands, then executing a release command could delete zero, one, or more than one request from the queue. 562

Anomalous Behavior Our resource scheduling algorithm ordered the requests according to the total ordering =*. This permits the following type of "anomalous behavior." Consider a nationwide system of interconnected computers. Suppose a person issues a request A on a computer A, and then telephones a friend in another city to have him issue a request B on a different computer B. It is quite possible for request B to receive a lower timestamp and be ordered before request A. This can happen because the system has no way of knowing that A actually preceded B, since that precedence informatiori is based on messages external to the system. Let us examine the source of the problem more closely. Let O° be the set of all system events. Let us introduce a set of events which contains the events in b° together with all other relevant external events, such as the phone calls in our example. Let ~ denote the "happened before" relation for ~. In our example, we had A B, but A-~ B. It is obvious that no algorithm based entirely upon events in 0 °, and which does not relate those events in any way with the other events i n ~ , can guarantee that request A is ordered before request B. There are two possible ways to avoid such anomalous behavior. The first way is to explicitly introduce into the system the necessary information about the ordering --~. In our example, the person issuing request A could receive the timestamp TA of that request from the system. When issuing request B, his friend could specify that B be given a timestamp later than TA. This gives the user the responsibility for avoiding anomalous behavior. The second approach is to construct a system of clocks which satisfies the following condition.

Strong Clock Condition. For any events a, b in O°: i f a --~ b then C(a} < C(b). This is stronger than the ordinary Clock Condition because ~ is a stronger relation than ---~. It is not in general satisfied by our logical clocks. Let us identify ~ with some set of "real" events in physical space-time, and let ~ be the partial ordering of events defined by special relativity. One of the mysteries of the universe is that it is possible to construct a system of physical clocks which, running quite independently of one another, will satisfy the Strong Clock Condition. We can therefore use physical clocks to eliminate anomalous behavior. We now turn our attention to such clocks.

Physical Clocks Let us introduce a physical time coordinate into our space-time picture, and let Ci(t) denote the reading of the clock Ci at physical time t.8 For mathematical conWe will assume a Newtonian space-time. If the relative motion of the clocks or gravitational effects are not negligible, then CM) must be deduced from the actual clock reading by transforming from proper time to the arbitrarily chosen time coordinate. Communications of the ACM

July 1978 Volume 2 l Number 7

venience, we assume that the clocks run continuously rather than in discrete "ticks." (A discrete clock can be thought of as a continuous one in which there is an error of up to ½ "tick" in reading it.) More precisely, we assume that Ci(t) is a continuous, differentiable function of t except for isolated j u m p discontinuities where the clock is reset. Then dCg(t)/dt represents the rate at which the clock is running at time t. In order for the clock Cg to be a true physical clock, it must run at approximately the correct rate. That is, we must have d C i ( t ) / d t -~ 1 for all t. More precisely, we will assume that the following condition is satisfied: P C I . There exists a constant x b. Hence, we need only consider events occurring in different processes. Let # be a number such that if event a occurs at physical time t and event b in another process satisfies a ~ b, then b occurs later than physical time t + bt. In other words,/~ is less than the shortest transmission time for interprocess messages. We can always choose # equal to the shortest distance between processes divided by the speed of light. However, depending upon how messages in ~ are transmitted, # could be significantly larger. To avoid anomalous behavior, we must make sure that for any i, j, and t: Ci(t + #) - CAt) > 0. Combining this with PC I and 2 allows us to relate the required smallness of x and ~ to the value of # as follows. We assume that when a clock is reset, it is always set forward and never back. (Setting it back could cause C I to be violated.) PCI then implies that Cg(t + #) - Cg(t) > (1 - x)#. Using PC2, it is then easy to deduce that Cg(t + #) - C/(t) > 0 if the following inequality holds: E/(I

-

~) _< ~.

This inequality together with PC 1 and PC2 implies that anomalous behavior is impossible. 563

We now describe our algorithm for insuring that PC2 holds. Let m be a message which is sent at physical time t and received at time t'. We define l,m ~ - t t - - I to be the total delay of the message m. This delay will, of course, not be known to the process which receives m. However, we assume that the receiving process knows some minim u m delay tzm >_ 0 such that ~£m ~ Pro. We call ~,, = I,m -- #m the unpredictable delay of the message. We now specialize rules I R I and 2 for our physical clocks as follows: IR 1'. For each i, if Pi does not receive a message at physical time t, then C/is differentiable at t and dCg(t)/dt >0. IR2'. (a) If Pg sends a message m at physical time t, then m contains a timestamp T m = C/(t). (b) Upon receiving a message m at time t', process P/ sets C/(t') equal to m a x i m u m (Cj(t' - 0), Tm + /Zm).9 Although the rules are formally specified in terms of the physical time parameter, a process only needs to know its own clock reading and the timestamps of messages it receives. For mathematical convenience, we are assuming that each event occurs at a precise instant of physical time, and different events in the same process occur at different times. These rules are then specializations of rules IR1 and IR2, so our system of clocks satisfies the Clock Condition. The fact that real events have a finite duration causes no difficulty in implementing the algorithm. The only real concern in the implementation is making sure that the discrete clock ticks are frequent enough so C 1 is maintained. We now show that this clock synchronizing algorithm can be used to satisfy condition PC2. We assume that the system of processes is described by a directed graph in which an arc from process Pi to process P/represents a communication line over which messages are sent directly from Pi to P/. We say that a message is sent over this arc every T seconds if for any t, Pi sends at least one message to P / b e t w e e n physical times t and t + -r. The diameter of the directed graph is the smallest n u m b e r d such that for any pair of distinct processes P/, Pk, there is a path from P / t o P~ having at most d arcs. In addition to establishing PC2, the following theorem bounds the length of time it can take the clocks to become synchronized when the system is first started. THEOREM. Assume a strongly connected graph of processes with diameter d which always obeys rules IR 1' and IR2'. Assume that for any message m, #m --< # for some constant g, and that for all t > to: (a) PC 1 holds. (b) There are constants ~"and ~ such that every ~-seconds a message with an unpredictable delay less than ~ is sent over every arc. Then PC2 is satisfied with • = d(2x~- + ~) for all t > to + Td, where the approximations assume # + ~ Cl(tl') + (1 - ~)(t - tl') - n~.

(4)

F r o m PC1, I R I ' and 2' we deduce that C l ( / l ' ) >" C l ( t l ) + (1 -- K)(tl' -- /1).

Conclusion

C o m b i n i n g this with (4) and using (2), we get W e have seen that the concept o f " h a p p e n i n g before" defines an invariant partial ordering of the events in a distributed multiprocess system. W e described an algorithm for extending that partial ordering to a s o m e w h a t arbitrary total ordering, and showed how this total ordering can be used to solve a simple synchronization problem. A future p a p e r will show how this a p p r o a c h can be extended to solve a n y synchronization problem. T h e total ordering defined by the algorithm is somewhat arbitrary. It can produce a n o m a l o u s b e h a v i o r if it disagrees with the ordering perceived by the system's users. This can be prevented by the use of properly synchronized physical clocks. O u r t h e o r e m showed h o w closely the clocks can be synchronized. In a distributed system, it is i m p o r t a n t to realize that the order in which events occur is only a partial ordering. W e believe that this idea is useful in understanding any multiprocess system. It should help one to u n d e r s t a n d the basic p r o b l e m s o f multiprocessing independently of the m e c h a n i s m s used to solve them. Appendix Proof of the Theorem

(1)

[dCz(t)/dtldt

Ci(t') >_ Cit(t ') for all t' >__ t.

(2)

Suppose process P~ at time tl sends a message to process Pz which is received at time t2 with an unpredictable delay _< ~, where to _ C~(t2) + (1 - x)(t - t2) > Cfftl) +/~m + (1 -- x)(t -- t2) Cl(tl) + (1 - x)(t

-

tl)

-

[by (1) and P C I ] [by I R 2 ' (b)]

[(t2 - tO - ~m] + x(t2 -

t,)

>-- Cl(tl) + (1 - x ) ( t - tl) - 4. Hence, with these assumptions, for all t >_ t2 we have: C~(t) _> Cl(tl) + (1 - x)(t - / 1 )

-

-

4"

(3)

NOW suppose that for i = 1, ..., n we have t, _< t ~, < 564

for t > t n + l . F o r a n y two processes P and P', we can find a sequence of processes P -- Po, P~ . . . . . Pn+~ = P', n _< d, with c o m m u n i c a t i o n arcs f r o m each Pi to Pi+~. By hypothesis (b) we can find times ti, t[ with t[ - ti n’. Cl We learn from these examples requires

that

(in general)

a consistent

global state

n = n’. ACM Transactions

(1) on Computer

Systems, Vol. 3, No. 1, February

1985.

70

l

K. M. Chandy and L. Lamport

Let m be the number of messages received along c before q’s state is recorded. Let m’ be the number of messages received along c before c’s state is recorded. We leave it up to the reader to extend the example to show that consistency requires m = m’.

(2)

In every state, the number of messages received along a channel cannot exceed the number of messages sent along that channel, that is, n’ 2 m’.

(3)

n 2 m.

(4)

From the above equations,

The state of channel c that is recorded must be the sequence of messages sent along the channel before the sender’s state is recorded, excluding the sequence of messages received along the channel before the receiver’s state is recordedthat is, if n’ = m’, the recorded state of c must be the empty sequence, and if n’ > m’, the recorded state of c must be the (m’ + l)st, . . . , n’th messages sent by p along c. This fact and eqs. (l)-(4) suggest a simple algorithm by which q can record the state of channel c. Process p sends a special message, called a marker, after the nth message it sends along c (and before sending further messages along c). The marker has no effect on the underlying computation. The state of c is the sequence of messages received by q after q records its own state and before q receives the marker along c. To ensure eq. (4), q must record its state, if it has not done so already, after receiving a marker along c and before q receives further messages along c. Our example suggests the following outline for a global state detection algorithm. 3.2 Global-State-Detection

Algorithm Outline

Marker-Sending Rule for a Process p. directed away from p:

For each channel c, incident

on, and

p sends one marker along c after p records its state and before p sends further messages along c. Marker-Receiving

Rule for a Process q.

On receiving a marker along a channel

C:

if q has not recorded its state then begin q records its state;

q records the state c as the empty sequence end else q records the state of c as the sequence of messages received along c after q’s state

was recorded and before q received the marker along c. 3.3 Termination

of the Algorithm

The marker receiving and sending rules guarantee that if a marker is received along every channel, then each process will record its state and the states of all ACM Transactions

on Computer

Systems, Vol. 3, No. 1, February

1985.

Distributed Snapshots

l

71

incoming channels. To ensure that the global-state recording algorithm terminates in finite time, each process must ensure that (Ll) no marker remains forever in an incident input channel and (L2) it records its state within finite time of initiation of the algorithm. The algorithm can be initiated by one or more processes, each of which records its state spontaneously, without receiving markers from other processes; we postpone discussion of what may cause a process to record its state spontaneously. If process p records its state and there is a channel from p to a process 4, then q will record its state in finite time becausep will send a marker along the channel and q will receive the marker in finite time (Ll). Hence if p records its state and there is a path (in the graph representing the system) from p to a process q, then q will record its state in finite time because, by induction, every process along the path will record its state in finite time. Termination in finite time is ensured if for every process q: q spontaneously records its state or there is a path from a process p, which spontaneously records its state, to q. In particular, if the graph is strongly connected and at least one process spontaneously records its state, then all processes will record their states in finite time (provided Ll is ensured). The algorithm described so far allows each process to record its state and the states of incoming channels. The recorded process and channel states must be collected and assembled to form the recorded global state. We shall not describe algorithms for collecting the recorded information because such algorithms have been described elsewhere [4, lo]. A simple algorithm for collecting information in a system whose topology is strongly connected is for each process to send the information it records along all outgoing channels, and for each process receiving information for the first time to copy it and propagate it along all of its outgoing channels. All the recorded information will then get to all the processes in finite time, allowing all processes to determine the recorded global state. 4. PROPERTIES

OF THE RECORDED

GLOBAL

STATE

To gain an intuitive understanding of the properties of the global state recorded by the algorithm, we shall study Example 2.2. Assume that the state of p is recorded in global state So (Figure 7), so the state recorded for p is A. After recording its state, p sends a marker along channel c. Now assume that the iystem goes to global state Si, then Sz, and then S3 while the marker is still in transit, and the marker is received by q when the system is in global state SB.On receiving the marker, q records its state, which is D, and records the state of c to be the empty sequence. After recording its state, q sends a marker along channel c’. On receiving the marker, p records the state of c’ as the sequence consisting of the single messageM’. The recorded global state S* is shown in Figure 8. The recording algorithm was initiated in global state 5’0and terminated in global state s3. Observe that the global state S* recorded by the algorithm is not identical to any of the global states So, S1, Sz, S3 that occurred in the computation. Of what use is the algorithm if the recorded global state never occurred? We shall now answer this question. ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985.

72

l

K. M. Chandy and L. Lamport

empty state D M’ Fig. 8. A recorded global state for Example 2.2.

Let seq = (ei, 0 5 i) be a distributed computation, and let Si be the global state of the system immediately before event ei, 0 5 i, in seq. Let the algorithm be initiated in global state S, and let it terminate in global state S4, 0 5 1 I 4; in other words, the algorithm is initiated after e,-l if L > 0, and before e,, and it terminates after eeel if 4 > 0, and before e,. We observed in Example 2.2 that the recorded global state S* may be different from all global states Sk, 1 5 k 5 4. We shall show that: (1) S* is reachable from S,, and (2) S, is reachable from S*. Specifically, we shall show that there exists a computation seq’ where (1) seq’ is a permutation of seq, such that S,, S* and S4 occur as global states in (2) S, = S* or S, occurs earlier than S*, and (3) S, = S* or S* occurs earlier than S, in seq’. THEOREM

1. There exists a computation seq’ = (el, 0 I i) where

(1) Foralli,wherei 4, are postrecording events in seq. There may be a postrecording event ej-1 before a prerecording event ej for some j, L < j < 4; this can occur only if ej-1 and ej are in different processes (because if ej-1 and cj are in the same process and ej-1 is a postrecording event, then so is ej).

We shall derive a computation seq’ by permuting seq, where all prerecording events occur before all postrecording events in seq’. We shall show that S* is the global state in seq’ after all prerecording events and before all postrecording events. Assume that there is a postrecording event ej-1 before a prerecording event ej in seq. We shall show that the sequence obtained by interchanging ej-1 and ej must also be a computation. Events ej-1 and ej must be on different processes. Let p be the process in which ej-1 occurs, and let q be the process in which ej occurs. There cannot be a message sent at ej-1 which is received at ej because (1) ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985.

Distributed Snapshots

l

73

if a messageis sent along a channel c when event ej-1 occurs, then a marker must have been sent along c before ej-1, since ej-1 is a postrecording event, and (2) if the message is received along channel c when ej occurs, then the marker must have been received along c before ej occurs (since channels are first-in-first-out), in which case (by the marker-receiving rule) ej would be a postrecording event too. The state of process q is not altered by the occurrence of event ej-1 because ej-1 is in a different process p. If ej is an event in which q receives a message M along a channel c, then M must have been the message at the head of c before event ej-1, since a message sent at ej-1 cannot be received at ej. Hence event ej can occur in global state Sj-1. The state of process p is not altered by the occurrence of ej. Hence ej-1 can occur after ej. Hence the sequence of events el, . . . , is a computation. From the arguments in the last paragraph it follows that the global state after computation el, . . . , ej is the same as the global state after computation el, . . . , ej-2,

ej-2,

ej,

ej,

ej-1

ej-1.

Let seq* be a permutation of seq that is identical to seq except that ej and ej-1 are interchanged. Then seq* must also be a computation. Let Si be the global state immediately before the ith event in seq*. From the arguments of the previous paragraph, Si = Si

for all

i where

i # j.

By repeatedly swapping postrecording events that immediately follow prerecording events, we see that there exists a permutation seq’ of seq in which (1) (2) (3) (4)

all prerecording events precede all postrecording events, seq’ is a computation, foralliwherei DFE Gh y? B ,Ac @ aA C MH H IJ#K

When h needs access to the token,

– it submits a request to its current local MSS, say M, and – stores M in the local variable req_locn. L MN#O

In addition to the above quantitative benfits, R-MH is vulnerable to disconnection of any MH and a separate algorithm will need to be executed to reconfigure the ring amongst the remaining MHs. In comparison, with the logical ring within the fixed network, disconnection of a MH that does not need to access the token, has no effect on the algorithm execution. Disconnection of a MH with a pending request can be easily handled since a “disconnected” flag is set for the particular MH at some MSS M within the fixed network: when a MSS M’ (where the MH’s token request is pending) attempts to forward the token to the MH, it is informed by M of the MH’s disconnection and M’ can then cancel the MH’s pending reques..

Q RS#T

After every move, h now includes req_locn with the join() message, i.e. it sends join(h, req_locn) message to the MSS M’ upon entering the cell under M’. – If req_locn received with the join() message is not P , then M’ sends a inform(h, M’) message to the MSS req_locn.

Comparison of search and inform strategies To compare the search nad inform strategies, let a MH h submit a request at MSS M and receive the token at M’. Assume that it makes MOB number of moves in the intervening period. Then, after each of these moves, a inform() message was sent to M, i.e. the inform cost is MOB  Cfixed . In algorithm R-MSS:search, on the other hand, M would search for the current location of hand the cost incurred would be Csearch . Thus, the inform strategy is preferable to search strategy when MOB  Cfixed < Csearch i.e., if h changes cells “less often” after submitting its request, then it is better for h to inform M of every change in its location rather than M searching for h.

INFORM Strategy An alternative to the search strategy to locate a migrant MH is to require the MH to notify the MSS (where it submitted its request) after every change in its location till it receives the token. y  S M Ac  t i  on s e x ec

ut ed  b a MS  !#" $ %&#'

When h receives the token from the MSS req_locn, it accesses the critical region, returns the token to the same MSS and then sets req_locn to P .

On receipt of a request from a local MH h, M adds a request to the rear of its request queue. U

Upon receipt of a inform(h, M’) message, the current value of locn(h) is replaced with M’ in the entry in M’s request queue. ( )*#+ On receipt of the token, M executes the following steps:

PROXY Strategy

The efficiency of search and inform strategies is determined by the number of moves made by a migrant MH. While a search strategy is useful for migrant MHs that “frequently” change their cells, the inform strategy is better for migrant MHs that change their locations less often. We now present a third strategy that combines advantages of both search and inform strategies, and is tuned for a mobility pattern wherein a migrant MH moves frequently between “adjacent” cells while rarely moving between non-adjacent cells. The set of all MSSs is partitioned into “areas”, and MSSs within the same area is associated with a common

1. Entries from the request queue are moved to the grant queue. 2. Repeat – Remove the request at the head of the grant queue – If locn(h) == M, then deliver the token to h over the local wireless link 7

O PQ%R

proxy. The proxy is a static host, but not necessarily a MSS. The token circulates in a logical ring, which now comprises of only the proxies. On receiving the token, each proxy is responsible for servicing pending requests from its request queue. Each request in the queue is an ordered pair , where h is the MH submitting the request and proxy(h) represents the area, i.e. proxy, where h is currently located. A MH makes a “wide area” move, when its local MSS before and after the move, are in different areas, i.e. the proxies associated its new cell is different from the cell prior to the move. Analogously, a “local area” move occurs when its proxy does not change after a move. This assumes that a MH is aware of the identity of the proxy associated with its current cell; this could be implemented by having each MSS include the identity of its associated proxy in the periodic beacon message.

– If init_proxy is not P , and init_proxy is different from the proxy P’ serving the new MSS, i.e. h has made a wide-area move, then the new MSS sends a inform(h, P’) message to init_proxy. Communication cost Let the number of proxies constituting the ring be Nproxy , and the number of MSSs be Nmss ; the number of MSSs within each area is thus Nmss / Nproxy . Let MOBwide be the number of wide-area moves made by a MH in the period between submitting a token request and receiving the token; MOBlocal represents the total number of local-area moves in the same period, and MOB is the sum of local and wide area moves. Prior to delivering the token to a MH, a proxy needs to locate a MH amongst the MSSs within its area. We refer to this as a local search, with an associated cost Cl-search and formulate the communication costs as follows:

y P Ac t i ons e x ec ut ed by a pro  x ! "#%$

& '(%)

* +,%-

On receipt of a token request from a MH h (forwarded by a MSS within P’s local area), the request is appended to the rear of the request queue.

• Cost of one token circulation in the ring: Nproxy  Cfixed • Cost of

When P receives a inform(h, P’) message, the current value of proxy(h) in the entry is changed to P’.

– submitting a token request from a MH to its proxy: Cwireless Cfixed – delivering the token to the MH: Cfixed Cl-search Cwireless (the Cfixed term can be dropped if the MH receives the token in the same area where it submitted its request). – returning the token from the MH to the proxy: Cwireless Cfixed

When P receives the token from its predecessor in the logical ring, it executes the following steps: 1. Entries from the request queue are moved to the grant queue. 2. Repeat

The above costs add up to: 3 Cwireless 3 Cfixed Cl-search • Informing a proxy after a wide-area move: Cfixed

– Delete the request from the head of request queue. – If proxy(h) == P, i.e. h located within P’s area, then deliver the token to h after searching for h within the MSSs under P. – Else, forward the token to proxy(h) (different from P) which will deliver the token to h after a local search for h within its area. – Await return of the token from h.

• The overall cost (worst case) of satisfying a request from a MH, including the inform cost, is then (3 Cwireless

3 Cfixed

Cl-search )

(MOBwide  Cfixed )

Comparison with search and inform strategies It is obvious that the cost of circulating the token amongst the proxies is less than circulating it amongst all MSSs (as is the case for search and inform strategies) by a factor Nproxy / Nmss . The cost of circulating the token within the logical ring is a measure of distribution of the overall computational workload amongst static hosts. If all three schemes service the same number of token requests in one traversal of the ring, then this workload is shared by Nmss in search and inform schemes, while it is spread amongst Nproxy fixed hosts under the proxy method. To compare the efficiency of each algorithm to handle mobility, we need to consider the communication cost of satisfying token requests.

Until grant queue is empty 3. Forward token to P’s successor in the ring. d> ?b@yA BMCE HD F h / t0 i1 2 o3n4s5 6e7xe 8 c 9 : u;t?

Significance Threshold 1% 1% 𝑇

20

Approximate Synchronous Parallel The significance filter • Filter updates based on their significance

ASP selective barrier • Ensure significant updates are read in time

Mirror clock • Safeguard for pathological cases

21

ASP Selective Barrier Data Center 1 Significant Significant Significant Significant Update Update Update Update Parameter Server

Data Center 2

Parameter Server

Only workers depend on Arrive that too late! these parameters are blocked

Data Center 1 Selective Significant Significant Significant Significant Barrier Update Update Update Update Parameter Server

Data Center 2

Parameter Server 22

Outline • Problem & Goal • Background & Motivation • Gaia System Overview • Approximate Synchronous Parallel • System Implementation • Evaluation • Conclusion 23

Put it All Together: The Gaia System Gaia Parameter Server Worker Update Machine

Local Server

Worker Machine Worker Machine

Significance Selective Filter Barrier

Data Center Boundary

Gaia Parameter Server

Aggregated Parameter Store Update

Control Queue Data Queue

Mirror Server



Mirror Client 24

Put it All Together: The Gaia System Data Center Boundary

Gaia Parameter ControlGaia Parameter Server messages (barriers, etc.) Server Worker are always prioritized Update Aggregated Local Machine Parameter Store Server

Worker Machine

Update

Control Queue

Mirror Server



Significance Selective Mirror for No change is required Filter Worker Barrier Client Data Machine ML algorithms Queue and ML programs 25

Problem: Broadcast Significant Updates

Communication overhead is proportional to the number of data centers 26

Mitigation: Overlay Networks and Hubs Data Center Group

Data Center Group

Data Center Group

Hub Hub Hub

Hub

Data Center Group

Save communication on WANs by aggregating the updates at hubs

27

Outline • Problem & Goal • Background & Motivation • Gaia System Overview • Approximate Synchronous Parallel • System Implementation • Evaluation • Conclusion 28

Methodology • Applications • Matrix Factorization with the Netflix dataset • Topic Modeling with the Nytimes dataset • Image Classification with the ILSVRC12 dataset

• Hardware platform • 22 machines with emulated EC2 WAN bandwidth • We validated the performance with a real EC2 deployment

• Baseline • IterStore (Cui et al., SoCC’14) and GeePS (Cui et al., EuroSys’16) on WAN

• Performance metrics • Execution time until algorithm convergence • Monetary cost of algorithm convergence 29

Performance – 11 EC2 Data Centers Normalized Exec. Time

Baseline

Gaia

LAN

1 0.8 0.6 0.4

3.8X 3.7X

0.2

3.7X

4.8X

6.0X 8.5X

0 Matrix Factorization

Topic Modeling

Image Classification

Gaia achieves 3.7-6.0X speedup over Baseline Gaia is at most 1.40X of LAN speeds

30

Performance and WAN Bandwidth Normalized Exec. Time

Baseline

Gaia

LAN

Baseline

1

1

0.8

0.8

0.6

0.6

0.4 0.2

3.7X 3.5X

0.4

3.7X 3.9X 7.4X 7.4X

0 Matrix Topic Modeling Image Factorization Classification

0.2

23.8X 25.4X

Gaia

LAN

17.3X 14.1X

53.7X 53.5X

0 Matrix Topic Modeling Image Factorization Classification

S/S WAN V/C WAN Gaia achieves 3.7-53.5X speedup over Baseline (Singapore/São Paulo) (Virginia/California) Gaia is at most 1.23X of LAN speeds

31

Results – EC2 Monetary Cost

Gaia is 2.6-59.0X cheaper than Baseline

EC2-ALL V/C WAN S/S WAN

Matrix Factorization

59.0X

10.7X

Gaia

Baseline

Gaia

Baseline

8.5X

Baseline

5.7X

0 Gaia

Gaia

Baseline

Gaia

Baseline

0 Gaia

2.6X

28.5X 0.5

6.0X

Baseline

4.2X

Gaia

1

Baseline

1

Gaia

1.5

Baseline

1.5

Gaia

2

0.5

4 3.5 3 2.5 2 1.5 18.7X 1 0.5 0

2.5

2

Baseline

Normliaed Cost

2.5

Communication Cost Machine Cost (Network) Machine Cost (Compute)

EC2-ALL V/C WAN S/S WAN

EC2-ALL V/C WAN S/S WAN

Topic Modeling

Image Classification 32

More in the Paper

•Convergence proof of Approximate Synchronous Parallel (ASP) •ASP vs. fully asynchronous •Gaia vs. centralizing data approach 33

Key Takeaways • The Problem: How to perform ML on geo-distributed data? • Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Gaia Approach • Decouple the synchronization model within the data center from that across data centers • Eliminate insignificant updates across data centers

• A new synchronization model: Approximate Synchronous Parallel • Retain the correctness and accuracy of ML algorithms

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs • at most 1.40X of LAN speeds • without requiring changes to algorithms

34

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†



Executive Summary • The Problem: How to perform ML on geo-distributed data? • Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Goal • Minimize communication over WANs • Retain the correctness and accuracy of ML algorithms • Without requiring changes to ML algorithms

• Our Gaia Approach • Decouple the synchronization model within the data center from that across data centers: Eliminate insignificant updates on WANs • A new synchronization model: Approximate Synchronous Parallel

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs • within 1.40X of LAN speeds

36

Approximate Synchronous Parallel The significance filter • Filter updates based their significance

ASP selective barrier • Ensure significant updates are read in time

Mirror clock • Safeguard for pathological cases

37

Mirror Clock Data Center 1

Data Center 2

No guarantee Barrier

under extreme Parameter Server network conditions Parameter Server Data Center 1 Clock N

Data Center 2 Clock N + DS

Guarantees Clock N all significant updates d are seen after DS clocks Parameter Server Parameter Server 38

Effect of Synchronization Mechanisms

1E+09 9E+08 8E+08 7E+08 6E+08 5E+08 4E+08 3E+08 2E+08 1E+08 0E+00

Gaia_Async

Convergence value 0

50 100 150 200 250 300 350 Time (Seconds)

Matrix Factorization

Gaia Objective value

Objective value

Gaia

-9.0E+08

Gaia_Async Convergence value

-1.0E+09 -1.1E+09 -1.2E+09 -1.3E+09 -1.4E+09 -1.5E+09

0

250

500

750

1000

Time (Seconds)

Topic Modeling 39

Methodology Details • Hardware • A 22-node cluster. Each has a 16-core Intel Xeon CPU (E5-2698), a NVIDIA Titan X GPU, 64GB RAM, and a 40GbE NIC

• Application details • Matrix Factorization: SGD algorithm, 500 ranks • Topic Modeling: Gibbs sampling, 500 topics

• Convergence criteria • The value of the objective function changes less than 2% over the course of 10 iterations

• Significance Threshold • 1% and shrinks over time

1% 2

40

ML System Performance Comparison • IterStore [Cui et al. SoCC’15] shows 10X performance improvement over PowerGraph [Gonzalez et al., OSDI’12] for Matrix Factorization • PowerGraph matches the performance of GraphX [Gonzalez et al., OSDI’14], a Spark-based system

41

Matrix Factorization (1/3) • Matrix factorization (also known as collaborative filtering) is a technique commonly used in recommender systems

42

Matrix Factorization (2/3) Movie

Rank (User Preference Parameters) (θ)

Rank (Movie Parameters) (x)

User



4

43

Matrix Factorization (3/3) • Objective function (L2 regularization)

• Solve with stochastic gradient decent (SGD)

44

Background – BSP • BSP (Bulk Synchronous Parallel) • All machines need to receive all updates before proceeding to the next iteration

Worker 1 Worker 2 Worker 3

Clock 0

1

2

3

45

Background – SSP • SSP (Stale Synchronous Parallel) • Allows the fastest worker ahead of the slowest worker by a bounded number of iterations

Worker 1 Worker 2 Worker 3

Clock 0

1

2

3

Staleness = 1

46

Compare Against Centralizing Approach Gaia Speedup over Gaia to Centralize Centralize Cost Ratio Matrix Factorization EC2-ALL V/C WAN S/S WAN

1.11 1.22 2.13

3.54 1.00 1.17

Topic Modeling

EC2-ALL V/C WAN S/S WAN

0.80 1.02 1.25

6.14 1.26 1.92

Image Classification EC2-ALL V/C WAN S/S WAN

0.76 1.12 1.86

3.33 1.07 1.08

47

SSP Performance – 11 Data Centers

Nromalized Execution Time

Matrix Factorization 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Amazon-EC2 Emulation-EC2 Emulation-Full-Speed 2.0X 1.8X 3.8X

Baseline

Gaia BSP

2.0X 1.8X

1.5X1.3X

3.7X

LAN

3.0X

Baseline

Gaia

1.5X 1.3X 2.7X

LAN

SSP 48

SSP Performance – 11 Data Centers

Nromalized Execution Time

Topic Modeling 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Emulation-EC2 Emulation-Full-Speed 2.0X 3.7X

Baseline

1.5X

2.5X

Gaia BSP

1.7X 2.0X

4.8X

LAN

3.5X

Baseline

Gaia SSP

LAN 49

SSP Performance – V/C WAN Matrix Factorization 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Baseline

Gaia

LAN

3.7X 3.5X

2.6X 2.3X

BSP

SSP

Topic Modeling 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Baseline

Gaia

LAN

3.7X 3.9X

3.1X 3.2X

BSP

SSP

50

SSP Performance – S/S WAN Matrix Factorization 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Baseline

Gaia

LAN

25X 24X

16X 14X

BSP

SSP

Topic Modeling 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Baseline

Gaia

LAN

14X 17X

17X 21X

BSP

SSP

51

Federated Learning: Collaborative Machine Learning without Centralized Training Data Thursday, April 06, 2017 Posted by Brendan McMahan and Daniel Ramage, Research Scientists

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we're introducing an additional approach: Federated Learning. Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and OnDevice Smart Reply) by bringing model training to the device as well. It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.

Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to th e shared model, after which the procedure is repeated. Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. And this approach has another immediate benefit: in addition to providing an update to the shared model, the improved model on your phone can also be used immediately, powering experiences personalized by the way you use your phone. We're currently testing Federated Learning in Gboard on Android, the Google Keyboard. When Gboard shows a suggested query, your phone locally stores information about the current context and whether you clicked the suggestion. Federated Learning processes that history on-device to suggest improvements to the next iteration of Gboard’s query suggestion model.

To make Federated Learning possible, we had to overcome many algorithmic and technical challenges. In a typical machine learning system, an optimization algorithm like Stochastic Gradient Descent (SGD) runs on a large dataset partitioned homogeneously across servers in the cloud. Such highly iterative algorithms require low-latency, high-throughput connections to the training data. But in the Federated Learning setting, the data is distributed across millions of devices in a highly uneven fashion. In addition, these devices have significan tly higher-latency, lower-throughput connections and are only intermittently available for training. These bandwidth and latency limitations motivate our Federated Averaging algorithm, which can train deep networks using 10-100x less communication compared to a naively federated version of SGD. The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps.

Since it takes fewer iterations of high -quality updates to produce a good model, training can use much less communication. As upload speeds are typically much slower than download speeds, we also developed a novel way to reduce upload communication costs up to another 100x by compressing updates using random rotations and quantization. While these approaches are focused on training deep networks, we've also designed algorithms for high-dimensional sparse convex models which excel on problems like click-through-rate prediction. Deploying this technology to millions of heterogenous phones running Gboard requires a sophisticated technology stack. On device training uses a miniature version of TensorFlow. Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance.

Your phone participates in Federated Learning only when it won't negatively impact your experience. The system then needs to communicate and aggregate the model updates in a secure, efficient, scalable, and fault-tolerant way. It's only the combination of research with this infrastructure that makes the benefits of Federated Learning possible. Federated learning works without the need to store user data in the cloud, but we're not stopping there. We've developed a Secure Aggregation protocol that uses cryptographic techniques so a coordinating server can only decrypt the average update if 100s or 1000s of users have participated — no individual phone's update can be inspected before averaging. It's the first protocol of its kind that is practical for deep-network-sized problems and real-world connectivity constraints. We designed Federated Averaging so the coordinating server only needs the average update, which allows Secure Aggregation to be used; however the protocol is general and can be applied to other

problems as well. We're working hard on a production implementation of this protocol and expect to deploy it for Federated Learning applications in the near future. Our work has only scratched the surface of what is possible. Federated Learning can't solve all machine learning problems (for example, learning to recognize different dog breeds by training on carefully labeled examples), and for many other models the necessary training data is already stored in the cloud (like training spam filters for Gmail). So Google will continue to advance the state-of-the-art for cloud-based ML, but we are also committed to ongoing research to expand the range of problems we can solve with Federated Learning. Beyond Gboard query suggestions, for example, we hope to improve the language models that power your keyboard based on what you actually type on your phone (which can have a style all its own) and photo rankings based on what kinds of photos people look at, share, or delete. Applying Federated Learning requires machine learning practitioners to adopt new tools and a new way of thinking: model development, training, and evaluation with no direct access to or labeling of raw data, with communication cost as a limiting factor. We believe the user benefits of Federated Learning make tackling the technical challenges worthwhile, and are publishing our work with hopes of a widespread conversation within the machine learning community. Acknowledgements This post reflects the work of many people in Google Research, including Blaise Agüera y Arcas, Galen Andrew, Dave Bacon, Keith Bonawitz, Chris Brumme, Arlie Davis, Jac de Haan, Hubert Eichner, Wolfgang Grieskamp, Wei Huang, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Nicholas Kong, Ben Kreuter, Alison Lentz, Stefano Mazzocchi, Sarvar Patel, Martin Pelikan, Aaron Segal, Karn Seth, Ananda Theertha Suresh, Iulia Turc, Felix Yu, Antonio Marcedone and our partners in the Gboard team.

Cartel: A System for Collaborative Transfer Learning at the Edge Harshit Daga

Patrick K. Nicholson

Georgia Institute of Technology

Nokia Bell Labs

Ada Gavrilovska

Diego Lugones

Georgia Institute of Technology

Nokia Bell Labs

ABSTRACT

1

As Multi-access Edge Computing (MEC) and 5G technologies evolve, new applications are emerging with unprecedented capacity and real-time requirements. At the core of such applications there is a need for machine learning (ML) to create value from the data at the edge. Current ML systems transfer data from geo-distributed streams to a central datacenter for modeling. The model is then moved to the edge and used for inference or classification. These systems can be ineffective because they introduce significant demand for data movement and model transfer in the critical path of learning. Furthermore, a full model may not be needed at each edge location. An alternative is to train and update the models online at each edge with local data, in isolation from other edges. Still, this approach can worsen the accuracy of models due to reduced data availability, especially in the presence of local data shifts. In this paper we propose Cartel, a system for collaborative learning in edge clouds, that creates a model-sharing environment in which tailored models at each edge can quickly adapt to changes, and can be as robust and accurate as centralized models. Results show that Cartel adapts to workload changes 4 to 8× faster than isolated learning, and reduces model size, training time and total data transfer by 3×, 5.7× and ~1500×, respectively, when compared to centralized learning.

The proliferation of connected devices is causing a compound annual growth rate of 47% in network traffic since 2016, i.e., an increase from 7 to 49 exabytes per month [15]. Service providers such as Facebook, Google, Amazon, and Microsoft rely on machine learning (ML) techniques [18, 47, 50] to extract and monetize insights from this distributed data. The predominant approach to learn from the geographically distributed data is to create a centralized model (see Figure 1a) by running ML algorithms over the raw data, or a preprocessed portion of it, collected from different data streams [12, 37]. More sophisticated solutions deal with geodistributed data by training models locally in the device, which are later averaged with other user updates in a centralized location – an approach known as federated learning [21, 33, 61, 64]. A centralized model can be very accurate and generic as it incorporates diverse data from multiple streams. From a system perspective, however, there is a challenge in moving all this data, and even the resulting model size can be significant, depending on the implementation, algorithm, and feature set size [2]. Concretely, as data sources spread geographically, the network becomes the bottleneck. In this case, ML algorithms [1, 12, 37], which are efficient in datacenters, can be slowed down by up to 53× when distributed to the network edge [21]. The emergent Multi-access Edge Computing (MEC) architecture, as well as 5G connectivity, are conceived to converge telecommunications and cloud services at the access network, and have the potential to cope with the challenge described above by enabling unprecedented computing and networking performance closer to users. In this context, the obvious alternative to centralized learning is to replicate the algorithms at each edge cloud and run them independently with local data, isolated from other edge clouds, as shown in Figure 1b. Isolated models can be useful in certain cases, e.g., when the data patterns observed by an edge are stationary and data is not significantly diverse. However, in more challenging scenarios, where the distribution of input data to the ML model is non-stationary, or when the application requires more complex models – only achievable with more data than the local to a particular edge – isolated models can have prohibitively high error rates (cf. Section 6.2). Therefore, although MEC and 5G technologies constitute the infrastructure needed to run distributed machine learning algorithms, we argue that there is a need for a coordination system that leverages the edge resources to learn collaboratively from logically similar nodes, reducing training times and excessive model updates. In this paper we introduce Cartel, a new system for collaborative ML at the edge cloud (Figure 1c). The idea behind Cartel is

CCS CONCEPTS • Computer systems organization → Distributed architectures; n-tier architectures; • Computing methodologies → Machine learning; Transfer learning; Online learning settings.

KEYWORDS Mobile-access Edge Computing (MEC), distributed machine learning, collaborative learning, transfer learning ACM Reference Format: Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones. 2019. Cartel: A System for Collaborative Transfer Learning at the Edge. In ACM Symposium on Cloud Computing (SoCC ’19), November 20–23, 2019, Santa Cruz, CA, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/ 10.1145/3357223.3362708 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6973-2/19/11…$15.00 https://doi.org/10.1145/3357223.3362708

25

INTRODUCTION

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

Edge Data

Cloud (a)

(b)

(c)

Figure 1: Machine learning systems with geographically distributed data streams, (a) Centralized learning, either raw data or partial models are shipped to a central datacenter for modeling (b) Isolated learning, models are replicated in edge cloud locations and maintained independently, and (c) Collaborative learning (Cartel), a distributed model-sharing framework to aggregate knowledge from related logical neighbors.

that centralized models, although trained on a broader variety of data, may not be required in full at each edge node. When changes in the environment or variations in workload patterns require the model to adapt, Cartel provides a jump start, by transferring knowledge from other edge(s) where similar patterns have been observed. This allows for lightweight models, reduced backhaul data transfer, strictly improved model accuracy compared to learning in isolation, and similar performance to centralized models. Cartel achieves the above by operating on metadata, as opposed to on raw data, and uses metadata to decide when an edgebased model needs to be updated, which peer should be used for the model update, and how the knowledge should be transferred from one model to another. To support these decisions and the collaborative learning they enable, Cartel provides three key mechanisms. It uses drift detection to determine variability in the workload distribution observed at the edge (i.e., dataset shift) and in the accuracy of the model. It incorporates functions to identify logical neighbors, i.e., candidate edges from which models can be transferred in case of input or output drift. Finally, it supports interfaces with the ML modeling framework to support model-specific knowledge transfer operations used to update a model instance in one edge, using state from a model in another edge stack. We describe the system functionality required to support these key mechanisms, their concrete implementation using specific algorithms which operate on model metadata, and evaluate the tradeoffs they afford for different collaborative learning models. We use both online random forest (ORF) [56] and online support vector machines (OSVMs) [11, 20] as running examples of learning algorithms. Both ORF and OSVM are well-known online classification techniques [42] that can operate in the streaming setting. Moreover, since they function quite differently, and have different characteristics of the model and update sizes, together they facilitate a discussion of how Cartel can be used with different ML algorithms (see Section 7). Cartel is evaluated using several streaming datasets as workloads, which consists of randomized request patterns in the form of time series with stationary and non-stationary data distributions at each edge to cover many use cases. We compare Cartel to isolated and centralized approaches. Results show that collaborative learning allows for model updates between 4 to 8× faster than isolated learning, and reduces the data transfer from ~200 to ~1500× compared to a centralized system while achieving similar accuracy. Moreover, at each edge Cartel reduces both the model size, by up

to 3×, and the training time, by 3 to 5.7× for the ML models in evaluation. In summary, our contributions are: • A collaborative system to create, distribute and update machine learning models across geographically distributed edge clouds. Cartel allows for tailored models to be learned at each edge, but provides for a quick adaptation to unforeseen variations in the statistical properties of the values the model is predicting – e.g., concept drift, changes in class priors, etc. • Cartel is designed to address the key challenges in learning at the edge (cf. §3) and relies on metadata-based operations to guide cross-edge node collaborations and reduce data transfer requirements in learning (cf. §4). • We design generic metadata-based operations that underpin the three key mechanisms in Cartel, along with algorithms and system support enabling their implementation. In particular: i) a distance-based heuristic for detecting similarities with other edge clouds, that can serve as logical neighbors for knowledge transfer, and; ii) two generic algorithms for knowledge transfer, one of which (based on bagging) can be applied to any machine learning model. (cf. §5). • An experimental evaluation using different models, workload patterns, and datasets, illustrates how Cartel supports robust edge-based learning, with significant reductions in data transfer and resource demands compared to traditional learning approaches (cf. §6).

2 MOTIVATION In this section, we briefly introduce MEC and collaborative learning, and summarize evidence from prior work on opportunities to leverage locality of information in MEC, which motivates the design of Cartel. Multi-access Edge Computing. Cartel targets the emerging field of MEC [6, 16, 23, 41, 46, 57, 58, 60]. MEC offers computing, storage and networking resources integrated with the edge of the cellular network, such as at base stations, with the goal of improving application performance for the end users and devices. The presence of resources a hop away provides low latency for real time applications, and data offloading and analytics capabilities, while reducing the backhaul network traffic. 5G and novel use cases demanding low latency and/or high data rates, such as connected cars, AR/VR, etc., are among the primary drivers behind MEC [55].

26

Cartel: A System for Collaborative Transfer Learning at the Edge

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Metadata Service (MdS)

Collaborative Learning. We define collaborative learning to be a model where edge nodes learn independently, but selectively collaborate with logical neighbors by transferring knowledge when needed. Knowledge transfer happens when a target edge executing a model detects an issue, such as sudden high error rates, with its current model. Logical neighbors are selected to assist the target edge, that is, edge nodes that: i) are most similar to the target in terms of either the data they have observed or their configuration, and; ii) have models that are performing adequately. Knowledge transfer involves transmitting some part of the model, or models (if there is more than one logical neighbor), from the logical neighbor to the target edge. This framework induces a set of primitive operations that can be directly used with existing ensemble-based machine learning algorithms. Note that the same learning algorithm is used in all edge nodes. For our proof-of-concept, we focus on ORF and OSVM, where the former makes use of bootstrap aggregation (or bagging) [9], but the latter does not. However, we emphasize that since the primitives can be applied to any learning algorithm utilizing bagging, and that since any machine learning algorithm can make use of bagging, this means that Cartel is not limited to the two techniques, but rather can be applied in general to any machine learning technique. An interesting future research direction would be to create a general abstraction to apply the set of primitive operations to any machine learning algorithm directly, without the use of bagging. A more surgical approach based on techniques such as patching [30] may be a good candidate for such an abstraction. However, patching raises potential model size issues after repeated application of the primitive operations, and thus further investigation is beyond the scope of this work. Locality in MEC. As the MEC compute elements are highly dispersed, they operate within contexts which may capture highly localized information. For instance, a recent study of 650 base station logs by AT&T [52] reports consistent daily patterns where each base station exhibits unique characteristics over time. Similarly, Cellscope [48] highlights differences in the base station characteristics, and demonstrates the change in a model’s performance with changes in data distributions. They also demonstrate the ineffectiveness of a global model due to the unique characteristics of each base station. In Deepcham [36], Li et al. demonstrate a deep learning framework that exploits data locality and coordination in near edge clouds to improve object recognition in mobile devices. Similar observations regarding locality in data patterns observed at an edge location are leveraged in other contexts, such as gaming [66] and transportation [3].

3

3

1

E1 node (t)

Request Batch

2

Request for nodes with similar model

Eis register and send metadata

Subset of helpful neighbors (E3, E4) E2 node Insights

4

Insights

E4 node

E3 node

Edge Node (E) Figure 2: Cartel overview. A collaborative system consisting of edge nodes (E), where E i ’s are trained independently and periodically update a MdS with metadata information about the node. A subset of logical neighbors are selected which helps the target edge node (t) to quickly adapt to change.

the edge stack or in the non-stationary statistical distributions of workloads can decrease model accuracy, thus requiring online retraining. The challenge is to create a mechanism to quickly detect and react to such variations. C2: Which neighbors to contact? Our hypothesis for collaborative learning is that edge nodes running similar machine learning tasks can share relevant model portions, thereby achieving more efficient online adaptation to changes. The goal is to avoid sharing of raw data between edge nodes, which makes nodes oblivious of data trends at other edges. Therefore, the challenge is to discover appropriate logical neighbors dynamically while coping with variations in the workload of the multiple edges over time. C3: How to transfer knowledge to the target? In order to update a model it may be possible and/or appropriate to either merge proper model portions from collaborating nodes or to simply replace local models with remote ones. The decision depends on various parameters such as the modeling technique, whether it allows for partitioning and merging of models, as well as the feature set, the pace at which the model needs to be updated, and the efficiency of the cross-node data transfer paths. Thus, the challenge is to provide support for the model-specific methods for sharing and updating model state.

CHALLENGES

To elaborate on the concepts behind Cartel, we first enumerate the complexities and the key challenges in implementing a distributed collaborative learning system in a general and effective manner. In the subsequent section we introduce our system design, components, and implementation, and explain how each challenge is addressed. C1: When to execute the collaborative model transfer among edge clouds? Since participants run independent models that can evolve differently, each edge must determine when to initiate collaboration with edge peers in the system. Changes in the configuration of

4

OVERVIEW OF CARTEL

Goal. The main focus of Cartel is to reduce the dependence on data movement compared to a purely centralized model that must periodically push out model updates. In contrast, Cartel only performs knowledge transfer when a target node actively requests help from logical neighbors. Thus, when no such requests are active, Cartel only requires nodes to periodically share metadata, which is used

27

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

in the figure, the metadata service is only logically centralized, and may be realized in a distributed manner depending on the scale of the system. Workflow. The operation of Cartel can be summarized as follows. Each edge node receives batches of requests 1 over a time period – these requests are the workload on which predictions are made by the resident model at the edge node. These batches are later used for retraining the model locally. If an edge node is experiencing poor model accuracy, we refer to that node as the target (t). We remark that this batching model fits well with predictive analytics in the streaming setting, e.g., predicting and classifying network resource demands based on the current set of users and their usage patterns. The metadata from each node is aggregated by the metadata service 2 which receives periodic updates from each edge node, described in Section 5.2. Cartel is aimed at scenarios with dynamic workload behaviors. As a result, the neighbor selection cannot be precomputed and stored, but is performed on-demand, based on dynamically aggregated metadata. When an edge node detects that its model accuracy has decreased significantly (drift detection), it asks the metadata service for similar nodes from which model updates should be requested 3 . The metadata service processes the metadata on-demand to identify the corresponding logical neighbors. The target node interacts with (one of) its logical neighbors directly to request a model update 4 , and applies the shared model state to update its resident model (knowledge transfer).

to establish a relationship among nodes: raw training data is never explicitly transmitted among the nodes, or to a centralized location. Concepts. To achieve this goal while addressing the challenges enumerated above, Cartel relies on three key mechanisms: drift detection (C1), which allows a node to determine when to send a request to its edge peers for a model transfer; logical neighbors (C2), which for each node determines sufficiently similar nodes likely to provide the required model transfer; and knowledge transfer (C3), which allows the system to decide on how to merge model portions provided from peer nodes. A common principle underpinning these mechanisms is the use of system-level support for operating on the metadata. The principle allows Cartel to achieve its desired goal of supporting learning at the edge with adequate accuracy and reduced data transfer costs. As a result, Cartel provides a novel system-level support for metadata management – to store, aggregate, compare, and act on it. This functionality is used by Cartel’s key mechanisms, which in turn facilitate collaborative learning. Metadata can be any information about a node that could potentially distinguish it from other nodes. In other words, information about the physical hardware, software configuration, enabled features, active user distribution by segments, geographic information, estimates of class priors, etc. Some metadata, such as enabled software features, those related to active users, or class prior estimates, can change over time at a given node. When such changes occur, this usually leads to a degradation in model performance, as machine learning techniques are impacted by underlying dataset shifts. Examples of such dataset shifts include: changes in class priors (probability of certain classes), covariate shift (distribution from which input examples are sampled), concept shift/drift (classes added/removed or changing boundary functions), and other more general domain shifts (see [45] for a detailed survey). In our system discussion and experiments, we focus on the first type of shift, i.e., changes in class priors. Thus, in our illustrative online classification examples and experiments, metadata refers specifically to empirical estimates for the prior probabilities for each class, as well as overall and per-class error rates. Such metadata is available in general, and therefore allows us to concretely describe an implementation of each of the three key mechanisms that can be applied in general. We emphasize, however, that additional application specific (or even completely different) choices for metadata are possible, such as the ones enumerated above, to address other dataset shifts. Architecture Overview. Figure 2 shows the architecture overview of Cartel. The system is comprised of edge nodes (E) and a metadata service (MdS).At a high level, an edge node maintains a tailored model trained using data observed at the node, and the metadata associated with that model. The metadata service is responsible for aggregating and acting on the metadata generated by edge nodes, so as to facilitate the selection of appropriate peer nodes for collaborative model updates. In other words, when a collaborative model update is requested, the metadata service is responsible for selecting the subset of edge nodes that can share portions of their models, with the node that requests assistance. These edge nodes are then responsible for negotiating with the target to determine which portions to share. Although shown as a single component

5

DESIGN DETAIL

Next, we describe the three key mechanisms in Cartel in terms of their metadata requirements and the algorithms they rely on for metadata manipulations, and describe the system support that enables their implementation.

5.1

Metadata Storage and Aggregation

To support its three building blocks, Cartel maintains and aggregates model metadata in the metadata service, and also stores some metadata locally at each edge node. For each of the previous W ≥ 1 batches, up to the current batch, each edge node maintains: i) counts for each class observed in the batch; ii) the overall model error rate on the batch, and; iii) the error rate per-class for the batch. Here W is user-defined window length parameter, which is used to adjust how sensitive the system is to changes in the model metadata. In terms of memory cost, at each edge node Cartel stores O(CW ) machine words, if C is the total number of classes in the classification problem. To aggregate model metadata, the metadata service relies on periodic updates reporting the metadata generated at each edge node. We considered several aggregation policies which trade the data transfer requirements of Cartel for the quality of the selected logical neighbors. The most trivial approach is where edge nodes send updates after every request batch. This helps in ensuring there is no stale information about the edge node at any given time. Thus, O(CN ) machine words are transferred after each batch, where C is the number of classes, and N the number of edge nodes. We refer to this operation policy as regular updates. These updates can further be sparsified by not sending them for every batch, but

28

Cartel: A System for Collaborative Transfer Learning at the Edge

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

for every m batches, referred to as interval updates. Interval updates can provide additional reduction in data transfer, but result in stale data at the MdS. For instance, an edge node may have model performance that degraded recently, but the MdS is oblivious to these changes during logical neighbor selection. One can fix this by adding further validation steps, however, this will delay the collaboration process. An alternative policy is to make use of threshold updates, where an edge sends an update only when there is a change in the metadata beyond some user-defined threshold. This reduces the required data transfer, while also attempting to avoid stale information, but requires an additional threshold. In the evaluation of Cartel we primarily use the regular update policy, as it provides an upper bound on how much metadata Cartel must transfer. However, in Section 6.3 we also explore the additional benefits that a threshold update policy can provide. We remark that Gaia [21] also employs a threshold update methodology. However, a fundamental difference between Cartel and Gaia, is that the later focuses on a new ML synchronization model that improves the communication between the nodes sending the models updates. Though beneficial for models with smaller model update size, the amount of data transfer will increase if applied to models with more memory consumption such as ORF. Further, Gaia’s goal is to build a geo-distributed generalized model, whereas Cartel supports tailored model at each node that only seek updates when a change is observed.

is to find a logical neighbor that has similar class priors to the target node, as this node has most potential to help. Logical neighbors are computed by the metadata service after receiving the request for help from the target node. The mechanism relies on the model metadata collected from each of the edge nodes, and on a similarity measure used to compare models based on the metadata. Similarity measure. For our example where class priors are undergoing some shift, the empirical distributions from the target node can be compared with those from the other nodes to determine which subset of edge nodes are logical neighbors of the target node. The metadata service maintains a rolling average of this metadata information provided by the edge nodes, which in memory costs can be defined as O(CN ) machine words, if C is the total number of classes in the classification problem and N be the total number of edge nodes in the system. The metadata service is thus responsible for determining which nodes are logical neighbors, and does so via a similarity measure. There are many measures that can be used for this purpose, such as Kullback-Leibler divergence (KLD) [28], Hellinger distance [5], Chi-squared statistic [54] and KolmogorovSmirnov distance [44]. After evaluating these techniques empirically, we selected Jensen-Shannon divergence (JSD) [40] (which is based on Kullback-Leibler divergence), as a function to determine the distance of two discrete probability distributions, Q 1 and Q 2 : ˜ + KLD(Q 2 , Q))/2 ˜ JSD(Q 1 , Q 2 ) = (KLD(Q 1 , Q) where, Q˜ = (Q 1 + ∑ Q ( i ) ˜ = i Q(i) log Q 2 )/2, and KLD(Q, Q) . JSD has convenient 2 ˜ Q (i )

5.2

properties of symmetry, and normalized values between [0, 1]; this is in contrast with KLD which is unbounded. If JSD(Q 1 , Q 2 ) = 0 then the distributions Q 1 and Q 2 are considered identical by this measure. On the other hand, as JSD(Q 1 , Q 2 ) → 1 then the distributions are considered highly distant. Once a list of logical neighbors with high similarity is identified, the list is pruned to only contain neighbors with low model error rates. Neighbors with high error rate, e.g., those that are also currently undergoing dataset drift, are filtered from the list. At this point, the top-k logical neighbors in the list are transferred to the target node, which then negotiates the knowledge transfer. Importantly, if the MdS finds no satisfactory logical neighbors (e.g., if all JSD scores exceed a user-defined threshold), it will return an empty result set to the target node. Knowledge Transfer. The final step in Cartel is to take advantage of the logical neighbors’ models. The knowledge transfer consists of two abstract steps: partitioning and merging. The knowledge transfer process is dependent upon the machine learning technique used by the application and is accomplished through model transfer – a machine learning technique for transferring knowledge from a source domain (i.e, the data observed by the logical neighbors) to a target domain (i.e., the problematic data arriving at the target node) [49]. The main difference between standard model transfer and this partitioning and merging setting is that there is the potential to transfer knowledge from multiple sources, and also that there is a resident model already present at the target node. As part of the Cartel system, during the knowledge transfer step, after the target node receives the logical neighbors, the target node attempts to identify those classes that have been most problematic. The target node computes this information by examining the perclass error rates over the last W batches. Any classes that have

Cartel – Three Key Mechanisms

Drift Detection. As discussed, a dataset shift can cause poor predictive performance of ML models. Through a drift detection mechanism, our goal is to quickly improve the performance of models on nodes where such dataset drift has occurred. Drift detection is a widely studied problem, especially in the area of text mining and natural language processing [62, 63]. It is important to note that in prior work on drift detection, there is often an interest in detecting both positive and negative drifts. However, for Cartel we only take action upon a negative drift (i.e., the error rate of the model increases over time). Any existing drift detection algorithm that can detect negative changes to model error rates can be used in Cartel. For ease of exposition, we opt for a straightforward thresholdbased drift detection mechanism that requires a user-specified hard limit L ∈ [0, 1] on the error-rate of a resident model. Thus, based on the two parameters, L and W , drift detection is performed locally at each edge node after processing a batch. The average error rate of the model is computed on the previous W batches to detect whether the hard threshold L has been exceeded, indicating a drift. Though simplistic, more sophisticated algorithms for drift detection also make use of two (and often more) such thresholds [19]: typically the thresholds are set with respect to statistical tests or entropy measures to determine what constitutes a significant change (cf. [7, 31]). Logical Neighbor. Although drift detection is useful in determining the need for help (i.e., for knowledge transfer) from an external source, we still face the challenge of finding out the node(s) that are most similar to the target in terms of their characteristics: either the data they have observed or their configuration. These nodes are also known as logical neighbors. Intuitively, the goal of Cartel

29

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

Edge Node (Target)

error-rates exceeding a user defined threshold are marked as problematic, and communicated to the logical neighbors as part of the request for help. Next, depending on the machine learning algorithm running on the edge nodes, partitioning and merging proceeds in different ways. We give two different methods for partitioning and merging that can be applied broadly to any classification problem. Bagging Approach. For the case of ORF, or any online learning algorithm that uses bootstrap aggregation, knowledge transfer is achieved via the following straightforward technique. Suppose the model at the target node contains M sub-models, each constructed on a separate bag: e.g., in the case of ORF, M is the number of online decision trees in the forest. By replacing a Z ∈ (0, 1] fraction of the sub-models in the target ensemble (Z of the trees from the target in the case of ORF), with sub-models collected from the logical neighbors, we can intuitively create a hybrid model at the target node, that is somewhere in between the old model and the models from the logical neighbors. In other words, we partition the models in the logical neighbors, selecting Z × M sub-models among the logical neighbors, and then merge these with the existing model at the target node. Before discussing how to set Z, it first makes sense to answer the question of which trees should be replaced in the target ensemble, and which trees should be used among those at the logical neighbors. We employ the following heuristic: replace Z of the trees having the highest error rate in the target node ORF, with the Z trees having the lowest error rate from the logical neighbors. To achieve partitioning and merging with bagging, Cartel must therefore additionally maintain, at each edge node, the error rates of each sub-model (e.g., decision tree in the forest). Fortunately, for many libraries implementing ensemble ML algorithms (such as scikit-learn [51]), this information is readily accessible from the model APIs. We also experiment with another heuristic that replaces Z trees with the highest error rate in the target node ORF, with the Z trees that have the lowest per-class error rate among the problematic classes. The exact setting of Z, in general, can depend on workload dynamics, as well as the distance between the target node and logical neighbors. For our datasets and workloads we experimentally found that Z = 0.3 worked well, and discuss this later in Section 6.3. However, other choices, such as Z in the range [0.3, 0.6] also behaved similarly. Thus, a precisely engineered value of Z is not required to yield similar benefits for the workloads and datasets we used. We leave automatic online tuning of Z as an interesting topic for future investigation. One-versus-Rest Approach. In contrast to the bagging approach above, linear OSVMs using a one-versus-rest (or one-versus-all) approach to multi-class classification to construct a set of C hyperplanes in an n-dimensional space, one for each class, each defining boundary between its associated class and items not in that class. This boundary is therefore represented as a row in a C × (F + 1) weight matrix containing the coefficients – each associated with one feature plus an additional bias term – representing the hyperplane, where F is the number of features for the model. For OSVM, knowledge transfer can be accomplished by updating the weights assigned to the features of problematic classes. The logical neighbors then partition the subset of these problematic classes

Collaborative Register Analyzer Communicator

Store

Transfer

Find Node

Collaborative Register

Metadata Service (MdS)

Analyzer Communicator

Transfer

Learning Predict Train Partition

ML Model

Data

Merge

Learning Predict Train Partition

ML Model

Merge

Edge Node (Logical Neighbor) Figure 3: Cartel system functions.

from their weight matrices, by simply selecting the rows corresponding to these classes. These rows are transmitted to the target node where the node merges these model portions into its OSVM model by overwriting the corresponding rows with the weights from the logical neighbors. We note that the same approach applies in general to other one-versus-rest classifiers, but that it is especially appealing in the case of linear models like OSVM, as the data required to transmit the F + 1 weights is small compared to other approaches. Crucially, both methods presented have the property that the resulting model at the target node does not increase in size. In particular, the same number of sub-models is present for the bagging approach and for OSVM the matrix representing the hyperplanes is exactly the same size in terms of the total number of entries. This means that knowledge transfer can be repeated without gradually inflating the size of the model. The focus for Cartel is to provide the system support for partitioning and merging operations to be easily integrated for collaborative learning. The specific implementation of partitioning and merging is left to the user, beyond the generic approaches just described. Thus, Cartel abstracts this mechanism as a set of APIs for partitioning and merging that can be extended by any machine learning model to be incorporated into the Cartel system, as shown in Figure 3 and described in the next section.

5.3

Cartel Runtime

The Cartel runtime at each edge node consists of two blocks of functions – Learning and Collaborative – as shown in Figure 3. The Learning component depends on the machine learning technique used and the type of problem it addresses (e.g., classification, regression, etc.). It constitutes the learning part of the system which makes predictions on the incoming data using the model, compares the predicted values to later observations determining the error rate, and subsequently re-trains the model using observations. It provides four interfaces: predict, (re)train, partition, and merge. The predict function keeps track of overall and classwise results predicted by the model while the (re)train function is responsible for training the model on the incoming data feed.

30

Cartel: A System for Collaborative Transfer Learning at the Edge

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

into feature values, and includes a mix of benign data samples as well as malicious attacks such as distributed denial-of-service (DDoS) attacks, Heartbleed, web and infiltration attacks captured over a period of 5 days. We use a subset of the features for two days of the dataset consisting of benign data samples, DDoS attacks, and port scan attacks, consisting of 500k data samples in total. Workloads. Edge nodes process requests in discrete batches over time. A batch consists of varying number of requests corresponding to different classes in a dataset (e.g., corresponding to one of the ten different digits from 0 to 9 in the case of MNIST, or to a different type of attack in case of CICIDS2017). In our experiments, the dataset shift we focus on is a change in class priors, i.e., a change in the distribution of the classes arriving at a node. We generate several synthetic request patterns which correspond to different types of change patterns in the workload. The results presented in the paper are primarily based on the Introduction and Fluctuation patterns. Introduction corresponds to a case where a new class gets introduced at an edge node abruptly after 25 batches. Fluctuation is a distribution pattern where a new class is introduced at batch 25, but then disappears and re-appears at batches 50 and 75 respectively. This is analogous to the Introduction pattern with periodicity. Other patterns used in our evaluation include Uniform, where all classes are uniformly distributed across all edge nodes, and Spikes, where several new classes are introduced in succession, each of which does not persist. The results obtained from these latter two workloads are similar, so we omit them for brevity. We have used emulated nodes and synthetic workloads, primarily because of limited availability of real infrastructure and data. However, in the following section we successfully demonstrate the benefits of Cartel. Moreover, with more edge nodes, we expect the savings from transmitting metadata compared to the raw data will persist.

Moreover, when a model update is requested by a target node, the partition function of the logical neighbor helps in finding the portion of the model that increases the model’s accuracy at the target edge node. Finally, the merge function of the target node incorporates the model update received from the supporting edges and completes a cycle in a collaborative learning process. The Collaborative component is independent of the machine learning technique used. It is responsible for drift detection, for triggering look-ups and for interacting with logical neighbors. It provides four functions – register, analyzer, communicator, and transfer. With register an edge node joins the Cartel knowledge sharing pool. The analyzer function analyzes the prediction results to determine a drift. Upon drift detection, it additionally performs data trend analysis to identify the problematic classes at the target node. The communicator function interacts with MdS to update the node’s metadata information, based on the metadata aggregation policy. Additionally, if a drift is detected, it also sends a request to MdS for logical neighbors. The transfer function opens a communication channel between the target node and the selected logical neighbor(s) to request and receive model portions.

6

EVALUATION

We present the results from the experimental evaluation of Cartel, and seeks to answer the following questions: 1. How effective is Cartel in reducing data transfer costs, while providing for more lightweight and accurate models that can quickly adapt to changes? (Section 6.2); 2. What are the costs in the mechanisms of Cartel and the design choices? (Section 6.3); 3. How does Cartel perform with realistic applications? (Section 6.4).

6.1

Experimental Methodology

Experimental Testbed. We evaluate Cartel on a testbed representing a distributed deployment of edge infrastructure, with emulated clients generating data to nearby edge nodes. The testbed consists of five edge nodes and a central node representing a centralized datacenter. All nodes in the system are Intel(R) Xeon(R) (X5650) with 2 hex-core 2.67GHz CPUs and 48GB RAM. Datasets and applications. The first experiment uses an image classification application where edge nodes participate in classifying images into different categories using ORF and OSVM. The models are implemented using the ORF library provided by Amir el at. [56] and, for OSVM, scikit-learn [51]. We use the MNIST database of handwritten digits [35], that consists of 70k digit images. A set of 1000 uniformly randomly selected images (training data), distributed across each of the edge node, is used for preliminary model training. The remainder of the dataset is used to generate a series of request patterns, following the different distribution patterns described below; batches from these requests are used for online training. We also evaluate Cartel with a second use case based on network monitoring and intrusion detection (Section 6.4) that uses the CICIDS2017 Intrusion Detection evaluation dataset [59]. This use case further illustrates some of the tradeoffs enabled by Cartel, and helps generalize the evaluation. The CICIDS2017 dataset consists of a time series of different network measurements, preprocessed

6.2

Benefits from Cartel

We compare Cartel to centralized and isolated learning, with respect to the changes observed at the edge in the three systems. A centralized system repeatedly builds a generic model using data collected from all the nodes in the system. This model is then distributed among the edge nodes. In such a system there exists a gap between the error bound and the model performance at the edge node. This is due to the time difference between the periodical update of the model at the edge nodes. In contrast, in an isolated environment, each edge node is trained individually and any change(s) in the workload pattern could impact the predictive performance of the model. We measure the time taken to adapt to changes in the class priors in the workload, and examine the resource demand (cost) in terms of data transferred over the backhaul network, time required to train the online model and model size. In Figures 4 and 5, we present the results for the image classification application with ORF and OSVM, and the Introduction and Fluctuation patterns. We use different workload patterns to assess the impact of change in request distribution on the performance of the systems. For each case, with a horizontal dashed red line, we show the error lower bound, obtained with offline model training. We used window size

31

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

(a) ORF overall error

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

(b) ORF class error

(c) OSVM overall error

(d) OSVM class error

Figure 4: Performance comparison for introduction workload distribution (lower error rate is better) showcasing overall model error and the introductory class error change. Misclassification of introductory class degrades the model performance. Cartel is able to quickly adapt to the change in distribution (given by the horizontal arrow). The horizontal line (in red) defines the lower bound obtained through an offline training.

(a) ORF overall error

(b) ORF class error

(c) OSVM overall error

(d) OSVM class error

Figure 5: Model performance comparison for fluctuation workload. Similar to Figure 4 Cartel is able to adapt to the changes in the distribution and behaves close to the centralized system.

W = 5 and hard error rate limit of L = 0.15 for Cartel andW = 10 for the centralized system to provide the model updates at the edge. Adaptability to Change in the Workload. We observe that the centralized system is more resilient to the change in the distribution pattern. This is due to the generic nature of the edge model which is regularly synchronized with the central node and is built using prior knowledge from the other edge nodes. On the other hand, the isolated system and Cartel experience a spike in the model error rate for the same change. We define the time taken for the model error rate to return to the baseline (10%) as the adaptability of the system to change. This adaptability is denoted with a horizontal arrow in Figures 4 and 5. Cartel’s drift detection allows the target node to have increased adaptability with respect to the dataset shift (measured as a smaller horizontal spread in terms of number of batches) when compared to an isolated system. Specifically, when using OSVM and ORF techniques, Cartel performs 8× and 4× faster, respectively, as compared to an isolated system. The adaptability of Cartel is important for both workload patterns. For Fluctuation (Figure 5), Cartel helps to bring the system back to an acceptable predictive performance while the isolated system takes a longer time to adapt to the fluctuation. Data Transfer Cost. In a centralized system, an edge node proactively updates its model from the central server which helps in improving the inference at edge nodes. However, this improvement comes at a cost of a proactive model transfer between the edge and the central node. To capture the network backhaul usage we

divide the data transferred into two categories: (i) data / communication cost which includes the transfer of raw data or metadata updates, and (ii) model transfer cost which captures the amount of data transferred during model updates to the edge (periodically in case of a centralized system or a partial model request from a logical neighbor in Cartel). Cartel does not centralize raw data and only transfers models when there is a shift in the predictive performance. This design helps in reducing both data / communication cost and model transfer cost by ~1700×, and 66 to 200×, respectively, thereby reducing the overall cost of total data transferred for Cartel by two to three orders of magnitude, compared to the centralized system, as shown in Table 1. We note that for ORF the cost of the model updates is the dominating factor in the total data transferred, whereas the data / communication dominates for OSVM. As discussed, the data / communication for Cartel is O(CN ) per batch. For a centralized model, the data / communication is O(BF N ) where B is the average number of data points in a batch. Provided BF ≫ C, we can expect the data / communication cost to be much lower for Cartel than a centralized system. For applications where dataset shifts are less frequent, we expect Cartel will provide better predictive performance in the long run. We expect the gains of Cartel to persist even when considering federated learning-based approaches to building a centralized model [33]. Those are reported to reduce communication costs by one to two orders of magnitude, but, importantly, they strive to build at each edge a global model, and can miss the opportunity

32

Cartel: A System for Collaborative Transfer Learning at the Edge

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

ML ORF

Workload D/C (×) MU (×) Total (×) Introduction 1745 200 212 Fluctuation 1760 66 71 OSVM Introduction 1763 188 1573 Fluctuation 1763 94 1404 Table 1: Ratio of data transferred in a centralized system versus Cartel. D/C represents data/communication, MU represents model update transfer cost and Total represents the combined cost. (a) Overall model error 0.04

Execution Time (s)

25 20

Centralized Isolated Cartel

15 10

0

0.03 0.02

training time up to by 5.7×, while for OSVM, since the model size is constant, the smaller training batch size alone reduces the training time by 3× compared to a centralized system. Impact of Machine Learning Algorithm. We remark that benefits from Cartel are not primarily related to the ML model accuracy, but rather to its adaptability during distribution changes. In the case of OSVM, although linear SVM exhibits high bias on the MNIST data, adding more features or using kernel SVMs are unlikely to speed up convergence. As such we expect the benefits of Cartel to persist even when other non-linear methods or feature transformations are used. To test this, we ran the MNIST workload as before, but with Random Fourier Features (RFF) [13, 53] to improve model accuracy.1 Although these additional features lowered the average error rate, we observed similar improvements to adaptability, as well as a reduction in data transfer, as observed with linear SVM and ORF. Thus, Cartel can provide similar benefits when used with these additional techniques. In summary, Cartel boosts the system’s adaptability by up to 8×. It achieves a similar predictive performance compared to a centralized system while reducing the data transfer cost up to three orders of magnitude. Cartel enables the use of smaller models at each edge and faster training.

0.01

5 20

40

60

Batch ID

(a) ORF

80

100

0.000

(b) Introductory class error

Figure 7: Effect of different drift detection policies on overall model and introductory class error rate: “Delayed by X ” implies the drift detection was delayed by X batches.

20

40 60 Batch ID

80

100

(b) OSVM

Figure 6: Time required to train the ORF and SVM model. Similar trends are observed for different workload distribution. Combination of global model and bigger training dataset in a centralized system increases the online training time of ML model.

to benefit from reduced model sizes or training times, as discussed next. Model Size. The model size depends on the machine learning technique used. It plays an important role in data transfer during model updates as mentioned above, as well as during retraining of the model. Cartel results in smaller, tailored models for each edge node, leading to faster online training time. Since ORF is an ensemble learning method, it builds multiple decision trees, each with varying number of nodes. The size of the ORF model depends on several factors, such as the data and hyperparameters used. From our experiments with two edge nodes, we observe that Cartel results in a reduction of the model size by 3× on an average when compared to a centralized system. This reduction is achieved because a tailored model in Cartel does not store as much information when compared to a generic model used in a centralized system. This is expected in MEC, because it operates in contexts with highly localized information that can result in fewer classes being active or observed at each edge [48, 52]. Beyond reduction in classes, the number of nodes in the ORF grows less quickly in Cartel vs. the centralized system due to fewer total training examples, further reducing the model size. For ORF, the model resulting from use of Cartel is similar to that of isolated learning, but has faster adaptability (as shown in the above discussion). Since OSVM uses a matrix to represent the hyperplane parameters corresponding to each class, there is no difference (without applying further compression) in size of an OSVM model trained for subset of classes compared to one trained using all classes. Training Time. The online training time for a machine learning model is a function of the training dataset and the model size. In an online system, a smaller model size and/or less data helps in training the model faster. Figure 6 shows the difference in the training times for the ORF and OSVM model during our experiment. The smaller model size and smaller local batch size reduces the ORF

6.3

Effect of Mechanisms

We next investigate the impact that each of the mechanism in Cartel has on the overall system performance. Drift Detection Timeliness. Timely drift detection is important for Cartel, since a delay in detection can impact a model’s predictive performance. To demonstrate the impact of slow drift detection, we modify the drift policy to delay the request to the MdS for logical neighbors by a variable number of batches. The results in Figure 7, show the impact of drift detection delay on the overall model’s performance as well as the misclassification rate of the introduced class for ORF; we observe a similar pattern for OSVM. The request for model transfer from a logical neighbor at the actual time of drift detection stabilizes the system quickly. If the delay is too large (e.g., in the figure, a delay of 20 batches), the model transfer does not provide collaboration benefits, as online training eventually improves the model. In the current implementation of Cartel, drift detection triggers immediate requests for logical neighbors. We acknowledge that overly sensitive drift detection may cause short and non-persistent workload fluctuations to trigger 1 We

33

used the implementation of Random Fourier Features of Ishikawa [25].

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

Model Error %

15 10 5 0 (a) Data transfer

Regular Threshold (2%) Threshold (5%) 20 40 60

Batch ID

80

100

(b) Overall model error

(a) Overall model error

Figure 8: Comparison of MdS metadata aggregation policies.

(b) Data transfer

Figure 9: Comparison of adaptability of Cartel and total data transfer for various Z values when using ORF ML model on MNIST dataset.

unwarranted and frequent transfer requests, increasing the overall data transferred over the network; detailed sensitivity analysis can be used to develop automated methods for determining effective ranges for the delay parameter in the future. MdS Aggregation Policies. The metadata at MdS can be updated according to different policies described in Section 5.1. All other experiments use the “regular” update policy, but we note that additional data reduction benefits can be obtained through the “threshold” update policy when configured appropriately. Figure 8a shows a comparison of the regular update policy with different threshold policies. Here we use a 2% and 5% change in class priors at each edge as the threshold parameter. As the threshold increases, we observe a reduction in the total metadata transfer (labeled as data/communication cost transfers in the figure). However, too high thresholds could result in an increase in model update costs. Use of a threshold parameter can result in a misrepresentation of the distribution pattern of different edge nodes at the MdS. The corollary to this is the selection of incorrect logical neighbors, shown in Figure 8b, where the system repeatedly requests for model updates and fails to adapt to the changes in the system due to incorrect logical neighbors, negating the benefit of reducing other communication costs. Finding Logical Neighbors. In a distributed deployment with hundreds of edge nodes, collaboration can be performed by: i) selecting a node at random, ii) based on geographical proximity, or; iii) based on similarities determined through node metadata information, for example. We experiment with the impact of Cartel’s logical neighbor selection, compared to various baseline approaches, using the Introduction workload and ORF. The logical neighbor algorithm in MdS is modified to introduce i selection failures (by randomly selecting among the nodes not identified as a top match by the MdS algorithm), before finally providing the correct logical neighbor. Depending on the batch size and how long drift detection takes, multiple failures – in our experimental setup, more than two – negate any benefit that knowledge transfer has on the accuracy of the model. Thus, it is critical that the MdS uses timely metadata about edge nodes and effective similarity measure, to identify good logical neighbor candidates. Knowledge Transfer Balance. As discussed in Section 5.2, knowledge transfer for OSVM involves selecting the coefficients associated with the hyperplane for the problematic classes. However, ORF operates by selecting the Z trees with lowest error rate for partitioning at a logical neighbor, raising the question of the value of Z. For our experimental setup, we tested the following values of Z: 0.1, 0.2, . . . , 1.0. We found that Z < 0.3 fails to help for the

Cartel Merge (Z=0.3) Merge All Replace All

Model Error %

40 30 20 10 0

20

40

60

Batch ID

80

(a) ORF overall model error

100 (b) ORF data transfer

Figure 10: Comparison of adaptability of the system and total data transfer between various knowledge transfer mechanisms, on MNIST dataset using ORF as machine learning model, that can be applied in Cartel.

problematic class, while Z > 0.6 leads to higher error rate for the non-problematic classes at the target node, as shown in Figure 9. Both result in more time required by Cartel to adapt to the changes. Hence, we selected Z = 0.3 as the optimal value, since higher Z values, for ORF, result in more bandwidth utilization. In addition, there are a few ways to apply a model update. In Figure 10 we evaluate the impact of each of these on the performance of Cartel, in terms of its ability to quickly converge to good models at the target node (with low model error rate) (left handside graphs in the figure), and in terms of the data transfer requirements (right hand-side graphs). For ORF, one can replace the existing forest with the logical neighbor’s forest (Replace All); two or more forests can be merged (taking the union of all the trees) (Merge All); the best performing Z of trees from the logical neighbor can be merged with the target’s forest (Merge (Z = 0.3)); or, finally, the worst performing trees in the target model can be replaced by the best performing Z trees from the logical neighbor’s model. The latter is what is used in Cartel, and enabled by use of additional local metadata at each edge, examined during the knowledge transfer request. Replacing the entire edge model with the neighbor’s model might not work because each edge node experiences different distributions and a blind merge from a logical neighbor would not work if only a few classes were common among the nodes. When possible (i.e., for ORF), merging all or a portion of the model (the ORF trees) seems to be a good solution when considering the error convergence time at the target. However, this increases the overall model size by up to 2× for Merge All, which further results in increase in training time by an average of 2× for Merge All, or 1.3× for Merge (Z = 0.3). Replacing portions of the target model based

34

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

Model Error %

Cartel: A System for Collaborative Transfer Learning at the Edge

(a) ORF overall error

200

400

600

Batch ID

800 (b) ORF data transfer

Figure 12: Comparison of three knowledge transfer mechanisms applied to the network attack dataset using ORF as the classification model.

Collaboration Scope. We evaluate Cartel under different data distribution scenarios. However, the evaluation is performed using synthetic workloads in an attempt to model realistic scenarios [52], and is limited to few edge nodes. Given that a centralized model must periodically be pushed to all edge nodes, we expect that the data transfer reduction of Cartel will be larger in deployments with dozens or hundreds of edge nodes. Thus, it would be interesting to evaluate the benefits of Cartel in such scenarios, and for use cases with live traffic and other dataset shifts. In particular, the choice of some of the parameters and threshold values chosen in Cartel, is dependent upon the workload and the characteristics of the underlying infrastructure, and further work is needed to establish practical policies for choosing effective values. Generalize to other machine learning algorithms. We present the benefits of Cartel with the underlying machine learning algorithms as ORF and OSVM, and developed general partitioning and merging algorithms that work in the bagging and one-versusrest paradigm. Still, we plan to explore other methods for partitioning and merging heuristics that can be used directly (rather than requiring bagging). We are interested in general methods for deep neural networks (DNN), and also in evaluating online regression problems in addition to online classification. As mentioned, one possibility is patching [30], though further developments are needed to ensure models do not become excessively large over many partitioning and merging operations. Recent work on knowledge transfer through merging of DNN [4, 14, 67] could be a stepping stone in extending Cartel to support DNN models. Other recent work has been done to partition DNNs across mobile, edge and cloud [22, 26, 29, 32], yet additional advances are needed in the ML algorithms to improve their efficacy of model transfer. Privacy. While a discussion about privacy is beyond the scope of this paper, we note that edge nodes in Cartel use the raw data to train models, but do not explicitly transmit this data to the other nodes. As such, the information sent to the MdS or to logical neighbors is a sketch derived from the raw data, e.g., a histogram of the data distribution at a node. However, we still foresee concerns about this sketched data, and believe such concerns also apply to other techniques such as federated learning. Regarding the possibility and handling of malicious edge nodes, our scope is limited to cooperating nodes that are owned or managed by a single service provider. We leave further exploration of issues, such as trust, as future work.

on the top performing Z classes in the neighbor’s model results in a target model that quickly converges to the model’s lower error bound, and keeps model transfer costs low. By enabling replacement of only those model portions which relate to the problematic classes at the target, Cartel achieves both quick adaptation and low transfer costs.

Use Case - Network Attack

This scenario is based on data from the CICIDS2017 Intrusion Detection evaluation dataset [59]. The testbed consists of two edge and a central node. One of the nodes (Edдe 0 ) experiences a sustained port scan attack, while the other (Edдe 1 ) experiences a DDoS attack. After a time period, Edдe 0 (target node) experiences a similar DDoS attack. The ML model used to classify the attacks is ORF, with 10 tree predictors at each node, each with a maximum depth of 16. The workload follows the Introduction distribution and consists of 900 batches each with ~1000 data points of various network metrics (features). Result. As shown in Figure 11, the collaboration with a logical neighbors helps the target node to adapt 5× faster to the introduction of the DDoS attack which was already observed by Edдe 1 , compared to the isolated system. A centralized system with edge nodes receiving regular model updates from a central node does not require time to adapt, however, even with only two edge nodes, the total data transfer is 90× more, and the time taken for training is 2× that of Cartel. We performed a more comprehensive evaluation of Cartel using the intrusion detection dataset. Figure 12 demonstrates that the knowledge transfer mechanism in Cartel reduces model transfer cost 2.5× or more compared to the other mechanisms, while keeping a lower overall model error rate. Additionally, the results from the Fluctuation workload exhibit increased adaptability of the system, reduced total data transfer (by 60×) and faster model training time (by 1.65×), compared to centralized learning. These results showcase a similar trend to the ones described for the MNISTbased use case in the Section 6.2; the graphs are omitted for the brevity.

7

Cartel Merge (Z=0.3) Merge All Replace All

(a) ORF overall model error

(b) Data transfer

Figure 11: Performance and total data transfer comparison for network attack dataset using ORF to classify begin request against DDoS and port scan attack requests.

6.4

60 50 40 30 20 10 0

DISCUSSION

We have shown how Cartel performs, in terms of data transfer, training time and model size. Our results demonstrate the potential of Cartel, but also illustrate several opportunities to be further explored.

35

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

8

Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones

RELATED WORK

[2] Amazon. 2018. Amazon Machine Learning. System Limits. https://docs.aws.amazon.com/machine-learning/latest/dg/system-limits.html. [3] B. Amento, B. Balasubramanian, R. J. Hall, K. Joshi, G. Jung, and K. H. Purdy. 2016. FocusStack: Orchestrating Edge Clouds Using Location-Based Focus of Attention. In ACM Symposium on Edge Computing (SEC’16). [4] Shabab Bazrafkan and Peter M Corcoran. 2018. Pushing the AI envelope: merging deep networks to accelerate edge artificial intelligence in consumer electronics devices and systems. IEEE Consumer Electronics Magazine 7, 2 (2018), 55–61. [5] Rudolf Beran et al. 1977. Minimum Hellinger distance estimates for parametric models. The annals of Statistics 5, 3 (1977), 445–463. [6] Ketan Bhardwaj, Ming-Wei Shih, Pragya Agarwal, Ada Gavrilovska, Taesoo Kim, and Karsten Schwan. 2016. Fast, Scalable and Secure Onloading of Edge Functions using AirBox. In Proceedings of the 1st IEEE/ACM Symposium on Edge Computing (SEC’16). [7] Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443–448. [8] Léon Bottou. 1998. Online Algorithms and Stochastic Approximations. (1998). http://leon.bottou.org/papers/bottou-98x revised, oct 2012. [9] Leo Breiman. 1996. Bagging predictors. Machine Learning 24, 2 (01 Aug 1996), 123–140. https://doi.org/10.1007/BF00058655 [10] Ignacio Cano, Markus Weimer, Dhruv Mahajan, Carlo Curino, Giovanni Matteo Fumarola, and Arvind Krishnamurthy. 2017. Towards Geo-Distributed Machine Learning. IEEE Data Eng. Bull. 40, 4 (2017), 41–59. http://sites.computer.org/ debull/A17dec/p41.pdf [11] Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2008. Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines. Journal of Machine Learning Research 9 (2008), 1369–1398. https://dl.acm.org/citation.cfm? id=1442778 [12] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System.. In OSDI, Vol. 14. 571–582. [13] Radha Chitta, Rong Jin, and Anil K Jain. 2012. Efficient kernel clustering using random fourier features. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 161–170. [14] Yi-Min Chou, Yi-Ming Chan, Jia-Hong Lee, Chih-Yi Chiu, and Chu-Song Chen. 2018. Unifying and Merging Well-trained Deep Neural Networks for Inference Stage. CoRR abs/1805.04980 (2018). arXiv:1805.04980 http://arxiv.org/abs/1805. 04980 [15] Cisco. 2017. Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2016–2021 White Paper. https://cisco.com/c/en/us/solutions/ collateral/service-provider/visual-networking-index-vni/mobile-white-paperc11-520862.pdf [16] Stephane Daeuble. [n.d.]. Small cells and Mobile Edge Computing cover all the bases for Taiwan baseball fans. https://www.nokia.com/blog/small-cells-mobileedge-computing-cover-bases-taiwan-baseball-fans/. [17] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2007. Boosting for Transfer Learning. (2007), 193–200. https://doi.org/10.1145/1273496.1273521 [18] Facebook. [n.d.]. Applying machine learning science to Facebook products. https://research.fb.com/category/machine-learning/. [19] João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1–44:37. https://doi.org/10.1145/2523813 [20] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. 2008. A dual coordinate descent method for large-scale linear SVM. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008 (ACM International Conference Proceeding Series), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.), Vol. 307. ACM, 408–415. https://doi.org/10.1145/1390156.1390208 [21] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.. In NSDI. 629–647. [22] Ke-Jou Carol Hsu, Ketan Bhardwaj, and Ada Gavrilovska. 2019. Couper: DNN Model Slicing for Visual Analytics Containers at the Edge. In 4th ACM/IEEE Symposium on Edge Computing (SEC’19). [23] Yun Chao Hu, Milan Patel, Dario Sabella, Nurit Sprecher, and Valerie Young. 2015. Mobile edge computing—A key technology towards 5G. ETSI white paper 11, 11 (2015), 1–16. [24] Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 97–106. [25] Tetsuya Ishikawa. 2019. Random Fourier Features. https://github.com/tiskw/ RandomFourierFeatures [26] Hyuk-Jin Jeong, Hyeon-Jae Lee, Chang Hyun Shin, and Soo-Mook Moon. 2018. IONN: Incremental Offloading of Neural Network Computations from Mobile Devices to Edge Servers. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 401–411.

Cartel is a system that leverages the unique characteristics of each edge and enables collaboration across nodes to provide a head start in adapting the model for changes observed at the target node. This is in contrast to the existing systems [38, 43, 65] where data processing and analysis happens in a single datacenter, however, the excessive communication overhead in distributed machine learning algorithm makes such systems unsuitable in a geo-distributed setting. Systems such as Gaia [21], Project Adam [12], Federated learning [33] and others [10, 39, 64] focus on addressing the communication overhead in running machine learning methods such as a parameter server and large deep neural network in a geo-distributed environment. Additionally, the distributed setting involves interaction with a large number of nodes, where some of these nodes can experience failures. MOCHA [61] is a system designed to handle such stragglers. DeepCham [36], IONN [26], Neurosurgeon [29] and Splitnet [32] are examples of systems where the machine learning model is partitioned across mobile, edge or cloud which works in a collaborative way to train a unified model. These systems do not consider custom models for each node in MEC where an edge might not require a global model trained on broad variety of data. Similarly to Cartel, Cellscope [48] is also aimed at creating better models at edge nodes. Using real data, the authors show evidence that global models can lead to poor accuracy and high variance. However, the focus of that work is on providing a bigger dataset by intelligently combining data from multiple base stations to help in building the local model at edge nodes. In contrast, Cartel avoids data transfer and aims to provide model updates from logical edge nodes only when there exists a data shift. Finally, there exist many machine learning algorithms [8, 24, 27, 34] to incrementally train machine learning models in an efficient manner and more sophisticated knowledge transfer techniques [17, 49] that Cartel can leverage to further improve the learning performance.

9

CONCLUSION

In this paper we introduce Cartel, a system for sharing customized machine learning models between edge datacenters. Cartel incorporates mechanisms to detect changes in the input patterns of the local machine learning model running in a given edge, to dynamically discover logical neighbors that have seen similar trends, and to request from them knowledge transfer. This creates a collaborative environment to learn from other models, only when required, and without sharing the raw data. Experiments show that Cartel allows edge nodes to benefit from the use of tailored models, while adapting quickly to change in their workloads, and incurring significant reductions in data transfer costs compared to approaches based on global models. As future work, we aim to explore the opportunities for additional gains from algorithmic improvements while adding other machine learning models to the system.

REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

36

Cartel: A System for Collaborative Transfer Learning at the Edge

SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA

[27] Ruoming Jin and Gagan Agrawal. 2003. Efficient decision tree construction on streaming data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 571–576. [28] James M Joyce. 2011. Kullback-leibler divergence. In International encyclopedia of statistical science. Springer, 720–722. [29] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN Notices 52, 4 (2017), 615–629. [30] Sebastian Kauschke and Johannes Fürnkranz. 2018. Batchwise Patching of Classifiers. In Thirty-Second AAAI Conference on Artificial Intelligence. [31] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. 2004. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 180–191. [32] Juyong Kim, Yookoon Park, Gunhee Kim, and Sung Ju Hwang. 2017. SplitNet: Learning to semantically split deep networks for parameter reduction and model parallelization. In International Conference on Machine Learning. 1866–1874. [33] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016). [34] Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. 2014. Mondrian forests: Efficient online random forests. (2014), 3140–3148. [35] Yann LeCun. 2010. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (2010). [36] Dawei Li, Theodoros Salonidis, Nirmit V Desai, and Mooi Choo Chuah. 2016. Deepcham: Collaborative edge-mediated adaptive deep learning for mobile object recognition. (2016), 64–76. [37] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server.. In OSDI, Vol. 14. 583– 598. [38] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. 14 (2014), 583–598. [39] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server.. In OSDI, Vol. 14. 583– 598. [40] Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Information Theory 37, 1 (1991), 145–151. https://doi.org/10.1109/18.61115 [41] Dirk Lindemeier. 2015. Nokia: EE and Mobile Edge Computing ready to rock Wembley stadium. https://www.nokia.com/blog/ee-mobile-edge-computingready-rock-wembley-stadium/. [42] Viktor Losing, Barbara Hammer, and Heiko Wersing. 2018. Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing 275 (2018), 1261–1274. https://doi.org/10.1016/j.neucom.2017.06.084 [43] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment (2012), 716–727. [44] Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253 (1951), 68–78. [45] Jose G. Moreno-Torres, Troy Raeder, Rocío Alaíz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification. Pattern Recognition 45, 1 (2012), 521–530. https://doi.org/10.1016/j.patcog.2011. 06.019 [46] IoT Now and Sheetal Kumbhar. 2017. Intel, RIFT.io, Vasona Networks and Xaptum to Demo IoT Multi-Access Edge Computing. https://tinyurl.com/intel-riftioVasona-Xaptum. [47] Opinov8. 2019. How Do Amazon, Facebook, Apple and Google Use AI? https://opinov8.com/how-do-amazon-facebook-apple-and-google-use-ai/. [48] Anand Padmanabha Iyer, Li Erran Li, Mosharaf Chowdhury, and Ion Stoica. 2018. Mitigating the Latency-Accuracy Trade-off in Mobile Data Analytics Systems. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. ACM, 513–528. [49] Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359. [50] Manohar Parakh. 2018. How Companies Use Machine Learning. https://dzone.com/articles/how-companies-use-machine-learning. [51] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. http://dl.acm.org/citation.cfm?id=2078195 [52] Michele Polese, Rittwik Jana, Velin Kounev, Ke Zhang, Supratim Deb, and Michele Zorzi. 2018. Machine Learning at the Edge: A Data-Driven Architecture with Applications to 5G Cellular Networks. CoRR abs/1808.07647 (2018).

arXiv:1808.07647 http://arxiv.org/abs/1808.07647 [53] Ali Rahimi and Benjamin Recht. 2008. Random features for large-scale kernel machines. In Advances in neural information processing systems. 1177–1184. [54] Jon NK Rao and Alastair J Scott. 1981. The analysis of categorical data from complex sample surveys: chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American statistical association 76, 374 (1981), 221–230. [55] Pablo Rodriguez. 2017. The Edge: Evolution or Revolution. In ACM/IEEE Symposium on Edge Computing (SEC’17). [56] Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. 2009. On-line random forests. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 1393–1400. [57] Mahadev Satyanarayanan. 2017. The emergence of edge computing. Computer 50, 1 (2017), 30–39. [58] Mahadev Satyanarayanan, Zhuo Chen, Kiryong Ha, Wenlu Hu, Wolfgang Richter, and Padmanabhan Pillai. 2014. Cloudlets: at the leading edge of mobilecloud convergence. In 2014 6th International Conference on Mobile Computing, Applications and Services (MobiCASE). IEEE, 1–9. [59] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization.. In ICISSP. 108–116. [60] Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. 2016. Edge computing: Vision and challenges. IEEE Internet of Things Journal 3, 5 (2016), 637– 646. [61] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017. Federated multi-task learning. In Advances in Neural Information Processing Systems. 4424–4434. [62] Heng Wang and Zubin Abraham. 2015. Concept drift detection for streaming data. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015. 1–9. https://doi.org/10.1109/IJCNN.2015. 7280398 [63] Shuo Wang, Leandro L. Minku, Davide Ghezzi, Daniele Caltabiano, Peter Tiño, and Xin Yao. 2013. Concept drift detection for online class imbalance learning. In The 2013 International Joint Conference on Neural Networks, IJCNN 2013, Dallas, TX, USA, August 4-9, 2013. 1–10. https://doi.org/10.1109/IJCNN.2013.6706768 [64] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2018. When edge meets learning: Adaptive control for resource-constrained distributed machine learning. (2018), 63–71. [65] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10 (2010), 95. [66] Wuyang Zhang, Jiachen Chen, Yanyong Zhang, and Dipankar Raychaudhuri. 2017. Towards efficient edge cloud augmentation for virtual reality MMOGs. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing. ACM, 8. [67] Liming Zhao, Jingdong Wang, Xi Li, Zhuowen Tu, and Wenjun Zeng. 2016. Deep convolutional neural networks with merge-and-run mappings. arXiv preprint arXiv:1611.07718 (2016).

37

Cartel: A System for Collaborative Transfer Learning at the Edge Harshit Daga* | Patrick K. Nicholson+ | Ada Gavrilovska* | Diego Lugones+ *Georgia

Institute of Technology, +Nokia Bell Labs

Multi-access Edge Computing (MEC)

• Compute & Storage closer to the end user • Provides ultra-low latency

Nokia

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 2

Machine Learning @ Edge o There is tremendous growth of data generated at the edge from end-user devices and IoT. o We explore machine learning in the context of MEC:

• Results are only needed locally • Latency is critical • Data volume must be reduced

Microsoft

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 3

Existing Solution

Edge Data

Centralized System

Cloud (a)

Problems o Data movement is time consuming and uses a lot of backhaul network bandwidth. o Distributed ML across geo-distributed data can slow down the execution up to 53X . [1]

o Regulatory constraints (GDPR)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 4 [1] Kevin et al. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds.

An Alternative Approach Isolated System • Train machine learning models independently at each edge, in isolation from other edge nodes. • The isolated model performance gets heavily impacted in scenarios where there is a need to adapt to changing workload.

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 5

Motivation Can we achieve a balance between centralized and isolated system? Leverage the resource-constrained edge nodes to train customized (smaller) machine learning models in a manner that reduces training time and backhaul data transfer while keeping the performance closer to a centralized system?

Opportunity •

Each edge node has its own attributes / characteristics à a full generic model trained on broad variety of data may not be required at an edge node.

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 6

Solution Overview Cartel : A System for Collaborative Transfer Learning at the Edge Centralized

E node E node

E node

E node

Isolated

Cartel

Light Weight Models

x

Data Transfer







Online Training Time







High Model accuracy

x

E node

• Cartel maintains small customized models at each edge node. • When there is change in the environment or variations in workload patterns, Cartel provides a jump start to adapt to these changes by transferring knowledge from other edge(s) where similar patterns have been observed. Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 7

Key Challenges C1 : When to request for model transfer? C2 : Which node (logical neighbor) to contact? C3 : How to transfer knowledge to the target edge node?

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 8

Solution Design Raw data v/s Metadata • Do not share raw data between any edge nodes or with the cloud. • Use Metadata § Statistics about the network § Software configuration § Active user distribution by segments § Estimates of class priors (probability of certain classes), etc.

Metadata Server (MdS)

E1 node

Cartel maintains and aggregates metadata locally and in the metadata server (MdS).

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 9

C1: When to request for model transfer? Drift Detection

Metadata Server (MdS)

• Determine when to send a request to collaborate with edge nodes for a model transfer. • In our prototype we use a threshold-based drift detection mechanism.

2 Eis register and send metadata

1

E1 node

E2 node

Request Batch

E4 node

Edge Node (E)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 10

E3 node

C2: Which neighbors to contact? Logical Neighbor • Find the neighbor that has similar class priors to the target node. • We call them as “logical neighbors” as they can be from anywhere in the network. • In our prototype class priors are undergoing some shift, the empirical distributions from the target node is compared with those from the other nodes at the MdS to determine which subset of edge nodes are logical neighbors of the target node.

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 11

C3: How to transfer knowledge to the target? Knowledge Transfer • Two steps process 1. Partitioning 2. Merging Help Me (SOS)

Logical Neighbor

Target Node

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 12

Solution Overview Edge Node

Data Collaborative Component

Existing ML Library*

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 13

Solution Overview Collaborative

Edge Edge Node Node

Register

Predict

Accuracy Trend Collaborative Component Distribution Drift

Train

Transfer

Learning

ML Model Existing ML Library* Partition Merge

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 14

Data Data

Evaluation Goals

Methodology

• How effectively system adapts to the change in workload?

• Workload

• How effective is Cartel in reducing data transfer costs, while providing lightweight and accurate models? • What are the costs in the mechanisms of Cartel and the design choices? • How does Cartel perform in a real-world scenario?

Introduction Workload

Fluctuation Workload

• Machine Learning Model – ORF & OSVM • Datasets used - MNIST & CICIDS2017 Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 16

Evaluation Goals

Methodology

• How effectively system adapts to the change in workload?

• Workload

• How effective is Cartel in reducing data transfer costs, while providing lightweight and accurate models? • What are the costs in the mechanisms of Cartel and the design choices? • How does Cartel perform in a real-world scenario?

Introduction Workload

Fluctuation Workload

• Machine Learning Model – ORF & OSVM • Datasets used - MNIST & CICIDS2017 Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 15

Evaluation

Number of Requests

Adaptability to Change in the Workload

Introduction Workload

Online Random Forest (ORF)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 17

Evaluation Adaptability to Change in the Workload

Fluctuation Workload

Online Support Vector Machine (OSVM)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 18

Evaluation Adaptability to Change in the Workload

• When changes in the environment or variations in workload patterns require the model to adapt, Cartel provides a jump start by transferring knowledge from other edge(s) where similar patterns have been observed. • Cartel adapts to the workload changes up to 8x faster than isolated system while achieving similar predictive performance compared to a centralized system.

Fluctuation Workload

Online Support Vector Machine (OSVM)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 19

Evaluation Data Transfer Cost • Data/Communication cost includes the transfer of raw data or metadata updates. • Model transfer cost captures the amount of data transferred during model updates to the edge (periodically in case of centralized system or partial model request from a logical neighbor in Cartel). • Cartel reduces the total data transfer cost up to 1500x when compared to a centralized system.

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 20

Summary

Metadata Service (MdS)

• We introduce Cartel, a system for sharing customized machine learning models between edge nodes. • Benefits of Cartel include: • Adapts quickly to changes in workload (up to 8x faster compared to an isolated system). • Reduces total data transfer costs significantly (1500x ↓ compared to a centralized system). • Enables use of smaller models (3x ↓) at an edge node leading to faster training (5.7x ↓) when compared to a centralized system.

3

1 Request Batch

E1 node (t)

Eis register and send metadata

Subset of helpful neighbors (E3, E4) E2 node Insights

4

Insights

E4 node

Edge Node (E)

Cartel: A System for Collaborative Transfer Learning at the Edge | SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA | 21

2

Request for nodes with similar model

E3 node

Cartel: A System for Collaborative Transfer Learning at the Edge Harshit Daga* | Patrick K. Nicholson+ | Ada Gavrilovska* | Diego Lugones+ *Georgia

Institute of Technology, +Nokia Bell Labs

Appears in the Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999

Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139 castro,liskov @lcs.mit.edu

Abstract

and replication techniques that tolerate Byzantine faults (starting with [19]). However, most earlier work (e.g., [3, 24, 10]) either concerns techniques designed to demonstrate theoretical feasibility that are too inefficient to be used in practice, or assumes synchrony, i.e., relies on known bounds on message delays and process speeds. The systems closest to ours, Rampart [30] and SecureRing [16], were designed to be practical, but they rely on the synchrony assumption for correctness, which is dangerous in the presence of malicious attacks. An attacker may compromise the safety of a service by delaying non-faulty nodes or the communication between them until they are tagged as faulty and excluded from the replica group. Such a denial-of-service attack is generally easier than gaining control over a non-faulty node. Our algorithm is not vulnerable to this type of attack because it does not rely on synchrony for safety. In addition, it improves the performance of Rampart and SecureRing by more than an order of magnitude as explained in Section 7. It uses only one message round trip to execute read-only operations and two to execute read-write operations. Also, it uses an efficient authentication scheme based on message authentication codes during normal operation; public-key cryptography, which was cited as the major latency [29] and throughput [22] bottleneck in Rampart, is used only when there are faults. To evaluate our approach, we implemented a replication library and used it to implement a real service: a Byzantine-fault-tolerant distributed file system that supports the NFS protocol. We used the Andrew benchmark [15] to evaluate the performance of our system. The results show that our system is only 3% slower than the standard NFS daemon in the Digital Unix kernel during normal-case operation. Thus, the paper makes the following contributions:

This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantinefault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbitrary behavior. Whereas previous algorithms assumed a synchronous system or were too slow to be used in practice, the algorithm described in this paper is practical: it works in asynchronous environments like the Internet and incorporates several important optimizations that improve the response time of previous algorithms by more than an order of magnitude. We implemented a Byzantine-fault-tolerant NFS service using our algorithm and measured its performance. The results show that our service is only 3% slower than a standard unreplicated NFS.

1 Introduction Malicious attacks and software errors are increasingly common. The growing reliance of industry and government on online information services makes malicious attacks more attractive and makes the consequences of successful attacks more serious. In addition, the number of software errors is increasing due to the growth in size and complexity of software. Since malicious attacks and software errors can cause faulty nodes to exhibit Byzantine (i.e., arbitrary) behavior, Byzantine-fault-tolerant algorithms are increasingly important. This paper presents a new, practical algorithm for state machine replication [17, 34] that tolerates Byzantine faults. The algorithm offers both liveness and safety provided at most 3 1 out of a total of replicas are simultaneously faulty. This means that clients eventually receive replies to their requests and those replies are correct according to linearizability [14, 4]. The algorithm works in asynchronous systems like the Internet and it incorporates important optimizations that enable it to perform efficiently. There is a significant body of work on agreement

It describes the first state-machine replication protocol that correctly survives Byzantine faults in asynchronous networks. It describes a number of important optimizations that allow the algorithm to perform well so that it can be used in real systems.

This research was supported in part by DARPA under contract DABT6395-C-005, monitored by Army Fort Huachuca, and under contract F30602-98-1-0237, monitored by the Air Force Research Laboratory, and in part by NEC. Miguel Castro was partially supported by a PRAXIS XXI fellowship.

1

It describes the implementation of a Byzantine-faulttolerant distributed file system. It provides experimental results that quantify the cost of the replication technique.

are computationally bound so that (with very high probability) it is unable to subvert the cryptographic techniques mentioned above. For example, the adversary cannot produce a valid signature of a non-faulty node, compute the information summarized by a digest from the digest, or find two messages with the same digest. The cryptographic techniques we use are thought to have these properties [33, 36, 32].

The remainder of the paper is organized as follows. We begin by describing our system model, including our failure assumptions. Section 3 describes the problem solved by the algorithm and states correctness conditions. The algorithm is described in Section 4 and some important optimizations are described in Section 5. Section 6 describes our replication library and how we used it to implement a Byzantine-fault-tolerant NFS. Section 7 presents the results of our experiments. Section 8 discusses related work. We conclude with a summary of what we have accomplished and a discussion of future research directions.

3 Service Properties Our algorithm can be used to implement any deterministic replicated service with a state and some operations. The operations are not restricted to simple reads or writes of portions of the service state; they can perform arbitrary deterministic computations using the state and operation arguments. Clients issue requests to the replicated service to invoke operations and block waiting for a reply. The replicated service is implemented by replicas. Clients and replicas are non-faulty if they follow the algorithm in Section 4 and if no attacker can forge their signature. The algorithm provides both safety and liveness assuming no more than 3 1 replicas are faulty. Safety means that the replicated service satisfies linearizability [14] (modified to account for Byzantine-faulty clients [4]): it behaves like a centralized implementation that executes operations atomically one at a time. Safety requires the bound on the number of faulty replicas because a faulty replica can behave arbitrarily, e.g., it can destroy its state. Safety is provided regardless of how many faulty clients are using the service (even if they collude with faulty replicas): all operations performed by faulty clients are observed in a consistent way by non-faulty clients. In particular, if the service operations are designed to preserve some invariants on the service state, faulty clients cannot break those invariants. The safety property is insufficient to guard against faulty clients, e.g., in a file system a faulty client can write garbage data to some shared file. However, we limit the amount of damage a faulty client can do by providing access control: we authenticate clients and deny access if the client issuing a request does not have the right to invoke the operation. Also, services may provide operations to change the access permissions for a client. Since the algorithm ensures that the effects of access revocation operations are observed consistently by all clients, this provides a powerful mechanism to recover from attacks by faulty clients. The algorithm does not rely on synchrony to provide safety. Therefore, it must rely on synchrony to provide liveness; otherwise it could be used to implement consensus in an asynchronous system, which is not possible [9]. We guarantee liveness, i.e., clients eventually receive replies to their requests, provided at 1 most replicas are faulty and delay does not 3

2 System Model We assume an asynchronous distributed system where nodes are connected by a network. The network may fail to deliver messages, delay them, duplicate them, or deliver them out of order. We use a Byzantine failure model, i.e., faulty nodes may behave arbitrarily, subject only to the restriction mentioned below. We assume independent node failures. For this assumption to be true in the presence of malicious attacks, some steps need to be taken, e.g., each node should run different implementations of the service code and operating system and should have a different root password and a different administrator. It is possible to obtain different implementations from the same code base [28] and for low degrees of replication one can buy operating systems from different vendors. N-version programming, i.e., different teams of programmers produce different implementations, is another option for some services. We use cryptographic techniques to prevent spoofing and replays and to detect corrupted messages. Our messages contain public-key signatures [33], message authentication codes [36], and message digests produced by collision-resistant hash functions [32]. We denote a message signed by node as and the digest of message by . We follow the common practice of signing a digest of a message and appending it to the plaintext of the message rather than signing the full message ( should be interpreted in this way). All replicas know the others’ public keys to verify signatures. We allow for a very strong adversary that can coordinate faulty nodes, delay communication, or delay correct nodes in order to cause the most damage to the replicated service. We do assume that the adversary cannot delay correct nodes indefinitely. We also assume that the adversary (and the faulty nodes it controls)

2

grow faster than indefinitely. Here, delay is the time between the moment when a message is sent for the first time and the moment when it is received by its destination (assuming the sender keeps retransmitting the message until it is received). (A more precise definition can be found in [4].) This is a rather weak synchrony assumption that is likely to be true in any real system provided network faults are eventually repaired, yet it enables us to circumvent the impossibility result in [9]. The resiliency of our algorithm is optimal: 3 1 is the minimum number of replicas that allow an asynchronous system to provide the safety and liveness properties when up to replicas are faulty (see [2] for a proof). This many replicas are needed because it must be possible to proceed after communicating with replicas, since replicas might be faulty and not responding. However, it is possible that the replicas that did not respond are not faulty and, therefore, of those that responded might be faulty. Even so, there must still be enough responses that those from non-faulty replicas outnumber those from faulty ones, i.e., 2 . Therefore 3 . The algorithm does not address the problem of faulttolerant privacy: a faulty replica may leak information to an attacker. It is not feasible to offer fault-tolerant privacy in the general case because service operations may perform arbitrary computations using their arguments and the service state; replicas need this information in the clear to execute such operations efficiently. It is possible to use secret sharing schemes [35] to obtain privacy even in the presence of a threshold of malicious replicas [13] for the arguments and portions of the state that are opaque to the service operations. We plan to investigate these techniques in the future.

used a similar approach to tolerate benign faults (as discussed in Section 8.) The algorithm works roughly as follows: 1. A client sends a request to invoke a service operation to the primary 2. The primary multicasts the request to the backups 3. Replicas execute the request and send a reply to the client 4. The client waits for 1 replies from different replicas with the same result; this is the result of the operation. Like all state machine replication techniques [34], we impose two requirements on replicas: they must be deterministic (i.e., the execution of an operation in a given state and with a given set of arguments must always produce the same result) and they must start in the same state. Given these two requirements, the algorithm ensures the safety property by guaranteeing that all nonfaulty replicas agree on a total order for the execution of requests despite failures. The remainder of this section describes a simplified version of the algorithm. We omit discussion of how nodes recover from faults due to lack of space. We also omit details related to message retransmissions. Furthermore, we assume that message authentication is achieved using digital signatures rather than the more efficient scheme based on message authentication codes; Section 5 discusses this issue further. A detailed formalization of the algorithm using the I/O automaton model [21] is presented in [4]. 4.1 The Client A client requests the execution of state machine operation by sending a REQUEST message to the primary. Timestamp is used to ensure exactlyonce semantics for the execution of client requests. Timestamps for ’s requests are totally ordered such that later requests have higher timestamps than earlier ones; for example, the timestamp could be the value of the client’s local clock when the request is issued. Each message sent by the replicas to the client includes the current view number, allowing the client to track the view and hence the current primary. A client sends a request to what it believes is the current primary using a point-to-point message. The primary atomically multicasts the request to all the backups using the protocol described in the next section. A replica sends the reply to the request directly to the client. The reply has the form REPLY where is the current view number, is the timestamp of the corresponding request, is the replica number, and is the result of executing the requested operation. The client waits for 1 replies with valid signatures from different replicas, and with the same and , before

4 The Algorithm Our algorithm is a form of state machine replication [17, 34]: the service is modeled as a state machine that is replicated across different nodes in a distributed system. Each state machine replica maintains the service state and implements the service operations. We denote the set of replicas by and identify each replica using an integer in 0 1 . For simplicity, we assume 3 1 where is the maximum number of replicas that may be faulty; although there could be more than 3 1 replicas, the additional replicas degrade performance (since more and bigger messages are being exchanged) without providing improved resiliency. The replicas move through a succession of configurations called views. In a view one replica is the primary and the others are backups. Views are numbered consecutively. The primary of a view is replica such that mod , where is the view number. View changes are carried out when it appears that the primary has failed. Viewstamped Replication [26] and Paxos [18]

3

accepting the result . This ensures that the result is valid, since at most replicas can be faulty. If the client does not receive replies soon enough, it broadcasts the request to all replicas. If the request has already been processed, the replicas simply re-send the reply; replicas remember the last reply message they sent to each client. Otherwise, if the replica is not the primary, it relays the request to the primary. If the primary does not multicast the request to the group, it will eventually be suspected to be faulty by enough replicas to cause a view change. In this paper we assume that the client waits for one request to complete before sending the next one. But we can allow a client to make asynchronous requests, yet preserve ordering constraints on them.

A backup accepts a pre-prepare message provided: the signatures in the request and the pre-prepare message are correct and is the digest for ; it is in view ; it has not accepted a pre-prepare message for view and sequence number containing a different digest; the sequence number in the pre-prepare message is between a low water mark, , and a high water mark, . The last condition prevents a faulty primary from exhausting the space of sequence numbers by selecting a very large one. We discuss how and advance in Section 4.3. If backup accepts the PRE-PREPARE message, it enters the prepare phase by multicasting a PREPARE message to all other replicas and adds both messages to its log. Otherwise, it does nothing. A replica (including the primary) accepts prepare messages and adds them to its log provided their signatures are correct, their view number equals the replica’s current view, and their sequence number is between and . We define the predicate prepared to be true if and only if replica has inserted in its log: the request , a pre-prepare for in view with sequence number , and 2 prepares from different backups that match the pre-prepare. The replicas verify whether the prepares match the pre-prepare by checking that they have the same view, sequence number, and digest. The pre-prepare and prepare phases of the algorithm guarantee that non-faulty replicas agree on a total order for the requests within a view. More precisely, they ensure the following invariant: if prepared is true then prepared is false for any non-faulty replica (including ) and any such that . This is true because prepared and 3 1 imply that at least 1 non-faulty replicas have sent a pre-prepare or prepare for in view with sequence number . Thus, for prepared to be true at least one of these replicas needs to have sent two conflicting prepares (or pre-prepares if it is the primary for ), i.e., two prepares with the same view and sequence number and a different digest. But this is not possible because the replica is not faulty. Finally, our assumption about the strength of message digests ensures that the probability that and is negligible. Replica multicasts a COMMIT to the other replicas when prepared becomes true. This starts the commit phase. Replicas accept commit messages and insert them in their log provided they are properly signed, the view number in the message is equal to the replica’s current view, and the sequence number is between and

4.2 Normal-Case Operation The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer denoting the replica’s current view. We describe how to truncate the log in Section 4.3. When the primary, , receives a client request, , it starts a three-phase protocol to atomically multicast the request to the replicas. The primary starts the protocol immediately unless the number of messages for which the protocol is in progress exceeds a given maximum. In this case, it buffers the request. Buffered requests are multicast later as a group to cut down on message traffic and CPU overheads under heavy load; this optimization is similar to a group commit in transactional systems [11]. For simplicity, we ignore this optimization in the description below. The three phases are pre-prepare, prepare, and commit. The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the primary, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views. In the pre-prepare phase, the primary assigns a sequence number, , to the request, multicasts a preprepare message with piggybacked to all the backups, and appends the message to its log. The message has the form PRE-PREPARE , where indicates the view in which the message is being sent, is the client’s request message, and is ’s digest. Requests are not included in pre-prepare messages to keep them small. This is important because preprepare messages are used as a proof that the request was assigned sequence number in view in view changes. Additionally, it decouples the protocol to totally order requests from the protocol to transmit the request to the replicas; allowing us to use a transport optimized for small messages for protocol messages and a transport optimized for large messages for large requests.

4

We define the committed and committed-local predicates as follows: committed is true if and only if prepared is true for all in some set of 1 non-faulty replicas; and committed-local is true if and only if prepared is true and has accepted 2 1 commits (possibly including its own) from different replicas that match the pre-prepare for ; a commit matches a pre-prepare if they have the same view, sequence number, and digest. The commit phase ensures the following invariant: if committed-local is true for some non-faulty then committed is true. This invariant and the view-change protocol described in Section 4.4 ensure that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Furthermore, it ensures that any request that commits locally at a non-faulty replica will commit at 1 or more non-faulty replicas eventually. Each replica executes the operation requested by after committed-local is true and ’s state reflects the sequential execution of all requests with lower sequence numbers. This ensures that all nonfaulty replicas execute requests in the same order as required to provide the safety property. After executing the requested operation, replicas send a reply to the client. Replicas discard requests whose timestamp is lower than the timestamp in the last reply they sent to the client to guarantee exactly-once semantics. We do not rely on ordered message delivery, and therefore it is possible for a replica to commit requests out of order. This does not matter since it keeps the preprepare, prepare, and commit messages logged until the corresponding request can be executed. Figure 1 shows the operation of the algorithm in the normal case of no primary faults. Replica 0 is the primary, replica 3 is faulty, and is the client. request

pre-prepare

prepare

commit

the requests they concern have been executed by at least 1 non-faulty replicas and it can prove this to others in view changes. In addition, if some replica misses messages that were discarded by all non-faulty replicas, it will need to be brought up to date by transferring all or a portion of the service state. Therefore, replicas also need some proof that the state is correct. Generating these proofs after executing every operation would be expensive. Instead, they are generated periodically, when a request with a sequence number divisible by some constant (e.g., 100) is executed. We will refer to the states produced by the execution of these requests as checkpoints and we will say that a checkpoint with a proof is a stable checkpoint. A replica maintains several logical copies of the service state: the last stable checkpoint, zero or more checkpoints that are not stable, and a current state. Copy-on-write techniques can be used to reduce the space overhead to store the extra copies of the state, as discussed in Section 6.3. The proof of correctness for a checkpoint is generated as follows. When a replica produces a checkpoint, it multicasts a message CHECKPOINT to the other replicas, where is the sequence number of the last request whose execution is reflected in the state and is the digest of the state. Each replica collects checkpoint messages in its log until it has 2 1 of them for sequence number with the same digest signed by different replicas (including possibly its own such message). These 2 1 messages are the proof of correctness for the checkpoint. A checkpoint with a proof becomes stable and the replica discards all pre-prepare, prepare, and commit messages with sequence number less than or equal to from its log; it also discards all earlier checkpoints and checkpoint messages. Computing the proofs is efficient because the digest can be computed using incremental cryptography [1] as discussed in Section 6.3, and proofs are generated rarely. The checkpoint protocol is used to advance the low and high water marks (which limit what messages will be accepted). The low-water mark is equal to the sequence number of the last stable checkpoint. The high water mark , where is big enough so that replicas do not stall waiting for a checkpoint to become stable. For example, if checkpoints are taken every 100 requests, might be 200.

reply

C 0 1 2 3

X

Figure 1: Normal Case Operation

4.4 View Changes The view-change protocol provides liveness by allowing the system to make progress when the primary fails. View changes are triggered by timeouts that prevent backups from waiting indefinitely for requests to execute. A backup is waiting for a request if it received a valid request

4.3 Garbage Collection This section discusses the mechanism used to discard messages from the log. For the safety condition to hold, messages must be kept in a replica’s log until it knows that

5

and has not executed it. A backup starts a timer when it receives a request and the timer is not already running. It stops the timer when it is no longer waiting to execute the request, but restarts it if at that point it is waiting to execute some other request. If the timer of backup expires in view , the backup starts a view change to move the system to view 1. It stops accepting messages (other than checkpoint, view-change, and new-view messages) and multicasts a VIEW-CHANGE 1 message to all replicas. Here is the sequence number of the last stable checkpoint known to , is a set of 2 1 valid checkpoint messages proving the correctness of , and is a set containing a set for each request that prepared at with a sequence number higher than . Each set contains a valid pre-prepare message (without the corresponding client message) and 2 matching, valid prepare messages signed by different backups with the same view, sequence number, and the digest of . When the primary of view 1 receives 2 valid view-change messages for view 1 from other replicas, it multicasts a NEW-VIEW 1 message to all other replicas, where is a set containing the valid viewchange messages received by the primary plus the viewchange message for 1 the primary sent (or would have sent), and is a set of pre-prepare messages (without the piggybacked request). is computed as follows: 1. The primary determines the sequence number min-s of the latest stable checkpoint in and the highest sequence number max-s in a prepare message in . 2. The primary creates a new pre-prepare message for view 1 for each sequence number between min-s and max-s. There are two cases: (1) there is at least one set in the component of some view-change message in with sequence number , or (2) there is no such set. In the first case, the primary creates a new message PRE-PREPARE 1 , where is the request digest in the pre-prepare message for sequence number with the highest view number in . In the second case, it creates a new preprepare message PRE-PREPARE 1 , where is the digest of a special null request; a null request goes through the protocol like other requests, but its execution is a no-op. (Paxos [18] used a similar technique to fill in gaps.)

contains are valid for view 1, and if the set is correct; it verifies the correctness of by performing a computation similar to the one used by the primary to create . Then it adds the new information to its log as described for the primary, multicasts a prepare for each message in to all the other replicas, adds these prepares to its log, and enters view 1. Thereafter, the protocol proceeds as described in Section 4.2. Replicas redo the protocol for messages between min-s and max-s but they avoid re-executing client requests (by using their stored information about the last reply sent to each client). A replica may be missing some request message or a stable checkpoint (since these are not sent in newview messages.) It can obtain missing information from another replica. For example, replica can obtain a missing checkpoint state from one of the replicas whose checkpoint messages certified its correctness in . Since 1 of those replicas are correct, replica will always obtain or a later certified stable checkpoint. We can avoid sending the entire checkpoint by partitioning the state and stamping each partition with the sequence number of the last request that modified it. To bring a replica up to date, it is only necessary to send it the partitions where it is out of date, rather than the whole checkpoint. 4.5 Correctness This section sketches the proof that the algorithm provides safety and liveness; details can be found in [4]. 4.5.1 Safety As discussed earlier, the algorithm provides safety if all non-faulty replicas agree on the sequence numbers of requests that commit locally. In Section 4.2, we showed that if prepared is true, prepared is false for any non-faulty replica (including ) and any such that . This implies that two non-faulty replicas agree on the sequence number of requests that commit locally in the same view at the two replicas. The view-change protocol ensures that non-faulty replicas also agree on the sequence number of requests that commit locally in different views at different replicas. A request commits locally at a non-faulty replica with sequence number in view only if committed is true. This means that there is a set 1 containing at least 1 non-faulty replicas such that prepared is true for every replica in the set. Non-faulty replicas will not accept a pre-prepare for view without having received a new-view message for (since only at that point do they enter the view). But any correct new-view message for view contains correct view-change messages from every replica in a

Next the primary appends the messages in to its log. If min-s is greater than the sequence number of its latest stable checkpoint, the primary also inserts the proof of stability for the checkpoint with sequence number min-s in its log, and discards information from the log as discussed in Section 4.3. Then it enters view 1: at this point it is able to accept messages for view 1. A backup accepts a new-view message for view 1 if it is signed properly, if the view-change messages it

6

set

1 replicas. Since there are 3 1 replicas, 2 of 2 and must intersect in at least one replica that is 1 2 not faulty. ’s view-change message will ensure that the fact that prepared in a previous view is propagated to subsequent views, unless the new-view message contains a view-change message with a stable checkpoint with a sequence number higher than . In the first case, the algorithm redoes the three phases of the atomic multicast protocol for with the same sequence number and the new view number. This is important because it prevents any different request that was assigned the sequence number in a previous view from ever committing. In the second case no replica in the new view will accept any message with sequence number lower than . In either case, the replicas will agree on the request that commits locally with sequence number .

4.6 Non-Determinism State machine replicas must be deterministic but many services involve some form of non-determinism. For example, the time-last-modified in NFS is set by reading the server’s local clock; if this were done independently at each replica, the states of non-faulty replicas would diverge. Therefore, some mechanism to ensure that all replicas select the same value is needed. In general, the client cannot select the value because it does not have enough information; for example, it does not know how its request will be ordered relative to concurrent requests by other clients. Instead, the primary needs to select the value either independently or based on values provided by the backups. If the primary selects the non-deterministic value independently, it concatenates the value with the associated request and executes the three phase protocol to ensure that non-faulty replicas agree on a sequence number for the request and value. This prevents a faulty primary from causing replica state to diverge by sending different values to different replicas. However, a faulty primary might send the same, incorrect, value to all replicas. Therefore, replicas must be able to decide deterministically whether the value is correct (and what to do if it is not) based only on the service state. This protocol is adequate for most services (including NFS) but occasionally replicas must participate in selecting the value to satisfy a service’s specification. This can be accomplished by adding an extra phase to the protocol: the primary obtains authenticated values proposed by the backups, concatenates 2 1 of them with the associated request, and starts the three phase protocol for the concatenated message. Replicas choose the value by a deterministic computation on the 2 1 values and their state, e.g., taking the median. The extra phase can be optimized away in the common case. For example, if replicas need a value that is “close enough” to that of their local clock, the extra phase can be avoided when their clocks are synchronized within some delta.

4.5.2 Liveness To provide liveness, replicas must move to a new view if they are unable to execute a request. But it is important to maximize the period of time when at least 2 1 non-faulty replicas are in the same view, and to ensure that this period of time increases exponentially until some requested operation executes. We achieve these goals by three means. First, to avoid starting a view change too soon, a replica that multicasts a view-change message for view 1 waits for 2 1 view-change messages for view 1 and then starts its timer to expire after some time . If the timer expires before it receives a valid new-view message for 1 or before it executes a request in the new view that it had not executed previously, it starts the view change for view 2 but this time it will wait 2 before starting a view change for view 3. Second, if a replica receives a set of 1 valid viewchange messages from other replicas for views greater than its current view, it sends a view-change message for the smallest view in the set, even if its timer has not expired; this prevents it from starting the next view change too late. Third, faulty replicas are unable to impede progress by forcing frequent view changes. A faulty replica cannot cause a view change by sending a view-change message, because a view change will happen only if at least 1 replicas send view-change messages, but it can cause a view change when it is the primary (by not sending messages or sending bad messages). However, because the primary of view is the replica such that mod , the primary cannot be faulty for more than consecutive views. These three techniques guarantee liveness unless message delays grow faster than the timeout period indefinitely, which is unlikely in a real system.

5 Optimizations This section describes some optimizations that improve the performance of the algorithm during normal-case operation. All the optimizations preserve the liveness and safety properties. 5.1 Reducing Communication We use three optimizations to reduce the cost of communication. The first avoids sending most large replies. A client request designates a replica to send the result; all other replicas send replies containing just the digest of the result. The digests allow the client to check the correctness of the result while reducing network

7

bandwidth consumption and CPU overhead significantly for large replies. If the client does not receive a correct result from the designated replica, it retransmits the request as usual, requesting all replicas to send full replies. The second optimization reduces the number of message delays for an operation invocation from 5 to 4. Replicas execute a request tentatively as soon as the prepared predicate holds for the request, their state reflects the execution of all requests with lower sequence number, and these requests are all known to have committed. After executing the request, the replicas send tentative replies to the client. The client waits for 2 1 matching tentative replies. If it receives this many, the request is guaranteed to commit eventually. Otherwise, the client retransmits the request and waits for 1 non-tentative replies. A request that has executed tentatively may abort if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number). The third optimization improves the performance of read-only operations that do not modify the service state. A client multicasts a read-only request to all replicas. Replicas execute the request immediately in their tentative state after checking that the request is properly authenticated, that the client has access, and that the request is in fact read-only. They send the reply only after all requests reflected in the tentative state have committed; this is necessary to prevent the client from observing uncommitted state. The client waits for 2 1 replies from different replicas with the same result. The client may be unable to collect 2 1 such replies if there are concurrent writes to data that affect the result; in this case, it retransmits the request as a regular read-write request after its retransmission timer expires.

specific invariants, e.g, the invariant that no two different requests prepare with the same view and sequence number at two non-faulty replicas. The modified algorithm is described in [5]. Here we sketch the main implications of using MACs. MACs can be computed three orders of magnitude faster than digital signatures. For example, a 200MHz Pentium Pro takes 43ms to generate a 1024-bit modulus RSA signature of an MD5 digest and 0.6ms to verify the signature [37], whereas it takes only 10.3 s to compute the MAC of a 64-byte message on the same hardware in our implementation. There are other publickey cryptosystems that generate signatures faster, e.g., elliptic curve public-key cryptosystems, but signature verification is slower [37] and in our algorithm each signature is verified many times. Each node (including active clients) shares a 16-byte secret session key with each replica. We compute message authentication codes by applying MD5 to the concatenation of the message with the secret key. Rather than using the 16 bytes of the final MD5 digest, we use only the 10 least significant bytes. This truncation has the obvious advantage of reducing the size of MACs and it also improves their resilience to certain attacks [27]. This is a variant of the secret suffix method [36], which is secure as long as MD5 is collision resistant [27, 8]. The digital signature in a reply message is replaced by a single MAC, which is sufficient because these messages have a single intended recipient. The signatures in all other messages (including client requests but excluding view changes) are replaced by vectors of MACs that we call authenticators. An authenticator has an entry for every replica other than the sender; each entry is the MAC computed with the key shared by the sender and the replica corresponding to the entry. The time to verify an authenticator is constant but the time to generate one grows linearly with the number of replicas. This is not a problem because we do not expect to have a large number of replicas and there is a huge performance gap between MAC and digital signature computation. Furthermore, we compute authenticators efficiently; MD5 is applied to the message once and the resulting context is used to compute each vector entry by applying MD5 to the corresponding session key. For example, in a system with 37 replicas (i.e., a system that can tolerate 12 simultaneous faults) an authenticator can still be computed much more than two orders of magnitude faster than a 1024-bit modulus RSA signature. The size of authenticators grows linearly with the number of replicas but it grows slowly: it is equal to 1 30 bytes. An authenticator is smaller than an 3 RSA signature with a 1024-bit modulus for 13 (i.e., systems that can tolerate up to 4 simultaneous faults), which we expect to be true in most configurations.

5.2 Cryptography In Section 4, we described an algorithm that uses digital signatures to authenticate all messages. However, we actually use digital signatures only for viewchange and new-view messages, which are sent rarely, and authenticate all other messages using message authentication codes (MACs). This eliminates the main performance bottleneck in previous systems [29, 22]. However, MACs have a fundamental limitation relative to digital signatures — the inability to prove that a message is authentic to a third party. The algorithm in Section 4 and previous Byzantine-fault-tolerant algorithms [31, 16] for state machine replication rely on the extra power of digital signatures. We modified our algorithm to circumvent the problem by taking advantage of

8

6 Implementation

completely implemented (including the manipulation of the timers that trigger view changes) and because we have formalized the complete algorithm and proved its correctness [4].

This section describes our implementation. First we discuss the replication library, which can be used as a basis for any replicated service. In Section 6.2 we describe how we implemented a replicated NFS on top of the replication library. Then we describe how we maintain checkpoints and compute checkpoint digests efficiently.

6.2 BFS: A Byzantine-Fault-tolerant File System We implemented BFS, a Byzantine-fault-tolerant NFS service, using the replication library. Figure 2 shows the architecture of BFS. We opted not to modify the kernel NFS client and server because we did not have the sources for the Digital Unix kernel. A file system exported by the fault-tolerant NFS service is mounted on the client machine like any regular NFS file system. Application processes run unmodified and interact with the mounted file system through the NFS client in the kernel. We rely on user level relay processes to mediate communication between the standard NFS client and the replicas. A relay receives NFS protocol requests, calls the invoke procedure of our replication library, and sends the result back to the NFS client.

6.1 The Replication Library The client interface to the replication library consists of a single procedure, invoke, with one argument, an input buffer containing a request to invoke a state machine operation. The invoke procedure uses our protocol to execute the requested operation at the replicas and select the correct reply from among the replies of the individual replicas. It returns a pointer to a buffer containing the operation result. On the server side, the replication code makes a number of upcalls to procedures that the server part of the application must implement. There are procedures to execute requests (execute), to maintain checkpoints of the service state (make checkpoint, delete checkpoint), to obtain the digest of a specified checkpoint (get digest), and to obtain missing information (get checkpoint, set checkpoint). The execute procedure receives as input a buffer containing the requested operation, executes the operation, and places the result in an output buffer. The other procedures are discussed further in Sections 6.3 and 6.4. Point-to-point communication between nodes is implemented using UDP, and multicast to the group of replicas is implemented using UDP over IP multicast [7]. There is a single IP multicast group for each service, which contains all the replicas. These communication protocols are unreliable; they may duplicate or lose messages or deliver them out of order. The algorithm tolerates out-of-order delivery and rejects duplicates. View changes can be used to recover from lost messages, but this is expensive and therefore it is important to perform retransmissions. During normal operation recovery from lost messages is driven by the receiver: backups send negative acknowledgments to the primary when they are out of date and the primary retransmits pre-prepare messages after a long timeout. A reply to a negative acknowledgment may include both a portion of a stable checkpoint and missing messages. During view changes, replicas retransmit view-change messages until they receive a matching newview message or they move on to a later view. The replication library does not implement view changes or retransmissions at present. This does not compromise the accuracy of the results given in Section 7 because the rest of the algorithm is

replica 0 snfsd

replication library

client relay Andrew benchmark

replication library

kernel NFS client

kernel VM

replica n snfsd

replication library

kernel VM

Figure 2: Replicated File System Architecture. Each replica runs a user-level process with the replication library and our NFS V2 daemon, which we will refer to as snfsd (for simple nfsd). The replication library receives requests from the relay, interacts with snfsd by making upcalls, and packages NFS replies into replication protocol replies that it sends to the relay. We implemented snfsd using a fixed-size memorymapped file. All the file system data structures, e.g., inodes, blocks and their free lists, are in the mapped file. We rely on the operating system to manage the cache of memory-mapped file pages and to write modified pages to disk asynchronously. The current implementation uses 8KB blocks and inodes contain the NFS status information plus 256 bytes of data, which is used to store directory entries in directories, pointers to blocks in files, and text in symbolic links. Directories and files may also use indirect blocks in a way similar to Unix. Our implementation ensures that all state machine

9

replicas start in the same initial state and are deterministic, which are necessary conditions for the correctness of a service implemented using our protocol. The primary proposes the values for time-last-modified and timelast-accessed, and replicas select the larger of the proposed value and one greater than the maximum of all values selected for earlier requests. We do not require synchronous writes to implement NFS V2 protocol semantics because BFS achieves stability of modified data and meta-data through replication [20].

maximum of 500. 6.4 Computing Checkpoint Digests snfsd computes a digest of a checkpoint state as part of a make checkpoint upcall. Although checkpoints are only taken occasionally, it is important to compute the state digest incrementally because the state may be large. snfsd uses an incremental collision-resistant oneway hash function called AdHash [1]. This function divides the state into fixed-size blocks and uses some other hash function (e.g., MD5) to compute the digest of the string obtained by concatenating the block index with the block value for each block. The digest of the state is the sum of the digests of the blocks modulo some large integer. In our current implementation, we use the 512-byte blocks from the copy-on-write technique and compute their digest using MD5. To compute the digest for the state incrementally, snfsd maintains a table with a hash value for each 512-byte block. This hash value is obtained by applying MD5 to the block index concatenated with the block value at the time of the last checkpoint. When make checkpoint is called, snfsd obtains the digest for the previous checkpoint state (from the associated checkpoint record). It computes new hash values for each block whose copyon-write bit is reset by applying MD5 to the block index concatenated with the current block value. Then, it adds the new hash value to , subtracts the old hash value from , and updates the table to contain the new hash value. This process is efficient provided the number of modified blocks is small; as mentioned above, on average 182 blocks are modified per checkpoint for the Andrew benchmark.

6.3 Maintaining Checkpoints This section describes how snfsd maintains checkpoints of the file system state. Recall that each replica maintains several logical copies of the state: the current state, some number of checkpoints that are not yet stable, and the last stable checkpoint. snfsd executes file system operations directly in the memory mapped file to preserve locality, and it uses copyon-write to reduce the space and time overhead associated with maintaining checkpoints. snfsd maintains a copyon-write bit for every 512-byte block in the memory mapped file. When the replication code invokes the make checkpoint upcall, snfsd sets all the copy-on-write bits and creates a (volatile) checkpoint record, containing the current sequence number, which it receives as an argument to the upcall, and a list of blocks. This list contains the copies of the blocks that were modified since the checkpoint was taken, and therefore, it is initially empty. The record also contains the digest of the current state; we discuss how the digest is computed in Section 6.4. When a block of the memory mapped file is modified while executing a client request, snfsd checks the copyon-write bit for the block and, if it is set, stores the block’s current contents and its identifier in the checkpoint record for the last checkpoint. Then, it overwrites the block with its new value and resets its copy-on-write bit. snfsd retains a checkpoint record until told to discard it via a delete checkpoint upcall, which is made by the replication code when a later checkpoint becomes stable. If the replication code requires a checkpoint to send to another replica, it calls the get checkpoint upcall. To obtain the value for a block, snfsd first searches for the block in the checkpoint record of the stable checkpoint, and then searches the checkpoint records of any later checkpoints. If the block is not in any checkpoint record, it returns the value from the current state. The use of the copy-on-write technique and the fact that we keep at most 2 checkpoints ensure that the space and time overheads of keeping several logical copies of the state are low. For example, in the Andrew benchmark experiments described in Section 7, the average checkpoint record size is only 182 blocks with a

7 Performance Evaluation This section evaluates the performance of our system using two benchmarks: a micro-benchmark and the Andrew benchmark [15]. The micro-benchmark provides a service-independent evaluation of the performance of the replication library; it measures the latency to invoke a null operation, i.e., an operation that does nothing. The Andrew benchmark is used to compare BFS with two other file systems: one is the NFS V2 implementation in Digital Unix, and the other is identical to BFS except without replication. The first comparison demonstrates that our system is practical by showing that its latency is similar to the latency of a commercial system that is used daily by many users. The second comparison allows us to evaluate the overhead of our algorithm accurately within an implementation of a real service. 7.1 Experimental Setup The experiments measure normal-case behavior (i.e., there are no view changes), because this is the behavior

10

that determines the performance of the system. All experiments ran with one client running two relay processes, and four replicas. Four replicas can tolerate one Byzantine fault; we expect this reliability level to suffice for most applications. The replicas and the client ran on identical DEC 3000/400 Alpha workstations. These workstations have a 133 MHz Alpha 21064 processor, 128 MB of memory, and run Digital Unix version 4.0. The file system was stored by each replica on a DEC RZ26 disk. All the workstations were connected by a 10Mbit/s switched Ethernet and had DEC LANCE Ethernet interfaces. The switch was a DEC EtherWORKS 8T/TX. The experiments were run on an isolated network. The interval between checkpoints was 128 requests, which causes garbage collection to occur several times in any of the experiments. The maximum sequence number accepted by replicas in pre-prepare messages was 256 plus the sequence number of the last stable checkpoint.

The overhead for read-only operations is significantly lower because the optimization discussed in Section 5.1 reduces both computation and communication overheads. For example, the computation overhead for the read-only 0/0 operation is approximately 0.43ms, which includes 0.23ms spent executing cryptographic operations, and the communication overhead is only 0.37ms because the protocol to execute read-only operations uses a single round-trip. Table 1 shows that the relative overhead is lower for the 4/0 and 0/4 operations. This is because a significant fraction of the overhead introduced by the replication library is independent of the size of operation arguments and results. For example, in the read-write 0/4 operation, the large message (the reply) goes over the network only once (as discussed in Section 5.1) and only the cryptographic overhead to process the reply message is increased. The overhead is higher for the read-write 4/0 operation because the large message (the request) goes over the network twice and increases the cryptographic overhead for processing both request and pre-prepare messages. It is important to note that this micro-benchmark represents the worst case overhead for our algorithm because the operations perform no work and the unreplicated server provides very weak guarantees. Most services will require stronger guarantees, e.g., authenticated connections, and the overhead introduced by our algorithm relative to a server that implements these guarantees will be lower. For example, the overhead of the replication library relative to a version of the unreplicated service that uses MACs for authentication is only 243% for the read-write 0/0 operation and 4% for the read-only 4/0 operation. We can estimate a rough lower bound on the performance gain afforded by our algorithm relative to Rampart [30]. Reiter reports that Rampart has a latency of 45ms for a multi-RPC of a null message in a 10 Mbit/s Ethernet network of 4 SparcStation 10s [30]. The multiRPC is sufficient for the primary to invoke a state machine operation but for an arbitrary client to invoke an operation it would be necessary to add an extra message delay and an extra RSA signature and verification to authenticate the client; this would lead to a latency of at least 65ms (using the RSA timings reported in [29].) Even if we divide this latency by 1.7, the ratio of the SPECint92 ratings of the DEC 3000/400 and the SparcStation 10, our algorithm still reduces the latency to invoke the read-write and read-only 0/0 operations by factors of more than 10 and 20, respectively. Note that this scaling is conservative because the network accounts for a significant fraction of Rampart’s latency [29] and Rampart’s results were obtained using 300-bit modulus RSA signatures, which are not considered secure today unless the keys used to

7.2 Micro-Benchmark The micro-benchmark measures the latency to invoke a null operation. It evaluates the performance of two implementations of a simple service with no state that implements null operations with arguments and results of different sizes. The first implementation is replicated using our library and the second is unreplicated and uses UDP directly. Table 1 reports the response times measured at the client for both read-only and readwrite operations. They were obtained by timing 10,000 operation invocations in three separate runs and we report the median value of the three runs. The maximum deviation from the median was always below 0.3% of the reported value. We denote each operation by a/b, where a and b are the sizes of the operation argument and result in KBytes. arg./res. (KB) 0/0 4/0 0/4

replicated read-write read-only 3.35 (309%) 1.62 (98%) 14.19 (207%) 6.98 (51%) 8.01 (72%) 5.94 (27%)

without replication 0.82 4.62 4.66

Table 1: Micro-benchmark results (in milliseconds); the percentage overhead is relative to the unreplicated case. The overhead introduced by the replication library is due to extra computation and communication. For example, the computation overhead for the read-write 0/0 operation is approximately 1.06ms, which includes 0.55ms spent executing cryptographic operations. The remaining 1.47ms of overhead are due to extra communication; the replication library introduces an extra message roundtrip, it sends larger messages, and it increases the number of messages received by each node relative to the service without replication.

11

generate them are refreshed very frequently. There are no published performance numbers for SecureRing [16] but it would be slower than Rampart because its algorithm has more message delays and signature operations in the critical path.

phase 1 2 3 4 5 total

7.3 Andrew Benchmark The Andrew benchmark [15] emulates a software development workload. It has five phases: (1) creates subdirectories recursively; (2) copies a source tree; (3) examines the status of all the files in the tree without examining their data; (4) examines every byte of data in all the files; and (5) compiles and links the files. We use the Andrew benchmark to compare BFS with two other file system configurations: NFS-std, which is the NFS V2 implementation in Digital Unix, and BFS-nr, which is identical to BFS but with no replication. BFS-nr ran two simple UDP relays on the client, and on the server it ran a thin veneer linked with a version of snfsd from which all the checkpoint management code was removed. This configuration does not write modified file system state to disk before replying to the client. Therefore, it does not implement NFS V2 protocol semantics, whereas both BFS and NFS-std do. Out of the 18 operations in the NFS V2 protocol only getattr is read-only because the time-last-accessed attribute of files and directories is set by operations that would otherwise be read-only, e.g., read and lookup. The result is that our optimization for readonly operations can rarely be used. To show the impact of this optimization, we also ran the Andrew benchmark on a second version of BFS that modifies the lookup operation to be read-only. This modification violates strict Unix file system semantics but is unlikely to have adverse effects in practice. For all configurations, the actual benchmark code ran at the client workstation using the standard NFS client implementation in the Digital Unix kernel with the same mount options. The most relevant of these options for the benchmark are: UDP transport, 4096-byte read and write buffers, allowing asynchronous client writes, and allowing attribute caching. We report the mean of 10 runs of the benchmark for each configuration. The sample standard deviation for the total time to run the benchmark was always below 2.6% of the reported value but it was as high as 14% for the individual times of the first four phases. This high variance was also present in the NFS-std configuration. The estimated error for the reported mean was below 4.5% for the individual phases and 0.8% for the total. Table 2 shows the results for BFS and BFS-nr. The comparison between BFS-strict and BFS-nr shows that the overhead of Byzantine fault tolerance for this service is low — BFS-strict takes only 26% more time to run

BFS strict r/o lookup 0.55 (57%) 0.47 (34%) 9.24 (82%) 7.91 (56%) 7.24 (18%) 6.45 (6%) 8.77 (18%) 7.87 (6%) 38.68 (20%) 38.38 (19%) 64.48 (26%) 61.07 (20%)

BFS-nr 0.35 5.08 6.11 7.41 32.12 51.07

Table 2: Andrew benchmark: BFS vs BFS-nr. The times are in seconds. the complete benchmark. The overhead is lower than what was observed for the micro-benchmarks because the client spends a significant fraction of the elapsed time computing between operations, i.e., between receiving the reply to an operation and issuing the next request, and operations at the server perform some computation. But the overhead is not uniform across the benchmark phases. The main reason for this is a variation in the amount of time the client spends computing between operations; the first two phases have a higher relative overhead because the client spends approximately 40% of the total time computing between operations, whereas it spends approximately 70% during the last three phases. The table shows that applying the read-only optimization to lookup improves the performance of BFS significantly and reduces the overhead relative to BFS-nr to 20%. This optimization has a significant impact in the first four phases because the time spent waiting for lookup operations to complete in BFS-strict is at least 20% of the elapsed time for these phases, whereas it is less than 5% of the elapsed time for the last phase. phase 1 2 3 4 5 total

BFS strict r/o lookup 0.55 (-69%) 0.47 (-73%) 9.24 (-2%) 7.91 (-16%) 7.24 (35%) 6.45 (20%) 8.77 (32%) 7.87 (19%) 38.68 (-2%) 38.38 (-2%) 64.48 (3%) 61.07 (-2%)

NFS-std 1.75 9.46 5.36 6.60 39.35 62.52

Table 3: Andrew benchmark: BFS vs NFS-std. The times are in seconds. Table 3 shows the results for BFS vs NFS-std. These results show that BFS can be used in practice — BFSstrict takes only 3% more time to run the complete benchmark. Thus, one could replace the NFS V2 implementation in Digital Unix, which is used daily by many users, by BFS without affecting the latency perceived by those users. Furthermore, BFS with the read-only optimization for the lookup operation is actually 2% faster than NFS-std. The overhead of BFS relative to NFS-std is not the

12

same for all phases. Both versions of BFS are faster than NFS-std for phases 1, 2, and 5 but slower for the other phases. This is because during phases 1, 2, and 5 a large fraction (between 21% and 40%) of the operations issued by the client are synchronous, i.e., operations that require the NFS implementation to ensure stability of modified file system state before replying to the client. NFS-std achieves stability by writing modified state to disk whereas BFS achieves stability with lower latency using replication (as in Harp [20]). NFS-std is faster than BFS (and BFS-nr) in phases 3 and 4 because the client issues no synchronous operations during these phases.

replicas or the communication between them until enough are excluded from the group. To reduce the probability of misclassification, failure detectors can be calibrated to delay classifying a replica as faulty. However, for the probability to be negligible the delay must be very large, which is undesirable. For example, if the primary has actually failed, the group will be unable to process client requests until the delay has expired. Our algorithm is not vulnerable to this problem because it never needs to exclude replicas from the group. Phalanx [23, 25] applies quorum replication techniques [12] to achieve Byzantine fault-tolerance in asynchronous systems. This work does not provide generic state machine replication; instead, it offers a data repository with operations to read and write individual variables and to acquire locks. The semantics it provides for read and write operations are weaker than those offered by our algorithm; we can implement arbitrary operations that access any number of variables, whereas in Phalanx it would be necessary to acquire and release locks to execute such operations. There are no published performance numbers for Phalanx but we believe our algorithm is faster because it has fewer message delays in the critical path and because of our use of MACs rather than public key cryptography. The approach in Phalanx offers the potential for improved scalability; each operation is processed by only a subset of replicas. But this approach to scalability is expensive: it requires 4 1 to tolerate faults; each replica needs a copy of the state; and the load on each replica decreases slowly with (it is O 1 ).

8 Related Work Most previous work on replication techniques ignored Byzantine faults or assumed a synchronous system model (e.g., [17, 26, 18, 34, 6, 10]). Viewstamped replication [26] and Paxos [18] use views with a primary and backups to tolerate benign faults in an asynchronous system. Tolerating Byzantine faults requires a much more complex protocol with cryptographic authentication, an extra pre-prepare phase, and a different technique to trigger view changes and select primaries. Furthermore, our system uses view changes only to select a new primary but never to select a different set of replicas to form the new view as in [26, 18]. Some agreement and consensus algorithms tolerate Byzantine faults in asynchronous systems (e.g,[2, 3, 24]). However, they do not provide a complete solution for state machine replication, and furthermore, most of them were designed to demonstrate theoretical feasibility and are too slow to be used in practice. Our algorithm during normal-case operation is similar to the Byzantine agreement algorithm in [2] but that algorithm is unable to survive primary failures. The two systems that are most closely related to our work are Rampart [29, 30, 31, 22] and SecureRing [16]. They implement state machine replication but are more than an order of magnitude slower than our system and, most importantly, they rely on synchrony assumptions. Both Rampart and SecureRing must exclude faulty replicas from the group to make progress (e.g., to remove a faulty primary and elect a new one), and to perform garbage collection. They rely on failure detectors to determine which replicas are faulty. However, failure detectors cannot be accurate in an asynchronous system [21], i.e., they may misclassify a replica as faulty. Since correctness requires that fewer than 1 3 of group members be faulty, a misclassification can compromise correctness by removing a non-faulty replica from the group. This opens an avenue of attack: an attacker gains control over a single replica but does not change its behavior in any detectable way; then it slows correct

9 Conclusions This paper has described a new state-machine replication algorithm that is able to tolerate Byzantine faults and can be used in practice: it is the first to work correctly in an asynchronous system like the Internet and it improves the performance of previous algorithms by more than an order of magnitude. The paper also described BFS, a Byzantine-faulttolerant implementation of NFS. BFS demonstrates that it is possible to use our algorithm to implement real services with performance close to that of an unreplicated service — the performance of BFS is only 3% worse than that of the standard NFS implementation in Digital Unix. This good performance is due to a number of important optimizations, including replacing public-key signatures by vectors of message authentication codes, reducing the size and number of messages, and the incremental checkpoint-management techniques. One reason why Byzantine-fault-tolerant algorithms will be important in the future is that they can allow systems to continue to work correctly even when there are software errors. Not all errors are survivable; our approach cannot mask a software error that occurs

13

at all replicas. However, it can mask errors that occur independently at different replicas, including nondeterministic software errors, which are the most problematic and persistent errors since they are the hardest to detect. In fact, we encountered such a software bug while running our system, and our algorithm was able to continue running correctly in spite of it. There is still much work to do on improving our system. One problem of special interest is reducing the amount of resources required to implement our algorithm. The number of replicas can be reduced by using replicas as witnesses that are involved in the protocol only when some full replica fails. We also believe that it is possible to reduce the number of copies of the state to 1 but the details remain to be worked out.

[14] M. Herlihy and J. Wing. Axioms for Concurrent Objects. In ACM Symposium on Principles of Programming Languages, 1987. [15] J. Howard et al. Scale and performance in a distributed file system. ACM Transactions on Computer Systems, 6(1), 1988. [16] K. Kihlstrom, L. Moser, and P. Melliar-Smith. The SecureRing Protocols for Securing Group Communication. In Hawaii International Conference on System Sciences, 1998. [17] L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7), 1978. [18] L. Lamport. The Part-Time Parliament. Technical Report 49, DEC Systems Research Center, 1989. [19] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems, 4(3), 1982. [20] B. Liskov et al. Replication in the Harp File System. In ACM Symposium on Operating System Principles, 1991. [21] N. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, 1996.

Acknowledgments We would like to thank Atul Adya, Chandrasekhar Boyapati, Nancy Lynch, Sape Mullender, Andrew Myers, Liuba Shrira, and the anonymous referees for their helpful comments on drafts of this paper.

[22] D. Malkhi and M. Reiter. A High-Throughput Secure Reliable Multicast Protocol. In Computer Security Foundations Workshop, 1996. [23] D. Malkhi and M. Reiter. Byzantine Quorum Systems. In ACM Symposium on Theory of Computing, 1997. [24] D. Malkhi and M. Reiter. Unreliable Intrusion Detection in Distributed Computations. In Computer Security Foundations Workshop, 1997.

References [1] M. Bellare and D. Micciancio. A New Paradigm for Collisionfree Hashing: Incrementality at Reduced Cost. In Advances in Cryptology – Eurocrypt 97, 1997.

[25] D. Malkhi and M. Reiter. Secure and Scalable Replication in Phalanx. In IEEE Symposium on Reliable Distributed Systems, 1998.

[2] G. Bracha and S. Toueg. Asynchronous Consensus and Broadcast Protocols. Journal of the ACM, 32(4), 1995.

[26] B. Oki and B. Liskov. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. In ACM Symposium on Principles of Distributed Computing, 1988.

[3] R. Canneti and T. Rabin. Optimal Asynchronous Byzantine Agreement. Technical Report #92-15, Computer Science Department, Hebrew University, 1992.

[27] B. Preneel and P. Oorschot. MDx-MAC and Building Fast MACs from Hash Functions. In Crypto 95, 1995.

[4] M. Castro and B. Liskov. A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm. Technical Memo MIT/LCS/TM-590, MIT Laboratory for Computer Science, 1999.

[28] C. Pu, A. Black, C. Cowan, and J. Walpole. A Specialization Toolkit to Increase the Diversity of Operating Systems. In ICMAS Workshop on Immunity-Based Systems, 1996.

[5] M. Castro and B. Liskov. Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography. Technical Memo MIT/LCS/TM-589, MIT Laboratory for Computer Science, 1999.

[29] M. Reiter. Secure Agreement Protocols. In ACM Conference on Computer and Communication Security, 1994.

[6] F. Cristian, H. Aghili, H. Strong, and D. Dolev. Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In International Conference on Fault Tolerant Computing, 1985.

[30] M. Reiter. The Rampart Toolkit for Building High-Integrity Services. Theory and Practice in Distributed Systems (LNCS 938), 1995.

[7] S. Deering and D. Cheriton. Multicast Routing in Datagram Internetworks and Extended LANs. ACM Transactions on Computer Systems, 8(2), 1990.

[31] M. Reiter. A Secure Group Membership Protocol. Transactions on Software Engineering, 22(1), 1996.

IEEE

[32] R. Rivest. The MD5 Message-Digest Algorithm. Internet RFC1321, 1992.

[8] H. Dobbertin. The Status of MD5 After a Recent Attack. RSA Laboratories’ CryptoBytes, 2(2), 1996.

[33] R. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21(2), 1978.

[9] M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus With One Faulty Process. Journal of the ACM, 32(2), 1985.

[34] F. Schneider. Implementing Fault-Tolerant Services Using The State Machine Approach: A Tutorial. ACM Computing Surveys, 22(4), 1990.

[10] J. Garay and Y. Moses. Fully Polynomial Byzantine Agreement for n 3t Processors in t+1 Rounds. SIAM Journal of Computing, 27(1), 1998.

[35] A. Shamir. How to share a secret. Communications of the ACM, 22(11), 1979.

[11] D. Gawlick and D. Kinkade. Varieties of Concurrency Control in IMS/VS Fast Path. Database Engineering, 8(2), 1985.

[36] G. Tsudik. Message Authentication with One-Way Hash Functions. ACM Computer Communications Review, 22(5), 1992.

[12] D. Gifford. Weighted Voting for Replicated Data. In Symposium on Operating Systems Principles, 1979.

[37] M. Wiener. Performance Comparison of Public-Key Cryptosystems. RSA Laboratories’ CryptoBytes, 4(1), 1998.

[13] M. Herlihy and J. Tygar. How to make replicated data secure. Advances in Cryptology (LNCS 293), 1988.

14

The Computing Landscape of the 21st Century Mahadev Satyanarayanan

Wei Gao

Brandon Lucia

Carnegie Mellon University [email protected]

University of Pittsburgh [email protected]

Carnegie Mellon University [email protected]

ABSTRACT

2

This paper shows how today’s complex computing landscape can be understood in simple terms through a 4-tier model. Each tier represents a distinct and stable set of design constraints that dominate attention at that tier. There are typically many alternative implementations of hardware and software at each tier, but all of them are subject to the same set of design constraints. We discuss how this simple and compact framework has explanatory power and predictive value in reasoning about system design.

Today’s computing landscape is best understood by the tiered model shown in Figure 1. Each tier represents a distinct and stable set of design constraints that dominate attention at that tier. There are typically many alternative implementations of hardware and software at each tier, but all of them are subject to the same set of design constraints. There is no expectation of full interoperability across tiers — randomly choosing one component from each tier is unlikely to result in a functional system. Rather, there are many sets of compatible choices across tiers. For example, a single company will ensure that its products at each tier work well with its own products in other tiers, but not necessarily with products of other companies. The tiered model of Figure 1 is thus quite different from the well-known “hourglass” model of interoperability. Rather than defining functional boundaries or APIs, our model segments the end-to-end computing path and highlights design commonalities. In each tier there is considerable churn at timescales of up to a few years, driven by technical progress as well as market-driven tactics and monetization efforts. The relationship between tiers, however, is stable over decade-long timescales. A major shift in computing typically involves the appearance, disappearance or repurposing of a tier in Figure 1. We describe the four tiers of Figure 1 in the rest of this section. Section 3 then explains how the tiered model can be used as an aid to reasoning about the design of a distributed system. Section 4 examines energy relationships across tiers. Section 5 interprets the past six decades of computing in the context of Figure 1, and Section 6 speculates on the future.

ACM Reference Format: Mahadev Satyanarayanan, Wei Gao, and Brandon Lucia. 2019. The Computing Landscape of the 21st Century. In The 20th International Workshop on Mobile Computing Systems and Applications (HotMobile ’19), February 27–28, 2019, Santa Cruz, CA, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3301293.3302357

1

Introduction

The creation of the Periodic Table in the late nineteenth and early twentieth centuries was an exquisite intellectual feat [36]. In a small and simple data structure, it organizes our knowledge about all the elements in our universe. The position of an element in the table immediately suggests its physical attributes and its chemical affinities to other elements. The presence of “holes” in early versions of the table led to the search and discovery of previously unknown elements with predicted properties. This simple data structure has withstood the test of time. As new man-made elements were created, they could all be accommodated within the existing framework. The quest to understand the basis of order in this table led to major discoveries in physics and chemistry. The history of the periodic table teaches us that there is high value in distilling and codifying taxonomical knowledge into a compact form. Today, we face a computing landscape of high complexity that is reminiscent of the scientific landscape of the late 19th century. Is there a way to organize our computing universe into a simple and compact framework that has explanatory power and predictive value? What is our analog of the periodic table? In this paper, we describe our initial effort at such an intellectual distillation. The periodic table took multiple decades and the contributions of many researchers to evolve into the familiar form that we know today. We therefore recognize that this paper is only the beginning of an important conversation in the research community.

2.1

A Tiered Model of Computing

Tier-1: Elasticity, Permanence and Consolidation

Tier-1 represents “the cloud” in today’s parlance. Two dominant themes of Tier-1 are compute elasticity and storage permanence. Cloud computing has almost unlimited elasticity, as a Tier-1 data center can easily spin up servers to rapidly meet peak demand. Relative to Tier-1, all other tiers have very limited elasticity. In terms of archival preservation, the cloud is the safest place to store data with confidence that it can be retrieved far into the future. A combination of storage redundancy (e.g., RAID), infrastructure stability (i.e., data center engineering), and management practices (e.g., data backup and disaster recovery) together ensure the long-term integrity and accessibility of data entrusted to the cloud. Relative to the data permanence of Tier-1, all other tiers offer more tenuous safety. Getting important data captured at those tiers to the cloud is often an imperative. Tier-1 exploits economies of scale to offer very low total costs of computing. As hardware costs shrink relative to personnel costs, it becomes valuable to amortize IT personnel costs over many machines in a large data center. Consolidation is thus a third dominant theme of Tier-1. For large tasks without strict timing, data ingress volume, or data privacy requirements, Tier-1 is typically the optimal place to perform the task.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). HotMobile ’19, February 27–28, 2019, Santa Cruz, CA, USA © 2019 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6273-3/19/02. https://doi . org/10 . 1145/3301293 . 3302357 1

Cloudlets RFID Tag

Swallowable Capsule

Immersive proximity

Robotic Insect

Drones

Static & Vehicular Sensor Arrays

Smartphones WiFi Backscatter Device

Tier 4

Microsoft Hololens Magic Leap

lowlatency highbandwidth

Luggable

Wide-Area Network wireless network

Vehicular

AR/VR Devices

MiniMini-datacenter

Tier 3

Tier 2

Tier 1

Figure 1: Four-tier Model of Computing

2.2

Tier-3: Mobility and Sensing

We consider Tier-3 next, because understanding its attributes helps to define Tier-2. Mobility is a defining attribute of Tier-3 because it places stringent constraints on weight, size, and heat dissipation of devices that a user carries or wears [29]. Such a device cannot be too large, too heavy or run too hot. Battery life is another crucial design constraint. Together, these constraints severely limit designs. Technological breakthroughs (e.g., a new battery technology or a new lightweight and flexible display material) may expand the envelope of designs, but the underlying constraints always remain. Sensing is another defining attribute of Tier-3. Today’s mobile devices are rich in sensors such as GPS, microphones, accelerometers, gyroscopes, and video cameras. Unfortunately, a mobile device may not be powerful enough to perform real-time analysis of data captured by its on-board sensors (e.g., video analytics). While mobile hardware continues to improve, there is always a large gap between what is feasible on a mobile device and what is feasible on a server of the same technological era. Figure 2 shows this large performance gap persisting over a 20-year period from 1997 to 2017. One can view this stubborn gap as a “mobility penalty” — i.e., the price one pays in performance foregone due to mobility constraints. To overcome this penalty, a mobile device can offload computation over a wireless network to Tier-1. This was first described by Noble et al [25] in 1997, and has since been extensively explored by many others [8, 32]. For example, speech recognition and natural language processing in iOS and Android nowadays work by offloading their compute-intensive aspects to the cloud. IoT devices can be viewed as Tier-3 devices. Although they may not be mobile, there is a strong incentive for them to be inexpensive. Since this typically implies meager processing capability, offloading computation to Tier-1 is again attractive.

2.3

Tier-2: Network Proximity

As mentioned in Section 2.1, economies of scale are achieved in Tier-1 by consolidation into a few very large data centers. Extreme consolidation has two negative consequences. First, it tends to lengthen network round-trip times (RTT) to Tier-1 from Tier-3 — if there are very few Tier-1 data centers, the closest one is likely to be far away. Second, the high fan-in from Tier-3 devices implies high cumulative ingress bandwidth demand into Tier-1 data centers. These negative consequences stifle the emergence of new classes of real-time, sensor-rich, compute-intensive applications [34].

Year

Typical Tier-1 Server Processor Speed

Typical Tier-3 Device Device Speed

1997 2002 2007

Pentium II Itanium Intel Core 2

Palm Pilot Blackberry 5810 Apple iPhone

16 MHz 133 MHz 412 MHz

2011

Intel Xeon X5 Intel Xeon E5-2697v2

Samsung Galaxy S2

2.4 GHz (2 cores) 6.4 GHz (4 cores) 2.4 GHz (2 cores) 7.5 GHz (4 cores) 4.16 GHz (4 cores) 9.4 GHz (4 cores)

2013

266 MHz 1 GHz 9.6 GHz (4 cores) 32 GHz (2x6 cores) 64 GHz (2x12 cores)

Samsung Galaxy S4 Google Glass

2016

Intel Xeon E5-2698v4

88.0 GHz (2x20 cores)

Samsung Galaxy S7 HoloLens

2017

Intel Xeon Gold 6148

96.0 GHz (2x20 cores)

Pixel 2

Source: Adapted from Chen [3] and Flinn [8] “Speed” metric = number of cores times per-core clock speed.

Figure 2: The Mobility Penalty: Impact of Tier-3 Constraints Tier-2 addresses these negative consequences by creating the illusion of bringing Tier-1 “closer.” This achieves two things. First, it enables Tier-3 devices to offload compute-intensive operations at very low latency. This helps to preserve the tight response time bounds needed for immersive user experience (e.g., augmented reality (AR)) and cyber-physical systems (e.g., drone control). Proximity also results in a much smaller fan-in between Tiers-3 and -2 than is the case when Tier-3 devices connect directly to Tier-1. Consequently, Tier-2 processing of data captured at Tier-3 avoids excessive bandwidth demand anywhere in the system. Server hardware at Tier-2 is essentially the same as at Tier-1 (i.e., the second column of Figure 2), but engineered differently. Instead of extreme consolidation, servers in Tier-2 are organized into small, dispersed data centers called cloudlets. A cloudlet can be viewed as “a data center in a box.” When a Tier-3 component such as a drone moves far from its current cloudlet, a mechanism analogous to cellular handoff is required to find and use a new optimal cloudlet [9]. The introduction of Tier-2 is the essence of edge computing [33]. Note that “proximity” here refers to network proximity rather than physical proximity. It is crucial that RTT be low and end-toend bandwidth be high. This is achievable by using a fiber link between a wireless access point and a cloudlet that is many tens or even hundreds of kilometers away. Conversely, physical proximity does not guarantee network proximity. A highly congested WiFi network may have poor RTT, even if Tier-2 is physically near Tier-3.

Tier-4: Longevity and Opportunism

3

Using the Model

The tiers of Figure 1 can be viewed as a canonical representation of components in a modern distributed system. Of course, not every distributed system will have all four tiers. For example, a team of users playing Pokemon Go will only use smartphones (Tier-3) and a server in the cloud (Tier-1). A worker in a warehouse who is taking inventory will use an RFID reader (Tier-3) and passive RFID tags that are embedded in the objects being inventoried (Tier-4). A more sophisticated design of this inventory control system may allow multiple users to work concurrently, and to use a cloudlet in the warehouse (Tier-2) or the cloud (Tier-1) to do aggregation and duplicate elimination of objects discovered by different workers. In general, one can deconstruct any complex distributed system and then examine the system from a tier viewpoint. Such an analysis can be a valuable aid to deeper understanding of the system. As discussed in Section 2, each tier embodies a small set of salient properties that define the reason for the existence of that tier. Elasticity, permanence and consolidation are the salient attributes of Tier-1; mobility and sensing are those of Tier-3; network proximity to Tier-3 is the central purpose of Tier-2; and, longevity combined with opportunism represents the essence of Tier-4. These salient attributes shape both hardware and software designs that are relevant to each tier. For example, hardware at Tier-3 is expected to be mobile and sensor-rich. Specific instances (e.g., a static array of video cameras) may not embody some of these attributes (i.e., mobility), but the broader point is invariably true. The salient attributes of

Importance of Energy in Design (not to scale)

2.4

A key driver of Tier-3 is the vision of embedded sensing, in which tiny sensing-computing-communication platforms continuously report on their environment. “Smart dust” is the extreme limit of this vision. The challenge of cheaply maintaining Tier-3 devices in the field has proved elusive because replacing their batteries or charging them is time-consuming and/or difficult. This has led to the emergence of devices that contain no chemical energy source (battery). Instead, they harvest incident EM energy (e.g., visible light or RF) to charge a capacitor, which then powers a brief episode of sensing, computation and wireless transmission. The device then remains passive until the next occasion when sufficient energy can be harvested to power another episode. This modality of operation, referred to as intermittent computing [17, 18, 21], eliminates the need for energy-related maintenance of devices in the field. This class of devices constitutes Tier-4 in the taxonomy of Figure 1. Longevity of deployment combined with opportunism in energy harvesting are the distinctive attributes of this tier. The most successful Tier-4 devices today are RFID tags, which are projected to be a roughly $25 billion market by 2020 [27]. More sophisticated devices are being explored in research projects including, for example, a robotic flying insect powered solely by an incident laser beam [12]. A Tier-3 device (e.g., RFID reader) provides the energy that is harvested by a Tier-4 device. Immersive proximity is thus the defining relationship between Tier-4 and Tier-3 devices — they have to be physically close enough for the Tier-4 device to harvest sufficient energy for an episode of intermittent computation. Network proximity alone is not sufficient. RFID readers have a typical range of a few meters today. A Tier-4 device stops functioning when its energy source is misaimed or too far away.

TIER4 TIER3 TIER1 4 orders of magnitude

10-8

10-6

RFID Tags

TIER2

10-3 10-2 10-1 100 101 102 103 104 105 106

Microcontrollers

Energy harvesting devices

Smartphones

Wearables

Desktop PCs

Laptops

Power Budget (W)

Data centers

Tower servers

Figure 3: Importance of Energy as a Design Constraint a tier severely constrain its range of acceptable designs. A mobile device, for example, has to be small, lightweight, energy-efficient and have a small thermal footprint. This imperative follows directly from the salient attribute of mobility. A product that does not meet this imperative will simply fail in the marketplace. The same reasoning also applies to software at each tier. For example, Tier-3 to Tier-1 communication is (by definition) over a WAN and may involve a wireless first hop that is unreliable and/or congested. Successful Tier-3 software design for this context has to embody support for disconnected and weakly-connected operation. On the other hand, Tier-3 to Tier-2 communication is expected to be LAN or WLAN quality at all times. A system composed of just those tiers can afford to ignore support for network failure. Note that the server hardware in the two cases may be identical: located in a data center (Tier-1) or in a closet nearby (Tier-2). It is only the placement and communication assumptions that are different. Constraints serve as valuable discipline in system design. Although implementation details of competing systems with comparable functionality may vary widely, their tier structure offers a common viewpoint from which to understand their differences. Some design choices are forced by the tier, while other design choices are made for business reasons, for compatibility reasons with products in other tiers, for efficiency, usability, aesthetics, and so on. Comparison of designs from a tier viewpoint helps to clarify and highlight the essential similarities versus incidental differences. Like the periodic table mentioned in Section 1, Figure 1 distills a vast space of possibilities (i.e., design choices for distributed systems) into a compact intellectual framework. However, the analogy should not be over-drawn since the basis of order in the two worlds is very different. The periodic table exposes order in a “closedsource” system (i.e., nature). The tiered model reveals structure in an “open-source” world (i.e., man-made system components). The key insight of the tiered model is that, in spite of all the degrees of freedom available to designers, the actual designs that thrive in the real world have deep structural similarities.

4

The Central Role of Energy

A hidden message of Section 2 is that energy plays a central role in segmentation across tiers. As shown in Figure 3, the power concerns at different tiers span many orders of magnitude, from a few nanowatts (e.g., a passive RFID tag) to tens of megawatts (e.g., an exascale data center). Energy is also the most critical factor when making design choices in other aspects of a computing system. For example, the limited availability of energy could severely limit

performance. The power budget of a system design could also be a major barrier to reductions of system cost and form factor. The relative heights of tiers in Figure 3 are meant to loosely convey the extent of energy’s influence on design at that tier. Tier-1 (Data Centers): Power is used in a data center for IT equipment (e.g., servers, networks, storage, etc) and infrastructure (e.g., cooling systems), adding up to as much as 30 MW at peak hours [6]. Current power saving techniques focus on load balancing and dynamically eliminating power peaks. Power oversubscription enables more servers to be hosted than theoretically possible, leveraging the fact that their peak demands rarely occur simultaneously. Tier-2 (Cloudlets): Cloudlets can span a wide range of form factors, from high-end laptops and desktop PCs to tower or rack servers. Power consumption can therefore vary from = threshold && ! active ) { // if all fans can be on , set active to true actuateAll ( ’ fans ’, ’on ’); write ( ’ active ’, true); } else if ( co2 < threshold && active ) { ... } }, [ ’co2 ’], 5, ’all ’, ’all ’); // executes if both policies are met tx .onSuccess( func ( evt ) { let txs = Transactuation ( evt ); txs .perform({ actuate ( ’msg ’, ’CO2 is high ’); }, ’ none ’, ’ none ’); txs .execute() ; }) ; // executes if either one policy is not met tx .onFailure( func ( evt ) { let txf = Transactuation ( evt ); ... }) ; tx .execute() ; }

Listing 3: CO2 Vent written with transactuation. The code presented here is in synchronous style but our implementation uses asynchronous Node JS. 10 seconds has the following intent: a hard state passes validation if its most recent event and the transactuation triggering event are not more than 10 seconds apart. A sensing policy is an acceptable level of hard-read failures that a transactuation can tolerate. It specifies that under what condition a perform lambda can be executed over a returned list of window-validated sensors. The perform lambda in turn may or may not execute depending on the sensing policy. Transactuations support three sensing policies: • All: ensures that the perform lambda executes only if all hard states in the sensor list pass validation. Consider an application that reads presence sensors of every user and turns on cameras if no one is present. For privacy, all sensors need to pass validation. If even one presence sensor fails, it should not risk turning on the cameras since it violates privacy. • Any: guarantees the execution of the perform lambda as long as at least one hard state in the sensor list passes validation. For example, an application that computes average humidity level from multiple sensors to control fans, executes accordingly with correct semantics, even if some sensors fail, but not all. • None: states that the perform lambda executes over the returned validated list of hard states regardless of how many hard states are unavailable. Observe that a time window along with a sensing policy helps preserve HR→* dependency as per the developer’s intention to preserve invariant (D1). To preserve invariant (D2),

USENIX Association

a developer needs to specify an actuating policy. The actuating policy is an acceptable level of hard-write failures that is tolerable. To meet an actuating policy in case of a failure, soft writes inside a transactuation roll back to their initial values, and onFailure lambda executes. Similar to a sensing policy, an actuating policy supports the following semantics: • All: states that modifications to soft states commit if all hard writes successfully finish. An example of this policy is an application that locks all doors and sets home state to safe. If even one door fails, the home state should not be set. • Any: guarantees that soft state modifications inside a lambda commits if at least one hard write succeeds. For example, an application that actuates all sirens and sets the flag ringing. Even if only one siren rings, the flag should be set. • None: states that soft writes commit despite of failures. onSuccess lambda. An onSuccess lambda executes if the perform lambda of a transactuation succeeds (i.e., sensing and actuating policies are met). A developer can assign an onSuccess lambda to a transactuation via onSuccess() as shown in line 17 of Listing 3. onFailure lambda. An onFailure lambda executes if a transactuation cannot meet its sensing or actuating policies. It is assigned to a transactuation via onFailure() as depicted in line 25 of Listing 3. When a developer has set up all the lambdas for a transactuation, she executes the transactuation by invoking execute() (line 29), which is an asynchronous call that executes the perform lambda in the background. Listing 3 illustrates the CO2 Vent rewritten with the transactuation abstraction. The perform lambda is parameterized with 5s time window. The transactuation only reads one hard state, co2. The lambda executes if the latest sensor update from co2, and the triggering event, which is also co2 fall in the 5 second time interval. switches, which binds to an array of fans, requires the “all” policy if we want the soft writes to be consistent with the actuations. The soft state active will be set to true only if all fans can be turned on, otherwise, active remains unchanged.

4.2

Chaining transactuations

A transactuation can be chained to other transactuations by invoking it in their onSuccess and onFailure lambdas. As we shall see in the next section, the runtime guarantees to execute chained transactuations sequentially: if a transactuation τ j is invoked in onSuccess lambda of τi , τ j is guaranteed to see the updates τi makes. We call this ordered execution of transactuations as T-Chain. This is particularly relevant in an asynchronous runtime where high latency operations can finish in arbitrary order, executing outside the critical path such as in worker threads [25, 44]. Thus, if τ j wants to use a soft state written by τi , τ j needs to be invoked in onSuccess

USENIX Association

lambda of τi . In addition, if τ j requires actuations of τi to complete before it, these two transactuations must form a T-Chain.

5

Relacs

In this section, we detail the design of our runtime, called Relacs, that execute smart-home applications, along with a supporting key-value store called Relacs Store.

5.1

Relacs Store

All soft and hard states inside a transactuation are stored in a key-value store called Relacs Store. It hides all complexities of working with sensors and actuators by allowing developers to not only perform read/write operations on soft states inside a transactuation, but also to issue hard reads/writes. Conceptually, every state inside the Relacs Store maintains two values, speculative and final. A speculative value means that the state has been updated logically in the Relacs Store, but is not confirmed to be final (i.e., issued to an IoT device). For example, a transactuation that wants to unlock a door will have the speculative value of the door set to unlocked, before the actuation command succeeds. When Relacs receives an ack event confirming the success of an actuation command, it updates the final value and discards the speculative value. Along with setting the final value, the Relacs Store also logs the timestamp of the ack event for validating a time window of a transactuation reading that hard state. In Section 5.2, we explain how speculative states help Relacs to speculatively execute transactuations. Since multiple hard writes on the same state can execute before the system receives an ack from the corresponding device, Relacs Store needs to record all versions of speculative values that have not been finalized yet. When reading a state, Relacs Store returns the latest speculative value, or the final value if no speculative value exists. For instance, consider the following transactuations: a transactuation τi sets a lamp color to red. While the lamp is changing its color, τ j changes the lamp color to green. In this example, Relacs Store logs both speculative values. Thus, if τk tries to read the state of the lamp, Relacs Store returns green, even if the lamp has not completed executing the first actuation command to change its color to red.

5.2

Execution Model

A transactuation execution model comprises of the following three phases: 1. Hard read phase: to start executing a transactuation, the system first needs to determine if it can read the required hard states in the sensor list which satisfy the specified window and the sensing policy. If so, the system proceeds to the next phase. For a poll-based sensor, if Relacs fails to validate the

2019 USENIX Annual Technical Conference

97

window, it polls the sensor to check if it can get a fresh value. For a push-based sensor, Relacs simply waits, as long as the window is valid, to receive an event from the sensor. Observe that the window is valid as long as the specified time window has not passed since the transactuation triggering event. If the window becomes invalid, and the list of received events fails staleness validation, it cannot execute the perform lambda, and proceeds to execute the onFailure lambda. 2. Speculative Commit Phase: since IoT devices cannot roll back, Relacs needs to make sure that a transactuation will definitely commit before performing real actuations. Therefore, it employs a speculative execution model where a perform lambda first executes speculatively, without performing any real actuation. Once the perform lambda finishes, it tries to speculatively commit like a normal transaction inside Relacs Store. Therefore, new speculative values are committed for modified soft and hard states. Additionally, committing new speculative values may trigger other handler functions subscribed to these states. Finally, Relacs starts executing the onSuccess lambda of the transactuation when it commits. Note that these lambdas triggered by speculative commit execute their transactuations speculatively. 3. Final Commit Phase: in the last phase, Relacs sends actuation commands that correspond to hard writes. A transactuation τi can start its final phase, when the following three conditions hold: first, all transactuations that precede τi in the T-Chain finally commit. Second, all transactuations updating states that τi read, finally commit. Third, no other finally committing τ j conflicts with τi . More specifically, the readset of τi does not have any intersection with the writeset of some finally committing transactuation, and the writeset of τi does not intersect with both readset and writeset of some finally committing transactuation. Relacs finally commits the transactuation when sufficient acks are received from actuators to satisfy its actuating policy. If the transactuation times out without satisfying its actuating policy, all soft writes inside the transactuation roll back to their initial state, and the transactuation finally commits. Next, onFailure lambda executes if it has been defined. Moreover, all speculative transactuations invoked by the failed transactuation abort (e.g., chained transactuations), and transactuations that bear data dependencies with the failed transactuation need to re-execute.

5.3

Relacs Runtime

Relacs is built atop serverless computing [32, 42]. The runtime comprises two classes of functions namely application functions and system functions. We explain these functions in detail here. Application Functions. An application can comprise several handlers which are triggered when particular states in the Relacs Store change (publish-subscribe model), and each

98

2019 USENIX Annual Technical Conference

handler can comprise several transactuations. An application submitted to run by Relacs system is transformed into a set of application functions to run on serverless instances as follows: 1. For each handler, Relacs transforms the logic of an embedded transactuation (i.e., perform lambda) into a transaction that can execute transactionally inside the Relacs Store. 2. The logic inside onSuccess lambda and onFailure lambda are transformed into stand-alone serverless functions called success and failure functions, respectively, hereafter. If onSuccess lambda or onFailure lambda is comprised of transactuations with their own onSuccess lambda and onFailure lambda (T-Chain), the transformations are applied recursively. 3. Finally, every handler is transformed into a runnable stand-alone serverless function, called handler function. System Functions. Relacs comprises a serverless function called updater function that is invoked whenever the state of a sensor or an actuator changes. Upon receiving a notification, the updater updates the hard state corresponding to the event in Relacs Store, and launches an instance of subscribed handler function(s). Final-committer is a designated function to perform the final commits. It selects speculative transactuations that can finally commit without breaking the final commit rules, issues all of their actuation commands, and marks the actuations as issued. When a successful actuation receives a notification (ack) from an IoT device, the updater function updates its corresponding state in Relacs Store, and marks the actuation command as done transactionally. In order to detect an actuation failure, Relacs has a failuredetector function that runs periodically, and checks whether an ack is received for an actuation command. If after certain threshold no ack is received, the failure detector marks the actuation as failed. If actuating policy is not met, the enclosing transactuation commits with rollback of soft writes, which triggers a re-executor function to re-execute transactuations that have data dependencies with the failed transactuation.

5.4

Fault Tolerance

A function in serverless computing is not guaranteed to complete, and can terminate at any arbitrary point of execution. Yet, Relacs guarantees applications to execute reliably despite failures as follows. Relacs ensures that all transactuations are executed exactlyonce even if an application function (handler, success, or failure) fails during its execution. To this end, Relacs maintains two logs: function log and transactuation log. Function log is a write-ahead log for application functions. The function name along with ID of the triggering event is recorded in the function log before the function executes. Transactuation log atomically records a transactuation name and the event ID during the speculative commit of a transactuation along with updates to soft/hard states.

USENIX Association

A system function called serverless checker runs periodically, and inspects the function log to execute functions which have failed. In either case, the serverless checker invokes the failed functions again. This might lead to duplicated executions of transactuations that have executed. To prevent this, Relacs checks if a particular transactuation is in the transactuation log, and skips its execution if present. 1 Currently, the updater failure is treated as an equivalent of sensor or actuator failure and it is handled by transactuation semantics. To address final committer failure, Relacs runs the final committer periodically to complete pending final commits by actuating unissued actuations. To preclude contention between the periodic and the regular final committer that can run concurrently, Relacs uses leases and ETAGS à la Tuba [21] in the final committer to ensure correctness.

5.5

Implementation

We implemented Relacs runtime and Relacs Store on top of Microsoft Azure. We used Azure Function (serverless computing) to implement the runtime, and used Azure Cosmos DB to build Relacs Store. All serverless functions were implemented with Azure Function. Application functions are triggered by HTTP calls and system functions are triggered on Cosmos DB updates or periodic timers. The parts of the protocol that need to update Relacs Store transactionally (including perform lambda) are transformed into Cosmos DB stored procedures [3]. Currently, Relacs has only been integrated with Samsung SmartThings. SmartThings allows a developer to build a web service that connects with devices in a home [18]. We built a gateway that forwards actuation commands from Relacs to actuators and also polls sensor data.

5.6

Discussion

As described, Relacs validates sensor failures through event timestamps and actuator failures through timeouts. For sensor validation, as explained, if validation fails and a device is pollable, Relacs polls the device within the window constraints. If a device is push-based but pollable, Relacs polls the device and if the validation fails again, it waits for its pushinterval within the time window. However, if the device is purely push-based, Relacs cannot differentiate between inactivity and failure. We inspected 188 SmartThings-compatible devices and found that 113 of them are pollable. Likewise, actuation failures are detected with timeouts, first on initial ack from smart-home connector, followed by notification on final actuator state change. Again, if the ack message is lost, Relacs can incorrectly rollback soft states. However, transactuations 1 Note that any failure during the speculative commit results in a regular transactional abort and transactuation log is not updated. Hence the transactuation is retried when the function reexecutes.

USENIX Association

can still help developers to prioritize home safety over convenience such as always setting a soft state to a conservative value; e.g., in Smart Security (Listing 2) to ensure that the alarm eventually rings.

6

Evaluation

In this section, we report our evaluation results on programmability, effectiveness of transactuations in enforcing correctness, and the overhead incurred by Relacs to provide transactuation semantics. We selected 10 SmartThings applications from the applications that we statically analyzed. These applications are publicly available on SmartThings repository [19]. The applications cover the four most common categories—Security (Sc), Safety (Sf), Convenience (Cn), and Energy Efficiency (Ee). Instead of using the original version that runs on SmartThings cloud, we implemented the following three versions of the applications, that run on Azure Functions, using Javascript Node JS [44]. This allows us to compare an application with transactuations against an application without transactuations in an apple-to-apple fashion. • BE: we wrote a best-effort version (BE) of the applications without the transactuation abstraction. The BE version follows the default semantics that ignores device failure, exactly-once execution, and isolation. • BE+Con: since the BE version ignores potential failures in devices or applications, we implemented a best-effort with consistency (BE+Con) version of an application which adds code that keeps device states consistent with application states. More specifically, BE+Con introduces both sensor window validation and soft state rollback code. However, it ignores the isolation guarantee that transactuations provide. • TN: we also implemented these applications with the transactuation abstraction (TN). 5 applications out of the evaluated 10 applications used T-Chain to establish order among hard and soft states.

Experimental setup. We set up SmartThings compatible devices and measured the round trip latency of four devices in a typical smart home: a door lock, a bulb, a power strip, and a smart power plug. The door lock has a significant latency of nearly 3.6s on average and maximum of nearly 9.8s, over 100 trials. The other devices incur an average latency of nearly 0.7s with the maximum at nearly 3.7s. Since we had a limited set of devices, we parallelized our experiments by simulating the devices using latency data on a Raspberry Pi Model 3 [13]. It comes with a 1.2 GHz 32-bit quadcore ARM Cortex-A53 processor and 1 GB RAM. In addition, the simulator also allowed us to easily inject failures for our experiments.

2019 USENIX Annual Technical Conference

99

Application

#HR

#HW

Rise And Shine (Cn1) Whole House Fan (Cn2) Thermostat Auto Off (Cn3) Auto Humidity Vent (Ee1)

1 (*) 1 (*), 3 1 (*) 1 (*), 1

1 2 (*) 2 3(*), 1

Lights Off With No Motion (Ee2) Cameras On When Away (Sc1) Nobody Home (Sc2) Smart Security (Sc3) CO2 Vent (Sf1) Lock It When I Leave (Sf2)

1 (*), 1 2 (*) 1 (*) 2 (*) 1 3 (*)

1 (*) 2 (*) 1 2 (*) 2 (*) 2 (*), 2

Transactuation Policy 2 (none, none) 1 (none, none) 1 (all, none), 1 (all, all), 1 (none, all) 1 (any, none), 1 (none, any), 1 (none, none), 1 (all, any) 2 (all, all) 1 (all, none), 1 (any, none) 1 (all, none), 1 (any, none), 1 (none, none) 1 (all, all) 1 (all, all) 2 (none, none), 1 (all, none)

BE 72 29 70 49

LOC BE+Con 195 176 198 170

TN 68 26 68 100

56 31 65 144 29 51

161 149 175 323 152 180

67 88 62 144 26 54

Table 2: Properties of each benchmark application including the number of hard reads and hard writes (* denotes an operation to an array of devices with a single command, for example, 2 (*) means 2 operations, each accessing a device group); the fault-tolerance policies for the TN configuration in a format of (sensing, actuating) (Col 4); and programability shown by LOC comparison among transactuation (TN), best effort (BE), and best effort with consistency (BE+Con) (Col 5).

6.1

Programmability

In order to evaluate the programmability and convenience of using transactuation in contrast to manually writing failure handling code, we compare lines of code (LOC) of applications, using CLOC [6]. Table 2 shows the programmability evaluation (LOC) along with the number of hard reads and writes, and transactuation policies we employ for each application. Observe that TN and BE versions are comparable in LOC despite no guarantees in the BE version, except in Ee1 where we introduce new soft states and four transactuations, each part of T-Chains, in order to ensure consistency. BE+Con version requires substantial code to explicitly handle failures. As mentioned earlier, BE+Con version validates sensor freshness similar to transactuation and may roll back soft states after determining the outcome of actuations for hard write to soft write dependencies. Finally, although transactuations require more code in order to create T-Chains, it automatically handles failure, and simplifies writing reliable applications considerably.

6.2

Correctness

Table 3 shows the applications that we evaluated with their inherent undesirable behaviors on transient or longer duration failures. The second column shows the undesirable behaviors, and the third column shows the outcome of using transactuations. The last column explains the mechanism transactuations use to resolve or mitigate the issue. We considered different types of failures that transactuations can address (i.e., unavailable sensors and failed actuations), and injected these failures by dropping event or actuation messages. Transactuation addresses these issues with three techniques. First, sensor staleness validation prevents the execution of perform lambda and executes onFailure lambda that can notify

100

2019 USENIX Annual Technical Conference

a user. Second, actuation losses are detected automatically and associated soft writes are rolled back to ensure consistency. Third, when one actuation depends on another, we used an intermediate soft state to chain two transactuations each having actuations. For example, in Sc3 (Smart Security) application, inconsistency between the alarm actuation and the soft write is resolved using roll back to eliminate the issue. However, some applications need to use multiple chained transactuations to correctly address actuation dependencies.

6.3

Overhead

To evaluate the overhead of transactuations, we measured execution time of the applications as follows. We started timing when an application began executing, and stopped when every soft write committed and all actuations completed. Our performance results are summarized in Figure 1. Each value is the mean of 30 runs, with 95% confidence intervals. Failure-free. We first compare the execution times of TN and BE versions without any injected failures. The overhead of transactuations is attributed to (1) safeguarding against inconsistencies due to inherently concurrent execution, (2) providing fault tolerance, and (3) enforcing actuation orders of T-Chains. We note that the final committer function imposes significant overhead on Relacs since it is invoked2 automatically by CosmosDB updates. For instance, we observed that its start may be delayed between zero to five seconds. The periodic final committer which we set to run every second helps to mitigate this overhead. Figure 1a shows that, on average (geomean), the TN version incurs 1.5 times slowdown compared to BE. Observe that the 2 Other

functions except the re-executor are invoked by HTTP calls.

USENIX Association

App Cn1 Cn2 Cn3 Ee1 Ee2 Sc1 Sc2 Sc3 Sf1 Sf2

Undesirable consequence Mode not set permanently Incorrect behavior Fans not ON irreversibly Thermostat not OFF Incorrect mode Incorrect energy and operation time reported Incorrect behavior Incorrectly turning lights ON/OFF Incorrect behavior Actuation failure Incorrect mode set Home mode change w/o notification Intruder motion not detected Alarm not active irreversibly Incorrect behavior Exhauts not ON irreversibly Door unlocked but home vacant Door locked at arrival

Transactuation effect Issue detected and user notified

Issue detected and user notified Issue detected and user notified Issue detected and user notified Issue detected and user notified Issue detected and user notified Issue detected and user notified Issue detected and user notified

Mechanism used Soft state rollback Sensor staleness validation Soft state rollback Soft state rollback Soft state rollback Soft state rollback and chaining Sensor staleness validation Sensor staleness validation Sensor staleness validation Chaining Sensor staleness validation soft state rollback Sensor staleness validation soft state rollback Sensor staleness validation soft state rollback Sensor staleness validation Chaining

Table 3: Applications with undesirable consequences on induced failures. Column 3 shows failure avoidance or mitigation when written with transactuations. Column 4 shows the internal mechanism used by the transactuations. A checkmark implies that transactuation automatically resolves the issue. speculative commit duration (TN.SC) is significantly smaller than the final commit duration (TN.FC). Figure 1a also breaks down the final commit time into actuation time (TN.FC.ACT) and the final-committer triggering overhead (TN.FC.TRIG). As mentioned earlier, the triggering overhead is significantly large, especially, in the case of a long T-Chain like Ee1 (4 transactuations). With failure. In this scenario, we conducted two experiments. In each experiment, we used a dummy application that issued a dummy actuation, and updated a dummy soft state. In the first experiment, the dummy actuation turned on a smart switch (low-latency actuation). In the second one, it actuated a door lock (high-latency actuation). We introduced an artificial data dependency (RAW) by forcing all benchmark applications to read the dummy soft state before executing their core logic. Lastly, we injected a failure to the dummy actuation to trigger failure detection and handling in the dummy application and re-execution of the benchmark applications to repair the broken data dependency. Because devices have different actuation latencies, the timeout thresholds to declare failed actuations are specific to each device. More specifically, we used the maximum observed latency for each device (i.e., 4s for the smart switch and 10s for the door lock). Figure 1b compares the execution time of the failure-free case against the two failure experiments. The additional overhead we observe here is the failure detection overhead which includes the timeout (TN.FD.TO) and the overhead of triggering the re-executor function (TN.FD.TRIG). Similar to the final committer, the re-executor is invoked automatically by Cosmos DB when actuations are marked as failed, thus it

USENIX Association

incurs similar overhead. Observe that the failure experiments have two stacked bars of speculative commits. The second bar shows the re-execution of transactuations with broken dependencies. As expected, introducing a failure results in longer execution times for the applications. This slowdown is caused by the timeout threshold plus the re-executor triggering overhead (~2s). Moreover, the difference between the middle and right bars for each application is the difference in timeout thresholds for low and high latency actuations (~6s).

7

Related Work

Checking Correctness. Soteria [22] employs model checking to identify contradicting interactions between IoT applications. For example, water leak detection turns off a water valve while smoke detection attempts to turn on a fire sprinkler. Prior work like DeLorean [24] models absolute and relative time to find timing bugs in event driven programs, e.g., door open at unsafe times. In contrast, our work tackles a different problem, the lack of reliability and isolation, using a dynamic technique. IoT analyses also use dynamic taint analyses like techniques to detect source of security breaches [46] and dynamic program slicing to explain behaviors [40]. We use static dependence analysis to report potential problems. Programming abstractions. Using speculative execution for improving latency and performance is a common technique in many transactional and replicated systems. These can be classified into two categories: systems [34, 41, 47]

2019 USENIX Annual Technical Conference

101

30

30

Execution Time (s)

25 20

BE TN.SC TN.FC TN.FC.ACT TN.FC.TRIG

25 20

15

15

10

10

5

5

0

TN.SC TN.FC TN.FD TN.FD.TO TN.FD.TRIG

Cn1 Cn2 Cn3 Ee1 Ee2 Sc1 Sc2 Sc3 Sf1 Sf2 gm

0

Cn1 Cn2 Cn3 Ee1 Ee2 Sc1 Sc2 Sc3 Sf1 Sf2 gm Applications

Applications

(a) Execution times for BE and TN versions in failure-free case. We break down the execution time of TN into speculative commit (TN.SC) and final commit (TN.FC). TN.FC is shown as actuation time (TN.FC.ACT) and as overhead to trigger the final-committer function (TN.FC.TRIG).

(b) Execution time comparison for failure-free and failure cases. For each application, we show 3 bars, failure-free case (the left bar), low-latency actuation failure case (the middle bar), and high-latency actuation failure case (the right bar). For the failure cases, the breakdown includes failure detection time (TN.FD) which is subdivided into timeout detection (TN.FD.TO) and re-execution triggering overhead (TN.FD.TRIG).

Figure 1: The execution time of 10 applications chosen from SmartThings repository and their geomean (gm) for BE and TN versions of applications in failure-free and failure scenarios. that hide the effects of speculation from applications, and work [29, 31, 43] that expose speculation results to applications. While certain applications in the latter case can benefit by reading speculative values, they need to handle possible side effects of acting on misspeculated values. With Relacs, effects of speculatively committed transactuations are exposed to other transactuations. Yet, no transactuation can finally commit, and actuate devices until all transactuations that it speculatively read from finally commit. Planet [43] provides a mechanism to speculate on partial state of a transaction in distributed environments. The abstraction allows a developer to continue based on a predictive outcome, and later receive a confirmation or an apology. In contrast, we target a different environment and problem, and provide a simplified way to address device failure handling. Execution semantics and conflict detection. IOTA [40] defines a calculus for programs in IoT domain. They also define an execution semantics to eliminate races on actions against the same physical event. Similar races can be resolved in our system by reordering transactuations according to programmer annotations similar to Zave et al. [48]. IOTA also shows offline analyses to detect device conflicts. Conflict detection in a home can include static model checking [38] or dynamic analyses [48] to detect feature interactions [38] and accesses to the same device [26]. They detect commands due to single event or concurrent independent events to the same device, e.g., simultaneous turning on and off on a device. The execution semantics of our system provides isolation naturally

102

2019 USENIX Annual Technical Conference

and can easily be enhanced to report device interactions by intersecting read-write sets of transactuations dynamically.

8

Conclusion

In this paper, we identified a fundamental problem that arises due to failures in IoT systems that interact with the physical world. We analyzed smart-home applications, and showed how application semantics is broken due to different failures that occur in an IoT environment. We introduced an abstraction, called transactuation, that allows a developer to build reliable IoT applications. Our runtime, called Relacs, enforces the semantic guarantees of transactuations. Our evaluation demonstrated programmability, performance, and effectiveness of the transactuation abstraction on top of our runtime.

9

Acknowledgment

We would like to thank our shepherd, Gernot Heiser and anonymous reviewers for their insightful and valuable feedback. We would also like to thank Nitin Agrawal, Arani Bhattacharya, Juan Colmenares, Iqbal Mohomed, Marc Shapiro, Pierre Sutra, Ahmad Bisher Tarakji, and Ashish Vulimiri for their suggestions and helpful discussions.

USENIX Association

References https://developer.mozilla. [1] Arrow functions. org/en-US/docs/Web/JavaScript/Reference/ Functions/Arrow_functions. [2] AWS Lambda Retry Behavior. https: //docs.aws.amazon.com/lambda/latest/dg/ retries-on-errors.html. [3] Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs. https://docs.microsoft.com/en-us/azure/ cosmos-db/programming. [4] Bluetooth Low Energy. com.

https://www.bluetooth.

[5] CO2 Vent. https://github.com/ SmartThingsCommunity/SmartThingsPublic/ tree/master/smartapps/dianoga/co2-vent.src. [6] Count Lines of Code. http://cloc.sourceforge. net. [7] Expressions. https://docs.python.org/2/ reference/expressions.html. http://docs.groovy[8] Groovy ast interface. lang.org/docs/groovy-2.4.0/html/api/org/ codehaus/groovy/ast/package-summary.html. [9] Inconsistent Behavior. https://community. smartthings.com/t/inconsistent-behavior/ 35284.

[18] Web Services SmartThings. https://docs. smartthings.com/en/latest/smartapp-webservices-developers-guide/index.html, 2018. [19] SmartThings Smart Apps. https://github.com/ SmartThingsCommunity/SmartThingsPublic/ tree/master/smartapps, 2019. [20] Masoud Saeida Ardekani, Rayman Preet Singh, Nitin Agrawal, Douglas B. Terry, and Riza O. Suminto. Rivulet: A Fault-tolerant Platform for Smart-home Applications. In Proceedings of the 18th Doctoral Symposium of the 18th International Middleware Conference (MIDDLEWARE ’17), Las Vegas, NV, December 2017. [21] Masoud Saeida Ardekani and Douglas B. Terry. A SelfConfigurable Geo-Replicated Cloud Storage System. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI ’14), Broomfield, CO, October 2014. [22] Z. Berkay Celik, Patrick McDaniel, and Gang Tan. Soteria: Automated IoT Safety and Security Analysis. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC ’18), Boston, MA, July 2018. [23] Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. A Simple, Fast Dominance Algorithm. Rice University, CS Technical Report 06-33870, January 2001. [24] Jason Croft, Ratul Mahajan, Matthew Caesar, and Madan Musuvathi. Systematically Exploring the Behavior of Control Programs. In Proceedings of the 2015 USENIX Annual Technical Conference (ATC ’15), Santa Clara, CA, July 2015.

[10] IoTBench-test-suite. https://github.com/IoTBench/IoTBench[25] James Davis, Arun Thekumparampil, and Dongyoon test-suite/tree/master/openHAB. Lee. Node.Fz: Fuzzing the Server-Side Event-Driven [11] Lambda Expressions. https://docs. Architecture. In Proceedings of the 2017 European Conoracle.com/javase/tutorial/java/javaOO/ ference on Computer Systems (EuroSys ’17), Belgrade, lambdaexpressions.html. Serbia, April 2017. [12] OpenHAB: Empowering the Smart Home. https:// www.openhab.org. [13] Raspberry Pi 3 Model B. https://www.raspberrypi. org/products/raspberry-pi-3-model-b/. [14] SmartThings. http://www.smartthings.com/. [15] SSA1 / SSA2 Instruction Manual. https: //support.smartthings.com/hc/en-us/article_ attachments/200715310/ssa_manual_14may2011_ -_new_address0.pdf. [16] Z-Wave Alliance. org.

http://www.z-wavealliance.

[17] ZigBee Alliance. http://www.zigbee.org/.

USENIX Association

[26] Colin Dixon, Ratul Mahajan, Sharad Agarwal, A. J. Brush, Bongshin Lee, Stefan Saroiu, and Paramvir Bahl. An Operating System for the Home. In Proceedings of the 9th Symposium on Networked Systems Design and Implementation (NSDI ’12), San Jose, CA, April 2012. [27] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The Program Dependence Graph and Its Use in Optimization. ACM Transactions on Programming Languages and Systems, 9(3):319–349, July 1987. [28] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7):1645–1660, September 2013.

2019 USENIX Annual Technical Conference

103

[29] Rachid Guerraoui, Matej Pavlovic, and Dragos-Adrian Seredinschi. Incremental Consistency Guarantees for Replicated Objects. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI ’16), Savannah, GA, November 2016. [30] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC ’14), Seattle, WA, November 2014. [31] Pat Helland and Dave Cambell. Building on Quicksand. In Proceedings of the 4th Conference on Innovative Data Systems Research (CIDR ’09), Pacific Grove, CA, January 2009. [32] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani, Andrea C. ArpaciDusseau, and Remzi H. Arpaci-Dusseau. Serverless Computation with OpenLambda. In Proceedings of the 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’16), Denver, CO, June 2016. [33] Timothy W. Hnat, Vijay Srinivasan, Jiakang Lu, Tamim I Sookoor, Raymond Dawson, John Stankovic, and Kamin Whitehouse. The hitchhiker’s guide to successful residential sensing deployments. In Proceedings of the 9th ACM Conference on Embedded Networked Sensor Systems (SenSys ’11), Seattle, WA, November 2011. [34] Manos Kapritsos, Yang Wang, Vivien Quéma, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. All about Eve: Execute-Verify Replication for Multi-Core Servers. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), October 2012. [35] Mary Beth Kery, Claire Le Goues, and Brad A. Myers. Examining Programmer Practices for Locally Handling Exceptions. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR ’16), Austin, TX, May 2016. [36] Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan Fekete. MDCC: Multi-Data Center Consistency. In Proceedings of the 2013 European Conference on Computer Systems (EuroSys ’13), Prague, Czech Republic, April 2013. [37] Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings

104

2019 USENIX Annual Technical Conference

of the 11th Symposium on Operating Systems Design and Implementation (OSDI ’14), Broomfield, CO, October 2014. [38] Chieh-Jan Mike Liang, Börje F. Karlsson, Nicholas D. Lane, Feng Zhao, Junbei Zhang, Zheyi Pan, Zhao Li, and Yong Yu. SIFT: Building an Internet of Safe Things. In Proceedings of the 14th International Conference on Information Processing in Sensor Networks (IPSN ’15), Seattle, WA, April 2015. [39] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS ’08), Seattle, WA, March 2008. [40] Julie L. Newcomb, Satish Chandra, Jean-Baptiste Jeannin, Cole Schlesinger, and Manu Sridharan. IOTA: A Calculus for Internet of Things Automation. In Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (ONWARD ’17), Vancouver, Canada, October 2017. [41] Edmund B. Nightingale, Peter M. Chen, and Jason Flinn. Speculative execution in a distributed file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05), pages 191–205, 2005. [42] Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. SOCK: Rapid task provisioning with serverless-optimized containers. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18), Boston, MA, July 2018. [43] Gene Pang, Tim Kraska, Michael J. Franklin, and Alan Fekete. PLANET: Making Progress with Commit Processing in Unpredictable Environments. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14), Snowbird, UT, June 2014. [44] Stefan Tilkov and Steve Vinoski. Node.js: Using JavaScript to Build High-Performance Network Programs. IEEE Internet Computing, 14(6):80–83, November 2010. [45] Blase Ur, Elyse McManus, Melwyn Pak Yong Ho, and Michael L. Littman. Practical Trigger-Action Programming in the Smart Home. In Proceedings of the 2014 SIGCHI Conference on Human Factors in Computing Systems (CHI ’14), Toronto, Canada, April 2014.

USENIX Association

[46] Qi Wang, Wajih Ul Hassan, Adam M. Bates, and Carl A. Gunter. Fear and Logging in the Internet of Things. In Proceedings of the 25th Annual Network and Distributed System Security Symposium, (NDSS ’18), San Diego, CA, Februay 2018. [47] Benjamin Wester, James A. Cowling, Edmund B. Nightingale, Peter M. Chen, Jason Flinn, and Barbara Liskov. Tolerating Latency in Replicated State Machines

USENIX Association

Through Client Speculation. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’09), April 2009.

[48] Pamela Zave, Eric Cheung, and Svetlana Yarosh. Toward user-centric feature composition for the Internet of Things. arXiv preprint arXiv:1510.06714, October 2015.

2019 USENIX Annual Technical Conference

105

Transactuations: Where Transactions Meet the Physical World Aritra Sengupta*, Tanakorn Leesatapornwongsa*, Masoud Saeida Ardekani*, Cesar A. Stuardo✚ *



When Smart Home Is Not Smart Inconsistent Behavior General SmartThings Discussion1

“... More importantly, we were robbed when we were out on vacation. I had it set to armed away. The logs show the motion of the robbers, but it never sounded the alarm ... I no longer trust it to do what it is supposed to do when it is supposed to do ...”

[1] SmartThings Community: https://community.smartthings.com

When Smart Home Is Not Smart Inconsistent Behavior General SmartThings Discussion1

“... More importantly, we were robbed when we were out on vacation. I had it set to armed away. The logs show the motion of the robbers, but it never sounded the alarm ... I no longer trust it to do what it is supposed to do when it is supposed to do ...”

“... this system imo is a hobby for Samsung and has never been consistent since day one. There is no way I would trust this system to do anything beyond monitoring under sink leaks.” [1] SmartThings Community: https://community.smartthings.com

Failures in IoT • Device failures • Battery drainage • Subsystem failures • Etc.

• Network failures • RF interference • Copper slap flooring • Etc.

• Concurrent execution • Data race bugs

Failures in IoT • Device failures • Battery drainage • Subsystem failures • Etc.

• Network failures • RF interference • Copper slap flooring • Etc.

• Concurrent execution • Data race bugs

Analyzed 182 apps and found possible 309 problems

Solution Transaction

Solution Transaction

Solution Transaction

Transactuation

Transactuation • High-level abstraction for building reliable IoT applications • Providing strong semantics for systems interacting with the physical world

Transactuation • High-level abstraction for building reliable IoT applications • Providing strong semantics for systems interacting with the physical world

Track II: Session Runtimes

Transactuations: Where Transactions Meet the Physical World* USENIX ATC ‘19 Aritra Sengupta

(Samsung Research)

Tanakorn Leesatapornwongsa (Microsoft Research)

Masoud Saeida Ardekani

(Uber Technologies)

* Work done at Samsung Research America

Cesar A. Stuardo

(University of Chicago)

• IoT solutions are becoming ubiquitous • Hundreds of applications for smart homes • Automation • Security • Safety

• Early stage and immature

• IoT solutions are becoming ubiquitous • Hundreds of applications for smart homes • Automation • Security • Safety

• Early stage and immature

Failure implication goes beyond inconvenience!

When Smart Home Is Not Smart Inconsistent Behavior 1 Upset Customer A

‘’... More importantly, we were robbed when we were out on vacation. I had it set to armed away. The logs show the motion of the robbers, but it never sounded the alarm ... I no longer trust it to do what it is supposed to do when it is supposed to do ... ‘’

[1] SmartThings Community: https://community.smartthings.com

Intrusion Detection Application function handleMotion(evt) { //isIntruder reads other sensors //and determines intrusion if (isIntruder(evt) && !state.alarmActive) { alarm.strobe(); state.alarmActive = true; } }

& !state.alarmActive

state.alarmActive = true (for avoiding redundant actions)

Intrusion Detection Application function handleMotion(evt) { //isIntruder reads other sensors //and determines intrusion if (isIntruder(evt) && !state.alarmActive) { alarm.strobe(); state.alarmActive = true; } }

& !state.alarmActive

Read sensor and app state

state.alarmActive = true (for avoiding redundant actions)

Intrusion Detection Application function handleMotion(evt) { //isIntruder reads other sensors //and determines intrusion if (isIntruder(evt) && !state.alarmActive) { alarm.strobe(); state.alarmActive = true; } }

& !state.alarmActive

Read sensor and app state Actuating a device

state.alarmActive = true (for avoiding redundant actions)

Intrusion Detection Application function handleMotion(evt) { //isIntruder reads other sensors //and determines intrusion if (isIntruder(evt) && !state.alarmActive) { alarm.strobe(); state.alarmActive = true; } }

& !state.alarmActive

Read sensor and app state Actuating a device

Writing app state

state.alarmActive = true (for avoiding redundant actions)

Failure Example

& !state.alarmActive

state.alarmActive = true (for avoiding redundant actions)

What if actuation command is lost or a glitch in the alarm?

Failure Example Inconsistency

& !state.alarmActive

state.alarmActive = true (for avoiding redundant actions)

What if actuation command is lost or a glitch in the alarm? Physical state ! = Application state

Failure Example Inconsistency

& !state.alarmActive

state.alarmActive = true (for avoiding redundant actions)

WARNING ! What if actuation command is lost or a glitch in the alarm? The alarm is based on wireless transmissions … can be subject to Physical state ! = Application state RF interference, … cause the alarm to not operate as intended …

Failure makes application and device states inconsistent

Failure makes application and device states inconsistent Inherent concurrency in applications also leads to inconsistencies

How often can inconsistencies happen? • Identified 3 classes of dependencies in application logic • Dependencies capture semantic relationship between app and device • These 3 dependencies are vulnerable to failures

How often can inconsistencies happen? • Identified 3 classes of dependencies in application logic • Dependencies capture semantic relationship between app and device • These 3 dependencies are vulnerable to failures

By statically analyzing applications for dependencies, we can identify potential inconsistencies in smart applications

Dependency 1. Sensing à actuating c = co2.value() if (c > threshold){ fans.on() } 2. Sensing à app state update t = thermo.value() if (t > 90){ setMode(“HOT”) } 3. Actuating à app state update alarm.strobe() active = “TRUE”

Reading sensor

Actuating based on sensor read

Dependency 1. Sensing à actuating c = co2.value() if (c > threshold){ fans.on() } 2. Sensing à app state update t = thermo.value() if (t > 90){ setMode(“HOT”) } 3. Actuating à app state update alarm.strobe() active = “TRUE”

Reading sensor

Actuating based on sensor read Reading sensor Updating app state based on sensor

Dependency 1. Sensing à actuating c = co2.value() if (c > threshold){ fans.on() } 2. Sensing à app state update t = thermo.value() if (t > 90){ setMode(“HOT”) } 3. Actuating à app state update

Reading sensor

Actuating based on sensor read Reading sensor Updating app state based on sensor Actuating device

alarm.strobe() active = “TRUE”

Updating app state tied to device

Can Transactions address the problem?

NO

Can Transactions address the problem?

NO • IoT devices cannot be locked • Users can observe intermediate value

Can Transactions address the problem?

NO • IoT devices cannot be locked • Users can observe intermediate value

• Rolling back IoT devices have consequences • A user observes a door locks then rolls back to unlocked • Not a good user experience!

Can Transactions address the problem?

NO • IoT devices cannot be locked • Users can observe intermediate value

• Rolling back IoT devices have consequences • A user observes a door locks then rolls back to unlocked • Not a good user experience!

• Some actuations cannot be rolled back • Undoing a water dispenser

Transactuation • High level abstraction and programming model • Allows a developer to read/write from/to devices • Failure-aware association of application and device states

Transactuation • High level abstraction and programming model • Allows a developer to read/write from/to devices • Failure-aware association of application and device states

• Atomic durability for application states • Actuations never roll back

Transactuation • High level abstraction and programming model • Allows a developer to read/write from/to devices • Failure-aware association of application and device states

• Atomic durability for application states • Actuations never roll back

• (Internal) atomic visibility among transactuations • External atomic visibility cannot be guaranteed for end users! • Disallows several concurrency related bugs

Transactuation • High level abstraction and programming model • Allows a developer to read/write from/to devices • Failure-aware association of application and device states

• Atomic durability for application states • Actuations never roll back

• (Internal) atomic visibility among transactuations • External atomic visibility cannot be guaranteed for end users! • Disallows several concurrency related bugs

• Guarantees two invariants

Transactuation • High level abstraction and programming model • Allows a developer to read/write from/to devices • Failure-aware association of application and device states

• Atomic durability for application states • Actuations never roll back

• (Internal) atomic visibility among transactuations • External atomic visibility cannot be guaranteed for end users! • Disallows several concurrency related bugs

• Guarantees two invariants Sensing Invariant Governs executing a transactuation

Actuating Invariant Governs committing a transactuation

Sensing Invariant Transactuation executes only when staleness of its sensor reads is bounded, as per specified sensing policy Sensing policy How much staleness is acceptable How many failed sensors is acceptable Example of sensing policy at least one co2 sensor can be read within last 5 mins

Actuating Invariant When a transactuation commits its app states, sufficient number of actuations have succeeded as per specified actuation policy

Actuation policy How many failed actuation is acceptable

Example of actuation policy At least one alarm should successfully turn on

Simplified Example

(sensors) => { let active = read(‘active’); if (sensors[‘co2’] > threshold && !read(‘active’)) { actuate(‘fans’, ‘on’); write(‘active’, true); } ... }

Application logic

Simplified Example let tx = new Transactuation();

tx.perform( (sensors) => { let active = read(‘active’); if (sensors[‘co2’] > threshold && !read(‘active’)) { actuate(‘fans’, ‘on’); write(‘active’, true); } ... } );

Application logic

Simplified Example let tx = new Transactuation();

Sensing policy tx.perform( [‘co2’], 5m, ‘sense_all’ (sensors) => { let active = read(‘active’); if (sensors[‘co2’] > threshold && !read(‘active’)) { actuate(‘fans’, ‘on’); write(‘active’, true); } ... } );

Application logic

Simplified Example let tx = new Transactuation();

Sensing policy

Actuating policy

tx.perform( [‘co2’], 5m, ‘sense_all’ , ‘act_all’, (sensors) => { let active = read(‘active’); if (sensors[‘co2’] > threshold && !read(‘active’)) { actuate(‘fans’, ‘on’); write(‘active’, true); } ... } );

Application logic

Execution Model Execute app logic defer actuations

T1

1. Start if Sensing policy is satisfied

Final Commit Phase

Execution Model Execute app logic defer actuations

Final Commit Phase

T1

1. Start if Sensing policy is 2. Speculative commit satisfied

Find a serializable Trigger order

T2 rollback Avoid

Execution Model Actuate devices

Final Commit Phase

T1

1. Start if 3. Final commit according to Sensing policy is 2. Speculative commit Actuating policy satisfied

Find a serializable Trigger order

T2 rollback Avoid

Execution Model Actuate devices

Final Commit Phase

T1

Trigger T2

Overlapping computation and actuation

Execution Model Actuate devices

Final Commit Phase

T1

Trigger T2 Trigger T3

Wait

Overlapping computation and actuation

Implementation: Relacs • Runtime called Relacs is built on Azure technology • Azure Functions (serverless functions) • Cosmos DB (Relacs store)

• Integrated to Samsung SmartThings IoT platform

Evaluation • Programmability • Correctness • Runtime overhead without failures • Runtime overhead with failures

Programmability Application

Lines of Codes

Original App

Original App + Consistency

Transactuation

Rise and Shine (Cn1)

72

195

68

Whole House Fan (Cn2)

29

176

26

Thermostat Auto Off (Cn3)

70

198

68

Auto Humidity Vent (Ee1)

49

170

100

Lights Off With No Motion (Ee2)

56

161

67

Cameras On When Away (Sc1)

31

149

88

Nobody Home (Sc2)

65

175

62

Smart Security (Sc3)

144

323

144

Co2 Vent (Sf1)

29

152

26

Lock It When I Leave (Sf2)

51

180

54

Programmability Application

Lines of Codes

Original App

Original App + Consistency

Transactuation

Rise and Shine (Cn1)

72

195

68

Whole House Fan (Cn2)

29

176

26

Thermostat Auto Off (Cn3)

70

198

68

Auto Humidity Vent (Ee1)

49

170

100

Lights Off With No Motion (Ee2)

56

161

67

Cameras On When Away (Sc1)

31

149

88

Nobody Home (Sc2)

65

175

62

Smart Security (Sc3)

144

323

144

Co2 Vent (Sf1)

29

152

26

Lock It When I Leave (Sf2)

51

180

54

Runtime Overhead without Failures 14

Execution time (s)

12 10 8 6 4 2 0 Cn1

Cn2

Cn3

Ee1

Ee2 Original

Sc1

Sc2

Transactuation

Sc3

Sf1

Sf2

GM

Runtime Overhead without Failures 14

Execution time (s)

12 10

50% overhead with transactuations

8 6 4 2 0 Cn1

Cn2

Cn3

Ee1

Ee2 Original

Sc1

Sc2

Transactuation

Sc3

Sf1

Sf2

GM

Runtime Overhead without Failures 14

Serverless function triggering overhead

Execution time (s)

12 10

50% overhead with transactuations

8 6 4 2 0 Cn1

Cn2

Cn3

Ee1

Ee2 Original

Sc1

Sc2

Transactuation

Sc3

Sf1

Sf2

GM

Conclusion • Established a critical reliability issue due to inconsistencies • Transactuation allows a developer to program in a failure-aware way • Demonstrated transactuation’s programmability, performance, and effectiveness

Additional Slides

Relacs

Relacs Store

… Relacs Serverless System Functions

Relacs

Relacs Store

d Rea

… Relacs Serverless System Functions

S

rs o s en

Relacs

Relacs Store

Transform to Serverless Function

d Rea

… Relacs Serverless System Functions

S

rs o s en

Relacs

Relacs Store

Transform to Serverless Function

1. Read Sensors

d Rea

… Relacs Serverless System Functions

S

rs o s en

Relacs

Relacs Store

Transform to Serverless Function

1. Read Sensors 2. Speculative Commit

d Rea

… Relacs Serverless System Functions

S

rs o s en

Relacs

Relacs Store

Transform to Serverless Function

1. Read Sensors 2. Speculative Commit 3. Final Commit (actuate devices)

… Relacs Serverless System Functions

d Rea

S

rs o s en