Parallel and Distributed Computing Applications 1773615033, 9781773615035

Parallel and Distributed Computing Applications examines various dimensions of parallel and distributed computing applic

269 30 51MB

English Pages 200 [348] Year 2018

Report DMCA / Copyright


Table of contents :
Half Title Page
Title Page
Copyright Page
About the Editor
Table of Contents
List of Contributors
List of Abbreviations
Chapter 1 A Parallel Algorithm for Global Optimization Problems in a Distributed Computing Environment
The Parallel Algorithm
Numerical Results
Chapter 2 Parallel Programming Design of BPSK Signal Demodulation
Based on CUDA
BPSK Signal Demodulation Algorithm
The Parallel Computing Model Based on CUDA
BPSK Signal Demodulation Parallel Programming
Chapter 3 Efficient Parallel Implementation of Active Appearance
Model Fitting Algorithm on GPU
Active Appearance Model
GPU Architecture And CUDA Model
Design Of Parallel AAM Fitting Algorithm For GPUs
Experiments And Results
Chapter 4 Scheduling of Divisible Loads on Heterogeneous Distributed Systems
The System Model
Analysis of Optimal Solution
Analysis of Two-Slave System
The Sport Algorithm
Chapter 5 Fault Tolerance Mechanisms in Distributed Systems
Distributed System
Distributed System Architecture
Fault Tolerance Systems
Basic Concept of Fault Tolerance Systems
Fault Tolerance Mechanism In Distributed Systems
Chapter 6 Parallel and Distributed Immersive Real-Time Simulation of
Large-Scale Networks
Real-Time Network Simulation
Supporting Real-Time Performance
Applications And Case Studies
Conclusions And Future Work
Chapter 7 Distributed Software Development Tools For Distributed Scientific Applications
The Platform For Research Collaborative Computing
Prototyping Optimal Design Platform For Engineering
Related Work
Chapter 8 A Performance-Driven Approach For
Restructuring Distributed Object-Oriented Software
Problem Definition
Distributed Object-Oriented Performance Model
Restructuring Scheme
Simulation and Results
Chapter 9 Analysis and Design of Distributed Pair
Programming System
Related Work
Analysis And Interaction In DPP System
Requirements Of DPP System
Design Of DPP System
Chapter 10 A Distributed Computing Architecture For The Large-Scale Integration Of Renewable Energy and Distributed Resources In Smart Grids
High-Voltage Power Grid Optimization Models
An Asynchronous Distributed Algorithm For Stochastic
Unit Commitment
Scalable Control For Distributed Energy Resources
Chapter 11 Assigning Real-Time Tasks In Environmentally
Powered Distributed Systems
Related Works
ACO Solution
Performance Results
Summary And Future Works
Chapter 12 Cloud/Fog Computing System Architecture And Key
Technologies For South-North Water Transfer Project Safety
Application Of Cloud Computing In Water Conservancy Industry
Characteristics Of Fog Computing Architecture
Safety System Architecture For The SNWTP
Key Technologies
Chapter 13 Agent-Based Synthesis of Distributed Controllers For Discrete
Manufacturing Systems
Controller Modelling
Distributed Software Design
Recommend Papers

Parallel and Distributed Computing Applications
 1773615033, 9781773615035

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview



Edited by: Zoran Gacovski







Parallel and Distributed Computing Applications Zoran Gacovski

Arcler Press 2010 Winston Park Drive, 2nd Floor Oakville, ON L6H 5R7 Canada Tel: 001-289-291-7705         001-905-616-2116 Fax: 001-289-291-7601 Email: [email protected] e-book Edition 2019 ISBN: 978-1-77361-626-1 (e-book)

This book contains information obtained from highly regarded resources. Reprinted material sources are indicated. Copyright for individual articles remains with the authors as indicated and published under Creative Commons License. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data and views articulated in the chapters are those of the individual contributors, and not necessarily those of the editors or publishers. Editors or publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify. Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement. © 2019 Arcler Press ISBN: 978-1-77361-503-5 (Hardcover) Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at

DECLARATION Some content or chapters in this book are open access copyright free published research work, which is published under Creative Commons License and are indicated with the citation. We are thankful to the publishers and authors of the content and chapters as without them this book wouldn’t have been possible.


Dr. Zoran Gacovski has earned his PhD degree at Faculty of Electrical engineering, Skopje. His research interests include Intelligent systems and Software engineering, fuzzy systems, graphical models (Petri, Neural and Bayesian networks), and IT security. He has published over 50 journal and conference papers, and he has been reviewer of renowned Journals. In his career he was awarded by Fulbright postdoctoral fellowship (2002) for research stay at Rutgers University, USA. He has also earned best-paper award at the Baltic Olympiad for Automation control (2002), US NSF grant for conducting a specific research in the field of human-computer interaction at Rutgers University, USA (2003), and DAAD grant for research stay at University of Bremen, Germany (2008). Currently, he is a professor in Computer Engineering at European University, Skopje, Macedonia.


List of Contributors........................................................................................xv

List of Abbreviations..................................................................................... xix

Preface..................................................................................................... ....xxi SECTION I PARALLEL COMPUTING MODELS AND ALGORITHMS Chapter 1

A Parallel Algorithm for Global Optimization Problems in a Distributed Computing Environment....................................................... 3 Abstract...................................................................................................... 3 Introduction................................................................................................ 4 Preliminaries............................................................................................... 5 The Parallel Algorithm................................................................................. 9 Numerical Results..................................................................................... 11 Conclusion............................................................................................... 16 Appendix.................................................................................................. 18 References................................................................................................ 21

Chapter 2

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA....................................................................................... 23 Abstract.................................................................................................... 23 Introduction.............................................................................................. 24 BPSK Signal Demodulation Algorithm...................................................... 24 The Parallel Computing Model Based on CUDA....................................... 27 BPSK Signal Demodulation Parallel Programming..................................... 30 Conclusion............................................................................................... 34 References ............................................................................................... 36

Chapter 3

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU............................................................. 37 Abstract.................................................................................................... 37

Introduction.............................................................................................. 38 Active Appearance Model......................................................................... 40 GPU Architecture And CUDA Model........................................................ 44 Design Of Parallel AAM Fitting Algorithm For GPUs................................. 45 Experiments And Results........................................................................... 54 Conclusion............................................................................................... 60 Acknowledgments.................................................................................... 61 References................................................................................................ 62 SECTION II DISTRIBUTED COMPUTING SYSTEMS Chapter 4

Scheduling of Divisible Loads on Heterogeneous Distributed Systems.................................................................................. 67 Introduction ............................................................................................. 67 The System Model ................................................................................... 69 Analysis of Optimal Solution.................................................................... 78 Analysis of Two-Slave System.................................................................... 82 The Sport Algorithm.................................................................................. 87 Conclusion............................................................................................... 96 References................................................................................................ 97

Chapter 5

Fault Tolerance Mechanisms in Distributed Systems................................ 99 Abstract.................................................................................................... 99 Introduction............................................................................................ 100 Distributed System.................................................................................. 101 Distributed System Architecture.............................................................. 102 Fault Tolerance Systems.......................................................................... 105 Basic Concept of Fault Tolerance Systems............................................... 107 Fault Tolerance Mechanism In Distributed Systems................................. 109 Conclusion............................................................................................. 114 References ............................................................................................. 115

Chapter 6

Parallel and Distributed Immersive Real-Time Simulation of Large-Scale Networks............................................................................. 121 Introduction............................................................................................ 121 Background............................................................................................ 124 Real-Time Network Simulation............................................................... 128


Supporting Real-Time Performance......................................................... 130 Applications And Case Studies................................................................ 142 Conclusions And Future Work................................................................ 146 References.............................................................................................. 147 SECTION III SOFTWARE IN PARALLEL AND DISTRIBUTED SYSTEMS Chapter 7

Distributed Software Development Tools For Distributed Scientific Applications........................................................................................... 159 Abstract ................................................................................................. 159 Introduction ........................................................................................... 160 The Platform For Research Collaborative Computing............................... 162 Prototyping Optimal Design Platform For Engineering............................ 168 Related Work.......................................................................................... 176 Conclusions............................................................................................ 177 References.............................................................................................. 179

Chapter 8

A Performance-Driven Approach For Restructuring Distributed Object-Oriented Software............................. 181 Abstract.................................................................................................. 182 Introduction ........................................................................................... 182 Problem Definition ................................................................................ 184 Distributed Object-Oriented Performance Model ................................... 185 Restructuring Scheme ............................................................................ 187 Simulation and Results ........................................................................... 191 Conclusions ........................................................................................... 197 References ............................................................................................. 198

Chapter 9

Analysis and Design of Distributed Pair Programming System.............................................................................. 201 Abstract.................................................................................................. 201 Introduction............................................................................................ 202 Related Work.......................................................................................... 203 Analysis And Interaction In DPP System.................................................. 207 Requirements Of DPP System................................................................. 213 Design Of DPP System........................................................................... 216 Conclusions............................................................................................ 220


References.............................................................................................. 222 SECTION IV APPLICATIONS OF DISTRIBUTED COMPUTING Chapter 10 A Distributed Computing Architecture For The Large-Scale Integration Of Renewable Energy and Distributed Resources In Smart Grids....................................................................... 227 Abstract ................................................................................................. 227 Introduction ........................................................................................... 228 High-Voltage Power Grid Optimization Models ..................................... 230 An Asynchronous Distributed Algorithm For Stochastic Unit Commitment......................................................................... 234 Scalable Control For Distributed Energy Resources ................................ 242 Conclusions ........................................................................................... 249 Acknowledgements ............................................................................... 249 Nomenclature ........................................................................................ 250 References ............................................................................................. 253 Chapter 11 Assigning Real-Time Tasks In Environmentally Powered Distributed Systems................................................................. 257 Abstract.................................................................................................. 257 Introduction............................................................................................ 258 Related Works........................................................................................ 259 Preliminaries........................................................................................... 260 ACO Solution......................................................................................... 263 Performance Results............................................................................... 270 Summary And Future Works................................................................... 279 References.............................................................................................. 282 Chapter 12 Cloud/Fog Computing System Architecture And Key Technologies For South-North Water Transfer Project Safety................ 285 Abstract.................................................................................................. 285 Background............................................................................................ 286 Application Of Cloud Computing In Water Conservancy Industry........... 287 Characteristics Of Fog Computing Architecture....................................... 288 Safety System Architecture For The SNWTP............................................. 290 Key Technologies.................................................................................... 294 Conclusions............................................................................................ 295 xii

Acknowledgments.................................................................................. 295 References.............................................................................................. 296 Chapter 13 Agent-Based Synthesis of Distributed Controllers For Discrete Manufacturing Systems.......................................................................... 299 Abstract.................................................................................................. 299 Introduction ........................................................................................... 300 Controller Modelling ............................................................................. 301 Distributed Software Design .................................................................. 307 Conclusions ........................................................................................... 314 References ............................................................................................. 316 Index...................................................................................................... 319


LIST OF CONTRIBUTORS Marco Gaviano Department of Mathematics and Informatics, University of Cagliari, Cagliari, Italy Daniela Lera Department of Mathematics and Informatics, University of Cagliari, Cagliari, Italy Elisabetta Mereu Department of Mathematics and Informatics, University of Cagliari, Cagliari, Italy Yandu Liu Equipment Academy, Beijing, China Baoling Zhang Equipment Academy, Beijing, China Haixin Zheng Equipment Academy, Beijing, China Jinwei Wang School of Computer Science and Technology, Tianjin University, Tianjin 300072, China College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China Xirong Ma College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China Yuanping Zhu College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China xv

Jizhou Sun School of Computer Science and Technology, Tianjin University, Tianjin 300072, China Abhay Ghatpande GITI, Waseda University, Tokyo 169-0051 Hidenori Nakazato GITI, Waseda University, Tokyo 169-0051 Olivier Beaumont LaBRI, France 33405 Arif Sari Department of Management Information Systems, Girne American University, Kyrenia, Cypru Murat Akkaya Department of Management Information Systems, Girne American University, Kyrenia, Cypru Jason Liu Florida International University United States Vaidas Giedrimas Siauliai University, Siauliai, Lithuania Leonidas Sakalauskas Siauliai University, Siauliai, Lithuania Vilnius University, Vilnius, Lithuania Anatoly Petrenko National Technical University of Ukraine “Kyiv Polytechnic Institute”, Kiyv, Ukraine Amal Abd El-Raouf Computer Science Department, Southern CT State University, New Haven, CT, USA


Tahany Fergany Computer Science Department, University of New Haven, CT, New Haven, USA Reda Ammar Computer Science and Engineering Department, University of Connecticut, CT, Storrs, USA Safwat Hamad Computer & Information Sciences Department, Ain Shams University, Abbassia, Cairo, Egypt Wanfeng Dou School of Computer Science and Technology, Jiangsu Research Center of Information Security & Confidential Engineering, Nanjing Normal University, Nanjing, China Yifeng Wang School of Computer Science and Technology, Jiangsu Research Center of Information Security & Confidential Engineering, Nanjing Normal University, Nanjing, China Sen Luo School of Computer Science and Technology, Jiangsu Research Center of Information Security & Confidential Engineering, Nanjing Normal University, Nanjing, China Ignacio Aravena CORE, Université Catholique de Louvain, Louvain-la-Neuve, Belgium Anthony Papavasiliou CORE, Université Catholique de Louvain, Louvain-la-Neuve, Belgium Alex Papalexopoulos ECCO International, San Francisco, CA, USA Jian Lin Department of Management Information Systems, University of Houston, Clear Lake, Houston, USA


Albert M. K. Cheng Department of Computer Science, University of Houston, Houston, USA Yaoling Fan School of Water Conservancy & Environment, Zhengzhou University, Zhengzhou, China Qiliang Zhu School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou, China Yang Liu School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou, China Ernesto López-Mellado CINVESTAV Unidad Guadalajara, Zapopan, Mexico



Active Appearance Model


Ant Colony Optimization




Active Shape Model


Class Dependency Graph


Content Distribution Network


Computer Supported Cooperative Work


Compute Unified Device Architecture


Divisible Load Scheduling


Divisible Load Theory


Domain Name Servers


Distributed Object Oriented


Distributed Object Oriented Performance


Distributed Pair Programming


Deterministic Unit Commitment


Dynamic Voltage Scaling


Eclipse Communication Framework


Earliest Deadline First


Energy Harvesting Assignment


External User Request


Exponential Weighted Moving Average


Flow of Parts Graph


Field- Programmable Gate Array


Genetic Algorithm


Graphics Processing Units


Graphical User Interface


Global Virtual Time


Institute of Applied System Analysis


Inverse Compositional Image Alignment


Internet of Things


Logical Processes


Long Range Dependence


Micro‐Electro‐Mechanical Systems


Max-Min Ant System


Numerical Control Oscillator


Ordinary Differential Equations


Principal Component Analysis


Parallel Discrete- Event Simulation


Process-Level Redundancy


Pair Programming


Poisson Pareto Burst process


Platform for Research Collaborative Computing


Remote Request


Single-Instruction Multiple-Thread


Streaming Multiprocessors


South-North Water Transfer Project


Service‐oriented computing


Streaming Processors


Stochastic Unit Commitment


Time Dilation Factor


Triple Modular Redundancy


Virtual Machine


Virtual Private Network



Parallel and distributed systems are relatively new computer systems. Their appearance is due to several factors. Firstly, computers have become smaller and cheaper. Thousands of computers can be stored in a box, while in the past one computer took up the volume of the entire room. Also, computers are faster than before. Network communication reduces the computing resources. Finally, interconnection technologies have advanced to the point that they are already simple and cost-effective to connect computers to a network. In a local network, we can expect speeds in the range of hundreds of Mbps to Gbps. Tanenbaum defines the distributed system as “a collection of independent computers that make the system to look as one computer”. There are two essential points in this definition. The first is the use of the word independent. First - architectural, means that machines are able to work independently. The second point is that the software allows the set of related machines to behave in relation to the user as one computer. This is known as a single system image and is the most important goal in designing distributed systems that are easy to maintain and operate. The main advantages of the distributed and parallel systems are: • Increased reliability. This means that in the case that a small percentage of the machines have a run-out, the rest of the system should remain intact and should be able to perform useful work. • Incremental growth. A company can buy single computer, but the load is too big for the machine. The only option is to replace the computer with a faster one. The parallel and distributed architecture allows adding of new resources to the existing infrastructure as needed. • Remote services. Users may need access to information held on other systems. Examples include Web, remote access to files, and programs like file-sharing, banking applications and multimedia to access music and video files. To implement a distributed system - there should be common software framework that will enable global mechanism for inter-process communication (any process should be able to speak with any other process in the same way, be it local or remote).

This edition covers different topics from parallel and distributed systems, including parallel computing models and algorithms, distributed computing systems, software in distributed systems and applications of distributed systems. Section 1 focuses on parallel computing models and algorithms, describing parallel algorithm for global optimization problems in a distributed computing environment, parallel programming design of BPSK signal demodulation based on CUDA, parallel implementation of active appearance model fitting algorithm on GPU. Section 2 focuses on distributed computing systems, describing scheduling of divisible loads on heterogeneous distributed systems, fault tolerance mechanisms in distributed systems, parallel and distributed immersive real-time simulation of large-scale networks. Section 3 focuses on software in distributed systems, describing software development tools for distributed scientific applications, performance-driven approach for restructuring distributed object-oriented software, analysis and design of distributed pair programming system. Section 4 focuses on applications of distributed systems, describing distributed computing architecture for the large-scale integration of renewable energy, assigning real-time tasks in environmentally powered distributed systems, cloud/fog computing system architecture and key technologies for south-north water transfer project safety, agent-based synthesis of distributed controllers for discrete manufacturing systems.





A PARALLEL ALGORITHM FOR GLOBAL OPTIMIZATION PROBLEMS IN A DISTRIBUTED COMPUTING ENVIRONMENT Marco Gaviano, Daniela Lera, Elisabetta Mereu Department of Mathematics and Informatics, University of Cagliari, Cagliari, Italy

ABSTRACT The problem of finding a global minimum of a real function on a set S Rn occurs in many real world problems. Since its computational complexity is exponential, its solution can be a very expensive computational task. In this paper, we introduce a parallel algorithm that exploits the latest computers in the market equipped with more than one processor, and used in clusters of computers. The algorithm belongs to the improvement of local minima algorithm family, and carries on local minimum searches iteratively but trying not to find an already found local optimizer. Numerical experiments have been carried out on two computers equipped with four and six processors; Citation: M. Gaviano, D. Lera and E. Mereu, “A Parallel Algorithm for Global Optimization Problems in a Distributed Computing Environment,” Applied Mathematics, Vol. 3 No. 10A, 2012, pp. 1380-1387. doi: 10.4236/am.2012.330194. Copyright: © 2012 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://


Parallel and Distributed Computing Applications

fourteen configurations of the computing resources have been investigated. To evaluate the algorithm performances the speedup and the efficiency are reported for each configuration. Keywords: Random Search, Global Optimization, Parallel Computing

INTRODUCTION In this paper we consider the following global optimization problem. Problem 1 where f: S→R is a function defined on a set S

Rn .

In order to solve Problem 1 a very large variety of algorithms has been proposed; several books that describe the research trends from different points of view, have appeared in the literature [1-8]. Numerical techniques for finding solutions to such problems by using parallel schemes have been discussed in the literature (see, e.g. [9-13]. To generalize the investigation of the properties of the algorithms, these are classified in families whose components share common strategies or techniques. In [14] five basic families are defined: partition and search, approximation and search, global decrease, improvement of local minima, enumeration of local minima. Nemirovsky and Yudin [15], and Vavasis [16] have proved, under suitable assumptions, that the computational complexity of the global optimization problem is exponential; hence, the number of function evaluations required to solve problem 1 grows dramatically as the number of variables of the problem increases. This feature makes the search of a global minimum of a given function a very expensive computational task. On the other hand the latest computers in the market, equipped with more than one processor, and clusters of computers can be exploited. In [17] a sequential algorithm, called Glob was presented; this belongs to the improvement local minima family and carries on local search procedures. Specifically, a local minimum finder algorithm is run iteratively and in order to avoid to find the same local minimizer, a local search execution rule was introduced; this was chosen such that the average number of function evaluations needed to move from a local minimum to a new one, is minimal. The parameters used to define the execution rule at a given iteration were computed taking into account the previous history of the minimization process.

A Parallel Algorithm for Global Optimization Problems in a Distributed...


In this paper we present a parallel algorithm that distributes the computations carried out by Glob across two or more processors. To reduce to a low level the data passing operations between processors, the sequential algorithm is run on each processor, but the parameters of the execution rule are updated either after a fixed number of iterations are completed or straight as soon as new local minimizer is found. The new algorithm has been tested for solving several test functions commonly used in the literature. The numerical experiments have been carried out on two computers equipped with four and six processors; fourteen configurations of the computing resources have been investigated. To evaluate the algorithm performances the speedup and the efficiency are reported for each configuration.

PRELIMINARIES In this section we recall some results established in [17]. For Problem 1 we consider the following assumption.

Assumption 1 1)

f(‧) has m local minimum points li, i = 1, …, m and ; 2) meas (S) = 1, with meas (S) denoting the measure of S. Consider the following algorithm scheme.

Algorithm 1 (Algorithm Glob) Choose x0 uniformly on S;

repeat choose x0 uniformly on S; if

Parallel and Distributed Computing Applications


end if end if until a stop rule is met; end The function rand (1) denotes a generator of random numbers in the interval. [0, 1] Further, we denote by local_search (xo) any procedure that starting from a point x0 returns both a local minimum li of problem 1 and its function value. In algorithm Glob a sequence of local searches is carried out. Once a local search has been completed and a new local minimum li is found, a point x0 at random uniformly on S is chosen. Whenever f(x0) is less than f (li) a new search is performed from x0; otherwise a local search is performed with probability di. We assume that Problem 1 satisfies all the conditions required to make the procedure local_search (x0) convergent. We have the following proposition.

Proposition 1. Let assumption 1 hold and consider a run of algorithm Glob. Then the probability that li is a global minimum of problem 1 tends to one as j→∞. First, we settle the following notation. Definition 1 · · mum lj}; · · We have

starting from x, returns local minimum lj}; ; starting from x, returns local mini; .

We consider the following definitions for algorithm Glob. Definition 2 · ti ≡ the probability that having found the local minimum li, in a subsequent iteration no new local minimum is detected; · the probability that the algorithm, having found the local minimum li, can find the local minimum lj in a subsequent iteration. We calculate the average number of function evaluations so that algorithm Glob having found a local minimum, finds any new one. We assume that

A Parallel Algorithm for Global Optimization Problems in a Distributed...


algorithm Glob can run an infinite number of iterations. Further it is assumed that the values p0,j and pi,j, i = 1, ···, m − 1 and j = 1, ···, m, are known and that the number of function evaluations required by local_search is k= constant. The following holds. Theorem 1. The average number of function evaluations so that algorithm Glob, having found a local minimum li, finds any new one is given by


Problem 2. Let us consider problem 1 and let the values k, p0,j and pi,j be given. Find value d*i such that We calculate which value of di gives the minimum of such a function. We have as i = 1, ···, m − 1

The derivative sign of evals1 (di) is greater than or equal to zero for

(1) The condition (1) links the probability pi,j with the number k of function evaluations performed at each local search in order to choose the most

Parallel and Distributed Computing Applications


convenient value of di: if the condition is met, we must take di = 0 otherwise di = 1.

In real problems usually we don’t know the values p0,j and pi,j; hence the choice of probabilities d1, d2, ···, dm in the optimization of the function in problem 2 cannot be calculated exactly. By making the following approximation of the values

which appear in the definition of evals1 in problem 2, we can device a rule for choosing the values di, I = 1, ···, m in algorithm Glob. Specifically, from (1) we get

where p2, p3 and k are approximated as follows (2) k = no. of function evaluations in the local searches. Hence the line of algorithm Glob is replaced by the line (3) where yes_box ( ) is a procedure that returns zero or 1 and is defined by Procedure 1 (yes-box())

if p2=1

p2=2×p3; end if

if p2 ratio yes = 0; else yes = 1; end if end. In the sequel we shall denote by Globnew algorithm Glob completed with procedure 1.

THE PARALLEL ALGORITHM The message passing model will be used in the design of the algorithm we are going to introduce. This model is suitable for running computations on MIMD computers for which according to the classification of parallel systems due to Michael J. Flynn each processor operates under the control of an instruction stream issued by its control unit. The main point in this model is the communication where messages are sent from a sender to one or more recipients. To each send operation there must correspond a receive operation. Further the sender either will not continue until the receiver has received the message or will continue without waiting for the receiver to be ready. In order to design our parallel algorithm in an environment of N processors we separate functions in two parts: server and client. The server task will executed by just one processor, while the remaining ones will execute the same code. The server will accomplish the following task. • reads all the initial data and sends them to each client; • receives the intermediate data from a sender client; • combines them with all the data already received; • sends back the updated data to the client sender; • gathers the final data from each client. Each client accomplishes the following tasks •

receives initial data from server;

Parallel and Distributed Computing Applications


• • • •

runs algorithm Globnew; sends intermediate data to server; receives updated values from server; stops running Globnew whenever its stop rule is met in any client execution; • sends final data to server. The communication that takes place between the server and each client concerns mainly the parameters in (2), that is, p2, p3 and k. Each client, as soon as he either finds a new local minimizer or a fixed number of iterations have been executed, sends a message to the server containing data gathered after the last sent message; that is. • • • •

last minimum found; the number of function evaluations since last message sending; the number of iterations since last message sending; the number of local searches carried out since last message sending; • status variable of value 0 or 1 denoting that the stop rule has been met. The server combines each set of intermediate data received with the ones stored in its memory and sends to the client the new data. If the server receives as status variable 1 in the subsequent messages sent to clients the status variable will keep the same value, meaning that the client has to stop running Globnew and has to send the final data to the server. The initial data and intermediate data are embodied in the following data structures. •

Data_start = struct (“x”, [ ], “fx”, [ ], “sum_ev”, [ ], “sum_tr”, [ ], “sum_ls”, [ ], “fun”, [ ], “call_interval”, [ ]). • Data_mid = struct (“stop_flag”, [ ], “client”,[ ], “x”, [ ], “fx”, [ ], “sum_ev”, [ ], “sum_tr” [ ], “sum_ls”, [ ]) where the strings within single quotes denote the names of the members of the structure and [ ] its values. In Data_start the members “x” and “fx” refer to the algorithm starting point as defined by the user; “sum_ ev”, “sum_tr”, “sum_ls” initialize the number of function evaluations, iterations and local searches. “fun” and “call_interval” denote the problem to solve and the number of iterations to be completed before an intermediate data passing has to take place.

A Parallel Algorithm for Global Optimization Problems in a Distributed...


In Data_mid “stop_ flag” and “client” refer to the status variable and to the client sender while the remaining members denote values as in Data_ start but if the sender is a client the values refer to values gathered since the last message sending, while if the sender is the server the values concern the overall minimization process. In the Appendix we report in pseudocode the basic instructions of the procedures to be executed in the server and client processors respectively.

NUMERICAL RESULTS In this section we report the numerical results we got in the implementation of the algorithm outlined in the previous sections. First we describe the paradigm of our experiments. Four test problems have been solved; to test the performance of our algorithm each problem has been chosen with specific features. Test problem 1


Test problem 2

with Test problem 3


Parallel and Distributed Computing Applications



Test problem 4










, j=1,…,n, mi, i = 1, …, 9, and xt denoting ten points uniformly chosen in S such that the Bi balls do not overlap each other, Ji real values to be taken as the values of f(‧) at mi.

Test problems 1, 2, and 3 appeared in [18-20] respectively. Test problems 4 belongs to a set of problems introduced in [21] and implemented in the software GKLS (cfr. [22]). We have been working in a Linux operating system environment according to the Ubuntu 10.04 LT implementation. All codes have been written in the C language in conjunction with the OpenMPI message passing library (version 1.4.2). The local minimization has been carried out by a code, called cgtrust, written by C. T. Kelley [23]. This code implements a trust region type algorithm that uses a polynomial procedure to compute the step size along a search direction. Since the cgtrust code was written according to the MatLab programming language, this has been converted in the C language. All software used is Open Source. Two desktop computers have been used; the first equipped with an Intel Quad CPU Q9400 based on four processors, the second with an AMD PHENOM II X6 1090T based on six processors. Experiments have been carried out both on each single computer and on the two connected to a local network. In Table 1 we report the fourteen configurations of the computing resources used in each of our experiments.

A Parallel Algorithm for Global Optimization Problems in a Distributed...


Table 1. Configurations of the computing resources

Whenever we have being exploiting just one processor of a computer, the running code was written leaving out any reference to the OpenMPI library. Hence the code is largely simpler than the one used for using more than one processor. Since our algorithm makes use of random procedures, to get significant results in solving the test problems, 100 runs of the algorithm have been done on each problem. The data reported in the tables are all mean values. The parameter k that evaluates the computational cost of local searches has been computed as the sum of function and gradient evaluations of the current objective function. The algorithm stops whenever the global minimum has been found within a fixed accuracy. That is the stop rule is with f* and the function values at the global minimum point and at the last local minimum found. To evaluate the performance of our algorithm we consider two indices, the speedup and the efficiency; the first estimates the decrease of the time of a parallel execution with respect to a sequential run. The second index estimates how much the parallel execution exploit the computer resources.

Parallel and Distributed Computing Applications


In Tables 2-5 we report the results gathered for each configuration given in Table 1; in each table for each function we report the computational expired time, the speedup and the efficiency. The function evaluation is done in two ways: 1) We evaluate the function as it is defined, 2) We introduce an extra computation that consists of a loop of 5000 iterations where at each iteration the square root of 10.99 is calculated. Hence we carried out our experiments by assigning different weights to the function evaluations. Note that the columns in Tables 4 and 5 referring to one processor has been calculated as the means of the values in the corresponding columns in 2, and 3. From the data in the tables we can state the following remarks. 1)

Whenever the function evaluation cost is evaluated

Table 2. Results working with Intel Quad

A Parallel Algorithm for Global Optimization Problems in a Distributed...

Table 3. Results working with AMD Phenom 6

Table 4. Results working with Intel Quad and AMD Phenom 6


Parallel and Distributed Computing Applications


according to •

• •

The use of more than one processor does not give significant improvements only for the first test problem. Indeed this is a very easy problem to solve and does not require a large computational cost. Working with two processors the speedup becomes less than one. Clearly this has to be related to the fact that the complexity of the multi processor code is not balanced by the use of additional processors. As the function is evaluated according to 2) the advantage of the multiprocessor code becomes clear. The output of the four test problems is quite different; while problem 1 even with the extra computational cost does not exhibits a good efficiency, the remaining problems show significant improvements. For problem 4, the parallel algorithm improves a great deal its performances with respect to the serial version.

CONCLUSION In order to find the global minimum of a real function of n variables, a new parallel algorithm of the multi-start and local search type is proposed. The algorithm distributes the computations across two or more processors. The data passing between cores is minimal. Numerical

A Parallel Algorithm for Global Optimization Problems in a Distributed...


Table 5. Results working with AMD Phenom 6 and Intel Quad

Marco Gaviano, Daniela Lera, Elisabetta Mereu experiments are carried out in a linux environment and all code has been written in the C language linked to the Open Mpi libraries. Two desktop computers have been used; the first equipped with an Intel Quad CPU Q9400 based on four processors, the second with a AMD Phenom II X6 1090T based on six processors. Numerical experiments for solving four well-known test problems, have been carried out both on each single computer and on the two connected to a local network. Several configurations with up to ten processors have considered; for each configuration, speedup and efficiency are evaluated. The results show that the new algorithm has a good performance especially in the case of problems that require a large amount of computations.

Parallel and Distributed Computing Applications


APPENDIX Procedure (Glob_server) fun=function to minimize; np=number of processors; call_interval=max interval between two server-client messages; x1=starting point; fx1=fun(x1); sum_ev=1; sum_tr=0; sum_ls=0; Data_start=struct(’x’,x1,’fx’,fx1,’sum_ev’,sum_ev’sum_tr’,sum_tr,’sum_ ls’,sum_ls,’fun’,fun,’call_ interval’,[ ]). stop_flag=0; no_stop=0; Send Data_start to all clients. while no_stopfx1; Data_mid.x=x1; data_mid.fx=fx1. else x1=Data_mid.x; fx1=Data_mid.fx; end if Data_mid.stop_flag==1; no_stops=no_stops+1; stop_flag=1; continue elseif stop_flag==1 Data_mid.stop_flag=1; no_stops=no_stops+1; end send to client Data_mid end.

A Parallel Algorithm for Global Optimization Problems in a Distributed...

Procedure (Glob_client) client=client_name; Receive Data_start from server; x1=Data_start.x; fx1=Data_start.fx; sum_ls=sum_ls+Data_start.sum_ls; sum_tr=sum_tr+Data_start.sum_tr; sum_ev=sum_ev+Data_start.sum_ev; call_interval=Data_start.call_interval; buf=struct(’sum_ls’, 0, ’sum_tr’, 0,’sum_ev’, 0); stop_flag=0; iter_client=0; yes=1; while stop_flag==0 flag_min=0; iter_client=iter_client+1; Choose

uniformly on S;

fx=fun(x0); buf.sum_ev=buf.sum_ev+1; buf.sum_tr=buf.sum_tr+1; if fx rj − tj . From the above observations, it is clear that after the reordering, all conditions for feasibility are still satisfied. Moreover, the orders ≺a and ≺c are unchanged, and no additional processing time is required for the reordering. If a similar reordering is carried out for all such pairs (i, j), then the allocation precedence condition is satisfied with no addition in total processing time T.

Now if there is an optimal schedule for DLSRCHETS that does not satisfy the allocation precedence condition, then a reordering can be performed as mentioned above so that the schedule satisfies the allocation precedence condition without an increase in the total processing time. That is, there always exists an optimal schedule that satisfies the allocation precedence condition, and only such schedules need be considered in the search for the optimal schedule. Two other basic lemma are stated before the DLSRCHETS problem is defined. Lemma 2. There exists an optimal schedule for DLSRCHETS that has no idle time between any two consecutive allocation phases and any two consecutive result collection phases. (There may exist other optimal schedules that do not satisfy this condition.) Proof. Assume that a feasible schedule that obeys (1) to (9), and in addition also satisfies the allocation precedence condition, has idle time between the consecutive communication phases (see Figure 3). Let the processing time be T, the load distribution be α, and (≺a , ≺c ) be the orders of allocation and collection. According to the assumptions in the system model, all processors are available continuously and exclusively during the entire execution process, and the master can only communicate with one processor at a time. For any i


Parallel and Distributed Computing Applications

j, when processor pi completes the reception of its allocated task at time ti + αi Ci , processor pj is already available and can start receiving data immediately at tj = ti + αi Ci . Because the schedule satisfies the allocation precedence condition, load is first distributed to all the processors sequentially before result collection begins. Thus the start time of each task i ∈ T can be brought forward so that ti = t≺+a +∑j∈B≺ia \{i} αj Cj , and the inequalities (1) and (2) are reduced to equalities without exceeding T. a

Following a similar logic to the one above, the result collection of each result i ∈ R can be delayed to the extent necessary to make the result collection start time ri = T − ∑j∈F≺ic δαj Cj , with inequalities (3) and (4) reduced to equalities and no extra time added to T. Since any feasible schedule can be reordered in this manner to eliminate the idle time between communication phases, it follows that an optimal schedule to DLSRCHETS also has no idle time between any two consecutive allocation and result collection phases. Lemma 3. There exists an optimal schedule for DLSRCHETS that has no idle time between the allocation and computation phases of each processor. (There may exist other optimal schedules that do not satisfy this condition.) Proof. Following an argument similar to the one used in Lemma 2, since all processors are always available, they can begin computing immediately upon receiving their load fractions in the allocation phase without affecting the schedule. Any processor pi begins computing its allocated task at time t≺+a + ∑j∈B≺ia αj Cj without crossing the time interval T. Since any feasible schedule can be reordered in this manner, an optimal schedule to DLSRCHETS too has no idle time between the allocation and computation phases of each processor. Theorem 1 (Feasible Schedule Theorem). There exists an optimal schedule for DLSRCHETS that satisfies Lemmas 1 to 3. Proof. If there exists an optimal schedule that does not satisfy any or all of the Lemmas 1 to 3, it can always be reordered as explained in the respective proofs to satisfy the same. From Theorem 1, it follows that only those schedules that satisfy Lemmas 1 to 3 need be considered in the search for the optimal solution to DLSRCHETS. A possible timing diagram for such a schedule is shown in Figure 5. From the preceding discussion, it can be concluded that the start times t and r in the optimal schedule for DLSRCHETS can be determined from the

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


sequences ≺a and ≺c , and the load distribution α that minimize the processing time T. Hence instead of finding t and r as in traditional scheduling practice, the DLSRCHETS problem is formulated as a linear programming problem, to find ≺a , ≺c , and α that minimize T. Once the optimal values of these variables are known, it is straightforward to find the optimal schedule.

Figure 5. A schedule for m = 3 that satisfies the Feasible Schedule Theorem. Result collection begins only after the entire load is distributed. Each allocation and result collection phase follows its predecessor without delay. The computation phase of each processor follows its allocation phase without delay. Idle time may be present in each processor between the end of its computation phase and the start of the result collection phase.

The constraints (1) to (9) and the allocation precedence condition are combined into a unified form, and for each processor pi , constraints on T are written in terms of B≺ia and F≺ic . The DLSRCHETS problem is defined in terms of a linear program as follows. Definition 1 (Divisible Load Scheduling with Result Collection on HETerogeneous Systems).

Given a heterogeneous network H=(P, L), a divisible load J , unit communication and computation times C, E , find the sequence pair (≺∗a , ≺∗c ), and load distribution α∗ = {α1∗, . . . , α∗m } that Minimize T Subject To:

(10) (11)


Parallel and Distributed Computing Applications

(12) (13) In the above formulation, for a sequence pair (≺a , ≺c ), and a load distribution α, the LHS (Left Hand Side) of constraint (10) indicates the total time spent in transmission of tasks to all the processors that must receive load before the processor pi can begin processing its al-located task, the computation time on the processor pi itself, and the time for transmission back to the master of results of processor pi , and all its subsequent result transfers. For the no-overlap model to be satisfied, the processing time T should be greater than or equal to this time for all the m processors. The single-port communication model is enforced by (11) since its LHS represents the lower bound on the time for distribution and collection under this model. The fact that the entire load is distributed amongst the processors is imposed by (12). This is the normalization equation. The non-negativity of the decision variables is ensured by constraint (13).

ANALYSIS OF OPTIMAL SOLUTION Processors that are allocated load are called participating processors or participants. Theorem 2 (Idle Time Theorem). There exists an optimal solution to the DLSRCHETS problem, in which irrespective of whether load is allocated to all available processors, at the most one of the participating processors has idle time, and the idle time exists only when the result collection begins immediately after the completion of load distribution. Proof. For a pair (≺a , ≺c ), the DLSRCHETS problem defined by (10) to (13) always has a feasible solution. This is because, for any load distribution α that satisfies (12), T can be made arbitrarily large to satisfy the inequalities (10) and (11). It implies that the polyhedron formed by the constraints of the DLSRCHETS problem, P := {X ∈ Rm+1 : AX ≤ B, X ≥ 0} ≠ ∅.

According to the theory of linear programming, the optimal solution to DLSRCHETS is obtained at some vertex of this polyhedron (Dantzig, 1963; Vanderbei, 2001). As the DLSRCHETS problem has m + 1 decision variables and 2m + 3 constraints, in a non-degenerate optimal solution, at the optimal vertex, m + 1 constraints out of these must be tight, i.e., satisfied with equality. In a degenerate optimal solution, more than m + 1 constraints are tight.

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


It is clear that in an optimal solution, the normalization constraint (12) will always be tight, and T will always be greater than zero. This means that m constraints out of the remaining 2m + 1 constraints will be tight in a nondegenerate optimal solution. There are two possible ways to proceed with the analysis at this point depending on the allocated load fractions in the optimal solution. 1. ∀ K ∈ {1, . . . , M} : αK > 0. In this case, all the load fractions are assumed to be always greater than zero, i.e. num-ber of participants is m. Since all decision variables are positive, there can be no degeneracy (Vanderbei, 2001, Chapter 3). It leaves only m + 1 constraints (10) and (11), out of which m will be tight in the optimal solution. Hence, in the optimal solution, either, • the m constraints (10) are tight, and the (11) constraint is not, or • the (11) constraint is tight and one of the (10) constraints is not. If any constraint from (10) and (11) is not tight in the optimal solution, it implies a shortfall in the LHS as compared to the optimal processing time. In constraints (10) this shortfall represents idle time in a processor, while in (11) it represents the intervening time interval between completion of load distribution from the master and the start of result transfer to the master. Thus, if the option (a) above is true, then none of the processors have any idle time in the optimal solution. If the option (b) is true, then one of the processors has idle time, and since this happens only when constraint (11) is tight, it means that idle time in a processor exists only when result transfer to the master begins immediately after completion of load allocation is completed. This is similar to the analysis in Beaumont, Marchal, Rehn & Robert (2005); Beaumont et al. (2006). • ∃ K ∈ {1, . . . , M} : αK = 0. In this case, some of the processors can be allocated zero load in the optimal solution. The analysis has two parts — one for non-degenerate and the other for degenerate optimal solutions.

Non-degenerate Optimal Solution If there are p (p ≤ m) participants in the optimal solution, then m − p constraints of (13) are necessarily tight. This means that out of the m + 1 constraints (10) and (11), only p constraints will be tight in the optimal solution. Hence, in an optimal solution, either,

Parallel and Distributed Computing Applications


p of the (10) constraints are tight, m − p of the (10) constraints are not tight, and the (11) constraint is not tight, or • the (11) constraint is tight, p − 1 of the (10) constraints are tight, and m − p + 1 of the (10) constraints are not tight. In the optimal solution, if the option (a) is true, then m − p processors have idle time, while if the option (b) is true, then m − p + 1 processors have idle time. Since m − p processors are not allocated load, it is obvious that they are idle throughout in either of the above two options. The additional processor with idle time if the op-tion (b) is true has to be one of the participating processors. This means that idle time in a participating processor exists only when the result collection begins immediately upon completion of load allocation.

Degenerate Optimal Solution Similar to the non-degenerate case, if there are p (p ≤ m) participants in the optimal solution, then m − p constraints of (13) are necessarily tight. Since the optimal solution is degenerate, more than p constraints out of the m + 1 constraints (10) and (11) will be tight. This means that in the optimal solution, irrespective of whether the (11) constraint is tight, at least p of the (10) constraints are tight, and less than m − p of the (10) constraints are not tight. Since m − p processors are necessarily idle, some of the (10) constraints corresponding to the processors allocated zero load are tight in the degenerate solution. Since ∀ k ∈ {1, . . . , m}, B≺ka , F≺kc ⊆ {1, . . . , m}, it implies that, k ∈ {1, . . . , m}


It follows that,

k ∈ {1, . . . , m}

k ∈ {1, . . . , m} (14) If (11) is not tight, then the RHS (Right Hand Side) of (14) is strictly less than T. That is,

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


(15) If ∃ k ∈ {1, . . . , m} : αk = 0, then αk Ek = 0, and from (15), it immediately follows that the corresponding constraint from (10) can never be tight.

Thus, a constraint corresponding to a processor pk allocated zero load is tight in the optimal solution only if (16) or equivalently if (14) is satisfied with an equality, and the RHS of (14) is equal to T, i.e, the (11) constraint is tight. It is now clear that a degenerate optimal solution exists only when the (11) constraint is tight, and the condition (16) is satisfied. To find when the condition is satisfied, consider the case where for some pair (≺a , ≺c ), one or more of the processors allocated zero load follow each other at the end of the allocation sequence and the start of the result collection sequence in the optimal solution. For example, if αi , αj , αk = 0, and one or more of the following occur (the list is not exhaustive): • ≺−a = i and ≺+c = i • ia j, ≺−a = j and ≺+c = i • ia j, ≺−a = j, ≺+c = k and k c i Only if such tail-end zero-load processors exist, then (14) is satisfied with an equality. Finally, if constraint (11) is tight in the optimal solution, then it follows that the constraints corresponding to these processors are tight. The linear program obtained after eliminating the redundant constraints corresponding to the tail-end zero-load processors has a non-degenerate optimal solution. This is because, the feasible region defined by the constraints of the non-degenerate problem does not change after addition of the redundant constraints. Hence only a single participant processor has idle time in the degenerate optimal solution. From the preceding discussion on the optimal solution to the linear program for a pair (≺a, ≺c ), it follows that in the optimal solution to the DLSRCHETS problem, (≺∗a , ≺∗c , α∗), at the most one participating processor can have idle time. The idle time occurs only when the result


Parallel and Distributed Computing Applications

collection from processor ≺+c starts immediately after completion of load allocation to processor ≺−a .

There are m! possible permutations each of ≺a and ≺c , and the linear program has to be evaluated (m!)2 times to determine the globally optimum solution (≺ ∗a , ≺∗c , α∗) for DLSRCHETS. Since the solution to the linear program is completely determined by the values of δ, C and E , along with the pair (≺a , ≺c ), it is not possible to predict which of the processors or how many processors will be allocated zero load.

ANALYSIS OF TWO-SLAVE SYSTEM For a sequence pair (σa , σc ) and load distribution α = {α1 , . . . , αm }, a slave processor pi , may have idle time xi because it may have to wait for another processor to release the communication medium for result transfer (ref. Figure 5). In the optimal solution to DLSRCHETS, ∀i ∈ {1 . . . m}, xi = 0, if and only if y > 0, and that there exists a unique xi > 0 if and only if y = 0, where y is the intervening time interval between the end of allocation phase of processor σa [m] and the start of result collection from processor σc [1]. For the FIFO schedule in particular, processor σa [m] can always be selected to have idle time when y = 0, i.e., in the FIFO schedule, xσa [m] > 0 if and only if y = 0. In the LIFO schedule, since y > 0 always, no processor has idle time, i.e., ∀i ∈ {1 . . . m}, xi = 0 always (Beaumont, Marchal, Rehn & Robert, 2005; Beaumont et al., 2006; Beaumont, Marchal & Robert, 2005).

Figure 6. The heterogeneous two-slave system. The two processors p1 and p2 are replaced by an equivalent virtual processor p1:2 on the right. The two network links l1 and l2 are replaced by an equivalent virtual link l1:2. As far as the master p0 is concerned, there is no difference in the time it takes for the equivalent processor to execute a task.

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


Let the allocation sequence be represented by σa , and the collection sequence by σc , both of which are permutations of the index set K = {1, . . . , m} of slave processors in the heterogeneous system H. For a pair (σa , σc ), the solution to the linear program defined by (10) to (13) is completely determined by the values of δ, E, C, and it is not possible to predict which processor is the one that has idle time in the optimal solution. In fact, it is possible that not all processors are allocated load in the optimal solution, in which case some processors are idle throughout. The heterogeneous system H = (P, L) with m = 2 is shown in Figure 6. It is defined by P = {p0, p1, p2 } and L = {l1, l2 }. The unit computation and communication times are defined by the sets E = {E1, E2 }, and C = {C1, C2 }. Without loss of generality, it is assumed that the total load to be processed available at the master is J = 1. Also it is assumed that C1 ≤ C2. No assumptions are possible regarding the relationship between E1 and E2, or C1 + E1 + δC1 and C2 + E2 + δC2. An important parameter, ρk , known as the network parameter is introduced, which indicates for a slave pk , how fast (or slow) its computation parameter Ek is with respect to the communication parameter Ck of its network link: (17) The master p0 distributes the load J between the two slave processors p1 and p2 so as to minimize the processing time T. Depending on the values of δ, E and C, there are three possibilities: 1.

Entire load is distributed to p1 only. The total processing time is given by

2. Entire load is distributed to p2 only. The total processing time in this case is 3. Load is distributed to both p1 and p2. It can be proved that as long as C1 ≤ C2, only the schedules in Figs. 7, 8, and 9 can be optimal for a two-slave system. These schedules are the FIFO schedule, the LIFO schedule, and the FIFO schedule with idle time in p2.

Parallel and Distributed Computing Applications


These schedules are referred to as Schedule f , Schedule l, and Schedule g respectively. Superscripts f , l, and g are used to distinguish the three schedules. The equations for load fractions, processing times, and the conditions for optimality of Schedules f , l, and g are not derived on account of space constraints. The interested reader is directed to (Ghatpande, Nakazato, Beaumont & Watanabe, 2008) for details.

Optimal Schedule in Two-Slave System A few lemmas and theorems to determine the optimal schedule for a twoslave system are now stated without proof. Please refer to Ghatpande, Nakazato, Beaumont & Watanabe (2008) for the proofs. Lemma 4. It is always advantageous to distribute the load to both the processors, rather than execute it on the individual processors (for the system model under consideration). Lemma 5 (Idle Indicator Lemma). ρ1 ρ2 ≤ δ is a necessary and sufficient condition to indicate the presence of idle time in the FIFO schedule (i.e. Schedule g). The simplicity of the condition to detect the presence of idle time in the FIFO schedule is both pleasing and surprising, and has been derived for the first time ever. Further confirmation of this condition is obtained in Sect. 4.2. Theorem 3 (Optimal Schedule Theorem). The optimal schedule for a two-slave system can be found as follows: 1. 2.

If δC2 > C1(1 + δ + ρ1), then Schedule l is optimal. Else If δC2 ≤ C1(1 + δ + ρ1), ρ1ρ2 ≤ δ and C2 ≤ C1 , then Schedule g is optimal. 3. Else if δC2 ≤ C1(1 + δ + ρ1), ρ1ρ2 ≤ δ and C2 > C1 , then Schedule l is optimal. 4. Else If δC2 ≤ C1(1 + δ + ρ1), ρ1ρ2 > δ, and Tf ≤ , then Schedule f is optimal. 5. Else if δC2 ≤ C1(1 + δ + ρ1), ρ1ρ2 > δ, and Tf > , then Schedule l is optimal. The optimal solution to DLSRCHETS, (σa∗, σc∗, α∗), for a system with two slave processors is a function of the system parameters and the application under consideration, because of which, no particular sequence of allocation and collection can be defined a priori as the optimal sequence. The optimal solution can only be determined once all the parameters become known.

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


Figure 7. Equivalent processor in Schedule f . The total communication time remains the same as the original two processors. The equivalent computation time is equal to the interval between the end of allocation to p2 and the start of result collection from p1 .

The Concept of Equivalent Processor To extend the above result to the general case with m slave processors, the concept of an equivalent processor is introduced. Consider the system in Figure 6. The processors p1 and p2 are replaced by a single equivalent processor p1:2 with computation parameter E1:2 , connected to the root by an equivalent link l1:2 with communication parameter C1:2 . The resulting system is called the equivalent system and the resulting schedule is known as the equivalent schedule. The values of the parameters for the three equivalent schedules are defined below. If the initial load distribution is α = {α1 , α2 }, and the processing time is T, then the equivalent system satisfies the following properties: • The load processed by p1:2 is α1:2 = α1 + α2 = 1. • The processing time is unchanged and equal to T. • The time spent in load distribution and result collection is unchanged, i.e., for all three schedules, – α1:2 C1:2 = α1 C1 + α2 C2 , and – δα1:2 C1:2 = δα1 C1 + δα2 C2 .


Parallel and Distributed Computing Applications

• The time spent in load computation is equal to the intervening time interval between the end of allocation phase and the start of result collection phase, i.e., – For Schedule f , α1:2 E1:2f = α1 E1 − α2 C2 = α2 E2 − δα1 C1 . – For Schedule l, α1:2 E1:2l = α2 E2 = α1 E1 − α2 C2 − δα2 C2 . – For Schedule g, α1:2 E1:2g = 0.

The Equivalent Processor Theorem This leads to the following theorem: (refer to (Ghatpande, Nakazato, Beaumont & Watanabe, 2008) for proof.)

Figure 8. Equivalent processor in Schedule l. The total communication time remains the same as the original two processors. The equivalent computation time is equal to the computation time of p2 .

Theorem 4 (Equivalent Processor Theorem). In a heterogeneous system H with m = 2, the two slave processors p1 and p2 can be replaced without affecting the processing time T, by a single (virtual) equivalent processor p1:2 with equivalent parameters C1:2 and E1:2 , such that C1 ≤ C1:2 ≤ C2 and E1:2 ≤ E1 , E2 .

The equivalent processor enables replacement of two processors by a single processor with communication parameter with a value that lies between the values of communication parameters of the original two links. Because of this property, if the processors are arranged so that C1 ≤ C2 ≤ . . . ≤ Cm , and two processors are combined at a time sequentially starting from the fastest two, then the resultant equivalent processor does not disturb the

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


order of the sequence. The equivalent processor for Schedule f provides additional confirmation of the condition for the presence of idle time in a FIFO schedule. It is known that idle time can exist in a FIFO schedule only when the intervening time interval y = 0. According to the definition of equivalent processor, this interval corresponds to the equivalent computation capacity E1:2f . This value becomes zero only when ρ1 ρ2 − δ = 0. Thus, if ρ1 ρ2 < δ, then idle time must exist in the FIFO schedule.

THE SPORT ALGORITHM Algorithm 1 (SPORT). 1: 2: 3: 4:

arrange p1 , . . . , pm such that C1 ≤ C2 ≤ . . . ≤ Cm σa ← 1, σc ← 1, α1 ← 1 for k := 2 to m do C1 ← C1:k−1 , E1 ← E1:k−1 , C2 ← Ck , E2 ← Ek

Figure 9. Equivalent processor in Schedule g. The total communication time remains the same as the original two processors. The equivalent computation time is equal to zero as the result collection begins immediately after the allocation phase ends.


Parallel and Distributed Computing Applications

Figure 10. The building of SPORT solution. At each step only two processors are involved (the state space remains constant). The optimal schedule for two processors can be easily computed in constant time using simple if-then-else statements in Theorem 3.

5: if δC2 > C1(1 + δ + ρ1) then

6: /* Tl < T f , T g , use Schedule l */ 7: call schedule_lifo 8: else 9: /* Need to check other conditions */ 10: if ρ1ρ2 ≤ δ then

11: /* Possibility of idle time */ 12: if C2 ≤ C1


13: /* Tg < T l , use Schedule g */ 14: call schedule_idle 15: break for 16: else 17: /* Tl < T g , use Schedule l */ 18: call schedule_lifo 19: end if 20: else

Scheduling of Divisible Loads on Heterogeneous Distributed Systems

21: /* No idle time present */ 22: if T f ≤


23: /* Tf < T l , use Schedule f */ 24: call schedule_fifo 25: else 26: /* Tl < T f , use Schedule l */ 27: call schedule_lifo 28: end if 29: end if 30: end if 31: end for 32: n ← numberOfProcessorsUsed 33: /* Update load fractions from stored values */ 34: 35: T ← C1:n + E1:n + δ C1:n The procedures in the algorithm are given below: procedure schedule_idle 1: 2: 3: /* Update sequences for FIFO */ 4: σa ← {σa, k} 5: σc ← {σc, k}

6: /* Compute equivalent processor parameters */ 7: 8: E1:k ← 0

9: numberOfProcessorsUsed ← k 10: return procedure schedule_lifo 1: rl1 ← ρ1 2: rl2 ← 1 + δ + ρ2



Parallel and Distributed Computing Applications

3: 4: 5: /* Update sequences for LIFO */ 6: σa ← {σa, k} 7: σc ← {k, σc} 8: /* Compute equivalent processor parameters */ 9: 10: 11: numberOfProcessorsUsed ← k 12: return procedure schedule_fifo

Figure 10. The building of SPORT solution. At each step only two processors are involved (the state space remains constant). The optimal schedule for two

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


processors can be easily computed in constant time using simple if-then-else statements in Theorem 3.

7: σc ← {σc, k}

8: /* Compute equivalent processor parameters */

11: numberOfProcessorsUsed ← k 12: return

Algorithm Explanation At the start, the processors are arranged so that C1 ≤ C2 ≤ . . . ≤ Cm , and two processors with the fastest communication links are selected. The optimal schedule and load distribution for the two processors are found according to Theorem 3. If Schedule f or l is found optimal, then the two processors are replaced by their equivalent processor. In either case, since C1 ≤ C1:2 ≤ C2 , the ordering of the processors does not change. In the subsequent iteration, the equivalent processor and the processor with the next fastest communication link are selected and the steps are repeated until either all processors are used up, or Schedule g is found to be optimal. If Schedule g is found to be optimal in any iteration, then the algorithm exits after finding the load distribution for that iteration. The computation of the allocation and collection sequences is straightforward. The allocation sequence σa is maintained in the order of decreasing communication link bandwidth of the processors. Irrespective of the schedule found optimal in iteration k, k is always appended to σa . The collection sequence σc is constructed as follows: •

If Schedule f or g is found optimal in iteration k, k is appended to σc .


Parallel and Distributed Computing Applications

Figure 11. Calculating the load fractions in SPORT. α1′ is the initial value of α1 . It is multiplied by the product term in (20) to get the final value of α1 = α1:n · α1:n−1 · · · α1:2 · α1′. This is equivalent to traversing the binary tree from the root to the leaf nodes and taking the product of all nodes (values) encountered. This calculation can be implemented in O(m) time by starting with αm and storing the intermediate values.

• If Schedule l is found optimal in iteration k, k is prepended to σc . The calculation of load distribution to the processors occurs simultaneously with the search for the optimal schedule. As shown in Figure 11, the algorithm creates a one-sided binary tree of load fractions. If the number of processors participating in the computation is n, 2 ≤ n ≤ m, the root node of the binary tree is α1:n and the leaf nodes represent the final load fractions allocated to the processors. The value of the root node need not be calculated as it is equal to one. The individual load fractions, αk , are initially assigned value α′k (say), and then updated at the end as: (20) This is equivalent to traversing the binary tree from the root to each leaf node and taking the product of the nodes encountered (see Figure 11). This calculation can be easily implemented in O(m) time by starting with the

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


computation of αn , and storing the values of the product terms (i.e. ∏ α1:j ) for each processor and then using that value for the next processor. Once the sequences (σa , σc ) and load distribution α are found, calculating the processing time is straightforward. The processing time is simply the sum of the (equivalent) parameters of the equivalent processor p1:n , i.e., T = C1:n + E1:n + δ C1:n .

In SPORT, defining the allocation sequence by sorting the values of Ck requires O(m log m) time, while finding the collection sequence and load distribution requires O(m) time in the worst case. Thus, if sorted values of Ck are given, then the overall complexity of the algorithm is polynomial in m and is equal to O(m).

Simulations and Analysis The performance of SPORT was compared to four algorithms, viz. OPT, FIFOC, LIFOC, and ITERLP. The globally optimal schedule OPT is obtained after evaluation of the linear program for all possible (M!)2 permutations of (σA , σC ). Table 1. Minimum statistics for SPORT simulations. In sets 1 and 2, the minimum errors in LIFOC are 2 orders of magnitude higher than SPORT, ITERLP, and FIFOC. In sets 3 and 4, FIFOC error is 2 to 3 orders of magnitude higher than the other three algorithms


Parallel and Distributed Computing Applications

Table 2. Maximum statistics for SPORT simulations. In sets 1 and 2, the maximum errors in LIFOC are 2 orders of magnitude higher than SPORT, ITERLP, and FIFOC. In sets 3 and 4, FIFOC error is 2 to 3 orders of magnitude higher than the other three algorithms

In FIFOC, processors are allocated load and result are collected in the order of decreasing communication link bandwidth of the processors. In LIFOC, load allocation is in the order of decreasing communication link bandwidth of the processors, while result collection is the reverse order of increasing communication link bandwidth of the processors. ITERLP (Ghatpande, Beaumont, Nakazato & Watanabe, 2008) is a near-optimal algorithm for DLSRCHETS. To explore the effects of system parameter values on the performance of the algorithms, several sets of simulations were carried out: Set 1

Homogeneous network and homogeneous processors

Set 2

Homogeneous network and heterogeneous processors

Set 3

Heterogeneous network and homogeneous processors

Set 4

Heterogeneous network and heterogeneous processors

The error values with respective to the optimal are calculated. Over 500,000 simulation runs are carried out. Further details can be obtained in (Ghatpande, Beaumont, Nakazato & Watanabe, 2008; Ghatpande, Nakazato, Beaumont & Watanabe, 2008). The minimum and maximum mean error values of each algorithm are tabulated in Tables 1 and 2. It can be observed that in sets 1 and 2, the minimum and maximum errors in LIFOC are 2 orders of magnitude higher than SPORT, ITERLP, and FIFOC. On the other hand in sets 3 and 4, FIFOC error is 2 to 3 orders of magnitude higher than the other three algorithms.

Scheduling of Divisible Loads on Heterogeneous Distributed Systems


Figure 12. Comparison of wall-clock time for SPORT, LIFOC, and FIFOC. SPORT is two orders of magnitude faster than LIFOC and almost four orders of magnitude faster than FIFOC. This figure appears in (Ghatpande, Nakazato, Beaumont & Watanabe, 2008).

There is a significant downside to LIFOC because of its property to use all available processors — the time required to compute the optimal solution (wall-clock time) is almost two orders of magnitude greater than that of SPORT as seen in Figure 12. These values were obtained by averaging the wall-clock time to compute a solution over 1000 runs. The results show that though both SPORT and LIFOC are O(m) algorithms given a set of processors sorted by decreasing communication bandwidth, clearly SPORT is the better performing algorithm, with the best cost-performance ratio for large values of m. The values for FIFOC are almost four orders of magnitude larger than SPORT. The extensive simulations show that: •

If network links are homogeneous, then LIFOC performance is affected for both homogeneous and heterogeneous computation speeds. If network links are heterogeneous, then FIFOC performance is affected for both homogeneous and heterogeneous computation speeds. SPORT performance is also affected to a certain degree by the heterogeneity in network links and computation speeds, but since SPORT does not use a single predefined sequence of allocation and collection, it is able to better adapt to the changing system conditions.

Parallel and Distributed Computing Applications


ITERLP performance is somewhat better than SPORT, but is computationally expensive. SPORT generates similar schedules at a fraction of the cost.

CONCLUSION In this chapter, the DLSRCHETS problem for the scheduling of divisible loads on heterogeneous master-slave systems and considering the result collection phase was formulated and analysed. A new polynomial-time algorithm, SPORT was proposed and tested. Future work can proceed in the following main directions: Theoretical Analysis The complexity of DLSRCHETS is still an open issue. It makes for an interesting research topic. Is it at all possible that DLSRCHETS can be solved in polynomial time? Does imposition of some additional constraints make it tractable? What are those conditions? Extending the System Model This area has a large number of possibilities for future work. Scheduling purists may consider the system model used in this thesis to be quite simplistic. As future work, the conditions (constraints on values of Ek and Ck ), that minimize the error need to be found. An interesting area would be the investigation of the effect of affine cost models, processor deadlines and release times. Another important area would be to extend the results to multi-installment delivery and multi-level processor trees. Modification of DLSRCHETS The ways in which DLSRCHETS may be modified are — dynamism and uncertainty in the system parameters, nonclairvoyance, non-omniscience of the master, node (slave) turnover (failure), slave sharing, multiple jobs on one master, multiple masters, multiple jobs on several masters, decentralization of scheduling decision (P2P model), QoS requirements, buffer, bandwidth, and computation constraints on slaves. Application Development All the testing in this work has been carried out using simulations. It will be interesting to see how the algorithms perform in practice. New and different applications apart from the number of possible scientific applications mentioned in the introduction, need to be developed that use the results in this work. This may require development of new libraries and middleware to support the computation models considered.

Scheduling of Divisible Loads on Heterogeneous Distributed Systems



Adler, M., Gong, Y. & Rosenberg, A. L. (2003). Optimal sharing of bags of tasks in heterogeneous clusters, SPAA ’03: Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures, ACM, New York, NY, USA, pp. 1–10. 2. Barlas, G. D. (1998). Collection-aware optimum sequencing of operations and closed-form solutions for the distribution of a divisible load on arbitrary processor trees, 9(5): 429– 441. 3. Beaumont, O., Casanova, H., Legrand, A., Robert, Y. & Yang, Y. (2005). Scheduling divisible loads on star and tree networks: Results and open problems, 16(3): 207–218. 4. Beaumont, O., Marchal, L., Rehn, V. & Robert, Y. (2005). FIFO scheduling of divisible loads with return messages under the one-port model, Research Report 2005-52, LIP, ENS Lyon, France. 5. Beaumont, O., Marchal, L., Rehn, V. & Robert, Y. (2006). FIFO scheduling of divisible loads with return messages under the one port model, Proc. Heterogeneous Computing Work-shop HCW’06. 6. Beaumont, O., Marchal, L. & Robert, Y. (2005). Scheduling divisible loads with return messages on heterogeneous master-worker platforms, Research Report 2005-21, LIP, ENS Lyon, France. 7. Bharadwaj, V., Ghose, D., Mani, V. & Robertazzi, T. G. (1996). Scheduling Divisible Loads in Parallel and Distributed Systems, IEEE Computer Society Press, Los Alamitos, CA. 8. Cheng, Y.-C. & Robertazzi, T. G. (1990). Distributed computation for a tree network with communication delays, 26(3): 511–516. 9. Comino, N. & Narasimhan, V. L. (2002). A novel data distribution technique for host-client type parallel applications, 13(2): 97–110. 10. Dantzig, G. B. (1963). Linear Programming and Extensions, Princeton Univ. Press, Princeton, NJ. 11. Ghatpande, A., Beaumont, O., Nakazato, H. & Watanabe, H. (2008). Divisible load scheduling with result collection on heterogeneous systems, Proc. Heterogeneous Computing Work-shop (HCW 2008) held in the IEEE Intl. Parallel and Distributed Processing Sysmposium (IPDPS 2008), Miami, FL. 12. Ghatpande, A., Nakazato, H., Beaumont, O. & Watanabe, H. (2008). SPORT: An algorithm for divisible load scheduling with result collection on heterogeneous systems, IEICE Transactions on Communications


13. 14.



Parallel and Distributed Computing Applications

E91-B(8). Robertazzi, T. (2008). Divisible (partitionable) load scheduling research. URL: tom/dlt.html#THEORY Rosenberg, A. (2001). Sharing partitionable workload in heterogeneous NOWs: Greedier is not better, IEEE International Conference on Cluster Computing, Newport Beach, CA, pp. 124–131. Vanderbei, R. J. (2001). Linear Programming: Foundations and Extensions, Vol. 37 of International Series in Operations Research & Management, 2nd edn, Kluwer Academic Publishers. URL: http:// rvdb/LPbook/online.html Yu, D. & Robertazzi, T. G. (2003). Divisible load scheduling for grid computing, Proc. Inter-national Conference on Parallel and Distributed Computing Systems (PDCS 2003), Vol. 1, Los Angeles, CA, USA.



FAULT TOLERANCE MECHANISMS IN DISTRIBUTED SYSTEMS Arif Sari, Murat Akkaya Department of Management Information Systems, Girne American University, Kyrenia, Cypru

ABSTRACT The use of technology has increased vastly and today computer systems are interconnected via different communication medium. The use of distributed systems in our day to day activities has solely improved with data distributions. This is because distributed systems enable nodes to organise and allow their resources to be used among the connected systems or devices that make people to be integrated with geographically distributed computing facilities. The distributed systems may lead to lack of service availability due to multiple system failures on multiple failure points. This article highlights the different fault tolerance mechanism in distributed systems used to prevent multiple system failures on multiple failure points Citation: Sari, A. and Akkaya, M. (2015), “Fault Tolerance Mechanisms in Distributed Systems”. International Journal of Communications, Network and System Sciences, 8, 471-482. doi: 10.4236/ijcns.2015.812042. Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://

Parallel and Distributed Computing Applications


by considering replication, high redundancy and high availability of the distributed services. Keywords: Fault Tolerance, Distributed System, Replication, Redundancy, High Availability

INTRODUCTION A faulty system creates a human/economic loss, air and rail traffic control, telecommunication loss, etc. The need for a reliable fault tolerance mechanism reduces these risks to a minimum. In distributed systems, faults or failures are limited or part. A part failure in distributed systems is not equally critical because the entire system would not be offline or brought down, for example a system having more than one processing cores (CPU), and if one of the cores fails the system would not stop functioning as though that’s the one physical core in the system. Hence, the other cores would continue to function and process data normally. Nevertheless, in a non-distributed system when one of its components stops functioning, it would lead to a malfunction of the entire system or program and the corresponding processes would stop. Fault tolerance is the dynamic method that’s used to keep the interconnected systems together, sustain reliability, and availability in distributed systems. The hardware and software redundancy methods are the known techniques of fault tolerance in distributed system. The hardware methods ensure the addition of some hardware components such as CPUs, communication links, memory, and I/O devices while in the software fault tolerance method, specific programs are included to deal with faults. Efficient fault tolerance mechanism helps in detecting of faults and if possible recovers from it. There are various definitions to what fault tolerance is. In dealing with fault tolerance, replication is typically used for general fault tolerance method to protect against system failure [1] [2] . Sebepou et al. highlighted three major forms of replication mechanism which are [1] [2] : • • •

The State Machine; Process Pairs; Roll Back Recovery.

Fault Tolerance Mechanisms in Distributed Systems


1) State Machine In this mechanism, the process state of a computer system is replicated on autonomous computer system at the same time, all replica nodes process data in analogous or matching way and also there’s coordination in their process among the replica nodes and all the inputs are sent to all replica at the same time [2] [3] . An active replica is an example of state machine [3] [4] . 2) Process Pairs The process pairs functions like a master (primary)/slave (secondary) link in replication coordination. The primary workstation acts in place of a master to transmit its corresponding input to the secondary node. Both nodes maintain a good communication link [3] -[5] . 3) Roll Back Recovery (Check-Point-Based) This mechanism collects check point momentarily and transfers these checkpoint states to a stable storage device or backup nodes. This enables a roll back recovery to be done successfully when or during recovery process. The checkpoint is been reconstructed prior to the recent state [3] -[6] .

DISTRIBUTED SYSTEM Distributed system are systems that don’t share memory or clock, in distributed systems nodes connect and relay information by exchanging the information over a communication medium. The different computer in distributed system have their own memory and OS, local resources are owned by the node using the resources. While the resources that is been accessed over the network or communication medium is known to be remote resources [5] -[7] . Figure 1 shows the communication network between systems in the distributed environment. In distributed system, pool of rules are executed to synchronise the actions of various or different processes on a communication network, thereby forming a distinct set of related tasks [6] -[9] . The independent system or computers access resources remotely or locally in the distributed system communication environment, these resources are put together to form a single intelligible system. The user in the distributed environment is not aware of the multiple interconnected system that ensures the task is carried out accurately. In distributed system, no single system is required or carries the load of the entire system in processing a task [8] [9] .


Parallel and Distributed Computing Applications

Figure 1. Distributed system.

DISTRIBUTED SYSTEM ARCHITECTURE The architecture of distributed system is built on existing OS and network software [8] . Distributed system encompasses the collection of selfsufficient computers that are linked via a computer network and distribution middleware. The distribution middleware in distributed system, enables the corresponding computers to manage and share the resources of the corresponding system, thus making the computer users to see the system as a single combined computing infrastructure [9] [10] . Middleware is the link that joins distributed applications across different geographical locations, different computing hardware, network technologies, operating systems, and programming languages. The middleware delivers standard services such as naming, concurrency control, event distribution, security, authorization etc. Figure 2 shows the distributed system architecture, with the middleware offering its services to the connected systems in the distributed environment [10] [11] . In distributed system, the structure can be fully connected networks or partially connected networks [12] -[15] . As shown in Figure 3, a full connected network, is a network where each node is connected together. The disadvantage of this network is that when a new computer added, it physically increase the number of nodes connected to nodes, because the network connects node to node. Because of the increase in nodes, the number

Fault Tolerance Mechanisms in Distributed Systems


of file descriptors and difficulty for each node to communicate are increased heavily. File Descriptors is an intellectual indicator used to access a file or other input/output resource, such as a pipe or network connection [15] - [17] . Hence, the ability for the networked systems to continue functioning well is limited to the connected nodes ability of open the file descriptors and also the capability to manage new connections. The fully linked network systems are reliable because the message sent from one node to another node goes through one link, and when a node fails to function or a link fails, other nodes in the network can still communicate with other nodes.

Figure 2. A simple architecture of a distributed system.

Figure 3. Fully connected network.

In the partially connected network, some node have direct links while others don’t. Some models of partially connected networks are star structured networks, multi-access bus net work, ring structured network, and tree


Parallel and Distributed Computing Applications

structured network. In Figures 4-7 illustrates the corresponding networks. The disadvantages in these network in are: in the Star designed network, when the main node fails to function the entire networked system stops to function they collapse. In multi-access bus network, nodes are connected to each other through a communication link “a bus”. If the bus link connecting the nodes fails to function, all other nodes can’t connect to each other, and the performance of the network drops as more nodes or computers are added to the system or heavy traffic occurs in the system. In the ring network, nodes are connected at least to two other nodes in the network creating a path for signals to be exchanged between the connected nodes. As new nodes are added to the network, the transmission delay becomes longer. If a node fail every other node in the network can be inaccessible. In the tree structured network, this is like a net work with hierarchy, each node in the network have a fixed number nodes that is attached to it in the sub level of the tree. In this network messages that are transmitted from the parent to the child nodes goes through one link.

Figure 4. Tree structured network.

Figure 5. Ring structured network.

Fault Tolerance Mechanisms in Distributed Systems


Figure 6. Multi-access bus network.

Figure 7. Star structured network.

For a distributed system to perform and function according to build, it must have the following characteristics; Fault Tolerant, Scalability, Predictable Performance, Openness, Security, and Transparency.

FAULT TOLERANCE SYSTEMS Fault tolerance system is a vital issue in distributed computing; it keeps the system in a working condition in subject to failure. The most important point of it is to keep the system functioning even if any of its part goes off or faulty [18] -[20] . For a system to be fault tolerant, it is related to dependable systems. Dependability covers some useful requirements in the fault tolerance system these requirements include: Availability, Reliability, Safety, and Maintainability. •

Availability: This is when a system is in a ready state, and is ready to deliver its functions to its corresponding users. Highly available systems works at a given instant in time. Reliability: This is the ability for a computer system run

Parallel and Distributed Computing Applications


continuously without a failure. Unlike availability, reliability is defined in a time interval instead of an instant in time. A highly reliably system, works constantly in a long period of time without interruption. • Safety: This is when a system fails to carry out its corresponding processes correctly and its operations are incorrect, but no shattering event happens. • Maintainability: A highly maintainability system can also show a great measurement of accessibility, especially if the corresponding failures can be noticed and fixed mechanically. As we have seen, fault tolerance system is a system which has the capacity of or to keep running correctly and proper execution of its programs and continues functioning in the event of a partial failure [21] [22] . Although sometimes the performance of the system is affected due to the failure that occurred. Some of the fault is narrowed down to Hardware or Software Failure (Node Failure) or Unauthorised Access (Machine Error). Errors caused by fault tolerance events are separated into categories namely; performance, omission, timing, crash, and fail-stop [22] -[24] . •

Performance: this is when the hardware or software components cannot meet the demands of the user. • Omission: is when components cannot implement the actions of a number of distinctive commands. • Timing: this is when components cannot implement the actions of a command at the right time. • Crash: certain components crash with no response and cannot be repaired. • Fail-stop: is when the software identifies errors, it ends the process or action, this is the easiest to handle, sometimes its simplicity deprives it from handling real situations. In addition to the error timing, three situations or form can be distinguished: 1) Permanent error; these causes damage to software components and resulting to permanent error or damage to the program, preventing it from running or functioning. In this case a restart of the program is done, an example is when a program crashes. 2) Temporary error; this only result to a brief damage to the software component, the damage gets resolved after some time and the corresponding software continues to work or function normally. 3) Periodic errors; these are errors that occurs occasionally. For

Fault Tolerance Mechanisms in Distributed Systems


example when there’s a software conflict between two software when run at the same time. In dealing with this type of error, one of the programs or software is exited to resolve the conflict. Most computers if not all have some fault tolerance technique such as micro diagnosis [25] [26] , parity checking [27] -[29] , watchdog timers [30] -[34] , etc. an incompletely fault tolerant system have inbuilt resources to cause a reduction in its specified computing capability and reduce to a smaller or lower system by removing some programs that have been used previously or by reducing the rate at which specified processes are executed. The reduction is due to the decrease or slowdown in the operational hardware configuration or it may be some design faults in the hardware.

BASIC CONCEPT OF FAULT TOLERANCE SYSTEMS Fault tolerance mechanism can be divided into three stages; Hardware, Software, and System Fault [34] . Hardware Fault Tolerance: This involves the provision of supplementary backup hardware such as; CPU, Memory, Hard disks, Power Supply Units, etc. hardware fault tolerance can only deliver support for the hardware by providing the basic hardware backup system, it can’t stop or detect error, accidental interfering with programs, program errors, etc. In hardware fault tolerance, computer systems that resolves fault occurring from hardware component automatically are built. This technique often partition the node into units that performance as a fault control area, each module is backed up with a defensive redundancy, the reason is that if one of the modules fails, the others can act or take up its function. There are two approach to hardware fault recovery namely; Fault Masking and Dynamic Recovery [35] -[37] . Fault Masking: This is an important redundancy method that fully covers faults within a set of redundant units or components. Other identical units carry out or implement the same tasks, and their outputs were noted to have removed errors created by a defective module. Commonly used fault masking module it the Triple Modular Redundancy (TMR). The TMR triplicate the circuitry and selected [38] [39] . The selected electrical system can also be triplicated so that the selected circuitry failures can be corrected by the same process. The selected system in the TMR needs more hardware, this enables computations to continue without been interrupted when a fault is detected or occurs, tolerating the operating system to be used [40] [41] .


Parallel and Distributed Computing Applications

Dynamic Recovery: In dynamic recovery, special mechanism is essential to discover faults in the units, perform a switch on a faulty module, puts in a spare, and carryout some software actions necessary to restore and continue computation such as; rollback, initialization, retry, and restart. This requires special hardware and software to make this work in single computer, but in a multicomputer situation, the function is carried out by other processors [42] -[45] . Software Fault Tolerance: This is a special software designed to tolerate errors that would originate from a software or programming errors. The software fault tolerance utilize the static and dynamic redundancy methods similar to those used for hardware fault [46] . N-version programming approach uses the static redundancy like an independently program that does the same function creating out that are selected at special checkpoint. Another approach is the Design Diversity which this adds both hardware and software fault tolerance by deploying a fault tolerant system using diverse hardware and software in the redundant channels. In the Design diversity, every channel is intended to carry out the same function and a mechanism is in check to see if any of the channels changes from others. The aim of the Design Diversity is to tolerate faults from both hardware and software. This approach is very expensive, its use mainly is in the aircraft control applications. Note: Software Fault Tolerance also consists of checkpoints storage and rollback recovery. Checkpoints are like a safe state or snapshot of the entire system in a working state. This is done regularly. The snapshot holds all the required information to restart the program from the checkpoint. The usefulness of the software fault tolerance is to create an application that would store checkpoints regularly for targeted systems. System Fault Tolerance: This is a complete system that stores not just checkpoints, it detects error in application, it stores memory block, program checkpoint automatically. When a fault or an error occurs, the system provides a correcting mechanism thereby correcting the error. Table 1 shows the comparison of three fault tolerance mechanism. Table 1. Comparison of fault tolerance mechanism. Mechanism

Hardware Fault Tolerance

Software Fault Tolerance

System Fault Tolerance

Major technique

Hardware backup

Checkpoint storage Rollback recovery

Architecture with error detecting & correcting

Fault Tolerance Mechanisms in Distributed Systems Design complexity




Time/cost expenditure




Fault-tolerance Level





FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEMS Replication Based Fault Tolerance Technique The replication based fault tolerance technique is one of the most popular method. This technique actually replicate the data on different other system. In the replication techniques, a request can be sent to one replica system in the midst of the other replica system. In this way if a particular or more than one node fails to function, it will not cause the whole system to stop functioning as shown in Figure 8. Replication adds redundancy in a system. There are different phase in the replication protocol which are client contact, server coordination, execution, agreement, coordination and client response. Major issues in replication based techniques are consistency, degree of replica, replica on demand etc. Consistency: This is a vital issue in replication technique. Several copies of the same entity create problem of consistency because of update that can be done by any of the user. The consistency of data is ensured by some criteria such as linearizability [47] , sequential consistency and casual consistency [48] etc. sequential and linearizability consistency ensures strong consistency unlike casual consistency which defines a weak consistency criterion. For example a primary backup replication technique guarantee consistency by linerarizability, likewise active replication technique. Degree or Number of Replica: The replication techniques utilises some protocols in replication of data or an object, such protocol are: Primary backup replication [49] , voting [50] and primary-per partition replication [51] . In the degree of replication, to attain a high level of consistency, large number of replicas is needed. If the number of replica is low or less it would affect the scalability, performance and multiple fault tolerance capability. To solve the issue of less number of replica, in [51] adaptive replicas creation algorithm was proposed.


Parallel and Distributed Computing Applications

Process Level Redundancy Technique This fault tolerance technique is often used for faults that disappears without anything been done to remedy the situation, this kind of fault is known as transient faults. Transient faults occurs when there’s a temporary malfunction in any of the system component or sometimes by environmental interference. The problem with transient faults is that they are hard to handle and diagnose but they are less severe in nature. In handling of transient fault, software based fault tolerance technique such as Process-Level Redundancy (PLR) is used because hardware based fault tolerance technique is more expensive to deploy. As shown in Figure 9, the PLR compares processes to ensure correct execution and also it creates a set of redundant processes apiece application process. Redundancy at the process level enables the OS to schedule easily processes across all accessible hardware resources. The PLR provides improved performance over existing software transient fault tolerance techniques with a 16.9% overhead for detection of fault [53] . PLR uses a software-centric approach which causes a shift in focus from guaranteeing hardware execution correctly to ensuring a correct software execution. Check Pointing and Roll Back: This is a popular technique which in the first part “check point” stores the current state of the system and this is done occasionally. The check point information is stored in a stable storage device for easy roll back when there’s a node failure. Information that is stored or checked includes environment, process state, value of the registers etc. these information are very useful if a complete recovery needs to be done [50] [51] . Figure 10 illustrates the check pointing techniques. The two most known type or roll back recovery are the checkpoint and log based roll back recovery technique. Each of the type of rollback recovery technique uses different mechanism; the checkpoint based uses the checkpoints states that it has stored in a stable storage device, while the log based rollback recovery techniques combines both check pointing and logging of events [51] .

Fault Tolerance Mechanisms in Distributed Systems


Figure 8. Replication based technique in distributed system.

Figure 9. Process redundancy.

Figure 10. Check pointing technique.

In recovery form system failures, there are two type of check point technique that is used; coordinated and uncoordinated checkpoint techniques, these techniques are related with message logging [34] . •

Coordinated Check Point: In this technique, check are coordinated to save a consistent state because the coordinated checkpoint are

Parallel and Distributed Computing Applications


consistent set of checkpoint. If the checkpoints are not consistent a full and complete rollback of the system can’t be done [52] . In a situation where there’s frequent failure, coordinated check point technique can’t be used. The recovery time can be set to a higher value or lower value, when set to a lower value, it improves performance of the technique because it only select the recovery to last correct state of the system instead from the very first state or checkpoint. Uncoordinated Check Point: This technique combines the message logging to ensure that the rollback state is correct. The uncoordinated check point technique executed checkpoints independently as well as recovery. There are three type of message logging protocols: optimistic, pessimistic and casual. In the optimistic protocol ensures all messages are logged. The pessimistic protocol makes sure that all the message that is received by a process are logged appropriately and stored in a stable and reliable storage media before it is forwarded into the system. While the causal protocol just log the message information of a process in all processes that are causally dependent [53].

Fusion Based Technique Replication is the most widely used method or technique in fault tolerance. The main downside is the multiple of backups that it incurs. Because the backups increase as faults increase and the cost of management is very expensive, the fusion based technique solves that problem. Fusion based technique stands as an alternative because it requires fewer backup machines compared to the replication based technique. As shown in Figure 11, the backup machines are fused corresponding to the given set of machines [53] [54] . The fusion based technique has a very high overhead during recovery process and it’s acceptable in low probability of fault in a system. From Table 2, it is clear that all methods having capability to handle multiple faults. In all methods performance can be improved by focusing on or addressing the serious aspects involved. In all the techniques involved, there is strong need for reliable, accurate and pure adaptive multiple failure detector mechanism [53] , [54] .

Fault Tolerance Mechanisms in Distributed Systems


Figure 11. Fusion process technique. Table 2. Shows compares the different fault tolerance technique or mechanism in distributed system. Major Factors Replication Based Technique

Checking Point/ Fusion Based Roll Back Tech- Technique nique

Process Level Redundancy Technique


Redirected to replica

State saved on Back up machine stables to rage used for recovery

A set of redundant process


Some criterion; linearizability.

Avoiding orphan messages

Among backup machines

Not a major issues

Multiple Faults Depend upon Handling number of replica.

Depend upon Check pointing scheduling

Depends upon number of back machine

Depends upon set of redundant process


Decreases as number of replica increases.

Decrease with Decrease in case of Decrease as faults frequency and size faults as recovery appears disappear of checkpoint cost is high


N replicas ensure Uncoordinated n-1 faults Pessimistic and N level disk less used for N-1 Faults

Multiple Fail- Reliable, Accuure Detector rate, Adaptive

Reliable, Accurate, Adaptive

In order handle Extra N faults N backups machine are required

Scaling the number of process and Majority voting

Reliable, Accurate, Reliable, Accurate, Adaptive Adaptive


Parallel and Distributed Computing Applications

CONCLUSION Fault tolerance is a major part of distributed system, because it ensures the continuity and functionality of a system at a point where there’s a fault or failure. This research showed the different type of fault tolerance technique in distributed system such as the Fusion Based Technique, Check Pointing and Roll Back Technique, and Replication Based Fault Tolerance Technique. Each mechanism is advantageous over the other and costly in deployment. In this paper we highlight the levels of fault tolerance such as the hardware fault tolerance which ensures that additional backup hardware such as memory block, CPU, etc., software fault tolerance system comprises of checkpoints storage and rollback recovery mechanisms, and the system fault tolerance is a complete system that does both software and hardware fault tolerance, to ensure availability of the system during failure, error or fault. Future research would be conducted on comparing the various data security mechanisms and their performance metrics.

Fault Tolerance Mechanisms in Distributed Systems



Sebepou, Z. and Magoutis, K. (2011) CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators. Proceedings of IEEE/IFIP 41st International Conference on Dependable Systems and Networks, 145-156. 2. Sari, A. and Necat, B. (2012) Impact of RTS Mechanism on TORA and AODV Protocol’s Performance in Mobile Ad Hoc Networks. International Journal of Science and Advanced Technology, 2, 188191. 3. Chen, W.H. and Tsai, J.C. (2014) Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems. 4. Sari, A. and Necat, B. (2012) Securing Mobile Ad Hoc Networks against Jamming Attacks through Unified Security Mechanism. International Journal of Ad Hoc, Sensor & Ubiquitous Computing, 3, 79-94. http:// 5. Balazinska, M., Balakrishnan, H., Madden, S. and Stonebraker, M. (2008) Fault-Tolerance in the Borealis Distributed Stream Processing System. ACM Transactions on Database Systems, 33, 1-44. http:// 6. Sari, A. (2014) Security Approaches in IEEE 802.11 MANET— Performance Evaluation of USM and RAS. International Journal of Communications, Network, and System Sciences, 7, 365-372. http:// 7. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (2002) A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34, 375-408. http://dx.doi. org/10.1145/568522.568525 8. Sari, A. (2014) Security Issues in RFID Middleware Systems: A Case of Network Layer Attacks: Proposed EPC Implementation for Network Layer Attacks. Transactions on Networks & Communications, 2, 1-6. 9. Andrew, S. (1995) Tanenbaum Distributed Operating Systems. Prentice Hall, Upper Saddle River. 10. Sari, A. (2015) Lightweight Robust Forwarding Scheme for MultiHop Wireless Networks. International Journal of Communications, Network and System Sciences, 8, 19-28. ijcns.2015.83003


Parallel and Distributed Computing Applications

11. Coulouris, G., Dollimore, J. and Kindberg, T. (2001) Distributed Systems: Concepts and Design, 4th Edition, Pearson Education Ltd., New York. 12. Carter, W.C. and Bouricius, W.G. (1971) A Survey of Fault-Tolerant Computer Architecture and Its Evaluation. Computer, 4, 9-16. 13. Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE Computer Group News, 2, 2-17. 14. Sari, A. (2015) Two-Tier Hierarchical Cluster Based Topology in Wireless Sensor Networks for Contention Based Protocol Suite. International Journal of Communications, Network and System Sciences, 8, 29-42. 15. Cooper, A.E. and Chow, W.T. (1976) Development of On-Board Space Computer Systems. IBM Journal of Research and Development, 20, 5-19. 16. Tanenbaum, A. and Van Steen, M. (2007) Distributed Systems: Principles and Paradigms. 2nd Edition, Pearson Prentice Hall, Upper Saddle River. 17. Koren, I. and Krishna, C.M. (2007) Fault-Tolerance Systems. Elsevier Inc., San Francisco. 18. Sari, A. and Onursal, O. (2013) Role of Information Security in E-Business Operations. International Journal of Information Technology and Business Management, 3, 90-93. 19. Avizienis, A., Kopetz, H. and Laprie, J.C. (1987) Dependable Computing and Fault-Tolerant Systems, Volume 1: The Evolution of Fault-Tolerant Computing. Springer-Verlag, Vienna, 193-213. 20. Sari, A. and Çağlar, E. (2015) Performance Simulation of Gossip Relay Protocol in Multi-Hop Wireless Networks. Social and Applied Sciences Journal, Girne American University, 7, 145-148. 21. Harper, R., Lala, J. and Deyst, J. (1988) Fault-Tolerant Parallel Processor Architectural Overview. Proceedings of the 18st International Symposium on Fault-Tolerant Computing, Tokyo, 27-30 June 1988. 22. Sari, A. and Rahnama, B. (2013) Addressing Security Challenges in WiMAX Environment. In: Proceedings of the 6th International Conference on Security of Information and Networks, ACM Press, New York, 454-456.

Fault Tolerance Mechanisms in Distributed Systems


23. Briere, D. and Traverse, P. (1993) AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, Toulouse, 22-24 June 1993. 24. Charron-Bost, B., Pedone, F. and Schiper, A. (2010) Replication: Theory and Practice. Lecture Notes in Computer Science, 5959. 25. Sari, A. (2015) Security Issues in Mobile Wireless Ad Hoc Networks: A Comparative Survey of Methods and Techniques to Provide Security in Wireless Ad Hoc Networks. New Threats and Countermeasures in Digital Crime and Cyber Terrorism, IGI Global, Hershey, 66-94. 26. Sari, A. and Rahnama, B. (2013) Simulation of 802.11 Physical Layer Attacks in MANET. Proceedings of the Fifth International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), Madrid, 5-7 June 2013, 334-337. cicsyn.2013.79 27. Tanenbaum, A.S. and van Steen, M. (2002) Distributed Systems: Principles and Paradigms. Pearson Education Asia. 28. Sari, A., Rahnama, B. and Caglar, E. (2014) Ultra-Fast Lithium Cell Charging for Mission Critical Applications. Transactions on Machine Learning and Artificial Intelligence, 2, 11-18. http://dx.doi. org/10.14738/tmlai.25.430 29. Ebnenasir, A. (2005) Software Fault-Tolerance. Computer Science and Engineering Department, Michigan State University, East Lansing. /ft1.pdf 30. Birman, K. (2005) Reliable Distributed Systems: Technologies, Web Services and Applications. Springer-Verlag, Berlin. 31. Obasuyi, G. and Sari, A. (2015) Security Challenges of Virtualization Hypervisors in Virtualized Hardware Environment. International Journal of Communications, Network and System Sciences, 8, 260273. 32. Avizienis, A. (1975) Architecture of Fault-Tolerant Computing Systems. Proceedings of the 1975 International Symposium on FaultTolerant Computing, Paris, 18-20 June 1975, 3-16. 33. Sari, A. (2015) A Review of Anomaly Detection Systems in Cloud Networks and Survey of Cloud Security Measures in Cloud Storage Applications. Journal of Information Security, 6, 142-154. http://


Parallel and Distributed Computing Applications

34. Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE Computer Group News, 2, 2-17. 35. Sari, A. (2014) Influence of ICT Applications on Learning Process in Higher Education. Procedia—Social and Behavioral Sciences, 116, 4939-4945. 36. Huang, M. and Bode, B. (2005) A Performance Comparison of Tree and Ring Topologies in Distributed Systems. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, 4-8 April 2005, 258.1. 37. Huang, M. (2004) A Performance Comparison of Tree and Ring Topologies in Distributed System. Master’s Thesis. 38. Minar, N. (2001) Distributed Systems Topologies: Part 1. http:// 39. Wiesmann, M., Pedone, F., Schiper, A., Kemme, B. and Alonso, G. (2000) Understanding Replication in Databases and Distributed Systems. Research Supported by EPFL-ETHZ DRAGON Project and OFES. 40. Herlihy, M. and Wing, J. (1990) Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems, 12, 463-492. 41. Ahamad, M., Hutto, P.W., Neiger, G., Burns, J.E. and Kohli, P. (1994) Causal Memory: Definitions, Implementations and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, Atlanta. 42. Rahnama, B., Sari, A. and Makvandi, R. (2013) Countering PCIe Gen. 3 Data Transfer Rate Imperfection Using Serial Data Interconnect. Proceedings of the 2013 International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), Konya, 9-11 May 2013, 579-582. TAEECE.2013.6557339 43. Budhiraja, N., Marzullo, K., Schneider, F.B. and Toueg, S. (1993) The Primary-Backup Approach. In: Mullender, S., Ed., Distributed Systems, ACM Press, New York, 199-216. 44. Gifford, D.K. (1979) Weighted Voting for Replicated Data. Proceedings of the Seventh ACM Symposium on Operating Systems Principles, Pacific Grove, 10-12 December 1979, 150-162. http://dx.doi. org/10.1145/800215.806583

Fault Tolerance Mechanisms in Distributed Systems


45. Osrael, J., Froihofer, L., Goeschka, K.M., Beyer, S., Galdamez, P. and Munoz, F. (2006) A System Architecture for Enhanced Availability of Tightly Coupled Distributed Systems. Proceedings of the First International Conference on Availability, Reliability and Security, Vienna, 20-22 April 2006. 46. Cao, H.H. and Zhu, J.M. (2008) An Adaptive Replicas Creation Algorithm with Fault Tolerance in the Distributed Storage Network. Proceedings of the Second International Symposium on Intelligent Information Technology Application, Shanghai, 20-22 December 2008, 738-741. 47. Shye, A., Blomstedt, J., Moseley, T., Reddi, V. and Connors, D.A. (2008) PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures. IEEE Transactions on Dependable and Secure Computing, 6, 135-148. 48. Agarwal, V. (2004) Fault Tolerance in Distributed Systems. Indian Institute of Technology, Kanpur. 49. Jung, H., Shin, D., Kim, H. and Lee, H.Y. (2005) Design and Implementation of Multiple Fault Tolerant MPI over Myrinet (M3). In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington DC, 32. SC.2005.22 50. Elnozahy, M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (1996) A Survey of Rollback-Recovery Protocols in Message Passing Systems. Technical Report CMU-CS-96-81, School of Computer Science, Carnegie Mellon University, Pittsburgh. 51. Alvisi, L. and Marzullo, K. (1995) Message Logging: Pessimistic, Optimistic, and Causal. Proceedings of the 15th International Conference on Distributed Computing, Systems (ICDCS 1995), Vancouver, 30 May-2 Jun 1995, 229-236. ICDCS.1995.500024 52. Garg, V.K. (2010) Implementing Fault-Tolerant Services Using Fused State Machines. Technical Report ECE-PDS2010-001, Parallel and Distributed Systems Laboratory, ECE Department, University of Texas, Austin. 53. Xiong, N., Cao, M., He, J. and Shu, L. (2009) A Survey on Fault Tolerance in Distributed Network Systems. Proceedings of the 2009 International Conference on Computational Science and Engineering,


Parallel and Distributed Computing Applications

Vancouver, 29-31 August 2009, 1065-1070. CSE.2009.497 54. Tian, D., Wu, K. and Li, X. (2008) A Novel Adaptive Failure Detector for Distributed Systems. Proceedings of the 2008 International Conference on Networking, Architecture, and Storage, Chongqing, 1214 June 2008, 215-221.




INTRODUCTION Network researchers need to embrace the challenge of designing the nextgeneration high-performance networking and software infrastructures that address the growing demand of distributed applications. These applications, particularly those potential “game changers” or “killer apps”, such as voiceover-IP (VoIP) and peer-to-peer (P2P) applications surfaced in recent years, will significantly influence the way people conduct business and go about their daily lives. These distributed applications also include platforms that facilitate large-scale scientific experimentation through remote control and visualization. Many large-scale science applications—such as those in the field of astronomy, astrophysics, climate and environmental science, Citation: Jason Liu (January 1st 2010). “Parallel and Distributed Immersive Real-Time Simulation of Large-Scale Networks”, Parallel and Distributed Computing Alberto Ros, IntechOpen, DOI: 10.5772/9453. Copyright: © 2010 by author and Intech. This paper is an open access article distributed under a Creative Commons Attribution 3.0 License


Parallel and Distributed Computing Applications

material science, particle physics, and social science—depend on the availability of high-performance facilities and advanced experimental instruments. Extreme networking capabilities together with effective highend middleware infrastructures are of great importance to interconnecting these applications, computing resources and experimental facilities. “When all you have is a hammer, everything looks like a nail.” The success of advancing critical technologies, to a large extent, depends on the available tools that can help effectively prototype, test, and analyze new designs and new ideas. Traditionally, network research has relied on a variety of tools. Physical network testbeds, such as WAIL (Barford and Landweber, 2003) and PlanetLab (Peterson et al., 2002), provide physical network connectivity; these testbeds are designed specifically for studying network protocols and services under real network conditions. However, the network condition of these testbeds is by and large constrained by the physical setup of the system and therefore inflexible for network researchers to explore a wide spectrum of the design space. To allow more flexibility, some of these testbeds, such as EmuLab (White et al., 2002) and VINI (Bavier et al., 2006), also offer emulation capabilities by modulating network traffic ac- cording to configuration and traffic condition of the target network. Physical and emulation testbeds currently are the mainstream for experimental networking research, primarily due to their capability of achieving desirable realism and accuracy. These testbeds, however, are costly to build. Due to limited resources available, conducting prolonged large-scale experiments on these platforms is difficult. Another solution is to use analytical models. Although analytical models are capable of bringing us important insight to the system design, dealing with a system as complex as the global network requires significantly simplified assumptions to be made to keep the models tractable. These simplified assumptions often exclude implementation details, which are often crucial to the validity of the system design. Simulation and emulation play an important role in network design and evaluation. While both refer to the technique of mimicking network operations in software, one major distinction is that simulation is purely virtual, whereas emulation focuses on interactions with real applications. A network simulation consists of software implementation of network protocols and various network entities, such as routers and links. Network operations (e.g., packet forwarding) are merely logical operations. As a result, the simulation time advancement bears no direct relationship to the wall-clock time. Emulation, on the other hand, focuses on interactions

Parallel and Distributed Immersive Real-Time Simulation of....


with real applications, such as distributed network services and distributed database systems. These real applications generate traffic; an emulator provides traffic shaping functions by adding appropriate packet delays and sometimes dropping packets. Emulation delivers more realism as it interacts with the physical entities. Comparatively, simulation is effective at capturing high-level design issues, answering what-if questions, and therefore can help us understand complex system behaviors, such as multi-scale interactions, self-organizing characteristics, and emergent phenomena. Unfortunately, simulation fairs poorly in many aspects, including notably the absence of operational realism. Further, simulation model development is both labor- intensive and error-prone; reproducing realistic network topology, representative traffic, and diverse operational conditions in simulation is known to be a substantial undertaking (Floyd and Paxson, 2001). Real-time simulation combines the advantages of both simulation and emulation: it can run simulation and simultaneously interact with the physical world. Real-time network simulation, sometimes called immersive network simulation, can be defined as the technique of simulating computer networks and communication systems in real time so that the simulated network can interact with real implementations of network protocols, network services, and distributed applications. The word “immersive” suggests that the virtual network behavior should not be distinguishable from a physical network for conducting network traffic. That is, simulation should capture important characteristics of the target network and support seamless interactions with the real applications. Real-time network simulation is based on simulation, and therefore is fast in execution and flexible at answering what-if questions. It allows high-level mathematical models (such as stochastic network traffic models) to be incorporated into the system with relative ease. The system interacts with real applications and real network traffic. Not only does it allow us to study the impact of real application traffic on the virtual network, but also supports studying the behavior of real applications under diverse simulated network conditions. The challenge is to keep it in real time. Since real applications operate in real time, real-time network simulation must meet real-time requirements. Especially, the performance of a large-scale network simulation must be able to keep up with the wall-clock time and allow real-time interactions with potentially a lot of real applications. A real-time simulator must also be able to characterize the behavior of a network, potentially with millions of network entities and with realistic traffic load at commensurate scale—all in real time. To speed up simulation, on the one hand, we need to apply


Parallel and Distributed Computing Applications

parallel and distributed discrete-event simulation techniques to harness the computing resources of parallel computers so as to physically increase the event-processing power; on the other hand, we need to resort to multiresolution modeling techniques using models at high-level of abstraction to reduce the computational demand. We also need to create a scalable emulation infrastructure, through which real applications can interact with the simulated network and sustain high- level emulation traffic intensity. In this chapter, we review the techniques that allow real- time simulation to model large-scale networks and interact with many real applications under the real-time constraint. We discuss advanced modeling and simulation techniques supporting real-time execution. We describe the emulation infrastructure and machine virtualization techniques supporting the network immersion of a large number of real applications. Through case studies, we show the potentials of real-time simulation in various areas of network science.

BACKGROUND Existing Network Testbeds We classify available network testbeds into physical, emulation, and simulation testbeds. We can further divide physical testbeds into production testbeds and research testbeds (Anderson et al., 2005). Production testbeds, such as CAIRN and Internet2, support network experiments directly on the network itself and thus with live traffic; however, they are very restrictive allowing only certain types of experiments that do not disrupt normal network operations. Comparatively, research testbeds, such as WAIL and PlanetLab, provide far better flexibility. WAIL (Barford and Landweber, 2003) is a research testbed consisting of a large set of commercial networking components (including router, switches, and end hosts) connected to form an experimental network capable of representing typical end-to-end configurations found on the Internet. PlanetLab (Peterson et al., 2002) is a well-known research facility consisting of machines distributed across the Internet and shared by researchers conducting experiments. Most research testbeds, however, can only provide an iconic view of the Internet at large. Also, the underlying facility is typically overloaded due to heavy use, which potentially affects their availability as well as accuracy (Spring et al., 2006). Many research testbeds are based on emulation. Network emulation adds packet delays and possibly drops packets when conducting traffic

Parallel and Distributed Immersive Real-Time Simulation of....


between real applications. Examples of emulation testbeds include Ahn et al. (1995); Carson and Santay (2003); Herrscher and Rothermel (2002); Zheng and Ni (2003) and Huang et al. (1999). The traffic modulation function can be implemented at the sender or receiver side, or both. For example, in dummynet (Rizzo, 1997), each virtual network link is represented as a queue with specific bandwidth and delay constraints; packets are intercepted at the protocol stack of the sender and pushed through a finite queue to simulate the time it takes to forward the packet. Emulation testbeds can be built on a variety of computing infrastructures. For example, ModelNet (Vahdat et al., 2002) extends dummynet, where a large number of network applications can run unmodified on a set of edge nodes and communicate via a virtual network emulated on parallel computers at the core. EmuLab (White et al., 2002) is an experimentation facility consisting of a compute cluster integrated and coordinated to present a diverse virtual network environment. DETER (Benzel et al., 2006) extends EmuLab to support research and development of cyber security applications. Some of the emulation testbeds are built for distributed environments, such as X-Bone (Touch, 2000), VIOLIN (Jiang and Xu, 2004), VNET (Sundararaj and Dinda, 2004), and VINI (Bavier et al., 2006). Other emulation testbeds may require special programmable devices. For example, the Open Network Laboratory (DeHart et al., 2006) uses embedded processors and configures them to represent realistic network settings for experimentation and observation. ORBIT (Raychaudhuri et al., 2005) is an open large-scale wireless network emulation testbed that supports experimental studies using an array of real wireless devices. The CMU Wireless Emulator (Judd and Steenkiste, 2004) is a wireless network testbed based on a large Field- Programmable Gate Array (FPGA) that can modify wireless signals sent by real wireless devices according to signal propagation models. A major distinction between simulation and emulation is that simulation contains only software modules representing network protocols and network entities, such as routers and links, and mimicking network transactions as pure logic operations to the state variables. Examples of network simulators include Barr et al. (2005); Tyan and Hou (2001) and Varga (2001). The ns-2 simulator (Breslau et al., 2000) is one of the most popular simulators with a rich collection of network algorithms and protocols for both wired and wireless networks. To scale up network simulation, a number of parallel and distributed simulators have also been developed, which include SSFNet (Cowie et al., 1999), GTNets (Riley, 2003), ROSSNet (Yaun et al., 2003), and GloMoSim (Bajaj et al.,


Parallel and Distributed Computing Applications

1999). Next, we describe parallel and distributed simulation as the enabling technique for real-time simulation.

Parallel and Distributed Simulation Parallel and distributed simulation, also known as parallel simulation or parallel discrete- event simulation (PDES), is concerned with executing a single discrete-event simulation program on parallel computers (Fujimoto, 1990). By exploiting the concurrency of a simulation model, parallel simulation can overcome the limitations of sequential simulation in both execution time and memory space. The critical issue of allowing a discrete-event simulation program to run in parallel is to maintain the causality constraint, which means that simulation events in the system must be processed in a non-decreasing timestamp order. This is because an event with a smaller timestamp has the potential to change the state of the system and affect events that happen later (with larger timestamps). Most parallel simulation adopts spatial decomposition: a model is divided into sub-models called logical processes (LPs), each of which maintains its own local simulation clock and can run on a different processor. For network simulation, a simulated network can be partitioned into smaller sub-networks, each handled by a different processor. The way how the causality constraint is enforced divides parallel simulation into conservative and optimistic approaches. The conservative approach strictly prohibits out-of-order event execution: a processor must be blocked from processing the next event in its event queue until it is safe to do so. That is, it must ensure that no event will arrive from another processor with a timestamp earlier than the local simulation clock. In contrast, the optimistic approach allows events to be processed out of order. Once a causality error is detected—an event arrives at a logical process with a timestamp in the simulated past—the simulation will be rolled back to a state before the error occurs. In order for the simulation to retract and recover from an erroneous execution path, state saving and recovery mechanisms are typically provided. The seminal work for the conservative approach is the CMB algorithm, an asynchronous algorithm proposed independently by Chandy and Misra (1979), and Bryant (1977). The CMB algorithm provides several important observations that epitomize the fundamentals of conservative synchronization. One important concept is lookahead. To avoid deadlock, an LP must determine a lower bound on the timestamp of messages it will send to another LP. In essence, Lookahead is the amount of

Parallel and Distributed Immersive Real-Time Simulation of....


simulation time that an LP can predict into the simulated future. Extensive performance studies emphasize the importance of extrapolating lookahead from the model (Fujimoto, 1988,1989; Reed et al., 1988). Nicol (1996) gave a classification of lookahead based on different levels of knowledge that can be extracted from the model. The use of different dimensions of lookahead underscores conservative synchronization protocols. Several models have been shown to exhibit good lookahead properties, such as firstcome-first-serve stochastic queuing networks (Nicol, 1988) and continuoustime Markov chains (Nicol and Heidelberger, 1995). In addition, several synchronization protocols have been developed to exploit lookahead for general applications, such as the conditional event approach by Chandy and Sherman (1989), the YAWNS protocol by Nicol (1991), the bounded lag algorithm by Lubachevsky (1988), the distance-between-objects algorithm by Ayani (1989), and the TNE algorithm by Groselj and Tropper (1988). The first optimistic synchronization protocol is the Time Warp algorithm (Jefferson, 1985). Since the optimistic approach allows events to be processed out of timestamp order, Time Warp provides mechanisms to “roll back” erroneous event processing. An LP is able to save and later restore the state of the LP and “unsend” any messages it sends to other LPs during an erroneous execution. Since Time Warp requires state saving during event processing, the algorithm must be able to reclaim the memory resource; otherwise, the simulation would soon run out of memory. To accomplish this, the concept of global virtual time (GVT) is introduced as a timestamp lowerbound of all unprocessed or partially processed events at any given time. It serves as a “moving commitment horizon”: any message and state with a timestamp less than GVT can be reclaimed and any irrevocable operations (such as I/O) that happen before GVT can be committed. Time Warp needs to overcome several problems in order to maintain good efficiency. These problems have prompted a flood of research in areas of state saving (e.g., Gomes et al., 1996; Lin and Lazowska, 1990; Lin et al., 1993; Ronngren et al., 1996), rollback (e.g., Gafni, 1988; Reiher et al., 1990; West, 1988), GVT computation (e.g., Fujimoto and Hybinette, 1997; Mattern, 1993; Samadi, 1985), memory management (e.g., Jefferson, 1990; Lin and Preiss, 1991; Preiss and Loucks, 1995), and alternative optimistic execution (e.g., Dickens and Reynolds, 1990; Sokol et al., 1988; Steinman, 1991, 1993). The jury is out on which of the two approaches is a better choice. This is because parallel simulation performance largely depends on the characteristics of the simulation model. For network simulation, conservative

Parallel and Distributed Computing Applications


synchronization is generally preferred as it requires a smaller memory footprint as opposed to the optimistic counterpart that generally needs additional memory for state saving and rollback. An interesting exception is the reverse computation technique (Carothers et al., 1999). Instead of applying state saving, one performs reverse computation to re-create the original state when rollback happens. Recent study shows that, with careful implementation, reverse computation achieves great memory efficiency in simulating large networks (Yaun et al., 2003).

REAL-TIME NETWORK SIMULATION Real-time simulation combines the advantages of simulation and emulation by conducting network simulation in real time and interacting with real applications and real network traffic. It allows us to study the impact of real application traffic on the virtual network and study real application behavior under a diverse set of simulated network conditions. Specifically, real-time network simulation provides the following capabilities: •

Accuracy. Real-time network simulation is based on simulation; thus, it is able to efficiently capture detailed packet-level transactions in the network. This is particularly true for simulating packet forwarding on wired infrastructure networks as it is relatively straightforward to calculate the link state with sufficient accuracy (such as the delay for a packet being forwarded from one router to the next). Real-time network simulation can also increase the fidelity of simulation since it can create real traffic conditions generated by real applications. Furthermore, existing implementations, such as routing protocols, can be incorporated directly in simulation rather than using a separate implementation just for simulation purposes. The design and implementation of network protocols, services, and applications is complex and labor- intensive. Maintaining code separately for simulation and for real deployment would have to include costly procedures for verification and validation. Repeatability. Repeatability is important to both protocol development and evaluation. In real-time network simulation, an experiment may or may not be repeatable, depending on whether interaction with the applications is repeatable or not. The virtual network in real-time network simulation is controlled by

Parallel and Distributed Immersive Real-Time Simulation of....


simulation events, and thus can be used to produce repeatable network conditions to test real network applications. • Scalability. Emulation typically implements packet transmission by really directing a packet across a physical link, although in some cases this process can be accelerated by using special programmable devices (e.g., DeHart et al., 2006). In comparison, network operations in real-time network simulation are handled in software; each packet transmission involves only a few changes to the state variables in simulation that are computationally insignificant compared to the I/O overhead. Furthermore, since packet forwarding operations are relatively easy to parallelize, the simulated network can be scaled up far beyond what could be supported by emulation. • Flexibility. Simulation is both a tool for analyzing the performance of existing systems and a tool for evaluating new design alternatives potentially under various operating settings. Once a simulation model is in place, it takes little effort to conduct simulation experiments, for example, to explore a wide spectrum of design space. We can also incorporate different analytical models in real-time network simulation. For example, we can use low-resolution models to describe aggregate Internet traffic behavior, which can significantly increase the scale of the network being simulated. Most real-time network simulators are based on existing network simulators added with emulation capabilities in order to interact with real applications. Examples include NSE (Fall, 1999), IP-TNE (Bradford et al., 2000), MaSSF (Liu et al., 2003), and Maya (Zhou et al., 2004). NSE is an emulation extension of the popular ns-2 simulator with support for connecting with real applications and scheduling real-time events. ns-2 is built on a sequential discrete-event simulation engine, which severely limits the size of the network it is capable of simulating; for real-time simulation, this means that the size of the network has to be kept small to allow realtime processing. IP-TNE is an emulation extension of an existing parallel network simulator. It is the first simulator we know that applies parallel simulation to large-scale network emulations. MaSSF is built on our parallel simulator DaSSF with support for the grid computing environment. Maya is an emulation extension of a simulator for wireless mobile networks. Our realtime network simulator is called PRIME, which stands for Parallel Realtime Immersive network Modeling Environment. The implementation of


Parallel and Distributed Computing Applications

PRIME inherits most of our previous efforts in the development of DaSSF, a process-oriented and conservatively synchronized parallel simulation engine designed for multi-protocol communication networks. DaSSF can run on most platforms, including shared-memory multiprocessors and clusters of distributed-memory machines. The DaSSF simulation engine is ultra fast and has been demonstrated capable of handling large network models, including simulation of infrastructure networks, cellular systems, wireless ad hoc networks, and wireless sensor networks. In order to support large-scale simulation, PRIME applies advanced parallel simulation techniques. For example, to achieve good performance on distributedmemory machines, PRIME adopts a hierarchical synchronization scheme to address the discrepancy in the communication cost between distributedmemory and shared-memory platforms (Liu and Nicol, 2001). Further, PRIME implements the composite synchronization algorithm (Nicol and Liu, 2002), which combines the traditional synchronous and asynchronous conservative parallel simulation algorithms. Consequently, PRIME is able to efficiently simulate diverse network scenarios, including those that exhibit large variability in link types (particularly with the existence of low- latency connections), and node types (especially for those with a large degree of connectivity). PRIME extends DaSSF with emulation capabilities, where unmodified implementations of real applications can interact with the network simulator that operates in real time. Traffic originated from the real applications is captured by PRIME’s emulation facilities and forwarded to the simulator. The real network packets are treated as simulation events as they are “carried” on the virtual network and experience appropriate delays and losses according to the run-time state of the simulated network.

SUPPORTING REAL-TIME PERFORMANCE Real-time network simulation needs to resolve two important and related issues: responsiveness and timeliness. Responsiveness dictates that the realtime simulator must be able to interact with real applications in time. That is, the system interface must be able to receive and respond to real-time events promptly according to proper real-time deadlines. Timeliness refers to the system’s ability to keep up with the wall-clock time. That is, the simulation must be able to characterize the behavior of the networks, potentially with

Parallel and Distributed Immersive Real-Time Simulation of....


millions of network entities and with a large amount of network traffic flows, in real time. Failing to do so will introduce timing faults, where the simulation fails to process events before the designated deadlines. An elevated occurrence of timing faults will cause the simulator to become less responsive when interacting with real applications. In this section we briefly describe the techniques we developed so far to factor out these issues.

Hybrid Traffic Modeling Large-scale real-time network simulation requires simulation be able to characterize the net- work behavior in real time. To speed up simulation, on the one hand, we apply parallel and distributed simulation techniques to harness the computing resources of parallel computers to physically increase the event-processing power; on the other hand, we resort to multi- resolution modeling techniques mixing models with high level of abstraction (and low resolution) to reduce the computational demand. Our solution to this problem is to use a hybrid network traffic model that combines a fluid-based analytical model using ordinary differential equations (ODEs) with the traditional packet-oriented discrete-event simulation (Liu, 2006). The model extends the fluid model by Liu et al. (2004) where ODEs are used to predict the mean behavior of the dynamic TCP congestion windows, the network queue lengths, and packet loss probabilities, as traffic flows through a set of network queues. These network queues are augmented with functions to handle both fluid flows and individual packets, as well as the interaction between them. We briefly describe the functions of these equations below. A detailed discussion of the hybrid model can be found in Liu (2006). We first define the variables in Table 1. (1) (2) (3)


Parallel and Distributed Computing Applications

Table 1. Variables defined in the hybrid model

(4) (5) (6) (7) (8)

Parallel and Distributed Immersive Real-Time Simulation of....





(12) (13) Equation (1) models the additive-increase-multiplicative-decrease (AIMD) behavior of a TCP congestion window during the congestion avoidance stage. The window size and the round-trip time determine the arrival rate at the first router in Equation (5). For UDP flows, we use a constant send rate instead. The arrival rate at subsequent routers is the same as the departure rate at the predecessor router only postponed by the link’s propagation delay, as prescribed in Equation (6). Equation (7) sums up the arrivals of both fluid and packet flows. The total arrival rate, together with the loss probability and the link’s bandwidth, are used to determine the instantaneous queue length in Equation (2). An average queue length is then calculated in Equation (3), which is derived from the Exponential Weighted Moving Average (EWMA) calculation in network queues with RED (Random Early Detection) queue management. The calculated average queue length contributes to the loss probability a dictated by the RED policy in Equation (4). The loss probability for drop-tail queues can be calculated directly from projected buffer overflows. Equation (9) describes the departure rate as a function of the arrival rate postponed by the queuing delay calculated using Equation (8). Equations (10) and (11) calculate the cumulative delay and loss since the beginning when the segment of flow is originated from the traffic source. The cumulative delay and loss are used to calculate the round-trip time and the total loss rate in Equations (12) and (13), which in turn are used to calculate the congestion window size.


Parallel and Distributed Computing Applications

With proper performance optimization (Liu and Li, 2008), this hybrid traffic model can achieve significant performance improvement, in certain cases, over three orders of magnitude. The hybrid model can also be parallelized to achieve even greater performance.

Figure 1. Instantaneous queue length.

Figure 2. Speedup over packet simulation.

To illustrate the potential of this approach, here we examine the accuracy and performance of the hybrid model using a simple dumbbell network

Parallel and Distributed Immersive Real-Time Simulation of....


model. In the experiment, the dumbbell network contains two routers in the middle connecting N server nodes on one side and N client nodes on the other side. Each server node directs M simultaneous TCP flows to the corresponding client node. All links are set with a propagation delay of 5 ms. The experiments were run sequentially on an Apple Mac Pro with two 3 GHz dual-core Intel Xeon processors and 9 GB of memory. We first set N — 10 and M — 30. Half of the connections are established at time 10 and the rest at time 50. We set the bandwidth of the bottleneck link to be 20 Mb/s. Each server or client node connects to its adjacent router over a 10 Mb/s link. Figure 1 compares the instantaneous queues lengths at the bottleneck router as predicted by fluid-based and packet-oriented simulations, as well as a hybrid of the two. The result from the fluid-based simulation matches well with that of the packet-oriented simulation in terms of averaged behavior. The hybrid model (with 50% fluid flows and 50% packet flows) produces similar results. To show the overall performance benefit of our hybrid approach, we use the same dumbbell topology but change the parameters, such as the bandwidth at the bottleneck link, so that the cost of the simulation may increase proportionally as we increase the number of TCP sessions. Specifically, we vary M, the number of simultaneous TCP sessions between each pair of client-server nodes. We set the bandwidth of the link between each client or server node and its adjacent router to be (10 x M ) Mb/s. The network queues at both ends of the link has a buffer size of M MB. The link between the two routers has a bandwidth of (10 ×M × N ) Mb/s. The corresponding network queues in the two routers have a buffer size of ( M × N ) MB. All TCP sessions start at time 0 and the experiments are run for 100 simulated seconds. The rest of the parameters are the same as in the previous experiment. Figure 2 shows the speedup of the fluid model over the pure packet simulation with different performance improvement techniques enabled one at a time (see Liu and Li, 2008 for more details about these performance improvement techniques). Here we set N — 100 and M — {5,10,20,40}. We see that, as we turn on all improving techniques in the case of M — 40, we can obtain a speedup as much as 3,057 over packetoriented simulation. The effective packet-event rate actually reaches over 566 million packet-event per second. We further extend the hybrid model to represent network background traffic (Li and Liu, 2009a). In real-time network simulation, we can make


Parallel and Distributed Computing Applications

a distinction between foreground traffic, which is generated by the real applications we intend to study with high fidelity, and background traffic, which represents the bulk of the network traffic that is of secondary interest and does not necessarily require significant accuracy. Nevertheless, background traffic interferes with foreground traffic as they both compete for network resources, and thus determines (and also is determined by) the behavior of network applications under investigation (Vishwanath and Vahdat, 2008). Our enhanced model enables bi-directional flows and uses heavy-tail distributions to describe the flow durations. To enable bi-directional flows, we assume that the forwarding path of the TCP flows in the fluid class i (from the source to the destination) consists of n queues: , and the reverse path (from the destination to the source) consists of m queues: . We use Equation (5) to calculate the arrival rate at the first queue f1. For subsequent queues except r1 , i.e., , we use Equation (6) to calculate the arrival rate from the departure rate at the predecessor queue. For queue 1 r (the first queue on the reverse path), we have: (14) where α1 is the average ACK packet size, and βi is the average data packet size in fluid class i. This equation represents the conversion from the data flows on the forwarding path to the corresponding ACK flows on the reverse path. To capture traffic burstness, we use the Poisson Pareto Burst process (PPBP) model to predict the aggregate Internet traffic. PPBP is a process based on multiple overlapping bursts, with Poisson arrival and burst lengths following a heavy-tail distribution (Zukerman et al., 2003). We schedule TCP session arrivals using the exponential distribution with a mean arrival rate μ. The durations of the TCP sessions d are independent and identically distributed Pareto random variables with parameters δ> 0 and 1 < γ < 2:

Parallel and Distributed Immersive Real-Time Simulation of....


(15) With the Pareto distributed flow duration, we can regenerate the long range dependence (LRD) characteristic of realistic background traffic in our model, which can be evaluated by a parameter called the Hurst parameter: (16) When 0.5 < H < 1, it implies that the traffic exhibits LRD and is selfsimilar. In our fluid model, we replace the constant number of homogeneous fluid flows ni with the PPBP process, Ni(t). Specifically, we redefine the equations for calculating the arrival rate at the first queue f1 (Equation 5), and the end-to-end packet loss rate (Equation 13) as follows: (17) (18) Figure 3 shows the result of an experiment using the same dumbbell model measuring the number of packets per second sent over time for both packet simulation (left plots) and the fluid background traffic model (right plots). From top down we progressively decreasing the sampling time scale, while maintaining the number of samples to be 300. The starting time scale is 1 second; each subsequent plot is obtained from the previous one by concentrating on a randomly chosen sub-interval with a length being one tenth of the previous one. That is, the time resolution is increased by a factor of 10. To a large extent, the results from the packet-oriented simulation and from the fluidbased simulation are similar, except for the 10 ms timescale (bottom plots). The fluid model does not capture packet details at subRTT level; the RTT for the dumbbell model is at least 10 ms.


Parallel and Distributed Computing Applications

Figure 3. Traffic burstness.

Scalable Emulation Infrastructure A large-scale network simulation must be able to interact with a large number of real applications. The emulation infrastructure, which connects the simulator to the applications, must be able to embed real applications easily in the real-time simulation. There are several ways to incorporate real applications into a simulation environment, the decision of which to use largely depends on where the interactions take place. Several techniques exist that allow running unmodified software, which include using packet capturing techniques (such as libpcap, IP table, and IP tunnel), preloading dynamic libraries, and modifying the binary executables. In certain cases, moderate software modifications are necessary to allow efficient direct execution. Our first attempt follows an open system approach (Liu et al., 2007). The emulation infrastructure is built on the Virtual Private Network (VPN), which is customized to function as a gateway that bridges traffic between the physical entities and the simulated network (see Figure 4). Client machines run real applications. They establish connection to the simulation gateway as VPN clients (by running an automatically generated VPN configuration

Parallel and Distributed Immersive Real-Time Simulation of....


scripts). Traffic generated by the applications running on the client machines and destined for the virtual network is directed by the emulation infrastructure to the real-time network simulator. We use an example to show how it works. Suppose two client machines are connected to the simulation gateway (not necessarily the same one) and want to communicate with each other. One client is assigned with the IP address and the other with Packets sent from to are forwarded to the VPN server at the simulation gate- way. The VPN server has been altered to forward the packets to a daemon process (ssfgwd), which then sends the packets to the real-time simulator via a dedicated TCP connection. At the simulator, the packets are injected into the simulation event list; the simulator simulates the packets being forwarded on the virtual network as if they were created by the virtual node with the same IP address Upon reaching the virtual node, the packets are exported from simulation and travel in the reverse direction via the simulation gateway back to the client machine assigned with the IP address

Figure 4. VPN emulation infrastructure.


Parallel and Distributed Computing Applications

Figure 5. VM emulation infrastructure.

One distinct advantage of this approach is that the emulation infrastructure does not require special hardware to set up. It is also secure and scalable, which are merits inherited directly from the underlying VPN implementation. Multiple simulation gateways can run simultaneously. In order to produce accurate results, however, the emulation infrastructure needs a tight coupling between the emulated entities (i.e., the client machines) and the real-time simulator. In particular, the segment between the client machines and the real-time network simulator should consist of only low-latency links. To maintain high throughput, the segment must also provide sufficient bandwidth to carry the emulation traffic. With these constraints, the physical latency between the clients and the simulator can actually be made transparent in the network model (Liljenstam et al., 2005). The idea is to allow an emulation packet in simulation to preempt other simulated packets in the network queues so that the packet can be delivered ahead of its schedule in order to compensate for the physical delays. We also inspect machine virtualization solutions for an accurate environment of running real applications. Machine virtualization has found a number of interesting applications, including resource management in data centers, security, virtual desktop environments, and software distribution. Recently,

Parallel and Distributed Immersive Real-Time Simulation of....


researchers have also proposed using virtualization techniques for building network emulation testbeds. We follow the method proposed by Maier et al. (2007) to classify virtual machine (VM) solutions for network emulation. Classical virtual machines, such as VMWare Workstation and User-Mode Linux (Dike, 2000), provide full machine virtualization and can therefore run unmodified guest operating systems. These solutions offer complete transparency (with a complete abstraction of a computer system) to the guest operating system, but in doing so incur a large performance overhead. Lightweight virtual machines, such as Xen (Barham et al., 2003), VMWare ESX Server, and Denali (Whitaker et al., 2002), implement partial virtualization for greater efficiency, but require slight modification of guest OSes. In addition to virtualizing an entire operating system instance, researchers have proposed virtual network stacks (Bavier et al., 2006; Huang et al., 1999; OpenVZ; Soltesz et al., 2007; Zec, 2003) and virtual routers (Maier et al., 2007; VRF) as alternative solutions. With virtual network stacks, applications running on the same OS instance are presented with multiple independent network stacks, which can be managed individually and control distinct physical devices. With virtual routers, a single OS instance can maintain multiple routing table instances, thereby allowing the co-execution of multiple router software. Since these two techniques only virtualize the network resource, they provide greater efficiency than light-weight VMs. They do not, however, provide a complete isolation of resources (such as CPU); they are also invasive, sometimes requiring substantial modification to the guest OS. Our work so far has explored the use of light-weight virtual machines and virtual network stacks as candidate emulated elements in a real-time simulation infrastructure. We have built a real-time simulation infrastructure that can seamlessly use light-weight virtual machines to emulate arbitrary network elements including routers and application endpoints. We looked into four types of network resources that may be provided by a virtual machine: network sockets, network interfaces, forwarding table, and loopback device. Network sockets (TCP, UDP, and raw sockets) are used by applications to establish connectivity and exchanging information. Network interfaces and the forwarding table are used by routing protocols to conduct network forwarding. A network loopback device is sometimes used by separate processes to communicate on the same machine. We investigated four popular virtualization technologies: Xen, OpenVZ, Linux-VServer and VRF and found that, while all four types of network resources are provided in Xen and OpenVZ, Linux-VServer and VRF have only partial network virtualization support.


Parallel and Distributed Computing Applications

Figure 5 shows a high-level view of our VM-based emulation infrastructure. We view each physical machine as a basic scaling unit, where emulated hosts are mapped to independent virtual machines (or virtual environments) so that they can run unmodified applications. Each instance of the real-time simulator runs on a separate virtual machine of the same physical machine, and processes events associated with a designated sub-network. The simulator instances on different physical machines are synchronized using conservative parallel simulation techniques. Real network traffic generated by the applications is intercepted by the hypervisor (or VM manager) and sent to the virtual machine where the corresponding realtime simulator instance is located. The simulator then processes these packets applying packet delays and losses according to the simulated network conditions.

APPLICATIONS AND CASE STUDIES We have been able to successfully apply real-time simulation to study many applications, including routing algorithms, transport protocols, content distribution services, web services, multimedia streaming, and peer-to-peer networks. In this section, we select several case studies to demonstrate the potentials of real-time simulation.

Large-Scale Routing Experiments The availability of open-source router platforms, such as XORP, Zebra, and Quagga, has simplified the task of researchers, who can now prototype and evaluate routing protocols with relative ease. To support experiments on a large-scale network consisting of many routers with multiple traffic sources and sinks, we need to integrate the open-source router platforms with the real-time network simulator. Since the routers are emulated outside the real-time simulator on client machines where they can run the real routing software directly, every packet traveling along its path from the source to the destination needs to be exported to each intermediate router for forwarding decisions, and subsequently imported back into the simulation engine. Thus, the forwarding operation for each packet at each hop would incur substantial I/O overhead. Consequently, the overall overhead would significantly impact the performance of the emulation infrastructure, especially in large-scale routing experiments. To avoid this problem, we propose a forwarding plane offloading approach, which moves the packet forwarding functions from the

Parallel and Distributed Immersive Real-Time Simulation of....


emulated router software to the simulation engine so that we can eliminate the I/O overhead associated with communicating bulk-traffic back and forth between the router software and the real-time simulator (Li et al., 2008). In our current implementation, we combine XORP with PRIME to provide a scalable platform for conducting routing experiments. We create a forwarding plane plug-in in XORP, which maintains a command channel with the PRIME simulator for transferring forwarding information updates and network interface configuration requests between the XORP instance and the corresponding simulated router. We carried out several experiments using the scalable routing platform. These experiments include an intra-domain routing experiment consisting of a realistic Abilene network model (Li et al., 2008) with the objective of observing the convergence of OSPF and its effect on data traffic. We injected a link failure followed by a recovery between two routers on the network. We were able to measure their effect on the round-trip time and data throughput of end applications. We also conducted realistic large-scale inter-domain routing experiments consisting of major autonomous systems connecting Swedish Internet users with realistic routing configurations derived from the routing registry (Li and Liu, 2009b). We ran a series of realtime security exercises on this routing system to study the consequence of intentionally propagating false routing information on interdomain routing and the effectiveness of corresponding defensive measures.

Large-Scale TCP Evaluation The TCP congestion control mechanism, which limits the rate of data entering the network, is essential to the overall stability of the network under traffic congestion and important to the protocol’s performance. It has been widely documented that the traditional TCP congestion control algorithms (such as TCP Reno and TCP SACK) have serious problems preventing TCP from reaching high data throughput over high-speed long-latency links. Consequently, quite a number of TCP variants have been proposed to directly tackle these problems. Compared with the traditional methods, these TCP variants typically adopt more aggressive congestion control methods in order to address the under-utilization problem of TCP over networks with a large bandwidth-delay product. The ability to establish an objective comparison between these highperformance TCP variants under diverse networking conditions and to obtain a quantitative assessment of their impact on the global network


Parallel and Distributed Computing Applications

traffic is essential to a community-wide understanding of various design approaches. Small-scale experiments are insufficient for a comprehensive study of these TCP variants. We developed a TCP performance evaluation testbed, called SVEET, based on real-time simulation technique using real implementations of the TCP variants, which are evaluated under diverse network configurations and workloads in large- scale network settings (Erazo et al., 2009). In order for SVEET to accommodate data communications with multigigabit throughput performance, we apply time dilation, proportionally slowing down the virtual machines and the network simulator. Using time dilation allows us to provide much higher bandwidths than what can be provided by the physical system and the network simulator at the cost of increased experiment time. We adopt the time dilation technique developed by Gupta et al. (2006), which can uniformly slow the passage of time from the perspective of the guest operating system (XenoLinux). This is achieved primarily by enlarging the interval between timer interrupts delivered to the virtual machines from the Xen hypervisor by a specified factor, called the Time Dilation Factor (TDF). Time dilation can scale the perceived I/O rate and processing power on the virtual machines by the same factor. For instance, if a virtual machine has a TDF of 10, it means that the time, as perceived by the applications running on the virtual machine, will be advanced at a pace 10 times slower than the true wall-time clock. Similarly, the applications would experience a tenfold increase in both network capacity and CPU cycles. We ported several TCP congestion control algorithms from the ns-2 simulator consisting of thirteen TCP variants originally implemented for Linux. In doing so we are able to conduct large-scale experiments using simulated traffic generated by these TCP variants. We also customized the Linux kernel on the virtual machines to include these TCP variants so that we can test them using real applications running on the virtual machines to communicate via the TCP/IP stack. We conducted extensive experiments to validate our testbed and investigated the impact of TCP variants on web applications, multimedia streaming, and peer-to-peer traffic.

Large-Scale Peer-to-Peer Content Distribution Network We design one of the largest network experiments that involve a real implementation of a peer-to-peer content distribution system under HTTP traffic from a public-domain empirical workload trace and using a realistic

Parallel and Distributed Immersive Real-Time Simulation of....


large network model (Liu et al., 2009). The main idea behind the content distribution network (CDN) is to replicate content at the edge of the Internet closer to the clients. In doing so, CDN can alleviate both the workload at the server and the traffic load at the network core. We choose to use an opensource CDN system called CoralCDN (Freedman et al., 2004), which is a peer-to-peer web-content distribution network that consists of three parts: 1) a network of cooperative web proxies for handling HTTP requests, 2) a network of domain name servers (DNS) to map clients to nearby web proxies, and 3) an underlying clustering mechanism and an indexing infrastructure to facilitate DNS mapping and content distribution. We statically mapped the clients to nearby Coral nodes to send HTTP requests. Thus we ignore CoralCDN’s DNS redirection function and only focus on web-content distribution for the experiment. We extend the Rocketfuel to build the network model for our study. Rocketfuel (Spring et al., 2004) contains the topology of 13 tier-1 ISPs, derived from information obtained from traceroute paths, BGP routing tables, and DNS. Previously, we created a best-effort Internet topology for largescale network simulation studies using the Rocketfuel dataset (Liljenstam et al., 2003). Based on this study, we further process the Rocketfuel network topology to improve accuracy and reduce data noise. We choose to use one of the tier-1 ISP networks for our study, which contains 637 routers (out of which 235 are backbone routers) connected by 1,381 links. Attached to the backbone network are medium-sized stub networks, called the campus network. Each campus network consists of 504 end hosts, organized into 12 local area networks (LANs) connected by 18 routers. Four extra end hosts are designated to form a server cluster. Each LAN consists of a gateway router and 42 end-hosts. The entire campus network is divided into four OSPF areas. The campus network is connected to the outside world through a BGP router. We attach 84 such campus networks to the tier-1 ISP network. The entire network thus contains 42,672 end hosts and 3,157 routers. We place one CoralCDN node within each of the 12 LANs of the 84 campus network (at one of the 42 end hosts in each LAN), thus making a total of 1,008 CoralCDN nodes overall. Each CoralCDN node is emulated in a separate OpenVZ container. The web clients are simulated; they send HTTP requests to the CoralCDN node within the same LAN and subsequently receive data objects from the Coral proxy. PRIME implements a full-fledged TCP model that allows simulated nodes to interact with real TCP counterparts. We attach a stub network to a backbone router in the tier-


Parallel and Distributed Computing Applications

1 ISP network (located in Paris, France) to run a web server, emulated on a separate compute node. We select the HTTP trace at the 1998 World Cup web site, which is publicly available (Arlitt and Jin, 1998). The trace is collected with all HTTP requests made to the 1998 World Cup Web site. We select a 24-hour period of this trace (from June 5,1998, 22:00:01 GMT to June 6,1998, 22:00:00 GMT). The segment consists of 5,452,684 requests originated from 40,491 clients. We pre-process the trace to filter out the sequence of requests sent from each client and randomly map the 40,491 clients to the end hosts in our network model for a complete daily pattern of the caching behavior. Through the experiment, we were able to successfully collect three important metrics to analyze the performance the peer-to-peer content distribution network: cache hit rate, web server load, and response time.

CONCLUSIONS AND FUTURE WORK In this chapter we describe real-time simulation of large-scale networks and compare it against other major tools for networking research. We discuss the problems that may prevent simulation from achieving real-time performance and subsequently present our current solutions. We conduct large-scale network experiments incorporating real-time simulation to demonstrate its capabilities. Future work includes efficient background traffic models for large-scale networks, high- performance communication conduit for connecting virtual machines and the real-time simulator, and effective methods for configuring, running and visualizing network experiments.

Acknowledgments This chapter significantly extends our previous work (Liu, 2008) with a high-level summary of published results thereafter. Our research reported in this chapter is supported in part by National Science Foundation grants CNS-0546712, CNS-0836408 and HRD-0833093.

Parallel and Distributed Immersive Real-Time Simulation of....



Jong Suk Ahn, Peter B. Danzip, Zhen Liu, and Limin Yan. Evaluation of TCP Vegas: emulation and experiment. In Proceedings of the 1995 ACM SIGCOMM Conference, pages 185-195, August 1995. 2. Thomas Anderson, Larry Peterson, Scott Shenker, and Jonathan Turner. Overcoming the In- ternet impasse through virtualization. Computer, 38(4):34—41, 2005. 3. Martin Arlitt and Tai Jin. 1998 World Cup web site access logs. Available at: http://www., August 1998. 4. Rassul Ayani. A parallel simulation scheme based on the distance between objects. Proceedings of the 1989 SCS Multiconference on Distributed Simulation, 21(2):113- 118, March 1989. 5. Lokesh Bajaj, Mineo Takai, Rajat Ahuja, Ken Tang, Rajive Bagrodia, and Mario Gerla. Glo- MoSim: a scalable network simulation environment. Technical Report 990027, Department of Computer Science, UCLA, May 1999. 6. Paul Barford and Larry Landweber. Bench-style network research in an Internet instance laboratory. ACM SIGCOMM Computer Communication Review, 33(3):21-26, 2003. 7. Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neuge- bauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03), 2003. 8. Rimon Barr, Zygmunt Haas, and Robbert van Renesse. JiST: An efficient approach to simulation using virtual machines. Software Practice and Experience, 35(6):539-576, May 2005. 9. Andy Bavier, Nick Feamster, Mark Huang, Larry Peterson, and Jennifer Rexford. In VINI veritas: realistic and controlled network experimentation. ACMSIGCOMMComputer Communication Review, 36(4):3-14, 2006. 10. Terry Benzel, Robert Braden, Dongho Kim, Clifford Neuman, Anthony Joseph, Keith Sklower, Ron Ostrenga, and Stephen Schwab. Experience with DETER: A testbed for security research. In Proceedings of 2nd International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities (TRIDENTCOM’06), March 2006.


Parallel and Distributed Computing Applications

11. Russell Bradford, Rob Simmonds, and Brian Unger. A parallel discrete event IP network emulator. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’00), pages 315-322, August 2000. 12. Lee Breslau, Deborah Estrin, Kevin Fall, Sally Floyd, John Heidemann, Ahmed Helmy, Polly Huang, Steven McCanne, Kannan Varadhan, Ya Xu, and Haobo Yu. Advances in network simulation. IEEE Computer, 33(5):59-67, May 2000. 13. Randal E. Bryant. Simulation of packet communication architecture computer systems. Technical Report MIT-LCS-TR-188, MIT, 1977. 14. Christopher D. Carothers, Kalyan S. Perumalla, and Richard M. Fujimoto. Efficient optimistic parallel simulations using reverse computation. ACM Transactions on Modeling and Computer Simulation, 9(3):224-253, July 1999. 15. Mark Carson and Darrin Santay. NIST Net: a Linux-based network emulation tool. SIGCOMM Computer Communication Review, 33(3):111-126, 2003. 16. K. M. Chandy and R. Sherman. The conditional event approach to distributed simulation. Proceedings o f t h e 1989 SCS Multiconference on Distributed Simulation, 21(2):93- 99, March 1989. 17. K. Mani Chandy and Jayadev Misra. Distributed simulation: A case study in design and verification of distributed programs. IEEE Transactions on Software Engineering, SE-5 (5):440^52, May 1979. 18. James Cowie, David Nicol, and Andy Ogielski. Modeling the global Internet. Computing in Science and Engineering, 1(1):42-50, January 1999. DaSSF. Dartmouth Scalable Simulation Framework. http://users.^liux/research/projects/dassf/index.html. 19. John DeHart, Fred Kuhns, Jyoti Parwatikar, Jonathan Turner, Charlie Wiseman, and Ken Wong. The open network laboratory. ACM SIGCSE Bulletin, 38(1):107-111, 2006. 20. Phillip M. Dickens and Paul F. Reynolds. SRADS with local rollback. Proceedings of the 1990 SCS Multiconference on Distributed Simulation, 22(1):161-164, January 1990. 21. Jeff Dike. A user-mode port of the Linux kernel. In Proceedings of the 4th Annual Linux Showcase & Conference, 2000. 22. Miguel Erazo, Yue Li, and Jason Liu. SVEET! A scalable virtualized

Parallel and Distributed Immersive Real-Time Simulation of....


24. 25.

26. 27.

28. 29. 30. 31.






evaluation environment for TCP. In Proceedings of the 5th International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities (TridentCom’09), April 2009. Kevin Fall. Network emulation in the Vint/NS simulator. In Proceedings of the 4th IEEE Symposium on Computers and Communications (ISCC’99), pages 244-250, July 1999. Sally Floyd and Vern Paxson. Difficulties in simulating the Internet. IEEE/ACM Transactions on Networking, 9(4):392-403, August 2001. Michael J. Freedman, Eric Freudenthal, and David Mazieres. Democratizing content publi- cation with Coral. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI 04), pages 239-252, 2004. Richard M. Fujimoto. Lookahead in parallel discrete event simulation. In Proceedings of the 1988 International Conference on Parallel Processing, pages 34-41, August 1988. Richard M. Fujimoto. Performance measurements of distributed simulation strategies. Transactions of the Society for Computer Simulation, 6(2):89-132, April 1989. Richard M. Fujimoto. Parallel discrete event simulation. Communications ofthe ACM, 33(10): 30-53, October 1990. Richard M. Fujimoto and Maria Hybinette. Computing global virtual time in shared memory multiprocessors. ACM Transactions on Modeling and Computer Simulation, 7(4):425-446, October 1997. A. Gafni. Rollback mechanisms for optimistic distributed simulation systems. Proceedings of the 1988 SCS Multiconference on Distributed Simulation, 19(3):61-67, July 1988. Fabian Gomes, Brian Unger, and John Cleary. Language based state saving extensions for optimistic parallel simulation. In Proceedings of the 1996 Winter Simulation Conference (WSC’96), pages 794-800, December 1996. Bojan Groselj and Carl Tropper. The time of next event algorithm. Proceedings ofthe 1988 SCS Multiconference on Distributed Simulation, 19(3):25-29, July 1988. Diwaker Gupta, Kenneth Yocum, Marvin McNett, Alex Snoeren, Amin Vahdat, and Geof- frey Voelker. To infinity and beyond: time-warped




38. 39.








Parallel and Distributed Computing Applications

network emulation. In Proceed- ings of the 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 06), 2006. Daniel Herrscher and Kurt Rothermel. A dynamic network scenario emulation tool. In Pro- ceedings of the 11th International Conference on Computer Communications and Networks (ICCCN’02), pages 262267, October 2002. X. W. Huang, R. Sharma, and S. Keshav. The ENTRAPID protocol development environment. In Proceedings of the 1999 IEEE INFOCOM, pages 1107-1115, March 1999. David R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7 (3):404^25, July 1985. David R. Jefferson. Virtual time II: Storage management in distributed simulation. In Procee- ings o f t h e 9th Annual ACM Symposium on Principles ofDistributed Computing, pages 75-89, August 1990. Xuxian Jiang and Dongyan Xu. VIOLIN: Virtual internetworking on overlay infrastructure. In Proceedings of the 2nd Internattonai Symposium on Paraiiei and Distributed Processing and Appiications (ISPA’04), pages 937-946, 2004. Glenn Judd and Peter Steenkiste. Repeatable and realistic wireless experimentation through physical emulation. ACM SIGCOMM Computer Communication Review, 34(1):63- 68, 2004. Ting Li and Jason Liu. A fluid background traffic model. In Proceedings of the 2009 IEEE Internationai Conference on Communications (ICC’09), June 2009a. Yue Li and Jason Liu. Real-time security exercises on a realistic interdomain routing experi- ment platform. In Proceedings of the 23rd Workshop on Principies of Advanced and Distributed Simuiation (PADS 09), June 2009b. Yue Li, Jason Liu, and Raju Rangaswami. Toward scalable routing experiments with real- time network simulation. In Proceedings of the 22nd Workshop on Principies of Advanced and Distributed Simuiation (PADS’08), pages 23-30, June 2008. Michael Liljenstam, Jason Liu, and David M. Nicol. Development of an Internet backbone topology for large-scale network simulations. In Proceedings of the 2003 Winter Simu-iation Conference, pages 694702, 2003. Michael Liljenstam, Jason Liu, David M. Nicol, Yougu Yuan,

Parallel and Distributed Immersive Real-Time Simulation of....












Guanhua Yan, and Chris Grier. RINSE: the real-time interactive network simulation environment for network secu- rity exercises. In Proceedings of the 19th Workshop on Paraiiei and Distributed Simuiation (PADS’05), pages 119-128, June 2005. Yi-Bing Lin and Edward D. Lazowska. Reducing the state saving overhead for Time Warp parallel simulation. Technical Report 9002-03, Department of Computer Science, University of Washington, February 1990. Yi-Bing Lin and Bruno R. Preiss. Optimal memory management for Time Warp parallel simulation. ACM Transactions on Modeiing and Computer Simuiation, 1(4):283- 307, October 1991. Yi-Bing Lin, Bruno Richard Preiss, Wayne Mervin Loucks, and Edward D. Lazowska. Selecting the checkpoint interval in Time Warp simulation. In Proceedings of the 7th Workshop on Paraiiei and Distributed Simuiation (PADS 93), pages 3-10, May 1993. Jason Liu. Packet-level integration of fluid TCP models in real-time network simulation. In Proceedings of the 2006 Winter Simuiation Conference (WSC’06), pages 2162-2169, December 2006. Jason Liu. A primer for real-time simulation of large-scale networks. In Proceedings ofthe 41st Annuai Simuiation Symposium (ANSS’08), April 2008. Jason Liu and Yue Li. On the performance of a hybrid network traffic model. Simuiation Modeiiing Practice and Theory, 16(6):656-669, 2008. Jason Liu and David M. Nicol. Learning not to share. In Proceedings of the 15th Workshop on Paraiiei and Distributed Simuiation (PADS’01), pages 46-55, May 2001. Jason Liu, Scott Mann, Nathanael Van Vorst, and Keith Hellman. An open and scalable em- ulation infrastructure for large-scale realtime network simulations. In Proceedings of 2007 IEEE INFOCOM MiniSymposium, pages 2471-2475, May 2007. Jason Liu, Yue Li, and Ying He. A large-scale real-time network simulation study using PRIME. In Proceedings o f t h e 2009 Winter Simuiation Conference (WSC 09), December 2009. To appear. Xin Liu, Huaxia Xia, and Andrew A. Chien. Network emulation tools for modeling grid be- havior. In Proceedings of3rd IEEE/ ACM International Symposium on Cluster Computing and the Grid













Parallel and Distributed Computing Applications

(CCGrid’03), May 2003. Yong Liu, Francesco Presti, Vishal Misra, Donald Towsley, and Yu Gu. Scalable fluid models and simulations for large-scale IP networks. ACM Transactions on Modeling and Computer Simulation (TOMACS), 14(3):305-324, July 2004. Boris D. Lubachevsky. Bounded lag distributed discrete event simulation. Proceedings o f t h e 1988 SCS Multiconference on Distributed Simulation, 19(3):183-191, July 1988. Steffen Maier, Daniel Herrscher, and Kurt Rothermel. Experiences with node virtualization for scalable network emulation. Computer Communications, 30(5):943-956, 2007. Friedemann Mattern. Efficient distributed snapshots and global virtual time approximation. Journal of Parallel and Distributed Computing, 18(4):423-434, August 1993. David M. Nicol. Parallel discrete-event simulation of FCFS stochastic queueing networks. ACM SIGPLAN Notices, 23(9):124-137, September 1988. David M. Nicol. Performance bounds on parallel self-initiating discreteevent simulations. ACM Transactions on Modeling and Computer Simulation, 1(1):24-50, January 1991. David M. Nicol. Principles of conservative parallel simulation. In Proceedings ofthe1996 Winter Simulation Conference (WSC 96), pages 128-135, December 1996. David M. Nicol and Philip Heidelberger. A comparative study of parallel algorithms for simulating continuous time Markov chains. ACM Transactions on Modeling and Computer Simulation, 5(4):326354, October 1995. David M. Nicol and Jason Liu. Composite synchronization in parallel discrete-event simula- tion. IEEE Transactions on Parallel and Distributed Systems, 13(5):433-446, May 2002. OpenVZ. http:// Larry Peterson, Tom Anderson, David Culler, and Timothy Roscoe. A blueprint for introducing disruptive technology into the Internet. HotNets-I, October 2002. Bruno Richard Preiss and Wayne Mervin Loucks. Memory management techniques for Time Warp on a distributed memory machine. In Proceedings ofthe 9th Workshop on Parallel and Distributed Simulation

Parallel and Distributed Immersive Real-Time Simulation of....

68. 69. 70.






76. 77.

78. 79.


(PADS 95), pages 30-39, June 1995. PRIME. Quagga. D. Raychaudhuri, I. Seskar, M. Ott, S. Ganu, K. Ramachandran, H. Kremo, R. Siracusa, H. Liu, and M. Singh. Overview of the ORBIT radio grid testbed for evaluation of next-generation wireless network protocols. In Proceedings ofthe IEEE Wireless Communications and Networking Conference (WCNC 05), March 2005. Daniel A. Reed, Allen D. Malony, and Bradley McCredie. Parallel discrete event simulation using shared memory. IEEE Transactions on Software Engineering, 14(4):541-53, April 1988. P. L. Reiher, R. M. Fujimoto, S. Bellenot, and D. Jefferson. Cancellation strategies in optimistic execution systems. Proceedings ofthe 1990 SCS Multiconference on Distributed Simulation, 22(1):112-121, January 1990. George F. Riley. The Georgia Tech network simulator. In Proceedings ofthe ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research (MoMe-Tools 03), pages 5-12, August 2003. Luigi Rizzo. Dummynet: a simple approach to the evaulation of network protocols. ACM SIGCOMM Computer Communication Review, 27(1):31-41, January 1997. Robert Ronngren, Michael Liljenstam, Rassul Ayani, and Johan Montagnat. Transparent incremental state saving in Time Warp parallel discrete event simulation. In Proceedings of the 10th Workshop on Parallel and Distributed Simulation (PADS’96), pages 70-77, May 1996. Behrokh Samadi. Distributed simulation, algorithms and performance analysis. PhD thesis, Department of Computer Science, UCLA, 1985. L. M. Sokol, D. P. Briscoe, and A. P. Wieland. MTW: A strategy for scheduling discrete simu- lation events for concurrent execution. Proceedings of the 1988 SCS Multiconference on Distributed Simulation, 19(3):34^2, July 1988. Stephen Soltesz, Herbert Potzl, Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. Container-based operating system virtualization: A scalable, highperformance alternative to hypervisors. In Proceedings of the 2nd












Parallel and Distributed Computing Applications

ACM SIGOPS/EuroSys European Conference on Computer Systemsof (EuroSys’07), March 2007. Neil Spring, Ratul Mahajan, David Wetherall, and Thomas Anderson. Measuring isp topolo- gies with rocketfuel. IEEE/ACM Transactions on Networking, 12(1):2-16, 2004. Neil Spring, Larry Peterson, Andy Bavier, and Vivek Pai. Using PlanetLab for network re- search: myths, realities, and best practices. ACM SIGOPS Operating Systems Review, 40(1):17-24, 2006. Jeff S. Steinman. SPEEDES: Synchronous parallel environment for emulation and discrete event simulation. Proceedings of the SCS Multiconference on Advances in Parallel and Distributed Simulation, SCS Simulation Series, 23(1):95-103, January 1991. Jeff S. Steinman. Breathing Time Warp. In Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS’93), pages 109-118, May 1993. Ananth I. Sundararaj and Peter A. Dinda. Towards virtual networks for virtual machine grid computing. In Proceedings of the 3rd USENIX Conference on Virtual Machine Technology (VM’04), pages 14-14, 2004. Joe Touch. Dynamic Internet overlay deployment and management using the X-Bone. In Proceedings of the 2000 International Conference on Network Protocols (ICNP’00), pages 59-68, 2000. Hung-ying Tyan and Jennifer Hou. JavaSim: A component based compositional network simulation environment. In Proceedings of the Western Simulation Multiconference, January 2001. Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostic, Jeff Chase, and David Becker. Scalability and accuracy in a large scale network emulator. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02), pages 271-284, December 2002. Andrs Varga. The OMNeT++ discrete event simulation system. In Proceedings of the European Simulation Multiconference (ESM 01), June 2001. Kashi Venkatesh Vishwanath and Amin Vahdat. Evaluating distributed systems: Does back- ground traffic matter. In Proceedings of the 2008 USENIX Technical Conference, pages 227-240, May 2008.

Parallel and Distributed Immersive Real-Time Simulation of....


90. VMWare ESX Server. 91. VMWare Workstation. workstation.html. VRF. Linux Virtual Routing and Forwarding. http:// linux-vrf/. Darrin West. Optimizing Time Warp: Lazy rollback and lazy re-evaluation. Master’s thesis, 92. Department of Computer Science, University of Calgary, January 1988. 93. A. Whitaker, M. Shaw, and S. Gribble. Denali: Lightweight virtual machines for distributed and networked applications. In Proceedings of the USENIX Annual Technical Conference, 94. June 2002. Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An integrated experimental environment for distributed systems and networks. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI 02), pages 255-270, December 2002. 95. XORP. 96. Garrett Yaun, David Bauer, Harshad Bhutada, Christopher Carothers, Murat Yuksel, and Shiv-kumar Kalyanaraman. Large-scale network simulation techniques: examples of TCP and OSPF models. ACM SIGCOMM Computer Communication Review, 33(3):27-41, 2003. 97. Zebra. 98. Marko Zec. Implementing a clonable network stack in the FreeBSD kernel. In Proceedings of the 2003 USENIX Annual Technical Conference, June 2003. 99. Pei Zheng and Lionel M. Ni. EMPOWER: a network emulator for wireline and wireless networks. In Proceedings of the 2003 IEEE INFOCOM, volume 3, pages 1933-1942, March/April 2003. 100. Junlan Zhou, Zhengrong Ji, Mineo Takai, and Rajive Bagrodia. MAYA: integrating hybrid network modeling to the physical world. ACM Transactions on Modeling and Computer Simulation (TOMACS), 14(2):149-169, April 2004. 101. Moshe Zukerman, Timothy D. Neame, and Ronald G. Addie. Internet traffic modeling and future technology implications. In Proceedings ofthe2003 IEEE INFOCOM, 2003.





Siauliai University, Siauliai, Lithuania


Vilnius University, Vilnius, Lithuania


National Technical University of Ukraine “Kyiv Polytechnic Institute”, Kiyv, Ukraine

ABSTRACT This chapter provides a new methodology and two tools for user‐driven Wikinomics-oriented scientific applications’ development. Service‐ oriented architecture for such applications is used, where the entire research supporting computing or simulating process is broken down into a set of loosely coupled stages in the form of interoperating replaceable Web services that can be distributed over different clouds. Any piece of the code and any Citation: Vaidas Giedrimas, Leonidas Sakalauskas and Anatoly Petrenko (July 19th 2017). “Distributed Software Development Tools for Distributed Scientific Applications”, Recent Progress in Parallel and Distributed Computing Wen-Jyi Hwang, IntechOpen, DOI: 10.5772/intechopen.68334. Copyright: © 2017 by author and Intech. This paper is an open access article distributed under a Creative Commons Attribution 3.0 License


Parallel and Distributed Computing Applications

application component deployed on a system can be reused and transformed into a service. The combination of service‐oriented and cloud computing will indeed begin to challenge the way of research supporting computing development, the facilities of which are considered in this chapter. Keywords: service computing, engineering tools, Wikinomics, mathematical programming, software modeling

INTRODUCTION One of the factors on which the financial results of the business company depend is the quality of software which company is using. Scientific software plays even more special role. On its quality depend the reliability of the scientific conclusions and the speed of scientific progress. However, the ratio of successful scientific software projects is close to average: some part of the projects fails, some exceeds the budget, and some makes inadequate product. The misunderstandings between scientists as end users and software engineers are even more frequent as usual. Software engineers have a lack of deep knowledge of user’s domain (e.g., high energy physics, chemistry, and life sciences). In order to avoid possible problems, scientists sometimes try to develop “home‐made” software. However, the probability of failure in such projects is even higher, because of the lack of the knowledge of software engineering domain. For example, scientists in common cases do not know good software engineering practices, processes, and so on. They even can have a lack of knowledge about good practices or good artefacts of the software, made by its colleagues. We stand among the believers that this problem can be solved using the Wikinomics. The idea of Wikinomics (or Wiki economics) is introduced by Tapscott and Williams [15]. Wikinomics is the spatial activity, which helps to achieve the result, having available resources only. Wiki technologies are laid on very simple procedures: the project leaders collect critical mass of volunteers, who have a willing and possibilities to contribute in small scale. The sum of such small contributions gives huge contribution to the project result. The Wikipedia or Wikitravel portals can be presented as a success story of mass collaboration [17]. In other hand, we believe that the mass collaboration can help to improve only part of the scientific development process. We need a software

Distributed Software Development Tools for Distributed Scientific...


developing solutions, oriented to services and clouds in order to use all available computational power of the distributed infrastructures. Service‐oriented computing (SOC) is extremely powerful in terms of the help for developer. The key point of modern scientific applications is a very quick transition from hypothesis generation stage to evaluating mathematical experiment, which is important for evidence and optimization of the result and its possible practical use. SOC technologies provide an important platform to make the resource‐intensive scientific and engineering applications more significant [1–4]. So any community, regardless of its working area, should be supplied with the technological approach to build their own distributed compute‐intensive multidisciplinary applications rapidly. Service‐oriented software developers work either as application builders (or services clients), service brokers, or service providers. Usually, the service repository is created which contains platform environment supporting services and application supporting services. The environment supporting services offer the standard operations for service management and hosting (e.g., cloud hosting, event processing and management, mediation and data services, service composition and workflow, security, connectivity, messaging, storage, and so on). They are correlated with generic services, provided by other producers (e.g., EGI (http://www.egi. eu/), Flatworld (http://www.flatworldsolutions. com/), FI‐WARE (http://‐ enablers), SAP ( enterprise‐information‐management/), ESRC (http://ukdataservice., and so on). Two dimensions of service interoperability, namely horizontal (communication protocol and data flow between services) and vertical matching (correspondence between an abstract user task and concrete service capabilities), should be supported in the composition process. Modern scientific and engineering applications are built as a complex network of services offered by different providers, based on heterogeneous resources of different organizational structures. The services can be composed using orchestration or using choreography. If the orchestration is used, all corresponding Web services are controlled by one central web service. On the other hand, if the choreography is used and central orchestrator is absent, the services are independent in some extent. The choreography is based on collaboration and is mainly used to exchange messages in public business processes. As SOC developed, a number of languages for service orchestration and choreography have been introduced: BPEL4WS, BPML, WSFL, XLANG, BPSS, WSCI, and WSCL [5].

Parallel and Distributed Computing Applications


Our proposal has the following innovative features: •

Implementation of novel service‐oriented design paradigm in distributed scientific application development area according to which all levels of research or design are divided into separate loosely coupled stages and procedures for their subsequent transfer to the form of standardized Web services. • Creation of the repository of research application Web services which support collective research computing, simulating, and globalization of R&D activities. • Adaption of the Wiki technologies for creation of the repository of scientific applications’ source code, reusing existing software assets at the code level as well as at the Web services level. • Personalization and customization of distributed scientific applications because users can build and adjust their research or design scenario and workflow by selecting the necessary Web services (as computing procedures) to be executed on cloud resources. The rest of the paper is organized as follows: Section 2 presents overall idea of the platform for research collaborative computing (PRCC). Section 3 presents Web‐enabled engineering design platform as one of the possible implementations of PRCC. Section 4 outlines the architecture, and main components of our other systems based on Wiki technologies. Section 5 describes the comparison of similar systems. Finally, the conclusions are made and future work discussed.

THE PLATFORM FOR RESEARCH COLLABORATIVE COMPUTING The service‐oriented computing is based on the software services, which are platform‐independent, autonomous, and computational elements. The services can be specified, published, discovered, and composed with other services using standard protocols. Such composition of services can be threat as wide‐distributed software system. Many languages for software service composition are developed [5]. The goal of such languages is to provide formal way for specifying the connections and the coordination logic between services. In order to support design, development, and execution of distributed applications in Internet environment, we have developed the end‐user

Distributed Software Development Tools for Distributed Scientific...


development framework called the platform for research collaborative computing. PRCC is an emerging interdisciplinary field, and it embraces physical sciences like chemistry, physics, biology, environmental sciences, hydrometeorology, engineering, and even art and humanities. All these fields are demanding for potent tools for mathematical modeling and collaborative computing research support. These tools should implement the idea of virtual organization including the possibility to combine distributed workflows, sequences of data processing functions, and so on. The platform for research collaborative computing is the answer for this demand. PRCC has the potential to benefit research in all disciplines at all stages of research. A well‐constructed SOC can empower a research environment with a flexible infrastructure and processing environment by provisioning independent, reusable automated simulating processes (as services), and providing a robust foundation for leveraging these services. PRCC concept is 24/7‐available online intelligent multidisciplinary gateway for researchers supporting the following main users’ activities: login, new project creation, creation of workflow, provision of input data such as computational task description and constrains, specification of additional parameters, workflow execution, and collection of data for further analysis. User authorization is performed at two levels: for the virtual workplace access (login and password) and for grid/cloud resources access (grid certificate). Application creating: Each customer has a possibility to create some projects, with services stored in the repository. Each application consists of a set of the files containing information about the computing workflow, the solved tasks, execution results, and so on. Solved task description is allowed whether with the problem‐oriented languages of the respective services or with the graphic editor. Constructing of a computational route consists of choosing the computing services needed and connecting them in the execution order required. The workflow editor checks the compatibility of numerical procedures to be connected. Parameters for different computational tasks are provided by means of the respective Web interface elements or set up by default (except the control parameters, for instance, desirable value for time response, border frequencies for frequency analysis, and so on). It can be also required to


Parallel and Distributed Computing Applications

provide type and parameters of output information (arguments of output characteristics, scale type used for plot construction and others). Launch for execution initiates a procedure of the application description generation in the internal format and its transferring to the task execution system. Web and grid service orchestrator are responsible for automatic route execution composed of the remote service invocation. Grid/cloud services invoked by the orchestrator during execution are responsible for preparing input data for a grid/cloud task, its launch, inquiring the execution state, unloading grid/cloud task results, and their transferring to the orchestrator. Execution results consist of a set of files containing information on the results of computing fulfilled (according to the parameters set by a user) including plots and histograms, logs of the errors resulting in a stop of separate route’s branches, ancillary data regarding grid/cloud resources used, and grid/cloud task executing. Based on the analysis of the received results, a customer could make a decision to repeat computational workflow execution with changed workflow’s fragments, input data, and parameters of the computing procedures. It is a need to know more details on services, its providers, and the customers, in order to manage service‐oriented applications. There are two roles in development process: the role of service provider and the role of application builder. This separation of concerns empowers application architects to concentrate more on the business logic (in this case research). The technical details are left to service providers. Comprehensive repository of various services would ensure the possibility to use the services for the personal/institutional requirements of the scientific users via incorporation of existing services into widely distributed system (Figure 1). Services can be clustered to two main groups: application supporting services (including subgroups: data processing services, modeling, and simulating services) and environment supporting (generic) services (including subgroups: cloud hosting for computational, network, and software resources provision, applications/services ecosystems and delivery framework, security, work‐flow engine for calculating purposes, digital science services). As far as authors know, there are no similar user‐oriented platforms supporting experiments in mathematics and applied sciences. PRCC unveils new methodology for mathematical experiments planning and modeling. It can improve future competitiveness of the science by strengthening its

Distributed Software Development Tools for Distributed Scientific...


scientific and technological base in the area of experimenting and data processing, which makes public service infrastructures and simulation processes smarter, i.e., more intelligent, more efficient, more adaptive, and sustainable.

Possible content of services’ repository Providing the ability to store ever‐increasing amounts of data, making them available for sharing, and providing scientists and engineers with efficient means of data processing are the problems today. In the PRCC, this problem is solving by using the service repository which is described here. From the beginning, it includes application supporting services (AS) for the typical scheme of a computational modeling experiment, been already considered.

Figure 1. General structure of PRCC.

Web services can contain program codes for implementation of concrete tasks from mathematical modeling and data processing and also represent results of calculations in grid/ cloud e‐infrastructures. They provide mathematical model equations solving procedures in depending on their type (differential, algebraic‐nonlinear, and linear) and selected science and engineering analysis. Software services are main building blocks for the following functionality: data preprocessing and results post-processing, mathematical modeling, DC, AC, TR, STA, FOUR and sensitivities


Parallel and Distributed Computing Applications

analysis, optimization, statistical analysis and yield maximization, tolerance assignment, data mining, and so on. More detailed description of typical scheme of a computational modeling experiment in many fields of science and technology which has an invariant character is given in [3, 10]. The offered list of calculation types covers considerable part of possible needs in computational solving scientifically applied research tasks in many fields of science and technology. Services are registered in the network service UDDI (Universal Description, Discovery, and Integration) which facilitate the access to them from different clients. Needed functionality is exposed via the Web service interface. Each Web service is capable to launch computations, to start and cancel jobs, to monitor their status, to retrieve the results, and so on. Besides modeling tasks, there are other types of computational experiments in which distributed Web service technologies for science data analysis solutions can be used. They include in user scenario procedures of curve fitting and approximation for estimating the relationships among variables, classification techniques for categorizing different data into various folders, clustering techniques for grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups, pattern recognition utilities, image processing, and filtering and optimization techniques. Above computational Web services for data proceeding are used in different science and technology branches during data collection, data management, data analytics, and data visualization, where there are very large data sets: earth observation data from satellites; data in meteorology, oceanography, and hydrology; experimental data in physics of high energy; observing data in astrophysics; seismograms, earthquake monitoring data, and so on. Services may be offered by different enterprises and communicate over the PRCC, that is why they provide a distributed computing infrastructure for both intra- and cross‐enterprise application integration and collaboration. For semantic service discovery in the repository, a set of ontologies was developed which include resource ontology (hardware and software grid and cloud resources used for workflow execution), data ontology (for annotation of large data files and databases), and workflow ontology (for annotating

Distributed Software Development Tools for Distributed Scientific...


past workflows and enabling their reuse in the future). The ontologies will be separated into two levels: generic ontologies and domain‐specific ontologies. Services will be annotated in terms of their functional aspects such as IOPE, internal states (an activity could be executed in a loop, and it will keep track of its internal state), data transformation (e.g., unit or format conversion between input and output), and internal processes (which can describe in detail how to interact with a service, e.g., a service which takes partial sets of data on each call and performs some operation on the full set after last call).

Management of Web Services Service‐oriented paradigm implies automated composition and orchestration of software services using workflows. Each workflow defines how tasks should be orchestrated and what components in what execution sequence should be. The workflow also includes the details of the synchronization and data flows. The workflow management may be based on standard Web‐ service orchestration description language WS‐BPEL 2.0 (Business Process Execution Language). The initial XML‐based description of the abstract workflow containing the task description parameters (prepared by user via the editor) is transformed to the WS‐BPEL 2.0 description. Then, the orchestration engine invokes Web services passing this task description to them for execution. The workflow management engine provides seamless and transparent execution of concrete workflows generated at the composition service. This engine leverages existing solutions to allow execution of user‐defined workflows on the right combination of resources and services available through clusters, grids, clouds, or Web services. Furthermore, the project plans to work on the development of new scheduling strategies for workflow execution can be implemented that will take into account multicriteria expressions defined by the user as a set of preferences and requirements. In this way, workflow execution could be directed, for instance, to minimize execution time, to reduce total fee cost, or any combination of both. The configuration and coordination of services in applications, based on the services, and the composition of services are equally important in the modern service systems [6]. The services interact with each other via messages. Message can be accomplished by using a template “request‐ response,” when at a given time, only one of the specific services caused


Parallel and Distributed Computing Applications

by one user (the connection between “one‐to‐one” or synchronous model); using a template “publish/ subscribe” when on one particular event many services can respond (communications “oneto‐many” or asynchronous model); and using intelligent agents that determine the coordination of services, because each agent has at its disposal some of the knowledge of the business process and can share this knowledge with other agents. Such a system can combine the quality of SOS, such as interoperability and openness, with MAS properties such as flexibility and autonomy.

PROTOTYPING OPTIMAL DESIGN PLATFORM FOR ENGINEERING The analysis of a state‐of‐art scientific platforms shows that there is a need of distributed computing‐oriented platform. This obliges to redesign similar environments in the terms of separate interacting software services. So the designers should specify a workflow of the interaction of services. Based on PRCC facilities, the Institute of Applied System Analysis (IASA) of NTUU “Kiev Polytechnic Institute” (Ukraine) has developed the user case WebALLTED1 as the Web-enabled engineering design platform, intended, in particular, for modeling and optimization of nonlinear dynamic systems, which consist of the components of different physical nature and which are widely spread in different scientific and engineering fields. It is the cross‐disciplinary application for distributed computing. Developed engineering service‐oriented simulation platform consists of the following layers (Figure 2). The most important features of this architecture are the following: Web accessibility, the distribution of the functionality across the software services in e‐infrastructure, the compatibility with existing protocols and standards, the support of user‐made scenarios in development‐time and in run‐time, and the encapsulation of the software services interaction complexity.

Distributed Software Development Tools for Distributed Scientific...


Figure 2. Main elements of SOA in the engineering simulation system.

The following functions are accessible via user interface: authentication, workflow editor, artefacts repository management environment, task monitoring, and more. The server side of the system is designed as multitier one in order to implement the workflow concept described early. First‐access tier is the portal supporting user environment. The purpose of its modules is the following: the user‐input‐based generation of abstract workflow specification; the transition of task specification to lower tiers, where the task will be executed; and the postprocessing of results including saving the artefacts in DB. The workflow manager works as second‐execution tier. It is deployed in the execution server. The purpose of this tier is the mapping of the abstract workflow specification to particular software services. The orchestration is done using the specific language similar to WS‐BPEL for BPEL instruments.


Parallel and Distributed Computing Applications

The workflow manager starts executing particular workflow with the external orchestrator as well as observes the state of workflow execution and procures its results. Particular workflow is working with functional software services and performs the following actions: data preprocessing and postprocessing, simulation, optimization, and so on. If high demand for resources is forecasted, only one node could be loaded to heavy. So the computation is planned on separate nodes and hosting grid/cloud services. These services give possibility to use widespread infrastructure (such as grid or cloud). It is possible to modify and to introduce of new functions to the system. This is done by the user by selection or registration of another Web or grid/cloud services. The user is able to start the task in an execution tier. Task specification is transient to the service of workflow management. This abstract workflow is transformed to the particular implementation on execution server. Then, the workflow manager analyses the specification, corrects its possible errors (in some extent), demands the data about the services from the repository, and performs binding of activity sequence and software services calls. For the arrangement of software services in correct invocation order, the Mapper unit is used in the workflow. It initializes XML messages, variables, etc., and provides the means for the control during a run‐time including the observing of workflow execution, its cancelling, early results monitoring, and so on. Finally, the orchestrator executes this particular “script.” User is informed about the progress of the workflow execution by monitoring unit communicating with workflow manager. When execution is finished, the user can retrieve the results, browse and analyze them, and repeat this sequence if needed. The architecture hides the complexity of web‐service interaction from user with abstract workflow concept and simple graphical workflow editor (Figure 3). Web services are representing the basic building blocks of simulation system’s functionality, and they enable customers to build and adjust scenarios and workflows of their design procedures or mathematical experiments via the Internet by selecting the necessary Web services, including automatic creation of equations of a mathematical model (an object or a process)

Distributed Software Development Tools for Distributed Scientific...


Figure 3. WebALLTED graphical workflow editor.

based on a description of its structure and properties of the used components, operations with large‐scale mathematical models, steady‐state analysis, transient and frequency‐domain analysis, sensitivity and statistical analysis, parametric optimization and optimal tolerance assignment, solution centering, yield maximization, and so on [3]. Computational supporting services are based mostly on innovative numeric methods and can be composed by an end user for workflow execution on evaluable grid/cloud nodes [3]. They are oriented, first of all, on Design Automation domain, where simulation, analysis, and design can be done for different electronic circuits, control systems, and dynamic systems composed of electronic, hydraulic, pneumatic, mechanical, electrical, electromagnetic, and other physical phenomena elements. The developed methodology and modeling toolkit support collective design of various micro‐electro‐mechanical systems (MEMS) and different microsystems in the form of chips.

DISTRIBUTED WIKI‐BASED SYSTEM FOR STOCHASTIC PROGRAMMING AND MODELING As is mentioned above, even empowered by huge computing power accessible to via Web services and clouds, users have still not exhausted possibilities, because of the lack of communication. Only communication and legal reuse

Parallel and Distributed Computing Applications


of existing software assets in addition to available computing power can ensure high speed of scientific activities. In this section is described another distributed scientific software development system, which is developed in parallel and independently from the system described in Section 4. However, both of these are sharing similar ideas.

The Architecture of Stochastic Programming and Modeling System We are started from the following hypothesis: the duration of development of scientific software can be decreased, the quality of such software can be improved using together with the power of the grid/cloud infrastructure, Wiki‐based technologies, and software synthesis methods. The project was executed via three main stages: •

The development of the portal for the Wiki‐based mass collaboration. This portal is used as the user interface in which scientists can specify software development problems, can rewrite/refine the specifications and software artefacts given by its (remote) colleagues, and can contribute all the process of software development for particular domain. The set of the statistical simulation and optimization problems was selected as the target domain for pilot project. In the future, the created environment can be applied to other domains as well. • The development of the interoperability model in order to bridge Wiki‐based portal and the Lithuanian National Grid Infrastructure (NGI‐LT) or other (European) distributed infrastructures. A private cloud based on Ubuntu One is created at Siauliai University within the framework of this pilot project. • To refine existing methods for software synthesis using the power of distributed computing infrastructures. This stage is under development yet, so it is not covered by this chapter. More details and early results are exposed in [22] (Figure 4). The system for stochastic programming and statistical modeling based on Wiki technologies (WikiSPSM) consists of the following parts (Figure 1): • •

Web portal with the content management system as the graphical user interface. Server‐side backed for tasks scheduling and execution.

Distributed Software Development Tools for Distributed Scientific...


Software artefacts (programs, subroutines, models, etc.) storage and management system. The user interface portal consists of four main components: •

Template‐based generator of Web pages. This component helps user to create web page content using template‐based structure. The same component is used for the storage and version control of generated Web pages. • WYSIWYG text editor. This editor provides more functionality than simple text editor on the Web page. It is dedicated to describe mathematical models and numerical algorithms. This component is enriched with the text preprocessing algorithms, which prevents from the hijacking attacks and code injection. • The component of IDE (integrated developing environment) is implemented for the software modeling and code writing. • The repository of mathematical functions. This component helps user to retrieve, rewrite, and append the repository of mathematical functions with new artefacts. WikiSPSM system is using NetLib repository LAPACK API; however, it can be improved on demand and can use other libraries, e.g., ESSL (Engineering Scientific Subroutine Library) or Parallel ESSL [16]. WikiSPSM is easy extensible and evolvable because of the architectural decision to store all the mathematical models, algorithms, programs, and libraries in central database. Initially, it was planned that WikiSPSM will enable scientific users to write their software in C/C++, Java, Fortran 90, and QT programming languages. Because of this, the command‐line interface is chosen as the architecture of communication between the UI and software generator part. Software generator performs the following functions: compilation, task submission (to distributed infrastructure or to single server), task monitoring, and control of the tasks and their results. For the compilation of the programs, we have chosen external command‐ line compilers. The architecture of the system lets to change its configuration and to work with another programming language having related SDK with command‐line compiler. The users also are encouraged to use command‐line interfaces instead of GUI. Latest version of WikiSPSM does not support application with GUI interfaces. This is done because of two factors: (a) many scientific applications are command‐line‐based and the graphical representation of


Parallel and Distributed Computing Applications

the data is performed with other tools; (b) the support of GUI gives more constraints for scientific application.

Figure 4. Main components of the Wiki‐based stochastic programming and statistical modeling system.

In early versions of WikiSPSM, the compilation and execution actions were made in server side.2 Object server creates an object task for each submitted data array received from the portal. Task object parses the data and sends back to user (via portal). For tasks monitoring, results getting the token are used. After the finishing of task, it transfers the data to server object. After that it is time for the compilation and execution. Each task is queued and scheduled. If it is not sufficient amount of resources (e.g., working nodes), task is laid out to the waiting queue. When the task is finished, its results are stored in DB (Figure 4).

Bridge to Distributed Systems Soon after first test of WikiSPSM was observed, that client‐server architecture does not fit the demands on computational resources. Increased number of users and tasks have negative impact on the performance of overall system. The architecture of the system has been changed in order to solve this issue.

Distributed Software Development Tools for Distributed Scientific...


In current architecture, the component for software generation was changed. This change was performed via two stages: •

Transformation between different OSs.3 The server side of previous version was hardly coupled with OS (in particular, Windows). It was based on Qt API and command‐line compilers. This fragment was reshaped completely. New implementation is Linux oriented, so now WikiSPSM can be considered as multiplatform tool. • Transformation between the paradigms. In order to ensure better throughput of computing application, server was redesigned to schedule tasks in distributed infrastructures. Ubuntu One and Open Stack private clouds were chosen for the pilot project (Figure 5). Distributed file system NFS is used for the communication of working nodes. Tests of redesigned component show very good results. For example, for 150 tasks, MonteCarlo problem using new (bridged to distributed systems) execution component was solved in two times faster than initial server‐based application component. The “toy example” (calculation of the factorial of big numbers) was solved eight times faster. More comprehensive information about WikiSPSM could be found in Ref. [11, 12, 20, 21].

Figure 5. The architecture of WikiSPSM with the cloud computing component.

Parallel and Distributed Computing Applications


RELATED WORK As far as authors of the chapter know the conception of Engineering, SOC with design procedures as Web services has almost no complete competitors worldwide [3]. However, partial comparison to other systems is possible. The original numerical algorithms are in the background of WebALLTED [3, 7, 8, 9], e.g., algorithms for analysis of steady or transient state, frequency, algorithms for parametrical optimization, yield maximization, and so on. The proposed approach to application design is completely different from present attempts to use the whole indivisible applied software in the grid/ cloud infrastructure as it is done in Cloud SME, TINACloud, PartSim, RT‐ LAB, FineSimPro, and CloudSME. WebALLTED was compared to SPICE. The following positive features of WebALLTED were observed: • •

Improvements on simulation rapidity and numerical convergence; Inclusive procedure of optimization and tolerance threshold setting; • Sensitivity of analysis tools; • Different approach of the determination of secondary response parameters (e.g., delays); • Richer possibilities to perform user‐defined modeling; • Novel way of generate a system‐level model of MEMS from FEM component equations (e.g., being received by means of ANSYS) [7]; • Dynamicity of the software architecture configuration in the terms of composed services and working nodes. For evaluation the possibilities of WikiSPSM, it has been compared to other commercial (Mathematica) and open‐source (Scilab) products. All compared products support rich set of mathematical functions; however, Mathematica’s list of functions [13, 18] is most distinguishing for the problems of mathematical programming. WikiSPSM uses NetLib repository LAPACK [19] for C++ and FORTRAN, so they provide more functionality as Scilab [14]. In contrast to Mathematica and Scilab, WikiSPSM cannot reuse its functions directly, because it is Web based, and all the programs are executed on the server side, not locally. However, WikiSPSM shows best result by the possibility to extend system repository. Other systems have different single‐user‐oriented architecture. Moreover, they have only a little

Distributed Software Development Tools for Distributed Scientific...


possibility to change system functions or extend the core of the system by user subroutines.

CONCLUSIONS The following conclusion can be made: •

The analysis of a current state of scientific software development tools proves the urgent need of existing tools re‐engineering to enable their operation in distributed computing environments. The original concept of the service‐oriented distributed scientific applications development (with computing procedures as Web services) has the following innovative features:

Division of the entire computational process into separate loosely coupled stages and procedures for their subsequent transfer to the form of unified software services; – Creation of a repository of computational Web services which contains components developed by different producers that support collective research application development and globalization of R&D activities; – Separation services into environment supporting (generic) services and application supporting services; – Unique Web services to enable automatic formation of mathematical models for the solution tasks in the form of equation descriptions or equivalent substituting schemes; – Personalized and customized user‐centric application design enabling users to build and adjust their design scenario and workflow by selecting the necessary Web services to be executed on grid/cloud resources; – Re‐composition of multidisciplinary applications can at runtime because Web services can be further discovered after the application has been deployed; – Service metadata creation to allow meaningful definition of information in cloud environments for many service providers which may reside within the same infrastructure by agreement on linked ontology;


Parallel and Distributed Computing Applications

The possibility to collaborate using Wiki technologies and reuse software at code level as well as at service level. • The prototype of the service‐oriented engineering design platform was developed on the base of the proposed architecture for electronic design automation domain. Beside EDA the simulation, analysis and design can be done using WebALLTED for different control systems and dynamic systems. • The prototype of collaboration‐oriented stochastic programming and modeling system WikiSPMS was developed on the base of Wiki technologies and open‐source software. We believe that the results of the projects will have direct positive impact in the scientific software development, because of the bridging two technologies, each of them promises good performance. The power of the Wiki technologies, software services, and clouds will ensure the ability of the interactive collaboration on software developing using the terms of particular domain.

Distributed Software Development Tools for Distributed Scientific...



Chen Y, Tsai W‐T. Distributed Service‐Oriented Software Development. Kendall Hunt Publishing; Iowa, USA. 2008. p. 467 2. Papazoglou MP, Traverso P, Dustdar S, Leymann F. Service‐oriented computing: A research roadmap. International Journal of Cooperative Information Systems. 2008;17(2):223‐255 3. Petrenko AI. Service‐oriented computing (SOC) in a cloud computing environment. Computer Science and Applications. 2014;1(6):349‐358 4. Kress J, Maier B, Normann H, Schmeidel D, Schmutz G, Trops B, Utschig‐Utschig C, Winterberg T. Industrial SOA [Internet]. 2013. Available from: technetwork/articles/soa/ind‐ soa‐preface‐1934606.html [Accesed: 2017‐01‐30] 5. OASIS. OASIS Web Services Business Process Execution Language [Internet]. 2008. Available from: https://www.oasis‐ committees/tc_home.php?wg_abbrev=wsbpel [Accessed 2017‐06‐01] 6. Petrenko AA. Comparing the types of service systems architectures [in Ukrainian]. System Research & Information Technologies. 2015;4:48‐62 7. Zgurovsky M, Petrenko A, Ladogubets V, Finogenov O, Bulakh B. WebALLTED: Interdisciplinary simulation in grid and cloud. Computer Science (Cracow). 2013; 14(2):295‐306 8. Petrenko A, Ladogubets V, Tchkalov V, Pudlowski Z. ALLTED—A Computer‐Aided System for Electronic Circuit Design. Melbourne: UICEE (UNESCO); 1997. p. 204 9. Petrenko AI. Macromodels of micro‐electro‐mechanical systems. In: Islam N, editor. Microelectro‐Mechanical Systems and Devices. InTech Rijeka, Croatia; 2012. pp. 155‐190 10. Petrenko AI. Collaborative complex computing environment (Com‐Com). Journal of Computer Science and Systems Biology. 2015;8:278‐284. DOI: 10.4172/jcsb.1000201 11. Giedrimas V, Sakalauskas L, Žilinskas K. Towards the Environment for Mass Collaboration for Software Synthesis, EGI User Forum [Internet]. 2011. Available from: event/207/session/15/contribution/117/material/slides/0.pdf [Accessed 2016‐09‐15]


Parallel and Distributed Computing Applications

12. Giedrimas V, Varoneckas A, Juozapavicius A. The grid and cloud computing facilities in Lithuania. Scalable Computing: Practice and Experience. 2011;12(4):417‐421 13. Steinhaus S. Comparison of mathematical programs for data analysis. Munich; 2008. 347uyziem4evkr0i405.pdf last accessed : 2017‐06‐11 14. Baudin M. Introduction to Scilab. The Scilab Consortium; 2010 15. Bunks C, Chancelier J‐P, Delebecque F, Gomez C, Goursat M, Nikoukhah R, Steer S. Engineering and Scientific Computing with Scilab. Boston: Birkhauser; 1999 16. ESSL and Parallel ESSL Library [Internet]. IBM, 2016. Available from: https://www‐03. [Accessed 2017‐06‐01] 17. Tapscott D, Williams AD. Wikinomics: How Mass Collaboration Changes Everything. Atlantic Books, London; 2011 18. Wolfram Research. Wolfram gridMathematica: Multiplying the power of Mathematica over the grid. [Internet]. 2006. Available from: http:// [Accessed 2017‐06‐11] 19. Barker VA, et al. LAPACK User’s Guide: Software, Environments and Tools. Society for Industrial and Applied Mathematics; Philadelphia, USA. 2001 20. Sakalauskas L. Application of the Monte‐Carlo method to nonlinear stochastic optimization with linear constraints. Informatica. 2004;15(2):271‐282 21. Giedrimas V, Sakalauskas L, Žilinskas K, Barauskas N, Neimantas M, Valčiukas R. Wiki‐based stochastic programming and statistical modelling system for the cloud. International Journal of Advanced Computer Science & Applications. 2016;7(3): 218‐223 22. Giedrimas V. Distributed systems for software engineering: Non‐ traditional approach. In: Proceedings of the 7th International Conference on Application of Information and Communication Technologies (AICT 2013). Baku (Azerbaijan), 23‐25 October. Published by Institute of Electrical and Electronics Engineers (IEEE); 2013. pp. 31‐34




Computer Science Department, Southern CT State University, New Haven, CT, USA


Computer Science Department, University of New Haven, CT, New Haven, USA

Computer Science and Engineering Department, University of Connecticut, CT, Storrs, USA 3

Computer & Information Sciences Department, Ain Shams University, Abbassia, Cairo, Egypt. 4

Citation: A. El-Raouf, T. Fergany, R. Ammar and S. Hamad, “A Performance-Driven Approach for Restructuring Distributed Object-Oriented Software,” Journal of Software Engineering and Applications, Vol. 2 No. 2, 2009, pp. 127-135. doi: 10.4236/ jsea.2009.22019. Copyright: © 2009 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://


Parallel and Distributed Computing Applications

ABSTRACT Object oriented techniques make applications substantially easier to build by providing a high-level platform for application development. There have been a large number of projects based on the Distributed Object Oriented approach for solving complex problems in various scientific fields. One important aspect of Distributed Object Oriented systems is the efficient distribution of software classes among different processors. The initial design of the Distributed Object Oriented application does not necessarily have the best class distribution and may require to be restructured. In this paper, we propose a methodology for efficiently restructuring the Distributed Object Oriented software systems to get better performance. We use Distributed Object-Oriented performance (DOOP) model as guidance for our restructuring methodology. The proposed methodology consists of two phases. The first phase introduces a recursive graph clustering technique to partition the OO system into subsystems with low coupling. The second phase is concerned with mapping the generated partitions to the set of available machines in the target distributed architecture. Keywords: Performance Modeling, Distributed Object Oriented, Restructuring Methodology

INTRODUCTION The software restructuring techniques present solutions for the hardware/ software mismatch problem in which the software structure does not match the available hardware organization [1,2,3,4]. There are two approaches to solve such a problem; either to configure the hardware to match the software components (hardware reconfiguration), and/or to reconfigure the software structure to match the available hardware by reorganizing its components (software restructuring). The first approach is impractical especially in complex programs containing many interacting modules (or subtasks). The second approach is more practical especially in computing environments that contain large number of users. It provides an optimal way to use the available system capabilities, reduces the overall computational cost, and improves the overall system performance. The basic idea of distributed software restructuring techniques as described in [2] is to select the best alternative structure(s) for a constrained

A Performance-Driven Approach for Restructuring Distributed Object...


computing environment while reducing the overall resources need. These structures can be created through; granularity definition (merging of modules), alternative modules ordering, loop decomposition, or multiple servers support. It has been shown that performing software restructuring ahead of the partitioning, allocation and scheduling phases improved the results obtained from these phases and reduced the overall resources cost. Unfortunately this technique is not applicable for Distributed Object Oriented (DOO) systems. In DOO systems, a program is organized as a set of interacting objects, each of which has its own private state rather than a set of functions that share a global state. Classes represent abstraction that makes adapting software easier and thus lower the cost of reuse, maintenance and enhancement. OO paradigm is based on several concepts such as encapsulation, inheritance, polymorphism, and dynamic binding [5,6]. Although these features contribute to the reusability and extensibility of systems, they produce complex dependencies among classes. The most interesting feature of OO approach is that objects may be distributed and executed either sequentially or in parallel [7]. Distributed objects are one of the most popular programming paradigms in today’s computing environments [8,9], naturally extending the sequential message-passing-oriented paradigm of objects. Designing a distributed Object Oriented application depends on how performance-critical the application is. An important phase is to tune performance by “concretizing” object locations and communication methods [10]. At this stage, it may be necessary to use tools to allow analysis of the communication patterns among objects in order to take the right allocation decision. The principal goal of this research is to develop new methodology for efficiently restructuring the DOO software to fully exploit the system resources and get better performance. This paper is organized as follows: the second section states the problem definition, including our objective and assumptions. Section 3 describes the analytical DOOP model its basic components and how it can be used to measure the communication cost between different classes. Section 4 presents the restructuring scheme and the algorithm used in the first phase and the different approaches proposed for the second phase of the restructuring process. Then a comparison and analysis of generated results are included within Section 5. Finally, Section 6 draws our conclusions.


Parallel and Distributed Computing Applications

PROBLEM DEFINITION In this paper, we consider restructuring DOO applications for mapping on a distributed system that consists of a collection of fully connected homogenous processors in order to attain better performance. This process is achieved in two phases illustrated in Figure 1. The first phase is concerned with identifying clusters of a dense community of classes within the DOO system. This helps in decomposing the system into subsystems that have low coupling and are suitable for distribution. The inter-class communication is modeled as a class dependency graph. In the second phase, we investigate the possibilities of grouping the generated clusters and mapping them to the nodes of the distributed system. The main objective is to propose a group of sub-systems, each has maximum communication cost among the innerclasses and the communication cost among the subsystems is minimized. Then, the proposed group of sub-systems is mapped to the available nodes in the distributed system.

Figure 1. Two-phase process for restructuring DOO software.

A Performance-Driven Approach for Restructuring Distributed Object...


DISTRIBUTED OBJECT-ORIENTED PERFORMANCE MODEL Performance analysis is a very important phase during the software development process. It will ensure that the system satisfies its performance objectives. Performance analysis will not only evaluate the resources utilization but also will allow comparing different design alternatives, detecting bottlenecks, maintenance, re-use and identify the tradeoffs. Most OO approaches studying the performance of OO systems are based on either the system measurements after its development or mapping it to a conventional performance model and hence no way to preserve the OO features during the analysis phase. In [11], the Distributed Object Oriented Performance (DOOP) model was introduced. The DOOP model analyzes and evaluates the overall time cost considering the communication overheads while preserving the features, properties and relationships between objects. According to the model, each node in a DOO system will be represented as shown in Figure 2.

Figure 2. The DOOP model node structure.


Parallel and Distributed Computing Applications

The performance model consists of two main parts: the execution server and the communication server. The execution server represents and evaluates the execution cost of the software modules that reside on the target node. The communication server provides the analysis of the communication activities (including objects updating) as well as the evaluation of the communication cost. In our restructuring scheme, we utilize the DOOP model in the evaluation process of the communication activities among classes as shown below. Assume that the overall arrival rate to the communication queue λck is given by: (1) where λcs, λcn and λcu represent the communication arrival due to External User Request (EUR), Remote Request (RR), and updating objects’ data on other nodes, respectively.

where, βs and βn are the message multipliers for EUR and RR. Let λcui be the arrival rate corresponding to object i data updating. Since the updating process to an object i occurs due to processing EUR or RR, Pi1 defined to be the probability that object i is updated due to EUR, Pi2 is the probability that object i is modified due to RR. λcui can be expressed as:

Hence, the expected communication service time for each class will be:

where tcs, tcn and tui are the expected communication service time for EUR, RR and for update requests from object i. While ms, mn and mui are the expected message sizes of EUR, RR and of sending object i updating data. R is the communication channel capacity. Furthermore, the average communication service time for node (k) will be: (2)

A Performance-Driven Approach for Restructuring Distributed Object...


where Pcs, and Pcn are the probabilities of activating communication service by the external user requests and by remote request respectively. Pui is the probability of sending object i’s data update to other nodes.

RESTRUCTURING SCHEME Our restructuring scheme starts with using the DOOP model to map the distribute Object Oriented application into a Class Dependency Graph (CDG) illustrated in the following subsection. Then this CDG is restructured using our two-phase methodology. The first phase is based on a recursive use of a spectral graph bi-partitioning algorithm to identify dense communities of classes. The second phase is concerned with mapping the generated partitions to the set of available machines in the target distributed architecture. We will describe two proposed approaches: the Cluster Grouping Approach and the proposed Double K-Clustering Approach. The details of each phase are described in the following subsections.

The Class Dependency Graph (CDG) If we assumed that each individual class is initially allocated to a separate node in the distributed system, then the above DOOP model can be used as a powerful analytical tool to accurately evaluate the communication activities among classes in a distributed OO system. The calculated values will be used to generate the Class Dependency Graph (CDG) of the given OO application.

Figure 3. A graph dependency graph for interclass communication.


Parallel and Distributed Computing Applications

Figure 4. A recursive graph clustering algorithm.

The class dependency graph CDG as shown in Figure 3 is a graph representation for interclass communications. In CDG, each class is represented as a vertex and an edge between class A and B represents a communication activity that exists between these two classes due to either data transfer or classes’ dependency. The weight of the edge WAB represents the cost of that communication activity between class (A) and class (B). If no data communication or relationship dependency has been recognized between two classes, no edge will connect them in the CDG.

First Phase: Clustering System Classes In this section, we describe a clustering technique that is considered the first primary phase of the restructuring approach to be proposed in this paper. After applying this step, the object oriented system is decomposed

A Performance-Driven Approach for Restructuring Distributed Object...


into subsystems that have low coupling and are suitable for distribution. The technique is based on a recursive use of a spectral graph bi-partitioning algorithm to identify dense communities of classes. At each recursive call, the CDG is partitioned into two sub graphs each of which will be further bi-partitioned as long as the partitioning results in clusters that are denser than their parents. Hence, the stopping condition depends on how good the produced partition is. A sub-graph is considered a WellPartitioned if the summation of the weight of internal edges (those between the vertices within a sub-graph) exceeds those of external edges (those between the vertices of the sub-graph and all the other vertices in other sub-graphs). The iteration stops when at least one of the produced sub-graphs is badly partitioned (the summation of the weight of external edges exceeds those of internal edges). In this case, the bi-partitioning step is considered obsolete and the algorithm will backtrack to the clusters generated in the previous step. At the end, the identified sub-graphs are the suggested clusters of the system. Figure 4 shows a detailed step by step description of the clustering Algorithm as described in [12]. The mathematical derivation of the spectral factorization algorithm used in our clustering approach was introduced in [13]. It is originally solving the l-bounded Graph Partitioning (l-GP) problem. However, we have adapted the solution methodology to fit within our bi-partitioning approach.

Second Phase: Mapping Classes to Nodes In this phase, the restructuring process is accomplished by mapping the set of DOO application clusters, resulted from the first phase to the different network nodes in order to attain better performance. To achieve this goal, we perform the mapping process taking into consideration our objective of minimizing the effect of class dependency and data communication. It is assumed that the target distributed system consists of a set of homogeneous processors that are fully connected via a communication network. The mapping process has two cases. The first case appears when the number of candidate clusters are less than or equal to the number of the available nodes. In this case the mapping process will be done simply by assigning each cluster to one of the available nodes. The problem occurs in the second case, when the number of the generated clusters exceeds the number of available nodes. This is a more realistic view since there will be always huge software systems with large number of classes and limited hardware resources.


Parallel and Distributed Computing Applications

In the following subsections, we are going to introduce two different approaches to be used in performing the grouping and mapping steps of the second phase: the Cluster Grouping Approach and the Double K-Clustering Approach. In Section 5, the results would be compared with the direct K-Partitioning Approach listed in Figure 4, where the graph is partitioned into a pre-specified number equals exactly to the number of nodes in the target distributed system. This is a one step allocation process that also maintains the criterion of minimizing inter-cluster communication cost.

The Cluster Grouping Approach In this approach, we use the clusters (sub-systems) generated at the first phase as the main candidates for distribution. The technique is based on merging clusters into groups in a way that keeps the communication costs among them minimized. As a result a cluster graph is formed, where the nodes represent clusters and the edges will capture the possible communication links that may exist among clusters. Then, the K-Partitioning algorithm is used to perform cluster grouping such that the number of the resultant groups is equal to the number of available nodes. The result will be groups of clusters that have minimal communication costs among them. Finally, those groups are assigned to the different available nodes in the distributed environment.

The Double K-Clustering Approach The K-Partitioning algorithm can not predict the number of system modules or clusters as the recursive Bi-Partitioning algorithm does. Instead, the number of required clusters must be given as an input parameter which is the k value. Hence, we will make use of the advantages provided by both algorithms: the recursive Bi-Partitioning and the K-Partitioning. So, the first phase described in Section 3 is considered as a pre-phase to estimate the number of the existing sub-systems within the distributed application. However, we will use the K-Partitioning algorithm twice. In the first time, the original class dependency graph CDG is clustered according to the number suggested at the pre-phase. In the second time the K-Partitioning algorithm is used again in the same way as in the cluster grouping approach illustrated above such that the number of the resultant groups is equal to the number of available nodes. Then, the resultant clusters are mapped to the available nodes.

A Performance-Driven Approach for Restructuring Distributed Object...


SIMULATION AND RESULTS In the section, we provide an analysis of both the clustering and the mapping phases described above. This illustration is done through a case study which describes the detailed steps of the two-phase restructuring process and a set of simulation experiments that were performed to validate our results.

Case Study We have developed a performance-driven restructuring simulator using Matlab 7.0.4. A Graphical User Interface (GUI) utility has been implemented to show the system status after each step while the algorithm proceeds. The simulator has a friendly user interface that allows the user to specify the nodes and edges in the systems, and then it will be able to generate the class dependency graph (CDG). We conducted an experiment using an OO system that consists of 28 classes. The CDG that was generated by the simulator for this system is given in Figure 5. In this section we provide a step-by-step description of applying the proposed restructuring techniques illustrated above. Furthermore, we are going to analyze and compare the results in each case. Figure 6 shows the resultant system clusters generated by the proposed bi-partitioning algorithm after the first phase. We can see that the proposed first phase algorithm has created 7 clusters each of which is marked up with a different color and surrounded by an oval for further highlight.

Figure 5. The generated CDG


Parallel and Distributed Computing Applications

Figure 6. The resultant clusters generated by the restructuring scheme first phase

Figure 7. Mapping the DOO system to a 4-node environment using the direct partitioning approach

Now, let us assume that the target distributed environment consists of 4 homogenous nodes that are fully connected. We need to apply any one

A Performance-Driven Approach for Restructuring Distributed Object...


of the second phase approaches to map the classes to the distributed nodes. First we applied the direct approach that partitioned the original CDG into 4 clusters using the k-partitioning algorithm. The resultant clusters are depicted in Figure 7. In Figure 8 the Cluster Grouping approach is applied to merge the clusters in Figure 8 generating 4 large clusters. However, applying the Double-K clustering approach results in a completely different group of clusters as shown in Figure 9. It started with generating 7 clusters then they were grouped into 4 clusters. The first one includes {C1, C2, C3, C4, C5, C6, C7, C8}, the second one {C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C22, C25, C26, C28}, the third {C21, C23, C24}, and the last one includes only C27. Notice that each one of the approaches resulted in a different grouping option, each of which has a different communication cost. The Direct Partitioning Approach results in a communication cost equals to 267 time unit. While the Cluster Grouping resulted in a cost of 265 time units, the Double-K produced a grouping of cost 223 time units.

Figure 8. Mapping the DOO system to a 4-node environment using the cluster grouping approach.


Parallel and Distributed Computing Applications

Figure 9. Mapping the DOO system to a 4-node environment using the doubleK approach.

Numerical Results We conducted a number of experiments using a set of DOO applications, whose class dependencies were randomly generated. The generated matrices are assumed to represent the adjacency matrices of the CDG for the systems under inspection. The Adjacency matrices are generated by Andrew Wathen method proposed in [14]. The resultant matrices are random, sparse, symmetric, and positive definite. In Figure 10, the X-axis represents the number of nodes in the target distributed environment and the Y-axis represents the communication cost in units of time that measured between classes located in different nodes. The decision of allocating a group of classes to a specific node is made by one of three algorithms. Two of them are the algorithms proposed above: the Cluster Grouping Approach and the Double K Approach. The third one is the well-known K-Partitioning approach. Each data point in the comparison chart is an average of a set of many trials. In each trial, the adjacency matrix of the CDG is generated randomly having 107 classes. For each generated random matrix, the Bi-Partitioning algorithm was used to verify that it will result in exactly 11 clusters, otherwise the matrix is neglected and another one is generated. The performance comparison depicted in Figure 10 shows that the Double-K Clustering Approach provides the best performance over the other algorithms since it gives the minimum interclass communication cost

A Performance-Driven Approach for Restructuring Distributed Object...


for various numbers of nodes (machines). Then comes the cluster grouping approach and at last comes the K-Partitioning approach. However, when the number of generated clusters equals to the number of machines or nodes, the proposed Double-K Clustering Approach gives typically the same results as that of the K-Partitioning algorithm. This is a logical finding since in this case the second clustering step in the Double-K Clustering approach is eliminated reducing the approach to the original K-Partitioning Algorithm. In order to confirm the correctness of the proposed methodologies, we have conducted a number of experiments using various sets of classes and their associated clusters. These classes were generated randomly and the results were averaged just like the experiment illustrated above. Table 1 and Table 2 present the communication costs computed for each partition generated by the three approaches when applied on different sets of classes. Each table represents a simulation evaluation targeting a distributed system architecture composed of a predetermined number of machines. Figure 11 and Figure 12 show the results when using 4 and 6 machines respectively. The simulation results came to be consistent with the results discussed in the case study presented above. That is, while changing the number of classes as well as the structure of the underlying distributed architecture, it is still the same remarks. The proposed two approaches Cluster Grouping and Double K-Clustering outperform the Direct Partitioning Approach as long as the number of nodes is less than the number of generated clusters.

Figure 10. Interclass communication cost measured after applying different restructuring approaches.


Parallel and Distributed Computing Applications

Table 1. Simulation results considering a distributed architecture consisting of four nodes

Figure 11. The simulation result of mapping different DOO systems with different number of classes on four nodes. Table 2. Simulation results considering a distributed architecture consisting of six nodes

A Performance-Driven Approach for Restructuring Distributed Object...


Figure 12. The simulation result of mapping different DOO systems with different number of classes on six nodes

CONCLUSIONS In this paper, we proposed a restructuring approach for DOO applications into a distributed system consisting of a collection of fully connected homogenous processors. The restructuring process was performed in two phases: the clustering phase and the mapping phase. The first phase is based on the theory of spectral graph bi-partitioning, where the Distributed Object Oriented performance model was used efficiently to evaluate the communication costs between different classes. In the second phase, the identified subsystems were assigned to different machines in the target distributed environment. This is done through two proposed algorithms: cluster grouping approach and the Double-K Clustering approach. A Comparison was made between the proposed approaches and the k-partitioning algorithm. The results showed that the Double-K Clustering Approach provides the best performance in terms of minimizing the communication cost among classes located on different nodes (machines).


Parallel and Distributed Computing Applications


A. Raouf, R. Ammar, and T. Fergany, “Object oriented performance modeling and restructuring on a pipeline architecture,” The Journal of Computational Methods in Science and Engineering, JCMSE, IOS Press, Vol. 6, pp. 59-71, 2006. 2. T. A. Fergany, “Software restructuring in performance critical distributed real-time systems,” Ph. D. Thesis, University of Connecticut, USA, 1991. 3. T. A. Fergany, H. Sholl, and R. A. Ammar, “SRS: A tool for software restructuring in real-time distributed environment,” in the Proceedings of the 4th International Conference on Parallel and Distributed Computing and Systems, October 1991. 4. H. Sholl and T. A. Fergany, “Performance-requirements-based loop restructuring for real-time distributed systems,” in the Proceedings of the International Conference on Mini and Microcomputers, From Micro to Supercomputers, Florida, December 1988. 5. B. Meyer, “Object-oriented software construction,” Prentice-Hall International (UK), Ltd, 1988. 6. Ostereich, “Developing software with UML: OO analysis and design in practice,” Addison Wesley, June 2002. 7. J. K. Lee and D. Gannon, “Object oriented parallel programming experiments and results,” in the Proceedings of Supercomputing 91, IEEE Computer Society Press, Los Alamitos, Calif, pp. 273-282, 1991. 8. Sun Microsystems Inc. Java home page, 9. J. Waldo, G. Wyant, A. Wollrath, and S. Kendall, “A note on distributed computing,” Sun Microsystems Laboratories, Technical Report-94-29, November 1994. 10. I. Sommerville, “Software Engineering,” 8th Edition, Addison-Wesley Publishers Ltd, New York, 2007. 11. A. A. El-Raouf, “Performance modeling and analysis of object oriented software systems,” PhD Dissertation, Department of Computer Science & Engineering, University of Connecticut, 2005. 12. S. Hamad, R. Ammar, A. Raouf, and M. Khalifa, “A performancedriven clustering approach to minimize coupling in a DOO system,” the 20th International Conference on Parallel and Distributed Computing Systems, Las Vegas, Nevada, pp. 24-26, September 2007.

A Performance-Driven Approach for Restructuring Distributed Object...


13. J. P. Hespanha, “An efficient MATLAB algorithm for graph partitioning,” Technical Report, Department of Electrical & Computer Engineering, University of California, USA, October 2004. 14. A. J. Wathen, “Realistic eigenvalue bounds for the galerkin mass matrix,” The Journal of Numerical Analysis, Vol. 7, pp. 449-457, 1987.



ANALYSIS AND DESIGN OF DISTRIBUTED PAIR PROGRAMMING SYSTEM Wanfeng Dou, Yifeng Wang, Sen Luo School of Computer Science and Technology, Jiangsu Research Center of Information Security & Confidential Engineering, Nanjing Normal University, Nanjing, China

ABSTRACT Pair Programming (PP) that has gained extensive focus within pedagogical and industrial environments is a programming practice in which two programmers use the same computer to work together on analysis, design, and programming of the same segment of code. Distributed Pair Programming (DPP) system is a programming system to aid two programmers, the driver and the navigator, to finish a common task such as analysis, design and programming on the same software from different locations. This paper first reviews the existing DPP tools and discusses the interaction and coordination Citation: W. Dou, Y. Wang and S. Luo, “Analysis and Design of Distributed Pair Programming System,” Intelligent Information Management, Vol. 2 No. 8, 2010, pp. 487497. doi: 10.4236/iim.2010.28059. Copyright: © 2010 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://


Parallel and Distributed Computing Applications

mechanism in DPP process. By means of activity theory and languageaction theory, some basic requirements of the DPP system are presented. Then, a design framework of such system and functions of each sub-system are deeply analyzed. Finally, a system prototype is implemented by plug-in style in Microsoft Visual Studio environment. Keywords: Pair Programming, Distributed Pair Programming, Software Engineering, Extreme Programming

INTRODUCTION In recent years, agile software methodologies have attracted increasing interest within pedagogical and industrial environments, with extreme programming being considered the most important of these agile methodologies [1]. In the agile manifesto, the authors state twelve general principles that all highlight the importance of flexibility and collaboration. One of these techniques, which are being adopted by software development group, is known as Pair Programming (PP), in which two developers work side by side, on the same computer, to collaboratively produce a design, an algorithm, a code, etc [2]. Taking these principles would imply a distributed application of agile methods, such as distributed extreme programming. Although some tools have been developed to better support distributed agile software development, there is still a need for additional research on tools and processes for distributed extreme programming, especially for solutions that extend the most obvious solution of providing a shared code editor. As the trend towards global software development continues, pair programming in which two developers are required to work in face-to-face interaction don’t meet the need of global software development. This needs to create computer programs through pair programming practice where developers are located in different workstation but they collaborate simultaneously to solve the same problem. This approach is called Distributed Pair Programming (DPP). This paper focuses on reviewing the existing distributed pair programming systems, and presents system design and implementation. This paper has six sections. After this introduction, Section 2 gives a related work about DPP tools. Section 3 discusses analysis approach of DPP based on activity theory and language theory. The requirements of DPP tool are presented in Section 4. Section 5 describes the design and implementation of prototype system. Section 6 draws conclusions.

Analysis and Design of Distributed Pair Programming System


RELATED WORK Pair Programming Extreme programming, also known as XP, includes a set of principles and practices for the rapid development of high quality software. XP identifies 12 best practices of software development and takes them to an extreme. Pair programming originated in industry as a key component of the XP development methodology. As the name suggests, pair programming involves two programmers working at the same computer to create code or analyze requirements and develop design and etc. This provides a mechanism for real-time problem solving and real-time quality control [2]. One programmer acts as the driver, who actively writes code or design document and has control of the keyboard and mouse. The other partner acts as the navigator, who helps plan as well as identify and prevent any syntactic or strategic deficiencies in code or design document, thinks of alternatives and asks questions [3]. The collaborator may exchange roles at any time during the pair programming session, or not at all. The concepts underlying pair programming are not new [4,5], but pair programming has only recently attracted significant attention and interest within the software industry and academia. Several previous controlled experiments have concluded that pair programming has many benefits over solo programming [6]. Pair programming has significant improvements in code reviewing and various others measures of quality of the programs being developed including lower duration with only minor additional overhead in terms of a measure of cost or effort [4-5]. But, with respect to time taken and improvement of functional correctness of the software product compared with Solo programming showed no positive effects of pair programming [7]. The reasons are the difference in sample populations (e.g., students or professionals), study settings (e.g., amount of training in pair programming), lack of power (e.g., few subjects), and different ways of treating the development variables (e.g., how correctness was measured and whether measures of development times also included rework), and task complexity (e.g., simple dependent tasks, or complicated projects) [8-9]. Pair programming originated in industry as a key component of the extreme programming development methodology. As the name suggests, pair programming involves two programmers working at the same computer to create code or analyze requirements and develop design and etc. This provides a mechanism for real-time problem solving and real-time quality


Parallel and Distributed Computing Applications

control [2]. One programmer acts as the driver, who actively writes code or design document and has control of the keyboard and mouse. The other partner acts as the navigator, who helps plan as well as identify and prevent any syntactic or strategic deficiencies in code or design document, thinks of alternatives, and asks questions. The collaborator may exchange roles at any time during the pair programming session, or not at all. Pair programming has been shown to be an effective pedagogical approach in teaching courses such as introductory computer science [10-11]. Undergraduate software engineering [12], and graduate object-oriented software development [13]. Studies have shown that pair programming creates an environment conducive to more advanced, active learning and social interaction, leading to students being less frustrated, more confident and more interested in IT [14], and also improve retention of women in computer science [15]. Pair programming encourages students to interact and cooperate with partners in their classes and laboratories, or development teams, thereby creating a more collaborative environment in which pairs discuss problems, brainstorm each other, and share knowledge. Pair programming also benefits the teaching staff. A pair of students can always analyze and discuss the low-level technical or procedural questions that typical burden the teaching staffs in the laboratory, hence there are fewer questions to be dealt with. Distributed pair programming is a style of programming in which two programmers who are geographically-distributed and synchronously collaborating over the Internet work together on the same software artifact. Comparing with pair programming, DPP decreases the scheduling issues that arise for developers trying to schedule collocated pair programming. Making DPP technology available to students increases the likelihood that they will pair program. Trying distributed pair programming increases the likelihood that students will pair program remotely in the future. While DPP has been shown to be better than distributed non-pair programming, DPP is not perfect. The main reason is to require a better tool to support the DPP process.

Tools of Distributed Pair Programming In pair programming environment, however, obstacles such as limited facilities, geographic separation, and scheduling often present challenges to collocated pair programming. DPP enables students or developers to collaborate from different locations to work on their programming projects remotely. One of the main trends in software development has been the

Analysis and Design of Distributed Pair Programming System


globalization of the software industry. Motivating factors behind this trend include hiring qualified programmers in different cities and countries for software companies, placing development group closer to their client’s location, creating quickly virtual development groups, and working continuously on critical projects by working on different time zones for groups [16]. Researchers have proposed several tools to better support distributed pair programming [17-22].These existing tools adopt either an application sharing approach to enhance an existing editor suite or provide customized tools that include various groupware features such shared awareness [17]. Customized groupware tools do not support all of the features needed by pair programming and thus limit partner’s ability to successfully accomplish their work. On the other hand, application sharing solutions lack process support and thus met collaboration awareness.

Application Sharing Tools JAZZ system is an example with an application approach [18]. It is an extension of eclipse that supports the XP and workflows in asynchronous interaction. JAZZ allows users to stay aware of co-workers and initiate chat sessions, and be invited to a synchronous pair programming session using an application sharing system. JAZZ implements a shared editing plug-in that provides a synchronous shared code editor based on operation transformation approach. But this plug-in is not integrated into the workflow of pair programming, and thus does not provide awareness and has no explicit switching of roles. MILOS is another application sharing system [19]. IT provides awareness of co-present users and allows users to initiate pair programming sessions using application sharing like JAZZ. MILOS makes use of existing IDEs and integrates single-user development environments into pair programming settings. But, application sharing approach does not support flexible pairing such as one-to-many pairing way, and role switching. TUKAN is a special purpose groupware for all phrase of the XP process [20]. It provides a shared editor for manipulating code together and users can highlight important code using a remote selection. Moomba extends the awareness tools of TUKAN and support Java IDE where the users can use a shared java editor [21]. However, TUKAN and Moomba use ENVY environment and are built as a proprietary tool and thereby cannot provide the same domain specific tool


Parallel and Distributed Computing Applications

support as it is present in modern IDEs. This is one of the reasons why they have not gain high popularity. Eclipse is a popular and more open environment that allows closer coupling of the developing IDEs [4]. Coordination work can be integrated into Eclipse in the internal browser window or special-purpose planning plug-in. The Eclipse Communication Framework (ECF) aims at integration a collaboration infrastructure with the IDE. Sangam is an Eclipse plug-in for Eclipse users to share workspace so that developers may work as if they were using the same computer [22]. Sangam use an eventdriven design for this plug-in. There are three basic components in Sangam: event interceptor, message server, and event reproducer. The responsibility of the event interceptor is to capture the event when the driver does something in Eclipse and then send it to the message server. When the event reproducer receives a message and interacts with Eclipse to perform the driver’s action. Saros plug-in supports driver-navigator interaction in Eclipse in a distributed pair programming session, and provides awareness on files that are opened at the driver’s site [4]. Saros includes a shared editor that allows collaborative code creation, and remote selections that allow the navigator to point at relevant pieces of code. Xpairtise is an Eclipse plug-in that offers shared editing, project synchronization, shared program, test execution, user management, built-in chat communication, and a shared whiteboard [4]. RIPPLE is a plug-in for the popular Eclipse integrated development environment in which data on collaborative programming is collected. RIPPLE is designed for use in educational setting to facilitate various forms of collaborative programming problem solving including distributed pair programming and distributed one-to-one tutoring [23]. RIPPLE extends the architecture implemented in Sangam. Compilation and execution of code, as well as generation of console message, are performed directly by Eclipse. However, RIPPLE currently only supports Java programming because the event-driven behavior of it requires that language-specific messages be transmitted between users. The textual dialogue of RIPPLE is an instantmessage-style chat program that supports enforced turn-taking in dialogue.

Customized Tools COLLECE, developed using Java technology, is a groupware system to support synchronous distributed pair programming practices. COLLECE provides a set of tools including editor, session panel, coordination panels, and structured chat [24]. The editor provides a workspace in which the driver

Analysis and Design of Distributed Pair Programming System


inserts or modifies the source code of the program that is being built. The session panel provides a simple awareness of partner that shows the photo and name of each pair. The coordination panels include three coordination tools that allow a collaboration protocol to be established: edition coordination panel, compilation coordination panel, and execution coordination panel. The structured chat is used to express conversational acts that are usually used during program coding, compilation and execution. COPPER is a synchronous source code editor that allows two distributed software engineers to write a program using pair programming. Its functions include communication mechanisms, collaboration awareness, concurrency control, and a radar view of the documents, among others. COPPER system is based on the C/S architecture. It is composed of three subsystems: collaborative editor, user and document presence, and audio subsystems. The editor is further decomposed into the Editor module and the document server. The Editor module implements a turn-taking synchronous editor and the document server provides document storage, document editing access control, user authentication and permissions, and document presence extensions. However, low display refresh rate can sometimes be confusing or something significant may be lost in the remote display. The trace of the mouse pointer is another problem if both developers use no same resolution for their monitors. Hence, next-generation tool is still analyzed and studied in terms of requirements of distributed pair programming.

ANALYSIS AND INTERACTION IN DPP SYSTEM Analysis Based on Activity Theory in DPP System Activity theory, as a social psychological theory on human self-regulation, is a well suited epistemological foundation for design. Activity theory was first used to design the user interface by Bodker. Later it has been extended and refined by numerous other authors. In particular, activity theory is used to understand cooperative work activities supported by computers [25,26]. Pair programming is a social activity involved two programmers, driver and navigator. This paper use activity theory as a theoretical basis for understanding the cooperative work activities in DPP. Broadly defined, activity theory is a philosophical framework for studying different forms of human praxis as development processes, with both individual and social levels interlinked. Three of the key ideas of


Parallel and Distributed Computing Applications

activity theory can be highlighted here: activities as basic unit of analysis, the historical development of activities and internal mediation with activities [25]. Activities—an individual can participate in several at the same time— have the following properties: 1) an activity has a material object; 2) an activity is a collective phenomenon; 3) an activity has an active subject, who understands the motive of the activity; 4) an activity exists in a material environment and transforms it; 5) an activity is a historically developing phenomenon; 6) contradiction is realized through conscious and purposeful actions by participants; 7) the relationships within an activity are culturally mediated. Y. Engestrom has made an attempt to establish a structural model of the concept activity and culturally mediated relationships within it (Figure 1). This structure includes three components, namely subject, object and community, and forms three relationships: subject-object, subjectcommunity and object-community. This activity model contains three mutual relationships between subject, object and community: the relationship between subject and object is mediated by tools, that between subject and community is mediated by rules and that between object and community is mediated by the division of labor. Each of the mediating terms is historically formed and opens to further development. In this activity model, four subsystems are formed: production subsystem, communication subsystem, assignment subsystem and consumption subsystem. The production subsystem is used by the subject (e.g., driver and navigator) to manipulate the object into outcome (e.g., analysis, design or programming for a code). In Figure 1, the production subsystem involves three components: subject, object and tool. In DPP, this subsystem is a shared editor that can support the synchronous editing, role switching, test execution and file sharing, etc. Communication subsystem, in Figure 1, involves also three components: subject, community and rule. For instance, In DPP, the driver and navigator use this subsystem to communicate each other so as to solve the problems met during pair programming. The driver and navigator as a community should stand by rules. For example, a partner as a role of driver, another must be a navigator. They switch role at intervals. The communication subsystem that includes the relationship and interaction between subject and community should provide chat session, whiteboard and audio or video communication. The communication subsystem must be designed for users

Analysis and Design of Distributed Pair Programming System


to easy discussion on problems and suggestions on their task and further focus on the shared code. Assignment subsystem builds the relationship between object and community through establishment of the division of labor, that is to say, it assign activity according to social rules and expectation. In DPP, the pair with a driver role is responsible for writing the code using keyboard and mouse, and the other with a navigator role is responsible for reviewing the code written by the partner and gives some suggestions. During DPP, they should switch the role at intervals. Consumption subsystem describes how subject and community to cooperate to act on object. Consumption process stands for the inherent contradictions of activity system. Although the goal of the production subsystem is to transform the object into outcome, it also consumes the energy and resources of subject and community. The consumption subsystem may plan arrangement and provide the resources for DPP. In Figure 1, the emphasis of analysis of activity system is production subsystem. The production of object is leaded by the results or intention of activity system. For example, the activities of DPP lead to produce the code with high quality. Production subsystem is usually considered to be the most important subsystem. Hence, understanding the production subsystem will be a good start for design of DPP system.

Figure 1. Basic structure of an activity.

Conversation Model of DPP In a DPP system, two programmers, the driver and the navigator, work commonly on the same task such as a code, or a design, or an analysis by network and related tools. In order to the efficiency of their programming, the


Parallel and Distributed Computing Applications

communication of pairs is important to effective cooperation for them. When the driver is editing, the navigator may observe the code or design remotely and at any moment to give suggestions about it, or think about optional solution or strategy. In other one, the driver may request acknowledgement of the pair to it during he/she writes the code. The conversation model is to describe the communicating process between the driver and the navigator so that we clarify how to communicate between them during pair programming. In follow section, we construct a conversation model of DPP by means of language-action theory. In designing a DPP system for practical situations, we need to consciously focus on the context and application. The structure of the system determines the possibilities for action by the people who use it, and it is this action structure that is ultimately important. Design is ontological. That is what we are participating in the larger design of the organization and collection of practices in which it plays a role. In describing or constructing any system, we are guided by a perspective. This perspective determines the kinds of questions that will be raised and the kinds of solution that will be sought. One can consciously apply a perspective as a guide to design. It will not provide answers to all of the specific design questions, but serves to generate the question that will demand answers. The language/action perspective is one of the relevant theoretical contributions that have appeared within cooperative work. Cooperative work is coordinated by the performance of language actions, in which the partner become mutually convinced to the performance of future actions and they make declarations creating social structures in which those acts are gathered and interpreted [27]. The language/action perspective has had a significant role with computer supported cooperative work. The PP or DPP is a cooperative activity with two actors, which can be modeled by language/ action perspective. The language/action perspective emphasis pragmatics, not the form of language, but what people do with it. The language/action has five fundamental illocutionary points —things you can do with an utterance [27]: 1) Assertive that commits the speaker to something being the case –to the truth of the expressed proposition; and 2) Directive that attempts to get the header to do something; and 3) Commission that commits the speaker to future course of action; and 4) Declaration that brings about the correspondence between the propositional content of the speech act and reality; and 5) Expressive that expresses a psychological state about a state

Analysis and Design of Distributed Pair Programming System


of affairs. The need of supporting DPP with suitable computer based tools implies the investigation of the deep aspects of cooperation and clarification. Cooperation clarification, to the extent that is made up of communication and negotiation, can be fully characterized under the assumption that the DPP can be viewed as a special linguistic action game between the driver and the navigator, constituted by asset of rules defining the conversations possible within it. The results of conceptual and experimental research motive the following answer: the driver and navigator spend their time taking commitments for future activities each other, coordinating the programming work, switching role according to the situation, explaining the problems they encounter during pair programming, reviewing the code. This needs to precisely develop conversation between the driver and navigator in order to take commitments for an effective negotiation and coordination of the activities. A conversation between the driver and the navigator during a DPP process is a sequence of related utterances. The utterance within a conversation can be classified from the pragmatic point of view in some basic categories of speech acts on the basis of their illocutionary point namely, directives (e.g., Request, Acceptance or Rejecting of a promise), commission (e.g., Promise, Count-offer, Acceptance or Rejecting of a commitment, Declaring of commitment fulfillment). Each conversation involves two actors in the DPP: the driver and the navigator, and follows the pattern which defines the possible sequences of speech acts characterizing the specific type of conversation. In accordance with language/action theory, there are also three main types of conversation occurring in any PP. The first is the conversation for action, characterized by the definition of a commitment for doing an action. The driver in the PP can recognize, e.g., the conversations opened by a request, where the driver opening the conversation asks the partner for some activities; the conversation by a promise, where the navigator agrees and provides the support for its fulfillment. The second is the conversation for possibilities, where the pairs discuss a new possibility for the code, in terms of requirements, code structure, language and related knowledge these conversations, when successful and devoted to topics under the competence of the pair, end with a declaration explaining the concept and agreeing with the code. The third is the conversation for clarification, where the pairs cope with or anticipate breakdowns concerning interpretations of conditions of satisfaction for action. The conditions are always interpreted with request to an implicit shared background, but sharing is partial and needs to be


Parallel and Distributed Computing Applications

negotiated. There is no sharp line between them, but they are accompanied by different moods. The PP is characterized by a specific organizational rule, which define the roles of pair programming, role switching and compatibility of pairs. These rules can be expressed in terms of conversation possibilities open to that role. For example, two roles are defined in pair programming mode, the driver and navigator. The driver is responsible for writing the code by the mouse and keyboard and the navigator can view and test the code, and think about the structure and some strategies. The navigator cannot enter any code but can point out the existing problems and request the discussion with the driver. If possible, the pairs can periodically switch role. The conversation for action forms the central fabric of a DPP. In a simple conversation for action, the driver (A) in a pair programming makes a request to the navigator (B). The request is interpreted as having certain conditions of satisfaction, which characterize a future course of action by B. After the initial utterance (the request), B can accept (satisfaction for action); reject (end of the conversation); or counter-offer with alternative conditions. Each of these in turn has its own possible continuations (e.g., after a counter offer by B, A can accept, reject, or counter-offer again). The meaning of a language/action exists in the interpretations of a driver and a navigator in a particular situation with their particular backgrounds. The request is an initial utterance, for the driver there are several kinds of request: 1) Help (request for collection of some materials or testing for the codes); and 2) Negotiation (request for clarification of some problems; and 3) Question (request for the design of programming). Reducing the complexity of a work process and of communicative mode going on within it is that they need to be supported in copying with that complexity [28]. This means that any tools supporting practices of a conversation must broaden and not restrict the range of all kinds of possibilities of its participants. The relationship between conversation and commitments is not a one-to-one one. Making a commitment explicit is sometimes very useful, in particular when we must ensure that it will be completed satisfactorily. Considering a conversation as a sequence of communication events to which can be attached not only documents of any types but also any numbers of commitment negotiations. A DPP procedure includes a set of conversations. Each conversation with a commitment and a title includes a set of events. Each event is a structured message characterized by its completion time, content, associated code, attached documents, its

Analysis and Design of Distributed Pair Programming System


sender and its receiver. In a DPP procedure, there are a lot of conversation occurring between the driver and navigator. For example, the driver may request a help for some materials with the code from the navigator, or hope to discuss some uncertain programming problems. In some time the driver may request the navigator to test the code written by him/her. The navigator can point out the existing problems during reviewing the code. A support system of commitment negotiation is required to help the user to understand the context where he/she is negotiating each other, as well as the state of the negotiation. The goal of this conversation model is to develop a theoretical framework for understanding communication within a DPP process. Conversations are just sequences of communicative events involving two participants, driver and navigator in PP, where each participant is free to be as creative as he/she wants. Conversations can be supported by a system making accessible the sequence of records of the communicative events, together with the documents generated and/or exchanged and with the commitment negotiations steps which occurred during them. Within this model a commitment may be viewed as the respective negotiation steps performed within a conversation by the driver and navigator and by the documents that are attached to them. Any negotiation step of a commitment is characterized by its object, its time and its state. Commitment negotiations are therefore fully transparent to their actors within conversations without imposing any normative constraints upon them comparing to fully scheduleed model of conversation.

REQUIREMENTS OF DPP SYSTEM Distributed pair programming means that two developers synchronously collaborate on the same design or code from different locations. The results of experiments indicate that DPP is competitive with collocated pair programming in terms of software quality and development time [13], and it is feasible and effective to develop software code using distributed pair programming [29]. Considering the trend of globalization in software development we have aimed at finding out how programmers could effectively apply DPP technique with the use of appropriate groupware tools, and what would be the requirements of such tools. For this purpose we defined a set of requirements of distributed pair programming tool in terms


Parallel and Distributed Computing Applications

of the analysis of the existing groupware tools and DPP tools [16,22-24], and features of pair programming. According to the technology of Computer Supported Cooperative Work (CSCW) we have identified the following requirements of the DPP tool.

Shared Editing Integrating Existing Editor As a source code editor it should highlight keywords based on the programming language being used and not only provide conventional editing tools such as: Cut, Copy, Paste, Find, and Replace, but also the options of compilation and execution of the source code being edited and should notify the users of the error messages reported by the compiler. On the other hand, the existing editors with the integration of developing environment supporting a specific language have very powerful functionalities. Moreover, developers hope to use their familiar editor or integrating development environment to pair programming, and for some language, for example Java language has several editors or developing environment support editing and compiling source code such as Eclipse, JDK, JBuilder, Visual Café, and etc. It is required that collaborative pair programming tool can integrate these editors or developing environments. However, there are some problems to be solved when paring developers use different editors or developing environments. For example, interoperation is one of the main problems in information exchange between two different editors with the same language due to the differences of their format of editing commands and the parameter options of compiling and executing the source code.

Shared File Repository The source code files and related documents being edited should be controlled at the repository. These files and documents should be shared among all members of the development team. Furthermore, configuration management tools are available to control the version change of code flies and documents. Mechanisms to request and obtain shared resources need to be provided so that developers invite their partner for pair programming. In the DPP setting, users hope to share intermediate results by passing to one or more users. A shared file repository is provided for users to place and retrieve files. Users can browser files and pars on these files. The shared file repository allows users to organize the files in folders. Pair programming tool should support text and audio or video-based communication so that the pairs discuss questions and selection of solutions

Analysis and Design of Distributed Pair Programming System


or know the partner’s sensibility and intention through these communicating tools.

Activity Indicator Users need time to perform a task but only the results are shared among them. In the DPP setting, users need to be aware of other user’s activities, which can use a peripheral place. The interface of the DPP also should support the presence of the role state of pairs and the function of role exchange.

Role Switch and Concurrent Control When a navigator wants to own the role of driver and write the code the system should support to apply for and release the token. Once there is the occurrence of role exchange, the DPP tool should support the file locking to control the change of the code. Concurrent operations to shared artifacts can lead to conflicting actions that confuse the interacting users and result in inconsistency on the artifacts, make interaction difficult. By means of a token and only let user holding the token modify or access the shared resources. In DPP setting, the user with the driver role can hold a token and allow modifying the code, and the user with a navigator role only browser the code written by the partner. Role switch can allow them switch the role each other and change the token holder.

DPP Communication Session Pair programming process is a negotiation process for programming problems such as design strategy, code specification, and collaborative testing. Its goal is to improve code quality and increase programming efficiency. Hence, distributed pair programming tool should support free and natural problem negotiations with a set of communicative events associated a conversation. For distributed interaction, communication between pairs poses an important role in DPP. There are all kinds of communication channels, such as text chat, whiteboard, remote selection, and audio or video channel. The text chat is a simplest communication style in which users can send short text messages and distribute these messages at the pair’s site. The driver or navigator initiates a conversation at any time aiming at a code segment or a design. The conversation with a title is composed of events. Those events are mutually related to the same conversation with a sequence of occurring of


Parallel and Distributed Computing Applications

them. Each event is represented a message format organized with complete time, content, sender, receiver, and optional code segment and attached documents. But the disadvantage of textual chat communication for a DPP is that the driver needs his or her hands to produce code. Normally, coding and talking goes hand in hand. Thus, the textual chat will not be the most important communication medium [4]. Whiteboard chat is similar to textual chat, but the only difference is that whiteboard uses graphical object to support their interaction. Whiteboard is usually used to discuss design problems of software. For example, pairs in DPP use UML (Unified Modeling Language) to finish the design and analysis of the software. As an alternative or addition to the communication functionality an audio or video channel can be embedded in the DPP. An audio or video channel supports parallel communication and coding. But the disadvantages of these channels are that they will consume too much network bandwidth, not be stable enough, and establishing connections will not be quick and easy. Remote selection shows remote user’s selection to a local user. Make sure that other pair is aware of his or her partner who has selected the object or edited the code.

DESIGN OF DPP SYSTEM The goal of distributed pair programming system between heterogeneous code editors or developing environments is to enhance pair programming ability and cooperation capacity among partners. As a result of programmers daily use different kinds of single-user code editors or developing environments during their designing or programming task, the functions of existing distributed pair programming system is inferior to the one of the commercial or open-source single-user code editors or developing environments. Figure 2 shows a framework of distributed pair programming system with the same or different code editors or environments, compatible to the specific program language, between driver and navigator. The system is implemented by the client/server architecture. The communication management module is responsible for transferring of operation information and event or message between the driver and navigator.

Analysis and Design of Distributed Pair Programming System


Collaborative Editing Subsystem Moreover, the existing code editors or environments lack good compatibility with the commercial single-user systems, and its usability is poor. It is impossible for programmers to accept these systems to support their development task concurrently. In order to solve this problem, the collaboration transparency technology emerges as the times require. Collaboration transparency technology causes group of users to be possible of no revision to the single-user code editors or developing environments, allows them directly to use familiar singleuser code editors or developing environments for distributed pair programming tasks, thus the research of collaboration transparency technology has a vital value. In Figure 2, the driver can select any code editors or developing environments that support a specific program language. The Code Adapter component can capture any local operations from the driver, filter any inessential information, and recombine into useful operation information in a common or standard format. Similarly, the navigator can select the same or different editor with the driver. This is due to the like or experiments of the navigator. The Code Adapter component also transforms the operation information received by Information transfer component into suitable format according to the requirements of local editor, and executes it to the local editor. In server site, the Central repository server is a resource repository. These resources include source code files, design documents, users, pair information. The design of Central repository operates on the client/server architecture. The clients reveal these resources, and the server is responsible for updating of them. In a one-to-many pair mode, the core programmer needs to know new changes to the code when he/she switches to previous partner.


Parallel and Distributed Computing Applications

Figure 2. A framework of DPP system.

Conversation Negotiation Subsystem The conversation for negotiation subsystem is responsible for the initiating, maintenance, organization, and storage of conversation. The messages of conversation are transferred by specific format between the driver and navigator. Its role is to aid pairs to communication and negotiation for some coding or design problem. Each conversation corresponds to a commitment. In DPP procedure, there are a lot of conversations occurring between the driver and navigator. For example, the driver may request a help for some materials with the code from the navigator, or hope to discuss some uncertain programming problems. In some time the driver may request the navigator to test the code written by him/her. On the other hand, the navigator may point out the existing problems by conversation negotiations during reviewing the code. Each conversation is associated a sequence of message which is composed of title, time, source, destination, content, attached documents, associated code segments. Figure 3 shows the model of conversation for the DPP process. The conversation negotiation server is responsible for recording all conversation information between the driver and the navigator so as to querying and indexing for later usage.

Analysis and Design of Distributed Pair Programming System


A DPP System Prototype We have implemented a preliminary prototype system which adopts the client/server architecture. The system can provide the basic functions of distributed pair programming in plug-in form integrated MS Visual Sudio environment. The XPPlugin based on client/server architecture consists of three subsystems: real-time communication, code synchronization and pairing management. In client site, communication and code synchronization are implemented in MS Visual studio plug-in style. In server site, all management tasks of DPP are finished by a solo program. The network communication between client and server is implemented by a XMPP (Extensible Messaging and Presence Protocol) which is an open instantmassage protocol based on XML (Extensible Markup Language). This prototype uses open source software, agsXMPP, under .net environment to support the interaction between pairs. The client program exists in plug-in form which conforms to the specification of MS Visual Studio plug-in. Figure 4 shows the window of our prototype system. The window consists of three sub-windows: code sharing and editing window, communication window and role switching and control window. The system architecture, as showed in Figure 5, is divided into four layers: •

User interface layer provides the functionalities such as login in, text chat, code control and role switch. User interface is implemented by using LoginForm, chatControl, CodeMonitorcontrol class. Middle layer is decided by MS Visual Studio. Only using this layer, the XPplugin can support the tool window pane as a visual studio standard tool pane to be used freely. The goal of design is that DPP tools are allowed to be embedded in visual studio environment, thus increase the efficiency of the prototype system. Interaction layer implements interaction between the XPPlugin and internal data of visual studio, including XPPluginPachage and SccService class. XPPluginPackage inherits Package class to allow the whole program as a plug-in to be loaded into Visual studio environment. SccService implements the management of code encapsulated as a service which can be freely called by either internal of the program or other plug-in or programs of Visual studio.

Parallel and Distributed Computing Applications


Network interface layer encapsulates a network communication class using a XMPP protocol to implement the interaction between client and server. Datahandler is an instance of such network communication class.

CONCLUSIONS In this paper, we have reviewed the features of the existing distributed pair programming tools, analyzed their advantages and disadvantages.

Figure 3. Model of conversation for the DPP process.

Figure 4. Main window of prototype system.

Analysis and Design of Distributed Pair Programming System


Figure 5. Layered structure of prototype system.

An activity theory is introduced to analyze the process model of DPP and the related main subsystem. A conversation model with commitments is presented based on language/action perspective as a framework for understanding communication within DPP processes. We have analyzed the requirements of distributed pair programming system, presented four important aspects in designing distributed pair programming system: 1) interoperation between heterogeneous editors corresponding to the same language; 2) file sharing at the repository and awareness of pair programming information; 3) role switch and control, and 4) conversation pattern with negotiation. Finally, we have presented a framework supporting distributed pair programming with heterogeneous editors or developing environments. In the future, we will improve our current preliminary system with new collaborative tools to support communication with audio and video channels. We also hope to integrate the existing developing environment, such as J++, JBuilder, Visual Cafe, which is relative to Java language, into our system.


Parallel and Distributed Computing Applications


R. Duque and C. Bravo, “Analyzing Work Productivity and Program Quality in Collaborative Programming,” The 3rd International Conference on Software Engineering Advances, Sliema, 2008, pp. 270-276. 2. D. Preston, “Using Collaborative Learning Research to Enhance Pair Programming Pedagogy,” ACM SIGITE Newsletter, Vol. 3, No. 1, January 2006, pp. 16-21. 3. L. Williams, R. Kessler, W. Cunningham and R. Jefferies, “Strengthening the Case for Pair Programming,” IEEE Software, Vol. 17, No. 11, 2000, pp. 19-21. 4. T. Schummer and S. Lukosch, “Understanding Tools and Practices for Distributed Pair Programming,” Journal of Universal Computer Sciences, Vol. 15, No. 16, 2009, pp. 3101-3125. 5. M. M. Muller, “Two Controlled Experiments Concerning the Comparison of Pair Programming to Peer Review,” Journal of Systems and Software, Vol. 78, No. 2, 2005, pp. 166-179. 6. L. Williams, R. R. Kessler, W. Cuningham and R. Jeffries, “Strengthening the Case for Pair Programming,” IEEE Software, Vol. 17, No. 4, 2000, pp. 19-25. 7. J. Ncwroclci and A. Wojciechowski, “Experimental Evaluation of Pair Programming,” Proceedings of European Software Control and Metrics Conference, London, 2001. 8. E. Arishoim, H. Gallis, T. Dyba and D. I. K. Sjoberg, “Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise,” IEEE Transactions on Software Engineering, Vol. 33, No. 2, 2007, pp. 65-86. 9. H. Gallis, E. Arishoim and T. Dyba, “An Initial Framework for Research on Pair Programming,” Proceedings of the 2003 ACM-IEEE International Symposium on Empirical Software Engineering, Toronto, 2003, pp. 132-142. 10. C. McDowell, L. Werner, H. Bullock and J. Fernald, “The Effects of Pair Programming on Performance in an Introductory Programming Course,” Proceedings of the 33rd Technical Symposium on Computer Science Education, Cincinnati, 2002, pp. 38-42. 11. N. Nagappan, L. Williams, et al., “Improving the CS1 Experience with Pair Programming,” Proceedings of the 34rd Technical Symposium on

Analysis and Design of Distributed Pair Programming System













Computer Science Education, Reno, 2003, pp. 359-362. L. Williams and R. L. Upchurch, “In Support of Student Pair Programming,” Proceedings of the 32nd Technical Symposium on Computer Science Education, Charlotte, 2001, pp. 327-331. P. Baheti, E. Gehringer and D. Stotts, “Exploring the Efficacy of Distributed Pair Programming,” Proceedings of XP Universe, SpringerVerlag, Chicago, 2002, pp. 208-220. L. Williams, D. M. Scott, L. Layman and K. Hussein, “Eleven Guidelines for Implementing Pair Programming in the Classroom,” Agile 2008 Conference, Kopaonik, 2008, pp. 445-451. L. Werner, B. Hanks and C. McDowell, “Pair Programming Helps Female Computer Science Students,” ACM Journal of Education Resources in Computing, Vol. 4, No. 1, 2004, pp. 1-8. H. Natsu, J. Favela, et al., “Distributed Pair Programming on the Web,” Proceedings of the 4th Mexican International Conference on Computer Science, Los Alamitos, 2003, pp. 81-88. B. Hanks, “Tool Support for Distributed Pair Programming: An Empirical Study,” Proceedings of Conference Extreme Programming and Agile Methods, Calgary, 2004, pp. 1-18. S. Hupfer, L. T. Cheng, S. Ross and J. Patterson, “Introduction Collaboration into an Application Development Environment,” Proceedings of the Computer Supported Cooperative Work, ACM Press, New York, 2004, pp. 21-24. F. Maurer, “Supporting Distributed Extreme Programming,” Proceedings of Conference on Extreme Programming and Agile Methods, Springer Verlag, London, 2002, pp. 13-22. T. Schummer and J. Schummer, “Support for Distributed Teams in Extreme Programming,” In: G. Succi and M. Marchesi, Eds., Extreme Programming Examined, Addison Wesley, Boston, 2001, pp. 355-377. M. Reeves and J. Zhu, “Moomba: A Collaborative Environment for Supported Extreme Programming in Global Software Development,” In: Lecture Notes in Computer Science: Extreme Programming and Agile Process in Software Engineering, Springer, London, 2004, pp. 38-50. C. W. Ho, S. Raha, E. Gehringer and L. William, “Sangam: A Distributed Pair Programming Plug-in for Eclipse,” Proceedings of









Parallel and Distributed Computing Applications

the 2004 OOPSLA Workshop on Eclipse Technology Exchange, New York, 2004, pp. 73- 77. K. Elizabeth, A. Dwight, et al., “A Development Environment for Distributed Synchronous Collaborative Programming,” Proceedings of the 13th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education, Madrid, 2008, pp. 158-162. R. Duque and C. Bravo, “Analyzing Work Productivity and Program Quality in Collaborative Programming,” The 3rd International Conference on Software Engineering Advances, 2008, pp. 270-276. K. Kuutti, “The Concept of Activity as a Basic Unit for CSCW Research,” In: L. J. Bannon, M. Robinson and K. Schmid, Eds., Proceedings of the 2nd ECSCW, Kluwer Academical, Amsterdam, 1991, pp. 249-264. K. Kuutti, “Identifying Potential CSCW Applications by Means of Activity Theory Concepts: A Case Example,” Proceedings of CSCW, ACM Press, New York, 1992, pp. 233-240. T. Winograd, “A Language/Action Perspective on the Design of Cooperative Work,” Human Computer and Interaction, Vol. 3, No. 30, 1998, pp. 203-220. G. D. Michelis and M. A. Grasso, “Situation Conversations within the Language/Action Perspective: The Milan Conversation Model,” Proceedings of the 5th Conference on CSCW, Chapel hill, North Carolina, 1994, pp. 1-12. P. Baheti, L. Williams, et al., “Exploring Pair Programming in Distributed Object-Oriented Team Projects,” Proceedings of OOPSLA Educators Symposium, Seattle, 2002, pp. 1-6.





CORE, Université Catholique de Louvain, Louvain-la-Neuve, Belgium


ECCO International, San Francisco, CA, USA

ABSTRACT We present a distributed computing architecture for smart grid management, composed of two applications at two different levels of the grid. At the high voltage level, we optimize operations using a stochastic unit commitment Citation: Ignacio Aravena, Anthony Papavasiliou and Alex Papalexopoulos (July 19th 2017). “A Distributed Computing Architecture for the Large-Scale Integration of Renewable Energy and Distributed Resources in Smart Grids”, Recent Progress in Parallel and Distributed Computing Wen-Jyi Hwang, IntechOpen, DOI: 10.5772/67791. Copyright: © 2017 by authors and Intech. This paper is an open access article distributed under a Creative Commons Attribution 3.0 License


Parallel and Distributed Computing Applications

(SUC) model with hybrid time resolution. The SUC problem is solved with an asynchronous distributed subgradient method, for which we propose stepsize scaling and fast initialization techniques. The asynchronous algorithm is implemented in a high-performance computing cluster and benchmarked against a deterministic unit commitment model with exogenous reserve targets in an industrial scale test case of the Central Western European system (679 buses, 1037 lines, and 656 generators). At the distribution network level, we manage demand response from small clients through distributed stochastic control, which enables harnessing residential demand response while respecting the desire of consumers for control, privacy, and simplicity. The distributed stochastic control scheme is successfully tested on a test case with 10,000 controllable devices. Both applications demonstrate the potential for efficiently managing flexible resources in smart grids and for systematically coping with the uncertainty and variability introduced by renewable energy. Keywords: smart grids, stochastic programming, asynchronous distributed algorithm, stochastic control, demand response

INTRODUCTION The progressive integration of renewable energy resources, demand response, energy storage, electric vehicles, and other distributed resources in electric power grids that has been taking place worldwide in recent years is transforming power systems and resulting in numerous operational challenges, including uncertainty of supply availability, distributed storage management, real-time coordination of distributed energy resources, and changing directions of flow in distribution networks. These challenges demand a shift of the traditional centralized power system operations paradigm toward the smart grid paradigm [1], where distributed computing and control stand out as a promising technology with the potential of achieving operations with optimal performance. The academic literature includes various applications of distributed computing in power system operations, including long- and mid-term planning, short-term scheduling, state estimation and monitoring, realtime control, and simulation [2–5]. Early studies pointed out several challenges related to communications and the heterogeneous characteristics of distributed computing systems, which needed to be addressed first in order to implement distributed computing applications. Nowadays, standard

A Distributed Computing Architecture for the Large-Scale Integration...


communication protocols are a mature technology and most current distributed computing resources can perform a broad range of operations. Such advances in distributed computing technology have paved the way for developing and implementing scalable distributed algorithms for power systems. The prevailing industry view, as we move forward into the future smart grid, is that it will entail: (i) broadcasting of dynamic prices or other information and (ii) telemetry backhaul to market participants. In the proposed model, distributed energy resource aggregators are often regarded as transaction brokers between end customers and various upstream market participants. The “failure-free market” design for a pure market-driven solution under this paradigm has been elusive, despite decades of research and development. In this chapter, we analyze the deployment of distributed computing as an enabling tool for managing the shortterm operations of smart grids in two levels: •

At the level of the high-voltage grid, we centrally optimize operations using a stochastic unit commitment (SUC) model, which endogenously allocates reserve capacity by explicitly modeling uncertainty. Specifically, we present an asynchronous distributed algorithm for solving SUC, which extends the asynchronous algorithm proposed in Ref. [6] in three aspects: (i) we propose a hybrid approach for modeling quarterly dispatch decisions alongside hourly commitment decisions; (ii) we introduce a stepsize scaling on the iterative method to diminish the error due to asynchronous execution; and (iii) we propose two methods for a faster initialization of the algorithm. The asynchronous algorithm is implemented in a high-performance computing (HPC) cluster and benchmarked against a deterministic unit commitment model with exogenous reserve targets (DUCR). We find that distributed computing allows solving SUC within the same time frame required for solving DUCR. At the level of the distribution grid, we rely on stochastic distributed control to manage consumer devices using the ColorPower architecture [7–9], which enables harnessing flexible residential demand response while respecting the desire of consumers for control, privacy, and simplicity. The ColorPower control approach is inspired by the very automatic cooperative protocols that govern Internet communications. These protocols


Parallel and Distributed Computing Applications

represent a distributed and federated control paradigm, in which information and decision-making authority remain local, yet global system stability is ensured. Centralized clearing at the high-voltage grid level and distributed clearing at the distribution grid level can be integrated in a cooptimization framework, as recently proposed by Caramanis et al. [10]. These two applications of distributed computing in power system operations demonstrate the potential to fully harness the flexibility of the grid and smoothly integrate large shares of renewable and other distributed energy resources in power systems without deteriorating the quality of service delivered to consumers. The rest of the chapter is organized as follows: Section 2 introduces the deterministic and stochastic unit commitment problems. Section 3 proposes an asynchronous algorithm for solving SUC and presents numerical experiments on a network of realistic scale. Section 4 presents the ColorPower architecture for managing demand response in the distribution grid and demonstrates its capability through a numerical experiment. Finally, Section 5 concludes the chapter.

HIGH-VOLTAGE POWER GRID OPTIMIZATION MODELS Overview Operations of the high-voltage power grid are typically scheduled in two stages: (i) day-ahead scheduling, where operations are planned based on forecast conditions for the system and the on/off status of slow generators is fixed and (ii) real-time scheduling, where system operators balance the system for the actual conditions using the available flexibility in the system. Models for short-term scheduling are solved on a daily basis, and they occupy a central role in clearing power markets and operating power systems. Until recently, power system operators have relied on deterministic short-term scheduling models with reserve margins to secure the system against load forecast errors and outages [11–14]. The integration of renewable energy sources has placed these practices under question because they ignore the inherent uncertainty of renewable energy supply, thereby motivating system operators and researchers to look for systematic methods to address uncertainty in real-time operations. A consistent methodology for

A Distributed Computing Architecture for the Large-Scale Integration...


mitigating the impacts of renewable energy uncertainty—and operational uncertainty in general—is stochastic programming. Stochastic models for short-term scheduling (i.e., SUC models) were originally considered in the seminal work of Takriti et al. [15] and Carpentier et al. [16], as an approach for mitigating demand uncertainty and generator outages. Subsequently, numerous variants of the SUC model have been proposed, which differ on the number of stages, the source of uncertainty, the representation of uncertainty, and the solution methods that are used. See Ref. [17] and references therein for a recent survey. In the present work, we use the deterministic and stochastic unit commitment models for dayahead scheduling presented in Sections 3.1 and 3.2. The proposed models differ from previously proposed models in the literature in which they use hybrid time resolution: hourly commitment decisions (u, v, w and z) and 15-min dispatch decisions (p, r and f). This formulation allows modeling subhourly phenomena, which have been shown to be important for the operation of systems with significant levels of renewable energy integration [18].

Deterministic unit Commitment with Exogenous Reserve Targets Using the notation provided in the beginning of the section, we model deterministic unit commitment with reserves (DUCR) as the minimization problem Eqs. (1)–(9). (1)




(5) (6)



Parallel and Distributed Computing Applications

(8) (9) The objective function Eq. (1) corresponds to the total operating cost, composed by the no-load cost, the startup cost, and the production cost. Constraints Eq. (2) enforce nodal power balance, while allowing for production shedding. Demand shedding can be included in the present formulation as having a very expensive generator connected to each bus. Eq. (3) enforces the reserve margins on each area of the system, allowing for reserve cascading (secondary reserve capacity can be used to provide tertiary reserve). Eq. (4) models DC power flow constraints in terms of bus angles and thermal limits of transmission lines. The feasible production set of thermal generators is described by Eqs. (5)–(9). Production and reserve provision limits are expressed as Eq. (5) for slow generators, that can provide reserves only when they are online, and as Eq. (6) for the remaining set of generators, which can provide secondary reserves when they are online and tertiary reserves both when they are online and offline. Ramp rate constraints Eqs. (7)–(8) are based on the formulation provided by Frangioni et al. [19]. Ramp-up rate constraints Eq. (8) enforce, in addition to the ramp-up rate limit on production, that there is enough ramping capability between periods t = 1 and t to ramp-up r2g,t MW within ΔT2 minutes (which can be used to provide secondary reserve), and to ramp-up MW within ΔT3 minutes (which can be used to provide tertiary reserve). Constraints Eq. (9) enforce minimum up and down times, as proposed by Rajan and Takriti [20]. Boundary conditions of the problem are modeled by allowing the time indices to cycle within the horizon, in other words, for any commitment variable x, τ with τ < 1, we define x, . Similarly, for any dispatch variable x,t with t < 1 or t > jT15j, we define . In this fashion, we model initial conditions (τ < 1, t < 1) and restrain end effects of the model (τ = |T60|, t = |T15|), simultaneously. In practical cases, initial conditions are given by the current operating conditions and end effects are dealt with by using an extended look-ahead horizon.

A Distributed Computing Architecture for the Large-Scale Integration...


Two-stage Stochastic Unit Commitment and Scenario Decomposition Following Papavasiliou et al. [21], we formulate SUC as the two-stage stochastic program of Eqs. (10)–(17).

(10) (11) (12)



(15) (16)

(17) The objective function in Eq. (10) corresponds to the expected cost over the set of scenarios S, with associated probabilities πs. Constraints in Eqs. (11)–(12) are analogous to Eqs. (2) and (4). No explicit reserve requirements are enforced in the stochastic unit commitment model, since reserves are endogenously determined by the explicit modeling of uncertainty. Consequently, generator constraints of the deterministic problem, Eqs. (5)– (10), become identical for all thermal generators and can be expressed as Eqs. (13)–(15). Nonanticipativity constraints Eq. (16) are formulated using state variables w and z for the commitment and startup decisions of slow thermal generators (first-stage decisions). We associate Lagrange multipliers


Parallel and Distributed Computing Applications

μ and ν with nonanticipativity constraints. Constraints in Eq. (17) enforce minimum up and down times on unit commitment variables.

AN ASYNCHRONOUS DISTRIBUTED ALGORITHM FOR STOCHASTIC UNIT COMMITMENT Scenario Decomposition of the SUC Problem The SUC problem in Eqs. (10)–(17) grows linearly in size with the number of scenarios. Hence, SUC problems are in general of large scale, even for small system models. This motivated Takriti et al. [15] and Carpentier et al. [16] to rely on Lagrangian decomposition methods for solving the problem. Recent SUC studies have focused on designing decomposition algorithms, capable of solving the problem in operationally acceptable time frames. Papavasiliou et al. [21] proposed a dual scenario decomposition scheme where the dual is solved using the subgradient method, and where the dual function is evaluated in parallel. Kim and Zavala [22] also used a dual scenario decomposition scheme, but solved the dual problem using a bundle method. Cheung et al. [23] present a parallel implementation of the progressive hedging algorithm of Rockafellar and Wets [24]. All previously mentioned parallel algorithms for SUC are synchronous algorithms, i.e., scenario subproblems are solved in parallel at each iteration of the decomposition method; however, it is necessary to solve all scenario subproblems before advancing to the next iteration. In cases where the solution times of subproblems differ significantly, synchronous algorithms lead to an underutilization of the parallel computing infrastructure and a loss of parallel efficiency. We have found instances where the time required to evaluate subproblems for difficult scenarios is 75 times longer than the solution time for easy scenarios. Aiming at overcoming the difficulties faced by synchronous algorithms, we propose an asynchronous distributed algorithm for solving SUC. The algorithm is based on the scenario decomposition scheme for SUC proposed in Ref. [21], where the authors relax the nonanticipativity constraints Eq. (16) and form the following Lagrangian dual problem (18)

A Distributed Computing Architecture for the Large-Scale Integration...


where h0(μ, ν) and hs(μs, νs) are defined according to Eqs. (19) and (20), respectively. We use boldface to denote vectors and partial indexation of dual variables with respect to scenarios, so that . The constraints within the infimum in Eq. (20) refer to constraints Eqs. (11)–(15) for scenario s (dropping the scenario indexation of variables). (19)

(20) Both h0(μ, ν) and hs(μs, νs) for all s ∈S are nondifferentiable convex functions. Evaluating h0(μ, ν) amounts to solving a small integer programming problem, for the constraints of which we have a linear-size convex hull description [20]. Evaluating hs(μs, νs) amounts to solving a deterministic unit commitment (DUC) problem without reserve requirements, which is a mixed-integer linear program of potentially large scale for realistic system models. In practice, the run time for evaluating hs(μs, νs) for any s and any dual multipliers is at least two orders of magnitude greater than the run time for evaluating h0(μ, ν). The proposed distributed algorithm exploits the characteristics of h0(μ, ν) and hs(μs, νs) in order to maximize Eq. (18) and compute lower bounds on the optimal SUC solution, while recovering feasible nonanticipative commitment schedules with associated expected costs (upper bounds to the optimal SUC solution). The dual maximization algorithm is inspired by the work of Nedić et al. on asynchronous incremental subgradient methods [25].

Dual Maximization and Primal Recovery For simplicity, assume that we have 1 ) DP ) PP available parallel processors which can all access a shared memory space. We allocate one processor to coordinate the parallel execution and manage the shared memory space, DP ≤ jSj processors to solve the dual problem in Eq. (18) and PP processors to recover complete solutions to the SUC problem in Eqs. (10)–(17). Interactions between different processors are presented in Figure 1.

Parallel and Distributed Computing Applications


Figure 1: Asynchronous algorithm layout. Information within square brackets is read or written at a single step of the algorithm.

We maximize the dual function in Eq. (18) using a block coordinate descent (BCD) method, in which each update is performed over a block of dual variables associated with a scenario, (μs, νs) for certain s ∈S, following the direction of the subgradient of the dual function in the block of variables (μs, νs). The BCD method is implemented in parallel and asynchronously by having each dual processor perform updates on the dual variables associated with a certain scenario, which are not being updated by any other dual processor at the same time. Scenarios whose dual variables are not currently being updated by any processor are held in the dual queue QD, to be updated later. We maintain shared memory registers of QD. We denote the current multipliers as ∀s ∈S, where k(s) is the number of updates to the

block of scenario s; the previous-to-current dual multipliers as and their associated lower bound on hs as , ∀s∈ S; the global update count as k; and the best lower bound found in Eqs. (10)– (17) as LB. Additionally, a shared memory register of the primal queue QP is required for recovering primal solutions. Then, each dual processor performs the following operations: •

Read and remove the first scenario s from QD.



and evaluate hs and

approximately. for all ω ∈S∖{s}.

A Distributed Computing Architecture for the Large-Scale Integration...


Construct the delayed multiplier vectors,

and evaluate h0(μ, ν) approximately. •

Read the current global iteration count k and perform a BCD update on the dual multipliers

where (w*, z*) is an approximate minimizer of Eq. (19) for , (p*, u*, v*,f*) is an approximate minimizer of Eq. (20) for and (u* SLOW , v* SLOW ) corresponds to the commitment and startup for slow generators in (p*, u*, v*, f*). •

Compute a new lower bound as

where and the MILP solution of Eqs. (19) and (20). •

are the lower bounds of

Let k(s) := k(s) ) 1 and update in memory:

• Add s at the end of QD and return to 1. Steps 1–3 of the dual processor algorithm are self-explanatory. Step 4 constructs a compound of the previous iterates which is useful for computing lower bounds.


Parallel and Distributed Computing Applications

During the execution of the algorithm, step 5 will perform updates to the blocks of dual variables associated to all scenarios. As hs(μs, νs) is easier to evaluate for certain scenarios than others, the blocks of dual variables associated to easier scenarios will be updated more frequently than harder scenarios. We model this process, in a simplified fashion, as if every update is performed on a randomly selected scenario from a nonuniform distribution, where the probability of selecting scenario s corresponds to

where Tbetweens is the average time between two updates on scenario s (Tbetweens is estimated during execution). The asynchronous BCD method can then be understood as a stochastic approximate subgradient method [26, 27]. This is an approximate method for three reasons: (i) as the objective function contains a nonseparable nondifferentiable function h0(μ, ν), there is no guarantee that the expected update direction coincides with a subgradient of the objective of Eq. (8) at the current iterate, (ii) h0(μ, ν) is evaluated for a delayed version of the multipliers , and (iii) h0(μ, ν) and hs(μs, νs) are evaluated only approximately up to a certain MILP gap. Provided that we use a diminishing, nonsummable and square-summable stepsize αk of the type 1=kq, and that the error in the subgradient is bounded, the method will converge to an approximate solution of the dual problem in Eq. (8) [26, 27]. In step 6, we compute a lower bound on the primal problem Eqs. (10)– (17) using previous evaluations of hs(μs, νs) recorded in memory, as proposed in Ref. [6]. Step 7 updates the shared memory registers for future iterations and step 8 closes the internal loop of the dual processor. We recover primal solutions by taking advantage of the fact that (u*SLOW, v ) is a feasible solution for (w, z) in Eqs. (10)–(17). Therefore, in order to compute complete primal solutions and obtain upper bounds for problem in Eqs. (10)–(17), we can fix w := u*SLOW and z := v*SLOW and solve the remaining problem, as proposed in Ref. [28]. After fixing (w, z), the remaining problem becomes separable by scenario; hence, in order to solve it, we need to solve a restricted DUC for each scenario in S. These primal evaluation jobs, i.e., solving the restricted DUC for {u*SLOW }*S, are appended at the end of the primal queue QP by dual processors after each update (step 7.e). Note that we do not require storing v*SLOW because its value is implied by u*SLOW. * SLOW

A Distributed Computing Architecture for the Large-Scale Integration...


The primal queue is managed by the coordinator process, which assigns primal jobs to primal processors as they become available. The computation of primal solutions is therefore also asynchronous, in the sense that it runs independently of dual iterations and that the evaluation of candidate solutions u*SLOW does not require that the previous candidates have already been evaluated for all scenarios. Once a certain candidate ul has been evaluated for all scenarios, the coordinator can compute a new upper bound to Eqs. (10)–(17) as


where UB is the upper bound associated with ul on the restricted DUC problem of scenario s. The coordinator process keeps track of the candidate associated with the smaller upper bound throughout the execution. l s

Finally, the coordinator process will terminate the algorithm when 1 LB=UB ≤ ϵ, where ϵ is a prescribed tolerance, or when reaching a prescribed maximum solution time. At this point, the algorithm retrieves the best-found solution and the bound on the distance of this solution from the optimal objective function value.

Dual Algorithm Initialization The lower bounds computed by the algorithm presented in the previous section depend on previous evaluations of hs(μs, νs) for other scenarios. As the evaluation of hs(μs, νs) can require a substantial amount of time for certain scenarios, the computation of the first lower bound considering nontrivial values of hs(μs, νs) for all scenarios can be delayed significantly with respect to the advance of dual iterations and primal recovery. In other words, it might be the case that the algorithm finds a very good primal solution but it is unable to terminate because it is missing the value of hs(μs, νs) for a single scenario. In order to prevent these situations and in order to obtain nontrivial bounds faster, in the first pass of the dual processors over all scenarios, we can replace hs(μs, νs) with a surrogate ηs(μs, νs) which is easier to compute, such that ηs(μs, νs) ≤ hs(μs, νs) for any (μs, νs). We propose two alternatives for ηs(μs, νs): 1.

The linear relaxation of the scenario DUC (LP):

Parallel and Distributed Computing Applications



An optimal power flow for each period (OPF):

where (11st) – (13st) correspond to constraints Eqs. (11)–(13) for scenario s and period t. The LP approach requires solving a linear program of the same size as the original problem in Eq. (20), but it has the advantage that it can be obtained as an intermediate result while evaluating hs(μs, νs) (the LP approach does not add extra computations to the algorithm). The OPF approach, on the other hand, requires solving many small MILP problems, which can be solved faster than the linear relaxation of Eq. (20). The OPF approach ignores several constraints and cost components, such as the startup cost of nonslow generators, and it adds extra computations to the algorithm.

Implementation and Numerical Experiments We implement the DUCR model using Mosel and solve it directly using Xpress. We also implement the proposed asynchronous algorithm for SUC (described in the previous subsections) in Mosel, using the module mmjobs for handling parallel processes and communications, while solving the subproblems with Xpress [29]. We configure Xpress to solve the root node using the barrier algorithm and we set the termination gap to 1%, for both the DUCR and SUC subproblems, and the maximum solution wall time to 10 hours. Numerical experiments were run on the Sierra cluster hosted at the Lawrence Livermore National Laboratory. Each node of the Sierra cluster is equipped with two Intel XeonEP X5660 processors (12 cores per node) and 24GB of RAM memory. We use 10 nodes for the proposed distributed algorithm, assigning 5 nodes to dual processors, with 6 dual processors per

A Distributed Computing Architecture for the Large-Scale Integration...


node (DP = 30), and 5 nodes to primal recovery, with 12 primal processors per node. The coordinator is implemented on a primal node and occupies one primal processor (PP = 59). We test the proposed algorithm on a detailed model of the Central Western European system, consisting of 656 thermal generators, 679 nodes, and 1037 lines. The model was constructed by using the network model of Hutcheon and Bialek [30], technical generator information provided to the authors by ENGIE, and multiarea demand and renewable energy information collected from national system operators (see [31] for details). We consider eight representative day types, one weekday and one weekend day per season, as being representative of the different conditions faced by the system throughout the year. We consider 4 day-ahead scheduling models: the DUCR model and the SUC model with 30 (SUC30), 60 (SUC60), and 120 (SUC120) scenarios. The sizes of the different day-ahead scheduling models are presented in Table 1, where the size of the stochastic models refers to the size of the extensive form. While the DUCR model is of the scale of problems that fit in the memory of a single machine and can be solved by a commercial solver, the SUC models in extensive form are beyond current capabilities of commercial solvers. Table 2 presents the solution time statistics for all day-ahead scheduling policies. In the case of SUC, we report these results for the two dual initialization alternatives proposed in Section 3.2. The results of Table 2 indicate that the OPF initialization significantly outperforms the LP approach in terms of termination time. This is mainly due to the fact that the OPF approach provides nontrivial lower bounds including information for all scenarios much faster than the LP approach. On the other hand, the solution times of SUC60 and DUCR indicate that, using distributed computing, we can solve SUC at a comparable run time to that required by commercial solvers for DUCR on large-scale systems. Moreover, as shown in Table 3, for a given hard constraint on solution wall time such as 2 h (which is common for day-ahead power system operations), the proposed distributed algorithm provides solutions to SUC with up to 60 scenarios within 2% of optimality, which is acceptable for operational purposes.


Parallel and Distributed Computing Applications

Table 1. Problem sizes

Table 2. Solution time statistics over 8 day types

Table 3. Worst optimality gap (over 8 day types) vs. solution wall time

SCALABLE CONTROL FOR DISTRIBUTED ENERGY RESOURCES Overview Residential demand response has gained significant attention in recent years as an underutilized source of flexibility in power systems, and is expected to become highly valuable as a balancing resource as increasing

A Distributed Computing Architecture for the Large-Scale Integration...


amounts of renewable energy are being integrated into the grid. However, the mobilization of demand response by means of real-time pricing, which represents the economists’ gold standard and can be traced back to the seminal work of Schweppe et al. [32], has so far fallen short of expectations due to several obstacles, including regulation issues, market structure, incentives to consumers, and technological limitations. The ColorPower architecture [7, 8, 9] aims at releasing the potent power of demand response by approaching electricity as a service of differentiated quality, rather than a commodity that residential consumers are willing to trade in real time [33]. In this architecture, the coordination problem of determining which devices should consume power at what times is solved through distributed aggregation and stochastic control. The consumer designates devices or device modes using priority tiers (colors). These tiers correspond to “service level” plans, which are easy to design and implement: we can simply map the “color” designations of electrical devices into plans. A “more flexible” color means less certainty of when a device will run (e.g., time when a pool pump runs), or lower quality service delivered by a device (e.g., wider temperature ranges, slower electrical vehicle charging). These types of economic decision-making are eminently compatible with consumer desires and economic design, as evidenced by the wide range of quality-of-service contracts offered in other industries. Furthermore, the self-identified priority tiers of the ColorPower approach enable retail power participation in wholesale energy markets, lifting the economic obstacles for demand response: since the demand for power can be differentiated into tiers with a priority order, the demand in each tier can be separately bid into the current wholesale or local (DSO level) energy markets. The price for each tier can be set according to the cost of supplying demand response from that tier, which in turn is linked to the incentives necessary for securing customer participation in the demand response program. This allows aggregated demand to send price signals in the form of a decreasing buy bid curve. Market information thus flows bidirectionally. A small amount of flexible demand can then buffer the volatility of the overall power demand by yielding power to the inflexible devices as necessary (based upon the priority chosen by the customer), while fairly distributing power to all customer devices within a demand tier. Technological limitations to the massive deployment of demand response are dealt with by deploying field-proven stochastic control techniques across the distribution network, with the objective of subtly shifting the schedules of


Parallel and Distributed Computing Applications

millions of devices in real time, based upon the conditions of the grid. These control techniques include the CSMA/CD algorithms that permit cellular phones to share narrow radio frequency bands, telephone switch control algorithms, and operating system thread scheduling, as well as examples from nature such as social insect hive behaviors and bacterial quorum sensing. Moreover, the ubiquity of Internet communications allows us to consider using the Internet platform itself for end-to-end communications between machines. At a high level, the ColorPower algorithm operates by aggregating the demand flexibility state information of each agent into a global estimate of total consumer flexibility. This aggregate and the current demand target are then broadcast via IP multicast throughout the system, and every local controller (typically one per consumer or one per device) combines the overall model and its local state to make a stochastic control decision. With each iteration of aggregation, broadcast, and control, the overall system moves toward the target demand, set by the utility or the ISO, TSO, or DSO, allowing the system as a whole to rapidly achieve any given target of demand and closely tracking target ramps. Note that aggregation has the beneficial side-effect of preserving the privacy of individual consumers: their demand information simply becomes part of an overall statistic. The proposed architectural approach supplements the inadequacy of pure market-based control approaches by introducing an automated, distributed, and cooperative communications feedback loop between the system and large populations of cooperative devices at the edge of the network. TSO markets and the evolving DSO local energy markets of the future will have both deep markets and distributed control architecture pushed out to the edge of the network. This smart grid architecture for demand response in the mass market is expected to be a key asset in addressing the challenges of renewable energy integration and the transition to a low-carbon economy

The ColorPower Control Problem A ColorPower system consists of a set of n agents, each owning a set of electrical devices organized into k colors, where lower-numbered colors are intended to be shut off first (e.g., 1 for “green” pool pumps, 2 for “green” HVAC, 3 for “yellow” pool pumps, etc.), and where each color has its own time constants.

A Distributed Computing Architecture for the Large-Scale Integration...


Within each color, every device is either Enabled, meaning that it can draw power freely, or Disabled, meaning that has been shut off or placed in a lower power mode. In order to prevent damage to appliances and/or customer annoyance, devices must wait through a Refractory period after switching between Disabled and Enabled, before they return to being Flexible and can switch again. These combinations give four device states (e.g., Enabled and Flexible, EF), through which each device in the ColorPower system moves according to the modified Markov model of Figure 2: randomly from EF to DR and DF to ER (becoming disabled with probability poff and enabled with probability pon) and by randomized timeout from ER to EF and DR to DF (a fixed length of TF plus a uniform random addition of up to TV). The ColorPower control problem can then be stated as dynamically adjusting pon and poff for each agent and color tier, in a distributed manner, so that the aggregate consumption of the system follows a demand goal given by the operator of the high-voltage network.

The ColorPower Architecture The block diagram of the ColorPower control architecture is presented in Figure 3. Each ColorPower client (i.e., the controller inside a device) regulates the state transitions of the devices under its control. Each client state s(t, a) is aggregated to produce a global state estimate ^s(t), which is broadcasted along with a goal g(t) (the demand target set by the utility or the ISO, TSO, or DSO), allowing clients to shape demand by independently computing the control state c(t, a). The state s(t, a) of a client a at time t sums the power demands of the device(s) under its control, and these values are aggregated using a distributed algorithm (e.g., a spanning tree in Ref. [7]) and fed to a state estimator to get an overall estimate of the true state sˆ (t) of total demand in each state for each color. This estimate is then broadcast to all clients (e.g., by gossip-like diffusion in Ref. [7]), along with the demand shaping goal g(t) for the next total Enabled demand over all colors. The controller at each client a sets its control state c(t, a), defined as the set of transition probabilities pon,i, a and poff,i, a for each color i. Finally, demands move through their states according to those transition probabilities, subject to exogenous disturbances such as changes in demand due to customer override, changing environmental conditions, imprecision in measurement, among others.


Parallel and Distributed Computing Applications

Figure 2: Markov model-based device state switching [8, 9].

Figure 3: Block diagram of the control architecture [8, 9].

Note that the aggregation and broadcast algorithms must be chosen carefully in order to ensure that the communication requirements are lightweight enough to allow control rounds that last for a few seconds on low-cost hardware. The choice of algorithm depends on the network structure: for mesh networks, for example, spanning tree aggregation and gossipbased broadcast are fast and efficient (for details, see [7]).

ColorPower Control Algorithm The ColorPower control algorithm, determines the control vector c(t, a) by a stochastic controller formulated to satisfy four constraints:

A Distributed Computing Architecture for the Large-Scale Integration...


Goal tracking: The total Enabled demand in s(t) should track g(t) as closely as possible: i.e., the sum of Enabled demand over all colors i should be equal to the goal. This is formalized as the equation:

Color priority: Devices with lower-numbered colors should be shut off before devices with higher-numbered colors. This is formalized as:

so that devices are Enabled from the highest color downward, where Di is the demand for the ith color and above:

Fairness: When the goal leads to some devices with a particular color being Enabled and other devices with that color being Disabled, each device has the same expected likelihood of being Disabled. This means that the control state is identical for every client. Cycling: Devices within a color trade-off which devices are Enabled and which are Disabled such that no device is unfairly burdened by initial bad luck. This is ensured by asserting the constraint: This means that any color with a mixture of Enabled and Disabled Flexible devices will always be switching the state of some devices. For this last constraint, there is a tradeoff between how quickly devices cycle and how much flexibility is held in reserve for future goal tracking; we balance these with a target ratio f of the minimum ratio between pairs of corresponding Flexible and Refractory states. Since the controller acts indirectly, by manipulating the pon and poff transition probabilities of devices, the only resource available for meeting these constraints is the demand in the flexible states EF and DF for each tier. When it is not possible to satisfy all four constraints simultaneously, the ColorPower controller prioritizes the constraints in order of their importance. Fairness and qualitative color guarantees are given highest priority, since these are part of the contract with customers: fairness by ensuring that the expected enablement fraction of each device is equivalent (though particular


Parallel and Distributed Computing Applications

clients may achieve this in different ways, depending on their type and customer settings). Qualitative priority is handled by rules that prohibit flexibility from being considered by the controller outside of contractually allowable circumstances. Constraints are enforced sequentially. First comes goal tracking—the actual shaping of demand to meet power schedules. Second is the soft color priority, which ensures that in those transient situations when goal tracking causes some devices to be in the wrong state, it is eventually corrected. Cycling is last, because it is defined only over long periods of time and thus is the least time critical to satisfy. A controller respecting the aforementioned constraints is described in Ref. [8].

Numerical Experiment We have implemented and tested the proposed demand response approach into the ColorPower software platform [8]. Simulations are executed with the following parameters: 10 trials per condition for 10,000 controllable devices, each device consumes 1 kW of power (for a total of 10 MW demand), devices are 20% green (low priority), 50% yellow (medium priority) and 30% red (high priority), the measurement error is ε = 0.1% (0.001), the rounds are 10 seconds long and all the Refractory time variables are 40 rounds. Error is measured by taking the ratio of the difference of a state from optimal versus the total power. The results of the simulation test are shown in Figure 4. When peak control is desired, the aggregate demand remains below the quota, while individual loads are subjected stochastically to brief curtailments. Post-event rush-in, a potentially severe problem for both traditional demand response and price signal-based control systems, is also managed gracefully due to the specific design of the modified Markov model of Figure 2.

Figure 4: Simulation results with 10,000 independently fluctuating power loads. Demand is shown as a stacked graph, with enabled demand at the bottom

A Distributed Computing Architecture for the Large-Scale Integration...


in dark tones, disabled demand at the top in light tones, and Refractory demand cross hatched. The goal is the dashed line, which coincides with the total enabled demand for the experiment. The plot illustrates a peak shaving case where a power quota, the demand response target that may be provided from an externally-generated demand forecast, is used as a guide for the demand to follow.

Taken together, these results indicate that the ColorPower approach, when coupled with an appropriate controller, should have the technological capability to flexibly and resiliently shape demand in most practical deployment scenarios.

CONCLUSIONS We present two applications of distributed computing in power systems. On the one hand, we optimize high-voltage power system operations using a distributed asynchronous algorithm capable of solving stochastic unit commitment in comparable run times to those of a deterministic unit commitment model with reserve requirements, and within operationally acceptable time frames. On the other hand, we control demand response at the distribution level using stochastic distributed control, thereby enabling large-scale demand shaping during real-time operations of power systems. Together, both applications of distributed computing demonstrate the potential for efficiently managing flexible resources in smart grids and for systematically coping with the uncertainty and variability introduced by renewable energy.

ACKNOWLEDGEMENTS The authors acknowledge the Fair Isaac Corporation FICO for providing licenses for Xpress, and the Lawrence Livermore National Laboratory for granting access and computing time at the Sierra cluster. This research was funded by the ENGIE Chair on Energy Economics and Energy Risk Management and by the Université catholique de Louvain through an FSR grant.

Parallel and Distributed Computing Applications


NOMENCLATURE Deterministic and stochastic unit commitment Sets T60

Hourly periods, T60 := {1,…, |T60|}


Scenarios, S := {s1,…, sM}

T15 A

15-min periods, T15 := {1,…, |T15|} Reserve areas






Thermal generators


Buses in area a

L(n, m)

Lines between buses n and m, directed from n to m


Thermal generators at bus or bus set n



Slow generators, GSLOW ⊆G


Corresponding hour of quarter t


Probability of scenario s


Demand at bus n in period t

ξ n,t, ξn,s,t

Forecast renewable supply, bus n, scenario s, quarter t

ΔT2,ΔT3 F±l


n(l), m(l)

Secondary and tertiary reserve requirements in area a Delivery time of secondary and tertiary reserves, 0 < ΔT2 < ΔT3 ≤ 15 Flow bounds, line l

Susceptance, line l Departing and arrival buses, line l

P±g Minimum stable level and maximum run capacity, generator g R±g TLg

Maximum 15-min ramp down/up, generator g Maximum state transition level, generator g

A Distributed Computing Architecture for the Large-Scale Integration...

UTg, DTg Kg Sg


Minimum up/down times, generator g Hourly no-load cost, generator g Startup cost, generator g

Cg(p) Quarterly production cost function, generator g (convex, piece-wise linear)

Variables pg,t , pg,s,t

Production, generator g, scenario t, quarter t

f l,t, f l,s,t

Flow through line l, scenario s, quarter t

θn,t, θn,s,t

Voltage angle, bus n, scenario s, quarter t

Capacity and ramp up rate reservation for secondary and tertiary reserve provision, generator g, quarter t

ug,τ, ug,s,τ

commitment, generator g, scenario s, hour τ

vg,τ, vg,s,τ

startup, generator g, scenario s, hour τ

wg,τ, zg,τ τ

Nonanticipative commitment and startup, generator g, hour

μg,s,τ, νg,s,τ

Dual multipliers of nonanticipativity constraints, generator g, scenario s, hour τ

Asynchronous distributed algorithm for stochastic unit commitment Sets QD

Dual queue (ordered set) of scenarios


Primal queue of pairs: ‹candidate solution, scenario›


Parameters DP, PP

Number of dual and primal processors


Stepsize, asynchronous subgradient method


Stepsize scaling factor, scenario s

Variables LB, UB Lower and upper bound on objective of stochastic unit commitment


Parallel and Distributed Computing Applications


Upper bound of primal candidate l on scenario s

Distributed control for demand response Parameters TDF,i TDV,i TEF,i TEV,i f α

Fixed rounds of disabled refractory time for tier i Maximum random rounds disabled refractory time for tier i Fixed rounds of enabled refractory time for tier i Maximum random rounds enabled refractory time for tier i Target minimum ratio of flexible to refractory demand Proportion of goal discrepancy corrected each round

Variables s(t, a) s(t) sˆ (t) |Xi, a| |Xi| | Xˆ i| g(t) c(t, a) poff,i, pon,i, Di

State of demand for agent a at time t State of total power demand (watts) at time t Estimate of s(t) Power demand (watts) in state X for color i at agent a Total power demand (watts) in state X for color i Estimate of jXij Goal total enabled demand for time t Control state for agent a at time t a Probability of a flexible color i device disabling at agent a a Probability of a flexible color i device enable at agent a Demand for ith color and above

A Distributed Computing Architecture for the Large-Scale Integration...



X. Fang, S. Misra, G. Xue and D. Yang, “Smart grid — the new and improved power grid: a survey,” in IEEE Communications Surveys & Tutorials, vol. 14, no. 4, pp. 944–980, Fourth Quarter 2012. 2. V.C. Ramesh, “On distributed computing for on-line power system applications,” International Journal of Electrical Power & Energy Systems, vol. 18, no. 8, pp. 527–533, 1996. 3. D. Falcão, “High performance computing in power system applications,” in Vector and Parallel Processing – VECPAR’96 (J. Palma and J. Dongarra, eds.), vol. 1215 of Lecture Notes in Computer Science, pp. 1-23, Springer Berlin Heidelberg, 1997. 4. M. Shahidehpour and Y. Wang, Communication and Control in Electric Power Systems: Appications of parallel and distributed processing, Wiley-IEEE Press, Piscataway, New Jersey, USA, July 2003. 5. S. Bera, S. Misra and J. J. P. C. Rodrigues, “Cloud computing applications for smart grid: a survey,” in IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 5, pp. 1477–1494, May 2015. 6. I. Aravena and A. Papavasiliou, “Distributed Control for Small Customer Energy Demand Management,” 2015 IEEE Power & Energy Society General Meeting, Denver, CO, 2015, pp. 1–5. 7. V. V. Ranade and J. Beal, “Distributed control for small customer energy demand management”, 2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems, Budapest, 2010, pp. 11–20. 8. J. Beal, J. Berliner and K. Hunter, “Fast precise distributed control for energy demand management,” 2012 IEEE Sixth International Conference on Self-Adaptive and SelfOrganizing Systems (SASO), Lyon, 2012, pp. 187–192. 9. A. Papalexopoulos, J. Beal and S. Florek, “Precise mass-market energy demand management through stochastic distributed computing,” IEEE Transactions on Smart Grid, vol. 4, no. 4, pp. 2017–2027, Dec. 2013. 10. M. Caramanis, E. Ntakou, W. W. Hogan, A. Chakrabortty and J. Schoene, “Co-optimization of power and reserves in dynamic T&D Power markets with nondispatchable renewable generation and distributed energy resources,” in Proceedings of the IEEE, vol. 104, no. 4, pp. 807–836, April 2016.


Parallel and Distributed Computing Applications

11. APX Group, Belpex, Cegedel Net, EEX, ELIA Group, EnBw, E-On Netz, Powernext, RTE, RWE, and TenneT, “A report for the regulators of the Central West European (CWE) region on the final design of the market coupling solution in the region, by the CWE MC Project,” January 2010. 12. 50Hertz Transmission GmbH, Amprion GmbH, Elia System Operator NV, TenneT TSO B. V., TenneT TSO GmbH, and TransnetBW GmbH, “Potential cross-border balancing cooperation between the Belgian, Dutch and German electricity Transmission System Operators,” October 2014. 13. PJM Interconnection LLC, “PJM Manual 11: Energy & Ancillary Services Market Operations,” Revision 86, February 1, 2017. 14. Midcontinent ISO, “BPM 002 Energy and Operating Reserve Markets Business Practice Manual,” 15 March 2016. 15. S. Takriti, J. R. Birge, and E. Long, “A stochastic model for the unit commitment problem,” IEEE Transactions on Power Systems, vol. 11, no. 3,pp. 1497–1508, Aug 1996. 16. P. Carpentier, G. Gohen, J.-C. Culioli, and A. Renaud, “Stochastic optimization of unit commitment: a new decomposition framework,” IEEE Transactions on Power Systems, vol. 11, pp. 1067–1073, May 1996. 17. M. Tahanan, W. van Ackooij, A. Frangioni, and F. Lacalandra, “Largescale unit commitment under uncertainty,” 4OR, vol. 13, no. 2, pp. 115–171, 2015. 18. J.P. Deane, G. Drayton, B.P. Ó Gallachóir, “The impact of sub-hourly modelling in power systems with significant levels of renewable generation,” Applied Energy, vol. 113, pp. 152–158, January 2014. 19. A. Frangioni, C. Gentile and F. Lacalandra, “Tighter approximated MILP formulations for unit commitment problems,” in IEEE Transactions on Power Systems, vol. 24, no. 1, pp. 105–113, Feb. 2009. 20. D. Rajan and S. Takriti. Minimum up/down polytopes of the unit commitment problem with start-up costs. IBM Research Report RC23628, Thomas J. Watson Research Center, June 2005. 21. A. Papavasiliou, S. S. Oren and B. Rountree, “Applying high performance computing to transmission-constrained stochastic unit commitment for renewable energy integration,” in IEEE Transactions on Power Systems, vol. 30, no. 3, pp. 1109–1120, May 2015.

A Distributed Computing Architecture for the Large-Scale Integration...


22. K. Kim and V.M. Zavala, “Algorithmic innovations and software for the dual decomposition method applied to stochastic mixed-integer programs,” Optimization Online, 2015. 23. K. Cheung, D. Gade, C. Silva-Monroy, S.M. Ryan, J.P. Watson, R.J.B. Wets, and D.L. Woodruff, “Toward scalable stochastic unit commitment. Part 2: solver configuration and performance assessment,” Energy Systems, vol. 6, no. 3, pp. 417–438, 2015. 24. R.T. Rockafellar and R.J.-B. Wets, “Scenarios and policy aggregation in optimization under uncertainty,” Mathematics of Operations Research, vol. 16, no. 1, pp. 119–147, 1991. 25. A. Nedić, D. Bertsekas, and V. Borkar, “Distributed asynchronous incremental subgradient methods,” in Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications (Y. Butnariu, S. Reich and Y. Censor eds.), vol. 8 of Studies in Computational Mathematics, pp. 381–407, Amsterdam: Elsevier, 2001. 26. Yuri Ermoliev, “Stochastic quasigradient methods and their application to system optimization,” Stochastics, vol. 9, no. 1–2, pp. 1–36, 1983. 27. K. Kiwiel, “Convergence of approximate and incremental subgradient methods for convex optimization,” SIAM Journal on Optimization, vol. 14, no. 3, pp. 807–840, 2004. 28. Shabbir Ahmed, “A scenario decomposition algorithm for 0–1 stochastic programs,” Operations Research Letters, vol. 41, no. 6, pp. 565–569, November 2013. 29. Y. Colombani and S. Heipcke. Multiple models and parallel solving with Mosel, February 2014. Available at: docs/DOC-1141. 30. N. Hutcheon and J. W. Bialek, “Updated and validated power flow model of the main continental European transmission network,” 2013 IEEE Grenoble Conference, Grenoble, 2013, pp. 1–5. doi: 10.1109/ PTC.2013.6652178 31. I. Aravena and A. Papavasiliou, “Renewable Energy Integration in Zonal Markets,” in IEEE Transactions on Power Systems, vol. 32, no. 2, pp. 1334–1349, March 2017. doi: 10.1109/TPWRS.2016.2585222 32. F. C. Schweppe, R. D. Tabors, J. L. Kirtley, H. R. Outhred, F. H. Pickel and A. J. Cox, “Homeostatic utility control,” IEEE Transactions on Power Apparatus and Systems, vol. PAS-99, no. 3, pp. 1151–1163, May 1980.


Parallel and Distributed Computing Applications

33. Shmuel S. Oren, “Product Differentiation in Service Industries”. Working paper presented at the First Annual Conference on Pricing, New York, NY, December 1987



ASSIGNING REAL-TIME TASKS IN ENVIRONMENTALLY POWERED DISTRIBUTED SYSTEMS Jian Lin1, Albert M. K. Cheng2 Department of Management Information Systems, University of Houston, Clear Lake, Houston, USA 1

Department of Computer Science, University of Houston, Houston, USA


ABSTRACT Harvesting energy for execution from the environment (e.g., solar, wind energy) has recently emerged as a feasible solution for low-cost and lowpower distributed systems. When real-time responsiveness of a given application has to be guaranteed, the recharge rate of obtaining energy inevitably affects the task scheduling. This paper extends our previous works in [1] [2] to explore the real-time task assignment problem on an Citation: Lin, J. and Cheng, A. (2014), “Assigning Real-Time Tasks in Environmentally Powered Distributed Systems”. Circuits and Systems, 5, 98-113. doi: 10.4236/ cs.2014.54012. Copyright: © 2014 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://


Parallel and Distributed Computing Applications

energy-harvesting distributed system. The solution using Ant Colony Optimization (ACO) and several significant improvements are presented. Simulations compare the performance of the approaches, which demonstrate the solutions effectiveness and efficiency. Keywords: Distributed Systems, Energy Harvesting, Real-Time Scheduling, Task Assignment

INTRODUCTION A distributed system consists of a collection of sub-systems, connected by a network, to perform computation tasks. Today, low energy consumption or green computing becomes a key design requirement for a lot of systems. Recently, a technology called energy harvesting, also known as energy scavenging, has received extensive attentions. Energy harvesting is a process that draws parts or all of the energy for operation from its ambient energy sources, such as solar, thermal, and wind energy, etc. As energy can be absorbed from a system’s ambient environment, it gives a low cost and clean solution to power the computer systems. A real-time system provides a guarantee to complete critical tasks for responsiveness and dependability. Their operation must be completed before a deadline. An important category of real-time tasks is a periodic task in which a series of the task instance continuously arrive at a certain rate. When operating these tasks with energy harvesting, a concern of satisfying the timely requirement is raised. Operating computing tasks on a system consumes energy. To execute repeating, periodic tasks, a sustainable system must obtain energy not slower than it consumes energy. Otherwise, Figure 1 shows a scenario that a system with two periodic tasks running at a rate of 8 time units will ultimately crash or the task will finally miss its deadline due to the shortage of energy during run-time. In this work, we study the scheduling of a set of real-time tasks on a distributed system powered by energy harvesting. We consider the static assignment in which each task needs to be statically assigned to a node before it executes. A feasible assignment can not violate the tasks’ deadline requirement and energy obtained from the node’s environment must be sufficient to run the tasks. The rest of the paper is organized as follows. In the next section, the related work is discussed. Then, we introduce the system model and problem definition in Section III. Section IV studies the

Assigning Real-Time Tasks in Environmentally Powered Distributed ...


solutions using Ant Colony Optimization (ACO). Also, we will show how to improve the ACO approach to adapt to an unstable environment. The simulation results are shown in Section V, and we conclude the paper in the last section.

RELATED WORKS Scheduling computer tasks on a distributed system is a classical problem in computer science research. In the literature, many research results have been obtained without considering the energy issues. In [3] , a genetic algorithm based approach for solving the mapping and concurrent scheduling problem is proposed. In [4] [5] , the authors study the assignment problem for a set of real-time periodic tasks. The work in [6] gives a good survey and a comparison study of eleven heuristics for mapping a class of independent tasks. While assigning and scheduling a set of computer tasks, the above solutions are not suitable for considering the energy issue. Some research works find solutions for scheduling real-time tasks also looking into their energy consumption. For example, in [7] -[9] , a technique called Dynamic Voltage Scaling (DVS) is used to minimize the energy consumption. By adjusting the voltage of a CPU, energy can be saved while the execution of a computing task is slowed down. A common assumption is made in these works that the energy is sufficiently available at any time. Unfortunately, this assumption is not always true if the energy-harvesting approach is used. Recently, the energy-harvesting has been studied in a variety of research works. In [10] [11] , two prototypes are developed for systems to operate perpetually through scavenging solar energy. Lin et al. and Voigt et al. study clustering decisions within the network that the knowledge of the currently generated power is available at each node [12] . They show that the network lifetime can be optimized by appropriately shifting the workload. Energyharvesting has also been discussed for real-time systems. A. Allavena et al. propose an off-line algorithm in [13] for a set of frame-based real-time tasks, assuming a fix battery’s recharge rate. Later, Moser et al. develop an on-line algorithm called lazy to avoid deadline violations [14] . Liu et al. improve the results by using DVS to reduce the deadline miss rate [15] . A new work for using the energy-harvesting is done in [16] that the Earliest Deadline First (EDF) scheduling algorithm has a zero competitive factor but nevertheless is optimal for on-line non-idling settings. All of the above works do not solve the scheduling problem on distributed real-time systems


Parallel and Distributed Computing Applications

using energy harvesting, for which our solutions will be presented in the following sections.

PRELIMINARIES In this section, we describe the system model as well as the task model and power model. We also define the problem statement and discuss its complexity. We lastly introduce certain preliminary results for our problem.

Figure 1. Consuming energy faster than energy harvested.

System, Task and Power Models The task model of frame-based task used in [13] is adopted in this work. In this model, n real-time tasks must execute in a frame or an interval with a length L, and the frame is repeated. The frame-based real-time task is a special type of periodic system where every task in the system shares a common period p, p=L. In each frame, all tasks are initiated at the frame’s beginning and must finish their execution specified by its computation time c by the frame’s end. For example, the task set shown in Figure 1 has two tasks and the frame’s length is 8 time units. Since all tasks start at the same time and share a common deadline in each frame, the order of executing these tasks in the frame does not matter which simplifies the scheduling decision. Also, a task consumes energy estimated by a rate of p when it executes. No energy consumption is assumed for a task if it is not in execution. For the base system providing the computing power and energy, it is distributed and has m nodes. Each node is a complete system capable of executing real-time computer jobs, with an energy-harvesting device to generate energy. The m nodes are assumed to be located in different locations, and each has an estimated rate rj (j=1,…,m) of obtaining energy from their ambiance to be fed into the energy storage. We consider rj as a constant, but we relax this limitation in our later discussion. Each node has a storage device to store the energy. If the storage is full, continuing to harvest energy is just a waste and the excessive energy is discarded.

Assigning Real-Time Tasks in Environmentally Powered Distributed ...


The system used can be homogeneous or heterogeneous. However, since the former one is just a special case of the latter one, we consider solving the problem on the heterogeneous systems. On a heterogeneous system, the computing speed and energy of running a task may be different for running on different nodes. What follows summarizes the parameters used in this paper: m: the number of nodes. n: the number of tasks in the frame-based application. L: the frame’s length which equals to the period and relative deadline of every real-time task. rj: (recharge) rate of harvesting energy on node j, j=1,…,m.

cij: the computation time required to finish task i on node j, i = 1,…, n, j=1,…,m. pij: the overall power consumption rate for executing task i on node j, i = 1,…, n, j=1,…,m.

Problem Definition and Complexity In statically partitioning the real-time tasks, a control system assigns the tasks to the nodes, and a task stays at a node during its lifetime after being assigned. In our later discussion, we also consider the cases that tasks can be reassigned if there is a change in the environment. Our goal is to determine the feasibility of the assignment on the system using energy-harvesting. The following definition formally defines the problem. Definition 1: Energy Harvesting Assignment (EHA) problem, given a set of frame-based real-time tasks and an energy-harvesting distributed system specified by the parameters above, find a feasible assignment for the tasks such that: a. b.

Every task meets the deadline. Energy remained in the storage of node j at the end of each frame interval is at least Ejinit where Ejinit is the initial level of the energy in the storage on node j. Constraint a specifies the real-time requirement for completing the tasks. Constraint b guarantees that the energy harvested within each frame interval is not less than the energy consumed by running the tasks. When Constraint b is satisfied, the system is powered continuously and sufficiently


Parallel and Distributed Computing Applications

for executing the repeating, frame-based tasks. Although all tasks in the application have the same starting time and deadline, the EHA problem is still an NP-Hard problem, even for a homogeneous system. For completeness, we give the proof of the following theorem for the homogeneous version of the problem. The proof of the heterogeneous version directly follows because it is a generalization of the homogeneous version. Theorem 1: The EHA problem is NP-Hard. Poof: we prove this theorem by performing a reduction from the 3-PARTITION problem which is known to be NP-Hard [17] to the EHA problem. 3-PARTITION: Given a finite set A of 3m elements, a bound BϵZ+, and a size s(a)ϵZ+ for each aϵA such that each s(a) satisfies B/4 < s(a) < B / 2 and such that . Question: Can A be partitioned into m disjoint sets S1, S2, …, Sm such that, for ?

Suppose that an algorithm exists to solve the EHA problem, given the instance of the problem is feasible, in polynomial time. Given an instance of 3-PARTITION, we construct the instance of the EHA problem. The instance includes a set of 3m frame-based tasks and a system of m nodes. The length of the frame is B. The computation time of task i, ci, equals to the size of ai in A. We assume that all nodes have the same recharge rate r, and all tasks have the same power consuming functionpi = r, which means that the energy consumed by running the tasks is as fast as the energy harvested from the ambiance. Then, the shortage of the energy is not a concern. Also since , which equals the total length of the m frames, there is no idle interval in a feasible schedule because all CPU time must be used to execute tasks. If there is a solution for problem 3- PARTITION, we can apply the solution to our problem directly. If a feasible schedule exists for our problem, the length of the schedule in a frame on each node is exactly B because the schedules do not have any idle intervals. This is also the solution of the problem 3-PARTITION. The reduction is finished in polynomial time. Thus, if EHA problem can be solved in polynomial time, 3-PARTITON can be solved in polynomial time. However, this contradicts the known theorem that 3-PARTITION is NP-Hard. Thus, the EHA problem is NP-Hard.

Preliminary Results In the EHA problem, if the constraint b is ignored, the feasibility after all tasks have been assigned on the system can be tested as:

Assigning Real-Time Tasks in Environmentally Powered Distributed ...


(1) where ci,j means the computation time of task i assigned on node j. If the energy issue is considered, A. Allavena et al. develop a schedulability condition in [13] for running a set of frame-based tasks on a single node system that has a rechargeable battery: (2) In the inequality, tidle is an idle interval without any task execution, used for energy replenishment if the energy consumed for running tasks is faster than it is obtained from the environment. To calculate tidle, p’i, the energy’s instantaneous replenishment rate of the system while running task i, is calculated. Let p’i = r - pi. p’i >0 means that running task i consumes energy slower than the system obtains energy and p’i< 0 means the contrary. All tasks are then divided into two groups, namely the recharging tasks and the dissipating tasks . Two sums are calculated respectively, as and . In the equation, |R| (or |D|) is the amount of energy the system gains (or loses) while running the tasks in the group. The size of the idle interval is . We call this idle interval as recharging idle interval. Let us extend (2) to the EHA problem and we have: (3) where ci,j means the computation time of task i assigned on node j. The inequality 3 can be used to replace the constraints a and b to determine the feasibility of an assignment instance of the problem. Due to the intractability, it is not likely to find a feasible assignment for every instance in polynomial time. In the following sections, we will discuss the approaches to solve the problem.

ACO SOLUTION The ACO algorithm is a meta-heuristic technique for optimization problems. It has been successfully applied to some NP-Hard combinatorial optimization problems. In this section, we apply and improve the ACO for

Parallel and Distributed Computing Applications


our EHA problem. Before we describe our approach, we briefly introduce ACO meta-heuristic. The readers who are familiar with the ACO can skip the introduction in subsection A.

Basic ACO Structure The ACO algorithm, originally introduced by Dorigo et al., is a populationbased approach which has been successfully applied to solve NP-Hard combinatorial optimization problems [18] . The algorithm was inspired by ethological studies on the behavior ants that they can find the optimal path between their colony and the food source, without sophisticated vision. An indirect communication called stigmergy via a chemical substance, or pheromone left by ants on the paths, is used for them to look for the shortest path. The higher the amount of pheromone on a path, the better the path it hints, and then the higher chance the ant will choose the path. As an ant traverses a path, it reinforces that path with its own pheromone. Finally, the shortest path has the highest amount of pheromone accumulated, and in turn it will attract most ants to choose it. The ACO algorithm, using probabilistic solution construction by imitating this mechanism, is used to solve NP-Hard combinatorial optimization problems. 1) 2) 4) 5) 6) 7)

initialization (parameters, pheromone trails) while true do 3) Construct solutions; Local search; Update pheromone trails; Terminates if some condition is reached; end Algorithm 1: ACO Algorithm’s Structure In general, the structure of ACO algorithms can be outlined in Algorithm 1. The basic ingredient of any ACO algorithm is a probabilistic construction of partial solutions. A constructive heuristic assembles solutions as sequences of elements from the finite set of solution components. It has been believed that a good solution is comprised by a sequence of good components. A solution construction is to find out these good components and then add them into the solution at each step. The process of constructing solutions can be regarded as a walk (or a path) of artificial ants on the so called construction graph. After a solution is constructed, it is evaluated by some metric. If the solution is a good solution

Assigning Real-Time Tasks in Environmentally Powered Distributed ...


compared with other solutions, a designated amount of pheromone is added to each component in it which will make the components to be a little bit more likely selected in next time of construction of solution. Components in good solutions accumulate probability of being selected in iterations and finally the selections will converge into these good components. This process is repeated until some condition is met to stop.

ACO for EHA Problem In order to apply the ACO to the EHA, we see the EHA problem as an equivalent optimization problem. The following definitions define the metric used in the optimization problem. Definition 2: Energy and Computation Time Length (EC-Length), the EC-Length of a node j is defined as the total computation time plus the tidle for the tasks assigned on that node. Definition 3: Energy and Computation Time Makespan (EC-Makespan), Given an assignment, the EC-Makespan denotes the maximum value of the EC-Length upon all nodes. The objective in our ACO system is to minimize the EC-Makespan. The task set is schedulable only if it can find an assignment in which the minimum EC-Makespan is not larger than the frame’s length L. Initialization: At the beginning, the values of the pheromone are all initialized to a constant ph > 0. In [19] , a variation of the original ACO algorithm called Max-Min Ant System (MMAS), was proposed to improve the original ACOs performance. The intuition behind the MMAS is that it addresses the premature convergence problem by providing bounds on the pheromone trails to encourage broader exploration of the search space. We dynamically set the two bounds in our system according to current best solution as in [19] . Construct solutions: The solution component treated in our approach is the (task, node) pair. A chessboard can be used as the construction graph. Table 1 gives an example of 4 tasks and 3 nodes. In that example, ants traverse the chessboard by satisfying: One and only one cell is visited in each row; Constraints a and b must be respected. For each ant at a step, it first randomly selects a task, and then selects a pair of the (taski, nodej) to be added into the solution according to the probabilistic model (4):


Parallel and Distributed Computing Applications

(4) where Ni is the set of next possible move of the ant to assign task i. In 4, τ denotes the amount of pheromone which shows the quality of assigning task i to node j by the past solutions. The η is a local heuristics defined as 1/current EC-Makespan. The current EC-Makespan is the one after task i is assigned to node j, evaluated by a heuristic. This local heuristic gives preference to the move with the possibly smallest EC-Makespan based on the currently completed, partial solution. The use of τ and η is to avoid the convergence too early either into the assignments of the completed solutions or to the looked-like good selections evaluated by a local heuristic. The α and β are parameters to control the relative importance between τ and η. We assign 0.5 to each of them to make τ and η equally important. After all ants finish the construction of solutions, we check the final EC-Makespan. If the final EC-Makespan is not larger than L, the solution is identified as feasible, or infeasible, otherwise. Table 1. An example of the chessboard

Local Search: To further improve the solution found by ACO, a local search optimization is performed on the infeasible solutions found by ACO. The local search in our problem tries to balance the load on different nodes. In our approach, we first find the most heavily loaded node, i.e., the node that contributes the final EC-Makespan, from which we try to reduce its load by moving some tasks to other lightly loaded node. The move is made only if the new final EC-Makespan does not exceed the old one. To reduce the time cost, at most 2 tasks are tried on the balanced node for local search. In our experiments, the performance of our ACO approach can be greatly improved with the local search.

Assigning Real-Time Tasks in Environmentally Powered Distributed ...


Pheromone updates: In an iteration, ants deposit pheromone according to the quality of those solutions they find. A straight-forward definition of the quality is the EC-Makespan. The amount of the pheromone deposited is set to be 1/ EC-Makespan. A general formula to update the pheromone at the tth iteration is given in (5): (5) Where 0< ρ