Networks for Grid Applications: Second International Conference, GridNets 2008, Beijing, China, October 8-10, 2008, Revised Selected Papers [1 ed.] 9783642020797, 3642020798

This book constitutes the thoroughly refereed post-conference proceedings of the Second International Conference on Netw

264 26 4MB

English Pages 264 [275] Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter....Pages -
A High Performance SOAP Engine for Grid Computing....Pages 1-8
UDTv4: Improvements in Performance and Usability....Pages 9-23
The Measurement and Modeling of a P2P Streaming Video Service....Pages 24-34
SCE: Grid Environment for Scientific Computing....Pages 35-42
QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks....Pages 43-55
Principles of Service Oriented Operating Systems....Pages 56-69
Preliminary Resource Management for Dynamic Parallel Applications in the Grid....Pages 70-80
Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks....Pages 81-88
Performance Assessment Architecture for Grid....Pages 89-97
Implementation and Evaluation of DSMIPv6 for MIPL....Pages 98-104
Grid Management: Data Model Definition for Trouble Ticket Normalization....Pages 105-112
Extension of Resource Management in SIP....Pages 113-120
Economic Model for Consistency Management of Replicas in Data Grids with OptorSim Simulator....Pages 121-129
A Video Broadcast Architecture with Server Placement Programming....Pages 130-137
VXDL: Virtual Resources and Interconnection Networks Description Language....Pages 138-154
Adding Node Absence Dynamics to Data Replication Strategies for Unreliable Wireless Grid Environments....Pages 155-160
Hop Optimization and Relay Node Selection in Multi-hop Wireless Ad-Hoc Networks....Pages 161-174
Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains....Pages 175-186
Network Coding Opportunities for Wireless Grids Formed by Mobile Devices....Pages 187-195
Automatic Network Services Aligned with Grid Application Requirements in CARRIOCAS Project....Pages 196-205
Communication Contention Reduction in Joint Scheduling for Optical Grid Computing....Pages 206-214
Experimental Demonstration of a Self-organized Architecture for Emerging Grid Computing Applications on OBS Testbed....Pages 215-222
Joint Scheduling of Tasks and Communication in WDM Optical Networks for Supporting Grid Computing....Pages 223-230
Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks (Invited Paper)....Pages 231-242
OBS/GMPLS Interworking Network with Scalable Resource Discovery for Global Grid Computing....Pages 243-250
Providing QoS for Anycasting over Optical Burst Switched Grid Networks....Pages 251-258
A Paradigm for Reconfigurable Processing on Grid....Pages 259-262
Back Matter....Pages -
Recommend Papers

Networks for Grid Applications: Second International Conference, GridNets 2008, Beijing, China, October 8-10, 2008, Revised Selected Papers [1 ed.]
 9783642020797, 3642020798

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering Editorial Board Ozgur Akan Middle East Technical University, Ankara, Turkey Paolo Bellavista University of Bologna, Italy Jiannong Cao Hong Kong Polytechnic University, Hong Kong Falko Dressler University of Erlangen, Germany Domenico Ferrari Università Cattolica Piacenza, Italy Mario Gerla UCLA, USA Hisashi Kobayashi Princeton University, USA Sergio Palazzo University of Catania, Italy Sartaj Sahni University of Florida, USA Xuemin (Sherman) Shen University of Waterloo, Canada Mircea Stan University of Virginia, USA Jia Xiaohua City University of Hong Kong, Hong Kong Albert Zomaya University of Sydney, Australia Geoffrey Coulson Lancaster University, UK

2

Pascale Vicat-Blanc Primet Tomohiro Kudoh Joe Mambretti (Eds.)

Networks for Grid Applications Second International Conference, GridNets 2008 Beijing, China, October 8-10, 2008 Revised Selected Papers

13

Volume Editors Pascale Vicat-Blanc Primet Ecole Normale Supérieure LIP, UMR CNRS - INRIA - ENS, UCBL No. 5668 69364 LYON Cedex 07, France E-mail: [email protected] Tomohiro Kudoh Information Technology Research Institute National Institute of Advanced Science and Technology Tsukuba-Central 2 1-1-1 Umezono Tsukuba Ibaraki 305-8568 Japan E-mail: [email protected] Joe Mambretti Northwestern University International Center for Advanced Internet Research 750 North Lake Shore Drive, Suite 600 Chicago, IL 60611, USA E-mail: [email protected]

Library of Congress Control Number: Applied for CR Subject Classification (1998): C.2, D.1.3, H.2.4, H.3.4, G.2.2 ISSN ISBN-10 ISBN-13

1867-8211 3-642-02079-8 Springer Berlin Heidelberg New York 978-3-642-02079-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12649303 06/3180 543210

Report on GridNets 2008

Beijing, China October 8–10, 2008 The Second International Conference on Networks for Grid Applications In cooperation with ACM SIGARCH Sponsored by ICST Technically co-sponsored by Create-Net and EU-IST

The GridNets conference series is an annual international meeting which provides a focused and highly interactive forum where researchers and technologists have the opportunity to present and discuss leading research, developments, and future directions in the grid networking area. The objective of this event is to serve as both the premier conference presenting the best grid networking research and a forum where new concepts can be introduced and explored. After the great success of last year's GridNets in Lyon, France, which was the first "conference" event, we decided to move GridNets to Beijing, China in 2008. We received 37 papers, and accepted 19. For this single-track conference, there were 2 invited keynote speakers, 19 reviewed paper presentations, and 4 invited presentations. This program was supplemented by a workshop on Service-Aware Optical Grid Networks and a workshop on Wireless Grids; both of these workshops took place on the first day of the conference day. Next year's event is already being planned, and it will take place in Athens, Greece. We hope to see you there!

Tomohiro Kudoh Pascale Vicat-Blanc Primet Joe Mambretti

Organization

Steering Committee Imrich Chlamtac, Chair Pascale Vicat-Blanc Primet Michael Welzl

CREATE-NET INRIA, ENS-Lyon (France) University of Innsbruck (Austria)

General Co-chairs Chris Edwards Michael Welzl Junsheng Yu

Lancaster University (UK) University of Innsbruck (Austria) Beijing University of Posts and Telecommunications (China)

Local Co-chairs Yuan'an Liu Tongxu Zhang Shaohua Liu

Beijing University of Posts and Telecommunications (China) China Mobile Group Design Institute Co. Ltd. (China) Beijing University of Posts and Telecommunications (China)

Program Committee Co-chairs Pascale Vicat-Blanc Primet Joe Mambretti Tomohiro Kudoh

INRIA, ENS-Lyon (France) Northwestern University (USA) AIST (Japan)

Publicity Co-chairs Serafim Kotrotsos, for Europe Xingyao Wu, for Asia Sumit Naiksatam, for USA

EXIS IT (Greece) China Mobile Group Design Institute Co. Ltd. (China) Cisco Systems

Publications Chair Marcelo Pasin INRIA, ENS-Lyon (France) email: [email protected]

VIII

Organization

Exhibits and Sponsorship Co-chairs Peng Gao Junsheng Yu

China Mobile Group Design Institute Co. Ltd. (China) Beijing University of Posts and Telecommunications (China)

Workshops Chair Ioannis Tomkos

Athens Information Technology (Greece)

Industry Track Chair Tibor Kovács

ICST

Panels Chair Chris Edwards

Lancaster University (UK)

Conference Organization Chair Gergely Nagy

ICST

Finance Chair Karen Decker

ICST

Webmaster Yehia Elkhatib

Lancaster University (UK)

Technical Program Co-chairs Pascale Vicat-Blanc Prime Joe Mambretti Tomohiro Kudoh

INRIA, ENS-Lyon (France) Northwestern University (USA) AIST (Japan)

Technical Program Members Micah Beck Augusto Casaca Piero Castoldi Cees De Laat

University of Tennessee (USA) INESC-ID (Portugal) SSSUP (Italy) University of Amsterdam (Holland)

Organization

Chris Develder Silvia Figueira Gabriele Garzoglio Olivier Glück Paola Grosso Yunhong Gu Wei Guo David Hausheer Guili He Doan Hoang Tao Huang Yusheng Ji Raj Kettimuthu Oh-kyoung Kwon Dieter Kranzlmueller Francis Lee Bu Sung Shaohua Liu Tao Liu Tiejun Ma Olivier Martin Katsuichi Nakamura Marcelo Pasin Nicholas Race Depei Qian Dimitra Simeonidou Burkhard Stiller Zhili Sun Martin Swany Osamu Tatebe Dominique Verchcre Jun Wei Chan-Hyu Youn Wolfgang Ziegler

Ghent University - IBBT (Belgium) Santa Clara University (USA) Fermi National Accelerator Laboratory (USA) INRIA, ENS-Lyon (France) University of Amsterdam (Holland) University of Illinois (USA) Shanghai Jiao Tong University (China) University of Zurich (Switzerland) China Telecommunication Technology Lab (China) University of Technology, Sydney (Australia) Institute of Software, Chinese Academy of Sciences (China) NII (Japan) Argonne National Laboratory (USA) KISTI (Korea) GUP-Linz (Austria) Nanyang Technological University (Singapore) Beijing University of Posts and Telecommunications (China) China Mobile Group Design Institute Co. Ltd. (China) University of Oxford (UK) ICT Consulting (Switzerland) Kyushu Institute of Technology (Japan) INRIA, ENS-Lyon (France) Lancaster University (UK) Beijing University of Aeronautics and Astronautics (China) University of Essex (UK) University of Zurich (Switzerland) University of Surrey (UK) University of Delaware (USA) Tsukuba University (Japan) Alcatel-Lucent (France) Institute of Software, Chinese Academy of Sciences (China) ICU (Korea) Fraunhofer-Gesellschaft (Germany)

Wireless Grids Workshop Organizing Committee Frank H.P. Fitzek Marcos D. Katz Geng-Sheng Kuo Jianwei Niu Qi Zhang

IX

Aalborg University (Denmark) VTT (Finland) BUPT (China) BUAA (China) DTU (Denmark)

Table of Contents

A High Performance SOAP Engine for Grid Computing . . . . . . . . . . . . . . . Ning Wang, Michael Welzl, and Liang Zhang

1

UDTv4: Improvements in Performance and Usability . . . . . . . . . . . . . . . . . Yunhong Gu and Robert Grossman

9

The Measurement and Modeling of a P2P Streaming Video Service . . . . . Peng Gao, Tao Liu, Yanming Chen, Xingyao Wu, Yehia El-khatib, and Christopher Edwards

24

SCE: Grid Environment for Scientific Computing . . . . . . . . . . . . . . . . . . . . Haili Xiao, Hong Wu, and Xuebin Chi

35

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Yu and Huan Xu Principles of Service Oriented Operating Systems . . . . . . . . . . . . . . . . . . . . Lutz Schubert and Alexander Kipp

43 56

Preliminary Resource Management for Dynamic Parallel Applications in the Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Liu, Amril Nazir, and Søren-Aksel Sørensen

70

Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Igor Cergol, Vinod Mirchandani, and Dominique Verchere

81

Performance Assessment Architecture for Grid . . . . . . . . . . . . . . . . . . . . . . . Jin Wu and Zhili Sun

89

Implementation and Evaluation of DSMIPv6 for MIPL . . . . . . . . . . . . . . . Mingli Wang, Bo Hu, Shanzhi Chen, and Qinxue Sun

98

Grid Management: Data Model Definition for Trouble Ticket Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitris Zisiadis, Spyros Kopsidas, Matina Tsavli, Leandros Tassiulas, Leonidas Georgiadis, Chrysostomos Tziouvaras, and Fotis Karayannis Extension of Resource Management in SIP . . . . . . . . . . . . . . . . . . . . . . . . . . Franco Callegati and Aldo Campi Economic Model for Consistency Management of Replicas in Data Grids with OptorSim Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ghalem Belalem

105

113

121

XII

Table of Contents

A Video Broadcast Architecture with Server Placement Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei He, Xiangjie Ma, Weili Zhang, Yunfei Guo, and Wenbo Liu VXDL: Virtual Resources and Interconnection Networks Description Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guilherme Piegas Koslovski, Pascale Vicat-Blanc Primet, and Andrea Schwertner Char˜ ao

130

138

Adding Node Absence Dynamics to Data Replication Strategies for Unreliable Wireless Grid Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soomi Yang

155

Hop Optimization and Relay Node Selection in Multi-hop Wireless Ad-Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohua(Edward) Li

161

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sireesha Krupadanam and Huirong Fu

175

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karsten Fyhn Nielsen, Tatiana K. Madsen, and Frank H.P. Fitzek

187

Automatic Network Services Aligned with Grid Application Requirements in CARRIOCAS Project (Invited Paper) . . . . . . . . . . . . . . . D. Verchere, O. Audouin, B. Berde, A. Chiosi, R. Douville, H. Pouyllau, P. Primet, M. Pasin, S. Soudan, T. Marcot, V. Piperaud, R. Theillaud, D. Hong, D. Barth, C. Cad´er´e, V. Reinhart, and J. Tomasik

196

Communication Contention Reduction in Joint Scheduling for Optical Grid Computing (Invited Paper) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yaohui Jin, Yan Wang, Wei Guo, Weiqiang Sun, and Weisheng Hu

206

Experimental Demonstration of a Self-organized Architecture for Emerging Grid Computing Applications on OBS Testbed . . . . . . . . . . . . . Lei Liu, Xiaobin Hong, Jian Wu, and Jintong Lin

215

Joint Scheduling of Tasks and Communication in WDM Optical Networks for Supporting Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . Xubin Luo and Bin Wang

223

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks (Invited Paper) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinod M. Vokkarane and Balagangadhar G. Bathula

231

Table of Contents

XIII

OBS/GMPLS Interworking Network with Scalable Resource Discovery for Global Grid Computing (Invited Paper) . . . . . . . . . . . . . . . . . . . . . . . . . J. Wu, L. Liu, X.B. Hong, and J.T. Lin

243

Providing QoS for Anycasting over Optical Burst Switched Grid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balagangadhar G. Bathula and Jaafar M.H. Elmirghani

251

A Paradigm for Reconfigurable Processing on Grid . . . . . . . . . . . . . . . . . . . Mahmood Ahmadi and Stephan Wong

259

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

263

A High Performance SOAP Engine for Grid Computing Ning Wang1 , Michael Welzl2 , and Liang Zhang1 1

Institute of Software, Chinese Academy of Sciences, Beijing, China [email protected] 2 Institute of Computer Science, University of Innsbruck, Austria [email protected]

Abstract. Web Service technology still has many defects that make its usage for Grid computing problematic, most notably the low performance of the SOAP engine. In this paper, we develop a novel SOAP engine called SOAPExpress, which adopts two key techniques for improving processing performance: SCTP data transport and dynamic early binding based data mapping. Experimental results show a significant and consistent performance improvement of SOAPExpress over Apache Axis. Keywords: SOAP, SCTP, Web Service.

1

Introduction

The rapid development of Web Service technology in recent years has attracted much attention in the Grid computing community. The recently proposed Open Grid Service Architecture (OGSA) represents an evolution towards a Grid system architecture based on Web Service concepts and technologies. The new WS-Resource Framework (WSRF) proposed by Globus, IBM and HP provides a set of core Web Service specifications for OGSA. Taken together, and combined with WS-Notification (WSN), these specifications describe how to implement OGSA capabilities using Web Services. The low performance of the Web Service engine (SOAP engine) is problematic in the Grid computing context. In this paper, we propose two techniques for improving Web Service processing performance and develop a novel SOAP engine called SOAPExpress. The two techniques are using SCTP as a transport protocol, and dynamic early binding based data mapping. We conduct experiments comparing the new SOAP engine with Apache Axis by using the standard WS Test suite.1 The experimental results show that, no matter what the Web Service call is, SOAPExpress is always more efficient than Apache Axis. In case of handling an echoList Web Service call, SOAPExpress can achieve a 56% reduction of the processing time. After a review of related work, we provide an overview of SOAPExpress in section 3, with more details about the underlying key techniques following in section 4. We present a performance evaluation in section 5 and conclude. 1

http://java.sun.com/performance/reference/whitepapers/WS Test-1 0.pdf

P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 1–8, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

2

2

N. Wang, M. Welzl, and L. Zhang

Related Work

There have been several studies [3, 4, 5] on the performance of the SOAP processing. These studies all agree that the XML based SOAP protocol incurs a substantial performance penalty in comparison with binary protocols. Davis conducts an experimental evaluation on the latency of various SOAP implementations, compared with other protocols such as Java RMI and CORBA [4]. A conclusion is drawn that two reasons may cause the inefficiency of SOAP: one is about the multiple system calls to realize one logical message sending, and another is about XML parsing and formatting. A similar conclusion is drawn in [5] by comparing SOAP with CORBA. Chiu et al. point out that the most critical bottleneck in using SOAP for scientific computing is the conversion between floating point numbers and their ASCII representations in [3]. Recently, various mechanisms have been utilized to optimize the deserialization and serialization between XML data and Java data. In [1], rather than reserializing each message from scratch, a serialized XML message copy is cached in the senders stub, which is reused as a template for the next message with the same type. The approach in [8] reuses the matching regions from the previously deserialized application objects, and only performs deserialization for a new region that has not been processed before; however, for large SOAP messages, especially for SOAP messages whose data always changes, the performance improvement of [8] will be decreased. Also Java reflection is adopted by [8] as a means to set and get new values. For large Java objects, especially deeply nested objects, this will negatively affect the performance. The transport protocol is also a factor that can degrade the SOAP performance. The traditional HTTP and TCP communication protocols exhibit many defects when used for Web Services, including “Head-Of-Line blocking (HOL)” delay (to be explained in section 4.1), three-way handshake, ordered data delivery and half open connections [2]. Some of these problems can be alleviated by using the SCTP protocol [7] instead of TCP. While this benefit was previously shown for similar applications, most notably MPI (see chapter 5 of [6] for a literature overview), to the best of our knowledge, using SCTP within a SOAP engine as presented in this paper is novel.

3

SOAPExpress Overview

As a lightweight Web Service container, SOAPExpress provides an integrated platform for developing, deploying, operating and managing Web Services, and fully reflects the important characteristics of the next generation SOAP engine including QoS and diverse message exchange patterns. SOAPExpress not only supports the core Web Service standards such as SOAP and WSDL, but also inherits the open and flexible design style of Web Service technology because of its architecture: SOAPExpress can easily support different Web Service standards such as WS-Addressing, WS-Security and WS-ReliableMessage. It can also be integrated with the major technology for enterprise applications such as EJB and

A High Performance SOAP Engine for Grid Computing

3

JMS to establish a more loosely coupled and flexible computing environment. To enable agile development of Web Services, we also provide development tools as plug-ins for the Eclipse platform.

Fig. 1. SOAPExpress architecture

The architecture of SOAPExpress consists of four parts as shown in Fig. 1: – Transport protocol adaptor: supports client access to the system through a variety of underlying protocols such as HTTP, TCP and SCTP, and offers the system an abstract SOAP message receiving and sending primitive. – SOAP message processing module: provides the effective and flexible SOAP message processing mechanism and is able to access the data in the SOAP message in three layers, namely byte stream, XML object and Jave object. – Execution controller: with a dynamic pipeline structure, controls the flow of SOAP message processing such as service identification, message addressing and message exchanging pattern management, and supports various QoS modules such as security and reliable messaging. – Integrated service provider: provides an integrated framework to support different kinds of information sources such as plain Java objects and EJBs, and wraps them into Web Services in a convenient way.

4

Key Techniques

In this section, we will present the design details of the key techniques applied in SOAPExpress to improve its performance. 4.1 SCTPTransport At the transport layer, we use the SCTP protocol [7] to speed up the execution of Web Service calls. This is done by exploiting two of its features: out-of-order delivery and multi-streaming. Out-of-order delivery eliminates the HOL delay of TCP: if, for example, packets 1, 2, 3, 4 are sent from A to B, and packet 1 is lost, packets 2, 3 and 4 arrive at the receiver before the retransmitted (and therefore delayed) packet 1. Then, even if the receiving application could already

4

N. Wang, M. Welzl, and L. Zhang

use the data in packets 2, 3 and 4, it has no means to access it because the TCP semantics (in-order delivery of a consecutive data stream) prevent the protocol from handing over the content of these packets before the arrival of packet 1. In a Grid, these packets could correspond with function calls, which, depending on the code, might need to be executed in sequence. If they do, the possibility of experiencing HOL blocking delay is inevitable — but if they don’t, the outof-order delivery feature of SCTP can speed up the transfer in the presence of packet loss. Directly using the out-of-order delivery mechanism may not always be useful, as this would require each function call to be at most as large as one packet, thereby significantly limiting the number and types of parameters that could be embedded. We therefore used the multi-streaming feature, which bundles independent data streams together and allows out-of-order delivery only for packets from different streams. In our example, packets 1 and 3 could be associated with stream A, and packets 2 and 4 could be associated with stream B. The data of stream B could then be delivered in sequence before the arrival of packet 1, thereby speeding up the transfer. For our implementation, we used a Java SCTP library which is based on the Linux kernel space implementation called “LKSCTP”.2 Since our goal was to enable the use of SCTP instead of TCP without requiring the programmer to carry out a major code change, we used Java’s inherent support for the factory pattern as a simple and efficient way to replace the TCP socket with an SCTP socket in an existing source code. All that is needed to automatically make all socket calls use SCTP instead of TCP is to call to the methods Socket.setSocketImplFactory and ServerSocket.setSocketFactory for the client and server side, respectively. In order to avoid bothering the programmer with the need to determine which SCTP stream a function call should be associated with, we automatically assign socket calls to streams in a round-robin fashion. Clearly, it must be up to the programmer to decide whether function calls could be executed in parallel (in which case they would be associated with multiple streams) or not. To this end, we also provide two methods called StartChunk() and EndChunk(), respectively, which can be used to mark parts of data which must be consecutively delivered. All write() calls that are executed between StartChunk() and EndChunk() will cause data to be sent via the same stream. 4.2

Dynamic Early Binding Based Data Mapping

The purpose of the data mapping is to build a bridge between the platform independent SOAP messages and the platform dependent data such as Java objects. The indispensable elements of the data mapping include XML data definitions in an XML schema, data definitions in a specific platform, and the mapping rule between them. Before discussing our data mapping solution, let us first explain two pairs of concepts. – Early binding and late binding: The differences between early binding and late binding focus on when to get the binding information and when to 2

Java SCTP library by I. Skytte Joergensen: http://i1.dk/JavaSCTP/

A High Performance SOAP Engine for Grid Computing

5

use them, as illustrated in the Fig. 2. Here the binding information refers to mapping information between XML data and Java data. In early binding, all the binding information is retrieved before performing the binding, while in late binding, the binding is performed as soon as enough binding information is available. – Dynamic binding and static binding: Here, dynamic binding refers to the binding mechanism which can add new XML-Java mapping pairs at run time. In contrast, static binding refers to a mechanism which can only add new mapping pairs at compilation time.

Fig. 2. Early binding and late binding

According to the above explanation, the existing data binding implementations can be classified into two schemes: dynamic late binding and static early binding. Dynamic late binding gets the binding information by Java reflection at run time, and then uses the binding information to carry on data binding between XML data and Java data. Dynamic late binding can dynamically add new XML-Java mapping pairs, and avoid generating assistant codes by using dynamic features of Java; however, this flexibility is achieved by sacrificing efficiency. Representatives of this scheme are Apache Axis and Castor. For example, Castor uses java reflection to instantiate the new class added to the XML-java pairs at the run time, and initialize it using the values in XML through method reflection. Static early binding generates Java template files which record the binding information before running, and then carries on the binding between XML data and Java data at runtime. Static early binding (as, e.g., in XMLBeans) improves the performance by avoiding the frequent use of Java reflection. However, new XML-Java mapping pairs cannot be added at runtime, which reduces the flexibility. As illustrated in Fig. 3, we use a dynamic early binding scheme. This scheme can establish the mapping rules between the XML schema for some data type and the Java class for the same data type at compilation time. At run time, a Java template is generated based on the XML schema, the Java class and their mapping rules, which we call Data Mapping Template (DMT), by dynamic code generation techniques. The DMT is used to drive the data mapping procedure. Dynamic early binding avoids Java reflection so that the performance can be distinctly improved. Simultaneously, the DMT can be generated and managed at run time, which gives dynamic early binding the same flexibility as dynamic

6

N. Wang, M. Welzl, and L. Zhang

Fig. 3. Dynamic early binding

late binding. Dynamic early binding combines the advantages of static early binding and dynamic late binding.

5

Performance Evaluation

We begin our performance evaluation with a study of the performance improvement from using SCTP with multi-streaming. We executed asynchronous Web Service calls with JAX-WS 2.0 to carry out a simple scalar product calculation, where sequential execution of the individual calls (multiplications) is not necessary. Three PCs were used: a server and client (identical AMD Athlon 64 X2 Dual-Core 4200 with 2.2 GHz), and a Linux router which interconnected them (a HP Evo W6000 2.4 GHz workstation). At the Linux router, we generated random packet loss with NistNet.3 Figure 4 shows the transfer time of these tests with various packet loss ratios. The results were taken as an average of running 100 tests with the same packet loss setting each. As could be expected from tests carried out in a controlled environment, there was no significant divergence between the results of these test runs. For each measurement, we sent 5000 Integer vaues to the Web Service, which sent 2500 results back to the client. Eventually, at the client, the sum was calculated to finally yield the scalar product. Clearly, if SCTP is used as we intended (with unordered delivery between streams and 1000 streams), it outperforms TCP, in particular when the packet loss ratio gets high. SCTP with one stream and ordered behavior is only included in fig. 4 as a reference value — its performance is not as good as TCP’s because the TCP implementation is probably more efficient (TCP has evolved over many years and, other than our library, operates at the kernel level). Multiple streams and ordered transmission of packets between streams would theoretically be pointless; surprisingly, the result for this case is better than with only one stream. We believe that this is a peculiarity of the SCTP library that we used. 3

http://snad.ncsl.nist.gov/nistnet/

A High Performance SOAP Engine for Grid Computing

7

Fig. 4. Transfer time of TCP and SCTP using the Web Service

We then evaluated the performance of SOAPExpress as a whole, including SCTP and dynamic early binding based data mapping. We chose the WS Test 1.0 suite to test the time spent on each stage in the SOAP message processing. Several kinds of test cases were carried out, each designed to measure the performance of a different type of Web Service calls: – echoVoid: send/receive a message with empty body. – echoStruct: send/receive an array of size 20, with each entry being a complex data type composed of an integer, a floating point number and a string. – echoList: send/receive a linked list of size 20, with each entry being the same complex data type defined in echoStruct. The experimental settings were: CPU: Pentium-4 2.40 GHz; Memory: 1 GB; OS: Ubuntu Linux 8.04; JVM: Sun JRE 6; Web Container: Apache Tomcat 6.0. The Web Service client performed each Web Service call 10,000 times, and the work load was 5 calls per second. The benchmark we have chosen is Apache Axis 1.2. Fig. 4 shows the experimental results. For echoStruct and echoList, the XML payload was about 4KB. The measure started from receiving a request and ended with returning the

Fig. 5. Performance comparison among different types of Web Service calls

8

N. Wang, M. Welzl, and L. Zhang

response. We observed that for echoVoid, the processing time is very close between the two SOAP engines, since echoVoid has no business logic and just returns the SOAP message with an empty body. For echoStruct, the processing time of SOAPExpress is about 46% of Apache Axis, and for echoList, the proportion reduces to about 44%. This is a very sound overall performance improvement of SOAPExpress over Apache Axis.

6

Conclusion

In this paper, we presented the SOAPExpress engine for Grid computing. It uses two key techniques for reducing the Web Service processing time: the SCTP transport protocol and dynamic early binding based data mapping. Experiments were conducted to compare its performance with Apache Axis by using the standard WS Test suite. Our experimental results have shown that, no matter what the Web Service call is, SOAPExpress is more efficient than Apache Axis.

Acknowledgments We thank Christoph Sereinig for his contributions. This work is partially supported by the FP6-IST-045256 project EC-GIN (http://www.ec-gin.eu).

References 1. Abu-Ghazaleh, N., Lewis, M.J., Govindaraju, M.: Differential serialization for optimized SOAP performance. In: HPDC 2004: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, pp. 55–64. IEEE Computer Society Press, Los Alamitos (2004) 2. Bickhart, R.W.: Transparent TCP-to-SCTP translation shim layer. In: Proceedings of the European BSD Conference (2007) 3. Chiu, K., Govindaraju, M., Bramley, R.: Investigating the limits of SOAP performance for scientific computing. In: HPDC 2002: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 246. IEEE Computer Society Press, Los Alamitos (2002) 4. Davis, D., Parashar, M.P.: Latency performance of SOAP implementations. In: CCGRID 2002: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Washington, DC, USA, p. 407 (2002) 5. Elfwing, R., Paulsson, U., Lundberg, L.: Performance of SOAP in Web Service environment compared to CORBA. In: APSEC 2002: Proceedings of the Ninth Asia-Pacific Software Engineering Conference, Washington, DC, USA, p. 84. IEEE Computer Society Press, Los Alamitos (2002) 6. Sereinig, C.: Speeding up Java web applications and Web Service calls with SCTP. Master’s thesis, University of Innsbruck (April 2008), http://www.ec-gin.eu 7. Stewart, R.: Stream Control Transmission Protocol. RFC 4960 (September 2007) 8. Suzumura, T., Takase, T., Tatsubori, M.: Optimizing web services performance by differential deserialization. In: ICWS 2005: Proceedings of the IEEE International Conference on Web Services, Washington, DC, USA, pp. 185–192. IEEE Computer Society Press, Los Alamitos (2005)

UDTv4: Improvements in Performance and Usability Yunhong Gu and Robert Grossman National Center for Data Mining, University of Illinois at Chicago

Abstract. This paper presents UDT version 4 (UDTv4), the fourth generation of the UDT high performance data transfer protocol. The focus of the paper is on the new features introduced in version 4 during the past two years to improve the performance and usability of the protocol. UDTv4 introduces a new three-layer protocol architecture (connection-flowmultiplexer) for enhanced congestion control and resource management. The new design allows protocol parameters to be shared by parallel connections and to be reused by future connections. This improves the congestion control and reduces the connection setup time. Meanwhile, UDTv4 also provide better usability by supporting a broader variety of network environments and use scenarios.

1 Introduction During the last decade there has been a marked boom in Internet applications, enabled by the rapid growth of raw network bandwidth. Examples of new applications include P2P file sharing, streaming multimedia, and grid/cloud computing. These applications vary greatly in traffic and connection characteristics. However, most of them still use TCP for data transfer. This is partly due to the fact that TCP is well established and contributes to the stability of the Internet. TCP was designed as a general-purpose protocol and was first introduced three decades ago. It is not surprising that certain requirements from new applications cannot be perfectly addressed by TCP. Network researchers have proposed many changes to TCP to address those emerging problems and requirements (SACK, ECN, etc.) [6]. The new techniques are carefully studied and deployed, albeit slowly. For example, TCP's inefficiency problem in high bandwidth-delay product (BDP) networks was observed almost a decade ago yet it is only recently that several new high speed TCP variants were deployed: (CUBIC on Linux [13] and Compound TCP on Windows Vista [17]). Furthermore, because new TCP algorithms have to be compatible with the TCP standard, improvements to TCP are limited. New transport protocols, DCCP [12] and SCTP [16], have also been proposed. However, it may take years for these new protocols to be widely deployed and used by applications (considering the example of IPv6). Moreover, both DCCP and SCTP are designed for specific groups of applications. New applications and requirements will continue to emerge and it is not a scalable solution to design a new transport layer protocol every few years. It is necessary to have a flexible protocol that provides basic functions and allows applications to define their own data processing. This is what UDP was designed for. P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 9–23, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

10

Y. Gu and R. Grossman

In fact, UDP has been used in many applications (e.g., Skype) but it is usually customized independently for each application. RTP [15] is a good example and it is a great success in supporting multimedia applications. However, there are few generalpurpose UDP-based protocols that application developers can use directly or customize easily. UDT, or UDP-based data transfer protocol, is an application level general-purpose transport protocol on top of UDP [8]. UDT address a large portion of the requirements from the new applications by seamlessly integrating many modern protocol design and implementation techniques at the application level. The protocol was originally designed for transferring large scientific data over high-speed wide area networks and it has been successful in many research projects. For example, UDT has been used to distribute the 13TB SDSS astronomy data release to global astronomers [9]. UDT has been an open source project since 2001 and the first production release was made in 2004. While it was originally designed for big scientific data sets, the UDT library has been used in many other situations, either with its stock form or in a modified form. A great deal of user feedback has been received. The new version (UDTv4) released in 2007 introduces significant changes and supports better performance and usability. •



UDTv4 uses a three-layer architecture to enhance congestion control and reduce connection setup time by sharing control parameters among parallel connections and by using historical data. UDTv4 introduces new techniques in both protocol design and implementation to support better scalability, hence it can be used in a larger variety of use scenarios.

This paper describes these new features of UDTv4. Section 2 explains the protocol design. Section 3 describes several key implementation techniques. Section 5 presents the evaluation. Section 6 discusses the related work. Section 7 concludes the paper. Throughout the rest of the paper, we use UDT to refer the most recent version, UDTv4, unless otherwise explicitly stated.

2 Protocol Design 2.1 Protocol Overview UDT is a connection-oriented, duplex, and unicast protocol. There are 3 logical layers in design: UDT connection, UDT flow, and UDP multiplexer (Figure 1). A UDT connection is set up between a pair of UDT sockets as a distinct data transfer entity to applications. It can provide either reliable data streaming services or partial reliable messaging services, but not both for the same socket. A UDT flow is a logical data transfer channel between two UDP addresses (IP and port) with a unique congestion control algorithm. That is, a UDT flow is composed of five elements (source IP, source UDP port, destination IP, destination UDP port, and congestion control algorithm). The UDT flow is transparent to applications.

UDTv4: Improvements in Performance and Usability

UDP Multiplexer

11

UDP Multiplexer Sockets

Sockets Sockets

UDT Connection UDT Flow

To other addresses Fig. 1. UDT Connection, Flow, and UDP Multiplexer

One or more UDT connections are associated with one UDT flow, if the UDT connections share the same five elements described above. Every connection must be associated with one and only one flow. In other words, UDT connections sharing the same five elements are multiplexed over a single UDT flow. A UDT flow provides reliability control as it multiplexes individual packets from UDT connections, while UDT connections provide data semantics (streaming or messaging) management. Different types of UDT connections (streaming or messaging) can be associated with the same UDT flow. Congestion control is also applied to the UDT flow, rather than the connections. Therefore, all connections in one flow share the same congestion control process. Flow control, however, is applied to each connection. Multiple UDT flows can share a single UDP socket/port and a UDP multiplexer is used to send and dispatch packets for different UDT flows. The UDP multiplexer is also transparent to applications. 2.2 UDP Multiplexing Multiple UDT flows can bind to a single UDP port and each packet is differentiated by the destination (UDT) socket ID carried in the packet header. The UDP multiplexing method helps to traverse firewalls and alleviates the system limitation on the port number space. The number of TCP ports is limited to 65536. In contrast, UDT can support up to 232 connections at the same time. UDP multiplexing also helps firewall traversing. By opening one UDP port, a host can open virtually an unlimited number of UDT connections to the outside. 2.3 Flow Management UDT multiplexes multiple connections into one single UDT flow, if the connections share the same attributes of source IP, source UDP port, destination IP, destination UDP port, and congestion control algorithm. This single flow for multiple connections helps to reduce control traffic, but more importantly, it uses a single congestion control for all connections sharing the same end points. This removes the unfairness by using parallel flows and in most situations

12

Y. Gu and R. Grossman Flow Control Buf

Connection 1

Connection 1

Buf

Buf

Connection 2

Connection 2

Buf

data

Buf

control Congestion Control

Reliability Control

Fig. 2. UDT Flow and Connection

it improves throughput because connections in a single flow coordinate with each other rather than compete with each other. As shown in Figure 2, the flow maintains all activities required for a regular data transfer connection, whereas the UDT connection is only responsible for the application interface (connection maintenance and data semantics). At the sender side, the UDT flow reads packets from each associated connection in a round robin manner, assigns each packet the flow sequence numbers and sends them out. 2.4 Connection Record Index/Cache When a new connection is requested, UDT needs to look up whether there is already a flow existing between the same peers. A connection record index (Figure 3) is used for this purpose. The index is sorted by the peer IP addresses. Each entry records the information between the local host and the peer address, including but not limited to RTT, path MTU, and estimated bandwidth. Each entry may contain multiple sub-entries by different ports, followed by multiple flows differentiated by congestion control (CC). The connection record index caches the IP information (RTT, MTU, estimated bandwidth, etc.) even if the connection and flow is closed, in which case there is no port associated with the IP entry. This information can be used when a new connection is set up. Its RTT value can be initialized with a previously recorded value; otherwise it would take several ACKs to get an accurate value for the RTT. If IP

Info.

Port

Flow/CC Flow/CC

Port IP

Info. Fig. 3. Connection Record Index

Flow/CC

UDTv4: Improvements in Performance and Usability

13

path MTU discovery is used, the MTU information can also be initialized with a historical value. The index entry without an active flow will be removed when the maximum length of the index has been reached, and the oldest entry will be removed first. Although the cache may be removed very quickly on a busy server (e.g., a web server), the client side may contain the same cache and pass the values to the server. For example, a client that frequently visits a web server may keep the link information between the client and the server, while the server may have already removed it. 2.5 Garbage Collection When a UDT socket is closed (either by the application or because of a broken connection), it is not removed immediately. Instead, it is tagged as having closed status. A garbage collection thread will periodically scan the closed sockets and remove the sockets when no API is accessing the socket. Without garbage collection, UDT would have needed stronger synchronization protection on its APIs, which increases implementation complexity and adds some slight overhead for the additional synchronization mechanism. In addition, because of the delayed removal, a new socket can reuse a closed socket and the related UDP multiplexer when possible, thus it improves connection setup efficiency. Garbage collection also checks the buffer usage and decreases the size of the system allocated buffer if necessary. If during the last 60 seconds, less that 50% of the buffer is used, the buffer will be reduced to half (a minimum size limit, 32 packets, is used so that the buffer size will not be decreased to a meaningless 1-byte).

3 Implementation UDT is implemented as an open source project and is available for download from SourceForge.net. The UDT library has been used in both research projects and commercial products. So far 18,000 copies have been downloaded, excluding direct checkout from the CVS and redistribution from other websites. The UDT implementation is available on both POSIX and Windows systems and it is thoroughly tested on Linux 2.4, 2.6, and Windows XP. The code is written in C++ with API wrappers for other languages available. The latest stable version of the UDT library (version 4.2) consists of approximately 11,500 lines of C++ code, including about 4000 semicolons and about 20% of the code is comments. 3.1 Software Architecture Figure 4 shows the software architecture of the UDT implementation. A global UDT API module dispatches requests from applications to a specific UDT socket. Data transfer for the UDT socket is managed by a UDT flow, while the UDT flow communicates via a UDP multiplexer. One UDP multiplexer can support multiple UDT flows, and one UDT flow can support multiple UDT sockets. Finally, both the

14

Y. Gu and R. Grossman UDT API

Global Socket Management Garbage Collection

Flow Control UDT Connection Management UDT Flow Management

Buffer Management

Congestion Control

Reliability Control

Snd Queue

Rcv Queue

UDP Multiplexer

System UDP Socket API

Fig. 4. UDT Software Architecture

Flow Ctrl. UDT Socket

Snd Buffer Rcv Buffer

UDT Socket

Cong. Ctrl. UDT Flow

Snd Buffer

Snd Queue Rcv Queue

UDP Multiplexer

Rcv Queue

UDP Multiplexer

Snd Loss List

Rcv Loss List UDT Flow

Fig. 5. Data Flow over a Single UDT Connection

buffer management module and the garbage collection module work at global space to support the resource management. Figure 5 shows the data flow in a single UDT connection. The UDT flow moves data packets from the socket buffer to its own sending buffer and sends the data out via the UDP multiplexer. The control information is exchanged on both directions of the data flow. At the sender side, the UDP multiplexer receives the control information (ACK, NAK, etc.) from the receiver and dispatches the control information to the

UDTv4: Improvements in Performance and Usability

15

corresponding UDT flow or connection. Lost lists are used at both sides to record the lost packets. Lost lists work at flow level and only record flow sequence numbers. Flow control is applied to a UDT socket, while congestion control and reliability control are applied to the UDT flow. 3.2 UDP Multiplexer and Queue Management The UDP multiplexer maintains a sending queue and a receiving queue. The queue manages a set of UDT flows to send or receive packets via the associated UDP port. The sending queue contains a set of UDT flows that has data to send out. If rate based control is used, the flows are scheduled according to the next packet sending time; if pure window-based control is used, the flows are scheduled according to a round robin scheme. The sending queue checks the system time and when it is time to send out the first packet, it removes the first flow on the queue and sends out its packet. If there are more packets to be sent for the particular flow, the flow will be inserted into the queue again according to the next packet sending time by rate/congestion/flow control. The sending queue uses a heap structure to maintain the flows. With the heap structure, each send or insert action takes at most log2(n) steps, where n is the total number of flows in the queue. The heap structure guarantees that the sender can find the flow instance with the smallest next scheduled packet sending time; however, it is not necessary to have all the flows sorted by the next scheduled time. The job of the receiving queue is much simpler. It checks the timing events (retransmission timer, keep-alive, timer-based ACK, etc.) for each flow associated with the UDP multiplexer. Every fixed time interval (0.1 second), flows are checked in a round robin manner. However, if a packet arrived for a particular flow, the timers will be checked for the flow and the flow is moved to the end of the queue for the next round of check. The receiving queue uses a double linked list to store the flows and each operation takes O(1) time. The receiving side of the UDP multiplexer also maintains a hash table for the associated UDT connections, so that when a packet arrives, the multiplexer can quickly look up the corresponding connection to process the packet. Note that the flow processing handler can be looked up via the socket instance. 3.3 Connection and Flow Management In the UDT implementation, a flow is a special connection that contains pointers to all connections within the same flow, including itself. The first connection of the flow is set up by the normal 3-way handshake process. More connections are set up by a simplified 2-way handshake as it joins an existing flow. The first connection automatically becomes the flow and manages all the connections. If the current "flow" connection is closed or leaves (because of IP address change), another connection will become the flow and related flow information will be moved to the new flow from the old one. The flow maintains a separate sending buffer in addition to the connections' sending buffers. In an ideal world, the flow should read packets from each connection

16

Y. Gu and R. Grossman

in a round robin fashion. However, in this way the flow would either need to keep track of the source of each packet or copy the packet into its own buffer, because each ACK or NAK processing needs to locate the original packet. In the current implementation, the socket sending buffer is organized as a link of multiple 32-packet blocks. The UDT flow reads one 32-packet block from each connection in round robin fashion, removes the block from the socket's sending buffer, and links the block to its own (flow) sending buffer. Note that there may be less than 32 packets in the block if there is not enough data to be sent for a particular connection. Flow control is enforced at the socket level. The UDT send call will be blocked if either the sender buffer limit or the receiver buffer limit is full. This guarantees that data in the flow sending buffer is not limited by flow control. By using this strategy, the flow simply applies ACKs and NAKs to its own buffer and avoids memory copies between flow and connections or a data structure to map flow sequence number to connection sequence number. In the latter case, UDT would also need to check every single packet being acknowledged, because they may belong to different connections and may not be continuous. At the receiver side, all connections have their own receiver buffer for application data reading. However, only the flow maintains a loss list to recover packet losses. Rendezvous connection setup. In addition to the regular client/server mode, UDT provides a method for rendezvous connection method. Both peers can connect to each other at (approximately) the same time, provided that they know the peer's address beforehand (e.g., via a 3rd known server). 3.4 Performance Considerations Multi-core processing. The UDT implementation uses multiple threads to explore the multi-core ability of modern processors. Network bandwidth increases faster than CPU speed, and a single core of today's processors is barely enough to saturate 10Gb/s. One single UDT connection can use 2 cores (sending and receiving) per data traffic direction on each side. Meanwhile, each UDP multiplexer has its own sending thread and receiving thread. Therefore, users can start more UDT threads by binding UDT sockets to different UDP ports, thus more UDP multiplexers will be started and each multiplexer will start their own packet processing threads. New select API. UDT provides a new version of the select API, in which the result socket descriptor set is an independent output, rather than overwriting the input directly. The BSD style select API is inefficient for large numbers of sockets, because the input is modified and applications have to reinitialize the input each time. In addition, UDT provides a way to iterate the result set; in contrast, for the BSD socket API, applications have to test each socket against the result set. New sendfile/recvfile API. UDT provides both sendfile and recvfile APIs to reduce one memory copy by exchanging data between the UDT buffer and application file directly. These two APIs also simplify application development in certain cases. It is important to mention that file transfer can operate under both streaming mode and messaging mode. However, messaging mode is more efficient in this case,

UDTv4: Improvements in Performance and Usability

17

because recvfile does not require continuous data block receiving and therefore in messaging mode data blocks can be read into files out of order without the "head of line" blocking problem. This is especially useful when the packet loss rate is high. Buffer auto-sizing. All UDT connections/flows share the same buffer space, which increases when necessary. The UDT socket buffer size is only an upper limit and it does not allocate the buffer until it has to. UDT automatically increases the socket buffer size limit to 2*BDP, if the default or user-specified buffer size is less than this value. However, if the default or userspecified value is greater than this value, UDT will not decrease the buffer size. The bandwidth value (B in BDP) is estimated by the maximum packet arriving rate at the receiver side. The garbage collection thread may decrease the system buffers when it detects that only less than half of the buffers are used.

4 Evaluation This section evaluates UDT's scalability, performance, and usability. UDT provides superior usability over TCP and although it is at the application level, its implementation efficiency is comparable to the highly optimized Linux TCP implementation in kernel space. More importantly, UDT effectively addresses many application requirements and fills a blank left by transport layer protocols. 4.1 Performance Characteristics This section summarizes the performance characteristics of UDT, in particular, its scalability. Packet header size. UDT consumes 24 bytes (16-byte UDT + 8-byte UDP) for data packet headers. In contrast, TCP uses a 20-byte packet header, SCTP uses a 28-byte packet header, and DCCP uses 12 bytes without reliability. Control traffic per flow. UDT sends one ACK per 0.01 second when there is data traffic. This can be overridden by a user-defined congestion control algorithm, if more ACKs are necessary. However, the user-defined ACKs will be lightweight ACKs and consumes less bandwidth and CPU [8]. ACK2 packet is generated occasionally, at a decreased frequency (up to 1 ACK2 per second). In contrast, TCP implementations usually send one ACK every one or two segments. In addition, UDT may also send NAKs, message drop request, or keep-alive packets when necessary, but these packets are much less frequent than ACK and ACK2. Limit on number of connections. The maximum number of flows and connections supported by UDT is virtually only limited by system resources (232). Multi-threading. UDT starts 2 threads per UDP port, in addition to the application thread. Users can control the number of data processing threads by using a different number of UDP ports. Summary of data structures. At the UDP multiplexer level, UDT maintains the sending queue and receiving queue. The sending queue costs O(log2n) time to insert

18

Y. Gu and R. Grossman

or remove a flow, where n is the total number of flows. The receiving queue checks timers of each UDT flow every 0.1 second, but it is self clocked by the arrival of packets. Each check costs O(1) time. Finally, the hash table used for the UDP multiplexer to locate a socket costs O(1) look up time. The UDT loss list is based on congestion events, and each scan time is proportional to the number of congestion events, rather than the number of lost packets [8]. 4.2 Implementation Efficiency UDT's implementation performance has been extensively tuned. This sub-section lists the CPU usage for one or more data flows between two local directly connected identical Linux servers. The server runs Debian Linux (kernel 2.6.18) on dual AMD Opteron Dual Core 3.0GHz processors, 4 GB memory, and 10GE MyriNet NIC. All system parameters are left as default except that the MTU is set to 9000 bytes. No TCP or UDP offload is enabled. Figure 6 shows the CPU usage of a single TCP, UDP and UDT flow (with or without memory copy avoidance). The total CPU capacity is 400%, because there are 4 cores. Because each flow has a different throughput (varies between 5.4Gb/s TCP and 7.5Gb/s UDT with memory copy avoidance), the values listed in Figure 6 are CPU usage per Gb/s throughput. According to Figure 6, UDT with memory copy avoidance costs similar CPU as UDP and less CPU time than TCP. UDT without memory copy avoidance costs approximately double CPU time of that in the other three situations. In the case of a single UDT flow without memory copy avoidance, at 7.4Gb/s, the CPU usage of the UDT thread and the application thread at the sender side cost 99% and 40%, respectively (per thread CPU time not shown in Figure 6); the UDT thread and the application thread at the receiver thread cost 90% and 36%, respectively. 20

send recv

CPU usage (percentage per Gb/s)

18 16 14 12 10 8 6 4 2 0

TCP

UDP

UDT

UDT w/o mem

Fig. 6. CPU Usage of Single Data Flow

UDTv4: Improvements in Performance and Usability

19

Although memory copy avoidance happens in the application thread, when it is used, it also reduces CPU usage on both the UDT sending and receiving threads because more memory bandwidth is available for the UDT threads and cache hit ratio is also higher. Figure 7 shows the CPU usage (unit value is per Gb/s throughput, the same as in Figure 6) of multiple parallel connections of TCP and UDT. UDT memory copy avoidance is not enabled in these experiments because in the situation of multiple connections, the receiver side memory copy avoidance does not work well (see Section 3.6). The connection concurrency is 10, 100, and 500 respectively for each group (TCP, UDT with all connections sharing a single flow, and UDT with each connection having its own flow).

CPU usage (percentage per Gb/s)

30

25

send recv

20

15

10

5

0

TCP

UDT single flow

UDT multi flow

Fig. 7. CPU Usage of Concurrent Data Flows

According to Figure 7, CPU usage of UDT increases slowly as the number of parallel connections increases. The design and implementation is scalable to connection concurrency, and it is comparable to the kernel space TCP implementation. Furthermore, the second group (all connections share one flow) costs slightly less CPU than the third group. In the case of multiple flows, the overhead of control packets for UDT increases proportionally to the number of flows, because each flow sends its own control packets. 4.3 Usability UDT is designed to be a general purpose and versatile transport protocol. The stock form of UDT can be used in regular data transfer. Additionally, the messaging UDT socket can also be used in multimedia applications, RPC, file transfer, web services, etc. Currently UDT has been used in many real world applications, including data distribution and file transfer (especially scientific data), P2P applications (both data transfer and system messaging), remote visualization, and so on.

20

Y. Gu and R. Grossman

While TCP is mostly used for regular file and data transfer, UDT has the advantage of a richer set of data transfer semantics and congestion control algorithms. Proper congestion control algorithms can be used in special environments such as wireless networks. UDT does not increase Internet congestion by allowing users to easily modify the congestion control algorithm. It has always been trivial to obtain unfair bandwidth share by using parallel TCP or constant bit rate UDP. In fact, UDT's connection/flow design improves the Internet congestion control by removing the unfairness and traffic oscillation caused by applications that start parallel data connections between the same pair of hosts. The configurable congestion control feature of UDT can actually help network researchers to rapidly implement and experiment with control algorithms. To demonstrate this ability, six new TCP control algorithms (Scalable, HighSpeed, BiC, Westwood, Vegas, and FAST) are implemented in addition to the three predefined algorithms in the UDT release. Lines of code for the implementation of these control algorithms vary between 11 and 73 [8]. UDT can also be modified to implement other protocols at the application level. An example is to implement forward error correction (FEC) on UDT for low bandwidth high link error environments. While UDT is not a completely modularized framework like CTP [4] due to performance considerations, it still provides high configurability (congestion control, user defined packets, user controllable ACK intervals, etc.). It is also much easier to modify UDT than to modify a kernel space TCP implementation. Moreover, there are fewer limitations on deployment and protocol standardization. Finally, UDT is also more supportive for firewall traversing (e.g., NAT punching) with UDP multiplexing and rendezvous connection setup.

5 Related Work While transport protocols have been an active research topic in computer networks for decades, there are actually few general purpose transport protocols running at the application level today. In this sense UDT fills a void left by the transport layer protocols where they cannot perfectly support all applications. However, without considering its application level advantage, UDT can be broadly compared to several other transport protocols. (In fact, the UDT protocol could actually be implemented on top of IP, but the application level implementation was one of the major objectives in developing this protocol.) UDT borrows the messaging and partial reliability semantics from SCTP. However, SCTP are specially designed for VoIP and telephony, but UDT targets general purpose data transfer. UDT unifies both messaging and streaming semantics in one protocol. UDT's connection/flow design can also be compared to the multi-streaming feature in SCTP. SCTP creates an association (analogous to UDT flow) between two addresses and multiple independent streams (analogous to UDT connection) can be set up over the association. However, in SCTP, applications need to explicitly create

UDTv4: Improvements in Performance and Usability

21

the association and the number of streams is fixed at the beginning, while UDT implicitly joins the connections into the same UDT flow (applications only create independent connections). Furthermore, SCTP applies flow control at the association level. In contrast, UDT applies flow control at the connection level. This layered design (connection/flow in UDT and stream/association in SCTP) can also be found in Structured Stream Transport (SST) [7]. SST creates channels (analogous to UDT flow) between a pair of hosts while starting multiple lightweight streams (analogous to UDT connection) atop the same channels. However, the rationales behind SST and UDT are fundamentally different. In SST, the channel provides a secured (optional) virtual connection to support multiple independent application streams and to reduce stream setup time. This design particularly targets applications that require multiple data channels. In contrast, UDT flow automatically aggregates multiple independent connections to reduce control traffic and to provide better congestion control. Both protocols apply congestion control on the lower layer (channel and flow), but SST channel provides unreliable packet delivery only and the streams have to conduct reliability control independently. In contrast, UDT flow provides reliable packet delivery (unless a UDT connection requests a message drop). Beyond this 2-layer design, SST and UDT differ significantly on details of reliability control (ACK, etc.), congestion control, data transfer semantics, and API semantics. UDT enforces congestion control on the flow level, which carries traffic from multiple UDT connections. This leads to a similar objective as that of congestion manager (CM) [2, 3]. UDT's flow/connection design makes it a natural way to share congestion control among connections between the same address pairs. This design is transparent to existing congestion control algorithms, because any congestion control algorithm originally designed for a single connection can still work on the UDT flow without any modification. In contrast, CM introduces its own congestion control that is specially designed for a group of connections. Furthermore, CM enforces congestion control at the system level and does not provide the flexibility for individual connections to have different control algorithms. UDT allows applications to choose a predefined congestion control algorithm for each connection. A similar approach is taken in DCCP. However, UDT goes further by allowing users to redefine the control event handlers and write their own congestion control algorithm. Some of the implementation techniques used in UDT are exchangeable with kernel space TCP implementations. Here are several examples. UDT's buffer management is similar to the Slab cache in Linux TCP [10]. UDT automatically changes socket buffer size to maximize throughput. Windows Vista provides socket buffer autosizing, while SOBAS [5] provides application level TCP buffer auto-tuning. UDT uses a congestion event based loss list that significantly reduces the scan time on the packet loss list. This problem occurred in the Linux SACK implementation (when a SACK packet arrives, Linux used to scan the complete list of in-flight packets, which could be very large for high BDP links) and was fixed later. While there are so many similarities on the implementation issues, UDT's application level implementation is largely different from TCP's kernel space implementation. UDT cannot directly use kernel space thread, kernel timer, hardware

22

Y. Gu and R. Grossman

interrupt, processor binding, and so on. It is more challenging to realize a high performance implementation at the applications level.

6 Conclusions This paper has described the design and implementation of Version 4 of the UDT protocol and demonstrated its scalability, performance, and usability in layered protocol design (UDP multiplexer, UDT flow, and UDT connection), data transfer semantics, configurability, and efficient application level implementation. As the end-to-end principle [14] indicates, the kernel space should provide the simplest possible protocol and the applications should handle application specific operations. From this point of view, the transport layer does provide UDP, while UDT can bridge the gap between transport layer and applications. This design rationale of UDT does not conflict with the existence of other transport layer protocols, as they provide direct support for large groups of applications with common requirements. The UDT software is currently in production quality. At the application level, it is much easier to deploy than new TCP variants or new kernel space protocols (e.g., XCP [11]). This also provides a platform for rapidly prototyping and evaluating new ideas in transport protocols. Some of the UDT approaches can be implemented in kernel space if they are proven to be effective in real world settings. Furthermore, many UDT modifications are expected to be application or domain specific, thus they do not need to be compatible with any existing protocols. Without the limitations of deployment and compatibility, even more innovative technologies on transport protocols will be encouraged and implemented than before, which is another ambitious objective of the UDT project.

References [1] Allman, M., Paxson, V., Stevens, W.: TCP congestion control. RFC 2581 (April 1999) [2] Andersen, D.G., Bansal, D., Curtis, D., Seshan, S., Balakrishnan, H.: System Support for Bandwidth Management and Content Adaptation in Internet Applications. In: 4th USENIX OSDI Conf., San Diego, California (October 2000) [3] Balakrishnan, H., Rahul, H., Seshan, S.: An Integrated Congestion Management Architecture for Internet Hosts. In: Proc. ACM SIGCOMM, Cambridge, MA (September 1999) [4] Bridges, P.G., Hiltunen, M.A., Schlichting, R.D., Wong, G.T.: A configurable and extensible transport protocol. ACM/IEEE Transactions on Networking 15(6) (December 2007) [5] Dovrolis, C., Prasad, R., Jain, M.: Socket Buffer Auto-Sizing for High-Performance Data Transfers. Journal of Grid Computing 1(4) (2004) [6] Duke, M., Braden, R., Eddy, W., Blanton, E.: A Roadmap for Transmission Control Protocol (TCP). RFC 4614, IETF (September 2006) [7] Ford, B.: Structured Streams: a New Transport Abstraction. In: ACM SIGCOMM 2007, August 27-31, Kyoto, Japan (2007) [8] Gu, Y., Grossman, R.L.: UDT: UDP-based Data Transfer for High-Speed Wide Area Networks. Computer Networks 51(7) (May 2007)

UDTv4: Improvements in Performance and Usability

23

[9] Gu, Y., Grossman, R.L., Szalay, A., Thakar, A.: Distributing the Sloan Digital Sky Survey Using UDT and Sector. In: Proceedings of e-Science (2006) [10] Herbert, T.: Linux TCP/IP Networking for Embedded Systems (Networking), 2nd edn., November 17, 2006. Charles River Media (2006) [11] Katabi, D., Handley, M., Rohrs, C.: Internet Congestion Control for High BandwidthDelay Product Networks. In: The ACM Special Interest Group on Data Communications (SIGCOMM 2002), Pittsburgh, PA, pp. 89–102 (2002) [12] Kohler, E., Handley, M., Floyd, S.: Designing DCCP: Congestion Control Without Reliability. In: Proceedings of SIGCOMM (September 2006) [13] Rhee, I., Xu, L.: CUBIC: A New TCP-Friendly High-Speed TCP Variant, PFLDnet, Lyon, France (2005) [14] Saltzer, J.H., Reed, D.P., Clark, D.D.: End-to-end arguments in system design. ACM Transactions on Computer Systems 2(4), 277–288 (1984) [15] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V.: RTP: A Transport Protocol for Real-Time Applications. RFC 3550 (July 2003) [16] Stewart, R. (ed.): Stream Control Transmission Protocol. RFC 4960 (September 2007) [17] Tan, K., Song, J., Zhang, Q., Sridharan, M.A.: Compound TCP: Approach for HighSpeed and Long Distance Networks. In: Proceedings of INFOCOM 2006. 25th IEEE International Conference on Computer Communications, pp. 1–12. IEEE Computer Society Press, Los Alamitos (2006)

The Measurement and Modeling of a P2P Streaming Video Service* Peng Gao1,2, Tao Liu2, Yanming Chen2, Xingyao Wu2, Yehia El-khatib3, and Christopher Edwards3 1

Graduate University of Chinese Academy of Sciences, Beijing, China 2 China Mobile Group Design Institute Co., Ltd, Beijing, China {gaopeng,liutao1,chenyanming,wuxingyao}@cmdi.chinamobile.com 3 Computing Department, InfoLab21, Lancaster University, Lancaster LA1 4WA, United Kingdom {yehia,ce}@comp.lancs.ac.uk

Abstract. Most of the work on grid technology in video area has been generally restricted to aspects of resource scheduling and replica management. The traffic of such service has a lot of characteristics in common with that of the traditional video service. However the architecture and user behavior in Grid networks are quite different from those of traditional Internet. Considering the potential of grid networks and video sharing services, measuring and analyzing P2P IPTV traffic are important and fundamental works in the field grid networks. This paper investigates the features of PPLive, which is the most popular streaming service in China and based on P2P technology. Through monitoring and analyzing PPLive traffic streams, the characteristics of P2P streaming service have been studied. The analyses are carried out in respect of bearing protocols, geographical distribution and the self-similarity properties of the traffic. A streaming service traffic model has been created and verified with the simulation. The simulation results indicate that the proposed streaming service traffic model complies well with the real IPTV streaming service. It can also function as a step towards studying video-sharing services on grids. Keywords: gird, peer to peer, p2p, video streaming, traffic characteristics, traffic modeling, ns2 simulation.

1 Introduction Along with the rapid development of P2P file sharing and IPTV video services, P2P streaming services have become the main multi-user video sharing application on the Internet. The focus of the grid technology in video area is generally on the resource schedule and replica management aspect [1] [2], wherein the service traffic characteristic is still similar to the traditional video service. In depth work has been carried out in the areas of monitoring and modeling video traffic [3]. Therefore, *

This work is partially supported by the European Sixth Framework Program (FP6) under the contract STREP FP6-2006 IST-045256.

P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 24–34, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

The Measurement and Modeling of a P2P Streaming Video Service

25

considering the developing trends of grid systems and video sharing, monitoring and analysis of P2P IPTV traffic are interesting and promising topics of research. They are the main focus of this paper. There have been many research efforts in the area of P2P live streaming. Silverston et al. [4] have analyzed and compared the different mechanisms and traffic patterns of four mainstreams P2P video applications. Their work has focused on IPTV system aspects such as peer survival time. Hei et al. [5] have made a comprehensive investigation of aspects such as the P2P live streaming system architecture, user behavior, user distribution and software realization. The work presented in these two papers is similar in part to the work presented in this paper. However, our objectives and outcome are dissimilar from theirs in two ways. First, the aim of traffic monitoring is to obtain the characteristics of service traffic. The traffic is analyzed in order to come up with models that can be used in simulation. Second, some of our traffic analysis results differ from those found in the two aforementioned papers for the PPLive software update [7]. The rest of this paper is organized as follows: section 2 describes the monitoring setup, and section 3 details the results of analyzing the measurement data. In order to verify the accuracy of the analysis, a traffic model is developed and evaluated through simulation in Section 4.

2 Measurement Setup The measurement was carried out in the intranet of China Mobile Group Design Institute (CMDI) in Beijing. Fig. 1 shows the measurement environment, where the intranet access bandwidth is 100Mbps and Internet access bandwidth is 10Mbps.

Fig. 1. PPLive Measuring Environment

Monitoring terminals choose a representative popular channel at a peak time to collect the service data. The duration of the data measurement is about 2 hours from the start till the end of the service. The size of the produced data file is 896MB stored in pcap format. In order to simplify the data analysis process and to make the data analysis detailed, the measurement data is divided into several parts, from which two portions of the data (the service beginning and service steady stage) are chosen as the emphases of the analysis. The monitoring terminal runs the PPLive software (version 1.8.18 build 1336). The service uses the default configuration: network type is

26

P. Gao et al.

community broadband, maximum number of connections per channel is 30, and maximum simultaneous network connections is 20. Monitoring terminals use Ethereal version 0.99.0 [8] to collect the traffic data generated by PPLive service. Measurement data processing is completed using Gawk [9] and the plots are drawn using gnulpot [10]. The mathematical analysis tool Easyfit [11] is used.

3 Measurement Data Analysis This section presents the results of analyzing different aspects of the measurement data. Since the amount of captured data is relatively large, five identical time sections are selected, where each section is ten minutes long and includes service establishment, middle and end periods. The five time sections are 0-600s, 18002400s, 3600-4200s, 5400-6000s and 6000-7200s. Section 3.1 uses the data from the 3600-4200s time section for analysis, describing the bearing protocols and the sizes of the packets used to transfer both data and control signals. Section 3.2 presents the geographical distribution characteristics of the peers during all the five time sections. Section 3.3 introduces the characteristics of the upload and download traffic based on the first two measurement time sections. The data from the 5400-6000s time section is used in Section 3.4 to investigate self-similarity properties of the traffic. 3.1 General Measurement In order to study network transport protocols that deliver the video streams, a period with relatively steady traffic has to first be considered. The 3600-4200s time section was selected. Fig. 2 shows the size distribution of TCP packets (green dots) and UDP datagrams (red dots) during the mentioned time section. The total number of TCP packets observed during this time section is 462 which is less than 0.46% of the total number of packets (101620). Further, the size of TCP packets is generally small. On the other hand, UDP datagrams clearly seem to be one of two types. One type is datagrams that are bigger than 600 bytes, most being 1145 bytes long. The other type is datagrams of 100 to 200 bytes. The former type composes 96% to all the traffic while the latter type composes only 4%. We conclude from this figure that PPLive uses UDP as the main bearing protocol for video content, with packet majorly being 1145 bytes long. TCP is used to transmit some control frames and a few video frames. 3.2 Peer IP Address Distribution Since data collection is on the Chinese financial information channel, the viewers mainly come from China's inland areas. According to the IP address distribution information from China Internet Network Information Center [12], the information of different viewers' location of upload and download can be determined by their IP address. Fig. 3 shows the information of the geographical distribution of IP address. The analysis of the distribution of peers is based on the data in the five time sections which are 1 ~ 600 seconds, 1800 ~ 2400 seconds, 3600 ~ 4200 seconds, 5400 ~ 6000 seconds and 6000 ~ 7200 seconds respectively.

The Measurement and Modeling of a P2P Streaming Video Service

27

Fig. 2. Packet Size Distribution per Bearing Protocol

Fig. 3. Peer Distribution for Download and Upload Traffic

The download and upload traffic are processed respectively. The focus of analysis of the peer distribution is on the relationship between the amount of service traffic and corresponding source-area or destination-area. In each time section, the peer IP addresses are sorted according to the number of video packets. Then, the amount of traffic based on the peer IP addresses is converged based on the geographic area of IP address. Fig.3 shows the traffic ratio from or to different areas in five time sections. The red block represents the fist contributing area or beneficial area, the yellow represents the second one, the black represents the third one and the blue represents the set of other areas. In the download traffic, the contributors are from about 20 provinces. Fig.3-a shows that more than 50% traffic are from the first contributing area. In upload traffic, the beneficiaries are from about 25 provinces. Compared with the download traffic, the upload traffic is evenly distributed to different beneficial areas, as shown

28

P. Gao et al.

in Fig.3-b. It is interesting that the download traffic is mainly from the network operated by China Necom (CNC) while the upload traffic being mainly to the network operated by China Telecom (CT). The ratio of the traffic is shown in Table1. Table 1. Traffic Distribution of the Different Time Sections

Operator CNC (Download) CT (Upload)

0-600s

1800-2400s

3600-4200s

5400-6000s

6600-7200s

86.9%

86.2%

68.78%

84.3%

80.7%



80.7%

56.3%

69.2%

85.1%

70

3.3 Traffic Characteristics The upload and download throughput during the first two time sections are shown in Fig. 4. Diagrams at different time sections correspond to different time and throughput. Fig. 4 shows the green download traffic presents relatively steady flow rate during the whole data monitoring period, the average is about 50KB/s and there is no dramatic fluctuation from the beginning to the end. The red upload traffic can be divided into two periods, which are service beginning period and the service steady period. During the service beginning period, there is an obvious increasing process. The time point is around the 10th minute and before the 10th minute, there are mainly the detect packets and there is no video packet transmission. During the service steady period, data packets are transmitted according to the link condition of the user. Fig. 4 is also of the top peer traffic both in download and upload, expressed in blue and purple separately, which represent the policy of download and upload. It can be notice that the output traffic to the top one beneficiary almost equals to the contribution from the top one contributor in the download traffic. The statistic granularity is also one second. The average of the rate of the upload and download top one peer is about 12KByte/s.

Fig. 4. Upload and Download Traffic Performance with Each Top One Peer Traffic

The Measurement and Modeling of a P2P Streaming Video Service

29

The relationship between the node and the throughput is analyzed with reference to the data between 1800s and 2400s and here only the upload traffic is analyzed. The beneficiaries are sequenced according to their received traffic. Compared with the download traffic, the number of the accumulated nodes of the upload traffic is larger. Therefore, the upload traffic obtained by different nodes is comparatively less. The top one node which obtains the most upload traffic only obtains 5% of the overall upload traffic. The upload traffic output decreases obviously after the 60th peer and after the 90th peer, there is basically only the output of control packets. During the overall monitoring period, only 4% peers (top 100 peers) receive 96% data traffic (the determination of the number of nodes is based on the IP addresses) and the 96% upload traffic is composed of 1145-byte video packet and the remaining part is composed of 100-200Byte detect packets. 3.4 Self-similarity Analysis The analysis of the self-similar and long range dependence is achieved by the Selfis tool [13] from Thomas Karagiannis et al. The data in the 5400-6000s time section is used as the analysis basis of download and upload. Fig. 5 shows the download and upload traffic analysis result of R/S estimation and autocorrelation function respectively. Its R/S linear fitting function is 0.1723x+0.01451, and the self-similar Hurst value is 0.1732. According to the increasing of lags, the autocorrelation function has no notable decline and its characteristic of non-summable is not clear. For the upload traffic, its R/S linear fitting function is 0.6216x+0.0349, and the corresponding Hurst value is 0.6216. The non-summable characteristic of autocorrelation function is also unclear, but since the Hurst value is more than 0.5, it can be regarded that the ACF of the upload traffic is non-summable (H>0.5), i.e. the upload traffic is long range dependent. According to the analysis of several time sections of upload and download traffic, the Hurst value of download traffic fluctuates from the 0.2 to 0.3, and the Hurst value of upload traffic fluctuates from 0.6 to 0.7. If the data analysis is in a long time scale which is more than 50min, the Hurst values of download and upload traffic will increase obviously which are at 0.35 and 0.75 respectively, and the ACF has no clear decline either. 3.5 Traffic Statistical Fitting From the analysis of the above sections, we draw two conclusions about modeling the traffic of PPLive. Firstly, traffic modeling of the PPLive service should be based on the upload traffic. Secondly, for simplicity, the upload traffic should be modeled as a whole, although actually being composed of traffic to different beneficiaries. Detecting video packets groups interval time, the fitting PDF and its corresponding parameters based on the K-S test sequencing are considered, as shown in Table 2 below. Log-normal PDF is chosen as the video frame transmitting function of traffic modeling.

30

P. Gao et al.

Fig. 5. R/S and ACF of Download and Upload Traffic Table 2. PDF Fitting

Distribution LogNormal LogNormal (3p) Weibull (3P) Gamma (3P) Weibull

K-S Test 0.03177 0.03385 0.03614 0.07984 0.08644

Parameter σ=1.5514 µ=-3.7143 σ=1.7522 µ=-3.86 γ=9.6370E-4 α=0.61622 β=0.04795 γ=0.0011 α=0.53457 β=0.12649 γ=0.0011 α=0.62782 β=0.05058

Fig. 6 shows the Q-Q plot of the log-normal distribution, which indicates the best KS test distribution when analyze the video packets inter time. It is shown that before 0.4 seconds the log-normal distribution can match the actual data well but deflect gradually thereafter. However, since the inter packets time is all less than 3s and almost 92% of all is less than 0.4 second, this does not influence the traffic model much. The math analysis tool uses the Easyfit software free trail [12]. Use the same way to analyze the control packets. The fitting PDF of small packet time interval and its corresponding parameters based on the K-S test sequencing are considered. Again, log-normal PDF is chosen as the control packets function of traffic modeling.

The Measurement and Modeling of a P2P Streaming Video Service

31

Fig. 6. Statistic Fitting of Video Packet

4 Modeling Design and Simulation As we know, the actual net system is very heterogeneous and complicated. Managing and developing such a large-scale distributed system is very difficult and hardly feasible, so modeling is the best way to simulate its operation. This paper provides an effective method to model IPTV traffic. This method can serve as a reference for similar work in the future. On the other hand, the simulation results can also verify the accuracy of the analysis and modeling methods used. 4.1 Algorithm In this model, an algorithm has been put forward to control the packet generation sequence, as shown in Fig. 7. First, data initialization is performed. 1. 2.

Send a video packet when simulation begins. Compute the next video packet sending time. Put it into a variable NextT.

Next, the time needed to send the next packet is computed. To account for different packet sizes, different parameters are used to calculate inter- video packet time (variable NextT) and the inter-control packet time (array t_i). The values of t_1 to t_n are summed to variable SmallT. As long as the value of SmallT is less than NextT, t_i is used as the inter-packet time for sending small packets. Otherwise, a large packet is sent immediately with an inter-packet time of NextT - (SmallT - t_i). 4.2 Simulation and Evaluation For the simulation, version 2.29 of the widely-used NS-2[14] is used to implement the model. Script analysis language Awk, mapping tool gnuplot and statistical fitting tool Easyfit are all used here for the simulation data analysis.

32

P. Gao et al.

Fig. 7. Packet Generation Sequence Flowchart

Fig. 8. Q-Q Plot of Packets Inter-Arrival Time

β

The log-normal distribution with the parameters α equals 1.5514 and equals 3.7143 is used to simulate the inter-video packet time interval, as discussed before. equals to -9.1022 simulate the Log-normal with parameters α equals to 2.5647 and inter-packet time for small packets, as explained in Section 4.1,while the size of control packets is 130 and the size of video packets is 8015. Fig. 8 shows the Q-Q plot of the characteristics of inter-video packet time from simulation, which match very well during the first 2 seconds and then depart gradually compared with the actual data.

β

The Measurement and Modeling of a P2P Streaming Video Service

33

In order to further validate the correctness of this traffic model, ten minutes of simulation traffic data has been collected to draw a throughput plot. This is compared with the data from the actual throughput for 3600-4200s by plotting them on the same figure, as shown in Fig. 9. The red points are real traffic data and the blue points are the simulation data.

Fig. 9. Comparison between Original and Simulated Traffic

The performance of these two sets of traffic data are compared in Table 3, which indicates that this model can be used to represent traffic generated by IPTV peers on a simulation platform. Table 3. Comparing data

Data

Total Traffic (Kbyte)

Actual Simulation

30907 38197

Average Value (Kbyte/s) 51.59766 63.76795

Standard Deviation

Confidence Level (95%)

51.0418 52.0086

4.0958 4.1734

5 Conclusion This paper investigates features of PPLive, a representative of the P2P technology, such as distribution of peer IP addresses, number of peers, and various traffic characteristics including self-similarity. By further analysis, we designed a model for the streaming video traffic generated by PPLive.. We verified the correctness of our model using simulations of the video streaming service. The results indicate that the proposed streaming service traffic model complies with the actual IPTV streaming service. This study of IPTV traffic streams can be used in the future as a foundation for characterizing grid video-sharing services.

34

P. Gao et al.

Acknowledgment First of all, we thank the Europe-China Grid InterNetworking (EC-GIN) project which provided a wonderful platform for in-depth research on grid networks. We are indebted to the members of the EC-GIN WP2 group for their collaboration and support. Our team members tried their best, and now we can share the success. Finally, we thank all the people whose experience we have benefited from, and those concerned with the improvement of grids!

References 1. Cuilian, L., Yunsheng, M., Junchun, R.: A Grid-based VOD System and Its Simulation. Journal of Fudan University (Natural Science) 43(1), 103–109 (2004) 2. Anfu, W.: Program Scheduling and Implementing in a Grid-VOD System. Engineering Journal of Wuhan University 40(4), 149–152 (2007) 3. Garrett, M.W., Willinger, W.: Analysis, Modeling and Generation of Self-Similar VBR Video Traffic. In: Proceedings of SIGCOMM 1994 (September 1994) 4. Silverston, T., Fourmaux, O.: Measuring P2P IPTV Systems. In: Proceedings of the 17th ACM International Workshop on Network and Operating Systems Support for Digital Audio & Video (NOSSDAV 2007), Urbana, IL, USA, June 2007, pp. 83–88 (2007) 5. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia 9(8), 1672–1687 (2007) 6. El-khatib, Y., Edwards, C. (eds.) Damjanovic, D., Heiß, W., Welzl, M., Stiller, B., Gonçalves, P., Loiseau, P., Vicat-Blanc Primet, P., Fan, L., Wu, J., Yang, Y., Zhou, Y., Hu, Y., Li, L., Li, S., Liu, S., Ma, X., Yang, M., Zhang, L., Kun, W., Liu, Z., Chen, Y., Liu, T., Zhang, C., Zhang, L.: Survey of Grid Simulators and a Network-level Analysis of Grid Applications. EC-GIN Deliverable 2.0, University of Innsbruck (April 2008) 7. PPLive, http://www.pplive.com/ 8. Ethereal: A Network Protocol Analyzer, http://www.ethereal.com/ 9. Gawk, http://www.gnu.org/software/gawk/ 10. gnuplot, http://www.gnuplot.info/ 11. Easyfit: Distribution Fitting Software, http://www.mathwave.com 12. China Internet Network Information Center, http://www.cnnic.net.cn/en/index/index.htm 13. The SELFIS Tool, http://www.cs.ucr.edu/~tkarag/Selfis/Selfis.html 14. NS-2: Simulation tool, http://nsnam.isi.edu/nsnam/

SCE: Grid Environment for Scientific Computing⋆ Haili Xiao, Hong Wu, and Xuebin Chi Supercomputing Center, Chinese Academy of Sciences 100190 P.O.Box 349, Beijing, China {haili,wh,chi}@sccas.cn

Abstract. Over the last few years Grid computing has evolved into an innovating technology and gotten increased commercial adoption. However, existing Grids do not have enough users as for sustainable development in the long term. This paper proposes several suggestions to this problem on the basis of long-term experience and careful analysis. The Scientific Computing Environment (SCE) in the Chinese Academy of Sciences is introduced as a completely new model and a feasible solution to this problem. Keywords: Grid, scientific computing, PSE.

1

Introduction

This paper begins with this short introduction. In the second part, a overview of several large Grid projects and its applications (and application Grids) is given, followed by a discussion of existing problems of those Grids, including EGEE, TeraGrid, CNGrid and ScGrid etc. The third part mainly talk about the Scientific Computing Environment (SCE), what it is, why it is important and how it works. The following part introduces applications build upon SCE and how scientists benefit from this integrated environment. The paper ends with final conclusions and future work.

2

Grid Computing Overview

2.1

Grid Computing

Grid computing has been defined in a number of different ways, especially when it’s getting increased commercial adoption. People from scientific and research area do not have the same understanding with others from company like IBM, ⋆

This work is supported by the National High Technology Research and Development Program of China (863 Program) “Research of management in CNGrid” (2006AA01A117), “Environment of supercomputing services facing scientific research” (2006AA01A116), and “Application of computational chemistry” (2006AA01A119).

P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 35–42, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

36

H. Xiao, H. Wu, and X. Chi

Oracle, Sun, etc. However, there is a consensus that Grid computing involves the integration of large computing resources, huge storage and expensive instruments, which are generally linked together from geographically diverse sites. 2.2

Well-Known Grid Projects

Over the last few years, since it was started at Argonne National Labs in 1990, Grid computing has evolved rapidly. Many national Grids, even multinational Grids are funded to build collaboration environment for scientists and researchers. EGEE. The Enabling Grids for E-sciencE (EGEE) project brings together scientists and engineers from more than 240 institutions in 45 countries world-wide to provide a seamless Grid infrastructure for e-Science that is available to scientists 24 hours-a-day. The EGEE project aims to provide researchers in academia and industry with access to major computing resources, independent of their geographic location. The first peorid of the EGEE project was officially ended on the 31 March 2006, and EGEE II was started on 1 April 2006. The EGEE Grid consists of 41,000 CPU available to users 24 hours a day, 7 days a week, in addition to about 5 PB disk (5 million Gigabytes) + tape MSS of storage, and maintains 100,000 concurrent jobs. Having such resources available changes the way scientific research takes place. The end use depends on the users’ needs: large storage capacity, the bandwidth that the infrastructure provides, or the sheer computing power available. The project primarily concentrates on three core areas: – To build a consistent, robust and secure Grid network that will attract additional computing resources. – To continuously improve and maintain the middleware in order to deliver a reliable service to users. – To attract new users from industry as well as science and ensure they receive the high standard of training and support they need. Expanding from originally two pilot scientific application fields, high energy physics and life sciences, EGEE now integrates applications from many other scientific fields, ranging from geology to computational chemistry. Generally, the EGEE Grid infrastructure is for scientific research especially where the time and resources needed for running the applications are considered impractical when using traditional IT infrastructures. TeraGrid. TeraGrid is one of the world’s largest, most comprehensive distributed cyberinfrastructure for open scientific research. TeraGrid integrates high-performance computers, data resources and tools, and high-end experimental facilities around USA through high-performance network connections. Currently, TeraGrid resources include more than 750 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks.

SCE: Grid Environment for Scientific Computing

37

A TeraGrid Science Gateway is a community-developed set of tools, applications, and data collections that are integrated via a portal or a suite of applications. Gateways provide access to a variety of capabilities including workflows, visualization, resource discovery, and job execution services. Science Gateways enable entire communities of users associated with a common scientific goal to use national resources through a common interface. Science Gateways are enabled by a community allocation whose goal is to delegate account management, accounting, certificates management, and user support to the gateway developers. TeraGrid is coordinated through the Grid Infrastructure Group (GIG) at the University of Chicago, working in partnership with the Resource Provider sites. CNGrid. China National Grid (CNGrid) is a key project launched in May 2002 and supported by the China National High-Tech Research and Development Program (the 863 program), and a testbed for the new information infrastructure which aggregates high-performance computing and transaction processing capability. CNGrid promotes the national information construction and the development of relevant industries by technical innovation. Ten resource providers (also known as grid nodes) from all around China join CNGrid, contribute more than 18 teraflops of computing capability. CNGrid has equipped with self-made grid-oriented high performance computers (Lenovo DeepComp 6800 in Beijing and Dawning 4000A in Shanghai). Two 100-teraflops hpcs is being built and will be installed before the end of 2008. Hpcs of 1000teraflops will be built and installed in the near future. Through the first five years of building, CNGrid has effectively supported many scientific research and application domain, such as resource and environment, advanced manufacturing and information service. CNGrid is coordinated through the CNGrid Operation Center at Supercomputing Center of Chinese Academy of Sciences. CNGrid Operation Center is established in September 2004 and works with all CNGrid nodes. CNGrid Operation Center is responsible for system monitoring, user management, policy management, training, technical support and international cooperation. ScGrid. In Chinese Academy of Sciences, construction of supercomputing environment is one of the key supporting area. To build the e-Science infrastructure and make the best of advantage of the supercomputing environment, the CAS has clear and long plans on Grid research and the construction of Scientific Computing Grid (ScGrid). ScGrid has provided users a problem solving environment based on Grid technology, which is a generic Grid computing platform. ScGrid has supported typical applications from Bioinformatics, Environmental Sciences, Material Sciences, Computational Chemistry, Geophysics etc. There are other well-known Grid projects like e-Science from UK, NAREGI from Japan and K*Grid from Korea.

38

2.3

H. Xiao, H. Wu, and X. Chi

Analysis of Existing Grids

The emerging Grid technology provide a Cyberinfrastructure for application science communities. In UK and China, another word e-Science is used more often when talk about Grid technology. Both characterizations regard Grid as the third basic way to scientific research, and the difference is that ‘Cyberinfrastructure’ focuses on method of scientific research while ‘e-Science’ emphasizes prospect and mode of future research. All grids are driven by applications in nature, or built for specific applications. Demands from users and applications do exist that can be concluded from above description. Scientific research can be done in a faster and more efficient way by using Grid technology, which has been proven by above Grids. However, Grids, especially production Grids still need to find more users and arouse more users’ interest. Grids and Grid applications need more active engaged domain scientists. As a example, in the TeraGrid annual report for 2007 [1], the milestone of 300 Science Gateway users was delayed by six months (from month 18 to month 24). In our experience as the China National Grid Operation Center in the past four years, most of the users’ complaint are: 1. the usability and stability are not enough for daily use; 2. the integrated applications do not cover their needs; 3. the Grid portal is not as convenient as their SSH terminal. There are a number of issues at stake when we talk about Grid and its value for users, which are also the reasons that users want to choose Grid: 1. Can Grid reduce time of computation and/or time of queuing? This is essential when a user like to choose a alternative and new method (like Grid) to finish her work. Unfortunately, Grid itself is not a technology to make a computation finished faster. But Grid often consists of more than one computation nodes, when a job is submitted, a ‘best’ destination node (faster or less busy one, due to the scheduling policy) could be selected to finish the job, without awareness of the user. So under many conditions, a computation job could finished earlier in Grid. 2. Is it free of charge to use Grid? Users will be very happy if it is true. Furthermore, if the computation (CPU cycle) is also free while using Grid, otherwise not, they will certainly love Grid. It is not good in the long term, however, because any new technology including Grid could not last long if it totally depends on free use. 3. Can Grid give users wonderful experience? There are some user like new things, beautiful and well-designed UI, convenient drag’n’drop mouse manipulation. Users need more choices to select their most familiar and comfortable way to access Grid resources, i.e. via a Grid portal or traditional terminal, from a Linux workstation or a Windows laptop. 4. Apply once, run anywhere in Grid. Many users have accounts on several HPCs, they always need to apply new account on each HPC. In Grid, they

SCE: Grid Environment for Scientific Computing

39

can apply only once, a Grid account, to have local account on each Grid node. This could save much time. From above analysis, resource sharing and scheduling is still the key value of Grid for users. Each Grid user must have a local account on every single Grid node, and each submitted job must been delivered to a specific Grid node. This is the basic concept of Grid, while it is sometimes ideal in reality. In this kind of situation, on the one hand, functions of software need to be improved, on the other hand, operation and management is also a big issue that need more concern.

3

Scientific Computing Environment

In the eleventh five-year plan of the Chinese Academy of Sciences, the construction of supercomputing environment is designed as a pyramidal structure. The top layer is a centralized high-end computing environment, building a one hundred teraflops computer and software support platform. The middle layer is distributed among China, choosing five sub-center and integrating fifty teraflops of computing resource. The bottom layer consists of institutes in the CAS, they can have their own hpcs with different dimensions. Under this pyramidal layout, based on ScGrid and the work during the tenth five-year plan, Grid computing technology is the best choice for sharing resource and integrating computing power. 3.1

Structure of SCE

Supercomputing Center continues the research and construction of Scientific Computing Grid during the advance of e-Science infrastructure in Chinese Academy of Sciences. To make the best of advantage of the Scientific Computing Environment (SCE), Supercomputing Center provide users a easy-to-use problem solving environment based on Grid technology. The basic structure of SCE is illustrated in figure 1. In the bottom, there are different high performance computers and visualization workstations that consists of hardware layer. Grid middleware is a abstract and transparent layer that shields the difference of geographical locations and heterogeneous machines in the hardware layer. It also serve as a Grid services provider for upper client tools and portals. Users choose whatever client tools or portals to access Grid resources as they like. 3.2

Features of SCE

Integrated and Transparent. In the integrated scientific computing environment, computing jobs are scheduling among all computing nodes. Jobs can be submitted to a specified node or a node selected automatically. Large scale and long-time jobs can be scheduled to the top or middle layer with more powerful computing capability. Other small or short-time jobs will be scheduled to the

40

H. Xiao, H. Wu, and X. Chi

Fig. 1. Basic structure of SCE

Fig. 2. Submitting a Gaussian Job in Grid portal

middle or bottom layer. When all comes to all, the utilization of high performance computers are maximize. On the other hand, users do not need to logon to different nodes. They can login once, and use all the nodes. All are transparent to users as a larger computers. Flexible and easy-to-use. Users can choose to use either web portal or client tool. In web portal (see figure 2), normal jobs can be submitted, job status can be checked, intermediate output can be viewed, and final result can be downloaded. In figure 2, the status of a running Gaussian job is listed and the instant output of intermediate file is dumped in the center window. In client tools, in addition, files can be transferred in multi-thread. Remote file systems on all Grid nodes

SCE: Grid Environment for Scientific Computing

41

can be mounted locally in the Windows explorer. Directories and files of three remote hpcs are listed concurrently in the left tree view of Windows explorer. Remote file operations are as easy and normal as doing locally. Lightweight and scalable. The source code of SCE middleware is mainly written in C and Shell script, the core code is around ten thousand lines. So it is very easy to deploy and setup. Also, the modular design make it easy to scale, new functions can be easily added. 3.3

Applications Based on SCE

Vitual Laboratory of Computational Chemistry. VLCC is a very successful example of the combination of Grid technology and specific application. VLCC is established in 2004, which aims at building for computational chemistry a platform of research, communication between scientists, training for newbies and application software development. Until now, there are more fifty members from universities and institutes that join the virtual laboratory. Many software are shared between those members, from commercial software, i.e. Gaussian 2003, ADF 2004, Molpro and VASP, to free software CPMD, GAMESS, and GROMACS etc. Scientists also share software developed by themselves with others, including XIAN-CI by Professor Zhenyuan Wen from North West University of China, CCSD and SCIDVD bye Professor Zhigang Shuai from Institute of Chemistry of CAS, 3D by Professor Keli Han from Dalian Institute of Chemistry Physics. The success of VLCC arouses inspiration of researchers from other application domains. Similar virtual laboratories are planned.

4

Conclusions and Future Work

Grid computing involves the integration of large computing resources, huge storage and expensive instruments. Over the last few years, Grid computing has evolved rapidly. Many national Grids, even multi-national Grids like EGEE, TeraGrid, CNGrid are built. The Scientific Computing Environment (SCE) in the Chinese Academy of Sciences is a completely new model which put emphasis on consistent operation and management, and versatile accessing method to meet users’ demand. Successful applications in the SCE integrated environment show it is a feasible solution to this problem. Also more applications are being integrated in SCE.

Acknowledgement We’d like to thank scientists and researchers from CAS for their support to our work. We also thank staffs and students of the Grid team in SCCAS. Their spirit of team work greatly encourage us and finally get things done.

42

H. Xiao, H. Wu, and X. Chi

References 1. Catlett, C., et al.: TeraGrid Report for CY 2006 Program Plan for 2007 (January 2007) 2. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1998) 3. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration (January 2002) 4. von Laszewski, G., Pieper, G.W., Wagstrom, P.: Gestalt of the Grid, May 12 (2002) 5. Yin, X.-c., Lu, Z., Chi, X., Zhang, H.: The China ACES iSERVO Grid Node. Computing in Science & Engineering (IEEE) 7(4), 38–42 (2005) 6. Zhu, J., Guo, A., Lu, Z., Wu, Y., Shen, B., Chi, X.: Analysis of the Bioinformatics Grid Technique Application in China. In: Fourth International Workshop on Biomedical Computations on the Grid, May 16-19, 2006. Nanyang Technological University, Singapore (2006) 7. Sun, Y., Shen, B., Lu, Z., Jin, Z., Chi, X.: GridMol: a grid application for molecular modeling and visualization. Journal of Computer-Aided Molecular Design(SCI) 22(2), 119–129 (2008) 8. Wu, H., Chi, X.-b., Xiao, H.-l.: An Example of Application-Specific Portal in Webbased Supercomputing Environment. In: International Conference on Parallel Algorithms and Computing Environments (ICPACE), October 8-11, 2003, pp. 135–138 (2003) 9. Haili, X., Hong, W., Xuebin, C.: An Implementation of Interactive Jobs Submission for Grid Computing Portals. Australian Computer Science Communications 27(7), 67–70 (2005) 10. Xiao, H., Wu, H., Chi, X.: A Novel Approach to Remote File Management in Grid Environments. In: DCABES 2007, vol. II, pp. 641–644 (2007) 11. Dai, Z., Wu, L., Xiao, H.: A Lightweight Grid Middleware Based on OPENSSH SCE. In: Grid and Cooperative Computing, pp. 387–394 (2007) 12. http://www.863.org.cn 13. http://www.cngrid.org 14. http://www.eu-egee.org 15. http://www.teragrid.org 16. http://www.sccas.cn 17. http://www.scgrid.cn 18. http://www.top500.org

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks Oliver Yu and Huan Xu Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607

Abstract. This paper presents the QoS differentiated Adaptive Scheduled Optical Burst Switching (QAS-OBS) paradigm that efficiently supports dynamic grid applications. (QAS-OBS) paradigm enables differentiated-services optical burst switching via two-way reservation and adaptive QoS control in distribute wavelength-routed networks. Based on the network loading status, the edge nodes dynamically optimize per-class lightpath reservation control to provide proportional QoS differentiation, while maximizing the overall network throughput. Unlike existing QoS control schemes that require special core network to support burst switching and QoS control, QAS-OBS pushes all the complexity of burst switching and QoS control to the edge nodes, so that the core network design is maximally simplified. Simulation results evaluate the performance of QAS-OBS and verified the proposed control schemes could provide proportional QoS control while maximizing the overall network throughput.

1 Introduction Optical network infrastructures have been increasingly deployed in grid systems to support data-intensive scientific and engineering applications with high bandwidth traffic. Wavelength-routed optical networks (WRONs) based on photonic all-optical switches and traffic control plane (e.g. GMPLS) had been used to support bandwidth intensive traffic of data-grid applications over multi-gigabit-rate lightpaths via end-toend wavelength channel or lambda. Interactive access-grid applications with time-varying participants require dynamic lambda grid systems, which connects grid resources (computation, storage, etc.) with on-demand provisioned lightpaths. For instance, a collaborative visualization requires connections among multiple remote computing clusters during a large period of time. However, only portion of these clusters are accessed at any given time. Thus, static provisioned lightpaths greatly reduce network resource utilization in this situation. Discovery and reservation of optical networking resources and grid resources can be based on either the overlay or the integrated model. The overlay model [1] specifies the layer of grid resources to sit over the optical network, with separated resource control mechanisms. The integrated model [2] specifies a combined resource control mechanism (e.g. extended GMPLS) to support unified optical networking and Grid resources provisioning. P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 43 – 55, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

44

O. Yu and H. Xu

Emerging collaborative grid applications have increasing demand for supports of diverse bandwidth granularities and multimedia traffic. This paper presents an optical burst switching (OBS) based grid network architecture to support such collaborative grid applications. An open issue for OBS based grid network architecture is provisioning for grid applications with multimedia traffic components requiring differentiated quality-of-service (QoS) performances. Optical Burst Switching (OBS) schemes [3][4] have been proposed to enable wavelength statistical multiplexing for bursty traffic. The one-way reservation of OBS, however, incurs burst contention and data may be dropped at the core nodes due to wavelength reservation contention. The burst contention blocking recovery controls (including time, wavelength and link deflection) are employed to minimize contention blocking. These recovery controls, however, may incur extra end-to-end data delay. In addition, some burst contention recovery controls require special devices such as Fiber Delay Lines (FDLs) at the core network. Wavelength routed OBS (WR-OBS) [5] guarantees data delivery through two-way wavelength reservation. WR-OBS is based on centralized control plane, and does not require special devices (e.g. FDLs) to be implemented at core nodes. However, the centralized control may not be practical for large-scale grid systems. Adaptive Reliable Optical Burst Switching (AR-OBS) [6] minimizes data loss in distribute-controlled WRONs based on wavelength-routed optical burst switching. AR-OBS minimizes the overall data loss via adapting the lightpath reservation robustness (maximal allowed time for blocking recovery) for each burst to the network loading status. To optimize sharing of optical network resources, it is desirable to provide service differentiation in OBS based grid networks. Service differentiation via assigning different offset delay is proposed in [7][8]. In these schemes, higher priority bursts are assigned with extra offset to avoid burst contention with lower priority bursts. However, the higher priority bursts always suffer from longer end-to-end data delay and such control schemes are unfair to large size burst [9][10]. Look-ahead Window (LaW) based service differentiation schemes are proposed in [11][12]. Each core node buffers burst headers in a sliding window, and a dropping-based decision is made for the buffered bursts based on the QoS requirements. It shows that the collective view of multiple burst headers in LaW results in more efficient dropping decision than simple dropping-based schemes (e.g. the scheme proposed in [9]). The overall end-toend delay of LAW is scarified due to the extra queuing delay in the window. By implementing Weighted Fair Queue for the burst reservation requests in central controller [12], WR-OBS could provide prioritized QoS. As mentioned above, the centralized control scheme may not be practical in large-scale network systems. To facilitate easy pricing model for grid applications, it is better to provide proportional service differentiation. By intentional dropping lower-class bursts at the core network node, the QoS control scheme proposed in [9] provides proportional servicedifferentiation in terms of data loss ratio. Simple dropping-based schemes at each core node may cause unnecessary drop and deteriorate the wavelength utilization. Preemption-based QoS control scheme proposed in [14] provides proportional service differentiation by maintaining the number of wavelengths occupied by each class of bursts. To guarantee the preset differentiation ratio, bursts satisfying the corresponding QoS usage profiles will preempt the ones not satisfying its usage profiles. The preempted ones, however, will waste some wavelength resource and lower the wavelength utilization.

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

45

An absolute QoS model for OBS networks is proposed in [15] to ensure the data loss of each QoS class does not exceed a certain value. Loss guarantees are provided via two mechanisms: Early drop and wavelength group. The optimal wavelength sharing problem for OBS networks with QoS constraint is discussed in [16]. Optimized wavelength sharing policies are derived based on Markov decision process to maximize wavelength utilization while ensuring the data loss rate for QoS classes not exceeding the given thresholds. All of the aforementioned QoS control schemes require implementing QoS control at every core network nodes, which increases the complexity of core network design. In addition, some of the schemes demand special optical switch architectures. For example, differentiated offset delay and LaW based QoS control schemes need the support of FDLs at every core network node to buffer the incoming optical bursts. The QoS control schemes proposed in [15] and [16] require every optical node to be wavelength convertible. These requirements may not be easily satisfied for existing WRONs, and will increase the investment cost for the core networks. To simplify the core network design and push the complexity of burst switching and QoS control to the edge node in WRONs, QoS differentiated Adaptive Scheduled Optical Burst Switching (QAS-OBS) is proposed. QAS-OBS provides proportional service differentiation in terms of data loss rate while maximizing the overall network throughput. QAS-OBS employs two-way reservation to guarantee data delivery. The core network controllers are not involved into burst QoS control (e.g. dropping decision). The core network control could employs distributed reservation protocols with blocking recovery control mechanisms (e.g. S-RFORP [17] or RFORP [18] etc.). The QoS control in QAS-OBS is implemented at the edge nodes, which dynamically adjust the lightpath reservation robustness (i.e. maximal allowed signaling delay) for each burst to provide the service differentiation. To guarantee proportional differentiation under different network loading status, the edge controller dynamically adjusts lightpath reservation robustness for each burst based on network loading status. Round Trip Signaling Delay (RTSD) of the two-way reservation signaling is employed as an indicator to estimate the network loading status while not demanding explicit network loading status information distribution. A heuristic-based control algorithm is proposed in this paper. Simulation results show that QAS-OBS could achieve proportional QoS provisioning while maximizes the network throughput. The reminder of the paper is organized as follows. Section II discusses the general architecture of QAS-OBS. The core signaling control is illustrated in Section III. Section IV presents multi-class QoS control. Section V presents the simulation results.

2 Architecture of QAS-OBS QAS-OBS contains two functional planes: a distributed data plane consisting of the optical core network switches, and a distributed control plane taking care of the signaling control and switch configuration for each optical switch. The motivation of QAS-OBS is to push the complexity of QoS control from network core to edge. Edge nodes control the lightpath setup robustness (i.e. maximal allowed lightpath signaling time) and burst scheduling. Core network nodes are in charge of the lightpath reservation and do not need to participate in the burst QoS control, which simplifies the implementation of core network.

46

O. Yu and H. Xu

At the edge nodes, incoming data packets are aggregated in different buffers according to their QoS classes and destinations. QAS-OBS employs two-way reservation with lightpath setup acknowledgment to guarantee data delivery. To avoid long delay incurred by two-way reservation, lightpath signaling process is initiated during burst aggregation phase. Thus, the signaling delay is overlapped with burst aggregation delay and the data end-to-end delay is minimized. To support service-differentiated control, robustness of lightpath reservation for each burst is set according to its QoS class. For higher-class bursts, less reservation blocking is achieved by assigning larger lightpath reservation robustness to the corresponding resource reservation process. Some optical resource reservation protocols could provide adjustable lightpath reservation robustness. For example, maximal allowed number of blocking recovery retries in S-RFORP [17] or RFORP [18] could be adjusted for each request. Larger number of blocking recovery retries will result in lower reservation blocking. In the following of this paper, we assume that QAS-OBS employs a resource reservation protocol that can support adjustable reservation robustness by assigning different upper bound of signaling delay (e.g. limitation of number of blocking recovery retries in S-RFORP). To meet the QoS requirement in different network loading status, burst reservation robustness in QAS-OBS is adapted to the network loading status. The reservation robustness of each burst is selected to maximize the overall throughput while satisfying the QoS constraints. Delayed reservation is employed in QAS-OBS to maximize the wavelength utilization.

User Generated Traffic

Per-class & Per-destination Aggregator Data Incoming Rate

Traffic Class

User Traffic Data Rate Estimator

Traffic Class Classifier

Estimated Burst Aggregation Delay&Size

Burst Traffic Class

Network Loading Status

Network RTSD Monitor

QAS-OBS Network

Network Input Traffic

Trigger Burst Transmission

Edge Node

Core Node Controller Edge Node Controller

Burst Transmission Controller Burst Offset Delay

Multi-Class QoS Controller

All-optical Switch

Burst Transmission Time&Size

Burst Signaling Controller

Per-burst Reservation Signaling Protocol

Selected Route

Route Selection Controller

Edge Node Controller

Fig. 1. Overall Architecture of QAS-OBS

Burst Signaling Controller

Core Node Controller

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

47

Unlike the offset-delay-based QoS control in OBS systems that isolate different classes in time dimension, QAS-OBS provides service differentiation via controlling the blocking probability of lightpath reservation for each burst. The lower blocking does not trade off data end-to-end delay in QAS-OBS, since the lightpath signaling process is required to be completed by the end of burst aggregation process. Fig. 1 shows the overall architecture of QAS-OBS with edge node function blocks. The incoming data is aggregated at the per-class per-destination aggregator. User Traffic Rate Estimator monitors the data incoming rate per-destination and per-QoS class, and predicts the burst sending time based on the criteria to avoid buffer overflow or data expiration at the edge node. Based on the estimated burst sending time and network loading status, The Multi-Class QoS controller dynamically selects the lightpath reservation robustness for each burst, and triggers the lightpath signaling procedure accordingly. We define offset delay to be the time interval between triggering lightpath reservation and sending out burst, which is also the time allowed for lightpath signaling (reservation robustness). To maintain proportional data loss rate among QoS classes and maximize the overall data throughput, Offset Delay is dynamically adjusted for each QoS class according to current network loading status. In QAS-OBS, the network loading status is estimated by the round-trip signaling delay (RTSD). Longer RTSD means there are more blocking recovery retries during lightpath reservation, and it is an indicator that the network is heavy loaded. After getting lightpath setup request carrying burst sending time and burst size from Multi-Class QoS Controller, Burst Signaling Controller selects the route and sends out per-burst reservation request to the core network. The routing control in QAS-OBS is supposed to be based on shortest path selection. Wavelength selection algorithm depends on the employed signaling protocol (e.g. first fit or random fit). If the lightpath setup acknowledgement returns, Burst Transmission Controller triggers burst transmission and data is dumped from edge buffer into the optical network as an optical burst at the predicted burst sending time.

3 Burst Signaling Control Burst signaling control is to reserve the wavelengths for each burst at specified burst transmission time. The signaling control is supported by both edge and core nodes. Based on the incoming rate of an aggregating burst, edge node determines the sending time of lightpath reservation signaling and wavelength channel holding time for corresponding burst. Core nodes interpret the signaling message and configure the corresponding optical switch to support the wavelength reservation. The objective of burst signaling control includes minimizing the data end-to-end delay and maximizing the wavelength utilization. In this section, we focus on the discussion of minimizing the end-to-end data delay. The wavelength utilization is maximized in QAS-OBS via delayed reservation [19]. In QAS-OBS networks, the burst is sent out after the lightpath setup acknowledgement returns, and data delivery is guaranteed. The offset delay is required to be larger than the signaling round-trip delay. To avoid larger end-to-end delay caused by the twoway reservation, QAS-OBS signals the lightpath before the burst is fully assembled. If

48

O. Yu and H. Xu

Burst Signaling Controller Wavelength Discovery t0

Blocking Recovery Processing Link 1

Estimation Processing Time

Wavele ngth discove ry

EFSD Burst Aggregation Delay

ARD

t5

D

Wavele ngth discove ry

Wavelen gt discover h y

Recovery

t3

Wavelength Holding time

Link 3 2

Wavelength Reservation

Offset Delay

t4

Link 2 1

S

t1

t2

Delayed Wavelength Reservation

MARD

Wavelength Reservation Delayed Reservation

Delayed Reservation

Delayed Reservation

t6 Optical Burst

Time

Fig. 2. Burst Signaling Control of QAS-OBS

the signaling is sent out early enough to ensure the lightpath acknowledgement returns before the burst is fully assembled, the total data end-to-end delay is minimized since the offset delay is not included. Taking account of the extra delay incurred by the blocking recovery, offset delay in QAS-OBS consists of error free signaling delay (EFSD) and maximal allowed recovery delay (MARD). EFSD is the signaling delay for a given route if there is no blocking occurs. MARD is the maximal allowed time for blocking recovery control in the signaling process, which depends on the maximal allowed signaling delay. In the signaling procedure, the actual recovery delay (ARD) is limited by MARD to ensure the acknowledgement returns before the signaled burst transmission time. If the assigned MARD is not big enough to cover ARD, the lightpath request will be dropped. The actual round trip signaling delay (RTSD) is EFSD plus ARD, which is bounded by the given offset delay. A signaling scenario is shown in Fig. 2 to illustrate the signaling control procedures. As shown in the figure, a new burst aggregation starts aggregation at t1. After some processing time, the signaling request is sent out to core network at t2 before the burst is fully assembled. The estimated wavelength channel holding time and burst sending time is carried by the signaling message. As shown in Fig. 2, the signaling message transverses three hops from source to destination during wavelength discovery phase. When the signaling message reaches the destination node, the wavelength reservation procedure will be triggered if there is common available wavelength along

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

49

the route. Then the signaling message is sent back to source node from destination node. If there is no wavelength contention in the reservation phase, the signaling message returns the source node at t3. The time between t3 and t2 is EFSD. In Fig. 2, blocking recovery processing is triggered at blocked link (Link 2) due to wavelength reservation contention. The illustrated blocking recovery control is based on alternate wavelength selection proposed in [17]. After the contention blocking is recovered at Link 2, the signaling message will be passed to next link. The total allowed blocking recovery time along the route is bounded by MARD and the actually recovery delay may take less time than MARD. In the figure, the signaling message returns to the source node at t4. The source node will send out the burst at t5, which is the signaled burst sending time. To maximize the wavelength utilization, delayed wavelength reservation is employed as marked in Fig. 2. The wavelength holding time (t5 – t6) of each burst is determined by the burst size and core bandwidth. Serialized signaling procedure shown in this scenario is for illustration only and may not be efficient to support burst switching. QAS-OBS employs a fast and robust signaling protocol, namely SRFORP, proposed in [17]. Interested reader may refer to [17] for details of the signaling protocol.

4 Multi-class QoS Controller In QAS-OBS, MARD is the time reserved for lightpath reservation blocking recovery. To provide differentiated QoS control in terms of data loss rate, MARD for each burst is adjusted according to the service class of that burst. Larger MARD allows the signaling protocol to have more chance for connection blocking recovery. For example, the signaling protocols proposed in [17] [18] will utilize the MARD to recover wavelength contention blocking via localized rerouting or alternative wavelength selection. To guarantee the proportional differentiated-services in terms of data loss rate, the MARD should be adjusted according to the network loading status. In distributed networks, however, the real-time link status information is hard to collect either due to the excessive information exchange or due to security reasons. The actual additional signaling delay (i.e. ARD) is employed in QAS-OBS as an indicator of the network loading status. The wavelength contention and congestion occurs more often as the average link loading increases. If blocking recovery control is employed, the additional signaling delay caused by blocking recovery will depend on the probability of contention or congestion blocking, which is determined by the link loading. The Multi-Class QoS Controller has the following functions: determine the Offset Delay, burst transmission starting time and burst size for a burst. Fig. 3 shows that the structure of Multi-Class QoS Controller. Based on the QoS Performance Requirement, Optimization Heuristic in the QoS Database Configuration module sets up the mapping between monitored ARD ( ARˆ D) and Per-Class MARD, and stores the information in ARD to Per-Class Offset Delay Mapping Database. The mapping table in the database is the decision pool for the Per-Class MARD Adaptation module. Each QoS class will be assigned with MARD via the Per-Class MARD Adaptation module based on the monitored ARD. According to the QoS class, selected route and estimated burst

50

O. Yu and H. Xu Multi-Class QoS Controller Monitored RTSD Selected Route

Per-Class MARD Adaptation

ARD to Per-Class MARD Mapping Database

Read

Write QoS Control Database Configuration (Optimization Heuristic)

Per-Class MARD

Optimization Policy Burst Traffic Class

Per-Burst MARD & Transmission Starting Time Decision Controller

QoS Performance Requirements • Optimize network throughput • Proportional Data Loss Rate

Estimated Burst Aggregation Delay&Size Burst Offset Delay

Burst Transmission Time&Size

Fig. 3. Multi-Class QoS Controller

aggregation time of an aggregating burst, the Per-Burst Offset Delay&Transmission Starting Time Decision Controller will select the offset delay for the each aggregated burst. When it is time to sending out the signaling, Per-Burst Offset Delay & Transmission Starting Time Decision Controller triggers the lightpath reservation by passing the burst transmission time and size to the Burst Signaling Controller. Meanwhile, the burst transmission time is sent to Burst Transmission Controller. Larger MARD will result in lower lightpath connection blocking; however, it will also increase the data loss due to edge buffer overflow. In section 2 and 3, we discussed that the burst size and sending time is based on estimation. The estimation processing time (shown in Fig. 2) is inversely proportional to the MARD, and the estimation error depends on the processing time. In QAS-OBS, the estimation error will be proportional to MARD. Taking account of the tradeoff between data loss due to connection blocking and edge buffer overflow, a mapping between MARD and data loss rate could be established. Our control algorithm is to select the optimal MARD that minimizes the total data loss for the QoS class that requires minimal data loss rate. The MARD for other QoS classes will be selected based on the data loss rate ratio and the mapping between MARD and data loss rate. Such mapping could be setup by monitoring the history data. An example of the mapping between MARD and data loss rate is illustrated in the simulation part. Within the function modules in Fig 3, the most important one is the ARD to MARD mapping database, which is the decision pool for MARD selection. Since MARD determines the reservation blocking for each burst, there is a mapping function between MARD and burst data loss rate for given network loading status. The one-to-one mapping between network loading and ARD is shown in the simulation. Based on this mapping, we can setup the ARD to MARD mapping database that satisfies the QoS requirement.

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

51

5 Simulation Results The simulation results consist of two parts. The first part presents the simulated results needed for the proposed control algorithm. The second part shows the performance results of QAS-OBS such as the overall throughput and proportional data loss rate between two classes.

Fig. 4. 14-node NSF network topology

The network used for simulation is the NSF 14-node as shown in Fig.4. Without loss of generality, we assume that each link has identical propagation delay that is normalized to 1 time unit (TU). The wavelength discovery and reservation processing delay of each node is supposed to be 5 TU, and switching latency of each optical switch is normalized to be 3 TU. Shortest Path routing and first-fit wavelength assignment is assumed. The incoming traffic to each edge is VBR traffic, and packet size follows a Poisson distribution with average packet size to be 10kb and a maximal allowable edge delay 100 TU. Exponential weighted sliding window estimator is implemented. The simulation is running on a self-designed simulator written in C++. The parameters of the topology model are: each node is supposed to have the full wavelength convertibility; number of links E=42 (bi-direction links assumed); default number of wavelength per link W=8. The average per-link wavelength loading is defined as: Wavelength Loading = ∑ (i )

where i is the index of a source destination pair, rate on the route of the source destination pair i, time for the burst transmitted on the route of i,

λi ⋅ H i µi ⋅ | E | ⋅ W

λi

µ

i

is the average connection arrival is the average wavelength holding

i

H is the number of hops on route of i.

A. Results needed by Control Algorithm Fig. 5 shows the effects of MARD on burst data loss for given wavelength loading. As shown in Fig. 5 increasing the MARD will reduce the burst reservation blocking, but the efficiency of reducing burst connection blocking via increasing the MARD

52

O. Yu and H. Xu

Fig. 5. Effects of Increment of Offset Delay on data loss rate

Fig. 6. Effect of Network Loading on ARD

reduces when the MARD is large enough. As the MARD increases, the increasing of edge overflow blocking will outperform the decreasing of burst connection blocking and the overall data loss rate will increase slightly when the MARD is large enough. Fig. 6 shows the how ARD works as a network loading indicator. In QAS-OBS, sliding window average ARD ( ARˆ D) is employed to indicate the network loading status. The figure shows that the ARˆ D is proportional to the wavelength loading. In addition, ARˆ D depends on the loading status only and is kind of independent to MARD. This characteristic qualifies ARˆ D as a good loading status indicator. After getting the results shown in Fig 5, and 6, the ARˆ D to Per Class MARD mapping database can be derived following the steps presented in Section 4. The result of the database is shown in Fig. 7. The value of per-class MARD that satisfies the QoS requirement has one-to-one mapping with the monitored ARˆ D .

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

Fig. 7.

ARˆ D

53

to Per Class MARD mapping database

Fig. 8. Effects of Average Wavelength Loading on Data Loss Rate

1RUPDOL]HG7KURXJKSXW

  6LQJOH&ODVV



&& 



&& 



&& 

  





$YHUDJH:DYHOHQJWK/RDGLQJ

Fig. 9. Effects of Average Wavelength Loading on Overall Throughput

54

O. Yu and H. Xu

B. Performance Results Fig. 8 shows one of the QoS constraint, proportional data loss ratio, is satisfied under different wavelength loading and traffic ratio. Three combinations with different traffic ratio of C0 to C1 are simulated for given total traffic amount. For example, C0:C1 = 2:1 means for given total traffic loading, 67% traffic belongs to class 0 and 33% belongs to class 1. It shows that the data loss rate of each class increases as the average wavelength loading increases. The data loss rate ratio between the two classes is largely fixed and independent of the wavelength loading. Thus, one of the QoS requirements (fixed data loss ratio between the two classes) is satisfied. In Fig. 9, the performance of overall throughput of the multi-class case is compared with the classless (single class) case. The overall throughput of single class is assumed to be the sub-optimal because every burst is scheduled to minimize the blocking and maximize the overall throughput. In addition, the overall throughput in single class case can be considered as the upper bound of that for multi-class case. This is because the QoS constraint in multi-class case may results in lower overall throughput since some lower class burst is reserved with low reservation robustness even it could be transmitted to keep the fixed data loss rate. The simulation results show that the data loss rate and overall throughput in multi-class case are very closed to the performance in single class case. Thus, the other QoS performance requirement, maximize the overall throughput, is satisfied. In Fig. 8 and 9, different combination of class 1 and class 0 traffic will affect the overall data loss rate. This is because for given traffic loading, the blocking probability cannot be set to arbitrary low, and lower class traffic may be assigned low lightpath reservation robustness to keep the proportional data loss rate. When the lower class traffic becomes dominant, the overall data loss rate decreases. In such case, adjust the blocking ratio may be needed.

6 Conclusion In this paper, we present QAS-OBS to provide proportional QoS differentiation for optical burst switching for grid applications in wavelength routed optical networks. Adaptive burst reservation robustness control is implemented to provide service differentiation while maximizing network throughput. QAS-OBS simplifies the core network design by pushing all the QoS control to the edge node, which makes it easy to implement QAS-OBS in existing WRONs. The control heuristic is presented and the performance is evaluated through simulation. It shows that QAS-OBS satisfies the QoS constraints while maximizing the overall network throughput.

References 1. Yu, O., et al.: Multi-domain lambda grid data portal for collaborative grid applications. Elsevier Journal Future Generation Computer Sys. 22(8) (October 2006) 2. Simeonidou, D., et al.: Dynamic Optical-Network Architectures and Technologies for Existing and Emerging Grid Services. J. Lightwave Technology 23(10), 3347–3357 (2005) 3. Qiao, C., Yoo, M.: Optical burst switching (OBS)-A new paradigm for an optical internet. J. High speed Networks 8(1), 69–84 (1999)

QoS Differentiated Adaptive Scheduled Optical Burst Switching for Grid Networks

55

4. Xiong, Y., Vandenhoute, M., Cankaya, H.: Control Architecture in Optical Burst-Switched WDM Networks. IEEE J. Select. Areas Commun. 18(10), 1838–1851 (2000) 5. Duser, M., Bayvel, P.: Analysis of a dynamically wavelength-routed optical burst switched network architecture. J. Lightwave Technology 20(4), 574–585 (2002) 6. Yu, O., Xu, H., Yin, L.: Adaptive reliable optical burst switching. In: 2nd International Conference on Broadband Networks, 2005, October 3-7, 2005, vol. 2, pp. 1419–1427 (2005) 7. Yoo, M., Qiao, C., Dixit, S.: QoS Performance of Optical Burst Switching in IP-overWDM Networks. IEEE J. Sel. Areas Commun. 18(10), 2062–2071 (2000) 8. Yoo, M., Qiao, C.: Supporting Multiple Classes of Services in IP over WDM Networks. In: Proc. IEEE Globecom 1999, pp. 1023–1027 (1999) 9. Chen, Y., Hamdi, M., Tsang, D.H.K.: Proportional QoS over OBS networks. In: Global Telecommunications Conference, 2001. GLOBECOM 2001, vol. 3, pp. 1510–1514 (2001) 10. Poppe, F., Laevens, K., Michiel, H., Molenaar, S.: Quality-of-service differentiation and fairness in optical burst-switched networks. In: Proc. SPIE OptiComm., Boston, MA, July 2002, vol. 4874, pp. 118–124 (2002) 11. Chen, Y., Turner, J.S., Zhai, Z.: Contour-Based Priority Scheduling in Optical Burst Switched Networks. Journal of Lightwave Technology 25(8), 1949–1960 (2007) 12. Farahmand, F., Jue, J.P.: Supporting QoS with look-ahead window contention resolution in optical burst switched networks. In: Global Telecommunications Conference, 2003. GLOBECOM 2003, December 1-5, 2003, vol. 5, pp. 2699–2703 (2003) 13. Kozlovski, E., Bayvel, P.: QoS performance of WR-OBS network architecture with request scheduling. In: IFIP 6th Working Conf. Optical Network Design Modeling Torino, Italy, February 4–6, 2002, pp. 101–116 (2002) 14. Loi, C.-H., Liao, W., Yang, D.-N.: Service differentiation in optical burst switched networks. In: IEEE GLOBECOM, November 2002, vol. 3, pp. 2313–2317 (2002) 15. Zhang, Q., et al.: Absolute QoS Differentiation in Optical Burst-Switched Networks. IEEE Journal of Selected Areas in Communications 22(9), 1781–1795 (2004) 16. Yang, L., Rouskas, G.N.: Optimal Wavelength Sharing Policies in OBS Networks Subject to QoS Constraints. IEEE Journal of Selected Areas in Communications 22(9), 40–50 (2007) 17. Xu, H., Yu, O., Yin, L., Liao, M.: Segment-Based Robust Fast Optical Reservation Protocol. In: High-Speed Networks Workshop, May 11, 2007, pp. 36–40 (2007) 18. Yu, O.: Intercarrier Interdomain Control Plane for Global Optical Networks. In: Proc. ICC 2004, IEEE International Communications Conference, June 20-24, 2004, vol. 3, pp. 1679–1683 (2004) 19. Yoo, M., Qiao, C.: Just-enough-time (JET): A high speed protocol for bursty traffic in optical networks. In: Proceeding of IEEE/LEOS Conf. on Technologies for a Global Information Infrastructure, August 26-27, 1997, pp. 26–27 (1997)

Principles of Service Oriented Operating Systems Lutz Schubert and Alexander Kipp HLRS - University of Stuttgart Intelligent Service Infrastructures Allmandring 30, 70569 Stuttgart, Germany [email protected]

Abstract. Grid middleware support and the Web Services domain have advanced significantly over recent years, reaching a point where resource exposition and usage over the web has not only become feasible, but an actual commercial reality. Nonetheless, commercial uptake is still slow, and certainly not progressing the same way as other web developments have been taking place - this is mostly due to the fact that usage is still complicated and restrictive. This paper will discuss a new approach towards tackling grid-like networking across organisational boundaries and across different types of resources by moving main parts of the infrastructure to a lower (OS) level. This will allow more intuitive use of Grid infrastructures for current types of users. Keywords: Service Oriented Architecture, distributed operating systems, Grid, virtualisation, encapsulation.

1

The “Grid”

Over recent years, the internet has become the most important means of communication and hence implicitly an important means for performing business and collaborative tasks. But the internet offers more potential than this: with the ever-increasing band-width, the internet is no longer restricted to pure communication, but is also capable of exchanging huge, complex data-sets within a fraction of the time once needed to fulfill these tasks. Likewise, it has become possible to send requests to remote services and have them execute complex tasks on your own behalf, thus giving the user the possibility to outsource resource intensive tasks. Recently, web services have found wide interest both in the commercial and the academic domain. They offer the capability to communicate in a standardised fashion across the web to invoke specific functionalities as exposed by the respective resource(s), so that from a pure developer’s perspective, the according capabilities can be treated just like a (local) library extension. In principle, this would allow generation of complex resource sharing networks that build a grid-like infrastructure over which distributed processes can be enacted. However, the concept fails due to multiple reasons: P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 56–69, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

Principles of Service Oriented Operating Systems

57

– non regarding standardised messaging, all providers will expose their own types of interfaces which is hardly ever compatible to another provider’s interface even when essentially providing the same functions. – additional capabilities are necessary in order to support the full scope of business requirements in situtations such as joint ventures, in particular with respect to legal restrictions and requirements. – exposing capabilities over the web and adapting the own processes to cater for the according requirements, is typically very time and effort consuming – a high degree of technical capabilities is required from the persons managing services and resources in a Virtual Organisation Current Grid related research projects tberefore focus strongly on overcoming this deficiencies by realising more flexibile interfaces and more intuitive means of interactions and management. One of the most sophisticated project in this area is certainly BREIN [1] which tries to enhance typical Grid support with a specific focus on simplifying usage for the human behind the system. It therefore introduces extensions from the Agent, Semantic and AI domains, as shall be discussed in more detail below. Whilst such projects hence enrichen the capabilities of the system, it remains questionable whether this is not a sub-efficient solution given that the underlying infrastructure suffers from the additional messaging overhead and that the actual low-level management issues have not been essentially simplified, but mostly just moved further away from the end-user. This paper presents an alternative approach to realising a grid-enabled networking infrastructure that reduces the administrative overhead on the individual user significantly. The approach is basing strongly on the concept of distributed operating systems, as already envisaged e.g. by Andrew. S. Tanenbaum [2]. We will demonstrate how modern advances in virtualisation and encapsulation can be taken to a lower (OS) layer, that will not only allow cross-internet resource usage in a more efficient way, but will also enable new forms of more complex applications running on top of it, which exceeds current Grid solutions. We will therefore examine the potentials of what is commonly called “the grid” (section 2), how a low level, OS-embedded Grid system would look like (section 3) and how it could be employed in collaborative environments (section 4). Basing on this we will the derive requirements towards future network support for such a system (section 5) and conclude with a summary and outlook (section 6).

2

Low level Grids?

All “Grid”-Definitions have in common that remote resources are used as, respectively instead of local ones, so as to enhance or stabilise the execution of complex tasks, see e.g. the EGEE project [3]. What is the advantage on an Operating System layer though? Currently, more and more services are offered over the internet which are, in essence, at a low hardware level: storage can be rented on the web that grants access from all over the world; computational platforms can be used to reduce

58

L. Schubert and A. Kipp

the local CPU usage; printing services expose capabilities to print high quality documents and photos; remote recording service store data, burn it to a disk and send it by mail and so on. Interoperability & Usability. Using these capabilities currently requires the user to write code in a specific manner, integrating public interfaces, making use of APIs and / or using browser interfaces over which to up- and download data. As opposed to this, local resources are provided to the user via the OS typical interfaces, such as Explorer, which are typically more intuitive to use and in particular do use OS layer drivers to communicate with the devices, rather than leaving it up to the user to make the resources usable. However, in particular low layer resources, such as the ones mentioned above, could in principal be treated in the same fashion as local resources on cost of delays for communication. We will treat the communication problem in more detail below. Resource Enhancement. The obvious benefit of integrating remote resources consists in the enhancement to local resources in order to e.g. extend storage space, but also to add additional computing power and extend local storage, as we will discuss below. In particular on the level of printers, disk burners and storage extensions, remote resources are already in use, but with above described drawbacks. Stability and Reliability. By using remote resources for backup and replication features, overall process execution and data maintenance of the system can increase significantly. Already there are backup solutions duplicating disk storage onto a resource accessible over the web and systems such as the one by Salesforce [4] allows for stable, managed execution of code by using local replication techniques (cf. [5], [6]). However, none of them are intuitive to use and allow for interactive integration. Tight integration on the OS layer would hence allow making use of remote resources in a simple fashion, without having to care about interoperability issues.

3

Building up the Future Middleware

Web Service and Grid middlewares typically make use of HTTP communication with the SOAP protocol on top of it, which renders interactions rather slow and thus basically useless on a level that wants to make full use of the resources. Classical Grid applications mostly assume decoupled process execution where interaction only takes place to receive input data and return, respectively forward operation results. Even “fast” Grid networks are building more on dynamic relationships than on actual interactivity [7], [8], [9]. There have hence been multiple approaches to improve message throughput, such as SOAP over TCP and XML integration into C++ [10], [11] or standard binary encoding of SOAP messages [12], [13]. None of these approaches however

Principles of Service Oriented Operating Systems

59

can overcome the full impact of the high SOAP overhead, thus rendering lower level resource integration nearly impossible. 3.1

A Model for Service Oriented Operating Systems

As has been noted, the main feature of Grid / Web Service oriented middlewares consists in provider transparent resource integration for local usage. Notably higher capabilities are of relevance for distributed process enactment, but from each provider’s local perspective the same statement holds true. On the operating system level, resources can not be treated as transparently as on the application level, as the resources, their location and the means of transaction have to be known in full detail - implicitly taking this burden away from the user. In a classical OS approach, the operating system would just provide the means of communication and leave it up to higher layers to take care of identification and maintenance of any remote resource.

Fig. 1. Schema of a service oriented, distributed operating system

Figure 1 shows a high level view on a Service Oriented OS and its relationships to resources in its environment: from the user’s perspective the OS spans all resources, independent of their location etc. and represents them as locally integrated. In order to realise this, the main operating instance provides common communication means on basis of an enhanced I/O management system. We distinguish three types of OS instances, depending on usage and location: The Main Operating Instance represents the user side - as opposed to all other instances, this one maintains resources in the network and controls their usage. It performs the main OS tasks, such as process maintenance and in particular virtual memory management. Any MicroKernel instance are extensions to the local resources and treated as such. The system therefore does not promote a flat grid hierarchy where all nodes can be treated as extensions to each other.

60

L. Schubert and A. Kipp

The Embedded MicroKernel is similar to the classical extension of a networked OS: it is actually a (network) interface to the device controller - depending on the type of resource, such control may range from a simple serial / parallel interface for e.g. printers and storage, up to a standalone job- and resourcemanagement e.g. if the resource is a computer itself that can host jobs. The Standalone MicroKernel is an extension to the Embedded MicroKernel and allows to run the same functions on top of a local operating system, i.e. it acts as an application rather than a kernel. As such it is principally identical to the current Grid middleware extensions which run as normal processes. Leaving performance aside1 , local and remote resources could principally be treated in the same fashion, i.e. using common interfaces and I/O modules. We will use this simplification in order to provide a sketch of the individual operating system instance and its main features as they differ from classical OS structures.

Fig. 2. Sketch of the individual kernel models

Main issue with respect to building a grid like Service Oriented OS consists in identifying the required connectivity between building blocks and resources, and thus candidates for outsourcing and remote usage. It is obvious that actual process management, execution, virtual memory access and all executable code needs to be accessible via (a) local cache, so that according capabilities can not be outsourced without, as a consequence, the operating system failing completely. The key to all remoting consists in the capability of virtualising devices and resources on a low, operating-system level. Virtualisation techniques imply an enhanced interface that exposes the functionalities of the respective resource in a common structure [14] understandable to the operating system. Accordingly, such interfaces act as an intermediary between the device controller and the respective communication means and as such hide the controller logic from the main operating instance. Using pre-defined descriptions for common devices, this technique not only increases the “resource space”, but also reduces the negative impact that device drivers typically inflict upon the operating system. Instead, maintenance of the resource is completely separated from operating system and user context. 1

In fact this would impact on performance in such a way that hardly any reasonable usage of the system would be possible anymore.

Principles of Service Oriented Operating Systems

61

In the following sections, we will describe in more detail, how in particular storage and computational resources will be integrated into and used by a Service Oriented Operating System: 3.2

Memory and Storage Resources

We may distinguish three types of storage, depending on speed and availability: – local RAM: the fastest and most available type of storage for the computer. It is typically limited in size and is reserved for the executable code and its direct execution environment – local storage, such as the computer’s hard drive is already used as an extension to the local RAM for less frequent access or as a swapping medium for whole processes. It also serves as pure file storage. – networked storage typically extends the local file storage and only grants slow access, sometimes only with limited reliability, respectively availability. All these resources together form the storage space available to the user. Classical operating systems distinguish these types on an application layer, thus leaving selection and usage to the user. With the Service Oriented Operating System, the whole storage is virtualised as one virtual memory which the OS uses according to processes’ requirements, such as speed and size. In addtion, the user will only be aware of the disk space distributed across local and remote resources. This implies that there is no direct distinction between local RAM and remote, networked storage. Process’ variables and even parts of the code itself may be stored to remote resources, if the right conditions apply (see computation resources below). By making use of a “double-blind” virtualisation technique, it is ensured that a provider can locally maintain his / her resources to his / her own discretion (i.e. most likely to the operating system’s requirements) without being impacted by the remote user. In other words, remote access to local resources is treated identically to local storage usage, with the according swapping and fragmenting techniques and without the remote user’s OS (the main instance) having to deal with changes in allocation etc. In case of virtual memory access, “double-blindness” works as follows (cf. Figure 3: the calling process maintains its own local memory address range (VMemproc ) which on the global level is mapped to a global unique identififer (VMemglobal ). Each actual providing resource must convert this global identifier to a local address space that reflects the actual physical storage in case of embedded devices, or the reserved storage in case of OS driven devices (Memlocal ). Measuring Memory Usage. The Service Oriented Operating System monitors memory usage of individual processes in two ways: depending on the actual usage in the code (the usage context) and according to the access frequency of specific memory parts. Accordingly, performance and memory usage will improve over time, as the process behaviour is continuously analysed.

62

L. Schubert and A. Kipp

Fig. 3. Double blind memory virtualisation

Whilst the drawback of this consists in sub-optimal behaviour in the beginning, the advantage is that coders and users will not have to consider aspects related to remote resources. Notably, preferences per process may be stated, even though the operating system takes the final decision on resource distribution. 3.3

Computational Resources

Computational resources are in principal computers that allow remote users to deploy and run processes. Depending on the type of resource (cf. above), these processes are either under the control of the Service Oriented Operating System or of the hosting OS. Computing devices do not directly extend the pool of available resources in the way storage resource do: as has often be shown by parallelisation efforts, and by multi-core CPUs, duplication of the number of CPUs does not imply duplication of speed. More precisely, a CPU can typically take over individual processes, but not jointly participate in executing a single process. Parallel execution in current single core CPUs is realised through processswapping, i.e. through freezing individual processes and switching to other scheduled jobs. The Service Oriented Operating System is capable of swapping across computational resources, thus allowing semi-shared process execution. By building up a shared process memory space, parts of the process can be distributed across multiple computational resources and still executed in a linear fashion as the virtual storage space builds a shared memory. Measuring Process Requirements. The restrictions of storage usage apply in even stronger form to process execution: each process requires a direct hosting environment including process status, passing which is time critical. Whilst the hosting environment itself is part of the embedded, respectively standalone kernels, the actual process status (cache) is not. Therefore the Service Oriented Operating System analyses the relationship between process parts and memory usage in order to identify distributable bits. Typical examples consist in applications with a high user interaction on the one hand, but a set of computing intensive operations in the background upon

Principles of Service Oriented Operating Systems

63

Fig. 4. Invoking remote code blocks from a local interactive interface

request, such as CAD and render software which requires high availability and interactivity during modelling, but is decoupled from the actual rendering process. 3.4

Other Resources

Other devices, such as printers, DVD writers or complex process enactors all communicate with the main operating instance over the local OS instance (embedded / standalone) using a common I/O interface. Main control over the resource is retained by the provider, thus allowing for complete provider side maintenance: not only can local policies be enforced (such as usage restrictions and prioritisation), but also the actual low-level resource control is maintained, i.e. the device drivers, thus decoupling it from the user-side maintenance tasks. This obviously comes at the cost of less control from the user side and accordingly less means to influence the details of individual resources, as all are treated identical. As an example, a printer allowing holes to be punched into the printouts will not be able to expose these capabilities to the main operating instance, as it is not a commonly known feature - notably, the protocol could be extended to cater for device specific capabilities too. However, it is the strong opinion of the authors, that such behaviour is not always desirable. As with the original Grid, resources should be made available according to need, i.e. printers upon a print-out request, CPUs upon executing a high amount of computing intensive processes, storage upon shortage etc. Typically in these cases, no special requirements are put forward to these resources so that simple context-driven retrieval and integration is sufficient. We can assume without loss of generality that all additional requirements are specified on an application level.

4

Putting Service Oriented Operating Systems to Use

The benefits of such a system may not be obvious straight away, though their applications are many-fold: with remote resources being available on a low

64

L. Schubert and A. Kipp

operating system layer, applications and data can be agnostic to its distribution across the internet (leaving issues of speed aside) - this leads to a variety of use cases, which shall be detailed in this section: 4.1

Mobile Office

With code and data being principally distributed across the internet, execution of applications and their usage context becomes mostly independent of the runtime environment of the main system and its resources. Instead, the same application could be hosted and maintained remotely, whilst the actual user accesses the full environment from a local low-level computer with the right network connection, as only direct interfaces and interactive code parts may have to be hosted locally. Accordingly, an employee could principally use the full office working environment from any location with according web-access (cf. also section 5) with any device of his or her choice. By using the main machine to maintain the user profile and additional information such as for identification, the user can also recreate the typical working environment’s look and feel non-regarding his or her current location. This will allow better integration of mobile workers in a company, thus realising the “mobile office”. 4.2

Collaborative Workspaces

The IST-project CoSpaces [15] is examining means to share applications across the internet so as to allow concurrent access to the same data via this application, i.e. to collaboratively work on a dataset, such as a CAD design. One of the specific use cases of CoSpaces consists in a mechanical engineer trying to repair a defect aircraft without knowing the full design: through CoSpaces he will not only get access to the design data, but also may invite experts who have better knowledge about the problem and the design than he is. Using a common application, the experts can together discuss the problem and its solution and directly communicate the work to be done in order to fix it. This implies that all people involved in the collaboration can see the same actions, changes etc. but also concurrently work on the data. With a Service Oriented Operating System, application sharing becomes far more simple: as the actual code may be distributed across the web, local execution will require only the minimal interfaces for interaction whilst the actual main code may be hosted remotely. With the right means to resolve conflicting actions upon both code and data and with network-based replication support (cf. section 5), it will be possible to grant concurrent access to common code and parameter environment parts, thus effectively using the same application across the network in a collaborative fashion. CoSpaces is currently taking a web service based approach, which does not provide the data throughput necessary for such a scenario but is vital to elaborate the detailed low-level protocols and the high-level management means which would have to be incorporated into the OS to meet all ends.

Principles of Service Oriented Operating Systems

4.3

65

Distributed Intelligent Virtual Organisations

A classical use case for distributed systems consists in so-calles “Virtual Organisations” (VO) where an end user brings together different service providers so that they collectively can realise a task none of them would have been able to fulfil individually. Typically, this involves execution of a cross-organisational workflow that distributes and collects data, respectively products from the individual participants, thus generating a larger product or result. Currently, VO support middlewares and frameworks are still difficult to use and typically require a lot of technical expertise from the respective user, ranging from deployment issues over policy descriptions to specifications integration and application / service adaptation. And hardly ever the overall business policies and goals of the user are respected or even taken into consideration. The IST project BREIN [1] is addressing this issue from a user centric point of view, i.e. trying primarily to make the use of VO middleware systems easier for the user by adding self-management and intelligent adaptation features as kind of an additional management layer on top of the base support system. This implies enhancements to the underlying infrastructure layer that is responsible for actual resource provisioning and maintenance. With a Service Oriented Operating System, remote resources can be linked on an OS level as if locally available - accordingly, execution of distributed workflows shifts the main issue related to discovery, binding and maintenance down to the lower layer. This allows the user to generate execution processes over resources in the same way as executing simple batch programs: by accessing and using local resources and not caring about their details. User profiles may be exploited to define discovery and integration details, which again leads to the problem of technical expertise to fully exploit the respective capabilities. Future version of the SOS may also consider more complex services as part of the low-level resource infrastructure. The BREIN project will deliver results with respect to how the according low level protocols need to be extended and in particular how the high level user interface of the OS needs to be enhanced to allow optimal representation of the user’s requirements and goals.

5

Future Network Requirements

Like most web-based applications, the Service Oriented Operating System too profits most from an increase in speed, reliability and bandwidth of the network. The actual detailed requirements are defined more by the respective use case than by the overall system as such - however, one will notice that there is a set of base requirements and assumptions that are recurring in recent network research areas, which in fact can be approached on different levels and with different technologies. Most prominent of these are obviously virtualisation technologies which hide the actual context and details of the resources and endpoint from the overlaying system (cf. [1], [16], [17]). This does not only allow for easier communication management, but in particular for higher dynamicity and

66

L. Schubert and A. Kipp

mobility, as the full messaging details do not need to be communicated to all the endpoints and protocols do not need to be renegotiated every time. Other issues along the same line involve secure authentication, encryption and in particular unique identification under highly dynamic circumstances, such as in mobile environments. Intermediary servers and routers may be used for caching and / or as resources themselves, but should not be overloaded with additional tasks of name and location resolution, in particular if the latter are subject to change. With the current instability of network servers, the overhead for (re)direction reliability just diminishes the efficiency of the system. Resources can be treated both as consumers and providers in a SOS network, meaning that hosting (serving) and consumption of messages need to be treated at the same level, i.e. consumers need to be directly accessible without intermediary services. The IPv6 protocol [18] offers most of the base capabilities for stable direct consumer access under dynamic conditions. What is more, to increase the performance of the system, context information related to connectivity and reliability of the respective location need to be taken into consideration to ensure stable execution on an Operating System level. Protocols such as SIP allow communication of location sensitive information [19] but there is no approach as yet to exploit network-traffic related quality information in the context of traffic control and automatic protocol adapatation via location specific data which is heuristically sufficient to indicate average performance issues. In the following we will elaborate the specifics in the context of the individual use cases in some more detail: Mobile Offices do not put forward requirements with respect to concurrent access (such as the other scenarios) but are particularly sensitive to performance and reliability issues as the main goal tends towards distributed application execution with little to no performance loss. We need to furthermore distinguish between access to local (intranet) and remote resources. Working towards uniformity in both usage and production, both connectivity types should make use of the same physical network ideally over a self-adaptive protocol layer. This would reduce the need of dedicated docking stations. As we will see, the security requirements in this scenario are of less priority than they appear to be: even though the corporate network may grant secure access to the local data structures, authentication and encryption do not require the same dynamicity support as with the other scenarios. Collaborative Workspaces share access to common applications and need to distinguish between two types of data even though identical from the application point of view: user specific and shared information. Though access is performed identically, the network may not grant cross-user access to private information whilst shared data must be subject to atomic transactions to reduce potential conflicts. Note that the degree of transaction freedom may be subject to individual roles’ privileges.

Principles of Service Oriented Operating Systems

67

Accordingly, the network needs to provide and maintain role and profile specific information that can be used for secure authentication, access restriction and in particular for differentiation between individual code-data-user-relationships. Further to this, replication can be supported through low-level multiplexing according to context-specific information, thus reducing the management overhead for the operating system. In other words extended control mechanisms of the higher layer may be exploited to control replication on the network layer. By making use of unique dynamic identifiers as discussed above, this would allow usage-independent access to data even in a replicated state and with according dynamic shifting of this replicated data. Distributed VO Operations do maintain access restriction according to dynamic end-user profiles, but typically do not share applications. Accordingly, most relevant for this scenario is data maintenance across the network, i.e. conflict anagement and atomic transactions to reduce the impact of potential conflicts.

6

Summary and Implications

We introduced in this paper a next generation operating system that realises grid on a low OS level thus allowing for full integration and usage of remote resources in a similar fashion to local ones. With such an operating system, management and monitoring of resources, workflow execution over remote resources etc. is simplified significantly, leading to more efficient grid systems in particular in the areas of distributed computing and database management. Obviously, such an operating system simplifies resource integration at the cost of flexibility and standardised communication on this low layer. In other words, all low level control is passed to the OS and influence by the user is restricted. For the average user, this is a big advantage, but it increases the effort of Kernel maintenance in case of required updates from the standardisation side. As it stands so far, the Service Oriented Operating System does not cater for realiability issues, i.e. it treats resource according to their current availabilty - this means that the Operating System will shift jobs and files to remote resources independently of potential reliability issues. Accordingly, code and data blocks may become invisible during execution and even fail, leading to unwanted behaviour and system crashes. Different approaches towards increasing reliability exist, ranging from simple replication to device behaviour analysis. The approach will differ on a case-bycase-basis as it will depend on the requirements put towards the respective processes and / or data. Notably the system can principally extend local execution reliability, when replicating critical processes. Along that line it is not yet fully specified, how these requirements should be defined and on what level - it is currently assumed that the user or coder specifies these directly per process or file. However, in particular for commonly used devices (printers, storage etc.) it may be sensible to defind and use user profiles that represent the most typical user requirements under specific circumstances and could thus serve as the base setup for most resources.

68

L. Schubert and A. Kipp

As the system is still in its initial stage, no performance measurements exist as yet that could quantify the discussions above - in particular since reliability issues have not been covered so far. This would also involve measurements involving different network conditions, i.e. bandwidth, reliability etc. in order to enable more precise defintion of the requirements towards future networks.

Acknowledgments This work has been supported by the BREIN (http://www. gridsforbusiness.eu) and the CoSpaces project (http://www.cospaces.org) and has been partly funded by the European Commission’s IST activity of the 6th Framework Programme under contract number 034556 and 034245. This paper expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this paper.

References 1. BREIN: Business objective driven reliable and intelligent grid for real business, http://www.eu-brein.com 2. Tanenbaum, A.S.: Modern Operating Systems. Prentice Hall PTR, Upper Saddle River (2001) 3. EGEE: Enabling grids for e-science, http://www.eu-egee.org 4. SalesForce: force.com - platform as a service, http://www.salesforce.com 5. Mitchell, D.: Defining platform-as-a-service, or PaaS (2008), http://bungeeconnect.wordpress.com/2008/02/18/ defining-platform-as-a-service-or-paas 6. Hinchcliffe, D.: The next evolution in web apps: Platform-as-a-service, PaaS (2008), http://bungee-media.s3.amazonaws.com/whitepapers/hinchcliffe/ hinchcliffe0408.pdf 7. Wesner, S., Schubert, L., Dimitrakos, T.: Dynamic virtual organisations in engineering. In: Computational Science and High Performance Computing II. Notes on Numerical Fluid Mechanics and Multidisciplinary Design, vol. 91, pp. 289–302. Springer, Berlin (2006) 8. Wilson, M., Schubert, L., Arenas, A.: The trustcom framework v4. Technical report (2007), http://epubs.cclrc.ac.uk/work-details?w=37589 9. Jun, T., Ji-chang, S., Hui, W.: Research on the evolution model for the virtual organization of cooperative information storage. In: International Conference on Management Science and Engineering, 2007. ICMSE 2007, pp. 52–57 (2007) 10. van Engelen, R.A., Gallivan, K.: The gsoap toolkit for web services and peer-to-peer computing networks. In: Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2002), pp. 128–135 (2002) 11. Head, M.R., Govindaraju, M., Slominski, A., Liu, P., Abu-Ghazaleh, N., van Engelen, R., Chiu, K., Lewis, M.J.: The gsi plug-in for gsoap: Enhanced security, performance, and reliability. In: ITCC conference 2005, pp. 304–309 (2005)

Principles of Service Oriented Operating Systems

69

12. Ng, A., Greenfield, P., Chen, S.: A study of the impact of compression and binary encoding on SOAP performance. In: Schneider, J.G. (ed.) The Sixth AustralasianWorkshop on Software and System Architectures: Proceedings, pp. 46–56. AWSA (2005) 13. Smith, J.: Inside Windows R Communication Foundation. Microsoft Press (2007) 14. Nash, A.: Service virtualization – key to managing change in soa. Service Virtualization – Key to Managing Change in SOA (2006) 15. CoSpaces: Cospaces project website, http://www.cospaces.org 16. 4WARD: The 4ward project, http://www.4ward-project.eu 17. IRMOS: Interactive realtime multimedia applications on service oriented infrastructures, http://www.irmosproject.eu/ 18. Hinden, R., Deering, S.: Ip version 6 addressing architecture (fc 2373). Technical report, Network Working Group (1998) 19. Polk, J., Brian Rosenberg, J.: Session initiation protocol location conveyance. Technical report, SIP Working Group (2007)

Preliminary Resource Management for Dynamic Parallel Applications in the Grid Hao Liu, Amril Nazir, and Søren-Aksel Sørensen Department of Computer Science, University College London United Kingdom {h.liu,a.nazir,s.sorensen}@cs.ucl.ac.uk

Abstract. Dynamic parallel applications such as CFD-OG impose a new problem for distributed processing because of their dynamic resource requirements at run-time. These applications are difficult to adapt in the current distributed processing model (such as the Grid) due to a lack of interface for them to directly communicate with the runtime system and the delay of resource allocation. In this paper, we propose a novel mechanism, the Application Agent (AA) embedded between an application and the underlying conventional Grid middleware to support the dynamic resource requests on the fly. We introduce AA's dynamic process management functionality and its resource buffer policies which efficiently store resources in advance to maintain the execution performance of the application. To this end, we introduce the implementation of AA. Keywords: resource management, dynamic parallel application, resource buffer.

1 Introduction The Grid is commonly used to submit batch and workflow type jobs. However, many scientific applications, such as astrophysics, mineralogy and oceanography, have several distinctive characteristics that differ from batch and workflow type of applications. Some of them are so-called dynamic parallel applications allowing significant changes to be made to the structure of the datasets themselves when necessary. Consequently they may require resources dynamically during the execution to meet their performance benchmarks. CFD-OG (Computational Fluid Dynamics-Object Graph) is one of the examples in that respect. Such applications are hardly applicable with the current distributed processing model, which needs the knowledge of application execution behavior and resource requirements prior to execution. Our strategy is to introduce an agent that enables a running Grid application to negotiate the computational resources assigned to it at run-time. The agent provides an interface for the application to call for resources on the fly while it communicates with the Grid middleware to allocate resources to satisfy these requests on-demand. In this paper we introduce the dynamic parallel applications demonstrated by CFDOG and the necessity of external resource management for them. We propose a novel mechanism, the Application Agent (AA) embedded between the applications and P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 70 – 80, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

Preliminary Resource Management for Dynamic Parallel Applications in the Grid

71

conventional Grid middleware for resource management. We also propose two resource buffer policies to maintain the application performance cost effectively while the resource demands change constantly.

2 Dynamic Parallel Applications Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics that uses numerical methods and algorithms to solve and analyze problems that involve fluid flows. One method to compute a continuous fluid is to discretize the spatial domain into small cells to form a volume mesh or grid, and then apply a suitable algorithm such as Eulerian methodology [18] to solve the equations of motion. The approach assumes that the values of physical attributes are the same throughout a volume. If scientists need a higher resolution of the result they have to replace a volume element with a number of smaller ones. Since the physical conditions change very rapidly, high resolution is needed dynamically to represent the flux (amount per time unit per surface unit) of all the physical attributes with sufficient accuracy. Consequently the whole volume grid is changing constantly, which may lead to dynamic resource requirements (Fig. 1).

Fig. 1. A series of slides of the flow across a hump. The grid structure is constantly changing at run-time to adjust the reasonable resolution. The resource requirements in the last slide are around 120 times those in the first slide [12].

The usual method for introducing high resolution is to replace a volume element with a number of smaller ones. This is a difficult process in traditional CFD because it used a Cartesian of indexed volumes (Vijk). To introduce smaller volumes into such a structure would require further hierarchical indexing leading to complex treatment. An alternative approach which is introduced by Sørensen is the Object Graph (CFDOG) [12]. A CFD-OG application has two base objects: cells and walls. A cell represents the fluid element and holds the physical values for a volume. It is surrounded by a number of wall objects known to it by their addresses (pointers). A cell simply holds a number of wall pointers. The walls on the other hand only know two objects: the cells on either side of the wall. This is a simple structure that is completely unaware of the physical geometry and topology of the model. The advantage of the object graph is that it uses reference by addressing only. It is therefore possible to change the whole grid topology when smaller volume elements are introduced. This approach

72

H. Liu, A. Nazir, and S.-A. Sørensen

also benefits the distributed processing since it is easy to substitute the local addresses (pointers) with global addresses (host, process, pointer) without changing the structure. That being the case, the objects can be distributed on the network computer nodes in any manner. With object graph, a distributed CFD-OG application can be very dynamic and autonomic. The application is composed of a number of processes executing synchronously and normally one process is running on one nodes. In this context, each process holds a number of computational objects (cells and walls) that can migrate from one process to another. As the application progresses, a built-in physical manager monitors the local conditions. If it detects the local resolution is too low, it will ask the built-in object manager to introduce smaller volumes. If it on the other hand the built-in physical manager detects superfluous resolutions, it asks to replace several smaller cells with fewer larger ones. This may create imbalances in the processing and the object manager may subsequently attempt to balance this by moving objects onto light load nodes. Such load balancing is limited. The application may have to demand additional resources (processors) immediately to maintain the required performance such as a certain execution time progression. Likewise, the application may want to release redundant allocated resources when the low resolution is acceptable. The dynamic resource requirements on the fly is a huge challenge for current distributed environment. This is because there is no well defined interface for applications to communicate with the Grid to add/release resources at run-time; and further more a significant delay is normally associated with the allocation of a resource in response to a request due both to resource competition between running jobs and the time conflict of concurrent requests, which will influence the smooth execution of the application. For the first problem, we define a mechanism which has several functions that applications can call to add and release resources during the execution in a way that is independent from resource managers. For the second problem, we propose the resource buffer management to hide the delay of resource allocation.

3 Dynamic Resource Support As introduced, a dynamic parallel application is composed of a number of processes that holds a number of computational objects that can migrate from one process to another by the application itself. Therefore, as the application needs an additional resource, it requests to deploy a new process that it can move other objects into. The resource requirement is therefore interpreted as adding a new process with no objects initially. We define a mechanism called Application Agent (AA) that stays between applications and Grid middleware to support such dynamic addition/release of processes. Programmers must build applications based on the programming interface provided by AA. Each process’s binary file therefore has to be compiled with the AA library. That means each running process is associated with an AA which keeps information synchronized across the whole application. Programmers can perform three basic operations inherited from AA: add a process, stop a process, and exchange messages with a process. As soon as the application is running, AA will start and act as an agent

Preliminary Resource Management for Dynamic Parallel Applications in the Grid

73

to communicate with the Grid environment on behalf of the application, i.e. to find a new node and deploy a new process, stop a process and return a node, and transport messages during the execution of the application. The process requests are served by a scheduler that is associated with AA. The scheduler can apply different scheduling policies (e.g. FCFS (First Come First Served), priority-based) according to the preference of the application. We do not address the scheduling issue in this paper. 3.1 Adding a Process AA allows the application programmer to start a named process at anytime during the execution. The process request is performed using AddProcess( Name of executable ) which returns an integer ID which subsequently used to address the process. The function AddProcess() is non-blocking. As soon as it is invoked, AA will check if its Resource Buffer (RB) holds an additional prepared process (an idle process that has been deployed on a new node previously). If so, it immediately returns the process ID to the application which can activate this process for migrating objects. This is called a ”successful request”. If not, it will contact the underlying Grid scheduler such as Condor [3], SGE [1] or GRAM [7] to request an allocation, simultaneity returning integer 0 to the application. This is called a ”failed request”. As in a non-dedicated Grid environment the amount of time it takes for a process to be allocated is not bound, it is the programmer’s responsibility to request a process again if previous request has failed. Once the new process is deployed, AA will store the ID of the process into the RB and return it to the application as soon as the next request is performed. One problem with the dynamic addition of process is that current Grid schedulers do not support dynamic resource co-allocation and deployment. They treat each later added process as an independent single job and do not deploy them communicable with other processes of the parallel application. One approach (approach A) to this problem is that once the process is allocated by a scheduler, the AA that is associated with this new process must broadcast its address to other old processes so that they can communicate. This is archived by registering this process’s address into the RB which is synchronized throughout the application. The new process is assigned an unique global id for further use by the application during the registration. This approach can be implemented by general socket programming or based on distributed computing framework (e.g. CORBA [10], Jini [11]). An alternative approach (approach B) is to use probes for resource allocation while the actual process startup is accomplished by AA itself. A probe is a small piece of deployment code that does not perform any computation for the application. The probe always runs until AA kills it, in order to hold that node. Once the probe is allocated by the scheduler, it configures the node for AA use and notifies its node address to the master AA which takes charge of process spawning. Then the master AA transfers the process binary onto that node and starts the process. This approach can make full use of existing process management systems (e.g. PVM [6], LAM/MPI [2]) that will return a process identifier for the application to address the process. Once the new processed is deployed, the object manager of the application can then migrate the target objects into this process for load balancing. The migration procedure is beyond the scope of this paper.

74

H. Liu, A. Nazir, and S.-A. Sørensen

3.2 Stopping a Process Programmers may want to stop a process and release the node when application’s resource requirement is low. The application firstly autonomically vacates the process (computational objects migration), then invokes the function StopProcess( id ). This results in the process with the given ID to be removed from the application and the related computer node to be disassociated. However AA may still reserve this prepared process for the possible future requests from the application. The decision is made according to AA’s buffer policies (we discuss them in next section). If AA does decide to release the node, it contacts the local scheduler to kill the vacated process and finally return the node to the pool. 3.3 Communication Technically, AA could either have a new communication mechanism if using approach A for adding processes or have PVM/MPI as the lower communication service if using approach B. In the latter case, programmers can still use PVM/MPI routines for communication, and use AA’s routines for process management.

4 Resource Buffer In non-dedicated Grid environments the amount of time it takes until a process is allocated is not bound. In order to satisfy the process addition request on-demand at run-time, we propose the Resource Buffer (RB) to hide the delay of process allocation. A RB stores a number of prepared processes that can be returned to the application immediately when it requests AddProcess(). AA manages the RB by requesting the allocations of processes from schedulers in advance. AA manages the RB to satisfy the application demands based on the RB policies. In order to measure the RB performance, we propose two simple metrics: Request Satisfaction S and Resource Waste W. Since the process release requests do not benefit from the policies, the Request Satisfaction is defined as the percentage of successful added processes out of the total number of process addition requests. If S is approaching 1, it indicates that the dynamic application can run smoothly since all of dynamic resource requests are satisfied on-demand. The Resource Waste is defined as the accumulative total time when the prepared processes stay in the RB (W =



Texe t =0

Rt where Texe = total execution time; Rt = the number of idle processes in the

RB at time t). If the RB policy is good enough, W should be approaching 0 while S should be approaching 1. In order to investigate the RB policies, we made a few assumptions regarding the dynamic parallel application: – 1. It is a parallel iterative application. – 2. Its dynamic behavior is not random. Similar to the CFD-OG, the application only requests resources on some stage (e.g. during iteration 100 ~ 110) when physical condition changes. – 3. It only requests to add processes but not to release processes.

Preliminary Resource Management for Dynamic Parallel Applications in the Grid

75

4.1 Policies We consider two corresponding heuristic policies, the Periodic (P) policy and the Periodic Prediction (PP) policy to manage the RB. The P policy periodically reviews the RB to ensure that the RB is kept to a predetermined process amount RL at each predefined interval tp. For every time interval, the number of Nprocess requested is:

⎧+ ( RL − R) : R RL

where RL is the request (threshold) level and R is the current number of prepared processes in the RB at each periodic level. ”+” means requesting the allocation of a node and deploying a new process and ”-” means requesting the release of a process and returning the node to the pool. The value of RL is crucial for the P policy. RL can be determined by the maximum number of requests that the application could consecutively make. Generally, if the application requests processes frequently, RL will be higher.

Fig. 2. AA periodically predicts the number of requests the application will make. E.g. when the application is running at iteration 5, AA predicts the period from iteration 25 to 30.

If the time

Talloc that the Grid scheduler takes to allocate a process in the cluster is

known and relatively stable, we can use the PP policy. The PP policy periodically (at interval tp) predicts the number of requests that the application will make after θ time (Figure 2). If AA predicts that after θ the application will make n requests, AA will place n requests to the scheduler immediately. We let θ = max Talloc ≈ ∀Talloc . Then the application would be able to use the advanced placed process right after it has been allocated. PP uses the application’s historical execution behaviors for the prediction. Since the application execution behavior may vary due to its execution environment, PP predicts the probability of a request in an iteration range (e.g. iteration 25 ~ iteration 30, Figure 2) rather than a single iteration. The probability p i ~ i + range in the future

Pi ~i + range = ∑i

i + range

iteration i ~ i + range is calculated as

ri / times , where ri is the

total number of requests at iteration i during the recorded executions, times is number of recorded historical executions, and range is the number of iteration the prediction

76

H. Liu, A. Nazir, and S.-A. Sørensen

ni ~i + range the application is predicted to make in iteration i ~ i + range can be calculated by ⎣Pi ~i + range ⎦ .

covers. The number of request

In order to avoid the redundancy of process in the RB, PP also has a request (threshold) level RL. As the current number of process R reaches beyond RL, AA initiates to release processes. The full Pseudo-code for PP policy in AA is shown in Figure 3.

Fig. 3. Pseudo-code for PP policy

4.2 Simulation and Results Based on the assumptions we made, we use Clown [17], a discrete-event simulator to simulate a simple iterative application that has a behavior similar to the real dynamic parallel application. Initially, this application has three processes, each of which has 10 computational objects. To simulate the dynamic behavior, one of the processes generates 10 objects every 100 iterations. Due to the increase of objects, the application will detect when the execution speed is getting slower and subsequently responds by requesting additional resources/processes by AddProcess() to balance the computation in ensuring smooth execution. If AA cannot satisfy the requests on some stage, the application runs slower with insufficient resources and it will request to add processes again in next iteration. In order to simulate the execution of the application, we simulate 500 homogeneous computer nodes where the average speed spnode is 11.2 (it takes 11.2 simulation time to complete 1 objects, and takes 112 to complete 10 objects) . The execution speed fluctuates in each run according to a Gaussian distribution with variance equals to 1.0. In the cluster, the time Talloc it takes to allocate a node is Gaussian distributed with average 1000 and variance 100. The simulated execution speed spexe (the time it takes to run 1 iteration) with S = 1 can be monitored

Preliminary Resource Management for Dynamic Parallel Applications in the Grid

77

as ≈ 120. The main purpose of the simulation is to evaluate the effectiveness of the proposed buffer policies under the simulated environment. For each policy, we run the application for 100 times. For each execution, the application is run for 3000 iterations. Figure 4 shows the value S of both policies during the 100 executions. We can see the Request Satisfaction of the P policy is slightly higher than what the PP policy can provide. For the PP policy, S is very low in the beginning and increases to 90% around the 5th execution. This is because PP is based on historical information and there is not enough information for predicting the execution behavior in the beginning. During the 100 executions, both policies can provide reasonable S (S > 80%). The small fluctuations are caused by the unstable Talloc and spnode .

Fig. 4. The Request Satisfaction S of both policies during 100 executions. PP: range = 5, RL = 1, tp = 5iterations. P: RL = 1, tp = 600.

Fig. 5. The Resource Waste W of both policies during 100 executions. PP: range = 5, RL = 1, tp = 5iterations. P: RL = 1, tp = 600.

Figure 5 shows the value W of both policies during the 100 executions. We can see that the PP policy is far more efficient than the P policy. The W of PP in the first few executions are relatively high since historical information is insufficient. W then drops rapidly through the learning process and maintains a low value with small fluctuations in the rest of the executions. The results show that both proposed policies have their pros and cons. The P policy can provide slightly higher Request Satisfaction while leads to very high Resource Waste too. The PP policy is considered more advanced. It can perform simple prediction and make requests at the right time. However the current PP policy is not suitable for the cluster where Talloc is random and unpredictable. While the P policy is suitable for any environment.

5 Implementation The current implementation of AA is developed based on PVM. The process management follows approach B. Requests are served according to FCFS policy. The test environment is a Condor pool which has 50 Linux machines. The cluster has NFS (Network File System) installed and each machine has PVM installed.

78

H. Liu, A. Nazir, and S.-A. Sørensen

All the AA-enabled application’s binaries and related files must be put on a NFS mounted directory. The application is firstly started on the Condor submitting machine. The first process started is called master process and manages the whole application. The correlative AA is called master AA. When AA decides to add a process (receiving a request or according to the RB policies) for the application, it firstly asks the master AA to submit a probe program to Condor specifying the resource requirements (e.g. Arch == ”INTEL”) of the application in the submit file. Once the probe is allocated, it notifies the master AA by writing the node information into a XML file on the NFS. AA keeps reading this file. Once it finds that a new node is added, it immediately adds that node into its virtual machine by pvm addhosts(). Then it starts a process on that node by pvm spawn(), and stores the process id into its RB. Since all the binaries are on the NFS, AA does not need to transfer any files. AA in its current implementation does not support heterogeneous deployment. It does allow heterogeneous processing as long as a suitable executable is present on the target node. When AA releases a process, it first stops the process by pvm kill(), then excludes the node from its virtual machine by pvm delhosts(). It finally returns this node to the pool by killing the probe via condor rm(). AA’s message passing module is a C++ wrapper of PVM’s interface.

6 Related Work Condor-PVM [3] provides a dynamic resource management for PVM applications executing in a Condor cluster. Whenever a PVM program asks for nodes, the request is re-mapped to Condor, which then finds a machine in the Condor pool via the usual mechanisms, and adds it to the PVM virtual machine. This system is intended to integrate the resource management (RM) systems (Condor) with the parallel programming environment (PVM) [14]. Although it supports runtime resource requests similar to what AA supports, it does not put any effort into the performance of the application, e.g buffer management. Moreover, the request scheduling for the application is totally managed by Condor, which has no scalability to add other application-level scheduling policies. Gropp et al. [9] introduce an extension of MPI for MPI applications to communicate the job scheduler to add and subtract processes during the computation. It proposed a non-blocking call MPI IALLOCATE to reserve processors from the scheduler with returning a set of MPI requests. Then the actual process startup can be accomplished with the conventional MPI START or MPI STARTALL calls. This paper however does not provide detailed implementation information. DUROC [4] implements an interactive transaction strategy and mechanism for resource co-allocation in a multi-cluster environment. It accepts multiple requests, each written in a RSL expression and each specifying a subjob of an application. In order to tolerate resource failures and timeouts, some resources may be specified as ”interactive” and ”optional”. Successful completion of a DUROC co-allocation request results in the creation of a set of application processes that are able to communicate with one another. An important issue in the resource co-allocation is that the required resources have to be available at the same time otherwise the computation cannot

Preliminary Resource Management for Dynamic Parallel Applications in the Grid

79

proceed. While in our model, a dynamic parallel application can continue computation with insufficient resources and request additional resources via AA during the computation to maintain its ideal performance.

7 Conclusion and Future Direction The contribution of this paper is the proposal of an application agent AA that supports the dynamic resource requirements of dynamic parallel applications such as CFD-OG. An AA-enabled application is able to add new resource (deploy a new process) and release surplus (release a process) at run-time. To maintain the smooth execution of the application, the Resource Buffer service is proposed that is embedded in AA to relieve the cost for waiting resources. Two heuristic policies are introduced to examine how the RB concept can be managed more effectively and efficiently. The current version of AA is implemented with approach B (detailed in section 3) and tested in a Condor cluster. As we mentioned, this approach is restricted by existing systems. For example, PVM is bounded to join resources that are located in the same network domain and so AA cannot perform wide-area computing based on PVM. Some extensions (e.g. PMVM [13]) enable PVM to create multi-domain virtual machines. The future work will involve implementing and testing AA in a multidomain environment. We aim to investigate whether the virtual machine architecture would apply in this setting or it is more appropriate to apply approach A that distributed loosely links processes across the network. The security problem arising from multi-domain environment will be also addressed. The RB policies also need more precise investigation. Two polices will be further tested by two real world dynamic parallel applications CFD-OG and RUNOUT [16]. The policies will be further extended to intelligently react to the change of resource environment to ensure that the smooth execution of application is not affected.

References 1. N1 grid engine6 administration guide. Technical report, Sun Microsystems, Inc. 2. Burns, G., Daoud, R., Vaigl, J.: LAM: An Open Cluster Environment for MPI. In: Proceedings of Supercomputing Symposium, pp. 379–386 (1994) 3. Condor. Condor online manual version 6.5, http://www.cs.wisc.edu/condor/manual/v6.5 4. Czajkowski, K., Foster, I.T., Kesselman, C.: Resource co-allocation in computational grids. In: HPDC (1999) 5. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley Longman Publishing Co., Inc., Boston (1995) 6. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.S.: PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge (1994) 7. Globus Alliance. Globus Toolkit, http://www.globus.org/toolkit/ 8. Goux, J.-P., Kulkarni, S., Linderoth, J., Yoder, M.: An enabling framework for masterworker applications on the computational grid. In: HPDC, pp. 43–50 (2000) 9. Gropp, W., Lusk, E.: Dynamic process management in an mpi setting. spdp, 530 (1995)

80

H. Liu, A. Nazir, and S.-A. Sørensen

10. Idl, N.S.: The common object request broker: Architecture and specification 11. S. Microsystems. Jini network technology. Technical report, http://www.sun.com/software/jini/ 12. Nazir, A., Liu, H., Sørensen, S.-A.: Powerpoint presentation: Steering dynamic behaviour. In: Open Grid Forum 20, Manchester, UK (2007) 13. Petrone, M., Zarrelli, R.: Enabling pvm to build parallel multidomain virtual machines. In: PDP 2006: Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Washington, DC, USA, pp. 187–194. IEEE Computer Society Press, Los Alamitos (2006) 14. Pruyne, J., Livny, M.: Providing resource management services to parallel applications (1995) 15. Shao, G.: Adaptive scheduling of master/worker applications on distributed computational resources (2001) 16. Sørensen, S.-A., Bauer, B.: On the dynamics of the kofels sturzstrom. In: Geomorphology (2003) 17. Sorensen, S.-A., Jones, M.G.W.: The clown network simulator. In: 7th UK Computer and Telecommunications Performance Engineering Workshop, London, UK, pp. 123–130. Springer, Heidelberg (1992) 18. Trac, H., Pen, U.-L.: A primer on eulerian computational fluid dynamics for astrophysics. Publications of the Astronomical Society of the Pacific 115, 303 (2003) 19. Wolski, R., Spring, N.T., Hayes, J.: The network weather service: a distributed resource performance forecasting service for metacomputing. Future Generation Computer Systems 15(5–6), 757–768 (1999)

Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks Igor Cergol1, Vinod Mirchandani1, and Dominique Verchere2 1

Faculty of Information Technology The University of Technology, Sydney, Australia {icergol,vinodm}@it.uts.edu.au 2 Bell Labs, Alcatel-Lucent Route de Villejust, 91620 Nozay, France [email protected]

Abstract. A framework for an autonomous negotiation control protocol for service delivery is crucial to enable the support of heterogeneous service level agreements (SLAs) that will exist in distributed environments. We have first given a gist of our augmented service negotiation protocol to support distinct service elements. The augmentations also encompass related composition of the services and negotiation with several service providers simultaneously. All the incorporated augmentations will enable to consolidate the service negotiation operations for telecom networks, which are evolving towards Grid networks. Furthermore, our autonomous negotiation protocol is based on a distributed multi-agent framework to create an open market for Grid services. Second, we have concisely presented key simulation results of our work in progress. The results exhibit the usefulness of our negotiation protocol for realistic scenarios that involves different background traffic loading, message sizes and traffic flow asymmetry between background and negotiation traffics. Keywords: Grid networks, Simulation, Performance Evaluation, Negotiation protocols.

1 Introduction Network forums like IETF have given rise to several different negotiation protocols such as COPS-SLS [1], DSNP [2], SrNP QoS-NSIS and RNAP [3]. Apart from these protocols, FIPA Contract Net [4] is a generic interaction protocol popular within the multi-agent community. The negotiation protocols enables the service providers to dynamically tune the network related parameters as well as take into account the level of pricing that can actually be afforded by the consumers. From the afore-mentioned negotiation protocol proposals published, we have identified scope for improvement in the individual mechanisms by means of consolidation; addressing the key areas of efficiency, reliability, flexibility and completeness. First we discuss our protocol for autonomous dynamic negotiation within a distributed open market environment. The protocol can operate on any underlying network P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 81 – 88, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

82

I. Cergol, V. Mirchandani, and D. Verchere

switching technologies ie L1 services (e.g. fiber or TDM), L2 services (e.g. Carrier Grade Ethernet) or L3 services (e.g. IP/MPLS) . Then we present and discuss OMNeT++ based models and related stochastic simulation results that reflect the impacts on the performance of our negotiation protocol from different background traffic loads, message sizes and traffic flow asymmetry. This paper is organized as follows: Section 2 gives an outline of the distributed Grid market framework in open environments. It also gives an essence of our negotiation protocol proposed in [5] first for single service element between two parties and then for a bundle of heterogeneous service elements from multiple service providers that can complement or compete for the service elements offers. Performance evaluation results are presented and discussed also in Section 2. The conclusions that can be drawn from this paper are stated in Section 3.

2 Negotiation Framework Overview The framework for managing Quality of experience Delivery In New generation telecommunication networks with E-negotiation (QDINE) [6] is shown in Figure 1.

Fig. 1. The QDINE Negotiation Framework

The QDINE approach adopts a distributed, open market approach to service management. It makes services available via a common repository with an accompanying list of negotiation models that are used to distribute service level agreements (SLAs) for the services. A QDINE market is a multiagent system with each agent adopting one or more market roles. Every QDINE market has exactly one agent acting in the market agent (MA) role and multiple service providers (SPs), consumers and billing providers. 2.1 Proposed Service Negotiation Protocol Our negotiation protocol [5] is session-oriented and client-initiated protocol that can be used for intra as well as inter-domain negotiation in a Grid environment. A key

Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks

83

distinguishing feature of our protocol is that it enables the negotiation of a single service as well as the composition of several service elements from multiple service providers (SPs) within one session. There are two possible scenarios as shown in Figs. 2 & 3 that we consider for the negotiation of services between the consumer agent (CA) and SP. 2.1.1 Negotiation of a Single Service The sequence diagram for the negotiation of a single service is explained with the help of Fig. 2 when there is a direct negotiation between the service provider (SP) and consumer agent (CA). The numbers in the parenthesis that precede the messages are used to describe the order of the interactions. Upon receiving the Request message (1), the SP has several options: • It can immediately provide the service by sending an Agree message (2a) back to the CA. This forms the beginning of the SLA binding process of the service element. • If the SP cannot deliver the service requested, it chooses to end the negotiation, it may reply with a Refuse message (2b), stating the reason for refusal. The SP may choose to continue the negotiation sequence by offering an alternative of the SLA parameters proposal, in which case it responds with a Propose message (2c), containing the modified SLA and any additional constraints for an acceptable contract agreement. The consumer may reply to the new proposal with another Request message (3), or accept the proposal (5b). Several Request and Propose messages (3) and (4) may be exchanged between the CA and the SP. If the consumer chooses to end the negotiation, it can reply with a Reject-proposal message (5a). Alternatively, if the consumer accepts the proposal, it responds with an Accept-proposal message (5b) indicating the accepted SLA. After receiving the Accept-proposal message from the CA the SP can bind the SLA by sending an Acknowledge message (6). The SLA under negotiation can be validated at any time by participants, using the appropriate service specification(s) and current constraints. Invalid proposals may result in a Reject-proposal message (5a) or Refuse message (2b) sent by the consumer or provider, respectively. 2.1.2 Negotiation of Multiple Service Elements We illustrate the protocol sequence diagram in Figure 3 for a multi-service negotiation using the following scenario: A QDINE market consumer needs to perform a highly demanding computational fluid dynamics (CFD) simulation workflow on a specific computer infrastructure and then to store the processed data results in other remote locations. The consumer therefore needs to negotiate the bundled service from several service providers (SPs): (i) a high performance computing (HPC) SP to execute the workflow of the simulation program hosted on the machine, (ii) a storage SP to store the results produced by the computational application, and (iii) a network (NW) SP to provide the required bandwidth for transmitting the data between the different resource centers within a time interval window that is able to avoid internal delays of the execution of global application workflow. In the configuration phase the consumer sends a

84

I. Cergol, V. Mirchandani, and D. Verchere

Fig. 2. Direct negotiation of a single service

Fig. 3. Negotiation of multiple services

Request message (1) to the MA, to get the necessary information to enable the consumer agent (CA) to fill the proxy message in a template defined by the MA. The MA responds to the request with an Inform message (2), carrying any updated configuration information since consumer registration. In this example scenario, after obtaining the configuration information, the CA sends a Proxy message (3) to the MA, requesting it to select the appropriate SPs and negotiate the three afore-mentioned services. All of these services are required for a successful negotiation in this example: • a call-for-proposal (cfp) for a service S1 (HPC) with three preferred providers (SP1, SP2 and SP3), returning the best offer. • a bilateral bargaining with any appropriate providers for service S2 (storage), returning the best offer, and • a bilateral bargaining with service provider SP5 for the network service S3 (constraints expressed with bandwidth), returning the final offer.

Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks

85

(The SPs for common services are shown collocated in Figure 3. due to space limitations.) Upon receiving the Proxy message, the MA assesses the proxy message and if it validates the request it replies with an Agree message (4b) to CA. According to the request for service element S1, the MA issues a cfp (5) to service providers SP1, SP2 and SP3. One of them refuses to provide the service, sending a Refuse message (6a), the other two HPC service providers submit each a proposal, using a Propose message (6b). The MA validates the proposals and uses the offer selection function (osf) to pick the most appropriate service offer. The MA requests the chosen SP to wait (7b), as it doesn't want to bind the agreement for service element S1 until it has collected acceptable proposals for all three service elements. According to the “any appropriate” provider selection function submitted by the CA for service S2, the MA then selects the only appropriate SP (SP4) for the storage service. The MA then sends a Request message (9) to SP4 by invoking a bilateral bargaining based mechanisms. For simplicity, we assume that the chosen SP immediately agrees to provide the requested service by replying with an Agree message (10). In reality, the negotiation sequence could be similar to the one depicted in Figure 2. When an agreement is formed, the MA requests SP4 to wait (11) while the other service negotiations are completed. SP4 replies with an Agree message (12). Steps 13-16 can be referred to in [5]. The Service providers bind the agreement by replying with an Acknowledgement (17) for the bundled service (s1, s2, s3). 2.2 Performance Evaluation An important part of the specification phase of the negotiation protocol is to pre-validate its sequences and evaluate its performance through simulations. Metro Ethernet Network (MEN) architecture was modelled as an underlying transport network infrastructure in the QDINE service negotiation framework. This section documents the results of the simulation study of the negotiation protocol performance for different realistic parameters such as negotiation message sizes, negotiation frequency and input load. We believe this study is valuable as it considers the performance during the negotiation phase over MEN rather than the end-to-end performance of the transport of data traffic. The study investigates the following negotiation sequence: (i) the negotiation of a single service element where the negotiation is direct between the SP and CA and (ii) the negotiation of multiple service elements simultaneously from different types of SPs via the MA. To the best of our knowledge, whilst there have been other studies conducted for negotiation protocols, they neither evaluate the performance for different negotiation message sizes nor for varying background input load from other stations connected to the network. Simulation Model. The simulations were performed by using OMNeT++ tool [7], the framework INET was used for the simulation of wired (LAN, WAN) and was implemented in our model environment. The QDINE framework with MEN infrastructure was modeled and simulated distinctly for (i) direct negotiation and (ii) indirect negotiation. The MEN service type that we used is the Extended Virtual PrivateLAN (EVPLAN) service and the background traffic generated was best-effort service, which is carried by the EVP-LAN. Diverse actors in the QDINE negotiation framework – CA, SP and MA – were modeled as network stations, i.e. Ethernet hosts. Each Ethernet

86

I. Cergol, V. Mirchandani, and D. Verchere

host was modeled as a stack that in a top to bottom order comprised of: (a) two application models – a simple traffic generator (client side) for generating request packets and a receiving entity (server side), which also generates response packets, a LLC module and a CSMA/CD MAC module.

Fig. 4. Metro Ethernet Network (MEN) model 2.2.1 Results and Discussion – Direct Negotiation for a Single Service In direct negotiation (refer Fig. 4), the negotiation messages flow directly between the consumer agent (CA) and the service provider (SP). The input data traffic load on the MEN was generated by the linked nodes (stations) 1…m. other than the negotiating control stations. The input load is expressed as a fraction of the 10Mbps capacity of the MEN. In the experiments performed, we varied the input load by keeping the packet sending rate constant and increasing the number of stations. Table 1. Attributes specific for direct negotiation Parameter Packet size – background traffic Packet arrival rate – background traffic Negotiation frequency Packet generation Negotiation message sizes

Value 1000 bytes 5, 10 packets./sec 1, 10 negotiation./sec Exp. Distributed 100-300 bytes

A) Impacts of negotiation frequency and back-ground traffic packet arrival rate: The attributes used for this simulation are given in Table 1. Fig. 5 shows the impact of the negotiation frequency on the traffic control as well as the generated background data traffic's packet arrival rates on the mean transfer delay of the negotiation packets received at the CA. The negotiation control protocol used was bilateral bargaining. From Fig. 5a it can be seen that the increase in the negotiation frequency has a marginal effect on the mean delay up to the input load where the mean delay increases asymptotically (0.7 in Fig. 5a). This shows that the negotiation protocol is more influenced by input data traffic load on the network relative to negotiation control frequency, which will facilitate to improve scalability of negotiation stations on the underlying network. B) Impact of different negotiation messages sizes: We studied this to determine the impact on performance at the SP, CA and overall network due to the following negotiation control messages sizes (i) 100 – 300 bytes (ii) 300 – 900 bytes (iii) 300 – 1800 bytes.

Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks

87

Fig. 5. (a) Time Performance at the consumer agent (CA) (b) Performance at the CA – Mean delay vs. Input load

The impact of negotiation message size on the mean transfer delay obtained at the CA is shown in Fig. 5b. The impact was found to be more pronounced at the SP than at the CA. The results of the impact at SP are not shown here due to space limitations. Most of the large size messages are received at SP relative to the CA, which causes a larger overall delay. Note: the total end-to-end delay of a control frame transfer is the sum of queuing delay, transmission delay, propagation delay and contention time. Except for the propagation delay (fixed for a medium of length L) all other delays are higher for larger frames. 2.2.2 Results and Discussion – Indirect Negotiation for Multiple Services This subsection provides an evaluation study of our service negotiation control protocol for multiple service elements that can be negotiated between a SP and a CA. This study uses the parameters values similar to that given in Table 1. As the negotiation messages and the background traffic are of different packet sizes and have relative traffic flow asymmetry it makes the performance study more realistic. The method for the indirect negotiation of multiple services has been proposed in section 2.1.2. The operation of the indirect negotiation that was modelled incorporated negotiations between the CA and the SP for multiple services indirectly via the MA. The MA can also be configured to possess the functionality of a resource broker, which will facilitate to identify the state and the services available at the various SPs. This will help to reduce the overall time of negotiation between the CA and the SPs via the MA. The mean delay of the messages received at the MA is presented in Fig. 6a, when the MA actively negotiates with the SPs on behalf of the CA. The SPs on receipt of the cfp respond in a distributed manner, which causes relatively lower

Fig. 6. (a) Mean delay vs. No. of neg. stations (No. of background stations) at the MA. (b) Mean negotiation time vs. No. of neg. stations (No. of background stations).

88

I. Cergol, V. Mirchandani, and D. Verchere

contention delays hence mean delay than it would be if all the SPs were to respond simultaneously. Due to this the mean delay of the messages at the MA is decreased. Also, background traffic has more impact on the performance than the number of SPs. In the indirect negotiation cases it was considered SPs to be grouped in blocks of 10 each forming a bundled service. The SPs within each SP block were candidates for providing a distinct service element. The mean negotiation time for this scenario is shown in Fig. 6b. It can be seen that in this case also the background traffic influences the delay. The result in Fig. 6b indicates that the increase in number of the negotiating stations has relatively lesser impact than the background traffic. This means that a higher capacity network such as 100 Mbps/1 Gbps/10 Gbps will be a more suitable infrastructure to limit the effect of the background traffic.

3 Conclusions We have proposed a dynamic negotiation protocol that was explained in a generic QoS framework for realistic scenarios that involved intra-domain negotiation. Salient performance evaluation results obtained from extensive simulations showed the effectiveness of the protocol for both direct and indirect negotiations. In particular, the negotiation protocol was shown to be feasible and scalable i.e. the negotiation protocol messages did not limit the scalability. Acknowledgements. This research was performed as part of an ARC Linkage Grant, between Bell Labs (Alcatel-Lucent) and UTS, Sydney. We express our thanks to Les Green and Prof. John Debenham for their contributions to the QDINE framework.

References 1. Nguyen, T.M.T., Boukhatem, N., Doudane, Y.G., Pujolle, G.: COPS-SLS: A Service Level Negotiation Protocol for the Internet. IEEE Communications Magazine (2002) 2. Chen, J.-C., McAuley, A., Sarangan, V., Baba, S., Ohba, Y.: Dynamic Service Negotiation Protocol (DSNP) and Wireless Diffserv. In: Proceedings of ICC, pp. 1033–1038 (2002) 3. Wang, X., Schulzrinne, H.: RNAP: A Resource Nego-tiation and Pricing Protocol. In: Proceedings of NOSSDAV, NJ, USA, pp. 77–93 (1999) 4. FIPA Contract Net Interaction Protocol Specification (December 2003), http://www.fipa.org 5. Green, L., Mirchandani, V., Cergol, I., Verchere, D.: Design of a Dynamic SLA Negotiation Protocol for Grids. In: Proc. ACM GridNets, Lyon, France (2007) 6. Green, L.: PhD Thesis, Automated, Ubiquitous delivery of Generalised Services in an Open Market (April 2007) 7. Discrete Event Simulation Environment OMNeT++, http://www.omnetpp.org

Performance Assessment Architecture for Grid Jin Wu and Zhili Sun Centre for Communication System Research, University of Surrey, Guildford, GU2 7XH, United Kingdom {jin.wu,z.sun}@surrey.ac.uk

Abstract. For the sake of simplicity in resource sharing, Grid services only expose a function access interface to users. However, some Grid users want to know more about the performance of services in planning phase. Problem emerges since the Grid simply has no means in obtaining how well the system will cope with user demands. Current Grid infrastructures do not integrate adequate performance assessment measures to meet the user’s requirement. In this paper, the architecture of Guided Subsystem Approach for Grid performance assessments is presented. We proposed an assessment infrastructure that allows the user to collect information to evaluate the performance of Grid applications. Based on this infrastructure, a user-centric performance assessment method is given. It is expected that this research will lead to some sort of extension in Grid middleware to facilitate the Grid platform the ability to handle applications with higher reliability requirements. Keywords: Grid, performance assessment, reliability.

1 Introduction The lack of performance indications becomes an obstacle for the continuously promotion of the Grid. User’s sceptics in service quality significantly hold back the efforts of putting more applications onto Grid: it is hard to convince users to transfer valuable applications onto the Grid infrastructure before they have been clearly notified with the service quality they will receive. Such a problem is brought forward when the Grid middleware hides system details by allowing users to access service through a portal without recognising any detail of resource providers. On one hand, a standardised interface provides an easy access to heterogeneous resources. On the other hand, the system details are hidden behind for users to recognize the performance of Grid services. This paper studies the performance assessment architecture to solve the problem. It collects the requirements from the user, and feedbacks how well the system will cope with the user’s requirement after examine related Grid components. This architecture does not help to improve the performance of individual component, but assist the users to select services more wisely by enabling the comparison among available services providers on like-for-like bases. Two approaches can be used to obtain the performance assessment of an application: the user initiated assessment, and the infrastructure based assessment. The user initiated assessment approach evaluates the performance of an application by users’ P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 89 – 97, 2009. @ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

90

J. Wu and Z. Sun

own efforts. The services from all providers are trialled one by one, and their performances are documented for like-for-like comparisons. While this approach is easy to use, its utility is generally limited by its overhead, accuracy, and management concerns. For instance, a large number of users making independent and frequent assessment trials could have a severe impact on the performance of services themselves. Furthermore, the assessment abilities at user-ends are always limited which weaken the accuracy of assessment results. More important still, application providers might not agree to open their systems for trials based on economical and security concerns. Ideally, dedicated modules attached to service providers should be in place while its assessment results could be made available, with low overhead, to all perspective users. This is the basic motivation of infrastructure based performance assessment for Grid services. Our paper extends the research by combing the above two idea together - an assessment infrastructure is in place to assist the application selection. The novelties of this research lie on: a) the assessment infrastructure has been extended to support performance assessment other than network performance metrics assessment; b) a general architecture considering both application layer and transmission network performances are proposed; and c) Grid application users will be notified with possible performance before their executions. The reminder of the paper are organised as follows. Section 2 gives the works relate to this research. Section 3 gives an architecture level study of the performance assessment. Section 4 gives the detailed procedures for the performance assessment identified in Section 3. Finally, Section 5 summaries this work.

2 Related Works Current researches have limited emphasises on Grid performance assessments. Previous researches have discussed the use of infrastructure to obtain data transmission performance between Internet hosts. A simple protocol for such a service, SONAR, was discussed in the IETF as early as February 1996 [2], and in April 1997 as a more general service called HOPS (Host Proximity Service) [3]. Both of these efforts proposed lightweight client-server query and reply protocols similar to the DNS query/reply protocol. Ref [4] proposed a global architecture, called “IDMaps”, for Internet host distance estimation and distribution. This work propose an efficient E2E probing scheme that satisfies the requirements for supporting large number of clientserver pairs. However, the timeliness of such information is anticipated to be on the order of a day and therefore, not reflect transient properties. Furthermore, this work can only assess the transmission delay among Internet host, this could be an underneath support of performance assessment, but this work did not touch the performance issue related to users. The topic of server selection has also been touched. Ref [5] proposed passive server selection schemes that collects response times from previous transactions and use this data to direct client to servers. Ref [6, 7] proposed active server selection schemes. Under these schemes, measurement to periodically probe network paths is distributed through the network paths are distributed throughout the network. Based on the Round Trip Time that is probed, an estimated metric is assigned to the path between node pair. Ref [1] study a scenario where multimedia Content Delivery Network are distributing thousands of servers throughout the Internet. In

Performance Assessment Architecture for Grid

91

such scenario, the number of tracers and the probe frequency must be limited to the minimum required to accurately report network distance on some time scale.

3 Architectural Level Study to Performance Assessment An architectural level study of performance assessment is given in this section. Performance assessment is used to find out the degree of likeliness by which the performance of a Grid service fulfils the user’s requirements. In many cases, multiple component services from different administration domains are federated into a composite service which is finally exposed to users. Needless to say, the performances of related component services affect the performance of the composite service that is directly visible to users. The accuracy, efficiency, and scalability are main concerns for the performance assessment of Grid services. Generally, Grid could be large scale heterogeneous systems with frequent simultaneous user accesses. Multiple performance assessment requests must be employed to assess performance in a timely and overhead efficiency manner. Apparently, the larger and more complex the system, and the more varied its characteristics, the more difficult it becomes to conduct an effective performance assessment. Subsystem-level approach is a solution to overcome the scalability problem to performance assessment over large scale systems. As its name suggests, the Subsystem-level approach accomplishes a performance assessment by developing a collection of assessments subsystem by subsystem. The rationale for such an approach is that the performance meltdown that user experienced will show up as a risk within one or more of the subsystems. Therefore, doing a good job at the subsystem level will cover all the important area for the whole system. The following observations identify the limitations of the approach. Independent subsystem performance assessment tends to give static performance and assume in the way that the upcoming invoking request does not carry out a meaningful impact to the subsystem performance. But a subsystem can exhibit differently when the additional invoking request is applied. Without a dynamic performance assessment which takes the characteristics of both invoking request and subsystem, the performance assessment result has a systematic inaccuracy. Independent subsystem performance assessment is prone to variance in the way performance metrics are characterised and described. Without a common basis applied across the system, subsequent aggregation of subsystem results is meaningless or impossible to accomplish. A consequence is the inevitable misunderstanding and the reduced accuracy of performance assessment. Subsystem A’s analysts can only depict the operation status and risk of Subsystem A, but it may not have an accurate understanding of Subsystem A’s criticality to the overall service performance. As viewed from the perspective of the top level performance assessment, the result from subsystem can be wasted effort assessing unrelated performance metrics, or subsystem performance metric that crucial to the overall performance assessment is not measured. Given the weakness of the conventional subsystem performance assessment approaches, we propose the Guided Subsystem Approach to performance assessment as an enhancement to conventional approaches.

92

J. Wu and Z. Sun

Fig. 1. Architectural Level Process for Guided Subsystem Assessment

The main idea of the Guided Subsystem Approach is to introduce a user-centric component that interprets the user’s assessment requests and assessment function of subsystems. It conducts macro-level pre-analyse efforts. The top-down efforts are enforced to translate those performance assessment requests and transmit to the subsystem assessment modules. Such the dynamically generating of assessment requests to subsystem helps to prevent mismatching between user’s assessment requirements and the assessment actions applied onto the subsystems. The Guided Subsystem Approach takes the advantage of both top-down and subsystem approaches while presenting better efficiencies, consistencies, and systematic accuracies. Furthermore, the Guided Subsystem Approach has top-down characteristics together with characteristics from subsystem. The user centric interpretation components and subsystem performance assessment components can be designed and implemented separately.

4 Guided Subsystem Performance Assessment For an assessment infrastructure, it needs to give performance indications after users express their expectations to the target Grid service. Users can have diverse performance expectation for a Grid service. Their performance expectations should be expressed in a standardised way to let the assessment infrastructure learn the need of users. A performance indicator is defined to notify users the result of performance assessment. The performance indicator should have pragmatic meaning and allow

Performance Assessment Architecture for Grid

93

user to perform like-with-like comparison between similar Grid services delivered from different providers. For the purpose of measuring the ECU of a target Grid service, three types of inputs need to be identified: • Set of Performance Malfunctioning (SPM) is a set containing any possible performance malfunctioning feature by which Internet application users might possibly be experienced. It is a framework of discernment attached to applications. • Consequence of Performance Malfunctioning (CPM) measures the damages every particular performance malfunctioning feature could possibly cause. It depicts the users’ expectations of performance towards the application. • Probability of Performance Malfunctioning (PPM) measures the possibility of performance malfunctioning appears under a given framework of discernment. This input depicts the how the underneath system reacts to the invoking requests. The ECU of a target Grid service can be given upon the collection of the above three aspects of inputs. To obtain the SPM is the first step for ECU measurement. The performance malfunctioning is a feature of an application that precludes it from performing according to performance specifications, a performance malfunctioning occurs if the actual performances of the system are under the specified values. SPM is usually defined when a type of Grid service is composed. When a type of Grid service is identified, its standard SPM can be obtained. Different types of Grid services could lead to different SPMs. Basically, SPM is authored by system analysis, and could be, nevertheless, updated thanks to users’ feedbacks. However, the management and standardisation of SPM is outside the scope of this research. Without loss of generality, we simply consider standard SPMs can be given by a third party generalised service when a use-case is presented, where the SPM of service u is denoted as SPMu. For a service u with a standard SPM including n malfunctioning type, there exists SPM u = {s1 ,..., sn } , which can also be denoted as an n-dimensional vector SPM u , SPM u = ( s1 ,..., sn ) . Identifying the severity of performance malfunctioning is another important aspect of ECU measurement. CPM is a set of parameters configured by users in representing the severity each performance malfunctioning feature could possibly cause. Define a function p representing the CPM configuration process for the user userx on the u

framework of discernment of use-case u, SPMu. Denote this CPM as a n-dimensional vector CPM u = (c1 ,., cx ,., cn ) , where cx is the cost of damages when the xth performance malfunction p< userx SPM

u

exists.

Then,

for

∀s x ∈ SPM ,

∃cx ∈ [0,+∞)

satisfying

: s x → cx . It is users’ responsibility to configure the CPM. In many cases, >

policy based automatic configurations are possible in order to make this process more friendly to users. The third part of performance assessment is to find out the performance of the invoked system in terms of under-performing possibility when an application scenario is applied. The performance of the invoked system relates to the ability of physical resource that the system contains, and their managements. An n-tuple, PPM, is defined to measure the performance malfunctioning probability of a given function under a scenario. Denote sx as a performance metric satisfying sx ∈ SPM u .

94

J. Wu and Z. Sun

For a use-case u, suppose the specified performance values as SPMu’, where SPMU’={s1’,…,sh’}. The scenario e is executed to apply the use-case u. px is defined as the possibility of performance malfunctioning that exhibited by the xth performance metric, where px=Pr(sx ∈ {0,1} , is the state value describing the healthy status of component x with regards to the performance metric , and satisfies ⎪⎧u < x , y > (e) = 1, when P< x , y > < R< x , y > (e) ⎨ ⎪⎩u < x , y > (e) = 0, when P< x , y > ≥ R< x , y > (e)

An n-tuple, Ux(e), is used to depict the performance status of component x under scenario e. U x (e) = (u< x ,1> (e),..., u< x ,n> (e))

where n is the total number of all measurable metrics for the component x. The under-performing of components surely affect the performance of scenario. A performance malfunctioning assignment function is used to represent the degree of influence to SPM when a particular performance state exhibits in a particular component. A performance malfunctioning assignment function m : SPM → [0,1]

is defined, when it verifies the following two formulas: ∀A ∈ SPM , mU x ( e ) ( A) ∈ [0,1] ; and

∑m

Ux (e)

( A) ≤ 1

A∈SPM

where mU (e ) denote the performance assignment function during the performance state Ux(e). The higher value of mU ( e ) ( A) denotes the higher influence of the state x

x

Ux(e) to the performance malfunctioning A, and vice versa. The m function can be coauthored by system analysis and simulations. However, how to obtain the m function is not within the scope of this research. We consider the m function can be generated by a third party service. We then study how to assess the performance of a component. Let R '< x , y > denotes the predicted performance for . For ∀x, y , there exist

Performance Assessment Architecture for Grid

95

v< x , y > (e) = Pr( R '< x , y > < P< x , y > (e)) .

Once the component x is invoked by scenario e, let v(e) be the degree of likeliness of under-performance measured from . The state of likeliness underperformance of component x can be represented as an n-tuple, V x (e) = (v< x ,1> (e),..., v< x ,n > (e)) .

is

A

similarity measurement function sim : ( X , Y ) → [0,1] X = ( x1,..., xn ) and Y = ( y1 ,.., yn ) are two n-tuples, and

defined,

where

n

sim( X , Y ) = ∏ (1 − xk − yk ) . k =1

Then, when Vx(e) can be obtained, the performance assignment function can be given as mVx ( e ) ( A) =

∑ sim(U

x

(e),Vx (e)) × mU x ( e ) ( A)

x

(e),V x (e)) × mU x ( e ) ( SPM )

U x ( e )∈Φ x

we also can have mVx ( e ) ( SPM ) =

∑ sim(U

U x ( e )∈Φ x

From this formula, the relationship between the performance of component x and SPM is given. Multiple components are involved when a scenario is being executed. To account for the probability of a component failure manifesting itself into user-end performance degradation, Dynamic Metrics (DM) is being used. Dynamic Metrics are used to measure the dynamic behaviour of a system based in the premise that active components are source of failure [2]. It is natural that a performance meltdown of component may not affect the remote instrumentation scenario’s performance if not invoked. So, it is reasonable to use measurements obtained from executing the system models to perform performance analysis. There is a higher probability that, if an underperformance event is likely to exist in an active component, it will easily lead the scenario into malfunctioning. A data structure denoting the Component Status Description (CSD) is defined as: CSD x (e) =< i , x, start x (e ), duration x (e), mV ( x ) ( SPM ) >

where i and x are the unified identification of invoking request and component; startx(e) and durationx(e) are, respectively, the start time and expected execution duration of component x. Startx can be given by analysing the scenario that invokes the component. We can assume the value of durationx can be given by an external service. When the CSD of all components that involve in a scenario are given, a mapping from the Time-Component Representation of a scenario to Time-Discrete Representation can be carried out. A data structure denoting the Scenario Status Description (SSD) is defined as: SSD(e) =< j ,{x x ∈ set ( j )}, start j (e), duration j (e),

∪m

V (x)

x∈set ( j )

( SPM ) >

96

J. Wu and Z. Sun

where j is the serial number for the component; set(j) is the set of active components in the jth time fragment; startj(e) and durationj(e) are the start time and the time length for the jth time fragment when the scenario e is applied. Since all components are independently operated, the accumulated performance malfunctioning assignment can be given by the following formula:

∪m

V (s)

∏ (1 − m

( SPM ) = 1 −

V ( x)

( SPM ))

x∈set ( j )

x∈set ( j )

Therefore, the PPM of a scenario e can be given as:



PPM (e, SPM e ) = 1 −

j∈scenario e

(1 −

duration j (e) ⋅ ∪ mV ( s ) ( SPM )) T (e) x∈set ( j )

where T(e) is total time required for scenario e. Finally, the performance evaluation of a Grid service can be given as Cost (e) =

∑ CPM (e, A) ⋅ PPM ( A)

A∈SPM

where ∀A ∈ SPM , PPM (e, A) = 1 −

∏ j∈scenario e

(1 −

∑ duration (e) duration j (e)

∪m

V (s)

( A) = 1 −

∏ (1 − m

V ( x)

x∈set ( j )

( A))

, and m ( A) = Vx ( e )

∪m

∑ sim (U j

j∈scenario e

x∈set ( j )



V ( x)

( A))

,

x∈set ( j )

x

(e),Vx (e)) × mU x ( e ) ( A) .

U x ( e )∈Φ x

5 Summary This paper describes the on-going research of the Grid service performance assessment. It explores the approach of assessing the Grid service performance by using Guided Subsystem Approach. The assessment approach given is an effective way to measure the composite Grid services which include multiple loosely coupled component services. Top-down approach is used to capture the characteristics of user demands. Performance requirements are automatically translated and transferred to component services. Subsystem approach locally analyses the local available resources and demands, and concludes the possibility of under-performing for the local module. Dynamic metric is used to combine the performance assessment result together taking into account the importance of the component. Future will adopt the assessment approach to specified applications to show its advantages.

References 1. Akamai Corporation, http://www.akamai.com 2. Moore, K., Cox, J., Green, S.: Sonar - a network proximity service. Internet-Draft (February 1996), http://www.netlib.org/utk/projects/sonar/ 3. Francis, P.: Host proximity service (hops) (preprint) (August 1998)

Performance Assessment Architecture for Grid

97

4. Francis, P., Jamin, S., Jin, C., Raz, D., Shavitt, Y., Zhang, L.: IDMaps: A Global Internet Host Distance Estimation Service. IEEE/ACM Trans. on Networking (October 2001) 5. Seshan, S., Stemm, M., Katz, R.: SPAND: Shared Passive Network Performance Discovery. In: USENIX Symposium on Internet Technologies and Systems (December 1997) 6. Bhattacharjee, S., Fei, Z.: A Novel Server Selection Technique for Improving the Response Time of a Replocated Service. In: Proc. of IEEE Infocom (1998) 7. Crovella, M., Carter, R.: Dynamic Server Selection in the Internet. In: Proc. of the 3nd IEEE workshop on the Architecture and Implementation of High Performance Communication Sybsystems (HPCS 1995) (1995)

Implementation and Evaluation of DSMIPv6 for MIPL* Mingli Wang1, Bo Hu1, Shanzhi Chen2, and Qinxue Sun1 1

State key Lab of Switching and Networking Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China 2 State key Lab of Wireless Mobile Communication, China Academy of Telecommunication Technology, Beijing 100083, China [email protected]

Abstract. Mobile IPv6 performs mobility management to mobile node within IPv6 networks. Since IPv6 is not widely deployed yet, it's an important feature to let Mobile IPv6 support mobile node moving to IPv4 network and maintaining the established communications. DSMIPv6 specification extends Mobile IPv6 capabilities to allow dual stack mobile nodes to move within IPv4 and IPv6 networks. This paper describes the implementation of this feature based on open source MIPL under Linux. By performing experiments on testbed using the implementation, it is confirmed that the DSMIPv6 works as expected. Keywords: Mobile IPv6, DSMIPv6, MIPL.

1 Introduction Mobile IP allows mobile nodes to move within the Internet while maintaining reachability and ongoing sessions, using a permanent home address. It can serve as the basic mobility management method in the IP-based wireless networks. Mobile IPv6 (MIPv6) [1] shares many features with Mobile IPv4 (MIPv4) and offers many other improvements. There are many different versions of MIPv6 available today, such as MIPL [2] and SunLab's Mobile IP [3]. Some works have been done on testbed for MIPv6, such as [4][5], they only support IPv6 network. However, since IPv6 has not been widely deployed, it is unlikely that mobile nodes use IPv6 addresses only for their connections. It is reasonable to assume that mobile nodes will move to networks that might not support IPv6 and would therefore need the capability to support IPv4 Care-of Addresses. Since running MIPv4 and MIPv6 simultaneously is problematic (as illustrated in [6]), The Dual Stack MIPv6 (DSMIPv6) [7] specification extends Mobile IPv6 capabilities to allow dual stack mobile nodes to request their home agent forward IPv4/IPv6 packets, addressed to their home addresses, to their IPv4/IPv6 care-of addresses. Mobile nodes can maintain *

Foundation item: This work is supported by the National High-Technology(863) Program of China under Grant No. 2006AA01Z229, 2007AA01Z222, 2008AA01A316 and Sino-Swedish Strategic Cooperative Programme on Next Generation Networks No. 2008DFA12110.

P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 98 – 104, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

Implementation and Evaluation of DSMIPv6 for MIPL

99

their communications with other nodes by using DSMIPv6 only, regardless the visiting network is IPv4 or IPv6. There are already some implementations of DSMIPv6, such as UMIP-DSMIP [8] and implementation for SHISA [9]. Although the two implementations realized the basic features of DSMIPv6, both of them have some features not implemented yet, such as supporting private IPv4 care-of address. In this paper, we mainly describe our implementation of DSMIPv6 to support mobile nodes move to IPv4 networks and use private IPv4 care-of addresses. We also performed experiments by using our implementation and the results are shown in this paper too. This paper is organized as follow. We give an overview of DSMIPv6 and its existing implementations in Sec.2. We then present the design of our implementation in Sec.3. We performed experiments by using our implementation and the results are shown in Sec.4. In Sec.5, we give the conclusion of this paper.

2 DSMIPv6 2.1 Principle of DSMIPv6 Dual stack node is a node that supports both IPv4 and IPv6 networks. In order to use DSMIPv6, dual stack mobile nodes need to manage an IPv4 and IPv6 home or care-of address simultaneously and update their home agents' bindings. The basic features of DSMIPv6 are: the possibility to configure an IPv4 care-of address on the mobile node, the possibility to connect the mobile node behind a NAT device, the possibility to connect the home agent behind a NAT device, the possibility to use an IPv4 Home Address on the mobile node. A mobile node, which is dual stack node, has both IPv4 and IPv6 home addresses. Home agent is connected to both IPv4 and IPv6 networks. When in IPv4 network, mobile node gets an IPv4 care-of address and registers it to home agent. Both IPv4 and IPv6 home addresses are bound to the address. IPv4 traffic goes through IPv4-inIPv4 tunnel between mobile node and home agent. IPv6 traffic goes through IPv6-inIPv4 tunnel. In a similar way, if mobile node moves to IPv6 network and registers IPv6 care-of address, traffic goes through IPv4-in-IPv6 tunnel or IPv6-in-IPv6 tunnel. NAT devices between mobile node and home agent can be detected. 2.2 Implementations of DSMIPv6 There are already some implementations of DSMIPv6, such as UMIP-DSMIP [8] and implementation for SHISA [9]. UMIP-DSMIP is an implementation of DSMIPv6 for the UMIP stack. It is developed by Nautilus6 [8] and is shipped as a set of patches for UMIP 0.4 and 2.6.24 kernels. The implementation for SHISA introduced in [9] is based on BSD operating systems. Although some basic features of DSMIPv6 are realized in the two implementations, both of them have some features not implemented yet, such as Network Address Translator (NAT) [10] Traversal feature. Since public IPv4 addresses are not enough for use, NAT is widely used to solve this problem. Suppose a mobile node moves to GPRS or CDMA network, it can only get private IPv4 care-of addresses. To

100

M. Wang et al.

use the care-of addresses to maintain the communication, NAT traversal feature must be realized in this situation. Our implementation includes this feature. The MIPv6 implementations based on are also different, UMIP-DSMIP is based on UMIP stack and the other is based on SHISA. Our implementation is based on MIPL (Mobile IPv6 for Linux), which is open source under GNU GPL. MIPL is developed by HUT to run on Linux environments that support IPv6. The latest version of MIPL is 2.0.2, which requires the kernel version of 2.6.16.

3 Implementations 3.1 Overview We implemented the basic features of DSMIPv6 including the possibility to configure an IPv4 care-of address on the mobile node and the possibility to connect the mobile node behind a NAT device. By using our implementation, mobile node can move to IPv4-only network and keep the communications. Private IPv4 care-of address is supported. Fig.1 shows the supported scenes. Our implementation is based on the version of MIPL 2.0.2.

Fig. 1. Supported scenes of our implementation

When a mobile node moves to IPv6 network and gets an IPv6 care-of address, the basic MIPL works. In addition to this, it should update the new members of binding update list and binding cache described in 3.3 and 3.4. If the mobile node moves to IPv4 network and gets an IPv4 care-of address, it needs send a binding update message to home agent's IPv4 address. The binding update message contains the mobile node's IPv6 home address in the home address option. However, since the care-of address is IPv4 address, the mobile node must include its IPv4 care-of address in the IPv6 packet. After accepting the binding update message, the home agent will update the related binding cache entry or create a new binding cache entry if such entry does not exist. The binding cache entry points to the

Implementation and Evaluation of DSMIPv6 for MIPL

101

mobile node's IPv4 care-of address. All packets addressed to the mobile node's home address will be encapsulated in a tunnel that includes the home agent's IPv4 address in the source address field and the mobile node's IPv4 care-of address in the destination address field. After creating the corresponding binding cache entry, the home agent sends a binding acknowledgment message to the mobile node. 3.2 Tunnel In the implementation of MIPL, when mobile node moves to IPv6 foreign network, an IPv6-in-IPv6 tunnel will be created, whose name is "ip6tnl". All packets sent to mobile node's home address will be encapsulated to this tunnel from home agent to mobile node's care-of address. When mobile node moves to IPv4 foreign network, we should setup an IPv6-in-IPv4 tunnel to transmit packets between mobile node and home agent. However, if mobile node's IPv4 care-of address is private, the communication link between mobile node's care-of address and home agent will consist of NAT devices. Since common NAT devices do not support IPv6-in-IPv4 packets (which type is 41), we should tunnel IPv6 packets in UDP and IPv4, i.e. IPv6-in-UDP tunnel. Fig.2 shows the differences between the two tunnels.

Fig. 2. Differences between the two type tunnels

Linux kernel realized the IPv6-in-IPv4 tunnel, which named "sit". It needs to configure the two end-points before using it to transmit packets. We modified the codes to realize the tunnel IPv6-in-UDP by adding UDP header between IPv4 header and inner IPv6 header before transmitting. After receiving tunnel packets, pull out the UDP header as well as the IPv4 header. There are two reasons for tunnel's implementation at the kernel space. Firstly, it has good efficiency; Secondly, the encapsulation and decapsulation are the common function of mobile node, home agent and access router. Separating the function with others and modeling of tunnel just provide the common interface to upper applications, so that applications do not need consider the tunnel. This method can enhance the expand property. 3.3 Mobile Node Modifications When IPv4 care-of address is detected, IPv4 care-of address option is created and included in the Mobility Header including the binding update message sent from the mobile node to home agent. Before sending binding update message, mobile node sets the sit tunnel in itself and sends signal to home agent to let it set another end-point of tunnel. After receiving the successful message to set tunnel in home agent, mobile node create or update a binding update list entry for its home address. Then, binding

102

M. Wang et al.

Fig. 3. Process flow for IPv4 care-of address

update message is encapsulated in IPv6-in-UDP tunnel and sent to home agent if everything is successfully processed. The process flow is shown in Fig.3. Based on the structure of binding update list in MIPL, we add two members, one is to store IPv4 care-of address, the other is to mark the current type of care-of address: IPv4 or IPv6. By doing this, we use only one entry for one home address. After receiving the binding acknowledgement message, mobile node updates the corresponding bind update list entry to change the binding status. If the binding acknowledgement shows the binding failure, should delete the tunnel set before. 3.4 Home Agent Modifications Home agent processes the received binding update message, create or update a binding cache entry. IPv4 address acknowledgement option is created and included in the Mobility Header including the binding acknowledgement message sent from home agent to mobile node. This option indicates whether a binding cache entry was created or updated for the mobile node's IPv4 care-of address. If home registration failed, should delete the tunnel set before. The process flow is shown in Fig.3. Like the binding update list, we add two members to the structure of binding cache too. One is to store the IPv4 care-of address; another is to mark the current type of care-of address.

4 Evaluations 4.1 Testbed Details The testbed's network topology is shown as Fig.4. The column are PC router, which is exploited on Linux platform. CN, which is corresponding node, is a PC connected

Implementation and Evaluation of DSMIPv6 for MIPL

103

Fig. 4. Network topology of testbed

with home agent (HA) via IPv6 network. MN, which is mobile node, moving between home network and different foreign networks, is a laptop with Linux OS. Protocol is IPv6 in home network and IPv4 in foreign network. Modified MIPL is running in MN and HA. MN has IPv6 home address and can get both IPv6 and IPv4 care-of address. CN has only IPv6 address and communicates with MN. HA has both IPv6 and IPv4 addresses, in which IPv4 address is public and is known by MN. MN can get private IPv4 address as its care-of address in WLAN or GPRS network. Initially MN is at home network, it communicates with CN via IPv6. After moving to foreign network, WLAN or GPRS, it gets IPv4 care-of addresses. MN will back to home network and use IPv6 address too. 4.2 RTT (Round Trip Time) Test We use "ping6" command to test the RTT and perform 200 times to get the mean values. Table.1 is the RTT comparison in different networks, separated by the scenes that MN in home network and in foreign network. Table 1. RTT comparisons in different networks MN in home network. MN in WLAN. MN in GPRS Mean RTT (ms) 2.559 223.6 615.3

The RTT in home network is smallest in the three network since MN is directly linked with HA. The RTT in GPRS network is the biggest, for the packets are tunneled and pass more hops in the road. The results confirmed that the implementation of DSMIPv6 worked as our expected, most of all, connection was maintained by private IPv4 address and tunnel in WLAN and GPRS. 4.3 Handover Delay Test The "crontab" tool of Linux is used to implement handover automatically. And we collect 100 times' handover to analyze. The results are listed in Table 2.

104

M. Wang et al. Table 2. Handover delays in different situations Home to WLAN. WLAN to GPRS. GPRS to Home Mean time (s) 1.942 2.038 2.846

As Table 2 shows, the delay of moving from home network to foreign network and moving between different foreign networks is approximate 2 seconds. The delay of moving from foreign network to home network is a little bigger, for there is more operations while MN back home. These tests also confirmed the validation of our implementation. We also run services, such as ftp and video, to test the effect of the handover between networks. The interval was not very obvious, and didn't cause a bad effect for user experience.

5 Conclusions This paper describes the DSMIPv6 implementation on MIPL for Linux systems. The DSMIPv6 implementation was designed to support mobile node moving between IPv4 and IPv6 networks while maintaining the established communications. An IPv6in-UDP tunnel is setup to transmit IPv6 packets via IPv4 link and traverse NAT devices. An IPv4 care-of address and an address flag are added to binding update list and binding cache, function of which are separately storing IPv4 care-of address and marking the address type. An IPv4 care-of address option and IPv4 address acknowledgement option are used to register IPv4 care-of address to home agent. By evaluating the implementation, it is confirmed that the DSMIPv6 implementation works as expected. It is thus said that the specification is stable to work in true environment.

References 1. Johnson, D.: Mobility Support in IPv6, RFC3775, the Internet Engineering Task Force (June) 2. Mobile IPv6 for Linux, http://www.mobile-ipv6.org/ 3. Networking and security center.: Sun Microsystems Laboratories, http://playground.sun.com 4. Busaranun1, A., Pongpaibool, P., Supanakoon, P.: Handover Performance of Mobile IPv6 on Linux Testbed. ECTI-CON (2006) 5. Lei, Y.: The Principle of Mobile IPv6 and Deployment of Test Bed. Sciencepaper Online 6. Tsirtsis, G., Soliman, H.: Mobility management for Dual stack mobile nodes, A Problem Statement, RFC 4977, The Internet Engineering Task Force (August 2007) 7. Soliman, H. (ed.): Mobile IPv6 support for dual stack Hosts and Routers (DSMIPv6), Internet Drafts draftietf-mip6-nemo-v4traversal-02, The Internet Engineering Task Force (June 2006) 8. UMIP-DSMIP, http://www.nautilus6.org/ 9. Mitsuya, K., Wakikawa, R., Murai, J.: Implementation and Evaluation of Dual Stack Mobile IPv6, AsiaBSDCon 2007 (2007) 10. Egevang, K.: The IP Network Address Translator (NAT), RFC1631, The Internet Engineering Task Force (May 1994)

Grid Management: Data Model Definition for Trouble Ticket Normalization Dimitris Zisiadis¹, Spyros Kopsidas¹, Matina Tsavli¹, Leandros Tassiulas¹, Leonidas Georgiadis¹, Chrysostomos Tziouvaras², and Fotis Karayannis² ¹ Center for research and Technology Hellas, 6th km Thermi-Thessaloniki, 57001, Hellas ² GRNET, 56 Mesogion Av. 11527, Athens, Hellas

Abstract. Handling multiple sets of trouble tickets (TTs) originating from different participants in today’s GRID interconnected network environments poses a series of challenges for the involved institutions. Each of the participants follows different procedures for handling trouble incidents in its domain, according to the local technical and linguistic profile. The TT systems of the participants collect, represent and disseminate TT information in different formats. As a result, management of the daily workload by a central Network Operations Centre (NOC) is a challenge on its own. Normalization of TTs to a common format for presentation and storing at the central NOC is mandatory. In the present work we provide a model for automating the collection and normalization of the TT received by multiple networks forming the Grid. Each of the participants is using its home TT system within its domain for handling trouble incidents, whereas the central NOC is gathering the tickets in the normalized format for repository and handling. Our approach is using XML as the common representation language. The model was adopted and used as part of the SA2 activity of the EGEE-II project. Keywords: Network management, trouble ticket, grid services, grid information systems, problem solving.

1 Introduction Modern telecommunications networks are aimed to provide a plethora of differentiated services to its customers. Networks are becoming more sophisticated by the day, while their offering spans a wide variety of customer types and services. Quality of Service (QoS) [1] and Service Level Agreement (SLA) [2] provisioning are fundamental ingredients. Multiple interconnected institutions, targeting a common approach to service offering, along with a unified network operation scheme to support these services, form Grid networks. Network Management is crucial for the success of the Grid. Problem reporting, identification and handling as well as trouble information dissemination and delegation of authority are some of the main tasks that have to be implemented by the members of the Grid. P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 105 – 112, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

106

D. Zisiadis et al.

GÉANT2 [3] is an example of a Grid. It is the seventh generation of pan-European research and education network successor to the pan-European multi-gigabit research network GÉANT. The GÉANT2 network connects 34 countries through 30 national research and education networks (NRENs), using multiple 10Gbps wavelengths. GÉANT2 is putting user needs at the forefront of its plans for network services and research. Usually a central Network Operations Centre (NOC) is established at the core of the network for achieving network and service integration support. Ideally, a uniform infrastructure should be put in place, with interoperating network components and systems, in order to provide services to the users of the Grid and to manage the network. In practice though, this is not the case. Unfortunately, different trouble ticket (TT) systems are used by the participating networks. There is a wide variety of TT systems available, with differentiated functionality and pricing (respectively) among them. Examples are Keystone [4], ITSM [5], HEAT [6], SimpleTicket [7], OTRS [8]. Moreover, in-house developed systems, as is the case for GRnet [9], is another option. The advantages of this option are that it offers freedom in design and localization options and that it meets the required functionality in full. It has though the disadvantage of local deployment and maintenance. Nevertheless, it is adopted by many Internet Service Providers (IPSs), both academic and commercial, as the pros of this solution enable service delivery and monitoring with total control over the network. The current work evolved within the specific Service Activity 2 (SA2) activity of the EGEE-II European funded project [10]. A central NOC, called the ENOC [11] is responsible for collecting and handling multiple TTs received by the participating institutions TT systems. Various TT systems are used by each of them, delivering TTs in different formats, while TT load is growing proportionally with the network size and the serviced users. TT normalization, i.e. transformation to a common format that is reasonable for all parties and copes with service demands in a dynamic and effective way, is of crucial importance for successful management of the Grid. In the present work we define a data model for TT normalization for the participating institutions in EGEE-II. The model is designed in accordance with the specific needs of the participants, meeting requirements of the multiple TT systems used. It is both effective and comprehensive, as it compensates for the core activities of the NOCs. It is also dynamic as it allows other options to be included in the future, according to demand. This paper is organized as follows: section 2 outlines related work on TT normalization. In section 3 we present our data model in detail, whereas in section 4 we provide a prototype implementation of the proposed solution. Finally in section 5 we discuss conclusions of this work.

2 Related Work Whenever multiple organizations and institutions form a Grid, or some other form of cooperative platform for network service deployment, the need arises to define a common understanding over network operations and management issues. Trouble incidents are recorded in case a problem arises, affecting normal network operations or

Grid Management: Data Model Definition for Trouble Ticket Normalization

107

services. Typical problems are failures in network links or other network elements (i.e. routers, servers), security incidents (i.e. intrusion detection) or any other problem that affects normal service delivery (i.e. service overload). The incidents are represented in specific formats, namely TTs. ATT is issued in order for the network operators to record and handle the incident. RFC 1297 [12], titled NOC Internal Integrated Trouble Ticket System Functional Specification Wishlist, describes general functions of a TT system that could be designed for NOCs, exploring competing uses, architectures, and desirable features of integrated internal trouble ticket systems for Network and other Operations Centres. Network infrastructure available to EGEE is served by a set of National Research and Education Networks (NRENs) via the GÉANT2 network. Reliable network resource provision to Grid infrastructure highly depends on coherent collaboration between a large numbers of different parties both from NREN/ GÉANT2 and EGEE sides, as described in [13]. Common problems and solutions as well as strategies for investigating problem reports has been presented in [14] [15]. The concept of the Multi-Domain Monitoring (MDM) service, which describes the transfer of end-to-end monitoring services in order to serve the needs of different user groups is discussed in [16]. The OSS Trouble Ticket API (OSS/J API) [17] provides interfaces for creating, querying, updating, and deleting trouble tickets (trouble reports). The Trouble Ticket API focus is on the application of the Java 2 Platform, Enterprise Edition (J2EE), and XML technologies to facilitate the development and integration of OSS components with Trouble Ticket Systems. The Incident Object Description Exchange Format (IODEF) [18] constitutes a format for representing computer security information commonly exchanged between Computer Security Incident Response Teams (CSIRTs). It provides an XML representation for conveying incident information across administrative domains between parties that have an operational responsibility of remediation or a watch-and-warning over a defined constituency. The data model encodes information about hosts, networks, and the services running on these systems; attack methodology and associated forensic evidence; impact of the activity; and limited approaches for documenting workflow. The EGEE project is heavily using shared resources spanning across more than 45 countries and involving more than 1600 production's hosts. To link these resources together the network infrastructure used by EGEE is mainly served by GÉANT2 and NRENs. NRENs are providing link to sites within a country while GÉANT2, the seventh generation of pan-European research and education network, connects countries. To link Grid and network worlds the ENOC [17], EGEE Network Operation Centre, has been defined in EGEE as the operational interface between the EGEE Grid, GÉANT2 and the NRENs to check the end-to-end connectivity of Grid sites. Using daily relations with all providers of the network infrastructures on top of which EGEE is built it ensures the complex nexus of domains involved to link Grid sites are performing efficiently. The ENOC deals with network problems troubleshooting, notifications from NRENs, network Service Level Agreement (SLA) installation and monitoring and network usage reporting. The ENOC acts as the network support unit in the Global Grig User Support (GGUS) of EGEE to provide coordinated user support across Grid and network services.

108

D. Zisiadis et al.

In the next section we describe the Data Model that was adopted by the EGEE parties for TT normalization.

3 Definition of the Data Model There has been a long discussion on the functionality of the emerging data model. We examined thoroughly the various fields supported by the numerous ticketing systems in use. There has also been a lot of effort to incorporate all critical fields that could ease network monitoring and management of the Grid. We consolidated all experts' opinions regarding the importance of each field and its effects on the management of both the individual NRENs as well as the Grid. The goal was to define a comprehensive set of fields that would best fit to the network management needs of the EGEE grid. As a result of this procedure, we provide below the definition of the Data model that aims to achieve the required functionality for the management of the Grid. 3.1 Terminology The Trouble Ticket Data Model (TTDM) uses specific keywords to describe the various data elements. These keywords are Defined, Free, Multiple, List, Predefined String, String, Datetime, Solved, Cancelled, Inactive, Superseded, Opened/Closed, Operational, Informational, Administrative, Test and they are interpreted as described in Section 3.5 and 3.6. 3.2 Notations This section provides a Unified Modelling Language (UML) model describing the individual classes and their relationships with each other. The semantics of each class are discussed and their attributes are explained. The terms "class", and "element" will be used interchangeably to reference a given UML class in the data model. 3.3 The TTDM Attributes The Field Name class has four attributes. Each attribute provides information about a Field Name instance. The attributes that characterize one instance constitute all the information required to form the data model. − DESCRIPTION: This field contains a short description of the field name. − TYPE: The TYPE attribute contains information about the type of the field name it depends on. The values that it may contain are: Defined, Free, Multiple, List. − VALID FORMAT: This attribute contains information about the format of each field. The values that it may contain are: Predefined String, String, Datetime. − MANDATORY: This attribute indicates if the information of each field is required or is optional. In case the information is required the field MANDATORY contains the word Yes. On the contrary, when filling the information is optional, the field MANDATORY contains the word No.

Grid Management: Data Model Definition for Trouble Ticket Normalization

109

3.4 The TTDM Aggregate Classes The collected and processed TTs received from multiple telecommunications networks are adjusted in a normalized TTDM. In this section, the individual components of the TTDM are discussed in detail. The TTDM aggregate class provides a standardized representation for commonly exchanged Field Name data. We provide below the field name values that are defined in our model. For convenience, as most names are self explained, and for readability reasons, we only provide the values : Partner_ID, Original_ID, TT_ID, TT_Open_Datetime, TT_Close_Datetime, Start_Datetime, Detect_Datetime, Report_Datetime, End_Datetime, TT_Lastupdate_ Time, Time_Window_Start, Time_Window_End, Work_Plan_Start_Datetime, Work_ Plan_End_Datetime, TT_Title, TT_Short_Description, TT_Long_Description, Type, TT_Type, TT_Impact_Assessment, Related_External_Tickets, Location, Network_Node, Network_Link_Circuit, End_Line_Location_A, End_Line_Location_B, Open_Engineer, Contact_Engineers, Close_Engineer, TT_Priority, TT_Status, Additional_Data, Related_Activity_History, Hash, TT_Source, Affected_Community, Affected_Service. 3.5 Types and Definitions of the TYPE Class The TYPE Class defines four data types, as follows: − Defined :The TTDM provides a mean to compute this value from the rest of the fields − Free : The value can be freely chosen − Multiple : One value among multiple fixed values − List : Many values among multiple fixed values 3.6 Types and Definitions of the VALID FORMAT Class The VALID FORMAT Class defines three data types, as follows: − Predefined String : A predefined value in the data model − String : A value defined by the user of the model − Datetime : A date-time string that indicates a particular instant in time The predefined values are associated with the appropriate Field Name class. The values are strict and should not be altered. The values defined in our model are the following: − TT_Type with accepted predefined values: Operational, Informational, Administrative, Test. − Type with accepted predefined values: Scheduled, Unscheduled. − TT_Priority with accepted predefined values: Low, Medium, High. − TT_Short_Description with accepted predefined values: Core Line Fault, Access Line Fault, Degraded Service, Router Hardware Fault, Router Software Fault, Routing Problem, Undefined Problem, Network Congestion, Client upgrade, IPv6, QoS, Other. − TT_Impact_Assessment with accepted predefined values: No impact, Reduced redundancy, Minor performance impact, Severe performance impact, No connectivity, On backup, At risk, Unknown.

110

D. Zisiadis et al.

− TT_Status with accepted predefined values: Solved, Cancelled, Inactive, Superseded, Opened/Closed. − TT_Source with accepted predefined values: Users, Monitoring, Other NOC.

4 Implementation XML [19] was the choice for the implementation schema, due to its powerful mechanisms and its global acceptance. The implemented system operates as depicted in Fig.1, below.

Fig. 1. The Implemented System

Our system connects to GRnet ticketing system and uses POP e-mail to download the TTs. Following, it converts the TTs according to the data model presented, stores them in a database and finally sends them to ENOC via e-mail to a specified email address. More options are available: − TTs can be sent via http protocol to a web service or a form. − TTs can be stored to another database (remote). − TTs can be sent via email in XML format (not suggested since the XML format is not human readable). An SMS send option (to mobile phones) is under development, since this proves to be vital in case of extreme importance. For this option to work, an SMS server needs to be provided. Linguistic issues are also under development, in order to ease understanding of all fields in a TT, i.e. Greek to English translation needs to be performed for some predefined fields, like TT Type. Our system offering improves security: most web forms use ASP, PHP or CGI to update the database or perform some other action. This is inherently insecure because the database needs to be accessible from the web server. Our system offers a number of advantages:

Grid Management: Data Model Definition for Trouble Ticket Normalization

111

− The email can be sent to a completely separate and secure PC. − Our system can process the email without ever needing access to the web server or be accessible from the Internet. − If a PC or network connection is down the emails will sit waiting. Our system will 'catch up' when the PC or network is back up. Moreover we offer increased redundancy: when using web forms to update backend databases, the problem always arises about what to do if the database is down. Our system resolves this problem, because the email will always get sent. In case the database cannot be updated our system will wait and process the email later.

5 Conclusions In the present work, a common format for normalizing trouble tickets from the various NRENs participating in the Grid, implemented in the EGEE project framework, has been designed and implemented. XML was used to represent the common format. The adopted transformation schema is lightweight yet effective and is accepted by all participating partners. The solution has passed beta testing and it is already in use for the GRNET TTs. The other NRENs are migrating to the solution gradually.

Acknowledgments This paper is largely based on work for TT normalization in SA2 Activity of EGEE. The following groups and individuals, contributed substantially to this document and are gratefully acknowledged: − − − − −

Jeannin Xavier, Goutelle Mathieu, Romier Genevieve, UREC/CNRS Cessieux Guillaume, CNRS-CC IN2P3 Rodwell Toby, Apted Emma, DANTE Allocchio Claudio, Vuagnin Gloria, Battista Claudia, GARR Schauerhammer Karin, Stoy Robert, DFN

References 1. Quality of Service (QoS) of service, http://en.wikipedia.org/wiki/Qualityofservice 2. Service Level Agreement (SLA) Level Agreement, http://en.wikipedia.org/wiki/ServiceLevelAgreement 3. GÉANT2, http://www.geant2.net/ 4. Keystone, http://www.stonekeep.com/keystone.php/ 5. ITSM, http://www.bmc.com/products/products_services_detail/0,0,190 52_19426_52560284,00.html 6. HEAT, http://www.frontrange.com/ProductsSolutions/SubCategory.aspx? id=35&ccid=6

112 7. 8. 9. 10. 11. 12. 13. 14. 15.

16. 17. 18. 19.

D. Zisiadis et al. SimpleTicket, http://www.simpleticket.net/ OTRS, http://otrs.org/ GRnet, http://www.grnet.gr/ Service Activity 2 (SA2), http://www.eu-egee.org/ ENOC, http://egee-sa2.web.cern.ch/egee-sa2/ENOC.html RFC 1297, http://www.ietf.org/rfc/rfc1297.txt EGEE-SA2-TEC-503527- NREN Interface-v2-7 PERTKB Web, http://pace.geant2.net/cgi-bin/twiki/view/PERTKB/WebHome GÉANT2 Performance Enhancement and Response Team (PERT) User Guide and Best Practice Guide, http://www.geant2.net/upload/pdf/GN2-05-176v4.1.pdf Implementing a Multi-Domain Monitoring Service for the European Research Networks, http://tnc2007.terena.org/core/getfile.php?file_id=148 The OSS Trouble Ticket API (OSS/J API), http://jcp.org/aboutJava/ communityprocess/first/jsr091/index.html The Incident Object Description Exchange Format (IODEF), http://tools.ietf.org/html/draft-ietf-inch-implement-01 XML, http://en.wikipedia.org/wiki/XML

Extension of Resource Management in SIP Franco Callegati and Aldo Campi University of Bologna, Italy {franco.callegati,aldo.campi}@unibo.it

Abstract. In this work we discuss the issue of communication QoS management in a high performance network oriented to carry GRID application. Based on a set of previous works that successfully proved the feasibility of the concept, we propose to use sessions to logically identify and mange the communication between the applications. As a consequence the quality of service of the communication is mapped on reservation of network resources to support a given session. Starting from a framework defined to support such a task for VoIP applications we show here how this framework can be extended to match the need of GRID computing. Keywords: Application oriented networking, QoS, SIP, Session layer.

1 Introduction In a GRID computing environment the network becomes part of the computation resources and its performance becomes a major issue. Up to a certain extent the rather “trivial” issue of bandwidth availability can be solved by “brute force”, enhancing the technology of the links and making them more powerful. Nonetheless other forms of more intelligent quality of service management as well as the signaling to provide it, requires some enhancement in the network infrastructure. In general terms the network will successfully satisfy the applications demands only if it will be able to provide the necessary grade of service. For instance the time needed to transfer a data object that is part of a more elaborated and distributed processing task will impact on the overall time needed to complete the processing task itself etc. The network must match such requirements and the applications must be able to make evident its requests at the session establishment phase. This is a general issue that requires a general answer. The work presented in this paper refers to the study case of an optical network using a fast switching technology such as Optical Burst Switching to support short lived consumer grid applications [1][2]. For such a case study previous works [3][4] investigated the possibility to implement all the signaling needed to support the GRID services as part of the network by means of a network session management layer using the SIP protocol [5]. Tasks like resource publication, discovery and reservation may be accomplished by means of SIP message exchanges as part of communication sessions [4]. This solution appears appealing since it leverages a lot on existing building blocks, even though the conventional SIP server must be enhanced with modules that interact with the network and P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 113 – 120, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

114

F. Callegati and A. Campi

with the application languages. This is also feasible and a full functional application working on top of an OBS data plane was reported in [6]. The concepts presented in this paper are an extension of this work to support QoS management. They can be easily applied to a general network environment but we will stick to the aforementioned scenario as a reference. RFC 3312 [7] addresses the problem of QoS management in session oriented communications using SIP as signaling protocol. The RFC defines a framework that is based on the concept of “precondition” at session establishment as will be explained later in the paper. Now the question is: could this framework be used to support the needs of a QoS oriented reservation of network resources in an application oriented OBS network? We argue in the following that the answer to this question is no and provide some first insight into a possible solution by analyzing what is missing and proposing a generalization of the framework. The paper is organized as follows. Section 2 reviews the use of the session layer to support application oriented networking. Then in section 3 are presented the problems to be addressed o support specific quality of service requirements by the applications. The applicability of the existing QoS management framework in SIP is analyzed and the weakness identified. In section 4 is proposed an extension for general application oriented networking. Finally in section 5 we draw some conclusions.

2 SIP Based Application Oriented Networking A trend we can envisage in new IT services such as GRID computing is that they tend to be “state-full” rather than state-less. The more complex communication paradigms require a number of state variables to manage the information flows. Moreover the Quality of Service issue is more and more important when the communication paradigms become more complex. Multimedia streams have rather stringent QoS requirements and communication without QoS guarantees is rather meaningless in this context. We believe that the state-full approach with QoS management capabilities has to be pursued for an “application aware” control plane that must manage “vertically” the communication. This is where the session concepts and the SIP protocol get into play: − sessions are used to handle the communication requests and maintain their state by mapping it into session attributes; − sessions are mapped into a set of networking resources with QoS guarantees; − the SIP protocol is used to manage the sessions since it provides all the primitives for user authentication, session set-up, suspension and retrieval, as well modification of the service by adding or taking away resources or communication facilities according to the needs. In [3] and [4] we have already described some different approaches to resource discovery and reservation exploiting the SIP protocol for GRID network. The resource reservation is part of the establishment of a session that is then used to carry the information flows. We considered that the SIP user agent (UA) already knows the location of the resources into the network, so direct reservation can be performed. In this scenario a SIP UA coupled with an application opens a dialog with a peer SIP UA. The dialog is the logical entity identifying the service relation between end-points (e.g. Video on

Extension of Resource Management in SIP

115

Demand dialog) and is made of one or many sessions. The sessions are the logical entities representing the communication services delivered between end-points, e.g. audio session, video session, chat session etc. Sessions exploit the network resources and may be associated with QoS requests. These concepts are in line with the IP Multimedia Subsystem (IMS) architecture [8]. Generally speaking the GRID application can be successfully served if both the application resources to execute the required processing and the network resources required to exchange the related data are available. Two models are possible to finalize the network resource reservation: − end-to-end model, requiring that the peers have the capability to map the media streams of a session into network resource reservations, for example, IMS supports several end-to-end QoS models and terminals may use link-layer resource reservation protocols (PDP Context Activation), Resource ReSerVation Protocol (RSVP), or Differentiated Services (DiffServ) directly; − core oriented model, where the transport network already provides QoS oriented service classes (for instance using DiffServ) and the sessions are mapped directly into this classes by the network itself. In both models the sequence of actions to be performed in the application layer are: − agree on the QoS characteristics for the communication paths of the session; − check if the related network resources are available; − reserve the resources for the desired session; − set-up the application session and consequently start the data streams. In the network layer the aforementioned actions are mapped into: − check by means of the network Control Plane if enough resources are available; − reserve the network resources providing the QoS required; The main issue is how the search and reservation of application and network resources are combined in time during the session start-up phase. The importance of this is due to the fact that the completion of the session requires alerting the end-user to ask whether the session is acceptable or not (phone ringing in a VoIP call for instance). Whether this has to be done before, while or after the network resources required for a given QoS communication are reserved is a matter to be discussed and tailored according to the specific service. It is worth mentioning that for communications spanning over multiple domains, QoS support is not straightforward. In the remainder of this work we do not consider the multi-domain issue, that will be subject of further investigation, and focus on QoS guarantee within a single network domain.

3 Session QoS Management with SIP Given that a session set-up phase is always started with an INVITE message sent by the caller to the callee, several interaction models are possible to guarantee the establishment of a session with network QoS guarantee. Network reservation during session set up: while the INVITE message goes through the network (e.g. with anycast procedures) a network path is reserved before the

116

F. Callegati and A. Campi

INVITE message reach the resource destination. This method can be used for very fast provisioning purpose. Network reservation before application reservation: as soon as the INVITE message arrives at the destination and application resources are checked, the network reservation starts. After the network resources are reserved the application resource is reserved and the session started. This is the case of VoIP calls. Network reservation before session set up: as soon as the INVITE message arrives at the destination, and the application resources are found, the application and network resource reservations are started in parallel. Once both reservation are completed independently and both the application resource and the network path are available the session is started. This can be the case of standard GRID sessions. Network reservation after application reservation: the INVITE message arrive at the destination and application resources are checked and reserved. Then, network reservation starts into the network and as soon as the path between GRID user and GRID resource is established, the application session is started. This can be used when there are few application resources available and the probability to find a free application resource is low. Network reservation after session set up: the INVITE message arrive at the destination and application resources are checked and reserved, immediately the application session can start without having any network resource reserved. Then, network reservation starts into the network and as soon as the path between GRID user and GRID resource is established, the application session already started can take advantage moving the transmission state from best-effort to QoS enable. This can be used when the QoS is not crucial for the application (or at least not in the beginning part of the session). The management of the QoS issue is not part of the standard SIP protocol but the issue is there and, not surprisingly, a Resource Management framework was defined for establishing SIP sessions with QoS [7]. The solution proposed exploits the concept of pre-condition. The pre-condition is a set of “desiderata” that are negotiated during the session set-up phase between the two SIP UAs involved in the communication (the application terminals). If the pre-conditions can be met by the underlining networking infrastructure then the session is set up, otherwise the set-up phase fails and the session is not established. This scheme is mandatory because the reservation of network resources frequently requires learning the IP address, port, and session parameters of the callee. Moreover in a bidirectional communications the QoS parameters must be agreed upon between caller and callee. Therefore the reservation of network resources can not be done before an exchange of information between caller and callee has been finalized. This exchange of information is the result of the initial offer/answer message exchange at the beginning of the session start up. The information exchange sets the preconditions to the session that is established if and only if they can be met. 3.1 Weakness of the Current Resource Management Framework The current framework says that the QoS pre-conditions are included into the SDP [9] message, in two state variables: current status and desired status. Consequently, the

Extension of Resource Management in SIP

117

SIP UA treats these variables as all other SDP media attributes. The current and desired status variables are exchanged between the UAs using offers and answers in order to have a shared view of the status of the session. The framework has been proposed for VoIP applications and is tailored to this specific case, but has the following limitations when considering a generalized application aware environment. 1. Both session and quality of service parameters are carried by the same SDP document. Thus, a session established with another session protocol (i.e. Job Submission Description Language (JSDL)) is not possible. 2. The pre-condition framework imposes a strict “modus operandi”, since preconditions must be met before alerting the user. Therefore network reservation must always be completed before service reservation. 3. Since the network reservation is performed by the UA at the edge of the network, only a mechanism based on end-to-end reservation is applicable. 4. QoS is performed on each single connection, without the ability of grouping connections resulting from sessions established by others. 5. Since SDP is a protocol based on lines description, it has a reduced enquiring capability. Moreover, only one set of pre-conditions can be espressed by SDP. Multiple negotiations of pre-conditions are not possible. 6. The SDP semantic is rather limited. It is not possible to specify detailed QoS parameters since the SDP lines are marked with “QoS” parameter without specifying any additional detail (i.e. bandwidth, delay, jitter, etc... ). Given these issues we believe that an extended resource management framework is needed in order to achieve a general framework for QoS for application oriented network supporting GRID applications. In the next section we propose an extension to the framework to generalize its applicability to the case of the Network reservation before application reservation scenario that we believe to be the more likely to be used in a GRID network. Extensions to other scenarios are possible and will be presented in future works.

4 The Extended Resource Management Framework The extension can be implemented exploiting the following ideas: 1. do not limit the framework to the use of the SDP protocol but allow more general protocols in the SIP payload at session start-up; 2. separate the information about the request and requirement related to the application layer from the information about request and requirements at the network layer, using two different protocols to carry them; 3. allow as many re-negotiation as possible of the pre-conditions both at session startup and while the session is running. Regarding the protocol to declare the pre-conditions we propose to use two protocols. − Application Description Protocol (ADP): is used to describe application requirements, for instance JSDL which is a consolidate language for describing Job submission for GRID networks.

118

F. Callegati and A. Campi

− Network Resource Description Protocol (NRDP): is used to describe the network requirements and QoS requests. The use of a complete different document descriptor for QoS allows the use of many ADPs for describing application requirements, and not only the SDP. In this work we assume Resource Description Framework (RDF) as the candidate protocol for this task. The Resource Description Framework [10] is a general method of modeling information through a variety of syntax formats. RDF has the capability to describe both resources and the resources state. For this reason RDF can be used instead of SDP as a more general description language for the network resources. Furthermore, because RDF is a structured protocol, it is possible to enrich its semantic at will with the detailed information about QoS (i.e. Bandwidth, jitter, delay, etc...) The main advantage of this approach is that the network does not need the capability to understand the ADP since the QoS requirements are only described in the NRDP. The main drawback is that a link between ADP and NRDP is needed; and therefore some sort of “interface” has to be implemented in the UA on top of a standard SIP UA. The extension to pre-condition negotiation is implemented exploiting two concepts already present is SIP: − INVITE multi-part message: that is an INVITE that carries more than one protocol in the payload (multi-body), in this case for instance RDF and JSDL; − NOTIFY message [11]: a message that allows the exchange of information related to a “relationship” between two UAs, for instance a session that is starting up or that is already running. In Fig. 1 is presented an overview of the call flow in an end-to-end scenario where the network resource reservation is a pre-condition to the session set up. User A sends an initial INVITE multi-part message including both the network QoS requests and an application protocol for the application requests. The QoS requests are expressed by means of an RDF document while the application protocol depends on the type of application (i.e. SDP for VoIP, JSDL for GRID computing, etc...) A does not want B to start providing the requested service until the network resources are reserved in both directions and end-to-end. B starts checking if the service is available, if so it agrees to reserve network resources for this session before starting the service. B will handle resource reservation in the B⇒A direction, but needs A to handle the A⇒B direction. To indicate so, B returns a 183 (Session Progress) response with an RDF document describing the quality of service from its point of view. This first phase goes on with the two peers exchanging their view of the desired QoS of the communication, thus setting the ”pre-conditions” to the session in term of communications quality. Then both A and B start the network resource reservation. When an UA finishes to reserve resources in one direction, it sends a NOTIFY message to the other UA to notify the current reservation status by a RDF document in the message body. The NOTIFY message is used to specify the notification of an event. Only when the network channel meets the pre-condition B starts the reservation of the desired resources in the application domain, and session establishment may complete as normally.

Extension of Resource Management in SIP

119

Once the service is ready to be provided B send back a 200 OK message to A. Data can be exchanged by the two end-point since both application and network resourced have been reserved. BYE message close the dialog.

Fig. 1. Example of Call Flow at session set-up

The main differences with the current approach are: − Use of a separated RDF document in an multi-part INVITE message. The INVITE message starts a SIP dialog between parties. The peer RDF is carried by a 183 response. − Use of NOTIFY messages instead of UPDATE. Since ADP and NRDP are separated (even if carried by the same INVITE message) the UPDATE message can be sent only by UA A and can be used to update both ADP and NRDP. The NOTIFY can be sent by both UAs and is related to NRDP only and guarantees that the issues of network resource reservation is addressed independently from that of application resource reservation. The NOTIFY messages have Dialog-ID, From and To tag equal to the INVITE message. − The INVITE message implies the opening of a logical relationship between the peers (a subscription) that ends with the dialog tear down. Each time that the network reservation state changed a NOTIFY message is used to notify the changes for a specific subscription. The Resource Description Framework is used to generate an offer regarding the desired QoS into the network. The network resource reservation parameters can be many with different semantics. The propose of this paper is not to give an extensive and detailed list of reservation and QoS parameters but is to present the general

120

F. Callegati and A. Campi

solution that exploiting the flexibility of the RDF can be adapted by the end users or by the network provider according to the specific needs.

5 Conclusion In this work we have presented a possible extension of the resource management framework for QoS in SIP. We have addressed the issues of application oriented networking presenting an approach based on communication service mapped into sessions and showing that different session requirements require different interaction between application and network layer. The current SIP Resource Management framework for QoS can cope only with VoIP application using a generic precondition mechanism. We have presented the first step for an extended Resource Management framework showing the call flow that can fit with GRID application. Extensions to other scenarios and details about the interaction between RDF and JSDL will be presented in future works.

Acknowledgments The work described in this paper was carried out with the support of the BONEproject (“Building the Future Optical Network in Europe'”), a Network of Excellence funded by the European Commission through the 7th ICT-Framework Programme.

References 1. Simeonidou, D., et al.: Optical Network Infrastructure for Grid. Open Grid Forum document GFD. 36 (August 2004) 2. Simeonidou, D., et al.: Dynamic Optical-Network Architectures and Technologies for Existing and Emerging Grid Services. IEEE Journal on Lightwave Technology 23(10), 3347– 3357 (2005) 3. Campi, A., et al.: SIP Based OBS networks for Grid Computing. In: Proceedings of ONDM 2007, Athens, Greece, May 29-31 (2007) 4. Zervas, G., et al.: SIP-enabled Optical Burst Switching architectures and protocols for application-aware optical networks. Computer Networks 52(10) (2008) 5. Rosenberg, J., et al.: SIP: Session Initiation Protocol. IETF RFC 3261 (June 2002) 6. Zervas, G., et al.: Demonstration of Application Layer Service Provisioning Integrated on Full-Duplex Optical Burst Switching Network Test-bed. In: OFC 2008, California, USA (June 2008) 7. Camarillo, G., et al.: Integration of Resource Management and Session Initiation Protocol (SIP). IETF RFC 3312 (October 2002) 8. Poikselka, M., et al.: The IMS: IP Multimedia Concepts and Services, 2nd edn. Wiley, Chichester (2006) 9. Rosenberg, J., Schulzrinne, H.: An offer/answer model with session description protocol (SDP), RFC 3264, Internet Engineering Task Force (June 2002) 10. Beckett, D. (ed.): RDF/XML Syntax Specification (Revised). W3C Recommendation, February 10 (2004) 11. Roach, A.B.: Session Initiation Protocol (SIP)-Specific Event Notification. IETF RFC 3265 (June 2002)

Economic Model for Consistency Management of Replicas in Data Grids with OptorSim Simulator Ghalem Belalem Dept. of Computer Science, Faculty of Sciences, University of Oran – Es Senia, Oran, Algeria [email protected]

Abstract. Data Grids are currently solutions suggested meeting the needs for the scale large systems. They provide whole of resources varied, geographically distributed of which the goal is to ensure a fast access and effective to the data, to improve the availability, and to tolerate the breakdowns. In such systems, these advantages are not possible that by the use of the replication. Nevertheless, several problems can appear with the use of replication techniques, most significant is the maintaining consistency of modified data. The strategies of replication and scheduling of the jobs were tested by simulation. Several simulators of grids were born. One of the important simulators for Data Grids is the OptorSim tool. In this work, we present an extension of OptorSim by consistency management service of replicas in data Grid. This service is based on hybrid approach, combining between pessimistic, optimistic approaches and economic model, articulated on a hierarchical model. Keywords: Data Grid, Replication, Consistency, OptorSim, Optimistic approach, Pessimistic approach, Economics models.

1 Introduction Replication in the Grids [3] is one of the popular research topics and is a widely accepted way to improve data availability and fault tolerance in the Grids. Broadly speaking, we can say that the technique of replication raises several problems [4, 5, 7], such as: (i) Degree of replicas; (ii) Choosing the replicas; (iii) Placement of replicas; (iv) Service of consistency management of replicas. Two approaches are generally used for the consistency management: optimistic approach and pessimistic approach [6]. In this present paper, we propose service based on a hybrid approach, combining between the two approaches of coherence (pessimist and optimist) and the economic model of the market. The approach proposed rests on a hierarchical model with two layers, it aims principally at maintaining the consistency of large scale systems. The remaining part of this article is organized as follows: in Section 2, we present the fundamental principles of the consistency management approaches. Section 3, describes the hybrid approach and its model proposed in terms of structure, functionalities, features and principal algorithm of our approach based on economic model for consistency management of the P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 121 – 129, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

122

G. Belalem

replicas. The various preliminary experiments discussed in section 4. Finally, we finish this article by a conclusion and some future work perspectives.

2 Approaches of Consistency Management The replica of data is made up of multiple copies, on the separate computers. It is a significant technology which improves the availability and the execution. But the various replica of the same object must be coherent, i.e. seem only one copy. There are many models of consistency. All does not offer the same performances and do not impose the same constraints to the application programmers. The consistency management of replicas can be done either in a synchronous way with use of algorithms known as pessimistic, or in an asynchronous way with use of algorithms known as optimistic [6]. The synchronous model makes it possible to make an immediate propagation of the updates towards the replicas with blocking of the access, whereas the asynchronous model differs the operation from propagation of the updates, which introduces a certain level of divergence between replicas. 1. Pessimistic approach: This forbids any access to a replica provided that it is up to data [1, 6], what gives an illusion to the user to have a single substantial copy. The main advantage of this approach is that all replicas converge at the same time, what allows guaranteeing a strong consistency, so avoiding any problem linked to the stage of the reconciliation. This type of approach is adapted well to systems of small and average scales and it becomes very complex to implement for systems with wide scale. So, we can raise three major inconveniences of this type of approach: (i) It is very badly adapted to vague or unstable environments, such as the mobile systems or the grids at strong rate of change; (ii) It cannot support the cost of update when the degree of replication is very high; (iii) It cannot support the cost of update when the degree of replication is very high. 2. Optimistic approach: This approach authorizes the access to any replica and all the time. In this way, it is then possible to access a replica which is not inevitably coherent [1, 6]. So, this approach tolerates a certain difference between replicas. On the other hand, it requires a phase of detection of difference between replicas then a phase of correction of this difference by converging the replicas on a coherent state. Although it does not guarantee coherence strong as in the pessimistic case, it possesses nevertheless certain number of advantages which we can summarize as follows [1]: (i) They improve the availability: because the access to data is never blocked; (ii) They are flexible as regards the network management which does not need to be completely connected so that the replicas are completely accessible, like the mobile environments; (iii) They can support large number replicas by the fact that they require not enough synchronization between replicas; (iv) Its algorithms are well adapted to large scale systems.

Economic Model for Consistency Management of Replicas in Data Grids

123

3 Service of Consistency Management Based on Economic Models To improve the quality of services (QoS) in consistency management of replicas in data grids, we deem valuable for extending work presented in [1] by an economic model of the market when resolving conflicts between replicas. In the real world market, there exist various economic models for setting the price of services based on supply-and-demand and their value to users. These real world economy models such as commodity market model, market bidding, auction model, bargaining model etc can also be applied to allocate resources and tasks in distributed computing [2]. In our work, we regard a grid as a distributed collection of elements of calculation (CE's) and elements of storage (SE's). These elements are dependent whole by a network to form a Site. This hierarchical model is composed of two levels (see Figure 1), level 0 is charged to control and to manage local consistency within each site, level 1 consists in ensuring the global consistency of the grid.

Fig. 1. Proposed Model

The main stages of global process of consistency management of replicas proposed is defined by the following Figure 2.

Fig. 2. Principal stages of consistency management

124

G. Belalem

- Stage 1: Collect of information (collection): example of information, we have the number of conflicts, divergences per site as well as current time. - Stage 2: Calculate metrics (analyzes): consist in analyzing and comparing the information collected with the tolerated thresholds (threshold of conflicts and threshold of divergence,…). - Stage 3: Decision: decide the adequate algorithm of consistency management of replicas. To study collaboration between the various entities of service of consistency management, we used the diagram of activity as we can see it in this Figure 3.

Fig. 3. Diagram of activity of service consistency management

3.1 Local Consistency Local consistency also called coherence intra-site, in this level we find the sites which make the grid. Each site contains CE's and SE's. The retorted data are stored on the SE's and are accessible by CE's via the operations from reading or writing. Each replica attached to additional information is called metadata (Timestamp, of the indices, the versions,…). The control and the management of coherence are ensured by representatives elected by a mechanism of election, the goal of these representatives is to treat the requests of readings and writings coming from the external environment, and to also control the degree of divergence within the site, by calculating measurements appearing in Table 1: Rate of the number of conflicts per site (τi), distance within a site (Dlocal_i), dispersion of version (σi).

Economic Model for Consistency Management of Replicas in Data Grids

125

Table 1. Measurements calculated in local consistency

Measure

Definition Number of replicas of same object inside Sitei

ni Dlocal_i

Versionmax - Versionmin of replicas inside Sitei Rate of number of conflicts inside Sitei Standard deviation of Sitei = 1 / n n ( V − V ) 2 i i ∑ it

τi σi

i

t =1

Where

V i is average version inside Sitei

We detect critical situations of one site to meet of one of the following cases: - τi > Rate of Conflicts number tolerated; - Dlocal_i > Distance tolerated; - σi > σm tolerance rate for dispersion of versions. 3.2 Global Consistency This level is also called coherence of inter-sites, it is responsible for global consistency of data grid, each site cooperates with the other sites via the representatives who communicate between them. The goal of representatives is to control the degree of divergence and of conflict within the site, by calculating measurements appearing in Table 2. Table 2. Measurements calculated in global consistency

Measure Dglobal(i,j)

Definition If ni < nj then 1 / n n i ∑ i

t =1

τij Cov(i,j)

it

−V

2

else

jt

∑V n

1/ n

j

j

t =1

it

−V

(τi +τj)/(ni+ nj) : Rate of conflicts between Sitei and Sitej 1 ni If ni < nj then ∑ (V it − V i )( V jt − V j ) n i t =1

1 nj ρij

V

∑ (V n

j

t =1

it

2 jt

else

− V i )( V jt − V j )

Cov(i,j)/ σi σj : Coefficient of correlation between Sitei and Sitej

Two situations can be treated, the first situation corresponds to the competing negotiation, started after the presence of a critical situation of a site which is translated by the existence of: - τij > Threshold of a tolerated number of conflicts; - Dglobal(i,j) > Threshold of tolerated divergence; - |ρij | < ε Where ε 1) then candidates ← true; Algorithm Negotiation /* Negotiation of candidates */ else Propagation of the updates of the stable site to the sites in crisis end if end for

If there are two or more representatives candidates (candidates = true), being able to put the site in crisis in stable state one passes to a negotiation between these candidates (see Algorithm 1), the best of them is that which is more stable, and will proceed to update the site in crisis. The second situation corresponds to the co-operative negotiation (see Algorithm 2) between the representatives of each site, started by the exhaustion of the period defined for total coherence.

1: 2: 3: 4: 5: 6: 7: 8: 9:

Algorithm 2. NEGOTIATION τi , Dlocal_i, σi /* measurements of ith representative for stable site */ winner ← first of all sites stable /* Candidate supposed winner */ τi , Dlocal_i, σi /*measurements of ith representative*/ ; no_candidate ← false for all representatives – { candidatewinner } do if (τi < τwinner) ∧ (Di < Dwinner) ∧ (σi τi) ∨ (Dlocal_max > Dlocal_i) ∨ (σmax >σi) then if versioni ≤ version_reserves then no_candidate ← true else representativemax ← representativei ; no_candidate ←false end if end if end for if (no_candidates = false ) then representativemax propagates the updates with the whole of representatives else Algorithm English Auction end if

The English auction (Algorithm 4) has the same principle that the Dutch auction except that it is ascending starting with the representative having the oldest replica and increases according to measurements calculated within each site.

1: 2: 3: 4: 5: 6: 7: 8: 9:

Algorithm 4. ENGLISH AUCTION representativemin ← representative having the oldest version version_reserves /* the average of all vectors of versions */ τi , Dlocal_i, σi /* measurements of ith representative */ for all representatives – {representativemin} do if (τmin > τi) ∨ (Dlocal_min > Dlocal_i) ∨ (σmin > σi) then representativemin ← representativei end if end for representativemin propagates the updates with the whole of the representatives

4 Experimental Study We have implemented a simulation tool for our approach and their component, where entire architecture was discussed in the previous sections. The tool is based on the OptorSim Grid simulator [1], extended with our hybrid approach, in order to validate and to evaluate our approach of consistency management of replicas compared to pessimistic and optimistic approaches, we carried out series of experiments whose results and interpretations are covered in this section. In order to analyze the results

128

G. Belalem

relating to the experimentation of our approach, we used three metrics to know the response time, the number of divergences and number conflicts among replicas. These experiments were carried out with the following parameters of simulation: 10 sites, 100 nodes, retorted data 10 times in each site according to the strategy multi-Masters, the bandwidth being fixed at 100Mb/ms for star topology, and we varied the number of requests (requests of writings) between 100 to 500 requests per step of 100.

Fig. 4. Average response time / #Requests

Fig. 5. Average response time / #Requests

We notice that in graph of the Figure 4, some is it a number of requests carried out, response time of the two approaches pessimists: ROWA and Quorum, are obviously more than optimistic approach, hybrid [1] like our approach. Us let us notice in the Figure 5, that the optimistic approach which does not block the requests, it who causes to reduce the execution time contrary to the two other approaches which are slightly confused, this increase is explained owing to the fact that local and global consistency, were started to propagate the updates in the two approaches. From this Figure and although two curves are not very different, the curve of our approach remains in below of that of the hybrid approach. This experiment allows us to evaluate and compare the QoS, by taking account of the divergences and the conflicts a number of conflicts measured during the execution of the writing request to the counterparts, between our suggested approach and optimistic one. The results presented in Figures 6, 7, 8, and 9 show in significant way that our approach solves quickly divergences and conflicts by using economic models.

Fig. 6. Number of divergences / #Requests

Fig. 7. Number of divergences / #Requests

Economic Model for Consistency Management of Replicas in Data Grids

Fig. 8. Number of conflicts / #Requests

129

Fig. 9. Number of conflicts / #Requests

5 Conclusion and Future Works In this work, we tried to seek balance between the performance and the quality of service. This led us to propose a service of consistency management based on a hybrid approach, combining between consistency approaches (pessimistic and optimistic) and the economic model of the market, which allows at the same time to ensure local consistency for each site and global consistency of the entire grid. The results of these experiments are rather encouraging and showed that the objective of compromise between quality of service and performance was achieved. There are a number of directions which we think are interesting, We can mention: - We propose to integrate our approach in the form of Web service in the Globus environment by using technology WSDL [3]; - Currently, we place a replica randomly. It is useful to explore the possibility of making static or dynamic placement to improve QoS.

References 1. Belalem, G., Slimani, Y.: Hybrid Approach for Consistency Management in OptorSim Simulator. International Journal of Multimedia and Ubiquitous Engineering (IJMUE) 2(2), 103–118 (2007) 2. Buyya, R., Murshed, M., Gridsim, M.: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. The Journal of Concurrency and Computation: Practice and Experience 4(13-15), 1175–1220 (2002) 3. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure. Elsevier Series in Grid Computing. Morgan Kaufmann Publishers, San Francisco (2004) 4. Gray, J., Helland, P., O’Neil, P., Shasha, D.: The Dangers of Replication and a Solution. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, pp. 173–182 (1996) 5. Ranganathan, K., Foster, I.: Identifying Dynamic Replication Strategies for a HighPerformance Data Grid. In: Lee, C.A. (ed.) GRID 2001. LNCS, vol. 2242, pp. 75–86. Springer, Heidelberg (2001) 6. Saito, Y., Shapiro, M.: Optimistic Replication. ACM Computing Surveys 37(1), 42–81 (2005) 7. Xu, J., Li, B., Li, D.: Placement Problems for Transparent Data Replication Proxy Services. IEEE Journal on Selected areas in Communications 20(7), 1383–1398 (2002)

A Video Broadcast Architecture with Server Placement Programming Lei He1, Xiangjie Ma2, Weili Zhang1, Yunfei Guo2, and Wenbo Liu2 China National Digital Switching System Engineering and Technology Research Center (NDSC), P.O. 1001, 450002 Zhengzhou, China {hl.helei,clairezwl}@gmail.com, {mxj,gyf,lwb}@mail.ndsc.com.cn

Abstract. We propose a hybrid architecture MTreeTV to support fast channel switching. MTreeTV combines the use of P2P networks with dedicated streaming servers, and was proposed to build on the advantages of both P2P and CDN paradigms. We study the placement of the servers with constraints on the client to server paths and evaluate the effect of the server parameters. Through analysis and simulation, we show that MTreeTV supports fast channel switching (98%) and little control overload (= j, including the node j itself, are known to the node j. Only the packages transmitted by nodes i < j are new to the node j. Therefore, using successive interference cancellation (SIC) and knowledge of all channels, the node j can remove the known packets from (1) and reduce xj (k) to j−1  √ √ x ˆj (k) = pgi,j ejθi,j u(k − i) + Nvj (k). (2) i=0

166

X.(E.) Li

For this signal, the signal-to-interference plus noise ratio (SINR) is pgj−1,j sj (k) = j−2 . i=0 pgi,j + N

(3)

However, this is not the only signal that the node j has for the detection of packet u(k − j + 1), and thus this SINR can be improve by other signals. In the slot k − 1, the node j has obtained a signal similar to (2), which is xj (k − 1) =

j−1  √ √ pgi,j ejθi,j u(k − 1 − i) + N vj (k − 1).

(4)

i=0

In the slot k − 1, it should have decoded and thus known the packet u(k − j). Then it can now remove this packet and reduce (4) to x ˆj (k − 1) =

j−2  √ i=0

pgi,j ejθi,j u(k − 1 − i) +

√ N vj (k − 1),

(5)

which contains information about the package u(k − j + 1) as well. The SINR for the signal (5) is pgj−2,j sj (k − 1) = j−3 . (6) i=0 pgi,j + N

The above procedure can be easily extended to reducing all signals that contains the packet u(k − j + 1). Specifically, the node j can exploit its received and processed signals in slots k − j + 1, · · · , k − 1, k, which have the general form as x ˆj (k − ℓ) =

j−ℓ−1  i=0

√ √ pgi,j ejθi,j u(k − ℓ − i) + Nvj (k − ℓ),

(7)

where ℓ = j − 1, · · · , 0, to detect the packet u(k − j + 1). The SINR for xˆj (k − ℓ) is pgj−ℓ−1,j sj (k − ℓ) = j−ℓ−2 . (8) pgi,j + N i=0 Now the node j has j received signals to detect a packet u(k − j + 1), which needs to be optimally combined to maximize the SINR. One of the ways for optimal combining is maximal ratio combining (MRC). In order to derive MRC, we first need to normalize the signals x ˆj (k−ℓ) by their corresponding interference plus noise power. Specifically, the interference plus noise power of the signal x ˆj (k − ℓ) is j−ℓ−2  pgi,j + N, (9) Ij (k − ℓ) = i=0

which is exactly the denominator of (8). Then the signals can be normalized as 1 x ˆj (k − ℓ). x ˜j (k − ℓ) =  Ij (k − ℓ)

Note that after normalization, the SINR for x ˜j (k − ℓ) is still (8).

(10)

Hop Optimization and Relay Node Selection

167

Then the MRC is conducted as yj (k) =

j−1  ℓ=0

aℓ x ˜j (k − ℓ),

(11)

with combining weights aℓ . The optimization objective is to maximize the SINR of yj (k), which we denote as sj . Proposition 1. With the optimal MRC coefficients  aℓ = sj (k − ℓ)e−jθj−ℓ−1,j ,

(12)

the SINR of yj (k) in (11) is maximized and equals to the summation of individual SINR in (8), i.e., j−1  sj = sj (k − ℓ). (13) ℓ=0

⊓ ⊔

Proof. See [14].

Based on Proposition 1 and (13), we can calculate the SINR for a node j in an H-hop relaying path when detecting packets as sj =

j−1  ℓ=0

pgj−ℓ−1,j , j−ℓ−2 pgi,j + N i=0

(14)

for any node j = 1, · · · , H. A interesting property is that the nodes i > j (after the node j) in the hopchain do not play a role in the SINR of the node j. In contrast, the nodes i < j (before the node j) in the hop-chain both contribute interference to reduce the SINR and contribute useful signal to increase the SINR of the node j. For the H-hop relaying path with node SINR sj , where j = 1, · · · , H, the transmission capacity is C1,···,H (H) = min log2 (1 + sj ). 1≤j≤H

(15)

Furthermore, in a network with J +1 nodes, in order to find the highest H-hop transmission capacity from node 0 to node J, we need to select the best H − 1 nodes to form an H-hop transmission path that has the highest capacity. This can be configured as a max-min optimization problem C(H) =

max

nodes {1,···,H−1}⊂{1,J−1}

C1,···,H (H).

(16)

Unfortunately, exhaustive search of all possible node combinations becomes prohibitive even for small J. Therefore we need to look for new methods with reduced complexity.

168

3.2

X.(E.) Li

Hop Optimization in Source-Destination Line and Node Selection

From the SINR expression (14), we have seen the complex relationship among the nodes. For simplification, we consider the case that the first term in sj (with ℓ = 0) is the dominating one, i.e., j−1

 pgj−1,j pgj−ℓ−1,j ≫ . j−2 j−ℓ−2 pgi,j + N i=0 pgi,j + N i=0 ℓ=1

(17)

Intuitively, this means that the transmission of the node j − 1 has a dominating contribution to the received signal of the node j. Obviously, this is a reasonable assumption for a fixed H hop count. Otherwise, if the first term is in-significant, then the transmission of nodes 0 to j − 2 is even stronger than node j − 1 to the node j. This means that the node j −1 in fact wastes its transmission power, and this path can not have the highest capacity among the H-hop paths. Therefore, there is no loss to avoid considering such cases. Under the assumption (17), we can derive a simple way for selecting the hop nodes to enhance the transmission capacity. Proposition 2. For any H-hop relaying path, there exists a corresponding H-hop relaying path along the line connecting the source and the destination that has larger transmission capacity, if the nodes can be put in corresponding places on this line. ⊓ ⊔

Proof. See [14].

The significance of Proposition 2 is that the upper bound of H-hop path capacity can be found by a max-min optimization along the source-destination line. This max-min optimization can be conducted relatively more efficiently. Specifically, in order to find the highest capacity of H-hop relaying, we just need to find H − 1 positions in the line that gives the highest SINR. Let the parameter dk , k = 0, · · · , H − 1, denote the distance between the node k and the node k + 1, respectively. Then the max-min optimization is formulated as a constrained optimization

max min

{dk } 1≤j≤H

j−1  ℓ=0

−α dm , −α j−ℓ−2 j−1 + N d p m m=i i=0 p

 j−1

m=j−ℓ−1

(18)

H−1 under the constraint k=0 dk = d0,J . We may also need the constraints dk ≥ 1 for k = 0, · · · , H − 1 to avoid the impractical case that small dk makes received power larger than transmission power. Unfortunately, the evaluation of the max-min optimization (18) is nontrivial, and may only be conducted by numerical algorithms. Even with numerical evaluation, the results still rely on good initial conditions. Some simulation results based on numerical optimization was given in [14].

Hop Optimization and Relay Node Selection

4

169

An Approximation Method to Optimize Hop Selection for Arbitrary Networks

In this section, we develop a new method to solve the hop optimization problem. We will first reduce the max-min optimization problem into a simple high-order equation solving (or root finding) problem by taking some approximations. Then, based on the roots, we propose an iterative algorithm to select hop nodes for arbitrary H-hop wireless networks. Let us begin from the node SINR expression (14). We can rewrite it as pgj−1,j sj = j−2 + i=0 pgi,j + N j−ℓ−1 j−2   i=0 pgi,j +N ℓ=1

 j−ℓ−2 i=0

pgi,j +N

 −1 +

pg0,j N

.

Using the Schwartz inequality, we have  pgj−1,j − j + 1. sj ≥ j j N

(19)

(20)

Obviously, if we just use the right hand-side of equation (20) as a lower bound of the SINR to conduct optimization, we can greatly simplify the problem. Considering that gj−1,j = d−α j−1 , we can change (20) to sj ≥ j



P N dα j−1

1j

− j + 1,

j = 1, · · · , H.

(21)

Note that our objective is to find the distance dj−1 , for j = 1, · · · , H. Because of the simplicity of (21), we can first try to describe the distances dj−1 , j = 2, · · · , H, as function of d0 . For this purpose, let us compare sj and s1 . For an optimally designed multi-hop path, we would like to let each of the node have the highest available SINR, which can then enhance the SINR or the capacity of the transmission path. Obviously, if node position is not a constraint, then the optimal solution would have sj = s1 for any j = 2, · · · , H. This phenomenon was in fact observed when we conducted simulations in [14]. Therefore, if we let sj = s1 , then we have j



P N

1j

−α

j −j+1= dj−1

P −α d . N 0

(22)

We need to describe dj−1 as a function of d0 . However, (22) is still to complex for this purpose. Fortunately, we can see that the residue factor (j − 1) in sj is usually much smaller than the other parts. Remember that sj is SINR, which is usually very large in value. Therefore, we can further approximate (22) by skipping the factor (j − 1), which gives us j



P N

1j

−α

j = dj−1

P −α d . N 0

(23)

170

X.(E.) Li

From (23), we can derive dj−1 = j

j α



P N

− j−1 α

dj0 .

(24)

Considering the optimization problem (18) in the straight line, i.e., with the constraint H  dj−1 = d0,J . (25) j=1

Note that in [14], we conducted the numerical optimization (18) on a line (similarly under constraint (25)), and then using the optimal hop points on this line to select the relaying nodes. In this paper, we conduct the similar procedure, with the much more simplified equation (24). One of the major differences is that the max-min optimization (18) now is avoided. Specifically, considering (24) and (25), we can find d0 by solving the following equation − j−1 H α  j P α j dj0 = d0,J . (26) N j=1 Note that (26) is an H-order equation. From simulations, we find that it will have only one real solution. After finding the distance of the first hop d0 , we can determine the distances of all the other hops dj−1 by (24). However, because of the approximation nature of (26), usually the accuracy of dj−1 for larger j is not high enough. To avoid potential problems, we may just use d0 to determine the first hop node. Based on this idea, we can use a iterative method to determine all the H hops. The procedure is described as follows. First, along the line from the source node to the destination node, we can find the position of the first hop by solving (26) to find d0 . Then we select a node that is nearest to this point. Then, during the next iteration, we do the same procedure along the line from this relay node to the destination node, i.e., we solve (26) again (with different d0,J and total hops H − 1 though) to determine the next relay. This relaying node is in fact the second hop node. This procedure is repeated until all the H-hop nodes are determined. This iterative procedure only uses the solution d0 which is the most accurate one under the above mentioned approximation. Therefore, the accuracy of the hop selection and optimization is approximately as much as possible. On the other hand, the major advantage of this procedure, as compared with the dynamic programming procedure in [14], is that the complex is drastically reduced. In fact, the complexity is nothing more than solving H − 1 equations with orders 2 to H. The complex is linear to the hop number H, but is independent from network size or total number of nodes J. As a result, it scales well with network size.

Hop Optimization and Relay Node Selection

5

171

Simulations

In this section, we use Monte-Carlo simulations to verify the proposed method. We assume L = 100 meters. For each hop count H, we use the iterative procedure in Section 4 to determine the optimal relay locations. The corresponding path capacity can also be calculated by (15)-(16). We normalize the path capacity by the direct source to destination transmission capacity as C(H)/C(1). The capacity based on numerical evaluation of (18) is shown in Fig. 4, where we denote the numerical results as “analysis” results. In Monte-Carlo simulations, for various node number J, we randomly place the nodes. Then we simulate both the complete exhaustive search with complexity (J − 1) × (J − 2) × · · · × (J − H) and the proposed algorithm with a complexity M H. We denote them as “Exhaustive” and “Proposed” results in the figures. In Fig. 3, we clearly see that the proposed method fits very well with the complete exhaustive search. The error is very small, especially when the number of nodes is not very small. In addition, the proposed method works for extremely large number of nodes and long hops, where the exhaustive search method becomes computationally prohibitive. The average capacity increases when the node number J increases, or when the hop count H increases. In Fig. 4, we see that the maximum capacity obtained by the three ways fits very well. When hop account is small, the analysis results and the results of the proposed method are both almost identical to exhaustive search results. When hop count H increases, however, the proposed method gives results smaller than the analysis results, which is because the number of simulation iterations was limited so we could not encounter those optimal node placements. In Fig. 5, we further see that the maximum capacity found by our proposed method fits well with the exhaustive search method. From the results in Figs. 3-5, we can see that for multi-hop wireless networks, the transmission capacity increases with the hop count. The more hops we can use, the higher capacity we can get. This more or less fits the fact that more

Average Normalized Capacity

4

3.5

Exhaustive: 2−hop Exhaustive: 3−hop Exhaustive: 4−hop Proposed: 2−hop Proposed: 3−hop Proposed: 4−hop Proposed: 5−hop Proposed: 7−hop

3

2.5 10

20

30

40 50 60 70 J: Total Number of Nodes

80

90

100

Fig. 3. Average capacity as function of hop count H and node amount J

172

X.(E.) Li

4.4 4.2

Max Normalized Capacity

4 3.8 3.6

Exhaustive: 2−hop Exhaustive: 3−hop Exhaustive: 4−hop Proposed: 2−hop Proposed: 3−hop Proposed: 4−hop Proposed: 5−hop Proposed: 7−hop

3.4 3.2 3 2.8 10

20

30

40 50 60 70 J: Total Number of Nodes

80

90

100

Fig. 4. Maximum capacity as function of hop count H

5

Max Normalized Capacity

4.5

4 Analysis Exhaustive Proposed

3.5

3

2.5

2

4

6 8 Number of Hops

10

12

Fig. 5. Maximum capacity as function of hop count H and node number J

1.4 Exhaustive: 2−hop Exhaustive: 3−hop Exhaustive: 4−hop Proposed: 2−hop Proposed: 3−hop Proposed: 4−hop Proposed: 5−hop Proposed: 7−hop

1.3

Average Capacity/Hop

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 10

20

30

40 50 60 70 J: Total Number of Nodes

80

90

100

Fig. 6. Average capacity per hop as function of hop count H and node amount J

Hop Optimization and Relay Node Selection

173

1.5 1.4 1.3

Max Capacity/Hop

1.2 1.1 1 Exhaustive: 2−hop Exhaustive: 3−hop Exhaustive: 4−hop Proposed: 2−hop Proposed: 3−hop Proposed: 4−hop Proposed: 5−hop Proposed: 7−hop

0.9 0.8 0.7 0.6 0.5 10

20

30

40 50 60 70 J: Total Number of Nodes

80

90

100

Fig. 7. Maximum capacity per hop as function of hop count H 1.6 Analysis Exhaustive Proposed

1.4

Max Capacity/Hop

1.2

1

0.8

0.6

0.4

0.2

2

4

6 8 Number of Hops

10

12

Fig. 8. Maximum capacity per hop as function of hop count H and node number J

transmission energy is used when more hops are involved. Therefore, another way to compare the network capacity is to study the capacity normalized by the transmission power, or the capacity per energy use. In our simulation, we use the capacity divided by the hop count H to describe the capacity normalized by total transmission energy. The results corresponding to Figs. 3-5 are now redraw in Figs. 6-8. From these three figures, we can clearly see that using less number of hops can increase the energy efficiency. The major reason might be that less energy has to be used to combat mutual interference.

References [1] Akyildiz, A., Su, W., Sankarasubramianiam, Y., Cayirci, E.: A survey on sensor networks. IEEE Commun. Mag., 102–114 (August 2002) [2] Gupta, P., Kuman, P.R.: The capacity of wireless networks. IEEE Trans. Inform. Theory 46, 388–404 (2000)

174

X.(E.) Li

[3] Gastpar, M., Vetterli, M.: On the capacity of wireless networks: the relay case. In: Proc. IEEE INFOCOM 2002, New York, June 2002, pp. 1577–1586 (2002) [4] Bolcskei, H., Nabar, R.U., Oyman, O., Paulraj, A.J.: Capacity scaling laws in MIMO relay networks. IEEE Trans. Wirel. Commun. (2006) [5] Gupta, P., Kramer, G., Gastpar, M.: Cooperative strategies and capacity theorems for relay networks. IEEE Trans. Inform. Theory 51(9), 3037–3063 (2005) [6] Toumpis, S., Goldsmith, A.J.: Capacity regions for wireless ad hoc networks. IEEE Trans. Wirel. Commun. 2(4), 736–748 (2003) [7] Ganesan, D., Govinden, R., Shenker, S., Estrin, D.: Highly-resilient, energyefficient multipath routing in wireless sensor networks. ACM SIGMOBILE Mobile Computing and Communications Review 5(4) (October 2001) [8] Tillett, J., Rao, R., Sahin, F., Rao, T.M.: Particle swarm optimization for the clustering of wireless sensors. In: Proc. SPIE, vol. 5100, pp. 73–83 (2003) [9] Wang, S.Y.: Optimizing the packet forwarding throughput of multi-hop wireless chain networks. Computer Communications 26, 1515–1532 (2003) [10] Sherif, M.R., Habib, I.W., Nagshineh, M., Kermani, P.: Adaptive allocation of resources and call admission control for wireless ATM using genetic algorithms. IEEE J. Select. Areas Commun. 18(2), 268–282 (2000) [11] Adickes, M.D., Billo, R.E., Norman, B.A., Banerjee, S., Nnaji, B.O., Rajgopal, J.: Optimization of indoor wireless communication network layouts. IIE Transactions 34, 823–836 (2002) [12] Sendonaris, A., Erkip, E., Aazhang, B.: User cooperation diversity, Part I, II. IEEE Trans. Commun. 51(11), 1927–1948 (2003) [13] Li, X.: Space-time coded multi-transmission among distributed transmitters without perfect synchronization. IEEE Signal Processing Lett. 11(12), 948–951 (2004) [14] Li, X.: An efficient method for hop selection and capacity enhancement in multihop wireless ad-hoc networks. In: IEEE MILCOM (October 2007)

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains Sireesha Krupadanam and Huirong Fu Member, IEEE

Abstract. In this paper, we develop and evaluate a Localization Anomaly Detection (LAD) scheme for non-flat surfaces for wireless sensor networks. The beacon-less grid localization scheme proposed for non-flat terrains in [1] is used for localization and the localization anomaly detection uses observations of the sensor node at two different reception ranges. Moreover, a new Signal Strength (SS) Metric is proposed and evaluated for LAD. Simulations show that the beacon-less localization method when combined with the LAD scheme gives good detection rates with low false positive rates for the proposed Signal Strength Metric and the Difference Metric. The results show that Signal Strength Metric is comparable to existing metrics while being more difficult to attack. Keywords: Wireless Sensor Network, LAD, Attack, Localization, Non-flat Terrain.

1 Introduction Many Wireless Sensor Networks (WSNs) are deployed in unattended and often hostile environments such as those in military and homeland security operations. Therefore, security mechanisms providing confidentiality, authentication, data integrity, and non-repudiation, among other security objectives, are vital to ensure proper network operations. A future WSN is expected to consist of hundreds or even thousands of sensor nodes. This renders it impractical to monitor and protect each individual node from either physical or logical attack. Location Anomaly Detection (LAD) is the ability of the sensor network to detect anomalies in the reported locations of the sensors. The anomalies may be caused by malicious attacks against the localization scheme to corrupt the sensor locations and thereby render the network measurements worthless. Most of the proposed techniques for location anomaly detection described in the literature cannot be applied to wireless sensor networks as they are computationally intensive. Furthermore, almost all of these are applicable to sensor networks with beacon nodes. Lazos and Poovendran proposed a Secure Range-independent Localization (SeRLoc) [4] scheme that assumes that the network is comprised of sensor nodes and anchors/locators. In [5], Tang et al. presented a RSSI based cooperative anomaly detection scheme to detect physical displacement attack in Wireless Sensor Networks. However there is not much work done in the area of location anomaly detection using beacon-less localization schemes. The authors Du, Fang and Ning in [2] proposed a P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 175 – 186, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

176

S. Krupadanam and H. Fu

general scheme to detect localization anomalies that are caused by adversaries. Their method uses the deployment knowledge and actual observations of the nodes to detect localization anomalies in beacon-less sensor networks. Their scheme has been used as the basis for the proposed grid based location anomaly detection method. In this research, we propose a LAD method applicable to non-flat terrains based on the Beacon-less grid localization [1]. We discuss various attacks on localization and evaluate the impact of the attacks on the performance of our proposed scheme in terms of ROC curves for different metrics, degrees of damage, percentages of compromised nodes, node density, and different attacks. Moreover, a new Signal Strength Metric to detect anomalies in the localization is proposed and evaluated. The Difference metric, Add-all metric, Probability metric proposed by Du et al. [2] are modified and evaluated for the various cases. Our simulation results show that the proposed location anomaly detection method gives good results for the sensor networks deployed over non-flat terrains. The results show good detection rates and low false positive rates. The organization of the paper is as follows: section II describes the proposed Beacon-less Grid Localization method, section III describes the proposed Location Anomaly Detection, section IV, introduces the new Signal Strength metric, section V presents simulation results of the proposed LAD method, and section VI presents the simulation results for the Signal Strength Metric. Section VII compares the localization results and section VII provides a summary and proposed future work.

2 Beacon-Less Grid Localization for Non-flat Terrains For beacon-less grid localization, two different deployment models are studied for non-flat terrains. The terrain used is shown in Figure 1, which is the terrain for Seacliff [3]. In the Static deployment case, it is assumed that the sensor nodes fall around the deployment point in a Gaussian distribution and stay at their drop point on the non-flat terrain. In the Dynamic deployment case, the sensor nodes slide to their final locations based on the surface characteristics. The results presented in this paper are mainly for the more challenging dynamic deployment.

Fig. 1. Sea-cliff Terrain

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains

177

2.1 Static Node Deployment For a deployment over a 1000mx1000m area divided into a 10x10 grid of 100mx100m each. The deployment points would be the centroids of each grid section. 2.2 Dynamic Node Deployment In this case, the initial distributions of the nodes follow the Gaussian distribution as in the case of Static Node Deployment. However, due to the surface characteristics of the terrain, the nodes will not remain at the same location after the deployment. The nodes slide/roll to their final locations. This is modeled using the information about the terrain and the characteristics of the nodes and is described in [1]. 2.3 Sensor Node Localization After sensors are deployed, each sensor broadcasts its group id to its neighbors, and each sensor can count the number of neighbors from group Gi , for i = 1,….n, within a radius of the transmission range R. Assume that a sensor finds out that it has o1 ,..., on neighbors from group G1 ,..., Gi , respectively. The actual observation of

the sensor is o

= (o1 ,..., on ) , where n is the number of deployment points. Based on

the actual observation of the sensor, and using the deployment knowledge, it finds the nearest location where the expected observation µ = ( µ1 ,..., µ n ) is closest to o. This location is the localization. The difference between the actual and the localization is the localization error. In our localization method, the deployment area is divided into a grid of sufficiently high resolution. The resolution is chosen such that the constraints on sensor node memory are satisfied. Each sensor is equipped with the expected observations only for the localization grid points. The expected observation of each grid point pi , for i=1… l, where l is the number of grid points, is defined as ξ = (ξ1 ,.., ξ n ) where ξi = {µ | location = pi}. The sensor finds the grid point whose expected observation ξ is closest to its actual observation o. This grid point location p is taken as the localized position of the sensor node.

3 Localization Anomaly Detection for Non-flat Terrains The Localization Anomaly Detection (LAD) proposed here uses deployment knowledge to find anomalies in the localization information of the nodes. The observations of sensor nodes within a certain radius are used for Localization. For the detection of location anomalies a different range, called the LAD Range is used for observations. Based on the deployment knowledge and its localization the sensor calculates the number of nodes from each group that should be observed within the LAD range. The

178

S. Krupadanam and H. Fu

difference between this expected observation for the LAD range and the actual observation provides a measure of the location anomaly. Metrics to quantify the location anomaly are then calculated using this error. The metrics are compared against preset thresholds to determine if there is an anomaly. After sensors are deployed, each sensor broadcasts its group id to its neighbors, and each sensor can count the number of neighbors from group Gi , for i = 1,….m. Assume that a sensor finds out that it has o1,...,on neighbors from group G1 ,..., Gi , respectively within its LAD range.

o = (o1 ,..., on ) is the actual observation of the

sensor. Based on the localization of the sensor, and using on-board deployment knowledge, it finds the expected observation within the LAD range µ = ( µ1 ,..., µ n ) . 3.1 Attacks on Localization

For localization anomaly detection, we first attack the sensor network by compromising N sensor nodes by dec-bounded and dec-only attacks [2]. To simulate the attacks with the degree of damage D, we use the following procedure. 1. A node v at the location

La is randomly picked, and the actual observation, a

obtained. 2. To stimulate the attack against the localization, we add nodes to the neighborhood of La such that the localization observation pi equals the localization observation

g i at a grid point a distance D away. A tolerance of 2d g is used to get grid

points in all directions around the attacked node at a distance D where d g is the grid size. This is done as the node is not exactly at a grid point and so we need to get a list of nodes in a circular annulus close to the intended degree of damage D. From this list of nodes that are compromised a random attack location is chosen. Boundary nodes are not considered in this case. 3. For the Dec-bounded attack, based on the above, the observation of the node o is obtained from its actual observation a by compromising nodes and using multiimpersonating nodes. The nodes which are compromised are not considered for the localization and are dropped. 4. For the Dec-only attack, the observation of the node o is obtained from its actual observation a by decreasing the nodes used to cause the localization error (g - p) in only a decreasing manner. 5. For the locations attacked the expected observations within 100m radius, µ, are used for LAD metrics. 3.2 LAD Scheme

For the LAD scheme the deployment of nodes on the surface is simulated a priori for m nodes at each deployment point on the surface. The number of sensors from

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains

179

different groups observed at each grid point is calculated and stored as the expected observation at the grid point. For the node to determine its location, the sensor finds the actual observation of the node using the neighbor information within the Localization Range, and determines its location as the grid point with the closest expected observation. Using this localization, the sensor finds the expected observation within the LAD range using the a priori simulation data. Now the sensor compares the actual LAD observation within the expected LAD observation and detects anomalies using the metrics by specifying a threshold. For example, if the sensor node at location A observes the same set of neighbors as the sensor node at location B within the transmission range of 60m, it is not possible for the LAD scheme [2] to detect the accurate location of the sensor node. However, in the proposed LAD method, it is possible to detect the location quite accurately as this method is using different transmission ranges for the Localization and for the LAD. Thus, if we look at the broader range, i.e. 100m, the observations are different and hence it is possible to detect the anomaly. 3.3 Metrics

In this research, we extend and implement the Difference, Add-all, and Probability metrics [2]. A new metric called Signal Strength Metric (SSM) is proposed and the results of the simulations are described in the following sections. 3.3.1 Difference Metric The difference metric, DM, uses the sum of the differences in the nodes observed from each group for the expected and actual observations. DM

n = ∑ oi − µ i i =1

3.3.2 Add-All Metric The Add-all metric, AM, takes the union of the expected and actual observations and obtains the maximum value of the nodes observed for each group. These larger observations for each node group are summed to arrive at the final metric. The metric shows an increase with increasing localization error. AM

n = ∑ max i =1

{o i , µ i }

3.3.3 Probability Metric The Probability metric, PM, computes the probability that the localization of a node

sees exactly

oi neighbors from Group Gi . We approximate the probability for nodes

from each Group

Gi as µ i /m. Thus, if the localization is Le and X i represents the

number of neighbors that come from the Group Gi , the probability metric is calculated as

180

S. Krupadanam and H. Fu

(

PM i = Pr X i = o i L e =

)

⎛m⎞ o m −o i ⎜ o ⎟ (µ i m ) i (1 − µ i m ) ⎝ i⎠

(

)

PM = Σ PM i

In [8], PM i is used as the probability metric and any abnormal drop is used to trigger the anomaly alarm. With our approximation of the probability density functions for each node group, probabilities for some node groups are very low both in the case of regular nodes localized to a nearby grid point or attacked nodes incorrectly localized. However the probabilities are very easy to calculate. Hence we modify the probability metric as the sum PM = ∑ PM i for our simulations.

4 Signal Strength Metric We propose a new metric, the Signal Strength Metric, described by Figure 2, to utilize the information from the strength of the signal received by the sensor node. The strength of signals received from each node can be computed. Using this signal strength, an estimated distance can be obtained for each node within the transmission range. The difference between the expected and observed distribution of nodes is then used as the metric. In the Location Anomaly Detection method, the LAD range observations could themselves be attacked. However, using the distribution of all the nodes against distance ensures that any attack that adds or removes nodes from a particular distance is easily identified by a comparison of the expected and actual distributions. The signal strength metric performs such a comparison. The distances of all observed nodes from the center to the edge of the transmission . Similarly, the expected nodes from range, for a group i, are obtained as d o

{ }

group i are obtained as d µ

{ } k i

. The metric in outward direction is computed as

k i

∫ abs (sum ({d o }i

R

SS ci =

k

)

< z − sum

({d µ }i < z ))dz k

0

However, this term is biased towards nodes which are close to the node being considered. In order to eliminate this bias, we integrate the cumulative error in observed and expected number of nodes for each group from the edge of the transmission range inwards towards the center. The metric for this calculation is ⎛ ∫ abs ⎜⎝ sum ({d o }i 0

SS ei =

k

R

> z − sum ⎛⎜ d µ > z ⎞⎟ ⎞⎟ dz k i ⎝ ⎠⎠

)

{ }

The signal strength metric for each observed group is thus computed by adding the center to edge and, edge to center metrics as shown in Figure 2. The values for each node group are then added together to obtain the final metric.

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains

SS i = ∑ (SS ci + SS ei i

181

)

As described earlier, the advantage of the signal strength metric over the previous metrics is that it is able to detect changes in the distribution of the nodes even though the overall number of nodes from each node group is the same. Increasing differences between the observed and expected distribution of distances of the nodes would lead to larger values for the metric.

Actual Nodes

Range 0

Expected Nodes Distance from Node

1

0

-1 Cumulative Sum of Difference - Node to Edge of Range

1

0

-1 Cumulative Sum of Difference - Node to Edge of Range

Fig. 2. Signal Strength Metric

The Signal Strength Metric outperforms the difference metric for short transmission ranges as the difference metric only considers the total number of observed nodes from each group without regard to their spatial distribution. With increasing transmission range and node density, more information about the distribution is available through

{oi } thereby reducing the advantage of the signal strength metric.

5 Simulation Results The simulations of the location anomaly detection on the Sea-cliff terrain are described in this section. 5.1 Attacks on Localization

A total of 1000 randomly selected nodes are simulated for values of Degree of Damage, D of 80m, 120m and 160m respectively. The nodes are attacked by dec-bounded and dec-only attacks. Several experiments by varying certain parameters such as

182

S. Krupadanam and H. Fu

network density (m), Degree of Damage (D), percentage of compromised nodes (x) are carried out and ROC curves are plotted for different cases. 5.2 ROC Curves for Different Metrics

Figure 3 shows ROC curve for three metrics for dynamic deployment of nodes on a non-flat terrain. These ROC curves are for different detection metrics and different degrees of damage. The percentage of compromised nodes is set to 10% and the number of nodes deployed at each deployment point is set to 300. The results show that the LAD method gives better results for attacks with high degree of damage. The Difference metric shows good anomaly detection rates for low false positive rates. Even though, the experiment was carried out on a non-flat terrain, the results are comparable to the results of static deployment on a flat terrain. As the degree of damage increases, the detection rate increases.

Fig. 3. False Positive Rate vs. Detection Rate for Dynamic Deployment of Nodes on the Nonflat Terrain

5.3 ROC Curves for Different Attacks

Figure 4 shows the ROC curves for the Diff metric for dec-bounded and dec-only attacks for the dynamic deployment. The results shown are for degree of damages D = 40m and 80m. 5.4 Detection Rate vs. Node Compromise Ratio

In this experiment, the ROC curves are plotted for detection rare versus node compromise ratio. Figure 5 shows the ROC curve for dynamic deployment of nodes on the non-flat terrain.

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains

183

As expected, the detection rate decreases as the percentage of compromised nodes increases. When the number of compromised nodes increases, the localization error increases which results in the anomalies. The detection rates are lower for the dynamic deployment when compared to static deployments as a result of the depletion of nodes from the slopes of the terrain.

Fig. 4. False Positive Rate vs. Detection Rate for Different Attacks for Dynamic Deployment of Nodes

Fig. 5. Detection Rate vs. Node Compromise Ratio (Degree of Damage = 80m, 120m, 160m) for Static Deployment of Nodes

5.5 Detection Rate vs. Network Density

The localization becomes more accurate when the number of nodes deployed at each deployment point in the sensor network increases. In order to demonstrate this, the false positive rate is set to 0.1, and the results show the detection rate for Diff metric when the attack is Dec-bounded.

184

S. Krupadanam and H. Fu

Fig. 6. Detection Rate vs. Network Density (Percentage of Compromised Nodes = 10%, 20%, 30%) for Dynamic Deployment

As the Degree of Damage and the number of nodes deployed at each deployment point increases, the detection rate increases. Figure 6 shows good detection rates when the nodes are deployed dynamically which causes larger localization errors. The false positive rate is set to 0.1 in this case. In all these cases, the trend of the detection rate is observed and it increases as the network density increases.

6 Simulation Results of Signal Strength Metric This section describes the simulation results of the signal strength metric. Here we compare the results of the signal strength metric with the results of the difference metric. The simulation results, when 30% of the nodes are compromised are compared for the two metrics. 6.1 ROC Curves for Different Metrics

Figure 7 shows the ROC curve for detection rate vs. false positive rate. When compared to the difference metric, the Signal Strength metric shows good detection rates for smaller transmission ranges while being close in performance overall. The main advantage of the signal strength metric is that it is not susceptible to an attack on the observations in the LAD range. That is, if both the Localization and LAD ranges are attacked the Diff metric would not be able to detect the attack. The entire distribution of nodes against distance needs to be exactly replicated in order to defeat the signal strength metric. 6.2 Detection Rate vs. Degree of Damage

Figure 8 shows the results of detection rate vs. degree of damage. In both the cases, the results of the Signal Strength metric are comparable to the results of the Difference metric. Signal Strength metric performs better for smaller ranges.

Localization Anomaly Detection in Wireless Sensor Networks for Non-flat Terrains

185

Fig. 7. Detection Rate vs. False Positive Rate for Diff Metric and Signal Strength Metric for Dynamic Deployment of Nodes

Fig. 8. Detection Rate vs. Degree of Damage for Diff Metric and Signal Strength Metric for Dynamic Deployment of Nodes

7 Conclusion and Future Work In this research, a LAD method for non-flat terrains is proposed and evaluated. The beacon-less localization scheme proposed for non-flat terrains in Krupadanam and Fu [1], is used to find the location of the sensor nodes for the LAD algorithm. A new metric based on signal strength is proposed for LAD. This metric achieves better detection rates with low false positives for smaller signal ranges while being less susceptible to attack. Moreover, the LAD method developed in this research demonstrates significant robustness with a sparse localization grid. In future work, the impact of errors in the

186

S. Krupadanam and H. Fu

deployment information on the performance of the LAD method needs to be quantified and analyzed. Other characteristics of the deployment such as wind effects and, non-uniform deployment grids need to be modeled.

References 1. Krupadanam, S., Fu, H.: Beacon-less Location Detection in Wireless Sensor Networks for Non-flat Terrain. In: International Conference on Future Generation Communication and Networking (FGCN 2007), Jeju Island, Korea, December 6-8 (2007) 2. Du, W., Fang, L., Ning, P.: LAD: Localization Anomaly Detection for Wireless Sensor Networks. In: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005) (2005) 3. The sample data for Non-flat terrain us taken, http://www.spinmass.com/2life/docs/Seacliff_Terrain.txt 4. Lazos, L., Poovendran, R.: SeRLoc: Secure Range-Independent Localization for Wireless Sensor Networks. In: ACM WiSe, pp. 21–30 (2004) 5. Tang, J., Fan, P., Tang, X.: A RSSI-Based Cooperative Anomaly Detection Scheme for Wireless Sensor Networks. In: International Conference on WiCom 2007, Shanghai, China (2007)

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices Karsten Fyhn Nielsen, Tatiana K. Madsen, and Frank H.P. Fitzek Dept. of Electronic Systems, Aalborg University {kfyhn,tatiana,ff}@es.aau.dk

Abstract. Wireless grids have potential in sharing communication, computational and storage resources making these networks more powerful, more robust, and less cost intensive. However, to enjoy the benefits of cooperative resource sharing, a number of issues should be addressed and the cost of the wireless link should be taken into account. We focus on the question how nodes can efficiently communicate and distribute data in a wireless grid. We show the potential of a network coding approach when nodes have the possibility to combine packets thus increasing the amount of information per transmission. Our implementation demonstrates the feasibility of network coding for wireless grids formed by mobile devices. Keywords: Wireless grids, network coding, implementation, mobile phones.

1 Introduction and Motivation Grid computing and grid topologies are attracting attention as sharing resources among devices has proven to bring benefits to the overall system performance [1]. A virtual pool of resources may consist of computing power, storage capacity or communication bandwidth. Despite of the variety of applications, the underlying concept for grid technologies is cooperation among devices and willingness to share resources. This concept has expanded into the world of wireless communication and by wireless grids we understand wireless devices forming cooperative clusters. Mobile devices within a wireless grid can use short-range links in addition to their cellular communication interfaces to share and combine resources and capabilities. To exploit the benefits that device cooperation potentially offers, proper solutions should be found to a number of technical challenges. Compared with their wired counterpart, wireless grids are characterized by device mobility and fluctuating capacity of wireless links. However, both wireless and wired grid technologies have to overcome the following common set of challenges [2]: • • • •

Efficient routing protocols Discovery semantics and protocols Security Policy management

P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 187–195, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

188

K.F. Nielsen, T.K. Madsen, and F.H.P. Fitzek

In this work we focus on the first item that can be also formulated as how to efficiently distribute data among devices in a cooperative cluster? Wireless grids formed by mobile devices are known to be constrained by limited communication bandwidth. Additionally, mobile devices are battery powered and thus, energy limited. Most of the energy consumption of a mobile device is due to sending and receiving operations. Under these conditions, efficient data distribution translates into minimization of the number of transmissions required to distribute the data. An extensive amount of research has been focused on power conservation techniques and energy-efficient routing algorithms for wireless grids (survey can be found e.g. in [3]). In this work we investigate another approach than traditional data forwarding. It is based on network coding and we demonstrate its benefits for wireless grids. Using traditional routing techniques, intermediate nodes between the source and the destination are relaying and replicating data messages. With network coding, intermediate nodes code incoming messages, e.g. by using exclusive-ORs for packet combinations. The packets for coding are chosen in the way such that the destination node is capable of decoding information. The more coding opportunities an intermediate node can find, the greater is the amount of information combined in one coded packet and the less overall number of transmissions is required. Many analytical studies and simulation evaluations advocate the usefulness of a network coding approach in terms of network throughput improvement. However its practical applicability for off-the-shelf devices with standard protocols can be demonstrated only by experimental studies. Experimental evaluation allows us to observe the influence of the realistic operation conditions, including error-proneness of the wireless channel and processing delays. Additionally, for wireless grids an extra challenge is the distributed nature of a network. There exist a few implementations that address the issue of network coding for wireless mesh networks. One should mention COPE [4] where devices perform opportunistic coding using XOR operation, and a recent work [5] presenting CLONE algorithms for unicast wireless sessions and [6] with a lightweight localized network coding protocol BFLY. The focus of this paper is on understanding the potential of network coding implemented on small hand-held mobile devices forming a wireless grid. The implementation platform is Nokia N810, where a built-in WLAN interface is used for the local packet interchange. We limit our investigations to one example, a wellknown "Alice and Bob" example [7] (explained in details in the next section). The performance of network coding is compared with a traditional routing approach based on a reliable broadcast implementation. We present a detailed study on how each individual transmission increases an accumulated knowledge in the whole network.

2 Problem Statement We consider the following scenario consisting of three nodes: A, B and C. Nodes A and C are located far apart and can not directly communicate with each other. All packets have to be relayed through B, see Figure 1. We say that A and B are outside communication range, however they are within an interference range, thus the hidden terminal problem is eliminated.

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices

A

B

189

C

Fig. 1. The configuration of the three nodes for the test

It is assumed that each of the nodes has a unique part of a set of data, which all nodes need the full set of. When node B broadcasts its packets, they will be received by both A and C. Packets from A (and respectively from C) will be received by B, stored in its memory and forwarded to another node. Traditional routing approach consists in forwarding at the interlineate node B, resulting in one transmission pr. packet. With a network coding approach node B may XORs two packets received from A and B and broadcast that, resulting in one saved transmission. For the simplicity of the further explanations, we consider that all data to be exchanged consists of 240 packets, 80 different packets on each node. The network is expected to behave as shown in Figure 2. The outcome of both distribution methods are separated into two phases. Reliable Broadcast A

B

C

Network Coding A

B

C

1. phase

2. phase

Fig. 2. The expected distribution of packets in the test

The first phases are identical. The two outer nodes send one packet to the middle node, and the middle node broadcasts one of its own packets to the two outer nodes. This results in four satisfied nodes using three transmissions. This can be used to denote the "speed" of the distribution. Here we say the speed is 4/3, because four pieces of information were received using three transmissions. The speed is larger than one, because of the broadcast. This phase is repeated until the two outer nodes have sent all of their packets to the middle node, and the middle node has broadcasted all of its own packets. This is the end of the first phase, and it occurs when 77,78 % of the packets are distributed in the system on average, because each outer node has 2/3 of the packets and the middle node has all the packets. Then the second phase begins. With reliable broadcast the middle node must transmit 160 packets in 160 transmissions, because only one node can use each packet. This results in a speed of one, i.e. a lower speed than in the first phase. With network coding the middle node can code together two packets and broadcast the result, thereby sending 160 packets in 80 transmissions. This gives a speed of 2, i.e. a higher speed than in the first phase. Assuming no collisions and thus no retransmissions, this results in reliable broadcast

190

K.F. Nielsen, T.K. Madsen, and F.H.P. Fitzek

Fig. 3. The implementation platform and GUI 106,67

100% Network Coding

80%

133,33

Reliable Broadcast

77,78%

60%

40%

20%

20

40

60

80

100

120

140

Fig. 4. The expected outcome of the test. The dotted line separates the two phases of the distribution at 77,78 %.

finishing in (80+80+80+80+80)/3=133,33 transmissions, whereas network coding should be able to finish in (80+80+80+80)/3=106,67 transmissions. The expected results of the test are shown in Figure 4. In the following we compare the measured data exchange process with the presented theoretical one. But first, we describe implementation details.

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices

191

3 Implementation As a platform for the implementation the Nokia N810 Internet Tablet is chosen. This is chosen because of its large screen, which makes it useful for visualization, because of its build in WLAN interface, and because of it's operating system, which is a Linux distribution, making it easy to develop for because of the many readily available tools. The platform has the following specifications: • • • •

Processor - TI OMAP 2420, 400 MHz ARM11. Memory - 128 MB + 128 MB swap. WLAN - IEEE 802.11 b/g (only b in ad-hoc). Operating System - Maemo1 OS2008 (Linux kernel 2.6.21-omap1)

Programs must be compiled for the ARM processor, so a special environment platform is set up for the development. This platform is called Scratchbox2 and is a cross-compilation toolkit, which may be used with Maemo SDK for software development for the Maemo platform. The developed implementation consists of the following four levels: • •





Test Application: At this level, a GUI has been implemented to show the distribution of packets (Figure 3). This level also manages a logging facility. Framework: This level is responsible for placing the implementation of network coding as an extra protocol between the IP and MAC layers. This has one very important advantage. It relies only on the information in the IPheader, which may be retrieved using raw sockets directly after the MAC layer. By also placing the implementation beneath the normal IP layer, it becomes possible to use a virtual network interface, as the interface between the first and the second level. Thereby communication between the two levels can happen solely through standard Berkely socket calls. The framework also provides functions for sending packets through the socket, and for sending special packets, that start and stop the lower levels. With very few changes this level could be run as a daemon on the OS instead, making it a completely separate entity from the application or applications using it. Logistics Platform: It contains all the data structures and functions for the logistics of network coding. In the implementation of network coding for distributed wireless grids, this especially is the knowledge of which packets the local node has, and the knowledge of which packets all remote nodes have. Schemes: This level is the algorithms for encoding and decoding. These may be different as well, even while using the same logistics. For this implementation two schemes have been implemented. One scheme for reliable broadcast, and one for network coding.

The two schemes are further explained in the following, but first some important data structures are introduced, which are heavily used in the schemes: 1 2

http://www.maemo.org http://www.scratchbox.org

192

K.F. Nielsen, T.K. Madsen, and F.H.P. Fitzek



• •

NC packet: When a node sends out a transmission, we refer to it as an NC packet. An NC packet may contain zero, one or a combination of many IP packets, depending on what the scheme has found. But always present in an NC packet is an NC header. This header is used to distribute knowledge between the nodes, e.g. the nodes reception vector. Reception vector: The reception vector linked to a node is a bitarray with as many bits as there are packets in the set. Packet pool: The reception vectors are closely intertwined with the packet pool. For a scheme to be able to minimize the number of transmissions, by sending combinations of IP packets in each transmission, the IP packets must be readily available in some form of persistent store. Packets are therefore not only forwarded between the application on the first level and the network, but are stored in the logistics platform, to provide the scheme with more coding opportunities.

The usefulness of these data structures becomes clear in the following, where the two implemented schemes are explained. The algorithm for network coding finds the IP packets to send in one NC packet using exclusive-ORs for packet combinations. This is done by iteratively finding IP packets to code with, until all coding opportunities have been explored. Coding opportunities is defined as the opportunity for a local node to combine one or more packets together for a single transmission. Because each NC packet is only useful to a node if it contains only one unknown IP packet, it is important to take this into account in the algorithm. If an IP packet is chosen for transmission with node A as receiver, no IP packet unknown to node A must be chosen in any of the following iterations. To avoid this, a codingvector is used to contain knowledge of IP packets, which can be used for coding. The codingvector must for each iteration contain the ids of IP packets, which all receivers found in previous iterations have in common. Therefore, if one of these is chosen for transmission, all receivers can decode it. It is also important that when an IP packet A is chosen for transmission in an iteration, none of the IP packets found in previous iterations must be unknown to the receiver(s) of A. In the implementation of a scheme for network coding, this has been solved by AND'ing the codingvector with the vectors of the receiving nodes in each iteration. The scheme for reliable broadcast is somewhat similar and much simpler. At each iteration a node has to choose an IP packet for transmission. The algorithm for reliable broadcast runs through all available IP packets on the local node, to find the first IP packet missing on a remote node. Exchange of reception vectors plays a role of acknowledgements.

4 Results In the experimental evaluation we focus on the average number of packets available on the nodes as a function of the average number of transmissions for each node. The results are averaged over five test runs. The result is normalized to show the number of packets in percent of the total set.

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices

193

The result of the test is shown in Figure 5. As can be seen, the test run is separated into two phases, the first running from start and until the system in average has 77 % of the set, and the second from then and until all nodes have the entire set. The first phase is when all nodes are sending with a speed of 4/3. This is the case for both transmission methods, and they are therefore similar. In the second phase, reliable broadcast only sends with a speed of 1, and therefore the curve declines, whereas the speed of network coding becomes 2, so its curve increases. 100

Average packets available on nodes [%]

90 80 70 60 50 40 30 20 10 0

Broadcast Network Coding 0

20

40 60 80 100 Average transmissions per node

120

140

Fig. 5. The measured results of the test. The dotted line separates the two phases of the distribution at 77,78 %.

The measured and theoretical transmission rates are compared in Table 1. The reliable broadcast scheme distributes the set with 136,4 transmissions on average, where the theoretical limit was 133,33, giving only three redundant transmissions per node, and network coding finishes with 112,2 transmissions compared to the theoretical 106,67,meaning six redundant transmissions per node. Table 1. Average number of transmissions per node required to distribute data for the two transmission schemes

Measurements

Theory

Reliable Broadcast

136,4

133,33

Network Coding

112,2

106,67

194

K.F. Nielsen, T.K. Madsen, and F.H.P. Fitzek

These redundant transmissions are caused by an initial transmission, when nodes do not have knowledge about the status of other nodes. Additionally, collisions, retransmissions and acknowledgements are other reasons for redundant transmissions. The behavior of the three individual nodes is shown in Figures 6 and 7. Node no. 1

Average number of packets available on node [%]

100

90

80

70

60

50

40 Broadcast Network Coding 30

0

10

20

30 40 50 60 Average number of transmissions

70

80

90

Fig. 6. The result of the test for the outer nodes A and C Node no. 2

Average number of packets available on node [%]

100

90

80

70

60

50

40 Broadcast Network Coding 30

0

50

100 150 Average number of transmissions

200

250

Fig. 7. The result of the test for the middle node B

5 Conclusions This paper demonstrates the practical potential of network coding for wireless grids using a simple three node example. As can be seen from the test bed results, network coding increases the throughput gained from sending the same information with fewer

Network Coding Opportunities for Wireless Grids Formed by Mobile Devices

195

transmissions with approx. 20 % compared to reliable broadcast. A closer look at system behavior shows that after some time network coding is "speeding up" distributing information in the network, while a traditional approach based on broadcast is slowing down. The measured number of required transmissions is very close to the theoretical one and suggests the feasibility of network coding implementation even on small hand-held devices such as mobile phones. A slight difference in the results can be explained by redundant transmissions and acknowledgements and most probably it can be illuminated or at least improved by using intraflow network coding [7] instead of sending single packets. This would greatly reduce the need for acknowledgements, and thereby reduce the required number of transmissions further.

References 1. Fitzek, F., Katz, M.: Cooperation in Wireless Networks: Principles and Applications - Real Egoistic Behavior is to Cooperate! Springer, Heidelberg (2006) 2. Agarwal, A., Norman, D., Gupta, A.: Wireless Grids: Approaches, Architectures and Technical Challenges. MIT Sloan Working Paper No. 4459-04 (2004) 3. Ahuja, S.P., Mayers, J.R.: A Survey on Wireless Grid Computing. The Journal of Supercomputing 37(1), 3–21 (2006) 4. Katti, S., Rahul, H., Hu, W., Katabi, D., Medard, M.: The Importance of Being Opportunistic: Practical Network Coding for Wireless Environments. In: Allerton Conference (2005) 5. Rayanchu, S., Sen, S., Wu, J., Sengupta, S., Banerjee, S.: Loss-Aware Network Coding for Unicast Wireless Sessions: Design, Implementation, and Performance Evaluation. ACM SIGMETRICS Performance Evaluation Review 36(1), 85–96 (2008) 6. Omiwade, S., Zheng, R., Hua, C.: Butterflies in the Mesh: Lightweight Localized Wireless Network Coding. In: Proc. of Fourth Workshop on Network Coding, Theory, and Applications (2008) 7. Fragouli, C., Katabi, D., Markopoulou, A., Medard, M., Rahul, R.: Wireless Network Coding: Opportunities & Challenges. In: Military Communications Conference, MILCOM 2007, pp. 1–8 (2007)

Automatic Network Services Aligned with Grid Application Requirements in CARRIOCAS Project (Invited Paper) D. Verchere1, O. Audouin1, B. Berde1, A. Chiosi1, R. Douville1, H. Pouyllau1, P. Primet2, M. Pasin2, S. Soudan2, T. Marcot3, V. Piperaud4, R. Theillaud4, D. Hong5, D. Barth6, C. Cadéré6, V. Reinhart7, and J. Tomasik7 1 Alcatel-Lucent Bell Labs Nozay, 91650 France [email protected] 2 INRIA [email protected] 3 Orange Labs [email protected] 4 Marben Products [email protected] 5 N2Nsoft [email protected] 6 PriSM [email protected] 7 Supélec [email protected]

Abstract. Automatic Service framework named Scheduling, Reconfiguration and Virtualization (SRV) is developed in CARRIOCAS project to enhance existing Telecom network infrastructures for supporting grid applications sharing IT resources interconnected with ultra-high performance optical networks. From the requirements of Grid applications a classification is proposed to specify the network services and their attributes. In large-scale collaborative environments, SRV solution is described to enable automatic network service operations according to high-performance computing service access. The resources hosted at datacenters are virtualized to be attached to transport network infrastructure offering uniform interfaces towards external customers. New level of service bindings is defined with network services during executions of Grid applications’ workflows. On-demand intensive computing and visualization services scenario is described in Telecom environment. Keywords: network services, infrastructure control and management, virtualization, collaborative applications.

1 Introduction New management and control functions are required to adapt existing Telecom network infrastructures to deliver commercial IT services for company customers. P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 196–205, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

Automatic Network Services Aligned with Grid Application Requirements

197

Large scale distributed industrial applications –termed as Grid applications – require use of ultra-high performance network infrastructure and multiple types of Grid resources such as computational, storage and visualization. Collaborative engineers need further to interact with massive amounts of data to analyze the simulation results over high resolution visualization screen fed by storage servers. Network operators are looking forward to delivering IT computing services for their company customers with several advantages. First is the practice of Virtual Organization (VO) with the related infrastructure access: IT, Labs or Equipments. The enterprises come together to share their core competencies. An efficient networking solution is key for implementing a productive VO. Secondly collaborations enable to better execute the projects attached to market opportunities. The third motivation is to split the fixed costs of maintenance and the management by outsourcing the infrastructure. But an «Utility computing» model still require standard interfaces [7] at the network service management [4]. The advantage for Telecom operators is the central role in interconnecting datacenters and company customers with the possibility of managing more dynamically the network services to Grid Service Providers (GSP): Fig.1. By enabling the network service management to intercept the grid application workflows, the operator can manage better the resources required to fill end-to-end QoS requirements. These workflows require explicit reservation of network resources with other types of Grid resources (e.g. computational, storage), then intercepting information of the workflows will enable operators to improve the utilization of their infrastructure. The enterprises will express to GSP their QoS requirements including the maximum cost they are willing to sustain and a time window within workflows has to be completed. On the other side, the infrastructure providers will publish to the GSP their services according to their policies and negotiations to maximize utilization rates together with Telecom and datacenter operators. The binding of customer applications’ workflows over connectivity services triggering network service reconfigurations is a capability that Telecom operators are now integrating. This binding requires the vertical service interactions allowing the negotiation of connectivity services between a network operator and GSPs. The companies sharing their datacenters have

Fig. 1. Actor interactions positioning the central role of Telecom network infrastructure

198

D. Verchere et al.

incongruous infrastructure needs. As described on Fig.1 SLA usages and SLA providers are bound by the GSP. The parameters include the amount of IT + network resources required, the type required, the class of end devices used to deliver the services, the duration, connectivity service, the service accounting. This paper is organized as follows: Section 2 gives an outline of the grid application classifications and their requirements for defining the network service management layer. Contributions on Scheduling, Reconfiguration and Virtualization (SRV) function with the network architecture to support the delivery of network services for GSPs are presented in section 3 and the SRV functional architecture in section 4. The section 5 presents the management and control function extensions for the provisioning of scheduled connections. Section 6 presents SRV oriented Transport Network scenario considered in CARRIOCAS network pilot. Conclusion is drawn in Section 7 with statements of potential standardization of a Network Service Interface [10] for Telecom infrastructure management and network services.

2 Grid Applications Requirements for the Network Services It is complex to define a generic classification of all the distributed applications. The focus is on large-scale grid applications requiring ultra-high bit rate network services at the order of the transmission capacity of the network infrastructures. Seven types of Grid applications are listed depending on their connectivity characteristics, capacity requirements, localization constraints, performance constraints and scenarios between networks and grid services offered to end-users [1]. Classification of distributed applications in Table 1 was specified from developers/users contributing in CARRIOCAS, and through specific workshops and interviews [7]. These applications combine the following requirements: (i) Real-time remote interactions with constant bandwidth requirements. (ii) Many large data file transfers between distant sites whose localization is known in advance (distributed storage and access to data centers) and with sporadic bandwidth requirement. (iii) Many data-file streaming between anonymous sites (e.g. multimedia production) requiring statistical guarantees, (iv) Data and file transfers between locations carried on a best-effort network services. As the connectivity service needs are very different, the network infrastructure must be able to support the class of services with ultra-high performance parameters, including bandwidth and edge-to-edge latency. The connection configuration must be dynamic to be adapted to different grid application usages. For applications which user sites and datacenter sites must be tightly coupled and reconfigured, performance of network infrastructure should make possible to abstract the location constraints of IT resources over a wide area network, this is one of the challenges to be demonstrated. The actors have their objective to maximize their utility functions [7]. For customer, the objective means to execute the submitted jobs at low costs within a required time window. For GSP it means to deliver connectivity services and the other IT services at the highest quality including performance and security and lowest cost. Each operator wants to maximize infrastructure utilization (CAPEX) with the minimal operation efforts (OPEX) to finally obtain best possible return on investment (ROI). These connectivity services require a pool of resources to be explicitly reserved and the resource pools are computed by the network resource planning

Automatic Network Services Aligned with Grid Application Requirements

199

Table 1. Classification of Distributed Applications Type

Cha ra cteristics

Type A Parallel Distributed computing Tight synchronized computation

Scientific applications involving tight coupled tasks

Type B High throughput low coupling computation

Type C Computing on demand

Type D Pipeline computing

Type E Massive data processing

Type F Largely distributed computing

Type G Collaborative computing

Great number of independent and long computing jobs

Access to a shared remote computing server during a given period Applications intensively exchanging streamed or real time data Treatment of distributed data with intensives communication between computer and data sources Research, modification on distributed databases with low computing and data volumes requirements Remote users interacting on a shared virtual visual space

Ca pacity Requirement

Loca liza tion constra int

Performa nce constra int

High

Low

Computational, connectivity latency

high

low

Computation

Medium

low

Computational, high bit rate, connectivity latency

Medium

Average

Computation, bit rate, latency, storage

Low

High

High bit rates and connectivity latency

Medium

High

Connectivity latency

Medium

High

Bit rate and latency (for interactivity)

Applica tions Exa mples Grand scientific problems requiring huge memory and computing power Combinatory optimization, exhaustive research (crypto, genomics), stochastic simulations (e.g. finance) Access to corporate computing infra., load balancing between servers, virtual prototyping High performance visualization, real time signal processing Distributed storage farms, distributed data acquisitions, Environmental data treatment, bioinformatic, finances, high energy physics Fine-grained computing applications, network congestion can cause frequent disruptions between clients. Collaborative visualization for scientific analysis or virtual prototyping

functions i.e. Network Planning Tool (NPT). Today network services are provisioned in the infrastructure and are not automatically related to how the business rules operate during the publication, the negotiation, and the service notification to the customers. Second, network management system (NMS) must cope with the scheduled service deliveries enabling connection provisioning to be more autonomic. The extensions of NMS are to ensure that infrastructure delivers the on-demand or scheduled network services asked by external entities such as the GSP. The network services must be adjusted in response to the changing customer environments such as new organizations joining and then connecting on network + datacenter infrastructures (e.g. car designer is going to be connected to Complex Fluid Dynamic simulation application the next month during two weeks) or a connected organization that is going to stop to use the IT services during the next week. The CARRIOCAS network has to provide dynamic and automatic (re)configuration of the network devices according to the roles and the business rules of different actors. Provider network services (respectively connectivity services) are described from the network infrastructure operator (respectively the customer) point of view which are distinct from the solutions implemented through the use of customer edge (CE) node based solutions [4]. Virtual Private Network services (VPN) provides secure and dedicated data communications over Telecom networks, through the use of standard tunneling, encryption and authentication functions. These services are contrasted from leased lines configured manually and allocated to one company. VPN is a network service dedication achieved through logical configuration rather than dedicated physical equipments with the use of the virtual technologies. To reconfigure automatically the provisioning of VPNs, the SRV should accommodate addition,

200

D. Verchere et al.

deletion, moves and/or changes of access among datacenter sites and company members with no manual intervention. In CARRIOCAS network pilot, three classes of connectivity services named Ethernet Virtual Circuit (EVC) are proposed to customers: point-to-point EVC, multipoint-to-multipoint EVC and rooted multipoint EVC. The description of each EVC includes the User-to-Network Interface (UNI) corresponding to the reference point between the provider edge (PE) node and CE node. At a given UNI more than one EVC can be provisioned or signaled according the multiplexing capabilities. CARRIOCAS network envisions the reconfiguration functions for wavelength connections thanks to the reconfigurable Add-Drop Multiplexers (R-OADMs) node and the network control functions. GMPLS controllers enable to establish the wavelength switched connections on-demand by communicating connectivity service end-points to PE node embedding a dynamic protocol message exchange in the form of RSVP-TE messages.

3 SRV Oriented Transport Network To establish automatically network services, the connectivity requirements of the grid application workflows gathered by the GSP are intercepted automatically by the network service management functions. CARRIOCAS transport network architecture is extended to position the functional components of the network service management layer between the GSP and the network resource management and control functions. The network service management functions are organized in the Scheduling, Reconfiguration and Virtualization (SRV) service components separately from network resource management and control functions. The SRV part of the network infrastructure operations is central and integrates different service interfaces according to the entities it interacts with. The SRV north bound interface dependent of the technology implemented at the GSP interfaces enable the SLA provider to be described. Towards the network infrastructure, the SRV south bound interface is based on two options: SRV requests for switched connections through the provider UNI (UNI-N) if it is a GMPLS controlled network [2] or SRV requests for permanent connections or soft-permanent connections through the Network Management Interface based on MTOSI from TeleManagement Forum [3]. Connections are dynamically established (provisioned through NMS or signaled through UNI) with guaranteed QoS defined in SLA provider. In the first version, the selection of computing and storage resources is performed by the grid application middleware hosted at end-users or GSP, but externally of the network infrastructure management. The SRV East-West interface supports edge-to-edge connectivity service covering multiple routing domains. The SRV interfacing the GSP manages the network service at the ingress network domain and the interdomain connectivity services through peer operation information exchanges with the SRVs involved in the chain [10]. By integrating the concepts of virtualized infrastructures (datacenters and transport networks), the virtualized services can be delivered by the second version of SRV. A virtualized service is a composition of computing, storage and connection as a managed composite service. Further specific software applications can be virtualized and integrated as an element of the composite service too. The virtualized services require explicit reservation of different types of resources offering the SRV the

Automatic Network Services Aligned with Grid Application Requirements

Grid Application User

Grid Application User

SLA Usage (Grid Services)

const raint s

SLA Usage (Grid Ser vices)

Buisness Layer

SLA Usage (Gr id Services)

const raint s

Grid Service Provider

SLA Provider (Net work Services)

Publicat ion/ Negot iat ion Agreement / Not i ficat ion Int erdomain Connect ivit y Services

SRV

Connect ion Service request s (e. g. MTOSI)

Connect ion Service Query/ Response

Planning Tool

N RScheduler

Pla nning Tool

Connect ion cont r ol

SLA Provider (Net work Services)

Publicat ion/ Negot iat ion Agr eement / Not if icat ion

SRV

PCE

Controller

Controller

M PLS

N RScheduler

TE- DB

TE- DB Controller

PCE

Controller

Controller

Controller

Controller

Data Center Controller

Connect ion cont rol (e. g. UNI signaling)

Network Management S ystem

Network Management S ystem

(e. g. UNI signaling )

const r aint s

Grid Service Provider

SLA Provider (Net work Services)

SLA Provider (Net work Services)

Service Layer Management & Cont rol Layer s

SLA Usage (Grid Ser vices)

Grid Service Provider

Grid Service Provider

Grid Application User (e.g. Company A)

Grid Application User

const raint s

201

Controller

Controller

GMPLS Controller

Data Center

Controller

Data Center

Net work Domain 2

Net work Domain 1

Data Center (e.g. Computing + Storage)

Data Center

Fig. 2. SRV oriented Transport Network for Grid Service Provider

possibility to co-allocate them. After the resource reservation phase is complete and before the allocation phase starts, cross-optimization can be executed through different types of resources. During the selection of one computational server at a datacenter location to execute a workflow, the virtualized service routing functions will take into account heterogeneous criteria including the computational capability of the server (e.g. Operating Systems, cycle speeds, RAM), the available network capability (interfaces, bandwidth, latency) on the selected connection path. Such an extension requires an additional interface between the SRV and the Grid Resource Management System (GMS) at the datacenter infrastructures (not represented on Fig. 2) to exchange management/control information between grid application and network services, which capability is not supported in standard infrastructures [2]. However several technical challenges still need to be overcome such as the amount of information to be managed, the security and confidentiality constraints between the network operators and the other business actors, the reconfiguration time delay of different resource types of the virtualized infrastructures.

4 SRV Functions and Scheduled Network Management The SRV internal architecture is composed by the functions to deliver connectivity services fulfilling the SLA requests towards a GSP. Three internal functional layers identified are: the GSP interface, Network Management System interface and the mediation layer as represented in Fig. 3. The SRV functional components interacting with the GSP are the service publication, service negotiation and service notification. These components allow publishing, negotiating and notifying respectively the connectivity services from the SRV towards external GSP customers. This interface is based on the VXDL description language [6] of the services (connectivity services or virtualized services) compliant with the specifications of the Open Grid Service Architecture (OGSA). Based on XML, VXDL has the advantages to enable the composition of multiple service elements by the GSP (e.g. a computing service combined with a connectivity

202

D. Verchere et al.

Pol icies Policies

Cont Contract ractss(SLA) (SLA) Mediat ion

Service ServiceSelect Selection ion Composit Composition ion

SRV DB

GSP int erf ace (OGSA compliant )

Scheduling Schedul ing

Processes Processes

Shared SharedInf Informat ormation ion/ / Dat DataaModel Model Resource Resourceabst abstr act raction ion/ / Virt Virtualizat ualization ion

NMS int erf ace (NGOSS compliant )

Network Resource Scheduler (NRS) Planning Tool (NPT)

Timed TED

Control Engine: (G)MPLS ƒSignaling: RSVP-TE ƒRouting: OSPF-TE

Scheduled PCE (S-PCE)

T- Policy

Management Engine: NMS ƒData/Info model ƒMgt Info Base (MIB) ƒSNMP / Connection mgt TED

Service Layer

Service ServicePubl Publicat ication ion S.R.V. / / Not Notifificat ication ion

Service ServiceNegot Negotiat iation ion

Ext. OSS/BSS Charging/Billing

Net work Layer

Grid Application User

Business Layer

Grid Service Provider

Path Computer Engine: PCE ƒPolicies / Rules ƒPCE protocol

Policy

Fig. 3. The SRV functional model GSP

Nego

Request for connectivity (A-B;T1-T2)

SRV Sched

NRS

SrvDB

Check network state

Read SRV-DB Proposition (+price)

NMS

Query for status

Query

Response.

Status update response Background process

Reservation Decision (accepted)

Confirm

Connectivity Reservation req. Reserved Connection

Allocation

Connectivity Binding req.

Provisioning req. Connection Provisioning

Provisioning response Connectivity

Binding

Ack. + Connectivity ID Ack.+ ID

t=T1- ε

Activation A-B Connection used

Ack.+ Service ID t=T2 + ε

De-activation

Sequence Diagram for a single network service provisioning

Fig. 4. Connectivity service request and network service provisioning sequence diagram

service). The network services are separated from the description of the connectivity services and are associated to the NMS to provision the resources. The interactions between GSP and SRV are based on Customer - Producer pattern including the SRV database. Three types of interactions between SRV and GSPs are supported: publish/subscribe query/response, and notification. The SRV acts upon a request initiated by GSP, the query/response interaction pattern is implemented (see Fig. 4). Other types of negotiation sequences are also considered as evaluated in [5]. The service publication function is attached to the creation of a SRV-DB that is the repository of the information related to the states of the connections, from which the SRV chooses a class of connectivity service according to the GSP requirements. The

Automatic Network Services Aligned with Grid Application Requirements

203

high-level description of the service offerings exposed by the SRV belongs to the class of CustomerFacingServiceSpec of the connectivity service specifications that are the service functions that GSP can request and purchase to the network operator [3]. The parameters of the connectivity services provided by the GSP in the SLA provider are derived from the grid application workflow. The connectivity service negotiation enables the GSP to negotiate for the QoSService and other attributes such as confidentiality. The sequence associated to a SLA for connectivity services is complicated by the need to coordinate the (re)configuration of multiple elements simultaneously in the network infrastructure. At the NMS interface layer, the network infrastructure is virtualized according to the Service class of the shared information data (SID) model [3]. The abstraction depends of the connectivity services published to the GSPs. The network services are generalized according to Lx-VPN, with x=1, 2 or 3. The basic elements described belong to the class of ResourceFacingServiceSpec of the network services. At the mediation layer of the SRV, the selection of network service elements and their composition enable to bind the GSP based connectivity service descriptions according to the rules defined by the network operator on provisioning (Policies). The Policy component insures the continuity from the GSP connectivity service requirements to the connection requests towards the NMS interface. The rules select the appropriate process to provision the connections via the NMS or to signal the establishment of the connections via the ingress node controller. The bulk data transfer scheduling (BDTS) service enables scheduling of network service end-points by providing time guarantees that a specified amount of data streaming is transferred within a strict time window [8]. The service Scheduling (Sched) component interfaces with network resource scheduler (NRS). The NRS is an extension of the NMS to integrate the time constraints expressed by the GSP according to the execution of the grid application workflows. NRS can signal scheduled connections e.g. Label Switched Paths for RSVP-TE signaled connections. The Scheduled-PCE (S-PCE) is an extension from the Path Computing Element architecture [9] and its PCE Protocol (PCEP) to enable the selection of the available network resources according to the time window constraints. Each connection provisioning request from the NRS is characterized by a source node, a destination node, a QoS (bandwidth and availability), a starting time noted START, and a ending time noted END. The times START and END define a time window managed by NRS for resource allocation. The provisioning sequence for one or several connections is divided into four phases: (i) resource request phase (e.g. between NRS and network elements), (ii) reservation phase, (iii) intermediate phase during which modification of the reservation is possible, and (iv) resource allocation phase during which the network service is provisioned as recommended in [2].

5 Use-Case and CARRIOCAS Test-Bed A lot of R&D projects require massive high-performance computing (HPC) powers generating large amounts of data which push enterprises to access storage. The data storage demands are increasing because of the meshing resolutions (3D, time, temperature) of their applied mathematics methods. It is often required to process the raw data, to visualize the results on visualization means and to interpret them

204

D. Verchere et al.

immediately and to restart a new cycle of application workflows. These types of scenarios named VISUPIPE are developed for industrial companies connected to CARRIOCAS network [6]. The BDTS service (Sched) allows reserving in advance network services with negotiated QoS and then allocating each service between two or more access endpoints [8]. The Grid resources are located at different points. The GSP workflow requests are modeled as a graph. Each vertex characterizes a task and QoS requirements (computational, storage and visualization), each edge represents the connectivity QoS requirements (bandwidth and latency). In the first version when datacenter resources are selected the vertices of the graph are defined, then GSP requests for connectivity services to SRV. The sequence diagram (Fig. 4) presents a scheduled connectivity request served by a connection provisioned with NMS. The sequence diagram of a switched connection differs slightly. The background process is monitoring the states (available or used) of the connections on the network and logs their states in SrvDB. When the connectivity service is bound to the network service delivered by the NRS (i.e. after the connectivity binding stage), the NRS returns a connectivity service ID to SRV that can be used to reference the connectivity service request from GSP. Before the grid application VISUPIPE workflow starts (i.e. T1-ε) the provisioned connection is activated by the NRS. Similarly after the time T2 negotiated by the GSP and accepted by SRV, the NRS deactivates the connection provisioned by the NMS. The connection state is changed to available.

6 Conclusion SRV oriented network infrastructure provides multiple network services automatically through a generic network service interface according to the connectivity requirements of the grid application workflows. Integrated services rely on the virtualized infrastructures with the connectivity service storefronts and the transport network infrastructures owned and operated by different organizations. Immediate or scheduled network services will have to be supported. The specifications of a standardized network service interface will facilitate the delivery of on-demand connectivity services and their integration with other services delivered by datacenter infrastructures [10]. CARRIOCAS aims to demonstrate a reconfigurable transport network with 40Gb/s optical transmission links providing connectivity services between the enterprise clients to remote datacenter servers to run computing and data intensive applications. In the first part of the project, after having analyzed the requirements of the distributed applications and evaluated the optimal network system architecture, the second phase of the network and application server infrastructure deployments is complete. The specifications of the SRV are complete and the design phase is under progress. Beyond the production network pilot used for the advanced tests and the experimentations, one of the objectives is to define new commercial usages of virtualized infrastructures. Further the CARRIOCAS is a collaborative platform for the design and the analysis of complex numerical systems and is fostering on new business models.

Automatic Network Services Aligned with Grid Application Requirements

205

Acknowledgments The authors thank all the partners of CARRIOCAS project including O. Leclerc (Alcatel-Lucent), L. Tuhal, J.-L. Leroux, M. Chaitou and J. Meuric (France-Telecom), X. Gadefait (Marben Products), J.-M. Fourneau and N. Izri (PRiSM), L. Zitoune and V. Vèque (IEF), M. Gagnaire and R. Aoun (Telecom Paris-Tech), I. Wang and A. Cavali (GET/INT), E. Vieira and J.-P. Gallois (CEA-list), C. Mouton (EDF) and J.-C. Lafoucrière (CEA Ter@tec) provided valuable inputs for this paper. CARRIOCAS project gathers 22 institutions (industrials, SME’s, and academia) and is carried out in the French cluster “system@tic Paris-Region”. The project is funded by the French Industry Ministry, the Essonne, Hauts-deSeine and Paris general councils: http://www.carriocas.org

References 1. Audouin, O., et al.: CARRIOCAS project: An experimental high bit rate optical network tailored for computing and data intensive distributed applications. In: APOC 2007, Wuhan, China (November 2007) 2. Requirements for Automatic Switched Transport Networks (ASTN), ITU-T Recommendations G.807/Y.1302 (July 2001) 3. Multi-Technology Operations Systems Interface (MTOSI) 2.0, TMF Forum (May 2008) 4. Barroncelli, F., Martini, B., et al.: A Service Oriented Network Architecture suitable for Global Grid Computing. Personal Communication (October 2004) 5. Cergol, I., et al.: Performance Evaluation of a SLA Negotiation Control Protocol for Grid Networks. In: Gridnets 2008 ICST Conf., China (2008) 6. Koslovski, G., Primet, P., et al.: VXDL: Virtual Resources and Interconnection Networks Description Language. In: Gridnets 2008 ICST Conf., China (2008) 7. Utility computing enhanced with high bit rate networks: Opportunity and requirements. In: Gridnets 2007 workshop, Lyon (October 2007) 8. Primet, P., et al.: Supporting bulk data transfers of high-end applications with guaranteed completion time. In: ICC 2007, Glasgow (June 2007) 9. Farrel, A., et al.: A Path Computation Element (PCE)-Based Architecture. IETF RFC4665 (August 2006) 10. Network Service Interface, Open Grid Forum-GHPN working group OGF24 Singapore (September 2008), http://www.ogf.org

Communication Contention Reduction in Joint Scheduling for Optical Grid Computing (Invited Paper) Yaohui Jin, Yan Wang, Wei Guo, Weiqiang Sun, and Weisheng Hu State Key Laboratory on Advanced Optical Communication Systems and Networks, Shanghai Jiao Tong University, Shanghai 200240, China [email protected]

Abstract. Optical network, which can provide guaranteed quality of service (QoS) connections, is considered as a promising infrastructure for grid computing to solve more and more complex scientific problems. When optical links are regarded as resources and jointly scheduled with other grid resources, communication contention must be taken into consideration for efficient task scheduling. This paper models the optical grid computing as a communicationaware Directed Acyclic Graph (DAG) scheduling problem. To reduce the communication contention, we propose to use hop-bytes metric (HBM) heuristic to select computing resource. Simulation results show that the HBM approach combined with the adaptive routing scheme can achieve better performance in terms of normalized schedule length and link utilization. Keywords: Optical Grid, DAG, Communication Contention.

1 Introduction By using open middleware technologies [1], Grid computing enables the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources (e.g. supercomputers, data sources, storage systems, instruments etc) together to solve more and more complex problems for scientific research. Optical circuitswitched (OCS) networking technologies are considered better suited to fulfill the QoS requirements, i.e., to offer huge capacity and relatively low latency, as well as dynamic control and allocation of bandwidth at various granularities [2][3]. Thus optical networking is expected to play an important role in creating an efficient infrastructure for supporting such advanced Grid computing applications, which is called optical Grid or photonic Grid [4]. Recently some significant researches have been done on the testbeds or architectures for optical Grid applications [5-9]. These efforts mainly aim to the integration of optical networks as Grid services, or to make optical circuit-switched networks more suitable to meet the Grid requirements, such as user-controlled capability, fast lightpath provisioning, and flexible dynamic control. However, few works focus on the scheduling problem for optical Grid computing in theoretical P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 206 – 214, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

Communication Contention Reduction in Joint Scheduling

207

details. A Grid computing application can be modeled as a Directed acyclic graph (DAG) [10]. The scheduling of Grid computing is to map the DAG onto the computational resources with efficient resource utilization. Although many algorithms have been proposed for DAG scheduling [11-14], these algorithms cannot be directly used for optical Grid computing applications. Because most of them assume an ideal communication system in which the Grid resources are fully connected and the communication between any two Grid resources can be provisioned whenever they need. These assumptions are not consistent with those of practical OCS networks in which a lightpath should be first setup before each communication and tore down after the communication finishes. When the lightpath is occupied by one communication, other communication cannot use it and therefore the contention in communication arises. There are some attempts to incorporate communication contention awareness into DAG scheduling. A few algorithms ware proposed that consider network link contention [14] or end-port contention [17]. Sinnen and Sousa [18] propose a new graph model of the system network which is capable of capturing both end-point and network link contention. Agarwal et al [16] propose a Hop-bytes metric for task mapping in large parallel machines. Hop-bytes is the total size of inter-processor communication in bytes weighted by distance between the respective end-processors. Using a Hop-bytes metric based estimation function, in each iteration in the mapping algorithm, the more heavily communication task can be mapped onto the nearby processor. A joint scheduling model of computing and networking resources for optical Grid application was proposed in [15] by incorporating the link communication contention of the optical networks into DAG scheduling. The optical network resource takes the form of a lightpath composed of a series of links which are viewed as network resources to be shared among Grid users like other traditional computing resources. In this paper, we investigate how to reduce the communication contention so as to minimize the schedule length in DAG scheduling under the joint scheduling model. In extending the classic list scheduling algorithm to fulfill the joint scheduling, we find there are basically two ways to reduce the communication contention: (1) Using adaptive routing scheme to detour the heavy traffic; and (2) Mapping task object to nearby grid resources to avoid long-hop communications. In this paper, we focus on the second contention reduction scheme and incorporate a Hop-bytes metric [16] into the grid resource selection phase in the algorithm. The rest of this paper is organized as follows: In Section 2 we describe the joint scheduling model for optical grid computing. In Section 3, we redefine the HBM for DAG scheduling and elaborate on how to incorporate it into the resource selection phase. In Section 4, we provide simulation results to evaluate the performance. Finally, section 5 concludes this paper.

2 Joint Scheduling for Optical Grid Computing In traditional DAG scheduling, networks are seldom thought of as resources. In this section, the optical networks are considered as network resources in the same way as processing and storage resources and all these resources can be jointly scheduled for DAG scheduling.

208

Y. Jin et al.

2.1 Resource Model In optical networks, possible resources mainly include optical switch nodes and fiber links. Fig. 1 (b) depicts an example of optical Grid extended resource model in which there are 7 Grid resources and 4 optical switches. The adjacent optical switches are connected via the WDM fiber links. Each grid resource is connected to the optical switch via the access link. Therefore, the traffic from and to the end grid resources can be mapped onto wide-area SONET/SDH circuits or all-optical light-paths. In our model, the optical switch is assumed to be equipped with all-wavelength converters, thus there is no wavelength continuity constraint for routing. Then our optical grid resources model can be formulated as an extended resource graph OGR = (N, L, type, bw, d), where N is a set of network nodes and N=R+S, where a node r∈R represents a grid resource and a node s∈S represents an optical switch. L is a set of undirected links and L=LA+LT, where each link l∈LA represents the access link between a grid resource and an optical switch, while a link l∈LT represents the transmission link between two optical switches. The notation type(r) is the type of r, for example, 1 represents computer, 2 storage and 3 I/O device etc. The weight bw(l) represents the link l’s bandwidth and the weight d(l) denotes the distance of the link l.

Fig. 1. Optical grid joint scheduling example. (a) DAG-modeled Grid application and (b) optical grid extended resource model.

2.2 Task Model A Grid computing application can be modeled by a directed acyclic graph (DAG) [10]. We formulate the task model as GDAG = (V, E, type, c, w), where V is a set of v tasks and E is a set of e edges between the tasks. Each edge emn∈E represents the precedence constraint such that task vn can not start execution until vm finishes. The notation type(v) represents the type of task v. Note that a task can only be scheduled onto the Grid resource of the same type. The weight c(v) denotes the average execution time required by v on a reference resource in the heterogeneous system. The weight w(e) denotes the data volume transmitted on the edge e. In a given DAG, the set of all direct predecessor of task v is denoted by pred(v) and the set of all direct successors of v is denoted by succ(v). A task vertex v without

Communication Contention Reduction in Joint Scheduling

209

predecessors, pred(v)= φ, is named source node and if it is without successors, succ(v)= φ, it is named sink node. Fig. 1 (a) shows an example DAG. 2.3 Communication Contention Aware DAG Scheduling To schedule a DAG onto the optical grid extended resource system, the awareness for communication contention can be achieved by edge scheduling, i.e., the scheduling of the edges of the DAG onto the links of the extended resource graph, in a similar manner how the task nodes are scheduled on the processing resources. Fig.1 exemplifies an optical grid scheduling which consists of two parts: One is task scheduling which is to schedule the computation tasks onto grid resources (e.g., v4→r7); the other is communication scheduling which is to schedule the edges onto each link along the lightpath (e.g., e46→(r7,r4)). The scheduled communication should start and end on all the links along the route simultaneously since the route is a cutting-though lightpath in the network without any store and forwarding.

1: Step 1: Determine the scheduling list 2: Determine each task’s bottom level of the DAG 3: Sort each task v V into a list LIST by decreasing order of their bottom levels 4: Step 2: Sequential scheduling over the list 5: for each task vn LIST do 6: for each resource r R with type(vn) = type(r) tentatively do 7: for each vm pred(vn) in a definite order do 8: if task vm and vn scheduled on two distinct resources then 9: Find a route Rt = l1, l2, …, lk for edge emn between the two resources 10: Schedule emn on each link along Rt 11: else 12: neglect the communication cost of emn 13: end if 14: end for 15: Schedule task vn on resource r tentatively 16: end for 17: Select the resource rmin on which task vn has earliest finish time 18: Schedule each incoming edge emn of vn, vm pred(vn), on its determined route 19: Schedule task vn on resource rmin 20: end for



∈ ∈ ∈





Fig. 2. The extended list scheduling (ELS) algorithm

The objective of DAG scheduling is to minimize the schedule length. The scheduling problem under our communication contention model has been proved to be NP-hard [18]. The heuristics therefore try to produce near optimal solutions in acceptable solving time. As list scheduling is one of the most common heuristics for the DAG scheduling, we extend the classic list scheduling algorithm to implement the joint schedule. The extended list scheduling (ELS) algorithm is outlined in Fig. 2.

210

Y. Jin et al.

Since the communication scheduling is included in the DAG scheduling, communication contention will naturally increase the schedule length. However, the ELS algorithm only describes how to implement communication contention aware DAG scheduling without any means to reduce the contention. From Fig.2, we can find there are basically two ways to alleviate the network resource contention. One way is to improve routing scheme (line 9 in Fig.2) and the other way is to improve computing resource selection scheme ( line 17 in Fig.2). There are generally three approaches to establish lightpath in optical networks, fixed routing, fixed-alternate routing and adaptive routing [20]. We will discuss resource selection scheme in the following section.

3 Hop-Bytes Metric Based Grid Resource Selection Scheme To improve the resource selection scheme, we have a motivation that is to try to map the task onto the nearby resource to reduce this link contention by introducing a HopBytes Metric (HBM) [17] into the grid resource selection phase for DAG scheduling. As HBM is originally used to judge the quality of the solution produced by the independent job mapping algorithm, we now redefine the HBM for the communication contention aware DAG scheduling. Definition: The HBM of task vi scheduled on computing resource r is defined as the total size of the data volume in bytes carried by each incoming edge of vi weighted by the hops of the route the edge scheduled on it

⎧ ∑ w(e ji ) × h ( rsc(v j ), rsc(vi ) ) pred(vi ) ≠ φ ⎪ hb(vi , r ) = ⎨v j ∈ pred (vi ) ⎪ 0 otherwise ⎩

(1)

where eji is the incoming edge from predecessor task vj to the current unscheduled task vi, w(eji) is the weight, i.e., the data volume in bytes, of the incoming edge eji, rsv(vj) and rsv(vi) denote the grid resources where task vj and vi are allocated respectively, h(rsv(vj), rsv(vi)) is the number of hops of the route between resource rsv(vj) and rsv(vi). In the previous ELS algorithm, there is only one objective in the grid resource selection phase, i.e., minimize{t f (v, r )} , where t f (v, r ) denotes the finish time of task

v ∈ V on grid resource r ∈ R . When the Hop-bytes metric is taken into consideration for the reduction of long-hop communication, we will have tow objectives, i.e., min ⎡⎣t f (v, r ), hb(v, r ) ⎤⎦ , for all r ∈ R . There are many approaches to solve multi-objective or multi-criteria problem [21] and here we use a basic weighted sum method. Since the finish time and HBM are different metrics, we first normalize them to the same measurement dimension. At each resource selection phase, we tentatively schedule the current task v on all the resources and record the maximum and minimum finish time of v, denoted as t f max (v) and t f min (v) , as well as the maximum and minimum HBM of v, denoted as

Communication Contention Reduction in Joint Scheduling

211

hbmax (v) and hbmin (v) . Then the new normalized finish time and HBM are given as below, t f ′ (v, r ) =

t f max (v) − t f (v, r )

hb (v ) − hb(v, r ) ∈ [0,1] ∈ [0,1] , hb′ (v, r ) = max hbmax (v) − hbmin (v) t f max (v) − t f min (v)

(2)

The

weighted sum objective function is written as FH (v, r ) = λ ⋅ t f ′ (v, r ) + (1 − λ ) ⋅ hb′(v, r ) , λ ∈ [0,1] . In the following step, we can

directly select the best resource rmin with the maximum FH (v, r ) , i.e. FH (v, rmin ) = max [ FH (v, r ) ] , rmin ∈ R , λ ∈ [0,1]

(3)

r∈R

When there is more than one resource having the same maximum FH (v, r ) , the best one is selected using first-fit strategy as mentioned in section 5.1. 8000 SL LU

7000 6000

2000 620

1000

5000

600 SL

1500

zoom in

4000

580 560 0.4 0.5 0.6 0.7 0.8 0.9 1.0

3000

λ

2000 500

Link Utilization (LU)

Schedule Length (SL)

2500

1000 0

0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

coefficient λ

Fig. 3. Schedule results under various coefficient λ for weighted sum method in terms of schedule length and link utilization

For the weighted sum method the problem is how to determine the value of coefficient λ . There are two extreme values for λ , λ =0 and λ =1. if λ =0, the objective becomes to minimize the HBM for resource selection. If λ =1, the objective is to minimize the finish time. As we have mentioned before, the importance of t f (v, r ) is higher than hb(v, r ) in order to get the minimal schedule length, then λ should be more than 0.5. So we have 0.5 < λ < 1. Next we will get the more precise value of λ through simulation. We randomly generate DAG with 250 task nodes. Then the DAG is scheduled onto 16 resources interconnected by a 16-node NSFNET. Each link has only one unit bandwidth. We get the average result over 100 simulations. Fig. 3 depicts the schedule length and link utilization that is defined as the sum of multiplication of occupied time and occupied

212

Y. Jin et al.

bandwidth on all the links under different λ . We can get relatively smaller schedule length when λ is around 0.8, as can be seen in the zoom-in inset of Fig. 3. When we change the network topology (e.g. 16-node Mesh-torus) or change the DAG size (the number of DAG nodes ranging from 128 to1024), we can get the similar simulation results and λ =0.8 can always produce relatively good results in terms of both schedule length and link utilization.

4 Simulation Study In this section, we evaluate the performance of the ELS algorithm with Hop-bytes metric based grid resource selection scheme through simulations. Two typical network topologies are employed in the following simulations, one is a 64-node mesh-torus and the other is a 46-node USNET. The purpose of minimizing the link capacity is to maximize the communication contention. We also employ two routing schemes and three resource selection schemes in the following simulations (see Table 1). Table 1. Different routing and resource selection schemes in the simulation

Nomorlized schedule length

EFT HB_WS

2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

Fixed shortest path routing scheme in which each route is pre-computed Adaptive routing scheme which can find an earliest start route for current communication according to current network state. Earliest finish time method for resource selection scheme HBM based weighted sum method for resource selection scheme

45000

FR+EFT FR+HB_WS AR+EFT AR+HB_WS

40000 35000

Link utilization

FR AR

FR+EFT FR+HB_WS AR+EFT AR+HB_WS

30000 25000 20000 15000 10000 5000 0

(184)

(736)

DAG size

(a)

(184)

(736)

DAG size

(b)

Fig. 4. Scheduling results of 4 combinations of two routing and two resource selection schemes in terms of (a) normalized schedule length and (b) link utilization vs. DAG size (184, 736)

We use the same random DAG generator in [15]. The average out-degree of DAG is 2. The DAG node weight is taken randomly from a uniform distribution [18] around 10, thus the average node weight is 10. The communication-computation-ratio (CCR) is selected to be 2 to simulate the application with more communications.

Communication Contention Reduction in Joint Scheduling

213

Moreover, it is also assumed that all the DAG nodes and grid resources have identical type and all the grid resources are homogeneous. The performance results are the average over 100 simulations. We compare 4 scenarios of combination of different routing and resource selection schemes. The scheduling results in terms of normalized schedule length and link utilization are given in Fig. 4. Compared with the results of FR+EFT, we can find that WS scheme has more contribution in reducing link utilization, but little effects in reducing schedule length. AR scheme can contribute to much shorter schedule length, but at the cost of higher link utilization (which is not desirable for public shared network). When we combine the two contention reduction schemes together (i.e., AR+WS) in the DAG scheduling, it can be seen that most of the communication contention can be removed, producing the minimal schedule length with relatively lower link utilization (even lower than F1R+EFT by 4.2%).

5 Conclusions Optical grid computing can be modeled as communication contention aware DAG scheduling over optical circuit switched networks. There are basically two ways to reduce the communication contention in the ELS algorithm: adaptive routing scheme and Hop-bytes based grid resource selection scheme. This paper mainly focused on the latter problem. We incorporated a Hop-bytes metric into the resource selection and proposed two methods, multilevel method and weighted sum method, to schedule the task onto the nearby resources. Simulation shows that the weighted sum method is better than multilevel method in terms of network resource utilization. We also demonstrated that the Hop-bytes based resource selection scheme contributes in lower resource utilization, while the adaptive routing scheme has the advantage in the reduction of schedule length. When we employ the two schemes together, both of the merits can be achieved and most of the communication contention can be avoided, leading to the smallest schedule length with relatively lower link utilization.

Acknowledgements This work is supported by the China 863 Program and National Nature Science Foundation Committee of China under grants 2006AA01Z247, 60672016, 60502004.

References 1. Froster, I., Grossman, R.: Data integration in a bandwidth-rich world. Commun. ACM 46, 50–57 (2003) 2. Veeraraghavan, M., et al.: On the use of connection-oriented networks to support grid computing. IEEE Commun. Mag. 44, 118–123 (2006) 3. Barker, K.J., et al.: On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. Supercomputing (2005) 4. Simeonidou, D., et al.: Optical network infrastructure for grid. Global Grid Forum Document, GFD.36. Grid High Performance Networking Group (2004)

214

Y. Jin et al.

5. Baroncelli, F., et al.: A Service Oriented Network Architecture suitable for Global Grid Computing. In: Conference on Optical Network Design and Modeling (2005) 6. Figueira, S., et al.: DWDM-RAM: Enabling Grid Services with Dynamic Optical Networks. In: IEEE International Symposium on Cluster Computing and the Grid (2004) 7. Lehman, T., et al.: DRAGON: A Framework for Service Provisioning in Heterogeneous Grid Networks. IEEE Commun. Mag. 44, 84–90 (2006) 8. Zheng, X., Veeraraghavan, M., Rao, N.S.V., Wu, Q., Zhu, M.: CHEETAH: Circuitswitched high-speed end-to-end transport architecture testbed. IEEE Commun. Mag. 43, 11–17 (2005) 9. Smarr, L.L., et al.: The OptIPuter. Commun. ACM 46(11), 59–67 (2003) 10. Sarkar, V.: Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. MIT, Cambridge (1989) 11. Topcuoglu, H., et al.: Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans. on Parallel Distrib. Syst. 13, 3 (2002) 12. Wu, M.Y., et al.: Hypertool: a programming and aid for message-passing systems. IEEE Trans. Parallel Distrib. Syst. 1, 330–343 (1990) 13. Yang, T., et al.: DSC: scheduling parallel tasks on an unbounded number of processors. IEEE Trans. Parallel Distrib. Syst. 5, 951–967 (1994) 14. Kwok, K., Ahmad, I.: Link Contention-Constrained Scheduling and Mapping of Tasks and Messages to a Network of Heterogeneous Processors. Cluster Computing 3, 113–124 (2000) 15. Wang, Y., et al.: Joint scheduling for optical grid applications. Journal of Optical Networking 6, 3 (2007) 16. Agarwal, T., et al.: Topology-aware Task Mapping for Reducing Communication Contention on Large Parallel Machines. In: IEEE Parallel and Distributed Processing Symposium (2006) 17. Beaumont, O., et al.: A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors. In: Proc. 11th Heterogeneous Computing Workshop (2002) 18. Sinnen, O., Sousa, L.A.: Communication contention in task scheduling. IEEE Trans. Parallel Distrib. Syst. 16, 503–515 (2005) 19. He, E., et al.: AR-PIN/PDC: Flexible Advance Reservation of Intradomain and Interdomain Lightpaths. In: IEEE Global Telecommunications Conference (2005) 20. Zang, H., et al.: Dynamic Lightpath Establishment in Wavelength-Routed WDM Networks. IEEE Commun. Mag. 39, 100–108 (2001) 21. Das, I.: Multi-Objective Optimization (1997), http://www-fp.mcs.anl.gov/otc/Guide/OptWeb/multiobj/

Experimental Demonstration of a Self-organized Architecture for Emerging Grid Computing Applications on OBS Testbed Lei Liu, Xiaobin Hong, Jian Wu, and Jintong Lin P.O. Box 55#, Key Laboratory of Optical Communication and Lightwave Technologies, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China [email protected],{xbhong,jianwu,ljt}@bupt.edu.cn

Abstract. As Grid computing continues to gain popularity in the industry and research community, it also attracts more attention from the customer level. The large number of users and high frequency of job requests in the consumer market make it challenging. Clearly, all the current Client/Server(C/S)-based architecture will become unfeasible for supporting large-scale Grid applications due to its poor scalability and poor fault-tolerance. In this paper, based on our previous works [1, 2], a novel self-organized architecture to realize a highly scalable and flexible platform for Grids is proposed. Experimental results show that this architecture is suitable and efficient for consumer-oriented Grids. Keywords: optical Grid, self-organization, optical burst switching (OBS).

1 Introduction Optical networking for Grid computing is an attractive proposition offering huge amount of affordable bandwidth and global reach of resources. Optical burst switching (OBS) [3], which combines the best of optical circuit switching (OCS) and optical packet switching (OPS), is widely regarded as a promising technology for supporting future Grid computing applications [4]. In recent years, the Grid over OBS architectures have been extensively studied [5-7]. The latest research proposed to use Session Initiation Protocol (SIP) to construct optical Grid architecture [8]. But to the best of the authors’ knowledge, these architectures are all conventional C/S model, in which the roles are well separated. All the grid resources publish their available resource information to the corresponding server (or SIP proxy). The job request is firstly sent to a server (or SIP proxy) for resource discovery, and if none satisfied resource is found, the request will then transferred to other server (or SIP proxy) for further handling. The Grid will open to the consumer market in the future, which is a challenge by the potentially large number of resources and users (perhaps millions), high frequency of job requests and considerable heterogeneity in resource types. Under this background, C/S-based optical Grid architecture will become unfeasible due to its poor scalability and poor fault-tolerance. In order to address this issue, a P2P-based P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 215 – 222, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

216

L. Liu et al.

architecture is proposed in [1]. Compared with the conventional C/S solution, the significant improvement of the P2P-based architecture is the resource discovery scheme is fully decentralized. Experimental results in [1] show the P2P-based architecture can improve the network scalability and it is a better choice for largescale Grid applications. However, the shortcoming of the P2P-based solution is only the non-network resource is considered for resource discovery. So in the work of [2], we propose a self-organized resource discovery and management (SRDM) scheme, in which both the network resource and non-network resource are taken into account for resource discovery. But the resource reservation is not efficient enough in [2] and additional time delay will be introduced. So in this paper, based on [1] and [2], a novel self-organized architecture for optical Grid is investigated. Experimental results show that this architecture is suitable and efficient for Grid applications. Moreover, it outperforms our previous works [1, 2] for supporting future large-scale consumeroriented Grid computing applications. The rest of this paper is organized as follows. Section 2 proposes the self-organized optical Grid architecture. Section 3 investigates the signaling process in this architecture. Section 4 is the performance evaluation and experimental demonstration of the proposed architecture. Section 5 concludes this paper.

2 Self-organized Network Architecture Fig.1 shows the self-organized architecture for Grid over OBS. The architecture is separated into 2 layers: transport layer and protocol layer.

Fig. 1. Self-organized architecture for Grid over OBS

The transport layer is the actual Grid network. Fig.1 shows a typical Grid over OBS infrastructure [4], that is, the Grid users/resources are divided into several Grid virtual organizations which are connected through OBS network. Two interfaces, Grid User Network Interface (GUNI) and Grid Resource Network Interface (GRNI) are used to connect Grid and OBS network. The protocol layer is used for implementing the resource discovery scheme. It is composed of the virtual nodes mapping from the Grid users/resources in the transport layer. The consistent hash function assigns each node in the protocol layer an m-bit

Experimental Demonstration of a Self-organized Architecture

217

identifier (node ID) using SHA-1[9] as a base hash function. A node’s identifier is chosen by hashing the node’s IP address. In the rest of this paper, the term “node” will refer to either the actual node in transport layer or the virtual node in protocol layer, as will be clearly distinguished from context. In the self-organized architecture, each node maintains three tables, including Finger Table (FT), Latency Information Table (LIT) and Blocking Information Table (BIT). FT is a routing table (Fig.2(a)), which includes both the identifier and the IP address of the relevant node. Note that the first finger of n is the immediate successor of n on the protocol layer; for convenience we often refer to the first finger as the successor. The generation process and algorithm for FT is introduced in [1] in detail.

Fig. 2. An example of (a) FT, (b) LIT and (c) BIT of node 10 (N10)

LIT and BIT are used for storing the end-to-end latency and blocking probability from the current user to different resources respectively. Such information is obtained by self-learning mechanism. Once a resource is discovered, the IP address of this resource will be saved in LIT and BIT. For each resource, user periodically sends “Hello” signaling to it to get the end-to-end latency and save it in LIT. Meanwhile, user records the job history (success or failure), calculates job blocking probability for each resource and saves it in BIT. BIT is cleared in every T minutes to eliminate the out-of-data information. Fig.2(b) and Fig.2(c) show examples of the LIT and BIT of node 10.

3 Signaling Process Fig.3(a) shows the signaling process for Grid applications in the self-organized architecture, which can be divided into 3 steps: preparation, resources discovery and job execution. A novel P2P protocol based on Chord [10] is integrated in this process. 3.1 Preparation This step is used to generate Resource Publication Message (RPM), which can be further illustrated as Fig.3(b). In computational Grids, Grid resources can be divided into two types: static resources, including operation system configuration, system version, service types that this resource can provide, etc. and dynamic resources,

218

L. Liu et al.

Fig. 3. (a) Signaling process (b) Resource Publication Message (RPM) generation (c) Protocol layer consisting of 11 virtual nodes storing 10 RPMs

including idle CPU cycles, available disk space, free RAM, etc. The dynamic resource will be changed in the process of Grid job execution while the static resource will remain unchanged. Each resource in the Grid network is required to describe its available static resource and dynamic resource in the same description method and format which are needed to be negotiated by all the Grid users. This description format should be a structured naming or description, such as [11-12]. The RPM is composed of two parts. The top m bits are the key which is generated by hashing the static resource description using SHA-1, the rest bits are the dynamic resource description (DRD). The dynamic resource description is reversible and can be analyzed by every Grid user while the key is irreversible due to the SHA-1. The RPM of node n will be denoted by RPM (n, Key k, DRDn) in the remainder of this paper. After the RPMs are generated, they will be published to nodes residing in the protocol layer, which is an identifier circle modulo 2m. As shown in Fig.3(c), RPM (n, Key k, DRDn) is assigned to the first node whose identifier is equal to or follows key k in the identifier space. This node is called the successor node of key k, denoted by successor(k). If identifiers are represented as a circle of numbers from 0 to 2m-1, then successor(k) is the first node clockwise from k. 3.2 Resource Discovery, Reservation and Release The process of resource discovery, reservation and release is described in this section. Firstly, user can specify the job requirements and job characteristic (i.e. losssensitivity or delay-sensitivity) through a web portal in which dynamic Web Service technology is implemented. After that, a job to be executed remotely will generate a Resource Discovery Message (RDM) according to its job requirements. The top m bits of RDM are also the key which is the hashing of the static resource requirements and the rest bits are the description of dynamic resource requirements (DRR) and the network information (NI). The input and output for generating NI can be described as

Experimental Demonstration of a Self-organized Architecture

219

the following program. Based on the end-to-end latency/blocking information, the IP addresses of the resources (IP1,IP2,...,IPn), which are stored in LIT and BIT, are sorted into increasing order (IP’1,IP’2,...,IP’n). After that, the reordering permutation (IP’1,IP’2,...,IP’n) is saved in NI. These are several sorting algorithm, which is investigated in detail in [13]. Fig.4 (a) shows the generation process of RDM. The RDM with key k will be denoted by RDM (Key k, DRR, NI) in the rest of this paper. Program of the generation of network information (NI) in RDM

program Input: (1)A sequence of IP addresses stored in LIT and BIT, IP1,IP2,...,IPn and (2)their corresponding end-toend latency L1,L2,...,Ln (Fig.2(b)) and (3)blocking probability value P1,P2,...,Pn (Fig.2(c))and (4)job characteristic specified by the user (i.e. losssensitive, delay-sensitive) ’



Output: A permutation (reordering) IP 1,IP 2,...,IP ’ ’ ’ the input sequence such that L 1≤L 2≤…≤L n (delay’ ’ ’ sensitive case)or P 1≤P 2≤…≤P n(loss-sensitive case) end.

’ n

of

The resource discovery process can be described as follows: an operation, find_successor, is firstly invokes at node n to find the key of RDM. If key falls between n and its successor, find_successor is finished and node n returns its successor. Otherwise, n searches its finger table for the node n’ whose ID most immediately precedes key, and then invokes find_successor at n’. The reason behind this choice of n’ is that the closer n’ is to key, the more it will know about the identifier circle in the region of key. When the key is found, the dynamic resource description will be compared to find out which node can meet the dynamic resource requirements of the job. After a list of candidate resources satisfying the specified requirement is obtained, i.e. list L, the (IP’1,IP’2,...,IP’n) in NI will be compared to the IP addresses in L in order to choose a best resource (i.e. least latency or least blocking). First-fit mechanism is used here since the (IP’1,IP’2,...,IP’n) in NI have been already sorted. If there is no relevant record in NI, an IP address is randomly selected from list L as resource discovery result. After a resource is chosen, non-network resource can be reserved by updating RPM. Together with OBS bandwidth reservation protocols (e.g. Just-Enough-Time (JET) [3]), the proposed self-organized architecture enables a more flexible end-to-end reservation (compared with [2]) of both network and non-network resources in a fully decentralized manner. Fig. 4(b) is

Fig. 4. (a) Resource Discovery Message (RDM) generation (b) Path of a query for RDM (Key54, DRR, NI) starting at node 10

220

L. Liu et al.

an example that node 10 wants to find the successor of RDM(Key54, DRR, NI). The number (1~5) in Fig.4(b) shows the resource discovery procedures. 3.3 Job Execution Once the resource discovery result is obtained, the user will send the actual job to Edge Router (ER) for transmission to the resource. ER aggregates the job into optical bursts which then are sent to the reserved resources by utilizing JET bandwidth reservation scheme. The edge router is able to send data from different users to different reserved resources across the network. After the job is successful executed, the job results will be returned to the user, at the same time, updating the RPM to release the reserved non-network resources.

4 Experimental Setup, Results and Discussions The experimental setup is shown in Fig.5. Grid users and resources were connected through JET-OBS testbed [14]. Various latency and blocking from users to resource 1 and 2 was introduced by injecting background traffic. The resource discovery signaling (e.g. Hello, PRS) were encapsulated into bursts for transmission for avoiding the O/E/O conversion and message processing delay. About 1000 jobs were randomly generated with random resource requirement, job characteristic and start time (Fig.7 (a)). OBS edge routers and core routers were connected with fibre links in which two DWDM data channels (1554.94nm, 1556.55nm) and one dedicated control channel (1.31µm) are included. Bit-rate for all channels is 1.25Gbps. The experimental results (Fig.6) show that the proposed self-organized architecture can be well applied to optical Grid with different job burst size (Fig.6 (a, b)) and different job request frequency (Fig.6 (b, d)). The burst together with eye diagram (Fig.6(d, e)) show that 19.43dB extinction ratio can be achieved. In our experiment, the shortest resource discovery time is 31.25ms (Fig.7 (b)) and the longest resource discovery time is 640.625ms (Fig.7 (c)). For all the Grid jobs, the average resource discovery time is nearly 200 milliseconds and the lookup successful rate is 100%.

Fig. 5. Experimental setup

The results in Fig.7(d) show that self-organized architecture outperforms P2Pbased architecture [1] in terms of job blocking and end-to-end latency since the resource discovery in the self-organized architecture is capable of consideration of both network and non-network resources. It can be seen that in the self-organized architecture, each user has its own intelligence to manage resource discovery requests

Experimental Demonstration of a Self-organized Architecture

221

Fig. 6. Experimental results (a) a small job encapsulated in a burst to transmit (b) a large job encapsulated in a burst to transmit (c) job result sent back to the user (d) several job transmission when the job request frequency is high (e) eye-diagram of job bursts

Fig. 7. Experimental results (a) Web interface (b) shortest resource discovery time (c) longest resource discovery time (d) comparison of self-organized architecture and [1]

and make proper decision based on its own information about the whole Grid network. Clearly, by employing this distributed mechanism, it is not necessary to deploy powerful centralized servers for storing Grid resource information, which enables to construct a more scalable and fault-tolerant network for large-scale consumer Grid.

5 Conclusions In this paper, a novel self-organized architecture for optical Grid is proposed and this solution is experimentally demonstrated on OBS testbed. By introducing selforganization in optical Grid, the disadvantages of the C/S-based Grid architecture are solved and many benefits are introduced to optical Grid due to the inherent advantages of self-organization (e.g. flexibility, scalability, fault-tolerance, etc.). The network resource and non-network resource are all taken into account for resource discovery in this architecture, which will result in better performance than our previous works. The experimental results verify that this architecture is suitable and efficient for future large-scale consumer-oriented Grid computing applications.

222

L. Liu et al.

Acknowledgments. This work was supported by 863 Program (2007AA01Z248), MOST Program (No.2006DFA11040), PCSIRT (No.IRT0609) and 111 Project (B07005).

References 1. Liu, L., Hong, X.B., Wu, J., Lin, J.T.: Experimental Demonstration of P2P-based Optical Grid on LOBS Testbed. In: Optical Fiber Communication Conference (OFC), San Diego, USA (2008) 2. Liu, L., Guo, H., et al.: Demonstration of a Self-organized Consumer Grid Architecture. In: European Conference on Optical Communications (ECOC), Brussels, Belgium (accepted, 2008) 3. Qiao, C., Yoo, M.: Optical Burst Switching (OBS) a New Paradigm for an Optical Internet. J. High Speed Netw. 8(1), 69–84 (1999) 4. Nejabati, R. (eds.): Grid Optical Burst Switched Networks (GOBS). Technical report, Open Grid Forum (OGF), GFD.128 (2008) 5. Zervas, G., Nejabati, R., Wang, Z., Simeonidou, D., Yu, S., O’Mahony, M.: A Fully Functional Application-aware Optical Burst Switched Network Test-bed. In: Optical Fiber Communication Conference (OFC). Anaheim, California, USA (2007) 6. Vokkarane, V.M., Zhang, Q.: Reliable Optical Burst Switching for Next-generation Grid Networks. In: IEEE/CreateNet GridNets, Boston, USA, pp. 505–514 (2005) 7. Farahmand, F., De Leenheer, M., Thysebaert, P., Volckaert, B., De Turck, F., Dhoedt, B., Demeestert, P., Jue, J.P.: A Multi-layered Approach to Optical Burst-switched Based Grids. In: International Conference on Broadband Networks (BroadNets), Boston, USA, vol. 2, pp. 1050–1057 (2005) 8. Zervas, G., Nejabati, R., Simeonidou, D., Campi, A., Cerroni, W., Callegati, F.: SIP Based OBS Networks for Grid Computing. In: Tomkos, I., Neri, F., Solé Pareta, J., Masip Bruin, X., Sánchez Lopez, S. (eds.) ONDM 2007. LNCS, vol. 4534, pp. 117–126. Springer, Heidelberg (2007) 9. FIPS 180-1, Secure Hash Standard. U.S. Department of Commerce/NIST, National Technical Information Service. Springfield, VA (1995) 10. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A Scalable Peerto-peer Lookup Service for Internet Applications. In: ACM SIGCOMM, San Diego, USA, pp. 149–160 (2001) 11. Globus Project, The Globus Resource Specification Language RSL v1.0, http://www.globus.org/toolkit/docs/2.4/gram/rsl_spec1.html 12. Smirnova, O.: Extended Resource Specification Language, reference manual, http://www.nordugrid.org/documents/xrsl.pdf 13. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithm, 2nd edn. The MIT Press, Cambridge (2001) 14. Guo, H., Lan, Z., Wu, J., Gao, Z., Li, X., Lin, J., Ji, Y., Chen, J., Li, X.: A Testbed for Optical Burst Switching Network. In: Optical Fiber Communication Conf. (OFC). Anaheim, California, USA (2005)



Joint Scheduling of Tasks and Communication in WDM Optical Networks for Supporting Grid Computing Xubin Luo and Bin Wang Department of Computer Science and Engineering Wright State University Dayton, OH 45435 {luo.4,bin.wang}@wright.edu

Abstract. A flexible task model (FTM) is proposed for modeling the relationship between grid tasks. We investigate the problem of scheduling grid computing tasks under FTM using light-trails in WDM networks to supporting the data communication between the tasks. Simulation results show that our proposed task scheduling algorithm under FTM significantly reduces the total task completion time.

1

Introduction

A grid application may consist of multiple interrelated tasks. A task model is used to capture the tasks and their interrelationships, and is commonly represented by a directed acyclic graph (DAG), also referred to as a task graph. An example DAG is shown in Figure 1(a). Conventional task models (CTM) used in most existing work assume that both task execution and data communication between tasks are atomic, in the sense that a task cannot produce its output until it has completed its execution, and that upon completion of execution, all the output is ready at that moment, and a task cannot start to execute until it receives all the inputs from its predecessors. This is reasonable because CTM was originally proposed for modeling and scheduling of the comparatively lightweighted processes of a parallel program in a multi-processor system. However, in a grid application environment, the grid tasks usually take much longer to execute and the amount of data exchanged between the tasks is much larger. Many practical scenarios exist in which a task generates output in the middle of execution or as soon as execution starts, and a task may start to execute when it receives adequate amount of (not necessarily all) inputs from its predecessors. Therefore, a fl e x ible task model (FTM) is proposed for modeling the relationship between the grid tasks. In FTM, a task may generate output before the task completes and may start to execute when it has collected a minimum amount of required input from its predecessors. For a task, the required percentage of input before starting execution and percentage of execution before generating output can be a value from 0 to 100%. Thus, FTM is more general and flexible than the conventional task graph model considered in previous work. At the P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 223–230, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

224

X. Luo and B. Wang

T1

(30,

τ0

5

)

T1

5

, (5

10

(20,

τ0

)

T2

(30 ,

τ0

)

5

T3

T4

35

(1 5,

(15,

τ0

10

)

T5

T6

(a) CTM;

(5,

τ1

(20, )

τ0

)

T2

)

(15,

τ0

(30 ,

)

τ0

(5 ,

)

0 .2 ,

(10 , 0.1 , 0.1)

(1 0 , 0 , 0 .5 )

5 10

τ1

, .1 0

(30,

) 0

τ0

)

T3

0 .1 )

(5, 1, 0)

T4

(5, 0, 0)

T5

(1 5,

τ1

)

(1 0, 0.2 , 0.2)

T6

(5,

τ1

)

(b) FTM.

Fig. 1. Task graph examples

Fig. 2. A grid with five switching nodes and three types of resources

same time, we assume that the output of a task in FTM is produced as a steady stream. Figure 1(b) is an example FTM task graph. Specifically, the number 5 associated link (T1 , T4 ) is the total amount of data produced by T1 and needs to be transmitted to T4 , 0.2 is the portion of execution that T1 has to finish before it starts to generate output for T4 , and 0.1 is the portion of data that T4 needs to collect from T1 before it starts to execute. We assume that the grid is composed of two parts: the WDM optical network for provisioning communication services between the grid tasks, and the resources attached to the switching nodes in the WDM network for executing the grid tasks. Figure 2 shows an example of a grid with five switching nodes and three types of resources. We assume no communication contention in the access links, i.e., the access links connecting the resources to the switching nodes are assumed to have abundant bandwidth. Therefore, there is no communication contention within a cluster of resources connected to the same switching node. We adopt a centralized architecture for task scheduling in which a central server is responsible for allocating the tasks to proper resources and managing the light-trails in the network. The server also keeps the status of all the resources, optical links, and the established light-trails in the optical network. The problem considered in this paper is to determine an optimal joint schedule for executing the grid tasks and for supporting communication between tasks using light-trails such that the total time taken to complete all the tasks is minimized. The problem is N P-complete because the problem of conventional task graph scheduling is a special case and it has been proved to be N P-complete [1]. In [2] the communication service between the tasks is provided using lightpaths. However, a new connection requires setting up a new light-path unless there happens to be an existing lightpath with sufficient spare capacity between

Joint Scheduling of Tasks and Communication in WDM Optical Networks

225

the same pair of source and destination nodes. In this work, we propose to provision the communication services using light-trails [3,4,5,6] in WDM optical networks. A light-trail can support a connection even if the connection’s source and destination node are in the middle of the light-trail, as long as the source node is in the upstream of the destination node and the space capacity of the lighttrail is enough to support this connection. We develop scheduling algorithms to solve the joint scheduling problem. Extensive simulations are conducted and analyzed. Our algorithms show excellent performance by significantly reducing the total task completion time, compared with the approaches taken under the conventional task graph model. The rest of this paper is organized as follows. Section 2 presents a formal definition of the problem. Section 3 analyzes the problem by interpreting the relationship between the tasks in the task graph, based on which Section 4 proposes the joint scheduling algorithm. The performance evaluation are reported in Section 5. Section 6 concludes the paper.

2

Problem Definition

In this section we formally define the scheduling problem considered in this paper. The notations used are as follows: – T = {T1 , T2 , · · · , Tn }: the set of nodes (tasks) in the task graph; – L: the set of directed links in the task graph: L = {(Ti , Tj )|Ti is a predecessor of Tj }; – w(Ti ): the relative execution time of task Ti ; – t(Ti ): the type of resource required by task Ti for execution; – w(i,j) : the weight of the directed link from node Ti to Tj which depicts the amount of data to be transmitted from task Ti to task Tj ; – α(i,j) : a number between 0 and 1 that specifies the percentage of the execution that task Ti needs to complete before it starts to produce output for task Tj ; – β(i,j) : a number between 0 and 1 that specifies the percentage of the output from task Ti that task Tj has to collect before it starts to execute. The WDM optical network is modeled as a graph (N , E) where N is the set of switching nodes and E is the set of optical links between the switching nodes. Each optical link e is assumed to have W wavelengths, and each wavelength has a capacity of C. Moreover, we assume no wavelength conversion. At the same time, a set of resources, {M1 , M2 , · · · , } are attached to the switching nodes. t(Mh ) indicates the resource type of resource Mh . Resource Mh is characterized by its processing power p(Mh ). For a task Ti compatible with Mh , the time it takes Mh to execute Ti is w(Ti )/p(Mh ). Furthermore, a set of lighttrails LT = {lt1 , lt2 , · · · , } will be created in the WDM network to support the communication between tasks. The problem considered in this work is to determine a joint schedule such that the tasks are assigned to resources for execution, and the communication

226

X. Luo and B. Wang

channels in the WDM network are allocated for supporting the data communication between the tasks. The objective is to minimize the total amount of time to complete all the grid application tasks.

3

Problem Analysis

3.1

A Basic Case for Link (Ti , Tj )

We first examine a basic case in which (1) Ti and Tj are connected by a directed link (Ti , Tj ), and (2) no other directed link leads to Tj . Suppose Ti and Tj are assigned to resource Mi and Mj , respectively, where Mi and Mj may (or may not) be the same resource. The following notations and parameters are introduced: – ts (Ti ): the start time of task Ti ; – te (Ti ): the end time task Ti ; (i,j) – to s : the time when Ti starts to produce output for Tj . We have: = ts (Ti ) + α(i,j) · (te (Ti ) − ts (Ti )) to(i,j) s – ro

(i,j)

: the data rate of Ti ’s output for Tj : ro

(i,j)

(i,j)

= w(i,j) /(te (Ti ) − to

(1) (i,j) ); s

: the transmission start time of the data stream from task Ti to task Tj ; – ts (i,j) – te : the transmission end time of the data stream from task Ti to task Tj ; – r(i,j) : the actual transmission data rate of the data stream from task Ti to task Tj which depends on the provisioning capability of the underlying net(i,j) work as well as the produced data rate of Ti . Therefore r(i,j) = min{ro , C}; (i,j) – tir : the time when the data stream from task Ti to task Tj has accumulated enough to start the execution of task Tj . We have: (i,j)

tir

= ts(i,j) + β(i,j) · (te(i,j) − ts(i,j) ).

(2)

Three cases needs to be considered for the scheduling the execution of Tj as well as the data transmission between Ti and Tj : – Case 1: Mi = Mj . Obviously we have ts (Tj ) ≥ te (Ti ). – Case 2: Mi = Mj and they are located in the same cluster. The execution sequence of the two tasks is shown in Figure 3(a). The execution of tasks and output/input data streams are represented by rectangles, whose lengths correspond to the time durations. In this case the only requirement (i,j) (i,j) (i,j) (i,j) is ts (Tj ) ≥ tir where tir can be obtained by Eq. (2), in which ts = to s (i,j) and te = te (Ti ). – Case 3: Mi = Mj and they are located in different clusters. The execution sequence of the two tasks is shown in Figure 3(b). In this case a l ight-trail is needed in the WDM network to accommodate the data stream from Ti (i,j) to Tj as soon as possible and no earlier than to s . The light-trail can be an

Joint Scheduling of Tasks and Communication in WDM Optical Networks

227

Fig. 3. The execution sequence of tasks Tiand Tj

existing one or a newly established one. Its spare capacity should be no less (i,j) than r(i,j) in a time interval of length w(i,j) /r(i,j) . Because ts is the time at which a light-trail is available to accommodate the data stream from Ti (i,j) (i,j) (i,j) (i,j) to Tj , we have ts ≥ tos and te = ts + w(i,j) /r(i,j) . More details are explained in Section 4. (i,j)

When the incoming data are ready (tir ), we need to check the availability of Mj . Therefore, we maintain the availability of each resource over time. Based on the availability information, we can then determine a value for ts (Tj ) such that (i,j) ts (Tj ) ≥ tir and during [ts (Tj ), ts (Tj ) + w(Tj )/p(Mj )] the resource is available. te (Tj ) depends on the actual execution time of Tj as well as the input data stream, whichever ends later. As a result, we have te (Tj ) = max{(ts (Tj ) + w(Tj )/p(Mj )), te(i,j) }. 3.2

(3)

Task Scheduling with Multiple Predecessors

For a task Tj with multiple predecessors, the start time ts (Tj ) is constrained by the data streams from all of its predecessors, and only when all of the constraints are met can Tj start to execute. Therefore, we have Eqs (4) and (5) for ts (Tj ): ts (Tj ) ≥

∀i,(T

max

i ,T j )∈E,M i =M j

(i,j)

tir .

(4)

ts (Tj ) ≥ te (Ti ), if Mi = Mj .

(5)

and Eq (6) for te (Tj ): te (Tj ) = max{(ts (Tj ) + w(Tj )/p(Mj )),

max

∀i,(Ti ,Tj )∈L

te(i,j) }.

(6)

228

4

X. Luo and B. Wang

Proposed Algorithm

We extend the list scheduling algorithm ([2], [7], [8]) to solve the joint schedule problem. Each task in the task graph is assigned a priority value. Tasks are processed in decreasing order of their priorities. When processing a task, it is assigned to a selected resource so that it can complete as soon as possible. This is achieved by tentatively allocating this task to every compatible resource in the grid and comparing the outcomes of the different assignments. The priority P(Ti ) for task Ti is determined by Eq. (7), ⎧ ⎪ if Ti is an exit node, ⎨ϕ(Ti ) (7) P (Ti ) = max∀j,(Ti ,Tj )∈L {P (Tj ) + (α(i,j) otherwise. ⎪ ⎩ +(1 − α(i,j) ) · β(i,j) ) · ϕ(Tj )} Basically, P (Ti ) is an approximation of the least amount of time it takes to complete task Ti and all its offsprings in the task graph. ϕ(Ti ) is an approximate execution time of Ti :  ϕ(Ti ) = 

∀h,t(Mh )=t(Ti )

∀h,t(Mh )=t(Ti )

p(Mh )

· w(Ti ).

(8)

If task Tj is assigned to a compatible resource M , the scheduling of Tj is determined as follows: (1) If Tj is an entry node, no constraint is imposed by its predecessors. Thus the earliest time when M is available is the time to start the execution of Tj . (2) If Tj is not an entry node, in addition to the resource contention, we need to inspect each of Tj ’s predecessors to determine the constraints imposed by the predecessors on the execution of Tj , by applying the analysis described in the previous section. When a light-trail is needed for the data stream from one of Tj ’s predecessors, we consider using an existing light-trail and routing a new light-trail, whichever provisions the data stream earlier. When considering an existing light-trail, we need to make sure that it has enough spare capacity during the time interval the data stream sustains. Also, when we routing a new light-trail, all the wavelengthlinks on the trail should be available during the same time interval. Therefore we need to keep track of the status of every wavelength-link as well as every lighttrail that has been established in the network, such that based on the status of the existing light-trails we can determine an earliest available light-trail that can support a connection or route a new light-trail. The details of the algorithms are not included in this paper due to space limit.

5

Performance Evaluation

We evaluate the performance of the proposed algorithm with three sets of randomly generated task graphs with different c om m u nication-to-computing-ratios (CCR): 0.1, 1.0, and 10.0. CCR is defined as the average communication time

Joint Scheduling of Tasks and Communication in WDM Optical Networks

229

divided by the average execution time on a given system. The number of nodes in the DAGs (DAG size) varies from 20 to 300 with increments of 20. Three topologies are used as the topologies of the WDM networks in our simulation: NSFNET, Torus 4 × 4, and a 16-node topology. We assume that the links in the topology are optical links with 4 wavelengths. The wavelength capacity C is assumed to be 1. Each node in the topologies is regarded as a switching node, to which a local cluster of resources are attached. 20 resources are randomly distributed in the grid. Each resource has processing power of 1 and is randomly assigned type τ0 or τ1 . Topology: NSFNET, CCR=0.1

Topology: NSFNET, CCR=1.0 100

100

70

3500

60

3000

50

2500

40

2000

30

1500

20

1000

10

500 100

150 DAG size

(a)

200

250

0 300

reduction(%) 80

100

70 4000 60 50 3000 40 30

2000

20 10

1000

90

CTM

8000

reduction(%) Average time to complete the tasks

4000

5000

Average time reduction by FTM (in percentage)

80

FTM 90

CTM Average time to complete the tasks

4500

Average time reduction by FTM (in percentage)

Average time to complete the tasks

9000

FTM 90

CTM reduction(%)

50

Topology: NSFNET, CCR=10

6000

FTM 5000

80

7000

70 6000 60 5000

50

4000

40

3000

30 20

2000

10

Average time reduction by FTM (in percentage)

5500

1000 50

100

150 DAG size

200

250

0 300

50

100

(b)

150 DAG size

200

250

0 300

(c)

Fig. 4. Total task completion time on NSFNET

Figure 4 shows the simulation results for NSFNET (the results for the other two topologies are similar and are not included here). The curves indicated by FTM and CTM show the task complete time under FTM and CTM, respectively. The other curve indicates the percentage of the task completion time reduced under FTM. As we can see from the simulation results, under FTM the total task completion time is significantly reduced when compared with that under CTM. This is especially true for the cases where CCR=0.1 or 1, and the reduction of the task completion time ranges from 25% to 31% in all the three topologies. It shows FTM’s excellent capability of capturing the concurrency characteristics between the tasks, as well as the efficiency of our scheduling algorithm. However, we notice that in all three topologies FTM and CTM have close performance when CCR=10, and the reduction of the task completion time ranges from 0 to 10%. This is reasonable since in this case the communication time is ten times the computation time. Thus, most of the time is consumed by transferring data between tasks such that the concurrency in the execution of tasks may not lead to significant reduction in the total task completion time. At the same time, we observe that FTM uses more wavelength-links than CTM in most of the cases (figures not included due to space limit). This is because FTM models the output data of a task as a constant data stream which most likely results in subwavelength connections. Thus the optical channels in FTM are more likely to be under-utilized than in CTM. We also compare FTM and FTM without communication constraints (FTMNC). Figure 5(a), (b) and (c) present the task completion time under FTM in the three topologies relative to that of FTM-NC. The results show that in all the three topologies, when CCR=0.1 and 1.0, the communication overhead is very small, while for CCR=10 the task completion time of FTM is over 2 times that

230

X. Luo and B. Wang Topology: NSFNET

Topology: Torus 4*4

3

3 CCR=0.1

CCR=0.1

CCR=1.0

CCR=1.0

CCR=1.0

CCR=10

CCR=10

CCR=10

2

1.5

2.5 Relative scheduling time

2.5 Relative scheduling time

2.5 Relative scheduling time

Topology: C

3 CCR=0.1

2

1.5

2

1.5

1

1

1

0.5

0.5

0.5

50

100

150 DAG size

200

(a)

250

300

50

100

150 DAG size

(b)

200

250

300

50

100

150 DAG size

200

250

300

(c)

Fig. 5. Total task completion time under FTM relative to that under FTM-NC

of FTM-NC. This shows that the communication service provisioning scheme is very efficient.

6

Conclusions

In this work a flexible task model (FTM) has been proposed for modeling the relationship between grid tasks. We investigated the problem of scheduling grid computing tasks modeled by FTM with light-trails in WDM networks to support the data communication between the tasks. Extensive simulations were conduced and results showed that our proposed task scheduling algorithm under FTM significantly reduced the total task completion time.

References 1. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979) 2. Wang, Y., Jin, Y., Guo, W., Sun, W., Hu, W., Wu, M.: Joint scheduling for optical grid applications. Journal of Optical Networking, 304–319 (March 2007) 3. Gumaste, A.: Light-trails as a SAN solution: Providing dynamic synchronous and multicasting connections in optical networks. SPIE Opticomm 2003, Workshop on Optical Networking Solutions for Global SAN (ONSAN), Dallas TX (Octobor 2003) 4. Gumaste, A., Chlamtac, I.: Adaptations to a GMPLS framework for IP over Optical Communication. National Fiber Optic Engineers Conferences (NFOEC), Orlando FL (September 2003) 5. Gumaste, A., Chlamtac, I.: Light-trails: A Novel Conceptual Framework for Conducting Optical Communications. In: Proceedings of IEEE Workshop on High Performance Switching and Routing, Torino Italy (June 2003) 6. Chlamtac, I., Gumaste, A.: Light-trails: A Solution to IP Centric Communication over Optical Networks. IEEE 2nd QoS IP Conference, Milan Italy (Februry 2003) 7. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13(3), 260–274 (2002) 8. Sinnen, O., Sousa, L.A.: List scheduling: extension for contention awareness and evaluation of node priorities for heterogeneous cluster architectures. IEEE Transactions on Parallel and Distributed Systems, 263–275 (March 2005)

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks (Invited Paper) Vinod M. Vokkarane1 and Balagangadhar G. Bathula2 1

Department of Computer and Information Science, University of Massachusetts, Dartmouth, MA 02747, USA 2 School of Electronic and Electrical Engineering, University of Leeds, Leeds LS2 9JT [email protected],[email protected]

Abstract. Recently there is an emergence of many Internet applications such as distributed interactive simulations (DIS), and high-performance scientific computations such Grid computing. These applications require huge amount of bandwidth and a viable communication paradigm to coordinate with multiple sources and destinations. In this work we propose variation of multicasting called quorumcasting or manycasting. In manycasting destinations are to be determined rather than given unlike in the case of multicasting. We first present a need to support manycasting over OBS networks. Quality of Service (QoS) policies implemented in IP does apply does not apply for optical burst switched (OBS) networks, as the optical counterpart for store-and forward model does not exist. Hence there is a need to support QoS for manycasting over OBS networks. In this work we focus on QoS parameters such as contention, optical signal quality, reliability, and propagation delay. Burst loss in OBS network can occur due to contention or bit-error rate (BER). We propose algorithms to decrease the overall burst loss. We show that IP based manycasting has poor performance compared to our proposed algorithms. Our simulation results are verified with the help of analytical model. This work is further extended as to multi-constrained manycast problem (MCMP). In this problem, we address the burst scheduling for multiple QoS constraints. We propose algorithms to minimize burst loss based on given service requirements. The goal of this work is to develop service-oriented optical networks (SOON) for many emerging Internet applications. Keywords: WDM, QoS, GoOBS, Manycasting.

1

Introduction

There has been an recent emergence of many Internet applications such as multimedia, video conferencing, distributed interactive simulations (DIS), and high-performance scientific computations like Grid computing. These applications require huge amount of bandwidth and a viable communication paradigm P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 231–242, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

232

V.M. Vokkarane and B.G. Bathula

to coordinate with multiple sources and destinations. Existing communication paradigms include broadcast, and multicast. As optical networks are the potential candidates for providing high bandwidth requirement, supporting these paradigms over optical networks is necessary. QoS policies implemented in Internet Protocol (IP) does not apply for Wavelength division multiplexed (WDM) or optical burst switched (OBS) networks, as the optical counterpart for storeand forward model does not exist. Hence there is a need to provision QoS over optical networks. These QoS requirements can include contention, optical signal quality, reliability and delay. To support these diverse requirements optical networks must be able to manage the available resources effectively. Destinations participating in the multicast session are fixed (or rather static). Due to the random contention in the network, if at least one or more destination(s) is not reachable, requested multicast session cannot be established. This results in loss of multicast request with high probability of blocking. Incorporating wavelength converters (WCs) at the core nodes can decrease the contention loss. However WCs require optical-electrical-optical (O/E/O) conversion, that increases the delay incurred by the optical signal. On the other hand all-optical WCs are expensive and increase the cost of the network if deployed. Goal of this work is, to provide hop-to-hop QoS on an existing all-optical network (AON) with no WC and optical regeneration capability. In order to minimize the request lost due to contention in AON, we propose a variation of multicasting called Quorumcasting or Manycasting. In Quorumcasting destinations can join (leave) to (from) the group depending on whether they are reachable are not. In other words destinations have to to determined rather than given as in the case of multicasting. Quorum pool is defined as minimum number of destinations (k) that are required to participate in the session for successful accomplishment of the job. Providing QoS for manycasting over OBS has not been addressed so far in the literature. Contribution of this work is to provide necessary QoS for a given manycast request. In this work we study the behavior of manycasting over OBS networks. In OBS networks packets from the upper-layer (such as IP, ATM, STM) are assembled and a burst is created at the edge router. By using O/E/O conversion at the edge nodes, these optical bursts are scheduled to the core node. Control header packet or burst header packet (BHP) is sent to prior to the transmission of burst. The BHP configures the core nodes and the burst is scheduled on the channel after certain offset time.

2

Manycasting Service

Distributed applications require multiple destinations to be co-ordinated with a single source, and thus multicasting is an approach to implement these distributed applications. However in multicasting the destination set is fixed and the dynamic behavior of the network cannot be implemented. A variation in this is to dynamically vary the destinations depending on the status of the network. Hence in distributed applications, first step is to identify potential destination

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks

233

candidates and then select the required number. This is called manycasting and the problem is defined as follows: given a network G(V, E), with V nodes and E edges, edge cost function is given by g : E → R+ , an integer k, a source s, and the subset of candidate destinations Ds ⊆ V , |Ds | = m ≥ k, where |Ds | is the cardinality of the set Ds . If k = 1, one destination is chosen from the set Ds and this is called anycasting. A manycast request is simply denoted by (s, Ds , k). We have to send the burst to k destinations out of m (|Ds | = m) possible candidate destinations. Due to burst loss that occurs due to burst contention and/or signal degradation, there is no guarantee that exactly k destinations receive the burst. In general most multicasting solution approaches are largely applicable to manycasting. Networks that can support optical multicast can also support optical manycasting. Thus, manycasting can be implemented by multicast-capable optical cross-connect (MC-OXC) switches [1], that uses Splitter-and-Delivery (SaD) switch to split the optical signal [2]. Now when it comes to routing the burst, shortest-path tree (SPT) can be computed, using the following three steps: – Step 1 : Find the shortest path from Source s to all the destinations in Ds . Let Ds = {d1 , d2 , . . . , d|Ds |=m } and minimum hop distance from s to di , where 1 ≤ i ≤ m is H(s) = {h1 , h2 , . . . hm }. – Step 2 : All the destinations in Ds are sorted in non-decreasing order of their path distance from Source s. Let Ds′ be the new set in this order given by {d′1 , d′2 , . . . d′m }. – Step 3 : Select the first k destinations from Ds′ . For a network of size n, each step requires the time-complexity of O(n2 ), O(1), and O(n), respectively. If the shortest path distance to all the destinations are known, then the time-complexity of the SPT algorithm reduces to O(n). We implement the SPT algorithm in a distributed manner. Step 1 is implemented by the unicast routing table. Step 2 sorts the destinations at the source node, in constant time. Step 3 works as follows: First k destinations are selected from ′ Dm and BHP is sent to all next-hop nodes (or child nodes). Let the child nodes be {c1 , c2 , . . . , cj } where 1 ≤ j ≤ k. Maximum number of child nodes can be k, if for each destination the next-hop node is different. Upon receiving the BHP at the next-hop node, again the above mentioned three steps are implemented. The process ends when packet reaches all the k destinations or are dropped at intermediate nodes (due to data loss). Even though signal degradation along the shortest-path is low, it is however not necessary that BER is within the threshold requirement. This indicates the need to develop physical-layer aware manycasting algorithms, which are explained in the following section. Bursts for the manycast are assembled in the same way as the unicast. When a burst is ready to transmit, a BHP will be sent out along the route for the manycast request [3]. The well-known OBS signaling protocols for unicast traffic, such as tell-and-wait (TAW), tell-and-go (TAG), just-in-time (JIT), and just-enoughtime (JET) [4], can be used for manycasting with the modifications described in the above centralized or distributed version of the SPT algorithm. In this paper we have studied manycasting with just-enough-time signaling (JET).

234

3

V.M. Vokkarane and B.G. Bathula

Impairment Aware Manycasting over OBS

Data loss in OBS network can occur either due to burst contentions or impairments in the fiber. Burst contention is a special issue in OBS networks, which occurs due to burstiness of IP traffic and the lack of optical buffering. Contention occurs when multiple bursts contend for the same outgoing port at the same time. Many schemes have been proposed to resolve burst contentions [3]. However all of these assume that the underlying physical fiber media is errorfree. But in practice this not the case. Bursts are transmitted all-optically in the fiber; they traverse through many optical components, such as fiber, multiplexer, demultiplexer, splitters, and optical amplifiers. This causes the quality of the signal to degrade. Received signal have amplified spontaneous emission (ASE) noise due to optical amplifiers in the network [5]. The common metric to characterize the signal quality is optical-signal-to-noise ratio (OSNR), defined as the ratio of power of signal received to power of the ASE noise [6]. Multicast capable switches cause optical power to split depending on number of output ports. The power will be reduced as the signal propagates towards destination, thus decreasing OSNR. Bit error rate (BER) of the signal is related to OSNR. Decrease in OSNR causes an increase in BER. Thus a burst scheduled on a wavelength can be lost due to high BER of the signal. BER of the signal can be computed through q-factor [6]. If signal has low q, then BER of the signal is high and vice-verse. Thus a burst successfully scheduled on a wavelength, can be lost due to a low q. These impairments studies have been done extensively in past. Recent challenges are to develop impairment-aware routing algorithms before scheduling the data transmission [7]. As the first step toward implementing impairment-aware manycasting, in this paper we consider only the OSNR constraint. We develop algorithms that implement manycasting considering both burst contention and optical impairments. In order to provide impairment-awareness during burst transmission, we mod′ ify the manycast request as (u, Du , ku, P (u), Pase (u)), where the last two tuple indicate signal power and noise power, respectively. u can be a Source s or an intermediate node, with sorted destination set Du′ and intended number of destinations ku . 3.1

Impairment Aware Shortest Path Tree (IA-SPT)

IA-SPT algorithm uses a pre-computed shortest path tree. Based on the three steps mentioned in Section 2, the tree is constructed for each manycast request. Recursive power relations given in [1] can be used to compute the OSNR of the optical signal along its path. If the link from the source node to one of the child nodes is free, then q is computed. If the q-factor is above the threshold value, qth , then the channel is scheduled for burst transmission. Hence, the successful reception of the burst at the destination node guarantees that signal is errorfree. This continues until k destinations are reached. If the burst reaches < k destinations, then the manycast request is said to be blocked. As the IA-SPT is implemented on the pre-computed routing tree, it does not consider the dynamic

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks

235

nature of the network. This algorithm suffers from high burst loss, due to fixed routing along the shortest path tree and this is verified by simulation results. Other algorithms proposed, decrease the burst loss in the presence of optical layer impairments. Pseudo code for this algorithm is described with the help of an example in [8]. If D is the set of all destinations that can be reached from Node u. If |D| < ku , then the request is said to be blocked and probability of the request blocking is given by 1 − |D|/ku . 3.2

Impairment Aware Static Over Provisioning (IA-SOP)

IA-SOP algorithm is similar to IA-SPT except that here we will not limit the number of destinations to k, but we send the burst to k + k ′ destinations, where k ′ is such that 0 ≤ k ′ ≤ m − k. With k ′ = 0, IA-SOP is similar to IA-SPT, i.e., no over-provisioning. In this algorithm, first k + k ′ , destinations are selected from the set Dc′ . Sending the burst to more than k destinations ensures that it reaches at least k of them. However by doing over-provisioning the fan-out of the splitter increases, thereby increasing BER. In spite of decrease in the contention loss, there is no significant improvement in the overall loss. From the simulation results we see that IA-SOP shows slightly better performance than IA-SPT. The IA-SOP algorithm is similar to that of IA-SPT, but with ku replaced with ku +k ′ . Thus the probability of request blocking is given by 1 − min(|D|, ku )/ku . This is because if all the ku + k ′ are free then the burst is sent to more destinations than intended (i.e., ku ), but from the user perspective we have only ku to be reached. If |D| > ku implies min(|D|, ku ) = ku , then the request blocking ratio is zero. 3.3

Impairment Aware Dynamic Membership (IA-DM)

IA-DM takes the dynamic network status into consideration. Instead of selecting the destinations before the burst is transmitted, we dynamically add members as possible destinations, depending on contention and quality of the link. IADM will work with a distributed version of SPT. The set of k-destinations is tentatively set up at the source node. We do not discard the remaining m − k destinations, but instead keep them as child branches at the source node. IA-DM is different from deflection routing, in the way that, in later the burst is routed to the same destination, but on the other route. Our simulation results show a significant decrease in burst blocking due to contention and BER for IA-DM. The pseudo-code for this algorithm is explained in [1]. 3.4

IP Manycasting

Selection of k destinations out of m by the IP layer is similar to the random algorithm in [9], we also present a simple analytical model for the manycasting with random selection of k destinations. Our results show that random selection of destinations has poor performance, hence supporting manycasting at the OBS layer is necessary. A manycast request is said to be blocked if the burst reaches less than k destinations.

236

4

V.M. Vokkarane and B.G. Bathula

Provisioning QoS for Manycasting over OBS

We study the behavior of manycasting over optical burst switched networks (OBS) based on multiple quality of service (QoS) constraints. These multiple constraints can be in the form of physical-layer impairments, transmission delay, and reliability of the link. Each application requires its own QoS threshold attributes. Destinations qualify only if they satisfy the required QoS constraints set-up by the application. We propose a decentralized way of routing the burst towards its destination. With the help of local-network state information, available at each node the burst is scheduled only if it satisfies multiple set of constraints. Correspondingly reception of the burst at the node ensures that all the QoS constraints are met and burst is forwarded to the next-hop. Due to multiple constraints, burst blocking could be high. We propose algorithms to minimize request blocking for Multiple Constrained Manycast Problem (MCMP). With the help of simulations we have calculated the average request blocking for the proposed algorithms. Our simulation results show that MCM-shortest path tree (SPT) performs better than MCM-dynamic membership (DM) for delay constrained services and real-time service, where as data services can be provisioned using MCM-DM. We define ηj , γj , and τj as noise factor, reliability factor, and end-to-end propagation delay for the Link j, respectively. These service attributes can be used to maintain the local network information and by properly comparing these vectors, destinations can be chosen. Comparison of multi-dimension metrics can be done using the notion of lattices [10]. Lattices are explained using the ordering denoted by , which has the properties of reflexivity, antisymmetry, and transitivity. We denote the information vector at Link j as, ⎛ ⎞ ηj Ωj = ⎝ γj ⎠ . (1) τj Definition 1. Let Ωj and Ωk be the two information vectors for the links j and k, respectively. We define Ωj  Ωk and comparable if and only if (ηj ≤ ηk ) ∧ (γj ≥ γk ) ∧ (τj ≤ τk ).

(2)

Service attributes are either multiplicative (product) or additive (sum). The ordering condition in (2) is chosen such that, noise factor and propagation delay are minimum, and reliability is maximum. Each information vector is a 3-tuple and hence it is a 3-dimensional vector space over real field R, which is denoted by R3 . The operation over multi-dimensional vectors is given by, ◦ : Ωj ∈ R3 , Ωk ∈ R3 → Ωj ◦ Ωk ∈ R3 . where the operation ◦ on two vectors Ωj and Ωk is given by, ⎛ ⎞ ηj ηk Ωj ◦ Ωk = ⎝ γj γk ⎠ . τj + τk

(3)

(4)

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks

4.1

237

Multi-Constrained Manycast (MCM) Problem

Multi-constrained manycast algorithms with the help of an example. We propose two algorithms, MCM-Shortest Path Tree (MCM-SPT) and MCM-Dynamic Membership (MCM-DM) for evaluating the performance of the manycasting with quality of service (QoS) constraints. These proposed algorithms are distributed wherein, each node individually maintains the network state information and executes the algorithm. Algorithms implemented in the centralized way, may fail due to a single failure and resulting in poor performance. Our proposed algorithms have the following functionality: 1. Handling multiple constraints with help of link state information available locally. 2. Service differentiated provisioning of manycast sessions. 3. Finding the best possible destinations in terms of service requirements for the manycast sessions. We use BHPs as the control packets and we propose the new BHP field which provides information about the QoS. In previous works [1,8] the BHP was modified to accommodate q-factor (i.e., BER) and burst were scheduled based on the BER threshold. Table 1 lists possible fields associated with QoS based scheduling of bursts [11]. T able 1. Control Packets Frame Fields BHP Field Manycast Id Burst Id Source (u) Quorum members (Du ) ku ⊤(θ Ωu

p)

−1, u 

Ingress Channel Duration Offset

Description Manycast request identification number Burst identification number used for sequencing Initial or starting node of the burst These are the probable destinations to which burst can be reached. Number of members in manycast session Threshold information vector for Service θp . Link information vector corresponding to the link between u − 1, u. Wavelength used for the data burst Duration of the data burst in seconds Time offset between the control packet the data packet

The manycast request (id, u, Du , ku , ⊤(θp ) , Ωu−1,u ) arrives at the Source Node u with a candidate destination set Du , along with the k intended.

5

Simulation Results

In this section we present our simulation results. We consider average request blocking as performance metric. We define average request blocking ratio as given

238

V.M. Vokkarane and B.G. Bathula 800 2800

1

12

11

2400

800 700

1600

1100

7 800

4

1000 2

600

8

300

500 9

14 500

700

5

300

900 13

1100

600 3

2000 10

2000

6

1200

Fig. 1. NSF network with 14 nodes and 21 bi-directional links. The weights represent distance in km and the corresponding reliability factor of the links respectively.

by [3]. Let f be the total number of manycast requests used in the simulation. Consider a manycast request (s, Dsf, k). Let D be the set of destinations which actually receive the data. Then average request blocking is given by,   Bavg = 1.0 − min(|D|, k)/k /f. (5) f

NSF network shown in the Fig. 1 is used for our simulation. All the links are bi-directional and have same transmission rate of 10 Gb/s. Burst arrivals follow Poisson process with an arrival rate of λ bursts per second. The length of the burst is exponentially distributed with expected service time of 1/µ seconds. 5.1

Assumptions

1. Only one wavelength is considered for analysis. Hence the dependency of q-factor on the wavelength is ignored. 2. Wavelength converters are not used in the network. 3. Calculation of noise factor is based on, losses due to attenuation, mux/ demux, tap and split loss. Only amplified spontaneous emission (ASE) noise can be considered for OSNR. Shot noise and beat noise are ignored. 4. Effects of offset time are ignored. 5. In line amplifiers along the links are placed, with spacing of 70 km between the amplifiers. 6. There are no optical buffers or wavelength converters in the network. 7. Reliability factor is same along both directions of the fiber. Using discrete-event simulations we compute Bavg using (5) and compare our results for without impairment-awareness, as given in [3]. Fig. 2a show the comparison of impairment-aware average request blocking to regular algorithms.

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks

239

0

10

Average Request Blocking (B(Sim) ) total

total

Average Request Blocking (B(Sim))

0.8

−1

10

−2

10

MC IA−MC MO IA−MO MA IA−MA

−3

10

−4

10

0

0.1

0.2 0.3 Load (in Erlangs)

0.4

0.75

0.7

0.65

0.6

0.55

IA−SPT IA−SOP IA−DM

0.5 0.5

6

8

10

12

(a)

14 16 18 Load (in Erlangs)

20

22

24

(b)

Fig. 2. (a) Comparison of algorithms with and without impairment awareness. (b) The blocking performance comparison between IA-SPT, IA-SOP and IA-DM for manycast configuration 7/4 under High load.

From these graphs we observe there is significant difference in Bavgunder low load conditions. This is because under low load conditions, contention blocking will be less and hence regular algorithms used in [3] does not provide the correct estimate of blocking. From the Fig. 2a we also observe that IA-DM has lower blocking than IA-SPT and IA-SOP and thus, impairment-aware manycasting over OBS, can be improved by using IA-DM. From the Fig. 2a we observe that without impairments all the three algorithms perform almost similar. However in the presence of impairments there is is significant reduction in the burst loss, when IA-DM is used. Our simulation results show that even under high loads IA-DM is better than the other two as shown in Fig. 2b. We validate our simulation results with the analytical model explained in [8]. Fig. 3a shows that our model is accurate for IA-SPT. This graph also indicates 0

10

1 0.9

Blocking probability

Blocking probability

0.8 0.7 0.6 0.5 0.4 0.3

0.1

0

1

2 3 Load (in Erlangs)

(a)

4

Binomial Analytical Simulation

−1

Binomial Analytical Simulation

0.2

10

5

0

0.5

1

1.5

2 2.5 3 Load (in Erlangs)

3.5

4

4.5

5

(b)

Fig. 3. Comparison of Binomial, Analytical and Simulation results for overall blocking probability for (a) IA-SPT (b) IA-SOP with k′ = 3 under low load

240

V.M. Vokkarane and B.G. Bathula

that random selection of k destinations from Dc (IP-Manycasting) has poor performance compared IA-SPT. Significant reduction in the blocking can be achieved by using IA-SPT. From Fig. 3b we observe that our analytical model over-estimates the blocking probability of IA-SOP at low loads. This is due to the size of intended destinations. In our case we have k ′ = 3, which is equivalent to multicasting. However at high loads these results converge. Finally we validate our simulation results for IA-DM using Poisson-splitting. From Fig. 4 we observe that Poisson split model slightly over-estimates the blocking probability than simulation. This is because of the (5) does not distinguish between primary and secondary destinations as in Poisson split. However the difference being very small, it provides a good estimate for the impairment-aware manycasting. Also by using Poisson-splitting we maintain the arrival process to secondary destinations as Poisson distribution and this makes analysis computationally efficient. In the Fig. 4 we also compare our results without split, which clearly validate our simulation results. 0

Blocking probability

10

−1

10

Analytical (With Split) Analytical (Without Split) Simulation

−2

10

0

2

4

6 Load (in Erlangs)

8

10

12

Fig. 4. Comparison of Analytical (with and without Poisson split) and Simulation results for overall blocking probability for IA-DM under low load

We now discuss the performance of Manycasting over OBS for different QoS requirements. We differentiate among service requirements, i.e., different services put different constraints. Differentiated services considered for simulation are ⊤(θ1 ) = [5.7, 0.6, 20]T , ⊤(θ2 ) = [5.7, 0.6, 10]T , ⊤(θ3 ) = [4.25, 0.9, 10]T and ⊤(θ4 ) = [4.25, 0.8, 10]T . We consider ⊤(θ2 ) as the real-time service, since it has more stringent delay requirement. Service ⊤(θ1 ) can be for data service as it has less relaxed delay requirements. Other two services have high threshold requirements. Figure 5a shows the performance of the MCMP-SPT for different set of services. More the requirements of the service, more is the blocking. As MCM-SPT uses shortest-path routing, one can expect to have a lower QoS blocking, but

0.9

0.8

0.8 avg

)

0.9

0.7

Average Request Blocking (B

Average Request Blocking (Bavg)

Manycast Service in Optical Burst/Packet Switched (OBS/OPS) Networks

0.6 0.5 0.4 T(θ1)=[5.7,0.6,20 ms]T

0.3

(θ )

T

2

T

T

3

0.7 0.6 0.5 0.4 0.3 T(θ1)=[5.7,0.6,20 ms]T

0.2

T(θ2)=[5.7,0.6,10 ms]T

=[5.7,0.6,10 ms]

(θ )

0.2

T

=[4.25,0.9,10 ms]

(θ )

T

0.1

T(θ4)=[4.25,0.8,10 ms]T 0.1

0

1

2 3 Load (Erlang)

(a)

4

241

3

T

=[4.25,0.9,10 ms]

T(θ4)=[4.25,0.8,10 ms]T 5

0

0

1

2 3 Load (Erlang)

4

5

(b)

Fig. 5. Blocking Probability performance of (a) SPT and (b) DM for different service thresholds

however due to the random contention along the links, if any one of the destination is not reachable, entire manycast request would be blocked. On the contrary, MCM-DM adds or removes destinations based on the contention in the network. However destinations which are added to the quorum pool can be at a longer distance than the destination which is not reachable. As the result, QoS of this destination can be decreased. In spite of decrease in values, if the path-information vector is with in the threshold condition of the service, the request can be satisfied. Fig. 5b shows average request blocking for MCM-DM under different service thresholds. At high loads, most of the blocking would be contention blocking and hence the effect of QoS will not be understood much. As our aim is show the effects of QoS, all the results are simulated under medium network load conditions.

6

Summary

In this paper we have evaluated the performance of manycasting over optical burst-switched networks for providing QoS. Algorithms were proposed in a view to decrease the average request loss for manycasting. Performance of these algorithms has been studied under differentiated services. This work proposes the necessity of providing QoS to the manycasting over OBS networks. This work can be further extended, by considering sparse wavelength regenerations. By using wavelength regenerations we can decrease noise factor for the routes that traverse longer paths.

References 1. Bathula, B.G., Vokkarane, V.M., Bikram, R.R.C.: Impairment-aware manycasting over optical burst switched networks. In: Proceedings of IEEE International Conference on Communications, pp. 5234–5238. IEEE, Los Alamitos (2008)

242

V.M. Vokkarane and B.G. Bathula

2. Hu, W.S., Zeng, Q.J.: Multicasting optical cross connects employing splitter-anddelivery switch. IEEE Photonics Technology Letters 10, 970–972 (1998) 3. Huang, X., She, Q., Vokkarane, V.M., Jue, J.P.: Manycasting over optical burstswitched networks. In: Proceedings of IEEE International Conference on Communication (ICC 2007), Glasgow, Scotland, pp. 2353–2358 (2007) 4. Chen, Y., Qiao, C., Yu, X.: Optical burst switching: A new area in optical networking research. IEEE Network 18, 16–23 (2004) 5. Ramamurthy, B., Datta, D., Feng, H., Heritage, J.P., Mukherjee, B.: Impact of transmission impairments on the teletraffic performance of wavelength-routed optical networks. IEEE/LEOS Journal of Lightwave Technology 17, 1713–1723 (1999) 6. Ramaswami, R., Sivarajan, K.N.: Optical Networks. Morgan Kaufmann Publishers, San Francisco (2004) 7. Martinez, R., Cugini, F., Andriolli, N., Wosinska, L., Comellas, J.: Challenges and requirements for introducing impairment-awareness into management and control planes of ASON/GMPLS WDM networks. IEEE Communication Magazine 44, 75–76 (2007) 8. Bathula, B.G., Bikram, R.R.C., Vokkarane, V.M., Talabattula, S.: Impairmentaware manycasting algorithms over optical burst switched networks. In: Proceedings of IEEE International Conference on Computer Communications and Networks, pp. 1–6. IEEE, Los Alamitos (2008) 9. Cheung, S.Y., Kumar, A.: Efficient quorumcast routing algorithms. In: Proceedings of IEEE INFOCOM 1994, Toronto, Ontario, Canada, pp. 840–847 (1994) 10. Przygienda, A.B.: Link state routing with QoS in ATM LANs. PhD thesis, Swiss Federal Institute of Technology (1995) 11. Bathula, B.G.: QoS Aware Quorumcasting Over Optical Burst Switched Networks. PhD thesis, Indian Institute of Science (2008)

OBS/GMPLS Interworking Network with Scalable Resource Discovery for Global Grid Computing (Invited Paper) J. Wu, L. Liu, X.B. Hong, and J.T. Lin P.O. Box 55#, Key Laboratory of Optical Communication and Lightwave Technologies, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China {jianwu,xbhong}@bupt.edu.cn, [email protected]

Abstract. In recent years, Grid computing is more common in the industry and research community and will open to the consumer market in the future. The final objective is the achievement of global Grid computing, which means that the computing and networks are flexibly integrated across the world and a scalable resource discovery scheme is implemented. In this paper, a promising architecture, i.e., optical burst switching (OBS)/generalized multi-protocol label switching (GMPLS) interworking network with Peer-to-Peer (P2P)-based scheme for resource discovery is investigated to realize a highly scalable and flexible platform for Grids. Experimental results show that this architecture is suitable and efficient for future global Grid computing. Keywords: optical Grid, OBS, GMPLS, resource discovery, P2P.

1 Introduction Optical networks have been identified as the network infrastructure that would enable the widespread development of Grid computing, i.e. global Grid computing. However, the large bandwidth connections are not sufficient for the requirements of global Grid computing. In order to achieve this goal, a scalable and efficient architecture must be considered. In this paper, a promising architecture, i.e., OBS/GMPLS interworking network with P2P-based scheme for resource discovery is investigated through experiments. The results show that this architecture is suitable and efficient for future global Grid computing. The rest of this paper is organized as follows. Section 2 proposes the OBS/GMPLS interworking network. Section 3 investigates the P2P-based resource discovery scheme. Section 4 concludes this paper.

2 Network Infrastructure The choice of the network infrastructure for global Grid computing is mainly driven by the fulfillment of a number of requirements, such as high bandwidth utilization (user’s P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 243 – 250, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

244

J. Wu et al.

requirement), a controllable and manageable infrastructure (network manager’s requirement), and efficient job transmission in a large-sized network, etc. Clearly, OBS can certainly fulfill the user requirement of efficient bandwidth utilization for its statistic multiplexing [1]. But due to the one-tier signaling protocols (e.g. JET, JIT), OBS is not suitable for large-sized networks. On the contrary, GMPLS utilizes a unified control plane to manage heterogeneous network elements and is expected to be gradually deployed to provide flexible control and management for carrier’s optical transport networks in the near future. So the best solution for future global Grid computing is OBS over GMPLS network, which limits the number of nodes involved in the OBS signaling transactions, and rely on GMPLS control plane to handle recovery and network resource optimization issues.

Fig. 1. OBS over GMPLS multi-layer architecture

This architecture is firstly proposed by Open Grid Forum (OGF) and investigated in detail in [2]. In this paper, such network architecture is demonstrated through experiments. We investigated an overlay model, as shown in Fig. 1, for the integrated OBS over GMPLS network. In this overlay network, the OBS networks are located at the edge and are capable of accessing various bursty traffic, while the GMPLS core network is responsible for providing layer-1 interconnections for the attached OBS client networks. An OBS Border Controller (OBC) and a GMPLS Border Controller (GBC) are introduced at the border between OBS client network and GMPLS core network. Each pair of OBC and GBC has a dedicated control interface between them in order to enable the GBC to dynamically establish or release a light path in GMPLS network for Data Burst (DB) transmission upon the request from OBC. Moreover, a simple client/server model with request/reply transactions can be applied between OBC and GBC, so no inter-layer routing or signaling between GMPLS and OBS networks is required, which is suitable for the network migration. This overlay model is simple for the integrated OBS over GMPLS network. However, since the Burst Header Packet (BHP) in the OBS network needs to precede its corresponding DB (i.e. Grid job) by a particular offset time in order to configure the optical switch in advance, the BHP must traverse the same route and undergo the

OBS/GMPLS Interworking Network with Scalable Resource Discovery

245

Fig. 2. Group LSP in GMPLS network for BHP/DB transmissions

same transmission latency with the DB. This requirement cannot be guaranteed if the BHP is transported over a conventional GMPLS control plane. Therefore, a light path extendedly provisioned by a so-called group-LSP as shown in Fig. 2, which consists of multiple LSPs with one dedicated LSP for BHPs, is introduced here for BHP/DB transmissions in the GMPLS data plane. Fig. 3 illustrates the detailed routing and signaling procedures on how to conduct the inter-domain Grid job transmission in this overlay-based network. The OSPF protocol can still be used for intra-domain routing in each independent OBS network. In addition, considering only one control channel is required during the period of inter-domain routing, the group LSP established beforehand may contain only one single BHP LSP, while other DB LSPs will be dynamically created upon the request of traffic. Once the inter-domain reachability information is distributed to intradomain nodes, the OBS edge node will be able to access the traffic destined to

Fig. 3. Routing and Signaling procedures in the OBS over GMPLS network

246

J. Wu et al.

Fig. 4. Experimental results for an e2e LSP setup for Grid job transmission

external domains. Meanwhile, an e2e LSP via the BHP LSP can also be established across the GMPLS network in order to forward the inter-domain BHP/DB pair. Finally, when the BHP destined to peered OBS domain arrives at the exit of OBS border node, the group LSP containing BHP LSP and DB LSPs will be selected to forward the BHP and its corresponding DB. However, if the DB LSP on specified wavelength does not exist, OBC will send a request to the attached GBC to create a new one. The results in Fig.4 verified the end-to-end Grid job transmissions. In the experiment, the BHP length is about 200ns, the assembled burst length and burst interval are configured as 16ms and 25ms respectively. The offset time between data burst and its BHP is set to 106ms taking into account the LSP setup latency of about 101ms. It can be seen from the Fig.4 that after the first data burst (DB1) destined to OBS Edge Node 2 (as shown in Fig.3) is generated, a new E-E LSP setup is triggered. Before the end of this procedure, there are total 5 data bursts and BHPs are generated and buffered at OBS Edge Node 1 (as shown in Fig.3). As a new LSP is available, the 5 BHPs buffered inside the edge node 1 are sent out in a very short time, which make them seems to be only one BHP in the large time scale as shown in the middle part of Fig.4. But with small time scale of 500ns/Div, they could be clearly distinguished as shown in the top part of Fig.4. Because the first BHP (BHP1) is buffered at edge node 1 for about 101ms, the offset time remained between DB1 and BHP1 is only Tr=106ms-101ms=5ms, though the initial offset time is 106ms. For the successive 4 data burst, the remained offset time can be calculated as 30ms, 55ms, 80ms and 105ms, respectively. When the sixth data burst and its BHP are generated, the LSP is already established. The BHP will be sent out immediately and the Tr remains 106ms as it is configured, as shown in Fig.4. At this moment, a small offset time could be set instead in order to reduce the end-to-end latency in case of the delay-sensitive traffic.

3 Scalable Resource Discovery Scheme Currently, all the available resource discovery and management schemes are conventional Client/Server-based (centralized schemes) [3-7]. In these solutions,

OBS/GMPLS Interworking Network with Scalable Resource Discovery

247

several centralized servers, which stores real time information about the available grid resources, resides in the network to deal with all the resource discovery requests and make decisions to these requests. The decision will be sent back to tell the users where to send out the job. However, this C/S mechanism confines the scalability of the whole Grid network. These solutions are poor scalability, fault-tolerance and low efficiency with the increasing of Grid users and job frequency.

Fig. 5. Optical Grid architecture with P2P-based resource discovery capability

In order to solve these disadvantages, a P2P-based resource discovery scheme (decentralized scheme) is proposed in [8], which distributed the resource information on many nodes using distributed hashing table [9-10]. Fig.5 shows the Grid over OBS architecture, in which a virtual protocol layer is introduced to implement the peer-topeer resource discovery scheme. It is composed of the virtual nodes mapping from the actual Grid users/resources. The consistent hash function assigns each node in the protocol layer an m-bit identifier using SHA-1[9] as a base hash function. The signaling process of the P2P-based resource discovery is introduced in detail in [8]. As same as the experimental setup in [3] and [8] on our OBS testbed [11], the performance between C/S-based resource discovery and P2P-based resource discovery can be evaluated. As shown in Fig.6, when job request frequency increasing, the performance of P2P model is not degraded, which is contrary to the results of C/Sbased scheme. It is because in the C/S-based solution, the server stores so many items about the resource information, while in P2P solution, these items are distributed on many nodes due to distributed hashing table. A job request needs to check these items to find out a most appropriate resource. So in the C/S solution, when the request frequency is high, the server cannot handle the resource query and update timely. It will result in sharp increased average response time (Fig.6(a)) and the out-of-date resource information is often queried by the requests, which will causes a lower resource discovery successful rate (Fig.6(b)). Clearly, by employing this P2P-based resource discovery mechanism, it is not necessary to keep the powerful centralized server and distribute real-time information to it, which makes up a more feasible and scalable Grid network. The experimental results show that P2P-based scheme outperforms C/S schemes in terms of discovery successful rate and discovery time in a large-scale Grid environment with high job request frequency. Thus it is suitable for future global Grid computing.

248

J. Wu et al.

Fig. 6. Comparison of C/S-based and P2P-based resource discovery schemes (a) average response time v.s. job request frequency (b) discovery successful rate v.s. job request frequency

Although the P2P-based resource discovery schemes can improve the scalability of the Grid networks, the shortcoming of the P2P scheme is that only non-network resource is taken into consideration for resource discovery. In order to solve this problem, a novel Self-organized Resources Discovery and Management (SRDM) scheme based on the P2P scheme is investigated in [12]. In the SRDM scheme, each user runs the P2P protocol proposed in [8], and two new tables: Latency Table (LT) and Blocking Table (BT) are added, which are used for storing the end-to-end latency and blocking probability from the current user to different resources. Such information is obtained by self-learning mechanism. Once a resource is discovered, the IP address of this resource will be saved in LT and BT. For each resource, user periodically sends “Hello” signaling to it to get the end-to-end latency and save it in LT. Meanwhile, user records the job history (success or failure), calculates job blocking probability for each resource and saves it in BT. BT is cleared in every T minutes to eliminate the out-of-data information. Note that although the exact network information may be obtained from the network manager, in many operational cases, it will increase the burden of manager and confine the scalability. Moreover, carriers may not want to expose too much detailed internal network information to customers due to the security concern. The process of SRDM is described as follows: firstly, user can specify the job requirements and job characteristic (i.e. loss-sensitivity or delay-sensitivity) through a web portal in which dynamic Web Service technology is implemented. After that, P2P-based resource discovery scheme will be utilized to obtain a list of candidate resources satisfying the specified requirement. Then the end-to-end blocking probability and latency from user to these resources will be compared in order to choose a best one (least blocking, least latency) according to the job characteristic. Random or first-fit mechanism can be used if there is no relevant record in LT and BT. After a resource is chosen, non-network resource can be reserved by using the same method in [8]. Together with OBS-JET and GMPLS-RSVP signaling, SRDM enables a flexible end-to-end reservation of both network and non-network resources in a fully decentralized manner. The experimental setup is shown in Fig.7. Various latency and blocking probability from users to resource 1 and 2 was introduced by injecting background traffic. The SRDM signaling (e.g. Hello) were encapsulated into bursts for transmission for

OBS/GMPLS Interworking Network with Scalable Resource Discovery

249

Fig. 7. Experimental setup

Fig. 8. Comparison between P2P [8] and SRDM

avoiding the O/E/O conversion and message processing delay. About 1000 jobs were randomly generated with random resource requirement, job characteristic and start time. The results in Fig.8 show that SRDM outperforms P2P scheme in terms of job blocking and end-to-end latency since the resource discovery in SRDM is capable of consideration of both network and non-network resources. It can be seen that in SRDM, each user has its own intelligence to manage resource discovery requests and make proper decision based on its own information about the whole Grid network. Clearly, by employing this distributed mechanism, it is not necessary to deploy powerful centralized servers for storing Grid resource information, which enables to construct a more scalable and fault-tolerant network for future global Grid computing.

4 Conclusions In this paper, a promising architecture, OBS over GMPLS network infrastructure with P2P-based resource discovery scheme is proposed for future global Grid computing. Experimental results show that this architecture is efficient and scalable. Due to the wide using of GMPLS and P2P technology in the world now, we believe the work in this paper can facilitate the achievement of global Grid computing. Acknowledgments. This work was supported by 863 Program (2007AA01Z248), MOST Program (No.2006DFA11040), PCSIRT (No.IRT0609) and 111 Project (B07005).

References 1. Qiao, C., Yoo, M.: Optical Burst Switching (OBS)—a New Paradigm for an Optical Internet. J. High Speed Netw. 8(1), 69–84 (1999) 2. Nejabati, R. (ed.): Grid Optical Burst Switched Networks (GOBS). Technical report, Open Grid Forum (OGF), GFD.128 (2007)

250

J. Wu et al.

3. Zervas, G., Nejabati, R., Wang, Z., Simeonidou, D., Yu, S., O’Mahony, M.: A Fully Functional Application-aware Optical Burst Switched Network Test-bed. In: Optical Fiber Communication Conference (OFC), Anaheim, California, USA (2007) 4. Vokkarane, V.M., Zhang, Q.: Reliable Optical Burst Switching for Next-generation Grid Networks. In: IEEE/CreateNet GridNets, Boston, USA, pp. 505–514 (2005) 5. De Leenheer, M., et al.: An OBS-based Grid Architecture. In: Global Telecommunications Conference (GLOBECOM) workshop, Dallas, USA, pp. 390–394 (2004) 6. Farahmand, F., De Leenheer, M., Thysebaert, P., Volckaert, B., De Turck, F., Dhoedt, B., Demeestert, P., Jue, J.P.: A Multi-layered Approach to Optical Burst-switched Based Grids. In: International Conference on Broadband Networks (BroadNets), Boston, USA, vol. 2, pp. 1050–1057 (2005) 7. Zervas, G., Nejabati, R., Simeonidou, D., Campi, A., Cerroni, W., Callegati, F.: SIP Based OBS Networks for Grid Computing. In: Tomkos, I., Neri, F., Solé Pareta, J., Masip Bruin, X., Sánchez Lopez, S. (eds.) ONDM 2007. LNCS, vol. 4534, pp. 117–126. Springer, Heidelberg (2007) 8. Liu, L., Hong, X.B., Wu, J., Lin, J.T.: Experimental Demonstration of P2P-based Optical Grid on LOBS Testbed. In: Optical Fiber Communication Conference (OFC), San Diego, USA (2008) 9. FIPS 180-1, Secure Hash Standard. U.S. Department of Commerce/NIST, National Technical Information Service, Springfield, VA (1995) 10. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A Scalable Peerto-peer Lookup Service for Internet Applications. In: ACM SIGCOMM, San Diego, USA, pp. 149–160 (2001) 11. Guo, H., Lan, Z., Wu, J., Gao, Z., Li, X., Lin, J., Ji, Y., Chen, J., Li, X.: A Testbed for Optical Burst Switching Network. In: Optical Fiber Communication Conf (OFC), Anaheim, California, USA (2005) 12. Liu, L., Guo, H., et al.: Demonstration of a Self-organized Consumer Grid Architecture. In: European Conference on Optical Communications (ECOC), Brussels, Belgium (accepted) (2008)

Providing QoS for Anycasting over Optical Burst Switched Grid Networks Balagangadhar G. Bathula and Jaafar M.H. Elmirghani School of Electronic and Electrical Engineering, University of Leeds, Leeds LS2 9JT {b.bathula,j.m.h.elmirghani}@leeds.ac.uk

Abstract. This paper presents a mathematical framework to provide Quality of Service (QoS) for Grid Applications over optical networks. These QoS parameters include, resource availability, reliability, propagation delay, and quality of transmission (QoT). These multiple services are needed to ensure the successful completion of a Grid job. With the help of link-state information available at each Network Element (NE), the bursts are scheduled to its next link. This de-centralized way of routing helps to provide optimal QoS and hence decreases the loss of Grid jobs due to multiple constraints. Keywords: WDM, QoS, GoOBS, Anycasting.

1

Introduction

The enormous bandwidth capability of the optical networks, helps the network user community to realize many distributed applications like Grid. These emerging interactive applications require a user-controlled network infrastructure [1]. This leads many researchers to investigate control plane architectures for optical networks. A comprehensive review of the optical control plane for the Grid community can be found in [2]. QoS policies implemented in IP network do not work in the optical network, as the store-and-forward model does not exist [3]. We thus see the need for an intelligent control plane in the optical network, which can provide the required QoS for Grid applications. With the advent of many new switching techniques, researchers were able to tap the huge bandwidth capacity of the fiber. Fast and dynamic connection establishments using Optical Burst Switched (OBS) networks have been achieved at much lower switching costs. The Open Grid Forum (OGF) is a community that aims to develop standards, protocols and solutions to support OBS-based Grid networks [1]. A general layered Grid architecture and the role of OBS network is discussed in [4]. Delivering a Grid application effectively involves many parameters such as, design of efficient control plane architectures, algorithms for routing, providing QoS and resilience guarantees. A ny cast can be defined as a variation on unicast, with the destination not known in a-priori [5,6]. Anycasting is similar to deflection routing, except for the P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 251–258, 2009. c ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009 

252

B.G. Bathula and J.M.H. Elmirghani

fact that different destination can be selected instead of routing the burst to the same destination in an another path. Routing can be accomplished by an labelbased control frame work [7] using optical core network, such as OBS. Anycasting allows the flexibility for the Grid job to effectively identify the destination that meets the QoS parameters. Incorporating an intelligent control plane and with the use of efficient signaling techniques, anycasting provides a viable communication paradigm to Grid applications. The rest of the paper is organized as follows: in Section 2 we describe notations used in the paper. The QoS parameters used for burst scheduling are discussed in Section 2.1. The mathematical framework for ordering the destinations, based on lattice theory is discussed in Section 3. We explain this mathematical framework with the help of a simple network example in Section 4.1 and finally we conclude this paper in Section 5.

2

Notations

An anycast request can be denoted by (s, D, 1) where s denotes the source, D the destination set and the last tuple indicates that a single destination has to be chosen from the set D. This notation is a generalization of manycast [8]. Let m = |D|, denote the cardinality of the set. Each Grid job has a service class and we hence define the service class set as S = {S1 , S2 . . . Sp }. There is an associated threshold requirement, for which the QoS parameters should not exceed this condition. We define this threshold parameter as T (Si ) , where Si ∈ S. 2.1

Service Parameters

We define wj , ηj , γj , and τj as the residual wavelengths, noise factor, reliability factor, and end-to-end propagation delay for Link j, respectively. In wavelength-routed optical burst switched networks (WROBS), the connection requests arrive at a very high speed while the average duration of each connection is only in the order of hundreds of milliseconds [9]. To support such bursty nature of the traffic, it is always advisable to choose a path with more number of free wavelengths (least congested path). wj indicates the number of free (or residual) wavelengths available on link j. We consider an All-Optical network (AON) architecture, where there is no wavelength conversion there, by resulting in wavelength-continuity constraint (WCC). Let Wi and Wj be the two free wavelengths sets available on the links i and j respectively. Without loss of generality we assume that Wi ∩ Wj = ∅. We propose to select a path towards the destination, with more number of free wavelengths. We use an operation | ∩ | which gives the common number of wavelengths on each link. If we assume that each uni-directional link can support 5 wavelengths, then |Wi ∩ Wj | is an integer ≤ 5. The number of free wavelengths on the route is given by,  wR = | Wi |, (1) ∀i∈R

Providing QoS for Anycasting over Optical Burst Switched Grid Networks

253

where R denotes the route and wR represents the number of free wavelengths available. If wR = 0, then the destination is said to be not reachable due to contention. The noise factor is defined as ratio of input optical signal to noise ratio (OSN Ri/ p ≡ OSN Ri ) and output optical signal to noise ratio (OSN Ro / p ≡ OSN Ri+1 ), thus we have OSN Ri/ p , (2) ηj = OSN Ro / p where OSN R is defined as the ratio of the average signal power received at a node to the average ASE noise power at that node. The OSNR of the link and q-factor are related as,  2

q=

1+



B B

o e

OSNR

1 + 4OSNR

,

(3)

where Bo and Be are optical and electrical bandwidths, respectively [11]. The bit-error rate is related to the q-factor as follows,   q BER = 2 erfc √ . (4) 2 In our proposed routing algorithm, we choose a route that has minimum noisefactor. Thus the overall noise factor is given by,  ηi , (5) ηR = ∀i∈R

The other two parameters considered in our approach include, reliability factor and propagation delay of the burst along the link. The reliability factor of the link j is denoted by ηj . This value on the link indicates the percentage of the reliability of the link and its value lies in the interval [0, 1] The overall reliability of the route is calculated as the multiplicative constraint and is given by [8,10],  γR = γi . (6) ∀i∈R

Propagation delay on the link j is denoted by τj and the overall propagation delay of the route R is given by,  τR = τi . (7) ∀i∈R

3

Mathematical Framework

In this section we provide the mathematical formulation for selecting the destination based on the above mentioned service parameters. We define Network Element Vector (NEV), that maintains information about the QoS parameters at each Network Element (NE). This information is contained in the Optical

254

B.G. Bathula and J.M.H. Elmirghani

Control Plane (OCP). In the distributed routing approach, current GMPLS routing protocols can be modified to implement the service information [12,13]. A global Traffic Engineering Database (TED) at each OCP, which maintains an up-to-date picture of NEV. Definition 1. We denote the network element vector for a link i as, ⎛ ⎞ wi ⎜ ηi ⎟ ⎟ N EVi = ⎜ ⎝ γi ⎠ . τi

(8)

Definition 2. Let N EVi and N EVj be the two network element information vectors of links i and j respectively, then we define a comparison  given by, ⎛ ⎞ ⎛ ⎞ wi wj ⎜ ηi ⎟ ⎜ ηj ⎟ ⎜ ⎟⎜ ⎟ (9) ⎝ γi ⎠ ⎝ γj ⎠ τi τj The above equation implies that, (wi ≥ wj ) ∧ (ηi ≤ ηj ) ∧ (γi ≥ γj ) ∧ (τi ≤ τj ).

(10)

Equation (10) is chosen such that, the path towards the destination has more number of residual wavelengths, low noise factor, high reliability and lower propagation delay. Definition 3. The overall service information of a destination dn ∈ D, 1 ≤ n ≤ m along the shortest-path route R(dn ) is given by, N EVR(dn ) = N EVR(dn ) [s, h1 ] ◦ N EVR(dn ) [h1 , h2 ] ◦ . . . ◦ N EVR(dn ) [hk , dn ] (11)  ⎡ ⎤T         Wi  , N EVR(dn ) = ⎣ (12) τi ⎦ . γi , ηi , ∀i∈R(dn )  ∀i∈R(dn ) ∀i∈R(dn ) ∀i∈R(dn )

where in (11) nk represents the next hop node along the shortest-path. The operation ◦ performs | ∩ | on wavelengths sets, multiplication on noise factor, multiplication on reliability, and addition on propagation delay. Equation (12) represents the overall QoS information vector for the destination dn . Definition 4. A destination dn is said to be feasible for a given service requirement T (Si ) if, N EVR(dn )  T (Si ) . (13) The comparison of two multidimensional vectors using  follows from the notion of lattices [14]. Using this ordering technique bursts can be scheduled to a destination that satisfies the service requirement if it is the best among the given set of destinations. In the next section we explain the proposed algorithm with the help of a network example.

Providing QoS for Anycasting over Optical Burst Switched Grid Networks

4

255

QoS Aware Anycasting Algorithm (Q3A)

Below is the pseudo-code for the proposed algorithm. As we have considered service-differentiated scheduling, the threshold parameters of the particular service are know a-priori. In the initialization step, we consider the cardinality of the free wavelengths as the number of wavelengths the fiber can support. Other service parameters are considered to be 1 for multiplicative and 0 for additive, as indicated in the Line:1 of the algorithm. For each destination dn ∈ D, the next-hop node is calculated from the shortest-path routing (Line:2). By using the path algebra given in (11), the new network element information vector is computed and updated at the nexthop node for dn as nk . A destination node dn , is said to be qualified for the assigned Grid job, when N EVR(dn ) [s, nk ]  T (Si ) (Line:4). If the required QoS are not met, then the anycast request is updated with the new destination set as given in Line:7. If the cardinality of D is zero, then the anycast request is said to be blocked for the given service threshold condition T (Si ) . However the same anycast request can satisfy another service Sj , i = j, with lower threshold requirements. Input: T (S i ) , N EVR (d n ) [s, nk −1 ] Output: N EVR (d n ) [s, nk ] 1: Initialization N EVin it = [wm a x , 1, 1, 0]T 2: N EXT HOP N ODE[s, dn ] = nk /*nk is calculated from the shortest path */ 3: N EVR (d n ) [s, nk ] ← N EVR (d n ) [s, nk −1 ] ◦ N EVR (d n ) [nk , nk −1 ] 4: if N EVR (d n ) [s, nk ]  T (S i ) then 5: The path [s, nk ] is a feasible path and destination dn can be reached 6: else 7: Update the destination set D ← D\{dn } /* Since route to dn does not satisfy the QoS requirement of the service Si */ 8: end if 9: If |D| = ∅, then anycast request is blocked or lost

This algorithm calculates all the NEVs at intermediate and destination nodes. Intermediate NEVs check the threshold condition and discard the respective destination without further scheduling of the burst. Upon calculation of NEVs (N EVR(dn ) [s, dn ]) at all the updated destination set, these are re-ordered and the destination corresponding to the optimal NEV is selected. The equations below show the ordering technique used in selecting the final anycast destination. N EV = {N EVR(d1 ) , N EVR(d2 ) , . . . , N EVR(dp ) } 1 ≤ p ≤ n, (un-sorted) (14) = {N EVR(d′1 ) , N EVR(d′2 ) , . . . , N EVR(d′p ) } (sorted)

N EVR(d′1 )  N EVR(d′2 )  . . .  N EVR(d′p )  T (Si )

(15)

(16)

256

B.G. Bathula and J.M.H. Elmirghani

From (16) d′1 is the best destination among D that can meet the service requirement of Si effectively. This distributed Q3A approach can be implemented in a distributed way with help of a signaling approach [12]. Burst Control Packet (BCP) or Burst Header Packet (BHP) can be used to maintain the NEVs and update them as they traverse each NE. At each NE, TED is used to maintain the traffic engineering (TE) and can be modified to maintain the NEV. 4.1

Network Example

In this section we discuss the Q3A with help of a example to show the effectiveness of the algorithm in providing the QoS parameters. Consider the network shown in the Fig. 1. Consider the anycast request as (6, {2, 3, 4}, 1). The dotted lines in Fig.1 represent the shortest-path distance from source node 6 to the respective destination. The weights on each link represent, fiber distance in kms, noise factor, reliability factor and propagation delay in milli-seconds1 . Table 1 shows an set of free wavelengths on the links at the time of the anycast request. (10, 1.5, 0.96, 0.04) 1 (40, 3, 0.97,0.16)

2

3

(30, 2.5, 0,92, 0.12) (50, 3.25, 0.95, 0.2)

4

(70, 4, 0.95, 0.28)

6

(15, 1.5, 0.96, 0.06)

5

Fig. 1. Network example used to explain the proposed Algorithm

The NEVs for each destination can be calculated as given in below equations, N EVR(2) = [W(6, 1), 2.5, 0.92, 0.12]T ◦ [W(1, 2), 3, 0.97, 0.16]T

(17)

T

= [|W(6, 2)|, 7.5, 0.89, 0.28]

= [2, 7.5, 0.89, 0.28]T The free wavelengths on each link are obtained from Table 1 and the cardinality of the common wavelengths is represented in (17). This ensures the WCC in the all-optical networks, where there is an absence of wavelength converters. 1

Propagation delay is the ratio of distance (km) to the velocity of light (250 km/ms).

Providing QoS for Anycasting over Optical Burst Switched Grid Networks

257

Table 1. Residual wavelengths available on links to all destinations # Link (i → j) Residual Wavelength set (W(i, j)) 1 6→5 {λ1 , λ2 , λ3 , λ4 , λ5 } 2 5→4 {λ1 , λ2 , λ3 } 3 6→1 {λ1 , λ2 , λ5 } 4 1→2 {λ2 , λ5 } 5 2→3 {λ3 , λ4 , λ5 }

As the route towards the destination 3 shares the common path until node 2, NEV is given by, NEV

R(3)

= NEV

R(2)

◦ [W(2, 3), 3, 0.97, 0.16]T T

(18) T

= [W(6, 2), 7.5, 0.89, 0.28] ◦ [W(2, 3), 1.5, 0.96, 0.04] = [1, 11.5, 0.85, 0.32]T

N EVR(4) = [W(6, 5), 1.5, 0.96, 0.04]T ◦ [W(5, 4), 4, 0.95, 0.28]T

(19)

T

= [3, 6, 0.91, 0.32]

From (17), (18), and (19) we observe, that destination 4 has an optimal QoS parameters.2. This confirms the benefits of specifying the service requirements, whereby a destination can be chosen rather than selecting it at random.

5

Conclusion

In this paper we discuss the provisioning of QoS for anycasting in Grid optical networks. By using the information vectors available at each NE, QoS parameters are computed. We have considered parameters that can be additive or multiplicative. Providing QoS to anycast communication, allows the Grid application to choose a candidate destination according to its service requirements. This flexibility helps realize a user-controlled network. Our proposed algorithms also helps in service-differentiated routing.

References 1. Nejabati, R.: Grid Optical Burst Switched Networks (GOBS), http://www.ogf.org 2. Jukan, A.: Optical Control Plane for the Grid Community. J. IEEE Communications Surveys & Tutorials 9(3), 30–44 (2007) 3. Kaheel, A., Khattab, T., Mohamed, A., Alnuweiri, H.: Quality-of-service mechanisms in IP-over-WDM networks. J. IEEE Communications Magazine 40(12), 38–43 (2002) 2

Except the propagation delay, which is slightly more than that of N EVR(2) .

258

B.G. Bathula and J.M.H. Elmirghani

4. Farahmand, F., Leenheer, M.D., Thysebaert, P., Volckaert, B., Turck, F.D., Dhoedt, B., Demeester, P., Jue, J.P.: A Multi-Layered Approach to Optical BurstSwitched Based Grids. In: Proc. IEEE International Conference on BROADNETS, Boston, MA, USA, pp. 1050–1057 (October 2005) 5. Marc, D.L., et al.: Design and Control of Optical Grid Networks. In: Proc. IEEE International Conference on BROADNETS, Raleigh, North Carolina, USA, 107– 115 (September 2007) 6. Marc, D.L., et al.: Anycast algorithms supporting optical burst switched grid networks. In: Proc. IEEE International Conference on Networking and Service (ICNS 2006), Silicon Valley, CA, USA, July 63–69 (2006) 7. Lu, K., et al.: An anycast routing scheme for supporting emerging grid computing applications in OBS networks. In: Proc. IEEE International Conference on Communications (ICC 2007), Glasgow, UK, pp. 2307–2312 (June 2007) 8. Balagangadhar, B.G.: QoS Aware Quorumcasting Over Optical Burst Switched Networks, Ph. D. dissertation, Department of Electrical and Communication Engineering, Indian Institute of Science (IISc), Bangalore, India (2008) 9. Duser, M., Bayel, P.: Analysis of dynamically wavelength-routed optical burst switched network architecture. J. Lightwave Technol. 20(4), 574–585 (2002) 10. Jukan, A., Franzl, G.: Path selection methods with multiple constraints in serviceguaranteed WDM networks. J. IEEE/ACM Trans. Networking 12(1), 59–72 (2004) 11. Ramaswami, R., Kumar, N.S.: Optical Networks. Morgan Kaufmann Publishers, San Francisco (2004) 12. Martinez, R., Cugini, F., Andriolli, N., Wosinska, L., Comellas, J.: Challenges and Requirements for Introducing Impairment-Awareness into Management and Control Planes of ASON/GMPLS WDM Networks. IEEE Communication Magazine 44(12), 76–85 (2007) 13. Farrel, A., Bryskin, I.: GMPLS, Architecture and Applications. Morgan Kaufmann Publishers, San Francisco (2006) 14. Przygienda, A.B.: Link state routing with QoS in ATM LANs, Ph. D. dissertation, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland (1995)

A Paradigm for Reconfigurable Processing on Grid Mahmood Ahmadi and Stephan Wong

GridSim Toolkit Simulator In a Grid environment, it is hard and even impossible to perform scheduler performance evaluation in a controllable manner as resources and users are distributed across multiple organizations with their own policies. To overcome this limitation a Java-based discrete-event grid simulation toolkit has been developed that called GrdiSim. The main characteristics of GridSim are as follows: -

The toolkit supports modeling and simulation of heterogeneous grid resources, users and application models. It provides primitives for creation of application tasks, mapping tasks to resources, and their management. It investigates techniques to incorporate background traffic and network effects in GridSim. It supports modeling and simulation reconfigurable architectures and general purpose architectures (our contribution). It can support cooperative processing using neighboring concept (our contribution).

Our assumptions -

-

Each application can be broken as different subjobs that called gridlets. Each application is packaged as gridlets whose contents include the job length in MI (million instructions). The job length is expressed in terms of the items of the time it takes to run on a standard GPP with MIPS rating of 100. Each Reconfigurable element accelerates the submitted subjob in compared to GPP that this acceleration rate is represented by speedup factor. A grid consists of reconfigurable and general purpose elements. The waiting time for reconfigurable elements is summation of standard waiting time (queuing time) and reconfiguration time. There is not any background traffic on the network.

Methodology Processing elements communicate and collaborate together based on neighboring concept. In the neighboring concept each processing element cooperates with their neighbors with minimum cost. In other words, the network combined of some primitives. Each primitive includes a collaborator processing element and requester processing element. Basic primitives are depicted in Figure 1. P. Primet et al. (Eds.): GridNets 2008, LNICST 2, pp. 259–262, 2009. © ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2009

260

M. Ahmadi and S. Wong

Fig. 1. Basic primitives (A)1-array primitive (B) n-array primitive (C) Complex primitive

A Paradigm for Reconfigurable Processing on Grid

261

Case studies Basic primitives in Figure 1 are investigated as case studies with following specifications. -

maximum packet size =1500 byte/sec user-router bandwidth =10 mb/sec router-router bandwidth=100 mb/sec number of gridlets =10 size of gridlets=5000 MI size of all gridlets is same MIPS rating of GPP=377 mips Minimum speedup for reconfigurable elements=2 Maximum speed up for reconfigurable elements =10 Reconfiguration file size =6 mb Reconfiguration speed=2 Reconfiguration time= reconfiguration file size/reconfiguration speed=3 sec Users submit their job to related resource after that each resource finds the neighbor resource to cooperative processing. Each resource can be included GPP, reconfigurable element (RE) alone or different combination such as cluster of processing elements. In this case we utilize only one GPP or Re in each resource.

Results The results show that utilization of reconfigurable architectures increase performance. In this case, the best performance is achieved when all of elements are REs and speed up factor set to 10. When the combination GPP and RE is utilized, a variation in waiting time and turnaround time is observed that is due to the MIP difference in GPPs and REs. To achieve the maximum performance some significant factors are: -

-

Application type: cooperative processing is useful for computationalintensive applications. Reconfiguration time: another important factor is reconfiguration time. It should be noted the summation of reconfiguration time and accelerated CPU time must be less than normal CPU time (the job execution time on GPP). The results can generalize when the size of gridlets and processing elements MIPS are random. Mathematical Analysis This analysis is performed to show the effect of different parameters on the processing time in collaborative and non-collaborative systems. To describe the related equations, the following notations are defined. N= number of subtasks S=size of subtask= MI instruction = processing time for each instruction by GPP =processing time for each instruction by RE=

262

M. Ahmadi and S. Wong

= communication time between different processing elements in collaborative systems. =speedup factor =number of subtasks that can process by neighbor =packet size =network bandwidth = number of packets ∑ ∑

∑ ∑



. . /

.

.

. . /

Author Index

Ahmadi, Mahmood Audouin, O. 196

Barth, D. 196 Bathula, Balagangadhar G. Belalem, Ghalem 121 Berde, B. 196 Cad´er´e, C. 196 Callegati, Franco 113 Campi, Aldo 113 Cergol, Igor 81 Char˜ ao, Andrea Schwertner Chen, Shanzhi 98 Chen, Yanming 24 Chi, Xuebin 35 Chiosi, A. 196 Douville, R.

Koslovski, Guilherme Piegas Krupadanam, Sireesha 175

259

231, 251

Li, Xiaohua(Edward) Lin, J.T. 243 Lin, Jintong 215 Liu, Hao 70 Liu, L. 243 Liu, Lei 215 Liu, Tao 24 Liu, Wenbo 130 Luo, Xubin 223 Ma, Xiangjie 130 Madsen, Tatiana K. Marcot, T. 196 Mirchandani, Vinod Nazir, Amril 70 Nielsen, Karsten Fyhn

Edwards, Christopher 24 El-khatib, Yehia 24 Elmirghani, Jaafar M.H. 251 187

He, Lei 130 Hong, D. 196 Hong, X.B. 243 Hong, Xiaobin 215 Hu, Bo 98 Hu, Weisheng 206 Jin, Yaohui

Karayannis, Fotis Kipp, Alexander Kopsidas, Spyros

105 56 105

81 187

138, 196

196

Schubert, Lutz 56 Sørensen, Søren-Aksel Soudan, S. 196 Sun, Qinxue 98 Sun, Weiqiang 206 Sun, Zhili 89

70

Tassiulas, Leandros 105 Theillaud, R. 196 Tomasik, J. 196 Tsavli, Matina 105 Tziouvaras, Chrysostomos Verchere, Dominique Vokkarane, Vinod M.

206

187

Pasin, M. 196 Piperaud, V. 196 Pouyllau, H. 196 Primet, Pascale Vicat-Blanc Reinhart, V.

Gao, Peng 24 Georgiadis, Leonidas 105 Grossman, Robert 9 Gu, Yunhong 9 Guo, Wei 206 Guo, Yunfei 130

161

138

196

Fitzek, Frank H.P. Fu, Huirong 175

138

Wang, Bin 223 Wang, Mingli 98

105

81, 196 231

264

Author Index

Wang, Ning 1 Wang, Yan 206 Welzl, Michael 1 Wong, Stephan 259 Wu, Hong 35 Wu, J. 243 Wu, Jian 215 Wu, Jin 89 Wu, Xingyao 24

Xiao, Haili Xu, Huan

35 43

Yang, Soomi 155 Yu, Oliver 43 Zhang, Liang 1 Zhang, Weili 130 Zisiadis, Dimitris 105